Wasserstein Flow Meets Replicator Dynamics: A Mean-Field Analysis of Representation Learning in Actor-Critic
Abstract
Actor-critic (AC) algorithms, empowered by neural networks, have had significant empirical success in recent years. However, most of the existing theoretical support for AC algorithms focuses on the case of linear function approximations, or linearized neural networks, where the feature representation is fixed throughout training. Such a limitation fails to capture the key aspect of representation learning in neural AC, which is pivotal in practical problems. In this work, we take a mean-field perspective on the evolution and convergence of feature-based neural AC. Specifically, we consider a version of AC where the actor and critic are represented by overparameterized two-layer neural networks and are updated with two-timescale learning rates. The critic is updated by temporal-difference (TD) learning with a larger stepsize while the actor is updated via proximal policy optimization (PPO) with a smaller stepsize. In the continuous-time and infinite-width limiting regime, when the timescales are properly separated, we prove that neural AC finds the globally optimal policy at a sublinear rate. Additionally, we prove that the feature representation induced by the critic network is allowed to evolve within a neighborhood of the initial one.
1 Introduction
In reinforcement learning (RL) (Sutton and Barto, 2018), an agent aims to learn the optimal policy that maximizes the expected total reward obtained from interactions with an environment. Policy-based RL algorithms achieve such a goal by directly optimizing the expected total reward as a function of the policy, which often involves two components: policy evaluation and policy improvement. Specifically, policy evaluation refers to estimating the value function of the current policy, which characterizes the performance of the current policy and reveals the updating direction for finding a better policy, which is known as policy improvement. Algorithms that explicitly incorporate these two ingredients are called actor-critic (AC) methods (Konda and Tsitsiklis, 2000), where the actor and the critic refer to the policy and its corresponding value function, respectively.
Recently, in RL applications with large state spaces, actor-critic methods have achieved striking empirical successes when empowered by expressive function approximators such as neural networks (Agostinelli et al., 2019; Akkaya et al., 2019; Berner et al., 2019; Duan et al., 2016; Silver et al., 2016, 2017; Vinyals et al., 2019). These successes benefit from the data-dependent representations learned by neural networks. Unfortunately, however, the theoretical understanding of this data-dependent benefit is very limited. The classical theory of AC focuses on the case of linear function approximation, where the actor and critic are represented using linear functions with the feature mapping fixed throughout learning (Bhatnagar et al., 2008, 2009; Konda and Tsitsiklis, 2000). Meanwhile, a few recent works establish convergence and optimality of AC with overparameterized neural networks (Fu et al., 2020; Liu et al., 2019; Wang et al., 2019), where the neural network training is captured by the Neural Tangent Kernel (NTK) (Jacot et al., 2018). Specifically, with properly designed parameter initialization and stepsizes, and sufficiently large network widths, the neural networks employed by both actor and critic can be assumed to be well approximated by linear functions of a random feature vector. In other words, from the point of view of representation learning, the features induced by these algorithms are by assumption infinitesimally close to the initial featural representation, which is data-independent.
In this work, we make initial steps towards understanding how representation learning comes into play in neural AC. Specifically, we address the following questions:
Going beyond the NTK regime, does neural AC provably find the globally optimal policy? How does the feature representation associated with the neural network evolve along with neural AC?
We focus on a version of AC where the critic performs temporal-difference (TD) learning (Sutton, 1988) for policy evaluation and the actor improves its policy via proximal policy optimization (PPO) (Schulman et al., 2017), which corresponds to a Kullback-Leibler (KL) divergence regularized optimization problem, with the critic providing the update direction. Moreover, we utilize two-timescale updates where both the actor and critic are updated at each iteration but with the critic having a much larger stepsize. In other words, the critic is updated at a faster timescale. Meanwhile, we represent the critic explicitly as a two-layer overparameterized neural network and parameterize the actor implicitly via the critic and PPO updates. To examine convergence, we study the evolution of the actor and critic in a continuous-time limit with the network width going to infinity. In such a regime, the actor update is closely connected to replicator dynamics (Börgers and Sarin, 1997; Hennes et al., 2020; Schuster and Sigmund, 1983) and the critic update is captured by a semigradient flow in the Wasserstein space (Villani, 2008). Moreover, the semigradient flow runs at a faster timescale according to the two-timescale mechanism.
It turns out that the separation of timescales plays an important role in the convergence analysis. In particular, the continuous-time limit enables us to separately analyze the evolution of actor and critic and then combine these results to get final theoretical guarantees. Specifically, focusing solely on the actor, we prove that the time-averaged suboptimality of the actor converges sublinearly to zero up to the time-averaged policy evaluation error associated with critic updates. Moreover, for the critic, under regularity conditions, we connect the Bellman error to the Wasserstein distance and show that the time-averaged policy evaluation error also converges sublinearly to zero. Therefore, we show that neural AC provably achieves global optimality at a sublinear rate. Furthermore, regarding representation learning, we show that the critic induces a data-dependent feature representation within an neighborhood of the initial representation in terms of the Wasserstein distance, where is a sufficiently large scaling parameter.
The key to our technical analysis reposes on three ingredients: (i) infinite-dimensional variational inequalities with a one-point monotonicity (Harker and Pang, 1990), (ii) a mean-field perspective on neural networks (Chizat and Bach, 2018b; Mei et al., 2019, 2018; Sirignano and Spiliopoulos, 2020a, b), and (iii) the two-timescale stochastic approximation (Borkar, 2009; Kushner and Yin, 2003). In particular, in the infinite-width limit, the neural network and its induced feature representation are identified with a distribution over the parameter space. The mean-field perspective enables us to characterize the evolution of such a distribution within the Wasserstein space via a certain partial differential equation (PDE) (Ambrosio and Gigli, 2013; Ambrosio et al., 2008; Villani, 2003, 2008). For policy evaluation, this PDE is given by the semigradient flow induced by TD learning. We characterize the error of policy evaluation by showing that mean-field Bellman error satisfies a version of one-point monotonicity tailored to the Wasserstein space. Moreover, our actor analysis utilizes the geometry of policy optimization, which shows that the expected total reward, as a function of the policy, also enjoys the property of one-point monotonicity in the policy space. Finally, the actor and critic errors are connected via two-timescale stochastic approximation. To the best of our knowledge, this is the first time that convergence and global optimality guarantees have been obtained for neural AC.
Related Work. AC with linear function approximation has been studied extensively in the literature. In particular, using a two-timescale stochastic approximation via ordinary differential equations, Konda and Tsitsiklis (2000); Bhatnagar et al. (2008) and Bhatnagar et al. (2009) establish asymptotic convergence guarantees in the continuous-time limiting regime. More recently, using more sophisticated optimization techniques, various works (Wu et al., 2020; Xu et al., 2020b, a; Hong et al., 2020; Khodadadian et al., 2021) have established discrete-time convergence guarantees that show that linear AC converges sublinearly to either a stationary point or the globally optimal policy. Furthermore, when overparameterized neural networks are employed, Liu et al. (2019); Wang et al. (2019) and Fu et al. (2020) prove that neural AC converges to the global optimum at a sublinear rate. In these works, the initial value of the network parameters and the learning rates are chosen such that both actor and critic updates are captured by the NTK. In other words, when the network width is sufficiently large, such a version of neural AC is well approximated by its linear counterpart via the neural tangent feature. In comparison, we establish a mean-field analysis that has a different scaling than the NTK regime. We also establish finite-time convergence to global optimality, and more importantly, the feature representation induced by the critic is data-dependent and allowed to evolve within a much larger neighborhood around the initial one.
Our work is also related to a recent line of research on understanding stochastic gradient descent (SGD) for supervised learning problems involving an overparameterized two-layer neural network under the mean-field regime. See, e.g., Chizat and Bach (2018b); Mei et al. (2018, 2019); Javanmard et al. (2019); Wei et al. (2019); Fang et al. (2019b, a); Chen et al. (2020a); Sirignano and Spiliopoulos (2020b, a); Lu et al. (2020) and the references therein. In the continuous-time and infinite-width limit, these works show that SGD for neural network training is captured by a Wasserstein gradient flow (Villani, 2008; Ambrosio et al., 2008; Ambrosio and Gigli, 2013) of an energy function that corresponds to the objective function in supervised learning. In contrast, our analysis combines such a mean-field analysis with TD learning and two-timescale stochastic approximation, which are tailored specifically to AC. Moreover, our critic is updated via TD learning, which is a semigradient algorithm and there is no objective functional making TD learning a gradient-based algorithm. Thus, in the mean-field regime, our critic is given by a Wasserstein semigradient flow, which also differs from these existing works.
Additionally, our work is closely related to Zhang et al. (2020) and Agazzi and Lu (2019), who provide mean-field analyses for neural TD-learning and Q-learning (Watkins and Dayan, 1992). In comparison, we focus on AC, which is a two-timescale policy optimization algorithm. Finally, Agazzi and Lu (2020) study softmax policy gradient with neural network policies in the mean-field regime, where policy gradient is cast as a Wasserstein gradient flow with respect to the expected total reward. The algorithm assumes that the critic directly gets the desired value function and thus the algorithm is single-timescale. Moreover, the convergence guarantee in Agazzi and Lu (2020) is asymptotic. In comparison, our AC is two-timescale and we establish non-asymptotic sublinear convergence guarantees to global optimality.
Notation. We denote by the set of probability measures over the measurable space . Given a curve , we denote by its derivative with respect to time. For an operator and a measure , we denote by the push forward of through . We denote by the chi-squared divergence between probability measures and , which is defined as . Given two probability measures and , we denote the Kullback-Leibler divergence or the relative entropy from to by . For , we define the weighted homogeneous Sobolev norm as . We denote by the -norm with respect to probability measure . We denote by the semidirect product, i.e., for and transition kernel . For a function , we denote by its Lipschitz constant. We denote a normal distribution on by , where is the mean value and is the covariance matrix.
2 Background
In this section, we first introduce the policy optimization problem and the actor-critic method. We then present the definition of the Wasserstein space.
2.1 Policy Optimization and Actor-Critic
We consider a Markov decision process (MDP) given by , where is the state, is the action space, is the discount factor, is the transition kernel, is the reward function, and is the initial state distribution. Without loss of generality, we assume that and , where . We remark that as long as the state-action space is bounded, we can normalize the space to be within the unit sphere. Given a policy , at the th step, the agent takes an action at state according to and observes a reward . The environment then transits to the next state according to the transition kernel . Note that the policy induces Markov chains on both and . Considering the Markov chain on , we denote the induced Markov transition kernel by , which is defined as . Likewise, we denote the Markov transition kernel on by , which is defined as . Let be a probability measure on . We then define the visitation measure induced by policy and starting from as
(2.1) |
where is the trajectory generated by starting from and following policy thereafter. If holds, we then denote such a visitation measure by . Furthermore, we denote by the marginal distribution of visitation measure with respect to . In particular, when holds in (2.1), it follows that . In policy optimization, we aim to maximize the expected total reward defined as follows,
where we denote by the expectation with respect to and for . We define the action value function and the state value function as follows,
(2.2) |
where we denote by the inner product on the action space . Correspondingly, the advantage function is defined as
The action value function corresponds to the minimizer of the following mean-squared Bellman error (MSBE),
(2.3) |
where is the sampling distribution depending on policy , which will be defined by (4.8) in §4.2. Note that when is with full support, i.e., , is the unique global minimizer to the MSBE. Consequently, the policy optimization problem can be written as the following bilevel optimization problem,
(2.4) |
The inner problem in (2.4) is known as a policy evaluation subproblem, while the outer problem is the policy improvement subproblem. One of the most popular way to solve the policy optimization problem is actor-critic (AC) methods (Sutton and Barto, 2018), where the job of the critic is to evaluate current policy and then the actor updates its policy according to the critic’s evaluation.
2.2 Wasserstein Space
Let be a Polish space. We denote by the set of probability measures with finite second moments. Then, the Wasserstein-2 distance between is defined as follows,
where the infimum is taken over the random variables and on and we denote by the distribution of a random variable . We call the Wasserstein () space, which is an infinite-dimensional manifold (Villani, 2008). See §A.1 for more details.
3 Algorithm
Two-timescale actor-critic. We consider a two-timescale actor-critic (AC) algorithm (Kakade, 2002; Peters and Schaal, 2008) for the policy optimization problem in (2.4). For policy evaluation, we parameterize the critic with a neural network and update the parameter via temporal-difference (TD) learning (Sutton, 1988). For policy improvement, we update the actor policy via proximal policy optimization (PPO) (Schulman et al., 2017). Our algorithm is two-timescale since both the actor and critic are updated at each iteration with different stepsizes. Specifically, we parameterize the critic by the following neural network with width and parameter ,
(3.1) |
Here is the activation function and is the scaling parameter. Such a structure also appears in Chizat and Bach (2018a); Mei et al. (2019) and Chen et al. (2020b). In a descrete-time finite-width (DF) scenario, at the th iteration, the critic and actor are updated as follows,
(3.2) | |||
(3.3) |
where and . Here is the policy for the actor at the th iteration, is the corresponding weighting distribution, and are the stepsizes for the DF-PPO update and the DF-TD update, respectively. In (3.2), the scaling of arises since our update falls into the lazy-training regime (Chizat and Bach, 2018a). In the sequel, we denote by the relative TD timescale. Note that in a double-loop AC algorithm, the critic can usually be solved with high precision. In the two-timescale AC however, even with the KL-divergence term in (3.3) which regularizes the policy update and helps to improve the local estimation quality of the TD update, the critic for updating the actor’s policy can still be far from the true action value function . Since the policy evaluation problem is not fully solved at each iteration, the two-timescale AC can be more efficient in computation while more challenging to characterize theoretically.
Mean-field (MF) Limit. To analyze the convergence of the two-timescale AC with neural networks, we employ an analysis that studies the mean-field limit regime (Mei et al., 2018, 2019). Here, by saying the mean-field limit, we refer to an infinite-width limit, i.e., for the neural network width in (3.1), and a continuous-time limit, i.e., where for the stepsize in (3.2) and (3.3). For independently sampled from a distribution , we can write the infinite-width limit of (3.1) as
(3.4) |
In the sequel, we denote by the distribution of for the infinite-width limit of the neural network at the th iteration. We further let and be the continuous-time limits of and , respectively. The existence of will be shown shortly after. As derived in Zhang et al. (2020), the mean-field limit of the DF-TD update in (3.2) is
(3.5) |
where is the relative TD timescale and
(3.6) |
is a vector field. Here is taken with respect to and . It remains to characterize the mean-field limit of the DF-PPO update in (3.3). By solving the maximization problem in (3.3), the infinite-width limit of DF-PPO update can be written in closed form as
where is the normalizing factor such that for any . By letting and , such a result directly shows that the limit of exists. Therefore, in the mean-field limit, we have , which can be further written as . Here we have and is the continuous-time limit of . Furthermore, noting that , the mean-field limit of the DF-PPO update in (3.3) is
(3.7) |
The two updates (3.5) and (3.7) correspond to the mean-field limits of (3.2) and (3.3), respectively, and together serve as the mean-field limit of the two-timescale AC. In particular, we remark that the MF-TD update in (3.5) for the critic is captured by a semigradient flow in the Wasserstein space (Villani, 2008) while the MF-PPO update in (3.7) for the actor resembles the replicator dynamics (Schuster and Sigmund, 1983; Börgers and Sarin, 1997; Hennes et al., 2020). See §A.2 for more discussion of the replicator dynamics. Note that such a framework is applicable to continuous state and action spaces. In this paper, we aim to provide a theoretical analysis of the mean-field limit of the two-timescale AC.
4 Main Result
In this section, we first establish the convergence of the MF-PPO update in §4.1. Then, under additional assumptions, we establish the optimality and convergence of the mean-field two-timescale AC in §4.2.
4.1 Convergence of Mean-field PPO
For the MF-PPO update in (3.7), we establish the following theorem on global optimality and convergence rate.
Theorem 4.1 (Convergence of MF-PPO).
Let be the optimal policy and be the initial policy. Then, it holds that
(4.1) |
where is an evaluation distribution for the policy evaluation error and is the expected KL-divergence between and . Furthermore, letting , where is a base distribution and , the concentrability coefficient is given by
Proof.
See §B.1 for a detailed proof. ∎
The concentrability coefficient commonly appears in the reinforcement learning literature (Szepesvári and Munos, 2005; Munos and Szepesvári, 2008; Antos et al., 2008; Farahmand et al., 2010; Scherrer et al., 2015; Farahmand et al., 2016; Lazaric et al., 2016; Liu et al., 2019; Wang et al., 2019). In contrast to a more standard concentrability coefficient form, note that is irrelevant to the update of the algorithm. To show the convergence of the MF-PPO, our condition here is much weaker since we only need to specify a base distribution such that .
Theorem 4.1 shows that the MF-PPO converges to the globally optimal policy at a rate of up to the policy evaluation error. Such a theorem implies the global optimality and convergence of a double-loop AC algorithm, where the critic is solved to high precision and the policy evaluation error is sufficiently small. In the sequel, we consider a more challenging setting, where the critic is updated simultaneously along with the update of the actor’s policy .
4.2 Global Optimality and Convergence of Two-timescale AC
In this section, we aim to provide an upper bound on the policy evaluation error when the critic and the actor are updated simultaneously. Specifically, the actor is updated via MF-PPO in (3.7) and the critic is updated via the MF-TD in (3.5). The smooth function in the parameterization of the Q function in (3.4) is taken to be the following two-layer neural network,
(4.2) |
where is the activation function, is the parameter, and is an odd and invertible function with scaling hyper-parameter . It then holds that , where and are the dimensions of and , respectively. It is worth noting that the function class of for is the same as
(4.3) |
which captures a vast function class because of the universal function approximation theorem (Barron, 1993; Pinkus, 1999). We remark that we introduce the rescaling function in (4.2) to avoid the study of the space of probability measures over in (4.3), which has a boundary and thus lacks the regularity in the study of optimal transport. Furthermore, note that we introduce a hyper-parameter in the Q function in (3.4). Thus, we are using to represent , which causes an “over-representation” when . Such over-representation appears to be essential for our analysis. To anticipate briefly, we note that controls the gap in the average total reward over time when the relative time-scale is properly selected according to Theorem 4.6. Furthermore, such an influence is imposed through Lemma 4.4, which shows that the Wasserstein distance between and is upper bounded by . In what follows, we consider the initialization of the TD update to be , which implies that . We next impose a regularity assumption on the two-layer neural network .
Assumption 4.2 (Regularity of the Neural Network).
For the two-layer neural network defined in (4.2), we assume that the following properties hold.
-
(i)
The rescaling function is odd, -Lipschitz continuous, -smooth, and invertible. Meanwhile, the inverse is locally Lipschitz continuous. In particular, we assume that is -Lipschitz continuous in .
-
(ii)
The activation function is odd, -bounded, -Lipschitz continuous, and -smooth.
We remark that Assumption 4.2 is not restrictive and is satisfied by a large family of neural networks, e.g., and . Noting that , Assumption 4.2 implies that the function in (4.2) is odd with respect to and and is also bounded, Lipschitz continuous, and smooth in the parameter domain, that is,
(4.4) |
We then impose the following assumption on the MDP.
Assumption 4.3 (Regularity of the MDP).
For the MDP , we assume the following properties hold.
-
(i)
The reward function and the transition kernel admit the following representations with respect to the activation function ,
(4.5) (4.6) where and are probability measures in for any , is a positive scaling parameter, and is a nonnegative function.
- (ii)
-
(iii)
We assume that there exists an absolute constant such that
where is the weighted homogeneous Sobolev norm.
We remark that by assuming to be a probability measure and that in (4.6), the representation of the transition kernel does not lose generality. Specifically, the function class of (4.6) is the same as
See §C.1 for a detailed proof. Assumption 4.3 generalizes the linear MDP in Yang and Wang (2019b, a); Cai et al. (2019a) and Jin et al. (2020). In contrast, our representation of the reward function and the transition kernel benefits from the universal function approximation theorem and is thus not as restrictive as the original linear MDP assumption. Note that the infinite-width neural network has a two-layer structure by . We establish the following lemma on the regularity of the representation of the action value function by such a neural network.
Lemma 4.4 (Regularity of Representation of ).
Proof.
See §B.2 for a detailed proof. ∎
Property (i) of Lemma 4.4 shows that the action value function can be parameterized with the infinite-width two-layer neural network in (3.4). Note that a larger captures a larger function class in (4.3). Without loss of generality, we assume that holds in the sequel. Hence, by Property (ii), we have that for any policy . In particular, it holds by Property (i) of Lemma 4.4 that and we have the following theorem to characterize such an error with regard to the space.
Theorem 4.5 (Upper Bound of Policy Evaluation Error).
Suppose that Assumptions 4.2 and 4.3 hold and is the initial distribution. We specify the weighting distribution in MF-TD (3.5) as , where is the evaluation distribution for the policy evaluation error in Theorem 4.1. Then, it holds that
(4.7) |
where
Here and are defined in (4.4) of Assumption 4.2, is the relative TD timescale, is the scaling parameter of the neural network, and is the scaled metric. Moreover, the constant depends on the discount factor , the scaling parameter in (4.2), and the absolute constants defined in Assumptions 4.2 and 4.3.
Proof.
See §B.3 for a detailed proof. ∎
Here we give a nonrigorous discussion on how to upper bound in (4.7). If holds for any , by in Lemma 4.4 and the triangle inequality of distance (Villani, 2008), it follows that and . Taking a time average of integration on both sides of (4.7), the policy evaluation error is then upper bounded by . Inspired by such a fact, we introduce the following restarting mechanism to ensure . Restarting Mechanism. Let be a threshold, where is the upper bound for by Lemma 4.4, is a constant scaling parameter for the restarting threshold, is the distribution of the parameters in the neural network at time , and is the initial distribution. Whenever we detect that reaches in the update, we pause and reset to by resampling the parameters from . Then, we reset the critic with the newly sampled parameters while keeping the actor’s policy unchanged and continue the update.
The restarting mechanism guarantees by restricting the distribution of the parameters to be close to . Moreover, by letting , we ensure that is realizable by since , which means that the neural network is capable of capturing the representation of the action value function . We remark that by letting , we allow to deviate from up to in the restarting mechanism. In contrast, the NTK regime (Cai et al., 2019b) which corresponds to letting in (3.1) only allows to deviate from by the chi-squared divergence . That is, the NTK regime fails to induce a feature representation significantly different from the initial one. Before moving on, we summarize the construction of the weighting distribution in Theorem 4.1 and 4.5 as follows,
(4.8) |
where is the base distribution. Now we have the following theorem that characterizes the global optimality and convergence of the two-timescale AC with restarting mechanism.
Theorem 4.6 (Global Optimality and Convergence Rate of Two-timescale AC with Restarting Mechanism).
Suppose that (4.8) and Assumptions 4.2 and 4.3 hold. With the restarting mechanism, it holds that
(4.9) |
where we have
Here , and are defined in Assumption 4.2 and 4.3, is the upper bound for in Lemma 4.4, depends on the discount factor and the absolute constants defined in Assumption 4.2 and 4.3, and is the scaling parameter for the restarting threshold. Additionally, the total restarting number satisfies the following inequality,
Proof.
See §B.4 for a detailed proof. ∎
Note that for a given MDP with starting distribution , the expected KL-divergence and the concentrability coefficient are both independent of the two-timescale update. We remark that our condition for (4.9) to be bounded is not restrictive. Specifically, we only need a given and such that the KL-divergence and the concentrability coefficient , which is a weakening relative to the standard usage of the concentrability coefficient in the literature (Szepesvári and Munos, 2005; Munos and Szepesvári, 2008; Antos et al., 2008; Farahmand et al., 2010; Scherrer et al., 2015; Farahmand et al., 2016; Lazaric et al., 2016; Liu et al., 2019; Wang et al., 2019).
The first term (a) on the right-hand side of (4.9) diminishes as . The second term (b) corresponds to the policy evaluation error. We give an example to demonstrate the convergence of the two-timescale AC. We let the scaling parameter for the restarting threshold. By letting , it holds that decreases at a rate of as and . Note that shows that the critic has a larger relative TD timescale in (3.5). As for the total number of restartings , it holds that as , which induces a tradeoff, i.e., a larger guarantees a smaller gap in but yields more restartings and potentially requires a larger relative TD timescale.
5 Acknowledgement
Zhaoran Wang acknowledges National Science Foundation (Awards 2048075, 2008827, 2015568, 1934931), Simons Institute (Theory of Reinforcement Learning), Amazon, J.P. Morgan, and Two Sigma for their supports.
References
- Agazzi and Lu (2019) Agazzi, A. and Lu, J. (2019). Temporal-difference learning for nonlinear value function approximation in the lazy training regime. arXiv preprint arXiv:1905.10917.
- Agazzi and Lu (2020) Agazzi, A. and Lu, J. (2020). Global optimality of softmax policy gradient with single hidden layer neural networks in the mean-field regime. arXiv preprint arXiv:2010.11858.
- Agostinelli et al. (2019) Agostinelli, F., McAleer, S., Shmakov, A. and Baldi, P. (2019). Solving the Rubik’s cube with deep reinforcement learning and search. Nature Machine Intelligence, 1 356–363.
- Akkaya et al. (2019) Akkaya, I., Andrychowicz, M., Chociej, M., Litwin, M., McGrew, B., Petron, A., Paino, A., Plappert, M., Powell, G., Ribas, R. et al. (2019). Solving Rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113.
- Ambrosio and Gigli (2013) Ambrosio, L. and Gigli, N. (2013). A user’s guide to optimal transport. In Modelling and Optimisation of Flows on Networks. Springer, 1–155.
- Ambrosio et al. (2008) Ambrosio, L., Gigli, N. and Savaré, G. (2008). Gradient Flows: In Metric Spaces and in the Space of Probability Measures. Springer.
- Antos et al. (2008) Antos, A., Szepesvári, C. and Munos, R. (2008). Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning, 71 89–129.
- Barron (1993) Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, 39 930–945.
- Berner et al. (2019) Berner, C., Brockman, G., Chan, B., Cheung, V., Debiak, P., Dennison, C., Farhi, D., Fischer, Q., Hashme, S., Hesse, C. et al. (2019). Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680.
- Bhatnagar et al. (2008) Bhatnagar, S., Ghavamzadeh, M., Lee, M. and Sutton, R. S. (2008). Incremental natural actor-critic algorithms. In Advances in Neural Information Processing Systems.
- Bhatnagar et al. (2009) Bhatnagar, S., Sutton, R. S., Ghavamzadeh, M. and Lee, M. (2009). Natural actor-critic algorithms. Automatica, 45 2471–2482.
- Börgers and Sarin (1997) Börgers, T. and Sarin, R. (1997). Learning through reinforcement and replicator dynamics. Journal of Economic Theory, 77 1–14.
- Borkar (2009) Borkar, V. S. (2009). Stochastic Approximation: A Dynamical Systems Viewpoint. Springer.
- Cai et al. (2019a) Cai, Q., Yang, Z., Jin, C. and Wang, Z. (2019a). Provably efficient exploration in policy optimization. arXiv preprint arXiv:1912.05830.
- Cai et al. (2019b) Cai, Q., Yang, Z., Lee, J. D. and Wang, Z. (2019b). Neural temporal-difference learning converges to global optima. In Advances in Neural Information Processing Systems.
- Chen et al. (2020a) Chen, M., Wang, Y., Liu, T., Yang, Z., Li, X., Wang, Z. and Zhao, T. (2020a). On computation and generalization of generative adversarial imitation learning. arXiv preprint arXiv:2001.02792.
- Chen et al. (2020b) Chen, Z., Cao, Y., Gu, Q. and Zhang, T. (2020b). Mean-field analysis of two-layer neural networks: Non-asymptotic rates and generalization bounds. arXiv preprint arXiv:2002.04026.
- Chizat and Bach (2018a) Chizat, L. and Bach, F. (2018a). A note on lazy training in supervised differentiable programming. arXiv preprint arXiv:1812.07956.
- Chizat and Bach (2018b) Chizat, L. and Bach, F. (2018b). On the global convergence of gradient descent for over-parameterized models using optimal transport. In Advances in Neural Information Processing Systems.
- Duan et al. (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J. and Abbeel, P. (2016). Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning.
- Fang et al. (2019a) Fang, C., Dong, H. and Zhang, T. (2019a). Over parameterized two-level neural networks can learn near optimal feature representations. arXiv preprint arXiv:1910.11508.
- Fang et al. (2019b) Fang, C., Gu, Y., Zhang, W. and Zhang, T. (2019b). Convex formulation of overparameterized deep neural networks. arXiv preprint arXiv:1911.07626.
- Farahmand et al. (2016) Farahmand, A.-m., Ghavamzadeh, M., Szepesvári, C. and Mannor, S. (2016). Regularized policy iteration with nonparametric function spaces. The Journal of Machine Learning Research, 17 4809–4874.
- Farahmand et al. (2010) Farahmand, A.-m., Szepesvári, C. and Munos, R. (2010). Error propagation for approximate policy and value iteration. In Advances in Neural Information Processing Systems.
- Friedrichs (1944) Friedrichs, K. O. (1944). The identity of weak and strong extensions of differential operators. Transactions of the American Mathematical Society, 55 132–151.
- Fu et al. (2020) Fu, Z., Yang, Z. and Wang, Z. (2020). Single-timescale actor-critic provably finds globally optimal policy. arXiv preprint arXiv:2008.00483.
- Harker and Pang (1990) Harker, P. T. and Pang, J.-S. (1990). Finite-dimensional variational inequality and nonlinear complementarity problems: A survey of theory, algorithms and applications. Mathematical Programming, 48 161–220.
- Hennes et al. (2020) Hennes, D., Morrill, D., Omidshafiei, S., Munos, R., Perolat, J., Lanctot, M., Gruslys, A., Lespiau, J.-B., Parmas, P., Duéñez-Guzmán, E. et al. (2020). Neural replicator dynamics: Multiagent learning via hedging policy gradients. In International Conference on Autonomous Agents and MultiAgent Systems.
- Hong et al. (2020) Hong, M., Wai, H.-T., Wang, Z. and Yang, Z. (2020). A two-timescale framework for bilevel optimization: Complexity analysis and application to actor-critic. arXiv preprint arXiv:2007.05170.
- Jacot et al. (2018) Jacot, A., Gabriel, F. and Hongler, C. (2018). Neural tangent kernel: Convergence and generalization in neural networks. In Advances in Neural Information Processing Systems.
- Javanmard et al. (2019) Javanmard, A., Mondelli, M. and Montanari, A. (2019). Analysis of a two-layer neural network via displacement convexity. arXiv preprint arXiv:1901.01375.
- Jin et al. (2020) Jin, C., Yang, Z., Wang, Z. and Jordan, M. I. (2020). Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory.
- Kakade and Langford (2002) Kakade, S. and Langford, J. (2002). Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning, vol. 2.
- Kakade (2002) Kakade, S. M. (2002). A natural policy gradient. In Advances in Neural Information Processing Systems.
- Khodadadian et al. (2021) Khodadadian, S., Doan, T. T., Maguluri, S. T. and Romberg, J. (2021). Finite sample analysis of two-time-scale natural actor-critic algorithm. arXiv preprint arXiv:2101.10506.
- Konda and Tsitsiklis (2000) Konda, V. R. and Tsitsiklis, J. N. (2000). Actor-critic algorithms. In Advances in Neural Information Processing Systems.
- Kushner and Yin (2003) Kushner, H. and Yin, G. G. (2003). Stochastic Approximation and Recursive Algorithms and Applications. Springer.
- Lazaric et al. (2016) Lazaric, A., Ghavamzadeh, M. and Munos, R. (2016). Analysis of classification-based policy iteration algorithms. The Journal of Machine Learning Research, 17 583–612.
- Liu et al. (2019) Liu, B., Cai, Q., Yang, Z. and Wang, Z. (2019). Neural proximal/trust region policy optimization attains globally optimal policy. arXiv preprint arXiv:1906.10306.
- Lu et al. (2020) Lu, Y., Ma, C., Lu, Y., Lu, J. and Ying, L. (2020). A mean field analysis of deep resnet and beyond: Towards provably optimization via overparameterization from depth. In International Conference on Machine Learning. PMLR.
- Mei et al. (2019) Mei, S., Misiakiewicz, T. and Montanari, A. (2019). Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit. In Conference on Learning Theory. PMLR.
- Mei et al. (2018) Mei, S., Montanari, A. and Nguyen, P.-M. (2018). A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115 E7665–E7671.
- Munos and Szepesvári (2008) Munos, R. and Szepesvári, C. (2008). Finite-time bounds for fitted value iteration. The Journal of Machine Learning Research, 9 815–857.
- Otto and Villani (2000) Otto, F. and Villani, C. (2000). Generalization of an inequality by Talagrand and links with the logarithmic Sobolev inequality. Journal of Functional Analysis, 173 361–400.
- Peters and Schaal (2008) Peters, J. and Schaal, S. (2008). Natural actor-critic. Neurocomputing, 71 1180–1190.
- Peyre (2011) Peyre, R. (2011). Comparison between distance and norm, and localisation of wasserstein distance. arXiv preprint arXiv:1104.4631.
- Pinkus (1999) Pinkus, A. (1999). Approximation theory of the MLP model in neural networks. Acta Numerica, 8 143–195.
- Scherrer et al. (2015) Scherrer, B., Ghavamzadeh, M., Gabillon, V., Lesner, B. and Geist, M. (2015). Approximate modified policy iteration and its application to the game of Tetris. The Journal of Machine Learning Research, 16 1629–1676.
- Schulman et al. (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A. and Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
- Schuster and Sigmund (1983) Schuster, P. and Sigmund, K. (1983). Replicator dynamics. Journal of Theoretical Biology, 100 533–538.
- Silver et al. (2016) Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M. et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529 484–489.
- Silver et al. (2017) Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A. et al. (2017). Mastering the game of Go without human knowledge. Nature, 550 354.
- Sirignano and Spiliopoulos (2020a) Sirignano, J. and Spiliopoulos, K. (2020a). Mean field analysis of neural networks: A central limit theorem. Stochastic Processes and their Applications, 130 1820–1852.
- Sirignano and Spiliopoulos (2020b) Sirignano, J. and Spiliopoulos, K. (2020b). Mean field analysis of neural networks: A law of large numbers. SIAM Journal on Applied Mathematics, 80 725–752.
- Sutton (1988) Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3 9–44.
- Sutton and Barto (2018) Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction. MIT press.
- Szepesvári and Munos (2005) Szepesvári, C. and Munos, R. (2005). Finite time bounds for sampling based fitted value iteration. In International Conference on Machine Learning. ACM.
- Villani (2003) Villani, C. (2003). Topics in Optimal Transportation. American Mathematical Society.
- Villani (2008) Villani, C. (2008). Optimal Transport: Old and New. Springer.
- Vinyals et al. (2019) Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A., Chung, J., Choi, D. H., Powell, R., Ewalds, T., Georgiev, P. et al. (2019). Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 575 350–354.
- Wang et al. (2019) Wang, L., Cai, Q., Yang, Z. and Wang, Z. (2019). Neural policy gradient methods: Global optimality and rates of convergence. arXiv preprint arXiv:1909.01150.
- Watkins and Dayan (1992) Watkins, C. and Dayan, P. (1992). Q-learning. Machine Learning, 8 279–292.
- Wei et al. (2019) Wei, C., Lee, J. D., Liu, Q. and Ma, T. (2019). Regularization matters: Generalization and optimization of neural nets vs their induced kernel. In Advances in Neural Information Processing Systems.
- Wu et al. (2020) Wu, Y., Zhang, W., Xu, P. and Gu, Q. (2020). A finite time analysis of two time-scale actor critic methods. arXiv preprint arXiv:2005.01350.
- Xu et al. (2020a) Xu, T., Wang, Z. and Liang, Y. (2020a). Improving sample complexity bounds for actor-critic algorithms. arXiv preprint arXiv:2004.12956.
- Xu et al. (2020b) Xu, T., Wang, Z. and Liang, Y. (2020b). Non-asymptotic convergence analysis of two time-scale (natural) actor-critic algorithms. arXiv preprint arXiv:2005.03557.
- Yang and Wang (2019a) Yang, L. F. and Wang, M. (2019a). Reinforcement learning in feature space: Matrix bandit, kernels, and regret bound. arXiv preprint arXiv:1905.10389.
- Yang and Wang (2019b) Yang, L. F. and Wang, M. (2019b). Sample-optimal parametric q-learning using linearly additive features. arXiv preprint arXiv:1902.04779.
- Zhang et al. (2020) Zhang, Y., Cai, Q., Yang, Z., Chen, Y. and Wang, Z. (2020). Can temporal-difference and Q-learning learn representation? A mean-field theory. arXiv preprint arXiv:2006.04761.
Appendix A Supplement to the Background
In this section, we present some background on Wasserstein space and replicator dynamics.
A.1 Wasserstein Space
Let be a Polish space. We denote by the set of probability measures with finite second moments. Then, the Wasserstein-2 () distance between is defined as follows,
(A.1) |
where the infimum is taken over the random variables and on and we denote by the distribution of a random variable . We call the Wasserstein space, which is an infinite-dimensional manifold (Villani, 2008). In particular, we define the tangent vector at as for the corresponding curve with . Under certain regularity conditions, the continuity equation corresponds to a vector field , which endows the infinite-dimensional manifold with a weak Riemannian structure in the following sense (Villani, 2008). Given any tangent vectors and at and the corresponding vector fields , which satisfy and , respectively, we define the inner product of and as follows,
(A.2) |
which yields a Riemannian metric. Such a Riemannian metric further induces a norm for any tangent vector at any , which allows us to write the Wasserstein-2 distance defined in (A.1) as follows,
(A.3) |
Here denotes for any . In particular, the infimum in (A.3) is attained by the geodesic connecting . Moreover, the geodesics on are constant-speed, that is,
(A.4) |
In Wasserstein space , a curve is defined to be absolutely continous if there exists , i.e., , such that
Such an absolutely continuous curve allows us to define the metric derivative in as follows,
(A.5) |
By Ambrosio et al. (2008), the metric derivative is connected to the norm of the tangent vector by
(A.6) |
Furthermore, we introduce the Wasserstein-1 distance, which is defined as
for any with finite first moments. The Wasserstein-1 distance has the following dual representation (Ambrosio et al., 2008),
(A.7) |
A.2 Replicator Dynamics
The replicator dynamics originally arises in the study of evolutionary game theory (Schuster and Sigmund, 1983). For a function , the replicator dynamics is given by the differential equation
where . As for the PPO update in (3.7), for a fixed , let and , we see that (3.7) corresponds to a replicator dynamics if . Note that in the simultaneous update of both the critic and actor, we do not have access to the true action value function . Thus, we use the estimator calculated by the critic step to guide the update of the actor in the PPO update, which takes the form of a replicator dynamics in the continuous-time limit.
Appendix B Proofs of Main Results
In this section, we give detailed proof of the theorems and present a detailed statement of Lemma 4.4.
B.1 Proof of Theorem 4.1
Proof.
Following from the performance difference lemma (Kakade and Langford, 2002), we have
where is the visitation measure induced by from and is the advantage function. Note that the continuous PPO dynamics in (3.7) can be equivalently written as . Thus, we have
(B.1) |
For the first term on the right-hand side of (B.1), it holds that,
(B.2) | ||||
where the last equality holds by noting that . For the second term on the right-hand side of (B.1), by the Cauchy-Schwartz inequality, we have
(B.3) | ||||
(B.4) |
Plugging (B.1), (B.1), and (B.1) into (B.1), we have
(B.5) |
By further letting , it holds for the concentrability coefficient in (B.1) that
(B.6) |
By further letting , it holds for the concentrability coefficient in (B.1) that
(B.7) |
Plugging (B.6) and (B.7) into (B.1) and taking integration on both sides of (B.1), we have
where . Thus, we complete the proof of Theorem 4.1. ∎
B.2 Detailed Statement of Lemma 4.4
We give a detailed version of Lemma 4.4 as follows.
Lemma B.1 (Regularity of Representation of ).
Proof.
See §C.2 for a detailed proof. ∎
B.3 Proof of Theorem 4.5
Proof.
For notation simplicity, we let . By Property (i) of Lemma B.1, it holds that , where . Thus it prompts us to study the distance between and . By the first variation formula in Lemma E.2, it holds that
(B.8) |
Here is the geodesic connecting and , and is its time-inverse, i.e., . Besides, we denote by and the derivation of geodesics and with respect to , respectively. We denote by the corresponding vector field at , which satisfies . For the first term of (B.8), it holds that
(B.9) |
where the first equality follows from (A.2) and the third equation follows from by Property (ii) of Lemma B.1. For the first term on the right-hand side of (B.3), we have
(B.10) |
where the first equality holds by defintion of in (3.6) and the last equality follows from Stokes’ formula. Note that we have by the definition of vector field . Thus, it holds for (B.3) that
(B.11) |
where the last equality follows from the definition of in (3.4). We let with respect to a specific and . Recall that for the weighting distribution , we set . Hence, for the integrand of (B.3), we have
(B.12) |
where the equality follows from the definition of in (3.6) and the inequality folllows from the Cauchy-Schwarz inequality. We define as a mapping operator such that . We rewrite (B.3) as
(B.13) |
By the definition of and the definition of visitation measure in (2.1), it holds that . Hence, we have and it holds for (B.13) that
(B.14) |
where the first inequality holds by noting that and the last inequality holds by noting that . Combining (B.3), (B.13), and (B.3) together, it holds for the first term of (B.3) that
(B.15) |
where the last inequality holds by the Cauchy-Schwarz inequality. For the second term of (B.3), we have
(B.16) |
where the first equality holds by Eulerian representation of the geodesic in Lemma E.3 that and Stokes’s fomula. Here, we denote by the outer product between two vectors. The inequality of (B.3) follows from and the property of the geodesic in (A.4) that .
For the second term on the right-hand side of (B.8), we denote by the corresponding vector field at such that . Then, it holds that
(B.17) |
where the first equality follows from (A.2), the inequality follows from the Cauchy-Schwarz inequality and the facts that by (A.2) and that by (A.4). The last equality of (B.17) holds by (A.6). Plugging the definition of metric derivative in (A.5) into (B.17), we have
(B.18) |
where the second inequality follows from Property (iv) of Lemma B.1 and the equality follows from the MF-PPO update in (3.7). For the approximation of the advantage function , it holds that
(B.19) |
Plugging (B.3) into (B.3), we have
(B.20) |
where the last inequality follows from . By first plugging (B.3) and (B.3) into (B.3), and then plugging (B.3) and (B.3) into (B.8), we have
(B.21) |
Plugging Lemma D.1 into (B.3), we have
(B.22) |
Note that is the geodesic connecting and . By Lemma D.2, we have
(B.23) |
Plugging (B.23) into (B.3), it follows that
Where is the scaled metric. Thus, we complete the proof of Theorem 4.5. ∎
B.4 Proof of Theorem 4.6
Proof.
We remark that the restarting mechanism produces discontinuity in while remains continuous. Let denote the restarting points in , where and is the total restarting number in . Let and denote the moments just before and after the restarting occurring at , respectively. According to the restarting mechanism, we have , and . Recall that we set . By (4.7) of Theorem 4.5, it holds that
(B.24) |
For simplicity, we let , , , , and . By sum of the integrals of (B.4) on , and , we have
(B.25) |
Note that we have by the triangle inequality of distance, the restarting mechanism and Property (iii) of Lemma B.1. It thus holds for (B.4) that
(B.26) |
Note that we have by the triangle inequality of distance, the restarting mechanism and Property (iii) of Lemma B.1. It thus holds for (B.4) that
(B.27) |
By setting and plugging (B.27) into (4.1) of Theorem 4.1, we have
(B.28) |
where is the concentrability coefficient and is the KL-divergence between and . Here the second inequality follows from the Cauchy-Schwarz inequality. Therefore, we complete the proof of (4.9) in Theorem 4.6. In what follows, we aim to upper bound the total restarting number . Recall that we have and according to the restarting mechanism. Thus, it holds for (B.4) that
Since , it follows that
which upper bounds the total restarting number . Hence, we complete the proof of Theorem 4.6. ∎
Appendix C Proofs of Supporting Lemmas
In this section, we give detailed proof of the supporting lemmas.
C.1 Generality of Transition Kernel Representation
Recall that we have the following representation of the transition kernel in Assumption 4.3,
(C.1) |
where and . Such a representation has the same function class as
(C.2) |
where is a finite signed measure.
Proof.
For simplicity, let and denote the function class represented by (C.1) and (C.2), respectively. Note that for any transition kernel represented by (C.1), by letting , such a transition kernel can be equivalently represented by in (C.2). Thus, it holds that . Therefore, we only need to prove , which is equivalent to proving that for any given by (C.2), the signed measure can be non-negative. If that is the case, by letting and , we can have (C.2) equivalently represented by (C.1). Note that there always exist non-negative functions and such that . Since is an odd function, it holds that
Thus, by letting and , it holds that , where and . Hence, any can be equivalently represented by (C.1) and it follows that . Thus, (C.1) has the same function class as (C.2) and we complete the proof. ∎
C.2 Proof of Detailed Version of Lemma 4.4
Proof.
We begin by a sketch of the proof of Lemma B.1. We first construct functions and . With the use of mollifiers, we prove that there exists a function satisfying (C.6) and then formulate a construction of , which gives way to obtain . For the proof of Property (iii), using the technique of Talagrand’s inequality and the chi-squared divergence, we establish a constant upper bound for . For the proof of Property (iv), by exploiting the inequality between distance and the weighted homogeneous Sobolev norm, we upper bound up to .
Proof of Property (i) of Lemma B.1. We give a proof of Property (i) by a construction of . For notational simplicity, we let . By definitions of the action value function and the state value function in (2.2), we have
(C.3) |
where
(C.4) | |||
(C.5) |
Here, the second equality in (C.2) holds by (i) of Assumption 4.3. We construct by , i.e., , where is defined to be a probability measure in such that
(C.6) |
We remark that such a exists and we will provide a construction later. Since we have , , and by Assumption 4.3, it turns out that is a probability density function according to (C.4), which further suggests that . Plugging (C.6) into (C.2), it holds that
where the equality holds by noting that in (4.2) and that . Furthermore, by letting , we have
where the second equality holds by noting that is odd with respect to and and that is an even function. Thus, we finish the construction of and also complete the proof of Property (i) in Lemma B.1.
A Construction for . Recall that we have defined in (C.6). Here, we provide a construction for which has some properties that will facilitate our analysis. Ideally, we want to have global support and concentrate to its mean, which motivates us to consider to be Gaussian distribution with high variance. Recall that we assume that is -Lipschitz continuous on in Assumption 4.2. Let be the probability density function of the standard Gaussian distribution,i.e., . Then, is the probability density function such that . We define function as follows,
(C.7) |
where denotes the convolution. Note that can be viewed as a class of mollifiers (Friedrichs, 1944). In particular, let
(C.8) |
where and characterize the Lipschitz continuity and smoothness of respectively and is -Lipschitz continuous in by Assumption 4.2. For the approximation error of mollifier , it holds that
where the last inequality follows from . Similarly, we have . By definition of in (C.8), it further holds that
(C.9) | |||
(C.10) |
Note that is a monotonic function with in by Assumption 4.2. With regard to (C.10), it follows that is also monotonic in and that
(C.11) |
Furthermore, by (C.9) and the continuity of in , we have
(C.12) |
The monotonicity of , (C.11), and (C.12) together show that exists and is -Lipschitz continuous in . Moreover, since is an odd function, it holds by (C.7) that is also an odd function with . Hence, it holds that . Furthermore, by (C.5), it holds that
where the first inequality follows from the fact that and the last inequality follows from Assumption 4.3 that . By setting , it holds that , which indicates that exists and allows to be the probability density function such that . For the mean value , recalling that , it thus holds that
(C.13) |
Following from (C.7), we have
Hence, our construction of here is in line with the definition of in (C.6). In the sequel, we consider to hold throughout.
Proof of Property (ii) of Lemma B.1. Here we show that is a direct result of in Property (i). Note that
(C.14) |
holds by the definition of the action value function in (2.2) for any . Since we have proved, by plugging (C.14) into the definition of in (3.6), where is substituted for , it follows that . Thus, we complete the proof of Property (ii) of Lemma B.1.
Proof of Property (iii) of Lemma B.1.
In what follows, we aim to upper bound . We summerize our aforementioned constructions as follows,
(C.15) | |||
(C.16) | |||
(C.17) | |||
(C.18) |
Plugging (C.16) into the definition of Chi-squared divergence, we have
(C.19) |
Note that we have by (C.8) and by (C.13). Hence, we have upper bounded. As for in (C.15), we have
(C.20) |
where the inequality holds by Property (iii) of Lemma D.3. For the first term on the right-hand side of (C.2), we have
(C.21) |
where the equality holds by Property (i) of Lemma D.3, the first inequality holds by noting that and the last inequality follows from by Assumption 4.3. Hence, the first term on the right-hand side of (C.2) is upper bounded. As for the second term, by Property (iv) of Lemma D.3, we have
(C.22) |
where the last inequality holds by noting that , , , and that by Assumption 4.3. Hence, it holds that the second term of (C.2) is also upper bounded. Plugging (C.2) and (C.2) into (C.2), we can establish the upper bound for . Furthermore, by Property (ii) of Lemma D.3 and noting that , we have upper bounded as well, that is,
where depends on absolute constants occurring in (C.19), (C.2), and (C.2), i.e., , , , , , , , and in Assumptions 4.2 and 4.3. Since , it holds for any that,
(C.23) |
where the first inequality follows from Talagrand’s inequality in Lemma E.4. Plugging into (C.2), we complete the proof of Property (iii) of Lemma B.1.
Proof of Property (iv) of Lemma B.1. Upper Bounding . Following the property of distance with respect to Gaussian distribution, we have . Recall that we have and that is -Lipschitz continuous on . It then holds that
(C.24) |
Meanwhile, we have
(C.25) |
where the last inequality holds by (ii) of Assumption 4.3. Plugging (C.2) into (C.24), it holds for that
(C.26) |
Upper Bounding . By definition of in (C.15), we have
(C.27) |
For , it holds that
(C.28) |
where the last inequality holds by noting that
(C.29) |
Here the inequality in (C.29) holds by (C.2) and the fact that . For , by Lemma E.5, it holds that
(C.30) |
Recall that we have defined in (C.15) that
where , , and , which indicate that is in the convex hull of and . Hence, by Property (i) of Lemma D.4, it holds that
(C.31) |
Furthermore, following from (C.27) and , by Property (ii) of Lemma D.4, it holds for that
(C.32) |
Plugging (iii) of Assumption 4.3, (C.2), and (C.29) into (C.2), we have
(C.33) |
By simply substituting for in both (C.2) and (C.2), it also holds for that
(C.34) |
Combining (C.30), (C.31), (C.2), and (C.2), we have
(C.35) |
Upper Bounding . Note that we have in (C.17). By Lemma D.5, we have
(C.36) |
where depends on the discount factor and absolute constants in Assumption 4.2 and 4.3 according to (C.26) and (C.2). By performance difference lemma (Kakade and Langford, 2002), we have
(C.37) |
Here the inequality follows from . We let . Plugging (C.2) into (C.36), we have
Recall that we have in (C.18). Then, by Lemma D.6, it holds that
which completes the proof of Lemma B.1. ∎
Appendix D Technical Results
In this section, we state and prove some technical results used in the proof of main theorems and lemmas.
Proof.
Following from Assumptions 4.3 and 4.2, we have that for any and , which implies that for any . Note that for any . Thus, by (A.7) and the inequality between distance and distance (Villani, 2003), we have for any and that
(D.3) |
which completes the proof of (D.1) in Lemma D.1. Following from the definition of in (3.6), we have for any and that
Here the last inequality follows from (D.3) and the fact that for any and , which follows from Assumptions 4.3 and 4.2. Thus, we complete the proof of Lemma D.1. ∎
Lemma D.2.
For and the geodesic connecting and , we have
(D.4) |
Proof.
We give a proof by contradiction. Note that is the geodesic connecting and . Assume there exists such that
Then, according to the triangle inequality of metric (Villani, 2008), we have
and for the same sake, which conflicts with the definition of geodesic that . Hence, such does not exist and (D.4) holds. Thus we complete the proof of Lemma D.2. ∎
Lemma D.3.
The Chi-squared divergence has the following properties.
-
(i)
For any probability measure , function , and , we have
-
(ii)
For any probability measures and , functions and , we have
where is the product of and , and is the product measure of and , i.e., and , respectively.
-
(iii)
For any probability measure , functions and , we have
-
(iv)
For any probability measure , function and , we have
Proof.
Proof of Property (i) of Lemma D.3. By definition of the Chi-squared divergence, we have
Thus, we complete the proof of Property (i) of Lemma D.3. Proof of Property (ii) of Lemma D.3. Let and . It then holds that
By further noting that , , , and , we have
Thus, we complete the proof of Property (ii) of Lemma D.3. Proof of Property (iii) of Lemma D.3. Let and . It then holds that
By further noting that and , we have
where the inequality follows from the Cauchy-Schwarz inequality. Thus, we complete the proof of Property (iii) of Lemma D.3. Proof of Property (iv) of Lemma D.3. Let . It then holds that
where the first equality holds by noting that and the inequality follows from the Cauchy-Schwarz inequality. Thus, we complete the proof of Property (iv) of Lemma D.3. ∎
Lemma D.4.
For weighted homogeneous Sobolev norm defined by
we have the following properties.
-
(i)
For a group of probability measures and , if is in the convex hull of , i.e., there exists such that both and hold, it then holds that
-
(ii)
Assume that we have measures and . Let be two functions on such that . Then, by letting and , we have
Proof.
Proof of Property (i) of Lemma D.4. By definition of the weighted homogeneous Sobolev norm, we have
(D.5) | ||||
(D.6) |
where the first equality holds by noting that . To illustrate the last equality, we denote by and the allowed function class for in (D.5) and in (D.6) to choose from, respectively. Note that we have for any , which is because the constraints for each in (D.6) are relaxation of the constraints for in (D.5). And so, the supremum taken over is no larger than the supremum over for any . Therefore, the supremum over is no larger than the the smallest supremum over . Thus, (D.6) holds. Furthermore, we have
(D.7) | ||||
(D.8) | ||||
where the first equality holds by noting that (D.8) is a rescaling of (D.7) with respect to in the constraints of (D.7). Here, we let
Then, it holds that . Otherwise, there must exists such that and that holds for any , which contradicts with our conditions that and . Therefore, it further holds that
Thus, we complete the proof of Property (i) of Lemma D.4. Proof of Property (ii) of Lemma D.4. Let . Then, we have . Let and . Then, we have and that . Since , it holds that , where . We further let . Then, it holds that and that . Therefore, we have
By defintion of the weighted homogeneous Sobolev norm, it further holds that
(D.9) |
We assume the supremum in (D) is reached at . Then, we have and that
where the last inequality holds by noting that . Thus, we complete the proof of Property (ii) of Lemma D.4. ∎
Lemma D.5.
For probability measures and , it holds that
Proof.
Lemma D.6.
For probability density function , let and . Then, We have
Proof.
Recall the definition of Wasserstain-2 distance that
where is the set of all couplings of and . We assume that the infimum is reached by , i.e.,
We denote by the distribution such that
where is dirac delta function and it holds that . Hence, it follows that
Thus, we complete the proof of Lemma D.6. ∎
Appendix E Auxiliary Lemmas
We use the definition of absolutely continuous curves in in Ambrosio et al. (2008).
Definition E.1 (Absolutely Continuous Curve).
Let be a curve. Then, we say is an absolutely continuous curve if there exists a square-integrable function such that
for any .
Then, we have the following first variation formula.
Lemma E.2 (First Variation Formula, Theorem 8.4.7 in Ambrosio et al. (2008)).
Given and an absolutely continuous curve , let be the geodesic connecting and . It holds that
where , , and the inner product is defined in (A.2).
Lemma E.3 (Eulerian Representation of Geodesics, Proposition 5.38 in Villani (2003)).
Let be a geodesic and be the corresponding vector field such that . It holds that
where is the outer product between two vectors.
Lemma E.4 (Talagrand’s Inequality, Corollary 2.1 in Otto and Villani (2000)).
Let be . It holds for any that
Lemma E.5 (Theorem 1 in Peyre (2011)).
Let be two probability measures in . Then, it holds that
Appendix F Conclusions and Limitations
In this work, we study the time envolution of a two-timescale AC represented by a two-layer neural network in the mean-field limit. Specifically, the actor updates its policy via proximal policy optimization, which is closely related to the replicator dynamics, while the critic updates by temporal-difference learning, which is captured by a semigradient flow in the Wasserstein space. By introducing a restarting mechanism, we establish the convergence and optimality of AC with two-layer overparameterized neural network. However, the study has potential limitations. In this work we only study the continuous-time limiting regime, which is an idealized setting with infinitesimal learning rates, and establish finite-time convergence and optimality guarantees. Finite-time results for the more realistic discrete-time setting is left for future research.