This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Permutation Invariant Policy Optimization for Mean-Field Multi-Agent Reinforcement Learning: A Principled Approach

Yan Li, Lingxiao Wang, Jiachen Yang, Ethan Wang,
Zhaoran Wang, Tuo Zhao, Hongyuan Zha Yan Li, Jiachen Yang, Ethan Wang, and Tuo Zhao are affiliated with Georgia Institute of Technology; Lingxiao Wang, Zhaoran Wang are affiliated with Northwestern University; Hongyuan Zha is affiliated with The Chinese University of Hong Kong, Shenzhen. Tuo Zhao is the corresponding author; Email: tourzhao@gatech.edu.
Abstract

Multi-agent reinforcement learning (MARL) becomes more challenging in the presence of more agents, as the capacity of the joint state and action spaces grows exponentially in the number of agents. To address such a challenge of scale, we identify a class of cooperative MARL problems with permutation invariance, and formulate it as mean-field Markov decision processes (MDP). To exploit the permutation invariance therein, we propose the mean-field proximal policy optimization (MF-PPO) algorithm, at the core of which is a permutation-invariant actor-critic neural architecture. We prove that MF-PPO attains the globally optimal policy at a sublinear rate of convergence. Moreover, its sample complexity is independent of the number of agents. We validate the theoretical advantages of MF-PPO with numerical experiments in the multi-agent particle environment (MPE). In particular, we show that the inductive bias introduced by the permutation-invariant neural architecture enables MF-PPO to outperform existing competitors with a smaller number of model parameters, which is the key to its generalization performance.

1 Introduction

Multi-Agent Reinforcement Learning (Littman, 1994; Zhang et al., 2019) generalizes Reinforcement Learning (Sutton and Barto, 2018) to address the sequential decision-making problem of multiple agents maximizing their own long term rewards while interacting with one another in a common environment. With breakthroughs in deep learning, MARL algorithms equipped with deep neural networks have seen significant empirical successes in various domains, including simulated autonomous driving (Shalev-Shwartz et al., 2016), multi-agent robotic control (Matarić, 1997; Kober et al., 2013), and E-sports (Vinyals et al., 2019).

Despite tremendous successes, MARL is notoriously hard to scale to the many-agent setting, as the size of the state-action space grows exponentially with respect to the number of agents. This phenomenon is recently described as the curse of many agents (Menda et al., 2018). To tackle this challenge, we focus on cooperative MARL, where agents work together to maximize their team reward (Panait and Luke, 2005). We identify and exploit a key property of cooperative MARL with homogeneous agents, namely the invariance with respect to the permutation of agents. Such permutation invariance can be found in many real-world scenarios with homogeneous agents, such as distributed control of multiple autonomous vehicles and team sports (Cao et al., 2013; Kalyanakrishnan et al., 2006), and also in the case of heterogeneous groups where invariance holds within each group (Liu et al., 2019b). More importantly, we further find that permutation invariance has significant practical implications, as the optimal value functions remains invariant when permuting the joint state-action pairs. Such an observation strongly advocates a permutation invariant design for learning, which helps reduce the effective search space of the policy/value functions from exponential dependence on the number of agents to polynomial dependence.

Several empirical methods have been proposed to incorporate permutation invariance into solving MARL problems. Liu et al. (2019b) implement a permutation invariant critic based on Graph Convolutional Network (GCN) (Kipf and Welling, 2017). Sunehag et al. (2017) propose value decomposition, which together with parameter sharing, leads to a joint critic network that is permutation invariant over agents. While these methods are based on heuristics, we are the first to provide theoretical principles for introducing permutation invariance as an inductive bias for learning value functions and policy in homogeneous systems. In addition, we adopt the DeepSet (Zaheer et al., 2017) architecture, which is well suited for handling homogeneity of agents, with much simpler operations to induce permutation invariance, and being much more parameter efficient by doing so.

To scale up learning in MARL problem in the presence of a large number, even infinitely many agents, mean-field approximation has been explored to directly model the population behavior of the agents. Yang et al. (2017) consider a mean-field game with deterministic linear state transitions, and show it can be reformulated as a mean-field MDP, with mean-field state residing in a finite-dimensional probability simplex. Yang et al. (2018) take the mean-field approximation over actions, such that the interaction for any given agent and the population is approximated by the interaction between the agent’s action and the averaged actions of its neighboring agents. However, the motivation for averaging over local actions remains unclear, and it generally requires a sparse graph over agents. In practice, properly identifying such structure also demands extensive prior knowledge. Carmona et al. (2019) motivate a mean-field MDP from the perspective of mean-field control. The mean-field state therein lies in a probability simplex and thus continuous in nature. To enable the ensuing Q-learning algorithm, discretization of the joint state-action space is necessary. Wang et al. (2020) motivate a mean-field MDP from permutation invariance, but assume a central controller coordinating the actions of all the agents, and hence is restricted to handling the curse of many agents from the exponential blowup of joint state space. In contrast, our formulation of mean-field approximation allows agents to make their own local actions without resorting to a centralized controller.

We propose a mean-field Markov decision process motivated from the permutation invariance structure of cooperative MARL, which can be viewed as a natural limit of finite-agent MDP by taking the number of agents to infinity. Such a mean-field MDP generalizes traditional MDP, with each state representing a distribution over the state space of a single agent. The mean-field MDP provides us a tractable formulation to model MDP with many agents, even infinite number of agents. We further propose the Mean-Field Proximal Policy Optimization (MF-PPO) algorithm, at the core of which is a pair of permutation invariant actor and critic neural networks. These networks are implemented based on DeepSet (Zaheer et al., 2017), which uses convolutional type operations to induce permutation invariance over the set of inputs. We show that with sufficiently many agents, MF-PPO converges to the optimal policy of the mean-field MDP with a sublinear sample complexity independent of the number of agents. To support our theory, we conduct numerical experiments on the benchmark multi-agent particle environment (MPE) and show that our proposed method requires a smaller number of model parameters and attains a better performance compared to multiple baselines.

Notations. For a set XX, we denote 𝒫(X)\mathcal{P}(X) as the set of distribution on XX. δx\delta_{x} denotes the Dirac measure supported at xx. For 𝐬=(s1,,sN)\mathbf{s}=(s_{1},\ldots,s_{N}), we use 𝐬i.i.d.p\mathbf{s}\overset{\mathrm{i.i.d.}}{\sim}p to denote that each sis_{i} is independently sampled from distribution pp. For a function f:Xf:X\to\mathbb{R} and a distribution π𝒫(X)\pi\in\mathcal{P}(X), we write f,π=𝔼aπf(a)\left\langle f,\pi\right\rangle=\mathbb{E}_{a\sim\pi}f(a). We write [m][m] in short for {1,,m}\{1,\ldots,m\}.

2 Problem Setup

We focus on studying multi-agent systems with cooperative, homogeneous agents, where the agents within the system are of similar nature and hence cannot be distinguished from each other. Specifically, we consider a discrete time control problem with NN agents, formulated as a Markov decision process (𝒮N,𝒜N,,r)({\mathcal{S}}^{N},\mathcal{A}^{N},\mathbb{P},\mathrm{r}). We define the joint state space 𝒮N{\mathcal{S}}^{N} to be the Cartesian product of the finite state space 𝒮{\mathcal{S}} for each agent, and similarly define joint action space 𝒜N\mathcal{A}^{N}. The homogeneous nature of the system is reflected in the transition kernel \mathbb{P} and the shared reward r\mathrm{r}, which satisfies:

r(𝐬t,𝐚t)=r(κ(𝐬t),κ(𝐚t)),(𝐬t+1|𝐬t,𝐚t)=(κ(𝐬t+1)|κ(𝐬t),κ(𝐚t)),\displaystyle\mathrm{r}(\mathbf{s}_{t},\mathbf{a}_{t})=\mathrm{r}\left(\kappa(\mathbf{s}_{t}),\kappa({\mathbf{a}_{t}})\right),~{}~{}~{}\mathbb{P}(\mathbf{s}_{t+1}|\mathbf{s}_{t},{\mathbf{a}_{t}})=\mathbb{P}\left(\kappa(\mathbf{s}_{t+1})|\kappa(\mathbf{s}_{t}),\kappa({\mathbf{a}_{t}})\right), (1)

for all (𝐬t,𝐚t)𝒮N×𝒜N(\mathbf{s}_{t},\mathbf{a}_{t})\in{\mathcal{S}}^{N}\times\mathcal{A}^{N} and permutation mapping κ()\kappa(\cdot). In other words, it is the configuration, rather than the identities of the agents, that affects the team reward, and the transition to the next configuration solely depends on the current configuration. See Figure 1 for a detailed illustration. Such permutation invariance can be found in many real-world scenarios, such as distributed control of multiple autonomous vehicles, team sports and social economic systems (Zheng et al., 2020; Cao et al., 2013; Kalyanakrishnan et al., 2006).

Refer to caption
Figure 1: Illustration of permutation invariance. Exchanging identities of the agents (first and third row) does not change the transition of the underlying system (second row).

Our goal is to find the optimal policy ν\nu within a policy class which maximizes the expected discounted reward

Vν(𝐬)=(1γ)𝔼{t=0γtr(𝐬t,𝐚t)|𝐬0=𝐬,𝐚tν(𝐬t),t0}.V^{\nu}(\mathbf{s})\!=\!(1-\gamma)\mathbb{E}\left\{\sum_{t=0}^{\infty}\gamma^{t}\mathrm{r}(\mathbf{s}_{t},\mathbf{a}_{t})|\mathbf{s}_{0}\!=\!\mathbf{s},\mathbf{a}_{t}\!\sim\!\nu(\mathbf{s}_{t}),\forall t\!\geq\!0\right\}.

Our first result shows that learning with permutation invariance advocates permutation invariant network design.

Proposition 2.1.

For cooperative MARL satisfying (1), there exists an optimal policy ν\nu^{*} that is permutation invariant, i.e., ν(𝐬,𝐚)=ν(κ(𝐬),κ(𝐚))\nu^{*}(\mathbf{s},\mathbf{a})=\nu^{*}\left(\kappa(\mathbf{s}),\kappa(\mathbf{a})\right) for any permutation mapping κ()\kappa(\cdot). In addition, for any permutation invariant policy ν\nu, the value function V()V(\cdot) and the state-action value function Q()Q(\cdot) is also permutation invariant

Vν(𝐬)=Vν(κ(𝐬)),Qν(𝐬,𝐚)=Qν(κ(𝐬),κ(𝐚)),\displaystyle V^{\nu}(\mathbf{s})=V^{\nu}\left(\kappa(\mathbf{s})\right),~{}~{}Q^{\nu}(\mathbf{s},\mathbf{a})=Q^{\nu}\left(\kappa(\mathbf{s}),\kappa(\mathbf{a})\right),

where Qν(𝐬,𝐚)=𝔼𝐬{r(𝐬,𝐚)+γVν(𝐬)}Q^{\nu}(\mathbf{s},\mathbf{a})=\mathbb{E}_{\mathbf{s}^{\prime}}\left\{r(\mathbf{s},\mathbf{a})+\gamma V^{\nu}(\mathbf{s}^{\prime})\right\}.

Proposition 2.1 has an important implication for architecture design, as it states that it suffices to search within the permutation invariant policy class and value function class. To the best of our knowledge, this is the first theoretical justification of permutation invariant network design for learning with homogeneous agents.

We focus on the factorized policy class with parameter sharing scheme, where each agent makes its own decision given local state. Specifically, the joint policy ν\nu can be factorized as ν(𝐚|𝐬)=i=1Nμ(ai|si)\nu(\mathbf{a}|\mathbf{s})=\prod_{i=1}^{N}\mu(a_{i}|s_{i}), where μ()\mu(\cdot) denotes the shared local mapping. Such a policy class is widely adopted in the celebrated centralized training – decentralized execution paradigm (Lowe et al., 2017), due to its light overhead in the deployment phase and favorable performances. However, directly learning such factorized policy remains challenging, as each agent needs to estimate its state-action value function, denoted as Qν(𝐬,𝐚)Q^{\nu}(\mathbf{s},\mathbf{a}). The search space during learning is (|𝒮|×|𝒜|)N(|{\mathcal{S}}|\times|\mathcal{A}|)^{N}, which scales exponentially with respect to the number of agents. The large search space poses as a significant roadblock for efficient learning, and was recently coined as the curse of many agents (Menda et al., 2018).

To address the curse of many agents, we exploit the homogeneity of the system and take mean-field approximation. We begin by taking the perspective of agent ii, which is arbitrarily chosen from the NN agents. We denote the state of such agent by ss and the states of the rest of the agents by 𝐬r\mathbf{s}_{r}. One can verify that when permuting the state of all the other agents, the value function remains unchanged; additionally, we can further characterize the value function as a function of the local state, and the empirical state distribution over the rest of agents.

Proposition 2.2.

For any permutation mapping κ()\kappa(\cdot), the value function satisfies Vν(s,𝐬r)=Vν(s,κ(𝐬r)).V^{\nu}(s,\mathbf{s}_{r})=V^{\nu}(s,\kappa(\mathbf{s}_{r})). Additionally, there exists gνg_{\nu} such that:

Vν(s,𝐬r)=gν(s,p^𝐬r),\displaystyle V^{\nu}(s,\mathbf{s}_{r})=g_{\nu}(s,\widehat{\mathrm{p}}_{\mathbf{s}_{r}}),

where p^𝐬r=1Ns𝐬rδs\widehat{\mathrm{p}}_{\mathbf{s}_{r}}=\frac{1}{N}\sum_{s\in\mathbf{s}_{r}}\delta_{s} is the empirical state distribution over the rest of the agents.

For a system with a large number of agents (e.g., financial markets, social networks), the empirical state distribution can be seen as the concrete realization of the underlying distribution of the population. Motivated from this observation and Proposition 2.2, we formulate the following mean-field MDP that can be seen as the limit of finite-agent MDP in the presence of infinitely many homogeneous agents.

Definition 2.1 (mean-field MDP).

The mean-field MDP consists of elements of the following: state (s,d𝒮)𝒮¯𝒮×𝒫(𝒮)(s,\mathrm{d}_{\mathcal{S}})\in\overline{{\mathcal{S}}}\subseteq{\mathcal{S}}\times\mathcal{P}({\mathcal{S}}); action a¯𝒜¯𝒜𝒮\overline{a}\in\overline{\mathcal{A}}\subseteq\mathcal{A}^{{\mathcal{S}}}; reward r(s,d𝒮,a¯)\mathrm{r}(s,\mathrm{d}_{\mathcal{S}},\overline{a}); transition kernel (s,d𝒮|s,d𝒮,a¯\mathbb{P}(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime}|s,\mathrm{d}_{\mathcal{S}},\overline{a}).

The defined mean-field MDP has an intimate connection with our previously discussed finite-agent MDP: here (s,d𝒮)(s,\mathrm{d}_{\mathcal{S}}) represents the joint state viewed from the individual agent with state ss; each agent makes a local decision using the shared local mapping a¯:𝒮𝒜\overline{a}:{\mathcal{S}}\to\mathcal{A} and receive a local reward, whose cumulative effect is captured by the team reward r(s,d𝒮,a¯)\mathrm{r}(s,\mathrm{d}_{\mathcal{S}},\overline{a}); finally, the joint state transits to (s,d𝒮)(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime}) according to kernel (s,d𝒮|s,d𝒮,a¯)\mathbb{P}(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime}|s,\mathrm{d}_{\mathcal{S}},\overline{a}). We can clearly see that the mean-field MDP has a step-to-step correspondence with the finite-agent MDP with homogeneous agents, thus making it a natural generalization in the presence of large number of agents.

Our goal is to learn efficiently a policy π\pi whose expected discounted reward is maximized. To facilitate ensuing discussions, we define the value function as

Vπ(s,d𝒮)=(1γ)𝔼{t=0γtr(st,d𝒮,t,a¯t)},\displaystyle V^{\pi}(s,\mathrm{d}_{\mathcal{S}})=(1-\gamma)\mathbb{E}\left\{\sum_{t=0}^{\infty}\gamma^{t}\mathrm{r}(s_{t},\mathrm{d}_{{\mathcal{S}},t},\overline{a}_{t})\right\},

where (s0,d𝒮,0)=(s,d𝒮),a¯tπ(st,d𝒮,t),t0(s_{0},\mathrm{d}_{{\mathcal{S}},0})=(s,\mathrm{d}_{\mathcal{S}}),\overline{a}_{t}\sim\pi(s_{t},\mathrm{d}_{{\mathcal{S}},t}),\forall t\geq 0, and Q-function as

Qπ(s,d𝒮,a¯)=(1γ)𝔼{t=0γtr(st,d𝒮,t,a¯t)},\displaystyle Q^{\pi}(s,\mathrm{d}_{\mathcal{S}},\overline{a})=(1-\gamma)\mathbb{E}\left\{\sum_{t=0}^{\infty}\gamma^{t}\mathrm{r}(s_{t},\mathrm{d}_{{\mathcal{S}},t},\overline{a}_{t})\right\},

where (s0,d𝒮,0)=(s,d𝒮),a¯0=a¯,a¯tπ(st,d𝒮,t)(s_{0},\mathrm{d}_{{\mathcal{S}},0})=(s,\mathrm{d}_{\mathcal{S}}),\overline{a}_{0}=\overline{a},\overline{a}_{t}\sim\pi(s_{t},\mathrm{d}_{{\mathcal{S}},t}). The optimal policy is then denoted by

πargmaxVπ(s,d𝒮).\pi^{*}\in\mathop{\mathrm{argmax}}V^{\pi}(s,\mathrm{d}_{\mathcal{S}}).

Despite the intuitive analogy to finite-agent MDP, solving the mean-field MDP posses some unique challenges. In addition to the unknown transition kernel and reward, the mean-field MDP takes a distribution as its state, which we do not have complete information of during training. In the following section, we propose our mean-field Proximal Policy Optimization (MF-PPO) algorithm that, with a careful architecture design, can solve such mean-field MDP in a model-free fashion efficiently.

3 Mean-field Proximal Policy Optimization

Our algorithm falls into the category of the actor-critic learning paradigm, consisting of alternating iterations of policy evaluation and improvement. The unique features of MF-PPO lie in the facts: (1) it exploits permutation invariance of the system, reducing search space of the actor/critic networks drastically and enables much more efficient learning; (2) it can handle a varying number of agents. For simplicity of exposition, we consider a fixed number of agents here.

Throughout the rest of the section, we assume that for any joint state (s,d𝒮)𝒮×𝒫(𝒮)(s,\mathrm{d}_{{\mathcal{S}}})\in{\mathcal{S}}\times\mathcal{P}({\mathcal{S}}), the agent has access to NN i.i.d.\mathrm{i.i.d.} samples {si}i=1N\{s_{i}\}_{i=1}^{N} from d𝒮\mathrm{d}_{{\mathcal{S}}}. We denote concatenation of such samples as 𝐬𝒮N\mathbf{s}\in{\mathcal{S}}^{N} and write 𝐬i.i.d.d𝒮\mathbf{s}\overset{\mathrm{i.i.d.}}{\sim}\mathrm{d}_{{\mathcal{S}}}.

MF-PPO maintains a pair of actor (denoted by FAF^{A}) and critic networks (denoted by FQF^{Q}), and uses the actor network to induce an energy-based policy π(a¯|s,d𝒮)\pi(\overline{a}|s,\mathrm{d}_{{\mathcal{S}}}). Specifically, given state (s,d𝒮)(s,\mathrm{d}_{{\mathcal{S}}}), the actor network induces a distribution on the action set 𝒜¯\overline{\mathcal{A}} according to

π(a¯|s,d𝒮)exp{τ1FA(s,d𝒮,a¯)},\pi(\overline{a}|s,\mathrm{d}_{{\mathcal{S}}})\propto\exp\left\{\tau^{-1}F^{A}(s,\mathrm{d}_{{\mathcal{S}}},\overline{a})\right\},

where τ\tau denotes the temperature parameter. We use πexp{FA}\pi\propto\exp\left\{F^{A}\right\} to denote the dependency of the policy on the energy function.

\bullet Permutation-invariant Actor and Critic. We adopt a permutation invariant design of the actor and critic network. Specifically, given individual state s𝒮s\in{\mathcal{S}} and sampled states 𝐬i.i.d.d𝒮\mathbf{s}\overset{\mathrm{i.i.d.}}{\sim}\mathrm{d}_{{\mathcal{S}}}, the actor (resp. critic) network FAF^{A} (resp. FQF^{Q}) satisfies

FA(s,𝐬,a¯)=FA(s,κ(𝐬),a¯)F^{A}(s,\mathbf{s},\overline{a})=F^{A}(s,\kappa(\mathbf{s}),\overline{a})

for any permutation mapping κ\kappa. With permutation invariance, the search space of the actor/critic network polynomially depends on the number of agents NN.

Proposition 3.1.

The search space of a permutation invariance actor (critic) network is at the order of

(k=1min{|𝒮|,N}(N1k1)(|𝒮|k))|𝒮||𝒜¯|;\left(\sum_{k=1}^{\min\{|{\mathcal{S}}|,N\}}\binom{N-1}{k-1}\binom{|{\mathcal{S}}|}{k}\right)|{\mathcal{S}}||\overline{\mathcal{A}}|;

Additionally, if |𝒮|<N|{\mathcal{S}}|<N, then the search space depends on NN at the order of N|𝒮|N^{|{\mathcal{S}}|}.

Compared to architectures without permutation invariance, whose search space depends on NN at the order of (|𝒮||𝒜|)N(|{\mathcal{S}}||\mathcal{A}|)^{N}, we can clearly see the search space of MF-PPO can be exponentially smaller.

Motivated by the characterization of the permutation invariant set function in Zaheer et al. (2017), the actor/critic network in MF-PPO takes the form of Deep Sets architecture, i.e.,

FA(s,𝐬,a¯)=h(s𝐬ϕ(s,s,a¯)/N).F^{A}(s,\mathbf{s},\overline{a})=h\left(\sum_{s^{\prime}\in\mathbf{s}}\phi(s,s^{\prime},\overline{a})/N\right).

Both networks first aggregate local information by averaging over the output of a shared sub-network among agents, before feeding the aggregated information into a subsequent network. See Figure 2 for a detailed illustration. Effectively, by the average pooling layer and the preceding parameter sharing scheme, the network can keep its output unchanged when permuting the ordering of agents. Compared to a Graph Convolutional Neural Network (Kipf and Welling, 2017), the averaging operations are well suited for homogeneous agents and more parameter-efficient.

Naturally, the actor network when given the joint state-action pair (s,d𝒮,a¯)(s,\mathrm{d}_{{\mathcal{S}}},\overline{a}) is given by

FA(s,d𝒮,a¯)=h(𝔼sd𝒮ϕ(s,s,a¯)).F^{A}(s,\mathrm{d}_{{\mathcal{S}}},\overline{a})=h\left(\mathbb{E}_{s^{\prime}\sim\mathrm{d}_{{\mathcal{S}}}}\phi(s,s^{\prime},\overline{a})\right).

We assume FAF^{A} is parameterized by a neural network with parameters αD\alpha\in\mathbb{R}^{D}, which is to be learned during training. We let the function class of all possible actor networks be denoted by A\mathcal{F}^{A}. This same architecture design applies to the critic network, with learnable parameters denoted by θD\theta\in\mathbb{R}^{D} and the function class denoted by Q\mathcal{F}^{Q}. MF-PPO then consists of successive iterations of policy evaluation and policy improvement steps described in detail below.

Refer to caption
Figure 2: Illustration of a DeepSet parameterized critic network.

\bullet Policy Evaluation. At the kk-th iteration of MF-PPO, we first update the critic network FQF^{Q} by minimizing the mean squared Bellman error while holding the actor network FA,kF^{A,k} fixed. We denote the policy induced by the actor network as πk\pi_{k}, the stationary state distribution of policy πk\pi_{k} as νk\nu_{k}, and the stationary state-action distribution as σk=νkπk\sigma_{k}=\nu_{k}\pi_{k}. Thus, we follow the update

θk=argminθ𝔹(θ0,Rθ)𝔼σk\displaystyle\theta_{k}=\mathop{\mathrm{argmin}}_{\theta\in\mathbb{B}(\theta_{0},R_{\theta})}\mathbb{E}_{\sigma_{k}} {FθQ(s,d𝒮,a¯)[𝒯πkFθQ](s,d𝒮,a¯)}2,\displaystyle\bigg{\{}F^{Q}_{\theta}(s,\mathrm{d}_{{\mathcal{S}}},\overline{a})-\left[{\mathcal{T}}^{\pi_{k}}F^{Q}_{\theta}\right](s,\mathrm{d}_{{\mathcal{S}}},\overline{a})\bigg{\}}^{2}, (2)

where 𝔹(θ0,Rθ)\mathbb{B}(\theta_{0},R_{\theta}) denotes the Euclidean ball with radius RθR_{\theta} centered at the initialized parameter θ0\theta_{0}, and the Bellman evaluation operator 𝒯π{\mathcal{T}}^{\pi} is given by

𝒯πθQ(s,d𝒮,a¯)=𝔼{(1γ)r(s,d𝒮,a¯)+γFθQ(s,d𝒮,a¯)},\displaystyle{\mathcal{T}}^{\pi}\mathcal{F}^{Q}_{\theta}(s,\mathrm{d}_{{\mathcal{S}}},\overline{a})=\mathbb{E}\left\{(1-\gamma)r(s,\mathrm{d}_{{\mathcal{S}}},\overline{a})+\gamma F^{Q}_{\theta}(s^{\prime},\mathrm{d}^{\prime}_{{\mathcal{S}}},\overline{a}^{\prime})\right\},

where d𝒮(|d𝒮,a¯),a¯π(|s,d𝒮)\mathrm{d}^{\prime}_{{\mathcal{S}}}\sim\mathbb{P}(\cdot|\mathrm{d}_{{\mathcal{S}}},\overline{a}),\overline{a}^{\prime}\sim\pi(\cdot|s^{\prime},\mathrm{d}^{\prime}_{{\mathcal{S}}}). We solve (2) by T-step temporal-difference (TD) update and output θk=θ(T)\theta_{k}=\theta(T). At the tt-th iteration of the TD update,

θ(t+1/2)\displaystyle\theta(t+1/2) =θ(t)η{Fθ(t)Q(s,𝐬,a¯)(1γ)r(s,d𝒮,a¯)γFθ(t)Q(s,𝐬,a¯)}θFθ(t)Q(s,𝐬,a¯),\displaystyle=\theta(t)-\eta\bigg{\{}F^{Q}_{\theta(t)}(s,\mathbf{s},\overline{a})-(1-\gamma)r(s,\mathrm{d}_{{\mathcal{S}}},\overline{a})-\gamma F^{Q}_{\theta(t)}(s^{\prime},\mathbf{s}^{\prime},\overline{a}^{\prime})\bigg{\}}\nabla_{\theta}F^{Q}_{\theta(t)}(s,\mathbf{s},\overline{a}),
θ(t+1)\displaystyle\theta(t+1) =Π𝔹(θ0,Rθ)(θ(t+1/2)),\displaystyle=\Pi_{\mathbb{B}(\theta_{0},R_{\theta})}\left(\theta(t+1/2)\right),

where we sample

(s,d𝒮,a¯)σk,(s,d𝒮)(|s,d𝒮,a¯),a¯πk(|s,d𝒮),𝐬i.i.d.d𝒮,𝐬i.i.d.d𝒮.\displaystyle(s,\mathrm{d}_{{\mathcal{S}}},\overline{a})\sim\sigma_{k},~{}(s^{\prime},\mathrm{d}^{\prime}_{{\mathcal{S}}})\sim\mathbb{P}(\cdot|s,\mathrm{d}_{{\mathcal{S}}},\overline{a}),~{}~{}~{}\overline{a}^{\prime}\sim\pi_{k}(\cdot|s^{\prime},\mathrm{d}^{\prime}_{{\mathcal{S}}}),~{}\mathbf{s}\overset{\mathrm{i.i.d.}}{\sim}\mathrm{d}_{{\mathcal{S}}},~{}\mathbf{s}^{\prime}\overset{\mathrm{i.i.d.}}{\sim}\mathrm{d}^{\prime}_{{\mathcal{S}}}.

We use ΠX()\Pi_{\mathrm{X}}(\cdot) to denote the orthogonal projection onto set X\mathrm{X}, and η\eta to denote the step size. The details are summarized in Algorithm 2 in Appendix A.

\bullet Policy Improvement. Following the policy evaluation step, MF-PPO updates its policy by updating the policy network FAF^{A}, which is the energy function associated with the policy. We update the policy network by

πk+1=argmaxπexp{FA,k+1},FA,k+1A𝔼νk{FθkQ(s,d𝒮,),π(|s,d𝒮)υkKL(π(|s,d𝒮)πk(|s,d𝒮))}.\displaystyle\pi_{k+1}=\mathop{\mathrm{argmax}}_{\begin{subarray}{c}\pi\propto\exp\left\{F^{A,k+1}\right\},F^{A,k+1}\in\mathcal{F}^{A}\end{subarray}}\mathbb{E}_{\nu_{k}}\bigg{\{}\left\langle F^{Q}_{\theta_{k}}(s,\mathrm{d}_{{\mathcal{S}}},\cdot),\pi(\cdot|s,\mathrm{d}_{{\mathcal{S}}})\right\rangle-\upsilon_{k}\mathrm{KL}\left(\pi(\cdot|s,\mathrm{d}_{{\mathcal{S}}})\big{\|}\pi_{k}(\cdot|s,\mathrm{d}_{{\mathcal{S}}})\right)\bigg{\}}.

The update rule intuitively reads as increasing the probability for choosing action a¯\overline{a} if it yields a higher value for critic network FQ(s,d𝒮,a¯)F^{Q}(s,\mathrm{d}_{{\mathcal{S}}},\overline{a}), which can be viewed as a softened version of policy iteration (Bertsekas, 2011). Additionally, by controlling the proximity parameter υk\upsilon_{k}, we can control the softness of the update, with υ0\upsilon\to 0 yielding the vanilla policy iteration. Moreover, without constraint of FA,k+1AF^{A,k+1}\in\mathcal{F}^{A}, such an update would have a nice closed form expression, and πk+1\pi_{k+1} itself is another energy-based policy.

Proposition 3.2.

Let πkexp(τk1FαkA)\pi_{k}\propto\exp\left(\tau_{k}^{-1}F^{A}_{\alpha_{k}}\right) denote the energy-based policy, then the update

π¯k+1=argmaxπ𝔼νk{FθkQ(s,d𝒮,),π(|s,d𝒮)υkKL(π(|s,d𝒮)πk(|s,d𝒮))}\displaystyle\overline{\pi}_{k+1}=\mathop{\mathrm{argmax}}_{\pi}\mathbb{E}_{\nu_{k}}\bigg{\{}\left\langle F^{Q}_{\theta_{k}}(s,\mathrm{d}_{{\mathcal{S}}},\cdot),\pi(\cdot|s,\mathrm{d}_{{\mathcal{S}}})\right\rangle-\upsilon_{k}\mathrm{KL}\left(\pi(\cdot|s,\mathrm{d}_{{\mathcal{S}}})\big{\|}\pi_{k}(\cdot|s,\mathrm{d}_{{\mathcal{S}}})\right)\bigg{\}}

yields π¯k+1exp{υk1FθkQ+τk1FαkA}\overline{\pi}_{k+1}\propto\exp\{\upsilon_{k}^{-1}F^{Q}_{\theta_{k}}+\tau_{k}^{-1}F^{A}_{\alpha_{k}}\}.

To take into account that the representable function of actor network resides in A\mathcal{F}_{A}, we update the policy by projecting the energy function of π¯k+1\overline{\pi}_{k+1} back to A\mathcal{F}_{A}. Specifically, by denoting

πk+1exp{τk+11Fαk+1A},\pi_{k+1}\propto\exp\{\tau_{k+1}^{-1}F^{A}_{\alpha_{k+1}}\},

we recover the next actor network Fαk+1AF^{A}_{\alpha_{k+1}} (i.e., energy) by performing the following regression task

αk+1=argminα𝔹(Rα,α0)𝔼σ~k{FαA(s,d𝒮,a¯)τk+1[υk1FθkQ(s,d𝒮,a¯)+τk1FαkA(s,d𝒮,a¯)]}2,\displaystyle\alpha^{*}_{k+1}=\mathop{\mathrm{argmin}}_{\alpha\in\mathbb{B}(R_{\alpha},\alpha_{0})}\mathbb{E}_{\widetilde{\sigma}_{k}}\bigg{\{}F^{A}_{\alpha}(s,\mathrm{d}_{{\mathcal{S}}},\overline{a})-\tau_{k+1}\left[\upsilon_{k}^{-1}F^{Q}_{\theta_{k}}(s,\mathrm{d}_{{\mathcal{S}}},\overline{a})+\tau_{k}^{-1}F^{A}_{\alpha_{k}}(s,\mathrm{d}_{{\mathcal{S}}},\overline{a})\right]\bigg{\}}^{2}, (3)

where σ~k=νkπ0\widetilde{\sigma}_{k}=\nu_{k}\pi_{0}. We approximately solve (3) via T-step stochastic gradient descent (SGD), and output

αk+1=α(T)αk+1.\alpha_{k+1}=\alpha(T)\approx\alpha_{k+1}^{*}.

At the tt-th iteration of SGD,

α(t+1/2)\displaystyle\alpha(t+1/2) =α(t)ηαFα(t)A(s,𝐬,a¯){Fα(t)A(s,𝐬,a¯)τk+1(υk1FθkQ(s,𝐬,a¯)+τk1FαkA(s,𝐬,a¯))},\displaystyle=\alpha(t)-\eta\nabla_{\alpha}F^{A}_{\alpha(t)}(s,\mathbf{s},\overline{a})\bigg{\{}F^{A}_{\alpha(t)}(s,\mathbf{s},\overline{a})-\tau_{k+1}\left(\upsilon_{k}^{-1}F^{Q}_{\theta_{k}}(s,\mathbf{s},\overline{a})+\tau_{k}^{-1}F^{A}_{\alpha_{k}}(s,\mathbf{s},\overline{a})\right)\bigg{\}},
α(t+1)\displaystyle\alpha(t+1) =Π𝔹(Rα,α0)(α(t+1/2)),\displaystyle=\Pi_{\mathbb{B}(R_{\alpha},\alpha_{0})}\left(\alpha(t+1/2)\right),

where we sample (s,d𝒮,a¯)σ~k(s,\mathrm{d}_{{\mathcal{S}}},\overline{a})\sim\widetilde{\sigma}_{k}, and 𝐬i.i.d.d𝒮\mathbf{s}\overset{\mathrm{i.i.d.}}{\sim}\mathrm{d}_{{\mathcal{S}}}, and η\eta is the step size. The details are summarized in Algorithm 3 of Appendix A. Finally, we present the complete MF-PPO in Algorithm 1.

Algorithm 1 Mean-Field Proximal Policy Optimization
  Require: Mean-field MDP (𝒮¯,𝒜¯,,r)(\overline{{\mathcal{S}}},\overline{\mathcal{A}},\mathbb{P},r), discount factor γ\gamma; number of outer iterations KK, number of inner updates TT; policy update parameter υ\upsilon, step size η\eta, projection radius Rα,RθR_{\alpha},R_{\theta}.
  Initialize: τ01\tau_{0}\leftarrow 1, FA,00F^{A,0}\leftarrow 0, π0exp{τ01FA,0}\pi_{0}\propto\exp\{\tau_{0}^{-1}F^{A,0}\} (uniform policy).
  for k=0,,K1k=0,\ldots,K-1 do
     Set temperature parameter τk+1υK/(k+1)\tau_{k+1}\leftarrow\upsilon\sqrt{K}/(k+1), and proximity parameter υkυK\upsilon_{k}\leftarrow\upsilon\sqrt{K}
     Solve (2) to update the critic network FθkQF^{Q}_{\theta_{k}}, using TD update (Algorithm 2)
     Solve (3) to update the actor network for Fαk+1AF^{A}_{\alpha_{k+1}}, using SGD update (Algorithm 3)
     Update policy: πk+1exp{τk+11Fαk+1A}\pi_{k+1}\propto\exp\{\tau_{k+1}^{-1}F^{A}_{\alpha_{k+1}}\}
  end for

4 Global Convergence of MF-PPO

We present the global convergence of MF-PPO algorithm for two-layer permutation-invariant parameterization of actor and critic network. We remark that our analysis can be extended to multi-layer permutation-invariant networks, and we present the two-layer case here for simplicity of exposition. Specifically, the actor and critic networks take the form

FαA(s,𝐬,a¯)\displaystyle F^{A}_{\alpha}(s,\mathbf{s},\overline{a}) =1mNj=1mAs𝐬ujσ(αj(s,s,a¯)),\displaystyle=\frac{1}{\sqrt{m}N}\sum_{j=1}^{m_{A}}\sum_{s^{\prime}\in\mathbf{s}}u_{j}\sigma\left(\alpha_{j}^{\top}(s,s^{\prime},\overline{a})\right),
FθQ(s,𝐬,a¯)\displaystyle F^{Q}_{\theta}(s,\mathbf{s},\overline{a}) =1mNj=1mQs𝐬vjσ(θj(s,s,a¯)),\displaystyle=\frac{1}{\sqrt{m}N}\sum_{j=1}^{m_{Q}}\sum_{s^{\prime}\in\mathbf{s}}v_{j}\sigma\left(\theta_{j}^{\top}(s,s^{\prime},\overline{a})\right),

where mAm_{A} (resp. mQm_{Q}) is the width of the actor (resp. critic) network, and σ(x)=max{x,0}\sigma(x)=\max\{x,0\} denotes the ReLU activation. We randomly initialize uju_{j} (resp. vjv_{j}) and first layer weights α0=[α0,1,,α0,mA]dmA\alpha_{0}=[\alpha_{0,1}^{\top},\ldots,\alpha_{0,m_{A}}^{\top}]^{\top}\in\mathbb{R}^{d\cdot m_{A}} (resp. θ0dmQ\theta_{0}\in\mathbb{R}^{d\cdot m_{Q}}) by

ujUnif{1,+1},α0,j𝒩(0,Id/d),j[m].\displaystyle u_{j}\sim\mathrm{Unif}\left\{-1,+1\right\},\alpha_{0,j}\sim\mathcal{N}(0,\mathrm{I}_{d}/d),~{}~{}\forall j\in[m].

For the ease of analysis, we take m=mA=mQm=m_{A}=m_{Q} and share the initialization of α0\alpha_{0} and θ0\theta_{0} (resp. u0u_{0} and v0v_{0}). Additionally, we keep uju_{j}’s fixed during training, and α\alpha (resp. θ\theta) within the ball 𝔹(α0,RA)\mathbb{B}(\alpha_{0},R_{A}) (resp. 𝔹(θ0,RQ)\mathbb{B}(\theta_{0},R_{Q})) throughout training. We define the following function class characterizing the class of previously defined actor/critic network for large network width.

Definition 4.1.

Given Rβ>0R_{\beta}>0, define the function class over domain 𝒮×𝒮×𝒜{\mathcal{S}}\times{\mathcal{S}}\times\mathcal{A} by

β,m={fβ()|fβ(s,s,a¯)=1mj=1mvj𝟙{β0,j(s,s,a¯)>0}βj(s,s,a¯),ββ02Rβ},\displaystyle\mathcal{F}_{\beta,m}\!=\!\bigg{\{}f_{\beta}(\cdot)\big{|}~{}f_{\beta}(s,s^{\prime},\overline{a})\!=\!\frac{1}{\sqrt{m}}\sum_{j=1}^{m}v_{j}\mathbbm{1}\{\beta_{0,j}^{\top}(s,s^{\prime},\overline{a})\!>\!0\}\beta_{j}^{\top}(s,s^{\prime},\overline{a}),\|\beta-\beta_{0}\|_{2}\leq R_{\beta}\bigg{\}},

where vj,β0,jv_{j},\beta_{0,j} are random weights sampled according to

vjUnif{1,+1},β0,j𝒩(0,Id/d),j[m].\displaystyle v_{j}\sim\mathrm{Unif}\left\{-1,+1\right\},\beta_{0,j}\sim\mathcal{N}(0,\mathrm{I}_{d}/d),~{}~{}\forall j\in[m].

β,m\mathcal{F}_{\beta,m} also induces the function class over 𝒮×𝒫(𝒮)×𝒜{\mathcal{S}}\times\mathcal{P}({\mathcal{S}})\times\mathcal{A} given by

β,m𝒫={F()|F(s,d𝒮,a¯)=𝔼sd𝒮fβ(s,s,a¯),fββ,m}.\displaystyle\mathcal{F}^{\mathcal{P}}_{\beta,m}\!=\!\left\{F(\cdot)~{}\big{|}~{}F(s,\mathrm{d}_{\mathcal{S}},\overline{a})\!=\!\mathbb{E}_{s^{\prime}\sim\mathrm{d}_{\mathcal{S}}}f_{\beta}(s,s^{\prime},\overline{a}),f_{\beta}\!\in\!\mathcal{F}_{\beta,m}\right\}.

It is well known that functions within β,m\mathcal{F}_{\beta,m} approximate functions within the reproducing kernel Hilbert space associated with kernel K(x,y)=𝔼z𝒩(0,Id/d){𝟙(zx>0,zy>0)}\mathrm{K}(x,y)=\mathbb{E}_{z\sim\mathcal{N}(0,I_{d}/d)}\left\{\mathbbm{1}(z^{\top}x>0,z^{\top}y>0)\right\} for large network width mm (Jacot et al., 2018; Chizat and Bach, 2018; Cai et al., 2019; Arora et al., 2019), whose RKHS norm is bounded by RβR_{\beta}. For large RβR_{\beta} and mm, β,m\mathcal{F}_{\beta,m} represents a rich class of functions. Additionally, functions within β,m𝒫\mathcal{F}^{\mathcal{P}}_{\beta,m} can be viewed as the mean-embedding of the joint state-action pair onto the RKHS space (Muandet et al., 2016; Song et al., 2009; Smola et al., 2007). Below, we make one technical assumption which states that θ,mQ𝒫\mathcal{F}^{\mathcal{P}}_{\theta,m_{Q}} is rich enough to represent the Q-function of all the policies within our policy class.

Assumption 1.

For any policy π\pi induced by FAAF_{A}\in\mathcal{F}^{A}, we have Qπθ,mQ𝒫Q^{\pi}\in\mathcal{F}^{\mathcal{P}}_{\theta,m_{Q}}.

We define two mild conditions stating (1) boundedness of reward, and (2) regularity of the joint state and the policy.

Assumption 2.

Reward function r()r¯r(\cdot)\leq\overline{r} for some r¯>0\overline{r}>0. Additionally, for any (s,d𝒮)𝒮¯(s,\mathrm{d}_{\mathcal{S}})\in\overline{{\mathcal{S}}} and any π\pi, there exists c>0c>0 such that

𝔼sd𝒮,a¯π(|s,d𝒮){𝟙(|z(s,s,a¯)|t)}ctz2\mathbb{E}_{s^{\prime}\sim\mathrm{d}_{\mathcal{S}},\overline{a}\sim\pi(\cdot|s,\mathrm{d}_{\mathcal{S}})}\left\{\mathbbm{1}\left(\lvert z^{\top}(s,s^{\prime},\overline{a})\rvert\leq t\right)\right\}\leq c\cdot\frac{t}{\|z\|_{2}}

for any zdz\in\mathbb{R}^{d} and t>0t>0.

We remark that the condition in Assumption 2 holds whenever 𝒮{\mathcal{S}} is bounded and d𝒮×π\mathrm{d}_{\mathcal{S}}\times\pi has an upper bounded density.

We measure the progress of MF-PPO in Algorithm 1 using the expected total reward

(π)=𝔼ν[Vπ(s,d𝒮)]=𝔼ν{Qπ(|s,d𝒮),π(|s,d𝒮)},\displaystyle\mathcal{L}(\pi)=\mathbb{E}_{\nu^{*}}\left[V^{\pi}(s,\mathrm{d}_{\mathcal{S}})\right]=\mathbb{E}_{\nu^{*}}\left\{\left\langle Q^{\pi}(\cdot|s,\mathrm{d}_{\mathcal{S}}),\pi(\cdot|s,\mathrm{d}_{\mathcal{S}})\right\rangle\right\}, (4)

where ν\nu^{*} is the stationary state distribution of the optimal policy π\pi^{*}. We also denote σ\sigma^{*} as the stationary state-action distribution induced by π\pi^{*}. Note that we have:

(π)=𝔼ν[Vπ(s,d𝒮)]𝔼ν[Vπ(s,d𝒮)]=(π)\mathcal{L}(\pi^{*})=\mathbb{E}_{\nu^{*}}\left[V^{\pi^{*}}(s,\mathrm{d}_{\mathcal{S}})\right]\geq\mathbb{E}_{\nu^{*}}\left[V^{\pi}(s,\mathrm{d}_{\mathcal{S}})\right]=\mathcal{L}(\pi)

for any policy π\pi. Our main results are presented in the following theorem, showing that (πk)\mathcal{L}(\pi_{k}) converges to (π)\mathcal{L}(\pi^{*}) at a sub-linear rate.

Theorem 4.1 (Global Convergence of MF-PPO).

Under Assumptions 1 and 2, the policies {πk}k=1K\{\pi_{k}\}_{k=1}^{K} generated by Algorithm 1 satisify

min0kK{(π)(πk)}υ(log|𝒜¯|+k=1K1(εk+εk))(1γ)K+M(1γ)υK,\displaystyle\min_{0\leq k\leq K}\{\mathcal{L}(\pi^{*})-\mathcal{L}(\pi_{k})\}\leq\frac{\upsilon\left(\log|\overline{\mathcal{A}}|+\sum_{k=1}^{K-1}(\varepsilon_{k}+\varepsilon_{k}^{\prime})\right)}{(1-\gamma)\sqrt{K}}+\frac{M}{(1-\gamma)\upsilon\sqrt{K}},

where M=𝔼ν{maxa¯𝒜¯(Fθ0Q(s,s,a¯))2}+2RA2M=\mathbb{E}_{\nu^{*}}\left\{\max_{\overline{a}\in\overline{\mathcal{A}}}(F^{Q}_{\theta_{0}}(s,s^{\prime},\overline{a}))^{2}\right\}+2R_{A}^{2}, εk\varepsilon_{k} and εk\varepsilon_{k}^{\prime} are defined in Lemma 4.3. In particular, suppose at each iteration of MF-PPO, we observe

N=Ω(K3RA4+KRQ4)N=\Omega\left(K^{3}R_{A}^{4}+KR_{Q}^{4}\right)

agents, and the actor/critic network satisfies

mA=Ω(K6RA10+K4RA10|𝒜¯|2),mQ=Ω(K2RQ10),m_{A}=\Omega\left(K^{6}R_{A}^{10}+K^{4}R_{A}^{10}\lvert\overline{\mathcal{A}}\rvert^{2}\right),m_{Q}=\Omega\left(K^{2}R_{Q}^{10}\right),

and

T=Ω(K3RA4+KRQ4),T=\Omega\left(K^{3}R_{A}^{4}+KR_{Q}^{4}\right),

then we have

min0kK{(π)(πk)}υ2(log|𝒜¯|+𝒪(1))+M(1γ)υK.\displaystyle\min_{0\leq k\leq K}\{\mathcal{L}(\pi^{*})-\mathcal{L}(\pi_{k})\}\leq\frac{\upsilon^{2}\left(\log|\overline{\mathcal{A}}|+\mathcal{O}(1)\right)+M}{(1-\gamma)\upsilon\sqrt{K}}.

Theorem 4.1 states that, given sufficiently many agents and a large enough actor/critic network, MF-PPO attains global optimality at a sublinear rate. Our result shows that when solving the mean-field MDP, having more agents serves as a blessing instead of a curse. In addition, as will be demonstrated in our proof sketch, there exists an inherent phase transition depending on the number of agents, where the final optimality gap is dominated by statistical error for a small number of agents (first phase); and by optimization error for a large number of agents (second phase).

The complete proof of Theorem 4.1 takes a careful analysis on the error from policy evaluation (2) and the improvement step (3). The analysis on the outer iterations of MF-PPO can be overviewed as approximate mirror descent, which needs to take into account how the evaluation and improvement error interacts. Intuitively, the tuple (εk,εk)(\varepsilon_{k},\varepsilon^{\prime}_{k}) describes the the effect of policy update when using approximate policy evaluation and policy improvement, and will be further clarified in the ensuing discussion. We present here the skeleton of our proof, and defer the technical detail to the appendix.

Proof Sketch. We first establish the global convergence of the policy evaluation (2) and improvement step (3).

Lemma 4.1 (Policy Evaluation).

Under the same assumptions in Theorem 4.1, let QπkQ^{\pi_{k}} denote the Q-function of policy πk\pi_{k}, let

ϵk=𝔼init,σk{Fθ¯(T)Q()Qπk()}2\epsilon_{k}=\mathbb{E}_{\mathrm{init},\sigma_{k}}\left\{F_{\overline{\theta}(T)}^{Q}(\cdot)-Q^{\pi_{k}}(\cdot)\right\}^{2}

denote policy evaluation error, where θ¯(T)\overline{\theta}(T) is the output of Algorithm 2, we have

ϵk=𝒪(RQ2T1/2+RQ5/2mQ1/4+RQ2N1/2+RQ3mQ1/2).\displaystyle\epsilon_{k}=\mathcal{O}\left(\frac{R_{Q}^{2}}{T^{1/2}}+\frac{R_{Q}^{5/2}}{m_{Q}^{1/4}}+\frac{R_{Q}^{2}}{N^{1/2}}+\frac{R_{Q}^{3}}{m_{Q}^{1/2}}\right).
Lemma 4.2 (Policy Improvement).

Under the same assumptions in Theorem 4.1, let

ϵk+1=𝔼init,σ~k{Fα¯(T)A()τk+1(υk1FθkQ()+τk1FαkA())}2\epsilon_{k+1}^{\prime}\!=\!\mathbb{E}_{\mathrm{init},\widetilde{\sigma}_{k}}\left\{F^{A}_{\overline{\alpha}(T)}(\cdot)\!-\!\tau_{k+1}\left(\upsilon_{k}^{-1}F^{Q}_{\theta_{k}}(\cdot)\!+\!\tau_{k}^{-1}F^{A}_{\alpha_{k}}(\cdot)\right)\right\}^{2}

denote policy optimization error, where α¯(T)\overline{\alpha}(T) is the output of Algorithm 3, we have

ϵk+1=𝒪(RA2T1/2+RA5/2mA1/4+RA2N1/2+RA3mA1/2).\displaystyle\epsilon_{k+1}^{\prime}=\mathcal{O}\left(\frac{R_{A}^{2}}{T^{1/2}}+\frac{R_{A}^{5/2}}{m_{A}^{1/4}}+\frac{R_{A}^{2}}{N^{1/2}}+\frac{R_{A}^{3}}{m_{A}^{1/2}}\right).

Lemma 4.1 and 4.2 show that despite non-convexity, both the policy evaluation and the policy improvement steps converge approximately to the global optimal solution. In particular, given networks with large width, for a small number of iterations TT, the optimization error 𝒪(T1/2)\mathcal{O}\left(T^{-1/2}\right) dominates the optimality gap; for a large number of iterations TT, the statistical error 𝒪(N1/2)\mathcal{O}\left(N^{-1/2}\right) dominates the optimality gap.

With Lemma 4.1 and 4.2, we illustrate the main argument for the proof of Theorem 4.1. Let us assume the ideal case when ϵk=ϵk+1=0\epsilon_{k}=\epsilon^{\prime}_{k+1}=0. Note that for ϵk=0\epsilon_{k}=0, we obtain the exact QQ-function of policy πk\pi_{k}. For ϵk+1=0\epsilon^{\prime}_{k+1}=0, we obtain the ideal energy-based updated policy defined in Proposition 3.2. That is,

πk+1=argmaxπ𝔼νk{Qπk(s,d𝒮,),π(|s,d𝒮)υkKL(π(|s,d𝒮)πk(|s,d𝒮))}.\displaystyle\pi_{k+1}=\mathop{\mathrm{argmax}}_{\pi}\mathbb{E}_{\nu_{k}}\bigg{\{}\left\langle Q^{\pi_{k}}(s,\mathrm{d}_{\mathcal{S}},\cdot),\pi(\cdot|s,\mathrm{d}_{\mathcal{S}})\right\rangle-\upsilon_{k}\mathrm{KL}\left(\pi(\cdot|s,\mathrm{d}_{\mathcal{S}})\|\pi_{k}(\cdot|s,\mathrm{d}_{\mathcal{S}})\right)\bigg{\}}. (5)

Without function approximation, problem (5) can be solved by treating each joint state (s,d𝒮)(s,\mathrm{d}_{\mathcal{S}}) independently, hence one can apply the well known three-point lemma in mirror descent (Chen and Teboulle, 1993) and obtain that, for all (s,d𝒮)𝒮×𝒫(𝒮)(s,\mathrm{d}_{\mathcal{S}})\in{\mathcal{S}}\times\mathcal{P}({\mathcal{S}}):

Qπk(s,d𝒮,),π(|s,d𝒮)πk(|s,d𝒮)\displaystyle\left\langle Q^{\pi_{k}}(s,\mathrm{d}_{\mathcal{S}},\cdot),\pi^{*}(\cdot|s,\mathrm{d}_{\mathcal{S}})-\pi_{k}(\cdot|s,\mathrm{d}_{\mathcal{S}})\right\rangle
\displaystyle\leq υk{KL(π(|s,d𝒮)πk(|s,d𝒮))KL(π(|s,d𝒮)πk+1(|s,d𝒮))KL(πk+1(|s,d𝒮)πk(|s,d𝒮))}\displaystyle\upsilon_{k}\bigg{\{}\mathrm{KL}\left(\pi^{*}(\cdot|s,\mathrm{d}_{\mathcal{S}})\|\pi_{k}(\cdot|s,\mathrm{d}_{\mathcal{S}})\right)-\mathrm{KL}\left(\pi^{*}(\cdot|s,\mathrm{d}_{\mathcal{S}})\|\pi_{k+1}(\cdot|s,\mathrm{d}_{\mathcal{S}})\right)-\mathrm{KL}\left(\pi_{k+1}(\cdot|s,\mathrm{d}_{\mathcal{S}})\|\pi_{k}(\cdot|s,\mathrm{d}_{\mathcal{S}})\right)\bigg{\}}
+Qπk(s,d𝒮,),πk+1(|s,d𝒮)πk(|s,d𝒮).\displaystyle~{}~{}~{}~{}~{}~{}+\left\langle Q^{\pi_{k}}(s,\mathrm{d}_{\mathcal{S}},\cdot),\pi_{k+1}(\cdot|s,\mathrm{d}_{\mathcal{S}})-\pi_{k}(\cdot|s,\mathrm{d}_{\mathcal{S}})\right\rangle.

From Lemma 6.1 in Kakade and Langford (2002), the expectation of the left hand side yields exactly (1γ){(π)(πk)}(1-\gamma)\left\{\mathcal{L}(\pi^{*})-\mathcal{L}(\pi_{k})\right\}. Hence we have

(1γ){(π)(πk)}\displaystyle(1-\gamma)\left\{\mathcal{L}(\pi^{*})-\mathcal{L}(\pi_{k})\right\}
\displaystyle\leq υk𝔼ν{KL(π(|s,d𝒮)πk(|s,d𝒮))KL(π(|s,d𝒮)πk+1(|s,d𝒮))KL(πk+1(|s,d𝒮)πk(|s,d𝒮))}\displaystyle\upsilon_{k}\mathbb{E}_{\nu^{*}}\bigg{\{}\mathrm{KL}\left(\pi^{*}(\cdot|s,\mathrm{d}_{\mathcal{S}})\|\pi_{k}(\cdot|s,\mathrm{d}_{\mathcal{S}})\right)-\mathrm{KL}\left(\pi^{*}(\cdot|s,\mathrm{d}_{\mathcal{S}})\|\pi_{k+1}(\cdot|s,\mathrm{d}_{\mathcal{S}})\right)-\mathrm{KL}\left(\pi_{k+1}(\cdot|s,\mathrm{d}_{\mathcal{S}})\|\pi_{k}(\cdot|s,\mathrm{d}_{\mathcal{S}})\right)\bigg{\}}
+𝔼νQπk(s,d𝒮,),πk+1(|s,d𝒮)πk(|s,d𝒮).\displaystyle~{}~{}~{}~{}~{}~{}+\mathbb{E}_{\nu^{*}}\left\langle Q^{\pi_{k}}(s,\mathrm{d}_{\mathcal{S}},\cdot),\pi_{k+1}(\cdot|s,\mathrm{d}_{\mathcal{S}})-\pi_{k}(\cdot|s,\mathrm{d}_{\mathcal{S}})\right\rangle.

Pinsker’s Inequality

KL(πk+1(|s,d𝒮)πk(|s,d𝒮))12πk+1πk12,\displaystyle\mathrm{KL}\left(\pi_{k+1}(\cdot|s,\mathrm{d}_{\mathcal{S}})\|\pi_{k}(\cdot|s,\mathrm{d}_{\mathcal{S}})\right)\geq\frac{1}{2}\|\pi_{k+1}-\pi_{k}\|_{1}^{2},

combined with the observation Qπk(s,d𝒮,)r¯\|Q^{\pi_{k}}(s,\mathrm{d}_{\mathcal{S}},\cdot)\|_{\infty}\leq\overline{r}, and basic inequality ax2+bxb2/(4a)-ax^{2}+bx\leq b^{2}/(4a) for a>0a>0 gives us

(1γ){(π)(πk)}υk𝔼ν{KL(π(|s,d𝒮)πk(|s,d𝒮))KL(π(|s,d𝒮)πk+1(|s,d𝒮))}+r¯2/(2υk).\displaystyle(1-\gamma)\left\{\mathcal{L}(\pi^{*})-\mathcal{L}(\pi_{k})\right\}\leq\upsilon_{k}\mathbb{E}_{\nu^{*}}\bigg{\{}\mathrm{KL}\left(\pi^{*}(\cdot|s,\mathrm{d}_{\mathcal{S}})\|\pi_{k}(\cdot|s,\mathrm{d}_{\mathcal{S}})\right)-\mathrm{KL}\left(\pi^{*}(\cdot|s,\mathrm{d}_{\mathcal{S}})\|\pi_{k+1}(\cdot|s,\mathrm{d}_{\mathcal{S}})\right)\bigg{\}}+\overline{r}^{2}/(2\upsilon_{k}). (6)

By setting υk=𝒪(K)\upsilon_{k}=\mathcal{O}(\sqrt{K}), and telescoping the above inequality from k=0k=0 to K1K-1, we obtain:

min0kK1{(π)(πk)}=𝒪(1/K).\min_{0\leq k\leq K-1}\{\mathcal{L}(\pi^{*})-\mathcal{L}(\pi_{k})\}=\mathcal{O}(1/\sqrt{K}).

Note that the key element in the global convergence of MF-PPO is the recursion defined in (6), which holds whenever we have the exact QQ-function of current policy and no function approximation is used when updating the next policy. Now MF-PPO conducts approximate policy evaluation (ϵk>0\epsilon_{k}>0), and after obtaining the approximate QQ-function, conducts approximate policy improvement step (ϵk+1>0\epsilon^{\prime}_{k+1}>0). In addition, the error of approximating the QQ-function introduced in the evaluation step can be further compounded in the improvement step. Nevertheless, recursion (6) still holds approximately, with additional terms representing the policy evaluation/improvement errors.

Lemma 4.3 (Liu et al. (2019a)).

Let ϵk\epsilon_{k} (evaluation error) and ϵk+1\epsilon_{k+1}^{\prime} (improvement error) be defined as in Lemma 4.1 and Lemma 4.2, respectively. We have:

(1γ)((π)(πk))\displaystyle(1-\gamma)\left(\mathcal{L}(\pi^{*})-\mathcal{L}(\pi_{k})\right) υk𝔼ν{KL(π(|s,d𝒮)πk(|s,d𝒮))KL(π(|s,d𝒮)πk+1(|s,d𝒮))}\displaystyle\leq\upsilon_{k}\mathbb{E}_{\nu_{*}}\bigg{\{}\mathrm{KL}\left(\pi^{*}(\cdot|s,\mathrm{d}_{\mathcal{S}})\|\pi_{k}(\cdot|s,\mathrm{d}_{\mathcal{S}})\right)-\mathrm{KL}\left(\pi^{*}(\cdot|s,\mathrm{d}_{\mathcal{S}})\|\pi_{k+1}(\cdot|s,\mathrm{d}_{\mathcal{S}})\right)\bigg{\}} (7)
+υk(εk+εk)+υk1M.\displaystyle~{}~{}~{}~{}~{}~{}+\upsilon_{k}\left(\varepsilon_{k}+\varepsilon_{k}^{\prime}\right)+\upsilon_{k}^{-1}M. (8)

where

εk\displaystyle\varepsilon_{k} =τk+11ϵk+1ϕk+1+υk1ϵkψk,\displaystyle=\tau_{k+1}^{-1}\epsilon_{k+1}^{\prime}\phi_{k+1}^{*}+\upsilon_{k}^{-1}\epsilon_{k}\psi_{k}^{*},
εk\displaystyle\varepsilon_{k}^{\prime} =|𝒜|τk+12(ϵk+1)2,\displaystyle=\lvert\mathcal{A}\rvert\tau_{k+1}^{-2}(\epsilon_{k+1}^{\prime})^{2},
M\displaystyle M =𝔼ν{maxa¯𝒜¯[Fθ0Q(s,d𝒮,a¯)]2}+2RA2.\displaystyle=\mathbb{E}_{\nu^{*}}\left\{\max_{\overline{a}\in\overline{\mathcal{A}}}\left[F^{Q}_{\theta_{0}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})\right]^{2}\right\}+2R_{A}^{2}.

In addition, ϕk\phi_{k}^{*} and ψk\psi_{k}^{*} are defined by:

ϕk\displaystyle\phi_{k}^{*} =𝔼σ~k[|dπ/dπ0dπk/dπ0|2]1/2,\displaystyle=\mathbb{E}_{\widetilde{\sigma}_{k}}\left[\lvert\mathrm{d}\pi^{*}/\mathrm{d}\pi_{0}-\mathrm{d}\pi_{k}/\mathrm{d}\pi_{0}\rvert^{2}\right]^{1/2},
ψk\displaystyle\psi_{k}^{*} =𝔼σk[|dσ/dσkd(ν×πk)/dσk|2]1/2.\displaystyle=\mathbb{E}_{\sigma_{k}}\left[\lvert\mathrm{d}\sigma^{*}/\mathrm{d}\sigma_{k}-\mathrm{d}(\nu^{*}\times\pi_{k})/\mathrm{d}\sigma_{k}\rvert^{2}\right]^{1/2}.

Finally, from Lemma 4.3, by telescoping inequality (8) from k=0k=0 to K1K-1, we complete the proof of Theorem 4.1.

5 Experiments

We perform experiments on the benchmark multi-agent particle environment (MPE) used in prior work (Lowe et al., 2017; Mordatch and Abbeel, 2018; Liu et al., 2019b). In the cooperative navigation task, NN agents each with position xi2x_{i}\in\mathbb{R}^{2} must move to cover NN fixed landmarks at positions yi2y_{i}\in\mathbb{R}^{2}. They receive a team reward R=i=1Nminj[N]yixj2;R=-\sum_{i=1}^{N}\min_{j\in[N]}\lVert y_{i}-x_{j}\rVert_{2}; In the cooperative push task, NN agents with position xi2x_{i}\in\mathbb{R}^{2} work together to push a ball x2x\in\mathbb{R}^{2} to a fixed landmark y2y\in\mathbb{R}^{2}. They receive a team reward R=minj[N]xjx2xy2.R=-\min_{j\in[N]}\lVert x_{j}-x\rVert_{2}-\|x-y\|_{2}. Both tasks involve homogeneous agents.

We instantiate MF-PPO by building on the original PPO algorithm (Schulman et al., 2017) with a centralized permutation invariant value-function critic, represented by a DeepSet (Zaheer et al., 2017) network with two hidden layers. We use a standard two-layer multi-layer perception (MLP) for the actor network in all algorithms. One version, labeled as MF, has a centralized actor network that outputs the mean and diagonal covariance of a Gaussian distribution over the joint continuous action space. A second decentralized version with parameter sharing, labeled as MF-PS, has a decentralized actor network that maps individual observations to parameters of a Gaussian distribution over the individual action space.

Refer to caption
(a) N=6N=6
Refer to caption
(b) N=15N=15
Refer to caption
(c) N=30N=30
Refer to caption
(d) N=200N=200
Figure 3: Performance versus number of environment steps in the multi-agent cooperative navigation task, over five independent runs per method. Each point is the average reward during 1000 evaluation episodes. MF based on DeepSet invariant critic, and built on either PPO or MADDPG, significantly outperforms other critic representations for various number of agents.

We compare with two other critic representations: one that uses MLP for the centralized critic, labeled MLP, and another that uses a graph convolutional network for the critic (Liu et al., 2019b), labeled GCN. Note that the GCN representation is permutation invariant if one imposes a fully-connected graph for the agents in the MPE, but this invariance property does not hold for all graphs in general. We also compare with an extension of (Yang et al., 2018) to the case of continuous action spaces, labeled MF-A, in which each independent DDPG agent ii has a decentralized critic Q(si,ai,a¯i)Q(s_{i},a_{i},\bar{a}_{i}) that takes in the mean of all other agents’ actions a¯i:=1N1jiaj\bar{a}_{i}:=\frac{1}{N-1}\sum_{j\neq i}a_{j}. Empirically, as we find that off-policy RL learns faster than on-policy RL in the MPE with higher agent number, regardless of the critic representation, we make all comparisons on top of MADDPG (Lowe et al., 2017) for all N15N\geq 15 cases. For a fair comparison of all critic representations, we ensure that all neural network architectures contain approximately the same number of trainable weights.

For the cooperative navigation task, Figure 3 shows that the permutation invariant critic representation based on DeepSet enables MF and MF-PS to learn faster or reach a higher performance than all other representations and methods in the MPE with 6, 15, 30 and 200 agents. Figure 4(a) shows that MF consistently and significantly outperforms GCN as the number of parameters in their critic network varies over a range, with all other settings fixed. In particular, MF requires much fewer critic parameters to achieve higher performance than GCN. For the cooperative push task, Figure 4(b) demonstrates similar performance improvement of MF compared to all other methods. We provide additional experiment detail in Appendix B.

Refer to caption
(a) Varying critic size
Refer to caption
(b) N=15N=15
Figure 4: (a) MF outperforms GCN even with a fewer number of critic network parameters (N=3N=3). (b) Performance versus number of environment steps in the multi-agent cooperative push task. MF based on DeepSet invariant critic significantly outperforms other critic representations.

Computational Improvements. Theorem 4.1 states that to obtain a small optimality gap in MF-PPO, one needs to compute the update on a large number of agents. We remark that with the dual embedding techniques developed in Dai et al. (2017), one can avoid computation on all the agents by sampling a small number of agents to compute the update. This technique could be readily incorporated into MF-PPO to improve its computational efficiency.

6 Conclusion

We propose a principled approach to exploit agent homogeneity and permutation invariance through the mean-field approximation in MARL. Moreover, our results are also the first to provably show the global convergence of MARL algorithms with neural networks as function approximators. This is in sharp contrast to current practices, which are mostly heuristic methods without convergence guarantees.

References

  • Arora et al. (2019) Arora, S., Du, S. S., Hu, W., Li, Z. and Wang, R. (2019). Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. arXiv preprint arXiv:1901.08584 .
  • Bertsekas (2011) Bertsekas, D. P. (2011). Approximate policy iteration: A survey and some new methods. Journal of Control Theory and Applications 9 310–335.
  • Bloem-Reddy and Teh (2019) Bloem-Reddy, B. and Teh, Y. W. (2019). Probabilistic symmetry and invariant neural networks. arXiv preprint arXiv:1901.06082 .
  • Cai et al. (2019) Cai, Q., Yang, Z., Lee, J. D. and Wang, Z. (2019). Neural temporal-difference learning converges to global optima. In Advances in Neural Information Processing Systems.
  • Cao et al. (2013) Cao, Y., Yu, W., Ren, W. and Chen, G. (2013). An overview of recent progress in the study of distributed multi-agent coordination. IEEE Transactions on Industrial informatics 9 427–438.
  • Carmona et al. (2019) Carmona, R., Laurière, M. and Tan, Z. (2019). Model-free mean-field reinforcement learning: mean-field mdp and mean-field q-learning. arXiv preprint arXiv:1910.12802 .
  • Chen and Teboulle (1993) Chen, G. and Teboulle, M. (1993). Convergence analysis of a proximal-like minimization algorithm using bregman functions. SIAM Journal on Optimization 3 538–543.
  • Chizat and Bach (2018) Chizat, L. and Bach, F. (2018). A note on lazy training in supervised differentiable programming. arXiv preprint arXiv:1812.07956 8.
  • Dai et al. (2017) Dai, B., Shaw, A., Li, L., Xiao, L., He, N., Liu, Z., Chen, J. and Song, L. (2017). Sbeed: Convergent reinforcement learning with nonlinear function approximation. arXiv preprint arXiv:1712.10285 .
  • Jacot et al. (2018) Jacot, A., Gabriel, F. and Hongler, C. (2018). Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems.
  • Kakade and Langford (2002) Kakade, S. and Langford, J. (2002). Approximately optimal approximate reinforcement learning. In ICML, vol. 2.
  • Kalyanakrishnan et al. (2006) Kalyanakrishnan, S., Liu, Y. and Stone, P. (2006). Half field offense in robocup soccer: A multiagent reinforcement learning case study. In Robot Soccer World Cup. Springer.
  • Kipf and Welling (2017) Kipf, T. N. and Welling, M. (2017). Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representatioons.
  • Kober et al. (2013) Kober, J., Bagnell, J. A. and Peters, J. (2013). Reinforcement learning in robotics: A survey. The International Journal of Robotics Research 32 1238–1274.
  • Littman (1994) Littman, M. L. (1994). Markov games as a framework for multi-agent reinforcement learning. In Machine Learning Proceedings 1994. Elsevier, 157–163.
  • Liu et al. (2019a) Liu, B., Cai, Q., Yang, Z. and Wang, Z. (2019a). Neural proximal/trust region policy optimization attains globally optimal policy. arXiv preprint arXiv:1906.10306 .
  • Liu et al. (2019b) Liu, I.-J., Yeh, R. A. and Schwing, A. G. (2019b). Pic: Permutation invariant critic for multi-agent deep reinforcement learning. arXiv preprint arXiv:1911.00025 .
  • Lowe et al. (2017) Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, O. P. and Mordatch, I. (2017). Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems.
  • Matarić (1997) Matarić, M. J. (1997). Reinforcement learning in the multi-robot domain. In Robot colonies. Springer, 73–83.
  • Menda et al. (2018) Menda, K., Chen, Y.-C., Grana, J., Bono, J. W., Tracey, B. D., Kochenderfer, M. J. and Wolpert, D. (2018). Deep reinforcement learning for event-driven multi-agent decision processes. IEEE Transactions on Intelligent Transportation Systems 20 1259–1268.
  • Mordatch and Abbeel (2018) Mordatch, I. and Abbeel, P. (2018). Emergence of grounded compositional language in multi-agent populations. In Thirty-Second AAAI Conference on Artificial Intelligence.
  • Muandet et al. (2016) Muandet, K., Fukumizu, K., Sriperumbudur, B. and Schölkopf, B. (2016). Kernel mean embedding of distributions: A review and beyond. arXiv preprint arXiv:1605.09522 .
  • Panait and Luke (2005) Panait, L. and Luke, S. (2005). Cooperative multi-agent learning: The state of the art. Autonomous agents and multi-agent systems 11 387–434.
  • Schulman et al. (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A. and Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 .
  • Shalev-Shwartz et al. (2016) Shalev-Shwartz, S., Shammah, S. and Shashua, A. (2016). Safe, multi-agent, reinforcement learning for autonomous driving. arXiv preprint arXiv:1610.03295 .
  • Smola et al. (2007) Smola, A., Gretton, A., Song, L. and Schölkopf, B. (2007). A hilbert space embedding for distributions. In International Conference on Algorithmic Learning Theory. Springer.
  • Song et al. (2009) Song, L., Huang, J., Smola, A. and Fukumizu, K. (2009). Hilbert space embeddings of conditional distributions with applications to dynamical systems. In Proceedings of the 26th Annual International Conference on Machine Learning.
  • Sunehag et al. (2017) Sunehag, P., Lever, G., Gruslys, A., Czarnecki, W. M., Zambaldi, V., Jaderberg, M., Lanctot, M., Sonnerat, N., Leibo, J. Z., Tuyls, K. et al. (2017). Value-decomposition networks for cooperative multi-agent learning. arXiv preprint arXiv:1706.05296 .
  • Sutton and Barto (2018) Sutton, R. S. and Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.
  • Vinyals et al. (2019) Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A., Chung, J., Choi, D. H., Powell, R., Ewalds, T., Georgiev, P. et al. (2019). Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature 575 350–354.
  • Wang et al. (2020) Wang, L., Yang, Z. and Wang, Z. (2020). Breaking the curse of many agents: Provable mean embedding q-iteration for mean-field reinforcement learning. In International Conference on Machine Learning. PMLR.
  • Yang et al. (2017) Yang, J., Ye, X., Trivedi, R., Xu, H. and Zha, H. (2017). Learning deep mean field games for modeling large population behavior. arXiv preprint arXiv:1711.03156 .
  • Yang et al. (2018) Yang, Y., Luo, R., Li, M., Zhou, M., Zhang, W. and Wang, J. (2018). Mean field multi-agent reinforcement learning. arXiv preprint arXiv:1802.05438 .
  • Zaheer et al. (2017) Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R. R. and Smola, A. J. (2017). Deep sets. In Advances in neural information processing systems.
  • Zhang et al. (2019) Zhang, K., Yang, Z. and Başar, T. (2019). Multi-agent reinforcement learning: A selective overview of theories and algorithms. arXiv preprint arXiv:1911.10635 .
  • Zheng et al. (2020) Zheng, S., Trott, A., Srinivasa, S., Naik, N., Gruesbeck, M., Parkes, D. C. and Socher, R. (2020). The ai economist: Improving equality and productivity with ai-driven tax policies. arXiv preprint arXiv:2004.13332 .

 

Supplementary Material for A Permutation Invariant Approach to Deep Mean-Field Multi-Agent Reinforcement Learning

 

Appendix A Algorithms for Policy Evaluation/Improvement

Algorithm 2 Policy Evaluation via TD
  Require: Mean-field MDP (𝒮¯,𝒜¯,,r,γ)(\overline{{\mathcal{S}}},\overline{\mathcal{A}},\mathbb{P},r,\gamma), sample {(st,𝐬t,a¯t,st,𝐬t,a¯t)}\{(s_{t},\mathbf{s}_{t},\overline{a}_{t},s_{t}^{\prime},\mathbf{s}_{t}^{\prime},\overline{a}_{t}^{\prime})\}, number of iterations TT
  Initialize: wj(0)Unif{1,+1},θj(0)𝒩(0,Id/d),j[mQ]w_{j}(0)\sim\mathrm{Unif}\{-1,+1\},~{}~{}~{}\theta_{j}(0)\sim\mathcal{N}(0,\mathrm{I}_{d}/d),~{}\forall j\in[m_{Q}]
  Set step size ηT1/2\eta\leftarrow T^{-1/2}
  for t=0,,T1t=0,\ldots,T-1 do
      Set (s,𝐬,a¯,s,𝐬,a¯)(st,𝐬t,a¯t,st,𝐬t,a¯t)(s,\mathbf{s},\overline{a},s^{\prime},\mathbf{s}^{\prime},\overline{a}^{\prime})\leftarrow(s_{t},\mathbf{s}_{t},\overline{a}_{t},s_{t}^{\prime},\mathbf{s}_{t}^{\prime},\overline{a}_{t}^{\prime})
      θ(t+1/2)=θ(t)η{Fθ(t)Q(s,𝐬,a¯)(1γ)r(s,d𝒮,a¯)γFθ(t)Q(s,𝐬,a¯)}θFθ(t)Q(s,𝐬,a¯)\theta(t+1/2)=\theta(t)-\eta\left\{F^{Q}_{\theta(t)}(s,\mathbf{s},\overline{a})-(1-\gamma)r(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\gamma F^{Q}_{\theta(t)}(s,\mathbf{s}^{\prime},\overline{a}^{\prime})\right\}\nabla_{\theta}F^{Q}_{\theta(t)}(s,\mathbf{s},\overline{a})
      θ(t+1)=Π𝔹(Rθ,θ0,)(θ(t+1/2))\theta(t+1)=\Pi_{\mathbb{B}(R_{\theta},\theta_{0},)}(\theta(t+1/2))
  end for
  Take ergodic average: θ¯T1Tt=0T1θ(t)\overline{\theta}_{T}\leftarrow\frac{1}{T}\sum_{t=0}^{T-1}\theta(t)
  Output: Fθ¯TQF^{Q}_{\overline{\theta}_{T}}
Algorithm 3 Policy Improvement via SGD
  Require: Mean-field MDP (𝒮¯,𝒜¯,,r,γ)(\overline{{\mathcal{S}}},\overline{\mathcal{A}},\mathbb{P},r,\gamma), sample {(st,𝐬t,a¯t}\{(s_{t},\mathbf{s}_{t},\overline{a}_{t}\}, number of iterations TT
  Initialize: vj(0)Unif{1,+1},αj(0)𝒩(0,Id/d),j[mA]v_{j}(0)\sim\mathrm{Unif}\{-1,+1\},~{}~{}~{}\alpha_{j}(0)\sim\mathcal{N}(0,\mathrm{I}_{d}/d),~{}\forall j\in[m_{A}]
  Set step size ηT1/2\eta\leftarrow T^{-1/2}
  for t=0,,T1t=0,\ldots,T-1 do
      Set (s,𝐬,a¯)(st,𝐬t,a¯t)(s,\mathbf{s},\overline{a})\leftarrow(s_{t},\mathbf{s}_{t},\overline{a}_{t})
      α(t+1/2)=α(t)η{Fα(t)A(s,𝐬,a¯)τk+1(υk1FθkQ(s,𝐬,a¯)+τk1FαkA(s,𝐬,a¯))}αFα(t)A(s,𝐬,a¯)\alpha(t+1/2)=\alpha(t)-\eta\left\{F^{A}_{\alpha(t)}(s,\mathbf{s},\overline{a})-\tau_{k+1}(\upsilon_{k}^{-1}F^{Q}_{\theta_{k}}(s,\mathbf{s},\overline{a})+\tau_{k}^{-1}F^{A}_{\alpha_{k}}(s,\mathbf{s},\overline{a}))\right\}\nabla_{\alpha}F^{A}_{\alpha(t)}(s,\mathbf{s},\overline{a})
      α(t+1)=Π𝔹(Rα,α0)(α(t+1/2))\alpha(t+1)=\Pi_{\mathbb{B}(R_{\alpha},\alpha_{0})}\left(\alpha(t+1/2)\right)
  end for
  Take ergodic average: α¯T1Tt=0T1α(t)\overline{\alpha}_{T}\leftarrow\frac{1}{T}\sum_{t=0}^{T-1}\alpha(t)
  Output: Fα¯TAF^{A}_{\overline{\alpha}_{T}}

Appendix B Experimental details

\diamondsuit Environment. We used the open-source code for the multi-agent particle environments provided by Liu et al. (2019b), which itself is based on the original code by Lowe et al. (2017), without any modification. Please refer to Liu et al. (2019b, Appendix B) for complete details.

\diamondsuit Computing infrastructure and runtime. Experiments were run on Intel Xeon 6154 CPUs, using one core for each independent policy training session. Average training time was approximately 4 hours for N=15N=15 and N=30N=30 with 1.5e6 steps, and 12 hours for N=200N=200 with 7.25e5 steps.

B.1 Implementation

\diamondsuit GCN111We label this architecture as “GCN” to distinguish it from alternative ways to instantiate a permutation invariant critic.. For experiments based on MADDPG, we re-ran the open-source code provided by Liu et al. (2019b) without modification. For experiments based on PPO, we used the GCN network in Kipf and Welling (2017) as the critic in PPO. Performance of GCN reported in this work are the results of our experimental runs.

\diamondsuit MF. In the PPO-based implementation, the joint policy is parameterized by a multi-layer perceptron (MLP) with two hidden layers of size 128 and ReLU activation, which takes in the concatenation of all agents’ observations and outputs the mean and diagonal covariance of a joint Gaussian policy. In the MADDPG-based implementation, each decentralized actor (i.e., policy) network is an MLP with two hidden layers of size 187 and ReLU activation, which takes in each agent’s individual observation and outputs the agent’s real-valued action vector. The centralized critic has the form V(s)=f3(f2(i=1Nf1(si)))V(s)=f_{3}(f_{2}(\sum_{i=1}^{N}f_{1}(s_{i}))), where f1f_{1} and f2f_{2} are hidden layers of size h=190/205/187/187/187h=190/205/187/187/187 (for N=3/6/15/30/200N=3/6/15/30/200 agents, respectively, to have approximately equal number of trainable critic parameters as the GCN critic used in Liu et al. (2019b)) with ReLU activation, and f3f_{3} is a linear layer with output size 1.

\diamondsuit MLP. The centralized critic is an MLP with two hidden layers of size 138,187,75,42,7 for the case of N=3,6,15,30,200N=3,6,15,30,200 agents, respectively, such that total number of trainable parameters is approximately equal to that in GCN (Liu et al., 2019b). The joint policy has the same representation as the one in MF for both PPO-based and MADDPG-based implementations.

\diamondsuit Hyperparameters. We used the Adam optimizer for all algorithms. For experiments based on PPO, the common hyperparameters across all algorithms are: discount factor γ=0.99\gamma=0.99, KL divergence target dtarg=0.01d_{\text{targ}}=0.01, KL divergence initial coefficient β=3\beta=3, and GAE λ=0.95\lambda=0.95 (see Schulman et al. (2017)), policy entropy loss coefficient 0.01, max gradient norm 0.50.5, and actor learning rate 10410^{-4}. Each PPO training step uses 5 epochs with minibatch size 2. We did a sweep of critic learning rate for each critic architecture, choosing 10310^{-3} for MF, 10410^{-4} for GCN and MLP, and 10210^{-2} for MF-A.

For experiments based on MADDPG, the common hyperparameters are: actor learning rate 0.010.01, critic learning rate 0.010.01, batch size 1024, discount factor γ=0.95\gamma=0.95, replay buffer size 10510^{5}, 8 actor and critic updates per 100 environment steps, and target network update rate τ=0.01\tau=0.01.

Appendix C Proofs in Section 2

Proof of Proposition 2.1.

We first prove the first part of claim. We begin by showing that the optimal Q-function QQ^{*} is permutation invariant, i.e., Q(𝐬,𝐚)=Q(κ(𝐬),κ(𝐚))Q^{*}(\mathbf{s},\mathbf{a})=Q^{*}(\kappa(\mathbf{s}),\kappa(\mathbf{a})) for all permutation mapping κ\kappa. Note that QQ^{*} is the unique solution to Bellman optimality condition over the space of QQ-function

Q(𝐬,𝐚)\displaystyle Q(\mathbf{s},\mathbf{a}) =𝔼𝐬(|𝐬,𝐚){r(𝐬,𝐚)+max𝐚Q(𝐬,𝐚)}\displaystyle=\mathbb{E}_{\mathbf{s}^{\prime}\sim\mathbb{P}(\cdot|\mathbf{s},\mathbf{a})}\left\{r(\mathbf{s},\mathbf{a})+\max_{\mathbf{a}^{\prime}}Q(\mathbf{s}^{\prime},\mathbf{a}^{\prime})\right\}
=r(𝐬,𝐚)+𝐬(𝐬|𝐬,𝐚)max𝐚Q(𝐬,𝐚).\displaystyle=r(\mathbf{s},\mathbf{a})+\sum_{\mathbf{s}^{\prime}}\mathbb{P}(\mathbf{s}^{\prime}|\mathbf{s},\mathbf{a})\max_{\mathbf{a}^{\prime}}Q(\mathbf{s}^{\prime},\mathbf{a}^{\prime}). (9)

Or equivalently, from the permutation invariance of rr and \mathbb{P} defined in (1)

Q(κ(𝐬),κ(𝐚))\displaystyle Q(\kappa(\mathbf{s}),\kappa(\mathbf{a})) =r(κ(𝐬),κ(𝐚))+𝐬(κ(𝐬)|κ(𝐬),κ(𝐚))max𝐚Q(κ(𝐬),κ(𝐚))\displaystyle=r(\kappa(\mathbf{s}),\kappa(\mathbf{a}))+\sum_{\mathbf{s}^{\prime}}\mathbb{P}(\kappa(\mathbf{s}^{\prime})|\kappa(\mathbf{s}),\kappa(\mathbf{a}))\max_{\mathbf{a}^{\prime}}Q(\kappa(\mathbf{s}^{\prime}),\kappa(\mathbf{a}^{\prime}))
=r(𝐬,𝐚)+𝐬(𝐬|𝐬,𝐚)max𝐚Q(κ(𝐬),κ(𝐚)).\displaystyle=r(\mathbf{s},\mathbf{a})+\sum_{\mathbf{s}^{\prime}}\mathbb{P}(\mathbf{s}^{\prime}|\mathbf{s},\mathbf{a})\max_{\mathbf{a}^{\prime}}Q(\kappa(\mathbf{s}^{\prime}),\kappa(\mathbf{a}^{\prime})).

That is, Q(κ(𝐬),κ(𝐚))Q(\kappa(\mathbf{s}),\kappa(\mathbf{a})) is also a solution to the Bellman optimality condition (9). From the uniqueness of the solution, we have Q(𝐬,𝐚)=Q(κ(𝐬),κ(𝐚))Q^{*}(\mathbf{s},\mathbf{a})=Q^{*}(\kappa(\mathbf{s}),\kappa(\mathbf{a})). Hence the optimal policy ν(𝐬)=argmax𝐚Q(𝐬,𝐚)\nu^{*}(\mathbf{s})=\mathop{\mathrm{argmax}}_{\mathbf{a}}Q^{*}(\mathbf{s},\mathbf{a}) is permutation invariant.

For second part of the proposition, we show the detailed proof for permutation invariance of value function VV. For any permutation invariance policy ν\nu, let Rν(𝐬)=𝔼𝐚νR(𝐬,𝐚)R^{\nu}(\mathbf{s})=\mathbb{E}_{\mathbf{a}\sim\nu}R(\mathbf{s},\mathbf{a}), and ν(𝐬|𝐬)=𝔼𝐚ν(𝐬|𝐬,𝐚)\mathbb{P}^{\nu}(\mathbf{s}^{\prime}|\mathbf{s})=\mathbb{E}_{\mathbf{a}\sim\nu}\mathbb{P}(\mathbf{s}^{\prime}|\mathbf{s},\mathbf{a}) denote the key elements of the induced Markov reward process. One can clearly see that from permutation invariance defined in (1), we have that both RνR^{\nu} and ν(𝐬|𝐬)\mathbb{P}^{\nu}(\mathbf{s}^{\prime}|\mathbf{s}) is permutation invariant. We define the kk-step value function VkνV^{\nu}_{k} as the expected reward from time 0 to time kk, then:

V1ν(𝐬)\displaystyle V^{\nu}_{1}(\mathbf{s}) =Rν(𝐬),\displaystyle=R^{\nu}(\mathbf{s}),
Vkν(𝐬)\displaystyle V^{\nu}_{k}(\mathbf{s}) =Rν(𝐬)+γ𝔼sν(|s)Vk1ν(𝐬),k>1.\displaystyle=R^{\nu}(\mathbf{s})+\gamma\mathbb{E}_{s^{\prime}\sim\mathbb{P}^{\nu}(\cdot|s)}V^{\nu}_{k-1}(\mathbf{s}^{\prime}),~{}~{}\forall k>1.

We can see from above recursion that VkνV^{\nu}_{k} is permutation invariant for all k1k\geq 1. From Vν=limkVkνV^{\nu}=\lim_{k\to\infty}V^{\nu}_{k}, we can conclude that VνV^{\nu} is also permutation invariant.

For permutation invariance of Q-function, recall that Qν(𝐬,𝐚)Q^{\nu}(\mathbf{s},\mathbf{a}) is the unique solution of the following Bellman evaluation equation

Q(𝐬,𝐚)\displaystyle Q(\mathbf{s},\mathbf{a}) =𝔼𝐬(|𝐬,𝐚){r(𝐬,𝐚)+𝔼𝐚ν(|𝐬)Q(𝐬,𝐚)}\displaystyle=\mathbb{E}_{\mathbf{s}^{\prime}\sim\mathbb{P}(\cdot|\mathbf{s},\mathbf{a})}\left\{r(\mathbf{s},\mathbf{a})+\mathbb{E}_{\mathbf{a}^{\prime}\sim\nu(\cdot|\mathbf{s}^{\prime})}Q(\mathbf{s}^{\prime},\mathbf{a}^{\prime})\right\}
=r(𝐬,𝐚)+𝐬,𝐚(𝐬|𝐬,𝐚)Q(𝐬,𝐚)ν(𝐬𝐚).\displaystyle=r(\mathbf{s},\mathbf{a})+\sum_{\mathbf{s}^{\prime},\mathbf{a}^{\prime}}\mathbb{P}(\mathbf{s}^{\prime}|\mathbf{s},\mathbf{a})Q(\mathbf{s}^{\prime},\mathbf{a}^{\prime})\nu(\mathbf{s}^{\prime}\mathbf{a}^{\prime}). (10)

From the permutation invariance of rr, \mathbb{P} and ν\nu, we have QνQ^{\nu} is also a solution to (10), as

Q(κ(𝐬),κ(𝐚))=r(𝐬,𝐚)+𝐬,𝐚(𝐬|𝐬,𝐚)Q(κ(𝐬),κ(𝐚))ν(𝐬,𝐚).\displaystyle Q(\kappa(\mathbf{s}),\kappa(\mathbf{a}))=r(\mathbf{s},\mathbf{a})+\sum_{\mathbf{s}^{\prime},\mathbf{a}^{\prime}}\mathbb{P}(\mathbf{s}^{\prime}|\mathbf{s},\mathbf{a})Q(\kappa(\mathbf{s}^{\prime}),\kappa(\mathbf{a}^{\prime}))\nu(\mathbf{s}^{\prime},\mathbf{a}^{\prime}).

Since QνQ^{\nu} is the unique solution to (10), we have Qν(𝐬,𝐚)=Qν(κ(𝐬),κ(𝐚))Q^{\nu}(\mathbf{s},\mathbf{a})=Q^{\nu}(\kappa(\mathbf{s}),\kappa(\mathbf{a})).

Proof of Proposition 2.2.

The proof for showing Vν(s,𝐬r)=Vν(s,κ(𝐬r))V^{\nu}(s,\mathbf{s}_{r})=V^{\nu}(s,\kappa(\mathbf{s}_{r})) follows the exact same ingredients as in the proof for Proposition 2.1. Following Theorem 11 in Bloem-Reddy and Teh (2019), there exists a function hν(s,η,s𝐬r1|𝐬r|δs)h_{\nu}(s,\eta,\sum_{s^{\prime}\in\mathbf{s}_{r}}\frac{1}{|\mathbf{s}_{r}|}\delta_{s^{\prime}}) such that Y=a.s.hνY\overset{a.s.}{=}h_{\nu}, where ηUnif[0,1]\eta\sim\mathrm{Unif}[0,1]. Then Proposition 2.2 follows from letting gν(s,s𝐬r1|𝐬r|δs)=𝔼ηhν(η,s𝐬r1|𝐬r|δs)g_{\nu}(s,\sum_{s^{\prime}\in\mathbf{s}_{r}}\frac{1}{|\mathbf{s}_{r}|}\delta_{s^{\prime}})=\mathbb{E}_{\eta}h_{\nu}(\eta,\sum_{s^{\prime}\in\mathbf{s}_{r}}\frac{1}{|\mathbf{s}_{r}|}\delta_{s^{\prime}}).

Appendix D Proofs in Section 3

Proof of Proposition 3.1.

For any permutation invariant actor/critic function F(s,𝐬,a¯)=F(s,κ(𝐬),a¯)F(s,\mathbf{s},\overline{a})=F(s,\kappa(\mathbf{s}),\overline{a}), we consider the size of tabular representation for such functions and denote it as the size for the search space. We say 𝐬\mathbf{s} and 𝐬\mathbf{s}^{\prime} are permutation equivalent if 𝐬=κ(𝐬)\mathbf{s}=\kappa(\mathbf{s}^{\prime}) for some permutation mapping κ\kappa. We can observe that (i) the permutation equivalent is a valid equivalent relation defined over the space of all possible 𝐬\mathbf{s}; (ii) 𝐬\mathbf{s} and 𝐬\mathbf{s}^{\prime} are permutation equivalent if and only if for every value vv in 𝐬\mathbf{s}, vv occurs the same number of times in 𝐬\mathbf{s} and 𝐬\mathbf{s}^{\prime}. One can verify that for 𝐬\mathbf{s} containing NN elements, with each element ss^{\prime} taking values in 𝒮{\mathcal{S}}, then the number of equivalence classes over the space of all possible 𝐬\mathbf{s}, induced by permutation equivalent relation is k=1min{|𝒮|,N}(N1k1)(|𝒮|k)\sum_{k=1}^{\min\{|{\mathcal{S}}|,N\}}\binom{N-1}{k-1}\binom{|{\mathcal{S}}|}{k}. The claim of Proposition 3.1 follows immediately. ∎

Proof of Proposition 3.2.

Note that the maximization problem can be solved separately for each s,d𝒮s,\mathrm{d}_{\mathcal{S}}. Hence we can equivalently solve

π¯(|s,d𝒮)maxπ(|s,d𝒮)𝒜[FθkQ(s,d𝒮,),π(|s,d𝒮)υkKL(πα(|s,d𝒮)π(|s,d𝒮))],\displaystyle\overline{\pi}(\cdot|s,\mathrm{d}_{\mathcal{S}})\max_{\pi(\cdot|s,\mathrm{d}_{\mathcal{S}})\in\mathbb{R}^{\mathcal{A}}}\left[\left\langle F^{Q}_{\theta_{k}}(s,\mathrm{d}_{\mathcal{S}},\cdot),\pi(\cdot|s,\mathrm{d}_{\mathcal{S}})\right\rangle-\upsilon_{k}\mathrm{KL}\left(\pi_{\alpha}(\cdot|s,\mathrm{d}_{\mathcal{S}})\|\pi(\cdot|s,\mathrm{d}_{\mathcal{S}})\right)\right],

which is an optimization problem over finite-dimensional parameter π(|s,d𝒮)𝒜¯\pi(\cdot|s,\mathrm{d}_{\mathcal{S}})\in\mathbb{R}^{\overline{\mathcal{A}}}. Setting the derivative w.r.t. π(a¯|s,d𝒮)\pi(\overline{a}|s,\mathrm{d}_{\mathcal{S}}) for each a¯𝒜¯\overline{a}\in\overline{\mathcal{A}} equal to zero, we have

FθkQ(s,d𝒮,a¯)υk(log(π¯πk(a¯|s,d𝒮))1)=0,\displaystyle F^{Q}_{\theta_{k}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\upsilon_{k}(\log\left(\frac{\underline{\pi}}{\pi_{k}(\overline{a}|s,\mathrm{d}_{\mathcal{S}})}\right)-1)=0,

or equivalently, π¯(a|s,d𝒮)πk(a|s,d𝒮)exp{υk1FθkQ(s,d𝒮,a¯)}\overline{\pi}(a|s,\mathrm{d}_{\mathcal{S}})\propto\pi_{k}(a|s,\mathrm{d}_{\mathcal{S}})\exp\{\upsilon_{k}^{-1}F^{Q}_{\theta_{k}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})\}. Plug in the definition of πkexp(τk1FαkA)\pi_{k}\propto\exp\left(\tau_{k}^{-1}F^{A}_{\alpha_{k}}\right) completes the proof. ∎

Appendix E Proofs in Section 4

We provide a unified analysis of the policy evaluation and policy improvement step. The policy evaluation and improvement steps can be described as minimizing the following mean-squared loss:

β(t+1)=argminβ𝔹(Rβ,β0)𝔼(s,d𝒮,a)ρ[(Fβ(s,d𝒮,a¯)ζFβ(s,d𝒮,a¯))2].\displaystyle\beta(t+1)=\mathop{\mathrm{argmin}}_{\beta\in\mathbb{B}(R_{\beta},\beta_{0})}\mathbb{E}_{(s,\mathrm{d}_{\mathcal{S}},a)\sim\rho}\left[(F_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\zeta_{F_{\beta}}(s,\mathrm{d}_{\mathcal{S}},\overline{a}))^{2}\right]. (11)

where the operator ζ\zeta maps function FβF_{\beta} to

ζFβ(s,d𝒮,a¯)=𝔼s,d𝒮(|s,d𝒮,a¯),a¯π(|s,d𝒮)[τξ(s,d𝒮,a¯)+μFβ(s,d𝒮,a¯)].\displaystyle\zeta_{F_{\beta}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})=\mathbb{E}_{s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime}\sim\mathbb{P}(\cdot|s,\mathrm{d}_{\mathcal{S}},\overline{a}),\overline{a}^{\prime}\sim\pi(\cdot|s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime})}\left[\tau\xi(s,\mathrm{d}_{\mathcal{S}},\overline{a})+\mu F_{\beta}(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\right]. (12)

Formulation (11) includes policy improvement and policy evaluation as special cases:

Policy Improvement. This corresponds to ρ=σ~k\rho=\widetilde{\sigma}_{k}, ξ(s,d𝒮,a¯)=τk+1(υk1FβkQ(s,d𝒮,a¯)+τk1FαkA(s,d𝒮,a¯))\xi(s,\mathrm{d}_{\mathcal{S}},\overline{a})=\tau_{k+1}(\upsilon_{k}^{-1}F^{Q}_{\beta_{k}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})+\tau_{k}^{-1}F^{A}_{\alpha_{k}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})), τ=1,μ=0\tau=1,\mu=0,

Policy Evaluation. This corresponds to ρ=σk\rho=\sigma_{k}, τ=1γ\tau=1-\gamma, μ=γ\mu=\gamma, ξ(s,d𝒮,a¯)=r(s,d𝒮,a¯)\xi(s,\mathrm{d}_{\mathcal{S}},\overline{a})=r(s,\mathrm{d}_{\mathcal{S}},\overline{a}). Note that ζFθ(s,d𝒮,a¯)\zeta_{F_{\theta}}(s,\mathrm{d}_{\mathcal{S}},\overline{a}) equals to the Bellman operator: ζFθ(s,d𝒮,a¯)=𝒯πFθ(s,d𝒮,a¯)\zeta_{F_{\theta}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})={\mathcal{T}}^{\pi}F_{\theta}(s,\mathrm{d}_{\mathcal{S}},\overline{a}).

To solve problem (11), we consider the following generic update rule:

β(t+1/2)\displaystyle\beta(t+1/2) =(Fβ(t)(s,d𝒮,a¯)τξ(s,d𝒮,a¯)μFβ(t)(s,d𝒮,a¯))βFβ(t)(s,d𝒮,a¯),\displaystyle=\left(F_{\beta(t)}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\tau\xi(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\mu F_{\beta(t)}(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\right)\nabla_{\beta}F_{\beta(t)}(s,\mathrm{d}_{\mathcal{S}},\overline{a}), (13)
β(t+1)\displaystyle\beta(t+1) =Π𝔹0(Rβ)(β(t+1/2)).\displaystyle=\Pi_{\mathbb{B}^{0}(R_{\beta})}(\beta(t+1/2)). (14)

where (s,d𝒮,a¯)ρ,(s,d𝒮)(|s,d𝒮,a¯),a¯π(|s,d𝒮)(s,\mathrm{d}_{\mathcal{S}},\overline{a})\sim\rho,(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime})\sim\mathbb{P}(\cdot|s,\mathrm{d}_{\mathcal{S}},\overline{a}),\overline{a}^{\prime}\sim\pi(\cdot|s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime}).

Instead of knowing s,d𝒮s,\mathrm{d}_{\mathcal{S}} exactly, we only have access to s,d𝒮s,\mathrm{d}_{\mathcal{S}} via NN independent samples represented as the states of NN agents. Hence, we perform the following update:

β(t+1/2)\displaystyle\beta(t+1/2) =(Fβ(t)(s,𝐬,a¯)τξ(s,d𝒮,a¯)μFβ(t)(s,𝐬,a¯))βFβ(t)(s,𝐬,a¯),\displaystyle=\left(F_{\beta(t)}(s,\mathbf{s},\overline{a})-\tau\xi(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\mu F_{\beta(t)}(s^{\prime},\mathbf{s^{\prime}},\overline{a}^{\prime})\right)\nabla_{\beta}F_{\beta(t)}(s,\mathbf{s},\overline{a}), (15)
β(t+1)\displaystyle\beta(t+1) =Π𝔹0(Rβ)(β(t+1/2)).\displaystyle=\Pi_{\mathbb{B}^{0}(R_{\beta})}(\beta(t+1/2)). (16)

where (s,d𝒮,a¯)ρ,(s,d𝒮)(|s,d𝒮,a¯),a¯π(|s,d𝒮)(s,\mathrm{d}_{\mathcal{S}},\overline{a})\sim\rho,(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime})\sim\mathbb{P}(\cdot|s,\mathrm{d}_{\mathcal{S}},\overline{a}),\overline{a}^{\prime}\sim\pi(\cdot|s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime}); and 𝐬i.i.d.d𝒮,𝐬i.i.d.d𝒮\mathbf{s}\overset{\mathrm{i.i.d.}}{\sim}\mathrm{d}_{\mathcal{S}},\mathbf{s}^{\prime}\overset{\mathrm{i.i.d.}}{\sim}\mathrm{d}_{\mathcal{S}}^{\prime}.

Policy Improvement Updates. For ρ=σ~k\rho=\widetilde{\sigma}_{k}, τ=1,μ=0,Rβ=Rα\tau=1,\mu=0,R_{\beta}=R_{\alpha} and ξ(s,d𝒮,a¯)=τk+1(βk1FθkQ(s,d𝒮,a¯)+τk1FαkA(s,d𝒮,a¯))\xi(s,\mathrm{d}_{\mathcal{S}},\overline{a})=\tau_{k+1}(\beta_{k}^{-1}F^{Q}_{\theta_{k}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})+\tau_{k}^{-1}F^{A}_{\alpha_{k}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})), we recover the policy improvement update in Algorithm 2.

Policy Evaluation Updates. For ρ=σk\rho=\sigma_{k}, τ=(1γ),μ=γ,ξ(s,d𝒮,a¯)=r(s,d𝒮,a¯),Rβ=Rθ\tau=(1-\gamma),\mu=\gamma,\xi(s,\mathrm{d}_{\mathcal{S}},\overline{a})=r(s,\mathrm{d}_{\mathcal{S}},\overline{a}),R_{\beta}=R_{\theta}, we recover the policy evaluation update in Algorithm 3.

We define a few more notations before we proceed

(residual): δβ(t)(s,d𝒮,a¯,s,d𝒮,a¯)=Fβ(t)(s,d𝒮,a¯)τξ(s,d𝒮,a¯)μFβ(t)(s,d𝒮,a¯),\displaystyle~{}~{}~{}\delta_{\beta(t)}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})=F_{\beta(t)}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\tau\xi(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\mu F_{\beta(t)}(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}),
(stochastic semi-gradient): gβ(t)(s,d𝒮,a¯,s,d𝒮,a¯)=δ(s,d𝒮,a¯,s,d𝒮,a¯)βFβ(t)(s,d𝒮,a¯),\displaystyle~{}~{}~{}g_{\beta(t)}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})=\delta(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\nabla_{\beta}F_{\beta(t)}(s,\mathrm{d}_{\mathcal{S}},\overline{a}),
(population semi-gradient): g¯β(t)=𝔼s,d𝒮,a¯,s,d𝒮,a¯g(s,d𝒮,a¯,s,d𝒮,a¯),\displaystyle~{}~{}~{}\overline{g}_{\beta(t)}=\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}}g(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}),

where (s,d𝒮,a¯)ρ,(s,d𝒮)(|s,d𝒮,a¯),a¯πk(|s,d𝒮)(s,\mathrm{d}_{\mathcal{S}},\overline{a})\sim\rho,(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime})\sim\mathbb{P}(\cdot|s,\mathrm{d}_{\mathcal{S}},\overline{a}),\overline{a}^{\prime}\sim\pi_{k}(\cdot|s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime}).

We also define the associated finite sample approximation:

δ^β(t)(s,𝐬,a¯,s,𝐬,a¯)\displaystyle\widehat{\delta}_{\beta(t)}(s,\mathbf{s},\overline{a},s^{\prime},\mathbf{s}^{\prime},\overline{a}^{\prime}) =Fβ(t)(s,𝐬,a¯)τξ(s,d𝒮,a¯)μFβ(t)(s,𝐬,a¯),\displaystyle=F_{\beta(t)}(s,\mathbf{s},\overline{a})-\tau\xi(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\mu F_{\beta(t)}(s,\mathbf{s}^{\prime},\overline{a}^{\prime}),
g^β(t)(s,𝐬,a,s,𝐬,a¯)\displaystyle\widehat{g}_{\beta(t)}(s,\mathbf{s},a,s^{\prime},\mathbf{s}^{\prime},\overline{a}^{\prime}) =δ^(s,𝐬,a¯,s,𝐬,a¯)βFβ(t)(s,𝐬,a¯),\displaystyle=\widehat{\delta}(s,\mathbf{s},\overline{a},s^{\prime},\mathbf{s}^{\prime},\overline{a}^{\prime})\nabla_{\beta}F_{\beta(t)}(s,\mathbf{s},\overline{a}),

where 𝐬i.i.d.d𝒮,𝐬i.i.d.d𝒮\mathbf{s}\stackrel{{\scriptstyle\mathrm{i.i.d.}}}{{\sim}}\mathrm{d}_{\mathcal{S}},\mathbf{s}^{\prime}\stackrel{{\scriptstyle\mathrm{i.i.d.}}}{{\sim}}\mathrm{d}_{\mathcal{S}}^{\prime}. Note that given g^\widehat{g} is not an unbiased estimator of gg, as 𝔼𝐬,𝐬g^(s,𝐬,a¯,s,𝐬,a¯)g(s,d𝒮,a¯,s,d𝒮,a¯)\mathbb{E}_{\mathbf{s},\mathbf{s}^{\prime}}\widehat{g}(s,\mathbf{s},\overline{a},s^{\prime},\mathbf{s}^{\prime},\overline{a}^{\prime})\neq g(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}).

E.1 Linearization at Initialization

For a two layer ReLU network defined as fβ,u(s,x,a)=1mj=1mujσ(βj(s,x,a))f_{\beta,u}(s,x,a)=\frac{1}{\sqrt{m}}\sum_{j=1}^{m}u_{j}\sigma(\beta_{j}^{\top}(s,x,a)), where σ(x)=max{0,x}\sigma(x)=\max\{0,x\}, we define its linearization at the initialization

fβ0(s,x,a)=1mj=1muj𝟙{βj(0)(s,x,a)>0}βj(s,x,a).\displaystyle f^{0}_{\beta}(s,x,a)=\frac{1}{\sqrt{m}}\sum_{j=1}^{m}u_{j}\mathbbm{1}\{\beta_{j}(0)^{\top}(s,x,a)>0\}\beta_{j}^{\top}(s,x,a).

Note that fβ0(s,x,a¯)=βfβ(0)(s,x,a),βf^{0}_{\beta}(s,x,\overline{a})=\left\langle\nabla_{\beta}f_{\beta(0)}(s,x,a),\beta\right\rangle, which equivalently means fβ0f^{0}_{\beta} is the local linearization of fβf_{\beta} at its initialization β(0)\beta(0) (note we do not train the second layer uu).

We define

Fβ0(s,𝐬,a¯)=1Ni=1Nfβ0(s,si,a¯),Fβ0(s,d𝒮,a¯)=𝔼xd𝒮fβ0(s,x,a¯),\displaystyle F^{0}_{\beta}(s,\mathbf{s},\overline{a})=\frac{1}{N}\sum_{i=1}^{N}f^{0}_{\beta}(s,s_{i},\overline{a}),~{}~{}~{}F^{0}_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a})=\mathbb{E}_{x\sim\mathrm{d}_{\mathcal{S}}}f^{0}_{\beta}(s,x,\overline{a}),

and similarly define δ0,g0,g¯0,δ^0\delta^{0},g^{0},\overline{g}^{0},\widehat{\delta}^{0}, and g^0\widehat{g}^{0}, with everything related to Fβ,FβF_{\beta},F_{\beta} replaced by Fβ0,Fβ0F^{0}_{\beta},F^{0}_{\beta}.

Our main objectives are to show that the objective in (11) can be replaced by replacing everything related to FF with its local linearization F0F^{0}, and the linearized stochastic semi-gradient g^0\widehat{g}^{0} remains close to the population semi-gradient g¯\overline{g} when the number of hidden units mm and number of agents are large enough. By doing so, we can conclude that learning with linearized stochastic semi-gradient g^\widehat{g} is sufficient to solve (11) to high accuracy, when the number of hidden units mm and number of agents are large enough. For the ease of exposition, we also make the following assumptions,

Assumption 3.

We assume (s,x,a¯)21\|(s,x,\overline{a})\|_{2}\leq 1 for all (s,x,a¯)𝒮×𝒮×𝒜(s,x,\overline{a})\in{\mathcal{S}}\times{\mathcal{S}}\times\mathcal{A}.

Assumption 4.

The function ξ(s,d𝒮,a¯)\xi(s,\mathrm{d}_{\mathcal{S}},\overline{a}) satisfies: for each (s,d𝒮)𝒮¯(s,\mathrm{d}_{\mathcal{S}})\in\overline{{\mathcal{S}}} and any action aa, we have (ξ(s,d𝒮,a¯))2𝔼xd𝒮τ1(fβ(0)(s,x,a¯))2+τ2Rβ2+τ3(\xi(s,\mathrm{d}_{\mathcal{S}},\overline{a}))^{2}\leq\mathbb{E}_{x\sim\mathrm{d}_{\mathcal{S}}}\tau_{1}(f_{\beta(0)}(s,x,\overline{a}))^{2}+\tau_{2}R_{\beta}^{2}+\tau_{3}.

Remark: We will verify that with (τ1,τ2,τ3)=(4,4,0)(\tau_{1},\tau_{2},\tau_{3})=(4,4,0), Assumption 4 holds for the policy evaluation step. With (τ1,τ2,τ3)=(0,0,r¯2)(\tau_{1},\tau_{2},\tau_{3})=(0,0,\overline{r}^{2}), Assumption 4 holds for the policy improvement step.

We first bound the difference between Fβ0(s,d𝒮,a¯)F^{0}_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a}) and Fβ(s,d𝒮,a¯)F_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a}).

Lemma E.1.

With Assumption 2 and 3, we have

𝔼init,ρ|Fβ0(s,d𝒮,a¯)Fβ(s,d𝒮,a¯)|2𝒪(Rβ3m1/2).\displaystyle\mathbb{E}_{\mathrm{init},\rho}|F^{0}_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a})|^{2}\leq\mathcal{O}(R_{\beta}^{3}m^{-1/2}).
Proof.

Be definition of Fβ0(s,d𝒮,a¯)F^{0}_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a}) and Fβ(s,d𝒮,a¯)F_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a}) and Jensen’s inequality, we have

𝔼init,s,d𝒮,a¯|Fβ0(s,d𝒮,a¯)Fβ(s,d𝒮,a¯)|\displaystyle\mathbb{E}_{\mathrm{init},s,\mathrm{d}_{\mathcal{S}},\overline{a}}|F^{0}_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a})| 𝔼init,s,d𝒮,a¯,x|fβ0(s,x,a¯)fβ(s,x,a¯)|.\displaystyle\leq\mathbb{E}_{\mathrm{init},s,\mathrm{d}_{\mathcal{S}},\overline{a},x}|f^{0}_{\beta}(s,x,\overline{a})-f_{\beta}(s,x,\overline{a})|.

We can further bound the right hand side with

𝔼init,s,d𝒮,a¯,x|fβ0(s,x,a¯)fβ(s,x,a¯)|\displaystyle\mathbb{E}_{\mathrm{init},s,\mathrm{d}_{\mathcal{S}},\overline{a},x}|f^{0}_{\beta}(s,x,\overline{a})-f_{\beta}(s,x,\overline{a})|
=\displaystyle= 𝔼init,s,d𝒮,a¯,x1mj=1m|𝟙{βj(0)(s,x,a¯)>0}𝟙{βj(s,x,a¯)>0}|βj(s,x,a¯)|\displaystyle\mathbb{E}_{\mathrm{init},s,\mathrm{d}_{\mathcal{S}},\overline{a},x}\frac{1}{\sqrt{m}}\sum_{j=1}^{m}|\mathbbm{1}\{\beta_{j}(0)^{\top}(s,x,\overline{a})>0\}-\mathbbm{1}\{\beta_{j}^{\top}(s,x,\overline{a})>0\}|\beta_{j}^{\top}(s,x,\overline{a})|
\displaystyle\leq 𝔼init,s,d𝒮,a¯,x1mj=1m|𝟙{βj(0)(s,x,a¯)>0}𝟙{βj(s,x,a¯)>0}|βjβj(0)2,\displaystyle\mathbb{E}_{\mathrm{init},s,\mathrm{d}_{\mathcal{S}},\overline{a},x}\frac{1}{\sqrt{m}}\sum_{j=1}^{m}\lvert\mathbbm{1}\{\beta_{j}(0)^{\top}(s,x,\overline{a})>0\}-\mathbbm{1}\{\beta_{j}^{\top}(s,x,\overline{a})>0\}\rvert\|\beta_{j}-\beta_{j}(0)\|_{2},

where the last inequality comes from that fact that 𝟙{βj(0)(s,x,a¯)>0}𝟙{βj(s,x,a¯)>0}\mathbbm{1}\{\beta_{j}(0)^{\top}(s,x,\overline{a})>0\}\neq\mathbbm{1}\{\beta_{j}^{\top}(s,x,\overline{a})>0\} implies |βj(s,x,a¯)||(βjβj(0))(s,x,a¯)|βjβj(0)2(s,x,a¯)2βjβj(0)2\lvert\beta_{j}^{\top}(s,x,\overline{a})\rvert\leq\lvert(\beta_{j}-\beta_{j}(0))^{\top}(s,x,\overline{a})\rvert\leq\|\beta_{j}-\beta_{j}(0)\|_{2}\|(s,x,\overline{a})\|_{2}\leq\|\beta_{j}-\beta_{j}(0)\|_{2}. Applying Cauchy Schwartz inequality, we have

𝔼init,s,d𝒮,a|Fβ0(s,d𝒮,a¯)Fβ(s,d𝒮,a¯)|2\displaystyle\mathbb{E}_{\mathrm{init},s,\mathrm{d}_{\mathcal{S}},a}|F^{0}_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a})|^{2}
\displaystyle\leq 𝔼init,s,d𝒮,a¯,x1m(j=1m|𝟙{βj(0)(s,x,a¯)>0}𝟙{βj(s,x,a¯)>0})(j=1mβjβj(0)22)\displaystyle\mathbb{E}_{\mathrm{init},s,\mathrm{d}_{\mathcal{S}},\overline{a},x}\frac{1}{m}\left(\sum_{j=1}^{m}|\mathbbm{1}\{\beta_{j}(0)^{\top}(s,x,\overline{a})>0\}-\mathbbm{1}\{\beta_{j}^{\top}(s,x,\overline{a})>0\}\right)\left(\sum_{j=1}^{m}\|\beta_{j}-\beta_{j}(0)\|_{2}^{2}\right)
\displaystyle\leq 𝔼init,s,d𝒮,a¯,xRβ2mj=1m𝟙{|βj(0)(s,x,a¯)|<βj(0)βj2}\displaystyle\mathbb{E}_{\mathrm{init},s,\mathrm{d}_{\mathcal{S}},\overline{a},x}\frac{R_{\beta}^{2}}{m}\sum_{j=1}^{m}\mathbbm{1}\{\lvert\beta_{j}(0)^{\top}(s,x,\overline{a})\rvert<\|\beta_{j}(0)-\beta_{j}\|_{2}\}
\displaystyle\leq 𝔼initcRβ2mj=1mβj(0)βj2βj(0)2\displaystyle\mathbb{E}_{\mathrm{init}}\frac{cR_{\beta}^{2}}{m}\sum_{j=1}^{m}\frac{\|\beta_{j}(0)-\beta_{j}\|_{2}}{\|\beta_{j}(0)\|_{2}}
\displaystyle\leq 𝔼initcRβ2m(j=1mβj(0)βj22)1/2(j=1mβj(0)22)1/2=𝒪(cRβ3m1/2),\displaystyle\mathbb{E}_{\mathrm{init}}\frac{cR_{\beta}^{2}}{m}\left(\sum_{j=1}^{m}\|\beta_{j}(0)-\beta_{j}\|_{2}^{2}\right)^{1/2}\left(\sum_{j=1}^{m}\|\beta_{j}(0)\|_{2}^{-2}\right)^{1/2}=\mathcal{O}\left(\frac{cR_{\beta}^{3}}{m^{1/2}}\right),

where in the third inequality we use Assumption 2.

Next, we bound the difference between g¯\overline{g} and and g¯0\overline{g}^{0}.

Lemma E.2.

With Assumption 2, 3 and 4, we have

𝔼initg¯βg¯β022𝒪(Rβ3m1/2).\displaystyle\mathbb{E}_{\mathrm{init}}\|\overline{g}_{\beta}-\overline{g}^{0}_{\beta}\|_{2}^{2}\leq\mathcal{O}\left(\frac{R_{\beta}^{3}}{m^{1/2}}\right).
Proof.

With decomposition

gβgβ0=\displaystyle g_{\beta}-g^{0}_{\beta}= (F(s,d𝒮,a¯)F(s,d𝒮,a¯)μ(F(s,d𝒮,a¯)F(s,d𝒮,a¯)))F(s,d𝒮,a¯)\displaystyle\left(F(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\mu(F(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})-F(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}))\right)\nabla F(s,\mathrm{d}_{\mathcal{S}},\overline{a})
+\displaystyle+ (F(s,d𝒮,a¯)τξ(s,d𝒮,a¯)μF(s,d𝒮,a¯))(βF(s,d𝒮,a¯)βF(s,d𝒮,a¯)),\displaystyle\left(F(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\tau\xi(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\mu F(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\right)\left(\nabla_{\beta}F(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\nabla_{\beta}F(s,\mathrm{d}_{\mathcal{S}},\overline{a})\right),

we apply basic inequality a+b222(a22+b22)\|a+b\|_{2}^{2}\leq 2(\|a\|_{2}^{2}+\|b\|_{2}^{2}) and obtain

g¯βg¯β022\displaystyle\|\overline{g}_{\beta}-\overline{g}^{0}_{\beta}\|_{2}^{2}\leq 2(𝔼s,d𝒮,a¯,s,d𝒮,a¯|F(s,d𝒮,a¯)F(s,d𝒮,a¯)μ(F(s,d𝒮,a¯)F(s,d𝒮,a¯))|F(s,d𝒮,a¯)2)2+\displaystyle 2\left(\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}}|F(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\mu(F(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})-F(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}))|\|\nabla F(s,\mathrm{d}_{\mathcal{S}},\overline{a})\|_{2}\right)^{2}+
2(𝔼s,d𝒮,a¯,s,d𝒮,a¯|F(s,d𝒮,a¯)τξ(s,d𝒮,a¯)μF(s,d𝒮,a¯)|βF(s,d𝒮,a¯)βF(s,d𝒮,a¯)2)2.\displaystyle 2\left(\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}}|F(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\tau\xi(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\mu F(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})|\|\nabla_{\beta}F(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\nabla_{\beta}F(s,\mathrm{d}_{\mathcal{S}},\overline{a})\|_{2}\right)^{2}. (17)

We have βF(s,d𝒮,a¯)=1m(𝟙{β1(0)(s,x,a¯)>0}(s,x,a¯),,𝟙{βm(0)(s,x,a¯)>0}(s,x,a¯))\nabla_{\beta}F(s,\mathrm{d}_{\mathcal{S}},\overline{a})=\frac{1}{\sqrt{m}}\left(\mathbbm{1}\{\beta_{1}(0)^{\top}(s,x,\overline{a})>0\}(s,x,\overline{a}),\ldots,\mathbbm{1}\{\beta_{m}(0)^{\top}(s,x,\overline{a})>0\}(s,x,\overline{a})\right), and with the assumption that (s,x,a¯)21\|(s,x,\overline{a})\|_{2}\leq 1, we have βF(s,d𝒮,a¯)21\|\nabla_{\beta}F(s,\mathrm{d}_{\mathcal{S}},\overline{a})\|_{2}\leq 1. In addition, we have

𝔼init,s,d𝒮,a¯,s,d𝒮,a¯|F(s,d𝒮,a¯)F(s,d𝒮,a¯)μ(F(s,d𝒮,a¯)F(s,d𝒮,a¯))|2=𝒪(Rβ3m1/2),\displaystyle\mathbb{E}_{\mathrm{init},s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}}|F(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\mu(F(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})-F(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}))|^{2}=\mathcal{O}(R_{\beta}^{3}m^{-1/2}),

as implied by Lemma E.1. Hence the first term in (E.1) is of order 𝒪(Rβ3m1/2)\mathcal{O}(R_{\beta}^{3}m^{-1/2}).

To bound the second term in (E.1), by Cauchy-Schwartz inequality we have

(𝔼s,d𝒮,a¯,s,d𝒮,a¯|F(s,d𝒮,a¯)τξ(s,d𝒮,a¯)μF(s,d𝒮,a¯)|βF(s,d𝒮,a¯)βF(s,d𝒮,a¯)2)2\displaystyle\left(\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}}|F(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\tau\xi(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\mu F(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})|\|\nabla_{\beta}F(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\nabla_{\beta}F(s,\mathrm{d}_{\mathcal{S}},\overline{a})\|_{2}\right)^{2}
\displaystyle\leq 𝔼s,d𝒮,a¯,s,d𝒮,a¯|F(s,d𝒮,a¯)τξ(s,d𝒮,a¯)μF(s,d𝒮,a¯)|2𝔼s,d𝒮,aβF(s,d𝒮,a¯)βF(s,d𝒮,a¯)22.\displaystyle\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}}|F(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\tau\xi(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\mu F(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})|^{2}\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},a}\|\nabla_{\beta}F(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\nabla_{\beta}F(s,\mathrm{d}_{\mathcal{S}},\overline{a})\|_{2}^{2}.

From Jensen’s inequality we have

|F(s,d𝒮,a¯)τξ(s,d𝒮,a¯)μF(s,d𝒮,a¯)|2\displaystyle|F(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\tau\xi(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\mu F(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})|^{2} 𝔼|fβ0(s,x,a¯)τξ(s,d𝒮,a¯)μfβ0(s,x,a¯)|2\displaystyle\leq\mathbb{E}|f^{0}_{\beta}(s,x,\overline{a})-\tau\xi(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\mu f^{0}_{\beta}(s^{\prime},x^{\prime},\overline{a}^{\prime})|^{2}
3𝔼[(fβ0(s,x,a¯))2+(τξ(s,d𝒮,a¯))2+(μfβ0(s,x,a¯))2].\displaystyle\leq 3\mathbb{E}\left[(f^{0}_{\beta}(s,x,\overline{a}))^{2}+(\tau\xi(s,\mathrm{d}_{\mathcal{S}},\overline{a}))^{2}+(\mu f^{0}_{\beta}(s^{\prime},x^{\prime},\overline{a}^{\prime}))^{2}\right].

Note that fβ0=fβ(0)0\nabla f^{0}_{\beta}=\nabla f^{0}_{\beta(0)} for all β\beta, we have

fβ0(s,x,a¯)fβ(0)0(s,x,a¯)+βfβ02ββ(0)2fβ(0)0(s,x,a¯)+Rβ.\displaystyle f^{0}_{\beta}(s,x,\overline{a})\leq f^{0}_{\beta(0)}(s,x,\overline{a})+\|\nabla_{\beta}f^{0}_{\beta}\|_{2}\|\beta-\beta(0)\|_{2}\leq f^{0}_{\beta(0)}(s,x,\overline{a})+R_{\beta}.

We have 3(fβ0(s,x,a¯)2+(μfβ0(s,x,a¯))2)6(fβ(0)(s,x,a¯)2+(μfβ(0)(s,x,a¯))2)+12Rβ23\left(f^{0}_{\beta}(s,x,\overline{a})^{2}+(\mu f^{0}_{\beta}(s^{\prime},x^{\prime},\overline{a}^{\prime}))^{2}\right)\leq 6\left(f_{\beta(0)}(s,x,\overline{a})^{2}+(\mu f_{\beta(0)}(s^{\prime},x^{\prime},\overline{a}^{\prime}))^{2}\right)+12R_{\beta}^{2}. Note that from Assumption 4: ξ(s,d𝒮,a¯)2τ1𝔼s,a[(fβ(0)(s,x,a¯))2+τ2Rβ2+τ3]\xi(s,\mathrm{d}_{\mathcal{S}},\overline{a})^{2}\leq\tau_{1}\mathbb{E}_{s,a}\left[(f_{\beta(0)}(s,x,\overline{a}))^{2}+\tau_{2}R_{\beta}^{2}+\tau_{3}\right]. Hence

𝔼s,d𝒮,a¯,s,d𝒮,a¯|F(s,d𝒮,a¯)τξ(s,d𝒮,a¯)μF(s,d𝒮,a¯)|2\displaystyle\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}}\lvert F(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\tau\xi(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\mu F(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\rvert^{2}
\displaystyle\leq 𝔼s,d𝒮,a¯,s,d𝒮,a¯𝔼x,x[6(fβ(0)(s,x,a¯)2+6μ2fβ(0)(s,x,a¯)2+12Rβ2+3τ2ξ(s,d𝒮,a¯)2]\displaystyle\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}}\mathbb{E}_{x,x^{\prime}}\left[6(f_{\beta(0)}(s,x,\overline{a})^{2}+6\mu^{2}f_{\beta(0)}(s^{\prime},x^{\prime},\overline{a}^{\prime})^{2}+12R_{\beta}^{2}+3\tau^{2}\xi(s,\mathrm{d}_{\mathcal{S}},\overline{a})^{2}\right]
=\displaystyle= 𝔼s,d𝒮,x,a¯[6fβ(0)(s,x,a¯)2+6μ2fβ(0)(s,x,a¯)2+12Rβ2+3τ2ξ(s,d𝒮,a¯)2]\displaystyle\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},x,\overline{a}}\left[6f_{\beta(0)}(s,x,\overline{a})^{2}+6\mu^{2}f_{\beta(0)}(s,x,\overline{a})^{2}+12R_{\beta}^{2}+3\tau^{2}\xi(s,\mathrm{d}_{\mathcal{S}},\overline{a})^{2}\right]
\displaystyle\leq 𝔼s,d𝒮,x,a¯[6fβ(0)(s,x,a¯)2+6μ2fβ(0)(s,x,a¯)2+12Rβ2+3τ2τ1fβ(0)(s,x,a¯))2+3τ2τ2Rβ2+3τ2τ3].\displaystyle\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},x,\overline{a}}\left[6f_{\beta(0)}(s,x,\overline{a})^{2}+6\mu^{2}f_{\beta(0)}(s,x,\overline{a})^{2}+12R_{\beta}^{2}+3\tau^{2}\tau_{1}f_{\beta(0)}(s,x,\overline{a}))^{2}+3\tau^{2}\tau_{2}R_{\beta}^{2}+3\tau^{2}\tau_{3}\right]. (18)

On the other hand, we have

𝔼s,d𝒮,aβF(s,d𝒮,a¯)βF(s,d𝒮,a¯)22\displaystyle\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},a}\|\nabla_{\beta}F(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\nabla_{\beta}F(s,\mathrm{d}_{\mathcal{S}},\overline{a})\|_{2}^{2}
\displaystyle\leq 1m𝔼s,d𝒮,a,x(𝟙{β1(0)(s,x,a¯)>0}𝟙{β1(s,x,a¯)>0},,𝟙{βm(0)(s,x,a¯)>0}𝟙{βm(s,x,a¯)>0})22(s,x,a¯)22\displaystyle\frac{1}{m}\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},a,x}\|\left(\mathbbm{1}\{\beta_{1}(0)^{\top}(s,x,\overline{a})>0\}-\mathbbm{1}\{\beta_{1}^{\top}(s,x,\overline{a})>0\},\ldots,\mathbbm{1}\{\beta_{m}(0)^{\top}(s,x,\overline{a})>0\}-\mathbbm{1}\{\beta_{m}^{\top}(s,x,\overline{a})>0\}\right)\|_{2}^{2}\|(s,x,\overline{a})\|_{2}^{2}
\displaystyle\leq 𝔼s,d𝒮,a¯,x1mj=1m(𝟙{βj(0)(s,x,a¯)>0}𝟙{βj(s,x,a¯)>0})2\displaystyle\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},\overline{a},x}\frac{1}{m}\sum_{j=1}^{m}(\mathbbm{1}\{\beta_{j}(0)^{\top}(s,x,\overline{a})>0\}-\mathbbm{1}\{\beta_{j}^{\top}(s,x,\overline{a})>0\})^{2}
\displaystyle\leq 𝔼s,d𝒮,a¯,x1mj=0m𝟙{βj(0)(s,x,a¯)βj(0)βj2}\displaystyle\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},\overline{a},x}\frac{1}{m}\sum_{j=0}^{m}\mathbbm{1}\{\beta_{j}(0)^{\top}(s,x,\overline{a})\leq\|\beta_{j}(0)-\beta_{j}\|_{2}\}
\displaystyle\leq cmj=1mβj(0)βj2βj(0)2\displaystyle\frac{c}{m}\sum_{j=1}^{m}\frac{\|\beta_{j}(0)-\beta_{j}\|_{2}}{\|\beta_{j}(0)\|_{2}}
\displaystyle\leq cm(j=1mβj(0)βj22)1/2(j=1mβj(0)22)1/2\displaystyle\frac{c}{m}\left(\Bigl{\lVert}\sum_{j=1}^{m}\beta_{j}(0)-\beta_{j}\Bigr{\rVert}_{2}^{2}\right)^{1/2}\left(\sum_{j=1}^{m}\|\beta_{j}(0)\|_{2}^{-2}\right)^{1/2}
\displaystyle\leq cRβm(j=1mβj(0)22)1/2,\displaystyle\frac{cR_{\beta}}{m}\left(\sum_{j=1}^{m}\|\beta_{j}(0)\|_{2}^{-2}\right)^{1/2}, (19)

where in the fourth inequality we use Assumption 1, and in the final inequality we use (j=1mβj(0)βj22)1/2Rβ\left(\|\sum_{j=1}^{m}\beta_{j}(0)-\beta_{j}\|_{2}^{2}\right)^{1/2}\leq R_{\beta}.

Combining (18) and (19), to bound the second term in (E.1), it remains to bound the following:

𝔼s,d𝒮,a¯,x{fβ(0)(s,x,a¯)2(j=1mβj(0)22)1/2(cRβm)}\displaystyle\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},\overline{a},x}\left\{f_{\beta(0)}(s,x,\overline{a})^{2}\left(\sum_{j=1}^{m}\|\beta_{j}(0)\|_{2}^{-2}\right)^{1/2}\left(\frac{cR_{\beta}}{m}\right)\right\}
=\displaystyle= cRβm2𝔼s,d𝒮,a¯,x(j=1mσ2(βj(0)(s,x,a¯))+klmukulσ(βk(0)(s,x,a¯))σ(βl(0)(s,x,a¯)))(j=1mβj(0)22)1/2\displaystyle\frac{cR_{\beta}}{m^{2}}\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},\overline{a},x}\left(\sum_{j=1}^{m}\sigma^{2}(\beta_{j}(0)^{\top}(s,x,\overline{a}))+\sum_{k\neq l}^{m}u_{k}u_{l}\sigma(\beta_{k}(0)^{\top}(s,x,\overline{a}))\sigma(\beta_{l}(0)^{\top}(s,x,\overline{a}))\right)\left(\sum_{j=1}^{m}\|\beta_{j}(0)\|_{2}^{-2}\right)^{1/2}
\displaystyle\leq cRβm2𝔼s,d𝒮,a¯,x(j=1mβj(0)22+klmukulσ(βk(0)(s,x,a¯))σ(βl(0)(s,x,a¯)))(j=1mβj(0)22)1/2,\displaystyle\frac{cR_{\beta}}{m^{2}}\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},\overline{a},x}\left(\sum_{j=1}^{m}\|\beta_{j}(0)\|_{2}^{2}+\sum_{k\neq l}^{m}u_{k}u_{l}\sigma(\beta_{k}(0)^{\top}(s,x,\overline{a}))\sigma(\beta_{l}(0)^{\top}(s,x,\overline{a}))\right)\left(\sum_{j=1}^{m}\|\beta_{j}(0)\|_{2}^{-2}\right)^{1/2},

Taking expectation with respect to initialization and noticing that 𝔼init{ukul}=0\mathbb{E}_{\mathrm{init}}\left\{u_{k}u_{l}\right\}=0 for klk\neq l, we have

cRβm2𝔼init,s,d𝒮,a¯,x(j=1mβj(0)22+klmukulσ(βk(0)(s,x,a¯))σ(βl(0)(s,x,a¯)))(j=1mβj(0)22)1/2\displaystyle\frac{cR_{\beta}}{m^{2}}\mathbb{E}_{\mathrm{init},s,\mathrm{d}_{\mathcal{S}},\overline{a},x}\left(\sum_{j=1}^{m}\|\beta_{j}(0)\|_{2}^{2}+\sum_{k\neq l}^{m}u_{k}u_{l}\sigma(\beta_{k}(0)^{\top}(s,x,\overline{a}))\sigma(\beta_{l}(0)^{\top}(s,x,\overline{a}))\right)\left(\sum_{j=1}^{m}\|\beta_{j}(0)\|_{2}^{-2}\right)^{1/2}
=\displaystyle= cRβm2𝔼init(j=1mβj(0)22)(j=1mβj(0)22)1/2\displaystyle\frac{cR_{\beta}}{m^{2}}\mathbb{E}_{\mathrm{init}}\left(\sum_{j=1}^{m}\|\beta_{j}(0)\|_{2}^{2}\right)\left(\sum_{j=1}^{m}\|\beta_{j}(0)\|_{2}^{-2}\right)^{1/2}
\displaystyle\leq cRβm2𝔼init1/2(j=1mβj(0)22)2𝔼init1/2(j=1mβj(0)22)\displaystyle\frac{cR_{\beta}}{m^{2}}\mathbb{E}_{\mathrm{init}}^{1/2}\left(\sum_{j=1}^{m}\|\beta_{j}(0)\|_{2}^{2}\right)^{2}\mathbb{E}_{\mathrm{init}}^{1/2}\left(\sum_{j=1}^{m}\|\beta_{j}(0)\|_{2}^{-2}\right)
=\displaystyle= 𝒪(cRβm1/2).\displaystyle\mathcal{O}\left(\frac{cR_{\beta}}{m^{1/2}}\right).

Then the the second term in (E.1) is at the order of 𝒪(max{Rβ3m1/2,Rβm1/2})=𝒪(Rβ3m1/2)\mathcal{O}\left(\max\{\frac{R_{\beta}^{3}}{m^{1/2}},\frac{R_{\beta}}{m^{1/2}}\}\right)=\mathcal{O}\left(\frac{R_{\beta}^{3}}{m^{1/2}}\right).

We then bound the the variance of stochastic semi-gradient gβ(s,d𝒮,a¯,s,d𝒮,a¯)g_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}).

Lemma E.3.

With Assumption 4, there exists ς2=𝒪(Rβ2)\varsigma^{2}=\mathcal{O}(R_{\beta}^{2}), such that:

𝔼init,s,d𝒮,a¯,s,d𝒮,a¯gβ(s,d𝒮,a¯,s,d𝒮,a¯)g¯β22ς2.\displaystyle\mathbb{E}_{\mathrm{init},s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}}\|g_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})-\overline{g}_{\beta}\|_{2}^{2}\leq\varsigma^{2}.
Proof.

We have

𝔼s,d𝒮,a¯,s,d𝒮,a¯gβ(s,d𝒮,a¯,s,d𝒮,a¯)g¯β22\displaystyle\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}}\|g_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})-\overline{g}_{\beta}\|_{2}^{2} 𝔼s,d𝒮,a¯,s,d𝒮,a¯gβ(s,d𝒮,a¯,s,d𝒮,a¯)22\displaystyle\leq\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}}\|g_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\|_{2}^{2}
𝔼s,s,d𝒮,s,d𝒮,a¯δ(s,d𝒮,a¯,s,d𝒮,a¯)βFβ(t)(s,d𝒮,a¯)22.\displaystyle\leq\mathbb{E}_{s,s,\mathrm{d}_{\mathcal{S}},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}}\|\delta(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\nabla_{\beta}F_{\beta(t)}(s,\mathrm{d}_{\mathcal{S}},\overline{a})\|_{2}^{2}.

With βFβ(t)(s,d𝒮,a¯)=𝔼xβfβ(t)(s,x,a¯)=𝔼x1m(𝟙{βj(s,x,a¯)>0},,𝟙{βj(s,x,a¯)>0})(s,x,a¯)\nabla_{\beta}F_{\beta(t)}(s,\mathrm{d}_{\mathcal{S}},\overline{a})=\mathbb{E}_{x}\nabla_{\beta}f_{\beta(t)}(s,x,\overline{a})=\mathbb{E}_{x}\frac{1}{\sqrt{m}}\left(\mathbbm{1}\{\beta_{j}^{\top}(s,x,\overline{a})>0\},\ldots,\mathbbm{1}\{\beta_{j}^{\top}(s,x,\overline{a})>0\}\right)(s,x,\overline{a}), we have βFβ(t)(s,d𝒮,a¯)21\|\nabla_{\beta}F_{\beta(t)}(s,\mathrm{d}_{\mathcal{S}},\overline{a})\|_{2}\leq 1. Then

𝔼s,d𝒮,a¯,s,d𝒮,a¯gβ(s,d𝒮,a¯,s,d𝒮,a¯)g¯β22\displaystyle\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}}\|g_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})-\overline{g}_{\beta}\|_{2}^{2}
\displaystyle\leq 𝔼s,s,d𝒮,s,d𝒮,a¯[δ2(s,d𝒮,a¯,s,d𝒮,a¯)]\displaystyle\mathbb{E}_{s,s,\mathrm{d}_{\mathcal{S}},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}}\left[\delta^{2}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\right]
=\displaystyle= 𝔼s,d𝒮,a¯,s,d𝒮,a¯(Fβ(s,d𝒮,a¯)τξ(s,d𝒮,a¯)μFβ(s,d𝒮,a¯))2\displaystyle\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}}\left(F_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\tau\xi(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\mu F_{\beta}(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\right)^{2}
\displaystyle\leq 𝔼s,d𝒮,a¯,s,d𝒮,a¯,x,x(fβ(s,x,a¯)τξ(s,d𝒮,a¯)μfβ(s,x,a¯))2\displaystyle\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime},x,x^{\prime}}\left(f_{\beta}(s,x,\overline{a})-\tau\xi(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\mu f_{\beta}(s^{\prime},x^{\prime},\overline{a}^{\prime})\right)^{2}
\displaystyle\leq 3𝔼s,d𝒮,a¯,x[(1+μ2)(fβ(s,x,a¯))2+(τξ(s,d𝒮,a¯))2]\displaystyle 3\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},\overline{a},x}\left[(1+\mu^{2})\left(f_{\beta}(s,x,\overline{a})\right)^{2}+\left(\tau\xi(s,\mathrm{d}_{\mathcal{S}},\overline{a})\right)^{2}\right]
\displaystyle\leq 3𝔼s,d𝒮,a¯,x[(1+μ2)(fβ(0)(s,x,a¯))2+τ1(τfβ(0)(s,x,a¯))2]+𝒪(Rβ2)+𝒪(τ3).\displaystyle 3\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},\overline{a},x}\left[(1+\mu^{2})\left(f_{\beta(0)}(s,x,\overline{a})\right)^{2}+\tau_{1}\left(\tau f_{\beta(0)}(s,x,\overline{a})\right)^{2}\right]+\mathcal{O}(R_{\beta}^{2})+\mathcal{O}(\tau_{3}).

where in the last inequality we use |fβ(0)(s,x,a¯)fβ(s,x,a¯)|ββ(0)2Rβ\lvert f_{\beta(0)}(s,x,\overline{a})-f_{\beta}(s,x,\overline{a})\rvert\leq\|\beta-\beta(0)\|_{2}\leq R_{\beta} and Assumption 4. Finally, note that

𝔼init,s,d𝒮,a¯,x[fβ(0)(s,x,a¯)2]\displaystyle\mathbb{E}_{\mathrm{init},s,\mathrm{d}_{\mathcal{S}},\overline{a},x}\left[f_{\beta(0)}(s,x,\overline{a})^{2}\right] =𝔼init,s,d𝒮,a¯,x1m(j=1mσ2(βj(0)(s,x,a¯))+klmukulσ(βk(0)(s,x,a¯))σ(βl(0)(s,x,a¯)))\displaystyle=\mathbb{E}_{\mathrm{init},s,\mathrm{d}_{\mathcal{S}},\overline{a},x}\frac{1}{m}\left(\sum_{j=1}^{m}\sigma^{2}(\beta_{j}(0)^{\top}(s,x,\overline{a}))+\sum_{k\neq l}^{m}u_{k}u_{l}\sigma(\beta_{k}(0)^{\top}(s,x,\overline{a}))\sigma(\beta_{l}(0)^{\top}(s,x,\overline{a}))\right)
=𝔼init,s,d𝒮,a¯,x1m(j=1mσ2(βj(0)(s,x,a¯)))\displaystyle=\mathbb{E}_{\mathrm{init},s,\mathrm{d}_{\mathcal{S}},\overline{a},x}\frac{1}{m}\left(\sum_{j=1}^{m}\sigma^{2}(\beta_{j}(0)^{\top}(s,x,\overline{a}))\right)
𝔼init1m(j=1mβj(0)22)=1.\displaystyle\leq\mathbb{E}_{\mathrm{init}}\frac{1}{m}\left(\sum_{j=1}^{m}\|\beta_{j}(0)\|_{2}^{2}\right)=1.

The claim follows immediately. ∎

We then bound the variance of g^(s,𝐬,a,s,𝐬,a¯)\widehat{g}(s,\mathbf{s},a,s^{\prime},\mathbf{s^{\prime}},\overline{a}^{\prime}) for fixed d𝒮,d𝒮\mathrm{d}_{\mathcal{S}},\mathrm{d}_{\mathcal{S}}^{\prime}.

Lemma E.4.

For any s,d𝒮,a¯,s,d𝒮,a¯s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}, let 𝐬i.i.d.s,d𝒮\mathbf{s}\overset{\mathrm{i.i.d.}}{\sim}s,\mathrm{d}_{\mathcal{S}} and 𝐬i.i.d.s,d𝒮\mathbf{s}^{\prime}\overset{\mathrm{i.i.d.}}{\sim}s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime}, and |𝐬|=|𝐬|=N|\mathbf{s}|=|\mathbf{s}^{\prime}|=N, we have

𝔼init,𝐬,𝐬g^β(s,𝐬,a¯,s,𝐬,a¯)gβ(s,d𝒮,a¯,s,d𝒮,a¯)22𝒪(Rβ2N).\displaystyle\mathbb{E}_{\mathrm{init},\mathbf{s},\mathbf{s}^{\prime}}\|\widehat{g}_{\beta}(s,\mathbf{s},\overline{a},s^{\prime},\mathbf{s^{\prime}},\overline{a}^{\prime})-g_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\|_{2}^{2}\leq\mathcal{O}\left(\frac{R_{\beta}^{2}}{N}\right).
Proof.

We have

g^β(s,𝐬,a¯,s,𝐬,a¯)gβ(s,d𝒮,a¯,s,d𝒮,a¯)\displaystyle\widehat{g}_{\beta}(s,\mathbf{s},\overline{a},s^{\prime},\mathbf{s^{\prime}},\overline{a}^{\prime})-g_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})
=\displaystyle= δ^β(s,𝐬,a¯,s,𝐬,a¯)βs𝐬fβ(s,x,a¯)/Nδβ(s,d𝒮,a¯,s,d𝒮,a¯)β𝔼xd𝒮fβ(s,x,a¯)\displaystyle\widehat{\delta}_{\beta}(s,\mathbf{s},\overline{a},s^{\prime},\mathbf{s^{\prime}},\overline{a}^{\prime})\nabla_{\beta}\sum_{s\in\mathbf{s}}f_{\beta}(s,x,\overline{a})/N-\delta_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\nabla_{\beta}\mathbb{E}_{x\sim\mathrm{d}_{\mathcal{S}}}f_{\beta}(s,x,\overline{a})
=\displaystyle= (δ^β(s,𝐬,a¯,s,𝐬,a¯)δβ(s,d𝒮,a¯,s,d𝒮,a¯))βs𝐬fβ(s,x,a¯)/N\displaystyle\left(\widehat{\delta}_{\beta}(s,\mathbf{s},\overline{a},s^{\prime},\mathbf{s^{\prime}},\overline{a}^{\prime})-\delta_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\right)\nabla_{\beta}\sum_{s\in\mathbf{s}}f_{\beta}(s,x,\overline{a})/N
+δβ(s,d𝒮,a¯,s,d𝒮,a¯)(s𝐬βfβ(s,x,a¯)/Nβ𝔼xd𝒮fβ(s,x,a¯)).\displaystyle+\delta_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\left(\sum_{s\in\mathbf{s}}\nabla_{\beta}f_{\beta}(s,x,\overline{a})/N-\nabla_{\beta}\mathbb{E}_{x\sim\mathrm{d}_{\mathcal{S}}}f_{\beta}(s,x,\overline{a})\right). (20)

For the first term, note that s𝐬βfβ(s,x,a¯)/N21\|\sum_{s\in\mathbf{s}}\nabla_{\beta}f_{\beta}(s,x,\overline{a})/N\|_{2}\leq 1, we have

𝔼𝐬,𝐬(δ^β(s,𝐬,a¯,s,𝐬,a¯)δβ(s,d𝒮,a¯,s,d𝒮,a¯))s𝐬βfβ(s,x,a¯)/N22\displaystyle\mathbb{E}_{\mathbf{s},\mathbf{s}^{\prime}}\Bigl{\lVert}\left(\widehat{\delta}_{\beta}(s,\mathbf{s},\overline{a},s^{\prime},\mathbf{s^{\prime}},\overline{a}^{\prime})-\delta_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\right)\sum_{s\in\mathbf{s}}\nabla_{\beta}f_{\beta}(s,x,\overline{a})/N\Bigr{\rVert}_{2}^{2}
\displaystyle\leq 𝔼𝐬,𝐬(δ^β(s,𝐬,a¯,s,𝐬,a¯)δβ(s,d𝒮,a¯,s,d𝒮,a¯))2\displaystyle\mathbb{E}_{\mathbf{s},\mathbf{s}^{\prime}}\left(\widehat{\delta}_{\beta}(s,\mathbf{s},\overline{a},s^{\prime},\mathbf{s^{\prime}},\overline{a}^{\prime})-\delta_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\right)^{2}
\displaystyle\leq 𝔼x,x(fβ(s,x,a¯)τξ(s,d𝒮,a¯)μfβ(s,x,a¯))2/N\displaystyle\mathbb{E}_{x,x^{\prime}}\left(f_{\beta}(s,x,\overline{a})-\tau\xi(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\mu f_{\beta}(s^{\prime},x^{\prime},\overline{a}^{\prime})\right)^{2}/N
\displaystyle\leq 3𝔼x,x[(fβ(0)(s,x,a¯))2+μ2(fβ(0)(s,x,a¯))2+τ1(τfβ(0)(s,x,a¯))2]/N+𝒪(Rβ2/N)+𝒪(τ3).\displaystyle 3\mathbb{E}_{x,x^{\prime}}\left[\left(f_{\beta(0)}(s,x,\overline{a})\right)^{2}+\mu^{2}\left(f_{\beta(0)}(s^{\prime},x^{\prime},\overline{a}^{\prime})\right)^{2}+\tau_{1}\left(\tau f_{\beta(0)}(s,x,\overline{a})\right)^{2}\right]/N+\mathcal{O}(R_{\beta}^{2}/N)+\mathcal{O}(\tau_{3}).

where the last inequality we use the similar argument in Lemma E.3. Now as we have shown in Lemma E.3, we have 𝔼init,s[fβ(0)(s,x,a¯)2]1\mathbb{E}_{\mathrm{init},s^{\prime}}\left[f_{\beta(0)}(s,x,\overline{a})^{2}\right]\leq 1. Hence the first term in (E.1) is at the order of 𝒪(Rβ2/N)\mathcal{O}(R_{\beta}^{2}/N).

To bound the second term in (E.1), from Jensen’s inequality we have

𝔼initδβ(s,d𝒮,a¯,s,d𝒮,a¯)2\displaystyle\mathbb{E}_{\mathrm{init}}\delta_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})^{2}
\displaystyle\leq 𝔼init,x,x(fβ(s,x,a¯)τξ(s,d𝒮,a¯)μfβ(s,x,a¯))2\displaystyle\mathbb{E}_{\mathrm{init},x,x^{\prime}}\left(f_{\beta}(s,x,\overline{a})-\tau\xi(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\mu f_{\beta}(s^{\prime},x^{\prime},\overline{a}^{\prime})\right)^{2}
\displaystyle\leq 3𝔼init,x,x[(fβ(0)(s,x,a¯))2+μ2(fβ(0)(s,x,a¯))2+τ1(τfβ(0)(s,x,a¯))2]+𝒪(Rβ2)+𝒪(τ3)\displaystyle 3\mathbb{E}_{\mathrm{init},x,x^{\prime}}\left[\left(f_{\beta(0)}(s,x,\overline{a})\right)^{2}+\mu^{2}\left(f_{\beta(0)}(s^{\prime},x^{\prime},\overline{a}^{\prime})\right)^{2}+\tau_{1}\left(\tau f_{\beta(0)}(s,x,\overline{a})\right)^{2}\right]+\mathcal{O}(R_{\beta}^{2})+\mathcal{O}(\tau_{3})
=\displaystyle= 𝒪(Rβ2).\displaystyle\mathcal{O}\left(R_{\beta}^{2}\right).

On the other hand, we have

𝔼𝐬s𝐬βfβ(s,x,a¯)/Nβ𝔼xd𝒮fβ(s,x,a¯)22𝔼xd𝒮βfβ(s,x,a¯)22/N1/N.\displaystyle\mathbb{E}_{\mathbf{s}}\Bigl{\lVert}\sum_{s\in\mathbf{s}}\nabla_{\beta}f_{\beta}(s,x,\overline{a})/N-\nabla_{\beta}\mathbb{E}_{x\sim\mathrm{d}_{\mathcal{S}}}f_{\beta}(s,x,\overline{a})\Bigr{\rVert}_{2}^{2}\leq\mathbb{E}_{x\sim\mathrm{d}_{\mathcal{S}}}\|\nabla_{\beta}f_{\beta}(s,x,\overline{a})\|_{2}^{2}/N\leq 1/N.

Therefore, taking square on both sides of (E.1) and using Cauchy Schwartz inequality, we have

𝔼init,𝐬,𝐬g^β(s,𝐬,a¯,s,𝐬,a¯)gβ(s,d𝒮,a¯,s,d𝒮,a¯)22\displaystyle\mathbb{E}_{\mathrm{init},\mathbf{s},\mathbf{s}^{\prime}}\|\widehat{g}_{\beta}(s,\mathbf{s},\overline{a},s^{\prime},\mathbf{s^{\prime}},\overline{a}^{\prime})-g_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a})\|_{2}^{2}
\displaystyle\leq 2𝔼init,𝐬,𝐬(δ^β(s,𝐬,a¯,s,𝐬,a¯)δβ(s,d𝒮,a¯,s,d𝒮,a¯))2s𝐬βfβ(s,x,a¯)/N22\displaystyle 2\mathbb{E}_{\mathrm{init},\mathbf{s},\mathbf{s}^{\prime}}\left(\widehat{\delta}_{\beta}(s,\mathbf{s},\overline{a},s^{\prime},\mathbf{s^{\prime}},\overline{a}^{\prime})-\delta_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\right)^{2}\Bigl{\lVert}\sum_{s\in\mathbf{s}}\nabla_{\beta}f_{\beta}(s,x,\overline{a})/N\Bigr{\rVert}_{2}^{2}
+2𝔼init,𝐬,𝐬{δβ2(s,d𝒮,a¯,s,d𝒮,a¯)}𝔼init,xs𝐬βfβ(s,x,a¯)/Nβ𝔼xd𝒮fβ(s,x,a¯)22\displaystyle+2\mathbb{E}_{\mathrm{init},\mathbf{s},\mathbf{s}^{\prime}}\left\{\delta^{2}_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\right\}\mathbb{E}_{\mathrm{init},x}\Bigl{\lVert}\sum_{s\in\mathbf{s}}\nabla_{\beta}f_{\beta}(s,x,\overline{a})/N-\nabla_{\beta}\mathbb{E}_{x\sim\mathrm{d}_{\mathcal{S}}}f_{\beta}(s,x,\overline{a})\Bigr{\rVert}_{2}^{2}
=\displaystyle= 𝒪(Rβ2/N).\displaystyle\mathcal{O}(R_{\beta}^{2}/N).

With Lemma E.3 and Lemma E.4, we can now bound the different between g^β\widehat{g}_{\beta} and g¯β\overline{g}_{\beta}.

Lemma E.5.

With Assumption 4, there exists ς2=𝒪(Rβ2)\varsigma^{2}=\mathcal{O}(R_{\beta}^{2}), such that

𝔼init,s,d𝒮,a¯,s,d𝒮,a¯,𝐬,𝐬g^βg¯β22ς2+𝒪(Rβ2/N).\displaystyle\mathbb{E}_{\mathrm{init},s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime},\mathbf{s},\mathbf{s}^{\prime}}\|\widehat{g}_{\beta}-\overline{g}_{\beta}\|_{2}^{2}\leq\varsigma^{2}+\mathcal{O}(R_{\beta}^{2}/N).
Proof.

We have

𝔼g^βg¯β22\displaystyle\mathbb{E}\|\widehat{g}_{\beta}-\overline{g}_{\beta}\|_{2}^{2}\leq 2𝔼g^β(s,𝐬,a,s,𝐬,a)gβ(s,d𝒮,a¯,s,d𝒮,a¯)22\displaystyle 2\mathbb{E}\|\widehat{g}_{\beta}(s,\mathbf{s},a,s^{\prime},\mathbf{s}^{\prime},a^{\prime})-g_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\|_{2}^{2}
+2𝔼gβ(s,d𝒮,a¯,s,d𝒮,a¯)g¯β22\displaystyle+2\mathbb{E}\|g_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})-\overline{g}_{\beta}\|_{2}^{2}

Now apply Lemma E.3 and Lemma E.4, the claim follow immediately. ∎

We are now ready to show the convergence of using updates (13) and (14) to solve problem (11). We denote Fβ0F^{0}_{\beta_{*}} to be the function within the class R,m\mathcal{F}_{R,m}, which satisfies the stationary condition:

Π𝔹(Rβ,β0)(βg¯β0)=β,\displaystyle\Pi_{\mathbb{B}(R_{\beta},\beta_{0})}(\beta_{*}-\overline{g}^{0}_{\beta_{*}})=\beta_{*}, (21)

where ΠX()\Pi_{X}(\cdot) denotes the Euclidean projection onto set XX. Condition (21) is equivalent to:

g¯β0,ββ0.\displaystyle\left\langle\overline{g}^{0}_{\beta_{*}},\beta-\beta_{*}\right\rangle\geq 0.

Note that g¯β0=𝔼s,d𝒮,a¯,s,d𝒮,a¯{δβ0(s,d𝒮,a¯,s,d𝒮,a¯)Fβ0(s,d𝒮,a¯)}\overline{g}^{0}_{\beta_{*}}=\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}}\left\{\delta_{\beta_{*}}^{0}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\nabla F^{0}_{\beta_{*}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})\right\}, and Fβ0(s,d𝒮,a¯),ββ=Fβ0(s,d𝒮,a¯)Fβ0(s,d𝒮,a¯)\left\langle\nabla F^{0}_{\beta_{*}}(s,\mathrm{d}_{\mathcal{S}},\overline{a}),\beta-\beta_{*}\right\rangle=F^{0}_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F^{0}_{\beta_{*}}(s,\mathrm{d}_{\mathcal{S}},\overline{a}). Hence we have

𝔼s,d𝒮,a¯,s,d𝒮,a¯{δβ0(s,d𝒮,a¯,s,d𝒮,a¯)(Fβ,u0(s,d𝒮,a¯)Fβ0(s,d𝒮,a¯))}0.\displaystyle\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}}\left\{\delta_{\beta_{*}}^{0}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\left(F^{0}_{\beta,u}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F^{0}_{\beta_{*}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})\right)\right\}\geq 0. (22)

With the definition of operator ζF\zeta_{F} in (12), we know that

𝔼s,d𝒮,a¯δβ0(s,d𝒮,a¯,s,d𝒮,a¯)=Fβ0(s,d𝒮,a¯)ζFβ0(s,d𝒮,a¯).\displaystyle\mathbb{E}_{s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}}\delta_{\beta_{*}}^{0}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})=F^{0}_{\beta_{*}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\zeta_{F^{0}_{\beta_{*}}}(s,\mathrm{d}_{\mathcal{S}},\overline{a}). (23)

From (22) and (23), we have

Fβ0ζFβ0,Fβ0Fβ0ρ0,\displaystyle\left\langle F^{0}_{\beta_{*}}-\zeta_{F^{0}_{\beta_{*}}},F^{0}_{\beta}-F^{0}_{\beta_{*}}\right\rangle_{\rho}\geq 0,

which is equivalent to that Fβ0=ΠFRβ,m(ζFβ0)F^{0}_{\beta_{*}}=\Pi_{F_{R_{\beta},m}}(\zeta_{F^{0}_{\beta_{*}}}). That is, Fβ0F^{0}_{\beta_{*}} is the projection (with metric defined w.r.t ,ρ\left\langle\cdot,\cdot\right\rangle_{\rho}), after applying operator ζ\zeta to itself.

Theorem E.1.

Let {β(t)}t=0T1\{\beta(t)\}_{t=0}^{T-1} be generated by updates (15) and (16). Define β¯T=1Tt=0T1β(t)\overline{\beta}_{T}=\frac{1}{T}\sum_{t=0}^{T-1}\beta(t), we have

𝔼init,s,d𝒮,a¯[Fβ0(s,d𝒮,a¯)Fβ¯T(s,d𝒮,a¯)]2𝒪(Rβ2T1/2+Rβ5/2m1/4+Rβ2N1/2+Rβ3m1/2).\displaystyle\mathbb{E}_{\mathrm{init},s,\mathrm{d}_{\mathcal{S}},\overline{a}}\left[F^{0}_{\beta_{*}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F_{\overline{\beta}_{T}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})\right]^{2}\leq\mathcal{O}\left(\frac{R_{\beta}^{2}}{T^{1/2}}+\frac{R_{\beta}^{5/2}}{m^{1/4}}+\frac{R_{\beta}^{2}}{N^{1/2}}+\frac{R_{\beta}^{3}}{m^{1/2}}\right).
Proof.

Conditioned on the tt-th iteration, we have

𝔼s,d𝒮,a¯,s,d𝒮,a¯,𝐬,𝐬[β(t+1)β22|β(t)]\displaystyle\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime},\mathbf{s},\mathbf{s}^{\prime}}\left[\|\beta(t+1)-\beta_{*}\|_{2}^{2}|\beta(t)\right]
=\displaystyle= Π𝔹Rβ0(βηg^β(t))Π𝔹Rβ0(βηg¯β0)22\displaystyle\|\Pi_{\mathbb{B}^{0}_{R_{\beta}}}(\beta-\eta\widehat{g}_{\beta(t)})-\Pi_{\mathbb{B}^{0}_{R_{\beta}}}(\beta_{*}-\eta\overline{g}^{0}_{\beta_{*}})\|_{2}^{2}
\displaystyle\leq 𝔼β(t)ηg^β(t)βηg¯β022\displaystyle\mathbb{E}\|\beta(t)-\eta\widehat{g}_{\beta(t)}-\beta_{*}-\eta\overline{g}^{0}_{\beta_{*}}\|_{2}^{2}
=\displaystyle= 𝔼(β(t)β22ηβ(t)β,g^β(t)g¯β0+η2g^β(t)g¯β022),\displaystyle\mathbb{E}\left(\|\beta(t)-\beta_{*}\|_{2}^{2}-\eta\left\langle\beta(t)-\beta_{*},\widehat{g}_{\beta(t)}-\overline{g}^{0}_{\beta_{*}}\right\rangle+\eta^{2}\|\widehat{g}_{\beta(t)}-\overline{g}^{0}_{\beta_{*}}\|_{2}^{2}\right), (24)

where the inequality comes from the non-expansive property of projection.

We first consider the second term in (24), we have

𝔼s,d𝒮,a¯,s,d𝒮,a¯,𝐬,𝐬β(t)β,g^β(t)g¯β0\displaystyle\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime},\mathbf{s},\mathbf{s}^{\prime}}\left\langle\beta(t)-\beta_{*},\widehat{g}_{\beta(t)}-\overline{g}^{0}_{\beta_{*}}\right\rangle
=\displaystyle= 𝔼[β(t)β,g^β(t)gβ(t)+β(t)β,gβ(t)g¯β(t)+β(t)β,g¯β(t)g¯β0]\displaystyle\mathbb{E}\left[\left\langle\beta(t)-\beta_{*},\widehat{g}_{\beta(t)}-g_{\beta(t)}\right\rangle+\left\langle\beta(t)-\beta_{*},g_{\beta(t)}-\overline{g}_{\beta(t)}\right\rangle+\left\langle\beta(t)-\beta_{*},\overline{g}_{\beta(t)}-\overline{g}^{0}_{\beta_{*}}\right\rangle\right]
=\displaystyle= 𝔼β(t)β,g^β(t)gβ(t)+β(t)β,g¯β(t)g¯β0\displaystyle\mathbb{E}\left\langle\beta(t)-\beta_{*},\widehat{g}_{\beta(t)}-g_{\beta(t)}\right\rangle+\left\langle\beta(t)-\beta_{*},\overline{g}_{\beta(t)}-\overline{g}^{0}_{\beta_{*}}\right\rangle
\displaystyle\geq 𝔼β(t)β,g¯β(t)g¯β0Rβ(𝔼g^β(t)gβ(t)22)1/2\displaystyle\mathbb{E}\left\langle\beta(t)-\beta_{*},\overline{g}_{\beta(t)}-\overline{g}^{0}_{\beta_{*}}\right\rangle-R_{\beta}\left(\mathbb{E}\|\widehat{g}_{\beta(t)}-g_{\beta(t)}\|_{2}^{2}\right)^{1/2}
\displaystyle\geq 𝔼β(t)β,g¯β(t)g¯β(t)0+β(t)β,g¯β(t)0g¯β0Rβ(𝔼g^β(t)gβ(t)22)1/2,\displaystyle\mathbb{E}\left\langle\beta(t)-\beta_{*},\overline{g}_{\beta(t)}-\overline{g}^{0}_{\beta(t)}\right\rangle+\left\langle\beta(t)-\beta_{*},\overline{g}^{0}_{\beta(t)}-\overline{g}^{0}_{\beta_{*}}\right\rangle-R_{\beta}\left(\mathbb{E}\|\widehat{g}_{\beta(t)}-g_{\beta(t)}\|_{2}^{2}\right)^{1/2}, (25)

where in the first equality we use definition 𝔼gβ(t)=g¯β(t)\mathbb{E}g_{\beta(t)}=\overline{g}_{\beta(t)}. We further have

β(t)β,g¯β(t)g¯β(t)0Rβg¯β(t)g¯β(t)02.\displaystyle\left\langle\beta(t)-\beta_{*},\overline{g}_{\beta(t)}-\overline{g}^{0}_{\beta(t)}\right\rangle\geq-R_{\beta}\|\overline{g}_{\beta(t)}-\overline{g}^{0}_{\beta(t)}\|_{2}. (26)

Using again g¯β0=𝔼{δβ0(s,d𝒮,a¯,s,d𝒮,a¯)Fβ0(s,d𝒮,a¯)}\overline{g}^{0}_{\beta}=\mathbb{E}\left\{\delta_{\beta_{*}}^{0}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\nabla F^{0}_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a})\right\}, Fβ0(s,d𝒮,a¯)=Fβ0(s,d𝒮,a¯)\nabla F^{0}_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a})=\nabla F^{0}_{\beta_{*}}(s,\mathrm{d}_{\mathcal{S}},\overline{a}) for all β\beta, and Fβ0(s,d𝒮,a¯),ββ=Fβ0(s,d𝒮,a¯)Fβ0(s,d𝒮,a¯)\left\langle\nabla F^{0}_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a}),\beta-\beta_{*}\right\rangle=F^{0}_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F^{0}_{\beta_{*}}(s,\mathrm{d}_{\mathcal{S}},\overline{a}), we have

𝔼init,s,d𝒮,a¯,s,d𝒮,a¯β(t)β,g¯β(t)0g¯β0\displaystyle\mathbb{E}_{\mathrm{init},s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}}\left\langle\beta(t)-\beta_{*},\overline{g}^{0}_{\beta(t)}-\overline{g}^{0}_{\beta_{*}}\right\rangle
=\displaystyle= 𝔼(δβ(t)0(s,d𝒮,a¯,s,d𝒮,a¯)δβ0(s,d𝒮,a¯,s,d𝒮,a¯))(Fβ(t)0(s,d𝒮,a¯)Fβ0(s,d𝒮,a¯))\displaystyle\mathbb{E}\left(\delta^{0}_{\beta(t)}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})-\delta^{0}_{\beta_{*}}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\right)\left(F^{0}_{\beta(t)}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F^{0}_{\beta_{*}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})\right)
=\displaystyle= 𝔼(Fβ(t)0(s,d𝒮,a¯)Fβ0(s,d𝒮,a¯)μ[Fβ(t)0(s,d𝒮,a¯)Fβ0(s,d𝒮,a¯)])(Fβ(t)0(s,d𝒮,a¯)Fβ0(s,d𝒮,a¯))\displaystyle\mathbb{E}\left(F^{0}_{\beta(t)}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F^{0}_{\beta_{*}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\mu\left[F^{0}_{\beta(t)}(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})-F^{0}_{\beta_{*}}(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\right]\right)\left(F^{0}_{\beta(t)}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F^{0}_{\beta_{*}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})\right)
\displaystyle\geq (1μ)𝔼init,s,d𝒮,a(Fβ0(s,d𝒮,a¯)Fβ(t)0(s,d𝒮,a¯))2,\displaystyle(1-\mu)\mathbb{E}_{\mathrm{init},s,\mathrm{d}_{\mathcal{S}},a}\left(F^{0}_{\beta_{*}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F^{0}_{\beta(t)}(s,\mathrm{d}_{\mathcal{S}},\overline{a})\right)^{2}, (27)

where in the last line we used Cauchy-Schwartz inequality, together with the fact that (s,d𝒮,a¯)(s,\mathrm{d}_{\mathcal{S}},\overline{a}) and (s,d𝒮,a¯)(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}) share the same distribution. That is

𝔼init,s,d𝒮,a¯,s,d𝒮,a¯[Fβ(t)0(s,d𝒮,a¯)Fβ0(s,d𝒮,a¯)][Fβ(t)0(s,d𝒮,a¯)Fβ0(s,d𝒮,a¯)]\displaystyle\mathbb{E}_{\mathrm{init},s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}}\left[F^{0}_{\beta(t)}(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})-F^{0}_{\beta_{*}}(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\right]\left[F^{0}_{\beta(t)}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F^{0}_{\beta_{*}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})\right]
\displaystyle\leq (𝔼[Fβ(t)0(s,d𝒮,a¯)Fβ0(s,d𝒮,a¯)]2𝔼[Fβ(t)0(s,d𝒮,a¯)Fβ0(s,d𝒮,a¯)]2)1/2\displaystyle\left(\mathbb{E}\left[F^{0}_{\beta(t)}(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})-F^{0}_{\beta_{*}}(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\right]^{2}\mathbb{E}\left[F^{0}_{\beta(t)}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F^{0}_{\beta_{*}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})\right]^{2}\right)^{1/2}
=\displaystyle= 𝔼init,s,d𝒮,a¯[Fβ(t)0(s,d𝒮,a¯)Fβ0(s,d𝒮,a¯)]2.\displaystyle\mathbb{E}_{\mathrm{init},s,\mathrm{d}_{\mathcal{S}},\overline{a}}\left[F^{0}_{\beta(t)}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F^{0}_{\beta_{*}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})\right]^{2}.

Combining (25), (26), and (27), the second term in (24) can be bounded by

η𝔼β(t)β,g^β(t)g¯β0\displaystyle-\eta\mathbb{E}\left\langle\beta(t)-\beta_{*},\widehat{g}_{\beta(t)}-\overline{g}^{0}_{\beta_{*}}\right\rangle
\displaystyle\leq η𝔼[Rβg¯β(t)g¯β(t)02+(1μ)𝔼(Fβ0(s,d𝒮,a¯)Fβ(t)0(s,d𝒮,a¯))2]\displaystyle-\eta\mathbb{E}\left[-R_{\beta}\|\overline{g}_{\beta(t)}-\overline{g}^{0}_{\beta(t)}\|_{2}+(1-\mu)\mathbb{E}\left(F^{0}_{\beta_{*}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F^{0}_{\beta(t)}(s,\mathrm{d}_{\mathcal{S}},\overline{a})\right)^{2}\right]
+ηRβ(𝔼g^β(t)gβ(t)22)1/2.\displaystyle+\eta R_{\beta}\left(\mathbb{E}\|\widehat{g}_{\beta(t)}-g_{\beta(t)}\|_{2}^{2}\right)^{1/2}. (28)

To bound the third term in (24), we have

𝔼g^β(t)g¯β0223𝔼(g^β(t)g¯β(t)22+g¯β(t)g¯β(t)022+g¯β(t)0g¯β022).\displaystyle\mathbb{E}\|\widehat{g}_{\beta(t)}-\overline{g}^{0}_{\beta_{*}}\|_{2}^{2}\leq 3\mathbb{E}\left(\|\widehat{g}_{\beta(t)}-\overline{g}_{\beta(t)}\|_{2}^{2}+\|\overline{g}_{\beta(t)}-\overline{g}^{0}_{\beta(t)}\|_{2}^{2}+\|\overline{g}^{0}_{\beta(t)}-\overline{g}^{0}_{\beta_{*}}\|_{2}^{2}\right). (29)

We can use Lemma E.5 and Lemma E.2 to control the first two terms in (29). We proceed to bound the third one, note that Fβ0(s,d𝒮,a¯)=Fβ(0)(s,d𝒮,a¯)\nabla F^{0}_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a})=\nabla F_{\beta(0)}(s,\mathrm{d}_{\mathcal{S}},\overline{a}) for any β\beta, and Fβ(0)(s,d𝒮,a¯)21\|\nabla F_{\beta(0)}(s,\mathrm{d}_{\mathcal{S}},\overline{a})\|_{2}\leq 1.

g¯0β(t)g¯0β22\displaystyle\|\overline{g}^{0}_{\beta(t)}-\overline{g}^{0}_{\beta_{*}}\|_{2}^{2} [𝔼(δ0β(t)(s,d𝒮,a¯,s,d𝒮,a¯)δ0β(s,d𝒮,a¯,s,d𝒮,a¯))F0β(0)(s,d𝒮,a)]2\displaystyle\leq\left[\mathbb{E}\left(\delta^{0}_{\beta(t)}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})-\delta^{0}_{\beta_{*}}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}))\|\nabla F^{0}_{\beta(0)}(s,\mathrm{d}_{\mathcal{S}},a\|\right)\right]^{2}
𝔼[(δ0β(t)(s,d𝒮,a¯,s,d𝒮,a¯)δ0β(s,d𝒮,a¯,s,d𝒮,a¯))F0β(0)(s,d𝒮,a]2\displaystyle\leq\mathbb{E}\left[\left(\delta^{0}_{\beta(t)}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})-\delta^{0}_{\beta_{*}}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\right)\|\nabla F^{0}_{\beta(0)}(s,\mathrm{d}_{\mathcal{S}},a\|\right]^{2}
𝔼[δ0β(t)(s,d𝒮,a¯,s,d𝒮,a¯)δ0β(s,d𝒮,a¯,s,d𝒮,a¯))]2\displaystyle\leq\mathbb{E}\left[\delta^{0}_{\beta(t)}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})-\delta^{0}_{\beta_{*}}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}))\right]^{2}
𝔼(F0β(t)(s,d𝒮,a¯)F0β(s,d𝒮,a¯)μ[F0β(t)(s,d𝒮,a¯)F0β(s,d𝒮,a¯)])2\displaystyle\leq\mathbb{E}\left(F^{0}_{\beta(t)}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F^{0}_{\beta_{*}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\mu\left[F^{0}_{\beta(t)}(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})-F^{0}_{\beta_{*}}(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\right]\right)^{2}
2(1+u2)𝔼[F0β(t)(s,d𝒮,a¯)F0β(s,d𝒮,a¯)]2,\displaystyle\leq 2(1+u^{2})\mathbb{E}\left[F^{0}_{\beta(t)}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F^{0}_{\beta_{*}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})\right]^{2}, (30)

where in the last inequality follows from (s,d𝒮,a¯)(s,\mathrm{d}_{\mathcal{S}},\overline{a}) and (s,d𝒮,a)(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},a^{\prime}) sharing the same distribution.

Combining (28), (29) and (30), conditioned on β(t)\beta(t), we have the following bound for (24):

𝔼[β(t+1)β22|β(t)]\displaystyle\mathbb{E}\left[\|\beta(t+1)-\beta_{*}\|_{2}^{2}|\beta(t)\right]
\displaystyle\leq β(t)β22(η(1μ)6(1+μ2η2)𝔼[F0β(s,d𝒮,a¯)F0β(t)(s,d𝒮,a¯)]2\displaystyle\|\beta(t)-\beta_{*}\|_{2}^{2}-\left(\eta(1-\mu)-6(1+\mu^{2}\eta^{2}\right)\mathbb{E}\left[F^{0}_{\beta_{*}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F^{0}_{\beta(t)}(s,\mathrm{d}_{\mathcal{S}},\overline{a})\right]^{2}
+\displaystyle+ ηRβ(𝔼g^β(t)gβ(t)22)1/2+ηRβg¯β(t)g¯0β(t)2\displaystyle\eta R_{\beta}\left(\mathbb{E}\|\widehat{g}_{\beta(t)}-g_{\beta(t)}\|_{2}^{2}\right)^{1/2}+\eta R_{\beta}\|\overline{g}_{\beta(t)}-\overline{g}^{0}_{\beta(t)}\|_{2}
+\displaystyle+ 3η2𝔼(g^β(t)g¯β(t)22+g¯β(t)g¯0β(t)22).\displaystyle 3\eta^{2}\mathbb{E}\left(\|\widehat{g}_{\beta(t)}-\overline{g}_{\beta(t)}\|_{2}^{2}+\|\overline{g}_{\beta(t)}-\overline{g}^{0}_{\beta(t)}\|_{2}^{2}\right).

From Lemma E.5, we have 𝔼g^β(t)gβ(t)22=𝒪(Rβ2/N)\mathbb{E}\|\widehat{g}_{\beta(t)}-g_{\beta(t)}\|_{2}^{2}=\mathcal{O}(R_{\beta}^{2}/N). From Lemma E.2, we have 𝔼g¯β(t)g¯0β(t)2=𝒪(Rβ3/2/m1/4)\mathbb{E}\|\overline{g}_{\beta(t)}-\overline{g}^{0}_{\beta(t)}\|_{2}=\mathcal{O}\left(R_{\beta}^{3/2}/m^{1/4}\right). From Lemma E.5, we have 𝔼g^β(t)g¯β(t)22=𝒪(Rβ2)\mathbb{E}\|\widehat{g}_{\beta(t)}-\overline{g}_{\beta(t)}\|_{2}^{2}=\mathcal{O}(R_{\beta}^{2}). Hence

𝔼β(t+1)β22\displaystyle\mathbb{E}\|\beta(t+1)-\beta_{*}\|_{2}^{2}
\displaystyle\leq β(t)β22(η(1μ)6(1+μ2)η2)𝔼[F0β(s,d𝒮,a¯)F0β(t)(s,d𝒮,a¯)]2\displaystyle\|\beta(t)-\beta_{*}\|_{2}^{2}-\left(\eta(1-\mu)-6(1+\mu^{2})\eta^{2}\right)\mathbb{E}\left[F^{0}_{\beta_{*}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F^{0}_{\beta(t)}(s,\mathrm{d}_{\mathcal{S}},\overline{a})\right]^{2}
+η𝒪(Rβ2N1/2+Rβ5/2m1/4)+η2𝒪(Rβ2).\displaystyle+\eta\mathcal{O}\left(\frac{R_{\beta}^{2}}{N^{1/2}}+\frac{R_{\beta}^{5/2}}{m^{1/4}}\right)+\eta^{2}\mathcal{O}(R_{\beta}^{2}).

Re-arrange and telescope, we have

t=0T1𝔼[F0β(s,d𝒮,a¯)F0β(t)(s,d𝒮,a¯)]2\displaystyle\sum_{t=0}^{T-1}\mathbb{E}\left[F^{0}_{\beta_{*}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F^{0}_{\beta(t)}(s,\mathrm{d}_{\mathcal{S}},\overline{a})\right]^{2}
\displaystyle\leq (η(1μ)6(1+μ2)η2)1𝔼[β0β22+ηT𝒪(Rβ2N1/2+Rβ5/2m1/4)+Tη2𝒪(Rβ2)].\displaystyle\left(\eta(1-\mu)-6(1+\mu^{2})\eta^{2}\right)^{-1}\mathbb{E}\left[\|\beta_{0}-\beta_{*}\|_{2}^{2}+\eta T\mathcal{O}\left(\frac{R_{\beta}^{2}}{N^{1/2}}+\frac{R_{\beta}^{5/2}}{m^{1/4}}\right)+T\eta^{2}\mathcal{O}(R_{\beta}^{2})\right]. (31)

Notice that the left hand side of (31) is convex with respect to β(t)\beta(t), we can take average of {β(t)}t=0T1\{\beta(t)\}_{t=0}^{T-1}. Define β¯T=1Tt=0T1β(t)\overline{\beta}_{T}=\frac{1}{T}\sum_{t=0}^{T-1}\beta(t), with Jensen’s inequality we have

𝔼[F0β(s,d𝒮,a¯)F0β¯T(s,d𝒮,a¯)]2\displaystyle\mathbb{E}\left[F^{0}_{\beta_{*}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F^{0}_{\overline{\beta}_{T}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})\right]^{2}
\displaystyle\leq (η(1μ)6(1+μ2)η2)1𝔼init[β0β22T+η𝒪(Rβ2N1/2+Rβ5/2m1/4)+η2𝒪(Rβ2)].\displaystyle\left(\eta(1-\mu)-6(1+\mu^{2})\eta^{2}\right)^{-1}\mathbb{E}_{\mathrm{init}}\left[\frac{\|\beta_{0}-\beta_{*}\|_{2}^{2}}{T}+\eta\mathcal{O}\left(\frac{R_{\beta}^{2}}{N^{1/2}}+\frac{R_{\beta}^{5/2}}{m^{1/4}}\right)+\eta^{2}\mathcal{O}(R_{\beta}^{2})\right].

Now take η=𝒪(T1/2)\eta=\mathcal{O}(T^{-1/2}), we have

𝔼[F0β(s,d𝒮,a¯)F0β¯T(s,d𝒮,a¯)]2𝒪(Rβ2T1/2+Rβ5/2m1/4+Rβ2N1/2).\displaystyle\mathbb{E}\left[F^{0}_{\beta_{*}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F^{0}_{\overline{\beta}_{T}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})\right]^{2}\leq\mathcal{O}\left(\frac{R_{\beta}^{2}}{T^{1/2}}+\frac{R_{\beta}^{5/2}}{m^{1/4}}+\frac{R_{\beta}^{2}}{N^{1/2}}\right).

We can now apply Lemma E.1 and conclude that:

𝔼[F0β(s,d𝒮,a¯)Fβ¯T(s,d𝒮,a¯)]2\displaystyle\mathbb{E}\left[F^{0}_{\beta_{*}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F_{\overline{\beta}_{T}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})\right]^{2} 2𝔼[(F0β(s,d𝒮,a¯)F0β¯T(s,d𝒮,a¯))2+(F0β(s,d𝒮,a¯)Fβ¯T(s,d𝒮,a¯))2]\displaystyle\leq 2\mathbb{E}\left[\left(F^{0}_{\beta_{*}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F^{0}_{\overline{\beta}_{T}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})\right)^{2}+\left(F^{0}_{\beta_{*}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F_{\overline{\beta}_{T}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})\right)^{2}\right]
𝒪(Rβ2T1/2+Rβ5/2m1/4+Rβ2N1/2+Rβ3m1/2).\displaystyle\leq\mathcal{O}\left(\frac{R_{\beta}^{2}}{T^{1/2}}+\frac{R_{\beta}^{5/2}}{m^{1/4}}+\frac{R_{\beta}^{2}}{N^{1/2}}+\frac{R_{\beta}^{3}}{m^{1/2}}\right).

We can now specialize Theorem E.1 to policy optimization and policy evaluation.

Proof of Lemma 4.1.

Note that FθF_{\theta_{*}} satisfies F0θ=Π𝒫Rθ,mQ(ζF0θ)F^{0}_{\theta_{*}}=\Pi_{\mathcal{F}^{\mathcal{P}}_{R_{\theta},m_{Q}}}(\zeta_{F^{0}_{\theta_{*}}}). For policy evaluation, by definition (12), the operator ζ\zeta equals to the Bellman evaluation operator: ζFθ(s,d𝒮,a¯)=[𝒯πkFθ](s,d𝒮,a¯)\zeta_{F_{\theta}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})=\left[{\mathcal{T}}^{\pi_{k}}F_{\theta}\right](s,\mathrm{d}_{\mathcal{S}},\overline{a}). Hence we have F0θ=Π𝒫Rθ,mQ([𝒯πkF0θ])F^{0}_{\theta_{*}}=\Pi_{\mathcal{F}^{\mathcal{P}}_{R_{\theta},m_{Q}}}(\left[{\mathcal{T}}^{\pi_{k}}F^{0}_{\theta_{*}}\right]). Now by Assumption 1, we know that [𝒯πkF0θ]𝒫Rθ,mQ\left[{\mathcal{T}}^{\pi_{k}}F^{0}_{\theta_{*}}\right]\in\mathcal{F}^{\mathcal{P}}_{R_{\theta},m_{Q}}, together with the fact that F0θ𝒫Rθ,mQF^{0}_{\theta_{*}}\in\mathcal{F}^{\mathcal{P}}_{R_{\theta},m_{Q}}. By uniqueness of the projection, we have F0θ=[𝒯πkF0θ]F^{0}_{\theta_{*}}=\left[{\mathcal{T}}^{\pi_{k}}F^{0}_{\theta_{*}}\right], which implies F0θ=QπkF^{0}_{\theta_{*}}=Q^{\pi_{k}} The claim follows immediately by applying Theorem E.1. ∎

Proof of Lemma 4.2.

Note that here we have F0α=ΠFRα,mA(ζF0α)F^{0}_{\alpha_{*}}=\Pi_{F_{R_{\alpha},m_{A}}}(\zeta_{F^{0}_{\alpha_{*}}}). For policy optimization the operator ζ\zeta is defined by: ζF0α(s,d𝒮,a¯)=τk+1(υk1FQθk(s,d𝒮,a¯)+τk1FAαk(s,d𝒮,a¯))\zeta_{F^{0}_{\alpha_{*}}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})=\tau_{k+1}(\upsilon_{k}^{-1}F^{Q}_{\theta_{k}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})+\tau_{k}^{-1}F^{A}_{\alpha_{k}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})). Then we have

𝔼σ~k[FAα¯(T)(s,d𝒮,a¯)τk+1(υk1FQθk(s,d𝒮,a¯)+τk1FAαk(s,d𝒮,a¯))]2\displaystyle\mathbb{E}_{\widetilde{\sigma}_{k}}\left[F^{A}_{\overline{\alpha}(T)}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\tau_{k+1}(\upsilon_{k}^{-1}F^{Q}_{\theta_{k}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})+\tau_{k}^{-1}F^{A}_{\alpha_{k}}(s,\mathrm{d}_{\mathcal{S}},\overline{a}))\right]^{2}
\displaystyle\leq 2𝔼σ~k[FAα¯(T)(s,d𝒮,a¯)F0α(s,d𝒮,a¯)]2+2𝔼σ~k[F0α(s,d𝒮,a¯)τk+1(υk1FQθk(s,d𝒮,a¯)+τk1FAαk(s,d𝒮,a¯))]2\displaystyle 2\mathbb{E}_{\widetilde{\sigma}_{k}}\left[F^{A}_{\overline{\alpha}(T)}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F^{0}_{\alpha_{*}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})\right]^{2}+2\mathbb{E}_{\widetilde{\sigma}_{k}}\left[F^{0}_{\alpha_{*}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\tau_{k+1}(\upsilon_{k}^{-1}F^{Q}_{\theta_{k}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})+\tau_{k}^{-1}F^{A}_{\alpha_{k}}(s,\mathrm{d}_{\mathcal{S}},\overline{a}))\right]^{2}
\displaystyle\leq 2𝔼σ~k[τk+1(υk1F0θk(s,d𝒮,a¯)+τk1F0αk(s,d𝒮,a¯))τk+1(υk1FQθk(s,d𝒮,a¯)+τk1FAαk(s,d𝒮,a¯))]2\displaystyle 2\mathbb{E}_{\widetilde{\sigma}_{k}}\left[\tau_{k+1}(\upsilon_{k}^{-1}F^{0}_{\theta_{k}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})+\tau_{k}^{-1}F^{0}_{\alpha_{k}}(s,\mathrm{d}_{\mathcal{S}},\overline{a}))-\tau_{k+1}(\upsilon_{k}^{-1}F^{Q}_{\theta_{k}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})+\tau_{k}^{-1}F^{A}_{\alpha_{k}}(s,\mathrm{d}_{\mathcal{S}},\overline{a}))\right]^{2}
+2𝔼σ~k[FAα¯(T)(s,d𝒮,a¯)F0α(s,d𝒮,a¯)]2\displaystyle+2\mathbb{E}_{\widetilde{\sigma}_{k}}\left[F^{A}_{\overline{\alpha}(T)}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F^{0}_{\alpha_{*}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})\right]^{2}
\displaystyle\leq 𝒪(Rα2T1/2+Rα5/2mA1/4+Rα2N1/2+Rα3mA1/2),\displaystyle\mathcal{O}\left(\frac{R_{\alpha}^{2}}{T^{1/2}}+\frac{R_{\alpha}^{5/2}}{m_{A}^{1/4}}+\frac{R_{\alpha}^{2}}{N^{1/2}}+\frac{R_{\alpha}^{3}}{m_{A}^{1/2}}\right),

where the second inequality comes from the definition of projection and F0α=ΠFRα,mA(ζF0α)F^{0}_{\alpha_{*}}=\Pi_{F_{R_{\alpha},m_{A}}}(\zeta_{F^{0}_{\alpha_{*}}}), and the last inequality comes from direct application of Theorem E.1 and Lemma E.1, together with the fact that FAαF^{A}_{\alpha} and FQθF^{Q}_{\theta} share the same initialization. ∎

Proof of Theorem 4.1.

From Lemma 4.3, we know that

(1γ)((π)(πk))\displaystyle(1-\gamma)\left(\mathcal{L}(\pi^{*})-\mathcal{L}(\pi_{k})\right)
\displaystyle\leq 𝔼ν{υk[KL(π(|s,d𝒮)πk(|s,d𝒮))KL(π(|s,d𝒮)πk+1(|s,d𝒮))+εk+εk]+υk1M}.\displaystyle\mathbb{E}_{\nu_{*}}\big{\{}\upsilon_{k}\left[\mathrm{KL}\left(\pi^{*}(\cdot|s,\mathrm{d}_{\mathcal{S}})\|\pi_{k}(\cdot|s,\mathrm{d}_{\mathcal{S}})\right)-\mathrm{KL}\left(\pi^{*}(\cdot|s,\mathrm{d}_{\mathcal{S}})\|\pi_{k+1}(\cdot|s,\mathrm{d}_{\mathcal{S}})\right)+\varepsilon_{k}+\varepsilon_{k}^{\prime}\right]+\upsilon_{k}^{-1}M\big{\}}. (32)

where ϵk\epsilon_{k} and ϵk+1\epsilon_{k+1}^{\prime} are defined as in Lemma 4.1 and Lemma 4.2, respectively. In addition, εk=τk+11ϵk+1ϕk+1+υk1ϵkψk\varepsilon_{k}=\tau_{k+1}^{-1}\epsilon_{k+1}\phi_{k+1}^{*}+\upsilon_{k}^{-1}\epsilon_{k}^{\prime}\psi_{k}^{*}, εk=|𝒜|τk+12ϵk+12\varepsilon_{k}^{\prime}=\lvert\mathcal{A}\rvert\tau_{k+1}^{-2}\epsilon_{k+1}^{2}, and M=𝔼ν[maxa𝒜(FQθ0(s,d𝒮,a¯))2]+2Rα2M=\mathbb{E}_{\nu^{*}}\left[\max_{a\in\mathcal{A}}(F^{Q}_{\theta_{0}}(s,\mathrm{d}_{\mathcal{S}},\overline{a}))^{2}\right]+2R_{\alpha}^{2}. ϕk\phi_{k}^{*} and ψk\psi_{k}^{*} are defined by:

ϕk\displaystyle\phi_{k}^{*} =𝔼σ~k[|dπ/dπ0dπk/dπ0|2]1/2,\displaystyle=\mathbb{E}_{\widetilde{\sigma}_{k}}\left[\lvert\mathrm{d}\pi^{*}/\mathrm{d}\pi_{0}-\mathrm{d}\pi_{k}/\mathrm{d}\pi_{0}\rvert^{2}\right]^{1/2},
ψk\displaystyle\psi_{k}^{*} =𝔼σk[|dσ/dσkd(ν×πk)/dσk|2]1/2.\displaystyle=\mathbb{E}_{\sigma_{k}}\left[\lvert\mathrm{d}\sigma^{*}/\mathrm{d}\sigma_{k}-\mathrm{d}(\nu^{*}\times\pi_{k})/\mathrm{d}\sigma_{k}\rvert^{2}\right]^{1/2}.

Now sum up (32) from k=0k=0 to K1K-1, with υk=υK\upsilon_{k}=\upsilon\sqrt{K}, we have

(1γ)Kmin0kK1((π)(πk))\displaystyle(1-\gamma)K\min_{0\leq k\leq K-1}\left(\mathcal{L}(\pi^{*})-\mathcal{L}(\pi_{k})\right)
\displaystyle\leq 𝔼ν{υK[KL(π(|s,d𝒮)π0(|s,d𝒮))KL(π(|s,d𝒮)πK(|s,d𝒮))+k=0K1(εk+εk)]+υ1KM}.\displaystyle\mathbb{E}_{\nu_{*}}\big{\{}\upsilon\sqrt{K}\left[\mathrm{KL}\left(\pi^{*}(\cdot|s,\mathrm{d}_{\mathcal{S}})\|\pi_{0}(\cdot|s,\mathrm{d}_{\mathcal{S}})\right)-\mathrm{KL}\left(\pi^{*}(\cdot|s,\mathrm{d}_{\mathcal{S}})\|\pi_{K}(\cdot|s,\mathrm{d}_{\mathcal{S}})\right)+\sum_{k=0}^{K-1}(\varepsilon_{k}+\varepsilon_{k}^{\prime})\right]+\upsilon^{-1}\sqrt{K}M\big{\}}.

Since we initialize policy π0\pi_{0} to be uniform policy, we have KL(π(|s,d𝒮)π0(|s,d𝒮))log|𝒜|\mathrm{KL}\left(\pi^{*}(\cdot|s,\mathrm{d}_{\mathcal{S}})\|\pi_{0}(\cdot|s,\mathrm{d}_{\mathcal{S}})\right)\leq\log\lvert\mathcal{A}\rvert. Rearrange, we obtain

min0kK{(π)(πk)}υ2(log|𝒜|+k=1K1(εk+εk))+M(1γ)υK,\displaystyle\min_{0\leq k\leq K}\{\mathcal{L}(\pi^{*})-\mathcal{L}(\pi_{k})\}\leq\frac{\upsilon^{2}\left(\log|\mathcal{A}|+\sum_{k=1}^{K-1}(\varepsilon_{k}+\varepsilon_{k}^{\prime})\right)+M}{(1-\gamma)\upsilon\sqrt{K}}, (33)

From Lemma 4.1 and Lemma 4.2, we have

ϵk=𝒪(Rθ2T1/2+Rθ5/2mQ1/4+Rθ2N1/2+Rθ3mQ1/2),ϵk+1=𝒪(Rα2T1/2+Rα5/2mA1/4+Rα2N1/2+Rα3mA1/2).\displaystyle\epsilon_{k}=\mathcal{O}\left(\frac{R_{\theta}^{2}}{T^{1/2}}+\frac{R_{\theta}^{5/2}}{m_{Q}^{1/4}}+\frac{R_{\theta}^{2}}{N^{1/2}}+\frac{R_{\theta}^{3}}{m_{Q}^{1/2}}\right),\epsilon_{k+1}^{\prime}=\mathcal{O}\left(\frac{R_{\alpha}^{2}}{T^{1/2}}+\frac{R_{\alpha}^{5/2}}{m_{A}^{1/4}}+\frac{R_{\alpha}^{2}}{N^{1/2}}+\frac{R_{\alpha}^{3}}{m_{A}^{1/2}}\right).

To control εk\varepsilon_{k} and εk\varepsilon_{k}^{\prime}, we have

τk+11ϵk+1ϕk+1\displaystyle\tau_{k+1}^{-1}\epsilon_{k+1}^{\prime}\phi_{k+1}^{*} =𝒪(kK1ϕk(Rα2T1/2+Rα5/2mA1/4+Rα2N1/2)\displaystyle=\mathcal{O}\left(kK^{-1}\phi_{k}^{*}(R_{\alpha}^{2}T^{-1/2}+R_{\alpha}^{5/2}m_{A}^{-1/4}+R_{\alpha}^{2}N^{-1/2}\right)
|𝒜|τk+12(ϵk+1)2\displaystyle\lvert\mathcal{A}\rvert\tau_{k+1}^{-2}(\epsilon_{k+1}^{\prime})^{2} =𝒪(k2K1|𝒜|(Rα2T1/2++Rα5/2mA1/4+Rα2N1/2)2)\displaystyle=\mathcal{O}\left(k^{2}K^{-1}\lvert\mathcal{A}\rvert(R_{\alpha}^{2}T^{-1/2}++R_{\alpha}^{5/2}m_{A}^{-1/4}+R_{\alpha}^{2}N^{-1/2})^{2}\right)
υk1ϵkψk\displaystyle\upsilon_{k}^{-1}\epsilon_{k}\psi_{k}^{*} =𝒪(K1/2ψk(Rθ2T1/2+Rθ2N1/2+Rθ5/2m1/4),\displaystyle=\mathcal{O}\left(K^{-1/2}\psi_{k}^{*}(R_{\theta}^{2}T^{-1/2}+R_{\theta}^{2}N^{-1/2}+R_{\theta}^{5/2}m^{-1/4}\right),

if mA=Ω(Rα2)m_{A}=\Omega(R_{\alpha}^{2}) and mQ=Ω(Rθ2)m_{Q}=\Omega(R_{\theta}^{2}). One can verify that with the choice of N,T,mA,mQN,T,m_{A},m_{Q} in Theorem 4.1, we have εk=εk=𝒪(K1)\varepsilon_{k}=\varepsilon_{k}^{\prime}=\mathcal{O}(K^{-1}). Plug into (33), we obtain the second part of Theorem 4.1.