Permutation Invariant Policy Optimization for Mean-Field Multi-Agent Reinforcement Learning: A Principled Approach

Yan Li^†, Lingxiao Wang^⋄, Jiachen Yang^†, Ethan Wang^†,
Zhaoran Wang^⋄, Tuo Zhao^†, Hongyuan Zha^∗^†^† Yan Li, Jiachen Yang, Ethan Wang, and Tuo Zhao are affiliated with Georgia Institute of Technology; Lingxiao Wang, Zhaoran Wang are affiliated with Northwestern University; Hongyuan Zha is affiliated with The Chinese University of Hong Kong, Shenzhen. Tuo Zhao is the corresponding author; Email: tourzhao@gatech.edu.

Abstract

Multi-agent reinforcement learning (MARL) becomes more challenging in the presence of more agents, as the capacity of the joint state and action spaces grows exponentially in the number of agents. To address such a challenge of scale, we identify a class of cooperative MARL problems with permutation invariance, and formulate it as mean-field Markov decision processes (MDP). To exploit the permutation invariance therein, we propose the mean-field proximal policy optimization (MF-PPO) algorithm, at the core of which is a permutation-invariant actor-critic neural architecture. We prove that MF-PPO attains the globally optimal policy at a sublinear rate of convergence. Moreover, its sample complexity is independent of the number of agents. We validate the theoretical advantages of MF-PPO with numerical experiments in the multi-agent particle environment (MPE). In particular, we show that the inductive bias introduced by the permutation-invariant neural architecture enables MF-PPO to outperform existing competitors with a smaller number of model parameters, which is the key to its generalization performance.

1 Introduction

Multi-Agent Reinforcement Learning (Littman, 1994; Zhang et al., 2019) generalizes Reinforcement Learning (Sutton and Barto, 2018) to address the sequential decision-making problem of multiple agents maximizing their own long term rewards while interacting with one another in a common environment. With breakthroughs in deep learning, MARL algorithms equipped with deep neural networks have seen significant empirical successes in various domains, including simulated autonomous driving (Shalev-Shwartz et al., 2016), multi-agent robotic control (Matarić, 1997; Kober et al., 2013), and E-sports (Vinyals et al., 2019).

Despite tremendous successes, MARL is notoriously hard to scale to the many-agent setting, as the size of the state-action space grows exponentially with respect to the number of agents. This phenomenon is recently described as the curse of many agents (Menda et al., 2018). To tackle this challenge, we focus on cooperative MARL, where agents work together to maximize their team reward (Panait and Luke, 2005). We identify and exploit a key property of cooperative MARL with homogeneous agents, namely the invariance with respect to the permutation of agents. Such permutation invariance can be found in many real-world scenarios with homogeneous agents, such as distributed control of multiple autonomous vehicles and team sports (Cao et al., 2013; Kalyanakrishnan et al., 2006), and also in the case of heterogeneous groups where invariance holds within each group (Liu et al., 2019b). More importantly, we further find that permutation invariance has significant practical implications, as the optimal value functions remains invariant when permuting the joint state-action pairs. Such an observation strongly advocates a permutation invariant design for learning, which helps reduce the effective search space of the policy/value functions from exponential dependence on the number of agents to polynomial dependence.

Several empirical methods have been proposed to incorporate permutation invariance into solving MARL problems. Liu et al. (2019b) implement a permutation invariant critic based on Graph Convolutional Network (GCN) (Kipf and Welling, 2017). Sunehag et al. (2017) propose value decomposition, which together with parameter sharing, leads to a joint critic network that is permutation invariant over agents. While these methods are based on heuristics, we are the first to provide theoretical principles for introducing permutation invariance as an inductive bias for learning value functions and policy in homogeneous systems. In addition, we adopt the DeepSet (Zaheer et al., 2017) architecture, which is well suited for handling homogeneity of agents, with much simpler operations to induce permutation invariance, and being much more parameter efficient by doing so.

To scale up learning in MARL problem in the presence of a large number, even infinitely many agents, mean-field approximation has been explored to directly model the population behavior of the agents. Yang et al. (2017) consider a mean-field game with deterministic linear state transitions, and show it can be reformulated as a mean-field MDP, with mean-field state residing in a finite-dimensional probability simplex. Yang et al. (2018) take the mean-field approximation over actions, such that the interaction for any given agent and the population is approximated by the interaction between the agent’s action and the averaged actions of its neighboring agents. However, the motivation for averaging over local actions remains unclear, and it generally requires a sparse graph over agents. In practice, properly identifying such structure also demands extensive prior knowledge. Carmona et al. (2019) motivate a mean-field MDP from the perspective of mean-field control. The mean-field state therein lies in a probability simplex and thus continuous in nature. To enable the ensuing Q-learning algorithm, discretization of the joint state-action space is necessary. Wang et al. (2020) motivate a mean-field MDP from permutation invariance, but assume a central controller coordinating the actions of all the agents, and hence is restricted to handling the curse of many agents from the exponential blowup of joint state space. In contrast, our formulation of mean-field approximation allows agents to make their own local actions without resorting to a centralized controller.

We propose a mean-field Markov decision process motivated from the permutation invariance structure of cooperative MARL, which can be viewed as a natural limit of finite-agent MDP by taking the number of agents to infinity. Such a mean-field MDP generalizes traditional MDP, with each state representing a distribution over the state space of a single agent. The mean-field MDP provides us a tractable formulation to model MDP with many agents, even infinite number of agents. We further propose the Mean-Field Proximal Policy Optimization (MF-PPO) algorithm, at the core of which is a pair of permutation invariant actor and critic neural networks. These networks are implemented based on DeepSet (Zaheer et al., 2017), which uses convolutional type operations to induce permutation invariance over the set of inputs. We show that with sufficiently many agents, MF-PPO converges to the optimal policy of the mean-field MDP with a sublinear sample complexity independent of the number of agents. To support our theory, we conduct numerical experiments on the benchmark multi-agent particle environment (MPE) and show that our proposed method requires a smaller number of model parameters and attains a better performance compared to multiple baselines.

Notations. For a set $X$ , we denote $\mathcal{P}(X)$ as the set of distribution on $X$ . $\delta_{x}$ denotes the Dirac measure supported at $x$ . For $\mathbf{s}=(s_{1},\ldots,s_{N})$ , we use $\mathbf{s}\overset{\mathrm{i.i.d.}}{\sim}p$ to denote that each $s_{i}$ is independently sampled from distribution $p$ . For a function $f:X\to\mathbb{R}$ and a distribution $\pi\in\mathcal{P}(X)$ , we write $\left\langle f,\pi\right\rangle=\mathbb{E}_{a\sim\pi}f(a)$ . We write $[m]$ in short for $\{1,\ldots,m\}$ .

2 Problem Setup

We focus on studying multi-agent systems with cooperative, homogeneous agents, where the agents within the system are of similar nature and hence cannot be distinguished from each other. Specifically, we consider a discrete time control problem with $N$ agents, formulated as a Markov decision process $({\mathcal{S}}^{N},\mathcal{A}^{N},\mathbb{P},\mathrm{r})$ . We define the joint state space ${\mathcal{S}}^{N}$ to be the Cartesian product of the finite state space ${\mathcal{S}}$ for each agent, and similarly define joint action space $\mathcal{A}^{N}$ . The homogeneous nature of the system is reflected in the transition kernel $\mathbb{P}$ and the shared reward $\mathrm{r}$ , which satisfies:

\displaystyle\mathrm{r}(\mathbf{s}_{t},\mathbf{a}_{t})=\mathrm{r}\left(\kappa(\mathbf{s}_{t}),\kappa({\mathbf{a}_{t}})\right),~{}~{}~{}\mathbb{P}(\mathbf{s}_{t+1}|\mathbf{s}_{t},{\mathbf{a}_{t}})=\mathbb{P}\left(\kappa(\mathbf{s}_{t+1})|\kappa(\mathbf{s}_{t}),\kappa({\mathbf{a}_{t}})\right),

(1)

for all $(\mathbf{s}_{t},\mathbf{a}_{t})\in{\mathcal{S}}^{N}\times\mathcal{A}^{N}$ and permutation mapping $\kappa(\cdot)$ . In other words, it is the configuration, rather than the identities of the agents, that affects the team reward, and the transition to the next configuration solely depends on the current configuration. See Figure 1 for a detailed illustration. Such permutation invariance can be found in many real-world scenarios, such as distributed control of multiple autonomous vehicles, team sports and social economic systems (Zheng et al., 2020; Cao et al., 2013; Kalyanakrishnan et al., 2006).

Refer to caption — Figure 1: Illustration of permutation invariance. Exchanging identities of the agents (first and third row) does not change the transition of the underlying system (second row).

Our goal is to find the optimal policy $\nu$ within a policy class which maximizes the expected discounted reward

V^{\nu}(\mathbf{s})\!=\!(1-\gamma)\mathbb{E}\left\{\sum_{t=0}^{\infty}\gamma^{t}\mathrm{r}(\mathbf{s}_{t},\mathbf{a}_{t})|\mathbf{s}_{0}\!=\!\mathbf{s},\mathbf{a}_{t}\!\sim\!\nu(\mathbf{s}_{t}),\forall t\!\geq\!0\right\}.

Our first result shows that learning with permutation invariance advocates permutation invariant network design.

Proposition 2.1.

For cooperative MARL satisfying (1), there exists an optimal policy $\nu^{*}$ that is permutation invariant, i.e., $\nu^{*}(\mathbf{s},\mathbf{a})=\nu^{*}\left(\kappa(\mathbf{s}),\kappa(\mathbf{a})\right)$ for any permutation mapping $\kappa(\cdot)$ . In addition, for any permutation invariant policy $\nu$ , the value function $V(\cdot)$ and the state-action value function $Q(\cdot)$ is also permutation invariant

\displaystyle V^{\nu}(\mathbf{s})=V^{\nu}\left(\kappa(\mathbf{s})\right),~{}~{}Q^{\nu}(\mathbf{s},\mathbf{a})=Q^{\nu}\left(\kappa(\mathbf{s}),\kappa(\mathbf{a})\right),

where $Q^{\nu}(\mathbf{s},\mathbf{a})=\mathbb{E}_{\mathbf{s}^{\prime}}\left\{r(\mathbf{s},\mathbf{a})+\gamma V^{\nu}(\mathbf{s}^{\prime})\right\}$ .

Proposition 2.1 has an important implication for architecture design, as it states that it suffices to search within the permutation invariant policy class and value function class. To the best of our knowledge, this is the first theoretical justification of permutation invariant network design for learning with homogeneous agents.

We focus on the factorized policy class with parameter sharing scheme, where each agent makes its own decision given local state. Specifically, the joint policy $\nu$ can be factorized as $\nu(\mathbf{a}|\mathbf{s})=\prod_{i=1}^{N}\mu(a_{i}|s_{i})$ , where $\mu(\cdot)$ denotes the shared local mapping. Such a policy class is widely adopted in the celebrated centralized training – decentralized execution paradigm (Lowe et al., 2017), due to its light overhead in the deployment phase and favorable performances. However, directly learning such factorized policy remains challenging, as each agent needs to estimate its state-action value function, denoted as $Q^{\nu}(\mathbf{s},\mathbf{a})$ . The search space during learning is $(|{\mathcal{S}}|\times|\mathcal{A}|)^{N}$ , which scales exponentially with respect to the number of agents. The large search space poses as a significant roadblock for efficient learning, and was recently coined as the curse of many agents (Menda et al., 2018).

To address the curse of many agents, we exploit the homogeneity of the system and take mean-field approximation. We begin by taking the perspective of agent $i$ , which is arbitrarily chosen from the $N$ agents. We denote the state of such agent by $s$ and the states of the rest of the agents by $\mathbf{s}_{r}$ . One can verify that when permuting the state of all the other agents, the value function remains unchanged; additionally, we can further characterize the value function as a function of the local state, and the empirical state distribution over the rest of agents.

Proposition 2.2.

For any permutation mapping $\kappa(\cdot)$ , the value function satisfies $V^{\nu}(s,\mathbf{s}_{r})=V^{\nu}(s,\kappa(\mathbf{s}_{r})).$ Additionally, there exists $g_{\nu}$ such that:

\displaystyle V^{\nu}(s,\mathbf{s}_{r})=g_{\nu}(s,\widehat{\mathrm{p}}_{\mathbf{s}_{r}}),

where $\widehat{\mathrm{p}}_{\mathbf{s}_{r}}=\frac{1}{N}\sum_{s\in\mathbf{s}_{r}}\delta_{s}$ is the empirical state distribution over the rest of the agents.

For a system with a large number of agents (e.g., financial markets, social networks), the empirical state distribution can be seen as the concrete realization of the underlying distribution of the population. Motivated from this observation and Proposition 2.2, we formulate the following mean-field MDP that can be seen as the limit of finite-agent MDP in the presence of infinitely many homogeneous agents.

Definition 2.1 (mean-field MDP).

The mean-field MDP consists of elements of the following: state $(s,\mathrm{d}_{\mathcal{S}})\in\overline{{\mathcal{S}}}\subseteq{\mathcal{S}}\times\mathcal{P}({\mathcal{S}})$ ; action $\overline{a}\in\overline{\mathcal{A}}\subseteq\mathcal{A}^{{\mathcal{S}}}$ ; reward $\mathrm{r}(s,\mathrm{d}_{\mathcal{S}},\overline{a})$ ; transition kernel $\mathbb{P}(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime}|s,\mathrm{d}_{\mathcal{S}},\overline{a}$ ).

The defined mean-field MDP has an intimate connection with our previously discussed finite-agent MDP: here $(s,\mathrm{d}_{\mathcal{S}})$ represents the joint state viewed from the individual agent with state $s$ ; each agent makes a local decision using the shared local mapping $\overline{a}:{\mathcal{S}}\to\mathcal{A}$ and receive a local reward, whose cumulative effect is captured by the team reward $\mathrm{r}(s,\mathrm{d}_{\mathcal{S}},\overline{a})$ ; finally, the joint state transits to $(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime})$ according to kernel $\mathbb{P}(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime}|s,\mathrm{d}_{\mathcal{S}},\overline{a})$ . We can clearly see that the mean-field MDP has a step-to-step correspondence with the finite-agent MDP with homogeneous agents, thus making it a natural generalization in the presence of large number of agents.

Our goal is to learn efficiently a policy $\pi$ whose expected discounted reward is maximized. To facilitate ensuing discussions, we define the value function as

\displaystyle V^{\pi}(s,\mathrm{d}_{\mathcal{S}})=(1-\gamma)\mathbb{E}\left\{\sum_{t=0}^{\infty}\gamma^{t}\mathrm{r}(s_{t},\mathrm{d}_{{\mathcal{S}},t},\overline{a}_{t})\right\},

where $(s_{0},\mathrm{d}_{{\mathcal{S}},0})=(s,\mathrm{d}_{\mathcal{S}}),\overline{a}_{t}\sim\pi(s_{t},\mathrm{d}_{{\mathcal{S}},t}),\forall t\geq 0$ , and Q-function as

\displaystyle Q^{\pi}(s,\mathrm{d}_{\mathcal{S}},\overline{a})=(1-\gamma)\mathbb{E}\left\{\sum_{t=0}^{\infty}\gamma^{t}\mathrm{r}(s_{t},\mathrm{d}_{{\mathcal{S}},t},\overline{a}_{t})\right\},

where $(s_{0},\mathrm{d}_{{\mathcal{S}},0})=(s,\mathrm{d}_{\mathcal{S}}),\overline{a}_{0}=\overline{a},\overline{a}_{t}\sim\pi(s_{t},\mathrm{d}_{{\mathcal{S}},t})$ . The optimal policy is then denoted by

\pi^{*}\in\mathop{\mathrm{argmax}}V^{\pi}(s,\mathrm{d}_{\mathcal{S}}).

Despite the intuitive analogy to finite-agent MDP, solving the mean-field MDP posses some unique challenges. In addition to the unknown transition kernel and reward, the mean-field MDP takes a distribution as its state, which we do not have complete information of during training. In the following section, we propose our mean-field Proximal Policy Optimization (MF-PPO) algorithm that, with a careful architecture design, can solve such mean-field MDP in a model-free fashion efficiently.

3 Mean-field Proximal Policy Optimization

Our algorithm falls into the category of the actor-critic learning paradigm, consisting of alternating iterations of policy evaluation and improvement. The unique features of MF-PPO lie in the facts: (1) it exploits permutation invariance of the system, reducing search space of the actor/critic networks drastically and enables much more efficient learning; (2) it can handle a varying number of agents. For simplicity of exposition, we consider a fixed number of agents here.

Throughout the rest of the section, we assume that for any joint state $(s,\mathrm{d}_{{\mathcal{S}}})\in{\mathcal{S}}\times\mathcal{P}({\mathcal{S}})$ , the agent has access to $N$ $\mathrm{i.i.d.}$ samples $\{s_{i}\}_{i=1}^{N}$ from $\mathrm{d}_{{\mathcal{S}}}$ . We denote concatenation of such samples as $\mathbf{s}\in{\mathcal{S}}^{N}$ and write $\mathbf{s}\overset{\mathrm{i.i.d.}}{\sim}\mathrm{d}_{{\mathcal{S}}}$ .

MF-PPO maintains a pair of actor (denoted by $F^{A}$ ) and critic networks (denoted by $F^{Q}$ ), and uses the actor network to induce an energy-based policy $\pi(\overline{a}|s,\mathrm{d}_{{\mathcal{S}}})$ . Specifically, given state $(s,\mathrm{d}_{{\mathcal{S}}})$ , the actor network induces a distribution on the action set $\overline{\mathcal{A}}$ according to

\pi(\overline{a}|s,\mathrm{d}_{{\mathcal{S}}})\propto\exp\left\{\tau^{-1}F^{A}(s,\mathrm{d}_{{\mathcal{S}}},\overline{a})\right\},

where $\tau$ denotes the temperature parameter. We use $\pi\propto\exp\left\{F^{A}\right\}$ to denote the dependency of the policy on the energy function.

$\bullet$ Permutation-invariant Actor and Critic. We adopt a permutation invariant design of the actor and critic network. Specifically, given individual state $s\in{\mathcal{S}}$ and sampled states $\mathbf{s}\overset{\mathrm{i.i.d.}}{\sim}\mathrm{d}_{{\mathcal{S}}}$ , the actor (resp. critic) network $F^{A}$ (resp. $F^{Q}$ ) satisfies

F^{A}(s,\mathbf{s},\overline{a})=F^{A}(s,\kappa(\mathbf{s}),\overline{a})

for any permutation mapping $\kappa$ . With permutation invariance, the search space of the actor/critic network polynomially depends on the number of agents $N$ .

Proposition 3.1.

The search space of a permutation invariance actor (critic) network is at the order of

\left(\sum_{k=1}^{\min\{|{\mathcal{S}}|,N\}}\binom{N-1}{k-1}\binom{|{\mathcal{S}}|}{k}\right)|{\mathcal{S}}||\overline{\mathcal{A}}|;

Additionally, if $|{\mathcal{S}}|<N$ , then the search space depends on $N$ at the order of $N^{|{\mathcal{S}}|}$ .

Compared to architectures without permutation invariance, whose search space depends on $N$ at the order of $(|{\mathcal{S}}||\mathcal{A}|)^{N}$ , we can clearly see the search space of MF-PPO can be exponentially smaller.

Motivated by the characterization of the permutation invariant set function in Zaheer et al. (2017), the actor/critic network in MF-PPO takes the form of Deep Sets architecture, i.e.,

F^{A}(s,\mathbf{s},\overline{a})=h\left(\sum_{s^{\prime}\in\mathbf{s}}\phi(s,s^{\prime},\overline{a})/N\right).

Both networks first aggregate local information by averaging over the output of a shared sub-network among agents, before feeding the aggregated information into a subsequent network. See Figure 2 for a detailed illustration. Effectively, by the average pooling layer and the preceding parameter sharing scheme, the network can keep its output unchanged when permuting the ordering of agents. Compared to a Graph Convolutional Neural Network (Kipf and Welling, 2017), the averaging operations are well suited for homogeneous agents and more parameter-efficient.

Naturally, the actor network when given the joint state-action pair $(s,\mathrm{d}_{{\mathcal{S}}},\overline{a})$ is given by

F^{A}(s,\mathrm{d}_{{\mathcal{S}}},\overline{a})=h\left(\mathbb{E}_{s^{\prime}\sim\mathrm{d}_{{\mathcal{S}}}}\phi(s,s^{\prime},\overline{a})\right).

We assume $F^{A}$ is parameterized by a neural network with parameters $\alpha\in\mathbb{R}^{D}$ , which is to be learned during training. We let the function class of all possible actor networks be denoted by $\mathcal{F}^{A}$ . This same architecture design applies to the critic network, with learnable parameters denoted by $\theta\in\mathbb{R}^{D}$ and the function class denoted by $\mathcal{F}^{Q}$ . MF-PPO then consists of successive iterations of policy evaluation and policy improvement steps described in detail below.

$\bullet$ Policy Evaluation. At the $k$ -th iteration of MF-PPO, we first update the critic network $F^{Q}$ by minimizing the mean squared Bellman error while holding the actor network $F^{A,k}$ fixed. We denote the policy induced by the actor network as $\pi_{k}$ , the stationary state distribution of policy $\pi_{k}$ as $\nu_{k}$ , and the stationary state-action distribution as $\sigma_{k}=\nu_{k}\pi_{k}$ . Thus, we follow the update

\displaystyle\theta_{k}=\mathop{\mathrm{argmin}}_{\theta\in\mathbb{B}(\theta_{0},R_{\theta})}\mathbb{E}_{\sigma_{k}}

\displaystyle\bigg{\{}F^{Q}_{\theta}(s,\mathrm{d}_{{\mathcal{S}}},\overline{a})-\left[{\mathcal{T}}^{\pi_{k}}F^{Q}_{\theta}\right](s,\mathrm{d}_{{\mathcal{S}}},\overline{a})\bigg{\}}^{2},

(2)

where $\mathbb{B}(\theta_{0},R_{\theta})$ denotes the Euclidean ball with radius $R_{\theta}$ centered at the initialized parameter $\theta_{0}$ , and the Bellman evaluation operator ${\mathcal{T}}^{\pi}$ is given by

\displaystyle{\mathcal{T}}^{\pi}\mathcal{F}^{Q}_{\theta}(s,\mathrm{d}_{{\mathcal{S}}},\overline{a})=\mathbb{E}\left\{(1-\gamma)r(s,\mathrm{d}_{{\mathcal{S}}},\overline{a})+\gamma F^{Q}_{\theta}(s^{\prime},\mathrm{d}^{\prime}_{{\mathcal{S}}},\overline{a}^{\prime})\right\},

where $\mathrm{d}^{\prime}_{{\mathcal{S}}}\sim\mathbb{P}(\cdot|\mathrm{d}_{{\mathcal{S}}},\overline{a}),\overline{a}^{\prime}\sim\pi(\cdot|s^{\prime},\mathrm{d}^{\prime}_{{\mathcal{S}}})$ . We solve (2) by T-step temporal-difference (TD) update and output $\theta_{k}=\theta(T)$ . At the $t$ -th iteration of the TD update,

	$\displaystyle\theta(t+1/2)$	$\displaystyle=\theta(t)-\eta\bigg{\{}F^{Q}_{\theta(t)}(s,\mathbf{s},\overline{a})-(1-\gamma)r(s,\mathrm{d}_{{\mathcal{S}}},\overline{a})-\gamma F^{Q}_{\theta(t)}(s^{\prime},\mathbf{s}^{\prime},\overline{a}^{\prime})\bigg{\}}\nabla_{\theta}F^{Q}_{\theta(t)}(s,\mathbf{s},\overline{a}),$
	$\displaystyle\theta(t+1)$	$\displaystyle=\Pi_{\mathbb{B}(\theta_{0},R_{\theta})}\left(\theta(t+1/2)\right),$

where we sample

\displaystyle(s,\mathrm{d}_{{\mathcal{S}}},\overline{a})\sim\sigma_{k},~{}(s^{\prime},\mathrm{d}^{\prime}_{{\mathcal{S}}})\sim\mathbb{P}(\cdot|s,\mathrm{d}_{{\mathcal{S}}},\overline{a}),~{}~{}~{}\overline{a}^{\prime}\sim\pi_{k}(\cdot|s^{\prime},\mathrm{d}^{\prime}_{{\mathcal{S}}}),~{}\mathbf{s}\overset{\mathrm{i.i.d.}}{\sim}\mathrm{d}_{{\mathcal{S}}},~{}\mathbf{s}^{\prime}\overset{\mathrm{i.i.d.}}{\sim}\mathrm{d}^{\prime}_{{\mathcal{S}}}.

We use $\Pi_{\mathrm{X}}(\cdot)$ to denote the orthogonal projection onto set $\mathrm{X}$ , and $\eta$ to denote the step size. The details are summarized in Algorithm 2 in Appendix A.

$\bullet$ Policy Improvement. Following the policy evaluation step, MF-PPO updates its policy by updating the policy network $F^{A}$ , which is the energy function associated with the policy. We update the policy network by

\displaystyle\pi_{k+1}=\mathop{\mathrm{argmax}}_{\begin{subarray}{c}\pi\propto\exp\left\{F^{A,k+1}\right\},F^{A,k+1}\in\mathcal{F}^{A}\end{subarray}}\mathbb{E}_{\nu_{k}}\bigg{\{}\left\langle F^{Q}_{\theta_{k}}(s,\mathrm{d}_{{\mathcal{S}}},\cdot),\pi(\cdot|s,\mathrm{d}_{{\mathcal{S}}})\right\rangle-\upsilon_{k}\mathrm{KL}\left(\pi(\cdot|s,\mathrm{d}_{{\mathcal{S}}})\big{\|}\pi_{k}(\cdot|s,\mathrm{d}_{{\mathcal{S}}})\right)\bigg{\}}.

The update rule intuitively reads as increasing the probability for choosing action $\overline{a}$ if it yields a higher value for critic network $F^{Q}(s,\mathrm{d}_{{\mathcal{S}}},\overline{a})$ , which can be viewed as a softened version of policy iteration (Bertsekas, 2011). Additionally, by controlling the proximity parameter $\upsilon_{k}$ , we can control the softness of the update, with $\upsilon\to 0$ yielding the vanilla policy iteration. Moreover, without constraint of $F^{A,k+1}\in\mathcal{F}^{A}$ , such an update would have a nice closed form expression, and $\pi_{k+1}$ itself is another energy-based policy.

Proposition 3.2.

Let $\pi_{k}\propto\exp\left(\tau_{k}^{-1}F^{A}_{\alpha_{k}}\right)$ denote the energy-based policy, then the update

\displaystyle\overline{\pi}_{k+1}=\mathop{\mathrm{argmax}}_{\pi}\mathbb{E}_{\nu_{k}}\bigg{\{}\left\langle F^{Q}_{\theta_{k}}(s,\mathrm{d}_{{\mathcal{S}}},\cdot),\pi(\cdot|s,\mathrm{d}_{{\mathcal{S}}})\right\rangle-\upsilon_{k}\mathrm{KL}\left(\pi(\cdot|s,\mathrm{d}_{{\mathcal{S}}})\big{\|}\pi_{k}(\cdot|s,\mathrm{d}_{{\mathcal{S}}})\right)\bigg{\}}

yields $\overline{\pi}_{k+1}\propto\exp\{\upsilon_{k}^{-1}F^{Q}_{\theta_{k}}+\tau_{k}^{-1}F^{A}_{\alpha_{k}}\}$ .

To take into account that the representable function of actor network resides in $\mathcal{F}_{A}$ , we update the policy by projecting the energy function of $\overline{\pi}_{k+1}$ back to $\mathcal{F}_{A}$ . Specifically, by denoting

\pi_{k+1}\propto\exp\{\tau_{k+1}^{-1}F^{A}_{\alpha_{k+1}}\},

we recover the next actor network $F^{A}_{\alpha_{k+1}}$ (i.e., energy) by performing the following regression task

\displaystyle\alpha^{*}_{k+1}=\mathop{\mathrm{argmin}}_{\alpha\in\mathbb{B}(R_{\alpha},\alpha_{0})}\mathbb{E}_{\widetilde{\sigma}_{k}}\bigg{\{}F^{A}_{\alpha}(s,\mathrm{d}_{{\mathcal{S}}},\overline{a})-\tau_{k+1}\left[\upsilon_{k}^{-1}F^{Q}_{\theta_{k}}(s,\mathrm{d}_{{\mathcal{S}}},\overline{a})+\tau_{k}^{-1}F^{A}_{\alpha_{k}}(s,\mathrm{d}_{{\mathcal{S}}},\overline{a})\right]\bigg{\}}^{2},

(3)

where $\widetilde{\sigma}_{k}=\nu_{k}\pi_{0}$ . We approximately solve (3) via T-step stochastic gradient descent (SGD), and output

\alpha_{k+1}=\alpha(T)\approx\alpha_{k+1}^{*}.

At the $t$ -th iteration of SGD,

	$\displaystyle\alpha(t+1/2)$	$\displaystyle=\alpha(t)-\eta\nabla_{\alpha}F^{A}_{\alpha(t)}(s,\mathbf{s},\overline{a})\bigg{\{}F^{A}_{\alpha(t)}(s,\mathbf{s},\overline{a})-\tau_{k+1}\left(\upsilon_{k}^{-1}F^{Q}_{\theta_{k}}(s,\mathbf{s},\overline{a})+\tau_{k}^{-1}F^{A}_{\alpha_{k}}(s,\mathbf{s},\overline{a})\right)\bigg{\}},$
	$\displaystyle\alpha(t+1)$	$\displaystyle=\Pi_{\mathbb{B}(R_{\alpha},\alpha_{0})}\left(\alpha(t+1/2)\right),$

where we sample $(s,\mathrm{d}_{{\mathcal{S}}},\overline{a})\sim\widetilde{\sigma}_{k}$ , and $\mathbf{s}\overset{\mathrm{i.i.d.}}{\sim}\mathrm{d}_{{\mathcal{S}}}$ , and $\eta$ is the step size. The details are summarized in Algorithm 3 of Appendix A. Finally, we present the complete MF-PPO in Algorithm 1.

Algorithm 1 Mean-Field Proximal Policy Optimization

Require: Mean-field MDP

(\overline{{\mathcal{S}}},\overline{\mathcal{A}},\mathbb{P},r)

, discount factor

\gamma

; number of outer iterations

K

, number of inner updates

T

; policy update parameter

\upsilon

, step size

\eta

, projection radius

R_{\alpha},R_{\theta}

Initialize:

\tau_{0}\leftarrow 1

F^{A,0}\leftarrow 0

\pi_{0}\propto\exp\{\tau_{0}^{-1}F^{A,0}\}

(uniform policy).

for

k=0,\ldots,K-1

Set temperature parameter

\tau_{k+1}\leftarrow\upsilon\sqrt{K}/(k+1)

, and proximity parameter

\upsilon_{k}\leftarrow\upsilon\sqrt{K}

Solve (2) to update the critic network

F^{Q}_{\theta_{k}}

, using TD update (Algorithm 2)

Solve (3) to update the actor network for

F^{A}_{\alpha_{k+1}}

, using SGD update (Algorithm 3)

Update policy:

\pi_{k+1}\propto\exp\{\tau_{k+1}^{-1}F^{A}_{\alpha_{k+1}}\}

end for

4 Global Convergence of MF-PPO

We present the global convergence of MF-PPO algorithm for two-layer permutation-invariant parameterization of actor and critic network. We remark that our analysis can be extended to multi-layer permutation-invariant networks, and we present the two-layer case here for simplicity of exposition. Specifically, the actor and critic networks take the form

	$\displaystyle F^{A}_{\alpha}(s,\mathbf{s},\overline{a})$	$\displaystyle=\frac{1}{\sqrt{m}N}\sum_{j=1}^{m_{A}}\sum_{s^{\prime}\in\mathbf{s}}u_{j}\sigma\left(\alpha_{j}^{\top}(s,s^{\prime},\overline{a})\right),$
	$\displaystyle F^{Q}_{\theta}(s,\mathbf{s},\overline{a})$	$\displaystyle=\frac{1}{\sqrt{m}N}\sum_{j=1}^{m_{Q}}\sum_{s^{\prime}\in\mathbf{s}}v_{j}\sigma\left(\theta_{j}^{\top}(s,s^{\prime},\overline{a})\right),$

where $m_{A}$ (resp. $m_{Q}$ ) is the width of the actor (resp. critic) network, and $\sigma(x)=\max\{x,0\}$ denotes the ReLU activation. We randomly initialize $u_{j}$ (resp. $v_{j}$ ) and first layer weights $\alpha_{0}=[\alpha_{0,1}^{\top},\ldots,\alpha_{0,m_{A}}^{\top}]^{\top}\in\mathbb{R}^{d\cdot m_{A}}$ (resp. $\theta_{0}\in\mathbb{R}^{d\cdot m_{Q}}$ ) by

\displaystyle u_{j}\sim\mathrm{Unif}\left\{-1,+1\right\},\alpha_{0,j}\sim\mathcal{N}(0,\mathrm{I}_{d}/d),~{}~{}\forall j\in[m].

For the ease of analysis, we take $m=m_{A}=m_{Q}$ and share the initialization of $\alpha_{0}$ and $\theta_{0}$ (resp. $u_{0}$ and $v_{0}$ ). Additionally, we keep $u_{j}$ ’s fixed during training, and $\alpha$ (resp. $\theta$ ) within the ball $\mathbb{B}(\alpha_{0},R_{A})$ (resp. $\mathbb{B}(\theta_{0},R_{Q})$ ) throughout training. We define the following function class characterizing the class of previously defined actor/critic network for large network width.

Definition 4.1.

Given $R_{\beta}>0$ , define the function class over domain ${\mathcal{S}}\times{\mathcal{S}}\times\mathcal{A}$ by

\displaystyle\mathcal{F}_{\beta,m}\!=\!\bigg{\{}f_{\beta}(\cdot)\big{|}~{}f_{\beta}(s,s^{\prime},\overline{a})\!=\!\frac{1}{\sqrt{m}}\sum_{j=1}^{m}v_{j}\mathbbm{1}\{\beta_{0,j}^{\top}(s,s^{\prime},\overline{a})\!>\!0\}\beta_{j}^{\top}(s,s^{\prime},\overline{a}),\|\beta-\beta_{0}\|_{2}\leq R_{\beta}\bigg{\}},

where $v_{j},\beta_{0,j}$ are random weights sampled according to

\displaystyle v_{j}\sim\mathrm{Unif}\left\{-1,+1\right\},\beta_{0,j}\sim\mathcal{N}(0,\mathrm{I}_{d}/d),~{}~{}\forall j\in[m].

$\mathcal{F}_{\beta,m}$ also induces the function class over ${\mathcal{S}}\times\mathcal{P}({\mathcal{S}})\times\mathcal{A}$ given by

\displaystyle\mathcal{F}^{\mathcal{P}}_{\beta,m}\!=\!\left\{F(\cdot)~{}\big{|}~{}F(s,\mathrm{d}_{\mathcal{S}},\overline{a})\!=\!\mathbb{E}_{s^{\prime}\sim\mathrm{d}_{\mathcal{S}}}f_{\beta}(s,s^{\prime},\overline{a}),f_{\beta}\!\in\!\mathcal{F}_{\beta,m}\right\}.

It is well known that functions within $\mathcal{F}_{\beta,m}$ approximate functions within the reproducing kernel Hilbert space associated with kernel $\mathrm{K}(x,y)=\mathbb{E}_{z\sim\mathcal{N}(0,I_{d}/d)}\left\{\mathbbm{1}(z^{\top}x>0,z^{\top}y>0)\right\}$ for large network width $m$ (Jacot et al., 2018; Chizat and Bach, 2018; Cai et al., 2019; Arora et al., 2019), whose RKHS norm is bounded by $R_{\beta}$ . For large $R_{\beta}$ and $m$ , $\mathcal{F}_{\beta,m}$ represents a rich class of functions. Additionally, functions within $\mathcal{F}^{\mathcal{P}}_{\beta,m}$ can be viewed as the mean-embedding of the joint state-action pair onto the RKHS space (Muandet et al., 2016; Song et al., 2009; Smola et al., 2007). Below, we make one technical assumption which states that $\mathcal{F}^{\mathcal{P}}_{\theta,m_{Q}}$ is rich enough to represent the Q-function of all the policies within our policy class.

Assumption 1.

For any policy $\pi$ induced by $F_{A}\in\mathcal{F}^{A}$ , we have $Q^{\pi}\in\mathcal{F}^{\mathcal{P}}_{\theta,m_{Q}}$ .

We define two mild conditions stating (1) boundedness of reward, and (2) regularity of the joint state and the policy.

Assumption 2.

Reward function $r(\cdot)\leq\overline{r}$ for some $\overline{r}>0$ . Additionally, for any $(s,\mathrm{d}_{\mathcal{S}})\in\overline{{\mathcal{S}}}$ and any $\pi$ , there exists $c>0$ such that

\mathbb{E}_{s^{\prime}\sim\mathrm{d}_{\mathcal{S}},\overline{a}\sim\pi(\cdot|s,\mathrm{d}_{\mathcal{S}})}\left\{\mathbbm{1}\left(\lvert z^{\top}(s,s^{\prime},\overline{a})\rvert\leq t\right)\right\}\leq c\cdot\frac{t}{\|z\|_{2}}

for any $z\in\mathbb{R}^{d}$ and $t>0$ .

We remark that the condition in Assumption 2 holds whenever ${\mathcal{S}}$ is bounded and $\mathrm{d}_{\mathcal{S}}\times\pi$ has an upper bounded density.

We measure the progress of MF-PPO in Algorithm 1 using the expected total reward

\displaystyle\mathcal{L}(\pi)=\mathbb{E}_{\nu^{*}}\left[V^{\pi}(s,\mathrm{d}_{\mathcal{S}})\right]=\mathbb{E}_{\nu^{*}}\left\{\left\langle Q^{\pi}(\cdot|s,\mathrm{d}_{\mathcal{S}}),\pi(\cdot|s,\mathrm{d}_{\mathcal{S}})\right\rangle\right\},

(4)

where $\nu^{*}$ is the stationary state distribution of the optimal policy $\pi^{*}$ . We also denote $\sigma^{*}$ as the stationary state-action distribution induced by $\pi^{*}$ . Note that we have:

\mathcal{L}(\pi^{*})=\mathbb{E}_{\nu^{*}}\left[V^{\pi^{*}}(s,\mathrm{d}_{\mathcal{S}})\right]\geq\mathbb{E}_{\nu^{*}}\left[V^{\pi}(s,\mathrm{d}_{\mathcal{S}})\right]=\mathcal{L}(\pi)

for any policy $\pi$ . Our main results are presented in the following theorem, showing that $\mathcal{L}(\pi_{k})$ converges to $\mathcal{L}(\pi^{*})$ at a sub-linear rate.

Theorem 4.1 (Global Convergence of MF-PPO).

Under Assumptions 1 and 2, the policies $\{\pi_{k}\}_{k=1}^{K}$ generated by Algorithm 1 satisify

\displaystyle\min_{0\leq k\leq K}\{\mathcal{L}(\pi^{*})-\mathcal{L}(\pi_{k})\}\leq\frac{\upsilon\left(\log|\overline{\mathcal{A}}|+\sum_{k=1}^{K-1}(\varepsilon_{k}+\varepsilon_{k}^{\prime})\right)}{(1-\gamma)\sqrt{K}}+\frac{M}{(1-\gamma)\upsilon\sqrt{K}},

where $M=\mathbb{E}_{\nu^{*}}\left\{\max_{\overline{a}\in\overline{\mathcal{A}}}(F^{Q}_{\theta_{0}}(s,s^{\prime},\overline{a}))^{2}\right\}+2R_{A}^{2}$ , $\varepsilon_{k}$ and $\varepsilon_{k}^{\prime}$ are defined in Lemma 4.3. In particular, suppose at each iteration of MF-PPO, we observe

N=\Omega\left(K^{3}R_{A}^{4}+KR_{Q}^{4}\right)

agents, and the actor/critic network satisfies

m_{A}=\Omega\left(K^{6}R_{A}^{10}+K^{4}R_{A}^{10}\lvert\overline{\mathcal{A}}\rvert^{2}\right),m_{Q}=\Omega\left(K^{2}R_{Q}^{10}\right),

and

T=\Omega\left(K^{3}R_{A}^{4}+KR_{Q}^{4}\right),

then we have

\displaystyle\min_{0\leq k\leq K}\{\mathcal{L}(\pi^{*})-\mathcal{L}(\pi_{k})\}\leq\frac{\upsilon^{2}\left(\log|\overline{\mathcal{A}}|+\mathcal{O}(1)\right)+M}{(1-\gamma)\upsilon\sqrt{K}}.

Theorem 4.1 states that, given sufficiently many agents and a large enough actor/critic network, MF-PPO attains global optimality at a sublinear rate. Our result shows that when solving the mean-field MDP, having more agents serves as a blessing instead of a curse. In addition, as will be demonstrated in our proof sketch, there exists an inherent phase transition depending on the number of agents, where the final optimality gap is dominated by statistical error for a small number of agents (first phase); and by optimization error for a large number of agents (second phase).

The complete proof of Theorem 4.1 takes a careful analysis on the error from policy evaluation (2) and the improvement step (3). The analysis on the outer iterations of MF-PPO can be overviewed as approximate mirror descent, which needs to take into account how the evaluation and improvement error interacts. Intuitively, the tuple $(\varepsilon_{k},\varepsilon^{\prime}_{k})$ describes the the effect of policy update when using approximate policy evaluation and policy improvement, and will be further clarified in the ensuing discussion. We present here the skeleton of our proof, and defer the technical detail to the appendix.

Proof Sketch. We first establish the global convergence of the policy evaluation (2) and improvement step (3).

Lemma 4.1 (Policy Evaluation).

Under the same assumptions in Theorem 4.1, let $Q^{\pi_{k}}$ denote the Q-function of policy $\pi_{k}$ , let

\epsilon_{k}=\mathbb{E}_{\mathrm{init},\sigma_{k}}\left\{F_{\overline{\theta}(T)}^{Q}(\cdot)-Q^{\pi_{k}}(\cdot)\right\}^{2}

denote policy evaluation error, where $\overline{\theta}(T)$ is the output of Algorithm 2, we have

\displaystyle\epsilon_{k}=\mathcal{O}\left(\frac{R_{Q}^{2}}{T^{1/2}}+\frac{R_{Q}^{5/2}}{m_{Q}^{1/4}}+\frac{R_{Q}^{2}}{N^{1/2}}+\frac{R_{Q}^{3}}{m_{Q}^{1/2}}\right).

Lemma 4.2 (Policy Improvement).

Under the same assumptions in Theorem 4.1, let

\epsilon_{k+1}^{\prime}\!=\!\mathbb{E}_{\mathrm{init},\widetilde{\sigma}_{k}}\left\{F^{A}_{\overline{\alpha}(T)}(\cdot)\!-\!\tau_{k+1}\left(\upsilon_{k}^{-1}F^{Q}_{\theta_{k}}(\cdot)\!+\!\tau_{k}^{-1}F^{A}_{\alpha_{k}}(\cdot)\right)\right\}^{2}

denote policy optimization error, where $\overline{\alpha}(T)$ is the output of Algorithm 3, we have

\displaystyle\epsilon_{k+1}^{\prime}=\mathcal{O}\left(\frac{R_{A}^{2}}{T^{1/2}}+\frac{R_{A}^{5/2}}{m_{A}^{1/4}}+\frac{R_{A}^{2}}{N^{1/2}}+\frac{R_{A}^{3}}{m_{A}^{1/2}}\right).

Lemma 4.1 and 4.2 show that despite non-convexity, both the policy evaluation and the policy improvement steps converge approximately to the global optimal solution. In particular, given networks with large width, for a small number of iterations $T$ , the optimization error $\mathcal{O}\left(T^{-1/2}\right)$ dominates the optimality gap; for a large number of iterations $T$ , the statistical error $\mathcal{O}\left(N^{-1/2}\right)$ dominates the optimality gap.

With Lemma 4.1 and 4.2, we illustrate the main argument for the proof of Theorem 4.1. Let us assume the ideal case when $\epsilon_{k}=\epsilon^{\prime}_{k+1}=0$ . Note that for $\epsilon_{k}=0$ , we obtain the exact $Q$ -function of policy $\pi_{k}$ . For $\epsilon^{\prime}_{k+1}=0$ , we obtain the ideal energy-based updated policy defined in Proposition 3.2. That is,

\displaystyle\pi_{k+1}=\mathop{\mathrm{argmax}}_{\pi}\mathbb{E}_{\nu_{k}}\bigg{\{}\left\langle Q^{\pi_{k}}(s,\mathrm{d}_{\mathcal{S}},\cdot),\pi(\cdot|s,\mathrm{d}_{\mathcal{S}})\right\rangle-\upsilon_{k}\mathrm{KL}\left(\pi(\cdot|s,\mathrm{d}_{\mathcal{S}})\|\pi_{k}(\cdot|s,\mathrm{d}_{\mathcal{S}})\right)\bigg{\}}.

(5)

Without function approximation, problem (5) can be solved by treating each joint state $(s,\mathrm{d}_{\mathcal{S}})$ independently, hence one can apply the well known three-point lemma in mirror descent (Chen and Teboulle, 1993) and obtain that, for all $(s,\mathrm{d}_{\mathcal{S}})\in{\mathcal{S}}\times\mathcal{P}({\mathcal{S}})$ :

		$\displaystyle\left\langle Q^{\pi_{k}}(s,\mathrm{d}_{\mathcal{S}},\cdot),\pi^{*}(\cdot\|s,\mathrm{d}_{\mathcal{S}})-\pi_{k}(\cdot\|s,\mathrm{d}_{\mathcal{S}})\right\rangle$
	$\displaystyle\leq$	$\displaystyle\upsilon_{k}\bigg{\{}\mathrm{KL}\left(\pi^{}(\cdot\|s,\mathrm{d}_{\mathcal{S}})\\|\pi_{k}(\cdot\|s,\mathrm{d}_{\mathcal{S}})\right)-\mathrm{KL}\left(\pi^{}(\cdot\|s,\mathrm{d}_{\mathcal{S}})\\|\pi_{k+1}(\cdot\|s,\mathrm{d}_{\mathcal{S}})\right)-\mathrm{KL}\left(\pi_{k+1}(\cdot\|s,\mathrm{d}_{\mathcal{S}})\\|\pi_{k}(\cdot\|s,\mathrm{d}_{\mathcal{S}})\right)\bigg{\}}$
		$\displaystyle~{}~{}~{}~{}~{}~{}+\left\langle Q^{\pi_{k}}(s,\mathrm{d}_{\mathcal{S}},\cdot),\pi_{k+1}(\cdot\|s,\mathrm{d}_{\mathcal{S}})-\pi_{k}(\cdot\|s,\mathrm{d}_{\mathcal{S}})\right\rangle.$

From Lemma 6.1 in Kakade and Langford (2002), the expectation of the left hand side yields exactly $(1-\gamma)\left\{\mathcal{L}(\pi^{*})-\mathcal{L}(\pi_{k})\right\}$ . Hence we have

		$\displaystyle(1-\gamma)\left\{\mathcal{L}(\pi^{*})-\mathcal{L}(\pi_{k})\right\}$
	$\displaystyle\leq$	$\displaystyle\upsilon_{k}\mathbb{E}_{\nu^{}}\bigg{\{}\mathrm{KL}\left(\pi^{}(\cdot\|s,\mathrm{d}_{\mathcal{S}})\\|\pi_{k}(\cdot\|s,\mathrm{d}_{\mathcal{S}})\right)-\mathrm{KL}\left(\pi^{*}(\cdot\|s,\mathrm{d}_{\mathcal{S}})\\|\pi_{k+1}(\cdot\|s,\mathrm{d}_{\mathcal{S}})\right)-\mathrm{KL}\left(\pi_{k+1}(\cdot\|s,\mathrm{d}_{\mathcal{S}})\\|\pi_{k}(\cdot\|s,\mathrm{d}_{\mathcal{S}})\right)\bigg{\}}$
		$\displaystyle~{}~{}~{}~{}~{}~{}+\mathbb{E}_{\nu^{*}}\left\langle Q^{\pi_{k}}(s,\mathrm{d}_{\mathcal{S}},\cdot),\pi_{k+1}(\cdot\|s,\mathrm{d}_{\mathcal{S}})-\pi_{k}(\cdot\|s,\mathrm{d}_{\mathcal{S}})\right\rangle.$

Pinsker’s Inequality

\displaystyle\mathrm{KL}\left(\pi_{k+1}(\cdot|s,\mathrm{d}_{\mathcal{S}})\|\pi_{k}(\cdot|s,\mathrm{d}_{\mathcal{S}})\right)\geq\frac{1}{2}\|\pi_{k+1}-\pi_{k}\|_{1}^{2},

combined with the observation $\|Q^{\pi_{k}}(s,\mathrm{d}_{\mathcal{S}},\cdot)\|_{\infty}\leq\overline{r}$ , and basic inequality $-ax^{2}+bx\leq b^{2}/(4a)$ for $a>0$ gives us

\displaystyle(1-\gamma)\left\{\mathcal{L}(\pi^{*})-\mathcal{L}(\pi_{k})\right\}\leq\upsilon_{k}\mathbb{E}_{\nu^{*}}\bigg{\{}\mathrm{KL}\left(\pi^{*}(\cdot|s,\mathrm{d}_{\mathcal{S}})\|\pi_{k}(\cdot|s,\mathrm{d}_{\mathcal{S}})\right)-\mathrm{KL}\left(\pi^{*}(\cdot|s,\mathrm{d}_{\mathcal{S}})\|\pi_{k+1}(\cdot|s,\mathrm{d}_{\mathcal{S}})\right)\bigg{\}}+\overline{r}^{2}/(2\upsilon_{k}).

(6)

By setting $\upsilon_{k}=\mathcal{O}(\sqrt{K})$ , and telescoping the above inequality from $k=0$ to $K-1$ , we obtain:

\min_{0\leq k\leq K-1}\{\mathcal{L}(\pi^{*})-\mathcal{L}(\pi_{k})\}=\mathcal{O}(1/\sqrt{K}).

Note that the key element in the global convergence of MF-PPO is the recursion defined in (6), which holds whenever we have the exact $Q$ -function of current policy and no function approximation is used when updating the next policy. Now MF-PPO conducts approximate policy evaluation ( $\epsilon_{k}>0$ ), and after obtaining the approximate $Q$ -function, conducts approximate policy improvement step ( $\epsilon^{\prime}_{k+1}>0$ ). In addition, the error of approximating the $Q$ -function introduced in the evaluation step can be further compounded in the improvement step. Nevertheless, recursion (6) still holds approximately, with additional terms representing the policy evaluation/improvement errors.

Lemma 4.3 (Liu et al. (2019a)).

Let $\epsilon_{k}$ (evaluation error) and $\epsilon_{k+1}^{\prime}$ (improvement error) be defined as in Lemma 4.1 and Lemma 4.2, respectively. We have:

	$\displaystyle(1-\gamma)\left(\mathcal{L}(\pi^{*})-\mathcal{L}(\pi_{k})\right)$	$\displaystyle\leq\upsilon_{k}\mathbb{E}_{\nu_{}}\bigg{\{}\mathrm{KL}\left(\pi^{}(\cdot\|s,\mathrm{d}_{\mathcal{S}})\\|\pi_{k}(\cdot\|s,\mathrm{d}_{\mathcal{S}})\right)-\mathrm{KL}\left(\pi^{*}(\cdot\|s,\mathrm{d}_{\mathcal{S}})\\|\pi_{k+1}(\cdot\|s,\mathrm{d}_{\mathcal{S}})\right)\bigg{\}}$		(7)
		$\displaystyle~{}~{}~{}~{}~{}~{}+\upsilon_{k}\left(\varepsilon_{k}+\varepsilon_{k}^{\prime}\right)+\upsilon_{k}^{-1}M.$		(8)

where

	$\displaystyle\varepsilon_{k}$	$\displaystyle=\tau_{k+1}^{-1}\epsilon_{k+1}^{\prime}\phi_{k+1}^{}+\upsilon_{k}^{-1}\epsilon_{k}\psi_{k}^{},$
	$\displaystyle\varepsilon_{k}^{\prime}$	$\displaystyle=\lvert\mathcal{A}\rvert\tau_{k+1}^{-2}(\epsilon_{k+1}^{\prime})^{2},$
	$\displaystyle M$	$\displaystyle=\mathbb{E}_{\nu^{*}}\left\{\max_{\overline{a}\in\overline{\mathcal{A}}}\left[F^{Q}_{\theta_{0}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})\right]^{2}\right\}+2R_{A}^{2}.$

In addition, $\phi_{k}^{*}$ and $\psi_{k}^{*}$ are defined by:

	$\displaystyle\phi_{k}^{*}$	$\displaystyle=\mathbb{E}_{\widetilde{\sigma}_{k}}\left[\lvert\mathrm{d}\pi^{*}/\mathrm{d}\pi_{0}-\mathrm{d}\pi_{k}/\mathrm{d}\pi_{0}\rvert^{2}\right]^{1/2},$
	$\displaystyle\psi_{k}^{*}$	$\displaystyle=\mathbb{E}_{\sigma_{k}}\left[\lvert\mathrm{d}\sigma^{}/\mathrm{d}\sigma_{k}-\mathrm{d}(\nu^{}\times\pi_{k})/\mathrm{d}\sigma_{k}\rvert^{2}\right]^{1/2}.$

Finally, from Lemma 4.3, by telescoping inequality (8) from $k=0$ to $K-1$ , we complete the proof of Theorem 4.1.

5 Experiments

We perform experiments on the benchmark multi-agent particle environment (MPE) used in prior work (Lowe et al., 2017; Mordatch and Abbeel, 2018; Liu et al., 2019b). In the cooperative navigation task, $N$ agents each with position $x_{i}\in\mathbb{R}^{2}$ must move to cover $N$ fixed landmarks at positions $y_{i}\in\mathbb{R}^{2}$ . They receive a team reward $R=-\sum_{i=1}^{N}\min_{j\in[N]}\lVert y_{i}-x_{j}\rVert_{2};$ In the cooperative push task, $N$ agents with position $x_{i}\in\mathbb{R}^{2}$ work together to push a ball $x\in\mathbb{R}^{2}$ to a fixed landmark $y\in\mathbb{R}^{2}$ . They receive a team reward $R=-\min_{j\in[N]}\lVert x_{j}-x\rVert_{2}-\|x-y\|_{2}.$ Both tasks involve homogeneous agents.

We instantiate MF-PPO by building on the original PPO algorithm (Schulman et al., 2017) with a centralized permutation invariant value-function critic, represented by a DeepSet (Zaheer et al., 2017) network with two hidden layers. We use a standard two-layer multi-layer perception (MLP) for the actor network in all algorithms. One version, labeled as MF, has a centralized actor network that outputs the mean and diagonal covariance of a Gaussian distribution over the joint continuous action space. A second decentralized version with parameter sharing, labeled as MF-PS, has a decentralized actor network that maps individual observations to parameters of a Gaussian distribution over the individual action space.

We compare with two other critic representations: one that uses MLP for the centralized critic, labeled MLP, and another that uses a graph convolutional network for the critic (Liu et al., 2019b), labeled GCN. Note that the GCN representation is permutation invariant if one imposes a fully-connected graph for the agents in the MPE, but this invariance property does not hold for all graphs in general. We also compare with an extension of (Yang et al., 2018) to the case of continuous action spaces, labeled MF-A, in which each independent DDPG agent $i$ has a decentralized critic $Q(s_{i},a_{i},\bar{a}_{i})$ that takes in the mean of all other agents’ actions $\bar{a}_{i}:=\frac{1}{N-1}\sum_{j\neq i}a_{j}$ . Empirically, as we find that off-policy RL learns faster than on-policy RL in the MPE with higher agent number, regardless of the critic representation, we make all comparisons on top of MADDPG (Lowe et al., 2017) for all $N\geq 15$ cases. For a fair comparison of all critic representations, we ensure that all neural network architectures contain approximately the same number of trainable weights.

For the cooperative navigation task, Figure 3 shows that the permutation invariant critic representation based on DeepSet enables MF and MF-PS to learn faster or reach a higher performance than all other representations and methods in the MPE with 6, 15, 30 and 200 agents. Figure 4(a) shows that MF consistently and significantly outperforms GCN as the number of parameters in their critic network varies over a range, with all other settings fixed. In particular, MF requires much fewer critic parameters to achieve higher performance than GCN. For the cooperative push task, Figure 4(b) demonstrates similar performance improvement of MF compared to all other methods. We provide additional experiment detail in Appendix B.

Computational Improvements. Theorem 4.1 states that to obtain a small optimality gap in MF-PPO, one needs to compute the update on a large number of agents. We remark that with the dual embedding techniques developed in Dai et al. (2017), one can avoid computation on all the agents by sampling a small number of agents to compute the update. This technique could be readily incorporated into MF-PPO to improve its computational efficiency.

6 Conclusion

We propose a principled approach to exploit agent homogeneity and permutation invariance through the mean-field approximation in MARL. Moreover, our results are also the first to provably show the global convergence of MARL algorithms with neural networks as function approximators. This is in sharp contrast to current practices, which are mostly heuristic methods without convergence guarantees.

References

Arora et al. (2019) Arora, S., Du, S. S., Hu, W., Li, Z. and Wang, R. (2019). Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. arXiv preprint arXiv:1901.08584 .
Bertsekas (2011) Bertsekas, D. P. (2011). Approximate policy iteration: A survey and some new methods. Journal of Control Theory and Applications 9 310–335.
Bloem-Reddy and Teh (2019) Bloem-Reddy, B. and Teh, Y. W. (2019). Probabilistic symmetry and invariant neural networks. arXiv preprint arXiv:1901.06082 .
Cai et al. (2019) Cai, Q., Yang, Z., Lee, J. D. and Wang, Z. (2019). Neural temporal-difference learning converges to global optima. In Advances in Neural Information Processing Systems.
Cao et al. (2013) Cao, Y., Yu, W., Ren, W. and Chen, G. (2013). An overview of recent progress in the study of distributed multi-agent coordination. IEEE Transactions on Industrial informatics 9 427–438.
Carmona et al. (2019) Carmona, R., Laurière, M. and Tan, Z. (2019). Model-free mean-field reinforcement learning: mean-field mdp and mean-field q-learning. arXiv preprint arXiv:1910.12802 .
Chen and Teboulle (1993) Chen, G. and Teboulle, M. (1993). Convergence analysis of a proximal-like minimization algorithm using bregman functions. SIAM Journal on Optimization 3 538–543.
Chizat and Bach (2018) Chizat, L. and Bach, F. (2018). A note on lazy training in supervised differentiable programming. arXiv preprint arXiv:1812.07956 8.
Dai et al. (2017) Dai, B., Shaw, A., Li, L., Xiao, L., He, N., Liu, Z., Chen, J. and Song, L. (2017). Sbeed: Convergent reinforcement learning with nonlinear function approximation. arXiv preprint arXiv:1712.10285 .
Jacot et al. (2018) Jacot, A., Gabriel, F. and Hongler, C. (2018). Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems.
Kakade and Langford (2002) Kakade, S. and Langford, J. (2002). Approximately optimal approximate reinforcement learning. In ICML, vol. 2.
Kalyanakrishnan et al. (2006) Kalyanakrishnan, S., Liu, Y. and Stone, P. (2006). Half field offense in robocup soccer: A multiagent reinforcement learning case study. In Robot Soccer World Cup. Springer.
Kipf and Welling (2017) Kipf, T. N. and Welling, M. (2017). Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representatioons.
Kober et al. (2013) Kober, J., Bagnell, J. A. and Peters, J. (2013). Reinforcement learning in robotics: A survey. The International Journal of Robotics Research 32 1238–1274.
Littman (1994) Littman, M. L. (1994). Markov games as a framework for multi-agent reinforcement learning. In Machine Learning Proceedings 1994. Elsevier, 157–163.
Liu et al. (2019a) Liu, B., Cai, Q., Yang, Z. and Wang, Z. (2019a). Neural proximal/trust region policy optimization attains globally optimal policy. arXiv preprint arXiv:1906.10306 .
Liu et al. (2019b) Liu, I.-J., Yeh, R. A. and Schwing, A. G. (2019b). Pic: Permutation invariant critic for multi-agent deep reinforcement learning. arXiv preprint arXiv:1911.00025 .
Lowe et al. (2017) Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, O. P. and Mordatch, I. (2017). Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems.
Matarić (1997) Matarić, M. J. (1997). Reinforcement learning in the multi-robot domain. In Robot colonies. Springer, 73–83.
Menda et al. (2018) Menda, K., Chen, Y.-C., Grana, J., Bono, J. W., Tracey, B. D., Kochenderfer, M. J. and Wolpert, D. (2018). Deep reinforcement learning for event-driven multi-agent decision processes. IEEE Transactions on Intelligent Transportation Systems 20 1259–1268.
Mordatch and Abbeel (2018) Mordatch, I. and Abbeel, P. (2018). Emergence of grounded compositional language in multi-agent populations. In Thirty-Second AAAI Conference on Artificial Intelligence.
Muandet et al. (2016) Muandet, K., Fukumizu, K., Sriperumbudur, B. and Schölkopf, B. (2016). Kernel mean embedding of distributions: A review and beyond. arXiv preprint arXiv:1605.09522 .
Panait and Luke (2005) Panait, L. and Luke, S. (2005). Cooperative multi-agent learning: The state of the art. Autonomous agents and multi-agent systems 11 387–434.
Schulman et al. (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A. and Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 .
Shalev-Shwartz et al. (2016) Shalev-Shwartz, S., Shammah, S. and Shashua, A. (2016). Safe, multi-agent, reinforcement learning for autonomous driving. arXiv preprint arXiv:1610.03295 .
Smola et al. (2007) Smola, A., Gretton, A., Song, L. and Schölkopf, B. (2007). A hilbert space embedding for distributions. In International Conference on Algorithmic Learning Theory. Springer.
Song et al. (2009) Song, L., Huang, J., Smola, A. and Fukumizu, K. (2009). Hilbert space embeddings of conditional distributions with applications to dynamical systems. In Proceedings of the 26th Annual International Conference on Machine Learning.
Sunehag et al. (2017) Sunehag, P., Lever, G., Gruslys, A., Czarnecki, W. M., Zambaldi, V., Jaderberg, M., Lanctot, M., Sonnerat, N., Leibo, J. Z., Tuyls, K. et al. (2017). Value-decomposition networks for cooperative multi-agent learning. arXiv preprint arXiv:1706.05296 .
Sutton and Barto (2018) Sutton, R. S. and Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.
Vinyals et al. (2019) Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A., Chung, J., Choi, D. H., Powell, R., Ewalds, T., Georgiev, P. et al. (2019). Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature 575 350–354.
Wang et al. (2020) Wang, L., Yang, Z. and Wang, Z. (2020). Breaking the curse of many agents: Provable mean embedding q-iteration for mean-field reinforcement learning. In International Conference on Machine Learning. PMLR.
Yang et al. (2017) Yang, J., Ye, X., Trivedi, R., Xu, H. and Zha, H. (2017). Learning deep mean field games for modeling large population behavior. arXiv preprint arXiv:1711.03156 .
Yang et al. (2018) Yang, Y., Luo, R., Li, M., Zhou, M., Zhang, W. and Wang, J. (2018). Mean field multi-agent reinforcement learning. arXiv preprint arXiv:1802.05438 .
Zaheer et al. (2017) Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R. R. and Smola, A. J. (2017). Deep sets. In Advances in neural information processing systems.
Zhang et al. (2019) Zhang, K., Yang, Z. and Başar, T. (2019). Multi-agent reinforcement learning: A selective overview of theories and algorithms. arXiv preprint arXiv:1911.10635 .
Zheng et al. (2020) Zheng, S., Trott, A., Srinivasa, S., Naik, N., Gruesbeck, M., Parkes, D. C. and Socher, R. (2020). The ai economist: Improving equality and productivity with ai-driven tax policies. arXiv preprint arXiv:2004.13332 .

Supplementary Material for A Permutation Invariant Approach to Deep Mean-Field Multi-Agent Reinforcement Learning

Appendix A Algorithms for Policy Evaluation/Improvement

Algorithm 2 Policy Evaluation via TD

Require: Mean-field MDP

(\overline{{\mathcal{S}}},\overline{\mathcal{A}},\mathbb{P},r,\gamma)

, sample

\{(s_{t},\mathbf{s}_{t},\overline{a}_{t},s_{t}^{\prime},\mathbf{s}_{t}^{\prime},\overline{a}_{t}^{\prime})\}

, number of iterations

T

Initialize:

w_{j}(0)\sim\mathrm{Unif}\{-1,+1\},~{}~{}~{}\theta_{j}(0)\sim\mathcal{N}(0,\mathrm{I}_{d}/d),~{}\forall j\in[m_{Q}]

Set step size

\eta\leftarrow T^{-1/2}

for

t=0,\ldots,T-1

Set

(s,\mathbf{s},\overline{a},s^{\prime},\mathbf{s}^{\prime},\overline{a}^{\prime})\leftarrow(s_{t},\mathbf{s}_{t},\overline{a}_{t},s_{t}^{\prime},\mathbf{s}_{t}^{\prime},\overline{a}_{t}^{\prime})

\theta(t+1/2)=\theta(t)-\eta\left\{F^{Q}_{\theta(t)}(s,\mathbf{s},\overline{a})-(1-\gamma)r(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\gamma F^{Q}_{\theta(t)}(s,\mathbf{s}^{\prime},\overline{a}^{\prime})\right\}\nabla_{\theta}F^{Q}_{\theta(t)}(s,\mathbf{s},\overline{a})

\theta(t+1)=\Pi_{\mathbb{B}(R_{\theta},\theta_{0},)}(\theta(t+1/2))

end for

Take ergodic average:

\overline{\theta}_{T}\leftarrow\frac{1}{T}\sum_{t=0}^{T-1}\theta(t)

Output:

F^{Q}_{\overline{\theta}_{T}}

Algorithm 3 Policy Improvement via SGD

Require: Mean-field MDP

(\overline{{\mathcal{S}}},\overline{\mathcal{A}},\mathbb{P},r,\gamma)

, sample

\{(s_{t},\mathbf{s}_{t},\overline{a}_{t}\}

, number of iterations

T

Initialize:

v_{j}(0)\sim\mathrm{Unif}\{-1,+1\},~{}~{}~{}\alpha_{j}(0)\sim\mathcal{N}(0,\mathrm{I}_{d}/d),~{}\forall j\in[m_{A}]

Set step size

\eta\leftarrow T^{-1/2}

for

t=0,\ldots,T-1

Set

(s,\mathbf{s},\overline{a})\leftarrow(s_{t},\mathbf{s}_{t},\overline{a}_{t})

\alpha(t+1/2)=\alpha(t)-\eta\left\{F^{A}_{\alpha(t)}(s,\mathbf{s},\overline{a})-\tau_{k+1}(\upsilon_{k}^{-1}F^{Q}_{\theta_{k}}(s,\mathbf{s},\overline{a})+\tau_{k}^{-1}F^{A}_{\alpha_{k}}(s,\mathbf{s},\overline{a}))\right\}\nabla_{\alpha}F^{A}_{\alpha(t)}(s,\mathbf{s},\overline{a})

\alpha(t+1)=\Pi_{\mathbb{B}(R_{\alpha},\alpha_{0})}\left(\alpha(t+1/2)\right)

end for

Take ergodic average:

\overline{\alpha}_{T}\leftarrow\frac{1}{T}\sum_{t=0}^{T-1}\alpha(t)

Output:

F^{A}_{\overline{\alpha}_{T}}

Appendix B Experimental details

$\diamondsuit$ Environment. We used the open-source code for the multi-agent particle environments provided by Liu et al. (2019b), which itself is based on the original code by Lowe et al. (2017), without any modification. Please refer to Liu et al. (2019b, Appendix B) for complete details.

$\diamondsuit$ Computing infrastructure and runtime. Experiments were run on Intel Xeon 6154 CPUs, using one core for each independent policy training session. Average training time was approximately 4 hours for $N=15$ and $N=30$ with 1.5e6 steps, and 12 hours for $N=200$ with 7.25e5 steps.

B.1 Implementation

$\diamondsuit$ GCN¹¹1We label this architecture as “GCN” to distinguish it from alternative ways to instantiate a permutation invariant critic.. For experiments based on MADDPG, we re-ran the open-source code provided by Liu et al. (2019b) without modification. For experiments based on PPO, we used the GCN network in Kipf and Welling (2017) as the critic in PPO. Performance of GCN reported in this work are the results of our experimental runs.

$\diamondsuit$ MF. In the PPO-based implementation, the joint policy is parameterized by a multi-layer perceptron (MLP) with two hidden layers of size 128 and ReLU activation, which takes in the concatenation of all agents’ observations and outputs the mean and diagonal covariance of a joint Gaussian policy. In the MADDPG-based implementation, each decentralized actor (i.e., policy) network is an MLP with two hidden layers of size 187 and ReLU activation, which takes in each agent’s individual observation and outputs the agent’s real-valued action vector. The centralized critic has the form $V(s)=f_{3}(f_{2}(\sum_{i=1}^{N}f_{1}(s_{i})))$ , where $f_{1}$ and $f_{2}$ are hidden layers of size $h=190/205/187/187/187$ (for $N=3/6/15/30/200$ agents, respectively, to have approximately equal number of trainable critic parameters as the GCN critic used in Liu et al. (2019b)) with ReLU activation, and $f_{3}$ is a linear layer with output size 1.

$\diamondsuit$ MLP. The centralized critic is an MLP with two hidden layers of size 138,187,75,42,7 for the case of $N=3,6,15,30,200$ agents, respectively, such that total number of trainable parameters is approximately equal to that in GCN (Liu et al., 2019b). The joint policy has the same representation as the one in MF for both PPO-based and MADDPG-based implementations.

$\diamondsuit$ Hyperparameters. We used the Adam optimizer for all algorithms. For experiments based on PPO, the common hyperparameters across all algorithms are: discount factor $\gamma=0.99$ , KL divergence target $d_{\text{targ}}=0.01$ , KL divergence initial coefficient $\beta=3$ , and GAE $\lambda=0.95$ (see Schulman et al. (2017)), policy entropy loss coefficient 0.01, max gradient norm $0.5$ , and actor learning rate $10^{-4}$ . Each PPO training step uses 5 epochs with minibatch size 2. We did a sweep of critic learning rate for each critic architecture, choosing $10^{-3}$ for MF, $10^{-4}$ for GCN and MLP, and $10^{-2}$ for MF-A.

For experiments based on MADDPG, the common hyperparameters are: actor learning rate $0.01$ , critic learning rate $0.01$ , batch size 1024, discount factor $\gamma=0.95$ , replay buffer size $10^{5}$ , 8 actor and critic updates per 100 environment steps, and target network update rate $\tau=0.01$ .

Appendix C Proofs in Section 2

Proof of Proposition 2.1.

We first prove the first part of claim. We begin by showing that the optimal Q-function $Q^{*}$ is permutation invariant, i.e., $Q^{*}(\mathbf{s},\mathbf{a})=Q^{*}(\kappa(\mathbf{s}),\kappa(\mathbf{a}))$ for all permutation mapping $\kappa$ . Note that $Q^{*}$ is the unique solution to Bellman optimality condition over the space of $Q$ -function

	$\displaystyle Q(\mathbf{s},\mathbf{a})$	$\displaystyle=\mathbb{E}_{\mathbf{s}^{\prime}\sim\mathbb{P}(\cdot\|\mathbf{s},\mathbf{a})}\left\{r(\mathbf{s},\mathbf{a})+\max_{\mathbf{a}^{\prime}}Q(\mathbf{s}^{\prime},\mathbf{a}^{\prime})\right\}$
		$\displaystyle=r(\mathbf{s},\mathbf{a})+\sum_{\mathbf{s}^{\prime}}\mathbb{P}(\mathbf{s}^{\prime}\|\mathbf{s},\mathbf{a})\max_{\mathbf{a}^{\prime}}Q(\mathbf{s}^{\prime},\mathbf{a}^{\prime}).$		(9)

Or equivalently, from the permutation invariance of $r$ and $\mathbb{P}$ defined in (1)

	$\displaystyle Q(\kappa(\mathbf{s}),\kappa(\mathbf{a}))$	$\displaystyle=r(\kappa(\mathbf{s}),\kappa(\mathbf{a}))+\sum_{\mathbf{s}^{\prime}}\mathbb{P}(\kappa(\mathbf{s}^{\prime})\|\kappa(\mathbf{s}),\kappa(\mathbf{a}))\max_{\mathbf{a}^{\prime}}Q(\kappa(\mathbf{s}^{\prime}),\kappa(\mathbf{a}^{\prime}))$
		$\displaystyle=r(\mathbf{s},\mathbf{a})+\sum_{\mathbf{s}^{\prime}}\mathbb{P}(\mathbf{s}^{\prime}\|\mathbf{s},\mathbf{a})\max_{\mathbf{a}^{\prime}}Q(\kappa(\mathbf{s}^{\prime}),\kappa(\mathbf{a}^{\prime})).$

That is, $Q(\kappa(\mathbf{s}),\kappa(\mathbf{a}))$ is also a solution to the Bellman optimality condition (9). From the uniqueness of the solution, we have $Q^{*}(\mathbf{s},\mathbf{a})=Q^{*}(\kappa(\mathbf{s}),\kappa(\mathbf{a}))$ . Hence the optimal policy $\nu^{*}(\mathbf{s})=\mathop{\mathrm{argmax}}_{\mathbf{a}}Q^{*}(\mathbf{s},\mathbf{a})$ is permutation invariant.

For second part of the proposition, we show the detailed proof for permutation invariance of value function $V$ . For any permutation invariance policy $\nu$ , let $R^{\nu}(\mathbf{s})=\mathbb{E}_{\mathbf{a}\sim\nu}R(\mathbf{s},\mathbf{a})$ , and $\mathbb{P}^{\nu}(\mathbf{s}^{\prime}|\mathbf{s})=\mathbb{E}_{\mathbf{a}\sim\nu}\mathbb{P}(\mathbf{s}^{\prime}|\mathbf{s},\mathbf{a})$ denote the key elements of the induced Markov reward process. One can clearly see that from permutation invariance defined in (1), we have that both $R^{\nu}$ and $\mathbb{P}^{\nu}(\mathbf{s}^{\prime}|\mathbf{s})$ is permutation invariant. We define the $k$ -step value function $V^{\nu}_{k}$ as the expected reward from time $0$ to time $k$ , then:

	$\displaystyle V^{\nu}_{1}(\mathbf{s})$	$\displaystyle=R^{\nu}(\mathbf{s}),$
	$\displaystyle V^{\nu}_{k}(\mathbf{s})$	$\displaystyle=R^{\nu}(\mathbf{s})+\gamma\mathbb{E}_{s^{\prime}\sim\mathbb{P}^{\nu}(\cdot\|s)}V^{\nu}_{k-1}(\mathbf{s}^{\prime}),~{}~{}\forall k>1.$

We can see from above recursion that $V^{\nu}_{k}$ is permutation invariant for all $k\geq 1$ . From $V^{\nu}=\lim_{k\to\infty}V^{\nu}_{k}$ , we can conclude that $V^{\nu}$ is also permutation invariant.

For permutation invariance of Q-function, recall that $Q^{\nu}(\mathbf{s},\mathbf{a})$ is the unique solution of the following Bellman evaluation equation

	$\displaystyle Q(\mathbf{s},\mathbf{a})$	$\displaystyle=\mathbb{E}_{\mathbf{s}^{\prime}\sim\mathbb{P}(\cdot\|\mathbf{s},\mathbf{a})}\left\{r(\mathbf{s},\mathbf{a})+\mathbb{E}_{\mathbf{a}^{\prime}\sim\nu(\cdot\|\mathbf{s}^{\prime})}Q(\mathbf{s}^{\prime},\mathbf{a}^{\prime})\right\}$
		$\displaystyle=r(\mathbf{s},\mathbf{a})+\sum_{\mathbf{s}^{\prime},\mathbf{a}^{\prime}}\mathbb{P}(\mathbf{s}^{\prime}\|\mathbf{s},\mathbf{a})Q(\mathbf{s}^{\prime},\mathbf{a}^{\prime})\nu(\mathbf{s}^{\prime}\mathbf{a}^{\prime}).$		(10)

From the permutation invariance of $r$ , $\mathbb{P}$ and $\nu$ , we have $Q^{\nu}$ is also a solution to (10), as

\displaystyle Q(\kappa(\mathbf{s}),\kappa(\mathbf{a}))=r(\mathbf{s},\mathbf{a})+\sum_{\mathbf{s}^{\prime},\mathbf{a}^{\prime}}\mathbb{P}(\mathbf{s}^{\prime}|\mathbf{s},\mathbf{a})Q(\kappa(\mathbf{s}^{\prime}),\kappa(\mathbf{a}^{\prime}))\nu(\mathbf{s}^{\prime},\mathbf{a}^{\prime}).

Since $Q^{\nu}$ is the unique solution to (10), we have $Q^{\nu}(\mathbf{s},\mathbf{a})=Q^{\nu}(\kappa(\mathbf{s}),\kappa(\mathbf{a}))$ .

∎

Proof of Proposition 2.2.

The proof for showing $V^{\nu}(s,\mathbf{s}_{r})=V^{\nu}(s,\kappa(\mathbf{s}_{r}))$ follows the exact same ingredients as in the proof for Proposition 2.1. Following Theorem 11 in Bloem-Reddy and Teh (2019), there exists a function $h_{\nu}(s,\eta,\sum_{s^{\prime}\in\mathbf{s}_{r}}\frac{1}{|\mathbf{s}_{r}|}\delta_{s^{\prime}})$ such that $Y\overset{a.s.}{=}h_{\nu}$ , where $\eta\sim\mathrm{Unif}[0,1]$ . Then Proposition 2.2 follows from letting $g_{\nu}(s,\sum_{s^{\prime}\in\mathbf{s}_{r}}\frac{1}{|\mathbf{s}_{r}|}\delta_{s^{\prime}})=\mathbb{E}_{\eta}h_{\nu}(\eta,\sum_{s^{\prime}\in\mathbf{s}_{r}}\frac{1}{|\mathbf{s}_{r}|}\delta_{s^{\prime}})$ .

∎

Appendix D Proofs in Section 3

Proof of Proposition 3.1.

For any permutation invariant actor/critic function $F(s,\mathbf{s},\overline{a})=F(s,\kappa(\mathbf{s}),\overline{a})$ , we consider the size of tabular representation for such functions and denote it as the size for the search space. We say $\mathbf{s}$ and $\mathbf{s}^{\prime}$ are permutation equivalent if $\mathbf{s}=\kappa(\mathbf{s}^{\prime})$ for some permutation mapping $\kappa$ . We can observe that (i) the permutation equivalent is a valid equivalent relation defined over the space of all possible $\mathbf{s}$ ; (ii) $\mathbf{s}$ and $\mathbf{s}^{\prime}$ are permutation equivalent if and only if for every value $v$ in $\mathbf{s}$ , $v$ occurs the same number of times in $\mathbf{s}$ and $\mathbf{s}^{\prime}$ . One can verify that for $\mathbf{s}$ containing $N$ elements, with each element $s^{\prime}$ taking values in ${\mathcal{S}}$ , then the number of equivalence classes over the space of all possible $\mathbf{s}$ , induced by permutation equivalent relation is $\sum_{k=1}^{\min\{|{\mathcal{S}}|,N\}}\binom{N-1}{k-1}\binom{|{\mathcal{S}}|}{k}$ . The claim of Proposition 3.1 follows immediately. ∎

Proof of Proposition 3.2.

Note that the maximization problem can be solved separately for each $s,\mathrm{d}_{\mathcal{S}}$ . Hence we can equivalently solve

\displaystyle\overline{\pi}(\cdot|s,\mathrm{d}_{\mathcal{S}})\max_{\pi(\cdot|s,\mathrm{d}_{\mathcal{S}})\in\mathbb{R}^{\mathcal{A}}}\left[\left\langle F^{Q}_{\theta_{k}}(s,\mathrm{d}_{\mathcal{S}},\cdot),\pi(\cdot|s,\mathrm{d}_{\mathcal{S}})\right\rangle-\upsilon_{k}\mathrm{KL}\left(\pi_{\alpha}(\cdot|s,\mathrm{d}_{\mathcal{S}})\|\pi(\cdot|s,\mathrm{d}_{\mathcal{S}})\right)\right],

which is an optimization problem over finite-dimensional parameter $\pi(\cdot|s,\mathrm{d}_{\mathcal{S}})\in\mathbb{R}^{\overline{\mathcal{A}}}$ . Setting the derivative w.r.t. $\pi(\overline{a}|s,\mathrm{d}_{\mathcal{S}})$ for each $\overline{a}\in\overline{\mathcal{A}}$ equal to zero, we have

\displaystyle F^{Q}_{\theta_{k}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\upsilon_{k}(\log\left(\frac{\underline{\pi}}{\pi_{k}(\overline{a}|s,\mathrm{d}_{\mathcal{S}})}\right)-1)=0,

or equivalently, $\overline{\pi}(a|s,\mathrm{d}_{\mathcal{S}})\propto\pi_{k}(a|s,\mathrm{d}_{\mathcal{S}})\exp\{\upsilon_{k}^{-1}F^{Q}_{\theta_{k}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})\}$ . Plug in the definition of $\pi_{k}\propto\exp\left(\tau_{k}^{-1}F^{A}_{\alpha_{k}}\right)$ completes the proof. ∎

Appendix E Proofs in Section 4

We provide a unified analysis of the policy evaluation and policy improvement step. The policy evaluation and improvement steps can be described as minimizing the following mean-squared loss:

\displaystyle\beta(t+1)=\mathop{\mathrm{argmin}}_{\beta\in\mathbb{B}(R_{\beta},\beta_{0})}\mathbb{E}_{(s,\mathrm{d}_{\mathcal{S}},a)\sim\rho}\left[(F_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\zeta_{F_{\beta}}(s,\mathrm{d}_{\mathcal{S}},\overline{a}))^{2}\right].

(11)

where the operator $\zeta$ maps function $F_{\beta}$ to

\displaystyle\zeta_{F_{\beta}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})=\mathbb{E}_{s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime}\sim\mathbb{P}(\cdot|s,\mathrm{d}_{\mathcal{S}},\overline{a}),\overline{a}^{\prime}\sim\pi(\cdot|s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime})}\left[\tau\xi(s,\mathrm{d}_{\mathcal{S}},\overline{a})+\mu F_{\beta}(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\right].

(12)

Formulation (11) includes policy improvement and policy evaluation as special cases:

Policy Improvement. This corresponds to $\rho=\widetilde{\sigma}_{k}$ , $\xi(s,\mathrm{d}_{\mathcal{S}},\overline{a})=\tau_{k+1}(\upsilon_{k}^{-1}F^{Q}_{\beta_{k}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})+\tau_{k}^{-1}F^{A}_{\alpha_{k}}(s,\mathrm{d}_{\mathcal{S}},\overline{a}))$ , $\tau=1,\mu=0$ ,

Policy Evaluation. This corresponds to $\rho=\sigma_{k}$ , $\tau=1-\gamma$ , $\mu=\gamma$ , $\xi(s,\mathrm{d}_{\mathcal{S}},\overline{a})=r(s,\mathrm{d}_{\mathcal{S}},\overline{a})$ . Note that $\zeta_{F_{\theta}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})$ equals to the Bellman operator: $\zeta_{F_{\theta}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})={\mathcal{T}}^{\pi}F_{\theta}(s,\mathrm{d}_{\mathcal{S}},\overline{a})$ .

To solve problem (11), we consider the following generic update rule:

	$\displaystyle\beta(t+1/2)$	$\displaystyle=\left(F_{\beta(t)}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\tau\xi(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\mu F_{\beta(t)}(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\right)\nabla_{\beta}F_{\beta(t)}(s,\mathrm{d}_{\mathcal{S}},\overline{a}),$		(13)
	$\displaystyle\beta(t+1)$	$\displaystyle=\Pi_{\mathbb{B}^{0}(R_{\beta})}(\beta(t+1/2)).$		(14)

Instead of knowing $s,\mathrm{d}_{\mathcal{S}}$ exactly, we only have access to $s,\mathrm{d}_{\mathcal{S}}$ via $N$ independent samples represented as the states of $N$ agents. Hence, we perform the following update:

	$\displaystyle\beta(t+1/2)$	$\displaystyle=\left(F_{\beta(t)}(s,\mathbf{s},\overline{a})-\tau\xi(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\mu F_{\beta(t)}(s^{\prime},\mathbf{s^{\prime}},\overline{a}^{\prime})\right)\nabla_{\beta}F_{\beta(t)}(s,\mathbf{s},\overline{a}),$		(15)
	$\displaystyle\beta(t+1)$	$\displaystyle=\Pi_{\mathbb{B}^{0}(R_{\beta})}(\beta(t+1/2)).$		(16)

where $(s,\mathrm{d}_{\mathcal{S}},\overline{a})\sim\rho,(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime})\sim\mathbb{P}(\cdot|s,\mathrm{d}_{\mathcal{S}},\overline{a}),\overline{a}^{\prime}\sim\pi(\cdot|s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime})$ ; and $\mathbf{s}\overset{\mathrm{i.i.d.}}{\sim}\mathrm{d}_{\mathcal{S}},\mathbf{s}^{\prime}\overset{\mathrm{i.i.d.}}{\sim}\mathrm{d}_{\mathcal{S}}^{\prime}$ .

Policy Improvement Updates. For $\rho=\widetilde{\sigma}_{k}$ , $\tau=1,\mu=0,R_{\beta}=R_{\alpha}$ and $\xi(s,\mathrm{d}_{\mathcal{S}},\overline{a})=\tau_{k+1}(\beta_{k}^{-1}F^{Q}_{\theta_{k}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})+\tau_{k}^{-1}F^{A}_{\alpha_{k}}(s,\mathrm{d}_{\mathcal{S}},\overline{a}))$ , we recover the policy improvement update in Algorithm 2.

Policy Evaluation Updates. For $\rho=\sigma_{k}$ , $\tau=(1-\gamma),\mu=\gamma,\xi(s,\mathrm{d}_{\mathcal{S}},\overline{a})=r(s,\mathrm{d}_{\mathcal{S}},\overline{a}),R_{\beta}=R_{\theta}$ , we recover the policy evaluation update in Algorithm 3.

We define a few more notations before we proceed

	(residual):	$\displaystyle~{}~{}~{}\delta_{\beta(t)}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})=F_{\beta(t)}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\tau\xi(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\mu F_{\beta(t)}(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}),$
	(stochastic semi-gradient):	$\displaystyle~{}~{}~{}g_{\beta(t)}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})=\delta(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\nabla_{\beta}F_{\beta(t)}(s,\mathrm{d}_{\mathcal{S}},\overline{a}),$
	(population semi-gradient):	$\displaystyle~{}~{}~{}\overline{g}_{\beta(t)}=\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}}g(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}),$

We also define the associated finite sample approximation:

	$\displaystyle\widehat{\delta}_{\beta(t)}(s,\mathbf{s},\overline{a},s^{\prime},\mathbf{s}^{\prime},\overline{a}^{\prime})$	$\displaystyle=F_{\beta(t)}(s,\mathbf{s},\overline{a})-\tau\xi(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\mu F_{\beta(t)}(s,\mathbf{s}^{\prime},\overline{a}^{\prime}),$
	$\displaystyle\widehat{g}_{\beta(t)}(s,\mathbf{s},a,s^{\prime},\mathbf{s}^{\prime},\overline{a}^{\prime})$	$\displaystyle=\widehat{\delta}(s,\mathbf{s},\overline{a},s^{\prime},\mathbf{s}^{\prime},\overline{a}^{\prime})\nabla_{\beta}F_{\beta(t)}(s,\mathbf{s},\overline{a}),$

where $\mathbf{s}\stackrel{{\scriptstyle\mathrm{i.i.d.}}}{{\sim}}\mathrm{d}_{\mathcal{S}},\mathbf{s}^{\prime}\stackrel{{\scriptstyle\mathrm{i.i.d.}}}{{\sim}}\mathrm{d}_{\mathcal{S}}^{\prime}$ . Note that given $\widehat{g}$ is not an unbiased estimator of $g$ , as $\mathbb{E}_{\mathbf{s},\mathbf{s}^{\prime}}\widehat{g}(s,\mathbf{s},\overline{a},s^{\prime},\mathbf{s}^{\prime},\overline{a}^{\prime})\neq g(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})$ .

E.1 Linearization at Initialization

For a two layer ReLU network defined as $f_{\beta,u}(s,x,a)=\frac{1}{\sqrt{m}}\sum_{j=1}^{m}u_{j}\sigma(\beta_{j}^{\top}(s,x,a))$ , where $\sigma(x)=\max\{0,x\}$ , we define its linearization at the initialization

\displaystyle f^{0}_{\beta}(s,x,a)=\frac{1}{\sqrt{m}}\sum_{j=1}^{m}u_{j}\mathbbm{1}\{\beta_{j}(0)^{\top}(s,x,a)>0\}\beta_{j}^{\top}(s,x,a).

Note that $f^{0}_{\beta}(s,x,\overline{a})=\left\langle\nabla_{\beta}f_{\beta(0)}(s,x,a),\beta\right\rangle$ , which equivalently means $f^{0}_{\beta}$ is the local linearization of $f_{\beta}$ at its initialization $\beta(0)$ (note we do not train the second layer $u$ ).

We define

\displaystyle F^{0}_{\beta}(s,\mathbf{s},\overline{a})=\frac{1}{N}\sum_{i=1}^{N}f^{0}_{\beta}(s,s_{i},\overline{a}),~{}~{}~{}F^{0}_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a})=\mathbb{E}_{x\sim\mathrm{d}_{\mathcal{S}}}f^{0}_{\beta}(s,x,\overline{a}),

and similarly define $\delta^{0},g^{0},\overline{g}^{0},\widehat{\delta}^{0}$ , and $\widehat{g}^{0}$ , with everything related to $F_{\beta},F_{\beta}$ replaced by $F^{0}_{\beta},F^{0}_{\beta}$ .

Our main objectives are to show that the objective in (11) can be replaced by replacing everything related to $F$ with its local linearization $F^{0}$ , and the linearized stochastic semi-gradient $\widehat{g}^{0}$ remains close to the population semi-gradient $\overline{g}$ when the number of hidden units $m$ and number of agents are large enough. By doing so, we can conclude that learning with linearized stochastic semi-gradient $\widehat{g}$ is sufficient to solve (11) to high accuracy, when the number of hidden units $m$ and number of agents are large enough. For the ease of exposition, we also make the following assumptions,

Assumption 3.

We assume $\|(s,x,\overline{a})\|_{2}\leq 1$ for all $(s,x,\overline{a})\in{\mathcal{S}}\times{\mathcal{S}}\times\mathcal{A}$ .

Assumption 4.

The function $\xi(s,\mathrm{d}_{\mathcal{S}},\overline{a})$ satisfies: for each $(s,\mathrm{d}_{\mathcal{S}})\in\overline{{\mathcal{S}}}$ and any action $a$ , we have $(\xi(s,\mathrm{d}_{\mathcal{S}},\overline{a}))^{2}\leq\mathbb{E}_{x\sim\mathrm{d}_{\mathcal{S}}}\tau_{1}(f_{\beta(0)}(s,x,\overline{a}))^{2}+\tau_{2}R_{\beta}^{2}+\tau_{3}$ .

Remark: We will verify that with $(\tau_{1},\tau_{2},\tau_{3})=(4,4,0)$ , Assumption 4 holds for the policy evaluation step. With $(\tau_{1},\tau_{2},\tau_{3})=(0,0,\overline{r}^{2})$ , Assumption 4 holds for the policy improvement step.

We first bound the difference between $F^{0}_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a})$ and $F_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a})$ .

Lemma E.1.

With Assumption 2 and 3, we have

\displaystyle\mathbb{E}_{\mathrm{init},\rho}|F^{0}_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a})|^{2}\leq\mathcal{O}(R_{\beta}^{3}m^{-1/2}).

Proof.

Be definition of $F^{0}_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a})$ and $F_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a})$ and Jensen’s inequality, we have

\displaystyle\mathbb{E}_{\mathrm{init},s,\mathrm{d}_{\mathcal{S}},\overline{a}}|F^{0}_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a})|

\displaystyle\leq\mathbb{E}_{\mathrm{init},s,\mathrm{d}_{\mathcal{S}},\overline{a},x}|f^{0}_{\beta}(s,x,\overline{a})-f_{\beta}(s,x,\overline{a})|.

We can further bound the right hand side with

		$\displaystyle\mathbb{E}_{\mathrm{init},s,\mathrm{d}_{\mathcal{S}},\overline{a},x}\|f^{0}_{\beta}(s,x,\overline{a})-f_{\beta}(s,x,\overline{a})\|$
	$\displaystyle=$	$\displaystyle\mathbb{E}_{\mathrm{init},s,\mathrm{d}_{\mathcal{S}},\overline{a},x}\frac{1}{\sqrt{m}}\sum_{j=1}^{m}\|\mathbbm{1}\{\beta_{j}(0)^{\top}(s,x,\overline{a})>0\}-\mathbbm{1}\{\beta_{j}^{\top}(s,x,\overline{a})>0\}\|\beta_{j}^{\top}(s,x,\overline{a})\|$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}_{\mathrm{init},s,\mathrm{d}_{\mathcal{S}},\overline{a},x}\frac{1}{\sqrt{m}}\sum_{j=1}^{m}\lvert\mathbbm{1}\{\beta_{j}(0)^{\top}(s,x,\overline{a})>0\}-\mathbbm{1}\{\beta_{j}^{\top}(s,x,\overline{a})>0\}\rvert\\|\beta_{j}-\beta_{j}(0)\\|_{2},$

where the last inequality comes from that fact that $\mathbbm{1}\{\beta_{j}(0)^{\top}(s,x,\overline{a})>0\}\neq\mathbbm{1}\{\beta_{j}^{\top}(s,x,\overline{a})>0\}$ implies $\lvert\beta_{j}^{\top}(s,x,\overline{a})\rvert\leq\lvert(\beta_{j}-\beta_{j}(0))^{\top}(s,x,\overline{a})\rvert\leq\|\beta_{j}-\beta_{j}(0)\|_{2}\|(s,x,\overline{a})\|_{2}\leq\|\beta_{j}-\beta_{j}(0)\|_{2}$ . Applying Cauchy Schwartz inequality, we have

		$\displaystyle\mathbb{E}_{\mathrm{init},s,\mathrm{d}_{\mathcal{S}},a}\|F^{0}_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a})\|^{2}$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}_{\mathrm{init},s,\mathrm{d}_{\mathcal{S}},\overline{a},x}\frac{1}{m}\left(\sum_{j=1}^{m}\|\mathbbm{1}\{\beta_{j}(0)^{\top}(s,x,\overline{a})>0\}-\mathbbm{1}\{\beta_{j}^{\top}(s,x,\overline{a})>0\}\right)\left(\sum_{j=1}^{m}\\|\beta_{j}-\beta_{j}(0)\\|_{2}^{2}\right)$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}_{\mathrm{init},s,\mathrm{d}_{\mathcal{S}},\overline{a},x}\frac{R_{\beta}^{2}}{m}\sum_{j=1}^{m}\mathbbm{1}\{\lvert\beta_{j}(0)^{\top}(s,x,\overline{a})\rvert<\\|\beta_{j}(0)-\beta_{j}\\|_{2}\}$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}_{\mathrm{init}}\frac{cR_{\beta}^{2}}{m}\sum_{j=1}^{m}\frac{\\|\beta_{j}(0)-\beta_{j}\\|_{2}}{\\|\beta_{j}(0)\\|_{2}}$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}_{\mathrm{init}}\frac{cR_{\beta}^{2}}{m}\left(\sum_{j=1}^{m}\\|\beta_{j}(0)-\beta_{j}\\|_{2}^{2}\right)^{1/2}\left(\sum_{j=1}^{m}\\|\beta_{j}(0)\\|_{2}^{-2}\right)^{1/2}=\mathcal{O}\left(\frac{cR_{\beta}^{3}}{m^{1/2}}\right),$

where in the third inequality we use Assumption 2.

∎

Next, we bound the difference between $\overline{g}$ and and $\overline{g}^{0}$ .

Lemma E.2.

With Assumption 2, 3 and 4, we have

\displaystyle\mathbb{E}_{\mathrm{init}}\|\overline{g}_{\beta}-\overline{g}^{0}_{\beta}\|_{2}^{2}\leq\mathcal{O}\left(\frac{R_{\beta}^{3}}{m^{1/2}}\right).

Proof.

With decomposition

	$\displaystyle g_{\beta}-g^{0}_{\beta}=$	$\displaystyle\left(F(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\mu(F(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})-F(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}))\right)\nabla F(s,\mathrm{d}_{\mathcal{S}},\overline{a})$
	$\displaystyle+$	$\displaystyle\left(F(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\tau\xi(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\mu F(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\right)\left(\nabla_{\beta}F(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\nabla_{\beta}F(s,\mathrm{d}_{\mathcal{S}},\overline{a})\right),$

we apply basic inequality $\|a+b\|_{2}^{2}\leq 2(\|a\|_{2}^{2}+\|b\|_{2}^{2})$ and obtain

	$\displaystyle\\|\overline{g}_{\beta}-\overline{g}^{0}_{\beta}\\|_{2}^{2}\leq$	$\displaystyle 2\left(\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}}\|F(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\mu(F(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})-F(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}))\|\\|\nabla F(s,\mathrm{d}_{\mathcal{S}},\overline{a})\\|_{2}\right)^{2}+$
		$\displaystyle 2\left(\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}}\|F(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\tau\xi(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\mu F(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\|\\|\nabla_{\beta}F(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\nabla_{\beta}F(s,\mathrm{d}_{\mathcal{S}},\overline{a})\\|_{2}\right)^{2}.$		(17)

We have $\nabla_{\beta}F(s,\mathrm{d}_{\mathcal{S}},\overline{a})=\frac{1}{\sqrt{m}}\left(\mathbbm{1}\{\beta_{1}(0)^{\top}(s,x,\overline{a})>0\}(s,x,\overline{a}),\ldots,\mathbbm{1}\{\beta_{m}(0)^{\top}(s,x,\overline{a})>0\}(s,x,\overline{a})\right)$ , and with the assumption that $\|(s,x,\overline{a})\|_{2}\leq 1$ , we have $\|\nabla_{\beta}F(s,\mathrm{d}_{\mathcal{S}},\overline{a})\|_{2}\leq 1$ . In addition, we have

\displaystyle\mathbb{E}_{\mathrm{init},s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}}|F(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\mu(F(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})-F(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}))|^{2}=\mathcal{O}(R_{\beta}^{3}m^{-1/2}),

as implied by Lemma E.1. Hence the first term in (E.1) is of order $\mathcal{O}(R_{\beta}^{3}m^{-1/2})$ .

To bound the second term in (E.1), by Cauchy-Schwartz inequality we have

		$\displaystyle\left(\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}}\|F(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\tau\xi(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\mu F(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\|\\|\nabla_{\beta}F(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\nabla_{\beta}F(s,\mathrm{d}_{\mathcal{S}},\overline{a})\\|_{2}\right)^{2}$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}}\|F(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\tau\xi(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\mu F(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\|^{2}\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},a}\\|\nabla_{\beta}F(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\nabla_{\beta}F(s,\mathrm{d}_{\mathcal{S}},\overline{a})\\|_{2}^{2}.$

From Jensen’s inequality we have

	$\displaystyle\|F(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\tau\xi(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\mu F(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\|^{2}$	$\displaystyle\leq\mathbb{E}\|f^{0}_{\beta}(s,x,\overline{a})-\tau\xi(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\mu f^{0}_{\beta}(s^{\prime},x^{\prime},\overline{a}^{\prime})\|^{2}$
		$\displaystyle\leq 3\mathbb{E}\left[(f^{0}_{\beta}(s,x,\overline{a}))^{2}+(\tau\xi(s,\mathrm{d}_{\mathcal{S}},\overline{a}))^{2}+(\mu f^{0}_{\beta}(s^{\prime},x^{\prime},\overline{a}^{\prime}))^{2}\right].$

Note that $\nabla f^{0}_{\beta}=\nabla f^{0}_{\beta(0)}$ for all $\beta$ , we have

\displaystyle f^{0}_{\beta}(s,x,\overline{a})\leq f^{0}_{\beta(0)}(s,x,\overline{a})+\|\nabla_{\beta}f^{0}_{\beta}\|_{2}\|\beta-\beta(0)\|_{2}\leq f^{0}_{\beta(0)}(s,x,\overline{a})+R_{\beta}.

We have $3\left(f^{0}_{\beta}(s,x,\overline{a})^{2}+(\mu f^{0}_{\beta}(s^{\prime},x^{\prime},\overline{a}^{\prime}))^{2}\right)\leq 6\left(f_{\beta(0)}(s,x,\overline{a})^{2}+(\mu f_{\beta(0)}(s^{\prime},x^{\prime},\overline{a}^{\prime}))^{2}\right)+12R_{\beta}^{2}$ . Note that from Assumption 4: $\xi(s,\mathrm{d}_{\mathcal{S}},\overline{a})^{2}\leq\tau_{1}\mathbb{E}_{s,a}\left[(f_{\beta(0)}(s,x,\overline{a}))^{2}+\tau_{2}R_{\beta}^{2}+\tau_{3}\right]$ . Hence

	$\displaystyle\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}}\lvert F(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\tau\xi(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\mu F(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\rvert^{2}$
$\displaystyle\leq$	$\displaystyle\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}}\mathbb{E}_{x,x^{\prime}}\left[6(f_{\beta(0)}(s,x,\overline{a})^{2}+6\mu^{2}f_{\beta(0)}(s^{\prime},x^{\prime},\overline{a}^{\prime})^{2}+12R_{\beta}^{2}+3\tau^{2}\xi(s,\mathrm{d}_{\mathcal{S}},\overline{a})^{2}\right]$
$\displaystyle=$	$\displaystyle\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},x,\overline{a}}\left[6f_{\beta(0)}(s,x,\overline{a})^{2}+6\mu^{2}f_{\beta(0)}(s,x,\overline{a})^{2}+12R_{\beta}^{2}+3\tau^{2}\xi(s,\mathrm{d}_{\mathcal{S}},\overline{a})^{2}\right]$
$\displaystyle\leq$	$\displaystyle\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},x,\overline{a}}\left[6f_{\beta(0)}(s,x,\overline{a})^{2}+6\mu^{2}f_{\beta(0)}(s,x,\overline{a})^{2}+12R_{\beta}^{2}+3\tau^{2}\tau_{1}f_{\beta(0)}(s,x,\overline{a}))^{2}+3\tau^{2}\tau_{2}R_{\beta}^{2}+3\tau^{2}\tau_{3}\right].$	(18)

On the other hand, we have

	$\displaystyle\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},a}\\|\nabla_{\beta}F(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\nabla_{\beta}F(s,\mathrm{d}_{\mathcal{S}},\overline{a})\\|_{2}^{2}$
$\displaystyle\leq$	$\displaystyle\frac{1}{m}\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},a,x}\\|\left(\mathbbm{1}\{\beta_{1}(0)^{\top}(s,x,\overline{a})>0\}-\mathbbm{1}\{\beta_{1}^{\top}(s,x,\overline{a})>0\},\ldots,\mathbbm{1}\{\beta_{m}(0)^{\top}(s,x,\overline{a})>0\}-\mathbbm{1}\{\beta_{m}^{\top}(s,x,\overline{a})>0\}\right)\\|_{2}^{2}\\|(s,x,\overline{a})\\|_{2}^{2}$
$\displaystyle\leq$	$\displaystyle\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},\overline{a},x}\frac{1}{m}\sum_{j=1}^{m}(\mathbbm{1}\{\beta_{j}(0)^{\top}(s,x,\overline{a})>0\}-\mathbbm{1}\{\beta_{j}^{\top}(s,x,\overline{a})>0\})^{2}$
$\displaystyle\leq$	$\displaystyle\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},\overline{a},x}\frac{1}{m}\sum_{j=0}^{m}\mathbbm{1}\{\beta_{j}(0)^{\top}(s,x,\overline{a})\leq\\|\beta_{j}(0)-\beta_{j}\\|_{2}\}$
$\displaystyle\leq$	$\displaystyle\frac{c}{m}\sum_{j=1}^{m}\frac{\\|\beta_{j}(0)-\beta_{j}\\|_{2}}{\\|\beta_{j}(0)\\|_{2}}$
$\displaystyle\leq$	$\displaystyle\frac{c}{m}\left(\Bigl{\lVert}\sum_{j=1}^{m}\beta_{j}(0)-\beta_{j}\Bigr{\rVert}_{2}^{2}\right)^{1/2}\left(\sum_{j=1}^{m}\\|\beta_{j}(0)\\|_{2}^{-2}\right)^{1/2}$
$\displaystyle\leq$	$\displaystyle\frac{cR_{\beta}}{m}\left(\sum_{j=1}^{m}\\|\beta_{j}(0)\\|_{2}^{-2}\right)^{1/2},$	(19)

where in the fourth inequality we use Assumption 1, and in the final inequality we use $\left(\|\sum_{j=1}^{m}\beta_{j}(0)-\beta_{j}\|_{2}^{2}\right)^{1/2}\leq R_{\beta}$ .

Combining (18) and (19), to bound the second term in (E.1), it remains to bound the following:

		$\displaystyle\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},\overline{a},x}\left\{f_{\beta(0)}(s,x,\overline{a})^{2}\left(\sum_{j=1}^{m}\\|\beta_{j}(0)\\|_{2}^{-2}\right)^{1/2}\left(\frac{cR_{\beta}}{m}\right)\right\}$
	$\displaystyle=$	$\displaystyle\frac{cR_{\beta}}{m^{2}}\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},\overline{a},x}\left(\sum_{j=1}^{m}\sigma^{2}(\beta_{j}(0)^{\top}(s,x,\overline{a}))+\sum_{k\neq l}^{m}u_{k}u_{l}\sigma(\beta_{k}(0)^{\top}(s,x,\overline{a}))\sigma(\beta_{l}(0)^{\top}(s,x,\overline{a}))\right)\left(\sum_{j=1}^{m}\\|\beta_{j}(0)\\|_{2}^{-2}\right)^{1/2}$
	$\displaystyle\leq$	$\displaystyle\frac{cR_{\beta}}{m^{2}}\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},\overline{a},x}\left(\sum_{j=1}^{m}\\|\beta_{j}(0)\\|_{2}^{2}+\sum_{k\neq l}^{m}u_{k}u_{l}\sigma(\beta_{k}(0)^{\top}(s,x,\overline{a}))\sigma(\beta_{l}(0)^{\top}(s,x,\overline{a}))\right)\left(\sum_{j=1}^{m}\\|\beta_{j}(0)\\|_{2}^{-2}\right)^{1/2},$

Taking expectation with respect to initialization and noticing that $\mathbb{E}_{\mathrm{init}}\left\{u_{k}u_{l}\right\}=0$ for $k\neq l$ , we have

		$\displaystyle\frac{cR_{\beta}}{m^{2}}\mathbb{E}_{\mathrm{init},s,\mathrm{d}_{\mathcal{S}},\overline{a},x}\left(\sum_{j=1}^{m}\\|\beta_{j}(0)\\|_{2}^{2}+\sum_{k\neq l}^{m}u_{k}u_{l}\sigma(\beta_{k}(0)^{\top}(s,x,\overline{a}))\sigma(\beta_{l}(0)^{\top}(s,x,\overline{a}))\right)\left(\sum_{j=1}^{m}\\|\beta_{j}(0)\\|_{2}^{-2}\right)^{1/2}$
	$\displaystyle=$	$\displaystyle\frac{cR_{\beta}}{m^{2}}\mathbb{E}_{\mathrm{init}}\left(\sum_{j=1}^{m}\\|\beta_{j}(0)\\|_{2}^{2}\right)\left(\sum_{j=1}^{m}\\|\beta_{j}(0)\\|_{2}^{-2}\right)^{1/2}$
	$\displaystyle\leq$	$\displaystyle\frac{cR_{\beta}}{m^{2}}\mathbb{E}_{\mathrm{init}}^{1/2}\left(\sum_{j=1}^{m}\\|\beta_{j}(0)\\|_{2}^{2}\right)^{2}\mathbb{E}_{\mathrm{init}}^{1/2}\left(\sum_{j=1}^{m}\\|\beta_{j}(0)\\|_{2}^{-2}\right)$
	$\displaystyle=$	$\displaystyle\mathcal{O}\left(\frac{cR_{\beta}}{m^{1/2}}\right).$

Then the the second term in (E.1) is at the order of $\mathcal{O}\left(\max\{\frac{R_{\beta}^{3}}{m^{1/2}},\frac{R_{\beta}}{m^{1/2}}\}\right)=\mathcal{O}\left(\frac{R_{\beta}^{3}}{m^{1/2}}\right)$ .

∎

We then bound the the variance of stochastic semi-gradient $g_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})$ .

Lemma E.3.

With Assumption 4, there exists $\varsigma^{2}=\mathcal{O}(R_{\beta}^{2})$ , such that:

\displaystyle\mathbb{E}_{\mathrm{init},s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}}\|g_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})-\overline{g}_{\beta}\|_{2}^{2}\leq\varsigma^{2}.

Proof.

We have

	$\displaystyle\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}}\\|g_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})-\overline{g}_{\beta}\\|_{2}^{2}$	$\displaystyle\leq\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}}\\|g_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\\|_{2}^{2}$
		$\displaystyle\leq\mathbb{E}_{s,s,\mathrm{d}_{\mathcal{S}},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}}\\|\delta(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\nabla_{\beta}F_{\beta(t)}(s,\mathrm{d}_{\mathcal{S}},\overline{a})\\|_{2}^{2}.$

With $\nabla_{\beta}F_{\beta(t)}(s,\mathrm{d}_{\mathcal{S}},\overline{a})=\mathbb{E}_{x}\nabla_{\beta}f_{\beta(t)}(s,x,\overline{a})=\mathbb{E}_{x}\frac{1}{\sqrt{m}}\left(\mathbbm{1}\{\beta_{j}^{\top}(s,x,\overline{a})>0\},\ldots,\mathbbm{1}\{\beta_{j}^{\top}(s,x,\overline{a})>0\}\right)(s,x,\overline{a})$ , we have $\|\nabla_{\beta}F_{\beta(t)}(s,\mathrm{d}_{\mathcal{S}},\overline{a})\|_{2}\leq 1$ . Then

		$\displaystyle\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}}\\|g_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})-\overline{g}_{\beta}\\|_{2}^{2}$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}_{s,s,\mathrm{d}_{\mathcal{S}},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}}\left[\delta^{2}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\right]$
	$\displaystyle=$	$\displaystyle\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}}\left(F_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\tau\xi(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\mu F_{\beta}(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\right)^{2}$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime},x,x^{\prime}}\left(f_{\beta}(s,x,\overline{a})-\tau\xi(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\mu f_{\beta}(s^{\prime},x^{\prime},\overline{a}^{\prime})\right)^{2}$
	$\displaystyle\leq$	$\displaystyle 3\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},\overline{a},x}\left[(1+\mu^{2})\left(f_{\beta}(s,x,\overline{a})\right)^{2}+\left(\tau\xi(s,\mathrm{d}_{\mathcal{S}},\overline{a})\right)^{2}\right]$
	$\displaystyle\leq$	$\displaystyle 3\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},\overline{a},x}\left[(1+\mu^{2})\left(f_{\beta(0)}(s,x,\overline{a})\right)^{2}+\tau_{1}\left(\tau f_{\beta(0)}(s,x,\overline{a})\right)^{2}\right]+\mathcal{O}(R_{\beta}^{2})+\mathcal{O}(\tau_{3}).$

where in the last inequality we use $\lvert f_{\beta(0)}(s,x,\overline{a})-f_{\beta}(s,x,\overline{a})\rvert\leq\|\beta-\beta(0)\|_{2}\leq R_{\beta}$ and Assumption 4. Finally, note that

	$\displaystyle\mathbb{E}_{\mathrm{init},s,\mathrm{d}_{\mathcal{S}},\overline{a},x}\left[f_{\beta(0)}(s,x,\overline{a})^{2}\right]$	$\displaystyle=\mathbb{E}_{\mathrm{init},s,\mathrm{d}_{\mathcal{S}},\overline{a},x}\frac{1}{m}\left(\sum_{j=1}^{m}\sigma^{2}(\beta_{j}(0)^{\top}(s,x,\overline{a}))+\sum_{k\neq l}^{m}u_{k}u_{l}\sigma(\beta_{k}(0)^{\top}(s,x,\overline{a}))\sigma(\beta_{l}(0)^{\top}(s,x,\overline{a}))\right)$
		$\displaystyle=\mathbb{E}_{\mathrm{init},s,\mathrm{d}_{\mathcal{S}},\overline{a},x}\frac{1}{m}\left(\sum_{j=1}^{m}\sigma^{2}(\beta_{j}(0)^{\top}(s,x,\overline{a}))\right)$
		$\displaystyle\leq\mathbb{E}_{\mathrm{init}}\frac{1}{m}\left(\sum_{j=1}^{m}\\|\beta_{j}(0)\\|_{2}^{2}\right)=1.$

The claim follows immediately. ∎

We then bound the variance of $\widehat{g}(s,\mathbf{s},a,s^{\prime},\mathbf{s^{\prime}},\overline{a}^{\prime})$ for fixed $\mathrm{d}_{\mathcal{S}},\mathrm{d}_{\mathcal{S}}^{\prime}$ .

Lemma E.4.

For any $s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}$ , let $\mathbf{s}\overset{\mathrm{i.i.d.}}{\sim}s,\mathrm{d}_{\mathcal{S}}$ and $\mathbf{s}^{\prime}\overset{\mathrm{i.i.d.}}{\sim}s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime}$ , and $|\mathbf{s}|=|\mathbf{s}^{\prime}|=N$ , we have

\displaystyle\mathbb{E}_{\mathrm{init},\mathbf{s},\mathbf{s}^{\prime}}\|\widehat{g}_{\beta}(s,\mathbf{s},\overline{a},s^{\prime},\mathbf{s^{\prime}},\overline{a}^{\prime})-g_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\|_{2}^{2}\leq\mathcal{O}\left(\frac{R_{\beta}^{2}}{N}\right).

Proof.

We have

	$\displaystyle\widehat{g}_{\beta}(s,\mathbf{s},\overline{a},s^{\prime},\mathbf{s^{\prime}},\overline{a}^{\prime})-g_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})$
$\displaystyle=$	$\displaystyle\widehat{\delta}_{\beta}(s,\mathbf{s},\overline{a},s^{\prime},\mathbf{s^{\prime}},\overline{a}^{\prime})\nabla_{\beta}\sum_{s\in\mathbf{s}}f_{\beta}(s,x,\overline{a})/N-\delta_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\nabla_{\beta}\mathbb{E}_{x\sim\mathrm{d}_{\mathcal{S}}}f_{\beta}(s,x,\overline{a})$
$\displaystyle=$	$\displaystyle\left(\widehat{\delta}_{\beta}(s,\mathbf{s},\overline{a},s^{\prime},\mathbf{s^{\prime}},\overline{a}^{\prime})-\delta_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\right)\nabla_{\beta}\sum_{s\in\mathbf{s}}f_{\beta}(s,x,\overline{a})/N$
	$\displaystyle+\delta_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\left(\sum_{s\in\mathbf{s}}\nabla_{\beta}f_{\beta}(s,x,\overline{a})/N-\nabla_{\beta}\mathbb{E}_{x\sim\mathrm{d}_{\mathcal{S}}}f_{\beta}(s,x,\overline{a})\right).$	(20)

For the first term, note that $\|\sum_{s\in\mathbf{s}}\nabla_{\beta}f_{\beta}(s,x,\overline{a})/N\|_{2}\leq 1$ , we have

		$\displaystyle\mathbb{E}_{\mathbf{s},\mathbf{s}^{\prime}}\Bigl{\lVert}\left(\widehat{\delta}_{\beta}(s,\mathbf{s},\overline{a},s^{\prime},\mathbf{s^{\prime}},\overline{a}^{\prime})-\delta_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\right)\sum_{s\in\mathbf{s}}\nabla_{\beta}f_{\beta}(s,x,\overline{a})/N\Bigr{\rVert}_{2}^{2}$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}_{\mathbf{s},\mathbf{s}^{\prime}}\left(\widehat{\delta}_{\beta}(s,\mathbf{s},\overline{a},s^{\prime},\mathbf{s^{\prime}},\overline{a}^{\prime})-\delta_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\right)^{2}$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}_{x,x^{\prime}}\left(f_{\beta}(s,x,\overline{a})-\tau\xi(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\mu f_{\beta}(s^{\prime},x^{\prime},\overline{a}^{\prime})\right)^{2}/N$
	$\displaystyle\leq$	$\displaystyle 3\mathbb{E}_{x,x^{\prime}}\left[\left(f_{\beta(0)}(s,x,\overline{a})\right)^{2}+\mu^{2}\left(f_{\beta(0)}(s^{\prime},x^{\prime},\overline{a}^{\prime})\right)^{2}+\tau_{1}\left(\tau f_{\beta(0)}(s,x,\overline{a})\right)^{2}\right]/N+\mathcal{O}(R_{\beta}^{2}/N)+\mathcal{O}(\tau_{3}).$

where the last inequality we use the similar argument in Lemma E.3. Now as we have shown in Lemma E.3, we have $\mathbb{E}_{\mathrm{init},s^{\prime}}\left[f_{\beta(0)}(s,x,\overline{a})^{2}\right]\leq 1$ . Hence the first term in (E.1) is at the order of $\mathcal{O}(R_{\beta}^{2}/N)$ .

To bound the second term in (E.1), from Jensen’s inequality we have

		$\displaystyle\mathbb{E}_{\mathrm{init}}\delta_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})^{2}$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}_{\mathrm{init},x,x^{\prime}}\left(f_{\beta}(s,x,\overline{a})-\tau\xi(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\mu f_{\beta}(s^{\prime},x^{\prime},\overline{a}^{\prime})\right)^{2}$
	$\displaystyle\leq$	$\displaystyle 3\mathbb{E}_{\mathrm{init},x,x^{\prime}}\left[\left(f_{\beta(0)}(s,x,\overline{a})\right)^{2}+\mu^{2}\left(f_{\beta(0)}(s^{\prime},x^{\prime},\overline{a}^{\prime})\right)^{2}+\tau_{1}\left(\tau f_{\beta(0)}(s,x,\overline{a})\right)^{2}\right]+\mathcal{O}(R_{\beta}^{2})+\mathcal{O}(\tau_{3})$
	$\displaystyle=$	$\displaystyle\mathcal{O}\left(R_{\beta}^{2}\right).$

On the other hand, we have

\displaystyle\mathbb{E}_{\mathbf{s}}\Bigl{\lVert}\sum_{s\in\mathbf{s}}\nabla_{\beta}f_{\beta}(s,x,\overline{a})/N-\nabla_{\beta}\mathbb{E}_{x\sim\mathrm{d}_{\mathcal{S}}}f_{\beta}(s,x,\overline{a})\Bigr{\rVert}_{2}^{2}\leq\mathbb{E}_{x\sim\mathrm{d}_{\mathcal{S}}}\|\nabla_{\beta}f_{\beta}(s,x,\overline{a})\|_{2}^{2}/N\leq 1/N.

Therefore, taking square on both sides of (E.1) and using Cauchy Schwartz inequality, we have

		$\displaystyle\mathbb{E}_{\mathrm{init},\mathbf{s},\mathbf{s}^{\prime}}\\|\widehat{g}_{\beta}(s,\mathbf{s},\overline{a},s^{\prime},\mathbf{s^{\prime}},\overline{a}^{\prime})-g_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a})\\|_{2}^{2}$
	$\displaystyle\leq$	$\displaystyle 2\mathbb{E}_{\mathrm{init},\mathbf{s},\mathbf{s}^{\prime}}\left(\widehat{\delta}_{\beta}(s,\mathbf{s},\overline{a},s^{\prime},\mathbf{s^{\prime}},\overline{a}^{\prime})-\delta_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\right)^{2}\Bigl{\lVert}\sum_{s\in\mathbf{s}}\nabla_{\beta}f_{\beta}(s,x,\overline{a})/N\Bigr{\rVert}_{2}^{2}$
		$\displaystyle+2\mathbb{E}_{\mathrm{init},\mathbf{s},\mathbf{s}^{\prime}}\left\{\delta^{2}_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\right\}\mathbb{E}_{\mathrm{init},x}\Bigl{\lVert}\sum_{s\in\mathbf{s}}\nabla_{\beta}f_{\beta}(s,x,\overline{a})/N-\nabla_{\beta}\mathbb{E}_{x\sim\mathrm{d}_{\mathcal{S}}}f_{\beta}(s,x,\overline{a})\Bigr{\rVert}_{2}^{2}$
	$\displaystyle=$	$\displaystyle\mathcal{O}(R_{\beta}^{2}/N).$

∎

With Lemma E.3 and Lemma E.4, we can now bound the different between $\widehat{g}_{\beta}$ and $\overline{g}_{\beta}$ .

Lemma E.5.

With Assumption 4, there exists $\varsigma^{2}=\mathcal{O}(R_{\beta}^{2})$ , such that

\displaystyle\mathbb{E}_{\mathrm{init},s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime},\mathbf{s},\mathbf{s}^{\prime}}\|\widehat{g}_{\beta}-\overline{g}_{\beta}\|_{2}^{2}\leq\varsigma^{2}+\mathcal{O}(R_{\beta}^{2}/N).

Proof.

We have

	$\displaystyle\mathbb{E}\\|\widehat{g}_{\beta}-\overline{g}_{\beta}\\|_{2}^{2}\leq$	$\displaystyle 2\mathbb{E}\\|\widehat{g}_{\beta}(s,\mathbf{s},a,s^{\prime},\mathbf{s}^{\prime},a^{\prime})-g_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\\|_{2}^{2}$
		$\displaystyle+2\mathbb{E}\\|g_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})-\overline{g}_{\beta}\\|_{2}^{2}$

Now apply Lemma E.3 and Lemma E.4, the claim follow immediately. ∎

We are now ready to show the convergence of using updates (13) and (14) to solve problem (11). We denote $F^{0}_{\beta_{*}}$ to be the function within the class $\mathcal{F}_{R,m}$ , which satisfies the stationary condition:

\displaystyle\Pi_{\mathbb{B}(R_{\beta},\beta_{0})}(\beta_{*}-\overline{g}^{0}_{\beta_{*}})=\beta_{*},

(21)

where $\Pi_{X}(\cdot)$ denotes the Euclidean projection onto set $X$ . Condition (21) is equivalent to:

\displaystyle\left\langle\overline{g}^{0}_{\beta_{*}},\beta-\beta_{*}\right\rangle\geq 0.

Note that $\overline{g}^{0}_{\beta_{*}}=\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}}\left\{\delta_{\beta_{*}}^{0}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\nabla F^{0}_{\beta_{*}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})\right\}$ , and $\left\langle\nabla F^{0}_{\beta_{*}}(s,\mathrm{d}_{\mathcal{S}},\overline{a}),\beta-\beta_{*}\right\rangle=F^{0}_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F^{0}_{\beta_{*}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})$ . Hence we have

\displaystyle\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}}\left\{\delta_{\beta_{*}}^{0}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\left(F^{0}_{\beta,u}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F^{0}_{\beta_{*}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})\right)\right\}\geq 0.

(22)

With the definition of operator $\zeta_{F}$ in (12), we know that

\displaystyle\mathbb{E}_{s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}}\delta_{\beta_{*}}^{0}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})=F^{0}_{\beta_{*}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\zeta_{F^{0}_{\beta_{*}}}(s,\mathrm{d}_{\mathcal{S}},\overline{a}).

(23)

From (22) and (23), we have

\displaystyle\left\langle F^{0}_{\beta_{*}}-\zeta_{F^{0}_{\beta_{*}}},F^{0}_{\beta}-F^{0}_{\beta_{*}}\right\rangle_{\rho}\geq 0,

which is equivalent to that $F^{0}_{\beta_{*}}=\Pi_{F_{R_{\beta},m}}(\zeta_{F^{0}_{\beta_{*}}})$ . That is, $F^{0}_{\beta_{*}}$ is the projection (with metric defined w.r.t $\left\langle\cdot,\cdot\right\rangle_{\rho}$ ), after applying operator $\zeta$ to itself.

Theorem E.1.

Let $\{\beta(t)\}_{t=0}^{T-1}$ be generated by updates (15) and (16). Define $\overline{\beta}_{T}=\frac{1}{T}\sum_{t=0}^{T-1}\beta(t)$ , we have

\displaystyle\mathbb{E}_{\mathrm{init},s,\mathrm{d}_{\mathcal{S}},\overline{a}}\left[F^{0}_{\beta_{*}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F_{\overline{\beta}_{T}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})\right]^{2}\leq\mathcal{O}\left(\frac{R_{\beta}^{2}}{T^{1/2}}+\frac{R_{\beta}^{5/2}}{m^{1/4}}+\frac{R_{\beta}^{2}}{N^{1/2}}+\frac{R_{\beta}^{3}}{m^{1/2}}\right).

Proof.

Conditioned on the $t$ -th iteration, we have

	$\displaystyle\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime},\mathbf{s},\mathbf{s}^{\prime}}\left[\\|\beta(t+1)-\beta_{*}\\|_{2}^{2}\|\beta(t)\right]$
$\displaystyle=$	$\displaystyle\\|\Pi_{\mathbb{B}^{0}_{R_{\beta}}}(\beta-\eta\widehat{g}_{\beta(t)})-\Pi_{\mathbb{B}^{0}_{R_{\beta}}}(\beta_{}-\eta\overline{g}^{0}_{\beta_{}})\\|_{2}^{2}$
$\displaystyle\leq$	$\displaystyle\mathbb{E}\\|\beta(t)-\eta\widehat{g}_{\beta(t)}-\beta_{}-\eta\overline{g}^{0}_{\beta_{}}\\|_{2}^{2}$
$\displaystyle=$	$\displaystyle\mathbb{E}\left(\\|\beta(t)-\beta_{}\\|_{2}^{2}-\eta\left\langle\beta(t)-\beta_{},\widehat{g}_{\beta(t)}-\overline{g}^{0}_{\beta_{}}\right\rangle+\eta^{2}\\|\widehat{g}_{\beta(t)}-\overline{g}^{0}_{\beta_{}}\\|_{2}^{2}\right),$	(24)

where the inequality comes from the non-expansive property of projection.

We first consider the second term in (24), we have

	$\displaystyle\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime},\mathbf{s},\mathbf{s}^{\prime}}\left\langle\beta(t)-\beta_{},\widehat{g}_{\beta(t)}-\overline{g}^{0}_{\beta_{}}\right\rangle$
$\displaystyle=$	$\displaystyle\mathbb{E}\left[\left\langle\beta(t)-\beta_{},\widehat{g}_{\beta(t)}-g_{\beta(t)}\right\rangle+\left\langle\beta(t)-\beta_{},g_{\beta(t)}-\overline{g}_{\beta(t)}\right\rangle+\left\langle\beta(t)-\beta_{},\overline{g}_{\beta(t)}-\overline{g}^{0}_{\beta_{}}\right\rangle\right]$
$\displaystyle=$	$\displaystyle\mathbb{E}\left\langle\beta(t)-\beta_{},\widehat{g}_{\beta(t)}-g_{\beta(t)}\right\rangle+\left\langle\beta(t)-\beta_{},\overline{g}_{\beta(t)}-\overline{g}^{0}_{\beta_{*}}\right\rangle$
$\displaystyle\geq$	$\displaystyle\mathbb{E}\left\langle\beta(t)-\beta_{},\overline{g}_{\beta(t)}-\overline{g}^{0}_{\beta_{}}\right\rangle-R_{\beta}\left(\mathbb{E}\\|\widehat{g}_{\beta(t)}-g_{\beta(t)}\\|_{2}^{2}\right)^{1/2}$
$\displaystyle\geq$	$\displaystyle\mathbb{E}\left\langle\beta(t)-\beta_{},\overline{g}_{\beta(t)}-\overline{g}^{0}_{\beta(t)}\right\rangle+\left\langle\beta(t)-\beta_{},\overline{g}^{0}_{\beta(t)}-\overline{g}^{0}_{\beta_{*}}\right\rangle-R_{\beta}\left(\mathbb{E}\\|\widehat{g}_{\beta(t)}-g_{\beta(t)}\\|_{2}^{2}\right)^{1/2},$	(25)

where in the first equality we use definition $\mathbb{E}g_{\beta(t)}=\overline{g}_{\beta(t)}$ . We further have

\displaystyle\left\langle\beta(t)-\beta_{*},\overline{g}_{\beta(t)}-\overline{g}^{0}_{\beta(t)}\right\rangle\geq-R_{\beta}\|\overline{g}_{\beta(t)}-\overline{g}^{0}_{\beta(t)}\|_{2}.

(26)

Using again $\overline{g}^{0}_{\beta}=\mathbb{E}\left\{\delta_{\beta_{*}}^{0}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\nabla F^{0}_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a})\right\}$ , $\nabla F^{0}_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a})=\nabla F^{0}_{\beta_{*}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})$ for all $\beta$ , and $\left\langle\nabla F^{0}_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a}),\beta-\beta_{*}\right\rangle=F^{0}_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F^{0}_{\beta_{*}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})$ , we have

	$\displaystyle\mathbb{E}_{\mathrm{init},s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}}\left\langle\beta(t)-\beta_{},\overline{g}^{0}_{\beta(t)}-\overline{g}^{0}_{\beta_{}}\right\rangle$
$\displaystyle=$	$\displaystyle\mathbb{E}\left(\delta^{0}_{\beta(t)}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})-\delta^{0}_{\beta_{}}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\right)\left(F^{0}_{\beta(t)}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F^{0}_{\beta_{}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})\right)$
$\displaystyle=$	$\displaystyle\mathbb{E}\left(F^{0}_{\beta(t)}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F^{0}_{\beta_{}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\mu\left[F^{0}_{\beta(t)}(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})-F^{0}_{\beta_{}}(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\right]\right)\left(F^{0}_{\beta(t)}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F^{0}_{\beta_{*}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})\right)$
$\displaystyle\geq$	$\displaystyle(1-\mu)\mathbb{E}_{\mathrm{init},s,\mathrm{d}_{\mathcal{S}},a}\left(F^{0}_{\beta_{*}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F^{0}_{\beta(t)}(s,\mathrm{d}_{\mathcal{S}},\overline{a})\right)^{2},$	(27)

where in the last line we used Cauchy-Schwartz inequality, together with the fact that $(s,\mathrm{d}_{\mathcal{S}},\overline{a})$ and $(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})$ share the same distribution. That is

		$\displaystyle\mathbb{E}_{\mathrm{init},s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}}\left[F^{0}_{\beta(t)}(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})-F^{0}_{\beta_{}}(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\right]\left[F^{0}_{\beta(t)}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F^{0}_{\beta_{}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})\right]$
	$\displaystyle\leq$	$\displaystyle\left(\mathbb{E}\left[F^{0}_{\beta(t)}(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})-F^{0}_{\beta_{}}(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\right]^{2}\mathbb{E}\left[F^{0}_{\beta(t)}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F^{0}_{\beta_{}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})\right]^{2}\right)^{1/2}$
	$\displaystyle=$	$\displaystyle\mathbb{E}_{\mathrm{init},s,\mathrm{d}_{\mathcal{S}},\overline{a}}\left[F^{0}_{\beta(t)}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F^{0}_{\beta_{*}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})\right]^{2}.$

Combining (25), (26), and (27), the second term in (24) can be bounded by

	$\displaystyle-\eta\mathbb{E}\left\langle\beta(t)-\beta_{},\widehat{g}_{\beta(t)}-\overline{g}^{0}_{\beta_{}}\right\rangle$
$\displaystyle\leq$	$\displaystyle-\eta\mathbb{E}\left[-R_{\beta}\\|\overline{g}_{\beta(t)}-\overline{g}^{0}_{\beta(t)}\\|_{2}+(1-\mu)\mathbb{E}\left(F^{0}_{\beta_{*}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F^{0}_{\beta(t)}(s,\mathrm{d}_{\mathcal{S}},\overline{a})\right)^{2}\right]$
	$\displaystyle+\eta R_{\beta}\left(\mathbb{E}\\|\widehat{g}_{\beta(t)}-g_{\beta(t)}\\|_{2}^{2}\right)^{1/2}.$	(28)

To bound the third term in (24), we have

\displaystyle\mathbb{E}\|\widehat{g}_{\beta(t)}-\overline{g}^{0}_{\beta_{*}}\|_{2}^{2}\leq 3\mathbb{E}\left(\|\widehat{g}_{\beta(t)}-\overline{g}_{\beta(t)}\|_{2}^{2}+\|\overline{g}_{\beta(t)}-\overline{g}^{0}_{\beta(t)}\|_{2}^{2}+\|\overline{g}^{0}_{\beta(t)}-\overline{g}^{0}_{\beta_{*}}\|_{2}^{2}\right).

(29)

We can use Lemma E.5 and Lemma E.2 to control the first two terms in (29). We proceed to bound the third one, note that $\nabla F^{0}_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a})=\nabla F_{\beta(0)}(s,\mathrm{d}_{\mathcal{S}},\overline{a})$ for any $\beta$ , and $\|\nabla F_{\beta(0)}(s,\mathrm{d}_{\mathcal{S}},\overline{a})\|_{2}\leq 1$ .

$\displaystyle\\|\overline{g}^{0}_{\beta(t)}-\overline{g}^{0}_{\beta_{*}}\\|_{2}^{2}$	$\displaystyle\leq\left[\mathbb{E}\left(\delta^{0}_{\beta(t)}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})-\delta^{0}_{\beta_{*}}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}))\\|\nabla F^{0}_{\beta(0)}(s,\mathrm{d}_{\mathcal{S}},a\\|\right)\right]^{2}$
	$\displaystyle\leq\mathbb{E}\left[\left(\delta^{0}_{\beta(t)}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})-\delta^{0}_{\beta_{*}}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\right)\\|\nabla F^{0}_{\beta(0)}(s,\mathrm{d}_{\mathcal{S}},a\\|\right]^{2}$
	$\displaystyle\leq\mathbb{E}\left[\delta^{0}_{\beta(t)}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})-\delta^{0}_{\beta_{*}}(s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}))\right]^{2}$
	$\displaystyle\leq\mathbb{E}\left(F^{0}_{\beta(t)}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F^{0}_{\beta_{}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\mu\left[F^{0}_{\beta(t)}(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})-F^{0}_{\beta_{}}(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\right]\right)^{2}$
	$\displaystyle\leq 2(1+u^{2})\mathbb{E}\left[F^{0}_{\beta(t)}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F^{0}_{\beta_{*}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})\right]^{2},$	(30)

where in the last inequality follows from $(s,\mathrm{d}_{\mathcal{S}},\overline{a})$ and $(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},a^{\prime})$ sharing the same distribution.

Combining (28), (29) and (30), conditioned on $\beta(t)$ , we have the following bound for (24):

		$\displaystyle\mathbb{E}\left[\\|\beta(t+1)-\beta_{*}\\|_{2}^{2}\|\beta(t)\right]$
	$\displaystyle\leq$	$\displaystyle\\|\beta(t)-\beta_{}\\|_{2}^{2}-\left(\eta(1-\mu)-6(1+\mu^{2}\eta^{2}\right)\mathbb{E}\left[F^{0}_{\beta_{}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F^{0}_{\beta(t)}(s,\mathrm{d}_{\mathcal{S}},\overline{a})\right]^{2}$
	$\displaystyle+$	$\displaystyle\eta R_{\beta}\left(\mathbb{E}\\|\widehat{g}_{\beta(t)}-g_{\beta(t)}\\|_{2}^{2}\right)^{1/2}+\eta R_{\beta}\\|\overline{g}_{\beta(t)}-\overline{g}^{0}_{\beta(t)}\\|_{2}$
	$\displaystyle+$	$\displaystyle 3\eta^{2}\mathbb{E}\left(\\|\widehat{g}_{\beta(t)}-\overline{g}_{\beta(t)}\\|_{2}^{2}+\\|\overline{g}_{\beta(t)}-\overline{g}^{0}_{\beta(t)}\\|_{2}^{2}\right).$

From Lemma E.5, we have $\mathbb{E}\|\widehat{g}_{\beta(t)}-g_{\beta(t)}\|_{2}^{2}=\mathcal{O}(R_{\beta}^{2}/N)$ . From Lemma E.2, we have $\mathbb{E}\|\overline{g}_{\beta(t)}-\overline{g}^{0}_{\beta(t)}\|_{2}=\mathcal{O}\left(R_{\beta}^{3/2}/m^{1/4}\right)$ . From Lemma E.5, we have $\mathbb{E}\|\widehat{g}_{\beta(t)}-\overline{g}_{\beta(t)}\|_{2}^{2}=\mathcal{O}(R_{\beta}^{2})$ . Hence

		$\displaystyle\mathbb{E}\\|\beta(t+1)-\beta_{*}\\|_{2}^{2}$
	$\displaystyle\leq$	$\displaystyle\\|\beta(t)-\beta_{}\\|_{2}^{2}-\left(\eta(1-\mu)-6(1+\mu^{2})\eta^{2}\right)\mathbb{E}\left[F^{0}_{\beta_{}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F^{0}_{\beta(t)}(s,\mathrm{d}_{\mathcal{S}},\overline{a})\right]^{2}$
		$\displaystyle+\eta\mathcal{O}\left(\frac{R_{\beta}^{2}}{N^{1/2}}+\frac{R_{\beta}^{5/2}}{m^{1/4}}\right)+\eta^{2}\mathcal{O}(R_{\beta}^{2}).$

Re-arrange and telescope, we have

		$\displaystyle\sum_{t=0}^{T-1}\mathbb{E}\left[F^{0}_{\beta_{*}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F^{0}_{\beta(t)}(s,\mathrm{d}_{\mathcal{S}},\overline{a})\right]^{2}$
	$\displaystyle\leq$	$\displaystyle\left(\eta(1-\mu)-6(1+\mu^{2})\eta^{2}\right)^{-1}\mathbb{E}\left[\\|\beta_{0}-\beta_{*}\\|_{2}^{2}+\eta T\mathcal{O}\left(\frac{R_{\beta}^{2}}{N^{1/2}}+\frac{R_{\beta}^{5/2}}{m^{1/4}}\right)+T\eta^{2}\mathcal{O}(R_{\beta}^{2})\right].$		(31)

Notice that the left hand side of (31) is convex with respect to $\beta(t)$ , we can take average of $\{\beta(t)\}_{t=0}^{T-1}$ . Define $\overline{\beta}_{T}=\frac{1}{T}\sum_{t=0}^{T-1}\beta(t)$ , with Jensen’s inequality we have

		$\displaystyle\mathbb{E}\left[F^{0}_{\beta_{*}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F^{0}_{\overline{\beta}_{T}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})\right]^{2}$
	$\displaystyle\leq$	$\displaystyle\left(\eta(1-\mu)-6(1+\mu^{2})\eta^{2}\right)^{-1}\mathbb{E}_{\mathrm{init}}\left[\frac{\\|\beta_{0}-\beta_{*}\\|_{2}^{2}}{T}+\eta\mathcal{O}\left(\frac{R_{\beta}^{2}}{N^{1/2}}+\frac{R_{\beta}^{5/2}}{m^{1/4}}\right)+\eta^{2}\mathcal{O}(R_{\beta}^{2})\right].$

Now take $\eta=\mathcal{O}(T^{-1/2})$ , we have

\displaystyle\mathbb{E}\left[F^{0}_{\beta_{*}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F^{0}_{\overline{\beta}_{T}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})\right]^{2}\leq\mathcal{O}\left(\frac{R_{\beta}^{2}}{T^{1/2}}+\frac{R_{\beta}^{5/2}}{m^{1/4}}+\frac{R_{\beta}^{2}}{N^{1/2}}\right).

We can now apply Lemma E.1 and conclude that:

	$\displaystyle\mathbb{E}\left[F^{0}_{\beta_{*}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F_{\overline{\beta}_{T}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})\right]^{2}$	$\displaystyle\leq 2\mathbb{E}\left[\left(F^{0}_{\beta_{}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F^{0}_{\overline{\beta}_{T}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})\right)^{2}+\left(F^{0}_{\beta_{}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F_{\overline{\beta}_{T}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})\right)^{2}\right]$
		$\displaystyle\leq\mathcal{O}\left(\frac{R_{\beta}^{2}}{T^{1/2}}+\frac{R_{\beta}^{5/2}}{m^{1/4}}+\frac{R_{\beta}^{2}}{N^{1/2}}+\frac{R_{\beta}^{3}}{m^{1/2}}\right).$

∎

We can now specialize Theorem E.1 to policy optimization and policy evaluation.

Proof of Lemma 4.1.

Note that $F_{\theta_{*}}$ satisfies $F^{0}_{\theta_{*}}=\Pi_{\mathcal{F}^{\mathcal{P}}_{R_{\theta},m_{Q}}}(\zeta_{F^{0}_{\theta_{*}}})$ . For policy evaluation, by definition (12), the operator $\zeta$ equals to the Bellman evaluation operator: $\zeta_{F_{\theta}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})=\left[{\mathcal{T}}^{\pi_{k}}F_{\theta}\right](s,\mathrm{d}_{\mathcal{S}},\overline{a})$ . Hence we have $F^{0}_{\theta_{*}}=\Pi_{\mathcal{F}^{\mathcal{P}}_{R_{\theta},m_{Q}}}(\left[{\mathcal{T}}^{\pi_{k}}F^{0}_{\theta_{*}}\right])$ . Now by Assumption 1, we know that $\left[{\mathcal{T}}^{\pi_{k}}F^{0}_{\theta_{*}}\right]\in\mathcal{F}^{\mathcal{P}}_{R_{\theta},m_{Q}}$ , together with the fact that $F^{0}_{\theta_{*}}\in\mathcal{F}^{\mathcal{P}}_{R_{\theta},m_{Q}}$ . By uniqueness of the projection, we have $F^{0}_{\theta_{*}}=\left[{\mathcal{T}}^{\pi_{k}}F^{0}_{\theta_{*}}\right]$ , which implies $F^{0}_{\theta_{*}}=Q^{\pi_{k}}$ The claim follows immediately by applying Theorem E.1. ∎

Proof of Lemma 4.2.

Note that here we have $F^{0}_{\alpha_{*}}=\Pi_{F_{R_{\alpha},m_{A}}}(\zeta_{F^{0}_{\alpha_{*}}})$ . For policy optimization the operator $\zeta$ is defined by: $\zeta_{F^{0}_{\alpha_{*}}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})=\tau_{k+1}(\upsilon_{k}^{-1}F^{Q}_{\theta_{k}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})+\tau_{k}^{-1}F^{A}_{\alpha_{k}}(s,\mathrm{d}_{\mathcal{S}},\overline{a}))$ . Then we have

		$\displaystyle\mathbb{E}_{\widetilde{\sigma}_{k}}\left[F^{A}_{\overline{\alpha}(T)}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\tau_{k+1}(\upsilon_{k}^{-1}F^{Q}_{\theta_{k}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})+\tau_{k}^{-1}F^{A}_{\alpha_{k}}(s,\mathrm{d}_{\mathcal{S}},\overline{a}))\right]^{2}$
	$\displaystyle\leq$	$\displaystyle 2\mathbb{E}_{\widetilde{\sigma}_{k}}\left[F^{A}_{\overline{\alpha}(T)}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F^{0}_{\alpha_{}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})\right]^{2}+2\mathbb{E}_{\widetilde{\sigma}_{k}}\left[F^{0}_{\alpha_{}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\tau_{k+1}(\upsilon_{k}^{-1}F^{Q}_{\theta_{k}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})+\tau_{k}^{-1}F^{A}_{\alpha_{k}}(s,\mathrm{d}_{\mathcal{S}},\overline{a}))\right]^{2}$
	$\displaystyle\leq$	$\displaystyle 2\mathbb{E}_{\widetilde{\sigma}_{k}}\left[\tau_{k+1}(\upsilon_{k}^{-1}F^{0}_{\theta_{k}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})+\tau_{k}^{-1}F^{0}_{\alpha_{k}}(s,\mathrm{d}_{\mathcal{S}},\overline{a}))-\tau_{k+1}(\upsilon_{k}^{-1}F^{Q}_{\theta_{k}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})+\tau_{k}^{-1}F^{A}_{\alpha_{k}}(s,\mathrm{d}_{\mathcal{S}},\overline{a}))\right]^{2}$
		$\displaystyle+2\mathbb{E}_{\widetilde{\sigma}_{k}}\left[F^{A}_{\overline{\alpha}(T)}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F^{0}_{\alpha_{*}}(s,\mathrm{d}_{\mathcal{S}},\overline{a})\right]^{2}$
	$\displaystyle\leq$	$\displaystyle\mathcal{O}\left(\frac{R_{\alpha}^{2}}{T^{1/2}}+\frac{R_{\alpha}^{5/2}}{m_{A}^{1/4}}+\frac{R_{\alpha}^{2}}{N^{1/2}}+\frac{R_{\alpha}^{3}}{m_{A}^{1/2}}\right),$

where the second inequality comes from the definition of projection and $F^{0}_{\alpha_{*}}=\Pi_{F_{R_{\alpha},m_{A}}}(\zeta_{F^{0}_{\alpha_{*}}})$ , and the last inequality comes from direct application of Theorem E.1 and Lemma E.1, together with the fact that $F^{A}_{\alpha}$ and $F^{Q}_{\theta}$ share the same initialization. ∎

Proof of Theorem 4.1.

From Lemma 4.3, we know that

		$\displaystyle(1-\gamma)\left(\mathcal{L}(\pi^{*})-\mathcal{L}(\pi_{k})\right)$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}_{\nu_{}}\big{\{}\upsilon_{k}\left[\mathrm{KL}\left(\pi^{}(\cdot\|s,\mathrm{d}_{\mathcal{S}})\\|\pi_{k}(\cdot\|s,\mathrm{d}_{\mathcal{S}})\right)-\mathrm{KL}\left(\pi^{*}(\cdot\|s,\mathrm{d}_{\mathcal{S}})\\|\pi_{k+1}(\cdot\|s,\mathrm{d}_{\mathcal{S}})\right)+\varepsilon_{k}+\varepsilon_{k}^{\prime}\right]+\upsilon_{k}^{-1}M\big{\}}.$		(32)

where $\epsilon_{k}$ and $\epsilon_{k+1}^{\prime}$ are defined as in Lemma 4.1 and Lemma 4.2, respectively. In addition, $\varepsilon_{k}=\tau_{k+1}^{-1}\epsilon_{k+1}\phi_{k+1}^{*}+\upsilon_{k}^{-1}\epsilon_{k}^{\prime}\psi_{k}^{*}$ , $\varepsilon_{k}^{\prime}=\lvert\mathcal{A}\rvert\tau_{k+1}^{-2}\epsilon_{k+1}^{2}$ , and $M=\mathbb{E}_{\nu^{*}}\left[\max_{a\in\mathcal{A}}(F^{Q}_{\theta_{0}}(s,\mathrm{d}_{\mathcal{S}},\overline{a}))^{2}\right]+2R_{\alpha}^{2}$ . $\phi_{k}^{*}$ and $\psi_{k}^{*}$ are defined by:

	$\displaystyle\phi_{k}^{*}$	$\displaystyle=\mathbb{E}_{\widetilde{\sigma}_{k}}\left[\lvert\mathrm{d}\pi^{*}/\mathrm{d}\pi_{0}-\mathrm{d}\pi_{k}/\mathrm{d}\pi_{0}\rvert^{2}\right]^{1/2},$
	$\displaystyle\psi_{k}^{*}$	$\displaystyle=\mathbb{E}_{\sigma_{k}}\left[\lvert\mathrm{d}\sigma^{}/\mathrm{d}\sigma_{k}-\mathrm{d}(\nu^{}\times\pi_{k})/\mathrm{d}\sigma_{k}\rvert^{2}\right]^{1/2}.$

Now sum up (32) from $k=0$ to $K-1$ , with $\upsilon_{k}=\upsilon\sqrt{K}$ , we have

		$\displaystyle(1-\gamma)K\min_{0\leq k\leq K-1}\left(\mathcal{L}(\pi^{*})-\mathcal{L}(\pi_{k})\right)$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}_{\nu_{}}\big{\{}\upsilon\sqrt{K}\left[\mathrm{KL}\left(\pi^{}(\cdot\|s,\mathrm{d}_{\mathcal{S}})\\|\pi_{0}(\cdot\|s,\mathrm{d}_{\mathcal{S}})\right)-\mathrm{KL}\left(\pi^{*}(\cdot\|s,\mathrm{d}_{\mathcal{S}})\\|\pi_{K}(\cdot\|s,\mathrm{d}_{\mathcal{S}})\right)+\sum_{k=0}^{K-1}(\varepsilon_{k}+\varepsilon_{k}^{\prime})\right]+\upsilon^{-1}\sqrt{K}M\big{\}}.$

Since we initialize policy $\pi_{0}$ to be uniform policy, we have $\mathrm{KL}\left(\pi^{*}(\cdot|s,\mathrm{d}_{\mathcal{S}})\|\pi_{0}(\cdot|s,\mathrm{d}_{\mathcal{S}})\right)\leq\log\lvert\mathcal{A}\rvert$ . Rearrange, we obtain

\displaystyle\min_{0\leq k\leq K}\{\mathcal{L}(\pi^{*})-\mathcal{L}(\pi_{k})\}\leq\frac{\upsilon^{2}\left(\log|\mathcal{A}|+\sum_{k=1}^{K-1}(\varepsilon_{k}+\varepsilon_{k}^{\prime})\right)+M}{(1-\gamma)\upsilon\sqrt{K}},

(33)

From Lemma 4.1 and Lemma 4.2, we have

\displaystyle\epsilon_{k}=\mathcal{O}\left(\frac{R_{\theta}^{2}}{T^{1/2}}+\frac{R_{\theta}^{5/2}}{m_{Q}^{1/4}}+\frac{R_{\theta}^{2}}{N^{1/2}}+\frac{R_{\theta}^{3}}{m_{Q}^{1/2}}\right),\epsilon_{k+1}^{\prime}=\mathcal{O}\left(\frac{R_{\alpha}^{2}}{T^{1/2}}+\frac{R_{\alpha}^{5/2}}{m_{A}^{1/4}}+\frac{R_{\alpha}^{2}}{N^{1/2}}+\frac{R_{\alpha}^{3}}{m_{A}^{1/2}}\right).

To control $\varepsilon_{k}$ and $\varepsilon_{k}^{\prime}$ , we have

	$\displaystyle\tau_{k+1}^{-1}\epsilon_{k+1}^{\prime}\phi_{k+1}^{*}$	$\displaystyle=\mathcal{O}\left(kK^{-1}\phi_{k}^{*}(R_{\alpha}^{2}T^{-1/2}+R_{\alpha}^{5/2}m_{A}^{-1/4}+R_{\alpha}^{2}N^{-1/2}\right)$
	$\displaystyle\lvert\mathcal{A}\rvert\tau_{k+1}^{-2}(\epsilon_{k+1}^{\prime})^{2}$	$\displaystyle=\mathcal{O}\left(k^{2}K^{-1}\lvert\mathcal{A}\rvert(R_{\alpha}^{2}T^{-1/2}++R_{\alpha}^{5/2}m_{A}^{-1/4}+R_{\alpha}^{2}N^{-1/2})^{2}\right)$
	$\displaystyle\upsilon_{k}^{-1}\epsilon_{k}\psi_{k}^{*}$	$\displaystyle=\mathcal{O}\left(K^{-1/2}\psi_{k}^{*}(R_{\theta}^{2}T^{-1/2}+R_{\theta}^{2}N^{-1/2}+R_{\theta}^{5/2}m^{-1/4}\right),$

if $m_{A}=\Omega(R_{\alpha}^{2})$ and $m_{Q}=\Omega(R_{\theta}^{2})$ . One can verify that with the choice of $N,T,m_{A},m_{Q}$ in Theorem 4.1, we have $\varepsilon_{k}=\varepsilon_{k}^{\prime}=\mathcal{O}(K^{-1})$ . Plug into (33), we obtain the second part of Theorem 4.1.

∎

		$\displaystyle\left\langle Q^{\pi_{k}}(s,\mathrm{d}_{\mathcal{S}},\cdot),\pi^{*}(\cdot\|s,\mathrm{d}_{\mathcal{S}})-\pi_{k}(\cdot\|s,\mathrm{d}_{\mathcal{S}})\right\rangle$
	$\displaystyle\leq$	$\displaystyle\upsilon_{k}\bigg{\{}\mathrm{KL}\left(\pi^{}(\cdot\|s,\mathrm{d}_{\mathcal{S}})\\|\pi_{k}(\cdot\|s,\mathrm{d}_{\mathcal{S}})\right)-\mathrm{KL}\left(\pi^{}(\cdot\|s,\mathrm{d}_{\mathcal{S}})\\|\pi_{k+1}(\cdot\|s,\mathrm{d}_{\mathcal{S}})\right)-\mathrm{KL}\left(\pi_{k+1}(\cdot\|s,\mathrm{d}_{\mathcal{S}})\\|\pi_{k}(\cdot\|s,\mathrm{d}_{\mathcal{S}})\right)\bigg{\}}$
		$\displaystyle~{}~{}~{}~{}~{}~{}+\left\langle Q^{\pi_{k}}(s,\mathrm{d}_{\mathcal{S}},\cdot),\pi_{k+1}(\cdot\|s,\mathrm{d}_{\mathcal{S}})-\pi_{k}(\cdot\|s,\mathrm{d}_{\mathcal{S}})\right\rangle.$

		$\displaystyle\mathbb{E}_{\mathrm{init},s,\mathrm{d}_{\mathcal{S}},\overline{a},x}\|f^{0}_{\beta}(s,x,\overline{a})-f_{\beta}(s,x,\overline{a})\|$
	$\displaystyle=$	$\displaystyle\mathbb{E}_{\mathrm{init},s,\mathrm{d}_{\mathcal{S}},\overline{a},x}\frac{1}{\sqrt{m}}\sum_{j=1}^{m}\|\mathbbm{1}\{\beta_{j}(0)^{\top}(s,x,\overline{a})>0\}-\mathbbm{1}\{\beta_{j}^{\top}(s,x,\overline{a})>0\}\|\beta_{j}^{\top}(s,x,\overline{a})\|$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}_{\mathrm{init},s,\mathrm{d}_{\mathcal{S}},\overline{a},x}\frac{1}{\sqrt{m}}\sum_{j=1}^{m}\lvert\mathbbm{1}\{\beta_{j}(0)^{\top}(s,x,\overline{a})>0\}-\mathbbm{1}\{\beta_{j}^{\top}(s,x,\overline{a})>0\}\rvert\\|\beta_{j}-\beta_{j}(0)\\|_{2},$

		$\displaystyle\mathbb{E}_{\mathrm{init},s,\mathrm{d}_{\mathcal{S}},a}\|F^{0}_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F_{\beta}(s,\mathrm{d}_{\mathcal{S}},\overline{a})\|^{2}$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}_{\mathrm{init},s,\mathrm{d}_{\mathcal{S}},\overline{a},x}\frac{1}{m}\left(\sum_{j=1}^{m}\|\mathbbm{1}\{\beta_{j}(0)^{\top}(s,x,\overline{a})>0\}-\mathbbm{1}\{\beta_{j}^{\top}(s,x,\overline{a})>0\}\right)\left(\sum_{j=1}^{m}\\|\beta_{j}-\beta_{j}(0)\\|_{2}^{2}\right)$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}_{\mathrm{init},s,\mathrm{d}_{\mathcal{S}},\overline{a},x}\frac{R_{\beta}^{2}}{m}\sum_{j=1}^{m}\mathbbm{1}\{\lvert\beta_{j}(0)^{\top}(s,x,\overline{a})\rvert<\\|\beta_{j}(0)-\beta_{j}\\|_{2}\}$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}_{\mathrm{init}}\frac{cR_{\beta}^{2}}{m}\sum_{j=1}^{m}\frac{\\|\beta_{j}(0)-\beta_{j}\\|_{2}}{\\|\beta_{j}(0)\\|_{2}}$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}_{\mathrm{init}}\frac{cR_{\beta}^{2}}{m}\left(\sum_{j=1}^{m}\\|\beta_{j}(0)-\beta_{j}\\|_{2}^{2}\right)^{1/2}\left(\sum_{j=1}^{m}\\|\beta_{j}(0)\\|_{2}^{-2}\right)^{1/2}=\mathcal{O}\left(\frac{cR_{\beta}^{3}}{m^{1/2}}\right),$

	$\displaystyle\\|\overline{g}_{\beta}-\overline{g}^{0}_{\beta}\\|_{2}^{2}\leq$	$\displaystyle 2\left(\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}}\|F(s,\mathrm{d}_{\mathcal{S}},\overline{a})-F(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\mu(F(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})-F(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}))\|\\|\nabla F(s,\mathrm{d}_{\mathcal{S}},\overline{a})\\|_{2}\right)^{2}+$
		$\displaystyle 2\left(\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},\overline{a},s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime}}\|F(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\tau\xi(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\mu F(s^{\prime},\mathrm{d}_{\mathcal{S}}^{\prime},\overline{a}^{\prime})\|\\|\nabla_{\beta}F(s,\mathrm{d}_{\mathcal{S}},\overline{a})-\nabla_{\beta}F(s,\mathrm{d}_{\mathcal{S}},\overline{a})\\|_{2}\right)^{2}.$		(17)

		$\displaystyle\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},\overline{a},x}\left\{f_{\beta(0)}(s,x,\overline{a})^{2}\left(\sum_{j=1}^{m}\\|\beta_{j}(0)\\|_{2}^{-2}\right)^{1/2}\left(\frac{cR_{\beta}}{m}\right)\right\}$
	$\displaystyle=$	$\displaystyle\frac{cR_{\beta}}{m^{2}}\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},\overline{a},x}\left(\sum_{j=1}^{m}\sigma^{2}(\beta_{j}(0)^{\top}(s,x,\overline{a}))+\sum_{k\neq l}^{m}u_{k}u_{l}\sigma(\beta_{k}(0)^{\top}(s,x,\overline{a}))\sigma(\beta_{l}(0)^{\top}(s,x,\overline{a}))\right)\left(\sum_{j=1}^{m}\\|\beta_{j}(0)\\|_{2}^{-2}\right)^{1/2}$
	$\displaystyle\leq$	$\displaystyle\frac{cR_{\beta}}{m^{2}}\mathbb{E}_{s,\mathrm{d}_{\mathcal{S}},\overline{a},x}\left(\sum_{j=1}^{m}\\|\beta_{j}(0)\\|_{2}^{2}+\sum_{k\neq l}^{m}u_{k}u_{l}\sigma(\beta_{k}(0)^{\top}(s,x,\overline{a}))\sigma(\beta_{l}(0)^{\top}(s,x,\overline{a}))\right)\left(\sum_{j=1}^{m}\\|\beta_{j}(0)\\|_{2}^{-2}\right)^{1/2},$