Scalable and Sample Efficient Distributed Policy Gradient Algorithms in Multi-Agent Networked Systems

Xin Liu
ShanghaiTech University
[email protected] Honghao Wei
University of Michigan, Ann Arbor
[email protected] Lei Ying
University of Michigan, Ann Arbor
[email protected]

Abstract

This paper studies a class of multi-agent reinforcement learning (MARL) problems where the reward that an agent receives depends on the states of other agents, but the next state only depends on the agent’s own current state and action. We name it REC-MARL standing for REward-Coupled Multi-Agent Reinforcement Learning. REC-MARL has a range of important applications such as real-time access control and distributed power control in wireless networks. This paper presents a distributed policy gradient algorithm for REC-MARL. The proposed algorithm is distributed in two aspects: (i) the learned policy is a distributed policy that maps a local state of an agent to its local action and (ii) the learning/training is distributed, during which each agent updates its policy based on its own and neighbors’ information. The learned algorithm achieves a stationary policy and its iterative complexity bounds depend on the dimension of local states and actions. The experimental results of our algorithm for the real-time access control and power control in wireless networks show that our policy significantly outperforms the state-of-the-art algorithms and well-known benchmarks.

1 Introduction

Multi-Agent Reinforcement Learning, or MARL for short, considers a reinforcement learning problem where multiple agents interact with each other and with the environment to finish a common task or achieve a common objective. The key difference between MARL and single-agent RL is that each agent in MARL only observes a subset of the state and receives an individual reward and does not have global information. Examples include multi-access networks where each user senses the collision locally, but needs to coordinate with each other to maximize the network throughput, or power control in wireless networks where each wireless node can only measure its local interference but they need to collectively decide the transmit power levels to maximize a network-wide utility, or task offloading in edge computing where each user observes its local quality of service (QoS) but they coordinate tasks offloaded to edge servers to maintain a high QoS for the network, or a team of robots where each robot can only sense its own surrounding environment, but the robot team needs to cooperate and coordinate for accomplishing a rescue task.

MARL raises two fundamental aspects that are different from single-agent RL. The first aspect is the policy space (the set of policies that are feasible) is restricted to local policies since each agent has to decide an action based on information available to the agent. The second aspect of MARL is that while distributed learning, e.g. distributed policy gradient, is desired, it is impossible in general MARL because an agent does not know the global state and rewards. One popular approach to address the second issue is the centralized learning and distributed execution paradigm [8], where data samples are collected by a central entity and the central entity uses data from all agents to learn a local policy. Therefore, the learning occurs at the central entity, but the learned policy is a local policy.

In this paper, we consider a special class of MARL, which we call REward-Coupled Multi-Agent Reinforcement Learning (REC-MARL). In REC-MARL, the problem is coupled through the reward functions. More specifically, the reward function of agent $n$ depends on agent $n$ ’s state and action and its neighbors’ states and actions. However, the next state of agent $n$ only depends on agent $n^{\prime}$ s current state and action, and is independent of other agents’ states and actions. In contrast, in general MARL, the transition kernels of the agents are coupled, so the next state of an agent depends on other agents’ states and actions.

While REC-MARL is more restrictive than MARL, a number of applications of MARL are actually REC-MARL. In wireless networks, where agents are network nodes or devices, and the state of an agent is could be its queue length or transmit power, the state only depends on an agent’s current state and its current action (see Section 2 for detailed description). For REC-MARL, this paper proposes distributed policy gradient algorithms and establish the local convergence (i.e. the algorithms achieve a stationary policy). The main contributions are summarized as follows.

•

We establish perfect decomposition of value functions and policy gradients in Lemmas 1 and 2, respectively, where we show that the global value functions (policy-gradients) can be written as a summation of local value functions (policy gradients). This decomposition significantly reduces the complexity of value functions and motivates our distributed multi-agent policy gradient algorithms.
•

We propose a Temporal-Difference (TD) learning based regularized distributed multi-agent policy gradient algorithm, named TD-RDAC in Algorithm 2. We proved in Theorem 2 that TD-RDAC achieves the local convergence with the rate $\tilde{O}\left(NS_{\max}A_{\max}/\sqrt{T}\right),$ which only depends on the maximum sizes of local state and action spaces, instead of the sizes of the global action and state spaces.
•

We apply TD-RDAC to the applications of real-time access and power control in ad hoc wireless networks, both are well-known challenging networking resource management problems. Our experiments show that TD-RDAC significantly outperforms the state-of-the-art algorithm [20] and well-known benchmarks [2, 21] in these two applications.

Related Work

MARL for networking: MARL have been applied to various networking applications (e.g., content caching [25], packet routing [23], video transcoding and transmission [3], transportation network control [6]). The work [25] proposed a multi-agent advantage actor-critic algorithm for large-scale video caching at network edge and it reduced the content access latency and traffic cost. The proposed algorithm requires the knowledge of neighbors’ policies, which is usually not observable. [23] studied packet routing problem in WAN and applied MARL to minimize the average packet completion time. The method adopted dynamic consensus algorithm to estimate the global reward, which incurs a heavy communication overhead. The work [3] proposed a multi-agent actor-critic algorithm for crowd-sourced livecast services and improved the average throughput and the transmission delay performance. The proposed algorithm requires a centralized controller to maintain the global state information. The most related work is [6], which utilized spatial discount factor to decouple the correlation of distant agents and designed networked MARL solution in traffic signal control and cooperative cruise control. However, it requires a dedicated communication module to maintain a local estimation of global state, which again incurs large communication cost. Moreover, the theoretical performance of these algorithms in [25, 23, 3, 6] are not investigated.

Provable MARL: There have been a great amount of works addressing the issues of scalability and sample complexity in MARL. A popular paradigm is the centralized learning and distributed execution (see e.g. [8]), where agents share a centralized critic but have individual actors. Both the critic and the actors are trained at a central server, but the actors are local policies that lead to a distributed implementation. [29] proposed a decentralized actor-critic algorithm for a cooperative scenario, where each agent has its individual critic and actor with a shared global state. The proposed algorithm converges an stationary point of the objective function. Motivated by [29], [4] and [10] provided a finite-time analysis of decentralized actor-critic algorithms in the infinity-horizon discounted-reward MDPs and average-reward MDPs, respectively. [30] studied a policy gradient algorithm for multi-agent stochastic games, where each agent maximizes its reward by taking actions independently based on the global state. It established its local convergence for general stochastic games and its global convergence for Markov potential games. The centralized critic or shared states (even with decentralized actors) require collecting global information and centralized training. The work [19] exploits the network structure to develop a localized or scalable actor-critic (SAC) framework that is proved to achieve $O(\gamma^{k+1})$ -approximation of a stationary point in $\gamma$ -discounted-reward MDPs. [20] and [13] extended SAC framework in [19] into the settings of infinity-horizon average-reward MDPs and stochastic/time-varying communication graphs. The SAC framework in [19] is efficient in implementing the paradigm of distributed learning and distributed execution. It is also worth mentioning that in a recent work [31], the authors studied a kernel-coupled setting and established that the localized policy-gradient converges to a neighborhood of global optimal policy, where the distance to the global optimality depends on the number of hops, $k$ , polynomially and could be a constant when $k$ is small. Another line of research in MARL that addresses the scalability issue is the mean-field approximation (MFA) [9, 7, 27], where agents are homogeneous and an individual agent interacts or games with the approximated “mean” behavior. However, the MFA approach is only applicable to a homogeneous system. Finally, [16] and [26] studied weakly-coupled MDPs, where individual MDPs are independent and coupled through constraints, instead of coupled reward as in ours.

Different from these existing works, this paper considers reward-coupled multi-agent Reinforcement learning (REC-MARL), establishes the local convergence of policy gradient algorithms with both distributed learning and distributed implementation, and demonstrates its efficiency in classical and challenging resource management problems.

2 Model

We consider a multi-agent system where the agents are connected by an interactive graph $\mathcal{G}=(\mathcal{I},\mathcal{E})$ with $\mathcal{I}$ and $\mathcal{E}$ being the set of nodes and edges, respectively. The system consists of $N=|\mathcal{I}|$ agents who are connected with edges in $\mathcal{E}.$ Each agent $n\in\mathcal{I}$ is associated with a $\gamma$ -discounted Markov decision process of $(\mathcal{S}_{n},\mathcal{A}_{n},r_{n},\mathcal{P}_{n},\gamma),$ where $\mathcal{S}_{n}$ is the state space, $\mathcal{A}_{n}$ is the action space, $r_{n}$ is the reward function, and $\mathcal{P}_{n}$ is the transition kernel. Define the neighbourhood of agent $n$ (including itself) to be $\mathcal{N}(n).$ Define the states and actions of the neighbors of agent $n$ to be $s_{\mathcal{N}(n)}$ and $a_{\mathcal{N}(n)},$ respectively. We next formally define the REward-Couple Multi-Agent Reinforcement Learning.

REC-MARL: The reward of agent $n$ depends on its neighbors’ states and actions $r_{n}(s_{\mathcal{N}(n)},a_{\mathcal{N}(n)}).$ The transition kernel of agent $n,$ $\mathcal{P}_{n}(\cdot|s_{n},a_{n}),$ only depends on its own state $s_{n}$ and action $a_{n}.$ Agent $n$ uses a local policy $\pi_{n}:\mathcal{S}_{n}\to\mathcal{A}_{n},$ where $\pi_{n}(a_{n}|s_{n})$ is the probability of taking action $a_{n}$ in state $s_{n}.$

Given a REC-MARL problem, the global state space is $\mathcal{S}=\mathcal{S}_{1}\times\mathcal{S}_{2}\times\cdots\times\mathcal{S}_{n}$ with $S_{\max}=\max_{n}|\mathcal{S}_{n}|;$ the global action space is $\mathcal{A}=\mathcal{A}_{1}\times\mathcal{A}_{2}\times\cdots\times\mathcal{A}_{n}$ with $A_{\max}=\max_{n}|\mathcal{A}_{n}|;$ the global reward function is $r(s,a)=\frac{1}{N}\sum_{n=1}^{N}r_{n}(s_{\mathcal{N}(n)},a_{\mathcal{N}(n)});$ the transition kernel is $\mathcal{P}(s^{\prime}|s,a),\forall s,s^{\prime}\in\mathcal{S},\forall a\in\mathcal{A}$ with $\mathcal{P}(s^{\prime}|s,a)=\prod_{n=1}^{N}\mathcal{P}_{n}(s^{\prime}_{n}|s_{n},a_{n}).$

In this paper, we study the softmax policy parameterized by $\theta_{n}:\mathcal{S}_{n}\times\mathcal{A}_{n}\to\mathbb{R}$ that is

\pi_{\theta_{n}}(a_{n}|s_{n})=\frac{e^{\theta_{n}(a_{n},s_{n})}}{\sum_{a^{\prime}_{n}\in\mathcal{A}_{n}}{e^{\theta_{n}(a^{\prime}_{n},s_{n})}}},\forall n.

The global policy $\pi_{\theta}:\mathcal{S}\times\mathcal{A}\to\mathbb{R}$ with $\theta=[\theta_{1},\theta_{2},\cdots,\theta_{N}]$ is as follows

\pi_{\theta}(a|s)=\prod_{n=1}^{N}\pi_{\theta_{n}}(a_{n}|s_{n}).

In this paper, we study $\gamma$ -discounted infinite-horizon Markov decision processes (MDP) with $(s(t),a(t))$ being the state and action of the MDP at time $t.$ For a global policy $\pi_{\theta},$ its value function, action value function ( $Q$ -function), and advantage function given the initial state and action ( $s(0),a(0)$ ) are defined below:

	$\displaystyle V^{\pi_{\theta}}(s)$	$\displaystyle:=\mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^{t}r(s(t),a(t))\Big{\|}s(0)=s\right],$
	$\displaystyle Q^{\pi_{\theta}}(s,a)$	$\displaystyle:=\mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^{t}r(s(t),a(t))\Big{\|}s(0)=s,a(0)=a\right],$
	$\displaystyle A^{\pi_{\theta}}(s,a)$	$\displaystyle:=Q^{\pi_{\theta}}(s,a)-V^{\pi_{\theta}}(s).$

Let $\rho$ be the initial state distribution. Define $V^{\pi_{\theta}}(\rho)=\mathbb{E}_{s\sim\rho}[V^{\pi_{\theta}}(s)].$ Further define the discounted occupancy measure to be

\displaystyle d^{\pi_{\theta}}_{\rho}(s)=(1-\gamma)\sum_{t=0}^{\infty}\gamma^{t}\mathcal{P}(s(t)=s|\rho,\pi_{\theta}).

(1)

We seek policies with distributed learning and distributed execution in this paper. We will decompose the global $Q$ -function as a sum of local $Q$ -functions so that the agents can collectively optimize the local policy. Before presenting the details of our solution, we introduce two representative applications in wireless networks: real-time access control and distributed power control.

Real-time access control in wireless networks: Consider a wireless access network with $N$ nodes (agents) and $M$ access points as shown in Figure 1(a). At the beginning of each time slot, a packet arrives at node $n$ with probability $w_{n}$ . The packet is associated with deadline $d.$ $q_{m}$ is the successful transmitting probability when transmitting to access point $m.$ A packet is removed if either it is sent to an access point (not necessarily delivered successfully) or it expires. At each time, each node decides whether to transmit a packet to one of the access points or keeping silence. A collision happens if multiple nodes send packets to the same access point simultaneously. Therefore, the throughput of a node depends not only on its decision but also its neighboring nodes’ decisions. In particular, the throughput of node $n$ is

\displaystyle r_{n}(a_{\mathcal{N}(n)})=\sum_{m\in\mathcal{AP}(n)}a_{n,m}\prod_{k\neq n,k\in\mathcal{N}(n)}(1-a_{k,m})q_{m},

(2)

where $a_{n,m}\in\{0,1\}$ indicate if node $n$ transmits a packet to its access point $m\in\mathcal{AC}(n).$ The goal of real-time access control is to maximize the network throughput.

The problem of real-time access control is challenging because i) the throughput functions are non-convex and highly-coupled functions of other nodes’ decisions and ii) the system parameters are unknown (e.g., $w_{n}$ and $p_{m}$ ). This problem can be formulated as a REC-MARL problem. In particular for node/agent $n,$ the MDP formulation is $(Q_{n},a_{n},r_{n}(a_{\mathcal{N}(n)}),Q^{\prime}_{n}):$ the state of agent $n$ is the queue length $Q_{n}=\{Q_{n}^{l}\}_{l}$ with $Q_{n}^{l}$ being the number of packets with the remaining time $l$ ; the action is $a_{n,m}$ with $m\in\mathcal{A}P(n)$ (one of access point of agent $n$ ); $r_{n}(a_{\mathcal{N}(n)})$ is the reward for agent $n;$ $Q^{\prime}_{n}$ is its next queue state.

Distributed power control in wireless network: The other application of REC-MARL is distributed power control in wireless networks. Consider a network with $N$ wireless links (agents) as shown in Figure 1(b). Each link can control the transmission rate by adapting its power level, and neighboring links interfere with each other. Therefore, the transmission rate of a link depends not only on its transmit power but also its neighboring links’ transmit power. In particular, the reward/utility of link $n$ is its transmission rate minus a linear power cost:

\displaystyle\log\left(1+\frac{p_{n}G_{n,n}}{\sum_{m\neq n,m\in\mathcal{N}(n)}p_{m}G_{m,n}+\sigma_{n}}\right)-u_{n}p_{n},

(3)

where $G_{m,n}$ is the channel gain from node $m$ to $n,$ $p_{n}$ is the power of node $n,$ $\sigma_{n}$ is the power of noise at user $n,$ and $u_{n}$ are the trade-off parameters. The goal of distributed power control to maximize the total rewards for the network.

The distributed power control is also challenging because the reward function is non-convex and highly-coupled, and the channel condition is unknown and dynamic. This problem can also be formulated as a REC-MARL problem. In particular for link/agent $n,$ the MDP formulation is $(p_{n},a_{n},r_{n}(p_{\mathcal{N}(n)}),p^{\prime}_{n}):$ the state of agent $n$ is the power level $p_{n}\in\mathbb{Z}^{+}$ with $p_{n}\leq p_{\max}$ ( $p_{\max}$ is the maximum power constraint); the action is $a_{n}\in\{0,-1,+1\}$ (keep, decrease, or increase the power level by one); $r_{n}(p_{\mathcal{N}(n)})$ is the reward for agent $n;$ $p^{\prime}_{n}$ is the next power level.

In the following, we present distributed multi-agent reinforcement learning algorithms that can be used to solve these two problems and show that our algorithm outperforms the existing benchmarks.

Refer to caption — (a) Real-time access control

3 Decomposition of Value and Policy Gradient Functions

To implement a paradigm of distributed learning and distributed execution, we first establish the decomposition of value and policy gradient functions in Lemma 1 and Lemma 2, respectively. The value and policy gradient functions could be represented locally and computed/estimated via exchange information with their neighbors.

3.1 Decomposition of value functions

We decompose the global value ( $Q$ ) functions into the individual value ( $Q$ ) functions and show they only depend on its neighborhood in Lemma 1. The proof could be found in Appendix A.1.

Lemma 1

Given a multi-agent network, where each agent $n$ is associated with a $\gamma$ -discounted Markov decision process defined by $(\mathcal{S}_{n},\mathcal{A}_{n},r_{n},\mathcal{P}_{n},\gamma),$ the neighborhood of agent $n$ is defined by $\mathcal{N}(n),$ the network reward function is $r(s,a)=\frac{1}{N}\sum_{n=1}^{N}r_{n}(s_{\mathcal{N}(n)},a_{\mathcal{N}(n)})$ and the transition kernel is $\mathcal{P}(s^{\prime}|s,a)=\prod_{n=1}^{N}\mathcal{P}_{n}(s^{\prime}_{n}|s_{n},a_{n}).$ The policy of a network is $\pi_{\theta}:\mathcal{S}\to\mathcal{A}$ with each agent $n$ using a local policy $\pi_{\theta_{n}}:\mathcal{S}_{n}\to\mathcal{A}_{n},$ we have

\displaystyle V^{\pi_{\theta}}(s)=\frac{1}{N}\sum_{n=1}^{N}V^{\pi_{\theta}}_{n}(s_{\mathcal{N}(n)}),~{}~{}~{}Q^{\pi_{\theta}}(s,a)=\frac{1}{N}\sum_{n=1}^{N}Q^{\pi_{\theta}}_{n}(s_{\mathcal{N}(n)},a_{\mathcal{N}(n)}).

Remark 1

We have decomposed the global value functions exactly into the local value function of individual agent. The decomposition is “perfect”, which is different from the exponential decay property of Lemma $3$ in [19] or the polynomial decay property of Proposition $2$ in [31].

3.2 Decomposition of Policy Gradient

We show that policy gradient function can also be decomposed exactly as $Q$ -function. Recall the definitions of $V^{\pi_{\theta}}(\rho)$ and $d^{\pi_{\theta}}_{\rho},$ and we first state the classical policy gradient theorem in [24].

Theorem 1 ([24])

Let $\pi_{\theta}$ be a policy parameterized by $\theta$ for a $\gamma$ -discounted Markov decision process, we have the policy gradient to be

\displaystyle\nabla_{\theta}V^{\pi_{\theta}}(\rho)=

\displaystyle\frac{1}{1-\gamma}\mathbb{E}_{s\sim d^{\pi_{\theta}}_{\rho},a\sim\pi_{\theta}(\cdot|s)}\left[Q^{\pi_{\theta}}(s,a)\nabla_{\theta}\log\pi_{\theta}(a|s)\right].

Theorem 1 has motivated policy gradient methods in the single-agent setting (e.g., [12]) and multi-agent setting (e.g., [14]). However, it is infeasible to be applied for a large-scale multi-agent network due to the large global state space. Therefore, we decompose the policy gradient in the following lemma. The proof can be found in Appendix A.2.

Lemma 2

	$\displaystyle\nabla_{\theta_{n}}V^{\pi_{\theta}}(\rho)=$	$\displaystyle\frac{1}{1-\gamma}\mathbb{E}_{s\sim d^{\pi_{\theta}}_{\rho},a\sim\pi_{\theta}(\cdot\|s)}\left[\frac{1}{N}\sum_{k\in\mathcal{N}(n)}Q^{\pi_{\theta}}_{k}(s_{\mathcal{N}(k)},a_{\mathcal{N}(k)})\nabla_{\theta_{n}}\log\pi_{\theta_{n}}(a_{n}\|s_{n})\right]$
	$\displaystyle=$	$\displaystyle\frac{1}{1-\gamma}\mathbb{E}_{s\sim d^{\pi_{\theta}}_{\rho},a\sim\pi_{\theta}(\cdot\|s)}\left[\frac{1}{N}\sum_{k\in\mathcal{N}(n)}A^{\pi_{\theta}}_{k}(s_{\mathcal{N}(k)},a_{\mathcal{N}(k)})\nabla_{\theta_{n}}\log\pi_{\theta_{n}}(a_{n}\|s_{n})\right]$

Remark 2

Lemma 2 implies that policy gradient of agent $n$ could be computed/estimated via exchanging information $Q^{\pi_{\theta}}(s_{\mathcal{N}(k)},a_{\mathcal{N}(k)})$ or $A^{\pi_{\theta}}(s_{\mathcal{N}(k)},a_{\mathcal{N}(k)}),\forall k\in\mathcal{N}(n)$ with its neighbors. This decomposition is the key to implement the efficient paradigm of distributed learning and distributed execution and motivate the algorithms in the following sections.

4 Distributed Multi-Agent Policy Gradient Algorithms

In this section, we introduce distributed policy gradient based algorithms (DPG) for the multi-agent system. Before introducing the algorithms, we assume that the initial state distribution $\mu$ induced by our policy is strictly positive:

Assumption 1

The initial state distribution $\mu$ satisfies $\min_{s_{n}}\mu(s_{n})>0.$ for any agent $n.$

This assumption imposes the sufficient exploration for the state space for each agent and is common in the literature [1, 15]. We first propose a regularized distributed policy gradient algorithm assuming the access of an inexact gradient, and then introduce TD-learning methods to estimate the gradient.

4.1 Regularized Distributed Policy Gradient Algorithm with Inexact Gradient

Assume an estimated gradient $\nabla\hat{V}^{\pi_{\theta^{t}}}(\rho)$ at any time $t,$ we study relative entropy regularized objective for the network as in [1, 28] that

L_{\lambda}(\theta)=V^{\pi_{\theta}}(\rho)+\sum_{n=1}^{N}\frac{\lambda}{|\mathcal{S}_{n}||\mathcal{A}_{n}|}\sum_{s_{n},a_{n}}\log\pi_{\theta_{n}}(a_{n}|s_{n}),

where $\lambda$ is a positive constant and $\log\pi_{\theta_{n}}(a_{n}|s_{n})$ is the regularizer to prevent the probability of taking action $a_{n}$ at state $s_{n}$ approaches to a small quantity for any agent $n.$ With the regularized value function $L_{\lambda}(\theta),$ we present a DPG with the inexact gradient in Algorithm 1, where the inexact gradient is usually estimated with TD-learning methods (see Section 4.2).

Algorithm 1 Distributed Multi-Agent Policy Gradient with Inexact Gradient

1: Initialization: parameters

\eta,

\lambda,

and

\theta^{0}_{n},\forall n\in\mathcal{I}.

2: for

t=1,\dots,T

3: Estimate Policy Gradient:

\nabla_{\theta_{n}}\hat{L}_{\lambda}(\theta^{t})

for each agent

n\in\mathcal{I}.

4: Update:

\theta_{n}^{t+1}=\theta_{n}^{t}+\eta\nabla_{\theta_{n}}\hat{L}_{\lambda}(\theta^{t}).

5: end for

By analyzing the dynamics of $\nabla L^{\pi_{\theta^{t}}}(\rho)$ in Algorithm 1, we can establish the upper bound on the cumulative $\mathbb{E}\left[||\nabla L^{\pi_{\theta^{t}}}(\rho)||^{2}\right]$ in Theorem 2. The proof can be found in Appendix B.

Theorem 2

Let $\beta^{\prime}=\frac{48N^{2}}{(1-\gamma)^{3}}+\sum_{n}\frac{2\lambda}{|\mathcal{S}_{n}|}$ and $\epsilon_{t}=\mathbb{E}\left[||\nabla\hat{L}^{\pi_{\theta^{t}}}(\rho)-\nabla L^{\pi_{\theta^{t}}}(\rho)||^{2}\right].$ We have Algorithm 1 with the learning rate $\eta\leq 1/\beta^{\prime}$ such that

\sum_{t=1}^{T}\mathbb{E}\left[||\nabla L^{\pi_{\theta^{t}}}(\rho)||^{2}\right]\leq\sum_{t=1}^{T}\epsilon_{t}+2\beta^{\prime}\left(L^{*}(\rho)-L^{\pi^{1}}(\rho)\right).

Theorem 2 implies a local convergence result of Algorithm 1 given a reasonable good gradient estimator, e.g., $\sum_{t=1}^{T}\epsilon_{t}=O(\sqrt{T}),$ which is related to the quality of estimated value functions according to Lemma 2. Next, we utilize TD-learning to estimate value functions, provide an actor-critic type of algorithm, and establish its local convergence.

4.2 Regularized Distributed Actor Critic Algorithm

Motivated by [19], we utilize TD-learning to estimate the gradient $\nabla_{\theta_{n}}\hat{V}^{\pi_{\theta}}(\rho)$ according to Lemma 2 and provide an actor-critic type of algorithm (TD-RDAC) for multi-agent setting in Algorithm 2. From Lemma 2, we have

	$\displaystyle\nabla_{\theta_{n}}L^{\pi_{\theta}}(\rho)=\nabla_{\theta_{n}}V^{\pi_{\theta}}(\rho)+\frac{\lambda}{\|\mathcal{S}_{n}\|\|\mathcal{A}_{n}\|}\sum_{s_{n},a_{n}}\nabla_{\theta_{n}}\log\pi_{\theta_{n}}(a_{n}\|s_{n})$
	$\displaystyle=\frac{1}{1-\gamma}\mathbb{E}_{s\sim d^{\pi_{\theta}}_{\rho}}\left[\frac{1}{N}\sum_{k\in\mathcal{N}(n)}A^{\pi_{\theta}}(s_{\mathcal{N}(k)},a_{\mathcal{N}(k)})\nabla_{\theta_{n}}\pi_{\theta_{n}}(a_{n}\|s_{n})\right]+\frac{\lambda}{\|\mathcal{S}_{n}\|}\left(\frac{1}{\|\mathcal{A}_{n}\|}-\pi_{\theta_{n}}\right).$

For each agent $n,$ it requires to aggregate the advantage functions (or the estimator) from its neighbors, i.e., $A^{\pi_{\theta}}(s_{\mathcal{N}(k)},a_{\mathcal{N}(k)})$ for any $k\in N(n)$ to estimate the gradient $\nabla_{\theta_{n}}\hat{V}^{\pi_{\theta}}(\rho)$ or $\nabla_{\theta_{n}}\hat{L}^{\pi_{\theta}}(\rho).$ The actor-critic algorithm is summarized in Algorithm 2. It contains $T$ outer loops and $H$ inner loops. The lines $4$ to $8$ perform TD learning to estimate value function for each individual agent. The line $10$ is to compute TD-error based on the learned value function for each agent. The line $11$ is to estimate the policy gradient by aggregating its neighbors’ TD-error. The line $12$ is to update the policy parameters. The inner loops (lines $3$ to $13$ ) will be run $T$ rounds.

Algorithm 2 Regularized Distributed Actor-Critic Algorithm

1: Initialization:

\theta^{0},\lambda,

and

\eta.

2: for

t=1,\dots,T

3: Initialize

\hat{V}_{0}^{\pi_{\theta^{t}}}(\cdot)=0

and sample a state

s^{0}\sim\rho.

4: for

h=1,\dots,H

5: for

n=1,\dots,N

6: Update value function for each agent at step

h:

	$\displaystyle\hat{V}^{\pi_{\theta^{t}},h+1}_{n}(s_{\mathcal{N}(n)}(h))$
	$\displaystyle=(1-\alpha)\hat{V}^{\pi_{\theta^{t}},h}_{n}(s_{\mathcal{N}(n)}(h))+\alpha\left(r_{n}^{h}(s_{\mathcal{N}(n)}(h),a_{\mathcal{N}(n)}(h))+\gamma\hat{V}^{\pi_{\theta^{t}},h}_{n}(s_{\mathcal{N}(n)}(h+1)\right),$

7: end for

8: end for

9: for

n=1,\dots,N

10: Compute TD-error at each step

h

	$\displaystyle\delta^{\pi_{\theta^{t}}}_{n}(s_{\mathcal{N}(n)}(h),a_{\mathcal{N}(n)}(h))$
	$\displaystyle=r_{n}^{h}(s_{\mathcal{N}(n)}(h),a_{\mathcal{N}(n)}(h))+\gamma\hat{V}^{\pi_{\theta^{t}},H}_{n}(s_{\mathcal{N}(n)}(h+1))-\hat{V}^{\pi_{\theta^{t}},H}_{n}(s_{\mathcal{N}(n)}(h)),$

11: Estimate policy gradient by aggregating its neighbors’ TD-errors:

	$\displaystyle\nabla_{\theta_{n}}\hat{V}^{\pi_{\theta^{t}}}(\rho)$
	$\displaystyle=\frac{1}{(1-\gamma)}\sum_{h=1}^{H}\gamma^{h}\frac{1}{N}\sum_{k\in\mathcal{N}(n)}\hat{\delta}^{\pi_{\theta^{t}}}_{k}(s_{\mathcal{N}(k)}(h),a_{\mathcal{N}(k)}(h))\nabla_{\theta_{n}}\log\pi_{\theta_{n}}(a_{n}(h)\|s_{n}(h)),$

12: Update:

\theta_{n}^{t+1}=\theta_{n}^{t}+\eta\nabla_{\theta_{n}}\hat{V}^{\pi_{\theta^{t}}}(\rho)+\frac{\lambda}{|\mathcal{S}_{n}|}\left(\frac{1}{|\mathcal{A}_{n}|}-\pi_{\theta_{n}}\right).

13: end for

14: end for

Intuitively, Algorithm 2 requires a large number of inner loops $H$ to guarantee the convergence of value function such that the estimated policy gradient is accurate. Before proving the local convergence results of Algorithm 2, we introduce a common Assumption 2 (e.g., [22, 18]) on the mixing time of Markov decision process.

Assumption 2

Assume $\{s_{n}(h)\}_{h},\forall n,$ be a Markov chain with the geometric mixing time under any policy $\pi_{\theta_{n}},$ i.e., there exists a constant $K$ such that for any $h\geq K\log(1/\epsilon),$ the total variance distance of $\mathcal{P}(z^{h}_{n}|z^{0}_{n})$ and $\pi_{\theta_{n}}(z^{h}_{n})$ satisfies

\|\mathcal{P}(z^{h}_{n}|z^{0}_{n})-\pi(z^{h}_{n})\|_{TV}\leq\epsilon,\forall n.

To avoid the complicated conditions on $T$ and $H,$ we present the order-wise results in Theorem 3. The proofs and the detailed requirements on $T$ and $H$ can be found in Appendix C.

Theorem 3

Suppose large $T$ and $H=O(T).$ Let the learning rate be $\eta=O(1/\sqrt{T}).$ We have Algorithm 2 such that

\displaystyle\min_{1\leq t\leq T}\mathbb{E}[||\nabla L^{\pi_{\theta^{t}}}(\rho)||^{2}]\leq

\displaystyle O\left(\frac{NS_{\max}A_{\max}}{(1-\gamma)^{4}}\sqrt{\frac{\log T}{T}}\right)

Theorem 3 concludes Algorithm 2 will return a stationary policy for a large $T.$ Moreover, the bound only depends on the local dimension of states and actions, i.e., $S_{\max}$ and $A_{\max}$ , which means our algorithm is scalable and could be applied to large-scale networks.

5 Experiments

In this section, we evaluate TD-RDAC for real-time access control problem and distributed power control problem in wireless networks as described in Section 2.

Real-time access control in wireless network: We consider the reward/utility of node $n$ defined in (2). We compared Algorithm 2 with the ALOHA algorithm [21] and a scalable actor critic (SAC) in [20]. The classical distributed solution to this problem is the localized ALOHA algorithm [21], where node $n$ sends the earliest packet with a certain probability and the packet is sent to a random access point $m\in\mathcal{AP}(n)$ in its available set, with probability proportional to $q_{m}$ and inversely proportional to the number of nodes that can potentially transmit to access point $m.$ Note to compare ALOHA algorithm [21] and SAC in [20], we apply Algorithm 2 into a slightly different setting where a packet is removed from its queue if either it is delivered successfully or expired.

In our experiment, we considered two types of network topology: a line network and a grid network as shown in the lefthhand side of Figure 2. The tabular method is used to represent value functions since the size of the table is tractable in this experiment.

The line network has $6$ nodes and $5$ access points as shown in the left-up of Figure 2. We considered two settings in the line network with the same arrival probability but distinct success transmission probabilities to represent reliable/unreliable environments. Specifically, the arrival probabilities are $w=[0.5,0.3,0.5,0.5,0.3,0.5];$ the transmission probabilities of the reliable/unreliable environments are $q=[0.9,0.95,0.9,0.95,0.9]$ and $q=[0.5,0.6,0.7,0.6,0.5],$ respectively. The deadline of packets is $d=2.$ We ran $9$ different instances with different random seeds and the shaded region represents the $95\%$ confidence interval. We observe our algorithm increases steady in Figures 3(a) and 3(b) and outperforms both SAC algorithm in [20] and ALOHA algorithm in the reliable and unreliable environment, where the converged rewards of (our algorithm v.s. SAC v.s. ALOHA) are ( $0.814$ v.s. $0.735$ v.s. $0.334$ ) and ( $0.503$ v.s. $0.464$ v.s. $0.222$ ) for the reliable and unreliable environment, respectively.

The grid network is similar as in [20], shown in the left-bottom of Figure 2. The network has $36$ nodes and $25$ access points. The arrival probability $w_{n}$ of node $n$ and success transmission probability $q_{m}$ of access point $m$ are generated uniformly random from $[0,1].$ The deadline of packets is also set $d=2.$ We observe our algorithm again increases steady in Figure 3(c) even with the relatively dense environment and outperforms both SAC algorithm in [20] and ALOHA algorithm [21], where the converged rewards of (our algorithm v.s. SAC v.s. ALOHA) are ( $0.393$ v.s. $0.369$ v.s. $0.138$ ).

Distributed power control in wireless network: We considered the reward/utility of each agent is the difference between a logarithmic function of signal-interference-noise ratio (SINR) and a linear pricing function proportional to the transmitted power as in (3). In the literature, the problem has been formulated as a linear pricing game [2] and [5]. We compared Algorithm 2 with the DPC algorithm [2] and SAC in [20]. In our experiment, we first studied a setting with $6$ -link ( $3\times 2$ grids) where the tabular method is tractable to represent value functions. Then we studied two settings: $9$ -link and $25$ -link as shown in the right-hand side of Figure 2, where we use blue circles to denote links. For both settings, we use neural network (NN) approximation for value functions. With NN approximation, we utilized the replay buffer [17] and double-Q learning [11] to stabilize the learning process.

For the reward/utility in (3), we set $G_{n,n}=1$ and $G_{m,n}=\kappa/({dist}_{m,n})^{2}$ with $\kappa=0.1$ as the fading factor and ${dist}_{m,n}$ is the distance between nodes $m$ and $n.$ The average power of the noise is $\sigma_{n}=0.1.$ We set the trade-off parameter $u_{n}=0.1,\forall n.$ For each environment, we ran $9$ different instances with different random seeds and the shaded region represents the $95\%$ confidence interval.

For $6$ -nodes network, we represent value functions with tables in Algorithm 2. We observe the learning processes are quite stable in Figure 4(a). We observe our algorithm outperforms the SAC and DPC algorithm as the training time increases: the converged reward are ( $17.60$ v.s. $17.43$ v.s. $16.77$ ). For 9-nodes , we approximate value functions with neural networks in Algorithm 2. We observe Algorithm 2 outperforms the SAC and DPC algorithms as the training time increases in Figure 4(b) ( $28.92$ v.s. $28.76$ v.s. $27.64$ ). For 25 nodes, Figure 4(c) shows the results with a heterogeneous topology. Again, we observe our algorithm increases steady in Figure 4(c) and outperforms the SAC and DPC algorithms ( $98.35$ v.s. $95.70$ v.s. $95.34$ ).

Finally, we note that both ALOHA and DPC algorithms are model-based algorithm and the agents need to know the systems parameters (success transmitting probability in the real-time access control problem and channel gains and interference levels in the power control problem) while our algorithm is model-free and does not need to know these parameters.

6 Conclusion

In this paper, we studied a weakly-coupled multi-agent reinforcement learning problem where agents’ reward functions are coupled but the transition kernels are not. We propose TD-learning based regularized multi-agent policy actor-critic algorithm (TD-RDAC) that achieve the local convergence result. We demonstrated the effectiveness of TD-RDAC with two important applications in wireless networks: real-time access control and distributed power control and showed that our algorithm outperforms the existing benchmarks in the literature.

Acknowledgements

The authors are very grateful to Rui Hu for his insightful discussion and comments.

References

[1] Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. Optimality and approximation with policy gradient methods in markov decision processes. In Proceedings of Thirty Third Conference on Learning Theory, 2020.
[2] Tansu Alpcan, Tamer Başar, R. Srikant, and Eitan Altman. CDMA uplink power control as a non-cooperative game. Proceedings of the IEEE Conference on Decision and Control, 2001.
[3] Xingyan Chen, Changqiao Xu, Mu Wang, Zhonghui Wu, Shujie Yang, Lujie Zhong, and Gabriel-Miro Muntean. A universal transcoding and transmission method for livecast with networked multi-agent reinforcement learning. In IEEE Conference on Computer Communications, 2021.
[4] Ziyi Chen, Yi Zhou, Rong-Rong Chen, and Shaofeng Zou. Sample and communication-efficient decentralized actor-critic algorithms with finite-time analysis. In Proceedings of the 39th International Conference on Machine Learning, 2022.
[5] Mung Chiang, Prashanth Hande, Tian Lan, and Chee Wei Tan. Power control in wireless cellular networks. Foundations and Trends® in Networking, 2008.
[6] Tianshu Chu, Sandeep Chinchali, and Sachin Katti. Multi-agent reinforcement learning for networked system control. In International Conference on Learning Representations, 2020.
[7] Romuald Elie, Pérolat Julien, Mathieu Laurière, Matthieu Geist, and Olivier Pietquin. On the convergence of model free learning in mean field games. AAAI Conference on Artificial Intelligence, 2020.
[8] Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gradients. AAAI Conference on Artificial Intelligence, 2018.
[9] Xin Guo, Anran Hu, Renyuan Xu, and Junzi Zhang. Learning mean-field games. In Advances in Neural Information Processing Systems, 2019.
[10] FNU Hairi, Jia Liu, and Songtao Lu. Finite-time convergence and sample complexity of multi-agent actor-critic reinforcement learning with average reward. In The Tenth International Conference on Learning Representations, 2022.
[11] Hado van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, 2016.
[12] Vijay Konda and John Tsitsiklis. Actor-critic algorithms. In S. Solla, T. Leen, and K. Müller, editors, Advances in Neural Information Processing Systems, volume 12. MIT Press, 2000.
[13] Yiheng Lin, Guannan Qu, Longbo Huang, and Adam Wierman. Multi-agent reinforcement learning in stochastic networked systems. In Advances in Neural Information Processing Systems, 2021.
[14] Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. In Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017.
[15] Jincheng Mei, Chenjun Xiao, Csaba Szepesvari, and Dale Schuurmans. On the global convergence rates of softmax policy gradient methods. In Proceedings of the 37th International Conference on Machine Learning, 2020.
[16] Nicolas Meuleau, Milos Hauskrecht, Kee-Eung Kim, Leonid Peshkin, Leslie Pack Kaelbling, Thomas Dean, and Craig Boutilier. Solving very large weakly coupled markov decision processes. American Association for Artificial Intelligence, 1998.
[17] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. In International Conference on Neural Information Processing Systems, 2013.
[18] Guannan Qu, Yiheng Lin, Adam Wierman, and Na Li. Scalable multi-agent reinforcement learning for networked systems with average reward. arXiv preprint arXiv:2006.06626, 2020.
[19] Guannan Qu, Adam Wierman, and Na Li. Scalable reinforcement learning of localized policies for multi-agent networked systems. In Proceedings of the 2nd Conference on Learning for Dynamics and Control, 2020.
[20] Guannan Qu, Adam Wierman, and Na Li. Scalable reinforcement learning for multiagent networked systems. Operations Research, 2022.
[21] Lawrence G. Roberts. Aloha packet system with and without slots and capture. SIGCOMM Comput. Commun. Rev., 1975.
[22] R. Srikant and Lei Ying. Finite-time error bounds for linear stochastic approximation and TD learning. In Proceedings of the 22nd Annual Conference on Learning Theory (COLT), 2019.
[23] Shan Sun, Mariam Kiran, and Wei Ren. MAMRL: Exploiting multi-agent meta reinforcement learning in WAN traffic engineering. arXiv preprint arXiv:2111.15087, 2021.
[24] Richard S. Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Neural Information Processing Systems, 1999.
[25] Fangxin Wang, Feng Wang, Jiangchuan Liu, Ryan Shea, and Lifeng Sun. Intelligent video caching at network edge: A multi-agent deep reinforcement learning approach. In IEEE Conference on Computer Communications, 2020.
[26] Xiaohan Wei, Hao Yu, and Michael J. Neely. Online learning in weakly coupled markov decision processes: A convergence time study. Proc. ACM Meas. Anal. Comput. Syst., apr 2018.
[27] Qiaomin Xie, Zhuoran Yang, Zhaoran Wang, and Andreea Minca. Learning while playing in mean-field games: Convergence and optimality. In Proceedings of the 38th International Conference on Machine Learning, 2021.
[28] Junzi Zhang, Jongho Kim, Brendan O’Donoghue, and Stephen Boyd. Sample efficient reinforcement learning with reinforce. In Proceedings of the AAAI Conference on Artificial Intelligence, 2021.
[29] Kaiqing Zhang, Zhuoran Yang, Han Liu, Tong Zhang, and Tamer Basar. Fully decentralized multi-agent reinforcement learning with networked agents. In Proceedings of the 35th International Conference on Machine Learning, 2018.
[30] Runyu Zhang, Zhaolin Ren, and Na Li. Gradient play in multi-agent markov stochastic games: Stationary points and convergence. arXiv preprint arXiv:2106.00198, 2021.
[31] Yizhou Zhang, Guannan Qu, Pan Xu, Yiheng Lin, Zaiwei Chen, and Adam Wierman. Global convergence of localized policy iteration in networked multi-agent reinforcement learning. arXiv preprint arXiv:2211.17116, 11 2022.

Appendix A Decomposition of Value Functions

A.1 Proof of Lemma 1

Proof 1

We provide the proof of the decomposition on $Q$ -function and the similar argument holds for the value function $V^{\pi_{\theta}}(s).$ According to the definitions of reward and $Q$ -function, we have

	$\displaystyle Q^{\pi_{\theta}}(s,a)=$	$\displaystyle\frac{1}{N}\sum_{n=1}^{N}\mathbb{E}\left[\sum_{t=1}^{\infty}\gamma^{t}r_{n}(s_{\mathcal{N}(n)}^{t},a_{\mathcal{N}(n)}^{t})\Big{\|}s^{0}=s,a^{0}=a\right]$
	$\displaystyle=$	$\displaystyle\frac{1}{N}\sum_{n=1}^{N}\mathbb{E}\left[\sum_{t=1}^{\infty}\gamma^{t}r_{n}(s_{\mathcal{N}(n)}^{t},a_{\mathcal{N}(n)}^{t})\Big{\|}s_{\mathcal{N}(n)}^{0}=s_{\mathcal{N}(n)},a_{\mathcal{N}(n)}^{0}=a_{\mathcal{N}(n)}\right]$
	$\displaystyle=$	$\displaystyle\frac{1}{N}\sum_{n=1}^{N}Q^{\pi_{\theta}}_{n}(s_{\mathcal{N}(n)},a_{\mathcal{N}(n)})$

where the second equality holds because the reward and transition of agent $n$ only depends on its neighbors’ states and actions since for any trajectories of agent $n$ until $t^{\prime},$ we have

	$\displaystyle\mathcal{P}(s^{0},a^{0},\cdots,s_{\mathcal{N}(n)}^{t^{\prime}},a_{\mathcal{N}(n)}^{t^{\prime}})$	$\displaystyle=\prod_{t=1}^{t^{\prime}}\prod_{k\in\mathcal{N}(n)}{\pi_{\theta_{k}}(a_{k}^{t}\|s_{k}^{t})\mathcal{P}(s_{k}^{t}\|s_{k}^{t-1},a_{k}^{t-1})}$
		$\displaystyle=\prod_{k\in\mathcal{N}(n)}\prod_{t=1}^{t^{\prime}}{\pi_{\theta_{k}}(a_{k}^{t}\|s_{k}^{t})\mathcal{P}(s_{k}^{t}\|s_{k}^{t-1},a_{k}^{t-1})}$
		$\displaystyle=\mathcal{P}(s_{\mathcal{N}(n)}^{0},a_{\mathcal{N}(n)}^{0},\cdots,s_{\mathcal{N}(n)}^{t^{\prime}},a_{\mathcal{N}(n)}^{t^{\prime}}),$

and it implies $\mathcal{P}(s_{\mathcal{N}(n)}^{t^{\prime}},a_{\mathcal{N}(n)}^{t^{\prime}})$ only depends on $(s_{\mathcal{N}(n)}^{0},a_{\mathcal{N}(n)}^{0}).$

A.2 Proof of Lemma 2

Proof 2

According to Theorem 1, we have

	$\displaystyle\nabla_{\theta_{n}}V^{\pi_{\theta}}(\rho)=$	$\displaystyle\frac{1}{1-\gamma}\mathbb{E}_{s\sim d^{\pi_{\theta}}_{\rho},a\sim\pi_{\theta}(\cdot\|s)}\left[Q^{\pi_{\theta}}(s,a)\nabla_{\theta_{n}}\log\pi_{\theta}(a\|s)\right]$
	$\displaystyle=$	$\displaystyle\frac{1}{1-\gamma}\mathbb{E}_{s\sim d^{\pi_{\theta}}_{\rho},a\sim\pi_{\theta}(\cdot\|s)}\left[Q^{\pi_{\theta}}(s,a)\nabla_{\theta_{n}}\log\pi_{\theta_{n}}(a_{n}\|s_{n})\right]$
	$\displaystyle=$	$\displaystyle\frac{1}{1-\gamma}\mathbb{E}_{s\sim d^{\pi_{\theta}}_{\rho},a\sim\pi_{\theta}(\cdot\|s)}\left[\frac{1}{N}\sum_{n=1}^{N}Q^{\pi_{\theta}}(s_{\mathcal{N}(n)},a_{\mathcal{N}(n)})\nabla_{\theta_{n}}\log\pi_{\theta_{n}}(a_{n}\|s_{n})\right]$
	$\displaystyle=$	$\displaystyle\frac{1}{1-\gamma}\mathbb{E}_{s\sim d^{\pi_{\theta}}_{\rho},a\sim\pi_{\theta}(\cdot\|s)}\left[\frac{1}{N}\sum_{k\in\mathcal{N}(n)}Q^{\pi_{\theta}}(s_{\mathcal{N}(k)},a_{\mathcal{N}(k)})\nabla_{\theta_{n}}\log\pi_{\theta_{n}}(a_{n}\|s_{n})\right]$

where the second equality holds because of localized policies $\pi_{\theta}(a|s)=\prod_{n}\pi_{\theta_{n}}(a_{n}|s_{n});$ the third equality holds because of Lemma 1; the last equality holds because for any $k\notin\mathcal{N}(n)$ that

		$\displaystyle\mathbb{E}_{s\sim d^{\pi_{\theta}}_{\rho},a\sim\pi_{\theta}(\cdot\|s)}\left[Q^{\pi_{\theta}}(s_{\mathcal{N}(k)},a_{\mathcal{N}(k)})\nabla_{\theta_{n}}\log\pi_{\theta_{n}}(a_{n}\|s_{n})\right]$
	$\displaystyle=$	$\displaystyle\mathbb{E}_{s\sim d^{\pi_{\theta}}_{\rho}}\left[\sum_{a}Q^{\pi_{\theta}}(s_{\mathcal{N}(k)},a_{\mathcal{N}(k)})\prod_{j\neq n}\pi_{\theta_{j}}(a_{j}\|s_{j})\nabla_{\theta_{n}}\pi_{\theta_{n}}(a_{n}\|s_{n})\right]$
	$\displaystyle=$	$\displaystyle\mathbb{E}_{s\sim d^{\pi_{\theta}}_{\rho}}\left[\sum_{a_{-n}}Q^{\pi_{\theta}}(s_{\mathcal{N}(k)},a_{\mathcal{N}(k)})\prod_{j\neq n}\pi_{\theta_{j}}(a_{j}\|s_{j})\sum_{a_{n}}\nabla_{\theta_{n}}\pi_{\theta_{n}}(a_{n}\|s_{n})\right]$
	$\displaystyle=$	$\displaystyle 0$

Note $\nabla_{\theta_{n}}V^{\pi_{\theta}}(\rho)=\frac{1}{1-\gamma}\mathbb{E}_{s\sim d^{\pi_{\theta}}_{\rho},a\sim\pi_{\theta}(\cdot|s)}\left[A^{\pi_{\theta}}(s,a)\nabla_{\theta_{n}}\log\pi_{\theta}(a|s)\right]$ and the second equality on the advantage function holds by following the same steps.

Appendix B Proof of Theorem 2

To prove Theorem 2, we first prove the smoothness of regularized value $L$ function.

B.1 Smoothness of $L$ function

Recall the definition of $L$ function

L_{\lambda}(\theta)=V^{\pi_{\theta}}(\rho)+R(\theta).

where $R(\theta)=\sum_{n=1}^{N}\frac{\lambda}{|\mathcal{S}_{n}||\mathcal{A}_{n}|}\sum_{s_{n},a_{n}}\log\pi_{\theta_{n}}(a_{n}|s_{n}).$ We show $V^{\pi_{\theta}}(\rho)$ is $\frac{48N^{2}}{(1-\gamma)^{3}}$ -smoothness in Lemma 3 and $R(\theta)$ is $\sum_{n}\frac{2\lambda}{|\mathcal{S}_{n}|}$ -smoothness in Lemma 4, respectively, which imply $L^{\pi_{\theta}}(s)$ is $\beta^{\prime}:=\frac{48N^{2}}{(1-\gamma)^{3}}+\sum_{n}\frac{2\lambda}{|\mathcal{S}_{n}|}$ smoothness.

Lemma 3 (Smoothness)

Under Algorithm 1, we have $V^{\pi_{\theta}}(s)$ to be $48N^{2}/(1-\gamma)^{3}$ smoothness, i.e.,

\left|V^{\pi_{\theta}}(s)-V^{\pi_{\theta^{\prime}}}(s)-\langle\frac{\partial V^{\pi_{\theta}}(s)}{\partial\theta},\theta-\theta^{\prime}\rangle\right|\leq\frac{24N^{2}}{(1-\gamma)^{3}}||\theta-\theta^{\prime}||^{2}.

Proof 3

The Bellman equation (for a network) is

V^{\pi_{\theta}}(s)=\sum_{a}\pi_{\theta}(a|s)r(s,a)+\gamma\sum_{a}\pi_{\theta}(a|s)\sum_{s^{\prime}}\mathcal{P}(s^{\prime}|s,a)V^{\pi_{\theta}}(s^{\prime}),

Let

r_{\theta}(s)=\sum_{a}\pi(a|s)r(s,a)~{}~{}\text{and}~{}~{}\mathcal{P}_{\theta}(s,s^{\prime})=\sum_{a}\pi_{\theta}(a|s)\sum_{s^{\prime}}\mathcal{P}(s^{\prime}|s,a).

Let $u_{s}$ be a vector with only the $s$ th entry being one and zeros for other entries. The Bellman equation can be wrote as

\displaystyle V^{\pi_{\theta}}(s)=

\displaystyle r_{\theta}(s)+\gamma\langle\mathcal{P}_{\theta}(s,\cdot),V^{\pi_{\theta}}(\cdot)\rangle,

which implies $V^{\pi_{\theta}}\in\mathbb{R}^{|\mathcal{S}|}$ satisfies

V^{\pi_{\theta}}=(I-\gamma\mathcal{P}(\theta))^{-1}r_{\theta}:=M_{\theta}r_{\theta},

where $I$ is the identity matrix. Let $\theta_{\alpha}=\theta+\alpha\nu,$ and we have $\nu^{\dagger}\frac{\partial^{2}V^{\pi_{\theta}}(s)}{\partial\theta^{2}}\nu=\frac{\partial^{2}V^{\pi_{\theta_{\alpha}}}(s)}{\partial\alpha^{2}}|_{\alpha=0}.$ With a bit abuse of notation, let $r_{\alpha}=r_{\theta_{\alpha}},$ $\mathcal{P}_{\alpha}=\mathcal{P}_{\theta_{\alpha}},$ $\pi_{\alpha}=\pi_{\theta_{\alpha}},$ and $M_{\alpha}=M_{\theta_{\alpha}}.$ We need to compute the first-order gradient

	$\displaystyle\frac{\partial V^{\pi_{\theta_{\alpha}}}(s)}{\partial\alpha}=$	$\displaystyle\gamma u_{s}^{T}\frac{\partial M_{\alpha}}{\partial\alpha}r_{\alpha}+u_{s}^{T}M_{\alpha}\frac{\partial r_{\alpha}}{\partial\alpha}$
	$\displaystyle=$	$\displaystyle\gamma u_{s}^{T}M_{\alpha}\frac{\partial\mathcal{P}_{\alpha}}{\partial\alpha}M_{\alpha}r_{\alpha}+u_{s}^{T}M_{\alpha}\frac{\partial r_{\alpha}}{\partial\alpha}$

where we use the derivative of inverse matrix $\frac{\partial(I-\gamma\mathcal{P}(\theta))^{-1}}{\partial\alpha}=(I-\gamma\mathcal{P}(\theta))^{-1}\frac{\partial(I-\gamma\mathcal{P}(\theta))}{\partial\alpha}(I-\gamma\mathcal{P}(\theta))^{-1};$ and the second-order gradient

	$\displaystyle\frac{\partial^{2}V^{\pi_{\theta_{\alpha}}}(s)}{\partial\alpha^{2}}=$	$\displaystyle 2\gamma^{2}u_{s}^{T}M_{\alpha}\frac{\partial\mathcal{P}_{\alpha}}{\partial\alpha}M_{\alpha}\frac{\partial\mathcal{P}_{\alpha}}{\partial\alpha}M_{\alpha}r_{\alpha}+\gamma u_{s}^{T}M_{\alpha}\frac{\partial^{2}\mathcal{P}_{\alpha}}{\partial\alpha^{2}}M_{\alpha}r_{\alpha}$
		$\displaystyle+\gamma u_{s}^{T}M_{\alpha}\frac{\partial\mathcal{P}_{\alpha}}{\partial\alpha}M_{\alpha}\frac{\partial r_{\alpha}}{\partial\alpha}+u_{s}^{T}M_{\alpha}\frac{\partial^{2}r_{\alpha}}{\partial\alpha^{2}}.$

Therefore, we need to establish the upper bounds on

\frac{\partial\mathcal{P}_{\alpha}}{\partial\alpha},~{}~{}\frac{\partial^{2}\mathcal{P}_{\alpha}}{\partial\alpha^{2}},~{}~{}\frac{\partial r_{\alpha}}{\partial\alpha},~{}~{}\frac{\partial^{2}r_{\alpha}}{\partial\alpha^{2}},

which require to compute

\frac{\partial\pi_{\alpha}}{\partial\alpha},~{}~{}\frac{\partial^{2}\pi_{\alpha}}{\partial\alpha^{2}}.

Recall $\pi_{\theta}(a|s)=\prod_{n=1}^{N}\pi_{\theta_{n}}(a_{n}|s_{n}).$ Let’s compute

	$\displaystyle\frac{\partial\pi_{\alpha}(a\|s)}{\partial\alpha}\|_{\alpha=0}$	$\displaystyle=\langle\frac{\partial\pi_{\theta}(a\|s)}{\partial\theta(s,\cdot)},\nu(s,\cdot)\rangle=\sum_{n=1}^{N}\langle\frac{\partial\pi_{\theta}(a\|s)}{\partial\theta_{n}(s_{n},\cdot)},\nu_{n}(s_{n},\cdot)\rangle$
		$\displaystyle=\sum_{n=1}^{N}\prod_{k\neq n}\pi_{\theta_{k}}(a_{k}\|s_{k})~{}\langle\frac{\partial\pi_{\theta_{n}}(a_{n}\|s_{n})}{\partial\theta_{n}(s_{n},\cdot)},\nu_{n}(s_{n},\cdot)\rangle$
		$\displaystyle=\sum_{n=1}^{N}\pi_{\theta}(a\|s)\left(\nu_{n}(s_{n},a_{n})-\langle\pi_{\theta_{n}}(\cdot\|s_{n}),\nu_{n}(s,\cdot)\rangle\right)$

where the last equality holds because

\frac{\partial\pi_{\theta_{n}}(a_{n}|s_{n})}{\partial\theta_{n}(s_{n},a^{\prime}_{n})}=\pi_{\theta_{n}}(a_{n}|s_{n})(\mathbb{I}_{a_{n}=a^{\prime}_{n}}-\pi_{\theta_{n}}(a^{\prime}_{n}|s_{n})),

and it implies

\langle\frac{\partial\pi_{\theta_{n}}(a_{n}|s_{n})}{\partial\theta_{n}(s_{n},\cdot)},\nu_{n}(s_{n},\cdot)\rangle=\nu_{n}(s_{n},a_{n})-\langle\pi_{\theta_{n}}(\cdot|s_{n}),\nu_{n}(s,\cdot)\rangle.

Therefore, we have

	$\displaystyle\sum_{a}\left\|\frac{\partial\pi_{\alpha}(a\|s)}{\partial\alpha}\|_{\alpha=0}\right\|$	$\displaystyle=\sum_{a}\left\|\sum_{n=1}^{N}\pi_{\theta}(a\|s)\left(\nu_{n}(s_{n},a_{n})-\langle\pi_{\theta_{n}}(\cdot\|s_{n}),\nu_{n}(s_{n},\cdot)\rangle\right)\right\|$
		$\displaystyle\leq 2\max_{a}\sum_{n=1}^{N}\|\nu_{n}(s_{n},a_{n})\|$
		$\displaystyle\leq 2N\|\|u\|\|_{2}.$

Next, let’s compute

\displaystyle\frac{\partial^{2}\pi_{\alpha}(a|s)}{\partial\alpha^{2}}|_{\alpha=0}

\displaystyle=\sum_{n=1}^{N}\sum_{n^{\prime}=1}^{N}\langle\frac{\partial^{2}\pi_{\theta}(a|s)}{\partial\theta_{n}(s_{n},\cdot)\theta_{n^{\prime}}(s_{n^{\prime}},\cdot)}\nu_{n}(s_{n},\cdot),\nu_{n^{\prime}}(s_{n^{\prime}},\cdot)\rangle,

where we need to compute

	$\displaystyle\left(\frac{\partial^{2}\pi_{\theta}(a\|s)}{\partial\theta_{n}(s_{n},\cdot)^{2}}\right)_{i,j}=$	$\displaystyle\mathbb{I}_{a_{n}=i}\pi_{\theta}(a\|s)\left(\mathbb{I}_{a_{n}=j}-\pi_{\theta_{n}}(j\|s_{n})\right)-\pi_{\theta}(a\|s)\pi_{\theta_{n}}(i\|s_{n})\left(\mathbb{I}_{i=j}-\pi_{\theta_{n}}(j\|s_{n})\right)$
		$\displaystyle-\pi_{\theta}(a\|s)\pi_{\theta_{n}}(i\|s_{n})\left(\mathbb{I}_{a_{n}=j}-\pi_{\theta_{n}}(j\|s_{n})\right),$

and

\displaystyle\left(\frac{\partial^{2}\pi_{\theta}(a|s)}{\partial\theta_{n}(s_{n},\cdot)\theta_{n^{\prime}}(s_{n^{\prime}},\cdot)}\right)_{i,j}=

\displaystyle\pi_{\theta}(a|s)\left(\mathbb{I}_{a_{n}=i}-\pi_{\theta_{n}}(i|s_{n})\right)\left(\mathbb{I}_{a_{n^{\prime}}=j}-\pi_{\theta_{n^{\prime}}}(j|s_{n^{\prime}})\right).

Then we have

		$\displaystyle\left\|\langle\frac{\partial^{2}\pi_{\theta}(a\|s)}{\partial\theta_{n}(s_{n},\cdot)^{2}}\nu_{n}(s_{n},\cdot),\nu_{n}(s_{n},\cdot)\rangle/\pi(a\|s)\right\|$
	$\displaystyle\leq$	$\displaystyle\nu^{2}_{n}(s_{n},a_{n})+2\|\nu_{n}(s_{n},a_{n})\sum_{j}\pi_{\theta_{n}}(j\|s_{n})\nu_{n}(s_{n},j)\|$
		$\displaystyle+\|\sum_{i}\pi_{\theta_{n}}(i\|s)\nu^{2}_{n}(i\|s)+2\sum_{i}\sum_{j}\pi_{\theta_{n}}(i\|s_{n})\pi_{\theta_{n}}(j\|s_{n})\nu_{n}(s_{n},i)\nu_{n^{\prime}}(s_{n^{\prime}},j)\|$
	$\displaystyle\leq$	$\displaystyle 6\max_{a_{n}}\nu^{2}_{n}(s_{n},a_{n}),$

and

	$\displaystyle\left\|\langle\frac{\partial^{2}\pi_{\theta}(a\|s)}{\partial\theta_{n}(s_{n},\cdot)\theta_{n^{\prime}}(s_{n^{\prime}},\cdot)}\nu_{n}(s_{n},\cdot),\nu_{n^{\prime}}(s_{n^{\prime}},\cdot)\rangle/\pi(a\|s)\right\|$
$\displaystyle\leq$	$\displaystyle\|\nu_{n}(s_{n},a_{n})\nu_{n^{\prime}}(s_{n^{\prime}},a_{n^{\prime}})\|+\|\nu_{n}(s_{n},a_{n})\sum_{j}\pi_{\theta_{n^{\prime}}}(j\|s_{n^{\prime}})\nu_{n^{\prime}}(s_{n^{\prime}},j)\|$	(4)
	$\displaystyle+\|\nu_{n^{\prime}}(s_{n^{\prime}},a_{n^{\prime}})\sum_{i}\pi_{\theta_{n}}(i\|s_{n})\nu_{n}(s_{n},i)\|+\sum_{i}\sum_{j}\|\pi_{\theta_{n}}(i\|s_{n})\pi_{\theta_{n^{\prime}}}(j\|s_{n^{\prime}})\nu_{n}(s_{n},i)\nu_{n^{\prime}}(s_{n^{\prime}},j)\|$	(5)
$\displaystyle\leq$	$\displaystyle 4\max_{s_{n},s_{n^{\prime}},a_{n},a_{n^{\prime}}}\|\nu_{n}(s_{n},a_{n})\nu_{n^{\prime}}(s_{n^{\prime}},a_{n^{\prime}})\|$	(6)

which implies

\displaystyle\left|\frac{\partial^{2}\pi_{\alpha}(a|s)}{\partial\alpha^{2}}|_{\alpha=0}\right|\leq 6N^{2}||\nu||_{2}^{2}

(7)

Now it is good to bound

	$\displaystyle\left\|\frac{\partial^{2}V^{\pi_{\theta_{\alpha}}}(s)}{\partial\alpha^{2}}\|_{\alpha=0}\right\|=$	$\displaystyle 2\gamma^{2}\left\|u_{s}^{T}M_{\alpha}\frac{\partial\mathcal{P}_{\alpha}}{\partial\alpha}M_{\alpha}\frac{\partial\mathcal{P}_{\alpha}}{\partial\alpha}M_{\alpha}r_{\alpha}\right\|+\gamma\left\|u_{s}^{T}M_{\alpha}\frac{\partial^{2}\mathcal{P}_{\alpha}}{\partial\alpha^{2}}M_{\alpha}r_{\alpha}\right\|$
		$\displaystyle+\gamma\left\|u_{s}^{T}M_{\alpha}\frac{\partial\mathcal{P}_{\alpha}}{\partial\alpha}M_{\alpha}\frac{\partial r_{\alpha}}{\partial\alpha}\right\|+\left\|u_{s}^{T}M_{\alpha}\frac{\partial^{2}r_{\alpha}}{\partial\alpha^{2}}\right\|$
	$\displaystyle\leq$	$\displaystyle\left[\frac{8N^{2}}{(1-\gamma)^{3}}+\frac{6N^{2}}{(1-\gamma)^{2}}+\frac{4N^{2}}{(1-\gamma)^{2}}+\frac{6N^{2}}{(1-\gamma)^{2}}\right]\|\|\nu\|\|_{2}^{2}$
	$\displaystyle\leq$	$\displaystyle\frac{24N^{2}}{(1-\gamma)^{3}}\|\|\nu\|\|_{2}^{2}.$

Finally, we have

\nu^{\dagger}\frac{\partial^{2}V^{\pi_{\theta}}(s)}{\partial\theta^{2}}\nu\leq\frac{24N^{2}}{(1-\gamma)^{3}}||\nu||_{2}^{2},

which completes the proof.

Lemma 4

Under Algorithm 1, we have $R(\theta)$ is $\sum_{n}\frac{2\lambda}{|\mathcal{S}_{n}|}$ -smoothness.

Proof 4

We have for any $n$ such that

	$\displaystyle\frac{\partial R(\theta)}{\partial\theta_{n}(s_{n},\cdot)}=$	$\displaystyle\frac{1}{\|\mathcal{S}_{n}\|}\left(\frac{1}{\|\mathcal{A}_{n}\|}-\pi_{\theta_{n}}(\cdot\|s_{n})\right),$
	$\displaystyle\frac{\partial^{2}R(\theta)}{\partial\theta_{n}(s_{n},\cdot)^{2}}=$	$\displaystyle\frac{1}{\|\mathcal{S}_{n}\|}\left(-\text{diag}(\pi_{\theta_{n}}(\cdot\|s_{n}))+\pi_{\theta_{n}}(\cdot\|s_{n})\pi_{\theta_{n}}(\cdot\|s_{n})^{T}\right),$

which implies

	$\displaystyle\langle\frac{\partial^{2}R(\theta)}{\partial\theta_{n}(s_{n},\cdot)^{2}}\nu_{n}(s_{n},\cdot),\nu_{n}(s_{n},\cdot)\rangle\leq$	$\displaystyle\left\|\langle\text{diag}(\pi_{\theta_{n}}(\cdot\|s_{n}))\nu_{n}(s_{n},\cdot),\nu_{n}(s_{n},\cdot)\rangle\right\|+\\|\langle\pi_{\theta_{n}}(\cdot\|s_{n}),\nu_{n}(s_{n},\cdot)\rangle\\|^{2}$
	$\displaystyle\leq$	$\displaystyle 2\\|\nu_{n}(s_{n},\cdot)\\|^{2}.$

Finally we have

\langle\frac{\partial^{2}R(\theta)}{\partial\theta^{2}}\nu,\nu)\rangle\leq 2\|\nu\|^{2},

which completes the proof.

B.2 Proving Theorem 2

Recall $\eta\leq 1/\beta^{\prime}$ and $D(t)=L^{*}(\rho)-L^{\pi_{\theta^{t}}}(\rho).$ Given $\theta^{t}$ and $\nabla\hat{L}^{\pi_{\theta^{t}}}(\rho),$ we have

	$\displaystyle D(t+1)-D(t)=$	$\displaystyle L^{\pi_{\theta^{t}}}(\rho)-L^{\pi_{\theta^{t+1}}}(\rho)$
	$\displaystyle=$	$\displaystyle L^{\pi_{\theta^{t}}}(\rho)-L^{\pi_{\theta^{t+1}}}(\rho)+\langle\nabla L^{\pi_{\theta^{t}}}(\rho),\theta^{t+1}-\theta^{t}\rangle-\langle\nabla L^{\pi_{\theta^{t}}}(\rho),\theta^{t+1}-\theta^{t}\rangle$
	$\displaystyle\leq$	$\displaystyle\frac{\beta^{\prime}}{2}\|\|\theta^{t+1}-\theta^{t}\|\|^{2}-\langle\nabla L^{\pi_{\theta^{t}}}(\rho),\theta^{t+1}-\theta^{t}\rangle$
	$\displaystyle=$	$\displaystyle\frac{\beta^{\prime}\eta^{2}}{2}\|\|\nabla\hat{L}^{\pi_{\theta^{t}}}(\rho)\|\|^{2}-\eta\langle\nabla L^{\pi_{\theta^{t}}}(\rho),\nabla\hat{L}^{\pi_{\theta^{t}}}(\rho)\rangle$
	$\displaystyle=$	$\displaystyle\frac{1}{2\beta^{\prime}}\|\|\nabla\hat{L}^{\pi_{\theta^{t}}}(\rho)-\nabla L^{\pi_{\theta^{t}}}(\rho)\|\|^{2}-\frac{1}{2\beta}\|\|\nabla L^{\pi_{\theta^{t}}}(\rho)\|\|^{2}.$

Since $\epsilon_{t}=\mathbb{E}\left[||\nabla\hat{L}^{\pi_{\theta^{t}}}(\rho)-\nabla L^{\pi_{\theta^{t}}}(\rho)||^{2}\right],$ we have

\displaystyle\mathbb{E}\left[D(t+1)-D(t)\right]\leq

\displaystyle\frac{1}{2\beta^{\prime}}\mathbb{E}\left[||\nabla\hat{L}^{\pi_{\theta^{t}}}(\rho)-\nabla L^{\pi_{\theta^{t}}}(\rho)||^{2}\right]-\frac{1}{2\beta^{\prime}}\mathbb{E}\left[||\nabla L^{\pi_{\theta^{t}}}(\rho)||^{2}\right],

which implies

	$\displaystyle\frac{1}{2\beta^{\prime}}\mathbb{E}\left[\|\|\nabla L^{\pi_{\theta^{t}}}(\rho)\|\|^{2}\right]\leq$	$\displaystyle\frac{1}{2\beta^{\prime}}\mathbb{E}\left[\|\|\nabla\hat{L}^{\pi_{\theta^{t}}}(\rho)-\nabla L^{\pi_{\theta^{t}}}(\rho)\|\|^{2}\right]+\mathbb{E}\left[D(t)-D(t+1)\right]$
	$\displaystyle=$	$\displaystyle\frac{\epsilon_{t}}{2\beta^{\prime}}+\mathbb{E}\left[D(t)-D(t+1)\right].$

Take summation $t=1,2,\cdots,T$ and we have

\sum_{t=1}^{T}\mathbb{E}\left[||\nabla L^{\pi_{\theta^{t}}}(\rho)||^{2}\right]\leq\sum_{t=1}^{T}\epsilon_{t}+2\beta^{\prime}\left(L^{*}(\rho)-L^{\pi^{1}}(\rho)\right).

Appendix C Proof of Theorem 3

Recall the estimated gradient and the true gradient to be

	$\displaystyle\nabla_{\theta_{n}}\hat{V}^{\pi_{\theta}}(\rho)=$	$\displaystyle\sum_{h=1}^{H}\gamma^{h}\frac{1}{N}\sum_{k\in\mathcal{N}(n)}\hat{\delta}^{\pi_{\theta},h}_{k}(s_{\mathcal{N}(k)}(h),a_{\mathcal{N}(k)}(h))\nabla_{\theta_{n}}\log\pi_{\theta_{n}}(a_{n}(h)\|s_{n}(h)),$
	$\displaystyle\nabla_{\theta_{n}}V^{\pi_{\theta}}(\rho)=$	$\displaystyle\sum_{h=1}^{\infty}\gamma^{h}\mathbb{E}\left[\frac{1}{N}\sum_{k\in\mathcal{N}(n)}\delta^{\pi_{\theta},h}_{k}(s_{\mathcal{N}(k)}(h),a_{\mathcal{N}(k)}(h))\nabla_{\theta_{n}}\log\pi_{\theta_{n}}(a_{n}(h)\|s_{n}(h))\right].$

Further define $\nabla_{\theta_{n}}\tilde{V}^{\pi_{\theta}}(\rho)$ and $\nabla_{\theta_{n}}\tilde{V}^{\pi_{\theta}}(\rho)$ to decompose policy gradient as follows:

	$\displaystyle\nabla_{\theta_{n}}\tilde{V}^{\pi_{\theta}}(\rho)=$	$\displaystyle\sum_{h=1}^{H}\gamma^{h}\frac{1}{N}\sum_{k\in\mathcal{N}(n)}\delta^{\pi_{\theta},h}_{k}(s_{\mathcal{N}(k)}(h),a_{\mathcal{N}(k)}(h))\nabla_{\theta_{n}}\log\pi_{\theta_{n}}(a_{n}(h)\|s_{n}(h)),$
	$\displaystyle\nabla_{\theta_{n}}\bar{V}^{\pi_{\theta}}(\rho)=$	$\displaystyle\sum_{h=1}^{H}\gamma^{h}\mathbb{E}\left[\frac{1}{N}\sum_{k\in\mathcal{N}(n)}\delta^{\pi_{\theta},h}_{k}(s_{\mathcal{N}(k)}(h),a_{\mathcal{N}(k)}(h))\nabla_{\theta_{n}}\log\pi_{\theta_{n}}(a_{n}(h)\|s_{n}(h))\right].$

The error is decomposed to be

		$\displaystyle\nabla_{\theta_{n}}\hat{L}^{\pi_{\theta}}(\rho)-\nabla_{\theta_{n}}{L}^{\pi_{\theta}}(\rho)$
	$\displaystyle=$	$\displaystyle\underbrace{\nabla_{\theta_{n}}\hat{L}^{\pi_{\theta}}(\rho)-\nabla_{\theta_{n}}\tilde{L}^{\pi_{\theta}}(\rho)}_{e_{1,n}}+\underbrace{\nabla_{\theta_{n}}\tilde{L}^{\pi_{\theta}}(\rho)-\nabla_{\theta_{n}}\bar{L}^{\pi_{\theta}}(\rho)}_{e_{2,n}}+\underbrace{\nabla_{\theta_{n}}\bar{L}^{\pi_{\theta}}(\rho)-\nabla_{\theta_{n}}{L}^{\pi_{\theta}}(\rho)}_{e_{3,n}},$

where the error $e_{1,n}$ is related to the estimated TD-error and its true TD-error under policy $\pi_{\theta};$ the error $e_{2,n}$ is related to the sample-based TD-error estimation and its expected one; the error $e_{3,n}$ is related to the truncated error of estimated state occupancy measure.

To prove Theorem 3, we establish the following lemma that relates $\sum_{t=1}^{T}\mathbb{E}[||\nabla L^{\pi_{\theta^{t}}}(\rho)||^{2}]$ to the errors of $e_{1},e_{2},$ and $e_{3}.$

Lemma 5

Under Algorithm 1, we have

	$\displaystyle\sum_{t=1}^{T}\mathbb{E}[\|\|\nabla L^{\pi_{\theta^{t}}}(\rho)\|\|^{2}]\leq$	$\displaystyle 3\eta\sum_{t=1}^{T}\mathbb{E}\left[\|\|e_{1}(t)\|\|^{2}+\|\|e_{2}(t)\|\|^{2}+\|\|e_{3}(t)\|\|^{2}\right]$
		$\displaystyle+\left(\frac{2}{(1-\gamma)^{2}}+2\lambda N\right)\sum_{t=1}^{T}\mathbb{E}\left[\|\|e_{1}(t)\|\|+\|\|e_{3}(t)\|\|\right]$
		$\displaystyle+\left(\beta^{\prime}\eta-2\right)\sum_{t=1}^{T}\mathbb{E}\left[\langle\nabla L^{\pi_{\theta^{t}}}(\rho),e_{2}(t)\rangle\right]$
		$\displaystyle+\frac{2}{\eta}\left(L^{*}(\rho)-L^{\pi^{1}}(\rho)\right).$

Proof 5

Define $D(t)=L^{*}(\rho)-L^{\pi_{\theta^{t}}}(\rho).$ Therefore, we have

	$\displaystyle D(t+1)-D(t)\leq$	$\displaystyle\frac{\beta^{\prime}\eta^{2}}{2}\|\|\nabla\hat{L}^{\pi_{\theta^{t}}}(\rho)\|\|^{2}-\eta\langle\nabla L^{\pi_{\theta^{t}}}(\rho),\nabla\hat{L}^{\pi_{\theta^{t}}}(\rho)\rangle$
	$\displaystyle=$	$\displaystyle\frac{\beta^{\prime}\eta^{2}}{2}\|\|\nabla L^{\pi_{\theta^{t}}}(\rho)+e_{1}(t)+e_{2}(t)+e_{3}(t)\|\|^{2}$
		$\displaystyle-\eta\langle\nabla L^{\pi_{\theta^{t}}}(\rho),\nabla L^{\pi_{\theta^{t}}}(\rho)+e_{1}(t)+e_{2}(t)+e_{3}(t)\rangle$
	$\displaystyle\leq$	$\displaystyle\left(\frac{\beta^{\prime}\eta^{2}}{2}-\eta\right)\|\|\nabla L^{\pi_{\theta^{t}}}(\rho)\|\|^{2}+\frac{\beta^{\prime}\eta^{2}}{2}\|\|e_{1}(t)+e_{2}(t)+e_{3}(t)\|\|^{2}$
		$\displaystyle+\left(\frac{\beta^{\prime}\eta^{2}}{2}-\eta\right)\langle\nabla L^{\pi_{\theta^{t}}}(\rho),e_{1}(t)+e_{2}(t)+e_{3}(t)\rangle$

Recall the values of $\eta$ and $\beta^{\prime},$ where $\beta^{\prime}\eta\leq 1$ , we have

	$\displaystyle D(t+1)-D(t)\leq$	$\displaystyle-\frac{\eta}{2}\|\|\nabla L^{\pi_{\theta^{t}}}(\rho)\|\|^{2}+\frac{3\eta}{2}\left(\|\|e_{1}(t)\|\|^{2}+\|\|e_{2}(t)\|\|^{2}+\|\|e_{3}(t)\|\|^{2}\right)$
		$\displaystyle+\eta\|\|\nabla L^{\pi_{\theta^{t}}}(\rho)\|\|\left(\|\|e_{1}(t)\|\|+\|\|e_{3}(t)\|\|\right)$
		$\displaystyle+\left(\frac{\beta^{\prime}\eta^{2}}{2}-\eta\right)\langle\nabla L^{\pi_{\theta^{t}}}(\rho),e_{2}(t)\rangle,$

which implies that

$\displaystyle\sum_{t=1}^{T}\mathbb{E}[\|\|\nabla L^{\pi_{\theta^{t}}}(\rho)\|\|^{2}]\leq$	$\displaystyle 3\eta\sum_{t=1}^{T}\mathbb{E}\left[\|\|e_{1}(t)\|\|^{2}+\|\|e_{2}(t)\|\|^{2}+\|\|e_{3}(t)\|\|^{2}\right]$
	$\displaystyle+2\sum_{t=1}^{T}\mathbb{E}\left[\\|\nabla L^{\pi_{\theta^{t}}}(\rho)\\|\left(\|\|e_{1}(t)\|\|+\|\|e_{3}(t)\|\|\right)\right]$
	$\displaystyle+\left(\beta^{\prime}\eta-2\right)\sum_{t=1}^{T}\mathbb{E}\left[\langle\nabla L^{\pi_{\theta^{t}}}(\rho),e_{2}(t)\rangle\right]$
	$\displaystyle+\frac{2}{\eta}\left(L^{*}(\rho)-L^{\pi^{1}}(\rho)\right).$	(8)

Next, we provide the upper bound on $\|\nabla L^{\pi_{\theta^{t}}}(\rho)\|.$ Note that

	$\displaystyle\delta^{\pi_{\theta},h}_{n}(s_{\mathcal{N}(n)}(h),a_{\mathcal{N}(n)}(h))=$	$\displaystyle r^{h}_{n}(s_{\mathcal{N}(n)}(h),a_{\mathcal{N}(n)}(h))+\gamma\hat{V}^{\pi_{\theta},H}_{n}(s_{\mathcal{N}(n)}(h+1))-\hat{V}^{\pi_{\theta},H}_{n}(s_{\mathcal{N}(n)}(h))$
	$\displaystyle\leq$	$\displaystyle 1+\frac{\gamma}{1-\gamma}=\frac{1}{1-\gamma},$

and

\displaystyle\nabla_{\theta_{n}(s^{\prime}_{n},a^{\prime}_{n})}\log\pi_{\theta_{n}}(a_{n}(h)|s_{n}(h))=\mathbb{I}_{s_{n}(h)=s^{\prime}_{n}}\left(\mathbb{I}_{a_{n}(h)=a^{\prime}_{n}}-\pi_{\theta_{n}}(a_{n}^{\prime}|s_{n}(h))\right).

We immediately have

\|\nabla L^{\pi_{\theta^{t}}}(\rho)\|\leq\frac{2N}{(1-\gamma)^{2}}+2\lambda N.

Finally, we complete the proof by substituting the upper bound into (8).

In the following subsections, we provide the upper bounds on the error terms in Lemma 5. We define $\{f-g\}(x):=f(x)-g(x)$ and occasionally abbreviate the index $t$ in $\pi_{\theta^{t}}$ for a simple notation.

C.1 Bounds on $\|e_{1}(t)\|^{2},\|e_{2}(t)\|^{2},$ and $\|e_{3}(t)\|^{2}$

Lemma 6

||e_{1}(t)||^{2}+||e_{2}(t)||^{2}+||e_{3}(t)||^{2}\leq\frac{12N}{(1-\gamma)^{4}}.

Proof 6

As in the proof of Lemma 5, it could be verified that $\|\nabla_{\theta_{n}}{V}^{\pi_{\theta}}(\rho)\|,\|\nabla_{\theta_{n}}\tilde{V}^{\pi_{\theta}}(\rho)\|,$ $\|\nabla_{\theta_{n}}\bar{V}^{\pi_{\theta}}(\rho)\|$ are bounded by the term

\frac{2}{(1-\gamma)^{2}}.

Therefore, all these error terms of $\|e_{1}\|^{2},\|e_{2}\|^{2},$ and $\|e_{3}\|^{2}$ are also bounded by

\frac{4N}{(1-\gamma)^{4}},

which completes the proof.

C.2 Bound on $\langle\nabla L^{\pi_{\theta^{t}}}(\rho),e_{2}(t)\rangle$

Lemma 7

\mathbb{E}\left[\left|\sum_{t=1}^{T}\langle\nabla L^{\pi_{\theta^{t}}}(\rho),e_{2}(t)\rangle\right|\right]\leq\left(\frac{4N}{(1-\gamma)^{4}}+\frac{2\lambda N}{(1-\gamma)^{2}}\right)\sqrt{T\log T}.

Proof 7

The sequence of $\{\langle\nabla L^{\pi_{\theta^{t}}}(\rho),e_{2}(t)\rangle\}_{t}$ is a martingale difference sequence with respect to $\mathcal{F}_{t-1}$ because it satisfies

•

$\left|\langle\nabla L^{\pi_{\theta^{t}}}(\rho),e_{2}(t)\rangle\right|\leq||\nabla L^{\pi_{\theta^{t}}}(\rho)||||e_{2}(t)||\leq\frac{4N}{(1-\gamma)^{4}}+\frac{2\lambda N}{(1-\gamma)^{2}}.$
•

$\mathbb{E}[\langle\nabla L^{\pi_{\theta^{t}}}(\rho),e_{2}(t)\rangle|\mathcal{F}_{t-1}]=0.$

Therefore, we use Azuma–Hoeffding inequality for the sequence such that

\displaystyle\mathcal{P}\left(\left|\sum_{t=1}^{T}\langle\nabla L^{\pi_{\theta^{t}}}(\rho),e_{2}(t)\rangle\right|\geq\left(\frac{4N}{(1-\gamma)^{4}}+\frac{2\lambda N}{(1-\gamma)^{2}}\right)\sqrt{2T\log T}\right)\leq\frac{1}{T},

which completes the proof.

C.3 Bound on $\|e_{3,n}(t)\|$

Lemma 8

\mathbb{E}\left[||e_{3,n}(t)||\right]\leq\frac{2\gamma^{H}}{1-\gamma},\forall 1\leq n\leq N,\forall 1\leq t\leq T.

Proof 8

We have

	$\displaystyle\\|e_{3,n}(t)\\|=$	$\displaystyle\left\\|\sum_{h=H}^{\infty}\gamma^{h}\mathbb{E}\left[\frac{1}{N}\sum_{k\in\mathcal{N}(n)}\delta^{\pi_{\theta}}_{k}(s_{\mathcal{N}(k)}(h),a_{\mathcal{N}(k)}(h))\nabla_{\theta_{n}}\log\pi_{\theta_{n}}(a_{n}(h)\|s_{n}(h))\right]\right\\|$
	$\displaystyle\leq$	$\displaystyle\frac{4\|\mathcal{N}(n)\|}{(1-\gamma)N}\left\\|\sum_{h=H}^{\infty}\gamma^{h}\mathbb{E}\left[\nabla_{\theta_{n}}\log\pi_{\theta_{n}}(a_{n}(h)\|s_{n}(h))\right]\right\\|$
	$\displaystyle\leq$	$\displaystyle\frac{4\gamma^{H}}{1-\gamma}.$

C.4 Bounds on $\|e_{1,n}\|$

Lemma 9

\mathbb{E}\left[||e_{1,n}(t)||\right]\leq\frac{4\max_{k\in\mathcal{N}(n)}\|\hat{V}^{\pi_{\theta^{t}}}_{k}-V^{\pi_{\theta^{t}}}_{k}\|}{1-\gamma},\forall 1\leq n\leq N,\forall 1\leq t\leq T.

Proof 9

Recall the error

	$\displaystyle e_{1,n}=$	$\displaystyle\nabla_{\theta_{n}}\hat{L}^{\pi_{\theta}}(\rho)-\nabla_{\theta_{n}}\tilde{L}^{\pi_{\theta}}(\rho)$
	$\displaystyle=$	$\displaystyle\sum^{H}_{h=1}\gamma^{h}\frac{1}{N}\sum_{k\in\mathcal{N}(n)}\{\hat{\delta}^{\pi_{\theta}}_{k}-\delta^{\pi_{\theta}}_{k}\}(s_{\mathcal{N}(k)}(h),a_{\mathcal{N}(k)}(h))\nabla_{\theta_{n}}\log\pi_{\theta}(a_{n}\|s_{n})$

which implies that

	$\displaystyle\\|e_{1,n}\\|\leq$	$\displaystyle\sum^{H}_{h=1}\gamma^{h}\frac{1}{N}\sum_{k\in\mathcal{N}(n)}\left\|\{\hat{\delta}^{\pi_{\theta}}_{k}-\delta^{\pi_{\theta}}_{k}\}(s_{\mathcal{N}(k)}(h),a_{\mathcal{N}(k)}(h))\right\|\\|\nabla_{\theta_{n}}\log\pi_{\theta}(a_{n}\|s_{n})\\|$
	$\displaystyle\leq$	$\displaystyle\sum^{H}_{h=1}\gamma^{h}\frac{1}{N}\sum_{k\in\mathcal{N}(n)}\\|\hat{\delta}^{\pi_{\theta}}_{k}-\delta^{\pi_{\theta}}_{k}\\|\\|\nabla_{\theta_{n}}\log\pi_{\theta}(a_{n}\|s_{n})\\|$
	$\displaystyle\leq$	$\displaystyle 2\sum^{H}_{h=1}\gamma^{h}\max_{k\in\mathcal{N}(n)}\\|\hat{\delta}^{\pi_{\theta}}_{k}-\delta^{\pi_{\theta}}_{k}\\|$
	$\displaystyle\leq$	$\displaystyle\frac{4\max_{k\in\mathcal{N}(n)}\\|\hat{V}^{\pi_{\theta}}_{k}-V^{\pi_{\theta}}_{k}\\|}{1-\gamma},$

where the third inequality holds because $\|\nabla_{\theta_{n}}\log\pi_{\theta}(a_{n}|s_{n})\|\leq 2$ and the last inequality holds because $|\mathcal{N}(n)|\leq N;$ and the last inequality holds because of the definition of TD-error $\delta.$

Since $e_{1}(t)$ is directly related to the estimation error of value function, we follow [22] to establish a finite-time analysis of $\hat{V}^{\pi_{\theta^{t}},h}-V^{\pi_{\theta^{t}},h}.$ We write down a linear representation of value function updating as in [22]. Define $z_{n}^{h}=(s_{\mathcal{N}(n)}(h),a_{\mathcal{N}(n)}(h))$ and a vector $u(z_{n}^{h})$ with the $z_{n}^{h}$ th entry being one and zeros otherwise. We abbreviate $t$ and $n$ for the simple notation because we evaluate a fixed policy $\pi_{\theta^{t}}$ for any $t$ and the following derivation holds for any $n.$ We represent $\hat{V}^{\pi_{\theta},h}(z_{n}^{h})=\langle\Theta^{h},u(z_{n}^{h})\rangle$ with parameter $\Theta^{h}\in\mathbb{R}^{|\mathcal{S}_{n}|\times|\mathcal{A}_{n}|}.$ Recall the value function updates in Algorithm 2 as follows

		$\displaystyle\hat{V}^{\pi_{\theta},h+1}_{n}(s_{\mathcal{N}(n)}(h),a_{\mathcal{N}(n)}(h))$
	$\displaystyle=$	$\displaystyle(1-\alpha)\hat{V}^{\pi_{\theta},h}_{n}(s_{\mathcal{N}(n)}(h))+\alpha(r^{h}_{n}(s_{\mathcal{N}(n)}(h),a_{\mathcal{N}(n)}(h))+\gamma\hat{V}^{\pi_{\theta},h}_{n}(s_{\mathcal{N}(n)}(h+1)),$

which can be wrote down in a linear form

\displaystyle\Theta^{h+1}=\Theta^{h}+\alpha\left(-u(z^{h})(u^{T}(z^{h})-\gamma u^{T}(z^{h+1}))\Theta^{h}+r(z^{h})u(z^{h})\right).

Define $A(z_{h},z_{h+1})=-u(z_{h})(u^{T}(z_{h})-\gamma u^{T}(z_{h+1})$ and $b(z_{h})=r(z_{h})u(z_{h}),$ we have

\displaystyle\Theta^{h+1}=\Theta^{h}+\alpha\left(A(z^{h},z^{h+1})\Theta^{h}+b(z^{h})\right).

(9)

Define a matrix to be $\Phi$ with the column being $u(z^{h})$ and $\Lambda$ to be a diagonal matrix with $\Lambda(z^{h},z^{h})=\pi(z^{h}),$ where $\pi(z^{h})$ is the stationary distribution of Markov chain $Z^{h},$ $\Gamma$ to be the transition kernel matrix and $r$ to be a vector with $r(z^{h})$ being the $z^{h}$ th entry. Further define $\tilde{A}=-\Phi\Lambda(\Phi^{T}-\gamma\Gamma\Phi^{T})$ and $\tilde{b}=\Phi\Lambda r.$ The corresponding (shift) ODE of (9) is

\dot{\theta}=\tilde{A}\theta+\tilde{b},

where its equilibrium is $\theta^{*}$ and it is also the solution to Bellman equation, i.e.,

V^{\pi_{\theta},h}(z^{h})=\langle\theta^{*},u(z^{h})\rangle.

Then we study $\|\Theta^{h}-\theta^{*}\|$ by leveraging the drift analysis as in [22]. The Lyapunov equation of ODE is $-I=\tilde{A}^{T}P+P\tilde{A}$ with a positive symmetric $P,$ where $\xi_{\max}$ and $\xi_{\min}$ are the largest and smallest eigenvalues of $P.$

Based on Assumption 2, let $\tau$ be the mixing time such that total variance distance of $\mathcal{P}(z^{h}|z^{0})$ and $\pi(z^{h})$ is less than $\frac{\xi_{\max}}{0.9}\frac{\log H}{H},$ that is

\|\mathcal{P}(z^{h}|z^{0})-\pi(z^{h})\|_{TV}\leq\frac{\xi_{\max}}{0.9}\frac{\log H}{H}.

Based on these definitions, we state Theorem $7$ in [22] as follows.

Theorem 4 (Theorem $7$ in [22])

For any $h>\tau$ and $\epsilon$ such that $256\epsilon\tau+\epsilon\xi_{\max}\leq 0.05,$ we have the following finite-time bound:

\displaystyle\mathbb{E}\left[\|\Theta^{h}-\theta^{*}\|^{2}\right]\leq\frac{\xi_{\max}}{\xi_{\min}}\left(1-\frac{0.9\epsilon}{\xi_{\max}}\right)^{h-\tau}(1.5\|\Theta^{0}\|+0.5)^{2}+\frac{980\xi_{\max}^{2}}{\xi_{\min}}\epsilon\tau.

By setting $\epsilon=\frac{\xi_{\max}}{0.9}\frac{\log H}{H-\tau},$ we can invoke Theorem $7$ in [22] and have the following lemma.

Lemma 10

For any $H>\tau$ such that $\frac{256\xi_{\max}\tau\log H}{H-\tau}+\frac{\xi_{\max}^{2}\log H}{H-\tau}\leq 0.045,$ we have the estimation error of value functions

\displaystyle\mathbb{E}[\|\hat{V}^{\pi_{\theta},H}-V^{\pi_{\theta}}\|^{2}]\leq\frac{\xi_{\max}}{4\xi_{\min}}\frac{1}{H}+\frac{1090\xi_{\max}^{3}}{\xi_{\min}}\frac{\tau\log H}{H-\tau}.

Proof 10

Let $\epsilon=\frac{\xi_{\max}}{0.9}\frac{\log H}{H-\tau}$ and $\Theta_{0}=0$ in Theorem 4, we have

	$\displaystyle\mathbb{E}\left[\\|\Theta^{H}-\theta^{*}\\|^{2}\right]\leq$	$\displaystyle\frac{\xi_{\max}}{4\xi_{\min}}\left(1-\frac{\log H}{H-\tau}\right)^{H-\tau}+\frac{1090\xi_{\max}^{3}}{\xi_{\min}}\frac{\tau\log H}{H-\tau}$
	$\displaystyle\leq$	$\displaystyle\frac{\xi_{\max}}{4\xi_{\min}}\frac{1}{H}+\frac{1090\xi_{\max}^{3}}{\xi_{\min}}\frac{\tau\log H}{H-\tau},$

which completes the proof because of the linear representation of the value function $\langle\Theta^{h},u(z^{h})\rangle.$

C.5 Proving Theorem 3

Let $H=T+\tau$ and $\eta=1/\sqrt{T}.$ We consider a small $\lambda$ and a large $T$ such that $\lambda N\log A_{\max}\leq 1$ and $\log(1/\gamma)\geq\log T/T.$ Combine all lemmas together, we have

		$\displaystyle\sum_{t=1}^{T}\mathbb{E}[\|\|\nabla L^{\pi_{\theta^{t}}}(\rho)\|\|^{2}]$
	$\displaystyle\leq$	$\displaystyle\frac{36N\sqrt{T}}{(1-\gamma)^{4}}+\left(\frac{8N}{(1-\gamma)^{2}}+8\lambda N\right)\sum_{n=1}^{N}\frac{T\gamma^{H}}{(1-\gamma)^{2}}$
		$\displaystyle+\left(\frac{4N}{(1-\gamma)^{4}}+\frac{2\lambda N}{(1-\gamma)^{2}}\right)\sqrt{T\log T}+2\left(L^{*}(\rho)-L^{\pi^{1}}(\rho)\right)\sqrt{T}$
		$\displaystyle+\left(\frac{8N}{(1-\gamma)^{3}}+\frac{8\lambda N}{1-\gamma}\right)\sqrt{\left({\frac{\xi_{\max}}{4\xi_{\min}}+\frac{1090\xi_{\max}^{3}}{\xi_{\min}}}\right)\tau T\log(\tau+T)}$
	$\displaystyle\leq$	$\displaystyle\frac{74}{(1-\gamma)^{4}}\sqrt{\left({\frac{\xi_{\max}}{4\xi_{\min}}+\frac{1090\xi_{\max}^{3}}{\xi_{\min}}}\right)\tau T\log(\tau+T)}$

the last inequality holds because

\displaystyle\gamma^{H}\leq 1/\sqrt{T}~{}~{}\text{and}~{}~{}L^{*}(\rho)-L^{\pi^{1}}(\rho)\leq\frac{1}{1-\gamma}+\lambda N\log A_{\max},

which completes the proof of Theorem 3 by

	$\displaystyle\min_{1\leq t\leq T}\mathbb{E}[\|\|\nabla L^{\pi_{\theta^{t}}}(\rho)\|\|^{2}]\leq$	$\displaystyle\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}[\|\|\nabla L^{\pi_{\theta^{t}}}(\rho)\|\|^{2}]$
	$\displaystyle\leq$	$\displaystyle\frac{74}{(1-\gamma)^{4}}\sqrt{\left({\frac{\xi_{\max}}{4\xi_{\min}}+\frac{1090\xi_{\max}^{3}}{\xi_{\min}}}\right)\frac{\tau\log(\tau+T)}{T}}.$

	$\displaystyle\frac{\partial\pi_{\alpha}(a\|s)}{\partial\alpha}\|_{\alpha=0}$	$\displaystyle=\langle\frac{\partial\pi_{\theta}(a\|s)}{\partial\theta(s,\cdot)},\nu(s,\cdot)\rangle=\sum_{n=1}^{N}\langle\frac{\partial\pi_{\theta}(a\|s)}{\partial\theta_{n}(s_{n},\cdot)},\nu_{n}(s_{n},\cdot)\rangle$
		$\displaystyle=\sum_{n=1}^{N}\prod_{k\neq n}\pi_{\theta_{k}}(a_{k}\|s_{k})~{}\langle\frac{\partial\pi_{\theta_{n}}(a_{n}\|s_{n})}{\partial\theta_{n}(s_{n},\cdot)},\nu_{n}(s_{n},\cdot)\rangle$
		$\displaystyle=\sum_{n=1}^{N}\pi_{\theta}(a\|s)\left(\nu_{n}(s_{n},a_{n})-\langle\pi_{\theta_{n}}(\cdot\|s_{n}),\nu_{n}(s,\cdot)\rangle\right)$

	$\displaystyle\sum_{a}\left\|\frac{\partial\pi_{\alpha}(a\|s)}{\partial\alpha}\|_{\alpha=0}\right\|$	$\displaystyle=\sum_{a}\left\|\sum_{n=1}^{N}\pi_{\theta}(a\|s)\left(\nu_{n}(s_{n},a_{n})-\langle\pi_{\theta_{n}}(\cdot\|s_{n}),\nu_{n}(s_{n},\cdot)\rangle\right)\right\|$
		$\displaystyle\leq 2\max_{a}\sum_{n=1}^{N}\|\nu_{n}(s_{n},a_{n})\|$
		$\displaystyle\leq 2N\|\|u\|\|_{2}.$

	$\displaystyle\left\|\frac{\partial^{2}V^{\pi_{\theta_{\alpha}}}(s)}{\partial\alpha^{2}}\|_{\alpha=0}\right\|=$	$\displaystyle 2\gamma^{2}\left\|u_{s}^{T}M_{\alpha}\frac{\partial\mathcal{P}_{\alpha}}{\partial\alpha}M_{\alpha}\frac{\partial\mathcal{P}_{\alpha}}{\partial\alpha}M_{\alpha}r_{\alpha}\right\|+\gamma\left\|u_{s}^{T}M_{\alpha}\frac{\partial^{2}\mathcal{P}_{\alpha}}{\partial\alpha^{2}}M_{\alpha}r_{\alpha}\right\|$
		$\displaystyle+\gamma\left\|u_{s}^{T}M_{\alpha}\frac{\partial\mathcal{P}_{\alpha}}{\partial\alpha}M_{\alpha}\frac{\partial r_{\alpha}}{\partial\alpha}\right\|+\left\|u_{s}^{T}M_{\alpha}\frac{\partial^{2}r_{\alpha}}{\partial\alpha^{2}}\right\|$
	$\displaystyle\leq$	$\displaystyle\left[\frac{8N^{2}}{(1-\gamma)^{3}}+\frac{6N^{2}}{(1-\gamma)^{2}}+\frac{4N^{2}}{(1-\gamma)^{2}}+\frac{6N^{2}}{(1-\gamma)^{2}}\right]\|\|\nu\|\|_{2}^{2}$
	$\displaystyle\leq$	$\displaystyle\frac{24N^{2}}{(1-\gamma)^{3}}\|\|\nu\|\|_{2}^{2}.$

	$\displaystyle\sum_{t=1}^{T}\mathbb{E}[\|\|\nabla L^{\pi_{\theta^{t}}}(\rho)\|\|^{2}]\leq$	$\displaystyle 3\eta\sum_{t=1}^{T}\mathbb{E}\left[\|\|e_{1}(t)\|\|^{2}+\|\|e_{2}(t)\|\|^{2}+\|\|e_{3}(t)\|\|^{2}\right]$
		$\displaystyle+\left(\frac{2}{(1-\gamma)^{2}}+2\lambda N\right)\sum_{t=1}^{T}\mathbb{E}\left[\|\|e_{1}(t)\|\|+\|\|e_{3}(t)\|\|\right]$
		$\displaystyle+\left(\beta^{\prime}\eta-2\right)\sum_{t=1}^{T}\mathbb{E}\left[\langle\nabla L^{\pi_{\theta^{t}}}(\rho),e_{2}(t)\rangle\right]$
		$\displaystyle+\frac{2}{\eta}\left(L^{*}(\rho)-L^{\pi^{1}}(\rho)\right).$

	$\displaystyle\\|e_{1,n}\\|\leq$	$\displaystyle\sum^{H}_{h=1}\gamma^{h}\frac{1}{N}\sum_{k\in\mathcal{N}(n)}\left\|\{\hat{\delta}^{\pi_{\theta}}_{k}-\delta^{\pi_{\theta}}_{k}\}(s_{\mathcal{N}(k)}(h),a_{\mathcal{N}(k)}(h))\right\|\\|\nabla_{\theta_{n}}\log\pi_{\theta}(a_{n}\|s_{n})\\|$
	$\displaystyle\leq$	$\displaystyle\sum^{H}_{h=1}\gamma^{h}\frac{1}{N}\sum_{k\in\mathcal{N}(n)}\\|\hat{\delta}^{\pi_{\theta}}_{k}-\delta^{\pi_{\theta}}_{k}\\|\\|\nabla_{\theta_{n}}\log\pi_{\theta}(a_{n}\|s_{n})\\|$
	$\displaystyle\leq$	$\displaystyle 2\sum^{H}_{h=1}\gamma^{h}\max_{k\in\mathcal{N}(n)}\\|\hat{\delta}^{\pi_{\theta}}_{k}-\delta^{\pi_{\theta}}_{k}\\|$
	$\displaystyle\leq$	$\displaystyle\frac{4\max_{k\in\mathcal{N}(n)}\\|\hat{V}^{\pi_{\theta}}_{k}-V^{\pi_{\theta}}_{k}\\|}{1-\gamma},$

Scalable and Sample Efficient Distributed Policy Gradient Algorithms in Multi-Agent Networked Systems

Abstract

1 Introduction

Related Work

2 Model

3 Decomposition of Value and Policy Gradient Functions

3.1 Decomposition of value functions

Lemma 1

Remark 1

3.2 Decomposition of Policy Gradient

Theorem 1 ([24])

Lemma 2

Remark 2

4 Distributed Multi-Agent Policy Gradient Algorithms

Assumption 1

4.1 Regularized Distributed Policy Gradient Algorithm with Inexact Gradient

Theorem 2

4.2 Regularized Distributed Actor Critic Algorithm

Assumption 2

Theorem 3

5 Experiments

6 Conclusion

Acknowledgements

References

Appendix A Decomposition of Value Functions

A.1 Proof of Lemma 1

Proof 1

A.2 Proof of Lemma 2

Proof 2

Appendix B Proof of Theorem 2

B.1 Smoothness of LL function

Lemma 3 (Smoothness)

Proof 3

Lemma 4

Proof 4

B.2 Proving Theorem 2

Appendix C Proof of Theorem 3

Lemma 5

Proof 5

C.1 Bounds on ‖e1​(t)‖2,‖e2​(t)‖2,\|e_{1}(t)\|^{2},\|e_{2}(t)\|^{2}, and ‖e3​(t)‖2\|e_{3}(t)\|^{2}

Lemma 6

Proof 6

C.2 Bound on ⟨∇Lπθt​(ρ),e2​(t)⟩\langle\nabla L^{\pi_{\theta^{t}}}(\rho),e_{2}(t)\rangle

Lemma 7

Proof 7

C.3 Bound on ‖e3,n​(t)‖\|e_{3,n}(t)\|

Lemma 8

Proof 8

C.4 Bounds on ‖e1,n‖\|e_{1,n}\|

Lemma 9

Proof 9

Theorem 4 (Theorem 77 in [22])

Lemma 10

Proof 10

C.5 Proving Theorem 3

B.1 Smoothness of $L$ function

C.1 Bounds on $\|e_{1}(t)\|^{2},\|e_{2}(t)\|^{2},$ and $\|e_{3}(t)\|^{2}$

C.2 Bound on $\langle\nabla L^{\pi_{\theta^{t}}}(\rho),e_{2}(t)\rangle$

C.3 Bound on $\|e_{3,n}(t)\|$

C.4 Bounds on $\|e_{1,n}\|$

Theorem 4 (Theorem $7$ in [22])