This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Gradient play in stochastic games: stationary points, convergence, and sample complexity

Runyu (Cathy) Zhang, Zhaolin Ren, Na Li R. Zhang, Z. Ren, and N. Li are affiliated with Harvard School of Engineering and Applied Sciences, (e-mail: [email protected], [email protected], [email protected]) This research is funded by NSF CAREER ECCS-1553407, NSF AI institute 2112085, NSF CNS: 2003111, ONR YIP N00014-19-1-2217.
(January 2022)
Abstract

We study the performance of the gradient play algorithm for stochastic games (SGs), where each agent tries to maximize its own total discounted reward by making decisions independently based on current state information which is shared between agents. Policies are directly parameterized by the probability of choosing a certain action at a given state. We show that Nash equilibria (NEs) and first-order stationary policies are equivalent in this setting, and give a local convergence rate around strict NEs. Further, for a subclass of SGs called Markov potential games (which includes the setting with identical rewards as an important special case), we design a sample-based reinforcement learning algorithm and give a non-asymptotic global convergence rate analysis for both exact gradient play and our sample-based learning algorithm. Our result shows that the number of iterations to reach an ϵ\epsilon-NE scales linearly, instead of exponentially, with the number of agents. Local geometry and local stability are also considered, where we prove that strict NEs are local maxima of the total potential function and fully-mixed NEs are saddle points.

1 Introduction

Multi-agent systems find applications in a wide range of societal systems, e.g. electric grids, traffic networks, smart buildings and smart cities etc. Given the complexity of these systems, multi-agent reinforcement learning (MARL) has gained increasing attention in recent years (e.g. [1, 2]) Among MARL algorithms, policy gradient-type methods are highly popular because of their flexibility and capability to incorporate structured state and action spaces. However, while many recent works [3, 4, 5, 6, 7] have studied the performance of multi-agent policy gradient algorithms, due to a lack of understanding of the optimization landscape in these multi-agent learning problems, most works can only show convergence to a first-order stationary point. Deeper understanding of the quality of these stationary points is missing even in the simple identical-reward multi-agent RL setting.

In this paper, we investigate this problem from a game-theoretic perspective. We model the multi-agent system as a stochastic game (SG) where agents take independent stochastic policies and can have different reward functions. The study of SGs dates back to as early as the 1950s by [8] with a series of follow-up works on developing NE-seeking algorithms, especially in the RL setting (e.g. [9, 10, 11, 12, 13, 14] and citations therein). While well-known classical algorithms for solving SGs are mostly value-based, such as Nash-Q learning [15], Hyper-Q learning [16], and WoLF-PHC [17], gradient-based algorithms have also started to gain popularity in recent years due to their advantages as mentioned earlier (e.g. [18, 19, 20]).

In this work, we aim to gain a deeper understanding of the structure of first-order stationary points and the dynamical behavior for these gradient-based methods, with a particular focus on answering the following questions: 1) How do the first-order stationary points relate to the NEs of the underlying game?, 2) What is the stability of the individual NEs?, 3) How do agents learn from samples in this environment?

These questions have already been widely discussed in other settings, e.g., one-shot (stateless) finite-action games [21, 22, 23, 24, 25, 26, 27, 28, 29, 30], one-shot continuous games [31], zero-sum linear quadratic (LQ) games [32], etc. There are both negative and positive results depending on the settings. For one-shot continuous games, [31] proved a negative result suggesting that gradient flow has stationary points (even local maxima) that are not necessarily NEs. Conversely, [32] designed projected nested-gradient methods that provably converge to NEs in zero-sum LQ games. However, much less is known in the tabular setting of SGs with finite state-action spaces.

Contributions. We consider the gradient play algorithm for the infinite time horizon, discounted reward SGs with independent, directly-parameterized agents’ policies. Through generalizing the gradient domination property in [33] to the SG setting, we first establish the equivalence of first-order stationary policies and Nash equilibria (Theorem 1). This result suggests that even if agents have an identical reward, the first-order stationary points are only equivalent to Nash equilibria, which are usually non-unique and have different reward values. This is fundamentally different from the centralized learning case [33] where it can be shown that the first order stationary point is the unique global optimal solution.

Then we study the convergence of gradient play for SGs. For general games, it is known that gradient play may fail to obtain global convergence [21, 22, 23, 24]. Thus we firstly focus on characterizing some local properties for the general cases. In particular, we characterize the structure of strict NEs and show that gradient play locally converges to strict NEs within finite steps (Theorem 2).

Next we study a special class of SGs called Markov potential games (MPGs) [34, 35, 36], which includes identical reward multi-agent RL [37, 38, 39, 40, 41, 42] as an important special case. Concurrently, this work and [36] have established the global convergence rate to a NE for gradient play under MPGs (Theorem 3). However, the result does not specify which NE the policies converge to. Given the fact that there are many NEs that would have poor global value, global convergence results have a limited implication on the algorithm performance. This motivates us to study the local geometry around some specific types of NEs. We show that strict NEs are local maxima of the total potential function, thus stable points under gradient play, and that fully mixed NEs are saddle points, thus unstable points under gradient play (Theorem 4).

Then, we design a fully-decentralized sample-based gradient play algorithm and prove that it can find an ϵ\epsilon-Nash equilibrium with high probability using O~(nϵ6poly(11γ,|𝒮|,maxi|𝒜i|))\widetilde{O}\left(\frac{n}{\epsilon^{6}}\textup{poly}\left(\frac{1}{1-\gamma},|\mathcal{S}|,\max_{i}|\mathcal{A}_{i}|\right)\right) samples (Theorem 5, |𝒮|,|𝒜i||\mathcal{S}|,|\mathcal{A}_{i}| denote the size of the state space and action space of agent ii respectively). The key enabler of our algorithm is the existence of an underlying averaged MDP for each agent when other agents’ policies are fixed. Our learning method can be viewed as a model-based policy evaluation method with respect to agents’ averaged MDPs. This averaged MDP concept could be applied to design many other MARL algorithms, especially policy-evaluation-based methods.

Comparison to other works on NE learning for SGs: There are some recent studies on general SGs with finite state-action spaces. However, either the structure of SGs or the methods they consider are different from our setting. For example, [43, 44] consider learning correlated equilibria (CCE) rather than NEs for finite time horizon general-sum games; [45] and [46] propose decentralized learning algorithms for the weakly acyclic games, which include the identical interest game as a special case, but only asymptotic convergence is considered; [47, 48] considers convergence to NE for two-player zero-sum games. In addition, [3, 6, 7] consider slightly different MARL settings, where agents collaboratively maximize the summation of agents’ reward with either full or partial state observation. They also require communication between neighboring agents for a better global coordination.

For the MPG subclass,[49, 36, 43, 50, 51] study the convergence to a NE. [43] designs the Nash-CA (Nash Coordinate Ascent) algorithm which requires agents to update sequentially and does not belong to the gradient-based algorithm class. While [50, 51] consider gradient-based algorithms, they study softmax policies which are different from directly-parameterized policies. [52] considers policy gradient with function approximation. [36] is the most related work to this paper, which studies the performance of gradient-based algorithms under direct parameterization. It studies the global convergence rate and develop sample complexity results for gradient play, but do not study the local geometry for general SGs. Additionally, the sample-based algorithm considered in [36] is based on Monte-Carlo gradient estimation, which might suffer from high variance in real implementation and is very different from our algorithm that estimates the gradient via estimating the “model” with respect to agents’ averaged MDPs. Moreover, our concept of “averaged” MDPs could also serve as a useful tool for the design and analysis of other MARL algorithms.

2 Problem setting and preliminaries

We consider a stochastic game (SG, [8]) =(N,𝒮,𝒜=𝒜1××𝒜n,P,r=(r1,,rn),γ,ρ)\mathcal{M}=(N,\mathcal{S},~{}\mathcal{A}=\mathcal{A}_{1}\times\dots\times\mathcal{A}_{n},~{}P,~{}r=(r_{1},\dots,r_{n}),~{}\gamma,~{}\rho) with nn agents which is specified by an agent set N={1,2,,n}N=\left\{1,2,\dots,n\right\}, a finite state space 𝒮\mathcal{S}, a finite action space 𝒜i\mathcal{A}_{i} for each agent iNi\in N, a transition model PP where P(s|s,a)=P(s|s,a1,,an)P(s^{\prime}|s,a)=P(s^{\prime}|s,a_{1},\dots,a_{n}) is the probability of transitioning into state ss^{\prime} upon taking action a:=(a1,,an)a:=(a_{1},\ldots,a_{n}) in state ss where ai𝒜ia_{i}\in\mathcal{A}_{i} is the action of agent ii, agent ii’s reward function ri:𝒮×𝒜[0,1]r_{i}:\mathcal{S}\times\mathcal{A}\rightarrow[0,1], a discount factor γ[0,1)\gamma\in[0,1), and an initial state distribution ρ\rho over 𝒮\mathcal{S}.

A stochastic policy π:𝒮Δ(𝒜)\pi:\mathcal{S}\rightarrow\Delta(\mathcal{A}) (where Δ(𝒜)\Delta(\mathcal{A}) is the probability simplex over 𝒜\mathcal{A}) specifies a strategy in which agents choose their actions jointly based on the current state in a stochastic fashion, i.e. Pr(at|st)=π(at|st)\Pr(a_{t}|s_{t})=\pi(a_{t}|s_{t}). A decentralized stochastic policy is a special subclass of stochastic policies, with π=π1××πn\pi=\pi_{1}\times\ldots\times\pi_{n}, where πi:𝒮Δ(𝒜i)\pi_{i}:\mathcal{S}\rightarrow\Delta(\mathcal{A}_{i}). For decentralized stochastic policies, each agent takes its action based on the current state ss independently of other agents’ choices of actions, i.e.:

Pr(at|st)=π(at|st)=i=1nπi(ai,t|st),at=(a1,t,,an,t).\textstyle\Pr(a_{t}|s_{t})=\pi(a_{t}|s_{t})=\prod_{i=1}^{n}\pi_{i}(a_{i,t}|s_{t}),a_{t}\!=\!(a_{1,t},\!\dots\!,a_{n,t}).\vspace{-3pt}

For notational simplicity, we define:  πI(aI|s):=iIπi(ai|s)\pi_{I}(a_{I}|s):=\prod_{i\in I}\pi_{i}(a_{i}|s), where INI\subseteq N is an index set. Further, we use the notation i-i to denote the index set N\{i}N\backslash\{i\}.

We consider direct decentralized policy parameterization, where agent ii’s policy is parameterized by θi\theta_{i}:

πi,θi(ai|s)=θi,(s,ai),i=1,2,,n.\pi_{i,\theta_{i}}(a_{i}|s)=\theta_{i,(s,a_{i})},\quad i=1,2,\dots,n. (1)

For notational simplicity, we abbreviate πi,θi(ai|s)\pi_{i,\theta_{i}}(a_{i}|s) as πθi(ai|s)\pi_{\theta_{i}}(a_{i}|s), and θi,(s,ai)\theta_{i,(s,a_{i})} as θs,ai\theta_{s,a_{i}}. Here θiΔ(𝒜i)|𝒮|\theta_{i}\in\Delta(\mathcal{A}_{i})^{|\mathcal{S}|}, i.e. θi\theta_{i} is subject to the constraints θs,ai0\theta_{s,a_{i}}\geq 0 and ai𝒜iθs,ai=1\sum_{a_{i}\in\mathcal{A}_{i}}\theta_{s,a_{i}}=1 for all s𝒮s\in\mathcal{S}. The global joint policy is given by: πθ(a|s)=i=1nπθi(ai|s)=i=1nθs,ai.\pi_{\theta}(a|s)=\prod_{i=1}^{n}\pi_{\theta_{i}}(a_{i}|s)=\prod_{i=1}^{n}\theta_{s,a_{i}}. We use 𝒳i:=Δ(𝒜i)|𝒮|,𝒳:=𝒳1××𝒳n\mathcal{X}_{i}:=\Delta(\mathcal{A}_{i})^{|\mathcal{S}|},\mathcal{X}:=\mathcal{X}_{1}\times\cdots\times\mathcal{X}_{n} to denote the feasible region of θi\theta_{i} and θ\theta.

Agent ii’s value function [53] Viθ:𝒮,iNV_{i}^{\theta}:\mathcal{S}\rightarrow\mathbb{R},i\in N is defined as the discounted sum of future rewards starting at state ss via executing πθ\pi_{\theta}, i.e.

Viθ(s):=𝔼[t=0γtri(st,at)|πθ,s0=s],\textstyle V_{i}^{\theta}(s):=\mathbb{{E}}\left[\sum_{t=0}^{\infty}\gamma^{t}r_{i}(s_{t},a_{t})\big{|}~{}\pi_{\theta},s_{0}=s\right],

where the expectation is with respect to the random trajectory τ=(st,at,ri,t)t=0\tau=(s_{t},a_{t},r_{i,t})_{t=0}^{\infty} where atπθ(|st),st+1=P(|st,at)a_{t}\sim\pi_{\theta}(\cdot|s_{t}),s_{t+1}=P(\cdot|s_{t},a_{t}). We denote agent ii’s total reward starting from initial state s0ρs_{0}\sim\rho as:

Ji(θ)=Ji(θ1,,θn):=𝔼s0ρViθ(s0).\quad\textstyle J_{i}(\theta)=J_{i}(\theta_{1},\dots,\theta_{n}):=\mathbb{{E}}_{s_{0}\sim\rho}V_{i}^{\theta}(s_{0}).\vspace{-1pt}

In the game setting, Nash equilibrium is often used to characterize the performance of agents’ policies.

Definition 1.

(Nash equilibrium, c.f. [54, 55]) A policy θ=(θ1,,θn)\theta^{*}=(\theta_{1}^{*},\dots,\theta_{n}^{*}) is called a Nash equilibrium (NE) if

Ji(θi,θi)Ji(θi,θi),θi𝒳i,iNJ_{i}(\theta_{i}^{*},\theta_{-i}^{*})\geq J_{i}(\theta_{i}^{\prime},\theta_{-i}^{*}),\quad\forall\theta_{i}^{\prime}\in\mathcal{X}_{i},\quad i\in N

The equilibrium is called a strict NE if the inequality holds strictly for all θi𝒳i\theta_{i}^{\prime}\in\mathcal{X}_{i} and iNi\in N. The equilibrium is called a pure NE if θ\theta^{*} corresponds to a deterministic policy. The equilibrium is called a mixed NE if it is not pure. Further, the equilibrium is called a fully mixed NE if every entry of θ\theta^{*} is strictly positive, i.e.: θs,ai>0,ai𝒜i,s𝒮,iN\theta_{s,a_{i}}^{*}>0,~{}\forall~{}a_{i}\in\mathcal{A}_{i},~{}\forall~{}s\in\mathcal{S},~{}i\in N.

We define the discounted state visitation distribution [53] dθd_{\theta} of a policy πθ\pi_{\theta} given an initial state distribution ρ\rho as:

dθ(s):=𝔼s0ρ(1γ)t=0γtPrθ(st=s|s0),\textstyle d_{\theta}(s):=\mathbb{{E}}_{s_{0}\sim\rho}(1-\gamma)\sum_{t=0}^{\infty}\gamma^{t}\textup{Pr}^{\theta}(s_{t}=s|s_{0}), (2)

where Prθ(st=s|s0)\textup{Pr}^{\theta}(s_{t}=s|s_{0}) is the state visitation probability that st=ss_{t}=s when executing πθ\pi_{\theta} starting at state s0s_{0}. Throughout the paper, we make the following assumption on the SGs we study.

Assumption 1.

The stochastic game \mathcal{M} satisfies:  dθ(s)>0,s𝒮,θ𝒳d_{\theta}(s)>0,~{}\forall s\in\mathcal{S},~{}\forall\theta\in\mathcal{X}.

Assumption 1 requires that every state is visited with positive probability, which is a standard assumption for convergence proofs in the RL literature (e.g. [33, 56],[36, 48]). Note that this assumption could be easily satisfied if the initial distribution ρ\rho satisfy ρ(s)>0,s𝒮\rho(s)>0,\forall s\in\mathcal{S}.

Similar to centralized RL [53], define agent ii’s QQ-function QiθQ_{i}^{\theta} and its advantage function AiθA_{i}^{\theta} as:

Qiθ(s,a):=𝔼[t=0γtri(st,at)|πθ,s0=s,a0=a],Aiθ(s,a):=Qiθ(s,a)Viθ(s).\begin{split}Q_{i}^{\theta}(s,a)&:=\mathbb{{E}}\left[\sum_{t=0}^{\infty}\gamma^{t}r_{i}(s_{t},a_{t})\big{|}~{}\pi_{\theta},s_{0}=s,a_{0}=a\right],\\ A_{i}^{\theta}(s,a)&:=Q_{i}^{\theta}(s,a)-V_{i}^{\theta}(s).\end{split}

‘Averaged’ Markov decision process (MDP): We further define agent ii’s ‘averaged’ Q-function Qiθ¯:𝒮×𝒜i\overline{Q_{i}^{\theta}}:\mathcal{S}\times\mathcal{A}_{i}\rightarrow\mathbb{R} and ‘averaged’ advantage-function Aiθ¯:𝒮×𝒜i\overline{A_{i}^{\theta}}:\mathcal{S}\times\mathcal{A}_{i}\rightarrow\mathbb{R} as:

Qiθ¯(s,ai):=aiπθi(ai|s)Qiθ(s,ai,ai),Aiθ¯(s,ai):=aiπθi(ai|s)Aiθ(s,ai,ai).\begin{split}\textstyle\overline{Q_{i}^{\theta}}(s,a_{i})&\textstyle:=\sum_{a_{-i}}\pi_{\theta_{-i}}(a_{-i}|s)Q_{i}^{\theta}(s,a_{i},a_{-i}),\\ \textstyle\overline{A_{i}^{\theta}}(s,a_{i})&\textstyle:=\sum_{a_{-i}}\pi_{\theta_{-i}}(a_{-i}|s)A_{i}^{\theta}(s,a_{i},a_{-i}).\end{split} (3)

Similarly, we define agent ii’s ‘averaged’ transition probability distribution Piθ¯:𝒮×𝒮×𝒜i\overline{P_{i}^{\theta}}:\mathcal{S}\times\mathcal{S}\times\mathcal{A}_{i}\rightarrow\mathbb{R}, and ‘averaged’ reward riθ¯:𝒮×𝒜i\overline{r_{i}^{\theta}}:\mathcal{S}\times\mathcal{A}_{i}\rightarrow\mathbb{R} as:

Piθ¯(s|s,ai):=aiπθi(ai|s)P(s|s,ai,ai),riθ¯(s,ai):=aiπθi(ai|s)ri(s,ai,ai)\begin{split}\textstyle\overline{P_{i}^{\theta}}(s^{\prime}|s,a_{i})&\textstyle:=\sum_{a_{-i}}\pi_{\theta{-i}}(a_{-i}|s)P(s^{\prime}|s,a_{i},a_{-i}),\\ \textstyle\overline{r_{i}^{\theta}}(s,a_{i})&\textstyle:=\sum_{a_{-i}}\pi_{\theta{-i}}(a_{-i}|s)r_{i}(s,a_{i},a_{-i})\end{split}

From its definition, the averaged Q-function satisfies the following Bellman equation:

Lemma 1.

Qiθ¯\overline{Q_{i}^{\theta}} satisfies:

Qiθ¯(s,ai)=riθ¯(s,ai)+γs,aiπθi(ai|s)Piθ¯(s|s,ai)Qiθ¯(s,ai)\overline{Q_{i}^{\theta}}(s,\!a_{i})\!=\!\overline{r_{i}^{\theta}}(s,\!a_{i})\!\!+\!\!\gamma\!\!\sum_{s^{\prime},a_{i}^{\prime}}\!\!\pi_{\theta_{i}}\!(a_{i}^{\prime}|s^{\prime})\overline{P_{i}^{\theta}}\!(s^{\prime}|s,a_{i})\overline{Q_{i}^{\theta}}(s^{\prime},a_{i}^{\prime})\vspace{-10pt} (4)

Lemma 1 suggests that the averaged Q-function Qiθ¯\overline{Q_{i}^{\theta}} is indeed the Q-function for the MDP defined on action space 𝒜i\mathcal{A}_{i}, with riθ¯,Piθ¯\overline{r_{i}^{\theta}},\overline{P_{i}^{\theta}} as its stage reward and transition probability, respectively. We define this MDP as the ‘averaged’ MDP of agent ii, i.e., iθ=(𝒮,𝒜i,Piθ¯,riθ¯,γ,ρ)\mathcal{M}_{i}^{\theta}=(\mathcal{S},\mathcal{A}_{i},\overline{P_{i}^{\theta}},\overline{r_{i}^{\theta}},\gamma,\rho). The notion of an ‘averaged’ MDP will serve as an important intuition when designing the sample-based algorithm. Note that the ‘averaged’ MDP is only well-defined when the policies of the other agents θi\theta_{-i} are kept fixed. When this is indeed the case, agent ii can be treated as an independent learner with respect to its own ‘averaged’ MDP. Thus, various classical policy evaluation RL algorithms can then be applied. Additionally, we can apply the performance difference lemma [57] to the averaged MDP to derive a corresponding lemma for SGs which is useful throughout the paper.

Lemma 2.

(Performance difference lemma, for SGs, proof see Appendix B) Let θ=(θi,θi)\theta^{\prime}=(\theta_{i}^{\prime},\theta_{-i})

Ji(θi,θi)Ji(θi,θi)=11γs,aidθ(s)πθi(ai|s)Aiθ¯(s,ai).J_{i}(\theta_{i}^{\prime},\theta_{-i})-J_{i}(\theta_{i},\theta_{-i})=\frac{1}{1-\gamma}\sum_{s,a_{i}}d_{\theta^{\prime}}(s)\pi_{\theta^{\prime}_{i}}(a_{i}|s)\overline{A_{i}^{\theta}}(s,a_{i}).

Note that in the single agent case (n=1n=1), Lemma 2 is the same as the original performance difference lemma known in literature, e.g., Lemma 6.1 in [57].

3 Gradient Play for General Stochastic Games

Under direct distributed parameterization, the gradient play algorithm is given by:

θi(t+1)=Proj𝒳i(θi(t)+ηθiJi(θi(t))),η>0.\theta_{i}^{(t+1)}=\text{Proj}_{\mathcal{X}_{i}}(\theta_{i}^{(t)}+\eta\nabla_{\theta_{i}}J_{i}(\theta_{i}^{(t)})),~{}\eta\!>\!0. (5)

Gradient play can be viewed as a ‘better response’ strategy, where agents update their own parameters by gradient ascent with respect to their own rewards.

A first-order stationary point is defined as such:

Definition 2.

(First-order stationary policy) A policy θ=(θ1,,θn)\theta^{*}=(\theta_{1}^{*},\dots,\theta_{n}^{*}) is called a first-order stationary policy if (θiθi)θiJi(θ)0,θi𝒳i,iN(\theta_{i}^{\prime}-\theta_{i}^{*})^{\top}\nabla_{\theta_{i}}J_{i}(\theta^{*})\leq 0,\ \forall\theta_{i}^{\prime}\in\mathcal{X}_{i},\ i\in N.

It is not hard to verify that θ\theta^{*} is a first-order stationary policy if and only if it is a fixed point under gradient play (5). Comparing Definition 1 (of NE) and Definition 2, we know that NEs are first-order stationary policies, but not necessarily vice versa. For each agent ii, first-order stationarity does not imply that θi\theta_{i}^{*} is optimal among all possible θi\theta_{i} given θi\theta_{-i}^{*}. However, interestingly, we will show that NEs are equivalent to first-order stationary policies due to a gradient domination property that we will show later. Before that, we first calculate the explicit form of the gradient θiJi\nabla_{\theta_{i}}J_{i}.

Policy gradient theorem [58] gives an efficient formula for the gradient:

θ𝔼s0ρViθ(s0)=11γ𝔼sdθ,aπθ(|s)[θlogπθ(a|s)Qiθ(s,a)],\nabla_{\!\theta}\mathbb{{E}}_{s_{0}\sim\rho}\!V_{i}^{\theta}(s_{0})\!=\!\frac{1}{1\!-\!\gamma}\mathbb{{E}}_{s\sim d_{\theta}\!,a\sim\pi_{\theta}(\!\cdot|s)}[\nabla_{\theta}\log\pi_{\theta}(a|s)Q_{i}^{\theta}(s,a)], (6)

Applying (6), the gradient θiJi\nabla_{\theta_{i}}J_{i} can be written explicitly as follows:

Lemma 3.

(Proof see Appendix D) For direct distributed parameterization (1),

Ji(θ)θs,ai=11γdθ(s)Qiθ¯(s,ai)\frac{\partial J_{i}(\theta)}{\partial{\theta_{s,a_{i}}}}=\frac{1}{1-\gamma}d_{\theta}(s)\overline{Q_{i}^{\theta}}(s,a_{i}) (7)

3.1 Gradient domination and the equivalence between NE and first-order stationary policy.

Lemma 4.1 in [33] established gradient domination for centralized tabular MDP under direct parameterization. We can show that a similar property still holds for SGs.

Lemma 4.

(Gradient domination) For direct distributed parameterization (1), we have that for any θ=(θ1,,θn)𝒳\theta=(\theta_{1},\dots,\theta_{n})\in\mathcal{X} and any θi𝒳i,iN\theta_{i}^{\prime}\in\mathcal{X}_{i},~{}i\in N:

Ji(θi,θi)Ji(θi,θi)dθdθmaxθ¯i𝒳i(θ¯iθi)θiJi(θ),J_{i}(\theta_{i}^{\prime},\!\theta_{\!-\!i})-J_{i}(\theta_{i},\!\theta_{\!-\!i})\!\leq\!\left\|\frac{d_{\theta^{\prime}}}{d_{\theta}}\right\|_{\infty}\!\max_{\overline{\theta}_{i}\!\in\!\mathcal{X}_{i}}(\overline{\theta}_{i}-\theta_{i})^{\top}\nabla_{\theta_{i}}J_{i}(\theta), (8)

where dθdθ:=maxsdθ(s)dθ(s)\left\|\frac{d_{\theta^{\prime}}}{d_{\theta}}\right\|_{\infty}:=\max_{s}\frac{d_{\theta^{\prime}}(s)}{d_{\theta}(s)}, and θ=(θi,θi)\theta^{\prime}=(\theta_{i}^{\prime},\theta_{-i}).

Proof.

According to Lemma 2:

Ji(θi,θi)Ji(θi,θi)=11γs,aidθ(s)πθi(ai|s)Aiθ¯(s,ai).\!J_{i}(\theta_{i}^{\prime},\!\theta_{-i})\!-\!J_{i}(\theta_{i},\!\theta_{-i})\!=\!\frac{1}{1\!-\!\gamma}\!\sum_{s,a_{i}}\!d_{\theta^{\prime}}\!(s)\pi_{\theta^{\prime}_{i}}(a_{i}|s)\overline{A_{i}^{\theta}}(s,\!a_{i}).\!\vspace{-5pt}

From the definition of ‘averaged’ advantage function:

aiπθi(ai|s)Aiθ¯(s,ai)=0,s𝒮\textstyle\sum_{a_{i}}\pi_{\theta_{i}}(a_{i}|s)\overline{A^{\theta}_{i}}(s,a_{i})=0,\quad\forall s\in\mathcal{S}

which implies: maxai𝒜iAiθ¯(s,ai)0,\max_{a_{i}\in\mathcal{A}_{i}}\overline{A^{\theta}_{i}}(s,a_{i})\geq 0, thus we have that:

Ji(θi,θi)Ji(θi,θi)=11γs,aidθ(s)πθi(ai|s)Aiθ¯(s,ai)sdθ(s)1γmaxai𝒜iAiθ¯(s,ai)=sdθ(s)dθ(s)dθ(s)1γmaxai𝒜iAiθ¯(s,ai)11γdθdθsdθ(s)maxai𝒜iAiθ¯(s,ai).\begin{split}&J_{i}(\theta_{i}^{\prime},\theta_{-i})-J_{i}(\theta_{i},\theta_{-i})=\frac{1}{1-\gamma}\sum_{s,a_{i}}d_{\theta^{\prime}}(s)\pi_{\theta^{\prime}_{i}}(a_{i}|s)\overline{A_{i}^{\theta}}(s,a_{i})\\ &\leq\!\!\sum_{s}\!\frac{d_{\theta^{\prime}}(s)}{1-\gamma}\max_{a_{i}\in\mathcal{A}_{i}}\!\overline{A_{i}^{\theta}}(s,a_{i})\!=\!\sum_{s}\frac{d_{\theta^{\prime}}(s)}{d_{\theta}(s)}\frac{d_{\theta}(s)}{1\!-\!\gamma}\max_{a_{i}\in\mathcal{A}_{i}}\overline{A_{i}^{\theta}}(s,a_{i})\\ &\leq\frac{1}{1-\gamma}\left\|\frac{d_{\theta^{\prime}}}{d_{\theta}}\right\|_{\infty}\sum_{s}d_{\theta}(s)\max_{a_{i}\in\mathcal{A}_{i}}\overline{A_{i}^{\theta}}(s,a_{i}).\end{split} (9)

We can rewrite 11γsdθ(s)maxai𝒜iAiθ¯(s,ai)\frac{1}{1-\gamma}\sum_{s}d_{\theta}(s)\max_{a_{i}\in\mathcal{A}_{i}}\overline{A_{i}^{\theta}}(s,a_{i}) as:

11γsdθ(s)maxai𝒜iAiθ¯(s,ai)=11γmaxθ¯i𝒳is,aidθ(s)πθ¯i(ai|s)Aiθ¯(s,ai)=maxθ¯i𝒳is,ai(πθ¯i(ai|s)πθi(ai|s))11γdθ(s)Aiθ¯(s,ai)=maxθ¯i𝒳i(s,ai(πθ¯i(ai|s)πθi(ai|s))11γdθ(s)Qiθ¯(s,ai)s11γdθ(s)V(s)ai(πθ¯i(ai|s)πθi(ai|s)))=0=maxθ¯i𝒳is,ai(πθ¯i(ai|s)πθi(ai|s))11γdθ(s)Qiθ¯(s,ai)=maxθ¯i𝒳i(θ¯iθi)θiJi(θ).\begin{split}&\frac{1}{1-\gamma}\sum_{s}d_{\theta}(s)\max_{a_{i}\in\mathcal{A}_{i}}\overline{A_{i}^{\theta}}(s,a_{i})\\ &=\frac{1}{1-\gamma}\max_{\overline{\theta}_{i}\in\mathcal{X}_{i}}\sum_{s,a_{i}}d_{\theta}(s)\pi_{\overline{\theta}_{i}}(a_{i}|s)\overline{A_{i}^{\theta}}(s,a_{i})\\ &=\max_{\overline{\theta}_{i}\in\mathcal{X}_{i}}\sum_{s,a_{i}}(\pi_{\overline{\theta}_{i}}(a_{i}|s)-\pi_{\theta_{i}}(a_{i}|s))\frac{1}{1-\gamma}d_{\theta}(s)\overline{A_{i}^{\theta}}(s,a_{i})\\ &=\max_{\overline{\theta}_{i}\in\mathcal{X}_{i}}\left(\sum_{s,a_{i}}(\pi_{\overline{\theta}_{i}}(a_{i}|s)-\pi_{\theta_{i}}(a_{i}|s))\frac{1}{1-\gamma}d_{\theta}(s)\overline{Q_{i}^{\theta}}(s,a_{i})\right.\\ &\quad-\underbrace{\left.\sum_{s}\frac{1}{1-\gamma}d_{\theta}(s)V(s)\sum_{a_{i}}(\pi_{\overline{\theta}_{i}}(a_{i}|s)-\pi_{\theta_{i}}(a_{i}|s))\right)}_{=0}\\ &=\max_{\overline{\theta}_{i}\in\mathcal{X}_{i}}\sum_{s,a_{i}}(\pi_{\overline{\theta}_{i}}(a_{i}|s)-\pi_{\theta_{i}}(a_{i}|s))\frac{1}{1-\gamma}d_{\theta}(s)\overline{Q_{i}^{\theta}}(s,a_{i})\\ &=\max_{\overline{\theta}_{i}\in\mathcal{X}_{i}}(\overline{\theta}_{i}-\theta_{i})^{\top}\nabla_{\theta_{i}}J_{i}(\theta).\end{split} (10)

Substituting this into (9), we may conclude that

Ji(θi,θi)Ji(θi,θi)dθdθmaxθ¯i𝒳i(θ¯iθi)θiJi(θ)J_{i}(\theta_{i}^{\prime},\theta_{-i})-J_{i}(\theta_{i},\theta_{-i})\leq\left\|\frac{d_{\theta^{\prime}}}{d_{\theta}}\right\|_{\infty}\max_{\overline{\theta}_{i}\in\mathcal{X}_{i}}(\overline{\theta}_{i}-\theta_{i})^{\top}\nabla_{\theta_{i}}J_{i}(\theta)

and this completes the proof. ∎

For the single-agent case (n=1n=1), (8) is consistent with the result in [33], i.e.:  J(θ)J(θ)dθdθmaxθ¯𝒳(θ¯θ)J(θ)J(\theta^{\prime})-J(\theta)\leq\left\|\frac{d_{\theta^{\prime}}}{d_{\theta}}\right\|_{\infty}\max_{\overline{\theta}\in\mathcal{X}}(\overline{\theta}-\theta)^{\top}\nabla J(\theta). However, when there are multiple agents, the condition is much weaker because the inequality requires θi\theta_{-i} to be fixed. When n=1n=1, gradient domination rules out the existence of stationary points that are not global optima. For the multi-agent case, the property can no longer guarantee the equivalence between first-order stationarity and global optimality; instead, it links the stationary points with NEs as shown in the following theorem.

Theorem 1.

Under Assumption 1, first-order stationary policies and NEs are equivalent.

Proof.

The definition of a Nash equilibrium naturally implies first order stationarity, because for any θi𝒳i\theta_{i}\in\mathcal{X}_{i}:

Ji\displaystyle J_{i} ((1η)θi+ηθi,θi)Ji(θi,θi)=\displaystyle((1-\eta)\theta_{i}^{*}+\eta\theta_{i},\theta_{-i}^{*})-J_{i}(\theta_{i}^{*},\theta_{-i}^{*})=
η(θiθi)θiJi(θ)+o(ηθiθi)0,η>0\displaystyle\eta(\theta_{i}-\theta_{i}^{*})^{\top}\nabla_{\theta_{i}}J_{i}(\theta^{*})+o(\eta\|\theta_{i}-\theta_{i}^{*}\|)\leq 0,\quad\forall~{}\eta>0

Letting η0\eta\rightarrow 0 gives the first order stationary condition:

(θiθi)θiJi(θ)0,θi𝒳i,(\theta_{i}-\theta_{i}^{*})^{\top}\nabla_{\theta_{i}}J_{i}(\theta^{*})\leq 0,\quad\forall\theta_{i}\in\mathcal{X}_{i},

It remains to be shown that all first order stationary policies are Nash equilibria. From Assumption 1 we know that for any pair of parameters θ,θ\theta^{\prime},\theta^{*},   dθdθ<+.\left\|\frac{d_{\theta^{\prime}}}{d_{\theta^{*}}}\right\|_{\infty}\!<\!+\infty.
Take θ=(θi,θi),θ=(θi,θi)\theta^{\prime}=(\theta_{i}^{\prime},\theta_{-i}^{*}),\theta^{*}=(\theta_{i}^{*},\theta_{-i}^{*}). According to Lemma 4, we have that for any first order stationary policy θ\theta^{*},

Ji(θi,θi)Ji(θi,θi)dθdθmaxθ¯i𝒳i(θ¯iθi)θiJi(θ)0,J_{i}(\theta_{i}^{\prime},\theta_{-i}^{*})\!-\!J_{i}(\theta_{i}^{*},\theta_{-i}^{*})\!\leq\!\left\|\frac{d_{\theta^{\prime}}}{d_{\theta^{*}}}\right\|_{\infty}\!\!\max_{\overline{\theta}_{i}\in\mathcal{X}_{i}}(\overline{\theta}_{i}\!-\!\theta_{i}^{*})^{\!\!\top}\nabla_{\theta_{i}}\!\!J_{i}(\theta^{*})\!\leq\!0,

which completes the proof. ∎

We briefly note here that the equivalence between the first-order stationary points and NEs holds for all SGs that satisfy Assumption 1. One implication from the theorem is that for identical interest case where agents have the same rewards, we can only ensure the first order stationary points to be NEs when the policies are decentralized policies. Note that NEs are often none-unique and often with different objective values. This is in contrast to the single agent/centralized case where the first order stationary point is equivalent to the global optimal point[33].

3.2 Local convergence for strict NEs

Although the equivalence of NEs and stationary points under gradient play has been established, it is in fact difficult to show that gradient play converges to these stationary points. Even in the simpler static (stateless) game setup, gradient play might fail to converge [21, 22, 23, 24]. One major difficulty is that the vector field {θiJi(θ)}i=1n\{\nabla_{\theta_{i}}J_{i}(\theta)\}_{i=1}^{n} is not a conservative vector field. Accordingly, its dynamics may display complicated behavior. Thus, as a preliminary study, instead of looking at global convergence, we focus on the local convergence and restrict our study to a special subset of NEs - the strict NEs. We begin by giving the following characterization of strict NEs:

Lemma 5.

Given a stochastic game \mathcal{M}, any strict NE θ\theta^{*} is pure, meaning that for each ii and ss, there exist one ai(s)a_{i}^{*}(s) such that θs,ai=𝟏{ai=ai(s)}\theta_{s,a_{i}}^{*}=\mathbf{1}\{a_{i}=a_{i}^{*}(s)\}. Additionally, we have

i)\displaystyle i)\ ai(s)=argmaxaiAiθ¯(s,ai),\displaystyle\textstyle a_{i}^{*}(s)=\operatorname*{arg\,max}_{a_{i}}\overline{A_{i}^{\theta^{*}}}(s,a_{i}),
ii)\displaystyle ii)\ Aiθ¯(s,ai(s))=0,\displaystyle\textstyle\overline{A_{i}^{\theta^{*}}}(s,a_{i}^{*}(s))=0,
iii)\displaystyle iii)\ Aiθ¯(s,ai)<0,aiai(s).\displaystyle\textstyle\overline{A_{i}^{\theta^{*}}}(s,a_{i})<0,~{}\forall~{}a_{i}\neq a_{i}^{*}(s).

Based on this lemma, we define the following for studying the local convergence of a strict NE θ\theta^{*}:

Δiθ(s):=minaiai(s)|Aiθ¯(s,ai)|,Δθ:=minimins11γdθ(s)Δiθ(s)>0.\begin{split}&\textstyle\Delta_{i}^{\theta^{*}}(s):=\min_{a_{i}\neq a_{i}^{*}(s)}\left|\overline{A_{i}^{\theta^{*}}}(s,a_{i})\right|,\\ &\textstyle\Delta^{\theta^{*}}:=\min_{i}\min_{s}\frac{1}{1-\gamma}d_{\theta^{*}}(s)\Delta_{i}^{\theta^{*}}(s)>0.\vspace{-2pt}\end{split} (11)
Theorem 2.

(Local finite time convergence around strict NE) Define the metric of policy parameters as:  D(θ||θ):=max1inmaxs𝒮θi,sθi,s1,D(\theta||\theta^{\prime}):=\max_{1\leq i\leq n}\max_{s\in\mathcal{S}}\|\theta_{i,s}-\theta^{\prime}_{i,s}\|_{1}, where 1\|\cdot\|_{1} denote the 1\ell_{1}- norm. Suppose θ\theta^{*} is a strict Nash equilibrium, then for any θ(0)\theta^{(0)} such that  D(θ(0)||θ)Δθ(1γ)38n|𝒮|(i=1n|𝒜i|),D(\theta^{(0)}||\theta^{*})\!\leq\!\frac{\Delta^{\theta^{*}}(1-\gamma)^{3}}{8n|\mathcal{S}|\left(\sum_{i=1}^{n}|\mathcal{A}_{i}|\right)},  running gradient play (5) will guarantee D(θ(t+1)||θ)max{D(θ(t)||θ)ηΔθ2,0},D(\theta^{(t+1)}||\theta^{*})\leq\max\left\{D(\theta^{(t)}||\theta^{*})-\frac{\eta\Delta^{\theta^{*}}}{2},0\right\}, which means that gradient play is going to converge within 2D(θ(0)||θ)ηΔθ\lceil\frac{2D(\theta^{(0)}||\theta^{*})}{\eta\Delta^{\theta^{*}}}\rceil steps.

Proofs are deferred to Appendix E.

Remark 1.

Note that the local convergence in Theorem 2 only requires a finite number of steps. The key insight of the proof is that the gradient always points towards θ\theta^{*}, and that the algorithm projects the gradient update onto the probability simplex, thus by picking the stepsize η\eta arbitrarily large, exact convergence can be achieved by just one step. However, the caveat is that we need to assume that the initial policy is sufficiently close to θ\theta^{*}. For numerical stability considerations, one should pick reasonable stepsizes to run the algorithm to accommodate random initializations. Theorem 2 also shows that the radius of region of attraction for strict NEs is at least Δθ(1γ)38n|𝒮|(i=1n|𝒜i|)\frac{\Delta^{\theta^{*}}(1-\gamma)^{3}}{8n|\mathcal{S}|\left(\sum_{i=1}^{n}|\mathcal{A}_{i}|\right)}, and thus θ\theta^{*} with a larger Δθ\Delta^{\theta^{*}}, i.e., a larger value gap between the optimal action and other actions, will have a larger region of attraction. We would like to further remark that Theorem 2 only focuses on the local convergence property; hence, we can interpret the theorem in the following way: if there exists a strict NE, then it is locally asymptotically stable under gradient play. However, it does not claim to solve the global existence or convergence of the strict NEs.

4 Gradient play for Markov potential games

We have discussed that the main problem for the global convergence of gradient play for general SGs is that the vector field {θiJi(θ)}i=1n\{\nabla_{\theta_{i}}J_{i}(\theta)\}_{i=1}^{n} is not conservative. Thus, in this section, we restrict our analysis to a special subclass where the vector field is conservative, which in turn enjoys global convergence. This subclass is generally referred to as a Markov potential game (MPG) in the literature.

Definition 3.

(Markov potential game [36]) A stochastic game \mathcal{M} is called a Markov potential game if there exists a potential function ϕ:𝒮×𝒜1××𝒜n\phi:\mathcal{S}\times\mathcal{A}_{1}\times\cdots\times\mathcal{A}_{n}\!\rightarrow\!\mathbb{R} such that for any agent ii and any pair of policy parameters (θi,θi),(θi,θi)(\theta_{i}^{\prime},\theta_{-i}),(\theta_{i},\theta_{-i}) :

𝔼[t=0γtri(st,at)|π=(θi,θi),s0=s]𝔼[t=0γtri(st,at)|π=(θi,θi),s0=s]\displaystyle\mathbb{{E}}\left[\sum_{t=0}^{\infty}\gamma^{t}r_{i}(s_{t},a_{t})\big{|}\pi=(\theta_{i}^{\prime},\theta_{-i}),s_{0}=s\right]-\mathbb{{E}}\left[\sum_{t=0}^{\infty}\gamma^{t}r_{i}(s_{t},a_{t})\big{|}\pi=(\theta_{i},\theta_{-i}),s_{0}=s\right]
=\displaystyle= 𝔼[t=0γtϕ(st,at)|π=(θi,θi),s0=s]𝔼[t=0γtϕ(st,at)|π=(θi,θi),s0=s],s.\displaystyle\mathbb{{E}}\left[\sum_{t=0}^{\infty}\gamma^{t}\phi(s_{t},a_{t})\big{|}\pi=(\theta_{i}^{\prime},\theta_{-i}),s_{0}=s\right]-\mathbb{{E}}\left[\sum_{t=0}^{\infty}\gamma^{t}\phi(s_{t},a_{t})\big{|}\pi=(\theta_{i},\theta_{-i}),s_{0}=s\right],~{}~{}\forall~{}s.

As shown in the definition, the condition of a MPG is admittedly rather strong and difficult to verify for general SGs. [35, 34] found that continuous MPGs can model applications such as the great fish war [59], the stochastic lake game [60], medium access control [35] etc. There are also efforts attempting to identify conditions such that a SG is a MPG, e.g., [35, 36, 61]. In Appendix A, we provide a more detailed discussion on MPGs, including a necessary condition of MPG, counterexamples of stage-wise potential games that are not MPG, sufficient conditions for a SG to be a MPG, and application examples of MPG. Nevertheless, identifying sufficient and necessary conditions and broadening the applications of MPG are important furture directions.

Given a policy θ\theta, we define the ‘total potential function’111Note that our definition of MPG is slightly stronger than the definition in [36] as it requires the total potential function to take the particular form as a discounted sum of the potential function, i.e., Φ(θ):=𝔼s0ρ()[t=0γtϕ(st,at)|πθ]\Phi(\theta):=\mathbb{{E}}_{s_{0}\sim\rho(\cdot)}\left[\sum_{t=0}^{\infty}\gamma^{t}\phi(s_{t},a_{t})\big{|}~{}\pi_{\theta}\right]. However, most of our results (Theorem 3 and Theorem 5) still hold under the weaker definition in [36]. Theorem 4 is the only result that relies on the slightly stronger version of the definition Φ(θ):=𝔼s0ρ()[t=0γtϕ(st,at)|πθ]\Phi(\theta):=\mathbb{{E}}_{s_{0}\sim\rho(\cdot)}\left[\sum_{t=0}^{\infty}\gamma^{t}\phi(s_{t},a_{t})\big{|}~{}\pi_{\theta}\right] for a MPG. The following proposition guarantees a MPG has at least one NE and it is a pure NE (proof see Appendix F).We also define the quantity Φmax,Φmin\Phi_{\max},\Phi_{\min} as Φmax:=ϕmax1γ,Φmin:=ϕmin1γ\Phi_{\max}:=\frac{\phi_{\max}}{1-\gamma},\Phi_{\min}:=\frac{\phi_{\min}}{1-\gamma}, where ϕmin:=mins,aϕ(s,a)\phi_{\min}:=\min_{s,a}\phi(s,a) and ϕmax:=maxs,aϕ(s,a)\phi_{\max}:=\max_{s,a}\phi(s,a). It can be easily verified that ΦminΦ(θ)Φmax\Phi_{\min}\leq\Phi(\theta)\leq\Phi_{\max} for all θ\theta.

Proposition 1.

For a Markov potential game, there is at least one global maximum θ\theta^{*} of the total potential function Φ\Phi, i.e.: θargmaxθ𝒳Φ(θ)\theta^{*}\in\operatorname*{arg\,max}_{\theta\in\mathcal{X}}\Phi(\theta) that is a pure NE.

From the definition of the total potential function we obtain the following relationship

Ji(θi,θi)Ji(θi,θi)=Φ(θi,θi)Φ(θi,θi).J_{i}(\theta_{i}^{\prime},\theta_{-i})-J_{i}(\theta_{i},\theta_{-i})=\Phi(\theta_{i}^{\prime},\theta_{-i})-\Phi(\theta_{i},\theta_{-i}). (12)

Thus,

θiJi(θ)=θiΦ(θ),\nabla_{\theta_{i}}J_{i}(\theta)=\nabla_{\theta_{i}}\Phi(\theta),

which means that gradient play (5) is equivalent to running projected gradient ascent with respect to the total potential function Φ\Phi, i.e.:

θ(t+1)=Proj𝒳(θ(t)+ηθΦ(θi(t))),η>0.\theta^{(t+1)}=\text{Proj}_{\mathcal{X}}(\theta^{(t)}+\eta\nabla_{\theta}\Phi(\theta_{i}^{(t)})),~{}\eta>0.

4.1 Global convergence

With the above property, we can establish the global convergence for gradient play to a ϵ\epsilon-NE for MPG. For the sake of self-completeness, we include the theorem here. Before that, we define ϵ\epsilon-NE.

Definition 4.

(ϵ\epsilon-Nash equilibrium) Define the ‘NE-gap’ of a policy θ\theta as:

NE-gapi(θ)\displaystyle\textup{{NE-gap}}_{i}(\theta) :=maxθi𝒳iJi(θi,θi)Ji(θi,θi);\displaystyle:=\max_{\theta_{i}^{\prime}\in\mathcal{X}_{i}}J_{i}(\theta_{i}^{\prime},\theta_{-i})-J_{i}(\theta_{i},\theta_{-i});
NE-gap(θ)\displaystyle\textup{{NE-gap}}(\theta) :=maxiNE-gapi(θ).\displaystyle:=\max_{i}\textup{{NE-gap}}_{i}(\theta).

A policy θ\theta is an ϵ\epsilon-Nash equilibrium if:   NE-gap(θ)ϵ.\textup{{NE-gap}}(\theta)\leq\epsilon.

Theorem 3.

Suppose for all θ𝒳\theta\in\mathcal{X}, with stepsize η=(1γ)32i=1n|𝒜i|\eta=\frac{(1-\gamma)^{3}}{2\sum_{i=1}^{n}|\mathcal{A}_{i}|}, the NE-gap of θ(t)\theta^{(t)} asymptotically converge to 0 under gradient play (5), i.e., limtNE-gap(θ(t))=0\lim_{t\rightarrow\infty}\textup{{NE-gap}}(\theta^{(t)})=0. Further, we have:

1T1tTNE-gap(θ(t))2ϵ2,whenever T64M2(ΦmaxΦmin)|𝒮|i=1n|𝒜i|(1γ)3ϵ2,\begin{split}&\quad\frac{1}{T}\sum_{1\leq t\leq T}\textup{{NE-gap}}(\theta^{(t)})^{2}\leq\epsilon^{2},\\ \textup{whenever }~{}&T\geq\frac{64M^{2}(\Phi_{\max}-\Phi_{\min})|\mathcal{S}|\sum_{i=1}^{n}|\mathcal{A}_{i}|}{(1-\gamma)^{3}\epsilon^{2}},\end{split} (13)

where  M:=maxθ,θ𝒳dθdθM:=\max_{\theta,\theta^{\prime}\in\mathcal{X}}\left\|\frac{d_{\theta}}{d_{\theta^{\prime}}}\right\|_{\infty} (by Assumption 1, we know that this quantity is well-defined). 222Another way to interpret the result is that the average term 1T1tTNE-gap(θ(t))2ϵ2\frac{1}{T}\sum_{1\leq t\leq T}\textup{{NE-gap}}(\theta^{(t)})^{2}\leq\epsilon^{2} could be translated to a constant probability guarantee on single NE-gap(θ(t))\textup{{NE-gap}}(\theta^{(t)}). For instance, if we randomly pick one θ(t)\theta^{(t)} from 1tT1\leq t\leq T, then it guarantees that NE-gap(θ(t))23ϵ2\textup{{NE-gap}}(\theta^{(t)})^{2}\leq 3\epsilon^{2} with probability at least 2/32/3 (the constants 3,2/33,2/3 could be replaced by 11p,p\frac{1}{1-p},p where p(0,1)p\in(0,1)).

The factor MM is also known as the distribution mismatch coefficient that characterizes how the state visitation varies with the policies. Given an initial state distribution ρ\rho that has positive measure on every state, MM can be at least bounded by M11γmaxθdθρ11γ1minsρ(s)M\!\leq\!\frac{1}{1-\gamma}\max_{\theta}\left\|\frac{d_{\theta}}{\rho}\right\|_{\infty}\!\leq\!\frac{1}{1-\gamma}\frac{1}{\min_{s}\rho(s)}. The proof structure of Theorem 3 resembles the proof of convergence for single-agent MDPs in [33], where they leverage classical nonconvex optimization results [62, 63] and gradient domination to get the convergence rate of O(64γdθρ2|𝒮||𝒜|(1γ)6ϵ2)O\left(\frac{64\gamma\left\|\frac{d_{\theta^{*}}}{\rho}\right\|_{\infty}^{2}|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^{6}\epsilon^{2}}\right) to the global optimum (see Appendix G for proof details). In fact, our result matches this bound when there is only one agent (the exponential factor on (1γ)(1\!-\!\gamma) looks slightly different because some factors are hidden implicitly in MM and (ΦmaxΦmin)(\Phi_{\max}\!-\!\Phi_{\min}) in our bound).

4.2 Local geometry of NEs

Theorem 3 suggests that gradient play is guaranteed to converge to a NE, however, which exact NE it converges to is not specified in the theorem.The qualities of NEs can vary significantly. For example, consider a simple two-agent identical-interest normal form game with reward table given in Table 1. There are three NEs. Two of them are strict NEs, where both agents choose the same action, i.e. a1=a2=1 or 2a_{1}=a_{2}=1\textup{ or }2. Both NEs are of reward 11. Another NE is a fully mixed NE, where both agents choose action 11 and 22 randomly with probability 12\frac{1}{2}. This NE is only of reward 12\frac{1}{2}. This significant quality difference between different types of NEs motivates us to further understand whether gradient play can find NEs with relatively good qualities. Since the NE that gradient play converges to depends on the initialization and the local geometry around the NE, as a preliminary study, we characterize the local geometry and landscape for strict NEs and fully mixed NEs (stated in the following theorem). More future investigation is needed for non-strict, non-fully-mixed NEs.

a2=1a_{2}=1 a2=2a_{2}=2
a1=1a_{1}=1 1 0
a1=2a_{1}=2 0 1
Table 1:
Theorem 4.

For a Markov potential game with Φmin<Φmax\Phi_{\min}\!<\!\Phi_{\max} (i.e., Φ\Phi is not a constant function):

  • A strict NE θ\theta^{*} is equivalent to a strict local maximum of the total potential function Φ\Phi, i.e.: δ,\exists~{}\delta, such that for all θ𝒳,θθ\theta\!\in\!\mathcal{X},\theta\!\neq\!\theta^{*} that satisfies θθδ,\|\theta-\theta^{*}\|\leq\delta, we have  Φ(θ)<Φ(θ)\Phi(\theta)<\Phi(\theta^{*}).

  • Any fully mixed NE θ\theta^{*} is a saddle point of the total potential function Φ\Phi, i.e., Φ(θ)=0\|\nabla\Phi(\theta^{*})\|=0, and δ>0,θ𝒳,such thatθθδ and Φ(θ)>Φ(θ).\forall~{}\delta>0,~{}~{}\exists~{}\theta\in\mathcal{X},~{}\text{such that}~{}\|\theta\!-\!\theta^{*}\|\leq\delta\text{ and }\Phi(\theta)\!>\!\Phi(\theta^{*}).

The full proof of Theorem 4 is deferred to Appendix H.

Remark 2.

Theorem 4 implies that strict NEs are asymptotically locally stable under first-order methods such as gradient play; while the fully mixed NEs are unstable under gradient play. Note that the theorem does not claim stability or instability for other types of NEs, e.g., pure NEs or non-fully mixed NEs. Nonetheless, we believe that these preliminary results can serve as a valuable platform towards a better understanding of the geometry of the problem. We remark that the conclusion about strict NEs in Theorem 4 does not hold for settings other than tabular MPG; for instance, for continuous games, one can use quadratic functions to construct simple counterexamples [31]. Also, similar to Remark 1, this theorem focuses on the local geometry of the NEs but does not claim the global existence or convergence of either strict NEs or fully mixed NEs.

4.3 Sample-based learning: algorithm and sample complexity

In this section, we no longer assume access to the exact gradient, but instead need to estimate it via samples. Throughout the section, we make the following additional assumption on MPGs:

Assumption 2.

((τ,σS)(\tau,\sigma_{S})-Sufficient exploration on states) There exist a positive integer τ\tau and a σS(0,1)\sigma_{S}\in(0,1) such that for any policy θ\theta and any initial state-action pair (s,ai),i(s,a_{i}),~{}\forall i, we have

Prθ(sτ|s0=s,a0=a)σS,sτ,\textstyle\Pr^{\theta}(s_{\tau}|s_{0}=s,a_{0}=a)\geq\sigma_{S},~{}~{}\forall s_{\tau}, (14)

i.e., it poses a condition on the mixing time of the Markov chain induced by any policy θ\theta: there exists a sufficiently long time τ\tau, so the probability of being at any state at time τ\tau is at least σS\sigma_{S} for any initial state and action pair.

Note that similar assumptions are common in proving finite time convergence of RL algorithms (e.g. [7, 64, 65]) where ergodicity of the Markov chain induced by certain policies is generally assumed, which results in every state-action pair being visited with positive probability in the stationary distribution..

We further introduce the state transition probability under θ\theta P𝒮θ¯:𝒮×𝒮\overline{P_{\mathcal{S}}^{\theta}}:\mathcal{S}\times\mathcal{S}\rightarrow\mathbb{R} as:

P𝒮θ¯(s|s):=aπθ(a|s)P(s|s,a).\textstyle\overline{P_{\mathcal{S}}^{\theta}}(s^{\prime}|s):=\sum_{a}\pi_{\theta}(a|s)P(s^{\prime}|s,a).

We consider fully decentralized learning, where agent ii’s observation only includes state sts_{t}, its own action ai,ta_{i,t}, and its own reward ri,t:=ri(st,at)r_{i,t}:=r_{i}(s_{t},a_{t}) at time tt. Such fully decentralized learning is plausible due to the fact that when θi\theta_{-i} is fixed, agent ii can be treated as an independent learner with the underlying MDP being the ‘averaged’ MDP described in Section 2. With this key observation, we design a ‘model-based’ on-policy learning algorithm, where agents perform policy evaluation in the inner loop and gradient ascent at the outer loop. The algorithm is provided in Algorithm 1. Roughly, it consists of three main steps: 1) (Inner loop) Estimate the averaged transition probability and reward using on-policy samples Piθ¯,riθ¯,P𝒮θ¯\overline{P_{i}^{\theta}},\overline{r_{i}^{\theta}},\overline{P_{\mathcal{S}}^{\theta}}. 2) (Inner loop) Calculate averaged Q-function Qiθ¯\overline{Q_{i}^{\theta}} and discounted state visitation distribution dθd_{\theta}, and compute the estimated gradient accordingly, 3) (Outer loop) Running projected gradient ascent with estimated gradients. Before discussing our algorithm in more detail, we highlight that the idea of using the “averaged” MDP can be used to design other learning methods including model-free methods, e.g., using the temporal difference methods to perform policy evaluation. One caveat is that the “averaged” MDP is only well-defined when all the other agents use fixed policies. This makes it difficult to extend the two-timescale framework (i.e. with an inner loop and outer loop) to single-timescale settings, which is an interesting future direction. Further, note that the current algorithm requires full state observation, it remains an intriguing open question to extend it to the case with only partial observability. We would also like to point out that the algorithm initialization still requires extra consensus/coordination among the players to agree on the hyperparameters TJ,TGT_{J},T_{G} etc, which guarantees that agents go through the same equal-length phases to sample the trajectories, and compute gradient estimates.

0:  learning rate η\eta, greedy parameter α\alpha, sample trajectory length TJT_{J}, total iteration steps TGT_{G}
  For each agent ii
  for k=0,1,TG1k=0,1\dots,T_{G}-1 do
     for t=0,1,,TJt=0,1,\dots,T_{J} do
         Sample s0ρs_{0}\sim\rho, implement policy θ(k)\theta^{(k)} and collect trajectory 𝒟i(k)\mathcal{D}_{i}^{(k)}: 𝒟i(k)𝒟i(k){st,ai,t,ri,t},ai,tπθi(k)(|st)\mathcal{D}_{i}^{(k)}\!\leftarrow\!\mathcal{D}_{i}^{(k)}\!\cup\!\{\!s_{t},a_{i,t},r_{i,t}\!\},~{}a_{i,t}\!\sim\!\pi_{\theta_{i}^{(k)}}(\cdot|s_{t})
     end for
     Estimate Piθ^,riθ^,P𝒮θ^,Miθ^\widehat{P_{i}^{\theta}},\widehat{r_{i}^{\theta}},\widehat{P_{\mathcal{S}}^{\theta}},\widehat{M_{i}^{\theta}} by (19), (24),  (27) respectively.
     Calculate Qiθ^,dθ^\widehat{Q_{i}^{\theta}},\widehat{d_{\theta}} by (28),  (29) respectively.
     Estimate the gradient by (30):
     Run projected gradient ascent as in (31)
  end for
Algorithm 1 Sample-based learning

Step 1: empirical estimation of Piθ¯,riθ¯,P𝒮θ¯\overline{P_{i}^{\theta}},\overline{r_{i}^{\theta}},\overline{P_{\mathcal{S}}^{\theta}}: Given a sequence {st,ai,t,ri,t}t=0TJ\{s_{t},a_{i,t},r_{i,t}\}_{t=0}^{T_{J}} generated by a policy θ:=(θi,θi)\theta:=(\theta_{i},\theta_{-i}), the empirical estimation Piθ^\widehat{P_{i}^{\theta}} of Piθ¯\overline{P_{i}^{\theta}} is given by:

Piθ^(s|s,ai)\displaystyle\widehat{P_{i}^{\theta}}(s^{\prime}|s,a_{i}) :={t=0TJ1𝟏{st+1=s,st=s,ai,t=ai}t=1TJ1𝟏{st=s,ai,t=ai},for t=1TJ1𝟏{st=s,ai,t=ai}1;𝟏{s=s},for t=1TJ1𝟏{st=s,ai,t=ai}=0.\displaystyle:=\left\{\begin{array}[]{ll}\frac{\sum_{t=0}^{T_{J}-1}\mathbf{1}\{s_{t+1}=s^{\prime},s_{t}=s,a_{i,t}=a_{i}\}}{\sum_{t=1}^{T_{J}-1}\mathbf{1}\{s_{t}=s,a_{i,t}=a_{i}\}},\vspace{5pt}\\ \quad~{}~{}{\small\textup{for }{\sum_{t=1}^{T_{J}-1}\mathbf{1}\{s_{t}=s,a_{i,t}=a_{i}\}}\geq 1};\vspace{5pt}\\ \mathbf{1}\{s^{\prime}=s\},\vspace{5pt}\\ \quad~{}~{}{\small\textup{for }{\sum_{t=1}^{T_{J}-1}\mathbf{1}\{s_{t}=s,a_{i,t}=a_{i}\}}=0}.\end{array}\right.\vspace{-5pt} (19)

Here we separately treat the special case where the state and action pair is not visited through the whole trajectory, i.e., t=1TJ1𝟏{st=s,ai,t=ai}=0{\sum_{t=1}^{T_{J}-1}\mathbf{1}\{s_{t}=s,a_{i,t}=a_{i}\}}=0 to make Piθ^\widehat{P_{i}^{\theta}} well-defined.

Similarly, the estimates riθ^,P𝒮θ^\widehat{r_{i}^{\theta}},\widehat{P_{\mathcal{S}}^{\theta}} of riθ¯,P𝒮θ¯\overline{r_{i}^{\theta}},\overline{P_{\mathcal{S}}^{\theta}} are given by:

riθ^(s,ai)\displaystyle\widehat{r_{i}^{\theta}}(s,a_{i}) :={t=0TJ𝟏{st=s,ai,t=ai}ri,tt=0TJ𝟏{st=s,ai,t=ai},for t=1TJ1𝟏{st=s,ai,t=ai}1;0,for t=1TJ1𝟏{st=s,ai,t=ai}=0.\displaystyle\!:=\!\left\{\begin{array}[]{ll}\frac{\sum_{t=0}^{T_{J}}\mathbf{1}\{s_{t}=s,a_{i,t}=a_{i}\}r_{i,t}}{\sum_{t=0}^{T_{J}}\mathbf{1}\{s_{t}=s,a_{i,t}=a_{i}\}},\vspace{5pt}\\ \quad{\small\textup{for }{\sum_{t=1}^{T_{J}-1}\mathbf{1}\{s_{t}=s,a_{i,t}=a_{i}\}}\geq 1};\vspace{5pt}\\ 0,\\ \quad{\small\textup{for }{\sum_{t=1}^{T_{J}-1}\mathbf{1}\{s_{t}=s,a_{i,t}=a_{i}\}}=0}.\end{array}\right. (24)
P𝒮θ^(s|s)\displaystyle\widehat{P_{\mathcal{S}}^{\theta}}(s^{\prime}|s) :={t=0TJ1𝟏{st+1=s,st=s}t=1TJ1𝟏{st=s},t=1TJ1𝟏{st=s}1;𝟏{s=s},t=1TJ1𝟏{st=s}=0.\displaystyle\!:=\!\left\{\begin{array}[]{ll}\frac{\sum_{t=0}^{T_{J}-1}\mathbf{1}\{s_{t+1}=s^{\prime},s_{t}=s\}}{\sum_{t=1}^{T_{J}-1}\mathbf{1}\{s_{t}=s\}},~{}\sum_{t=1}^{T_{J}\!-\!1}\!\mathbf{1}\{s_{t}\!=\!s\}\!\geq\!1;\vspace{5pt}\\ \mathbf{1}\{s^{\prime}=s\},\qquad\qquad\quad~{}\sum_{t=1}^{T_{J}\!-\!1}\!\mathbf{1}\{s_{t}\!=\!s\}\!=\!0.\end{array}\right. (27)

Step 2: estimation of Qiθ¯,dθ\overline{Q_{i}^{\theta}},d_{\theta}: We slightly abuse notation and use Qiθ¯,riθ¯|𝒮||𝒜i|\overline{Q_{i}^{\theta}},\overline{r_{i}^{\theta}}\in\mathbb{R}^{|\mathcal{S}||\mathcal{A}_{i}|} to also denote the vectors corresponding to the averaged Q-function and reward function of agent ii. Similarly, ρ,dθ|𝒮|\rho,d_{\theta}\in\mathbb{R}^{|\mathcal{S}|} are used to denote the vectors for ρ(s)\rho(s) and dθ(s)d_{\theta}(s). Define Miθ|𝒮||𝒜i|×|𝒮||𝒜i|M_{i}^{\theta}\in\mathbb{R}^{|\mathcal{S}||\mathcal{A}_{i}|\times|\mathcal{S}||\mathcal{A}_{i}|}:

Miθ¯:=(s,ai)(s,ai)πθi(ai|s)Piθ¯(s|s,ai).\textstyle\overline{M^{\theta}_{i}}{{}_{(s,a_{i})\rightarrow(s^{\prime},a_{i}^{\prime})}}:=\pi_{\theta_{i}}(a_{i}^{\prime}|s^{\prime})\overline{P_{i}^{\theta}}(s^{\prime}|s,a_{i}).

Then from Lemma 1, Qiθ¯\overline{Q_{i}^{\theta}} is given by:

(IγMiθ¯)Qiθ¯=riθ¯Qiθ¯=(IγMiθ¯)1riθ¯.(I-\gamma\overline{M^{\theta}_{i}})\overline{Q_{i}^{\theta}}=\overline{r_{i}^{\theta}}~{}\Longrightarrow~{}\overline{Q_{i}^{\theta}}=(I-\gamma\overline{M^{\theta}_{i}})^{\!-\!1}\overline{r_{i}^{\theta}}.

The estimated averaged Q-function Qiθ^\widehat{Q_{i}^{\theta}} is given by:333From the Perron-Frobenius theorem, we know that the absolute values of the eigenvalues of Miθ^\widehat{M^{\theta}_{i}} are upper bounded by 11, which guarantees that the matrix IγMiθ^I-\gamma\widehat{M^{\theta}_{i}} is invertible.

Qiθ^=(IγMiθ^)1riθ^, where Miθ^(s,ai)(s,ai):=πθi(ai|s)Piθ^(s|s,ai).\begin{split}&\widehat{Q_{i}^{\theta}}=(I-\gamma\widehat{M^{\theta}_{i}})^{\!-\!1}\widehat{r_{i}^{\theta}},\\ \textup{ where }&\widehat{M^{\theta}_{i}}_{(s,a_{i})\rightarrow(s^{\prime},a_{i}^{\prime})}:=\pi_{\theta_{i}}(a_{i}^{\prime}|s^{\prime})\widehat{P_{i}^{\theta}}(s^{\prime}|s,a_{i}).\end{split} (28)

Similarly, from (2), we have that dθd_{\theta} and dθ^\widehat{d_{\theta}} are given by (derivation see Appendix C):

dθ=(1γ)(IγP𝒮θ¯)1ρ,dθ^:=(1γ)(IγP𝒮θ^)1ρ.\textstyle{d_{\theta}}\!=\!(1\!-\!\gamma)\left(I\!-\!\gamma\overline{P_{\mathcal{S}}^{\theta}}^{\top}\right)^{\!-\!1}\rho,\quad\widehat{d_{\theta}}\!:=\!(1\!-\!\gamma)\left(I\!-\!\gamma\widehat{P_{\mathcal{S}}^{\theta}}^{\top}\!\!\right)^{\!-\!1}\rho. (29)

Then accordingly, the estimated gradient is computed as:

^θs,aiJi(θ(k))=11γdθ^(s)Qiθ^(s,ai).\widehat{\partial}_{\theta_{s,a_{i}}}J_{i}(\theta^{(k)})=\frac{1}{1-\gamma}\widehat{d_{\theta}}(s)\widehat{Q_{i}^{\theta}}(s,a_{i}). (30)

Step 3: Projected gradient ascent onto the set of α\alpha-greedy policies: Let Un=[1n,,1n]Δ(n)U_{n}\!=\![\frac{1}{n},\dots,\frac{1}{n}]\!\in\!\Delta(n) denote the nn dimensional uniform distribution. Define Δα(n):={θ|θΔ(n),s.t.θ=(1α)θ+αUn}.\Delta\!^{\alpha}(n)\!:=\!\{\theta|~{}\exists\theta^{\prime}\!\in\!\Delta(n),s.t.~{}\theta=(1\!-\!\alpha)\theta^{\prime}+\alpha U_{n}\}. We use 𝒳iα:=Δα(|𝒜i|)|𝒮|,𝒳α:=𝒳1α×𝒳2α××𝒳nα\mathcal{X}_{i}^{\alpha}\!:=\!\Delta^{\alpha}(|\mathcal{A}_{i}|)^{|\mathcal{S}|},~{}\mathcal{X}^{\alpha}\!:=\!\mathcal{X}_{1}^{\alpha}\times\mathcal{X}_{2}^{\alpha}\times\cdots\times\mathcal{X}_{n}^{\alpha} to denote the set of the α\alpha-greedy policies for θi\theta_{i} and θ\theta respectively. Every step after doing gradient ascent, the parameter θ\theta will further be projected onto 𝒳α\mathcal{X}^{\alpha}, i.e.:

θi(k+1)=Proj𝒳iα(θi(k)+η^θiJi(θ(k))).\theta_{i}^{(k+1)}=Proj_{\mathcal{X}_{i}^{\alpha}}(\theta_{i}^{(k)}+\eta\widehat{\nabla}_{\theta_{i}}J_{i}(\theta^{(k)})). (31)

The reason of projecting onto 𝒳α\mathcal{X}^{\alpha} instead of 𝒳\mathcal{X} is to make sure that every action has positive possibility of being selected in order to get a relatively accurate estimation of averaged QQ-function. Intuitively, a larger α\alpha introduces a larger additional error in the NE-gap; however, a smaller α\alpha requires more samples to estimate the gradient. Thus the choice of α\alpha is the tradeoff between the two effects.

Theorem 5.

(Sample complexity) Assume that the MPG satisfies Assumption 2. Let M:=maxθ,θ𝒳dθdθM:=\max_{\theta,\theta^{\prime}\in\mathcal{X}}\left\|\frac{d_{\theta}}{d_{\theta^{\prime}}}\right\|_{\infty}. In Algorithm 1, for η(1γ)34i|𝒜i|\eta\leq\frac{(1-\gamma)^{3}}{4\sum_{i}|\mathcal{A}_{i}|}, α=(1γ)ϵ6M\alpha=\frac{(1-\gamma)\epsilon}{6M}, and

TJ206976τnM4|𝒮|3maxi|𝒜i|3(1γ)8ϵ4σS2log(16τTG|𝒮|2i|𝒜i|δ)+τ,TG648M2(ΦmaxΦmin)|𝒮|ηϵ2,\begin{split}T_{\!J}\!&\geq\!\!\frac{206976\tau nM^{4}|\mathcal{S}|^{3}\!\max_{i}\!|\mathcal{A}_{i}|^{3}}{(1-\gamma)^{8}\epsilon^{4}\sigma_{S}^{2}}\!\log\!\left(\!\frac{16\tau T_{\!G}|\mathcal{S}|^{2}\!\sum_{i}\!|\mathcal{A}_{i}|\!}{\delta}\!\right)\!+\!\tau,\\ &\quad\qquad\qquad~{}T_{G}\!\geq\!\frac{648M^{2}(\!\Phi_{\max}\!-\!\Phi_{\min}\!)|\mathcal{S}|}{\eta\epsilon^{2}},\end{split}

with probability at least 1δ1\!-\!\delta, we have that:

1TGk=1TGNE-gap(θ(k))2ϵ2.\frac{1}{T_{G}}\sum\nolimits_{k=1}^{T_{G}}\textup{{NE-gap}}(\theta^{(k)})^{2}\leq\epsilon^{2}.

That is, with a proper choice of stepsize, e.g., η=(1γ)34i|𝒜i|\eta\!=\!\frac{(1-\gamma)^{3}}{4\sum_{i}\!|\mathcal{A}_{i}|}, the algorithm can find an ϵ\epsilon-NE with probability at least 1δ1-\delta with

TJTGO~(nϵ6poly(11γ,|𝒮|,maxi|𝒜i|))T_{J}T_{G}\sim\tilde{O}\left(\frac{n}{\epsilon^{6}}\textup{poly}\left(\frac{1}{1-\gamma},|\mathcal{S}|,\max_{i}|\mathcal{A}_{i}|\right)\right)\vspace{-10pt} (32)

samples, where O~\tilde{O} hides log factors.

We would like to first compare our result with one related work with sample complexity of learning MPGs [36]. Interestingly, both sample complexities are O(1/ϵ6)O(1/\epsilon^{6}). It is an interesting question to study whether such dependence is fundamental or not for learning with simultaneous-updating agents. Yet [36] considers Monte Carlo, model-free gradient estimation, while our algorithm takes a model-based approach which suffers less from high variance and the notion of ‘averaged’ MDP can potentially be extrapolated to other settings.

Proof Sketch: The proof of Theorem 5 consists of three major steps. The first step is to bound the estimation error of parameters Piθ¯,riθ¯,P𝒮θ¯\overline{P_{i}^{\theta}},\overline{r_{i}^{\theta}},\overline{P_{\mathcal{S}}^{\theta}} of the ‘averaged’-MDP. This step leverages Assumption 1 and Azuma-Hoeffding inequality to get high probability bounds for the parameters. The second step translates the estimation error of the ‘averaged’-MDP into the gradient estimation error. Then, the third step treats gradient estimation step as an oracle that gives biased gradient information, where the bias is the estimation error. The final result is obtained by analyzing the performance of biased projected gradient ascent algorithm. The detailed proofs are provided in Appendix J.

Comparison with centralized learning: The best known sample complexity bound for single-agent/centralized MDP is O~(|𝒮||𝒜|(1γ)3ϵ2)\tilde{O}\left(\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^{3}\epsilon^{2}}\right) [66]. Compared with (32), the centralized bound scales better with respect to ϵ,|𝒮|,|𝒜i|,11γ\epsilon,|\mathcal{S}|,|\mathcal{A}_{i}|,\frac{1}{1-\gamma}. However, as argued in the previous subsection, the total action space |𝒜|=i=1n|𝒜i||\mathcal{A}|=\prod_{i=1}^{n}|\mathcal{A}_{i}| in the centralized bound scales exponentially with the number of agent nn, while our complexity bound only scales linearly. Here, we briefly state the fundamental difficulties of learning in the SG setting compared with centralized learning, which also explains why our bound scales worse with respect to the factors ϵ,|𝒮|,|𝒜i|,11γ\epsilon,|\mathcal{S}|,|\mathcal{A}_{i}|,\frac{1}{1-\gamma}. 1) Firstly, the optimization landscape in the SG setting is more complicated. For centralized learning, the gradient domination property is stronger and accelerated gradient methods (e.g. via natural policy gradient or entropy regularization) can speed up the convergence of exact gradient from O(1ϵ2)O(\frac{1}{\epsilon^{2}}) to O(1ϵ)O(\frac{1}{\epsilon}) [33], or even O(log(1ϵ))O(\log(\frac{1}{\epsilon})) [56]. In contrast, for multi-agent settings, due to the more complicated optimization landscape, these methods can no longer improve the dependency on ϵ\epsilon, and thus makes the outer loop complexity TGT_{G} larger. 2) Secondly, the behavior of other agents makes the environment non-stationary, i.e., the averaged Q-function Qiθ¯\overline{Q_{i}^{\theta}} as well as the averaged transition probability distribution Piθ¯\overline{P_{i}^{\theta}} depends on the policy of other agent θi\theta_{-i}. Thus, unlike centralized learning, where the state transition probability matrix can be estimated in an off-policy or even offline manner, i.e. using data samples from different policies, Piθ¯\overline{P_{i}^{\theta}} can only be estimated in a online manner, using samples generated by exactly the same policy θ\theta, which increases the inner loop complexity TJT_{J}. 3) Thirdly, the complicated interactions amongst agents necessitate more care during the learning process. Algorithms designed for centralized learning that achieve near-optimal sample complexity are generally Q-learning type algorithms. However, in SGs, it can be shown that having each agent maximize its own averaged Q-function may actually lead to non-convergent behavior. Thus, we need to consider algorithms that update in a less aggressive manner, e.g. soft Q-learning, or policy gradient (which is considered in this paper).

5 Numerical simulations

a2=1a_{2}=1 a2=2a_{2}=2
a1=1a_{1}=1 (-1,-1) (-3,0)
a1=2a_{1}=2 (0,-3) (-2,-2)
Table 2: Game 1: Reward
s2=1s_{2}=1 s2=2s_{2}=2
s1=1s_{1}=1 2 0
s1=2s_{1}=2 0 1
Table 3: Game 2: Reward

This section studies three numerical examples to corroborate our theoretical results. The multi-stage prisoner’s dilemma (Game 1) confirms the local stability results for general-sum SGs; the coordination game (Game 2) considers local stability as well as convergence rate of exact gradient play for MPG; the state-based coordination game (Game 3) tests the performance of the sample-based algorithm proposed in Section 4.3.

5.1 Game 1: multi-stage prisoner’s dilemma

The first example — multi-stage prisoner’s dilemma model[45] — studies exact gradient play for general SGs. It is a 22-agent SG, with 𝒮=𝒜1=𝒜2={1,2}\mathcal{S}=\mathcal{A}_{1}=\mathcal{A}_{2}=\{1,2\}. The reward for each agent ri(s,a1,a2),i{1,2}r_{i}(s,a_{1},a_{2}),~{}i\in\{1,2\} is independent of state ss and is given by Table 3. The state transition probability is determined by agents’ previous actions:

P(st+1=1|(a1,t,a2,t)=(1,1))\displaystyle P(s_{t+1}=1|(a_{1,t},a_{2,t})=(1,1)) =1ϵ,\displaystyle=1-\epsilon,
P(st+1=1|(a1,t,a2,t)(1,1))\displaystyle P(s_{t+1}=1|(a_{1,t},a_{2,t})\neq(1,1)) =ϵ.\displaystyle=\epsilon.

Here action ai=1a_{i}=1 means that agent ii choose to cooperate and ai=2a_{i}=2 means betray. The state ss serves as a noisy indicator, with error rate ϵ\epsilon, of whether both agents cooperated (st=1s_{t}=1) or not (st=2s_{t}=2) in the previous stage t1t-1.

Refer to caption
Figure 1: (Game 1:) Convergence to the cooperative NE
ϵ\epsilon Δθ(s=1)\Delta^{\!\theta^{*}}\!(s\!=\!1) ratio %
0.1 433.3 (47.8±\pm 5.1)%
0.05 979.3 (66.3±\pm 4.3)%
0.01 2498.6 (77.4±\pm 2.8)%
Table 4: (Game 1:) Relationship of Δθ(s=1)\Delta^{\!\theta^{*}}\!(s\!=\!1), convergence ratio, and ϵ\epsilon. Δθ(s=1)\Delta^{\!\theta^{*}}\!(s\!=\!1) is calculated using (11). Convergence ratio is calculated by #Trials that converge to θ#Total number of trials\frac{\#\textup{Trials that converge to }\theta^{*}}{\#\textup{Total number of trials}}.
Refer to caption
Figure 2: (Game 2:) Starting from a close neighborhood of a fully mixed NE
Refer to caption
Figure 3: (Game 2:) Total reward for multiple runs
Refer to caption
Figure 4: (Game 2:) NE-gap for multiple runs

The single-stage game corresponds to the famous Prisoner’s Dilemma, and it is well-known that there is a unique NE (a1,a2)=(2,2)(a_{1},a_{2})=(2,2), where both agent decide to betray. The dilemma arises from the fact that there exists a joint non-NE strategy (1,1)(1,1) such that both players obtain a higher reward than what they get under the NE. However, in the multi-stage case, the introduction of an additional state ss allows agents to make decisions based on whether they have cooperated before. It turns out that cooperation can be achieved given that the discounting γ\gamma is close to 11 and that the indicator for ss is accurate enough, i.e. ϵ\epsilon is close to 0. Apart from the fully betray strategy, where both agents will betray regardless of ss, there is another strict NE θ\theta^{*} that is  θs=1,ai=1=1,θs=2,ai=1=0,\theta^{*}_{s=1,a_{i}=1}=1,~{}\theta^{*}_{s=2,a_{i}=1}=0,  where agents will cooperate given that they have cooperated in previous stage, and betray otherwise.

We simulate gradient play for this model and mainly focus on the convergence to the cooperative equilibrium θ\theta^{*}. We fix γ=0.95\gamma=0.95. The initial policy is set as: θs=1,ai=1(0)=10.4δi,θs=2,ai=1(0)=0\theta^{(0)}_{s=1,a_{i}=1}\!=\!1\!-\!0.4\delta_{i},~{}~{}\theta^{(0)}_{s=2,a_{i}=1}\!=\!0, where the δi\delta_{i}’s are uniformly sampled from [0,1][0,1]. The initialization implies that at the beginning, both agents are willing to cooperate to some extent given that they cooperated at the previous stage. Figure 1 shows a trial converging to the NE starting from a randomly initialized policy. Then we study the size of attraction region for θ\theta^{*} and how it varies with the indicator’s error rate ϵ\epsilon, which is shown in Table 4. The size of the region of attraction for θ\theta^{*} can be reflected by the ratio of convergence ( #Trials that converge to θ#Total number of trials\frac{\#\textup{Trials that converge to }\theta^{*}}{\#\textup{Total number of trials}}) for multiple trials with different initial points. Here we calculate one ratio using 100 trials and the mean and standard deviation (std) are calculated by computing the ratio 10 times using different trials. An empirical estimate of the volume of the region is the convergence ratio times the volume of the uniform sampling area; hence the larger the ratio, the larger the region of attraction. Intuitively speaking, the more accurately the state ss represents the cooperation situation of the agents, the less incentive agents will have for betraying when observing s=1s=1, that is, the larger Δθ(s=1)\Delta^{\!\theta^{*}}\!(s\!=\!1) will become, and thus the larger the convergence ratio will be. This intuition matches the simulation result as well as the theoretical guarantees on the local convergence around a strict NE in Theorem 2.

5.2 Game 2: coordination game

Our second example is based on coordination game [67]. It is an identical-reward game which is one special class of Markov potential game. Consider a 22-agent identical reward coordination game problem with state space 𝒮=𝒮1×𝒮2\mathcal{S}=\mathcal{S}_{1}\times\mathcal{S}_{2} and action space 𝒜=𝒜1×𝒜2\mathcal{A}=\mathcal{A}_{1}\times\mathcal{A}_{2}, where 𝒮1=𝒮2=𝒜1=𝒜2={1,2}\mathcal{S}_{1}=\mathcal{S}_{2}=\mathcal{A}_{1}=\mathcal{A}_{2}=\{1,2\}. The state transition probability is given by:

P(si,t+1=1|ai,t=1)=1ϵ,P(si,t+1=1|ai,t=2)=ϵ,\displaystyle P(s_{i,t+1}=1|a_{i,t}=1)=1-\epsilon,~{}P(s_{i,t+1}=1|a_{i,t}=2)=\epsilon,

where i=1,2i=1,2. The reward table is given by Table 3. Here we can view the actions {1,2}\left\{1,2\right\} as two different social networks that agents can choose. They are rewarded only if they are in the same network. Network 11 has a higher reward than network 22. The state sis_{i} stands for the network that agent ii is really at after taking an action. ϵ\epsilon stands for the randomness in reaching a network after taking the action.

There is at least one fully-mixed NE where both agents join network 11 with probability 13ϵ3(12ϵ)\frac{1-3\epsilon}{3(1-2\epsilon)} regardless of the current occupancy of networks, and there are 1313 different strict NEs that can be verified numerically. Figure 2 shows a gradient play trajectory whose initial point lies in a close neighborhood of the mixed NE. As the algorithm progresses, we see that the trajectory in Figure 2 diverges from the mixed-NE, indicating that the fully-mixed NE is indeed a saddle point. This corroborates our finding in Theorem 4. Figure 3 shows the evolution of total reward J(θ(t))J(\theta^{(t)}) for gradient play for different random initial points θ(0)\theta^{(0)}. Different initial points converge to one of 1313 different strict NEs each with a different total reward (some strict NEs with relatively small region of attraction are omitted in the figure). While the total rewards are different, as shown in Figure 4, we see that the NE-gap of each trajectory (corresponding to same initial points in Figure 3) converges to 0. This suggests that the algorithm is indeed able to converge to a NE. Notice that the NE-gaps do not decrease monotonically.

5.3 Game 3: state-based coordination game

Refer to caption
Refer to caption
Figure 5: (Game 3:) State-based coordination game, rewards are nonzero if both players locate at the same shaded grids
Refer to caption
Figure 6: Total reward J(θ(t))J(\theta^{(t)}) keeps increasing
Refer to caption
Figure 7: Marginal distribution dθx(s1,x,s2,x)d_{\theta}^{x}(s_{1,x},s_{2,x}) and dθy(s1,y,s2,y)d_{\theta}^{y}(s_{1,y},s_{2,y})
Refer to caption
Figure 8: NE-gap converges to a value close to zero. Here the NE-gap is measured by maximax(s,ai)Aiθ¯(s,ai)\max_{i}\max_{(s,a_{i})\overline{A_{i}^{\theta}}(s,a_{i})}

Our third numerical example studies the empirical performance of the sample-based learning algorithm, Algorithm 1. Here we consider a generalization of coordination game (Game 2) where the two players now try to coordinate on a 2D grid. The two-player state-based coordination game on a 3×33\times 3 grid is defined as follows: the state space is given by 𝒮=𝒮1×𝒮2,𝒮1=𝒮2=𝒮x×𝒮y={xa,xb,xc}×{ya,yb,yc}\mathcal{S}=\mathcal{S}_{1}\times\mathcal{S}_{2},~{}\mathcal{S}_{1}=\mathcal{S}_{2}=\mathcal{S}_{x}\times\mathcal{S}_{y}=\{x_{a},x_{b},x_{c}\}\times\{y_{a},y_{b},y_{c}\}, action space is given by 𝒜=𝒜1×𝒜2,𝒜1=𝒜2={Stay,Left,Right,Up,Down}\mathcal{A}=\mathcal{A}_{1}\times\mathcal{A}_{2},~{}\mathcal{A}_{1}=\mathcal{A}_{2}=\{\textup{Stay},\textup{Left},\textup{Right},\textup{Up},\textup{Down}\}, i.e., agent can choose to stay at current grid or move left/right/up/down to its neighboring grids. We assume that there is random noise during the transition, where the agent might end up in a neighboring grid of the target location with error probability ϵ\epsilon.The reward is given by:

r(s1,s2)=𝟏{s1=s2={xa,ya} or {xc,yc}},r(s_{1},s_{2})=\mathbf{1}\{s_{1}=s_{2}=\{x_{a},y_{a}\}\textup{ or }\{x_{c},y_{c}\}\},

i.e. the two agents are only rewarded if they stay at the upper-left or lower-right corner at the same time.

For numerical simulation, we take TG=300,TJ=10000,α=0.1,η=10,ϵ=0.1T_{G}=300,T_{J}=10000,\alpha=0.1,\eta=10,\epsilon=0.1; the numerical results are as displayed in Figure 6 - 8. Figure 6 shows that total reward increases as the number of iterations increase, and Figure 8 shows that the NE-gap converges to a value close to zero. However, because we project the policy to the α\alpha-greedy set 𝒳α\mathcal{X}^{\alpha}, the NE-gap cannot converge to exactly zero. Figure 7 visualizes the discounted state visitation distribution. To make the visualization more intuitive, we look at the marginalized discounted state visitation distribution dθxd_{\theta}^{x} defined below:

dθx(s1,x,s2,x)=s1,y,s2,ydθ(s1,x,s1,y,s2,x,s2,y).\displaystyle\textstyle d_{\theta}^{x}(s_{1,x},s_{2,x})\!=\!\sum_{s_{1,y},s_{2,y}}d_{\theta}(s_{1,x},s_{1,y},s_{2,x},s_{2,y}).

dθyd_{\theta}^{y} is defined similarly. From Figure 7 we can see that most of the probability measure concentrates on {(xa,xa),(xc,xc)}\{(x_{a},x_{a}),(x_{c},x_{c})\}, {(ya,ya),(yc,yc)}\{(y_{a},y_{a}),(y_{c},y_{c})\}, indicating that the two agents are able to coordinate most of the time.

6 Conclusion and Discussion

This paper studies the optimization landscape and convergence of gradient play for SGs. For general SGs, we establish local convergence for strict NEs. For MPGs, we establish the global convergence with respect to NE gap and the local stability results for strict NEs as well as fully mixed ones. A sample-based NE-learning algorithm with sample complexity guarantee is also proposed under this setting. There are many interesting future directions. Firstly, the current assumption of MPGs is relatively strong compared with the notion of potential games in the one-shot setup, which might restrict its application to broader settings. More effort would be needed to identify other special types of SGs that facilitate efficient learning. It would also be meaningful to investigate real-life applications, such as dynamic congestion and routing. Secondly, other sample-based learning methods, such as actor-critic, natural policy gradient, Gauss-Newton methods, could also be considered, which might improve the sample complexity.

References

  • [1] F. Daneshfar and H. Bevrani, “Load–frequency control: a ga-based multi-agent reinforcement learning,” IET generation, transmission & distribution, vol. 4, no. 1, pp. 13–26, 2010.
  • [2] S. Shalev-Shwartz, S. Shammah, and A. Shashua, “Safe, multi-agent, reinforcement learning for autonomous driving,” ArXiv, vol. abs/1610.03295, 2016.
  • [3] K. Zhang, Z. Yang, H. Liu, T. Zhang, and T. Basar, “Fully decentralized multi-agent reinforcement learning with networked agents,” in International Conference on Machine Learning.   PMLR, 2018, pp. 5872–5881.
  • [4] H.-T. Wai, Z. Yang, Z. Wang, and M. Hong, “Multi-agent reinforcement learning via double averaging primal-dual optimization,” ser. NIPS’18.   Red Hook, NY, USA: Curran Associates Inc., 2018, p. 9672–9683.
  • [5] T. Chen, K. Zhang, G. B. Giannakis, and T. Başar, “Communication-efficient policy gradient methods for distributed reinforcement learning,” IEEE Transactions on Control of Network Systems, vol. 9, no. 2, pp. 917–929, 2022.
  • [6] Y. Li, Y. Tang, R. Zhang, and N. Li, “Distributed reinforcement learning for decentralized linear quadratic control: A derivative-free policy optimization approach,” 2019.
  • [7] G. Qu, A. Wierman, and N. Li, “Scalable reinforcement learning of localized policies for multi-agent networked systems,” in Learning for Dynamics and Control.   PMLR, 2020, pp. 256–266.
  • [8] L. S. Shapley, “Stochastic games,” Proceedings of the national academy of sciences, vol. 39, no. 10, pp. 1095–1100, 1953.
  • [9] M. L. Littman, “Markov games as a framework for multi-agent reinforcement learning,” in Machine learning proceedings 1994.   Elsevier, 1994, pp. 157–163.
  • [10] M. Bowling and M. Veloso, “An analysis of stochastic game theory for multiagent reinforcement learning,” Carnegie-Mellon Univ Pittsburgh Pa School of Computer Science, Tech. Rep., 2000.
  • [11] Y. Shoham, R. Powers, and T. Grenager, “Multi-agent reinforcement learning: a critical survey,” Technical report, Stanford University, Tech. Rep., 2003.
  • [12] L. Buşoniu, R. Babuška, and B. De Schutter, “Multi-agent reinforcement learning: An overview,” Innovations in multi-agent systems and applications-1, pp. 183–221, 2010.
  • [13] M. Lanctot, V. Zambaldi, A. Gruslys, A. Lazaridou, K. Tuyls, J. Pérolat, D. Silver, and T. Graepel, “A unified game-theoretic approach to multiagent reinforcement learning,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017.
  • [14] K. Zhang, Z. Yang, and T. Başar, “Multi-agent reinforcement learning: A selective overview of theories and algorithms,” Handbook of reinforcement learning and control, pp. 321–384, 2021.
  • [15] J. Hu and M. P. Wellman, “Nash Q-learning for general-sum stochastic games,” Journal of machine learning research, vol. 4, no. Nov, pp. 1039–1069, 2003.
  • [16] G. Tesauro, “Extending Q-learning to general adaptive multi-agent systems,” Advances in neural information processing systems, vol. 16, pp. 871–878, 2003.
  • [17] M. Bowling and M. Veloso, “Rational and convergent learning in stochastic games,” in International joint conference on artificial intelligence, vol. 17, no. 1.   Citeseer, 2001, pp. 1021–1026.
  • [18] S. Abdallah and V. Lesser, “A multiagent reinforcement learning algorithm with non-linear dynamics,” Journal of Artificial Intelligence Research, vol. 33, pp. 521–549, 2008.
  • [19] C. Zhang and V. Lesser, “Multi-agent learning with policy prediction,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 24, no. 1, 2010.
  • [20] J. Foerster, R. Y. Chen, M. Al-Shedivat, S. Whiteson, P. Abbeel, and I. Mordatch, “Learning with opponent-learning awareness,” in Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems.   Richland, SC: International Foundation for Autonomous Agents and Multiagent Systems, 2018.
  • [21] L. Shapley, “Some topics in two-person games,” Advances in game theory, vol. 52, pp. 1–29, 1964.
  • [22] V. P. Crawford, “Learning behavior and mixed-strategy Nash equilibria,” Journal of Economic Behavior & Organization, vol. 6, no. 1, pp. 69–78, 1985.
  • [23] J. S. Jordan, “Three problems in learning mixed-strategy Nash equilibria,” Games and Economic Behavior, vol. 5, no. 3, pp. 368–386, 1993.
  • [24] V. Krishna and T. Sjöström, “On the convergence of fictitious play,” Mathematics of Operations Research, vol. 23, no. 2, pp. 479–511, 1998.
  • [25] J. S. Shamma and G. Arslan, “Dynamic fictitious play, dynamic gradient play, and distributed convergence to Nash equilibria,” IEEE Transactions on Automatic Control, vol. 50, no. 3, pp. 312–327, 2005.
  • [26] E. Kohlberg and J.-F. Mertens, “On the strategic stability of equilibria,” Econometrica: Journal of the Econometric Society, pp. 1003–1037, 1986.
  • [27] E. Van Damme, Stability and perfection of Nash equilibria.   Springer, 1991, vol. 339.
  • [28] D. Foster and H. P. Young, “Regret testing: Learning to play Nash equilibrium without knowing you have an opponent,” Theoretical Economics, vol. 1, no. 3, pp. 341–367, 2006.
  • [29] F. Germano and G. Lugosi, “Global Nash convergence of foster and young’s regret testing,” Games and Economic Behavior, vol. 60, no. 1, pp. 135–154, 2007.
  • [30] J. R. Marden, H. P. Young, G. Arslan, and J. S. Shamma, “Payoff-based dynamics for multiplayer weakly acyclic games,” SIAM Journal on Control and Optimization, vol. 48, no. 1, pp. 373–396, 2009.
  • [31] E. Mazumdar, L. J. Ratliff, and S. S. Sastry, “On gradient-based learning in continuous games,” SIAM Journal on Mathematics of Data Science, vol. 2, no. 1, pp. 103–131, 2020.
  • [32] K. Zhang, Z. Yang, and T. Basar, “Policy optimization provably converges to Nash equilibria in zero-sum linear quadratic games,” in Advances in Neural Information Processing Systems, vol. 32, 2019.
  • [33] A. Agarwal, S. M. Kakade, J. D. Lee, and G. Mahajan, “On the theory of policy gradient methods: Optimality, approximation, and distribution shift,” 2020.
  • [34] D. González-Sánchez and O. Hernández-Lerma, Discrete–time stochastic control and dynamic potential games: the Euler–Equation approach.   Springer Science & Business Media, 2013.
  • [35] S. V. Macua, J. Zazo, and S. Zazo, “Learning parametric closed-loop policies for Markov potential games,” in International Conference on Learning Representations, 2018.
  • [36] S. Leonardos, W. Overman, I. Panageas, and G. Piliouras, “Global convergence of multi-agent policy gradient in Markov potential games,” in International Conference on Learning Representations, 2022.
  • [37] M. Tan, “Multi-agent reinforcement learning: Independent vs. cooperative agents,” in Proceedings of the tenth international conference on machine learning, 1993, pp. 330–337.
  • [38] C. Claus and C. Boutilier, “The dynamics of reinforcement learning in cooperative multiagent systems,” AAAI/IAAI, vol. 1998, no. 746-752, p. 2, 1998.
  • [39] L. Panait and S. Luke, “Cooperative multi-agent learning: The state of the art,” Autonomous agents and multi-agent systems, vol. 11, no. 3, pp. 387–434, 2005.
  • [40] L. Matignon, G. J. Laurent, and N. Le Fort-Piat, “Independent reinforcement learners in cooperative markov games: a survey regarding coordination problems,” The Knowledge Engineering Review, vol. 27, no. 1, pp. 1–31, 2012.
  • [41] Z. Zhang, Y.-S. Ong, D. Wang, and B. Xue, “A collaborative multiagent reinforcement learning method based on policy gradient potential,” IEEE transactions on cybernetics, vol. 51, no. 2, pp. 1015–1027, 2019.
  • [42] A. Oroojlooy and D. Hajinezhad, “A review of cooperative multi-agent deep reinforcement learning,” Applied Intelligence, vol. 53, no. 11, pp. 13 677–13 722, 2023.
  • [43] Z. Song, S. Mei, and Y. Bai, “When can we learn general-sum Markov games with a large number of players sample-efficiently?” 2021.
  • [44] C. Jin, Q. Liu, Y. Wang, and T. Yu, “V-learning–a simple, efficient, decentralized algorithm for multiagent rl,” arXiv preprint arXiv:2110.14555, 2021.
  • [45] G. Arslan and S. Yüksel, “Decentralized Q-learning for stochastic teams and games,” IEEE Transactions on Automatic Control, vol. 62, no. 4, pp. 1545–1558, 2016.
  • [46] B. Yongacoglu, G. Arslan, and S. Yüksel, “Decentralized learning for optimality in stochastic dynamic teams and games with local control and global state information,” IEEE Transactions on Automatic Control, vol. 67, no. 10, pp. 5230–5245, 2022.
  • [47] C. Daskalakis, D. J. Foster, and N. Golowich, “Independent policy gradient methods for competitive reinforcement learning,” Advances in neural information processing systems, vol. 33, pp. 5527–5540, 2020.
  • [48] A. Ozdaglar, M. O. Sayin, and K. Zhang, “Independent learning in stochastic games,” arXiv preprint arXiv:2111.11743, 2021.
  • [49] R. Zhang, Z. Ren, and N. Li, “Gradient play in multi-agent Markov stochastic games: Stationary points and local geometry,” in International Symposium on Mathematical Theory of Networks and Systems, 2022.
  • [50] R. Zhang, J. Mei, B. Dai, D. Schuurmans, and N. Li, “On the global convergence rates of decentralized softmax gradient play in markov potential games,” Advances in Neural Information Processing Systems, vol. 35, pp. 1923–1935, 2022.
  • [51] R. Fox, S. M. Mcaleer, W. Overman, and I. Panageas, “Independent natural policy gradient always converges in Markov potential games,” in International Conference on Artificial Intelligence and Statistics.   PMLR, 2022, pp. 4414–4425.
  • [52] D. Ding, C.-Y. Wei, K. Zhang, and M. Jovanovic, “Independent policy gradient for large-scale markov potential games: Sharper rates, function approximation, and game-agnostic convergence,” in International Conference on Machine Learning.   PMLR, 2022, pp. 5166–5220.
  • [53] A. Agarwal, N. Jiang, S. M. Kakade, and W. Sun, “Reinforcement learning: Theory and algorithms,” CS Dept., UW Seattle, Seattle, WA, USA, Tech. Rep, vol. 32, 2019.
  • [54] D. Fudenberg and J. Tirole, Game theory.   MIT press, 1991.
  • [55] M. Maschler, S. Zamir, and E. Solan, Game theory.   Cambridge University Press, 2020.
  • [56] J. Mei, C. Xiao, C. Szepesvari, and D. Schuurmans, “On the global convergence rates of softmax policy gradient methods,” in Proceedings of the 37th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 119.   PMLR, 13–18 Jul 2020, pp. 6820–6829.
  • [57] S. M. Kakade and J. Langford, “Approximately optimal approximate reinforcement learning,” in Machine Learning, Proceedings of the Nineteenth International Conference (ICML 2002), University of New South Wales, Sydney, Australia, July 8-12, 2002, 2002, pp. 267–274.
  • [58] R. S. Sutton, D. A. McAllester, S. P. Singh, Y. Mansour et al., “Policy gradient methods for reinforcement learning with function approximation.” in NIPs, vol. 99.   Citeseer, 1999, pp. 1057–1063.
  • [59] D. Levhari and L. Mirman, “The great fish war: An example using a dynamic Cournot-Nash solution,” Bell Journal of Economics, vol. 11, no. 1, pp. 322–334, 1980.
  • [60] W. D. Dechert and S. O’Donnell, “The stochastic lake game: A numerical solution,” Journal of Economic Dynamics and Control, vol. 30, no. 9-10, pp. 1569–1587, 2006.
  • [61] D. H. Mguni, Y. Wu, Y. Du, Y. Yang, Z. Wang, M. Li, Y. Wen, J. Jennings, and J. Wang, “Learning in nonzero-sum stochastic games with potentials,” in International Conference on Machine Learning, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:232257740
  • [62] A. Beck, First-Order Methods in Optimization.   Philadelphia, PA: Society for Industrial and Applied Mathematics, 2017. [Online]. Available: https://epubs.siam.org/doi/abs/10.1137/1.9781611974997
  • [63] S. Ghadimi and G. Lan, “Accelerated gradient methods for nonconvex nonlinear and stochastic programming,” Mathematical Programming, vol. 156, no. 1-2, pp. 59–99, 2016.
  • [64] R. Srikant and L. Ying, “Finite-time error bounds for linear stochastic approximation and TD learning,” in Conference on Learning Theory.   PMLR, 2019, pp. 2803–2830.
  • [65] G. Li, Y. Wei, Y. Chi, Y. Gu, and Y. Chen, “Sample complexity of asynchronous q-learning: Sharper analysis and variance reduction,” IEEE Transactions on Information Theory, vol. 68, no. 1, pp. 448–473, 2022.
  • [66] A. Sidford, M. Wang, X. Wu, L. Yang, and Y. Ye, “Near-optimal time and sample complexities for solving Markov decision processes with a generative model,” in Advances in Neural Information Processing Systems, vol. 31, 2018.
  • [67] R. Cooper, Coordination games.   Cambridge University Press, 1999.
  • [68] D. Monderer and L. S. Shapley, “Potential games,” Games and economic behavior, vol. 14, no. 1, pp. 124–143, 1996.
  • [69] N. Bertrand, N. Markey, S. Sadhukhan, and O. Sankur, “Dynamic network congestion games,” arXiv preprint arXiv:2009.13632, 2020.
  • [70] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction.   Cambridge, MA, USA: A Bradford Book, 2018.
  • [71] W. Hoeffding, “Probability inequalities for sums of bounded random variables,” in The collected works of Wassily Hoeffding.   Springer, 1994, pp. 409–426.
  • [72] K. Azuma, “Weighted sums of certain dependent random variables,” Tohoku Mathematical Journal, Second Series, vol. 19, no. 3, pp. 357–367, 1967.
  • [73] W. Wang and M. Á. Carreira-Perpiñán, “Projection onto the probability simplex: An efficient algorithm with a simple proof, and an application,” CoRR, vol. abs/1309.1541, 2013. [Online]. Available: http://arxiv.org/abs/1309.1541

Appendix A More about Markov potential games

This section is dedicated to a more thorough understanding of Markov potential games, which includes necessary or sufficient conditions for MPG and a few (counter)examples.

B.1. A necessary condition and counterexamples

Definition 5.

([68]) Define the path in the parameter space as τ=(θ(0),θ(1),,θ(N))\tau=(\theta^{(0)},\theta^{(1)},\dots,\theta^{(N)}), where θt,θt+1\theta^{t},\theta^{t+1} differ only one component iti_{t}, i.e. θ(t+1)=(θit(t+1),θit(t))\theta^{(t+1)}=(\theta^{(t+1)}_{i_{t}},\theta^{(t)}_{-i_{t}}). A closed path is a path such that θ(0)=θ(N)\theta^{(0)}=\theta^{(N)}. Define:

I(τ):=t=1NJit(θ(t))Jit(θ(t1))I(\tau):=\sum_{t=1}^{N}J_{i_{t}}(\theta^{(t)})-J_{i_{t}}(\theta^{(t-1)})

The following theorem is a direct generalization of Theorem 2.8 in [68] to MPG setting:

Lemma 6.

For Markov potential games, I(τ)=0I(\tau)=0 for any finite closed path τ\tau.

Proof.

The proof is quite straightforward from the definition of MPG

I(τ)\displaystyle I(\tau) =t=1NJit(θ(t))Jit(θ(t1))\displaystyle=\sum_{t=1}^{N}J_{i_{t}}(\theta^{(t)})-J_{i_{t}}(\theta^{(t-1)})
=t=1NΦ(θ(t))Φ(θ(t1))\displaystyle=\sum_{t=1}^{N}\Phi(\theta^{(t)})-\Phi(\theta^{(t-1)})
=Φ(θ(N))Φ(θ(0))\displaystyle=\Phi(\theta^{(N)})-\Phi(\theta^{(0)})
=0.\displaystyle=0.

Although the proof of Lemma 6 is straightforward, it serves as a useful tool in proving that a game is not a MPG. For example, applying the theorem we can get that the following conditions are not sufficient conditions for MPG.

Proposition 2.

None of the following conditions on a SG necessarily imply that it is a MPG:

  1. (1)

    There exists ϕ(s,a)\phi(s,a) such that at each ss, ri(s,ai,ai)ri(s,ai,ai)=ϕ(s,ai,ai)ϕ(s,ai,ai)\ r_{i}(s,a_{i}^{\prime},a_{-i})-r_{i}(s,a_{i},a_{-i})=\phi(s,a_{i}^{\prime},a_{-i})-\phi(s,a_{i},a_{-i});

  2. (2)

    There exists ϕ(s,a)\phi(s,a) such that for every s,s^s,\hat{s}, ri(s,ai,ai)ri(s^,ai,ai)=ϕ(s,ai,ai)ϕ(s^,ai,ai)\ r_{i}(s,a_{i}^{\prime},a_{-i})-r_{i}(\hat{s},a_{i},a_{-i})=\phi(s,a_{i}^{\prime},a_{-i})-\phi(\hat{s},a_{i},a_{-i});

  3. (3)

    Rewards rir_{i} are independent of ss, and they have a potential function, i.e., ri(ai,ai)ri(ai,ai)=ϕ(ai,ai)ϕ(ai,ai)r_{i}(a_{i},a_{-i})-r_{i}(a_{i}^{\prime},a_{-i})=\phi(a_{i},a_{-i})-\phi(a_{i}^{\prime},a_{-i}).

Proof.

(of Proposition 2) A simple counterexample showing that the conditions in (2) are not sufficient is the multi-stage prisoner’s dilemma (Game 1) introduced in the numerical section (Appendix 5). Since the reward table for multi-stage prisoner’s dilemma is the same as the one-shot prisoner’s dilemma (which is known to be a potential game), Game 1 satisfies condition (3) in Proposition 2, which implies condition (2), which in turn implies condition (1). In the following we are going to use Lemma 6 to show that Game 1 is not a MPG. We define the following individual policies:

θiDefect:θs=1,ai=1Defect\displaystyle\theta_{i}^{\text{Defect}}:\qquad\theta^{\text{Defect}}_{s=1,a_{i}=1} =0,θs=2,ai=1Defect=0\displaystyle=0,~{}~{}\theta^{\text{Defect}}_{s=2,a_{i}=1}=0
θiCoop:θs=1,ai=1Coop\displaystyle\theta_{i}^{\text{Coop}}:\qquad\theta^{\text{Coop}}_{s=1,a_{i}=1} =1,θs=2,ai=1Coop=0\displaystyle=1,~{}~{}\theta^{\text{Coop}}_{s=2,a_{i}=1}=0
θiAlways_coop:θs=1,ai=1Always_coop\displaystyle\theta_{i}^{\text{Always\_coop}}:\qquad\theta^{\text{Always\_coop}}_{s=1,a_{i}=1} =1,θs=2,ai=1Always_coop=1\displaystyle=1,~{}~{}\theta^{\text{Always\_coop}}_{s=2,a_{i}=1}=1

Let:

θ(0)=(θ1Defect,θ2Coop),\displaystyle\theta^{(0)}=(\theta_{1}^{\text{Defect}},\theta_{2}^{\text{Coop}}),\quad θ(1)=(θ1Coop,θ2Coop),\displaystyle\theta^{(1)}=(\theta_{1}^{\text{Coop}},\theta_{2}^{\text{Coop}}),
θ(2)=(θ1Coop,θ2Always_coop),\displaystyle\theta^{(2)}=(\theta_{1}^{\text{Coop}},\theta_{2}^{\text{Always\_coop}}),\quad θ(3)=(θ1Defect,θ2Always_coop)\displaystyle\theta^{(3)}=(\theta_{1}^{\text{Defect}},\theta_{2}^{\text{Always\_coop}})

and define a path τ\tau by:

τ=(θ(0),θ(1),θ(2),θ(3),θ(4)),θ(4)=θ(0).\tau=(\theta^{(0)},\theta^{(1)},\theta^{(2)},\theta^{(3)},\theta^{(4)}),~{}~{}\theta^{(4)}=\theta^{(0)}.

For the sake of easy calculation, we set ϵ=0\epsilon=0 and set initial state as s0=1s_{0}=1 in Game 1. In this example, it is not hard to see that Jit(θ(t))Jit(θ(t1))>0J_{i_{t}}(\theta^{(t)})-J_{i_{t}}(\theta^{(t-1)})>0 for each t{1,2,3,4}t\in\{1,2,3,4\}, implying that I(τ)>0I(\tau)>0. This indicates that although Game 1 satisfies condition (3) (as well as conditions (1) and (2)), it is still not a MPG. ∎

B.2. A sufficient condition

Proposition 2 suggests that MPG is a quite restrictive assumption. Even if the reward table for a SG is the same as the reward table of a one-shot potential game, the SG may still not be a MPG. Nevertheless, we can show that the following condition is sufficient for a stochastic game to be a MPG:

Lemma 7.

A stochastic game is a MPG if condition (1) in Proposition 2 is satisfied and that P(s|s,a)=P(s|s)P(s^{\prime}|s,a)=P(s^{\prime}|s).

Proof.

P(s|s,a)=P(s|s)P(s^{\prime}|s,a)=P(s^{\prime}|s) implies that the discounted state visitation distribution dθd_{\theta} does not depend on θ\theta, and thus we denote it as d(s)d(s) instead. Condition (1) implies that ϕ(s,ai,ai)ri(s,ai,ai)\phi(s,a_{i},a_{-i})-r_{i}(s,a_{i},a_{-i}) only depends on ss and aia_{-i} but not aia_{i}, and so we denote the difference as δi(s,ai)\delta_{i}(s,a_{-i}), i.e.,

ϕ(s,ai,ai)ri(s,ai,ai)=δi(s,ai).\phi(s,a_{i},a_{-i})-r_{i}(s,a_{i},a_{-i})=\delta_{i}(s,a_{-i}).

The total reward of agent ii can be written as:

Ji(θ)\displaystyle J_{i}(\theta) =sd(s)aπθ(a|s)ri(s,a)\displaystyle=\sum_{s}d(s)\sum_{a}\pi_{\theta}(a|s)r_{i}(s,a)
=sd(s)aiπθi(ai|s)aiπθi(ai|s)r(s,ai,ai)\displaystyle=\sum_{s}d(s)\sum_{a_{i}}\pi_{\theta_{i}}(a_{i}|s)\sum_{a_{-i}}\pi_{\theta_{-i}}(a_{-i}|s)r(s,a_{i},a_{-i})

Similarly, total potential function can be written as:

Φ(θ)\displaystyle\Phi(\theta) =sd(s)aπθ(a|s)ϕ(s,a)\displaystyle=\sum_{s}d(s)\sum_{a}\pi_{\theta}(a|s)\phi(s,a)
=sd(s)aiπθi(ai|s)aiπθi(ai|s)ϕ(s,ai,ai)\displaystyle=\sum_{s}d(s)\sum_{a_{i}}\pi_{\theta_{i}}(a_{i}|s)\sum_{a_{-i}}\pi_{\theta_{-i}}(a_{-i}|s)\phi(s,a_{i},a_{-i})

Thus,

Φ(θ)Ji(θ)\displaystyle\Phi(\theta)-J_{i}(\theta) =sd(s)aiπθi(ai|s)aiπθi(ai|s)(ϕ(s,ai,ai)r(s,ai,ai))\displaystyle=\sum_{s}d(s)\sum_{a_{i}}\pi_{\theta_{i}}(a_{i}|s)\sum_{a_{-i}}\pi_{\theta_{-i}}(a_{-i}|s)\left(\phi(s,a_{i},a_{-i})-r(s,a_{i},a_{-i})\right)
=sd(s)aiπθi(ai|s)aiπθi(ai|s)δi(s,ai)\displaystyle=\sum_{s}d(s)\sum_{a_{i}}\pi_{\theta_{i}}(a_{i}|s)\sum_{a_{-i}}\pi_{\theta_{-i}}(a_{-i}|s)\delta_{i}(s,a_{-i})
=sd(s)aiπθi(ai|s)δi(s,ai),\displaystyle=\sum_{s}d(s)\sum_{a_{-i}}\pi_{\theta_{-i}}(a_{-i}|s)\delta_{i}(s,a_{-i}),

which does not depend on parameter θi\theta_{i}, i.e.,

Φ(θi,θi)Ji(θi,θi)=Φ(θi,θi)Ji(θi,θi),(θi,θi),(θi,θi)𝒳,\Phi(\theta_{i}^{\prime},\theta_{-i})-J_{i}(\theta_{i}^{\prime},\theta_{-i})=\Phi(\theta_{i},\theta_{-i})-J_{i}(\theta_{i},\theta_{-i}),\quad\forall(\theta_{i}^{\prime},\theta_{-i}),(\theta_{i},\theta_{-i})\in\mathcal{X},

which completes the proof. ∎

B.3. MPG with local states and an application example

From Proposition 2, we see that it is difficult for a SG to be a MPG even if the game is a potential game at each state. Lemma 7 only presents a very special case where the action does not affect the state, meaning that this MPG is merely a collection of potential games. To provide a MPG which is beyond the identical-interest case and the case in Lemma 7, inspired by the setting in [7] and [35], here we consider a special multi-agent setting where 𝒮=𝒮1××𝒮n\mathcal{S}=\mathcal{S}_{1}\times\dots\times\mathcal{S}_{n} and 𝒮i\mathcal{S}_{i} is the local state space of agent ii. In addition, the transition probability takes the decomposed form, P(s|s,a)=i=1nP(si|si,ai)P(s^{\prime}|s,a)=\prod_{i=1}^{n}P(s_{i}^{\prime}|s_{i},a_{i}). The rest of the SG setting is the same as the SG in Section 2. Deviating slightly from the main text, we consider the localized policy where each agent take actions based on its own state,

πθ(at|st)=i=1nπθi(ai,t|si,t)\pi_{\theta}(a_{t}|s_{t})=\prod_{i=1}^{n}\pi_{\theta_{i}}(a_{i,t}|s_{i,t})

with the localized direct parameterization:

πθi(ai,t|si,t)=θ(si,ai),θiΔ(𝒜i)|𝒮i|\pi_{\theta_{i}}(a_{i,t}|s_{i,t})=\theta_{(s_{i},a_{i})},\quad\theta_{i}\in\Delta(\mathcal{A}_{i})^{|\mathcal{S}_{i}|}

We use 𝒳ilocal:=Δ(𝒜i)|𝒮i|\mathcal{X}_{i}^{\textup{local}}:=\Delta(\mathcal{A}_{i})^{|\mathcal{S}_{i}|} to denote the feasibility region of θi\theta_{i}, and the feasibility region of θ\theta is denoted as 𝒳local:=𝒳ilocal××𝒳nlocal\mathcal{X}^{\textup{local}}:=\mathcal{X}_{i}^{\textup{local}}\times\dots\times\mathcal{X}_{n}^{\textup{local}}.

Lemma 8.

If there is a function ϕ(s,a)\phi(s,a) such that for every agent ii, ri(si,si,ai,ai)=ϕ(si,si,ai,ai)+ψi(si,ai)r_{i}(s_{i},s_{-i},a_{i},a_{-i})=\phi(s_{i},s_{-i},a_{i},a_{-i})+\psi_{i}(s_{-i},a_{-i}) where ψi\psi_{i} only depends on si,ais_{-i},a_{-i}, then this SG is a MPG, i.e., for any parameters (θi,θi),(θi,θi)𝒳local(\theta_{i}^{\prime},\theta_{-i}),(\theta_{i},\theta_{-i})\in\mathcal{X}^{\textup{local}}, the equation in Definition 3 is satisfied.

The proof is straightforward given the local structure of the MDP and the localized policies. This MPG enjoys nontrivial multi-agent application examples such as medium access control [35], dynamic congestion control [69], etc. Below we provide medium access control as one of the examples.

Real application - medium access control. We consider the discretized version of the dynamic medium access control game introduced in [35], where each agent is a user that tries to transmit data via a single transmission medium by injecting power to the wireless network. Each user’s goal is to maximize its data rate and battery lifespan. If multiple users transmit at the same time, they will interfere with each other and decrease their data rate. Here user ii’s state is si𝒮i={0,1,,Bi,max}s_{i}\in\mathcal{S}_{i}=\{0,1,\dots,B_{i,\max}\}, which denotes its own battery level, where Bi,maxB_{i,\max} is its initial battery level. We use δi\delta_{i} to denote its discharging factor. Its action ai𝒜i={0,1,,Pi,max}a_{i}\in\mathcal{A}_{i}=\{0,1,\dots,P_{i,\max}\} denotes the power injected to the network at each time step, where Pi,maxP_{i,\max} is the maximum allowed power. The state transition is deterministic, describing the discharging process of the battery proportional to the transmission power, which is given by:

si,t+1=si,tδiai,t.s_{i,t+1}=s_{i,t}-\delta_{i}a_{i,t}.

The stage reward of user ii is given by:

ri(s,a)=log(1+|hi|2ai1+ji|hj|2aj)+αsi,r_{i}(s,a)=\log\left(1+\frac{|h_{i}|^{2}a_{i}}{1+\sum_{j\neq i}|h_{j}|^{2}a_{j}}\right)+\alpha s_{i},

where hih_{i} is the random fading channel coefficient for user ii.

By noticing that ri(s,a)=log(1+i=1n|hi|2ai)+αisilog(1+ji|hj|2aj)αjisjr_{i}(s,a)=\log\left(1+\sum_{i=1}^{n}|h_{i}|^{2}a_{i}\right)+\alpha\sum_{i}s_{i}-\log\left(1+\sum_{j\neq i}|h_{j}|^{2}a_{j}\right)-\alpha\sum_{j\neq i}s_{j}, we can apply Lemma 8 to verify that the medium access control problem is indeed a MPG and that the potential function ϕ\phi is given as:

ϕ(s,a)=log(1+i=1n|hi|2ai)+αi=1nsi.\phi(s,a)=\log\left(1+\sum_{i=1}^{n}|h_{i}|^{2}a_{i}\right)+\alpha\sum_{i=1}^{n}s_{i}.

Appendix B Proof of Lemma 2

Proof.

According to the performance difference lemma (c.f. [57]), let θ:=(θi,θi)\theta^{\prime}:=(\theta_{i}^{\prime},\theta_{-i}),

Ji(θi,θi)Ji(θi,θi)=11γs,adθ(s)πθ(a|s)Aiθ(s,a)\textstyle J_{i}(\theta_{i}^{\prime},\theta_{-i})-J_{i}(\theta_{i},\theta_{-i})=\frac{1}{1-\gamma}\sum_{s,a}d_{\theta^{\prime}}(s)\pi_{\theta^{\prime}}(a|s)A_{i}^{\theta}(s,a)
=11γs,aidθ(s)πθi(ai|s)aiπθi(ai|s)Aiθ(s,ai,ai)\textstyle=\frac{1}{1-\gamma}\sum_{s,a_{i}}d_{\theta^{\prime}}(s)\pi_{\theta^{\prime}_{i}}(a_{i}|s)\sum_{a_{-i}}\pi_{\theta_{-i}}(a_{-i}|s)A_{i}^{\theta}(s,a_{i},a_{-i})
=11γs,aidθ(s)πθi(ai|s)Aiθ¯(s,ai).\textstyle=\frac{1}{1-\gamma}\sum_{s,a_{i}}d_{\theta^{\prime}}(s)\pi_{\theta^{\prime}_{i}}(a_{i}|s)\overline{A_{i}^{\theta}}(s,a_{i}).

Appendix C Derivation of (29) (calculation of dθd_{\theta})

From the definition of dθd_{\theta}:

dθ(s)=𝔼s0ρ(1γ)t=0γtPrθ(st=s|s0),d_{\theta}(s)=\mathbb{{E}}_{s_{0}\sim\rho}(1-\gamma)\sum_{t=0}^{\infty}\gamma^{t}\textup{Pr}^{\theta}(s_{t}=s|s_{0}),\vspace{-5pt}

we have that:

dθ\displaystyle d_{\theta} =(1γ)(ρ+γPSθ¯ρ+γ2PSθ¯2ρ+)\displaystyle=(1-\gamma)\left(\rho+\gamma\overline{P_{S}^{\theta}}^{\top}\rho+\gamma^{2}\overline{P_{S}^{\theta}}^{2^{\top}}\rho+\cdots\right)
=(1γ)(I+γPSθ¯+γ2PSθ¯2)ρ\displaystyle=(1-\gamma)(I+\gamma\overline{P_{S}^{\theta}}+\gamma^{2}\overline{P_{S}^{\theta}}^{2}\cdots)^{\top}\rho
=(1γ)(IγPSθ¯)ρ\displaystyle=(1-\gamma)(I-\gamma\overline{P_{S}^{\theta}})^{-\top}\rho

Appendix D Proof of Lemma 3

Proof.

According to policy gradient theorem (6):

Ji(θ)θs,ai\displaystyle\frac{\partial J_{i}(\theta)}{\partial{\theta_{s,a_{i}}}} =11γsadθ(s)πθ(a|s)logπθ(a|s)θs,aiQiθ(s,a)\displaystyle=\frac{1}{1-\gamma}\sum_{s^{\prime}}\sum_{a^{\prime}}d_{\theta}(s^{\prime})\pi_{\theta}(a^{\prime}|s^{\prime})\frac{\partial\log\pi_{\theta}(a^{\prime}|s^{\prime})}{\partial\theta_{s,a_{i}}}Q_{i}^{\theta}(s,a)

Since for direct parameterization:

logπθ(a|s)θs,ai=logπθi(ai|s)θs,ai\displaystyle\frac{\partial\log\pi_{\theta}(a^{\prime}|s^{\prime})}{\partial\theta_{s,a_{i}}}=\frac{\partial\log\pi_{\theta_{i}}(a_{i}^{\prime}|s^{\prime})}{\partial\theta_{s,a_{i}}} =𝟏{ai=ai,s=s}1θs,ai\displaystyle=\mathbf{1}\{a_{i}^{\prime}=a_{i},s^{\prime}=s\}\frac{1}{\theta_{s,a_{i}}}
=𝟏{ai=ai,s=s}1πθi(ai|s)\displaystyle=\mathbf{1}\{a_{i}^{\prime}=a_{i},s^{\prime}=s\}\frac{1}{\pi_{\theta_{i}}(a_{i}|s)}

Thus we have that:

Ji(θ)θs,ai=11γsadθ(s)πθ(a|s)𝟏{ai=ai,s=s}1πθ(ai|s)Qiθ(s,a)=11γaidθ(s)πθi(ai|s)πθi(ai|s)1πθ(ai|s)Qiθ(s,ai,ai)=11γdθ(s)aiπθi(ai|s)Qiθ(s,ai,ai)=11γdθ(s)Qiθ¯(s,ai)\begin{split}\frac{\partial J_{i}(\theta)}{\partial{\theta_{s,a_{i}}}}&=\frac{1}{1-\gamma}\sum_{s^{\prime}}\sum_{a^{\prime}}d_{\theta}(s^{\prime})\pi_{\theta}(a^{\prime}|s^{\prime})\mathbf{1}\{a_{i}^{\prime}=a_{i},s^{\prime}=s\}\frac{1}{\pi_{\theta}(a_{i}|s)}Q_{i}^{\theta}(s,a)\\ &=\frac{1}{1-\gamma}\sum_{a_{-i}^{\prime}}d_{\theta}(s)\pi_{\theta_{i}}(a_{i}|s)\pi_{\theta_{-i}}(a_{-i}^{\prime}|s)\frac{1}{\pi_{\theta}(a_{i}|s)}Q_{i}^{\theta}(s,a_{i},a_{-i}^{\prime})\\ &=\frac{1}{1-\gamma}d_{\theta}(s)\sum_{a_{-i}^{\prime}}\pi_{\theta_{-i}}(a_{-i}^{\prime}|s)Q_{i}^{\theta}(s,a_{i},a_{-i}^{\prime})\\ &=\frac{1}{1-\gamma}d_{\theta}(s)\overline{Q_{i}^{\theta}}(s,a_{i})\end{split}

Appendix E Proof of Theorem 2 and Lemma 5

Proof.

(Lemma 5) For a given strict NE θ\theta^{*} randomly set:

ai(s)argmaxaiAiθ¯(s,ai),a_{i}^{*}(s)\in\operatorname*{arg\,max}_{a_{i}}\overline{A_{i}^{\theta^{*}}}(s,a_{i}),

and set θi\theta_{i} be:

θs,ai=𝟏{ai=ai(s)}.\theta_{s,a_{i}}=\mathbf{1}\{a_{i}=a_{i}^{*}(s)\}.

And set θ:=(θi,θi)\theta:=(\theta_{i},\theta_{-i}^{*}) From performance difference lemma (Lemma 2):

Ji(θi,θi)Ji(θi,θi)\displaystyle J_{i}(\theta_{i},\theta_{-i}^{*})-J_{i}(\theta_{i}^{*},\theta_{-i}^{*}) =s,aidθ(s)πθi(s,ai)Aiθ¯(s,ai)\displaystyle=\sum_{s,a_{i}}d_{\theta}(s)\pi_{\theta_{i}}(s,a_{i})\overline{A_{i}^{\theta^{*}}}(s,a_{i})
=sdθ(s)maxaiAiθ¯(s,ai)0\displaystyle=\sum_{s}d_{\theta}(s)\max_{a_{i}}\overline{A_{i}^{\theta^{*}}}(s,a_{i})\geq 0

Because θ\theta^{*} is a strict NE, thus the inequality above forces θi=θ\theta_{i}^{*}=\theta, and that maxaiAiθ¯(s,ai)=0\max_{a_{i}}\overline{A_{i}^{\theta^{*}}}(s,a_{i})=0. The uniqueness of θ\theta^{*} also implies uniqueness of ai(s)a_{i}^{*}(s), and thus,

Aiθ¯(s,ai)<0,aiai(s),\overline{A_{i}^{\theta^{*}}}(s,a_{i})<0,~{}~{}\forall~{}a_{i}\neq a_{i}^{*}(s),

which completes the proof of the lemma. ∎

The proof of Theorem 2 relies on the following auxiliary lemma, whose proof we defer to Appendix L.

{restatable}

lemmaauxiliary Let 𝒳\mathcal{X} denote the probability simplex of dimension nn. Suppose θ𝒳,gn\theta\in\mathcal{X},g\in\mathbb{R}^{n} and that there exists i{1,2,,n}i^{*}\in\{1,2,\dots,n\} and Δ>0\Delta>0 such that:

θi\displaystyle\theta_{i^{*}} θi,ii\displaystyle\geq\theta_{i},\quad\forall i\neq i^{*}
gi\displaystyle g_{i^{*}} gi+Δ,ii.\displaystyle\geq g_{i}+\Delta,\quad\forall i\neq i^{*}.

Let

θ=Proj𝒳(θ+g),\theta^{\prime}=Proj_{\mathcal{X}}(\theta+g),

then:

θimin{1,θi+Δ2}\theta^{\prime}_{i^{*}}\geq\min\{1,\theta_{i^{*}}+\frac{\Delta}{2}\}
Proof.

(Theorem 2) For a fixed agent ii and state ss, the gradient play ((5)) update rule of policy θi,s\theta_{i,s} is given by:

θi,s(t+1)=ProjΔ(|𝒜i|)(θi,s(t)+η1γdθ(t)(s)Qiθ(t)¯(s,)),\theta_{i,s}^{(t+1)}=Proj_{\Delta(|\mathcal{A}_{i}|)}(\theta_{i,s}^{(t)}+\frac{\eta}{1-\gamma}d_{\theta^{(t)}}(s)\overline{Q_{i}^{\theta^{(t)}}}(s,\cdot)), (33)

where Δ(|𝒜i|)\Delta(|\mathcal{A}_{i}|) denotes the probability simplex in |𝒜i||\mathcal{A}_{i}|-th dimension and Qiθ(t)¯(s,))\overline{Q_{i}^{\theta^{(t)}}}(s,\cdot)) is a |𝒜i||\mathcal{A}_{i}|-th dimensional vector with aia_{i}-th element equals to Qiθ(t)¯(s,ai))\overline{Q_{i}^{\theta^{(t)}}}(s,a_{i})). We will show that this update rule satisfies the conditions in Appendix E, which will then allow us to prove that

D(θ(t+1)||θ)max{0,D(θ(t)||θ)ηΔθ2}.\displaystyle D(\theta^{(t+1)}||\theta^{*})\leq\max\{0,D(\theta^{(t)}||\theta^{*})-\frac{\eta\Delta^{\theta^{*}}}{2}\}.

Letting ai(s)a_{i}^{*}(s) be the same definition as in Lemma 5, we have that:

11γdθ(t)(s)Qiθ(t)¯(s,ai(s))11γdθ(t)(s)Qiθ(t)¯(s,ai)\displaystyle\frac{1}{1-\gamma}d_{\theta^{(t)}}(s)\overline{Q_{i}^{\theta^{(t)}}}(s,a_{i}^{*}(s))-\frac{1}{1-\gamma}d_{\theta^{(t)}}(s)\overline{Q_{i}^{\theta^{(t)}}}(s,a_{i})
\displaystyle\geq 11γdθ(s)Qiθ¯(s,ai(s))11γdθ(s)Qiθ¯(s,ai)\displaystyle\frac{1}{1-\gamma}d_{\theta^{*}}(s)\overline{Q_{i}^{\theta^{*}}}(s,a_{i}^{*}(s))-\frac{1}{1-\gamma}d_{\theta^{{}^{*}}}(s)\overline{Q_{i}^{\theta^{*}}}(s,a_{i})
\displaystyle- |11γdθ(s)Qiθ¯(s,ai(s))11γdθ(t)(s)Qiθ(t)¯(s,ai(s))|\displaystyle\left|\frac{1}{1-\gamma}d_{\theta^{*}}(s)\overline{Q_{i}^{\theta^{*}}}(s,a_{i}^{*}(s))-\frac{1}{1-\gamma}d_{\theta^{(t)}}(s)\overline{Q_{i}^{\theta^{(t)}}}(s,a_{i}^{*}(s))\right|
\displaystyle- |11γdθ(s)Qiθ¯(s,ai)11γdθ(t)(s)Qiθ(t)¯(s,ai)|\displaystyle\left|\frac{1}{1-\gamma}d_{\theta^{*}}(s)\overline{Q_{i}^{\theta^{*}}}(s,a_{i})-\frac{1}{1-\gamma}d_{\theta^{(t)}}(s)\overline{Q_{i}^{\theta^{(t)}}}(s,a_{i})\right|
\displaystyle\geq 11γdθ(s)(Aiθ¯(s,ai(s)Aiθ¯(s,ai)))2θiJi(θ(t))θiJi(θ)\displaystyle\frac{1}{1-\gamma}d_{\theta^{*}}(s)\left(\overline{A_{i}^{\theta^{*}}}(s,a_{i}^{*}(s)-\overline{A_{i}^{\theta^{*}}}(s,a_{i}))\right)-2\|\nabla_{\theta_{i}}J_{i}(\theta^{(t)})-\nabla_{\theta_{i}}J_{i}(\theta^{*})\| (34)
\displaystyle\geq Δθ4(1γ)3(i=1n|𝒜i|)θ(t)θ\displaystyle\Delta^{\theta^{*}}-\frac{4}{(1-\gamma)^{3}}\left(\sum_{i=1}^{n}|\mathcal{A}_{i}|\right)\|\theta^{(t)}-\theta^{*}\| (35)
\displaystyle\geq Δθ4(1γ)3(i=1n|𝒜i|)i=1nsθi,s(t)θi,s)1\displaystyle\Delta^{\theta^{*}}-\frac{4}{(1-\gamma)^{3}}\left(\sum_{i=1}^{n}|\mathcal{A}_{i}|\right)\sum_{i=1}^{n}\sum_{s}\|\theta^{(t)}_{i,s}-\theta^{*}_{i,s})\|_{1}
\displaystyle\geq Δθ4(1γ)3n|𝒮|(i=1n|𝒜i|)D(θ(t)||θ),\displaystyle\Delta^{\theta^{*}}-\frac{4}{(1-\gamma)^{3}}n|\mathcal{S}|\left(\sum_{i=1}^{n}|\mathcal{A}_{i}|\right)D(\theta^{(t)}||\theta^{*}),

where (34) to (35) uses smoothness property in Lemma 17.

We use proof of induction as supposed for t1\ell\leq t-1, we have:

D(θ(+1)||θ)max{D(θ()||θ)ηΔθ2,0},D(\theta^{(\ell+1)}||\theta^{*})\leq\max\{D(\theta^{(\ell)}||\theta^{*})-\frac{\eta\Delta^{\theta^{*}}}{2},0\},

thus

D(θ(t)||θ)D(θ(0)||θ)Δθ(1γ)38n|𝒮|(i=1n|𝒜i|).D(\theta^{(t)}||\theta^{*})\leq D(\theta^{(0)}||\theta^{*})\leq\frac{\Delta^{\theta^{*}}(1-\gamma)^{3}}{8n|\mathcal{S}|\left(\sum_{i=1}^{n}|\mathcal{A}_{i}|\right)}.

Then we can further conclude that:

(1γ)dθ(t)(s)Qiθ(t)¯(s,ai(s))(1γ)dθ(t)(s)Qiθ(t)¯(s,ai)\displaystyle(1-\gamma)d_{\theta^{(t)}}(s)\overline{Q_{i}^{\theta^{(t)}}}(s,a_{i}^{*}(s))-(1-\gamma)d_{\theta^{(t)}}(s)\overline{Q_{i}^{\theta^{(t)}}}(s,a_{i})
\displaystyle\geq Δθ4(1γ)3n|𝒮|(i=1n|𝒜i|)D(θ(t)||θ)\displaystyle\Delta^{\theta^{*}}-\frac{4}{(1-\gamma)^{3}}n|\mathcal{S}|\left(\sum_{i=1}^{n}|\mathcal{A}_{i}|\right)D(\theta^{(t)}||\theta^{*})
\displaystyle\geq Δθ2,aiai(s)\displaystyle\frac{\Delta^{\theta^{*}}}{2},\quad\forall~{}a_{i}\neq a_{i}^{*}(s)

Additionally, for D(θ(t)||θ)Δθ(1γ)38n|𝒮|(i=1n|𝒜i|)D(\theta^{(t)}||\theta^{*})\leq\frac{\Delta^{\theta^{*}}(1-\gamma)^{3}}{8n|\mathcal{S}|\left(\sum_{i=1}^{n}|\mathcal{A}_{i}|\right)}, we may conclude:

θs,ai(s)(t)1/2θs,ai(t)aiai(s),\theta^{(t)}_{s,a_{i}^{*}(s)}\geq 1/2\geq\theta^{(t)}_{s,a_{i}}\quad\forall a_{i}\neq a_{i}^{*}(s),

then by applying Appendix E to (33) we have:

θs,ai(s)(t+1)min{1,θs,ai(s)(t)+ηΔθ4}\displaystyle\theta^{(t+1)}_{s,a_{i}^{*}(s)}\geq\min\{1,\theta^{(t)}_{s,a_{i}^{*}(s)}+\frac{\eta\Delta^{\theta^{*}}}{4}\}
\displaystyle\Longrightarrow\quad θi,s(t+1)θi,s1=2(1θs,ai(s)(t+1))\displaystyle\|\theta^{(t+1)}_{i,s}-\theta^{*}_{i,s}\|_{1}=2\left(1-\theta^{(t+1)}_{s,a_{i}^{*}(s)}\right)
max{0,θi,s(t)θi,s1ηΔθ2},s𝒮,i=1,2,,n\displaystyle\leq\max\{0,\|\theta^{(t)}_{i,s}-\theta^{*}_{i,s}\|_{1}-\frac{\eta\Delta^{\theta^{*}}}{2}\},\quad\forall~{}s\in\mathcal{S},~{}i=1,2,\dots,n
\displaystyle\Longrightarrow\quad D(θ(t+1)||θ)max{0,D(θ(t)||θ)ηΔθ2},\displaystyle D(\theta^{(t+1)}||\theta^{*})\leq\max\{0,D(\theta^{(t)}||\theta^{*})-\frac{\eta\Delta^{\theta^{*}}}{2}\},

which completes the proof. ∎

Appendix F Proof of Proposition 1

Proof.

First of all, from the definition of NE, the global maximum of the potential function is a NE. We now show that this global maximum is a deterministic policy. From classical results (e.g. [70]) we know that there is an optimal deterministic centralized policy

π(a=(a1,,an)|s)=𝟏{a=a(s)=(a1(s),,an(s))}\pi^{*}(a=(a_{1},\dots,a_{n})|s)=\mathbf{1}\{a=a^{*}(s)=(a_{1}^{*}(s),\dots,a_{n}^{*}(s))\}

that maximizes:

π=argmaxπ:𝒮Δ(𝒜)𝔼[t=0γtϕ(st,at)|π,s0=s].\pi^{*}=\operatorname*{arg\,max}_{\pi:\mathcal{S}\rightarrow\Delta(\mathcal{A})}\mathbb{{E}}\left[\sum_{t=0}^{\infty}\gamma^{t}\phi(s_{t},a_{t})\big{|}\pi,s_{0}=s\right].

We now show that this centralized policy can also be represented by direct distributed policy parameterization. Set θ\theta^{*} as:

πθi(ai|s)=𝟏{ai=ai(s)},\pi_{\theta_{i}^{*}}(a_{i}|s)=\mathbf{1}\{a_{i}=a_{i}^{*}(s)\},

then:

π(a|s)=i=1nπθi(ai|s)\pi^{*}(a|s)=\prod_{i=1}^{n}\pi_{\theta_{i}^{*}}(a_{i}|s)

Since π\pi^{*} globally maximizes the discounted summation of potential function ϕ\phi among centralized policies, which includes all possible direct distributedly parameterized policies, θ\theta^{*} also maximizes the total potential function Φ\Phi globally among all direct distributed parameterization, which completes the proof. ∎

Appendix G Proof of Theorem 3

G.1 Useful Optimization Lemmas

Lemma 9.

Let Φ(θ)\Phi(\theta) be β\beta-smooth in θ\theta, define gradient mapping:

Gη(θ):=1η(Proj𝒳(θ+ηΦ(θ))θ).G^{\eta}(\theta):=\frac{1}{\eta}\left(Proj_{\mathcal{X}}(\theta+\eta\nabla\Phi(\theta))-\theta\right).

The update rule for projected gradient is:

θ+=θ+ηGη(θ)=Proj𝒳(θ+ηΦ(θ)).\theta^{+}=\theta+\eta G^{\eta}(\theta)=Proj_{\mathcal{X}}(\theta+\eta\nabla\Phi(\theta)).

Then:

(θθ+)Φ(θ+)(1+ηβ)Gη(θ)θθ+θ𝒳.(\theta^{\prime}-\theta^{+})^{\top}\nabla\Phi(\theta^{+})\leq(1+\eta\beta)\|G^{\eta}(\theta)\|\|\theta^{\prime}-\theta^{+}\|\quad\forall\theta^{\prime}\in\mathcal{X}.
Proof.

By a standard property of Euclidean projections onto a convex set, we get that

(θ+ηΦ(θ)θ+)(θθ+)0\displaystyle\qquad(\theta+\eta\nabla\Phi(\theta)-\theta^{+})^{\top}(\theta^{\prime}-\theta^{+})\leq 0
ηΦ(θ)(θθ+)+(θθ+)(θθ+)0\displaystyle\Longrightarrow~{}~{}\eta\nabla\Phi(\theta)^{\top}(\theta^{\prime}-\theta^{+})+(\theta-\theta^{+})^{\top}(\theta^{\prime}-\theta^{+})\leq 0
ηΦ(θ)(θθ+)ηGη(θ)(θθ+)0\displaystyle\Longrightarrow~{}~{}\eta\nabla\Phi(\theta)^{\top}(\theta^{\prime}-\theta^{+})-\eta G^{\eta}(\theta)^{\top}(\theta^{\prime}-\theta^{+})\leq 0
Φ(θ)(θθ+)Gη(θ)θθ+\displaystyle\Longrightarrow~{}~{}\nabla\Phi(\theta)^{\top}(\theta^{\prime}-\theta^{+})\leq\|G^{\eta}(\theta)\|\theta^{\prime}-\theta^{+}\|
Φ(θ+)(θθ+)Gη(θ)θθ++(Φ(θ+)Φ(θ))(θθ+)\displaystyle\Longrightarrow~{}~{}\nabla\Phi(\theta^{+})^{\top}(\theta^{\prime}-\theta^{+})\leq\|G^{\eta}(\theta)\|\theta^{\prime}-\theta^{+}\|+(\nabla\Phi(\theta^{+})-\nabla\Phi(\theta))^{\top}(\theta^{\prime}-\theta^{+})\|
Gη(θ)θθ++βθ+θθθ+\displaystyle\qquad\qquad\qquad\qquad\qquad~{}~{}\leq\|G^{\eta}(\theta)\|\theta^{\prime}-\theta^{+}\|+\beta\|\theta^{+}-\theta\|\|\theta^{\prime}-\theta^{+}\|
=(1+ηβ)Gη(θ)θθ+\displaystyle\qquad\qquad\qquad\qquad\qquad~{}~{}=(1+\eta\beta)\|G^{\eta}(\theta)\|\theta^{\prime}-\theta^{+}\|

Lemma 10.

(Sufficient ascent) Suppose Φ(θ)\Phi(\theta) is β\beta-smooth. Let θ+=Proj𝒳(θ+ηΦ(θ))\theta^{+}=Proj_{\mathcal{X}}(\theta+\eta{\nabla}\Phi(\theta)). Then for η1β\eta\leq\frac{1}{\beta},

Φ(θ+)Φ(θ)η2Gη(θ)2\begin{split}\Phi(\theta^{+})-\Phi(\theta)\geq\frac{\eta}{2}\|G^{\eta}(\theta)\|^{2}\end{split}
Proof.

From the smoothness property we have that:

Φ(θ+)Φ(θ)θΦ(θ)(θ+θ)β2θ+θ2\Phi(\theta^{+})-\Phi(\theta)\geq\nabla_{\theta}\Phi(\theta)^{\top}(\theta^{+}-\theta)-\frac{\beta}{2}\|\theta^{+}-\theta\|^{2} (36)

Since θ+=Proj𝒳(θ+ηΦ(θ))\theta^{+}=Proj_{\mathcal{X}}(\theta+\eta{\nabla}\Phi(\theta)), we have that:

(θ+ηΦ(θ)θ+)(θθ+)0,θ𝒳(\theta+\eta\nabla\Phi(\theta)-\theta^{+})^{\top}(\theta^{\prime}-\theta^{+})\leq 0,~{}~{}\forall~{}\theta^{\prime}\in\mathcal{X}

take θ=θ\theta^{\prime}=\theta, we get:

Φ(θ)(θ+θ)1ηθ+θ2.\nabla\Phi(\theta)^{\top}(\theta^{+}-\theta)\geq\frac{1}{\eta}\|\theta^{+}-\theta\|^{2}.

Thus:

Φ(θ+)Φ(θ)\displaystyle\Phi(\theta^{+})-\Phi(\theta) θΦ(θ)(θ+θ)β2θ+θ2\displaystyle\geq\nabla_{\theta}\Phi(\theta)^{\top}(\theta^{+}-\theta)-\frac{\beta}{2}\|\theta^{+}-\theta\|^{2}
(1ηβ2)θ+θ2\displaystyle\geq(\frac{1}{\eta}-\frac{\beta}{2})\|\theta^{+}-\theta\|^{2}
12ηθ+θ2\displaystyle\geq\frac{1}{2\eta}\|\theta^{+}-\theta\|^{2}
=η2Gη(θ)2,\displaystyle=\frac{\eta}{2}\|G^{\eta}(\theta)\|^{2},

which completes the proof. ∎

Lemma 11.

(Corollary of Lemma 10) For Φ(θ)\Phi(\theta) that is β\beta smooth and bounded ΦminΦ(θ)Φmax\Phi_{\min}\leq\Phi(\theta)\leq\Phi_{\max}, running projected gradient ascent:

θ(t+1)=Proj𝒳(θ+ηΦ(θ(t)))\theta^{(t+1)}=Proj_{\mathcal{X}}(\theta+\eta{\nabla}\Phi(\theta^{(t)}))

with η=1β\eta=\frac{1}{\beta}, will guarantee that:

limt+Gη(θ(t))=0.\lim_{t\rightarrow+\infty}\|G^{\eta}(\theta^{(t)})\|=0.

Further, we have that:

1Tt=0T1Gη(θ(t))22β(ΦmaxΦmin)T\frac{1}{T}\sum_{t=0}^{T-1}\|G^{\eta}(\theta^{(t)})\|^{2}\leq\frac{2\beta(\Phi_{\max}-\Phi_{\min})}{T} (37)
Proof.

From Lemma 10 we have:

Φ(θ(t+1))Φ(θ(t))12βGη(θ(t))20\displaystyle\Phi(\theta^{(t+1)})-\Phi(\theta^{(t)})\geq\frac{1}{2\beta}\|G^{\eta}(\theta^{(t)})\|^{2}\geq 0

Thus Φ(θ(t))\Phi(\theta^{(t)}) is non decreasing, and since it is bounded, we know that Φ(θ(t))\Phi(\theta^{(t)}) asymptotically convergence to some value Φ\Phi^{*}, and thus show that

limtGη(θ(t))=0.\lim_{t\rightarrow\infty}\|G^{\eta}(\theta^{(t)})\|=0.

Additionally, from (37),

Φ(θ(T))Φ(θ(0))t=0T112βGη(θ(t))2\displaystyle\Phi(\theta^{(T)})-\Phi(\theta^{(0)})\geq\sum_{t=0}^{T-1}\frac{1}{2\beta}\|G^{\eta}(\theta^{(t)})\|^{2}
1Tt=0T1Gη(θ(t))22β(ΦmaxΦmin)T,\displaystyle\Longrightarrow~{}~{}\frac{1}{T}\sum_{t=0}^{T-1}\|G^{\eta}(\theta^{(t)})\|^{2}\leq\frac{2\beta(\Phi_{\max}-\Phi_{\min})}{T},

which completes the proof. ∎

G.2 Proof of Theorem 3

Proof.

Recall the definition of gradient mapping:

Gη(θ)=1η(Proj𝒳(θ+ηΦ(θ))θ).G^{\eta}(\theta)=\frac{1}{\eta}\left(Proj_{\mathcal{X}}(\theta+\eta\nabla\Phi(\theta))-\theta\right).

From gradient domination property (8) we have that:

NE-gapi(θ(t+1))\displaystyle\textup{{NE-gap}}_{i}(\theta^{(t+1)}) =maxθi𝒳iJi(θi.θi(t+1))Ji(θi(t+1),θi(t+1))\displaystyle=\max_{\theta_{i}^{\prime}\in\mathcal{X}_{i}}J_{i}(\theta_{i}^{\prime}.\theta_{-i}^{(t+1)})-J_{i}(\theta_{i}^{(t+1)},\theta_{-i}^{(t+1)})
maxθi𝒳id(θi.θi(t+1))d(θi(t+1),θi(t+1))maxθ¯i𝒳i(θ¯iθi(t+1))θiJi(θ(t+1))\displaystyle\leq\max_{\theta_{i}^{\prime}\in\mathcal{X}_{i}}\left\|\frac{d_{(\theta_{i}^{\prime}.\theta_{-i}^{(t+1)})}}{d_{(\theta_{i}^{(t+1)},\theta_{-i}^{(t+1)})}}\right\|_{\infty}\max_{\overline{\theta}_{i}\in\mathcal{X}_{i}}\left(\overline{\theta}_{i}-\theta_{i}^{(t+1)}\right)^{\top}\nabla_{\theta_{i}}J_{i}(\theta^{(t+1)})
Mmaxθ¯i𝒳i(θ¯iθi(t+1))θiΦ(θ(t+1))\displaystyle\leq M\max_{\overline{\theta}_{i}\in\mathcal{X}_{i}}\left(\overline{\theta}_{i}-\theta_{i}^{(t+1)}\right)^{\top}\nabla_{\theta_{i}}\Phi(\theta^{(t+1)})
M(1+ηβ)maxθ¯i𝒳iθ¯iθi(t+1)Gη(θ(t))\displaystyle\leq M(1+\eta\beta)\max_{\overline{\theta}_{i}\in\mathcal{X}_{i}}\|\overline{\theta}_{i}-\theta_{i}^{(t+1)}\|\|G^{\eta}(\theta^{(t)})\|
M(1+ηβ)2|𝒮|Gη(θ(t))\displaystyle\leq M(1+\eta\beta)2\sqrt{|\mathcal{S}|}\|G^{\eta}(\theta^{(t)})\|
=4M|𝒮|Gη(θ(t))\displaystyle=4M\sqrt{|\mathcal{S}|}\|G^{\eta}(\theta^{(t)})\|

where the last step follows as θ¯iθi(t+1)2|𝒮|\|\overline{\theta}_{i}-\theta_{i}^{(t+1)}\|\leq 2\sqrt{|\mathcal{S}|}. Thus

NE-gap(θ(t+1))4M|𝒮|Gη(θ(t))\textup{{NE-gap}}(\theta^{(t+1)})\leq 4M\sqrt{|\mathcal{S}|}\|G^{\eta}(\theta^{(t)})\|

Then from Lemma 11 we have that:

limtGη(θ(t))=0limtNE-gap(θ(t))=0,\lim_{t\rightarrow\infty}\|G^{\eta}(\theta^{(t)})\|=0~{}~{}\Longrightarrow~{}~{}\lim_{t\rightarrow\infty}\textup{{NE-gap}}(\theta^{(t)})=0,

and that:

1Tt=0T1Gη(θ(t))22β(ΦmaxΦmin)T\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\|G^{\eta}(\theta^{(t)})\|^{2}\leq\frac{2\beta(\Phi_{\max}-\Phi_{\min})}{T}
\displaystyle\Longrightarrow~{}~{} 1Tt=1TNE-gap(θ(t))232βM2|𝒮|(ΦmaxΦmin)T\displaystyle\frac{1}{T}\sum_{t=1}^{T}\textup{{NE-gap}}(\theta^{(t)})^{2}\leq\frac{32\beta M^{2}|\mathcal{S}|(\Phi_{\max}-\Phi_{\min})}{T}

we can get our required bound of ϵ\epsilon if we set:

32βM2|𝒮|(ΦmaxΦmin)Tϵ2,\frac{32\beta M^{2}|\mathcal{S}|(\Phi_{\max}-\Phi_{\min})}{T}\leq\epsilon^{2},

or equivalently

T\displaystyle T 32M2β(ΦmaxΦmin)|𝒮|ϵ2\displaystyle\geq\frac{32M^{2}\beta(\Phi_{\max}-\Phi_{\min})|\mathcal{S}|}{\epsilon^{2}}
=64M2(ΦmaxΦmin)|𝒮|i|𝒜i|ϵ2(1γ)3,\displaystyle=\frac{64M^{2}(\Phi_{\max}-\Phi_{\min})|\mathcal{S}|\sum_{i}|\mathcal{A}_{i}|}{\epsilon^{2}(1-\gamma)^{3}},

which completes the proof. ∎

Appendix H Proof of Theorem 4

Proof.

(of the first claim) The proof requires knowledge of Lemma 5 in Section 3 thus we would recommend readers to first go through Lemma 5 first. The lemma immediately leads to the conclusion that a strict NE θ\theta^{*} should be deterministic. Let ai(s),Δiθ(s),Δiθa_{i}^{*}(s),\Delta_{i}^{\theta^{*}}(s),\Delta_{i}^{\theta^{*}} be the same definition as in Lemma 5.

For any θ𝒳\theta\in\mathcal{X}, Taylor expansion suggests that:

Φ(θ)Φ(θ)\displaystyle\Phi(\theta)-\Phi(\theta^{*}) =(θθ)Φ(θ)+o(θθ)\displaystyle=(\theta-\theta^{*})^{\top}\nabla\Phi(\theta^{*})+o(\|\theta-\theta^{*}\|)
=i(θiθi)θiJi(θ)+o(θθ)\displaystyle=\sum_{i}(\theta_{i}-\theta_{i}^{*})^{\top}\nabla_{\theta_{i}}J_{i}(\theta^{*})+o(\|\theta-\theta^{*}\|)
=11γisaidθ(s)Aiθ¯(s,ai)(θs,aiθs,ai)+o(θθ)\displaystyle=\frac{1}{1-\gamma}\sum_{i}\sum_{s}\sum_{a_{i}}d_{\theta^{*}}(s)\overline{A_{i}^{\theta^{*}}}(s,a_{i})(\theta_{s,a_{i}}-\theta^{*}_{s,a_{i}})+o(\|\theta-\theta^{*}\|)
11γisdθ(s)Δiθ(s)(aiai(s)(θs,aiθs,ai))+o(θθ)\displaystyle\leq-\frac{1}{1-\gamma}\sum_{i}\sum_{s}d_{\theta^{*}}(s)\Delta_{i}^{\theta^{*}}(s)\left(\sum_{a_{i}\!\neq\!a_{i}^{*}(s)}(\theta_{s,a_{i}}-\theta^{*}_{s,a_{i}})\right)+o(\|\theta-\theta^{*}\|)
=11γisdθ(s)Δiθ(s)12θi,sθi,s1+o(θθ)\displaystyle=-\frac{1}{1-\gamma}\sum_{i}\sum_{s}d_{\theta^{*}}(s)\Delta_{i}^{\theta^{*}}(s)\frac{1}{2}\|\theta_{i,s}-\theta_{i,s}^{*}\|_{1}+o(\|\theta-\theta^{*}\|)
Δθ2isθi,sθi,s1+o(θθ)\displaystyle\leq-\frac{\Delta^{\theta^{*}}}{2}\sum_{i}\sum_{s}\|\theta_{i,s}-\theta_{i,s}^{*}\|_{1}+o(\|\theta-\theta^{*}\|)
Δθ2θθ+o(θθ).\displaystyle\leq-\frac{\Delta^{\theta^{*}}}{2}\|\theta-\theta^{*}\|+o(\|\theta-\theta^{*}\|).

Thus for θθ\|\theta-\theta^{*}\| sufficiently small,

Φ(θ)Φ(θ)<0 holds,\Phi(\theta)-\Phi(\theta^{*})<0\textup{ holds,}

this suggests that strict NEs are strict local maxima. We now show that this also holds vice versa.

Strict local maxima satisfy first-order stationarity by definition, and thus by Theorem 1 they are also NEs, we only need to show that they are strict. We prove by contradiction, suppose that there exists a local maximum θ\theta^{*} such that it is non-strict NE, i.e., there exists θi𝒳i,θiθi\theta_{i}^{\prime}\in\mathcal{X}_{i},\theta_{i}^{\prime}\neq\theta_{i}^{*} such that:

Ji(θi,θi)=Ji(θi,θi)J_{i}(\theta_{i}^{\prime},\theta_{-i}^{*})=J_{i}(\theta_{i}^{*},\theta_{-i}^{*})

According to (10) and first-order stationarity of θ\theta^{*}:

11γsdθ(s)maxai𝒜iAiθ¯(s,ai)=maxθ¯i𝒳i(θ¯iθi)θiJi(θ)0.\displaystyle\frac{1}{1-\gamma}\sum_{s}d_{\theta^{*}}(s)\max_{a_{i}\in\mathcal{A}_{i}}\overline{A_{i}^{\theta^{*}}}(s,a_{i})=\max_{\overline{\theta}_{i}\in\mathcal{X}_{i}}(\overline{\theta}_{i}-\theta_{i}^{*})^{\top}\nabla_{\theta_{i}}J_{i}(\theta^{*})\leq 0.

Since maxai𝒜iAiθ¯(s,ai)0\max_{a_{i}\in\mathcal{A}_{i}}\overline{A_{i}^{\theta}}(s,a_{i})\geq 0 for all θ\theta, we may conclude:

maxai𝒜iAiθ¯(s,ai)=0,s𝒮.\max_{a_{i}\in\mathcal{A}_{i}}\overline{A_{i}^{\theta^{*}}}(s,a_{i})=0,~{}~{}\forall~{}s\in\mathcal{S}.

We denote θ:=(θi,θi)\theta^{\prime}:=(\theta_{i}^{\prime},\theta_{-i^{*}}), according to Lemma 2

0=Ji(θi,θi)Ji(θi,θi)=11γs,aidθ(s)πθi(ai|s)Aiθ¯(s,ai)0.\displaystyle 0=J_{i}(\theta_{i}^{\prime},\theta_{-i}^{*})-J_{i}(\theta_{i}^{*},\theta_{-i}^{*})=\frac{1}{1-\gamma}\sum_{s,a_{i}}d_{\theta^{\prime}}(s)\pi_{\theta^{\prime}_{i}}(a_{i}|s)\overline{A_{i}^{\theta^{*}}}(s,a_{i})\leq 0.

Since dθ(s)>0,sd_{\theta^{\prime}}(s)>0,~{}\forall~{}s, this further implies that

aiπθi(ai|s)Aiθ¯(s,ai)=0,s𝒮,\sum_{a_{i}}\pi_{\theta^{\prime}_{i}}(a_{i}|s)\overline{A_{i}^{\theta^{*}}}(s,a_{i})=0,~{}~{}\forall~{}s\in\mathcal{S},

i.e., πθi(ai|s)\pi_{\theta^{\prime}_{i}}(a_{i}|s) is nonzero only if Aiθ¯(s,ai)=0\overline{A_{i}^{\theta^{*}}}(s,a_{i})=0. Define θiη:=ηθi+(1η)θi\theta_{i}^{\eta}:=\eta\theta_{i}^{\prime}+(1-\eta)\theta_{i}^{*}, then

aiπθiη(ai|s)Aiθ¯(s,ai)=0,s𝒮.\sum_{a_{i}}\pi_{\theta^{\eta}_{i}}(a_{i}|s)\overline{A_{i}^{\theta^{*}}}(s,a_{i})=0,~{}~{}\forall~{}s\in\mathcal{S}.

Thus let θη:=(θiη,θi)\theta^{\eta}:=(\theta^{\eta}_{i},\theta^{*}_{-i})

Ji(θiη,θi)Ji(θi,θi)=11γs,aidθη(s)πθiη(ai|s)Aiθ¯(s,ai)=0.\displaystyle J_{i}(\theta_{i}^{\eta},\theta_{-i}^{*})-J_{i}(\theta_{i}^{*},\theta_{-i}^{*})=\frac{1}{1-\gamma}\sum_{s,a_{i}}d_{\theta^{\eta}}(s)\pi_{\theta^{\eta}_{i}}(a_{i}|s)\overline{A_{i}^{\theta^{*}}}(s,a_{i})=0.

Since θiηθi0\|\theta_{i}^{\eta}-\theta_{i}^{*}\|\rightarrow 0 as η0\eta\rightarrow 0, this contradicts the assumption that θ\theta^{*} is a strict local maximum. This suggests that all strict local maxima are strict NEs, which completes the proof. ∎

Proof.

(of the second claim) First, we define the corresponding value function, QQ-function and advantage function for potential function ϕ\phi.

Vϕθ(s)\displaystyle V_{\phi}^{\theta}(s) :=𝔼[t=0γtϕ(st,at)|π=θ,s0=s]\displaystyle:=\mathbb{{E}}\left[\sum_{t=0}^{\infty}\gamma^{t}\phi(s_{t},a_{t})\big{|}~{}\pi=\theta,s_{0}=s\right]
Qϕθ(s,a)\displaystyle Q_{\phi}^{\theta}(s,a) :=[t=0γtϕ(st,at)|π=θ,s0=s,a0=a]\displaystyle:=\left[\sum_{t=0}^{\infty}\gamma^{t}\phi(s_{t},a_{t})\big{|}~{}\pi=\theta,s_{0}=s,a_{0}=a\right]
Aϕθ(s,a)\displaystyle A_{\phi}^{\theta}(s,a) :=Qϕθ(s,a)Vϕθ(s).\displaystyle:=Q_{\phi}^{\theta}(s,a)-V_{\phi}^{\theta}(s).

For an index set {1,2,,n}\mathcal{I}\subseteq\{1,2,\dots,n\} we define the following averaged advantage potential function of index set \mathcal{I} as:

Aϕ,θ¯(s,a):=aAϕθ(s,a,a).\overline{A_{\phi,\mathcal{I}}^{\theta}}(s,a_{\mathcal{I}}):=\sum_{a_{-\mathcal{I}}}A_{\phi}^{\theta}(s,a_{\mathcal{I}},a_{-\mathcal{I}}).

We choose an index set {1,2,,n}\mathcal{I}\subseteq\{1,2,\dots,n\} such that there exists s,as^{*},a_{\mathcal{I}}^{*} such that:

Aϕ,θ¯(s,a)>0,\overline{A_{\phi,\mathcal{I}}^{\theta^{*}}}(s^{*},a_{\mathcal{I}}^{*})>0, (38)

and that for any other index set \mathcal{I}^{\prime} with smaller cardinality:

Aϕ,θ¯(s,a)0,s,a,||<||.\overline{A_{\phi,\mathcal{I}^{\prime}}^{\theta^{*}}}(s,a_{\mathcal{I}^{\prime}})\leq 0,~{}~{}\forall~{}s,a_{\mathcal{I}^{\prime}},~{}~{}\forall~{}|\mathcal{I}^{\prime}|<|\mathcal{I}|. (39)

Because Φ\Phi is not a constant, this guarantees the existence of such an index set \mathcal{I}. Further, since

aπθ(a|s)Aϕ,θ¯(s,a)=0,s,\sum_{a_{\mathcal{I}^{\prime}}}\pi_{\theta_{\mathcal{I}^{\prime}}^{*}}(a_{\mathcal{I}^{\prime}}|s)\overline{A_{\phi,\mathcal{I}^{\prime}}^{\theta^{*}}}(s,a_{\mathcal{I}^{\prime}})=0,~{}~{}\forall~{}s,

and that θ\theta^{*} is fully-mixed, we have that:

Aϕ,θ¯(s,a)=0,s,a,||<||.\overline{A_{\phi,\mathcal{I}^{\prime}}^{\theta^{*}}}(s,a_{\mathcal{I}^{\prime}})=0,~{}~{}\forall~{}s,a_{\mathcal{I}^{\prime}},~{}~{}\forall~{}|\mathcal{I}^{\prime}|<|\mathcal{I}|. (40)

We set θ:=(θ,θ)\theta:=(\theta_{\mathcal{I}},\theta_{-\mathcal{I}}^{*}), where θ\theta_{\mathcal{I}} is a convex combination of θ,θ𝒳\theta_{\mathcal{I}}^{*},\theta_{\mathcal{I}}^{\prime}\in\mathcal{X}:

θ=(1η)θ+ηθ,η>0.\theta_{\mathcal{I}}=(1-\eta)\theta_{\mathcal{I}}^{*}+\eta\theta_{\mathcal{I}}^{\prime},~{}~{}\eta>0.

According to performance difference lemma (Lemma 2) we have:

(1γ)(Φ(θ,θ)Φ(θ,θ))=s,adθ(s)πθ(a|s)Aϕ,θ¯(s,a)\displaystyle(1-\gamma)\left(\Phi(\theta_{\mathcal{I}},\theta_{-\mathcal{I}}^{*})-\Phi(\theta_{\mathcal{I}}^{*},\theta_{-\mathcal{I}}^{*})\right)=\sum_{s,a_{\mathcal{I}}}d_{\theta}(s)\pi_{\theta_{\mathcal{I}}}(a_{\mathcal{I}}|s)\overline{A_{\phi,\mathcal{I}}^{\theta^{*}}}(s,a_{\mathcal{I}})
=s,adθ(s)i((1η)πθi(ai|s)+ηπθi(ai|s))Aϕ,θ¯(s,a)\displaystyle=\sum_{s,a_{\mathcal{I}}}d_{\theta}(s)\prod_{i\in\mathcal{I}}\left((1-\eta)\pi_{\theta_{i}^{*}}(a_{i}|s)+\eta\pi_{\theta_{i}^{\prime}}(a_{i}|s)\right)\overline{A_{\phi,\mathcal{I}}^{\theta^{*}}}(s,a_{\mathcal{I}})
=s,adθ(s)((1η)πθi0(ai0|s)+ηπθi0(ai0|s))i\{i0}((1η)πθi(ai|s)+ηπθi(ai|s))Aϕ,θ¯(s,a),(i0)\displaystyle=\sum_{s,a_{\mathcal{I}}}d_{\theta}(s)\left((1\!-\!\eta)\pi_{\theta_{i_{0}}^{*}}(a_{i_{0}}|s)+\eta\pi_{\theta_{i_{0}}^{\prime}}(a_{i_{0}}|s)\right)\!\prod_{i\in\mathcal{I}\!\backslash\!\{\!i_{0}\!\}\!}\!\left((1\!-\!\eta)\pi_{\theta_{i}^{*}}(a_{i}|s)+\eta\pi_{\theta_{i}^{\prime}}(a_{i}|s)\right)\overline{A_{\phi,\mathcal{I}}^{\theta^{*}}}(s,a_{\mathcal{I}}),~{}~{}(\forall~{}i_{0}\in\mathcal{I})
=(1η)s,adθ(s)i\{i0}((1η)πθi(ai|s)+ηπθi(ai|s))Aϕ,\{i0}θ¯(s,a\{i0})\displaystyle=(1-\eta)\sum_{s,a_{\mathcal{I}}}d_{\theta}(s)\prod_{i\in\mathcal{I}\backslash\{\!i_{0}\!\}\!}\!\left((1\!-\!\eta)\pi_{\theta_{i}^{*}}(a_{i}|s)+\eta\pi_{\theta_{i}^{\prime}}(a_{i}|s)\right)\overline{A_{\phi,\mathcal{I}\backslash\{i_{0}\}}^{\theta^{*}}}(s,a_{\mathcal{I}\backslash\{i_{0}\}})
+ηs,adθ(s)πθi0(ai0|s)i\{i0}((1η)πθi(ai|s)+ηπθi(ai|s))Aϕ,θ¯(s,a).\displaystyle\quad+\eta\sum_{s,a_{\mathcal{I}}}d_{\theta}(s)\pi_{\theta_{i_{0}}^{\prime}}(a_{i_{0}}|s)\prod_{i\in\mathcal{I}\backslash\{i_{0}\}}\left((1-\eta)\pi_{\theta_{i}^{*}}(a_{i}|s)+\eta\pi_{\theta_{i}^{\prime}}(a_{i}|s)\right)\overline{A_{\phi,\mathcal{I}}^{\theta^{*}}}(s,a_{\mathcal{I}}).

According to (40), we know that:

Aϕ,\{i0}θ¯(s,a\{i0})=0,\overline{A_{\phi,\mathcal{I}\backslash\{i_{0}\}}^{\theta^{*}}}(s,a_{\mathcal{I}\backslash\{i_{0}\}})=0,

thus

(1γ)(Φ(θ,θ)Φ(θ,θ))=\displaystyle(1-\gamma)\left(\Phi(\theta_{\mathcal{I}},\theta_{-\mathcal{I}}^{*})-\Phi(\theta_{\mathcal{I}}^{*},\theta_{-\mathcal{I}}^{*})\right)=
ηs,adθ(s)πθi0(ai0|s)i\{i0}((1η)πθi(ai|s)+ηπθi(ai|s))Aϕ,θ¯(s,a).\displaystyle\eta\sum_{s,a_{\mathcal{I}}}d_{\theta}(s)\pi_{\theta_{i_{0}}^{\prime}}(a_{i_{0}}|s)\prod_{i\in\mathcal{I}\backslash\{i_{0}\}}\left((1-\eta)\pi_{\theta_{i}^{*}}(a_{i}|s)+\eta\pi_{\theta_{i}^{\prime}}(a_{i}|s)\right)\overline{A_{\phi,\mathcal{I}}^{\theta^{*}}}(s,a_{\mathcal{I}}).

Applying similar procedures recursively and using the fact that:

Aϕ,\{i}θ¯(s,a\{i})=0,i,\overline{A_{\phi,\mathcal{I}\backslash\{i\}}^{\theta^{*}}}(s,a_{\mathcal{I}\backslash\{i\}})=0,~{}~{}\forall~{}i\in\mathcal{I},

we get:

Φ(θ,θ)Φ(θ,θ)=η||1γs,adθ(s)iπθi(ai|s)Aϕ,θ¯(s,a).\displaystyle\Phi(\theta_{\mathcal{I}},\theta_{-\mathcal{I}}^{*})-\Phi(\theta_{\mathcal{I}}^{*},\theta_{-\mathcal{I}}^{*})=\frac{\eta^{|\mathcal{I}|}}{1-\gamma}\sum_{s,a_{\mathcal{I}}}d_{\theta}(s)\prod_{i\in\mathcal{I}}\pi_{\theta_{i}^{\prime}}(a_{i}|s)\overline{A_{\phi,\mathcal{I}}^{\theta^{*}}}(s,a_{\mathcal{I}}).

Set πθi(ai|s)\pi_{\theta_{i}^{\prime}}(a_{i}|s) as:

πθi(ai|s)\displaystyle\pi_{\theta_{i}^{\prime}}(a_{i}|s^{*}) ={1ai=ai0otherwise\displaystyle=\left\{\begin{array}[]{cc}1&a_{i}=a_{i}^{*}\\ 0&\textup{otherwise}\end{array}\right.
πθi(ai|s)\displaystyle\pi_{\theta_{i}^{\prime}}(a_{i}|s) =πθi(ai|s),ss,\displaystyle=\pi_{\theta_{i}^{*}}(a_{i}|s),\quad s\neq s^{*},

where s,ais^{*},a_{i}^{*} are defined in (38). Then:

Φ(θ,θ)Φ(θ,θ)=η||1γdθ(s)Aϕ,θ¯(s,a)>0,\Phi(\theta_{\mathcal{I}},\theta_{-\mathcal{I}}^{*})-\Phi(\theta_{\mathcal{I}}^{*},\theta_{-\mathcal{I}}^{*})=\frac{\eta^{|\mathcal{I}|}}{1-\gamma}d_{\theta}(s^{*})\overline{A_{\phi,\mathcal{I}}^{\theta^{*}}}(s^{*},a_{\mathcal{I}}^{*})>0,

which completes the proof. ∎

Appendix I Bounding the gradient estimation error of Algorithm 1

The accuracy of gradient estimation is essential in the sample-based algorithm 1. In this section, we will give a high probability bound of the estimation error, which is stated in the following theorem:

Theorem 6.

(Error bound for gradient estimation) Assume that the stochastic game that satisfies Assumption 2. In Algorithm 1, for

TJ32τ(1+α)2|𝒮|3i|𝒜i|maxi|𝒜i|2(1γ)6ϵg2α2σS2log(16τTG|𝒮|2i|𝒜i|δ)+1,T_{J}\geq\frac{32\tau(1+\alpha)^{2}|\mathcal{S}|^{3}\sum_{i}|\mathcal{A}_{i}|\max_{i}|\mathcal{A}_{i}|^{2}}{(1-\gamma)^{6}\epsilon_{g}^{2}\alpha^{2}\sigma_{S}^{2}}\log\left(\frac{16\tau T_{G}|\mathcal{S}|^{2}\sum_{i}|\mathcal{A}_{i}|}{\delta}\right)+1,

with probability at least 1δ1-\delta, we have:

^Φ(θ(k))Φ(θ(k))2ϵg,0kTG1.\|\widehat{\nabla}\Phi(\theta^{(k)})-\nabla\Phi(\theta^{(k)})\|_{2}\leq\epsilon_{g},~{}~{}\forall~{}0\leq k\leq T_{G}-1.

The proof of the theorem includes bounding the estimation error of Qiθ¯\overline{Q_{i}^{\theta}} (I.1) and dθd_{\theta} (I.2). Let’s first introduce the definition of ‘sufficient exploration’ which is going to play an important role in this section.

In the main text Assumption 2 we have introduced (τ,σS)(\tau,\sigma_{S})-sufficient exploration on states. In this section we introduce a similar definition (τ,σ)(\tau,\sigma)-sufficient exploration:

Definition 6.

((τ,σ)(\tau,\sigma)-Sufficient Exploration) A stochastic game and a policy θ\theta is said to satisfy (τ,σ)(\tau,\sigma)-sufficient exploration condition if there exists positive integer τ\tau and σ(0,1)\sigma\in(0,1) such that for policy θ\theta and any initial state-action pair (s,ai),i(s,a_{i}),~{}\forall i, we have

Pr(sτ,ai,τ|s0=s,a0=a)σ,sτ,ai,τ\Pr(s_{\tau},a_{i,\tau}|s_{0}=s,a_{0}=a)\geq\sigma,~{}~{}\forall s_{\tau},a_{i,\tau}

Note that ‘(τ,σ)(\tau,\sigma)-sufficient exploration’ is a stronger condition compared with ‘(τ,σS)(\tau,\sigma_{S})-sufficient exploration on states’. Additionally it is not hard to verify that for any stochastic game that satisfies (τ,σS)(\tau,\sigma_{S})-sufficient exploration on states, and any θ𝒳α\theta\in\mathcal{X}^{\alpha}, it will also satisfy (τ,ασSmaxi|𝒜i|)(\tau,\frac{\alpha\sigma_{S}}{\max_{i}|\mathcal{A}_{i}|})-sufficient exploration condition.

I.1 Bounding the estimation error of the averaged-Q function

We first state the main theorem in this subsection:

Theorem 7.

(Estimation error of averaged Q-functions) Assume that the stochastic game with policy θ\theta satisfies (σ,τ)(\sigma,\tau)-sufficient exploration condition (Definition 6), then for a fixed ii, running Algorithm 1 will guarantee that:

Pr(Qiθ^Qiθ¯ϵ)4τ|𝒮|2|𝒜i|exp((1γ)4ϵ2σ2Tτ32|𝒮|2),\Pr\left(\|\widehat{Q_{i}^{\theta}}-\overline{Q_{i}^{\theta}}\|_{\infty}\geq\epsilon\right)\leq 4\tau|\mathcal{S}|^{2}|\mathcal{A}_{i}|\exp\left(-\frac{(1-\gamma)^{4}\epsilon^{2}\sigma^{2}\lfloor\frac{T}{\tau}\rfloor}{32|\mathcal{S}|^{2}}\right),

further:

Pr(Qiθ^Qiθ¯ϵ,i)8τ|𝒮|2(i=1n|𝒜i|)exp((1γ)4ϵ2σ2Tτ32|𝒮|2),\Pr\left(\|\widehat{Q_{i}^{\theta}}-\overline{Q_{i}^{\theta}}\|_{\infty}\geq\epsilon,~{}\exists~{}i\right)\leq 8\tau|\mathcal{S}|^{2}\left(\sum_{i=1}^{n}|\mathcal{A}_{i}|\right)\exp\left(-\frac{(1-\gamma)^{4}\epsilon^{2}\sigma^{2}\lfloor\frac{T}{\tau}\rfloor}{32|\mathcal{S}|^{2}}\right),

i.e., when

TJ32τ|𝒮|2(1γ)4ϵ2σ2log(8τ|𝒮|2|i|𝒜i|δ)+τT_{J}\geq\frac{32\tau|\mathcal{S}|^{2}}{(1-\gamma)^{4}\epsilon^{2}\sigma^{2}}\log\left(\frac{8\tau|\mathcal{S}|^{2}|\sum_{i}|\mathcal{A}_{i}|}{\delta}\right)+\tau

with probability at least 1δ1-\delta, Qiθ^Qiθ¯ϵ,i\|\widehat{Q_{i}^{\theta}}-\overline{Q_{i}^{\theta}}\|_{\infty}\leq\epsilon,~{}\forall~{}i

In the following, we will introduce some lemmas which will play an important role in bounding the estimation error of the averaged-Q function:

Lemma 12.

Assume that the stochastic game with policy θ\theta satisfies (σ,τ)(\sigma,\tau)-sufficient exploration condition (Definition 6), then fix s,s,ais^{\prime},s,a_{i},for ϵ1\epsilon\leq 1,

Pr(|Piθ^(s|s,ai)Piθ¯(s|s,ai)|ϵ)4τexp(ϵ2σ2Tτ32)\Pr\left(\left|\widehat{P_{i}^{\theta}}(s^{\prime}|s,a_{i})-\overline{P_{i}^{\theta}}(s^{\prime}|s,a_{i})\right|\geq\epsilon\right)\leq 4\tau\exp\left(-\frac{\epsilon^{2}\sigma^{2}\lfloor\frac{T}{\tau}\rfloor}{32}\right)
Proof.

According to the definition of Piθ^\widehat{P_{i}^{\theta}}, we have that

{Piθ^(s|s,ai)Piθ¯(s|s,ai)ϵ}\displaystyle\left\{\widehat{P_{i}^{\theta}}(s^{\prime}|s,a_{i})-\overline{P_{i}^{\theta}}(s^{\prime}|s,a_{i})\geq\epsilon\right\}
{t=0T1(𝟏{st+1=s,st=s,ai,t=ai}(Piθ¯(s|s,ai)+ϵ)𝟏{st=s,ai,t=ai})0}\displaystyle\subseteq\left\{\sum_{t=0}^{T-1}\left(\mathbf{1}\{s_{t+1}=s^{\prime},s_{t}=s,a_{i,t}=a_{i}\}-(\overline{P_{i}^{\theta}}(s^{\prime}|s,a_{i})+\epsilon)\mathbf{1}\{s_{t}=s,a_{i,t}=a_{i}\}\right)\geq 0\right\}
{t=0T1𝟏{st=s,ai,t=ai}=0}\displaystyle\qquad\mathop{\cup}\left\{\sum_{t=0}^{T-1}\mathbf{1}\{s_{t}=s,a_{i,t}=a_{i}\}=0\right\}
m=0τ1{k=0T1mτ(𝟏{skτ+m+1=s,skτ+m=s,ai,kτ+m=ai}(Piθ¯(s|s,ai)+ϵ)𝟏{skτ+m=s,ai,kτ+m=ai})0}\displaystyle\subseteq\mathop{\cup}_{m=0}^{\tau-1}\left\{\sum_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor}\!\left(\mathbf{1}\{s_{k\tau+m+1}=s^{\prime},s_{k\tau+m}=s,a_{i,k\tau+m}=a_{i}\}\!-\!(\overline{P_{i}^{\theta}}(s^{\prime}|s,a_{i})+\epsilon)\mathbf{1}\{s_{k\tau+m}=s,a_{i,k\tau+m}=a_{i}\}\right)\geq 0\!\right\}
m=0τ1{k=0T1mτ𝟏{skτ+m=s,ai,kτ+m=ai}=0}\displaystyle\qquad\mathop{\cup}_{m=0}^{\tau-1}\left\{\sum_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor}\mathbf{1}\{s_{k\tau+m}=s,a_{i,k\tau+m}=a_{i}\}=0\right\}

Let:

Am\displaystyle A_{m} :={k=0T1mτ(𝟏{skτ+m+1=s,skτ+m=s,ai,kτ+m=ai}(Piθ¯(s|s,ai)+ϵ)𝟏{skτ+m=s,ai,kτ+m=ai})0}\displaystyle:=\left\{\sum_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor}\left(\mathbf{1}\{s_{k\tau+m+1}\!=\!s^{\prime},s_{k\tau+m}\!=\!s,a_{i,k\tau+m}\!=\!a_{i}\}-(\overline{P_{i}^{\theta}}(s^{\prime}|s,a_{i})+\epsilon)\mathbf{1}\{s_{k\tau+m}\!=\!s,a_{i,k\tau+m}\!=\!a_{i}\}\right)\geq 0\right\}
Am\displaystyle A_{m}^{\prime} :={k=0T1mτ𝟏{skτ+m=s,ai,kτ+m=ai}=0}\displaystyle:=\left\{\sum_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor}\mathbf{1}\{s_{k\tau+m}=s,a_{i,k\tau+m}=a_{i}\}=0\right\}
Xk,m\displaystyle X_{k,m} :=𝟏{skτ+m+1=s,skτ+m=s,ai,kτ+m=ai}(Piθ¯(s|s,ai)+ϵ)𝟏{skτ+m=s,ai,kτ+m=ai}\displaystyle:=\mathbf{1}\{s_{k\tau+m+1}=s^{\prime},s_{k\tau+m}=s,a_{i,k\tau+m}=a_{i}\}-(\overline{P_{i}^{\theta}}(s^{\prime}|s,a_{i})+\epsilon)\mathbf{1}\{s_{k\tau+m}=s,a_{i,k\tau+m}=a_{i}\}
Xk,m\displaystyle X_{k,m}^{\prime} :=𝟏{skτ+m=s,ai,kτ+m=ai}\displaystyle:=\mathbf{1}\{s_{k\tau+m}=s,a_{i,k\tau+m}=a_{i}\}
Yk,m\displaystyle Y_{k,m} :=Xk,m𝔼[Xk,m|(k1)τ+m]\displaystyle:=X_{k,m}-\mathbb{{E}}[X_{k,m}|\mathcal{F}_{(k-1)\tau+m}]
Yk,m\displaystyle Y_{k,m}^{\prime} :=Xk,m𝔼[Xk,m|(k1)τ+m]\displaystyle:=X_{k,m}^{\prime}-\mathbb{{E}}[X_{k,m}^{\prime}|\mathcal{F}_{(k-1)\tau+m}]

Then {Yk,m}k=0T1mτ\{Y_{k,m}\}_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor} is a martingale difference sequence. Because ϵ1\epsilon\leq 1, it is easy to verify that |Xk,m|2,|Xk,m|1|X_{k,m}|\leq 2,|X_{k,m}^{\prime}|\leq 1. We have that:

|Yk,m||Xk,m|+𝔼[|Xk,m||(k1)τ+m]4,|Yk,m||Xk,m|+𝔼[|Xk,m||(k1)τ+m]2.|Y_{k,m}|\leq|X_{k,m}|+\mathbb{{E}}[|X_{k,m}||\mathcal{F}_{(k-1)\tau+m}]\leq 4,~{}~{}|Y_{k,m}^{\prime}|\leq|X_{k,m}^{\prime}|+\mathbb{{E}}[|X_{k,m}^{\prime}||\mathcal{F}_{(k-1)\tau+m}]\leq 2.

Further,

𝔼[Xk,m|(k1)τ+m]=𝔼[𝟏{skτ+m=s,ai,kτ+m=ai}|(k1)τ+m]σ,\mathbb{{E}}[X^{\prime}_{k,m}|\mathcal{F}_{(k-1)\tau+m}]=\mathbb{{E}}[\mathbf{1}\{s_{k\tau+m}=s,a_{i,k\tau+m}=a_{i}\}|\mathcal{F}_{(k-1)\tau+m}]\geq\sigma,

and that

𝔼[Xk,m|(k1)τ+m]\displaystyle\mathbb{{E}}[X_{k,m}|\mathcal{F}_{(k-1)\tau+m}]
=𝔼[𝟏{skτ+m+1=s,skτ+m=s,ai,kτ+m=ai}(Piθ¯(s|s,ai)+ϵ)𝟏{skτ+m=s,ai,kτ+m=ai}|(k1)τ+m]\displaystyle=\mathbb{{E}}[\mathbf{1}\{s_{k\tau+m+1}=s^{\prime},s_{k\tau+m}=s,a_{i,k\tau+m}=a_{i}\}-(\overline{P_{i}^{\theta}}(s^{\prime}|s,a_{i})+\epsilon)\mathbf{1}\{s_{k\tau+m}=s,a_{i,k\tau+m}=a_{i}\}|\mathcal{F}_{(k-1)\tau+m}] (41)
=ϵ𝔼[𝟏{skτ+m=s,ai,kτ+m=ai}|(k1)τ+m]ϵσ.\displaystyle=-\epsilon\mathbb{{E}}[\mathbf{1}\{s_{k\tau+m}=s,a_{i,k\tau+m}=a_{i}\}|\mathcal{F}_{(k-1)\tau+m}]\leq-\epsilon\sigma. (42)

To move from (41) to (42), we used the fact that:

𝔼[𝟏{st+1=s,st=s,ai,t+m=ai}|t1]\displaystyle\mathbb{{E}}[\mathbf{1}\{s_{t+1}=s^{\prime},s_{t}=s,a_{i,t+m}=a_{i}\}|\mathcal{F}_{t-1}] =P(s|st1,at1)aiπθ(ai,ai|s)P(s|s,ai,ai)\displaystyle=P(s|s_{t-1},a_{t-1})\sum_{a_{-i}}\pi_{\theta}(a_{i},a_{-i}|s)P(s^{\prime}|s,a_{i},a_{-i})
=P(s|st1,at1)πθi(ai|s)aiπθi(ai|s)P(s|s,ai,ai)\displaystyle=P(s|s_{t-1},a_{t-1})\pi_{\theta_{i}}(a_{i}|s)\sum_{a_{-i}}\pi_{\theta_{-i}}(a_{-i}|s)P(s^{\prime}|s,a_{i},a_{-i})
=P(s|st1,at1)πθi(ai|s)Piθ¯(s|s,ai)\displaystyle=P(s|s_{t-1},a_{t-1})\pi_{\theta_{i}}(a_{i}|s)\overline{P_{i}^{\theta}}(s^{\prime}|s,a_{i})
=𝔼[Piθ¯(s|s,ai)𝟏{st=s,ai,t+m=ai}|t1]\displaystyle=\mathbb{{E}}[\overline{P_{i}^{\theta}}(s^{\prime}|s,a_{i})\mathbf{1}\{s_{t}=s,a_{i,t+m}=a_{i}\}|\mathcal{F}_{t-1}]

and the inequality in (42) is derived directly from Definition 6.

According to Azuma-Hoeffding inequality [71, 72]:

Pr(Am)\displaystyle\Pr(A_{m}) =Pr(k=0T1mτXk,m0)\displaystyle=\Pr\left(\sum_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor}X_{k,m}\geq 0\right)
=Pr(k=0T1mτYk,mk=0T1mτ𝔼[Xk,m|(k1)τ+m])\displaystyle=\Pr\left(\sum_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor}Y_{k,m}\geq-\sum_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor}\mathbb{{E}}[X_{k,m}|\mathcal{F}_{(k-1)\tau+m}]\right)
Pr(k=0T1mτYk,mT1m+ττϵσ)\displaystyle\leq\Pr\left(\sum_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor}Y_{k,m}\geq\left\lfloor\frac{T-1-m+\tau}{\tau}\right\rfloor\epsilon\sigma\right)
exp(ϵ2σ2Tτ32)\displaystyle\leq\exp\left(-\frac{\epsilon^{2}\sigma^{2}\left\lfloor\frac{T}{\tau}\right\rfloor}{32}\right)

Similarly, from Azuma-Hoeffding inequality,

Pr(Am)\displaystyle\Pr(A_{m}^{\prime}) =Pr(k=0T1mτXk,m=0)\displaystyle=\Pr\left(\sum_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor}X_{k,m}^{\prime}=0\right)
=Pr(k=0T1mτYk,m=k=0T1mτ𝔼[Xk,m|(k1)τ+m])\displaystyle=\Pr\left(\sum_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor}Y_{k,m}^{\prime}=-\sum_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor}\mathbb{{E}}[X_{k,m}^{\prime}|\mathcal{F}_{(k-1)\tau+m}]\right)
Pr(k=0T1mτYk,mT1m+ττσ)\displaystyle\leq\Pr\left(\sum_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor}Y_{k,m}^{\prime}\leq-\left\lfloor\frac{T-1-m+\tau}{\tau}\right\rfloor\sigma\right)
exp(σ2Tτ8)\displaystyle\leq\exp\left(-\frac{\sigma^{2}\left\lfloor\frac{T}{\tau}\right\rfloor}{8}\right) (43)

Thus

Pr(Piθ^(s|s,ai)Piθ¯(s|s,ai)ϵ)m=0τ1Pr(Am)+Pr(Am)\displaystyle\Pr\left(\widehat{P_{i}^{\theta}}(s^{\prime}|s,a_{i})-\overline{P_{i}^{\theta}}(s^{\prime}|s,a_{i})\geq\epsilon\right)\leq\sum_{m=0}^{\tau-1}\Pr\left(A_{m}\right)+\Pr\left(A_{m}^{\prime}\right)
τexp(ϵ2σ2Tτ32)+τexp(σ2Tτ8)2τexp(ϵ2σ2Tτ32)\displaystyle\leq\tau\exp\left(-\frac{\epsilon^{2}\sigma^{2}\left\lfloor\frac{T}{\tau}\right\rfloor}{32}\right)+\tau\exp\left(-\frac{\sigma^{2}\left\lfloor\frac{T}{\tau}\right\rfloor}{8}\right)\leq 2\tau\exp\left(-\frac{\epsilon^{2}\sigma^{2}\left\lfloor\frac{T}{\tau}\right\rfloor}{32}\right)

Similarly

Pr(Piθ^(s|s,ai)Piθ¯(s|s,ai)ϵ)2τexp(ϵ2σ2Tτ32)\displaystyle\Pr\left(\widehat{P_{i}^{\theta}}(s^{\prime}|s,a_{i})-\overline{P_{i}^{\theta}}(s^{\prime}|s,a_{i})\leq-\epsilon\right)\leq 2\tau\exp\left(-\frac{\epsilon^{2}\sigma^{2}\left\lfloor\frac{T}{\tau}\right\rfloor}{32}\right)
Pr(|Piθ^(s|s,ai)Piθ¯(s|s,ai)|ϵ)4τexp(ϵ2σ2Tτ32)\displaystyle\Longrightarrow\Pr\left(\left|\widehat{P_{i}^{\theta}}(s^{\prime}|s,a_{i})-\overline{P_{i}^{\theta}}(s^{\prime}|s,a_{i})\right|\geq\epsilon\right)\leq 4\tau\exp\left(-\frac{\epsilon^{2}\sigma^{2}\left\lfloor\frac{T}{\tau}\right\rfloor}{32}\right)

which completes the proof. ∎

Lemma 13.

Assume that the stochastic game with policy θ\theta satisfies (σ,τ)(\sigma,\tau)-sufficient exploration condition (Definition 6), then fix s,s,ais^{\prime},s,a_{i}, for ϵ1\epsilon\leq 1,

Pr(|riθ^(s,ai)riθ¯(s,ai)|ϵ)4τexp(ϵ2σ2Tτ32)\Pr\left(\left|\widehat{r_{i}^{\theta}}(s,a_{i})-\overline{r_{i}^{\theta}}(s,a_{i})\right|\geq\epsilon\right)\leq 4\tau\exp\left(-\frac{\epsilon^{2}\sigma^{2}\lfloor\frac{T}{\tau}\rfloor}{32}\right)
Proof.

The proof is similar to Lemma 12.

{riθ^(s,ai)riθ¯(s,ai)ϵ}\displaystyle\left\{\widehat{r_{i}^{\theta}}(s,a_{i})-\overline{r_{i}^{\theta}}(s,a_{i})\geq\epsilon\right\}
{t=0T𝟏{st=s,ai,t=ai}ri(st,at)(riθ¯(s,ai)+ϵ)𝟏{st=s,ai,t=ai}0}\displaystyle\subseteq\left\{\sum_{t=0}^{T}\mathbf{1}\{s_{t}=s,a_{i,t}=a_{i}\}r_{i}(s_{t},a_{t})-(\overline{r_{i}^{\theta}}(s,a_{i})+\epsilon)\mathbf{1}\{s_{t}=s,a_{i,t}=a_{i}\}\geq 0\right\}
{t=0T1𝟏{st=s,ai,t=ai}=0}\displaystyle\qquad\mathop{\cup}\left\{\sum_{t=0}^{T-1}\mathbf{1}\{s_{t}=s,a_{i,t}=a_{i}\}=0\right\}
m=0τ1{k=0Tmτ𝟏{skτ+m=s,ai,kτ+m=ai}ri(skτ+m,akτ+m)(riθ¯(s,ai)+ϵ)𝟏{skτ+m=s,ai,kτ+m=ai}0}\displaystyle\subseteq\mathop{\cup}_{m=0}^{\tau-1}\left\{\sum_{k=0}^{\lfloor\frac{T-m}{\tau}\rfloor}\mathbf{1}\{s_{k\tau+m}=s,a_{i,k\tau+m}=a_{i}\}r_{i}(s_{k\tau+m},a_{k\tau+m})-(\overline{r_{i}^{\theta}}(s,a_{i})+\epsilon)\mathbf{1}\{s_{k\tau+m}=s,a_{i,k\tau+m}=a_{i}\}\geq 0\right\}
m=0τ1{k=0T1mτ𝟏{skτ+m=s,ai,kτ+m=ai}=0}\displaystyle\qquad\mathop{\cup}_{m=0}^{\tau-1}\left\{\sum_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor}\mathbf{1}\{s_{k\tau+m}=s,a_{i,k\tau+m}=a_{i}\}=0\right\}

Let:

Am\displaystyle A_{m} :={k=0Tmτ𝟏{skτ+m=s,ai,kτ+m=ai}ri(skτ+m,akτ+m)(riθ¯(s,ai)+ϵ)𝟏{skτ+m=s,ai,kτ+m=ai}0}\displaystyle:=\left\{\sum_{k=0}^{\lfloor\frac{T-m}{\tau}\rfloor}\mathbf{1}\{s_{k\tau+m}=s,a_{i,k\tau+m}=a_{i}\}r_{i}(s_{k\tau+m},a_{k\tau+m})-(\overline{r_{i}^{\theta}}(s,a_{i})+\epsilon)\mathbf{1}\{s_{k\tau+m}=s,a_{i,k\tau+m}=a_{i}\}\geq 0\right\}
Am\displaystyle A_{m}^{\prime} :={k=0T1mτ𝟏{skτ+m=s,ai,kτ+m=ai}=0}\displaystyle:=\left\{\sum_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor}\mathbf{1}\{s_{k\tau+m}=s,a_{i,k\tau+m}=a_{i}\}=0\right\}
Xk,m\displaystyle X_{k,m} :=𝟏{skτ+m=s,ai,kτ+m=ai}ri(skτ+m,akτ+m)(riθ¯(s,ai)+ϵ)𝟏{skτ+m=s,ai,kτ+m=ai}\displaystyle:=\mathbf{1}\{s_{k\tau+m}=s,a_{i,k\tau+m}=a_{i}\}r_{i}(s_{k\tau+m},a_{k\tau+m})-(\overline{r_{i}^{\theta}}(s,a_{i})+\epsilon)\mathbf{1}\{s_{k\tau+m}=s,a_{i,k\tau+m}=a_{i}\}
Yk,m\displaystyle Y_{k,m} :=Xk,m𝔼[Xk,m|(k1)τ+m]\displaystyle:=X_{k,m}-\mathbb{{E}}[X_{k,m}|\mathcal{F}_{(k-1)\tau+m}]

Then {Yk,m}k=0T1mτ\{Y_{k,m}\}_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor} is a martingale difference sequence. Because ϵ1\epsilon\leq 1, it is easy to verify that |Xk,m|2|X_{k,m}|\leq 2. We have that:

|Yk,m||Xk,m|+𝔼[|Xk,m||(k1)τ+m]4.|Y_{k,m}|\leq|X_{k,m}|+\mathbb{{E}}[|X_{k,m}||\mathcal{F}_{(k-1)\tau+m}]\leq 4.

Further,

𝔼[Xk,m|(k1)τ+m]\displaystyle\mathbb{{E}}[X_{k,m}|\mathcal{F}_{(k-1)\tau+m}]
=𝔼[𝟏{skτ+m=s,ai,kτ+m=ai}ri(skτ+m,akτ+m)(riθ¯(s,ai)+ϵ)𝟏{skτ+m=s,ai,kτ+m=ai}|(k1)τ+m]\displaystyle=\mathbb{{E}}[\mathbf{1}\{s_{k\tau+m}=s,a_{i,k\tau+m}=a_{i}\}r_{i}(s_{k\tau+m},a_{k\tau+m})-(\overline{r_{i}^{\theta}}(s,a_{i})+\epsilon)\mathbf{1}\{s_{k\tau+m}=s,a_{i,k\tau+m}=a_{i}\}|\mathcal{F}_{(k-1)\tau+m}]
=ϵ𝔼[𝟏{skτ+m=s,ai,kτ+m=ai}|(k1)τ+m]ϵσ\displaystyle=-\epsilon\mathbb{{E}}[\mathbf{1}\{s_{k\tau+m}=s,a_{i,k\tau+m}=a_{i}\}|\mathcal{F}_{(k-1)\tau+m}]\leq-\epsilon\sigma

the second line to the third line of the equation is derived by the fact that:

𝔼[𝟏{st=s,ai,t+m=ai}ri(st,at)|t1]\displaystyle\mathbb{{E}}[\mathbf{1}\{s_{t}=s,a_{i,t+m}=a_{i}\}r_{i}(s_{t},a_{t})|\mathcal{F}_{t-1}] =P(s|st1,at1)aiπθ(ai,ai|s)ri(s,ai,ai)\displaystyle=P(s|s_{t-1},a_{t-1})\sum_{a_{-i}}\pi_{\theta}(a_{i},a_{-i}|s)r_{i}(s,a_{i},a_{-i})
=P(s|st1,at1)πθi(ai|s)riθ¯(s,ai)\displaystyle=P(s|s_{t-1},a_{t-1})\pi_{\theta_{i}}(a_{i}|s)\overline{r_{i}^{\theta}}(s,a_{i})
=𝔼[riθ¯(s,ai)𝟏{st=s,ai,t=at}|t1]\displaystyle=\mathbb{{E}}[\overline{r_{i}^{\theta}}(s,a_{i})\mathbf{1}\{s_{t}=s,a_{i,t}=a_{t}\}|\mathcal{F}_{t-1}]

and the inequality in the third line is derived directly from Definition 6.

According to Azuma-Hoeffding inequality:

Pr(Am)\displaystyle\Pr(A_{m}) =Pr(k=0TmτXk,m0)\displaystyle=\Pr\left(\sum_{k=0}^{\lfloor\frac{T-m}{\tau}\rfloor}X_{k,m}\geq 0\right)
=Pr(k=0TmτYk,mk=0Tmτ𝔼[Xk,m|(k1)τ+m])\displaystyle=\Pr\left(\sum_{k=0}^{\lfloor\frac{T-m}{\tau}\rfloor}Y_{k,m}\geq-\sum_{k=0}^{\lfloor\frac{T-m}{\tau}\rfloor}\mathbb{{E}}[X_{k,m}|\mathcal{F}_{(k-1)\tau+m}]\right)
Pr(k=0TmτYk,mTm+ττϵσ)\displaystyle\leq\Pr\left(\sum_{k=0}^{\lfloor\frac{T-m}{\tau}\rfloor}Y_{k,m}\geq\left\lfloor\frac{T-m+\tau}{\tau}\right\rfloor\epsilon\sigma\right)
exp(ϵ2σ2Tτ32)\displaystyle\leq\exp\left(-\frac{\epsilon^{2}\sigma^{2}\left\lfloor\frac{T}{\tau}\right\rfloor}{32}\right)

Same as (43), we have that:

Pr(Am)exp(σ2Tτ8)\Pr(A_{m}^{\prime})\leq\exp\left(-\frac{\sigma^{2}\left\lfloor\frac{T}{\tau}\right\rfloor}{8}\right)

Thus

Pr(riθ^(s,ai)riθ¯(s,ai)ϵ)m=0τ1Pr(Am)+Pr(Am)2τexp(ϵ2σ2Tτ32)\displaystyle\Pr\left(\widehat{r_{i}^{\theta}}(s,a_{i})-\overline{r_{i}^{\theta}}(s,a_{i})\geq\epsilon\right)\leq\sum_{m=0}^{\tau-1}\Pr\left(A_{m}\right)+\Pr\left(A_{m}^{\prime}\right)\leq 2\tau\exp\left(-\frac{\epsilon^{2}\sigma^{2}\left\lfloor\frac{T}{\tau}\right\rfloor}{32}\right)

Similarly

Pr(riθ^(s,ai)riθ¯(s,ai)ϵ)2τexp(ϵ2σ2Tτ32)\displaystyle\Pr\left(\widehat{r_{i}^{\theta}}(s,a_{i})-\overline{r_{i}^{\theta}}(s,a_{i})\leq-\epsilon\right)\leq 2\tau\exp\left(-\frac{\epsilon^{2}\sigma^{2}\left\lfloor\frac{T}{\tau}\right\rfloor}{32}\right)
Pr(|riθ^(s,ai)riθ¯(s,ai)|ϵ)4τexp(ϵ2σ2Tτ32)\displaystyle\Longrightarrow\Pr\left(\left|\widehat{r_{i}^{\theta}}(s,a_{i})-\overline{r_{i}^{\theta}}(s,a_{i})\right|\geq\epsilon\right)\leq 4\tau\exp\left(-\frac{\epsilon^{2}\sigma^{2}\left\lfloor\frac{T}{\tau}\right\rfloor}{32}\right)

which completes the proof. ∎

Lemma 12 and 13 lead to the following corollary:

Corollary 1.
Pr(Miθ^Miθ¯ϵ)\displaystyle\Pr(\|\widehat{M^{\theta}_{i}}-\overline{M^{\theta}_{i}}\|_{\infty}\geq\epsilon) 4τ|𝒮|2|𝒜i|exp(ϵ2σ2Tτ32|𝒮|2)\displaystyle\leq 4\tau|\mathcal{S}|^{2}|\mathcal{A}_{i}|\exp\left(-\frac{\epsilon^{2}\sigma^{2}\lfloor\frac{T}{\tau}\rfloor}{32|\mathcal{S}|^{2}}\right) (44)
Pr(riθ^riθ¯ϵ)\displaystyle\Pr(\|\widehat{r^{\theta}_{i}}-\overline{r^{\theta}_{i}}\|_{\infty}\geq\epsilon) 4τ|𝒮||𝒜i|exp(ϵ2σ2Tτ32)\displaystyle\leq 4\tau|\mathcal{S}||\mathcal{A}_{i}|\exp\left(-\frac{\epsilon^{2}\sigma^{2}\lfloor\frac{T}{\tau}\rfloor}{32}\right) (45)
Proof.

We first prove (44)

Miθ^Miθ¯\displaystyle\|\widehat{M^{\theta}_{i}}-\overline{M^{\theta}_{i}}\|_{\infty} =max(s,ai)(s,ai)πθi(ai|s)|Piθ^(s|s,ai)Piθ¯(s|s,ai)|\displaystyle=\max_{(s,a_{i})}\sum_{(s^{\prime},a_{i}^{\prime})}\pi_{\theta_{i}}(a_{i}^{\prime}|s^{\prime})\left|\widehat{P_{i}^{\theta}}(s^{\prime}|s,a_{i})-\overline{P_{i}^{\theta}}(s^{\prime}|s,a_{i})\right|
=max(s,ai)s|Piθ^(s|s,ai)Piθ¯(s|s,ai)|\displaystyle=\max_{(s,a_{i})}\sum_{s^{\prime}}\left|\widehat{P_{i}^{\theta}}(s^{\prime}|s,a_{i})-\overline{P_{i}^{\theta}}(s^{\prime}|s,a_{i})\right|

Thus,

{Miθ^Miθ¯ϵ}\displaystyle\left\{\|\widehat{M^{\theta}_{i}}-\overline{M^{\theta}_{i}}\|_{\infty}\geq\epsilon\right\} =(s,ai){s|Piθ^(s|s,ai)Piθ¯(s|s,ai)|ϵ}\displaystyle=\mathop{\cup}_{(s,a_{i})}\left\{\sum_{s^{\prime}}\left|\widehat{P_{i}^{\theta}}(s^{\prime}|s,a_{i})-\overline{P_{i}^{\theta}}(s^{\prime}|s,a_{i})\right|\geq\epsilon\right\}
(s,ai)s{|Piθ^(s|s,ai)Piθ¯(s|s,ai)|ϵ|𝒮|}\displaystyle\subseteq\mathop{\cup}_{(s,a_{i})}\mathop{\cup}_{s^{\prime}}\left\{\left|\widehat{P_{i}^{\theta}}(s^{\prime}|s,a_{i})-\overline{P_{i}^{\theta}}(s^{\prime}|s,a_{i})\right|\geq\frac{\epsilon}{|\mathcal{S}|}\right\}

Then according to Lemma 12,

Pr(Miθ^Miθ¯ϵ)\displaystyle\Pr\left(\|\widehat{M^{\theta}_{i}}-\overline{M^{\theta}_{i}}\|_{\infty}\geq\epsilon\right) (s,s,ai)Pr(|Piθ^(s|s,ai)Piθ¯(s|s,ai)|ϵ|𝒮|)\displaystyle\leq\sum_{(s^{\prime},s,a_{i})}\Pr\left(\left|\widehat{P_{i}^{\theta}}(s^{\prime}|s,a_{i})-\overline{P_{i}^{\theta}}(s^{\prime}|s,a_{i})\right|\geq\frac{\epsilon}{|\mathcal{S}|}\right)
4τ|𝒮|2|𝒜i|exp(ϵ2σ2Tτ32|𝒮|2).\displaystyle\leq 4\tau|\mathcal{S}|^{2}|\mathcal{A}_{i}|\exp\left(-\frac{\epsilon^{2}\sigma^{2}\lfloor\frac{T}{\tau}\rfloor}{32|\mathcal{S}|^{2}}\right).

Now we prove (45). Since

{riθ^riθ¯ϵ}=(s,ai){|riθ^(s,ai)riθ¯(s,ai)|ϵ},\displaystyle\left\{\|\widehat{r^{\theta}_{i}}-\overline{r^{\theta}_{i}}\|_{\infty}\geq\epsilon\right\}=\mathop{\cup}_{(s,a_{i})}\left\{\left|\widehat{r_{i}^{\theta}}(s,a_{i})-\overline{r_{i}^{\theta}}(s,a_{i})\right|\geq\epsilon\right\},

according to Lemma 13,

Pr(riθ^riθ¯ϵ)\displaystyle\Pr\left(\|\widehat{r^{\theta}_{i}}-\overline{r^{\theta}_{i}}\|_{\infty}\geq\epsilon\right) (s,ai)Pr(|riθ^(s,ai)riθ¯(s,ai)|ϵ)\displaystyle\leq\sum_{(s,a_{i})}\Pr\left(\left|\widehat{r_{i}^{\theta}}(s,a_{i})-\overline{r_{i}^{\theta}}(s,a_{i})\right|\geq\epsilon\right)
4τ|𝒮||𝒜i|exp(ϵ2σ2Tτ32),\displaystyle\leq 4\tau|\mathcal{S}||\mathcal{A}_{i}|\exp\left(-\frac{\epsilon^{2}\sigma^{2}\left\lfloor\frac{T}{\tau}\right\rfloor}{32}\right),

which completes the proof of the corollary. ∎

We are now ready to prove Theorem 7.

Proof.

(of Theorem 7) From the definition of Qiθ^,Qiθ¯\widehat{Q_{i}^{\theta}},\overline{Q_{i}^{\theta}},

Qiθ¯\displaystyle\overline{Q_{i}^{\theta}} =(IγMiθ¯)1riθ¯,\displaystyle=(I-\gamma\overline{M^{\theta}_{i}})^{-1}\overline{r_{i}^{\theta}},
Qiθ^\displaystyle\widehat{Q_{i}^{\theta}} =(IγMiθ^)1riθ^,\displaystyle=(I-\gamma\widehat{M^{\theta}_{i}})^{-1}\widehat{r_{i}^{\theta}},

we have that

Qiθ^Qiθ¯\displaystyle\|\widehat{Q_{i}^{\theta}}-\overline{Q_{i}^{\theta}}\|_{\infty} =(IγMiθ^)1riθ^(IγMiθ¯)1riθ¯\displaystyle=\left\|(I-\gamma\widehat{M^{\theta}_{i}})^{-1}\widehat{r_{i}^{\theta}}-(I-\gamma\overline{M^{\theta}_{i}})^{-1}\overline{r_{i}^{\theta}}\right\|_{\infty}
=(IγMiθ¯)1(riθ^riθ¯)+((IγMiθ^)1(IγMiθ¯)1)riθ^\displaystyle=\left\|(I-\gamma\overline{M^{\theta}_{i}})^{-1}(\widehat{r_{i}^{\theta}}-\overline{r_{i}^{\theta}})+\left((I-\gamma\widehat{M^{\theta}_{i}})^{-1}-(I-\gamma\overline{M^{\theta}_{i}})^{-1}\right)\widehat{r_{i}^{\theta}}\right\|_{\infty}
(IγMiθ¯)1(riθ^riθ¯)+γ(IγMiθ¯)1(Miθ^Miθ¯)(IγMiθ^)1riθ^.\displaystyle\leq\left\|(I-\gamma\overline{M^{\theta}_{i}})^{-1}(\widehat{r_{i}^{\theta}}-\overline{r_{i}^{\theta}})\right\|_{\infty}+\left\|\gamma(I-\gamma\overline{M^{\theta}_{i}})^{-1}(\widehat{M^{\theta}_{i}}-\overline{M^{\theta}_{i}})(I-\gamma\widehat{M^{\theta}_{i}})^{-1}\widehat{r_{i}^{\theta}}\right\|_{\infty}.

Because both Miθ¯\overline{M^{\theta}_{i}} and Miθ^\widehat{M^{\theta}_{i}} are transition probability matrices, thus:

Miθ¯x\displaystyle\|\overline{M^{\theta}_{i}}x\|_{\infty} x\displaystyle\leq\|x\|_{\infty}
Miθ^x\displaystyle\|\widehat{M^{\theta}_{i}}x\|_{\infty} x\displaystyle\leq\|x\|_{\infty}
(IγMiθ¯)1x\displaystyle\|(I-\gamma\overline{M^{\theta}_{i}})^{-1}x\|_{\infty} 11γx\displaystyle\leq\frac{1}{1-\gamma}\|x\|_{\infty}
(IγMiθ^)1x\displaystyle\|(I-\gamma\widehat{M^{\theta}_{i}})^{-1}x\|_{\infty} 11γx\displaystyle\leq\frac{1}{1-\gamma}\|x\|_{\infty}

Thus,

Qiθ^Qiθ¯\displaystyle\|\widehat{Q_{i}^{\theta}}-\overline{Q_{i}^{\theta}}\|_{\infty} (IγMiθ¯)1(riθ^riθ¯)+γ(IγMiθ¯)1(Miθ^Miθ¯)(IγMiθ^)1riθ^\displaystyle\leq\left\|(I-\gamma\overline{M^{\theta}_{i}})^{-1}(\widehat{r_{i}^{\theta}}-\overline{r_{i}^{\theta}})\right\|_{\infty}+\left\|\gamma(I-\gamma\overline{M^{\theta}_{i}})^{-1}(\widehat{M^{\theta}_{i}}-\overline{M^{\theta}_{i}})(I-\gamma\widehat{M^{\theta}_{i}})^{-1}\widehat{r_{i}^{\theta}}\right\|_{\infty}
11γriθ^riθ¯+γ(1γ)2Miθ^Miθ¯riθ^\displaystyle\leq\frac{1}{1-\gamma}\|\widehat{r_{i}^{\theta}}-\overline{r_{i}^{\theta}}\|_{\infty}+\frac{\gamma}{(1-\gamma)^{2}}\|\widehat{M^{\theta}_{i}}-\overline{M^{\theta}_{i}}\|_{\infty}\|\widehat{r_{i}^{\theta}}\|_{\infty}
11γriθ^riθ¯+γ(1γ)2Miθ^Miθ¯\displaystyle\leq\frac{1}{1-\gamma}\|\widehat{r_{i}^{\theta}}-\overline{r_{i}^{\theta}}\|_{\infty}+\frac{\gamma}{(1-\gamma)^{2}}\|\widehat{M^{\theta}_{i}}-\overline{M^{\theta}_{i}}\|_{\infty}

Thus if

riθ^riθ¯(1γ)2ϵ,Miθ^Miθ¯(1γ)2ϵ,\displaystyle\|\widehat{r_{i}^{\theta}}-\overline{r_{i}^{\theta}}\|_{\infty}\leq(1-\gamma)^{2}\epsilon,\quad\|\widehat{M^{\theta}_{i}}-\overline{M^{\theta}_{i}}\|_{\infty}\leq(1-\gamma)^{2}\epsilon,

we have that:

Qiθ^Qiθ¯ϵ,\|\widehat{Q_{i}^{\theta}}-\overline{Q_{i}^{\theta}}\|_{\infty}\leq\epsilon,

Thus from Corollary 1,

Pr(Qiθ^Qiθ¯ϵ)\displaystyle\Pr\left(\|\widehat{Q_{i}^{\theta}}-\overline{Q_{i}^{\theta}}\|_{\infty}\geq\epsilon\right) Pr(riθ^riθ¯(1γ)2ϵ)+Pr(Miθ^Miθ¯(1γ)2ϵ)\displaystyle\leq\Pr\left(\|\widehat{r_{i}^{\theta}}-\overline{r_{i}^{\theta}}\|_{\infty}\geq(1-\gamma)^{2}\epsilon\right)+\Pr\left(\|\widehat{M^{\theta}_{i}}-\overline{M^{\theta}_{i}}\|_{\infty}\geq(1-\gamma)^{2}\epsilon\right)
4τ|𝒮||𝒜i|exp((1γ)4ϵ2σ2Tτ32)+4τ|𝒮|2|𝒜i|exp((1γ)4ϵ2σ2Tτ32|𝒮|2)\displaystyle\leq 4\tau|\mathcal{S}||\mathcal{A}_{i}|\exp\left(-\frac{(1-\gamma)^{4}\epsilon^{2}\sigma^{2}\lfloor\frac{T}{\tau}\rfloor}{32}\right)+4\tau|\mathcal{S}|^{2}|\mathcal{A}_{i}|\exp\left(-\frac{(1-\gamma)^{4}\epsilon^{2}\sigma^{2}\lfloor\frac{T}{\tau}\rfloor}{32|\mathcal{S}|^{2}}\right)
8τ|𝒮|2|𝒜i|exp((1γ)4ϵ2σ2Tτ32|𝒮|2),\displaystyle\leq 8\tau|\mathcal{S}|^{2}|\mathcal{A}_{i}|\exp\left(-\frac{(1-\gamma)^{4}\epsilon^{2}\sigma^{2}\lfloor\frac{T}{\tau}\rfloor}{32|\mathcal{S}|^{2}}\right),

which completes the proof. ∎

I.2 Bounding the estimation error of dθd_{\theta}

We first state our main result:

Theorem 8.

(Estimation error of dθd_{\theta}) Under Assumption 2,

Pr(dθ^dθ1ϵ)4τ|𝒮|2exp((1γ)2ϵ2σS2Tτ32γ2|𝒮|2),\Pr\left(\|\widehat{d_{\theta}}-d_{\theta}\|_{1}\geq\epsilon\right)\leq 4\tau|\mathcal{S}|^{2}\exp\left(-\frac{(1-\gamma)^{2}\epsilon^{2}\sigma_{S}^{2}\left\lfloor\frac{T}{\tau}\right\rfloor}{32\gamma^{2}|\mathcal{S}|^{2}}\right),

i.e., when

T32τ|𝒮|2(1γ)2ϵ2σS2log(4τ|𝒮|2|δ)+1,T\geq\frac{32\tau|\mathcal{S}|^{2}}{(1-\gamma)^{2}\epsilon^{2}\sigma_{S}^{2}}\log\left(\frac{4\tau|\mathcal{S}|^{2}|}{\delta}\right)+1,

with probability at least 1δ1-\delta, dθ^dθ1ϵ\|\widehat{d_{\theta}}-d_{\theta}\|_{1}\leq\epsilon.

Similar to the previous section, the proof of the theorem begins by bounding the estimation error |P𝒮θ^(s|s)P𝒮θ¯(s|s)||\widehat{P_{\mathcal{S}}^{\theta}}(s^{\prime}|s)-\overline{P_{\mathcal{S}}^{\theta}}(s^{\prime}|s)|.

Lemma 14.

Under Assumption 2, fix s,s,ais^{\prime},s,a_{i},for ϵ1\epsilon\leq 1,

Pr(|P𝒮θ^(s|s)P𝒮θ¯(s|s)|ϵ)4τexp(ϵ2σS2Tτ32)\Pr\left(\left|\widehat{P_{\mathcal{S}}^{\theta}}(s^{\prime}|s)-\overline{P_{\mathcal{S}}^{\theta}}(s^{\prime}|s)\right|\geq\epsilon\right)\leq 4\tau\exp\left(-\frac{\epsilon^{2}\sigma_{S}^{2}\left\lfloor\frac{T}{\tau}\right\rfloor}{32}\right)
Proof.

According to the definition of P𝒮θ^\widehat{P_{\mathcal{S}}^{\theta}}, we have that

{P𝒮θ^(s|s)P𝒮θ¯(s|s)ϵ}\displaystyle\left\{\widehat{P_{\mathcal{S}}^{\theta}}(s^{\prime}|s)-\overline{P_{\mathcal{S}}^{\theta}}(s^{\prime}|s)\geq\epsilon\right\}
{t=0T1(𝟏{st+1=s,st=s}(P𝒮θ¯(s|s)+ϵ)𝟏{st=s})0}{t=0T1𝟏{st=s}=0}\displaystyle\subseteq\left\{\sum_{t=0}^{T-1}\left(\mathbf{1}\{s_{t+1}=s^{\prime},s_{t}=s\}-(\overline{P_{\mathcal{S}}^{\theta}}(s^{\prime}|s)+\epsilon)\mathbf{1}\{s_{t}=s\}\right)\geq 0\right\}\cup\left\{\sum_{t=0}^{T-1}\mathbf{1}\{s_{t}=s\}=0\right\}
m=0τ1{k=0T1mτ(𝟏{skτ+m+1=s,skτ+m=s}(P𝒮θ¯(s|s)+ϵ)𝟏{skτ+m=s})0}\displaystyle\subseteq\mathop{\cup}_{m=0}^{\tau-1}\left\{\sum_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor}\left(\mathbf{1}\{s_{k\tau+m+1}=s^{\prime},s_{k\tau+m}=s\}-(\overline{P_{\mathcal{S}}^{\theta}}(s^{\prime}|s)+\epsilon)\mathbf{1}\{s_{k\tau+m}=s\}\right)\geq 0\right\}
m=0τ1{k=0T1mτ𝟏{skτ+m=s}=0}\displaystyle\qquad\mathop{\cup}_{m=0}^{\tau-1}\left\{\sum_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor}\mathbf{1}\{s_{k\tau+m}=s\}=0\right\}

Let:

Am\displaystyle A_{m} :={k=0T1mτ(𝟏{skτ+m+1=s,skτ+m=s}(P𝒮θ¯(s|s)+ϵ)𝟏{skτ+m=s})0}\displaystyle:=\left\{\sum_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor}\left(\mathbf{1}\{s_{k\tau+m+1}=s^{\prime},s_{k\tau+m}=s\}-(\overline{P_{\mathcal{S}}^{\theta}}(s^{\prime}|s)+\epsilon)\mathbf{1}\{s_{k\tau+m}=s\}\right)\geq 0\right\}
Am\displaystyle A_{m}^{\prime} :={k=0T1mτ𝟏{skτ+m=s}=0}\displaystyle:=\left\{\sum_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor}\mathbf{1}\{s_{k\tau+m}=s\}=0\right\}
Xk,m\displaystyle X_{k,m} :=𝟏{skτ+m+1=s,skτ+m=s}(P𝒮θ¯(s|s)+ϵ)𝟏{skτ+m=s}\displaystyle:=\mathbf{1}\{s_{k\tau+m+1}=s^{\prime},s_{k\tau+m}=s\}-(\overline{P_{\mathcal{S}}^{\theta}}(s^{\prime}|s)+\epsilon)\mathbf{1}\{s_{k\tau+m}=s\}
Xk,m\displaystyle X_{k,m}^{\prime} :=𝟏{skτ+m=s}\displaystyle:=\mathbf{1}\{s_{k\tau+m}=s\}
Yk,m\displaystyle Y_{k,m} :=Xk,m𝔼[Xk,m|(k1)τ+m]\displaystyle:=X_{k,m}-\mathbb{{E}}[X_{k,m}|\mathcal{F}_{(k-1)\tau+m}]
Yk,m\displaystyle Y_{k,m}^{\prime} :=Xk,m𝔼[Xk,m|(k1)τ+m]\displaystyle:=X_{k,m}^{\prime}-\mathbb{{E}}[X_{k,m}^{\prime}|\mathcal{F}_{(k-1)\tau+m}]

Then {Yk,m}k=0T1mτ\{Y_{k,m}\}_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor} is a martingale difference sequence. Because ϵ1\epsilon\leq 1, it is easy to verify that |Xk,m|2,|Xk,m|1|X_{k,m}|\leq 2,|X_{k,m}|\leq 1. We have that:

|Yk,m||Xk,m|+𝔼[|Xk,m||(k1)τ+m]4,|Yk,m||Xk,m|+𝔼[|Xk,m||(k1)τ+m]2.|Y_{k,m}|\leq|X_{k,m}|+\mathbb{{E}}[|X_{k,m}||\mathcal{F}_{(k-1)\tau+m}]\leq 4,~{}~{}|Y_{k,m}^{\prime}|\leq|X_{k,m}^{\prime}|+\mathbb{{E}}[|X_{k,m}^{\prime}||\mathcal{F}_{(k-1)\tau+m}]\leq 2.

Further,

𝔼[Xk,m|(k1)τ+m]=𝔼[𝟏{skτ+m=s}|(k1)τ+m]σS,\mathbb{{E}}[X^{\prime}_{k,m}|\mathcal{F}_{(k-1)\tau+m}]=\mathbb{{E}}[\mathbf{1}\{s_{k\tau+m}=s\}|\mathcal{F}_{(k-1)\tau+m}]\geq\sigma_{S},

and that

𝔼[Xk,m|(k1)τ+m]\displaystyle\mathbb{{E}}[X_{k,m}|\mathcal{F}_{(k-1)\tau+m}]
=𝔼[𝟏{skτ+m+1=s,skτ+m=s}(P𝒮θ¯(s|s)+ϵ)𝟏{skτ+m=s}|(k1)τ+m]\displaystyle=\mathbb{{E}}[\mathbf{1}\{s_{k\tau+m+1}=s^{\prime},s_{k\tau+m}=s\}-(\overline{P_{\mathcal{S}}^{\theta}}(s^{\prime}|s)+\epsilon)\mathbf{1}\{s_{k\tau+m}=s\}|\mathcal{F}_{(k-1)\tau+m}]
=ϵ𝔼[𝟏{skτ+m=s}|(k1)τ+m]ϵσS\displaystyle=-\epsilon\mathbb{{E}}[\mathbf{1}\{s_{k\tau+m}=s\}|\mathcal{F}_{(k-1)\tau+m}]\leq-\epsilon\sigma_{S}

the second line to the third line of the equation is derived by the fact that:

𝔼[𝟏{st+1=s,st=s}|t1]\displaystyle\mathbb{{E}}[\mathbf{1}\{s_{t+1}=s^{\prime},s_{t}=s\}|\mathcal{F}_{t-1}] =P(s|st1,at1)aπθ(a|s)P(s|s,a)\displaystyle=P(s|s_{t-1},a_{t-1})\sum_{a}\pi_{\theta}(a|s)P(s^{\prime}|s,a)
=P(s|st1,at1)P𝒮θ¯(s|s)\displaystyle=P(s|s_{t-1},a_{t-1})\overline{P_{\mathcal{S}}^{\theta}}(s^{\prime}|s)
=𝔼[P𝒮θ¯(s|s)𝟏{st=s}|t1]\displaystyle=\mathbb{{E}}[\overline{P_{\mathcal{S}}^{\theta}}(s^{\prime}|s)\mathbf{1}\{s_{t}=s\}|\mathcal{F}_{t-1}]

and the inequality in the third line is derived directly from Assumption 2.

According to Azuma-Hoeffding inequality:

Pr(Am)\displaystyle\Pr(A_{m}) =Pr(k=0T1mτXk,m0)\displaystyle=\Pr\left(\sum_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor}X_{k,m}\geq 0\right)
=Pr(k=0T1mτYk,mk=0T1mτ𝔼[Xk,m|(k1)τ+m])\displaystyle=\Pr\left(\sum_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor}Y_{k,m}\geq-\sum_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor}\mathbb{{E}}[X_{k,m}|\mathcal{F}_{(k-1)\tau+m}]\right)
Pr(k=0T1mτYk,mT1m+ττϵσS)\displaystyle\leq\Pr\left(\sum_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor}Y_{k,m}\geq\left\lfloor\frac{T-1-m+\tau}{\tau}\right\rfloor\epsilon\sigma_{S}\right)
exp(ϵ2σS2Tτ32)\displaystyle\leq\exp\left(-\frac{\epsilon^{2}\sigma_{S}^{2}\left\lfloor\frac{T}{\tau}\right\rfloor}{32}\right)

Similarly, from Azuma-Hoeffding inequality,

Pr(Am)\displaystyle\Pr(A_{m}^{\prime}) =Pr(k=0T1mτXk,m=0)\displaystyle=\Pr\left(\sum_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor}X_{k,m}^{\prime}=0\right)
=Pr(k=0T1mτYk,m=k=0T1mτ𝔼[Xk,m|(k1)τ+m])\displaystyle=\Pr\left(\sum_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor}Y_{k,m}^{\prime}=-\sum_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor}\mathbb{{E}}[X_{k,m}^{\prime}|\mathcal{F}_{(k-1)\tau+m}]\right)
Pr(k=0T1mτYk,mT1m+ττσS)\displaystyle\leq\Pr\left(\sum_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor}Y_{k,m}^{\prime}\leq-\left\lfloor\frac{T-1-m+\tau}{\tau}\right\rfloor\sigma_{S}\right)
exp(σS2Tτ8)\displaystyle\leq\exp\left(-\frac{\sigma_{S}^{2}\left\lfloor\frac{T}{\tau}\right\rfloor}{8}\right)

Thus

Pr(P𝒮θ^(s|s)P𝒮θ¯(s|s)ϵ)m=0τ1Pr(Am)+Pr(Am)\displaystyle\Pr\left(\widehat{P_{\mathcal{S}}^{\theta}}(s^{\prime}|s)-\overline{P_{\mathcal{S}}^{\theta}}(s^{\prime}|s)\geq\epsilon\right)\leq\sum_{m=0}^{\tau-1}\Pr\left(A_{m}\right)+\Pr\left(A_{m}^{\prime}\right)
τexp(ϵ2σS2Tτ32)+τexp(σS2Tτ8)2τexp(ϵ2σS2Tτ32)\displaystyle\leq\tau\exp\left(-\frac{\epsilon^{2}\sigma_{S}^{2}\left\lfloor\frac{T}{\tau}\right\rfloor}{32}\right)+\tau\exp\left(-\frac{\sigma_{S}^{2}\left\lfloor\frac{T}{\tau}\right\rfloor}{8}\right)\leq 2\tau\exp\left(-\frac{\epsilon^{2}\sigma_{S}^{2}\left\lfloor\frac{T}{\tau}\right\rfloor}{32}\right)

Similarly

Pr(P𝒮θ^(s|s)P𝒮θ¯(s|s)ϵ)2τexp(ϵ2σS2Tτ32)\displaystyle\Pr\left(\widehat{P_{\mathcal{S}}^{\theta}}(s^{\prime}|s)-\overline{P_{\mathcal{S}}^{\theta}}(s^{\prime}|s)\leq-\epsilon\right)\leq 2\tau\exp\left(-\frac{\epsilon^{2}\sigma_{S}^{2}\left\lfloor\frac{T}{\tau}\right\rfloor}{32}\right)
Pr(|P𝒮θ^(s|s)P𝒮θ¯(s|s)|ϵ)4τexp(ϵ2σS2Tτ32)\displaystyle\Longrightarrow\Pr\left(\left|\widehat{P_{\mathcal{S}}^{\theta}}(s^{\prime}|s)-\overline{P_{\mathcal{S}}^{\theta}}(s^{\prime}|s)\right|\geq\epsilon\right)\leq 4\tau\exp\left(-\frac{\epsilon^{2}\sigma_{S}^{2}\left\lfloor\frac{T}{\tau}\right\rfloor}{32}\right)

which completes the proof. ∎

Corollary 2.
Pr(P𝒮θ^P𝒮θ¯ϵ)\displaystyle\Pr\left(\left\|\widehat{P_{\mathcal{S}}^{\theta}}-\overline{P_{\mathcal{S}}^{\theta}}\right\|_{\infty}\geq\epsilon\right) 4τ|𝒮|2exp(ϵ2σS2Tτ32|𝒮|2)\displaystyle\leq 4\tau|\mathcal{S}|^{2}\exp\left(-\frac{\epsilon^{2}\sigma_{S}^{2}\lfloor\frac{T}{\tau}\rfloor}{32|\mathcal{S}|^{2}}\right) (46)
Proof.
P𝒮θ^P𝒮θ¯\displaystyle\left\|\widehat{P_{\mathcal{S}}^{\theta}}-\overline{P_{\mathcal{S}}^{\theta}}\right\|_{\infty} =maxss|P𝒮θ^(s|s)P𝒮θ¯(s|s)|\displaystyle=\max_{s}\sum_{s^{\prime}}\left|\widehat{P_{\mathcal{S}}^{\theta}}(s^{\prime}|s)-\overline{P_{\mathcal{S}}^{\theta}}(s^{\prime}|s)\right|

Thus,

{P𝒮θ^P𝒮θ¯ϵ}\displaystyle\left\{\left\|\widehat{P_{\mathcal{S}}^{\theta}}-\overline{P_{\mathcal{S}}^{\theta}}\right\|_{\infty}\geq\epsilon\right\} =s{s|P𝒮θ^(s|s)P𝒮θ¯(s|s)|ϵ}\displaystyle=\mathop{\cup}_{s}\left\{\sum_{s^{\prime}}\left|\widehat{P_{\mathcal{S}}^{\theta}}(s^{\prime}|s)-\overline{P_{\mathcal{S}}^{\theta}}(s^{\prime}|s)\right|\geq\epsilon\right\}
ss{|P𝒮θ^(s|s)P𝒮θ¯(s|s)|ϵ|𝒮|}\displaystyle\subseteq\mathop{\cup}_{s}\mathop{\cup}_{s^{\prime}}\left\{\left|\widehat{P_{\mathcal{S}}^{\theta}}(s^{\prime}|s)-\overline{P_{\mathcal{S}}^{\theta}}(s^{\prime}|s)\right|\geq\frac{\epsilon}{|\mathcal{S}|}\right\}

Then according to Lemma 14,

Pr(P𝒮θ^P𝒮θ¯ϵ)\displaystyle\Pr\left(\|\left\|\widehat{P_{\mathcal{S}}^{\theta}}-\overline{P_{\mathcal{S}}^{\theta}}\right\|_{\infty}\geq\epsilon\right) (s,s)Pr(|P𝒮θ^(s|s)P𝒮θ¯(s|s)|ϵ|𝒮|)\displaystyle\leq\sum_{(s^{\prime},s)}\Pr\left(\left|\widehat{P_{\mathcal{S}}^{\theta}}(s^{\prime}|s)-\overline{P_{\mathcal{S}}^{\theta}}(s^{\prime}|s)\right|\geq\frac{\epsilon}{|\mathcal{S}|}\right)
4τ|𝒮|2exp(ϵ2σS2Tτ32|𝒮|2).\displaystyle\leq 4\tau|\mathcal{S}|^{2}\exp\left(-\frac{\epsilon^{2}\sigma_{S}^{2}\lfloor\frac{T}{\tau}\rfloor}{32|\mathcal{S}|^{2}}\right).

which completes the proof of the corollary. ∎

Proof.

(Proof of Theorem 8)

dθ^dθ1\displaystyle\|\widehat{d_{\theta}}-d_{\theta}\|_{1} =(1γ)((IγP𝒮θ^)1(IγP𝒮θ¯)1)ρ1\displaystyle=(1-\gamma)\|\left(\left(I-\gamma\widehat{P_{\mathcal{S}}^{\theta}}^{\top}\right)^{-1}-\left(I-\gamma\overline{P_{\mathcal{S}}^{\theta}}^{\top}\right)^{-1}\right)\rho\|_{1}
(1γ)(IγP𝒮θ^)1(IγP𝒮θ¯)11ρ1\displaystyle\leq(1-\gamma)\left\|\left(I-\gamma\widehat{P_{\mathcal{S}}^{\theta}}^{\top}\right)^{-1}-\left(I-\gamma\overline{P_{\mathcal{S}}^{\theta}}^{\top}\right)^{-1}\right\|_{1}\|\rho\|_{1}
(1γ)(IγP𝒮θ^)1(IγP𝒮θ¯)1ρ1\displaystyle\leq(1-\gamma)\left\|\left(I-\gamma\widehat{P_{\mathcal{S}}^{\theta}}\right)^{-1}-\left(I-\gamma\overline{P_{\mathcal{S}}^{\theta}}\right)^{-1}\right\|_{\infty}\|\rho\|_{1}
=γ(1γ)(IγP𝒮θ¯)1(P𝒮θ^P𝒮θ¯)(IγP𝒮θ^)1\displaystyle=\gamma(1-\gamma)\left\|\left(I-\gamma\overline{P_{\mathcal{S}}^{\theta}}\right)^{-1}\left(\widehat{P_{\mathcal{S}}^{\theta}}-\overline{P_{\mathcal{S}}^{\theta}}\right)\left(I-\gamma\widehat{P_{\mathcal{S}}^{\theta}}\right)^{-1}\right\|_{\infty}
γ1γP𝒮θ^P𝒮θ¯\displaystyle\leq\frac{\gamma}{1-\gamma}\left\|\widehat{P_{\mathcal{S}}^{\theta}}-\overline{P_{\mathcal{S}}^{\theta}}\right\|_{\infty}

Thus

Pr(dθ^dθ1ϵ)Pr(P𝒮θ^P𝒮θ¯1γγϵ)\displaystyle\Pr\left(\|\widehat{d_{\theta}}-d_{\theta}\|_{1}\geq\epsilon\right)\leq\Pr\left(\left\|\widehat{P_{\mathcal{S}}^{\theta}}-\overline{P_{\mathcal{S}}^{\theta}}\right\|_{\infty}\geq\frac{1-\gamma}{\gamma}\epsilon\right)
4τ|𝒮|2exp((1γ)2ϵ2σS2Tτ32γ2|𝒮|2)\displaystyle\leq 4\tau|\mathcal{S}|^{2}\exp\left(-\frac{(1-\gamma)^{2}\epsilon^{2}\sigma_{S}^{2}\left\lfloor\frac{T}{\tau}\right\rfloor}{32\gamma^{2}|\mathcal{S}|^{2}}\right)

I.3 Proof of Theorem 6

Proof.

Since the stochastic game satisfies (τ,σS)(\tau,\sigma_{S})-sufficient exploration on states, then for any θ𝒳α\theta\in\mathcal{X}^{\alpha}, we know that it satisfies (τ,ασSmaxi|𝒜i|)(\tau,\frac{\alpha\sigma_{S}}{\max_{i}|\mathcal{A}_{i}|})-sufficient exploration. Substitute this into Theorem 7, we have that for

TJ32τ(1+α)2|𝒮|3i|𝒜i|maxi|𝒜i|2(1γ)6ϵg2α2σS2log(16τTG|𝒮|2i|𝒜i|δ)+1,T_{J}\geq\frac{32\tau(1+\alpha)^{2}|\mathcal{S}|^{3}\sum_{i}|\mathcal{A}_{i}|\max_{i}|\mathcal{A}_{i}|^{2}}{(1-\gamma)^{6}\epsilon_{g}^{2}\alpha^{2}\sigma_{S}^{2}}\log\left(\frac{16\tau T_{G}|\mathcal{S}|^{2}\sum_{i}|\mathcal{A}_{i}|}{\delta}\right)+1, (47)

with probability at least 1δ2TG1-\frac{\delta}{2T_{G}},

Qθ(k)¯Qθ(k)^(1γ)ϵg(1+α)|𝒮|i|𝒜i|.\|\overline{Q^{\theta^{(k)}}}-\widehat{Q^{\theta^{(k)}}}\|_{\infty}\leq\frac{(1-\gamma)\epsilon_{g}}{(1+\alpha)\sqrt{|\mathcal{S}|\sum_{i}|\mathcal{A}_{i}|}}.

Similarly, applying Theorem 8, we have that with probability at least 1δ2TG1-\frac{\delta}{2T_{G}},

dθ(k)dθ(k)^1(1γ)2ϵgα(1+α)|𝒮|i|𝒜i|.\|d_{\theta^{(k)}}-\widehat{d_{\theta^{(k)}}}\|_{1}\leq\frac{(1-\gamma)^{2}\epsilon_{g}\alpha}{(1+\alpha)\sqrt{|\mathcal{S}|\sum_{i}|\mathcal{A}_{i}|}}.

Since:

|[Φ(θ)^Φ(θ)](s,ai)|\displaystyle\left|\left[\nabla\Phi(\theta)-\widehat{\nabla}\Phi(\theta)\right]_{(s,a_{i})}\right| =|11γdθ(s)Qiθ¯(s,ai)11γdθ^(s)Qiθ^(s,ai)|\displaystyle=\left|\frac{1}{1-\gamma}d_{\theta}(s)\overline{Q_{i}^{\theta}}(s,a_{i})-\frac{1}{1-\gamma}\widehat{d_{\theta}}(s)\widehat{Q_{i}^{\theta}}(s,a_{i})\right|
=|11γdθ(s)(Qiθ¯(s,ai)Qiθ^(s,ai))|+|11γQiθ^(s,ai)(dθ(s)dθ^(s))|\displaystyle=\left|\frac{1}{1-\gamma}d_{\theta}(s)(\overline{Q_{i}^{\theta}}(s,a_{i})-\widehat{Q_{i}^{\theta}}(s,a_{i}))\right|+\left|\frac{1}{1-\gamma}\widehat{Q_{i}^{\theta}}(s,a_{i})(d_{\theta}(s)-\widehat{d_{\theta}}(s))\right|
11γ|Qiθ¯(s,ai)Qiθ^(s,ai)|+1(1γ)2|dθ(s)dθ^(s)|\displaystyle\leq\frac{1}{1-\gamma}|\overline{Q_{i}^{\theta}}(s,a_{i})-\widehat{Q_{i}^{\theta}}(s,a_{i})|+\frac{1}{(1-\gamma)^{2}}|d_{\theta}(s)-\widehat{d_{\theta}}(s)|
ϵg(1+α)|𝒮|j|𝒜j|+ϵgα(1+α)|𝒮|j|𝒜j|\displaystyle\leq\frac{\epsilon_{g}}{(1+\alpha)\sqrt{|\mathcal{S}|\sum_{j}|\mathcal{A}_{j}|}}+\frac{\epsilon_{g}\alpha}{(1+\alpha)\sqrt{|\mathcal{S}|\sum_{j}|\mathcal{A}_{j}|}}
=ϵg|𝒮|j|𝒜j|\displaystyle=\frac{\epsilon_{g}}{\sqrt{|\mathcal{S}|\sum_{j}|\mathcal{A}_{j}|}}

Thus, with probability 1δ1-\delta

Φ(θ(k)^Φ(θ(k))22\displaystyle\|\nabla\Phi(\theta^{(k)}-\widehat{\nabla}\Phi(\theta^{(k)})\|_{2}^{2} =isai|[Φ(θ)^Φ(θ)](s,ai)|2ϵg2,1kTG\displaystyle=\sum_{i}\sum_{s}\sum_{a_{i}}\left|\left[\nabla\Phi(\theta)-\widehat{\nabla}\Phi(\theta)\right]_{(s,a_{i})}\right|^{2}\leq\epsilon_{g}^{2},~{}~{}\forall~{}1\leq k\leq T_{G}

Appendix J Proof of Theorem 5

Notations:

We define the following variables that will be useful in the analysis:

G^η(θ):=1η(Proj𝒳(θ+η^Φ(θ))θ)\widehat{G}^{\eta}(\theta):=\frac{1}{\eta}\left(Proj_{\mathcal{X}}(\theta+\eta\widehat{\nabla}\Phi(\theta))-\theta\right)
G^η,α(θ):=1η(Proj𝒳α(θ+η^Φ(θ))θ).\widehat{G}^{\eta,\alpha}(\theta):=\frac{1}{\eta}\left(Proj_{\mathcal{X}^{\alpha}}(\theta+\eta\widehat{\nabla}\Phi(\theta))-\theta\right).

J.1 Optimization Lemmas

Lemma 15.

(Sufficient ascent) Suppose Φ(θ)\Phi(\theta) is β\beta-smooth. Let θ+=Proj𝒳α(θ+η^Φ(θ))\theta^{+}=Proj_{\mathcal{X}^{\alpha}}(\theta+\eta\widehat{\nabla}\Phi(\theta)). Then for η12β\eta\leq\frac{1}{2\beta},

Φ(θ+)Φ(θ)η4G^η,α(θ)2η2Φ(θ)^Φ(θ)2\begin{split}\Phi(\theta^{+})-\Phi(\theta)\geq\frac{\eta}{4}\|\widehat{G}^{\eta,\alpha}(\theta)\|^{2}-\frac{\eta}{2}\left\|\nabla\Phi(\theta)-\widehat{\nabla}\Phi(\theta)\right\|^{2}\end{split}
Proof.

From the smoothness property we have that:

Φ(θ+)Φ(θ)θΦ(θ)(θ+θ)β2θ+θ2\Phi(\theta^{+})-\Phi(\theta)\geq\nabla_{\theta}\Phi(\theta)^{\top}(\theta^{+}-\theta)-\frac{\beta}{2}\|\theta^{+}-\theta\|^{2}

Since θ+=Proj𝒳α(θ+η^Φ(θ))\theta^{+}=Proj_{\mathcal{X}^{\alpha}}(\theta+\eta\widehat{\nabla}\Phi(\theta)), we have that:

(θ+η^Φ(θ)θ+)(θθ+)0,θ𝒳α(\theta+\eta\widehat{\nabla}\Phi(\theta)-\theta^{+})^{\top}(\theta^{\prime}-\theta^{+})\leq 0,~{}~{}\forall~{}\theta^{\prime}\in\mathcal{X}^{\alpha}

take θ=θ\theta^{\prime}=\theta, we get:

^Φ(θ)(θ+θ)1ηθ+θ2.\widehat{\nabla}\Phi(\theta)^{\top}(\theta^{+}-\theta)\geq\frac{1}{\eta}\|\theta^{+}-\theta\|^{2}.

Thus:

Φ(θ)(θ+θ)=(Φ(θ)^Φ(θ))(θ+θ)+^Φ(θ)(θ+θ)\displaystyle\nabla\Phi(\theta)^{\top}(\theta^{+}-\theta)=\left(\nabla\Phi(\theta)-\widehat{\nabla}\Phi(\theta)\right)^{\top}(\theta^{+}-\theta)+\widehat{\nabla}\Phi(\theta)^{\top}(\theta^{+}-\theta)
η2Φ(θ)^Φ(θ)212ηθ+θ2+^Φ(θ)(θ+θ)\displaystyle\geq-\frac{\eta}{2}\left\|\nabla\Phi(\theta)-\widehat{\nabla}\Phi(\theta)\right\|^{2}-\frac{1}{2\eta}\left\|\theta^{+}-\theta\right\|^{2}+\widehat{\nabla}\Phi(\theta)^{\top}(\theta^{+}-\theta)
η2Φ(θ)^Φ(θ)212ηθ+θ2+1ηθ+θ2\displaystyle\geq-\frac{\eta}{2}\left\|\nabla\Phi(\theta)-\widehat{\nabla}\Phi(\theta)\right\|^{2}-\frac{1}{2\eta}\left\|\theta^{+}-\theta\right\|^{2}+\frac{1}{\eta}\|\theta^{+}-\theta\|^{2}
=12ηθ+θ2η2Φ(θ)^Φ(θ)2\displaystyle=\frac{1}{2\eta}\|\theta^{+}-\theta\|^{2}-\frac{\eta}{2}\left\|\nabla\Phi(\theta)-\widehat{\nabla}\Phi(\theta)\right\|^{2}

Thus from (36):

Φ(θ+)Φ(θ)\displaystyle\Phi(\theta^{+})-\Phi(\theta) (12ηβ2)θ+θ2η2Φ(θ)^Φ(θ)2\displaystyle\geq\left(\frac{1}{2\eta}-\frac{\beta}{2}\right)\|\theta^{+}-\theta\|^{2}-\frac{\eta}{2}\left\|\nabla\Phi(\theta)-\widehat{\nabla}\Phi(\theta)\right\|^{2}
14ηθ+θ2η2Φ(θ)^Φ(θ)2\displaystyle\geq\frac{1}{4\eta}\|\theta^{+}-\theta\|^{2}-\frac{\eta}{2}\left\|\nabla\Phi(\theta)-\widehat{\nabla}\Phi(\theta)\right\|^{2}
=η4G^η,α(θ)2η2Φ(θ)^Φ(θ)2\displaystyle=\frac{\eta}{4}\|\widehat{G}^{\eta,\alpha}(\theta)\|^{2}-\frac{\eta}{2}\left\|\nabla\Phi(\theta)-\widehat{\nabla}\Phi(\theta)\right\|^{2}

which completes the proof. ∎

Lemma 15 immediately results in the following corollary:

Corollary 3.

(of Lemma 15) In Algorithm 1, suppose ^Φ(θ(k))Φ(θ(k))ϵg\|\widehat{\nabla}\Phi(\theta^{(k)})-\nabla\Phi(\theta^{(k)})\|_{\infty}\leq\epsilon_{g} holds for every 0kTG10\leq k\leq T_{G}-1, then running algorithm 1 will guarantee that:

1TGk=0TG1G^η,α(θ(k))24(ΦmaxΦmin)ηTG+2ϵg2\frac{1}{T_{G}}\sum_{k=0}^{T_{G}-1}\|\widehat{G}^{\eta,\alpha}(\theta^{(k)})\|^{2}\leq\frac{4(\Phi_{\max}-\Phi_{\min})}{\eta T_{G}}+2\epsilon_{g}^{2}
Proof.

From Lemma 15 we have that:

Φ(θ(k+1))Φ(θ(k))\displaystyle\Phi(\theta^{(k+1)})-\Phi(\theta^{(k)}) η4G^η,α(θ(k))2η2Φ(θ(k))^Φ(θ(k))2\displaystyle\geq\frac{\eta}{4}\|\widehat{G}^{\eta,\alpha}(\theta^{(k)})\|^{2}-\frac{\eta}{2}\left\|\nabla\Phi(\theta^{(k)})-\widehat{\nabla}\Phi(\theta^{(k)})\right\|^{2}
η4G^η,α(θ(k))2η2ϵg2.\displaystyle\geq\frac{\eta}{4}\|\widehat{G}^{\eta,\alpha}(\theta^{(k)})\|^{2}-\frac{\eta}{2}\epsilon_{g}^{2}.

Thus

1TGk=0TG1G^η,α(θ(k))2\displaystyle\frac{1}{T_{G}}\sum_{k=0}^{T_{G}-1}\|\widehat{G}^{\eta,\alpha}(\theta^{(k)})\|^{2} 4(Φ(θ(0))Φ(θ(TG)))ηTG+2ϵg2\displaystyle\leq\frac{4(\Phi(\theta^{(0)})-\Phi(\theta^{(T_{G})}))}{\eta T_{G}}+2\epsilon_{g}^{2}
4(ΦmaxΦmin)ηTG+2ϵg2\displaystyle\leq\frac{4(\Phi_{\max}-\Phi_{\min})}{\eta T_{G}}+2\epsilon_{g}^{2}

Lemma 16.

(First-order stationarity and G^η,α(θ)\|\widehat{G}^{\eta,\alpha}(\theta)\|) Suppose Φ(θ)\Phi(\theta) is β\beta-smooth. Let θ+=Proj𝒳α(θ+η^Φ(θ))\theta^{+}=Proj_{\mathcal{X}^{\alpha}}(\theta+\eta\widehat{\nabla}\Phi(\theta)). Then:

θΦ(θ+)(θθ+)[(1+ηβ)G^η,α(θ)+^Φ(θ)Φ(θ)]θθ+,θ𝒳α.\nabla_{\theta}\Phi(\theta^{+})^{\top}(\theta^{\prime}-\theta^{+})\leq\left[(1+\eta\beta)\|\widehat{G}^{\eta,\alpha}(\theta)\|+\|\widehat{\nabla}\Phi(\theta)-\nabla\Phi(\theta)\|\right]\|\theta^{\prime}-\theta^{+}\|,\quad\forall\theta^{\prime}\in\mathcal{X}^{\alpha}. (48)

Further:

maxθ¯i𝒳iθiΦ(θ+)(θ¯iθi+)2|𝒮|[(1+ηβ)G^η,α(θ)+^Φ(θ)Φ(θ)]+2α1γ\max_{\overline{\theta}_{i}\in\mathcal{X}_{i}}\nabla_{\theta_{i}}\Phi(\theta^{+})^{\top}(\overline{\theta}_{i}-\theta_{i}^{+})\leq 2\sqrt{|\mathcal{S}|}\left[(1+\eta\beta)\|\widehat{G}^{\eta,\alpha}(\theta)\|+\|\widehat{\nabla}\Phi(\theta)-\nabla\Phi(\theta)\|\right]+\frac{2\alpha}{1-\gamma} (49)
Proof.

Since θ+=Proj𝒳α(θ+η^Φ(θ))\theta^{+}=Proj_{\mathcal{X}^{\alpha}}(\theta+\eta\widehat{\nabla}\Phi(\theta)), we have:

(θ+η^Φ(θ)θ+)(θθ+)\displaystyle(\theta+\eta\widehat{\nabla}\Phi(\theta)-\theta^{+})^{\top}(\theta^{\prime}-\theta^{+}) 0θ𝒳α\displaystyle\leq 0~{}~{}\forall~{}\theta^{\prime}\in\mathcal{X}^{\alpha}
η^Φ(θ)(θθ+)\displaystyle\Longrightarrow~{}\eta\widehat{\nabla}\Phi(\theta)^{\top}(\theta^{\prime}-\theta^{+}) (θθ+)(θθ+)\displaystyle\leq(\theta-\theta^{+})^{\top}(\theta^{\prime}-\theta^{+})
ηΦ(θ)(θθ+)\displaystyle\Longrightarrow~{}\eta\nabla\Phi(\theta)^{\top}(\theta^{\prime}-\theta^{+}) (θθ+)(θθ+)+η(Φ(θ)^Φ(θ))(θθ+)\displaystyle\leq(\theta-\theta^{+})^{\top}(\theta^{\prime}-\theta^{+})+\eta(\nabla\Phi(\theta)-\widehat{\nabla}\Phi(\theta))^{\top}(\theta^{\prime}-\theta^{+})
ηΦ(θ+)(θθ+)\displaystyle\Longrightarrow~{}\eta\nabla\Phi(\theta^{+})^{\top}(\theta^{\prime}-\theta^{+}) (θθ+)(θθ+)+η(Φ(θ)^Φ(θ))(θθ+)\displaystyle\leq(\theta-\theta^{+})^{\top}(\theta^{\prime}-\theta^{+})+\eta(\nabla\Phi(\theta)-\widehat{\nabla}\Phi(\theta))^{\top}(\theta^{\prime}-\theta^{+})
+η(Φ(θ+)Φ(θ))(θθ+)\displaystyle\quad+\eta(\nabla\Phi(\theta^{+})-\nabla\Phi(\theta))^{\top}(\theta^{\prime}-\theta^{+})
ηΦ(θ+)(θθ+)\displaystyle\Longrightarrow~{}\eta\nabla\Phi(\theta^{+})^{\top}(\theta^{\prime}-\theta^{+}) (θθ++ηΦ(θ)^Φ(θ)+ηΦ(θ+)Φ(θ))θθ+\displaystyle\leq(\|\theta-\theta^{+}\|+\eta\|\nabla\Phi(\theta)-\widehat{\nabla}\Phi(\theta)\|+\eta\|\nabla\Phi(\theta^{+})-\nabla\Phi(\theta)\|)\|\theta^{\prime}-\theta^{+}\|
(θθ++ηΦ(θ)^Φ(θ)+ηβθ+θ)θθ+\displaystyle\leq(\|\theta-\theta^{+}\|+\eta\|\nabla\Phi(\theta)-\widehat{\nabla}\Phi(\theta)\|+\eta\beta\|\theta^{+}-\theta\|)\|\theta^{\prime}-\theta^{+}\|
=[(1+ηβ)θθ++ηΦ(θ)^Φ(θ)]θθ+\displaystyle=\left[(1+\eta\beta)\|\theta-\theta^{+}\|+\eta\|\nabla\Phi(\theta)-\widehat{\nabla}\Phi(\theta)\|\right]\|\theta^{\prime}-\theta^{+}\|
Φ(θ+)(θθ+)\displaystyle\Longrightarrow~{}\nabla\Phi(\theta^{+})^{\top}(\theta^{\prime}-\theta^{+}) [(1+ηβ)G^η,α(θ)+Φ(θ)^Φ(θ)]θθ+,\displaystyle\leq\left[(1+\eta\beta)\|\widehat{G}^{\eta,\alpha}(\theta)\|+\|\nabla\Phi(\theta)-\widehat{\nabla}\Phi(\theta)\|\right]\|\theta^{\prime}-\theta^{+}\|,

which proves (48). We now prove (49). For any θi,sΔ(|𝒜i|)\theta_{i,s}^{\prime}\in\Delta(|\mathcal{A}_{i}|), we know that (1α)θi,s+αU|𝒜i|Δα(|𝒜i|)(1-\alpha)\theta_{i,s}^{\prime}+\alpha U_{|\mathcal{A}_{i}|}\in\Delta^{\alpha}(|\mathcal{A}_{i}|). Let Ui:=[U|𝒜i|,,U|𝒜i||𝒮|times]U_{i}:=[\underbrace{U_{|\mathcal{A}_{i}|},\dots,U_{|\mathcal{A}_{i}|}}_{|\mathcal{S}|\text{times}}], then for any θi𝒳i\theta_{i}^{\prime}\in\mathcal{X}_{i}, (1α)θi+αUi𝒳iα(1-\alpha)\theta_{i}^{\prime}+\alpha U_{i}\in\mathcal{X}_{i}^{\alpha}.

Thus:

θiΦ(θ+)(θiθi+)\displaystyle\nabla_{\theta_{i}}\Phi(\theta^{+})^{\top}(\theta_{i}^{\prime}-\theta_{i}^{+}) θiΦ(θ+)((1α)θi+αUiθi+)+θiΦ(θ+)(θi(1α)θiαUi)\displaystyle\leq\nabla_{\theta_{i}}\Phi(\theta^{+})^{\top}((1-\alpha)\theta_{i}^{\prime}+\alpha U_{i}-\theta_{i}^{+})+\nabla_{\theta_{i}}\Phi(\theta^{+})^{\top}(\theta_{i}^{\prime}-(1-\alpha)\theta_{i}^{\prime}-\alpha U_{i})
[(1+ηβ)G^η,α(θ)+^Φ(θ)Φ(θ)](1α)θi+αUiθi+\displaystyle\leq\left[(1+\eta\beta)\|\widehat{G}^{\eta,\alpha}(\theta)\|+\|\widehat{\nabla}\Phi(\theta)-\nabla\Phi(\theta)\|\right]\|(1-\alpha)\theta_{i}^{\prime}+\alpha U_{i}-\theta_{i}^{+}\|
+θiΦ(θ+)(θi(1α)θiαUi)\displaystyle\qquad+\nabla_{\theta_{i}}\Phi(\theta^{+})^{\top}(\theta_{i}^{\prime}-(1-\alpha)\theta_{i}^{\prime}-\alpha U_{i})
2|𝒮|[(1+ηβ)G^η,α(θ)+^Φ(θ)Φ(θ)]+αθiΦ(θ+)(θiUi)\displaystyle\leq 2\sqrt{|\mathcal{S}|}\left[(1+\eta\beta)\|\widehat{G}^{\eta,\alpha}(\theta)\|+\|\widehat{\nabla}\Phi(\theta)-\nabla\Phi(\theta)\|\right]+\alpha\nabla_{\theta_{i}}\Phi(\theta^{+})^{\top}(\theta_{i}^{\prime}-U_{i})

Since

θiΦ(θ+)(θiUi)\displaystyle\nabla_{\theta_{i}}\Phi(\theta^{+})^{\top}(\theta_{i}^{\prime}-U_{i}) =sdθ(s)Qi,sθ¯(θi,sU|𝒜i|)\displaystyle=\sum_{s}d_{\theta}(s)\overline{Q_{i,s}^{\theta}}^{\top}(\theta_{i,s}^{\prime}-U_{|\mathcal{A}_{i}|})
sdθ(s)Qi,sθ¯θi,sU|𝒜i|1\displaystyle\leq\sum_{s}d_{\theta}(s)\|\overline{Q_{i,s}^{\theta}}\|_{\infty}\|\theta_{i,s}^{\prime}-U_{|\mathcal{A}_{i}|}\|_{1}
sdθ(s)21γ21γ,\displaystyle\leq\sum_{s}d_{\theta}(s)\frac{2}{1-\gamma}\leq\frac{2}{1-\gamma},

we have that:

θiΦ(θ+)(θiθi)\displaystyle\nabla_{\theta_{i}}\Phi(\theta^{+})^{\top}(\theta_{i}^{\prime}-\theta_{i}) 2|𝒮|[(1+ηβ)G^η,α(θ)+^Φ(θ)Φ(θ)]+2α1γ\displaystyle\leq 2\sqrt{|\mathcal{S}|}\left[(1+\eta\beta)\|\widehat{G}^{\eta,\alpha}(\theta)\|+\|\widehat{\nabla}\Phi(\theta)-\nabla\Phi(\theta)\|\right]+\frac{2\alpha}{1-\gamma}

J.2 Proof of Theorem 5

Proof.

Recall that Φ\Phi is β\beta-smooth with β=2(1γ)3(i=1n|𝒜i|)\beta=\frac{2}{(1-\gamma)^{3}}\left(\sum_{i=1}^{n}|\mathcal{A}_{i}|\right). The step size η\eta in Theorem 5 satisfies η(1γ)34i=1n|𝒜i|=12β\eta\leq\frac{(1-\gamma)^{3}}{4\sum_{i=1}^{n}|\mathcal{A}_{i}|}=\frac{1}{2\beta}.

Recall from gradient domination property:

NE-gapi(θ(k+1))\displaystyle\textup{{NE-gap}}_{i}(\theta^{(k+1)}) =maxθi𝒳iJi(θi,θi(k+1))Ji(θi(k+1),θi(k+1))\displaystyle=\max_{\theta_{i}^{\prime}\in\mathcal{X}_{i}}J_{i}(\theta_{i}^{\prime},\theta_{-i}^{(k+1)})-J_{i}(\theta_{i}^{(k+1)},\theta_{-i}^{(k+1)})
Mmaxθi𝒳i(θiθi(k+1))θiΦ(θ(k+1))\displaystyle\leq M\max_{\theta_{i}^{\prime}\in\mathcal{X}_{i}}(\theta_{i}^{\prime}-\theta_{i}^{(k+1)})^{\top}\nabla_{\theta_{i}}\Phi(\theta^{(k+1)})

Suppose ^Φ(θ(k))Φ(θ(k))ϵg,0kTG1\|\widehat{\nabla}\Phi(\theta^{(k)})-\nabla\Phi(\theta^{(k)})\|_{\infty}\leq\epsilon_{g},~{}\forall 0\leq k\leq T_{G}-1, recall from Lemma 16,

NE-gap(θ(k+1))\displaystyle\textup{{NE-gap}}(\theta^{(k+1)}) maxiNE-gapi(θ(k+1))Mmaximaxθi𝒳i(θiθi(k+1))θiΦ(θ(k+1))\displaystyle\leq\max_{i}\textup{{NE-gap}}_{i}(\theta^{(k+1)})\leq M\max_{i}\max_{\theta_{i}^{\prime}\in\mathcal{X}_{i}}(\theta_{i}^{\prime}-\theta_{i}^{(k+1)})^{\top}\nabla_{\theta_{i}}\Phi(\theta^{(k+1)})
2M|𝒮|[(1+ηβ)G^η,α(θ(k))+ϵg]+2αM1γ\displaystyle\leq 2M\sqrt{|\mathcal{S}|}\left[(1+\eta\beta)\|\widehat{G}^{\eta,\alpha}(\theta^{(k)})\|+\epsilon_{g}\right]+\frac{2\alpha M}{1-\gamma}

Thus,

1TGk=0TG1NE-gap(θ(k+1))2\displaystyle\frac{1}{T_{G}}\sum_{k=0}^{T_{G}-1}\textup{{NE-gap}}(\theta^{(k+1)})^{2} 1TGk=0TG13×[4M2|𝒮|(1+ηβ)2G^η,α(θ(k))2+4M2|𝒮|ϵg2+4α2M2(1γ)2]\displaystyle\leq\frac{1}{T_{G}}\sum_{k=0}^{T_{G}-1}3\times\left[4M^{2}|\mathcal{S}|(1+\eta\beta)^{2}\|\widehat{G}^{\eta,\alpha}(\theta^{(k)})\|^{2}+4M^{2}|\mathcal{S}|\epsilon_{g}^{2}+\frac{4\alpha^{2}M^{2}}{(1-\gamma)^{2}}\right]
=12M2|𝒮|ϵg2+12α2M2(1γ)2+12M2|𝒮|(1+ηβ)2(1TGk=0TG1G^η,α(θ(k))2)\displaystyle=12M^{2}|\mathcal{S}|\epsilon_{g}^{2}+\frac{12\alpha^{2}M^{2}}{(1-\gamma)^{2}}+12M^{2}|\mathcal{S}|(1+\eta\beta)^{2}\left(\frac{1}{T_{G}}\sum_{k=0}^{T_{G}-1}\|\widehat{G}^{\eta,\alpha}(\theta^{(k)})\|^{2}\right)

From Corollary 3, we have that

1TGk=0TG1NE-gap(θ(k+1))2\displaystyle\frac{1}{T_{G}}\sum_{k=0}^{T_{G}-1}\textup{{NE-gap}}(\theta^{(k+1)})^{2} 12M2|𝒮|ϵg2+12α2M2(1γ)2+12M2|𝒮|(1+ηβ)2(4(ΦmaxΦmin)ηTG+2ϵg2)\displaystyle\leq 12M^{2}|\mathcal{S}|\epsilon_{g}^{2}+\frac{12\alpha^{2}M^{2}}{(1-\gamma)^{2}}+12M^{2}|\mathcal{S}|(1+\eta\beta)^{2}\left(\frac{4(\Phi_{\max}-\Phi_{\min})}{\eta T_{G}}+2\epsilon_{g}^{2}\right)
66M2|𝒮|ϵg2+12α2M2(1γ)2+108M2|𝒮|(ΦmaxΦmin)ηTG\displaystyle\leq 66M^{2}|\mathcal{S}|\epsilon_{g}^{2}+\frac{12\alpha^{2}M^{2}}{(1-\gamma)^{2}}+\frac{108M^{2}|\mathcal{S}|(\Phi_{\max}-\Phi_{\min})}{\eta T_{G}} (50)

Substitute

α=(1γ)ϵ6M,ϵg=ϵ233M|𝒮| and TG648M2(ΦmaxΦmin)|𝒮|ηϵ2\displaystyle\alpha=\frac{(1-\gamma)\epsilon}{6M},~{}~{}\epsilon_{g}=\frac{\epsilon}{2\sqrt{33}M\sqrt{|\mathcal{S}|}}\textup{ and }T_{G}\geq\frac{648M^{2}(\Phi_{\max}-\Phi_{\min})|\mathcal{S}|}{\eta\epsilon^{2}}

into the above inequality we get that:

1TGk=0TG1NE-gap(θ(k+1))2ϵ22+ϵ23+ϵ26=ϵ2\frac{1}{T_{G}}\sum_{k=0}^{T_{G}-1}\textup{{NE-gap}}(\theta^{(k+1)})^{2}\leq\frac{\epsilon^{2}}{2}+\frac{\epsilon^{2}}{3}+\frac{\epsilon^{2}}{6}=\epsilon^{2}

Substitute the value of α,ϵg\alpha,\epsilon_{g} in (50) into Theorem 6 will give us:

TJ206976τnM4|𝒮|3maxi|𝒜i|3(1γ)8ϵ4σS2log(16τTG|𝒮|2i|𝒜i|δ)+1T_{J}\geq\frac{206976\tau nM^{4}|\mathcal{S}|^{3}\max_{i}|\mathcal{A}_{i}|^{3}}{(1-\gamma)^{8}\epsilon^{4}\sigma_{S}^{2}}\log\left(\frac{16\tau T_{G}|\mathcal{S}|^{2}\sum_{i}|\mathcal{A}_{i}|}{\delta}\right)+1

which completes the proof. ∎

Appendix K Smoothness

Lemma 17.

(Smoothness for Direct Distributed Parameterization) Assume that 0ri(s,a)1,s,a,i=1,2,,n0\leq r_{i}(s,a)\leq 1,~{}\forall s,a,~{}i=1,2,\dots,n, then:

g(θ)g(θ)2(1γ)3(i=1n|𝒜i|)θθ,\|g(\theta^{\prime})-g(\theta)\|\leq\frac{2}{(1-\gamma)^{3}}\left(\sum_{i=1}^{n}|\mathcal{A}_{i}|\right)\|\theta^{\prime}-\theta\|, (51)

where g(θ)={θiJi(θ)}g(\theta)=\{\nabla_{\theta_{i}}J_{i}(\theta)\}.

The proof of Lemma 17 depends on the following lemma:

Lemma 18.
θiJi(θ)θiJi(θ)2(1γ)3|𝒜i|j=1n|𝒜j|θjθj\|\nabla_{\theta_{i}}J_{i}(\theta^{\prime})-\nabla_{\theta_{i}}J_{i}(\theta)\|\leq\frac{2}{(1-\gamma)^{3}}\sqrt{|\mathcal{A}_{i}|}\sum_{j=1}^{n}\sqrt{|\mathcal{A}_{j}|}\|\theta_{j}^{\prime}-\theta_{j}\| (52)

Lemma 17 is a simple corollary of Lemma 18.

Proof.

(Proof of Lemma 17)

g(θ)g(θ)2\displaystyle\|g(\theta^{\prime})-g(\theta)\|^{2} =i=1nθiJi(θ)θiJi(θ)2\displaystyle=\sum_{i=1}^{n}\|\nabla_{\theta_{i}}J_{i}(\theta^{\prime})-\nabla_{\theta_{i}}J_{i}(\theta)\|^{2}
(2(1γ)3)2i|𝒜i|(j=1n|𝒜j|θjθj)2\displaystyle\leq\left(\frac{2}{(1-\gamma)^{3}}\right)^{2}\sum_{i}|\mathcal{A}_{i}|\left(\sum_{j=1}^{n}\sqrt{|\mathcal{A}_{j}|}\|\theta_{j}^{\prime}-\theta_{j}\|\right)^{2}
(2(1γ)3)2i|𝒜i|(j=1n|𝒜j|)(j=1nθjθj2)\displaystyle\leq\left(\frac{2}{(1-\gamma)^{3}}\right)^{2}\sum_{i}|\mathcal{A}_{i}|\left(\sum_{j=1}^{n}|\mathcal{A}_{j}|\right)\left(\sum_{j=1}^{n}\|\theta_{j}^{\prime}-\theta_{j}\|^{2}\right)
=(2(1γ)3)2(i=1n|𝒜i|)2θθ2,\displaystyle=\left(\frac{2}{(1-\gamma)^{3}}\right)^{2}\left(\sum_{i=1}^{n}|\mathcal{A}_{i}|\right)^{2}\|\theta^{\prime}-\theta\|^{2},

which completes the proof. ∎

Lemma 18 is equivalent to the following lemma:

Lemma 19.
|Ji(θi+αui,θi)Ji(θi+αui,θi)α|α=0|2(1γ)3|𝒜i|j=1n|𝒜j|θjθj,u=1\left|\frac{\partial J_{i}(\theta_{i}^{\prime}+\alpha u_{i},\theta_{-i}^{\prime})-\partial J_{i}(\theta_{i}+\alpha u_{i},\theta_{-i})}{\partial\alpha}\Big{|}_{\alpha=0}\right|\leq\frac{2}{(1-\gamma)^{3}}\sqrt{|\mathcal{A}_{i}|}\sum_{j=1}^{n}\sqrt{|\mathcal{A}_{j}|}\|\theta_{j}^{\prime}-\theta_{j}\|,\quad\forall\|u\|=1 (53)
Proof.

(Lemma 19) Define:

πi,α(ai|s)\displaystyle\pi_{i,\alpha}(a_{i}|s) :=πθi+αui(ai|s)=θs,ai+αuai,s\displaystyle:=\pi_{\theta_{i}+\alpha u_{i}}(a_{i}|s)=\theta_{s,a_{i}}+\alpha u_{a_{i},s}
πi,α(ai|s)\displaystyle\pi_{i,\alpha}^{\prime}(a_{i}|s) :=πθi+αui(ai|s)=θs,ai+αuai,s\displaystyle:=\pi_{\theta_{i}+\alpha u_{i}}^{\prime}(a_{i}|s)=\theta_{s,a_{i}}^{\prime}+\alpha u_{a_{i},s}
πα(a|s)\displaystyle\pi_{\alpha}(a|s) :=πθi+αui(ai|s)πθi(ai|s)\displaystyle:=\pi_{\theta_{i}+\alpha u_{i}}(a_{i}|s)\pi_{\theta_{-i}}(a_{-i}|s)
πα(a|s)\displaystyle\pi_{\alpha}^{\prime}(a|s) :=πθi+αui(ai|s)πθi(ai|s)\displaystyle:=\pi_{\theta_{i}^{\prime}+\alpha u_{i}}(a_{i}|s)\pi_{\theta_{-i}}^{\prime}(a_{-i}|s)
Qiα(s,a)\displaystyle Q_{i}^{\alpha}(s,a) :=Q(θi+αui,θi)(s,a)\displaystyle:=Q_{(\theta_{i}+\alpha u_{i},\theta_{-i})}(s,a)
dα(s)\displaystyle d_{\alpha}^{\prime}(s) :=d(θi+αui,θi)(s)\displaystyle:=d_{(\theta_{i}^{\prime}+\alpha u_{i},\theta_{-i})}(s)

According to cost difference lemma,

|Ji(θi+αui,θi)Ji(θi+αui,θi)α|α=0|\displaystyle\quad\left|\frac{\partial J_{i}(\theta_{i}^{\prime}+\alpha u_{i},\theta_{-i}^{\prime})-\partial J_{i}(\theta_{i}+\alpha u_{i},\theta_{-i})}{\partial\alpha}\Big{|}_{\alpha=0}\right|
=11γ|s,adα(s)πα(a|s)Aiα(s,a)α|α=0|\displaystyle=\frac{1}{1-\gamma}\left|\frac{\partial\sum_{s,a}d_{\alpha^{\prime}}(s)\pi^{\prime}_{\alpha}(a|s)A_{i}^{\alpha}(s,a)}{\partial\alpha}\Big{|}_{\alpha=0}\right|
=11γ|s,adα(s)(πα(a|s)πα(a|s))Qiα(s,a)α|α=0|\displaystyle=\frac{1}{1-\gamma}\left|\frac{\partial\sum_{s,a}d_{\alpha^{\prime}}(s)\left(\pi^{\prime}_{\alpha}(a|s)-\pi_{\alpha}(a|s)\right)Q_{i}^{\alpha}(s,a)}{\partial\alpha}\Big{|}_{\alpha=0}\right|
11γ(|s,adθ(s)πα(a|s)πα(a|s)α|α=0Qiθ(s,a)|Part A\displaystyle\leq\frac{1}{1-\gamma}\left(\underbrace{\left|\sum_{s,a}d_{\theta}^{\prime}(s)\frac{\partial\pi^{\prime}_{\alpha}(a|s)-\partial\pi_{\alpha}(a|s)}{\partial\alpha}\Big{|}_{\alpha=0}Q_{i}^{\theta}(s,a)\right|}_{\text{Part A}}\right.
+|s,adθ(s)(πθ(a|s)πθ(a|s))Qiα(s,a)α|α=0|Part B\displaystyle+\underbrace{\left|\sum_{s,a}d_{\theta}^{\prime}(s)(\pi^{\prime}_{\theta}(a|s)-\pi_{\theta}(a|s))\frac{\partial Q_{i}^{\alpha}(s,a)}{\partial\alpha}\Big{|}_{\alpha=0}\right|}_{\text{Part B}}
+|s,adα(s)α|α=0(πθ(a|s)πθ(a|s))Qiθ(s,a)|Part C)\displaystyle+\left.\underbrace{\left|\sum_{s,a}\frac{\partial d_{\alpha}^{\prime}(s)}{\partial\alpha}\Big{|}_{\alpha=0}(\pi^{\prime}_{\theta}(a|s)-\pi_{\theta}(a|s))Q_{i}^{\theta}(s,a)\right|}_{\text{Part C}}\right)

Thus:

Part A =|s,adθ(s)πα(a|s)πα(a|s)α|α=0Qiθ(s,a)|\displaystyle=\left|\sum_{s,a}d_{\theta}^{\prime}(s)\frac{\partial\pi^{\prime}_{\alpha}(a|s)-\partial\pi_{\alpha}(a|s)}{\partial\alpha}\Big{|}_{\alpha=0}Q_{i}^{\theta}(s,a)\right|
=|s,adθ(s)uai,s(πθi(ai|s)πθi(ai|s))Qiθ(s,a)|\displaystyle=\left|\sum_{s,a}d_{\theta}^{\prime}(s)u_{a_{i},s}(\pi_{\theta_{-i}^{\prime}}(a_{-i}|s)-\pi_{\theta_{-i}}(a_{-i}|s))Q_{i}^{\theta}(s,a)\right| (54)
11γ|sdθ(s)ai|uai,s|ai|πθi(ai|s)πθi(ai|s)||\displaystyle\leq\frac{1}{1-\gamma}\left|\sum_{s}d_{\theta}^{\prime}(s)\sum_{a_{i}}|u_{a_{i},s}|\sum_{a_{-i}}\left|\pi_{\theta_{-i}^{\prime}}(a_{-i}|s)-\pi_{\theta_{-i}}(a_{-i}|s)\right|\right| (55)
11γ(maxsai|uai,s|)sdθ(s)2dTV(πθi(|s)||πθi(|s))\displaystyle\leq\frac{1}{1-\gamma}\left(\max_{s}\sum_{a_{i}}|u_{a_{i},s}|\right)\sum_{s}d_{\theta}^{\prime}(s)2d_{\text{TV}}(\pi_{\theta_{-i}^{\prime}}(\cdot|s)||\pi_{\theta_{-i}}(\cdot|s)) (56)
11γ(maxsai|uai,s|)sdθ(s)ji2dTV(πθj(|s)||πθj(|s))\displaystyle\leq\frac{1}{1-\gamma}\left(\max_{s}\sum_{a_{i}}|u_{a_{i},s}|\right)\sum_{s}d_{\theta}^{\prime}(s)\sum_{j\neq i}2d_{\text{TV}}(\pi_{\theta_{j}^{\prime}}(\cdot|s)||\pi_{\theta_{j}}(\cdot|s)) (57)
=11γ(maxsai|uai,s|)sdθ(s)jiθj,sθj,s1\displaystyle=\frac{1}{1-\gamma}\left(\max_{s}\sum_{a_{i}}|u_{a_{i},s}|\right)\sum_{s}d_{\theta}^{\prime}(s)\sum_{j\neq i}\|\theta^{\prime}_{j,s}-\theta_{j,s}\|_{1} (58)
11γ|𝒜i|sdθ(s)ji|𝒜j|θj,sθj,s\displaystyle\leq\frac{1}{1-\gamma}\sqrt{|\mathcal{A}_{i}|}\sum_{s}d_{\theta}^{\prime}(s)\sum_{j\neq i}\sqrt{|\mathcal{A}_{j}|}\|\theta_{j,s}^{\prime}-\theta_{j,s}\| (59)
11γ|𝒜i|ji|𝒜j|sdθ(s)2sθj,sθj,s2\displaystyle\leq\frac{1}{1-\gamma}\sqrt{|\mathcal{A}_{i}|}\sum_{j\neq i}\sqrt{|\mathcal{A}_{j}|}\sqrt{\sum_{s}d_{\theta^{\prime}}(s)^{2}}\sqrt{\sum_{s}\|\theta_{j,s}^{\prime}-\theta_{j,s}\|^{2}} (60)
=11γ|𝒜i|ji|𝒜j|sdθ(s)2θjθj\displaystyle=\frac{1}{1-\gamma}\sqrt{|\mathcal{A}_{i}|}\sum_{j\neq i}\sqrt{|\mathcal{A}_{j}|}\sqrt{\sum_{s}d_{\theta^{\prime}}(s)^{2}}\|\theta_{j}^{\prime}-\theta_{j}\|
11γ|𝒜i|ji|𝒜j|θjθj\displaystyle\leq\frac{1}{1-\gamma}\sqrt{|\mathcal{A}_{i}|}\sum_{j\neq i}\sqrt{|\mathcal{A}_{j}|}\|\theta_{j}^{\prime}-\theta_{j}\|
11γ|𝒜i|j=1n|𝒜j|θjθj,\displaystyle\leq\frac{1}{1-\gamma}\sqrt{|\mathcal{A}_{i}|}\sum_{j=1}^{n}\sqrt{|\mathcal{A}_{j}|}\|\theta_{j}^{\prime}-\theta_{j}\|,

where (54) to (55) is derived from the fact that |Qiθ(s,a)|11γ|Q_{i}^{\theta}(s,a)|\leq\frac{1}{1-\gamma}. (56) to (57) relies on the property of total variation distance:

dTV(πθi(|s)||πθi(|s))jidTV(πθj(|s)||πθj(|s))d_{\text{TV}}(\pi_{\theta_{-i}^{\prime}}(\cdot|s)||\pi_{\theta_{-i}}(\cdot|s))\leq\sum_{j\neq i}d_{\text{TV}}(\pi_{\theta_{j}^{\prime}}(\cdot|s)||\pi_{\theta_{j}}(\cdot|s))

(58) to (59) is derived from:

maxsai|uai,s||𝒜i|,u1\displaystyle\max_{s}\sum_{a_{i}}|u_{a_{i},s}|\leq\sqrt{|\mathcal{A}_{i}|},~{}~{}\|u\|\leq 1
θj,sθj,s1|𝒜j|θj,sθj,s\displaystyle\|\theta^{\prime}_{j,s}-\theta_{j,s}\|_{1}\leq\sqrt{|\mathcal{A}_{j}|}\|\theta_{j,s}^{\prime}-\theta_{j,s}\|

which can be immediately verified by applying Cauchy-Schwarz inequality.

Before looking into Part B, we first define P~(α)\widetilde{P}(\alpha) as the state-action under πα\pi_{\alpha}:

[P~(α)](s,a)(s,a)=πα(a|s)P(s|s,a)\left[\widetilde{P}(\alpha)\right]_{(s,a)\rightarrow(s^{\prime},a^{\prime})}=\pi_{\alpha}(a^{\prime}|s^{\prime})P(s^{\prime}|s,a)

Then we have that:

[P~(α)α|α=0](s,a)(s,a)=uai,sπθi(ai|s)P(s|s,a)\left[\frac{\partial\widetilde{P}(\alpha)}{\partial\alpha}\Big{|}_{\alpha=0}\right]_{(s,a)\rightarrow(s^{\prime},a^{\prime})}=u_{a_{i}^{\prime},s^{\prime}}\pi_{\theta_{-i}}(a_{-i}^{\prime}|s^{\prime})P(s^{\prime}|s,a)

For an arbitrary vector xx:

[P~(α)α|α=0x](s,a)\displaystyle\left[\frac{\partial\widetilde{P}(\alpha)}{\partial\alpha}\Big{|}_{\alpha=0}x\right]_{(s,a)} =s,auai,sπθi(ai|s)P(s|s,a)xs,a\displaystyle=\sum_{s^{\prime},a^{\prime}}u_{a_{i}^{\prime},s^{\prime}}\pi_{\theta_{-i}}(a_{-i}^{\prime}|s^{\prime})P(s^{\prime}|s,a)x_{s^{\prime},a^{\prime}}
xs,a|uai,s|πθi(ai|s)P(s|s,a)\displaystyle\leq\|x\|_{\infty}\sum_{s^{\prime},a^{\prime}}|u_{a_{i}^{\prime},s^{\prime}}|\pi_{\theta_{-i}}(a_{-i}^{\prime}|s^{\prime})P(s^{\prime}|s,a)
=xsP(s|s,a)ai|uai,s|aiπθi(ai|s)\displaystyle=\|x\|_{\infty}\sum_{s^{\prime}}P(s^{\prime}|s,a)\sum_{a_{i}^{\prime}}|u_{a_{i}^{\prime},s^{\prime}}|\sum_{a_{-i}^{\prime}}\pi_{\theta_{-i}}(a_{-i}^{\prime}|s^{\prime})
xsP(s|s,a)|𝒜i|aiπθi(ai|s)\displaystyle\leq\|x\|_{\infty}\sum_{s^{\prime}}P(s^{\prime}|s,a)\sqrt{|\mathcal{A}_{i}|}\sum_{a_{-i}^{\prime}}\pi_{\theta_{-i}}(a_{-i}^{\prime}|s^{\prime})
|𝒜i|x\displaystyle\leq\sqrt{|\mathcal{A}_{i}|}\|x\|_{\infty}

Thus:

[P~(α)α|α=0x](s,a)|𝒜i|x\left\|\left[\frac{\partial\widetilde{P}(\alpha)}{\partial\alpha}\Big{|}_{\alpha=0}x\right]_{(s,a)}\right\|_{\infty}\leq\sqrt{|\mathcal{A}_{i}|}\|x\|_{\infty}

Similarly we can define P~(α)\widetilde{P}(\alpha)^{\prime} as the state-action under πα\pi_{\alpha}^{\prime}, and can easily check that

[P~(α)α|α=0x](s,a)|𝒜i|x\left\|\left[\frac{\partial\widetilde{P}(\alpha)^{\prime}}{\partial\alpha}\Big{|}_{\alpha=0}x\right]_{(s,a)}\right\|_{\infty}\leq\sqrt{|\mathcal{A}_{i}|}\|x\|_{\infty}

Define:

M(α):=(IγP~(α))1,M(α):=(IγP~(α))1.M(\alpha):=\left(I-\gamma\widetilde{P}(\alpha)\right)^{-1},\quad M(\alpha)^{\prime}:=\left(I-\gamma\widetilde{P}(\alpha)^{\prime}\right)^{-1}.

Because:

M(α)=(IγP~(α))1=n=0γnP~(α),M(\alpha)=\left(I-\gamma\widetilde{P}(\alpha)\right)^{-1}=\sum_{n=0}^{\infty}\gamma^{n}\widetilde{P}(\alpha),

which implies that every entry of M(α)M(\alpha) is nonnegative and M(α)𝟏=11γ𝟏M(\alpha)\mathbf{1}=\frac{1}{1-\gamma}\mathbf{1}, this implies:

M(α)x11γx,\|M(\alpha)x\|_{\infty}\leq\frac{1}{1-\gamma}\|x\|_{\infty},

and similarly

M(α)x11γx.\|M(\alpha)^{\prime}x\|_{\infty}\leq\frac{1}{1-\gamma}\|x\|_{\infty}.

Now we are ready to bound Part B. Because:

Qiα(s,a)\displaystyle Q_{i}^{\alpha}(s,a) =e(s,a)M(α)ri\displaystyle=e_{(s,a)}^{\top}M(\alpha)r_{i}
Qiα(s,a)α\displaystyle\Longrightarrow~{}~{}\frac{\partial Q_{i}^{\alpha}(s,a)}{\partial\alpha} =e(s,a)M(α)αri=γe(s,a)M(α)P~(α)αM(α)ri\displaystyle=e_{(s,a)}^{\top}\frac{\partial M(\alpha)}{\partial\alpha}r_{i}=\gamma e_{(s,a)}^{\top}M(\alpha)\frac{\partial\widetilde{P}(\alpha)}{\partial\alpha}M(\alpha)r_{i}
|Qiα(s,a)α|\displaystyle\Longrightarrow~{}~{}\left|\frac{\partial Q_{i}^{\alpha}(s,a)}{\partial\alpha}\right| γM(α)P~(α)αM(α)ri\displaystyle\leq\gamma\left\|M(\alpha)\frac{\partial\widetilde{P}(\alpha)}{\partial\alpha}M(\alpha)r_{i}\right\|_{\infty}
γ(1γ)2|𝒜i|\displaystyle\leq\frac{\gamma}{(1-\gamma)^{2}}\sqrt{|\mathcal{A}_{i}|}

Thus,

Part B =|s,adθ(s)(πθ(a|s)πθ(a|s))Qiα(s,a)α|α=0|\displaystyle=\left|\sum_{s,a}d_{\theta}^{\prime}(s)(\pi^{\prime}_{\theta}(a|s)-\pi_{\theta}(a|s))\frac{\partial Q_{i}^{\alpha}(s,a)}{\partial\alpha}\Big{|}_{\alpha=0}\right|
s,adθ(s)|πθ(a|s)πθ(a|s)||Qiα(s,a)α|α=0|\displaystyle\leq\sum_{s,a}d_{\theta}^{\prime}(s)\left|\pi^{\prime}_{\theta}(a|s)-\pi_{\theta}(a|s)\right|\left|\frac{\partial Q_{i}^{\alpha}(s,a)}{\partial\alpha}\Big{|}_{\alpha=0}\right|
γ(1γ)2|𝒜i|sdθ(s)2dTV(πθ(|s)||πθ(|s))\displaystyle\leq\frac{\gamma}{(1-\gamma)^{2}}\sqrt{|\mathcal{A}_{i}|}\sum_{s}d_{\theta}^{\prime}(s)2d_{\textup{TV}}(\pi_{\theta^{\prime}}(\cdot|s)||\pi_{\theta}(\cdot|s))
γ(1γ)2|𝒜i|sdθ(s)j2dTV(πθj(|s)||πθj(|s))\displaystyle\leq\frac{\gamma}{(1-\gamma)^{2}}\sqrt{|\mathcal{A}_{i}|}\sum_{s}d_{\theta}^{\prime}(s)\sum_{j}2d_{\textup{TV}}(\pi_{\theta^{\prime}_{j}}(\cdot|s)||\pi_{\theta_{j}}(\cdot|s))
=γ(1γ)2|𝒜i|sdθ(s)jθj,sθj,s1\displaystyle=\frac{\gamma}{(1-\gamma)^{2}}\sqrt{|\mathcal{A}_{i}|}\sum_{s}d_{\theta}^{\prime}(s)\sum_{j}\|\theta_{j,s}^{\prime}-\theta_{j,s}\|_{1}
γ(1γ)2|𝒜i|sdθ(s)j=1n|𝒜j|θj,sθj,s\displaystyle\leq\frac{\gamma}{(1-\gamma)^{2}}\sqrt{|\mathcal{A}_{i}|}\sum_{s}d_{\theta}^{\prime}(s)\sum_{j=1}^{n}\sqrt{|\mathcal{A}_{j}|}\|\theta_{j,s}^{\prime}-\theta_{j,s}\|
γ(1γ)2|𝒜i|j=1n|𝒜j|sdθ(s)2sθj,sθj,s2\displaystyle\leq\frac{\gamma}{(1-\gamma)^{2}}\sqrt{|\mathcal{A}_{i}|}\sum_{j=1}^{n}\sqrt{|\mathcal{A}_{j}|}\sqrt{\sum_{s}d_{\theta}^{\prime}(s)^{2}}\sqrt{\sum_{s}\|\theta_{j,s}^{\prime}-\theta_{j,s}\|^{2}}
γ(1γ)2|𝒜i|j=1n|𝒜j|θjθj\displaystyle\leq\frac{\gamma}{(1-\gamma)^{2}}\sqrt{|\mathcal{A}_{i}|}\sum_{j=1}^{n}\sqrt{|\mathcal{A}_{j}|}\|\theta_{j}^{\prime}-\theta_{j}\|

Now let’s look at Part C:

dα(s)\displaystyle d_{\alpha}^{\prime}(s) =(1γ)sρ(s)aπα(a|s)e(s,a)M(α)ae(s,a)\displaystyle=(1-\gamma)\sum_{s^{\prime}}\rho(s^{\prime})\sum_{a^{\prime}}\pi_{\alpha}^{\prime}(a^{\prime}|s^{\prime})e_{(s^{\prime},a^{\prime})}^{\top}M(\alpha)^{\prime}\sum_{a^{\prime\prime}}e_{(s,a^{\prime\prime})}
dα(s)α\displaystyle\Longrightarrow~{}~{}\frac{\partial d_{\alpha}^{\prime}(s)}{\partial\alpha} =(1γ)(sρ(s)aπα(a|s)αe(s,a)v1M(α)ae(s,a)\displaystyle=(1-\gamma)\left(\underbrace{\sum_{s^{\prime}}\rho(s^{\prime})\sum_{a^{\prime}}\frac{\partial\pi_{\alpha}^{\prime}(a^{\prime}|s^{\prime})}{\partial\alpha}e_{(s^{\prime},a^{\prime})}^{\top}}_{v_{1}^{\top}}M(\alpha)^{\prime}\sum_{a^{\prime\prime}}e_{(s,a^{\prime\prime})}\right.
+sρ(s)aπα(a|s)e(s,a)v2M(α)αae(s,a))\displaystyle\left.+\underbrace{\sum_{s^{\prime}}\rho(s^{\prime})\sum_{a^{\prime}}\pi_{\alpha}^{\prime}(a^{\prime}|s^{\prime})e_{(s^{\prime},a^{\prime})}^{\top}}_{v_{2}^{\top}}\frac{\partial M(\alpha)^{\prime}}{\partial\alpha}\sum_{a^{\prime\prime}}e_{(s,a^{\prime\prime})}\right)
=(1γ)(v1M(α)+γv2M(α)P~(α)αM(α))ae(s,a)\displaystyle=(1-\gamma)\left(v_{1}^{\top}M(\alpha)^{\prime}+\gamma v_{2}^{\top}M(\alpha)^{\prime}\frac{\partial\widetilde{P}(\alpha)}{\partial\alpha}M(\alpha)^{\prime}\right)\sum_{a^{\prime\prime}}e_{(s,a^{\prime\prime})}

Note that v1,v2v_{1},v_{2} are constant vectors that are independent of the choice of ss. Additionally:

v11\displaystyle\|v_{1}\|_{1} =sρ(s)aπα(a|s)αe(s,a)1\displaystyle=\left\|\sum_{s}\rho(s)\sum_{a}\frac{\partial\pi_{\alpha}^{\prime}(a|s)}{\partial\alpha}e_{(s,a)}\right\|_{1}
=sρ(s)a|πα(a|s)α|\displaystyle=\sum_{s}\rho(s)\sum_{a}\left|\frac{\partial\pi_{\alpha}^{\prime}(a|s)}{\partial\alpha}\right|
=sρ(s)a|uai,s|πθi(ai|s)\displaystyle=\sum_{s}\rho(s)\sum_{a}\left|u_{a_{i},s}\right|\pi_{\theta_{-i}^{\prime}}(a_{-i}|s)
sρ(s)a|uai,s||𝒜i|\displaystyle\leq\sum_{s}\rho(s)\sum_{a}\left|u_{a_{i},s}\right|\leq\sqrt{|\mathcal{A}_{i}|}
v21\displaystyle\|v_{2}\|_{1} =sρ(s)aπα(a|s)e(s,a)1\displaystyle=\|\sum_{s}\rho(s)\sum_{a}\pi_{\alpha}^{\prime}(a|s)e_{(s,a)}\|_{1}
=sρ(s)aπα(a|s)=1\displaystyle=\sum_{s}\rho(s)\sum_{a}\pi_{\alpha}^{\prime}(a|s)=1

Thus:

Part C =|s,adα(s)α|α=0(πθ(a|s)πθ(a|s))Qiθ(s,a)|\displaystyle=\left|\sum_{s,a}\frac{\partial d_{\alpha}^{\prime}(s)}{\partial\alpha}\Big{|}_{\alpha=0}(\pi^{\prime}_{\theta}(a|s)-\pi_{\theta}(a|s))Q_{i}^{\theta}(s,a)\right|
=(1γ)|(v1M(0)+γv2M(0)P~(α)α|α=0M(0))s,aae(s,a)(πθ(a|s)πθ(a|s))Qiθ(s,a)v3|\displaystyle=(1-\gamma)\left|\left(v_{1}^{\top}M(0)^{\prime}+\gamma v_{2}^{\top}M(0)^{\prime}\frac{\partial\widetilde{P}(\alpha)}{\partial\alpha}\Big{|}_{\alpha=0}M(0)^{\prime}\right)\underbrace{\sum_{s,a}\sum_{a^{\prime}}e_{(s,a^{\prime})}(\pi^{\prime}_{\theta}(a|s)-\pi_{\theta}(a|s))Q_{i}^{\theta}(s,a)}_{v_{3}}\right|
(1γ)(11γv11v3+γ(1γ)2|𝒜i|v21v3)\displaystyle\leq(1-\gamma)\left(\frac{1}{1-\gamma}\|v_{1}\|_{1}\|v_{3}\|_{\infty}+\frac{\gamma}{(1-\gamma)^{2}}\sqrt{|\mathcal{A}_{i}|}\|v_{2}\|_{1}\|v_{3}\|_{\infty}\right)
|𝒜i|1γv3\displaystyle\leq\frac{\sqrt{|\mathcal{A}_{i}|}}{1-\gamma}\|v_{3}\|_{\infty}

Additionally:

|[v3](s0,a0)|\displaystyle\left|[v_{3}]_{(s_{0},a_{0})}\right| =|a(πθ(a|s0)πθ(a|s0))Qiθ(s0,a)|\displaystyle=\left|\sum_{a}(\pi_{\theta^{\prime}}(a|s_{0})-\pi_{\theta}(a|s_{0}))Q_{i}^{\theta}(s_{0},a)\right|
11γa|πθ(a|s0)πθ(a|s0)|\displaystyle\leq\frac{1}{1-\gamma}\sum_{a}|\pi_{\theta^{\prime}}(a|s_{0})-\pi_{\theta}(a|s_{0})|
=11γ2dTV(πθ(|s0)||πθ(|s0))\displaystyle=\frac{1}{1-\gamma}2d_{\textup{TV}}(\pi_{\theta^{\prime}}(\cdot|s_{0})||\pi_{\theta}(\cdot|s_{0}))
11γj=1n2dTV(πθj(|s0)||πθj(|s0))\displaystyle\leq\frac{1}{1-\gamma}\sum_{j=1}^{n}2d_{\textup{TV}}(\pi_{\theta_{j}^{\prime}}(\cdot|s_{0})||\pi_{\theta_{j}}(\cdot|s_{0}))
=11γj=1nθj,sθj,s1\displaystyle=\frac{1}{1-\gamma}\sum_{j=1}^{n}\|\theta_{j,s}^{\prime}-\theta_{j,s}\|_{1}
11γj=1n|𝒜j|θj,sθj,s\displaystyle\leq\frac{1}{1-\gamma}\sum_{j=1}^{n}\sqrt{|\mathcal{A}_{j}|}\|\theta_{j,s}^{\prime}-\theta_{j,s}\|
11γj=1n|𝒜j|θjθj\displaystyle\leq\frac{1}{1-\gamma}\sum_{j=1}^{n}\sqrt{|\mathcal{A}_{j}|}\|\theta_{j}^{\prime}-\theta_{j}\|

Combining the above inequalities we get:

Part C|𝒜i|1γv3|𝒜i|(1γ)2j=1n|𝒜j|θjθj\text{Part C}\leq\frac{\sqrt{|\mathcal{A}_{i}|}}{1-\gamma}\|v_{3}\|_{\infty}\leq\frac{\sqrt{|\mathcal{A}_{i}|}}{(1-\gamma)^{2}}\sum_{j=1}^{n}\sqrt{|\mathcal{A}_{j}|}\|\theta_{j}^{\prime}-\theta_{j}\|

Sum up Part A-C we get:

|Ji(θi+αui,θi)Ji(θi+αui,θi)α|α=0|\displaystyle\left|\frac{\partial J_{i}(\theta_{i}^{\prime}+\alpha u_{i},\theta_{-i}^{\prime})-\partial J_{i}(\theta_{i}+\alpha u_{i},\theta_{-i})}{\partial\alpha}\Big{|}_{\alpha=0}\right| 11γ(Part A+Part B+Part C)\displaystyle\leq\frac{1}{1-\gamma}(\text{Part A}+\text{Part B}+\text{Part C})
2(1γ)3|𝒜i|j=1n|𝒜j|θjθj,\displaystyle\leq\frac{2}{(1-\gamma)^{3}}\sqrt{|\mathcal{A}_{i}|}\sum_{j=1}^{n}\sqrt{|\mathcal{A}_{j}|}\|\theta_{j}^{\prime}-\theta_{j}\|,

which completes the proof. ∎

Appendix L Auxiliary

We recall Appendix E. \auxiliary*

Proof.

Let y=θ+gy=\theta+g, without loss of generality, assume that i=1i^{*}=1 and that:

y1>y2y3yn.y_{1}>y_{2}\geq y_{3}\geq\cdots\geq y_{n}.

Using KKT condition, one can derive an efficient algorithm for solving Proj𝒳(y)Proj_{\mathcal{X}}(y) [73], which consists of the following steps:

  1. 1.

    Find ρ:=max{1jn:yj+1j(1i=1jyi)>0}\rho:=\max\{1\leq j\leq n:y_{j}+\frac{1}{j}\left(1-\sum_{i=1}^{j}y_{i}\right)>0\};

  2. 2.

    Set λ:=1ρ(1i=1ρyi)~{}~{}\lambda:=\frac{1}{\rho}\left(1-\sum_{i=1}^{\rho}y_{i}\right);

  3. 3.

    Set θi=max{yi+λ,0}~{}~{}\theta^{\prime}_{i}=\max\{y_{i}+\lambda,0\}.

From the algorithm, we have that:

λ\displaystyle\lambda =1ρ(1i=1ρyi)=1ρ(1i=1ρ(θi+gi))\displaystyle=\frac{1}{\rho}\left(1-\sum_{i=1}^{\rho}y_{i}\right)=\frac{1}{\rho}\left(1-\sum_{i=1}^{\rho}(\theta_{i}+g_{i})\right)
=1ρ(1i=1ρθi)1ρi=1ρgi\displaystyle=\frac{1}{\rho}\left(1-\sum_{i=1}^{\rho}\theta_{i}\right)-\frac{1}{\rho}\sum_{i=1}^{\rho}g_{i}
1ρi=1ρgi.\displaystyle\geq-\frac{1}{\rho}\sum_{i=1}^{\rho}g_{i}.

If ρ2\rho\geq 2,

θ1\displaystyle\theta^{\prime}_{1} =max{y1+λ,0}y1+λθ1+g11ρi=1ρgi\displaystyle=\max\{y_{1}+\lambda,0\}\geq y_{1}+\lambda\geq\theta_{1}+g_{1}-\frac{1}{\rho}\sum_{i=1}^{\rho}g_{i}
θ1+(11ρ)g11ρi=2ρ(g1Δ)=θ1+ρ1ρΔθ1+Δ2.\displaystyle\geq\theta_{1}+(1-\frac{1}{\rho})g_{1}-\frac{1}{\rho}\sum_{i=2}^{\rho}(g_{1}-\Delta)=\theta_{1}+\frac{\rho-1}{\rho}\Delta\geq\theta_{1}+\frac{\Delta}{2}.

If ρ=1\rho=1,

θ1=y1+λ=y1+(1y1)=1.\theta^{\prime}_{1}=y_{1}+\lambda=y_{1}+(1-y_{1})=1.

Thus:

θ1min{1,θ1+Δ2},\theta^{\prime}_{1}\geq\min\{1,\theta_{1}+\frac{\Delta}{2}\},

which completes the proof. ∎

Appendix M Numerical Simulation Details

Verification of the fully mixed NE in Game 2

We now verify that joining network 1 with probability 13ϵ3(12ϵ)\frac{1-3\epsilon}{3(1-2\epsilon)},i.e.:

πθi(ai=1|s)=13ϵ3(12ϵ),s𝒮,i=1,2,\pi_{\theta_{i}}(a_{i}=1|s)=\frac{1-3\epsilon}{3(1-2\epsilon)},~{}~{}\forall s\in\mathcal{S},~{}~{}i=1,2,

is indeed a NE. First, observe that

Prθ(si,t+1=1)\displaystyle\textup{Pr}^{\theta}(s_{i,t+1}=1) =(13ϵ3(12ϵ))P(si,t+1=1|ai,t=1)+(113ϵ3(12ϵ))P(si,t+1=1|ai,t=2)\displaystyle=\left(\frac{1-3\epsilon}{3(1-2\epsilon)}\right)P(s_{i,t+1}=1|a_{i,t}=1)+\left(1-\frac{1-3\epsilon}{3(1-2\epsilon)}\right)P(s_{i,t+1}=1|a_{i,t}=2)
=(13ϵ3(12ϵ))(1ϵ)+(113ϵ3(12ϵ))ϵ=13.\displaystyle=\left(\frac{1-3\epsilon}{3(1-2\epsilon)}\right)(1-\epsilon)+\left(1-\frac{1-3\epsilon}{3(1-2\epsilon)}\right)\epsilon=\frac{1}{3}.

Thus,

V(s)\displaystyle V(s) =r(s)+t=1𝔼stγtr(st)=r(s)+2γ3(1γ),\displaystyle=r(s)+\sum_{t=1}^{\infty}\mathbb{{E}}_{s_{t}}\gamma^{t}r(s_{t})=r(s)+\frac{2\gamma}{3(1-\gamma)},
Qθi¯(s,ai)\displaystyle\overline{Q^{\theta}_{i}}(s,a_{i}) =r(s)+γs,aiP(s|ai,ai)πθi(ai|s)V(s)\displaystyle=r(s)+\gamma\sum_{s^{\prime},a_{-i}}P(s^{\prime}|a_{i},a_{-i})\pi_{\theta_{-i}}(a_{-i}|s)V(s^{\prime})
=r(s)+γsi{1,2}(P(si|ai)Prθ(si=1)r(si,si=1)+P(si|ai)Prθ(si=2)r(si,si=2))+2γ23(1γ)\displaystyle=r(s)+\gamma\sum_{s_{i}^{\prime}\in\{1,2\}}(P(s_{i}^{\prime}|a_{i})\textup{Pr}^{\theta}(s_{-i}=1)r(s_{i},s_{-i}=1)+P(s_{i}^{\prime}|a_{i})\textup{Pr}^{\theta}(s_{-i}=2)r(s_{i},s_{-i}=2))+\frac{2\gamma^{2}}{3(1-\gamma)}
=r(s)+γP(si=1|ai)(13r(si=1,si=1)+23r(si=1,si=2))\displaystyle=r(s)+\gamma P(s_{i}^{\prime}=1|a_{i})\left(\frac{1}{3}r(s_{i}^{\prime}=1,s_{-i}=1)+\frac{2}{3}r(s_{i}^{\prime}=1,s_{-i}=2)\right)
+γP(si=2|ai)(13(si=2,si=1)+23r(si=2,si=2))+2γ23(1γ)\displaystyle\quad+\gamma P(s_{i}^{\prime}=2|a_{i})\left(\frac{1}{3}(s_{i}^{\prime}=2,s_{-i}=1)+\frac{2}{3}r(s_{i}^{\prime}=2,s_{-i}=2)\right)+\frac{2\gamma^{2}}{3(1-\gamma)}
=r(s)+23γ+2γ23(1γ)=r(s)+2γ3(1γ)=V(s),\displaystyle=r(s)+\frac{2}{3}\gamma+\frac{2\gamma^{2}}{3(1-\gamma)}=r(s)+\frac{2\gamma}{3(1-\gamma)}=V(s),

which implies that:

(θiθi)θiJi(θ)=0,θi𝒳i,i=1,2,(\theta_{i}^{\prime}-\theta_{i})^{\top}\nabla_{\theta_{i}}J_{i}(\theta)=0,\quad\forall\theta_{i}^{\prime}\in\mathcal{X}_{i},\quad i=1,2,

i.e. θ\theta satisfies first-order stationarity. Since dθ(s)>0d_{\theta}(s)>0 holds for any valid θ\theta, by Theorem 1, θ\theta is a NE.

Computation of strict NEs in Game 2

The computation of strict NEs is done numerically, using the criterion in Lemma 5. We enumerate over all 282^{8} possible deterministic policies and check whether the conditions in Lemma 5 hold. For ϵ=0.1,γ=0.95,\epsilon=0.1,\gamma=0.95, and an initial distribution set as:

ρ(s1=i,s2=j)=1/4,i,j{1,2},\rho(s_{1}=i,s_{2}=j)=1/4,~{}~{}i,j\in\{1,2\},

the numerical calculation shows there exist 1313 different strict NEs.