Gradient play in stochastic games: stationary points, convergence, and sample complexity

Runyu (Cathy) Zhang, Zhaolin Ren, Na Li R. Zhang, Z. Ren, and N. Li are affiliated with Harvard School of Engineering and Applied Sciences, (e-mail: [email protected], [email protected], [email protected]) This research is funded by NSF CAREER ECCS-1553407, NSF AI institute 2112085, NSF CNS: 2003111, ONR YIP N00014-19-1-2217.

(January 2022)

Abstract

We study the performance of the gradient play algorithm for stochastic games (SGs), where each agent tries to maximize its own total discounted reward by making decisions independently based on current state information which is shared between agents. Policies are directly parameterized by the probability of choosing a certain action at a given state. We show that Nash equilibria (NEs) and first-order stationary policies are equivalent in this setting, and give a local convergence rate around strict NEs. Further, for a subclass of SGs called Markov potential games (which includes the setting with identical rewards as an important special case), we design a sample-based reinforcement learning algorithm and give a non-asymptotic global convergence rate analysis for both exact gradient play and our sample-based learning algorithm. Our result shows that the number of iterations to reach an $\epsilon$ -NE scales linearly, instead of exponentially, with the number of agents. Local geometry and local stability are also considered, where we prove that strict NEs are local maxima of the total potential function and fully-mixed NEs are saddle points.

1 Introduction

Multi-agent systems find applications in a wide range of societal systems, e.g. electric grids, traffic networks, smart buildings and smart cities etc. Given the complexity of these systems, multi-agent reinforcement learning (MARL) has gained increasing attention in recent years (e.g. [1, 2]) Among MARL algorithms, policy gradient-type methods are highly popular because of their flexibility and capability to incorporate structured state and action spaces. However, while many recent works [3, 4, 5, 6, 7] have studied the performance of multi-agent policy gradient algorithms, due to a lack of understanding of the optimization landscape in these multi-agent learning problems, most works can only show convergence to a first-order stationary point. Deeper understanding of the quality of these stationary points is missing even in the simple identical-reward multi-agent RL setting.

In this paper, we investigate this problem from a game-theoretic perspective. We model the multi-agent system as a stochastic game (SG) where agents take independent stochastic policies and can have different reward functions. The study of SGs dates back to as early as the 1950s by [8] with a series of follow-up works on developing NE-seeking algorithms, especially in the RL setting (e.g. [9, 10, 11, 12, 13, 14] and citations therein). While well-known classical algorithms for solving SGs are mostly value-based, such as Nash-Q learning [15], Hyper-Q learning [16], and WoLF-PHC [17], gradient-based algorithms have also started to gain popularity in recent years due to their advantages as mentioned earlier (e.g. [18, 19, 20]).

In this work, we aim to gain a deeper understanding of the structure of first-order stationary points and the dynamical behavior for these gradient-based methods, with a particular focus on answering the following questions: 1) How do the first-order stationary points relate to the NEs of the underlying game?, 2) What is the stability of the individual NEs?, 3) How do agents learn from samples in this environment?

These questions have already been widely discussed in other settings, e.g., one-shot (stateless) finite-action games [21, 22, 23, 24, 25, 26, 27, 28, 29, 30], one-shot continuous games [31], zero-sum linear quadratic (LQ) games [32], etc. There are both negative and positive results depending on the settings. For one-shot continuous games, [31] proved a negative result suggesting that gradient flow has stationary points (even local maxima) that are not necessarily NEs. Conversely, [32] designed projected nested-gradient methods that provably converge to NEs in zero-sum LQ games. However, much less is known in the tabular setting of SGs with finite state-action spaces.

Contributions. We consider the gradient play algorithm for the infinite time horizon, discounted reward SGs with independent, directly-parameterized agents’ policies. Through generalizing the gradient domination property in [33] to the SG setting, we first establish the equivalence of first-order stationary policies and Nash equilibria (Theorem 1). This result suggests that even if agents have an identical reward, the first-order stationary points are only equivalent to Nash equilibria, which are usually non-unique and have different reward values. This is fundamentally different from the centralized learning case [33] where it can be shown that the first order stationary point is the unique global optimal solution.

Then we study the convergence of gradient play for SGs. For general games, it is known that gradient play may fail to obtain global convergence [21, 22, 23, 24]. Thus we firstly focus on characterizing some local properties for the general cases. In particular, we characterize the structure of strict NEs and show that gradient play locally converges to strict NEs within finite steps (Theorem 2).

Next we study a special class of SGs called Markov potential games (MPGs) [34, 35, 36], which includes identical reward multi-agent RL [37, 38, 39, 40, 41, 42] as an important special case. Concurrently, this work and [36] have established the global convergence rate to a NE for gradient play under MPGs (Theorem 3). However, the result does not specify which NE the policies converge to. Given the fact that there are many NEs that would have poor global value, global convergence results have a limited implication on the algorithm performance. This motivates us to study the local geometry around some specific types of NEs. We show that strict NEs are local maxima of the total potential function, thus stable points under gradient play, and that fully mixed NEs are saddle points, thus unstable points under gradient play (Theorem 4).

Then, we design a fully-decentralized sample-based gradient play algorithm and prove that it can find an $\epsilon$ -Nash equilibrium with high probability using $\widetilde{O}\left(\frac{n}{\epsilon^{6}}\textup{poly}\left(\frac{1}{1-\gamma},|\mathcal{S}|,\max_{i}|\mathcal{A}_{i}|\right)\right)$ samples (Theorem 5, $|\mathcal{S}|,|\mathcal{A}_{i}|$ denote the size of the state space and action space of agent $i$ respectively). The key enabler of our algorithm is the existence of an underlying averaged MDP for each agent when other agents’ policies are fixed. Our learning method can be viewed as a model-based policy evaluation method with respect to agents’ averaged MDPs. This averaged MDP concept could be applied to design many other MARL algorithms, especially policy-evaluation-based methods.

Comparison to other works on NE learning for SGs: There are some recent studies on general SGs with finite state-action spaces. However, either the structure of SGs or the methods they consider are different from our setting. For example, [43, 44] consider learning correlated equilibria (CCE) rather than NEs for finite time horizon general-sum games; [45] and [46] propose decentralized learning algorithms for the weakly acyclic games, which include the identical interest game as a special case, but only asymptotic convergence is considered; [47, 48] considers convergence to NE for two-player zero-sum games. In addition, [3, 6, 7] consider slightly different MARL settings, where agents collaboratively maximize the summation of agents’ reward with either full or partial state observation. They also require communication between neighboring agents for a better global coordination.

For the MPG subclass,[49, 36, 43, 50, 51] study the convergence to a NE. [43] designs the Nash-CA (Nash Coordinate Ascent) algorithm which requires agents to update sequentially and does not belong to the gradient-based algorithm class. While [50, 51] consider gradient-based algorithms, they study softmax policies which are different from directly-parameterized policies. [52] considers policy gradient with function approximation. [36] is the most related work to this paper, which studies the performance of gradient-based algorithms under direct parameterization. It studies the global convergence rate and develop sample complexity results for gradient play, but do not study the local geometry for general SGs. Additionally, the sample-based algorithm considered in [36] is based on Monte-Carlo gradient estimation, which might suffer from high variance in real implementation and is very different from our algorithm that estimates the gradient via estimating the “model” with respect to agents’ averaged MDPs. Moreover, our concept of “averaged” MDPs could also serve as a useful tool for the design and analysis of other MARL algorithms.

2 Problem setting and preliminaries

We consider a stochastic game (SG, [8]) $\mathcal{M}=(N,\mathcal{S},~{}\mathcal{A}=\mathcal{A}_{1}\times\dots\times\mathcal{A}_{n},~{}P,~{}r=(r_{1},\dots,r_{n}),~{}\gamma,~{}\rho)$ with $n$ agents which is specified by an agent set $N=\left\{1,2,\dots,n\right\}$ , a finite state space $\mathcal{S}$ , a finite action space $\mathcal{A}_{i}$ for each agent $i\in N$ , a transition model $P$ where $P(s^{\prime}|s,a)=P(s^{\prime}|s,a_{1},\dots,a_{n})$ is the probability of transitioning into state $s^{\prime}$ upon taking action $a:=(a_{1},\ldots,a_{n})$ in state $s$ where $a_{i}\in\mathcal{A}_{i}$ is the action of agent $i$ , agent $i$ ’s reward function $r_{i}:\mathcal{S}\times\mathcal{A}\rightarrow[0,1]$ , a discount factor $\gamma\in[0,1)$ , and an initial state distribution $\rho$ over $\mathcal{S}$ .

A stochastic policy $\pi:\mathcal{S}\rightarrow\Delta(\mathcal{A})$ (where $\Delta(\mathcal{A})$ is the probability simplex over $\mathcal{A}$ ) specifies a strategy in which agents choose their actions jointly based on the current state in a stochastic fashion, i.e. $\Pr(a_{t}|s_{t})=\pi(a_{t}|s_{t})$ . A decentralized stochastic policy is a special subclass of stochastic policies, with $\pi=\pi_{1}\times\ldots\times\pi_{n}$ , where $\pi_{i}:\mathcal{S}\rightarrow\Delta(\mathcal{A}_{i})$ . For decentralized stochastic policies, each agent takes its action based on the current state $s$ independently of other agents’ choices of actions, i.e.:

\textstyle\Pr(a_{t}|s_{t})=\pi(a_{t}|s_{t})=\prod_{i=1}^{n}\pi_{i}(a_{i,t}|s_{t}),a_{t}\!=\!(a_{1,t},\!\dots\!,a_{n,t}).\vspace{-3pt}

For notational simplicity, we define: $\pi_{I}(a_{I}|s):=\prod_{i\in I}\pi_{i}(a_{i}|s)$ , where $I\subseteq N$ is an index set. Further, we use the notation $-i$ to denote the index set $N\backslash\{i\}$ .

We consider direct decentralized policy parameterization, where agent $i$ ’s policy is parameterized by $\theta_{i}$ :

\pi_{i,\theta_{i}}(a_{i}|s)=\theta_{i,(s,a_{i})},\quad i=1,2,\dots,n.

(1)

For notational simplicity, we abbreviate $\pi_{i,\theta_{i}}(a_{i}|s)$ as $\pi_{\theta_{i}}(a_{i}|s)$ , and $\theta_{i,(s,a_{i})}$ as $\theta_{s,a_{i}}$ . Here $\theta_{i}\in\Delta(\mathcal{A}_{i})^{|\mathcal{S}|}$ , i.e. $\theta_{i}$ is subject to the constraints $\theta_{s,a_{i}}\geq 0$ and $\sum_{a_{i}\in\mathcal{A}_{i}}\theta_{s,a_{i}}=1$ for all $s\in\mathcal{S}$ . The global joint policy is given by: $\pi_{\theta}(a|s)=\prod_{i=1}^{n}\pi_{\theta_{i}}(a_{i}|s)=\prod_{i=1}^{n}\theta_{s,a_{i}}.$ We use $\mathcal{X}_{i}:=\Delta(\mathcal{A}_{i})^{|\mathcal{S}|},\mathcal{X}:=\mathcal{X}_{1}\times\cdots\times\mathcal{X}_{n}$ to denote the feasible region of $\theta_{i}$ and $\theta$ .

Agent $i$ ’s value function [53] $V_{i}^{\theta}:\mathcal{S}\rightarrow\mathbb{R},i\in N$ is defined as the discounted sum of future rewards starting at state $s$ via executing $\pi_{\theta}$ , i.e.

\textstyle V_{i}^{\theta}(s):=\mathbb{{E}}\left[\sum_{t=0}^{\infty}\gamma^{t}r_{i}(s_{t},a_{t})\big{|}~{}\pi_{\theta},s_{0}=s\right],

where the expectation is with respect to the random trajectory $\tau=(s_{t},a_{t},r_{i,t})_{t=0}^{\infty}$ where $a_{t}\sim\pi_{\theta}(\cdot|s_{t}),s_{t+1}=P(\cdot|s_{t},a_{t})$ . We denote agent $i$ ’s total reward starting from initial state $s_{0}\sim\rho$ as:

\quad\textstyle J_{i}(\theta)=J_{i}(\theta_{1},\dots,\theta_{n}):=\mathbb{{E}}_{s_{0}\sim\rho}V_{i}^{\theta}(s_{0}).\vspace{-1pt}

In the game setting, Nash equilibrium is often used to characterize the performance of agents’ policies.

Definition 1.

(Nash equilibrium, c.f. [54, 55]) A policy $\theta^{*}=(\theta_{1}^{*},\dots,\theta_{n}^{*})$ is called a Nash equilibrium (NE) if

J_{i}(\theta_{i}^{*},\theta_{-i}^{*})\geq J_{i}(\theta_{i}^{\prime},\theta_{-i}^{*}),\quad\forall\theta_{i}^{\prime}\in\mathcal{X}_{i},\quad i\in N

The equilibrium is called a strict NE if the inequality holds strictly for all $\theta_{i}^{\prime}\in\mathcal{X}_{i}$ and $i\in N$ . The equilibrium is called a pure NE if $\theta^{*}$ corresponds to a deterministic policy. The equilibrium is called a mixed NE if it is not pure. Further, the equilibrium is called a fully mixed NE if every entry of $\theta^{*}$ is strictly positive, i.e.: $\theta_{s,a_{i}}^{*}>0,~{}\forall~{}a_{i}\in\mathcal{A}_{i},~{}\forall~{}s\in\mathcal{S},~{}i\in N$ .

We define the discounted state visitation distribution [53] $d_{\theta}$ of a policy $\pi_{\theta}$ given an initial state distribution $\rho$ as:

\textstyle d_{\theta}(s):=\mathbb{{E}}_{s_{0}\sim\rho}(1-\gamma)\sum_{t=0}^{\infty}\gamma^{t}\textup{Pr}^{\theta}(s_{t}=s|s_{0}),

(2)

where $\textup{Pr}^{\theta}(s_{t}=s|s_{0})$ is the state visitation probability that $s_{t}=s$ when executing $\pi_{\theta}$ starting at state $s_{0}$ . Throughout the paper, we make the following assumption on the SGs we study.

Assumption 1.

The stochastic game $\mathcal{M}$ satisfies: $d_{\theta}(s)>0,~{}\forall s\in\mathcal{S},~{}\forall\theta\in\mathcal{X}$ .

Assumption 1 requires that every state is visited with positive probability, which is a standard assumption for convergence proofs in the RL literature (e.g. [33, 56],[36, 48]). Note that this assumption could be easily satisfied if the initial distribution $\rho$ satisfy $\rho(s)>0,\forall s\in\mathcal{S}$ .

Similar to centralized RL [53], define agent $i$ ’s $Q$ -function $Q_{i}^{\theta}$ and its advantage function $A_{i}^{\theta}$ as:

\begin{split}Q_{i}^{\theta}(s,a)&:=\mathbb{{E}}\left[\sum_{t=0}^{\infty}\gamma^{t}r_{i}(s_{t},a_{t})\big{|}~{}\pi_{\theta},s_{0}=s,a_{0}=a\right],\\ A_{i}^{\theta}(s,a)&:=Q_{i}^{\theta}(s,a)-V_{i}^{\theta}(s).\end{split}

‘Averaged’ Markov decision process (MDP): We further define agent $i$ ’s ‘averaged’ Q-function $\overline{Q_{i}^{\theta}}:\mathcal{S}\times\mathcal{A}_{i}\rightarrow\mathbb{R}$ and ‘averaged’ advantage-function $\overline{A_{i}^{\theta}}:\mathcal{S}\times\mathcal{A}_{i}\rightarrow\mathbb{R}$ as:

\begin{split}\textstyle\overline{Q_{i}^{\theta}}(s,a_{i})&\textstyle:=\sum_{a_{-i}}\pi_{\theta_{-i}}(a_{-i}|s)Q_{i}^{\theta}(s,a_{i},a_{-i}),\\ \textstyle\overline{A_{i}^{\theta}}(s,a_{i})&\textstyle:=\sum_{a_{-i}}\pi_{\theta_{-i}}(a_{-i}|s)A_{i}^{\theta}(s,a_{i},a_{-i}).\end{split}

(3)

Similarly, we define agent $i$ ’s ‘averaged’ transition probability distribution $\overline{P_{i}^{\theta}}:\mathcal{S}\times\mathcal{S}\times\mathcal{A}_{i}\rightarrow\mathbb{R}$ , and ‘averaged’ reward $\overline{r_{i}^{\theta}}:\mathcal{S}\times\mathcal{A}_{i}\rightarrow\mathbb{R}$ as:

\begin{split}\textstyle\overline{P_{i}^{\theta}}(s^{\prime}|s,a_{i})&\textstyle:=\sum_{a_{-i}}\pi_{\theta{-i}}(a_{-i}|s)P(s^{\prime}|s,a_{i},a_{-i}),\\ \textstyle\overline{r_{i}^{\theta}}(s,a_{i})&\textstyle:=\sum_{a_{-i}}\pi_{\theta{-i}}(a_{-i}|s)r_{i}(s,a_{i},a_{-i})\end{split}

From its definition, the averaged Q-function satisfies the following Bellman equation:

Lemma 1.

$\overline{Q_{i}^{\theta}}$ satisfies:

\overline{Q_{i}^{\theta}}(s,\!a_{i})\!=\!\overline{r_{i}^{\theta}}(s,\!a_{i})\!\!+\!\!\gamma\!\!\sum_{s^{\prime},a_{i}^{\prime}}\!\!\pi_{\theta_{i}}\!(a_{i}^{\prime}|s^{\prime})\overline{P_{i}^{\theta}}\!(s^{\prime}|s,a_{i})\overline{Q_{i}^{\theta}}(s^{\prime},a_{i}^{\prime})\vspace{-10pt}

(4)

Lemma 1 suggests that the averaged Q-function $\overline{Q_{i}^{\theta}}$ is indeed the Q-function for the MDP defined on action space $\mathcal{A}_{i}$ , with $\overline{r_{i}^{\theta}},\overline{P_{i}^{\theta}}$ as its stage reward and transition probability, respectively. We define this MDP as the ‘averaged’ MDP of agent $i$ , i.e., $\mathcal{M}_{i}^{\theta}=(\mathcal{S},\mathcal{A}_{i},\overline{P_{i}^{\theta}},\overline{r_{i}^{\theta}},\gamma,\rho)$ . The notion of an ‘averaged’ MDP will serve as an important intuition when designing the sample-based algorithm. Note that the ‘averaged’ MDP is only well-defined when the policies of the other agents $\theta_{-i}$ are kept fixed. When this is indeed the case, agent $i$ can be treated as an independent learner with respect to its own ‘averaged’ MDP. Thus, various classical policy evaluation RL algorithms can then be applied. Additionally, we can apply the performance difference lemma [57] to the averaged MDP to derive a corresponding lemma for SGs which is useful throughout the paper.

Lemma 2.

(Performance difference lemma, for SGs, proof see Appendix B) Let $\theta^{\prime}=(\theta_{i}^{\prime},\theta_{-i})$

J_{i}(\theta_{i}^{\prime},\theta_{-i})-J_{i}(\theta_{i},\theta_{-i})=\frac{1}{1-\gamma}\sum_{s,a_{i}}d_{\theta^{\prime}}(s)\pi_{\theta^{\prime}_{i}}(a_{i}|s)\overline{A_{i}^{\theta}}(s,a_{i}).

Note that in the single agent case ( $n=1$ ), Lemma 2 is the same as the original performance difference lemma known in literature, e.g., Lemma 6.1 in [57].

3 Gradient Play for General Stochastic Games

Under direct distributed parameterization, the gradient play algorithm is given by:

\theta_{i}^{(t+1)}=\text{Proj}_{\mathcal{X}_{i}}(\theta_{i}^{(t)}+\eta\nabla_{\theta_{i}}J_{i}(\theta_{i}^{(t)})),~{}\eta\!>\!0.

(5)

Gradient play can be viewed as a ‘better response’ strategy, where agents update their own parameters by gradient ascent with respect to their own rewards.

A first-order stationary point is defined as such:

Definition 2.

(First-order stationary policy) A policy $\theta^{*}=(\theta_{1}^{*},\dots,\theta_{n}^{*})$ is called a first-order stationary policy if $(\theta_{i}^{\prime}-\theta_{i}^{*})^{\top}\nabla_{\theta_{i}}J_{i}(\theta^{*})\leq 0,\ \forall\theta_{i}^{\prime}\in\mathcal{X}_{i},\ i\in N$ .

It is not hard to verify that $\theta^{*}$ is a first-order stationary policy if and only if it is a fixed point under gradient play (5). Comparing Definition 1 (of NE) and Definition 2, we know that NEs are first-order stationary policies, but not necessarily vice versa. For each agent $i$ , first-order stationarity does not imply that $\theta_{i}^{*}$ is optimal among all possible $\theta_{i}$ given $\theta_{-i}^{*}$ . However, interestingly, we will show that NEs are equivalent to first-order stationary policies due to a gradient domination property that we will show later. Before that, we first calculate the explicit form of the gradient $\nabla_{\theta_{i}}J_{i}$ .

Policy gradient theorem [58] gives an efficient formula for the gradient:

\nabla_{\!\theta}\mathbb{{E}}_{s_{0}\sim\rho}\!V_{i}^{\theta}(s_{0})\!=\!\frac{1}{1\!-\!\gamma}\mathbb{{E}}_{s\sim d_{\theta}\!,a\sim\pi_{\theta}(\!\cdot|s)}[\nabla_{\theta}\log\pi_{\theta}(a|s)Q_{i}^{\theta}(s,a)],

(6)

Applying (6), the gradient $\nabla_{\theta_{i}}J_{i}$ can be written explicitly as follows:

Lemma 3.

(Proof see Appendix D) For direct distributed parameterization (1),

\frac{\partial J_{i}(\theta)}{\partial{\theta_{s,a_{i}}}}=\frac{1}{1-\gamma}d_{\theta}(s)\overline{Q_{i}^{\theta}}(s,a_{i})

(7)

3.1 Gradient domination and the equivalence between NE and first-order stationary policy.

Lemma 4.1 in [33] established gradient domination for centralized tabular MDP under direct parameterization. We can show that a similar property still holds for SGs.

Lemma 4.

(Gradient domination) For direct distributed parameterization (1), we have that for any $\theta=(\theta_{1},\dots,\theta_{n})\in\mathcal{X}$ and any $\theta_{i}^{\prime}\in\mathcal{X}_{i},~{}i\in N$ :

J_{i}(\theta_{i}^{\prime},\!\theta_{\!-\!i})-J_{i}(\theta_{i},\!\theta_{\!-\!i})\!\leq\!\left\|\frac{d_{\theta^{\prime}}}{d_{\theta}}\right\|_{\infty}\!\max_{\overline{\theta}_{i}\!\in\!\mathcal{X}_{i}}(\overline{\theta}_{i}-\theta_{i})^{\top}\nabla_{\theta_{i}}J_{i}(\theta),

(8)

where $\left\|\frac{d_{\theta^{\prime}}}{d_{\theta}}\right\|_{\infty}:=\max_{s}\frac{d_{\theta^{\prime}}(s)}{d_{\theta}(s)}$ , and $\theta^{\prime}=(\theta_{i}^{\prime},\theta_{-i})$ .

Proof.

According to Lemma 2:

\!J_{i}(\theta_{i}^{\prime},\!\theta_{-i})\!-\!J_{i}(\theta_{i},\!\theta_{-i})\!=\!\frac{1}{1\!-\!\gamma}\!\sum_{s,a_{i}}\!d_{\theta^{\prime}}\!(s)\pi_{\theta^{\prime}_{i}}(a_{i}|s)\overline{A_{i}^{\theta}}(s,\!a_{i}).\!\vspace{-5pt}

From the definition of ‘averaged’ advantage function:

\textstyle\sum_{a_{i}}\pi_{\theta_{i}}(a_{i}|s)\overline{A^{\theta}_{i}}(s,a_{i})=0,\quad\forall s\in\mathcal{S}

which implies: $\max_{a_{i}\in\mathcal{A}_{i}}\overline{A^{\theta}_{i}}(s,a_{i})\geq 0,$ thus we have that:

\begin{split}&J_{i}(\theta_{i}^{\prime},\theta_{-i})-J_{i}(\theta_{i},\theta_{-i})=\frac{1}{1-\gamma}\sum_{s,a_{i}}d_{\theta^{\prime}}(s)\pi_{\theta^{\prime}_{i}}(a_{i}|s)\overline{A_{i}^{\theta}}(s,a_{i})\\ &\leq\!\!\sum_{s}\!\frac{d_{\theta^{\prime}}(s)}{1-\gamma}\max_{a_{i}\in\mathcal{A}_{i}}\!\overline{A_{i}^{\theta}}(s,a_{i})\!=\!\sum_{s}\frac{d_{\theta^{\prime}}(s)}{d_{\theta}(s)}\frac{d_{\theta}(s)}{1\!-\!\gamma}\max_{a_{i}\in\mathcal{A}_{i}}\overline{A_{i}^{\theta}}(s,a_{i})\\ &\leq\frac{1}{1-\gamma}\left\|\frac{d_{\theta^{\prime}}}{d_{\theta}}\right\|_{\infty}\sum_{s}d_{\theta}(s)\max_{a_{i}\in\mathcal{A}_{i}}\overline{A_{i}^{\theta}}(s,a_{i}).\end{split}

(9)

We can rewrite $\frac{1}{1-\gamma}\sum_{s}d_{\theta}(s)\max_{a_{i}\in\mathcal{A}_{i}}\overline{A_{i}^{\theta}}(s,a_{i})$ as:

\begin{split}&\frac{1}{1-\gamma}\sum_{s}d_{\theta}(s)\max_{a_{i}\in\mathcal{A}_{i}}\overline{A_{i}^{\theta}}(s,a_{i})\\ &=\frac{1}{1-\gamma}\max_{\overline{\theta}_{i}\in\mathcal{X}_{i}}\sum_{s,a_{i}}d_{\theta}(s)\pi_{\overline{\theta}_{i}}(a_{i}|s)\overline{A_{i}^{\theta}}(s,a_{i})\\ &=\max_{\overline{\theta}_{i}\in\mathcal{X}_{i}}\sum_{s,a_{i}}(\pi_{\overline{\theta}_{i}}(a_{i}|s)-\pi_{\theta_{i}}(a_{i}|s))\frac{1}{1-\gamma}d_{\theta}(s)\overline{A_{i}^{\theta}}(s,a_{i})\\ &=\max_{\overline{\theta}_{i}\in\mathcal{X}_{i}}\left(\sum_{s,a_{i}}(\pi_{\overline{\theta}_{i}}(a_{i}|s)-\pi_{\theta_{i}}(a_{i}|s))\frac{1}{1-\gamma}d_{\theta}(s)\overline{Q_{i}^{\theta}}(s,a_{i})\right.\\ &\quad-\underbrace{\left.\sum_{s}\frac{1}{1-\gamma}d_{\theta}(s)V(s)\sum_{a_{i}}(\pi_{\overline{\theta}_{i}}(a_{i}|s)-\pi_{\theta_{i}}(a_{i}|s))\right)}_{=0}\\ &=\max_{\overline{\theta}_{i}\in\mathcal{X}_{i}}\sum_{s,a_{i}}(\pi_{\overline{\theta}_{i}}(a_{i}|s)-\pi_{\theta_{i}}(a_{i}|s))\frac{1}{1-\gamma}d_{\theta}(s)\overline{Q_{i}^{\theta}}(s,a_{i})\\ &=\max_{\overline{\theta}_{i}\in\mathcal{X}_{i}}(\overline{\theta}_{i}-\theta_{i})^{\top}\nabla_{\theta_{i}}J_{i}(\theta).\end{split}

(10)

Substituting this into (9), we may conclude that

J_{i}(\theta_{i}^{\prime},\theta_{-i})-J_{i}(\theta_{i},\theta_{-i})\leq\left\|\frac{d_{\theta^{\prime}}}{d_{\theta}}\right\|_{\infty}\max_{\overline{\theta}_{i}\in\mathcal{X}_{i}}(\overline{\theta}_{i}-\theta_{i})^{\top}\nabla_{\theta_{i}}J_{i}(\theta)

and this completes the proof. ∎

For the single-agent case ( $n=1$ ), (8) is consistent with the result in [33], i.e.: $J(\theta^{\prime})-J(\theta)\leq\left\|\frac{d_{\theta^{\prime}}}{d_{\theta}}\right\|_{\infty}\max_{\overline{\theta}\in\mathcal{X}}(\overline{\theta}-\theta)^{\top}\nabla J(\theta)$ . However, when there are multiple agents, the condition is much weaker because the inequality requires $\theta_{-i}$ to be fixed. When $n=1$ , gradient domination rules out the existence of stationary points that are not global optima. For the multi-agent case, the property can no longer guarantee the equivalence between first-order stationarity and global optimality; instead, it links the stationary points with NEs as shown in the following theorem.

Theorem 1.

Under Assumption 1, first-order stationary policies and NEs are equivalent.

Proof.

The definition of a Nash equilibrium naturally implies first order stationarity, because for any $\theta_{i}\in\mathcal{X}_{i}$ :

	$\displaystyle J_{i}$	$\displaystyle((1-\eta)\theta_{i}^{}+\eta\theta_{i},\theta_{-i}^{})-J_{i}(\theta_{i}^{},\theta_{-i}^{})=$
		$\displaystyle\eta(\theta_{i}-\theta_{i}^{})^{\top}\nabla_{\theta_{i}}J_{i}(\theta^{})+o(\eta\\|\theta_{i}-\theta_{i}^{*}\\|)\leq 0,\quad\forall~{}\eta>0$

Letting $\eta\rightarrow 0$ gives the first order stationary condition:

(\theta_{i}-\theta_{i}^{*})^{\top}\nabla_{\theta_{i}}J_{i}(\theta^{*})\leq 0,\quad\forall\theta_{i}\in\mathcal{X}_{i},

It remains to be shown that all first order stationary policies are Nash equilibria. From Assumption 1 we know that for any pair of parameters $\theta^{\prime},\theta^{*}$ , $\left\|\frac{d_{\theta^{\prime}}}{d_{\theta^{*}}}\right\|_{\infty}\!<\!+\infty.$
Take $\theta^{\prime}=(\theta_{i}^{\prime},\theta_{-i}^{*}),\theta^{*}=(\theta_{i}^{*},\theta_{-i}^{*})$ . According to Lemma 4, we have that for any first order stationary policy $\theta^{*}$ ,

J_{i}(\theta_{i}^{\prime},\theta_{-i}^{*})\!-\!J_{i}(\theta_{i}^{*},\theta_{-i}^{*})\!\leq\!\left\|\frac{d_{\theta^{\prime}}}{d_{\theta^{*}}}\right\|_{\infty}\!\!\max_{\overline{\theta}_{i}\in\mathcal{X}_{i}}(\overline{\theta}_{i}\!-\!\theta_{i}^{*})^{\!\!\top}\nabla_{\theta_{i}}\!\!J_{i}(\theta^{*})\!\leq\!0,

which completes the proof. ∎

We briefly note here that the equivalence between the first-order stationary points and NEs holds for all SGs that satisfy Assumption 1. One implication from the theorem is that for identical interest case where agents have the same rewards, we can only ensure the first order stationary points to be NEs when the policies are decentralized policies. Note that NEs are often none-unique and often with different objective values. This is in contrast to the single agent/centralized case where the first order stationary point is equivalent to the global optimal point[33].

3.2 Local convergence for strict NEs

Although the equivalence of NEs and stationary points under gradient play has been established, it is in fact difficult to show that gradient play converges to these stationary points. Even in the simpler static (stateless) game setup, gradient play might fail to converge [21, 22, 23, 24]. One major difficulty is that the vector field $\{\nabla_{\theta_{i}}J_{i}(\theta)\}_{i=1}^{n}$ is not a conservative vector field. Accordingly, its dynamics may display complicated behavior. Thus, as a preliminary study, instead of looking at global convergence, we focus on the local convergence and restrict our study to a special subset of NEs - the strict NEs. We begin by giving the following characterization of strict NEs:

Lemma 5.

Given a stochastic game $\mathcal{M}$ , any strict NE $\theta^{*}$ is pure, meaning that for each $i$ and $s$ , there exist one $a_{i}^{*}(s)$ such that $\theta_{s,a_{i}}^{*}=\mathbf{1}\{a_{i}=a_{i}^{*}(s)\}$ . Additionally, we have

	$\displaystyle i)\$	$\displaystyle\textstyle a_{i}^{}(s)=\operatorname{arg\,max}_{a_{i}}\overline{A_{i}^{\theta^{*}}}(s,a_{i}),$
	$\displaystyle ii)\$	$\displaystyle\textstyle\overline{A_{i}^{\theta^{}}}(s,a_{i}^{}(s))=0,$
	$\displaystyle iii)\$	$\displaystyle\textstyle\overline{A_{i}^{\theta^{}}}(s,a_{i})<0,~{}\forall~{}a_{i}\neq a_{i}^{}(s).$

Based on this lemma, we define the following for studying the local convergence of a strict NE $\theta^{*}$ :

\begin{split}&\textstyle\Delta_{i}^{\theta^{*}}(s):=\min_{a_{i}\neq a_{i}^{*}(s)}\left|\overline{A_{i}^{\theta^{*}}}(s,a_{i})\right|,\\ &\textstyle\Delta^{\theta^{*}}:=\min_{i}\min_{s}\frac{1}{1-\gamma}d_{\theta^{*}}(s)\Delta_{i}^{\theta^{*}}(s)>0.\vspace{-2pt}\end{split}

(11)

Theorem 2.

(Local finite time convergence around strict NE) Define the metric of policy parameters as: $D(\theta||\theta^{\prime}):=\max_{1\leq i\leq n}\max_{s\in\mathcal{S}}\|\theta_{i,s}-\theta^{\prime}_{i,s}\|_{1},$ where $\|\cdot\|_{1}$ denote the $\ell_{1}$ - norm. Suppose $\theta^{*}$ is a strict Nash equilibrium, then for any $\theta^{(0)}$ such that $D(\theta^{(0)}||\theta^{*})\!\leq\!\frac{\Delta^{\theta^{*}}(1-\gamma)^{3}}{8n|\mathcal{S}|\left(\sum_{i=1}^{n}|\mathcal{A}_{i}|\right)},$ running gradient play (5) will guarantee $D(\theta^{(t+1)}||\theta^{*})\leq\max\left\{D(\theta^{(t)}||\theta^{*})-\frac{\eta\Delta^{\theta^{*}}}{2},0\right\},$ which means that gradient play is going to converge within $\lceil\frac{2D(\theta^{(0)}||\theta^{*})}{\eta\Delta^{\theta^{*}}}\rceil$ steps.

Proofs are deferred to Appendix E.

Remark 1.

Note that the local convergence in Theorem 2 only requires a finite number of steps. The key insight of the proof is that the gradient always points towards $\theta^{*}$ , and that the algorithm projects the gradient update onto the probability simplex, thus by picking the stepsize $\eta$ arbitrarily large, exact convergence can be achieved by just one step. However, the caveat is that we need to assume that the initial policy is sufficiently close to $\theta^{*}$ . For numerical stability considerations, one should pick reasonable stepsizes to run the algorithm to accommodate random initializations. Theorem 2 also shows that the radius of region of attraction for strict NEs is at least $\frac{\Delta^{\theta^{*}}(1-\gamma)^{3}}{8n|\mathcal{S}|\left(\sum_{i=1}^{n}|\mathcal{A}_{i}|\right)}$ , and thus $\theta^{*}$ with a larger $\Delta^{\theta^{*}}$ , i.e., a larger value gap between the optimal action and other actions, will have a larger region of attraction. We would like to further remark that Theorem 2 only focuses on the local convergence property; hence, we can interpret the theorem in the following way: if there exists a strict NE, then it is locally asymptotically stable under gradient play. However, it does not claim to solve the global existence or convergence of the strict NEs.

4 Gradient play for Markov potential games

We have discussed that the main problem for the global convergence of gradient play for general SGs is that the vector field $\{\nabla_{\theta_{i}}J_{i}(\theta)\}_{i=1}^{n}$ is not conservative. Thus, in this section, we restrict our analysis to a special subclass where the vector field is conservative, which in turn enjoys global convergence. This subclass is generally referred to as a Markov potential game (MPG) in the literature.

Definition 3.

(Markov potential game [36]) A stochastic game $\mathcal{M}$ is called a Markov potential game if there exists a potential function $\phi:\mathcal{S}\times\mathcal{A}_{1}\times\cdots\times\mathcal{A}_{n}\!\rightarrow\!\mathbb{R}$ such that for any agent $i$ and any pair of policy parameters $(\theta_{i}^{\prime},\theta_{-i}),(\theta_{i},\theta_{-i})$ :

		$\displaystyle\mathbb{{E}}\left[\sum_{t=0}^{\infty}\gamma^{t}r_{i}(s_{t},a_{t})\big{\|}\pi=(\theta_{i}^{\prime},\theta_{-i}),s_{0}=s\right]-\mathbb{{E}}\left[\sum_{t=0}^{\infty}\gamma^{t}r_{i}(s_{t},a_{t})\big{\|}\pi=(\theta_{i},\theta_{-i}),s_{0}=s\right]$
	$\displaystyle=$	$\displaystyle\mathbb{{E}}\left[\sum_{t=0}^{\infty}\gamma^{t}\phi(s_{t},a_{t})\big{\|}\pi=(\theta_{i}^{\prime},\theta_{-i}),s_{0}=s\right]-\mathbb{{E}}\left[\sum_{t=0}^{\infty}\gamma^{t}\phi(s_{t},a_{t})\big{\|}\pi=(\theta_{i},\theta_{-i}),s_{0}=s\right],~{}~{}\forall~{}s.$

As shown in the definition, the condition of a MPG is admittedly rather strong and difficult to verify for general SGs. [35, 34] found that continuous MPGs can model applications such as the great fish war [59], the stochastic lake game [60], medium access control [35] etc. There are also efforts attempting to identify conditions such that a SG is a MPG, e.g., [35, 36, 61]. In Appendix A, we provide a more detailed discussion on MPGs, including a necessary condition of MPG, counterexamples of stage-wise potential games that are not MPG, sufficient conditions for a SG to be a MPG, and application examples of MPG. Nevertheless, identifying sufficient and necessary conditions and broadening the applications of MPG are important furture directions.

Given a policy $\theta$ , we define the ‘total potential function’¹¹1Note that our definition of MPG is slightly stronger than the definition in [36] as it requires the total potential function to take the particular form as a discounted sum of the potential function, i.e., $\Phi(\theta):=\mathbb{{E}}_{s_{0}\sim\rho(\cdot)}\left[\sum_{t=0}^{\infty}\gamma^{t}\phi(s_{t},a_{t})\big{|}~{}\pi_{\theta}\right]$ . However, most of our results (Theorem 3 and Theorem 5) still hold under the weaker definition in [36]. Theorem 4 is the only result that relies on the slightly stronger version of the definition $\Phi(\theta):=\mathbb{{E}}_{s_{0}\sim\rho(\cdot)}\left[\sum_{t=0}^{\infty}\gamma^{t}\phi(s_{t},a_{t})\big{|}~{}\pi_{\theta}\right]$ for a MPG. The following proposition guarantees a MPG has at least one NE and it is a pure NE (proof see Appendix F).We also define the quantity $\Phi_{\max},\Phi_{\min}$ as $\Phi_{\max}:=\frac{\phi_{\max}}{1-\gamma},\Phi_{\min}:=\frac{\phi_{\min}}{1-\gamma}$ , where $\phi_{\min}:=\min_{s,a}\phi(s,a)$ and $\phi_{\max}:=\max_{s,a}\phi(s,a)$ . It can be easily verified that $\Phi_{\min}\leq\Phi(\theta)\leq\Phi_{\max}$ for all $\theta$ .

Proposition 1.

For a Markov potential game, there is at least one global maximum $\theta^{*}$ of the total potential function $\Phi$ , i.e.: $\theta^{*}\in\operatorname*{arg\,max}_{\theta\in\mathcal{X}}\Phi(\theta)$ that is a pure NE.

From the definition of the total potential function we obtain the following relationship

J_{i}(\theta_{i}^{\prime},\theta_{-i})-J_{i}(\theta_{i},\theta_{-i})=\Phi(\theta_{i}^{\prime},\theta_{-i})-\Phi(\theta_{i},\theta_{-i}).

(12)

Thus,

\nabla_{\theta_{i}}J_{i}(\theta)=\nabla_{\theta_{i}}\Phi(\theta),

which means that gradient play (5) is equivalent to running projected gradient ascent with respect to the total potential function $\Phi$ , i.e.:

\theta^{(t+1)}=\text{Proj}_{\mathcal{X}}(\theta^{(t)}+\eta\nabla_{\theta}\Phi(\theta_{i}^{(t)})),~{}\eta>0.

4.1 Global convergence

With the above property, we can establish the global convergence for gradient play to a $\epsilon$ -NE for MPG. For the sake of self-completeness, we include the theorem here. Before that, we define $\epsilon$ -NE.

Definition 4.

( $\epsilon$ -Nash equilibrium) Define the ‘NE-gap’ of a policy $\theta$ as:

	$\displaystyle\textup{{NE-gap}}_{i}(\theta)$	$\displaystyle:=\max_{\theta_{i}^{\prime}\in\mathcal{X}_{i}}J_{i}(\theta_{i}^{\prime},\theta_{-i})-J_{i}(\theta_{i},\theta_{-i});$
	$\displaystyle\textup{{NE-gap}}(\theta)$	$\displaystyle:=\max_{i}\textup{{NE-gap}}_{i}(\theta).$

A policy $\theta$ is an $\epsilon$ -Nash equilibrium if: $\textup{{NE-gap}}(\theta)\leq\epsilon.$

Theorem 3.

Suppose for all $\theta\in\mathcal{X}$ , with stepsize $\eta=\frac{(1-\gamma)^{3}}{2\sum_{i=1}^{n}|\mathcal{A}_{i}|}$ , the NE-gap of $\theta^{(t)}$ asymptotically converge to 0 under gradient play (5), i.e., $\lim_{t\rightarrow\infty}\textup{{NE-gap}}(\theta^{(t)})=0$ . Further, we have:

\begin{split}&\quad\frac{1}{T}\sum_{1\leq t\leq T}\textup{{NE-gap}}(\theta^{(t)})^{2}\leq\epsilon^{2},\\ \textup{whenever }~{}&T\geq\frac{64M^{2}(\Phi_{\max}-\Phi_{\min})|\mathcal{S}|\sum_{i=1}^{n}|\mathcal{A}_{i}|}{(1-\gamma)^{3}\epsilon^{2}},\end{split}

(13)

where $M:=\max_{\theta,\theta^{\prime}\in\mathcal{X}}\left\|\frac{d_{\theta}}{d_{\theta^{\prime}}}\right\|_{\infty}$ (by Assumption 1, we know that this quantity is well-defined). ²²2Another way to interpret the result is that the average term $\frac{1}{T}\sum_{1\leq t\leq T}\textup{{NE-gap}}(\theta^{(t)})^{2}\leq\epsilon^{2}$ could be translated to a constant probability guarantee on single $\textup{{NE-gap}}(\theta^{(t)})$ . For instance, if we randomly pick one $\theta^{(t)}$ from $1\leq t\leq T$ , then it guarantees that $\textup{{NE-gap}}(\theta^{(t)})^{2}\leq 3\epsilon^{2}$ with probability at least $2/3$ (the constants $3,2/3$ could be replaced by $\frac{1}{1-p},p$ where $p\in(0,1)$ ).

The factor $M$ is also known as the distribution mismatch coefficient that characterizes how the state visitation varies with the policies. Given an initial state distribution $\rho$ that has positive measure on every state, $M$ can be at least bounded by $M\!\leq\!\frac{1}{1-\gamma}\max_{\theta}\left\|\frac{d_{\theta}}{\rho}\right\|_{\infty}\!\leq\!\frac{1}{1-\gamma}\frac{1}{\min_{s}\rho(s)}$ . The proof structure of Theorem 3 resembles the proof of convergence for single-agent MDPs in [33], where they leverage classical nonconvex optimization results [62, 63] and gradient domination to get the convergence rate of $O\left(\frac{64\gamma\left\|\frac{d_{\theta^{*}}}{\rho}\right\|_{\infty}^{2}|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^{6}\epsilon^{2}}\right)$ to the global optimum (see Appendix G for proof details). In fact, our result matches this bound when there is only one agent (the exponential factor on $(1\!-\!\gamma)$ looks slightly different because some factors are hidden implicitly in $M$ and $(\Phi_{\max}\!-\!\Phi_{\min})$ in our bound).

4.2 Local geometry of NEs

Theorem 3 suggests that gradient play is guaranteed to converge to a NE, however, which exact NE it converges to is not specified in the theorem.The qualities of NEs can vary significantly. For example, consider a simple two-agent identical-interest normal form game with reward table given in Table 1. There are three NEs. Two of them are strict NEs, where both agents choose the same action, i.e. $a_{1}=a_{2}=1\textup{ or }2$ . Both NEs are of reward $1$ . Another NE is a fully mixed NE, where both agents choose action $1$ and $2$ randomly with probability $\frac{1}{2}$ . This NE is only of reward $\frac{1}{2}$ . This significant quality difference between different types of NEs motivates us to further understand whether gradient play can find NEs with relatively good qualities. Since the NE that gradient play converges to depends on the initialization and the local geometry around the NE, as a preliminary study, we characterize the local geometry and landscape for strict NEs and fully mixed NEs (stated in the following theorem). More future investigation is needed for non-strict, non-fully-mixed NEs.

	$a_{2}=1$	$a_{2}=2$
$a_{1}=1$	1	0
$a_{1}=2$	0	1

Table 1:

Theorem 4.

For a Markov potential game with $\Phi_{\min}\!<\!\Phi_{\max}$ (i.e., $\Phi$ is not a constant function):

•

A strict NE $\theta^{*}$ is equivalent to a strict local maximum of the total potential function $\Phi$ , i.e.: $\exists~{}\delta,$ such that for all $\theta\!\in\!\mathcal{X},\theta\!\neq\!\theta^{*}$ that satisfies $\|\theta-\theta^{*}\|\leq\delta,$ we have $\Phi(\theta)<\Phi(\theta^{*})$ .
•

Any fully mixed NE $\theta^{*}$ is a saddle point of the total potential function $\Phi$ , i.e., $\|\nabla\Phi(\theta^{*})\|=0$ , and $\forall~{}\delta>0,~{}~{}\exists~{}\theta\in\mathcal{X},~{}\text{such that}~{}\|\theta\!-\!\theta^{*}\|\leq\delta\text{ and }\Phi(\theta)\!>\!\Phi(\theta^{*}).$

The full proof of Theorem 4 is deferred to Appendix H.

Remark 2.

Theorem 4 implies that strict NEs are asymptotically locally stable under first-order methods such as gradient play; while the fully mixed NEs are unstable under gradient play. Note that the theorem does not claim stability or instability for other types of NEs, e.g., pure NEs or non-fully mixed NEs. Nonetheless, we believe that these preliminary results can serve as a valuable platform towards a better understanding of the geometry of the problem. We remark that the conclusion about strict NEs in Theorem 4 does not hold for settings other than tabular MPG; for instance, for continuous games, one can use quadratic functions to construct simple counterexamples [31]. Also, similar to Remark 1, this theorem focuses on the local geometry of the NEs but does not claim the global existence or convergence of either strict NEs or fully mixed NEs.

4.3 Sample-based learning: algorithm and sample complexity

In this section, we no longer assume access to the exact gradient, but instead need to estimate it via samples. Throughout the section, we make the following additional assumption on MPGs:

Assumption 2.

( $(\tau,\sigma_{S})$ -Sufficient exploration on states) There exist a positive integer $\tau$ and a $\sigma_{S}\in(0,1)$ such that for any policy $\theta$ and any initial state-action pair $(s,a_{i}),~{}\forall i$ , we have

\textstyle\Pr^{\theta}(s_{\tau}|s_{0}=s,a_{0}=a)\geq\sigma_{S},~{}~{}\forall s_{\tau},

(14)

i.e., it poses a condition on the mixing time of the Markov chain induced by any policy $\theta$ : there exists a sufficiently long time $\tau$ , so the probability of being at any state at time $\tau$ is at least $\sigma_{S}$ for any initial state and action pair.

Note that similar assumptions are common in proving finite time convergence of RL algorithms (e.g. [7, 64, 65]) where ergodicity of the Markov chain induced by certain policies is generally assumed, which results in every state-action pair being visited with positive probability in the stationary distribution..

We further introduce the state transition probability under $\theta$ $\overline{P_{\mathcal{S}}^{\theta}}:\mathcal{S}\times\mathcal{S}\rightarrow\mathbb{R}$ as:

\textstyle\overline{P_{\mathcal{S}}^{\theta}}(s^{\prime}|s):=\sum_{a}\pi_{\theta}(a|s)P(s^{\prime}|s,a).

We consider fully decentralized learning, where agent $i$ ’s observation only includes state $s_{t}$ , its own action $a_{i,t}$ , and its own reward $r_{i,t}:=r_{i}(s_{t},a_{t})$ at time $t$ . Such fully decentralized learning is plausible due to the fact that when $\theta_{-i}$ is fixed, agent $i$ can be treated as an independent learner with the underlying MDP being the ‘averaged’ MDP described in Section 2. With this key observation, we design a ‘model-based’ on-policy learning algorithm, where agents perform policy evaluation in the inner loop and gradient ascent at the outer loop. The algorithm is provided in Algorithm 1. Roughly, it consists of three main steps: 1) (Inner loop) Estimate the averaged transition probability and reward using on-policy samples $\overline{P_{i}^{\theta}},\overline{r_{i}^{\theta}},\overline{P_{\mathcal{S}}^{\theta}}$ . 2) (Inner loop) Calculate averaged Q-function $\overline{Q_{i}^{\theta}}$ and discounted state visitation distribution $d_{\theta}$ , and compute the estimated gradient accordingly, 3) (Outer loop) Running projected gradient ascent with estimated gradients. Before discussing our algorithm in more detail, we highlight that the idea of using the “averaged” MDP can be used to design other learning methods including model-free methods, e.g., using the temporal difference methods to perform policy evaluation. One caveat is that the “averaged” MDP is only well-defined when all the other agents use fixed policies. This makes it difficult to extend the two-timescale framework (i.e. with an inner loop and outer loop) to single-timescale settings, which is an interesting future direction. Further, note that the current algorithm requires full state observation, it remains an intriguing open question to extend it to the case with only partial observability. We would also like to point out that the algorithm initialization still requires extra consensus/coordination among the players to agree on the hyperparameters $T_{J},T_{G}$ etc, which guarantees that agents go through the same equal-length phases to sample the trajectories, and compute gradient estimates.

0: learning rate

\eta

, greedy parameter

\alpha

, sample trajectory length

T_{J}

, total iteration steps

T_{G}

For each agent

i

for

k=0,1\dots,T_{G}-1

for

t=0,1,\dots,T_{J}

Sample

s_{0}\sim\rho

, implement policy

\theta^{(k)}

and collect trajectory

\mathcal{D}_{i}^{(k)}

\mathcal{D}_{i}^{(k)}\!\leftarrow\!\mathcal{D}_{i}^{(k)}\!\cup\!\{\!s_{t},a_{i,t},r_{i,t}\!\},~{}a_{i,t}\!\sim\!\pi_{\theta_{i}^{(k)}}(\cdot|s_{t})

end for

Estimate

\widehat{P_{i}^{\theta}},\widehat{r_{i}^{\theta}},\widehat{P_{\mathcal{S}}^{\theta}},\widehat{M_{i}^{\theta}}

by (19), (24), (27) respectively.

Calculate

\widehat{Q_{i}^{\theta}},\widehat{d_{\theta}}

by (28), (29) respectively.

Estimate the gradient by (30):

Run projected gradient ascent as in (31)

end for

Algorithm 1 Sample-based learning

Step 1: empirical estimation of $\overline{P_{i}^{\theta}},\overline{r_{i}^{\theta}},\overline{P_{\mathcal{S}}^{\theta}}$ : Given a sequence $\{s_{t},a_{i,t},r_{i,t}\}_{t=0}^{T_{J}}$ generated by a policy $\theta:=(\theta_{i},\theta_{-i})$ , the empirical estimation $\widehat{P_{i}^{\theta}}$ of $\overline{P_{i}^{\theta}}$ is given by:

\displaystyle\widehat{P_{i}^{\theta}}(s^{\prime}|s,a_{i})

\displaystyle:=\left\{\begin{array}[]{ll}\frac{\sum_{t=0}^{T_{J}-1}\mathbf{1}\{s_{t+1}=s^{\prime},s_{t}=s,a_{i,t}=a_{i}\}}{\sum_{t=1}^{T_{J}-1}\mathbf{1}\{s_{t}=s,a_{i,t}=a_{i}\}},\vspace{5pt}\\ \quad~{}~{}{\small\textup{for }{\sum_{t=1}^{T_{J}-1}\mathbf{1}\{s_{t}=s,a_{i,t}=a_{i}\}}\geq 1};\vspace{5pt}\\ \mathbf{1}\{s^{\prime}=s\},\vspace{5pt}\\ \quad~{}~{}{\small\textup{for }{\sum_{t=1}^{T_{J}-1}\mathbf{1}\{s_{t}=s,a_{i,t}=a_{i}\}}=0}.\end{array}\right.\vspace{-5pt}

(19)

Here we separately treat the special case where the state and action pair is not visited through the whole trajectory, i.e., ${\sum_{t=1}^{T_{J}-1}\mathbf{1}\{s_{t}=s,a_{i,t}=a_{i}\}}=0$ to make $\widehat{P_{i}^{\theta}}$ well-defined.

Similarly, the estimates $\widehat{r_{i}^{\theta}},\widehat{P_{\mathcal{S}}^{\theta}}$ of $\overline{r_{i}^{\theta}},\overline{P_{\mathcal{S}}^{\theta}}$ are given by:

	$\displaystyle\widehat{r_{i}^{\theta}}(s,a_{i})$	$\displaystyle\!:=\!\left\{\begin{array}[]{ll}\frac{\sum_{t=0}^{T_{J}}\mathbf{1}\{s_{t}=s,a_{i,t}=a_{i}\}r_{i,t}}{\sum_{t=0}^{T_{J}}\mathbf{1}\{s_{t}=s,a_{i,t}=a_{i}\}},\vspace{5pt}\\ \quad{\small\textup{for }{\sum_{t=1}^{T_{J}-1}\mathbf{1}\{s_{t}=s,a_{i,t}=a_{i}\}}\geq 1};\vspace{5pt}\\ 0,\\ \quad{\small\textup{for }{\sum_{t=1}^{T_{J}-1}\mathbf{1}\{s_{t}=s,a_{i,t}=a_{i}\}}=0}.\end{array}\right.$		(24)
	$\displaystyle\widehat{P_{\mathcal{S}}^{\theta}}(s^{\prime}\|s)$	$\displaystyle\!:=\!\left\{\begin{array}[]{ll}\frac{\sum_{t=0}^{T_{J}-1}\mathbf{1}\{s_{t+1}=s^{\prime},s_{t}=s\}}{\sum_{t=1}^{T_{J}-1}\mathbf{1}\{s_{t}=s\}},~{}\sum_{t=1}^{T_{J}\!-\!1}\!\mathbf{1}\{s_{t}\!=\!s\}\!\geq\!1;\vspace{5pt}\\ \mathbf{1}\{s^{\prime}=s\},\qquad\qquad\quad~{}\sum_{t=1}^{T_{J}\!-\!1}\!\mathbf{1}\{s_{t}\!=\!s\}\!=\!0.\end{array}\right.$		(27)

Step 2: estimation of $\overline{Q_{i}^{\theta}},d_{\theta}$ : We slightly abuse notation and use $\overline{Q_{i}^{\theta}},\overline{r_{i}^{\theta}}\in\mathbb{R}^{|\mathcal{S}||\mathcal{A}_{i}|}$ to also denote the vectors corresponding to the averaged Q-function and reward function of agent $i$ . Similarly, $\rho,d_{\theta}\in\mathbb{R}^{|\mathcal{S}|}$ are used to denote the vectors for $\rho(s)$ and $d_{\theta}(s)$ . Define $M_{i}^{\theta}\in\mathbb{R}^{|\mathcal{S}||\mathcal{A}_{i}|\times|\mathcal{S}||\mathcal{A}_{i}|}$ :

\textstyle\overline{M^{\theta}_{i}}{{}_{(s,a_{i})\rightarrow(s^{\prime},a_{i}^{\prime})}}:=\pi_{\theta_{i}}(a_{i}^{\prime}|s^{\prime})\overline{P_{i}^{\theta}}(s^{\prime}|s,a_{i}).

Then from Lemma 1, $\overline{Q_{i}^{\theta}}$ is given by:

(I-\gamma\overline{M^{\theta}_{i}})\overline{Q_{i}^{\theta}}=\overline{r_{i}^{\theta}}~{}\Longrightarrow~{}\overline{Q_{i}^{\theta}}=(I-\gamma\overline{M^{\theta}_{i}})^{\!-\!1}\overline{r_{i}^{\theta}}.

The estimated averaged Q-function $\widehat{Q_{i}^{\theta}}$ is given by:³³3From the Perron-Frobenius theorem, we know that the absolute values of the eigenvalues of $\widehat{M^{\theta}_{i}}$ are upper bounded by $1$ , which guarantees that the matrix $I-\gamma\widehat{M^{\theta}_{i}}$ is invertible.

\begin{split}&\widehat{Q_{i}^{\theta}}=(I-\gamma\widehat{M^{\theta}_{i}})^{\!-\!1}\widehat{r_{i}^{\theta}},\\ \textup{ where }&\widehat{M^{\theta}_{i}}_{(s,a_{i})\rightarrow(s^{\prime},a_{i}^{\prime})}:=\pi_{\theta_{i}}(a_{i}^{\prime}|s^{\prime})\widehat{P_{i}^{\theta}}(s^{\prime}|s,a_{i}).\end{split}

(28)

Similarly, from (2), we have that $d_{\theta}$ and $\widehat{d_{\theta}}$ are given by (derivation see Appendix C):

\textstyle{d_{\theta}}\!=\!(1\!-\!\gamma)\left(I\!-\!\gamma\overline{P_{\mathcal{S}}^{\theta}}^{\top}\right)^{\!-\!1}\rho,\quad\widehat{d_{\theta}}\!:=\!(1\!-\!\gamma)\left(I\!-\!\gamma\widehat{P_{\mathcal{S}}^{\theta}}^{\top}\!\!\right)^{\!-\!1}\rho.

(29)

Then accordingly, the estimated gradient is computed as:

\widehat{\partial}_{\theta_{s,a_{i}}}J_{i}(\theta^{(k)})=\frac{1}{1-\gamma}\widehat{d_{\theta}}(s)\widehat{Q_{i}^{\theta}}(s,a_{i}).

(30)

Step 3: Projected gradient ascent onto the set of $\alpha$ -greedy policies: Let $U_{n}\!=\![\frac{1}{n},\dots,\frac{1}{n}]\!\in\!\Delta(n)$ denote the $n$ dimensional uniform distribution. Define $\Delta\!^{\alpha}(n)\!:=\!\{\theta|~{}\exists\theta^{\prime}\!\in\!\Delta(n),s.t.~{}\theta=(1\!-\!\alpha)\theta^{\prime}+\alpha U_{n}\}.$ We use $\mathcal{X}_{i}^{\alpha}\!:=\!\Delta^{\alpha}(|\mathcal{A}_{i}|)^{|\mathcal{S}|},~{}\mathcal{X}^{\alpha}\!:=\!\mathcal{X}_{1}^{\alpha}\times\mathcal{X}_{2}^{\alpha}\times\cdots\times\mathcal{X}_{n}^{\alpha}$ to denote the set of the $\alpha$ -greedy policies for $\theta_{i}$ and $\theta$ respectively. Every step after doing gradient ascent, the parameter $\theta$ will further be projected onto $\mathcal{X}^{\alpha}$ , i.e.:

\theta_{i}^{(k+1)}=Proj_{\mathcal{X}_{i}^{\alpha}}(\theta_{i}^{(k)}+\eta\widehat{\nabla}_{\theta_{i}}J_{i}(\theta^{(k)})).

(31)

The reason of projecting onto $\mathcal{X}^{\alpha}$ instead of $\mathcal{X}$ is to make sure that every action has positive possibility of being selected in order to get a relatively accurate estimation of averaged $Q$ -function. Intuitively, a larger $\alpha$ introduces a larger additional error in the NE-gap; however, a smaller $\alpha$ requires more samples to estimate the gradient. Thus the choice of $\alpha$ is the tradeoff between the two effects.

Theorem 5.

(Sample complexity) Assume that the MPG satisfies Assumption 2. Let $M:=\max_{\theta,\theta^{\prime}\in\mathcal{X}}\left\|\frac{d_{\theta}}{d_{\theta^{\prime}}}\right\|_{\infty}$ . In Algorithm 1, for $\eta\leq\frac{(1-\gamma)^{3}}{4\sum_{i}|\mathcal{A}_{i}|}$ , $\alpha=\frac{(1-\gamma)\epsilon}{6M}$ , and

\begin{split}T_{\!J}\!&\geq\!\!\frac{206976\tau nM^{4}|\mathcal{S}|^{3}\!\max_{i}\!|\mathcal{A}_{i}|^{3}}{(1-\gamma)^{8}\epsilon^{4}\sigma_{S}^{2}}\!\log\!\left(\!\frac{16\tau T_{\!G}|\mathcal{S}|^{2}\!\sum_{i}\!|\mathcal{A}_{i}|\!}{\delta}\!\right)\!+\!\tau,\\ &\quad\qquad\qquad~{}T_{G}\!\geq\!\frac{648M^{2}(\!\Phi_{\max}\!-\!\Phi_{\min}\!)|\mathcal{S}|}{\eta\epsilon^{2}},\end{split}

with probability at least $1\!-\!\delta$ , we have that:

\frac{1}{T_{G}}\sum\nolimits_{k=1}^{T_{G}}\textup{{NE-gap}}(\theta^{(k)})^{2}\leq\epsilon^{2}.

That is, with a proper choice of stepsize, e.g., $\eta\!=\!\frac{(1-\gamma)^{3}}{4\sum_{i}\!|\mathcal{A}_{i}|}$ , the algorithm can find an $\epsilon$ -NE with probability at least $1-\delta$ with

T_{J}T_{G}\sim\tilde{O}\left(\frac{n}{\epsilon^{6}}\textup{poly}\left(\frac{1}{1-\gamma},|\mathcal{S}|,\max_{i}|\mathcal{A}_{i}|\right)\right)\vspace{-10pt}

(32)

samples, where $\tilde{O}$ hides log factors.

We would like to first compare our result with one related work with sample complexity of learning MPGs [36]. Interestingly, both sample complexities are $O(1/\epsilon^{6})$ . It is an interesting question to study whether such dependence is fundamental or not for learning with simultaneous-updating agents. Yet [36] considers Monte Carlo, model-free gradient estimation, while our algorithm takes a model-based approach which suffers less from high variance and the notion of ‘averaged’ MDP can potentially be extrapolated to other settings.

Proof Sketch: The proof of Theorem 5 consists of three major steps. The first step is to bound the estimation error of parameters $\overline{P_{i}^{\theta}},\overline{r_{i}^{\theta}},\overline{P_{\mathcal{S}}^{\theta}}$ of the ‘averaged’-MDP. This step leverages Assumption 1 and Azuma-Hoeffding inequality to get high probability bounds for the parameters. The second step translates the estimation error of the ‘averaged’-MDP into the gradient estimation error. Then, the third step treats gradient estimation step as an oracle that gives biased gradient information, where the bias is the estimation error. The final result is obtained by analyzing the performance of biased projected gradient ascent algorithm. The detailed proofs are provided in Appendix J.

Comparison with centralized learning: The best known sample complexity bound for single-agent/centralized MDP is $\tilde{O}\left(\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^{3}\epsilon^{2}}\right)$ [66]. Compared with (32), the centralized bound scales better with respect to $\epsilon,|\mathcal{S}|,|\mathcal{A}_{i}|,\frac{1}{1-\gamma}$ . However, as argued in the previous subsection, the total action space $|\mathcal{A}|=\prod_{i=1}^{n}|\mathcal{A}_{i}|$ in the centralized bound scales exponentially with the number of agent $n$ , while our complexity bound only scales linearly. Here, we briefly state the fundamental difficulties of learning in the SG setting compared with centralized learning, which also explains why our bound scales worse with respect to the factors $\epsilon,|\mathcal{S}|,|\mathcal{A}_{i}|,\frac{1}{1-\gamma}$ . 1) Firstly, the optimization landscape in the SG setting is more complicated. For centralized learning, the gradient domination property is stronger and accelerated gradient methods (e.g. via natural policy gradient or entropy regularization) can speed up the convergence of exact gradient from $O(\frac{1}{\epsilon^{2}})$ to $O(\frac{1}{\epsilon})$ [33], or even $O(\log(\frac{1}{\epsilon}))$ [56]. In contrast, for multi-agent settings, due to the more complicated optimization landscape, these methods can no longer improve the dependency on $\epsilon$ , and thus makes the outer loop complexity $T_{G}$ larger. 2) Secondly, the behavior of other agents makes the environment non-stationary, i.e., the averaged Q-function $\overline{Q_{i}^{\theta}}$ as well as the averaged transition probability distribution $\overline{P_{i}^{\theta}}$ depends on the policy of other agent $\theta_{-i}$ . Thus, unlike centralized learning, where the state transition probability matrix can be estimated in an off-policy or even offline manner, i.e. using data samples from different policies, $\overline{P_{i}^{\theta}}$ can only be estimated in a online manner, using samples generated by exactly the same policy $\theta$ , which increases the inner loop complexity $T_{J}$ . 3) Thirdly, the complicated interactions amongst agents necessitate more care during the learning process. Algorithms designed for centralized learning that achieve near-optimal sample complexity are generally Q-learning type algorithms. However, in SGs, it can be shown that having each agent maximize its own averaged Q-function may actually lead to non-convergent behavior. Thus, we need to consider algorithms that update in a less aggressive manner, e.g. soft Q-learning, or policy gradient (which is considered in this paper).

5 Numerical simulations

	$a_{2}=1$	$a_{2}=2$
$a_{1}=1$	(-1,-1)	(-3,0)
$a_{1}=2$	(0,-3)	(-2,-2)

Table 2: Game 1: Reward

	$s_{2}=1$	$s_{2}=2$
$s_{1}=1$	2	0
$s_{1}=2$	0	1

Table 3: Game 2: Reward

This section studies three numerical examples to corroborate our theoretical results. The multi-stage prisoner’s dilemma (Game 1) confirms the local stability results for general-sum SGs; the coordination game (Game 2) considers local stability as well as convergence rate of exact gradient play for MPG; the state-based coordination game (Game 3) tests the performance of the sample-based algorithm proposed in Section 4.3.

5.1 Game 1: multi-stage prisoner’s dilemma

The first example — multi-stage prisoner’s dilemma model[45] — studies exact gradient play for general SGs. It is a $2$ -agent SG, with $\mathcal{S}=\mathcal{A}_{1}=\mathcal{A}_{2}=\{1,2\}$ . The reward for each agent $r_{i}(s,a_{1},a_{2}),~{}i\in\{1,2\}$ is independent of state $s$ and is given by Table 3. The state transition probability is determined by agents’ previous actions:

	$\displaystyle P(s_{t+1}=1\|(a_{1,t},a_{2,t})=(1,1))$	$\displaystyle=1-\epsilon,$
	$\displaystyle P(s_{t+1}=1\|(a_{1,t},a_{2,t})\neq(1,1))$	$\displaystyle=\epsilon.$

Here action $a_{i}=1$ means that agent $i$ choose to cooperate and $a_{i}=2$ means betray. The state $s$ serves as a noisy indicator, with error rate $\epsilon$ , of whether both agents cooperated ( $s_{t}=1$ ) or not ( $s_{t}=2$ ) in the previous stage $t-1$ .

Refer to caption — Figure 1: (Game 1:) Convergence to the cooperative NE

$\epsilon$	$\Delta^{\!\theta^{*}}\!(s\!=\!1)$	ratio %
0.1	433.3	(47.8 $\pm$ 5.1)%
0.05	979.3	(66.3 $\pm$ 4.3)%
0.01	2498.6	(77.4 $\pm$ 2.8)%

The single-stage game corresponds to the famous Prisoner’s Dilemma, and it is well-known that there is a unique NE $(a_{1},a_{2})=(2,2)$ , where both agent decide to betray. The dilemma arises from the fact that there exists a joint non-NE strategy $(1,1)$ such that both players obtain a higher reward than what they get under the NE. However, in the multi-stage case, the introduction of an additional state $s$ allows agents to make decisions based on whether they have cooperated before. It turns out that cooperation can be achieved given that the discounting $\gamma$ is close to $1$ and that the indicator for $s$ is accurate enough, i.e. $\epsilon$ is close to $0$ . Apart from the fully betray strategy, where both agents will betray regardless of $s$ , there is another strict NE $\theta^{*}$ that is $\theta^{*}_{s=1,a_{i}=1}=1,~{}\theta^{*}_{s=2,a_{i}=1}=0,$ where agents will cooperate given that they have cooperated in previous stage, and betray otherwise.

We simulate gradient play for this model and mainly focus on the convergence to the cooperative equilibrium $\theta^{*}$ . We fix $\gamma=0.95$ . The initial policy is set as: $\theta^{(0)}_{s=1,a_{i}=1}\!=\!1\!-\!0.4\delta_{i},~{}~{}\theta^{(0)}_{s=2,a_{i}=1}\!=\!0$ , where the $\delta_{i}$ ’s are uniformly sampled from $[0,1]$ . The initialization implies that at the beginning, both agents are willing to cooperate to some extent given that they cooperated at the previous stage. Figure 1 shows a trial converging to the NE starting from a randomly initialized policy. Then we study the size of attraction region for $\theta^{*}$ and how it varies with the indicator’s error rate $\epsilon$ , which is shown in Table 4. The size of the region of attraction for $\theta^{*}$ can be reflected by the ratio of convergence ( $\frac{\#\textup{Trials that converge to }\theta^{*}}{\#\textup{Total number of trials}}$ ) for multiple trials with different initial points. Here we calculate one ratio using 100 trials and the mean and standard deviation (std) are calculated by computing the ratio 10 times using different trials. An empirical estimate of the volume of the region is the convergence ratio times the volume of the uniform sampling area; hence the larger the ratio, the larger the region of attraction. Intuitively speaking, the more accurately the state $s$ represents the cooperation situation of the agents, the less incentive agents will have for betraying when observing $s=1$ , that is, the larger $\Delta^{\!\theta^{*}}\!(s\!=\!1)$ will become, and thus the larger the convergence ratio will be. This intuition matches the simulation result as well as the theoretical guarantees on the local convergence around a strict NE in Theorem 2.

5.2 Game 2: coordination game

Our second example is based on coordination game [67]. It is an identical-reward game which is one special class of Markov potential game. Consider a $2$ -agent identical reward coordination game problem with state space $\mathcal{S}=\mathcal{S}_{1}\times\mathcal{S}_{2}$ and action space $\mathcal{A}=\mathcal{A}_{1}\times\mathcal{A}_{2}$ , where $\mathcal{S}_{1}=\mathcal{S}_{2}=\mathcal{A}_{1}=\mathcal{A}_{2}=\{1,2\}$ . The state transition probability is given by:

\displaystyle P(s_{i,t+1}=1|a_{i,t}=1)=1-\epsilon,~{}P(s_{i,t+1}=1|a_{i,t}=2)=\epsilon,

where $i=1,2$ . The reward table is given by Table 3. Here we can view the actions $\left\{1,2\right\}$ as two different social networks that agents can choose. They are rewarded only if they are in the same network. Network $1$ has a higher reward than network $2$ . The state $s_{i}$ stands for the network that agent $i$ is really at after taking an action. $\epsilon$ stands for the randomness in reaching a network after taking the action.

There is at least one fully-mixed NE where both agents join network $1$ with probability $\frac{1-3\epsilon}{3(1-2\epsilon)}$ regardless of the current occupancy of networks, and there are $13$ different strict NEs that can be verified numerically. Figure 2 shows a gradient play trajectory whose initial point lies in a close neighborhood of the mixed NE. As the algorithm progresses, we see that the trajectory in Figure 2 diverges from the mixed-NE, indicating that the fully-mixed NE is indeed a saddle point. This corroborates our finding in Theorem 4. Figure 3 shows the evolution of total reward $J(\theta^{(t)})$ for gradient play for different random initial points $\theta^{(0)}$ . Different initial points converge to one of $13$ different strict NEs each with a different total reward (some strict NEs with relatively small region of attraction are omitted in the figure). While the total rewards are different, as shown in Figure 4, we see that the NE-gap of each trajectory (corresponding to same initial points in Figure 3) converges to $0$ . This suggests that the algorithm is indeed able to converge to a NE. Notice that the NE-gaps do not decrease monotonically.

5.3 Game 3: state-based coordination game

Our third numerical example studies the empirical performance of the sample-based learning algorithm, Algorithm 1. Here we consider a generalization of coordination game (Game 2) where the two players now try to coordinate on a 2D grid. The two-player state-based coordination game on a $3\times 3$ grid is defined as follows: the state space is given by $\mathcal{S}=\mathcal{S}_{1}\times\mathcal{S}_{2},~{}\mathcal{S}_{1}=\mathcal{S}_{2}=\mathcal{S}_{x}\times\mathcal{S}_{y}=\{x_{a},x_{b},x_{c}\}\times\{y_{a},y_{b},y_{c}\}$ , action space is given by $\mathcal{A}=\mathcal{A}_{1}\times\mathcal{A}_{2},~{}\mathcal{A}_{1}=\mathcal{A}_{2}=\{\textup{Stay},\textup{Left},\textup{Right},\textup{Up},\textup{Down}\}$ , i.e., agent can choose to stay at current grid or move left/right/up/down to its neighboring grids. We assume that there is random noise during the transition, where the agent might end up in a neighboring grid of the target location with error probability $\epsilon$ .The reward is given by:

r(s_{1},s_{2})=\mathbf{1}\{s_{1}=s_{2}=\{x_{a},y_{a}\}\textup{ or }\{x_{c},y_{c}\}\},

i.e. the two agents are only rewarded if they stay at the upper-left or lower-right corner at the same time.

For numerical simulation, we take $T_{G}=300,T_{J}=10000,\alpha=0.1,\eta=10,\epsilon=0.1$ ; the numerical results are as displayed in Figure 6 - 8. Figure 6 shows that total reward increases as the number of iterations increase, and Figure 8 shows that the NE-gap converges to a value close to zero. However, because we project the policy to the $\alpha$ -greedy set $\mathcal{X}^{\alpha}$ , the NE-gap cannot converge to exactly zero. Figure 7 visualizes the discounted state visitation distribution. To make the visualization more intuitive, we look at the marginalized discounted state visitation distribution $d_{\theta}^{x}$ defined below:

\displaystyle\textstyle d_{\theta}^{x}(s_{1,x},s_{2,x})\!=\!\sum_{s_{1,y},s_{2,y}}d_{\theta}(s_{1,x},s_{1,y},s_{2,x},s_{2,y}).

$d_{\theta}^{y}$ is defined similarly. From Figure 7 we can see that most of the probability measure concentrates on $\{(x_{a},x_{a}),(x_{c},x_{c})\}$ , $\{(y_{a},y_{a}),(y_{c},y_{c})\}$ , indicating that the two agents are able to coordinate most of the time.

6 Conclusion and Discussion

This paper studies the optimization landscape and convergence of gradient play for SGs. For general SGs, we establish local convergence for strict NEs. For MPGs, we establish the global convergence with respect to NE gap and the local stability results for strict NEs as well as fully mixed ones. A sample-based NE-learning algorithm with sample complexity guarantee is also proposed under this setting. There are many interesting future directions. Firstly, the current assumption of MPGs is relatively strong compared with the notion of potential games in the one-shot setup, which might restrict its application to broader settings. More effort would be needed to identify other special types of SGs that facilitate efficient learning. It would also be meaningful to investigate real-life applications, such as dynamic congestion and routing. Secondly, other sample-based learning methods, such as actor-critic, natural policy gradient, Gauss-Newton methods, could also be considered, which might improve the sample complexity.

References

[1] F. Daneshfar and H. Bevrani, “Load–frequency control: a ga-based multi-agent reinforcement learning,” IET generation, transmission & distribution, vol. 4, no. 1, pp. 13–26, 2010.
[2] S. Shalev-Shwartz, S. Shammah, and A. Shashua, “Safe, multi-agent, reinforcement learning for autonomous driving,” ArXiv, vol. abs/1610.03295, 2016.
[3] K. Zhang, Z. Yang, H. Liu, T. Zhang, and T. Basar, “Fully decentralized multi-agent reinforcement learning with networked agents,” in International Conference on Machine Learning. PMLR, 2018, pp. 5872–5881.
[4] H.-T. Wai, Z. Yang, Z. Wang, and M. Hong, “Multi-agent reinforcement learning via double averaging primal-dual optimization,” ser. NIPS’18. Red Hook, NY, USA: Curran Associates Inc., 2018, p. 9672–9683.
[5] T. Chen, K. Zhang, G. B. Giannakis, and T. Başar, “Communication-efficient policy gradient methods for distributed reinforcement learning,” IEEE Transactions on Control of Network Systems, vol. 9, no. 2, pp. 917–929, 2022.
[6] Y. Li, Y. Tang, R. Zhang, and N. Li, “Distributed reinforcement learning for decentralized linear quadratic control: A derivative-free policy optimization approach,” 2019.
[7] G. Qu, A. Wierman, and N. Li, “Scalable reinforcement learning of localized policies for multi-agent networked systems,” in Learning for Dynamics and Control. PMLR, 2020, pp. 256–266.
[8] L. S. Shapley, “Stochastic games,” Proceedings of the national academy of sciences, vol. 39, no. 10, pp. 1095–1100, 1953.
[9] M. L. Littman, “Markov games as a framework for multi-agent reinforcement learning,” in Machine learning proceedings 1994. Elsevier, 1994, pp. 157–163.
[10] M. Bowling and M. Veloso, “An analysis of stochastic game theory for multiagent reinforcement learning,” Carnegie-Mellon Univ Pittsburgh Pa School of Computer Science, Tech. Rep., 2000.
[11] Y. Shoham, R. Powers, and T. Grenager, “Multi-agent reinforcement learning: a critical survey,” Technical report, Stanford University, Tech. Rep., 2003.
[12] L. Buşoniu, R. Babuška, and B. De Schutter, “Multi-agent reinforcement learning: An overview,” Innovations in multi-agent systems and applications-1, pp. 183–221, 2010.
[13] M. Lanctot, V. Zambaldi, A. Gruslys, A. Lazaridou, K. Tuyls, J. Pérolat, D. Silver, and T. Graepel, “A unified game-theoretic approach to multiagent reinforcement learning,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017.
[14] K. Zhang, Z. Yang, and T. Başar, “Multi-agent reinforcement learning: A selective overview of theories and algorithms,” Handbook of reinforcement learning and control, pp. 321–384, 2021.
[15] J. Hu and M. P. Wellman, “Nash Q-learning for general-sum stochastic games,” Journal of machine learning research, vol. 4, no. Nov, pp. 1039–1069, 2003.
[16] G. Tesauro, “Extending Q-learning to general adaptive multi-agent systems,” Advances in neural information processing systems, vol. 16, pp. 871–878, 2003.
[17] M. Bowling and M. Veloso, “Rational and convergent learning in stochastic games,” in International joint conference on artificial intelligence, vol. 17, no. 1. Citeseer, 2001, pp. 1021–1026.
[18] S. Abdallah and V. Lesser, “A multiagent reinforcement learning algorithm with non-linear dynamics,” Journal of Artificial Intelligence Research, vol. 33, pp. 521–549, 2008.
[19] C. Zhang and V. Lesser, “Multi-agent learning with policy prediction,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 24, no. 1, 2010.
[20] J. Foerster, R. Y. Chen, M. Al-Shedivat, S. Whiteson, P. Abbeel, and I. Mordatch, “Learning with opponent-learning awareness,” in Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems. Richland, SC: International Foundation for Autonomous Agents and Multiagent Systems, 2018.
[21] L. Shapley, “Some topics in two-person games,” Advances in game theory, vol. 52, pp. 1–29, 1964.
[22] V. P. Crawford, “Learning behavior and mixed-strategy Nash equilibria,” Journal of Economic Behavior & Organization, vol. 6, no. 1, pp. 69–78, 1985.
[23] J. S. Jordan, “Three problems in learning mixed-strategy Nash equilibria,” Games and Economic Behavior, vol. 5, no. 3, pp. 368–386, 1993.
[24] V. Krishna and T. Sjöström, “On the convergence of fictitious play,” Mathematics of Operations Research, vol. 23, no. 2, pp. 479–511, 1998.
[25] J. S. Shamma and G. Arslan, “Dynamic fictitious play, dynamic gradient play, and distributed convergence to Nash equilibria,” IEEE Transactions on Automatic Control, vol. 50, no. 3, pp. 312–327, 2005.
[26] E. Kohlberg and J.-F. Mertens, “On the strategic stability of equilibria,” Econometrica: Journal of the Econometric Society, pp. 1003–1037, 1986.
[27] E. Van Damme, Stability and perfection of Nash equilibria. Springer, 1991, vol. 339.
[28] D. Foster and H. P. Young, “Regret testing: Learning to play Nash equilibrium without knowing you have an opponent,” Theoretical Economics, vol. 1, no. 3, pp. 341–367, 2006.
[29] F. Germano and G. Lugosi, “Global Nash convergence of foster and young’s regret testing,” Games and Economic Behavior, vol. 60, no. 1, pp. 135–154, 2007.
[30] J. R. Marden, H. P. Young, G. Arslan, and J. S. Shamma, “Payoff-based dynamics for multiplayer weakly acyclic games,” SIAM Journal on Control and Optimization, vol. 48, no. 1, pp. 373–396, 2009.
[31] E. Mazumdar, L. J. Ratliff, and S. S. Sastry, “On gradient-based learning in continuous games,” SIAM Journal on Mathematics of Data Science, vol. 2, no. 1, pp. 103–131, 2020.
[32] K. Zhang, Z. Yang, and T. Basar, “Policy optimization provably converges to Nash equilibria in zero-sum linear quadratic games,” in Advances in Neural Information Processing Systems, vol. 32, 2019.
[33] A. Agarwal, S. M. Kakade, J. D. Lee, and G. Mahajan, “On the theory of policy gradient methods: Optimality, approximation, and distribution shift,” 2020.
[34] D. González-Sánchez and O. Hernández-Lerma, Discrete–time stochastic control and dynamic potential games: the Euler–Equation approach. Springer Science & Business Media, 2013.
[35] S. V. Macua, J. Zazo, and S. Zazo, “Learning parametric closed-loop policies for Markov potential games,” in International Conference on Learning Representations, 2018.
[36] S. Leonardos, W. Overman, I. Panageas, and G. Piliouras, “Global convergence of multi-agent policy gradient in Markov potential games,” in International Conference on Learning Representations, 2022.
[37] M. Tan, “Multi-agent reinforcement learning: Independent vs. cooperative agents,” in Proceedings of the tenth international conference on machine learning, 1993, pp. 330–337.
[38] C. Claus and C. Boutilier, “The dynamics of reinforcement learning in cooperative multiagent systems,” AAAI/IAAI, vol. 1998, no. 746-752, p. 2, 1998.
[39] L. Panait and S. Luke, “Cooperative multi-agent learning: The state of the art,” Autonomous agents and multi-agent systems, vol. 11, no. 3, pp. 387–434, 2005.
[40] L. Matignon, G. J. Laurent, and N. Le Fort-Piat, “Independent reinforcement learners in cooperative markov games: a survey regarding coordination problems,” The Knowledge Engineering Review, vol. 27, no. 1, pp. 1–31, 2012.
[41] Z. Zhang, Y.-S. Ong, D. Wang, and B. Xue, “A collaborative multiagent reinforcement learning method based on policy gradient potential,” IEEE transactions on cybernetics, vol. 51, no. 2, pp. 1015–1027, 2019.
[42] A. Oroojlooy and D. Hajinezhad, “A review of cooperative multi-agent deep reinforcement learning,” Applied Intelligence, vol. 53, no. 11, pp. 13 677–13 722, 2023.
[43] Z. Song, S. Mei, and Y. Bai, “When can we learn general-sum Markov games with a large number of players sample-efficiently?” 2021.
[44] C. Jin, Q. Liu, Y. Wang, and T. Yu, “V-learning–a simple, efficient, decentralized algorithm for multiagent rl,” arXiv preprint arXiv:2110.14555, 2021.
[45] G. Arslan and S. Yüksel, “Decentralized Q-learning for stochastic teams and games,” IEEE Transactions on Automatic Control, vol. 62, no. 4, pp. 1545–1558, 2016.
[46] B. Yongacoglu, G. Arslan, and S. Yüksel, “Decentralized learning for optimality in stochastic dynamic teams and games with local control and global state information,” IEEE Transactions on Automatic Control, vol. 67, no. 10, pp. 5230–5245, 2022.
[47] C. Daskalakis, D. J. Foster, and N. Golowich, “Independent policy gradient methods for competitive reinforcement learning,” Advances in neural information processing systems, vol. 33, pp. 5527–5540, 2020.
[48] A. Ozdaglar, M. O. Sayin, and K. Zhang, “Independent learning in stochastic games,” arXiv preprint arXiv:2111.11743, 2021.
[49] R. Zhang, Z. Ren, and N. Li, “Gradient play in multi-agent Markov stochastic games: Stationary points and local geometry,” in International Symposium on Mathematical Theory of Networks and Systems, 2022.
[50] R. Zhang, J. Mei, B. Dai, D. Schuurmans, and N. Li, “On the global convergence rates of decentralized softmax gradient play in markov potential games,” Advances in Neural Information Processing Systems, vol. 35, pp. 1923–1935, 2022.
[51] R. Fox, S. M. Mcaleer, W. Overman, and I. Panageas, “Independent natural policy gradient always converges in Markov potential games,” in International Conference on Artificial Intelligence and Statistics. PMLR, 2022, pp. 4414–4425.
[52] D. Ding, C.-Y. Wei, K. Zhang, and M. Jovanovic, “Independent policy gradient for large-scale markov potential games: Sharper rates, function approximation, and game-agnostic convergence,” in International Conference on Machine Learning. PMLR, 2022, pp. 5166–5220.
[53] A. Agarwal, N. Jiang, S. M. Kakade, and W. Sun, “Reinforcement learning: Theory and algorithms,” CS Dept., UW Seattle, Seattle, WA, USA, Tech. Rep, vol. 32, 2019.
[54] D. Fudenberg and J. Tirole, Game theory. MIT press, 1991.
[55] M. Maschler, S. Zamir, and E. Solan, Game theory. Cambridge University Press, 2020.
[56] J. Mei, C. Xiao, C. Szepesvari, and D. Schuurmans, “On the global convergence rates of softmax policy gradient methods,” in Proceedings of the 37th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 119. PMLR, 13–18 Jul 2020, pp. 6820–6829.
[57] S. M. Kakade and J. Langford, “Approximately optimal approximate reinforcement learning,” in Machine Learning, Proceedings of the Nineteenth International Conference (ICML 2002), University of New South Wales, Sydney, Australia, July 8-12, 2002, 2002, pp. 267–274.
[58] R. S. Sutton, D. A. McAllester, S. P. Singh, Y. Mansour et al., “Policy gradient methods for reinforcement learning with function approximation.” in NIPs, vol. 99. Citeseer, 1999, pp. 1057–1063.
[59] D. Levhari and L. Mirman, “The great fish war: An example using a dynamic Cournot-Nash solution,” Bell Journal of Economics, vol. 11, no. 1, pp. 322–334, 1980.
[60] W. D. Dechert and S. O’Donnell, “The stochastic lake game: A numerical solution,” Journal of Economic Dynamics and Control, vol. 30, no. 9-10, pp. 1569–1587, 2006.
[61] D. H. Mguni, Y. Wu, Y. Du, Y. Yang, Z. Wang, M. Li, Y. Wen, J. Jennings, and J. Wang, “Learning in nonzero-sum stochastic games with potentials,” in International Conference on Machine Learning, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:232257740
[62] A. Beck, First-Order Methods in Optimization. Philadelphia, PA: Society for Industrial and Applied Mathematics, 2017. [Online]. Available: https://epubs.siam.org/doi/abs/10.1137/1.9781611974997
[63] S. Ghadimi and G. Lan, “Accelerated gradient methods for nonconvex nonlinear and stochastic programming,” Mathematical Programming, vol. 156, no. 1-2, pp. 59–99, 2016.
[64] R. Srikant and L. Ying, “Finite-time error bounds for linear stochastic approximation and TD learning,” in Conference on Learning Theory. PMLR, 2019, pp. 2803–2830.
[65] G. Li, Y. Wei, Y. Chi, Y. Gu, and Y. Chen, “Sample complexity of asynchronous q-learning: Sharper analysis and variance reduction,” IEEE Transactions on Information Theory, vol. 68, no. 1, pp. 448–473, 2022.
[66] A. Sidford, M. Wang, X. Wu, L. Yang, and Y. Ye, “Near-optimal time and sample complexities for solving Markov decision processes with a generative model,” in Advances in Neural Information Processing Systems, vol. 31, 2018.
[67] R. Cooper, Coordination games. Cambridge University Press, 1999.
[68] D. Monderer and L. S. Shapley, “Potential games,” Games and economic behavior, vol. 14, no. 1, pp. 124–143, 1996.
[69] N. Bertrand, N. Markey, S. Sadhukhan, and O. Sankur, “Dynamic network congestion games,” arXiv preprint arXiv:2009.13632, 2020.
[70] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Cambridge, MA, USA: A Bradford Book, 2018.
[71] W. Hoeffding, “Probability inequalities for sums of bounded random variables,” in The collected works of Wassily Hoeffding. Springer, 1994, pp. 409–426.
[72] K. Azuma, “Weighted sums of certain dependent random variables,” Tohoku Mathematical Journal, Second Series, vol. 19, no. 3, pp. 357–367, 1967.
[73] W. Wang and M. Á. Carreira-Perpiñán, “Projection onto the probability simplex: An efficient algorithm with a simple proof, and an application,” CoRR, vol. abs/1309.1541, 2013. [Online]. Available: http://arxiv.org/abs/1309.1541

Appendix A More about Markov potential games

This section is dedicated to a more thorough understanding of Markov potential games, which includes necessary or sufficient conditions for MPG and a few (counter)examples.

B.1. A necessary condition and counterexamples

Definition 5.

([68]) Define the path in the parameter space as $\tau=(\theta^{(0)},\theta^{(1)},\dots,\theta^{(N)})$ , where $\theta^{t},\theta^{t+1}$ differ only one component $i_{t}$ , i.e. $\theta^{(t+1)}=(\theta^{(t+1)}_{i_{t}},\theta^{(t)}_{-i_{t}})$ . A closed path is a path such that $\theta^{(0)}=\theta^{(N)}$ . Define:

I(\tau):=\sum_{t=1}^{N}J_{i_{t}}(\theta^{(t)})-J_{i_{t}}(\theta^{(t-1)})

The following theorem is a direct generalization of Theorem 2.8 in [68] to MPG setting:

Lemma 6.

For Markov potential games, $I(\tau)=0$ for any finite closed path $\tau$ .

Proof.

The proof is quite straightforward from the definition of MPG

	$\displaystyle I(\tau)$	$\displaystyle=\sum_{t=1}^{N}J_{i_{t}}(\theta^{(t)})-J_{i_{t}}(\theta^{(t-1)})$
		$\displaystyle=\sum_{t=1}^{N}\Phi(\theta^{(t)})-\Phi(\theta^{(t-1)})$
		$\displaystyle=\Phi(\theta^{(N)})-\Phi(\theta^{(0)})$
		$\displaystyle=0.$

∎

Although the proof of Lemma 6 is straightforward, it serves as a useful tool in proving that a game is not a MPG. For example, applying the theorem we can get that the following conditions are not sufficient conditions for MPG.

Proposition 2.

None of the following conditions on a SG necessarily imply that it is a MPG:

(1)

There exists $\phi(s,a)$ such that at each $s$ , $\ r_{i}(s,a_{i}^{\prime},a_{-i})-r_{i}(s,a_{i},a_{-i})=\phi(s,a_{i}^{\prime},a_{-i})-\phi(s,a_{i},a_{-i})$ ;
(2)

There exists $\phi(s,a)$ such that for every $s,\hat{s}$ , $\ r_{i}(s,a_{i}^{\prime},a_{-i})-r_{i}(\hat{s},a_{i},a_{-i})=\phi(s,a_{i}^{\prime},a_{-i})-\phi(\hat{s},a_{i},a_{-i})$ ;
(3)

Rewards $r_{i}$ are independent of $s$ , and they have a potential function, i.e., $r_{i}(a_{i},a_{-i})-r_{i}(a_{i}^{\prime},a_{-i})=\phi(a_{i},a_{-i})-\phi(a_{i}^{\prime},a_{-i})$ .

Proof.

(of Proposition 2) A simple counterexample showing that the conditions in (2) are not sufficient is the multi-stage prisoner’s dilemma (Game 1) introduced in the numerical section (Appendix 5). Since the reward table for multi-stage prisoner’s dilemma is the same as the one-shot prisoner’s dilemma (which is known to be a potential game), Game 1 satisfies condition (3) in Proposition 2, which implies condition (2), which in turn implies condition (1). In the following we are going to use Lemma 6 to show that Game 1 is not a MPG. We define the following individual policies:

	$\displaystyle\theta_{i}^{\text{Defect}}:\qquad\theta^{\text{Defect}}_{s=1,a_{i}=1}$	$\displaystyle=0,~{}~{}\theta^{\text{Defect}}_{s=2,a_{i}=1}=0$
	$\displaystyle\theta_{i}^{\text{Coop}}:\qquad\theta^{\text{Coop}}_{s=1,a_{i}=1}$	$\displaystyle=1,~{}~{}\theta^{\text{Coop}}_{s=2,a_{i}=1}=0$
	$\displaystyle\theta_{i}^{\text{Always\_coop}}:\qquad\theta^{\text{Always\_coop}}_{s=1,a_{i}=1}$	$\displaystyle=1,~{}~{}\theta^{\text{Always\_coop}}_{s=2,a_{i}=1}=1$

Let:

	$\displaystyle\theta^{(0)}=(\theta_{1}^{\text{Defect}},\theta_{2}^{\text{Coop}}),\quad$	$\displaystyle\theta^{(1)}=(\theta_{1}^{\text{Coop}},\theta_{2}^{\text{Coop}}),$
	$\displaystyle\theta^{(2)}=(\theta_{1}^{\text{Coop}},\theta_{2}^{\text{Always\_coop}}),\quad$	$\displaystyle\theta^{(3)}=(\theta_{1}^{\text{Defect}},\theta_{2}^{\text{Always\_coop}})$

and define a path $\tau$ by:

\tau=(\theta^{(0)},\theta^{(1)},\theta^{(2)},\theta^{(3)},\theta^{(4)}),~{}~{}\theta^{(4)}=\theta^{(0)}.

For the sake of easy calculation, we set $\epsilon=0$ and set initial state as $s_{0}=1$ in Game 1. In this example, it is not hard to see that $J_{i_{t}}(\theta^{(t)})-J_{i_{t}}(\theta^{(t-1)})>0$ for each $t\in\{1,2,3,4\}$ , implying that $I(\tau)>0$ . This indicates that although Game 1 satisfies condition (3) (as well as conditions (1) and (2)), it is still not a MPG. ∎

B.2. A sufficient condition

Proposition 2 suggests that MPG is a quite restrictive assumption. Even if the reward table for a SG is the same as the reward table of a one-shot potential game, the SG may still not be a MPG. Nevertheless, we can show that the following condition is sufficient for a stochastic game to be a MPG:

Lemma 7.

A stochastic game is a MPG if condition (1) in Proposition 2 is satisfied and that $P(s^{\prime}|s,a)=P(s^{\prime}|s)$ .

Proof.

$P(s^{\prime}|s,a)=P(s^{\prime}|s)$ implies that the discounted state visitation distribution $d_{\theta}$ does not depend on $\theta$ , and thus we denote it as $d(s)$ instead. Condition (1) implies that $\phi(s,a_{i},a_{-i})-r_{i}(s,a_{i},a_{-i})$ only depends on $s$ and $a_{-i}$ but not $a_{i}$ , and so we denote the difference as $\delta_{i}(s,a_{-i})$ , i.e.,

\phi(s,a_{i},a_{-i})-r_{i}(s,a_{i},a_{-i})=\delta_{i}(s,a_{-i}).

The total reward of agent $i$ can be written as:

	$\displaystyle J_{i}(\theta)$	$\displaystyle=\sum_{s}d(s)\sum_{a}\pi_{\theta}(a\|s)r_{i}(s,a)$
		$\displaystyle=\sum_{s}d(s)\sum_{a_{i}}\pi_{\theta_{i}}(a_{i}\|s)\sum_{a_{-i}}\pi_{\theta_{-i}}(a_{-i}\|s)r(s,a_{i},a_{-i})$

Similarly, total potential function can be written as:

	$\displaystyle\Phi(\theta)$	$\displaystyle=\sum_{s}d(s)\sum_{a}\pi_{\theta}(a\|s)\phi(s,a)$
		$\displaystyle=\sum_{s}d(s)\sum_{a_{i}}\pi_{\theta_{i}}(a_{i}\|s)\sum_{a_{-i}}\pi_{\theta_{-i}}(a_{-i}\|s)\phi(s,a_{i},a_{-i})$

Thus,

	$\displaystyle\Phi(\theta)-J_{i}(\theta)$	$\displaystyle=\sum_{s}d(s)\sum_{a_{i}}\pi_{\theta_{i}}(a_{i}\|s)\sum_{a_{-i}}\pi_{\theta_{-i}}(a_{-i}\|s)\left(\phi(s,a_{i},a_{-i})-r(s,a_{i},a_{-i})\right)$
		$\displaystyle=\sum_{s}d(s)\sum_{a_{i}}\pi_{\theta_{i}}(a_{i}\|s)\sum_{a_{-i}}\pi_{\theta_{-i}}(a_{-i}\|s)\delta_{i}(s,a_{-i})$
		$\displaystyle=\sum_{s}d(s)\sum_{a_{-i}}\pi_{\theta_{-i}}(a_{-i}\|s)\delta_{i}(s,a_{-i}),$

which does not depend on parameter $\theta_{i}$ , i.e.,

\Phi(\theta_{i}^{\prime},\theta_{-i})-J_{i}(\theta_{i}^{\prime},\theta_{-i})=\Phi(\theta_{i},\theta_{-i})-J_{i}(\theta_{i},\theta_{-i}),\quad\forall(\theta_{i}^{\prime},\theta_{-i}),(\theta_{i},\theta_{-i})\in\mathcal{X},

which completes the proof. ∎

B.3. MPG with local states and an application example

From Proposition 2, we see that it is difficult for a SG to be a MPG even if the game is a potential game at each state. Lemma 7 only presents a very special case where the action does not affect the state, meaning that this MPG is merely a collection of potential games. To provide a MPG which is beyond the identical-interest case and the case in Lemma 7, inspired by the setting in [7] and [35], here we consider a special multi-agent setting where $\mathcal{S}=\mathcal{S}_{1}\times\dots\times\mathcal{S}_{n}$ and $\mathcal{S}_{i}$ is the local state space of agent $i$ . In addition, the transition probability takes the decomposed form, $P(s^{\prime}|s,a)=\prod_{i=1}^{n}P(s_{i}^{\prime}|s_{i},a_{i})$ . The rest of the SG setting is the same as the SG in Section 2. Deviating slightly from the main text, we consider the localized policy where each agent take actions based on its own state,

\pi_{\theta}(a_{t}|s_{t})=\prod_{i=1}^{n}\pi_{\theta_{i}}(a_{i,t}|s_{i,t})

with the localized direct parameterization:

\pi_{\theta_{i}}(a_{i,t}|s_{i,t})=\theta_{(s_{i},a_{i})},\quad\theta_{i}\in\Delta(\mathcal{A}_{i})^{|\mathcal{S}_{i}|}

We use $\mathcal{X}_{i}^{\textup{local}}:=\Delta(\mathcal{A}_{i})^{|\mathcal{S}_{i}|}$ to denote the feasibility region of $\theta_{i}$ , and the feasibility region of $\theta$ is denoted as $\mathcal{X}^{\textup{local}}:=\mathcal{X}_{i}^{\textup{local}}\times\dots\times\mathcal{X}_{n}^{\textup{local}}$ .

Lemma 8.

If there is a function $\phi(s,a)$ such that for every agent $i$ , $r_{i}(s_{i},s_{-i},a_{i},a_{-i})=\phi(s_{i},s_{-i},a_{i},a_{-i})+\psi_{i}(s_{-i},a_{-i})$ where $\psi_{i}$ only depends on $s_{-i},a_{-i}$ , then this SG is a MPG, i.e., for any parameters $(\theta_{i}^{\prime},\theta_{-i}),(\theta_{i},\theta_{-i})\in\mathcal{X}^{\textup{local}}$ , the equation in Definition 3 is satisfied.

The proof is straightforward given the local structure of the MDP and the localized policies. This MPG enjoys nontrivial multi-agent application examples such as medium access control [35], dynamic congestion control [69], etc. Below we provide medium access control as one of the examples.

Real application - medium access control. We consider the discretized version of the dynamic medium access control game introduced in [35], where each agent is a user that tries to transmit data via a single transmission medium by injecting power to the wireless network. Each user’s goal is to maximize its data rate and battery lifespan. If multiple users transmit at the same time, they will interfere with each other and decrease their data rate. Here user $i$ ’s state is $s_{i}\in\mathcal{S}_{i}=\{0,1,\dots,B_{i,\max}\}$ , which denotes its own battery level, where $B_{i,\max}$ is its initial battery level. We use $\delta_{i}$ to denote its discharging factor. Its action $a_{i}\in\mathcal{A}_{i}=\{0,1,\dots,P_{i,\max}\}$ denotes the power injected to the network at each time step, where $P_{i,\max}$ is the maximum allowed power. The state transition is deterministic, describing the discharging process of the battery proportional to the transmission power, which is given by:

s_{i,t+1}=s_{i,t}-\delta_{i}a_{i,t}.

The stage reward of user $i$ is given by:

r_{i}(s,a)=\log\left(1+\frac{|h_{i}|^{2}a_{i}}{1+\sum_{j\neq i}|h_{j}|^{2}a_{j}}\right)+\alpha s_{i},

where $h_{i}$ is the random fading channel coefficient for user $i$ .

By noticing that $r_{i}(s,a)=\log\left(1+\sum_{i=1}^{n}|h_{i}|^{2}a_{i}\right)+\alpha\sum_{i}s_{i}-\log\left(1+\sum_{j\neq i}|h_{j}|^{2}a_{j}\right)-\alpha\sum_{j\neq i}s_{j}$ , we can apply Lemma 8 to verify that the medium access control problem is indeed a MPG and that the potential function $\phi$ is given as:

\phi(s,a)=\log\left(1+\sum_{i=1}^{n}|h_{i}|^{2}a_{i}\right)+\alpha\sum_{i=1}^{n}s_{i}.

Appendix B Proof of Lemma 2

Proof.

According to the performance difference lemma (c.f. [57]), let $\theta^{\prime}:=(\theta_{i}^{\prime},\theta_{-i})$ ,

	$\textstyle J_{i}(\theta_{i}^{\prime},\theta_{-i})-J_{i}(\theta_{i},\theta_{-i})=\frac{1}{1-\gamma}\sum_{s,a}d_{\theta^{\prime}}(s)\pi_{\theta^{\prime}}(a\|s)A_{i}^{\theta}(s,a)$
	$\textstyle=\frac{1}{1-\gamma}\sum_{s,a_{i}}d_{\theta^{\prime}}(s)\pi_{\theta^{\prime}_{i}}(a_{i}\|s)\sum_{a_{-i}}\pi_{\theta_{-i}}(a_{-i}\|s)A_{i}^{\theta}(s,a_{i},a_{-i})$
	$\textstyle=\frac{1}{1-\gamma}\sum_{s,a_{i}}d_{\theta^{\prime}}(s)\pi_{\theta^{\prime}_{i}}(a_{i}\|s)\overline{A_{i}^{\theta}}(s,a_{i}).$

∎

Appendix C Derivation of (29) (calculation of $d_{\theta}$ )

From the definition of $d_{\theta}$ :

d_{\theta}(s)=\mathbb{{E}}_{s_{0}\sim\rho}(1-\gamma)\sum_{t=0}^{\infty}\gamma^{t}\textup{Pr}^{\theta}(s_{t}=s|s_{0}),\vspace{-5pt}

we have that:

	$\displaystyle d_{\theta}$	$\displaystyle=(1-\gamma)\left(\rho+\gamma\overline{P_{S}^{\theta}}^{\top}\rho+\gamma^{2}\overline{P_{S}^{\theta}}^{2^{\top}}\rho+\cdots\right)$
		$\displaystyle=(1-\gamma)(I+\gamma\overline{P_{S}^{\theta}}+\gamma^{2}\overline{P_{S}^{\theta}}^{2}\cdots)^{\top}\rho$
		$\displaystyle=(1-\gamma)(I-\gamma\overline{P_{S}^{\theta}})^{-\top}\rho$

Appendix D Proof of Lemma 3

Proof.

According to policy gradient theorem (6):

\displaystyle\frac{\partial J_{i}(\theta)}{\partial{\theta_{s,a_{i}}}}

\displaystyle=\frac{1}{1-\gamma}\sum_{s^{\prime}}\sum_{a^{\prime}}d_{\theta}(s^{\prime})\pi_{\theta}(a^{\prime}|s^{\prime})\frac{\partial\log\pi_{\theta}(a^{\prime}|s^{\prime})}{\partial\theta_{s,a_{i}}}Q_{i}^{\theta}(s,a)

Since for direct parameterization:

	$\displaystyle\frac{\partial\log\pi_{\theta}(a^{\prime}\|s^{\prime})}{\partial\theta_{s,a_{i}}}=\frac{\partial\log\pi_{\theta_{i}}(a_{i}^{\prime}\|s^{\prime})}{\partial\theta_{s,a_{i}}}$	$\displaystyle=\mathbf{1}\{a_{i}^{\prime}=a_{i},s^{\prime}=s\}\frac{1}{\theta_{s,a_{i}}}$
		$\displaystyle=\mathbf{1}\{a_{i}^{\prime}=a_{i},s^{\prime}=s\}\frac{1}{\pi_{\theta_{i}}(a_{i}\|s)}$

Thus we have that:

\begin{split}\frac{\partial J_{i}(\theta)}{\partial{\theta_{s,a_{i}}}}&=\frac{1}{1-\gamma}\sum_{s^{\prime}}\sum_{a^{\prime}}d_{\theta}(s^{\prime})\pi_{\theta}(a^{\prime}|s^{\prime})\mathbf{1}\{a_{i}^{\prime}=a_{i},s^{\prime}=s\}\frac{1}{\pi_{\theta}(a_{i}|s)}Q_{i}^{\theta}(s,a)\\ &=\frac{1}{1-\gamma}\sum_{a_{-i}^{\prime}}d_{\theta}(s)\pi_{\theta_{i}}(a_{i}|s)\pi_{\theta_{-i}}(a_{-i}^{\prime}|s)\frac{1}{\pi_{\theta}(a_{i}|s)}Q_{i}^{\theta}(s,a_{i},a_{-i}^{\prime})\\ &=\frac{1}{1-\gamma}d_{\theta}(s)\sum_{a_{-i}^{\prime}}\pi_{\theta_{-i}}(a_{-i}^{\prime}|s)Q_{i}^{\theta}(s,a_{i},a_{-i}^{\prime})\\ &=\frac{1}{1-\gamma}d_{\theta}(s)\overline{Q_{i}^{\theta}}(s,a_{i})\end{split}

∎

Appendix E Proof of Theorem 2 and Lemma 5

Proof.

(Lemma 5) For a given strict NE $\theta^{*}$ randomly set:

a_{i}^{*}(s)\in\operatorname*{arg\,max}_{a_{i}}\overline{A_{i}^{\theta^{*}}}(s,a_{i}),

and set $\theta_{i}$ be:

\theta_{s,a_{i}}=\mathbf{1}\{a_{i}=a_{i}^{*}(s)\}.

And set $\theta:=(\theta_{i},\theta_{-i}^{*})$ From performance difference lemma (Lemma 2):

	$\displaystyle J_{i}(\theta_{i},\theta_{-i}^{})-J_{i}(\theta_{i}^{},\theta_{-i}^{*})$	$\displaystyle=\sum_{s,a_{i}}d_{\theta}(s)\pi_{\theta_{i}}(s,a_{i})\overline{A_{i}^{\theta^{*}}}(s,a_{i})$
		$\displaystyle=\sum_{s}d_{\theta}(s)\max_{a_{i}}\overline{A_{i}^{\theta^{*}}}(s,a_{i})\geq 0$

Because $\theta^{*}$ is a strict NE, thus the inequality above forces $\theta_{i}^{*}=\theta$ , and that $\max_{a_{i}}\overline{A_{i}^{\theta^{*}}}(s,a_{i})=0$ . The uniqueness of $\theta^{*}$ also implies uniqueness of $a_{i}^{*}(s)$ , and thus,

\overline{A_{i}^{\theta^{*}}}(s,a_{i})<0,~{}~{}\forall~{}a_{i}\neq a_{i}^{*}(s),

which completes the proof of the lemma. ∎

The proof of Theorem 2 relies on the following auxiliary lemma, whose proof we defer to Appendix L.

{restatable}

lemmaauxiliary Let $\mathcal{X}$ denote the probability simplex of dimension $n$ . Suppose $\theta\in\mathcal{X},g\in\mathbb{R}^{n}$ and that there exists $i^{*}\in\{1,2,\dots,n\}$ and $\Delta>0$ such that:

	$\displaystyle\theta_{i^{*}}$	$\displaystyle\geq\theta_{i},\quad\forall i\neq i^{*}$
	$\displaystyle g_{i^{*}}$	$\displaystyle\geq g_{i}+\Delta,\quad\forall i\neq i^{*}.$

Let

\theta^{\prime}=Proj_{\mathcal{X}}(\theta+g),

then:

\theta^{\prime}_{i^{*}}\geq\min\{1,\theta_{i^{*}}+\frac{\Delta}{2}\}

Proof.

(Theorem 2) For a fixed agent $i$ and state $s$ , the gradient play ((5)) update rule of policy $\theta_{i,s}$ is given by:

\theta_{i,s}^{(t+1)}=Proj_{\Delta(|\mathcal{A}_{i}|)}(\theta_{i,s}^{(t)}+\frac{\eta}{1-\gamma}d_{\theta^{(t)}}(s)\overline{Q_{i}^{\theta^{(t)}}}(s,\cdot)),

(33)

where $\Delta(|\mathcal{A}_{i}|)$ denotes the probability simplex in $|\mathcal{A}_{i}|$ -th dimension and $\overline{Q_{i}^{\theta^{(t)}}}(s,\cdot))$ is a $|\mathcal{A}_{i}|$ -th dimensional vector with $a_{i}$ -th element equals to $\overline{Q_{i}^{\theta^{(t)}}}(s,a_{i}))$ . We will show that this update rule satisfies the conditions in Appendix E, which will then allow us to prove that

\displaystyle D(\theta^{(t+1)}||\theta^{*})\leq\max\{0,D(\theta^{(t)}||\theta^{*})-\frac{\eta\Delta^{\theta^{*}}}{2}\}.

Letting $a_{i}^{*}(s)$ be the same definition as in Lemma 5, we have that:

	$\displaystyle\frac{1}{1-\gamma}d_{\theta^{(t)}}(s)\overline{Q_{i}^{\theta^{(t)}}}(s,a_{i}^{*}(s))-\frac{1}{1-\gamma}d_{\theta^{(t)}}(s)\overline{Q_{i}^{\theta^{(t)}}}(s,a_{i})$
$\displaystyle\geq$	$\displaystyle\frac{1}{1-\gamma}d_{\theta^{}}(s)\overline{Q_{i}^{\theta^{}}}(s,a_{i}^{}(s))-\frac{1}{1-\gamma}d_{\theta^{{}^{}}}(s)\overline{Q_{i}^{\theta^{*}}}(s,a_{i})$
$\displaystyle-$	$\displaystyle\left\|\frac{1}{1-\gamma}d_{\theta^{}}(s)\overline{Q_{i}^{\theta^{}}}(s,a_{i}^{}(s))-\frac{1}{1-\gamma}d_{\theta^{(t)}}(s)\overline{Q_{i}^{\theta^{(t)}}}(s,a_{i}^{}(s))\right\|$
$\displaystyle-$	$\displaystyle\left\|\frac{1}{1-\gamma}d_{\theta^{}}(s)\overline{Q_{i}^{\theta^{}}}(s,a_{i})-\frac{1}{1-\gamma}d_{\theta^{(t)}}(s)\overline{Q_{i}^{\theta^{(t)}}}(s,a_{i})\right\|$
$\displaystyle\geq$	$\displaystyle\frac{1}{1-\gamma}d_{\theta^{}}(s)\left(\overline{A_{i}^{\theta^{}}}(s,a_{i}^{}(s)-\overline{A_{i}^{\theta^{}}}(s,a_{i}))\right)-2\\|\nabla_{\theta_{i}}J_{i}(\theta^{(t)})-\nabla_{\theta_{i}}J_{i}(\theta^{*})\\|$	(34)
$\displaystyle\geq$	$\displaystyle\Delta^{\theta^{}}-\frac{4}{(1-\gamma)^{3}}\left(\sum_{i=1}^{n}\|\mathcal{A}_{i}\|\right)\\|\theta^{(t)}-\theta^{}\\|$	(35)
$\displaystyle\geq$	$\displaystyle\Delta^{\theta^{}}-\frac{4}{(1-\gamma)^{3}}\left(\sum_{i=1}^{n}\|\mathcal{A}_{i}\|\right)\sum_{i=1}^{n}\sum_{s}\\|\theta^{(t)}_{i,s}-\theta^{}_{i,s})\\|_{1}$
$\displaystyle\geq$	$\displaystyle\Delta^{\theta^{}}-\frac{4}{(1-\gamma)^{3}}n\|\mathcal{S}\|\left(\sum_{i=1}^{n}\|\mathcal{A}_{i}\|\right)D(\theta^{(t)}\|\|\theta^{}),$

where (34) to (35) uses smoothness property in Lemma 17.

We use proof of induction as supposed for $\ell\leq t-1$ , we have:

D(\theta^{(\ell+1)}||\theta^{*})\leq\max\{D(\theta^{(\ell)}||\theta^{*})-\frac{\eta\Delta^{\theta^{*}}}{2},0\},

thus

D(\theta^{(t)}||\theta^{*})\leq D(\theta^{(0)}||\theta^{*})\leq\frac{\Delta^{\theta^{*}}(1-\gamma)^{3}}{8n|\mathcal{S}|\left(\sum_{i=1}^{n}|\mathcal{A}_{i}|\right)}.

Then we can further conclude that:

		$\displaystyle(1-\gamma)d_{\theta^{(t)}}(s)\overline{Q_{i}^{\theta^{(t)}}}(s,a_{i}^{*}(s))-(1-\gamma)d_{\theta^{(t)}}(s)\overline{Q_{i}^{\theta^{(t)}}}(s,a_{i})$
	$\displaystyle\geq$	$\displaystyle\Delta^{\theta^{}}-\frac{4}{(1-\gamma)^{3}}n\|\mathcal{S}\|\left(\sum_{i=1}^{n}\|\mathcal{A}_{i}\|\right)D(\theta^{(t)}\|\|\theta^{})$
	$\displaystyle\geq$	$\displaystyle\frac{\Delta^{\theta^{}}}{2},\quad\forall~{}a_{i}\neq a_{i}^{}(s)$

Additionally, for $D(\theta^{(t)}||\theta^{*})\leq\frac{\Delta^{\theta^{*}}(1-\gamma)^{3}}{8n|\mathcal{S}|\left(\sum_{i=1}^{n}|\mathcal{A}_{i}|\right)}$ , we may conclude:

\theta^{(t)}_{s,a_{i}^{*}(s)}\geq 1/2\geq\theta^{(t)}_{s,a_{i}}\quad\forall a_{i}\neq a_{i}^{*}(s),

then by applying Appendix E to (33) we have:

		$\displaystyle\theta^{(t+1)}_{s,a_{i}^{}(s)}\geq\min\{1,\theta^{(t)}_{s,a_{i}^{}(s)}+\frac{\eta\Delta^{\theta^{*}}}{4}\}$
	$\displaystyle\Longrightarrow\quad$	$\displaystyle\\|\theta^{(t+1)}_{i,s}-\theta^{}_{i,s}\\|_{1}=2\left(1-\theta^{(t+1)}_{s,a_{i}^{}(s)}\right)$
		$\displaystyle\leq\max\{0,\\|\theta^{(t)}_{i,s}-\theta^{}_{i,s}\\|_{1}-\frac{\eta\Delta^{\theta^{}}}{2}\},\quad\forall~{}s\in\mathcal{S},~{}i=1,2,\dots,n$
	$\displaystyle\Longrightarrow\quad$	$\displaystyle D(\theta^{(t+1)}\|\|\theta^{})\leq\max\{0,D(\theta^{(t)}\|\|\theta^{})-\frac{\eta\Delta^{\theta^{*}}}{2}\},$

which completes the proof. ∎

Appendix F Proof of Proposition 1

Proof.

First of all, from the definition of NE, the global maximum of the potential function is a NE. We now show that this global maximum is a deterministic policy. From classical results (e.g. [70]) we know that there is an optimal deterministic centralized policy

\pi^{*}(a=(a_{1},\dots,a_{n})|s)=\mathbf{1}\{a=a^{*}(s)=(a_{1}^{*}(s),\dots,a_{n}^{*}(s))\}

that maximizes:

\pi^{*}=\operatorname*{arg\,max}_{\pi:\mathcal{S}\rightarrow\Delta(\mathcal{A})}\mathbb{{E}}\left[\sum_{t=0}^{\infty}\gamma^{t}\phi(s_{t},a_{t})\big{|}\pi,s_{0}=s\right].

We now show that this centralized policy can also be represented by direct distributed policy parameterization. Set $\theta^{*}$ as:

\pi_{\theta_{i}^{*}}(a_{i}|s)=\mathbf{1}\{a_{i}=a_{i}^{*}(s)\},

then:

\pi^{*}(a|s)=\prod_{i=1}^{n}\pi_{\theta_{i}^{*}}(a_{i}|s)

Since $\pi^{*}$ globally maximizes the discounted summation of potential function $\phi$ among centralized policies, which includes all possible direct distributedly parameterized policies, $\theta^{*}$ also maximizes the total potential function $\Phi$ globally among all direct distributed parameterization, which completes the proof. ∎

Appendix G Proof of Theorem 3

G.1 Useful Optimization Lemmas

Lemma 9.

Let $\Phi(\theta)$ be $\beta$ -smooth in $\theta$ , define gradient mapping:

G^{\eta}(\theta):=\frac{1}{\eta}\left(Proj_{\mathcal{X}}(\theta+\eta\nabla\Phi(\theta))-\theta\right).

The update rule for projected gradient is:

\theta^{+}=\theta+\eta G^{\eta}(\theta)=Proj_{\mathcal{X}}(\theta+\eta\nabla\Phi(\theta)).

Then:

(\theta^{\prime}-\theta^{+})^{\top}\nabla\Phi(\theta^{+})\leq(1+\eta\beta)\|G^{\eta}(\theta)\|\|\theta^{\prime}-\theta^{+}\|\quad\forall\theta^{\prime}\in\mathcal{X}.

Proof.

By a standard property of Euclidean projections onto a convex set, we get that

	$\displaystyle\qquad(\theta+\eta\nabla\Phi(\theta)-\theta^{+})^{\top}(\theta^{\prime}-\theta^{+})\leq 0$
	$\displaystyle\Longrightarrow~{}~{}\eta\nabla\Phi(\theta)^{\top}(\theta^{\prime}-\theta^{+})+(\theta-\theta^{+})^{\top}(\theta^{\prime}-\theta^{+})\leq 0$
	$\displaystyle\Longrightarrow~{}~{}\eta\nabla\Phi(\theta)^{\top}(\theta^{\prime}-\theta^{+})-\eta G^{\eta}(\theta)^{\top}(\theta^{\prime}-\theta^{+})\leq 0$
	$\displaystyle\Longrightarrow~{}~{}\nabla\Phi(\theta)^{\top}(\theta^{\prime}-\theta^{+})\leq\\|G^{\eta}(\theta)\\|\theta^{\prime}-\theta^{+}\\|$
	$\displaystyle\Longrightarrow~{}~{}\nabla\Phi(\theta^{+})^{\top}(\theta^{\prime}-\theta^{+})\leq\\|G^{\eta}(\theta)\\|\theta^{\prime}-\theta^{+}\\|+(\nabla\Phi(\theta^{+})-\nabla\Phi(\theta))^{\top}(\theta^{\prime}-\theta^{+})\\|$
	$\displaystyle\qquad\qquad\qquad\qquad\qquad~{}~{}\leq\\|G^{\eta}(\theta)\\|\theta^{\prime}-\theta^{+}\\|+\beta\\|\theta^{+}-\theta\\|\\|\theta^{\prime}-\theta^{+}\\|$
	$\displaystyle\qquad\qquad\qquad\qquad\qquad~{}~{}=(1+\eta\beta)\\|G^{\eta}(\theta)\\|\theta^{\prime}-\theta^{+}\\|$

∎

Lemma 10.

(Sufficient ascent) Suppose $\Phi(\theta)$ is $\beta$ -smooth. Let $\theta^{+}=Proj_{\mathcal{X}}(\theta+\eta{\nabla}\Phi(\theta))$ . Then for $\eta\leq\frac{1}{\beta}$ ,

\begin{split}\Phi(\theta^{+})-\Phi(\theta)\geq\frac{\eta}{2}\|G^{\eta}(\theta)\|^{2}\end{split}

Proof.

From the smoothness property we have that:

\Phi(\theta^{+})-\Phi(\theta)\geq\nabla_{\theta}\Phi(\theta)^{\top}(\theta^{+}-\theta)-\frac{\beta}{2}\|\theta^{+}-\theta\|^{2}

(36)

Since $\theta^{+}=Proj_{\mathcal{X}}(\theta+\eta{\nabla}\Phi(\theta))$ , we have that:

(\theta+\eta\nabla\Phi(\theta)-\theta^{+})^{\top}(\theta^{\prime}-\theta^{+})\leq 0,~{}~{}\forall~{}\theta^{\prime}\in\mathcal{X}

take $\theta^{\prime}=\theta$ , we get:

\nabla\Phi(\theta)^{\top}(\theta^{+}-\theta)\geq\frac{1}{\eta}\|\theta^{+}-\theta\|^{2}.

Thus:

	$\displaystyle\Phi(\theta^{+})-\Phi(\theta)$	$\displaystyle\geq\nabla_{\theta}\Phi(\theta)^{\top}(\theta^{+}-\theta)-\frac{\beta}{2}\\|\theta^{+}-\theta\\|^{2}$
		$\displaystyle\geq(\frac{1}{\eta}-\frac{\beta}{2})\\|\theta^{+}-\theta\\|^{2}$
		$\displaystyle\geq\frac{1}{2\eta}\\|\theta^{+}-\theta\\|^{2}$
		$\displaystyle=\frac{\eta}{2}\\|G^{\eta}(\theta)\\|^{2},$

which completes the proof. ∎

Lemma 11.

(Corollary of Lemma 10) For $\Phi(\theta)$ that is $\beta$ smooth and bounded $\Phi_{\min}\leq\Phi(\theta)\leq\Phi_{\max}$ , running projected gradient ascent:

\theta^{(t+1)}=Proj_{\mathcal{X}}(\theta+\eta{\nabla}\Phi(\theta^{(t)}))

with $\eta=\frac{1}{\beta}$ , will guarantee that:

\lim_{t\rightarrow+\infty}\|G^{\eta}(\theta^{(t)})\|=0.

Further, we have that:

\frac{1}{T}\sum_{t=0}^{T-1}\|G^{\eta}(\theta^{(t)})\|^{2}\leq\frac{2\beta(\Phi_{\max}-\Phi_{\min})}{T}

(37)

Proof.

From Lemma 10 we have:

\displaystyle\Phi(\theta^{(t+1)})-\Phi(\theta^{(t)})\geq\frac{1}{2\beta}\|G^{\eta}(\theta^{(t)})\|^{2}\geq 0

Thus $\Phi(\theta^{(t)})$ is non decreasing, and since it is bounded, we know that $\Phi(\theta^{(t)})$ asymptotically convergence to some value $\Phi^{*}$ , and thus show that

\lim_{t\rightarrow\infty}\|G^{\eta}(\theta^{(t)})\|=0.

Additionally, from (37),

	$\displaystyle\Phi(\theta^{(T)})-\Phi(\theta^{(0)})\geq\sum_{t=0}^{T-1}\frac{1}{2\beta}\\|G^{\eta}(\theta^{(t)})\\|^{2}$
	$\displaystyle\Longrightarrow~{}~{}\frac{1}{T}\sum_{t=0}^{T-1}\\|G^{\eta}(\theta^{(t)})\\|^{2}\leq\frac{2\beta(\Phi_{\max}-\Phi_{\min})}{T},$

which completes the proof. ∎

G.2 Proof of Theorem 3

Proof.

Recall the definition of gradient mapping:

G^{\eta}(\theta)=\frac{1}{\eta}\left(Proj_{\mathcal{X}}(\theta+\eta\nabla\Phi(\theta))-\theta\right).

From gradient domination property (8) we have that:

	$\displaystyle\textup{{NE-gap}}_{i}(\theta^{(t+1)})$	$\displaystyle=\max_{\theta_{i}^{\prime}\in\mathcal{X}_{i}}J_{i}(\theta_{i}^{\prime}.\theta_{-i}^{(t+1)})-J_{i}(\theta_{i}^{(t+1)},\theta_{-i}^{(t+1)})$
		$\displaystyle\leq\max_{\theta_{i}^{\prime}\in\mathcal{X}_{i}}\left\\|\frac{d_{(\theta_{i}^{\prime}.\theta_{-i}^{(t+1)})}}{d_{(\theta_{i}^{(t+1)},\theta_{-i}^{(t+1)})}}\right\\|_{\infty}\max_{\overline{\theta}_{i}\in\mathcal{X}_{i}}\left(\overline{\theta}_{i}-\theta_{i}^{(t+1)}\right)^{\top}\nabla_{\theta_{i}}J_{i}(\theta^{(t+1)})$
		$\displaystyle\leq M\max_{\overline{\theta}_{i}\in\mathcal{X}_{i}}\left(\overline{\theta}_{i}-\theta_{i}^{(t+1)}\right)^{\top}\nabla_{\theta_{i}}\Phi(\theta^{(t+1)})$
		$\displaystyle\leq M(1+\eta\beta)\max_{\overline{\theta}_{i}\in\mathcal{X}_{i}}\\|\overline{\theta}_{i}-\theta_{i}^{(t+1)}\\|\\|G^{\eta}(\theta^{(t)})\\|$
		$\displaystyle\leq M(1+\eta\beta)2\sqrt{\|\mathcal{S}\|}\\|G^{\eta}(\theta^{(t)})\\|$
		$\displaystyle=4M\sqrt{\|\mathcal{S}\|}\\|G^{\eta}(\theta^{(t)})\\|$

where the last step follows as $\|\overline{\theta}_{i}-\theta_{i}^{(t+1)}\|\leq 2\sqrt{|\mathcal{S}|}$ . Thus

\textup{{NE-gap}}(\theta^{(t+1)})\leq 4M\sqrt{|\mathcal{S}|}\|G^{\eta}(\theta^{(t)})\|

Then from Lemma 11 we have that:

\lim_{t\rightarrow\infty}\|G^{\eta}(\theta^{(t)})\|=0~{}~{}\Longrightarrow~{}~{}\lim_{t\rightarrow\infty}\textup{{NE-gap}}(\theta^{(t)})=0,

and that:

		$\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\\|G^{\eta}(\theta^{(t)})\\|^{2}\leq\frac{2\beta(\Phi_{\max}-\Phi_{\min})}{T}$
	$\displaystyle\Longrightarrow~{}~{}$	$\displaystyle\frac{1}{T}\sum_{t=1}^{T}\textup{{NE-gap}}(\theta^{(t)})^{2}\leq\frac{32\beta M^{2}\|\mathcal{S}\|(\Phi_{\max}-\Phi_{\min})}{T}$

we can get our required bound of $\epsilon$ if we set:

\frac{32\beta M^{2}|\mathcal{S}|(\Phi_{\max}-\Phi_{\min})}{T}\leq\epsilon^{2},

or equivalently

	$\displaystyle T$	$\displaystyle\geq\frac{32M^{2}\beta(\Phi_{\max}-\Phi_{\min})\|\mathcal{S}\|}{\epsilon^{2}}$
		$\displaystyle=\frac{64M^{2}(\Phi_{\max}-\Phi_{\min})\|\mathcal{S}\|\sum_{i}\|\mathcal{A}_{i}\|}{\epsilon^{2}(1-\gamma)^{3}},$

which completes the proof. ∎

Appendix H Proof of Theorem 4

Proof.

(of the first claim) The proof requires knowledge of Lemma 5 in Section 3 thus we would recommend readers to first go through Lemma 5 first. The lemma immediately leads to the conclusion that a strict NE $\theta^{*}$ should be deterministic. Let $a_{i}^{*}(s),\Delta_{i}^{\theta^{*}}(s),\Delta_{i}^{\theta^{*}}$ be the same definition as in Lemma 5.

For any $\theta\in\mathcal{X}$ , Taylor expansion suggests that:

	$\displaystyle\Phi(\theta)-\Phi(\theta^{*})$	$\displaystyle=(\theta-\theta^{})^{\top}\nabla\Phi(\theta^{})+o(\\|\theta-\theta^{*}\\|)$
		$\displaystyle=\sum_{i}(\theta_{i}-\theta_{i}^{})^{\top}\nabla_{\theta_{i}}J_{i}(\theta^{})+o(\\|\theta-\theta^{*}\\|)$
		$\displaystyle=\frac{1}{1-\gamma}\sum_{i}\sum_{s}\sum_{a_{i}}d_{\theta^{}}(s)\overline{A_{i}^{\theta^{}}}(s,a_{i})(\theta_{s,a_{i}}-\theta^{}_{s,a_{i}})+o(\\|\theta-\theta^{}\\|)$
		$\displaystyle\leq-\frac{1}{1-\gamma}\sum_{i}\sum_{s}d_{\theta^{}}(s)\Delta_{i}^{\theta^{}}(s)\left(\sum_{a_{i}\!\neq\!a_{i}^{}(s)}(\theta_{s,a_{i}}-\theta^{}_{s,a_{i}})\right)+o(\\|\theta-\theta^{*}\\|)$
		$\displaystyle=-\frac{1}{1-\gamma}\sum_{i}\sum_{s}d_{\theta^{}}(s)\Delta_{i}^{\theta^{}}(s)\frac{1}{2}\\|\theta_{i,s}-\theta_{i,s}^{}\\|_{1}+o(\\|\theta-\theta^{}\\|)$
		$\displaystyle\leq-\frac{\Delta^{\theta^{}}}{2}\sum_{i}\sum_{s}\\|\theta_{i,s}-\theta_{i,s}^{}\\|_{1}+o(\\|\theta-\theta^{*}\\|)$
		$\displaystyle\leq-\frac{\Delta^{\theta^{}}}{2}\\|\theta-\theta^{}\\|+o(\\|\theta-\theta^{*}\\|).$

Thus for $\|\theta-\theta^{*}\|$ sufficiently small,

\Phi(\theta)-\Phi(\theta^{*})<0\textup{ holds,}

this suggests that strict NEs are strict local maxima. We now show that this also holds vice versa.

Strict local maxima satisfy first-order stationarity by definition, and thus by Theorem 1 they are also NEs, we only need to show that they are strict. We prove by contradiction, suppose that there exists a local maximum $\theta^{*}$ such that it is non-strict NE, i.e., there exists $\theta_{i}^{\prime}\in\mathcal{X}_{i},\theta_{i}^{\prime}\neq\theta_{i}^{*}$ such that:

J_{i}(\theta_{i}^{\prime},\theta_{-i}^{*})=J_{i}(\theta_{i}^{*},\theta_{-i}^{*})

According to (10) and first-order stationarity of $\theta^{*}$ :

\displaystyle\frac{1}{1-\gamma}\sum_{s}d_{\theta^{*}}(s)\max_{a_{i}\in\mathcal{A}_{i}}\overline{A_{i}^{\theta^{*}}}(s,a_{i})=\max_{\overline{\theta}_{i}\in\mathcal{X}_{i}}(\overline{\theta}_{i}-\theta_{i}^{*})^{\top}\nabla_{\theta_{i}}J_{i}(\theta^{*})\leq 0.

Since $\max_{a_{i}\in\mathcal{A}_{i}}\overline{A_{i}^{\theta}}(s,a_{i})\geq 0$ for all $\theta$ , we may conclude:

\max_{a_{i}\in\mathcal{A}_{i}}\overline{A_{i}^{\theta^{*}}}(s,a_{i})=0,~{}~{}\forall~{}s\in\mathcal{S}.

We denote $\theta^{\prime}:=(\theta_{i}^{\prime},\theta_{-i^{*}})$ , according to Lemma 2

\displaystyle 0=J_{i}(\theta_{i}^{\prime},\theta_{-i}^{*})-J_{i}(\theta_{i}^{*},\theta_{-i}^{*})=\frac{1}{1-\gamma}\sum_{s,a_{i}}d_{\theta^{\prime}}(s)\pi_{\theta^{\prime}_{i}}(a_{i}|s)\overline{A_{i}^{\theta^{*}}}(s,a_{i})\leq 0.

Since $d_{\theta^{\prime}}(s)>0,~{}\forall~{}s$ , this further implies that

\sum_{a_{i}}\pi_{\theta^{\prime}_{i}}(a_{i}|s)\overline{A_{i}^{\theta^{*}}}(s,a_{i})=0,~{}~{}\forall~{}s\in\mathcal{S},

i.e., $\pi_{\theta^{\prime}_{i}}(a_{i}|s)$ is nonzero only if $\overline{A_{i}^{\theta^{*}}}(s,a_{i})=0$ . Define $\theta_{i}^{\eta}:=\eta\theta_{i}^{\prime}+(1-\eta)\theta_{i}^{*}$ , then

\sum_{a_{i}}\pi_{\theta^{\eta}_{i}}(a_{i}|s)\overline{A_{i}^{\theta^{*}}}(s,a_{i})=0,~{}~{}\forall~{}s\in\mathcal{S}.

Thus let $\theta^{\eta}:=(\theta^{\eta}_{i},\theta^{*}_{-i})$

\displaystyle J_{i}(\theta_{i}^{\eta},\theta_{-i}^{*})-J_{i}(\theta_{i}^{*},\theta_{-i}^{*})=\frac{1}{1-\gamma}\sum_{s,a_{i}}d_{\theta^{\eta}}(s)\pi_{\theta^{\eta}_{i}}(a_{i}|s)\overline{A_{i}^{\theta^{*}}}(s,a_{i})=0.

Since $\|\theta_{i}^{\eta}-\theta_{i}^{*}\|\rightarrow 0$ as $\eta\rightarrow 0$ , this contradicts the assumption that $\theta^{*}$ is a strict local maximum. This suggests that all strict local maxima are strict NEs, which completes the proof. ∎

Proof.

(of the second claim) First, we define the corresponding value function, $Q$ -function and advantage function for potential function $\phi$ .

	$\displaystyle V_{\phi}^{\theta}(s)$	$\displaystyle:=\mathbb{{E}}\left[\sum_{t=0}^{\infty}\gamma^{t}\phi(s_{t},a_{t})\big{\|}~{}\pi=\theta,s_{0}=s\right]$
	$\displaystyle Q_{\phi}^{\theta}(s,a)$	$\displaystyle:=\left[\sum_{t=0}^{\infty}\gamma^{t}\phi(s_{t},a_{t})\big{\|}~{}\pi=\theta,s_{0}=s,a_{0}=a\right]$
	$\displaystyle A_{\phi}^{\theta}(s,a)$	$\displaystyle:=Q_{\phi}^{\theta}(s,a)-V_{\phi}^{\theta}(s).$

For an index set $\mathcal{I}\subseteq\{1,2,\dots,n\}$ we define the following averaged advantage potential function of index set $\mathcal{I}$ as:

\overline{A_{\phi,\mathcal{I}}^{\theta}}(s,a_{\mathcal{I}}):=\sum_{a_{-\mathcal{I}}}A_{\phi}^{\theta}(s,a_{\mathcal{I}},a_{-\mathcal{I}}).

We choose an index set $\mathcal{I}\subseteq\{1,2,\dots,n\}$ such that there exists $s^{*},a_{\mathcal{I}}^{*}$ such that:

\overline{A_{\phi,\mathcal{I}}^{\theta^{*}}}(s^{*},a_{\mathcal{I}}^{*})>0,

(38)

and that for any other index set $\mathcal{I}^{\prime}$ with smaller cardinality:

\overline{A_{\phi,\mathcal{I}^{\prime}}^{\theta^{*}}}(s,a_{\mathcal{I}^{\prime}})\leq 0,~{}~{}\forall~{}s,a_{\mathcal{I}^{\prime}},~{}~{}\forall~{}|\mathcal{I}^{\prime}|<|\mathcal{I}|.

(39)

Because $\Phi$ is not a constant, this guarantees the existence of such an index set $\mathcal{I}$ . Further, since

\sum_{a_{\mathcal{I}^{\prime}}}\pi_{\theta_{\mathcal{I}^{\prime}}^{*}}(a_{\mathcal{I}^{\prime}}|s)\overline{A_{\phi,\mathcal{I}^{\prime}}^{\theta^{*}}}(s,a_{\mathcal{I}^{\prime}})=0,~{}~{}\forall~{}s,

and that $\theta^{*}$ is fully-mixed, we have that:

\overline{A_{\phi,\mathcal{I}^{\prime}}^{\theta^{*}}}(s,a_{\mathcal{I}^{\prime}})=0,~{}~{}\forall~{}s,a_{\mathcal{I}^{\prime}},~{}~{}\forall~{}|\mathcal{I}^{\prime}|<|\mathcal{I}|.

(40)

We set $\theta:=(\theta_{\mathcal{I}},\theta_{-\mathcal{I}}^{*})$ , where $\theta_{\mathcal{I}}$ is a convex combination of $\theta_{\mathcal{I}}^{*},\theta_{\mathcal{I}}^{\prime}\in\mathcal{X}$ :

\theta_{\mathcal{I}}=(1-\eta)\theta_{\mathcal{I}}^{*}+\eta\theta_{\mathcal{I}}^{\prime},~{}~{}\eta>0.

According to performance difference lemma (Lemma 2) we have:

	$\displaystyle(1-\gamma)\left(\Phi(\theta_{\mathcal{I}},\theta_{-\mathcal{I}}^{})-\Phi(\theta_{\mathcal{I}}^{},\theta_{-\mathcal{I}}^{})\right)=\sum_{s,a_{\mathcal{I}}}d_{\theta}(s)\pi_{\theta_{\mathcal{I}}}(a_{\mathcal{I}}\|s)\overline{A_{\phi,\mathcal{I}}^{\theta^{}}}(s,a_{\mathcal{I}})$
	$\displaystyle=\sum_{s,a_{\mathcal{I}}}d_{\theta}(s)\prod_{i\in\mathcal{I}}\left((1-\eta)\pi_{\theta_{i}^{}}(a_{i}\|s)+\eta\pi_{\theta_{i}^{\prime}}(a_{i}\|s)\right)\overline{A_{\phi,\mathcal{I}}^{\theta^{}}}(s,a_{\mathcal{I}})$
	$\displaystyle=\sum_{s,a_{\mathcal{I}}}d_{\theta}(s)\left((1\!-\!\eta)\pi_{\theta_{i_{0}}^{}}(a_{i_{0}}\|s)+\eta\pi_{\theta_{i_{0}}^{\prime}}(a_{i_{0}}\|s)\right)\!\prod_{i\in\mathcal{I}\!\backslash\!\{\!i_{0}\!\}\!}\!\left((1\!-\!\eta)\pi_{\theta_{i}^{}}(a_{i}\|s)+\eta\pi_{\theta_{i}^{\prime}}(a_{i}\|s)\right)\overline{A_{\phi,\mathcal{I}}^{\theta^{*}}}(s,a_{\mathcal{I}}),~{}~{}(\forall~{}i_{0}\in\mathcal{I})$
	$\displaystyle=(1-\eta)\sum_{s,a_{\mathcal{I}}}d_{\theta}(s)\prod_{i\in\mathcal{I}\backslash\{\!i_{0}\!\}\!}\!\left((1\!-\!\eta)\pi_{\theta_{i}^{}}(a_{i}\|s)+\eta\pi_{\theta_{i}^{\prime}}(a_{i}\|s)\right)\overline{A_{\phi,\mathcal{I}\backslash\{i_{0}\}}^{\theta^{}}}(s,a_{\mathcal{I}\backslash\{i_{0}\}})$
	$\displaystyle\quad+\eta\sum_{s,a_{\mathcal{I}}}d_{\theta}(s)\pi_{\theta_{i_{0}}^{\prime}}(a_{i_{0}}\|s)\prod_{i\in\mathcal{I}\backslash\{i_{0}\}}\left((1-\eta)\pi_{\theta_{i}^{}}(a_{i}\|s)+\eta\pi_{\theta_{i}^{\prime}}(a_{i}\|s)\right)\overline{A_{\phi,\mathcal{I}}^{\theta^{}}}(s,a_{\mathcal{I}}).$

According to (40), we know that:

\overline{A_{\phi,\mathcal{I}\backslash\{i_{0}\}}^{\theta^{*}}}(s,a_{\mathcal{I}\backslash\{i_{0}\}})=0,

thus

	$\displaystyle(1-\gamma)\left(\Phi(\theta_{\mathcal{I}},\theta_{-\mathcal{I}}^{})-\Phi(\theta_{\mathcal{I}}^{},\theta_{-\mathcal{I}}^{*})\right)=$
	$\displaystyle\eta\sum_{s,a_{\mathcal{I}}}d_{\theta}(s)\pi_{\theta_{i_{0}}^{\prime}}(a_{i_{0}}\|s)\prod_{i\in\mathcal{I}\backslash\{i_{0}\}}\left((1-\eta)\pi_{\theta_{i}^{}}(a_{i}\|s)+\eta\pi_{\theta_{i}^{\prime}}(a_{i}\|s)\right)\overline{A_{\phi,\mathcal{I}}^{\theta^{}}}(s,a_{\mathcal{I}}).$

Applying similar procedures recursively and using the fact that:

\overline{A_{\phi,\mathcal{I}\backslash\{i\}}^{\theta^{*}}}(s,a_{\mathcal{I}\backslash\{i\}})=0,~{}~{}\forall~{}i\in\mathcal{I},

we get:

\displaystyle\Phi(\theta_{\mathcal{I}},\theta_{-\mathcal{I}}^{*})-\Phi(\theta_{\mathcal{I}}^{*},\theta_{-\mathcal{I}}^{*})=\frac{\eta^{|\mathcal{I}|}}{1-\gamma}\sum_{s,a_{\mathcal{I}}}d_{\theta}(s)\prod_{i\in\mathcal{I}}\pi_{\theta_{i}^{\prime}}(a_{i}|s)\overline{A_{\phi,\mathcal{I}}^{\theta^{*}}}(s,a_{\mathcal{I}}).

Set $\pi_{\theta_{i}^{\prime}}(a_{i}|s)$ as:

	$\displaystyle\pi_{\theta_{i}^{\prime}}(a_{i}\|s^{*})$	$\displaystyle=\left\{\begin{array}[]{cc}1&a_{i}=a_{i}^{*}\\ 0&\textup{otherwise}\end{array}\right.$
	$\displaystyle\pi_{\theta_{i}^{\prime}}(a_{i}\|s)$	$\displaystyle=\pi_{\theta_{i}^{}}(a_{i}\|s),\quad s\neq s^{},$

where $s^{*},a_{i}^{*}$ are defined in (38). Then:

\Phi(\theta_{\mathcal{I}},\theta_{-\mathcal{I}}^{*})-\Phi(\theta_{\mathcal{I}}^{*},\theta_{-\mathcal{I}}^{*})=\frac{\eta^{|\mathcal{I}|}}{1-\gamma}d_{\theta}(s^{*})\overline{A_{\phi,\mathcal{I}}^{\theta^{*}}}(s^{*},a_{\mathcal{I}}^{*})>0,

which completes the proof. ∎

Appendix I Bounding the gradient estimation error of Algorithm 1

The accuracy of gradient estimation is essential in the sample-based algorithm 1. In this section, we will give a high probability bound of the estimation error, which is stated in the following theorem:

Theorem 6.

(Error bound for gradient estimation) Assume that the stochastic game that satisfies Assumption 2. In Algorithm 1, for

T_{J}\geq\frac{32\tau(1+\alpha)^{2}|\mathcal{S}|^{3}\sum_{i}|\mathcal{A}_{i}|\max_{i}|\mathcal{A}_{i}|^{2}}{(1-\gamma)^{6}\epsilon_{g}^{2}\alpha^{2}\sigma_{S}^{2}}\log\left(\frac{16\tau T_{G}|\mathcal{S}|^{2}\sum_{i}|\mathcal{A}_{i}|}{\delta}\right)+1,

with probability at least $1-\delta$ , we have:

\|\widehat{\nabla}\Phi(\theta^{(k)})-\nabla\Phi(\theta^{(k)})\|_{2}\leq\epsilon_{g},~{}~{}\forall~{}0\leq k\leq T_{G}-1.

The proof of the theorem includes bounding the estimation error of $\overline{Q_{i}^{\theta}}$ (I.1) and $d_{\theta}$ (I.2). Let’s first introduce the definition of ‘sufficient exploration’ which is going to play an important role in this section.

In the main text Assumption 2 we have introduced $(\tau,\sigma_{S})$ -sufficient exploration on states. In this section we introduce a similar definition $(\tau,\sigma)$ -sufficient exploration:

Definition 6.

( $(\tau,\sigma)$ -Sufficient Exploration) A stochastic game and a policy $\theta$ is said to satisfy $(\tau,\sigma)$ -sufficient exploration condition if there exists positive integer $\tau$ and $\sigma\in(0,1)$ such that for policy $\theta$ and any initial state-action pair $(s,a_{i}),~{}\forall i$ , we have

\Pr(s_{\tau},a_{i,\tau}|s_{0}=s,a_{0}=a)\geq\sigma,~{}~{}\forall s_{\tau},a_{i,\tau}

Note that ‘ $(\tau,\sigma)$ -sufficient exploration’ is a stronger condition compared with ‘ $(\tau,\sigma_{S})$ -sufficient exploration on states’. Additionally it is not hard to verify that for any stochastic game that satisfies $(\tau,\sigma_{S})$ -sufficient exploration on states, and any $\theta\in\mathcal{X}^{\alpha}$ , it will also satisfy $(\tau,\frac{\alpha\sigma_{S}}{\max_{i}|\mathcal{A}_{i}|})$ -sufficient exploration condition.

I.1 Bounding the estimation error of the averaged-Q function

We first state the main theorem in this subsection:

Theorem 7.

(Estimation error of averaged Q-functions) Assume that the stochastic game with policy $\theta$ satisfies $(\sigma,\tau)$ -sufficient exploration condition (Definition 6), then for a fixed $i$ , running Algorithm 1 will guarantee that:

\Pr\left(\|\widehat{Q_{i}^{\theta}}-\overline{Q_{i}^{\theta}}\|_{\infty}\geq\epsilon\right)\leq 4\tau|\mathcal{S}|^{2}|\mathcal{A}_{i}|\exp\left(-\frac{(1-\gamma)^{4}\epsilon^{2}\sigma^{2}\lfloor\frac{T}{\tau}\rfloor}{32|\mathcal{S}|^{2}}\right),

further:

\Pr\left(\|\widehat{Q_{i}^{\theta}}-\overline{Q_{i}^{\theta}}\|_{\infty}\geq\epsilon,~{}\exists~{}i\right)\leq 8\tau|\mathcal{S}|^{2}\left(\sum_{i=1}^{n}|\mathcal{A}_{i}|\right)\exp\left(-\frac{(1-\gamma)^{4}\epsilon^{2}\sigma^{2}\lfloor\frac{T}{\tau}\rfloor}{32|\mathcal{S}|^{2}}\right),

i.e., when

T_{J}\geq\frac{32\tau|\mathcal{S}|^{2}}{(1-\gamma)^{4}\epsilon^{2}\sigma^{2}}\log\left(\frac{8\tau|\mathcal{S}|^{2}|\sum_{i}|\mathcal{A}_{i}|}{\delta}\right)+\tau

with probability at least $1-\delta$ , $\|\widehat{Q_{i}^{\theta}}-\overline{Q_{i}^{\theta}}\|_{\infty}\leq\epsilon,~{}\forall~{}i$

In the following, we will introduce some lemmas which will play an important role in bounding the estimation error of the averaged-Q function:

Lemma 12.

Assume that the stochastic game with policy $\theta$ satisfies $(\sigma,\tau)$ -sufficient exploration condition (Definition 6), then fix $s^{\prime},s,a_{i}$ ,for $\epsilon\leq 1$ ,

\Pr\left(\left|\widehat{P_{i}^{\theta}}(s^{\prime}|s,a_{i})-\overline{P_{i}^{\theta}}(s^{\prime}|s,a_{i})\right|\geq\epsilon\right)\leq 4\tau\exp\left(-\frac{\epsilon^{2}\sigma^{2}\lfloor\frac{T}{\tau}\rfloor}{32}\right)

Proof.

According to the definition of $\widehat{P_{i}^{\theta}}$ , we have that

	$\displaystyle\left\{\widehat{P_{i}^{\theta}}(s^{\prime}\|s,a_{i})-\overline{P_{i}^{\theta}}(s^{\prime}\|s,a_{i})\geq\epsilon\right\}$
	$\displaystyle\subseteq\left\{\sum_{t=0}^{T-1}\left(\mathbf{1}\{s_{t+1}=s^{\prime},s_{t}=s,a_{i,t}=a_{i}\}-(\overline{P_{i}^{\theta}}(s^{\prime}\|s,a_{i})+\epsilon)\mathbf{1}\{s_{t}=s,a_{i,t}=a_{i}\}\right)\geq 0\right\}$
	$\displaystyle\qquad\mathop{\cup}\left\{\sum_{t=0}^{T-1}\mathbf{1}\{s_{t}=s,a_{i,t}=a_{i}\}=0\right\}$
	$\displaystyle\subseteq\mathop{\cup}_{m=0}^{\tau-1}\left\{\sum_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor}\!\left(\mathbf{1}\{s_{k\tau+m+1}=s^{\prime},s_{k\tau+m}=s,a_{i,k\tau+m}=a_{i}\}\!-\!(\overline{P_{i}^{\theta}}(s^{\prime}\|s,a_{i})+\epsilon)\mathbf{1}\{s_{k\tau+m}=s,a_{i,k\tau+m}=a_{i}\}\right)\geq 0\!\right\}$
	$\displaystyle\qquad\mathop{\cup}_{m=0}^{\tau-1}\left\{\sum_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor}\mathbf{1}\{s_{k\tau+m}=s,a_{i,k\tau+m}=a_{i}\}=0\right\}$

Let:

	$\displaystyle A_{m}$	$\displaystyle:=\left\{\sum_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor}\left(\mathbf{1}\{s_{k\tau+m+1}\!=\!s^{\prime},s_{k\tau+m}\!=\!s,a_{i,k\tau+m}\!=\!a_{i}\}-(\overline{P_{i}^{\theta}}(s^{\prime}\|s,a_{i})+\epsilon)\mathbf{1}\{s_{k\tau+m}\!=\!s,a_{i,k\tau+m}\!=\!a_{i}\}\right)\geq 0\right\}$
	$\displaystyle A_{m}^{\prime}$	$\displaystyle:=\left\{\sum_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor}\mathbf{1}\{s_{k\tau+m}=s,a_{i,k\tau+m}=a_{i}\}=0\right\}$
	$\displaystyle X_{k,m}$	$\displaystyle:=\mathbf{1}\{s_{k\tau+m+1}=s^{\prime},s_{k\tau+m}=s,a_{i,k\tau+m}=a_{i}\}-(\overline{P_{i}^{\theta}}(s^{\prime}\|s,a_{i})+\epsilon)\mathbf{1}\{s_{k\tau+m}=s,a_{i,k\tau+m}=a_{i}\}$
	$\displaystyle X_{k,m}^{\prime}$	$\displaystyle:=\mathbf{1}\{s_{k\tau+m}=s,a_{i,k\tau+m}=a_{i}\}$
	$\displaystyle Y_{k,m}$	$\displaystyle:=X_{k,m}-\mathbb{{E}}[X_{k,m}\|\mathcal{F}_{(k-1)\tau+m}]$
	$\displaystyle Y_{k,m}^{\prime}$	$\displaystyle:=X_{k,m}^{\prime}-\mathbb{{E}}[X_{k,m}^{\prime}\|\mathcal{F}_{(k-1)\tau+m}]$

Then $\{Y_{k,m}\}_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor}$ is a martingale difference sequence. Because $\epsilon\leq 1$ , it is easy to verify that $|X_{k,m}|\leq 2,|X_{k,m}^{\prime}|\leq 1$ . We have that:

|Y_{k,m}|\leq|X_{k,m}|+\mathbb{{E}}[|X_{k,m}||\mathcal{F}_{(k-1)\tau+m}]\leq 4,~{}~{}|Y_{k,m}^{\prime}|\leq|X_{k,m}^{\prime}|+\mathbb{{E}}[|X_{k,m}^{\prime}||\mathcal{F}_{(k-1)\tau+m}]\leq 2.

Further,

\mathbb{{E}}[X^{\prime}_{k,m}|\mathcal{F}_{(k-1)\tau+m}]=\mathbb{{E}}[\mathbf{1}\{s_{k\tau+m}=s,a_{i,k\tau+m}=a_{i}\}|\mathcal{F}_{(k-1)\tau+m}]\geq\sigma,

and that

	$\displaystyle\mathbb{{E}}[X_{k,m}\|\mathcal{F}_{(k-1)\tau+m}]$
	$\displaystyle=\mathbb{{E}}[\mathbf{1}\{s_{k\tau+m+1}=s^{\prime},s_{k\tau+m}=s,a_{i,k\tau+m}=a_{i}\}-(\overline{P_{i}^{\theta}}(s^{\prime}\|s,a_{i})+\epsilon)\mathbf{1}\{s_{k\tau+m}=s,a_{i,k\tau+m}=a_{i}\}\|\mathcal{F}_{(k-1)\tau+m}]$		(41)
	$\displaystyle=-\epsilon\mathbb{{E}}[\mathbf{1}\{s_{k\tau+m}=s,a_{i,k\tau+m}=a_{i}\}\|\mathcal{F}_{(k-1)\tau+m}]\leq-\epsilon\sigma.$		(42)

To move from (41) to (42), we used the fact that:

	$\displaystyle\mathbb{{E}}[\mathbf{1}\{s_{t+1}=s^{\prime},s_{t}=s,a_{i,t+m}=a_{i}\}\|\mathcal{F}_{t-1}]$	$\displaystyle=P(s\|s_{t-1},a_{t-1})\sum_{a_{-i}}\pi_{\theta}(a_{i},a_{-i}\|s)P(s^{\prime}\|s,a_{i},a_{-i})$
		$\displaystyle=P(s\|s_{t-1},a_{t-1})\pi_{\theta_{i}}(a_{i}\|s)\sum_{a_{-i}}\pi_{\theta_{-i}}(a_{-i}\|s)P(s^{\prime}\|s,a_{i},a_{-i})$
		$\displaystyle=P(s\|s_{t-1},a_{t-1})\pi_{\theta_{i}}(a_{i}\|s)\overline{P_{i}^{\theta}}(s^{\prime}\|s,a_{i})$
		$\displaystyle=\mathbb{{E}}[\overline{P_{i}^{\theta}}(s^{\prime}\|s,a_{i})\mathbf{1}\{s_{t}=s,a_{i,t+m}=a_{i}\}\|\mathcal{F}_{t-1}]$

and the inequality in (42) is derived directly from Definition 6.

According to Azuma-Hoeffding inequality [71, 72]:

	$\displaystyle\Pr(A_{m})$	$\displaystyle=\Pr\left(\sum_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor}X_{k,m}\geq 0\right)$
		$\displaystyle=\Pr\left(\sum_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor}Y_{k,m}\geq-\sum_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor}\mathbb{{E}}[X_{k,m}\|\mathcal{F}_{(k-1)\tau+m}]\right)$
		$\displaystyle\leq\Pr\left(\sum_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor}Y_{k,m}\geq\left\lfloor\frac{T-1-m+\tau}{\tau}\right\rfloor\epsilon\sigma\right)$
		$\displaystyle\leq\exp\left(-\frac{\epsilon^{2}\sigma^{2}\left\lfloor\frac{T}{\tau}\right\rfloor}{32}\right)$

Similarly, from Azuma-Hoeffding inequality,

$\displaystyle\Pr(A_{m}^{\prime})$	$\displaystyle=\Pr\left(\sum_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor}X_{k,m}^{\prime}=0\right)$
	$\displaystyle=\Pr\left(\sum_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor}Y_{k,m}^{\prime}=-\sum_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor}\mathbb{{E}}[X_{k,m}^{\prime}\|\mathcal{F}_{(k-1)\tau+m}]\right)$
	$\displaystyle\leq\Pr\left(\sum_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor}Y_{k,m}^{\prime}\leq-\left\lfloor\frac{T-1-m+\tau}{\tau}\right\rfloor\sigma\right)$
	$\displaystyle\leq\exp\left(-\frac{\sigma^{2}\left\lfloor\frac{T}{\tau}\right\rfloor}{8}\right)$	(43)

Thus

	$\displaystyle\Pr\left(\widehat{P_{i}^{\theta}}(s^{\prime}\|s,a_{i})-\overline{P_{i}^{\theta}}(s^{\prime}\|s,a_{i})\geq\epsilon\right)\leq\sum_{m=0}^{\tau-1}\Pr\left(A_{m}\right)+\Pr\left(A_{m}^{\prime}\right)$
	$\displaystyle\leq\tau\exp\left(-\frac{\epsilon^{2}\sigma^{2}\left\lfloor\frac{T}{\tau}\right\rfloor}{32}\right)+\tau\exp\left(-\frac{\sigma^{2}\left\lfloor\frac{T}{\tau}\right\rfloor}{8}\right)\leq 2\tau\exp\left(-\frac{\epsilon^{2}\sigma^{2}\left\lfloor\frac{T}{\tau}\right\rfloor}{32}\right)$

Similarly

	$\displaystyle\Pr\left(\widehat{P_{i}^{\theta}}(s^{\prime}\|s,a_{i})-\overline{P_{i}^{\theta}}(s^{\prime}\|s,a_{i})\leq-\epsilon\right)\leq 2\tau\exp\left(-\frac{\epsilon^{2}\sigma^{2}\left\lfloor\frac{T}{\tau}\right\rfloor}{32}\right)$
	$\displaystyle\Longrightarrow\Pr\left(\left\|\widehat{P_{i}^{\theta}}(s^{\prime}\|s,a_{i})-\overline{P_{i}^{\theta}}(s^{\prime}\|s,a_{i})\right\|\geq\epsilon\right)\leq 4\tau\exp\left(-\frac{\epsilon^{2}\sigma^{2}\left\lfloor\frac{T}{\tau}\right\rfloor}{32}\right)$

which completes the proof. ∎

Lemma 13.

Assume that the stochastic game with policy $\theta$ satisfies $(\sigma,\tau)$ -sufficient exploration condition (Definition 6), then fix $s^{\prime},s,a_{i}$ , for $\epsilon\leq 1$ ,

\Pr\left(\left|\widehat{r_{i}^{\theta}}(s,a_{i})-\overline{r_{i}^{\theta}}(s,a_{i})\right|\geq\epsilon\right)\leq 4\tau\exp\left(-\frac{\epsilon^{2}\sigma^{2}\lfloor\frac{T}{\tau}\rfloor}{32}\right)

Proof.

The proof is similar to Lemma 12.

	$\displaystyle\left\{\widehat{r_{i}^{\theta}}(s,a_{i})-\overline{r_{i}^{\theta}}(s,a_{i})\geq\epsilon\right\}$
	$\displaystyle\subseteq\left\{\sum_{t=0}^{T}\mathbf{1}\{s_{t}=s,a_{i,t}=a_{i}\}r_{i}(s_{t},a_{t})-(\overline{r_{i}^{\theta}}(s,a_{i})+\epsilon)\mathbf{1}\{s_{t}=s,a_{i,t}=a_{i}\}\geq 0\right\}$
	$\displaystyle\qquad\mathop{\cup}\left\{\sum_{t=0}^{T-1}\mathbf{1}\{s_{t}=s,a_{i,t}=a_{i}\}=0\right\}$
	$\displaystyle\subseteq\mathop{\cup}_{m=0}^{\tau-1}\left\{\sum_{k=0}^{\lfloor\frac{T-m}{\tau}\rfloor}\mathbf{1}\{s_{k\tau+m}=s,a_{i,k\tau+m}=a_{i}\}r_{i}(s_{k\tau+m},a_{k\tau+m})-(\overline{r_{i}^{\theta}}(s,a_{i})+\epsilon)\mathbf{1}\{s_{k\tau+m}=s,a_{i,k\tau+m}=a_{i}\}\geq 0\right\}$
	$\displaystyle\qquad\mathop{\cup}_{m=0}^{\tau-1}\left\{\sum_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor}\mathbf{1}\{s_{k\tau+m}=s,a_{i,k\tau+m}=a_{i}\}=0\right\}$

Let:

	$\displaystyle A_{m}$	$\displaystyle:=\left\{\sum_{k=0}^{\lfloor\frac{T-m}{\tau}\rfloor}\mathbf{1}\{s_{k\tau+m}=s,a_{i,k\tau+m}=a_{i}\}r_{i}(s_{k\tau+m},a_{k\tau+m})-(\overline{r_{i}^{\theta}}(s,a_{i})+\epsilon)\mathbf{1}\{s_{k\tau+m}=s,a_{i,k\tau+m}=a_{i}\}\geq 0\right\}$
	$\displaystyle A_{m}^{\prime}$	$\displaystyle:=\left\{\sum_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor}\mathbf{1}\{s_{k\tau+m}=s,a_{i,k\tau+m}=a_{i}\}=0\right\}$
	$\displaystyle X_{k,m}$	$\displaystyle:=\mathbf{1}\{s_{k\tau+m}=s,a_{i,k\tau+m}=a_{i}\}r_{i}(s_{k\tau+m},a_{k\tau+m})-(\overline{r_{i}^{\theta}}(s,a_{i})+\epsilon)\mathbf{1}\{s_{k\tau+m}=s,a_{i,k\tau+m}=a_{i}\}$
	$\displaystyle Y_{k,m}$	$\displaystyle:=X_{k,m}-\mathbb{{E}}[X_{k,m}\|\mathcal{F}_{(k-1)\tau+m}]$

Then $\{Y_{k,m}\}_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor}$ is a martingale difference sequence. Because $\epsilon\leq 1$ , it is easy to verify that $|X_{k,m}|\leq 2$ . We have that:

|Y_{k,m}|\leq|X_{k,m}|+\mathbb{{E}}[|X_{k,m}||\mathcal{F}_{(k-1)\tau+m}]\leq 4.

Further,

	$\displaystyle\mathbb{{E}}[X_{k,m}\|\mathcal{F}_{(k-1)\tau+m}]$
	$\displaystyle=\mathbb{{E}}[\mathbf{1}\{s_{k\tau+m}=s,a_{i,k\tau+m}=a_{i}\}r_{i}(s_{k\tau+m},a_{k\tau+m})-(\overline{r_{i}^{\theta}}(s,a_{i})+\epsilon)\mathbf{1}\{s_{k\tau+m}=s,a_{i,k\tau+m}=a_{i}\}\|\mathcal{F}_{(k-1)\tau+m}]$
	$\displaystyle=-\epsilon\mathbb{{E}}[\mathbf{1}\{s_{k\tau+m}=s,a_{i,k\tau+m}=a_{i}\}\|\mathcal{F}_{(k-1)\tau+m}]\leq-\epsilon\sigma$

the second line to the third line of the equation is derived by the fact that:

	$\displaystyle\mathbb{{E}}[\mathbf{1}\{s_{t}=s,a_{i,t+m}=a_{i}\}r_{i}(s_{t},a_{t})\|\mathcal{F}_{t-1}]$	$\displaystyle=P(s\|s_{t-1},a_{t-1})\sum_{a_{-i}}\pi_{\theta}(a_{i},a_{-i}\|s)r_{i}(s,a_{i},a_{-i})$
		$\displaystyle=P(s\|s_{t-1},a_{t-1})\pi_{\theta_{i}}(a_{i}\|s)\overline{r_{i}^{\theta}}(s,a_{i})$
		$\displaystyle=\mathbb{{E}}[\overline{r_{i}^{\theta}}(s,a_{i})\mathbf{1}\{s_{t}=s,a_{i,t}=a_{t}\}\|\mathcal{F}_{t-1}]$

and the inequality in the third line is derived directly from Definition 6.

According to Azuma-Hoeffding inequality:

	$\displaystyle\Pr(A_{m})$	$\displaystyle=\Pr\left(\sum_{k=0}^{\lfloor\frac{T-m}{\tau}\rfloor}X_{k,m}\geq 0\right)$
		$\displaystyle=\Pr\left(\sum_{k=0}^{\lfloor\frac{T-m}{\tau}\rfloor}Y_{k,m}\geq-\sum_{k=0}^{\lfloor\frac{T-m}{\tau}\rfloor}\mathbb{{E}}[X_{k,m}\|\mathcal{F}_{(k-1)\tau+m}]\right)$
		$\displaystyle\leq\Pr\left(\sum_{k=0}^{\lfloor\frac{T-m}{\tau}\rfloor}Y_{k,m}\geq\left\lfloor\frac{T-m+\tau}{\tau}\right\rfloor\epsilon\sigma\right)$
		$\displaystyle\leq\exp\left(-\frac{\epsilon^{2}\sigma^{2}\left\lfloor\frac{T}{\tau}\right\rfloor}{32}\right)$

Same as (43), we have that:

\Pr(A_{m}^{\prime})\leq\exp\left(-\frac{\sigma^{2}\left\lfloor\frac{T}{\tau}\right\rfloor}{8}\right)

Thus

\displaystyle\Pr\left(\widehat{r_{i}^{\theta}}(s,a_{i})-\overline{r_{i}^{\theta}}(s,a_{i})\geq\epsilon\right)\leq\sum_{m=0}^{\tau-1}\Pr\left(A_{m}\right)+\Pr\left(A_{m}^{\prime}\right)\leq 2\tau\exp\left(-\frac{\epsilon^{2}\sigma^{2}\left\lfloor\frac{T}{\tau}\right\rfloor}{32}\right)

Similarly

	$\displaystyle\Pr\left(\widehat{r_{i}^{\theta}}(s,a_{i})-\overline{r_{i}^{\theta}}(s,a_{i})\leq-\epsilon\right)\leq 2\tau\exp\left(-\frac{\epsilon^{2}\sigma^{2}\left\lfloor\frac{T}{\tau}\right\rfloor}{32}\right)$
	$\displaystyle\Longrightarrow\Pr\left(\left\|\widehat{r_{i}^{\theta}}(s,a_{i})-\overline{r_{i}^{\theta}}(s,a_{i})\right\|\geq\epsilon\right)\leq 4\tau\exp\left(-\frac{\epsilon^{2}\sigma^{2}\left\lfloor\frac{T}{\tau}\right\rfloor}{32}\right)$

which completes the proof. ∎

Lemma 12 and 13 lead to the following corollary:

Corollary 1.

	$\displaystyle\Pr(\\|\widehat{M^{\theta}_{i}}-\overline{M^{\theta}_{i}}\\|_{\infty}\geq\epsilon)$	$\displaystyle\leq 4\tau\|\mathcal{S}\|^{2}\|\mathcal{A}_{i}\|\exp\left(-\frac{\epsilon^{2}\sigma^{2}\lfloor\frac{T}{\tau}\rfloor}{32\|\mathcal{S}\|^{2}}\right)$		(44)
	$\displaystyle\Pr(\\|\widehat{r^{\theta}_{i}}-\overline{r^{\theta}_{i}}\\|_{\infty}\geq\epsilon)$	$\displaystyle\leq 4\tau\|\mathcal{S}\|\|\mathcal{A}_{i}\|\exp\left(-\frac{\epsilon^{2}\sigma^{2}\lfloor\frac{T}{\tau}\rfloor}{32}\right)$		(45)

Proof.

We first prove (44)

	$\displaystyle\\|\widehat{M^{\theta}_{i}}-\overline{M^{\theta}_{i}}\\|_{\infty}$	$\displaystyle=\max_{(s,a_{i})}\sum_{(s^{\prime},a_{i}^{\prime})}\pi_{\theta_{i}}(a_{i}^{\prime}\|s^{\prime})\left\|\widehat{P_{i}^{\theta}}(s^{\prime}\|s,a_{i})-\overline{P_{i}^{\theta}}(s^{\prime}\|s,a_{i})\right\|$
		$\displaystyle=\max_{(s,a_{i})}\sum_{s^{\prime}}\left\|\widehat{P_{i}^{\theta}}(s^{\prime}\|s,a_{i})-\overline{P_{i}^{\theta}}(s^{\prime}\|s,a_{i})\right\|$

Thus,

	$\displaystyle\left\{\\|\widehat{M^{\theta}_{i}}-\overline{M^{\theta}_{i}}\\|_{\infty}\geq\epsilon\right\}$	$\displaystyle=\mathop{\cup}_{(s,a_{i})}\left\{\sum_{s^{\prime}}\left\|\widehat{P_{i}^{\theta}}(s^{\prime}\|s,a_{i})-\overline{P_{i}^{\theta}}(s^{\prime}\|s,a_{i})\right\|\geq\epsilon\right\}$
		$\displaystyle\subseteq\mathop{\cup}_{(s,a_{i})}\mathop{\cup}_{s^{\prime}}\left\{\left\|\widehat{P_{i}^{\theta}}(s^{\prime}\|s,a_{i})-\overline{P_{i}^{\theta}}(s^{\prime}\|s,a_{i})\right\|\geq\frac{\epsilon}{\|\mathcal{S}\|}\right\}$

Then according to Lemma 12,

	$\displaystyle\Pr\left(\\|\widehat{M^{\theta}_{i}}-\overline{M^{\theta}_{i}}\\|_{\infty}\geq\epsilon\right)$	$\displaystyle\leq\sum_{(s^{\prime},s,a_{i})}\Pr\left(\left\|\widehat{P_{i}^{\theta}}(s^{\prime}\|s,a_{i})-\overline{P_{i}^{\theta}}(s^{\prime}\|s,a_{i})\right\|\geq\frac{\epsilon}{\|\mathcal{S}\|}\right)$
		$\displaystyle\leq 4\tau\|\mathcal{S}\|^{2}\|\mathcal{A}_{i}\|\exp\left(-\frac{\epsilon^{2}\sigma^{2}\lfloor\frac{T}{\tau}\rfloor}{32\|\mathcal{S}\|^{2}}\right).$

Now we prove (45). Since

\displaystyle\left\{\|\widehat{r^{\theta}_{i}}-\overline{r^{\theta}_{i}}\|_{\infty}\geq\epsilon\right\}=\mathop{\cup}_{(s,a_{i})}\left\{\left|\widehat{r_{i}^{\theta}}(s,a_{i})-\overline{r_{i}^{\theta}}(s,a_{i})\right|\geq\epsilon\right\},

according to Lemma 13,

	$\displaystyle\Pr\left(\\|\widehat{r^{\theta}_{i}}-\overline{r^{\theta}_{i}}\\|_{\infty}\geq\epsilon\right)$	$\displaystyle\leq\sum_{(s,a_{i})}\Pr\left(\left\|\widehat{r_{i}^{\theta}}(s,a_{i})-\overline{r_{i}^{\theta}}(s,a_{i})\right\|\geq\epsilon\right)$
		$\displaystyle\leq 4\tau\|\mathcal{S}\|\|\mathcal{A}_{i}\|\exp\left(-\frac{\epsilon^{2}\sigma^{2}\left\lfloor\frac{T}{\tau}\right\rfloor}{32}\right),$

which completes the proof of the corollary. ∎

We are now ready to prove Theorem 7.

Proof.

(of Theorem 7) From the definition of $\widehat{Q_{i}^{\theta}},\overline{Q_{i}^{\theta}}$ ,

	$\displaystyle\overline{Q_{i}^{\theta}}$	$\displaystyle=(I-\gamma\overline{M^{\theta}_{i}})^{-1}\overline{r_{i}^{\theta}},$
	$\displaystyle\widehat{Q_{i}^{\theta}}$	$\displaystyle=(I-\gamma\widehat{M^{\theta}_{i}})^{-1}\widehat{r_{i}^{\theta}},$

we have that

	$\displaystyle\\|\widehat{Q_{i}^{\theta}}-\overline{Q_{i}^{\theta}}\\|_{\infty}$	$\displaystyle=\left\\|(I-\gamma\widehat{M^{\theta}_{i}})^{-1}\widehat{r_{i}^{\theta}}-(I-\gamma\overline{M^{\theta}_{i}})^{-1}\overline{r_{i}^{\theta}}\right\\|_{\infty}$
		$\displaystyle=\left\\|(I-\gamma\overline{M^{\theta}_{i}})^{-1}(\widehat{r_{i}^{\theta}}-\overline{r_{i}^{\theta}})+\left((I-\gamma\widehat{M^{\theta}_{i}})^{-1}-(I-\gamma\overline{M^{\theta}_{i}})^{-1}\right)\widehat{r_{i}^{\theta}}\right\\|_{\infty}$
		$\displaystyle\leq\left\\|(I-\gamma\overline{M^{\theta}_{i}})^{-1}(\widehat{r_{i}^{\theta}}-\overline{r_{i}^{\theta}})\right\\|_{\infty}+\left\\|\gamma(I-\gamma\overline{M^{\theta}_{i}})^{-1}(\widehat{M^{\theta}_{i}}-\overline{M^{\theta}_{i}})(I-\gamma\widehat{M^{\theta}_{i}})^{-1}\widehat{r_{i}^{\theta}}\right\\|_{\infty}.$

Because both $\overline{M^{\theta}_{i}}$ and $\widehat{M^{\theta}_{i}}$ are transition probability matrices, thus:

	$\displaystyle\\|\overline{M^{\theta}_{i}}x\\|_{\infty}$	$\displaystyle\leq\\|x\\|_{\infty}$
	$\displaystyle\\|\widehat{M^{\theta}_{i}}x\\|_{\infty}$	$\displaystyle\leq\\|x\\|_{\infty}$
	$\displaystyle\\|(I-\gamma\overline{M^{\theta}_{i}})^{-1}x\\|_{\infty}$	$\displaystyle\leq\frac{1}{1-\gamma}\\|x\\|_{\infty}$
	$\displaystyle\\|(I-\gamma\widehat{M^{\theta}_{i}})^{-1}x\\|_{\infty}$	$\displaystyle\leq\frac{1}{1-\gamma}\\|x\\|_{\infty}$

Thus,

	$\displaystyle\\|\widehat{Q_{i}^{\theta}}-\overline{Q_{i}^{\theta}}\\|_{\infty}$	$\displaystyle\leq\left\\|(I-\gamma\overline{M^{\theta}_{i}})^{-1}(\widehat{r_{i}^{\theta}}-\overline{r_{i}^{\theta}})\right\\|_{\infty}+\left\\|\gamma(I-\gamma\overline{M^{\theta}_{i}})^{-1}(\widehat{M^{\theta}_{i}}-\overline{M^{\theta}_{i}})(I-\gamma\widehat{M^{\theta}_{i}})^{-1}\widehat{r_{i}^{\theta}}\right\\|_{\infty}$
		$\displaystyle\leq\frac{1}{1-\gamma}\\|\widehat{r_{i}^{\theta}}-\overline{r_{i}^{\theta}}\\|_{\infty}+\frac{\gamma}{(1-\gamma)^{2}}\\|\widehat{M^{\theta}_{i}}-\overline{M^{\theta}_{i}}\\|_{\infty}\\|\widehat{r_{i}^{\theta}}\\|_{\infty}$
		$\displaystyle\leq\frac{1}{1-\gamma}\\|\widehat{r_{i}^{\theta}}-\overline{r_{i}^{\theta}}\\|_{\infty}+\frac{\gamma}{(1-\gamma)^{2}}\\|\widehat{M^{\theta}_{i}}-\overline{M^{\theta}_{i}}\\|_{\infty}$

Thus if

\displaystyle\|\widehat{r_{i}^{\theta}}-\overline{r_{i}^{\theta}}\|_{\infty}\leq(1-\gamma)^{2}\epsilon,\quad\|\widehat{M^{\theta}_{i}}-\overline{M^{\theta}_{i}}\|_{\infty}\leq(1-\gamma)^{2}\epsilon,

we have that:

\|\widehat{Q_{i}^{\theta}}-\overline{Q_{i}^{\theta}}\|_{\infty}\leq\epsilon,

Thus from Corollary 1,

	$\displaystyle\Pr\left(\\|\widehat{Q_{i}^{\theta}}-\overline{Q_{i}^{\theta}}\\|_{\infty}\geq\epsilon\right)$	$\displaystyle\leq\Pr\left(\\|\widehat{r_{i}^{\theta}}-\overline{r_{i}^{\theta}}\\|_{\infty}\geq(1-\gamma)^{2}\epsilon\right)+\Pr\left(\\|\widehat{M^{\theta}_{i}}-\overline{M^{\theta}_{i}}\\|_{\infty}\geq(1-\gamma)^{2}\epsilon\right)$
		$\displaystyle\leq 4\tau\|\mathcal{S}\|\|\mathcal{A}_{i}\|\exp\left(-\frac{(1-\gamma)^{4}\epsilon^{2}\sigma^{2}\lfloor\frac{T}{\tau}\rfloor}{32}\right)+4\tau\|\mathcal{S}\|^{2}\|\mathcal{A}_{i}\|\exp\left(-\frac{(1-\gamma)^{4}\epsilon^{2}\sigma^{2}\lfloor\frac{T}{\tau}\rfloor}{32\|\mathcal{S}\|^{2}}\right)$
		$\displaystyle\leq 8\tau\|\mathcal{S}\|^{2}\|\mathcal{A}_{i}\|\exp\left(-\frac{(1-\gamma)^{4}\epsilon^{2}\sigma^{2}\lfloor\frac{T}{\tau}\rfloor}{32\|\mathcal{S}\|^{2}}\right),$

which completes the proof. ∎

I.2 Bounding the estimation error of $d_{\theta}$

We first state our main result:

Theorem 8.

(Estimation error of $d_{\theta}$ ) Under Assumption 2,

\Pr\left(\|\widehat{d_{\theta}}-d_{\theta}\|_{1}\geq\epsilon\right)\leq 4\tau|\mathcal{S}|^{2}\exp\left(-\frac{(1-\gamma)^{2}\epsilon^{2}\sigma_{S}^{2}\left\lfloor\frac{T}{\tau}\right\rfloor}{32\gamma^{2}|\mathcal{S}|^{2}}\right),

i.e., when

T\geq\frac{32\tau|\mathcal{S}|^{2}}{(1-\gamma)^{2}\epsilon^{2}\sigma_{S}^{2}}\log\left(\frac{4\tau|\mathcal{S}|^{2}|}{\delta}\right)+1,

with probability at least $1-\delta$ , $\|\widehat{d_{\theta}}-d_{\theta}\|_{1}\leq\epsilon$ .

Similar to the previous section, the proof of the theorem begins by bounding the estimation error $|\widehat{P_{\mathcal{S}}^{\theta}}(s^{\prime}|s)-\overline{P_{\mathcal{S}}^{\theta}}(s^{\prime}|s)|$ .

Lemma 14.

Under Assumption 2, fix $s^{\prime},s,a_{i}$ ,for $\epsilon\leq 1$ ,

\Pr\left(\left|\widehat{P_{\mathcal{S}}^{\theta}}(s^{\prime}|s)-\overline{P_{\mathcal{S}}^{\theta}}(s^{\prime}|s)\right|\geq\epsilon\right)\leq 4\tau\exp\left(-\frac{\epsilon^{2}\sigma_{S}^{2}\left\lfloor\frac{T}{\tau}\right\rfloor}{32}\right)

Proof.

According to the definition of $\widehat{P_{\mathcal{S}}^{\theta}}$ , we have that

	$\displaystyle\left\{\widehat{P_{\mathcal{S}}^{\theta}}(s^{\prime}\|s)-\overline{P_{\mathcal{S}}^{\theta}}(s^{\prime}\|s)\geq\epsilon\right\}$
	$\displaystyle\subseteq\left\{\sum_{t=0}^{T-1}\left(\mathbf{1}\{s_{t+1}=s^{\prime},s_{t}=s\}-(\overline{P_{\mathcal{S}}^{\theta}}(s^{\prime}\|s)+\epsilon)\mathbf{1}\{s_{t}=s\}\right)\geq 0\right\}\cup\left\{\sum_{t=0}^{T-1}\mathbf{1}\{s_{t}=s\}=0\right\}$
	$\displaystyle\subseteq\mathop{\cup}_{m=0}^{\tau-1}\left\{\sum_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor}\left(\mathbf{1}\{s_{k\tau+m+1}=s^{\prime},s_{k\tau+m}=s\}-(\overline{P_{\mathcal{S}}^{\theta}}(s^{\prime}\|s)+\epsilon)\mathbf{1}\{s_{k\tau+m}=s\}\right)\geq 0\right\}$
	$\displaystyle\qquad\mathop{\cup}_{m=0}^{\tau-1}\left\{\sum_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor}\mathbf{1}\{s_{k\tau+m}=s\}=0\right\}$

Let:

	$\displaystyle A_{m}$	$\displaystyle:=\left\{\sum_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor}\left(\mathbf{1}\{s_{k\tau+m+1}=s^{\prime},s_{k\tau+m}=s\}-(\overline{P_{\mathcal{S}}^{\theta}}(s^{\prime}\|s)+\epsilon)\mathbf{1}\{s_{k\tau+m}=s\}\right)\geq 0\right\}$
	$\displaystyle A_{m}^{\prime}$	$\displaystyle:=\left\{\sum_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor}\mathbf{1}\{s_{k\tau+m}=s\}=0\right\}$
	$\displaystyle X_{k,m}$	$\displaystyle:=\mathbf{1}\{s_{k\tau+m+1}=s^{\prime},s_{k\tau+m}=s\}-(\overline{P_{\mathcal{S}}^{\theta}}(s^{\prime}\|s)+\epsilon)\mathbf{1}\{s_{k\tau+m}=s\}$
	$\displaystyle X_{k,m}^{\prime}$	$\displaystyle:=\mathbf{1}\{s_{k\tau+m}=s\}$
	$\displaystyle Y_{k,m}$	$\displaystyle:=X_{k,m}-\mathbb{{E}}[X_{k,m}\|\mathcal{F}_{(k-1)\tau+m}]$
	$\displaystyle Y_{k,m}^{\prime}$	$\displaystyle:=X_{k,m}^{\prime}-\mathbb{{E}}[X_{k,m}^{\prime}\|\mathcal{F}_{(k-1)\tau+m}]$

Then $\{Y_{k,m}\}_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor}$ is a martingale difference sequence. Because $\epsilon\leq 1$ , it is easy to verify that $|X_{k,m}|\leq 2,|X_{k,m}|\leq 1$ . We have that:

|Y_{k,m}|\leq|X_{k,m}|+\mathbb{{E}}[|X_{k,m}||\mathcal{F}_{(k-1)\tau+m}]\leq 4,~{}~{}|Y_{k,m}^{\prime}|\leq|X_{k,m}^{\prime}|+\mathbb{{E}}[|X_{k,m}^{\prime}||\mathcal{F}_{(k-1)\tau+m}]\leq 2.

Further,

\mathbb{{E}}[X^{\prime}_{k,m}|\mathcal{F}_{(k-1)\tau+m}]=\mathbb{{E}}[\mathbf{1}\{s_{k\tau+m}=s\}|\mathcal{F}_{(k-1)\tau+m}]\geq\sigma_{S},

and that

	$\displaystyle\mathbb{{E}}[X_{k,m}\|\mathcal{F}_{(k-1)\tau+m}]$
	$\displaystyle=\mathbb{{E}}[\mathbf{1}\{s_{k\tau+m+1}=s^{\prime},s_{k\tau+m}=s\}-(\overline{P_{\mathcal{S}}^{\theta}}(s^{\prime}\|s)+\epsilon)\mathbf{1}\{s_{k\tau+m}=s\}\|\mathcal{F}_{(k-1)\tau+m}]$
	$\displaystyle=-\epsilon\mathbb{{E}}[\mathbf{1}\{s_{k\tau+m}=s\}\|\mathcal{F}_{(k-1)\tau+m}]\leq-\epsilon\sigma_{S}$

the second line to the third line of the equation is derived by the fact that:

	$\displaystyle\mathbb{{E}}[\mathbf{1}\{s_{t+1}=s^{\prime},s_{t}=s\}\|\mathcal{F}_{t-1}]$	$\displaystyle=P(s\|s_{t-1},a_{t-1})\sum_{a}\pi_{\theta}(a\|s)P(s^{\prime}\|s,a)$
		$\displaystyle=P(s\|s_{t-1},a_{t-1})\overline{P_{\mathcal{S}}^{\theta}}(s^{\prime}\|s)$
		$\displaystyle=\mathbb{{E}}[\overline{P_{\mathcal{S}}^{\theta}}(s^{\prime}\|s)\mathbf{1}\{s_{t}=s\}\|\mathcal{F}_{t-1}]$

and the inequality in the third line is derived directly from Assumption 2.

According to Azuma-Hoeffding inequality:

	$\displaystyle\Pr(A_{m})$	$\displaystyle=\Pr\left(\sum_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor}X_{k,m}\geq 0\right)$
		$\displaystyle=\Pr\left(\sum_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor}Y_{k,m}\geq-\sum_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor}\mathbb{{E}}[X_{k,m}\|\mathcal{F}_{(k-1)\tau+m}]\right)$
		$\displaystyle\leq\Pr\left(\sum_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor}Y_{k,m}\geq\left\lfloor\frac{T-1-m+\tau}{\tau}\right\rfloor\epsilon\sigma_{S}\right)$
		$\displaystyle\leq\exp\left(-\frac{\epsilon^{2}\sigma_{S}^{2}\left\lfloor\frac{T}{\tau}\right\rfloor}{32}\right)$

Similarly, from Azuma-Hoeffding inequality,

	$\displaystyle\Pr(A_{m}^{\prime})$	$\displaystyle=\Pr\left(\sum_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor}X_{k,m}^{\prime}=0\right)$
		$\displaystyle=\Pr\left(\sum_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor}Y_{k,m}^{\prime}=-\sum_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor}\mathbb{{E}}[X_{k,m}^{\prime}\|\mathcal{F}_{(k-1)\tau+m}]\right)$
		$\displaystyle\leq\Pr\left(\sum_{k=0}^{\lfloor\frac{T-1-m}{\tau}\rfloor}Y_{k,m}^{\prime}\leq-\left\lfloor\frac{T-1-m+\tau}{\tau}\right\rfloor\sigma_{S}\right)$
		$\displaystyle\leq\exp\left(-\frac{\sigma_{S}^{2}\left\lfloor\frac{T}{\tau}\right\rfloor}{8}\right)$

Thus

	$\displaystyle\Pr\left(\widehat{P_{\mathcal{S}}^{\theta}}(s^{\prime}\|s)-\overline{P_{\mathcal{S}}^{\theta}}(s^{\prime}\|s)\geq\epsilon\right)\leq\sum_{m=0}^{\tau-1}\Pr\left(A_{m}\right)+\Pr\left(A_{m}^{\prime}\right)$
	$\displaystyle\leq\tau\exp\left(-\frac{\epsilon^{2}\sigma_{S}^{2}\left\lfloor\frac{T}{\tau}\right\rfloor}{32}\right)+\tau\exp\left(-\frac{\sigma_{S}^{2}\left\lfloor\frac{T}{\tau}\right\rfloor}{8}\right)\leq 2\tau\exp\left(-\frac{\epsilon^{2}\sigma_{S}^{2}\left\lfloor\frac{T}{\tau}\right\rfloor}{32}\right)$

Similarly

	$\displaystyle\Pr\left(\widehat{P_{\mathcal{S}}^{\theta}}(s^{\prime}\|s)-\overline{P_{\mathcal{S}}^{\theta}}(s^{\prime}\|s)\leq-\epsilon\right)\leq 2\tau\exp\left(-\frac{\epsilon^{2}\sigma_{S}^{2}\left\lfloor\frac{T}{\tau}\right\rfloor}{32}\right)$
	$\displaystyle\Longrightarrow\Pr\left(\left\|\widehat{P_{\mathcal{S}}^{\theta}}(s^{\prime}\|s)-\overline{P_{\mathcal{S}}^{\theta}}(s^{\prime}\|s)\right\|\geq\epsilon\right)\leq 4\tau\exp\left(-\frac{\epsilon^{2}\sigma_{S}^{2}\left\lfloor\frac{T}{\tau}\right\rfloor}{32}\right)$

which completes the proof. ∎

Corollary 2.

\displaystyle\Pr\left(\left\|\widehat{P_{\mathcal{S}}^{\theta}}-\overline{P_{\mathcal{S}}^{\theta}}\right\|_{\infty}\geq\epsilon\right)

\displaystyle\leq 4\tau|\mathcal{S}|^{2}\exp\left(-\frac{\epsilon^{2}\sigma_{S}^{2}\lfloor\frac{T}{\tau}\rfloor}{32|\mathcal{S}|^{2}}\right)

(46)

Proof.

\displaystyle\left\|\widehat{P_{\mathcal{S}}^{\theta}}-\overline{P_{\mathcal{S}}^{\theta}}\right\|_{\infty}

\displaystyle=\max_{s}\sum_{s^{\prime}}\left|\widehat{P_{\mathcal{S}}^{\theta}}(s^{\prime}|s)-\overline{P_{\mathcal{S}}^{\theta}}(s^{\prime}|s)\right|

Thus,

	$\displaystyle\left\{\left\\|\widehat{P_{\mathcal{S}}^{\theta}}-\overline{P_{\mathcal{S}}^{\theta}}\right\\|_{\infty}\geq\epsilon\right\}$	$\displaystyle=\mathop{\cup}_{s}\left\{\sum_{s^{\prime}}\left\|\widehat{P_{\mathcal{S}}^{\theta}}(s^{\prime}\|s)-\overline{P_{\mathcal{S}}^{\theta}}(s^{\prime}\|s)\right\|\geq\epsilon\right\}$
		$\displaystyle\subseteq\mathop{\cup}_{s}\mathop{\cup}_{s^{\prime}}\left\{\left\|\widehat{P_{\mathcal{S}}^{\theta}}(s^{\prime}\|s)-\overline{P_{\mathcal{S}}^{\theta}}(s^{\prime}\|s)\right\|\geq\frac{\epsilon}{\|\mathcal{S}\|}\right\}$

Then according to Lemma 14,

	$\displaystyle\Pr\left(\\|\left\\|\widehat{P_{\mathcal{S}}^{\theta}}-\overline{P_{\mathcal{S}}^{\theta}}\right\\|_{\infty}\geq\epsilon\right)$	$\displaystyle\leq\sum_{(s^{\prime},s)}\Pr\left(\left\|\widehat{P_{\mathcal{S}}^{\theta}}(s^{\prime}\|s)-\overline{P_{\mathcal{S}}^{\theta}}(s^{\prime}\|s)\right\|\geq\frac{\epsilon}{\|\mathcal{S}\|}\right)$
		$\displaystyle\leq 4\tau\|\mathcal{S}\|^{2}\exp\left(-\frac{\epsilon^{2}\sigma_{S}^{2}\lfloor\frac{T}{\tau}\rfloor}{32\|\mathcal{S}\|^{2}}\right).$

which completes the proof of the corollary. ∎

Proof.

(Proof of Theorem 8)

	$\displaystyle\\|\widehat{d_{\theta}}-d_{\theta}\\|_{1}$	$\displaystyle=(1-\gamma)\\|\left(\left(I-\gamma\widehat{P_{\mathcal{S}}^{\theta}}^{\top}\right)^{-1}-\left(I-\gamma\overline{P_{\mathcal{S}}^{\theta}}^{\top}\right)^{-1}\right)\rho\\|_{1}$
		$\displaystyle\leq(1-\gamma)\left\\|\left(I-\gamma\widehat{P_{\mathcal{S}}^{\theta}}^{\top}\right)^{-1}-\left(I-\gamma\overline{P_{\mathcal{S}}^{\theta}}^{\top}\right)^{-1}\right\\|_{1}\\|\rho\\|_{1}$
		$\displaystyle\leq(1-\gamma)\left\\|\left(I-\gamma\widehat{P_{\mathcal{S}}^{\theta}}\right)^{-1}-\left(I-\gamma\overline{P_{\mathcal{S}}^{\theta}}\right)^{-1}\right\\|_{\infty}\\|\rho\\|_{1}$
		$\displaystyle=\gamma(1-\gamma)\left\\|\left(I-\gamma\overline{P_{\mathcal{S}}^{\theta}}\right)^{-1}\left(\widehat{P_{\mathcal{S}}^{\theta}}-\overline{P_{\mathcal{S}}^{\theta}}\right)\left(I-\gamma\widehat{P_{\mathcal{S}}^{\theta}}\right)^{-1}\right\\|_{\infty}$
		$\displaystyle\leq\frac{\gamma}{1-\gamma}\left\\|\widehat{P_{\mathcal{S}}^{\theta}}-\overline{P_{\mathcal{S}}^{\theta}}\right\\|_{\infty}$

Thus

	$\displaystyle\Pr\left(\\|\widehat{d_{\theta}}-d_{\theta}\\|_{1}\geq\epsilon\right)\leq\Pr\left(\left\\|\widehat{P_{\mathcal{S}}^{\theta}}-\overline{P_{\mathcal{S}}^{\theta}}\right\\|_{\infty}\geq\frac{1-\gamma}{\gamma}\epsilon\right)$
	$\displaystyle\leq 4\tau\|\mathcal{S}\|^{2}\exp\left(-\frac{(1-\gamma)^{2}\epsilon^{2}\sigma_{S}^{2}\left\lfloor\frac{T}{\tau}\right\rfloor}{32\gamma^{2}\|\mathcal{S}\|^{2}}\right)$

∎

I.3 Proof of Theorem 6

Proof.

Since the stochastic game satisfies $(\tau,\sigma_{S})$ -sufficient exploration on states, then for any $\theta\in\mathcal{X}^{\alpha}$ , we know that it satisfies $(\tau,\frac{\alpha\sigma_{S}}{\max_{i}|\mathcal{A}_{i}|})$ -sufficient exploration. Substitute this into Theorem 7, we have that for

T_{J}\geq\frac{32\tau(1+\alpha)^{2}|\mathcal{S}|^{3}\sum_{i}|\mathcal{A}_{i}|\max_{i}|\mathcal{A}_{i}|^{2}}{(1-\gamma)^{6}\epsilon_{g}^{2}\alpha^{2}\sigma_{S}^{2}}\log\left(\frac{16\tau T_{G}|\mathcal{S}|^{2}\sum_{i}|\mathcal{A}_{i}|}{\delta}\right)+1,

(47)

with probability at least $1-\frac{\delta}{2T_{G}}$ ,

\|\overline{Q^{\theta^{(k)}}}-\widehat{Q^{\theta^{(k)}}}\|_{\infty}\leq\frac{(1-\gamma)\epsilon_{g}}{(1+\alpha)\sqrt{|\mathcal{S}|\sum_{i}|\mathcal{A}_{i}|}}.

Similarly, applying Theorem 8, we have that with probability at least $1-\frac{\delta}{2T_{G}}$ ,

\|d_{\theta^{(k)}}-\widehat{d_{\theta^{(k)}}}\|_{1}\leq\frac{(1-\gamma)^{2}\epsilon_{g}\alpha}{(1+\alpha)\sqrt{|\mathcal{S}|\sum_{i}|\mathcal{A}_{i}|}}.

Since:

	$\displaystyle\left\|\left[\nabla\Phi(\theta)-\widehat{\nabla}\Phi(\theta)\right]_{(s,a_{i})}\right\|$	$\displaystyle=\left\|\frac{1}{1-\gamma}d_{\theta}(s)\overline{Q_{i}^{\theta}}(s,a_{i})-\frac{1}{1-\gamma}\widehat{d_{\theta}}(s)\widehat{Q_{i}^{\theta}}(s,a_{i})\right\|$
		$\displaystyle=\left\|\frac{1}{1-\gamma}d_{\theta}(s)(\overline{Q_{i}^{\theta}}(s,a_{i})-\widehat{Q_{i}^{\theta}}(s,a_{i}))\right\|+\left\|\frac{1}{1-\gamma}\widehat{Q_{i}^{\theta}}(s,a_{i})(d_{\theta}(s)-\widehat{d_{\theta}}(s))\right\|$
		$\displaystyle\leq\frac{1}{1-\gamma}\|\overline{Q_{i}^{\theta}}(s,a_{i})-\widehat{Q_{i}^{\theta}}(s,a_{i})\|+\frac{1}{(1-\gamma)^{2}}\|d_{\theta}(s)-\widehat{d_{\theta}}(s)\|$
		$\displaystyle\leq\frac{\epsilon_{g}}{(1+\alpha)\sqrt{\|\mathcal{S}\|\sum_{j}\|\mathcal{A}_{j}\|}}+\frac{\epsilon_{g}\alpha}{(1+\alpha)\sqrt{\|\mathcal{S}\|\sum_{j}\|\mathcal{A}_{j}\|}}$
		$\displaystyle=\frac{\epsilon_{g}}{\sqrt{\|\mathcal{S}\|\sum_{j}\|\mathcal{A}_{j}\|}}$

Thus, with probability $1-\delta$

\displaystyle\|\nabla\Phi(\theta^{(k)}-\widehat{\nabla}\Phi(\theta^{(k)})\|_{2}^{2}

\displaystyle=\sum_{i}\sum_{s}\sum_{a_{i}}\left|\left[\nabla\Phi(\theta)-\widehat{\nabla}\Phi(\theta)\right]_{(s,a_{i})}\right|^{2}\leq\epsilon_{g}^{2},~{}~{}\forall~{}1\leq k\leq T_{G}

∎

Appendix J Proof of Theorem 5

Notations:

We define the following variables that will be useful in the analysis:

\widehat{G}^{\eta}(\theta):=\frac{1}{\eta}\left(Proj_{\mathcal{X}}(\theta+\eta\widehat{\nabla}\Phi(\theta))-\theta\right)

\widehat{G}^{\eta,\alpha}(\theta):=\frac{1}{\eta}\left(Proj_{\mathcal{X}^{\alpha}}(\theta+\eta\widehat{\nabla}\Phi(\theta))-\theta\right).

J.1 Optimization Lemmas

Lemma 15.

(Sufficient ascent) Suppose $\Phi(\theta)$ is $\beta$ -smooth. Let $\theta^{+}=Proj_{\mathcal{X}^{\alpha}}(\theta+\eta\widehat{\nabla}\Phi(\theta))$ . Then for $\eta\leq\frac{1}{2\beta}$ ,

\begin{split}\Phi(\theta^{+})-\Phi(\theta)\geq\frac{\eta}{4}\|\widehat{G}^{\eta,\alpha}(\theta)\|^{2}-\frac{\eta}{2}\left\|\nabla\Phi(\theta)-\widehat{\nabla}\Phi(\theta)\right\|^{2}\end{split}

Proof.

From the smoothness property we have that:

\Phi(\theta^{+})-\Phi(\theta)\geq\nabla_{\theta}\Phi(\theta)^{\top}(\theta^{+}-\theta)-\frac{\beta}{2}\|\theta^{+}-\theta\|^{2}

Since $\theta^{+}=Proj_{\mathcal{X}^{\alpha}}(\theta+\eta\widehat{\nabla}\Phi(\theta))$ , we have that:

(\theta+\eta\widehat{\nabla}\Phi(\theta)-\theta^{+})^{\top}(\theta^{\prime}-\theta^{+})\leq 0,~{}~{}\forall~{}\theta^{\prime}\in\mathcal{X}^{\alpha}

take $\theta^{\prime}=\theta$ , we get:

\widehat{\nabla}\Phi(\theta)^{\top}(\theta^{+}-\theta)\geq\frac{1}{\eta}\|\theta^{+}-\theta\|^{2}.

Thus:

	$\displaystyle\nabla\Phi(\theta)^{\top}(\theta^{+}-\theta)=\left(\nabla\Phi(\theta)-\widehat{\nabla}\Phi(\theta)\right)^{\top}(\theta^{+}-\theta)+\widehat{\nabla}\Phi(\theta)^{\top}(\theta^{+}-\theta)$
	$\displaystyle\geq-\frac{\eta}{2}\left\\|\nabla\Phi(\theta)-\widehat{\nabla}\Phi(\theta)\right\\|^{2}-\frac{1}{2\eta}\left\\|\theta^{+}-\theta\right\\|^{2}+\widehat{\nabla}\Phi(\theta)^{\top}(\theta^{+}-\theta)$
	$\displaystyle\geq-\frac{\eta}{2}\left\\|\nabla\Phi(\theta)-\widehat{\nabla}\Phi(\theta)\right\\|^{2}-\frac{1}{2\eta}\left\\|\theta^{+}-\theta\right\\|^{2}+\frac{1}{\eta}\\|\theta^{+}-\theta\\|^{2}$
	$\displaystyle=\frac{1}{2\eta}\\|\theta^{+}-\theta\\|^{2}-\frac{\eta}{2}\left\\|\nabla\Phi(\theta)-\widehat{\nabla}\Phi(\theta)\right\\|^{2}$

Thus from (36):

	$\displaystyle\Phi(\theta^{+})-\Phi(\theta)$	$\displaystyle\geq\left(\frac{1}{2\eta}-\frac{\beta}{2}\right)\\|\theta^{+}-\theta\\|^{2}-\frac{\eta}{2}\left\\|\nabla\Phi(\theta)-\widehat{\nabla}\Phi(\theta)\right\\|^{2}$
		$\displaystyle\geq\frac{1}{4\eta}\\|\theta^{+}-\theta\\|^{2}-\frac{\eta}{2}\left\\|\nabla\Phi(\theta)-\widehat{\nabla}\Phi(\theta)\right\\|^{2}$
		$\displaystyle=\frac{\eta}{4}\\|\widehat{G}^{\eta,\alpha}(\theta)\\|^{2}-\frac{\eta}{2}\left\\|\nabla\Phi(\theta)-\widehat{\nabla}\Phi(\theta)\right\\|^{2}$

which completes the proof. ∎

Lemma 15 immediately results in the following corollary:

Corollary 3.

(of Lemma 15) In Algorithm 1, suppose $\|\widehat{\nabla}\Phi(\theta^{(k)})-\nabla\Phi(\theta^{(k)})\|_{\infty}\leq\epsilon_{g}$ holds for every $0\leq k\leq T_{G}-1$ , then running algorithm 1 will guarantee that:

\frac{1}{T_{G}}\sum_{k=0}^{T_{G}-1}\|\widehat{G}^{\eta,\alpha}(\theta^{(k)})\|^{2}\leq\frac{4(\Phi_{\max}-\Phi_{\min})}{\eta T_{G}}+2\epsilon_{g}^{2}

Proof.

From Lemma 15 we have that:

	$\displaystyle\Phi(\theta^{(k+1)})-\Phi(\theta^{(k)})$	$\displaystyle\geq\frac{\eta}{4}\\|\widehat{G}^{\eta,\alpha}(\theta^{(k)})\\|^{2}-\frac{\eta}{2}\left\\|\nabla\Phi(\theta^{(k)})-\widehat{\nabla}\Phi(\theta^{(k)})\right\\|^{2}$
		$\displaystyle\geq\frac{\eta}{4}\\|\widehat{G}^{\eta,\alpha}(\theta^{(k)})\\|^{2}-\frac{\eta}{2}\epsilon_{g}^{2}.$

Thus

	$\displaystyle\frac{1}{T_{G}}\sum_{k=0}^{T_{G}-1}\\|\widehat{G}^{\eta,\alpha}(\theta^{(k)})\\|^{2}$	$\displaystyle\leq\frac{4(\Phi(\theta^{(0)})-\Phi(\theta^{(T_{G})}))}{\eta T_{G}}+2\epsilon_{g}^{2}$
		$\displaystyle\leq\frac{4(\Phi_{\max}-\Phi_{\min})}{\eta T_{G}}+2\epsilon_{g}^{2}$

∎

Lemma 16.

(First-order stationarity and $\|\widehat{G}^{\eta,\alpha}(\theta)\|$ ) Suppose $\Phi(\theta)$ is $\beta$ -smooth. Let $\theta^{+}=Proj_{\mathcal{X}^{\alpha}}(\theta+\eta\widehat{\nabla}\Phi(\theta))$ . Then:

\nabla_{\theta}\Phi(\theta^{+})^{\top}(\theta^{\prime}-\theta^{+})\leq\left[(1+\eta\beta)\|\widehat{G}^{\eta,\alpha}(\theta)\|+\|\widehat{\nabla}\Phi(\theta)-\nabla\Phi(\theta)\|\right]\|\theta^{\prime}-\theta^{+}\|,\quad\forall\theta^{\prime}\in\mathcal{X}^{\alpha}.

(48)

Further:

\max_{\overline{\theta}_{i}\in\mathcal{X}_{i}}\nabla_{\theta_{i}}\Phi(\theta^{+})^{\top}(\overline{\theta}_{i}-\theta_{i}^{+})\leq 2\sqrt{|\mathcal{S}|}\left[(1+\eta\beta)\|\widehat{G}^{\eta,\alpha}(\theta)\|+\|\widehat{\nabla}\Phi(\theta)-\nabla\Phi(\theta)\|\right]+\frac{2\alpha}{1-\gamma}

(49)

Proof.

Since $\theta^{+}=Proj_{\mathcal{X}^{\alpha}}(\theta+\eta\widehat{\nabla}\Phi(\theta))$ , we have:

	$\displaystyle(\theta+\eta\widehat{\nabla}\Phi(\theta)-\theta^{+})^{\top}(\theta^{\prime}-\theta^{+})$	$\displaystyle\leq 0~{}~{}\forall~{}\theta^{\prime}\in\mathcal{X}^{\alpha}$
	$\displaystyle\Longrightarrow~{}\eta\widehat{\nabla}\Phi(\theta)^{\top}(\theta^{\prime}-\theta^{+})$	$\displaystyle\leq(\theta-\theta^{+})^{\top}(\theta^{\prime}-\theta^{+})$
	$\displaystyle\Longrightarrow~{}\eta\nabla\Phi(\theta)^{\top}(\theta^{\prime}-\theta^{+})$	$\displaystyle\leq(\theta-\theta^{+})^{\top}(\theta^{\prime}-\theta^{+})+\eta(\nabla\Phi(\theta)-\widehat{\nabla}\Phi(\theta))^{\top}(\theta^{\prime}-\theta^{+})$
	$\displaystyle\Longrightarrow~{}\eta\nabla\Phi(\theta^{+})^{\top}(\theta^{\prime}-\theta^{+})$	$\displaystyle\leq(\theta-\theta^{+})^{\top}(\theta^{\prime}-\theta^{+})+\eta(\nabla\Phi(\theta)-\widehat{\nabla}\Phi(\theta))^{\top}(\theta^{\prime}-\theta^{+})$
		$\displaystyle\quad+\eta(\nabla\Phi(\theta^{+})-\nabla\Phi(\theta))^{\top}(\theta^{\prime}-\theta^{+})$
	$\displaystyle\Longrightarrow~{}\eta\nabla\Phi(\theta^{+})^{\top}(\theta^{\prime}-\theta^{+})$	$\displaystyle\leq(\\|\theta-\theta^{+}\\|+\eta\\|\nabla\Phi(\theta)-\widehat{\nabla}\Phi(\theta)\\|+\eta\\|\nabla\Phi(\theta^{+})-\nabla\Phi(\theta)\\|)\\|\theta^{\prime}-\theta^{+}\\|$
		$\displaystyle\leq(\\|\theta-\theta^{+}\\|+\eta\\|\nabla\Phi(\theta)-\widehat{\nabla}\Phi(\theta)\\|+\eta\beta\\|\theta^{+}-\theta\\|)\\|\theta^{\prime}-\theta^{+}\\|$
		$\displaystyle=\left[(1+\eta\beta)\\|\theta-\theta^{+}\\|+\eta\\|\nabla\Phi(\theta)-\widehat{\nabla}\Phi(\theta)\\|\right]\\|\theta^{\prime}-\theta^{+}\\|$
	$\displaystyle\Longrightarrow~{}\nabla\Phi(\theta^{+})^{\top}(\theta^{\prime}-\theta^{+})$	$\displaystyle\leq\left[(1+\eta\beta)\\|\widehat{G}^{\eta,\alpha}(\theta)\\|+\\|\nabla\Phi(\theta)-\widehat{\nabla}\Phi(\theta)\\|\right]\\|\theta^{\prime}-\theta^{+}\\|,$

which proves (48). We now prove (49). For any $\theta_{i,s}^{\prime}\in\Delta(|\mathcal{A}_{i}|)$ , we know that $(1-\alpha)\theta_{i,s}^{\prime}+\alpha U_{|\mathcal{A}_{i}|}\in\Delta^{\alpha}(|\mathcal{A}_{i}|)$ . Let $U_{i}:=[\underbrace{U_{|\mathcal{A}_{i}|},\dots,U_{|\mathcal{A}_{i}|}}_{|\mathcal{S}|\text{times}}]$ , then for any $\theta_{i}^{\prime}\in\mathcal{X}_{i}$ , $(1-\alpha)\theta_{i}^{\prime}+\alpha U_{i}\in\mathcal{X}_{i}^{\alpha}$ .

Thus:

	$\displaystyle\nabla_{\theta_{i}}\Phi(\theta^{+})^{\top}(\theta_{i}^{\prime}-\theta_{i}^{+})$	$\displaystyle\leq\nabla_{\theta_{i}}\Phi(\theta^{+})^{\top}((1-\alpha)\theta_{i}^{\prime}+\alpha U_{i}-\theta_{i}^{+})+\nabla_{\theta_{i}}\Phi(\theta^{+})^{\top}(\theta_{i}^{\prime}-(1-\alpha)\theta_{i}^{\prime}-\alpha U_{i})$
		$\displaystyle\leq\left[(1+\eta\beta)\\|\widehat{G}^{\eta,\alpha}(\theta)\\|+\\|\widehat{\nabla}\Phi(\theta)-\nabla\Phi(\theta)\\|\right]\\|(1-\alpha)\theta_{i}^{\prime}+\alpha U_{i}-\theta_{i}^{+}\\|$
		$\displaystyle\qquad+\nabla_{\theta_{i}}\Phi(\theta^{+})^{\top}(\theta_{i}^{\prime}-(1-\alpha)\theta_{i}^{\prime}-\alpha U_{i})$
		$\displaystyle\leq 2\sqrt{\|\mathcal{S}\|}\left[(1+\eta\beta)\\|\widehat{G}^{\eta,\alpha}(\theta)\\|+\\|\widehat{\nabla}\Phi(\theta)-\nabla\Phi(\theta)\\|\right]+\alpha\nabla_{\theta_{i}}\Phi(\theta^{+})^{\top}(\theta_{i}^{\prime}-U_{i})$

Since

	$\displaystyle\nabla_{\theta_{i}}\Phi(\theta^{+})^{\top}(\theta_{i}^{\prime}-U_{i})$	$\displaystyle=\sum_{s}d_{\theta}(s)\overline{Q_{i,s}^{\theta}}^{\top}(\theta_{i,s}^{\prime}-U_{\|\mathcal{A}_{i}\|})$
		$\displaystyle\leq\sum_{s}d_{\theta}(s)\\|\overline{Q_{i,s}^{\theta}}\\|_{\infty}\\|\theta_{i,s}^{\prime}-U_{\|\mathcal{A}_{i}\|}\\|_{1}$
		$\displaystyle\leq\sum_{s}d_{\theta}(s)\frac{2}{1-\gamma}\leq\frac{2}{1-\gamma},$

we have that:

\displaystyle\nabla_{\theta_{i}}\Phi(\theta^{+})^{\top}(\theta_{i}^{\prime}-\theta_{i})

\displaystyle\leq 2\sqrt{|\mathcal{S}|}\left[(1+\eta\beta)\|\widehat{G}^{\eta,\alpha}(\theta)\|+\|\widehat{\nabla}\Phi(\theta)-\nabla\Phi(\theta)\|\right]+\frac{2\alpha}{1-\gamma}

∎

J.2 Proof of Theorem 5

Proof.

Recall that $\Phi$ is $\beta$ -smooth with $\beta=\frac{2}{(1-\gamma)^{3}}\left(\sum_{i=1}^{n}|\mathcal{A}_{i}|\right)$ . The step size $\eta$ in Theorem 5 satisfies $\eta\leq\frac{(1-\gamma)^{3}}{4\sum_{i=1}^{n}|\mathcal{A}_{i}|}=\frac{1}{2\beta}$ .

Recall from gradient domination property:

	$\displaystyle\textup{{NE-gap}}_{i}(\theta^{(k+1)})$	$\displaystyle=\max_{\theta_{i}^{\prime}\in\mathcal{X}_{i}}J_{i}(\theta_{i}^{\prime},\theta_{-i}^{(k+1)})-J_{i}(\theta_{i}^{(k+1)},\theta_{-i}^{(k+1)})$
		$\displaystyle\leq M\max_{\theta_{i}^{\prime}\in\mathcal{X}_{i}}(\theta_{i}^{\prime}-\theta_{i}^{(k+1)})^{\top}\nabla_{\theta_{i}}\Phi(\theta^{(k+1)})$

Suppose $\|\widehat{\nabla}\Phi(\theta^{(k)})-\nabla\Phi(\theta^{(k)})\|_{\infty}\leq\epsilon_{g},~{}\forall 0\leq k\leq T_{G}-1$ , recall from Lemma 16,

	$\displaystyle\textup{{NE-gap}}(\theta^{(k+1)})$	$\displaystyle\leq\max_{i}\textup{{NE-gap}}_{i}(\theta^{(k+1)})\leq M\max_{i}\max_{\theta_{i}^{\prime}\in\mathcal{X}_{i}}(\theta_{i}^{\prime}-\theta_{i}^{(k+1)})^{\top}\nabla_{\theta_{i}}\Phi(\theta^{(k+1)})$
		$\displaystyle\leq 2M\sqrt{\|\mathcal{S}\|}\left[(1+\eta\beta)\\|\widehat{G}^{\eta,\alpha}(\theta^{(k)})\\|+\epsilon_{g}\right]+\frac{2\alpha M}{1-\gamma}$

Thus,

	$\displaystyle\frac{1}{T_{G}}\sum_{k=0}^{T_{G}-1}\textup{{NE-gap}}(\theta^{(k+1)})^{2}$	$\displaystyle\leq\frac{1}{T_{G}}\sum_{k=0}^{T_{G}-1}3\times\left[4M^{2}\|\mathcal{S}\|(1+\eta\beta)^{2}\\|\widehat{G}^{\eta,\alpha}(\theta^{(k)})\\|^{2}+4M^{2}\|\mathcal{S}\|\epsilon_{g}^{2}+\frac{4\alpha^{2}M^{2}}{(1-\gamma)^{2}}\right]$
		$\displaystyle=12M^{2}\|\mathcal{S}\|\epsilon_{g}^{2}+\frac{12\alpha^{2}M^{2}}{(1-\gamma)^{2}}+12M^{2}\|\mathcal{S}\|(1+\eta\beta)^{2}\left(\frac{1}{T_{G}}\sum_{k=0}^{T_{G}-1}\\|\widehat{G}^{\eta,\alpha}(\theta^{(k)})\\|^{2}\right)$

From Corollary 3, we have that

	$\displaystyle\frac{1}{T_{G}}\sum_{k=0}^{T_{G}-1}\textup{{NE-gap}}(\theta^{(k+1)})^{2}$	$\displaystyle\leq 12M^{2}\|\mathcal{S}\|\epsilon_{g}^{2}+\frac{12\alpha^{2}M^{2}}{(1-\gamma)^{2}}+12M^{2}\|\mathcal{S}\|(1+\eta\beta)^{2}\left(\frac{4(\Phi_{\max}-\Phi_{\min})}{\eta T_{G}}+2\epsilon_{g}^{2}\right)$
		$\displaystyle\leq 66M^{2}\|\mathcal{S}\|\epsilon_{g}^{2}+\frac{12\alpha^{2}M^{2}}{(1-\gamma)^{2}}+\frac{108M^{2}\|\mathcal{S}\|(\Phi_{\max}-\Phi_{\min})}{\eta T_{G}}$		(50)

Substitute

\displaystyle\alpha=\frac{(1-\gamma)\epsilon}{6M},~{}~{}\epsilon_{g}=\frac{\epsilon}{2\sqrt{33}M\sqrt{|\mathcal{S}|}}\textup{ and }T_{G}\geq\frac{648M^{2}(\Phi_{\max}-\Phi_{\min})|\mathcal{S}|}{\eta\epsilon^{2}}

into the above inequality we get that:

\frac{1}{T_{G}}\sum_{k=0}^{T_{G}-1}\textup{{NE-gap}}(\theta^{(k+1)})^{2}\leq\frac{\epsilon^{2}}{2}+\frac{\epsilon^{2}}{3}+\frac{\epsilon^{2}}{6}=\epsilon^{2}

Substitute the value of $\alpha,\epsilon_{g}$ in (50) into Theorem 6 will give us:

T_{J}\geq\frac{206976\tau nM^{4}|\mathcal{S}|^{3}\max_{i}|\mathcal{A}_{i}|^{3}}{(1-\gamma)^{8}\epsilon^{4}\sigma_{S}^{2}}\log\left(\frac{16\tau T_{G}|\mathcal{S}|^{2}\sum_{i}|\mathcal{A}_{i}|}{\delta}\right)+1

which completes the proof. ∎

Appendix K Smoothness

Lemma 17.

(Smoothness for Direct Distributed Parameterization) Assume that $0\leq r_{i}(s,a)\leq 1,~{}\forall s,a,~{}i=1,2,\dots,n$ , then:

\|g(\theta^{\prime})-g(\theta)\|\leq\frac{2}{(1-\gamma)^{3}}\left(\sum_{i=1}^{n}|\mathcal{A}_{i}|\right)\|\theta^{\prime}-\theta\|,

(51)

where $g(\theta)=\{\nabla_{\theta_{i}}J_{i}(\theta)\}$ .

The proof of Lemma 17 depends on the following lemma:

Lemma 18.

\|\nabla_{\theta_{i}}J_{i}(\theta^{\prime})-\nabla_{\theta_{i}}J_{i}(\theta)\|\leq\frac{2}{(1-\gamma)^{3}}\sqrt{|\mathcal{A}_{i}|}\sum_{j=1}^{n}\sqrt{|\mathcal{A}_{j}|}\|\theta_{j}^{\prime}-\theta_{j}\|

(52)

Lemma 17 is a simple corollary of Lemma 18.

Proof.

(Proof of Lemma 17)

	$\displaystyle\\|g(\theta^{\prime})-g(\theta)\\|^{2}$	$\displaystyle=\sum_{i=1}^{n}\\|\nabla_{\theta_{i}}J_{i}(\theta^{\prime})-\nabla_{\theta_{i}}J_{i}(\theta)\\|^{2}$
		$\displaystyle\leq\left(\frac{2}{(1-\gamma)^{3}}\right)^{2}\sum_{i}\|\mathcal{A}_{i}\|\left(\sum_{j=1}^{n}\sqrt{\|\mathcal{A}_{j}\|}\\|\theta_{j}^{\prime}-\theta_{j}\\|\right)^{2}$
		$\displaystyle\leq\left(\frac{2}{(1-\gamma)^{3}}\right)^{2}\sum_{i}\|\mathcal{A}_{i}\|\left(\sum_{j=1}^{n}\|\mathcal{A}_{j}\|\right)\left(\sum_{j=1}^{n}\\|\theta_{j}^{\prime}-\theta_{j}\\|^{2}\right)$
		$\displaystyle=\left(\frac{2}{(1-\gamma)^{3}}\right)^{2}\left(\sum_{i=1}^{n}\|\mathcal{A}_{i}\|\right)^{2}\\|\theta^{\prime}-\theta\\|^{2},$

which completes the proof. ∎

Lemma 18 is equivalent to the following lemma:

Lemma 19.

\left|\frac{\partial J_{i}(\theta_{i}^{\prime}+\alpha u_{i},\theta_{-i}^{\prime})-\partial J_{i}(\theta_{i}+\alpha u_{i},\theta_{-i})}{\partial\alpha}\Big{|}_{\alpha=0}\right|\leq\frac{2}{(1-\gamma)^{3}}\sqrt{|\mathcal{A}_{i}|}\sum_{j=1}^{n}\sqrt{|\mathcal{A}_{j}|}\|\theta_{j}^{\prime}-\theta_{j}\|,\quad\forall\|u\|=1

(53)

Proof.

(Lemma 19) Define:

	$\displaystyle\pi_{i,\alpha}(a_{i}\|s)$	$\displaystyle:=\pi_{\theta_{i}+\alpha u_{i}}(a_{i}\|s)=\theta_{s,a_{i}}+\alpha u_{a_{i},s}$
	$\displaystyle\pi_{i,\alpha}^{\prime}(a_{i}\|s)$	$\displaystyle:=\pi_{\theta_{i}+\alpha u_{i}}^{\prime}(a_{i}\|s)=\theta_{s,a_{i}}^{\prime}+\alpha u_{a_{i},s}$
	$\displaystyle\pi_{\alpha}(a\|s)$	$\displaystyle:=\pi_{\theta_{i}+\alpha u_{i}}(a_{i}\|s)\pi_{\theta_{-i}}(a_{-i}\|s)$
	$\displaystyle\pi_{\alpha}^{\prime}(a\|s)$	$\displaystyle:=\pi_{\theta_{i}^{\prime}+\alpha u_{i}}(a_{i}\|s)\pi_{\theta_{-i}}^{\prime}(a_{-i}\|s)$
	$\displaystyle Q_{i}^{\alpha}(s,a)$	$\displaystyle:=Q_{(\theta_{i}+\alpha u_{i},\theta_{-i})}(s,a)$
	$\displaystyle d_{\alpha}^{\prime}(s)$	$\displaystyle:=d_{(\theta_{i}^{\prime}+\alpha u_{i},\theta_{-i})}(s)$

According to cost difference lemma,

	$\displaystyle\quad\left\|\frac{\partial J_{i}(\theta_{i}^{\prime}+\alpha u_{i},\theta_{-i}^{\prime})-\partial J_{i}(\theta_{i}+\alpha u_{i},\theta_{-i})}{\partial\alpha}\Big{\|}_{\alpha=0}\right\|$
	$\displaystyle=\frac{1}{1-\gamma}\left\|\frac{\partial\sum_{s,a}d_{\alpha^{\prime}}(s)\pi^{\prime}_{\alpha}(a\|s)A_{i}^{\alpha}(s,a)}{\partial\alpha}\Big{\|}_{\alpha=0}\right\|$
	$\displaystyle=\frac{1}{1-\gamma}\left\|\frac{\partial\sum_{s,a}d_{\alpha^{\prime}}(s)\left(\pi^{\prime}_{\alpha}(a\|s)-\pi_{\alpha}(a\|s)\right)Q_{i}^{\alpha}(s,a)}{\partial\alpha}\Big{\|}_{\alpha=0}\right\|$
	$\displaystyle\leq\frac{1}{1-\gamma}\left(\underbrace{\left\|\sum_{s,a}d_{\theta}^{\prime}(s)\frac{\partial\pi^{\prime}_{\alpha}(a\|s)-\partial\pi_{\alpha}(a\|s)}{\partial\alpha}\Big{\|}_{\alpha=0}Q_{i}^{\theta}(s,a)\right\|}_{\text{Part A}}\right.$
	$\displaystyle+\underbrace{\left\|\sum_{s,a}d_{\theta}^{\prime}(s)(\pi^{\prime}_{\theta}(a\|s)-\pi_{\theta}(a\|s))\frac{\partial Q_{i}^{\alpha}(s,a)}{\partial\alpha}\Big{\|}_{\alpha=0}\right\|}_{\text{Part B}}$
	$\displaystyle+\left.\underbrace{\left\|\sum_{s,a}\frac{\partial d_{\alpha}^{\prime}(s)}{\partial\alpha}\Big{\|}_{\alpha=0}(\pi^{\prime}_{\theta}(a\|s)-\pi_{\theta}(a\|s))Q_{i}^{\theta}(s,a)\right\|}_{\text{Part C}}\right)$

Thus:

Part A	$\displaystyle=\left\|\sum_{s,a}d_{\theta}^{\prime}(s)\frac{\partial\pi^{\prime}_{\alpha}(a\|s)-\partial\pi_{\alpha}(a\|s)}{\partial\alpha}\Big{\|}_{\alpha=0}Q_{i}^{\theta}(s,a)\right\|$
	$\displaystyle=\left\|\sum_{s,a}d_{\theta}^{\prime}(s)u_{a_{i},s}(\pi_{\theta_{-i}^{\prime}}(a_{-i}\|s)-\pi_{\theta_{-i}}(a_{-i}\|s))Q_{i}^{\theta}(s,a)\right\|$	(54)
	$\displaystyle\leq\frac{1}{1-\gamma}\left\|\sum_{s}d_{\theta}^{\prime}(s)\sum_{a_{i}}\|u_{a_{i},s}\|\sum_{a_{-i}}\left\|\pi_{\theta_{-i}^{\prime}}(a_{-i}\|s)-\pi_{\theta_{-i}}(a_{-i}\|s)\right\|\right\|$	(55)
	$\displaystyle\leq\frac{1}{1-\gamma}\left(\max_{s}\sum_{a_{i}}\|u_{a_{i},s}\|\right)\sum_{s}d_{\theta}^{\prime}(s)2d_{\text{TV}}(\pi_{\theta_{-i}^{\prime}}(\cdot\|s)\|\|\pi_{\theta_{-i}}(\cdot\|s))$	(56)
	$\displaystyle\leq\frac{1}{1-\gamma}\left(\max_{s}\sum_{a_{i}}\|u_{a_{i},s}\|\right)\sum_{s}d_{\theta}^{\prime}(s)\sum_{j\neq i}2d_{\text{TV}}(\pi_{\theta_{j}^{\prime}}(\cdot\|s)\|\|\pi_{\theta_{j}}(\cdot\|s))$	(57)
	$\displaystyle=\frac{1}{1-\gamma}\left(\max_{s}\sum_{a_{i}}\|u_{a_{i},s}\|\right)\sum_{s}d_{\theta}^{\prime}(s)\sum_{j\neq i}\\|\theta^{\prime}_{j,s}-\theta_{j,s}\\|_{1}$	(58)
	$\displaystyle\leq\frac{1}{1-\gamma}\sqrt{\|\mathcal{A}_{i}\|}\sum_{s}d_{\theta}^{\prime}(s)\sum_{j\neq i}\sqrt{\|\mathcal{A}_{j}\|}\\|\theta_{j,s}^{\prime}-\theta_{j,s}\\|$	(59)
	$\displaystyle\leq\frac{1}{1-\gamma}\sqrt{\|\mathcal{A}_{i}\|}\sum_{j\neq i}\sqrt{\|\mathcal{A}_{j}\|}\sqrt{\sum_{s}d_{\theta^{\prime}}(s)^{2}}\sqrt{\sum_{s}\\|\theta_{j,s}^{\prime}-\theta_{j,s}\\|^{2}}$	(60)
	$\displaystyle=\frac{1}{1-\gamma}\sqrt{\|\mathcal{A}_{i}\|}\sum_{j\neq i}\sqrt{\|\mathcal{A}_{j}\|}\sqrt{\sum_{s}d_{\theta^{\prime}}(s)^{2}}\\|\theta_{j}^{\prime}-\theta_{j}\\|$
	$\displaystyle\leq\frac{1}{1-\gamma}\sqrt{\|\mathcal{A}_{i}\|}\sum_{j\neq i}\sqrt{\|\mathcal{A}_{j}\|}\\|\theta_{j}^{\prime}-\theta_{j}\\|$
	$\displaystyle\leq\frac{1}{1-\gamma}\sqrt{\|\mathcal{A}_{i}\|}\sum_{j=1}^{n}\sqrt{\|\mathcal{A}_{j}\|}\\|\theta_{j}^{\prime}-\theta_{j}\\|,$

where (54) to (55) is derived from the fact that $|Q_{i}^{\theta}(s,a)|\leq\frac{1}{1-\gamma}$ . (56) to (57) relies on the property of total variation distance:

d_{\text{TV}}(\pi_{\theta_{-i}^{\prime}}(\cdot|s)||\pi_{\theta_{-i}}(\cdot|s))\leq\sum_{j\neq i}d_{\text{TV}}(\pi_{\theta_{j}^{\prime}}(\cdot|s)||\pi_{\theta_{j}}(\cdot|s))

(58) to (59) is derived from:

	$\displaystyle\max_{s}\sum_{a_{i}}\|u_{a_{i},s}\|\leq\sqrt{\|\mathcal{A}_{i}\|},~{}~{}\\|u\\|\leq 1$
	$\displaystyle\\|\theta^{\prime}_{j,s}-\theta_{j,s}\\|_{1}\leq\sqrt{\|\mathcal{A}_{j}\|}\\|\theta_{j,s}^{\prime}-\theta_{j,s}\\|$

which can be immediately verified by applying Cauchy-Schwarz inequality.

Before looking into Part B, we first define $\widetilde{P}(\alpha)$ as the state-action under $\pi_{\alpha}$ :

\left[\widetilde{P}(\alpha)\right]_{(s,a)\rightarrow(s^{\prime},a^{\prime})}=\pi_{\alpha}(a^{\prime}|s^{\prime})P(s^{\prime}|s,a)

Then we have that:

\left[\frac{\partial\widetilde{P}(\alpha)}{\partial\alpha}\Big{|}_{\alpha=0}\right]_{(s,a)\rightarrow(s^{\prime},a^{\prime})}=u_{a_{i}^{\prime},s^{\prime}}\pi_{\theta_{-i}}(a_{-i}^{\prime}|s^{\prime})P(s^{\prime}|s,a)

For an arbitrary vector $x$ :

	$\displaystyle\left[\frac{\partial\widetilde{P}(\alpha)}{\partial\alpha}\Big{\|}_{\alpha=0}x\right]_{(s,a)}$	$\displaystyle=\sum_{s^{\prime},a^{\prime}}u_{a_{i}^{\prime},s^{\prime}}\pi_{\theta_{-i}}(a_{-i}^{\prime}\|s^{\prime})P(s^{\prime}\|s,a)x_{s^{\prime},a^{\prime}}$
		$\displaystyle\leq\\|x\\|_{\infty}\sum_{s^{\prime},a^{\prime}}\|u_{a_{i}^{\prime},s^{\prime}}\|\pi_{\theta_{-i}}(a_{-i}^{\prime}\|s^{\prime})P(s^{\prime}\|s,a)$
		$\displaystyle=\\|x\\|_{\infty}\sum_{s^{\prime}}P(s^{\prime}\|s,a)\sum_{a_{i}^{\prime}}\|u_{a_{i}^{\prime},s^{\prime}}\|\sum_{a_{-i}^{\prime}}\pi_{\theta_{-i}}(a_{-i}^{\prime}\|s^{\prime})$
		$\displaystyle\leq\\|x\\|_{\infty}\sum_{s^{\prime}}P(s^{\prime}\|s,a)\sqrt{\|\mathcal{A}_{i}\|}\sum_{a_{-i}^{\prime}}\pi_{\theta_{-i}}(a_{-i}^{\prime}\|s^{\prime})$
		$\displaystyle\leq\sqrt{\|\mathcal{A}_{i}\|}\\|x\\|_{\infty}$

Thus:

\left\|\left[\frac{\partial\widetilde{P}(\alpha)}{\partial\alpha}\Big{|}_{\alpha=0}x\right]_{(s,a)}\right\|_{\infty}\leq\sqrt{|\mathcal{A}_{i}|}\|x\|_{\infty}

Similarly we can define $\widetilde{P}(\alpha)^{\prime}$ as the state-action under $\pi_{\alpha}^{\prime}$ , and can easily check that

\left\|\left[\frac{\partial\widetilde{P}(\alpha)^{\prime}}{\partial\alpha}\Big{|}_{\alpha=0}x\right]_{(s,a)}\right\|_{\infty}\leq\sqrt{|\mathcal{A}_{i}|}\|x\|_{\infty}

Define:

M(\alpha):=\left(I-\gamma\widetilde{P}(\alpha)\right)^{-1},\quad M(\alpha)^{\prime}:=\left(I-\gamma\widetilde{P}(\alpha)^{\prime}\right)^{-1}.

Because:

M(\alpha)=\left(I-\gamma\widetilde{P}(\alpha)\right)^{-1}=\sum_{n=0}^{\infty}\gamma^{n}\widetilde{P}(\alpha),

which implies that every entry of $M(\alpha)$ is nonnegative and $M(\alpha)\mathbf{1}=\frac{1}{1-\gamma}\mathbf{1}$ , this implies:

\|M(\alpha)x\|_{\infty}\leq\frac{1}{1-\gamma}\|x\|_{\infty},

and similarly

\|M(\alpha)^{\prime}x\|_{\infty}\leq\frac{1}{1-\gamma}\|x\|_{\infty}.

Now we are ready to bound Part B. Because:

	$\displaystyle Q_{i}^{\alpha}(s,a)$	$\displaystyle=e_{(s,a)}^{\top}M(\alpha)r_{i}$
	$\displaystyle\Longrightarrow~{}~{}\frac{\partial Q_{i}^{\alpha}(s,a)}{\partial\alpha}$	$\displaystyle=e_{(s,a)}^{\top}\frac{\partial M(\alpha)}{\partial\alpha}r_{i}=\gamma e_{(s,a)}^{\top}M(\alpha)\frac{\partial\widetilde{P}(\alpha)}{\partial\alpha}M(\alpha)r_{i}$
	$\displaystyle\Longrightarrow~{}~{}\left\|\frac{\partial Q_{i}^{\alpha}(s,a)}{\partial\alpha}\right\|$	$\displaystyle\leq\gamma\left\\|M(\alpha)\frac{\partial\widetilde{P}(\alpha)}{\partial\alpha}M(\alpha)r_{i}\right\\|_{\infty}$
		$\displaystyle\leq\frac{\gamma}{(1-\gamma)^{2}}\sqrt{\|\mathcal{A}_{i}\|}$

Thus,

	Part B	$\displaystyle=\left\|\sum_{s,a}d_{\theta}^{\prime}(s)(\pi^{\prime}_{\theta}(a\|s)-\pi_{\theta}(a\|s))\frac{\partial Q_{i}^{\alpha}(s,a)}{\partial\alpha}\Big{\|}_{\alpha=0}\right\|$
		$\displaystyle\leq\sum_{s,a}d_{\theta}^{\prime}(s)\left\|\pi^{\prime}_{\theta}(a\|s)-\pi_{\theta}(a\|s)\right\|\left\|\frac{\partial Q_{i}^{\alpha}(s,a)}{\partial\alpha}\Big{\|}_{\alpha=0}\right\|$
		$\displaystyle\leq\frac{\gamma}{(1-\gamma)^{2}}\sqrt{\|\mathcal{A}_{i}\|}\sum_{s}d_{\theta}^{\prime}(s)2d_{\textup{TV}}(\pi_{\theta^{\prime}}(\cdot\|s)\|\|\pi_{\theta}(\cdot\|s))$
		$\displaystyle\leq\frac{\gamma}{(1-\gamma)^{2}}\sqrt{\|\mathcal{A}_{i}\|}\sum_{s}d_{\theta}^{\prime}(s)\sum_{j}2d_{\textup{TV}}(\pi_{\theta^{\prime}_{j}}(\cdot\|s)\|\|\pi_{\theta_{j}}(\cdot\|s))$
		$\displaystyle=\frac{\gamma}{(1-\gamma)^{2}}\sqrt{\|\mathcal{A}_{i}\|}\sum_{s}d_{\theta}^{\prime}(s)\sum_{j}\\|\theta_{j,s}^{\prime}-\theta_{j,s}\\|_{1}$
		$\displaystyle\leq\frac{\gamma}{(1-\gamma)^{2}}\sqrt{\|\mathcal{A}_{i}\|}\sum_{s}d_{\theta}^{\prime}(s)\sum_{j=1}^{n}\sqrt{\|\mathcal{A}_{j}\|}\\|\theta_{j,s}^{\prime}-\theta_{j,s}\\|$
		$\displaystyle\leq\frac{\gamma}{(1-\gamma)^{2}}\sqrt{\|\mathcal{A}_{i}\|}\sum_{j=1}^{n}\sqrt{\|\mathcal{A}_{j}\|}\sqrt{\sum_{s}d_{\theta}^{\prime}(s)^{2}}\sqrt{\sum_{s}\\|\theta_{j,s}^{\prime}-\theta_{j,s}\\|^{2}}$
		$\displaystyle\leq\frac{\gamma}{(1-\gamma)^{2}}\sqrt{\|\mathcal{A}_{i}\|}\sum_{j=1}^{n}\sqrt{\|\mathcal{A}_{j}\|}\\|\theta_{j}^{\prime}-\theta_{j}\\|$

Now let’s look at Part C:

	$\displaystyle d_{\alpha}^{\prime}(s)$	$\displaystyle=(1-\gamma)\sum_{s^{\prime}}\rho(s^{\prime})\sum_{a^{\prime}}\pi_{\alpha}^{\prime}(a^{\prime}\|s^{\prime})e_{(s^{\prime},a^{\prime})}^{\top}M(\alpha)^{\prime}\sum_{a^{\prime\prime}}e_{(s,a^{\prime\prime})}$
	$\displaystyle\Longrightarrow~{}~{}\frac{\partial d_{\alpha}^{\prime}(s)}{\partial\alpha}$	$\displaystyle=(1-\gamma)\left(\underbrace{\sum_{s^{\prime}}\rho(s^{\prime})\sum_{a^{\prime}}\frac{\partial\pi_{\alpha}^{\prime}(a^{\prime}\|s^{\prime})}{\partial\alpha}e_{(s^{\prime},a^{\prime})}^{\top}}_{v_{1}^{\top}}M(\alpha)^{\prime}\sum_{a^{\prime\prime}}e_{(s,a^{\prime\prime})}\right.$
		$\displaystyle\left.+\underbrace{\sum_{s^{\prime}}\rho(s^{\prime})\sum_{a^{\prime}}\pi_{\alpha}^{\prime}(a^{\prime}\|s^{\prime})e_{(s^{\prime},a^{\prime})}^{\top}}_{v_{2}^{\top}}\frac{\partial M(\alpha)^{\prime}}{\partial\alpha}\sum_{a^{\prime\prime}}e_{(s,a^{\prime\prime})}\right)$
		$\displaystyle=(1-\gamma)\left(v_{1}^{\top}M(\alpha)^{\prime}+\gamma v_{2}^{\top}M(\alpha)^{\prime}\frac{\partial\widetilde{P}(\alpha)}{\partial\alpha}M(\alpha)^{\prime}\right)\sum_{a^{\prime\prime}}e_{(s,a^{\prime\prime})}$

Note that $v_{1},v_{2}$ are constant vectors that are independent of the choice of $s$ . Additionally:

	$\displaystyle\\|v_{1}\\|_{1}$	$\displaystyle=\left\\|\sum_{s}\rho(s)\sum_{a}\frac{\partial\pi_{\alpha}^{\prime}(a\|s)}{\partial\alpha}e_{(s,a)}\right\\|_{1}$
		$\displaystyle=\sum_{s}\rho(s)\sum_{a}\left\|\frac{\partial\pi_{\alpha}^{\prime}(a\|s)}{\partial\alpha}\right\|$
		$\displaystyle=\sum_{s}\rho(s)\sum_{a}\left\|u_{a_{i},s}\right\|\pi_{\theta_{-i}^{\prime}}(a_{-i}\|s)$
		$\displaystyle\leq\sum_{s}\rho(s)\sum_{a}\left\|u_{a_{i},s}\right\|\leq\sqrt{\|\mathcal{A}_{i}\|}$
	$\displaystyle\\|v_{2}\\|_{1}$	$\displaystyle=\\|\sum_{s}\rho(s)\sum_{a}\pi_{\alpha}^{\prime}(a\|s)e_{(s,a)}\\|_{1}$
		$\displaystyle=\sum_{s}\rho(s)\sum_{a}\pi_{\alpha}^{\prime}(a\|s)=1$

Thus:

	Part C	$\displaystyle=\left\|\sum_{s,a}\frac{\partial d_{\alpha}^{\prime}(s)}{\partial\alpha}\Big{\|}_{\alpha=0}(\pi^{\prime}_{\theta}(a\|s)-\pi_{\theta}(a\|s))Q_{i}^{\theta}(s,a)\right\|$
		$\displaystyle=(1-\gamma)\left\|\left(v_{1}^{\top}M(0)^{\prime}+\gamma v_{2}^{\top}M(0)^{\prime}\frac{\partial\widetilde{P}(\alpha)}{\partial\alpha}\Big{\|}_{\alpha=0}M(0)^{\prime}\right)\underbrace{\sum_{s,a}\sum_{a^{\prime}}e_{(s,a^{\prime})}(\pi^{\prime}_{\theta}(a\|s)-\pi_{\theta}(a\|s))Q_{i}^{\theta}(s,a)}_{v_{3}}\right\|$
		$\displaystyle\leq(1-\gamma)\left(\frac{1}{1-\gamma}\\|v_{1}\\|_{1}\\|v_{3}\\|_{\infty}+\frac{\gamma}{(1-\gamma)^{2}}\sqrt{\|\mathcal{A}_{i}\|}\\|v_{2}\\|_{1}\\|v_{3}\\|_{\infty}\right)$
		$\displaystyle\leq\frac{\sqrt{\|\mathcal{A}_{i}\|}}{1-\gamma}\\|v_{3}\\|_{\infty}$

Additionally:

	$\displaystyle\left\|[v_{3}]_{(s_{0},a_{0})}\right\|$	$\displaystyle=\left\|\sum_{a}(\pi_{\theta^{\prime}}(a\|s_{0})-\pi_{\theta}(a\|s_{0}))Q_{i}^{\theta}(s_{0},a)\right\|$
		$\displaystyle\leq\frac{1}{1-\gamma}\sum_{a}\|\pi_{\theta^{\prime}}(a\|s_{0})-\pi_{\theta}(a\|s_{0})\|$
		$\displaystyle=\frac{1}{1-\gamma}2d_{\textup{TV}}(\pi_{\theta^{\prime}}(\cdot\|s_{0})\|\|\pi_{\theta}(\cdot\|s_{0}))$
		$\displaystyle\leq\frac{1}{1-\gamma}\sum_{j=1}^{n}2d_{\textup{TV}}(\pi_{\theta_{j}^{\prime}}(\cdot\|s_{0})\|\|\pi_{\theta_{j}}(\cdot\|s_{0}))$
		$\displaystyle=\frac{1}{1-\gamma}\sum_{j=1}^{n}\\|\theta_{j,s}^{\prime}-\theta_{j,s}\\|_{1}$
		$\displaystyle\leq\frac{1}{1-\gamma}\sum_{j=1}^{n}\sqrt{\|\mathcal{A}_{j}\|}\\|\theta_{j,s}^{\prime}-\theta_{j,s}\\|$
		$\displaystyle\leq\frac{1}{1-\gamma}\sum_{j=1}^{n}\sqrt{\|\mathcal{A}_{j}\|}\\|\theta_{j}^{\prime}-\theta_{j}\\|$

Combining the above inequalities we get:

\text{Part C}\leq\frac{\sqrt{|\mathcal{A}_{i}|}}{1-\gamma}\|v_{3}\|_{\infty}\leq\frac{\sqrt{|\mathcal{A}_{i}|}}{(1-\gamma)^{2}}\sum_{j=1}^{n}\sqrt{|\mathcal{A}_{j}|}\|\theta_{j}^{\prime}-\theta_{j}\|

Sum up Part A-C we get:

	$\displaystyle\left\|\frac{\partial J_{i}(\theta_{i}^{\prime}+\alpha u_{i},\theta_{-i}^{\prime})-\partial J_{i}(\theta_{i}+\alpha u_{i},\theta_{-i})}{\partial\alpha}\Big{\|}_{\alpha=0}\right\|$	$\displaystyle\leq\frac{1}{1-\gamma}(\text{Part A}+\text{Part B}+\text{Part C})$
		$\displaystyle\leq\frac{2}{(1-\gamma)^{3}}\sqrt{\|\mathcal{A}_{i}\|}\sum_{j=1}^{n}\sqrt{\|\mathcal{A}_{j}\|}\\|\theta_{j}^{\prime}-\theta_{j}\\|,$

which completes the proof. ∎

Appendix L Auxiliary

We recall Appendix E. \auxiliary*

Proof.

Let $y=\theta+g$ , without loss of generality, assume that $i^{*}=1$ and that:

y_{1}>y_{2}\geq y_{3}\geq\cdots\geq y_{n}.

Using KKT condition, one can derive an efficient algorithm for solving $Proj_{\mathcal{X}}(y)$ [73], which consists of the following steps:

1.

Find $\rho:=\max\{1\leq j\leq n:y_{j}+\frac{1}{j}\left(1-\sum_{i=1}^{j}y_{i}\right)>0\}$ ;
2.

Set $~{}~{}\lambda:=\frac{1}{\rho}\left(1-\sum_{i=1}^{\rho}y_{i}\right)$ ;
3.

Set $~{}~{}\theta^{\prime}_{i}=\max\{y_{i}+\lambda,0\}$ .

From the algorithm, we have that:

	$\displaystyle\lambda$	$\displaystyle=\frac{1}{\rho}\left(1-\sum_{i=1}^{\rho}y_{i}\right)=\frac{1}{\rho}\left(1-\sum_{i=1}^{\rho}(\theta_{i}+g_{i})\right)$
		$\displaystyle=\frac{1}{\rho}\left(1-\sum_{i=1}^{\rho}\theta_{i}\right)-\frac{1}{\rho}\sum_{i=1}^{\rho}g_{i}$
		$\displaystyle\geq-\frac{1}{\rho}\sum_{i=1}^{\rho}g_{i}.$

If $\rho\geq 2$ ,

	$\displaystyle\theta^{\prime}_{1}$	$\displaystyle=\max\{y_{1}+\lambda,0\}\geq y_{1}+\lambda\geq\theta_{1}+g_{1}-\frac{1}{\rho}\sum_{i=1}^{\rho}g_{i}$
		$\displaystyle\geq\theta_{1}+(1-\frac{1}{\rho})g_{1}-\frac{1}{\rho}\sum_{i=2}^{\rho}(g_{1}-\Delta)=\theta_{1}+\frac{\rho-1}{\rho}\Delta\geq\theta_{1}+\frac{\Delta}{2}.$

If $\rho=1$ ,

\theta^{\prime}_{1}=y_{1}+\lambda=y_{1}+(1-y_{1})=1.

Thus:

\theta^{\prime}_{1}\geq\min\{1,\theta_{1}+\frac{\Delta}{2}\},

which completes the proof. ∎

Appendix M Numerical Simulation Details

Verification of the fully mixed NE in Game 2

We now verify that joining network 1 with probability $\frac{1-3\epsilon}{3(1-2\epsilon)}$ ,i.e.:

\pi_{\theta_{i}}(a_{i}=1|s)=\frac{1-3\epsilon}{3(1-2\epsilon)},~{}~{}\forall s\in\mathcal{S},~{}~{}i=1,2,

is indeed a NE. First, observe that

	$\displaystyle\textup{Pr}^{\theta}(s_{i,t+1}=1)$	$\displaystyle=\left(\frac{1-3\epsilon}{3(1-2\epsilon)}\right)P(s_{i,t+1}=1\|a_{i,t}=1)+\left(1-\frac{1-3\epsilon}{3(1-2\epsilon)}\right)P(s_{i,t+1}=1\|a_{i,t}=2)$
		$\displaystyle=\left(\frac{1-3\epsilon}{3(1-2\epsilon)}\right)(1-\epsilon)+\left(1-\frac{1-3\epsilon}{3(1-2\epsilon)}\right)\epsilon=\frac{1}{3}.$

Thus,

	$\displaystyle V(s)$	$\displaystyle=r(s)+\sum_{t=1}^{\infty}\mathbb{{E}}_{s_{t}}\gamma^{t}r(s_{t})=r(s)+\frac{2\gamma}{3(1-\gamma)},$
	$\displaystyle\overline{Q^{\theta}_{i}}(s,a_{i})$	$\displaystyle=r(s)+\gamma\sum_{s^{\prime},a_{-i}}P(s^{\prime}\|a_{i},a_{-i})\pi_{\theta_{-i}}(a_{-i}\|s)V(s^{\prime})$
		$\displaystyle=r(s)+\gamma\sum_{s_{i}^{\prime}\in\{1,2\}}(P(s_{i}^{\prime}\|a_{i})\textup{Pr}^{\theta}(s_{-i}=1)r(s_{i},s_{-i}=1)+P(s_{i}^{\prime}\|a_{i})\textup{Pr}^{\theta}(s_{-i}=2)r(s_{i},s_{-i}=2))+\frac{2\gamma^{2}}{3(1-\gamma)}$
		$\displaystyle=r(s)+\gamma P(s_{i}^{\prime}=1\|a_{i})\left(\frac{1}{3}r(s_{i}^{\prime}=1,s_{-i}=1)+\frac{2}{3}r(s_{i}^{\prime}=1,s_{-i}=2)\right)$
		$\displaystyle\quad+\gamma P(s_{i}^{\prime}=2\|a_{i})\left(\frac{1}{3}(s_{i}^{\prime}=2,s_{-i}=1)+\frac{2}{3}r(s_{i}^{\prime}=2,s_{-i}=2)\right)+\frac{2\gamma^{2}}{3(1-\gamma)}$
		$\displaystyle=r(s)+\frac{2}{3}\gamma+\frac{2\gamma^{2}}{3(1-\gamma)}=r(s)+\frac{2\gamma}{3(1-\gamma)}=V(s),$

which implies that:

(\theta_{i}^{\prime}-\theta_{i})^{\top}\nabla_{\theta_{i}}J_{i}(\theta)=0,\quad\forall\theta_{i}^{\prime}\in\mathcal{X}_{i},\quad i=1,2,

i.e. $\theta$ satisfies first-order stationarity. Since $d_{\theta}(s)>0$ holds for any valid $\theta$ , by Theorem 1, $\theta$ is a NE.

Computation of strict NEs in Game 2

The computation of strict NEs is done numerically, using the criterion in Lemma 5. We enumerate over all $2^{8}$ possible deterministic policies and check whether the conditions in Lemma 5 hold. For $\epsilon=0.1,\gamma=0.95,$ and an initial distribution set as:

\rho(s_{1}=i,s_{2}=j)=1/4,~{}~{}i,j\in\{1,2\},

the numerical calculation shows there exist $13$ different strict NEs.

	$\displaystyle\Phi(\theta)-J_{i}(\theta)$	$\displaystyle=\sum_{s}d(s)\sum_{a_{i}}\pi_{\theta_{i}}(a_{i}\|s)\sum_{a_{-i}}\pi_{\theta_{-i}}(a_{-i}\|s)\left(\phi(s,a_{i},a_{-i})-r(s,a_{i},a_{-i})\right)$
		$\displaystyle=\sum_{s}d(s)\sum_{a_{i}}\pi_{\theta_{i}}(a_{i}\|s)\sum_{a_{-i}}\pi_{\theta_{-i}}(a_{-i}\|s)\delta_{i}(s,a_{-i})$
		$\displaystyle=\sum_{s}d(s)\sum_{a_{-i}}\pi_{\theta_{-i}}(a_{-i}\|s)\delta_{i}(s,a_{-i}),$

		$\displaystyle\theta^{(t+1)}_{s,a_{i}^{}(s)}\geq\min\{1,\theta^{(t)}_{s,a_{i}^{}(s)}+\frac{\eta\Delta^{\theta^{*}}}{4}\}$
	$\displaystyle\Longrightarrow\quad$	$\displaystyle\\|\theta^{(t+1)}_{i,s}-\theta^{}_{i,s}\\|_{1}=2\left(1-\theta^{(t+1)}_{s,a_{i}^{}(s)}\right)$
		$\displaystyle\leq\max\{0,\\|\theta^{(t)}_{i,s}-\theta^{}_{i,s}\\|_{1}-\frac{\eta\Delta^{\theta^{}}}{2}\},\quad\forall~{}s\in\mathcal{S},~{}i=1,2,\dots,n$
	$\displaystyle\Longrightarrow\quad$	$\displaystyle D(\theta^{(t+1)}\|\|\theta^{})\leq\max\{0,D(\theta^{(t)}\|\|\theta^{})-\frac{\eta\Delta^{\theta^{*}}}{2}\},$

	$\displaystyle\qquad(\theta+\eta\nabla\Phi(\theta)-\theta^{+})^{\top}(\theta^{\prime}-\theta^{+})\leq 0$
	$\displaystyle\Longrightarrow~{}~{}\eta\nabla\Phi(\theta)^{\top}(\theta^{\prime}-\theta^{+})+(\theta-\theta^{+})^{\top}(\theta^{\prime}-\theta^{+})\leq 0$
	$\displaystyle\Longrightarrow~{}~{}\eta\nabla\Phi(\theta)^{\top}(\theta^{\prime}-\theta^{+})-\eta G^{\eta}(\theta)^{\top}(\theta^{\prime}-\theta^{+})\leq 0$
	$\displaystyle\Longrightarrow~{}~{}\nabla\Phi(\theta)^{\top}(\theta^{\prime}-\theta^{+})\leq\\|G^{\eta}(\theta)\\|\theta^{\prime}-\theta^{+}\\|$
	$\displaystyle\Longrightarrow~{}~{}\nabla\Phi(\theta^{+})^{\top}(\theta^{\prime}-\theta^{+})\leq\\|G^{\eta}(\theta)\\|\theta^{\prime}-\theta^{+}\\|+(\nabla\Phi(\theta^{+})-\nabla\Phi(\theta))^{\top}(\theta^{\prime}-\theta^{+})\\|$
	$\displaystyle\qquad\qquad\qquad\qquad\qquad~{}~{}\leq\\|G^{\eta}(\theta)\\|\theta^{\prime}-\theta^{+}\\|+\beta\\|\theta^{+}-\theta\\|\\|\theta^{\prime}-\theta^{+}\\|$
	$\displaystyle\qquad\qquad\qquad\qquad\qquad~{}~{}=(1+\eta\beta)\\|G^{\eta}(\theta)\\|\theta^{\prime}-\theta^{+}\\|$

	$\displaystyle\Phi(\theta^{+})-\Phi(\theta)$	$\displaystyle\geq\nabla_{\theta}\Phi(\theta)^{\top}(\theta^{+}-\theta)-\frac{\beta}{2}\\|\theta^{+}-\theta\\|^{2}$
		$\displaystyle\geq(\frac{1}{\eta}-\frac{\beta}{2})\\|\theta^{+}-\theta\\|^{2}$
		$\displaystyle\geq\frac{1}{2\eta}\\|\theta^{+}-\theta\\|^{2}$
		$\displaystyle=\frac{\eta}{2}\\|G^{\eta}(\theta)\\|^{2},$

	$\displaystyle\textup{{NE-gap}}_{i}(\theta^{(t+1)})$	$\displaystyle=\max_{\theta_{i}^{\prime}\in\mathcal{X}_{i}}J_{i}(\theta_{i}^{\prime}.\theta_{-i}^{(t+1)})-J_{i}(\theta_{i}^{(t+1)},\theta_{-i}^{(t+1)})$
		$\displaystyle\leq\max_{\theta_{i}^{\prime}\in\mathcal{X}_{i}}\left\\|\frac{d_{(\theta_{i}^{\prime}.\theta_{-i}^{(t+1)})}}{d_{(\theta_{i}^{(t+1)},\theta_{-i}^{(t+1)})}}\right\\|_{\infty}\max_{\overline{\theta}_{i}\in\mathcal{X}_{i}}\left(\overline{\theta}_{i}-\theta_{i}^{(t+1)}\right)^{\top}\nabla_{\theta_{i}}J_{i}(\theta^{(t+1)})$
		$\displaystyle\leq M\max_{\overline{\theta}_{i}\in\mathcal{X}_{i}}\left(\overline{\theta}_{i}-\theta_{i}^{(t+1)}\right)^{\top}\nabla_{\theta_{i}}\Phi(\theta^{(t+1)})$
		$\displaystyle\leq M(1+\eta\beta)\max_{\overline{\theta}_{i}\in\mathcal{X}_{i}}\\|\overline{\theta}_{i}-\theta_{i}^{(t+1)}\\|\\|G^{\eta}(\theta^{(t)})\\|$
		$\displaystyle\leq M(1+\eta\beta)2\sqrt{\|\mathcal{S}\|}\\|G^{\eta}(\theta^{(t)})\\|$
		$\displaystyle=4M\sqrt{\|\mathcal{S}\|}\\|G^{\eta}(\theta^{(t)})\\|$

Gradient play in stochastic games: stationary points, convergence, and sample complexity

Abstract

1 Introduction

2 Problem setting and preliminaries

Definition 1.

Assumption 1.

Lemma 1.

Lemma 2.

3 Gradient Play for General Stochastic Games

Definition 2.

Lemma 3.

3.1 Gradient domination and the equivalence between NE and first-order stationary policy.

Lemma 4.

Proof.

Theorem 1.

Proof.

3.2 Local convergence for strict NEs

Lemma 5.

Theorem 2.

Remark 1.

4 Gradient play for Markov potential games

Definition 3.

Proposition 1.

4.1 Global convergence

Definition 4.

Theorem 3.

4.2 Local geometry of NEs

Theorem 4.

Remark 2.

4.3 Sample-based learning: algorithm and sample complexity

Assumption 2.

Theorem 5.

5 Numerical simulations

5.1 Game 1: multi-stage prisoner’s dilemma

5.2 Game 2: coordination game

5.3 Game 3: state-based coordination game

6 Conclusion and Discussion

References

Appendix A More about Markov potential games

Definition 5.

Lemma 6.

Proof.

Proposition 2.

Proof.

Lemma 7.

Proof.

Lemma 8.

Appendix B Proof of Lemma 2

Proof.

Appendix C Derivation of (29) (calculation of dθd_{\theta})

Appendix D Proof of Lemma 3

Proof.

Appendix E Proof of Theorem 2 and Lemma 5

Proof.

Proof.

Appendix F Proof of Proposition 1

Proof.

Appendix G Proof of Theorem 3

G.1 Useful Optimization Lemmas

Lemma 9.

Proof.

Lemma 10.

Proof.

Lemma 11.

Proof.

G.2 Proof of Theorem 3

Proof.

Appendix H Proof of Theorem 4

Proof.

Proof.

Appendix I Bounding the gradient estimation error of Algorithm 1

Theorem 6.

Definition 6.

I.1 Bounding the estimation error of the averaged-Q function

Theorem 7.

Lemma 12.

Proof.

Lemma 13.

Proof.

Corollary 1.

Appendix C Derivation of (29) (calculation of $d_{\theta}$ )

I.2 Bounding the estimation error of $d_{\theta}$