Gradient play in stochastic games: stationary points, convergence, and sample complexity
Abstract
We study the performance of the gradient play algorithm for stochastic games (SGs), where each agent tries to maximize its own total discounted reward by making decisions independently based on current state information which is shared between agents. Policies are directly parameterized by the probability of choosing a certain action at a given state. We show that Nash equilibria (NEs) and first-order stationary policies are equivalent in this setting, and give a local convergence rate around strict NEs. Further, for a subclass of SGs called Markov potential games (which includes the setting with identical rewards as an important special case), we design a sample-based reinforcement learning algorithm and give a non-asymptotic global convergence rate analysis for both exact gradient play and our sample-based learning algorithm. Our result shows that the number of iterations to reach an -NE scales linearly, instead of exponentially, with the number of agents. Local geometry and local stability are also considered, where we prove that strict NEs are local maxima of the total potential function and fully-mixed NEs are saddle points.
1 Introduction
Multi-agent systems find applications in a wide range of societal systems, e.g. electric grids, traffic networks, smart buildings and smart cities etc. Given the complexity of these systems, multi-agent reinforcement learning (MARL) has gained increasing attention in recent years (e.g. [1, 2]) Among MARL algorithms, policy gradient-type methods are highly popular because of their flexibility and capability to incorporate structured state and action spaces. However, while many recent works [3, 4, 5, 6, 7] have studied the performance of multi-agent policy gradient algorithms, due to a lack of understanding of the optimization landscape in these multi-agent learning problems, most works can only show convergence to a first-order stationary point. Deeper understanding of the quality of these stationary points is missing even in the simple identical-reward multi-agent RL setting.
In this paper, we investigate this problem from a game-theoretic perspective. We model the multi-agent system as a stochastic game (SG) where agents take independent stochastic policies and can have different reward functions. The study of SGs dates back to as early as the 1950s by [8] with a series of follow-up works on developing NE-seeking algorithms, especially in the RL setting (e.g. [9, 10, 11, 12, 13, 14] and citations therein). While well-known classical algorithms for solving SGs are mostly value-based, such as Nash-Q learning [15], Hyper-Q learning [16], and WoLF-PHC [17], gradient-based algorithms have also started to gain popularity in recent years due to their advantages as mentioned earlier (e.g. [18, 19, 20]).
In this work, we aim to gain a deeper understanding of the structure of first-order stationary points and the dynamical behavior for these gradient-based methods, with a particular focus on answering the following questions: 1) How do the first-order stationary points relate to the NEs of the underlying game?, 2) What is the stability of the individual NEs?, 3) How do agents learn from samples in this environment?
These questions have already been widely discussed in other settings, e.g., one-shot (stateless) finite-action games [21, 22, 23, 24, 25, 26, 27, 28, 29, 30], one-shot continuous games [31], zero-sum linear quadratic (LQ) games [32], etc. There are both negative and positive results depending on the settings. For one-shot continuous games, [31] proved a negative result suggesting that gradient flow has stationary points (even local maxima) that are not necessarily NEs. Conversely, [32] designed projected nested-gradient methods that provably converge to NEs in zero-sum LQ games. However, much less is known in the tabular setting of SGs with finite state-action spaces.
Contributions. We consider the gradient play algorithm for the infinite time horizon, discounted reward SGs with independent, directly-parameterized agents’ policies. Through generalizing the gradient domination property in [33] to the SG setting, we first establish the equivalence of first-order stationary policies and Nash equilibria (Theorem 1). This result suggests that even if agents have an identical reward, the first-order stationary points are only equivalent to Nash equilibria, which are usually non-unique and have different reward values. This is fundamentally different from the centralized learning case [33] where it can be shown that the first order stationary point is the unique global optimal solution.
Then we study the convergence of gradient play for SGs. For general games, it is known that gradient play may fail to obtain global convergence [21, 22, 23, 24]. Thus we firstly focus on characterizing some local properties for the general cases. In particular, we characterize the structure of strict NEs and show that gradient play locally converges to strict NEs within finite steps (Theorem 2).
Next we study a special class of SGs called Markov potential games (MPGs) [34, 35, 36], which includes identical reward multi-agent RL [37, 38, 39, 40, 41, 42] as an important special case. Concurrently, this work and [36] have established the global convergence rate to a NE for gradient play under MPGs (Theorem 3). However, the result does not specify which NE the policies converge to. Given the fact that there are many NEs that would have poor global value, global convergence results have a limited implication on the algorithm performance. This motivates us to study the local geometry around some specific types of NEs. We show that strict NEs are local maxima of the total potential function, thus stable points under gradient play, and that fully mixed NEs are saddle points, thus unstable points under gradient play (Theorem 4).
Then, we design a fully-decentralized sample-based gradient play algorithm and prove that it can find an -Nash equilibrium with high probability using samples (Theorem 5, denote the size of the state space and action space of agent respectively). The key enabler of our algorithm is the existence of an underlying averaged MDP for each agent when other agents’ policies are fixed. Our learning method can be viewed as a model-based policy evaluation method with respect to agents’ averaged MDPs. This averaged MDP concept could be applied to design many other MARL algorithms, especially policy-evaluation-based methods.
Comparison to other works on NE learning for SGs: There are some recent studies on general SGs with finite state-action spaces. However, either the structure of SGs or the methods they consider are different from our setting. For example, [43, 44] consider learning correlated equilibria (CCE) rather than NEs for finite time horizon general-sum games; [45] and [46] propose decentralized learning algorithms for the weakly acyclic games, which include the identical interest game as a special case, but only asymptotic convergence is considered; [47, 48] considers convergence to NE for two-player zero-sum games. In addition, [3, 6, 7] consider slightly different MARL settings, where agents collaboratively maximize the summation of agents’ reward with either full or partial state observation. They also require communication between neighboring agents for a better global coordination.
For the MPG subclass,[49, 36, 43, 50, 51] study the convergence to a NE. [43] designs the Nash-CA (Nash Coordinate Ascent) algorithm which requires agents to update sequentially and does not belong to the gradient-based algorithm class. While [50, 51] consider gradient-based algorithms, they study softmax policies which are different from directly-parameterized policies. [52] considers policy gradient with function approximation. [36] is the most related work to this paper, which studies the performance of gradient-based algorithms under direct parameterization. It studies the global convergence rate and develop sample complexity results for gradient play, but do not study the local geometry for general SGs. Additionally, the sample-based algorithm considered in [36] is based on Monte-Carlo gradient estimation, which might suffer from high variance in real implementation and is very different from our algorithm that estimates the gradient via estimating the “model” with respect to agents’ averaged MDPs. Moreover, our concept of “averaged” MDPs could also serve as a useful tool for the design and analysis of other MARL algorithms.
2 Problem setting and preliminaries
We consider a stochastic game (SG, [8]) with agents which is specified by an agent set , a finite state space , a finite action space for each agent , a transition model where is the probability of transitioning into state upon taking action in state where is the action of agent , agent ’s reward function , a discount factor , and an initial state distribution over .
A stochastic policy (where is the probability simplex over ) specifies a strategy in which agents choose their actions jointly based on the current state in a stochastic fashion, i.e. . A decentralized stochastic policy is a special subclass of stochastic policies, with , where . For decentralized stochastic policies, each agent takes its action based on the current state independently of other agents’ choices of actions, i.e.:
For notational simplicity, we define: , where is an index set. Further, we use the notation to denote the index set .
We consider direct decentralized policy parameterization, where agent ’s policy is parameterized by :
(1) |
For notational simplicity, we abbreviate as , and as . Here , i.e. is subject to the constraints and for all . The global joint policy is given by: We use to denote the feasible region of and .
Agent ’s value function [53] is defined as the discounted sum of future rewards starting at state via executing , i.e.
where the expectation is with respect to the random trajectory where . We denote agent ’s total reward starting from initial state as:
In the game setting, Nash equilibrium is often used to characterize the performance of agents’ policies.
Definition 1.
(Nash equilibrium, c.f. [54, 55]) A policy is called a Nash equilibrium (NE) if
The equilibrium is called a strict NE if the inequality holds strictly for all and . The equilibrium is called a pure NE if corresponds to a deterministic policy. The equilibrium is called a mixed NE if it is not pure. Further, the equilibrium is called a fully mixed NE if every entry of is strictly positive, i.e.: .
We define the discounted state visitation distribution [53] of a policy given an initial state distribution as:
(2) |
where is the state visitation probability that when executing starting at state . Throughout the paper, we make the following assumption on the SGs we study.
Assumption 1.
The stochastic game satisfies: .
Assumption 1 requires that every state is visited with positive probability, which is a standard assumption for convergence proofs in the RL literature (e.g. [33, 56],[36, 48]). Note that this assumption could be easily satisfied if the initial distribution satisfy .
Similar to centralized RL [53], define agent ’s -function and its advantage function as:
‘Averaged’ Markov decision process (MDP): We further define agent ’s ‘averaged’ Q-function and ‘averaged’ advantage-function as:
(3) |
Similarly, we define agent ’s ‘averaged’ transition probability distribution , and ‘averaged’ reward as:
From its definition, the averaged Q-function satisfies the following Bellman equation:
Lemma 1.
satisfies:
(4) |
Lemma 1 suggests that the averaged Q-function is indeed the Q-function for the MDP defined on action space , with as its stage reward and transition probability, respectively. We define this MDP as the ‘averaged’ MDP of agent , i.e., . The notion of an ‘averaged’ MDP will serve as an important intuition when designing the sample-based algorithm. Note that the ‘averaged’ MDP is only well-defined when the policies of the other agents are kept fixed. When this is indeed the case, agent can be treated as an independent learner with respect to its own ‘averaged’ MDP. Thus, various classical policy evaluation RL algorithms can then be applied. Additionally, we can apply the performance difference lemma [57] to the averaged MDP to derive a corresponding lemma for SGs which is useful throughout the paper.
Lemma 2.
(Performance difference lemma, for SGs, proof see Appendix B) Let
3 Gradient Play for General Stochastic Games
Under direct distributed parameterization, the gradient play algorithm is given by:
(5) |
Gradient play can be viewed as a ‘better response’ strategy, where agents update their own parameters by gradient ascent with respect to their own rewards.
A first-order stationary point is defined as such:
Definition 2.
(First-order stationary policy) A policy is called a first-order stationary policy if .
It is not hard to verify that is a first-order stationary policy if and only if it is a fixed point under gradient play (5). Comparing Definition 1 (of NE) and Definition 2, we know that NEs are first-order stationary policies, but not necessarily vice versa. For each agent , first-order stationarity does not imply that is optimal among all possible given . However, interestingly, we will show that NEs are equivalent to first-order stationary policies due to a gradient domination property that we will show later. Before that, we first calculate the explicit form of the gradient .
Policy gradient theorem [58] gives an efficient formula for the gradient:
(6) |
Applying (6), the gradient can be written explicitly as follows:
3.1 Gradient domination and the equivalence between NE and first-order stationary policy.
Lemma 4.1 in [33] established gradient domination for centralized tabular MDP under direct parameterization. We can show that a similar property still holds for SGs.
Lemma 4.
(Gradient domination) For direct distributed parameterization (1), we have that for any and any :
(8) |
where , and .
For the single-agent case (), (8) is consistent with the result in [33], i.e.: . However, when there are multiple agents, the condition is much weaker because the inequality requires to be fixed. When , gradient domination rules out the existence of stationary points that are not global optima. For the multi-agent case, the property can no longer guarantee the equivalence between first-order stationarity and global optimality; instead, it links the stationary points with NEs as shown in the following theorem.
Theorem 1.
Under Assumption 1, first-order stationary policies and NEs are equivalent.
Proof.
The definition of a Nash equilibrium naturally implies first order stationarity, because for any :
Letting gives the first order stationary condition:
We briefly note here that the equivalence between the first-order stationary points and NEs holds for all SGs that satisfy Assumption 1. One implication from the theorem is that for identical interest case where agents have the same rewards, we can only ensure the first order stationary points to be NEs when the policies are decentralized policies. Note that NEs are often none-unique and often with different objective values. This is in contrast to the single agent/centralized case where the first order stationary point is equivalent to the global optimal point[33].
3.2 Local convergence for strict NEs
Although the equivalence of NEs and stationary points under gradient play has been established, it is in fact difficult to show that gradient play converges to these stationary points. Even in the simpler static (stateless) game setup, gradient play might fail to converge [21, 22, 23, 24]. One major difficulty is that the vector field is not a conservative vector field. Accordingly, its dynamics may display complicated behavior. Thus, as a preliminary study, instead of looking at global convergence, we focus on the local convergence and restrict our study to a special subset of NEs - the strict NEs. We begin by giving the following characterization of strict NEs:
Lemma 5.
Given a stochastic game , any strict NE is pure, meaning that for each and , there exist one such that . Additionally, we have
Based on this lemma, we define the following for studying the local convergence of a strict NE :
(11) |
Theorem 2.
(Local finite time convergence around strict NE) Define the metric of policy parameters as: where denote the - norm. Suppose is a strict Nash equilibrium, then for any such that running gradient play (5) will guarantee which means that gradient play is going to converge within steps.
Proofs are deferred to Appendix E.
Remark 1.
Note that the local convergence in Theorem 2 only requires a finite number of steps. The key insight of the proof is that the gradient always points towards , and that the algorithm projects the gradient update onto the probability simplex, thus by picking the stepsize arbitrarily large, exact convergence can be achieved by just one step. However, the caveat is that we need to assume that the initial policy is sufficiently close to . For numerical stability considerations, one should pick reasonable stepsizes to run the algorithm to accommodate random initializations. Theorem 2 also shows that the radius of region of attraction for strict NEs is at least , and thus with a larger , i.e., a larger value gap between the optimal action and other actions, will have a larger region of attraction. We would like to further remark that Theorem 2 only focuses on the local convergence property; hence, we can interpret the theorem in the following way: if there exists a strict NE, then it is locally asymptotically stable under gradient play. However, it does not claim to solve the global existence or convergence of the strict NEs.
4 Gradient play for Markov potential games
We have discussed that the main problem for the global convergence of gradient play for general SGs is that the vector field is not conservative. Thus, in this section, we restrict our analysis to a special subclass where the vector field is conservative, which in turn enjoys global convergence. This subclass is generally referred to as a Markov potential game (MPG) in the literature.
Definition 3.
(Markov potential game [36]) A stochastic game is called a Markov potential game if there exists a potential function such that for any agent and any pair of policy parameters :
As shown in the definition, the condition of a MPG is admittedly rather strong and difficult to verify for general SGs. [35, 34] found that continuous MPGs can model applications such as the great fish war [59], the stochastic lake game [60], medium access control [35] etc. There are also efforts attempting to identify conditions such that a SG is a MPG, e.g., [35, 36, 61]. In Appendix A, we provide a more detailed discussion on MPGs, including a necessary condition of MPG, counterexamples of stage-wise potential games that are not MPG, sufficient conditions for a SG to be a MPG, and application examples of MPG. Nevertheless, identifying sufficient and necessary conditions and broadening the applications of MPG are important furture directions.
Given a policy , we define the ‘total potential function’111Note that our definition of MPG is slightly stronger than the definition in [36] as it requires the total potential function to take the particular form as a discounted sum of the potential function, i.e., . However, most of our results (Theorem 3 and Theorem 5) still hold under the weaker definition in [36]. Theorem 4 is the only result that relies on the slightly stronger version of the definition for a MPG. The following proposition guarantees a MPG has at least one NE and it is a pure NE (proof see Appendix F).We also define the quantity as , where and . It can be easily verified that for all .
Proposition 1.
For a Markov potential game, there is at least one global maximum of the total potential function , i.e.: that is a pure NE.
From the definition of the total potential function we obtain the following relationship
(12) |
Thus,
which means that gradient play (5) is equivalent to running projected gradient ascent with respect to the total potential function , i.e.:
4.1 Global convergence
With the above property, we can establish the global convergence for gradient play to a -NE for MPG. For the sake of self-completeness, we include the theorem here. Before that, we define -NE.
Definition 4.
(-Nash equilibrium) Define the ‘NE-gap’ of a policy as:
A policy is an -Nash equilibrium if:
Theorem 3.
Suppose for all , with stepsize , the NE-gap of asymptotically converge to 0 under gradient play (5), i.e., . Further, we have:
(13) |
where (by Assumption 1, we know that this quantity is well-defined). 222Another way to interpret the result is that the average term could be translated to a constant probability guarantee on single . For instance, if we randomly pick one from , then it guarantees that with probability at least (the constants could be replaced by where ).
The factor is also known as the distribution mismatch coefficient that characterizes how the state visitation varies with the policies. Given an initial state distribution that has positive measure on every state, can be at least bounded by . The proof structure of Theorem 3 resembles the proof of convergence for single-agent MDPs in [33], where they leverage classical nonconvex optimization results [62, 63] and gradient domination to get the convergence rate of to the global optimum (see Appendix G for proof details). In fact, our result matches this bound when there is only one agent (the exponential factor on looks slightly different because some factors are hidden implicitly in and in our bound).
4.2 Local geometry of NEs
Theorem 3 suggests that gradient play is guaranteed to converge to a NE, however, which exact NE it converges to is not specified in the theorem.The qualities of NEs can vary significantly. For example, consider a simple two-agent identical-interest normal form game with reward table given in Table 1. There are three NEs. Two of them are strict NEs, where both agents choose the same action, i.e. . Both NEs are of reward . Another NE is a fully mixed NE, where both agents choose action and randomly with probability . This NE is only of reward . This significant quality difference between different types of NEs motivates us to further understand whether gradient play can find NEs with relatively good qualities. Since the NE that gradient play converges to depends on the initialization and the local geometry around the NE, as a preliminary study, we characterize the local geometry and landscape for strict NEs and fully mixed NEs (stated in the following theorem). More future investigation is needed for non-strict, non-fully-mixed NEs.
1 | 0 | |
0 | 1 |
Theorem 4.
For a Markov potential game with (i.e., is not a constant function):
-
•
A strict NE is equivalent to a strict local maximum of the total potential function , i.e.: such that for all that satisfies we have .
-
•
Any fully mixed NE is a saddle point of the total potential function , i.e., , and
Remark 2.
Theorem 4 implies that strict NEs are asymptotically locally stable under first-order methods such as gradient play; while the fully mixed NEs are unstable under gradient play. Note that the theorem does not claim stability or instability for other types of NEs, e.g., pure NEs or non-fully mixed NEs. Nonetheless, we believe that these preliminary results can serve as a valuable platform towards a better understanding of the geometry of the problem. We remark that the conclusion about strict NEs in Theorem 4 does not hold for settings other than tabular MPG; for instance, for continuous games, one can use quadratic functions to construct simple counterexamples [31]. Also, similar to Remark 1, this theorem focuses on the local geometry of the NEs but does not claim the global existence or convergence of either strict NEs or fully mixed NEs.
4.3 Sample-based learning: algorithm and sample complexity
In this section, we no longer assume access to the exact gradient, but instead need to estimate it via samples. Throughout the section, we make the following additional assumption on MPGs:
Assumption 2.
(-Sufficient exploration on states) There exist a positive integer and a such that for any policy and any initial state-action pair , we have
(14) |
i.e., it poses a condition on the mixing time of the Markov chain induced by any policy : there exists a sufficiently long time , so the probability of being at any state at time is at least for any initial state and action pair.
Note that similar assumptions are common in proving finite time convergence of RL algorithms (e.g. [7, 64, 65]) where ergodicity of the Markov chain induced by certain policies is generally assumed, which results in every state-action pair being visited with positive probability in the stationary distribution..
We further introduce the state transition probability under as:
We consider fully decentralized learning, where agent ’s observation only includes state , its own action , and its own reward at time . Such fully decentralized learning is plausible due to the fact that when is fixed, agent can be treated as an independent learner with the underlying MDP being the ‘averaged’ MDP described in Section 2. With this key observation, we design a ‘model-based’ on-policy learning algorithm, where agents perform policy evaluation in the inner loop and gradient ascent at the outer loop. The algorithm is provided in Algorithm 1. Roughly, it consists of three main steps: 1) (Inner loop) Estimate the averaged transition probability and reward using on-policy samples . 2) (Inner loop) Calculate averaged Q-function and discounted state visitation distribution , and compute the estimated gradient accordingly, 3) (Outer loop) Running projected gradient ascent with estimated gradients. Before discussing our algorithm in more detail, we highlight that the idea of using the “averaged” MDP can be used to design other learning methods including model-free methods, e.g., using the temporal difference methods to perform policy evaluation. One caveat is that the “averaged” MDP is only well-defined when all the other agents use fixed policies. This makes it difficult to extend the two-timescale framework (i.e. with an inner loop and outer loop) to single-timescale settings, which is an interesting future direction. Further, note that the current algorithm requires full state observation, it remains an intriguing open question to extend it to the case with only partial observability. We would also like to point out that the algorithm initialization still requires extra consensus/coordination among the players to agree on the hyperparameters etc, which guarantees that agents go through the same equal-length phases to sample the trajectories, and compute gradient estimates.
Step 1: empirical estimation of : Given a sequence generated by a policy , the empirical estimation of is given by:
(19) |
Here we separately treat the special case where the state and action pair is not visited through the whole trajectory, i.e., to make well-defined.
Similarly, the estimates of are given by:
(24) | ||||
(27) |
Step 2: estimation of : We slightly abuse notation and use to also denote the vectors corresponding to the averaged Q-function and reward function of agent . Similarly, are used to denote the vectors for and . Define :
Then from Lemma 1, is given by:
The estimated averaged Q-function is given by:333From the Perron-Frobenius theorem, we know that the absolute values of the eigenvalues of are upper bounded by , which guarantees that the matrix is invertible.
(28) |
Similarly, from (2), we have that and are given by (derivation see Appendix C):
(29) |
Then accordingly, the estimated gradient is computed as:
(30) |
Step 3: Projected gradient ascent onto the set of -greedy policies: Let denote the dimensional uniform distribution. Define We use to denote the set of the -greedy policies for and respectively. Every step after doing gradient ascent, the parameter will further be projected onto , i.e.:
(31) |
The reason of projecting onto instead of is to make sure that every action has positive possibility of being selected in order to get a relatively accurate estimation of averaged -function. Intuitively, a larger introduces a larger additional error in the NE-gap; however, a smaller requires more samples to estimate the gradient. Thus the choice of is the tradeoff between the two effects.
Theorem 5.
We would like to first compare our result with one related work with sample complexity of learning MPGs [36]. Interestingly, both sample complexities are . It is an interesting question to study whether such dependence is fundamental or not for learning with simultaneous-updating agents. Yet [36] considers Monte Carlo, model-free gradient estimation, while our algorithm takes a model-based approach which suffers less from high variance and the notion of ‘averaged’ MDP can potentially be extrapolated to other settings.
Proof Sketch: The proof of Theorem 5 consists of three major steps. The first step is to bound the estimation error of parameters of the ‘averaged’-MDP. This step leverages Assumption 1 and Azuma-Hoeffding inequality to get high probability bounds for the parameters. The second step translates the estimation error of the ‘averaged’-MDP into the gradient estimation error. Then, the third step treats gradient estimation step as an oracle that gives biased gradient information, where the bias is the estimation error. The final result is obtained by analyzing the performance of biased projected gradient ascent algorithm. The detailed proofs are provided in Appendix J.
Comparison with centralized learning: The best known sample complexity bound for single-agent/centralized MDP is [66]. Compared with (32), the centralized bound scales better with respect to . However, as argued in the previous subsection, the total action space in the centralized bound scales exponentially with the number of agent , while our complexity bound only scales linearly. Here, we briefly state the fundamental difficulties of learning in the SG setting compared with centralized learning, which also explains why our bound scales worse with respect to the factors . 1) Firstly, the optimization landscape in the SG setting is more complicated. For centralized learning, the gradient domination property is stronger and accelerated gradient methods (e.g. via natural policy gradient or entropy regularization) can speed up the convergence of exact gradient from to [33], or even [56]. In contrast, for multi-agent settings, due to the more complicated optimization landscape, these methods can no longer improve the dependency on , and thus makes the outer loop complexity larger. 2) Secondly, the behavior of other agents makes the environment non-stationary, i.e., the averaged Q-function as well as the averaged transition probability distribution depends on the policy of other agent . Thus, unlike centralized learning, where the state transition probability matrix can be estimated in an off-policy or even offline manner, i.e. using data samples from different policies, can only be estimated in a online manner, using samples generated by exactly the same policy , which increases the inner loop complexity . 3) Thirdly, the complicated interactions amongst agents necessitate more care during the learning process. Algorithms designed for centralized learning that achieve near-optimal sample complexity are generally Q-learning type algorithms. However, in SGs, it can be shown that having each agent maximize its own averaged Q-function may actually lead to non-convergent behavior. Thus, we need to consider algorithms that update in a less aggressive manner, e.g. soft Q-learning, or policy gradient (which is considered in this paper).
5 Numerical simulations
(-1,-1) | (-3,0) | |
(0,-3) | (-2,-2) |
2 | 0 | |
0 | 1 |
This section studies three numerical examples to corroborate our theoretical results. The multi-stage prisoner’s dilemma (Game 1) confirms the local stability results for general-sum SGs; the coordination game (Game 2) considers local stability as well as convergence rate of exact gradient play for MPG; the state-based coordination game (Game 3) tests the performance of the sample-based algorithm proposed in Section 4.3.
5.1 Game 1: multi-stage prisoner’s dilemma
The first example — multi-stage prisoner’s dilemma model[45] — studies exact gradient play for general SGs. It is a -agent SG, with . The reward for each agent is independent of state and is given by Table 3. The state transition probability is determined by agents’ previous actions:
Here action means that agent choose to cooperate and means betray. The state serves as a noisy indicator, with error rate , of whether both agents cooperated () or not () in the previous stage .

ratio % | ||
---|---|---|
0.1 | 433.3 | (47.8 5.1)% |
0.05 | 979.3 | (66.3 4.3)% |
0.01 | 2498.6 | (77.4 2.8)% |



The single-stage game corresponds to the famous Prisoner’s Dilemma, and it is well-known that there is a unique NE , where both agent decide to betray. The dilemma arises from the fact that there exists a joint non-NE strategy such that both players obtain a higher reward than what they get under the NE. However, in the multi-stage case, the introduction of an additional state allows agents to make decisions based on whether they have cooperated before. It turns out that cooperation can be achieved given that the discounting is close to and that the indicator for is accurate enough, i.e. is close to . Apart from the fully betray strategy, where both agents will betray regardless of , there is another strict NE that is where agents will cooperate given that they have cooperated in previous stage, and betray otherwise.
We simulate gradient play for this model and mainly focus on the convergence to the cooperative equilibrium . We fix . The initial policy is set as: , where the ’s are uniformly sampled from . The initialization implies that at the beginning, both agents are willing to cooperate to some extent given that they cooperated at the previous stage. Figure 1 shows a trial converging to the NE starting from a randomly initialized policy. Then we study the size of attraction region for and how it varies with the indicator’s error rate , which is shown in Table 4. The size of the region of attraction for can be reflected by the ratio of convergence ( ) for multiple trials with different initial points. Here we calculate one ratio using 100 trials and the mean and standard deviation (std) are calculated by computing the ratio 10 times using different trials. An empirical estimate of the volume of the region is the convergence ratio times the volume of the uniform sampling area; hence the larger the ratio, the larger the region of attraction. Intuitively speaking, the more accurately the state represents the cooperation situation of the agents, the less incentive agents will have for betraying when observing , that is, the larger will become, and thus the larger the convergence ratio will be. This intuition matches the simulation result as well as the theoretical guarantees on the local convergence around a strict NE in Theorem 2.
5.2 Game 2: coordination game
Our second example is based on coordination game [67]. It is an identical-reward game which is one special class of Markov potential game. Consider a -agent identical reward coordination game problem with state space and action space , where . The state transition probability is given by:
where . The reward table is given by Table 3. Here we can view the actions as two different social networks that agents can choose. They are rewarded only if they are in the same network. Network has a higher reward than network . The state stands for the network that agent is really at after taking an action. stands for the randomness in reaching a network after taking the action.
There is at least one fully-mixed NE where both agents join network with probability regardless of the current occupancy of networks, and there are different strict NEs that can be verified numerically. Figure 2 shows a gradient play trajectory whose initial point lies in a close neighborhood of the mixed NE. As the algorithm progresses, we see that the trajectory in Figure 2 diverges from the mixed-NE, indicating that the fully-mixed NE is indeed a saddle point. This corroborates our finding in Theorem 4. Figure 3 shows the evolution of total reward for gradient play for different random initial points . Different initial points converge to one of different strict NEs each with a different total reward (some strict NEs with relatively small region of attraction are omitted in the figure). While the total rewards are different, as shown in Figure 4, we see that the NE-gap of each trajectory (corresponding to same initial points in Figure 3) converges to . This suggests that the algorithm is indeed able to converge to a NE. Notice that the NE-gaps do not decrease monotonically.
5.3 Game 3: state-based coordination game





Our third numerical example studies the empirical performance of the sample-based learning algorithm, Algorithm 1. Here we consider a generalization of coordination game (Game 2) where the two players now try to coordinate on a 2D grid. The two-player state-based coordination game on a grid is defined as follows: the state space is given by , action space is given by , i.e., agent can choose to stay at current grid or move left/right/up/down to its neighboring grids. We assume that there is random noise during the transition, where the agent might end up in a neighboring grid of the target location with error probability .The reward is given by:
i.e. the two agents are only rewarded if they stay at the upper-left or lower-right corner at the same time.
For numerical simulation, we take ; the numerical results are as displayed in Figure 6 - 8. Figure 6 shows that total reward increases as the number of iterations increase, and Figure 8 shows that the NE-gap converges to a value close to zero. However, because we project the policy to the -greedy set , the NE-gap cannot converge to exactly zero. Figure 7 visualizes the discounted state visitation distribution. To make the visualization more intuitive, we look at the marginalized discounted state visitation distribution defined below:
is defined similarly. From Figure 7 we can see that most of the probability measure concentrates on , , indicating that the two agents are able to coordinate most of the time.
6 Conclusion and Discussion
This paper studies the optimization landscape and convergence of gradient play for SGs. For general SGs, we establish local convergence for strict NEs. For MPGs, we establish the global convergence with respect to NE gap and the local stability results for strict NEs as well as fully mixed ones. A sample-based NE-learning algorithm with sample complexity guarantee is also proposed under this setting. There are many interesting future directions. Firstly, the current assumption of MPGs is relatively strong compared with the notion of potential games in the one-shot setup, which might restrict its application to broader settings. More effort would be needed to identify other special types of SGs that facilitate efficient learning. It would also be meaningful to investigate real-life applications, such as dynamic congestion and routing. Secondly, other sample-based learning methods, such as actor-critic, natural policy gradient, Gauss-Newton methods, could also be considered, which might improve the sample complexity.
References
- [1] F. Daneshfar and H. Bevrani, “Load–frequency control: a ga-based multi-agent reinforcement learning,” IET generation, transmission & distribution, vol. 4, no. 1, pp. 13–26, 2010.
- [2] S. Shalev-Shwartz, S. Shammah, and A. Shashua, “Safe, multi-agent, reinforcement learning for autonomous driving,” ArXiv, vol. abs/1610.03295, 2016.
- [3] K. Zhang, Z. Yang, H. Liu, T. Zhang, and T. Basar, “Fully decentralized multi-agent reinforcement learning with networked agents,” in International Conference on Machine Learning. PMLR, 2018, pp. 5872–5881.
- [4] H.-T. Wai, Z. Yang, Z. Wang, and M. Hong, “Multi-agent reinforcement learning via double averaging primal-dual optimization,” ser. NIPS’18. Red Hook, NY, USA: Curran Associates Inc., 2018, p. 9672–9683.
- [5] T. Chen, K. Zhang, G. B. Giannakis, and T. Başar, “Communication-efficient policy gradient methods for distributed reinforcement learning,” IEEE Transactions on Control of Network Systems, vol. 9, no. 2, pp. 917–929, 2022.
- [6] Y. Li, Y. Tang, R. Zhang, and N. Li, “Distributed reinforcement learning for decentralized linear quadratic control: A derivative-free policy optimization approach,” 2019.
- [7] G. Qu, A. Wierman, and N. Li, “Scalable reinforcement learning of localized policies for multi-agent networked systems,” in Learning for Dynamics and Control. PMLR, 2020, pp. 256–266.
- [8] L. S. Shapley, “Stochastic games,” Proceedings of the national academy of sciences, vol. 39, no. 10, pp. 1095–1100, 1953.
- [9] M. L. Littman, “Markov games as a framework for multi-agent reinforcement learning,” in Machine learning proceedings 1994. Elsevier, 1994, pp. 157–163.
- [10] M. Bowling and M. Veloso, “An analysis of stochastic game theory for multiagent reinforcement learning,” Carnegie-Mellon Univ Pittsburgh Pa School of Computer Science, Tech. Rep., 2000.
- [11] Y. Shoham, R. Powers, and T. Grenager, “Multi-agent reinforcement learning: a critical survey,” Technical report, Stanford University, Tech. Rep., 2003.
- [12] L. Buşoniu, R. Babuška, and B. De Schutter, “Multi-agent reinforcement learning: An overview,” Innovations in multi-agent systems and applications-1, pp. 183–221, 2010.
- [13] M. Lanctot, V. Zambaldi, A. Gruslys, A. Lazaridou, K. Tuyls, J. Pérolat, D. Silver, and T. Graepel, “A unified game-theoretic approach to multiagent reinforcement learning,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017.
- [14] K. Zhang, Z. Yang, and T. Başar, “Multi-agent reinforcement learning: A selective overview of theories and algorithms,” Handbook of reinforcement learning and control, pp. 321–384, 2021.
- [15] J. Hu and M. P. Wellman, “Nash Q-learning for general-sum stochastic games,” Journal of machine learning research, vol. 4, no. Nov, pp. 1039–1069, 2003.
- [16] G. Tesauro, “Extending Q-learning to general adaptive multi-agent systems,” Advances in neural information processing systems, vol. 16, pp. 871–878, 2003.
- [17] M. Bowling and M. Veloso, “Rational and convergent learning in stochastic games,” in International joint conference on artificial intelligence, vol. 17, no. 1. Citeseer, 2001, pp. 1021–1026.
- [18] S. Abdallah and V. Lesser, “A multiagent reinforcement learning algorithm with non-linear dynamics,” Journal of Artificial Intelligence Research, vol. 33, pp. 521–549, 2008.
- [19] C. Zhang and V. Lesser, “Multi-agent learning with policy prediction,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 24, no. 1, 2010.
- [20] J. Foerster, R. Y. Chen, M. Al-Shedivat, S. Whiteson, P. Abbeel, and I. Mordatch, “Learning with opponent-learning awareness,” in Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems. Richland, SC: International Foundation for Autonomous Agents and Multiagent Systems, 2018.
- [21] L. Shapley, “Some topics in two-person games,” Advances in game theory, vol. 52, pp. 1–29, 1964.
- [22] V. P. Crawford, “Learning behavior and mixed-strategy Nash equilibria,” Journal of Economic Behavior & Organization, vol. 6, no. 1, pp. 69–78, 1985.
- [23] J. S. Jordan, “Three problems in learning mixed-strategy Nash equilibria,” Games and Economic Behavior, vol. 5, no. 3, pp. 368–386, 1993.
- [24] V. Krishna and T. Sjöström, “On the convergence of fictitious play,” Mathematics of Operations Research, vol. 23, no. 2, pp. 479–511, 1998.
- [25] J. S. Shamma and G. Arslan, “Dynamic fictitious play, dynamic gradient play, and distributed convergence to Nash equilibria,” IEEE Transactions on Automatic Control, vol. 50, no. 3, pp. 312–327, 2005.
- [26] E. Kohlberg and J.-F. Mertens, “On the strategic stability of equilibria,” Econometrica: Journal of the Econometric Society, pp. 1003–1037, 1986.
- [27] E. Van Damme, Stability and perfection of Nash equilibria. Springer, 1991, vol. 339.
- [28] D. Foster and H. P. Young, “Regret testing: Learning to play Nash equilibrium without knowing you have an opponent,” Theoretical Economics, vol. 1, no. 3, pp. 341–367, 2006.
- [29] F. Germano and G. Lugosi, “Global Nash convergence of foster and young’s regret testing,” Games and Economic Behavior, vol. 60, no. 1, pp. 135–154, 2007.
- [30] J. R. Marden, H. P. Young, G. Arslan, and J. S. Shamma, “Payoff-based dynamics for multiplayer weakly acyclic games,” SIAM Journal on Control and Optimization, vol. 48, no. 1, pp. 373–396, 2009.
- [31] E. Mazumdar, L. J. Ratliff, and S. S. Sastry, “On gradient-based learning in continuous games,” SIAM Journal on Mathematics of Data Science, vol. 2, no. 1, pp. 103–131, 2020.
- [32] K. Zhang, Z. Yang, and T. Basar, “Policy optimization provably converges to Nash equilibria in zero-sum linear quadratic games,” in Advances in Neural Information Processing Systems, vol. 32, 2019.
- [33] A. Agarwal, S. M. Kakade, J. D. Lee, and G. Mahajan, “On the theory of policy gradient methods: Optimality, approximation, and distribution shift,” 2020.
- [34] D. González-Sánchez and O. Hernández-Lerma, Discrete–time stochastic control and dynamic potential games: the Euler–Equation approach. Springer Science & Business Media, 2013.
- [35] S. V. Macua, J. Zazo, and S. Zazo, “Learning parametric closed-loop policies for Markov potential games,” in International Conference on Learning Representations, 2018.
- [36] S. Leonardos, W. Overman, I. Panageas, and G. Piliouras, “Global convergence of multi-agent policy gradient in Markov potential games,” in International Conference on Learning Representations, 2022.
- [37] M. Tan, “Multi-agent reinforcement learning: Independent vs. cooperative agents,” in Proceedings of the tenth international conference on machine learning, 1993, pp. 330–337.
- [38] C. Claus and C. Boutilier, “The dynamics of reinforcement learning in cooperative multiagent systems,” AAAI/IAAI, vol. 1998, no. 746-752, p. 2, 1998.
- [39] L. Panait and S. Luke, “Cooperative multi-agent learning: The state of the art,” Autonomous agents and multi-agent systems, vol. 11, no. 3, pp. 387–434, 2005.
- [40] L. Matignon, G. J. Laurent, and N. Le Fort-Piat, “Independent reinforcement learners in cooperative markov games: a survey regarding coordination problems,” The Knowledge Engineering Review, vol. 27, no. 1, pp. 1–31, 2012.
- [41] Z. Zhang, Y.-S. Ong, D. Wang, and B. Xue, “A collaborative multiagent reinforcement learning method based on policy gradient potential,” IEEE transactions on cybernetics, vol. 51, no. 2, pp. 1015–1027, 2019.
- [42] A. Oroojlooy and D. Hajinezhad, “A review of cooperative multi-agent deep reinforcement learning,” Applied Intelligence, vol. 53, no. 11, pp. 13 677–13 722, 2023.
- [43] Z. Song, S. Mei, and Y. Bai, “When can we learn general-sum Markov games with a large number of players sample-efficiently?” 2021.
- [44] C. Jin, Q. Liu, Y. Wang, and T. Yu, “V-learning–a simple, efficient, decentralized algorithm for multiagent rl,” arXiv preprint arXiv:2110.14555, 2021.
- [45] G. Arslan and S. Yüksel, “Decentralized Q-learning for stochastic teams and games,” IEEE Transactions on Automatic Control, vol. 62, no. 4, pp. 1545–1558, 2016.
- [46] B. Yongacoglu, G. Arslan, and S. Yüksel, “Decentralized learning for optimality in stochastic dynamic teams and games with local control and global state information,” IEEE Transactions on Automatic Control, vol. 67, no. 10, pp. 5230–5245, 2022.
- [47] C. Daskalakis, D. J. Foster, and N. Golowich, “Independent policy gradient methods for competitive reinforcement learning,” Advances in neural information processing systems, vol. 33, pp. 5527–5540, 2020.
- [48] A. Ozdaglar, M. O. Sayin, and K. Zhang, “Independent learning in stochastic games,” arXiv preprint arXiv:2111.11743, 2021.
- [49] R. Zhang, Z. Ren, and N. Li, “Gradient play in multi-agent Markov stochastic games: Stationary points and local geometry,” in International Symposium on Mathematical Theory of Networks and Systems, 2022.
- [50] R. Zhang, J. Mei, B. Dai, D. Schuurmans, and N. Li, “On the global convergence rates of decentralized softmax gradient play in markov potential games,” Advances in Neural Information Processing Systems, vol. 35, pp. 1923–1935, 2022.
- [51] R. Fox, S. M. Mcaleer, W. Overman, and I. Panageas, “Independent natural policy gradient always converges in Markov potential games,” in International Conference on Artificial Intelligence and Statistics. PMLR, 2022, pp. 4414–4425.
- [52] D. Ding, C.-Y. Wei, K. Zhang, and M. Jovanovic, “Independent policy gradient for large-scale markov potential games: Sharper rates, function approximation, and game-agnostic convergence,” in International Conference on Machine Learning. PMLR, 2022, pp. 5166–5220.
- [53] A. Agarwal, N. Jiang, S. M. Kakade, and W. Sun, “Reinforcement learning: Theory and algorithms,” CS Dept., UW Seattle, Seattle, WA, USA, Tech. Rep, vol. 32, 2019.
- [54] D. Fudenberg and J. Tirole, Game theory. MIT press, 1991.
- [55] M. Maschler, S. Zamir, and E. Solan, Game theory. Cambridge University Press, 2020.
- [56] J. Mei, C. Xiao, C. Szepesvari, and D. Schuurmans, “On the global convergence rates of softmax policy gradient methods,” in Proceedings of the 37th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 119. PMLR, 13–18 Jul 2020, pp. 6820–6829.
- [57] S. M. Kakade and J. Langford, “Approximately optimal approximate reinforcement learning,” in Machine Learning, Proceedings of the Nineteenth International Conference (ICML 2002), University of New South Wales, Sydney, Australia, July 8-12, 2002, 2002, pp. 267–274.
- [58] R. S. Sutton, D. A. McAllester, S. P. Singh, Y. Mansour et al., “Policy gradient methods for reinforcement learning with function approximation.” in NIPs, vol. 99. Citeseer, 1999, pp. 1057–1063.
- [59] D. Levhari and L. Mirman, “The great fish war: An example using a dynamic Cournot-Nash solution,” Bell Journal of Economics, vol. 11, no. 1, pp. 322–334, 1980.
- [60] W. D. Dechert and S. O’Donnell, “The stochastic lake game: A numerical solution,” Journal of Economic Dynamics and Control, vol. 30, no. 9-10, pp. 1569–1587, 2006.
- [61] D. H. Mguni, Y. Wu, Y. Du, Y. Yang, Z. Wang, M. Li, Y. Wen, J. Jennings, and J. Wang, “Learning in nonzero-sum stochastic games with potentials,” in International Conference on Machine Learning, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:232257740
- [62] A. Beck, First-Order Methods in Optimization. Philadelphia, PA: Society for Industrial and Applied Mathematics, 2017. [Online]. Available: https://epubs.siam.org/doi/abs/10.1137/1.9781611974997
- [63] S. Ghadimi and G. Lan, “Accelerated gradient methods for nonconvex nonlinear and stochastic programming,” Mathematical Programming, vol. 156, no. 1-2, pp. 59–99, 2016.
- [64] R. Srikant and L. Ying, “Finite-time error bounds for linear stochastic approximation and TD learning,” in Conference on Learning Theory. PMLR, 2019, pp. 2803–2830.
- [65] G. Li, Y. Wei, Y. Chi, Y. Gu, and Y. Chen, “Sample complexity of asynchronous q-learning: Sharper analysis and variance reduction,” IEEE Transactions on Information Theory, vol. 68, no. 1, pp. 448–473, 2022.
- [66] A. Sidford, M. Wang, X. Wu, L. Yang, and Y. Ye, “Near-optimal time and sample complexities for solving Markov decision processes with a generative model,” in Advances in Neural Information Processing Systems, vol. 31, 2018.
- [67] R. Cooper, Coordination games. Cambridge University Press, 1999.
- [68] D. Monderer and L. S. Shapley, “Potential games,” Games and economic behavior, vol. 14, no. 1, pp. 124–143, 1996.
- [69] N. Bertrand, N. Markey, S. Sadhukhan, and O. Sankur, “Dynamic network congestion games,” arXiv preprint arXiv:2009.13632, 2020.
- [70] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Cambridge, MA, USA: A Bradford Book, 2018.
- [71] W. Hoeffding, “Probability inequalities for sums of bounded random variables,” in The collected works of Wassily Hoeffding. Springer, 1994, pp. 409–426.
- [72] K. Azuma, “Weighted sums of certain dependent random variables,” Tohoku Mathematical Journal, Second Series, vol. 19, no. 3, pp. 357–367, 1967.
- [73] W. Wang and M. Á. Carreira-Perpiñán, “Projection onto the probability simplex: An efficient algorithm with a simple proof, and an application,” CoRR, vol. abs/1309.1541, 2013. [Online]. Available: http://arxiv.org/abs/1309.1541
Appendix A More about Markov potential games
This section is dedicated to a more thorough understanding of Markov potential games, which includes necessary or sufficient conditions for MPG and a few (counter)examples.
B.1. A necessary condition and counterexamples
Definition 5.
([68]) Define the path in the parameter space as , where differ only one component , i.e. . A closed path is a path such that . Define:
The following theorem is a direct generalization of Theorem 2.8 in [68] to MPG setting:
Lemma 6.
For Markov potential games, for any finite closed path .
Proof.
The proof is quite straightforward from the definition of MPG
∎
Although the proof of Lemma 6 is straightforward, it serves as a useful tool in proving that a game is not a MPG. For example, applying the theorem we can get that the following conditions are not sufficient conditions for MPG.
Proposition 2.
None of the following conditions on a SG necessarily imply that it is a MPG:
-
(1)
There exists such that at each , ;
-
(2)
There exists such that for every , ;
-
(3)
Rewards are independent of , and they have a potential function, i.e., .
Proof.
(of Proposition 2) A simple counterexample showing that the conditions in (2) are not sufficient is the multi-stage prisoner’s dilemma (Game 1) introduced in the numerical section (Appendix 5). Since the reward table for multi-stage prisoner’s dilemma is the same as the one-shot prisoner’s dilemma (which is known to be a potential game), Game 1 satisfies condition (3) in Proposition 2, which implies condition (2), which in turn implies condition (1). In the following we are going to use Lemma 6 to show that Game 1 is not a MPG. We define the following individual policies:
Let:
and define a path by:
For the sake of easy calculation, we set and set initial state as in Game 1. In this example, it is not hard to see that for each , implying that . This indicates that although Game 1 satisfies condition (3) (as well as conditions (1) and (2)), it is still not a MPG. ∎
B.2. A sufficient condition
Proposition 2 suggests that MPG is a quite restrictive assumption. Even if the reward table for a SG is the same as the reward table of a one-shot potential game, the SG may still not be a MPG. Nevertheless, we can show that the following condition is sufficient for a stochastic game to be a MPG:
Lemma 7.
A stochastic game is a MPG if condition (1) in Proposition 2 is satisfied and that .
Proof.
implies that the discounted state visitation distribution does not depend on , and thus we denote it as instead. Condition (1) implies that only depends on and but not , and so we denote the difference as , i.e.,
The total reward of agent can be written as:
Similarly, total potential function can be written as:
Thus,
which does not depend on parameter , i.e.,
which completes the proof. ∎
B.3. MPG with local states and an application example
From Proposition 2, we see that it is difficult for a SG to be a MPG even if the game is a potential game at each state. Lemma 7 only presents a very special case where the action does not affect the state, meaning that this MPG is merely a collection of potential games. To provide a MPG which is beyond the identical-interest case and the case in Lemma 7, inspired by the setting in [7] and [35], here we consider a special multi-agent setting where and is the local state space of agent . In addition, the transition probability takes the decomposed form, . The rest of the SG setting is the same as the SG in Section 2. Deviating slightly from the main text, we consider the localized policy where each agent take actions based on its own state,
with the localized direct parameterization:
We use to denote the feasibility region of , and the feasibility region of is denoted as .
Lemma 8.
If there is a function such that for every agent , where only depends on , then this SG is a MPG, i.e., for any parameters , the equation in Definition 3 is satisfied.
The proof is straightforward given the local structure of the MDP and the localized policies. This MPG enjoys nontrivial multi-agent application examples such as medium access control [35], dynamic congestion control [69], etc. Below we provide medium access control as one of the examples.
Real application - medium access control. We consider the discretized version of the dynamic medium access control game introduced in [35], where each agent is a user that tries to transmit data via a single transmission medium by injecting power to the wireless network. Each user’s goal is to maximize its data rate and battery lifespan. If multiple users transmit at the same time, they will interfere with each other and decrease their data rate. Here user ’s state is , which denotes its own battery level, where is its initial battery level. We use to denote its discharging factor. Its action denotes the power injected to the network at each time step, where is the maximum allowed power. The state transition is deterministic, describing the discharging process of the battery proportional to the transmission power, which is given by:
The stage reward of user is given by:
where is the random fading channel coefficient for user .
By noticing that , we can apply Lemma 8 to verify that the medium access control problem is indeed a MPG and that the potential function is given as:
Appendix B Proof of Lemma 2
Proof.
Appendix C Derivation of (29) (calculation of )
From the definition of :
we have that:
Appendix D Proof of Lemma 3
Proof.
Appendix E Proof of Theorem 2 and Lemma 5
Proof.
lemmaauxiliary Let denote the probability simplex of dimension . Suppose and that there exists and such that:
Let
then:
Proof.
(Theorem 2) For a fixed agent and state , the gradient play ((5)) update rule of policy is given by:
(33) |
where denotes the probability simplex in -th dimension and is a -th dimensional vector with -th element equals to . We will show that this update rule satisfies the conditions in Appendix E, which will then allow us to prove that
Letting be the same definition as in Lemma 5, we have that:
(34) | ||||
(35) | ||||
We use proof of induction as supposed for , we have:
thus
Then we can further conclude that:
Additionally, for , we may conclude:
then by applying Appendix E to (33) we have:
which completes the proof. ∎
Appendix F Proof of Proposition 1
Proof.
First of all, from the definition of NE, the global maximum of the potential function is a NE. We now show that this global maximum is a deterministic policy. From classical results (e.g. [70]) we know that there is an optimal deterministic centralized policy
that maximizes:
We now show that this centralized policy can also be represented by direct distributed policy parameterization. Set as:
then:
Since globally maximizes the discounted summation of potential function among centralized policies, which includes all possible direct distributedly parameterized policies, also maximizes the total potential function globally among all direct distributed parameterization, which completes the proof. ∎
Appendix G Proof of Theorem 3
G.1 Useful Optimization Lemmas
Lemma 9.
Let be -smooth in , define gradient mapping:
The update rule for projected gradient is:
Then:
Proof.
By a standard property of Euclidean projections onto a convex set, we get that
∎
Lemma 10.
(Sufficient ascent) Suppose is -smooth. Let . Then for ,
Proof.
From the smoothness property we have that:
(36) |
Since , we have that:
take , we get:
Thus:
which completes the proof. ∎
Lemma 11.
(Corollary of Lemma 10) For that is smooth and bounded , running projected gradient ascent:
with , will guarantee that:
Further, we have that:
(37) |
G.2 Proof of Theorem 3
Appendix H Proof of Theorem 4
Proof.
(of the first claim) The proof requires knowledge of Lemma 5 in Section 3 thus we would recommend readers to first go through Lemma 5 first. The lemma immediately leads to the conclusion that a strict NE should be deterministic. Let be the same definition as in Lemma 5.
For any , Taylor expansion suggests that:
Thus for sufficiently small,
this suggests that strict NEs are strict local maxima. We now show that this also holds vice versa.
Strict local maxima satisfy first-order stationarity by definition, and thus by Theorem 1 they are also NEs, we only need to show that they are strict. We prove by contradiction, suppose that there exists a local maximum such that it is non-strict NE, i.e., there exists such that:
According to (10) and first-order stationarity of :
Since for all , we may conclude:
We denote , according to Lemma 2
Since , this further implies that
i.e., is nonzero only if . Define , then
Thus let
Since as , this contradicts the assumption that is a strict local maximum. This suggests that all strict local maxima are strict NEs, which completes the proof. ∎
Proof.
(of the second claim) First, we define the corresponding value function, -function and advantage function for potential function .
For an index set we define the following averaged advantage potential function of index set as:
We choose an index set such that there exists such that:
(38) |
and that for any other index set with smaller cardinality:
(39) |
Because is not a constant, this guarantees the existence of such an index set . Further, since
and that is fully-mixed, we have that:
(40) |
We set , where is a convex combination of :
According to performance difference lemma (Lemma 2) we have:
According to (40), we know that:
thus
Applying similar procedures recursively and using the fact that:
we get:
Set as:
where are defined in (38). Then:
which completes the proof. ∎
Appendix I Bounding the gradient estimation error of Algorithm 1
The accuracy of gradient estimation is essential in the sample-based algorithm 1. In this section, we will give a high probability bound of the estimation error, which is stated in the following theorem:
Theorem 6.
The proof of the theorem includes bounding the estimation error of (I.1) and (I.2). Let’s first introduce the definition of ‘sufficient exploration’ which is going to play an important role in this section.
In the main text Assumption 2 we have introduced -sufficient exploration on states. In this section we introduce a similar definition -sufficient exploration:
Definition 6.
(-Sufficient Exploration) A stochastic game and a policy is said to satisfy -sufficient exploration condition if there exists positive integer and such that for policy and any initial state-action pair , we have
Note that ‘-sufficient exploration’ is a stronger condition compared with ‘-sufficient exploration on states’. Additionally it is not hard to verify that for any stochastic game that satisfies -sufficient exploration on states, and any , it will also satisfy -sufficient exploration condition.
I.1 Bounding the estimation error of the averaged-Q function
We first state the main theorem in this subsection:
Theorem 7.
In the following, we will introduce some lemmas which will play an important role in bounding the estimation error of the averaged-Q function:
Lemma 12.
Assume that the stochastic game with policy satisfies -sufficient exploration condition (Definition 6), then fix ,for ,
Proof.
Lemma 13.
Assume that the stochastic game with policy satisfies -sufficient exploration condition (Definition 6), then fix , for ,
Proof.
The proof is similar to Lemma 12.
Let:
Then is a martingale difference sequence. Because , it is easy to verify that . We have that:
Further,
the second line to the third line of the equation is derived by the fact that:
and the inequality in the third line is derived directly from Definition 6.
According to Azuma-Hoeffding inequality:
Same as (43), we have that:
Thus
Similarly
which completes the proof. ∎
Corollary 1.
(44) | ||||
(45) |
Proof.
We are now ready to prove Theorem 7.
I.2 Bounding the estimation error of
We first state our main result:
Theorem 8.
Similar to the previous section, the proof of the theorem begins by bounding the estimation error .
Lemma 14.
Under Assumption 2, fix ,for ,
Proof.
According to the definition of , we have that
Let:
Then is a martingale difference sequence. Because , it is easy to verify that . We have that:
Further,
and that
the second line to the third line of the equation is derived by the fact that:
and the inequality in the third line is derived directly from Assumption 2.
According to Azuma-Hoeffding inequality:
Similarly, from Azuma-Hoeffding inequality,
Thus
Similarly
which completes the proof. ∎
Corollary 2.
(46) |
Proof.
Proof.
I.3 Proof of Theorem 6
Proof.
Since the stochastic game satisfies -sufficient exploration on states, then for any , we know that it satisfies -sufficient exploration. Substitute this into Theorem 7, we have that for
(47) |
with probability at least ,
Similarly, applying Theorem 8, we have that with probability at least ,
Since:
Thus, with probability
∎
Appendix J Proof of Theorem 5
Notations:
We define the following variables that will be useful in the analysis:
J.1 Optimization Lemmas
Lemma 15.
(Sufficient ascent) Suppose is -smooth. Let . Then for ,
Proof.
From the smoothness property we have that:
Since , we have that:
take , we get:
Thus:
Thus from (36):
which completes the proof. ∎
Lemma 15 immediately results in the following corollary:
Corollary 3.
Proof.
Lemma 16.
(First-order stationarity and ) Suppose is -smooth. Let . Then:
(48) |
Further:
(49) |
J.2 Proof of Theorem 5
Proof.
Recall that is -smooth with . The step size in Theorem 5 satisfies .
Recall from gradient domination property:
Appendix K Smoothness
Lemma 17.
(Smoothness for Direct Distributed Parameterization) Assume that , then:
(51) |
where .
The proof of Lemma 17 depends on the following lemma:
Lemma 18.
(52) |
Proof.
Lemma 18 is equivalent to the following lemma:
Lemma 19.
(53) |
Proof.
(Lemma 19) Define:
According to cost difference lemma,
Thus:
Part A | ||||
(54) | ||||
(55) | ||||
(56) | ||||
(57) | ||||
(58) | ||||
(59) | ||||
(60) | ||||
where (54) to (55) is derived from the fact that . (56) to (57) relies on the property of total variation distance:
which can be immediately verified by applying Cauchy-Schwarz inequality.
Before looking into Part B, we first define as the state-action under :
Then we have that:
For an arbitrary vector :
Thus:
Similarly we can define as the state-action under , and can easily check that
Define:
Because:
which implies that every entry of is nonnegative and , this implies:
and similarly
Now we are ready to bound Part B. Because:
Thus,
Part B | |||
Now let’s look at Part C:
Note that are constant vectors that are independent of the choice of . Additionally:
Thus:
Part C | |||
Additionally:
Combining the above inequalities we get:
Sum up Part A-C we get:
which completes the proof. ∎
Appendix L Auxiliary
We recall Appendix E. \auxiliary*
Proof.
Let , without loss of generality, assume that and that:
Using KKT condition, one can derive an efficient algorithm for solving [73], which consists of the following steps:
-
1.
Find ;
-
2.
Set ;
-
3.
Set .
From the algorithm, we have that:
If ,
If ,
Thus:
which completes the proof. ∎
Appendix M Numerical Simulation Details
Verification of the fully mixed NE in Game 2
We now verify that joining network 1 with probability ,i.e.:
is indeed a NE. First, observe that
Thus,
which implies that:
i.e. satisfies first-order stationarity. Since holds for any valid , by Theorem 1, is a NE.