This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Unsynchronized Decentralized Q-Learning: Two Timescale Analysis By Persistencethanks: An earlier version of this work containing preliminary results was presented at the 2023 IEEE Conference on Decision and Control.

Bora Yongacoglu Department of Electrical and Computer Engineering, University of Toronto    Gürdal Arslan Department of Electrical Engineering, University of Hawaii at Manoa    Serdar Yüksel Department of Mathematics and Statistics, Queen’s University
Abstract

Non-stationarity is a fundamental challenge in multi-agent reinforcement learning (MARL), where agents update their behaviour as they learn. Many theoretical advances in MARL avoid the challenge of non-stationarity by coordinating the policy updates of agents in various ways, including synchronizing times at which agents are allowed to revise their policies. Synchronization enables analysis of many MARL algorithms via multi-timescale methods, but such synchronization is infeasible in many decentralized applications. In this paper, we study an unsynchronized variant of the decentralized Q-learning algorithm, a recent MARL algorithm for stochastic games. We provide sufficient conditions under which the unsynchronized algorithm drives play to equilibrium with high probability. Our solution utilizes constant learning rates in the Q-factor update, which we show to be critical for relaxing the synchronization assumptions of earlier work. Our analysis also applies to unsynchronized generalizations of a number of other algorithms from the regret testing tradition, whose performance is analyzed by multi-timescale methods that study Markov chains obtained via policy update dynamics. This work extends the applicability of the decentralized Q-learning algorithm and its relatives to settings in which parameters are selected in an independent manner, and tames non-stationarity without imposing the coordination assumptions of prior work.

keywords:
Multi-agent reinforcement learning, independent learners, learning in games, stochastic games, decentralized systems
{MSCcodes}

91A15, 91A26, 60J20, 93A14

Bora Yongacoglu Department of Mathematics and Statistics, Queen’s University Gürdal Arslan Department of Electrical Engineering, University of Hawaii at Manoa Serdar Yüksel11footnotemark: 1

1 Introduction

Multi-agent systems are characterized by the coexistence of several autonomous agents acting in a shared environment. In multi-agent reinforcement learning (MARL), agents in the system change their behaviour in response to feedback information received in previous interactions, and the system is non-stationary from any one agent’s perspective. In this non-stationary environment, agents attempt to optimize their performance against a moving target [12]. This non-stationarity has been identified as one of the fundamental technical challenges in MARL [11]. In contrast to the rich literature on single-agent learning theory, the theory of MARL is relatively underdeveloped, due in large part to the inherent challenges of non-stationarity and decentralized information.

This paper considers learning algorithms for stochastic games, a common framework for studying MARL in which the cost-relevant history of the system is summarized by a state variable. We focus on stochastic games in which each agent fully observes the system’s state variable but does not observe the actions of other agents, exacerbating the challenge of non-stationarity.

Early theoretical work on MARL in stochastic games avoided the problem of non-stationarity by studying applications in which joint actions were observed by all agents [22, 21, 13]. There has also been interest in the local action learner setting, where actions are not shared between agents. Several rigorous contributions have recently been made in this setting, including the works of [1, 6, 31, 36, 37], to be discussed shortly.

Of the recent advances in MARL for the local action learner setting, many of the algorithms with strong guarantees have circumvented the challenge of non-stationarity by relying, implicitly or explicitly, on coordination between agents. In particular, several works in the regret testing tradition rely on some form of synchronization, whereby agents agree on the times at which they may revise their behaviour and are constrained to fix their policies during the intervening periods. This synchronized approach allows for the use of two timescale analyses common to many single- and multi-agent reinforcement learning paradigms, in which the behaviour of value function estimates and the behaviour of the joint policy process can be effectively uncoupled and analyzed separately [18, 3, 15].

A variety of regret testing algorithms have been proposed for different classes of games, including [9, 10, 1] and [37]. These algorithms differ in their policy update rules, which are typically chosen to exploit some underlying structure of particular classes of games, from better/best-response dynamics in [1] to arbitrary ϵ\epsilon-satisficing policy update rules in [37]. Despite such differences, previous regret testing algorithms are united in their reliance on synchronization to facilitate the analysis of learning iterates and to guarantee accuracy of value function estimates. While this is justifiable in some settings, it can be restrictive in others, including applications where parameters are selected in a distributed or decentralized manner, as well as settings in which parameter coordination can be affected by communication noise. As such, it would be desirable to provide MARL algorithms that do not require synchronization but still come with rigorous performance guarantees in the local action learner setting.

This paper studies weakly acyclic stochastic games, where the definition of weak acyclicity is in terms of single-agent best-responding. In this context, our aim is to give guarantees of convergence to stationary deterministic equilibrium policies. We find that, with slight modification, the decentralized Q-learning algorithm of [1] can be made to tolerate unsynchronized policy updates, and that strong coordination assumptions about parameter choice and timing are not necessary. This constitutes a direct generalization of the algorithm and results in [1]. The key modification involves utilizing constant learning rates in the Q-function updates, replacing the decreasing learning rates used in [1] and in all previous regret testing algorithms. Unlike decreasing learning rates, which place relatively large weight on data encountered in the distant past, constant learning rates allow agents to rapidly correct their Q-factor estimates by discarding outdated information. Additionally, we retain the use of random inertia in the policy update, which we show acts as a decentralized stabilizing mechanism.

Although the technical exposition here focuses on weakly acyclic games, the analytic insights in this paper are not limited to this setting. In fact, the results of this paper can be generalized in various directions by studying different classes of games, different policy revision mechanisms, or by selecting different sets of policies as the target for convergence. We now list several other regret testing results that can be generalized to accommodate for unsynchronized agents by employing constant learning rates and inertia in the policy update.

Teams and Optimality. Stochastic teams and their generalizations of common interest games constitute an important special case of weakly acyclic games, with applications to distributed control [36]. In common interest games, one typically seeks team optimal policies, a subset of equilibrium policies that simultaneously achieves minimum cost for each player and may not exist in more general classes of games. A modification of the decentralized Q-learning algorithm was proposed in [36], which incorporated memory to account for past performance as well as random search of players’ policy spaces to search for team optimal policies. This algorithm, however, was developed in the synchronized setting, where the analysis of learning iterates was decoupled from policy dynamics. This algorithm can be further adapted to accommodate unsynchronized policy updates by utilizing constant learning rates.

General games and ϵ\epsilon-equilibrium. Rather than restricting players to revise policies via inertial best-responding, one may be interested in more general revision mechanisms, to model settings where players are more exploratory in nature. If one imposes a stopping condition, such as refraining from updating when one is already ϵ\epsilon-best-responding, then the policy dynamics can be studied using the theory of ϵ\epsilon-satisficing paths presented by [37]. In [37], it was shown that a synchronized variant of decentralized Q-learning, which used random search of the policy space when not ϵ\epsilon-best-responding, drove policies to ϵ\epsilon-equilibrium in symmetric stochastic games. With appropriate modifications, a similar result can be obtained without synchronization.

General games and “cumber" sets. In general stochastic games, the set of stationary deterministic equilibrium policies can be empty. Even if this set is not empty, in general, it is not guaranteed that inertial best-response dynamics drives play to equilibrium. The notions of cusber sets (introduced in [14]) and cumber sets ([36]) are useful for studying the dynamics of inertial best-responding in general games. Cusber sets are subsets of policies that are closed under single-agent best-responses, while cumber sets are defined as being closed under multi-agent best-responses. It was previously shown in [36] that variants of decentralized Q-learning drive play to minimal cumber sets even in general games, though play may cycle between several policies within this minimal set. The results of the present paper can be combined with the analysis of [36] to guarantee convergence of unsynchronized decentralized Q-learning to cumber sets in general games.

Contributions: In this paper, we offer a means to relax the synchronization assumptions common in the regret testing paradigm of MARL while retaining desirable convergence guarantees. Our solution combines inertial policy updating with persistent learning via non-decreasing learning rates. We provide mathematical details for a particular set-up involving the unsynchronized variant of the decentralized Q-learning algorithm of [1], a recent MARL algorithm designed for weakly acyclic stochastic games, a class of games that admits teams and potential games as important special cases. We do not impose synchronization of policy revision times, which is a common and restrictive assumption in prior work. We show that persistent learning, when combined with inertial policy updating, provides a mechanism for taming non-stationarity and mitigating the moving target problem of MARL without coordinating agents’ parameter choices ahead of play. In particular, we show that this algorithm drives policies to equilibrium with arbitrarily high probability, under appropriate parameter selection. As we discuss below in Section 3, this appears to be the first formal analysis of an unsynchronized algorithm in the regret testing tradition. We also note that the same analysis can be adapted and applied to a variety of related learning algorithms, which all impose the synchronization condition and are analyzed via multi-timescale methods. Thus, the results of this paper establish that persistent learning can address analytic challenges relating to unsynchronized algorithms in a variety of multi-agent learning environments above and beyond weakly acyclic games.

Notation: For a standard Borel space AA, we let 𝒫(A)\mathcal{P}(A) denote the set of probability measures on AA with its Borel σ\sigma-algebra. For standard Borel spaces AA and BB, we let 𝒫(A|B)\mathcal{P}(A|B) denote the set of transition kernels on AA given BB. For a finite set AA, we let A\mathbb{R}^{A} denote the set of functions from AA to \mathbb{R}. We let Unif(A)\text{Unif}(A) denote the uniform probability distribution on a compact set AA. We let 0\mathbb{Z}_{\geq 0} denote the set of non-negative integers. To denote that a random variable XX has distribution μ\mu, we write XμX\sim\mu.

Terminology: We use the terms player and agent interchangeably. When considering a particular player and referring to other players in the same game, these other players may be variously called counterparts or counterplayers.

1.1 Organization

The remainder of the paper is organized as follows. In Section 2, we present the model of stochastic games. Related work is surveyed in Section 3. In Section 4, we present the unsynchronized decentralized Q-learning algorithm and we state our main result, Theorem 4.1, on high probability guarantees for equilibrium in weakly acyclic stochastic games. The proof of Theorem 4.1 is outlined in Section 5. The details of the proof of Theorem 4.1 are presented in Section 6. The results of a simulation study are presented in Section 7, and the final section concludes. A glossary of notation is provided in Appendix A.

2 Model

A stochastic game 𝒢\mathcal{G} is described by a list:

(1) 𝒢=(𝒩,𝕏,{𝔸i,ci,γi:i𝒩},P,ν0).\mathcal{G}=\left(\mathcal{N},\mathbb{X},\{\mathbb{A}^{i},c^{i},\gamma^{i}:i\in\mathcal{N}\},P,\nu_{0}\right).

The components of 𝒢\mathcal{G} are as follows: 𝒩={1,2,,N}\mathcal{N}=\{1,2,\dots,N\} is a finite set of NN agents. The set 𝕏\mathbb{X} is a finite set of system states. For each player i𝒩i\in\mathcal{N}, 𝔸i\mathbb{A}^{i} is ii’s finite set of actions. We write 𝐀=×i𝒩𝔸i\mathbf{A}=\times_{i\in\mathcal{N}}\mathbb{A}^{i}, and refer to elements of 𝐀\mathbf{A} as joint actions. For each player ii, a function ci:𝕏×𝐀c^{i}:\mathbb{X}\times\mathbf{A}\to\mathbb{R} determines player ii’s stage costs, which are aggregated using a discount factor γi[0,1)\gamma^{i}\in[0,1). The initial system state has distribution ν0𝒫(𝕏)\nu_{0}\in\mathcal{P}(\mathbb{X}), and state transitions are governed by a transition kernel P𝒫(𝕏|𝕏×𝐀)P\in\mathcal{P}(\mathbb{X}|\mathbb{X}\times\mathbf{A}).

At time t0t\in\mathbb{Z}_{\geq 0}, the state variable is denoted by xtx_{t}, and each player ii selects an action ati𝔸ia^{i}_{t}\in\mathbb{A}^{i} according to its policy, to be described shortly. The joint action at time tt is denoted 𝐚t\mathbf{a}_{t}. Each player ii then incurs a cost cti:=ci(xt,𝐚t)c^{i}_{t}:=c^{i}(x_{t},\mathbf{a}_{t}), and state variable evolves according to xt+1P(|xt,𝐚t)x_{t+1}\sim P(\cdot|x_{t},\mathbf{a}_{t}). This process is then repeated at time t+1t+1, and so on.

A policy for player ii is an action selection rule that, at each time t0t\geq 0, selects player ii’s action atia^{i}_{t} in some (possibly random) manner that depends only on information locally available to player ii at time tt. Using htih^{i}_{t} to denote the information that is locally available to player ii at time tt, the following standing assumption specifies the decentralized information structure considered in this paper.

Assumption 1 (Local action learners).

For each i𝒩i\in\mathcal{N}, player ii’s information variables {hti}t0\{h^{i}_{t}\}_{t\geq 0} are given by h0i=x0h^{i}_{0}=x_{0} and

ht+1i=(hti,ati,ci(xt,𝐚t),xt+1), for t0.h^{i}_{t+1}=\left(h^{i}_{t},a^{i}_{t},c^{i}(x_{t},\mathbf{a}_{t}),x_{t+1}\right),\text{ for }t\geq 0.

Under this information structure, each player ii observes the complete system state without noise, and additionally observes its own actions and its own cost realizations, but does not observe the actions of other players directly. This local action learner (LAL) information structure is common in the literature in MARL, with examples such as [1, 5, 6, 25, 31, 36, 37], and is sometimes called the independent learning paradigm.111This terminology is, however, not uniform, as independent learning has also recently been used to refer to learners that update their policies in a (perhaps myopic) self-interested manner [30]. In response to the changing conventions, we recommend that learners who condition only on their local action history should be called local action learners, in analogy to joint action learners. The LAL paradigm is one of the principal alternatives to the joint action learner (JAL) paradigm, studied previously in [20] and [21] among others. The key difference is that the JAL paradigm assumes that each agent ii gets to observe the complete joint action profile 𝐚t\mathbf{a}_{t} after it is played. (That is, in the JAL paradigm, one assumes ht+1i=(hti,𝐚t,ci(xt,𝐚t),xt+1)h^{i}_{t+1}=(h^{i}_{t},\mathbf{a}_{t},c^{i}(x_{t},\mathbf{a}_{t}),x_{t+1}).)

Definition 2.1.

For i𝒩i\in\mathcal{N} and t0t\geq 0, let Hti:=𝕏×(𝔸i××𝕏)tH^{i}_{t}:=\mathbb{X}\times\left(\mathbb{A}^{i}\times\mathbb{R}\times\mathbb{X}\right)^{t}. A sequence πi=(πti)t0\pi^{i}=(\pi^{i}_{t})_{t\geq 0} is called a policy for player ii if πti𝒫(𝔸i|Hti)\pi^{i}_{t}\in\mathcal{P}(\mathbb{A}^{i}|H^{i}_{t}) for each t0t\geq 0.

We let Πi\Pi^{i} denote player ii’s set of policies. When following a policy πi=(πti)t0\pi^{i}=(\pi^{i}_{t})_{t\geq 0}, player ii selects actions according to atiπti(|hti)a^{i}_{t}\sim\pi^{i}_{t}(\cdot|h^{i}_{t}).

In general, action selection can incorporate randomness, and players may use arbitrarily complicated, history-dependent policies. However, our analysis will focus on stationary (Markov) policies, a subset of policies that randomly select actions in a time invariant manner that conditions only on the currently observed state variable. Such a restriction entails no loss in optimality for a particular player, provided the remaining players also use stationary policies. Focusing on stationary policies is quite natural: we refer the reader to [19] for an excellent elaboration.

Definition 2.2.

For i𝒩i\in\mathcal{N}, a policy πiΠi\pi^{i}\in\Pi^{i} is called stationary, if the following holds for any t,k0t,k\geq 0 and any h^tiHti,h¯kiHki\hat{h}^{i}_{t}\in H^{i}_{t},\bar{h}^{i}_{k}\in H^{i}_{k}:

x^t=x¯kπti(|h^ti)=πti(|h¯ki),\hat{x}_{t}=\bar{x}_{k}\Rightarrow\pi^{i}_{t}(\cdot|\hat{h}^{i}_{t})=\pi^{i}_{t}(\cdot|\bar{h}^{i}_{k}),

where x^t\hat{x}_{t} and x¯k\bar{x}_{k} denote the final components of h^ti\hat{h}^{i}_{t} and h¯ki\bar{h}^{i}_{k}, respectively.

The set of stationary policies for player i𝒩i\in\mathcal{N} is denoted ΠSi\Pi^{i}_{S} and we identify ΠSi\Pi^{i}_{S} with 𝒫(𝔸i|𝕏)\mathcal{P}(\mathbb{A}^{i}|\mathbb{X}), the set of transition kernels on 𝔸i\mathbb{A}^{i} given 𝕏\mathbb{X}. Henceforth, unqualified reference to a policy shall be understood to mean a stationary policy, while non-stationary policies will be explicitly identified as such.

Definition 2.3.

For i𝒩i\in\mathcal{N}, ξ>0\xi>0, a policy πiΠSi\pi^{i}\in\Pi^{i}_{S} is called ξ\xi-soft if πi(ai|x)ξ\pi^{i}(a^{i}|x)\geq\xi for all (x,ai)𝕏×𝔸i(x,a^{i})\in\mathbb{X}\times\mathbb{A}^{i}. A policy πiΠSi\pi^{i}\in\Pi^{i}_{S} is called soft if it is ξ\xi-soft for some ξ>0\xi>0.

Definition 2.4.

A policy πiΠSi\pi^{i}\in\Pi^{i}_{S} is called deterministic if for each x𝕏x\in\mathbb{X}, there exists ai𝔸ia^{i}\in\mathbb{A}^{i} such that πi(ai|x)=1\pi^{i}(a^{i}|x)=1.

The set of deterministic stationary policies for player ii is denoted by ΠSDi\Pi^{i}_{SD} and is identified with the set of functions from 𝕏\mathbb{X} to 𝔸i\mathbb{A}^{i}.

Notation: We let 𝚷S:=×i𝒩ΠSi\bm{\Pi}_{S}:=\times_{i\in\mathcal{N}}\Pi^{i}_{S} denote the set of joint policies. To isolate player ii’s component in a particular joint policy 𝝅𝚷S\bm{\pi}\in\bm{\Pi}_{S}, we write 𝝅=(πi,𝝅i)\bm{\pi}=(\pi^{i},\bm{\pi}^{-i}), where i-i is used in the agent index to represent all agents other than ii. Similarly, we write the joint policy set as 𝚷S=ΠSi×𝚷Si\bm{\Pi}_{S}=\Pi_{S}^{i}\times\bm{\Pi}_{S}^{-i}, and so on.

For any joint policy 𝝅\bm{\pi} and initial distribution ν𝒫(𝕏)\nu\in\mathcal{P}(\mathbb{X}), there is a unique probability measure on the set of state-action trajectories (𝕏×𝐀)(\mathbb{X}\times\mathbf{A})^{\infty}. We denote this measure by Prν𝝅{\rm Pr}^{\bm{\pi}}_{\nu}, and let Eν𝝅E^{\bm{\pi}}_{\nu} denote its expectation. We use this to define player ii’s value function:

Ji(𝝅,ν):=Eν𝝅[t=0(γi)tci(xt,𝐚t)].J^{i}(\bm{\pi},\nu):=E^{\bm{\pi}}_{\nu}\left[\sum_{t=0}^{\infty}(\gamma^{i})^{t}c^{i}(x_{t},\mathbf{a}_{t})\right].

When ν=δs\nu=\delta_{s} is the Dirac measure of some state s𝕏s\in\mathbb{X}, we write Ji(𝝅,s)J^{i}(\bm{\pi},s) instead of Ji(𝝅,δs)J^{i}(\bm{\pi},\delta_{s}). For 𝝅=(πi,𝝅i)\bm{\pi}=(\pi^{i},\bm{\pi}^{-i}), we will also write Ji(πi,𝝅i,ν)J^{i}(\pi^{i},\bm{\pi}^{-i},\nu) to isolate the role of πi\pi^{i}.

Definition 2.5.

Let ϵ0\epsilon\geq 0, i𝒩i\in\mathcal{N}. A policy πiΠSi\pi^{*i}\in\Pi^{i}_{S} is called an ϵ\epsilon-best-response to 𝝅i𝚷Si\bm{\pi}^{-i}\in\bm{\Pi}^{-i}_{S} if, for every s𝕏s\in\mathbb{X},

Ji(πi,𝝅i,s)infπ~iΠiJi(π~i,𝝅i,s)+ϵ.J^{i}(\pi^{*i},\bm{\pi}^{-i},s)\leq\inf_{\tilde{\pi}^{i}\in\Pi^{i}}J^{i}(\tilde{\pi}^{i},\bm{\pi}^{-i},s)+\epsilon.

The set of ϵ\epsilon-best-responses to 𝝅i\bm{\pi}^{-i} is denoted BRϵi(𝝅i)ΠSi{\rm BR}^{i}_{\epsilon}(\bm{\pi}^{-i})\subseteq\Pi^{i}_{S}. It is well-known that for any 𝝅i𝚷Si\bm{\pi}^{-i}\in\bm{\Pi}^{-i}_{S}, player ii’s set of 0-best-responses BR0i(𝝅i){\rm BR}^{i}_{0}(\bm{\pi}^{-i}) is non-empty, and the infimum above is in fact attained by some policy in ΠSDi\Pi^{i}_{SD}: that is, ΠSDiBR0i(𝝅i)\Pi^{i}_{SD}\cap{\rm BR}^{i}_{0}(\bm{\pi}^{-i})\not=\varnothing.

Definition 2.6.

Let ϵ0\epsilon\geq 0. A joint policy 𝛑𝚷S\bm{\pi}^{*}\in\bm{\Pi}_{S} is called an ϵ\epsilon-equilibrium if πiBRϵi(𝛑i)\pi^{*i}\in{\rm BR}^{i}_{\epsilon}(\bm{\pi}^{*-i}) for all i𝒩i\in\mathcal{N}.

For ϵ0\epsilon\geq 0, we let 𝚷Sϵ-eq𝚷S\bm{\Pi}^{\epsilon{\rm\text{-}eq}}_{S}\subseteq\bm{\Pi}_{S} denote the set of ϵ\epsilon-equilibrium policies. It is known that the set 𝚷S0-eq\bm{\Pi}^{0{\rm\text{-}eq}}_{S} is non-empty [7]. We also let 𝚷SDϵ-eq𝚷SD\bm{\Pi}^{\epsilon{\rm\text{-}eq}}_{SD}\subseteq\bm{\Pi}_{SD} denote the subset of stationary deterministic ϵ\epsilon-equilibrium policies, which may be empty in general.

2.1 Weakly Acyclic Stochastic Games

We now introduce weakly acyclic games, an important subclass of games that will be the main focus of this paper.

Definition 2.7.

A sequence {𝛑k}k0\{\bm{\pi}_{k}\}_{k\geq 0} in 𝚷SD\bm{\Pi}_{SD} is called a strict best-response path if for any k0k\geq 0 there is a unique player i𝒩i\in\mathcal{N} such that πk+1iπki\pi^{i}_{k+1}\not=\pi^{i}_{k} and πk+1iBR0i(𝛑ki)\pi^{i}_{k+1}\in{\rm BR}^{i}_{0}(\bm{\pi}^{-i}_{k}).

Definition 2.8.

The stochastic game 𝒢\mathcal{G} is weakly acyclic if (i) 𝚷SD0-eq\bm{\Pi}^{0{\rm\text{-}eq}}_{SD}\not=\varnothing, and (ii) for any 𝛑0𝚷SD\bm{\pi}_{0}\in\bm{\Pi}_{SD}, there is a strict best-response path from 𝛑0\bm{\pi}_{0} to some 𝛑𝚷SD0-eq\bm{\pi}^{*}\in\bm{\Pi}^{0{\rm\text{-}eq}}_{SD}.

The multi-state formulation above was stated in [1], though weakly acyclic games had previously been studied in stateless games [38]. An important special case is that of stochastic teams, where ci=cjc^{i}=c^{j} and γi=γj\gamma^{i}=\gamma^{j} for any players i,j𝒩i,j\in\mathcal{N}, and the interests of all agents are perfectly aligned. Markov potential games with finitely many states constitute another special case of weakly acyclic games [28, 16, 39].

2.2 Q-Functions in Stochastic Games

In the stochastic game 𝒢\mathcal{G}, when player ii’s counterplayers use a stationary policy 𝝅i𝚷Si\bm{\pi}^{-i}\in\bm{\Pi}^{-i}_{S}, player ii faces an environment that is equivalent to a single-agent MDP. The MDP in question depends on the policy 𝝅i\bm{\pi}^{-i} as well as the game 𝒢\mathcal{G}, and (stationary Markov) optimal policies for this MDP are equivalent to 0-best-responses to 𝝅i\bm{\pi}^{-i} in the game 𝒢\mathcal{G}.

Player ii’s best-responses to a policy 𝝅i𝚷Si\bm{\pi}^{-i}\in\bm{\Pi}^{-i}_{S} can be characterized using an appropriately defined Q-function, Q𝝅ii:𝕏×𝔸iQ^{*i}_{\bm{\pi}^{-i}}:\mathbb{X}\times\mathbb{A}^{i}\to\mathbb{R}.222We use the terms Q-function and (state-) action-value function interchangeably. The function Q𝝅iiQ^{*i}_{\bm{\pi}^{-i}} can be defined by a fixed point equation of a Bellman operator, but here we give an equivalent definition in terms of the optimal policy of the corresponding MDP:

(2) Q𝝅ii(x,ai):=Eν𝝅[t=0(γi)tci(xt,𝐚t)|x0=x,u0i=ai],Q^{*i}_{\bm{\pi}^{-i}}(x,a^{i}):=E^{\bm{\pi}^{*}}_{\nu}\left[\sum_{t=0}^{\infty}(\gamma^{i})^{t}c^{i}(x_{t},\mathbf{a}_{t})\middle|x_{0}=x,u^{i}_{0}=a^{i}\right],

for all (x,ai)𝕏×𝔸i,(x,a^{i})\in\mathbb{X}\times\mathbb{A}^{i}, where 𝝅=(πi,𝝅i)\bm{\pi}^{*}=(\pi^{*i},\bm{\pi}^{-i}) and πiBR0i(𝝅i)ΠSDi.\pi^{*i}\in{\rm BR}^{i}_{0}(\bm{\pi}^{-i})\cap\Pi^{i}_{SD}.

Definition 2.9.

For Qi:𝕏×𝔸iQ^{i}:\mathbb{X}\times\mathbb{A}^{i}\to\mathbb{R} and ϵ0\epsilon\geq 0, we define

BR^ϵi(Qi):={πiΠSDi:Qi(x,πi(x))minai𝔸iQi(x,ai)+ϵ,x𝕏}.\widehat{{\rm BR}}^{i}_{\epsilon}(Q^{i}):=\left\{\pi^{*i}\in\Pi^{i}_{SD}:Q^{i}(x,\pi^{*i}(x))\leq\min_{a^{i}\in\mathbb{A}^{i}}Q^{i}(x,a^{i})+\epsilon,\>\forall x\in\mathbb{X}\right\}.

The set BR^ϵi(Qi)\widehat{{\rm BR}}^{i}_{\epsilon}(Q^{i}) is the set of stationary deterministic policies that are ϵ\epsilon-greedy with respect to QiQ^{i}. The function QiQ^{i} plays the role of an action-value function, and for Qi=Q𝝅iiQ^{i}=Q^{*i}_{\bm{\pi}^{-i}}, we have BR^0i(Q𝝅ii)=BR0i(𝝅i)ΠSDi\widehat{{\rm BR}}^{i}_{0}(Q^{*i}_{\bm{\pi}^{-i}})={\rm BR}^{i}_{0}(\bm{\pi}^{-i})\cap\Pi^{i}_{SD}.

When the remaining players follow a stationary policy, player ii can use Q-learning to estimate its action-values, which can then be used to estimate a 0-best-response policy. In the following sections, we observe that the situation is more complicated when the remaining players employ learning algorithms that adapt to feedback and revise their action selection probabilities over time.

3 Related Work

In this paper, we are interested in MARL for stochastic games in the local action learner setting of 1. In this framework, players do not have knowledge of their cost functions, the transition kernel PP, and perhaps even the existence or nature of other players in the system. We focus on the online setting, where players do not have access to a generative model for sampling state transitions or costs, and instead must rely on sequential interactions with the multi-agent system in order to gather data.

At a high level, our objective is to study and design learning algorithms that drive play of the game to stable and individually rational joint policies. Our objectives are closely aligned with the desiderata of MARL algorithms outlined in [4], namely convergence and rationality. In particular, we wish to establish that joint policies converge to some stationary (ϵ\epsilon-) equilibrium policy.

A number of MARL algorithms have been proposed with the same high-level objectives as ours. The wide ranging field of algorithms differ from one another in terms of how and when incoming feedback data is processed and how and when the agent’s policy is accordingly modified. In this section, we survey some related work that is especially relevant to our chosen setting of MARL in stochastic games with local action learners.

3.1 Q-Learning and its Challenges

In decentralized MARL settings, where agents have minimal information about strategic interactions with other players, one natural and commonly studied approach to algorithm design is to treat each agent’s game environment as a separate single-agent RL environment. Indeed, when the remaining players use a stationary policy 𝝅i𝚷Si\bm{\pi}^{-i}\in\bm{\Pi}^{-i}_{S}, then player ii truly does face a strategic environment equivalent to an MDP, and can approach its learning task with single-agent Q-learning.

Picking some Q^0i𝕏×𝔸i\widehat{Q}^{i}_{0}\in\mathbb{R}^{\mathbb{X}\times\mathbb{A}^{i}} to serve as an initial estimate of its optimal Q-function, player ii would iteratively apply the following update rule:

(3) Q^t+1i(xt,ati)=(1αt)Q^ti(xt,ati)+αt[ci(xt,𝐚t)+γiminaiQ^ti(xt+1,ai)],\widehat{Q}^{i}_{t+1}(x_{t},a^{i}_{t})=(1-\alpha_{t})\widehat{Q}^{i}_{t}(x_{t},a^{i}_{t})+\alpha_{t}\left[c^{i}(x_{t},\mathbf{a}_{t})+\gamma^{i}\min_{a^{i}}\widehat{Q}^{i}_{t}(x_{t+1},a^{i})\right],

where αt=αt(xt,ati)\alpha_{t}=\alpha_{t}(x_{t},a^{i}_{t}) is a random (that is, history-dependent) learning rate that depends on the state-action pair (xt,ati)(x_{t},a^{i}_{t}) and Q^t+1i(x,a)=Q^ti(x,a)\widehat{Q}^{i}_{t+1}(x,a)=\widehat{Q}^{i}_{t}(x,a) for all (x,a)(xt,ati)(x,a)\not=(x_{t},a^{i}_{t}).

If the learning rate αt(xt,ati)\alpha_{t}(x_{t},a^{i}_{t}) is decreased at an appropriate rate in the number of visits to the state-action pair (xt,ati)(x_{t},a^{i}_{t}), and if every state-action pair is visited infinitely often, then one has Q^tiQ𝝅ii\widehat{Q}^{i}_{t}\to Q^{*i}_{\bm{\pi}^{-i}} almost surely [35]. We recall that Q𝝅iiQ^{*i}_{\bm{\pi}^{-i}} was defined in Section 2 as the optimal Q-function for player ii when playing against the stationary policy 𝝅i\bm{\pi}^{-i}.

In order to asymptotically respond (nearly) optimally to the policy of its counterparts, player ii must exploit the information carried by its learning iterates. Since one must balance this exploitation with persistent exploration, a standard approach in algorithm design is to periodically revise an agent’s policy—either after every system interaction or less frequently—in a manner that places higher probability on actions with better action values. A widely used device for balancing exploration and exploitation is Boltzmann action selection, wherein

Pr(ati=ai|hti)=exp(Q^ti(xt,ai)/τt)a¯i𝔸iexp(Q^ti(xt,a¯i)/τt), for t0,{\rm Pr}\left(a^{i}_{t}=a^{i}\middle|h^{i}_{t}\right)=\frac{\exp\left(-\widehat{Q}^{i}_{t}(x_{t},{a}^{i})/\tau_{t}\right)}{\sum_{\bar{a}^{i}\in\mathbb{A}^{i}}\exp\left(-\widehat{Q}^{i}_{t}(x_{t},\bar{a}^{i})/\tau_{t}\right)},\text{ for }t\geq 0,

where τt>0\tau_{t}>0 is a temperature parameter that is typically taken to decrease to 0 as tt\to\infty, and player ii’s Q-factor estimates Q^ti\widehat{Q}^{i}_{t} are constructed from the local history htih^{i}_{t} observed by player ii.

Boltzmann action selection is not the only feasible alternative for balancing the trade-off between exploration and exploitation in this context; several other greedy in the limit with infinite exploration (GLIE) mechanisms may also be used to the same effect [33].

In the preceding equations, one observes that Pr(ati=|hti){\rm Pr}(a^{i}_{t}=\cdot\,|h^{i}_{t}) is simply an alternative representation of the policy actually followed by player ii. Since player ii’s Q-factor estimates and other relevant parameters (e.g. the temperature sequence) change with time, any adaptive learning algorithm that exploits learned iterates will be non-stationary. If other players use analogous learning algorithms and exploit their own learned information, then other players will also follow non-stationary policies. In symbols, this is to say that 𝝅i𝚷Si\bm{\pi}^{-i}\notin\bm{\Pi}^{-i}_{S}. In this case, the strategic environment defined by the non-stationary policy 𝝅i\bm{\pi}^{-i} is not generally equivalent to an MDP, and there is no guarantee that the Q-factor estimates given by (3) will converge to some meaningful quantity, if they are to converge at all.

Provable non-convergence of Q-iterates obtained by single-agent Q-learning in small, stateless games has been established [17]. Similar non-convergence had been previously been observed empirically in stochastic games [4, 34]. Thus, some modification is needed in order to employ algorithms based on Q-learning in multi-agent settings. Modifications may involve changes to how and when Q-factor updates are performed or may involve changes to how and when action selection probabilities are updated.

3.2 Algorithms for Local Action Learners

As described in the previous section, myopic algorithms that repeatedly change an agent’s policy during the course of interaction can be effective in single-agent settings, when an agent faces a stationary environment. By contrast, when all agents employ such algorithms in a multi-agent system, each agent faces a non-stationary environment and the same convergence guarantees need not apply. In addition to the previously described non-convergence of Q-factors, provable non-convergence of policies under particular learning algorithms has also been observed in various settings, [8, 26, 27].

With the understanding that some modification of single-agent RL algorithms is needed to address the challenge of non-stationarity in MARL, several multi-agent algorithms have been proposed that use multiple timescales. The key feature of the multi-timescale approach is that certain iterates—whether policy parameters, Q-factor estimates, or otherwise—are updated slowly relative to other iterates. In effect, this allows one to analyze the rapidly changing iterates as though the slowly changing iterates were held fixed, artificially mimicking stationarity.

Some works have studied MARL algorithms in which a subset of agents update their policies slowly while the remaining agents update their policy parameters quickly. A two-timescale policy gradient algorithm was studied for two-player, zero-sum games in [6], where a performance guarantee was proven for the minimizing player. Other works, including [29], have empirically studied MARL algorithms in which the agents alternate between updating their policies slowly and quickly according to some commonly accepted schedule.

A second type of multi-timescale algorithm updates various policy-relevant parameters quickly relative to value function estimates, which are updated relatively slowly. This approach can be found in [30, 31] and [32]. This can be contrasted with a third type of multi-timescale algorithm, in which policies are updated relatively slowly while value functions are updated relatively quickly. For a recent example of this, see [23].

A third type of algorithm that retains the same essential quality of multiple timescale algorithms is the regret testing approach, outlined in the next subsection.

3.3 Regret Testers

In the MARL algorithms surveyed above, the agents update all relevant parameters after each system interaction, and the multi-timescale analysis is implemented by controlling the magnitude of these every-step updates via step-size constraints. An alternative approach is to update certain iterates after each system interaction while updating other iterates infrequently, with many system interactions between updates. Such is the approach taken by works in the tradition of regret testing, pioneered by Foster and Young in [9], which presented an algorithm for stateless games. This approach was later studied by [10] and [24] among others in the context of stateless repeated games, where impressive convergence guarantees were proven. The strong results of [10] and [24] in stateless games were enabled to a significant extent by the absence of state dynamics, which would otherwise complicate value estimation and policy evaluation in multi-state games.

In [1], the regret testing paradigm of Foster and Young was modified for multi-state stochastic games, where one must account for both the immediate cost of an action and also the cost-to-go, which depends on the evolution of the state variable. The decentralized Q-learning algorithm of [1] instructs agents to agree on an increasing sequence of policy update times, (tk)k0(t_{k})_{k\geq 0}, and to fix one’s policy within the intervals [tk,tk+1)[t_{k},t_{k+1}), called the kthk^{th} exploration phase.333Although the terminology of exploration phases first appeared in [1], the exploration phase approach was pioneered by [9]. In so doing, the joint policy process is fixed over each exploration phase, and within each exploration phase, each agent faces a Markov decision problem (MDP). Each agent then runs a standard Q-learning algorithm within the kthk^{th} exploration phase, and updates its policy in light of its Q-factors at the end of the kthk^{th} exploration phase. Since each agent faces an MDP, one can analyze the learning iterates using single-agent learning theory. Provided the exploration phase lengths Tk:=tk+1tkT_{k}:=t_{k+1}-t_{k} are sufficiently long, the probability of correctly recovering one’s Q-function (defined with respect to the policies of the other agents and elaborated in the coming sections) can be made arbitrarily close to 1.

In many of the papers on regret testers, the proofs of convergence to equilibrium share the same core elements, which can be summarized as follows. First, one defines an equilibrium event. (In the fully synchronized regime described above, this equilibrium event is usually the event that the joint policy of the kthk^{th} exploration phase is an equilibrium.) Second, one uniformly lower bounds the probability of transiting from non-equilibrium to equilibrium in a fixed number of steps. Third, one shows that the probability of remaining at equilibrium once at equilibrium can be made arbitrarily large relative to the aforementioned lower bound. Fourth and finally, one obtains a high probability guarantee of equilibrium by analyzing the ratio of entry and exit probabilities to equilibrium policies (see, for instance, [9, Lemma 2]).

In effect, the exploration phase technique decouples learning from adaptation, and allows for separate analysis of learning iterates and policy dynamics. This allows for approximation arguments to be used, in which the dynamics of the policy process resemble those of an idealized process where players obtain noise-free learning iterates for use in their policy updates. This has lead to a series of theoretical contributions in MARL that make use of the exploration phase technique, including [36] and [37].

One natural criticism of the exploration phase technique described above is its reliance on synchronization of policy updating. In the description above, agents agree on the policy update times {tk}k0\{t_{k}\}_{k\geq 0} exactly, and no agent ever updates its policy in the interval (tk,tk+11](t_{k},t_{k+1}-1]. Indeed, the mathematically convenient assumption of synchronization is made in various works in the regret testing tradition, including [9, 10] and [24]. This can be justified in some settings, but is demanding decentralized settings where parameters are selected locally by each player or in distributed settings where parameter choices are communicated over noisy channels and are likely to differ across agents.

To relax the synchronization assumption of regret testers, one allows each player ii to select its own exploration phase times (tki)k0(t^{i}_{k})_{k\geq 0}. Player ii then has the opportunity to revise its policies at each time in (tki)k0(t^{i}_{k})_{k\geq 0}. We observe that an update time for player ii may fall within an exploration phase for some other player jj. That is, tmj<tki<tm+1jt^{j}_{m}<t^{i}_{k}<t^{j}_{m+1} for two players i,ji,j and exploration phase indices kk and mm. In this case, the environment faced by player jj during its mthm^{th} exploration phase is non-stationary, having been disrupted (at least) by player ii’s changing policies.

Intuitively, unsynchronized policy updating may be problematic for regret testers because the action-value estimates of a given player depend on historical data from that player’s most recent exploration phase, which in turn depends on the (time-varying) joint policies used during this period. As such, if other players change their policies during an individual’s exploration phase, the individual receives feedback from different sources, and its learning iterates may not approximate any quantity relevant to the prevailing environment at the time of the individual’s policy update. These changes of policy during an exploration phase constitute potential disruptions of a player’s learning, and analysis of the overall joint policy process is difficult when players do not reliably learn accurate action-values.

In this unsynchronized regime, learning and adaptation are not decoupled, as agents do not face stationary MDPs during each of their exploration phases. In particular, it is not guaranteed that a given player’s Q-factors will converge. Consequently, it is not clear that the second and (especially) the third steps of the standard line of proof described above can be reproduced in the unsynchronized setting.

In [24], a heuristic argument suggested that the use of inertia in policy updating may allow one to relax the assumption of perfect synchronization in regret testing algorithms for stateless repeated games. This argument offers a promising starting point, but also has two potential gaps. The first unresolved aspect pertains to the ratio of entry and exit probabilities. The heuristic argument correctly states that one can use inertia to both (i) lower bound the probability of transiting to equilibrium from out of equilibrium, and (ii) make the probability of remaining at equilibrium arbitrarily close to 1. However, the analysis of the overall joint policy process depends on the ratio of these entry and exit probabilities, which depends on the choice of inertia parameter and may be somewhat challenging to analyze.

A second unresolved aspect has to do with the learning routine and how quickly errors are corrected. Here, the specific learning rate used in each agent’s Q-learning algorithm plays an important role. Existing regret testing algorithms employ a decreasing learning rate, so as to ensure the convergence of learning iterates within a synchronized exploration phase. In the unsynchronized regime, however, agents learn against time-varying environments. If an agent uses a decreasing learning rate, it gives relatively high weight to outdated data that are no longer representative of its current environment. As such, it may take a long, uninterrupted stretch of learning against the prevailing environment to correct the outdated learning. The time required will depend on how much the learning rate parameters have been reduced, and these considerations must be quantified explicitly in the analysis. This point is somewhat subtle, and may pose a challenge if one attempts to use decreasing learning rates in the unsynchronized regime.

In this paper, we formalize the heuristic argument of [24] and show that it is effectively correct, with minor caveats. To rectify the unaddressed aspects mentioned above, we modify the Q-learning protocol to use a constant learning rate and conduct a somewhat involved analysis of the resulting behaviour. We focus on weakly acyclic games and best-response-based policy updating routines, but the analysis from this setting can be used to study other policy update routines and classes of games, including ϵ\epsilon-satisficing policy update dynamics such as those described in [37].

4 Unsynchronized Decentralized Q-Learning

In this section, we present Algorithm 1, an unsynchronized variant of the decentralized Q-learning of [1]. Unlike in the original decentralized Q-learning algorithm, Algorithm 1 allows for the sequence of exploration phase lengths {Tki}k0\{T^{i}_{k}\}_{k\geq 0} vary by agent. Although this change introduces non-stationarity and thus complicates the analysis, the algorithm itself is no more intricate than that of [1].

1
2
3
4Set Parameters
5 {Tki}k0\{T^{i}_{k}\}_{k\geq 0}: player ii’s sequence in \mathbb{N} of learning phase lengths
6 Put t0i=0t^{i}_{0}=0 and tk+1i=tki+Tkit^{i}_{k+1}=t^{i}_{k}+T^{i}_{k} for all k0k\geq 0.
7 ρi(0,1)\rho^{i}\in(0,1): experimentation probability
8 λi(0,1)\lambda^{i}\in(0,1): inertia during policy update
9 δi(0,)\delta^{i}\in(0,\infty): tolerance level for suboptimality
10 αi(0,1)\alpha^{i}\in(0,1): step-size parameter (also called the learning rate)
11
12Initialize π0iΠSDi\pi_{0}^{i}\in\Pi^{i}_{SD}, Q^0i𝕏×𝔸i\widehat{Q}_{0}^{i}\in\mathbb{R}^{\mathbb{X}\times\mathbb{A}^{i}} (arbitrary)
13
14for k0k\geq 0 (kthk^{th} exploration phase for agent ii )
15       for t=tki,tki+1,,tk+1i1t=t_{k}^{i},t_{k}^{i}+1,\dots,t^{i}_{k+1}-1
16             Observe xtx_{t}
17             Select ati={πki(xt),w.p. 1ρiu~tiUnif(𝔸i),w.p. ρia^{i}_{t}=\begin{cases}\pi^{i}_{k}(x_{t}),&\text{w.p. }1-\rho^{i}\\ \tilde{u}^{i}_{t}\sim\text{Unif}(\mathbb{A}^{i}),&\text{w.p. }\rho^{i}\end{cases}
18            
19            Observe cost cti:=c(xt,𝐚t)c^{i}_{t}:=c(x_{t},\mathbf{a}_{t}), state xt+1x_{t+1}
20            
21            Put Δti=cti+γiminaiQ^ti(xt+1,ai)\Delta^{i}_{t}=c^{i}_{t}+\gamma^{i}\min_{a^{i}}\widehat{Q}^{i}_{t}(x_{t+1},a^{i})
22            
23             Q^t+1i(xt,ati)=(1αi)Q^ti(xt,ati)+αiΔti\widehat{Q}^{i}_{t+1}(x_{t},a^{i}_{t})=(1-\alpha^{i})\widehat{Q}_{t}^{i}(x_{t},a_{t}^{i})+\alpha^{i}\Delta^{i}_{t}
24            
25            Q^t+1i(x,ui)=Q^ti(x,ui)\widehat{Q}_{t+1}^{i}(x,u^{i})=\widehat{Q}^{i}_{t}(x,u^{i}), for all (x,ui)(xt,ati)(x,u^{i})\not=(x_{t},a_{t}^{i})
26            
27      
28      if  πkiBR^δii(Q^tk+1ii)\pi^{i}_{k}\in\widehat{{\rm BR}}^{i}_{\delta^{i}}(\widehat{Q}^{i}_{t^{i}_{k+1}}), then
29             πk+1iπki\pi^{i}_{k+1}\leftarrow\pi^{i}_{k}
30      else
31            πk+1i{πki,w.p. λiπ~kiUnif(BR^δii(Q^tk+1ii)),w.p. 1λi\pi^{i}_{k+1}\leftarrow\begin{cases}\pi^{i}_{k},&\text{w.p. }\lambda^{i}\\ \tilde{\pi}^{i}_{k}\sim\text{Unif}\left(\widehat{{\rm BR}}^{i}_{\delta^{i}}(\widehat{Q}^{i}_{t^{i}_{k+1}})\right),&\text{w.p. }1-\lambda^{i}\end{cases}
Algorithm 1 Unsynchronized Decentralized Q-Learning

4.1 Assumptions

In order to state our main result, Theorem 4.1, we now impose some assumptions on the underlying game 𝒢\mathcal{G}, on the algorithmic parameters, and on the mutual independence of certain randomized subroutines appearing in the algorithm.

4.1.1 Assumption on the transition kernel

Assumption 2.

For any pair (s,s)𝕏×𝕏(s,s^{\prime})\in\mathbb{X}\times\mathbb{X}, there exists H=H(s,s)H=H(s,s^{\prime})\in\mathbb{N} and a sequence of joint actions 𝐚0,,𝐚H𝐀\mathbf{a}^{\prime}_{0},\dots,\mathbf{a}^{\prime}_{H}\in\mathbf{A} such that

Pr(xH+1=s|x0=s,𝐚0=𝐚0,,𝐚H=𝐚H)>0.{\rm Pr}(x_{H+1}=s^{\prime}|x_{0}=s,\mathbf{a}_{0}=\mathbf{a}^{\prime}_{0},\dots,\mathbf{a}_{H}=\mathbf{a}^{\prime}_{H})>0.

2 requires that the state process can be driven from any initial state to any other state in finitely many steps, provided a suitable selection of joint actions is made. This is a rather weak assumption on the underlying transition kernel PP, and is quite standard in the theory of MARL (c.f. [30, Assumption 4.1, Case iv]).

4.1.2 Assumptions on algorithmic parameters

Our next assumption restricts the hyperparameter selections in Algorithm 1. Let δ¯:=min(𝔄{0})\bar{\delta}:=\min\left(\mathfrak{A}\setminus\{0\}\right), where

𝔄:={\displaystyle\mathfrak{A}:={\big{\{}} |Q𝝅ii(s,a1i)Q𝝅ii(s,a2i)|:i𝒩,𝝅i𝚷SDi,s𝕏,a1i,a2i𝔸i}.\displaystyle{\big{|}}Q^{*i}_{\bm{\pi}^{-i}}(s,a^{i}_{1})-Q^{*i}_{\bm{\pi}^{-i}}(s,a^{i}_{2}){\big{|}}:i\in\mathcal{N},\bm{\pi}^{-i}\in\bm{\Pi}^{-i}_{SD},s\in\mathbb{X},a^{i}_{1},a^{i}_{2}\in\mathbb{A}^{i}{\big{\}}}.

The quantity δ¯\bar{\delta}, defined originally by [1] and recalled above, is the minimum non-zero separation between two optimal Q-factors with matching states, minimized over all agents i𝒩i\in\mathcal{N} and over all policies 𝝅i𝚷SDi\bm{\pi}^{-i}\in\bm{\Pi}^{-i}_{SD}.

For any baseline policy 𝝅𝚷SD\bm{\pi}\in\bm{\Pi}_{SD} and fixed experimentation parameters {ρi}i𝒩\{\rho^{i}\}_{i\in\mathcal{N}}, we use the notation 𝝅^𝚷S\hat{\bm{\pi}}\in\bm{\Pi}_{S} to denote a corresponding behaviour policy, which is stationary but not deterministic. When using π^i\hat{\pi}^{i}, agent i𝒩i\in\mathcal{N} follows πi\pi^{i} with probability 1ρi1-\rho^{i} and mixes uniformly over 𝔸i\mathbb{A}^{i} with probability ρi\rho^{i}. It was previously shown in [1, Lemma B3] that the optimal Q-factors for these two environments will be close provided ρi\rho^{i} is sufficiently small for all i𝒩i\in\mathcal{N}. In particular, there exists ρ¯>0\bar{\rho}>0 such that if ρi(0,ρ¯)\rho^{i}\in(0,\bar{\rho}) i𝒩\forall i\in\mathcal{N}, then

(4) Q𝝅jjQ𝝅^jj<mini𝒩min{δi,δ¯δi}4,j𝒩,𝝅j𝚷SDj.\|Q^{*j}_{\bm{\pi}^{-j}}-Q^{*j}_{\hat{\bm{\pi}}^{-j}}\|_{\infty}<\frac{\min_{i\in\mathcal{N}}\min\{\delta^{i},\bar{\delta}-\delta^{i}\}}{4},\>\forall j\in\mathcal{N},\bm{\pi}^{-j}\in\bm{\Pi}^{-j}_{SD}.
Assumption 3.

For all i𝒩i\in\mathcal{N}, δi(0,δ¯)\delta^{i}\in(0,\bar{\delta}) and ρi(0,ρ¯)\rho^{i}\in(0,\bar{\rho}).

In the first part of 3, we require that δi\delta^{i}, player ii’s tolerance for suboptimality in its Q-factor, is positive to account for noise in its learning iterates, as exact convergence is unreasonably demanding during a truncated learning phase. On the other hand, the tolerance parameter δi\delta^{i} is upper bounded so as to exclude truly suboptimal policies from consideration in an estimated best-response set. The second part of 3 concerns the experimentation parameter ρi\rho^{i}. We assume that ρi>0\rho^{i}>0 so that player ii can explore all of its actions and accurately learn its Q-factors in the limit. This experimentation parameter ρi\rho^{i} must be kept small, however, so player ii’s behaviour policy closely approximates its baseline policy. 3 appears as Assumption 2 in [1].

Assumption 4.

There exist integers R,TR,T\in\mathbb{N} such that TTkiRTT\leq T^{i}_{k}\leq RT for each player ii and exploration phase index k0k\geq 0.

4 requires that each agent spends a long time learning during any given exploration phase (TkiTT^{i}_{k}\geq T almost surely for any player ii during any exploration phase kk). The second part of 4, that TkiRTT^{i}_{k}\leq RT almost surely for all players ii and exploration phase indices kk, is assumed in order to control the relative frequency at which different agents update their policies. This is done to avoid the potentially problematic regime in which some agents update their policies an enormous number of times and repeatedly disrupt the learning of another agent who uses long exploration phase lengths.

When all players use Algorithm 1, we view the resulting sequence of policies {πki}k0\{\pi^{i}_{k}\}_{k\geq 0} as player ii’s baseline policy process, where πki\pi^{i}_{k} is player ii’s baseline policy during [tki,tk+1i)[t^{i}_{k},t^{i}_{k+1}), player ii’s kthk^{th} exploration phase. This policy process {πki}k0\{\pi^{i}_{k}\}_{k\geq 0} is indexed on the coarser timescale of exploration phases. For ease of reference, we also introduce a sequence of baseline policies indexed by the finer timescale of stage games. For t0t\geq 0 with t[tki,tk+1i)t\in[t^{i}_{k},t^{i}_{k+1}), we let ϕti:=πki\phi^{i}_{t}:=\pi^{i}_{k} denote player ii’s baseline policy during the stage game at time tt. The baseline joint policy at stage game tt is then denoted ϕt=(ϕti)i𝒩\bm{\phi}_{t}=(\phi^{i}_{t})_{i\in\mathcal{N}}. Furthermore, we refer to the collection of Q-factor step-size parameters {αi}i𝒩\{\alpha^{i}\}_{i\in\mathcal{N}} as 𝜶(0,1)N\bm{\alpha}\in(0,1)^{N}.

4.1.3 Assumptions on algorithmic randomness

Algorithm 1 is a randomized algorithm: in multiple places (such as Lines 12 and 20), a player’s decision depends on both the history of play and also on exogenous randomness. In this section, we describe our assumptions on how these random selections are made. In simple terms, we assume that random choices are independent across players and are conditionally independent of the history of the system. The intuition is elaborated below and then formalized in 5 using the language of primitive random variables.

Consider the action choice in Line 12 of Algorithm 1:

Select ati={πki(xt),w.p. 1ρiu~tiUnif(𝔸i),w.p. ρi\text{Select }a^{i}_{t}=\begin{cases}\pi^{i}_{k}(x_{t}),&\text{w.p. }1-\rho^{i}\\ \tilde{u}^{i}_{t}\sim\text{Unif}(\mathbb{A}^{i}),&\text{w.p. }\rho^{i}\end{cases}

The selection of action atia^{i}_{t} can be expressed as a function of the state xtx_{t} along with the outcome of two random quantities: a uniform random variable, denoted ρ~tiUnif([0,1])\tilde{\rho}^{i}_{t}\sim\text{Unif}([0,1]), and a uniform random action choice u~tiUnif(𝔸i)\tilde{u}^{i}_{t}\sim\text{Unif}(\mathbb{A}^{i}). That is, Line 12 of Algorithm 1 can be equivalently expressed as

(5) ati={πki(xt),if ρ~ti1ρiu~ti,if ρ~ti<ρi.a^{i}_{t}=\begin{cases}\pi^{i}_{k}(x_{t}),&\text{if }\tilde{\rho}^{i}_{t}\geq 1-\rho^{i}\\ \tilde{u}^{i}_{t},&\text{if }\tilde{\rho}^{i}_{t}<\rho^{i}.\end{cases}

Similarly, the selection of the policy πk+1i\pi^{i}_{k+1} in Line 20 of Algorithm 1 can be expressed as a function of player ii’s Q-factor estimates and the outcome of two random quantities:

  • A uniform random variable, denoted λ~tiUnif([0,1])\tilde{\lambda}^{i}_{t}\sim\text{Unif}([0,1]).

  • A uniform random policy π~ki:=π~ki(Bi)Unif(Bi)\tilde{\pi}^{i}_{k}:=\tilde{\pi}^{i}_{k}(B^{i})\sim\text{Unif}(B^{i}), where Bi=BR^δii(Q^tk+1ii)B^{i}=\widehat{{\rm BR}}^{i}_{\delta^{i}}\left(\widehat{Q}^{i}_{t^{i}_{k+1}}\right).

That is, Line 20 of Algorithm 1 can be equivalently expressed as

πk+1i={πki,if λ~ki<λiπ~ki,else.\pi^{i}_{k+1}=\begin{cases}\pi^{i}_{k},&\text{if }\tilde{\lambda}^{i}_{k}<\lambda^{i}\\ \tilde{\pi}^{i}_{k},&\text{else.}\end{cases}

In the analysis of the coming sections, it will be convenient to isolate the role of exogenous randomness and separate it from the realization of the Q-factor iterates above. It will therefore be convenient to consider analogs of the random choice π~ki(Bi)\tilde{\pi}^{i}_{k}(B^{i}) drawn from arbitrary policy subsets BiΠSDiB^{i}\subseteq\Pi^{i}_{SD}.

With the preceding intuition in mind, we now introduce several collections of primitive random variables that will be used to state 5 below. For any player i𝒩i\in\mathcal{N} and t0t\geq 0, we define the following random variables:

  • {Wt}t0\{W_{t}\}_{t\geq 0} is an identically distributed, [0,1][0,1]-valued stochastic process. For some f:𝕏×𝐀×[0,1]𝕏f:\mathbb{X}\times\mathbf{A}\times[0,1]\to\mathbb{X}, state transitions are driven by {Wt}t0\{W_{t}\}_{t\geq 0} via ff:

    Pr(xt+1=s|xt=s,𝐚t=𝐚)=Pr(Wt{w:f(s,𝐚,w)=s}),\displaystyle{\rm Pr}(x_{t+1}=s^{\prime}|x_{t}=s,\mathbf{a}_{t}=\mathbf{a})={\rm Pr}\left(W_{t}\in\{w:f(s,\mathbf{a},w)=s^{\prime}\}\right),

    for any (s,𝐚,s)𝕏×𝐀×𝕏(s,\mathbf{a},s^{\prime})\in\mathbb{X}\times\mathbf{A}\times\mathbb{X} and t0t\geq 0;

  • u~tiUnif(𝔸i)\tilde{u}^{i}_{t}\sim\text{Unif}(\mathbb{A}^{i});

  • ρ~tiUnif([0,1])\tilde{\rho}^{i}_{t}\sim\text{Unif}([0,1]);

  • λ~tiUnif([0,1])\tilde{\lambda}^{i}_{t}\sim\text{Unif}([0,1]);

  • For non-empty BiΠSDiB^{i}\subseteq\Pi^{i}_{SD}, π~ti(Bi)Unif(Bi)\tilde{\pi}^{i}_{t}(B^{i})\sim\text{Unif}(B^{i});

Assumption 5.

The collection of primitive random variables 𝒱1𝒱2\mathcal{V}_{1}\cup\mathcal{V}_{2} is mutually independent, where

𝒱1\displaystyle\mathcal{V}_{1} :=i𝒩,t0{Wt,ρ~ti,u~ti,λ~ti},\displaystyle:=\bigcup_{i\in\mathcal{N},t\geq 0}\left\{W_{t},\tilde{\rho}^{i}_{t},\tilde{u}^{i}_{t},\tilde{\lambda}^{i}_{t}\right\},
𝒱2\displaystyle\mathcal{V}_{2} :=i𝒩,t0{π~ti(Bi):BiΠSDi,Bi}.\displaystyle:=\bigcup_{i\in\mathcal{N},t\geq 0}\left\{\tilde{\pi}^{i}_{t}(B^{i}):B^{i}\subseteq\Pi^{i}_{SD},B^{i}\not=\varnothing\right\}.

Remark: The random variables in 𝒱1𝒱2\mathcal{V}_{1}\cup\mathcal{V}_{2} are primitive random variable that do not depend on any player’s choice of hyperparameters or on the history of play of the game.

4.2 Theorem Statement

Theorem 4.1.

Let 𝒢\mathcal{G} be a weakly acyclic game and suppose each player i𝒩i\in\mathcal{N} uses Algorithm 1 to play 𝒢\mathcal{G}. Suppose Assumptions 3, 4, and 5 hold, and let ϵ>0\epsilon>0. There exists α¯ϵ>0\bar{\alpha}_{\epsilon}>0 and a function T¯ϵ:(0,1)N×\bar{T}_{\epsilon}:(0,1)^{N}\times\mathbb{N}\to\mathbb{N} such that if

maxi𝒩αi<α¯ϵ, and TT¯ϵ(𝜶,R),\max_{i\in\mathcal{N}}\alpha^{i}<\bar{\alpha}_{\epsilon},\text{ and }T\geq\bar{T}_{\epsilon}(\bm{\alpha},R),

then Pr(ϕt𝚷SD0-eq)1ϵ,{\rm Pr}(\bm{\phi}_{t}\in\bm{\Pi}^{0{\rm\text{-}eq}}_{SD})\geq 1-\epsilon, for all sufficiently large tt\in\mathbb{N}.

In words, Theorem 4.1 says the following: if the relative frequency of policy update times of different players is controlled by some ratio RR, if players use sufficiently small learning rates 𝜶\bm{\alpha}, and if the lower bound TT on each player’s exploration phase length is sufficiently long relative to both the learning rates 𝜶\bm{\alpha} as well as the ratio parameter RR, then play will be driven to equilibrium with high probability.

As we will see, the interdependence of the various parameters—R,𝜶,R,\bm{\alpha}, and TT—is significant in the analysis to follow.

5 Proof Overview

As we will see in Section 6, the proof of Theorem 4.1 is somewhat intricate. For this reason, we now offer an outline of the proof to facilitate its reading.

5.1 High-Level Overview

The proof of Theorem 4.1 involves three major steps. First, we introduce a sequence of equilibrium events, {Bk}k0\{B_{k}\}_{k\geq 0}, defined in terms of a suitably defined sequence of time intervals {[τkmin,τkmax]}k0\{[\tau^{\min}_{k},\tau^{\max}_{k}]\}_{k\geq 0}, to be elaborated in the sequel. For k0k\geq 0, we put

Bk:={ϕt=ϕτkmin𝚷SD0-eq:t=τkmin+1,,τkmax}.B_{k}:=\left\{\bm{\phi}_{t}=\bm{\phi}_{\tau^{\min}_{k}}\in\bm{\Pi}^{0{\rm\text{-}eq}}_{SD}:t=\tau^{\min}_{k}+1,\cdots,\tau^{\max}_{k}\right\}.

In words, BkB_{k} is the event in which the baseline policy did not change during the interval [τkmin,τkmax][\tau^{\min}_{k},\tau^{\max}_{k}] and moreover the baseline policy was an equilibrium during this time.

Second, for a particular integer L<L<\infty (defined in the text preceding Lemma 6.8), we argue in Lemma 6.8 that we can control and lower bound the probability of remaining at the equilibrium event for LL steps: that is, Pr(Bk+L|Bk){\rm Pr}(B_{k+L}|B_{k}) can be made arbitrarily large by suitable selection of algorithm parameters. The proof of Lemma 6.8 uses convergence properties of Q-learning iterates (Lemma 6.3) and requires careful arguments about conditional probabilities, since conditioning on the event BkB_{k} carries confounding information about the state-action trajectory from both before and after time τkmin\tau^{\min}_{k} and precludes simple analysis that disconnects the future evolution of Q-learning estimates from the history of the system before time τkmin\tau^{\min}_{k}. Consequently, the analysis is somewhat delicate, and is worded in terms of primitive random variables, which offers a path around the confounding effect of conditioning on BkB_{k}.

Third, we argue in Lemma 6.10 that the probability of driving play to this equilibrium event in LL time steps can be lower bounded by a positive constant. After this is achieved, one can then explicitly lower bound Pr(Bk+mL){\rm Pr}(B_{k+mL}) for suitably large mm, as in [1] and [37].

6 Proof of Theorem 4.1

6.1 Additional Notation and Constructions

We begin by defining several objects that will be needed for the various lemmas to follow. The notation in this section builds on that of Section 4.1.3. For ease of reference, we have included a glossary of notation in Table 2, Table 3, and Table 4 of Appendix A.

Recall that in Section 4.1.3, we motivated the use of primitive random variables for describing the algorithmic randomness of Algorithm 1. The primitive random variables described earlier included {u~ti,ρ~ti}t0\{\tilde{u}^{i}_{t},\tilde{\rho}^{i}_{t}\}_{t\geq 0}, which appeared in randomized action selection, and {λ~ki,π~ki}k0\{\tilde{\lambda}^{i}_{k},\tilde{\pi}^{i}_{k}\}_{k\geq 0}, which appeared in randomized policy updated. We also recall that π~ki\tilde{\pi}^{i}_{k} was previously a random sample from a particular policy subset BiΠSDiB^{i}\subseteq\Pi^{i}_{SD}, which was defined using player ii’s Q-factor iterates. In this section, it will be helpful to further isolate the role of algorithmic randomness, and therefore we will also consider random samples from all non-empty subsets of ΠSDi\Pi^{i}_{SD}. For the latter purpose and for each i𝒩i\in\mathcal{N}, we arbitrarily order non-empty subsets of ΠSDi\Pi^{i}_{SD} as Bi,1,,Bi,miB^{i,1},\dots,B^{i,m_{i}}, where mi=|2ΠSDi{}|m_{i}=|2^{\Pi^{i}_{SD}}\setminus\{\varnothing\}|.

We introduce the following new quantities for each t0t\geq 0:

  • ωti:=(ρ~ti,u~ti,λ~ti,π~ti(Bi,1),,π~ti(Bi,mi)),\omega^{i}_{t}:=(\tilde{\rho}^{i}_{t},\tilde{u}^{i}_{t},\tilde{\lambda}^{i}_{t},\tilde{\pi}^{i}_{t}(B^{i,1}),\dots,\tilde{\pi}^{i}_{t}(B^{i,m_{i}})), for all i𝒩i\in\mathcal{N}. We define 𝔖i\mathfrak{S}^{i} to be the set in which the random quantity ωti\omega^{i}_{t} takes its values. That is,

    𝔖i:=[0,1]×𝔸i×[0,1]×Bi,1××Bi,mi,i𝒩.\mathfrak{S}^{i}:=[0,1]\times\mathbb{A}^{i}\times[0,1]\times B^{i,1}\times\cdots\times B^{i,m_{i}},\quad\forall i\in\mathcal{N}.
  • In plain terms, ωti\omega^{i}_{t} consists of random samples of each of the quantities needed for player ii to execute the randomized subroutines in Algorithm 1 at a given time tt.

  • 𝝎t:=(Wt,ωt1,,ωtN)\bm{\omega}_{t}:=(W_{t},\omega^{1}_{t},\dots,\omega^{N}_{t}), where {Wt}t0\{W_{t}\}_{t\geq 0} is the [0,1][0,1]-valued independent and identically distributed (i.i.d.) noise process driving state transitions that was introduced in §4.1.3.

  • For each tt, 𝝎t\bm{\omega}_{t} is a collection of the randomized quantities described in ωti\omega^{i}_{t} along with the random noise that drives state transitions. Of note, the information contained in 𝝎t\bm{\omega}_{t} is sufficient for recovering the subsequent state xt+1x_{t+1}, the current joint action 𝐚t\mathbf{a}_{t}, and the subsequent joint policy ϕt+1\bm{\phi}_{t+1}.

  • ϖt+:=(𝝎t,𝝎t+1,)\bm{\varpi}_{t{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}+}}:=(\bm{\omega}_{t},\bm{\omega}_{t+1},\dots). This quantity describes all random samples required to fully describe the future of the state-action trajectory and thus of Q-factors.

  • Q^t:=(Q^t1,,Q^tN)\widehat{\textbf{Q}}_{t}:=(\widehat{Q}^{1}_{t},\dots,\widehat{Q}^{N}_{t}).

  • 𝐡t\mathbf{h}_{t} :=(x0,ϕ0,Q^0,,xt,ϕt,Q^t):=(x_{0},\bm{\phi}_{0},\widehat{\textbf{Q}}_{0},\dots,x_{t},\bm{\phi}_{t},\widehat{\textbf{Q}}_{t}). This quantity describes the states, joint policies, and Q-factors up to and including time tt.

  • 𝐇t\mathbf{H}_{t} :=(𝕏×𝚷SD×𝕏×𝔸1××𝕏×𝔸N)t+1:=\left(\mathbb{X}\times\bm{\Pi}_{SD}\times\mathbb{R}^{\mathbb{X}\times\mathbb{A}^{1}}\times\dots\times\mathbb{R}^{\mathbb{X}\times\mathbb{A}^{N}}\right)^{t+1}.

  • 𝐇t,eq\mathbf{H}_{t,\rm eq} :={𝐡t𝐇t:ϕt𝚷SD0-eq}:=\{\mathbf{h}_{t}\in\mathbf{H}_{t}:\bm{\phi}_{t}\in\bm{\Pi}^{0{\rm\text{-}eq}}_{SD}\}. This set consists of those tt-histories for which the final joint policy ϕt\bm{\phi}_{t} is an equilibrium.

We note that ωti\omega^{i}_{t} is a random quantity taking values in 𝔖i\mathfrak{S}^{i}, while 𝝎t\bm{\omega}_{t} is a random quantity taking values in [0,1]×𝔖1××𝔖N[0,1]\times\mathfrak{S}^{1}\times\cdots\times\mathfrak{S}^{N}. For each t0t\geq 0, a realization of the random quantity ϖt+\bm{\varpi}_{t+} is a sequence with entries in [0,1]×𝔖1××𝔖N[0,1]\times\mathfrak{S}^{1}\times\cdots\times\mathfrak{S}^{N}.

When every player uses Algorithm 1 to play the game and each player’s randomization mechanism is governed by the primitive random variables of §4.1.3, we have that the sample path of play—including realizations of the state process {xt}t0\{x_{t}\}_{t\geq 0}, the action processes {ati}t0\{a^{i}_{t}\}_{t\geq 0} for each agent i𝒩i\in\mathcal{N}, the baseline joint policy process {ϕt}t0\{\bm{\phi}_{t}\}_{t\geq 0}, and the Q-factor estimates {Q^ti}t0\{\widehat{Q}^{i}_{t}\}_{t\geq 0}—are deterministic functions of any given prefix history 𝐡s𝐇s\mathbf{h}_{s}\in\mathbf{H}_{s} and its corresponding continuation, ϖs+([0,1]×𝔖1××𝔖N)\bm{\varpi}_{s+}\in\left([0,1]\times\mathfrak{S}^{1}\times\cdots\times\mathfrak{S}^{N}\right)^{\infty}, where s0s\in\mathbb{Z}_{\geq 0}. With this in mind, for i𝒩i\in\mathcal{N} and t0t\geq 0, we define mappings 𝒬ti\mathcal{Q}^{i}_{t} and Φt\Phi_{t} such that

(6) Q^ti=𝒬ti(𝐡s,ϖs+), and ϕt=Φt(𝐡s,ϖs+),0st.\widehat{Q}^{i}_{t}=\mathcal{Q}^{i}_{t}(\mathbf{h}_{s},\bm{\varpi}_{s+}),\>\text{ and }\>\bm{\phi}_{t}=\Phi_{t}(\mathbf{h}_{s},\bm{\varpi}_{s+}),\quad\forall 0\leq s\leq t.

Next, for any s,t0s,t\geq 0 and any i𝒩i\in\mathcal{N}, we introduce a function Q¯t+si(𝐡t,ϖt+)\bar{Q}^{i}_{t+s}(\mathbf{h}_{t},\bm{\varpi}_{t+}) that reports the hypothetical Q-factors player ii would have obtained if the baseline policies had been frozen at time tt. That is, the history up to time tt is given by 𝐡t\mathbf{h}_{t}, the primitive random variables from tt onward are given by ϖt+\bm{\varpi}_{t+}, and we obtain the hypothetical (ϖt+\bm{\varpi}_{t+}-measurable) continuation trajectory (x¯t,𝐚¯t,,x¯t+s,𝐚¯t+s)(\bar{x}_{t},\bar{\mathbf{a}}_{t},\dots,\bar{x}_{t+s},\bar{\mathbf{a}}_{t+s}), as x¯t:=xt\bar{x}_{t}:=x_{t},

x¯t+m+1=f(x¯t+m,𝐚¯t+m,Wt+m), 0ms,\bar{x}_{t+m+1}=f(\bar{x}_{t+m},\bar{\mathbf{a}}_{t+m},W_{t+m}),\quad\forall\,0\leq m\leq s,

where for each player jj and time t+mtt+m\geq t,

a¯t+mj:={ϕtj(x¯t+m),if ρ~t+mj>ρju~t+mjotherwise.\bar{a}^{j}_{t+m}:=\begin{cases}\phi^{j}_{t}(\bar{x}_{t+m}),&\text{if }\tilde{\rho}^{j}_{t+m}>\rho^{j}\\ \tilde{u}^{j}_{t+m}&\text{otherwise.}\end{cases}

Note that the index of ϕti\phi^{i}_{t} is not t+mt+m, which reflects that, in this hypothetical continuation, the baseline policies were frozen at time tt. Each hypothetical Q-factor estimate Q¯t+si(s,ai)\bar{Q}^{i}_{t+s}(s,a^{i}) is then (analytically) built out of this hypothetical trajectory using the regular Q-learning update with the initial condition prescribed by Q^t\widehat{\textbf{Q}}_{t}.

Remark: Hypothetical Q-factors feature prominently in the analysis of the coming sections, and are a somewhat subtle construction, so we explain their utilization here. To analyze the likelihood of driving play to equilibrium or the likelihood of remaining at equilibrium, we ultimately wish to make probabilistic statements about the realized Q-learning iterates, {Q^t}t0\{\widehat{\textbf{Q}}_{t}\}_{t\geq 0}, conditional on various events, such as the event that players did not switch their baseline policies recently. However, conditioning on such events can be problematic in the analysis, since the conditioning event clearly carries information about the state-action trajectory that may be consequential for the evolution of the Q-factor estimates. Due to this confounding effect, we will avoid making statements about conditional probabilities directly, and we will instead study trajectories of the hypothetical Q-factors. In so doing, we can describe likelihoods of relevant events in terms of the primitive random variables, thereby avoiding the confounding effects of conditioning.

6.2 Supporting Results on Q-Learning Iterates

Lemma 6.1.

For some M<M<\infty, the following holds almost surely:

maxi𝒩supt0Q^tiM.\max_{i\in\mathcal{N}}\sup_{t\geq 0}\left\|\widehat{Q}^{i}_{t}\right\|_{\infty}\leq M.

Proof 6.2.

For all i𝒩i\in\mathcal{N}, we have

Q^t+1i\displaystyle\|\widehat{Q}_{t+1}^{i}\|_{\infty} max{(1αi)Q^ti+αi(ci+γiQ^ti),Q^ti}.\displaystyle\leq\max\bigg{\{}(1-\alpha^{i})\|\widehat{Q}_{t}^{i}\|_{\infty}+\alpha^{i}\big{(}\|c^{i}\|_{\infty}+\gamma^{i}\|\widehat{Q}_{t}^{i}\|_{\infty}\big{)},\|\widehat{Q}_{t}^{i}\|_{\infty}\bigg{\}}.

Defining M:=maxi𝒩{Q^0i,ci1γi}M:=\max_{i\in\mathcal{N}}\left\{\left\|\widehat{Q}^{i}_{0}\right\|_{\infty},\frac{\|c^{i}\|_{\infty}}{1-\gamma^{i}}\right\}, we have that if Q^tiM\left\|\widehat{Q}^{i}_{t}\right\|_{\infty}\leq M, then

Q^t+1i\displaystyle\|\widehat{Q}_{t+1}^{i}\|_{\infty} max{(1αi)M+αi(ci+γiM),M}\displaystyle\leq\max\bigg{\{}(1-\alpha^{i})M+\alpha^{i}\big{(}\|c^{i}\|_{\infty}+\gamma^{i}M\big{)},M\bigg{\}}
=max{M+αi(ci(1γi)M0),M}=M.\displaystyle=\max\bigg{\{}M+\alpha^{i}\big{(}\underbrace{\|c^{i}\|_{\infty}-(1-\gamma^{i})M}_{\leq 0}\big{)},M\bigg{\}}=M.

This proves the lemma, as Q^0iM<\|\widehat{Q}_{0}^{i}\|_{\infty}\leq M<\infty for each i𝒩i\in\mathcal{N}.

The following lemma says that players can learn their optimal Q-factors accurately when they use sufficiently small step sizes and when their learning is not disrupted by policy updates for a sufficiently long number of steps. It is worded in terms of primitive random variables and hypothetical continuation Q-factors for reasons described above.

Lemma 6.3.

For any ξ>0\xi>0, there exists α^ξ>0\hat{\alpha}_{\xi}>0 and function T^ξ:(0,1)N\hat{T}_{\xi}:(0,1)^{N}\to\mathbb{N} such that if (1) αi(0,α^ξ)\alpha^{i}\in(0,\hat{\alpha}_{\xi}) for all i𝒩i\in\mathcal{N}, and (2) T~T^ξ(𝛂)\tilde{T}\geq\hat{T}_{\xi}(\bm{\alpha}), then

Pr(Ωt:t+T~)1ξ,t0,𝐡t𝐇t,{\rm Pr}\left(\Omega_{t:t+\tilde{T}}\right)\geq 1-\xi,\quad\forall\,t\geq 0,\mathbf{h}_{t}\in\mathbf{H}_{t},

where Ωt:t+T~={ϖt+:maxi𝒩Q¯t+T~i(𝐡t,ϖt+)Qϕtii<ξ}.\Omega_{t:t+\tilde{T}}=\big{\{}\bm{\varpi}_{t+}:\max_{i\in\mathcal{N}}\|\bar{Q}_{t+\tilde{T}}^{i}(\mathbf{h}_{t},\bm{\varpi}_{t+})-{Q}_{\bm{\phi}_{t}^{-i}}^{*i}\|_{\infty}<\xi\big{\}}.

Proof 6.4.

Since each player ii’s policy is ρi\rho^{i}-soft, our 2 implies the persistent excitation assumption of [2, Assumption 1]. The result then follows from Lemma 6.1 and [2, Theorem 3.4], using Markov’s inequality.

In the unsynchronized regime, where each agent faces a non-stationary, time-varying environment and uses Q-learning, Lemma 6.3 is an extremely useful result that allows one to quantify the amount of time needed to correct outdated information contained in the learning iterates. The choice to employ constant learning rates was made in large part due to this consideration.

In the sequel, we fix α^ξ\hat{\alpha}_{\xi} and T^ξ()\hat{T}_{\xi}(\cdot) with the properties outlined in Lemma 6.3.

6.3 A Supporting Result Controlling Update Frequencies

A core challenge of analyzing unsynchronized multi-agent learning is that policy updates for one player constitute potential destabilizations of learning for others. We now recall the sequence of time intervals {[τkmin,τkmax]}k0\{[\tau^{\min}_{k},\tau^{\max}_{k}]\}_{k\geq 0} previously described in Section 5. These time intervals will be useful in quantifying and analyzing the effects of such disruptions. In particular, we wish to upper bound the number of potential disruptions to a given player’s learning, which can be upper bounded by the number of times other players finish an exploration phase during the given individual’s current exploration phase. Producing an upper bound of this sort will be important for quantifying the importance of inertia for stabilizing the environment in the heuristic argument of [24].

Definition 6.5.

Let τ0min=τ0max:=0\tau^{\min}_{0}=\tau^{\max}_{0}:=0. For k0k\geq 0, define

τk+1min\displaystyle\tau^{\min}_{k+1} :=inf{tni:tni>τkmax,i𝒩,n0}\displaystyle:=\inf\{t^{i}_{n}:t^{i}_{n}>\tau^{\max}_{k},i\in\mathcal{N},n\geq 0\}
τk+1max\displaystyle\tau^{\max}_{k+1} :=inf{tτk+1min:i𝒩,n s.t. tni[τk+1min,t],\displaystyle:=\inf{\big{\{}}t\geq\tau^{\min}_{k+1}:\forall\,i\in\mathcal{N},\exists\,n\in\mathbb{N}\text{ s.t. }t^{i}_{n}\in[\tau^{\min}_{k+1},t],
 and inf{tn¯i>t:i𝒩,n¯}t+T/N},\displaystyle\qquad\quad\qquad\quad\qquad\text{ and }\inf\{t^{i}_{\bar{n}}>t:i\in\mathcal{N},\bar{n}\in\mathbb{N}\}\geq t+T/N{\big{\}}},

where TT is the constant appearing in 4.

The intervals [τkmin,τkmax][\tau^{\min}_{k},\tau^{\max}_{k}] represent active phases, during which players may change their policies. The (k+1)th(k+1)^{th} active phase starts at τk+1min\tau^{\min}_{k+1}, which is defined as the first time after τkmax\tau^{\max}_{k} at which some agent has an opportunity to revise its policy. As a consequence of these definitions, no policy updates occur in (τkmax,τk+1min)(\tau^{\max}_{k},\tau^{\min}_{k+1}).

The definition of τk+1max\tau^{\max}_{k+1} is slightly more involved: τk+1max\tau^{\max}_{k+1} is the minimal stage game time at which (a) each agent has an opportunity to switch policies in [τk+1min,τk+1max][\tau^{\min}_{k+1},\tau^{\max}_{k+1}] and (b) that no agent finishes its current exploration phase in the next T/NT/N stage games. The latter point is critical, as it guarantees that each player’s learning is allowed to proceed uninterrupted for at least T/NT/N stages after the final (potential) disruption.

Lemma 6.6.

The sequences {τkmin}k0\{\tau^{\min}_{k}\}_{k\geq 0} and {τkmax}k0\{\tau^{\max}_{k}\}_{k\geq 0} are well-defined (i.e. the infimum defining each term is achieved by a finite integer), and the following holds for any k0k\geq 0:

  • (a)

    τk+1minτkmax+T/N\tau^{\min}_{k+1}\geq\tau^{\max}_{k}+T/N.

  • (b)

    For each i𝒩i\in\mathcal{N}, we have n01{tni[τkmin,τkmax]}R+1.\sum_{n\geq 0}\textbf{1}\{t^{i}_{n}\in[\tau^{\min}_{k},\tau^{\max}_{k}]\}\leq R+1.

In words, part (a) of Lemma 6.6 guarantees that any successive active phases are separated by at least T/NT/N stage games, while part (b) guarantees that any individual player has at most R+1R+1 opportunities to revise its policy during the kthk^{th} active phase.

Proof 6.7.

For some k0k\geq 0, suppose that (a) and (b) hold for all lkl\leq k. That is,

  • (a)

    τl+1minτlmax+T/N\tau^{\min}_{l+1}\geq\tau^{\max}_{l}+T/N, for all 0lk0\leq l\leq k, and

  • (b)

    for each i𝒩,lki\in\mathcal{N},l\leq k, we have n01{tni[τlmin,τlmax]}R+1.\sum_{n\geq 0}\textbf{1}\{t^{i}_{n}\in[\tau^{\min}_{l},\tau^{\max}_{l}]\}\leq R+1.

Note that this holds for k=0k=0. We will show the following: (1) τk+1max<\tau^{\max}_{k+1}<\infty; (2) τk+2minτk+1max+T/N\tau^{\min}_{k+2}~{}\geq~{}\tau^{\max}_{k+1}~{}+~{}T/N; (3) For all i𝒩i\in\mathcal{N}, we have n01{tni[τk+1min,τk+1max]}R+1.\sum_{n\geq 0}\textbf{1}\{t^{i}_{n}\in[\tau^{\min}_{k+1},\tau^{\max}_{k+1}]\}\leq R+1.

For each i𝒩i\in\mathcal{N}, let ni0n_{i}\in\mathbb{Z}_{\geq 0} denote the index such that

tni1i<τk+1mintnii.t^{i}_{n_{i}-1}<\tau^{\min}_{k+1}\leq t^{i}_{n_{i}}.

By minimality of τk+1min\tau^{\min}_{k+1}, we have that tni1iτkmaxt^{i}_{n_{i}-1}\leq\tau^{\max}_{k}. By 4, we have Tni1iRTT^{i}_{n_{i}-1}\leq RT and thus

tnii=tni1i+Tni1iτkmax+RT.t^{i}_{n_{i}}=t^{i}_{n_{i}-1}+T^{i}_{n_{i}-1}\leq\tau^{\max}_{k}+RT.

This implies tniiτk+1minRTT/Nt^{i}_{n_{i}}-\tau^{\min}_{k+1}\leq RT-T/N, since τk+1minτkmax+T/N\tau^{\min}_{k+1}\geq\tau^{\max}_{k}+T/N by hypothesis.

We put t^0:=maxitnii\hat{t}_{0}:=\max_{i}t^{i}_{n_{i}}, and let j(0)𝒩j(0)\in\mathcal{N} denote an agent with tnj(0)j(0)=t^0t^{j(0)}_{n_{j(0)}}=\hat{t}_{0}.

For each l{0,1,,N1}l\in\{0,1,\dots,N-1\}, we define t^l+1\hat{t}_{l+1} as

t^l+1=min{tk¯j>t^l:j𝒩,k¯0}.\hat{t}_{l+1}=\min\left\{t^{j}_{\bar{k}}>\hat{t}_{l}:j\in\mathcal{N},\bar{k}\geq 0\right\}.

Observe that t^0τk+1max\hat{t}_{0}\leq\tau^{\max}_{k+1}. Moreover, if there exists some lN1l\leq N-1 such that t^l+1t^l+T/N\hat{t}_{l+1}\geq\hat{t}_{l}+T/N, then τk+1maxt^l\tau^{\max}_{k+1}\leq\hat{t}_{l}, and thus τk+1max<\tau^{\max}_{k+1}<\infty. We now argue that such ll exists.

Suppose that no such ll exists:

(7) max{t^1t^0,,t^Nt^N1}<T/N\displaystyle\max\{\hat{t}_{1}-\hat{t}_{0},\cdots,\hat{t}_{N}-\hat{t}_{N-1}\}<T/N
\displaystyle\Rightarrow t^Nt^0=l=0N1(t^l+1t^l)<NTN=Tmini,nTni.\displaystyle\hat{t}_{N}-\hat{t}_{0}=\sum_{l=0}^{N-1}\left(\hat{t}_{l+1}-\hat{t}_{l}\right)<N\cdot\frac{T}{N}=T\leq\min_{i,n}T^{i}_{n}.

One concludes that the minima defining each {t^l:1lN}\{\hat{t}_{l}:1\leq l\leq N\} are attained by NN distinct minimizing agents, one of whom is j(0)j(0). But then, for some ll, we have

t^0=tnj(0)j(0), and tnj(0)j(0)+Tnj(0)j(0)=t^lt^N,\hat{t}_{0}=t^{j(0)}_{n_{j(0)}},\text{ and }\>t^{j(0)}_{n_{j(0)}}+T^{j(0)}_{n_{j(0)}}=\hat{t}_{l}\leq\hat{t}_{N},

which implies Tnj(0)j(0)<TT^{j(0)}_{n_{j(0)}}<T, contradicting 4.

We have thus shown that the set 𝔗\mathfrak{T}\not=\varnothing, where 𝔗\mathfrak{T} is given by

𝔗:={l{0,1,,N1}:t^l+1t^l+T/N}.\mathfrak{T}:=\{l\in\{0,1,\dots,N-1\}:\hat{t}_{l+1}\geq\hat{t}_{l}+T/N\}.

Let l=min𝔗l^{*}=\min\mathfrak{T}, and note that, if l0l^{*}\not=0, then t^k+1<t^k+T/N\hat{t}_{k+1}<\hat{t}_{k}+T/N for all k<lk<l^{*}. It follows that (1) τk+1max=t^l<\tau^{\max}_{k+1}=\hat{t}_{l^{*}}<\infty and (2) τk+2min=t^l+1τk+1max+T/N\tau^{\min}_{k+2}=\hat{t}_{l^{*}+1}\geq\tau^{\max}_{k+1}+T/N.

We conclude by showing that (3) holds. That is,

n01{tni[τk+1min,τk+1max]}R+1,i𝒩.\sum_{n\geq 0}\textbf{1}\{t^{i}_{n}\in[\tau^{\min}_{k+1},\tau^{\max}_{k+1}]\}\leq R+1,\quad\forall i\in\mathcal{N}.

By 4, it suffices to show that τk+1maxτk+1min<(R+1)T.\tau^{\max}_{k+1}-\tau^{\min}_{k+1}<(R+1)T.

We have already shown t^0τk+1minRTT/N\hat{t}_{0}-\tau^{\min}_{k+1}\leq RT-T/N, which handles the case where l=0l^{*}=0, and we focus on l>0l^{*}>0. Since t^k+1<t^k+T/N\hat{t}_{k+1}<\hat{t}_{k}+T/N for all k<lk<l^{*} and lN1l^{*}\leq N-1, we have τk+1maxτk+1min=t^lτk+1min\tau^{\max}_{k+1}-\tau^{\min}_{k+1}=\hat{t}_{l^{*}}-\tau^{\min}_{k+1} and

t^lτk+1min=(l=0l1[t^l+1t^l])+t^0τk+1minl(T/N)+(RTT/N)<(R+1)T,\displaystyle\hat{t}_{l^{*}}-\tau^{\min}_{k+1}=\left(\sum_{l=0}^{l^{*}-1}\left[\hat{t}_{l+1}-\hat{t}_{l}\right]\right)+\hat{t}_{0}-\tau^{\min}_{k+1}\leq l^{*}(T/N)+(RT-T/N)<(R+1)T,

which concludes the proof.

Note that the proof above also shows that for any active phase [τkmin,τkmax][\tau^{\min}_{k},\tau^{\max}_{k}], there exists some agent that has exactly one opportunity to update its policy in [τkmin,τkmax][\tau^{\min}_{k},\tau^{\max}_{k}].

For k0k\geq 0, we define BkB_{k} to be the event that the baseline policy is a fixed equilibrium policy throughout the kthk^{th} active phase, [τkmin,τkmax][\tau^{\min}_{k},\tau^{\max}_{k}]. That is,

Bk:={ϕτkmin==ϕτkmax𝚷SD0-eq}.B_{k}:=\left\{\bm{\phi}_{\tau^{\min}_{k}}=\cdots=\bm{\phi}_{\tau^{\max}_{k}}\in\bm{\Pi}^{0{\rm\text{-}eq}}_{SD}\right\}.

We define L:=+1L:=\ell^{*}+1, where =max{(𝝅):𝝅𝚷SD}\ell^{*}=\max\{\ell(\bm{\pi}):\bm{\pi}\in\bm{\Pi}_{SD}\} and (𝝅)\ell(\bm{\pi}) denotes the length of a shortest strict best-response path from 𝝅𝚷SD\bm{\pi}\in\bm{\Pi}_{SD} to an equilibrium policy in 𝚷SD0-eq\bm{\Pi}^{0{\rm\text{-}eq}}_{SD}.

Lemma 6.8.

Let θ>0\theta>0, and define

ξ:=1(R+1)NLmin{θ,12mini𝒩{δi,δ¯δi}}.\xi:=\frac{1}{(R+1)NL}\min\left\{\theta,\frac{1}{2}\min_{i\in\mathcal{N}}\{\delta^{i},\bar{\delta}-\delta^{i}\}\right\}.

Suppose maxi𝒩αi<α^ξ\max_{i\in\mathcal{N}}\alpha^{i}<\hat{\alpha}_{\xi} and TNT^ξ(𝛂)T\geq N\hat{T}_{\xi}(\bm{\alpha}), where α^ξ\hat{\alpha}_{\xi} and T^ξ()\hat{T}_{\xi}(\cdot) are the objects specified in Lemma 6.3. For any k0k\geq 0 and history 𝐡τkmax𝐇τkmax,eq\mathbf{h}_{\tau^{\max}_{k}}\in\mathbf{H}_{\tau^{\max}_{k},\rm eq}, we have

Pr(Bk+1|𝐡τkmax)1θ/L.{\rm Pr}\left(B_{k+1}|\mathbf{h}_{\tau^{\max}_{k}}\right)\geq 1-\theta/L.

Remark: We note that τkmax0\tau^{\max}_{k}\in\mathbb{Z}_{\geq 0} is some constant, and 𝐡τkmax\mathbf{h}_{\tau^{\max}_{k}} is a τkmax\tau^{\max}_{k}-history,

𝐡τkmax=(x0,ϕ0,Q^0,,xτkmax,ϕτkmax,Q^τkmax),\mathbf{h}_{\tau^{\max}_{k}}=(x_{0},\bm{\phi}_{0},\widehat{\textbf{Q}}_{0},\dots,x_{\tau^{\max}_{k}},\bm{\phi}_{\tau^{\max}_{k}},\widehat{\textbf{Q}}_{\tau^{\max}_{k}}),

with ϕτkmax𝚷SD𝚷0-eq\bm{\phi}_{\tau^{\max}_{k}}\in\bm{\Pi}_{SD}\cap\bm{\Pi}^{0{\rm\text{-}eq}}.

Proof 6.9.

We put t~0:=τk+1min\tilde{t}_{0}:=\tau^{\min}_{k+1} and recursively define

t~l+1:=min{tni>t~l:i𝒩,n0},l0.\tilde{t}_{l+1}:=\min\{t^{i}_{n}>\tilde{t}_{l}:i\in\mathcal{N},n\geq 0\},\quad\forall l\geq 0.

We define m0m\in\mathbb{Z}_{\geq 0} to be the index achieving t~m=τk+1max\tilde{t}_{m}=\tau^{\max}_{k+1}, and note that m=0m=0 is possible when τk+1max=τk+1min\tau^{\max}_{k+1}=\tau^{\min}_{k+1}.

Note that mm counts the number of stage games after τk+1min\tau^{\min}_{k+1} and on/before τk+1max\tau^{\max}_{k+1} at which any player ends an exploration phase. From the proof of Lemma 6.6 we have that

mi𝒩n01{tni[τk+1min,τk+1max]}<(R+1)N.m\leq\sum_{i\in\mathcal{N}}\sum_{n\geq 0}\textbf{1}\{t^{i}_{n}\in[\tau^{\min}_{k+1},\tau^{\max}_{k+1}]\}<(R+1)N.

where the strict inequality obtains since at least one player (player j(0)j(0) in the proof of Lemma 6.6) has exactly one exploration phase ending in [τk+1min,τk+1max][\tau^{\min}_{k+1},\tau^{\max}_{k+1}], while the remaining N1N-1 players have at most R+1R+1.

For each l{0,,m}l\in\{0,\cdots,m\}, let AlA_{l} denote the set of players who have an opportunity to switch policies at time t~l\tilde{t}_{l}. That is,

Al:={i𝒩:n(l) s.t. tn(l)i=t~l}.A_{l}:=\{i\in\mathcal{N}:\exists\;n(l)\in\mathbb{N}\text{ s.t. }t^{i}_{n(l)}=\tilde{t}_{l}\}.

From our choices of ξ,𝛂\xi,\bm{\alpha}, and TNT^ξ(𝛂)T\geq N\hat{T}_{\xi}(\bm{\alpha}) and the fact that t~lτk+1minτkmax+T/N\tilde{t}_{l}\geq\tau^{\min}_{k+1}\geq\tau^{\max}_{k}+T/N for all l{0,1,,m}l\in\{0,1,\dots,m\}, we have, by Lemma 6.3, that

Pr(Ωτkmax:t~l)1ξ,l{0,1,,m},{\rm Pr}\left(\Omega_{\tau^{\max}_{k}:\tilde{t}_{l}}\right)\geq 1-\xi,\quad\forall l\in\{0,1,\dots,m\},

where, for any s,t0s,t\geq 0, we recall that Ωt:t+s\Omega_{t:t+s} is given by

Ωt:t+s={ϖt+:maxi𝒩Q¯t+si(𝐡t,ϖt+)Qϕtii<ξ}.\Omega_{t:t+s}=\left\{\bm{\varpi}_{t+}:\max_{i\in\mathcal{N}}\left\|\bar{Q}^{i}_{t+s}(\mathbf{h}_{t},\bm{\varpi}_{t+})-Q^{*i}_{\bm{\phi}^{-i}_{t}}\right\|_{\infty}<\xi\right\}.

Note that t~0=τk+1min\tilde{t}_{0}=\tau^{\min}_{k+1}, and the minimality defining τk+1min\tau^{\min}_{k+1} implies that no player updates its policy during the interval (τkmax,τk+1min)(\tau^{\max}_{k},\tau^{\min}_{k+1}). It follows that the hypothetical continuation trajectory defining each player ii’s hypothetical Q-factors Q¯t~0i(𝐡τkmax,ϖτkmax+)\bar{Q}^{i}_{\tilde{t}_{0}}(\mathbf{h}_{\tau^{\max}_{k}},\bm{\varpi}_{\tau^{\max}_{k}+}) coincides with the sample trajectory defining Q^t~0\widehat{\textbf{Q}}_{\tilde{t}_{0}}, since action selections and state transitions are decided by ϕτkmax\bm{\phi}_{\tau^{\max}_{k}} and ϖτkmax+\bm{\varpi}_{\tau^{\max}_{k}+}. Thus, by our choice of t~0=τk+1min\tilde{t}_{0}=\tau^{\min}_{k+1},

Q¯t~0i(𝐡τkmax,ϖτkmax+)=Q^t~0i,i𝒩.\bar{Q}^{i}_{\tilde{t}_{0}}(\mathbf{h}_{\tau^{\max}_{k}},\bm{\varpi}_{\tau^{\max}_{k}+})=\widehat{Q}^{i}_{\tilde{t}_{0}},\quad\forall i\in\mathcal{N}.

For ϖτkmax+Ωτkmax:t~0\bm{\varpi}_{\tau^{\max}_{k}+}\in\Omega_{\tau^{\max}_{k}:\tilde{t}_{0}}, each player iA0i\in A_{0} recovers its estimated Q-factors Q^t~0i\widehat{Q}^{i}_{\tilde{t}_{0}} within ξ\xi of QϕτkmaxiiQ^{*i}_{\bm{\phi}^{-i}_{\tau^{\max}_{k}}}, since

Q^t~0i=𝒬t~0i(𝐡τkmax,ϖτkmax+).\widehat{Q}^{i}_{\tilde{t}_{0}}=\mathcal{Q}^{i}_{\tilde{t}_{0}}(\mathbf{h}_{\tau^{\max}_{k}},\bm{\varpi}_{\tau^{\max}_{k}+}).

Since we have hypothesized that ϕτkmax𝚷SD0-eq\bm{\phi}_{\tau^{\max}_{k}}\in\bm{\Pi}^{0{\rm\text{-}eq}}_{SD} and chosen ξ<12min{δi,δ¯δi}\xi<\frac{1}{2}\min\{\delta^{i},\bar{\delta}-\delta^{i}\}, it follows that agent ii does not change its policy at time t~0=τk+1min\tilde{t}_{0}=\tau^{\min}_{k+1} when given the opportunity to revise its policy. In words, the agent appraises its policy as being sufficiently close to optimal with respect to its learned Q-factors, and does not switch its policy. In symbols, that is to say

ϖτkmax+Ωτkmax:t~0maxi𝒩𝒬t~0i(𝐡τkmax,ϖτkmax+)Qϕτkmaxii<ξ,\displaystyle\bm{\varpi}_{\tau^{\max}_{k}+}\in\Omega_{\tau^{\max}_{k}:\tilde{t}_{0}}\Rightarrow\max_{i\in\mathcal{N}}\left\|\mathcal{Q}^{i}_{\tilde{t}_{0}}(\mathbf{h}_{\tau^{\max}_{k}},\bm{\varpi}_{\tau^{\max}_{k}+})-Q^{*i}_{\bm{\phi}^{-i}_{\tau^{\max}_{k}}}\right\|_{\infty}<\xi,

which in turn implies that ϕt~0=Φt~0(𝐡τkmax,ϖτkmax+)=ϕτkmax.\bm{\phi}_{\tilde{t}_{0}}=\Phi_{\tilde{t}_{0}}(\mathbf{h}_{\tau^{\max}_{k}},\bm{\varpi}_{\tau^{\max}_{k}+})=\bm{\phi}_{\tau^{\max}_{k}}.

Repeating this logic, one has that if

ϖτkmax+0lmΩτkmax:t~l\bm{\varpi}_{\tau^{\max}_{k}+}\in\bigcap_{0\leq l\leq m}\Omega_{\tau^{\max}_{k}:\tilde{t}_{l}}

then ϕt~l=Φ(𝐡τkmax,ϖτkmax+)=ϕτkmax\bm{\phi}_{\tilde{t}_{l}}=\Phi(\mathbf{h}_{\tau^{\max}_{k}},\bm{\varpi}_{\tau^{\max}_{k}+})=\bm{\phi}_{\tau^{\max}_{k}} for each lml\leq m. The probability of this intersection is lower bounded using the union bound and Lemma 6.3:

Pr(0lmΩτkmax:t~l)1(m+1)ξ1θ/L,{\rm Pr}\left(\bigcap_{0\leq l\leq m}\Omega_{\tau^{\max}_{k}:\tilde{t}_{l}}\right)\geq 1-(m+1)\xi\geq 1-\theta/L,

as desired.

Lemma 6.10.

Let θ>0\theta>0, and define

ξ:=1(R+1)NLmin{θ,12mini𝒩{δi,δ¯δi}}.\xi:=\frac{1}{(R+1)NL}\min\left\{\theta,\frac{1}{2}\min_{i\in\mathcal{N}}\{\delta^{i},\bar{\delta}-\delta^{i}\}\right\}.

Suppose maxi𝒩αi<α^ξ\max_{i\in\mathcal{N}}\alpha^{i}<\hat{\alpha}_{\xi} and TNT^ξ(𝛂)T\geq N\hat{T}_{\xi}(\bm{\alpha}), where α^ξ\hat{\alpha}_{\xi} and T^ξ()\hat{T}_{\xi}(\cdot) are the objects specified in Lemma 6.3. For any k0k\geq 0 and history 𝐡τkmax𝐇τkmax𝐇τkmax,eq\mathbf{h}_{\tau^{\max}_{k}}\in\mathbf{H}_{\tau^{\max}_{k}}\setminus\mathbf{H}_{\tau^{\max}_{k},\rm eq}, we have

Pr(Bk+L|𝐡τkmax)pmin(1θ),{\rm Pr}\left(B_{k+L}|\mathbf{h}_{\tau^{\max}_{k}}\right)\geq p_{\min}(1-\theta),

where pmin:=j𝒩min{1λj|ΠSDj|,λj}(R+1)L>0p_{\min}:=\prod_{j\in\mathcal{N}}\min\left\{\frac{1-\lambda^{j}}{|\Pi^{j}_{SD}|},\lambda^{j}\right\}^{(R+1)L}>0 and LL was defined above Lemma 6.8.

Proof 6.11.

Let 𝛑0,,𝛑\bm{\pi}_{0},\dots,\bm{\pi}_{\ell} be a (shortest) strict best-response path from 𝛑0:=ϕτkmax\bm{\pi}_{0}:=\bm{\phi}_{\tau_{k}^{\max}} to 𝛑𝚷SD0-eq\bm{\pi}_{\ell}\in\bm{\Pi}^{0{\rm\text{-}eq}}_{SD}, and {1,,}\ell\in\{1,\dots,\ell^{*}\}, where \ell^{*} was defined along with LL immediately before Lemma 6.8. For each pair of neighbouring joint policies 𝛑l\bm{\pi}_{l} and 𝛑l+1\bm{\pi}_{l+1}, there exists exactly one player i(l)i(l) who switches its policy. That is, πl+1j=πlj\pi^{j}_{l+1}=\pi^{j}_{l} for all ji(l)j\not=i(l).

Consider the event that no agent updates its policy during [τkmax,τk+1max][\tau_{k}^{\max},\tau_{k+1}^{\max}] due to inertia except player i(0)i(0), who updates its policy exactly once to π1i(0)\pi^{i(0)}_{1} and remains inert at all other update opportunities in [τkmax,τk+1max][\tau_{k}^{\max},\tau_{k+1}^{\max}].

By Lemma 6.3 and Lemma 6.6, the conditional probability of this event given 𝐡τkmax\mathbf{h}_{\tau_{k}^{\max}} is lower bounded by

(1θ/L)j𝒩min{1λj|ΠSDj|,λj}R+1.(1-\theta/L)\prod_{j\in\mathcal{N}}\min\left\{\frac{1-\lambda^{j}}{|\Pi_{SD}^{j}|},\lambda^{j}\right\}^{R+1}.

The same lower bound similarly applies to each transition along 𝛑0,,𝛑\bm{\pi}_{0},\dots,\bm{\pi}_{\ell}, which leads to

Pr(ϕτk+max=𝝅|𝐡τkmax)(1θ/L)j𝒩min{1λj|ΠSDj|,λj}(R+1).\displaystyle{\rm Pr}\left(\bm{\phi}_{\tau^{\max}_{k+\ell}}=\bm{\pi}_{\ell}\middle|\mathbf{h}_{\tau_{k}^{\max}}\right)\geq(1-\theta/L)^{\ell}\prod_{j\in\mathcal{N}}\min\left\{\frac{1-\lambda^{j}}{|\Pi_{SD}^{j}|},\lambda^{j}\right\}^{(R+1)\ell}.

Next, consider the event that no agent updates its policy during [τk+max,τk+Lmax][\tau^{\max}_{k+\ell},\tau^{\max}_{k+L}] due to inertia. The conditional probability of this event given hτk+maxh_{\tau^{\max}_{k+\ell}} is lower bounded by

j𝒩(λj)(R+1)(L).\prod_{j\in\mathcal{N}}(\lambda^{j})^{(R+1)(L-\ell)}.

This results in Pr(Bk+L|𝐡τkmax)(1θ/L)Lpmin,{\rm Pr}(B_{k+L}|\mathbf{h}_{\tau_{k}^{\max}})\geq(1-\theta/L)^{L}p_{\min}, which proves the lemma since (1θ/L)L1θ(1-\theta/L)^{L}\geq 1-\theta.

6.4 Proof of Theorem 4.1

Given ϵ>0\epsilon>0, let θ>0\theta>0 be the unique solution to

(1θ)pminθ+(1θ)pminθ=1ϵ\frac{(1-\theta)p_{\min}}{\theta+(1-\theta)p_{\min}}-\theta=1-\epsilon

and

ξ=1(R+1)NLmin{θ,12mini𝒩{δi,δ¯δi}}.\xi=\frac{1}{(R+1)NL}\min\left\{\theta,\frac{1}{2}\min_{i\in\mathcal{N}}\{\delta^{i},\bar{\delta}-\delta^{i}\}\right\}.

Suppose that

maxi𝒩αiα¯ϵ:=α^ξandTT¯ϵ(𝜶,R):=NT^ξ(𝜶).\max_{i\in\mathcal{N}}\alpha^{i}\leq\bar{\alpha}_{\epsilon}:=\hat{\alpha}_{\xi}\>\,\textrm{and}\quad T\geq\bar{T}_{\epsilon}(\bm{\alpha},R):=N\hat{T}_{\xi}(\bm{\alpha}).

By Lemmas 6.8 and 6.10, for all k0k\geq 0,

(8) Pr(Bk+L|Bk)\displaystyle{\rm Pr}(B_{k+L}|B_{k}) 1θ,\displaystyle\geq 1-\theta,
(9) Pr(Bk+L|Bkc)\displaystyle{\rm Pr}(B_{k+L}|B_{k}^{c}) (1θ)pmin.\displaystyle\geq(1-\theta)p_{\min}.

Let pk:=Pr(Bk)p_{k}:={\rm Pr}(B_{k}) for all k0k\geq 0. The subsequent details lower bounding pk+mLp_{k+mL} for large mm and every k<Lk<L are omitted, as they resemble the proofs of [1] or [37].

Remark: In the interest of clarity, Theorem 4.1 was stated and proved for a special case of deterministic exploration phase lengths {Tki:i𝒩,k0}\{T^{i}_{k}:i\in\mathcal{N},k\geq 0\}. The analysis above can be immediately extended to handle random exploration phase lengths, provided these are primitive random variables (as opposed to being history dependent) and provided they satisfy the requisite assumptions appearing in Theorem 4.1 almost surely.

7 Simulations

In this section, we present empirical findings of a simulation study where Algorithm 1 was applied to the stochastic game whose stage game costs and transition probabilities are presented below.

We have chosen to simulate play of a game with two players, 𝒩={1,2}\mathcal{N}=\{1,2\}, state space 𝕏={s0,s1}\mathbb{X}=\{s_{0},s_{1}\}, and two actions per player, with action sets labelled 𝔸1=𝔸2={a0,a1}\mathbb{A}^{1}=\mathbb{A}^{2}=\{a_{0},a_{1}\}. Each player’s discount factor is set to 0.8, and the transition kernel is described as follows:

Pr(xt+1=s0|xt=s0,at1,at2)\displaystyle{\rm Pr}\left(x_{t+1}=s_{0}|x_{t}=s_{0},a^{1}_{t},a^{2}_{t}\right) 0.5,\displaystyle\equiv 0.5,
Pr(xt+1=s0|xt=s1,at1=at2)\displaystyle{\rm Pr}\left(x_{t+1}=s_{0}|x_{t}=s_{1},a^{1}_{t}=a^{2}_{t}\right) =0.25\displaystyle=0.25
Pr(xt+1=s0|xt=s1,at1at2)\displaystyle{\rm Pr}\left(x_{t+1}=s_{0}|x_{t}=s_{1},a^{1}_{t}\not=a^{2}_{t}\right) =0.9\displaystyle=0.9

In words, the state dynamics out of state s0s_{0} do not depend on the choice of joint action, while the dynamics out of state s1s_{1} do. In state s1s_{1}, if players select matching actions, then the state transitions to s0s_{0} with (low) probability 0.25. Otherwise, if players select distinct actions in state s1s_{1}, play transitions to state s0s_{0} with (high) probability 0.9.

The stage game’s symmetric cost structure is summarized by the tables below.

{game}

22[] &a0a_{0} a1a_{1}
a0a_{0} 0, 0 2, 2
a1a_{1} 2, 2 0, 0

(a) State s0s_{0}: low cost state
{game}

22[] &a0a_{0} a1a_{1}
a0a_{0} 10, 10 11, 11
a1a_{1} 11, 11 10, 10

(b) State s1s_{1}: high cost state
Figure 1: Stage costs: Player 1 (respectively, 2) picks a row (column), and its stage cost is the 1st (2nd) entry of the chosen cell.

In either state, the stage cost to a given player is minimized by selecting one’s action to match the action choice of the opponent. We note that the stage costs in state s0s_{0} are considerably lower than those of state s1s_{1}. Despite the apparent coordination flavour of the stage game payoffs, the transition probabilities dictate that an optimizing agent should not match its counterpart’s action in state s1s_{1}, and should instead attempt to drive play to the low cost state s0s_{0}.

The challenge of non-stationarity to multi-agent learning in this game is such: if players use a joint policy under which they mismatch actions in state s0s_{0} or match actions in state s1s_{1}, then either player has an incentive to switch its policy in the relevant state. If both players switch policies in an attempt to respond to the other, then cyclic behaviour ensues, where the individual responds to an outdated policy, as the other player will have switched its behaviour by the time the individual is ready to update its own policy.

With the cost structure, transition probabilities, and discount factors described above, we have the joint policies in which players’ actions match in state s0s_{0} and players’ actions do not match in state s1s_{1} constitute 0-equilibrium policies.

Parameter Choices

We conducted 500 independent trials of Algorithm 1. Each trial consisted of 10610^{6} stage games. Each player’s exploration phase lengths, TkiT^{i}_{k}, were selected uniformly at random from the interval [T,RT][T,R\cdot T], with TT set to 50005000 and the ratio parameter RR set to R=3R=3.

For our gameplay parameters, we set the stage game action experimentation parameters to be ρ1=ρ2=0.05\rho^{1}=\rho^{2}=0.05. The parameter controlling inertia in the policy update was chosen as λ1=λ2=0.2\lambda^{1}=\lambda^{2}=0.2. The parameter controlling tolerance for sub-optimality in Q-factor assessment was chosen as δ1=δ2=0.5\delta^{1}=\delta^{2}=0.5. The learning rate parameters were selected to be α1=α2=0.08\alpha^{1}=\alpha^{2}=0.08.

Results

The results of our simulations are summarized below in Figure 2 and Table 1. With randomly selected initial policies, we observed that Algorithm 1 rapidly drove play to equilibrium, with the relative frequency of equilibrium stabilizing at over 98% of trials after approximately 40,000 stage games.

Refer to caption
Figure 2: Frequency of ϕt𝚷SD0-eq\bm{\phi}_{t}\in\bm{\Pi}^{0{\rm\text{-}eq}}_{SD}, averaged over 500 trials.
Stage game index t=t= 0 10,000 20,000 30,000 40,000
1500k=15001{ϕt𝚷SD0-eq}\frac{1}{500}\sum_{k=1}^{500}\textbf{1}\{\bm{\phi}_{t}\in\bm{\Pi}^{0{\rm\text{-}eq}}_{SD}\} 0.264 0.620 0.728 0.974 0.984
Table 1: Frequency of equilibrium at select stage game indices

8 Conclusions

In this paper, we considered an unsynchronized variant of the decentralized Q-learning algorithm of [1]. We formalized an argument presented in [24], which contended that inertia alone would stabilize the learning process and tame non-stationarity. In the process of formalizing the argument of [24], some modifications were needed to handle technical nuances relating to the dynamics of the policy process and the convergence of learning iterates. To accommodate unsynchronized policy updating and non-stationarity in each agent’s learning environment, we have introduced a constant learning rate that can rapidly overcome errors in learning estimates that are artifacts of outdated information. In so doing, we have shown that decentralized Q-learning can still drive policies to equilibrium in weakly acyclic stochastic games without making strong coordination assumptions such as synchronizing the schedules on which players update their policies.

References

  • [1] Gürdal Arslan and Serdar Yüksel. Decentralized Q-learning for stochastic teams and games. IEEE Transactions on Automatic Control, 62(4):1545–1558, 2017.
  • [2] Carolyn L Beck and Rayadurgam Srikant. Error bounds for constant step-size Q-learning. Systems & Control Letters, 61(12):1203–1208, 2012.
  • [3] Vivek S Borkar. Reinforcement learning in Markovian evolutionary games. Advances in Complex Systems, 5(01):55–72, 2002.
  • [4] Michael Bowling and Manuela Veloso. Multiagent learning using a variable learning rate. Artificial Intelligence, 136(2):215–250, 2002.
  • [5] Caroline Claus and Craig Boutilier. The dynamics of reinforcement learning in cooperative multiagent systems. In Proceedings of the Tenth Innovative Applications of Artificial Intelligence Conference, Madison, Wisconsin, pages 746–752, 1998.
  • [6] Constantinos Daskalakis, Dylan J Foster, and Noah Golowich. Independent policy gradient methods for competitive reinforcement learning. Advances in Neural Information Processing Systems, 33:5527–5540, 2020.
  • [7] Arlington M. Fink. Equilibrium in a stochastic nn-person game. Journal of Science of the Hiroshima University, Series AI (Mathematics), 28(1):89–93, 1964.
  • [8] D. P. Foster and H. P. Young. On the nonconvergence of fictitious play in coordination games. Games Econ. Behav., 25:79–96, 1998.
  • [9] Dean Foster and H. Peyton Young. Regret testing: Learning to play Nash equilibrium without knowing you have an opponent. Theoretical Economics, 1:341–367, 2006.
  • [10] Fabrizio Germano and Gabor Lugosi. Global Nash convergence of Foster and Young’s regret testing. Games and Economic Behavior, 60(1):135–154, 2007.
  • [11] Pablo Hernandez-Leal, Michael Kaisers, Tim Baarslag, and Enrique Munoz de Cote. A survey of learning in multiagent environments: Dealing with non-stationarity. arXiv preprint arXiv:1707.09183, 2017.
  • [12] Pablo Hernandez-Leal, Bilal Kartal, and Matthew E Taylor. A survey and critique of multiagent deep reinforcement learning. Autonomous Agents and Multi-Agent Systems, 33(6):750–797, 2019.
  • [13] Junling Hu and Michael P. Wellman. Nash Q-learning for general-sum stochastic games. Journal of Machine Learning Research, 4(Nov):1039–1069, 2003.
  • [14] Jens Josephson and Alexander Matros. Stochastic imitation in finite games. Games and Economic Behavior, 49(2):244–259, 2004.
  • [15] Vijaymohan R Konda and Vivek S Borkar. Actor-critic–type learning algorithms for Markov decision processes. SIAM Journal on Control and Optimization, 38(1):94–123, 1999.
  • [16] Stefanos Leonardos, Will Overman, Ioannis Panageas, and Georgios Piliouras. Global convergence of multi-agent policy gradient in Markov potential games. In International Conference on Learning Representations, 2022.
  • [17] D. Leslie and E. Collins. Individual Q-learning in normal form games. SIAM Journal on Control and Optimization, 44:495–514, 2005.
  • [18] David S Leslie and Edmund J Collins. Convergent multiple-timescales reinforcement learning algorithms in normal form games. The Annals of Applied Probability, 13(4):1231–1251, 2003.
  • [19] Yehuda Levy. Discounted stochastic games with no stationary Nash equilibrium: two examples. Econometrica, 81(5):1973–2007, 2013.
  • [20] Michael L. Littman. Markov games as a framework for multi-agent reinforcement learning. In Machine Learning Proceedings 1994, pages 157–163. Elsevier, 1994.
  • [21] Michael L. Littman. Friend-or-foe Q-learning in general-sum games. In ICML, volume 1, pages 322–328, 2001.
  • [22] Michael L. Littman and Csaba Szepesvári. A generalized reinforcement-learning model: Convergence and applications. In ICML, volume 96, pages 310–318. Citeseer, 1996.
  • [23] Chinmay Maheshwari, Manxi Wu, Druv Pai, and Shankar Sastry. Independent and decentralized learning in markov potential games. arXiv preprint arXiv:2205.14590, 2022.
  • [24] Jason R. Marden, H. Peyton Young, Gürdal Arslan, and Jeff S. Shamma. Payoff-based dynamics for multiplayer weakly acyclic games. SIAM Journal on Control and Optimization, 48(1):373–396, 2009.
  • [25] Laetitia Matignon, Guillaume J. Laurent, and Nadine Le Fort-Piat. Independent reinforcement learners in cooperative Markov games: a survey regarding coordination problems. Knowledge Engineering Review, 27(1):1–31, 2012.
  • [26] Eric Mazumdar, Lillian J Ratliff, Michael I Jordan, and S Shankar Sastry. Policy-gradient algorithms have no guarantees of convergence in linear quadratic games. arXiv preprint arXiv:1907.03712, 2019.
  • [27] Panayotis Mertikopoulos, Christos Papadimitriou, and Georgios Piliouras. Cycles in adversarial regularized learning. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2703–2717. SIAM, 2018.
  • [28] David H Mguni, Yutong Wu, Yali Du, Yaodong Yang, Ziyi Wang, Minne Li, Ying Wen, Joel Jennings, and Jun Wang. Learning in nonzero-sum stochastic games with potentials. In International Conference on Machine Learning, pages 7688–7699. PMLR, 2021.
  • [29] Hadi Nekoei, Akilesh Badrinaaraayanan, Amit Sinha, Mohammad Amini, Janarthanan Rajendran, Aditya Mahajan, and Sarath Chandar. Dealing with non-stationarity in decentralized cooperative multi-agent deep reinforcement learning via multi-timescale learning. In Conference on Lifelong Learning Agents, pages 376–398. PMLR, 2023.
  • [30] Asuman Ozdaglar, Muhammed O Sayin, and Kaiqing Zhang. Independent learning in stochastic games. In International Congress of Mathematicians, 2021.
  • [31] Muhammed Sayin, Kaiqing Zhang, David Leslie, Tamer Başar, and Asuman Ozdaglar. Decentralized Q-learning in zero-sum Markov games. Advances in Neural Information Processing Systems, 34:18320–18334, 2021.
  • [32] Muhammed O Sayin and Onur Unlu. Logit-Q learning in Markov games. arXiv preprint arXiv:2205.13266, 2022.
  • [33] Satinder Singh, Tommi Jaakkola, Michael L Littman, and Csaba Szepesvári. Convergence results for single-step on-policy reinforcement-learning algorithms. Machine learning, 38(3):287–308, 2000.
  • [34] Ming Tan. Multi-agent reinforcement learning: Independent vs. cooperative agents. In Proceedings of the Tenth International Conference on Machine Learning, pages 330–337, 1993.
  • [35] J. N. Tsitsiklis. Asynchronous stochastic approximation and Q-learning. Machine Learning, 16(3):185–202, 1994.
  • [36] Bora Yongacoglu, Gürdal Arslan, and Serdar Yüksel. Decentralized learning for optimality in stochastic dynamic teams and games with local control and global state information. IEEE Transactions on Automatic Control, 67(10):5230–5245, 2022.
  • [37] Bora Yongacoglu, Gürdal Arslan, and Serdar Yüksel. Satisficing paths and independent multi-agent reinforcement learning in stochastic games. SIAM Journal on Mathematics of Data Science, to appear.
  • [38] H. Peyton Young. Strategic Learning and its Limits. Oxford University Press., 2004.
  • [39] Runyu Zhang, Zhaolin Ren, and Na Li. Gradient play in multi-agent Markov stochastic games: Stationary points and convergence. arXiv preprint arXiv:2106.00198, 2021.

Appendix A Glossary of Notation

Notation Description Reference(s)
Q𝝅iiQ^{*i}_{\bm{\pi}^{-i}} Q-function when facing 𝝅i𝚷Si\bm{\pi}^{-i}\in\bm{\Pi}^{-i}_{S} (2)
Q^ti\widehat{Q}^{i}_{t} Player ii’s Q-factor estimates at time tt (3), Algorithm 1
Q^t\widehat{\textbf{Q}}_{t} (Q^t1,,Q^tN)\left(\widehat{Q}^{1}_{t},\dots,\widehat{Q}^{N}_{t}\right) p. 6.1
EP Exploration Phase
TkiT^{i}_{k} Length of player ii’s kthk^{th} EP Algorithm 1
tkit^{i}_{k} Start time of player ii’s kthk^{th} EP Algorithm 1
ρi(0,1)\rho^{i}\in(0,1) Player ii’s action experimentation parameter Algorithm 1
λi(0,1)\lambda^{i}\in(0,1) Player ii’s policy inertia parameter Algorithm 1
δi>0\delta^{i}>0 Player ii’s “tolerance for suboptimality" Algorithm 1
αi(0,1)\alpha^{i}\in(0,1) Player ii’s learning rate/step size Algorithm 1
𝜶\bm{\alpha} Tuple of learning rates: (αi)i𝒩(\alpha^{i})_{i\in\mathcal{N}} p. 4
πki\pi^{i}_{k} Player ii’s baseline policy during its kthk^{th} EP Algorithm 1
ϕti\phi^{i}_{t} Player ii’s baseline policy for stage game at time tt p. 4
Table 2: Algorithm-specific Notation
Notation Description Reference(s)
ρ¯\bar{\rho} An upper bound on each player’s experimentation parameters ρi\rho^{i} 3
δ¯\bar{\delta} An upper bound on each player’s δi\delta^{i} 3
π^ki\hat{\pi}^{i}_{k} Player ii’s behaviour policy during kthk^{th} EP: π^ki\hat{\pi}^{i}_{k} selects according to πki\pi^{i}_{k} w.p. (1ρi)(1-\rho^{i}) and selects from Unif(𝔸i)\text{Unif}(\mathbb{A}^{i}) w.p. ρi\rho^{i}. p. 4
TT Lower bound on EP lengths: TTkiT\leq T^{i}_{k} 4
RR Factor bounding EP lengths: TkiRTT^{i}_{k}\leq RT 4
[τkmin,τkmax][\tau^{\min}_{k},\tau^{\max}_{k}] Active phase of policy updating Definition 6.5
BkB_{k} Equilibrium event defined using kthk^{th} active phase [τkmin,τkmax][\tau^{\min}_{k},\tau^{\max}_{k}] p. 6.8
Table 3: Analysis-specific Notation (Part 1)
Notation Description Ref(s)
{Wt}t0\{W_{t}\}_{t\geq 0} An i.i.d. noise process taking values in [0,1][0,1]. p. 4.1.3
f:𝕏×𝐀×[0,1]𝕏f{\!}:{\!}\mathbb{X}{\!}\times{\!}\mathbf{A}{\!}\times[0,1]{\!}\to{\!}\mathbb{X} A function for expressing state transitions as a deterministic function driven by exogenous noise: xt+1=f(xt,𝐚t,Wt)x_{t+1}=f(x_{t},\mathbf{a}_{t},W_{t}) p. 4.1.3
{u~ti}t0\{\tilde{u}^{i}_{t}\}_{t\geq 0} An i.i.d. sequence of actions for player ii, with u~tiUnif(𝔸i)\tilde{u}^{i}_{t}\sim\text{Unif}(\mathbb{A}^{i}) p. 4.1.3
ρ~ti\tilde{\rho}^{i}_{t} An i.i.d. sequence with ρ~tiUnif([0,1])\tilde{\rho}^{i}_{t}\sim\text{Unif}([0,1]) p. 4.1.3
λ~ti\tilde{\lambda}^{i}_{t} An i.i.d. sequence with λ~tiUnif([0,1])\tilde{\lambda}^{i}_{t}\sim\text{Unif}([0,1]) p. 4.1.3
{π~ti(Bi)}\{\tilde{\pi}^{i}_{t}(B^{i})\} An i.i.d. sequence of policies for player ii, with π~ti(Bi)Unif(Bi)\tilde{\pi}^{i}_{t}(B^{i})\sim\text{Unif}(B^{i}), where BiΠSDiB^{i}\subseteq\Pi^{i}_{SD}, p. 4.1.3
ωti\omega^{i}_{t} Random variables for player ii at time tt: ωti=(ρ~ti,u~ti,λ~ti,π~ti(Bi):BiΠSDi)\omega^{i}_{t}=(\tilde{\rho}^{i}_{t},\tilde{u}^{i}_{t},\tilde{\lambda}^{i}_{t},\tilde{\pi}^{i}_{t}(B^{i}):B^{i}\subseteq\Pi^{i}_{SD}) p. 6.1
𝔖i\mathfrak{S}^{i} Codomain of ωti\omega^{i}_{t}: ωti𝔖i\omega^{i}_{t}\in\mathfrak{S}^{i} p. 6.1
𝝎t\bm{\omega}_{t} Collected state noise and player noise for time tt: 𝝎t=(Wt,ωt1,,ωtN)\bm{\omega}_{t}=(W_{t},\omega^{1}_{t},\dots,\omega^{N}_{t}) p. 6.1
ϖt+\bm{\varpi}_{t+} Present and future of the noise processes: ϖt=(𝝎t,𝝎t+1,)\bm{\varpi}_{t}=(\bm{\omega}_{t},\bm{\omega}_{t+1},\dots) p. 6.1
𝐡t\mathbf{h}_{t} History of states, policies, and Q-iterates to time tt: (x0,ϕ0,Q^0,,xt,ϕt,Q^t)(x_{0},\bm{\phi}_{0},\widehat{\textbf{Q}}_{0},\dots,x_{t},\bm{\phi}_{t},\widehat{\textbf{Q}}_{t}) p. 6.1
𝐇t\mathbf{H}_{t} Codomain of 𝐡t\mathbf{h}_{t}: 𝐡t𝐇t\mathbf{h}_{t}\in\mathbf{H}_{t} p. 6.1
𝐇t,eq𝐇t\mathbf{H}_{t,{\rm eq}}\subset\mathbf{H}_{t} {𝐡t𝐇t:ϕt𝚷SD0-eq}\{\mathbf{h}_{t}\in\mathbf{H}_{t}:\bm{\phi}_{t}\in\bm{\Pi}^{0{\rm\text{-}eq}}_{SD}\} p. 6.1
𝒬ti\mathcal{Q}^{i}_{t} Mapping that reports Q-iterates at time tt using historical data 𝐡s\mathbf{h}_{s} up to time ss and future random variables from time ss onward in ϖs\bm{\varpi}_{s}: 𝒬ti(𝐡s,ϖs)=Q^ti\mathcal{Q}^{i}_{t}(\mathbf{h}_{s},\bm{\varpi}_{s})=\widehat{Q}^{i}_{t} for all 0st0\leq s\leq t. (6)
Φt\Phi_{t} Mapping that reports baseline policies at time tt using historical data to time ss and future random variables from ss onward : Φt(𝐡s,ϖs)=ϕt\Phi_{t}(\mathbf{h}_{s},\bm{\varpi}_{s})=\bm{\phi}_{t} for all 0st0\leq s\leq t. (6)
u¯t+mi(𝐡t,ϖt)\bar{u}^{i}_{t+m}(\mathbf{h}_{t},\bm{\varpi}_{t}) Action chosen by player ii at time t+mt+m in hypothetical scenario where (i) baseline policies were frozen at ϕt\bm{\phi}_{t} at time tt, and (ii) action selection and state transition noise was described by ϖt\bm{\varpi}_{t} p. 6.2
x¯t+m(𝐡t,ϖt)\bar{x}_{t+m}(\mathbf{h}_{t},\bm{\varpi}_{t}) State at time t+mt+m in hypothetical scenario where (i) and (ii) above hold. p. 6.2
Q¯t+si(𝐡t,ϖt)\bar{Q}^{i}_{t+s}(\mathbf{h}_{t},\bm{\varpi}_{t}) Hypothetical Q-factors that would have been obtained at time t+st+s in hypothetical scenario where (i) and (ii) above hold. p. 6.2
Table 4: Analysis-specific Notation (Part 2)