Convergence and Stability of Coupled Belief–Strategy Learning Dynamics in Continuous Games
Abstract
We propose a learning dynamics to model how strategic agents repeatedly play a continuous game while relying on an information platform to learn an unknown payoff-relevant parameter. In each time step, the platform updates a belief estimate of the parameter based on players’ strategies and realized payoffs using Bayes’s rule. Then, players adopt a generic learning rule to adjust their strategies based on the updated belief. We present results on the convergence of beliefs and strategies and the properties of convergent fixed points of the dynamics. We obtain sufficient and necessary conditions for the existence of globally stable fixed points. We also provide sufficient conditions for the local stability of fixed points. These results provide an approach to analyzing the long-term outcomes that arise from the interplay between Bayesian belief learning and strategy learning in games, and enable us to characterize conditions under which learning leads to a complete information equilibrium.
1 Introduction
Strategic agents often need to engage in repeated interactions with each other while learning an unknown environment that impacts their payoffs. Such a situation arises in online market platforms, where buyers and sellers repeatedly make their transaction decisions while learning the latent market condition that governs the price distribution. The price distribution is updated based on the previous transactions and buyer reviews on platforms such as Amazon, eBay, and Airbnb (Moe and Fader (2004); Acemoglu et al. (2017)). Another situation concerns with transportation networks, where travelers make route choice decisions on a day-to-day basis while also learning the underlying state of network that affects the travel time distribution. This travel time distribution is repeatedly updated based on the delay and flow information provided by navigation apps such as Google Maps or Apple Maps (Zhu et al. (2010); Wu et al. (2021); Meigs et al. (2017)). In both situations, players’ strategic decisions (purchases and sales on online platforms or route choices in transportation networks) influence the learning of the unknown environment (latent market condition or network state), which then impact the players’ future decisions. Thus, the long-run outcome of strategic interactions among players is governed by the joint evolution of stage-wise decisions made by the players and learning of the unknown environment.
In this article, we introduce and study a learning model to describe this joint evolution in a game-theoretic setting. In our model, a finite number of strategic agents (players) repeatedly play a game with continuous strategy sets. The payoff of each player in the game is a function of the strategy profile with an unknown parameter (or parameter vector). A public information platform (e.g. a market platform or navigation app) serves as an information aggregator that intermittently updates and broadcasts a Bayesian belief estimate of the payoff parameter based on stage-wise game outcomes (i.e. strategies and randomly realized payoffs) to all players. Players then account for the most current belief when updating their strategies.
The players’ strategy update in our learning dynamics is generic so that it can be used to model a broad class of strategy learning rule that players decide to choose based on the updated belief and the strategy played in the previous stage. In particular, the strategy update can incorporate two classes of learning rules that naturally arise from utility-maximizing players. One is the belief-based best response dynamics, where, in each stage, players update their strategies using the best response that maximizes their expected utility given the updated belief and their opponents’ strategies. Another class is the belief-based no-regret learning dynamics, where each player updates their strategy using an online mirror ascent algorithm that maximizes the expected utility, and the gradient used in the algorithm is estimated using the updated belief. When the true parameter is known to players, our learning dynamics reduces to the classical strategy learning in games with complete information. For this special case, the belief-based best response dynamics reduces to the best response dynamics with complete information of utility functions, and the belief-based no-regret learning reduces to the no-regret learning dynamics with complete information of the gradient of utility functions.
Past literature has extensively studied various strategy learning rules in games. Learning rules based on best response strategies have been studied in both games with finite strategy set and continuous strategy sets when players have complete information of their utility functions. This includes simultaneous and sequential best response dynamics (Milgrom and Roberts (1990); Monderer and Shapley (1996b); Hofbauer and Sorin (2006)), fictitious play (Fudenberg and Kreps (1993); Monderer and Shapley (1996a)), and stochastic fictitious play (Benaim and Hirsch (1999); Hofbauer and Sandholm (2002)). For games with finite strategy sets, various learning rules have been proposed to model how players, who do not have complete information of the payoff matrices, learn an equilibrium based on the realized payoffs in every stage. Examples include log-linear learning (Blume et al. (1993), Marden and Shamma (2012), Alós-Ferrer and Netzer (2010)), regret-based learning (Hart and Mas-Colell (2003), Foster and Young (2006), Marden et al. (2007), Daskalakis et al. (2011); Syrgkanis et al. (2015)), payoff-based learning (Cominetti et al. (2010), Marden et al. (2009)), and replicator dynamics (Sandholm (2010b), Beggs (2005), Hopkins (2002)). On the other hand, in games with continuous strategy sets, learning dynamics typically assume the knowledge of players’ utility functions (e.g. the dynamics based on best responses), or consider that a stochastic estimate of the gradient of utility functions is available (e.g. no-regret learning and gradient-based learning in continuous games Rosen (1965); Mertikopoulos and Zhou (2019); Bravo et al. (2018)).
We extend the past literature by focusing on coupled belief-strategy dynamics, which naturally describes the learning behavior of players who adjust their strategies while relying on an information platform to learn the unknown utility functions that are parametrized. The setting of parametrized utility functions is especially relevant to continuous games since players need to learn their utility function on a continuous strategy set. To the best of our knowledge, past literature has not analyzed the long-run outcomes of coupled belief-strategy updates in continuous games. Our focus is to develop an approach to analyze the convergence and stability (both local and global) properties of this learning dynamics, and derive conditions under which a complete information Nash equilibrium naturally arises from the learning dynamics. Our conditions for learning complete information equilibrium can also be used for studying equilibrium computation algorithms in continuous games, however, such algorithmic aspects of equilibrium computation are outside the scope of our work.
Next, we summarize the main contributions of our paper.
Convergence: We show that the joint evolution of beliefs and strategies converges to a fixed point of our dynamics with probability 1 provided that the learning rule in strategy updates converges given a static belief (i.e. when the belief of parameter is held fixed instead of being repeatedly updated). Furthermore, at any fixed point, the fixed point belief forms a consistent estimate of the payoff distribution given the strategy, and the strategy is an equilibrium of the game corresponding to the convergent belief (Theorem 1).
Naturally, a complete information belief and the associated complete information equilibrium form a fixed point of our learning dynamics. However, the fixed point set may also include other fixed points that do not identify the true parameter. At these fixed points, the belief assigns a non-zero probability to other parameters besides the true parameter, which may lead to an incorrect payoff estimate for strategies that differ from the fixed point. Consequently, the associated fixed point strategy that the learning dynamics converge to may not be a complete information equilibrium.
Our proof of Theorem 1 builds on the martingale property of Bayesian beliefs as well as the properties of strategy learning in games. Firstly, we obtain the convergence of Bayesian beliefs by showing that the beliefs form a martingale. Secondly, to prove the strategy convergence, we construct a series of auxiliary strategy sequences, each of which is identical to the original sequence up to a certain stage, but differs in the “tail” subsequence – the strategies beyond that stage are updated using the fixed point belief. We show that the distance between a so-constructed auxiliary sequence and the original sequence goes to zero as the stage beyond which the two sequences differ tends to infinity. Thus, the convergence property of strategies in our coupled belief- strategy learning dynamics is derived from the convergence of the strategy updates with static belief being the fixed point belief. Finally, using the strategy convergence result, we prove that the players’ payoff distributions asymptotically approach the identical and independent distribution corresponding to the fixed point strategy. This also allows us to show that the belief concentrates exponentially fast on the subset of parameters with the property that, at fixed point, each parameter in this set induces the same payoff distribution as the true parameter.
Global and local stability: A fixed point is globally stable if the learning dynamics starting from any initial state converges to that fixed point with probability 1. We find that globally stable fixed points exist if and only if all fixed points have complete information of the unknown parameter. Equivalently, the sufficient and necessary condition for all fixed points being complete information fixed points is that the true parameter is identifiable for any equilibrium strategy profile, i.e. no other parameter induces the same payoff distribution in equilibrium (Proposition 1). In this case, learning is guaranteed to identify the true parameter, and players eventually play a complete information equilibrium.
When the sufficient and necessary condition for global stability is not satisfied, there exist one or multiple fixed points that do not have complete information of the true parameter. Then, whether learning converges to a complete information fixed point or not depends on the initial state. In this case, we study the local stability property that evaluates how robust a convergent fixed point is under local perturbations of beliefs and strategies. A fixed point is locally stable if states remain close to the fixed point with high probability when learning starts with an initial state close to that fixed point. A locally stable fixed point is robust to local perturbations, and thus is more likely to arise as the long-run outcome of the learning dynamics.
We prove that a fixed point is locally stable if it satisfies two conditions (Theorem 2): (a) The fixed point belief is locally consistent in that it consistently estimates the payoff distribution in a local neighborhood of the fixed point strategy (instead of just at the fixed point); (b) The fixed point strategy has a local neighborhood that is an invariant set for the strategy update. In proving this result, we again exploit the martingale property of beliefs and the properties of strategy updates. We use the martingale upcrossing inequality to show that under condition (a) of local consistency, we can construct a neighborhood of initial belief, which ensures that the repeatedly updated beliefs remain in a small neighborhood of the fixed point belief with high probability. Condition (b) further guarantees that strategy update is robust to local perturbations of strategies and beliefs. Importantly, we find that any complete information fixed point is locally stable, and thus is robust to local perturbations of belief and strategy (Proposition 2).
Our global and local stability analysis contributes to the study of stability of Nash equilibrium under various learning dynamics in games. In particular, previous literature has focused on the stability of evolutionary dynamics in population games (Smith and Price (1973); Taylor and Jonker (1978); Samuelson and Zhang (1992); Matsui (1992); Hofbauer and Sandholm (2009); Sandholm (2010a)), and more recently on adaptive learning dynamics in traffic routing (Dumett and Cominetti (2018)). Our results extend the equilibrium stability analysis in a complete information environment to the stability analysis on both the belief estimates and players’ strategies governed by the coupled belief-strategy dynamics.
Complete learning. Learning is complete if the convergent fixed point strategy is a complete information equilibrium. Complete learning is not always guaranteed because fixed point beliefs may form incorrect estimates on the unobservable game outcomes, i.e. the payoffs of strategies that are not taken. These incorrect estimates may persist since the information of game outcomes used in belief updates is endogenously acquired based on strategies chosen by players.
The phenomenon that endogenous information acquisition leads to incomplete learning is central to a variety of settings that include multi-arm bandit problems (Auer et al. (2002); Cesa-Bianchi and Lugosi (2006); Lattimore and Szepesvári (2020)), endogenous social learning (Banerjee (1992); Gale and Kariv (2003); Golub and Jackson (2010); Acemoglu et al. (2014); Mossel et al. (2015)), and self-confirming equilibrium learning in extensive form games (Fudenberg and Levine (1993); Fudenberg and Kreps (1995)). In settings where the action space of each agent is finite, complete learning can be achieved when the decision maker randomly takes each action with small probability (Auer et al. (2002); Cesa-Bianchi and Lugosi (2006); Hart and Mas-Colell (2000)). In our setting, players have continuous strategy sets. Thus, it is unclear under what condition exploration is needed, and how random exploration can be used to achieve complete learning.
We focus on identifying conditions under which learning is complete without the need of exploration. We show that learning is complete if (i) the convergent fixed point assigns probability 1 to a single parameter, or (ii) the convergent fixed point belief is locally consistent, and the payoff function of each player is concave in their strategy for any parameter (Proposition 3). In case (i), we know that the single parameter with probability 1 is the true parameter, and the fixed point strategy is a complete information equilibrium. In case (ii), the fixed point belief does not identify the true parameter, but the local consistency condition ensures that the belief forms consistent estimate of payoffs in a local neighborhood of the fixed point strategy. Thus, each player’s strategy is a local best response to their opponents’ strategies. Additionally, the payoff concavity condition ensures that each player’s local best response is also the true best response strategy. Therefore, in case (ii), the fixed point strategy is a complete information Nash equilibrium even though the fixed point belief does not have complete information.
If the convergent fixed point does not satisfy (i) or (ii), then learning may not be complete, and exploration of strategies other than the fixed point strategy is needed to guarantee complete learning. If payoff functions in the game are concave (which is typically assumed in most applications), then local exploration – randomly taking strategies in a local neighborhood of the fixed point – is sufficient to identify the complete information equilibrium. This is because local exploration can exclude any parameter that does not form locally consistent payoff estimate, and eventually leads to a locally consistent belief so that condition (ii) for complete learning is satisfied.
The rest of the paper is organized as follows: Section 2 describes the learning model and Section 3 details our main results – convergence and stability analysis, and conditions for complete learning. We present the proofs of our main convergence and stability theorems in Section 4. In Section 5, we discuss the extensions of our results to learning with two timescales, learning in games with finite strategies, and learning with maximum a posteriori or least square estimates. We include the proofs of propositions and technical lemmas in Appendix A.
2 Model of Learning Dynamics in Continuous Games
Our learning dynamics is induced by strategic players in a finite set who repeatedly play a game for an infinite number of stages. The players’ payoffs in game depend on an unknown parameter vector belonging to a finite set . Players have access to an information system (or an aggregator) that repeatedly updates and broadcasts a belief estimate to all players, where denotes the estimated probability of parameter .
In game , the strategy of each player is a finite-dimensional vector in a convex, closed and bounded strategy set . The players’ strategy profile is denoted . The payoff of each player is realized randomly according to a probability distribution. The distribution of players’ payoffs for any strategy profile and any parameter is denoted by the probability density function . We assume that is continuous in for all . Without loss of generality, we write the player ’s payoff for any as the sum of an average payoff function that is continuous in and a zero-mean noise term that can be correlated across players:
(1) |
We model our learning dynamics as a discrete-time stochastic process, with state comprising of belief estimate of the unknown parameter and the players’ strategies: In each stage , the information platform broadcasts the current belief estimate ; the players act according to a strategy profile ; and the payoffs are realized according to when the parameter is . The state of the learning dynamics in stage is .
The unknown true parameter is . We suppose that the initial belief does not exclude any possible parameter, i.e. for all , and the initial strategy is feasible. The evolution of states is jointly governed by belief and strategy updates as introduced next.
Belief update. The information platform updates the belief intermittently and infinitely. The stages at which the belief is updated can be deterministic or random, denoted by the subsequence . In stage , the most recent belief estimate is updated using players’ strategy profiles and payoffs between the stages and according to the Bayes’ rule:111In many instances of the problem setup, it is sufficient to update the belief only based on aggregate strategies and payoffs , provided that the tuple is a sufficient statistics of in the following sense: one can write , where is the -independent conditional distribution of strategy and payoffs given the aggregate statistics, and is the conditional probability of given for parameter . Then, we can re-write (-update) as Bayesian update that only relies on : For example, in routing games, the aggregate traffic flow and travel time of each edge in the network are sufficient statistics of the individual travelers’ route choices and travel time. Another example is in Cournot games, the total production level and product price in a market are sufficient statistics of each producer’s production and revenue (see Section 3.4).
(-update) |
Strategy update. Players update their strategies in each stage based on the updated belief and the current strategies played by their opponents. Given any and any , we denote the strategy update for each as a set-valued function :
(-update) |
In our model, the belief and strategy updates (-update) – (-update) capture the dynamic interplay between Bayesian parameter learning and strategy learning in games. Specifically, since the distribution of depends on the strategy profile in each stage, the sequence of payoffs is not identically and independently distributed when the strategies are repeatedly updated. On the other hand, the strategy updates (-update) depend on the repeatedly updated beliefs .
We note that the belief updates can occur less frequently than the strategy updates since the subsequence of belief update stages satisfy . In Sec. 3 – 4, we present our main results by considering that is finite with probability (w.p.) 1; i.e., both belief and strategy updates follow the same timescale. In Sec. 5, we show that all results analogously hold when w.p.1; that is when the belief is updated at a slower timescale compared to the strategy updates.
For any belief , we denote as the game with a static information environment, where each player ’s utility function is the expected utility . Note that if the belief were held constant as for all , then (-update) simply becomes the corresponding strategy update in game . Since our goal is to analyze the joint evolution of both the beliefs and the strategies (instead of the properties of a particular strategy update in static information environment), we make the following two assumptions throughout the paper:
Assumption 1 (Upper-hemicontinuous strategy updates).
For any , is upper hemicontinuous in and .
Assumption 2 (Convergence in static information environment).
For any belief , the equilibrium strategy set of the game is non-empty. Moreover, given any initial strategy, the strategy update (-update) with static belief converges to an equilibrium in game .
Assumption 3.
For any , either (i) or (ii) is satisfied:
-
(i)
is a singleton set.
-
(ii)
There exists such that any two different equilibria satisfy . Moreover, for any , there exists such that for any that satisfies and any that satisfies , the strategy update is unique.
Assumption 1 ensures that the updated strategy is upper hemicontinuous in the belief and the strategy in the previous stage. Assumption 2 guarantees the convergence of strategies induced by updates under static information environment. Without Assumption 2, strategies may not converge even in the static information environment, and consequently the joint evolution of both the beliefs and the strategies also cannot be guaranteed to converge. Assumption 3 implies that when the game has multiple equilibria, each equilibrium is isolated, and the strategy update has a unique value in the local neighborhood of each equilibrium.
Our results and analysis approach hold for all types of strategy update (-update) that satisfy Assumptions 1 – 3. For the sake of concreteness, we provide two examples of that are natural extension of widely studied strategy update rules:
-
1.
Belief-based best response dynamics. We define the best response correspondence of each player as the set of strategies that maximize their expected utility given the opponents’ strategies and the updated belief . The strategy is updated following any one of the following dynamics: (Simultaneous-BR) – each player chooses a best response strategy; (Sequential-BR) – players sequentially update their strategies as the best response; (Inertial-BR) – each player updates their strategy as a linear combination of their current strategy and a best response strategy.
(Simultaneous-BR) (Sequential-BR) (Inertial-BR) where in (Inertial-BR) for each .
-
2.
Belief-based no-regret learning. Given the updated belief , and the current strategy profile , the updated strategy is given by:
(No-regret) where the function is a regularizer that is continuous and strongly convex in 222The function is strongly convex if there exists such that for all and all , and is a sequence of each player ’s scores such that , and
That is, the initial score of player is the same with the strategy , and the score in each step is updated using gradient ascent without regard to the feasibility constraint imposed by strategy set . In particular, the gradient vector is the gradient of the expected utility function based on the updated belief .333Our belief-based no-regret learning estimates the gradient of the expected utility function based on the belief and the parametrized utility function . When the belief does not have complete information of the true parameter , the estimated gradient is different from the gradient of the true utility function . As the belief converges, the estimated gradient becomes unbiased of the true gradient.
Then, the updated strategy is a feasible strategy profile in that minimizes the inner product between the strategy and the updated score vector plus the value of the regularizer. A common example of such regularizer is . In this case, the updated strategy is the projection of the score vector onto the feasible strategy set; i.e. . Thus, the strategy update follows a projected gradient ascent algorithm based on the updated belief for each player.
We note that Assumption 1 is satisfied by the belief-based best response learning and the belief-based no-regret learning when the utility function is continuously differentiable with respect to for all and all .444Following the Berge’s maximum theorem, for any , any and any , is upper-hemicontinuous in and . Therefore, the three types of belief-based best response updates satisfy Assumption 1. Additionally, when the utility function is continuously differentiable with respect to for any and any , the updated belief given by (No-regret) is also upper-hemicontinuous in and . From the extensive literature in learning in games (Fudenberg and Kreps (1993); Sandholm (2010b)), we know that convergence of strategy updates to equilibrium is not guaranteed in arbitrary games even under static information environment. In Table 1, we provide a list of games in which the strategy updates with the two examples of learning rules satisfy Assumption 2. Our results, as discussed next, hold for any (-update) satisfying the two assumptions.
Game | |
---|---|
(Simultaneous-BR) | Two-player Cournot games, and Dominance solvable supermodular games (Monderer and Shapley (1996b); Milgrom and Roberts (1990)) |
(Sequential-BR) | Potential games (Monderer and Shapley (1996b)) |
(Inertial-BR) | Potential games, convex-concave zero-sum games, dominance solvable games, and supermodular games (Monderer and Shapley (1996a); Hofbauer and Sorin (2006); Milgrom and Roberts (1990)) |
(No-regret) | Concave potential games, and strictly concave games (Rosen (1965); Mertikopoulos and Zhou (2019); Golowich et al. (2020)) |
3 Main Results
Sec. 3.1 presents the convergence of beliefs and strategies. Sec. 3.2 analyzes the local and global stability properties of fixed points. Sec. 3.3 focuses on conditions for learning to converge to equilibrium with complete information, and Sec. 3.4 illustrates our results through examples. We focus on presenting the results and explaining the intuition in this section. We defer the proofs of Theorems to Sec. 4, and the proofs of Propositions to Appendix A.
3.1 Convergence
To present our convergence result, we need to introduce two definitions.
Definition 1 (Kullback–Leibler (KL)-divergence).
For a strategy profile , the KL divergence between the payoff distribution with parameters and is:
Here means that the distribution is absolutely continuous with respect to , i.e. implies w.p. 1.
Definition 2 (Payoff-equivalent parameters).
A parameter is payoff-equivalent to the true parameter for a strategy if . For a given strategy profile , the set of parameters that are payoff-equivalent to is:
The KL-divergence between any two distributions is non-negative, and is equal to zero if and only if the two distributions are identical. For a given strategy profile , if a parameter is in the payoff-equivalent parameter set , then the payoff distribution is identical for parameters and , i.e. for all . In this case, realized payoffs cannot be used to distinguish and in the belief update (-update) because the belief ratio remains unchanged w.p. 1. Also note that the set can vary with strategy profile , and hence a payoff-equivalent parameter for one strategy profile may not be payoff-equivalent for another strategy profile.
Theorem 1.
For any initial state , under Assumptions 1 – 3, the sequence of states induced by (-update) and (-update) converges to a fixed point w.p. 1, and satisfies:
(4a) | ||||
(4b) |
where , and is the set of equilibria corresponding to belief .
Moreover, for any , if , then converges to 0 exponentially fast:
(5) |
Otherwise, there exists a positive integer such that for all w.p. 1.
This result demonstrates two facts: Firstly, if a strategy update converges in static information environment, then this strategy update also converges when beliefs are repeatedly updated. That is, the convergence property of strategy update dynamics in games with static information environment also holds under joint evolution of belief learning and strategy learning.
Secondly, (4a) ensures that at any fixed point , the payoff distribution estimated by the fixed point belief is consistent with the true distribution given , i.e.
(6) |
Belief consistency follows from the fact that any parameter that is not payoff equivalent to is excluded from the support set of . In fact, we show in (5) that the belief of such parameter converges to zero exponentially fast with the rate of convergence being the KL-divergence between the corresponding payoff distribution with and the true parameter given strategy .
Moreover, (4b) ensures that is an equilibrium with respect to the estimated payoff distribution. Therefore, at a fixed point, players have no incentive to deviate from , and the realized payoffs can no longer change the belief .
We define the set of all fixed points as . We denote the belief vector with as the complete information belief, and any strategy as a complete information equilibrium. Since , the state is always a fixed point (i.e. ), and has the property that all players have complete information of the true parameter and choose a complete information equilibrium. Therefore, we refer to as a complete information fixed point.
However, the set may contain other fixed points that are not equivalent to the complete information environment (i.e. ). Such belief must assign positive probability to at least one parameter that is payoff-equivalent to given the fixed point strategy profile , but not necessarily at all . We refer to such fixed points as incomplete information fixed points.
Importantly, incomplete information fixed points arise in our learning dynamics due to the fact that realized payoffs are governed by the distribution corresponding to at fixed point, which may not distinguish every other parameter from the true parameter . Consequently, the fixed point strategy may not be a complete information equilibrium. In Sec. 3.3, we analyze the difference between complete information fixed points and incomplete information fixed points. We also provide conditions under which learning leads to complete information fixed points with probability 1.
3.2 Local and Global Stability
We now analyze the global and local stability of fixed points. The definitions of global and local stability are as follows:
Definition 3 (Global stability).
A fixed point belief and the associated equilibrium set are globally stable if for any initial state , the beliefs of the learning dynamics converge to and the strategies converge to with probability 1.
The definition of global stability requires that that the convergent fixed point does not depend on the initial state of the learning dynamics. On the other hand, local stability requires that the learning dynamics is robust to small perturbations around and . Such local perturbations of beliefs may arise from random errors in data collection or analysis algorithms, and local perturbations of strategies are likely to occur due to players’ mistakes in choosing strategies or random local exploration.
Definition 4 (Local stability).
Note that both global and local stability notions are not defined for a single fixed point, but rather for the tuple , i.e. fixed points with an identical belief . This is important when the game has multiple equilibria; i.e., is not a singleton set for some belief . That is, our stability notions do not hinge on the convergence to a particular equilibrium in the fixed point equilibrium set .
The next proposition provides two conditions that are equivalent to the existence of globally stable fixed points, and shows that any globally stable fixed point must have complete information.
Proposition 1.
The following three statements are equivalent:
-
(a)
The set of globally stable fixed points is non-empty.
-
(b)
For any and any , there exists such that .
-
(c)
All fixed points are complete information fixed points, i.e. .
Moreover, any globally stable fixed point must be a complete information fixed point.
In Proposition 1, condition (a) states the existence of globally stable fixed point; Condition (b) states that any belief that does not have complete information cannot be a fixed point belief, because can be distinguished at an associated equilibrium strategy (i.e. (4a) is not satisfied); Condition (c) states that all fixed points are complete information fixed points.
Proposition 1 is intuitive: Clearly, (b) – states that any belief that does not have complete information cannot be a fixed point belief – is equivalent to (c) – all fixed points have complete information. Additionally, (a) is equivalent to (c) because if the fixed point set contains another fixed point that is not a complete information fixed point, then whether the states of learning dynamics converge to the complete information fixed point or another fixed point depends on the initial state; hence no fixed point in such a set can be globally stable. On the other hand, when all fixed points are complete information fixed points, states of the learning dynamics must converge to and the associated equilibrium set regardless of the initial state, and thus must be globally stable. Consequently, (a) and (b) must also be equivalent. Given the equivalence between (a) and (c), it directly follows that all globally stable fixed points must be complete information fixed points.
From Proposition 1, we know that if there exist fixed points with incomplete information, then no fixed point is globally stable. In other words, whether learning converges to a complete information fixed point or a fixed point with incomplete information depends on the initial state. How likely will a particular fixed point arise as the long-run outcome of our learning dynamics? We answer this question by analyzing the local stability property of fixed points. In Theorem 2, we provide sufficient conditions for local stability of fixed point, i.e. if learning starts with state close to a fixed point that satisfies these conditions, then states will remain close to that fixed point with high probability. We further show in Proposition 2 that all complete information fixed points are locally stable.
Before proceeding, for any , we define an -neighborhood of belief as , where is the Euclidean distance. For any , we define the -neighborhood of equilibrium set as , where is the Euclidean distance between and the equilibrium set .555Generically, the function defines the Euclidean distance between and any strategy subset . We introduce the following assumption:
Assumption 4.
For a fixed point belief and the associated equilibrium set , such that the neighborhoods and satisfy
(A3a) Local consistency: Fixed point belief forms a consistent payoff estimate in the local neighborhood , that is for any .
(A3b) Local invariance: Neighborhood is a locally invariant set of the strategy update, that is, for any and any .
Theorem 2.
From Theorem 1, we already know that Assumptions 1 – 3 guarantee the convergence of beliefs and strategies under local perturbations. We now discuss the role of Assumption 4 towards ensuring local stability of a fixed point. Firstly, the local consistency condition (A3a) ensures that (-update) keeps the beliefs close to . In particular, under this condition, any parameter in the support of remains to be payoff equivalent to for any strategy in a local neighborhood of . Then, forms a consistent estimate of players’ payoffs not just at fixed point strategy , but also when the strategy is locally perturbed around . Therefore, the Bayesian belief update keeps the beliefs of all parameters in close to their respective probabilities in when the strategies are in the local neighborhood, and eventually any parameters not in are excluded by the learning dynamics.
Secondly, the local invariance condition (A3b) guarantees that the strategy sequence resulting from the strategy updates remains within the locally invariant neighborhood of the fixed point equilibrium. Due to the dynamic interplay between the belief updates and strategy updates, we need both local consistency and local invariance conditions to ensure that the strategy sequence in our learning dynamics does not leave the local neighborhood of and the perturbed beliefs remain close to .
Finally, we demonstrate that any complete information fixed point must be locally stable.
Proposition 2.
Any complete information fixed point is locally stable.
This result implies that any complete information fixed point is robust to local perturbations of beliefs and strategies. It directly follows from Theorem 2 since any can be shown to satisfy Assumption 4. To see this, note that the complete information belief satisfies local consistency since it forms a consistent payoff estimate for any feasible strategy (and thus for any local neighborhood of ). Besides, the local invariance condition is satisfied because the set is an invariant set for any belief (and thus for any belief in a local neighborhood of ).
On the other hand, other fixed points are not guaranteed to be locally stable unless there exists a local neighborhood of it that satisfies local consistency and local invariance. In particular, consider the situation where a parameter is only payoff equivalent to for the fixed point strategy but not for strategies in a local neighborhood of . In this case, local consistency is not satisfied, and payoffs generated by any local perturbation of the strategy will allow players to distinguish from with positive probability. Consequently, beliefs and strategies may leave the local neighborhood of with positive probability when it is locally perturbed.
Theorem 2 and Proposition 2 show that complete information fixed points are robust to such local perturbations, but other fixed points may not be. While these two results do not rule out the possibility that learning may still converge to fixed points that do not have complete information, these fixed points are non-robust unless they are locally stable. In next section, we further analyze conditions, under which learning dynamics converge to a complete information equilibrium.
3.3 Complete Learning
We say that learning is complete if the convergent fixed point strategy profile is a complete information equilibrium (i.e. ). We now answer two questions: When is learning complete? If learning is not complete (i.e. ), can we still find the complete information equilibrium?
Firstly, if the convergent fixed point belief assigns probability 1 to a single parameter (i.e. the support set is a singleton set), then the unique parameter in must be the true parameter . In this case, we know that the fixed point must be a complete information fixed point , and thus learning is complete. Besides, from Proposition 1, we know that learning is guaranteed to converge to a complete information fixed point if and only if the true parameter is identifiable at equilibrium.
Secondly, if the convergent fixed point belief assigns positive probability to more than one parameters (i.e. the support set is not a singleton set), then the fixed point belief does not have complete information of the true parameter. In this case, Proposition 3 below provides a sufficient condition on the fixed point belief and players’ payoff functions to ensure that the fixed point strategy profile is a complete information equilibrium. Here, learning is still complete, although the belief does not identify the true parameter.
Proposition 3.
For a fixed point , if
-
(i)
Local consistency. The fixed point belief forms a consistent payoff estimate in a local neighborhood of , i.e. such that for any ;
-
(ii)
Payoff concavity. The payoff function is concave in for all , all and all .
For any fixed point , since is a best response strategy of , is a local maximizer of the expected payoff function . The local consistency condition (i) in Proposition 3 ensures that the value of the expected payoff function is identical to the one corresponding to the true parameter for any belonging to a small neighborhood of . Therefore, must be a local maximizer of the payoff function with the true parameter . Condition (ii) further provides that payoffs are concave functions of , and thus must also be a global maximizer of . Therefore, (i) and (ii) together ensure that the fixed point strategy is a complete information equilibrium, although may not fully identify the true parameter .666 We note that for continuous and differentiable utility functions, the local consistency condition is equivalent to requiring that the gradient of the expected utility function based on the fixed point belief is identical to the gradient of the true utility function in a local neighborhood of the fixed point, i.e. for all . In classical no-regret learning dynamics (e.g. Mertikopoulos and Zhou (2019)), the convergence of strategies requires that the payoffs are strictly concave, and the estimate of the gradient of the utility function is unbiased in each stage. In our belief-based no-regret learning (-update) – (No-regret), the estimated gradient computed by the belief in each stage may not be consistent with the gradient of the true utility function. From Proposition 3, we can conclude that in strictly concave game, when players choose belief-based no-regret learning, strategies converge to a complete information equilibrium if the gradient estimate is consistent in a local neighborhood of the fixed point.
To summarize, we can guarantee that learning is complete in two cases: either the convergent belief identifies a single parameter, or conditions (i) and (ii) in Proposition 3 are satisfied. When neither of these cases apply to the learning dynamics, we can conclude that the convergent is definitely not a complete information fixed point, but may or may not be a complete information equilibrium.
Typically, literature on continuous games assumes that the payoff concavity condition is satisfied.777In continuous games, payoff concavity guarantees the existence of pure equilibrium, and thus is typically assumed (Rosen (1965)). In these games, if is not locally consistent, then there must exist at least one parameter such that is not payoff equivalent to in any local neighborhood of . Then, local exploration, that is randomly perturbing strategies in a local neighborhood of , can distinguish such , and thus leads learning to a new fixed point where is locally consistent. Since both conditions in Proposition 3 are satisfied for the new fixed point, we can conclude that is a complete information Nash equilibrium. Therefore, in games with concave payoff functions, local exploration can find a complete information equilibrium.
More generally, when payoff concavity is not satisfied, may not be a complete information equilibrium, and one needs a different approach to identify the true parameter in the support set . Unfortunately, local exploration no longer helps since all parameters in are payoff equivalent with in local neighborhood of . Instead, one needs to distinguish each pair of by exploring strategies for which these two parameters are not payoff equivalent, and such strategies may not be in local neighborhood of .
3.4 Examples
We present three examples to further illustrate our results.
Example 1. (Cournot competition) A set of firms produce an identical product and compete in a market. In each stage , firm ’s strategy is their production level . The price of the product is , where is the unknown parameter vector in the price function, and is a random variable with zero mean. The set of parameter vectors is , where and . The true parameter is . The marginal cost of each firm is 0. The payoff of firm in stage is for each .
The information platform does not need to observe the production level or the payoff of each firm. . Instead, the platform can update belief based on the total production , and the price for all . This is due to the fact that for all , where is the probability density function of the price given the total production. The belief update given is as follows:
In this game, the expected payoff function is continuous in and concave in for all . Thus, the best response correspondence is upper-hemicontinuous in and for each following from the Berge’s maximum theorem. As a result, the strategy updates (Sequential-BR) and (Inertial-BR) satisfy Assumption 1. Additionally, since the utility function is continuously differentiable with respect to the strategy profile , the update (No-regret) is also upper-hemicontinuous in and for all , and thus satisfies Assumption 1. Furthermore, for any , the Cournot game has a concave potential function given by:
Therefore, the strategy updates – (Sequential-BR), (Inertial-BR), and (No-regret) with respect to a static belief converge to a Nash equilibrium in , i.e. (Sequential-BR), (Inertial-BR), and (No-regret) satisfies Assumption 2 (Table 1). Additionally, since for any , the equilibrium set is unique, Assumption 3 is satisfied.
From Theorem 1, the states of the learning dynamics with strategy updates (Sequential-BR), (Inertial-BR), and (No-regret) converge to a fixed point with probability 1. The complete information fixed point is and . Additionally, and is also a fixed point since . Note that at , the parameter leads to identical price distribution as , and thus is payoff equivalent. In fact, since any must include in the support set, one can show that is the only strategy profile for which and are payoff-equivalent. Thus, there does not exist any other fixed points apart from and ; i.e. .
Since the complete information fixed point is not the unique fixed point, no fixed point is globally stable (Proposition 1). We know from Proposition 2 that the complete information fixed point , is locally stable. On the other hand, the other fixed point and does not satisfy the local consistency condition since the two parameters and can be distinguished when the strategy is perturbed in any local neighborhood of . Since the utility functions are concave, local exploration around can identify the true parameter, and leads to the complete information equilibrium.
Example 2.(Zero sum game) Two players repeatedly play a zero-sum game with identical convex and closed strategy sets . For any strategy profile , the payoff of each player is , where
and is the unknown parameter. The true parameter . Belief is updated by an information platform based on the strategy profiles and the realized payoffs as in (-update).
For any , the game is a zero-sum game with expected value function that is continuous in , concave in and convex in . Thus, the strategy updates (Inertial-BR) and (No-regret) satisfy Assumptions 1 and 2 (Table 1). Moreover, for any such that , the equilibrium is unique . For any such that , the equilibrium strategy profile is . Therefore, Assumption 3 is satisfied.
From Theorem 1, the sequence of states converges to a fixed point w.p. 1. The set of complete information fixed points is and . Apart from the complete information fixed points, any and is also a fixed point. This is because for any belief that assigns zero probability on , and is an equilibrium, and the two parameters and are payoff equivalent at .
We can check that conditions (i) and (ii) in Proposition 3 are satisfied by any fixed point . Thus, any fixed point strategy is a complete information equilibrium although is not a complete information belief. Therefore, learning is guaranteed to be complete in this game.
Since the complete information fixed point is not unique, no fixed point is globally stable. Moreover, by setting and , we can check that all fixed points in satisfy the two conditions in Assumption 4, and thus are locally stable (Theorem 2).
Example 3. (Investment game) Two players repeatedly play an investment game. In each stage , the strategy is the non-negative level of investment of player . Given the strategy profile , the return of a unit investment is randomly realized according to , where is the unknown parameter that represents the average baseline return and is the noise term. The true parameter is . The stage cost of investment for each player is . Therefore, the payoff of each player is for all . Following similar argument as in Example 1, the information platform does not need to observe the strategy of each player and their realized payoffs. Instead, the platform can update the belief based on the total investment and the realized unit investment return in each stage .
For any , the expected utility function is continuous in and . Moreover, the game is a supermodular game (i.e. ), and it is also dominance solvable with unique equilibrium strategy profile . Therefore, all three best response dynamics (Simultaneous-BR) – (Inertial-BR) satisfy Assumptions 1 and 2 (Table 1). Since the equilibrium set is a singleton set for all , Assumption 3 is also satisfied. Thus, states converge to a fixed point with probability 1. In this game, since for any , the unique fixed point is the complete information fixed point, i.e. . From Proposition 1, we know that this complete information fixed point is globally stable.
4 Proofs of Main Results
In this section, we present the proofs of our convergence and local stability results (Theorems 1 and 2). Recall that in (-update), the beliefs are updated at a random subsequence of stages . For the ease of presentation in our proof, we introduce an auxiliary belief sequence that is updated in every stage:
(8) |
From (-update), we know that
(11) |
We can also write the auxiliary belief ratio as follows:
(12) |
Since the payoffs and strategies are public information, we show in the following lemma that the auxiliary belief ratio is a martingale sequence. The proof of Lemma 1 is included in Appendix A.
Lemma 1.
For any , the sequence of belief ratios is a martingale sequence.
Our proofs of the convergence and stability results build on the martingale property of belief ratio. In particular, we utilize the martingale property to show the convergence of beliefs in Theorem 1. In our proof of local stability in Theorem 2, we rely on the martingale property to compute the upper bound of the probability that beliefs leave a local -neighborhood of , and derive conditions on the initial belief to ensure that this upper bound does not exceed as in the definition of local stability (Definition 4).
4.1 Proof of convergence
We prove Theorem 1 in three steps: Firstly, we prove that the sequence of beliefs converges to a fixed point belief w.p. 1 by applying the martingale convergence theorem (Lemma 2). Secondly, we show that under Assumptions 1 and 2, the strategies in our learning dynamics also converge. This convergent strategy is an equilibrium corresponding to the fixed point belief (Lemma 3). Finally, we prove that the belief of any that is not payoff-equivalent to given must converge to with rate of convergence governed by (5) (Lemma 4). Hence, we can conclude that beliefs and strategies induced by the learning dynamics converge to a fixed point that satisfies (4) w.p. 1.
Lemma 2.
w.p. 1, where .
We next prove the convergence of strategies. For every , we construct an auxiliary strategy sequences such that are identical to in the original sequence generate by the learning dynamics up to the stage (i.e. for all ), and the remaining strategies are induced by updates associated with the fixed point belief (instead of the repeatedly updated belief sequence ). In particular, for each , is obtained as follows:
(13) |
From (13), is constructed in two steps: Firstly, given the original strategy and the fixed point belief , we compute as the the strategy updated from with respect to (i.e. ) that is the closest to the original strategy . Secondly, is updated from the auxiliary strategy in stage with the fixed point belief (i.e. ), and is the closest to among all strategies in the set .
The two-step construction ensures that has the smallest Euclidean distance with the original strategy sequence among all strategy sequences that are induced by updates with . In fact, we can show that, as the beliefs converge to (Lemma 2), the distance between the auxiliary strategy sequence and the original strategy sequence also converges to zero as under Assumption 1. Additionally, under Assumption 2, the auxiliary strategy sequence must converge to an equilibrium , and thus the original sequence also converges to .
Lemma 3.
The sequence of strategies converges with probability 1. Moreover, the convergent strategy profile .
Proof of Lemma 3. Consider any sequence of states induced by the learning dynamics (-update) – (-update). For any , we construct the auxiliary strategy sequence from (13). For any , we have:
(14) |
We next show by mathematical induction that for any , . To begin with, for , we have
(15) |
Since converges to (Lemma 2), is upper hemicontinuous in (Assumption 1), and , we know that . Additionally, since and , . Therefore, .
Now, assume that for some , we need to show that . Similar to (15), we have
Analogous to , since is upper hemicontinuous in , . Additionally, since , is upper hemicontinuous in (Assumption 1), and , we know that . Therefore, we have . By mathematical induction, we conclude that for any , .
Under Assumption 2, we know that given any initial strategy , the sequence of strategies induced by updates (-update) with static belief converges to the equilibrium set . Then, for any and any initial strategy , there exists a finite integer such that for any . Since the strategy set is a closed and bounded set, we take , and we know that is finite.
Next, for any , the original strategy sequence satisfies
where is the -th strategy in the auxiliary sequence . Recall that the auxiliary strategy sequence is such that for , and is constructed with respect to the fixed point belief as in (13) for . Thus, for any , represents a strategy profile resulting from strategy updates with respect to starting from the initial profile . Then, for all . Moreover, since for any , we know that . Hence, there must exist such that for any , . Consequently, we have that for any , , i.e. . The sequence of strategies converges to the equilibrium set .
If is a singleton set, then we can conclude that converges. If contains multiple equilibria, it remains to prove the convergence of the strategy sequence to one of the equilibria. Assume for the sake of contradiction that does not converge. Since , there must exist such that for any , we can find a such that , where is the lower bound of the distance between any pair of equilibrium as in Assumption 3. We note that can be viewed as the equilibrium strategy in the set that is the closest to , and may be different for different . In fact, under the assumption that the strategy sequence does not converge, we know that for any , there must exist such that and , where . Otherwise, the strategy sequence will remain in the local neighborhood of one equilibrium after a finite stage, which implies the convergence to that equilibrium given that and each equilibrium is isolated. Moreover, for any such pair of stages , since , we must have .
Given Assumptions 1 and 3, we argue that for any , there must exist such that for any that satisfies and any that satisfies , the updated strategy satisfies . This is due to the fact that (i) ( is a fixed point of the strategy update with respect to , and it is an isolated equilibrium in ) (Assumption 3), and (ii) is upper hemicontinuous in both and (Assumption 1) and unique in the local neighborhood of (Assumption 3).
Given that converges to with probability 1 (Lemma 2) and , we know that there exists such that for any , is within distance to the closest equilibrium strategy , and the belief is within distance to . This implies that for any , and , which is a contradiction to the existence of where and . Therefore, we have arrived at a contradiction, and we conclude that the sequence of strategies must converge.
Lemma 4.
Lemma 4 is based on Lemmas 2 and 3. Although the realized payoffs are not independently and identically distributed (i.i.d.) due to players’ strategy updates, we can show that since converge to (Lemma 3), as , the distribution of converges to an i.i.d. process with , which is the payoff distribution given the fixed point strategy , for each parameter as . This allows us to show that the belief eventually excludes parameters that are not in , and also establish the rate of convergence.
Proof of Lemma 4. By iteratively applying the belief update in (8), we can write:
(16) |
We define as the probability density function of the history of the realized payoffs conditioned on the history of strategies prior to stage , i.e. . For any , we can rewrite (16) as follows:
(17) |
For any , if we can show that the ratio converges to 0, then must also converge to 0. We need to consider two cases:
Case 1: .
In this case, the log-likelihood ratio can be written as:
(18) |
For any , since is a continuous function of for any realized , is also continuous in for any . In Lemma 3, we proved that converges to . Then, must converge to as for all realized payoff . Therefore, from law of large numbers, we have:
Note that for any , the expectation of can be written as:
Then, for any , . From (17), we know that for all . Finally, since for all , the true parameter is never excluded from the belief. For any , we have the following:
From (11), we know that .
Case 2: is not absolutely continuous in .
In this case, does not imply with probability 1, i.e. , where is the probability of with respect to the true distribution . Since the distributions and are continuous in , the probability must also be continuous in . Therefore, for any , there exists such that for all .
From Lemma 3, we know that . Hence, we can find a positive number such that for any , , and hence . We then have . Moreover, since the event is independent from the event given any and , and any , we can conclude that based on the second Borel-Cantelli lemma. Hence, . From the Bayesian update (-update), we know that if for some stage , then any belief of updated after stage is 0. Therefore, we can conclude that there exists a positive number with probability 1 such that for any .
4.2 Proof of local stability
From the local stability definition (Definition 4), for any and any probability , we need to characterize such that if the initial state is in -neighborhood of the fixed point , then with probability higher than , the convergent state is in -neighborhood of .
Our proof of Theorem 2 starts with constructing an auxiliary neighborhood of such that if all states remain in the auxiliary neighborhood, then the convergent state is guaranteed to be in the -neighborhood of (Part (iii) in Lemma 5). Here, and , and are chosen according to Assumption 2 to guarantee that the states in the auxiliary neighborhood satisfies the conditions of local consistency and local invariance (Part (i) – (ii) in Lemma 5).
Lemma 5.
We include the proof of Lemma 5 in Appendix A. Essentially, (i) follows from the fact that in continuous games, the equilibrium set is upper-hemicontinuous in (Lemma 8 in Appendix A). Then, we obtain (ii) from Assumption (A3b) that is an invariant set of the strategy update for and . Furthermore, if beliefs are in for all stages, then the convergent belief must also be in . Based on Assumptions 1 – 2, Theorem 1 provides that the sequence of strategies converges. Since , we know from (i) that the convergent strategy is an equilibrium in the neighborhood . Thus, (iii) holds.
Thanks to Lemma 5 (iii), to prove local stability as in (7), it remains to be established that there exists an initial -neighborhood of the fixed point such that when learning starts from that neighborhood, all states remain in the auxiliary - neighborhood with probability higher than . To characterize , we note that remains in the -neighborhood of if for all . In Lemma 6, we characterize the condition of the initial belief under which with probability higher than for all and all (i.e. the set of parameters with zero probability in ). In Lemma 7, we analyze conditions of initial beliefs of parameters , which together with Lemma 6 allow us to precisely characterize . We also characterize that guarantees that the strategies do not leave . Our characterization of and builds on the martingale property of belief ratios (Lemma 1), and the fact that under Assumption 4, states satisfy local consistency and local invariance in the auxiliary neighborhood (Lemma 5).
Before proceeding, we need to define the following thresholds:
(19a) | ||||
(19b) | ||||
(19c) |
Lemma 6 below shows that if the initial belief is in the neighborhood , then for all in all stages of the learning dynamics with probability higher than . Note that ensures since for all and . Additionally, the threshold is specifically constructed to bound the beliefs of the remaining parameters in , which will be used later in Lemma 7.
Lemma 6.
For any , if the initial belief satisfies
(20a) | |||
(20b) |
then
(21) |
In the proof of Lemma 6, we say that the belief completes an upcrossing of the interval if increases from less than to higher than . Note that if the belief of a parameter is initially smaller than but later becomes higher than in some stage , then the belief sequence must have completed at least one upcrossing of before stage . Therefore, for all is equivalent to that the number of upcrossings completed by the belief is zero.
Additionally, by bounding the initial belief of parameters as in (20b), we construct another interval such that the number of upcrossings with respect to this interval completed by the auxiliary sequence of belief ratios is no less than the number of upcrossings with respect to interval completed by . Recall that the sequence of belief ratios forms a martingale process (Lemma 1). By applying Doob’s upcrossing inequality, we obtain an upper bound on the expected number of upcrossings completed by the belief ratio corresponding to each parameter , which is also an upper bound on the expected number of upcrossings made by the belief of . Using Markov’s inequality and the upper bound of the expected number of upcrossings, we show that with probability higher than , no belief of any parameter can ever complete a single upcrossing with respect to the interval characterized by (19a) – (19b). Hence, remains lower than the threshold for all and all with probability higher than .
Proof of Lemma 6. First, note that . For any and any , we denote the number of upcrossings of the interval that the belief completes by stage . That is, is the maximum number of intervals with , such that for . Since the beliefs are updated based on randomly realized payoffs as in (8), is also a random variable. For any , if and only if and there exists a stage such that . Equivalently, if and only if and there exists a stage such that . Therefore, if for all , then:
(22) |
Next, we define . Since and is in the support set, we have . If satisfies (20a) – (20b), then for all . Additionally, for any stage and any , if , then because . Hence, whenever completes an upcrossing of the interval , must also have completed an upcrosssing of the interval . To ensure that the interval is non-empty, we need to verify that , which requires that
(23) |
From (19a), since , we can check that (23) is indeed satisfied. Thus, the interval is non-empty.
We denote as the number of upcrossings of the sequence with respect to the interval until stage . Then, for all . Therefore, we can write:
(24) |
where the last inequality follows from the Makov inequality.
From the proof of Lemma 2, we know that the sequence is a martingale. Therefore, we can apply the Doob’s upcrossing inequality as follows:
(25) |
From (22) – (25), and (19a) – (19b), we have:
where the last inequality is derived using (19a) – (19b). Finally, from (11), we know that .
Furthermore, Lemma 7 characterizes another set of conditions on the initial state to ensure that the beliefs of remaining parameters satisfy , and the strategy for all so long as for any parameter . Recall that for all is satisfied with probability higher than under the conditions in Lemma 6.
Lemma 7.
Under Assumption 4, if for all and , then
(28) |
Proof of Lemma 7. From Assumption (A3a), we know that if . Hence, for any and any realized payoff . Therefore,
(29) |
This implies that , and for all :
Thus, we have
Since , and conditioned on that for all as in (28), we have
Additionally, since and again conditioned on that for all as in (28), we have
Since by (19c), for all , any is a positive number for all . Therefore, we have the following bounds:
(30) |
Since
(31) |
we can check that for all . To ensure the right-hand-side of (31) is positive, we need to have for all , and , which is indeed satisfied by (19b).
Also, since
(32) |
we have for all . Below, we check that the right-hand-side of (32) is positive, and thus the constraint of in (32) is feasible:
From (30), we can conclude that for all . Additionally, conditional on for all as in (28), we have . From (11), we have . From (ii) in Lemma 5, since , we know that the updated strategy .
We now use mathematical induction to prove that the belief of any satisfies for stages . If in stages , for all and for all , then for all . Thus, from (11) and (ii) in Lemma 5, we have and . Therefore, for all .
From Assumption (A3a), we know that for all . Therefore, for any and any , with probability 1. Then, by iteratively applying (29) for , we have for all with probability 1. Analogous to , we can prove that for all . From (11), we also have for all . From the principle of mathematical induction, we conclude that in all stages , for all , and for all . Therefore, we have proved (28).
5 Extensions
In this section, we briefly discuss two extensions of the learning model introduced in Section 2: (1) Learning with two timescales; (2) Learning in games with finite strategies.
(1) Learning with two timescales. Consider the case where strategy update is at a faster timescale compared with the belief updates, i.e. with probability 1. Then, our convergence and stability results still hold as follows:
Proposition 4.
We note that the convergence and stability results in Proposition 4 does not rely on Assumption 1. Since with probability 1, as , the strategies between any two belief updates and are updated with the static belief . Therefore, Assumption 2 directly ensures that strategies converge to an equilibrium strategy profile in without relying on the construction of auxiliary strategies or Assumption 1. Then, the updated belief will form an accurate payoff estimate given the equilibrium strategy following Lemma 4. By repeating this process, we can show that the beliefs and strategies converge to a fixed point. Similarly, the local stability result does not rely on Assumption 1, and it holds following Theorem 2. The global stability result is also analogous to that in Proposition 1.
(2) Learning in Games with Finite Strategy Set. Our results in Section 3 can be extended to learning in games where strategy sets are finite and players can choose mixed strategies. In this game, each player ’s action set (pure strategies) is a finite set , and the action profile (pure strategy profile) is denoted as . Given any parameter and any action profile , the distribution of players’ payoff is .
We denote player ’s mixed strategy as , where is the probability of choosing the action . The mixed strategy set is bounded and convex. Players’ action profile in each stage , denoted as , is realized from the mixed strategy profile . Analogous to (-update) – (-update), beliefs and strategies in each stage are updated based on the past actions and the realized payoff vectors .
The convergence result in Theorem 1 can be readily extended to games with finite strategy sets: Under Assumptions 1 and 2, the sequence of beliefs and mixed strategies converges to a fixed point , where is a mixed Nash equilibrium associated with the belief , and accurately estimates the payoff distribution for all action profiles that are taken with positive probability in . Additionally, our results on global and local stability properties (Proposition 1 and Theorem 2) also hold analogously.
Moreover, in games with finite strategy sets, any local neighborhood of a fixed point strategy profile includes mixed strategies with full support on all action profiles, and any parameter can be distinguished from with the realized payoffs. In this case, only complete information fixed points satisfy the local consistency condition (A3a), and thus is locally stable (Proposition 2). Any other fixed point is non-robust to local perturbation of strategies. Hence, our stability analysis implies that complete learning is guaranteed by local exploration in games with finite strategies.888Our observation that local exploration is sufficient to identify the complete information equilibrium in games with finite strategies is consistent with literature on equilibrium learning in normal form games with unknown payoff matrices (Hart and Mas-Colell (2003); Foster and Young (2006); Marden et al. (2007); Daskalakis et al. (2011); Syrgkanis et al. (2015)). Indeed, algorithms proposed in these literature rely on local exploration – repeatedly sampling actions that are not equilibrium with small probability to learn the payoff matrices.
6 Concluding Remarks
In this article, we studied stochastic learning dynamics induced by a set of strategic players who repeatedly play a game with an unknown parameter. We analyzed the convergence of beliefs and strategies induced by the stochastic dynamics, and derived conditions for local and global stability of fixed points. We also provided conditions that guarantee that learning converges to a complete information equilibrium.
A future research question of interest is to analyze the learning dynamics when players seek to efficiently learn the true parameter by choosing off-equilibrium strategies. When there are one or more parameters that are payoff equivalent to the true parameter at fixed point, complete learning requires players to take strategies that may reduce their individual payoffs in some stages. In our setup, if a player were to choose a non-equilibrium strategy, the information resulting from that player’s realized payoff would be incorporated into the belief update, and the new belief is known to all players. Under what scenarios the utility-maximizing players will choose their strategies to engage such explorative behavior is an interesting question, and worthy of further investigation.
Another promising extension is to study multi-agent reinforcement learning problem from a Bayesian viewpoint. In such settings, the unknown parameter changes over time according to a Markovian transition process, and players may have imperfect or no knowledge of the underlying transition kernel. The approach presented in this article on studying the dynamic interplay between belief updates and strategy updates is useful to analyze how players learn the belief estimates of payoffs that depend on the latent Markov state, and adaptively adjust their strategies that converges to a stationary equilibrium.
Acknowledgement
The authors are grateful to the area editor, the associate editor and the three reviewers for useful suggestions and constructive feedback. We thank participants of the Cornell ORIE Colloquium (2021), Games, Decisions and Networks Seminar (2021), seminar at Simons Institute for the Theory of Computing (2022), and the Spring 2022 Colloquium at the C3.ai Digital Transformation Institute for useful discussions. This research was supported in part by Simons Fellowship, Michael Hammer Fellowship, and AFOSR project Building Attack Resilience into Complex Networks.
References
- Acemoglu et al. [2014] Daron Acemoglu, Kostas Bimpikis, and Asuman Ozdaglar. Dynamics of information exchange in endogenous social networks. Theoretical Economics, 9(1):41–97, 2014.
- Acemoglu et al. [2017] Daron Acemoglu, Ali Makhdoumi, Azarakhsh Malekian, and Asuman Ozdaglar. Fast and slow learning from reviews. Technical report, National Bureau of Economic Research, 2017.
- Alós-Ferrer and Netzer [2010] Carlos Alós-Ferrer and Nick Netzer. The logit-response dynamics. Games and Economic Behavior, 68(2):413–427, 2010.
- Auer et al. [2002] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235–256, 2002.
- Banerjee [1992] Abhijit V Banerjee. A simple model of herd behavior. The Quarterly Journal of Economics, 107(3):797–817, 1992.
- Beggs [2005] Alan W Beggs. On the convergence of reinforcement learning. Journal of Economic Theory, 122(1):1–36, 2005.
- Benaim and Hirsch [1999] Michel Benaim and Morris W Hirsch. Mixed equilibria and dynamical systems arising from fictitious play in perturbed games. Games and Economic Behavior, 29(1-2):36–72, 1999.
- Blume et al. [1993] Lawrence E Blume et al. The statistical mechanics of strategic interaction. Games and Economic Behavior, 5(3):387–424, 1993.
- Bravo et al. [2018] Mario Bravo, David Leslie, and Panayotis Mertikopoulos. Bandit learning in concave n-person games. Advances in Neural Information Processing Systems, 31, 2018.
- Cesa-Bianchi and Lugosi [2006] Nicolo Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge University Press, 2006.
- Cominetti et al. [2010] Roberto Cominetti, Emerson Melo, and Sylvain Sorin. A payoff-based learning procedure and its application to traffic games. Games and Economic Behavior, 70(1):71–83, 2010.
- Daskalakis et al. [2011] Constantinos Daskalakis, Alan Deckelbaum, and Anthony Kim. Near-optimal no-regret algorithms for zero-sum games. In Proceedings of the Twenty-second Annual ACM-SIAM Symposium on Discrete Algorithms, pages 235–254. SIAM, 2011.
- Dumett and Cominetti [2018] Miguel A Dumett and Roberto Cominetti. On the stability of an adaptive learning dynamics in traffic games. arXiv preprint arXiv:1807.01256, 2018.
- Foster and Young [2006] Dean Foster and Hobart Peyton Young. Regret testing: Learning to play Nash equilibrium without knowing you have an opponent. Theoretical Economics, 1(3):341–367, 2006.
- Fudenberg and Kreps [1993] Drew Fudenberg and David M Kreps. Learning mixed equilibria. Games and Economic Behavior, 5(3):320–367, 1993.
- Fudenberg and Kreps [1995] Drew Fudenberg and David M Kreps. Learning in extensive-form games I. self-confirming equilibria. Games and Economic Behavior, 8(1):20–55, 1995.
- Fudenberg and Levine [1993] Drew Fudenberg and David K Levine. Self-confirming equilibrium. Econometrica: Journal of the Econometric Society, pages 523–545, 1993.
- Fudenberg and Tirole [1991] Drew Fudenberg and Jean Tirole. Game theory. MIT press, 1991.
- Gale and Kariv [2003] Douglas Gale and Shachar Kariv. Bayesian learning in social networks. Games and Economic Behavior, 45(2):329–346, 2003.
- Golowich et al. [2020] Noah Golowich, Sarath Pattathil, and Constantinos Daskalakis. Tight last-iterate convergence rates for no-regret learning in multi-player games. Advances in neural information processing systems, 33:20766–20778, 2020.
- Golub and Jackson [2010] Benjamin Golub and Matthew O Jackson. Naive learning in social networks and the wisdom of crowds. American Economic Journal: Microeconomics, 2(1):112–49, 2010.
- Hart and Mas-Colell [2000] Sergiu Hart and Andreu Mas-Colell. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68(5):1127–1150, 2000.
- Hart and Mas-Colell [2003] Sergiu Hart and Andreu Mas-Colell. Regret-based continuous-time dynamics. Games and Economic Behavior, 45(2):375–394, 2003.
- Hofbauer and Sandholm [2002] Josef Hofbauer and William H Sandholm. On the global convergence of stochastic fictitious play. Econometrica, 70(6):2265–2294, 2002.
- Hofbauer and Sandholm [2009] Josef Hofbauer and William H Sandholm. Stable games and their dynamics. Journal of Economic Theory, 144(4):1665–1693, 2009.
- Hofbauer and Sorin [2006] Josef Hofbauer and Sylvain Sorin. Best response dynamics for continuous zero-sum games. Discrete and Continuous Dynamical Systems Series B, 6(1):215, 2006.
- Hopkins [2002] Ed Hopkins. Two competing models of how people learn in games. Econometrica, 70(6):2141–2166, 2002.
- Lattimore and Szepesvári [2020] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020.
- Marden and Shamma [2012] Jason R Marden and Jeff S Shamma. Revisiting log-linear learning: Asynchrony, completeness and payoff-based implementation. Games and Economic Behavior, 75(2):788–808, 2012.
- Marden et al. [2007] Jason R Marden, Gürdal Arslan, and Jeff S Shamma. Regret based dynamics: convergence in weakly acyclic games. In Proceedings of the 6th International Joint Conference on Autonomous Agents and Multiagent Systems, page 42. ACM, 2007.
- Marden et al. [2009] Jason R Marden, H Peyton Young, Gürdal Arslan, and Jeff S Shamma. Payoff-based dynamics for multiplayer weakly acyclic games. SIAM Journal on Control and Optimization, 48(1):373–396, 2009.
- Matsui [1992] Akihiko Matsui. Best response dynamics and socially stable strategies. Journal of Economic Theory, 57(2):343–362, 1992.
- Meigs et al. [2017] Emily Meigs, Francesca Parise, and Asuman Ozdaglar. Learning dynamics in stochastic routing games. In 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 259–266. IEEE, 2017.
- Mertikopoulos and Zhou [2019] Panayotis Mertikopoulos and Zhengyuan Zhou. Learning in games with continuous action sets and unknown payoff functions. Mathematical Programming, 173(1):465–507, 2019.
- Milgrom and Roberts [1990] Paul Milgrom and John Roberts. Rationalizability, learning, and equilibrium in games with strategic complementarities. Econometrica: Journal of the Econometric Society, pages 1255–1277, 1990.
- Moe and Fader [2004] Wendy W Moe and Peter S Fader. Dynamic conversion behavior at e-commerce sites. Management Science, 50(3):326–335, 2004.
- Monderer and Shapley [1996a] Dov Monderer and Lloyd S Shapley. Fictitious play property for games with identical interests. Journal of Economic Theory, 68(1):258–265, 1996a.
- Monderer and Shapley [1996b] Dov Monderer and Lloyd S Shapley. Potential games. Games and Economic Behavior, 14(1):124–143, 1996b.
- Mossel et al. [2015] Elchanan Mossel, Allan Sly, and Omer Tamuz. Strategic learning and the topology of social networks. Econometrica, 83(5):1755–1794, 2015.
- Rosen [1965] J Ben Rosen. Existence and uniqueness of equilibrium points for concave n-person games. Econometrica: Journal of the Econometric Society, pages 520–534, 1965.
- Samuelson and Zhang [1992] Larry Samuelson and Jianbo Zhang. Evolutionary stability in asymmetric games. Journal of Economic Theory, 57(2):363–391, 1992.
- Sandholm [2010a] William H Sandholm. Local stability under evolutionary game dynamics. Theoretical Economics, 5(1):27–50, 2010a.
- Sandholm [2010b] William H Sandholm. Population games and evolutionary dynamics. MIT press, 2010b.
- Smith and Price [1973] J Maynard Smith and George R Price. The logic of animal conflict. Nature, 246(5427):15–18, 1973.
- Syrgkanis et al. [2015] Vasilis Syrgkanis, Alekh Agarwal, Haipeng Luo, and Robert E Schapire. Fast convergence of regularized learning in games. Advances in Neural Information Processing Systems, 28, 2015.
- Taylor and Jonker [1978] Peter D Taylor and Leo B Jonker. Evolutionary stable strategies and game dynamics. Mathematical Siosciences, 40(1-2):145–156, 1978.
- Wu et al. [2021] Manxi Wu, Saurabh Amin, and Asuman E Ozdaglar. Value of information in Bayesian routing games. Operations Research, 69(1):148–163, 2021.
- Zhu et al. [2010] Shanjiang Zhu, David Levinson, Henry X Liu, and Kathleen Harder. The traffic and behavioral effects of the I-35W Mississippi River Bridge collapse. Transportation Research Part A: Policy and Practice, 44(10):771–784, 2010.
Appendix A Supplementary Proofs
Proof of Lemma 1. Starting from any initial belief , consider a sequence of strategies and a sequence of realized payoffs before stage . The belief is the repeatedly updated belief from based on and using (8). Then, from (12), we obtain the follows:
(33) |
where and are derived from the belief and strategy updates given by and . Note that
Hence, for any ,
Since , the sequence is a non-negative martingale for any .
Proof of Lemma 2. From the martingale convergence theorem and Lemma 1, we know that converges with probability 1. Therefore, converges with probability 1. Since and for all , also converges with probability 1. We can thus conclude that converges with probability 1 for all , and the convergent vector is a belief vector. Moreover, from (11), we know that also converges to with probability 1.
Lemma 8 (Fudenberg and Tirole [1991]).
The equilibrium set is upper-hemicontinuous in the belief if the utility function is continuous in for all and all .
Proof of Lemma 5.
-
(i)
From Lemma 8, we know that such must exist.
-
(ii)
Since , we know from Assumption (A3b) that for any and any .
- (iii)
Proof of Proposition 1. We first proof that (a) is equivalent to (c). On one hand, if , then for any initial state, the learning dynamics converges to a complete information fixed point with belief and strategy in . That is, is globally stable. On the other hand, if there exists another fixed point , then learning that starts with the initial belief (resp. ) and strategy (resp. ) remains at (resp. ) for all stages w.p. 1. Thus, in this case, globally stable fixed points do not exist.
Next, we prove that (b) is equivalent to (c). If is a non-empty set for any and any , then no belief with imperfect information satisfies (4a). That is, only the complete information belief vector can be a fixed point belief. Therefore, all fixed point must be complete information fixed points. On the other hand, assume for the sake of contradiction that there exists a belief such that for an equilibrium strategy , then , which is not a complete information fixed point, is in the set . Thus, we arrive at a contradiction.
We can conclude that (a) is equivalent to (b) from the equivalence between (a) and (c), and (b) and (c).
Proof of Proposition 2. First, since , satisfies the local consistency condition in (A3a) for all . Second, since is an invariant set for any strategy update and any . We can choose an arbitrary -neighborhood of , and local invariance is satisfied. Therefore, neighborhood satisfies Assumption 4. Thus, any complete information fixed point must be locally stable.
Proof of Proposition 3. From condition (i) that for any , we have:
(34) |
For any , since is a best response to with respect to , must be a local maximizer of . From (34), is a local maximizer of . From condition (ii), since the function is concave in , is also a global maximizer of , and hence is a best response of with complete information of . Since this argument holds for all , is a complete information equilibrium.
Proof of Proposition 4. Since with probability 1, as , the strategies between any two belief updates and are updated with the static belief . Under Assumption 2, we know that the sequence of strategies converges to an equilibrium strategy in . From Lemma 2, we know that with probability 1. Since the equilibrium set is upper hemicontinuous in , the sequence of strategies must converge to an equilibrium . We know from Lemma 4 that the belief must form a consistent estimate of payoff distribution given . The proofs of local and global consistency are analogous to that in Theorem 2 and Proposition 1, respectively, and thus are omitted.