This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Convergence and Stability of Coupled Belief–Strategy Learning Dynamics in Continuous Games

Manxi Wu, Saurabh Amin, and Asuman Ozdaglar M. Wu ([email protected]) is with the School of Operations Research and Information Engineering at Cornell University. S. Amin ([email protected]) is with Laboratory for Information and Decision Systems at Massachusetts Institute of Technology, and A. Ozdaglar ([email protected]) is with Department of Electrical Engineering and Computer Science at Massachusetts Institute of Technology.
Abstract

We propose a learning dynamics to model how strategic agents repeatedly play a continuous game while relying on an information platform to learn an unknown payoff-relevant parameter. In each time step, the platform updates a belief estimate of the parameter based on players’ strategies and realized payoffs using Bayes’s rule. Then, players adopt a generic learning rule to adjust their strategies based on the updated belief. We present results on the convergence of beliefs and strategies and the properties of convergent fixed points of the dynamics. We obtain sufficient and necessary conditions for the existence of globally stable fixed points. We also provide sufficient conditions for the local stability of fixed points. These results provide an approach to analyzing the long-term outcomes that arise from the interplay between Bayesian belief learning and strategy learning in games, and enable us to characterize conditions under which learning leads to a complete information equilibrium.

1 Introduction

Strategic agents often need to engage in repeated interactions with each other while learning an unknown environment that impacts their payoffs. Such a situation arises in online market platforms, where buyers and sellers repeatedly make their transaction decisions while learning the latent market condition that governs the price distribution. The price distribution is updated based on the previous transactions and buyer reviews on platforms such as Amazon, eBay, and Airbnb (Moe and Fader (2004); Acemoglu et al. (2017)). Another situation concerns with transportation networks, where travelers make route choice decisions on a day-to-day basis while also learning the underlying state of network that affects the travel time distribution. This travel time distribution is repeatedly updated based on the delay and flow information provided by navigation apps such as Google Maps or Apple Maps (Zhu et al. (2010); Wu et al. (2021); Meigs et al. (2017)). In both situations, players’ strategic decisions (purchases and sales on online platforms or route choices in transportation networks) influence the learning of the unknown environment (latent market condition or network state), which then impact the players’ future decisions. Thus, the long-run outcome of strategic interactions among players is governed by the joint evolution of stage-wise decisions made by the players and learning of the unknown environment.

In this article, we introduce and study a learning model to describe this joint evolution in a game-theoretic setting. In our model, a finite number of strategic agents (players) repeatedly play a game with continuous strategy sets. The payoff of each player in the game is a function of the strategy profile with an unknown parameter (or parameter vector). A public information platform (e.g. a market platform or navigation app) serves as an information aggregator that intermittently updates and broadcasts a Bayesian belief estimate of the payoff parameter based on stage-wise game outcomes (i.e. strategies and randomly realized payoffs) to all players. Players then account for the most current belief when updating their strategies.

The players’ strategy update in our learning dynamics is generic so that it can be used to model a broad class of strategy learning rule that players decide to choose based on the updated belief and the strategy played in the previous stage. In particular, the strategy update can incorporate two classes of learning rules that naturally arise from utility-maximizing players. One is the belief-based best response dynamics, where, in each stage, players update their strategies using the best response that maximizes their expected utility given the updated belief and their opponents’ strategies. Another class is the belief-based no-regret learning dynamics, where each player updates their strategy using an online mirror ascent algorithm that maximizes the expected utility, and the gradient used in the algorithm is estimated using the updated belief. When the true parameter is known to players, our learning dynamics reduces to the classical strategy learning in games with complete information. For this special case, the belief-based best response dynamics reduces to the best response dynamics with complete information of utility functions, and the belief-based no-regret learning reduces to the no-regret learning dynamics with complete information of the gradient of utility functions.

Past literature has extensively studied various strategy learning rules in games. Learning rules based on best response strategies have been studied in both games with finite strategy set and continuous strategy sets when players have complete information of their utility functions. This includes simultaneous and sequential best response dynamics (Milgrom and Roberts (1990); Monderer and Shapley (1996b); Hofbauer and Sorin (2006)), fictitious play (Fudenberg and Kreps (1993); Monderer and Shapley (1996a)), and stochastic fictitious play (Benaim and Hirsch (1999); Hofbauer and Sandholm (2002)). For games with finite strategy sets, various learning rules have been proposed to model how players, who do not have complete information of the payoff matrices, learn an equilibrium based on the realized payoffs in every stage. Examples include log-linear learning (Blume et al. (1993), Marden and Shamma (2012), Alós-Ferrer and Netzer (2010)), regret-based learning (Hart and Mas-Colell (2003), Foster and Young (2006), Marden et al. (2007), Daskalakis et al. (2011); Syrgkanis et al. (2015)), payoff-based learning (Cominetti et al. (2010), Marden et al. (2009)), and replicator dynamics (Sandholm (2010b), Beggs (2005), Hopkins (2002)). On the other hand, in games with continuous strategy sets, learning dynamics typically assume the knowledge of players’ utility functions (e.g. the dynamics based on best responses), or consider that a stochastic estimate of the gradient of utility functions is available (e.g. no-regret learning and gradient-based learning in continuous games Rosen (1965); Mertikopoulos and Zhou (2019); Bravo et al. (2018)).

We extend the past literature by focusing on coupled belief-strategy dynamics, which naturally describes the learning behavior of players who adjust their strategies while relying on an information platform to learn the unknown utility functions that are parametrized. The setting of parametrized utility functions is especially relevant to continuous games since players need to learn their utility function on a continuous strategy set. To the best of our knowledge, past literature has not analyzed the long-run outcomes of coupled belief-strategy updates in continuous games. Our focus is to develop an approach to analyze the convergence and stability (both local and global) properties of this learning dynamics, and derive conditions under which a complete information Nash equilibrium naturally arises from the learning dynamics. Our conditions for learning complete information equilibrium can also be used for studying equilibrium computation algorithms in continuous games, however, such algorithmic aspects of equilibrium computation are outside the scope of our work.

Next, we summarize the main contributions of our paper.

Convergence: We show that the joint evolution of beliefs and strategies converges to a fixed point of our dynamics with probability 1 provided that the learning rule in strategy updates converges given a static belief (i.e. when the belief of parameter is held fixed instead of being repeatedly updated). Furthermore, at any fixed point, the fixed point belief forms a consistent estimate of the payoff distribution given the strategy, and the strategy is an equilibrium of the game corresponding to the convergent belief (Theorem 1).

Naturally, a complete information belief and the associated complete information equilibrium form a fixed point of our learning dynamics. However, the fixed point set may also include other fixed points that do not identify the true parameter. At these fixed points, the belief assigns a non-zero probability to other parameters besides the true parameter, which may lead to an incorrect payoff estimate for strategies that differ from the fixed point. Consequently, the associated fixed point strategy that the learning dynamics converge to may not be a complete information equilibrium.

Our proof of Theorem 1 builds on the martingale property of Bayesian beliefs as well as the properties of strategy learning in games. Firstly, we obtain the convergence of Bayesian beliefs by showing that the beliefs form a martingale. Secondly, to prove the strategy convergence, we construct a series of auxiliary strategy sequences, each of which is identical to the original sequence up to a certain stage, but differs in the “tail” subsequence – the strategies beyond that stage are updated using the fixed point belief. We show that the distance between a so-constructed auxiliary sequence and the original sequence goes to zero as the stage beyond which the two sequences differ tends to infinity. Thus, the convergence property of strategies in our coupled belief- strategy learning dynamics is derived from the convergence of the strategy updates with static belief being the fixed point belief. Finally, using the strategy convergence result, we prove that the players’ payoff distributions asymptotically approach the identical and independent distribution corresponding to the fixed point strategy. This also allows us to show that the belief concentrates exponentially fast on the subset of parameters with the property that, at fixed point, each parameter in this set induces the same payoff distribution as the true parameter.

Global and local stability: A fixed point is globally stable if the learning dynamics starting from any initial state converges to that fixed point with probability 1. We find that globally stable fixed points exist if and only if all fixed points have complete information of the unknown parameter. Equivalently, the sufficient and necessary condition for all fixed points being complete information fixed points is that the true parameter is identifiable for any equilibrium strategy profile, i.e. no other parameter induces the same payoff distribution in equilibrium (Proposition 1). In this case, learning is guaranteed to identify the true parameter, and players eventually play a complete information equilibrium.

When the sufficient and necessary condition for global stability is not satisfied, there exist one or multiple fixed points that do not have complete information of the true parameter. Then, whether learning converges to a complete information fixed point or not depends on the initial state. In this case, we study the local stability property that evaluates how robust a convergent fixed point is under local perturbations of beliefs and strategies. A fixed point is locally stable if states remain close to the fixed point with high probability when learning starts with an initial state close to that fixed point. A locally stable fixed point is robust to local perturbations, and thus is more likely to arise as the long-run outcome of the learning dynamics.

We prove that a fixed point is locally stable if it satisfies two conditions (Theorem 2): (a) The fixed point belief is locally consistent in that it consistently estimates the payoff distribution in a local neighborhood of the fixed point strategy (instead of just at the fixed point); (b) The fixed point strategy has a local neighborhood that is an invariant set for the strategy update. In proving this result, we again exploit the martingale property of beliefs and the properties of strategy updates. We use the martingale upcrossing inequality to show that under condition (a) of local consistency, we can construct a neighborhood of initial belief, which ensures that the repeatedly updated beliefs remain in a small neighborhood of the fixed point belief with high probability. Condition (b) further guarantees that strategy update is robust to local perturbations of strategies and beliefs. Importantly, we find that any complete information fixed point is locally stable, and thus is robust to local perturbations of belief and strategy (Proposition 2).

Our global and local stability analysis contributes to the study of stability of Nash equilibrium under various learning dynamics in games. In particular, previous literature has focused on the stability of evolutionary dynamics in population games (Smith and Price (1973); Taylor and Jonker (1978); Samuelson and Zhang (1992); Matsui (1992); Hofbauer and Sandholm (2009); Sandholm (2010a)), and more recently on adaptive learning dynamics in traffic routing (Dumett and Cominetti (2018)). Our results extend the equilibrium stability analysis in a complete information environment to the stability analysis on both the belief estimates and players’ strategies governed by the coupled belief-strategy dynamics.

Complete learning. Learning is complete if the convergent fixed point strategy is a complete information equilibrium. Complete learning is not always guaranteed because fixed point beliefs may form incorrect estimates on the unobservable game outcomes, i.e. the payoffs of strategies that are not taken. These incorrect estimates may persist since the information of game outcomes used in belief updates is endogenously acquired based on strategies chosen by players.

The phenomenon that endogenous information acquisition leads to incomplete learning is central to a variety of settings that include multi-arm bandit problems (Auer et al. (2002); Cesa-Bianchi and Lugosi (2006); Lattimore and Szepesvári (2020)), endogenous social learning (Banerjee (1992); Gale and Kariv (2003); Golub and Jackson (2010); Acemoglu et al. (2014); Mossel et al. (2015)), and self-confirming equilibrium learning in extensive form games (Fudenberg and Levine (1993); Fudenberg and Kreps (1995)). In settings where the action space of each agent is finite, complete learning can be achieved when the decision maker randomly takes each action with small probability (Auer et al. (2002); Cesa-Bianchi and Lugosi (2006); Hart and Mas-Colell (2000)). In our setting, players have continuous strategy sets. Thus, it is unclear under what condition exploration is needed, and how random exploration can be used to achieve complete learning.

We focus on identifying conditions under which learning is complete without the need of exploration. We show that learning is complete if (i) the convergent fixed point assigns probability 1 to a single parameter, or (ii) the convergent fixed point belief is locally consistent, and the payoff function of each player is concave in their strategy for any parameter (Proposition 3). In case (i), we know that the single parameter with probability 1 is the true parameter, and the fixed point strategy is a complete information equilibrium. In case (ii), the fixed point belief does not identify the true parameter, but the local consistency condition ensures that the belief forms consistent estimate of payoffs in a local neighborhood of the fixed point strategy. Thus, each player’s strategy is a local best response to their opponents’ strategies. Additionally, the payoff concavity condition ensures that each player’s local best response is also the true best response strategy. Therefore, in case (ii), the fixed point strategy is a complete information Nash equilibrium even though the fixed point belief does not have complete information.

If the convergent fixed point does not satisfy (i) or (ii), then learning may not be complete, and exploration of strategies other than the fixed point strategy is needed to guarantee complete learning. If payoff functions in the game are concave (which is typically assumed in most applications), then local exploration – randomly taking strategies in a local neighborhood of the fixed point – is sufficient to identify the complete information equilibrium. This is because local exploration can exclude any parameter that does not form locally consistent payoff estimate, and eventually leads to a locally consistent belief so that condition (ii) for complete learning is satisfied.

The rest of the paper is organized as follows: Section 2 describes the learning model and Section 3 details our main results – convergence and stability analysis, and conditions for complete learning. We present the proofs of our main convergence and stability theorems in Section 4. In Section 5, we discuss the extensions of our results to learning with two timescales, learning in games with finite strategies, and learning with maximum a posteriori or least square estimates. We include the proofs of propositions and technical lemmas in Appendix A.

2 Model of Learning Dynamics in Continuous Games

Our learning dynamics is induced by strategic players in a finite set II who repeatedly play a game GG for an infinite number of stages. The players’ payoffs in game GG depend on an unknown parameter vector ss belonging to a finite set SS. Players have access to an information system (or an aggregator) that repeatedly updates and broadcasts a belief estimate θ=(θ(s))sSΔ(S)\theta=\left(\theta(s)\right)_{s\in S}\in\Delta(S) to all players, where θ(s)\theta(s) denotes the estimated probability of parameter ss.

In game GG, the strategy of each player iIi\in I is a finite-dimensional vector qiq_{i} in a convex, closed and bounded strategy set QiQ_{i}. The players’ strategy profile is denoted q=(qi)iIQ=ΔiIQiq=\left(q_{i}\right)_{i\in I}\in Q\stackrel{{\scriptstyle\Delta}}{{=}}\prod_{i\in I}Q_{i}. The payoff of each player is realized randomly according to a probability distribution. The distribution of players’ payoffs y=(yi)iIy=\left(y_{i}\right)_{i\in I} for any strategy profile qQq\in Q and any parameter sSs\in S is denoted by the probability density function ϕs(y|q)\phi^{s}(y|q). We assume that ϕs(y|q)\phi^{s}(y|q) is continuous in qq for all sSs\in S. Without loss of generality, we write the player ii’s payoff yiy_{i} for any sSs\in S as the sum of an average payoff function uis(q)u^{s}_{i}(q) that is continuous in qq and a zero-mean noise term ϵis(q)\epsilon_{i}^{s}(q) that can be correlated across players:

yi=uis(q)+ϵis(q).\displaystyle y_{i}=u^{s}_{i}(q)+\epsilon_{i}^{s}(q). (1)

We model our learning dynamics as a discrete-time stochastic process, with state comprising of belief estimate of the unknown parameter and the players’ strategies: In each stage k+k\in\mathbb{N}_{+}, the information platform broadcasts the current belief estimate θk\theta^{k}; the players act according to a strategy profile qk=(qik)iIq^{k}=\left(q^{k}_{i}\right)_{i\in I}; and the payoffs yk=(yik)iIy^{k}=\left(y^{k}_{i}\right)_{i\in I} are realized according to ϕs(yk|qk)\phi^{s}(y^{k}|q^{k}) when the parameter is sSs\in S. The state of the learning dynamics in stage kk is (θk,qk)Δ(S)×Q\left(\theta^{k},q^{k}\right)\in\Delta(S)\times Q.

The unknown true parameter is sSs^{*}\in S. We suppose that the initial belief θ1\theta^{1} does not exclude any possible parameter, i.e. θ1(s)>0\theta^{1}(s)>0 for all sSs\in S, and the initial strategy q1Qq^{1}\in Q is feasible. The evolution of states (θk,qk)k=1\left(\theta^{k},q^{k}\right)_{k=1}^{\infty} is jointly governed by belief and strategy updates as introduced next.

Belief update. The information platform updates the belief intermittently and infinitely. The stages at which the belief is updated can be deterministic or random, denoted by the subsequence (kt)t=1\left(k_{t}\right)_{t=1}^{\infty}. In stage kt+1k_{t+1}, the most recent belief estimate θkt\theta^{k_{t}} is updated using players’ strategy profiles (qk)k=ktkt+11\left(q^{k}\right)_{k=k_{t}}^{k_{t+1}-1} and payoffs (yk)k=ktkt+11\left(y^{k}\right)_{k=k_{t}}^{k_{t+1}-1} between the stages ktk_{t} and kt+1k_{t+1} according to the Bayes’ rule:111In many instances of the problem setup, it is sufficient to update the belief only based on aggregate strategies q~k\tilde{q}^{k} and payoffs y~k\tilde{y}^{k}, provided that the tuple (q~k,y~k)\left(\tilde{q}^{k},\tilde{y}^{k}\right) is a sufficient statistics of (qk,yk)\left(q^{k},y^{k}\right) in the following sense: one can write ϕs(yk|qk)=ψ(qk,yk|q~k,y~k)ϕ~s(y~k|q~k)\phi^{s}(y^{k}|q^{k})=\psi\left(q^{k},y^{k}|\tilde{q}^{k},\tilde{y}^{k}\right)\tilde{\phi}^{s}\left(\tilde{y}^{k}|\tilde{q}^{k}\right), where ψ(qk,yk|q~k,y~k)\psi\left(q^{k},y^{k}|\tilde{q}^{k},\tilde{y}^{k}\right) is the ss-independent conditional distribution of strategy and payoffs given the aggregate statistics, and ϕ~s(y~k|q~k)\tilde{\phi}^{s}\left(\tilde{y}^{k}|\tilde{q}^{k}\right) is the conditional probability of y~k\tilde{y}^{k} given q~k\tilde{q}^{k} for parameter ss. Then, we can re-write (θ\theta-update) as Bayesian update that only relies on (q~k,y~k)\left(\tilde{q}^{k},\tilde{y}^{k}\right): θkt+1(s)\displaystyle\theta^{k_{t+1}}(s) =θkt(s)k=ktkt+11ψ(qk,yk|q~k,y~k)ϕ~s(y~k|q~k)sSθkt(s)k=ktkt+11ψ(qk,yk|q~k,y~k)ϕ~s(y~k|q~k)=θkt(s)k=ktkt+11ϕ~s(y~k|q~k)sSθkt(s)k=ktkt+11ϕ~s(y~k|q~k),sS.\displaystyle=\frac{\theta^{k_{t}}(s)\prod_{k=k_{t}}^{k_{t+1}-1}\psi(q^{k},y^{k}|\tilde{q}^{k},\tilde{y}^{k})\tilde{\phi}^{s}(\tilde{y}^{k}|\tilde{q}^{k})}{\sum_{s^{\prime}\in S}\theta^{k_{t}}(s^{\prime})\prod_{k=k_{t}}^{k_{t+1}-1}\psi(q^{k},y^{k}|\tilde{q}^{k},\tilde{y}^{k})\tilde{\phi}^{s^{\prime}}(\tilde{y}^{k}|\tilde{q}^{k})}=\frac{\theta^{k_{t}}(s)\prod_{k=k_{t}}^{k_{t+1}-1}\tilde{\phi}^{s}(\tilde{y}^{k}|\tilde{q}^{k})}{\sum_{s^{\prime}\in S}\theta^{k_{t}}(s^{\prime})\prod_{k=k_{t}}^{k_{t+1}-1}\tilde{\phi}^{s^{\prime}}(\tilde{y}^{k}|\tilde{q}^{k})},\quad\forall s\in S. For example, in routing games, the aggregate traffic flow and travel time of each edge in the network are sufficient statistics of the individual travelers’ route choices and travel time. Another example is in Cournot games, the total production level and product price in a market are sufficient statistics of each producer’s production and revenue (see Section 3.4).

θkt+1(s)\displaystyle\theta^{k_{t+1}}(s) =θkt(s)k=ktkt+11ϕs(yk|qk)sSθkt(s)k=ktkt+11ϕs(yk|qk),sS.\displaystyle=\frac{\theta^{k_{t}}(s)\prod_{k=k_{t}}^{k_{t+1}-1}\phi^{s}(y^{k}|q^{k})}{\sum_{s^{\prime}\in S}\theta^{k_{t}}(s^{\prime})\prod_{k=k_{t}}^{k_{t+1}-1}\phi^{s^{\prime}}(y^{k}|q^{k})},\quad\forall s\in S. (θ\theta-update)

Strategy update. Players update their strategies in each stage based on the updated belief and the current strategies played by their opponents. Given any θk+1\theta^{k+1} and any qkq^{k}, we denote the strategy update for each iIi\in I as a set-valued function Fi(θk+1,qk):Δ(S)×QQiF_{i}\left(\theta^{k+1},q^{k}\right):\Delta\left(S\right)\times Q\rightrightarrows Q_{i}:

qik+1Fi(θk+1,qk),iI.\displaystyle q_{i}^{k+1}\in F_{i}\left(\theta^{k+1},q^{k}\right),\quad\forall i\in I. (qq-update)

In our model, the belief and strategy updates (θ\theta-update) – (qq-update) capture the dynamic interplay between Bayesian parameter learning and strategy learning in games. Specifically, since the distribution of yky^{k} depends on the strategy profile qkq^{k} in each stage, the sequence of payoffs (yk)k=1(y^{k})_{k=1}^{\infty} is not identically and independently distributed when the strategies (qk)k=1(q^{k})_{k=1}^{\infty} are repeatedly updated. On the other hand, the strategy updates (qq-update) depend on the repeatedly updated beliefs (θk)k=1(\theta^{k})_{k=1}^{\infty}.

We note that the belief updates can occur less frequently than the strategy updates since the subsequence of belief update stages satisfy kt+1kt1k_{t+1}-k_{t}\geq 1. In Sec. 34, we present our main results by considering that kt+1ktk_{t+1}-k_{t} is finite with probability (w.p.) 1; i.e., both belief and strategy updates follow the same timescale. In Sec. 5, we show that all results analogously hold when limtkt+1kt=\lim_{t\to\infty}k_{t+1}-k_{t}=\infty w.p.1; that is when the belief is updated at a slower timescale compared to the strategy updates.

For any belief θ\theta, we denote G(θ)G(\theta) as the game with a static information environment, where each player ii’s utility function is the expected utility 𝔼θ[uis(qi,qi)]=sSθ(s)uis(qi,qi)\mathbb{E}_{\theta}\left[u_{i}^{s}(q_{i},q_{-i})\right]=\sum_{s\in S}\theta(s)u_{i}^{s}(q_{i},q_{-i}). Note that if the belief θk\theta^{k} were held constant as θ\theta for all kk, then (qq-update) simply becomes the corresponding strategy update qk+1=Fi(θ,qk)q^{k+1}=F_{i}(\theta,q^{k}) in game G(θ)G(\theta). Since our goal is to analyze the joint evolution of both the beliefs and the strategies (instead of the properties of a particular strategy update in static information environment), we make the following two assumptions throughout the paper:

Assumption 1 (Upper-hemicontinuous strategy updates).

For any iIi\in I, Fi(θ,q)F_{i}(\theta,q) is upper hemicontinuous in θ\theta and qq.

Assumption 2 (Convergence in static information environment).

For any belief θ\theta, the equilibrium strategy set of the game G(θ)G(\theta) is non-empty. Moreover, given any initial strategy, the strategy update (qq-update) with static belief θ\theta converges to an equilibrium in game G(θ)G(\theta).

Assumption 3.

For any θΔ(S)\theta\in\Delta(S), either (i) or (ii) is satisfied:

  • (i)

    EQ(θ)\mathrm{EQ}(\theta) is a singleton set.

  • (ii)

    There exists δ>0\delta>0 such that any two different equilibria q,qEQ(θ)q^{\dagger},q^{\ddagger}\in\mathrm{EQ}(\theta) satisfy qq>δ\|q^{\dagger}-q^{\ddagger}\|>\delta. Moreover, for any qEQ(θ)q^{\dagger}\in\mathrm{EQ}(\theta), there exists ξa,ξb>0\xi_{a},\xi_{b}>0 such that for any qq that satisfies qq<ξa\|q-q^{\dagger}\|<\xi_{a} and any θ\theta that satisfies θθ¯<ξb\|\theta-\bar{\theta}\|<\xi_{b}, the strategy update F(θ,q)=(Fi(θ,q))iIF(\theta,q)=(F_{i}(\theta,q))_{i\in I} is unique.

Assumption 1 ensures that the updated strategy is upper hemicontinuous in the belief and the strategy in the previous stage. Assumption 2 guarantees the convergence of strategies induced by updates under static information environment. Without Assumption 2, strategies may not converge even in the static information environment, and consequently the joint evolution of both the beliefs and the strategies also cannot be guaranteed to converge. Assumption 3 implies that when the game has multiple equilibria, each equilibrium is isolated, and the strategy update has a unique value in the local neighborhood of each equilibrium.

Our results and analysis approach hold for all types of strategy update (qq-update) that satisfy Assumptions 13. For the sake of concreteness, we provide two examples of FiF_{i} that are natural extension of widely studied strategy update rules:

  1. 1.

    Belief-based best response dynamics. We define the best response correspondence of each player BRi(θk+1,qik)=ΔargmaxqiQisSθk+1(s)uis(qi,qik)\mathrm{BR}_{i}(\theta^{k+1},q^{k}_{-i})\stackrel{{\scriptstyle\Delta}}{{=}}\operatorname*{arg\,max}_{q_{i}\in Q_{i}}\sum_{s\in S}\theta^{k+1}(s)u_{i}^{s}(q_{i},q^{k}_{-i}) as the set of strategies that maximize their expected utility given the opponents’ strategies qikq^{k}_{-i} and the updated belief θk+1\theta^{k+1}. The strategy is updated following any one of the following dynamics: (Simultaneous-BR) – each player chooses a best response strategy; (Sequential-BR) – players sequentially update their strategies as the best response; (Inertial-BR) – each player updates their strategy as a linear combination of their current strategy and a best response strategy.

    Fi(θk+1,qk)\displaystyle F_{i}(\theta^{k+1},q^{k}) =BRi(θk+1,qik),iI,k.\displaystyle=\mathrm{BR}_{i}(\theta^{k+1},q^{k}_{-i}),\quad\forall i\in I,\quad\forall k. (Simultaneous-BR)
    Fi(θk+1,qk)\displaystyle F_{i}(\theta^{k+1},q^{k}) ={BRi(θk+1,qik),if kmod|I|=i,qik,otherwise.\displaystyle=\left\{\begin{array}[]{ll}\mathrm{BR}_{i}(\theta^{k+1},q^{k}_{-i}),&\quad\text{if }k~{}\text{mod}~{}|I|=i,\\ q_{i}^{k},&\quad\text{otherwise}.\end{array}\right. (Sequential-BR)
    Fi(θk+1,qk)\displaystyle F_{i}(\theta^{k+1},q^{k}) =(1αk)qik+αkBRi(θk+1,qik),iI,k,\displaystyle=(1-\alpha^{k})q_{i}^{k}+\alpha^{k}\mathrm{BR}_{i}(\theta^{k+1},q^{k}_{-i}),\quad\forall i\in I,\quad\forall k, (Inertial-BR)

    where αk[0,1]\alpha^{k}\in[0,1] in (Inertial-BR) for each kk.

  2. 2.

    Belief-based no-regret learning. Given the updated belief θk+1\theta^{k+1}, and the current strategy profile qkq^{k}, the updated strategy qk+1q^{k+1} is given by:

    qik+1=argmaxqiQi{(xik+1)Tqihi(qi)},iI,k,\displaystyle q_{i}^{k+1}=\operatorname*{arg\,max}_{q_{i}\in Q_{i}}\{\left(x_{i}^{k+1}\right)^{T}\cdot q_{i}-h_{i}(q_{i})\},\quad\forall i\in I,\quad\forall k, (No-regret)

    where the function hi:Qih_{i}:Q_{i}\to\mathbb{R} is a regularizer that is continuous and strongly convex in qiq_{i}222The function hh is strongly convex if there exists A>0A>0 such that h(λq+(1λ)q)λh(q)+(1λ)h(q)(1/2)Aλ(1λ)qq2h(\lambda q+(1-\lambda)q^{\prime})\leq\lambda h(q)+(1-\lambda)h(q^{\prime})-(1/2)A\lambda(1-\lambda)\|q-q^{\prime}\|^{2} for all q,qQq,q^{\prime}\in Q and all λ[0,1]\lambda\in[0,1], and (xik)k=1(x_{i}^{k})_{k=1}^{\infty} is a sequence of each player ii’s scores such that xi1=qi1x_{i}^{1}=q_{i}^{1}, and

    xik+1=xik+αk(sSθk+1(s)uis(qk)qik),iI,k.\displaystyle x_{i}^{k+1}=x_{i}^{k}+\alpha^{k}\left(\sum_{s\in S}\theta^{k+1}(s)\frac{\partial u_{i}^{s}(q^{k})}{\partial q_{i}^{k}}\right),\quad\forall i\in I,\quad\forall k.

    That is, the initial score of player ii is the same with the strategy qi1q^{1}_{i}, and the score in each step is updated using gradient ascent without regard to the feasibility constraint imposed by strategy set QiQ_{i}. In particular, the gradient vector is the gradient of the expected utility function 𝔼θk+1[uis(qk)]/qik=sSθk+1(s)uis(qk)qik\partial\mathbb{E}_{\theta^{k+1}}[u_{i}^{s}(q^{k})]/\partial q_{i}^{k}=\sum_{s\in S}\theta^{k+1}(s)\frac{\partial u_{i}^{s}(q^{k})}{\partial q_{i}^{k}} based on the updated belief θk+1\theta^{k+1}.333Our belief-based no-regret learning estimates the gradient of the expected utility function 𝔼θ[uis(q)]\mathbb{E}_{\theta}[u_{i}^{s}(q)] based on the belief θ\theta and the parametrized utility function uis(q)u_{i}^{s}(q). When the belief θ\theta does not have complete information of the true parameter ss^{*}, the estimated gradient is different from the gradient of the true utility function uis(q)u_{i}^{s^{*}}(q). As the belief converges, the estimated gradient becomes unbiased of the true gradient.

    Then, the updated strategy qk+1q^{k+1} is a feasible strategy profile in QQ that minimizes the inner product between the strategy and the updated score vector xk+1=(xik+1)iIx^{k+1}=(x_{i}^{k+1})_{i\in I} plus the value of the regularizer. A common example of such regularizer is hi(qi)=(1/2)qi2h_{i}(q_{i})=(1/2)\|q_{i}\|^{2}. In this case, the updated strategy is the projection of the score vector xk+1x^{k+1} onto the feasible strategy set; i.e. qik+1=argminqiQiqixik+12q_{i}^{k+1}=\operatorname*{arg\,min}_{q_{i}\in Q_{i}}\|q_{i}-x_{i}^{k+1}\|^{2}. Thus, the strategy update follows a projected gradient ascent algorithm based on the updated belief for each player.

We note that Assumption 1 is satisfied by the belief-based best response learning and the belief-based no-regret learning when the utility function uis(q)u_{i}^{s}(q) is continuously differentiable with respect to qq for all iIi\in I and all sSs\in S.444Following the Berge’s maximum theorem, for any θΔ(S)\theta\in\Delta(S), any iIi\in I and any qiQiq_{-i}\in Q_{-i}, BR(θ,qi)\mathrm{BR}(\theta,q_{-i}) is upper-hemicontinuous in θ\theta and qiq_{-i}. Therefore, the three types of belief-based best response updates satisfy Assumption 1. Additionally, when the utility function uis(q)u_{i}^{s}(q) is continuously differentiable with respect to qq for any sSs\in S and any iIi\in I, the updated belief given by (No-regret) is also upper-hemicontinuous in θ\theta and qq. From the extensive literature in learning in games (Fudenberg and Kreps (1993); Sandholm (2010b)), we know that convergence of strategy updates to equilibrium is not guaranteed in arbitrary games even under static information environment. In Table 1, we provide a list of games in which the strategy updates with the two examples of learning rules satisfy Assumption 2. Our results, as discussed next, hold for any (qq-update) satisfying the two assumptions.

F(θ,q)F(\theta,q) Game G(θ)G(\theta)
(Simultaneous-BR) Two-player Cournot games, and Dominance solvable supermodular games (Monderer and Shapley (1996b); Milgrom and Roberts (1990))
(Sequential-BR) Potential games (Monderer and Shapley (1996b))
(Inertial-BR) Potential games, convex-concave zero-sum games, dominance solvable games, and supermodular games (Monderer and Shapley (1996a); Hofbauer and Sorin (2006); Milgrom and Roberts (1990))
(No-regret) Concave potential games, and strictly concave games (Rosen (1965); Mertikopoulos and Zhou (2019); Golowich et al. (2020))
Table 1: Examples of games in which belief-based best response learning and belief-based no-regret learning satisfy Assumption 2.

3 Main Results

Sec. 3.1 presents the convergence of beliefs and strategies. Sec. 3.2 analyzes the local and global stability properties of fixed points. Sec. 3.3 focuses on conditions for learning to converge to equilibrium with complete information, and Sec. 3.4 illustrates our results through examples. We focus on presenting the results and explaining the intuition in this section. We defer the proofs of Theorems to Sec. 4, and the proofs of Propositions to Appendix A.

3.1 Convergence

To present our convergence result, we need to introduce two definitions.

Definition 1 (Kullback–Leibler (KL)-divergence).

For a strategy profile qQq\in Q, the KL divergence between the payoff distribution with parameters ss and sSs^{*}\in S is:

DKL(ϕs(y|q)||ϕs(y|q))=Δ{yϕs(y|q)log(ϕs(y|q)ϕs(y|q))𝑑y,if ϕs(y|q)ϕs(y|q),otherwise.\displaystyle D_{KL}\left(\phi^{s^{*}}(y|q)||\phi^{s}(y|q)\right)\stackrel{{\scriptstyle\Delta}}{{=}}\left\{\begin{array}[]{ll}\int_{y}\phi^{s^{*}}(y|q)\log\left(\frac{\phi^{s^{*}}(y|q)}{\phi^{s}(y|q)}\right)dy,&\quad\text{if $\phi^{s^{*}}(y|q)\ll\phi^{s}(y|q)$},\\ \infty&\quad\text{otherwise.}\end{array}\right.

Here ϕs(y|q)ϕs(y|q)\phi^{s^{*}}(y|q)\ll\phi^{s}(y|q) means that the distribution ϕs(y|q)\phi^{s^{*}}(y|q) is absolutely continuous with respect to ϕs(y|q)\phi^{s}(y|q), i.e. ϕs(y|q)=0\phi^{s}(y|q)=0 implies ϕs(y|q)=0\phi^{s^{*}}(y|q)=0 w.p. 1.

Definition 2 (Payoff-equivalent parameters).

A parameter sSs\in S is payoff-equivalent to the true parameter ss^{*} for a strategy qQq\in Q if DKL(ϕs(y|q)||ϕs(y|q))=0D_{KL}\left(\phi^{s^{*}}(y|q)||\phi^{s}(y|q)\right)=0. For a given strategy profile qQq\in Q, the set of parameters that are payoff-equivalent to ss^{*} is:

S(q)=Δ{sS|DKL(ϕs(y|q)||ϕs(y|q))=0}.\displaystyle S^{*}(q)\stackrel{{\scriptstyle\Delta}}{{=}}\{s\in S|D_{KL}\left(\phi^{s^{*}}(y|q)||\phi^{s}(y|q)\right)=0\}.

The KL-divergence between any two distributions is non-negative, and is equal to zero if and only if the two distributions are identical. For a given strategy profile qQq\in Q, if a parameter ss is in the payoff-equivalent parameter set S(q)S^{*}(q), then the payoff distribution is identical for parameters ss and ss^{*}, i.e. ϕs(y|q)=ϕs(y|q)\phi^{s^{*}}(y|q)=\phi^{s}(y|q) for all yy. In this case, realized payoffs cannot be used to distinguish ss and ss^{*} in the belief update (θ\theta-update) because the belief ratio θk(s)θk(s)\frac{\theta^{k}(s)}{\theta^{k}(s^{*})} remains unchanged w.p. 1. Also note that the set S(q)S^{*}(q) can vary with strategy profile qq, and hence a payoff-equivalent parameter for one strategy profile may not be payoff-equivalent for another strategy profile.

Theorem 1.

For any initial state (θ1,q1)Δ(S)×Q(\theta^{1},q^{1})\in\Delta(S)\times Q, under Assumptions 13, the sequence of states (θk,qk)k=1(\theta^{k},q^{k})_{k=1}^{\infty} induced by (θ\theta-update) and (qq-update) converges to a fixed point (θ¯,q¯)(\bar{\theta},\bar{q}) w.p. 1, and (θ¯,q¯)\left(\bar{\theta},\bar{q}\right) satisfies:

[θ¯]\displaystyle[\bar{\theta}] S(q¯),\displaystyle\subseteq S^{*}(\bar{q}), (4a)
q¯\displaystyle\bar{q} EQ(θ¯),\displaystyle\in\mathrm{EQ}(\bar{\theta}), (4b)

where [θ¯]=Δ{sS|θ¯(s)>0}[\bar{\theta}]\stackrel{{\scriptstyle\Delta}}{{=}}\{s\in S|\bar{\theta}(s)>0\}, and EQ(θ¯)\mathrm{EQ}(\bar{\theta}) is the set of equilibria corresponding to belief θ¯\bar{\theta}.

Moreover, for any sSS(q¯)s\in S\setminus S^{*}(\bar{q}), if ϕs(y|q¯)ϕs(y|q¯)\phi^{s^{*}}(y|\bar{q})\ll\phi^{s}(y|\bar{q}), then θk(s)\theta^{k}(s) converges to 0 exponentially fast:

limk1klog(θk(s))=DKL(ϕs(y|q¯)||ϕs(y|q¯)),w.p.1.\displaystyle\lim_{k\to\infty}\frac{1}{k}\log(\theta^{k}(s))=-D_{KL}(\phi^{s^{*}}(y|\bar{q})||\phi^{s}(y|\bar{q})),\quad w.p.~{}1. (5)

Otherwise, there exists a positive integer K<K^{*}<\infty such that θk(s)=0\theta^{k}(s)=0 for all k>Kk>K^{*} w.p. 1.

This result demonstrates two facts: Firstly, if a strategy update converges in static information environment, then this strategy update also converges when beliefs are repeatedly updated. That is, the convergence property of strategy update dynamics in games with static information environment also holds under joint evolution of belief learning and strategy learning.

Secondly, (4a) ensures that at any fixed point (θ¯,q¯)\left(\bar{\theta},\bar{q}\right), the payoff distribution μ(y|θ¯,q¯)\mu(y|\bar{\theta},\bar{q}) estimated by the fixed point belief θ¯\bar{\theta} is consistent with the true distribution given q¯\bar{q}, i.e.

μ(y|θ¯,q¯)=ΔsSθ¯(s)ϕs(y|q¯)=(4a)sS(q¯)θ¯(s)ϕs(y|q¯)=sS(q¯)θ¯(s)ϕs(y|q¯)=ϕs(y|q¯),y.\displaystyle\mu(y|\bar{\theta},\bar{q})\stackrel{{\scriptstyle\Delta}}{{=}}\sum_{s\in S}\bar{\theta}(s)\phi^{s}(y|\bar{q})\stackrel{{\scriptstyle\eqref{eq:exclude_distinguished}}}{{=}}\sum_{s\in S^{*}(\bar{q})}\bar{\theta}(s)\phi^{s}(y|\bar{q})=\sum_{s\in S^{*}(\bar{q})}\bar{\theta}(s)\phi^{s^{*}}(y|\bar{q})=\phi^{s^{*}}(y|\bar{q}),\quad\forall y. (6)

Belief consistency follows from the fact that any parameter that is not payoff equivalent to ss^{*} is excluded from the support set of θ¯\bar{\theta}. In fact, we show in (5) that the belief of such parameter sSS(q¯)s\in S\setminus S^{*}(\bar{q}) converges to zero exponentially fast with the rate of convergence being the KL-divergence between the corresponding payoff distribution with ss and the true parameter ss^{*} given strategy q¯\bar{q}.

Moreover, (4b) ensures that q¯\bar{q} is an equilibrium with respect to the estimated payoff distribution. Therefore, at a fixed point, players have no incentive to deviate from q¯\bar{q}, and the realized payoffs can no longer change the belief θ¯\bar{\theta}.

We define the set of all fixed points as Ω=Δ{(θ¯,q¯)|[θ]S(q¯),q¯EQ(θ¯)}\Omega\stackrel{{\scriptstyle\Delta}}{{=}}\left\{\left(\bar{\theta},\bar{q}\right)\left|[\theta]\subseteq S^{*}\left(\bar{q}\right),~{}\bar{q}\in\mathrm{EQ}(\bar{\theta})\right.\right\}. We denote the belief vector θ\theta^{*} with θ(s)=1\theta^{*}(s^{*})=1 as the complete information belief, and any strategy qEQ(θ)q^{*}\in\mathrm{EQ}(\theta^{*}) as a complete information equilibrium. Since [θ]={s}S(q)[\theta^{*}]=\{s^{*}\}\subseteq S^{*}(q^{*}), the state (θ,q)\left(\theta^{*},q^{*}\right) is always a fixed point (i.e. (θ,q)Ω\left(\theta^{*},q^{*}\right)\in\Omega), and has the property that all players have complete information of the true parameter ss^{*} and choose a complete information equilibrium. Therefore, we refer to (θ,q)\left(\theta^{*},q^{*}\right) as a complete information fixed point.

However, the set Ω\Omega may contain other fixed points (θ¯,q¯)\left(\bar{\theta},\bar{q}\right) that are not equivalent to the complete information environment (i.e. θ¯θ\bar{\theta}\neq\theta^{*}). Such belief θ¯\bar{\theta} must assign positive probability to at least one parameter sss\neq s^{*} that is payoff-equivalent to ss^{*} given the fixed point strategy profile q¯\bar{q}, but not necessarily at all qQq\in Q. We refer to such fixed points as incomplete information fixed points.

Importantly, incomplete information fixed points arise in our learning dynamics due to the fact that realized payoffs are governed by the distribution corresponding to q¯\bar{q} at fixed point, which may not distinguish every other parameter ss from the true parameter ss^{*}. Consequently, the fixed point strategy q¯\bar{q} may not be a complete information equilibrium. In Sec. 3.3, we analyze the difference between complete information fixed points and incomplete information fixed points. We also provide conditions under which learning leads to complete information fixed points with probability 1.

3.2 Local and Global Stability

We now analyze the global and local stability of fixed points. The definitions of global and local stability are as follows:

Definition 3 (Global stability).

A fixed point belief θ¯Δ(S)\bar{\theta}\in\Delta(S) and the associated equilibrium set EQ(θ¯)\mathrm{EQ}(\bar{\theta}) are globally stable if for any initial state (θ1,q1)\left(\theta^{1},q^{1}\right), the beliefs of the learning dynamics (θk)k=1\left(\theta^{k}\right)_{k=1}^{\infty} converge to θ¯\bar{\theta} and the strategies (qk)k=1\left(q^{k}\right)_{k=1}^{\infty} converge to EQ(θ¯)\mathrm{EQ}(\bar{\theta}) with probability 1.

The definition of global stability requires that that the convergent fixed point does not depend on the initial state of the learning dynamics. On the other hand, local stability requires that the learning dynamics is robust to small perturbations around θ¯\bar{\theta} and EQ(θ¯)\mathrm{EQ}(\bar{\theta}). Such local perturbations of beliefs may arise from random errors in data collection or analysis algorithms, and local perturbations of strategies are likely to occur due to players’ mistakes in choosing strategies or random local exploration.

Definition 4 (Local stability).

A fixed point belief θ¯Δ(S)\bar{\theta}\in\Delta(S) and the associated equilibrium set EQ(θ¯)\mathrm{EQ}(\bar{\theta}) are locally stable if for any γ(0,1)\gamma\in(0,1) and any ϵ¯,δ¯>0\bar{\epsilon},\bar{\delta}>0, there exist ϵ1,δ1>0\epsilon^{1},\delta^{1}>0 such that for the learning dynamics (θ\theta-update) and (qq-update) starting from θ1Nϵ1(θ¯)\theta^{1}\in N_{\epsilon^{1}}(\bar{\theta}) and q1Nδ1(EQ(θ¯))q^{1}\in N_{\delta^{1}}(\mathrm{EQ}(\bar{\theta})), the following holds:

limkPr(θkNϵ¯(θ¯),qkNδ¯(EQ(θ¯)))>γ.\displaystyle\lim_{k\to\infty}\mathrm{Pr}\left(\theta^{k}\in N_{\bar{\epsilon}}(\bar{\theta}),~{}q^{k}\in N_{\bar{\delta}}(\mathrm{EQ}(\bar{\theta}))\right)>\gamma. (7)

Note that both global and local stability notions are not defined for a single fixed point, but rather for the tuple (θ¯,EQ(θ¯))\left(\bar{\theta},\mathrm{EQ}(\bar{\theta})\right), i.e. fixed points with an identical belief θ¯\bar{\theta}. This is important when the game has multiple equilibria; i.e., EQ(θ)\mathrm{EQ}(\theta) is not a singleton set for some belief θΔ(S)\theta\in\Delta(S). That is, our stability notions do not hinge on the convergence to a particular equilibrium in the fixed point equilibrium set EQ(θ¯)\mathrm{EQ}(\bar{\theta}).

The next proposition provides two conditions that are equivalent to the existence of globally stable fixed points, and shows that any globally stable fixed point must have complete information.

Proposition 1.

The following three statements are equivalent:

  1. (a)

    The set of globally stable fixed points is non-empty.

  2. (b)

    For any θΔ(S){θ}\theta\in\Delta(S)\setminus\{\theta^{*}\} and any qEQ(θ)q\in\mathrm{EQ}(\theta), there exists s[θ]s\in[\theta] such that sS(q)s\notin S^{*}(q).

  3. (c)

    All fixed points are complete information fixed points, i.e. Ω={(θ,EQ(θ))}\Omega=\left\{\left(\theta^{*},\mathrm{EQ}(\theta^{*})\right)\right\}.

Moreover, any globally stable fixed point must be a complete information fixed point.

In Proposition 1, condition (a) states the existence of globally stable fixed point; Condition (b) states that any belief that does not have complete information cannot be a fixed point belief, because ss^{*} can be distinguished at an associated equilibrium strategy (i.e. (4a) is not satisfied); Condition (c) states that all fixed points are complete information fixed points.

Proposition 1 is intuitive: Clearly, (b) – states that any belief that does not have complete information cannot be a fixed point belief – is equivalent to (c) – all fixed points have complete information. Additionally, (a) is equivalent to (c) because if the fixed point set Ω\Omega contains another fixed point that is not a complete information fixed point, then whether the states of learning dynamics converge to the complete information fixed point or another fixed point depends on the initial state; hence no fixed point in such a set can be globally stable. On the other hand, when all fixed points are complete information fixed points, states of the learning dynamics must converge to θ\theta^{*} and the associated equilibrium set EQ(θ)\mathrm{EQ}(\theta^{*}) regardless of the initial state, and thus must be globally stable. Consequently, (a) and (b) must also be equivalent. Given the equivalence between (a) and (c), it directly follows that all globally stable fixed points must be complete information fixed points.

From Proposition 1, we know that if there exist fixed points with incomplete information, then no fixed point is globally stable. In other words, whether learning converges to a complete information fixed point or a fixed point with incomplete information depends on the initial state. How likely will a particular fixed point arise as the long-run outcome of our learning dynamics? We answer this question by analyzing the local stability property of fixed points. In Theorem 2, we provide sufficient conditions for local stability of fixed point, i.e. if learning starts with state close to a fixed point that satisfies these conditions, then states will remain close to that fixed point with high probability. We further show in Proposition 2 that all complete information fixed points are locally stable.

Before proceeding, for any ϵ>0\epsilon>0, we define an ϵ\epsilon-neighborhood of belief θ¯\bar{\theta} as Nϵ(θ¯)=Δ{θ|θθ¯<ϵ}N_{\epsilon}(\bar{\theta})\stackrel{{\scriptstyle\Delta}}{{=}}\left\{\theta\left|\|\theta-\bar{\theta}\|<\epsilon\right.\right\}, where \|\cdot\| is the Euclidean distance. For any δ>0\delta>0, we define the δ\delta-neighborhood of equilibrium set as Nδ(EQ(θ¯))=Δ{qQ|D(q,EQ(θ¯))<δ}N_{\delta}(\mathrm{EQ}(\bar{\theta}))\stackrel{{\scriptstyle\Delta}}{{=}}\left\{q\in Q\left|D\left(q,\mathrm{EQ}(\bar{\theta})\right)<\delta\right.\right\}, where D(q,EQ(θ¯))=minqEQ(θ¯)qqD\left(q,\mathrm{EQ}(\bar{\theta})\right)=\min_{q^{\prime}\in\mathrm{EQ}(\bar{\theta})}\|q-q^{\prime}\| is the Euclidean distance between qq and the equilibrium set EQ(θ¯)\mathrm{EQ}(\bar{\theta}).555Generically, the function D(q,Q^)=minqQ^qqD\left(q,\hat{Q}\right)=\min_{q^{\prime}\in\hat{Q}}\|q-q^{\prime}\| defines the Euclidean distance between qq and any strategy subset Q^\hat{Q}. We introduce the following assumption:

Assumption 4.

For a fixed point belief θ¯\bar{\theta} and the associated equilibrium set EQ(θ¯)\mathrm{EQ}(\bar{\theta}), ϵ,δ>0\exists\epsilon,~{}\delta>0 such that the neighborhoods Nϵ(θ¯)N_{\epsilon}\left(\bar{\theta}\right) and Nδ(EQ(θ¯))N_{\delta}\left(\mathrm{EQ}(\bar{\theta})\right) satisfy
(A3a) Local consistency: Fixed point belief θ¯\bar{\theta} forms a consistent payoff estimate in the local neighborhood Nδ(EQ(θ¯))N_{\delta}(EQ(\bar{\theta})), that is [θ¯]S(q)[\bar{\theta}]\subseteq S^{*}(q) for any qNδ(EQ(θ¯))q\in N_{\delta}(EQ(\bar{\theta})).
(A3b) Local invariance: Neighborhood Nδ(EQ(θ¯))N_{\delta}(\mathrm{EQ}(\bar{\theta})) is a locally invariant set of the strategy update, that is, F(θ,q)Nδ(EQ(θ¯))F(\theta,q)\subseteq N_{\delta}(\mathrm{EQ}(\bar{\theta})) for any θNϵ(θ¯)\theta\in N_{\epsilon}\left(\bar{\theta}\right) and any qNδ(EQ(θ¯))q\in N_{\delta}\left(\mathrm{EQ}(\bar{\theta})\right).

Theorem 2.

Let the learning dynamics (θ\theta-update) and (qq-update) satisfy Assumptions 13. Then, a fixed point belief θ¯\bar{\theta} and the associated equilibrium set EQ(θ¯)EQ(\bar{\theta}) is locally stable if Assumption 4 is satisfied.

From Theorem 1, we already know that Assumptions 13 guarantee the convergence of beliefs and strategies under local perturbations. We now discuss the role of Assumption 4 towards ensuring local stability of a fixed point. Firstly, the local consistency condition (A3a) ensures that (θ\theta-update) keeps the beliefs close to θ¯\bar{\theta}. In particular, under this condition, any parameter in the support of θ¯\bar{\theta} remains to be payoff equivalent to ss^{*} for any strategy in a local neighborhood of EQ(θ¯)\mathrm{EQ}(\bar{\theta}). Then, θ¯\bar{\theta} forms a consistent estimate of players’ payoffs not just at fixed point strategy q¯\bar{q}, but also when the strategy is locally perturbed around q¯\bar{q}. Therefore, the Bayesian belief update keeps the beliefs of all parameters in [θ¯][\bar{\theta}] close to their respective probabilities in θ¯\bar{\theta} when the strategies are in the local neighborhood, and eventually any parameters not in [θ¯][\bar{\theta}] are excluded by the learning dynamics.

Secondly, the local invariance condition (A3b) guarantees that the strategy sequence resulting from the strategy updates remains within the locally invariant neighborhood of the fixed point equilibrium. Due to the dynamic interplay between the belief updates and strategy updates, we need both local consistency and local invariance conditions to ensure that the strategy sequence in our learning dynamics does not leave the local neighborhood of EQ(θ¯)\mathrm{EQ}(\bar{\theta}) and the perturbed beliefs remain close to θ¯\bar{\theta}.

Finally, we demonstrate that any complete information fixed point must be locally stable.

Proposition 2.

Any complete information fixed point (θ,q)(\theta^{*},q^{*}) is locally stable.

This result implies that any complete information fixed point is robust to local perturbations of beliefs and strategies. It directly follows from Theorem 2 since any (θ,q)(\theta^{*},q^{*}) can be shown to satisfy Assumption 4. To see this, note that the complete information belief θ\theta^{*} satisfies local consistency since it forms a consistent payoff estimate for any feasible strategy qQq\in Q (and thus for any local neighborhood of q¯\bar{q}). Besides, the local invariance condition is satisfied because the set QQ is an invariant set for any belief θ\theta (and thus for any belief in a local neighborhood of θ¯\bar{\theta}).

On the other hand, other fixed points (θ¯,q¯)(\bar{\theta},\bar{q}) are not guaranteed to be locally stable unless there exists a local neighborhood of it that satisfies local consistency and local invariance. In particular, consider the situation where a parameter s[θ¯]s\in[\bar{\theta}] is only payoff equivalent to ss^{*} for the fixed point strategy q¯\bar{q} but not for strategies in a local neighborhood of q¯\bar{q}. In this case, local consistency is not satisfied, and payoffs generated by any local perturbation of the strategy will allow players to distinguish ss from ss^{*} with positive probability. Consequently, beliefs and strategies may leave the local neighborhood of (θ¯,q¯)(\bar{\theta},\bar{q}) with positive probability when it is locally perturbed.

Theorem 2 and Proposition 2 show that complete information fixed points are robust to such local perturbations, but other fixed points may not be. While these two results do not rule out the possibility that learning may still converge to fixed points that do not have complete information, these fixed points are non-robust unless they are locally stable. In next section, we further analyze conditions, under which learning dynamics converge to a complete information equilibrium.

3.3 Complete Learning

We say that learning is complete if the convergent fixed point strategy profile q¯\bar{q} is a complete information equilibrium (i.e. q¯=q\bar{q}=q^{*}). We now answer two questions: When is learning complete? If learning is not complete (i.e. q¯q\bar{q}\neq q^{*}), can we still find the complete information equilibrium?

Firstly, if the convergent fixed point belief assigns probability 1 to a single parameter (i.e. the support set [θ¯][\bar{\theta}] is a singleton set), then the unique parameter in [θ¯][\bar{\theta}] must be the true parameter ss^{*}. In this case, we know that the fixed point must be a complete information fixed point (θ,q)(\theta^{*},q^{*}), and thus learning is complete. Besides, from Proposition 1, we know that learning is guaranteed to converge to a complete information fixed point if and only if the true parameter is identifiable at equilibrium.

Secondly, if the convergent fixed point belief assigns positive probability to more than one parameters (i.e. the support set [θ¯][\bar{\theta}] is not a singleton set), then the fixed point belief does not have complete information of the true parameter. In this case, Proposition 3 below provides a sufficient condition on the fixed point belief and players’ payoff functions to ensure that the fixed point strategy profile is a complete information equilibrium. Here, learning is still complete, although the belief does not identify the true parameter.

Proposition 3.

For a fixed point (θ¯,q¯)\left(\bar{\theta},\bar{q}\right), q¯=q\bar{q}=q^{*} if

  • (i)

    Local consistency. The fixed point belief θ¯\bar{\theta} forms a consistent payoff estimate in a local neighborhood of q¯\bar{q}, i.e. ξ>0\exists\xi>0 such that [θ¯]S(q)[\bar{\theta}]\subseteq S^{*}(q) for any qNξ(q¯)q\in N_{\xi}(\bar{q});

  • (ii)

    Payoff concavity. The payoff function uis(qi,qi)u^{s}_{i}(q_{i},q_{-i}) is concave in qiq_{i} for all qiQiq_{-i}\in Q_{-i}, all iIi\in I and all s[θ¯]s\in[\bar{\theta}].

For any fixed point (θ¯,q¯)\left(\bar{\theta},\bar{q}\right), since q¯i\bar{q}_{i} is a best response strategy of q¯i\bar{q}_{-i}, q¯i\bar{q}_{i} is a local maximizer of the expected payoff function 𝔼θ¯[uis(qi,q¯i)]\mathbb{E}_{\bar{\theta}}[u_{i}^{s}(q_{i},\bar{q}_{-i})]. The local consistency condition (i) in Proposition 3 ensures that the value of the expected payoff function is identical to the one corresponding to the true parameter ss^{*} for any qiq_{i} belonging to a small neighborhood of q¯i\bar{q}_{i}. Therefore, q¯i\bar{q}_{i} must be a local maximizer of the payoff function with the true parameter uis(qi,q¯i)u_{i}^{s^{*}}(q_{i},\bar{q}_{-i}). Condition (ii) further provides that payoffs are concave functions of qiq_{i}, and thus q¯i\bar{q}_{i} must also be a global maximizer of uis(qi,q¯i)u_{i}^{s^{*}}(q_{i},\bar{q}_{-i}). Therefore, (i) and (ii) together ensure that the fixed point strategy q¯\bar{q} is a complete information equilibrium, although θ¯\bar{\theta} may not fully identify the true parameter ss^{*}.666 We note that for continuous and differentiable utility functions, the local consistency condition is equivalent to requiring that the gradient of the expected utility function based on the fixed point belief is identical to the gradient of the true utility function in a local neighborhood of the fixed point, i.e. 𝔼θ¯[uis(q)qi]=uis(q)qi\mathbb{E}_{\bar{\theta}}[\frac{\partial u_{i}^{s}(q)}{\partial q_{i}}]=\frac{\partial u_{i}^{s^{*}}(q)}{\partial q_{i}} for all qNξ(q¯)q\in N_{\xi}(\bar{q}). In classical no-regret learning dynamics (e.g. Mertikopoulos and Zhou (2019)), the convergence of strategies requires that the payoffs are strictly concave, and the estimate of the gradient of the utility function is unbiased in each stage. In our belief-based no-regret learning (θ\theta-update) – (No-regret), the estimated gradient computed by the belief in each stage may not be consistent with the gradient of the true utility function. From Proposition 3, we can conclude that in strictly concave game, when players choose belief-based no-regret learning, strategies converge to a complete information equilibrium if the gradient estimate is consistent in a local neighborhood of the fixed point.

To summarize, we can guarantee that learning is complete in two cases: either the convergent belief identifies a single parameter, or conditions (i) and (ii) in Proposition 3 are satisfied. When neither of these cases apply to the learning dynamics, we can conclude that the convergent (θ¯,q¯)(\bar{\theta},\bar{q}) is definitely not a complete information fixed point, but q¯\bar{q} may or may not be a complete information equilibrium.

Typically, literature on continuous games assumes that the payoff concavity condition is satisfied.777In continuous games, payoff concavity guarantees the existence of pure equilibrium, and thus is typically assumed (Rosen (1965)). In these games, if θ¯\bar{\theta} is not locally consistent, then there must exist at least one parameter s[θ¯]s\in[\bar{\theta}] such that ss is not payoff equivalent to ss^{*} in any local neighborhood of q¯\bar{q}. Then, local exploration, that is randomly perturbing strategies in a local neighborhood of q¯\bar{q}, can distinguish such ss, and thus leads learning to a new fixed point (θ¯,q¯)(\bar{\theta}^{\prime},\bar{q}^{\prime}) where θ¯\bar{\theta}^{\prime} is locally consistent. Since both conditions in Proposition 3 are satisfied for the new fixed point, we can conclude that q¯\bar{q}^{\prime} is a complete information Nash equilibrium. Therefore, in games with concave payoff functions, local exploration can find a complete information equilibrium.

More generally, when payoff concavity is not satisfied, q¯\bar{q}^{\prime} may not be a complete information equilibrium, and one needs a different approach to identify the true parameter in the support set [θ¯][\bar{\theta}^{\prime}]. Unfortunately, local exploration no longer helps since all parameters in [θ¯][\bar{\theta}^{\prime}] are payoff equivalent with ss^{*} in local neighborhood of q¯\bar{q}^{\prime}. Instead, one needs to distinguish each pair of s,s[θ¯]s,s^{\prime}\in[\bar{\theta}^{\prime}] by exploring strategies for which these two parameters are not payoff equivalent, and such strategies may not be in local neighborhood of q¯\bar{q}^{\prime}.

3.4 Examples

We present three examples to further illustrate our results.

Example 1. (Cournot competition) A set of II firms produce an identical product and compete in a market. In each stage kk, firm ii’s strategy is their production level qik[0,3]q_{i}^{k}\in[0,3]. The price of the product is pk=αsβs(iIqik)+ϵsp^{k}=\alpha^{s}-\beta^{s}\left(\sum_{i\in I}q_{i}^{k}\right)+\epsilon^{s}, where s=(αs,βs)s=\left(\alpha^{s},\beta^{s}\right) is the unknown parameter vector in the price function, and ϵ\epsilon is a random variable with zero mean. The set of parameter vectors is S={s1,s2}S=\{s_{1},s_{2}\}, where s1=(2,1)s_{1}=\left(2,1\right) and s2=(4,3)s_{2}=\left(4,3\right). The true parameter is s=s1s^{*}=s_{1}. The marginal cost of each firm is 0. The payoff of firm ii in stage kk is yik=qik(αsβs(iIqik)+ϵs)y^{k}_{i}=q_{i}^{k}\left(\alpha^{s}-\beta^{s}\left(\sum_{i\in I}q_{i}^{k}\right)+\epsilon^{s}\right) for each sSs\in S.

The information platform does not need to observe the production level or the payoff of each firm. iIi\in I. Instead, the platform can update belief θk\theta^{k} based on the total production q~k=iIqik\tilde{q}^{k}=\sum_{i\in I}q_{i}^{k}, and the price pk=yik/qikp^{k}=y^{k}_{i}/q^{k}_{i} for all iIi\in I. This is due to the fact that ϕs(yk|qk)=ϕ~s(pk|q~k)\phi^{s}(y^{k}|q^{k})=\tilde{\phi}^{s}(p^{k}|\tilde{q}^{k}) for all sSs\in S, where ϕ~s(pk|q~k)\tilde{\phi}^{s}(p^{k}|\tilde{q}^{k}) is the probability density function of the price given the total production. The belief update given (q~k,pk)k=1(\tilde{q}^{k},p^{k})_{k=1}^{\infty} is as follows:

θkt+1(s)\displaystyle\theta^{k_{t+1}}(s) =θkt(s)k=ktkt+11ϕ~s(pk|q~k)sSθkt(s)k=ktkt+11ϕ~s(pk|q~k),sS,t=1,2,.\displaystyle=\frac{\theta^{k_{t}}(s)\prod_{k=k_{t}}^{k_{t+1}-1}\tilde{\phi}^{s}(p^{k}|\tilde{q}^{k})}{\sum_{s^{\prime}\in S}\theta^{k_{t}}(s^{\prime})\prod_{k=k_{t}}^{k_{t+1}-1}\tilde{\phi}^{s^{\prime}}(p^{k}|\tilde{q}^{k})},\quad\forall s\in S,\quad\forall t=1,2,\dots.

In this game, the expected payoff function 𝔼θ[uis(qi,qi)]=sSθ(s)qi(αsβs(iIqi))\mathbb{E}_{\theta}[u_{i}^{s}(q_{i},q_{-i})]=\sum_{s\in S}\theta(s)q_{i}\left(\alpha^{s}-\beta^{s}\left(\sum_{i\in I}q_{i}\right)\right) is continuous in θ\theta and concave in qiq_{i} for all iIi\in I. Thus, the best response correspondence BRi(θ,qi)BR_{i}(\theta,q_{-i}) is upper-hemicontinuous in θ\theta and qiq_{-i} for each iIi\in I following from the Berge’s maximum theorem. As a result, the strategy updates (Sequential-BR) and (Inertial-BR) satisfy Assumption 1. Additionally, since the utility function 𝔼θ[uis(qi,qi)]\mathbb{E}_{\theta}[u_{i}^{s}(q_{i},q_{-i})] is continuously differentiable with respect to the strategy profile qq, the update (No-regret) is also upper-hemicontinuous in θ\theta and qq for all iIi\in I, and thus satisfies Assumption 1. Furthermore, for any θ\theta, the Cournot game G(θ)G(\theta) has a concave potential function given by:

Ψθ(q)=𝔼θ[αs](iIqi)𝔼θ[βs](iIqi2)𝔼θ[βs](1ij|I|qiqj).\displaystyle\Psi_{\theta}(q)=\mathbb{E}_{\theta}[\alpha^{s}]\left(\sum_{i\in I}q_{i}\right)-\mathbb{E}_{\theta}[\beta^{s}]\left(\sum_{i\in I}q_{i}^{2}\right)-\mathbb{E}_{\theta}[\beta^{s}]\left(\sum_{1\leq i\leq j\leq|I|}q_{i}q_{j}\right).

Therefore, the strategy updates – (Sequential-BR), (Inertial-BR), and (No-regret) with respect to a static belief θ\theta converge to a Nash equilibrium in EQ(θ)\mathrm{EQ}(\theta), i.e. (Sequential-BR), (Inertial-BR), and (No-regret) satisfies Assumption 2 (Table 1). Additionally, since for any θ\theta, the equilibrium set EQ(θ)\mathrm{EQ}(\theta) is unique, Assumption 3 is satisfied.

From Theorem 1, the states of the learning dynamics with strategy updates (Sequential-BR), (Inertial-BR), and (No-regret) converge to a fixed point with probability 1. The complete information fixed point is θ=(1,0)\theta^{*}=\left(1,0\right) and q=(2/3,2/3)q^{*}=\left(2/3,2/3\right). Additionally, θ=(0.5,0.5)\theta^{\dagger}=\left(0.5,0.5\right) and q=(0.5,0.5)EQ(θ)q^{\dagger}=\left(0.5,0.5\right)\in\mathrm{EQ}(\theta^{\dagger}) is also a fixed point since [θ]=S(q)={s1,s2}[\theta^{\dagger}]=S^{*}(q^{\dagger})=\{s_{1},s_{2}\}. Note that at qq^{\dagger}, the parameter s2s_{2} leads to identical price distribution as s1s_{1}, and thus is payoff equivalent. In fact, since any θθ\theta\neq\theta^{*} must include s2s_{2} in the support set, one can show that q=(0.5,0.5)q^{\dagger}=\left(0.5,0.5\right) is the only strategy profile for which s1s_{1} and s2s_{2} are payoff-equivalent. Thus, there does not exist any other fixed points apart from (θ,q)\left(\theta^{*},q^{*}\right) and (θ,q)\left(\theta^{\dagger},q^{\dagger}\right); i.e. Ω={(θ,q),(θ,q)}\Omega=\left\{(\theta^{*},q^{*}),(\theta^{\dagger},q^{\dagger})\right\}.

Since the complete information fixed point is not the unique fixed point, no fixed point is globally stable (Proposition 1). We know from Proposition 2 that the complete information fixed point θ=(1,0)\theta^{*}=(1,0), q=(2/3,2/3)q^{*}=\left(2/3,2/3\right) is locally stable. On the other hand, the other fixed point θ=(0.5,0.5)\theta^{\dagger}=\left(0.5,0.5\right) and q=(0.5,0.5)q^{\dagger}=\left(0.5,0.5\right) does not satisfy the local consistency condition since the two parameters s1s_{1} and s2s_{2} can be distinguished when the strategy is perturbed in any local neighborhood of qq^{\dagger}. Since the utility functions are concave, local exploration around qq^{\dagger} can identify the true parameter, and leads to the complete information equilibrium.

Example 2.(Zero sum game) Two players i{1,2}i\in\{1,2\} repeatedly play a zero-sum game with identical convex and closed strategy sets Q1=Q2=[0,6]Q_{1}=Q_{2}=[0,6]. For any strategy profile qq, the payoff of each player is y1=y2=vs(q)+ϵsy_{1}=-y_{2}=v^{s}(q)+\epsilon^{s}, where

vs(q)=(max(|q1kq2k|,s)s)22(q1k)2+12(q2k2)2,\displaystyle v^{s}(q)=\left(\max\left(|q^{k}_{1}-q^{k}_{2}|,s\right)-s\right)^{2}-2(q^{k}_{1})^{2}{\color[rgb]{0,0,1}+\frac{1}{2}(q_{2}^{k}-2)^{2}},

and sS={1,3,5}s\in S=\{1,3,5\} is the unknown parameter. The true parameter s=3s^{*}=3. Belief is updated by an information platform based on the strategy profiles (qk)k=1(q^{k})_{k=1}^{\infty} and the realized payoffs (y1k,y2k)k=1(y_{1}^{k},y_{2}^{k})_{k=1}^{\infty} as in (θ\theta-update).

For any θ\theta, the game G(θ)G(\theta) is a zero-sum game with expected value function 𝔼θ[vs(q)]\mathbb{E}_{\theta}[v^{s}(q)] that is continuous in θ\theta, concave in q1q_{1} and convex in q2q_{2}. Thus, the strategy updates (Inertial-BR) and (No-regret) satisfy Assumptions 1 and 2 (Table 1). Moreover, for any θ\theta such that θ(1)=0\theta(1)=0, the equilibrium is unique (0,2)(0,2). For any θ\theta such that θ(1)>0\theta(1)>0, the equilibrium strategy profile is (0,(2+2θ(1))/(2θ(1)+1))(0,(2+2\theta(1))/(2\theta(1)+1)). Therefore, Assumption 3 is satisfied.

From Theorem 1, the sequence of states converges to a fixed point w.p. 1. The set of complete information fixed points is θ=(0,1,0)\theta^{*}=\left(0,1,0\right) and EQ(θ)={(0,2)}\mathrm{EQ}(\theta^{*})=\{(0,2)\}. Apart from the complete information fixed points, any θ{Δ(S)|θ(1)=0}\theta^{\dagger}\in\{\Delta(S)|\theta^{\dagger}(1)=0\} and q=(0,2)q^{\dagger}=(0,2) is also a fixed point. This is because for any belief θ\theta^{\dagger} that assigns zero probability on s=1s=1, q1=0q^{\dagger}_{1}=0 and q2=2q^{\dagger}_{2}=2 is an equilibrium, and the two parameters s=3s=3 and s=5s=5 are payoff equivalent at qq^{\dagger}.

We can check that conditions (i) and (ii) in Proposition 3 are satisfied by any fixed point (θ,q)\left(\theta^{\dagger},q^{\dagger}\right). Thus, any fixed point strategy is a complete information equilibrium although θ\theta^{\dagger} is not a complete information belief. Therefore, learning is guaranteed to be complete in this game.

Since the complete information fixed point is not unique, no fixed point is globally stable. Moreover, by setting ϵ=1/2\epsilon=1/2 and δ=6\delta=6, we can check that all fixed points in Ω\Omega satisfy the two conditions in Assumption 4, and thus are locally stable (Theorem 2).

Example 3. (Investment game) Two players repeatedly play an investment game. In each stage kk, the strategy qik[0,1]q_{i}^{k}\in[0,1] is the non-negative level of investment of player ii. Given the strategy profile qk=(q1k,q2k)q^{k}=\left(q^{k}_{1},q^{k}_{2}\right), the return of a unit investment is randomly realized according to rk=s+q1k+q2k+ϵsr^{k}=s+q^{k}_{1}+q^{k}_{2}+\epsilon^{s}, where sS={0,1,2}s\in S=\{0,1,2\} is the unknown parameter that represents the average baseline return and ϵs\epsilon^{s} is the noise term. The true parameter is s=1s^{*}=1. The stage cost of investment for each player is 3(qik)23\left(q_{i}^{k}\right)^{2}. Therefore, the payoff of each player iIi\in I is yik=qik(s+q1k+q2k+ϵs)3(qik)2=qik(s2qik+qik+ϵs)y^{k}_{i}=q_{i}^{k}(s+q^{k}_{1}+q^{k}_{2}+\epsilon^{s})-3\left(q_{i}^{k}\right)^{2}=q_{i}^{k}(s-2q_{i}^{k}+q^{k}_{-i}+\epsilon^{s}) for all sSs\in S. Following similar argument as in Example 1, the information platform does not need to observe the strategy of each player and their realized payoffs. Instead, the platform can update the belief θk\theta^{k} based on the total investment q~k=q1k+q2k\tilde{q}^{k}=q^{k}_{1}+q^{k}_{2} and the realized unit investment return rkr^{k} in each stage kk.

For any θ\theta, the expected utility function 𝔼θ[uis(q)]\mathbb{E}_{\theta}[u_{i}^{s}(q)] is continuous in θ\theta and qq. Moreover, the game G(θ)G(\theta) is a supermodular game (i.e. 2𝔼θ[uis(q)]/qiqj=1>0\partial^{2}\mathbb{E}_{\theta}[u_{i}^{s}(q)]/\partial q_{i}\partial q_{j}=1>0), and it is also dominance solvable with unique equilibrium strategy profile EQ(θ)={(𝔼θ[s]/3,𝔼θ[s]/3)}\mathrm{EQ}(\theta)=\{(\mathbb{E}_{\theta}[s]/3,\mathbb{E}_{\theta}[s]/3)\}. Therefore, all three best response dynamics (Simultaneous-BR) – (Inertial-BR) satisfy Assumptions 1 and 2 (Table 1). Since the equilibrium set is a singleton set for all θ\theta, Assumption 3 is also satisfied. Thus, states converge to a fixed point with probability 1. In this game, since S(q)={s=1}S^{*}(q)=\{s^{*}=1\} for any qQq\in Q, the unique fixed point is the complete information fixed point, i.e. Ω={(θ,q)=((0,1,0),(1/3,1/3))}\Omega=\{\left(\theta^{*},q^{*}\right)=\left(\left(0,1,0\right),\left(1/3,1/3\right)\right)\}. From Proposition 1, we know that this complete information fixed point is globally stable.

4 Proofs of Main Results

In this section, we present the proofs of our convergence and local stability results (Theorems 1 and 2). Recall that in (θ\theta-update), the beliefs are updated at a random subsequence of stages (kt)k=1\left(k_{t}\right)_{k=1}^{\infty}. For the ease of presentation in our proof, we introduce an auxiliary belief sequence (θ~k)k=1\left(\tilde{\theta}^{k}\right)_{k=1}^{\infty} that is updated in every stage:

θ~1=θ1, and θ~k+1(s)=θ~k(s)ϕs(yk|qk)sSθ~k(s)ϕs(yk|qk),sS,k=1,2,\displaystyle\tilde{\theta}^{1}=\theta^{1},\text{ and }~{}\tilde{\theta}^{k+1}(s)=\frac{\tilde{\theta}^{k}(s)\phi^{s}(y^{k}|q^{k})}{\sum_{s^{\prime}\in S}\tilde{\theta}^{k}(s^{\prime})\phi^{s^{\prime}}(y^{k}|q^{k})},\quad\forall s\in S,\quad\forall k=1,2,\dots (8)

From (θ\theta-update), we know that

θk={θ~k,if k=kt,t=1,2,,θk1,otherwise.\displaystyle\theta^{k}=\left\{\begin{array}[]{ll}\tilde{\theta}^{k},&\quad\text{if }k=k_{t},\quad\forall t=1,2,\dots,\\ \theta^{k-1},&\quad\text{otherwise}.\end{array}\right. (11)

We can also write the auxiliary belief ratio as follows:

θ~k(s)θ~k(s)=θ~1(s)θ~1(s)j=1k1ϕs(yj|qj)ϕs(yj|qj)\displaystyle\frac{\tilde{\theta}^{k}(s)}{\tilde{\theta}^{k}(s^{*})}=\frac{\tilde{\theta}^{1}(s)}{\tilde{\theta}^{1}(s^{*})}\cdot\prod_{j=1}^{k-1}\frac{\phi^{s}(y^{j}|q^{j})}{\phi^{s^{*}}(y^{j}|q^{j})} (12)

Since the payoffs and strategies are public information, we show in the following lemma that the auxiliary belief ratio is a martingale sequence. The proof of Lemma 1 is included in Appendix A.

Lemma 1.

For any sSs\in S, the sequence of belief ratios (θ~k(s)θ~k(s))k=1\left(\frac{\tilde{\theta}^{k}(s)}{\tilde{\theta}^{k}(s^{*})}\right)_{k=1}^{\infty} is a martingale sequence.

Our proofs of the convergence and stability results build on the martingale property of belief ratio. In particular, we utilize the martingale property to show the convergence of beliefs in Theorem 1. In our proof of local stability in Theorem 2, we rely on the martingale property to compute the upper bound of the probability that beliefs leave a local ϵ¯\bar{\epsilon}-neighborhood of θ¯\bar{\theta}, and derive conditions on the initial belief to ensure that this upper bound does not exceed γ\gamma as in the definition of local stability (Definition 4).

4.1 Proof of convergence

We prove Theorem 1 in three steps: Firstly, we prove that the sequence of beliefs (θk)k=1\left(\theta^{k}\right)_{k=1}^{\infty} converges to a fixed point belief θ¯Δ(S)\bar{\theta}\in\Delta\left(S\right) w.p. 1 by applying the martingale convergence theorem (Lemma 2). Secondly, we show that under Assumptions 1 and 2, the strategies (qk)k=1\left(q^{k}\right)_{k=1}^{\infty} in our learning dynamics also converge. This convergent strategy is an equilibrium corresponding to the fixed point belief θ¯\bar{\theta} (Lemma 3). Finally, we prove that the belief of any sSs\in S that is not payoff-equivalent to ss^{*} given q¯\bar{q} must converge to 0 with rate of convergence governed by (5) (Lemma 4). Hence, we can conclude that beliefs and strategies induced by the learning dynamics converge to a fixed point (θ¯,q¯)\left(\bar{\theta},\bar{q}\right) that satisfies (4) w.p. 1.

Lemma 2.

limkθk=θ¯\lim_{k\to\infty}\theta^{k}=\bar{\theta} w.p. 1, where θ¯Δ(S)\bar{\theta}\in\Delta(S).

Lemma 2 directly follows from Lemma 1. We include the proof in Appendix A.

We next prove the convergence of strategies. For every K=1,2,K=1,2,\dots, we construct an auxiliary strategy sequences (q^Kj)j=1\left(\hat{q}^{j}_{K}\right)_{j=1}^{\infty} such that q^Kj\hat{q}^{j}_{K} are identical to qjq^{j} in the original sequence generate by the learning dynamics up to the stage KK (i.e. q^Kj=qj\hat{q}^{j}_{K}=q^{j} for all j=1,,Kj=1,\dots,K), and the remaining strategies (q^Kj)j=K+1\left(\hat{q}^{j}_{K}\right)_{j=K+1}^{\infty} are induced by updates associated with the fixed point belief θ¯\bar{\theta} (instead of the repeatedly updated belief sequence (θk)k=K+1\left(\theta^{k}\right)_{k=K+1}^{\infty}). In particular, for each jKj\geq K, q^Kj+1\hat{q}^{j+1}_{K} is obtained as follows:

q^Kj+1=argminqF(θ¯,q^Kj)qq~j+1,such thatq~j+1=argminq~F(θ¯,qj)q~qj+1.\displaystyle\hat{q}^{j+1}_{K}=\operatorname*{arg\,min}_{q\in F(\bar{\theta},\hat{q}^{j}_{K})}\|q-\tilde{q}^{j+1}\|,\quad\text{such that}\quad\tilde{q}^{j+1}=\operatorname*{arg\,min}_{\tilde{q}\in F(\bar{\theta},q^{j})}\|\tilde{q}-q^{j+1}\|. (13)

From (13), q^Kj+1\hat{q}^{j+1}_{K} is constructed in two steps: Firstly, given the original strategy qjq^{j} and the fixed point belief θ¯\bar{\theta}, we compute q~j+1\tilde{q}^{j+1} as the the strategy updated from qjq^{j} with respect to θ¯\bar{\theta} (i.e. q~j+1F(θ¯,qj)\tilde{q}^{j+1}\in F(\bar{\theta},q^{j})) that is the closest to the original strategy qj+1q^{j+1}. Secondly, q^Kj+1\hat{q}^{j+1}_{K} is updated from the auxiliary strategy q^Kj\hat{q}^{j}_{K} in stage jj with the fixed point belief θ¯\bar{\theta} (i.e. q^Kj+1F(θ¯,q^Kj)\hat{q}_{K}^{j+1}\in F(\bar{\theta},\hat{q}_{K}^{j})), and q^Kj+1\hat{q}^{j+1}_{K} is the closest to q~j+1\tilde{q}^{j+1} among all strategies in the set F(θ¯,q^Kj)F(\bar{\theta},\hat{q}^{j}_{K}).

The two-step construction ensures that (q^Kj)j=K+1\left(\hat{q}^{j}_{K}\right)_{j=K+1}^{\infty} has the smallest Euclidean distance with the original strategy sequence among all strategy sequences that are induced by updates with θ¯\bar{\theta}. In fact, we can show that, as the beliefs converge to θ¯\bar{\theta} (Lemma 2), the distance between the auxiliary strategy sequence and the original strategy sequence also converges to zero as KK\to\infty under Assumption 1. Additionally, under Assumption 2, the auxiliary strategy sequence must converge to an equilibrium q¯EQ(θ¯)\bar{q}\in\mathrm{EQ}(\bar{\theta}), and thus the original sequence (qk)k=1\left(q^{k}\right)_{k=1}^{\infty} also converges to q¯\bar{q}.

Lemma 3.

The sequence of strategies (qk)k=1(q^{k})_{k=1}^{\infty} converges with probability 1. Moreover, the convergent strategy profile q¯EQ(θ¯)\bar{q}\in\mathrm{EQ}(\bar{\theta}).

Proof of Lemma 3. Consider any sequence of states (θj,qj)j=1(\theta^{j},q^{j})_{j=1}^{\infty} induced by the learning dynamics (θ\theta-update) – (qq-update). For any KK, we construct the auxiliary strategy sequence (q^Kj)j=1(\hat{q}^{j}_{K})_{j=1}^{\infty} from (13). For any j>Kj>K, we have:

qj+1q~j+1=D(qj+1,F(θ¯,qj)),q~j+1q^Kj+1=D(q~j+1,F(θ¯,q^Kj)).\displaystyle\|q^{j+1}-\tilde{q}^{j+1}\|=D\left(q^{j+1},F(\bar{\theta},q^{j})\right),~{}\|\tilde{q}^{j+1}-\hat{q}^{j+1}_{K}\|=D\left(\tilde{q}^{j+1},F(\bar{\theta},\hat{q}^{j}_{K})\right). (14)

We next show by mathematical induction that for any 1\ell\geq 1, limKqK+q^KK+=0\lim_{K\to\infty}\|q^{K+\ell}-\hat{q}^{K+\ell}_{K}\|=0. To begin with, for =1\ell=1, we have

qK+1q^KK+1qK+1q~K+1+q~K+1q^KK+1\displaystyle\|q^{K+1}-\hat{q}^{K+1}_{K}\|\leq\|q^{K+1}-\tilde{q}^{K+1}\|+\|\tilde{q}^{K+1}-\hat{q}^{K+1}_{K}\|
=(14)\displaystyle\stackrel{{\scriptstyle\eqref{eq:dist}}}{{=}} D(qK+1,F(θ¯,qK))+D(q~K+1,F(θ¯,q^KK)).\displaystyle D\left(q^{K+1},F\left(\bar{\theta},q^{K}\right)\right)+D\left(\tilde{q}^{K+1},F\left(\bar{\theta},\hat{q}^{K}_{K}\right)\right). (15)

Since θk\theta^{k} converges to θ¯\bar{\theta} (Lemma 2), F(θ,q)F(\theta,q) is upper hemicontinuous in θ\theta (Assumption 1), and qK+1F(θK+1,qK)q^{K+1}\in F(\theta^{K+1},q^{K}), we know that limKD(qK+1,F(θ¯,qK))=0\lim_{K\to\infty}D\left(q^{K+1},F\left(\bar{\theta},q^{K}\right)\right)=0. Additionally, since q^KK=qK\hat{q}^{K}_{K}=q^{K} and q~K+1F(θ¯,qK)\tilde{q}^{K+1}\in F(\bar{\theta},q^{K}), D(q~K+1,F(θ¯,q^KK))=0D\left(\tilde{q}^{K+1},F\left(\bar{\theta},\hat{q}^{K}_{K}\right)\right)=0. Therefore, limKqK+1q^K+1=0\lim_{K\to\infty}\|q^{K+1}-\hat{q}^{K+1}\|=0.

Now, assume that limKqK+q^KK+=0\lim_{K\to\infty}\|q^{K+\ell}-\hat{q}_{K}^{K+\ell}\|=0 for some 1\ell\geq 1, we need to show that limKqK++1q^KK++1=0\lim_{K\to\infty}\|q^{K+\ell+1}-\hat{q}_{K}^{K+\ell+1}\|=0. Similar to (15), we have

qK++1q^KK++1D(qK++1,F(θ¯,qK+))+D(q~K++1,F(θ¯,q^KK+)).\displaystyle\|q^{K+\ell+1}-\hat{q}_{K}^{K+\ell+1}\|\leq D\left(q^{K+\ell+1},F\left(\bar{\theta},q^{K+\ell}\right)\right)+D\left(\tilde{q}^{K+\ell+1},F\left(\bar{\theta},\hat{q}_{K}^{K+\ell}\right)\right).

Analogous to =1\ell=1, since F(θ,q)F(\theta,q) is upper hemicontinuous in θ\theta, limKD(qK++1,F(θ¯,qK+))=0\lim_{K\to\infty}D\left(q^{K+\ell+1},F\left(\bar{\theta},q^{K+\ell}\right)\right)=0. Additionally, since limKqK+q^KK+=0\lim_{K\to\infty}\|q^{K+\ell}-\hat{q}^{K+\ell}_{K}\|=0, F(θ,q)F(\theta,q) is upper hemicontinuous in qq (Assumption 1), and q~K++1F(θ¯,qK+)\tilde{q}^{K+\ell+1}\in F\left(\bar{\theta},q^{K+\ell}\right), we know that limKD(q~K++1,F(θ¯,q^KK+))=0\lim_{K\to\infty}D\left(\tilde{q}^{K+\ell+1},F\left(\bar{\theta},\hat{q}_{K}^{K+\ell}\right)\right)=0. Therefore, we have limKqK++1q^KK++1=0\lim_{K\to\infty}\|q^{K+\ell+1}-\hat{q}^{K+\ell+1}_{K}\|=0. By mathematical induction, we conclude that for any 1\ell\geq 1, limKqK+q^KK+=0\lim_{K\to\infty}\|q^{K+\ell}-\hat{q}^{K+\ell}_{K}\|=0.

Under Assumption 2, we know that given any initial strategy q^1Q\hat{q}^{1}\in Q, the sequence of strategies (q^k)k=1(\hat{q}^{k})_{k=1}^{\infty} induced by updates (qq-update) with static belief θ¯\bar{\theta} converges to the equilibrium set EQ(θ¯)\mathrm{EQ}(\bar{\theta}). Then, for any ϵ>0\epsilon>0 and any initial strategy q^1Q\hat{q}^{1}\in Q, there exists a finite integer L(q^1)L(\hat{q}^{1}) such that D(q^k,EQ(θ¯))<ϵ/2D(\hat{q}^{k},\mathrm{EQ}(\bar{\theta}))<\epsilon/2 for any kL(q^1)k\geq L(\hat{q}^{1}). Since the strategy set QQ is a closed and bounded set, we take L=maxq^1QL(q^1)L=\max_{\hat{q}^{1}\in Q}L(\hat{q}^{1}), and we know that LL is finite.

Next, for any kLk\geq L, the original strategy sequence (qk)k=1(q^{k})_{k=1}^{\infty} satisfies

D(qk,EQ(θ¯))qkq^kLk+D(q^kLk,EQ(θ¯)),\displaystyle D(q^{k},\mathrm{EQ}(\bar{\theta}))\leq\|q^{k}-\hat{q}_{k-L}^{k}\|+D(\hat{q}_{k-L}^{k},\mathrm{EQ}(\bar{\theta})),

where q^kLk\hat{q}_{k-L}^{k} is the kk-th strategy in the auxiliary sequence (q^kLj)j=1(\hat{q}_{k-L}^{j})_{j=1}^{\infty}. Recall that the auxiliary strategy sequence (q^kLj)j=1(\hat{q}_{k-L}^{j})_{j=1}^{\infty} is such that q^kLj=qj\hat{q}_{k-L}^{j}=q^{j} for j=1,,kLj=1,\dots,k-L, and q^kLj\hat{q}_{k-L}^{j} is constructed with respect to the fixed point belief θ¯\bar{\theta} as in (13) for j=kL+1,,j=k-L+1,\dots,. Thus, for any kLk\geq L, q^kLk\hat{q}_{k-L}^{k} represents a strategy profile resulting from LL strategy updates with respect to θ¯\bar{\theta} starting from the initial profile qkLq^{k-L}. Then, D(q^kLk,EQ(θ¯))<ϵ/2D(\hat{q}_{k-L}^{k},\mathrm{EQ}(\bar{\theta}))<\epsilon/2 for all kLk\geq L. Moreover, since limkqk+q^kk+=0\lim_{k\to\infty}\|q^{k+\ell}-\hat{q}^{k+\ell}_{k}\|=0 for any \ell, we know that limkqkq^kLk=0\lim_{k\to\infty}\|q^{k}-\hat{q}_{k-L}^{k}\|=0. Hence, there must exist KK such that for any kK+Lk\geq K+L, qkq^kLk<ϵ/2\|q^{k}-\hat{q}_{k-L}^{k}\|<\epsilon/2. Consequently, we have that for any kK+Lk\geq K+L, D(qk,EQ(θ¯))qkq^kLk+D(q^kLk,EQ(θ¯))<ϵD(q^{k},\mathrm{EQ}(\bar{\theta}))\leq\|q^{k}-\hat{q}_{k-L}^{k}\|+D(\hat{q}_{k-L}^{k},\mathrm{EQ}(\bar{\theta}))<\epsilon, i.e. limkD(qk,EQ(θ¯))=0\lim_{k\to\infty}D(q^{k},\mathrm{EQ}(\bar{\theta}))=0. The sequence of strategies (qk)k=1\left(q^{k}\right)_{k=1}^{\infty} converges to the equilibrium set EQ(θ¯)\mathrm{EQ}(\bar{\theta}).

If EQ(θ¯)\mathrm{EQ}(\bar{\theta}) is a singleton set, then we can conclude that (qk)k=1(q^{k})_{k=1}^{\infty} converges. If EQ(θ¯)\mathrm{EQ}(\bar{\theta}) contains multiple equilibria, it remains to prove the convergence of the strategy sequence (qk)k=1(q^{k})_{k=1}^{\infty} to one of the equilibria. Assume for the sake of contradiction that (qk)k=1(q^{k})_{k=1}^{\infty} does not converge. Since limkD(qk,EQ(θ¯))=0\lim_{k\to\infty}D(q^{k},\mathrm{EQ}(\bar{\theta}))=0, there must exist KaK_{a} such that for any k>Kak>K_{a}, we can find a qkEQ(θ¯)q^{k^{\dagger}}\in\mathrm{EQ}(\bar{\theta}) such that qkqk<δ/4\|q^{k}-q^{k\dagger}\|<\delta/4, where δ\delta is the lower bound of the distance between any pair of equilibrium as in Assumption 3. We note that qkq^{k\dagger} can be viewed as the equilibrium strategy in the set EQ(θ¯)\mathrm{EQ}(\bar{\theta}) that is the closest to qkq^{k}, and qkq^{k\dagger} may be different for different k>Kak>K_{a}. In fact, under the assumption that the strategy sequence does not converge, we know that for any KK, there must exist k,k+1>max{K,Ka}k,k+1>\max\{K,K_{a}\} such that qkqk<δ/4\|q^{k}-q^{k\dagger}\|<\delta/4 and qk+1qk+1<δ/4\|q^{k+1}-q^{k+1\dagger}\|<\delta/4, where qkqk+1q^{k\dagger}\neq q^{k+1\dagger}. Otherwise, the strategy sequence (qk)k=1(q^{k})_{k=1}^{\infty} will remain in the local neighborhood of one equilibrium after a finite stage, which implies the convergence to that equilibrium given that limkD(qk,EQ(θ¯))=0\lim_{k\to\infty}D(q^{k},\mathrm{EQ}(\bar{\theta}))=0 and each equilibrium is isolated. Moreover, for any such pair of stages k,k+1k,k+1, since qkqk+1δ\|q^{k\dagger}-q^{k+1\dagger}\|\geq\delta, we must have qk+1qk|qk+1qk+1qkqk+1|3δ/4\|q^{k+1}-q^{k\dagger}\|\geq|\|q^{k+1}-q^{k+1\dagger}\|-\|q^{k\dagger}-q^{k+1\dagger}\||\geq 3\delta/4.

Given Assumptions 1 and 3, we argue that for any qEQ(θ¯)q^{\dagger}\in\mathrm{EQ}(\bar{\theta}), there must exist ξa,ξb>0\xi_{a},\xi_{b}>0 such that for any qq that satisfies qq<ξa\|q-q^{\dagger}\|<\xi_{a} and any θ\theta that satisfies θθ¯<ξb\|\theta-\bar{\theta}\|<\xi_{b}, the updated strategy q=F(θ,q)q^{\prime}=F(\theta,q) satisfies qq<δ/4\|q^{\prime}-q^{\dagger}\|<\delta/4. This is due to the fact that (i) q=F(θ¯,q)q^{\dagger}=F(\bar{\theta},q^{\dagger}) (qq^{\dagger} is a fixed point of the strategy update with respect to θ¯\bar{\theta}, and it is an isolated equilibrium in EQ(θ¯)\mathrm{EQ}(\bar{\theta})) (Assumption 3), and (ii) F(,)F(\cdot,\cdot) is upper hemicontinuous in both θ\theta and qq (Assumption 1) and unique in the local neighborhood of (θ¯,q)(\bar{\theta},q^{\dagger}) (Assumption 3).

Given that θk\theta^{k} converges to θ¯\bar{\theta} with probability 1 (Lemma 2) and limkD(qk,EQ(θ¯))=0\lim_{k\to\infty}D(q^{k},\mathrm{EQ}(\bar{\theta}))=0, we know that there exists Kb>0K_{b}>0 such that for any k>Kbk>K_{b}, qkq^{k} is within ε=min{ξa,δ/4}\varepsilon=\min\{\xi_{a},\delta/4\} distance to the closest equilibrium strategy qkEQ(θ¯)q^{k\dagger}\in\mathrm{EQ}(\bar{\theta}), and the belief θk\theta^{k} is within ξb\xi_{b} distance to θ¯\bar{\theta}. This implies that for any k,k+1>max{K,Ka,Kb}k,k+1>\max\{K,K_{a},K_{b}\}, qkqk<δ/4\|q^{k}-q^{k\dagger}\|<\delta/4 and qk+1qk<δ/4\|q^{k+1}-q^{k\dagger}\|<\delta/4, which is a contradiction to the existence of k,k+1k,k+1 where qkqk<δ/4\|q^{k}-q^{k\dagger}\|<\delta/4 and qk+1qk3δ/4\|q^{k+1}-q^{k\dagger}\|\geq 3\delta/4. Therefore, we have arrived at a contradiction, and we conclude that the sequence of strategies (qk)k=1(q^{k})_{k=1}^{\infty} must converge. \square

Lemma 4.

Any fix point (θ¯,q¯)\left(\bar{\theta},\bar{q}\right) satisfies (4a). Furthermore, for any sSS(q¯)s\in S\setminus S^{*}(\bar{q}), if ϕs(y|q¯)ϕs(y|q¯)\phi^{s^{*}}(y|\bar{q})\ll\phi^{s}(y|\bar{q}), then θk(s)\theta^{k}(s) satisfies (5). Otherwise, there exists a finite positive integer KK^{*} such that θk(s)=0\theta^{k}(s)=0 for all k>Kk>K^{*} w.p. 1.

Lemma 4 is based on Lemmas 2 and 3. Although the realized payoffs (yk)k=1\left(y^{k}\right)_{k=1}^{\infty} are not independently and identically distributed (i.i.d.) due to players’ strategy updates, we can show that since qkq^{k} converge to q¯\bar{q} (Lemma 3), as kk\to\infty, the distribution of yky^{k} converges to an i.i.d. process with ϕs(yk|q¯)\phi^{s}(y^{k}|\bar{q}), which is the payoff distribution given the fixed point strategy q¯\bar{q}, for each parameter ss as kk\to\infty. This allows us to show that the belief eventually excludes parameters that are not in S(q¯)S^{*}(\bar{q}), and also establish the rate of convergence.

Proof of Lemma 4. By iteratively applying the belief update in (8), we can write:

θ~k(s)=θ~1(s)j=1k1ϕs(yj|qj)sSθ~1(s)j=1k1ϕs(yj|qj),sS.\displaystyle\tilde{\theta}^{k}(s)=\frac{\tilde{\theta}^{1}(s)\prod_{j=1}^{k-1}\phi^{s}(y^{j}|q^{j})}{\sum_{s^{\prime}\in S}\tilde{\theta}^{1}(s^{\prime})\prod_{j=1}^{k-1}\phi^{s^{\prime}}(y^{j}|q^{j})},\quad\forall s\in S. (16)

We define Φs(Yk1|Qk1)\Phi^{s}(Y^{k-1}|Q^{k-1}) as the probability density function of the history of the realized payoffs Yk1=(yj)j=1k1Y^{k-1}=\left(y^{j}\right)_{j=1}^{k-1} conditioned on the history of strategies Qk1=(qj)j=1k1Q^{k-1}=\left(q^{j}\right)_{j=1}^{k-1} prior to stage kk, i.e. Φs(Yk1|Qk1)=Δj=1k1ϕs(yj|qj)\Phi^{s}(Y^{k-1}|Q^{k-1})\stackrel{{\scriptstyle\Delta}}{{=}}\prod_{j=1}^{k-1}\phi^{s}(y^{j}|q^{j}). For any sS{s}s\in S\setminus\{s^{*}\}, we can rewrite (16) as follows:

θ~k(s)=θ~1(s)Φs(Yk1|Qk1)sSθ~1(s)Φs(Yk1|Qk1)\displaystyle\tilde{\theta}^{k}(s)=\frac{\tilde{\theta}^{1}(s)\Phi^{s}(Y^{k-1}|Q^{k-1})}{\sum_{s^{\prime}\in S}\tilde{\theta}^{1}(s^{\prime})\Phi^{s^{\prime}}(Y^{k-1}|Q^{k-1})} θ~1(s)Φs(Yk1|Qk1)θ~1(s)Φs(Yk1|Qk1)+θ~1(s)Φs(Yk1|Qk1)\displaystyle\leq\frac{\tilde{\theta}^{1}(s)\Phi^{s}(Y^{k-1}|Q^{k-1})}{\tilde{\theta}^{1}(s)\Phi^{s}(Y^{k-1}|Q^{k-1})+\tilde{\theta}^{1}(s^{*})\Phi^{s^{*}}(Y^{k-1}|Q^{k-1})}
=θ~1(s)Φs(Yk1|Qk1)Φs(Yk1|Qk1)θ~1(s)Φs(Yk1|Qk1)Φs(Yk1|Qk1)+θ~1(s).\displaystyle=\frac{\tilde{\theta}^{1}(s)\frac{\Phi^{s}(Y^{k-1}|Q^{k-1})}{\Phi^{s^{*}}(Y^{k-1}|Q^{k-1})}}{\tilde{\theta}^{1}(s)\frac{\Phi^{s}(Y^{k-1}|Q^{k-1})}{\Phi^{s^{*}}(Y^{k-1}|Q^{k-1})}+\tilde{\theta}^{1}(s^{*})}. (17)

For any sSS(q¯)s\in S\setminus S^{*}(\bar{q}), if we can show that the ratio Φs(Yk1|Qk1)Φs(Yk1|Qk1)\frac{\Phi^{s}(Y^{k-1}|Q^{k-1})}{\Phi^{s^{*}}(Y^{k-1}|Q^{k-1})} converges to 0, then θ~k(s)\tilde{\theta}^{k}(s) must also converge to 0. We need to consider two cases:

Case 1: ϕs(y|q¯)ϕs(y|q¯)\phi^{s^{*}}(y|\bar{q})\ll\phi^{s}(y|\bar{q}).
In this case, the log-likelihood ratio can be written as:

log(Φs(Yk1|Qk1)Φs(Yk1|Qk1))=j=1k1log(ϕs(yj|qj)ϕs(yj|qj)).\displaystyle\log\left(\frac{\Phi^{s}(Y^{k-1}|Q^{k-1})}{\Phi^{s^{*}}(Y^{k-1}|Q^{k-1})}\right)=\sum_{j=1}^{k-1}\log\left(\frac{\phi^{s}(y^{j}|q^{j})}{\phi^{s^{*}}(y^{j}|q^{j})}\right). (18)

For any sSs\in S, since ϕs(yj|qj)\phi^{s}(y^{j}|q^{j}) is a continuous function of qjq^{j} for any realized yjy^{j}, log(ϕs(yj|qj)ϕs(yj|qj))\log\left(\frac{\phi^{s}(y^{j}|q^{j})}{\phi^{s^{*}}(y^{j}|q^{j})}\right) is also continuous in qjq^{j} for any yjy^{j}. In Lemma 3, we proved that (qk)k=1\left(q^{k}\right)_{k=1}^{\infty} converges to q¯\bar{q}. Then, log(ϕs(yj|qj)ϕs(yj|qj))\log\left(\frac{\phi^{s}(y^{j}|q^{j})}{\phi^{s^{*}}(y^{j}|q^{j})}\right) must converge to log(ϕs(yj|q¯)ϕs(yj|q¯))\log\left(\frac{\phi^{s}(y^{j}|\bar{q})}{\phi^{s^{*}}(y^{j}|\bar{q})}\right) as jj\to\infty for all realized payoff yjy^{j}. Therefore, from law of large numbers, we have:

limk1k1log(Φs(Yk1|Qk1)Φs(Yk1|Qk1))\displaystyle\lim_{k\to\infty}\frac{1}{k-1}\log\left(\frac{\Phi^{s}(Y^{k-1}|Q^{k-1})}{\Phi^{s^{*}}(Y^{k-1}|Q^{k-1})}\right) =limk1k1j=1k1log(ϕs(yj|qj)ϕs(yj|qj))=𝔼[log(ϕs(y|q¯)ϕs(y|q¯))],w.p.1.\displaystyle=\lim_{k\to\infty}\frac{1}{k-1}\sum_{j=1}^{k-1}\log\left(\frac{\phi^{s}(y^{j}|q^{j})}{\phi^{s^{*}}(y^{j}|q^{j})}\right)=\mathbb{E}\left[\log\left(\frac{\phi^{s}(y|\bar{q})}{\phi^{s^{*}}(y|\bar{q})}\right)\right],~{}w.p.~{}1.

Note that for any sSS(q¯)s\in S\setminus S^{*}(\bar{q}), the expectation of log(ϕs(y|q¯)ϕs(y|q¯))\log\left(\frac{\phi^{s}(y|\bar{q})}{\phi^{s^{*}}(y|\bar{q})}\right) can be written as:

𝔼[log(ϕs(y|q¯)ϕs(y|q¯))]=yϕs(y|q¯)log(ϕs(y|q¯)ϕs(y|q¯))dy=DKL(ϕs(y|q¯)||ϕs(y|q¯))<0.\displaystyle\mathbb{E}\left[\log\left(\frac{\phi^{s}(y|\bar{q})}{\phi^{s^{*}}(y|\bar{q})}\right)\right]=\int_{y}\phi^{s^{*}}(y|\bar{q})\cdot\log\left(\frac{\phi^{s}(y|\bar{q})}{\phi^{s^{*}}(y|\bar{q})}\right)dy=-D_{KL}\left(\phi^{s^{*}}(y|\bar{q})||\phi^{s}(y|\bar{q})\right)<0.

Then, for any sSS(q¯)s\in S\setminus S^{*}(\bar{q}), limkΦs(Yk1|Qk1)Φs(Yk1|Qk1)=0\lim_{k\to\infty}\frac{\Phi^{s}(Y^{k-1}|Q^{k-1})}{\Phi^{s^{*}}(Y^{k-1}|Q^{k-1})}=0. From (17), we know that limkθ~k(s)=0\lim_{k\to\infty}\tilde{\theta}^{k}(s)=0 for all sSS(q¯)s\in S\setminus S^{*}(\bar{q}). Finally, since θ~1(s)>0\tilde{\theta}^{1}(s)>0 for all sSs\in S, the true parameter ss^{*} is never excluded from the belief. For any sSS(q¯)s\in S\setminus S^{*}(\bar{q}), we have the following:

limk1klog(θ~k(s))=limk1klog(θ~k(s))+limk1klog(θ~k(s)θ~k(s))\displaystyle\lim_{k\to\infty}\frac{1}{k}\log\left(\tilde{\theta}^{k}(s)\right)=\lim_{k\to\infty}\frac{1}{k}\log\left(\tilde{\theta}^{k}(s^{*})\right)+\lim_{k\to\infty}\frac{1}{k}\log\left(\frac{\tilde{\theta}^{k}(s)}{\tilde{\theta}^{k}(s^{*})}\right)
=limk1klog(θ~k(s)θ~k(s))=limk1klog(θ~1(s)θ1(s))+limk1klog(Φs(Yk1|Qk1)Φs(Yk1|Qk1))\displaystyle=\lim_{k\to\infty}\frac{1}{k}\log\left(\frac{\tilde{\theta}^{k}(s)}{\tilde{\theta}^{k}(s^{*})}\right)=\lim_{k\to\infty}\frac{1}{k}\log\left(\frac{\tilde{\theta}^{1}(s)}{\theta^{1}(s^{*})}\right)+\lim_{k\to\infty}\frac{1}{k}\log\left(\frac{\Phi^{s}(Y^{k-1}|Q^{k-1})}{\Phi^{s^{*}}(Y^{k-1}|Q^{k-1})}\right)
=𝔼[log(ϕs(y|q¯)ϕs(y|q¯))]=DKL(ϕs(y|q¯)||ϕs(y|q¯)),w.p.1.\displaystyle=\mathbb{E}\left[\log\left(\frac{\phi^{s}(y|\bar{q})}{\phi^{s^{*}}(y|\bar{q})}\right)\right]=-D_{KL}\left(\phi^{s^{*}}(y|\bar{q})||\phi^{s}(y|\bar{q})\right),\quad w.p.~{}1.

From (11), we know that limk1klog(θk(s))=DKL(ϕs(y|q¯)||ϕs(y|q¯))\lim_{k\to\infty}\frac{1}{k}\log\left(\theta^{k}(s)\right)=-D_{KL}\left(\phi^{s^{*}}(y|\bar{q})||\phi^{s}(y|\bar{q})\right).

Case 2: ϕs(y|q¯)\phi^{s^{*}}(y|\bar{q}) is not absolutely continuous in ϕs(y|q¯)\phi^{s}(y|\bar{q}).
In this case, ϕs(y|q¯)=0\phi^{s}(y|\bar{q})=0 does not imply ϕs(y|q¯)=0\phi^{s^{*}}(y|\bar{q})=0 with probability 1, i.e. Pr(ϕs(y|q¯)=0)>0\mathrm{Pr}\left(\phi^{s}(y|\bar{q})=0\right)>0, where Pr()\mathrm{Pr}\left(\cdot\right) is the probability of yy with respect to the true distribution ϕs(y|q¯)\phi^{s^{*}}(y|\bar{q}). Since the distributions ϕs(y|q)\phi^{s}(y|q) and ϕs(y|q)\phi^{s^{*}}(y|q) are continuous in qq, the probability Pr(ϕs(y|q)=0)\mathrm{Pr}\left(\phi^{s}(y|q)=0\right) must also be continuous in qq. Therefore, for any ϵ(0,Pr(ϕs(y|q¯)=0))\epsilon\in\left(0,\mathrm{Pr}\left(\phi^{s}(y|\bar{q})=0\right)\right), there exists δ>0\delta>0 such that Pr(ϕs(y|q)=0)>ϵ\mathrm{Pr}\left(\phi^{s}(y|q)=0\right)>\epsilon for all q{q|qq¯<δ}q\in\{q|\|q-\bar{q}\|<\delta\}.

From Lemma 3, we know that limkqk=q¯\lim_{k\to\infty}q^{k}=\bar{q}. Hence, we can find a positive number K1>0K_{1}>0 such that for any k>K1k>K_{1}, qkq¯<δ\|q^{k}-\bar{q}\|<\delta, and hence Pr(ϕs(yk|qk)=0)>ϵ\mathrm{Pr}\left(\phi^{s}(y^{k}|q^{k})=0\right)>\epsilon. We then have k=1Pr(ϕs(yk|qk)=0)=\sum_{k=1}^{\infty}\mathrm{Pr}\left(\phi^{s}(y^{k}|q^{k})=0\right)=\infty. Moreover, since the event {ϕs(yk|qk)=0}\{\phi^{s}(y^{k}|q^{k})=0\} is independent from the event {ϕs(yk|qk)=0}\{\phi^{s}(y^{k^{\prime}}|q^{k^{\prime}})=0\} given any qkq^{k} and qkq^{k^{\prime}}, and any k,kk,k^{\prime}, we can conclude that Pr(ϕs(yk|qk)=0,infinitely often)=1\mathrm{Pr}\left(\phi^{s}(y^{k}|q^{k})=0,\text{infinitely often}\right)=1 based on the second Borel-Cantelli lemma. Hence, Pr(ϕs(yk|qk)>0,k)=0\mathrm{Pr}\left(\phi^{s}(y^{k}|q^{k})>0,~{}\forall k\right)=0. From the Bayesian update (θ\theta-update), we know that if ϕs(yk|qk)=0\phi^{s}(y^{k}|q^{k})=0 for some stage kk, then any belief of ss updated after stage kk is 0. Therefore, we can conclude that there exists a positive number K>K1K^{*}>K_{1} with probability 1 such that θk(s)=0\theta^{k}(s)=0 for any k>Kk>K^{*}. \square

4.2 Proof of local stability

From the local stability definition (Definition 4), for any ϵ¯,δ¯>0\bar{\epsilon},\bar{\delta}>0 and any probability γ(0,1)\gamma\in(0,1), we need to characterize ϵ1,δ1>0\epsilon^{1},\delta^{1}>0 such that if the initial state is in (ϵ1,δ1)(\epsilon^{1},\delta^{1})-neighborhood of the fixed point (θ¯,EQ(θ¯))(\bar{\theta},\mathrm{EQ}(\bar{\theta})), then with probability higher than γ\gamma, the convergent state is in (ϵ¯,δ¯)(\bar{\epsilon},\bar{\delta})-neighborhood of (θ¯,EQ(θ¯))(\bar{\theta},\mathrm{EQ}(\bar{\theta})).

Our proof of Theorem 2 starts with constructing an auxiliary (ϵ^,δ)(\hat{\epsilon},\delta) neighborhood of (θ¯,EQ(θ¯))(\bar{\theta},\mathrm{EQ}(\bar{\theta})) such that if all states remain in the auxiliary neighborhood, then the convergent state is guaranteed to be in the (ϵ¯,δ¯)(\bar{\epsilon},\bar{\delta})-neighborhood of (θ¯,EQ(θ¯))(\bar{\theta},\mathrm{EQ}(\bar{\theta})) (Part (iii) in Lemma 5). Here, ϵ^(0,ϵ)\hat{\epsilon}\in(0,\epsilon) and ϵ\epsilon, and δ\delta are chosen according to Assumption 2 to guarantee that the states in the auxiliary neighborhood satisfies the conditions of local consistency and local invariance (Part (i)(ii) in Lemma 5).

Lemma 5.

Under Assumptions 14,

  1. (i)

    For any δ¯>0\bar{\delta}>0, ϵ(0,ϵ)\exists\epsilon^{\prime}\in\left(0,\epsilon\right) such that any θNϵ(θ¯)\theta\in N_{\epsilon^{\prime}}(\bar{\theta}) satisfies EQ(θ)Nδ¯(EQ(θ¯))\mathrm{EQ}(\theta)\subseteq N_{\bar{\delta}}(\mathrm{EQ}(\bar{\theta})).

  2. (ii)

    For any ϵ¯>0\bar{\epsilon}>0, F(θ,q)Nδ(EQ(θ¯))F(\theta,q)\subseteq N_{\delta}(\mathrm{EQ}(\bar{\theta})) for all qNδ(EQ(θ¯))q\in N_{\delta}(\mathrm{EQ}(\bar{\theta})) and all θNϵ^(θ¯)\theta\in N_{\hat{\epsilon}}\left(\bar{\theta}\right), where ϵ^=min{ϵ,ϵ,ϵ¯}\hat{\epsilon}=\min\{\epsilon,\epsilon^{\prime},\bar{\epsilon}\}.

  3. (iii)

    limkPr(θkNϵ¯(θ¯),\lim_{k\to\infty}\mathrm{Pr}\left(\theta^{k}\in N_{\bar{\epsilon}}(\bar{\theta}),\right. qkNδ¯(EQ(θ¯)))Pr(θkNϵ^(θ¯),qkNδ(EQ(θ¯)),k)\left.q^{k}\in N_{\bar{\delta}}(\mathrm{EQ}(\bar{\theta}))\right)\geq\mathrm{Pr}\left(\theta^{k}\in N_{\hat{\epsilon}}(\bar{\theta}),~{}q^{k}\in N_{\delta}(\mathrm{EQ}(\bar{\theta})),~{}\forall k\right).

We include the proof of Lemma 5 in Appendix A. Essentially, (i) follows from the fact that in continuous games, the equilibrium set EQ(θ)\mathrm{EQ}(\theta) is upper-hemicontinuous in θ\theta (Lemma 8 in Appendix A). Then, we obtain (ii) from Assumption (A3b) that Nδ(EQ(θ¯))N_{\delta}\left(\mathrm{EQ}(\bar{\theta})\right) is an invariant set of the strategy update for θNϵ(θ¯)\theta\in N_{\epsilon}(\bar{\theta}) and ϵ^ϵ\hat{\epsilon}\leq\epsilon. Furthermore, if beliefs are in Nϵ^(θ¯)N_{\hat{\epsilon}}(\bar{\theta}) for all stages, then the convergent belief must also be in Nϵ^(θ¯)Nϵ¯(θ¯)N_{\hat{\epsilon}}(\bar{\theta})\subseteq N_{\bar{\epsilon}}(\bar{\theta}). Based on Assumptions 12, Theorem 1 provides that the sequence of strategies converges. Since Nϵ^(θ¯)Nϵ(θ¯)N_{\hat{\epsilon}}(\bar{\theta})\subseteq N_{\epsilon^{\prime}}(\bar{\theta}), we know from (i) that the convergent strategy is an equilibrium in the neighborhood Nδ¯(EQ(θ¯))N_{\bar{\delta}}(\mathrm{EQ}(\bar{\theta})). Thus, (iii) holds.

Thanks to Lemma 5 (iii), to prove local stability as in (7), it remains to be established that there exists an initial (ϵ1,δ1)(\epsilon^{1},\delta^{1})-neighborhood of the fixed point such that when learning starts from that neighborhood, all states remain in the auxiliary (ϵ^,δ)(\hat{\epsilon},\delta)- neighborhood with probability higher than γ\gamma. To characterize ϵ1\epsilon^{1}, we note that θk\theta^{k} remains in the ϵ^\hat{\epsilon}-neighborhood of θ¯\bar{\theta} if |θk(s)θ¯(s)|ϵ^|S||\theta^{k}(s)-\bar{\theta}(s)|\leq\frac{\hat{\epsilon}}{|S|} for all sSs\in S. In Lemma 6, we characterize the condition of the initial belief θ1\theta^{1} under which |θk(s)θ¯(s)|ϵ^|S||\theta^{k}(s)-\bar{\theta}(s)|\leq\frac{\hat{\epsilon}}{|S|} with probability higher than γ\gamma for all kk and all sS[θ¯]s\in S\setminus[\bar{\theta}] (i.e. the set of parameters with zero probability in θ¯\bar{\theta}). In Lemma 7, we analyze conditions of initial beliefs of parameters s[θ¯]s\in[\bar{\theta}], which together with Lemma 6 allow us to precisely characterize ϵ1\epsilon^{1}. We also characterize δ1\delta^{1} that guarantees that the strategies do not leave Nδ(EQ(θ¯))N_{\delta}(\mathrm{EQ}(\bar{\theta})). Our characterization of ϵ1\epsilon^{1} and δ1\delta^{1} builds on the martingale property of belief ratios (Lemma 1), and the fact that under Assumption 4, states satisfy local consistency and local invariance in the auxiliary neighborhood (Lemma 5).

Before proceeding, we need to define the following thresholds:

0<ρ1\displaystyle 0<\rho^{1} <mins[θ¯]{(1γ)θ¯(s)ϵ^(1γ+|S[θ¯]|)(|S[θ¯]|+1)|S|+(1γ)ϵ^},\displaystyle<\min_{s\in[\bar{\theta}]}\left\{\frac{(1-\gamma)\bar{\theta}(s)\hat{\epsilon}}{(1-\gamma+|S\setminus[\bar{\theta}]|)(|S\setminus[\bar{\theta}]|+1)|S|+(1-\gamma)\hat{\epsilon}}\right\}, (19a)
ρ2\displaystyle\rho^{2} =Δϵ^(|S[θ¯]|+1)|S|,\displaystyle\stackrel{{\scriptstyle\Delta}}{{=}}\frac{\hat{\epsilon}}{(|S\setminus[\bar{\theta}]|+1)|S|}, (19b)
0<ρ3\displaystyle 0<\rho^{3} <mins[θ¯]{ϵ^|S[θ¯]||S|ρ2θ¯(s)|S||S[θ¯]||S|ρ2,ϵ^|S|ρ2|S[θ¯]|(θ¯(s)+ϵ^|S|),θ¯(s)}.\displaystyle<\min_{s\in[\bar{\theta}]}\left\{\frac{\hat{\epsilon}-|S\setminus[\bar{\theta}]||S|\rho^{2}\bar{\theta}(s)}{|S|-|S\setminus[\bar{\theta}]||S|\rho^{2}},~{}\frac{\hat{\epsilon}}{|S|}-\rho^{2}|S\setminus[\bar{\theta}]|\left(\bar{\theta}(s)+\frac{\hat{\epsilon}}{|S|}\right),~{}\bar{\theta}(s)\right\}. (19c)

Lemma 6 below shows that if the initial belief θ1\theta^{1} is in the neighborhood Nρ1(θ¯)N_{\rho^{1}}(\bar{\theta}), then θk(s)<ρ2\theta^{k}(s)<\rho^{2} for all sS[θ¯]s\in S\setminus[\bar{\theta}] in all stages of the learning dynamics with probability higher than γ\gamma. Note that θk(s)<ρ2\theta^{k}(s)<\rho^{2} ensures |θk(s)θ¯(s)|<ϵ^|S||\theta^{k}(s)-\bar{\theta}(s)|<\frac{\hat{\epsilon}}{|S|} since θ¯(s)=0\bar{\theta}(s)=0 for all sS[θ¯]s\in S\setminus[\bar{\theta}] and ρ2<ϵ^|S|\rho^{2}<\frac{\hat{\epsilon}}{|S|}. Additionally, the threshold ρ2\rho^{2} is specifically constructed to bound the beliefs of the remaining parameters in [θ¯][\bar{\theta}], which will be used later in Lemma 7.

Lemma 6.

For any γ(0,1)\gamma\in(0,1), if the initial belief satisfies

θ1(s)<ρ1,sS[θ¯],\displaystyle\theta^{1}(s)<\rho^{1},\quad\forall s\in S\setminus[\bar{\theta}], (20a)
θ¯(s)ρ1<θ1(s)<θ¯(s)+ρ1,s[θ¯],\displaystyle\bar{\theta}(s)-\rho^{1}<\theta^{1}(s)<\bar{\theta}(s)+\rho^{1},\quad\forall s\in[\bar{\theta}], (20b)

then

Pr(θk(s)<ρ2,sS[θ¯],k)>γ.\displaystyle\mathrm{Pr}\left(\theta^{k}(s)<\rho^{2},~{}\forall s\in S\setminus[\bar{\theta}],~{}\forall k\right)>\gamma. (21)

In the proof of Lemma 6, we say that the belief θk(s)\theta^{k}(s) completes an upcrossing of the interval [ρ1,ρ2][\rho^{1},\rho^{2}] if θk(s)\theta^{k}(s) increases from less than ρ1\rho^{1} to higher than ρ2\rho^{2}. Note that if the belief of a parameter sS[θ¯]s\in S\setminus[\bar{\theta}] is initially smaller than ρ1\rho^{1} but later becomes higher than ρ2\rho^{2} in some stage kk, then the belief sequence (θj(s))j=1k\left(\theta^{j}(s)\right)_{j=1}^{k} must have completed at least one upcrossing of [ρ1,ρ2][\rho^{1},\rho^{2}] before stage kk. Therefore, θk(s)<ρ2\theta^{k}(s)<\rho^{2} for all kk is equivalent to that the number of upcrossings completed by the belief is zero.

Additionally, by bounding the initial belief of parameters s[θ¯]s\in[\bar{\theta}] as in (20b), we construct another interval [ρ1/(θ¯(s)ρ1),ρ2]\left[\rho^{1}/\left(\bar{\theta}(s^{*})-\rho^{1}\right),\rho^{2}\right] such that the number of upcrossings with respect to this interval completed by the auxiliary sequence of belief ratios (θ~k(s)θ~k(s))k=1\left(\frac{\tilde{\theta}^{k}(s)}{\tilde{\theta}^{k}(s^{*})}\right)_{k=1}^{\infty} is no less than the number of upcrossings with respect to interval [ρ1,ρ2][\rho^{1},\rho^{2}] completed by (θk(s))k=1\left(\theta^{k}(s)\right)_{k=1}^{\infty}. Recall that the sequence of belief ratios (θ~k(s)θ~k(s))k=1\left(\frac{\tilde{\theta}^{k}(s)}{\tilde{\theta}^{k}(s^{*})}\right)_{k=1}^{\infty} forms a martingale process (Lemma 1). By applying Doob’s upcrossing inequality, we obtain an upper bound on the expected number of upcrossings completed by the belief ratio corresponding to each parameter sS[θ¯]s\in S\setminus[\bar{\theta}], which is also an upper bound on the expected number of upcrossings made by the belief of ss. Using Markov’s inequality and the upper bound of the expected number of upcrossings, we show that with probability higher than γ\gamma, no belief θk(s)\theta^{k}(s) of any parameter sS[θ¯]s\in S\setminus[\bar{\theta}] can ever complete a single upcrossing with respect to the interval [ρ1,ρ2][\rho^{1},\rho^{2}] characterized by (19a) – (19b). Hence, θk(s)\theta^{k}(s) remains lower than the threshold ρ2\rho^{2} for all sS[θ¯]s\in S\setminus[\bar{\theta}] and all kk with probability higher than γ\gamma.

Proof of Lemma 6. First, note that 0<ρ1<ρ2<ϵ^|S|0<\rho^{1}<\rho^{2}<\frac{\hat{\epsilon}}{|S|}. For any sS[θ¯]s\in S\setminus[\bar{\theta}] and any k>1k>1, we denote Uk(s)U^{k}(s) the number of upcrossings of the interval [ρ1,ρ2][\rho^{1},\rho^{2}] that the belief θ~j(s)\tilde{\theta}^{j}(s) completes by stage kk. That is, Uk(s)U^{k}(s) is the maximum number of intervals ([k¯i,k¯i])i=1Uk(s)\left([\underline{k}_{i},\overline{k}_{i}]\right)_{i=1}^{U^{k}(s)} with 1k¯1<k¯1<k¯2<k¯2<<k¯Uk(s)<k¯Uk(s)k1\leq\underline{k}_{1}<\overline{k}_{1}<\underline{k}_{2}<\overline{k}_{2}<\cdots<\underline{k}_{U^{k}(s)}<\overline{k}_{U^{k}(s)}\leq k, such that θ~k¯i(s)<ρ1<ρ2<θ~k¯i(s)\tilde{\theta}^{\underline{k}_{i}}(s)<\rho^{1}<\rho^{2}<\tilde{\theta}^{\overline{k}_{i}}(s) for i=1,Uk(s)i=1,\dots U^{k}(s). Since the beliefs (θ~j(s))j=1k\left(\tilde{\theta}^{j}(s)\right)_{j=1}^{k} are updated based on randomly realized payoffs (yj)j=1k\left(y^{j}\right)_{j=1}^{k} as in (8), Uk(s)U^{k}(s) is also a random variable. For any k>1k>1, Uk(s)1U^{k}(s)\geq 1 if and only if θ~1(s)<ρ1\tilde{\theta}^{1}(s)<\rho^{1} and there exists a stage jkj\leq k such that θ~j(s)>ρ2\tilde{\theta}^{j}(s)>\rho^{2}. Equivalently, limkUk(s)1\lim_{k\to\infty}U^{k}(s)\geq 1 if and only if θ~1(s)<ρ1\tilde{\theta}^{1}(s)<\rho^{1} and there exists a stage k>1k>1 such that θ~k(s)>ρ2\tilde{\theta}^{k}(s)>\rho^{2}. Therefore, if θ~1(s)<ρ1\tilde{\theta}^{1}(s)<\rho^{1} for all sS[θ¯]s\in S\setminus[\bar{\theta}], then:

Pr(θ~k(s)<ρ2,sS[θ¯],k)=1Pr(sS[θ¯] and k,s.t.θ~k(s)>ρ2)\displaystyle\mathrm{Pr}\left(\tilde{\theta}^{k}(s)<\rho^{2},~{}\forall s\in S\setminus[\bar{\theta}],~{}\forall k\right)=1-\mathrm{Pr}\left(\exists s\in S\setminus[\bar{\theta}]\text{ and }k,~{}s.t.~{}\tilde{\theta}^{k}(s)>\rho^{2}\right)
\displaystyle\geq 1sS[θ¯]Pr(k,s.t.θ~k(s)>ρ2)=1sS[θ¯]limkPr(Uk(s)1).\displaystyle 1-\sum_{s\in S\setminus[\bar{\theta}]}\mathrm{Pr}\left(\exists k,~{}s.t.~{}~{}\tilde{\theta}^{k}(s)>\rho^{2}\right)=1-\sum_{s\in S\setminus[\bar{\theta}]}\lim_{k\to\infty}\mathrm{Pr}\left(U^{k}(s)\geq 1\right). (22)

Next, we define α=Δθ¯(s)ρ1\alpha\stackrel{{\scriptstyle\Delta}}{{=}}\bar{\theta}(s^{*})-\rho^{1}. Since 0<ρ1<mins[θ¯]{θ¯(s)}0<\rho^{1}<\min_{s\in[\bar{\theta}]}\{\bar{\theta}(s)\} and ss^{*} is in the support set, we have α(0,θ¯(s))\alpha\in(0,\bar{\theta}(s^{*})). If θ~1(s)\tilde{\theta}^{1}(s) satisfies (20a) – (20b), then θ~1(s)θ~1(s)<ρ1α\frac{\tilde{\theta}^{1}(s)}{\tilde{\theta}^{1}(s^{*})}<\frac{\rho^{1}}{\alpha} for all sS[θ¯]s\in S\setminus[\bar{\theta}]. Additionally, for any stage kk and any sS[θ¯]s\in S\setminus[\bar{\theta}], if θ~k(s)>ρ2\tilde{\theta}^{k}(s)>\rho^{2}, then θ~k(s)θ~k(s)>ρ2\frac{\tilde{\theta}^{k}(s)}{\tilde{\theta}^{k}(s^{*})}>\rho^{2} because θ~k(s)1\tilde{\theta}^{k}(s^{*})\leq 1. Hence, whenever θ~k(s)\tilde{\theta}^{k}(s) completes an upcrossing of the interval [ρ1,ρ2]\left[\rho^{1},\rho^{2}\right], θ~k(s)θ~k(s)\frac{\tilde{\theta}^{k}(s)}{\tilde{\theta}^{k}(s^{*})} must also have completed an upcrosssing of the interval [ρ1α,ρ2]\left[\frac{\rho^{1}}{\alpha},\rho^{2}\right]. To ensure that the interval [ρ1α,ρ2]\left[\frac{\rho^{1}}{\alpha},\rho^{2}\right] is non-empty, we need to verify that ρ1α=ρ1θ¯(s)ρ1<ρ2\frac{\rho^{1}}{\alpha}=\frac{\rho^{1}}{\bar{\theta}(s^{*})-\rho^{1}}<\rho^{2}, which requires that

ρ1<ρ2θ¯(s)1+ρ2=ϵ^θ¯(s)ϵ^+(|S[θ¯]|+1)|S|.\displaystyle\rho^{1}<\frac{\rho^{2}\bar{\theta}(s^{*})}{1+\rho^{2}}=\frac{\hat{\epsilon}\bar{\theta}(s^{*})}{\hat{\epsilon}+(|S\setminus[\bar{\theta}]|+1)|S|}. (23)

From (19a), since s[θ¯]s^{*}\in[\bar{\theta}], we can check that (23) is indeed satisfied. Thus, the interval [ρ1α,ρ2]\left[\frac{\rho^{1}}{\alpha},\rho^{2}\right] is non-empty.

We denote U^k(s)\hat{U}^{k}(s) as the number of upcrossings of the sequence (θ~j(s)θ~j(s))j=1k\left(\frac{\tilde{\theta}^{j}(s)}{\tilde{\theta}^{j}(s^{*})}\right)_{j=1}^{k} with respect to the interval [ρ1α,ρ2]\left[\frac{\rho^{1}}{\alpha},\rho^{2}\right] until stage kk. Then, Uk(s)U^k(s)U^{k}(s)\leq\hat{U}^{k}(s) for all kk. Therefore, we can write:

Pr(Uk(s)1)Pr(U^k(s)1)𝔼[U^k(s)],\displaystyle\mathrm{Pr}\left(U^{k}(s)\geq 1\right)\leq\mathrm{Pr}\left(\hat{U}^{k}(s)\geq 1\right)\leq\mathbb{E}\left[\hat{U}^{k}(s)\right], (24)

where the last inequality follows from the Makov inequality.

From the proof of Lemma 2, we know that the sequence (θ~k(s)θ~k(s))k=1\left(\frac{\tilde{\theta}^{k}(s)}{\tilde{\theta}^{k}(s^{*})}\right)_{k=1}^{\infty} is a martingale. Therefore, we can apply the Doob’s upcrossing inequality as follows:

𝔼[U^k(s)]𝔼[max{ρ1αθ~k(s)θ~k(s),0}]ρ2ρ1αρ1αρ2ρ1α,k.\displaystyle\mathbb{E}\left[\hat{U}^{k}(s)\right]\leq\frac{\mathbb{E}\left[\max\{\frac{\rho^{1}}{\alpha}-\frac{\tilde{\theta}^{k}(s)}{\tilde{\theta}^{k}(s^{*})},0\}\right]}{\rho^{2}-\frac{\rho^{1}}{\alpha}}\leq\frac{\frac{\rho^{1}}{\alpha}}{\rho^{2}-\frac{\rho^{1}}{\alpha}},\quad\forall k. (25)

From (22) – (25), and (19a) – (19b), we have:

Pr(θ~k(s)<ρ2,sS[θ¯],k)1ρ1α|S[θ¯]|ρ2ρ1α=1ρ1θ¯(s)ρ1|S[θ¯]|ρ2ρ1θ¯(s)ρ1>γ,\displaystyle\mathrm{Pr}\left(\tilde{\theta}^{k}(s)<\rho^{2},~{}\forall s\in S\setminus[\bar{\theta}],~{}\forall k\right)\geq 1-\frac{\frac{\rho^{1}}{\alpha}|S\setminus[\bar{\theta}]|}{\rho^{2}-\frac{\rho^{1}}{\alpha}}=1-\frac{\frac{\rho^{1}}{\bar{\theta}(s^{*})-\rho^{1}}|S\setminus[\bar{\theta}]|}{\rho^{2}-\frac{\rho^{1}}{\bar{\theta}(s^{*})-\rho^{1}}}>\gamma,

where the last inequality is derived using (19a) – (19b). Finally, from (11), we know that Pr(θk(s)<ρ2,sS[θ¯],k)Pr(θ~k(s)<ρ2,sS[θ¯],k)>γ\mathrm{Pr}\left(\theta^{k}(s)<\rho^{2},~{}\forall s\in S\setminus[\bar{\theta}],~{}\forall k\right)\geq\mathrm{Pr}\left(\tilde{\theta}^{k}(s)<\rho^{2},~{}\forall s\in S\setminus[\bar{\theta}],~{}\forall k\right)>\gamma. \square

Furthermore, Lemma 7 characterizes another set of conditions on the initial state to ensure that the beliefs of remaining parameters s[θ¯]s\in[\bar{\theta}] satisfy |θk(s)θ¯(s)|<ϵ^|S||\theta^{k}(s)-\bar{\theta}(s)|<\frac{\hat{\epsilon}}{|S|}, and the strategy qkNδ(EQ(θ¯))q^{k}\in N_{\delta}\left(\mathrm{EQ}(\bar{\theta})\right) for all kk so long as θk(s)<ρ2\theta^{k}(s)<\rho^{2} for any parameter sS[θ¯]s\in S\setminus[\bar{\theta}]. Recall that θk(s)<ρ2\theta^{k}(s)<\rho^{2} for all sS[θ¯]s\in S\setminus[\bar{\theta}] is satisfied with probability higher than γ\gamma under the conditions in Lemma 6.

Lemma 7.

Under Assumption 4, if |θ1(s)θ¯(s)|<ρ3|\theta^{1}(s)-\bar{\theta}(s)|<\rho^{3} for all s[θ¯]s\in[\bar{\theta}] and q1Nδ(EQ(θ¯))q^{1}\in N_{\delta}\left(\mathrm{EQ}(\bar{\theta})\right), then

Pr(|θk(s)θ¯(s)|<ϵ^|S|,s[θ¯],kand qkNδ(EQ(θ¯)),k|θk(s)<ρ2,sS[θ¯],k)=1.\displaystyle\mathrm{Pr}\left(\left.\begin{array}[]{l}|\theta^{k}(s)-\bar{\theta}(s)|<\frac{\hat{\epsilon}}{|S|},~{}\forall s\in[\bar{\theta}],~{}\forall k\\ \text{and }q^{k}\in N_{\delta}\left(\mathrm{EQ}(\bar{\theta})\right),~{}\forall k\end{array}\right|\theta^{k}(s)<\rho^{2},~{}\forall s\in S\setminus[\bar{\theta}],~{}\forall k\right)=1. (28)

Proof of Lemma 7. From Assumption (A3a), we know that [θ¯]S(q1)[\bar{\theta}]\subseteq S^{*}(q^{1}) if q1Nδ(EQ(θ¯))q^{1}\in N_{\delta}\left(\mathrm{EQ}(\bar{\theta})\right). Hence, ϕs(y1|q1)=ϕs(y1|q1)\phi^{s}(y^{1}|q^{1})=\phi^{s^{*}}(y^{1}|q^{1}) for any s[θ¯]s\in[\bar{\theta}] and any realized payoff y1y^{1}. Therefore,

θ~2(s)θ~2(s)=θ~1(s)θ~1(s)ϕs(y1|q1)ϕs(y1|q1)=θ~1(s)θ~1(s),w.p.1,s[θ¯].\displaystyle\frac{\tilde{\theta}^{2}(s)}{\tilde{\theta}^{2}(s^{*})}=\frac{\tilde{\theta}^{1}(s)}{\tilde{\theta}^{1}(s^{*})}\frac{\phi^{s}(y^{1}|q^{1})}{\phi^{s^{*}}(y^{1}|q^{1})}=\frac{\tilde{\theta}^{1}(s)}{\tilde{\theta}^{1}(s^{*})},\quad w.p.~{}1,\quad\forall s\in[\bar{\theta}]. (29)

This implies that s[θ¯]θ~2(s)θ~2(s)=s[θ¯]θ~1(s)θ~1(s)\frac{\sum_{s\in[\bar{\theta}]}\tilde{\theta}^{2}(s)}{\tilde{\theta}^{2}(s^{*})}=\frac{\sum_{s\in[\bar{\theta}]}\tilde{\theta}^{1}(s)}{\tilde{\theta}^{1}(s^{*})}, and for all s[θ¯]s\in[\bar{\theta}]:

θ~2(s)s[θ¯]θ~2(s)=θ~2(s)θ~2(s)θ~2(s)s[θ¯]θ~2(s)=θ~1(s)θ~1(s)θ~1(s)s[θ¯]θ~1(s)=θ~1(s)s[θ¯]θ~1(s).\displaystyle\frac{\tilde{\theta}^{2}(s)}{\sum_{s\in[\bar{\theta}]}\tilde{\theta}^{2}(s)}=\frac{\tilde{\theta}^{2}(s)}{\tilde{\theta}^{2}(s^{*})}\frac{\tilde{\theta}^{2}(s^{*})}{\sum_{s\in[\bar{\theta}]}\tilde{\theta}^{2}(s)}=\frac{\tilde{\theta}^{1}(s)}{\tilde{\theta}^{1}(s^{*})}\frac{\tilde{\theta}^{1}(s^{*})}{\sum_{s\in[\bar{\theta}]}\tilde{\theta}^{1}(s)}=\frac{\tilde{\theta}^{1}(s)}{\sum_{s\in[\bar{\theta}]}\tilde{\theta}^{1}(s)}.

Thus, we have

θ~2(s)θ~1(s)=s[θ¯]θ~2(s)s[θ¯]θ~1(s),w.p.1,s[θ¯].\displaystyle\frac{\tilde{\theta}^{2}(s)}{\tilde{\theta}^{1}(s)}=\frac{\sum_{s\in[\bar{\theta}]}\tilde{\theta}^{2}(s)}{\sum_{s\in[\bar{\theta}]}\tilde{\theta}^{1}(s)},\quad w.p.~{}1,\quad\forall s\in[\bar{\theta}].

Since s[θ¯]θ~1(s)1\sum_{s\in[\bar{\theta}]}\tilde{\theta}^{1}(s)\leq 1, and conditioned on that θ~2(s)<ρ2\tilde{\theta}^{2}(s)<\rho^{2} for all sS[θ¯]s\in S\setminus[\bar{\theta}] as in (28), we have

θ~2(s)θ~1(s)=s[θ¯]θ~2(s)s[θ¯]θ~1(s)s[θ¯]θ~2(s)>1|S[θ¯]|ρ2,s[θ¯].\displaystyle\frac{\tilde{\theta}^{2}(s)}{\tilde{\theta}^{1}(s)}=\frac{\sum_{s\in[\bar{\theta}]}\tilde{\theta}^{2}(s)}{\sum_{s\in[\bar{\theta}]}\tilde{\theta}^{1}(s)}\geq\sum_{s\in[\bar{\theta}]}\tilde{\theta}^{2}(s)>1-|S\setminus[\bar{\theta}]|\rho^{2},\quad\forall s\in[\bar{\theta}].

Additionally, since s[θ¯]θ~2(s)1\sum_{s\in[\bar{\theta}]}\tilde{\theta}^{2}(s)\leq 1 and again conditioned on that θ~1(s)<ρ2\tilde{\theta}^{1}(s)<\rho^{2} for all sS[θ¯]s\in S\setminus[\bar{\theta}] as in (28), we have

θ~2(s)θ~1(s)=s[θ¯]θ~2(s)s[θ¯]θ~1(s)1s[θ¯]θ~1(s)<11|S[θ¯]|ρ2,s[θ¯].\displaystyle\frac{\tilde{\theta}^{2}(s)}{\tilde{\theta}^{1}(s)}=\frac{\sum_{s\in[\bar{\theta}]}\tilde{\theta}^{2}(s)}{\sum_{s\in[\bar{\theta}]}\tilde{\theta}^{1}(s)}\leq\frac{1}{\sum_{s\in[\bar{\theta}]}\tilde{\theta}^{1}(s)}<\frac{1}{1-|S\setminus[\bar{\theta}]|\rho^{2}},\quad\forall s\in[\bar{\theta}].

Since by (19c), ρ3<θ¯(s)\rho^{3}<\bar{\theta}(s) for all s[θ¯]s\in[\bar{\theta}], any θ~1(s)(θ¯(s)ρ3,θ¯(s)+ρ3)\tilde{\theta}^{1}(s)\in\left(\bar{\theta}(s)-\rho^{3},\bar{\theta}(s)+\rho^{3}\right) is a positive number for all s[θ¯]s\in[\bar{\theta}]. Therefore, we have the following bounds:

(θ¯(s)ρ3)(1|S[θ¯]|ρ2)<θ~2(s)<θ¯(s)+ρ31|S[θ¯]|ρ2,s[θ¯].\displaystyle\left(\bar{\theta}(s)-\rho^{3}\right)\left(1-|S\setminus[\bar{\theta}]|\rho^{2}\right)<\tilde{\theta}^{2}(s)<\frac{\bar{\theta}(s)+\rho^{3}}{1-|S\setminus[\bar{\theta}]|\rho^{2}},\quad\forall s\in[\bar{\theta}]. (30)

Since

ρ3<(19c)ϵ^|S[θ¯]||S|ρ2θ¯(s)|S||S[θ¯]||S|ρ2,s[θ¯],\displaystyle\rho^{3}\stackrel{{\scriptstyle\eqref{eq:rho_three}}}{{<}}\frac{\hat{\epsilon}-|S\setminus[\bar{\theta}]||S|\rho^{2}\bar{\theta}(s)}{|S|-|S\setminus[\bar{\theta}]||S|\rho^{2}},\quad\forall s\in[\bar{\theta}], (31)

we can check that (θ¯(s)ρ3)(1|S[θ¯]|ρ2)>θ¯(s)ϵ^|S|\left(\bar{\theta}(s)-\rho^{3}\right)\left(1-|S\setminus[\bar{\theta}]|\rho^{2}\right)>\bar{\theta}(s)-\frac{\hat{\epsilon}}{|S|} for all s[θ¯]s\in[\bar{\theta}]. To ensure the right-hand-side of (31) is positive, we need to have ρ2<ϵ^|S[θ¯]||S|θ¯(s)\rho^{2}<\frac{\hat{\epsilon}}{|S\setminus[\bar{\theta}]||S|\bar{\theta}(s)} for all s[θ¯]s\in[\bar{\theta}], and ρ2<1/|S[θ¯]|\rho^{2}<1/|S\setminus[\bar{\theta}]|, which is indeed satisfied by (19b).

Also, since

ρ3<(19c)ϵ^|S|ρ2|S[θ¯]|(θ¯(s)+ϵ^|S|),s[θ¯].\displaystyle\rho^{3}\stackrel{{\scriptstyle\eqref{eq:rho_three}}}{{<}}\frac{\hat{\epsilon}}{|S|}-\rho^{2}|S\setminus[\bar{\theta}]|\left(\bar{\theta}(s)+\frac{\hat{\epsilon}}{|S|}\right),\quad\forall s\in[\bar{\theta}]. (32)

we have θ¯(s)+ρ31|S[θ¯]|ρ2<θ¯(s)+ϵ^|S|\frac{\bar{\theta}(s)+\rho^{3}}{1-|S\setminus[\bar{\theta}]|\rho^{2}}<\bar{\theta}(s)+\frac{\hat{\epsilon}}{|S|} for all s[θ¯]s\in[\bar{\theta}]. Below, we check that the right-hand-side of (32) is positive, and thus the constraint of ρ3\rho^{3} in (32) is feasible:

ϵ^|S|ρ2|S[θ¯]|(θ¯(s)+ϵ^|S|)\displaystyle\frac{\hat{\epsilon}}{|S|}-\rho^{2}|S\setminus[\bar{\theta}]|\left(\bar{\theta}(s)+\frac{\hat{\epsilon}}{|S|}\right) ϵ^|S|ρ2|S[θ¯]|(1+ϵ^|S|)=(19b)ϵ^|S|1|S[θ¯]|+1ϵ^|S|ϵ^|S[θ¯]|(|S[θ¯]|+1)|S|\displaystyle\geq\frac{\hat{\epsilon}}{|S|}-\rho^{2}|S\setminus[\bar{\theta}]|\left(1+\frac{\hat{\epsilon}}{|S|}\right)\stackrel{{\scriptstyle\eqref{eq:rhotwo}}}{{=}}\frac{\hat{\epsilon}}{|S|}\frac{1}{|S\setminus[\bar{\theta}]|+1}-\frac{\hat{\epsilon}}{|S|}\frac{\hat{\epsilon}|S\setminus[\bar{\theta}]|}{(|S\setminus[\bar{\theta}]|+1)|S|}
=ϵ^|S||S|ϵ^|S[θ¯]|(|S[θ¯]|+1)|S|>0.\displaystyle=\frac{\hat{\epsilon}}{|S|}\frac{|S|-\hat{\epsilon}|S\setminus[\bar{\theta}]|}{(|S\setminus[\bar{\theta}]|+1)|S|}>0.

From (30), we can conclude that |θ~2(s)θ¯(s)|<ϵ^|S||\tilde{\theta}^{2}(s)-\bar{\theta}(s)|<\frac{\hat{\epsilon}}{|S|} for all s[θ¯]s\in[\bar{\theta}]. Additionally, conditional on θ~2(s)<ρ2<ϵ^|S|\tilde{\theta}^{2}(s)<\rho^{2}<\frac{\hat{\epsilon}}{|S|} for all sS[θ¯]s\in S\setminus[\bar{\theta}] as in (28), we have θ~2Nϵ^(θ¯)\tilde{\theta}^{2}\in N_{\hat{\epsilon}}\left(\bar{\theta}\right). From (11), we have θ2Nϵ^(θ¯)\theta^{2}\in N_{\hat{\epsilon}}\left(\bar{\theta}\right). From (ii) in Lemma 5, since q1Nδ(EQ(θ¯))q^{1}\in N_{\delta}\left(\mathrm{EQ}(\bar{\theta})\right), we know that the updated strategy q2=F(θ2,q1)Nδ(EQ(θ¯))q^{2}=F(\theta^{2},q^{1})\in N_{\delta}\left(\mathrm{EQ}(\bar{\theta})\right).

We now use mathematical induction to prove that the belief of any s[θ¯]s\in[\bar{\theta}] satisfies |θ~k(s)θ¯(s)|<ϵ^|S||\tilde{\theta}^{k}(s)-\bar{\theta}(s)|<\frac{\hat{\epsilon}}{|S|} for stages k2k\geq 2. If in stages j=1,,kj=1,\dots,k, |θ~j(s)θ¯(s)|<ϵ^|S||\tilde{\theta}^{j}(s)-\bar{\theta}(s)|<\frac{\hat{\epsilon}}{|S|} for all s[θ¯]s\in[\bar{\theta}] and θ~j(s)<ρ2<ϵ^|S|\tilde{\theta}^{j}(s)<\rho^{2}<\frac{\hat{\epsilon}}{|S|} for all sS[θ¯]s\in S\setminus[\bar{\theta}], then θ~jNϵ^(θ¯)\tilde{\theta}^{j}\in N_{\hat{\epsilon}}\left(\bar{\theta}\right) for all j=1,,kj=1,\dots,k. Thus, from (11) and (ii) in Lemma 5, we have θjNϵ^(θ¯)\theta^{j}\in N_{\hat{\epsilon}}\left(\bar{\theta}\right) and F(θj,qj1)Nδ(EQ(θ¯))F\left(\theta^{j},q^{j-1}\right)\subseteq N_{\delta}\left(\mathrm{EQ}(\bar{\theta})\right). Therefore, qjNδ(EQ(θ¯))q^{j}\in N_{\delta}\left(\mathrm{EQ}(\bar{\theta})\right) for all j=2,,kj=2,\dots,k.

From Assumption (A3a), we know that [θ¯]S(qj)[\bar{\theta}]\subseteq S^{*}(q^{j}) for all j=1,,kj=1,\dots,k. Therefore, for any s[θ¯]s\in[\bar{\theta}] and any j=1,,kj=1,\dots,k, ϕs(yj|qj)=ϕs(yj|qj)\phi^{s}(y^{j}|q^{j})=\phi^{s^{*}}(y^{j}|q^{j}) with probability 1. Then, by iteratively applying (29) for j=1,2,kj=1,2,\dots k, we have θ~k+1(s)θ~1(s)=s[θ¯]θ~k+1(s)s[θ¯]θ~1(s)\frac{\tilde{\theta}^{k+1}(s)}{\tilde{\theta}^{1}(s)}=\frac{\sum_{s\in[\bar{\theta}]}\tilde{\theta}^{k+1}(s)}{\sum_{s\in[\bar{\theta}]}\tilde{\theta}^{1}(s)} for all s[θ¯]s\in[\bar{\theta}] with probability 1. Analogous to k=2k=2, we can prove that |θ~k+1(s)θ¯(s)|<ϵ^|S||\tilde{\theta}^{k+1}(s)-\bar{\theta}(s)|<\frac{\hat{\epsilon}}{|S|} for all s[θ¯]s\in[\bar{\theta}]. From (11), we also have |θk+1(s)θ¯(s)|<ϵ^|S||\theta^{k+1}(s)-\bar{\theta}(s)|<\frac{\hat{\epsilon}}{|S|} for all s[θ¯]s\in[\bar{\theta}]. From the principle of mathematical induction, we conclude that in all stages kk, |θk(s)θ¯(s)|<ϵ^|S||\theta^{k}(s)-\bar{\theta}(s)|<\frac{\hat{\epsilon}}{|S|} for all s[θ¯]s\in[\bar{\theta}], and qkNδ(EQ(θ¯))q^{k}\in N_{\delta}\left(\mathrm{EQ}(\bar{\theta})\right) for all kk. Therefore, we have proved (28).

\square

Finally, for any γ(0,1)\gamma\in(0,1), and any ϵ¯,δ¯>0\bar{\epsilon},\bar{\delta}>0, we set ϵ1=min{ρ1,ρ3}\epsilon^{1}=\min\{\rho^{1},\rho^{3}\} given by (19a) – (19c), and δ1=δ\delta^{1}=\delta. We are now ready to prove Theorem 2.

Proof of Theorem 2. If θ1Nϵ1(θ¯)\theta^{1}\in N_{\epsilon^{1}}(\bar{\theta}), then |θ1(s)θ¯(s)|<ϵ1|\theta^{1}(s)-\bar{\theta}(s)|<\epsilon^{1} for all sSs\in S. Recall from (iii) in Lemma 5, limkPr(θkNϵ¯(θ),qkNδ¯(EQ(θ¯)))Pr(θkNϵ^(θ¯),qkNδ(EQ(θ¯)),k)\lim_{k\to\infty}\mathrm{Pr}\left(\theta^{k}\in N_{\bar{\epsilon}}(\theta),~{}q^{k}\in N_{\bar{\delta}}(\mathrm{EQ}(\bar{\theta}))\right)\geq\mathrm{Pr}\left(\theta^{k}\in N_{\hat{\epsilon}}(\bar{\theta}),~{}q^{k}\in N_{\delta}\left(\mathrm{EQ}(\bar{\theta})\right),~{}\forall k\right). Since ρ2<ϵ^/|S|\rho^{2}<\hat{\epsilon}/|S|, we further have:

Pr(θkNϵ^(θ¯),qkNδ(EQ(θ¯)),k)Pr(|θk(s)θ¯(s)|<ϵ^|S|,s[θ¯],k andθk<ρ2,sS[θ¯],qkNδ(EQ(θ¯)),k)\displaystyle\mathrm{Pr}\left(\theta^{k}\in N_{\hat{\epsilon}}(\bar{\theta}),~{}q^{k}\in N_{\delta}\left(\mathrm{EQ}(\bar{\theta})\right),~{}\forall k\right)\geq\mathrm{Pr}\left(\begin{array}[]{l}|\theta^{k}(s)-\bar{\theta}(s)|<\frac{\hat{\epsilon}}{|S|},~{}\forall s\in[\bar{\theta}],~{}\forall k\text{ and}\\ \theta^{k}<\rho^{2},~{}\forall s\in S\setminus[\bar{\theta}],~{}q^{k}\in N_{\delta}\left(\mathrm{EQ}(\bar{\theta})\right),~{}\forall k\end{array}\right)
=Pr(θk(s)<ρ2,sS[θ¯],k)Pr(|θk(s)θ¯(s)|<ϵ^|S|,s[θ¯],kand qkNδ(EQ(θ¯)),k|θk(s)<ρ2.sS[θ¯],k)\displaystyle=\mathrm{Pr}\left(\theta^{k}(s)<\rho^{2},\forall s\in S\setminus[\bar{\theta}],\forall k\right)\cdot\mathrm{Pr}\left(\left.\begin{array}[]{l}|\theta^{k}(s)-\bar{\theta}(s)|<\frac{\hat{\epsilon}}{|S|},\forall s\in[\bar{\theta}],\forall k\\ \text{and }q^{k}\in N_{\delta}\left(\mathrm{EQ}(\bar{\theta})\right),~{}\forall k\end{array}\right|\begin{array}[]{l}\theta^{k}(s)<\rho^{2}.\\ \forall s\in S\setminus[\bar{\theta}],\forall k\end{array}\right)

For any θ1Nϵ1(θ¯)\theta^{1}\in N_{\epsilon^{1}}(\bar{\theta}) and any q1Nδ1(EQ(θ¯))q^{1}\in N_{\delta^{1}}\left(\mathrm{EQ}(\bar{\theta})\right), we know from Lemmas 67 that:

Pr(θk(s)<ρ2,sS[θ¯],k)>γ, and\displaystyle\mathrm{Pr}\left(\theta^{k}(s)<\rho^{2},~{}\forall s\in S\setminus[\bar{\theta}],~{}\forall k\right)>\gamma,\text{ and }
Pr(|θk(s)θ¯(s)|<ϵ^|S|,s[θ¯],kand qkNδ(EQ(θ¯)),k|θk(s)<ρ2,sS[θ¯],k)=1\displaystyle\mathrm{Pr}\left(\left.\begin{array}[]{l}|\theta^{k}(s)-\bar{\theta}(s)|<\frac{\hat{\epsilon}}{|S|},~{}\forall s\in[\bar{\theta}],~{}\forall k\\ \text{and }q^{k}\in N_{\delta}\left(\mathrm{EQ}(\bar{\theta})\right),~{}\forall k\end{array}\right|\theta^{k}(s)<\rho^{2},~{}\forall s\in S\setminus[\bar{\theta}],~{}\forall k\right)=1

Therefore, for any θ1Nϵ1(θ¯)\theta^{1}\in N_{\epsilon^{1}}(\bar{\theta}) and any q1Nδ1(EQ(θ¯))q^{1}\in N_{\delta^{1}}\left(\mathrm{EQ}(\bar{\theta})\right), the states of learning dynamics satisfy limkPr(θkNϵ¯(θ),qkNδ¯(EQ(θ¯)))>γ\lim_{k\to\infty}\mathrm{Pr}\left(\theta^{k}\in N_{\bar{\epsilon}}(\theta),~{}q^{k}\in N_{\bar{\delta}}\left(\mathrm{EQ}(\bar{\theta})\right)\right)>\gamma. Thus, (θ¯,q¯)\left(\bar{\theta},\bar{q}\right) is locally stable. \square

5 Extensions

In this section, we briefly discuss two extensions of the learning model introduced in Section 2: (1) Learning with two timescales; (2) Learning in games with finite strategies.

(1) Learning with two timescales. Consider the case where strategy update is at a faster timescale compared with the belief updates, i.e. limtkt+1kt=\lim_{t\to\infty}k_{t+1}-k_{t}=\infty with probability 1. Then, our convergence and stability results still hold as follows:

Proposition 4.

Under Assumption 2, the sequence of states (θk,qk)k=1\left(\theta^{k},q^{k}\right)_{k=1}^{\infty} induced by (θ\theta-update) – (qq-update) converges to a fixed point (θ¯,q¯)\left(\bar{\theta},\bar{q}\right) with probability 1, and (θ¯,q¯)\left(\bar{\theta},\bar{q}\right) satisfies (4a) and (4b).

Additionally, under Assumption 2, a fixed point (θ¯,q¯)(\bar{\theta},\bar{q}) is locally stable if Assumption 4 is satisfied.

We note that the convergence and stability results in Proposition 4 does not rely on Assumption 1. Since limtkt+1kt=\lim_{t\to\infty}k_{t+1}-k_{t}=\infty with probability 1, as tt\rightarrow\infty, the strategies between any two belief updates ktk_{t} and kt+1k_{t+1} are updated with the static belief θkt\theta^{k_{t}}. Therefore, Assumption 2 directly ensures that strategies converge to an equilibrium strategy profile in EQ(θkt)\mathrm{EQ}\left(\theta^{k_{t}}\right) without relying on the construction of auxiliary strategies or Assumption 1. Then, the updated belief θkt+1\theta^{k_{t+1}} will form an accurate payoff estimate given the equilibrium strategy following Lemma 4. By repeating this process, we can show that the beliefs and strategies converge to a fixed point. Similarly, the local stability result does not rely on Assumption 1, and it holds following Theorem 2. The global stability result is also analogous to that in Proposition 1.

(2) Learning in Games with Finite Strategy Set. Our results in Section 3 can be extended to learning in games where strategy sets are finite and players can choose mixed strategies. In this game, each player ii’s action set (pure strategies) is a finite set AiA_{i}, and the action profile (pure strategy profile) is denoted as a=(ai)iIA=iIAia=\left(a_{i}\right)_{i\in I}\in A=\prod_{i\in I}A_{i}. Given any parameter ss and any action profile aa, the distribution of players’ payoff yy is ϕs(y|a)\phi^{s}\left(y|a\right).

We denote player ii’s mixed strategy as qi=(qi(ai))aiAiQi=Δ(Ai)q_{i}=\left(q_{i}(a_{i})\right)_{a_{i}\in A_{i}}\in Q_{i}=\Delta\left(A_{i}\right), where qi(ai)q_{i}(a_{i}) is the probability of choosing the action aia_{i}. The mixed strategy set QiQ_{i} is bounded and convex. Players’ action profile in each stage kk, denoted as ak=(aik)iIa^{k}=\left(a_{i}^{k}\right)_{i\in I}, is realized from the mixed strategy profile qkq^{k}. Analogous to (θ\theta-update) – (qq-update), beliefs and strategies in each stage kk are updated based on the past actions (aj)j=1k\left(a^{j}\right)_{j=1}^{k} and the realized payoff vectors (yj)j=1k\left(y^{j}\right)_{j=1}^{k}.

The convergence result in Theorem 1 can be readily extended to games with finite strategy sets: Under Assumptions 1 and 2, the sequence of beliefs and mixed strategies (θk,qk)k=1\left(\theta^{k},q^{k}\right)_{k=1}^{\infty} converges to a fixed point (θ¯,q¯)(\bar{\theta},\bar{q}), where q¯EQ(θ¯)\bar{q}\in\mathrm{EQ}(\bar{\theta}) is a mixed Nash equilibrium associated with the belief θ¯\bar{\theta}, and θ¯\bar{\theta} accurately estimates the payoff distribution for all action profiles that are taken with positive probability in q¯\bar{q}. Additionally, our results on global and local stability properties (Proposition 1 and Theorem 2) also hold analogously.

Moreover, in games with finite strategy sets, any local neighborhood of a fixed point strategy profile includes mixed strategies with full support on all action profiles, and any parameter sss\neq s^{*} can be distinguished from ss^{*} with the realized payoffs. In this case, only complete information fixed points satisfy the local consistency condition (A3a), and thus is locally stable (Proposition 2). Any other fixed point is non-robust to local perturbation of strategies. Hence, our stability analysis implies that complete learning is guaranteed by local exploration in games with finite strategies.888Our observation that local exploration is sufficient to identify the complete information equilibrium in games with finite strategies is consistent with literature on equilibrium learning in normal form games with unknown payoff matrices (Hart and Mas-Colell (2003); Foster and Young (2006); Marden et al. (2007); Daskalakis et al. (2011); Syrgkanis et al. (2015)). Indeed, algorithms proposed in these literature rely on local exploration – repeatedly sampling actions that are not equilibrium with small probability to learn the payoff matrices.

6 Concluding Remarks

In this article, we studied stochastic learning dynamics induced by a set of strategic players who repeatedly play a game with an unknown parameter. We analyzed the convergence of beliefs and strategies induced by the stochastic dynamics, and derived conditions for local and global stability of fixed points. We also provided conditions that guarantee that learning converges to a complete information equilibrium.

A future research question of interest is to analyze the learning dynamics when players seek to efficiently learn the true parameter by choosing off-equilibrium strategies. When there are one or more parameters that are payoff equivalent to the true parameter at fixed point, complete learning requires players to take strategies that may reduce their individual payoffs in some stages. In our setup, if a player were to choose a non-equilibrium strategy, the information resulting from that player’s realized payoff would be incorporated into the belief update, and the new belief is known to all players. Under what scenarios the utility-maximizing players will choose their strategies to engage such explorative behavior is an interesting question, and worthy of further investigation.

Another promising extension is to study multi-agent reinforcement learning problem from a Bayesian viewpoint. In such settings, the unknown parameter changes over time according to a Markovian transition process, and players may have imperfect or no knowledge of the underlying transition kernel. The approach presented in this article on studying the dynamic interplay between belief updates and strategy updates is useful to analyze how players learn the belief estimates of payoffs that depend on the latent Markov state, and adaptively adjust their strategies that converges to a stationary equilibrium.

Acknowledgement

The authors are grateful to the area editor, the associate editor and the three reviewers for useful suggestions and constructive feedback. We thank participants of the Cornell ORIE Colloquium (2021), Games, Decisions and Networks Seminar (2021), seminar at Simons Institute for the Theory of Computing (2022), and the Spring 2022 Colloquium at the C3.ai Digital Transformation Institute for useful discussions. This research was supported in part by Simons Fellowship, Michael Hammer Fellowship, and AFOSR project Building Attack Resilience into Complex Networks.

References

  • Acemoglu et al. [2014] Daron Acemoglu, Kostas Bimpikis, and Asuman Ozdaglar. Dynamics of information exchange in endogenous social networks. Theoretical Economics, 9(1):41–97, 2014.
  • Acemoglu et al. [2017] Daron Acemoglu, Ali Makhdoumi, Azarakhsh Malekian, and Asuman Ozdaglar. Fast and slow learning from reviews. Technical report, National Bureau of Economic Research, 2017.
  • Alós-Ferrer and Netzer [2010] Carlos Alós-Ferrer and Nick Netzer. The logit-response dynamics. Games and Economic Behavior, 68(2):413–427, 2010.
  • Auer et al. [2002] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235–256, 2002.
  • Banerjee [1992] Abhijit V Banerjee. A simple model of herd behavior. The Quarterly Journal of Economics, 107(3):797–817, 1992.
  • Beggs [2005] Alan W Beggs. On the convergence of reinforcement learning. Journal of Economic Theory, 122(1):1–36, 2005.
  • Benaim and Hirsch [1999] Michel Benaim and Morris W Hirsch. Mixed equilibria and dynamical systems arising from fictitious play in perturbed games. Games and Economic Behavior, 29(1-2):36–72, 1999.
  • Blume et al. [1993] Lawrence E Blume et al. The statistical mechanics of strategic interaction. Games and Economic Behavior, 5(3):387–424, 1993.
  • Bravo et al. [2018] Mario Bravo, David Leslie, and Panayotis Mertikopoulos. Bandit learning in concave n-person games. Advances in Neural Information Processing Systems, 31, 2018.
  • Cesa-Bianchi and Lugosi [2006] Nicolo Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge University Press, 2006.
  • Cominetti et al. [2010] Roberto Cominetti, Emerson Melo, and Sylvain Sorin. A payoff-based learning procedure and its application to traffic games. Games and Economic Behavior, 70(1):71–83, 2010.
  • Daskalakis et al. [2011] Constantinos Daskalakis, Alan Deckelbaum, and Anthony Kim. Near-optimal no-regret algorithms for zero-sum games. In Proceedings of the Twenty-second Annual ACM-SIAM Symposium on Discrete Algorithms, pages 235–254. SIAM, 2011.
  • Dumett and Cominetti [2018] Miguel A Dumett and Roberto Cominetti. On the stability of an adaptive learning dynamics in traffic games. arXiv preprint arXiv:1807.01256, 2018.
  • Foster and Young [2006] Dean Foster and Hobart Peyton Young. Regret testing: Learning to play Nash equilibrium without knowing you have an opponent. Theoretical Economics, 1(3):341–367, 2006.
  • Fudenberg and Kreps [1993] Drew Fudenberg and David M Kreps. Learning mixed equilibria. Games and Economic Behavior, 5(3):320–367, 1993.
  • Fudenberg and Kreps [1995] Drew Fudenberg and David M Kreps. Learning in extensive-form games I. self-confirming equilibria. Games and Economic Behavior, 8(1):20–55, 1995.
  • Fudenberg and Levine [1993] Drew Fudenberg and David K Levine. Self-confirming equilibrium. Econometrica: Journal of the Econometric Society, pages 523–545, 1993.
  • Fudenberg and Tirole [1991] Drew Fudenberg and Jean Tirole. Game theory. MIT press, 1991.
  • Gale and Kariv [2003] Douglas Gale and Shachar Kariv. Bayesian learning in social networks. Games and Economic Behavior, 45(2):329–346, 2003.
  • Golowich et al. [2020] Noah Golowich, Sarath Pattathil, and Constantinos Daskalakis. Tight last-iterate convergence rates for no-regret learning in multi-player games. Advances in neural information processing systems, 33:20766–20778, 2020.
  • Golub and Jackson [2010] Benjamin Golub and Matthew O Jackson. Naive learning in social networks and the wisdom of crowds. American Economic Journal: Microeconomics, 2(1):112–49, 2010.
  • Hart and Mas-Colell [2000] Sergiu Hart and Andreu Mas-Colell. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68(5):1127–1150, 2000.
  • Hart and Mas-Colell [2003] Sergiu Hart and Andreu Mas-Colell. Regret-based continuous-time dynamics. Games and Economic Behavior, 45(2):375–394, 2003.
  • Hofbauer and Sandholm [2002] Josef Hofbauer and William H Sandholm. On the global convergence of stochastic fictitious play. Econometrica, 70(6):2265–2294, 2002.
  • Hofbauer and Sandholm [2009] Josef Hofbauer and William H Sandholm. Stable games and their dynamics. Journal of Economic Theory, 144(4):1665–1693, 2009.
  • Hofbauer and Sorin [2006] Josef Hofbauer and Sylvain Sorin. Best response dynamics for continuous zero-sum games. Discrete and Continuous Dynamical Systems Series B, 6(1):215, 2006.
  • Hopkins [2002] Ed Hopkins. Two competing models of how people learn in games. Econometrica, 70(6):2141–2166, 2002.
  • Lattimore and Szepesvári [2020] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020.
  • Marden and Shamma [2012] Jason R Marden and Jeff S Shamma. Revisiting log-linear learning: Asynchrony, completeness and payoff-based implementation. Games and Economic Behavior, 75(2):788–808, 2012.
  • Marden et al. [2007] Jason R Marden, Gürdal Arslan, and Jeff S Shamma. Regret based dynamics: convergence in weakly acyclic games. In Proceedings of the 6th International Joint Conference on Autonomous Agents and Multiagent Systems, page 42. ACM, 2007.
  • Marden et al. [2009] Jason R Marden, H Peyton Young, Gürdal Arslan, and Jeff S Shamma. Payoff-based dynamics for multiplayer weakly acyclic games. SIAM Journal on Control and Optimization, 48(1):373–396, 2009.
  • Matsui [1992] Akihiko Matsui. Best response dynamics and socially stable strategies. Journal of Economic Theory, 57(2):343–362, 1992.
  • Meigs et al. [2017] Emily Meigs, Francesca Parise, and Asuman Ozdaglar. Learning dynamics in stochastic routing games. In 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 259–266. IEEE, 2017.
  • Mertikopoulos and Zhou [2019] Panayotis Mertikopoulos and Zhengyuan Zhou. Learning in games with continuous action sets and unknown payoff functions. Mathematical Programming, 173(1):465–507, 2019.
  • Milgrom and Roberts [1990] Paul Milgrom and John Roberts. Rationalizability, learning, and equilibrium in games with strategic complementarities. Econometrica: Journal of the Econometric Society, pages 1255–1277, 1990.
  • Moe and Fader [2004] Wendy W Moe and Peter S Fader. Dynamic conversion behavior at e-commerce sites. Management Science, 50(3):326–335, 2004.
  • Monderer and Shapley [1996a] Dov Monderer and Lloyd S Shapley. Fictitious play property for games with identical interests. Journal of Economic Theory, 68(1):258–265, 1996a.
  • Monderer and Shapley [1996b] Dov Monderer and Lloyd S Shapley. Potential games. Games and Economic Behavior, 14(1):124–143, 1996b.
  • Mossel et al. [2015] Elchanan Mossel, Allan Sly, and Omer Tamuz. Strategic learning and the topology of social networks. Econometrica, 83(5):1755–1794, 2015.
  • Rosen [1965] J Ben Rosen. Existence and uniqueness of equilibrium points for concave n-person games. Econometrica: Journal of the Econometric Society, pages 520–534, 1965.
  • Samuelson and Zhang [1992] Larry Samuelson and Jianbo Zhang. Evolutionary stability in asymmetric games. Journal of Economic Theory, 57(2):363–391, 1992.
  • Sandholm [2010a] William H Sandholm. Local stability under evolutionary game dynamics. Theoretical Economics, 5(1):27–50, 2010a.
  • Sandholm [2010b] William H Sandholm. Population games and evolutionary dynamics. MIT press, 2010b.
  • Smith and Price [1973] J Maynard Smith and George R Price. The logic of animal conflict. Nature, 246(5427):15–18, 1973.
  • Syrgkanis et al. [2015] Vasilis Syrgkanis, Alekh Agarwal, Haipeng Luo, and Robert E Schapire. Fast convergence of regularized learning in games. Advances in Neural Information Processing Systems, 28, 2015.
  • Taylor and Jonker [1978] Peter D Taylor and Leo B Jonker. Evolutionary stable strategies and game dynamics. Mathematical Siosciences, 40(1-2):145–156, 1978.
  • Wu et al. [2021] Manxi Wu, Saurabh Amin, and Asuman E Ozdaglar. Value of information in Bayesian routing games. Operations Research, 69(1):148–163, 2021.
  • Zhu et al. [2010] Shanjiang Zhu, David Levinson, Henry X Liu, and Kathleen Harder. The traffic and behavioral effects of the I-35W Mississippi River Bridge collapse. Transportation Research Part A: Policy and Practice, 44(10):771–784, 2010.

Appendix A Supplementary Proofs

Proof of Lemma 1. Starting from any initial belief θ~1\tilde{\theta}^{1}, consider a sequence of strategies Qk1=Δ(qj)j=1k1Q^{k-1}\stackrel{{\scriptstyle\Delta}}{{=}}\left(q^{j}\right)_{j=1}^{k-1} and a sequence of realized payoffs Yk1=Δ(yj)j=1k1Y^{k-1}\stackrel{{\scriptstyle\Delta}}{{=}}\left(y^{j}\right)_{j=1}^{k-1} before stage kk. The belief θ~k\tilde{\theta}^{k} is the repeatedly updated belief from θ~1\tilde{\theta}^{1} based on Qk1Q^{k-1} and Yk1Y^{k-1} using (8). Then, from (12), we obtain the follows:

𝔼[θ~k+1(s)θ~k+1(s)|θ~1,Qk1,Yk1]\displaystyle\mathbb{E}\left[\left.\frac{\tilde{\theta}^{k+1}(s)}{\tilde{\theta}^{k+1}(s^{*})}\right|\tilde{\theta}^{1},Q^{k-1},Y^{k-1}\right] =θ~k(s)θ~k(s)𝔼[ϕs(yk|qk)ϕs(yk|qk)],\displaystyle=\frac{\tilde{\theta}^{k}(s)}{\tilde{\theta}^{k}(s^{*})}\cdot\mathbb{E}\left[\frac{\phi^{s}(y^{k}|q^{k})}{\phi^{s^{*}}(y^{k}|q^{k})}\right], (33)

where θ~k\tilde{\theta}^{k} and qkq^{k} are derived from the belief and strategy updates given by Qk1Q^{k-1} and Yk1Y^{k-1}. Note that

𝔼[ϕs(yk|qk)ϕs(yk|qk)]=yk(ϕs(yk|qk)ϕs(yk|qk))ϕs(yk|qk)𝑑yk=ykϕs(yk|qk)𝑑yk=1.\displaystyle\mathbb{E}\left[\frac{\phi^{s}(y^{k}|q^{k})}{\phi^{s^{*}}(y^{k}|q^{k})}\right]=\int_{y^{k}}\left(\frac{\phi^{s}(y^{k}|q^{k})}{\phi^{s^{*}}(y^{k}|q^{k})}\right)\cdot\phi^{s^{*}}(y^{k}|q^{k})dy^{k}=\int_{y^{k}}\phi^{s}(y^{k}|q^{k})dy^{k}=1.

Hence, for any k=1,2,k=1,2,\dots,

𝔼[θ~k+1(s)θ~k+1(s)|θ~1,Qk1,Yk1]\displaystyle\mathbb{E}\left[\left.\frac{\tilde{\theta}^{k+1}(s)}{\tilde{\theta}^{k+1}(s^{*})}\right|\tilde{\theta}^{1},Q^{k-1},Y^{k-1}\right] =θ~k(s)θ~k(s),sS.\displaystyle=\frac{\tilde{\theta}^{k}(s)}{\tilde{\theta}^{k}(s^{*})},\quad\forall s\in S.

Since θ~k(s)θ~k(s)0\frac{\tilde{\theta}^{k}(s)}{\tilde{\theta}^{k}(s^{*})}\geq 0, the sequence (θ~k(s)θ~k(s))t=1\left(\frac{\tilde{\theta}^{k}(s)}{\tilde{\theta}^{k}(s^{*})}\right)_{t=1}^{\infty} is a non-negative martingale for any sSs\in S.

Proof of Lemma 2. From the martingale convergence theorem and Lemma 1, we know that θ~k(s)θ~k(s)\frac{\tilde{\theta}^{k}(s)}{\tilde{\theta}^{k}(s^{*})} converges with probability 1. Therefore, sSθ~k(s)θ~k(s)\frac{\sum_{s\in S}\tilde{\theta}^{k}(s)}{\tilde{\theta}^{k}(s^{*})} converges with probability 1. Since sSθ~k(s)=1\sum_{s\in S}\tilde{\theta}^{k}(s)=1 and θ~k(s)>0\tilde{\theta}^{k}(s^{*})>0 for all kk, θ~k(s)\tilde{\theta}^{k}(s^{*}) also converges with probability 1. We can thus conclude that θ~k\tilde{\theta}^{k} converges with probability 1 for all sSs\in S, and the convergent vector θ¯\bar{\theta} is a belief vector. Moreover, from (11), we know that θk\theta^{k} also converges to θ¯\bar{\theta} with probability 1. \square

Lemma 8 (Fudenberg and Tirole [1991]).

The equilibrium set EQ(θ)\mathrm{EQ}(\theta) is upper-hemicontinuous in the belief θ\theta if the utility function uis(q)u_{i}^{s}(q) is continuous in qq for all iIi\in I and all sSs\in S.

Proof of Lemma 5.

  • (i)

    From Lemma 8, we know that such ϵ\epsilon^{\prime} must exist.

  • (ii)

    Since ϵ^ϵ\hat{\epsilon}\leq\epsilon, we know from Assumption (A3b) that F(θ,q)Nδ(EQ(θ¯))F(\theta,q)\subseteq N_{\delta}\left(\mathrm{EQ}(\bar{\theta})\right) for any θNϵ^(θ¯)\theta\in N_{\hat{\epsilon}}\left(\bar{\theta}\right) and any qNδ(EQ(θ¯))q\in N_{\delta}\left(\mathrm{EQ}(\bar{\theta})\right).

  • (iii)

    Under Assumptions 12, we know from Theorem 1 that the sequence of the beliefs and strategies converges to a fixed point, denoted as (θ,q)\left(\theta^{\dagger},q^{\dagger}\right) to be distinguished from the original fixed point (θ¯,q¯)(\bar{\theta},\bar{q}). If θkNϵ^(θ¯)\theta^{k}\in N_{\hat{\epsilon}}(\bar{\theta}) for all kk, then limkθk=θNϵ^(θ¯)Nϵ¯(θ¯)\lim_{k\to\infty}\theta^{k}=\theta^{\dagger}\in N_{\hat{\epsilon}}(\bar{\theta})\subseteq N_{\bar{\epsilon}}(\bar{\theta}). Additionally, from (i) and the fact that ϵ^ϵ\hat{\epsilon}\leq\epsilon^{\prime}, we know that limkqk=qEQ(θ)Nδ¯(EQ(θ¯))\lim_{k\to\infty}q^{k}=q^{\dagger}\in\mathrm{EQ}(\theta^{\dagger})\subseteq N_{\bar{\delta}}\left(\mathrm{EQ}(\bar{\theta})\right). Therefore,

    limkPr(θkNϵ¯(θ¯),qkNδ¯(EQ(θ¯)))\displaystyle\lim_{k\to\infty}\mathrm{Pr}\left(\theta^{k}\in N_{\bar{\epsilon}}(\bar{\theta}),~{}q^{k}\in N_{\bar{\delta}}(\mathrm{EQ}(\bar{\theta}))\right)\geq Pr(θkNϵ^(θ¯),k)\displaystyle\mathrm{Pr}\left(\theta^{k}\in N_{\hat{\epsilon}}(\bar{\theta}),~{}\forall k\right)
    \displaystyle\geq Pr(θkNϵ^(θ¯),qkNδ(EQ(θ¯)),k).\displaystyle\mathrm{Pr}\left(\theta^{k}\in N_{\hat{\epsilon}}(\bar{\theta}),q^{k}\in N_{\delta}(\mathrm{EQ}(\bar{\theta})),~{}\forall k\right).

\square

Proof of Proposition 1. We first proof that (a) is equivalent to (c). On one hand, if Ω={(θ,EQ(θ))}\Omega=\{\left(\theta^{*},\mathrm{EQ}(\theta^{*})\right)\}, then for any initial state, the learning dynamics converges to a complete information fixed point with belief θ\theta^{*} and strategy in EQ(θ)\mathrm{EQ}(\theta^{*}). That is, (θ,EQ(θ))\left(\theta^{*},\mathrm{EQ}(\theta^{*})\right) is globally stable. On the other hand, if there exists another fixed point (θ,q)Ω{(θ,EQ(θ))}\left(\theta^{\dagger},q^{\dagger}\right)\in\Omega\setminus\{\left(\theta^{*},\mathrm{EQ}(\theta^{*})\right)\}, then learning that starts with the initial belief θ1=θ\theta^{1}=\theta^{\dagger} (resp. θ1=θ\theta^{1}=\theta^{*}) and strategy q1=qq^{1}=q^{\dagger} (resp. q1=qq^{1}=q^{*}) remains at (θ,q)\left(\theta^{\dagger},q^{\dagger}\right) (resp. (θ,q)\left(\theta^{*},q^{*}\right)) for all stages w.p. 1. Thus, in this case, globally stable fixed points do not exist.

Next, we prove that (b) is equivalent to (c). If [θ]S(q)[\theta]\setminus S^{*}(q) is a non-empty set for any θΔ(S){θ}\theta\in\Delta(S)\setminus\{\theta^{*}\} and any qEQ(θ)q\in\mathrm{EQ}(\theta), then no belief with imperfect information θΔ(S){θ}\theta\in\Delta(S)\setminus\{\theta^{*}\} satisfies (4a). That is, only the complete information belief vector θ\theta^{*} can be a fixed point belief. Therefore, all fixed point must be complete information fixed points. On the other hand, assume for the sake of contradiction that there exists a belief θΔ(S){θ}\theta^{\dagger}\in\Delta(S)\setminus\{\theta^{*}\} such that [θ]S(q)[\theta^{\dagger}]\subseteq S^{*}(q^{\dagger}) for an equilibrium strategy qEQ(θ)q^{\dagger}\in\mathrm{EQ}(\theta^{\dagger}), then (θ,q)\left(\theta^{\dagger},q^{\dagger}\right), which is not a complete information fixed point, is in the set Ω\Omega. Thus, we arrive at a contradiction.

We can conclude that (a) is equivalent to (b) from the equivalence between (a) and (c), and (b) and (c). \square

Proof of Proposition 2. First, since θ(s)=1\theta^{*}(s^{*})=1, θ\theta^{*} satisfies the local consistency condition in (A3a) for all qQq\in Q. Second, since QQ is an invariant set for any strategy update and any θΔ(S)\theta\in\Delta(S). We can choose an arbitrary ϵ\epsilon-neighborhood of θ\theta^{*}, and local invariance is satisfied. Therefore, neighborhood (Δ(S),Q)(\Delta(S),Q) satisfies Assumption 4. Thus, any complete information fixed point must be locally stable. \square

Proof of Proposition 3. From condition (i) that [θ¯]S(q)[\bar{\theta}]\subseteq S^{*}(q) for any qq¯<ξ\|q-\bar{q}\|<\xi, we have:

𝔼θ¯[uis(q)]=uis(q),iI.\displaystyle\mathbb{E}_{\bar{\theta}}[u_{i}^{s}(q)]=u_{i}^{s^{*}}(q),\quad\forall i\in I. (34)

For any q¯EQ(θ¯)\bar{q}\in\mathrm{EQ}(\bar{\theta}), since q¯i\bar{q}_{i} is a best response to q¯i\bar{q}_{-i} with respect to θ¯\bar{\theta}, q¯i\bar{q}_{i} must be a local maximizer of 𝔼θ¯[uis(qi,q¯i)]\mathbb{E}_{\bar{\theta}}[u_{i}^{s}(q_{i},\bar{q}_{-i})]. From (34), q¯i\bar{q}_{i} is a local maximizer of uis(qi,q¯i)u_{i}^{s^{*}}(q_{i},\bar{q}_{-i}). From condition (ii), since the function uis(qi,q¯i)u_{i}^{s^{*}}(q_{i},\bar{q}_{-i}) is concave in qiq_{i}, q¯i\bar{q}_{i} is also a global maximizer of uis(qi,q¯i)u_{i}^{s^{*}}(q_{i},\bar{q}_{-i}), and hence is a best response of q¯i\bar{q}_{-i} with complete information of ss^{*}. Since this argument holds for all iIi\in I, q¯\bar{q} is a complete information equilibrium. \square

Proof of Proposition 4. Since limtkt+1kt=\lim_{t\to\infty}k_{t+1}-k_{t}=\infty with probability 1, as tt\rightarrow\infty, the strategies between any two belief updates ktk_{t} and kt+1k_{t+1} are updated with the static belief θkt\theta^{k_{t}}. Under Assumption 2, we know that the sequence of strategies (qj)j=ktkt+1\left(q^{j}\right)_{j=k_{t}}^{k_{t+1}} converges to an equilibrium strategy in EQ(θkt)\mathrm{EQ}(\theta^{k_{t}}). From Lemma 2, we know that limtθkt=θ¯\lim_{t\to\infty}\theta^{k_{t}}=\bar{\theta} with probability 1. Since the equilibrium set EQ(θ)\mathrm{EQ}(\theta) is upper hemicontinuous in θ\theta, the sequence of strategies must converge to an equilibrium q¯EQ(θ¯)\bar{q}\in\mathrm{EQ}(\bar{\theta}). We know from Lemma 4 that the belief must form a consistent estimate of payoff distribution given q¯\bar{q}. The proofs of local and global consistency are analogous to that in Theorem 2 and Proposition 1, respectively, and thus are omitted. \square