This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Informational Design of Dynamic Multi-Agent System

Tao Zhang and Quanyan Zhu
First Draft: April, 2021
This Draft: June 2021
Electrical and Computer Engineering
New York University
Email: {tz636, qz494}@nyu.edu
(This draft: April 2021
First draft: April 2021)
Abstract

This work considers a novel information design problem and studies how the craft of payoff-relevant environmental signals solely can influence the behaviors of intelligent agents. The agents’ strategic interactions are captured by a Markov game, in which each agent first selects one external signal from multiple signal sources as additional payoff-relevant information and then takes an action. There is a rational information designer (principal) who possesses one signal source and aims to influence the equilibrium behaviors of the agents by designing the information structure of her signals sent to the agents. We propose a direct information design approach that incentivizes each agent to select the signal sent by the principal, such that the design process avoids the predictions of the agents’ strategic selection behaviors. We then introduce the design protocol given a goal of the designer which we refer to as obedient implementability (OIL) and characterize the OIL in a class of obedient sequential Markov perfect equilibria (O-SMPE). A design regime is proposed based on an approach which we refer to as the fixed-point alignment that incentivizes the agents to choose the signal sent by the principal, guarantees that the agents’ policy profile of taking actions is the policy component of an O-SMPE and the principal’s goal is achieved. We then formulate the principal’s optimal goal selection problem in terms of information design and characterize the optimization problem by minimizing the fixed-point misalignments. The proposed approach can be applied to elicit desired behaviors of multi-agent systems in competing as well as cooperating settings and be extended to heterogeneous stochastic games in the complete- and the incomplete-information environments.

I Introduction

Building rational multi-agent system is an important research desideratum in Artificial Intelligence. In goal-directed decision making systems, an agent’s action is controlled by its consequence [1]. In a game, the consequence of an agent’s action is the outcome of the game, given as the reward of taking that action as well as the actions of his opponents, which situates the optimality criterion of each agent’s decision making in the game. A rational agent’s reward may also depend on the payoff-relevant information, in addition to the actions. The information may include the situation of the agents in a game, referred to as the state of the world, as well as his knowledge about his opponents’ diverging interests and their preferences over the outcomes of the game. Incorporating such payoff-relevant information in his decisions constitutes an essential part of an agent’s rationality in the strategic interactions with his opponents. Hence, one may re-direct the goal achievement of rational agents in a game by information provision. In economics, this refers to as information design, which studies how an information designer (she) can influence agents’ optimal behaviors in a game to achieve her own objective, through the design of information provided to the game [2].

Referred to as the inverse game theory, mechanism design is a well-developed mathematical theory in economics that provides general principles of how to design rules of games (e.g., rewarding systems with specifications of actions and outcomes) to influence the agents’ strategic interactions and achieve system-wide goals while treating the information as given. Information design, on the other hand, considers the circumstances when the information in the environment is under the control of the system designer and offers a new approach to indirectly elicit agents’ behaviors by keeping the game rules fixed [3].

This work considers a infinite-horizon Markov game of a finite number of agents. Each agent’s identity is characterized by his type. At each period of time, agents observe a payoff-relevant global state (state). In addition to the state, each agent observes a batch of signals (signal batch, batch) at each period and then strategically chooses one signal from the batch as the additional information to support his decision of taking a action. Each agent’s one-period reward (parameterized by the type) is determined by his own action, the actions of his opponents, the global state, and his choice of signal. We refer to this game as a base Markov game (BMG). The transition of the state and the distribution of signals are referred to as the information structure of the BMG. In a BMG, each agent’s behavior includes selecting a signal according to a selection rule and taking an action according to a policy. Here, each agents’ selection of signal and the choice of action are coupled since the selected signal enters the policy to determine the choice of the action. If a mechanism designer aims to incentivize the agents to behave in her desired way, she directly modifies the BMG–reversing the game–by designing the game model, including changing the reward function associated with actions and outcomes, while treating the information structure as a given part of the environment. An information designer, however, treats the BMG model as fixed and modifies the information structure to elicit agents’ equilibrium behaviors that coincide with her objective.

We study a novel dynamic information design problem in the BMG in which there are multiple sources of signals (signal sources, sources) and each of them sends one signal to each agent. The signals sent by all sources constitute the signal batch observed by each agent at each time. Among these sources, there is one rational information designer (referred to as principal, she) who controls one signal source and intends to strategically craft the information structure of her signal by choosing a signaling rule to indirectly control the equilibrium of the BMG. We consider that other sources of signals provide additional information to the agents in a non-strategic take-it-or-leave-it manner. The goal of the principal is to induce the agents to take actions according to an equilibrium policy that is desired by the principal. However, the principal has no ability to directly program the agents’ behaviors to force them to take certain actions. Instead, her information design should provide incentive to rational agents to behave in her favor. We study the extent to which the provision of signals along by controlling a single signal source can influence the agents’ behavior in a BMG, when the agents have the freedom to choose any available signal in the batch. We will name the BMG with a rational principal in this setting as an augmented Markov game (AMG).

Since the principal’s design problem keeps the base game unchanged, our model fits the scenarios when the agents are intrinsically motivated and their internal reward systems translate information from external environment into internal reward signals [4]. Intrinsically-motivated rational agents can be human decision makers with intrinsic psychological preferences or intelligent agents programmed with internal reward system. The setting of multiple sources of additional information captures the circumstances when the environment is perturbed by noisy information, in which the agents may improperly use redundant and useless information to make their decisions that may deviate from the system designer’s desire. Also, the principal can be an adversary who aims to manipulate the strategic interactions in a multi-agent system through the provision of disinformation, without intruding each agent’s local system to make any physical or digital modifications.

Although the principal’s objective of the information design in an AMG is to elicit an equilibrium distribution of actions, her design problem has to take into consideration how the agents select the signals from their signal batches because each agent’s choice of action is coupled with his selection of signal. In an information design problem, the principal chooses an information structure such that each agent selects a signal using a selection rule and then takes an action according to a policy which matches the principal’s goal. The latter is constrained by the notion admissibility. In general, the signals sent by the principal may not be selected by some agents, thereby the actions taken by those agents are independent of the principal’s (realized) signals. However, even though her signal does not enter an agent’s policy to realize an action, the principal may still influence the agent’s action because the information structure of the signal batch is influenced by her choice of signaling rule. The information structure of the signal batch affects the agents’ selections of signals. Hence, the agents’ behaviors indirectly depend on the principal’s signaling rule. We refer to such information design as indirect information design (IID) which requires the principal to accurately predict each agents’ strategic selection rule and their policy profiles that might be induced by the signaling rule. We restrict attention to another class of information design, referred to as direct information design (DID). In DID problems, each agent always selects the signal sent by the principal and then takes an action. Thus, the realizations of the principal’s signals directly enter the agents’ policies to choose actions. In addition to the admissibility, another restriction of the principal’s DID problem is captured by the notion of obedience which requires that each agent is incentivized to select the signal from the principal rather than choose one from other signal sources. The key simplification provided by the DID is that the principal’s prediction of the agents’ strategic selection rules is replaced by replaced by a straightforward obedient selection rule that always prefers the principal’s signals.

This paper makes three major contributions to the foundations of information design. First, we define a dynamic direct information design problem in an environment where the agents have the freedom to choose any available signal as addition payoff-relevant information. Captured by the notion of obedient implementability, the principal’s information problem is constrained by the obedient condition that incentivizes the agents to prefer the signals sent by the principal and the admissibility condition such that the agents take actions which meets the principal’s goal. Our information design problem is distinguished from others in economics that study the commitment of the information design in a game when there is only a single source of additional information in static settings (e.g., [5, 3, 6, 7]) as well as in dynamic environment (e.g., [8, 9, 10, 11, 12, 13]) and the settings in which the agents do not make a choice from multiple designers (e.g., [14]).

Second, we propose a new solution concept termed obedient sequential Markov perfect equilibrium (O-SMPE) which allows us to handle the undesirable deviations of agents in a principled manner. By bridging the augmented Markov game model with dynamic programming and uncovering the close relationship between the sequential-perfect relationship of the O-SMPE and a pair of fixed points, we characterize the obedient implementability and explicitly formulate the information design regime given a goal of the principal. The proposed framework is based on an approach referred to as fixed-point alignment which selects a signaling rule that matches the first fixed point from the agents’ optimal signaling selection to the second fixed point of optimal action takings. However, the principal cannot achieve just any goal she wishes to. We identify the key conditions, known as Markov perfect goal and strong admissibility, that discipline the freedom of the principal’s ability to influence the agents’ equilibrium behaviors.

Third, we formulate the principal’s goal selection problem and transform it to a direct information design problem without a predetermined goal. The principal’s problem is thus to select a signaling rule such that it induces agents’ equilibrium policy profiles that maximize (optimal information design) or minimize (robust information design) her expected payoff. Our formulation does not assume the availability of all possible equilibrium policy profiles that can be induced by each possible signaling rule and thus the principal can select equilibrium policy profiles for her choice of signaling rule. Instead, our framework takes into consideration the role of the signaling rule in ensuring the agents to converge to an equilibrium. A new approach is proposed based on a condition known as the fixed-point misalignments minimization that captures a new optimality criterion for the information design without a given goal.

I-A Related Work

We follow a growing line of research on creating incentives for interacting agents to behave in a desired way. The most straightforward way is based on mechanism design approaches that properly provide reward incentives (e.g., contingent payments, penalty, supply of resources) by directly modifying the game itself to change the induced preferences of the agents over actions. Mechanism design approaches have been fruitfully studied in both static [15] as well as dynamic environment [16, zhang2021incentive, 17]. For example, auctions [18, 19] specify the way in which the agents can place their bid and clarify how the agents pay for the items; in matching markets [20, 21], matching rules matches agents in one side of a market to agents of another side that directly affect the payoff of each matched individuals. In reinforcement learning literature, reward engineering [22, 23, 24] is similar to mechanism design that directly crafts the reward functions of the agents that post specifications of the learning goal.

Our work lies in another direction: the information design. Information design studies how to influence the outcomes of the decision makings by choosing signal (also referred to as signal structure, information structure, Blackwell experiment, or data-generating process) whose realizations are observed by the agents [25]. In a seminal paper [7], Kamenica and Gentzkow has introduced Bayesian persuasion in which there is an informed sender and an uninformed receiver. The sender is endowed to commit to choosing any probability distribution (i.e., the information structure) of the signals as a function of the state of the world which is payoff-relevant to and unobserved by the receiver. The Bayesian persuasion can be interpreted as a communication device that is used by the sender to inform the receiver through the signals that contain knowledge about the state of the world. Hence, the sender controls what the agent gets to know about the payoff-relevant state. With the knowledge about the information structure, the receiver forms a posterior belief about the unobserved state based on the received signal. Hence, the information design of Bayesian persuasion is also referred to as an exercise in belief manipulation. Other works alongside with the Bayesian persuasion include [26, 27, 28, 29]. In [5], Mathevet et al. extends the single-agent Bayesian persuasion of [7] to a multi-agent game and formulate the information design of influencing agents’ behaviors through inducing distributions over agents’ beliefs. In [6], Bergemann and Morris have also considered information design in games. They have formulated the Myersonian approach for the information design in an incomplete-information environment. The essential of the Myersonian information design is the notion of Bayes correlated equilibrium, which characterizes the all possible Bayesian Nash equilibrium outcomes that could be induced by all available information structures. The Myersonian approach avoids the modeling of belief hierarchies [30] and constructs the information design problem as a linear programming. Information design has been applied in a variety of areas to study and improve real-world decision making protocols, including stress test in finance [31, 32], law enforcement and security [33, 34], censorship [35], routing system [36], finance and insurance [37, 38, 39]. Kamenica [25] has provided a recent survey of the literature of Bayesian persuasion and information design.

This work fundamentally differs from existing works on the information design. First, we consider a different environment. Specifically, we consider the setting when there are multiple sources of signals and each agent chooses one realized signal as an additional (payoff-relevant) information at each time. Among these sources of signals, there is an information designer who controls one of these sources and aims to induce equilibrium outcomes of the incomplete-information Markov game by strategically crafting information structures. Second, other than only taking actions, each agent in our model makes a coupled decision of selecting a realized signal and taking an action. Hence, the characterization of the solution concepts in our work is different from the equilibrium analysis in other works. Third, we also provide an approach with an explicit formulation to relaxing the optimal information design problem.

In this section, we first describe some fundamental concepts of a canonical Markov game model and then define our new model of a game called augmented Markov game by extending the canonical model.

Conventions. For any measurable set XX, Δ(X)\Delta(X) is the set of probability measures over XX. Any function defined on a measurable set is assumed to be measurable. For any distribution PΔ(X1×X2)P\in\Delta(X_{1}\times X_{2}), for two measurable sets X1X_{1} and X2X_{2}, 𝚖𝚊𝚛𝚐X1P\operatorname{\mathtt{marg}}_{X_{1}}P is the marginal distribution of PP over X1X_{1} and 𝚜𝚞𝚙𝚙P\operatorname{\mathtt{supp}}P is the support of PP. The history of xXx\in X from period ss to period tt is denoted by xs(t)x^{(t)}_{s} with x(t)x^{(t)} when s=0s=0. We use Pr(E)P_{r}(E) to represent the probability of an event EE. For the compactness of notations, we only show the elements, but not the sets, over which are summed under the summation operator. The notations are summarized in Appendix LABEL:app:list_of_notations.

I-B Normal-Form Game and Canonical Markov Game

This work considers games of finite agents, denoted by 𝒩[n]\mathcal{N}\equiv[n], 0<n<0<n<\infty. A normal-form (or strategic-form) is a basic representation of a static game:

Definition 0.1 (Normal-Form Game [40])

A normal-form game is defined by a tuple G<𝒩,𝒜,{ri}i𝒩>G\equiv<\mathcal{N},\mathcal{A},\{r_{i}\}_{i\in\mathcal{N}}>. Here, 𝒜\mathcal{A} is a finite set of actions available to each agent. ri:𝒜nr_{i}:\mathcal{A}^{n}\mapsto\mathbb{R} is the reward function of agent ii.

Each agent i𝒩i\in\mathcal{N} simultaneously chooses an action ai𝒜a_{i}\in\mathcal{A} and receives a reward ri(ai,𝒂i)r_{i}(a_{i},\bm{a}_{-i}) when other agents choose actions 𝒂i𝒜n1\bm{a}_{-i}\in\mathcal{A}^{n-1}.

Refer to caption
Figure 1: Canonical Markov game.

With reference to Fig. 1, a canonical Markov game generalizes a normal-form game to dynamic settings as well as Markov decision processing to multi-agent interactions. We consider a finite-agent infinite-horizon Markov game. The game is played in discrete time indexed by t=0,1,t=0,1,\dots. Each agent ii’s identification is captured by the parameter known as type denoted by θi\theta_{i}. A canonical Markov game can be defined by a tuple M^[𝜽]<𝒩,𝒢,𝒜,dg,𝒯g,{R^i(|θi)}i𝒩>\widehat{M}[\bm{\theta}]\equiv<\mathcal{N},\mathcal{G},\mathcal{A},d_{g},\mathcal{T}_{g},\{\hat{R}_{i}(\cdot|\theta_{i})\}_{i\in\mathcal{N}}>. Here, 𝒢\mathcal{G} is a finite set of states. 𝒜\mathcal{A} is a finite set of actions available to each agent at each period. dgΔ(𝒢)d_{g}\in\Delta(\mathcal{G}) is an initial distribution of the state. 𝒯g:𝒢×𝒜nΔ(𝒢)\mathcal{T}_{g}:\mathcal{G}\times\mathcal{A}^{n}\mapsto\Delta(\mathcal{G}) is the transition function of the state, such that 𝒯g(|gt,𝒂t)Δ(𝒢)\mathcal{T}_{g}(\cdot|g_{t},\bm{a}_{t})\in\Delta(\mathcal{G}) specifies the probability distribution of next-period state when the current state is gtg_{t} and the current-period joint action is 𝒂t(ai,t)i𝒩\bm{a}_{t}\equiv(a_{i,t})_{i\in\mathcal{N}}. Each agent ii’s goal is characterized by his reward function R^i(|θi):𝒢×𝒜n\hat{R}_{i}(\cdot|\theta_{i}):\mathcal{G}\times\mathcal{A}^{n}\mapsto\mathbb{R} is the reward function of agent ii parameterized by θiΘ\theta_{i}\in\Theta that realizes a one-stage reward ri(gt,𝒂t)r_{i}(g_{t},\bm{a}_{t}) for agent ii when the state is gtg_{t} and the joint action is 𝒂t\bm{a}_{t}. A canonical Markov game is complete-information.

A solution to M^\widehat{M} is a sequence of policy profile {𝝅^t}t0\{\bm{\hat{\pi}}_{t}\}_{t\geq 0} in which 𝝅^t:𝒢×𝒜nΔ(𝒜n)\bm{\hat{\pi}}_{t}:\mathcal{G}\times\mathcal{A}^{n}\mapsto\Delta(\mathcal{A}^{n}), such that 𝝅^t(|gt)\bm{\hat{\pi}}_{t}(\cdot|g_{t}) specifies the probability of the joint action of the agents given the current-period state gtg_{t}. A policy profile is stationary if the agents’ decisions of actions depend only on the current-period payoff-relevant information and is independent of the calendar time; i.e., we denote the policy profile as 𝝅^:𝒢×𝒜nΔ(𝒜n)\bm{\hat{\pi}}:\mathcal{G}\times\mathcal{A}^{n}\mapsto\Delta(\mathcal{A}^{n}). When the policy profile is a pure strategy if 𝝅^:𝒢×𝒜n𝒜n\bm{\hat{\pi}}:\mathcal{G}\times\mathcal{A}^{n}\mapsto\mathcal{A}^{n} is a deterministic mapping. In a canonical Markov game, 𝝅^\bm{\hat{\pi}} can be either independent (i.e., 𝝅^(𝒂t|st)=i𝒩π^i(ai,t|st)\bm{\hat{\pi}}(\bm{a}_{t}|s_{t})=\prod_{i\in\mathcal{N}}\hat{\pi}_{i}(a_{i,t}|s_{t})) or correlated (i.e., a joint function).

According to Ionescu Tulcea theorem (see, e.g., [41]), the initial distribution dgd_{g} on g0g_{0}, the transition function 𝒯g\mathcal{T}_{g}, and the policy profile 𝝅^\bm{\hat{\pi}} together define a unique probability measure P𝝅^P^{\bm{\hat{\pi}}} on (𝒢×𝒜n)(\mathcal{G}\times\mathcal{A}^{n})^{\infty}. We denote the expectation with respect to P𝝅^P^{\bm{\hat{\pi}}} as 𝔼𝝅^[]\mathbb{E}_{\bm{\hat{\pi}}}[\cdot]. The optimality criterion for each agent’s decision making at each period is to maximize his expected payoff. Each agent discounts his future payoffs by a discount factor 0<γ<10<\gamma<1. Thus, agent ii’s period-tt infinite-horizon interim expected payoff is defined as: for gt𝒢g_{t}\in\mathcal{G}, 𝒂t𝒜n\bm{a}_{t}\in\mathcal{A}^{n}, θiΘ\theta_{i}\in\Theta, i𝒩i\in\mathcal{N}, t0t\geq 0,

𝙴𝚡𝚙𝚛i(gt,𝒂t;θi|𝝅^,R^i)𝔼𝝅^[s=tγtR^i(𝒂s,gs|θi)|gt,𝒂t].\displaystyle\mathtt{Expr}_{i}(g_{t},\bm{a}_{t};\theta_{i}|\bm{\hat{\pi}},\hat{R}_{i})\equiv\mathbb{E}_{\bm{\hat{\pi}}}\Big{[}\sum_{s=t}^{\infty}\gamma^{t}\hat{R}_{i}(\bm{a}_{s},g_{s}|\theta_{i})\Big{|}g_{t},\bm{a}_{t}\Big{]}. (1)
Definition 0.2 (MPE)

A policy profile 𝛑^\bm{\hat{\pi}}^{*} constitutes a stationary Markov perfect equilibrium (MPE) if, 𝛑^\bm{\hat{\pi}}^{*} is independent, and for ai,t𝒜a_{i,t}\in\mathcal{A} with π^i(ai,t|g,θi)>0\hat{\pi}^{*}_{i}(a_{i,t}|g,\theta_{i})>0, ai,t𝒜a^{\prime}_{i,t}\in\mathcal{A}, g𝒢g\in\mathcal{G}, t0t\geq 0, i𝒩i\in\mathcal{N},

𝔼𝝅^i[𝙴𝚡𝚙𝚛i\displaystyle\mathbb{E}_{\bm{\hat{\pi}}^{*}_{-i}}\Big{[}\mathtt{Expr}_{i} (gt,ai,t,𝒂i,t;θi|𝝅^,R^i)]\displaystyle(g_{t},a_{i,t},\bm{a}_{-i,t};\theta_{i}|\bm{\hat{\pi}},\hat{R}_{i})\Big{]} (2)
𝔼𝝅^i[𝙴𝚡𝚙𝚛i(gt,ai,t,𝒂i,t;θi|𝝅^,R^i].\displaystyle\geq\mathbb{E}_{\bm{\hat{\pi}}^{*}_{-i}}\Big{[}\mathtt{Expr}_{i}(g_{t},a^{\prime}_{i,t},\bm{a}_{-i,t};\theta_{i}|\bm{\hat{\pi}},\hat{R}_{i}\Big{]}.

Let 𝒉^(𝒂(),g())𝑯^𝒜n××𝒢\bm{\hat{h}}\equiv(\bm{a}^{(\infty)},g^{(\infty)})\in\bm{\hat{H}}\equiv\mathcal{A}^{n\times\infty}\times\mathcal{G}^{\infty} denote any infinite sequence of action-state pairs. Define 𝙴𝚡𝚁i(𝒉^)t=0γtR^i(𝒂t,gt|θi)\mathtt{ExR}_{i}(\bm{\hat{h}})\equiv\sum_{t=0}^{\infty}\gamma^{t}\hat{R}_{i}(\bm{a}_{t},g_{t}|\theta_{i}), for any 𝒉^𝑯^\bm{\hat{h}}\in\bm{\hat{H}}, θi\theta_{i}, i𝒩i\in\mathcal{N}. Let 𝒉^(t)\bm{\hat{h}}(t) denote the first tt components of 𝒉^\bm{\hat{h}}, for all t0t\geq 0.

Lemma 1 ([maskin2001markov]))

For any 0<γ<10<\gamma<1, The game M^\widehat{M} is continuous at infinity; i.e.,

limTsupi,𝒉^,𝒉^,𝒉^(T1)=𝒉^(T1)|𝙴𝚡𝚁i(𝒉^)𝙴𝚡𝚁i(𝒉^)|0.\displaystyle\lim\limits_{T\rightarrow\infty}\sup\limits_{\begin{subarray}{c}i,\bm{\hat{h}},\bm{\hat{h}^{\prime}},\\ \bm{\hat{h}}(T-1)=\bm{\hat{h}^{\prime}}(T-1)\end{subarray}}\Big{|}\mathtt{ExR}_{i}(\bm{\hat{h}})-\mathtt{ExR}_{i}(\bm{\hat{h}^{\prime}})\Big{|}\rightarrow 0.

Lemma 1 shows that for a fixed discount rate 0<γ<10<\gamma<1, the canonical Markov game is continuous at infinity. The continuity at infinity is essential for the existence of MPE of infinite-horizon stochastic game. The following proposition shows the existence of MPE due to [maskin2001markov].

Proposition 1.1 ([maskin2001markov])

Suppose that 0<γ<10<\gamma<1. Then, the game M^\widehat{M} admits a MPE.

I-C Augmented Markov Game Model

With reference to Fig. 2, we extend the canonical Markov game model in this section by introducing an additional payoff-relevant information for the agents in addition to the state.

Signals. We call the additional information as signal. Let Ω\Omega be a finite set of signals. We consider that at each period tt each agent ii observes a batch of mm (finite) signals (signal batch, batch), for m>1m>1, denoted by Wi,t{ωi,tj}j=1mΩmW_{i,t}\equiv\{\omega^{j}_{i,t}\}_{j=1}^{m}\subseteq\Omega^{m}, for all i𝒩i\in\mathcal{N}, t0t\geq 0, in which ωi,tjΩ\omega^{j}_{i,t}\in\Omega denotes a typical signal indexed by j[m]j\in[m] sent to agent ii at tt. Upon observing Wi,tW_{i,t}, each agent ii selects one signal ωi,t\omega_{i,t} from the batch Wi,tW_{i,t} and ωi,t\omega_{i,t} becomes payoff-relevant to him in addition to the state gtg_{t}.

Refer to caption
Figure 2: Augmented Markov game. (a) shows external activities of the augmented game from agent ii’s point of view with a principal and (b) describes agent ii’s internal decision making processes. Specifically, each agent ii received a state gtg_{t} and a signal batch Wi,t={ωi,tk,𝝎i,tk}W_{i,t}=\{\omega^{k}_{i,t},\bm{\omega}^{-k}_{i,t}\} in which the signal ωi,tk\omega^{k}_{i,t} is chosen by the principal using the signaling rule αi\alpha_{i} that depends on the state gtg_{t}. Each agent ii first uses βi\beta_{i} to select one signal ωi,t\omega_{i,t} from Wi,tW_{i,t}. By taking into account the state gtg_{t} and his selection ωi,t\omega_{i,t}, he uses πi\pi_{i} to take an action ai,ta_{i,t}. The state gtg_{t} is transitioned to a next state gt+1g_{t+1} given the agents’ joint action 𝒂t={ai,t,𝒂i,t}\bm{a}_{t}=\{a_{i,t},\bm{a}_{-i,t}\}.

Dynamics of Information. Let ωi,tk\omega^{k}_{i,t} denote one signal contained in the batch Wi,tW_{i,t}. Given ωi,tk\omega^{k}_{i,t}, the signal batch can be represented as Wi,t={ωi,tk,Wi,tk}W_{i,t}=\{\omega^{k}_{i,t},W^{-k}_{i,t}\}, where Wi,tk{ωi,tj}j=1,jkmW^{-k}_{i,t}\equiv\{\omega^{j}_{i,t}\}_{j=1,j\neq k}^{m}, for all i𝒩i\in\mathcal{N}, t0t\geq 0. We assume that the probability of ωi,tk\omega^{k}_{i,t} at any tt is specified by 𝒫i,tkΔ(Ω)\mathcal{P}^{k}_{i,t}\in\Delta(\Omega). Let 𝒫i,tkΔ(Ωm1)\mathcal{P}^{-k}_{i,t}\in\Delta(\Omega^{m-1}) denote the probability of Wi,tkW^{-k}_{i,t}. Here, both 𝒫i,tk\mathcal{P}^{k}_{i,t} and 𝒫i,tk\mathcal{P}^{-k}_{i,t}, for all i𝒩i\in\mathcal{N} may depend on the current state gtg_{t}, the joint type 𝜽\bm{\theta}, the calendar time, or the past realizations of states or actions. Although there is an additional information in the game, the transition of the state is controlled by the current gtg_{t} and the realized actions 𝒂t\bm{a}_{t} and, however, is independent of the selected signal ωi,t\omega_{i,t}, for all i𝒩i\in\mathcal{N}, t0t\geq 0. As in the canonical game M^\widehat{M}, the transition of the state is Markov: given any gt𝒢,𝒂t𝒜ng_{t}\in\mathcal{G},\bm{a}_{t}\in\mathcal{A}^{n}, the transition function 𝒯g(gt+1|gt,𝒂t)\mathcal{T}_{g}(g_{t+1}|g_{t},\bm{a}_{t}) specifies the probability of a next state gt+1𝒢g_{t+1}\in\mathcal{G}, for all t0t\geq 0.

Augmented Markov Game. We refer to the Markov game with signal as the augmented Markov game, denoted by MM: M<𝒩,𝒢,𝒜,Ω,dg,𝒯g,M\equiv<\mathcal{N},\mathcal{G},\mathcal{A},\Omega,d_{g},\mathcal{T}_{g}, {𝒫i,tk,𝒫ik}i𝒩,t0,\{\mathcal{P}^{k}_{i,t},\mathcal{P}^{-k}_{i}\}_{i\in\mathcal{N},t\geq 0}, {Ri(|θi)}i𝒩>\{R_{i}(\cdot|\theta_{i})\}_{i\in\mathcal{N}}>, where Ri(|θi):𝒜n×𝒢×ΩR_{i}(\cdot|\theta_{i}):\mathcal{A}^{n}\times\mathcal{G}\times\Omega\mapsto\mathbb{R} is the reward function (θi\theta_{i}-parameterized) of agent ii, which takes into consideration agent ii’s selected signal. A special setting for MM is that each agent ii makes two sequentially-coupled decisions. Specifically, agent ii first selects a signal ωi,t\omega_{i,t} from the batch Wi,tW_{i,t}, given the state gtg_{t}; then, he takes an action ai,ta_{i,t}, given gtg_{t} and his selected signal ωi,t\omega_{i,t}. Hence, at a given period tt, if the state is gtg_{t}, agent ii observes a set of signals Wi,tW_{i,t}, the joint action of all other agents is 𝒂i,t\bm{a}_{-i,t}, and if agent ii firstly selects a signal ωi,t\omega_{i,t} from Wi,tW_{i,t} and secondly takes an action ai,ta_{i,t}, then the single-period reward is Ri({ai,t,𝒂i,t},gt,ωi,t|θi)R_{i}(\{a_{i,t},\bm{a}_{-i,t}\},g_{t},\omega_{i,t}|\theta_{i}). We assume that the game model MM is complete-information.

I-C1 Strategies

Each agent ii is rational in the sense that it is self-interested and makes his decisions according to his observation (gt,Wi,t)(g_{t},W_{i,t}) to maximize his expected payoffs. We consider that the solution to the game MM is a stationary Markov strategy profile <𝜷,𝝅><\bm{\beta},\bm{\pi}> where 𝜷:𝒢𝑾t\bm{\beta}:\mathcal{G}\mapsto\bm{W}_{t} is a selection rule profile and 𝝅(|θn):𝒢×Ωn×ΘnΔ(𝒜n)\bm{\pi}(\cdot|\theta^{n}):\mathcal{G}\times\Omega^{n}\times\Theta^{n}\mapsto\Delta(\mathcal{A}^{n}) is a policy profile. Similar to 𝝅^\bm{\hat{\pi}} in M^\widehat{M}, each strategy of the profiles 𝜷\bm{\beta} and 𝝅\bm{\pi} can be either correlated (i.e, a joint function) or independent (i.e., ωi,t=βi(gt)\omega_{i,t}=\beta_{i}(g_{t}), for all i𝒩i\in\mathcal{N}, and 𝝅(𝒂t|gt,𝝎t,𝜽t)=iNπi(ai,t|gt,ωi,t,θi,t)\bm{\pi}(\bm{a}_{t}|g_{t},\bm{\omega}_{t},\bm{\theta}_{t})=\prod_{i\in N}\pi_{i}(a_{i,t}|g_{t},\omega_{i,t},\theta_{i,t}), where ωi,t=βi(gt,θi,t)\omega_{i,t}=\beta_{i}(g_{t},\theta_{i,t})).

Given any observation (gt,Wi,t,θi,t)(g_{t},W_{i,t},\theta_{i,t}), each agent ii’s selection of the signal and his choice of the action are fundamentally different. Specifically, the payoff-relevant information for signal selection is gtg_{t}, i.e., 𝜷(gt,𝜽t)𝑾t\bm{\beta}(g_{t},\bm{\theta}_{t})\in\bm{W}_{t}, while the payoff-relevant information for the action taking is (gt,𝝎t)(g_{t},\bm{\omega}_{t}), i.e., 𝝅(𝒂t|gt,𝝎t,𝜽t)Δ(𝓐)\bm{\pi}(\bm{a}_{t}|g_{t},\bm{\omega}_{t},\bm{\theta}_{t})\in\Delta(\bm{\mathcal{A}}) in which 𝝎t=𝜷(gt,𝜽t)\bm{\omega}_{t}=\bm{\beta}(g_{t},\bm{\theta}_{t}). However, we will write ωi,t=βi(gt,θi,t|Wi,t)Wi,t\omega_{i,t}=\beta_{i}(g_{t},\theta_{i,t}|W_{i,t})\in W_{i,t} to highlight the influence of the signal batch Wi,tW_{i,t} (and thus its distribution) on each agent ii’s decision of selecting a signal. Agent ii first uses βi\beta_{i} to select signal ωi,t=βi(gt,θi,t|Wi,t)\omega_{i,t}=\beta_{i}(g_{t},\theta_{i,t}|W_{i,t}) and then chooses an action ai,ta_{i,t} according to πi(ai,t|gt,ωi,t,θi,t)\pi_{i}(a_{i,t}|g_{t},\omega_{i,t},\theta_{i,t}) based on the realized selection ωi,t\omega_{i,t}.

II Information Design Problem

In this work, we are interested in when there is one rational information designer referred to as principal (she, indexed by kk) who controls one of mm signal sources. At time tt, the signal sent to agent ii by the principal is denoted by ωi,tk\omega^{k}_{i,t}. We assume that 𝓟tk𝓟k={𝒫ik}i𝒩\bm{\mathcal{P}}^{-k}_{t}\equiv\bm{\mathcal{P}}^{-k}=\{\mathcal{P}^{-k}_{i}\}_{i\in\mathcal{N}} is fixed and purely exogenous; i.e., 𝓟k\bm{\mathcal{P}}^{-k} is independent of the state, types, actions, the calendar time, or the histories of the game. This can be interpreted as when other m1m-1 sources of signals provide additional information to the agents in a non-strategic take-it-or-leave-it manner.

We consider that the principal is rational in that she possesses a goal specified by the agents’ equilibrium actions and strategically designs the information structure of her signal to achieve her goal. Specifically, the principal aims to induce the agents to take actions that coincides with her goal in the equilibrium by strategically chooses 𝓟tk{𝒫i,tk}i𝒩\bm{\mathcal{P}}^{k}_{t}\equiv\{\mathcal{P}^{k}_{i,t}\}_{i\in\mathcal{N}} given Ω\Omega. Since 𝓟tk\bm{\mathcal{P}}^{k}_{t} governs the generation of her signal 𝝎tk={ωi,tk}i𝒩\bm{\omega}^{k}_{t}=\{\omega^{k}_{i,t}\}_{i\in\mathcal{N}}, the principal’s choice of 𝓟tk\bm{\mathcal{P}}^{k}_{t} partially influences 𝓟t{𝓟tk,𝓟k}\bm{\mathcal{P}}_{t}\equiv\{\bm{\mathcal{P}}^{k}_{t},\bm{\mathcal{P}}^{-k}\} and the realizations of signal batch 𝑾t={𝝎tk,𝑾tk}\bm{W}_{t}=\{\bm{\omega}^{k}_{t},\bm{W}^{-k}_{t}\}. This process is information design:

Definition 1.1 (Information Design)

An information design problem is defined as a tuple <M[𝛉],𝛑,\mathcal{I}\equiv<M[\bm{\theta}],\bm{\pi}, {𝒫i,tk}iN,t0,Ω,𝛋>\{\mathcal{P}^{k}_{i,t}\}_{i\in N,t\geq 0},\Omega,\bm{\kappa}>. Here, MM is an augmented Markov game model. 𝛑\bm{\pi} is the agents’ policy profile. <𝒫ik,Ω><\mathcal{P}^{k}_{i},\Omega> is the information structure, where 𝒫ik{𝒫i,tk}t0\mathcal{P}^{k}_{i}\equiv\{\mathcal{P}^{k}_{i,t}\}_{t\geq 0} defines a distribution of the signal ωi,tk\omega^{k}_{i,t} privately observed by agent ii at tt. 𝛋(,𝛉):𝒢Δ(𝒜n)\bm{\kappa}(\cdot,\bm{\theta}):\mathcal{G}\mapsto\Delta(\mathcal{A}^{n}) is the principal’s goal, i.e., her target equilibrium probability distribution of agents’ joint action conditioning only on the state and the agents’ type.

A solution to \mathcal{I} is a stationary signaling rule profile (signaling rule) 𝜶:𝒢×𝚯Δ(Ωn)\bm{\alpha}:\mathcal{G}\times\bm{\Theta}\mapsto\Delta(\Omega^{n}) that defines 𝓟k{𝒫ik}i𝒩\bm{\mathcal{P}}^{k}\equiv\{\mathcal{P}^{k}_{i}\}_{i\in\mathcal{N}} of the joint signal 𝝎tk\bm{\omega}^{k}_{t}. The signaling rule 𝜶\bm{\alpha} is correlated if it is a joint function; i.e., 𝜶(𝝎tk|g,𝜽)i𝒩αi(ωi,tk|g,𝜽)\bm{\alpha}(\bm{\omega}^{k}_{t}|g,\bm{\theta})\neq\prod_{i\in\mathcal{N}}\alpha_{i}(\omega^{k}_{i,t}|g,\bm{\theta}), where αi(ωi,tk|g,𝜽)𝝎i,tk𝜶(𝝎tk|g,𝜽)\alpha_{i}(\omega^{k}_{-i,t}|g,\bm{\theta})\equiv\sum_{\bm{\omega}^{k}_{-i,t}}\bm{\alpha}(\bm{\omega}^{k}_{t}|g,\bm{\theta}). The rule 𝜶\bm{\alpha} is independent if the principal specifies the signal to each agent is independent of each other; i.e., 𝜶(𝝎tk|g,𝜽)=i𝒩αi(ωi,tk|g,𝜽)\bm{\alpha}(\bm{\omega}^{k}_{t}|g,\bm{\theta})=\prod_{i\in\mathcal{N}}\alpha_{i}(\omega^{k}_{i,t}|g,\bm{\theta}). Since the agents use Markov strategies, agent ii’s period-tt action ai,ta_{i,t} depends on histories gtg^{t} and ωik;(t)\omega^{k;(t)}_{i} (via the selection rule βi\beta_{i}) only through the current-period gtg_{t} and ωi,tk\omega^{k}_{i,t}. Hence, we restrict attention to a Markovian signaling rule 𝜶\bm{\alpha} that specifies the distribution of period-tt signal by depending on the current state gtg_{t} and the joint type 𝜽\bm{\theta}. We will denote the game MM with the principal using 𝜶\bm{\alpha} as M[𝜶|𝜽]M[\bm{\alpha}|\bm{\theta}].

The information design problem is a planning problem. Hence, the design of 𝜶\bm{\alpha} is independent of any realizations of states. Additionally, the principal does not know in advance all the possible equilibria that could be induced by any of her available signaling rules. Therefore, the principal’s information design has to take into account how the signaling rule can induce the agents’ behaviors that constitute an equilibrium. If the information design is viewed as an extensive form game between the principal and the agents, the timing is as follows:

  • (i)

    The principal chooses a signaling rule profile 𝜶\bm{\alpha} for the agents.

  • (ii)

    At the beginning of each period tt, a state gtg_{t} is realized and observed by the principal and all agents.

  • (iii)

    The principal sends a joint signal 𝝎tk\bm{\omega}^{k}_{t} according to 𝜶\bm{\alpha}. Each agent ii receives Wi,t={ωi,tk,Wi,tk}W_{i,t}=\{\omega^{k}_{i,t},W^{-k}_{i,t}\} and observes 𝑾i,t{ωj,tk,Wj,tk}ji\bm{W}_{-i,t}\equiv\{\omega^{k}_{j,t},W^{-k}_{j,t}\}_{j\neq i}. Here, 𝑾tk{Wi,tk,𝑾i,tk}\bm{W}^{-k}_{t}\equiv\{W^{-k}_{i,t},\bm{W}^{-k}_{-i,t}\} is generated according to (𝒫tk)n\big{(}\mathcal{P}^{-k}_{t}\big{)}^{n}.

  • (iv)

    The agents use 𝜷\bm{\beta} to select signals 𝝎t\bm{\omega}_{t} from 𝑾t\bm{W}_{t}.

  • (v)

    Then, the agents use 𝝅\bm{\pi} to chooses their actions from 𝒜n\mathcal{A}^{n} based on gtg_{t} and 𝝎t\bm{\omega}_{t}.

  • (vi)

    Immediate rewards are realized and the state gtg_{t} is transitioned to gt+1g_{t+1} according to 𝒯g\mathcal{T}_{g}.

II-A Equilibrium Concepts

In this section, we define a stationary equilibrium concept of the game M[𝜶|𝜽]M[\bm{\alpha}|\bm{\theta}]. With a slight abuse of notation, we suppress the notations of 𝒫ik\mathcal{P}^{-k}_{i}, Wi,tkW^{-k}_{i,t}, and 𝑾tk\bm{W}^{-k}_{t} and only show 𝒫i,tk\mathcal{P}^{k}_{i,t}, ωi,tk\omega^{k}_{i,t} (of Wi,tW_{i,t}), and 𝝎tk\bm{\omega}^{k}_{t} (of 𝑾t\bm{W}_{t}), for all i𝒩i\in\mathcal{N}, t0t\geq 0, unless otherwise stated. Since we focus on stationary environment, we suppress the time indexes from the notations, unless otherwise stated.

Similar to the canonical game M^\widehat{M}, Ionescu Tulcea theorem (see, e.g., [41]) implies that the initial distribution dgd_{g} of the state, the transition function 𝒯g\mathcal{T}_{g}, the distribution 𝓟k\bm{\mathcal{P}}^{-k}, the signaling rule 𝜶\bm{\alpha}, and the strategy profile <𝜷,𝝅><\bm{\beta},\bm{\pi}> together define a unique probability measure P𝝅𝜶,𝜷P^{\bm{\alpha},\bm{\beta}}_{\bm{\pi}} on (𝒢×Ωm×n×𝒜n)(\mathcal{G}\times\Omega^{m\times n}\times\mathcal{A}^{n})^{\infty}. The expectation with respect to P𝝅𝜶,𝜷P^{\bm{\alpha},\bm{\beta}}_{\bm{\pi}} is denoted by 𝔼𝝅𝜷,𝜶[]\mathbb{E}^{\bm{\beta},\bm{\alpha}}_{\bm{\pi}}\big{[}\cdot\big{]} or 𝔼𝝅𝜷,𝜶[|]\mathbb{E}^{\bm{\beta},\bm{\alpha}}_{\bm{\pi}}\big{[}\cdot\big{|}\cdot\big{]}. With a slight abuse of notation, let 𝒯gg𝜶,𝜷,𝝅(𝝎;𝝎k)\mathcal{T}^{\bm{\alpha},\bm{\beta},\bm{\pi}}_{g^{\prime}g}(\bm{\omega};\bm{\omega}^{k}) denote the transition probability from state gg to state gg^{\prime}, given that the signal batch is 𝝎k\bm{\omega}^{k} and the agents select 𝝎=𝜷(g,𝜽|𝝎k)\bm{\omega}=\bm{\beta}(g,\bm{\theta}|\bm{\omega}^{k}): for any 𝝎kΩn\bm{\omega}^{k}\in\Omega^{n} with 𝜶(𝝎k|g,𝜽)>0\bm{\alpha}(\bm{\omega}^{k}|g,\bm{\theta})>0,

𝒯g,g𝜷,𝝅(𝝎;𝝎k)𝒂𝝅(𝒂|g,𝝎,𝜽)𝒯g(g|g,𝒂).\displaystyle\mathcal{T}^{\bm{\beta},\bm{\pi}}_{g^{\prime},g}(\bm{\omega};\bm{\omega}^{k})\equiv\sum_{\bm{a}}\bm{\pi}\big{(}\bm{a}|g,\bm{\omega},\bm{\theta}\big{)}\mathcal{T}_{g}(g^{\prime}|g,\bm{a}).

Let 𝒯g,g𝜶,𝜷,𝝅=𝝎k𝒯g,g𝜷,𝝅(𝝎;\mathcal{T}^{\bm{\alpha},\bm{\beta},\bm{\pi}}_{g^{\prime},g}=\sum_{\bm{\omega}^{k}}\mathcal{T}^{\bm{\beta},\bm{\pi}}_{g^{\prime},g}(\bm{\omega}; 𝝎k)𝜶(ωk|g,𝜽)\bm{\omega}^{k})\bm{\alpha}(\omega^{k}|g,\bm{\theta}). Given 𝜶\bm{\alpha}, 𝜷\bm{\beta}, 𝝅\bm{\pi}, define the state-signal value function Vi𝝅,𝜷,𝜶V_{i}^{\bm{\pi},\bm{\beta},\bm{\alpha}} of agent ii, representing agent ii’s expected reward, originating at some g,𝝎k𝒢×Ωng,\bm{\omega}^{k}\in\mathcal{G}\times\Omega^{n} with 𝜶(𝝎k|g,𝜽)>0\bm{\alpha}(\bm{\omega}^{k}|g,\bm{\theta})>0, agents select 𝝎=𝜷(g,𝜽|𝝎k)\bm{\omega}=\bm{\beta}(g,\bm{\theta}|\bm{\omega}^{k}),

Vi𝜶,𝜷,𝝅(g,𝝎;𝝎k|𝜽)\displaystyle V^{\bm{\alpha},\bm{\beta},\bm{\pi}}_{i}(g,\bm{\omega};\bm{\omega}^{k}|\bm{\theta}) (3)
t=0gγt(𝒯g,g𝜷,𝝅(𝝎;𝝎k))t𝒂,𝝎k𝝅(𝒂|g,𝝎,𝜽)\displaystyle\equiv\sum_{t=0}^{\infty}\sum_{g^{\prime}}\gamma^{t}\big{(}\mathcal{T}^{\bm{\beta},\bm{\pi}}_{g^{\prime},g}(\bm{\omega};\bm{\omega}^{k})\big{)}^{t}\sum_{\bm{a}^{\prime},\bm{\omega}^{k^{\prime}}}\bm{\pi}(\bm{a}^{\prime}|g^{\prime},\bm{\omega}^{\prime},\bm{\theta})
×𝜶(𝝎k|g,𝜽)Ri(𝒂,g,ωi|θi),\displaystyle\times\bm{\alpha}(\bm{\omega}^{k^{\prime}}|g^{\prime},\bm{\theta})R_{i}(\bm{a}^{\prime},g^{\prime},\omega^{\prime}_{i}|\theta_{i}),

where 𝝎=𝜷(g,𝜽|𝝎k)\bm{\omega}^{\prime}=\bm{\beta}(g,\bm{\theta}|\bm{\omega}^{k^{\prime}}).

Define the state value function Ji𝝅,𝜷,𝜶J^{\bm{\pi},\bm{\beta},\bm{\alpha}}_{i} of agent ii that describes his expected reward, originating at any state g𝒢g\in\mathcal{G}:

Ji𝜶,𝜷,𝝅(g|𝜽)t=0gγt(𝒯g,g𝜶,𝜷,𝝅)t\displaystyle J^{\bm{\alpha},\bm{\beta},\bm{\pi}}_{i}(g|\bm{\theta})\equiv\sum_{t=0}^{\infty}\sum_{g^{\prime}}\gamma^{t}\big{(}\mathcal{T}^{\bm{\alpha},\bm{\beta},\bm{\pi}}_{g^{\prime},g}\big{)}^{t} (4)
×𝒂,𝝎k𝝅(𝒂|g,𝜷(g,𝜽|𝝎k),𝜽)𝜶(𝝎k|g,𝜽)\displaystyle\times\sum_{\bm{a}^{\prime},\bm{\omega}^{k^{\prime}}}\bm{\pi}(\bm{a}^{\prime}|g^{\prime},\bm{\beta}(g^{\prime},\bm{\theta}|\bm{\omega}^{k^{\prime}}),\bm{\theta})\bm{\alpha}(\bm{\omega}^{k^{\prime}}|g^{\prime},\bm{\theta})
×Ri(𝒂,g,βi(𝒂,g,θi|ωik)|θi).\displaystyle\times R_{i}(\bm{a^{\prime}},g^{\prime},\beta_{i}(\bm{a}^{\prime},g^{\prime},\theta_{i}|\omega^{k^{\prime}}_{i})|\theta_{i}).

Define the state-signal-action value function Qi𝝅,𝜷,𝜶Q^{\bm{\pi},\bm{\beta},\bm{\alpha}}_{i} that represents agent ii’s expected reward if (𝝎,𝒂)Ωn×𝒜n(\bm{\omega},\bm{a})\in\Omega^{n}\times\mathcal{A}^{n} are played in (g,𝝎k)𝒢×Ωn(g,\bm{\omega}^{k})\in\mathcal{G}\times\Omega^{n}:

Qi𝜶,𝜷,𝝅(𝒂,g,ωi;ωik|𝜽)Ri(𝒂,g,ωi|θi)\displaystyle Q^{\bm{\alpha},\bm{\beta},\bm{\pi}}_{i}(\bm{a},g,\omega_{i};\omega^{k}_{i}|\bm{\theta})\equiv R_{i}(\bm{a},g,\omega_{i}|\theta_{i}) (5)
+γg𝒯g(g|g,𝒂)(s=0g′′γs(𝒯g′′,g𝜶,𝜷,𝝅)s)×\displaystyle+\gamma\sum_{g^{\prime}}\mathcal{T}_{g}(g^{\prime}|g,\bm{a})\Big{(}\sum_{s=0}^{\infty}\sum_{g^{\prime\prime}}\gamma^{s}\big{(}\mathcal{T}^{\bm{\alpha},\bm{\beta},\bm{\pi}}_{g^{\prime\prime},g^{\prime}}\big{)}^{s}\Big{)}\times
𝒂′′,𝝎k′′𝝅(𝒂′′|g′′,𝝎′′;𝜽)𝜶(𝝎k′′|g′′,𝜽)Ri(𝒂,g′′,ωi|θi)),\displaystyle\sum_{\bm{a}^{\prime\prime},\bm{\omega}^{k^{\prime\prime}}}\bm{\pi}(\bm{a}^{\prime\prime}|g^{\prime\prime},\bm{\omega}^{\prime\prime};\bm{\theta})\bm{\alpha}(\bm{\omega}^{k^{\prime\prime}}|g^{\prime\prime},\bm{\theta})R_{i}(\bm{a}^{\prime},g^{\prime\prime},\omega^{\prime}_{i}|\theta_{i})\Big{)},

where 𝝎′′=𝜷(g′′,𝜽|𝝎k′′)\bm{\omega}^{\prime\prime}=\bm{\beta}(g^{\prime\prime},\bm{\theta}|\bm{\omega}^{k^{\prime\prime}}).

We define an equilibrium concept known as sequential Markov perfect equilibrium (SMPE) as follows.

Definition 1.2 (SMPE)

Fix any signaling rule 𝛂\bm{\alpha}. A strategy profile <𝛃,𝛑><\bm{\beta}^{*},\bm{\pi}^{*}> constitutes a (stationary) sequential Markov perfect equilibrium (SMPE), where (i) the policy profile is independent (i.e., 𝛑(𝐚|g,𝛚,𝛉)=i𝒩πi(ai|g,ωi,θi)\bm{\pi}^{*}(\bm{a}|g,\bm{\omega},\bm{\theta})=\prod_{i\in\mathcal{N}}\pi^{*}_{i}(a_{i}|g,\omega_{i},\theta_{i})) and (ii) the agents are sequential-perfectly rational (sequential-perfect rationality), i.e., for any g𝒢g\in\mathcal{G}, βi{βi,τ}τ0\vec{\beta}^{\prime}_{i}\equiv\{\beta^{\prime}_{i,\tau}\}_{\tau\geq 0}, i𝒩i\in\mathcal{N},

Ji𝜶,𝜷,𝝅(g|𝜽)Ji𝜶,βi,𝜷i,𝝅(g|𝜽),\begin{split}J^{\bm{\alpha},\bm{\beta}^{*},\bm{\pi}^{*}}_{i}(g|\bm{\theta})\geq J^{\bm{\alpha},\vec{\beta}^{\prime}_{i},\bm{\beta}^{*}_{-i},\bm{\pi}^{*}}_{i}(g|\bm{\theta}),\end{split} (6)

and, for any ωikΩ\omega^{k}_{i}\in\Omega with αi(ωik|g,𝛉)>0\alpha_{i}(\omega^{k}_{i}|g,\bm{\theta})>0, ωi=βi(g,θi|ωik)\omega^{*}_{i}=\beta^{*}_{i}(g,\theta_{i}|\omega^{k}_{i}), πi{πi,τ}τ0\vec{\pi}^{\prime}_{i}\equiv\{\pi^{\prime}_{i,\tau}\}_{\tau\geq 0},

𝔼𝝎ik𝜶i𝜷[Vi𝜶,𝜷,𝝅(g,ωi,𝝎i;ωik,𝝎ik|𝜽)]\displaystyle\mathbb{E}^{\bm{\beta}^{*}}_{\bm{\omega}^{k}_{-i}\sim\bm{\alpha}_{-i}}\Big{[}V^{\bm{\alpha},\bm{\beta}^{*},\bm{\pi}^{*}}_{i}(g,\omega^{*}_{i},\bm{\omega}^{*}_{-i};\omega^{k}_{i},\bm{\omega}^{k}_{-i}|\bm{\theta})\Big{]} (7)
𝔼𝝎ik𝜶i𝜷[Vi𝜶,𝜷,πi,𝝅i(g,ωi,𝝎i;ωik,𝝎ik|𝜽)].\displaystyle\geq\mathbb{E}^{\bm{\beta}^{*}}_{\bm{\omega}^{k}_{-i}\sim\bm{\alpha}_{-i}}\Big{[}V^{\bm{\alpha},\bm{\beta}^{*},\vec{\pi}^{\prime}_{i},\bm{\pi}^{*}_{-i}}_{i}(g,\omega^{*}_{i},\bm{\omega}^{*}_{-i};\omega^{k}_{i},\bm{\omega}^{k}_{-i}|\bm{\theta})\Big{]}.

An SMPE extends the stationary Markov perfect equilibrium (see, e.g., [42]) to our augmented Markov game M[𝜶|𝜽]M[\bm{\alpha}|\bm{\theta}] The sequential-perfect rationality describes the coupled sequential best responses of each agent’s selection and action given a state and the available signal batch. In words, a strategy profile <𝜷,𝝅><\bm{\beta}^{*},\bm{\pi}^{*}> is sequential-perfectly rational if (i) given that agents choose actions according to the equilibrium policy profile 𝝅\bm{\pi}^{*}, there is no state g𝒢g\in\mathcal{G} such that once it is reached, the agents strictly prefer to deviate from 𝜷\bm{\beta}^{*}; and (ii) there is no information set (g,{𝝎,𝑾k})𝒢×Ωm×n(g,\{\bm{\omega}^{*},\bm{W}^{-k}\})\in\mathcal{G}\times\Omega^{m\times n} where 𝝎\bm{\omega}^{*} is selected by 𝜷\bm{\beta}^{*} such that once it is reached, the agents strictly prefer to deviate from 𝝅\bm{\pi}^{*}.

The concept of SMPE in Definition 1.2, however, permits arbitrarily complex and possibly nonstationary deviations from (stationary) equilibrium profile <𝜷,𝝅><\bm{\beta}^{*},\bm{\pi}^{*}>. The following lemma states that it entails no loss of generality to consider any one-shot deviations from <𝜷,𝝅><\bm{\beta}^{*},\bm{\pi}^{*}>.

Lemma 2

Let βi0{βi,τ}τ0\vec{\beta}^{\prime}_{i}\circ 0\equiv\{\beta^{\prime}_{i,\tau}\}_{\tau\geq 0} and πi0{πi,τ}τ0\vec{\pi}^{\prime}_{i}\circ 0\equiv\{\pi^{\prime}_{i,\tau}\}_{\tau\geq 0}, respectively, be such that βi,τ=βi\beta^{\prime}_{i,\tau}=\beta^{*}_{i} and πi,τ=πi\pi^{\prime}_{i,\tau}=\pi^{*}_{i}, for all τ1\tau\geq 1, while βi,0\beta^{\prime}_{i,0} and πi,0\pi^{\prime}_{i,0} are any two arbitrary strategies. A strategy profile <𝛃,𝛑><\bm{\beta}^{*},\bm{\pi}^{*}> constitutes a sequential-perfectly rational equilibrium profile of an SMPE if and only if for any g0𝒢g_{0}\in\mathcal{G}, i𝒩i\in\mathcal{N},

Ji𝜷,𝝅,𝜶(g0|θi)Jiβi0,𝜷i,𝝅,𝜶(g0|θi),\begin{split}J^{\bm{\beta}^{*},\bm{\pi}^{*},\bm{\alpha}}_{i}(g_{0}|\theta_{i})\geq J^{\vec{\beta}^{\prime}_{i}\circ 0,\bm{\beta}^{*}_{-i},\bm{\pi}^{*},\bm{\alpha}}_{i}(g_{0}|\theta_{i}),\end{split} (8)

and, for any ωikΩ\omega^{k}_{i}\in\Omega with αi(ωik|g0,𝛉)>0\alpha_{i}(\omega^{k}_{i}|g_{0},\bm{\theta})>0, ωi=βi(g0,θi|ωik)\omega^{*}_{i}=\beta^{*}_{i}(g_{0},\theta_{i}|\omega^{k}_{i}), i𝒩i\in\mathcal{N}, πi,0\pi^{\prime}_{i,0},

Vi𝜷,𝝅,𝜶(g0,ωi,0;ωi,0k|θi)Vi𝜷,πi0,𝝅i,𝜶;μi(g0,ωi,0;ωi,0k|θi).\begin{split}V^{\bm{\beta}^{*},\bm{\pi}^{*},\bm{\alpha}}_{i}&(g_{0},\omega^{*}_{i,0};\omega^{k}_{i,0}|\theta_{i})\\ \geq&V^{\bm{\beta}^{*},\vec{\pi}^{\prime}_{i}\circ 0,\bm{\pi}^{*}_{-i},\bm{\alpha};\mu_{i}}_{i}(g_{0},\omega^{*}_{i,0};\omega^{k}_{i,0}|\theta_{i}).\end{split} (9)

A one-shot deviation is a behavior of each agent ii’s deviating from the equilibrium profile <𝜷,𝝅><\bm{\beta}^{*},\bm{\pi}^{*}> by selecting a signal using any βi,0\beta^{\prime}_{i,0} and taking an action using any πi,0\pi^{\prime}_{i,0} at the initial period of any subgame of M[𝜶|𝜽]M[\bm{\alpha}|\bm{\theta}], then reverting back to his equilibrium profile for the rest of the game. The one-shot deviation property in Lemma 2 allows the principal to restrict attention to the equilibrium characterization by considering the robustness of the information design to the one-shot deviation without lose of generality.

III SMPE Implementability

In this section, we formally formulate the principal’s information design problem. The optimality criterion of successful information design is captured by the notion of implementability which is characterized in the equilibrium concept of SMPE.

III-A Implementability

As in a canonical Markov game, each agent ii’s decision of choosing an action aa takes into account other agents’ decisions of choosing 𝒂i\bm{a}_{-i} because its immediate reward of taking aia_{i} directly depends on 𝒂i\bm{a}_{-i}. In an augmented Markov game M[𝜶|𝜽]M[\bm{\alpha}|\bm{\theta}], agent ii’s choices of βi\beta_{i} and πi\pi_{i} are coupled because ωi\omega_{i} specified by βi\beta_{i} has a direct causal effect on aia_{i} through πi\pi_{i}. Thus, each agent’s immediate reward indirectly depends on other agents’ selected signals through their actions. Hence, agents’ selection of signals is also a part of the strategic interactions in a M[𝜶|𝜽]M[\bm{\alpha}|\bm{\theta}]. Since 𝓟k\bm{\mathcal{P}}^{-k} is fixed, the principal’s choice of 𝜶\bm{\alpha} controls the dynamics of 𝑾t\bm{W}_{t} given the strategy profiles. Therefore, it is possible for the principal to influence the equilibrium behaviors of agents in M[𝜶|𝜽]M[\bm{\alpha}|\bm{\theta}] through proper designs of 𝜶\bm{\alpha}.

The principal’s information design takes an objective-first approach to design the information structure (given Ω\Omega) of the signals sent to agents, toward desired objectives 𝜿\bm{\kappa}, in a strategic setting through the design of 𝜶\bm{\alpha}, where self-interested agents act rationally by choosing 𝜷\bm{\beta} and 𝝅\bm{\pi}. Although any realization of the signal depends on the current state, the choice of 𝜶\bm{\alpha} is independent of the realizations of the states. The key restriction on the principal’s 𝜶\bm{\alpha} is that the agents are elicited to perform equilibrium actions that coincide with the principal’s goal, which is referred to as admissibility.

Definition 2.1 (Admissibility)

Fix any 𝛋\bm{\kappa} and 𝛂\bm{\alpha}. Let <𝛃,𝛑><\bm{\beta}^{*},\bm{\pi}^{*}> be any SMPE of the game M[𝛂|𝛉]M[\bm{\alpha}|\bm{\theta}]. The policy profile 𝛑\bm{\pi}^{*} is admissible if, for all g𝒢g\in\mathcal{G},

𝜿(𝒂|g,𝜽)=𝝎k𝝅(𝒂|g,𝜷(g,𝜽|𝝎k),𝜽)𝜶(𝝎k|g,𝜽).\displaystyle\bm{\kappa}(\bm{a}|g,\bm{\theta})=\sum_{\bm{\omega}^{k}}\bm{\pi}^{*}\big{(}\bm{a}|g,\bm{\beta}^{*}(g,\bm{\theta}|\bm{\omega}^{k}),\bm{\theta}\big{)}\bm{\alpha}(\bm{\omega}^{k}|g,\bm{\theta}). (10)

The admissibility imposes a constraint on the signaling rule 𝜶\bm{\alpha} and the induced policy profile 𝝅\bm{\pi} such that the goal 𝜿\bm{\kappa} is achieved in the sense of (10). We define a strong version of admissibility as follows.

Definition 2.2 (Strong Admissibility)

Fix any 𝛋\bm{\kappa} and 𝛂\bm{\alpha}. Let <𝛃,𝛑><\bm{\beta}^{*},\bm{\pi}^{*}> be any SMPE of the game M[𝛂|𝛉]M[\bm{\alpha}|\bm{\theta}]. The policy profile 𝛑\bm{\pi}^{*} is string admissible if, for all g𝒢g\in\mathcal{G}, ωikΩ\omega^{k}_{i}\in\Omega, i𝒩i\in\mathcal{N},

𝜿(𝒂|g,𝜽)𝝎ik𝜶(ωi,𝝎ik|g,𝜽)𝜷(g,𝜽|ωik,𝝎ik)\displaystyle\bm{\kappa}(\bm{a}|g,\bm{\theta})\sum\limits_{\bm{\omega}^{k}_{-i}}\bm{\alpha}(\omega_{i},\bm{\omega}^{k}_{-i}|g,\bm{\theta})\bm{\beta}^{*}(g,\bm{\theta}|\omega^{k}_{i},\bm{\omega}^{k}_{-i}) (11)
=𝝎ik𝝅(𝒂|g,𝜷(g,𝜽|ωik,𝝎ik),𝜽)𝜶(𝝎k|g,𝜽).\displaystyle=\sum\limits_{\bm{\omega}^{k}_{-i}}\bm{\pi}^{*}\big{(}\bm{a}|g,\bm{\beta}^{*}(g,\bm{\theta}|\omega^{k}_{i},\bm{\omega}^{k}_{-i}),\bm{\theta}\big{)}\bm{\alpha}(\bm{\omega}^{k}|g,\bm{\theta}).

The strong version constrains 𝜶\bm{\alpha} and 𝝅\bm{\pi} additionally by imposing individual-level constraint on the signaling rule. It is straightforward to verify that the strong admissibility implies admissibility but not vice versa.

The success criterion of information design for the game M[𝜶|𝜽]M[\bm{\alpha}|\bm{\theta}] in equilibrium is captured by the notion of implementability.

Definition 2.3 (SMPE Implementability)

Given any 𝛋\bm{\kappa}. We say that the signaling rule 𝛂\bm{\alpha} is (strongly) implementable in SMPE ( SMPE Implementability) if M[𝛂|𝛉]M[\bm{\alpha}|\bm{\theta}] has an SMPE <𝛃,𝛑><\bm{\beta}^{*},\bm{\pi}^{*}> in which 𝛑\bm{\pi}^{*} is (strongly) admissible.

The SMPE implementability requires that (i) the signaling rule 𝜶\bm{\alpha} designed by the principal induces a SMPE of M[𝜶|𝜽]M[\bm{\alpha}|\bm{\theta}] and (ii) the principal’s goal is achieved (i.e., the equilibrium policy profile is admissible or strong admissible).

Given any 𝓟k\bm{\mathcal{P}}^{-k}, the distribution of the action 𝒂\bm{a} conditioning on any state gg is jointly determined by the agents’ 𝜷\bm{\beta} and the principal’s 𝜶\bm{\alpha}. Hence, given 𝓟k\bm{\mathcal{P}}^{-k}, the signal 𝝎k\bm{\omega}^{k} sent by the principal by using 𝜶\bm{\alpha} ultimately influences each agent’ expected reward. However, this information is transmitted indirectly through the agents’ selection rules when the selected signal 𝝎=𝜷(g,𝜽|{𝝎k,𝑾k})\bm{\omega}=\bm{\beta}(g,\bm{\theta}|\{\bm{\omega}^{k},\bm{W}^{-k}\}) is not equal to 𝝎k\bm{\omega}^{-k}. We refer to the design of 𝜶\bm{\alpha} that induces such selection rules in SMPE of M[𝜶|𝜽]M[\bm{\alpha}|\bm{\theta}] as an indirect information design (IID). We call the game with such 𝜶\bm{\alpha} as indirect augmented Markov game, denoted by MD[𝜶]M^{-D}[\bm{\alpha}].

III-B Direct Information Design

As a designer, the principal takes into consideration how each agent strategically behaves according to the game rules and reactions to his opponents’ behaviors as well as the responses from the environment. In any MD[𝜶]M^{-D}[\bm{\alpha}], the principal’s design of 𝜶\bm{\alpha} must predict the agents’ equilibrium selection rule profile 𝜷\bm{\beta} and their corresponding equilibrium policy profile 𝝅\bm{\pi} that might be induced by 𝜶\bm{\alpha}. In contrast to MD[𝜶]M^{-D}[\bm{\alpha}], the principal may elect to a direct information design in which the principal makes her signals payoff-relevant to each agent’s decision of choosing an action by incentivizing each agent to select her signal at each state. We refer to the game with such signaling rule as direct augmented Markov game, denoted by MD[𝜶]M^{D}[\bm{\alpha}]. The implementability of the direct information design requires a restriction noted as obedience in addition to the admissibility.

Definition 2.4 (Obedience)

In any MD[𝛂]M^{D}[\bm{\alpha}], agent ii’s selection rule βi\beta_{i} is dominant-strategy obedient (DS-obedient, DS-obedience) if, for any (g,{ωik,Wik})(g,\{\omega^{k}_{i},W^{-k}_{i}\}) 𝒢×Ωm\in\mathcal{G}\times\Omega^{m} with αi(ωik|g,𝛉)>0\alpha_{i}(\omega^{k}_{i}|g,\bm{\theta})>0, i𝒩i\in\mathcal{N},

βi(gt,θi|{ωi,tk,Wi,tk})=ωi,tk.\begin{split}\beta_{i}(g_{t},\theta_{i}|\{\omega^{k}_{i,t},W^{-k}_{i,t}\})=\omega^{k}_{i,t}.\end{split} (12)

Agent ii’s selection rule βi\beta_{i} is Bayesian obedient (Bayesian obedience) if, for any (g,{ωik,Wik})(g,\{\omega^{k}_{i},W^{-k}_{i}\})\in 𝒢×Ωm\mathcal{G}\times\Omega^{m} with αi(ωik|g,𝛉)>0\alpha_{i}(\omega^{k}_{i}|g,\bm{\theta})>0, i𝒩i\in\mathcal{N},

βi(gt,θi|{ωi,tk,Wi,tk})=ωi,tk,\begin{split}\beta_{i}(g_{t},\theta_{i}|\{\omega^{k}_{i,t},W^{-k}_{i,t}\})=\omega^{k}_{i,t},\end{split} (13)

when all other agents are obedient, i.e., 𝛃i(g,𝛉i|\bm{\beta}_{-i}(g,\bm{\theta}_{-i}| 𝛚ik)=𝛚ki\bm{\omega}^{k}_{-i})=\bm{\omega}^{k}_{-i}.

We will refer to a SMPE with obedient selection rule profile as (dominant-strategy or Bayesian) obedient SMPE (O-SMPE).

Definition 2.5 (OIL)

Given any goal 𝛋\bm{\kappa}, the signaling rule 𝛂\bm{\alpha} is (DS, Bayesian) obedient-implementable in SMPE (OIL) if it induces an (DS, Bayesian) O-SMPE <𝛃,𝛑><\bm{\beta},\bm{\pi}> in which 𝛃\bm{\beta} is (DS, Bayesian) obedient and 𝛑\bm{\pi} is admissible.

In a MD[𝜶|𝜽]M^{D}[\bm{\alpha}|\bm{\theta}], the principal wants 𝝎tk\bm{\omega}^{k}_{t} to directly enter the immediate rewards of the agents at each period tt. The OIL guarantees that (i) agents would want to select the signal 𝝎k\bm{\omega}^{k} sent by the principal than choose any other signals from 𝑾k\bm{W}^{-k}, and (ii) agents take actions specified by the admissible policy other than other available actions. Hereafter, we denote obedient selection rule by 𝜷O\bm{\beta}^{O} and βiO\beta^{O}_{i}, for all i𝒩i\in\mathcal{N}.

A successful information design depends on the principal’s having accurate beliefs in regard to the agents’ decision processes. This includes all the possible indirect selection behaviors of the agents, i.e., all possible 𝜷𝜷O\bm{\beta}\neq\bm{\beta}^{O}. The point of direct information design is that it allows the principal to ignore analyzing all of agents’ indirect selections behaviors and focus on the obedient 𝜷O\bm{\beta}^{O}.

III-C Characterizing OIL

In this section, we characterize the OIL and formulate the principal’s information design problem given a goal 𝜿\bm{\kappa}.

The following proposition is an analog of Bellman’s Theorem [43].

Proposition 2.1

Given a game M[𝛂|𝛉]M[\bm{\alpha}|\bm{\theta}], for any stationary <𝛃,𝛑><\bm{\beta},\bm{\pi}>, any Vi:𝒢×Ω×ΩV_{i}:\mathcal{G}\times\Omega\times\Omega\mapsto\mathbb{R}, any Ji:𝒢J_{i}:\mathcal{G}\mapsto\mathbb{R}, any Qi:𝒢×Ω×𝒜n×ΩQ_{i}:\mathcal{G}\times\Omega\times\mathcal{A}^{n}\times\Omega\mapsto\mathbb{R}, we say Vi=Vi𝛂,𝛃,𝛑V_{i}=V^{\bm{\alpha},\bm{\beta},\bm{\pi}}_{i} in (3), Ji=Ji𝛂,𝛃,𝛑J_{i}=J^{\bm{\alpha},\bm{\beta},\bm{\pi}}_{i} in (4), and Qi=Qi𝛂,𝛃,𝛑Q_{i}=Q^{\bm{\alpha},\bm{\beta},\bm{\pi}}_{i} in (5), if and only if the following unique Bellman recursions are satisfied:

Vi(g,𝝎;𝝎k|𝜽)=𝒂𝝅(𝒂|g,𝝎,𝜽)Qi(𝒂,g,ωi;ωik|𝜽),\displaystyle V_{i}(g,\bm{\omega};\bm{\omega}^{k}|\bm{\theta})=\sum_{\bm{a}}\bm{\pi}(\bm{a}|g,\bm{\omega},\bm{\theta})Q_{i}(\bm{a},g,\omega_{i};\omega^{k}_{i}|\bm{\theta}), (14)
Ji(g|𝜽)=𝝎k𝜶(𝝎k|g,𝜽)Vi(g,𝜷(g,𝜽|𝝎k);𝝎k|𝜽),\displaystyle J_{i}(g|\bm{\theta})=\sum_{\bm{\omega}^{k}}\bm{\alpha}(\bm{\omega}^{k}|g,\bm{\theta})V_{i}(g,\bm{\beta}(g,\bm{\theta}|\bm{\omega}^{k});\bm{\omega}^{k}|\bm{\theta}), (15)
Qi(𝒂,g,ωi;ωik|𝜽)\displaystyle Q_{i}(\bm{a},g,\omega_{i};\omega^{k}_{i}|\bm{\theta}) (16)
=Ri(𝒂,g,ωi|θi)+γg𝒯g(g|g,𝒂)Ji(g|𝜽).\displaystyle=R_{i}(\bm{a},g,\omega_{i}|\theta_{i})+\gamma\sum_{g^{\prime}}\mathcal{T}_{g}(g^{\prime}|g,\bm{a})J_{i}(g^{\prime}|\bm{\theta}).

From Proposition 2.1, we can re-define Vi𝜶,𝜷,𝝅V^{\bm{\alpha},\bm{\beta},\bm{\pi}}_{i}, Ji𝜶,𝜷,𝝅J^{\bm{\alpha},\bm{\beta},\bm{\pi}}_{i}, and Qi𝜶,𝜷,𝝅Q^{\bm{\alpha},\bm{\beta},\bm{\pi}}_{i} given in (3)-(5), respectively, as the unique pair of value functions that satisfy the Bellman recursions (14), (15), and (16).

Lemma 3

Fix 𝛂\bm{\alpha}. Let 𝛃O={βiO}i𝒩\bm{\beta}^{O}=\{\beta^{O}_{i}\}_{i\in\mathcal{N}} denote the obedient selection rule profile. The strategy profile <𝛃O,𝛑><\bm{\beta}^{O},\bm{\pi}^{*}> is a (DS, Bayesian) O-SMPE if and only if, for any g𝒢g\in\mathcal{G}, ωikΩ\omega^{k}_{i}\in\Omega with αi(ωik|g,𝛉)>0\alpha_{i}(\omega^{k}_{i}|g,\bm{\theta})>0, i𝒩i\in\mathcal{N},

Vi𝜶,𝜷O,𝝅(g,𝝎k|𝜽)\displaystyle V^{\bm{\alpha},\bm{\beta}^{O},\bm{\pi}^{*}}_{i}(g,\bm{\omega}^{k}|\bm{\theta}) (17)
=maxai𝔼𝒂i𝝅i[Q𝜷O,𝜶;μi(ai,𝒂i,g;ωik|θi)],\displaystyle=\max_{a^{\prime}_{i}}\mathbb{E}_{\bm{a}_{-i}\sim\bm{\pi}^{*}_{-i}}\Big{[}Q^{\bm{\beta}^{O},\bm{\alpha};\mu_{i}}(a^{\prime}_{i},\bm{a}_{-i},g;\omega^{k}_{i}|\theta_{i})\Big{]},
  • (i)

    and 𝜷O\bm{\beta}^{O} is DS-obedient, i.e., if, for any arbitrary selection rule profiles 𝜷^i\bm{\hat{\beta}}_{-i}, any g𝒢g\in\mathcal{G}, i𝒩i\in\mathcal{N},

    Ji𝜶,βiO,𝜷^i,𝝅(g|𝜽)=maxωi𝝎ik𝜶i(𝝎ik|g,𝜽)\displaystyle J^{\bm{\alpha},\beta^{O}_{i},\bm{\hat{\beta}}_{-i},\bm{\pi}^{*}}_{i}(g|\bm{\theta})=\max_{\omega^{\prime}_{i}}\sum_{\bm{\omega}^{k}_{-i}}\bm{\alpha}_{-i}(\bm{\omega}^{k}_{-i}|g,\bm{\theta}) (18)
    ×Vi𝜶,𝜷O,𝝅(g,ωi,𝜷^i(gt,𝜽i|𝝎ik);𝝎k|𝜽);\displaystyle\times V^{\bm{\alpha},\bm{\beta}^{O},\bm{\pi}^{*}}_{i}(g,\omega^{\prime}_{i},\bm{\hat{\beta}}_{-i}(g_{t},\bm{\theta}_{-i}|\bm{\omega}^{k}_{-i});\bm{\omega}^{k}|\bm{\theta});
  • (ii)

    or, 𝜷O\bm{\beta}^{O} is Bayesian-obedient, i.e., if, for any g𝒢g\in\mathcal{G}, i𝒩i\in\mathcal{N},

    Ji𝜶,𝜷,𝝅(g|𝜽)\displaystyle J^{\bm{\alpha},\bm{\beta}^{*},\bm{\pi}^{*}}_{i}(g|\bm{\theta}) (19)
    =maxωi𝝎ik𝜶i(𝝎ik|g,𝜽)Vi𝜶,𝜷O,𝝅(g,ωi,𝝎ik;𝝎k|𝜽).\displaystyle=\max_{\omega^{\prime}_{i}}\sum_{\bm{\omega}^{k}_{-i}}\bm{\alpha}_{-i}(\bm{\omega}^{k}_{-i}|g,\bm{\theta})V^{\bm{\alpha},\bm{\beta}^{O},\bm{\pi}^{*}}_{i}(g,\omega^{\prime}_{i},\bm{\omega}^{k}_{-i};\bm{\omega}^{k}|\bm{\theta}).

In Lemma 3, (17)-(18) (resp. (17)-(19)) constitute a recursive representation of a DS-O-SMPE (resp. Bayesian O-SMPE). Jointly, (17)-(18) require that (i) the expected payoff of each agent ii is maximized by his equilibrium policy πi\pi^{*}_{i} in every state and every signal selected according to any possible equilibrium selection rule β¯i\bar{\beta}_{i}, (ii) the expected payoff of each agent ii is maximized by the obedient selection rule βi\beta^{*}_{i} in every state when the action is taken by any possible corresponding equilibrium policy π¯i\bar{\pi}_{i}, and (iii) β¯i=βi\bar{\beta}_{i}=\beta^{*}_{i} and π¯i=πi\bar{\pi}_{i}=\pi^{*}_{i} (given that his opponents are using any arbitrary selection rule but following O-SMPE policy profile.) Similar interpretation can be done for Bayesian O-SMPE, given that each agent ii’s opponents are following Bayesian O-SMPE strategy profile.

III-C1 Design Regime: Fixed-Point Alignment

In this section, we restrict attention to the Bayesian obedience and formulate an information design regime for a given goal. The following theorem states the existence of Bayesian O-SMPE.

Theorem 4

Every augmented Markov game M[𝛂|𝛉]M[\bm{\alpha}|\bm{\theta}] admits a stationary Bayesian O-SMPE for any regular 𝛂\bm{\alpha}.

We provide an approach which we refer to as the fixed-point alignment to formulate the design of the Bayesian OIL signaling rule 𝜶\bm{\alpha} as a planning problem. First, we define a class of principal’s goal which we refer to as Markov perfect goal. Let, for all i𝒩i\in\mathcal{N},

Ri𝜶(𝒂,g|θi)ωikRi(𝒂,g,ωik|θi)αi(ωik|g,𝜽).\displaystyle R^{\bm{\alpha}}_{i}(\bm{a},g|\theta_{i})\equiv\sum\limits_{\omega_{i}^{k}}R_{i}(\bm{a},g,\omega^{k}_{i}|\theta_{i})\alpha_{i}(\omega^{k}_{i}|g,\bm{\theta}).
Definition 4.1 (Markov Perfect Goal)

We say that the principal’s goal 𝛋\bm{\kappa} is a Markov perfect goal if it is a MPE of a game canonical Markov game with Ri𝛂R^{\bm{\alpha}}_{i} as the reward function of each agent ii; i.e., for all 𝐚t={ai,t,𝐚i,t}𝒜n\bm{a}_{t}=\{a_{i,t},\bm{a}_{-i,t}\}\in\mathcal{A}^{n} with 𝛋(𝐚t|gt)>0\bm{\kappa}(\bm{a}_{t}|g_{t})>0, ai,t𝒜a^{\prime}_{i,t}\in\mathcal{A}, gt𝒢g_{t}\in\mathcal{G}, t0t\geq 0, i𝒩i\in\mathcal{N},

𝔼𝜿i[𝙴𝚡𝚙𝚛i\displaystyle\mathbb{E}_{\bm{\kappa}_{-i}}\Big{[}\mathtt{Expr}_{i} (gt,ai,t,𝒂i,t;θi|𝜿,Ri𝜶)]\displaystyle(g_{t},a_{i,t},\bm{a}_{-i,t};\theta_{i}|\bm{\kappa},R^{\bm{\alpha}}_{i})\Big{]}
𝔼𝜿i[𝙴𝚡𝚙𝚛i(gt,ai,t,𝒂i,t;θi|𝜿,Ri𝜶)],\displaystyle\geq\mathbb{E}_{\bm{\kappa}_{-i}}\Big{[}\mathtt{Expr}_{i}(g_{t},a^{\prime}_{i,t},\bm{a}_{-i,t};\theta_{i}|\bm{\kappa},R^{\bm{\alpha}}_{i})\Big{]},

where 𝙴𝚡𝚙𝚛i\mathtt{Expr}_{i} is defined in (1). Let 𝙼𝙿𝙶[𝐑]\mathtt{MPG}[\bm{R}] denote a set of Markov perfect goals given the reward functions 𝐑{Ri}i𝒩\bm{R}\equiv\{R_{i}\}_{i\in\mathcal{N}}.

For any state value function JiJ_{i}, any 𝒂𝒜n\bm{a}\in\mathcal{A}^{n}, g𝒢g\in\mathcal{G}, ωi,ωikΩ\omega_{i},\omega^{k}_{i}\in\Omega, we denote Qi(𝒂,g,ωi;ωik|𝜽;Ji)Q_{i}(\bm{a},g,\omega_{i};\omega^{k}_{i}|\bm{\theta};J_{i}) as the state-signal-action value function constructed in terms of JiJ_{i} according to (16). Similarly, we denote Ji(g|𝜽;Vi)J_{i}(g|\bm{\theta};V_{i}) as the state value function constructed in terms of ViV_{i} according to (15). Given any policy profile 𝝅i\bm{\pi}_{-i} of other agents and any state value function JiJ_{i}, we let, for all ai𝒜a_{i}\in\mathcal{A}, g𝒢g\in\mathcal{G}, 𝝎k={ωik,𝝎ik}Ωn\bm{\omega}^{k}=\{\omega^{k}_{i},\bm{\omega}^{k}_{-i}\}\in\Omega^{n}, 𝜽={θi,𝜽i}Θn\bm{\theta}=\{\theta_{i},\bm{\theta}_{-i}\}\in\Theta^{n}, i𝒩i\in\mathcal{N},

Qi𝝅i(ai,g,ωi,𝝎ik;𝝎k|𝜽;Ji)\displaystyle Q^{\bm{\pi}_{-i}}_{i}(a_{i},g,\omega_{i},\bm{\omega}^{k}_{-i};\bm{\omega}^{k}|\bm{\theta};J_{i}) (20)
𝔼𝒂i𝝅i(|g,𝝎ik,𝜽i)[Qi(ai,𝒂i,g,ωi;ωik|𝜽,Ji)].\displaystyle\equiv\mathbb{E}_{\bm{a}_{-i}\sim\bm{\pi}_{-i}(\cdot|g,\bm{\omega}^{k}_{-i},\bm{\theta}_{-i})}\Big{[}Q_{i}(a_{i},\bm{a}_{-i},g,\omega_{i};\omega^{k}_{i}|\bm{\theta},J_{i})\Big{]}.

Similarly, for any state-signal value function ViV_{i}, we denote Qi𝜶(;Vi)Q^{\bm{\alpha}}_{i}(\cdot;V_{i}) as the value function constructed in terms of ViV_{i} according to the Bellman recursions (14)-(16), i.e., for all 𝒂𝒜n\bm{a}\in\mathcal{A}^{n}, g𝒢g\in\mathcal{G}, ωikΩ\omega^{k}_{i}\in\Omega with αi(ωik|g,𝜽)>0\alpha_{i}(\omega^{k}_{i}|g,\bm{\theta})>0, i𝒩i\in\mathcal{N},

Qi𝜶(𝒂,g;ωik|𝜽;Vi)=Ri(𝒂,g,ωik|θi)\displaystyle Q^{\bm{\alpha}}_{i}(\bm{a},g;\omega^{k}_{i}|\bm{\theta};V_{i})=R_{i}(\bm{a},g,\omega^{k}_{i}|\theta_{i})
+γg,𝝎k𝒯g(g|g,𝒂)𝜶(𝝎k|g,𝜽)Vi(g;𝝎k|𝜽).\displaystyle+\gamma\sum_{g^{\prime},\bm{\omega}^{k^{\prime}}}\mathcal{T}_{g}(g^{\prime}|g,\bm{a})\bm{\alpha}(\bm{\omega}^{k^{\prime}}|g^{\prime},\bm{\theta})V_{i}(g^{\prime};\bm{\omega}^{k^{\prime}}|\bm{\theta}).

The Bayesian OIL restricts the design of the signaling rule 𝜶\bm{\alpha} by threefold constraints. First, agents are incentivized to be Bayesian obedient in the signal selections. Second, the Bayesian-obedient agents’ policy profile constitute the policy component of a Bayesian O-SMPE. Third, the equilibrium policy profile is admissible. The first two constraints require the 𝜶\bm{\alpha} to elicit the agents to a Bayesian O-SMPE.

Proposition 4.1 provides a general formulation of an optimization problem to find the policy profile of a Bayesian O-SMPE of a game M[𝜶|𝜽]M[\bm{\alpha}|\bm{\theta}].

Proposition 4.1

Fix a signaling rule 𝛂\bm{\alpha}. Let 𝛃O\bm{\beta}^{O} denote the Bayesian obedient selection rule profile. Let 𝐕={Vi}i𝒩\bm{V}^{*}=\{V^{*}_{i}\}_{i\in\mathcal{N}} denote the corresponding state-signal value function of a policy profile 𝛑\bm{\pi}^{*}. The strategy profile <𝛃O,𝛑><\bm{\beta}^{O},\bm{\pi}^{*}> constitutes a Bayesian O-SMPE if and only if <𝛑,𝐕><\bm{\pi}^{*},\bm{V}^{*}> is a solution of the following optimization problem with 𝐙(𝛑,𝐕;𝛂)=0\bm{Z}(\bm{\pi}^{*},\bm{V}^{*};\bm{\alpha})=0:

min𝝅,𝑽𝒁(𝝅,𝑽;𝜶)\displaystyle\min\limits_{\bm{\pi},\bm{V}}\bm{Z}(\bm{\pi},\bm{V};\bm{\alpha}) (𝙾𝚙𝚝\mathtt{Opt})
i,g,𝝎kVi(g;𝝎k|θi)𝔼𝒂𝝅(|𝝎k)[Qi𝜶(𝒂,g;ωik|𝜽;Vi)],\displaystyle\equiv\sum_{i,g,\bm{\omega}^{k}}V_{i}(g;\bm{\omega}^{k}|\theta_{i})-\mathbb{E}_{\bm{a}\sim\bm{\pi}(\cdot|\bm{\omega}^{k})}\Big{[}Q^{\bm{\alpha}}_{i}(\bm{a},g;\omega^{k}_{i}|\bm{\theta};V_{i})\Big{]},

subject to, for all i𝒩i\in\mathcal{N}, g𝒢g\in\mathcal{G}, 𝛚kΩn\bm{\omega}^{k}\in\Omega^{n} with 𝛂(𝛚k|g,𝛉)>0\bm{\alpha}(\bm{\omega}^{k}|g,\bm{\theta})>0, ai𝒜a^{\prime}_{i}\in\mathcal{A}, ωiΩ\omega^{\prime}_{i}\in\Omega,

πi(ai|g,ωi,θi)0,ai𝒜πi(ai|g,ωi,θi)=1,\displaystyle\pi_{i}(a_{i}|g,\omega_{i},\theta_{i})\geq 0,\sum_{a_{i}\in\mathcal{A}}\pi_{i}(a_{i}|g,\omega_{i},\theta_{i})=1, (𝚁𝙶i\mathtt{RG}_{i})
Vi(g;𝝎k|𝜽)𝔼𝒂i𝝅i(|𝝎ik)[Qi𝜶(ai,𝒂i,g;ωik|𝜽;Vi)],\begin{split}&V_{i}(g;\bm{\omega}^{k}|\bm{\theta})\\ &\geq\mathbb{E}_{\bm{a}_{-i}\sim\bm{\pi}_{-i}(\cdot|\bm{\omega}^{k}_{-i})}\Big{[}Q^{\bm{\alpha}}_{i}(a^{\prime}_{i},\bm{a}_{-i},g;\omega^{k}_{i}|\bm{\theta};V_{i})\Big{]},\end{split} (𝙵𝙴i\mathtt{FE}_{i})
Ji𝜶,𝜷O,𝝅(g|𝜽;Vi)\displaystyle J^{\bm{\alpha},\bm{\beta}^{O},\bm{\pi}}_{i}(g|\bm{\theta};V_{i}) (𝙱𝙾𝙱0i\mathtt{BOB}0_{i})
𝝎ik𝜶i(𝝎ik|g,𝜽)Vi(g,ωi,𝝎ik;𝝎k|𝜽).\displaystyle\geq\sum_{\bm{\omega}^{k}_{-i}}\bm{\alpha}_{-i}(\bm{\omega}^{k}_{-i}|g,\bm{\theta})V_{i}(g,\omega^{\prime}_{i},\bm{\omega}^{k}_{-i};\bm{\omega}^{k}|\bm{\theta}).

Proposition (4.1) extends the fundamental formulation of finding the Nash equilibrium if a stochastic game as a nonlinear programming (Theorem 3.8.2 of [44]; see also, [45, 46]). Here, the condition (𝚁𝙶i\mathtt{RG}_{i}) ensures that each πi\pi_{i} is valid policy and rules out the possible trivial solution πi=0\pi_{i}=0 for all i𝒩i\in\mathcal{N}. The constraints (𝙵𝙴i\mathtt{FE}_{i}) and (𝙱𝙾𝙱0i\mathtt{BOB}0_{i}) are two necessary conditions for a Bayesian O-SMPE of the game M[𝜶|𝜽]M[\bm{\alpha}|\bm{\theta}] derived from (17) and (19) of Lemma 3. Any feasible solution <𝝅,𝑽><\bm{\pi},\bm{V}> making 𝒁(𝝅,𝑽;𝜶)=0\bm{Z}(\bm{\pi},\bm{V};\bm{\alpha})=0 constitutes a Bayesian O-SMPE (in which the admissibility is not constrained). Here, the Bayesian obedient selection rule profile 𝜷O\bm{\beta}^{O} is not a solution of the optimization problem (𝙾𝚙𝚝\mathtt{Opt})-(𝙱𝙾𝙱0i\mathtt{BOB}0_{i}); instead, the optimality of Bayesian obedience constrains the optimal solution through (𝙱𝙾𝙱0i\mathtt{BOB}0_{i}).

If we suppress the constraint (𝙱𝙾𝙱0i\mathtt{BOB}0_{i}), then the reduced optimization problem (𝙾𝚙𝚝\mathtt{Opt})-(𝙵𝙴i\mathtt{FE}_{i}) can be interpreted as a process to find pairs of decision variables <πi,Vi><\pi_{i},V_{i}> that fit a Bellman optimality operator (i.e., satisfying (17)). In other words, the goal of this reduced optimization problem is to find fixed points. However, there is another fixed point from the Bellman optimality operator established by the condition (19) and the Bellman recursions (14)-(16); i.e., suppose agents are Bayesian obedient, for all g𝒢g\in\mathcal{G},i𝒩i\in\mathcal{N},

Ji(g|𝜽)=maxωi𝝎ik,ai𝜶i(𝝎ik|g,𝜽)πi(ai|g,ωik,θi)\displaystyle J_{i}(g|\bm{\theta})=\max\limits_{\omega^{\prime}_{i}}\sum_{\bm{\omega}^{k}_{-i},a_{i}}\bm{\alpha}_{-i}(\bm{\omega}^{k}_{-i}|g,\bm{\theta})\pi_{i}(a_{i}|g,\omega^{k}_{i},\theta_{i}) (21)
×Qi𝝅i(ai,g;𝝎k|𝜽;Ji).\displaystyle\times Q^{\bm{\pi}_{-i}}_{i}(a_{i},g;\bm{\omega}^{k}|\bm{\theta};J_{i}).

However, the Bellman optimality equation (21) is independent of ViV_{i} but is constructed based on the relationship between JiJ_{i} and ViV_{i} given in (15).

We propose a design regime for finding a signaling rule 𝜶\bm{\alpha} that aligns two fixed points JiJ^{*}_{i} and ViV^{*}_{i} for each agent ii while each agent’s policy is strongly admissible. Let, for any g𝒢g\in\mathcal{G}, ωikΩ\omega^{k}_{i}\in\Omega with αi(ωik|g,𝜽)>0\alpha_{i}(\omega^{k}_{i}|g,\bm{\theta})>0, ωiΩ\omega_{i}\in\Omega, i𝒩i\in\mathcal{N},

Vi𝜶i(g,ωi;ωik|𝜽;Vi)\displaystyle V^{\bm{\alpha}_{-i}}_{i}(g,\omega_{i};\omega^{k}_{i}|\bm{\theta};V_{i})
𝝎ik𝜶i(𝝎ik|g,𝜽)Vi(g,ωi,𝝎ik;ωik,𝝎ik|𝜽),\displaystyle\equiv\sum\limits_{\bm{\omega}^{k}_{-i}}\bm{\alpha}_{-i}(\bm{\omega}^{k}_{-i}|g,\bm{\theta})V_{i}(g,\omega_{i},\bm{\omega}^{k}_{-i};\omega^{k}_{i},\bm{\omega}^{k}_{-i}|\bm{\theta}),

with Vi𝜶i(g;ωik|𝜽;Vi)=Vi𝜶i(g,ωik;ωik|𝜽;Vi)V^{\bm{\alpha}_{-i}}_{i}(g;\omega^{k}_{i}|\bm{\theta};V_{i})=V^{\bm{\alpha}_{-i}}_{i}(g,\omega^{k}_{i};\omega^{k}_{i}|\bm{\theta};V_{i}). The objective function would be

𝒁FPA(𝜶,𝑱,𝑽;𝜽)\displaystyle\bm{Z}^{\text{FPA}}(\bm{\alpha},\bm{J},\bm{V};\bm{\theta}) (22)
i,g(Ji(g|𝜽)ωikαi(ωik|g,𝜽)Vi𝜶i(g;ωik|𝜽)),\displaystyle\equiv\sum\limits_{i,g}\Big{(}J_{i}(g|\bm{\theta})-\sum\limits_{\omega^{k}_{i}}\alpha_{i}(\omega^{k}_{i}|g,\bm{\theta})V^{\bm{\alpha}_{-i}}_{i}(g;\omega^{k}_{i}|\bm{\theta})\Big{)},

which will need to be minimized by all possible 𝜶\bm{\alpha}, 𝑱\bm{J}, and 𝑽\bm{V}. By AD[𝜶,𝜿]\text{AD}[\bm{\alpha},\bm{\kappa}], given 𝜶\bm{\alpha} and 𝜿\bm{\kappa}, we define a set of valid policy profiles that are admissible. For example, if we refer to strong admissibility, the set AD[𝜶,𝜿]\text{AD}[\bm{\alpha},\bm{\kappa}] is given as

AD[𝜶,𝜿]\displaystyle\text{AD}[\bm{\alpha},\bm{\kappa}]\equiv {𝝅:i𝒩,(RGi),\displaystyle\Big{\{}\bm{\pi}:\forall i\in\mathcal{N},\text{(\ref{eq:regular_policy})}, (23)
𝜿(𝒂|g,𝜽)𝝎~i𝜶(ωi,𝝎~i|g,𝜽)\displaystyle\bm{\kappa}(\bm{a}|g,\bm{\theta})\sum\limits_{\bm{\tilde{\omega}}_{-i}}\bm{\alpha}(\omega_{i},\bm{\tilde{\omega}}_{-i}|g,\bm{\theta})
=𝝎~i𝝅(𝒂|g,ωi,𝝎~i,𝜽)𝜶(ωi,𝝎~i)}.\displaystyle=\sum\limits_{\bm{\tilde{\omega}}_{-i}}\bm{\pi}(\bm{a}|g,\omega_{i},\bm{\tilde{\omega}}_{-i},\bm{\theta})\bm{\alpha}(\omega_{i},\bm{\tilde{\omega}}_{-i})\Big{\}}.

The Bayesian obedience is constrained by, for all g𝒢g\in\mathcal{G}, ωiΩ\omega_{i}\in\Omega, ωikΩ\omega^{k}_{i}\in\Omega with αi(ωik|g,𝜽)>0\alpha_{i}(\omega^{k}_{i}|g,\bm{\theta})>0, i𝒩i\in\mathcal{N},

Vi𝜶i(g,ωi;ωik|𝜽;Vi)ωik𝜶(ωik|g,𝜽)Vi𝜶i(g;ωik|𝜽;Vi).\displaystyle V^{\bm{\alpha}_{-i}}_{i}(g,\omega_{i};\omega^{k}_{i}|\bm{\theta};V_{i})\leq\sum\limits_{\omega^{k}_{i}}\bm{\alpha}({\omega}^{k}_{i}|g,\bm{\theta})V^{\bm{\alpha}_{-i}}_{i}(g;\omega^{k}_{i}|\bm{\theta};V_{i}). (𝙱𝙾𝙱𝟷i\mathtt{BOB1}_{i})

Unlike the which is conditioned on ViV_{i} and 𝜶\bm{\alpha}. Unlike The feasibility of ViV_{i} given a 𝝅\bm{\pi} is captured by the constraint (𝙵𝙴i\mathtt{FE}_{i}). We additionally constrain the feasibility of JiJ_{i} in terms of ViV_{i} as follows: for all g𝒢g\in\mathcal{G}, ωikΩ\omega^{k}_{i}\in\Omega with αi(ωik|g,𝜽)>0\alpha_{i}(\omega^{k}_{i}|g,\bm{\theta})>0, i𝒩i\in\mathcal{N},

Ji(g|𝜽)𝝎ik𝜶i(𝝎ik|g,𝜽)Vi(g,ωik,𝝎ik|𝜽).\displaystyle J_{i}(g|\bm{\theta})\geq\sum\limits_{\bm{\omega}^{k}_{-i}}\bm{\alpha}_{-i}(\bm{\omega}^{k}_{-i}|g,\bm{\theta})V_{i}(g,\omega^{k}_{i},\bm{\omega}^{k}_{-i}|\bm{\theta}). (𝙵𝚂i\mathtt{FS}_{i})

Formally, the optimization problem of the information design based ob fixed-point alignment is

min𝜶,𝑱,𝑽𝒁FPA(𝜶,𝑱,𝑽;𝜽)\displaystyle\min\limits_{\bm{\alpha},\bm{J},\bm{V}}\bm{Z}^{\text{FPA}}(\bm{\alpha},\bm{J},\bm{V};\bm{\theta}) (𝙵𝙿𝙰𝚕𝚒𝚐𝚗\mathtt{FPAlign})
s.t., (FEi), (BOB1i), (FSi),i𝒩,\displaystyle\text{ (\ref{eq:nonlinear_program_pi_constraint}), (\ref{eq:constraint_BOB_1}), (\ref{eq:FS_constraint_J})},\forall i\in\mathcal{N},
𝝅AD[𝜶,𝜿],𝒁(𝝅,𝑽;𝜶)=0.\displaystyle\bm{\pi}\in\text{AD}[\bm{\alpha},\bm{\kappa}],\bm{Z}(\bm{\pi},\bm{V};\bm{\alpha})=0.

In (𝙵𝙿𝙰𝚕𝚒𝚐𝚗\mathtt{FPAlign}), the constraint Z(𝝅,𝑽;𝜶)=0Z(\bm{\pi},\bm{V};\bm{\alpha})=0 is a sufficient and necessary for the feasible 𝝅\bm{\pi} to be the policy component of a Bayesian O-SMPE.

Theorem 5

Fix a goal 𝛋\bm{\kappa}. Let <𝛂,𝐉,𝐕><\bm{\alpha}^{*},\bm{J}^{*},\bm{V}^{*}> be feasible of (𝙵𝙿𝙰𝚕𝚒𝚐𝚗\mathtt{FPAlign}). The signaling rule 𝛂\bm{\alpha}^{*} is Bayesian OIL with strong admissibility if and only if (i) 𝐙FPA(𝛂,𝐉,𝐕;𝛉)=0\bm{Z}^{\text{FPA}}(\bm{\alpha}^{*},\bm{J}^{*},\bm{V}^{*};\bm{\theta})=0, (ii) 𝛋𝙼𝙿𝙶[𝐑]\bm{\kappa}\in\mathtt{MPG}[\bm{R}], and (iii) strong admissibility holds.

Theorem 5 provides a design regime for the signaling rule that is OIL in Bayesian O-SMPE. The condition (i) specifies the optimality of the solution to (𝙵𝙿𝙰𝚕𝚒𝚐𝚗\mathtt{FPAlign}) while the conditions (ii) and (iii) disciplines the principal’s freedom in manipulating the agents’ behaviors. Specifically, the condition (ii) implies that the principal cannot plan arbitrary goal that specifies arbitrary distribution of the agents’ actions conditioning on the state and the joint type.

Theorem 5 shows two restrictions for the principal’s freedom to set her goal and determine how the goal is achieved such that the agents’ behaviors in a Bayesian O-SMPE can be influenced Specifically, the goal 𝜿\bm{\kappa} should be a Markov perfect goal and the induced equilibrium policy profile should be strongly admissible. The following corollary uncovers another restriction on the principal’s ability to influence the agents’ behaviors in a Bayesian O-SMPE.

Corollary 5.1

Fix a base augmented Markov game MM with 𝐑={Ri}i𝒩\bm{R}=\{R_{i}\}_{i\in\mathcal{N}}. In general, there exists 𝛋𝙼𝙿𝙶[𝐑]\bm{\kappa^{\prime}}\in\mathtt{MPG}[\bm{R}] that can be achieved in an indirect game MD[𝛂|𝛉]M^{-D}[\bm{\alpha^{\prime}}|\bm{\theta}] but not in any direct game MD[𝛂′′|𝛉]M^{D}[\bm{\alpha^{\prime\prime}}|\bm{\theta}].

Corollary 5.1 states that restricting attention to direct information design is with loss of generality in selecting Markov perfect goals.

IV Principal’s Optimal Information Design

So far, we focus on when the principal’s goal is given. In this section, we introduce the optimality criterion of the principal’s goal selection and define the optimal information design problem without a predetermined goal. The one-stage payoff function of the principal is u(;𝜽):𝒜n×𝒢u(\cdot;\bm{\theta}):\mathcal{A}^{n}\times\mathcal{G}\mapsto\mathbb{R}, such that u(𝒂,g;𝜽)u(\bm{a},g;\bm{\theta}) gives the immediate payoff for the principal when the state is gg and the agents of the game M[𝜶;𝜽]M[\bm{\alpha};\bm{\theta}] take the joint action 𝒂\bm{a}. Recall that the principal’s goal 𝜿\bm{\kappa} is the probability distribution of the agents’ joint action in the equilibrium conditioned only on the global state given the agents’ types. Hence, the information structure that matters for the principal’s goal selection problem is <𝒢,𝒯g,dg><\mathcal{G},\mathcal{T}_{g},d_{g}>. According to Ionescu Tulcea theorem, the information structure <𝒢,𝒯g,dg><\mathcal{G},\mathcal{T}_{g},d_{g}> and any goal 𝜿\bm{\kappa} uniquely define a probability measure on (𝒢×𝒜n)(\mathcal{G}\times\mathcal{A}^{n})^{\infty}. We denote the corresponding expectation as 𝔼𝜿[]\mathbb{E}^{\bm{\kappa}}[\cdot]. Hence, the principal’s problem is to choose a goal by maximizing her expected payoff (γ\gamma-discounted, the same as the agents’), i.e.,

C(𝜿)𝔼𝜿[t=0γtu(𝒂t,gt,𝜽)].\displaystyle C(\bm{\kappa})\equiv\mathbb{E}^{\bm{\kappa}}\Big{[}\sum_{t=0}^{\infty}\gamma^{t}u(\bm{a}_{t},g_{t},\bm{\theta})\Big{]}. (24)

However, the principal cannot force the agents to take the actions or directly program agents’ actions according to the 𝜿\bm{\kappa}^{*} that maximizes C(𝜿)C(\bm{\kappa}); instead, she uses information design to elicit the agents to take actions that coincide with 𝜿\bm{\kappa}^{*} in the sense of strong admissibility. Hence, the principal’s optimal goal selection problem is a constrained optimization problem:

max𝜿𝙼𝙿𝙶[𝑹]C(𝜿)𝔼𝜿[t=0γtu(𝒂t,gt,𝜽)]\displaystyle\max\limits_{\bm{\kappa}\in\mathtt{MPG}[\bm{R}]}C(\bm{\kappa})\equiv\mathbb{E}^{\bm{\kappa}}\Big{[}\sum_{t=0}^{\infty}\gamma^{t}u(\bm{a}_{t},g_{t},\bm{\theta})\Big{]} (25)
s.t. <𝜶,𝑱,𝑽> is a solution of (FPAlign).\displaystyle<\bm{\alpha}^{*},\bm{J}^{*},\bm{V}^{*}>\text{ is a solution of (\ref{eq:FPAlign})}.

In (25), the feasibility of 𝜿\bm{\kappa} is captured by the conditions (i) it is a Markov perfect goal and (ii) it disciplines the strong admissibility in (𝙵𝙿𝙰𝚕𝚒𝚐𝚗\mathtt{FPAlign}).

The optimal goal selection problem (25) can be reformulated to a problem of selecting 𝜶\bm{\alpha} and 𝝅\bm{\pi}. Specifically, from the (strong) admissibility, the objective function C(𝜿)C(\bm{\kappa}) can be represented in terms of 𝜶\bm{\alpha} and 𝝅\bm{\pi} as follows:

CO(𝜶,𝝅)𝔼[t=0𝒂t\displaystyle C^{O}(\bm{\alpha},\bm{\pi})\equiv\mathbb{E}\Big{[}\sum^{\infty}_{t=0}\sum\limits_{\bm{a}_{t}} (26)
γtu(𝒂t,gt;𝜽)𝝎tk𝝅(𝒂t|gt,𝝎tk,𝜽)𝜶(𝝎tk|gt,𝜽)],\displaystyle\gamma^{t}u(\bm{a}_{t},g_{t};\bm{\theta})\sum_{\bm{\omega}^{k}_{t}}\bm{\pi}(\bm{a}_{t}|g_{t},\bm{\omega}^{k}_{t},\bm{\theta})\bm{\alpha}(\bm{\omega}^{k}_{t}|g_{t},\bm{\theta})\Big{]},

where the expectation 𝔼\mathbb{E} is with respect to the probability measure on the dynamics of the state. Let 𝚷[𝜶,𝑱,𝑽]\bm{\Pi}[\bm{\alpha},\bm{J},\bm{V}] denote the set of valid policy profiles that associated with the value function 𝑽\bm{V}, given 𝜶\bm{\alpha} and 𝑱\bm{J}; i.e.,

𝚷[𝜶,\displaystyle\bm{\Pi}[\bm{\alpha}, 𝑱,𝑽]{𝝅:(RGi),Vi(g;𝝎k|𝜽)\displaystyle\bm{J},\bm{V}]\equiv\Big{\{}\bm{\pi}:\text{(\ref{eq:regular_policy})},V_{i}(g;\bm{\omega}^{k}|\bm{\theta})
=𝔼πi[Qi𝝅i(ai,g,ωi,𝝎ik;𝝎k|𝜽;Ji)],i𝒩},\displaystyle=\mathbb{E}_{\pi_{i}}\Big{[}Q^{\bm{\pi}_{-i}}_{i}(a_{i},g,\omega_{i},\bm{\omega}^{k}_{-i};\bm{\omega}^{k}|\bm{\theta};J_{i})\Big{]},\forall i\in\mathcal{N}\Big{\}},

where Qi𝝅iQ^{\bm{\pi}_{-i}}_{i} is defined in (20). Hence, the principal’s problem (25) can be reformulated as follows:

max𝜶max𝝅𝚷[𝜶,𝑱,𝑽]CO(𝜶,𝝅)\displaystyle\max\limits_{\bm{\alpha}}\max\limits_{\bm{\pi}\in\bm{\Pi}[\bm{\alpha}^{*},\bm{J}^{*},\bm{V}^{*}]}C^{O}(\bm{\alpha},\bm{\pi}) (OptInfo)
s.t. <𝜶,𝑱,𝑽>𝚊𝚛𝚐𝚖𝚒𝚗𝜶,𝑱,𝑽𝒁FPA(𝜶,𝑱,𝑽;𝜽)\displaystyle<\bm{\alpha},\bm{J}^{*},\bm{V}^{*}>\in\operatorname*{\mathtt{argmin}}\limits_{\bm{\alpha},\bm{J},\bm{V}}\bm{Z}^{\text{FPA}}(\bm{\alpha},\bm{J},\bm{V};\bm{\theta})
s.t.  (FEi), (BOB1i), (FSi),i𝒩.\displaystyle\text{s.t. }\text{ (\ref{eq:nonlinear_program_pi_constraint}), (\ref{eq:constraint_BOB_1}), (\ref{eq:FS_constraint_J})},\forall i\in\mathcal{N}.

Technically, the problem (OptInfo) is to select (i) an equilibrium policy profile 𝝅\bm{\pi}^{*} that is strongly admissible and is a MPE and (ii) the signaling rule 𝜶\bm{\alpha}^{*} that induces the policy profile such that the principal’s expected payoff COC^{O} is maximized at (𝜶,𝝅)(\bm{\alpha}^{*},\bm{\pi}^{*}).

The problem (OptInfo) is, however, based on the assumption that the agents’ equilibrium behavior is always principal-preferred. We could also consider the problem of a principal who aims to solve her problem in a robust manner in the sense that she chooses the signaling rule, but wants to maximize her expected payoff in the worst equilibrium; i.e., it is the robust information design problem:

max𝜶min𝝅𝚷[𝜶,𝑱,𝑽]CO(𝜶,𝝅)\displaystyle\max\limits_{\bm{\alpha}}\min\limits_{\bm{\pi}\in\bm{\Pi}[\bm{\alpha}^{*},\bm{J}^{*},\bm{V}^{*}]}C^{O}(\bm{\alpha},\bm{\pi}) (Robust)
s.t. <𝜶,𝑱,𝑽>𝚊𝚛𝚐𝚖𝚒𝚗𝜶,𝑱,𝑽𝒁FPA(𝜶,𝑱,𝑽;𝜽)\displaystyle<\bm{\alpha},\bm{J}^{*},\bm{V}^{*}>\in\operatorname*{\mathtt{argmin}}\limits_{\bm{\alpha},\bm{J},\bm{V}}\bm{Z}^{\text{FPA}}(\bm{\alpha},\bm{J},\bm{V};\bm{\theta})
s.t.  (FEi), (BOB1i), (FSi),i𝒩.\displaystyle\text{s.t. }\text{ (\ref{eq:nonlinear_program_pi_constraint}), (\ref{eq:constraint_BOB_1}), (\ref{eq:FS_constraint_J})},\forall i\in\mathcal{N}.

IV-A Fixed-Point Misalignment Minimization

In this section, we provide an alternative formulation of information design by introducing the notion of fixed-point misalignment (FP misalignment).

Define, for any g𝒢g\in\mathcal{G}, i𝒩i\in\mathcal{N}, 𝜶i\bm{\alpha}_{-i}, 𝝅i\bm{\pi}_{-i}, JiJ_{i}, ViV_{i},

i𝜶i(Ji,Vi;g,ωik,𝜽)Ji(g|𝜽)Vi𝜶i(g;ωik|𝜽;Vi),\mathcal{E}^{\bm{\alpha}_{-i}}_{i}(J_{i},V_{i};g,\omega^{k}_{i},\bm{\theta})\equiv J_{i}(g|\bm{\theta})-V^{\bm{\alpha}_{-i}}_{i}(g;\omega^{k}_{i}|\bm{\theta};V_{i}),
i𝝅i(Ji,Vi;g,𝝎k,𝜽)Vi(g;𝝎k|𝜽)Qi𝝅i(ai,g;𝝎k|𝜽;Ji).\mathcal{E}^{\bm{\pi}_{-i}}_{i}(J_{i},V_{i};g,\bm{\omega}^{k},\bm{\theta})\equiv V_{i}(g;\bm{\omega}^{k}|\bm{\theta})-Q^{\bm{\pi}_{-i}}_{i}(a_{i},g;\bm{\omega}^{k}|\bm{\theta};J_{i}).

Then, we define the notion of fixed-point misalignment as follows:

δi𝝅i(\displaystyle\delta^{\bm{\pi}_{-i}}_{i}( Ji,Vi;πi|g,ωik,ai,𝜽)\displaystyle J_{i},V_{i};\pi_{i}|g,\omega^{k}_{i},a_{i},\bm{\theta})
πi(ai|g,ωik,θi)i𝝅i(Ji,Vi;g,𝜽),\displaystyle\equiv\pi_{i}(a_{i}|g,\omega^{k}_{i},\theta_{i})\mathcal{E}^{\bm{\pi}_{-i}}_{i}(J_{i},V_{i};g,\bm{\theta}),
δi𝜶i(\displaystyle\delta^{\bm{\alpha}_{-i}}_{i}( Ji,Vi;αi|g,ωik,𝜽)\displaystyle J_{i},V_{i};\alpha_{i}|g,\omega^{k}_{i},\bm{\theta})
αi(ωik|g,ωik,𝜽)i𝜶i(Ji,Vi;g,𝝎k,𝜽).\displaystyle\equiv\alpha_{i}(\omega^{k}_{i}|g,\omega^{k}_{i},\bm{\theta})\mathcal{E}^{\bm{\alpha}_{-i}}_{i}(J_{i},V_{i};g,\bm{\omega}^{k},\bm{\theta}).
Proposition 5.1

Fix a Markov perfect goal 𝛋\bm{\kappa}. A strategy profile <𝛃O,𝛑><\bm{\beta}^{O},\bm{\pi}^{*}> where 𝛃O\bm{\beta}^{O} is Bayesian obedient is a Bayesian O-SMPE if and only if there exists a profile <𝛂,𝐉,𝐕><\bm{\alpha},\bm{J},\bm{V}> that satisfies (𝙵𝙴i\mathtt{FE}_{i}), (𝙱𝙾𝙱𝟷i\mathtt{BOB1}_{i}), and (𝙵𝚂i\mathtt{FS}_{i}), given 𝛑\bm{\pi}^{*}, g𝒢g\in\mathcal{G}, i𝒩i\in\mathcal{N}, such that, for all ωikΩ\omega^{k}_{i}\in\Omega, ai𝒜a_{i}\in\mathcal{A}, i𝒩i\in\mathcal{N},

δi𝝅i(Ji,Vi;πi|g,ωik,ai,𝜽)=0,\displaystyle\delta^{\bm{\pi}_{-i}}_{i}(J_{i},V_{i};\pi_{i}|g,\omega^{k}_{i},a_{i},\bm{\theta})=0, (𝙵𝙿𝙼𝟷i\mathtt{FPM1}_{i})
δi𝜶i(Ji,Vi;αi|g,ωik,𝜽)=0.\displaystyle\delta^{\bm{\alpha}_{-i}}_{i}(J_{i},V_{i};\alpha_{i}|g,\omega^{k}_{i},\bm{\theta})=0. (𝙵𝙿𝙼𝟸i\mathtt{FPM2}_{i})

Then, we reformulate (𝙵𝙿𝙰𝚕𝚒𝚐𝚗\mathtt{FPAlign}) in terms of FP misalignment minimization based on Proposition 5.1. Due to the definitions of i𝜶i\mathcal{E}^{\bm{\alpha}_{-i}}_{i} and i𝝅i\mathcal{E}^{\bm{\pi}_{-i}}_{i}, the objective functions 𝒁\bm{Z} and 𝒁FPA\bm{Z}^{FPA} can be represented in terms of δi𝝅\delta^{\bm{\pi}}_{i} and δi𝜶\delta^{\bm{\alpha}}_{i}, as follows (denoted by 𝒁^\bm{\hat{Z}} and 𝒁^FPA\bm{\hat{Z}}^{FPA}) respectively:

𝒁^(𝝅,𝑽;𝜶)=i,g,ωikaiδi𝝅i(Ji,Vi;πi|g,ωik,ai,𝜽),\displaystyle\bm{\hat{Z}}(\bm{\pi},\bm{V};\bm{\alpha})=\sum\limits_{i,g,\omega^{k}_{i}}\sum\limits_{a_{i}}\delta^{\bm{\pi}_{-i}}_{i}(J_{i},V_{i};\pi_{i}|g,\omega^{k}_{i},a_{i},\bm{\theta}), (27)
𝒁^FPA(𝜶,𝑱,𝑽;𝜽)=i,gωikδi𝜶i(Ji,Vi;αi|g,ωik,𝜽).\displaystyle\bm{\hat{Z}}^{FPA}(\bm{\alpha},\bm{J},\bm{V};\bm{\theta})=\sum\limits_{i,g}\sum\limits_{\omega^{k}_{i}}\delta^{\bm{\alpha}_{-i}}_{i}(J_{i},V_{i};\alpha_{i}|g,\omega^{k}_{i},\bm{\theta}). (28)
Corollary 5.2

Given any 𝛋𝙼𝙿𝙶[𝐑]\bm{\kappa}\in\mathtt{MPG}[\bm{R}], the problem (𝙵𝙿𝙰𝚕𝚒𝚐𝚗\mathtt{FPAlign}) is equivalent to the following:

min𝜶,𝑱,𝑽\displaystyle\min\limits_{\bm{\alpha},\bm{J},\bm{V}} 𝒁^FPA(𝜶,𝑱,𝑽;𝜽)\displaystyle\bm{\hat{Z}}^{FPA}(\bm{\alpha},\bm{J},\bm{V};\bm{\theta}) (𝙵𝙿𝙼𝚒𝚜\mathtt{FPMis})
s.t. (BOB1i) ,(FPM1i) ,(FPM2i), i𝒩\displaystyle\text{ (\ref{eq:constraint_BOB_1}) ,(\ref{eq:misalignment_1}) ,(\ref{eq:misalignment_2}), }\forall i\in\mathcal{N}
𝝅AD[𝜶,𝜿].\displaystyle\bm{\pi}\in\text{AD}[\bm{\alpha},\bm{\kappa}].

Define a set:

𝓓{𝜶,𝝅:i𝒩,\displaystyle\bm{\mathcal{D}}\equiv\Big{\{}\bm{\alpha},\bm{\pi}:\forall i\in\mathcal{N}, (RGi),\displaystyle\text{(\ref{eq:regular_policy})},
<αi,Ji,Vi>\displaystyle<\alpha_{i},J_{i},V_{i}>\in 𝚊𝚛𝚐𝚖𝚒𝚗δi𝜶i(Ji,Vi;αi|g,ωik,𝜽)\displaystyle\operatorname*{\mathtt{argmin}}\delta^{\bm{\alpha}_{-i}}_{i}(J_{i},V_{i};\alpha_{i}|g,\omega^{k}_{i},\bm{\theta})
s.t.  (BOB1i) ,(FPM1i) ,(FPM2i).}\displaystyle\text{s.t. }\text{ (\ref{eq:constraint_BOB_1}) ,(\ref{eq:misalignment_1}) ,(\ref{eq:misalignment_2})}.\Big{\}}

Given 𝓓\bm{\mathcal{D}}, we define a set of OIL signaling rules as 𝓢{𝜶:(𝜶,𝝅)𝓓}\bm{\mathcal{S}}\equiv\Big{\{}\bm{\alpha}:\forall(\bm{\alpha},\bm{\pi})\in\bm{\mathcal{D}}\Big{\}} and a set of policy profiles given any signaling rule 𝜶\bm{\alpha} as 𝚷[𝜶]{𝝅:(𝜶,𝝅)𝓓}\bm{\Pi}[\bm{\alpha}]\equiv\Big{\{}\bm{\pi}:\forall(\bm{\alpha},\bm{\pi})\in\bm{\mathcal{D}}\Big{\}}.

Corollary 5.3

The principal’s robust information design is to solve the following problem

max𝜶𝓢min𝝅𝚷[𝜶]CO(\displaystyle\max\limits_{\bm{\alpha}\in\bm{\mathcal{S}}}\min\limits_{\bm{\pi}\in\bm{\Pi}[\bm{\alpha}]}C^{O}( 𝜶,𝝅).\displaystyle\bm{\alpha},\bm{\pi}). (29)

V Conclusion

This work is the first to propose an information design principle for dynamic games in which each agent makes coupled decisions of selecting a signal and taking an action at each period of time. We have formally defined a novel information design problem for the indirect and the direct settings. The notion of obedient implementability has been introduced to capture the optimality of the direct information design problem in a new equilibrium concept of obedient sequential Markov perfect equilibrium (O-SMPE). By characterizing the obedient implementability (OIL) in Bayesian O-SMPE, we have proposed an approach to determining the information structure. We refer to this approach as fixed-point alignment that aligns the two fixed points at the signal selection stage and the action taken stage, respectively. We have uncovered the restrictions that discipline the principal’s freedom to influence the agents’ behaviors in Bayesian O-SMPE. Specifically, the principal’s goal should be a Markov perfect goal and the equilibrium policy profile should be strongly admissible. Additionally, it is with loss of generality in terms of the selection of Markov perfect goals for OIL in Bayesian O-SMPE. Finally, we have formulated the principal’s goal selection problem in terms of the optimal and the robust information design by replacing the admissibility by the optimality or the robustness of the agents’ equilibrium policy profile in the principal’s expected payoff.

References

  • [1] A. Dickinson, “Actions and habits: the development of behavioural autonomy,” Philosophical Transactions of the Royal Society of London. B, Biological Sciences, vol. 308, no. 1135, pp. 67–78, 1985.
  • [2] D. Bergemann and S. Morris, “Information design: A unified perspective,” Journal of Economic Literature, vol. 57, no. 1, pp. 44–95, 2019.
  • [3] I. Taneva, “Information design,” American Economic Journal: Microeconomics, vol. 11, no. 4, pp. 151–85, 2019.
  • [4] N. Chentanez, A. G. Barto, and S. P. Singh, “Intrinsically motivated reinforcement learning,” in Advances in neural information processing systems, 2005, pp. 1281–1288.
  • [5] L. Mathevet, J. Perego, and I. Taneva, “On information design in games,” Journal of Political Economy, vol. 128, no. 4, pp. 1370–1404, 2020.
  • [6] D. Bergemann and S. Morris, “Bayes correlated equilibrium and the comparison of information structures in games,” Theoretical Economics, vol. 11, no. 2, pp. 487–522, 2016.
  • [7] E. Kamenica and M. Gentzkow, “Bayesian persuasion,” American Economic Review, vol. 101, no. 6, pp. 2590–2615, 2011.
  • [8] J. Ely, A. Frankel, and E. Kamenica, “Suspense and surprise,” Journal of Political Economy, vol. 123, no. 1, pp. 215–260, 2015.
  • [9] J. Passadore and J. P. Xandri, “Robust conditional predictions in dynamic games: An application to sovereign debt,” Job Market Paper, 2015.
  • [10] L. Doval and J. C. Ely, “Sequential information design,” Econometrica, vol. 88, no. 6, pp. 2575–2608, 2020.
  • [11] J. C. Ely, “Beeps,” American Economic Review, vol. 107, no. 1, pp. 31–53, 2017.
  • [12] J. C. Ely and M. Szydlowski, “Moving the goalposts,” Journal of Political Economy, vol. 128, no. 2, pp. 468–506, 2020.
  • [13] M. Makris and L. Renou, “Information design in multi-stage games,” working paper, Tech. Rep., 2018.
  • [14] F. Koessler, M. Laclau, and T. Tomala, “Interactive information design,” HEC Paris Research Paper No. ECO/SCD-2018-1260, 2018.
  • [15] R. B. Myerson, “Optimal auction design,” Mathematics of operations research, vol. 6, no. 1, pp. 58–73, 1981.
  • [16] A. Pavan, I. Segal, and J. Toikka, “Dynamic mechanism design: A myersonian approach,” Econometrica, vol. 82, no. 2, pp. 601–653, 2014.
  • [17] T. Zhang and Q. Zhu, “On the differential private data market: Endogenous evolution, dynamic pricing, and incentive compatibility,” 2021.
  • [18] P. Milgrom and P. R. Milgrom, Putting auction theory to work.   Cambridge University Press, 2004.
  • [19] S. Bhat, S. Jain, S. Gujar, and Y. Narahari, “An optimal bidimensional multi-armed bandit auction for multi-unit procurement,” Annals of Mathematics and Artificial Intelligence, vol. 85, no. 1, pp. 1–19, 2019.
  • [20] T. Sönmez and M. U. Ünver, “Matching, allocation, and exchange of discrete resources,” in Handbook of social Economics.   Elsevier, 2011, vol. 1, pp. 781–852.
  • [21] T. Zhang and Q. Zhu, “Optimal two-sided market mechanism design for large-scale data sharing and trading in massive iot networks,” arXiv preprint arXiv:1912.06229, 2019.
  • [22] D. Dewey, “Reinforcement learning and the reward engineering principle,” in 2014 AAAI Spring Symposium Series, 2014.
  • [23] R. Nagpal, A. U. Krishnan, and H. Yu, “Reward engineering for object pick and place training,” arXiv preprint arXiv:2001.03792, 2020.
  • [24] D. Hadfield-Menell, S. Milli, P. Abbeel, S. J. Russell, and A. Dragan, “Inverse reward design,” in Advances in neural information processing systems, 2017, pp. 6765–6774.
  • [25] E. Kamenica, “Bayesian persuasion and information design,” Annual Review of Economics, vol. 11, pp. 249–272, 2019.
  • [26] I. Brocas and J. D. Carrillo, “Influence through ignorance,” The RAND Journal of Economics, vol. 38, no. 4, pp. 931–947, 2007.
  • [27] L. Rayo and I. Segal, “Optimal information disclosure,” Journal of political Economy, vol. 118, no. 5, pp. 949–987, 2010.
  • [28] I. Arieli and Y. Babichenko, “Private bayesian persuasion,” Journal of Economic Theory, vol. 182, pp. 185–217, 2019.
  • [29] M. Castiglioni, A. Celli, A. Marchesi, and N. Gatti, “Online bayesian persuasion,” Advances in Neural Information Processing Systems, vol. 33, 2020.
  • [30] J.-F. Mertens and S. Zamir, “Formulation of bayesian analysis for games with incomplete information,” International Journal of Game Theory, vol. 14, no. 1, pp. 1–29, 1985.
  • [31] I. Goldstein and Y. Leitner, “Stress tests and information disclosure,” Journal of Economic Theory, vol. 177, pp. 34–69, 2018.
  • [32] N. Inostroza and A. Pavan, “Persuasion in global games with application to stress testing,” 2018.
  • [33] P. Hernández and Z. Neeman, “How bayesian persuasion can help reduce illegal parking and other socially undesirable behavior,” Preprint, 2018.
  • [34] Z. Rabinovich, A. X. Jiang, M. Jain, and H. Xu, “Information disclosure as a means to security,” in Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems.   Citeseer, 2015, pp. 645–653.
  • [35] S. Gehlbach and K. Sonin, “Government control of the media,” Journal of public Economics, vol. 118, pp. 163–171, 2014.
  • [36] S. Das, E. Kamenica, and R. Mirka, “Reducing congestion through information design,” in 2017 55th annual allerton conference on communication, control, and computing (allerton).   IEEE, 2017, pp. 1279–1284.
  • [37] D. Duffie, P. Dworczak, and H. Zhu, “Benchmarks in search markets,” The Journal of Finance, vol. 72, no. 5, pp. 1983–2044, 2017.
  • [38] M. Szydlowski, “Optimal financing and disclosure,” Management Science, vol. 67, no. 1, pp. 436–454, 2021.
  • [39] D. Garcia and M. Tsur, “Information design in competitive insurance markets,” Journal of Economic Theory, vol. 191, p. 105160, 2021.
  • [40] B. D. Ziebart, J. A. Bagnell, and A. K. Dey, “Maximum causal entropy correlated equilibria for markov games.” in AAMAS.   Citeseer, 2011, pp. 207–214.
  • [41] O. Hernández-Lerma and J. B. Lasserre, Discrete-time Markov control processes: basic optimality criteria.   Springer Science & Business Media, 2012, vol. 30.
  • [42] W. He and Y. Sun, “Stationary markov perfect equilibria in discounted stochastic games,” Journal of Economic Theory, vol. 169, pp. 35–61, 2017.
  • [43] R. Bellman, “Dynamic programming,” Science, vol. 153, no. 3731, pp. 34–37, 1966.
  • [44] J. Filar and K. Vrieze, “Competitive markov decision processes-theory, algorithms, and applications,” 1997.
  • [45] H. Prasad and S. Bhatnagar, “General-sum stochastic games: Verifiability conditions for nash equilibria,” Automatica, vol. 48, no. 11, pp. 2923–2930, 2012.
  • [46] H. Prasad, P. LA, and S. Bhatnagar, “Two-timescale algorithms for learning nash equilibria in general-sum stochastic games,” in Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, 2015, pp. 1371–1379.