Informational Design of Dynamic Multi-Agent System

Tao Zhang and Quanyan Zhu
First Draft: April, 2021
This Draft: June 2021 Electrical and Computer Engineering
New York University
Email: {tz636, qz494}@nyu.edu

(This draft: April 2021
First draft: April 2021)

Abstract

This work considers a novel information design problem and studies how the craft of payoff-relevant environmental signals solely can influence the behaviors of intelligent agents. The agents’ strategic interactions are captured by a Markov game, in which each agent first selects one external signal from multiple signal sources as additional payoff-relevant information and then takes an action. There is a rational information designer (principal) who possesses one signal source and aims to influence the equilibrium behaviors of the agents by designing the information structure of her signals sent to the agents. We propose a direct information design approach that incentivizes each agent to select the signal sent by the principal, such that the design process avoids the predictions of the agents’ strategic selection behaviors. We then introduce the design protocol given a goal of the designer which we refer to as obedient implementability (OIL) and characterize the OIL in a class of obedient sequential Markov perfect equilibria (O-SMPE). A design regime is proposed based on an approach which we refer to as the fixed-point alignment that incentivizes the agents to choose the signal sent by the principal, guarantees that the agents’ policy profile of taking actions is the policy component of an O-SMPE and the principal’s goal is achieved. We then formulate the principal’s optimal goal selection problem in terms of information design and characterize the optimization problem by minimizing the fixed-point misalignments. The proposed approach can be applied to elicit desired behaviors of multi-agent systems in competing as well as cooperating settings and be extended to heterogeneous stochastic games in the complete- and the incomplete-information environments.

I Introduction

Building rational multi-agent system is an important research desideratum in Artificial Intelligence. In goal-directed decision making systems, an agent’s action is controlled by its consequence [1]. In a game, the consequence of an agent’s action is the outcome of the game, given as the reward of taking that action as well as the actions of his opponents, which situates the optimality criterion of each agent’s decision making in the game. A rational agent’s reward may also depend on the payoff-relevant information, in addition to the actions. The information may include the situation of the agents in a game, referred to as the state of the world, as well as his knowledge about his opponents’ diverging interests and their preferences over the outcomes of the game. Incorporating such payoff-relevant information in his decisions constitutes an essential part of an agent’s rationality in the strategic interactions with his opponents. Hence, one may re-direct the goal achievement of rational agents in a game by information provision. In economics, this refers to as information design, which studies how an information designer (she) can influence agents’ optimal behaviors in a game to achieve her own objective, through the design of information provided to the game [2].

Referred to as the inverse game theory, mechanism design is a well-developed mathematical theory in economics that provides general principles of how to design rules of games (e.g., rewarding systems with specifications of actions and outcomes) to influence the agents’ strategic interactions and achieve system-wide goals while treating the information as given. Information design, on the other hand, considers the circumstances when the information in the environment is under the control of the system designer and offers a new approach to indirectly elicit agents’ behaviors by keeping the game rules fixed [3].

This work considers a infinite-horizon Markov game of a finite number of agents. Each agent’s identity is characterized by his type. At each period of time, agents observe a payoff-relevant global state (state). In addition to the state, each agent observes a batch of signals (signal batch, batch) at each period and then strategically chooses one signal from the batch as the additional information to support his decision of taking a action. Each agent’s one-period reward (parameterized by the type) is determined by his own action, the actions of his opponents, the global state, and his choice of signal. We refer to this game as a base Markov game (BMG). The transition of the state and the distribution of signals are referred to as the information structure of the BMG. In a BMG, each agent’s behavior includes selecting a signal according to a selection rule and taking an action according to a policy. Here, each agents’ selection of signal and the choice of action are coupled since the selected signal enters the policy to determine the choice of the action. If a mechanism designer aims to incentivize the agents to behave in her desired way, she directly modifies the BMG–reversing the game–by designing the game model, including changing the reward function associated with actions and outcomes, while treating the information structure as a given part of the environment. An information designer, however, treats the BMG model as fixed and modifies the information structure to elicit agents’ equilibrium behaviors that coincide with her objective.

We study a novel dynamic information design problem in the BMG in which there are multiple sources of signals (signal sources, sources) and each of them sends one signal to each agent. The signals sent by all sources constitute the signal batch observed by each agent at each time. Among these sources, there is one rational information designer (referred to as principal, she) who controls one signal source and intends to strategically craft the information structure of her signal by choosing a signaling rule to indirectly control the equilibrium of the BMG. We consider that other sources of signals provide additional information to the agents in a non-strategic take-it-or-leave-it manner. The goal of the principal is to induce the agents to take actions according to an equilibrium policy that is desired by the principal. However, the principal has no ability to directly program the agents’ behaviors to force them to take certain actions. Instead, her information design should provide incentive to rational agents to behave in her favor. We study the extent to which the provision of signals along by controlling a single signal source can influence the agents’ behavior in a BMG, when the agents have the freedom to choose any available signal in the batch. We will name the BMG with a rational principal in this setting as an augmented Markov game (AMG).

Since the principal’s design problem keeps the base game unchanged, our model fits the scenarios when the agents are intrinsically motivated and their internal reward systems translate information from external environment into internal reward signals [4]. Intrinsically-motivated rational agents can be human decision makers with intrinsic psychological preferences or intelligent agents programmed with internal reward system. The setting of multiple sources of additional information captures the circumstances when the environment is perturbed by noisy information, in which the agents may improperly use redundant and useless information to make their decisions that may deviate from the system designer’s desire. Also, the principal can be an adversary who aims to manipulate the strategic interactions in a multi-agent system through the provision of disinformation, without intruding each agent’s local system to make any physical or digital modifications.

Although the principal’s objective of the information design in an AMG is to elicit an equilibrium distribution of actions, her design problem has to take into consideration how the agents select the signals from their signal batches because each agent’s choice of action is coupled with his selection of signal. In an information design problem, the principal chooses an information structure such that each agent selects a signal using a selection rule and then takes an action according to a policy which matches the principal’s goal. The latter is constrained by the notion admissibility. In general, the signals sent by the principal may not be selected by some agents, thereby the actions taken by those agents are independent of the principal’s (realized) signals. However, even though her signal does not enter an agent’s policy to realize an action, the principal may still influence the agent’s action because the information structure of the signal batch is influenced by her choice of signaling rule. The information structure of the signal batch affects the agents’ selections of signals. Hence, the agents’ behaviors indirectly depend on the principal’s signaling rule. We refer to such information design as indirect information design (IID) which requires the principal to accurately predict each agents’ strategic selection rule and their policy profiles that might be induced by the signaling rule. We restrict attention to another class of information design, referred to as direct information design (DID). In DID problems, each agent always selects the signal sent by the principal and then takes an action. Thus, the realizations of the principal’s signals directly enter the agents’ policies to choose actions. In addition to the admissibility, another restriction of the principal’s DID problem is captured by the notion of obedience which requires that each agent is incentivized to select the signal from the principal rather than choose one from other signal sources. The key simplification provided by the DID is that the principal’s prediction of the agents’ strategic selection rules is replaced by replaced by a straightforward obedient selection rule that always prefers the principal’s signals.

This paper makes three major contributions to the foundations of information design. First, we define a dynamic direct information design problem in an environment where the agents have the freedom to choose any available signal as addition payoff-relevant information. Captured by the notion of obedient implementability, the principal’s information problem is constrained by the obedient condition that incentivizes the agents to prefer the signals sent by the principal and the admissibility condition such that the agents take actions which meets the principal’s goal. Our information design problem is distinguished from others in economics that study the commitment of the information design in a game when there is only a single source of additional information in static settings (e.g., [5, 3, 6, 7]) as well as in dynamic environment (e.g., [8, 9, 10, 11, 12, 13]) and the settings in which the agents do not make a choice from multiple designers (e.g., [14]).

Second, we propose a new solution concept termed obedient sequential Markov perfect equilibrium (O-SMPE) which allows us to handle the undesirable deviations of agents in a principled manner. By bridging the augmented Markov game model with dynamic programming and uncovering the close relationship between the sequential-perfect relationship of the O-SMPE and a pair of fixed points, we characterize the obedient implementability and explicitly formulate the information design regime given a goal of the principal. The proposed framework is based on an approach referred to as fixed-point alignment which selects a signaling rule that matches the first fixed point from the agents’ optimal signaling selection to the second fixed point of optimal action takings. However, the principal cannot achieve just any goal she wishes to. We identify the key conditions, known as Markov perfect goal and strong admissibility, that discipline the freedom of the principal’s ability to influence the agents’ equilibrium behaviors.

Third, we formulate the principal’s goal selection problem and transform it to a direct information design problem without a predetermined goal. The principal’s problem is thus to select a signaling rule such that it induces agents’ equilibrium policy profiles that maximize (optimal information design) or minimize (robust information design) her expected payoff. Our formulation does not assume the availability of all possible equilibrium policy profiles that can be induced by each possible signaling rule and thus the principal can select equilibrium policy profiles for her choice of signaling rule. Instead, our framework takes into consideration the role of the signaling rule in ensuring the agents to converge to an equilibrium. A new approach is proposed based on a condition known as the fixed-point misalignments minimization that captures a new optimality criterion for the information design without a given goal.

I-A Related Work

We follow a growing line of research on creating incentives for interacting agents to behave in a desired way. The most straightforward way is based on mechanism design approaches that properly provide reward incentives (e.g., contingent payments, penalty, supply of resources) by directly modifying the game itself to change the induced preferences of the agents over actions. Mechanism design approaches have been fruitfully studied in both static [15] as well as dynamic environment [16, zhang2021incentive, 17]. For example, auctions [18, 19] specify the way in which the agents can place their bid and clarify how the agents pay for the items; in matching markets [20, 21], matching rules matches agents in one side of a market to agents of another side that directly affect the payoff of each matched individuals. In reinforcement learning literature, reward engineering [22, 23, 24] is similar to mechanism design that directly crafts the reward functions of the agents that post specifications of the learning goal.

Our work lies in another direction: the information design. Information design studies how to influence the outcomes of the decision makings by choosing signal (also referred to as signal structure, information structure, Blackwell experiment, or data-generating process) whose realizations are observed by the agents [25]. In a seminal paper [7], Kamenica and Gentzkow has introduced Bayesian persuasion in which there is an informed sender and an uninformed receiver. The sender is endowed to commit to choosing any probability distribution (i.e., the information structure) of the signals as a function of the state of the world which is payoff-relevant to and unobserved by the receiver. The Bayesian persuasion can be interpreted as a communication device that is used by the sender to inform the receiver through the signals that contain knowledge about the state of the world. Hence, the sender controls what the agent gets to know about the payoff-relevant state. With the knowledge about the information structure, the receiver forms a posterior belief about the unobserved state based on the received signal. Hence, the information design of Bayesian persuasion is also referred to as an exercise in belief manipulation. Other works alongside with the Bayesian persuasion include [26, 27, 28, 29]. In [5], Mathevet et al. extends the single-agent Bayesian persuasion of [7] to a multi-agent game and formulate the information design of influencing agents’ behaviors through inducing distributions over agents’ beliefs. In [6], Bergemann and Morris have also considered information design in games. They have formulated the Myersonian approach for the information design in an incomplete-information environment. The essential of the Myersonian information design is the notion of Bayes correlated equilibrium, which characterizes the all possible Bayesian Nash equilibrium outcomes that could be induced by all available information structures. The Myersonian approach avoids the modeling of belief hierarchies [30] and constructs the information design problem as a linear programming. Information design has been applied in a variety of areas to study and improve real-world decision making protocols, including stress test in finance [31, 32], law enforcement and security [33, 34], censorship [35], routing system [36], finance and insurance [37, 38, 39]. Kamenica [25] has provided a recent survey of the literature of Bayesian persuasion and information design.

This work fundamentally differs from existing works on the information design. First, we consider a different environment. Specifically, we consider the setting when there are multiple sources of signals and each agent chooses one realized signal as an additional (payoff-relevant) information at each time. Among these sources of signals, there is an information designer who controls one of these sources and aims to induce equilibrium outcomes of the incomplete-information Markov game by strategically crafting information structures. Second, other than only taking actions, each agent in our model makes a coupled decision of selecting a realized signal and taking an action. Hence, the characterization of the solution concepts in our work is different from the equilibrium analysis in other works. Third, we also provide an approach with an explicit formulation to relaxing the optimal information design problem.

In this section, we first describe some fundamental concepts of a canonical Markov game model and then define our new model of a game called augmented Markov game by extending the canonical model.

Conventions. For any measurable set $X$ , $\Delta(X)$ is the set of probability measures over $X$ . Any function defined on a measurable set is assumed to be measurable. For any distribution $P\in\Delta(X_{1}\times X_{2})$ , for two measurable sets $X_{1}$ and $X_{2}$ , $\operatorname{\mathtt{marg}}_{X_{1}}P$ is the marginal distribution of $P$ over $X_{1}$ and $\operatorname{\mathtt{supp}}P$ is the support of $P$ . The history of $x\in X$ from period $s$ to period $t$ is denoted by $x^{(t)}_{s}$ with $x^{(t)}$ when $s=0$ . We use $P_{r}(E)$ to represent the probability of an event $E$ . For the compactness of notations, we only show the elements, but not the sets, over which are summed under the summation operator. The notations are summarized in Appendix LABEL:app:list_of_notations.

I-B Normal-Form Game and Canonical Markov Game

This work considers games of finite agents, denoted by $\mathcal{N}\equiv[n]$ , $0<n<\infty$ . A normal-form (or strategic-form) is a basic representation of a static game:

Definition 0.1 (Normal-Form Game [40])

A normal-form game is defined by a tuple $G\equiv<\mathcal{N},\mathcal{A},\{r_{i}\}_{i\in\mathcal{N}}>$ . Here, $\mathcal{A}$ is a finite set of actions available to each agent. $r_{i}:\mathcal{A}^{n}\mapsto\mathbb{R}$ is the reward function of agent $i$ .

Each agent $i\in\mathcal{N}$ simultaneously chooses an action $a_{i}\in\mathcal{A}$ and receives a reward $r_{i}(a_{i},\bm{a}_{-i})$ when other agents choose actions $\bm{a}_{-i}\in\mathcal{A}^{n-1}$ .

Refer to caption — Figure 1: Canonical Markov game.

With reference to Fig. 1, a canonical Markov game generalizes a normal-form game to dynamic settings as well as Markov decision processing to multi-agent interactions. We consider a finite-agent infinite-horizon Markov game. The game is played in discrete time indexed by $t=0,1,\dots$ . Each agent $i$ ’s identification is captured by the parameter known as type denoted by $\theta_{i}$ . A canonical Markov game can be defined by a tuple $\widehat{M}[\bm{\theta}]\equiv<\mathcal{N},\mathcal{G},\mathcal{A},d_{g},\mathcal{T}_{g},\{\hat{R}_{i}(\cdot|\theta_{i})\}_{i\in\mathcal{N}}>$ . Here, $\mathcal{G}$ is a finite set of states. $\mathcal{A}$ is a finite set of actions available to each agent at each period. $d_{g}\in\Delta(\mathcal{G})$ is an initial distribution of the state. $\mathcal{T}_{g}:\mathcal{G}\times\mathcal{A}^{n}\mapsto\Delta(\mathcal{G})$ is the transition function of the state, such that $\mathcal{T}_{g}(\cdot|g_{t},\bm{a}_{t})\in\Delta(\mathcal{G})$ specifies the probability distribution of next-period state when the current state is $g_{t}$ and the current-period joint action is $\bm{a}_{t}\equiv(a_{i,t})_{i\in\mathcal{N}}$ . Each agent $i$ ’s goal is characterized by his reward function $\hat{R}_{i}(\cdot|\theta_{i}):\mathcal{G}\times\mathcal{A}^{n}\mapsto\mathbb{R}$ is the reward function of agent $i$ parameterized by $\theta_{i}\in\Theta$ that realizes a one-stage reward $r_{i}(g_{t},\bm{a}_{t})$ for agent $i$ when the state is $g_{t}$ and the joint action is $\bm{a}_{t}$ . A canonical Markov game is complete-information.

A solution to $\widehat{M}$ is a sequence of policy profile $\{\bm{\hat{\pi}}_{t}\}_{t\geq 0}$ in which $\bm{\hat{\pi}}_{t}:\mathcal{G}\times\mathcal{A}^{n}\mapsto\Delta(\mathcal{A}^{n})$ , such that $\bm{\hat{\pi}}_{t}(\cdot|g_{t})$ specifies the probability of the joint action of the agents given the current-period state $g_{t}$ . A policy profile is stationary if the agents’ decisions of actions depend only on the current-period payoff-relevant information and is independent of the calendar time; i.e., we denote the policy profile as $\bm{\hat{\pi}}:\mathcal{G}\times\mathcal{A}^{n}\mapsto\Delta(\mathcal{A}^{n})$ . When the policy profile is a pure strategy if $\bm{\hat{\pi}}:\mathcal{G}\times\mathcal{A}^{n}\mapsto\mathcal{A}^{n}$ is a deterministic mapping. In a canonical Markov game, $\bm{\hat{\pi}}$ can be either independent (i.e., $\bm{\hat{\pi}}(\bm{a}_{t}|s_{t})=\prod_{i\in\mathcal{N}}\hat{\pi}_{i}(a_{i,t}|s_{t})$ ) or correlated (i.e., a joint function).

According to Ionescu Tulcea theorem (see, e.g., [41]), the initial distribution $d_{g}$ on $g_{0}$ , the transition function $\mathcal{T}_{g}$ , and the policy profile $\bm{\hat{\pi}}$ together define a unique probability measure $P^{\bm{\hat{\pi}}}$ on $(\mathcal{G}\times\mathcal{A}^{n})^{\infty}$ . We denote the expectation with respect to $P^{\bm{\hat{\pi}}}$ as $\mathbb{E}_{\bm{\hat{\pi}}}[\cdot]$ . The optimality criterion for each agent’s decision making at each period is to maximize his expected payoff. Each agent discounts his future payoffs by a discount factor $0<\gamma<1$ . Thus, agent $i$ ’s period- $t$ infinite-horizon interim expected payoff is defined as: for $g_{t}\in\mathcal{G}$ , $\bm{a}_{t}\in\mathcal{A}^{n}$ , $\theta_{i}\in\Theta$ , $i\in\mathcal{N}$ , $t\geq 0$ ,

\displaystyle\mathtt{Expr}_{i}(g_{t},\bm{a}_{t};\theta_{i}|\bm{\hat{\pi}},\hat{R}_{i})\equiv\mathbb{E}_{\bm{\hat{\pi}}}\Big{[}\sum_{s=t}^{\infty}\gamma^{t}\hat{R}_{i}(\bm{a}_{s},g_{s}|\theta_{i})\Big{|}g_{t},\bm{a}_{t}\Big{]}.

(1)

Definition 0.2 (MPE)

A policy profile $\bm{\hat{\pi}}^{*}$ constitutes a stationary Markov perfect equilibrium (MPE) if, $\bm{\hat{\pi}}^{*}$ is independent, and for $a_{i,t}\in\mathcal{A}$ with $\hat{\pi}^{*}_{i}(a_{i,t}|g,\theta_{i})>0$ , $a^{\prime}_{i,t}\in\mathcal{A}$ , $g\in\mathcal{G}$ , $t\geq 0$ , $i\in\mathcal{N}$ ,

	$\displaystyle\mathbb{E}_{\bm{\hat{\pi}}^{*}_{-i}}\Big{[}\mathtt{Expr}_{i}$	$\displaystyle(g_{t},a_{i,t},\bm{a}_{-i,t};\theta_{i}\|\bm{\hat{\pi}},\hat{R}_{i})\Big{]}$		(2)
		$\displaystyle\geq\mathbb{E}_{\bm{\hat{\pi}}^{*}_{-i}}\Big{[}\mathtt{Expr}_{i}(g_{t},a^{\prime}_{i,t},\bm{a}_{-i,t};\theta_{i}\|\bm{\hat{\pi}},\hat{R}_{i}\Big{]}.$		(2)

Let $\bm{\hat{h}}\equiv(\bm{a}^{(\infty)},g^{(\infty)})\in\bm{\hat{H}}\equiv\mathcal{A}^{n\times\infty}\times\mathcal{G}^{\infty}$ denote any infinite sequence of action-state pairs. Define $\mathtt{ExR}_{i}(\bm{\hat{h}})\equiv\sum_{t=0}^{\infty}\gamma^{t}\hat{R}_{i}(\bm{a}_{t},g_{t}|\theta_{i})$ , for any $\bm{\hat{h}}\in\bm{\hat{H}}$ , $\theta_{i}$ , $i\in\mathcal{N}$ . Let $\bm{\hat{h}}(t)$ denote the first $t$ components of $\bm{\hat{h}}$ , for all $t\geq 0$ .

Lemma 1 ([maskin2001markov]))

For any $0<\gamma<1$ , The game $\widehat{M}$ is continuous at infinity; i.e.,

\displaystyle\lim\limits_{T\rightarrow\infty}\sup\limits_{\begin{subarray}{c}i,\bm{\hat{h}},\bm{\hat{h}^{\prime}},\\ \bm{\hat{h}}(T-1)=\bm{\hat{h}^{\prime}}(T-1)\end{subarray}}\Big{|}\mathtt{ExR}_{i}(\bm{\hat{h}})-\mathtt{ExR}_{i}(\bm{\hat{h}^{\prime}})\Big{|}\rightarrow 0.

Lemma 1 shows that for a fixed discount rate $0<\gamma<1$ , the canonical Markov game is continuous at infinity. The continuity at infinity is essential for the existence of MPE of infinite-horizon stochastic game. The following proposition shows the existence of MPE due to [maskin2001markov].

Proposition 1.1 ([maskin2001markov])

Suppose that $0<\gamma<1$ . Then, the game $\widehat{M}$ admits a MPE.

I-C Augmented Markov Game Model

With reference to Fig. 2, we extend the canonical Markov game model in this section by introducing an additional payoff-relevant information for the agents in addition to the state.

Signals. We call the additional information as signal. Let $\Omega$ be a finite set of signals. We consider that at each period $t$ each agent $i$ observes a batch of $m$ (finite) signals (signal batch, batch), for $m>1$ , denoted by $W_{i,t}\equiv\{\omega^{j}_{i,t}\}_{j=1}^{m}\subseteq\Omega^{m}$ , for all $i\in\mathcal{N}$ , $t\geq 0$ , in which $\omega^{j}_{i,t}\in\Omega$ denotes a typical signal indexed by $j\in[m]$ sent to agent $i$ at $t$ . Upon observing $W_{i,t}$ , each agent $i$ selects one signal $\omega_{i,t}$ from the batch $W_{i,t}$ and $\omega_{i,t}$ becomes payoff-relevant to him in addition to the state $g_{t}$ .

Dynamics of Information. Let $\omega^{k}_{i,t}$ denote one signal contained in the batch $W_{i,t}$ . Given $\omega^{k}_{i,t}$ , the signal batch can be represented as $W_{i,t}=\{\omega^{k}_{i,t},W^{-k}_{i,t}\}$ , where $W^{-k}_{i,t}\equiv\{\omega^{j}_{i,t}\}_{j=1,j\neq k}^{m}$ , for all $i\in\mathcal{N}$ , $t\geq 0$ . We assume that the probability of $\omega^{k}_{i,t}$ at any $t$ is specified by $\mathcal{P}^{k}_{i,t}\in\Delta(\Omega)$ . Let $\mathcal{P}^{-k}_{i,t}\in\Delta(\Omega^{m-1})$ denote the probability of $W^{-k}_{i,t}$ . Here, both $\mathcal{P}^{k}_{i,t}$ and $\mathcal{P}^{-k}_{i,t}$ , for all $i\in\mathcal{N}$ may depend on the current state $g_{t}$ , the joint type $\bm{\theta}$ , the calendar time, or the past realizations of states or actions. Although there is an additional information in the game, the transition of the state is controlled by the current $g_{t}$ and the realized actions $\bm{a}_{t}$ and, however, is independent of the selected signal $\omega_{i,t}$ , for all $i\in\mathcal{N}$ , $t\geq 0$ . As in the canonical game $\widehat{M}$ , the transition of the state is Markov: given any $g_{t}\in\mathcal{G},\bm{a}_{t}\in\mathcal{A}^{n}$ , the transition function $\mathcal{T}_{g}(g_{t+1}|g_{t},\bm{a}_{t})$ specifies the probability of a next state $g_{t+1}\in\mathcal{G}$ , for all $t\geq 0$ .

Augmented Markov Game. We refer to the Markov game with signal as the augmented Markov game, denoted by $M$ : $M\equiv<\mathcal{N},\mathcal{G},\mathcal{A},\Omega,d_{g},\mathcal{T}_{g},$ $\{\mathcal{P}^{k}_{i,t},\mathcal{P}^{-k}_{i}\}_{i\in\mathcal{N},t\geq 0},$ $\{R_{i}(\cdot|\theta_{i})\}_{i\in\mathcal{N}}>$ , where $R_{i}(\cdot|\theta_{i}):\mathcal{A}^{n}\times\mathcal{G}\times\Omega\mapsto\mathbb{R}$ is the reward function ( $\theta_{i}$ -parameterized) of agent $i$ , which takes into consideration agent $i$ ’s selected signal. A special setting for $M$ is that each agent $i$ makes two sequentially-coupled decisions. Specifically, agent $i$ first selects a signal $\omega_{i,t}$ from the batch $W_{i,t}$ , given the state $g_{t}$ ; then, he takes an action $a_{i,t}$ , given $g_{t}$ and his selected signal $\omega_{i,t}$ . Hence, at a given period $t$ , if the state is $g_{t}$ , agent $i$ observes a set of signals $W_{i,t}$ , the joint action of all other agents is $\bm{a}_{-i,t}$ , and if agent $i$ firstly selects a signal $\omega_{i,t}$ from $W_{i,t}$ and secondly takes an action $a_{i,t}$ , then the single-period reward is $R_{i}(\{a_{i,t},\bm{a}_{-i,t}\},g_{t},\omega_{i,t}|\theta_{i})$ . We assume that the game model $M$ is complete-information.

I-C1 Strategies

Each agent $i$ is rational in the sense that it is self-interested and makes his decisions according to his observation $(g_{t},W_{i,t})$ to maximize his expected payoffs. We consider that the solution to the game $M$ is a stationary Markov strategy profile $<\bm{\beta},\bm{\pi}>$ where $\bm{\beta}:\mathcal{G}\mapsto\bm{W}_{t}$ is a selection rule profile and $\bm{\pi}(\cdot|\theta^{n}):\mathcal{G}\times\Omega^{n}\times\Theta^{n}\mapsto\Delta(\mathcal{A}^{n})$ is a policy profile. Similar to $\bm{\hat{\pi}}$ in $\widehat{M}$ , each strategy of the profiles $\bm{\beta}$ and $\bm{\pi}$ can be either correlated (i.e, a joint function) or independent (i.e., $\omega_{i,t}=\beta_{i}(g_{t})$ , for all $i\in\mathcal{N}$ , and $\bm{\pi}(\bm{a}_{t}|g_{t},\bm{\omega}_{t},\bm{\theta}_{t})=\prod_{i\in N}\pi_{i}(a_{i,t}|g_{t},\omega_{i,t},\theta_{i,t})$ , where $\omega_{i,t}=\beta_{i}(g_{t},\theta_{i,t})$ ).

Given any observation $(g_{t},W_{i,t},\theta_{i,t})$ , each agent $i$ ’s selection of the signal and his choice of the action are fundamentally different. Specifically, the payoff-relevant information for signal selection is $g_{t}$ , i.e., $\bm{\beta}(g_{t},\bm{\theta}_{t})\in\bm{W}_{t}$ , while the payoff-relevant information for the action taking is $(g_{t},\bm{\omega}_{t})$ , i.e., $\bm{\pi}(\bm{a}_{t}|g_{t},\bm{\omega}_{t},\bm{\theta}_{t})\in\Delta(\bm{\mathcal{A}})$ in which $\bm{\omega}_{t}=\bm{\beta}(g_{t},\bm{\theta}_{t})$ . However, we will write $\omega_{i,t}=\beta_{i}(g_{t},\theta_{i,t}|W_{i,t})\in W_{i,t}$ to highlight the influence of the signal batch $W_{i,t}$ (and thus its distribution) on each agent $i$ ’s decision of selecting a signal. Agent $i$ first uses $\beta_{i}$ to select signal $\omega_{i,t}=\beta_{i}(g_{t},\theta_{i,t}|W_{i,t})$ and then chooses an action $a_{i,t}$ according to $\pi_{i}(a_{i,t}|g_{t},\omega_{i,t},\theta_{i,t})$ based on the realized selection $\omega_{i,t}$ .

II Information Design Problem

In this work, we are interested in when there is one rational information designer referred to as principal (she, indexed by $k$ ) who controls one of $m$ signal sources. At time $t$ , the signal sent to agent $i$ by the principal is denoted by $\omega^{k}_{i,t}$ . We assume that $\bm{\mathcal{P}}^{-k}_{t}\equiv\bm{\mathcal{P}}^{-k}=\{\mathcal{P}^{-k}_{i}\}_{i\in\mathcal{N}}$ is fixed and purely exogenous; i.e., $\bm{\mathcal{P}}^{-k}$ is independent of the state, types, actions, the calendar time, or the histories of the game. This can be interpreted as when other $m-1$ sources of signals provide additional information to the agents in a non-strategic take-it-or-leave-it manner.

We consider that the principal is rational in that she possesses a goal specified by the agents’ equilibrium actions and strategically designs the information structure of her signal to achieve her goal. Specifically, the principal aims to induce the agents to take actions that coincides with her goal in the equilibrium by strategically chooses $\bm{\mathcal{P}}^{k}_{t}\equiv\{\mathcal{P}^{k}_{i,t}\}_{i\in\mathcal{N}}$ given $\Omega$ . Since $\bm{\mathcal{P}}^{k}_{t}$ governs the generation of her signal $\bm{\omega}^{k}_{t}=\{\omega^{k}_{i,t}\}_{i\in\mathcal{N}}$ , the principal’s choice of $\bm{\mathcal{P}}^{k}_{t}$ partially influences $\bm{\mathcal{P}}_{t}\equiv\{\bm{\mathcal{P}}^{k}_{t},\bm{\mathcal{P}}^{-k}\}$ and the realizations of signal batch $\bm{W}_{t}=\{\bm{\omega}^{k}_{t},\bm{W}^{-k}_{t}\}$ . This process is information design:

Definition 1.1 (Information Design)

An information design problem is defined as a tuple $\mathcal{I}\equiv<M[\bm{\theta}],\bm{\pi},$ $\{\mathcal{P}^{k}_{i,t}\}_{i\in N,t\geq 0},\Omega,\bm{\kappa}>$ . Here, $M$ is an augmented Markov game model. $\bm{\pi}$ is the agents’ policy profile. $<\mathcal{P}^{k}_{i},\Omega>$ is the information structure, where $\mathcal{P}^{k}_{i}\equiv\{\mathcal{P}^{k}_{i,t}\}_{t\geq 0}$ defines a distribution of the signal $\omega^{k}_{i,t}$ privately observed by agent $i$ at $t$ . $\bm{\kappa}(\cdot,\bm{\theta}):\mathcal{G}\mapsto\Delta(\mathcal{A}^{n})$ is the principal’s goal, i.e., her target equilibrium probability distribution of agents’ joint action conditioning only on the state and the agents’ type.

A solution to $\mathcal{I}$ is a stationary signaling rule profile (signaling rule) $\bm{\alpha}:\mathcal{G}\times\bm{\Theta}\mapsto\Delta(\Omega^{n})$ that defines $\bm{\mathcal{P}}^{k}\equiv\{\mathcal{P}^{k}_{i}\}_{i\in\mathcal{N}}$ of the joint signal $\bm{\omega}^{k}_{t}$ . The signaling rule $\bm{\alpha}$ is correlated if it is a joint function; i.e., $\bm{\alpha}(\bm{\omega}^{k}_{t}|g,\bm{\theta})\neq\prod_{i\in\mathcal{N}}\alpha_{i}(\omega^{k}_{i,t}|g,\bm{\theta})$ , where $\alpha_{i}(\omega^{k}_{-i,t}|g,\bm{\theta})\equiv\sum_{\bm{\omega}^{k}_{-i,t}}\bm{\alpha}(\bm{\omega}^{k}_{t}|g,\bm{\theta})$ . The rule $\bm{\alpha}$ is independent if the principal specifies the signal to each agent is independent of each other; i.e., $\bm{\alpha}(\bm{\omega}^{k}_{t}|g,\bm{\theta})=\prod_{i\in\mathcal{N}}\alpha_{i}(\omega^{k}_{i,t}|g,\bm{\theta})$ . Since the agents use Markov strategies, agent $i$ ’s period- $t$ action $a_{i,t}$ depends on histories $g^{t}$ and $\omega^{k;(t)}_{i}$ (via the selection rule $\beta_{i}$ ) only through the current-period $g_{t}$ and $\omega^{k}_{i,t}$ . Hence, we restrict attention to a Markovian signaling rule $\bm{\alpha}$ that specifies the distribution of period- $t$ signal by depending on the current state $g_{t}$ and the joint type $\bm{\theta}$ . We will denote the game $M$ with the principal using $\bm{\alpha}$ as $M[\bm{\alpha}|\bm{\theta}]$ .

The information design problem is a planning problem. Hence, the design of $\bm{\alpha}$ is independent of any realizations of states. Additionally, the principal does not know in advance all the possible equilibria that could be induced by any of her available signaling rules. Therefore, the principal’s information design has to take into account how the signaling rule can induce the agents’ behaviors that constitute an equilibrium. If the information design is viewed as an extensive form game between the principal and the agents, the timing is as follows:

(i)

The principal chooses a signaling rule profile $\bm{\alpha}$ for the agents.
(ii)

At the beginning of each period $t$ , a state $g_{t}$ is realized and observed by the principal and all agents.
(iii)

The principal sends a joint signal $\bm{\omega}^{k}_{t}$ according to $\bm{\alpha}$ . Each agent $i$ receives $W_{i,t}=\{\omega^{k}_{i,t},W^{-k}_{i,t}\}$ and observes $\bm{W}_{-i,t}\equiv\{\omega^{k}_{j,t},W^{-k}_{j,t}\}_{j\neq i}$ . Here, $\bm{W}^{-k}_{t}\equiv\{W^{-k}_{i,t},\bm{W}^{-k}_{-i,t}\}$ is generated according to $\big{(}\mathcal{P}^{-k}_{t}\big{)}^{n}$ .
(iv)

The agents use $\bm{\beta}$ to select signals $\bm{\omega}_{t}$ from $\bm{W}_{t}$ .
(v)

Then, the agents use $\bm{\pi}$ to chooses their actions from $\mathcal{A}^{n}$ based on $g_{t}$ and $\bm{\omega}_{t}$ .
(vi)

Immediate rewards are realized and the state $g_{t}$ is transitioned to $g_{t+1}$ according to $\mathcal{T}_{g}$ .

II-A Equilibrium Concepts

In this section, we define a stationary equilibrium concept of the game $M[\bm{\alpha}|\bm{\theta}]$ . With a slight abuse of notation, we suppress the notations of $\mathcal{P}^{-k}_{i}$ , $W^{-k}_{i,t}$ , and $\bm{W}^{-k}_{t}$ and only show $\mathcal{P}^{k}_{i,t}$ , $\omega^{k}_{i,t}$ (of $W_{i,t}$ ), and $\bm{\omega}^{k}_{t}$ (of $\bm{W}_{t}$ ), for all $i\in\mathcal{N}$ , $t\geq 0$ , unless otherwise stated. Since we focus on stationary environment, we suppress the time indexes from the notations, unless otherwise stated.

Similar to the canonical game $\widehat{M}$ , Ionescu Tulcea theorem (see, e.g., [41]) implies that the initial distribution $d_{g}$ of the state, the transition function $\mathcal{T}_{g}$ , the distribution $\bm{\mathcal{P}}^{-k}$ , the signaling rule $\bm{\alpha}$ , and the strategy profile $<\bm{\beta},\bm{\pi}>$ together define a unique probability measure $P^{\bm{\alpha},\bm{\beta}}_{\bm{\pi}}$ on $(\mathcal{G}\times\Omega^{m\times n}\times\mathcal{A}^{n})^{\infty}$ . The expectation with respect to $P^{\bm{\alpha},\bm{\beta}}_{\bm{\pi}}$ is denoted by $\mathbb{E}^{\bm{\beta},\bm{\alpha}}_{\bm{\pi}}\big{[}\cdot\big{]}$ or $\mathbb{E}^{\bm{\beta},\bm{\alpha}}_{\bm{\pi}}\big{[}\cdot\big{|}\cdot\big{]}$ . With a slight abuse of notation, let $\mathcal{T}^{\bm{\alpha},\bm{\beta},\bm{\pi}}_{g^{\prime}g}(\bm{\omega};\bm{\omega}^{k})$ denote the transition probability from state $g$ to state $g^{\prime}$ , given that the signal batch is $\bm{\omega}^{k}$ and the agents select $\bm{\omega}=\bm{\beta}(g,\bm{\theta}|\bm{\omega}^{k})$ : for any $\bm{\omega}^{k}\in\Omega^{n}$ with $\bm{\alpha}(\bm{\omega}^{k}|g,\bm{\theta})>0$ ,

\displaystyle\mathcal{T}^{\bm{\beta},\bm{\pi}}_{g^{\prime},g}(\bm{\omega};\bm{\omega}^{k})\equiv\sum_{\bm{a}}\bm{\pi}\big{(}\bm{a}|g,\bm{\omega},\bm{\theta}\big{)}\mathcal{T}_{g}(g^{\prime}|g,\bm{a}).

Let $\mathcal{T}^{\bm{\alpha},\bm{\beta},\bm{\pi}}_{g^{\prime},g}=\sum_{\bm{\omega}^{k}}\mathcal{T}^{\bm{\beta},\bm{\pi}}_{g^{\prime},g}(\bm{\omega};$ $\bm{\omega}^{k})\bm{\alpha}(\omega^{k}|g,\bm{\theta})$ . Given $\bm{\alpha}$ , $\bm{\beta}$ , $\bm{\pi}$ , define the state-signal value function $V_{i}^{\bm{\pi},\bm{\beta},\bm{\alpha}}$ of agent $i$ , representing agent $i$ ’s expected reward, originating at some $g,\bm{\omega}^{k}\in\mathcal{G}\times\Omega^{n}$ with $\bm{\alpha}(\bm{\omega}^{k}|g,\bm{\theta})>0$ , agents select $\bm{\omega}=\bm{\beta}(g,\bm{\theta}|\bm{\omega}^{k})$ ,

		$\displaystyle V^{\bm{\alpha},\bm{\beta},\bm{\pi}}_{i}(g,\bm{\omega};\bm{\omega}^{k}\|\bm{\theta})$		(3)
		$\displaystyle\equiv\sum_{t=0}^{\infty}\sum_{g^{\prime}}\gamma^{t}\big{(}\mathcal{T}^{\bm{\beta},\bm{\pi}}_{g^{\prime},g}(\bm{\omega};\bm{\omega}^{k})\big{)}^{t}\sum_{\bm{a}^{\prime},\bm{\omega}^{k^{\prime}}}\bm{\pi}(\bm{a}^{\prime}\|g^{\prime},\bm{\omega}^{\prime},\bm{\theta})$
		$\displaystyle\times\bm{\alpha}(\bm{\omega}^{k^{\prime}}\|g^{\prime},\bm{\theta})R_{i}(\bm{a}^{\prime},g^{\prime},\omega^{\prime}_{i}\|\theta_{i}),$

where $\bm{\omega}^{\prime}=\bm{\beta}(g,\bm{\theta}|\bm{\omega}^{k^{\prime}})$ .

Define the state value function $J^{\bm{\pi},\bm{\beta},\bm{\alpha}}_{i}$ of agent $i$ that describes his expected reward, originating at any state $g\in\mathcal{G}$ :

		$\displaystyle J^{\bm{\alpha},\bm{\beta},\bm{\pi}}_{i}(g\|\bm{\theta})\equiv\sum_{t=0}^{\infty}\sum_{g^{\prime}}\gamma^{t}\big{(}\mathcal{T}^{\bm{\alpha},\bm{\beta},\bm{\pi}}_{g^{\prime},g}\big{)}^{t}$		(4)
		$\displaystyle\times\sum_{\bm{a}^{\prime},\bm{\omega}^{k^{\prime}}}\bm{\pi}(\bm{a}^{\prime}\|g^{\prime},\bm{\beta}(g^{\prime},\bm{\theta}\|\bm{\omega}^{k^{\prime}}),\bm{\theta})\bm{\alpha}(\bm{\omega}^{k^{\prime}}\|g^{\prime},\bm{\theta})$
		$\displaystyle\times R_{i}(\bm{a^{\prime}},g^{\prime},\beta_{i}(\bm{a}^{\prime},g^{\prime},\theta_{i}\|\omega^{k^{\prime}}_{i})\|\theta_{i}).$

Define the state-signal-action value function $Q^{\bm{\pi},\bm{\beta},\bm{\alpha}}_{i}$ that represents agent $i$ ’s expected reward if $(\bm{\omega},\bm{a})\in\Omega^{n}\times\mathcal{A}^{n}$ are played in $(g,\bm{\omega}^{k})\in\mathcal{G}\times\Omega^{n}$ :

		$\displaystyle Q^{\bm{\alpha},\bm{\beta},\bm{\pi}}_{i}(\bm{a},g,\omega_{i};\omega^{k}_{i}\|\bm{\theta})\equiv R_{i}(\bm{a},g,\omega_{i}\|\theta_{i})$		(5)
		$\displaystyle+\gamma\sum_{g^{\prime}}\mathcal{T}_{g}(g^{\prime}\|g,\bm{a})\Big{(}\sum_{s=0}^{\infty}\sum_{g^{\prime\prime}}\gamma^{s}\big{(}\mathcal{T}^{\bm{\alpha},\bm{\beta},\bm{\pi}}_{g^{\prime\prime},g^{\prime}}\big{)}^{s}\Big{)}\times$
		$\displaystyle\sum_{\bm{a}^{\prime\prime},\bm{\omega}^{k^{\prime\prime}}}\bm{\pi}(\bm{a}^{\prime\prime}\|g^{\prime\prime},\bm{\omega}^{\prime\prime};\bm{\theta})\bm{\alpha}(\bm{\omega}^{k^{\prime\prime}}\|g^{\prime\prime},\bm{\theta})R_{i}(\bm{a}^{\prime},g^{\prime\prime},\omega^{\prime}_{i}\|\theta_{i})\Big{)},$

where $\bm{\omega}^{\prime\prime}=\bm{\beta}(g^{\prime\prime},\bm{\theta}|\bm{\omega}^{k^{\prime\prime}})$ .

We define an equilibrium concept known as sequential Markov perfect equilibrium (SMPE) as follows.

Definition 1.2 (SMPE)

Fix any signaling rule $\bm{\alpha}$ . A strategy profile $<\bm{\beta}^{*},\bm{\pi}^{*}>$ constitutes a (stationary) sequential Markov perfect equilibrium (SMPE), where (i) the policy profile is independent (i.e., $\bm{\pi}^{*}(\bm{a}|g,\bm{\omega},\bm{\theta})=\prod_{i\in\mathcal{N}}\pi^{*}_{i}(a_{i}|g,\omega_{i},\theta_{i})$ ) and (ii) the agents are sequential-perfectly rational (sequential-perfect rationality), i.e., for any $g\in\mathcal{G}$ , $\vec{\beta}^{\prime}_{i}\equiv\{\beta^{\prime}_{i,\tau}\}_{\tau\geq 0}$ , $i\in\mathcal{N}$ ,

\begin{split}J^{\bm{\alpha},\bm{\beta}^{*},\bm{\pi}^{*}}_{i}(g|\bm{\theta})\geq J^{\bm{\alpha},\vec{\beta}^{\prime}_{i},\bm{\beta}^{*}_{-i},\bm{\pi}^{*}}_{i}(g|\bm{\theta}),\end{split}

(6)

and, for any $\omega^{k}_{i}\in\Omega$ with $\alpha_{i}(\omega^{k}_{i}|g,\bm{\theta})>0$ , $\omega^{*}_{i}=\beta^{*}_{i}(g,\theta_{i}|\omega^{k}_{i})$ , $\vec{\pi}^{\prime}_{i}\equiv\{\pi^{\prime}_{i,\tau}\}_{\tau\geq 0}$ ,

		$\displaystyle\mathbb{E}^{\bm{\beta}^{}}_{\bm{\omega}^{k}_{-i}\sim\bm{\alpha}_{-i}}\Big{[}V^{\bm{\alpha},\bm{\beta}^{},\bm{\pi}^{}}_{i}(g,\omega^{}_{i},\bm{\omega}^{*}_{-i};\omega^{k}_{i},\bm{\omega}^{k}_{-i}\|\bm{\theta})\Big{]}$		(7)
		$\displaystyle\geq\mathbb{E}^{\bm{\beta}^{}}_{\bm{\omega}^{k}_{-i}\sim\bm{\alpha}_{-i}}\Big{[}V^{\bm{\alpha},\bm{\beta}^{},\vec{\pi}^{\prime}_{i},\bm{\pi}^{}_{-i}}_{i}(g,\omega^{}_{i},\bm{\omega}^{*}_{-i};\omega^{k}_{i},\bm{\omega}^{k}_{-i}\|\bm{\theta})\Big{]}.$		(7)

An SMPE extends the stationary Markov perfect equilibrium (see, e.g., [42]) to our augmented Markov game $M[\bm{\alpha}|\bm{\theta}]$ The sequential-perfect rationality describes the coupled sequential best responses of each agent’s selection and action given a state and the available signal batch. In words, a strategy profile $<\bm{\beta}^{*},\bm{\pi}^{*}>$ is sequential-perfectly rational if (i) given that agents choose actions according to the equilibrium policy profile $\bm{\pi}^{*}$ , there is no state $g\in\mathcal{G}$ such that once it is reached, the agents strictly prefer to deviate from $\bm{\beta}^{*}$ ; and (ii) there is no information set $(g,\{\bm{\omega}^{*},\bm{W}^{-k}\})\in\mathcal{G}\times\Omega^{m\times n}$ where $\bm{\omega}^{*}$ is selected by $\bm{\beta}^{*}$ such that once it is reached, the agents strictly prefer to deviate from $\bm{\pi}^{*}$ .

The concept of SMPE in Definition 1.2, however, permits arbitrarily complex and possibly nonstationary deviations from (stationary) equilibrium profile $<\bm{\beta}^{*},\bm{\pi}^{*}>$ . The following lemma states that it entails no loss of generality to consider any one-shot deviations from $<\bm{\beta}^{*},\bm{\pi}^{*}>$ .

Lemma 2

Let $\vec{\beta}^{\prime}_{i}\circ 0\equiv\{\beta^{\prime}_{i,\tau}\}_{\tau\geq 0}$ and $\vec{\pi}^{\prime}_{i}\circ 0\equiv\{\pi^{\prime}_{i,\tau}\}_{\tau\geq 0}$ , respectively, be such that $\beta^{\prime}_{i,\tau}=\beta^{*}_{i}$ and $\pi^{\prime}_{i,\tau}=\pi^{*}_{i}$ , for all $\tau\geq 1$ , while $\beta^{\prime}_{i,0}$ and $\pi^{\prime}_{i,0}$ are any two arbitrary strategies. A strategy profile $<\bm{\beta}^{*},\bm{\pi}^{*}>$ constitutes a sequential-perfectly rational equilibrium profile of an SMPE if and only if for any $g_{0}\in\mathcal{G}$ , $i\in\mathcal{N}$ ,

\begin{split}J^{\bm{\beta}^{*},\bm{\pi}^{*},\bm{\alpha}}_{i}(g_{0}|\theta_{i})\geq J^{\vec{\beta}^{\prime}_{i}\circ 0,\bm{\beta}^{*}_{-i},\bm{\pi}^{*},\bm{\alpha}}_{i}(g_{0}|\theta_{i}),\end{split}

(8)

and, for any $\omega^{k}_{i}\in\Omega$ with $\alpha_{i}(\omega^{k}_{i}|g_{0},\bm{\theta})>0$ , $\omega^{*}_{i}=\beta^{*}_{i}(g_{0},\theta_{i}|\omega^{k}_{i})$ , $i\in\mathcal{N}$ , $\pi^{\prime}_{i,0}$ ,

\begin{split}V^{\bm{\beta}^{*},\bm{\pi}^{*},\bm{\alpha}}_{i}&(g_{0},\omega^{*}_{i,0};\omega^{k}_{i,0}|\theta_{i})\\ \geq&V^{\bm{\beta}^{*},\vec{\pi}^{\prime}_{i}\circ 0,\bm{\pi}^{*}_{-i},\bm{\alpha};\mu_{i}}_{i}(g_{0},\omega^{*}_{i,0};\omega^{k}_{i,0}|\theta_{i}).\end{split}

(9)

A one-shot deviation is a behavior of each agent $i$ ’s deviating from the equilibrium profile $<\bm{\beta}^{*},\bm{\pi}^{*}>$ by selecting a signal using any $\beta^{\prime}_{i,0}$ and taking an action using any $\pi^{\prime}_{i,0}$ at the initial period of any subgame of $M[\bm{\alpha}|\bm{\theta}]$ , then reverting back to his equilibrium profile for the rest of the game. The one-shot deviation property in Lemma 2 allows the principal to restrict attention to the equilibrium characterization by considering the robustness of the information design to the one-shot deviation without lose of generality.

III SMPE Implementability

In this section, we formally formulate the principal’s information design problem. The optimality criterion of successful information design is captured by the notion of implementability which is characterized in the equilibrium concept of SMPE.

III-A Implementability

As in a canonical Markov game, each agent $i$ ’s decision of choosing an action $a$ takes into account other agents’ decisions of choosing $\bm{a}_{-i}$ because its immediate reward of taking $a_{i}$ directly depends on $\bm{a}_{-i}$ . In an augmented Markov game $M[\bm{\alpha}|\bm{\theta}]$ , agent $i$ ’s choices of $\beta_{i}$ and $\pi_{i}$ are coupled because $\omega_{i}$ specified by $\beta_{i}$ has a direct causal effect on $a_{i}$ through $\pi_{i}$ . Thus, each agent’s immediate reward indirectly depends on other agents’ selected signals through their actions. Hence, agents’ selection of signals is also a part of the strategic interactions in a $M[\bm{\alpha}|\bm{\theta}]$ . Since $\bm{\mathcal{P}}^{-k}$ is fixed, the principal’s choice of $\bm{\alpha}$ controls the dynamics of $\bm{W}_{t}$ given the strategy profiles. Therefore, it is possible for the principal to influence the equilibrium behaviors of agents in $M[\bm{\alpha}|\bm{\theta}]$ through proper designs of $\bm{\alpha}$ .

The principal’s information design takes an objective-first approach to design the information structure (given $\Omega$ ) of the signals sent to agents, toward desired objectives $\bm{\kappa}$ , in a strategic setting through the design of $\bm{\alpha}$ , where self-interested agents act rationally by choosing $\bm{\beta}$ and $\bm{\pi}$ . Although any realization of the signal depends on the current state, the choice of $\bm{\alpha}$ is independent of the realizations of the states. The key restriction on the principal’s $\bm{\alpha}$ is that the agents are elicited to perform equilibrium actions that coincide with the principal’s goal, which is referred to as admissibility.

Definition 2.1 (Admissibility)

Fix any $\bm{\kappa}$ and $\bm{\alpha}$ . Let $<\bm{\beta}^{*},\bm{\pi}^{*}>$ be any SMPE of the game $M[\bm{\alpha}|\bm{\theta}]$ . The policy profile $\bm{\pi}^{*}$ is admissible if, for all $g\in\mathcal{G}$ ,

\displaystyle\bm{\kappa}(\bm{a}|g,\bm{\theta})=\sum_{\bm{\omega}^{k}}\bm{\pi}^{*}\big{(}\bm{a}|g,\bm{\beta}^{*}(g,\bm{\theta}|\bm{\omega}^{k}),\bm{\theta}\big{)}\bm{\alpha}(\bm{\omega}^{k}|g,\bm{\theta}).

(10)

The admissibility imposes a constraint on the signaling rule $\bm{\alpha}$ and the induced policy profile $\bm{\pi}$ such that the goal $\bm{\kappa}$ is achieved in the sense of (10). We define a strong version of admissibility as follows.

Definition 2.2 (Strong Admissibility)

Fix any $\bm{\kappa}$ and $\bm{\alpha}$ . Let $<\bm{\beta}^{*},\bm{\pi}^{*}>$ be any SMPE of the game $M[\bm{\alpha}|\bm{\theta}]$ . The policy profile $\bm{\pi}^{*}$ is string admissible if, for all $g\in\mathcal{G}$ , $\omega^{k}_{i}\in\Omega$ , $i\in\mathcal{N}$ ,

		$\displaystyle\bm{\kappa}(\bm{a}\|g,\bm{\theta})\sum\limits_{\bm{\omega}^{k}_{-i}}\bm{\alpha}(\omega_{i},\bm{\omega}^{k}_{-i}\|g,\bm{\theta})\bm{\beta}^{*}(g,\bm{\theta}\|\omega^{k}_{i},\bm{\omega}^{k}_{-i})$		(11)
		$\displaystyle=\sum\limits_{\bm{\omega}^{k}_{-i}}\bm{\pi}^{}\big{(}\bm{a}\|g,\bm{\beta}^{}(g,\bm{\theta}\|\omega^{k}_{i},\bm{\omega}^{k}_{-i}),\bm{\theta}\big{)}\bm{\alpha}(\bm{\omega}^{k}\|g,\bm{\theta}).$		(11)

The strong version constrains $\bm{\alpha}$ and $\bm{\pi}$ additionally by imposing individual-level constraint on the signaling rule. It is straightforward to verify that the strong admissibility implies admissibility but not vice versa.

The success criterion of information design for the game $M[\bm{\alpha}|\bm{\theta}]$ in equilibrium is captured by the notion of implementability.

Definition 2.3 (SMPE Implementability)

Given any $\bm{\kappa}$ . We say that the signaling rule $\bm{\alpha}$ is (strongly) implementable in SMPE ( SMPE Implementability) if $M[\bm{\alpha}|\bm{\theta}]$ has an SMPE $<\bm{\beta}^{*},\bm{\pi}^{*}>$ in which $\bm{\pi}^{*}$ is (strongly) admissible.

The SMPE implementability requires that (i) the signaling rule $\bm{\alpha}$ designed by the principal induces a SMPE of $M[\bm{\alpha}|\bm{\theta}]$ and (ii) the principal’s goal is achieved (i.e., the equilibrium policy profile is admissible or strong admissible).

Given any $\bm{\mathcal{P}}^{-k}$ , the distribution of the action $\bm{a}$ conditioning on any state $g$ is jointly determined by the agents’ $\bm{\beta}$ and the principal’s $\bm{\alpha}$ . Hence, given $\bm{\mathcal{P}}^{-k}$ , the signal $\bm{\omega}^{k}$ sent by the principal by using $\bm{\alpha}$ ultimately influences each agent’ expected reward. However, this information is transmitted indirectly through the agents’ selection rules when the selected signal $\bm{\omega}=\bm{\beta}(g,\bm{\theta}|\{\bm{\omega}^{k},\bm{W}^{-k}\})$ is not equal to $\bm{\omega}^{-k}$ . We refer to the design of $\bm{\alpha}$ that induces such selection rules in SMPE of $M[\bm{\alpha}|\bm{\theta}]$ as an indirect information design (IID). We call the game with such $\bm{\alpha}$ as indirect augmented Markov game, denoted by $M^{-D}[\bm{\alpha}]$ .

III-B Direct Information Design

As a designer, the principal takes into consideration how each agent strategically behaves according to the game rules and reactions to his opponents’ behaviors as well as the responses from the environment. In any $M^{-D}[\bm{\alpha}]$ , the principal’s design of $\bm{\alpha}$ must predict the agents’ equilibrium selection rule profile $\bm{\beta}$ and their corresponding equilibrium policy profile $\bm{\pi}$ that might be induced by $\bm{\alpha}$ . In contrast to $M^{-D}[\bm{\alpha}]$ , the principal may elect to a direct information design in which the principal makes her signals payoff-relevant to each agent’s decision of choosing an action by incentivizing each agent to select her signal at each state. We refer to the game with such signaling rule as direct augmented Markov game, denoted by $M^{D}[\bm{\alpha}]$ . The implementability of the direct information design requires a restriction noted as obedience in addition to the admissibility.

Definition 2.4 (Obedience)

In any $M^{D}[\bm{\alpha}]$ , agent $i$ ’s selection rule $\beta_{i}$ is dominant-strategy obedient (DS-obedient, DS-obedience) if, for any $(g,\{\omega^{k}_{i},W^{-k}_{i}\})$ $\in\mathcal{G}\times\Omega^{m}$ with $\alpha_{i}(\omega^{k}_{i}|g,\bm{\theta})>0$ , $i\in\mathcal{N}$ ,

\begin{split}\beta_{i}(g_{t},\theta_{i}|\{\omega^{k}_{i,t},W^{-k}_{i,t}\})=\omega^{k}_{i,t}.\end{split}

(12)

Agent $i$ ’s selection rule $\beta_{i}$ is Bayesian obedient (Bayesian obedience) if, for any $(g,\{\omega^{k}_{i},W^{-k}_{i}\})\in$ $\mathcal{G}\times\Omega^{m}$ with $\alpha_{i}(\omega^{k}_{i}|g,\bm{\theta})>0$ , $i\in\mathcal{N}$ ,

\begin{split}\beta_{i}(g_{t},\theta_{i}|\{\omega^{k}_{i,t},W^{-k}_{i,t}\})=\omega^{k}_{i,t},\end{split}

(13)

when all other agents are obedient, i.e., $\bm{\beta}_{-i}(g,\bm{\theta}_{-i}|$ $\bm{\omega}^{k}_{-i})=\bm{\omega}^{k}_{-i}$ .

We will refer to a SMPE with obedient selection rule profile as (dominant-strategy or Bayesian) obedient SMPE (O-SMPE).

Definition 2.5 (OIL)

Given any goal $\bm{\kappa}$ , the signaling rule $\bm{\alpha}$ is (DS, Bayesian) obedient-implementable in SMPE (OIL) if it induces an (DS, Bayesian) O-SMPE $<\bm{\beta},\bm{\pi}>$ in which $\bm{\beta}$ is (DS, Bayesian) obedient and $\bm{\pi}$ is admissible.

In a $M^{D}[\bm{\alpha}|\bm{\theta}]$ , the principal wants $\bm{\omega}^{k}_{t}$ to directly enter the immediate rewards of the agents at each period $t$ . The OIL guarantees that (i) agents would want to select the signal $\bm{\omega}^{k}$ sent by the principal than choose any other signals from $\bm{W}^{-k}$ , and (ii) agents take actions specified by the admissible policy other than other available actions. Hereafter, we denote obedient selection rule by $\bm{\beta}^{O}$ and $\beta^{O}_{i}$ , for all $i\in\mathcal{N}$ .

A successful information design depends on the principal’s having accurate beliefs in regard to the agents’ decision processes. This includes all the possible indirect selection behaviors of the agents, i.e., all possible $\bm{\beta}\neq\bm{\beta}^{O}$ . The point of direct information design is that it allows the principal to ignore analyzing all of agents’ indirect selections behaviors and focus on the obedient $\bm{\beta}^{O}$ .

III-C Characterizing OIL

In this section, we characterize the OIL and formulate the principal’s information design problem given a goal $\bm{\kappa}$ .

The following proposition is an analog of Bellman’s Theorem [43].

Proposition 2.1

Given a game $M[\bm{\alpha}|\bm{\theta}]$ , for any stationary $<\bm{\beta},\bm{\pi}>$ , any $V_{i}:\mathcal{G}\times\Omega\times\Omega\mapsto\mathbb{R}$ , any $J_{i}:\mathcal{G}\mapsto\mathbb{R}$ , any $Q_{i}:\mathcal{G}\times\Omega\times\mathcal{A}^{n}\times\Omega\mapsto\mathbb{R}$ , we say $V_{i}=V^{\bm{\alpha},\bm{\beta},\bm{\pi}}_{i}$ in (3), $J_{i}=J^{\bm{\alpha},\bm{\beta},\bm{\pi}}_{i}$ in (4), and $Q_{i}=Q^{\bm{\alpha},\bm{\beta},\bm{\pi}}_{i}$ in (5), if and only if the following unique Bellman recursions are satisfied:

\displaystyle V_{i}(g,\bm{\omega};\bm{\omega}^{k}|\bm{\theta})=\sum_{\bm{a}}\bm{\pi}(\bm{a}|g,\bm{\omega},\bm{\theta})Q_{i}(\bm{a},g,\omega_{i};\omega^{k}_{i}|\bm{\theta}),

(14)

\displaystyle J_{i}(g|\bm{\theta})=\sum_{\bm{\omega}^{k}}\bm{\alpha}(\bm{\omega}^{k}|g,\bm{\theta})V_{i}(g,\bm{\beta}(g,\bm{\theta}|\bm{\omega}^{k});\bm{\omega}^{k}|\bm{\theta}),

(15)

		$\displaystyle Q_{i}(\bm{a},g,\omega_{i};\omega^{k}_{i}\|\bm{\theta})$		(16)
		$\displaystyle=R_{i}(\bm{a},g,\omega_{i}\|\theta_{i})+\gamma\sum_{g^{\prime}}\mathcal{T}_{g}(g^{\prime}\|g,\bm{a})J_{i}(g^{\prime}\|\bm{\theta}).$		(16)

From Proposition 2.1, we can re-define $V^{\bm{\alpha},\bm{\beta},\bm{\pi}}_{i}$ , $J^{\bm{\alpha},\bm{\beta},\bm{\pi}}_{i}$ , and $Q^{\bm{\alpha},\bm{\beta},\bm{\pi}}_{i}$ given in (3)-(5), respectively, as the unique pair of value functions that satisfy the Bellman recursions (14), (15), and (16).

Lemma 3

Fix $\bm{\alpha}$ . Let $\bm{\beta}^{O}=\{\beta^{O}_{i}\}_{i\in\mathcal{N}}$ denote the obedient selection rule profile. The strategy profile $<\bm{\beta}^{O},\bm{\pi}^{*}>$ is a (DS, Bayesian) O-SMPE if and only if, for any $g\in\mathcal{G}$ , $\omega^{k}_{i}\in\Omega$ with $\alpha_{i}(\omega^{k}_{i}|g,\bm{\theta})>0$ , $i\in\mathcal{N}$ ,

		$\displaystyle V^{\bm{\alpha},\bm{\beta}^{O},\bm{\pi}^{*}}_{i}(g,\bm{\omega}^{k}\|\bm{\theta})$		(17)
		$\displaystyle=\max_{a^{\prime}_{i}}\mathbb{E}_{\bm{a}_{-i}\sim\bm{\pi}^{*}_{-i}}\Big{[}Q^{\bm{\beta}^{O},\bm{\alpha};\mu_{i}}(a^{\prime}_{i},\bm{a}_{-i},g;\omega^{k}_{i}\|\theta_{i})\Big{]},$		(17)

(i)

and $\bm{\beta}^{O}$ is DS-obedient, i.e., if, for any arbitrary selection rule profiles $\bm{\hat{\beta}}_{-i}$ , any $g\in\mathcal{G}$ , $i\in\mathcal{N}$ ,

		$\displaystyle J^{\bm{\alpha},\beta^{O}_{i},\bm{\hat{\beta}}_{-i},\bm{\pi}^{*}}_{i}(g\|\bm{\theta})=\max_{\omega^{\prime}_{i}}\sum_{\bm{\omega}^{k}_{-i}}\bm{\alpha}_{-i}(\bm{\omega}^{k}_{-i}\|g,\bm{\theta})$		(18)
		$\displaystyle\times V^{\bm{\alpha},\bm{\beta}^{O},\bm{\pi}^{*}}_{i}(g,\omega^{\prime}_{i},\bm{\hat{\beta}}_{-i}(g_{t},\bm{\theta}_{-i}\|\bm{\omega}^{k}_{-i});\bm{\omega}^{k}\|\bm{\theta});$		(18)

(ii)

or, $\bm{\beta}^{O}$ is Bayesian-obedient, i.e., if, for any $g\in\mathcal{G}$ , $i\in\mathcal{N}$ ,

		$\displaystyle J^{\bm{\alpha},\bm{\beta}^{},\bm{\pi}^{}}_{i}(g\|\bm{\theta})$		(19)
		$\displaystyle=\max_{\omega^{\prime}_{i}}\sum_{\bm{\omega}^{k}_{-i}}\bm{\alpha}_{-i}(\bm{\omega}^{k}_{-i}\|g,\bm{\theta})V^{\bm{\alpha},\bm{\beta}^{O},\bm{\pi}^{*}}_{i}(g,\omega^{\prime}_{i},\bm{\omega}^{k}_{-i};\bm{\omega}^{k}\|\bm{\theta}).$		(19)

In Lemma 3, (17)-(18) (resp. (17)-(19)) constitute a recursive representation of a DS-O-SMPE (resp. Bayesian O-SMPE). Jointly, (17)-(18) require that (i) the expected payoff of each agent $i$ is maximized by his equilibrium policy $\pi^{*}_{i}$ in every state and every signal selected according to any possible equilibrium selection rule $\bar{\beta}_{i}$ , (ii) the expected payoff of each agent $i$ is maximized by the obedient selection rule $\beta^{*}_{i}$ in every state when the action is taken by any possible corresponding equilibrium policy $\bar{\pi}_{i}$ , and (iii) $\bar{\beta}_{i}=\beta^{*}_{i}$ and $\bar{\pi}_{i}=\pi^{*}_{i}$ (given that his opponents are using any arbitrary selection rule but following O-SMPE policy profile.) Similar interpretation can be done for Bayesian O-SMPE, given that each agent $i$ ’s opponents are following Bayesian O-SMPE strategy profile.

III-C1 Design Regime: Fixed-Point Alignment

In this section, we restrict attention to the Bayesian obedience and formulate an information design regime for a given goal. The following theorem states the existence of Bayesian O-SMPE.

Theorem 4

Every augmented Markov game $M[\bm{\alpha}|\bm{\theta}]$ admits a stationary Bayesian O-SMPE for any regular $\bm{\alpha}$ .

We provide an approach which we refer to as the fixed-point alignment to formulate the design of the Bayesian OIL signaling rule $\bm{\alpha}$ as a planning problem. First, we define a class of principal’s goal which we refer to as Markov perfect goal. Let, for all $i\in\mathcal{N}$ ,

\displaystyle R^{\bm{\alpha}}_{i}(\bm{a},g|\theta_{i})\equiv\sum\limits_{\omega_{i}^{k}}R_{i}(\bm{a},g,\omega^{k}_{i}|\theta_{i})\alpha_{i}(\omega^{k}_{i}|g,\bm{\theta}).

Definition 4.1 (Markov Perfect Goal)

We say that the principal’s goal $\bm{\kappa}$ is a Markov perfect goal if it is a MPE of a game canonical Markov game with $R^{\bm{\alpha}}_{i}$ as the reward function of each agent $i$ ; i.e., for all $\bm{a}_{t}=\{a_{i,t},\bm{a}_{-i,t}\}\in\mathcal{A}^{n}$ with $\bm{\kappa}(\bm{a}_{t}|g_{t})>0$ , $a^{\prime}_{i,t}\in\mathcal{A}$ , $g_{t}\in\mathcal{G}$ , $t\geq 0$ , $i\in\mathcal{N}$ ,

	$\displaystyle\mathbb{E}_{\bm{\kappa}_{-i}}\Big{[}\mathtt{Expr}_{i}$	$\displaystyle(g_{t},a_{i,t},\bm{a}_{-i,t};\theta_{i}\|\bm{\kappa},R^{\bm{\alpha}}_{i})\Big{]}$
		$\displaystyle\geq\mathbb{E}_{\bm{\kappa}_{-i}}\Big{[}\mathtt{Expr}_{i}(g_{t},a^{\prime}_{i,t},\bm{a}_{-i,t};\theta_{i}\|\bm{\kappa},R^{\bm{\alpha}}_{i})\Big{]},$

where $\mathtt{Expr}_{i}$ is defined in (1). Let $\mathtt{MPG}[\bm{R}]$ denote a set of Markov perfect goals given the reward functions $\bm{R}\equiv\{R_{i}\}_{i\in\mathcal{N}}$ .

For any state value function $J_{i}$ , any $\bm{a}\in\mathcal{A}^{n}$ , $g\in\mathcal{G}$ , $\omega_{i},\omega^{k}_{i}\in\Omega$ , we denote $Q_{i}(\bm{a},g,\omega_{i};\omega^{k}_{i}|\bm{\theta};J_{i})$ as the state-signal-action value function constructed in terms of $J_{i}$ according to (16). Similarly, we denote $J_{i}(g|\bm{\theta};V_{i})$ as the state value function constructed in terms of $V_{i}$ according to (15). Given any policy profile $\bm{\pi}_{-i}$ of other agents and any state value function $J_{i}$ , we let, for all $a_{i}\in\mathcal{A}$ , $g\in\mathcal{G}$ , $\bm{\omega}^{k}=\{\omega^{k}_{i},\bm{\omega}^{k}_{-i}\}\in\Omega^{n}$ , $\bm{\theta}=\{\theta_{i},\bm{\theta}_{-i}\}\in\Theta^{n}$ , $i\in\mathcal{N}$ ,

		$\displaystyle Q^{\bm{\pi}_{-i}}_{i}(a_{i},g,\omega_{i},\bm{\omega}^{k}_{-i};\bm{\omega}^{k}\|\bm{\theta};J_{i})$		(20)
		$\displaystyle\equiv\mathbb{E}_{\bm{a}_{-i}\sim\bm{\pi}_{-i}(\cdot\|g,\bm{\omega}^{k}_{-i},\bm{\theta}_{-i})}\Big{[}Q_{i}(a_{i},\bm{a}_{-i},g,\omega_{i};\omega^{k}_{i}\|\bm{\theta},J_{i})\Big{]}.$		(20)

Similarly, for any state-signal value function $V_{i}$ , we denote $Q^{\bm{\alpha}}_{i}(\cdot;V_{i})$ as the value function constructed in terms of $V_{i}$ according to the Bellman recursions (14)-(16), i.e., for all $\bm{a}\in\mathcal{A}^{n}$ , $g\in\mathcal{G}$ , $\omega^{k}_{i}\in\Omega$ with $\alpha_{i}(\omega^{k}_{i}|g,\bm{\theta})>0$ , $i\in\mathcal{N}$ ,

		$\displaystyle Q^{\bm{\alpha}}_{i}(\bm{a},g;\omega^{k}_{i}\|\bm{\theta};V_{i})=R_{i}(\bm{a},g,\omega^{k}_{i}\|\theta_{i})$
		$\displaystyle+\gamma\sum_{g^{\prime},\bm{\omega}^{k^{\prime}}}\mathcal{T}_{g}(g^{\prime}\|g,\bm{a})\bm{\alpha}(\bm{\omega}^{k^{\prime}}\|g^{\prime},\bm{\theta})V_{i}(g^{\prime};\bm{\omega}^{k^{\prime}}\|\bm{\theta}).$

The Bayesian OIL restricts the design of the signaling rule $\bm{\alpha}$ by threefold constraints. First, agents are incentivized to be Bayesian obedient in the signal selections. Second, the Bayesian-obedient agents’ policy profile constitute the policy component of a Bayesian O-SMPE. Third, the equilibrium policy profile is admissible. The first two constraints require the $\bm{\alpha}$ to elicit the agents to a Bayesian O-SMPE.

Proposition 4.1 provides a general formulation of an optimization problem to find the policy profile of a Bayesian O-SMPE of a game $M[\bm{\alpha}|\bm{\theta}]$ .

Proposition 4.1

Fix a signaling rule $\bm{\alpha}$ . Let $\bm{\beta}^{O}$ denote the Bayesian obedient selection rule profile. Let $\bm{V}^{*}=\{V^{*}_{i}\}_{i\in\mathcal{N}}$ denote the corresponding state-signal value function of a policy profile $\bm{\pi}^{*}$ . The strategy profile $<\bm{\beta}^{O},\bm{\pi}^{*}>$ constitutes a Bayesian O-SMPE if and only if $<\bm{\pi}^{*},\bm{V}^{*}>$ is a solution of the following optimization problem with $\bm{Z}(\bm{\pi}^{*},\bm{V}^{*};\bm{\alpha})=0$ :

		$\displaystyle\min\limits_{\bm{\pi},\bm{V}}\bm{Z}(\bm{\pi},\bm{V};\bm{\alpha})$		( $\mathtt{Opt}$ )
		$\displaystyle\equiv\sum_{i,g,\bm{\omega}^{k}}V_{i}(g;\bm{\omega}^{k}\|\theta_{i})-\mathbb{E}_{\bm{a}\sim\bm{\pi}(\cdot\|\bm{\omega}^{k})}\Big{[}Q^{\bm{\alpha}}_{i}(\bm{a},g;\omega^{k}_{i}\|\bm{\theta};V_{i})\Big{]},$		( $\mathtt{Opt}$ )

subject to, for all $i\in\mathcal{N}$ , $g\in\mathcal{G}$ , $\bm{\omega}^{k}\in\Omega^{n}$ with $\bm{\alpha}(\bm{\omega}^{k}|g,\bm{\theta})>0$ , $a^{\prime}_{i}\in\mathcal{A}$ , $\omega^{\prime}_{i}\in\Omega$ ,

\displaystyle\pi_{i}(a_{i}|g,\omega_{i},\theta_{i})\geq 0,\sum_{a_{i}\in\mathcal{A}}\pi_{i}(a_{i}|g,\omega_{i},\theta_{i})=1,

(

\mathtt{RG}_{i}

)

\begin{split}&V_{i}(g;\bm{\omega}^{k}|\bm{\theta})\\ &\geq\mathbb{E}_{\bm{a}_{-i}\sim\bm{\pi}_{-i}(\cdot|\bm{\omega}^{k}_{-i})}\Big{[}Q^{\bm{\alpha}}_{i}(a^{\prime}_{i},\bm{a}_{-i},g;\omega^{k}_{i}|\bm{\theta};V_{i})\Big{]},\end{split}

(

\mathtt{FE}_{i}

)

		$\displaystyle J^{\bm{\alpha},\bm{\beta}^{O},\bm{\pi}}_{i}(g\|\bm{\theta};V_{i})$		( $\mathtt{BOB}0_{i}$ )
		$\displaystyle\geq\sum_{\bm{\omega}^{k}_{-i}}\bm{\alpha}_{-i}(\bm{\omega}^{k}_{-i}\|g,\bm{\theta})V_{i}(g,\omega^{\prime}_{i},\bm{\omega}^{k}_{-i};\bm{\omega}^{k}\|\bm{\theta}).$		( $\mathtt{BOB}0_{i}$ )

Proposition (4.1) extends the fundamental formulation of finding the Nash equilibrium if a stochastic game as a nonlinear programming (Theorem 3.8.2 of [44]; see also, [45, 46]). Here, the condition ( $\mathtt{RG}_{i}$ ) ensures that each $\pi_{i}$ is valid policy and rules out the possible trivial solution $\pi_{i}=0$ for all $i\in\mathcal{N}$ . The constraints ( $\mathtt{FE}_{i}$ ) and ( $\mathtt{BOB}0_{i}$ ) are two necessary conditions for a Bayesian O-SMPE of the game $M[\bm{\alpha}|\bm{\theta}]$ derived from (17) and (19) of Lemma 3. Any feasible solution $<\bm{\pi},\bm{V}>$ making $\bm{Z}(\bm{\pi},\bm{V};\bm{\alpha})=0$ constitutes a Bayesian O-SMPE (in which the admissibility is not constrained). Here, the Bayesian obedient selection rule profile $\bm{\beta}^{O}$ is not a solution of the optimization problem ( $\mathtt{Opt}$ )-( $\mathtt{BOB}0_{i}$ ); instead, the optimality of Bayesian obedience constrains the optimal solution through ( $\mathtt{BOB}0_{i}$ ).

If we suppress the constraint ( $\mathtt{BOB}0_{i}$ ), then the reduced optimization problem ( $\mathtt{Opt}$ )-( $\mathtt{FE}_{i}$ ) can be interpreted as a process to find pairs of decision variables $<\pi_{i},V_{i}>$ that fit a Bellman optimality operator (i.e., satisfying (17)). In other words, the goal of this reduced optimization problem is to find fixed points. However, there is another fixed point from the Bellman optimality operator established by the condition (19) and the Bellman recursions (14)-(16); i.e., suppose agents are Bayesian obedient, for all $g\in\mathcal{G}$ , $i\in\mathcal{N}$ ,

		$\displaystyle J_{i}(g\|\bm{\theta})=\max\limits_{\omega^{\prime}_{i}}\sum_{\bm{\omega}^{k}_{-i},a_{i}}\bm{\alpha}_{-i}(\bm{\omega}^{k}_{-i}\|g,\bm{\theta})\pi_{i}(a_{i}\|g,\omega^{k}_{i},\theta_{i})$		(21)
		$\displaystyle\times Q^{\bm{\pi}_{-i}}_{i}(a_{i},g;\bm{\omega}^{k}\|\bm{\theta};J_{i}).$		(21)

However, the Bellman optimality equation (21) is independent of $V_{i}$ but is constructed based on the relationship between $J_{i}$ and $V_{i}$ given in (15).

We propose a design regime for finding a signaling rule $\bm{\alpha}$ that aligns two fixed points $J^{*}_{i}$ and $V^{*}_{i}$ for each agent $i$ while each agent’s policy is strongly admissible. Let, for any $g\in\mathcal{G}$ , $\omega^{k}_{i}\in\Omega$ with $\alpha_{i}(\omega^{k}_{i}|g,\bm{\theta})>0$ , $\omega_{i}\in\Omega$ , $i\in\mathcal{N}$ ,

		$\displaystyle V^{\bm{\alpha}_{-i}}_{i}(g,\omega_{i};\omega^{k}_{i}\|\bm{\theta};V_{i})$
		$\displaystyle\equiv\sum\limits_{\bm{\omega}^{k}_{-i}}\bm{\alpha}_{-i}(\bm{\omega}^{k}_{-i}\|g,\bm{\theta})V_{i}(g,\omega_{i},\bm{\omega}^{k}_{-i};\omega^{k}_{i},\bm{\omega}^{k}_{-i}\|\bm{\theta}),$

with $V^{\bm{\alpha}_{-i}}_{i}(g;\omega^{k}_{i}|\bm{\theta};V_{i})=V^{\bm{\alpha}_{-i}}_{i}(g,\omega^{k}_{i};\omega^{k}_{i}|\bm{\theta};V_{i})$ . The objective function would be

		$\displaystyle\bm{Z}^{\text{FPA}}(\bm{\alpha},\bm{J},\bm{V};\bm{\theta})$		(22)
		$\displaystyle\equiv\sum\limits_{i,g}\Big{(}J_{i}(g\|\bm{\theta})-\sum\limits_{\omega^{k}_{i}}\alpha_{i}(\omega^{k}_{i}\|g,\bm{\theta})V^{\bm{\alpha}_{-i}}_{i}(g;\omega^{k}_{i}\|\bm{\theta})\Big{)},$		(22)

which will need to be minimized by all possible $\bm{\alpha}$ , $\bm{J}$ , and $\bm{V}$ . By $\text{AD}[\bm{\alpha},\bm{\kappa}]$ , given $\bm{\alpha}$ and $\bm{\kappa}$ , we define a set of valid policy profiles that are admissible. For example, if we refer to strong admissibility, the set $\text{AD}[\bm{\alpha},\bm{\kappa}]$ is given as

$\displaystyle\text{AD}[\bm{\alpha},\bm{\kappa}]\equiv$	$\displaystyle\Big{\{}\bm{\pi}:\forall i\in\mathcal{N},\text{(\ref{eq:regular_policy})},$	(23)
	$\displaystyle\bm{\kappa}(\bm{a}\|g,\bm{\theta})\sum\limits_{\bm{\tilde{\omega}}_{-i}}\bm{\alpha}(\omega_{i},\bm{\tilde{\omega}}_{-i}\|g,\bm{\theta})$
	$\displaystyle=\sum\limits_{\bm{\tilde{\omega}}_{-i}}\bm{\pi}(\bm{a}\|g,\omega_{i},\bm{\tilde{\omega}}_{-i},\bm{\theta})\bm{\alpha}(\omega_{i},\bm{\tilde{\omega}}_{-i})\Big{\}}.$

The Bayesian obedience is constrained by, for all $g\in\mathcal{G}$ , $\omega_{i}\in\Omega$ , $\omega^{k}_{i}\in\Omega$ with $\alpha_{i}(\omega^{k}_{i}|g,\bm{\theta})>0$ , $i\in\mathcal{N}$ ,

\displaystyle V^{\bm{\alpha}_{-i}}_{i}(g,\omega_{i};\omega^{k}_{i}|\bm{\theta};V_{i})\leq\sum\limits_{\omega^{k}_{i}}\bm{\alpha}({\omega}^{k}_{i}|g,\bm{\theta})V^{\bm{\alpha}_{-i}}_{i}(g;\omega^{k}_{i}|\bm{\theta};V_{i}).

(

\mathtt{BOB1}_{i}

)

Unlike the which is conditioned on $V_{i}$ and $\bm{\alpha}$ . Unlike The feasibility of $V_{i}$ given a $\bm{\pi}$ is captured by the constraint ( $\mathtt{FE}_{i}$ ). We additionally constrain the feasibility of $J_{i}$ in terms of $V_{i}$ as follows: for all $g\in\mathcal{G}$ , $\omega^{k}_{i}\in\Omega$ with $\alpha_{i}(\omega^{k}_{i}|g,\bm{\theta})>0$ , $i\in\mathcal{N}$ ,

\displaystyle J_{i}(g|\bm{\theta})\geq\sum\limits_{\bm{\omega}^{k}_{-i}}\bm{\alpha}_{-i}(\bm{\omega}^{k}_{-i}|g,\bm{\theta})V_{i}(g,\omega^{k}_{i},\bm{\omega}^{k}_{-i}|\bm{\theta}).

(

\mathtt{FS}_{i}

)

Formally, the optimization problem of the information design based ob fixed-point alignment is

	$\displaystyle\min\limits_{\bm{\alpha},\bm{J},\bm{V}}\bm{Z}^{\text{FPA}}(\bm{\alpha},\bm{J},\bm{V};\bm{\theta})$	( $\mathtt{FPAlign}$ )
s.t.,	$\displaystyle\text{ (\ref{eq:nonlinear_program_pi_constraint}), (\ref{eq:constraint_BOB_1}), (\ref{eq:FS_constraint_J})},\forall i\in\mathcal{N},$
	$\displaystyle\bm{\pi}\in\text{AD}[\bm{\alpha},\bm{\kappa}],\bm{Z}(\bm{\pi},\bm{V};\bm{\alpha})=0.$

In ( $\mathtt{FPAlign}$ ), the constraint $Z(\bm{\pi},\bm{V};\bm{\alpha})=0$ is a sufficient and necessary for the feasible $\bm{\pi}$ to be the policy component of a Bayesian O-SMPE.

Theorem 5

Fix a goal $\bm{\kappa}$ . Let $<\bm{\alpha}^{*},\bm{J}^{*},\bm{V}^{*}>$ be feasible of ( $\mathtt{FPAlign}$ ). The signaling rule $\bm{\alpha}^{*}$ is Bayesian OIL with strong admissibility if and only if (i) $\bm{Z}^{\text{FPA}}(\bm{\alpha}^{*},\bm{J}^{*},\bm{V}^{*};\bm{\theta})=0$ , (ii) $\bm{\kappa}\in\mathtt{MPG}[\bm{R}]$ , and (iii) strong admissibility holds.

Theorem 5 provides a design regime for the signaling rule that is OIL in Bayesian O-SMPE. The condition (i) specifies the optimality of the solution to ( $\mathtt{FPAlign}$ ) while the conditions (ii) and (iii) disciplines the principal’s freedom in manipulating the agents’ behaviors. Specifically, the condition (ii) implies that the principal cannot plan arbitrary goal that specifies arbitrary distribution of the agents’ actions conditioning on the state and the joint type.

Theorem 5 shows two restrictions for the principal’s freedom to set her goal and determine how the goal is achieved such that the agents’ behaviors in a Bayesian O-SMPE can be influenced Specifically, the goal $\bm{\kappa}$ should be a Markov perfect goal and the induced equilibrium policy profile should be strongly admissible. The following corollary uncovers another restriction on the principal’s ability to influence the agents’ behaviors in a Bayesian O-SMPE.

Corollary 5.1

Fix a base augmented Markov game $M$ with $\bm{R}=\{R_{i}\}_{i\in\mathcal{N}}$ . In general, there exists $\bm{\kappa^{\prime}}\in\mathtt{MPG}[\bm{R}]$ that can be achieved in an indirect game $M^{-D}[\bm{\alpha^{\prime}}|\bm{\theta}]$ but not in any direct game $M^{D}[\bm{\alpha^{\prime\prime}}|\bm{\theta}]$ .

Corollary 5.1 states that restricting attention to direct information design is with loss of generality in selecting Markov perfect goals.

IV Principal’s Optimal Information Design

So far, we focus on when the principal’s goal is given. In this section, we introduce the optimality criterion of the principal’s goal selection and define the optimal information design problem without a predetermined goal. The one-stage payoff function of the principal is $u(\cdot;\bm{\theta}):\mathcal{A}^{n}\times\mathcal{G}\mapsto\mathbb{R}$ , such that $u(\bm{a},g;\bm{\theta})$ gives the immediate payoff for the principal when the state is $g$ and the agents of the game $M[\bm{\alpha};\bm{\theta}]$ take the joint action $\bm{a}$ . Recall that the principal’s goal $\bm{\kappa}$ is the probability distribution of the agents’ joint action in the equilibrium conditioned only on the global state given the agents’ types. Hence, the information structure that matters for the principal’s goal selection problem is $<\mathcal{G},\mathcal{T}_{g},d_{g}>$ . According to Ionescu Tulcea theorem, the information structure $<\mathcal{G},\mathcal{T}_{g},d_{g}>$ and any goal $\bm{\kappa}$ uniquely define a probability measure on $(\mathcal{G}\times\mathcal{A}^{n})^{\infty}$ . We denote the corresponding expectation as $\mathbb{E}^{\bm{\kappa}}[\cdot]$ . Hence, the principal’s problem is to choose a goal by maximizing her expected payoff ( $\gamma$ -discounted, the same as the agents’), i.e.,

\displaystyle C(\bm{\kappa})\equiv\mathbb{E}^{\bm{\kappa}}\Big{[}\sum_{t=0}^{\infty}\gamma^{t}u(\bm{a}_{t},g_{t},\bm{\theta})\Big{]}.

(24)

However, the principal cannot force the agents to take the actions or directly program agents’ actions according to the $\bm{\kappa}^{*}$ that maximizes $C(\bm{\kappa})$ ; instead, she uses information design to elicit the agents to take actions that coincide with $\bm{\kappa}^{*}$ in the sense of strong admissibility. Hence, the principal’s optimal goal selection problem is a constrained optimization problem:

		$\displaystyle\max\limits_{\bm{\kappa}\in\mathtt{MPG}[\bm{R}]}C(\bm{\kappa})\equiv\mathbb{E}^{\bm{\kappa}}\Big{[}\sum_{t=0}^{\infty}\gamma^{t}u(\bm{a}_{t},g_{t},\bm{\theta})\Big{]}$		(25)
	s.t.	$\displaystyle<\bm{\alpha}^{},\bm{J}^{},\bm{V}^{*}>\text{ is a solution of (\ref{eq:FPAlign})}.$		(25)

In (25), the feasibility of $\bm{\kappa}$ is captured by the conditions (i) it is a Markov perfect goal and (ii) it disciplines the strong admissibility in ( $\mathtt{FPAlign}$ ).

The optimal goal selection problem (25) can be reformulated to a problem of selecting $\bm{\alpha}$ and $\bm{\pi}$ . Specifically, from the (strong) admissibility, the objective function $C(\bm{\kappa})$ can be represented in terms of $\bm{\alpha}$ and $\bm{\pi}$ as follows:

		$\displaystyle C^{O}(\bm{\alpha},\bm{\pi})\equiv\mathbb{E}\Big{[}\sum^{\infty}_{t=0}\sum\limits_{\bm{a}_{t}}$		(26)
		$\displaystyle\gamma^{t}u(\bm{a}_{t},g_{t};\bm{\theta})\sum_{\bm{\omega}^{k}_{t}}\bm{\pi}(\bm{a}_{t}\|g_{t},\bm{\omega}^{k}_{t},\bm{\theta})\bm{\alpha}(\bm{\omega}^{k}_{t}\|g_{t},\bm{\theta})\Big{]},$		(26)

where the expectation $\mathbb{E}$ is with respect to the probability measure on the dynamics of the state. Let $\bm{\Pi}[\bm{\alpha},\bm{J},\bm{V}]$ denote the set of valid policy profiles that associated with the value function $\bm{V}$ , given $\bm{\alpha}$ and $\bm{J}$ ; i.e.,

	$\displaystyle\bm{\Pi}[\bm{\alpha},$	$\displaystyle\bm{J},\bm{V}]\equiv\Big{\{}\bm{\pi}:\text{(\ref{eq:regular_policy})},V_{i}(g;\bm{\omega}^{k}\|\bm{\theta})$
		$\displaystyle=\mathbb{E}_{\pi_{i}}\Big{[}Q^{\bm{\pi}_{-i}}_{i}(a_{i},g,\omega_{i},\bm{\omega}^{k}_{-i};\bm{\omega}^{k}\|\bm{\theta};J_{i})\Big{]},\forall i\in\mathcal{N}\Big{\}},$

where $Q^{\bm{\pi}_{-i}}_{i}$ is defined in (20). Hence, the principal’s problem (25) can be reformulated as follows:

	$\displaystyle\max\limits_{\bm{\alpha}}\max\limits_{\bm{\pi}\in\bm{\Pi}[\bm{\alpha}^{},\bm{J}^{},\bm{V}^{*}]}C^{O}(\bm{\alpha},\bm{\pi})$	(OptInfo)
s.t.	$\displaystyle<\bm{\alpha},\bm{J}^{},\bm{V}^{}>\in\operatorname*{\mathtt{argmin}}\limits_{\bm{\alpha},\bm{J},\bm{V}}\bm{Z}^{\text{FPA}}(\bm{\alpha},\bm{J},\bm{V};\bm{\theta})$
	$\displaystyle\text{s.t. }\text{ (\ref{eq:nonlinear_program_pi_constraint}), (\ref{eq:constraint_BOB_1}), (\ref{eq:FS_constraint_J})},\forall i\in\mathcal{N}.$

Technically, the problem (OptInfo) is to select (i) an equilibrium policy profile $\bm{\pi}^{*}$ that is strongly admissible and is a MPE and (ii) the signaling rule $\bm{\alpha}^{*}$ that induces the policy profile such that the principal’s expected payoff $C^{O}$ is maximized at $(\bm{\alpha}^{*},\bm{\pi}^{*})$ .

The problem (OptInfo) is, however, based on the assumption that the agents’ equilibrium behavior is always principal-preferred. We could also consider the problem of a principal who aims to solve her problem in a robust manner in the sense that she chooses the signaling rule, but wants to maximize her expected payoff in the worst equilibrium; i.e., it is the robust information design problem:

	$\displaystyle\max\limits_{\bm{\alpha}}\min\limits_{\bm{\pi}\in\bm{\Pi}[\bm{\alpha}^{},\bm{J}^{},\bm{V}^{*}]}C^{O}(\bm{\alpha},\bm{\pi})$	(Robust)
s.t.	$\displaystyle<\bm{\alpha},\bm{J}^{},\bm{V}^{}>\in\operatorname*{\mathtt{argmin}}\limits_{\bm{\alpha},\bm{J},\bm{V}}\bm{Z}^{\text{FPA}}(\bm{\alpha},\bm{J},\bm{V};\bm{\theta})$
	$\displaystyle\text{s.t. }\text{ (\ref{eq:nonlinear_program_pi_constraint}), (\ref{eq:constraint_BOB_1}), (\ref{eq:FS_constraint_J})},\forall i\in\mathcal{N}.$

IV-A Fixed-Point Misalignment Minimization

In this section, we provide an alternative formulation of information design by introducing the notion of fixed-point misalignment (FP misalignment).

Define, for any $g\in\mathcal{G}$ , $i\in\mathcal{N}$ , $\bm{\alpha}_{-i}$ , $\bm{\pi}_{-i}$ , $J_{i}$ , $V_{i}$ ,

\mathcal{E}^{\bm{\alpha}_{-i}}_{i}(J_{i},V_{i};g,\omega^{k}_{i},\bm{\theta})\equiv J_{i}(g|\bm{\theta})-V^{\bm{\alpha}_{-i}}_{i}(g;\omega^{k}_{i}|\bm{\theta};V_{i}),

\mathcal{E}^{\bm{\pi}_{-i}}_{i}(J_{i},V_{i};g,\bm{\omega}^{k},\bm{\theta})\equiv V_{i}(g;\bm{\omega}^{k}|\bm{\theta})-Q^{\bm{\pi}_{-i}}_{i}(a_{i},g;\bm{\omega}^{k}|\bm{\theta};J_{i}).

Then, we define the notion of fixed-point misalignment as follows:

	$\displaystyle\delta^{\bm{\pi}_{-i}}_{i}($	$\displaystyle J_{i},V_{i};\pi_{i}\|g,\omega^{k}_{i},a_{i},\bm{\theta})$
		$\displaystyle\equiv\pi_{i}(a_{i}\|g,\omega^{k}_{i},\theta_{i})\mathcal{E}^{\bm{\pi}_{-i}}_{i}(J_{i},V_{i};g,\bm{\theta}),$

	$\displaystyle\delta^{\bm{\alpha}_{-i}}_{i}($	$\displaystyle J_{i},V_{i};\alpha_{i}\|g,\omega^{k}_{i},\bm{\theta})$
		$\displaystyle\equiv\alpha_{i}(\omega^{k}_{i}\|g,\omega^{k}_{i},\bm{\theta})\mathcal{E}^{\bm{\alpha}_{-i}}_{i}(J_{i},V_{i};g,\bm{\omega}^{k},\bm{\theta}).$

Proposition 5.1

Fix a Markov perfect goal $\bm{\kappa}$ . A strategy profile $<\bm{\beta}^{O},\bm{\pi}^{*}>$ where $\bm{\beta}^{O}$ is Bayesian obedient is a Bayesian O-SMPE if and only if there exists a profile $<\bm{\alpha},\bm{J},\bm{V}>$ that satisfies ( $\mathtt{FE}_{i}$ ), ( $\mathtt{BOB1}_{i}$ ), and ( $\mathtt{FS}_{i}$ ), given $\bm{\pi}^{*}$ , $g\in\mathcal{G}$ , $i\in\mathcal{N}$ , such that, for all $\omega^{k}_{i}\in\Omega$ , $a_{i}\in\mathcal{A}$ , $i\in\mathcal{N}$ ,

\displaystyle\delta^{\bm{\pi}_{-i}}_{i}(J_{i},V_{i};\pi_{i}|g,\omega^{k}_{i},a_{i},\bm{\theta})=0,

(

\mathtt{FPM1}_{i}

)

\displaystyle\delta^{\bm{\alpha}_{-i}}_{i}(J_{i},V_{i};\alpha_{i}|g,\omega^{k}_{i},\bm{\theta})=0.

(

\mathtt{FPM2}_{i}

)

Then, we reformulate ( $\mathtt{FPAlign}$ ) in terms of FP misalignment minimization based on Proposition 5.1. Due to the definitions of $\mathcal{E}^{\bm{\alpha}_{-i}}_{i}$ and $\mathcal{E}^{\bm{\pi}_{-i}}_{i}$ , the objective functions $\bm{Z}$ and $\bm{Z}^{FPA}$ can be represented in terms of $\delta^{\bm{\pi}}_{i}$ and $\delta^{\bm{\alpha}}_{i}$ , as follows (denoted by $\bm{\hat{Z}}$ and $\bm{\hat{Z}}^{FPA}$ ) respectively:

\displaystyle\bm{\hat{Z}}(\bm{\pi},\bm{V};\bm{\alpha})=\sum\limits_{i,g,\omega^{k}_{i}}\sum\limits_{a_{i}}\delta^{\bm{\pi}_{-i}}_{i}(J_{i},V_{i};\pi_{i}|g,\omega^{k}_{i},a_{i},\bm{\theta}),

(27)

\displaystyle\bm{\hat{Z}}^{FPA}(\bm{\alpha},\bm{J},\bm{V};\bm{\theta})=\sum\limits_{i,g}\sum\limits_{\omega^{k}_{i}}\delta^{\bm{\alpha}_{-i}}_{i}(J_{i},V_{i};\alpha_{i}|g,\omega^{k}_{i},\bm{\theta}).

(28)

Corollary 5.2

Given any $\bm{\kappa}\in\mathtt{MPG}[\bm{R}]$ , the problem ( $\mathtt{FPAlign}$ ) is equivalent to the following:

$\displaystyle\min\limits_{\bm{\alpha},\bm{J},\bm{V}}$	$\displaystyle\bm{\hat{Z}}^{FPA}(\bm{\alpha},\bm{J},\bm{V};\bm{\theta})$	( $\mathtt{FPMis}$ )
s.t.	$\displaystyle\text{ (\ref{eq:constraint_BOB_1}) ,(\ref{eq:misalignment_1}) ,(\ref{eq:misalignment_2}), }\forall i\in\mathcal{N}$
	$\displaystyle\bm{\pi}\in\text{AD}[\bm{\alpha},\bm{\kappa}].$

Define a set:

	$\displaystyle\bm{\mathcal{D}}\equiv\Big{\{}\bm{\alpha},\bm{\pi}:\forall i\in\mathcal{N},$	$\displaystyle\text{(\ref{eq:regular_policy})},$
	$\displaystyle<\alpha_{i},J_{i},V_{i}>\in$	$\displaystyle\operatorname*{\mathtt{argmin}}\delta^{\bm{\alpha}_{-i}}_{i}(J_{i},V_{i};\alpha_{i}\|g,\omega^{k}_{i},\bm{\theta})$
		$\displaystyle\text{s.t. }\text{ (\ref{eq:constraint_BOB_1}) ,(\ref{eq:misalignment_1}) ,(\ref{eq:misalignment_2})}.\Big{\}}$

Given $\bm{\mathcal{D}}$ , we define a set of OIL signaling rules as $\bm{\mathcal{S}}\equiv\Big{\{}\bm{\alpha}:\forall(\bm{\alpha},\bm{\pi})\in\bm{\mathcal{D}}\Big{\}}$ and a set of policy profiles given any signaling rule $\bm{\alpha}$ as $\bm{\Pi}[\bm{\alpha}]\equiv\Big{\{}\bm{\pi}:\forall(\bm{\alpha},\bm{\pi})\in\bm{\mathcal{D}}\Big{\}}$ .

Corollary 5.3

The principal’s robust information design is to solve the following problem

\displaystyle\max\limits_{\bm{\alpha}\in\bm{\mathcal{S}}}\min\limits_{\bm{\pi}\in\bm{\Pi}[\bm{\alpha}]}C^{O}(

\displaystyle\bm{\alpha},\bm{\pi}).

(29)

V Conclusion

This work is the first to propose an information design principle for dynamic games in which each agent makes coupled decisions of selecting a signal and taking an action at each period of time. We have formally defined a novel information design problem for the indirect and the direct settings. The notion of obedient implementability has been introduced to capture the optimality of the direct information design problem in a new equilibrium concept of obedient sequential Markov perfect equilibrium (O-SMPE). By characterizing the obedient implementability (OIL) in Bayesian O-SMPE, we have proposed an approach to determining the information structure. We refer to this approach as fixed-point alignment that aligns the two fixed points at the signal selection stage and the action taken stage, respectively. We have uncovered the restrictions that discipline the principal’s freedom to influence the agents’ behaviors in Bayesian O-SMPE. Specifically, the principal’s goal should be a Markov perfect goal and the equilibrium policy profile should be strongly admissible. Additionally, it is with loss of generality in terms of the selection of Markov perfect goals for OIL in Bayesian O-SMPE. Finally, we have formulated the principal’s goal selection problem in terms of the optimal and the robust information design by replacing the admissibility by the optimality or the robustness of the agents’ equilibrium policy profile in the principal’s expected payoff.

References

[1] A. Dickinson, “Actions and habits: the development of behavioural autonomy,” Philosophical Transactions of the Royal Society of London. B, Biological Sciences, vol. 308, no. 1135, pp. 67–78, 1985.
[2] D. Bergemann and S. Morris, “Information design: A unified perspective,” Journal of Economic Literature, vol. 57, no. 1, pp. 44–95, 2019.
[3] I. Taneva, “Information design,” American Economic Journal: Microeconomics, vol. 11, no. 4, pp. 151–85, 2019.
[4] N. Chentanez, A. G. Barto, and S. P. Singh, “Intrinsically motivated reinforcement learning,” in Advances in neural information processing systems, 2005, pp. 1281–1288.
[5] L. Mathevet, J. Perego, and I. Taneva, “On information design in games,” Journal of Political Economy, vol. 128, no. 4, pp. 1370–1404, 2020.
[6] D. Bergemann and S. Morris, “Bayes correlated equilibrium and the comparison of information structures in games,” Theoretical Economics, vol. 11, no. 2, pp. 487–522, 2016.
[7] E. Kamenica and M. Gentzkow, “Bayesian persuasion,” American Economic Review, vol. 101, no. 6, pp. 2590–2615, 2011.
[8] J. Ely, A. Frankel, and E. Kamenica, “Suspense and surprise,” Journal of Political Economy, vol. 123, no. 1, pp. 215–260, 2015.
[9] J. Passadore and J. P. Xandri, “Robust conditional predictions in dynamic games: An application to sovereign debt,” Job Market Paper, 2015.
[10] L. Doval and J. C. Ely, “Sequential information design,” Econometrica, vol. 88, no. 6, pp. 2575–2608, 2020.
[11] J. C. Ely, “Beeps,” American Economic Review, vol. 107, no. 1, pp. 31–53, 2017.
[12] J. C. Ely and M. Szydlowski, “Moving the goalposts,” Journal of Political Economy, vol. 128, no. 2, pp. 468–506, 2020.
[13] M. Makris and L. Renou, “Information design in multi-stage games,” working paper, Tech. Rep., 2018.
[14] F. Koessler, M. Laclau, and T. Tomala, “Interactive information design,” HEC Paris Research Paper No. ECO/SCD-2018-1260, 2018.
[15] R. B. Myerson, “Optimal auction design,” Mathematics of operations research, vol. 6, no. 1, pp. 58–73, 1981.
[16] A. Pavan, I. Segal, and J. Toikka, “Dynamic mechanism design: A myersonian approach,” Econometrica, vol. 82, no. 2, pp. 601–653, 2014.
[17] T. Zhang and Q. Zhu, “On the differential private data market: Endogenous evolution, dynamic pricing, and incentive compatibility,” 2021.
[18] P. Milgrom and P. R. Milgrom, Putting auction theory to work. Cambridge University Press, 2004.
[19] S. Bhat, S. Jain, S. Gujar, and Y. Narahari, “An optimal bidimensional multi-armed bandit auction for multi-unit procurement,” Annals of Mathematics and Artificial Intelligence, vol. 85, no. 1, pp. 1–19, 2019.
[20] T. Sönmez and M. U. Ünver, “Matching, allocation, and exchange of discrete resources,” in Handbook of social Economics. Elsevier, 2011, vol. 1, pp. 781–852.
[21] T. Zhang and Q. Zhu, “Optimal two-sided market mechanism design for large-scale data sharing and trading in massive iot networks,” arXiv preprint arXiv:1912.06229, 2019.
[22] D. Dewey, “Reinforcement learning and the reward engineering principle,” in 2014 AAAI Spring Symposium Series, 2014.
[23] R. Nagpal, A. U. Krishnan, and H. Yu, “Reward engineering for object pick and place training,” arXiv preprint arXiv:2001.03792, 2020.
[24] D. Hadfield-Menell, S. Milli, P. Abbeel, S. J. Russell, and A. Dragan, “Inverse reward design,” in Advances in neural information processing systems, 2017, pp. 6765–6774.
[25] E. Kamenica, “Bayesian persuasion and information design,” Annual Review of Economics, vol. 11, pp. 249–272, 2019.
[26] I. Brocas and J. D. Carrillo, “Influence through ignorance,” The RAND Journal of Economics, vol. 38, no. 4, pp. 931–947, 2007.
[27] L. Rayo and I. Segal, “Optimal information disclosure,” Journal of political Economy, vol. 118, no. 5, pp. 949–987, 2010.
[28] I. Arieli and Y. Babichenko, “Private bayesian persuasion,” Journal of Economic Theory, vol. 182, pp. 185–217, 2019.
[29] M. Castiglioni, A. Celli, A. Marchesi, and N. Gatti, “Online bayesian persuasion,” Advances in Neural Information Processing Systems, vol. 33, 2020.
[30] J.-F. Mertens and S. Zamir, “Formulation of bayesian analysis for games with incomplete information,” International Journal of Game Theory, vol. 14, no. 1, pp. 1–29, 1985.
[31] I. Goldstein and Y. Leitner, “Stress tests and information disclosure,” Journal of Economic Theory, vol. 177, pp. 34–69, 2018.
[32] N. Inostroza and A. Pavan, “Persuasion in global games with application to stress testing,” 2018.
[33] P. Hernández and Z. Neeman, “How bayesian persuasion can help reduce illegal parking and other socially undesirable behavior,” Preprint, 2018.
[34] Z. Rabinovich, A. X. Jiang, M. Jain, and H. Xu, “Information disclosure as a means to security,” in Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems. Citeseer, 2015, pp. 645–653.
[35] S. Gehlbach and K. Sonin, “Government control of the media,” Journal of public Economics, vol. 118, pp. 163–171, 2014.
[36] S. Das, E. Kamenica, and R. Mirka, “Reducing congestion through information design,” in 2017 55th annual allerton conference on communication, control, and computing (allerton). IEEE, 2017, pp. 1279–1284.
[37] D. Duffie, P. Dworczak, and H. Zhu, “Benchmarks in search markets,” The Journal of Finance, vol. 72, no. 5, pp. 1983–2044, 2017.
[38] M. Szydlowski, “Optimal financing and disclosure,” Management Science, vol. 67, no. 1, pp. 436–454, 2021.
[39] D. Garcia and M. Tsur, “Information design in competitive insurance markets,” Journal of Economic Theory, vol. 191, p. 105160, 2021.
[40] B. D. Ziebart, J. A. Bagnell, and A. K. Dey, “Maximum causal entropy correlated equilibria for markov games.” in AAMAS. Citeseer, 2011, pp. 207–214.
[41] O. Hernández-Lerma and J. B. Lasserre, Discrete-time Markov control processes: basic optimality criteria. Springer Science & Business Media, 2012, vol. 30.
[42] W. He and Y. Sun, “Stationary markov perfect equilibria in discounted stochastic games,” Journal of Economic Theory, vol. 169, pp. 35–61, 2017.
[43] R. Bellman, “Dynamic programming,” Science, vol. 153, no. 3731, pp. 34–37, 1966.
[44] J. Filar and K. Vrieze, “Competitive markov decision processes-theory, algorithms, and applications,” 1997.
[45] H. Prasad and S. Bhatnagar, “General-sum stochastic games: Verifiability conditions for nash equilibria,” Automatica, vol. 48, no. 11, pp. 2923–2930, 2012.
[46] H. Prasad, P. LA, and S. Bhatnagar, “Two-timescale algorithms for learning nash equilibria in general-sum stochastic games,” in Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, 2015, pp. 1371–1379.

		$\displaystyle V^{\bm{\alpha},\bm{\beta},\bm{\pi}}_{i}(g,\bm{\omega};\bm{\omega}^{k}\|\bm{\theta})$		(3)
		$\displaystyle\equiv\sum_{t=0}^{\infty}\sum_{g^{\prime}}\gamma^{t}\big{(}\mathcal{T}^{\bm{\beta},\bm{\pi}}_{g^{\prime},g}(\bm{\omega};\bm{\omega}^{k})\big{)}^{t}\sum_{\bm{a}^{\prime},\bm{\omega}^{k^{\prime}}}\bm{\pi}(\bm{a}^{\prime}\|g^{\prime},\bm{\omega}^{\prime},\bm{\theta})$
		$\displaystyle\times\bm{\alpha}(\bm{\omega}^{k^{\prime}}\|g^{\prime},\bm{\theta})R_{i}(\bm{a}^{\prime},g^{\prime},\omega^{\prime}_{i}\|\theta_{i}),$

		$\displaystyle J^{\bm{\alpha},\bm{\beta},\bm{\pi}}_{i}(g\|\bm{\theta})\equiv\sum_{t=0}^{\infty}\sum_{g^{\prime}}\gamma^{t}\big{(}\mathcal{T}^{\bm{\alpha},\bm{\beta},\bm{\pi}}_{g^{\prime},g}\big{)}^{t}$		(4)
		$\displaystyle\times\sum_{\bm{a}^{\prime},\bm{\omega}^{k^{\prime}}}\bm{\pi}(\bm{a}^{\prime}\|g^{\prime},\bm{\beta}(g^{\prime},\bm{\theta}\|\bm{\omega}^{k^{\prime}}),\bm{\theta})\bm{\alpha}(\bm{\omega}^{k^{\prime}}\|g^{\prime},\bm{\theta})$
		$\displaystyle\times R_{i}(\bm{a^{\prime}},g^{\prime},\beta_{i}(\bm{a}^{\prime},g^{\prime},\theta_{i}\|\omega^{k^{\prime}}_{i})\|\theta_{i}).$

		$\displaystyle Q^{\bm{\alpha},\bm{\beta},\bm{\pi}}_{i}(\bm{a},g,\omega_{i};\omega^{k}_{i}\|\bm{\theta})\equiv R_{i}(\bm{a},g,\omega_{i}\|\theta_{i})$		(5)
		$\displaystyle+\gamma\sum_{g^{\prime}}\mathcal{T}_{g}(g^{\prime}\|g,\bm{a})\Big{(}\sum_{s=0}^{\infty}\sum_{g^{\prime\prime}}\gamma^{s}\big{(}\mathcal{T}^{\bm{\alpha},\bm{\beta},\bm{\pi}}_{g^{\prime\prime},g^{\prime}}\big{)}^{s}\Big{)}\times$
		$\displaystyle\sum_{\bm{a}^{\prime\prime},\bm{\omega}^{k^{\prime\prime}}}\bm{\pi}(\bm{a}^{\prime\prime}\|g^{\prime\prime},\bm{\omega}^{\prime\prime};\bm{\theta})\bm{\alpha}(\bm{\omega}^{k^{\prime\prime}}\|g^{\prime\prime},\bm{\theta})R_{i}(\bm{a}^{\prime},g^{\prime\prime},\omega^{\prime}_{i}\|\theta_{i})\Big{)},$