Sequential Information Design: Markov Persuasion Process and Its Efficient Reinforcement Learning

Jibang Wu Zixuan Zhang Zhe Feng Zhaoran Wang
Zhuoran Yang Michael I. Jordan Haifeng Xu University of Virginia. Email: [email protected].University of Science and Technology of China. Email: [email protected].Google. Email: [email protected].Northwestern University. Email: [email protected].Yale University. Email: [email protected].UC Berkeley. Email: [email protected].University of Virginia. Email: [email protected].

Abstract

In today’s economy, it becomes important for Internet platforms to consider the sequential information design problem to align its long term interest with incentives of the gig service providers (e.g., drivers, hosts). This paper proposes a novel model of sequential information design, namely the Markov persuasion processes (MPPs). Specifically, in an MPP, a sender, with informational advantage, seeks to persuade a stream of myopic receivers to take actions that maximizes the sender’s cumulative utilities in a finite horizon Markovian environment with varying prior and utility functions. Planning in MPPs thus faces the unique challenge in finding a signaling policy that is simultaneously persuasive to the myopic receivers and inducing the optimal long-term cumulative utilities of the sender. Nevertheless, in the population level where the model is known, it turns out that we can efficiently determine the optimal (resp. $\epsilon$ -optimal) policy with finite (resp. infinite) states and outcomes, through a modified formulation of the Bellman equation that additionally takes persuasiveness into consideration.

Our main technical contribution is to study the MPP under the online reinforcement learning (RL) setting, where the goal is to learn the optimal signaling policy by interacting with with the underlying MPP, without the knowledge of the sender’s utility functions, prior distributions, and the Markov transition kernels. For such a problem, we design a provably efficient no-regret learning algorithm, the Optimism-Pessimism Principle for Persuasion Process (OP4), which features a novel combination of both optimism and pessimism principles. In particular, we obtain optimistic estimates of the value functions to encourage exploration under the unknown environment. Meanwhile, we additionally robustify the signaling policy with respect to the uncertainty of prior estimation to prevent receiver’s detrimental equilibrium behavior. Our algorithm enjoys sample efficiency by achieving a sublinear $\sqrt{T}$ -regret upper bound. Furthermore, both our algorithm and theory can be applied to MPPs with large space of outcomes and states via function approximation, and we showcase such a success under the linear setting.

1 Introduction

Most sequential decision models assume that there is a sole agent who possesses and processes all relevant (online or offline) information and takes an action accordingly. However, the economic literature on information design [26, 8] highlights the importance of considering information asymmetry in decision making, where the decision maker and information possessor may be two parties having different interests and goals. For example, a ride-sharing platform holds historical and real-time data on active riders and driver types in different locations, based on which they have developed centralized combinatorial optimization algorithms and reinforcement learning algorithms for vehicle repositioning, routing and order matching to optimize their operational efficiency and profit [33, 42, 34, 43]. But the de facto decision makers are the drivers. Moreover, as increasingly many drivers are freelancers instead of employees, the platform cannot expect to give mandatory orders to them. On the other hand, if the platform shares no information on rider demand, most drivers will not be able to efficiently find profitable trips. Therefore, it is not only realistic but also necessary to consider an information design problem that aligns the interests of the two parties in sequential decision making processes of this kind.

Given the large data sets being collected by corporations and governments, with avowed goals that relate data analysis to social welfare, it is timely to pursue formal treatments of sequential information design, to understand how to strategically inform the (sequential) decision makers (e.g., users, clients or citizens) impacted by centralized data analysis. In particular, we wish to understand the resulting equilibrium outcomes of both parties. As a concrete example, consider an online shopping platform which may make use of learning tools such as reinforcement learning or online convex optimization to manage inventory and ensure profitability [20, 36]. The platform cannot single-handedly manage its inventory, instead it requires information design (a.k.a., Bayesian persuasion) in its interactions with its suppliers and consumers. On the supply side, it could strategically reveal aspects of consumer sentiment (e.g., rough number of visits, search) to the suppliers in order to guide their sales expectation and negotiate for lower unit prices. On the demand side, it could tactically control displayed product information (e.g., last five remaining, editor’s choice) so as to influence consumers’ perception of products and consequently their purchase decisions. Similar situations can be anticipated for a recommendation platform. On the one hand, it should recommend most relevant items to its users for click-through and engagement. On the other hand, its recommendations are subject to misalignments with long-term objectives such as profits (e.g., from paid promotion), social impact (e.g., to prevent misinformation and filter bubbles) or development of a creator ecosystem [49, 53, 37].

1.1 Our Results and Contributions

To provide a formal foundation for the study of sequential information design, we introduce the Markov persuasion process (MPP), where a sender, with informational advantage, seeks to persuade a stream of myopic receivers to take actions that maximize the sender’s cumulative utility in a finite-horizon Markovian environment with varying prior and utility functions. We need to address a key challenge regarding the planning problem in MPPs; specifically, how to find persuasive signaling policies that are also optimized for the sender’s long-term objective. Moreover, in face of the uncertainty for both the environment and receivers, there is a dilemma that the optimal policy based on estimated prior is not necessarily persuasive and thus cannot induce the desired trajectory, whereas a full information revelation policy is always persuasive but usually leads to suboptimal cumulative utility. So the reinforcement learning algorithm in MPPs has to ensure optimality under the premise of robust persuasiveness. This makes our algorithm design non-trivial and regret analysis highly challenging.

We show how to surmount these analysis and design challenges, and present a no-regret learning algorithm, which we refer to as Optimism-Pessimism Principle for Persuasion Process (OP4), that provably achieves a $\widetilde{O}\big{(}\sqrt{d_{\phi}^{2}d_{\psi}^{3}H^{4}T}\big{)}$ regret with high probability, where $d_{\phi},d_{\psi}$ are dimensions of the feature spaces, $H$ is the horizon length in each episode, $T$ is the number of episodes, and $\widetilde{O}(\cdot)$ hides logarithmic factors as well as problem-dependent parameters. To establish this result, in Section 3.3 we start by constructing a modified formulation of the Bellman equation that can efficiently determine the optimal (resp. $\epsilon$ -optimal) policy with finite (resp. infinite) states and outcomes. Section 4.2 then considers the learning problem, in particular the design of the OP4 that adopts both the optimistic principle in utility estimation to incentivize exploration and the pessimism principle in prior estimation to prevent a detrimental equilibrium for the receiver. In Sections 4.3 and 4.4, we showcase OP4 in the tabular MPPs and contextual Bayesian persuasion problem, respectively, both of which are practical special cases of MPPs. In Section 5, we then generalize these positive results to MPPs with large outcome and state spaces via linear function approximation and generalized linear models.

In summary, our contributions are threefold. At the conceptual level, we identify the need for sequential information design in real-world problems and accordingly formulate a novel model, the MPP, to capture the misaligned incentives between the (sequential) decision makers and information possessors. At the methodological level, our key insight is a new algorithmic principle—optimism to encourage exploration and pessimism to induce robust equilibrium behavior. Finally, at the technical level, we develop a novel regret decomposition tailored to this combination of optimism and pessimism in the design of online learning algorithms. The fact that the combined optimism-pessimism concept can still lead to $O(\sqrt{T})$ regret for strategic setups was not clear before our new regret decomposition lemma. We expect this design principle and our proof techniques can be useful for other strategic learning problems.

1.2 Related Work

Our work is built on the foundation of information design and reinforcement learning. We refer the readers to Section 2.1 and 2.2 for background and formal introductions. Here we focus on the technical and modeling comparisons with related work from dynamic Bayesian persuasion and efficient reinforcement learning.

Dynamic Bayesian persuasion. Starting from seminal work by Kamenica and Gentzkow [26], the study of Bayesian persuasion looks at the design problem to influence an uninformed decision maker through strategic information revelation. Many variants of this model have been studied, with applications in security, advertising, finance, etc. [44, 54, 21, 6]. More recently, several dynamic Bayesian persuasion frameworks have been proposed to model the long-term interest of the sender. Many papers [16, 45, 17, 29] consider the setting where the sender observes the evolving states of a Markov chain, seeks to influence the receiver’s belief of the state through signaling and thereby persuade him to take certain actions. In contrast to our setting, at each round, the receiver’s action has no influence on the evolution of the Markov process and thus can only maximizes his utility on his belief of current state, given all the historical signals received from the sender. In Ely [16], Farhadi and Teneketzis [17], the Markov chain has two states (one is absorbing): the receiver is interested in detecting the jump to the absorbing state, whereas the sender seeks to prolong the time to detection of such a jump. Renault et al. [45] shows a greedy disclosure policy that ignores its influence to the future utility can be optimal in Markov chain with special utility functions. Lehrer and Shaiderman [29] characterize optimal strategies under different discount factors as well as the optimal values the sender could achieve. Closer to our model is that of Gan et al. [19]—we both assume the Markov environment with state transition influenced by receiver’s action, as well as a separate persuasion state drawn from a prior independent of receiver’s action. However, Gan et al. [19] focus on the planning problem for the infinite-horizon MDP, solving sender’s optimal signaling policy when the environment is known in cases when the receiver is myoptic or far-sighted. In particular, it is shown as NP-hard to approximate an optimal policy against a far-sighted receiver, which also justifies our interest on the myoptic receiver. Another closely related work [61] studies the learning problem in repeated persuasion setting (without Markov state transition) between a stream of myopic receivers and a sender without initial knowledge of the prior. It introduces the notion of regret as well as the robustness principle to this learning problem that we adopt and generalize to our model.

Bayesian Incentive-Compatible Bandit Exploration. Our work is also loosely related to a seminal result by Mansour et al. [35], who model the misaligned incentives between a system (i.e., sender) and a stream of myopic agents (i.e., receivers). Mansour et al. [35] shows that using information asymmetry, the system can create intrinsic incentives for agents to follow its recommendations. In this problem, the sender’s objective is limited to the social welfare, i.e, the cumulative utility of all agents, whereas we make no assumption on the sender’s utility function and thus her long-term objective. Besides our model is designed to capture more general situations where each receiver could have different priors and utility functions, and the environment might be Markovian with dynamics under the influence of the receivers’ actions.

Efficient Reinforcement Learning. Reinforcement learning has seen its successful applications in various domains, such as robotics, finance and dialogue systems [27, 58, 30]. Along with the empirical success, we have seen a growing quest to establish provably efficient RL methods. Classical sample efficiency results focus on tabular environments with small, finite state spaces [2, 39, 4, 12, 47, 24, 46]. Notably, through the design principle, known as optimism in the face of uncertainty [28], an RL algorithm would provably incur a $\Omega(\sqrt{|{\mathcal{S}}||\mathcal{A}|T})$ regret under the tabular setting, where ${\mathcal{S}}$ and $\mathcal{A}$ are the state and action spaces respectively [24, 4]. More recently, there have been advances in RL with function approximation, especially the linear case. Jin et al. [25] proposed an efficient algorithm for a setting where the transition kernel and the utility function are both linear functions with respect to a feature mapping: $\phi:{\mathcal{S}}\times\mathcal{A}\to\mathbb{R}^{d}$ . A similar assumption has been studied for different settings and has led to sample efficiency results [55, 14, 38, 57, 22]. Moreover, other general function approximations have been studied in parallel, including generalized linear function approximation [52], linear mixture MDPs based on a ternary feature mapping [3, 60, 9, 59], kernel approximation [56] as well as models based on the low Bellman rank assumption [23, 11]. We make use of these function approximation techniques to model our conditional prior, and we show how to integrate the persuasion structure into these efficient reinforcement learning frameworks, thereby obtaining sample efficient result for large-scale MPPs.

2 Preliminaries

This section provides some necessary background in information design and Markov decision processes, as preparation for our model of Markov persuasion processes presented in the next section.

2.1 Basics of Information Design

Classic information design [26] considers the persuasion problem between a single sender (she) and receiver (he). The receiver is the only actor, and looks to take an action $a\in\mathcal{A}$ which results in receiver utility $u(\omega,a)$ and sender utility $v(\omega,a)$ . Here $\omega\in\Omega$ is the realized outcome of certain environment uncertainty, which is drawn from a prior distribution $\mu\in\Delta(\Omega)$ , and $\mathcal{A}$ is a finite set of available actions for the receiver. While $u,v:\Omega\times\mathcal{A}\to[0,1]$ and the prior distribution $\mu$ are all common knowledge, the sender possesses an informational advantage and can privately observe the realized outcome $\omega$ . The persuasion problem studies how the sender can selectively reveal her private information about $\omega$ to influence the receiver’s decisions and ultimately maximize her own expected utility $v$ .

To model the sender’s strategic revelation of information, it is standard to use a signaling scheme, which essentially specifies the conditional distribution of a random variable (namely the signal), given the outcome $\omega$ . Before the realization of the outcome, the sender commits to such a signaling scheme. Given the realized outcome, the sender samples a signal from the conditional distribution according to the signaling scheme and reveals it to the receiver. Upon receiving this signal, the receiver infers a posterior belief about the outcome via Bayes’ theorem (based on the correlation between the signal and outcome $\omega$ as promised by the signaling scheme) and then chooses an action $a$ that maximizes the expected utility.

A standard revelation-principle-style argument shows that it is without loss of generality to focus on direct and persuasive signaling schemes [26]. A scheme is direct if each signal corresponds to an action recommendation to the receiver, and is persuasive if the recommended action indeed maximizes the receiver’s a posteriori expected utility. More formally, in a direct signaling scheme , $\pi=(\pi(a|\omega):\omega\in\Omega,a\in\mathcal{A})$ , $\pi(a|\omega)$ denotes the probability of recommending action $a$ given realized outcome $\omega$ . Upon receiving an action recommendation $a$ , the receiver computes a posterior belief for $\omega$ : $\Pr(\omega|a)=\frac{\mu(\omega)\pi(a|\omega)}{\sum_{\omega^{\prime}}\mu(\omega^{\prime})\pi(a|\omega^{\prime})}$ . Thus, the action recommendation $a$ is persuasive if and only if $a$ maximizes the expected utility w.r.t. the posterior belief about $\omega$ ; i.e., $\sum_{\omega}\Pr(\omega|a)\cdot u(\omega,a)\geq\sum_{\omega}\Pr(\omega|a)\cdot u(\omega,a^{\prime})$ for any $a^{\prime}\in\mathcal{A}$ . Equivalently, we define persuasiveness as

\text{Persuasiveness: }\quad\sum_{\omega\in\Omega}\mu(\omega)\pi(a|\omega)\cdot[u(\omega,a)-u(\omega,a^{\prime})]\geq 0,\forall a,a^{\prime}\in\mathcal{A}.

Let $\mathcal{P}=\{\pi:\pi(\cdot|\omega)\in\Delta(\mathcal{A})\text{ for each }\omega\in\Omega\}$ denote the set of all signaling schemes. To emphasize that the definition of persuasiveness depends on the prior $\mu$ , we denote the set of persuasive schemes on prior $\mu$ by

\operatorname{Pers}(\mu)\coloneqq\left\{\pi\in\mathcal{P}:\sum_{\omega\in\Omega}\mu(\omega)\pi(a|\omega)\left[u(\omega,a)-u\left(\omega,a^{\prime}\right)\right]\geq 0,\quad\forall a,a^{\prime}\in\mathcal{A}\right\}.

Given a persuasive signaling scheme $\pi\in\operatorname{Pers}(\mu)$ , it is in the receiver’s best interest to take the recommended action and thus the sender’s expected utility becomes $V(\mu,\pi)\coloneqq\sum_{\omega\in\Omega}\sum_{a\in\mathcal{A}}\mu(\omega)\pi(a|\omega)v(\omega,a).$

Therefore, given full knowledge of the persuasion instance, the sender can solve for an optimal persuasive signaling scheme that maximizes her expected utility through the following linear program (LP) which searches for a persuasive signaling scheme that maximizes $V(\mu,\pi)$ (see, e.g., [15] for details):

\text{Persuasion as an LP:}\qquad\operatorname{OPT}\left(\mu\right)\coloneqq\max_{\pi\in\operatorname{Pers}(\mu)}\quad V(\mu,\pi).

2.2 Basics of Reinforcement Learning and Markov Decision Processes

The Markov decision process (MDP) [41, 48] is a classic mathematical framework for the sequential decision making problem. In this work, we focus on the model of episodic MDP. Specifically, at the beginning of the episode, the environment has an initial state $s_{1}$ (possibly picked by an adversary). Then, at each step $h\geq 1$ , the agent takes some action $a_{h}\in\mathcal{A}$ to interact with environment at state $s_{h}\in{\mathcal{S}}$ . The state $s_{h}$ obeys a Markov property and thus captures all relevant information in the history $\{s_{i}\}_{i<h}$ . Accordingly, the agent receives the utility $v_{h}(s_{h},a_{h})\in[0,1]$ and the system evolves to the state of the next step $s_{h+1}\sim P_{h}(\cdot|s_{h},a_{h})$ . Such a process terminates after $h=H$ , where $H$ is also known as the horizon length. Here, $\mathcal{A}$ is a finite set of available actions for the agent, ${\mathcal{S}}$ is the (possibly infinite) set of MDP states. The utility function $v_{h}:{\mathcal{S}}\times\mathcal{A}\to[0,1]$ and transition kernel $P_{h}:{\mathcal{S}}\times\mathcal{A}\to\Delta({\mathcal{S}})$ may vary at each step. A policy of the agent $\pi_{h}:{\mathcal{S}}\to\Delta(\mathcal{A})$ characterizes her decision making process at step $h$ —after observing the state $s$ , the agent takes action $a$ with probability $\pi_{h}(a|s)$ .

In an episodic MDP with $H$ steps, under policy $\bm{\pi}=\{\pi_{h}\}_{h\in[H]}$ , we define the value function as the expected value of cumulative utilities starting from an arbitrary state,

V_{h}^{\pi}(s)\coloneqq\mathbb{E}_{P,\bm{\pi}}\bigg{[}\sum_{h^{\prime}=h}^{H}v_{h}(s_{h^{\prime}},a_{h^{\prime}})\bigg{|}s_{h^{\prime}}=s\bigg{]},\quad\forall s\in{\mathcal{S}},h\in[H].

Here $\mathbb{E}_{P,\bm{\pi}}$ means that the expectation is taken with respect to the trajectory $\{s_{h},a_{h}\}_{h\in[H]}$ , which is generated by policy $\bm{\pi}$ on the transition model $P=\{P_{h}\}_{h\in[H]}$ . Similarly, we define the action-value function as the expected value of cumulative utilities starting from an arbitrary state-action pair,

Q_{h}^{\pi}(s,a)\coloneqq v_{h}(s_{h},a_{h})+\mathbb{E}_{P,\bm{\pi}}\bigg{[}\sum_{h^{\prime}=h+1}^{H}v_{h}(s_{h^{\prime}},a_{h^{\prime}})\bigg{|}s_{h^{\prime}}=s,a_{h^{\prime}}=a\bigg{]},\quad\forall s\in{\mathcal{S}},a\in\mathcal{A},h\in[H].

The optimal policy is defined as $\bm{\pi}^{*}\coloneqq\arg\max_{\bm{\pi}}V_{h}^{\bm{\pi}}(s_{1})$ , which maximizes the (expected) cumulative utility. Since the agent’s action affects both immediate utility and next states that influences its future utility, it thus demands careful planning to maximize total utility. Notably, $\bm{\pi}^{*}$ can solved by dynamic programming based on the Bellman equation [7]. Specifically, with $V_{H+1}^{*}(s)=0$ and for each $h$ from $H$ to $1$ , iteratively update $Q_{h}^{*}(s,a)=v_{h}(s,a)+\mathbb{E}_{s^{\prime}\sim P(\cdot|s,a)}V_{h+1}^{*}(s^{\prime},a),~{}V_{h}^{*}(s)=\max_{a\in\mathcal{A}}Q_{h}^{*}(s,a),$ and determine the optimal policy $\bm{\pi}^{*}$ as the greedy policy with respect to $\{Q_{h}^{*}\}_{h\in[H]}$ .

In online reinforcement learning, the agent has no prior knowledge of the environment, namely, $\{v_{h},P_{h}\}_{h\in[H]}$ , but aims to learn the optimal policy by interacting with the environment for $T$ episodes. For each $t\in[T]$ , at the beginning of the $t$ -th episode, after observing the initial state $s_{1}^{t}$ , the agent chooses a policy $\bm{\pi}^{t}$ based on the observations before $t$ -th episode. The discrepancy between $V^{\bm{\pi}^{t}}_{1}(s_{1}^{t})$ and $V^{*}_{1}(s_{1}^{t})$ serves as the suboptimality of the agent at the $t$ -th episode. The performance of the online learning algorithm is measured by the expected regret, $\operatorname{Reg}(T)\coloneqq\sum_{t=1}^{T}[V^{*}_{1}(s_{1}^{t})-V^{\bm{\pi}^{t}}_{1}(s_{1}^{t})].$

3 Markov Persuasion Processes

This section introduces the Markov Persuasion Process (MPP), a novel model for sequential information design in Markovian environments. It notably captures the motivating yet intricate real-world problems in Section 1. Furthermore, our MPP model is readily applicable to generalized settings with large state spaces by incorporating function approximation techniques.

3.1 A Model of Markov Persuasion Processes (MPPs)

We start by abstracting the sequential information design problem instances in Section 1 into MPPs. Taking as an example recommendation platform for ad keywords, we view the platform as the sender, the advertisers as the receivers. The advertisers decide the actions $a\in\mathcal{A}$ such as whether to accept the recommended keyword. To better reflect the nature of reality, we model two types of information for MPPs, outcome and state. We use the notion of outcome $\omega\in\Omega$ to characterize the sender’s private information in face of each receiver, such as the features of searchers for some keyword. The outcome follows a prior distribution such as the general demographics of Internet users on the platform. The platform can thus leverage such fine-grained knowledge on keyword features, matching with the specific ad features of each advertiser, to persuade the advertisers to take a recommendation of keywords. Meanwhile, we use the notion of state $s\in{\mathcal{S}}$ to characterize the Markovian state of the environment, e.g., the availability of ad keyword slots. The state is affected by the receiver’s action, as the availability changes after some keywords get brought.¹¹1Similarly, we can view the online shopping platform as the sender who persuades a stream of receivers (supplier, consumer) to take certain action, whether to take an offer or make a purchase. In this case, sender can privately observe the outcomes such as the consumer sentiments on some random products based on the search and click logs, whereas the states are product reviews, sales or shipping time commonly known to the public and affected by the actions of both supply and demand sides. In case of rider-sharing, outcome represents the fine-grained knowledge of currently active rider types that are privately known to the platform and generally stochastic in accordance to some user demographics, whereas the state captures the general driver supply or rider demand at locations that is affected by the drivers’ decisions. Naturally, both sender’s and receiver’s utility are determined by the receiver’s action $a$ jointly with the state of environment $s$ and realized outcome $\omega$ , i.e., $u,v:{\mathcal{S}}\times\Omega\times\mathcal{A}\to[0,1]$ . Meanwhile, as these applications could serve thousands or millions of receivers every day, to reduce the complexity of our model we assume each receiver appears only once and thus will myopically maximizes his utility at that particular step, whereas the sender is a system planner who aim to maximizes her long-term accumulated expected utility.

More specifically, an MPP is built on top of a standard episodic MDP with state space ${\mathcal{S}}$ , action space $\mathcal{A}$ , and transition kernel $P$ . In this paper, we restrict our attention to finite-horizon (i.e., episodic) MPPs with $H$ steps denoted by $[H]=\{1,\cdots,H\}$ , and leave the study of infinite-horizon MPPs as an interesting future direction. At a high level, there are two major differences between MPPs and MDPs. First, in a MPP, the planner cannot directly take an action but instead can leverage its information advantage and “persuade” a receiver to take a desired action $a_{h}$ at each step $h\in[H]$ . Second, in an MPP, the state transition is affected not only by the current action $a_{h}$ and state $s_{h}$ , but also by the realized outcome $\omega_{h}$ of Nature’s probability distribution. Specifically, the state transition kernel at step $h$ is denoted as $P_{h}(s_{h+1}|s_{h},\omega_{h},a_{h})$ . To capture the sender’s persuasion of a receiver to take actions at step $h$ , we assume that a fresh receiver arrives at time $h$ with a prior $\mu_{h}$ over the outcome $\omega_{h}$ . The planner, who is the sender here, can observe the realized outcome $\omega_{h}$ and would like to strategically reveal information about $\omega_{h}$ in order to persuade the receiver to take a certain action $a_{h}$ .

Differing from classical single-shot information design, the immediate utility functions $u_{h},v_{h}$ for the receiver and sender vary not only at each step $h$ but also additionally depend on the commonly observed state $s_{h}$ of the environment. We assume the receiver to have full knowledge of his utility $u_{h}$ and prior $\mu_{h}$ at each step $h$ , and would take the recommended action $a_{h}$ if and only if $a_{h}$ maximizes his expected utility under the posterior for $\omega_{h}$ .²²2This assumption is not essential but just for technical rigor. Because even if receivers have limited knowledge or computational power to accurately determine the utility-maximizing actions, the sender should have sufficient ethical or legal reasons to comply with the persuasive constraints in practice. And the receivers would only take the recommendation if the platform has good reputation (i.e., persuasive with high probability).

Formally, an MPP with a horizon length $H$ proceeds as follows at each step $h\in[H]$ :

1. A fresh receiver with prior distribution $\mu_{h}\in\Delta(\Omega)$ and utility $u_{h}:{\mathcal{S}}\times\Omega\times\mathcal{A}\to[0,1]$ arrives. 2. The sender commits to a persuasive signaling policy $\pi_{h}:{\mathcal{S}}\to\mathcal{P}$ , which is public knowledge. 3. After observing the realized state $s_{h}$ and outcome $\omega_{h}$ , the sender accordingly recommends the receiver to take an action $a_{h}\sim\pi_{h}(\cdot|s_{h},\omega_{h})$ . 4. Given the recommended action $a_{h}$ , the receiver takes an action $a^{\prime}_{h}$ , receives utility $u_{h}(s_{h},\omega_{h},a^{\prime}_{h})$ and then leaves the system. Meanwhile, the sender receives utility $v_{h}(s_{h},\omega_{h},a^{\prime}_{h})$ . 5. The next state $s_{h+1}\sim P_{h}(\cdot|s_{h},\omega_{h},a^{\prime}_{h})$ is generated according to $P_{h}:{\mathcal{S}}\times\Omega\times\mathcal{A}\to\Delta({\mathcal{S}})$ , the state transition kernel at the $h$ -th step.

Here we coin the notion of a signaling policy $\pi_{h}$ as a mapping from state to a signaling scheme at the $h$ -th step. It captures a possibly multi-step procedure in which the sender commits to a signaling scheme after observing the realized state and then samples a signal after observing the realized outcome. For notational convenience, we denote $\pi(a|s,\omega)$ as the probability of recommending action $a$ given state $s$ and realized outcome $\omega$ . We can also generalize the notion of persuasiveness from classic information design to MPPs. Specifically, we define $\operatorname{Pers}(\mu,u)$ as the persuasive set that contains all signaling policies that are persuasive to the receiver with utility $u$ and prior $\mu$ for all possible state $s\in{\mathcal{S}}$ :

	$\displaystyle\operatorname{Pers}(\mu,u)\coloneqq$	$\displaystyle\bigg{\{}\pi:{\mathcal{S}}\to\mathcal{P}:$
		$\displaystyle\qquad\qquad\int_{\omega\in\Omega}\mu(\omega)\pi(a\|s,\omega)\big{[}u(s,\omega,a)-u(s,\omega,a^{\prime})\big{]}{\mathrm{d}}\omega\geq 0,\quad\forall a,a^{\prime}\in\mathcal{A},s\in{\mathcal{S}}\bigg{\}}.$

Recall that $\mathcal{P}$ consists of all mappings from $\Omega$ to $\Delta(\mathcal{A})$ . As such, the sender’s persuasive signaling scheme $\pi_{h}\in\operatorname{Pers}(\mu_{h},u_{h})$ is essentially a stochastic policy as defined in standard MDPs— $\pi_{h}$ maps a state $s_{h}$ to a stochastic action $a_{h}$ —except that here the probability of suggesting action $a_{h}$ by policy $\pi_{h}$ depends additionally on the realized outcome $\omega_{h}$ that is only known to the sender.

We say $\bm{\pi}\coloneqq\{\pi_{h}\}_{h\in[H]}$ is a feasible policy of the MPP if $\pi_{h}\in\operatorname{Pers}(\mu_{h},u_{h}),\forall h\in[H]$ , because the state transition trajectory would otherwise be infeasible if the receiver is not guaranteed to take the recommended action, i.e., $a^{\prime}_{h}\neq a_{h}$ . We denote the set of all feasible policies as $\mathcal{P}^{H}\coloneqq\prod_{h\in[H]}\operatorname{Pers}(\mu_{h},u_{h})$ .

3.2 MPPs: the Generalized Version with Contexts and Linear Parameterization

To provide a broadly useful modeling concept, we also study a generalized setting of the Markov Persuasion Process with contextual prior and a possibly large space of states, outcomes and contexts.

Contextual Prior.

At the beginning of each episode, a sequence of contexts $C=\{c_{h}\in\mathcal{C}\}_{h\in[H]}$ is realized by Nature and becomes public knowledge. And we allow the prior $\mu_{h}$ to be influenced by the context $c_{h}$ at each step $h$ , and thus denote it by $\mu_{h}(\cdot|c_{h})$ . Specifically, the contextual information is able to model the uncertainty such as the varying demographics of active user group affected by events (e.g., scheduled concerts or sport games in ride-sharing) at different time of the day.³³3In the case of the online shopping platform, the prior of consumer interests may be affected by the different holidays or seasons at different time of year. Here we allow the sequence of contexts to be adversarially generated.

Linear Parameterization.

We also relax the state, context and outcome space ${\mathcal{S}},\mathcal{C},\Omega$ to be continuous and additionally assume that the transition kernels and utility functions are linear, and the conditional priors of outcomes are generalized linear models (GLM) of the context at each steps. More formally, for each step $h\in[H]$ , our linearity condition assumes:

•

The sender’s utility is $v_{h}\coloneqq v_{h}^{*}(s_{h},\omega_{h},a_{h})=\psi(s_{h},\omega_{h},a_{h})^{\top}\gamma_{h}^{*}$ , where (1) $\psi(\cdot,\cdot,\cdot)\in\mathbb{R}^{d_{\psi}}$ is a known feature vector; (2) $\gamma_{h}^{*}\in\mathbb{R}^{d_{\psi}}$ is the unknown linear parameter at step $h$ .
•

The next state $s_{h+1}$ is drawn from the distribution $P_{M,h}(\cdot|s_{h},\omega_{h},a_{h})=\psi(s_{h},\omega_{h},a_{h})^{\top}M_{h}(\cdot)$ , where $M_{h}=(M_{h}^{(1)},M_{h}^{(2)},\dots,M_{h}^{(d_{\psi})})$ is a vector of $d_{\psi}$ unknown measures over ${\mathcal{S}}$ at step $h$ .
•

The outcome $\omega_{h}\in\mathbb{R}$ subjects to a generalized linear model (GLM), which models a wider range of hypothesis function classes.⁴⁴4We note that GLM is a strictly generalization of the linear model assumption that we have for the distribution of transition kernel $P$ . While we could use similar technique to extend the distribution of $P$ to GLM using techniques similar to that in Wang et al. [52], but we save such an extension for simplicity, since it is not the primary focus of our work. Given the context $c_{h}$ , there exists a link function $f:\mathbb{R}\to\mathbb{R}$ such that $\omega_{h}=f(\phi(c_{h})^{\top}\theta^{*}_{h})+z_{h}$ , where $\phi(\cdot)\in\mathbb{R}^{d_{\phi}}$ is a feature vector and $\theta_{h}^{*}\in\mathbb{R}^{d_{\phi}}$ is an unknown parameter. The noises $\{z_{h}\}_{h\in[H]}$ are independent $\sigma$ -sub-Gaussian variables with zero mean. We denote the prior of $\omega_{h}$ with parameter $\theta$ at context $c$ as $\mu_{\theta}(\cdot|c)$ .

Without loss of generality, we assume that there exist $\Phi,\Psi$ such that $\|\phi(s)\|\leq\Phi$ ,⁵⁵5For the simplicity of notation, we will omit the subscript of the norm whenever it is an $L_{2}$ norm in this paper. $\|\psi(s,\omega,a)\|\leq\Psi$ for all $s\in{\mathcal{S}},\omega\in\Omega$ and $a\in\mathcal{A}$ . We also assume that $\|\theta_{h}^{*}\|\leq L_{\theta}$ , $\|\gamma_{h}^{*}\|\leq L_{\gamma}$ , $\|M_{h}^{*}\|\leq L_{M}$ , $|\mathcal{A}|\geq 2$ , $|\Omega|\geq 2$ . Such a regularity condition is common in the RL literature.

3.3 Optimal Signaling Policy in MPPs

In order to maximize the sender’s utility, we study the optimal policy in MPPs, in analogy to that of standard MDPs. We start by considering the value of any feasible policy $\bm{\pi}$ . For each step $h\in[H]$ , we define the value function for the sender $V_{h}^{\pi}:{\mathcal{S}}\to\mathbb{R}$ as the expected value of cumulative utilities under $\bm{\pi}$ when starting from an arbitrary state at the $h$ -th step. That is, for any $s\in{\mathcal{S}},h\in[H]$ , we define

V_{h}^{\pi}(s)\coloneqq\mathbb{E}_{P,\mu,\pi}\bigg{[}\sum_{h^{\prime}=h}^{H}v_{h^{\prime}}\big{(}s_{h^{\prime}},\omega_{h^{\prime}},a_{h^{\prime}}\big{)}\bigg{|}s_{h}=s\bigg{]},

where the expectation $\mathbb{E}_{P,\mu,\pi}$ is taken with respect to the randomness of the trajectory (i.e., randomness of state transition), realized outcome and the stochasticity of $\bm{\pi}$ . Accordingly, we define the $Q$ -function (action-value function) $Q_{h}^{\pi}:{\mathcal{S}}\times\Omega\times\mathcal{A}\to\mathbb{R}$ which gives the expected value of cumulative utilities when starting from an arbitrary state-action pair at the $h$ -step following the signaling policy $\bm{\pi}$ , that is,

Q_{h}^{\pi}(s,\omega,a)\coloneqq v_{h}(s,\omega,a)+\mathbb{E}_{P,\mu,\pi}\bigg{[}\sum_{h^{\prime}=h+1}^{H}v_{h^{\prime}}\big{(}s_{h^{\prime}},\omega_{h^{\prime}},a_{h^{\prime}}\big{)}\bigg{|}s_{h}=s,\omega_{h}=\omega,a_{h}=a\bigg{]}.

By definition, $Q_{h}(\cdot,\cdot,\cdot),V_{h}(\cdot)\in[0,h]$ , since $v_{h}(\cdot,\cdot,\cdot)\in[0,1]$ . To simplify notation, for any $Q$ -function $Q_{h}$ and any distributions $\mu_{h}$ and $\pi_{h}$ over $\Omega$ and $\mathcal{A}$ , we additionally denote

\begin{split}\big{\langle}Q_{h},\mu_{h}\otimes\pi_{h}\big{\rangle}_{\Omega\times\mathcal{A}}(s)&\coloneqq\mathbb{E}_{\omega\sim\mu_{h},a\sim\pi_{h}(\cdot|s,\omega)}\left[Q_{h}(s,\omega,a)\right].\end{split}

Using this notation, the Bellman equation associated with signaling policy $\bm{\pi}$ becomes

Q_{h}^{\pi}(s,\omega,a)=(v_{h}+P_{h}V_{h+1}^{\pi})(s,\omega,a),~{}~{}V_{h}^{\pi}(s)=\big{\langle}Q_{h}^{\pi},\mu_{h}\otimes\pi_{h}\big{\rangle}_{\Omega\times\mathcal{A}}(s),~{}~{}V_{H+1}^{\pi}(s)=0,

(3.1)

which holds for all $s\in\mathcal{S},\omega\in\Omega,a\in\mathcal{A}$ . Similarly, the Bellman optimality equation is

Q_{h}^{*}(s,\omega,a)=(v_{h}+P_{h}V_{h+1}^{*})(s,\omega,a),~{}V_{h}^{*}(s)=\max_{\pi_{h}^{\prime}\in\operatorname{Pers}(\mu_{h},u_{h})}\big{\langle}Q_{h}^{*},\mu_{h}\otimes\pi_{h}^{\prime}\big{\rangle}_{\Omega\times\mathcal{A}}(s),~{}V_{H+1}^{*}(s)=0.

(3.2)

We remark that the above equations implicitly assume the context $C=\{c_{h}\}_{h\in[H]}$ (and thus the priors) are determined in advance. To emphasize the values’ dependence on context which will be useful for the analysis of later learning algorithms, we extend the notation to $V^{\pi}_{h}(s;C),Q_{h}^{\pi}(s,\omega,a;C)$ to specify that the value (resp. Q) function is estimated based on which prior $\mu$ conditioned on which sequence of context $C$ .

A Note on Computational Efficiency.

We note that the above Bellman Optimality Equation in (3.2) also implies an efficient dynamic program to compute the optimal policy $\bm{\pi}^{*}$ in the basic tabular model of MPP in Subsection 3.1, i.e., when $s\in\mathcal{S},\omega\in\Omega,a\in\mathcal{A}$ are all discrete. This is because the maximization problem in equation (3.2) can be solved efficiently be a linear program. The generalized MPP of subsection 3.2 imposes some computational challenge due to infinitely many outcomes and states. Fortunately, it is already known that planning in the infinite state MDP with linear function approximation can also be solved efficiently [25]. Following a similar analysis, we can determine $Q_{h}^{*}(\cdot,\cdot,\cdot)$ through a linear function of $q_{h}^{*}\in\mathbb{R}^{d_{\psi}}$ with the observed feature $\psi(\cdot,\cdot,\cdot)$ . Hence, the dominating operation is to compute $\max_{\pi\in\operatorname{Pers}(\mu_{h},u_{h})}\langle Q_{h}^{*},\mu_{h}\otimes\pi_{h}\rangle_{\Omega\times\mathcal{A}}(s)$ at each step. Let the sender utility function be $Q_{h}^{*}$ ; such an optimization is exactly the problem of optimal information design with infinitely many outcomes but finitely many actions, which has been studied in previous work [15]. It turns out that there is an efficient algorithm that can signal on the fly for any given outcome $\omega$ and obtains an $\epsilon$ -optimal persuasive signaling scheme in $\operatorname{poly}(1/\epsilon)$ time. Therefore, in our later studies of learning, we will take these algorithms as given and simply assume that we can compute the optimal signaling scheme efficiently at any given state $s$ . One caveat is that our regret guarantee will additionally lose an additive $\epsilon$ factor at each step due to the availability of only an $\epsilon$ -optimal algorithm, but this loss can be negligible when we set $\epsilon=O(1/(TH))$ by using a $\operatorname{poly}(TH)$ time algorithm.

4 Reinforcement Learning in MPPs and the Optimism-Pessimism Principle

In this section, we study online reinforcement learning (RL) for learning the optimal signaling policy on an MPP. Here the learner only knows the utility functions of the receivers⁶⁶6 The receiver’s utility is known to the sender because the pricing rules are usually transparent, some are even set by the platform. For example, a rider-sharing platform usually sets per hour or mile payment rules for the drivers. and has no prior knowledge about the prior distribution, the sender’s utility function, and the transition kernel. While the computation of optimal policy in MPPs in Section 3.3 may appear analogous to that of a standard MDP, as we will see that the corresponding RL problem turns out to be significantly different, partially due to the presence of the stream of receivers, whose decisions are self-interested and not under the learner’s control. This makes the learning challenging because if the receivers’ incentives are not carefully addressed, they may take actions that are extremely undesirable to the learner. Such concern leads to the integration of the pessimism principle into our learning algorithm design. Specifically, our learner will be optimistic to the estimation of the $Q$ -function, similar to many other RL algorithms, in order to encourage exploration. But more interestingly, it will be pessimistic to the uncertainty in the estimation of the prior distributions in order to prepare for detrimental equilibrium behavior. Such dual considerations lead to an interesting optimism-pessimism principle (OPP) for learning MPPs under the online setting. From a technical point of view, our main contribution is to prove how the mixture of optimism and pessimism principle can still lead to no regret algorithms, and this proof crucially hinges on a robust property of the MPP model which we develop and carefully apply to the regret analysis. To the best of our knowledge, this is the first time that OPP is employed to learn the optimal information design in an online fashion. We prove that it can not only satisfy incentive constraints but also guarantees efficiency in terms of both sample complexity and computational complexity.

In order to convey our key design ideas before diving into the intricate technicalities, this section singles out two representative special cases of the online sequential information design problem. In a nutshell, we present a learning algorithm OP4 that combines the principle of optimism and pessimism such that the sender can learn to persuade without initially knowing her own utility or the prior distribution of outcomes. In the tabular MPP, we illustrate the unique challenges of learning to persuade arising from the dynamically evolving environment state according to a Markov process. Through the contextual Bayesian persuasion, we showcase the techniques necessary for learning to persuade with infinitely many states (i.e., contexts) and outcomes. We shall omit most proofs in this section to focus on the high-level ideas, because the proof for the general setting presented in Section 5 suffices to imply all results for the two special cases here.

4.1 Learning Optimal Policies in MPPs: Setups and Benchmarks

We consider the episodic reinforcement learning problem in finite-horizon MPPs. Different from the full knowledge setting in Section 3.3, the transition kernel, the sender’s utility function and the outcome prior at each step of the episode, $\{P_{h},v_{h},\mu_{h}\}_{h\in[H]}$ , are all unknown. The sender has to learn the optimal signaling policy by interacting with the environment as well as a stream of receivers in $T$ number of episodes. For each $t\in[T]=\{1,\cdots,T\}$ , at the beginning of $t$ -th episode, given the data $\{(c_{h}^{\tau},s_{h}^{\tau},\omega_{h}^{\tau},a_{h}^{\tau},v_{h}^{\tau})\}_{h\in[H],\tau\in[t-1]}$ , the adversary picks the context sequence $\{c_{h}^{t}\}_{h\in[H]}$ as well as the initial state $s_{1}^{t}$ , and the agent accordingly chooses a signaling policy $\bm{\pi}^{t}=\{\pi_{h}^{t}\}_{h\in[H]}$ . Here $v_{h}^{\tau}$ is the utility collected by the sender at step $h$ of episode $\tau$ .

Regret

To evaluate the online learning performance, given the ground-truth outcome prior $\bm{\mu}^{*}=\{\mu_{h}^{*}\}_{h\in[H]}$ , we define the sender’s total (expected) regret over the all $T$ episodes as

\operatorname{Reg}(T,\bm{\mu}^{*})\coloneqq\sum_{t=1}^{T}\left[V^{*}_{1}(s_{1}^{t};C^{t})-V^{\bm{\pi}^{t}}_{1}(s_{1}^{t};C^{t})\right].

(4.1)

Note that if $\bm{\pi}^{t}$ is not always feasible under $\bm{\mu}^{*}$ , but is only persuasive with high probability, so the corresponding regret under $\bm{\pi}^{t}$ should be also in high probability sense.

It turns out that in certain degenerate cases it is impossible to achieve a sublinear regret. For example, if the set of possible posterior outcome distributions that induce some $a\in\mathcal{A}$ as the optimal receiver action has zero measure, then such posterior within a zero-measure set can never be exactly induced by a signaling scheme without a precise knowledge of the prior. Thus, the regret could be $\Omega(T)$ if receiver cannot be persuaded to play such action $a$ . Therefore, to guarantee no regret, it is necessary to introduce certain regularity assumption on the MPP instance. Towards that end, we shall assume that the receivers’ utility $u$ and prior $\mu$ at any step of the MPP instance always satisfies a minor assumption of $(p_{0},D)$ -regularity as defined below.

Regularity Conditions

An instance satisfies $(p_{0},D)$ -regularity, if for any feasible state $s\in{\mathcal{S}}$ and context $c\in\mathcal{C}$ , we have

\mathbb{P}_{\omega\sim\mu(\cdot|c)}\left[\omega\in\mathcal{W}_{s,a}(D)\right]\geq p_{0},\quad\forall a\in\mathcal{A},

where $\mu$ is the ground-truth prior of outcomes and $\mathcal{W}_{s,a}(D)\triangleq\{\omega:u(s,\omega,a)-u(s,\omega,a^{\prime})\geq D,\forall a^{\prime}\in\mathcal{A}/\{a\}\}$ is the set of outcomes $\omega$ for which the action $a$ is optimal for the receiver by at least $D$ at state $s$ . . In other words, an instance is $(p_{0},D)$ -regular if every action $a$ has at least probability $p_{0}$ , under randomness of the outcome, to be strictly better than other actions by at least $D$ . This regularity condition is analogous to a regularity condition of Zu et al. [61] but is generalizable to infinite outcomes as we consider here.

4.2 Algorithm: Optimism-Pessimism Principle for Persuasion Process (OP4)

The learning task in MPPs involves two intertwined challenges: (1) How to persuade the receiver to take desired actions under unknown $\mu_{h}$ ? (2) Which action to persuade the receiver to take in order to explore the underlying environment? For the first challenge, due to having finite data, it is impossible to perfectly recover $\mu_{h}$ . We can only hope to construct an approximately accurate estimator of $\mu_{h}$ . To guard against potentially detrimental equilibrium behavior of the receivers due to the prior estimation error, we propose to adopt the pessimism principle. Specifically, before each episode, we conduct uncertainty quantification for the estimator of the prior distributions, which enables us to construct a confidence region containing the true prior with high probability. Then we propose to find the signaling policy within a pessimistic candidate set—signaling policies that are simultaneously persuasive with respect to all prior distributions in the confidence region. When the confidence region is valid, such a pessimism principle ensures that the executed signaling policy is always persuasive with respect to the true prior. Furthermore, to address the second challenge, we adopt the celebrated principle of optimism in the face of uncertainty [28], which has played a key role in the online RL literature. The main idea of this principle is that, the uncertainty of the $Q$ -function estimates essentially reflects our uncertainty about the underlying model. By adding the uncertainty as a bonus function, we encourage actions with high uncertainty to be recommended and thus taken by the receiver when persuasiveness is satisfied. We then fuse the two principles into the OP4 algorithm in Algorithm 1.

Algorithm 1 OP4 Overview

1: for episode

t=1\dots T

2: Receive the initial state

\{s_{1}^{t}\}

and context

C^{t}=\{c_{h}^{t}\}_{h=1}^{H}

3: For each step

h\in[H]

, estimate prior

\mu_{h}^{t}

along with the confidence region

\mu_{\mathcal{B}_{h}^{t}}

, and construct an optimistic

Q

-function

Q_{h}^{t}

iteratively with the value function

V_{h}^{t}

4: for step

h=1,\ldots,H

5: Choose robust signaling scheme

\pi_{h}^{t}\in\arg\max_{\pi_{h}\in\operatorname{Pers}(\mu_{\mathcal{B}_{h}^{t}},u_{h})}\big{\langle}Q_{h}^{t},\mu_{h}^{t}\otimes\pi_{h}\big{\rangle}_{\Omega\times\mathcal{A}}(s_{h}^{t};C^{t})

6: Observe state

s_{h}

, outcome

\omega_{h}

and accordingly recommend action

a\sim\pi^{t}_{h}(\omega_{h},\cdot)

to the receiver.

7: end for

8: end for

Pessimism to Induce Robust Equilibrium Behavior

From the data in the past episode, the sender can estimate the mean of the prior as well as obtain a confidence region through concentration inequalities. Given this partial knowledge of the prior distribution, the sender needs to design a signaling scheme that works in the face of any possible priors in the confidence region in order to ensure the receiver will take its recommended action with high probability. Specifically, we let ${\mathrm{B}}_{\Sigma}(\theta,\beta)\coloneqq\{\theta^{\prime}:\|\theta^{\prime}-\theta\|_{\Sigma}\leq\beta\}$ denote the closed ball in $\left\|\cdot\right\|_{\Sigma}$ norm of radius $\beta>0$ centered at $\theta\in\mathbb{R}^{d_{\theta}}$ . For any set $\mathcal{B}\subseteq\mathbb{R}^{d_{\theta}}$ , we let $\operatorname{Pers}(\mu_{\mathcal{B}},u)$ denote the set of signaling policies that are simultaneously persuasive under all weigh vectors $\theta\in\mathcal{B}$ : $\operatorname{Pers}(\mu_{\mathcal{B}},u)\coloneqq\bigcap_{\theta\in\mathcal{B}}\operatorname{Pers}(\mu_{\theta},u)$ . For any non-empty set $\mathcal{B}$ , the set $\operatorname{Pers}(\mu_{\mathcal{B}},u)$ is convex since it is an intersection of convex sets $\operatorname{Pers}(\mu_{\theta},u)$ , and is non-empty since it must contain the full-information signaling scheme. We note that since $\operatorname{Pers}(\mu_{\mathcal{B}},u)$ is a convex set, we can solve the linear optimization among the policies in $\operatorname{Pers}(\mu_{\mathcal{B}},u)$ in polynomial time (see e.g., [61]).

Optimism to Encourage Exploration

In order to balance exploration and exploitation, we adopt the principle of optimism in face of uncertainty to the value iteration algorithm based on Bellman equation, following in a line of work in online RL such as $Q$ -learning with UCB exploration [24], UCBVI [4], LSVI-UCB [25] (also see [50, 56, 51] and the references therein). The additional UCB bonus on the $Q$ -value encourages exploration and has been shown to be a provably efficient online method to improve policies in MDPs. Moreover, this method not only works for the simple tabular setting, but also generalizes to settings with infinite state spaces by exploiting linearity of the $Q$ -function and a regularized least-squares program to determine the optimal estimation of $Q$ -value. In fact, within our framework, we could obtain efficient learning result in the infinite state space setting through other optimism-based online RL methods and general function approximators, such as linear mixture MDPs [3, 60, 9, 59], or kernel approximation [56] or bilinear classes [13].

To provide a concrete picture of the learning process, we instantiate the OP4 algorithm in two special cases and showcase our key ideas and techniques before delving into the more involved analysis of the generalized MPP setting. Nevertheless, we remark that whether the problem instance is tabular or in the form of linear or generalized linear approximations is not essential and not the focus of our study. OP4 itself only relies on two things, i.e., the uncertainty quantification for $Q$ -function and prior estimation. So even the model-free RL framework can be replaced by model-based RL, as we can just construct confidence region for the transition models.

4.3 Warm-up I: Reinforcement Learning in the Tabular MPP

We first consider MPPs in tabular setting with finite states and outcomes, as described in Section 3.1. In this case, the prior on outcomes at each step degenerates to an unknown but fixed discrete distribution independent of context. As linear parameterization is not required for discrete probability distribution, the algorithm can simply update the empirical estimation of $\mu_{h}^{t}$ through counting. Similarly, the transition kernel $P_{h}^{*}$ is estimated through the occurrence of observed samples, and we uses this estimated transition to compute the $Q$ -function $\widehat{Q}_{h}^{t}$ from the observed utility and estimated value function in the next step, according to the Bellman equation. To be specific, for each $s\in{\mathcal{S}},\omega\in\Omega,a\in\mathcal{A}$ , $\mu_{h}^{t}$ and $\widehat{Q}_{h}^{t}$ are estimated through the following equations:

	$\displaystyle\mu_{h}^{t}(\omega)$	$\displaystyle\leftarrow\frac{\lambda/\|\Omega\|+N_{t,h}(\omega)}{\lambda+t-1},$
	$\displaystyle{\widehat{Q}}_{h}^{t}(s,\omega,a)$	$\displaystyle\leftarrow\frac{1}{\lambda+N_{t,h}(s,\omega,a)}\sum_{\tau\in[t-1]}\big{\{}\mathbb{I}(s_{h}^{\tau}=s,\omega_{h}^{\tau}=\omega,a_{h}^{\tau}=a)\big{[}v_{h}^{\tau}+V_{h+1}^{t}(s_{h+1}^{\tau})\big{]}\big{\}},$

where $N_{t,h}(\omega)=\sum_{\tau\in[t-1]}\mathbb{I}(\omega_{h}^{\tau}=\omega)$ and $N_{t,h}(s,\omega,a)=\sum_{\tau\in[t-1]}\mathbb{I}(s_{h}^{\tau}=s,\omega_{h}^{t}=\omega,a_{h}^{t}=a)$ respectively count the effective number of samples that the sender has observed arriving at $\omega$ , or the combination $\{s,\omega,a\}$ ), and $\lambda>0$ is a constant for regularization.

In our learning algorithm, we determine the radius of confidence region $\mathcal{B}_{h}^{t}$ for $\mu_{h}^{t}$ according to confidence bound $\epsilon_{h}^{t}=O(\sqrt{\log(HT)/t})$ . Moreover, we add a UCB bonus term of form $\rho/\sqrt{N_{t,h}(s,\omega,a)}$ to $\widehat{Q}_{h}^{t}$ to obtain the optimistic $Q$ -function $Q_{h}^{t}$ . Then, it selects a robustly persuasive signaling scheme that maximizes an optimistic estimation of $Q$ -function with respect to the current prior estimation $\mu_{h}^{t}$ . Finally, it makes an action recommendation $a_{h}^{t}$ using this signaling scheme, given the state and outcome realization $\{s_{h}^{t},\omega_{h}^{t}\}$ .

Theorem 4.1.

Let $\epsilon_{h}^{t}=\widetilde{O}(\sqrt{1/t})$ , and $\rho=\widetilde{O}(|S|\cdot|\Omega|\cdot|A|H)$ . Then under $(p_{0},D)$ -regularity, with probability at least $1-3H^{-1}T^{-1}$ , OP4 has regret of order $\widetilde{O}\big{(}|C|(|S|\cdot|\Omega|\cdot|A|)^{3/2}\cdot H^{2}\sqrt{T}/(p_{0}D)\big{)}$ in tabular MPPs.

To obtain the regret of OP4, we have to consider the regret arising from different procedures. Formal decomposition of the regret is described in Lemma 6.1. Separately, we upper bound errors incurred from estimating $Q$ -function (Lemma 6.2), the randomness of of choosing the outcome, action and next state (Lemma A.5) as well as estimating the prior of outcome and choosing a persuasive signaling scheme that is robustly persuasive for a subset of priors (Lemmas 6.3 and 6.4). As the two warm-up models are special cases of the general MPP, the proof of the above properties follows from that of the general MPP setting in Section 6, and thus is omitted here.

4.4 Warm-up II: Reinforcement Learning in Contextual Bayesian Persuasion

We now move to another special case with $H=1$ , such that the MPP problem reduces to a contextual-bandit-like problem, where transitions no longer exist. Given a context $c$ and a persuasive signaling policy $\pi$ , the value function is simply the sender’s expected utility for any $s\in{\mathcal{S}}$ ,

V^{\pi}(s;c)\coloneqq\int_{\omega}\sum_{a\in A}\mu(\omega|c)\pi(a|s,\omega)v(s,\omega,a){\mathrm{d}}\omega.

The sender’s optimal expected utility is defined as $V^{*}(s;c)\coloneqq\max_{\pi\in\operatorname{Pers}(\mu(\cdot|c),u)}V^{\pi}(s;c).$

Meanwhile, we consider the general setting where outcome $\omega$ is a continuous random variable that subjects to a generalized linear model. To be specific, the prior $\mu$ is conditioned on the context $c$ with the mean value $f(\phi(c)^{\top}\theta)$ . For the prior $\mu$ and link function $f$ , we assume the smoothness of the prior and the bounded derivatives of the link function:

Assumption 4.2.

There exists a constant $L_{\mu}>0$ such that for any parameter $\theta_{1},\theta_{2}$ , we have $\big{\|}\mu_{\theta_{1}}(\cdot|c)-\mu_{\theta_{2}}(\cdot|c)\big{\|}_{1}\leq L_{\mu}\big{\|}f\big{(}\phi(c)^{\top}\theta_{1}\big{)}-f\big{(}\phi(c)^{\top}\theta_{2}\big{)}\big{\|}$ for any given context $c$ .

Assumption 4.3.

The link function $f$ is either monotonically increasing or decreasing. Moreover, there exists absolute constants $0<\kappa<K<\infty$ and $0<M<\infty$ such that $\kappa\leq|f^{\prime}(z)|\leq K$ and $|f^{\prime\prime}(z)|\leq M$ for all $|z|\leq\Phi L_{\theta}$ .

It is natural to assume a Lipschitz property of the distribution in Assumption 4.2. For instance, Gaussian distributions and uniform distributions satisfy this property. Assumption 4.3 is standard in the literature [18, 52, 32]. Two example link functions are the identity map $f(z)=z$ and the logistic map $f(z)=1/(1+e^{-z})$ with bounded $z$ . It is easy to verify that both maps satisfy this assumption.

Different from the tabular setting, we are now unable to use the counting-based estimator to keep track of the distribution of the possibly infinite states and outcomes. Instead, we resort to function approximation techniques and estimate the linear parameters $\theta^{*}$ and $\gamma^{*}$ . In each episode, OP4 respectively updates the estimation and confidence region of $\theta^{t}$ and $\gamma^{t}$ , with which it can determine the outcome prior under pessimism and sender’s utility under optimism. To be specific, the update of $\theta^{t}$ solves a constrained least-squares problem and the update of $q^{t}$ solves precisely a regularized one:

	$\displaystyle\theta^{t}$	$\displaystyle\leftarrow\arg\min_{\\|\theta\\|\leq L_{\theta}}\sum_{\tau\in[t-1]}\big{[}\omega^{\tau}-f(\phi(c^{\tau})^{\top}\theta_{h})\big{]}^{2},$
	$\displaystyle\gamma^{t}$	$\displaystyle\leftarrow\arg\min_{\gamma\in\mathbb{R}^{\psi}}\sum_{\tau\in[t-1]}\left\\|v^{\tau}-\psi(s^{\tau},\omega^{\tau},a^{\tau})^{\top}\gamma\right\\|^{2}+\lambda\left\\|\gamma\right\\|^{2}.$

We then estimate the prior by setting $\mu^{t}(\cdot|c)$ to the distribution of $f\big{(}\phi(c)^{\top}\theta^{t}\big{)}+z$ and estimate the sender’s utility by setting $v^{t}(\cdot,\cdot,\cdot)=\psi(\cdot,\cdot,\cdot)^{\top}\gamma^{t}$ . On one hand, to encourage exploration, OP4 adds the UCB bonus term of form $\rho\left\|\psi(\cdot,\cdot,\cdot)\right\|_{(\Gamma^{t})^{-1}}$ to the $Q$ -function, where $\Gamma^{t}=\lambda I_{d_{\psi}}+\sum_{\tau\in[t]}\psi(s^{\tau},\omega^{\tau},a^{\tau})\psi(s^{\tau},\omega^{\tau},a^{\tau})^{\top}$ is the Gram matrix of the regularized least-squares problem and $\rho$ is equivalent to a scalar. This is a common technique for linear bandits. On the other hand, OP4 determines the confidence region of $\theta^{t}$ with radius $\beta$ , and ensures that signaling scheme is robustly persuasive for any possible (worst case) prior induced by linear parameters $\theta$ in this region. Combining optimism and pessimism, OP4 picks the signaling scheme among the robust persuasive set that maximizes the sender’s optimistic utility.

Theorem 4.4.

Under $(p_{0},D)$ -regularity and Assumption 4.2 and 4.3, there exists an absolute constant $C_{1},C_{2}>0$ such that, if we set $\lambda=\max\{1,\Psi^{2}\}$ , $\beta=C_{1}(1+\kappa^{-1}\sqrt{K+M+d_{\phi}\sigma^{2}\log(T)})$ , and $\rho=C_{2}d_{\psi}\sqrt{\log(4d_{\psi}\Psi^{2}T^{3})}$ , then with probability at least $1-3T^{-1}$ , OP4 has regret of order $\widetilde{O}\big{(}d_{\phi}\sqrt{d_{\psi}^{3}}\sqrt{T}/(p_{0}D)\big{)}$ in contextual Bayesian persuasion problems.

Since we estimate the prior by computing an estimator $\theta^{t}$ , we evaluate the persuasiveness of OP4 through the probability that $\theta^{*}$ lies in the confidence region centered at $\theta^{t}$ with the radius $\beta=O(\sqrt{d_{\phi}\log(T)})$ in weighted norm. Due to the smoothness of the prior and the assumption of link function, the error of the estimated prior is bounded by the product of $\beta$ and the weighted norm of feature vector $\|\phi(c^{t})\|_{\Sigma^{t}}=O(1/\sqrt{t})$ , which yields the same conclusion for $\epsilon^{t}$ in the tabular MPP case. Also compared to Li et al. [31], we do not require any regularity for $\Sigma^{t}$ , since we add a constant matrix $\Phi^{2}I$ to the Gram matrix $\Sigma^{t}$ . This ensures that $\Sigma^{t}$ is always lower bounded by the constant $\Phi^{2}>0$ . The proof of the persuasiveness and sublinear regret of contextual bandit can be viewed as a direct reduction of the MPP case when the total step $H=1$ . We decompose the regret in the same way as that in Lemma 6.1 for MPPs and then estimate the upper bound for each item to measure the regret loss.

5 No-Regret Learning in the General Markov Persuasion Process

In this section, we present the full version of the OP4 algorithm for MPPs and show that it is persuasive with high probability and meanwhile achieves average regret $\widetilde{O}\big{(}d_{\phi}\cdot d_{\psi}^{3/2}H^{2}\sqrt{T}/(p_{0}D)\big{)}$ .

In the general MPP setting with the linear utility and transition, a crucial property is that the $Q$ -functions under any signaling policy is always linear in the feature map $\psi$ (akin to linear MDPs [25]). Therefore, when designing learning algorithms, it suffices to focus on linear $Q$ -functions. In our OP4 algorithm, we iteratively fit the optimal $Q$ -function, which is parameterized by $q_{h}^{*}$ as $\psi(\cdot,\cdot,\cdot)^{\top}q_{h}^{*}$ at each step $h\in[H]$ . OP4 learns the $Q$ -functions of MPPs and the prior of persuasion states simultaneously. It operates similarly as that in tabular MPPs and contextual Bayesian persuasion. At the $t$ -th episode, given the historical data $\{(c_{h}^{\tau},s_{h}^{\tau},\omega_{h}^{\tau},a_{h}^{\tau},v_{h}^{\tau})\}_{h\in[H],\tau\in[t-1]}$ , we can estimate the unknown vectors $\theta_{h}^{*},q_{h}^{*},\forall h\in[H]$ by solving the following constrained or regularized least-squares problems:

	$\displaystyle\theta_{h}^{t}$	$\displaystyle\leftarrow\mathop{\mathrm{argmin}}_{\\|\theta_{h}\\|\leq L_{\theta}}\sum_{\tau\in[t-1]}\big{[}\omega_{h}^{\tau}-f(\phi(c_{h}^{\tau})^{\top}\theta_{h})\big{]}^{2},$
	$\displaystyle q_{h}^{t}$	$\displaystyle\leftarrow\mathop{\mathrm{argmin}}_{q\in\mathbb{R}^{d_{\psi}}}\sum_{\tau\in[t-1]}\big{[}v_{h}^{\tau}+V_{h+1}^{t}(s_{h+1}^{\tau};C^{t})-\psi(s_{h}^{\tau},\omega_{h}^{\tau},a_{h}^{\tau})^{\top}q\big{]}^{2}+\lambda\\|q\\|^{2}.$

Additionally, $V_{h+1}^{t}$ is the estimated value function with the observed context $C^{t}$ at the episode $t$ , which we describe formally later. This estimator is used to replace the unknown transition $P_{h}$ and distribution $\nu_{h}$ in equation (3.2). Moreover, we can update the estimate of outcome prior $\mu_{h}^{t}$ and $Q$ -function $Q_{h}^{t}$ respectively. Here OP4 adds UCB bonus to $Q_{h}^{t}$ to encourage exploration. The formal description is given in Algorithm 2.

Likewise, the MPP setting inherits the regularity conditions and Assumption 4.2 and 4.3 in the last section. Combining the insights from both the tabular MPPs and contextual Bayesian persuasion, we can show that the OP4 is persuasive and guarantees sublinear regret with high probability for general MPPs.

Theorem 5.1.

Under $(p_{0},D)$ -regularity and Assumption 4.2 and 4.3, there exists absolute constants $C_{1},C_{2}>0$ such that, if we set $\lambda=\max\{1,\Psi^{2}\}$ , $\beta=C_{1}(1+\kappa^{-1}\sqrt{K+M+d_{\phi}\sigma^{2}\log(HT)})$ ,and $\rho=C_{2}d_{\psi}H\sqrt{\log(4d_{\psi}\Psi^{2}H^{2}T^{3})}$ , then with probability at least $1-3H^{-1}T^{-1}$ , OP4 has regret of order $\widetilde{O}\big{(}d_{\phi}d_{\psi}^{3/2}H^{2}\sqrt{T}/(p_{0}D)\big{)}$ .

Recall that the novelty of OP4 is that we adopt pessimism and optimism to induce robust equilibrium behavior and encourage exploration simultaneously. Specifically, pessimism tackles the uncertainty in the prior estimation by selecting a signaling policy that is persuasive w.r.t. all the priors in the confidence region, while optimism in $Q$ -function estimation encourages exploration. To evaluate the regret of OP4, we provide a novel regret decomposition, which is tailored to this pessimism and optimism combination. Each term represents different aspects of regret loss incurred by either estimation or randomness.

6 Proof Sketch and Technical Highlights

In this section, we present the proof sketch for Theorem 5.1. We first decompose the regret into several terms tailored to MPPs and briefly introduce how to bound each term. Then we highlight our technical contribution about regularity when measuring the loss in the sender’s utility for choosing a signaling scheme that is persuasive for a subset of priors close to each other.

6.1 Proof of Theorem 5.1

In order to prove the sublinear regret for OP4, we construct a novel regret decomposition tailored to MPPs. Our proof starts from decomposing the regret into several terms, each of which indicates the regret loss either from estimation or from the randomness of trajectories. Next, we evaluate each term and then add them together to conclude the upper bound of the regret of OP4. For simplicity of presentation, denote $\widetilde{V}_{h}^{t}(\cdot;C)=\big{\langle}Q_{h}^{t},\mu_{h}^{*}\otimes\pi_{h}^{t}\big{\rangle}_{\Omega\times\mathcal{A}}(\cdot;C)$ as the expectation of $Q_{h}^{t}$ with respect to the ground-truth prior $\mu_{h}^{*}$ and signaling scheme $\pi_{h}^{t}$ at the $h$ -th step. Then we can define the temporal-difference (TD) error as

\delta_{h}^{t}(s,\omega,a)=(v_{h}^{t}+P_{h}V_{h+1}^{t}-Q_{h}^{t})(s,\omega,a;C^{t}).

(6.1)

Here $\delta_{h}^{t}$ is a function on ${\mathcal{S}}\times\Omega\times\mathcal{A}$ for all $h\in[H]$ and $t\in[T]$ . Intuitively, $\{\delta_{h}^{t}\}_{h\in[H]}$ quantifies how far the $Q$ -functions $\{Q_{h}^{t}\}_{h\in[H]}$ are from satisfying the Bellman optimality equation in equation (3.2). Moreover, define $\zeta_{t,h}^{1}$ and $\zeta_{t,h}^{2}$ for the trajectory $\{c_{h}^{t},s_{h}^{t},\omega_{h}^{t},a_{h}^{t}\}_{h\in[H]}$ generated by Algorithm 2 at the $t$ -th episode as follows

\begin{split}\zeta_{t,h}^{1}&=(\widetilde{V}_{h}^{t}-V_{h}^{\pi^{t}})(s_{h}^{t};C^{t})-(Q_{h}^{t}-Q_{h}^{\pi^{t}})(s_{h}^{t},\omega_{h}^{t},a_{h}^{t};C^{t}),\\ \zeta_{t,h}^{2}&=P_{h}(V_{h+1}^{t}-V_{h+1}^{*})(s_{h}^{t},\omega_{h}^{t},a_{h}^{t};C^{t})-(V_{h+1}^{t}-V_{h+1}^{*})(s_{h+1}^{t};C^{t}).\end{split}

(6.2)

By definition, $\zeta_{t,h}^{1}$ capture the randomness of realizing the outcome $\omega_{h}^{t}\sim\mu_{h}^{*}(\cdot|c_{h})$ and signaling the action $a_{h}^{t}\sim\pi_{h}^{t}(s_{h}^{t},\omega_{h}^{t},\cdot)$ , while $\zeta_{t,h}^{2}$ captures the randomness of drawing the next state $s_{h+1}^{t}$ from $P_{h}(\cdot|s_{h}^{t},\omega_{h}^{t},\cdot)$ . With the notations above, we can decompose the regret into six parts to facilitate the establishment of the upper bound of the regret.

Lemma 6.1 (Regret Decomposition).

With the notations defines in equation (6.1) and (6.2), we can write the regret as:

\begin{split}\operatorname{Reg}(T,\mu^{*})=&\underbrace{\sum_{t\in[T]}\sum_{h\in[H]}\big{\{}\mathbb{E}_{\mu_{h}^{*},\pi_{h}^{*}}[\delta_{h}^{t}(s_{h},\omega_{h},a_{h}^{t})|s_{1}=s_{1}^{t}]-\delta_{h}^{t}(s_{h}^{t},\omega_{h}^{t},a_{h}^{t})\big{\}}}_{\displaystyle\mathrm{(i)}}+\underbrace{\sum_{t\in[T]}\sum_{h\in[H]}(\zeta_{t,h}^{1}+\zeta_{t,h}^{2})}_{\displaystyle\mathrm{(ii)}}\\ &~{}~{}~{}~{}+\underbrace{\sum_{t\in[T]}\sum_{h\in[H]}\mathbb{E}_{\mu_{h}^{*},\pi_{h}^{*}}\big{[}\big{\langle}Q_{h}^{t},\mu_{h}^{*}\otimes\pi_{h}^{*}-\mu_{h}^{t}\otimes\pi_{h}^{t}\big{\rangle}_{\Omega\times\mathcal{A}}(s_{h};C^{t})|s_{1}=s_{1}^{t}\big{]}}_{\displaystyle\mathrm{(iii)}}\\ &~{}~{}~{}~{}+\underbrace{\sum_{t\in[T]}\sum_{h\in[H]}\big{\langle}Q_{h}^{t},(\mu_{h}^{t}-\mu_{h}^{*})\otimes\pi_{h}^{t}\big{\rangle}_{\Omega\times\mathcal{A}}(s_{h}^{t},C^{t})}_{\displaystyle\mathrm{(iv)}}.\end{split}

(6.3)

In this novel regret decomposition, term $\displaystyle\mathrm{(i)}$ indicates the optimism in OP4. Provably, $\delta_{h}^{t}$ in term $\displaystyle\mathrm{(i)}$ is always non-positive due to the optimistic $Q$ -value estimation, which could simplify this term. Term $\displaystyle\mathrm{(iii)}$ corresponds to the pessimism in OP4 for inducing a robust equilibria. It evaluates the regret loss incurred by choosing a robustly persuasive signaling policy. Since the signaling policy has to be persuasive to ensure that receivers will always take recommended actions, we cannot simply choose a greedy policy for a fixed prior estimation. Instead, we first apply optimism to construct the optimistic $Q$ -value estimation and then apply pessimism to select a signaling policy that is robustly persuasive for all the priors in the confidence region. Therefore, we design the regret decomposition, especially term $\displaystyle\mathrm{(iii)}$ in this form to reflect the optimism and pessimism principle in OP4. Notice that this decomposition does not depend on specific function approximation forms in the algorithm, since not only the estimation of prior and $Q$ -function but also the chosen signaling policy has no influence on this formula. Therefore, it generally suits all the algorithms for MPPs.

Unlike the regret decomposition in [56], Lemma 6.1 also captures the randomness of realizing the outcome. Since we have to estimate the prior of the outcome and choose a robustly persuasive policy in MPPs, we add term $\displaystyle\mathrm{(iii)}$ and $\displaystyle\mathrm{(iv)}$ to evaluate the further regret loss.

The rigorous arguments turn out to be technical, and thus we shall defer the proof of most lemmas to the appendix while aiming to present all the key ideas and conclusions in the following. For term $\displaystyle\mathrm{(i)}$ in equation (6.3), although we do not observe the trajectories under prior $\mu^{*}$ and signaling policy $\pi^{*}$ , we can upper bound both $\delta_{h}^{t}$ and $-\delta_{h}^{t}$ . The following lemma states this result.

Lemma 6.2 (Optimism).

There exists an absolute constant $c>0$ such that, for any fixed $\delta\in(0,1)$ , if we set $\lambda=\max\{1,\Psi^{2}\}$ and $\rho=cd_{\psi}H\sqrt{\iota}$ in Algorithm 2 with $\iota=\log(2d_{\psi}\Psi^{2}T/\delta)$ , then with probability at least $1-\delta/2$ , we have

-2\rho\|\psi(s,\omega,a)\|_{(\Gamma_{h}^{t})^{-1}}\leq\delta_{h}^{t}(s,\omega,a)\leq 0.

for all $s\in{\mathcal{S}},\omega\in\Omega,a\in\mathcal{A},h\in[H]$ and $t\in[T]$ .

Term $\displaystyle\mathrm{(ii)}$ in equation (6.3) can be bounded by Lemma 5.3 from [56] using martingale techniques and the Azuma-Hoeffding inequality [5]. We state the upper bound for term $\displaystyle\mathrm{(ii)}$ in Lemma A.5. Moreover, term $\displaystyle\mathrm{(iii)}$ in equation (6.3) evaluates the regret loss caused by estimating the prior and choosing a robustly persuasive signaling policy. Here, we apply the robustness gap $\operatorname{Gap}$ defined later to bound this term.

Lemma 6.3 (Bounding Term $\displaystyle\mathrm{(iii)}$ ).

On the event of $\{\theta_{h}^{*}\in\mathcal{B}_{h}^{t}\}$ , under Assumption 4.2 and 4.3, we have

	$\displaystyle\sum_{t\in[T]}\sum_{h\in[H]}\mathbb{E}_{\mu_{h}^{},\pi_{h}^{}}\big{[}\big{\langle}Q_{h}^{t},\mu_{h}^{}\otimes\pi_{h}^{}$	$\displaystyle-\mu_{h}^{t}\otimes\pi_{h}^{t}\big{\rangle}_{\Omega\times\mathcal{A}}(s_{h};C^{t})\|s_{1}=s_{1}^{t}\big{]}$
		$\displaystyle\leq\bigg{(}\frac{3HL_{\mu}K}{p_{0}D}+\frac{HL_{\mu}K}{2}\bigg{)}\beta\sum_{h\in[H]}\sum_{t\in[T]}\\|\phi(c_{h}^{t})\\|_{(\Sigma_{h}^{t})^{-1}}.$

It remains to bound term $\displaystyle\mathrm{(iv)}$ in equation (6.3). This bound can be derived from Holder inequality and the property of the prior.

Lemma 6.4 (Bounding Term $\displaystyle\mathrm{(iv)}$ ).

On the event of $\{\theta_{h}^{*}\in\mathcal{B}_{h}^{t}\}$ , under Assumption 4.2 and 4.3, we have

\sum_{t\in[T]}\sum_{h\in[H]}\big{\langle}Q_{h}^{t},(\mu_{h}^{t}-\mu_{h}^{*})\otimes\pi_{h}^{t}\big{\rangle}_{\Omega\times\mathcal{A}}(s_{h}^{t};C^{t})\leq HL_{\mu}K\beta\sum_{h\in[H]}\sum_{t\in[T]}\|\phi(c_{h}^{t})\|_{(\Sigma_{h}^{t})^{-1}}.

Now we are ready to prove our main result, Theorem 5.1. By the decomposition in Lemma 6.1 and all previous lemmas, let $\beta=C(1+\kappa^{-1}\sqrt{K+M+d_{\phi}\sigma^{2}\log(HT)})$ , and then we obtain the following upper bound for regret:

	$\displaystyle\operatorname{Reg}(T,\mu^{*})\leq$	$\displaystyle 4\sqrt{2TH^{3}\log(2HT)}$
		$\displaystyle~{}~{}~{}~{}+\sum_{t\in[T]}\sum_{h\in[H]}\bigg{[}2\rho\big{\\|}\psi(s_{h}^{t},\omega_{h}^{t},a_{h}^{t})\\|_{(\Gamma_{h}^{t})^{-1}}+\bigg{(}\frac{3HL_{\mu}K}{p_{0}D}+\frac{3HL_{\mu}K}{2}\bigg{)}\beta\\|\phi(c_{h}^{t})\\|_{(\Sigma_{h}^{t})^{-1}}\bigg{]},$

With the probability of the given event by Lemma 6.8 and appropriately chosen $\delta$ in previous lemmas, the above inequality holds for the probability at least $1-3H^{-1}T^{-1}$ .

By Lemma A.6, we have

	$\displaystyle\operatorname{Reg}(T,\mu^{*})$	$\displaystyle\leq 2\sqrt{2TH^{3}\log(2HT)}+2\rho H\sqrt{2d_{\psi}T\log\big{(}1+T\Psi^{2}/(\lambda d_{\psi})\big{)}}$
		$\displaystyle\qquad\qquad+\beta H^{2}L_{\mu}K\bigg{(}\frac{3}{p_{0}D}+\frac{3}{2}\bigg{)}\sqrt{2d_{\phi}T\log\big{(}1+T/(d_{\phi})\big{)}}.$

Since $\beta$ is in $\widetilde{O}(\sqrt{d_{\phi}})$ and $\rho$ is in $\widetilde{O}(d_{\psi}H)$ , we can conclude that the regret of Algorithm 2 is $\widetilde{O}\big{(}d_{\phi}d_{\psi}^{3/2}H^{2}\sqrt{T}/(p_{0}D)\big{)}$ .

6.2 Inducing Robust Equilibria via Pessimism

One necessary prerequisite is that the signaling policy given by OP4 has to be persuasive to ensure receivers to take recommended actions. However, the optimal signaling policy that is persuasive for the estimated prior can hardly be also persuasive for the true prior, even if the estimation is quite close to it. To ensure persuasiveness under the prior estimation error, we adopt pessimism principle to select a signaling policy that is robustly persuasive for all the priors in the confidence region. And we shall quantify the extra utility loss suffered by the pessimism principle. In this subsection, we start by showing that there exists a robust signaling scheme that suffers only $O(\epsilon)$ utility loss compared to the optimal expected utility of persuasion algorithm designed with precise knowledge of the prior. Formally, in basic MPP, given any fixed $Q$ -function $Q(\cdot,\cdot,\cdot)$ , we define the robustness gap for some state $s\in{\mathcal{S}}$ and any prior $\mu\in\mathcal{B}\subseteq\Delta(\Omega)$ as

\operatorname{Gap}\big{(}s,\mu,\mathcal{B};Q\big{)}\triangleq\max_{\pi\in\operatorname{Pers}(\mu,u)}\big{\langle}Q,\mu\otimes\pi\big{\rangle}_{\Omega\times\mathcal{A}}(s)-\max_{\pi\in\operatorname{Pers}(\mathcal{B},u)}\big{\langle}Q,\mu\otimes\pi\big{\rangle}_{\Omega\times\mathcal{A}}(s).

(6.4)

We let ${\mathrm{B}}(\mu,\epsilon)=\{\mu^{\prime}\in\Delta(\Omega):\left\|\mu-\mu^{\prime}\right\|_{1}\leq\epsilon\}$ be the $\ell_{1}$ -norm ball centered the prior distribution $\mu$ with radius $\epsilon$ .

Lemma 6.5 (Pessimism).

Under $(p_{0},D)$ -regularity, for all $\epsilon>0$ , given a $Q$ -function $Q$ , for any state $s\in{\mathcal{S}}$ , we have

\operatorname{Gap}\big{(}s,\mu,{\mathrm{B}}(\mu,\epsilon);Q\big{)}\leq\frac{H\epsilon}{p_{0}D}.

The proof is given in Appendix A.2. This result extends Proposition 1 in [61]. Notice that the upper bound of $\operatorname{Gap}(\cdot;\cdot)$ does not depend on the value of $Q$ , which is important for our analysis. Once given a signaling algorithm, at each episode $t\in[T]$ and each step $h\in[H]$ , we are able to obtain an estimation of $Q$ -function with an explicit form. It is equivalent to the “known” $Q$ -function mentioned in equation (6.4). Using $\operatorname{Gap}(\cdot;\cdot)$ , we can estimate the expected sender’s utility loss for choosing a signaling mechanism that is persuasive for all priors in a subset. Moreover, if we consider the dependence on context for priors and add the linear assumption of priors to the proceeding lemma, we can bound $\operatorname{Gap}(\cdot;\cdot)$ by the difference of linearity parameter $\theta$ .

Corollary 6.6.

Under $(p_{0},D)$ -regularity and Assumption 4.2 and 4.3, given a $Q$ -function $Q$ and context $c$ , for any state $s\in{\mathcal{S}}$ , prior $\mu_{\theta}(\cdot|c)$ and confidence region $\mathcal{B}=\{\mu_{\theta^{\prime}}(\cdot|c):\theta^{\prime}\in{\mathrm{B}}_{\Sigma}(\theta,\epsilon)\}$ , we have $\operatorname{Gap}(s,\mu_{\theta}(\cdot|c),\mathcal{B};Q)\leq HL_{\mu}K\|\phi(c)\|_{\Sigma^{-1}}\epsilon/(p_{0}D).$

In MPPs, we have to estimate the prior of the outcome since we cannot observe the ground-truth prior. However, the estimation may not satisfy the regularity conditions, which conflicts with the requirements for the prior when proving Lemma 6.5. To address this problem, we give another upper bound of the robustness gap for the prior estimation in Lemma A.1. In addition, to handle the regret loss incurred by estimating the prior, we compute the difference in $Q$ -functions when choosing respectively persuasive scheme for different priors in Lemma A.2.

We now prove that the above pessimism design guarantees persuasiveness w.r.t. the true prior with high probability. And it suffices to show that the estimation $\theta_{h}^{t}$ is close enough to the real parameter $\theta_{h}^{*}$ such that the confidence region $\mathcal{B}_{h}^{t}$ centered at $\theta_{h}^{t}$ given in Algorithm 2 contains $\theta_{h}^{*}$ . If so, the signaling scheme chosen to be persuasive for the whole set $\mu_{\mathcal{B}_{h}^{t}}$ is also persuasive for $\mu_{h}^{*}$ , where $\mu_{\mathcal{B}}\coloneqq\{\mu_{\theta^{\prime}}:\theta^{\prime}\in\mathcal{B}\}$ denotes the set of priors that are determined by the parameters $\theta^{\prime}\in\mathcal{B}$ .

Lemma 6.7.

There exists a constant $C>0$ , such that for $\beta=C(1+\kappa^{-1}\sqrt{K+M+d_{\phi}\sigma^{2}\log(HT)})$ , OP4 Algorithm is persuasive with probability at least $1-H^{-1}T^{-1}$ , i.e.,

\mathbb{P}_{\theta^{*}}\bigg{(}\bigcup_{h\in[H]}\big{\{}\theta_{h}^{*}\notin\cap_{t\in[T]}\mathcal{B}_{h}^{t}\big{\}}\bigg{)}\leq H^{-1}T^{-1}.

Proof.

We first analyze the probability for being non-persuasive. For any $\|\theta_{h}^{*}\|\leq L_{\theta}$ , using the union bound, we have

	$\displaystyle P_{\theta^{}}\bigg{(}\bigcup_{t\in[T],h\in[H]}\big{\{}\theta_{h}^{}\notin\cap_{t\in[T]}\mathcal{B}_{h}^{t}\big{\}}\bigg{)}$	$\displaystyle\leq\sum_{t\in[T]}\sum_{h\in[H]}P_{\theta_{h}^{}}\big{(}\theta_{h}^{}\notin\cap_{t\in[T]}\mathcal{B}_{h}^{t}\big{)}$
		$\displaystyle\leq\sum_{t\in[T]}\sum_{h\in[H]}P_{\theta_{h}^{}}\big{(}\\|\theta_{h}^{t}-\theta_{h}^{}\\|_{\Sigma_{h}^{t}}>\beta\big{)}.$

The following lemma gives the belief of confidence region for the linear parameter $\theta_{h}^{*}$ . The proof can be directly derived from Lemma 6 in Wang et al. [52].

Lemma 6.8 (Belief of Confidence Region).

For any $t\in[T]$ and $h\in[H]$ , there exists a constant $C>0$ , such that for $\beta=C(1+\kappa^{-1}\sqrt{K+M+d_{\phi}\sigma^{2}\log(1/\delta)})$ , given $\delta\in(0,1)$ , with probability at least $1-\delta$ , we have $\|\theta_{h}^{t}-\theta_{h}^{*}\|_{\Sigma_{h}^{t}}\leq\beta.$

By Lemma 6.8, taking $\delta=H^{-2}T^{-2}$ , then we have $\mathbb{P}_{\theta^{*}}(\|\theta_{h}^{t}-\theta_{h}^{*}\|_{\Sigma_{h}^{t}}>\beta)\leq H^{-2}T^{-2}$ . Summing up the failure probabilities over $t\in[T]$ , we have $\mathbb{P}_{\theta^{*}}(\theta^{*}\notin\cap_{t\in[T]}\mathcal{B}^{t})\leq H^{-1}T^{-1}$ . ∎

7 Conclusion

We have presented a novel model, the MPP, which captures the misaligned incentives of uninformed decision makers and the long-term objective of an information possessor for the first time. We then provide a reinforcement learning algorithm, OP4, that is provably efficient in terms of both computational complexity and sample complexity, under mild assumptions. We remark that while we showcase this algorithm in particular problem instances with linear approximation or GLMs, the framework of OP4 does not rely on the function approximation form, as long as we can quantify the uncertainty of the prior estimation and $Q$ -function (or transition model). In addition, we expect this optimism-pessimism design principle and its corresponding proof techniques to be generally useful for some other strategic learning problems with misaligned incentives involved.

Besides extending our techniques to other design problems, we point out that several other open problems arises from our work. First, while it is natural that the sender have knowledge of receiver’s utility functions in many cases (see Footnote 6), we hope to also study the problem even without initially knowing receiver’s utility. Similar problem has been studied in Stackelberg games [40, 10] yet without measuring the performance in terms of the cumulative utility of sender (leader). Second, another interesting direction is to study the setting of Markov Bayesian persuasion with one sender and one receiver, both aiming at maximizing their own long-term cumulative utilities when the environment involves Markovian transitions.

References

Agarwal et al. [2019] Agarwal, A., Jiang, N., Kakade, S. M. and Sun, W. (2019). Reinforcement learning: Theory and algorithms. CS Dept., UW Seattle, Seattle, WA, USA, Tech. Rep.
Auer et al. [2008] Auer, P., Jaksch, T. and Ortner, R. (2008). Near-optimal regret bounds for reinforcement learning. Advances in neural information processing systems, 21.
Ayoub et al. [2020] Ayoub, A., Jia, Z., Szepesvari, C., Wang, M. and Yang, L. (2020). Model-based reinforcement learning with value-targeted regression. In International Conference on Machine Learning. PMLR.
Azar et al. [2017] Azar, M. G., Osband, I. and Munos, R. (2017). Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning. PMLR.
Azuma [1967] Azuma, K. (1967). Weighted sums of certain dependent random variables. Tohoku Mathematical Journal, Second Series, 19 357–367.
Badanidiyuru et al. [2018] Badanidiyuru, A., Bhawalkar, K. and Xu, H. (2018). Targeting and signaling in ad auctions. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM.
Bellman [1957] Bellman, R. (1957). A markovian decision process. Indiana Univ. Math. J., 6 679–684.
Bergemann and Morris [2019] Bergemann, D. and Morris, S. (2019). Information design: A unified perspective. Journal of Economic Literature, 57 44–95.
Cai et al. [2020] Cai, Q., Yang, Z., Jin, C. and Wang, Z. (2020). Provably efficient exploration in policy optimization. In International Conference on Machine Learning. PMLR.
Conitzer and Sandholm [2006] Conitzer, V. and Sandholm, T. (2006). Computing the optimal strategy to commit to. In Proceedings of the 7th ACM conference on Electronic commerce.
Dann et al. [2018] Dann, C., Jiang, N., Krishnamurthy, A., Agarwal, A., Langford, J. and Schapire, R. E. (2018). On oracle-efficient pac rl with rich observations. Advances in neural information processing systems, 31.
Dann et al. [2017] Dann, C., Lattimore, T. and Brunskill, E. (2017). Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning. Advances in Neural Information Processing Systems, 30.
Du et al. [2021] Du, S., Kakade, S., Lee, J., Lovett, S., Mahajan, G., Sun, W. and Wang, R. (2021). Bilinear classes: A structural framework for provable generalization in rl. In International Conference on Machine Learning. PMLR.
Du et al. [2019] Du, S. S., Kakade, S. M., Wang, R. and Yang, L. F. (2019). Is a good representation sufficient for sample efficient reinforcement learning? arXiv preprint arXiv:1910.03016.
Dughmi and Xu [2019] Dughmi, S. and Xu, H. (2019). Algorithmic bayesian persuasion. SIAM Journal on Computing STOC16–68.
Ely [2017] Ely, J. C. (2017). Beeps. American Economic Review, 107 31–53.
Farhadi and Teneketzis [2021] Farhadi, F. and Teneketzis, D. (2021). Dynamic information design: a simple problem on optimal sequential information disclosure. Dynamic Games and Applications 1–42.
Filippi et al. [2010] Filippi, S., Cappe, O., Garivier, A. and Szepesvári, C. (2010). Parametric bandits: The generalized linear case. Advances in Neural Information Processing Systems, 23.
Gan et al. [2021] Gan, J., Majumdar, R., Radanovic, G. and Singla, A. (2021). Bayesian persuasion in sequential decision-making. arXiv preprint arXiv:2106.05137.
Giannoccaro and Pontrandolfo [2002] Giannoccaro, I. and Pontrandolfo, P. (2002). Inventory management in supply chains: a reinforcement learning approach. International Journal of Production Economics, 78 153–161.
Goldstein and Leitner [2018] Goldstein, I. and Leitner, Y. (2018). Stress tests and information disclosure. Journal of Economic Theory, 177 34–69.
He et al. [2021] He, J., Zhou, D. and Gu, Q. (2021). Logarithmic regret for reinforcement learning with linear function approximation. In International Conference on Machine Learning. PMLR.
Jiang et al. [2017] Jiang, N., Krishnamurthy, A., Agarwal, A., Langford, J. and Schapire, R. E. (2017). Contextual decision processes with low bellman rank are pac-learnable. In International Conference on Machine Learning. PMLR.
Jin et al. [2018] Jin, C., Allen-Zhu, Z., Bubeck, S. and Jordan, M. I. (2018). Is q-learning provably efficient? Advances in neural information processing systems, 31.
Jin et al. [2020] Jin, C., Yang, Z., Wang, Z. and Jordan, M. I. (2020). Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory. PMLR.
Kamenica and Gentzkow [2011] Kamenica, E. and Gentzkow, M. (2011). Bayesian persuasion. American Economic Review, 101 2590–2615.
Kober et al. [2013] Kober, J., Bagnell, J. A. and Peters, J. (2013). Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32 1238–1274.
Lattimore and Szepesvári [2020] Lattimore, T. and Szepesvári, C. (2020). Bandit algorithms. Cambridge University Press.
Lehrer and Shaiderman [2021] Lehrer, E. and Shaiderman, D. (2021). Markovian persuasion. arXiv preprint arXiv:2111.14365.
Li et al. [2016] Li, J., Monroe, W., Ritter, A., Galley, M., Gao, J. and Jurafsky, D. (2016). Deep reinforcement learning for dialogue generation. arXiv preprint arXiv:1606.01541.
Li et al. [2017a] Li, L., Lu, Y. and Zhou, D. (2017a). Provable optimal algorithms for generalized linear contextual bandits. CoRR, abs/1703.00048.
http://arxiv.org/abs/1703.00048
Li et al. [2017b] Li, L., Lu, Y. and Zhou, D. (2017b). Provably optimal algorithms for generalized linear contextual bandits. In International Conference on Machine Learning. PMLR.
Li et al. [2019] Li, M., Qin, Z., Jiao, Y., Yang, Y., Wang, J., Wang, C., Wu, G. and Ye, J. (2019). Efficient ridesharing order dispatching with mean field multi-agent reinforcement learning. In The world wide web conference.
Liang et al. [2021] Liang, E., Wen, K., Lam, W. H., Sumalee, A. and Zhong, R. (2021). An integrated reinforcement learning and centralized programming approach for online taxi dispatching. IEEE Transactions on Neural Networks and Learning Systems.
Mansour et al. [2021] Mansour, Y., Slivkins, A., Syrgkanis, V. and Wu, Z. S. (2021). Bayesian exploration: Incentivizing exploration in bayesian games. Operations Research.
Meisheri et al. [2020] Meisheri, H., Baniwal, V., Sultana, N. N., Khadilkar, H. and Ravindran, B. (2020). Using reinforcement learning for a large variable-dimensional inventory management problem. In Adaptive Learning Agents Workshop at AAMAS.
Milano et al. [2020] Milano, S., Taddeo, M. and Floridi, L. (2020). Recommender systems and their ethical challenges. Ai & Society, 35 957–967.
Neu and Pike-Burke [2020] Neu, G. and Pike-Burke, C. (2020). A unifying view of optimism in episodic reinforcement learning. Advances in Neural Information Processing Systems, 33 1392–1403.
Osband et al. [2016] Osband, I., Van Roy, B. and Wen, Z. (2016). Generalization and exploration via randomized value functions. In International Conference on Machine Learning. PMLR.
Peng et al. [2019] Peng, B., Shen, W., Tang, P. and Zuo, S. (2019). Learning optimal strategies to commit to. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33.
Puterman [2014] Puterman, M. L. (2014). Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons.
Qin et al. [2020] Qin, Z., Tang, X., Jiao, Y., Zhang, F., Xu, Z., Zhu, H. and Ye, J. (2020). Ride-hailing order dispatching at didi via reinforcement learning. INFORMS Journal on Applied Analytics, 50 272–286.
Qin et al. [2021] Qin, Z. T., Zhu, H. and Ye, J. (2021). Reinforcement learning for ridesharing: A survey. In 2021 IEEE International Intelligent Transportation Systems Conference (ITSC). IEEE.
Rabinovich et al. [2015] Rabinovich, Z., Jiang, A. X., Jain, M. and Xu, H. (2015). Information disclosure as a means to security. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems. Citeseer.
Renault et al. [2017] Renault, J., Solan, E. and Vieille, N. (2017). Optimal dynamic information provision. Games and Economic Behavior, 104 329–349.
Russo [2019] Russo, D. (2019). Worst-case regret bounds for exploration via randomized value functions. Advances in Neural Information Processing Systems, 32.
Strehl et al. [2006] Strehl, A. L., Li, L., Wiewiora, E., Langford, J. and Littman, M. L. (2006). Pac model-free reinforcement learning. In Proceedings of the 23rd international conference on Machine learning.
Sutton and Barto [2018] Sutton, R. S. and Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.
Tang and Winoto [2016] Tang, T. Y. and Winoto, P. (2016). I should not recommend it to you even if you will like it: the ethics of recommender systems. New Review of Hypermedia and Multimedia, 22 111–138.
Wang et al. [2020] Wang, R., Salakhutdinov, R. and Yang, L. F. (2020). Provably efficient reinforcement learning with general value function approximation.
Wang et al. [2021] Wang, T., Zhou, D. and Gu, Q. (2021). Provably efficient reinforcement learning with linear function approximation under adaptivity constraints. Advances in Neural Information Processing Systems, 34.
Wang et al. [2019] Wang, Y., Wang, R., Du, S. S. and Krishnamurthy, A. (2019). Optimism in reinforcement learning with generalized linear function approximation. arXiv preprint arXiv:1912.04136.
Xiao et al. [2019] Xiao, W., Zhao, H., Pan, H., Song, Y., Zheng, V. W. and Yang, Q. (2019). Beyond personalization: Social content recommendation for creator equality and consumer satisfaction. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.
Xu et al. [2015] Xu, H., Rabinovich, Z., Dughmi, S. and Tambe, M. (2015). Exploring information asymmetry in two-stage security games. In Twenty-Ninth AAAI Conference on Artificial Intelligence.
Yang and Wang [2019] Yang, L. and Wang, M. (2019). Sample-optimal parametric q-learning using linearly additive features. In International Conference on Machine Learning. PMLR.
Yang et al. [2020] Yang, Z., Jin, C., Wang, Z., Wang, M. and Jordan, M. I. (2020). Bridging exploration and general function approximation in reinforcement learning: Provably efficient kernel and neural value iterations. CoRR, abs/2011.04622.
https://arxiv.org/abs/2011.04622
Zanette et al. [2020] Zanette, A., Lazaric, A., Kochenderfer, M. and Brunskill, E. (2020). Learning near optimal policies with low inherent bellman error. In International Conference on Machine Learning. PMLR.
Zheng et al. [2020] Zheng, S., Trott, A., Srinivasa, S., Naik, N., Gruesbeck, M., Parkes, D. C. and Socher, R. (2020). The ai economist: Improving equality and productivity with ai-driven tax policies. arXiv preprint arXiv:2004.13332.
Zhou et al. [2021a] Zhou, D., Gu, Q. and Szepesvari, C. (2021a). Nearly minimax optimal reinforcement learning for linear mixture markov decision processes. In Conference on Learning Theory. PMLR.
Zhou et al. [2021b] Zhou, D., He, J. and Gu, Q. (2021b). Provably efficient reinforcement learning for discounted mdps with feature mapping. In International Conference on Machine Learning. PMLR.
Zu et al. [2021] Zu, Y., Iyer, K. and Xu, H. (2021). Learning to persuade on the fly: Robustness against ignorance. EC ’21, Association for Computing Machinery, New York, NY, USA.
https://doi.org/10.1145/3465456.3467593

Appendix A Omitted Proofs and Descriptions

A.1 Formal Description of the OP4

The formal description of the OP4 for MPPs is stated as follows:

Algorithm 2 The OP4 for MPPs

1: Input: Number of Episodes

T

, Number of Step

H

2: Parameters:

\beta>0

\rho>0

\lambda\in\mathbb{R}^{+}

3: Output:

a_{h}^{t}\in\mathcal{A}

for each

h\in[H],t\in[T]

4: for episode

t=1\dots T

5: Receive the initial state

s_{1}^{t}

and context

C^{t}=(c_{1}^{t},\ldots,c_{H}^{t})

6: for step

h=H,\ldots,1

7: Compute the constrained least square problem

\theta_{h}^{t}\leftarrow\mathop{\mathrm{argmin}}_{\|\theta_{h}\|\leq L_{\theta}}\sum_{\tau\in[t-1]}\big{[}\omega_{h}^{\tau}-f(\phi(c_{h}^{\tau})^{\top}\theta_{h})\big{]}^{2}.

8: Calculate

\Sigma_{h}^{t}=\Phi^{2}I_{d_{\phi}}+\sum_{\tau\in[t-1]}\phi(c_{h}^{\tau})\phi(c_{h}^{\tau})^{\top}

. Update

\mathcal{B}_{h}^{t}\leftarrow{\mathrm{B}}_{\Sigma_{h}^{t}}(\theta_{h}^{t},\beta)

9: Set

\mu_{h}^{t}(\cdot|c)

to the distribution of

f\big{(}\phi(c)^{\top}\theta_{h}^{t}\big{)}+z_{h}

10: Calculate

	$\displaystyle\Gamma_{h}^{t}$	$\displaystyle=\lambda I_{d_{\psi}}+\sum_{\tau\in[t-1]}\psi(s_{h}^{\tau},\omega_{h}^{\tau},a_{h}^{\tau})\psi(s_{h}^{\tau},\omega_{h}^{\tau},a_{h}^{\tau})^{\top},$
	$\displaystyle\iota_{h}^{t}$	$\displaystyle=\sum_{\tau\in[t-1]}\psi(s_{h}^{\tau},\omega_{h}^{\tau},a_{h}^{\tau})[v_{h}^{\tau}+V_{h+1}^{t}(s_{h+1}^{\tau};C^{t})]$

11: Update

q_{h}^{t}\leftarrow(\Gamma_{h}^{t})^{-1}\iota_{h}^{t}

12: Set

\begin{cases}Q_{h}^{t}(\cdot,\cdot,\cdot;C^{t})\leftarrow\min\{\psi(\cdot,\cdot,\cdot)^{\top}q_{h}^{t}+\rho\|\psi(\cdot,\cdot,\cdot)\|_{(\Gamma_{h}^{t})^{-1}},H\},&\\ V_{h}^{t}(\cdot;C^{t})\leftarrow\max_{\pi_{h}\in\operatorname{Pers}(\mu_{\mathcal{B}_{h}^{t}},u_{h})}\big{\langle}Q_{h}^{t},\mu_{h}^{t}\otimes\pi_{h}\big{\rangle}_{\Omega\times\mathcal{A}}(\cdot;C^{t}).&\end{cases}

13: end for

14: for step

h=1,\ldots,H

15: Choose

\pi_{h}^{t}\in\arg\max_{\pi_{h}\in\operatorname{Pers}(\mu_{\mathcal{B}_{h}^{t}},u_{h})}\big{\langle}Q_{h}^{t},\mu_{h}^{t}\otimes\pi_{h}\big{\rangle}_{\Omega\times\mathcal{A}}(s_{h}^{t};C^{t})

16: end for

17: Execute

\pi^{t}

to sample a trajectory

\{(s_{h}^{t},\omega_{h}^{t},a_{h}^{t},v_{h}^{t})\}_{h\in[H]}

18: end for

A.2 Proof of Lemma 6.5

Proof.

We prove with an explicit construction of a signaling scheme that is robustly persuasive for any prior in ${\mathrm{B}}(\mu,\epsilon)$ and achieve the expected utility at least $\max_{\pi\in\operatorname{Pers}(\mu,u)}\big{\langle}Q,\mu\otimes\pi\big{\rangle}_{\Omega\times\mathcal{A}}(s)-H\epsilon/(p_{0}D)$ . To simplify the notation, we omit the $s$ in $u$ , $Q$ and $\mathcal{W}$ .

Let $\pi^{*}=\arg\max_{\pi\in\operatorname{Pers}(\mu,u)}\big{\langle}Q,\mu\otimes\pi\big{\rangle}_{\Omega\times\mathcal{A}}$ be a direct scheme without loss of generality [26]. For each $a\in\mathcal{A}$ , let $\mu_{a}(\cdot)\coloneqq\mu(\cdot)\odot\pi^{*}(a|\cdot)$ denote the posterior of outcome (i.e., kernel ⁷⁷7In this proof, we will directly work with the posterior without normalization (kernel) to simplify our notations and derivations, because $\int_{\omega\in\Omega}\mu_{a}(\omega)\left[u(\omega,a)-u(\omega,a^{\prime})\right]\geq 0\iff\int_{\omega\in\Omega}\frac{\mu_{a}(\omega)}{\int_{\omega\in\Omega}\mu_{a}(\omega)}\left[u(\omega,a)-u(\omega,a^{\prime})\right]\geq 0$ . We use $\odot$ to denote the Hadamard product.) that action $a$ is recommended by $\pi$ , so the prior can be composed as $\mu(\cdot)=\sum_{a\in\mathcal{A}}\mu_{a}(\cdot)$ . Since $\pi$ is persuasive, we know $\int_{\omega\in\Omega}\mu_{a}(\omega)\left[u(\omega,a)-u(\omega,a^{\prime})\right]\geq 0,\forall a^{\prime}\in\mathcal{A}.$

Let $\pi^{0}$ be the fully revealing signaling scheme that always recommends (signals) the action that maximizes the receivers’ utility at the realized outcome. For each $a\in\mathcal{A}$ , let $\eta_{a}(\cdot)\coloneqq\mu(\cdot)\odot\pi^{0}(a|\cdot)$ denote the posterior of outcome that action $a$ is recommended by $\pi^{0}$ , so the prior can be composed as $\mu(\cdot)=\sum_{a\in\mathcal{A}}\eta_{a}(\cdot)$ . By regularity condition, we have

\int_{\omega\in\Omega}\eta_{a}(\omega)\left[u(\omega,a)-u(\omega,a^{\prime})\right]\geq\int_{\omega\in\mathcal{W}_{a}(D)}\eta_{a}(\omega)\left[u(\omega,a)-u(\omega,a^{\prime})\right]\geq p_{0}D,\quad\forall a^{\prime}\in\mathcal{A}.

We now show that the signaling scheme $\pi^{\prime}=(1-\delta)\pi^{*}+\delta\pi^{0}$ is persuasive for any prior $\widetilde{\mu}\in{\mathrm{B}}(\mu,\epsilon)$ with $\delta=\frac{\epsilon}{p_{0}D}$ . One simple way to interpret this “compound” signaling scheme is to follow $\pi^{*}$ with probability $(1-\delta)$ and follow $\pi^{0}$ with probability $\delta$ . Hence, given a recommended action $a$ , the receiver would compute the posterior as $\mu^{\prime}_{a}=(1-\delta)\mu_{a}(\omega)+\delta\eta_{a}(\omega)$ . Let $\mu^{\prime}_{a},\widetilde{\mu}_{a}$ be the outcome posterior of $\pi^{\prime}$ recommending action $a$ under the true prior $\mu$ (resp. the perturbed prior $\widetilde{\mu}$ ). So $\mu^{\prime}_{a}(\cdot)=\mu(\cdot)\odot\pi^{\prime}(a|\cdot)$ and $\widetilde{\mu}_{a}(\cdot)=\widetilde{\mu}(\cdot)\odot\pi^{\prime}(a|\cdot)$ . By definition of persuasiveness, we need to show that for any recommended action (signal from $\pi^{\prime}$ ) $a\in\mathcal{A}$ , the action $a$ maximizes the receiver’s utility under $\mu^{\prime}_{a}$ . This follows from the decomposition below,

		$\displaystyle\int_{\omega\in\Omega}\widetilde{\mu}_{a}\cdot\left[u(\omega,a)-u(\omega,a^{\prime})\right]$
	$\displaystyle\geq$	$\displaystyle\int_{\omega\in\Omega}\mu^{\prime}_{a}\cdot\left[u(\omega,a)-u(\omega,a^{\prime})\right]-\left\\|\widetilde{\mu}_{a}-\mu^{\prime}_{a}\right\\|_{1}$
	$\displaystyle\geq$	$\displaystyle\int_{\omega\in\Omega}\left[(1-\delta)\mu_{a}(\omega)+\delta\eta_{a}(\omega)\right]\cdot\left[u(\omega,a)-u(\omega,a^{\prime})\right]-\left\\|\widetilde{\mu}_{a}-\mu_{a}\right\\|_{1}$
	$\displaystyle=$	$\displaystyle\int_{\omega\in\Omega}(1-\delta)\mu_{a}(\omega)\left[u(\omega,a)-u(\omega,a^{\prime})\right]+\int_{\omega\in\Omega}\delta\eta_{a}(\omega)\left[u(\omega,a)-u(\omega,a^{\prime})\right]-\left\\|\widetilde{\mu}_{a}-\mu_{a}\right\\|_{1}$
	$\displaystyle\geq$	$\displaystyle\ \delta p_{0}D-\left\\|\widetilde{\mu}_{a}-\mu_{a}\right\\|_{1}$
	$\displaystyle=$	$\displaystyle\ \epsilon-\left\\|\widetilde{\mu}_{a}-\mu_{a}\right\\|_{1}\geq 0.$

The first inequality is by the fact that $u(\omega,a)\in[0,1]$ for any $\omega,a$ and thus $\sum_{a}(\widetilde{\mu}_{a}-\mu^{\prime}_{a})\cdot\left[u(\omega,a)-u(\omega,a^{\prime})\right]\leq\left\|\widetilde{\mu}_{a}-\mu^{\prime}_{a}\right\|_{1}$ . The second inequality is from $\mu^{\prime}_{a}=(1-\delta)\mu_{a}(\omega)+\delta\eta_{a}(\omega)$ . The third inequality is by construction of $\mu_{a}$ and $\eta_{a}$ induced by signaling scheme $\pi$ and $\pi^{0}$ . The last inequality is by the fact that $\left\|\widetilde{\mu}_{a}-\mu^{\prime}_{a}\right\|_{1}=\left\|(\widetilde{\mu}-\mu^{\prime})\odot\pi^{\prime}(a|\cdot)\right\|_{1}\leq\left\|\widetilde{\mu}-\mu^{\prime}\right\|_{1}=\epsilon$ , since $\left\|\pi^{\prime}(a|\cdot)\right\|_{\infty}\leq 1$

It remains to show the expected utility under signaling scheme $\pi^{\prime}$ is at least $\big{\langle}Q,\mu\otimes\pi^{*}\big{\rangle}_{\Omega\times\mathcal{A}}-H\epsilon/(p_{0}D)$ . This is due to the following inequalities,

	$\displaystyle\big{\langle}Q,\mu\otimes\pi^{\prime}\big{\rangle}_{\Omega\times\mathcal{A}}-\big{\langle}Q,\mu\otimes\pi^{*}\big{\rangle}_{\Omega\times\mathcal{A}}$	$\displaystyle=\int_{\omega\in\Omega,a\in\mathcal{A}}\mu(\omega)\left[\pi^{\prime}(a\|\omega)-\pi^{*}(a\|\omega)\right]Q(\omega,a)$
		$\displaystyle=\int_{\omega\in\Omega,a\in\mathcal{A}}\mu(\omega)\left[\delta\pi^{0}(a\|\omega)-\delta\pi^{*}(a\|\omega)\right]Q(\omega,a)$
		$\displaystyle\geq-\delta\int_{\omega\in\Omega,a\in\mathcal{A}}\mu(\omega)\pi(a\|\omega)Q(\omega,a)$
		$\displaystyle\geq-H\delta=-\frac{H\epsilon}{p_{0}D}.$

The first and second equalities use the definition and linearity. The third and last inequalities use the fact that $\mathbb{E}[Q(\omega,a)]\in[0,H]$ and remove the positive term. ∎

A.3 Properties for the Robustness Gap

We present the robustness gap $\operatorname{Gap}$ for the ground-truth prior in Lemma 6.5. For the estimation of prior $\mu_{h}^{t}$ given in Algorithm 2 which may not satisfy the regularity condition, we also have corresponding robustness gap.

Lemma A.1.

For any $h\in[H],t\in[T]$ and $s\in{\mathcal{S}}$ , on the event of $\{\theta_{h}^{*}\in\mathcal{B}_{h}^{t}\}$ , we have

\operatorname{Gap}(s,\mu_{h}^{t},{\mathrm{B}}(\mu_{h}^{t},\epsilon_{h}^{t});Q_{h}^{t})\leq\frac{2H\epsilon}{p_{0}D}.

Proof.

For any fixed action $a\in\mathcal{A}$ , on the given event, we have

	$\displaystyle\mathbb{P}_{\omega\sim\mu_{h}^{t}(\cdot)}[\omega\in\mathcal{W}_{s,a}(D)]$	$\displaystyle=\int_{\omega\in\Omega}\mu_{h}^{t}(\omega)\mathbb{I}(\omega\in\mathcal{W}_{s,a}(D)){\mathrm{d}}\omega$
		$\displaystyle=\int_{\omega\in\Omega}\mu_{h}^{}(\omega)\mathbb{I}(\omega\in\mathcal{W}_{s,a}(D)){\mathrm{d}}\omega+\int_{\omega\in\Omega}[\mu_{h}^{t}(\omega)-\mu_{h}^{}(\omega)]\mathbb{I}(\omega\in\mathcal{W}_{s,a}(D)){\mathrm{d}}\omega$
		$\displaystyle\geq\int_{\omega\in\Omega}\mu_{h}^{}(\omega)\mathbb{I}(\omega\in\mathcal{W}_{s,a}(D)){\mathrm{d}}\omega+\\|\mu_{h}^{t}-\mu_{h}^{}\\|_{1}$
		$\displaystyle\geq p_{0}-\epsilon_{h}^{t},$

where $\mathbb{I}$ is the indicating function. The last inequality uses the regularity condition for the real prior $\mu_{h}^{*}$ . For $\epsilon_{h}^{t}\leq p_{0}/2$ , we have $\mathbb{P}_{\omega\sim\mu_{h}^{t}(\cdot)}[\omega\in\mathcal{W}_{s,a}]\leq p_{0}/2$ . Then by Lemma 6.5, we can arrive at

\operatorname{Gap}(s,\mu_{h}^{t},{\mathrm{B}}(\mu_{h}^{t},\epsilon_{h}^{t});Q_{h}^{t})\leq\frac{2H\epsilon_{h}^{t}}{p_{0}D}.

For $\epsilon_{h}^{t}>p_{0}/2$ , the bound holds trivially since $2H\epsilon_{h}^{t}/(p_{0}D)>H$ . ∎

The robustness gap $\operatorname{Gap}$ defined in equation (6.4) measures the loss in value functions for being robustly persuasive for a subset of priors. In the following lemma, we show that we can also use $\operatorname{Gap}$ to bound the difference in expected optimal $Q$ -functions between different priors.

Lemma A.2.

Denote $\mathcal{B}_{1,2}\coloneqq{\mathrm{B}}\big{(}\mu_{1},\|\mu_{1}-\mu_{2}\|_{1}\big{)}$ for any fixed state $s\in{\mathcal{S}}$ and $\mu_{1},\mu_{2}\in\Delta(\Omega)$ . Then given a known $Q$ -function $Q(\cdot,\cdot,\cdot)$ , we have

\max_{\pi_{1}\in\operatorname{Pers}(\mu_{1},u)}\big{\langle}Q,\mu_{1}\otimes\pi_{1}\big{\rangle}_{\Omega\times\mathcal{A}}(s)-\max_{\pi_{2}\in\operatorname{Pers}(\mu_{2},u)}\big{\langle}Q,\mu_{2}\otimes\pi_{2}\big{\rangle}_{\Omega\times\mathcal{A}}(s)\leq\operatorname{Gap}(s,\mu_{1},\mathcal{B}_{1,2};Q)+\frac{H}{2}\|\mu_{1}-\mu_{2}\|_{1}.

Proof.

Fix $\mu_{1},\mu_{2}\in\Delta(\Omega)$ , we respectively choose the optimal signaling scheme

\pi_{i}=\mathop{\mathrm{argmax}}_{\pi_{i}\in\operatorname{Pers}(\mu_{i},u)}\big{\langle}Q,\mu_{i}\otimes\pi_{i}\big{\rangle}_{\Omega\times\mathcal{A}}(s),~{}i=1,2.

Then among all the signaling schemes persuasive for all $\mathcal{B}_{1,2}$ , let $\pi_{3}$ maximize $\big{\langle}Q,\mu_{1}\otimes\pi\big{\rangle}_{\Omega\times\mathcal{A}}(s)$ . Since $\pi_{3}$ is persuasive for $\mu_{2}$ , we know $\big{\langle}Q,\mu_{2}\otimes\pi_{2}\big{\rangle}_{\Omega\times\mathcal{A}}(s)\geq\big{\langle}Q,\mu_{2}\otimes\pi_{3}\big{\rangle}_{\Omega\times\mathcal{A}}(s)$ by definition. Therefore, we have

	$\displaystyle\big{\langle}Q,\mu_{1}\otimes\pi_{1}-\mu_{2}\otimes\pi_{2}\big{\rangle}_{\Omega\times\mathcal{A}}(s)$	$\displaystyle\leq\big{\langle}Q,\mu_{1}\otimes\pi_{1}-\mu_{2}\otimes\pi_{3}\big{\rangle}_{\Omega\times\mathcal{A}}(s)$
		$\displaystyle\leq\big{\langle}Q,\mu_{1}\otimes\pi_{1}-\mu_{1}\otimes\pi_{3}\big{\rangle}_{\Omega\times\mathcal{A}}(s)+\big{\langle}Q,\mu_{1}\otimes\pi_{3}-\mu_{2}\otimes\pi_{3}\big{\rangle}_{\Omega\times\mathcal{A}}(s)$
		$\displaystyle=\operatorname{Gap}(s,\mu_{1},\mathcal{B}_{1,2};Q)+\frac{H}{2}\\|\mu_{1}-\mu_{2}\\|_{1}.$

The last equality uses the definition of $\operatorname{Gap}$ and Lemma A.3. ∎

Lemma A.3.

Given a $Q$ -function $Q(\cdot,\cdot,\cdot)\in[0,H]$ , for any fixed state $s\in{\mathcal{S}}$ , $\mu_{1},\mu_{2}\in\Delta(\Omega)$ and any signaling scheme $\pi$ , we have

\big{|}\big{\langle}Q,\mu_{1}\otimes\pi\big{\rangle}_{\Omega\times\mathcal{A}}(s)-\big{\langle}Q,\mu_{2}\otimes\pi\big{\rangle}_{\Omega\times\mathcal{A}}(s)\big{|}\leq\frac{H}{2}\|\mu_{1}-\mu_{2}\|_{1}.

Proof.

Fix $\mu_{1}(\cdot),\mu_{2}(\cdot)\in\Delta(\Omega)$ . For any $x\in\mathbb{R}$ , we have

	$\displaystyle\big{\|}\big{\langle}Q,\mu_{1}\otimes\pi-\mu_{2}\otimes\pi\big{\rangle}_{\Omega\times\mathcal{A}}(s)$	$\displaystyle=\bigg{\|}\int_{\omega\in\Omega}[\mu_{1}(\omega)-\mu_{2}(\omega)]\bigg{[}\int_{a\in\mathcal{A}}\pi(a\|s,\omega)Q(s,\omega,a){\mathrm{d}}a-x\bigg{]}{\mathrm{d}}\omega\bigg{\|}$
		$\displaystyle\leq\\|\mu_{1}-\mu_{2}\\|_{1}\cdot\sup_{\omega\in\Omega}\bigg{\|}\int_{a\in\mathcal{A}}\pi(a\|s,\omega)Q(s,\omega,a){\mathrm{d}}a-x\bigg{\|},$

where the last inequality is derived from Holder’s inequality. With $Q$ -function taking values in $[0,H]$ , we can set $x=H/2$ and achieve the optimality.

∎

A.4 Proof of Lemma 6.1

Proof.

Before presenting the proof, we first define two operators $\mathbb{J}_{h}^{*}$ and $\mathbb{J}_{h}^{t}$ :

(\mathbb{J}_{h}^{*}f)(s;C)=\langle f,\mu_{h}^{*}\otimes\pi_{h}^{*}\rangle_{\Omega\times\mathcal{A}}(s;C),~{}~{}~{}~{}(\mathbb{J}_{h}^{t}f)(s;C)=\langle f,\mu_{h}^{t}\otimes\pi_{h}^{t}\rangle_{\Omega\times\mathcal{A}}(s;C),

(A.1)

for any $h\in[H],t\in[T]$ and any function $f(\cdot,\cdot,\cdot;C):{\mathcal{S}}\times\Omega\times\mathcal{A}\to\mathbb{R}$ under the context $C$ . Moreover, for any $h\in[H],t\in[T]$ and any state $s\in{\mathcal{S}}$ , we define

\xi_{h}^{t}(s;C)=(\mathbb{J}_{h}^{*}Q_{h}^{t})(s;C)-(\mathbb{J}_{h}^{t}Q_{h}^{t})(s;C)=\langle Q_{h}^{t},\mu_{h}^{*}\otimes\pi_{h}^{*}-\mu_{h}^{t}\otimes\pi_{h}^{t}\rangle_{\Omega\times\mathcal{A}}(s;C).

(A.2)

After introducing these notations, we decompose the instantaneous regret at the $t$ -th episode into two terms,

V_{1}^{*}(s_{1}^{t};C^{t})-V_{1}^{\pi^{t}}(s_{1}^{t};C^{t})=\underbrace{V_{1}^{*}(s_{1}^{t};C^{t})-V_{1}^{t}(s_{1}^{t};C^{t})}_{\mathbb{p}_{1}}+\underbrace{V_{1}^{t}(s_{1}^{t};C^{t})-V_{1}^{\pi^{t}}(s_{1}^{t};C^{t})}_{\mathbb{p}_{2}}.

(A.3)

Then we consider these two terms separately. By the definition of value functions in (3.1) and the operator $\mathbb{J}_{h}^{*}$ in (A.1), we have $V_{h}^{*}=\mathbb{J}_{h}^{*}Q_{h}^{*}$ . By the construction of Algorithm 2, we have $V_{h}^{t}=\mathbb{J}_{h}^{t}Q_{h}^{t}$ similarly. Thus, for the first term $\mathbb{p}_{1}$ defined in equation (A.3), using $\xi_{h}^{t}$ defined in (A.2), for any $h\in[H],t\in[T]$ , we have

\begin{split}V_{h}^{*}-V_{h}^{t}&=\mathbb{J}_{h}^{*}Q_{h}^{*}-\mathbb{J}_{h}^{t}Q_{h}^{t}=(\mathbb{J}_{h}^{*}Q_{h}^{*}-\mathbb{J}_{h}^{*}Q_{h}^{t})+(\mathbb{J}_{h}^{*}Q_{h}^{t}-\mathbb{J}_{h}^{t}Q_{h}^{t})\\ &=\mathbb{J}_{h}^{*}(Q_{h}^{*}-Q_{h}^{t})+\xi_{h}^{t}.\end{split}

Next, by the definition of the temporal-difference error $\delta_{h}^{t}$ in (6.1) and the Bellman optimality equation in equation (3.2), we have

Q_{h}^{*}-Q_{h}^{t}=(v_{h}+P_{h}V_{h+1}^{*})-(v_{h}+P_{h}V_{h+1}^{t}-\delta_{h}^{t})=P_{h}(V_{h+1}^{*}-V_{h+1}^{t})+\delta_{h}^{t}.

Hence we get

V_{h}^{*}-V_{h}^{t}=\mathbb{J}_{h}^{*}P_{h}(V_{h+1}^{*}-V_{h+1}^{t})++\mathbb{J}_{h}^{*}\delta_{h}^{t}+\xi_{h}^{t}.

Then, by recursively applying the above formula, we have

\displaystyle V_{1}^{*}-V_{1}^{t}=\bigg{(}\prod_{h\in[H]}\mathbb{J}_{h}^{*}P_{h}\bigg{)}(V_{H+1}^{*}-V_{H+1}^{t})+\sum_{h\in[H]}\bigg{(}\prod_{i\in[h]}\mathbb{J}_{i}^{*}P_{i}\bigg{)}\mathbb{J}_{h}^{*}\delta_{h}^{t}+\sum_{h\in[H]}\bigg{(}\prod_{i\in[h]}\mathbb{J}_{i}^{*}P_{i}\bigg{)}\xi_{h}^{t}.

By the definition of $\xi_{h}^{t}$ in equation (A.2) and $\zeta_{t,h}^{3}$ in equation (6.2), we get

\displaystyle\sum_{h\in[H]}\bigg{(}\prod_{i\in[h]}\mathbb{J}_{i}^{*}P_{i}\bigg{)}\xi_{h}^{t}(s_{h};C^{t})=\sum_{h\in[H]}\mathbb{E}_{\mu^{*},\pi^{*}}\big{\{}\big{[}\langle Q_{h}^{t},\mu_{h}^{*}\otimes\pi_{h}^{*}-\mu_{h}^{t}\otimes\pi_{h}^{t}\rangle_{\Omega\times\mathcal{A}}(s_{h};C^{t})|s_{1}=s_{1}^{t}\big{]}\big{\}}.

Notice that $V_{H+1}^{*}=V_{H+1}^{t}=0$ . Therefore, for any episode $t\in[T]$ , we have

\begin{split}V_{1}^{*}(s_{1}^{t};C^{t})-V_{1}^{t}(s_{1}^{t};C^{t})=&\sum_{h\in[H]}\mathbb{E}_{\mu^{*},\pi^{*}}\big{\{}\big{[}\langle Q_{h}^{t},\mu_{h}^{*}\otimes\pi_{h}^{*}-\mu_{h}^{t}\otimes\pi_{h}^{t}\rangle_{\Omega\times\mathcal{A}}(s_{h};C^{t})|s_{1}=s_{1}^{t}\big{]}\big{\}}\\ &~{}~{}~{}~{}+\sum_{h\in[H]}\mathbb{E}_{\mu^{*},\pi^{*}}\left[\delta_{h}^{t}(s_{h},\omega_{h},a_{h})|s_{1}=s_{1}^{t}\right].\end{split}

Now we come to bound the second term $\mathbb{p}_{2}$ in equation (A.3). By the definition of the temporal-difference error $\delta_{h}^{t}$ in (6.1), for any $h\in[H],t\in[T]$ , we note that

	$\displaystyle\delta_{h}^{t}(s_{h}^{t},\omega_{h}^{t},a_{h}^{t})$	$\displaystyle=(v_{h}^{t}+P_{h}V_{h+1}^{t}-Q_{h}^{t})(s_{h}^{t},\omega_{h}^{t},a_{h}^{t};C^{t})$
		$\displaystyle=(v_{h}^{t}+P_{h}V_{h+1}^{t}-Q_{h}^{\pi^{t}})(s_{h}^{t},\omega_{h}^{t},a_{h}^{t};C^{t})+(Q_{h}^{\pi^{t}}-Q_{h}^{t})(s_{h}^{t},\omega_{h}^{t},a_{h}^{t};C^{t})$
		$\displaystyle=(P_{h}V_{h+1}^{t}-P_{h}V_{h+1}^{\pi^{t}})(s_{h}^{t},\omega_{h}^{t},a_{h}^{t})+(Q_{h}^{\pi^{t}}-Q_{h}^{t})(s_{h}^{t},\omega_{h}^{t},a_{h}^{t}).$

where the last equality follows the Bellman equation (3.1). Furthermore, using $\zeta_{t,h}^{1}$ and $\zeta_{t,h}^{2}$ defined in (6.2), we have

	$\displaystyle V_{h}^{t}(s_{h}^{t};$	$\displaystyle C^{t})-V_{h}^{\pi^{t}}(s_{h}^{t};C^{t})$
	$\displaystyle=$	$\displaystyle(V_{h}^{t}-V_{h}^{\pi^{t}})(s_{h}^{t};C^{t})-\delta_{h}^{t}(s_{h}^{t},\omega_{h}^{t},a_{h}^{t})+(Q_{h}^{\pi^{t}}-Q_{h}^{t})(s_{h}^{t},\omega_{h}^{t},a_{h}^{t};C^{t})$
		$\displaystyle\quad+(P_{h}V_{h+1}^{t}-P_{h}V_{h+1}^{\pi^{t}})(s_{h}^{t},\omega_{h}^{t},a_{h}^{t};C^{t})$
	$\displaystyle=$	$\displaystyle(V_{h}^{t}-\widetilde{V}_{h}^{t})(s_{h}^{t};C^{t})-\delta_{h}^{t}(s_{h}^{t},\omega_{h}^{t},a_{h}^{t})+(\widetilde{V}_{h}^{t}-V_{h}^{\pi^{t}})(s_{h}^{t};C^{t})+(Q_{h}^{\pi^{t}}-Q_{h}^{t})(s_{h}^{t},\omega_{h}^{t},a_{h}^{t};C^{t})$
		$\displaystyle\quad+\big{(}P_{h}(V_{h+1}^{t}-V_{h+1}^{\pi^{t}})\big{)}(s_{h}^{t},\omega_{h}^{t},a_{h}^{t};C^{t})-(V_{h+1}^{t}-V_{h+1}^{\pi^{t}})(s_{h+1}^{t};C^{t})+(V_{h+1}^{t}-V_{h+1}^{\pi^{t}})(s_{h+1}^{t};C^{t})$
	$\displaystyle=$	$\displaystyle\big{[}V_{h+1}^{t}(s_{h+1}^{t};C^{t})-V_{h+1}^{\pi^{t}}(s_{h+1}^{t};C^{t})\big{]}+\big{[}V_{h}^{t}(s_{h}^{t};C^{t})-\widetilde{V}_{h}^{t}(s_{h}^{t};C^{t})\big{]}-\delta_{h}^{t}(s_{h}^{t},\omega_{h}^{t},a_{h}^{t})+\zeta_{t,h}^{1}+\zeta_{t,h}^{2}.$

Applying the above equation recursively, we get that

	$\displaystyle V_{1}^{t}(s_{1}^{t};C^{t})-V_{1}^{\pi^{t}}(s_{1}^{t};C^{t})=$	$\displaystyle V_{H+1}^{t}(s_{H}^{t};C^{t})-V_{H+1}^{\pi^{t}}(s_{H}^{t};C^{t})+\sum_{h\in[H]}\big{[}V_{h}^{t}(s_{h}^{t};C^{t})-\widetilde{V}_{h}^{t}(s_{h}^{t};C^{t})\big{]}$
		$\displaystyle~{}~{}~{}~{}-\sum_{h\in[H]}\delta_{h}^{t}(s_{h}^{t},\omega_{h}^{t},a_{h}^{t})+\sum_{h\in[H]}(\zeta_{t,h}^{1}+\zeta_{t,h}^{2}).$

Again by Bellman equation (3.1), we have,

\displaystyle V_{h}^{t}(s_{h}^{t};C^{t})-\widetilde{V}_{h}^{t}(s_{h}^{t};C^{t})

\displaystyle=\big{\langle}Q_{h}^{t},(\mu_{h}^{t}-\mu_{h}^{*})\otimes\pi_{h}^{t}\big{\rangle}_{\Omega\times\mathcal{A}}(s_{h}^{t};C^{t}).

Then we use $V_{H+1}^{t}=V_{H+1}^{\pi^{t}}=0$ to simplify the decomposition to the following form:

	$\displaystyle V_{1}^{t}(s_{1}^{t};C^{t})-V_{1}^{\pi^{t}}(s_{1}^{t};C^{t})=$	$\displaystyle\sum_{h\in[H]}\big{\langle}Q_{h}^{t},(\mu_{h}^{t}-\mu_{h}^{*})\otimes\pi_{h}^{t}\big{\rangle}_{\Omega\times\mathcal{A}}(s_{h}^{t};C^{t})$
		$\displaystyle\quad-\sum_{h\in[H]}\delta_{h}^{t}(s_{h}^{t},\omega_{h}^{t},a_{h}^{t})+\sum_{h\in[H]}(\zeta_{t,h}^{1}+\zeta_{t,h}^{2}).$

Therefore, combining $\mathbb{p}_{1}$ and $\mathbb{p}_{2}$ , we can conclude the proof of this lemma.

	$\displaystyle\operatorname{Reg}(T,\mu^{*})=$	$\displaystyle\sum_{t\in[T]}\left[V_{1}^{*}(s_{1}^{t};C^{t})-V_{1}^{\pi^{t}}(s_{1}^{t};C^{t})\right]$
	$\displaystyle=$	$\displaystyle\sum_{t\in[T]}\sum_{h\in[H]}\big{\{}\mathbb{E}_{\mu_{h}^{},\pi_{h}^{}}[\delta_{h}^{t}(s_{h},\omega_{h},a_{h})\|s_{1}=s_{1}^{t}]-\delta_{h}^{t}(s_{h}^{t},\omega_{h}^{t},a_{h}^{t})\big{\}}+\sum_{t\in[T]}\sum_{h\in[H]}(\zeta_{t,h}^{1},\zeta_{t,h}^{2})$
		$\displaystyle~{}~{}~{}~{}+\sum_{t\in[T]}\sum_{h\in[H]}\mathbb{E}_{\mu_{h}^{},\pi_{h}^{}}\big{[}\big{\langle}Q_{h}^{t},\mu_{h}^{}\otimes\pi_{h}^{}-\mu_{h}^{t}\otimes\pi_{h}^{t}\big{\rangle}_{\Omega\times\mathcal{A}}(s_{h};C^{t})\|s_{1}=s_{1}^{t}\big{]}$
		$\displaystyle~{}~{}~{}~{}+\sum_{t\in[T]}\sum_{h\in[H]}\big{\langle}Q_{h}^{t},(\mu_{h}^{t}-\mu_{h}^{*})\otimes\pi_{h}^{t}\big{\rangle}_{\Omega\times\mathcal{A}}(s_{h}^{t};C^{t}).$

Therefore, we conclude the proof of the lemma. ∎

A.5 Proof of Lemma 6.2

Proof.

In the following lemma, we firstly bound the difference between the $Q$ -function maintained in Algorithm 2 (without bonus) and the real $Q$ -function of any policy $\pi$ by their expected difference at next step, plus an error term. This error term can be upper bounded by our bonus with high probability. This lemma can be derived from Lemma B.4 in [25] with slight revisions.

Lemma A.4.

Set $\lambda=\max\{1,\Psi^{2}\}$ . There exists an absolute constant $c_{\rho}$ such that for $\rho=c_{\rho}d_{\psi}H\sqrt{\iota}$ where $\iota=\log(2d_{\psi}\Psi^{2}T/\delta)$ , and for any fixed policy $\pi$ , with probability at least $1-\delta/2$ , we have for all $s\in{\mathcal{S}}$ , $\omega\in\Omega$ , $a\in\mathcal{A}$ , $h\in[H]$ , $t\in[T]$ ,

\psi(s,\omega,a)^{\top}q_{h}^{t}-Q_{h}^{\pi}(s,\omega,a)=P_{h}(V_{h+1}^{t}-V_{h+1}^{\pi})(s,\omega,a)+\triangle_{h}^{t}(s,\omega,a),

for some $\triangle_{h}^{t}(s,\omega,a)$ that satisfies $|\triangle_{h}^{t}(s,\omega,a)|\leq\rho\|\psi(s,\omega,a)\|_{(\Gamma_{h}^{t})^{-1}}$ .

Now we are ready to prove Lemma 6.2. By the definition of $\delta_{h}^{t}$ in (6.1), we have $\delta_{h}^{t}=(v_{h}+P_{h}V_{h+1}^{t}-Q_{h}^{t})=(P_{h}V_{h+1}^{t}-P_{h}V_{h+1}^{\pi^{t}})+(Q_{h}^{\pi^{t}}-Q_{h}^{t})$ . Therefore, by the construction of $Q_{h}^{t}$ in Algorithm 2, we obtain that

	$\displaystyle\delta_{h}^{t}(s,\omega,a)$	$\displaystyle\geq(P_{h}V_{h+1}^{t}-P_{h}V_{h+1}^{\pi^{t}})(s,\omega,a)+Q_{h}^{\pi^{t}}(s,\omega,a)-\left(\psi(s,\omega,a)^{\top}q_{h}^{t}+\rho\\|\psi(s,\omega,a)\\|_{(\Gamma_{h}^{t})^{-1}}\right)$
		$\displaystyle=-\triangle_{h}^{t}(s,\omega,a)-\rho\\|\psi(s,\omega,a)\\|_{(\Gamma_{h}^{t})^{-1}}\geq-2\rho\\|\psi(s,\omega,a)\\|_{(\Gamma_{h}^{t})^{-1}},$

which concludes the proof. ∎

A.6 Proof of Lemma 6.3

Proof.

Denote the optimal signaling schemes corresponding to the real prior $\mu_{h}^{*}$ and the estimated prior $\mu_{h}^{t}$ respectively as

\pi_{h}^{\prime}=\mathop{\mathrm{argmax}}_{\pi_{h}\in\operatorname{Pers}(\mu_{h}^{*})}\big{\langle}Q_{h}^{t},\mu_{h}^{*}\otimes\pi_{h}\big{\rangle}_{\Omega\times\mathcal{A}}(\cdot;C^{t})~{}~{}~{}\text{and}~{}~{}~{}\pi_{h}^{\prime\prime}=\mathop{\mathrm{argmax}}_{\pi_{h}\in\operatorname{Pers}(\mu_{h}^{t})}\big{\langle}Q_{h}^{t},\mu_{h}^{t}\otimes\pi_{h}\big{\rangle}_{\Omega\times\mathcal{A}}(\cdot;C^{t}),

where the $Q$ -function $Q_{h}^{t}$ is given by Algorithm 2. Notably, $\pi_{h}^{\prime}$ is different from the truly optimal policy $\mu_{h}^{*}$ , since $\pi_{h}^{\prime}$ is computed based on the approximate $Q$ -function $Q_{h}^{t}$ . By definition, we can decompose the difference as follows:

$\displaystyle\big{\langle}Q_{h}^{t},\mu_{h}^{}\otimes\pi_{h}^{}-\mu_{h}^{t}\otimes\pi_{h}^{t}\big{\rangle}_{\Omega\times\mathcal{A}}(s_{h};C^{t})=$	$\displaystyle\big{\langle}Q_{h}^{t},\mu_{h}^{}\otimes\pi_{h}^{}-\mu_{h}^{*}\otimes\pi_{h}^{\prime}\big{\rangle}_{\Omega\times\mathcal{A}}(s_{h};C^{t})$	(A.4)
	$\displaystyle~{}~{}~{}~{}+\big{\langle}Q_{h}^{t},\mu_{h}^{*}\otimes\pi_{h}^{\prime}-\mu_{h}^{t}\otimes\pi_{h}^{\prime\prime}\big{\rangle}_{\Omega\times\mathcal{A}}(s_{h};C^{t})$	(A.5)
	$\displaystyle~{}~{}~{}~{}+\big{\langle}Q_{h}^{t},\mu_{h}^{t}\otimes\pi_{h}^{\prime\prime}-\mu_{h}^{t}\otimes\pi_{h}^{t}\big{\rangle}_{\Omega\times\mathcal{A}}(s_{h};C^{t}).$	(A.6)

By definition, equation (A.4) is always non-positive. Apply Lemma A.2 to equation (A.5) and we can get

	$\displaystyle\big{\langle}Q_{h}^{t},\mu_{h}^{*}\otimes\pi_{h}^{\prime}-\mu_{h}^{t}\otimes\pi_{h}^{\prime\prime}\big{\rangle}_{\Omega\times\mathcal{A}}(s_{h};C^{t})\leq$	$\displaystyle\operatorname{Gap}\bigg{(}s_{h},\mu_{h}^{}(\cdot\|c_{h}^{t}),{\mathrm{B}}_{1}\big{(}\mu_{h}^{}(\cdot\|c_{h}^{t}),\big{\\|}\mu_{h}^{*}(\cdot\|c_{h}^{t})-\mu_{h}^{t}(\cdot\|c_{h}^{t})\big{\\|}_{1}\big{)};Q_{h}^{t}\bigg{)}$
		$\displaystyle~{}~{}~{}~{}+\frac{H}{2}\big{\\|}\mu_{h}^{*}(\cdot\|c_{h}^{t})-\mu_{h}^{t}(\cdot\|c_{h}^{t})\big{\\|}_{1}.$

According to Corollary 6.6, we can bound the above equation with the norm of feature vector and the radius of confidence region for $\theta_{h}$ .

\displaystyle\big{\langle}Q_{h}^{t},\mu_{h}^{*}\otimes\pi_{h}^{\prime}-\mu_{h}^{t}\otimes\pi_{h}^{\prime\prime}\big{\rangle}_{\Omega\times\mathcal{A}}(s_{h};C^{t})\leq\big{(}\frac{HL_{\mu}K}{p_{0}D}+\frac{HL_{\mu}K}{2}\big{)}\beta\|\phi(c_{h}^{t})\|_{(\Sigma_{h}^{t})^{-1}}.

We also note that equation (A.6) is equal to $\operatorname{Gap}\big{(}s_{h},\mu_{h}^{t}(\cdot|c_{h}^{t}),\mu_{\mathcal{B}_{h}^{t}}(\cdot|c_{h}^{t});Q_{h}^{t}\big{)}$ . By Lemma A.1, on the event $\{\theta_{h}^{*}\in\mathcal{B}_{h}^{t}\}$ , we have

\operatorname{Gap}\big{(}s_{h},\mu_{h}^{t}(\cdot|c_{h}^{t}),\mu_{\mathcal{B}_{h}^{t}}(\cdot|c_{h}^{t});Q_{h}^{t}\big{)}\leq\frac{2HL_{\mu}K}{p_{0}D}\beta\|\phi(c_{h}^{t})\|_{(\Sigma_{h}^{t})^{-1}}.

Therefore, on the given event, we have

\mathbb{E}_{\mu_{h}^{*},\pi_{h}^{*}}\big{[}\big{\langle}Q_{h}^{t},\mu_{h}^{*}\otimes\pi_{h}^{*}-\mu_{h}^{t}\otimes\pi_{h}^{t}\big{\rangle}_{\Omega\times\mathcal{A}}(s_{h};C^{t})|s_{1}=s_{1}^{t}\big{]}\leq\bigg{(}\frac{3HL_{\mu}K}{p_{0}D}+\frac{HL_{\mu}K}{2}\bigg{)}\beta\|\phi(c_{h}^{t})\|_{(\Sigma_{h}^{t})^{-1}}.

Summing up together, we get

	$\displaystyle\sum_{t\in[T]}\sum_{h\in[H]}\mathbb{E}_{\mu_{h}^{},\pi_{h}^{}}\big{[}\big{\langle}Q_{h}^{t},\mu_{h}^{}\otimes\pi_{h}^{}$	$\displaystyle-\mu_{h}^{t}\otimes\pi_{h}^{t}\big{\rangle}_{\Omega\times\mathcal{A}}(s_{h};C^{t})\|s_{1}=s_{1}^{t}\big{]}$
		$\displaystyle\leq\bigg{(}\frac{3HL_{\mu}K}{p_{0}D}+\frac{HL_{\mu}K}{2}\bigg{)}\beta\sum_{t\in[T]}\sum_{h\in[H]}\\|\phi(c_{h}^{t})\\|_{(\Sigma_{h}^{t})^{-1}}.$

Therefore, we conclude the proof of Lemma 6.3. ∎

A.7 Proof of Lemma 6.4

Proof.

By definition, we can rewrite the difference in Lemma 6.4 as

	$\displaystyle\big{\langle}Q_{h}^{t},(\mu_{h}^{t}-\mu_{h}^{*})\otimes\pi_{h}^{t}\big{\rangle}_{\Omega\times\mathcal{A}}(s_{h}^{t};C^{t})$	$\displaystyle=\int_{\Omega\times\mathcal{A}}\big{[}\mu_{h}^{t}(\omega\|c_{h}^{t})-\mu_{h}^{*}(\omega\|c_{h}^{t})\big{]}\pi_{h}^{t}(s,\omega,a)Q_{h}^{t}(s,\omega,a){\mathrm{d}}a{\mathrm{d}}\omega$
		$\displaystyle=\int_{\Omega}\big{[}\mu_{h}^{t}(\omega\|c_{h}^{t})-\mu_{h}^{*}(\omega\|c_{h}^{t})\big{]}\int_{\mathcal{A}}\pi_{h}^{t}(s,\omega,a)Q_{h}^{t}(s,\omega,a){\mathrm{d}}a{\mathrm{d}}\omega.$

By Holder’s inequality, we have

\bigg{|}\big{\langle}Q_{h}^{t},(\mu_{h}^{t}-\mu_{h}^{*})\otimes\pi_{h}^{t}\big{\rangle}_{\Omega\times\mathcal{A}}(s_{h}^{t};C^{t})\bigg{|}\leq\big{\|}\mu_{h}^{t}(\cdot|c_{h}^{t})-\mu_{h}^{*}(\cdot|c_{h}^{t})\big{\|}_{1}\sup_{\omega\in\Omega}\bigg{|}\int_{\mathcal{A}}\pi_{h}^{t}(s,\omega,a)Q_{h}^{t}(s,\omega,a){\mathrm{d}}a\bigg{|}.

Since $Q_{h}^{t}\leq H$ for any $h\in[H]$ and $t\in[T]$ , the inequality can be simplified to

\bigg{|}\big{\langle}Q_{h}^{t},(\mu_{h}^{t}-\mu_{h}^{*})\otimes\pi_{h}^{t}\big{\rangle}_{\Omega\times\mathcal{A}}(s_{h}^{t};C^{t})\bigg{|}\leq H\big{\|}\mu_{h}^{t}(\cdot|c_{h}^{t})-\mu_{h}^{*}(\cdot|c_{h}^{t})\big{\|}_{1}.

With the assumption of the prior and link function, on the given event, we obtain that

\sum_{t\in[T]}\sum_{h\in[H]}\big{\langle}Q_{h}^{t},(\mu_{h}^{t}-\mu_{h}^{*})\otimes\pi_{h}^{t}\big{\rangle}_{\Omega\times\mathcal{A}}(s_{h}^{t};C^{t})\leq HL_{\mu}K\beta\sum_{h\in[H]}\sum_{t\in[T]}\|\phi(c_{h}^{t})\|_{(\Sigma_{h}^{t})^{-1}}.

Therefore, we conclude the proof of Lemma 6.4. ∎

A.8 Proof of Corollary 6.6

Proof.

According to Assumption 4.2 for the prior, we can show that for any $\mu_{\theta^{\prime}}(\cdot|c)\in\mathcal{B}$ ,

\left\|\mu_{\theta}(\cdot|c)-\mu_{\theta^{\prime}}(\cdot|c)\right\|_{1}\leq L_{\mu}\big{\|}f(\phi(c)^{\top}\theta)-f(\phi(c)^{\top}\theta^{\prime})\big{\|}.

Moreover, by Assumption 4.3 for the link function $f(\cdot)$ , we have

\left\|\mu_{\theta}(\cdot|c)-\mu_{\theta^{\prime}}(\cdot|c)\right\|_{1}\leq L_{\mu}K\big{\|}\phi(c)^{\top}(\theta-\theta^{\prime})\big{\|}\leq L_{\mu}K\left\|\phi(c)\right\|_{\Sigma^{-1}}\epsilon.

Therefore, $\mathcal{B}\subseteq{\mathrm{B}}(\mu_{\theta}(\cdot|c),L_{\mu}K\|\phi(c)\|_{\Sigma^{-1}}\epsilon)$ , and by Lemma 6.5, we can conclude the result. ∎

A.9 Auxiliary Lemmas

This section presents several auxiliary lemmas and their proofs.

Lemma A.5 (Martingale Bound; [9]).

For $\zeta_{t,h}^{1}$ and $\zeta_{t,h}^{2}$ defined in (6.2) and for any fixed $\delta\in(0,1)$ , with probability at least $1-\delta/2$ , we have

\sum_{t\in[T]}\sum_{h\in[H]}(\zeta_{t,h}^{1}+\zeta_{t,h}^{2})\leq\sqrt{16TH^{3}\log(4/\delta)}.

Proof.

See [9] for a detailed proof. ∎

Lemma A.6.

Suppose that $\phi_{1},\phi_{2},\ldots,\phi_{T}\in R^{d_{\phi}\times d}$ and for any $1\leq i\leq T$ , there exists a constant $\Phi>0$ such that $\|\phi_{i}\|\leq\Phi.$ Let $\Sigma_{t}=\lambda I_{d_{\phi}}+\sum_{i\in[t-1]}\phi_{i}\phi_{i}^{\prime}$ for some $\lambda\geq\Phi^{2}$ . Then,

\sum_{t\in[T]}\|\phi_{t}\|_{(\Sigma_{t})^{-1}}\leq\sqrt{2d_{\phi}T\log(1+T\Phi^{2}/(\lambda d_{\phi}))}.

Proof.

Firstly, we apply Cauchy-Schwartz inequality,

\sum_{t\in[T]}\|\phi_{t}\|_{(\Sigma_{t})^{-1}}\leq\sqrt{T\sum_{t\in[T]}\|\phi_{t}\|^{2}_{(\Sigma_{t})^{-1}}}.

Since $\|\phi_{t}\|_{(\Sigma_{t})^{-1}}=\sqrt{\phi_{t}^{\top}(\Sigma_{t})^{-1}\phi_{t}}\leq\sqrt{\lambda^{-1}\phi_{t}^{\top}\phi_{t}}\leq\Phi/\sqrt{\lambda}\leq 1$ , we can use Lemma A.7 to bound the sum of squares:

	$\displaystyle\sum_{t\in[T]}\\|\phi_{t}\\|_{(\Sigma_{t})^{-1}}$	$\displaystyle\leq\sqrt{2T\log(\det(\Sigma_{T})\det(\Sigma_{1})^{-1})}$
		$\displaystyle\leq\sqrt{2d_{\phi}T\log(1+T\Phi^{2}/(\lambda d_{\phi}))}.$

The last inequality is derived from Lemma A.8. ∎

Lemma A.7 (Sum of Potential Function; [1]).

For any sequence of $\{\phi_{t}\}_{t\in[T]}$ , let $\Sigma_{t}=\lambda I_{h}+\sum_{t\in[t-1]}\phi_{i}\phi_{i}^{\prime}$ for some $\lambda\geq 0$ . Then we have

\sum_{t\in[T]}\min\{\|\phi_{t}\|_{(\Sigma_{t})^{-1}}^{2},1\}\leq 2\log(\det(\Sigma_{T})\det(\Sigma_{1})^{-1}).

Proof.

See [1] for a detailed proof. ∎

Lemma A.8 (Determinant-Trace Inequality).

\det(\Sigma_{t})\leq\big{(}\lambda+t\Phi^{2}/d_{\phi}\big{)}^{d_{\phi}}.

Proof.

Let $\lambda_{1},\lambda_{2},\ldots,\lambda_{h}$ be the eigenvalues of $\Sigma_{t}$ . Since $\Sigma_{t}$ is positive definite, its eigenvalues are positive. Also, note that $\det(\Sigma_{t})=\prod_{s=1}^{d_{\phi}}\lambda_{s}$ and $\mathop{\text{tr}}\kern 0.86108pt(\Sigma_{t})=\sum_{s=1}^{h}\lambda_{s}$ . By inequality of arithmetic and geometric means

\det(\Sigma_{t})\leq(\mathop{\text{tr}}\kern 0.86108pt(\Sigma_{t})/d_{\phi})^{d_{\phi}}.

It remains to upper bound the trace:

\mathop{\text{tr}}\kern 0.86108pt(\Sigma_{t})=\mathop{\text{tr}}\kern 0.86108pt(\lambda I_{d_{\phi}})+\sum_{i=1}^{t-1}\mathop{\text{tr}}\kern 0.86108pt(\phi_{i}\phi_{i}^{\prime})=d_{\phi}\lambda+\sum_{i=1}^{t-1}\|\phi_{i}\|^{2}\leq d_{\phi}\lambda+t\Phi^{2}

and the lemma follows. ∎

	$\displaystyle\big{\langle}Q,\mu\otimes\pi^{\prime}\big{\rangle}_{\Omega\times\mathcal{A}}-\big{\langle}Q,\mu\otimes\pi^{*}\big{\rangle}_{\Omega\times\mathcal{A}}$	$\displaystyle=\int_{\omega\in\Omega,a\in\mathcal{A}}\mu(\omega)\left[\pi^{\prime}(a\|\omega)-\pi^{*}(a\|\omega)\right]Q(\omega,a)$
		$\displaystyle=\int_{\omega\in\Omega,a\in\mathcal{A}}\mu(\omega)\left[\delta\pi^{0}(a\|\omega)-\delta\pi^{*}(a\|\omega)\right]Q(\omega,a)$
		$\displaystyle\geq-\delta\int_{\omega\in\Omega,a\in\mathcal{A}}\mu(\omega)\pi(a\|\omega)Q(\omega,a)$
		$\displaystyle\geq-H\delta=-\frac{H\epsilon}{p_{0}D}.$

Sequential Information Design: Markov Persuasion Process and Its Efficient Reinforcement Learning

Abstract

1 Introduction

1.1 Our Results and Contributions

1.2 Related Work

2 Preliminaries

2.1 Basics of Information Design

2.2 Basics of Reinforcement Learning and Markov Decision Processes

3 Markov Persuasion Processes

3.1 A Model of Markov Persuasion Processes (MPPs)

3.2 MPPs: the Generalized Version with Contexts and Linear Parameterization

Contextual Prior.

Linear Parameterization.

3.3 Optimal Signaling Policy in MPPs

A Note on Computational Efficiency.

4 Reinforcement Learning in MPPs and the Optimism-Pessimism Principle

4.1 Learning Optimal Policies in MPPs: Setups and Benchmarks

Regret

Regularity Conditions

4.2 Algorithm: Optimism-Pessimism Principle for Persuasion Process (OP4)

Pessimism to Induce Robust Equilibrium Behavior

Optimism to Encourage Exploration

4.3 Warm-up I: Reinforcement Learning in the Tabular MPP

Theorem 4.1.

4.4 Warm-up II: Reinforcement Learning in Contextual Bayesian Persuasion

Assumption 4.2.

Assumption 4.3.

Theorem 4.4.

5 No-Regret Learning in the General Markov Persuasion Process

Theorem 5.1.

6 Proof Sketch and Technical Highlights

6.1 Proof of Theorem 5.1

Lemma 6.1 (Regret Decomposition).

Lemma 6.2 (Optimism).

Lemma 6.3 (Bounding Term (iii)\displaystyle\mathrm{(iii)}).

Lemma 6.4 (Bounding Term (iv)\displaystyle\mathrm{(iv)}).

6.2 Inducing Robust Equilibria via Pessimism

Lemma 6.5 (Pessimism).

Corollary 6.6.

Lemma 6.7.

Proof.

Lemma 6.8 (Belief of Confidence Region).

7 Conclusion

References

Appendix A Omitted Proofs and Descriptions

A.1 Formal Description of the OP4

A.2 Proof of Lemma 6.5

Proof.

A.3 Properties for the Robustness Gap

Lemma A.1.

Proof.

Lemma A.2.

Proof.

Lemma A.3.

Proof.

A.4 Proof of Lemma 6.1

Proof.

A.5 Proof of Lemma 6.2

Proof.

Lemma A.4.

A.6 Proof of Lemma 6.3

Proof.

A.7 Proof of Lemma 6.4

Proof.

A.8 Proof of Corollary 6.6

Proof.

A.9 Auxiliary Lemmas

Lemma A.5 (Martingale Bound; [9]).

Proof.

Lemma A.6.

Proof.

Lemma A.7 (Sum of Potential Function; [1]).

Proof.

Lemma A.8 (Determinant-Trace Inequality).

Proof.

Lemma 6.3 (Bounding Term $\displaystyle\mathrm{(iii)}$ ).

Lemma 6.4 (Bounding Term $\displaystyle\mathrm{(iv)}$ ).