This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Sequential Information Design: Markov Persuasion Process and Its Efficient Reinforcement Learning

Jibang Wu  Zixuan Zhang  Zhe Feng  Zhaoran Wang
Zhuoran Yang  Michael I. Jordan  Haifeng Xu
University of Virginia. Email: [email protected].University of Science and Technology of China. Email: [email protected].Google. Email: [email protected].Northwestern University. Email: [email protected].Yale University. Email: [email protected].UC Berkeley. Email: [email protected].University of Virginia. Email: [email protected].
Abstract

In today’s economy, it becomes important for Internet platforms to consider the sequential information design problem to align its long term interest with incentives of the gig service providers (e.g., drivers, hosts). This paper proposes a novel model of sequential information design, namely the Markov persuasion processes (MPPs). Specifically, in an MPP, a sender, with informational advantage, seeks to persuade a stream of myopic receivers to take actions that maximizes the sender’s cumulative utilities in a finite horizon Markovian environment with varying prior and utility functions. Planning in MPPs thus faces the unique challenge in finding a signaling policy that is simultaneously persuasive to the myopic receivers and inducing the optimal long-term cumulative utilities of the sender. Nevertheless, in the population level where the model is known, it turns out that we can efficiently determine the optimal (resp. ϵ\epsilon-optimal) policy with finite (resp. infinite) states and outcomes, through a modified formulation of the Bellman equation that additionally takes persuasiveness into consideration.

Our main technical contribution is to study the MPP under the online reinforcement learning (RL) setting, where the goal is to learn the optimal signaling policy by interacting with with the underlying MPP, without the knowledge of the sender’s utility functions, prior distributions, and the Markov transition kernels. For such a problem, we design a provably efficient no-regret learning algorithm, the Optimism-Pessimism Principle for Persuasion Process (OP4), which features a novel combination of both optimism and pessimism principles. In particular, we obtain optimistic estimates of the value functions to encourage exploration under the unknown environment. Meanwhile, we additionally robustify the signaling policy with respect to the uncertainty of prior estimation to prevent receiver’s detrimental equilibrium behavior. Our algorithm enjoys sample efficiency by achieving a sublinear T\sqrt{T}-regret upper bound. Furthermore, both our algorithm and theory can be applied to MPPs with large space of outcomes and states via function approximation, and we showcase such a success under the linear setting.

1 Introduction

Most sequential decision models assume that there is a sole agent who possesses and processes all relevant (online or offline) information and takes an action accordingly. However, the economic literature on information design [26, 8] highlights the importance of considering information asymmetry in decision making, where the decision maker and information possessor may be two parties having different interests and goals. For example, a ride-sharing platform holds historical and real-time data on active riders and driver types in different locations, based on which they have developed centralized combinatorial optimization algorithms and reinforcement learning algorithms for vehicle repositioning, routing and order matching to optimize their operational efficiency and profit [33, 42, 34, 43]. But the de facto decision makers are the drivers. Moreover, as increasingly many drivers are freelancers instead of employees, the platform cannot expect to give mandatory orders to them. On the other hand, if the platform shares no information on rider demand, most drivers will not be able to efficiently find profitable trips. Therefore, it is not only realistic but also necessary to consider an information design problem that aligns the interests of the two parties in sequential decision making processes of this kind.

Given the large data sets being collected by corporations and governments, with avowed goals that relate data analysis to social welfare, it is timely to pursue formal treatments of sequential information design, to understand how to strategically inform the (sequential) decision makers (e.g., users, clients or citizens) impacted by centralized data analysis. In particular, we wish to understand the resulting equilibrium outcomes of both parties. As a concrete example, consider an online shopping platform which may make use of learning tools such as reinforcement learning or online convex optimization to manage inventory and ensure profitability [20, 36]. The platform cannot single-handedly manage its inventory, instead it requires information design (a.k.a., Bayesian persuasion) in its interactions with its suppliers and consumers. On the supply side, it could strategically reveal aspects of consumer sentiment (e.g., rough number of visits, search) to the suppliers in order to guide their sales expectation and negotiate for lower unit prices. On the demand side, it could tactically control displayed product information (e.g., last five remaining, editor’s choice) so as to influence consumers’ perception of products and consequently their purchase decisions. Similar situations can be anticipated for a recommendation platform. On the one hand, it should recommend most relevant items to its users for click-through and engagement. On the other hand, its recommendations are subject to misalignments with long-term objectives such as profits (e.g., from paid promotion), social impact (e.g., to prevent misinformation and filter bubbles) or development of a creator ecosystem [49, 53, 37].

1.1 Our Results and Contributions

To provide a formal foundation for the study of sequential information design, we introduce the Markov persuasion process (MPP), where a sender, with informational advantage, seeks to persuade a stream of myopic receivers to take actions that maximize the sender’s cumulative utility in a finite-horizon Markovian environment with varying prior and utility functions. We need to address a key challenge regarding the planning problem in MPPs; specifically, how to find persuasive signaling policies that are also optimized for the sender’s long-term objective. Moreover, in face of the uncertainty for both the environment and receivers, there is a dilemma that the optimal policy based on estimated prior is not necessarily persuasive and thus cannot induce the desired trajectory, whereas a full information revelation policy is always persuasive but usually leads to suboptimal cumulative utility. So the reinforcement learning algorithm in MPPs has to ensure optimality under the premise of robust persuasiveness. This makes our algorithm design non-trivial and regret analysis highly challenging.

We show how to surmount these analysis and design challenges, and present a no-regret learning algorithm, which we refer to as Optimism-Pessimism Principle for Persuasion Process (OP4), that provably achieves a O~(dϕ2dψ3H4T)\widetilde{O}\big{(}\sqrt{d_{\phi}^{2}d_{\psi}^{3}H^{4}T}\big{)} regret with high probability, where dϕ,dψd_{\phi},d_{\psi} are dimensions of the feature spaces, HH is the horizon length in each episode, TT is the number of episodes, and O~()\widetilde{O}(\cdot) hides logarithmic factors as well as problem-dependent parameters. To establish this result, in Section 3.3 we start by constructing a modified formulation of the Bellman equation that can efficiently determine the optimal (resp. ϵ\epsilon-optimal) policy with finite (resp. infinite) states and outcomes. Section 4.2 then considers the learning problem, in particular the design of the OP4 that adopts both the optimistic principle in utility estimation to incentivize exploration and the pessimism principle in prior estimation to prevent a detrimental equilibrium for the receiver. In Sections 4.3 and 4.4, we showcase OP4 in the tabular MPPs and contextual Bayesian persuasion problem, respectively, both of which are practical special cases of MPPs. In Section 5, we then generalize these positive results to MPPs with large outcome and state spaces via linear function approximation and generalized linear models.

In summary, our contributions are threefold. At the conceptual level, we identify the need for sequential information design in real-world problems and accordingly formulate a novel model, the MPP, to capture the misaligned incentives between the (sequential) decision makers and information possessors. At the methodological level, our key insight is a new algorithmic principle—optimism to encourage exploration and pessimism to induce robust equilibrium behavior. Finally, at the technical level, we develop a novel regret decomposition tailored to this combination of optimism and pessimism in the design of online learning algorithms. The fact that the combined optimism-pessimism concept can still lead to O(T)O(\sqrt{T}) regret for strategic setups was not clear before our new regret decomposition lemma. We expect this design principle and our proof techniques can be useful for other strategic learning problems.

1.2 Related Work

Our work is built on the foundation of information design and reinforcement learning. We refer the readers to Section 2.1 and 2.2 for background and formal introductions. Here we focus on the technical and modeling comparisons with related work from dynamic Bayesian persuasion and efficient reinforcement learning.

Dynamic Bayesian persuasion. Starting from seminal work by Kamenica and Gentzkow [26], the study of Bayesian persuasion looks at the design problem to influence an uninformed decision maker through strategic information revelation. Many variants of this model have been studied, with applications in security, advertising, finance, etc. [44, 54, 21, 6]. More recently, several dynamic Bayesian persuasion frameworks have been proposed to model the long-term interest of the sender. Many papers [16, 45, 17, 29] consider the setting where the sender observes the evolving states of a Markov chain, seeks to influence the receiver’s belief of the state through signaling and thereby persuade him to take certain actions. In contrast to our setting, at each round, the receiver’s action has no influence on the evolution of the Markov process and thus can only maximizes his utility on his belief of current state, given all the historical signals received from the sender. In Ely [16], Farhadi and Teneketzis [17], the Markov chain has two states (one is absorbing): the receiver is interested in detecting the jump to the absorbing state, whereas the sender seeks to prolong the time to detection of such a jump. Renault et al. [45] shows a greedy disclosure policy that ignores its influence to the future utility can be optimal in Markov chain with special utility functions. Lehrer and Shaiderman [29] characterize optimal strategies under different discount factors as well as the optimal values the sender could achieve. Closer to our model is that of Gan et al. [19]—we both assume the Markov environment with state transition influenced by receiver’s action, as well as a separate persuasion state drawn from a prior independent of receiver’s action. However, Gan et al. [19] focus on the planning problem for the infinite-horizon MDP, solving sender’s optimal signaling policy when the environment is known in cases when the receiver is myoptic or far-sighted. In particular, it is shown as NP-hard to approximate an optimal policy against a far-sighted receiver, which also justifies our interest on the myoptic receiver. Another closely related work [61] studies the learning problem in repeated persuasion setting (without Markov state transition) between a stream of myopic receivers and a sender without initial knowledge of the prior. It introduces the notion of regret as well as the robustness principle to this learning problem that we adopt and generalize to our model.

Bayesian Incentive-Compatible Bandit Exploration. Our work is also loosely related to a seminal result by Mansour et al. [35], who model the misaligned incentives between a system (i.e., sender) and a stream of myopic agents (i.e., receivers). Mansour et al. [35] shows that using information asymmetry, the system can create intrinsic incentives for agents to follow its recommendations. In this problem, the sender’s objective is limited to the social welfare, i.e, the cumulative utility of all agents, whereas we make no assumption on the sender’s utility function and thus her long-term objective. Besides our model is designed to capture more general situations where each receiver could have different priors and utility functions, and the environment might be Markovian with dynamics under the influence of the receivers’ actions.

Efficient Reinforcement Learning. Reinforcement learning has seen its successful applications in various domains, such as robotics, finance and dialogue systems [27, 58, 30]. Along with the empirical success, we have seen a growing quest to establish provably efficient RL methods. Classical sample efficiency results focus on tabular environments with small, finite state spaces [2, 39, 4, 12, 47, 24, 46]. Notably, through the design principle, known as optimism in the face of uncertainty [28], an RL algorithm would provably incur a Ω(|𝒮||𝒜|T)\Omega(\sqrt{|{\mathcal{S}}||\mathcal{A}|T}) regret under the tabular setting, where 𝒮{\mathcal{S}} and 𝒜\mathcal{A} are the state and action spaces respectively [24, 4]. More recently, there have been advances in RL with function approximation, especially the linear case. Jin et al. [25] proposed an efficient algorithm for a setting where the transition kernel and the utility function are both linear functions with respect to a feature mapping: ϕ:𝒮×𝒜d\phi:{\mathcal{S}}\times\mathcal{A}\to\mathbb{R}^{d}. A similar assumption has been studied for different settings and has led to sample efficiency results [55, 14, 38, 57, 22]. Moreover, other general function approximations have been studied in parallel, including generalized linear function approximation [52], linear mixture MDPs based on a ternary feature mapping [3, 60, 9, 59], kernel approximation [56] as well as models based on the low Bellman rank assumption [23, 11]. We make use of these function approximation techniques to model our conditional prior, and we show how to integrate the persuasion structure into these efficient reinforcement learning frameworks, thereby obtaining sample efficient result for large-scale MPPs.

2 Preliminaries

This section provides some necessary background in information design and Markov decision processes, as preparation for our model of Markov persuasion processes presented in the next section.

2.1 Basics of Information Design

Classic information design [26] considers the persuasion problem between a single sender (she) and receiver (he). The receiver is the only actor, and looks to take an action a𝒜a\in\mathcal{A} which results in receiver utility u(ω,a)u(\omega,a) and sender utility v(ω,a)v(\omega,a). Here ωΩ\omega\in\Omega is the realized outcome of certain environment uncertainty, which is drawn from a prior distribution μΔ(Ω)\mu\in\Delta(\Omega), and 𝒜\mathcal{A} is a finite set of available actions for the receiver. While u,v:Ω×𝒜[0,1]u,v:\Omega\times\mathcal{A}\to[0,1] and the prior distribution μ\mu are all common knowledge, the sender possesses an informational advantage and can privately observe the realized outcome ω\omega. The persuasion problem studies how the sender can selectively reveal her private information about ω\omega to influence the receiver’s decisions and ultimately maximize her own expected utility vv.

To model the sender’s strategic revelation of information, it is standard to use a signaling scheme, which essentially specifies the conditional distribution of a random variable (namely the signal), given the outcome ω\omega. Before the realization of the outcome, the sender commits to such a signaling scheme. Given the realized outcome, the sender samples a signal from the conditional distribution according to the signaling scheme and reveals it to the receiver. Upon receiving this signal, the receiver infers a posterior belief about the outcome via Bayes’ theorem (based on the correlation between the signal and outcome ω\omega as promised by the signaling scheme) and then chooses an action aa that maximizes the expected utility.

A standard revelation-principle-style argument shows that it is without loss of generality to focus on direct and persuasive signaling schemes [26]. A scheme is direct if each signal corresponds to an action recommendation to the receiver, and is persuasive if the recommended action indeed maximizes the receiver’s a posteriori expected utility. More formally, in a direct signaling scheme , π=(π(a|ω):ωΩ,a𝒜)\pi=(\pi(a|\omega):\omega\in\Omega,a\in\mathcal{A}), π(a|ω)\pi(a|\omega) denotes the probability of recommending action aa given realized outcome ω\omega. Upon receiving an action recommendation aa, the receiver computes a posterior belief for ω\omega: Pr(ω|a)=μ(ω)π(a|ω)ωμ(ω)π(a|ω)\Pr(\omega|a)=\frac{\mu(\omega)\pi(a|\omega)}{\sum_{\omega^{\prime}}\mu(\omega^{\prime})\pi(a|\omega^{\prime})}. Thus, the action recommendation aa is persuasive if and only if aa maximizes the expected utility w.r.t. the posterior belief about ω\omega; i.e., ωPr(ω|a)u(ω,a)ωPr(ω|a)u(ω,a)\sum_{\omega}\Pr(\omega|a)\cdot u(\omega,a)\geq\sum_{\omega}\Pr(\omega|a)\cdot u(\omega,a^{\prime}) for any a𝒜a^{\prime}\in\mathcal{A}. Equivalently, we define persuasiveness as

Persuasiveness: ωΩμ(ω)π(a|ω)[u(ω,a)u(ω,a)]0,a,a𝒜.\text{Persuasiveness: }\quad\sum_{\omega\in\Omega}\mu(\omega)\pi(a|\omega)\cdot[u(\omega,a)-u(\omega,a^{\prime})]\geq 0,\forall a,a^{\prime}\in\mathcal{A}.

Let 𝒫={π:π(|ω)Δ(𝒜) for each ωΩ}\mathcal{P}=\{\pi:\pi(\cdot|\omega)\in\Delta(\mathcal{A})\text{ for each }\omega\in\Omega\} denote the set of all signaling schemes. To emphasize that the definition of persuasiveness depends on the prior μ\mu, we denote the set of persuasive schemes on prior μ\mu by

Pers(μ){π𝒫:ωΩμ(ω)π(a|ω)[u(ω,a)u(ω,a)]0,a,a𝒜}.\operatorname{Pers}(\mu)\coloneqq\left\{\pi\in\mathcal{P}:\sum_{\omega\in\Omega}\mu(\omega)\pi(a|\omega)\left[u(\omega,a)-u\left(\omega,a^{\prime}\right)\right]\geq 0,\quad\forall a,a^{\prime}\in\mathcal{A}\right\}.

Given a persuasive signaling scheme πPers(μ)\pi\in\operatorname{Pers}(\mu), it is in the receiver’s best interest to take the recommended action and thus the sender’s expected utility becomes V(μ,π)ωΩa𝒜μ(ω)π(a|ω)v(ω,a).V(\mu,\pi)\coloneqq\sum_{\omega\in\Omega}\sum_{a\in\mathcal{A}}\mu(\omega)\pi(a|\omega)v(\omega,a).

Therefore, given full knowledge of the persuasion instance, the sender can solve for an optimal persuasive signaling scheme that maximizes her expected utility through the following linear program (LP) which searches for a persuasive signaling scheme that maximizes V(μ,π)V(\mu,\pi) (see, e.g., [15] for details):

Persuasion as an LP:OPT(μ)maxπPers(μ)V(μ,π).\text{Persuasion as an LP:}\qquad\operatorname{OPT}\left(\mu\right)\coloneqq\max_{\pi\in\operatorname{Pers}(\mu)}\quad V(\mu,\pi).

2.2 Basics of Reinforcement Learning and Markov Decision Processes

The Markov decision process (MDP) [41, 48] is a classic mathematical framework for the sequential decision making problem. In this work, we focus on the model of episodic MDP. Specifically, at the beginning of the episode, the environment has an initial state s1s_{1} (possibly picked by an adversary). Then, at each step h1h\geq 1, the agent takes some action ah𝒜a_{h}\in\mathcal{A} to interact with environment at state sh𝒮s_{h}\in{\mathcal{S}}. The state shs_{h} obeys a Markov property and thus captures all relevant information in the history {si}i<h\{s_{i}\}_{i<h}. Accordingly, the agent receives the utility vh(sh,ah)[0,1]v_{h}(s_{h},a_{h})\in[0,1] and the system evolves to the state of the next step sh+1Ph(|sh,ah)s_{h+1}\sim P_{h}(\cdot|s_{h},a_{h}). Such a process terminates after h=Hh=H, where HH is also known as the horizon length. Here, 𝒜\mathcal{A} is a finite set of available actions for the agent, 𝒮{\mathcal{S}} is the (possibly infinite) set of MDP states. The utility function vh:𝒮×𝒜[0,1]v_{h}:{\mathcal{S}}\times\mathcal{A}\to[0,1] and transition kernel Ph:𝒮×𝒜Δ(𝒮)P_{h}:{\mathcal{S}}\times\mathcal{A}\to\Delta({\mathcal{S}}) may vary at each step. A policy of the agent πh:𝒮Δ(𝒜)\pi_{h}:{\mathcal{S}}\to\Delta(\mathcal{A}) characterizes her decision making process at step hh—after observing the state ss, the agent takes action aa with probability πh(a|s)\pi_{h}(a|s).

In an episodic MDP with HH steps, under policy 𝝅={πh}h[H]\bm{\pi}=\{\pi_{h}\}_{h\in[H]}, we define the value function as the expected value of cumulative utilities starting from an arbitrary state,

Vhπ(s)𝔼P,𝝅[h=hHvh(sh,ah)|sh=s],s𝒮,h[H].V_{h}^{\pi}(s)\coloneqq\mathbb{E}_{P,\bm{\pi}}\bigg{[}\sum_{h^{\prime}=h}^{H}v_{h}(s_{h^{\prime}},a_{h^{\prime}})\bigg{|}s_{h^{\prime}}=s\bigg{]},\quad\forall s\in{\mathcal{S}},h\in[H].

Here 𝔼P,𝝅\mathbb{E}_{P,\bm{\pi}} means that the expectation is taken with respect to the trajectory {sh,ah}h[H]\{s_{h},a_{h}\}_{h\in[H]}, which is generated by policy 𝝅\bm{\pi} on the transition model P={Ph}h[H]P=\{P_{h}\}_{h\in[H]}. Similarly, we define the action-value function as the expected value of cumulative utilities starting from an arbitrary state-action pair,

Qhπ(s,a)vh(sh,ah)+𝔼P,𝝅[h=h+1Hvh(sh,ah)|sh=s,ah=a],s𝒮,a𝒜,h[H].Q_{h}^{\pi}(s,a)\coloneqq v_{h}(s_{h},a_{h})+\mathbb{E}_{P,\bm{\pi}}\bigg{[}\sum_{h^{\prime}=h+1}^{H}v_{h}(s_{h^{\prime}},a_{h^{\prime}})\bigg{|}s_{h^{\prime}}=s,a_{h^{\prime}}=a\bigg{]},\quad\forall s\in{\mathcal{S}},a\in\mathcal{A},h\in[H].

The optimal policy is defined as 𝝅argmax𝝅Vh𝝅(s1)\bm{\pi}^{*}\coloneqq\arg\max_{\bm{\pi}}V_{h}^{\bm{\pi}}(s_{1}), which maximizes the (expected) cumulative utility. Since the agent’s action affects both immediate utility and next states that influences its future utility, it thus demands careful planning to maximize total utility. Notably, 𝝅\bm{\pi}^{*} can solved by dynamic programming based on the Bellman equation [7]. Specifically, with VH+1(s)=0V_{H+1}^{*}(s)=0 and for each hh from HH to 11, iteratively update Qh(s,a)=vh(s,a)+𝔼sP(|s,a)Vh+1(s,a),Vh(s)=maxa𝒜Qh(s,a),Q_{h}^{*}(s,a)=v_{h}(s,a)+\mathbb{E}_{s^{\prime}\sim P(\cdot|s,a)}V_{h+1}^{*}(s^{\prime},a),~{}V_{h}^{*}(s)=\max_{a\in\mathcal{A}}Q_{h}^{*}(s,a), and determine the optimal policy 𝝅\bm{\pi}^{*} as the greedy policy with respect to {Qh}h[H]\{Q_{h}^{*}\}_{h\in[H]}.

In online reinforcement learning, the agent has no prior knowledge of the environment, namely, {vh,Ph}h[H]\{v_{h},P_{h}\}_{h\in[H]}, but aims to learn the optimal policy by interacting with the environment for TT episodes. For each t[T]t\in[T], at the beginning of the tt-th episode, after observing the initial state s1ts_{1}^{t}, the agent chooses a policy 𝝅t\bm{\pi}^{t} based on the observations before tt-th episode. The discrepancy between V1𝝅t(s1t)V^{\bm{\pi}^{t}}_{1}(s_{1}^{t}) and V1(s1t)V^{*}_{1}(s_{1}^{t}) serves as the suboptimality of the agent at the tt-th episode. The performance of the online learning algorithm is measured by the expected regret, Reg(T)t=1T[V1(s1t)V1𝝅t(s1t)].\operatorname{Reg}(T)\coloneqq\sum_{t=1}^{T}[V^{*}_{1}(s_{1}^{t})-V^{\bm{\pi}^{t}}_{1}(s_{1}^{t})].

3 Markov Persuasion Processes

This section introduces the Markov Persuasion Process (MPP), a novel model for sequential information design in Markovian environments. It notably captures the motivating yet intricate real-world problems in Section 1. Furthermore, our MPP model is readily applicable to generalized settings with large state spaces by incorporating function approximation techniques.

3.1 A Model of Markov Persuasion Processes (MPPs)

We start by abstracting the sequential information design problem instances in Section 1 into MPPs. Taking as an example recommendation platform for ad keywords, we view the platform as the sender, the advertisers as the receivers. The advertisers decide the actions a𝒜a\in\mathcal{A} such as whether to accept the recommended keyword. To better reflect the nature of reality, we model two types of information for MPPs, outcome and state. We use the notion of outcome ωΩ\omega\in\Omega to characterize the sender’s private information in face of each receiver, such as the features of searchers for some keyword. The outcome follows a prior distribution such as the general demographics of Internet users on the platform. The platform can thus leverage such fine-grained knowledge on keyword features, matching with the specific ad features of each advertiser, to persuade the advertisers to take a recommendation of keywords. Meanwhile, we use the notion of state s𝒮s\in{\mathcal{S}} to characterize the Markovian state of the environment, e.g., the availability of ad keyword slots. The state is affected by the receiver’s action, as the availability changes after some keywords get brought.111Similarly, we can view the online shopping platform as the sender who persuades a stream of receivers (supplier, consumer) to take certain action, whether to take an offer or make a purchase. In this case, sender can privately observe the outcomes such as the consumer sentiments on some random products based on the search and click logs, whereas the states are product reviews, sales or shipping time commonly known to the public and affected by the actions of both supply and demand sides. In case of rider-sharing, outcome represents the fine-grained knowledge of currently active rider types that are privately known to the platform and generally stochastic in accordance to some user demographics, whereas the state captures the general driver supply or rider demand at locations that is affected by the drivers’ decisions. Naturally, both sender’s and receiver’s utility are determined by the receiver’s action aa jointly with the state of environment ss and realized outcome ω\omega, i.e., u,v:𝒮×Ω×𝒜[0,1]u,v:{\mathcal{S}}\times\Omega\times\mathcal{A}\to[0,1]. Meanwhile, as these applications could serve thousands or millions of receivers every day, to reduce the complexity of our model we assume each receiver appears only once and thus will myopically maximizes his utility at that particular step, whereas the sender is a system planner who aim to maximizes her long-term accumulated expected utility.

More specifically, an MPP is built on top of a standard episodic MDP with state space 𝒮{\mathcal{S}}, action space 𝒜\mathcal{A}, and transition kernel PP. In this paper, we restrict our attention to finite-horizon (i.e., episodic) MPPs with HH steps denoted by [H]={1,,H}[H]=\{1,\cdots,H\}, and leave the study of infinite-horizon MPPs as an interesting future direction. At a high level, there are two major differences between MPPs and MDPs. First, in a MPP, the planner cannot directly take an action but instead can leverage its information advantage and “persuade” a receiver to take a desired action aha_{h} at each step h[H]h\in[H]. Second, in an MPP, the state transition is affected not only by the current action aha_{h} and state shs_{h}, but also by the realized outcome ωh\omega_{h} of Nature’s probability distribution. Specifically, the state transition kernel at step hh is denoted as Ph(sh+1|sh,ωh,ah)P_{h}(s_{h+1}|s_{h},\omega_{h},a_{h}). To capture the sender’s persuasion of a receiver to take actions at step hh, we assume that a fresh receiver arrives at time hh with a prior μh\mu_{h} over the outcome ωh\omega_{h}. The planner, who is the sender here, can observe the realized outcome ωh\omega_{h} and would like to strategically reveal information about ωh\omega_{h} in order to persuade the receiver to take a certain action aha_{h}.

Differing from classical single-shot information design, the immediate utility functions uh,vhu_{h},v_{h} for the receiver and sender vary not only at each step hh but also additionally depend on the commonly observed state shs_{h} of the environment. We assume the receiver to have full knowledge of his utility uhu_{h} and prior μh\mu_{h} at each step hh, and would take the recommended action aha_{h} if and only if aha_{h} maximizes his expected utility under the posterior for ωh\omega_{h}.222This assumption is not essential but just for technical rigor. Because even if receivers have limited knowledge or computational power to accurately determine the utility-maximizing actions, the sender should have sufficient ethical or legal reasons to comply with the persuasive constraints in practice. And the receivers would only take the recommendation if the platform has good reputation (i.e., persuasive with high probability).

Formally, an MPP with a horizon length HH proceeds as follows at each step h[H]h\in[H]:

1. A fresh receiver with prior distribution μhΔ(Ω)\mu_{h}\in\Delta(\Omega) and utility uh:𝒮×Ω×𝒜[0,1]u_{h}:{\mathcal{S}}\times\Omega\times\mathcal{A}\to[0,1] arrives. 2. The sender commits to a persuasive signaling policy πh:𝒮𝒫\pi_{h}:{\mathcal{S}}\to\mathcal{P}, which is public knowledge. 3. After observing the realized state shs_{h} and outcome ωh\omega_{h}, the sender accordingly recommends the receiver to take an action ahπh(|sh,ωh)a_{h}\sim\pi_{h}(\cdot|s_{h},\omega_{h}). 4. Given the recommended action aha_{h}, the receiver takes an action aha^{\prime}_{h}, receives utility uh(sh,ωh,ah)u_{h}(s_{h},\omega_{h},a^{\prime}_{h}) and then leaves the system. Meanwhile, the sender receives utility vh(sh,ωh,ah)v_{h}(s_{h},\omega_{h},a^{\prime}_{h}). 5. The next state sh+1Ph(|sh,ωh,ah)s_{h+1}\sim P_{h}(\cdot|s_{h},\omega_{h},a^{\prime}_{h}) is generated according to Ph:𝒮×Ω×𝒜Δ(𝒮)P_{h}:{\mathcal{S}}\times\Omega\times\mathcal{A}\to\Delta({\mathcal{S}}), the state transition kernel at the hh-th step.

Here we coin the notion of a signaling policy πh\pi_{h} as a mapping from state to a signaling scheme at the hh-th step. It captures a possibly multi-step procedure in which the sender commits to a signaling scheme after observing the realized state and then samples a signal after observing the realized outcome. For notational convenience, we denote π(a|s,ω)\pi(a|s,\omega) as the probability of recommending action aa given state ss and realized outcome ω\omega. We can also generalize the notion of persuasiveness from classic information design to MPPs. Specifically, we define Pers(μ,u)\operatorname{Pers}(\mu,u) as the persuasive set that contains all signaling policies that are persuasive to the receiver with utility uu and prior μ\mu for all possible state s𝒮s\in{\mathcal{S}}:

Pers(μ,u)\displaystyle\operatorname{Pers}(\mu,u)\coloneqq {π:𝒮𝒫:\displaystyle\bigg{\{}\pi:{\mathcal{S}}\to\mathcal{P}:
ωΩμ(ω)π(a|s,ω)[u(s,ω,a)u(s,ω,a)]dω0,a,a𝒜,s𝒮}.\displaystyle\qquad\qquad\int_{\omega\in\Omega}\mu(\omega)\pi(a|s,\omega)\big{[}u(s,\omega,a)-u(s,\omega,a^{\prime})\big{]}{\mathrm{d}}\omega\geq 0,\quad\forall a,a^{\prime}\in\mathcal{A},s\in{\mathcal{S}}\bigg{\}}.

Recall that 𝒫\mathcal{P} consists of all mappings from Ω\Omega to Δ(𝒜)\Delta(\mathcal{A}). As such, the sender’s persuasive signaling scheme πhPers(μh,uh)\pi_{h}\in\operatorname{Pers}(\mu_{h},u_{h}) is essentially a stochastic policy as defined in standard MDPs—πh\pi_{h} maps a state shs_{h} to a stochastic action aha_{h}—except that here the probability of suggesting action aha_{h} by policy πh\pi_{h} depends additionally on the realized outcome ωh\omega_{h} that is only known to the sender.

We say 𝝅{πh}h[H]\bm{\pi}\coloneqq\{\pi_{h}\}_{h\in[H]} is a feasible policy of the MPP if πhPers(μh,uh),h[H]\pi_{h}\in\operatorname{Pers}(\mu_{h},u_{h}),\forall h\in[H], because the state transition trajectory would otherwise be infeasible if the receiver is not guaranteed to take the recommended action, i.e., ahaha^{\prime}_{h}\neq a_{h}. We denote the set of all feasible policies as 𝒫Hh[H]Pers(μh,uh)\mathcal{P}^{H}\coloneqq\prod_{h\in[H]}\operatorname{Pers}(\mu_{h},u_{h}).

3.2 MPPs: the Generalized Version with Contexts and Linear Parameterization

To provide a broadly useful modeling concept, we also study a generalized setting of the Markov Persuasion Process with contextual prior and a possibly large space of states, outcomes and contexts.

Contextual Prior.

At the beginning of each episode, a sequence of contexts C={ch𝒞}h[H]C=\{c_{h}\in\mathcal{C}\}_{h\in[H]} is realized by Nature and becomes public knowledge. And we allow the prior μh\mu_{h} to be influenced by the context chc_{h} at each step hh, and thus denote it by μh(|ch)\mu_{h}(\cdot|c_{h}). Specifically, the contextual information is able to model the uncertainty such as the varying demographics of active user group affected by events (e.g., scheduled concerts or sport games in ride-sharing) at different time of the day.333In the case of the online shopping platform, the prior of consumer interests may be affected by the different holidays or seasons at different time of year. Here we allow the sequence of contexts to be adversarially generated.

Linear Parameterization.

We also relax the state, context and outcome space 𝒮,𝒞,Ω{\mathcal{S}},\mathcal{C},\Omega to be continuous and additionally assume that the transition kernels and utility functions are linear, and the conditional priors of outcomes are generalized linear models (GLM) of the context at each steps. More formally, for each step h[H]h\in[H], our linearity condition assumes:

  • The sender’s utility is vhvh(sh,ωh,ah)=ψ(sh,ωh,ah)γhv_{h}\coloneqq v_{h}^{*}(s_{h},\omega_{h},a_{h})=\psi(s_{h},\omega_{h},a_{h})^{\top}\gamma_{h}^{*}, where (1) ψ(,,)dψ\psi(\cdot,\cdot,\cdot)\in\mathbb{R}^{d_{\psi}} is a known feature vector; (2) γhdψ\gamma_{h}^{*}\in\mathbb{R}^{d_{\psi}} is the unknown linear parameter at step hh.

  • The next state sh+1s_{h+1} is drawn from the distribution PM,h(|sh,ωh,ah)=ψ(sh,ωh,ah)Mh()P_{M,h}(\cdot|s_{h},\omega_{h},a_{h})=\psi(s_{h},\omega_{h},a_{h})^{\top}M_{h}(\cdot), where Mh=(Mh(1),Mh(2),,Mh(dψ))M_{h}=(M_{h}^{(1)},M_{h}^{(2)},\dots,M_{h}^{(d_{\psi})}) is a vector of dψd_{\psi} unknown measures over 𝒮{\mathcal{S}} at step hh.

  • The outcome ωh\omega_{h}\in\mathbb{R} subjects to a generalized linear model (GLM), which models a wider range of hypothesis function classes.444We note that GLM is a strictly generalization of the linear model assumption that we have for the distribution of transition kernel PP. While we could use similar technique to extend the distribution of PP to GLM using techniques similar to that in Wang et al. [52], but we save such an extension for simplicity, since it is not the primary focus of our work. Given the context chc_{h}, there exists a link function f:f:\mathbb{R}\to\mathbb{R} such that ωh=f(ϕ(ch)θh)+zh\omega_{h}=f(\phi(c_{h})^{\top}\theta^{*}_{h})+z_{h}, where ϕ()dϕ\phi(\cdot)\in\mathbb{R}^{d_{\phi}} is a feature vector and θhdϕ\theta_{h}^{*}\in\mathbb{R}^{d_{\phi}} is an unknown parameter. The noises {zh}h[H]\{z_{h}\}_{h\in[H]} are independent σ\sigma-sub-Gaussian variables with zero mean. We denote the prior of ωh\omega_{h} with parameter θ\theta at context cc as μθ(|c)\mu_{\theta}(\cdot|c).

Without loss of generality, we assume that there exist Φ,Ψ\Phi,\Psi such that ϕ(s)Φ\|\phi(s)\|\leq\Phi,555For the simplicity of notation, we will omit the subscript of the norm whenever it is an L2L_{2} norm in this paper. ψ(s,ω,a)Ψ\|\psi(s,\omega,a)\|\leq\Psi for all s𝒮,ωΩs\in{\mathcal{S}},\omega\in\Omega and a𝒜a\in\mathcal{A}. We also assume that θhLθ\|\theta_{h}^{*}\|\leq L_{\theta}, γhLγ\|\gamma_{h}^{*}\|\leq L_{\gamma}, MhLM\|M_{h}^{*}\|\leq L_{M}, |𝒜|2|\mathcal{A}|\geq 2, |Ω|2|\Omega|\geq 2. Such a regularity condition is common in the RL literature.

3.3 Optimal Signaling Policy in MPPs

In order to maximize the sender’s utility, we study the optimal policy in MPPs, in analogy to that of standard MDPs. We start by considering the value of any feasible policy 𝝅\bm{\pi}. For each step h[H]h\in[H], we define the value function for the sender Vhπ:𝒮V_{h}^{\pi}:{\mathcal{S}}\to\mathbb{R} as the expected value of cumulative utilities under 𝝅\bm{\pi} when starting from an arbitrary state at the hh-th step. That is, for any s𝒮,h[H]s\in{\mathcal{S}},h\in[H], we define

Vhπ(s)𝔼P,μ,π[h=hHvh(sh,ωh,ah)|sh=s],V_{h}^{\pi}(s)\coloneqq\mathbb{E}_{P,\mu,\pi}\bigg{[}\sum_{h^{\prime}=h}^{H}v_{h^{\prime}}\big{(}s_{h^{\prime}},\omega_{h^{\prime}},a_{h^{\prime}}\big{)}\bigg{|}s_{h}=s\bigg{]},

where the expectation 𝔼P,μ,π\mathbb{E}_{P,\mu,\pi} is taken with respect to the randomness of the trajectory (i.e., randomness of state transition), realized outcome and the stochasticity of 𝝅\bm{\pi}. Accordingly, we define the QQ-function (action-value function) Qhπ:𝒮×Ω×𝒜Q_{h}^{\pi}:{\mathcal{S}}\times\Omega\times\mathcal{A}\to\mathbb{R} which gives the expected value of cumulative utilities when starting from an arbitrary state-action pair at the hh-step following the signaling policy 𝝅\bm{\pi}, that is,

Qhπ(s,ω,a)vh(s,ω,a)+𝔼P,μ,π[h=h+1Hvh(sh,ωh,ah)|sh=s,ωh=ω,ah=a].Q_{h}^{\pi}(s,\omega,a)\coloneqq v_{h}(s,\omega,a)+\mathbb{E}_{P,\mu,\pi}\bigg{[}\sum_{h^{\prime}=h+1}^{H}v_{h^{\prime}}\big{(}s_{h^{\prime}},\omega_{h^{\prime}},a_{h^{\prime}}\big{)}\bigg{|}s_{h}=s,\omega_{h}=\omega,a_{h}=a\bigg{]}.

By definition, Qh(,,),Vh()[0,h]Q_{h}(\cdot,\cdot,\cdot),V_{h}(\cdot)\in[0,h], since vh(,,)[0,1]v_{h}(\cdot,\cdot,\cdot)\in[0,1]. To simplify notation, for any QQ-function QhQ_{h} and any distributions μh\mu_{h} and πh\pi_{h} over Ω\Omega and 𝒜\mathcal{A}, we additionally denote

Qh,μhπhΩ×𝒜(s)𝔼ωμh,aπh(|s,ω)[Qh(s,ω,a)].\begin{split}\big{\langle}Q_{h},\mu_{h}\otimes\pi_{h}\big{\rangle}_{\Omega\times\mathcal{A}}(s)&\coloneqq\mathbb{E}_{\omega\sim\mu_{h},a\sim\pi_{h}(\cdot|s,\omega)}\left[Q_{h}(s,\omega,a)\right].\end{split}

Using this notation, the Bellman equation associated with signaling policy 𝝅\bm{\pi} becomes

Qhπ(s,ω,a)=(vh+PhVh+1π)(s,ω,a),Vhπ(s)=Qhπ,μhπhΩ×𝒜(s),VH+1π(s)=0,Q_{h}^{\pi}(s,\omega,a)=(v_{h}+P_{h}V_{h+1}^{\pi})(s,\omega,a),~{}~{}V_{h}^{\pi}(s)=\big{\langle}Q_{h}^{\pi},\mu_{h}\otimes\pi_{h}\big{\rangle}_{\Omega\times\mathcal{A}}(s),~{}~{}V_{H+1}^{\pi}(s)=0, (3.1)

which holds for all s𝒮,ωΩ,a𝒜s\in\mathcal{S},\omega\in\Omega,a\in\mathcal{A}. Similarly, the Bellman optimality equation is

Qh(s,ω,a)=(vh+PhVh+1)(s,ω,a),Vh(s)=maxπhPers(μh,uh)Qh,μhπhΩ×𝒜(s),VH+1(s)=0.Q_{h}^{*}(s,\omega,a)=(v_{h}+P_{h}V_{h+1}^{*})(s,\omega,a),~{}V_{h}^{*}(s)=\max_{\pi_{h}^{\prime}\in\operatorname{Pers}(\mu_{h},u_{h})}\big{\langle}Q_{h}^{*},\mu_{h}\otimes\pi_{h}^{\prime}\big{\rangle}_{\Omega\times\mathcal{A}}(s),~{}V_{H+1}^{*}(s)=0. (3.2)

We remark that the above equations implicitly assume the context C={ch}h[H]C=\{c_{h}\}_{h\in[H]} (and thus the priors) are determined in advance. To emphasize the values’ dependence on context which will be useful for the analysis of later learning algorithms, we extend the notation to Vhπ(s;C),Qhπ(s,ω,a;C)V^{\pi}_{h}(s;C),Q_{h}^{\pi}(s,\omega,a;C) to specify that the value (resp. Q) function is estimated based on which prior μ\mu conditioned on which sequence of context CC.

A Note on Computational Efficiency.

We note that the above Bellman Optimality Equation in (3.2) also implies an efficient dynamic program to compute the optimal policy 𝝅\bm{\pi}^{*} in the basic tabular model of MPP in Subsection 3.1, i.e., when s𝒮,ωΩ,a𝒜s\in\mathcal{S},\omega\in\Omega,a\in\mathcal{A} are all discrete. This is because the maximization problem in equation (3.2) can be solved efficiently be a linear program. The generalized MPP of subsection 3.2 imposes some computational challenge due to infinitely many outcomes and states. Fortunately, it is already known that planning in the infinite state MDP with linear function approximation can also be solved efficiently [25]. Following a similar analysis, we can determine Qh(,,)Q_{h}^{*}(\cdot,\cdot,\cdot) through a linear function of qhdψq_{h}^{*}\in\mathbb{R}^{d_{\psi}} with the observed feature ψ(,,)\psi(\cdot,\cdot,\cdot). Hence, the dominating operation is to compute maxπPers(μh,uh)Qh,μhπhΩ×𝒜(s)\max_{\pi\in\operatorname{Pers}(\mu_{h},u_{h})}\langle Q_{h}^{*},\mu_{h}\otimes\pi_{h}\rangle_{\Omega\times\mathcal{A}}(s) at each step. Let the sender utility function be QhQ_{h}^{*}; such an optimization is exactly the problem of optimal information design with infinitely many outcomes but finitely many actions, which has been studied in previous work [15]. It turns out that there is an efficient algorithm that can signal on the fly for any given outcome ω\omega and obtains an ϵ\epsilon-optimal persuasive signaling scheme in poly(1/ϵ)\operatorname{poly}(1/\epsilon) time. Therefore, in our later studies of learning, we will take these algorithms as given and simply assume that we can compute the optimal signaling scheme efficiently at any given state ss. One caveat is that our regret guarantee will additionally lose an additive ϵ\epsilon factor at each step due to the availability of only an ϵ\epsilon-optimal algorithm, but this loss can be negligible when we set ϵ=O(1/(TH))\epsilon=O(1/(TH)) by using a poly(TH)\operatorname{poly}(TH) time algorithm.

4 Reinforcement Learning in MPPs and the Optimism-Pessimism Principle

In this section, we study online reinforcement learning (RL) for learning the optimal signaling policy on an MPP. Here the learner only knows the utility functions of the receivers666 The receiver’s utility is known to the sender because the pricing rules are usually transparent, some are even set by the platform. For example, a rider-sharing platform usually sets per hour or mile payment rules for the drivers. and has no prior knowledge about the prior distribution, the sender’s utility function, and the transition kernel. While the computation of optimal policy in MPPs in Section 3.3 may appear analogous to that of a standard MDP, as we will see that the corresponding RL problem turns out to be significantly different, partially due to the presence of the stream of receivers, whose decisions are self-interested and not under the learner’s control. This makes the learning challenging because if the receivers’ incentives are not carefully addressed, they may take actions that are extremely undesirable to the learner. Such concern leads to the integration of the pessimism principle into our learning algorithm design. Specifically, our learner will be optimistic to the estimation of the QQ-function, similar to many other RL algorithms, in order to encourage exploration. But more interestingly, it will be pessimistic to the uncertainty in the estimation of the prior distributions in order to prepare for detrimental equilibrium behavior. Such dual considerations lead to an interesting optimism-pessimism principle (OPP) for learning MPPs under the online setting. From a technical point of view, our main contribution is to prove how the mixture of optimism and pessimism principle can still lead to no regret algorithms, and this proof crucially hinges on a robust property of the MPP model which we develop and carefully apply to the regret analysis. To the best of our knowledge, this is the first time that OPP is employed to learn the optimal information design in an online fashion. We prove that it can not only satisfy incentive constraints but also guarantees efficiency in terms of both sample complexity and computational complexity.

In order to convey our key design ideas before diving into the intricate technicalities, this section singles out two representative special cases of the online sequential information design problem. In a nutshell, we present a learning algorithm OP4 that combines the principle of optimism and pessimism such that the sender can learn to persuade without initially knowing her own utility or the prior distribution of outcomes. In the tabular MPP, we illustrate the unique challenges of learning to persuade arising from the dynamically evolving environment state according to a Markov process. Through the contextual Bayesian persuasion, we showcase the techniques necessary for learning to persuade with infinitely many states (i.e., contexts) and outcomes. We shall omit most proofs in this section to focus on the high-level ideas, because the proof for the general setting presented in Section 5 suffices to imply all results for the two special cases here.

4.1 Learning Optimal Policies in MPPs: Setups and Benchmarks

We consider the episodic reinforcement learning problem in finite-horizon MPPs. Different from the full knowledge setting in Section 3.3, the transition kernel, the sender’s utility function and the outcome prior at each step of the episode, {Ph,vh,μh}h[H]\{P_{h},v_{h},\mu_{h}\}_{h\in[H]}, are all unknown. The sender has to learn the optimal signaling policy by interacting with the environment as well as a stream of receivers in TT number of episodes. For each t[T]={1,,T}t\in[T]=\{1,\cdots,T\}, at the beginning of tt-th episode, given the data {(chτ,shτ,ωhτ,ahτ,vhτ)}h[H],τ[t1]\{(c_{h}^{\tau},s_{h}^{\tau},\omega_{h}^{\tau},a_{h}^{\tau},v_{h}^{\tau})\}_{h\in[H],\tau\in[t-1]}, the adversary picks the context sequence {cht}h[H]\{c_{h}^{t}\}_{h\in[H]} as well as the initial state s1ts_{1}^{t}, and the agent accordingly chooses a signaling policy 𝝅t={πht}h[H]\bm{\pi}^{t}=\{\pi_{h}^{t}\}_{h\in[H]}. Here vhτv_{h}^{\tau} is the utility collected by the sender at step hh of episode τ\tau.

Regret

To evaluate the online learning performance, given the ground-truth outcome prior 𝝁={μh}h[H]\bm{\mu}^{*}=\{\mu_{h}^{*}\}_{h\in[H]}, we define the sender’s total (expected) regret over the all TT episodes as

Reg(T,𝝁)t=1T[V1(s1t;Ct)V1𝝅t(s1t;Ct)].\operatorname{Reg}(T,\bm{\mu}^{*})\coloneqq\sum_{t=1}^{T}\left[V^{*}_{1}(s_{1}^{t};C^{t})-V^{\bm{\pi}^{t}}_{1}(s_{1}^{t};C^{t})\right]. (4.1)

Note that if 𝝅t\bm{\pi}^{t} is not always feasible under 𝝁\bm{\mu}^{*}, but is only persuasive with high probability, so the corresponding regret under 𝝅t\bm{\pi}^{t} should be also in high probability sense.

It turns out that in certain degenerate cases it is impossible to achieve a sublinear regret. For example, if the set of possible posterior outcome distributions that induce some a𝒜a\in\mathcal{A} as the optimal receiver action has zero measure, then such posterior within a zero-measure set can never be exactly induced by a signaling scheme without a precise knowledge of the prior. Thus, the regret could be Ω(T)\Omega(T) if receiver cannot be persuaded to play such action aa. Therefore, to guarantee no regret, it is necessary to introduce certain regularity assumption on the MPP instance. Towards that end, we shall assume that the receivers’ utility uu and prior μ\mu at any step of the MPP instance always satisfies a minor assumption of (p0,D)(p_{0},D)-regularity as defined below.

Regularity Conditions

An instance satisfies (p0,D)(p_{0},D)-regularity, if for any feasible state s𝒮s\in{\mathcal{S}} and context c𝒞c\in\mathcal{C}, we have

ωμ(|c)[ω𝒲s,a(D)]p0,a𝒜,\mathbb{P}_{\omega\sim\mu(\cdot|c)}\left[\omega\in\mathcal{W}_{s,a}(D)\right]\geq p_{0},\quad\forall a\in\mathcal{A},

where μ\mu is the ground-truth prior of outcomes and 𝒲s,a(D){ω:u(s,ω,a)u(s,ω,a)D,a𝒜/{a}}\mathcal{W}_{s,a}(D)\triangleq\{\omega:u(s,\omega,a)-u(s,\omega,a^{\prime})\geq D,\forall a^{\prime}\in\mathcal{A}/\{a\}\} is the set of outcomes ω\omega for which the action aa is optimal for the receiver by at least DD at state ss. . In other words, an instance is (p0,D)(p_{0},D)-regular if every action aa has at least probability p0p_{0}, under randomness of the outcome, to be strictly better than other actions by at least DD. This regularity condition is analogous to a regularity condition of Zu et al. [61] but is generalizable to infinite outcomes as we consider here.

4.2 Algorithm: Optimism-Pessimism Principle for Persuasion Process (OP4)

The learning task in MPPs involves two intertwined challenges: (1) How to persuade the receiver to take desired actions under unknown μh\mu_{h}? (2) Which action to persuade the receiver to take in order to explore the underlying environment? For the first challenge, due to having finite data, it is impossible to perfectly recover μh\mu_{h}. We can only hope to construct an approximately accurate estimator of μh\mu_{h}. To guard against potentially detrimental equilibrium behavior of the receivers due to the prior estimation error, we propose to adopt the pessimism principle. Specifically, before each episode, we conduct uncertainty quantification for the estimator of the prior distributions, which enables us to construct a confidence region containing the true prior with high probability. Then we propose to find the signaling policy within a pessimistic candidate set—signaling policies that are simultaneously persuasive with respect to all prior distributions in the confidence region. When the confidence region is valid, such a pessimism principle ensures that the executed signaling policy is always persuasive with respect to the true prior. Furthermore, to address the second challenge, we adopt the celebrated principle of optimism in the face of uncertainty [28], which has played a key role in the online RL literature. The main idea of this principle is that, the uncertainty of the QQ-function estimates essentially reflects our uncertainty about the underlying model. By adding the uncertainty as a bonus function, we encourage actions with high uncertainty to be recommended and thus taken by the receiver when persuasiveness is satisfied. We then fuse the two principles into the OP4 algorithm in Algorithm 1.

Algorithm 1 OP4 Overview
1:  for episode t=1Tt=1\dots T do
2:     Receive the initial state {s1t}\{s_{1}^{t}\} and context Ct={cht}h=1HC^{t}=\{c_{h}^{t}\}_{h=1}^{H}.
3:     For each step h[H]h\in[H], estimate prior μht\mu_{h}^{t} along with the confidence region μht\mu_{\mathcal{B}_{h}^{t}}, and construct an optimistic QQ-function QhtQ_{h}^{t} iteratively with the value function VhtV_{h}^{t}.
4:     for step h=1,,Hh=1,\ldots,H do
5:        Choose robust signaling scheme πhtargmaxπhPers(μht,uh)Qht,μhtπhΩ×𝒜(sht;Ct)\pi_{h}^{t}\in\arg\max_{\pi_{h}\in\operatorname{Pers}(\mu_{\mathcal{B}_{h}^{t}},u_{h})}\big{\langle}Q_{h}^{t},\mu_{h}^{t}\otimes\pi_{h}\big{\rangle}_{\Omega\times\mathcal{A}}(s_{h}^{t};C^{t}).
6:        Observe state shs_{h}, outcome ωh\omega_{h} and accordingly recommend action aπht(ωh,)a\sim\pi^{t}_{h}(\omega_{h},\cdot) to the receiver.
7:     end for
8:  end for

Pessimism to Induce Robust Equilibrium Behavior

From the data in the past episode, the sender can estimate the mean of the prior as well as obtain a confidence region through concentration inequalities. Given this partial knowledge of the prior distribution, the sender needs to design a signaling scheme that works in the face of any possible priors in the confidence region in order to ensure the receiver will take its recommended action with high probability. Specifically, we let BΣ(θ,β){θ:θθΣβ}{\mathrm{B}}_{\Sigma}(\theta,\beta)\coloneqq\{\theta^{\prime}:\|\theta^{\prime}-\theta\|_{\Sigma}\leq\beta\} denote the closed ball in Σ\left\|\cdot\right\|_{\Sigma} norm of radius β>0\beta>0 centered at θdθ\theta\in\mathbb{R}^{d_{\theta}}. For any set dθ\mathcal{B}\subseteq\mathbb{R}^{d_{\theta}}, we let Pers(μ,u)\operatorname{Pers}(\mu_{\mathcal{B}},u) denote the set of signaling policies that are simultaneously persuasive under all weigh vectors θ\theta\in\mathcal{B}: Pers(μ,u)θPers(μθ,u)\operatorname{Pers}(\mu_{\mathcal{B}},u)\coloneqq\bigcap_{\theta\in\mathcal{B}}\operatorname{Pers}(\mu_{\theta},u). For any non-empty set \mathcal{B}, the set Pers(μ,u)\operatorname{Pers}(\mu_{\mathcal{B}},u) is convex since it is an intersection of convex sets Pers(μθ,u)\operatorname{Pers}(\mu_{\theta},u), and is non-empty since it must contain the full-information signaling scheme. We note that since Pers(μ,u)\operatorname{Pers}(\mu_{\mathcal{B}},u) is a convex set, we can solve the linear optimization among the policies in Pers(μ,u)\operatorname{Pers}(\mu_{\mathcal{B}},u) in polynomial time (see e.g., [61]).

Optimism to Encourage Exploration

In order to balance exploration and exploitation, we adopt the principle of optimism in face of uncertainty to the value iteration algorithm based on Bellman equation, following in a line of work in online RL such as QQ-learning with UCB exploration [24], UCBVI [4], LSVI-UCB [25] (also see [50, 56, 51] and the references therein). The additional UCB bonus on the QQ-value encourages exploration and has been shown to be a provably efficient online method to improve policies in MDPs. Moreover, this method not only works for the simple tabular setting, but also generalizes to settings with infinite state spaces by exploiting linearity of the QQ-function and a regularized least-squares program to determine the optimal estimation of QQ-value. In fact, within our framework, we could obtain efficient learning result in the infinite state space setting through other optimism-based online RL methods and general function approximators, such as linear mixture MDPs [3, 60, 9, 59], or kernel approximation [56] or bilinear classes [13].

To provide a concrete picture of the learning process, we instantiate the OP4 algorithm in two special cases and showcase our key ideas and techniques before delving into the more involved analysis of the generalized MPP setting. Nevertheless, we remark that whether the problem instance is tabular or in the form of linear or generalized linear approximations is not essential and not the focus of our study. OP4 itself only relies on two things, i.e., the uncertainty quantification for QQ-function and prior estimation. So even the model-free RL framework can be replaced by model-based RL, as we can just construct confidence region for the transition models.

4.3 Warm-up I: Reinforcement Learning in the Tabular MPP

We first consider MPPs in tabular setting with finite states and outcomes, as described in Section 3.1. In this case, the prior on outcomes at each step degenerates to an unknown but fixed discrete distribution independent of context. As linear parameterization is not required for discrete probability distribution, the algorithm can simply update the empirical estimation of μht\mu_{h}^{t} through counting. Similarly, the transition kernel PhP_{h}^{*} is estimated through the occurrence of observed samples, and we uses this estimated transition to compute the QQ-function Q^ht\widehat{Q}_{h}^{t} from the observed utility and estimated value function in the next step, according to the Bellman equation. To be specific, for each s𝒮,ωΩ,a𝒜s\in{\mathcal{S}},\omega\in\Omega,a\in\mathcal{A}, μht\mu_{h}^{t} and Q^ht\widehat{Q}_{h}^{t} are estimated through the following equations:

μht(ω)\displaystyle\mu_{h}^{t}(\omega) λ/|Ω|+Nt,h(ω)λ+t1,\displaystyle\leftarrow\frac{\lambda/|\Omega|+N_{t,h}(\omega)}{\lambda+t-1},
Q^ht(s,ω,a)\displaystyle{\widehat{Q}}_{h}^{t}(s,\omega,a) 1λ+Nt,h(s,ω,a)τ[t1]{𝕀(shτ=s,ωhτ=ω,ahτ=a)[vhτ+Vh+1t(sh+1τ)]},\displaystyle\leftarrow\frac{1}{\lambda+N_{t,h}(s,\omega,a)}\sum_{\tau\in[t-1]}\big{\{}\mathbb{I}(s_{h}^{\tau}=s,\omega_{h}^{\tau}=\omega,a_{h}^{\tau}=a)\big{[}v_{h}^{\tau}+V_{h+1}^{t}(s_{h+1}^{\tau})\big{]}\big{\}},

where Nt,h(ω)=τ[t1]𝕀(ωhτ=ω)N_{t,h}(\omega)=\sum_{\tau\in[t-1]}\mathbb{I}(\omega_{h}^{\tau}=\omega) and Nt,h(s,ω,a)=τ[t1]𝕀(shτ=s,ωht=ω,aht=a)N_{t,h}(s,\omega,a)=\sum_{\tau\in[t-1]}\mathbb{I}(s_{h}^{\tau}=s,\omega_{h}^{t}=\omega,a_{h}^{t}=a) respectively count the effective number of samples that the sender has observed arriving at ω\omega, or the combination {s,ω,a}\{s,\omega,a\}), and λ>0\lambda>0 is a constant for regularization.

In our learning algorithm, we determine the radius of confidence region ht\mathcal{B}_{h}^{t} for μht\mu_{h}^{t} according to confidence bound ϵht=O(log(HT)/t)\epsilon_{h}^{t}=O(\sqrt{\log(HT)/t}). Moreover, we add a UCB bonus term of form ρ/Nt,h(s,ω,a)\rho/\sqrt{N_{t,h}(s,\omega,a)} to Q^ht\widehat{Q}_{h}^{t} to obtain the optimistic QQ-function QhtQ_{h}^{t}. Then, it selects a robustly persuasive signaling scheme that maximizes an optimistic estimation of QQ-function with respect to the current prior estimation μht\mu_{h}^{t}. Finally, it makes an action recommendation ahta_{h}^{t} using this signaling scheme, given the state and outcome realization {sht,ωht}\{s_{h}^{t},\omega_{h}^{t}\}.

Theorem 4.1.

Let ϵht=O~(1/t)\epsilon_{h}^{t}=\widetilde{O}(\sqrt{1/t}), and ρ=O~(|S||Ω||A|H)\rho=\widetilde{O}(|S|\cdot|\Omega|\cdot|A|H). Then under (p0,D)(p_{0},D)-regularity, with probability at least 13H1T11-3H^{-1}T^{-1}, OP4 has regret of order O~(|C|(|S||Ω||A|)3/2H2T/(p0D))\widetilde{O}\big{(}|C|(|S|\cdot|\Omega|\cdot|A|)^{3/2}\cdot H^{2}\sqrt{T}/(p_{0}D)\big{)} in tabular MPPs.

To obtain the regret of OP4, we have to consider the regret arising from different procedures. Formal decomposition of the regret is described in Lemma 6.1. Separately, we upper bound errors incurred from estimating QQ-function (Lemma 6.2), the randomness of of choosing the outcome, action and next state (Lemma A.5) as well as estimating the prior of outcome and choosing a persuasive signaling scheme that is robustly persuasive for a subset of priors (Lemmas 6.3 and 6.4). As the two warm-up models are special cases of the general MPP, the proof of the above properties follows from that of the general MPP setting in Section 6, and thus is omitted here.

4.4 Warm-up II: Reinforcement Learning in Contextual Bayesian Persuasion

We now move to another special case with H=1H=1, such that the MPP problem reduces to a contextual-bandit-like problem, where transitions no longer exist. Given a context cc and a persuasive signaling policy π\pi, the value function is simply the sender’s expected utility for any s𝒮s\in{\mathcal{S}},

Vπ(s;c)ωaAμ(ω|c)π(a|s,ω)v(s,ω,a)dω.V^{\pi}(s;c)\coloneqq\int_{\omega}\sum_{a\in A}\mu(\omega|c)\pi(a|s,\omega)v(s,\omega,a){\mathrm{d}}\omega.

The sender’s optimal expected utility is defined as V(s;c)maxπPers(μ(|c),u)Vπ(s;c).V^{*}(s;c)\coloneqq\max_{\pi\in\operatorname{Pers}(\mu(\cdot|c),u)}V^{\pi}(s;c).

Meanwhile, we consider the general setting where outcome ω\omega is a continuous random variable that subjects to a generalized linear model. To be specific, the prior μ\mu is conditioned on the context cc with the mean value f(ϕ(c)θ)f(\phi(c)^{\top}\theta). For the prior μ\mu and link function ff, we assume the smoothness of the prior and the bounded derivatives of the link function:

Assumption 4.2.

There exists a constant Lμ>0L_{\mu}>0 such that for any parameter θ1,θ2\theta_{1},\theta_{2}, we have μθ1(|c)μθ2(|c)1Lμf(ϕ(c)θ1)f(ϕ(c)θ2)\big{\|}\mu_{\theta_{1}}(\cdot|c)-\mu_{\theta_{2}}(\cdot|c)\big{\|}_{1}\leq L_{\mu}\big{\|}f\big{(}\phi(c)^{\top}\theta_{1}\big{)}-f\big{(}\phi(c)^{\top}\theta_{2}\big{)}\big{\|} for any given context cc.

Assumption 4.3.

The link function ff is either monotonically increasing or decreasing. Moreover, there exists absolute constants 0<κ<K<0<\kappa<K<\infty and 0<M<0<M<\infty such that κ|f(z)|K\kappa\leq|f^{\prime}(z)|\leq K and |f′′(z)|M|f^{\prime\prime}(z)|\leq M for all |z|ΦLθ|z|\leq\Phi L_{\theta}.

It is natural to assume a Lipschitz property of the distribution in Assumption 4.2. For instance, Gaussian distributions and uniform distributions satisfy this property. Assumption 4.3 is standard in the literature [18, 52, 32]. Two example link functions are the identity map f(z)=zf(z)=z and the logistic map f(z)=1/(1+ez)f(z)=1/(1+e^{-z}) with bounded zz. It is easy to verify that both maps satisfy this assumption.

Different from the tabular setting, we are now unable to use the counting-based estimator to keep track of the distribution of the possibly infinite states and outcomes. Instead, we resort to function approximation techniques and estimate the linear parameters θ\theta^{*} and γ\gamma^{*}. In each episode, OP4 respectively updates the estimation and confidence region of θt\theta^{t} and γt\gamma^{t}, with which it can determine the outcome prior under pessimism and sender’s utility under optimism. To be specific, the update of θt\theta^{t} solves a constrained least-squares problem and the update of qtq^{t} solves precisely a regularized one:

θt\displaystyle\theta^{t} argminθLθτ[t1][ωτf(ϕ(cτ)θh)]2,\displaystyle\leftarrow\arg\min_{\|\theta\|\leq L_{\theta}}\sum_{\tau\in[t-1]}\big{[}\omega^{\tau}-f(\phi(c^{\tau})^{\top}\theta_{h})\big{]}^{2},
γt\displaystyle\gamma^{t} argminγψτ[t1]vτψ(sτ,ωτ,aτ)γ2+λγ2.\displaystyle\leftarrow\arg\min_{\gamma\in\mathbb{R}^{\psi}}\sum_{\tau\in[t-1]}\left\|v^{\tau}-\psi(s^{\tau},\omega^{\tau},a^{\tau})^{\top}\gamma\right\|^{2}+\lambda\left\|\gamma\right\|^{2}.

We then estimate the prior by setting μt(|c)\mu^{t}(\cdot|c) to the distribution of f(ϕ(c)θt)+zf\big{(}\phi(c)^{\top}\theta^{t}\big{)}+z and estimate the sender’s utility by setting vt(,,)=ψ(,,)γtv^{t}(\cdot,\cdot,\cdot)=\psi(\cdot,\cdot,\cdot)^{\top}\gamma^{t}. On one hand, to encourage exploration, OP4 adds the UCB bonus term of form ρψ(,,)(Γt)1\rho\left\|\psi(\cdot,\cdot,\cdot)\right\|_{(\Gamma^{t})^{-1}} to the QQ-function, where Γt=λIdψ+τ[t]ψ(sτ,ωτ,aτ)ψ(sτ,ωτ,aτ)\Gamma^{t}=\lambda I_{d_{\psi}}+\sum_{\tau\in[t]}\psi(s^{\tau},\omega^{\tau},a^{\tau})\psi(s^{\tau},\omega^{\tau},a^{\tau})^{\top} is the Gram matrix of the regularized least-squares problem and ρ\rho is equivalent to a scalar. This is a common technique for linear bandits. On the other hand, OP4 determines the confidence region of θt\theta^{t} with radius β\beta, and ensures that signaling scheme is robustly persuasive for any possible (worst case) prior induced by linear parameters θ\theta in this region. Combining optimism and pessimism, OP4 picks the signaling scheme among the robust persuasive set that maximizes the sender’s optimistic utility.

Theorem 4.4.

Under (p0,D)(p_{0},D)-regularity and Assumption 4.2 and 4.3, there exists an absolute constant C1,C2>0C_{1},C_{2}>0 such that, if we set λ=max{1,Ψ2}\lambda=\max\{1,\Psi^{2}\}, β=C1(1+κ1K+M+dϕσ2log(T))\beta=C_{1}(1+\kappa^{-1}\sqrt{K+M+d_{\phi}\sigma^{2}\log(T)}), and ρ=C2dψlog(4dψΨ2T3)\rho=C_{2}d_{\psi}\sqrt{\log(4d_{\psi}\Psi^{2}T^{3})}, then with probability at least 13T11-3T^{-1}, OP4 has regret of order O~(dϕdψ3T/(p0D))\widetilde{O}\big{(}d_{\phi}\sqrt{d_{\psi}^{3}}\sqrt{T}/(p_{0}D)\big{)} in contextual Bayesian persuasion problems.

Since we estimate the prior by computing an estimator θt\theta^{t}, we evaluate the persuasiveness of OP4 through the probability that θ\theta^{*} lies in the confidence region centered at θt\theta^{t} with the radius β=O(dϕlog(T))\beta=O(\sqrt{d_{\phi}\log(T)}) in weighted norm. Due to the smoothness of the prior and the assumption of link function, the error of the estimated prior is bounded by the product of β\beta and the weighted norm of feature vector ϕ(ct)Σt=O(1/t)\|\phi(c^{t})\|_{\Sigma^{t}}=O(1/\sqrt{t}), which yields the same conclusion for ϵt\epsilon^{t} in the tabular MPP case. Also compared to Li et al. [31], we do not require any regularity for Σt\Sigma^{t}, since we add a constant matrix Φ2I\Phi^{2}I to the Gram matrix Σt\Sigma^{t}. This ensures that Σt\Sigma^{t} is always lower bounded by the constant Φ2>0\Phi^{2}>0. The proof of the persuasiveness and sublinear regret of contextual bandit can be viewed as a direct reduction of the MPP case when the total step H=1H=1. We decompose the regret in the same way as that in Lemma 6.1 for MPPs and then estimate the upper bound for each item to measure the regret loss.

5 No-Regret Learning in the General Markov Persuasion Process

In this section, we present the full version of the OP4 algorithm for MPPs and show that it is persuasive with high probability and meanwhile achieves average regret O~(dϕdψ3/2H2T/(p0D))\widetilde{O}\big{(}d_{\phi}\cdot d_{\psi}^{3/2}H^{2}\sqrt{T}/(p_{0}D)\big{)}.

In the general MPP setting with the linear utility and transition, a crucial property is that the QQ-functions under any signaling policy is always linear in the feature map ψ\psi (akin to linear MDPs [25]). Therefore, when designing learning algorithms, it suffices to focus on linear QQ-functions. In our OP4 algorithm, we iteratively fit the optimal QQ-function, which is parameterized by qhq_{h}^{*} as ψ(,,)qh\psi(\cdot,\cdot,\cdot)^{\top}q_{h}^{*} at each step h[H]h\in[H]. OP4 learns the QQ-functions of MPPs and the prior of persuasion states simultaneously. It operates similarly as that in tabular MPPs and contextual Bayesian persuasion. At the tt-th episode, given the historical data {(chτ,shτ,ωhτ,ahτ,vhτ)}h[H],τ[t1]\{(c_{h}^{\tau},s_{h}^{\tau},\omega_{h}^{\tau},a_{h}^{\tau},v_{h}^{\tau})\}_{h\in[H],\tau\in[t-1]}, we can estimate the unknown vectors θh,qh,h[H]\theta_{h}^{*},q_{h}^{*},\forall h\in[H] by solving the following constrained or regularized least-squares problems:

θht\displaystyle\theta_{h}^{t} argminθhLθτ[t1][ωhτf(ϕ(chτ)θh)]2,\displaystyle\leftarrow\mathop{\mathrm{argmin}}_{\|\theta_{h}\|\leq L_{\theta}}\sum_{\tau\in[t-1]}\big{[}\omega_{h}^{\tau}-f(\phi(c_{h}^{\tau})^{\top}\theta_{h})\big{]}^{2},
qht\displaystyle q_{h}^{t} argminqdψτ[t1][vhτ+Vh+1t(sh+1τ;Ct)ψ(shτ,ωhτ,ahτ)q]2+λq2.\displaystyle\leftarrow\mathop{\mathrm{argmin}}_{q\in\mathbb{R}^{d_{\psi}}}\sum_{\tau\in[t-1]}\big{[}v_{h}^{\tau}+V_{h+1}^{t}(s_{h+1}^{\tau};C^{t})-\psi(s_{h}^{\tau},\omega_{h}^{\tau},a_{h}^{\tau})^{\top}q\big{]}^{2}+\lambda\|q\|^{2}.

Additionally, Vh+1tV_{h+1}^{t} is the estimated value function with the observed context CtC^{t} at the episode tt, which we describe formally later. This estimator is used to replace the unknown transition PhP_{h} and distribution νh\nu_{h} in equation (3.2). Moreover, we can update the estimate of outcome prior μht\mu_{h}^{t} and QQ-function QhtQ_{h}^{t} respectively. Here OP4 adds UCB bonus to QhtQ_{h}^{t} to encourage exploration. The formal description is given in Algorithm 2.

Likewise, the MPP setting inherits the regularity conditions and Assumption 4.2 and 4.3 in the last section. Combining the insights from both the tabular MPPs and contextual Bayesian persuasion, we can show that the OP4 is persuasive and guarantees sublinear regret with high probability for general MPPs.

Theorem 5.1.

Under (p0,D)(p_{0},D)-regularity and Assumption 4.2 and 4.3, there exists absolute constants C1,C2>0C_{1},C_{2}>0 such that, if we set λ=max{1,Ψ2}\lambda=\max\{1,\Psi^{2}\}, β=C1(1+κ1K+M+dϕσ2log(HT))\beta=C_{1}(1+\kappa^{-1}\sqrt{K+M+d_{\phi}\sigma^{2}\log(HT)}),and ρ=C2dψHlog(4dψΨ2H2T3)\rho=C_{2}d_{\psi}H\sqrt{\log(4d_{\psi}\Psi^{2}H^{2}T^{3})}, then with probability at least 13H1T11-3H^{-1}T^{-1}, OP4 has regret of order O~(dϕdψ3/2H2T/(p0D))\widetilde{O}\big{(}d_{\phi}d_{\psi}^{3/2}H^{2}\sqrt{T}/(p_{0}D)\big{)}.

Recall that the novelty of OP4 is that we adopt pessimism and optimism to induce robust equilibrium behavior and encourage exploration simultaneously. Specifically, pessimism tackles the uncertainty in the prior estimation by selecting a signaling policy that is persuasive w.r.t. all the priors in the confidence region, while optimism in QQ-function estimation encourages exploration. To evaluate the regret of OP4, we provide a novel regret decomposition, which is tailored to this pessimism and optimism combination. Each term represents different aspects of regret loss incurred by either estimation or randomness.

6 Proof Sketch and Technical Highlights

In this section, we present the proof sketch for Theorem 5.1. We first decompose the regret into several terms tailored to MPPs and briefly introduce how to bound each term. Then we highlight our technical contribution about regularity when measuring the loss in the sender’s utility for choosing a signaling scheme that is persuasive for a subset of priors close to each other.

6.1 Proof of Theorem 5.1

In order to prove the sublinear regret for OP4, we construct a novel regret decomposition tailored to MPPs. Our proof starts from decomposing the regret into several terms, each of which indicates the regret loss either from estimation or from the randomness of trajectories. Next, we evaluate each term and then add them together to conclude the upper bound of the regret of OP4. For simplicity of presentation, denote V~ht(;C)=Qht,μhπhtΩ×𝒜(;C)\widetilde{V}_{h}^{t}(\cdot;C)=\big{\langle}Q_{h}^{t},\mu_{h}^{*}\otimes\pi_{h}^{t}\big{\rangle}_{\Omega\times\mathcal{A}}(\cdot;C) as the expectation of QhtQ_{h}^{t} with respect to the ground-truth prior μh\mu_{h}^{*} and signaling scheme πht\pi_{h}^{t} at the hh-th step. Then we can define the temporal-difference (TD) error as

δht(s,ω,a)=(vht+PhVh+1tQht)(s,ω,a;Ct).\delta_{h}^{t}(s,\omega,a)=(v_{h}^{t}+P_{h}V_{h+1}^{t}-Q_{h}^{t})(s,\omega,a;C^{t}). (6.1)

Here δht\delta_{h}^{t} is a function on 𝒮×Ω×𝒜{\mathcal{S}}\times\Omega\times\mathcal{A} for all h[H]h\in[H] and t[T]t\in[T]. Intuitively, {δht}h[H]\{\delta_{h}^{t}\}_{h\in[H]} quantifies how far the QQ-functions {Qht}h[H]\{Q_{h}^{t}\}_{h\in[H]} are from satisfying the Bellman optimality equation in equation (3.2). Moreover, define ζt,h1\zeta_{t,h}^{1} and ζt,h2\zeta_{t,h}^{2} for the trajectory {cht,sht,ωht,aht}h[H]\{c_{h}^{t},s_{h}^{t},\omega_{h}^{t},a_{h}^{t}\}_{h\in[H]} generated by Algorithm 2 at the tt-th episode as follows

ζt,h1=(V~htVhπt)(sht;Ct)(QhtQhπt)(sht,ωht,aht;Ct),ζt,h2=Ph(Vh+1tVh+1)(sht,ωht,aht;Ct)(Vh+1tVh+1)(sh+1t;Ct).\begin{split}\zeta_{t,h}^{1}&=(\widetilde{V}_{h}^{t}-V_{h}^{\pi^{t}})(s_{h}^{t};C^{t})-(Q_{h}^{t}-Q_{h}^{\pi^{t}})(s_{h}^{t},\omega_{h}^{t},a_{h}^{t};C^{t}),\\ \zeta_{t,h}^{2}&=P_{h}(V_{h+1}^{t}-V_{h+1}^{*})(s_{h}^{t},\omega_{h}^{t},a_{h}^{t};C^{t})-(V_{h+1}^{t}-V_{h+1}^{*})(s_{h+1}^{t};C^{t}).\end{split} (6.2)

By definition, ζt,h1\zeta_{t,h}^{1} capture the randomness of realizing the outcome ωhtμh(|ch)\omega_{h}^{t}\sim\mu_{h}^{*}(\cdot|c_{h}) and signaling the action ahtπht(sht,ωht,)a_{h}^{t}\sim\pi_{h}^{t}(s_{h}^{t},\omega_{h}^{t},\cdot), while ζt,h2\zeta_{t,h}^{2} captures the randomness of drawing the next state sh+1ts_{h+1}^{t} from Ph(|sht,ωht,)P_{h}(\cdot|s_{h}^{t},\omega_{h}^{t},\cdot). With the notations above, we can decompose the regret into six parts to facilitate the establishment of the upper bound of the regret.

Lemma 6.1 (Regret Decomposition).

With the notations defines in equation (6.1) and (6.2), we can write the regret as:

Reg(T,μ)=t[T]h[H]{𝔼μh,πh[δht(sh,ωh,aht)|s1=s1t]δht(sht,ωht,aht)}(i)+t[T]h[H](ζt,h1+ζt,h2)(ii)+t[T]h[H]𝔼μh,πh[Qht,μhπhμhtπhtΩ×𝒜(sh;Ct)|s1=s1t](iii)+t[T]h[H]Qht,(μhtμh)πhtΩ×𝒜(sht,Ct)(iv).\begin{split}\operatorname{Reg}(T,\mu^{*})=&\underbrace{\sum_{t\in[T]}\sum_{h\in[H]}\big{\{}\mathbb{E}_{\mu_{h}^{*},\pi_{h}^{*}}[\delta_{h}^{t}(s_{h},\omega_{h},a_{h}^{t})|s_{1}=s_{1}^{t}]-\delta_{h}^{t}(s_{h}^{t},\omega_{h}^{t},a_{h}^{t})\big{\}}}_{\displaystyle\mathrm{(i)}}+\underbrace{\sum_{t\in[T]}\sum_{h\in[H]}(\zeta_{t,h}^{1}+\zeta_{t,h}^{2})}_{\displaystyle\mathrm{(ii)}}\\ &~{}~{}~{}~{}+\underbrace{\sum_{t\in[T]}\sum_{h\in[H]}\mathbb{E}_{\mu_{h}^{*},\pi_{h}^{*}}\big{[}\big{\langle}Q_{h}^{t},\mu_{h}^{*}\otimes\pi_{h}^{*}-\mu_{h}^{t}\otimes\pi_{h}^{t}\big{\rangle}_{\Omega\times\mathcal{A}}(s_{h};C^{t})|s_{1}=s_{1}^{t}\big{]}}_{\displaystyle\mathrm{(iii)}}\\ &~{}~{}~{}~{}+\underbrace{\sum_{t\in[T]}\sum_{h\in[H]}\big{\langle}Q_{h}^{t},(\mu_{h}^{t}-\mu_{h}^{*})\otimes\pi_{h}^{t}\big{\rangle}_{\Omega\times\mathcal{A}}(s_{h}^{t},C^{t})}_{\displaystyle\mathrm{(iv)}}.\end{split} (6.3)

In this novel regret decomposition, term (i)\displaystyle\mathrm{(i)} indicates the optimism in OP4. Provably, δht\delta_{h}^{t} in term (i)\displaystyle\mathrm{(i)} is always non-positive due to the optimistic QQ-value estimation, which could simplify this term. Term (iii)\displaystyle\mathrm{(iii)} corresponds to the pessimism in OP4 for inducing a robust equilibria. It evaluates the regret loss incurred by choosing a robustly persuasive signaling policy. Since the signaling policy has to be persuasive to ensure that receivers will always take recommended actions, we cannot simply choose a greedy policy for a fixed prior estimation. Instead, we first apply optimism to construct the optimistic QQ-value estimation and then apply pessimism to select a signaling policy that is robustly persuasive for all the priors in the confidence region. Therefore, we design the regret decomposition, especially term (iii)\displaystyle\mathrm{(iii)} in this form to reflect the optimism and pessimism principle in OP4. Notice that this decomposition does not depend on specific function approximation forms in the algorithm, since not only the estimation of prior and QQ-function but also the chosen signaling policy has no influence on this formula. Therefore, it generally suits all the algorithms for MPPs.

Unlike the regret decomposition in [56], Lemma 6.1 also captures the randomness of realizing the outcome. Since we have to estimate the prior of the outcome and choose a robustly persuasive policy in MPPs, we add term (iii)\displaystyle\mathrm{(iii)} and (iv)\displaystyle\mathrm{(iv)} to evaluate the further regret loss.

The rigorous arguments turn out to be technical, and thus we shall defer the proof of most lemmas to the appendix while aiming to present all the key ideas and conclusions in the following. For term (i)\displaystyle\mathrm{(i)} in equation (6.3), although we do not observe the trajectories under prior μ\mu^{*} and signaling policy π\pi^{*}, we can upper bound both δht\delta_{h}^{t} and δht-\delta_{h}^{t}. The following lemma states this result.

Lemma 6.2 (Optimism).

There exists an absolute constant c>0c>0 such that, for any fixed δ(0,1)\delta\in(0,1), if we set λ=max{1,Ψ2}\lambda=\max\{1,\Psi^{2}\} and ρ=cdψHι\rho=cd_{\psi}H\sqrt{\iota} in Algorithm 2 with ι=log(2dψΨ2T/δ)\iota=\log(2d_{\psi}\Psi^{2}T/\delta), then with probability at least 1δ/21-\delta/2, we have

2ρψ(s,ω,a)(Γht)1δht(s,ω,a)0.-2\rho\|\psi(s,\omega,a)\|_{(\Gamma_{h}^{t})^{-1}}\leq\delta_{h}^{t}(s,\omega,a)\leq 0.

for all s𝒮,ωΩ,a𝒜,h[H]s\in{\mathcal{S}},\omega\in\Omega,a\in\mathcal{A},h\in[H] and t[T]t\in[T].

Term (ii)\displaystyle\mathrm{(ii)} in equation (6.3) can be bounded by Lemma 5.3 from [56] using martingale techniques and the Azuma-Hoeffding inequality [5]. We state the upper bound for term (ii)\displaystyle\mathrm{(ii)} in Lemma A.5. Moreover, term (iii)\displaystyle\mathrm{(iii)} in equation (6.3) evaluates the regret loss caused by estimating the prior and choosing a robustly persuasive signaling policy. Here, we apply the robustness gap Gap\operatorname{Gap} defined later to bound this term.

Lemma 6.3 (Bounding Term (iii)\displaystyle\mathrm{(iii)}).

On the event of {θhht}\{\theta_{h}^{*}\in\mathcal{B}_{h}^{t}\}, under Assumption 4.2 and 4.3, we have

t[T]h[H]𝔼μh,πh[Qht,μhπh\displaystyle\sum_{t\in[T]}\sum_{h\in[H]}\mathbb{E}_{\mu_{h}^{*},\pi_{h}^{*}}\big{[}\big{\langle}Q_{h}^{t},\mu_{h}^{*}\otimes\pi_{h}^{*} μhtπhtΩ×𝒜(sh;Ct)|s1=s1t]\displaystyle-\mu_{h}^{t}\otimes\pi_{h}^{t}\big{\rangle}_{\Omega\times\mathcal{A}}(s_{h};C^{t})|s_{1}=s_{1}^{t}\big{]}
(3HLμKp0D+HLμK2)βh[H]t[T]ϕ(cht)(Σht)1.\displaystyle\leq\bigg{(}\frac{3HL_{\mu}K}{p_{0}D}+\frac{HL_{\mu}K}{2}\bigg{)}\beta\sum_{h\in[H]}\sum_{t\in[T]}\|\phi(c_{h}^{t})\|_{(\Sigma_{h}^{t})^{-1}}.

It remains to bound term (iv)\displaystyle\mathrm{(iv)} in equation (6.3). This bound can be derived from Holder inequality and the property of the prior.

Lemma 6.4 (Bounding Term (iv)\displaystyle\mathrm{(iv)}).

On the event of {θhht}\{\theta_{h}^{*}\in\mathcal{B}_{h}^{t}\}, under Assumption 4.2 and 4.3, we have

t[T]h[H]Qht,(μhtμh)πhtΩ×𝒜(sht;Ct)HLμKβh[H]t[T]ϕ(cht)(Σht)1.\sum_{t\in[T]}\sum_{h\in[H]}\big{\langle}Q_{h}^{t},(\mu_{h}^{t}-\mu_{h}^{*})\otimes\pi_{h}^{t}\big{\rangle}_{\Omega\times\mathcal{A}}(s_{h}^{t};C^{t})\leq HL_{\mu}K\beta\sum_{h\in[H]}\sum_{t\in[T]}\|\phi(c_{h}^{t})\|_{(\Sigma_{h}^{t})^{-1}}.

Now we are ready to prove our main result, Theorem 5.1. By the decomposition in Lemma 6.1 and all previous lemmas, let β=C(1+κ1K+M+dϕσ2log(HT))\beta=C(1+\kappa^{-1}\sqrt{K+M+d_{\phi}\sigma^{2}\log(HT)}), and then we obtain the following upper bound for regret:

Reg(T,μ)\displaystyle\operatorname{Reg}(T,\mu^{*})\leq 42TH3log(2HT)\displaystyle 4\sqrt{2TH^{3}\log(2HT)}
+t[T]h[H][2ρψ(sht,ωht,aht)(Γht)1+(3HLμKp0D+3HLμK2)βϕ(cht)(Σht)1],\displaystyle~{}~{}~{}~{}+\sum_{t\in[T]}\sum_{h\in[H]}\bigg{[}2\rho\big{\|}\psi(s_{h}^{t},\omega_{h}^{t},a_{h}^{t})\|_{(\Gamma_{h}^{t})^{-1}}+\bigg{(}\frac{3HL_{\mu}K}{p_{0}D}+\frac{3HL_{\mu}K}{2}\bigg{)}\beta\|\phi(c_{h}^{t})\|_{(\Sigma_{h}^{t})^{-1}}\bigg{]},

With the probability of the given event by Lemma 6.8 and appropriately chosen δ\delta in previous lemmas, the above inequality holds for the probability at least 13H1T11-3H^{-1}T^{-1}.

By Lemma A.6, we have

Reg(T,μ)\displaystyle\operatorname{Reg}(T,\mu^{*}) 22TH3log(2HT)+2ρH2dψTlog(1+TΨ2/(λdψ))\displaystyle\leq 2\sqrt{2TH^{3}\log(2HT)}+2\rho H\sqrt{2d_{\psi}T\log\big{(}1+T\Psi^{2}/(\lambda d_{\psi})\big{)}}
+βH2LμK(3p0D+32)2dϕTlog(1+T/(dϕ)).\displaystyle\qquad\qquad+\beta H^{2}L_{\mu}K\bigg{(}\frac{3}{p_{0}D}+\frac{3}{2}\bigg{)}\sqrt{2d_{\phi}T\log\big{(}1+T/(d_{\phi})\big{)}}.

Since β\beta is in O~(dϕ)\widetilde{O}(\sqrt{d_{\phi}}) and ρ\rho is in O~(dψH)\widetilde{O}(d_{\psi}H), we can conclude that the regret of Algorithm 2 is O~(dϕdψ3/2H2T/(p0D))\widetilde{O}\big{(}d_{\phi}d_{\psi}^{3/2}H^{2}\sqrt{T}/(p_{0}D)\big{)}.

6.2 Inducing Robust Equilibria via Pessimism

One necessary prerequisite is that the signaling policy given by OP4 has to be persuasive to ensure receivers to take recommended actions. However, the optimal signaling policy that is persuasive for the estimated prior can hardly be also persuasive for the true prior, even if the estimation is quite close to it. To ensure persuasiveness under the prior estimation error, we adopt pessimism principle to select a signaling policy that is robustly persuasive for all the priors in the confidence region. And we shall quantify the extra utility loss suffered by the pessimism principle. In this subsection, we start by showing that there exists a robust signaling scheme that suffers only O(ϵ)O(\epsilon) utility loss compared to the optimal expected utility of persuasion algorithm designed with precise knowledge of the prior. Formally, in basic MPP, given any fixed QQ-function Q(,,)Q(\cdot,\cdot,\cdot), we define the robustness gap for some state s𝒮s\in{\mathcal{S}} and any prior μΔ(Ω)\mu\in\mathcal{B}\subseteq\Delta(\Omega) as

Gap(s,μ,;Q)maxπPers(μ,u)Q,μπΩ×𝒜(s)maxπPers(,u)Q,μπΩ×𝒜(s).\operatorname{Gap}\big{(}s,\mu,\mathcal{B};Q\big{)}\triangleq\max_{\pi\in\operatorname{Pers}(\mu,u)}\big{\langle}Q,\mu\otimes\pi\big{\rangle}_{\Omega\times\mathcal{A}}(s)-\max_{\pi\in\operatorname{Pers}(\mathcal{B},u)}\big{\langle}Q,\mu\otimes\pi\big{\rangle}_{\Omega\times\mathcal{A}}(s). (6.4)

We let B(μ,ϵ)={μΔ(Ω):μμ1ϵ}{\mathrm{B}}(\mu,\epsilon)=\{\mu^{\prime}\in\Delta(\Omega):\left\|\mu-\mu^{\prime}\right\|_{1}\leq\epsilon\} be the 1\ell_{1}-norm ball centered the prior distribution μ\mu with radius ϵ\epsilon.

Lemma 6.5 (Pessimism).

Under (p0,D)(p_{0},D)-regularity, for all ϵ>0\epsilon>0, given a QQ-function QQ, for any state s𝒮s\in{\mathcal{S}}, we have

Gap(s,μ,B(μ,ϵ);Q)Hϵp0D.\operatorname{Gap}\big{(}s,\mu,{\mathrm{B}}(\mu,\epsilon);Q\big{)}\leq\frac{H\epsilon}{p_{0}D}.

The proof is given in Appendix A.2. This result extends Proposition 1 in [61]. Notice that the upper bound of Gap(;)\operatorname{Gap}(\cdot;\cdot) does not depend on the value of QQ, which is important for our analysis. Once given a signaling algorithm, at each episode t[T]t\in[T] and each step h[H]h\in[H], we are able to obtain an estimation of QQ-function with an explicit form. It is equivalent to the “known” QQ-function mentioned in equation (6.4). Using Gap(;)\operatorname{Gap}(\cdot;\cdot), we can estimate the expected sender’s utility loss for choosing a signaling mechanism that is persuasive for all priors in a subset. Moreover, if we consider the dependence on context for priors and add the linear assumption of priors to the proceeding lemma, we can bound Gap(;)\operatorname{Gap}(\cdot;\cdot) by the difference of linearity parameter θ\theta.

Corollary 6.6.

Under (p0,D)(p_{0},D)-regularity and Assumption 4.2 and 4.3, given a QQ-function QQ and context cc, for any state s𝒮s\in{\mathcal{S}}, prior μθ(|c)\mu_{\theta}(\cdot|c) and confidence region ={μθ(|c):θBΣ(θ,ϵ)}\mathcal{B}=\{\mu_{\theta^{\prime}}(\cdot|c):\theta^{\prime}\in{\mathrm{B}}_{\Sigma}(\theta,\epsilon)\}, we have Gap(s,μθ(|c),;Q)HLμKϕ(c)Σ1ϵ/(p0D).\operatorname{Gap}(s,\mu_{\theta}(\cdot|c),\mathcal{B};Q)\leq HL_{\mu}K\|\phi(c)\|_{\Sigma^{-1}}\epsilon/(p_{0}D).

In MPPs, we have to estimate the prior of the outcome since we cannot observe the ground-truth prior. However, the estimation may not satisfy the regularity conditions, which conflicts with the requirements for the prior when proving Lemma 6.5. To address this problem, we give another upper bound of the robustness gap for the prior estimation in Lemma A.1. In addition, to handle the regret loss incurred by estimating the prior, we compute the difference in QQ-functions when choosing respectively persuasive scheme for different priors in Lemma A.2.

We now prove that the above pessimism design guarantees persuasiveness w.r.t. the true prior with high probability. And it suffices to show that the estimation θht\theta_{h}^{t} is close enough to the real parameter θh\theta_{h}^{*} such that the confidence region ht\mathcal{B}_{h}^{t} centered at θht\theta_{h}^{t} given in Algorithm 2 contains θh\theta_{h}^{*}. If so, the signaling scheme chosen to be persuasive for the whole set μht\mu_{\mathcal{B}_{h}^{t}} is also persuasive for μh\mu_{h}^{*}, where μ{μθ:θ}\mu_{\mathcal{B}}\coloneqq\{\mu_{\theta^{\prime}}:\theta^{\prime}\in\mathcal{B}\} denotes the set of priors that are determined by the parameters θ\theta^{\prime}\in\mathcal{B}.

Lemma 6.7.

There exists a constant C>0C>0, such that for β=C(1+κ1K+M+dϕσ2log(HT))\beta=C(1+\kappa^{-1}\sqrt{K+M+d_{\phi}\sigma^{2}\log(HT)}), OP4 Algorithm is persuasive with probability at least 1H1T11-H^{-1}T^{-1}, i.e.,

θ(h[H]{θht[T]ht})H1T1.\mathbb{P}_{\theta^{*}}\bigg{(}\bigcup_{h\in[H]}\big{\{}\theta_{h}^{*}\notin\cap_{t\in[T]}\mathcal{B}_{h}^{t}\big{\}}\bigg{)}\leq H^{-1}T^{-1}.
Proof.

We first analyze the probability for being non-persuasive. For any θhLθ\|\theta_{h}^{*}\|\leq L_{\theta}, using the union bound, we have

Pθ(t[T],h[H]{θht[T]ht})\displaystyle P_{\theta^{*}}\bigg{(}\bigcup_{t\in[T],h\in[H]}\big{\{}\theta_{h}^{*}\notin\cap_{t\in[T]}\mathcal{B}_{h}^{t}\big{\}}\bigg{)} t[T]h[H]Pθh(θht[T]ht)\displaystyle\leq\sum_{t\in[T]}\sum_{h\in[H]}P_{\theta_{h}^{*}}\big{(}\theta_{h}^{*}\notin\cap_{t\in[T]}\mathcal{B}_{h}^{t}\big{)}
t[T]h[H]Pθh(θhtθhΣht>β).\displaystyle\leq\sum_{t\in[T]}\sum_{h\in[H]}P_{\theta_{h}^{*}}\big{(}\|\theta_{h}^{t}-\theta_{h}^{*}\|_{\Sigma_{h}^{t}}>\beta\big{)}.

The following lemma gives the belief of confidence region for the linear parameter θh\theta_{h}^{*}. The proof can be directly derived from Lemma 6 in Wang et al. [52].

Lemma 6.8 (Belief of Confidence Region).

For any t[T]t\in[T] and h[H]h\in[H], there exists a constant C>0C>0, such that for β=C(1+κ1K+M+dϕσ2log(1/δ))\beta=C(1+\kappa^{-1}\sqrt{K+M+d_{\phi}\sigma^{2}\log(1/\delta)}), given δ(0,1)\delta\in(0,1), with probability at least 1δ1-\delta, we have θhtθhΣhtβ.\|\theta_{h}^{t}-\theta_{h}^{*}\|_{\Sigma_{h}^{t}}\leq\beta.

By Lemma 6.8, taking δ=H2T2\delta=H^{-2}T^{-2}, then we have θ(θhtθhΣht>β)H2T2\mathbb{P}_{\theta^{*}}(\|\theta_{h}^{t}-\theta_{h}^{*}\|_{\Sigma_{h}^{t}}>\beta)\leq H^{-2}T^{-2}. Summing up the failure probabilities over t[T]t\in[T], we have θ(θt[T]t)H1T1\mathbb{P}_{\theta^{*}}(\theta^{*}\notin\cap_{t\in[T]}\mathcal{B}^{t})\leq H^{-1}T^{-1}. ∎

7 Conclusion

We have presented a novel model, the MPP, which captures the misaligned incentives of uninformed decision makers and the long-term objective of an information possessor for the first time. We then provide a reinforcement learning algorithm, OP4, that is provably efficient in terms of both computational complexity and sample complexity, under mild assumptions. We remark that while we showcase this algorithm in particular problem instances with linear approximation or GLMs, the framework of OP4 does not rely on the function approximation form, as long as we can quantify the uncertainty of the prior estimation and QQ-function (or transition model). In addition, we expect this optimism-pessimism design principle and its corresponding proof techniques to be generally useful for some other strategic learning problems with misaligned incentives involved.

Besides extending our techniques to other design problems, we point out that several other open problems arises from our work. First, while it is natural that the sender have knowledge of receiver’s utility functions in many cases (see Footnote 6), we hope to also study the problem even without initially knowing receiver’s utility. Similar problem has been studied in Stackelberg games [40, 10] yet without measuring the performance in terms of the cumulative utility of sender (leader). Second, another interesting direction is to study the setting of Markov Bayesian persuasion with one sender and one receiver, both aiming at maximizing their own long-term cumulative utilities when the environment involves Markovian transitions.

References

  • Agarwal et al. [2019] Agarwal, A., Jiang, N., Kakade, S. M. and Sun, W. (2019). Reinforcement learning: Theory and algorithms. CS Dept., UW Seattle, Seattle, WA, USA, Tech. Rep.
  • Auer et al. [2008] Auer, P., Jaksch, T. and Ortner, R. (2008). Near-optimal regret bounds for reinforcement learning. Advances in neural information processing systems, 21.
  • Ayoub et al. [2020] Ayoub, A., Jia, Z., Szepesvari, C., Wang, M. and Yang, L. (2020). Model-based reinforcement learning with value-targeted regression. In International Conference on Machine Learning. PMLR.
  • Azar et al. [2017] Azar, M. G., Osband, I. and Munos, R. (2017). Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning. PMLR.
  • Azuma [1967] Azuma, K. (1967). Weighted sums of certain dependent random variables. Tohoku Mathematical Journal, Second Series, 19 357–367.
  • Badanidiyuru et al. [2018] Badanidiyuru, A., Bhawalkar, K. and Xu, H. (2018). Targeting and signaling in ad auctions. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM.
  • Bellman [1957] Bellman, R. (1957). A markovian decision process. Indiana Univ. Math. J., 6 679–684.
  • Bergemann and Morris [2019] Bergemann, D. and Morris, S. (2019). Information design: A unified perspective. Journal of Economic Literature, 57 44–95.
  • Cai et al. [2020] Cai, Q., Yang, Z., Jin, C. and Wang, Z. (2020). Provably efficient exploration in policy optimization. In International Conference on Machine Learning. PMLR.
  • Conitzer and Sandholm [2006] Conitzer, V. and Sandholm, T. (2006). Computing the optimal strategy to commit to. In Proceedings of the 7th ACM conference on Electronic commerce.
  • Dann et al. [2018] Dann, C., Jiang, N., Krishnamurthy, A., Agarwal, A., Langford, J. and Schapire, R. E. (2018). On oracle-efficient pac rl with rich observations. Advances in neural information processing systems, 31.
  • Dann et al. [2017] Dann, C., Lattimore, T. and Brunskill, E. (2017). Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning. Advances in Neural Information Processing Systems, 30.
  • Du et al. [2021] Du, S., Kakade, S., Lee, J., Lovett, S., Mahajan, G., Sun, W. and Wang, R. (2021). Bilinear classes: A structural framework for provable generalization in rl. In International Conference on Machine Learning. PMLR.
  • Du et al. [2019] Du, S. S., Kakade, S. M., Wang, R. and Yang, L. F. (2019). Is a good representation sufficient for sample efficient reinforcement learning? arXiv preprint arXiv:1910.03016.
  • Dughmi and Xu [2019] Dughmi, S. and Xu, H. (2019). Algorithmic bayesian persuasion. SIAM Journal on Computing STOC16–68.
  • Ely [2017] Ely, J. C. (2017). Beeps. American Economic Review, 107 31–53.
  • Farhadi and Teneketzis [2021] Farhadi, F. and Teneketzis, D. (2021). Dynamic information design: a simple problem on optimal sequential information disclosure. Dynamic Games and Applications 1–42.
  • Filippi et al. [2010] Filippi, S., Cappe, O., Garivier, A. and Szepesvári, C. (2010). Parametric bandits: The generalized linear case. Advances in Neural Information Processing Systems, 23.
  • Gan et al. [2021] Gan, J., Majumdar, R., Radanovic, G. and Singla, A. (2021). Bayesian persuasion in sequential decision-making. arXiv preprint arXiv:2106.05137.
  • Giannoccaro and Pontrandolfo [2002] Giannoccaro, I. and Pontrandolfo, P. (2002). Inventory management in supply chains: a reinforcement learning approach. International Journal of Production Economics, 78 153–161.
  • Goldstein and Leitner [2018] Goldstein, I. and Leitner, Y. (2018). Stress tests and information disclosure. Journal of Economic Theory, 177 34–69.
  • He et al. [2021] He, J., Zhou, D. and Gu, Q. (2021). Logarithmic regret for reinforcement learning with linear function approximation. In International Conference on Machine Learning. PMLR.
  • Jiang et al. [2017] Jiang, N., Krishnamurthy, A., Agarwal, A., Langford, J. and Schapire, R. E. (2017). Contextual decision processes with low bellman rank are pac-learnable. In International Conference on Machine Learning. PMLR.
  • Jin et al. [2018] Jin, C., Allen-Zhu, Z., Bubeck, S. and Jordan, M. I. (2018). Is q-learning provably efficient? Advances in neural information processing systems, 31.
  • Jin et al. [2020] Jin, C., Yang, Z., Wang, Z. and Jordan, M. I. (2020). Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory. PMLR.
  • Kamenica and Gentzkow [2011] Kamenica, E. and Gentzkow, M. (2011). Bayesian persuasion. American Economic Review, 101 2590–2615.
  • Kober et al. [2013] Kober, J., Bagnell, J. A. and Peters, J. (2013). Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32 1238–1274.
  • Lattimore and Szepesvári [2020] Lattimore, T. and Szepesvári, C. (2020). Bandit algorithms. Cambridge University Press.
  • Lehrer and Shaiderman [2021] Lehrer, E. and Shaiderman, D. (2021). Markovian persuasion. arXiv preprint arXiv:2111.14365.
  • Li et al. [2016] Li, J., Monroe, W., Ritter, A., Galley, M., Gao, J. and Jurafsky, D. (2016). Deep reinforcement learning for dialogue generation. arXiv preprint arXiv:1606.01541.
  • Li et al. [2017a] Li, L., Lu, Y. and Zhou, D. (2017a). Provable optimal algorithms for generalized linear contextual bandits. CoRR, abs/1703.00048.
    http://arxiv.org/abs/1703.00048
  • Li et al. [2017b] Li, L., Lu, Y. and Zhou, D. (2017b). Provably optimal algorithms for generalized linear contextual bandits. In International Conference on Machine Learning. PMLR.
  • Li et al. [2019] Li, M., Qin, Z., Jiao, Y., Yang, Y., Wang, J., Wang, C., Wu, G. and Ye, J. (2019). Efficient ridesharing order dispatching with mean field multi-agent reinforcement learning. In The world wide web conference.
  • Liang et al. [2021] Liang, E., Wen, K., Lam, W. H., Sumalee, A. and Zhong, R. (2021). An integrated reinforcement learning and centralized programming approach for online taxi dispatching. IEEE Transactions on Neural Networks and Learning Systems.
  • Mansour et al. [2021] Mansour, Y., Slivkins, A., Syrgkanis, V. and Wu, Z. S. (2021). Bayesian exploration: Incentivizing exploration in bayesian games. Operations Research.
  • Meisheri et al. [2020] Meisheri, H., Baniwal, V., Sultana, N. N., Khadilkar, H. and Ravindran, B. (2020). Using reinforcement learning for a large variable-dimensional inventory management problem. In Adaptive Learning Agents Workshop at AAMAS.
  • Milano et al. [2020] Milano, S., Taddeo, M. and Floridi, L. (2020). Recommender systems and their ethical challenges. Ai & Society, 35 957–967.
  • Neu and Pike-Burke [2020] Neu, G. and Pike-Burke, C. (2020). A unifying view of optimism in episodic reinforcement learning. Advances in Neural Information Processing Systems, 33 1392–1403.
  • Osband et al. [2016] Osband, I., Van Roy, B. and Wen, Z. (2016). Generalization and exploration via randomized value functions. In International Conference on Machine Learning. PMLR.
  • Peng et al. [2019] Peng, B., Shen, W., Tang, P. and Zuo, S. (2019). Learning optimal strategies to commit to. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33.
  • Puterman [2014] Puterman, M. L. (2014). Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons.
  • Qin et al. [2020] Qin, Z., Tang, X., Jiao, Y., Zhang, F., Xu, Z., Zhu, H. and Ye, J. (2020). Ride-hailing order dispatching at didi via reinforcement learning. INFORMS Journal on Applied Analytics, 50 272–286.
  • Qin et al. [2021] Qin, Z. T., Zhu, H. and Ye, J. (2021). Reinforcement learning for ridesharing: A survey. In 2021 IEEE International Intelligent Transportation Systems Conference (ITSC). IEEE.
  • Rabinovich et al. [2015] Rabinovich, Z., Jiang, A. X., Jain, M. and Xu, H. (2015). Information disclosure as a means to security. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems. Citeseer.
  • Renault et al. [2017] Renault, J., Solan, E. and Vieille, N. (2017). Optimal dynamic information provision. Games and Economic Behavior, 104 329–349.
  • Russo [2019] Russo, D. (2019). Worst-case regret bounds for exploration via randomized value functions. Advances in Neural Information Processing Systems, 32.
  • Strehl et al. [2006] Strehl, A. L., Li, L., Wiewiora, E., Langford, J. and Littman, M. L. (2006). Pac model-free reinforcement learning. In Proceedings of the 23rd international conference on Machine learning.
  • Sutton and Barto [2018] Sutton, R. S. and Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.
  • Tang and Winoto [2016] Tang, T. Y. and Winoto, P. (2016). I should not recommend it to you even if you will like it: the ethics of recommender systems. New Review of Hypermedia and Multimedia, 22 111–138.
  • Wang et al. [2020] Wang, R., Salakhutdinov, R. and Yang, L. F. (2020). Provably efficient reinforcement learning with general value function approximation.
  • Wang et al. [2021] Wang, T., Zhou, D. and Gu, Q. (2021). Provably efficient reinforcement learning with linear function approximation under adaptivity constraints. Advances in Neural Information Processing Systems, 34.
  • Wang et al. [2019] Wang, Y., Wang, R., Du, S. S. and Krishnamurthy, A. (2019). Optimism in reinforcement learning with generalized linear function approximation. arXiv preprint arXiv:1912.04136.
  • Xiao et al. [2019] Xiao, W., Zhao, H., Pan, H., Song, Y., Zheng, V. W. and Yang, Q. (2019). Beyond personalization: Social content recommendation for creator equality and consumer satisfaction. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.
  • Xu et al. [2015] Xu, H., Rabinovich, Z., Dughmi, S. and Tambe, M. (2015). Exploring information asymmetry in two-stage security games. In Twenty-Ninth AAAI Conference on Artificial Intelligence.
  • Yang and Wang [2019] Yang, L. and Wang, M. (2019). Sample-optimal parametric q-learning using linearly additive features. In International Conference on Machine Learning. PMLR.
  • Yang et al. [2020] Yang, Z., Jin, C., Wang, Z., Wang, M. and Jordan, M. I. (2020). Bridging exploration and general function approximation in reinforcement learning: Provably efficient kernel and neural value iterations. CoRR, abs/2011.04622.
    https://arxiv.org/abs/2011.04622
  • Zanette et al. [2020] Zanette, A., Lazaric, A., Kochenderfer, M. and Brunskill, E. (2020). Learning near optimal policies with low inherent bellman error. In International Conference on Machine Learning. PMLR.
  • Zheng et al. [2020] Zheng, S., Trott, A., Srinivasa, S., Naik, N., Gruesbeck, M., Parkes, D. C. and Socher, R. (2020). The ai economist: Improving equality and productivity with ai-driven tax policies. arXiv preprint arXiv:2004.13332.
  • Zhou et al. [2021a] Zhou, D., Gu, Q. and Szepesvari, C. (2021a). Nearly minimax optimal reinforcement learning for linear mixture markov decision processes. In Conference on Learning Theory. PMLR.
  • Zhou et al. [2021b] Zhou, D., He, J. and Gu, Q. (2021b). Provably efficient reinforcement learning for discounted mdps with feature mapping. In International Conference on Machine Learning. PMLR.
  • Zu et al. [2021] Zu, Y., Iyer, K. and Xu, H. (2021). Learning to persuade on the fly: Robustness against ignorance. EC ’21, Association for Computing Machinery, New York, NY, USA.
    https://doi.org/10.1145/3465456.3467593

Appendix A Omitted Proofs and Descriptions

A.1 Formal Description of the OP4

The formal description of the OP4 for MPPs is stated as follows:

Algorithm 2 The OP4 for MPPs
1:  Input: Number of Episodes TT, Number of Step HH
2:  Parameters: β>0\beta>0, ρ>0\rho>0, λ+\lambda\in\mathbb{R}^{+}.
3:  Output: aht𝒜a_{h}^{t}\in\mathcal{A} for each h[H],t[T]h\in[H],t\in[T]
4:  for episode t=1Tt=1\dots T do
5:     Receive the initial state s1ts_{1}^{t} and context Ct=(c1t,,cHt)C^{t}=(c_{1}^{t},\ldots,c_{H}^{t}).
6:     for step h=H,,1h=H,\ldots,1 do
7:        Compute the constrained least square problem
θhtargminθhLθτ[t1][ωhτf(ϕ(chτ)θh)]2.\theta_{h}^{t}\leftarrow\mathop{\mathrm{argmin}}_{\|\theta_{h}\|\leq L_{\theta}}\sum_{\tau\in[t-1]}\big{[}\omega_{h}^{\tau}-f(\phi(c_{h}^{\tau})^{\top}\theta_{h})\big{]}^{2}.
8:        Calculate Σht=Φ2Idϕ+τ[t1]ϕ(chτ)ϕ(chτ)\Sigma_{h}^{t}=\Phi^{2}I_{d_{\phi}}+\sum_{\tau\in[t-1]}\phi(c_{h}^{\tau})\phi(c_{h}^{\tau})^{\top}. Update htBΣht(θht,β)\mathcal{B}_{h}^{t}\leftarrow{\mathrm{B}}_{\Sigma_{h}^{t}}(\theta_{h}^{t},\beta).
9:        Set μht(|c)\mu_{h}^{t}(\cdot|c) to the distribution of f(ϕ(c)θht)+zhf\big{(}\phi(c)^{\top}\theta_{h}^{t}\big{)}+z_{h}.
10:        Calculate
Γht\displaystyle\Gamma_{h}^{t} =λIdψ+τ[t1]ψ(shτ,ωhτ,ahτ)ψ(shτ,ωhτ,ahτ),\displaystyle=\lambda I_{d_{\psi}}+\sum_{\tau\in[t-1]}\psi(s_{h}^{\tau},\omega_{h}^{\tau},a_{h}^{\tau})\psi(s_{h}^{\tau},\omega_{h}^{\tau},a_{h}^{\tau})^{\top},
ιht\displaystyle\iota_{h}^{t} =τ[t1]ψ(shτ,ωhτ,ahτ)[vhτ+Vh+1t(sh+1τ;Ct)]\displaystyle=\sum_{\tau\in[t-1]}\psi(s_{h}^{\tau},\omega_{h}^{\tau},a_{h}^{\tau})[v_{h}^{\tau}+V_{h+1}^{t}(s_{h+1}^{\tau};C^{t})]
11:        Update qht(Γht)1ιhtq_{h}^{t}\leftarrow(\Gamma_{h}^{t})^{-1}\iota_{h}^{t}.
12:        Set {Qht(,,;Ct)min{ψ(,,)qht+ρψ(,,)(Γht)1,H},Vht(;Ct)maxπhPers(μht,uh)Qht,μhtπhΩ×𝒜(;Ct).\begin{cases}Q_{h}^{t}(\cdot,\cdot,\cdot;C^{t})\leftarrow\min\{\psi(\cdot,\cdot,\cdot)^{\top}q_{h}^{t}+\rho\|\psi(\cdot,\cdot,\cdot)\|_{(\Gamma_{h}^{t})^{-1}},H\},&\\ V_{h}^{t}(\cdot;C^{t})\leftarrow\max_{\pi_{h}\in\operatorname{Pers}(\mu_{\mathcal{B}_{h}^{t}},u_{h})}\big{\langle}Q_{h}^{t},\mu_{h}^{t}\otimes\pi_{h}\big{\rangle}_{\Omega\times\mathcal{A}}(\cdot;C^{t}).&\end{cases}
13:     end for
14:     for step h=1,,Hh=1,\ldots,H do
15:        Choose πhtargmaxπhPers(μht,uh)Qht,μhtπhΩ×𝒜(sht;Ct)\pi_{h}^{t}\in\arg\max_{\pi_{h}\in\operatorname{Pers}(\mu_{\mathcal{B}_{h}^{t}},u_{h})}\big{\langle}Q_{h}^{t},\mu_{h}^{t}\otimes\pi_{h}\big{\rangle}_{\Omega\times\mathcal{A}}(s_{h}^{t};C^{t}).
16:     end for
17:     Execute πt\pi^{t} to sample a trajectory {(sht,ωht,aht,vht)}h[H]\{(s_{h}^{t},\omega_{h}^{t},a_{h}^{t},v_{h}^{t})\}_{h\in[H]}.
18:  end for

A.2 Proof of Lemma 6.5

Proof.

We prove with an explicit construction of a signaling scheme that is robustly persuasive for any prior in B(μ,ϵ){\mathrm{B}}(\mu,\epsilon) and achieve the expected utility at least maxπPers(μ,u)Q,μπΩ×𝒜(s)Hϵ/(p0D)\max_{\pi\in\operatorname{Pers}(\mu,u)}\big{\langle}Q,\mu\otimes\pi\big{\rangle}_{\Omega\times\mathcal{A}}(s)-H\epsilon/(p_{0}D). To simplify the notation, we omit the ss in uu, QQ and 𝒲\mathcal{W}.

Let π=argmaxπPers(μ,u)Q,μπΩ×𝒜\pi^{*}=\arg\max_{\pi\in\operatorname{Pers}(\mu,u)}\big{\langle}Q,\mu\otimes\pi\big{\rangle}_{\Omega\times\mathcal{A}} be a direct scheme without loss of generality [26]. For each a𝒜a\in\mathcal{A}, let μa()μ()π(a|)\mu_{a}(\cdot)\coloneqq\mu(\cdot)\odot\pi^{*}(a|\cdot) denote the posterior of outcome (i.e., kernel 777In this proof, we will directly work with the posterior without normalization (kernel) to simplify our notations and derivations, because ωΩμa(ω)[u(ω,a)u(ω,a)]0ωΩμa(ω)ωΩμa(ω)[u(ω,a)u(ω,a)]0\int_{\omega\in\Omega}\mu_{a}(\omega)\left[u(\omega,a)-u(\omega,a^{\prime})\right]\geq 0\iff\int_{\omega\in\Omega}\frac{\mu_{a}(\omega)}{\int_{\omega\in\Omega}\mu_{a}(\omega)}\left[u(\omega,a)-u(\omega,a^{\prime})\right]\geq 0. We use \odot to denote the Hadamard product.) that action aa is recommended by π\pi, so the prior can be composed as μ()=a𝒜μa()\mu(\cdot)=\sum_{a\in\mathcal{A}}\mu_{a}(\cdot). Since π\pi is persuasive, we know ωΩμa(ω)[u(ω,a)u(ω,a)]0,a𝒜.\int_{\omega\in\Omega}\mu_{a}(\omega)\left[u(\omega,a)-u(\omega,a^{\prime})\right]\geq 0,\forall a^{\prime}\in\mathcal{A}.

Let π0\pi^{0} be the fully revealing signaling scheme that always recommends (signals) the action that maximizes the receivers’ utility at the realized outcome. For each a𝒜a\in\mathcal{A}, let ηa()μ()π0(a|)\eta_{a}(\cdot)\coloneqq\mu(\cdot)\odot\pi^{0}(a|\cdot) denote the posterior of outcome that action aa is recommended by π0\pi^{0}, so the prior can be composed as μ()=a𝒜ηa()\mu(\cdot)=\sum_{a\in\mathcal{A}}\eta_{a}(\cdot). By regularity condition, we have

ωΩηa(ω)[u(ω,a)u(ω,a)]ω𝒲a(D)ηa(ω)[u(ω,a)u(ω,a)]p0D,a𝒜.\int_{\omega\in\Omega}\eta_{a}(\omega)\left[u(\omega,a)-u(\omega,a^{\prime})\right]\geq\int_{\omega\in\mathcal{W}_{a}(D)}\eta_{a}(\omega)\left[u(\omega,a)-u(\omega,a^{\prime})\right]\geq p_{0}D,\quad\forall a^{\prime}\in\mathcal{A}.

We now show that the signaling scheme π=(1δ)π+δπ0\pi^{\prime}=(1-\delta)\pi^{*}+\delta\pi^{0} is persuasive for any prior μ~B(μ,ϵ)\widetilde{\mu}\in{\mathrm{B}}(\mu,\epsilon) with δ=ϵp0D\delta=\frac{\epsilon}{p_{0}D}. One simple way to interpret this “compound” signaling scheme is to follow π\pi^{*} with probability (1δ)(1-\delta) and follow π0\pi^{0} with probability δ\delta. Hence, given a recommended action aa, the receiver would compute the posterior as μa=(1δ)μa(ω)+δηa(ω)\mu^{\prime}_{a}=(1-\delta)\mu_{a}(\omega)+\delta\eta_{a}(\omega). Let μa,μ~a\mu^{\prime}_{a},\widetilde{\mu}_{a} be the outcome posterior of π\pi^{\prime} recommending action aa under the true prior μ\mu (resp. the perturbed prior μ~\widetilde{\mu}). So μa()=μ()π(a|)\mu^{\prime}_{a}(\cdot)=\mu(\cdot)\odot\pi^{\prime}(a|\cdot) and μ~a()=μ~()π(a|)\widetilde{\mu}_{a}(\cdot)=\widetilde{\mu}(\cdot)\odot\pi^{\prime}(a|\cdot). By definition of persuasiveness, we need to show that for any recommended action (signal from π\pi^{\prime}) a𝒜a\in\mathcal{A}, the action aa maximizes the receiver’s utility under μa\mu^{\prime}_{a}. This follows from the decomposition below,

ωΩμ~a[u(ω,a)u(ω,a)]\displaystyle\int_{\omega\in\Omega}\widetilde{\mu}_{a}\cdot\left[u(\omega,a)-u(\omega,a^{\prime})\right]
\displaystyle\geq ωΩμa[u(ω,a)u(ω,a)]μ~aμa1\displaystyle\int_{\omega\in\Omega}\mu^{\prime}_{a}\cdot\left[u(\omega,a)-u(\omega,a^{\prime})\right]-\left\|\widetilde{\mu}_{a}-\mu^{\prime}_{a}\right\|_{1}
\displaystyle\geq ωΩ[(1δ)μa(ω)+δηa(ω)][u(ω,a)u(ω,a)]μ~aμa1\displaystyle\int_{\omega\in\Omega}\left[(1-\delta)\mu_{a}(\omega)+\delta\eta_{a}(\omega)\right]\cdot\left[u(\omega,a)-u(\omega,a^{\prime})\right]-\left\|\widetilde{\mu}_{a}-\mu_{a}\right\|_{1}
=\displaystyle= ωΩ(1δ)μa(ω)[u(ω,a)u(ω,a)]+ωΩδηa(ω)[u(ω,a)u(ω,a)]μ~aμa1\displaystyle\int_{\omega\in\Omega}(1-\delta)\mu_{a}(\omega)\left[u(\omega,a)-u(\omega,a^{\prime})\right]+\int_{\omega\in\Omega}\delta\eta_{a}(\omega)\left[u(\omega,a)-u(\omega,a^{\prime})\right]-\left\|\widetilde{\mu}_{a}-\mu_{a}\right\|_{1}
\displaystyle\geq δp0Dμ~aμa1\displaystyle\ \delta p_{0}D-\left\|\widetilde{\mu}_{a}-\mu_{a}\right\|_{1}
=\displaystyle= ϵμ~aμa10.\displaystyle\ \epsilon-\left\|\widetilde{\mu}_{a}-\mu_{a}\right\|_{1}\geq 0.

The first inequality is by the fact that u(ω,a)[0,1]u(\omega,a)\in[0,1] for any ω,a\omega,a and thus a(μ~aμa)[u(ω,a)u(ω,a)]μ~aμa1\sum_{a}(\widetilde{\mu}_{a}-\mu^{\prime}_{a})\cdot\left[u(\omega,a)-u(\omega,a^{\prime})\right]\leq\left\|\widetilde{\mu}_{a}-\mu^{\prime}_{a}\right\|_{1}. The second inequality is from μa=(1δ)μa(ω)+δηa(ω)\mu^{\prime}_{a}=(1-\delta)\mu_{a}(\omega)+\delta\eta_{a}(\omega). The third inequality is by construction of μa\mu_{a} and ηa\eta_{a} induced by signaling scheme π\pi and π0\pi^{0}. The last inequality is by the fact that μ~aμa1=(μ~μ)π(a|)1μ~μ1=ϵ\left\|\widetilde{\mu}_{a}-\mu^{\prime}_{a}\right\|_{1}=\left\|(\widetilde{\mu}-\mu^{\prime})\odot\pi^{\prime}(a|\cdot)\right\|_{1}\leq\left\|\widetilde{\mu}-\mu^{\prime}\right\|_{1}=\epsilon, since π(a|)1\left\|\pi^{\prime}(a|\cdot)\right\|_{\infty}\leq 1

It remains to show the expected utility under signaling scheme π\pi^{\prime} is at least Q,μπΩ×𝒜Hϵ/(p0D)\big{\langle}Q,\mu\otimes\pi^{*}\big{\rangle}_{\Omega\times\mathcal{A}}-H\epsilon/(p_{0}D). This is due to the following inequalities,

Q,μπΩ×𝒜Q,μπΩ×𝒜\displaystyle\big{\langle}Q,\mu\otimes\pi^{\prime}\big{\rangle}_{\Omega\times\mathcal{A}}-\big{\langle}Q,\mu\otimes\pi^{*}\big{\rangle}_{\Omega\times\mathcal{A}} =ωΩ,a𝒜μ(ω)[π(a|ω)π(a|ω)]Q(ω,a)\displaystyle=\int_{\omega\in\Omega,a\in\mathcal{A}}\mu(\omega)\left[\pi^{\prime}(a|\omega)-\pi^{*}(a|\omega)\right]Q(\omega,a)
=ωΩ,a𝒜μ(ω)[δπ0(a|ω)δπ(a|ω)]Q(ω,a)\displaystyle=\int_{\omega\in\Omega,a\in\mathcal{A}}\mu(\omega)\left[\delta\pi^{0}(a|\omega)-\delta\pi^{*}(a|\omega)\right]Q(\omega,a)
δωΩ,a𝒜μ(ω)π(a|ω)Q(ω,a)\displaystyle\geq-\delta\int_{\omega\in\Omega,a\in\mathcal{A}}\mu(\omega)\pi(a|\omega)Q(\omega,a)
Hδ=Hϵp0D.\displaystyle\geq-H\delta=-\frac{H\epsilon}{p_{0}D}.

The first and second equalities use the definition and linearity. The third and last inequalities use the fact that 𝔼[Q(ω,a)][0,H]\mathbb{E}[Q(\omega,a)]\in[0,H] and remove the positive term. ∎

A.3 Properties for the Robustness Gap

We present the robustness gap Gap\operatorname{Gap} for the ground-truth prior in Lemma 6.5. For the estimation of prior μht\mu_{h}^{t} given in Algorithm 2 which may not satisfy the regularity condition, we also have corresponding robustness gap.

Lemma A.1.

For any h[H],t[T]h\in[H],t\in[T] and s𝒮s\in{\mathcal{S}}, on the event of {θhht}\{\theta_{h}^{*}\in\mathcal{B}_{h}^{t}\}, we have

Gap(s,μht,B(μht,ϵht);Qht)2Hϵp0D.\operatorname{Gap}(s,\mu_{h}^{t},{\mathrm{B}}(\mu_{h}^{t},\epsilon_{h}^{t});Q_{h}^{t})\leq\frac{2H\epsilon}{p_{0}D}.
Proof.

For any fixed action a𝒜a\in\mathcal{A}, on the given event, we have

ωμht()[ω𝒲s,a(D)]\displaystyle\mathbb{P}_{\omega\sim\mu_{h}^{t}(\cdot)}[\omega\in\mathcal{W}_{s,a}(D)] =ωΩμht(ω)𝕀(ω𝒲s,a(D))dω\displaystyle=\int_{\omega\in\Omega}\mu_{h}^{t}(\omega)\mathbb{I}(\omega\in\mathcal{W}_{s,a}(D)){\mathrm{d}}\omega
=ωΩμh(ω)𝕀(ω𝒲s,a(D))dω+ωΩ[μht(ω)μh(ω)]𝕀(ω𝒲s,a(D))dω\displaystyle=\int_{\omega\in\Omega}\mu_{h}^{*}(\omega)\mathbb{I}(\omega\in\mathcal{W}_{s,a}(D)){\mathrm{d}}\omega+\int_{\omega\in\Omega}[\mu_{h}^{t}(\omega)-\mu_{h}^{*}(\omega)]\mathbb{I}(\omega\in\mathcal{W}_{s,a}(D)){\mathrm{d}}\omega
ωΩμh(ω)𝕀(ω𝒲s,a(D))dω+μhtμh1\displaystyle\geq\int_{\omega\in\Omega}\mu_{h}^{*}(\omega)\mathbb{I}(\omega\in\mathcal{W}_{s,a}(D)){\mathrm{d}}\omega+\|\mu_{h}^{t}-\mu_{h}^{*}\|_{1}
p0ϵht,\displaystyle\geq p_{0}-\epsilon_{h}^{t},

where 𝕀\mathbb{I} is the indicating function. The last inequality uses the regularity condition for the real prior μh\mu_{h}^{*}. For ϵhtp0/2\epsilon_{h}^{t}\leq p_{0}/2, we have ωμht()[ω𝒲s,a]p0/2\mathbb{P}_{\omega\sim\mu_{h}^{t}(\cdot)}[\omega\in\mathcal{W}_{s,a}]\leq p_{0}/2. Then by Lemma 6.5, we can arrive at

Gap(s,μht,B(μht,ϵht);Qht)2Hϵhtp0D.\operatorname{Gap}(s,\mu_{h}^{t},{\mathrm{B}}(\mu_{h}^{t},\epsilon_{h}^{t});Q_{h}^{t})\leq\frac{2H\epsilon_{h}^{t}}{p_{0}D}.

For ϵht>p0/2\epsilon_{h}^{t}>p_{0}/2, the bound holds trivially since 2Hϵht/(p0D)>H2H\epsilon_{h}^{t}/(p_{0}D)>H. ∎

The robustness gap Gap\operatorname{Gap} defined in equation (6.4) measures the loss in value functions for being robustly persuasive for a subset of priors. In the following lemma, we show that we can also use Gap\operatorname{Gap} to bound the difference in expected optimal QQ-functions between different priors.

Lemma A.2.

Denote 1,2B(μ1,μ1μ21)\mathcal{B}_{1,2}\coloneqq{\mathrm{B}}\big{(}\mu_{1},\|\mu_{1}-\mu_{2}\|_{1}\big{)} for any fixed state s𝒮s\in{\mathcal{S}} and μ1,μ2Δ(Ω)\mu_{1},\mu_{2}\in\Delta(\Omega). Then given a known QQ-function Q(,,)Q(\cdot,\cdot,\cdot), we have

maxπ1Pers(μ1,u)Q,μ1π1Ω×𝒜(s)maxπ2Pers(μ2,u)Q,μ2π2Ω×𝒜(s)Gap(s,μ1,1,2;Q)+H2μ1μ21.\max_{\pi_{1}\in\operatorname{Pers}(\mu_{1},u)}\big{\langle}Q,\mu_{1}\otimes\pi_{1}\big{\rangle}_{\Omega\times\mathcal{A}}(s)-\max_{\pi_{2}\in\operatorname{Pers}(\mu_{2},u)}\big{\langle}Q,\mu_{2}\otimes\pi_{2}\big{\rangle}_{\Omega\times\mathcal{A}}(s)\leq\operatorname{Gap}(s,\mu_{1},\mathcal{B}_{1,2};Q)+\frac{H}{2}\|\mu_{1}-\mu_{2}\|_{1}.
Proof.

Fix μ1,μ2Δ(Ω)\mu_{1},\mu_{2}\in\Delta(\Omega), we respectively choose the optimal signaling scheme

πi=argmaxπiPers(μi,u)Q,μiπiΩ×𝒜(s),i=1,2.\pi_{i}=\mathop{\mathrm{argmax}}_{\pi_{i}\in\operatorname{Pers}(\mu_{i},u)}\big{\langle}Q,\mu_{i}\otimes\pi_{i}\big{\rangle}_{\Omega\times\mathcal{A}}(s),~{}i=1,2.

Then among all the signaling schemes persuasive for all 1,2\mathcal{B}_{1,2}, let π3\pi_{3} maximize Q,μ1πΩ×𝒜(s)\big{\langle}Q,\mu_{1}\otimes\pi\big{\rangle}_{\Omega\times\mathcal{A}}(s). Since π3\pi_{3} is persuasive for μ2\mu_{2}, we know Q,μ2π2Ω×𝒜(s)Q,μ2π3Ω×𝒜(s)\big{\langle}Q,\mu_{2}\otimes\pi_{2}\big{\rangle}_{\Omega\times\mathcal{A}}(s)\geq\big{\langle}Q,\mu_{2}\otimes\pi_{3}\big{\rangle}_{\Omega\times\mathcal{A}}(s) by definition. Therefore, we have

Q,μ1π1μ2π2Ω×𝒜(s)\displaystyle\big{\langle}Q,\mu_{1}\otimes\pi_{1}-\mu_{2}\otimes\pi_{2}\big{\rangle}_{\Omega\times\mathcal{A}}(s) Q,μ1π1μ2π3Ω×𝒜(s)\displaystyle\leq\big{\langle}Q,\mu_{1}\otimes\pi_{1}-\mu_{2}\otimes\pi_{3}\big{\rangle}_{\Omega\times\mathcal{A}}(s)
Q,μ1π1μ1π3Ω×𝒜(s)+Q,μ1π3μ2π3Ω×𝒜(s)\displaystyle\leq\big{\langle}Q,\mu_{1}\otimes\pi_{1}-\mu_{1}\otimes\pi_{3}\big{\rangle}_{\Omega\times\mathcal{A}}(s)+\big{\langle}Q,\mu_{1}\otimes\pi_{3}-\mu_{2}\otimes\pi_{3}\big{\rangle}_{\Omega\times\mathcal{A}}(s)
=Gap(s,μ1,1,2;Q)+H2μ1μ21.\displaystyle=\operatorname{Gap}(s,\mu_{1},\mathcal{B}_{1,2};Q)+\frac{H}{2}\|\mu_{1}-\mu_{2}\|_{1}.

The last equality uses the definition of Gap\operatorname{Gap} and Lemma A.3. ∎

Lemma A.3.

Given a QQ-function Q(,,)[0,H]Q(\cdot,\cdot,\cdot)\in[0,H], for any fixed state s𝒮s\in{\mathcal{S}}, μ1,μ2Δ(Ω)\mu_{1},\mu_{2}\in\Delta(\Omega) and any signaling scheme π\pi, we have

|Q,μ1πΩ×𝒜(s)Q,μ2πΩ×𝒜(s)|H2μ1μ21.\big{|}\big{\langle}Q,\mu_{1}\otimes\pi\big{\rangle}_{\Omega\times\mathcal{A}}(s)-\big{\langle}Q,\mu_{2}\otimes\pi\big{\rangle}_{\Omega\times\mathcal{A}}(s)\big{|}\leq\frac{H}{2}\|\mu_{1}-\mu_{2}\|_{1}.
Proof.

Fix μ1(),μ2()Δ(Ω)\mu_{1}(\cdot),\mu_{2}(\cdot)\in\Delta(\Omega). For any xx\in\mathbb{R}, we have

|Q,μ1πμ2πΩ×𝒜(s)\displaystyle\big{|}\big{\langle}Q,\mu_{1}\otimes\pi-\mu_{2}\otimes\pi\big{\rangle}_{\Omega\times\mathcal{A}}(s) =|ωΩ[μ1(ω)μ2(ω)][a𝒜π(a|s,ω)Q(s,ω,a)dax]dω|\displaystyle=\bigg{|}\int_{\omega\in\Omega}[\mu_{1}(\omega)-\mu_{2}(\omega)]\bigg{[}\int_{a\in\mathcal{A}}\pi(a|s,\omega)Q(s,\omega,a){\mathrm{d}}a-x\bigg{]}{\mathrm{d}}\omega\bigg{|}
μ1μ21supωΩ|a𝒜π(a|s,ω)Q(s,ω,a)dax|,\displaystyle\leq\|\mu_{1}-\mu_{2}\|_{1}\cdot\sup_{\omega\in\Omega}\bigg{|}\int_{a\in\mathcal{A}}\pi(a|s,\omega)Q(s,\omega,a){\mathrm{d}}a-x\bigg{|},

where the last inequality is derived from Holder’s inequality. With QQ-function taking values in [0,H][0,H], we can set x=H/2x=H/2 and achieve the optimality.

A.4 Proof of Lemma 6.1

Proof.

Before presenting the proof, we first define two operators 𝕁h\mathbb{J}_{h}^{*} and 𝕁ht\mathbb{J}_{h}^{t}:

(𝕁hf)(s;C)=f,μhπhΩ×𝒜(s;C),(𝕁htf)(s;C)=f,μhtπhtΩ×𝒜(s;C),(\mathbb{J}_{h}^{*}f)(s;C)=\langle f,\mu_{h}^{*}\otimes\pi_{h}^{*}\rangle_{\Omega\times\mathcal{A}}(s;C),~{}~{}~{}~{}(\mathbb{J}_{h}^{t}f)(s;C)=\langle f,\mu_{h}^{t}\otimes\pi_{h}^{t}\rangle_{\Omega\times\mathcal{A}}(s;C), (A.1)

for any h[H],t[T]h\in[H],t\in[T] and any function f(,,;C):𝒮×Ω×𝒜f(\cdot,\cdot,\cdot;C):{\mathcal{S}}\times\Omega\times\mathcal{A}\to\mathbb{R} under the context CC. Moreover, for any h[H],t[T]h\in[H],t\in[T] and any state s𝒮s\in{\mathcal{S}}, we define

ξht(s;C)=(𝕁hQht)(s;C)(𝕁htQht)(s;C)=Qht,μhπhμhtπhtΩ×𝒜(s;C).\xi_{h}^{t}(s;C)=(\mathbb{J}_{h}^{*}Q_{h}^{t})(s;C)-(\mathbb{J}_{h}^{t}Q_{h}^{t})(s;C)=\langle Q_{h}^{t},\mu_{h}^{*}\otimes\pi_{h}^{*}-\mu_{h}^{t}\otimes\pi_{h}^{t}\rangle_{\Omega\times\mathcal{A}}(s;C). (A.2)

After introducing these notations, we decompose the instantaneous regret at the tt-th episode into two terms,

V1(s1t;Ct)V1πt(s1t;Ct)=V1(s1t;Ct)V1t(s1t;Ct)𝕡1+V1t(s1t;Ct)V1πt(s1t;Ct)𝕡2.V_{1}^{*}(s_{1}^{t};C^{t})-V_{1}^{\pi^{t}}(s_{1}^{t};C^{t})=\underbrace{V_{1}^{*}(s_{1}^{t};C^{t})-V_{1}^{t}(s_{1}^{t};C^{t})}_{\mathbb{p}_{1}}+\underbrace{V_{1}^{t}(s_{1}^{t};C^{t})-V_{1}^{\pi^{t}}(s_{1}^{t};C^{t})}_{\mathbb{p}_{2}}. (A.3)

Then we consider these two terms separately. By the definition of value functions in (3.1) and the operator 𝕁h\mathbb{J}_{h}^{*} in (A.1), we have Vh=𝕁hQhV_{h}^{*}=\mathbb{J}_{h}^{*}Q_{h}^{*}. By the construction of Algorithm 2, we have Vht=𝕁htQhtV_{h}^{t}=\mathbb{J}_{h}^{t}Q_{h}^{t} similarly. Thus, for the first term 𝕡1\mathbb{p}_{1} defined in equation (A.3), using ξht\xi_{h}^{t} defined in (A.2), for any h[H],t[T]h\in[H],t\in[T], we have

VhVht=𝕁hQh𝕁htQht=(𝕁hQh𝕁hQht)+(𝕁hQht𝕁htQht)=𝕁h(QhQht)+ξht.\begin{split}V_{h}^{*}-V_{h}^{t}&=\mathbb{J}_{h}^{*}Q_{h}^{*}-\mathbb{J}_{h}^{t}Q_{h}^{t}=(\mathbb{J}_{h}^{*}Q_{h}^{*}-\mathbb{J}_{h}^{*}Q_{h}^{t})+(\mathbb{J}_{h}^{*}Q_{h}^{t}-\mathbb{J}_{h}^{t}Q_{h}^{t})\\ &=\mathbb{J}_{h}^{*}(Q_{h}^{*}-Q_{h}^{t})+\xi_{h}^{t}.\end{split}

Next, by the definition of the temporal-difference error δht\delta_{h}^{t} in (6.1) and the Bellman optimality equation in equation (3.2), we have

QhQht=(vh+PhVh+1)(vh+PhVh+1tδht)=Ph(Vh+1Vh+1t)+δht.Q_{h}^{*}-Q_{h}^{t}=(v_{h}+P_{h}V_{h+1}^{*})-(v_{h}+P_{h}V_{h+1}^{t}-\delta_{h}^{t})=P_{h}(V_{h+1}^{*}-V_{h+1}^{t})+\delta_{h}^{t}.

Hence we get

VhVht=𝕁hPh(Vh+1Vh+1t)++𝕁hδht+ξht.V_{h}^{*}-V_{h}^{t}=\mathbb{J}_{h}^{*}P_{h}(V_{h+1}^{*}-V_{h+1}^{t})++\mathbb{J}_{h}^{*}\delta_{h}^{t}+\xi_{h}^{t}.

Then, by recursively applying the above formula, we have

V1V1t=(h[H]𝕁hPh)(VH+1VH+1t)+h[H](i[h]𝕁iPi)𝕁hδht+h[H](i[h]𝕁iPi)ξht.\displaystyle V_{1}^{*}-V_{1}^{t}=\bigg{(}\prod_{h\in[H]}\mathbb{J}_{h}^{*}P_{h}\bigg{)}(V_{H+1}^{*}-V_{H+1}^{t})+\sum_{h\in[H]}\bigg{(}\prod_{i\in[h]}\mathbb{J}_{i}^{*}P_{i}\bigg{)}\mathbb{J}_{h}^{*}\delta_{h}^{t}+\sum_{h\in[H]}\bigg{(}\prod_{i\in[h]}\mathbb{J}_{i}^{*}P_{i}\bigg{)}\xi_{h}^{t}.

By the definition of ξht\xi_{h}^{t} in equation (A.2) and ζt,h3\zeta_{t,h}^{3} in equation (6.2), we get

h[H](i[h]𝕁iPi)ξht(sh;Ct)=h[H]𝔼μ,π{[Qht,μhπhμhtπhtΩ×𝒜(sh;Ct)|s1=s1t]}.\displaystyle\sum_{h\in[H]}\bigg{(}\prod_{i\in[h]}\mathbb{J}_{i}^{*}P_{i}\bigg{)}\xi_{h}^{t}(s_{h};C^{t})=\sum_{h\in[H]}\mathbb{E}_{\mu^{*},\pi^{*}}\big{\{}\big{[}\langle Q_{h}^{t},\mu_{h}^{*}\otimes\pi_{h}^{*}-\mu_{h}^{t}\otimes\pi_{h}^{t}\rangle_{\Omega\times\mathcal{A}}(s_{h};C^{t})|s_{1}=s_{1}^{t}\big{]}\big{\}}.

Notice that VH+1=VH+1t=0V_{H+1}^{*}=V_{H+1}^{t}=0. Therefore, for any episode t[T]t\in[T], we have

V1(s1t;Ct)V1t(s1t;Ct)=h[H]𝔼μ,π{[Qht,μhπhμhtπhtΩ×𝒜(sh;Ct)|s1=s1t]}+h[H]𝔼μ,π[δht(sh,ωh,ah)|s1=s1t].\begin{split}V_{1}^{*}(s_{1}^{t};C^{t})-V_{1}^{t}(s_{1}^{t};C^{t})=&\sum_{h\in[H]}\mathbb{E}_{\mu^{*},\pi^{*}}\big{\{}\big{[}\langle Q_{h}^{t},\mu_{h}^{*}\otimes\pi_{h}^{*}-\mu_{h}^{t}\otimes\pi_{h}^{t}\rangle_{\Omega\times\mathcal{A}}(s_{h};C^{t})|s_{1}=s_{1}^{t}\big{]}\big{\}}\\ &~{}~{}~{}~{}+\sum_{h\in[H]}\mathbb{E}_{\mu^{*},\pi^{*}}\left[\delta_{h}^{t}(s_{h},\omega_{h},a_{h})|s_{1}=s_{1}^{t}\right].\end{split}

Now we come to bound the second term 𝕡2\mathbb{p}_{2} in equation (A.3). By the definition of the temporal-difference error δht\delta_{h}^{t} in (6.1), for any h[H],t[T]h\in[H],t\in[T], we note that

δht(sht,ωht,aht)\displaystyle\delta_{h}^{t}(s_{h}^{t},\omega_{h}^{t},a_{h}^{t}) =(vht+PhVh+1tQht)(sht,ωht,aht;Ct)\displaystyle=(v_{h}^{t}+P_{h}V_{h+1}^{t}-Q_{h}^{t})(s_{h}^{t},\omega_{h}^{t},a_{h}^{t};C^{t})
=(vht+PhVh+1tQhπt)(sht,ωht,aht;Ct)+(QhπtQht)(sht,ωht,aht;Ct)\displaystyle=(v_{h}^{t}+P_{h}V_{h+1}^{t}-Q_{h}^{\pi^{t}})(s_{h}^{t},\omega_{h}^{t},a_{h}^{t};C^{t})+(Q_{h}^{\pi^{t}}-Q_{h}^{t})(s_{h}^{t},\omega_{h}^{t},a_{h}^{t};C^{t})
=(PhVh+1tPhVh+1πt)(sht,ωht,aht)+(QhπtQht)(sht,ωht,aht).\displaystyle=(P_{h}V_{h+1}^{t}-P_{h}V_{h+1}^{\pi^{t}})(s_{h}^{t},\omega_{h}^{t},a_{h}^{t})+(Q_{h}^{\pi^{t}}-Q_{h}^{t})(s_{h}^{t},\omega_{h}^{t},a_{h}^{t}).

where the last equality follows the Bellman equation (3.1). Furthermore, using ζt,h1\zeta_{t,h}^{1} and ζt,h2\zeta_{t,h}^{2} defined in (6.2), we have

Vht(sht;\displaystyle V_{h}^{t}(s_{h}^{t}; Ct)Vhπt(sht;Ct)\displaystyle C^{t})-V_{h}^{\pi^{t}}(s_{h}^{t};C^{t})
=\displaystyle= (VhtVhπt)(sht;Ct)δht(sht,ωht,aht)+(QhπtQht)(sht,ωht,aht;Ct)\displaystyle(V_{h}^{t}-V_{h}^{\pi^{t}})(s_{h}^{t};C^{t})-\delta_{h}^{t}(s_{h}^{t},\omega_{h}^{t},a_{h}^{t})+(Q_{h}^{\pi^{t}}-Q_{h}^{t})(s_{h}^{t},\omega_{h}^{t},a_{h}^{t};C^{t})
+(PhVh+1tPhVh+1πt)(sht,ωht,aht;Ct)\displaystyle\quad+(P_{h}V_{h+1}^{t}-P_{h}V_{h+1}^{\pi^{t}})(s_{h}^{t},\omega_{h}^{t},a_{h}^{t};C^{t})
=\displaystyle= (VhtV~ht)(sht;Ct)δht(sht,ωht,aht)+(V~htVhπt)(sht;Ct)+(QhπtQht)(sht,ωht,aht;Ct)\displaystyle(V_{h}^{t}-\widetilde{V}_{h}^{t})(s_{h}^{t};C^{t})-\delta_{h}^{t}(s_{h}^{t},\omega_{h}^{t},a_{h}^{t})+(\widetilde{V}_{h}^{t}-V_{h}^{\pi^{t}})(s_{h}^{t};C^{t})+(Q_{h}^{\pi^{t}}-Q_{h}^{t})(s_{h}^{t},\omega_{h}^{t},a_{h}^{t};C^{t})
+(Ph(Vh+1tVh+1πt))(sht,ωht,aht;Ct)(Vh+1tVh+1πt)(sh+1t;Ct)+(Vh+1tVh+1πt)(sh+1t;Ct)\displaystyle\quad+\big{(}P_{h}(V_{h+1}^{t}-V_{h+1}^{\pi^{t}})\big{)}(s_{h}^{t},\omega_{h}^{t},a_{h}^{t};C^{t})-(V_{h+1}^{t}-V_{h+1}^{\pi^{t}})(s_{h+1}^{t};C^{t})+(V_{h+1}^{t}-V_{h+1}^{\pi^{t}})(s_{h+1}^{t};C^{t})
=\displaystyle= [Vh+1t(sh+1t;Ct)Vh+1πt(sh+1t;Ct)]+[Vht(sht;Ct)V~ht(sht;Ct)]δht(sht,ωht,aht)+ζt,h1+ζt,h2.\displaystyle\big{[}V_{h+1}^{t}(s_{h+1}^{t};C^{t})-V_{h+1}^{\pi^{t}}(s_{h+1}^{t};C^{t})\big{]}+\big{[}V_{h}^{t}(s_{h}^{t};C^{t})-\widetilde{V}_{h}^{t}(s_{h}^{t};C^{t})\big{]}-\delta_{h}^{t}(s_{h}^{t},\omega_{h}^{t},a_{h}^{t})+\zeta_{t,h}^{1}+\zeta_{t,h}^{2}.

Applying the above equation recursively, we get that

V1t(s1t;Ct)V1πt(s1t;Ct)=\displaystyle V_{1}^{t}(s_{1}^{t};C^{t})-V_{1}^{\pi^{t}}(s_{1}^{t};C^{t})= VH+1t(sHt;Ct)VH+1πt(sHt;Ct)+h[H][Vht(sht;Ct)V~ht(sht;Ct)]\displaystyle V_{H+1}^{t}(s_{H}^{t};C^{t})-V_{H+1}^{\pi^{t}}(s_{H}^{t};C^{t})+\sum_{h\in[H]}\big{[}V_{h}^{t}(s_{h}^{t};C^{t})-\widetilde{V}_{h}^{t}(s_{h}^{t};C^{t})\big{]}
h[H]δht(sht,ωht,aht)+h[H](ζt,h1+ζt,h2).\displaystyle~{}~{}~{}~{}-\sum_{h\in[H]}\delta_{h}^{t}(s_{h}^{t},\omega_{h}^{t},a_{h}^{t})+\sum_{h\in[H]}(\zeta_{t,h}^{1}+\zeta_{t,h}^{2}).

Again by Bellman equation (3.1), we have,

Vht(sht;Ct)V~ht(sht;Ct)\displaystyle V_{h}^{t}(s_{h}^{t};C^{t})-\widetilde{V}_{h}^{t}(s_{h}^{t};C^{t}) =Qht,(μhtμh)πhtΩ×𝒜(sht;Ct).\displaystyle=\big{\langle}Q_{h}^{t},(\mu_{h}^{t}-\mu_{h}^{*})\otimes\pi_{h}^{t}\big{\rangle}_{\Omega\times\mathcal{A}}(s_{h}^{t};C^{t}).

Then we use VH+1t=VH+1πt=0V_{H+1}^{t}=V_{H+1}^{\pi^{t}}=0 to simplify the decomposition to the following form:

V1t(s1t;Ct)V1πt(s1t;Ct)=\displaystyle V_{1}^{t}(s_{1}^{t};C^{t})-V_{1}^{\pi^{t}}(s_{1}^{t};C^{t})= h[H]Qht,(μhtμh)πhtΩ×𝒜(sht;Ct)\displaystyle\sum_{h\in[H]}\big{\langle}Q_{h}^{t},(\mu_{h}^{t}-\mu_{h}^{*})\otimes\pi_{h}^{t}\big{\rangle}_{\Omega\times\mathcal{A}}(s_{h}^{t};C^{t})
h[H]δht(sht,ωht,aht)+h[H](ζt,h1+ζt,h2).\displaystyle\quad-\sum_{h\in[H]}\delta_{h}^{t}(s_{h}^{t},\omega_{h}^{t},a_{h}^{t})+\sum_{h\in[H]}(\zeta_{t,h}^{1}+\zeta_{t,h}^{2}).

Therefore, combining 𝕡1\mathbb{p}_{1} and 𝕡2\mathbb{p}_{2}, we can conclude the proof of this lemma.

Reg(T,μ)=\displaystyle\operatorname{Reg}(T,\mu^{*})= t[T][V1(s1t;Ct)V1πt(s1t;Ct)]\displaystyle\sum_{t\in[T]}\left[V_{1}^{*}(s_{1}^{t};C^{t})-V_{1}^{\pi^{t}}(s_{1}^{t};C^{t})\right]
=\displaystyle= t[T]h[H]{𝔼μh,πh[δht(sh,ωh,ah)|s1=s1t]δht(sht,ωht,aht)}+t[T]h[H](ζt,h1,ζt,h2)\displaystyle\sum_{t\in[T]}\sum_{h\in[H]}\big{\{}\mathbb{E}_{\mu_{h}^{*},\pi_{h}^{*}}[\delta_{h}^{t}(s_{h},\omega_{h},a_{h})|s_{1}=s_{1}^{t}]-\delta_{h}^{t}(s_{h}^{t},\omega_{h}^{t},a_{h}^{t})\big{\}}+\sum_{t\in[T]}\sum_{h\in[H]}(\zeta_{t,h}^{1},\zeta_{t,h}^{2})
+t[T]h[H]𝔼μh,πh[Qht,μhπhμhtπhtΩ×𝒜(sh;Ct)|s1=s1t]\displaystyle~{}~{}~{}~{}+\sum_{t\in[T]}\sum_{h\in[H]}\mathbb{E}_{\mu_{h}^{*},\pi_{h}^{*}}\big{[}\big{\langle}Q_{h}^{t},\mu_{h}^{*}\otimes\pi_{h}^{*}-\mu_{h}^{t}\otimes\pi_{h}^{t}\big{\rangle}_{\Omega\times\mathcal{A}}(s_{h};C^{t})|s_{1}=s_{1}^{t}\big{]}
+t[T]h[H]Qht,(μhtμh)πhtΩ×𝒜(sht;Ct).\displaystyle~{}~{}~{}~{}+\sum_{t\in[T]}\sum_{h\in[H]}\big{\langle}Q_{h}^{t},(\mu_{h}^{t}-\mu_{h}^{*})\otimes\pi_{h}^{t}\big{\rangle}_{\Omega\times\mathcal{A}}(s_{h}^{t};C^{t}).

Therefore, we conclude the proof of the lemma. ∎

A.5 Proof of Lemma 6.2

Proof.

In the following lemma, we firstly bound the difference between the QQ-function maintained in Algorithm 2 (without bonus) and the real QQ-function of any policy π\pi by their expected difference at next step, plus an error term. This error term can be upper bounded by our bonus with high probability. This lemma can be derived from Lemma B.4 in [25] with slight revisions.

Lemma A.4.

Set λ=max{1,Ψ2}\lambda=\max\{1,\Psi^{2}\}. There exists an absolute constant cρc_{\rho} such that for ρ=cρdψHι\rho=c_{\rho}d_{\psi}H\sqrt{\iota} where ι=log(2dψΨ2T/δ)\iota=\log(2d_{\psi}\Psi^{2}T/\delta), and for any fixed policy π\pi, with probability at least 1δ/21-\delta/2, we have for all s𝒮s\in{\mathcal{S}}, ωΩ\omega\in\Omega, a𝒜a\in\mathcal{A}, h[H]h\in[H], t[T]t\in[T],

ψ(s,ω,a)qhtQhπ(s,ω,a)=Ph(Vh+1tVh+1π)(s,ω,a)+ht(s,ω,a),\psi(s,\omega,a)^{\top}q_{h}^{t}-Q_{h}^{\pi}(s,\omega,a)=P_{h}(V_{h+1}^{t}-V_{h+1}^{\pi})(s,\omega,a)+\triangle_{h}^{t}(s,\omega,a),

for some ht(s,ω,a)\triangle_{h}^{t}(s,\omega,a) that satisfies |ht(s,ω,a)|ρψ(s,ω,a)(Γht)1|\triangle_{h}^{t}(s,\omega,a)|\leq\rho\|\psi(s,\omega,a)\|_{(\Gamma_{h}^{t})^{-1}}.

Now we are ready to prove Lemma 6.2. By the definition of δht\delta_{h}^{t} in (6.1), we have δht=(vh+PhVh+1tQht)=(PhVh+1tPhVh+1πt)+(QhπtQht)\delta_{h}^{t}=(v_{h}+P_{h}V_{h+1}^{t}-Q_{h}^{t})=(P_{h}V_{h+1}^{t}-P_{h}V_{h+1}^{\pi^{t}})+(Q_{h}^{\pi^{t}}-Q_{h}^{t}). Therefore, by the construction of QhtQ_{h}^{t} in Algorithm 2, we obtain that

δht(s,ω,a)\displaystyle\delta_{h}^{t}(s,\omega,a) (PhVh+1tPhVh+1πt)(s,ω,a)+Qhπt(s,ω,a)(ψ(s,ω,a)qht+ρψ(s,ω,a)(Γht)1)\displaystyle\geq(P_{h}V_{h+1}^{t}-P_{h}V_{h+1}^{\pi^{t}})(s,\omega,a)+Q_{h}^{\pi^{t}}(s,\omega,a)-\left(\psi(s,\omega,a)^{\top}q_{h}^{t}+\rho\|\psi(s,\omega,a)\|_{(\Gamma_{h}^{t})^{-1}}\right)
=ht(s,ω,a)ρψ(s,ω,a)(Γht)12ρψ(s,ω,a)(Γht)1,\displaystyle=-\triangle_{h}^{t}(s,\omega,a)-\rho\|\psi(s,\omega,a)\|_{(\Gamma_{h}^{t})^{-1}}\geq-2\rho\|\psi(s,\omega,a)\|_{(\Gamma_{h}^{t})^{-1}},

which concludes the proof. ∎

A.6 Proof of Lemma 6.3

Proof.

Denote the optimal signaling schemes corresponding to the real prior μh\mu_{h}^{*} and the estimated prior μht\mu_{h}^{t} respectively as

πh=argmaxπhPers(μh)Qht,μhπhΩ×𝒜(;Ct)andπh′′=argmaxπhPers(μht)Qht,μhtπhΩ×𝒜(;Ct),\pi_{h}^{\prime}=\mathop{\mathrm{argmax}}_{\pi_{h}\in\operatorname{Pers}(\mu_{h}^{*})}\big{\langle}Q_{h}^{t},\mu_{h}^{*}\otimes\pi_{h}\big{\rangle}_{\Omega\times\mathcal{A}}(\cdot;C^{t})~{}~{}~{}\text{and}~{}~{}~{}\pi_{h}^{\prime\prime}=\mathop{\mathrm{argmax}}_{\pi_{h}\in\operatorname{Pers}(\mu_{h}^{t})}\big{\langle}Q_{h}^{t},\mu_{h}^{t}\otimes\pi_{h}\big{\rangle}_{\Omega\times\mathcal{A}}(\cdot;C^{t}),

where the QQ-function QhtQ_{h}^{t} is given by Algorithm 2. Notably, πh\pi_{h}^{\prime} is different from the truly optimal policy μh\mu_{h}^{*}, since πh\pi_{h}^{\prime} is computed based on the approximate QQ-function QhtQ_{h}^{t}. By definition, we can decompose the difference as follows:

Qht,μhπhμhtπhtΩ×𝒜(sh;Ct)=\displaystyle\big{\langle}Q_{h}^{t},\mu_{h}^{*}\otimes\pi_{h}^{*}-\mu_{h}^{t}\otimes\pi_{h}^{t}\big{\rangle}_{\Omega\times\mathcal{A}}(s_{h};C^{t})= Qht,μhπhμhπhΩ×𝒜(sh;Ct)\displaystyle\big{\langle}Q_{h}^{t},\mu_{h}^{*}\otimes\pi_{h}^{*}-\mu_{h}^{*}\otimes\pi_{h}^{\prime}\big{\rangle}_{\Omega\times\mathcal{A}}(s_{h};C^{t}) (A.4)
+Qht,μhπhμhtπh′′Ω×𝒜(sh;Ct)\displaystyle~{}~{}~{}~{}+\big{\langle}Q_{h}^{t},\mu_{h}^{*}\otimes\pi_{h}^{\prime}-\mu_{h}^{t}\otimes\pi_{h}^{\prime\prime}\big{\rangle}_{\Omega\times\mathcal{A}}(s_{h};C^{t}) (A.5)
+Qht,μhtπh′′μhtπhtΩ×𝒜(sh;Ct).\displaystyle~{}~{}~{}~{}+\big{\langle}Q_{h}^{t},\mu_{h}^{t}\otimes\pi_{h}^{\prime\prime}-\mu_{h}^{t}\otimes\pi_{h}^{t}\big{\rangle}_{\Omega\times\mathcal{A}}(s_{h};C^{t}). (A.6)

By definition, equation (A.4) is always non-positive. Apply Lemma A.2 to equation (A.5) and we can get

Qht,μhπhμhtπh′′Ω×𝒜(sh;Ct)\displaystyle\big{\langle}Q_{h}^{t},\mu_{h}^{*}\otimes\pi_{h}^{\prime}-\mu_{h}^{t}\otimes\pi_{h}^{\prime\prime}\big{\rangle}_{\Omega\times\mathcal{A}}(s_{h};C^{t})\leq Gap(sh,μh(|cht),B1(μh(|cht),μh(|cht)μht(|cht)1);Qht)\displaystyle\operatorname{Gap}\bigg{(}s_{h},\mu_{h}^{*}(\cdot|c_{h}^{t}),{\mathrm{B}}_{1}\big{(}\mu_{h}^{*}(\cdot|c_{h}^{t}),\big{\|}\mu_{h}^{*}(\cdot|c_{h}^{t})-\mu_{h}^{t}(\cdot|c_{h}^{t})\big{\|}_{1}\big{)};Q_{h}^{t}\bigg{)}
+H2μh(|cht)μht(|cht)1.\displaystyle~{}~{}~{}~{}+\frac{H}{2}\big{\|}\mu_{h}^{*}(\cdot|c_{h}^{t})-\mu_{h}^{t}(\cdot|c_{h}^{t})\big{\|}_{1}.

According to Corollary 6.6, we can bound the above equation with the norm of feature vector and the radius of confidence region for θh\theta_{h}.

Qht,μhπhμhtπh′′Ω×𝒜(sh;Ct)(HLμKp0D+HLμK2)βϕ(cht)(Σht)1.\displaystyle\big{\langle}Q_{h}^{t},\mu_{h}^{*}\otimes\pi_{h}^{\prime}-\mu_{h}^{t}\otimes\pi_{h}^{\prime\prime}\big{\rangle}_{\Omega\times\mathcal{A}}(s_{h};C^{t})\leq\big{(}\frac{HL_{\mu}K}{p_{0}D}+\frac{HL_{\mu}K}{2}\big{)}\beta\|\phi(c_{h}^{t})\|_{(\Sigma_{h}^{t})^{-1}}.

We also note that equation (A.6) is equal to Gap(sh,μht(|cht),μht(|cht);Qht)\operatorname{Gap}\big{(}s_{h},\mu_{h}^{t}(\cdot|c_{h}^{t}),\mu_{\mathcal{B}_{h}^{t}}(\cdot|c_{h}^{t});Q_{h}^{t}\big{)}. By Lemma A.1, on the event {θhht}\{\theta_{h}^{*}\in\mathcal{B}_{h}^{t}\}, we have

Gap(sh,μht(|cht),μht(|cht);Qht)2HLμKp0Dβϕ(cht)(Σht)1.\operatorname{Gap}\big{(}s_{h},\mu_{h}^{t}(\cdot|c_{h}^{t}),\mu_{\mathcal{B}_{h}^{t}}(\cdot|c_{h}^{t});Q_{h}^{t}\big{)}\leq\frac{2HL_{\mu}K}{p_{0}D}\beta\|\phi(c_{h}^{t})\|_{(\Sigma_{h}^{t})^{-1}}.

Therefore, on the given event, we have

𝔼μh,πh[Qht,μhπhμhtπhtΩ×𝒜(sh;Ct)|s1=s1t](3HLμKp0D+HLμK2)βϕ(cht)(Σht)1.\mathbb{E}_{\mu_{h}^{*},\pi_{h}^{*}}\big{[}\big{\langle}Q_{h}^{t},\mu_{h}^{*}\otimes\pi_{h}^{*}-\mu_{h}^{t}\otimes\pi_{h}^{t}\big{\rangle}_{\Omega\times\mathcal{A}}(s_{h};C^{t})|s_{1}=s_{1}^{t}\big{]}\leq\bigg{(}\frac{3HL_{\mu}K}{p_{0}D}+\frac{HL_{\mu}K}{2}\bigg{)}\beta\|\phi(c_{h}^{t})\|_{(\Sigma_{h}^{t})^{-1}}.

Summing up together, we get

t[T]h[H]𝔼μh,πh[Qht,μhπh\displaystyle\sum_{t\in[T]}\sum_{h\in[H]}\mathbb{E}_{\mu_{h}^{*},\pi_{h}^{*}}\big{[}\big{\langle}Q_{h}^{t},\mu_{h}^{*}\otimes\pi_{h}^{*} μhtπhtΩ×𝒜(sh;Ct)|s1=s1t]\displaystyle-\mu_{h}^{t}\otimes\pi_{h}^{t}\big{\rangle}_{\Omega\times\mathcal{A}}(s_{h};C^{t})|s_{1}=s_{1}^{t}\big{]}
(3HLμKp0D+HLμK2)βt[T]h[H]ϕ(cht)(Σht)1.\displaystyle\leq\bigg{(}\frac{3HL_{\mu}K}{p_{0}D}+\frac{HL_{\mu}K}{2}\bigg{)}\beta\sum_{t\in[T]}\sum_{h\in[H]}\|\phi(c_{h}^{t})\|_{(\Sigma_{h}^{t})^{-1}}.

Therefore, we conclude the proof of Lemma 6.3. ∎

A.7 Proof of Lemma 6.4

Proof.

By definition, we can rewrite the difference in Lemma 6.4 as

Qht,(μhtμh)πhtΩ×𝒜(sht;Ct)\displaystyle\big{\langle}Q_{h}^{t},(\mu_{h}^{t}-\mu_{h}^{*})\otimes\pi_{h}^{t}\big{\rangle}_{\Omega\times\mathcal{A}}(s_{h}^{t};C^{t}) =Ω×𝒜[μht(ω|cht)μh(ω|cht)]πht(s,ω,a)Qht(s,ω,a)dadω\displaystyle=\int_{\Omega\times\mathcal{A}}\big{[}\mu_{h}^{t}(\omega|c_{h}^{t})-\mu_{h}^{*}(\omega|c_{h}^{t})\big{]}\pi_{h}^{t}(s,\omega,a)Q_{h}^{t}(s,\omega,a){\mathrm{d}}a{\mathrm{d}}\omega
=Ω[μht(ω|cht)μh(ω|cht)]𝒜πht(s,ω,a)Qht(s,ω,a)dadω.\displaystyle=\int_{\Omega}\big{[}\mu_{h}^{t}(\omega|c_{h}^{t})-\mu_{h}^{*}(\omega|c_{h}^{t})\big{]}\int_{\mathcal{A}}\pi_{h}^{t}(s,\omega,a)Q_{h}^{t}(s,\omega,a){\mathrm{d}}a{\mathrm{d}}\omega.

By Holder’s inequality, we have

|Qht,(μhtμh)πhtΩ×𝒜(sht;Ct)|μht(|cht)μh(|cht)1supωΩ|𝒜πht(s,ω,a)Qht(s,ω,a)da|.\bigg{|}\big{\langle}Q_{h}^{t},(\mu_{h}^{t}-\mu_{h}^{*})\otimes\pi_{h}^{t}\big{\rangle}_{\Omega\times\mathcal{A}}(s_{h}^{t};C^{t})\bigg{|}\leq\big{\|}\mu_{h}^{t}(\cdot|c_{h}^{t})-\mu_{h}^{*}(\cdot|c_{h}^{t})\big{\|}_{1}\sup_{\omega\in\Omega}\bigg{|}\int_{\mathcal{A}}\pi_{h}^{t}(s,\omega,a)Q_{h}^{t}(s,\omega,a){\mathrm{d}}a\bigg{|}.

Since QhtHQ_{h}^{t}\leq H for any h[H]h\in[H] and t[T]t\in[T], the inequality can be simplified to

|Qht,(μhtμh)πhtΩ×𝒜(sht;Ct)|Hμht(|cht)μh(|cht)1.\bigg{|}\big{\langle}Q_{h}^{t},(\mu_{h}^{t}-\mu_{h}^{*})\otimes\pi_{h}^{t}\big{\rangle}_{\Omega\times\mathcal{A}}(s_{h}^{t};C^{t})\bigg{|}\leq H\big{\|}\mu_{h}^{t}(\cdot|c_{h}^{t})-\mu_{h}^{*}(\cdot|c_{h}^{t})\big{\|}_{1}.

With the assumption of the prior and link function, on the given event, we obtain that

t[T]h[H]Qht,(μhtμh)πhtΩ×𝒜(sht;Ct)HLμKβh[H]t[T]ϕ(cht)(Σht)1.\sum_{t\in[T]}\sum_{h\in[H]}\big{\langle}Q_{h}^{t},(\mu_{h}^{t}-\mu_{h}^{*})\otimes\pi_{h}^{t}\big{\rangle}_{\Omega\times\mathcal{A}}(s_{h}^{t};C^{t})\leq HL_{\mu}K\beta\sum_{h\in[H]}\sum_{t\in[T]}\|\phi(c_{h}^{t})\|_{(\Sigma_{h}^{t})^{-1}}.

Therefore, we conclude the proof of Lemma 6.4. ∎

A.8 Proof of Corollary 6.6

Proof.

According to Assumption 4.2 for the prior, we can show that for any μθ(|c)\mu_{\theta^{\prime}}(\cdot|c)\in\mathcal{B},

μθ(|c)μθ(|c)1Lμf(ϕ(c)θ)f(ϕ(c)θ).\left\|\mu_{\theta}(\cdot|c)-\mu_{\theta^{\prime}}(\cdot|c)\right\|_{1}\leq L_{\mu}\big{\|}f(\phi(c)^{\top}\theta)-f(\phi(c)^{\top}\theta^{\prime})\big{\|}.

Moreover, by Assumption 4.3 for the link function f()f(\cdot), we have

μθ(|c)μθ(|c)1LμKϕ(c)(θθ)LμKϕ(c)Σ1ϵ.\left\|\mu_{\theta}(\cdot|c)-\mu_{\theta^{\prime}}(\cdot|c)\right\|_{1}\leq L_{\mu}K\big{\|}\phi(c)^{\top}(\theta-\theta^{\prime})\big{\|}\leq L_{\mu}K\left\|\phi(c)\right\|_{\Sigma^{-1}}\epsilon.

Therefore, B(μθ(|c),LμKϕ(c)Σ1ϵ)\mathcal{B}\subseteq{\mathrm{B}}(\mu_{\theta}(\cdot|c),L_{\mu}K\|\phi(c)\|_{\Sigma^{-1}}\epsilon), and by Lemma 6.5, we can conclude the result. ∎

A.9 Auxiliary Lemmas

This section presents several auxiliary lemmas and their proofs.

Lemma A.5 (Martingale Bound; [9]).

For ζt,h1\zeta_{t,h}^{1} and ζt,h2\zeta_{t,h}^{2} defined in (6.2) and for any fixed δ(0,1)\delta\in(0,1), with probability at least 1δ/21-\delta/2, we have

t[T]h[H](ζt,h1+ζt,h2)16TH3log(4/δ).\sum_{t\in[T]}\sum_{h\in[H]}(\zeta_{t,h}^{1}+\zeta_{t,h}^{2})\leq\sqrt{16TH^{3}\log(4/\delta)}.
Proof.

See [9] for a detailed proof. ∎

Lemma A.6.

Suppose that ϕ1,ϕ2,,ϕTRdϕ×d\phi_{1},\phi_{2},\ldots,\phi_{T}\in R^{d_{\phi}\times d} and for any 1iT1\leq i\leq T, there exists a constant Φ>0\Phi>0 such that ϕiΦ.\|\phi_{i}\|\leq\Phi. Let Σt=λIdϕ+i[t1]ϕiϕi\Sigma_{t}=\lambda I_{d_{\phi}}+\sum_{i\in[t-1]}\phi_{i}\phi_{i}^{\prime} for some λΦ2\lambda\geq\Phi^{2}. Then,

t[T]ϕt(Σt)12dϕTlog(1+TΦ2/(λdϕ)).\sum_{t\in[T]}\|\phi_{t}\|_{(\Sigma_{t})^{-1}}\leq\sqrt{2d_{\phi}T\log(1+T\Phi^{2}/(\lambda d_{\phi}))}.
Proof.

Firstly, we apply Cauchy-Schwartz inequality,

t[T]ϕt(Σt)1Tt[T]ϕt(Σt)12.\sum_{t\in[T]}\|\phi_{t}\|_{(\Sigma_{t})^{-1}}\leq\sqrt{T\sum_{t\in[T]}\|\phi_{t}\|^{2}_{(\Sigma_{t})^{-1}}}.

Since ϕt(Σt)1=ϕt(Σt)1ϕtλ1ϕtϕtΦ/λ1\|\phi_{t}\|_{(\Sigma_{t})^{-1}}=\sqrt{\phi_{t}^{\top}(\Sigma_{t})^{-1}\phi_{t}}\leq\sqrt{\lambda^{-1}\phi_{t}^{\top}\phi_{t}}\leq\Phi/\sqrt{\lambda}\leq 1, we can use Lemma A.7 to bound the sum of squares:

t[T]ϕt(Σt)1\displaystyle\sum_{t\in[T]}\|\phi_{t}\|_{(\Sigma_{t})^{-1}} 2Tlog(det(ΣT)det(Σ1)1)\displaystyle\leq\sqrt{2T\log(\det(\Sigma_{T})\det(\Sigma_{1})^{-1})}
2dϕTlog(1+TΦ2/(λdϕ)).\displaystyle\leq\sqrt{2d_{\phi}T\log(1+T\Phi^{2}/(\lambda d_{\phi}))}.

The last inequality is derived from Lemma A.8. ∎

Lemma A.7 (Sum of Potential Function; [1]).

For any sequence of {ϕt}t[T]\{\phi_{t}\}_{t\in[T]}, let Σt=λIh+t[t1]ϕiϕi\Sigma_{t}=\lambda I_{h}+\sum_{t\in[t-1]}\phi_{i}\phi_{i}^{\prime} for some λ0\lambda\geq 0. Then we have

t[T]min{ϕt(Σt)12,1}2log(det(ΣT)det(Σ1)1).\sum_{t\in[T]}\min\{\|\phi_{t}\|_{(\Sigma_{t})^{-1}}^{2},1\}\leq 2\log(\det(\Sigma_{T})\det(\Sigma_{1})^{-1}).
Proof.

See [1] for a detailed proof. ∎

Lemma A.8 (Determinant-Trace Inequality).

Suppose that ϕ1,ϕ2,,ϕTRdϕ×d\phi_{1},\phi_{2},\ldots,\phi_{T}\in R^{d_{\phi}\times d} and for any 1iT1\leq i\leq T, there exists a constant Φ>0\Phi>0 such that ϕiΦ.\|\phi_{i}\|\leq\Phi. Let Σt=λIdϕ+i[t1]ϕiϕi\Sigma_{t}=\lambda I_{d_{\phi}}+\sum_{i\in[t-1]}\phi_{i}\phi_{i}^{\prime} for some λ0\lambda\geq 0. Then,

det(Σt)(λ+tΦ2/dϕ)dϕ.\det(\Sigma_{t})\leq\big{(}\lambda+t\Phi^{2}/d_{\phi}\big{)}^{d_{\phi}}.
Proof.

Let λ1,λ2,,λh\lambda_{1},\lambda_{2},\ldots,\lambda_{h} be the eigenvalues of Σt\Sigma_{t}. Since Σt\Sigma_{t} is positive definite, its eigenvalues are positive. Also, note that det(Σt)=s=1dϕλs\det(\Sigma_{t})=\prod_{s=1}^{d_{\phi}}\lambda_{s} and tr(Σt)=s=1hλs\mathop{\text{tr}}\kern 0.86108pt(\Sigma_{t})=\sum_{s=1}^{h}\lambda_{s}. By inequality of arithmetic and geometric means

det(Σt)(tr(Σt)/dϕ)dϕ.\det(\Sigma_{t})\leq(\mathop{\text{tr}}\kern 0.86108pt(\Sigma_{t})/d_{\phi})^{d_{\phi}}.

It remains to upper bound the trace:

tr(Σt)=tr(λIdϕ)+i=1t1tr(ϕiϕi)=dϕλ+i=1t1ϕi2dϕλ+tΦ2\mathop{\text{tr}}\kern 0.86108pt(\Sigma_{t})=\mathop{\text{tr}}\kern 0.86108pt(\lambda I_{d_{\phi}})+\sum_{i=1}^{t-1}\mathop{\text{tr}}\kern 0.86108pt(\phi_{i}\phi_{i}^{\prime})=d_{\phi}\lambda+\sum_{i=1}^{t-1}\|\phi_{i}\|^{2}\leq d_{\phi}\lambda+t\Phi^{2}

and the lemma follows. ∎