This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Modeling Attrition in Recommender Systems with Departing Bandits

   Omer Ben-Porat\equalcontrib1, Lee Cohen\equalcontrib1, Liu Leqi\equalcontrib2, Zachary C. Lipton2, Yishay Mansour1,3
Abstract

Traditionally, when recommender systems are formalized as multi-armed bandits, the policy of the recommender system influences the rewards accrued, but not the length of interaction. However, in real-world systems, dissatisfied users may depart (and never come back). In this work, we propose a novel multi-armed bandit setup that captures such policy-dependent horizons. Our setup consists of a finite set of user types, and multiple arms with Bernoulli payoffs. Each (user type, arm) tuple corresponds to an (unknown) reward probability. Each user’s type is initially unknown and can only be inferred through their response to recommendations. Moreover, if a user is dissatisfied with their recommendation, they might depart the system. We first address the case where all users share the same type, demonstrating that a recent UCB-based algorithm is optimal. We then move forward to the more challenging case, where users are divided among two types. While naive approaches cannot handle this setting, we provide an efficient learning algorithm that achieves O~(T)\tilde{O}(\sqrt{T}) regret, where TT is the number of users.

1 Introduction

At the heart of online services spanning such diverse industries as media consumption, dating, financial products, and more, recommendation systems (RSs) drive personalized experiences by making curation decisions informed by each user’s past history of interactions. While in practice, these systems employ diverse statistical heuristics, much of our theoretical understanding of them comes via stylized formulations within the multi-armed bandits (MABs) framework. While MABs abstract away from many aspects of real-world systems they allow us to extract crisp insights by formalizing fundamental tradeoffs, such as that between exploration and exploitation that all RSs must face (Joseph et al. 2016; Liu and Ho 2018; Patil et al. 2020; Ron, Ben-Porat, and Shalit 2021). As applies to RSs, exploitation consists of continuing to recommend items (or categories of items) that have been observed to yield high rewards in the past, while exploration consists of recommending items (or categories of items) about which the RS is uncertain but that could potentially yield even higher rewards.

In traditional formalizations of RSs as MABs, the recommender’s decisions affect only the rewards obtained. However, real-life recommenders face a dynamic that potentially alters the exploration-exploitation tradeoff: Dissatisfied users have the option to depart the system, never to return. Thus, recommendations in the service of exploration not only impact instantaneous rewards but also risk driving away users and therefore can influence long-term cumulative rewards by shortening trajectories of interactions.

In this work, we propose departing bandits which augment conventional MABs by incorporating these policy-dependent horizons. To motivate our setup, we consider the following example: An RS for recommending blog articles must choose at each time among two categories of articles, e.g., economics and sports. Upon a user’s arrival, the RS recommends articles sequentially. After each recommendation, the user decides whether to “click” the article and continue to the next recommendation, or to “not click” and may leave the system. Crucially, the user interacts with the system for a random number of rounds. The user’s departure probability depends on their satisfaction from the recommended item, which in turn depends on the user’s unknown type. A user’s type encodes their preferences (hence the probability of clicking) on the two topics (economics and sports).

When model parameters are given, in contrast to traditional MABs where the optimal policy is to play the best fixed arm, departing bandits require more careful analysis to derive an optimal planning strategy. Such planning is a local problem, in the sense that it is solved for each user. Since the user type is never known explicitly (the recommender must update its beliefs over the user types after each interaction), finding an optimal recommendation policy requires solving a specific partially observable MDP (POMDP) where the user type constitutes the (unobserved) state (more details in Section 5.1). When the model parameters are unknown, we deal with a learning problem that is global, in the sense that the recommender (learner) is learning for a stream of users instead of a particular user.

We begin with a formal definition of departing bandits in Section 2, and demonstrate that any fixed-arm policy is prone to suffer linear regret. In Section 3, we establish the UCB-based learning framework used in later sections. We instantiate this framework with a single user type in Section 4, where we show that it achieves O~(T)\tilde{O}(\sqrt{T}) regret for TT being the number of users. We then move to the more challenging case with two user types and two recommendation categories in Section 5. To analyze the planning problem, we effectively reduce the search space for the optimal policy by using a closed-form of the expected return of any recommender policy. These results suggest an algorithm that achieves O~(T)\tilde{O}(\sqrt{T}) regret in this setting. Finally, we show an efficient optimal planning algorithm for multiple user types and two recommendation categories (Appendix A) and describe a scheme to construct semi-synthetic problem instances for this setting using real-world datasets (Appendix B).

1.1 Related Work

MABs have been studied extensively by the online learning community (Cesa-Bianchi and Lugosi 2006; Bubeck, Cesa-Bianchi et al. 2012). The contextual bandit literature augments the MAB setup with context-dependent rewards (Abbasi-Yadkori, Pál, and Szepesvári 2011; Slivkins 2019; Mahadik et al. 2020; Korda, Szörényi, and Li 2016; Lattimore and Szepesvári 2020). In contextual bandits, the learner observes a context before they make a decision, and the reward depends on the context. Another line of related work considers the dynamics that emerge when users act strategically (Kremer, Mansour, and Perry 2014; Mansour, Slivkins, and Syrgkanis 2015; Cohen and Mansour 2019; Bahar, Smorodinsky, and Tennenholtz 2016; Bahar et al. 2020). In that line of work, users arriving at the system receive a recommendation but act strategically: They can follow the recommendation or choose a different action. This modeling motivates the development of incentive-compatible mechanisms as solutions. In our work, however, the users are modeled in a stochastic (but not strategic) manner. Users may leave the system if they are dissatisfied with recommendations, and this departure follows a fixed (but possibly unknown) stochastic model.

The departing bandits problem has two important features: Policy-dependent horizons, and multiple user types that can be interpreted as unknown states. Existing MAB works (Azar, Lazaric, and Brunskill 2013; Cao et al. 2020) have addresses these phenomena separately but we know of no work that integrates the two in a single framework. In particular, while Azar, Lazaric, and Brunskill (2013) study the setting with multiple user types, they focus on a fixed horizon setting. Additionally, while Cao et al. (2020) deal with departure probabilities and policy-dependent interaction times for a single user type, they do not consider the possibility of multiple underlying user types.

The planning part of our problem falls under the framework of using Markov Decision Processes for modeling recommender-user dynamics (Shani, Heckerman, and Brafman 2005). Specifically, our problem works with partially observable user states which have also been seen in many recent bandits variants (Pike-Burke and Grünewälder 2019; Leqi et al. 2020). Unlike these prior works that focus on interactions with a single user, departing bandits consider a stream of users each of which has an (unknown) type selected among a finite set of user types.

More broadly, our RS learning problem falls under the domain of reinforcement learning (RL). Existing RL literature that considers departing users in RSs include Zhao et al. (2020b); Lu and Yang (2016); Zhao et al. (2020a). While Zhao et al. (2020b) handle users of a single type that depart the RS within a bounded number of interactions, our work deals with multiple user types. In contrast to Zhao et al. (2020a), we consider an online setting and provide regret guarantees that do not require bounded horizon. Finally, Lu and Yang (2016) use POMDPs to model user departure and focus on approximating the value function. They conduct an experimental analysis on historical data, while we devise an online learning algorithm with theoretical guarantees.

2 Departing Bandits

We propose a new online problem, called departing bandits, where the goal is to find the optimal recommendation algorithm for users of (unknown) types, and where the length of the interactions depends on the algorithm itself. Formally, the departing bandits problem is defined by a tuple [M],[K],𝐪,𝐏,𝚲\langle[M],[K],\mathbf{q},\mathbf{P},\mathbf{\Lambda}\rangle, where MM is the number of user types, KK is the number of categories, 𝐪[0,1]M\mathbf{q}\in[0,1]^{M} specifies a prior distribution over types, and 𝐏(0,1)K×M\mathbf{P}\in(0,1)^{K\times M} and 𝚲(0,1)K×M\mathbf{\Lambda}\in(0,1)^{K\times M} are the click-probability and the departure-probability matrices, respectively.111We denote by [n][n] the set {1,,n}\{1,\dots,n\}.

There are TT users who arrive sequentially at the RS. At every episode, a new user t[T]t\in[T] arrives with a type type(t)type(t). We let 𝐪\mathbf{q} denote the prior distribution over the user types, i.e., type(t)𝐪type(t)\sim\mathbf{q}. Each user of type xx clicks on a recommended category aa with probability 𝐏a,x\mathbf{P}_{a,x}. In other words, each click follows a Bernoulli distribution with parameter 𝐏a,x\mathbf{P}_{a,x}. Whenever the user clicks, she stays for another iteration, and when the user does not click (no-click), she departs with probability 𝚲a,x\mathbf{\Lambda}_{a,x} (and stays with probability 1𝚲a,x1-\mathbf{\Lambda}_{a,x}). Each user tt interacts with the RS (the learner) until she departs.

We proceed to describe the user-RS interaction protocol. In every iteration jj of user tt, the learner recommends a category a[K]a\in[K] to user tt. The user clicks on it with probability 𝐏a,type(t)\mathbf{P}_{a,type(t)}. If the user clicks, the learner receives a reward of rt,j(a)=1r_{t,j}(a)=1.222We formalize the reward as is standard in the online learning literature, from the perspective of the learner. However, defining the reward from the user perspective by, e.g., considering her utility as the number of clicks she gives or the number of articles she reads induces the same model. If the user does not click, the learner receives no reward (i.e., rt,j(a)=0r_{t,j}(a)=0), and user tt departs with probability 𝚲a,type(t)\mathbf{\Lambda}_{a,type(t)}. We assume that the learner knows the value of a constant ϵ>0\epsilon>0 such that maxa,x𝐏a,x1ϵ\max_{a,x}\mathbf{P}_{a,x}\leq 1-\epsilon (i.e., ϵ\epsilon does not depend on TT). When user tt departs, she does not interact with the learner anymore (and the learner moves on to the next user t+1t+1). For convenience, the departing bandits problem protocol is summarized in Algorithm 1.

Input: number of types MM, number of categories KK, and number of users (episodes) TT
Hidden Parameters: types prior 𝐪\mathbf{q}, click-probability 𝐏\mathbf{P}, and departure-probability 𝚲\mathbf{\Lambda}

1:  for episode t1,,Tt\leftarrow 1,\dots,T do
2:     a new user with type type(t)𝐪type(t)\sim\mathbf{q} arrives
3:     j1j\leftarrow 1, departfalsedepart\leftarrow false
4:     while departdepart is falsefalse do
5:        the learner picks a category a[K]a\in[K]
6:        with probability 𝐏a,x\mathbf{P}_{a,x}, user tt clicks on aa and rt,j(a)1r_{t,j}(a)\leftarrow 1; otherwise, rt,j(a)0r_{t,j}(a)\leftarrow 0
7:        if rt,j(a)=0r_{t,j}(a)=0 then
8:           with probability 𝚲a,x\mathbf{\Lambda}_{a,x}: departtruedepart\leftarrow true and user tt departs
9:        the learner observes rt,j(a)r_{t,j}(a) and departdepart
10:        if departdepart is falsefalse then
11:           jj+1j\leftarrow j+1
Algorithm 1 The Departing Bandits Protocol

Having described the protocol, we move on to the goals of the learner. Without loss of generality, we assume that the online learner’s recommendations are made based on a policy π\pi, which is a mapping from the history of previous interactions (with that user) to recommendation categories.

For each user (episode) t[T]t\in[T], the learner selects a policy πt\pi_{t} that recommends category πt,j[K]\pi_{t,j}\in[K] at every iteration j[Nπt(t)]j\in[N^{\pi_{t}}(t)], where Nπt(t)N^{\pi_{t}}(t) denotes the episode length (i.e., total number of iterations policy πt\pi_{t} interacts with user tt until she departs).333We limit the discussion to deterministic policies solely; this is w.l.o.g. (see Subsection 5.1 for further details). The return of a policy π\pi, denoted by VπV^{\pi} is the cumulative reward the learner obtains when executing the policy π\pi until the user departs. Put differently, the return of π\pi from user tt is the random variable Vπ=j=1Nπ(t)rt,j(πt,j)V^{\pi}=\sum_{j=1}^{N^{\pi}(t)}r_{t,j}(\pi_{t,j}).

We denote by π\pi^{*} an optimal policy, namely a policy that maximizes the expected return, π=argmaxπ𝔼[Vπ]\pi^{*}=\operatorname*{arg\,max}_{\pi}\mathbb{E}[V^{\pi}]. Similarly, we denote by VV^{*} the optimal return, i.e., V=VπV^{*}=V^{\pi^{*}}.

We highlight two algorithmic tasks. The first is the planning task, in which the goal is to find an optimal policy π\pi^{*}, given 𝐏,𝚲,𝐪\mathbf{P},\mathbf{\Lambda},\mathbf{q}. The second is the online learning task. We consider settings where the learner knows the number of categories, KK, the number of types, MM, and the number of users, TT, but has no prior knowledge regarding 𝐏,𝚲\mathbf{P},\mathbf{\Lambda} or 𝐪\mathbf{q}. In the online learning task, the value of the learner’s algorithm is the sum of the returns obtained from all the users, namely

t=1TVπt=t=1Tj=1Nπ(t)rt,j(πt,j).\sum_{t=1}^{T}V^{\pi_{t}}=\sum_{t=1}^{T}\sum_{j=1}^{N^{\pi}(t)}r_{t,j}(\pi_{t,j}).

The performance of the leaner is compared to that of the best policy, formally defined by the regret for TT episodes,

RT=T𝔼[Vπ]t=1TVπt.R_{T}=T\cdot\mathbb{E}[V^{\pi^{*}}]-\sum_{t=1}^{T}V^{\pi_{t}}. (1)

The learner’s goal is to minimize the expected regret 𝔼[RT]\mathbb{E}[R_{T}].

2.1 Example

Type xx Type yy
Category 11 𝐏1,x=0.5\mathbf{P}_{1,x}=0.5 𝐏1,y=0.28\mathbf{P}_{1,y}=0.28
Category 22 𝐏2,x=0.4\mathbf{P}_{2,x}=0.4 𝐏2,y=0.39\mathbf{P}_{2,y}=0.39
Prior 𝐪x=0.4\mathbf{q}_{x}=0.4 𝐪y=0.6\mathbf{q}_{y}=0.6
Table 1: The departing bandits instance in Section 2.1.

The motivation for the following example is two-fold. First, to get the reader acquainted with our notations; and second, to show why fixed-arm policies are inferior in our setting.

Consider a problem instance with two user types (M=2M=2), which we call xx and yy for convenience. There are two categories (K=2K=2), and given no-click the departure is deterministic, i.e., 𝚲a,τ=1\mathbf{\Lambda}_{a,\tau}=1 for every category a[K]a\in[K] and type τ[M]\tau\in[M]. That is, every user leaves immediately if she does not click. Furthermore, let the click-probability 𝐏\mathbf{P} matrix and the user type prior distribution 𝐪\mathbf{q} be as in Table 1.

Looking at 𝐏\mathbf{P} and 𝐪\mathbf{q}, we see that Category 1 is better for Type xx, while Category 2 is better for type yy. Notice that without any additional information, a user is more likely to be type yy. Given the prior distribution, recommending Category 11 in the first round yields an expected reward of 𝐪x𝐏1,x+𝐪y𝐏1,y=0.368\mathbf{q}_{x}\mathbf{P}_{1,x}+\mathbf{q}_{y}\mathbf{P}_{1,y}=0.368. Similarly, recommending Category 22 in the first round results in an expected reward of 0.3940.394. Consequently, if we recommend myopically, i.e., without considering the user type, always recommending Category 22 is better than always recommending Category 1.

Let πa\pi^{a} denote the fixed-arm policy that always selects a single category aa. Using the tools we derive in Section 5 and in particular Theorem 5.3, we can compute the expected returns of π1\pi^{1} and π2\pi^{2}, 𝔼[Vπ1]\mathbb{E}[V^{\pi^{1}}] and 𝔼[Vπ2]\mathbb{E}[V^{\pi^{2}}]. Additionally, using results from Section 5.2, we can show that the optimal policy for the planning task, π\pi^{*}, recommends Category 22 until iteration 77, and then recommends Category 11 for the rest of the iterations until the user departs.

Using simple calculations, we see that 𝔼[Vπ]𝔼[Vπ1]>0.0169\mathbb{E}[V^{{{\pi^{*}}}}]-\mathbb{E}[V^{\pi^{1}}]>0.0169 and 𝔼[Vπ]𝔼[Vπ2]>1.22×105\mathbb{E}[V^{{{\pi^{*}}}}]-\mathbb{E}[V^{\pi^{2}}]>1.22\times 10^{-5}; hence, the expected return of the optimal policy is greater than the returns of both fixed-arm policies by a constant. As a result, if the learner only uses fixed-arm policies (πa\pi^{a} for every a[K]a\in[K]), she suffers linear expected regret, i.e., 𝔼[RT]=T𝔼[Vπ]t=1T𝔼[Vπa]=Ω(T).\mathbb{E}[R_{T}]=T\cdot\mathbb{E}[V^{\pi^{*}}]-\sum_{t=1}^{T}\mathbb{E}[V^{\pi^{a}}]=\Omega(T).

3 UCB Policy for Sub-exponential Returns

In this section, we introduce the learning framework used in the paper and provide a general regret guarantee for it.

In standard MAB problems, at each t[T]t\in[T] the learner picks a single arm and receives a single sub-Gaussian reward. In contrast, in departing bandits, at each t[T]t\in[T] the learner receives a return VπV^{\pi}, which is the cumulative reward of that policy. The return VπV^{\pi} depends on the policy π\pi not only through the obtained rewards at each iteration but also through the total number of iterations (trajectory length). Such returns are not necessarily sub-Gaussian. Consequently, we cannot use standard MAB algorithms as they usually rely on concentration bounds for sub-Gaussian rewards. Furthermore, as we have shown in Section 2.1, in departing bandits fixed-arm policies can suffer linear regret (in terms of the number of users), which suggests considering a more expressive set of policies. This in turn yields another disadvantage for using MAB algorithms for departing bandits, as their regret is linear in the number of arms (categories) KK.

As we show later in Sections 4 and 5, for some natural instances of the departing bandits problem, the return from each user is sub-exponential (Definition 3.1). Algorithm 2, which we propose below, receives a set of policies Π\Pi as input, along with other parameters that we describe shortly. The algorithm is a restatement of the UCB-Hybrid Algorithm from Jia, Shi, and Shen (2021), with two modifications: (1) The input includes a set of policies rather than a set of actions/categories, and accordingly, the confidence bound updates are based on return samples (denoted by V^π\hat{V}^{\pi}) rather than reward samples. (2) There are two global parameters (τ~\tilde{\tau} and η\eta) instead of two local parameters per action. If the return from each policy in Π\Pi is sub-exponential, Algorithm 2 not only handles sub-exponential returns, but also comes with the following guarantee: Its expected value is close to the value of the best policy in Π\Pi.

3.1 Sub-exponential Returns

For convenience, we state here the definition of sub-exponential random variables (Eldar and Kutyniok 2012).

Definition 3.1.

We say that a random variable XX is sub-exponential with parameters (τ2,b)(\tau^{2},b) if for every γ\gamma such that |γ|<1/b|\gamma|<1/b,

𝔼[exp(γ(X𝔼[X]))]exp(γ2τ22).\mathbb{E}[\exp({\gamma(X-\mathbb{E}[X])})]\leq\exp(\frac{\gamma^{2}\tau^{2}}{2}).

In addition, for every (τ2,b)(\tau^{2},b)-sub-exponential random variables, there exist constants C1,C2>0C_{1},C_{2}>0 such that the above is equivalent to each of the following properties:

  1. 1.

    Tails: v0:Pr[|X|>v]exp(1vC1)\forall v\geq 0\mathrel{\mathop{\mathchar 58\relax}}\Pr[|X|>v]\leq\exp({1-\frac{v}{C_{1}}}).

  2. 2.

    Moments: p1:(𝔼[|X|p])1/pC2p\forall p\geq 1\mathrel{\mathop{\mathchar 58\relax}}(\mathbb{E}[|X|^{p}])^{1/p}\leq C_{2}p.

Let Π\Pi be a set of policies with the following property: There exist τ~,η\tilde{\tau},\eta such that the return of every policy πΠ\pi\in\Pi is (τ2,b\tau^{2},b)-sub-exponential with τ~τ\tilde{\tau}\geq\tau and ηb2τ2\eta\geq\frac{b^{2}}{\tau^{2}}. The following Algorithm 2 receives as input a set of policies Π\Pi with the associated parameters, τ~\tilde{\tau} and η\eta. Similarly to the UCB algorithm, it maintains an upper confidence bound UU for each policy, and balances between exploration and exploitation. Theorem 3.2 below shows that Algorithm 2 always gets a value similar to that of the best policy in Π\Pi up to an additive factor of O~(|Π|T+|Π|)\tilde{O}\left(\sqrt{\mathinner{\!\left\lvert\Pi\right\rvert}T}+\mathinner{\!\left\lvert\Pi\right\rvert}\right). The theorem follows directly from Theorem 33 from Jia, Shi, and Shen (2021) by having policies as arms and returns as rewards.

Theorem 3.2.

Let Π\Pi be a set of policies with the associated parameters τ~,η\tilde{\tau},\eta. Let π1,,πT\pi_{1},\dots,\pi_{T} be the policies Algorithm 2 selects. It holds that

𝔼[maxπΠTVπt=1TVπt]=O(|Π|TlogT+|Π|logT).\mathbb{E}\left[\max_{\pi\in\Pi}T\cdot V^{\pi}-\sum_{t=1}^{T}V^{\pi_{t}}\right]=O(\sqrt{\mathinner{\!\left\lvert\Pi\right\rvert}T\log T}+\mathinner{\!\left\lvert\Pi\right\rvert}\log T).

There are two challenges in leveraging Theorem 3.2. The first challenge is crucial: Notice that Theorem 3.2 does not imply that Algorithm 2 has a low regret; its only guarantee is w.r.t. the policies in Π\Pi received as an input. As the number of policies is infinite, our success will depend on our ability to characterize a “good” set of policies Π\Pi. The second challenge is technical: Even if we find such Π\Pi, we still need to characterize the associated τ~\tilde{\tau} and η\eta. This is precisely what we do in Section 4 and 5.

1:  Input: set of policies Π\Pi, number of users TT, τ~,η\tilde{\tau},\eta
2:  Initialize:  πΠ:U0(π),n(π)=0\forall\pi\in~{}\Pi\mathrel{\mathop{\mathchar 58\relax}}U_{0}(\pi)\leftarrow\infty,n(\pi)=0
3:  for user t1,,Tt\leftarrow 1,\dots,T do
4:     Execute πt\pi_{t} such that πtargmaxπΠUt1(π)\pi_{t}\in\operatorname*{arg\,max}_{\pi\in\Pi}U_{t-1}(\pi) and receive return V^πt[n(πt)]j=1Nπt(t)rt,j(πt,j)\hat{V}^{\pi_{t}}[n(\pi_{t})]\leftarrow\sum_{j=1}^{N^{\pi_{t}}(t)}r_{t,j}(\pi_{t,j})
5:     n(πt)n(πt)+1n(\pi_{t})\leftarrow n(\pi_{t})+1
6:     if n(πt)<8ηlnTn(\pi_{t})<8\eta\ln T then
7:        Update Ut(πt)=i=1n(πt)V^πt[i]n(πt)+8ητ~lnTn(πt)U_{t}(\pi_{t})=\frac{\sum_{i=1}^{n(\pi_{t})}\hat{V}^{\pi_{t}}[i]}{n(\pi_{t})}+\frac{8\sqrt{\eta}\cdot\tilde{\tau}\ln T}{n(\pi_{t})}
8:     else
9:        Update Ut(πt)=i=1n(πt)V^πt[i]n(πt)+8τ~2lnTn(πt)U_{t}(\pi_{t})=\frac{\sum_{i=1}^{n(\pi_{t})}\hat{V}^{\pi_{t}}[i]}{n(\pi_{t})}+\sqrt{\frac{8\tilde{\tau}^{2}\ln T}{n(\pi_{t})}}
Algorithm 2 UCB-based algorithm with hybrid radii: UCB-Hybrid (Jia, Shi, and Shen 2021)

4 Single User Type

In this section, we focus on the special case of a single user type, i.e., M=1M=1. For notational convenience, since we only discuss single-type users, we associate each category a[K]a\in[K] with its two unique parameters 𝐏a:=𝐏a,1,𝚲a:=𝚲a,1\mathbf{P}_{a}\mathrel{\mathop{\mathchar 58\relax}}=\mathbf{P}_{a,1},\mathbf{\Lambda}_{a}\mathrel{\mathop{\mathchar 58\relax}}=\mathbf{\Lambda}_{a,1} and refer to them as scalars rather than vectors. In addition, We use the notation NaN_{a} for the random variable representing the number of iterations until a random user departs after being recommended by πa\pi^{a}, the fixed-arm policy that recommends category aa in each iteration.

To derive a regret bound for single-type users, we use two main lemmas: Lemma 4.1, which shows the optimal policy is fixed, and Lemma 4.3, which shows that returns of fixed-arm policies are sub-exponential and calculate their corresponding parameters. These lemmas allow us to use Algorithm 2 with a policy set Π\Pi that contains all the fixed-arm policies, and derive a O~(T)\tilde{O}(\sqrt{T}) regret bound. For brevity, we relegate all the proofs to the Appendix.

To show that there exists a category a[K]a^{*}\in[K] for which πa\pi^{a^{*}} is optimal, we rely on the assumption that all the users have the same type (hence we drop the type subscripts tt), and as a result the rewards of each category a[K]a\in[K] have an expectation that depends on a single parameter, namely 𝔼[r(a)]=𝐏a\mathbb{E}[r(a)]=\mathbf{P}_{a}. Such a category a[K]a^{*}\in[K] does not necessarily have the maximal click-probability nor the minimal departure-probability, but rather an optimal combination of the two (in a way, this is similar to the knapsack problem, where we want to maximize the reward while having as little weight as possible). We formalize it in the following lemma.

Lemma 4.1.

A policy πa\pi^{a^{*}} is optimal if

aargmaxa[K]𝐏a𝚲a(1𝐏a).a^{*}\in\operatorname*{arg\,max}_{a\in[K]}\frac{\mathbf{P}_{a}}{\mathbf{\Lambda}_{a}(1-\mathbf{P}_{a})}.

As a consequence of this lemma, the planning problem for single-type users is trivial—the solution is a fixed-arm policy πa\pi^{a^{*}} given in the lemma. However, without access to the model parameters, identifying πa\pi^{a^{*}} requires learning. We proceed with a simple observation regarding the random number of iterations obtained by executing a fixed-arm policy. The observation would later help us show that the return of any fixed-arm policy is sub-exponential.

Observation 4.2.

For every a[K]a\in[K] and every 𝚲a>0\mathbf{\Lambda}_{a}>0, the random variable NaN_{a} follows a geometric distribution with success probability parameter 𝚲a[1𝐏a](0,1ϵ]\mathbf{\Lambda}_{a}[1-\mathbf{P}_{a}]\in(0,1-\epsilon].

Using Observation 4.2 and previously known results (stated as Lemma D.3 in the appendix), we show that NaN_{a} is sub-exponential for all a[K]a\in[K]. Notice that return realizations are always upper bounded by the trajectory length; this implies that returns are also sub-exponential. However, to use the regret bound of Algorithm 2, we need information regarding the parameters (τa2,ba)(\tau^{2}_{a},b_{a}) for every policy πa\pi^{a}. We provide this information in the following Lemma 4.3.

Lemma 4.3.

For each category a[K]a\in[K], the centred random variable Vπa𝔼[Vπa]V^{\pi^{a}}-\mathbb{E}[V^{\pi^{a}}] is sub-exponential with parameters (τa2,ba)(\tau_{a}^{2},b_{a}), such that

τa=ba=8eln(1𝚲a(1𝐏a)).\tau_{a}=b_{a}=-\frac{8e}{\ln(1-\mathbf{\Lambda}_{a}(1-\mathbf{P}_{a}))}.
Proof sketch.

We rely on the equivalence between the subexponentiality of a random variable and the bounds on its moments (Property 2 in Definition 3.1). We bound the expectation of the return VπaV^{\pi^{a}}, and use Minkowski’s and Jensen’s inequalities to show in Lemma D.2 that 𝔼[|Vπa𝔼[Vπa]|p])1/p\mathbb{E}[|V^{\pi^{a}}-\mathbb{E}[V^{\pi^{a}}]|^{p}])^{1/p} is upper bounded by 4/ln(1𝚲a(1𝐏a))-4/\ln(1-\mathbf{\Lambda}_{a}(1-\mathbf{P}_{a})) for every a[K]a\in[K] and p1p\geq 1. Finally, we apply a normalization trick and bound the Taylor series of 𝔼[exp(γ(Vπa𝔼[Vπa]))]\mathbb{E}[\exp(\gamma(V^{\pi^{a}}-\mathbb{E}[V^{\pi^{a}}]))] to obtain the result. ∎

An immediate consequence of Lemma 4.3 is that the parameters τ~=8e/ln(11ϵ)\tilde{\tau}=8e/\ln(\frac{1}{1-\epsilon}) and η=1\eta=1 are valid upper bounds for τa\tau_{a} and ba/τa2b_{a}/\tau_{a}^{2} for each a[K]a\in[K] (I.e., a[K]:τ~τa\forall a\in[K]\mathrel{\mathop{\mathchar 58\relax}}\;\tilde{\tau}\geq\tau_{a} and ηba2/τa2\eta\geq b_{a}^{2}/\tau_{a}^{2}). We can now derive a regret bound using Algorithm 2 and Theorem 3.2.

Theorem 4.4.

For single-type users (M=1M=1), running Algorithm 2 with Π={πa:a[K]}\Pi=\{\pi^{a}\mathrel{\mathop{\mathchar 58\relax}}a\in[K]\} and τ~=8eln(11ϵ)\tilde{\tau}=\frac{8e}{\ln(\frac{1}{1-\epsilon})}, η=1\eta=1 achieves an expected regret of at most

𝔼[RT]=O(KTlogT+KlogT).\mathbb{E}[R_{T}]=O(\sqrt{KT\log T}+K\log T).

5 Two User Types and Two Categories

In this section, we consider cases with two user types (M=2M=2), two categories (K=2K=2) and departure-probability 𝚲a,τ=1\mathbf{\Lambda}_{a,\tau}=1 for every category a[K]a\in[K] and type τ[M]\tau\in[M]. Even in this relatively simplified setting, where users leave after the first “no-click”, planning is essential. To see this, notice that the event of a user clicking on a certain category provides additional information about the user, which can be used to tailor better recommendations; hence, algorithms that do not take this into account may suffer a linear regret. In fact, this is not just a matter of the learning algorithm at hand, but rather a failure of all fixed-arm policies; there are instances where all fixed-arm policies yield high regret w.r.t. the baseline defined in Equation (1). Indeed, this is what the example in Section 2.1 showcases. Such an observation suggests that studying the optimal planning problem is vital.

In Section 5.1, we introduce the partially observable MDP formulation of departing bandits along with notion of belief-category walk. We use this notion to provide a closed-form formula for policies’ expected return, which we use extensively later on. Next, in Section 5.2 we characterize the optimal policy, and show that we can compute it in constant time relying on the closed-form formula. This is striking, as generally computing optimal POMDP policies is computationally intractable since, e.g., the space of policies grows exponentially with the horizon. Conceptually, we show that there exists an optimal policy that depends on a belief threshold: It recommends one category until the posterior belief of one type, which is monotonically increasing, crosses the threshold, and then it recommends the other category. Finally, in Section 5.3 we leverage all the previously obtained results to derive a small set of threshold policies of size O(lnT)O(\ln T) with corresponding sub-exponential parameters. Due to Theorem 3.2, this result implies a O~(T)\tilde{O}(\sqrt{T}) regret.

5.1 Efficient Planning

To recap, we aim to find the optimal policy when the click-probability matrix and the prior over user types are known. Namely, given an instance in the form of 𝐏,𝐪\langle\mathbf{P},\mathbf{q}\rangle, our goal is to efficiently find the optimal policy.

For planning purposes, the problem can be modeled by an episodic POMDP, S,[K],O,Tr,𝐏,Ω,𝐪,O\langle S,[K],O,\text{Tr},\mathbf{P},\Omega,\mathbf{q},O\rangle. A set of states, S=[M]{}S=[M]\cup\{\perp\} that comprises all types [M][M], along with a designated absorbing state \perp suggesting that the user departed (and the episode terminated). [K][K] is the set of the actions (categories). O={stay,depart}O=\{stay,depart\} is the set of possible observations. The transition and observation functions, Tr:S×[K]S\text{Tr}\mathrel{\mathop{\mathchar 58\relax}}S\times[K]\rightarrow S and Ω:S×[K]O\Omega\mathrel{\mathop{\mathchar 58\relax}}S\times[K]\rightarrow O (respectively) satisfy Tr(|i,a)=Ω(depart|i,a)=1𝐏i,a\text{Tr}(\perp|i,a)=\Omega(depart|i,a)=1-\mathbf{P}_{i,a} and Tr(i|i,a)=Ω(stay|i,a)=𝐏i,a\text{Tr}(i|i,a)=\Omega(stay|i,a)=\mathbf{P}_{i,a} for every type i[M]i\in[M] and action a[K]a\in[K]. Finally, 𝐏\mathbf{P} is the expected reward matrix, and 𝐪\mathbf{q} is the initial state distribution over the MM types.

When there are two user types and two categories, the click-probability matrix is given by Table 2 where we note that the prior on the types holds 𝐪y=1𝐪x\mathbf{q}_{y}=1-\mathbf{q}_{x}, thus can be represented by a single parameter 𝐪x\mathbf{q}_{x}.

Remark 5.1.

Without loss of generality, we assume that 𝐏1,x𝐏2,x,𝐏1,y,𝐏2,y\mathbf{P}_{1,x}\geq\mathbf{P}_{2,x},\mathbf{P}_{1,y},\mathbf{P}_{2,y} since one could always permute the matrix to obtain such a structure.

Since the return and number of iterations for the same policy is independent of the user index, we drop the subscript tt in the rest of this subsection and use .

Type xx Type yy
Category 11 𝐏1,x\mathbf{P}_{1,x} 𝐏1,y\mathbf{P}_{1,y}
Category 22 𝐏2,x\mathbf{P}_{2,x} 𝐏2,y\mathbf{P}_{2,y}
Prior 𝐪x\mathbf{q}_{x} 𝐪y=1𝐪x\mathbf{q}_{y}=1-\mathbf{q}_{x}
Table 2: Click probabilities for two user types and two categories.

As is well-known in the POMDP literature (Kaelbling, Littman, and Cassandra 1998), the optimal policy π\pi^{*} and its expected return are functions of belief states that represent the probability of the state at each time. In our setting, the states are the user types. We denote by bjb_{j} the belief that the state is (type) xx at iteration jj. Similarly, 1bj1-b_{j} is the belief that the state is (type) yy at iteration jj. Needless to say, once the state \perp is reached, the belief over the type states [M][M] is irrelevant, as users do not come back. Nevertheless, we neglect this case as our analysis does not make use it.

We now describe how to compute the belief. At iteration j=1j=1, the belief state is set to be b1=(state=x)=𝐪xb_{1}=\mathbb{P}\left(state=x\right)=\mathbf{q}_{x}. At iteration j>1j>1, upon receiving a positive reward rj=1r_{j}=1, the belief is updated from bj1[0,1]b_{j-1}\in[0,1] to

bj(bj1,a,1)=bj1𝐏a,xbj1𝐏a,x+𝐏a,y(1bj1),\displaystyle b_{j}(b_{j-1},a,1)=\frac{b_{j-1}\cdot\mathbf{P}_{a,x}}{b_{j-1}\cdot\mathbf{P}_{a,x}+\mathbf{P}_{a,y}(1-b_{j-1})}, (2)

where we note that in the event of no-click, the current user departs the system, i.e., we move to the absorbing state \perp. For any policy π:[0,1]{1,2}\pi\mathrel{\mathop{\mathchar 58\relax}}[0,1]\to\{1,2\} that maps a belief to a category, its expected return satisfies the Bellman equation,

𝔼[Vπ(b)]\displaystyle\mathbb{E}[V^{\pi}(b)] =(b𝐏π(b),x+(1b)𝐏π(b),y)\displaystyle=\left(b\mathbf{P}_{\pi(b),x}+(1-b)\mathbf{P}_{\pi(b),y}\right)\cdot
(1+𝔼[Vπ(b(b,π(b),1))]).\displaystyle\qquad(1+\mathbb{E}[V^{\pi}(b^{\prime}(b,\pi(b),1))]).

To better characterize the expected return, we introduce the following notion of belief-category walk.

Definition 5.2 (Belief-category walk).

Let π:[0,1]{1,2}\pi\mathrel{\mathop{\mathchar 58\relax}}[0,1]\rightarrow\{1,2\} be any policy. The sequence

b1,a1=π(b1),b2,a2=π(b2),b_{1},a_{1}=\pi(b_{1}),b_{2},a_{2}=\pi(b_{2}),\dots

is called the belief-category walk. Namely, it is the induced walk of belief updates and categories chosen by π\pi, given all the rewards are positive (rj=1r_{j}=1 for every jj\in\mathbb{N}).

Notice that every policy induces a single, well-defined and deterministic belief-category walk (recall that we assume departure-probabilities satisfy 𝚲a,τ=1\mathbf{\Lambda}_{a,\tau}=1 for every a[K],τ[M]a\in[K],\tau\in[M]). Moreover, given any policy π\pi, the trajectory of every user recommended by π\pi is fully characterized by belief-category walk clipped at bNπ(t),aNπ(t)b_{N^{\pi}(t)},a_{N^{\pi}(t)}.

In what follows, we derive a closed-form expression for the expected return as a function of bb, the categories chosen by the policy, and the click-probability matrix.

Theorem 5.3.

For every policy π\pi and an initial belief b[0,1]b\in[0,1], the expected return is given by

𝔼[Vπ(b)]=i=1b𝐏1,xm1,i𝐏2,xm2,i+(1b)𝐏1,ym1,i𝐏2,ym2,i,\mathbb{E}[V^{\pi}(b)]=\sum_{i=1}^{\infty}b\cdot\mathbf{P}_{1,x}^{m_{1,i}}\cdot\mathbf{P}_{2,x}^{m_{2,i}}+(1-b)\mathbf{P}_{1,y}^{m_{1,i}}\cdot\mathbf{P}_{2,y}^{m_{2,i}},

where m1,i:=|{aj=1,ji}|m_{1,i}\mathrel{\mathop{\mathchar 58\relax}}=|\{a_{j}=1,j\leq i\}| and m2,i:=|{aj=2,ji}|m_{2,i}\mathrel{\mathop{\mathchar 58\relax}}=|\{a_{j}=2,j\leq i\}| are calculated based on the belief-category walk b1,a1,b2,a2,b_{1},a_{1},b_{2},a_{2},\dots induced by π\pi.

5.2 Characterizing the Optimal Policy

Using Theorem 5.3, we show that the planning problem can be solved in O(1)O(1). To arrive at this conclusion, we perform a case analysis over the following three structures of the click-probability matrix 𝐏\mathbf{P}:

  • Dominant Row, where 𝐏1,y𝐏2,y\mathbf{P}_{1,y}\geq\mathbf{P}_{2,y};

  • Dominant Column, where 𝐏2,x𝐏2,y>𝐏1,y\mathbf{P}_{2,x}\geq\mathbf{P}_{2,y}>\mathbf{P}_{1,y};

  • Dominant Diagonal, where 𝐏1,x𝐏2,y>𝐏1,y,𝐏2,x\mathbf{P}_{1,x}\geq\mathbf{P}_{2,y}>\mathbf{P}_{1,y},\mathbf{P}_{2,x}.

Crucially, any matrix 𝐏\mathbf{P} takes exactly one of the three structures. Further, since 𝐏\mathbf{P} is known in the planning problem, identifying the structure at hand takes O(1)O(1) time. Using this structure partition, we characterize the optimal policy.

Dominant Row

We start by considering the simplest structure, in which the Category 11 is preferred by both types of users: Since 𝐏1,y𝐏2,y\mathbf{P}_{1,y}\geq\mathbf{P}_{2,y} and 𝐏1,x𝐏2,x,𝐏1,y,𝐏2,y\mathbf{P}_{1,x}\geq\mathbf{P}_{2,x},\mathbf{P}_{1,y},\mathbf{P}_{2,y} (Remark 5.1), there exists a dominant row, i.e., Category 11.

Lemma 5.4.

For any instance such that 𝐏\mathbf{P} has a dominant row aa, the fixed policy πa\pi^{a} is an optimal policy.

As expected, if Category 11 is dominant then the policy that always recommends Category 11 is optimal.

Dominant Column

In the second structure we consider the case where there is no dominant row, and that the column of type xx is dominant, i.e., 𝐏1,x𝐏2,x𝐏2,y>𝐏1,y\mathbf{P}_{1,x}\geq\mathbf{P}_{2,x}\geq\mathbf{P}_{2,y}>\mathbf{P}_{1,y}. In such a case, which is also the one described in the example in Section 2.1, it is unclear what the optimal policy would be since none of the categories dominates the other.

Surprisingly, we show that the optimal policy can be of only one form: Recommend Category 22 for some time steps (possibly zero) and then always recommend Category 11. To identify when to switch from Category 22 to Category 11, one only needs to compare four expected returns.

Theorem 5.5.

For any instance such that 𝐏\mathbf{P} has a dominant column, one of the following four policies is optimal:

π1,π2,π2:N,π2:N,\pi^{1},\pi^{2},\pi^{2\mathrel{\mathop{\mathchar 58\relax}}\lfloor N^{*}\rfloor},\pi^{2\mathrel{\mathop{\mathchar 58\relax}}\lceil N^{*}\rceil},

where N=N(𝐏,𝐪)N^{*}=N^{*}(\mathbf{P},\mathbf{q}) is a constant, and π2:N\pi^{2\mathrel{\mathop{\mathchar 58\relax}}\lfloor N^{*}\rfloor} (π2:N\pi^{2\mathrel{\mathop{\mathchar 58\relax}}\lceil N^{*}\rceil}) stands for recommending Category 22 until iteration N\lfloor N^{*}\rfloor (N\lceil N^{*}\rceil) and then switching to Category 11.

The intuition behind the theorem is as follows. If the prior tends towards type yy, we might start with recommending Category 22 (which users of type yy are more likely to click on). But after several iterations, and as long as the user stays, the posterior belief bb increases since 𝐏2,x>𝐏2,y\mathbf{P}_{2,x}>\mathbf{P}_{2,y} (recall Equation (2)). Consequently, since type xx becomes more probable, and since 𝐏1,x𝐏2,x\mathbf{P}_{1,x}\geq\mathbf{P}_{2,x}, the optimal policy recommends the best category for this type, i.e., Category 11. For the exact expression of NN^{*}, we refer the reader to Appendix E.3.

Using Theorem 5.3, we can compute the expected return for each of the four policies in O(1)O(1), showing that we can find the optimal policy when 𝐏\mathbf{P} has a column in O(1)O(1).

Dominant Diagonal

In the last structure, we consider the case where there is no dominant row (i.e., 𝐏2,y>𝐏1,y\mathbf{P}_{2,y}>\mathbf{P}_{1,y}) nor a dominant column (i.e., 𝐏2,y>𝐏2,x\mathbf{P}_{2,y}>\mathbf{P}_{2,x}). At first glance, this case is more complex than the previous two, since none of the categories and none of the types dominates the other one. However, we uncover that the optimal policy can be either always recommending Category 11 or always recommending Category 22. Theorem 5.6 summarizes this result.

Theorem 5.6.

For any instance such that 𝐏\mathbf{P} has a dominant diagonal, either π1\pi^{1} or π2\pi^{2} is optimal.

With the full characterization of the optimal policy derived in this section (for all the three structures), we have shown that the optimal policy can be computed in O(1)O(1).

5.3 Learning: UCB-based regret bound

In this section, we move from the planning task to the learning one. Building on the results of previous sections, we know that there must exist a threshold policy—a policy whose belief-category walk has a finite prefix of one category, and an infinite suffix with the other category—which is optimal. However, there can still be infinitely many such policies. To address this problem, we first show how to reduce the search space for approximately optimal policies with negligible additive factor to a set of |Π|=O(ln(T))\mathinner{\!\left\lvert\Pi\right\rvert}=O(\ln(T)) policies. Then, we derive the parameters τ~\tilde{\tau} and η\eta required for Algorithm 2. As an immediate consequence, we get a sublinear regret algorithm for this setting. We begin with defining threshold policies.

Definition 5.7 (Threshold Policy).

A policy π\pi is called an (a,h)(a,h)-threshold policy if there exists an number h{0}h\in\mathbb{N}\cup\{0\} in π\pi’s belief-category walk such that

  • π\pi recommends category aa in iterations jhj\leq h, and

  • π\pi recommends category aa^{\prime} in iterations j>hj>h,

for a,a{1,2}a,a^{\prime}\in\{1,2\} and aaa\neq a^{\prime}.

For instance, the policy π1\pi^{1} that always recommends Category 1 is the (2,0)(2,0)-threshold policy, as it recommends Category 2 until the zero’th iteration (i.e., never recommends Category 2) and then Category 1 eternally. Furthermore, the policy π2:N\pi^{2\mathrel{\mathop{\mathchar 58\relax}}\lfloor N^{*}\rfloor} introduced in Theorem 5.5 is the (2,N)(2,\lfloor N^{*}\rfloor)-threshold policy.

Next, recall that the chance of departure in every iteration is greater or equal to ϵ\epsilon, since we assume maxa,τ𝐏a,τ1ϵ\max_{a,\tau}\mathbf{P}_{a,\tau}\leq 1-\epsilon. Consequently, the probability that a user will stay beyond HH iterations is exponentially decreasing with HH. We could use high-probability arguments to claim that it suffices to focus on the first HH iterations, but without further insights this would yield Ω(2H)\Omega(2^{H}) candidates for the optimal policy. Instead, we exploit our insights about threshold policies.

Let ΠH\Pi_{H} be the set of all (a,h)(a,h)-threshold policies for a{1,2}a\in\{1,2\} and h[H]{0}h\in[H]\cup\{0\}. Clearly, |ΠH|=2H+2\mathinner{\!\left\lvert\Pi_{H}\right\rvert}=2H+2. Lemma 5.8 shows that the return obtained by the best policy in ΠH\Pi_{H} is not worse than that of the optimal policy π\pi^{*} by a negligible factor.

Lemma 5.8.

For every HH\in\mathbb{N}, it holds that

𝔼[VπmaxπΠHVπ]12O(H).\mathbb{E}\left[V^{\pi^{*}}-\max_{\pi\in\Pi_{H}}V^{\pi}\right]\leq\frac{1}{2^{O(H)}}.

Before we describe how to apply Algorithm 2, we need to show that returns of all the policies in ΠH\Pi_{H} are sub-exponential. In Lemma 5.9, we show that VπV^{\pi} is (τ2,b)(\tau^{2},b)-sub-exponential for every threshold policy πΠH\pi\in\Pi_{H}, and provide bounds for both τ\tau and b2/τ2b^{2}/\tau^{2}.

Lemma 5.9.

Let τ~=8eln(11ϵ)\tilde{\tau}=\frac{8e}{\ln(\frac{1}{1-\epsilon})} and η=1\eta=1. For every threshold policy πΠH\pi\in\Pi_{H}, the centred random variable Vπ𝔼[Vπ]V^{\pi}-\mathbb{E}[V^{\pi}] is (τ2,b)(\tau^{2},b)-sub-exponential with (τ2,b)(\tau^{2},b) satisfying τ~τ\tilde{\tau}\geq\tau and ηb2/τ2\eta\geq b^{2}/\tau^{2}.

We are ready to wrap up our solution for the learning task proposed in this section. Let H=Θ(lnT)H=\Theta(\ln T), ΠH\Pi_{H} be the set of threshold policies characterized before, and let τ~\tilde{\tau} and η\eta be constants as defined in Lemma 5.9.

Theorem 5.10.

Applying Algorithm 2 with ΠH,T,τ~,η\Pi_{H},T,\tilde{\tau},\eta on the class of two-types two-categories instances considered in this section always yields an expected regret of

𝔼[RT]O(TlnT).\mathbb{E}[R_{T}]\leq O(\sqrt{T}\ln T).
Proof.

It holds that

𝔼[RT]=𝔼[TVπt=1TVπt]\displaystyle\mathbb{E}[R_{T}]=\mathbb{E}\left[TV^{\pi^{*}}-\sum_{t=1}^{T}V^{\pi_{t}}\right]
=𝔼[TVπmaxπΠHTVπ]+𝔼[maxπΠHTVπt=1TVπt]\displaystyle=\mathbb{E}\left[TV^{\pi^{*}}-\max_{\pi\in\Pi_{H}}TV^{\pi}\right]+\mathbb{E}\left[\max_{\pi\in\Pi_{H}}TV^{\pi}-\sum_{t=1}^{T}V^{\pi_{t}}\right]
()T2O(H)+O(HTlogT+HlogT)=O(TlnT),\displaystyle\overset{(*)}{\leq}\frac{T}{2^{O(H)}}+O(\sqrt{HT\log T}+H\log T)=O(\sqrt{T}\ln T),

where ()(*) follows from Theorem 3.2 and Lemma 5.8. Finally, setting H=Θ(lnT)H=\Theta(\ln T) yields the desired result. ∎

6 Conclusions and Discussion

This paper introduces a MAB model in which the recommender system influences both the rewards accrued and the length of interaction. We dealt with two classes of problems: A single user type with general departure probabilities (Section 4) and the two user types, two categories where each user departs after her first no-click (Section 5). For each problem class, we started with analyzing the planning task, then characterized a small set of candidates for the optimal policy, and then applied Algorithm 2 to achieve sublinear regret.

In the appendix, we also consider a third class of problems: Two categories, multiple user types (M2)M\geq 2) where user departs with their first no-click. We use the closed-form expected return derived in Theorem 5.3 to show how to use dynamic programming to find approximately optimal planning policies. We formulate the problem of finding an optimal policy for a finite horizon HH in a recursive manner. Particularly, we show how to find a 1/2O(H)\nicefrac{{1}}{{2^{O(H)}}} additive approximation in run-time of O(H2)O(H^{2}). Unfortunately, this approach cannot assist us in the learning task. Dynamic programming relies on skipping sub-optimal solutions to sub-problems (shorter horizons in our case), but this happens on the fly; thus, we cannot a-priori define a small set of candidates like what Algorithm 2 requires. More broadly, we could use this dynamic programming approach for more than two categories, namely for K2K\geq 2, but then the run-time becomes O(HK)O(H^{K}).

There are several interesting future directions. First, achieving low regret for the setup in Section 5 with K2K\geq 2. We suspect that this class of problems could enjoy a solution similar to ours, where candidates for optimal policies are mixing two categories solely. Second, achieving low regret for the setup in Section 5 with uncertain departure (i.e., 𝚲1\mathbf{\Lambda}\neq 1). Our approach fails in such a case since we cannot use belief-category walks; these are no longer deterministic. Consequently, the closed-form formula is much more complex and optimal planning becomes more intricate. These two challenges are left open for future work.

Acknowledgement

LL is generously supported by an Open Philanthropy AI Fellowship. LC is supported by Ariane de Rothschild Women Doctoral Program. ZL thanks the Block Center for Technology and Society; Amazon AI; PwC USA via the Digital Transformation and Innovation Center; and the NSF: Fair AI Award IIS2040929 for supporting ACMI lab’s research on the responsible use of machine learning. This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (grant agreement No. 882396), by the Israel Science Foundation (grant number 993/17), Tel Aviv University Center for AI and Data Science (TAD), and the Yandex Initiative for Machine Learning at Tel Aviv University.

References

  • Abbasi-Yadkori, Pál, and Szepesvári (2011) Abbasi-Yadkori, Y.; Pál, D.; and Szepesvári, C. 2011. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24: 2312–2320.
  • Azar, Lazaric, and Brunskill (2013) Azar, M. G.; Lazaric, A.; and Brunskill, E. 2013. Sequential transfer in multi-armed bandit with finite set of models. In Proceedings of the 26th International Conference on Neural Information Processing Systems-Volume 2, 2220–2228.
  • Bahar et al. (2020) Bahar, G.; Ben-Porat, O.; Leyton-Brown, K.; and Tennenholtz, M. 2020. Fiduciary bandits. In International Conference on Machine Learning, 518–527. PMLR.
  • Bahar, Smorodinsky, and Tennenholtz (2016) Bahar, G.; Smorodinsky, R.; and Tennenholtz, M. 2016. Economic Recommendation Systems: One Page Abstract. In Proceedings of the 2016 ACM Conference on Economics and Computation, EC ’16, 757–757. New York, NY, USA: ACM. ISBN 978-1-4503-3936-0.
  • Bubeck, Cesa-Bianchi et al. (2012) Bubeck, S.; Cesa-Bianchi, N.; et al. 2012. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1): 1–122.
  • Cao et al. (2020) Cao, J.; Sun, W.; Shen, Z.-J. M.; and Ettl, M. 2020. Fatigue-Aware Bandits for Dependent Click Models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 3341–3348.
  • Cesa-Bianchi and Lugosi (2006) Cesa-Bianchi, N.; and Lugosi, G. 2006. Prediction, learning, and games. Cambridge Univ Press.
  • Cohen and Mansour (2019) Cohen, L.; and Mansour, Y. 2019. Optimal Algorithm for Bayesian Incentive-Compatible. In ACM Conf. on Economics and Computation (EC).
  • Eldar and Kutyniok (2012) Eldar, Y.; and Kutyniok, G. 2012. Compressed Sensing: Theory and Applications. ISBN 978-1107005587.
  • Harper and Konstan (2015) Harper, F. M.; and Konstan, J. A. 2015. The MovieLens Datasets: History and Context. ACM Trans. Interact. Intell. Syst.
  • Hillar and Wibisono (2018) Hillar, C.; and Wibisono, A. 2018. Maximum entropy distributions on graphs. arXiv:1301.3321.
  • Jia, Shi, and Shen (2021) Jia, H.; Shi, C.; and Shen, S. 2021. Multi-armed Bandit with Sub-exponential Rewards. Operations Research Letters.
  • Joseph et al. (2016) Joseph, M.; Kearns, M.; Morgenstern, J.; and Roth, A. 2016. Fairness in learning: Classic and contextual bandits. arXiv preprint arXiv:1605.07139.
  • Kaelbling, Littman, and Cassandra (1998) Kaelbling, L. P.; Littman, M. L.; and Cassandra, A. R. 1998. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1-2): 99–134.
  • Korda, Szörényi, and Li (2016) Korda, N.; Szörényi, B.; and Li, S. 2016. Distributed Clustering of Linear Bandits in Peer to Peer Networks. In Proceedings of the 33rd International Conference on International Conference on Machine Learning.
  • Kremer, Mansour, and Perry (2014) Kremer, I.; Mansour, Y.; and Perry, M. 2014. Implementing the wisdom of the crowd. Journal of Political Economy, 122: 988–1012.
  • Lattimore and Szepesvári (2020) Lattimore, T.; and Szepesvári, C. 2020. Bandit algorithms. Cambridge University Press.
  • Leqi et al. (2020) Leqi, L.; Kilinc-Karzan, F.; Lipton, Z. C.; and Montgomery, A. L. 2020. Rebounding Bandits for Modeling Satiation Effects. arXiv preprint arXiv:2011.06741.
  • Liu and Ho (2018) Liu, Y.; and Ho, C.-J. 2018. Incentivizing high quality user contributions: New arm generation in bandit learning. In Thirty-Second AAAI Conference on Artificial Intelligence.
  • Lu and Yang (2016) Lu, Z.; and Yang, Q. 2016. Partially Observable Markov Decision Process for Recommender Systems. CoRR, abs/1608.07793.
  • Mahadik et al. (2020) Mahadik, K.; Wu, Q.; Li, S.; and Sabne, A. 2020. Fast Distributed Bandits for Online Recommendation Systems.
  • Mansour, Slivkins, and Syrgkanis (2015) Mansour, Y.; Slivkins, A.; and Syrgkanis, V. 2015. Bayesian Incentive-Compatible Bandit Exploration. In ACM Conf. on Economics and Computation (EC).
  • Patil et al. (2020) Patil, V.; Ghalme, G.; Nair, V.; and Narahari, Y. 2020. Achieving fairness in the stochastic multi-armed bandit problem. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 5379–5386.
  • Pike-Burke and Grünewälder (2019) Pike-Burke, C.; and Grünewälder, S. 2019. Recovering bandits. arXiv preprint arXiv:1910.14354.
  • Ron, Ben-Porat, and Shalit (2021) Ron, T.; Ben-Porat, O.; and Shalit, U. 2021. Corporate Social Responsibility via Multi-Armed Bandits. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 26–40.
  • Shani, Heckerman, and Brafman (2005) Shani, G.; Heckerman, D.; and Brafman, R. I. 2005. An MDP-Based Recommender System. Journal of Machine Learning Research, 6(43): 1265–1295.
  • Slivkins (2019) Slivkins, A. 2019. Introduction to multi-armed bandits. arXiv preprint arXiv:1904.07272.
  • Zhao et al. (2020a) Zhao, X.; Zheng, X.; Yang, X.; Liu, X.; and Tang, J. 2020a. Jointly Learning to Recommend and Advertise. In KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.
  • Zhao et al. (2020b) Zhao, Y.; Zhou, Y.; Ou, M.; Xu, H.; and Li, N. 2020b. Maximizing Cumulative User Engagement in Sequential Recommendation: An Online Optimization Perspective. In Gupta, R.; Liu, Y.; Tang, J.; and Prakash, B. A., eds., KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.

Appendix A Extension: Planning Beyond Two User Types

In this section, we treat the planning task with two categories (K=2K=2) but potentially many types (i.e., M2M\geq 2). For convenience, we formalize the results in this section in terms of M=2M=2, but the results are readily extendable for the more general 2×M2\times M case. We derive an almost-optimal planning policy via dynamic programming, and then explain why it cannot be used for learning as we did in the previous section.

For reasons that will become apparent later on, we define by VHπV_{H}^{\pi} as the return of a policy π\pi until the HH’s iteration. Using Theorem 5.3, we have that

𝔼[VHπ(b)]=i=1Hb𝐏1,xm1,i𝐏2,xm2,i+(1b)𝐏1,ym1,i𝐏2,ym2,i,\mathbb{E}[V_{H}^{\pi}(b)]=\sum_{i=1}^{H}b\cdot\mathbf{P}_{1,x}^{m_{1,i}}\cdot\mathbf{P}_{2,x}^{m_{2,i}}+(1-b)\mathbf{P}_{1,y}^{m_{1,i}}\cdot\mathbf{P}_{2,y}^{m_{2,i}},

where m1,i:=|{aj=1,ji}|m_{1,i}\mathrel{\mathop{\mathchar 58\relax}}=|\{a_{j}=1,j\leq i\}| and m2,i:=|{aj=2,ji}|m_{2,i}\mathrel{\mathop{\mathchar 58\relax}}=|\{a_{j}=2,j\leq i\}| are calculated based on the belief-category walk b1,a1,b2,a2,b_{1},a_{1},b_{2},a_{2},\dots induced by π\pi. Further, let π~\tilde{\pi}^{*} denote the policy maximizing VHV_{H}.

Notice that there is a bijection from HH-iterations policies to (m1,i,m2,i)i=1H(m_{1,i},m_{2,i})_{i=1}^{H}; hence, we can find π~\tilde{\pi}^{*} by finding the arg max of the expression on the right-hand-side of the above equation, in terms of (m1,i,m2,i)i=1H(m_{1,i},m_{2,i})_{i=1}^{H}. Formally, we want to solve the integer linear programming (ILP),

maximizei=1Hb𝐏1,xm1,i𝐏2,xm2,i+(1b)𝐏1,ym1,i𝐏2,ym2,isubject to ma,i=l=1iza,l for a{1,2},i[H],za,i{0,1} for a{1,2},i[H],z1,i+z2,i=1 for i[H].\begin{array}[]{ll@{}ll}&\text{maximize}&\displaystyle\sum_{i=1}^{H}b\cdot\mathbf{P}_{1,x}^{m_{1,i}}\cdot\mathbf{P}_{2,x}^{m_{2,i}}+(1-b)\mathbf{P}_{1,y}^{m_{1,i}}\cdot\mathbf{P}_{2,y}^{m_{2,i}}\\ &\text{subject to }&m_{a,i}=\displaystyle\sum\limits_{l=1}^{i}z_{a,l}\text{ for }a\in\{1,2\},i\in[H],\\ &&z_{a,i}\in\{0,1\}\text{ for }a\in\{1,2\},i\in[H],\\ &&z_{1,i}+z_{2,i}=1\text{ for }i\in[H].\\ \end{array} (3)

Despite that this problem involves integer programming, we can solve it using dynamic programming in O(H2)O\left(H^{2}\right) runtime. Notice that the optimization is over a subset of binary variables (z1,i,z2,i)i=1H(z_{1,i},z_{2,i})_{i=1}^{H}. Let ZHZ^{H} be the set of feasible solutions of the ILP, and similarly let ZhZ^{h} denote set of prefixes of length hHh\leq H of ZHZ^{H}.

For any h[H]h\in[H] and 𝐳Zh\mathbf{z}\in Z^{h}, define

Dh(𝐳)=defi=1hb𝐏1,xm1,i𝐏2,xm2,i+(1b)𝐏1,ym1,i𝐏2,ym2,i,D^{h}(\mathbf{z})\overset{\text{def}}{=}\sum_{i=1}^{h}b\cdot\mathbf{P}_{1,x}^{m_{1,i}}\cdot\mathbf{P}_{2,x}^{m_{2,i}}+(1-b)\mathbf{P}_{1,y}^{m_{1,i}}\cdot\mathbf{P}_{2,y}^{m_{2,i}},

where ma,i=l=1iza,l for j{1,2},i[h]m_{a,i}=\sum_{l=1}^{i}z_{a,l}\text{ for }j\in\{1,2\},i\in[h] as in the ILP.

Consequently, solving the ILP is equivalent to maximizing DHD^{H} over the domain ZHZ^{H}.

Next, for any h[H]h\in[H] and two integers c1,c2c_{1},c_{2} such that c1+c2=hc_{1}+c_{2}=h, define

D~h(c1,c2)=defmax𝐳Zh,m1,h(𝐳)=c1m2,h(𝐳)=c2Dh(𝐳).\tilde{D}^{h}(c_{1},c_{2})\overset{\text{def}}{=}\max_{\begin{subarray}{c}\mathbf{z}\in Z^{h},\\ m_{1,h}(\mathbf{z})=c_{1}\\ m_{2,h}(\mathbf{z})=c_{2}\end{subarray}}D^{h}(\mathbf{z}). (4)

Under this construction, maxc1,c2D~H(c1,c2)\max_{c_{1},c_{2}}\tilde{D}^{H}(c_{1},c_{2}) over c1,c2c_{1},c_{2} such that c1+c2=Hc_{1}+c_{2}=H is precisely the value of the ILP.

Reformulating Equation (4) for h>1{h>1},

D~h(c1,c2)\displaystyle\tilde{D}^{h}(c_{1},c_{2}) =maxz1,z2{0,1}z1+z2=1{D~h1(c1z1,c2z2)+α(c1,c2)},\displaystyle=\max_{\begin{subarray}{c}z_{1},z_{2}\in\{0,1\}\\ z_{1}+z_{2}=1\end{subarray}}\left\{\tilde{D}^{h-1}(c_{1}-z_{1},c_{2}-z_{2})+\alpha(c_{1},c_{2})\right\},

where α(m1,m2)=defbx1m1x2m2+(1b)y1m1y2m2\alpha(m_{1},m_{2})\overset{\text{def}}{=}b\cdot x_{1}^{m_{1}}\cdot x_{2}^{m_{2}}+(1-b)y_{1}^{m_{1}}\cdot y_{2}^{m_{2}}. For every hh, there are only h+1h+1 possible values D~h\tilde{D}^{h} can take: All the ways of dividing hh into non-negative integers c1c_{1} and c2c_{2}; therefore, having computed D~h1\tilde{D}^{h-1} for all hh feasible inputs, we can compute D~h(c1,c2)\tilde{D}^{h}(c_{1},c_{2}) in O(h)O(h). Consequently, computing maxc1,c2D~H(c1,c2)\max_{c_{1},c_{2}}\tilde{D}^{H}(c_{1},c_{2}), which is precisely the value of the ILP in (3), takes O(H2)O(H^{2}) run-time. Moreover, the policy π~\tilde{\pi}^{*} can be found using backtracking. We remark that an argument similar to Lemma 5.8 implies that 𝔼[VπVπ~]12O(H)\mathbb{E}[V^{\pi^{*}}-V^{\tilde{\pi}^{*}}]\leq\frac{1}{2^{O(H)}}; hence, π~\tilde{\pi}^{*} is almost optimal.

To finalize this section, we remark that this approach could also work for K>2K>2 categories. Naively, for a finite horizon HH, there are KHK^{H} possible policies. The dynamic programming procedure explain above makes the search operate in run-time of O(HK)O(H^{K}). The run-time, exponential in the number of categories but polynomial in the horizon, is feasible when the number of categories is small.

Appendix B Experimental Evaluation

For general real-world datasets, we propose a scheme to construct semi-synthetic problem instances with many arms and many user types, using rating data sets with multiple ratings per user. We exemplify our scheme on the MovieLens Dataset Harper and Konstan (2015). As a pre-processing step, we set movie genres to be the categories of interest, select a subset of categories |A||A| of size kk (e.g., sci-fi, drama, and comedy), and select the number of user types, mm. Remove any user who has not provided a rating for at least one movie from each category aAa\in A. When running the algorithm, randomly draw users from the data, and given a recommended category aa, suggest them a random movie which they have rated, and set their click probability to 1r1-r, where r[0,1]r\in[0,1] is their normalized rating of the suggested movie.

Appendix C UCB Policy for Sub-exponential Returns

An important tool for analyzing sub-exponential random variables is Bernstein’s Inequality, which is a concentration inequality for sub-exponential random variables (see, e.g., Jia, Shi, and Shen (2021)). Being a major component of the regret analysis for Algorithm 2, we state it here for convenience.

Lemma C.1.

(Bernstein’s Inequality) Let a random variable XX be sub-exponential with parameters (τ2,b)(\tau^{2},b). Then for every v0v\geq 0:

Pr[|X𝔼[X]|v]{2exp(v22τ2)vτ2b2exp(v2b)else.\Pr[|X-\mathbb{E}[X]|\geq v]\leq\begin{cases}2\exp(-\frac{v^{2}}{2\tau^{2}})&v\leq\frac{\tau^{2}}{b}\\ 2\exp(-\frac{v}{2b})&else\end{cases}.

Appendix D Single User Type: Proofs from Section 4

To simplify the proofs, we use the following notation: For a fixed-arm policy πa\pi^{a}, we use VjπaV_{j}^{\pi^{a}} to denote its return from iteration jj until the user departs. Namely,

Vjπa=i=jNπa𝐏aV_{j}^{\pi^{a}}=\sum_{i=j}^{N^{\pi^{a}}}\mathbf{P}_{a}

Throughout this section, we will use the following Observation.

Observation D.1.

For every policy π\pi and iteration jj,

𝔼[Vjπ]=𝐏πj(1+𝔼[Vj+1π])+(1𝚲πj)(1𝐏πj)𝔼[Vj+1π]=𝔼[Vj+1π](1𝚲πj(1𝐏πj))+𝐏πj.\mathbb{E}[V^{\pi}_{j}]=\mathbf{P}_{\pi_{j}}(1+\mathbb{E}[V^{\pi}_{j+1}])+(1-\mathbf{\Lambda}_{\pi_{j}})(1-\mathbf{P}_{\pi_{j}})\mathbb{E}[V^{\pi}_{j+1}]=\mathbb{E}[V^{\pi}_{j+1}](1-\mathbf{\Lambda}_{\pi_{j}}(1-\mathbf{P}_{\pi_{j}}))+\mathbf{P}_{\pi_{j}}.

See 4.1

Proof.

First, recall that every POMDP has an optimal Markovian policy which is deterministic (we refer the reader to Section 5.1 for full formulation of the problem as POMDP). Having independent rewards and a single state implies that there exists μ\mu^{*}\in\mathbb{N} such that 𝔼[Vj]=μ\mathbb{E}[V^{*}_{j}]=\mu^{*} for every jj\in\mathbb{N} (similarly to standard MAB problems, there exists a fixed-arm policy which is optimal).
Assume by contradiction that the optimal policy πa\pi^{a^{*}} holds

aargmaxa[k]𝐏a𝚲a(1𝐏a).a^{*}\notin\operatorname*{arg\,max}_{a\in[k]}\frac{\mathbf{P}_{a}}{\mathbf{\Lambda}_{a}(1-\mathbf{P}_{a})}.

Now, notice that

𝔼[Vπa]=𝔼[V1πa]=𝔼[V2πa](1𝚲a(1𝐏a))+𝐏a\mathbb{E}[V^{\pi^{a^{\prime}}}]=\mathbb{E}[V^{\pi^{a^{\prime}}}_{1}]=\mathbb{E}[V^{\pi^{a^{\prime}}}_{2}](1-\mathbf{\Lambda}_{a^{\prime}}(1-\mathbf{P}_{a^{\prime}}))+\mathbf{P}_{a^{\prime}}

Solving the recurrence relation and summing the geometric series we get

𝔼[Vπa]=𝐏aj=0(1𝚲a(1𝐏a))j=𝐏a𝚲a(1𝐏a).\mathbb{E}[V^{\pi^{a^{\prime}}}]=\mathbf{P}_{a^{\prime}}\sum_{j=0}^{\infty}(1-\mathbf{\Lambda}_{a^{\prime}}(1-\mathbf{P}_{a^{\prime}}))^{j}=\frac{\mathbf{P}_{a^{\prime}}}{\mathbf{\Lambda}_{a^{\prime}}(1-\mathbf{P}_{a^{\prime}})}.

Finally,

aargmaxa[k]𝐏a𝚲a(1𝐏a),a^{*}\notin\operatorname*{arg\,max}_{a\in[k]}\frac{\mathbf{P}_{a}}{\mathbf{\Lambda}_{a}(1-\mathbf{P}_{a})},

yields that any fixed-armed policy, πa\pi^{a^{\prime}} such that

aargmaxa[k]𝐏a𝚲a(1𝐏a)a^{\prime}\in\operatorname*{arg\,max}_{a\in[k]}\frac{\mathbf{P}_{a}}{\mathbf{\Lambda}_{a}(1-\mathbf{P}_{a})}

holds 𝔼[Vπa]>𝔼[Vπa]\mathbb{E}[V^{\pi^{a^{\prime}}}]>\mathbb{E}[V^{\pi^{a^{*}}}], a contradiction to the optimality of πa\pi^{a^{*}} . ∎

Lemma D.2.

For each a[k]a\in[k], the centered random return Vπa𝔼[Vπa]V^{\pi^{a}}-\mathbb{E}[V^{\pi^{a}}] is sub-exponential with parameter C2=4/ln(1𝚲a(1𝐏a))C_{2}=-4/\ln(1-\mathbf{\Lambda}_{a}(1-\mathbf{P}_{a})).

In order to show that returns of fixed-arm policies are sub-exponential random variables, we first show that the number of iterations of users recommended by fixed-arm policies is also a sub-exponential. For this purpose, we state here a lemma that implies that every geometric r.v. is a sub-exponential r.v.. The proof of the next lemma appears, e.g., in Hillar and Wibisono (2018) (Lemma 4.34.3).

Lemma D.3.

Let XX be a geometric random variable with parameter r(0,1)r\in(0,1), so that:

Pr[X=x]=(1r)x1r,x.\Pr[X=x]=(1-r)^{x-1}\,r,\quad x\in\mathbb{N}.

Then XX satisfies Property (2) from Definition  3.1. Namely, XX is sub-exponential with parameter C2=2/ln(1r)C_{2}=-2/\ln(1-r). Formally,

p0:(𝔼[|X|p])1/p2ln(1r)p.\forall p\geq 0\mathrel{\mathop{\mathchar 58\relax}}(\mathbb{E}[|X|^{p}])^{1/p}\leq-\frac{2}{\ln(1-r)}p.

The lemma above and Observation 4.2 allow us to deduce that the variables NaN_{a} are sub-exponential in the first part of the following Corollary (the case in which 𝚲a=0\mathbf{\Lambda}_{a}=0 follows immediately from definition.). The second part of the lemma follows directly from the equivalences between Properties (2) and (1) in Definition  3.1.

Corollary D.4.

For each a[K]a\in[K], the number of iterations a user recommended by πa\pi^{a} stays within the system, NaN_{a}, is sub-exponential with parameter C2a=2/ln(1𝚲a(1𝐏a))C_{2}^{a}=-2/\ln(1-\mathbf{\Lambda}_{a}(1-\mathbf{P}_{a})). In addition, there exist constants C1a>0C^{a}_{1}>0 for every a[K]a\in[K] such that

a[K],v0:Pr[|Na|>v]exp(1vC1a).\forall a\in[K],\;v\geq 0\mathrel{\mathop{\mathchar 58\relax}}\Pr[|N_{a}|>v]\leq\exp({1-\frac{v}{C_{1}^{a}}}).

The next Proposition D.5 is used for the proof of Lemma D.2.

Proposition D.5.

For every a[K]a\in[K],

|𝔼[Vπa]|2ln(1𝚲a(1𝐏a))|\mathbb{E}[V^{\pi^{a}}]|\leq\frac{-2}{\ln(1-\mathbf{\Lambda}_{a}(1-\mathbf{P}_{a}))}
Proof.

First, notice that

(1𝚲a(1𝐏a))ln(1𝚲a(1𝐏a))>(1𝚲a(1𝐏a))𝚲a(1𝐏a)1𝚲a(1𝐏a)=𝚲a(1𝐏a)>2𝚲a(1𝐏a),(1-\mathbf{\Lambda}_{a}(1-\mathbf{P}_{a}))\ln(1-\mathbf{\Lambda}_{a}(1-\mathbf{P}_{a}))>(1-\mathbf{\Lambda}_{a}(1-\mathbf{P}_{a}))\frac{-\mathbf{\Lambda}_{a}(1-\mathbf{P}_{a})}{1-\mathbf{\Lambda}_{a}(1-\mathbf{P}_{a})}=-\mathbf{\Lambda}_{a}(1-\mathbf{P}_{a})>-2\mathbf{\Lambda}_{a}(1-\mathbf{P}_{a}),

where the first inequality is due to x1+xln(1+x)\frac{x}{1+x}\leq\ln(1+x) for every x1x\geq-1. Rearranging,

1𝚲a(1𝐏a)𝚲a(1𝐏a)<2ln(1𝚲a(1𝐏a)).\frac{1-\mathbf{\Lambda}_{a}(1-\mathbf{P}_{a})}{\mathbf{\Lambda}_{a}(1-\mathbf{P}_{a})}<\frac{-2}{\ln(1-\mathbf{\Lambda}_{a}(1-\mathbf{P}_{a}))}. (5)

For each user, the realization of VπaV^{\pi^{a}} is less or equal to the realization of Na1N_{a}-1 for the same user (as users provide negative feedback in their last iteration); hence,

|𝔼[Vπa]|=𝔼[Vπa]𝔼[Na]1=1𝚲a(1𝐏a)1=1𝚲a(1𝐏a)𝚲a(1𝐏a)<2ln(1𝚲a(1𝐏a)).|\mathbb{E}[V^{\pi^{a}}]|=\mathbb{E}[V^{\pi^{a}}]\leq\mathbb{E}[N_{a}]-1=\frac{1}{\mathbf{\Lambda}_{a}(1-\mathbf{P}_{a})}-1=\frac{1-\mathbf{\Lambda}_{a}(1-\mathbf{P}_{a})}{\mathbf{\Lambda}_{a}(1-\mathbf{P}_{a})}<\frac{-2}{\ln(1-\mathbf{\Lambda}_{a}(1-\mathbf{P}_{a}))}.

We proceed by showing that returns of fix-armed policies satisfy Property (1) from Definition  3.1. See D.2

Proof.

We use Property (1) from Definition  3.1 to derive that VπaV^{\pi^{a}} is also sub-exponential. This is true since the tails of VπaV^{\pi^{a}} satisfy that for all v0v\geq 0,

Pr[|Vπa|>v]Pr[|Na|>v+1]Pr[|Na|>v](1)exp(1vC1),\Pr[|V^{\pi^{a}}|>v]\leq\Pr[|N_{a}|>v+1]\leq\Pr[|N_{a}|>v]\leq_{(\ref{it:tails})}\exp({1-\frac{v}{C_{1}}}),

where the first inequality follows since |Na|>v+1|N_{a}|>v+1 is a necessary condition for |Vπa|>v|V^{\pi^{a}}|>v, and the last inequality follows from Corollary D.4. Along with Definition  3.1, we conclude that

𝔼[|Vπa|p]1/p2/ln(1𝚲a(1𝐏a))p.\mathbb{E}[|V^{\pi^{a}}|^{p}]^{1/p}\leq-2/\ln(1-\mathbf{\Lambda}_{a}(1-\mathbf{P}_{a}))p. (6)

Now, applying Minkowski’s inequality and then Jensen’s inequality (as f(z)=zp,g(z)=|z|f(z)=z^{p},g(z)=|z| are convex for every p1p\geq 1) we get

(𝔼[|Vπa𝔼[Vπa]|p])1/p𝔼[|Vπa|p]1/p+𝔼[𝔼[|Vπa|]p]1/p𝔼[|Vπa|p]1/p+|𝔼[Vπa]|.(\mathbb{E}[|V^{\pi^{a}}-\mathbb{E}[V^{\pi^{a}}]|^{p}])^{1/p}\leq\mathbb{E}[|V^{\pi^{a}}|^{p}]^{1/p}+\mathbb{E}[\mathbb{E}[|V^{\pi^{a}}|]^{p}]^{1/p}\leq\mathbb{E}[|V^{\pi^{a}}|^{p}]^{1/p}+|\mathbb{E}[V^{\pi^{a}}]|.

Using Proposition D.5 and Inequality (6), we get

𝔼[|Vπa|p]1/p+|𝔼[Vπa]|2ln(1𝚲a(1𝐏a))+1𝚲a(1𝐏a)14ln(1𝚲a(1𝐏a))\mathbb{E}[|V^{\pi^{a}}|^{p}]^{1/p}+|\mathbb{E}[V^{\pi^{a}}]|\leq\frac{-2}{\ln(1-\mathbf{\Lambda}_{a}(1-\mathbf{P}_{a}))}+\frac{1}{\mathbf{\Lambda}_{a}(1-\mathbf{P}_{a})}-1\leq\frac{-4}{\ln(1-\mathbf{\Lambda}_{a}(1-\mathbf{P}_{a}))}

Hence Vπa𝔼[Vπa]V^{\pi^{a}}-\mathbb{E}[V^{\pi^{a}}] is sub-exponential with parameter C2=4/ln(1𝚲a(1𝐏a))C_{2}=-4/\ln(1-\mathbf{\Lambda}_{a}(1-\mathbf{P}_{a})). ∎

See 4.3

Proof.

Throughout this proof, we will use the sub-exponential norm, ||||ψ1||\cdot||_{\psi_{1}}, which is defined as

Zψ1=supp1(𝔼[|Z|p])1/pp.||Z||_{\psi_{1}}=\sup_{p\geq 1}\frac{(\mathbb{E}[|Z|^{p}])^{1/p}}{p}.

Let

X=Vπa𝔼[Vπa]Vπa𝔼[Vπa]ψ1,y=γVπa𝔼[Vπa]ψ1.X=\frac{V^{\pi^{a}}-\mathbb{E}[V^{\pi^{a}}]}{||V^{\pi^{a}}-\mathbb{E}[V^{\pi^{a}}]||_{\psi_{1}}},\quad y=\gamma\cdot||V^{\pi^{a}}-\mathbb{E}[V^{\pi^{a}}]||_{\psi_{1}}.

We have that

Xψ1=Vπa𝔼[Vπa]Vπa𝔼[Vπa]ψ1ψ1=1.||X||_{\psi_{1}}=||\frac{V^{\pi^{a}}-\mathbb{E}[V^{\pi^{a}}]}{||V^{\pi^{a}}-\mathbb{E}[V^{\pi^{a}}]||_{\psi_{1}}}||_{\psi_{1}}=1. (7)

Let γ\gamma be such that |γ|<1/ba=ln(1𝚲a(1𝐏a))8e|\gamma|<1/b_{a}=-\frac{\ln(1-\mathbf{\Lambda}_{a}(1-\mathbf{P}_{a}))}{8e}. From Lemma D.2 we conclude that

|γ|=|yVπa𝔼[Vπa]ψ1|ln(1𝚲a(1𝐏a))8e=12e1Vπa𝔼[Vπa]ψ1;|\gamma|=\big{|}\frac{y}{||V^{\pi^{a}}-\mathbb{E}[V^{\pi^{a}}]||_{\psi_{1}}}\big{|}\leq-\frac{\ln(1-\mathbf{\Lambda}_{a}(1-\mathbf{P}_{a}))}{8e}=\frac{1}{2e}\cdot\frac{1}{||V^{\pi^{a}}-\mathbb{E}[V^{\pi^{a}}]||_{\psi_{1}}};

hence, |y|<12e|y|<\frac{1}{2e}.
Summing the geometric series, we get

p=2(e|y|)p=e2y21e|y|<2e2y2\sum_{p=2}^{\infty}(e|y|)^{p}=\frac{e^{2}y^{2}}{1-e|y|}<2e^{2}y^{2} (8)

In addition, notice that yX=γ(Vπa𝔼[Vπa])yX=\gamma(V^{\pi^{a}}-\mathbb{E}[V^{\pi^{a}}]).
Next, from the Taylor series of exp()\exp(\cdot) we have

𝔼[exp(γ(Vπa𝔼[Vπa]))]=𝔼[exp(yX)]=1+y𝔼[x]+p=2yp𝔼[Xp]p!.\mathbb{E}[\exp(\gamma(V^{\pi^{a}}-\mathbb{E}[V^{\pi^{a}}]))]=\mathbb{E}[\exp(yX)]=1+y\mathbb{E}[x]+\sum_{p=2}^{\infty}\frac{y^{p}\mathbb{E}[X^{p}]}{p!}.

Combining the fact that 𝔼[X]=0\mathbb{E}[X]=0 and (7) to the above,

1+y𝔼[x]+p=2yp𝔼[Xp]p!1+p=2ypppp!.1+y\mathbb{E}[x]+\sum_{p=2}^{\infty}\frac{y^{p}\mathbb{E}[X^{p}]}{p!}\leq 1+\sum_{p=2}^{\infty}\frac{y^{p}p^{p}}{p!}.

By applying p!(pe)pp!\geq(\frac{p}{e})^{p} and (8), we get

1+p=2ypppp!1+p=2(e|y|)p1+2e2y2exp(2e2y2)=exp(2e2(γVπa𝔼[Vπa]ψ1)2),1+\sum_{p=2}^{\infty}\frac{y^{p}p^{p}}{p!}\leq 1+\sum_{p=2}^{\infty}(e|y|)^{p}\leq 1+2e^{2}y^{2}\leq\exp(2e^{2}y^{2})=\exp(2e^{2}(\gamma\cdot||V^{\pi^{a}}-\mathbb{E}[V^{\pi^{a}}]||_{\psi_{1}})^{2}),

where the last inequality is due to 1+xex1+x\leq e^{x}.

Note that ||Vπa𝔼[Vπa]||ψ14ln(1𝚲a(1𝐏a)))2)||V^{\pi^{a}}-\mathbb{E}[V^{\pi^{a}}]||_{\psi_{1}}\leq-\frac{4}{\ln(1-\mathbf{\Lambda}_{a}(1-\mathbf{P}_{a}))})^{2}). Ultimately,

𝔼[exp(γ(Vπa𝔼[Vπa]))]exp(2e2γ2(4ln(1𝚲a(1𝐏a)))2)=exp(12γ2(8eln(1𝚲a(1𝐏a)))2).\mathbb{E}[\exp(\gamma(V^{\pi^{a}}-\mathbb{E}[V^{\pi^{a}}]))]\leq\exp\big{(}2e^{2}\gamma^{2}(-\frac{4}{\ln(1-\mathbf{\Lambda}_{a}(1-\mathbf{P}_{a}))})^{2}\big{)}=\exp\big{(}\frac{1}{2}\gamma^{2}(-\frac{8e}{\ln(1-\mathbf{\Lambda}_{a}(1-\mathbf{P}_{a}))})^{2}\big{)}.

This concludes the proof of the lemma. ∎

Appendix E Two User Types and Two Categories: Proofs from Section 5

E.1 Planning when K=2K=2

See 5.3

Proof.

Let βiπ(b):=b𝐏1,xm1,i𝐏2,xm2,i+(1b)𝐏1,ym1,i𝐏2,ym2,i\beta^{\pi}_{i}(b)\mathrel{\mathop{\mathchar 58\relax}}=b\cdot{\mathbf{P}_{1,x}}^{m_{1,i}}\cdot{\mathbf{P}_{2,x}}^{m_{2,i}}+(1-b){\mathbf{P}_{1,y}}^{m_{1,i}}\cdot{\mathbf{P}_{2,y}}^{m_{2,i}}. We will prove that for every policy π\pi and every belief bb, we have that 𝔼[VHπa(b)]=i=1Hβiπ(b)\mathbb{E}[V_{H}^{\pi_{a}}(b)]=\sum_{i=1}^{H}\beta^{\pi}_{i}(b) by a backward induction over HH.

For the base case, consider H=1H=1. We have that

𝔼[V1π(b1)]=b1𝐏a1,x+(1b)𝐏a1,y=b𝐏1,xm1,1𝐏2,xm2,1+(1b)𝐏1,ym1,1𝐏2,ym2,1=β1π(b)\mathbb{E}[V_{1}^{\pi}(b_{1})]=b_{1}\cdot\mathbf{P}_{{a_{1}},x}+(1-b)\mathbf{P}_{a_{1},y}=b\cdot{\mathbf{P}_{1,x}}^{m_{1,1}}\cdot{\mathbf{P}_{2,x}}^{m_{2,1}}+(1-b){\mathbf{P}_{1,y}}^{m_{1,1}}\cdot{\mathbf{P}_{2,y}}^{m_{2,1}}=\beta^{\pi}_{1}(b)

as ma,1=𝕀[a1=a]m_{a,1}=\mathbb{I}[a_{1}=a].

For the inductive step, assume that 𝔼[VH1π(b)]=i=1H1βiπ(b)\mathbb{E}[V_{H-1}^{\pi}({b})]=\sum_{i=1}^{H-1}\beta^{\pi}_{i}({b}) for every b[0,1]{b}\in[0,1]. We need to show that 𝔼[VHπ(b)]=i=1Hβiπ(b)\mathbb{E}[V_{H}^{\pi}(b)]=\sum_{i=1}^{H}\beta^{\pi}_{i}(b) for every b[0,1]b\in[0,1].

Indeed,

𝔼[VHπ(b1)]\displaystyle\mathbb{E}[V_{H}^{\pi}(b_{1})] =β1π(b1)(1+𝔼[VH1π(b(b1,a1,liked))])\displaystyle=\beta^{\pi}_{1}(b_{1})(1+\mathbb{E}[V_{H-1}^{\pi}(b^{\prime}(b_{1},a_{1},liked))])
=β1π(b1)(1+𝔼[VH1π(b2)])\displaystyle=\beta^{\pi}_{1}(b_{1})(1+\mathbb{E}[V_{H-1}^{\pi}(b_{2})])
=β1π(b1)(1+i=2H1βiπ(b2))\displaystyle=\beta^{\pi}_{1}(b_{1})(1+\sum_{i=2}^{H-1}\beta^{\pi}_{i}(b_{2}))
=i=1Hβiπ(b1),\displaystyle=\sum_{i=1}^{H}\beta^{\pi}_{i}(b_{1}),

where the second to last equality is due to the induction hypothesis and the assumption that π\pi is a deterministic stationary policy. The proof completes by realizing that 𝔼[Vπ(b)]=limH𝔼[VHπ(b)]=limHi=1Hβiπ(b)=i=1βiπ(b)\mathbb{E}[V^{\pi}(b)]=\lim_{H\to\infty}\mathbb{E}[V_{H}^{\pi}(b)]=\lim_{H\to\infty}\sum_{i=1}^{H}\beta^{\pi}_{i}(b)=\sum_{i=1}^{\infty}\beta^{\pi}_{i}(b), since the sum is finite and has positive summands. ∎

E.2 Dominant Row (DR)

See 5.4

Proof.

We will show that for every iteration jj, no matter what were the previous topic recommendations were, selecting topic 11 rather than topic 22 can only increase the value.

Let π\pi be a stationary policy such that π(bj)=2\pi(b_{j})=2. Changing it into a policy πj\pi^{j} that is equivalent to π\pi for all iterations but iteration j+1j+1 in which it recommends topic 11 can only improve the value.

Since 𝐏1,x,𝐏2,x,𝐏1,y,𝐏2,y0\mathbf{P}_{1,x},\mathbf{P}_{2,x},\mathbf{P}_{1,y},\mathbf{P}_{2,y}\geq 0, 𝐏1,x𝐏2,x0\mathbf{P}_{1,x}-\mathbf{P}_{2,x}\geq 0, b,1b0b,1-b\geq 0 and this structure satisfies 𝐏2,y𝐏1,y0\mathbf{P}_{2,y}-\mathbf{P}_{1,y}\leq 0, we get that for every m¯1,j,m¯2,j,n1,j,n2,j\bar{m}_{1,j},\bar{m}_{2,j},n_{1,j},n_{2,j}\in\mathbb{N} and for every bb,

b𝐏1,xm¯1,j+n1,j𝐏2,xm¯2,j+n2,j(𝐏1,x𝐏2,x)(1b)𝐏1,ym¯1,j+n1,j𝐏2,ym¯2,j+n2,j(𝐏2,y𝐏1,y);b\cdot\mathbf{P}_{1,x}^{\bar{m}_{1,j}+n_{1,j}}\cdot\mathbf{P}_{2,x}^{\bar{m}_{2,j}+n_{2,j}}(\mathbf{P}_{1,x}-\mathbf{P}_{2,x})\geq(1-b)\mathbf{P}_{1,y}^{\bar{m}_{1,j}+n_{1,j}}\cdot\mathbf{P}_{2,y}^{\bar{m}_{2,j}+n_{2,j}}(\mathbf{P}_{2,y}-\mathbf{P}_{1,y});

thus,

b𝐏1,xm¯1,j+1+n1,j𝐏2,xm¯2,j+n2,j+(1b)𝐏1,ym¯1,j+1+n1,j𝐏2,ym2,j+n2,jb\cdot\mathbf{P}_{1,x}^{\bar{m}_{1,j}+1+n_{1,j}}\cdot\mathbf{P}_{2,x}^{\bar{m}_{2,j}+n_{2,j}}+(1-b)\mathbf{P}_{1,y}^{\bar{m}_{1,j}+1+n_{1,j}}\cdot\mathbf{P}_{2,y}^{m_{2,j}+n_{2,j}}\geq
b𝐏1,xm¯1,j+n1,j𝐏2,xm¯2,j+1+n2,j+(1b)𝐏1,ym¯1,j+n1,j𝐏2,ym¯2,j+1+n2,j.b\cdot\mathbf{P}_{1,x}^{\bar{m}_{1,j}+n_{1,j}}\cdot\mathbf{P}_{2,x}^{\bar{m}_{2,j}+1+n_{2,j}}+(1-b)\mathbf{P}_{1,y}^{\bar{m}_{1,j}+n_{1,j}}\cdot\mathbf{P}_{2,y}^{\bar{m}_{2,j}+1+n_{2,j}}.

Hence for every time step j+1j+1, choosing topic 11 instead of topic 22 leads to increased value of each of the summation element b𝐏1,xm1,i𝐏2,xm2,i+(1b)𝐏1,ym1,i𝐏2,ym2,ib\cdot\mathbf{P}_{1,x}^{m_{1,i}}\cdot\mathbf{P}_{2,x}^{m_{2,i}}+(1-b)\mathbf{P}_{1,y}^{m_{1,i}}\cdot\mathbf{P}_{2,y}^{m_{2,i}} such that m1,i=m¯1,j+n1,jm¯1,jm_{1,i}=\bar{m}_{1,j}+n_{1,j}\geq\bar{m}_{1,j} and m2,i=m¯2,j+n2,jm¯2,jm_{2,i}=\bar{m}_{2,j}+n_{2,j}\geq\bar{m}_{2,j}. We deduce that

𝔼[Vπj(b)]𝔼[Vπ(b)].\mathbb{E}[V^{\pi^{j}}(b)]\geq\mathbb{E}[V^{\pi}(b)].

E.3 Dominant Column (DC)

Before proving the main theorem (Theorem 5.5), we prove two auxiliary lemmas.

Lemma E.1.

For 𝐏\mathbf{P} with a DC structure, if a policy π\pi is optimal then it recommends topic 11 for all iteration jj+1j^{\prime}\geq j+1 such that

i=j+1𝐏1,xm1,i𝐏2,xm2,i>i=j+11bb𝐏2,y𝐏1,y𝐏1,x𝐏2,x𝐏2,x𝐏2,y𝐏1,ym1,i𝐏2,ym2,i.\displaystyle\sum_{i=j+1}^{\infty}\;\mathbf{P}_{1,x}^{m_{1,i}}\mathbf{P}_{2,x}^{m_{2,i}}>\quad\sum_{i=j+1}^{\infty}\frac{1-b}{b}\cdot\frac{\mathbf{P}_{2,y}-\mathbf{P}_{1,y}}{\mathbf{P}_{1,x}-\mathbf{P}_{2,x}}\cdot\frac{\mathbf{P}_{2,x}}{\mathbf{P}_{2,y}}\mathbf{P}_{1,y}^{m_{1,i}}\mathbf{P}_{2,y}^{m_{2,i}}. (9)
Proof.

First, assume by contradiction that there exists an optimal policy π\pi that recommends topic 22 in iteration j+1j+1 such that (9) holds.

Let πj\pi^{j} be the policy that is equivalent to π\pi but recommend topic 11 instead of topic 22 in iteration j+1j+1. Since π\pi and πj\pi^{j} recommends the same topic until iteration jj, along with the optimality of π\pi, we have

𝔼[Vπj(b)]𝔼[Vπ(b)]=𝔼[Vj+1πj(b)]𝔼[Vj+1π(b)]0.\mathbb{E}[V^{\pi^{j}}(b)]-\mathbb{E}[V^{\pi}(b)]=\mathbb{E}[V_{j+1}^{\pi^{j}}(b)]-\mathbb{E}[V_{j+1}^{\pi}(b)]\leq 0.

Expending the above equation,

i=j+1b𝐏1,xm1,iπ+1𝐏2,xm2,iπ1+(1b)𝐏1,ym1,iπ+1𝐏2,ym2,iπ1(i=j+1b𝐏1,xm1,iπ𝐏2,xm2,iπ+(1b)𝐏1,ym1,iπ𝐏2,ym2,iπ)0\sum_{i=j+1}^{\infty}b\cdot\mathbf{P}_{1,x}^{m_{1,i}^{\pi}+1}\cdot\mathbf{P}_{2,x}^{m_{2,i}^{\pi}-1}+(1-b)\mathbf{P}_{1,y}^{m_{1,i}^{\pi}+1}\cdot\mathbf{P}_{2,y}^{m_{2,i}^{\pi}-1}-\big{(}\sum_{i=j+1}^{\infty}b\cdot\mathbf{P}_{1,x}^{m_{1,i}^{\pi}}\cdot\mathbf{P}_{2,x}^{m_{2,i}^{\pi}}+(1-b)\mathbf{P}_{1,y}^{m_{1,i}^{\pi}}\cdot\mathbf{P}_{2,y}^{m_{2,i}^{\pi}}\big{)}\leq 0
i=j+1b𝐏1,xm1,iπ𝐏2,xm2,iπ(𝐏1,x𝐏2,x1)i=j+1(1b)𝐏1,ym1,iπ𝐏2,ym2,iπ(1𝐏1,y𝐏2,y)\sum_{i=j+1}^{\infty}b\cdot\mathbf{P}_{1,x}^{m_{1,i}^{\pi}}\cdot\mathbf{P}_{2,x}^{m_{2,i}^{\pi}}(\frac{\mathbf{P}_{1,x}}{\mathbf{P}_{2,x}}-1)\leq\sum_{i=j+1}^{\infty}(1-b)\mathbf{P}_{1,y}^{m_{1,i}^{\pi}}\cdot\mathbf{P}_{2,y}^{m_{2,i}^{\pi}}(1-\frac{\mathbf{P}_{1,y}}{\mathbf{P}_{2,y}})
b(𝐏1,x𝐏2,x)𝐏2,xi=j+1𝐏1,xm1,iπ𝐏2,xm2,iπ(1b)(𝐏2,y𝐏1,y)𝐏2,yi=j+1𝐏1,ym1,iπ𝐏2,ym2,iπ\frac{b(\mathbf{P}_{1,x}-\mathbf{P}_{2,x})}{\mathbf{P}_{2,x}}\sum_{i=j+1}^{\infty}\cdot\mathbf{P}_{1,x}^{m_{1,i}^{\pi}}\cdot\mathbf{P}_{2,x}^{m_{2,i}^{\pi}}\leq\frac{(1-b)(\mathbf{P}_{2,y}-\mathbf{P}_{1,y})}{\mathbf{P}_{2,y}}\sum_{i=j+1}^{\infty}\mathbf{P}_{1,y}^{m_{1,i}^{\pi}}\cdot\mathbf{P}_{2,y}^{m_{2,i}^{\pi}}
i=j+1𝐏1,xm1,iπ𝐏2,xm2,iπ1bb𝐏2,x𝐏2,y𝐏2,y𝐏1,y𝐏1,x𝐏2,xi=j+1𝐏1,ym1,iπ𝐏2,ym2,iπ,\sum_{i=j+1}^{\infty}\mathbf{P}_{1,x}^{m_{1,i}^{\pi}}\cdot\mathbf{P}_{2,x}^{m_{2,i}^{\pi}}\leq\frac{1-b}{b}\cdot\frac{\mathbf{P}_{2,x}}{\mathbf{P}_{2,y}}\cdot\frac{\mathbf{P}_{2,y}-\mathbf{P}_{1,y}}{\mathbf{P}_{1,x}-\mathbf{P}_{2,x}}\sum_{i=j+1}^{\infty}\mathbf{P}_{1,y}^{m_{1,i}^{\pi}}\cdot\mathbf{P}_{2,y}^{m_{2,i}^{\pi}},

which is a contradiction to (9).

For the second part of the lemma, assume that condition (9) holds for some iteration j+1j+1\in\mathbb{N} and some optimal policy π\pi; hence, π(b,m1,jπ,m2,jπ)=1\pi(b,m^{\pi}_{1,j},m^{\pi}_{2,j})=1 and we have m1,j+1π=m1,jπ+1m^{\pi}_{1,j+1}=m^{\pi}_{1,j}+1 and m2,j+1π=m2,jπm^{\pi}_{2,j+1}=m^{\pi}_{2,j}. Exploiting this fact, we have that

i=j+2𝐏1,xm1,iπ𝐏2,xm2,iπ=i=j+1𝐏1,xm1,iπ+1𝐏2,xm2,iπ=𝐏1,xi=j+1𝐏1,xm1,iπ𝐏2,xm2,iπ>(9),\sum_{i=j+2}^{\infty}\;\mathbf{P}_{1,x}^{m^{\pi}_{1,i}}\mathbf{P}_{2,x}^{m^{\pi}_{2,i}}=\sum_{i=j+1}^{\infty}\;\mathbf{P}_{1,x}^{m^{\pi}_{1,i}+1}\mathbf{P}_{2,x}^{m^{\pi}_{2,i}}=\mathbf{P}_{1,x}\sum_{i=j+1}^{\infty}\;\mathbf{P}_{1,x}^{m^{\pi}_{1,i}}\mathbf{P}_{2,x}^{m^{\pi}_{2,i}}>(\ref{eq:suffCondForOptimal12}),

implying

𝐏1,xi=j+11bb𝐏2,y𝐏1,y𝐏1,x𝐏2,x𝐏2,x𝐏2,y𝐏1,ym1,iπ𝐏2,ym2,iπ\displaystyle\mathbf{P}_{1,x}\sum_{i=j+1}^{\infty}\frac{1-b}{b}\cdot\frac{\mathbf{P}_{2,y}-\mathbf{P}_{1,y}}{\mathbf{P}_{1,x}-\mathbf{P}_{2,x}}\cdot\frac{\mathbf{P}_{2,x}}{\mathbf{P}_{2,y}}\mathbf{P}_{1,y}^{m^{\pi}_{1,i}}\mathbf{P}_{2,y}^{m^{\pi}_{2,i}}
>\displaystyle> (𝐏1,x𝐏1,y)𝐏1,yi=j+11bb𝐏2,y𝐏1,y𝐏1,x𝐏2,x𝐏2,x𝐏2,y𝐏1,ym1,iπ𝐏2,ym2,iπ\displaystyle(\mathbf{P}_{1,x}\geq\mathbf{P}_{1,y})\mathbf{P}_{1,y}\sum_{i=j+1}^{\infty}\frac{1-b}{b}\cdot\frac{\mathbf{P}_{2,y}-\mathbf{P}_{1,y}}{\mathbf{P}_{1,x}-\mathbf{P}_{2,x}}\cdot\frac{\mathbf{P}_{2,x}}{\mathbf{P}_{2,y}}\mathbf{P}_{1,y}^{m^{\pi}_{1,i}}\mathbf{P}_{2,y}^{m^{\pi}_{2,i}}
=\displaystyle= i=j+11bb𝐏2,y𝐏1,y𝐏1,x𝐏2,x𝐏2,x𝐏2,y𝐏1,ym1,iπ+1𝐏2,ym2,iπ\displaystyle\sum_{i=j+1}^{\infty}\frac{1-b}{b}\cdot\frac{\mathbf{P}_{2,y}-\mathbf{P}_{1,y}}{\mathbf{P}_{1,x}-\mathbf{P}_{2,x}}\cdot\frac{\mathbf{P}_{2,x}}{\mathbf{P}_{2,y}}\mathbf{P}_{1,y}^{m^{\pi}_{1,i}+1}\mathbf{P}_{2,y}^{m^{\pi}_{2,i}}
=\displaystyle= i=j+21bb𝐏2,y𝐏1,y𝐏1,x𝐏2,x𝐏2,x𝐏2,y𝐏1,ym1,iπ𝐏2,ym2,iπ.\displaystyle\sum_{i=j+2}^{\infty}\frac{1-b}{b}\cdot\frac{\mathbf{P}_{2,y}-\mathbf{P}_{1,y}}{\mathbf{P}_{1,x}-\mathbf{P}_{2,x}}\cdot\frac{\mathbf{P}_{2,x}}{\mathbf{P}_{2,y}}\mathbf{P}_{1,y}^{m^{\pi}_{1,i}}\mathbf{P}_{2,y}^{m^{\pi}_{2,i}}.

An immediate consequence of Lemma E.1 is the following corollary.

Corollary E.2.

For any DC-structured 𝐏\mathbf{P} and every belief b[0,1]b\in[0,1], the optimal policy π\pi first recommends topic 22 for at most

argminNi=1N𝐏2,xm2,iπ>1bb𝐏2,y𝐏1,y𝐏1,x𝐏2,x𝐏2,x𝐏2,yi=1N𝐏2,ym2,iπ\text{argmin}_{N}\sum_{i=1}^{N}\;\mathbf{P}_{2,x}^{m^{\pi}_{2,i}}>\frac{1-b}{b}\cdot\frac{\mathbf{P}_{2,y}-\mathbf{P}_{1,y}}{\mathbf{P}_{1,x}-\mathbf{P}_{2,x}}\cdot\frac{\mathbf{P}_{2,x}}{\mathbf{P}_{2,y}}\sum_{i=1}^{N}\mathbf{P}_{2,y}^{m^{\pi}_{2,i}}

times, and then recommends topic 11 permanently. In addition, NN\in\mathbb{N} since 𝐏2,x>𝐏2,y\mathbf{P}_{2,x}>\mathbf{P}_{2,y}.

See 5.5

Proof.

Due to Theorem 5.3 and Corollary E.2, we can write the expected value of a policy as a function of  NN when 𝐏\mathbf{P} has a DC structure:

𝔼[VπN(b)]\displaystyle\mathbb{E}[V^{\pi_{N}}(b)] =i=1b𝐏1,xm1,i𝐏2,xm2,i+(1b)𝐏1,ym1,i𝐏2,ym2,i\displaystyle=\sum_{i=1}^{\infty}b\cdot\mathbf{P}_{1,x}^{m_{1,i}}\cdot\mathbf{P}_{2,x}^{m_{2,i}}+(1-b)\mathbf{P}_{1,y}^{m_{1,i}}\cdot\mathbf{P}_{2,y}^{m_{2,i}}
=i=1Nb𝐏2,xi+(1b)𝐏2,yi+i=N+1b𝐏2,xN𝐏1,xiN+(1b)𝐏2,yN𝐏1,yiN\displaystyle=\sum_{i=1}^{N}b\cdot\mathbf{P}_{2,x}^{i}+(1-b)\mathbf{P}_{2,y}^{i}+\sum_{i=N+1}^{\infty}b\cdot\mathbf{P}_{2,x}^{N}\cdot\mathbf{P}_{1,x}^{i-N}+(1-b)\mathbf{P}_{2,y}^{N}\cdot\mathbf{P}_{1,y}^{i-N}
=b𝐏2,x(𝐏2,xN1)𝐏2,x1+(1b)𝐏2,y(𝐏2,yN1)𝐏2,y1+b𝐏2,xNi=1𝐏1,xi+(1b)𝐏2,yNi=1𝐏1,yi\displaystyle=b\cdot\frac{\mathbf{P}_{2,x}(\mathbf{P}_{2,x}^{N}-1)}{\mathbf{P}_{2,x}-1}+(1-b)\cdot\frac{\mathbf{P}_{2,y}(\mathbf{P}_{2,y}^{N}-1)}{\mathbf{P}_{2,y}-1}+b\cdot\mathbf{P}_{2,x}^{N}\cdot\sum_{i=1}^{\infty}\mathbf{P}_{1,x}^{i}+(1-b)\mathbf{P}_{2,y}^{N}\cdot\sum_{i=1}^{\infty}\mathbf{P}_{1,y}^{i}
=b𝐏2,x(𝐏2,xN1)𝐏2,x1+(1b)𝐏2,y(𝐏2,yN1)𝐏2,y1+b𝐏2,xN𝐏1,x1𝐏1,x+(1b)𝐏2,yN𝐏1,y1𝐏1,y\displaystyle=b\cdot\frac{\mathbf{P}_{2,x}(\mathbf{P}_{2,x}^{N}-1)}{\mathbf{P}_{2,x}-1}+(1-b)\cdot\frac{\mathbf{P}_{2,y}(\mathbf{P}_{2,y}^{N}-1)}{\mathbf{P}_{2,y}-1}+b\cdot\mathbf{P}_{2,x}^{N}\cdot\frac{\mathbf{P}_{1,x}}{1-\mathbf{P}_{1,x}}+(1-b)\mathbf{P}_{2,y}^{N}\frac{\mathbf{P}_{1,y}}{1-\mathbf{P}_{1,y}}
=𝐏2,xNb(𝐏2,x𝐏2,x1+𝐏1,x1𝐏1,x)+𝐏2,yN(1b)(𝐏2,y𝐏2,y1+𝐏1,y1𝐏1,y)+b𝐏2,x1𝐏2,x+(1b)𝐏2,y1𝐏2,y.\displaystyle=\mathbf{P}_{2,x}^{N}\cdot b\big{(}\frac{\mathbf{P}_{2,x}}{\mathbf{P}_{2,x}-1}+\frac{\mathbf{P}_{1,x}}{1-\mathbf{P}_{1,x}}\big{)}+\mathbf{P}_{2,y}^{N}(1-b)\big{(}\frac{\mathbf{P}_{2,y}}{\mathbf{P}_{2,y}-1}+\frac{\mathbf{P}_{1,y}}{1-\mathbf{P}_{1,y}}\big{)}+\frac{b\mathbf{P}_{2,x}}{1-\mathbf{P}_{2,x}}+\frac{(1-b)\mathbf{P}_{2,y}}{1-\mathbf{P}_{2,y}}. (10)

Equation (E.3) could be cast as c1𝐏2,xN+c2𝐏2,yN+c3(𝐏2,x,𝐏2,y)c_{1}\cdot\mathbf{P}_{2,x}^{N}+c_{2}\mathbf{P}_{2,y}^{N}+c_{3}(\mathbf{P}_{2,x},\mathbf{P}_{2,y}) for positive c1c_{1}, negative c2c_{2} and positive c3c_{3}. Let f:f\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}\leftarrow\mathbb{R} be the continuous function such that f(N)=c1𝐏2,xN+c2𝐏2,yN+c3(𝐏2,x,𝐏2,y)f(N)=c_{1}\cdot\mathbf{P}_{2,x}^{N}+c_{2}\mathbf{P}_{2,y}^{N}+c_{3}(\mathbf{P}_{2,x},\mathbf{P}_{2,y}).

We take the derivative w.r.t. NN to find the saddle point of ff:

ddNf=c1ln𝐏2,x𝐏2,xN+c2ln𝐏2,y𝐏2,yN=0,\frac{d}{dN}f=c_{1}\cdot\ln\mathbf{P}_{2,x}\cdot\mathbf{P}_{2,x}^{N}+c_{2}\ln\mathbf{P}_{2,y}\cdot\mathbf{P}_{2,y}^{N}=0,

which suggests the saddle point of ff is

N~=ln(c2ln𝐏2,yc1ln𝐏2,x)ln(𝐏2,x𝐏2,y).\tilde{N}=\frac{\ln\big{(}-\frac{c_{2}\ln\mathbf{P}_{2,y}}{c_{1}\ln\mathbf{P}_{2,x}}\big{)}}{\ln\big{(}\frac{\mathbf{P}_{2,x}}{\mathbf{P}_{2,y}}\big{)}}.

Next, set N=defmax{0,N~}N^{*}\overset{\text{def}}{=}\max\{0,\tilde{N}\}. Since ff has a single saddle point and for every nn\in\mathbb{N} it holds that f(N)=𝔼[VπN(b)]f(N)=\mathbb{E}[V^{\pi_{N}}(b)], to determine the optimal policy, one only needs to compare the value 𝔼[VπN(b)]\mathbb{E}[V^{\pi_{N}}(b)] at the boundary points (N=0,N=)(N=0,N=\infty) and at the closest integers to the saddle point (N=N,N=N)(N=\lfloor N^{*}\rfloor,N=\lceil N^{*}\rceil). ∎

E.4 Dominant Diagonal (SD)

See 5.6

Proof.

We prove the following claim by a backward induction over the number of iterations remaining: For every k=H1,1k=H-1,\dots 1 it holds that for every policy π\pi and belief bb,

𝔼[Vkπ(b)]max{𝔼[Vkπ1(b)],𝔼[Vkπ2(b)]}.\mathbb{E}[V_{k}^{\pi}(b)]\leq\max\{\mathbb{E}[V_{k}^{\pi^{1}}(b)],\mathbb{E}[V_{k}^{\pi^{2}}(b)]\}.

First, we notice that when k=H1k=H-1, the only possible policies are π1\pi^{1} and π2\pi^{2}. For k=H2k=H-2, we prove the statement by contradiction. There are only two ways to selects topics when k=H2k=H-2:

π1:H=(π1:H2,1H1,2H)andπ1:H′′=(π1:H2,2H1,1H).\displaystyle\pi^{\prime}_{1\mathrel{\mathop{\mathchar 58\relax}}H}=(\pi_{1\mathrel{\mathop{\mathchar 58\relax}}H-2},\underbrace{1}_{{H-1}},\underbrace{2}_{{H}})\quad\text{and}\quad\pi_{1\mathrel{\mathop{\mathchar 58\relax}}H}^{\prime\prime}=(\pi_{1\mathrel{\mathop{\mathchar 58\relax}}H-2},\underbrace{2}_{H-1},\underbrace{1}_{{H}}).

Let m1m_{1} and m2m_{2} denote the number of times π\pi has played topic 11 and 22 till time H2H-2, inclusive. Assume that the policy π\pi^{\prime} is optimal. In particular, it holds that 𝔼[Vkπ1]𝔼[Vkπ]\mathbb{E}[V^{\pi_{1}}_{k}]\leq\mathbb{E}[V^{\pi^{\prime}}_{k}] and 𝔼[Vkπ2]𝔼[Vkπ]\mathbb{E}[V^{\pi_{2}}_{k}]\leq\mathbb{E}[V^{\pi^{\prime}}_{k}]. We get

b𝐏1,xm1𝐏2,xm2𝐏1,x(𝐏1,x𝐏2,x)𝐏1,ym1𝐏2,ym2(1b)𝐏1,y(𝐏2,y𝐏1,y),\displaystyle b\mathbf{P}_{1,x}^{m_{1}}\mathbf{P}_{2,x}^{m_{2}}\mathbf{P}_{1,x}(\mathbf{P}_{1,x}-\mathbf{P}_{2,x})\leq\mathbf{P}_{1,y}^{m_{1}}\mathbf{P}_{2,y}^{m_{2}}(1-b)\mathbf{P}_{1,y}(\mathbf{P}_{2,y}-\mathbf{P}_{1,y}), (11)

and

𝐏1,ym1𝐏2,ym2(1b)(𝐏2,y𝐏1,y)(1+𝐏2,y)b𝐏1,xm1𝐏2,xm2(𝐏1,x𝐏2,x)(1+𝐏2,x).\displaystyle\mathbf{P}_{1,y}^{m_{1}}\mathbf{P}_{2,y}^{m_{2}}(1-b)(\mathbf{P}_{2,y}-\mathbf{P}_{1,y})(1+\mathbf{P}_{2,y})\leq b\mathbf{P}_{1,x}^{m_{1}}\mathbf{P}_{2,x}^{m_{2}}(\mathbf{P}_{1,x}-\mathbf{P}_{2,x})(1+\mathbf{P}_{2,x}). (12)

As both sides of (11) and (12) are positive, the multiplication of their left hand sides is smaller than the multiplication of their right hand sides, i.e.,

b𝐏1,xm1𝐏2,xm2𝐏1,x(𝐏1,x𝐏2,x)𝐏1,ym1𝐏2,ym2(1b)(𝐏2,y𝐏1,y)(1+𝐏2,y)\displaystyle\;b\mathbf{P}_{1,x}^{m_{1}}\mathbf{P}_{2,x}^{m_{2}}\mathbf{P}_{1,x}(\mathbf{P}_{1,x}-\mathbf{P}_{2,x})\mathbf{P}_{1,y}^{m_{1}}\mathbf{P}_{2,y}^{m_{2}}(1-b)(\mathbf{P}_{2,y}-\mathbf{P}_{1,y})(1+\mathbf{P}_{2,y})
\displaystyle\leq 𝐏1,ym1𝐏2,ym2(1b)𝐏1,y(𝐏2,y𝐏1,y)b𝐏1,xm1𝐏2,xm2(𝐏1,x𝐏2,x)(1+𝐏2,x)\displaystyle\;\mathbf{P}_{1,y}^{m_{1}}\mathbf{P}_{2,y}^{m_{2}}(1-b)\mathbf{P}_{1,y}(\mathbf{P}_{2,y}-\mathbf{P}_{1,y})b\mathbf{P}_{1,x}^{m_{1}}\mathbf{P}_{2,x}^{m_{2}}(\mathbf{P}_{1,x}-\mathbf{P}_{2,x})(1+\mathbf{P}_{2,x})

Dividing both sides by b𝐏1,xm1𝐏2,xm2(𝐏1,x𝐏2,x)𝐏1,ym1𝐏2,ym2(1b)(𝐏2,y𝐏1,y)>0b\mathbf{P}_{1,x}^{m_{1}}\mathbf{P}_{2,x}^{m_{2}}(\mathbf{P}_{1,x}-\mathbf{P}_{2,x})\mathbf{P}_{1,y}^{m_{1}}\mathbf{P}_{2,y}^{m_{2}}(1-b)(\mathbf{P}_{2,y}-\mathbf{P}_{1,y})>0, we obtain

𝐏1,x(1+𝐏2,y)𝐏1,y(1+𝐏2,x),\mathbf{P}_{1,x}(1+\mathbf{P}_{2,y})\leq\mathbf{P}_{1,y}(1+\mathbf{P}_{2,x}),

which is a contradiction as 𝐏1,x>𝐏1,y\mathbf{P}_{1,x}>\mathbf{P}_{1,y} and 1+𝐏2,y>1+𝐏2,x1+\mathbf{P}_{2,y}>1+\mathbf{P}_{2,x}.

Now, assume that the policy π′′\pi^{\prime\prime} is optimal. In particular, it holds that 𝔼[Vkπ1]𝔼[Vkπ′′]\mathbb{E}[V^{\pi^{1}}_{k}]\leq\mathbb{E}[V^{\pi^{\prime\prime}}_{k}] and 𝔼[Vkπ2]𝔼[Vkπ′′]\mathbb{E}[V^{\pi^{2}}_{k}]\leq\mathbb{E}[V^{\pi^{\prime\prime}}_{k}]. We get

𝐏1,xm1𝐏2,xm2b(𝐏1,x𝐏2,x)(1+𝐏1,x)𝐏1,ym1𝐏2,ym2(1b)(1+𝐏1,y)(𝐏2,y𝐏1,y),\displaystyle\mathbf{P}_{1,x}^{m_{1}}\mathbf{P}_{2,x}^{m_{2}}b(\mathbf{P}_{1,x}-\mathbf{P}_{2,x})(1+\mathbf{P}_{1,x})\leq\mathbf{P}_{1,y}^{m_{1}}\mathbf{P}_{2,y}^{m_{2}}(1-b)(1+\mathbf{P}_{1,y})(\mathbf{P}_{2,y}-\mathbf{P}_{1,y}), (13)

and

𝐏1,ym1𝐏2,ym2(1b)𝐏2,y(𝐏2,y𝐏1,y)𝐏1,xm1𝐏2,xm2b𝐏2,x(𝐏1,x𝐏2,x).\displaystyle\mathbf{P}_{1,y}^{m_{1}}\mathbf{P}_{2,y}^{m_{2}}(1-b)\mathbf{P}_{2,y}(\mathbf{P}_{2,y}-\mathbf{P}_{1,y})\leq\mathbf{P}_{1,x}^{m_{1}}\mathbf{P}_{2,x}^{m_{2}}b\mathbf{P}_{2,x}(\mathbf{P}_{1,x}-\mathbf{P}_{2,x}). (14)

As both sides of (13) and (14) are positive, the multiplication of their left hand sides is smaller than the multiplication of their right hand sides,

𝐏1,xm1𝐏2,xm2b(𝐏1,x𝐏2,x)(1+𝐏1,x)𝐏1,ym1𝐏2,ym2(1b)𝐏2,y(𝐏2,y𝐏1,y)\displaystyle\;\mathbf{P}_{1,x}^{m_{1}}\mathbf{P}_{2,x}^{m_{2}}b(\mathbf{P}_{1,x}-\mathbf{P}_{2,x})(1+\mathbf{P}_{1,x})\mathbf{P}_{1,y}^{m_{1}}\mathbf{P}_{2,y}^{m_{2}}(1-b)\mathbf{P}_{2,y}(\mathbf{P}_{2,y}-\mathbf{P}_{1,y})
\displaystyle\leq 𝐏1,ym1𝐏2,ym2(1b)(1+𝐏1,y)(𝐏2,y𝐏1,y)𝐏1,xm1𝐏2,xm2b𝐏2,x(𝐏1,x𝐏2,x).\displaystyle\;\mathbf{P}_{1,y}^{m_{1}}\mathbf{P}_{2,y}^{m_{2}}(1-b)(1+\mathbf{P}_{1,y})(\mathbf{P}_{2,y}-\mathbf{P}_{1,y})\mathbf{P}_{1,x}^{m_{1}}\mathbf{P}_{2,x}^{m_{2}}b\mathbf{P}_{2,x}(\mathbf{P}_{1,x}-\mathbf{P}_{2,x}).

Dividing both sides by 𝐏1,xm1𝐏2,xm2b(𝐏1,x𝐏2,x)𝐏1,ym1𝐏2,ym2(1b)(𝐏2,y𝐏1,y)>0\mathbf{P}_{1,x}^{m_{1}}\mathbf{P}_{2,x}^{m_{2}}b(\mathbf{P}_{1,x}-\mathbf{P}_{2,x})\mathbf{P}_{1,y}^{m_{1}}\mathbf{P}_{2,y}^{m_{2}}(1-b)(\mathbf{P}_{2,y}-\mathbf{P}_{1,y})>0, we obtain

𝐏2,y(1+𝐏1,x)𝐏2,x(1+𝐏1,y),\mathbf{P}_{2,y}(1+\mathbf{P}_{1,x})\leq\mathbf{P}_{2,x}(1+\mathbf{P}_{1,y}),

which is again, a contradiction as 𝐏2,x<𝐏2,y\mathbf{P}_{2,x}<\mathbf{P}_{2,y} and 1+𝐏1,y<1+𝐏1,x1+\mathbf{P}_{1,y}<1+\mathbf{P}_{1,x}.

For H3H\geq 3, we prove the statement by contradiction. Suppose not, i.e., the optimal policy π\pi switch recommended topic at least once. Let tt denote the time step where π\pi switch for the last time. We first consider the case where π\pi has switched from topic 22 to topic 11 at time tt. More specifically, we have

π1:H=(π1:t2,2πt1,1πt,1,,1πt+1:H1,1πH).\displaystyle\pi_{1\mathrel{\mathop{\mathchar 58\relax}}H}=(\pi_{1\mathrel{\mathop{\mathchar 58\relax}}t-2},\underbrace{2}_{\pi_{t-1}},\underbrace{1}_{\pi_{t}},\underbrace{1,\ldots,1}_{\pi_{t+1\mathrel{\mathop{\mathchar 58\relax}}H-1}},\underbrace{1}_{\pi_{H}}).

Consider another policy π~\tilde{\pi} (that behaves the same as π\pi except at time step t1t-1) defined as

π~1:H=(π1:t2,2πt1,2πt,1,,1πt+1:H1,1πH).\displaystyle\tilde{\pi}_{1\mathrel{\mathop{\mathchar 58\relax}}H}=(\pi_{1\mathrel{\mathop{\mathchar 58\relax}}t-2},\underbrace{2}_{\pi_{t-1}},\underbrace{2}_{\pi_{t}},\underbrace{1,\ldots,1}_{\pi_{t+1\mathrel{\mathop{\mathchar 58\relax}}H-1}},\underbrace{1}_{\pi_{H}}).

Let m1m_{1} and m2m_{2} denote the number of times π\pi has recommended topic 11 and 22 till (and include) time t1t-1. Since π\pi is optimal, we have the difference between the value of π\pi and π~\tilde{\pi} to be non-negative, i.e.,

𝔼[VHπ]𝔼[VHπ~]=i=1Ht+1b𝐏1,xm1+i1𝐏2,xm2+1(𝐏1,x𝐏2,x)+(1b)𝐏1,ym1+i1𝐏2,ym2+1(𝐏1,y𝐏2,y)0,\displaystyle\mathbb{E}[V^{{\pi}}_{H}]-\mathbb{E}[V^{\tilde{\pi}}_{H}]=\sum_{i=1}^{H-t+1}b\mathbf{P}_{1,x}^{m_{1}+i-1}\mathbf{P}_{2,x}^{m_{2}+1}(\mathbf{P}_{1,x}-\mathbf{P}_{2,x})+(1-b)\mathbf{P}_{1,y}^{m_{1}+i-1}\mathbf{P}_{2,y}^{m_{2}+1}(\mathbf{P}_{1,y}-\mathbf{P}_{2,y})\geq 0, (15)

where the difference is induced by the discrepancy of the two policies from time step tt to HH. Consider another policy π\pi^{\prime} (that behaves the same as π\pi except at time step HH) defined as

π1:H=(π1:t2,2πt1,1πt,1,,1πt+1:H1,2πH).\displaystyle\pi^{\prime}_{1\mathrel{\mathop{\mathchar 58\relax}}H}=(\pi_{1\mathrel{\mathop{\mathchar 58\relax}}t-2},\underbrace{2}_{\pi_{t-1}},\underbrace{1}_{\pi_{t}},\underbrace{1,\ldots,1}_{\pi_{t+1\mathrel{\mathop{\mathchar 58\relax}}H-1}},\underbrace{2}_{\pi_{H}}).

Since π\pi is optimal, we have the difference between the value of π\pi and π\pi^{\prime} to be non-negative, i.e.,

𝔼[VHπ]>𝔼[VHπ]b𝐏1,xm1+Ht𝐏2,xm2(𝐏1,x𝐏2,x)>(1b)𝐏1,ym1+Ht𝐏2,ym2(𝐏2,y𝐏1,y),\mathbb{E}[V^{{\pi}}_{H}]>\mathbb{E}[V^{\pi^{\prime}}_{H}]\Rightarrow b\mathbf{P}_{1,x}^{m_{1}+H-t}\mathbf{P}_{2,x}^{m_{2}}(\mathbf{P}_{1,x}-\mathbf{P}_{2,x})>(1-b)\mathbf{P}_{1,y}^{m_{1}+H-t}\mathbf{P}_{2,y}^{m_{2}}(\mathbf{P}_{2,y}-\mathbf{P}_{1,y}),

where the difference is induced by the discrepancy of the two policies from time step HH. Multiplying both sides by 𝐏1,y>0\mathbf{P}_{1,y}>0, we get

𝐏1,yb𝐏1,xm1+Ht𝐏2,xm2(𝐏1,x𝐏2,x)>(1b)𝐏1,ym1+Ht+1𝐏2,ym2(𝐏2,y𝐏1,y).\mathbf{P}_{1,y}b\mathbf{P}_{1,x}^{m_{1}+H-t}\mathbf{P}_{2,x}^{m_{2}}(\mathbf{P}_{1,x}-\mathbf{P}_{2,x})>(1-b)\mathbf{P}_{1,y}^{m_{1}+H-t+1}\mathbf{P}_{2,y}^{m_{2}}(\mathbf{P}_{2,y}-\mathbf{P}_{1,y}).

Using 𝐏1,x𝐏1,y>1\frac{\mathbf{P}_{1,x}}{\mathbf{P}_{1,y}}>1, and 𝐏1,yb𝐏1,xm1+Ht𝐏2,xm2(𝐏1,x𝐏2,x)>0\mathbf{P}_{1,y}b\mathbf{P}_{1,x}^{m_{1}+H-t}\mathbf{P}_{2,x}^{m_{2}}(\mathbf{P}_{1,x}-\mathbf{P}_{2,x})>0,

b𝐏1,xm1+Ht+1𝐏2,xm2(𝐏1,x𝐏2,x)>(1b)𝐏1,ym1+Ht+1𝐏2,ym2(𝐏2,y𝐏1,y);b\mathbf{P}_{1,x}^{m_{1}+H-t+1}\mathbf{P}_{2,x}^{m_{2}}(\mathbf{P}_{1,x}-\mathbf{P}_{2,x})>(1-b)\mathbf{P}_{1,y}^{m_{1}+H-t+1}\mathbf{P}_{2,y}^{m_{2}}(\mathbf{P}_{2,y}-\mathbf{P}_{1,y});

hence,

b𝐏1,xm1+Ht+1𝐏2,xm2(𝐏1,x𝐏2,x)+(1b)𝐏1,ym1+Ht+1𝐏2,ym2(𝐏1,y𝐏2,y)0.\displaystyle b\mathbf{P}_{1,x}^{m_{1}+H-t+1}\mathbf{P}_{2,x}^{m_{2}}(\mathbf{P}_{1,x}-\mathbf{P}_{2,x})+(1-b)\mathbf{P}_{1,y}^{m_{1}+H-t+1}\mathbf{P}_{2,y}^{m_{2}}(\mathbf{P}_{1,y}-\mathbf{P}_{2,y})\geq 0. (16)

Next, we construct a new policy πnew\pi_{\text{new}} that outperforms π\pi. We let πnew\pi^{\text{new}} to be the policy defined as below

π1:Hnew=(π1:t2,1πt1,1πt,1,,1πt+1:H1,1πH).\displaystyle{\pi}^{\text{new}}_{1\mathrel{\mathop{\mathchar 58\relax}}H}=(\pi_{1\mathrel{\mathop{\mathchar 58\relax}}t-2},\underbrace{1}_{\pi_{t-1}},\underbrace{1}_{\pi_{t}},\underbrace{1,\ldots,1}_{\pi_{t+1\mathrel{\mathop{\mathchar 58\relax}}H-1}},\underbrace{1}_{\pi_{H}}).

The value difference between πnew\pi^{\text{new}} and π\pi (caused by the discrepancy of the two policies from time t1t-1 to HH) is

𝔼[VHπnew]𝔼[VHπ]\displaystyle\mathbb{E}[V^{{\pi}^{\text{new}}}_{H}]-\mathbb{E}[V^{\pi}_{H}] =i=1Ht+1b𝐏1,xm1+i1𝐏2,xm2(𝐏1,x𝐏2,x)+(1b)𝐏1,ym1+i1𝐏2,ym2(𝐏1,y𝐏2,y)\displaystyle=\sum_{i=1}^{H-t+1}b\mathbf{P}_{1,x}^{m_{1}+i-1}\mathbf{P}_{2,x}^{m_{2}}(\mathbf{P}_{1,x}-\mathbf{P}_{2,x})+(1-b)\mathbf{P}_{1,y}^{m_{1}+i-1}\mathbf{P}_{2,y}^{m_{2}}(\mathbf{P}_{1,y}-\mathbf{P}_{2,y})
+b𝐏1,xm1+Ht+1𝐏2,xm2(𝐏1,x𝐏2,x)+(1b)𝐏1,ym1+Ht+1𝐏2,ym2(𝐏1,y𝐏2,y)\displaystyle+b\mathbf{P}_{1,x}^{m_{1}+H-t+1}\mathbf{P}_{2,x}^{m_{2}}(\mathbf{P}_{1,x}-\mathbf{P}_{2,x})+(1-b)\mathbf{P}_{1,y}^{m_{1}+H-t+1}\mathbf{P}_{2,y}^{m_{2}}(\mathbf{P}_{1,y}-\mathbf{P}_{2,y})
>i=1Ht+1b𝐏1,xm1+i1𝐏2,xm2+1(𝐏1,x𝐏2,x)+(1b)𝐏1,ym1+i1𝐏2,ym2+1(𝐏1,y𝐏2,y)\displaystyle>\sum_{i=1}^{H-t+1}b\mathbf{P}_{1,x}^{m_{1}+i-1}\mathbf{P}_{2,x}^{m_{2}+1}(\mathbf{P}_{1,x}-\mathbf{P}_{2,x})+(1-b)\mathbf{P}_{1,y}^{m_{1}+i-1}\mathbf{P}_{2,y}^{m_{2}+1}(\mathbf{P}_{1,y}-\mathbf{P}_{2,y})
0,\displaystyle\geq 0,

where the first inequality is true because 𝐏2,x<𝐏2,y\mathbf{P}_{2,x}<\mathbf{P}_{2,y}, 𝐏1,x𝐏2,x>0\mathbf{P}_{1,x}-\mathbf{P}_{2,x}>0 and 𝐏1,y𝐏2,y<0\mathbf{P}_{1,y}-\mathbf{P}_{2,y}<0, therefore for every 1iHt+11\leq i\leq H-t+1

b𝐏1,xm1+i1𝐏2,xm2(𝐏1,x𝐏2,x)(1𝐏2,x)>0>(1b)𝐏1,ym1+i1𝐏2,ym2(𝐏2,y𝐏1,y)(𝐏2,y1)b\mathbf{P}_{1,x}^{m_{1}+i-1}\mathbf{P}_{2,x}^{m_{2}}(\mathbf{P}_{1,x}-\mathbf{P}_{2,x})(1-\mathbf{P}_{2,x})>0>(1-b)\mathbf{P}_{1,y}^{m_{1}+i-1}\mathbf{P}_{2,y}^{m_{2}}(\mathbf{P}_{2,y}-\mathbf{P}_{1,y})(\mathbf{P}_{2,y}-1)

along with (16). The second inequality follows from (15). Thus, we have successfully find another policy π1:Hnew\pi^{\text{new}}_{1\mathrel{\mathop{\mathchar 58\relax}}H} that differs from π\pi and achieves a higher value, which is a contradiction.

next, we consider the case where π\pi has switched from topic 11 to topic 22 at time tt, i.e.,

π1:H=(π1:t2,1πt1,2πt,2,,2πt+1:H1,2πH).\displaystyle\pi_{1\mathrel{\mathop{\mathchar 58\relax}}H}=(\pi_{1\mathrel{\mathop{\mathchar 58\relax}}t-2},\underbrace{1}_{\pi_{t-1}},\underbrace{2}_{\pi_{t}},\underbrace{2,\ldots,2}_{\pi_{t+1\mathrel{\mathop{\mathchar 58\relax}}H-1}},\underbrace{2}_{\pi_{H}}).

Consider another policy π~\tilde{\pi} (that behaves the same as π\pi except at time step tt) defined as

π~1:H=(π1:t2,1πt1,1πt,2,,2πt+1:H1,2πH).\displaystyle\tilde{\pi}_{1\mathrel{\mathop{\mathchar 58\relax}}H}=(\pi_{1\mathrel{\mathop{\mathchar 58\relax}}t-2},\underbrace{1}_{\pi_{t-1}},\underbrace{1}_{\pi_{t}},\underbrace{2,\ldots,2}_{\pi_{t+1\mathrel{\mathop{\mathchar 58\relax}}H-1}},\underbrace{2}_{\pi_{H}}).

Since π\pi is optimal, we have the difference between the value of π\pi and π~\tilde{\pi} to be non-negative, i.e.,

𝔼[VHπ]𝔼[VHπ~]=i=1Ht+1b𝐏1,xm1+1𝐏2,xm2+i1(𝐏2,x𝐏1,x)+(1b)𝐏1,ym1+1𝐏2,ym2+i1(𝐏2,y𝐏1,y)0,\displaystyle\mathbb{E}[V^{{\pi}}_{H}]-\mathbb{E}[V^{\tilde{\pi}}_{H}]=\sum_{i=1}^{H-t+1}b\mathbf{P}_{1,x}^{m_{1}+1}\mathbf{P}_{2,x}^{m_{2}+i-1}(\mathbf{P}_{2,x}-\mathbf{P}_{1,x})+(1-b)\mathbf{P}_{1,y}^{m_{1}+1}\mathbf{P}_{2,y}^{m_{2}+i-1}(\mathbf{P}_{2,y}-\mathbf{P}_{1,y})\geq 0, (17)

where the difference follows from the discrepancy between the two policies from time step tt to HH.

Consider another policy π\pi^{\prime} (that behaves the same as π\pi except at time step HH) defined as

π1:H=(π1:t2,1πt1,2πt,2,,2πt+1:H1,1πH).\displaystyle\pi^{\prime}_{1\mathrel{\mathop{\mathchar 58\relax}}H}=(\pi_{1\mathrel{\mathop{\mathchar 58\relax}}t-2},\underbrace{1}_{\pi_{t-1}},\underbrace{2}_{\pi_{t}},\underbrace{2,\ldots,2}_{\pi_{t+1\mathrel{\mathop{\mathchar 58\relax}}H-1}},\underbrace{1}_{\pi_{H}}).

Since π\pi is optimal, we have the difference between the value of π\pi and π\pi^{\prime} to be non-negative, i.e.,

𝔼[VHπ]>𝔼[VHπ](1b)𝐏1,ym1𝐏2,ym2+Ht(𝐏2,y𝐏1,y)b𝐏1,xm1𝐏2,xm2+Ht(𝐏1,x𝐏2,x),\mathbb{E}[V^{{\pi}}_{H}]>\mathbb{E}[V^{\pi^{\prime}}_{H}]\Rightarrow(1-b)\mathbf{P}_{1,y}^{m_{1}}\mathbf{P}_{2,y}^{m_{2}+H-t}(\mathbf{P}_{2,y}-\mathbf{P}_{1,y})\geq b\mathbf{P}_{1,x}^{m_{1}}\mathbf{P}_{2,x}^{m_{2}+H-t}(\mathbf{P}_{1,x}-\mathbf{P}_{2,x}),

where the difference is induced by the discrepancy of the two policies from time step HH. Multiplying both sides by 𝐏2,x>0\mathbf{P}_{2,x}>0,

𝐏2,x(1b)𝐏1,ym1𝐏2,ym2+Ht(𝐏2,y𝐏1,y)b𝐏1,xm1𝐏2,xm2+Ht+1(𝐏1,x𝐏2,x).\mathbf{P}_{2,x}(1-b)\mathbf{P}_{1,y}^{m_{1}}\mathbf{P}_{2,y}^{m_{2}+H-t}(\mathbf{P}_{2,y}-\mathbf{P}_{1,y})\geq b\mathbf{P}_{1,x}^{m_{1}}\mathbf{P}_{2,x}^{m_{2}+H-t+1}(\mathbf{P}_{1,x}-\mathbf{P}_{2,x}).

Using 𝐏2,x(1b)𝐏1,ym1𝐏2,ym2+Ht(𝐏2,y𝐏1,y)>0\mathbf{P}_{2,x}(1-b)\mathbf{P}_{1,y}^{m_{1}}\mathbf{P}_{2,y}^{m_{2}+H-t}(\mathbf{P}_{2,y}-\mathbf{P}_{1,y})>0 and 𝐏2,y𝐏2,x1\frac{\mathbf{P}_{2,y}}{\mathbf{P}_{2,x}}\geq 1, we get

(1b)𝐏1,ym1𝐏2,ym2+Ht+1(𝐏2,y𝐏1,y)b𝐏1,xm1𝐏2,xm2+Ht+1(𝐏1,x𝐏2,x);(1-b)\mathbf{P}_{1,y}^{m_{1}}\mathbf{P}_{2,y}^{m_{2}+H-t+1}(\mathbf{P}_{2,y}-\mathbf{P}_{1,y})\geq b\mathbf{P}_{1,x}^{m_{1}}\mathbf{P}_{2,x}^{m_{2}+H-t+1}(\mathbf{P}_{1,x}-\mathbf{P}_{2,x});

hence,

b𝐏1,xm1𝐏2,xm2+Ht+1(𝐏2,x𝐏1,x)+(1b)𝐏1,ym1𝐏2,ym2+Ht+1(𝐏2,y𝐏1,y)0.\displaystyle b\mathbf{P}_{1,x}^{m_{1}}\mathbf{P}_{2,x}^{m_{2}+H-t+1}(\mathbf{P}_{2,x}-\mathbf{P}_{1,x})+(1-b)\mathbf{P}_{1,y}^{m_{1}}\mathbf{P}_{2,y}^{m_{2}+H-t+1}(\mathbf{P}_{2,y}-\mathbf{P}_{1,y})\geq 0. (18)

Again, we will construct a new policy πnew\pi_{\text{new}} that outperforms π\pi. We let πnew\pi^{\text{new}} to be the policy defined as below

π1:Hnew=(π1:t2,2πt1,2πt,2,,2πt+1:H1,1πH).\displaystyle{\pi}^{\text{new}}_{1\mathrel{\mathop{\mathchar 58\relax}}H}=(\pi_{1\mathrel{\mathop{\mathchar 58\relax}}t-2},\underbrace{2}_{\pi_{t-1}},\underbrace{2}_{\pi_{t}},\underbrace{2,\ldots,2}_{\pi_{t+1\mathrel{\mathop{\mathchar 58\relax}}H-1}},\underbrace{1}_{\pi_{H}}).

Now, the value difference between πnew\pi^{\text{new}} and π\pi (caused by the discrepancy of the two policies from time t1t-1 to HH) is

𝔼[VHπnew]𝔼[VHπ]\displaystyle\mathbb{E}[V^{{\pi}^{\text{new}}}_{H}]-\mathbb{E}[V^{\pi}_{H}] =i=1Ht+1(b𝐏1,xm1𝐏2,xm2+i1(𝐏2,x𝐏1,x)+(1b)𝐏1,ym1𝐏2,ym2+i1(𝐏2,y𝐏1,y))\displaystyle=\sum_{i=1}^{H-t+1}\big{(}b\mathbf{P}_{1,x}^{m_{1}}\mathbf{P}_{2,x}^{m_{2}+i-1}(\mathbf{P}_{2,x}-\mathbf{P}_{1,x})+(1-b)\mathbf{P}_{1,y}^{m_{1}}\mathbf{P}_{2,y}^{m_{2}+i-1}(\mathbf{P}_{2,y}-\mathbf{P}_{1,y})\big{)}
+b𝐏1,xm1𝐏2,xm2+Ht+1(𝐏2,x𝐏1,x)+(1b)𝐏1,ym1𝐏2,ym2+i1(𝐏2,y𝐏1,y)\displaystyle+b\mathbf{P}_{1,x}^{m_{1}}\mathbf{P}_{2,x}^{m_{2}+H-t+1}(\mathbf{P}_{2,x}-\mathbf{P}_{1,x})+(1-b)\mathbf{P}_{1,y}^{m_{1}}\mathbf{P}_{2,y}^{m_{2}+i-1}(\mathbf{P}_{2,y}-\mathbf{P}_{1,y})
>i=1Ht+1b𝐏1,xm1+1𝐏2,xm2+i1(𝐏2,x𝐏1,x)+(1b)𝐏1,ym1+1𝐏2,ym2+i1(𝐏2,y𝐏1,y)\displaystyle>\sum_{i=1}^{H-t+1}b\mathbf{P}_{1,x}^{m_{1}+1}\mathbf{P}_{2,x}^{m_{2}+i-1}(\mathbf{P}_{2,x}-\mathbf{P}_{1,x})+(1-b)\mathbf{P}_{1,y}^{m_{1}+1}\mathbf{P}_{2,y}^{m_{2}+i-1}(\mathbf{P}_{2,y}-\mathbf{P}_{1,y})
0,\displaystyle\geq 0,

where the first inequality is true because 𝐏1,y<𝐏1,x\mathbf{P}_{1,y}<\mathbf{P}_{1,x}, 𝐏2,x𝐏1,x<0\mathbf{P}_{2,x}-\mathbf{P}_{1,x}<0 and 𝐏2,y𝐏1,y>0\mathbf{P}_{2,y}-\mathbf{P}_{1,y}>0 and (18), and the second from (17). Similarly, we have successfully find another policy π1:Hnew\pi^{\text{new}}_{1\mathrel{\mathop{\mathchar 58\relax}}H} that differs from π\pi and achieves a higher value, which is a contradiction.

We have covered all cases, so the inductive argument holds. This concludes the proof of the theorem. ∎

E.5 UCB-based regret bound

See 5.9

Proof.

Let γ\gamma be such that

|γ|<ln(1ϵ)8emina{1,2},i{x,y}{ln(1𝚲a,i(1𝐏a,i))8e}=mina{1,2},i{x,y}{ln(𝐏a,i)8e}.|\gamma|<-\frac{\ln(1-\epsilon)}{8e}\leq\min_{a\in\{1,2\},i\in\{x,y\}}\{-\frac{\ln(1-\mathbf{\Lambda}_{a,i}(1-\mathbf{P}_{a,i}))}{8e}\}=\min_{a\in\{1,2\},i\in\{x,y\}}\{-\frac{\ln(\mathbf{P}_{a,i})}{8e}\}.

Next, we have that

𝔼[exp(γ(Vπ𝔼[Vπ]))]a{1,2}𝔼[exp(γ(Vπa𝔼[Vπa]))]|type(t)argmaxi[1,2]𝐏a,i]Pr[type(t)argmaxi[1,2]𝐏a,i]\mathbb{E}[\exp({\gamma(V^{\pi}-\mathbb{E}[V^{\pi}])})]\leq\sum_{a\in\{1,2\}}\mathbb{E}[\exp({\gamma(V^{\pi_{a}}-\mathbb{E}[V^{\pi_{a}}])})]\big{|}type(t)\in\operatorname*{arg\,max}_{i\in[1,2]}\mathbf{P}_{a,i}]\cdot\Pr[type(t)\in\operatorname*{arg\,max}_{i\in[1,2]}\mathbf{P}_{a,i}]\leq
maxa{1,2}{𝔼[exp(γ(V¯πa𝔼[V¯πa]))]}\max_{a\in\{1,2\}}\{\mathbb{E}[\exp({\gamma(\bar{V}^{\pi^{a}}-\mathbb{E}[\bar{V}^{\pi^{a}}])})]\}

Where V¯πa\bar{V}^{\pi_{a}} is the return for the instance [1],[2],𝐪,𝐏¯,𝚲¯\langle[1],[2],\mathbf{q},\bar{\mathbf{P}},\bar{\mathbf{\Lambda}}\rangle such that for every a{1,2}a\in\{1,2\}: 𝐏¯a,1=maxi{x,y}𝐏a,i\bar{\mathbf{P}}_{a,1}=\max_{i\in\{x,y\}}\mathbf{P}_{a,i} and 𝚲a,1=1\mathbf{\Lambda}_{a,1}=1.

Finally, from Lemma 4.3 we get

maxa{1,2}{𝔼[exp(γ(V¯πa𝔼[V¯πa]))]}maxa{1,2}exp((8eln(𝐏¯a,1))2γ22)=maxa{1,2},i{x,y}exp((8eln(𝐏a,i))2γ22).\max_{a\in\{1,2\}}\{\mathbb{E}[\exp({\gamma(\bar{V}^{\pi^{a}}-\mathbb{E}[\bar{V}^{\pi^{a}}])})]\}\leq\max_{a\in\{1,2\}}\exp((-\frac{8e}{\ln(\bar{\mathbf{P}}_{a,1})})^{2}\frac{\gamma^{2}}{2})=\max_{a\in\{1,2\},i\in\{x,y\}}\exp((-\frac{8e}{\ln(\mathbf{P}_{a,i})})^{2}\frac{\gamma^{2}}{2}).

Choosing

τ=b=maxa{1,2},i{x,y}8eln(𝐏a,i)\tau=b=\max_{a\in\{1,2\},i\in\{x,y\}}-\frac{8e}{\ln(\mathbf{P}_{a,i})}

completes the proof as

maxa{1,2},i{x,y}8eln(𝐏a,i)8eln(1ϵ)=τ~andτ2b2=1=η.\max_{a\in\{1,2\},i\in\{x,y\}}-\frac{8e}{\ln(\mathbf{P}_{a,i})}\leq-\frac{8e}{\ln(1-\epsilon)}=\tilde{\tau}\quad\text{and}\quad\frac{\tau^{2}}{b^{2}}=1=\eta.

See 5.8

Proof.

Recall that Vπ=j=1Nπrj(πj)V^{\pi}=\sum_{j=1}^{N^{\pi}}r_{j}(\pi_{j}), where we drop the dependence on the user index for readability. Formulating differently, for any HH\in\mathbb{N} it holds that

Vπ=j=1H𝕀jNπrj(πj)+j=H+1𝕀jNπrj(πj).V^{\pi}=\sum_{j=1}^{H}\mathbb{I}_{j\leq N^{\pi}}\cdot r_{j}(\pi_{j})+\sum_{j=H+1}^{\infty}\mathbb{I}_{j\leq N^{\pi}}\cdot r_{j}(\pi_{j}).

Using the same representation for VπV^{\pi^{\prime}} and taking expectation, we get that

𝔼[VπVπ]\displaystyle\mathbb{E}\left[V^{\pi}-V^{\pi^{\prime}}\right] 𝔼[j=1H𝕀jNπrj(πj)j=1H𝕀jNπrj(πj)]+𝔼[j=H+1𝕀jNπrj(πj)]\displaystyle\leq\mathbb{E}\left[\sum_{j=1}^{H}\mathbb{I}_{j\leq N^{\pi}}\cdot r_{j}(\pi_{j})-\sum_{j=1}^{H}\mathbb{I}_{j\leq N^{\pi^{\prime}}}\cdot r_{j}(\pi_{j}^{\prime})\right]+\mathbb{E}\left[\sum_{j=H+1}^{\infty}\mathbb{I}_{j\leq N^{\pi}}\cdot r_{j}(\pi_{j})\right]
0+𝔼[j=H+1𝕀jNπrj(πj)]=j=H+1Pr(jNπ)rj(πj)\displaystyle\leq 0+\mathbb{E}\left[\sum_{j=H+1}^{\infty}\mathbb{I}_{j\leq N^{\pi}}\cdot r_{j}(\pi_{j})\right]=\sum_{j=H+1}^{\infty}\Pr\left(j\leq N^{\pi}\right)r_{j}(\pi_{j})
j=H+1(1ϵ)j(1ϵ)=(1ϵ)H+2j=0(1ϵ)j\displaystyle\leq\sum_{j=H+1}^{\infty}(1-\epsilon)^{j}(1-\epsilon)=(1-\epsilon)^{H+2}\sum_{j=0}^{\infty}(1-\epsilon)^{j}
(1ϵ)H1ϵeϵHϵ=12O(H).\displaystyle\leq(1-\epsilon)^{H}\frac{1}{\epsilon}\leq\frac{e^{-\epsilon H}}{\epsilon}=\frac{1}{2^{O(H)}}.