Modeling Attrition in Recommender Systems with Departing Bandits
Abstract
Traditionally, when recommender systems are formalized as multi-armed bandits, the policy of the recommender system influences the rewards accrued, but not the length of interaction. However, in real-world systems, dissatisfied users may depart (and never come back). In this work, we propose a novel multi-armed bandit setup that captures such policy-dependent horizons. Our setup consists of a finite set of user types, and multiple arms with Bernoulli payoffs. Each (user type, arm) tuple corresponds to an (unknown) reward probability. Each user’s type is initially unknown and can only be inferred through their response to recommendations. Moreover, if a user is dissatisfied with their recommendation, they might depart the system. We first address the case where all users share the same type, demonstrating that a recent UCB-based algorithm is optimal. We then move forward to the more challenging case, where users are divided among two types. While naive approaches cannot handle this setting, we provide an efficient learning algorithm that achieves regret, where is the number of users.
1 Introduction
At the heart of online services spanning such diverse industries as media consumption, dating, financial products, and more, recommendation systems (RSs) drive personalized experiences by making curation decisions informed by each user’s past history of interactions. While in practice, these systems employ diverse statistical heuristics, much of our theoretical understanding of them comes via stylized formulations within the multi-armed bandits (MABs) framework. While MABs abstract away from many aspects of real-world systems they allow us to extract crisp insights by formalizing fundamental tradeoffs, such as that between exploration and exploitation that all RSs must face (Joseph et al. 2016; Liu and Ho 2018; Patil et al. 2020; Ron, Ben-Porat, and Shalit 2021). As applies to RSs, exploitation consists of continuing to recommend items (or categories of items) that have been observed to yield high rewards in the past, while exploration consists of recommending items (or categories of items) about which the RS is uncertain but that could potentially yield even higher rewards.
In traditional formalizations of RSs as MABs, the recommender’s decisions affect only the rewards obtained. However, real-life recommenders face a dynamic that potentially alters the exploration-exploitation tradeoff: Dissatisfied users have the option to depart the system, never to return. Thus, recommendations in the service of exploration not only impact instantaneous rewards but also risk driving away users and therefore can influence long-term cumulative rewards by shortening trajectories of interactions.
In this work, we propose departing bandits which augment conventional MABs by incorporating these policy-dependent horizons. To motivate our setup, we consider the following example: An RS for recommending blog articles must choose at each time among two categories of articles, e.g., economics and sports. Upon a user’s arrival, the RS recommends articles sequentially. After each recommendation, the user decides whether to “click” the article and continue to the next recommendation, or to “not click” and may leave the system. Crucially, the user interacts with the system for a random number of rounds. The user’s departure probability depends on their satisfaction from the recommended item, which in turn depends on the user’s unknown type. A user’s type encodes their preferences (hence the probability of clicking) on the two topics (economics and sports).
When model parameters are given, in contrast to traditional MABs where the optimal policy is to play the best fixed arm, departing bandits require more careful analysis to derive an optimal planning strategy. Such planning is a local problem, in the sense that it is solved for each user. Since the user type is never known explicitly (the recommender must update its beliefs over the user types after each interaction), finding an optimal recommendation policy requires solving a specific partially observable MDP (POMDP) where the user type constitutes the (unobserved) state (more details in Section 5.1). When the model parameters are unknown, we deal with a learning problem that is global, in the sense that the recommender (learner) is learning for a stream of users instead of a particular user.
We begin with a formal definition of departing bandits in Section 2, and demonstrate that any fixed-arm policy is prone to suffer linear regret. In Section 3, we establish the UCB-based learning framework used in later sections. We instantiate this framework with a single user type in Section 4, where we show that it achieves regret for being the number of users. We then move to the more challenging case with two user types and two recommendation categories in Section 5. To analyze the planning problem, we effectively reduce the search space for the optimal policy by using a closed-form of the expected return of any recommender policy. These results suggest an algorithm that achieves regret in this setting. Finally, we show an efficient optimal planning algorithm for multiple user types and two recommendation categories (Appendix A) and describe a scheme to construct semi-synthetic problem instances for this setting using real-world datasets (Appendix B).
1.1 Related Work
MABs have been studied extensively by the online learning community (Cesa-Bianchi and Lugosi 2006; Bubeck, Cesa-Bianchi et al. 2012). The contextual bandit literature augments the MAB setup with context-dependent rewards (Abbasi-Yadkori, Pál, and Szepesvári 2011; Slivkins 2019; Mahadik et al. 2020; Korda, Szörényi, and Li 2016; Lattimore and Szepesvári 2020). In contextual bandits, the learner observes a context before they make a decision, and the reward depends on the context. Another line of related work considers the dynamics that emerge when users act strategically (Kremer, Mansour, and Perry 2014; Mansour, Slivkins, and Syrgkanis 2015; Cohen and Mansour 2019; Bahar, Smorodinsky, and Tennenholtz 2016; Bahar et al. 2020). In that line of work, users arriving at the system receive a recommendation but act strategically: They can follow the recommendation or choose a different action. This modeling motivates the development of incentive-compatible mechanisms as solutions. In our work, however, the users are modeled in a stochastic (but not strategic) manner. Users may leave the system if they are dissatisfied with recommendations, and this departure follows a fixed (but possibly unknown) stochastic model.
The departing bandits problem has two important features: Policy-dependent horizons, and multiple user types that can be interpreted as unknown states. Existing MAB works (Azar, Lazaric, and Brunskill 2013; Cao et al. 2020) have addresses these phenomena separately but we know of no work that integrates the two in a single framework. In particular, while Azar, Lazaric, and Brunskill (2013) study the setting with multiple user types, they focus on a fixed horizon setting. Additionally, while Cao et al. (2020) deal with departure probabilities and policy-dependent interaction times for a single user type, they do not consider the possibility of multiple underlying user types.
The planning part of our problem falls under the framework of using Markov Decision Processes for modeling recommender-user dynamics (Shani, Heckerman, and Brafman 2005). Specifically, our problem works with partially observable user states which have also been seen in many recent bandits variants (Pike-Burke and Grünewälder 2019; Leqi et al. 2020). Unlike these prior works that focus on interactions with a single user, departing bandits consider a stream of users each of which has an (unknown) type selected among a finite set of user types.
More broadly, our RS learning problem falls under the domain of reinforcement learning (RL). Existing RL literature that considers departing users in RSs include Zhao et al. (2020b); Lu and Yang (2016); Zhao et al. (2020a). While Zhao et al. (2020b) handle users of a single type that depart the RS within a bounded number of interactions, our work deals with multiple user types. In contrast to Zhao et al. (2020a), we consider an online setting and provide regret guarantees that do not require bounded horizon. Finally, Lu and Yang (2016) use POMDPs to model user departure and focus on approximating the value function. They conduct an experimental analysis on historical data, while we devise an online learning algorithm with theoretical guarantees.
2 Departing Bandits
We propose a new online problem, called departing bandits, where the goal is to find the optimal recommendation algorithm for users of (unknown) types, and where the length of the interactions depends on the algorithm itself. Formally, the departing bandits problem is defined by a tuple , where is the number of user types, is the number of categories, specifies a prior distribution over types, and and are the click-probability and the departure-probability matrices, respectively.111We denote by the set .
There are users who arrive sequentially at the RS. At every episode, a new user arrives with a type . We let denote the prior distribution over the user types, i.e., . Each user of type clicks on a recommended category with probability . In other words, each click follows a Bernoulli distribution with parameter . Whenever the user clicks, she stays for another iteration, and when the user does not click (no-click), she departs with probability (and stays with probability ). Each user interacts with the RS (the learner) until she departs.
We proceed to describe the user-RS interaction protocol. In every iteration of user , the learner recommends a category to user . The user clicks on it with probability . If the user clicks, the learner receives a reward of .222We formalize the reward as is standard in the online learning literature, from the perspective of the learner. However, defining the reward from the user perspective by, e.g., considering her utility as the number of clicks she gives or the number of articles she reads induces the same model. If the user does not click, the learner receives no reward (i.e., ), and user departs with probability . We assume that the learner knows the value of a constant such that (i.e., does not depend on ). When user departs, she does not interact with the learner anymore (and the learner moves on to the next user ). For convenience, the departing bandits problem protocol is summarized in Algorithm 1.
Input: number of types , number of categories , and number of users (episodes)
Hidden Parameters: types prior , click-probability , and departure-probability
Having described the protocol, we move on to the goals of the learner. Without loss of generality, we assume that the online learner’s recommendations are made based on a policy , which is a mapping from the history of previous interactions (with that user) to recommendation categories.
For each user (episode) , the learner selects a policy that recommends category at every iteration , where denotes the episode length (i.e., total number of iterations policy interacts with user until she departs).333We limit the discussion to deterministic policies solely; this is w.l.o.g. (see Subsection 5.1 for further details). The return of a policy , denoted by is the cumulative reward the learner obtains when executing the policy until the user departs. Put differently, the return of from user is the random variable .
We denote by an optimal policy, namely a policy that maximizes the expected return, . Similarly, we denote by the optimal return, i.e., .
We highlight two algorithmic tasks. The first is the planning task, in which the goal is to find an optimal policy , given . The second is the online learning task. We consider settings where the learner knows the number of categories, , the number of types, , and the number of users, , but has no prior knowledge regarding or . In the online learning task, the value of the learner’s algorithm is the sum of the returns obtained from all the users, namely
The performance of the leaner is compared to that of the best policy, formally defined by the regret for episodes,
(1) |
The learner’s goal is to minimize the expected regret .
2.1 Example
Type | Type | |
---|---|---|
Category | ||
Category | ||
Prior |
The motivation for the following example is two-fold. First, to get the reader acquainted with our notations; and second, to show why fixed-arm policies are inferior in our setting.
Consider a problem instance with two user types (), which we call and for convenience. There are two categories (), and given no-click the departure is deterministic, i.e., for every category and type . That is, every user leaves immediately if she does not click. Furthermore, let the click-probability matrix and the user type prior distribution be as in Table 1.
Looking at and , we see that Category 1 is better for Type , while Category 2 is better for type . Notice that without any additional information, a user is more likely to be type . Given the prior distribution, recommending Category in the first round yields an expected reward of . Similarly, recommending Category in the first round results in an expected reward of . Consequently, if we recommend myopically, i.e., without considering the user type, always recommending Category is better than always recommending Category 1.
Let denote the fixed-arm policy that always selects a single category . Using the tools we derive in Section 5 and in particular Theorem 5.3, we can compute the expected returns of and , and . Additionally, using results from Section 5.2, we can show that the optimal policy for the planning task, , recommends Category until iteration , and then recommends Category for the rest of the iterations until the user departs.
Using simple calculations, we see that and ; hence, the expected return of the optimal policy is greater than the returns of both fixed-arm policies by a constant. As a result, if the learner only uses fixed-arm policies ( for every ), she suffers linear expected regret, i.e.,
3 UCB Policy for Sub-exponential Returns
In this section, we introduce the learning framework used in the paper and provide a general regret guarantee for it.
In standard MAB problems, at each the learner picks a single arm and receives a single sub-Gaussian reward. In contrast, in departing bandits, at each the learner receives a return , which is the cumulative reward of that policy. The return depends on the policy not only through the obtained rewards at each iteration but also through the total number of iterations (trajectory length). Such returns are not necessarily sub-Gaussian. Consequently, we cannot use standard MAB algorithms as they usually rely on concentration bounds for sub-Gaussian rewards. Furthermore, as we have shown in Section 2.1, in departing bandits fixed-arm policies can suffer linear regret (in terms of the number of users), which suggests considering a more expressive set of policies. This in turn yields another disadvantage for using MAB algorithms for departing bandits, as their regret is linear in the number of arms (categories) .
As we show later in Sections 4 and 5, for some natural instances of the departing bandits problem, the return from each user is sub-exponential (Definition 3.1). Algorithm 2, which we propose below, receives a set of policies as input, along with other parameters that we describe shortly. The algorithm is a restatement of the UCB-Hybrid Algorithm from Jia, Shi, and Shen (2021), with two modifications: (1) The input includes a set of policies rather than a set of actions/categories, and accordingly, the confidence bound updates are based on return samples (denoted by ) rather than reward samples. (2) There are two global parameters ( and ) instead of two local parameters per action. If the return from each policy in is sub-exponential, Algorithm 2 not only handles sub-exponential returns, but also comes with the following guarantee: Its expected value is close to the value of the best policy in .
3.1 Sub-exponential Returns
For convenience, we state here the definition of sub-exponential random variables (Eldar and Kutyniok 2012).
Definition 3.1.
We say that a random variable is sub-exponential with parameters if for every such that ,
In addition, for every -sub-exponential random variables, there exist constants such that the above is equivalent to each of the following properties:
-
1.
Tails: .
-
2.
Moments: .
Let be a set of policies with the following property: There exist such that the return of every policy is ()-sub-exponential with and . The following Algorithm 2 receives as input a set of policies with the associated parameters, and . Similarly to the UCB algorithm, it maintains an upper confidence bound for each policy, and balances between exploration and exploitation. Theorem 3.2 below shows that Algorithm 2 always gets a value similar to that of the best policy in up to an additive factor of . The theorem follows directly from Theorem from Jia, Shi, and Shen (2021) by having policies as arms and returns as rewards.
Theorem 3.2.
Let be a set of policies with the associated parameters . Let be the policies Algorithm 2 selects. It holds that
There are two challenges in leveraging Theorem 3.2. The first challenge is crucial: Notice that Theorem 3.2 does not imply that Algorithm 2 has a low regret; its only guarantee is w.r.t. the policies in received as an input. As the number of policies is infinite, our success will depend on our ability to characterize a “good” set of policies . The second challenge is technical: Even if we find such , we still need to characterize the associated and . This is precisely what we do in Section 4 and 5.
4 Single User Type
In this section, we focus on the special case of a single user type, i.e., . For notational convenience, since we only discuss single-type users, we associate each category with its two unique parameters and refer to them as scalars rather than vectors. In addition, We use the notation for the random variable representing the number of iterations until a random user departs after being recommended by , the fixed-arm policy that recommends category in each iteration.
To derive a regret bound for single-type users, we use two main lemmas: Lemma 4.1, which shows the optimal policy is fixed, and Lemma 4.3, which shows that returns of fixed-arm policies are sub-exponential and calculate their corresponding parameters. These lemmas allow us to use Algorithm 2 with a policy set that contains all the fixed-arm policies, and derive a regret bound. For brevity, we relegate all the proofs to the Appendix.
To show that there exists a category for which is optimal, we rely on the assumption that all the users have the same type (hence we drop the type subscripts ), and as a result the rewards of each category have an expectation that depends on a single parameter, namely . Such a category does not necessarily have the maximal click-probability nor the minimal departure-probability, but rather an optimal combination of the two (in a way, this is similar to the knapsack problem, where we want to maximize the reward while having as little weight as possible). We formalize it in the following lemma.
Lemma 4.1.
A policy is optimal if
As a consequence of this lemma, the planning problem for single-type users is trivial—the solution is a fixed-arm policy given in the lemma. However, without access to the model parameters, identifying requires learning. We proceed with a simple observation regarding the random number of iterations obtained by executing a fixed-arm policy. The observation would later help us show that the return of any fixed-arm policy is sub-exponential.
Observation 4.2.
For every and every , the random variable follows a geometric distribution with success probability parameter .
Using Observation 4.2 and previously known results (stated as Lemma D.3 in the appendix), we show that is sub-exponential for all . Notice that return realizations are always upper bounded by the trajectory length; this implies that returns are also sub-exponential. However, to use the regret bound of Algorithm 2, we need information regarding the parameters for every policy . We provide this information in the following Lemma 4.3.
Lemma 4.3.
For each category , the centred random variable is sub-exponential with parameters , such that
Proof sketch.
We rely on the equivalence between the subexponentiality of a random variable and the bounds on its moments (Property 2 in Definition 3.1). We bound the expectation of the return , and use Minkowski’s and Jensen’s inequalities to show in Lemma D.2 that is upper bounded by for every and . Finally, we apply a normalization trick and bound the Taylor series of to obtain the result. ∎
An immediate consequence of Lemma 4.3 is that the parameters and are valid upper bounds for and for each (I.e., and ). We can now derive a regret bound using Algorithm 2 and Theorem 3.2.
Theorem 4.4.
For single-type users (), running Algorithm 2 with and , achieves an expected regret of at most
5 Two User Types and Two Categories
In this section, we consider cases with two user types (), two categories () and departure-probability for every category and type . Even in this relatively simplified setting, where users leave after the first “no-click”, planning is essential. To see this, notice that the event of a user clicking on a certain category provides additional information about the user, which can be used to tailor better recommendations; hence, algorithms that do not take this into account may suffer a linear regret. In fact, this is not just a matter of the learning algorithm at hand, but rather a failure of all fixed-arm policies; there are instances where all fixed-arm policies yield high regret w.r.t. the baseline defined in Equation (1). Indeed, this is what the example in Section 2.1 showcases. Such an observation suggests that studying the optimal planning problem is vital.
In Section 5.1, we introduce the partially observable MDP formulation of departing bandits along with notion of belief-category walk. We use this notion to provide a closed-form formula for policies’ expected return, which we use extensively later on. Next, in Section 5.2 we characterize the optimal policy, and show that we can compute it in constant time relying on the closed-form formula. This is striking, as generally computing optimal POMDP policies is computationally intractable since, e.g., the space of policies grows exponentially with the horizon. Conceptually, we show that there exists an optimal policy that depends on a belief threshold: It recommends one category until the posterior belief of one type, which is monotonically increasing, crosses the threshold, and then it recommends the other category. Finally, in Section 5.3 we leverage all the previously obtained results to derive a small set of threshold policies of size with corresponding sub-exponential parameters. Due to Theorem 3.2, this result implies a regret.
5.1 Efficient Planning
To recap, we aim to find the optimal policy when the click-probability matrix and the prior over user types are known. Namely, given an instance in the form of , our goal is to efficiently find the optimal policy.
For planning purposes, the problem can be modeled by an episodic POMDP, . A set of states, that comprises all types , along with a designated absorbing state suggesting that the user departed (and the episode terminated). is the set of the actions (categories). is the set of possible observations. The transition and observation functions, and (respectively) satisfy and for every type and action . Finally, is the expected reward matrix, and is the initial state distribution over the types.
When there are two user types and two categories, the click-probability matrix is given by Table 2 where we note that the prior on the types holds , thus can be represented by a single parameter .
Remark 5.1.
Without loss of generality, we assume that since one could always permute the matrix to obtain such a structure.
Since the return and number of iterations for the same policy is independent of the user index, we drop the subscript in the rest of this subsection and use .
Type | Type | |
---|---|---|
Category | ||
Category | ||
Prior |
As is well-known in the POMDP literature (Kaelbling, Littman, and Cassandra 1998), the optimal policy and its expected return are functions of belief states that represent the probability of the state at each time. In our setting, the states are the user types. We denote by the belief that the state is (type) at iteration . Similarly, is the belief that the state is (type) at iteration . Needless to say, once the state is reached, the belief over the type states is irrelevant, as users do not come back. Nevertheless, we neglect this case as our analysis does not make use it.
We now describe how to compute the belief. At iteration , the belief state is set to be . At iteration , upon receiving a positive reward , the belief is updated from to
(2) |
where we note that in the event of no-click, the current user departs the system, i.e., we move to the absorbing state . For any policy that maps a belief to a category, its expected return satisfies the Bellman equation,
To better characterize the expected return, we introduce the following notion of belief-category walk.
Definition 5.2 (Belief-category walk).
Let be any policy. The sequence
is called the belief-category walk. Namely, it is the induced walk of belief updates and categories chosen by , given all the rewards are positive ( for every ).
Notice that every policy induces a single, well-defined and deterministic belief-category walk (recall that we assume departure-probabilities satisfy for every ). Moreover, given any policy , the trajectory of every user recommended by is fully characterized by belief-category walk clipped at .
In what follows, we derive a closed-form expression for the expected return as a function of , the categories chosen by the policy, and the click-probability matrix.
Theorem 5.3.
For every policy and an initial belief , the expected return is given by
where and are calculated based on the belief-category walk induced by .
5.2 Characterizing the Optimal Policy
Using Theorem 5.3, we show that the planning problem can be solved in . To arrive at this conclusion, we perform a case analysis over the following three structures of the click-probability matrix :
-
•
Dominant Row, where ;
-
•
Dominant Column, where ;
-
•
Dominant Diagonal, where .
Crucially, any matrix takes exactly one of the three structures. Further, since is known in the planning problem, identifying the structure at hand takes time. Using this structure partition, we characterize the optimal policy.
Dominant Row
We start by considering the simplest structure, in which the Category is preferred by both types of users: Since and (Remark 5.1), there exists a dominant row, i.e., Category .
Lemma 5.4.
For any instance such that has a dominant row , the fixed policy is an optimal policy.
As expected, if Category is dominant then the policy that always recommends Category is optimal.
Dominant Column
In the second structure we consider the case where there is no dominant row, and that the column of type is dominant, i.e., . In such a case, which is also the one described in the example in Section 2.1, it is unclear what the optimal policy would be since none of the categories dominates the other.
Surprisingly, we show that the optimal policy can be of only one form: Recommend Category for some time steps (possibly zero) and then always recommend Category . To identify when to switch from Category to Category , one only needs to compare four expected returns.
Theorem 5.5.
For any instance such that has a dominant column, one of the following four policies is optimal:
where is a constant, and () stands for recommending Category until iteration () and then switching to Category .
The intuition behind the theorem is as follows. If the prior tends towards type , we might start with recommending Category (which users of type are more likely to click on). But after several iterations, and as long as the user stays, the posterior belief increases since (recall Equation (2)). Consequently, since type becomes more probable, and since , the optimal policy recommends the best category for this type, i.e., Category . For the exact expression of , we refer the reader to Appendix E.3.
Using Theorem 5.3, we can compute the expected return for each of the four policies in , showing that we can find the optimal policy when has a column in .
Dominant Diagonal
In the last structure, we consider the case where there is no dominant row (i.e., ) nor a dominant column (i.e., ). At first glance, this case is more complex than the previous two, since none of the categories and none of the types dominates the other one. However, we uncover that the optimal policy can be either always recommending Category or always recommending Category . Theorem 5.6 summarizes this result.
Theorem 5.6.
For any instance such that has a dominant diagonal, either or is optimal.
With the full characterization of the optimal policy derived in this section (for all the three structures), we have shown that the optimal policy can be computed in .
5.3 Learning: UCB-based regret bound
In this section, we move from the planning task to the learning one. Building on the results of previous sections, we know that there must exist a threshold policy—a policy whose belief-category walk has a finite prefix of one category, and an infinite suffix with the other category—which is optimal. However, there can still be infinitely many such policies. To address this problem, we first show how to reduce the search space for approximately optimal policies with negligible additive factor to a set of policies. Then, we derive the parameters and required for Algorithm 2. As an immediate consequence, we get a sublinear regret algorithm for this setting. We begin with defining threshold policies.
Definition 5.7 (Threshold Policy).
A policy is called an -threshold policy if there exists an number in ’s belief-category walk such that
-
•
recommends category in iterations , and
-
•
recommends category in iterations ,
for and .
For instance, the policy that always recommends Category 1 is the -threshold policy, as it recommends Category 2 until the zero’th iteration (i.e., never recommends Category 2) and then Category 1 eternally. Furthermore, the policy introduced in Theorem 5.5 is the -threshold policy.
Next, recall that the chance of departure in every iteration is greater or equal to , since we assume . Consequently, the probability that a user will stay beyond iterations is exponentially decreasing with . We could use high-probability arguments to claim that it suffices to focus on the first iterations, but without further insights this would yield candidates for the optimal policy. Instead, we exploit our insights about threshold policies.
Let be the set of all -threshold policies for and . Clearly, . Lemma 5.8 shows that the return obtained by the best policy in is not worse than that of the optimal policy by a negligible factor.
Lemma 5.8.
For every , it holds that
Before we describe how to apply Algorithm 2, we need to show that returns of all the policies in are sub-exponential. In Lemma 5.9, we show that is -sub-exponential for every threshold policy , and provide bounds for both and .
Lemma 5.9.
Let and . For every threshold policy , the centred random variable is -sub-exponential with satisfying and .
We are ready to wrap up our solution for the learning task proposed in this section. Let , be the set of threshold policies characterized before, and let and be constants as defined in Lemma 5.9.
Theorem 5.10.
Applying Algorithm 2 with on the class of two-types two-categories instances considered in this section always yields an expected regret of
6 Conclusions and Discussion
This paper introduces a MAB model in which the recommender system influences both the rewards accrued and the length of interaction. We dealt with two classes of problems: A single user type with general departure probabilities (Section 4) and the two user types, two categories where each user departs after her first no-click (Section 5). For each problem class, we started with analyzing the planning task, then characterized a small set of candidates for the optimal policy, and then applied Algorithm 2 to achieve sublinear regret.
In the appendix, we also consider a third class of problems: Two categories, multiple user types ( where user departs with their first no-click. We use the closed-form expected return derived in Theorem 5.3 to show how to use dynamic programming to find approximately optimal planning policies. We formulate the problem of finding an optimal policy for a finite horizon in a recursive manner. Particularly, we show how to find a additive approximation in run-time of . Unfortunately, this approach cannot assist us in the learning task. Dynamic programming relies on skipping sub-optimal solutions to sub-problems (shorter horizons in our case), but this happens on the fly; thus, we cannot a-priori define a small set of candidates like what Algorithm 2 requires. More broadly, we could use this dynamic programming approach for more than two categories, namely for , but then the run-time becomes .
There are several interesting future directions. First, achieving low regret for the setup in Section 5 with . We suspect that this class of problems could enjoy a solution similar to ours, where candidates for optimal policies are mixing two categories solely. Second, achieving low regret for the setup in Section 5 with uncertain departure (i.e., ). Our approach fails in such a case since we cannot use belief-category walks; these are no longer deterministic. Consequently, the closed-form formula is much more complex and optimal planning becomes more intricate. These two challenges are left open for future work.
Acknowledgement
LL is generously supported by an Open Philanthropy AI Fellowship. LC is supported by Ariane de Rothschild Women Doctoral Program. ZL thanks the Block Center for Technology and Society; Amazon AI; PwC USA via the Digital Transformation and Innovation Center; and the NSF: Fair AI Award IIS2040929 for supporting ACMI lab’s research on the responsible use of machine learning. This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (grant agreement No. 882396), by the Israel Science Foundation (grant number 993/17), Tel Aviv University Center for AI and Data Science (TAD), and the Yandex Initiative for Machine Learning at Tel Aviv University.
References
- Abbasi-Yadkori, Pál, and Szepesvári (2011) Abbasi-Yadkori, Y.; Pál, D.; and Szepesvári, C. 2011. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24: 2312–2320.
- Azar, Lazaric, and Brunskill (2013) Azar, M. G.; Lazaric, A.; and Brunskill, E. 2013. Sequential transfer in multi-armed bandit with finite set of models. In Proceedings of the 26th International Conference on Neural Information Processing Systems-Volume 2, 2220–2228.
- Bahar et al. (2020) Bahar, G.; Ben-Porat, O.; Leyton-Brown, K.; and Tennenholtz, M. 2020. Fiduciary bandits. In International Conference on Machine Learning, 518–527. PMLR.
- Bahar, Smorodinsky, and Tennenholtz (2016) Bahar, G.; Smorodinsky, R.; and Tennenholtz, M. 2016. Economic Recommendation Systems: One Page Abstract. In Proceedings of the 2016 ACM Conference on Economics and Computation, EC ’16, 757–757. New York, NY, USA: ACM. ISBN 978-1-4503-3936-0.
- Bubeck, Cesa-Bianchi et al. (2012) Bubeck, S.; Cesa-Bianchi, N.; et al. 2012. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1): 1–122.
- Cao et al. (2020) Cao, J.; Sun, W.; Shen, Z.-J. M.; and Ettl, M. 2020. Fatigue-Aware Bandits for Dependent Click Models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 3341–3348.
- Cesa-Bianchi and Lugosi (2006) Cesa-Bianchi, N.; and Lugosi, G. 2006. Prediction, learning, and games. Cambridge Univ Press.
- Cohen and Mansour (2019) Cohen, L.; and Mansour, Y. 2019. Optimal Algorithm for Bayesian Incentive-Compatible. In ACM Conf. on Economics and Computation (EC).
- Eldar and Kutyniok (2012) Eldar, Y.; and Kutyniok, G. 2012. Compressed Sensing: Theory and Applications. ISBN 978-1107005587.
- Harper and Konstan (2015) Harper, F. M.; and Konstan, J. A. 2015. The MovieLens Datasets: History and Context. ACM Trans. Interact. Intell. Syst.
- Hillar and Wibisono (2018) Hillar, C.; and Wibisono, A. 2018. Maximum entropy distributions on graphs. arXiv:1301.3321.
- Jia, Shi, and Shen (2021) Jia, H.; Shi, C.; and Shen, S. 2021. Multi-armed Bandit with Sub-exponential Rewards. Operations Research Letters.
- Joseph et al. (2016) Joseph, M.; Kearns, M.; Morgenstern, J.; and Roth, A. 2016. Fairness in learning: Classic and contextual bandits. arXiv preprint arXiv:1605.07139.
- Kaelbling, Littman, and Cassandra (1998) Kaelbling, L. P.; Littman, M. L.; and Cassandra, A. R. 1998. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1-2): 99–134.
- Korda, Szörényi, and Li (2016) Korda, N.; Szörényi, B.; and Li, S. 2016. Distributed Clustering of Linear Bandits in Peer to Peer Networks. In Proceedings of the 33rd International Conference on International Conference on Machine Learning.
- Kremer, Mansour, and Perry (2014) Kremer, I.; Mansour, Y.; and Perry, M. 2014. Implementing the wisdom of the crowd. Journal of Political Economy, 122: 988–1012.
- Lattimore and Szepesvári (2020) Lattimore, T.; and Szepesvári, C. 2020. Bandit algorithms. Cambridge University Press.
- Leqi et al. (2020) Leqi, L.; Kilinc-Karzan, F.; Lipton, Z. C.; and Montgomery, A. L. 2020. Rebounding Bandits for Modeling Satiation Effects. arXiv preprint arXiv:2011.06741.
- Liu and Ho (2018) Liu, Y.; and Ho, C.-J. 2018. Incentivizing high quality user contributions: New arm generation in bandit learning. In Thirty-Second AAAI Conference on Artificial Intelligence.
- Lu and Yang (2016) Lu, Z.; and Yang, Q. 2016. Partially Observable Markov Decision Process for Recommender Systems. CoRR, abs/1608.07793.
- Mahadik et al. (2020) Mahadik, K.; Wu, Q.; Li, S.; and Sabne, A. 2020. Fast Distributed Bandits for Online Recommendation Systems.
- Mansour, Slivkins, and Syrgkanis (2015) Mansour, Y.; Slivkins, A.; and Syrgkanis, V. 2015. Bayesian Incentive-Compatible Bandit Exploration. In ACM Conf. on Economics and Computation (EC).
- Patil et al. (2020) Patil, V.; Ghalme, G.; Nair, V.; and Narahari, Y. 2020. Achieving fairness in the stochastic multi-armed bandit problem. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 5379–5386.
- Pike-Burke and Grünewälder (2019) Pike-Burke, C.; and Grünewälder, S. 2019. Recovering bandits. arXiv preprint arXiv:1910.14354.
- Ron, Ben-Porat, and Shalit (2021) Ron, T.; Ben-Porat, O.; and Shalit, U. 2021. Corporate Social Responsibility via Multi-Armed Bandits. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 26–40.
- Shani, Heckerman, and Brafman (2005) Shani, G.; Heckerman, D.; and Brafman, R. I. 2005. An MDP-Based Recommender System. Journal of Machine Learning Research, 6(43): 1265–1295.
- Slivkins (2019) Slivkins, A. 2019. Introduction to multi-armed bandits. arXiv preprint arXiv:1904.07272.
- Zhao et al. (2020a) Zhao, X.; Zheng, X.; Yang, X.; Liu, X.; and Tang, J. 2020a. Jointly Learning to Recommend and Advertise. In KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.
- Zhao et al. (2020b) Zhao, Y.; Zhou, Y.; Ou, M.; Xu, H.; and Li, N. 2020b. Maximizing Cumulative User Engagement in Sequential Recommendation: An Online Optimization Perspective. In Gupta, R.; Liu, Y.; Tang, J.; and Prakash, B. A., eds., KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.
Appendix A Extension: Planning Beyond Two User Types
In this section, we treat the planning task with two categories () but potentially many types (i.e., ). For convenience, we formalize the results in this section in terms of , but the results are readily extendable for the more general case. We derive an almost-optimal planning policy via dynamic programming, and then explain why it cannot be used for learning as we did in the previous section.
For reasons that will become apparent later on, we define by as the return of a policy until the ’s iteration. Using Theorem 5.3, we have that
where and are calculated based on the belief-category walk induced by . Further, let denote the policy maximizing .
Notice that there is a bijection from iterations policies to ; hence, we can find by finding the arg max of the expression on the right-hand-side of the above equation, in terms of . Formally, we want to solve the integer linear programming (ILP),
(3) |
Despite that this problem involves integer programming, we can solve it using dynamic programming in runtime. Notice that the optimization is over a subset of binary variables . Let be the set of feasible solutions of the ILP, and similarly let denote set of prefixes of length of .
For any and , define
where as in the ILP.
Consequently, solving the ILP is equivalent to maximizing over the domain .
Next, for any and two integers such that , define
(4) |
Under this construction, over such that is precisely the value of the ILP.
Reformulating Equation (4) for ,
where . For every , there are only possible values can take: All the ways of dividing into non-negative integers and ; therefore, having computed for all feasible inputs, we can compute in . Consequently, computing , which is precisely the value of the ILP in (3), takes run-time. Moreover, the policy can be found using backtracking. We remark that an argument similar to Lemma 5.8 implies that ; hence, is almost optimal.
To finalize this section, we remark that this approach could also work for categories. Naively, for a finite horizon , there are possible policies. The dynamic programming procedure explain above makes the search operate in run-time of . The run-time, exponential in the number of categories but polynomial in the horizon, is feasible when the number of categories is small.
Appendix B Experimental Evaluation
For general real-world datasets, we propose a scheme to construct semi-synthetic problem instances with many arms and many user types, using rating data sets with multiple ratings per user. We exemplify our scheme on the MovieLens Dataset Harper and Konstan (2015). As a pre-processing step, we set movie genres to be the categories of interest, select a subset of categories of size (e.g., sci-fi, drama, and comedy), and select the number of user types, . Remove any user who has not provided a rating for at least one movie from each category . When running the algorithm, randomly draw users from the data, and given a recommended category , suggest them a random movie which they have rated, and set their click probability to , where is their normalized rating of the suggested movie.
Appendix C UCB Policy for Sub-exponential Returns
An important tool for analyzing sub-exponential random variables is Bernstein’s Inequality, which is a concentration inequality for sub-exponential random variables (see, e.g., Jia, Shi, and Shen (2021)). Being a major component of the regret analysis for Algorithm 2, we state it here for convenience.
Lemma C.1.
(Bernstein’s Inequality) Let a random variable be sub-exponential with parameters . Then for every :
Appendix D Single User Type: Proofs from Section 4
To simplify the proofs, we use the following notation: For a fixed-arm policy , we use to denote its return from iteration until the user departs. Namely,
Throughout this section, we will use the following Observation.
Observation D.1.
For every policy and iteration ,
See 4.1
Proof.
First, recall that every POMDP has an optimal Markovian policy which is deterministic (we refer the reader to Section 5.1 for full formulation of the problem as POMDP). Having independent rewards and a single state implies that there exists such that for every (similarly to standard MAB problems, there exists a fixed-arm policy which is optimal).
Assume by contradiction that the optimal policy holds
Now, notice that
Solving the recurrence relation and summing the geometric series we get
Finally,
yields that any fixed-armed policy, such that
holds , a contradiction to the optimality of . ∎
Lemma D.2.
For each , the centered random return is sub-exponential with parameter .
In order to show that returns of fixed-arm policies are sub-exponential random variables, we first show that the number of iterations of users recommended by fixed-arm policies is also a sub-exponential. For this purpose, we state here a lemma that implies that every geometric r.v. is a sub-exponential r.v.. The proof of the next lemma appears, e.g., in Hillar and Wibisono (2018) (Lemma ).
Lemma D.3.
The lemma above and Observation 4.2 allow us to deduce that the variables are sub-exponential in the first part of the following Corollary (the case in which follows immediately from definition.). The second part of the lemma follows directly from the equivalences between Properties (2) and (1) in Definition 3.1.
Corollary D.4.
For each , the number of iterations a user recommended by stays within the system, , is sub-exponential with parameter . In addition, there exist constants for every such that
Proposition D.5.
For every ,
Proof.
First, notice that
where the first inequality is due to for every . Rearranging,
(5) |
For each user, the realization of is less or equal to the realization of for the same user (as users provide negative feedback in their last iteration); hence,
∎
We proceed by showing that returns of fix-armed policies satisfy Property (1) from Definition 3.1. See D.2
Proof.
We use Property (1) from Definition 3.1 to derive that is also sub-exponential. This is true since the tails of satisfy that for all ,
where the first inequality follows since is a necessary condition for , and the last inequality follows from Corollary D.4. Along with Definition 3.1, we conclude that
(6) |
Now, applying Minkowski’s inequality and then Jensen’s inequality (as are convex for every ) we get
Using Proposition D.5 and Inequality (6), we get
Hence is sub-exponential with parameter . ∎
See 4.3
Proof.
Throughout this proof, we will use the sub-exponential norm, , which is defined as
Let
We have that
(7) |
Let be such that . From Lemma D.2 we conclude that
hence, .
Summing the geometric series, we get
(8) |
In addition, notice that .
Next, from the Taylor series of we have
Combining the fact that and (7) to the above,
By applying and (8), we get
where the last inequality is due to .
Note that . Ultimately,
This concludes the proof of the lemma. ∎
Appendix E Two User Types and Two Categories: Proofs from Section 5
E.1 Planning when
See 5.3
Proof.
Let . We will prove that for every policy and every belief , we have that by a backward induction over .
For the base case, consider . We have that
as .
For the inductive step, assume that for every . We need to show that for every .
Indeed,
where the second to last equality is due to the induction hypothesis and the assumption that is a deterministic stationary policy. The proof completes by realizing that , since the sum is finite and has positive summands. ∎
E.2 Dominant Row (DR)
See 5.4
Proof.
We will show that for every iteration , no matter what were the previous topic recommendations were, selecting topic rather than topic can only increase the value.
Let be a stationary policy such that . Changing it into a policy that is equivalent to for all iterations but iteration in which it recommends topic can only improve the value.
Since , , and this structure satisfies , we get that for every and for every ,
thus,
Hence for every time step , choosing topic instead of topic leads to increased value of each of the summation element such that and . We deduce that
∎
E.3 Dominant Column (DC)
Before proving the main theorem (Theorem 5.5), we prove two auxiliary lemmas.
Lemma E.1.
For with a DC structure, if a policy is optimal then it recommends topic for all iteration such that
(9) |
Proof.
First, assume by contradiction that there exists an optimal policy that recommends topic in iteration such that (9) holds.
Let be the policy that is equivalent to but recommend topic instead of topic in iteration . Since and recommends the same topic until iteration , along with the optimality of , we have
Expending the above equation,
which is a contradiction to (9).
For the second part of the lemma, assume that condition (9) holds for some iteration and some optimal policy ; hence, and we have and . Exploiting this fact, we have that
implying
∎
An immediate consequence of Lemma E.1 is the following corollary.
Corollary E.2.
For any DC-structured and every belief , the optimal policy first recommends topic for at most
times, and then recommends topic permanently. In addition, since .
See 5.5
Proof.
Due to Theorem 5.3 and Corollary E.2, we can write the expected value of a policy as a function of when has a DC structure:
(10) |
Equation (E.3) could be cast as for positive , negative and positive . Let be the continuous function such that .
We take the derivative w.r.t. to find the saddle point of :
which suggests the saddle point of is
Next, set . Since has a single saddle point and for every it holds that , to determine the optimal policy, one only needs to compare the value at the boundary points and at the closest integers to the saddle point . ∎
E.4 Dominant Diagonal (SD)
See 5.6
Proof.
We prove the following claim by a backward induction over the number of iterations remaining: For every it holds that for every policy and belief ,
First, we notice that when , the only possible policies are and . For , we prove the statement by contradiction. There are only two ways to selects topics when :
Let and denote the number of times has played topic and till time , inclusive. Assume that the policy is optimal. In particular, it holds that and . We get
(11) |
and
(12) |
As both sides of (11) and (12) are positive, the multiplication of their left hand sides is smaller than the multiplication of their right hand sides, i.e.,
Dividing both sides by , we obtain
which is a contradiction as and .
Now, assume that the policy is optimal. In particular, it holds that and . We get
(13) |
and
(14) |
As both sides of (13) and (14) are positive, the multiplication of their left hand sides is smaller than the multiplication of their right hand sides,
Dividing both sides by , we obtain
which is again, a contradiction as and .
For , we prove the statement by contradiction. Suppose not, i.e., the optimal policy switch recommended topic at least once. Let denote the time step where switch for the last time. We first consider the case where has switched from topic to topic at time . More specifically, we have
Consider another policy (that behaves the same as except at time step ) defined as
Let and denote the number of times has recommended topic and till (and include) time . Since is optimal, we have the difference between the value of and to be non-negative, i.e.,
(15) |
where the difference is induced by the discrepancy of the two policies from time step to . Consider another policy (that behaves the same as except at time step ) defined as
Since is optimal, we have the difference between the value of and to be non-negative, i.e.,
where the difference is induced by the discrepancy of the two policies from time step . Multiplying both sides by , we get
Using , and ,
hence,
(16) |
Next, we construct a new policy that outperforms . We let to be the policy defined as below
The value difference between and (caused by the discrepancy of the two policies from time to ) is
where the first inequality is true because , and , therefore for every
along with (16). The second inequality follows from (15). Thus, we have successfully find another policy that differs from and achieves a higher value, which is a contradiction.
next, we consider the case where has switched from topic to topic at time , i.e.,
Consider another policy (that behaves the same as except at time step ) defined as
Since is optimal, we have the difference between the value of and to be non-negative, i.e.,
(17) |
where the difference follows from the discrepancy between the two policies from time step to .
Consider another policy (that behaves the same as except at time step ) defined as
Since is optimal, we have the difference between the value of and to be non-negative, i.e.,
where the difference is induced by the discrepancy of the two policies from time step . Multiplying both sides by ,
Using and , we get
hence,
(18) |
Again, we will construct a new policy that outperforms . We let to be the policy defined as below
Now, the value difference between and (caused by the discrepancy of the two policies from time to ) is
where the first inequality is true because , and and (18), and the second from (17). Similarly, we have successfully find another policy that differs from and achieves a higher value, which is a contradiction.
We have covered all cases, so the inductive argument holds. This concludes the proof of the theorem. ∎
E.5 UCB-based regret bound
See 5.9
Proof.
Let be such that
Next, we have that
Where is the return for the instance such that for every : and .
See 5.8
Proof.
Recall that , where we drop the dependence on the user index for readability. Formulating differently, for any it holds that
Using the same representation for and taking expectation, we get that
∎