This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Recommending with Recommendations

Naveen Durvasula
University of California, Berkeley
Department of EECS
Berkeley, CA 94704
[email protected]
&Franklyn Wang
Harvard University
Department of Mathematics
Cambridge, MA 02138
[email protected]
&Scott Duke Kominers
Harvard Business School
Boston, MA 02138
[email protected]
The first two authors contributed equally
Abstract

Recommendation systems are a key modern application of machine learning, but they have the downside that they often draw upon sensitive user information in making their predictions. We show how to address this deficiency by basing a service’s recommendation engine upon recommendations from other existing services, which contain no sensitive information by nature. Specifically, we introduce a contextual multi-armed bandit recommendation framework where the agent has access to recommendations for other services. In our setting, the user’s (potentially sensitive) information belongs to a high-dimensional latent space, and the ideal recommendations for the source and target tasks (which are non-sensitive) are given by unknown linear transformations of the user information. So long as the tasks rely on similar segments of the user information, we can decompose the target recommendation problem into systematic components that can be derived from the source recommendations, and idiosyncratic components that are user-specific and cannot be derived from the source, but have significantly lower dimensionality. We propose an explore-then-refine approach to learning and utilizing this decomposition; then using ideas from perturbation theory and statistical concentration of measure, we prove our algorithm achieves regret comparable to a strong skyline that has full knowledge of the source and target transformations. We also consider a generalization of our algorithm to a model with many simultaneous targets and no source. Our methods obtain superior empirical results on synthetic benchmarks.

1 Introduction

Recommendation engines enable access to a range of products by helping consumers efficiently explore options for everything from restaurants to movies. But these systems typically rely on fine-grained user data—which both necessitates users’ sharing such data, and makes it difficult for platforms without existing data sources to enter the space. In this paper, we propose a new approach to recommendation system design that partially mitigates both of these challenges: basing recommendations on existing recommendations developed by other services. We call the new service’s recommendation task the target, and call the prior recommendations we rely on the source. Source recommendations are always available to the user, and are sometimes even public. We show that as long as the source and target recommendation tasks are sufficiently related, users can enable services to make high-quality recommendations for them without giving away any new personal data. Thus our method offers a middle ground between consumers’ current options of either giving platforms personal data directly or receiving low-quality recommendations: under our system, consumers can receive high-quality recommendations while only sharing existing recommendations based on data that other services already possess.

For intuition, consider the following example which we use throughout the paper for illustration: There is an existing grocery delivery service, which has become effective at recommending grocery items to its users. A new takeout delivery service is seeking to recommend restaurants to the same set of users. Users’ preferences over groceries and restaurant food rely on many of the same underlying user characteristics: for example, a user who loves pasta will buy it at grocery stores and patronize Italian restaurants; meanwhile, a user who is lactose intolerant will prefer not to buy milk products and prefer to avoid ice cream parlors. Thus, high-quality grocery recommendations provide valuable data upon which we could base restaurant recommendations.

From the user perspective, a restaurant recommendation system based on existing grocery recommendations would be able to provide high-quality restaurant suggestions immediately, rather than needing to learn the user’s preferences over time. And yet the user would not need to share any data with the restaurant recommendation system other than the grocery recommendations themselves—which by nature contain (weakly) less user private information than they were derived from. Crucially, the user’s food preferences might theoretically be driven by a health condition like lactose intolerance, but they could also just be a matter of taste (not liking milk). In this case, the grocery delivery platform’s recommendations reflect that preference information in a way that is useful for extrapolating restaurant recommendations, without directly revealing the potentially sensitive underlying source of the user’s preferences.

And while these food recommendation instances are a bit niche, there are many settings in which recommendations built on personal data may be available, even if we do not have access to the personal data itself. Search engines, for example, utilize user data to serve advertisements, and at least two notable projects—the Ad Observatory project [1] and the Citizen Browser [2]—have automatically logged social media advertisements that were served to users. Such advertisements could be used as contextual input for a variety of recommendation problems.

In our framework, we consider an agent that aims to make recommendations for a collection of users over a collection of target tasks. Each task takes the form of a standard stochastic linear bandit problem, where the reward parameter for user uu is given by a task-specific linear transformation of uu’s high-dimensional personal data vector. We think of these transformations as converting the personal data into some lower-dimensional features that are relevant for the task at hand. These transformations and the personal data vectors are unknown to the agent. However, the agent is given access to the optimal arms (and its final reward) for each of the users on a source task. No other “white-box” information about the source task (e.g., the reward history for the agent that solved the task) is known. That is, only the recommendations themselves—which are served to (and thus may be captured by) the user—in addition to the final reward (which may also be captured by asking the user to rate the recommendation) are accessible to the agent.

We describe algorithms that allow the agent to effectively make use of this information to reduce regret, so long as the source and target task transformations make use of segments of the high-dimensional personal data similar to those used in the source task. Notably, such a recommendation system functions without the user providing any personal information to the agent directly—and moreover, to the extent that the source task abstracts from sensitive data, all of the target task recommendations do as well.

Formally, we set up two related contextual bandit problems for the recommendation problem. When we have one target and many users, we can decompose each user’s reward parameter into idiosyncratic components, which are unique to each user, and systematic components, which are shared among all users. If the source and target tasks utilize similar segments of the users’ personal data, the idiosyncratic component is low-dimensional. As a result, if 0κ10\leq\kappa\leq 1 is the shared information between the subspaces, our regret bounds improve upon the LinUCB baseline [3] by a factor of 1κ1-\kappa.

In our second iteration, we consider the general problem of having many targets and many users, but no sources. We show that if we perform what is essentially an alternating minimization algorithm to solve for a few tasks at first, use the solutions to solve for all the users, and then use those to solve for all the tasks, we obtain strong experimental performance. We call our collective methods Rec2, for recommending with recommendations.

In both analyses, we make a crucial user growth assumption, which states that in several initial phases, we have beta groups where only some subset of the tasks and some subset of the users can appear. This assumption enables us to estimate systematic parameters accurately, as there are fewer idiosyncratic parameters. It also reflects how a web platform grows.

Section 2 introduces our formal model and compares it to prior work. In Section 3, we describe our approach for inferring the relationship between the source and target tasks when we have source recommendations for multiple users. We then develop and analyze an algorithm that uses this approach to reduce regret. In Section 4, we generalize our algorithm from Section 3 to our second setting where we have multiple target tasks. In Section 5, we demonstrate our methods in synthetic experiments. Finally, Section 6 concludes.

2 Preliminaries

2.1 Framework

In our setup, we aim to solve TT target tasks for UU users. We are also given the optimal arms (and corresponding expected reward) for each user on a source task.

Formally, we associate to each user u[U]u\in[U] a vector 𝜽un\mathbf{\bm{\theta}}^{u}\in\mathbb{R}^{n} denoting all of the user’s data, some of which is potentially sensitive. We think of this space as being extremely high-dimensional—i.e., n1n\gg 1—and encapsulating all of the salient attributes about the user. We associate to the source task a matrix Aa×nA\in\mathbb{R}^{a\times n} and similarly associate to each target task t[T]t\in[T] a matrix Btb×nB_{t}\in\mathbb{R}^{b\times n}. The parameters a,bna,b\leq n denote the dimensionality of the source target tasks, respectively.111The assumption that all target tasks have the same dimensionality is made for expositional simplicity and can be relaxed. We assume that the task matrices AA and BtB_{t} do not contain any redundant information, and thus have full row rank. These task matrices transform a user’s personal data into the reward parameter for the task. For example, if a takeout delivery service (given by target task tt) aims to make recommendations for some user uu, the task matrix BtB_{t} transforms uu’s personal data into the space of possible restaurants in such a way that the vector Bt𝜽uB_{t}\mathbf{\bm{\theta}}^{u} essentially denotes uu’s ideal restaurant choice.

𝜽u\mathbf{\bm{\theta}}^{u}n\mathbb{R}^{n} – User Datar\mathbb{R}^{r} – Low Rank Modele.g. Food PreferencesA+A𝜽uA^{+}A\mathbf{\bm{\theta}}^{u}B+B𝜽uB^{+}B\mathbf{\bm{\theta}}^{u}A𝜽uA\mathbf{\bm{\theta}}^{u}B𝜽uB\mathbf{\bm{\theta}}^{u}Row(A)a\operatorname{Row}(A)\cong\mathbb{R}^{a}Row(B)b\operatorname{Row}(B)\cong\mathbb{R}^{b}Col(A)a\operatorname{Col}(A)\cong\mathbb{R}^{a}e.g. Grocery RecsCol(B)b\operatorname{Col}(B)\cong\mathbb{R}^{b}e.g. Restaurant Recs
Figure 1: Visualizing source and target tasks. In our model, users’ personal data 𝜽u\mathbf{\bm{\theta}}^{u} lie in a latent space n\mathbb{R}^{n}. For a given source matrix AA and target matrix BB, the image A𝜽uaA\mathbf{\bm{\theta}}^{u}\in\mathbb{R}^{a} gives the reward parameter for the source task, and B𝜽ubB\mathbf{\bm{\theta}}^{u}\in\mathbb{R}^{b} gives the reward parameter for the target task. Thus, Row(A)n\operatorname{Row}(A)\subseteq\mathbb{R}^{n} and Row(B)n\operatorname{Row}(B)\subseteq\mathbb{R}^{n} can be thought of as the corresponding partitions of the full personal data that are used by source and target tasks respectively. If the source and target tasks make use of similar segments of the users’ personal data—that is, if Row(A)Row(B)\operatorname{Row}(A)\cap\operatorname{Row}(B) has high dimension—then, we may construct a low rank model (boxed) to represent both the source and target tasks. In this case, having knowledge of A𝜽uA\mathbf{\bm{\theta}}^{u} may help in recovering B𝜽uB\mathbf{\bm{\theta}}^{u}.

We now consider a multi-armed bandit problem with finite horizon HH. At each time h[H]h\in[H], the agent observes a context consisting of a user-task pair (uh,th)[U]×[T](u_{h},t_{h})\in[U]\times[T], and subsequently chooses an arm 𝐱\mathbf{\bm{x}} from a finite and time-dependent decision set 𝒜h\mathcal{A}_{h}. We then observe the reward

Xh=𝐱,Bth𝜽uh+ηh,X_{h}=\left\langle\mathbf{\bm{x}},B_{t_{h}}\mathbf{\bm{\theta}}^{u_{h}}\right\rangle+\eta_{h}, (1)

where {ηh}h=1H\{\eta_{h}\}_{h=1}^{H} are i.i.d., mean-0, independent 11-subgaussian random variables. We assume that the decision sets are well-conditioned, such that for any arm 𝐱𝒜h\mathbf{\bm{x}}\in\mathcal{A}_{h}, Bt𝜽u1\left\lVert B_{t}\mathbf{\bm{\theta}}^{u}\right\rVert\leq 1 for all t,ut,u. We also assume maxh[H],𝐱𝒜h𝐱21\max_{h\in[H],\mathbf{\bm{x}}\in\mathcal{A}_{h}}\left\lVert\mathbf{\bm{x}}\right\rVert_{2}\leq 1, so each arm has 22-norm at most 11, and that 𝜽u1\left\lVert\mathbf{\bm{\theta}}^{u}\right\rVert\leq 1 for all uu.

Crucially, in our setup, the agent is not given access to any of the personal data vectors 𝜽u\mathbf{\bm{\theta}}^{u}, or any of the source/task matrices AA or BtB_{t}. The agent is, however, given access to each of the optimal arms 𝜶u\mathbf{\bm{\alpha}}^{u} and corresponding reward βu\beta^{u} for an agent that solved the source task. These quantities can be used to approximate the source reward parameter A𝜽uA\mathbf{\bm{\theta}}^{u} under reasonable assumptions on the source agent’s arm space. Namely, we assume that the source agent’s arm space is a net of some ball of radius SS, whence we may write 𝜶uSA𝜽uA𝜽u2\mathbf{\bm{\alpha}}^{u}\approx S\frac{A\mathbf{\bm{\theta}}^{u}}{\left\lVert A\mathbf{\bm{\theta}}^{u}\right\rVert_{2}}. The corresponding expected reward, as per (1), is then approximately given by βuSA𝜽u2\beta^{u}\approx S\left\lVert A\mathbf{\bm{\theta}}^{u}\right\rVert_{2}. We may then approximate A𝜽uβu𝜶u𝜶u22A\mathbf{\bm{\theta}}^{u}\approx\frac{\beta^{u}\mathbf{\bm{\alpha}}^{u}}{\left\lVert\mathbf{\bm{\alpha}}^{u}\right\rVert_{2}^{2}}; in the remainder of this paper we assume, for simplicity, that this approximation is exact.

As usual, the objective of the agent is to maximize the expected cumulative reward h=1H𝔼[Xh]\sum_{h=1}^{H}\mathbb{E}\left[X_{h}\right], or equivalently minimize the expected regret

RH:=𝔼[t[T]u[U]h:(uh,th)=(u,t)max𝐱𝒜h𝐱,Bt𝜽uh=1HXh].R_{H}:=\mathbb{E}\left[\sum_{t\in[T]}\sum_{u\in[U]}\sum_{h:(u_{h},t_{h})=(u,t)}\max_{\mathbf{\bm{x}}\in\mathcal{A}_{h}}\left\langle\mathbf{\bm{x}},B_{t}\mathbf{\bm{\theta}}^{u}\right\rangle-\sum_{h=1}^{H}X_{h}\right]. (2)

Our central contribution in this paper is demonstrating how knowledge of the optimal arms for the each of the source tasks may be used to significantly reduce regret relative to the naive approach solving each of the UTU\cdot T contexts separately. Using the popular LinUCB algorithm [4], we obtain the following baseline regret bound. (All proofs are in Appendix A.)

Theorem 2.1 (LinUCB Bound).

A LinUCB agent that naively solves each of the UTU\cdot T tasks separately has regret bounded by

RHO(bHTUlog(HTU)).R_{H}\leq O\left(b\sqrt{HTU}\log\left(\frac{H}{TU}\right)\right).

An interpretation of Theorem 2.1 is that on average, each user receives an arm with reward at most RH/H=bTUHlog(HTU)R_{H}/H=b\sqrt{\frac{TU}{H}}\log\left(\frac{H}{TU}\right) worse than the best, meaning that the problem is harder when there are more tasks, more users, and higher dimension, but becomes easier with time.

2.2 Related Work

Recommendation Systems

Recommendation systems—which recommend items to users, given data on users, items, and/or their pairwise interactions—have been one of the most impactful applications of modern machine learning [5]. Moreover, the study of recommendation systems has led to a number of foundational methodological breakthroughs [6, 7].

Contextual Bandits

Our framework is an example of a setting with contextual bandits [8], i.e., one in which bandits are allowed to draw upon information beyond the arms themselves—the context—to help make their decisions. Moreover, our framework follows that of a linear bandit. However, our framework differs from the standard model of contextual linear bandits [9, Equation 19.1], which represents the rewards as Xh=ψ(c,𝐱),𝜽,X_{h}=\langle\psi(c,\mathbf{\bm{x}}),\mathbf{\bm{\theta}}^{\ast}\rangle, where ψ\psi is a known feature map and 𝜽\mathbf{\bm{\theta}}^{\ast} is an unknown reward parameter. While our question can be fit into that standard framework by using

ψ((u,t),𝐱)=[0𝐱(u,t)th vector 0]and\displaystyle\psi((u,t),\mathbf{\bm{x}})=\begin{bmatrix}0&\ldots&\underbrace{\mathbf{\bm{x}}^{\top}}_{(u,t)^{\text{th}}\text{ vector }}&\ldots&0\end{bmatrix}^{\top}\quad\text{and}
𝜽=[(B1𝜽1)(Bt𝜽u)(BT𝜽U)],\displaystyle\mathbf{\bm{\theta}}^{\ast}=\begin{bmatrix}(B_{1}\mathbf{\bm{\theta}}^{1})^{\top}&\ldots&(B_{t}\mathbf{\bm{\theta}}^{u})^{\top}&\ldots&(B_{T}\mathbf{\bm{\theta}}^{U})^{\top}\end{bmatrix}^{\top},

such a setup does not allow for us to easily incorporate our information about A𝜽uA\mathbf{\bm{\theta}}^{u} into the problem, as it merely views 𝜽\mathbf{\bm{\theta}}^{\ast} as an unrolling of a vector. In particular, the direct bounds obtained through the standard contextual bandits formulation (without further assumptions) are weak relative to ours, as 𝜽\mathbf{\bm{\theta}}^{\ast} has dimension bTUbTU.

Explore and Refine Approaches

Our algorithm in Section 3 is methodologically similar to the algorithm of [10], which also explores to obtain a subspace containing the solution, and imposes a soft penalty to make the estimated reward parameter close to that subspace. Indeed, our approach draws upon their LowOFUL algorithm. However, the problem we seek to solve is quite different from that of [10]—we have either multiple tasks or side information in each of our models.

Cross-Domain Recommendations

Cross-Domain recommendations (see [11] for a survey) is an area where data from other recommendation domains is used to inform recommendations in a new domain. Here we highlight works in cross-domain recommendations that we find particularly relevant.

[12] makes a low-rank assumption on the tensor of item-user-domain, which is similar to but differs from our assumption, as our assumption does not make any assumptions regarding the quantity of items, and hence about rank constraints along that axis. Furthermore, those works recommend items, whereas our work chooses from a set of arms—the featurizations in our model are provided ahead of time, whereas in [12] they are learnt.

[13] uses a similar cross-domain recommendation idea; but they use explicit relationships between items as hinted by the title (i.e. aficionados of the book "The Devil Wears Prada" should also enjoy the movie "The Devil Wears Prada"), which our work does not depend on. Our work makes purely low-rank based assumptions, making it more general than [13].

Meta-Learning

In addition, our setup resembles the standard meta-learning [14, 15] setup, as we have several interrelated tasks (namely, each (service, user) pair can be viewed as a separate task). In [16], the authors address a meta-learning task by assuming that the task matrices come from a simple distribution, which are then learnt. One way to think of this work is that it corresponds to ours in the case where U=1U=1. As such, it cannot share parameters across users, and thus must make distributional assumptions on the tasks to get nontrivial guarantees.

Transfer Learning and Domain Adaptation

Our setup also resembles that of transfer learning, as we use parameters from one learned task directly in another task, as well as domain adaptation, because we use similar models across two different settings. Transfer learning has commonly used in the bandit setting [17, 18, 19, 20].

Most relevant to our work, the model we present in Section 3 is similar to the setting of [21], in which the target features are a linear transformation away from the source features, and the bandit has access to historical trajectories of the source task. One way to interpret this in our model is that the work assumes knowledge of the full set of features 𝜽\mathbf{\bm{\theta}} (as opposed to merely a projection), and thus strives to solve for BB. It then solves for BB by applying linear regression to align the contexts, obtaining a nontrivial regret guarantee. However, the result is not directly comparable, because the model contains substantially more information than ours, namely access to the full 𝜽\mathbf{\bm{\theta}} instead of simply a view A𝜽A\mathbf{\bm{\theta}} as well as past trajectories on a bandit problem.

The paper [22] uses domain adaptation to approach cross-domain recommendation and shows that it can achieve superior performance to certain baselines. However, it is not clear how to implement this in a bandit-based framework.

User-similarity for recommendation

In works like [23, 24, 25, 26], the authors address recommendation problems by using information about similar users, like users in a social network. They assume that people who are connected on social media have similar parameter vectors. However, this assumption may potentially violate privacy by assuming access to a user’s contacts, and by using many tasks (as these works effectively have T=1T=1) and many users we can obtain similar results without using social data.

3 One Target, One Source, Many Users

In this section, we consider the case where we have T=1T=1 target task. We show that source recommendations can be used to generate target recommendations in two phases. We first show how an agent with full knowledge of the source and target matrices can reduce regret using the source recommendations by reducing the dimensionality of target recommendation problem. We then show that agent without knowledge of the source and target matrices can still reduce regret under a realistic assumption on the ordering of the contexts by learning the underlying relationship between the source and target matrices from observed data. One quantity that relates to the success of our approach is the common information, which we define as

κ:=dim(Row(A)Row(B))b.\kappa:=\frac{\dim\left(\operatorname{Row}(A)\cap\operatorname{Row}(B)\right)}{b}. (3)

This quantity can be thought of as the fraction of information about the underlying reward parameter B𝜽B\mathbf{\bm{\theta}} that can be possibly be explained given the source information A𝜽A\mathbf{\bm{\theta}}. Referring back to Figure 1, we can see that if κ1\kappa\approx 1, then the target task uses segments of the user’s information that are almost all contained by the segments used by the source task – a situation that arises often in practice. Stated differently, this implies that the source and target matrices AA and BB have a low rank structure. To see this, observe that the rank of the block matrix r:=rank([AB])r:=\operatorname{rank}\left(\begin{bmatrix}A\\ B\end{bmatrix}\right) is given by

r=a+bdim(Row(A)Row(B))=a+(1κ)b.r=a+b-\dim(\operatorname{Row}(A)\cap\operatorname{Row}(B))=a+(1-\kappa)b.

In our running example, the matrix AA represents a grocery recommendation task and the matrix BB represents a restaurant recommendation task. Our algorithm finds the minimal set of information needed to predict both tasks, and then predicts the user’s restaurant preferences by looking at all possibilities that are consistent with the user’s grocery preferences.

Skyline Regret Bound

We now show how an agent that was given the source and target matrices AA and BB may use this structure to reduce regret. By the above, we may write the compact singular value decomposition of the stacked matrix as [AB]=𝒰Σ𝒱T\begin{bmatrix}A\\ B\end{bmatrix}=\mathcal{U}\Sigma\mathcal{V}^{T}, where 𝒰,𝒱(a+b)×r\mathcal{U},\mathcal{V}\in\mathbb{R}^{(a+b)\times r} and Σr×r\Sigma\in\mathbb{R}^{r\times r}. Projecting onto the principal components, we can create a low-dimensional generative model

[A𝜽uB𝜽u]=𝒰(Σ𝒱T𝜽u):=𝒰ϕu,\begin{bmatrix}A\mathbf{\bm{\theta}}^{u}\\ B\mathbf{\bm{\theta}}^{u}\end{bmatrix}=\mathcal{U}\left(\Sigma\mathcal{V}^{T}\mathbf{\bm{\theta}}^{u}\right):=\mathcal{U}\mathbf{\bm{\phi}}^{u},

where ϕur\mathbf{\bm{\phi}}^{u}\in\mathbb{R}^{r}. Letting πAa×(a+b)\pi_{A}\in\mathbb{R}^{a\times(a+b)} denote the orthogonal projector onto the first aa components and πBb×(a+b)\pi_{B}\in\mathbb{R}^{b\times(a+b)} denote the orthogonal projector onto the last bb components, we have that πA𝒰ϕu=A𝜽u\pi_{A}\mathcal{U}\mathbf{\bm{\phi}}^{u}=A\mathbf{\bm{\theta}}^{u}, whence we may write ϕu(πA𝒰)+A𝜽u+Null(A)\mathbf{\bm{\phi}}^{u}\in\left(\pi_{A}\mathcal{U}\right)^{+}A\mathbf{\bm{\theta}}^{u}+\operatorname{Null}(A). Here, (πA𝒰)+\left(\pi_{A}\mathcal{U}\right)^{+} denotes the Moore-Penrose pseudo-inverse. Substituting in, we may write

B𝜽u=πB𝒰ϕuπB𝒰(πA𝒰)+A𝜽u+πB𝒰Null(A).B\mathbf{\bm{\theta}}^{u}=\pi_{B}\mathcal{U}\mathbf{\bm{\phi}}^{u}\in\pi_{B}\mathcal{U}\left(\pi_{A}\mathcal{U}\right)^{+}A\mathbf{\bm{\theta}}^{u}+\pi_{B}\mathcal{U}\operatorname{Null}(A). (4)

We call DT:=πB𝒰ϕuπB𝒰(πA𝒰)+b×aD_{T}:=\pi_{B}\mathcal{U}\mathbf{\bm{\phi}}^{u}\in\pi_{B}\mathcal{U}\left(\pi_{A}\mathcal{U}\right)^{+}\in\mathbb{R}^{b\times a} the transformer, as it transforms the known reward parameter from the source domain into the target domain. From the above, we have that the residual B𝜽uDTA𝜽uB\mathbf{\bm{\theta}}^{u}-D_{T}A\mathbf{\bm{\theta}}^{u} lies in πB𝒰Null(A)\pi_{B}\mathcal{U}\operatorname{Null}(A), which is a space of dimension (1κ)b(1-\kappa)b. Choosing an orthogonal basis for this space, it follows that we may write

B𝜽u=DTA𝜽uSystematic Component+DG𝝍uIdiosyncratic Component,B\mathbf{\bm{\theta}}^{u}=\underbrace{D_{T}A\mathbf{\bm{\theta}}^{u}}_{\text{Systematic Component}}+\underbrace{D_{G}\mathbf{\bm{\psi}}^{u}}_{\text{Idiosyncratic Component}}, (5)

where DGb×(1κ)bD_{G}\in\mathbb{R}^{b\times(1-\kappa)b} has orthogonal columns given by the basis elements, and 𝝍u(1κ)b\mathbf{\bm{\psi}}^{u}\in\mathbb{R}^{(1-\kappa)b} gives the corresponding weights. We call DGD_{G} the generator, as it generates the second term in the above expression from a lower dimensional vector 𝝍u\mathbf{\bm{\psi}}^{u}. Together DTD_{T} and DGD_{G} give a useful decomposition of the target reward parameter into a systematic component that can be inferred from the source reward parameter, and an idiosyncratic component that must be learned independently for each user. An agent with full knowledge of the source and task matrices can compute DGD_{G} and DTD_{T} as given above, and thus needs only solve for each of the 𝝍u\mathbf{\bm{\psi}}^{u} that lie in a (1κ)b(1-\kappa)b-dimensional subspace. As a result, it attains a lower skyline regret bound.

Theorem 3.1 (Skyline Regret).

An agent with full knowledge of the source and target matrices AA and BB may attain regret

RHO((1κ)bHlog(HU)).R_{H}\leq O\left((1-\kappa)b\sqrt{H}\log\left(\frac{H}{U}\right)\right).

Comparing to Theorem 2.1, we see that we obtain an identical bound, but with a factor of 1κ1-\kappa in front. Thus, as alluded to earlier, the improvement depends on how large κ\kappa is. If κ\kappa is close to one, then we obtain a significant improvement, corresponding to being able to use a substantial amount of information from the source recommendations. This result only applies when the agent is given the source and target matrices.

Learning from Observed Data

We now return to our original setup, where the source and target matrices are unknown. We show how the transformer DTD_{T} and the generator DGD_{G} may be learned from observed data under a realistic assumption on the ordering of the contexts. We assume that some subset of the users U0U_{0} belongs to a beta group, and are the only users requesting recommendations for the first H0H_{0} iterations. After the first H0H_{0} iterations, all users are allowed to request recommendations for the target task. This assumption makes learning the transformer and generator easier as it reduces the number of idiosyncratic parameters we must fully learn in order to understand the relationship between the two tasks.

Our approach is briefly summarized as follows. In the first H0H_{0} iterations, we use some exploration policy that independently pulls arms for each user. In practice, this policy could be adaptive, like the baseline LinUCB algorithm. To keep our analysis tractable, we instead assume that the exploration policy is oblivious. That is, we assume that arms pulled in the first H0H_{0} iterations do not depend on any observed rewards, and that the arms are randomly drawn from 1b𝒩(0,Ib)\frac{1}{\sqrt{b}}\mathcal{N}(0,I_{b}). We further assume that the same oblivious policy is used across users. This is essentially equivalent to using different oblivious strategies for different users, as we have no prior information on the reward parameters. In our experiments, we show that this obliviousness assumption does not appear necessary to achieve good performance. We find that adaptive policies (namely LinUCB) are just as, if not more effective at learning the model relating the two tasks, while achieving performance equivalent to the baseline during the exploration phase.

Immediately after iteration H0H_{0}, we run least-squares for each user on the arm-reward history we have observed thus far to obtain estimates B𝜽u^\widehat{B\mathbf{\bm{\theta}}^{u}} for each user uu. Observe by Equation 4 that the transformer DTD_{T} is determined by the left singular vectors UU. As Null(A)=Null(πA𝒰)\operatorname{Null}(A)=\operatorname{Null}(\pi_{A}\mathcal{U}), DGD_{G} may also be computed given UU. We compute the singular value decomposition

[A𝜽1A𝜽U0B𝜽1^B𝜽U0^]=[𝒰^𝒰~][Σr00Σκb][𝒱T^𝒱T~];\begin{bmatrix}A\mathbf{\bm{\theta}}^{1}&\ldots&A\mathbf{\bm{\theta}}^{U_{0}}\\ \widehat{B\mathbf{\bm{\theta}}^{1}}&\ldots&\widehat{B\mathbf{\bm{\theta}}^{U_{0}}}\end{bmatrix}=\begin{bmatrix}\widehat{\mathcal{U}}&\widetilde{\mathcal{U}}\end{bmatrix}\begin{bmatrix}\Sigma_{r}&0\\ 0&\Sigma_{\kappa b}\end{bmatrix}\begin{bmatrix}\widehat{\mathcal{V}^{T}}\\ \widetilde{\mathcal{V}^{T}}\end{bmatrix}; (6)

we then use 𝒰^\widehat{\mathcal{U}}, which corresponds to the first rr columns of the decomposition, as a proxy for UU.

Note that in the skyline, we showed that B𝜽uB\mathbf{\bm{\theta}}^{u} lies in the subspace Su:={πB𝒰ϕπA𝒰ϕ=A𝜽u,ϕr}.S_{u}:=\left\{\pi_{B}\mathcal{U}\mathbf{\bm{\phi}}\mid\pi_{A}\mathcal{U}\mathbf{\bm{\phi}}=A\mathbf{\bm{\theta}}^{u},\mathbf{\bm{\phi}}\in\mathbb{R}^{r}\right\}.

By substituting 𝒰^\widehat{\mathcal{U}} for 𝒰\mathcal{U}, and subsequently computing D^T\widehat{D}_{T} and D^G\widehat{D}_{G} as in Equation 4, we obtain the approximation to SuS_{u} given by

S^u\displaystyle\widehat{S}_{u} :={πB𝒰^ϕπA𝒰^ϕ=A𝜽u,ϕr}\displaystyle:=\left\{\pi_{B}\widehat{\mathcal{U}}\mathbf{\bm{\phi}}\mid\pi_{A}\widehat{\mathcal{U}}\mathbf{\bm{\phi}}=A\mathbf{\bm{\theta}}^{u},\mathbf{\bm{\phi}}\in\mathbb{R}^{r}\right\}
span(D^TA𝜽u,D^G)=:Eu.\displaystyle\subset\operatorname{span}\left(\widehat{D}_{T}A\mathbf{\bm{\theta}}^{u},\widehat{D}_{G}\right)=:E_{u}.

Although S^u\widehat{S}_{u} is an affine space, it is a subset of the linear space EuE_{u}. If B𝜽uB\mathbf{\bm{\theta}}^{u} is close to the space EuE_{u} (i.e. if our approximation is close) then in the coordinates of some orthogonal basis W:=[WW]W:=\begin{bmatrix}W_{\parallel}&W_{\perp}\end{bmatrix} for b\mathbb{R}^{b}, where WW_{\parallel} is a basis for EuE_{u}, B𝜽uB\mathbf{\bm{\theta}}^{u} is approximately sparse (i.e. components of B𝜽uB\mathbf{\bm{\theta}}^{u} that lie outside of EuE_{u} have small magnitude). We invoke an algorithm known as LowOFUL [10], which works in precisely this setting to select arms for all iterations h[H0+1,H]h\in[H_{0}+1,H]. We give a full description of our algorithm in Algorithm 1. More concretely, the regret for our approach may be decomposed as

RH\displaystyle R_{H} =𝔼[u[U]hH0:uh=u(max𝐱𝒜h𝐱,B𝜽uXh)]\displaystyle=\mathbb{E}\left[\sum_{u\in[U]}\sum_{h\leq H_{0}:u_{h}=u}\left(\max_{\mathbf{\bm{x}}\in\mathcal{A}_{h}}\left\langle\mathbf{\bm{x}},B\mathbf{\bm{\theta}}^{u}\right\rangle-X_{h}\right)\right]
+u[U]𝔼[h>H0:uh=u(max𝐱WT𝒜h𝐱,WTB𝜽uXh)]\displaystyle\quad+\sum_{u\in[U]}\mathbb{E}\left[\sum_{h>H_{0}:u_{h}=u}\left(\max_{\mathbf{\bm{x}}\in W^{T}\mathcal{A}_{h}}\left\langle\mathbf{\bm{x}},W^{T}B\mathbf{\bm{\theta}}^{u}\right\rangle-X_{h}\right)\right]
=:ExploreH0+u[U]LowOFULHuu\displaystyle=:\mathrm{Explore}_{H_{0}}+\sum_{u\in[U]}\mathrm{LowOFUL}^{u}_{H_{u}}

where HuH_{u} is the number of times user uu requests a recommendation. Our approach achieves regret comparable to the skyline agent that has knowledge of the source and target matrices. Each term LowOFULHuu\mathrm{LowOFUL}^{u}_{H_{u}} is bounded in terms of the magnitude of the “almost sparse” components Lu:=(WTB𝜽u)(1κ)b+1:b2L^{u}:=\left\lVert(W^{T}B\mathbf{\bm{\theta}}^{u})_{(1-\kappa)b+1:b}\right\rVert_{2}. With methods from perturbation theory, we show:

Lemma 3.2.

With probability 1δ1-\delta, we have that for all uu,

Lu2(1+(σmin(πA𝒰)2Z)1)Z,L^{u}\leq\sqrt{2}\left(1+\left(\sigma_{\mathrm{min}}(\pi_{A}\mathcal{U})-\sqrt{2}Z\right)^{-1}\right)Z,

where q=bU0(b+log(2/δ))H0/U0C(b+log(4/δ))bU0H0CU0bq=\frac{\sqrt{bU_{0}(b+\log(2/\delta))}}{\sqrt{H_{0}/U_{0}}-C(\sqrt{b}+\sqrt{\log(4/\delta)})}\approx\frac{bU_{0}}{\sqrt{H_{0}}-C\sqrt{U_{0}b}}; Z=qσmin(M)qZ=\frac{q}{\sigma_{\mathrm{min}}\left(M\right)-q}; M=[A𝛉1A𝛉U0B𝛉1B𝛉U0]M=\tiny{\begin{bmatrix}A\mathbf{\bm{\theta}}^{1}&\ldots&A\mathbf{\bm{\theta}}^{U_{0}}\\ B\mathbf{\bm{\theta}}^{1}&\ldots&B\mathbf{\bm{\theta}}^{U_{0}}\end{bmatrix}}; CC is some absolute constant; and we assume that σmin(M)>q\sigma_{\mathrm{min}}\left(M\right)>q.

We now apply the original LowOFUL [10, Corollary 1] regret bound to show:

Theorem 3.3.

With an oblivious exploration policy that draws identical arms for each user from 1b𝒩(0,Ib)\frac{1}{\sqrt{b}}\mathcal{N}(0,I_{b}), assuming that each user appears equally many times, with probability at least 1(Uδ+δ)1-\left(U\delta^{\prime}+\delta\right), we have

u[U]LowOFULHuu\displaystyle\sum_{u\in[U]}\mathrm{LowOFUL}^{u}_{H_{u}}\leq
O((1κ)bUHlog(HU)log(1/δ)Skyline Regret (adjusted by δ)\displaystyle\quad\quad\quad O\biggr{(}\underbrace{(1-\kappa)b\sqrt{UH}\log\left(\frac{H}{U}\right)\sqrt{\log(1/\delta^{\prime})}}_{\text{Skyline Regret (adjusted by $\delta^{\prime}$)}}
+H(1+(σmin(πA𝒰)2Z)1)Zlog(HU)Approximation Error from Using S^u).\displaystyle\quad\quad\quad+\underbrace{H\left(1+\left(\sigma_{\mathrm{min}}(\pi_{A}\mathcal{U})-\sqrt{2}Z\right)^{-1}\right)Z\log\left(\frac{H}{U}\right)}_{\text{Approximation Error from Using $\widehat{S}_{u}$}}\biggr{)}.

Observe that the first sum term in this bound essentially matches the skyline bound (Theorem 3.1), and has regret scaled by 1κ1-\kappa. We pick up an additional penalty term given by the second summand that comes from the approximation error in 𝒰\mathcal{U}. Looking more closely at the term ZZ, our bound predicts that we obtain more regret when the number of users in the beta group U0U_{0} increases as we have to learn more idiosyncratic components, justifying the user growth assumption, and less regret as the time H0H_{0} we have to learn the idiosyncratic components increases, Finally, when H0=cHH_{0}=cH for some constant cc and UU\rightarrow\infty while holding U0U_{0} fixed, the second term is o(1)o(1) times the first term.

4 No Sources, Many Targets, Many Users

The previous section showed that we can obtain learning guarantees from a source. In this section, we address that case where we have no sources but many targets – the reward at time hh is given by Xh=𝐱,Bth𝜽uh+ηhX_{h}=\left\langle\mathbf{\bm{x}},B_{t_{h}}\mathbf{\bm{\theta}}^{u_{h}}\right\rangle+\eta_{h}, where tt and uu index the task and user respectively. As before, we compare to a LinUCB baseline (Theorem 2.1) that uses one bandit for each of the UTU\cdot T contexts. In the previous section, we found that source tasks with low rank rr (or equivalently high common information κ\kappa) were effective in producing recommendations for the target task. We similarly assume in this setting that there exist user-specific vectors ϕur\mathbf{\bm{\phi}}^{u}\in\mathbb{R}^{r} of dimension rnr\ll n along with matrices 𝒰1,,𝒰T\mathcal{U}_{1},\dots,\mathcal{U}_{T} such that Bt𝜽u=𝒰tϕuB_{t}\mathbf{\bm{\theta}}^{u}=\mathcal{U}_{t}\mathbf{\bm{\phi}}^{u} for all uu and tt.

In our running example, the matrices BtB_{t} represent many different food delivery apps, perhaps focused on different food sources. Our low-rank assumption posits that there is substantial redundancy between the different tasks (in the example: user-specific food preferences ϕu\mathbf{\bm{\phi}}^{u} are consistent across apps). One benefit is that after many users has been solved for, solving new tasks is very easy because we have data on a large set of users.

Skyline Regret Bound

In the skyline, we assume full access to the task matrices – we show that in this case, we obtain a reduction to UU separate dimension-rr bandit problems, one for each user.

Lemma 4.1 (Skyline Bound).

With access to all task matrices, we obtain a regret of

RHO(rUHlog(HU)).R_{H}\leq O\left(r\sqrt{UH}\log\left(\frac{H}{U}\right)\right).

An Algorithm based on User and Task Growth

Our algorithm in the previous section required the existence of a beta group of users to learn the relationship between the source and target. In this multitask setting, we additionally require a a beta group of tasks. We first apply an exploration strategy as in Section 3 to make estimates Bt𝜽u^\widehat{B_{t}\mathbf{\bm{\theta}}^{u}} for a small number of users U0U_{0} and tasks T0T_{0}. We subsequently compute a singular value decomposition, as in Equation 6, to produce estimates 𝒰t^\widehat{\mathcal{U}_{t}} for t[T0]t\in[T_{0}]. Next, we let all UU users request recommendations on the T0T_{0} tasks, and use the 𝒰t^\widehat{\mathcal{U}_{t}} to make estimates for the low rank latent parameters ϕu^\widehat{\mathbf{\bm{\phi}}^{u}} for all u[U]u\in[U]. We accomplish this by running least-squares on the arm-reward history, noting that for any arm 𝐱\mathbf{\bm{x}},

𝐱,Bt𝜽u=𝐱,𝒰tϕu=(𝒰t)T𝐱,ϕu(𝒰t^)T𝐱,ϕu.\left\langle\mathbf{\bm{x}},B_{t}\mathbf{\bm{\theta}}^{u}\right\rangle=\left\langle\mathbf{\bm{x}},\mathcal{U}_{t}\mathbf{\bm{\phi}}^{u}\right\rangle=\left\langle\left(\mathcal{U}_{t}\right)^{T}\mathbf{\bm{x}},\mathbf{\bm{\phi}}^{u}\right\rangle\approx\left\langle\left(\widehat{\mathcal{U}_{t}}\right)^{T}\mathbf{\bm{x}},\mathbf{\bm{\phi}}^{u}\right\rangle.

In the last phase, we let all UU users request recommendations on all TT tasks, and again use these approximations, in conjunction with the reward history, to learn all 𝒰t^\widehat{\mathcal{U}_{t}} for all t[T]t\in[T] by least squares

𝐱,Bt𝜽u=𝐱,𝒰tϕu\displaystyle\left\langle\mathbf{\bm{x}},B_{t}\mathbf{\bm{\theta}}^{u}\right\rangle=\left\langle\mathbf{\bm{x}},\mathcal{U}_{t}\mathbf{\bm{\phi}}^{u}\right\rangle =𝐱(ϕu),vec(𝒰t)𝐱(ϕu^),vec(𝒰t).\displaystyle=\left\langle\mathbf{\bm{x}}\left(\mathbf{\bm{\phi}}^{u}\right)^{\top},\operatorname{vec}\left(\mathcal{U}_{t}\right)\right\rangle\approx\left\langle\mathbf{\bm{x}}\left(\widehat{\mathbf{\bm{\phi}}^{u}}\right)^{\top},\operatorname{vec}\left(\mathcal{U}_{t}\right)\right\rangle.

Observe that the sample complexity of learning the 𝒰t^\widehat{\mathcal{U}_{t}} is greatly reduced as we make use of all estimates ϕu^\widehat{\mathbf{\bm{\phi}}^{u}} to estimate each 𝒰t^\widehat{\mathcal{U}_{t}}. That is, no matter which user uhu_{h} appears for a given task context tht_{h}, we are able to collect some sample that helps in approximating 𝒰th^\widehat{\mathcal{U}_{t_{h}}}, as we have access to a quality estimate of the low rank latent parameter ϕuh^\widehat{\mathbf{\bm{\phi}}^{u_{h}}} for the user uhu_{h}. Although we do not give a theoretical analysis of this algorithm (for barriers described in Appendix D) we show in synthetic experiments that our approach offers a clear advantage over the baseline approach of solving each of the UTU\cdot T contexts separately. We give a full description of the algorithm in Algorithm 2.

5 Simulation Experiments

In this section, we present several simulation experiments to benchmark our approach.222 Ideally, we would also perform an experimental evaluation on real-world data. To accomplish this, however, we would need to obtain data on user behavior for the same collection of users over multiple services. The closest matches to this requirement are datasets such as Douban [27] that show the action-reward history for a real cross-domain recommendation service. Unfortunately, although the Douban dataset does record user behavior over multiple services, the only actions that we know rewards for are those that were executed by the service. That is, each user only gave feedback for items recommended to them by the service, which in practice is a very small fraction of the total action space. Thus, this dataset cannot be used to assess our algorithms, since the recommendations our algorithms would make would not coincide with those that received feedback in Douban. We tested Rec2’s regret, using a LinUCB implementation following that of [28].333 Each run of our algorithms takes at most five seconds on a 8-core 2.3GHz Intel Core i9 processor. We provide the full experimental setup in Appendix C, only highlighting the most important details here.

Experiment with One Source, One Target, Many Users

In this experiment, we choose κ=0.9\kappa=0.9, a=20a=20, b=20b=20, U0=25U_{0}=25, U=500U=500, and |𝒜|=40|\mathcal{A}|=40, H0=2,000H_{0}=2,000 and H=8,000H=8,000. We generate AA and BB by letting ϕu\mathbf{\bm{\phi}}^{u} be random dd-dimensional vectors, where d=a+b(1κ)d=a+b(1-\kappa), and AA, BB are random a×da\times d and b×db\times d-dimensional matrices. As we can see, the results of our Rec2 approaches greatly improve upon the baselines. Notably, while we only have bounds on the excess regret after exploration for the oblivious exploration policy, it appears that similar bounds hold for both the oblivious exploration policy and the LinUCB exploration policy.

Experiment with No Sources, Many Targets and Many Users

In this experiment, we choose b=3b=3, T=30T=30, U=50U=50, r=6r=6, |𝒜|=40|\mathcal{A}|=40, and T0=3T_{0}=3, U0=5U_{0}=5. We choose random vectors ϕur\mathbf{\bm{\phi}}^{u}\in\mathbb{R}^{r}, and each task matrix BiB_{i} is a b×rb\times r matrix. Actions are random vectors of length bb. We run Algorithm 2 for H0=1,000H_{0}=1,000 timesteps in the first phase, H1H0=3,000H_{1}-H_{0}=3,000 timesteps in the second phase, and HH1=9,000H-H_{1}=9,000 timesteps in the third phase and see that the results of Rec2 strongly outperform the baseline in the third phase.

Refer to caption
(a) Results for one source, one target
Refer to caption
(b) Results for no sources, many targets
Figure 2: Average Regret, with 95% confidence intervals shown. Note that in (a) the post-exploration phase (after H0=2000H_{0}=2000) in the combination of Oblivious and LowOFUL sharply bends downwards, as predicted by our theorem, yet somehow the LinUCB exploration policy appears to incur equal or even less regret after the exploration phase without sacrificing any regret in the exploration phase. In (b) it appears that while the Rec2 approach does not obtain lower regret in the first and second phases, the regret becomes sharply lower in the third phase.

6 Conclusion

We introduced a contextual multi-armed bandit recommendation framework under which existing recommendations serve as input for new recommendation tasks. We show how to decompose the target problem into systematic and idiosyncratic components. The systematic component can be extrapolated from the source recommendations; the idiosyncratic component must be learned directly but has significantly lower dimensionality than the original task. The decomposition itself can be learned and utilized by an algorithm that that executes a variant of LinUCB in a low-dimensional subspace, achieving regret comparable to a strong skyline bound. Our approach extends to a setting with simultaneous targets but no source, and the methods perform well in simulation.

We believe that recommending with recommendations should make it possible to reduce recommendation systems’ reliance on sensitive user data. Indeed, there are already large corpora of recommendations (in the form of advertisements) produced by search engines and social media platforms that have access to tremendous amounts of user information; these span a wide range of domains and thus could be used as source input in many different target tasks.444Ironically, in this sense, advertisements effectively act as a privacy filter – they can actually serve as a fairly complete profile of a person, while abstracting from much of the underlying sensitive data. Moreover, individual platforms such as grocery delivery services are developing expertise in recommending products within specific domains; these could naturally be extrapolated to target tasks with some degree of overlap, such as the restaurant recommendation service in our running example.

Operationalizing the approach we provide here just requires a way for users to share the recommendations from one service with another. There are already public data sets for this in advertising domains, but in the future it could be accomplished more robustly either through a third-party tool that users can use to scrape and locally store their recommendations, or through direct API integrations with existing recommendation engines. But in any event, once a given user’s existing recommendations are available, they can be loaded into a new service (with the user’s consent) and then can instantaneously learn the systematic component and provide valuable recommendations using the approach we described here.

In addition to the overall privacy benefits of basing recommendations on existing recommendation sources rather than the underlying user data, our method lowers barriers to entry by creating a way for new services to bootstrap off of other platforms’ recommendations. Moreover, this approach to some extent empowers users by giving them a way to derive additional value from the recommendations that platforms have already generated from their data.

References

  • [1] NYU. Ad Observatory by NYU Tandon School of Engineering, 2018.
  • [2] The Markup. Citizen Browser - The Markup, 2021.
  • [3] Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 208–214. JMLR Workshop and Conference Proceedings, 2011.
  • [4] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670, 2010.
  • [5] Jesús Bobadilla, Fernando Ortega, Antonio Hernando, and Abraham Gutiérrez. Recommender systems survey. Knowledge-based systems, 46:109–132, 2013.
  • [6] Emmanuel J Candès and Benjamin Recht. Exact matrix completion via convex optimization. Foundations of Computational mathematics, 9(6):717–772, 2009.
  • [7] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8):30–37, 2009.
  • [8] Naoki Abe, Alan W Biermann, and Philip M Long. Reinforcement learning with immediate rewards and linear hypotheses. Algorithmica, 37(4):263–293, 2003.
  • [9] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020.
  • [10] Kwang-Sung Jun, Rebecca Willett, Stephen Wright, and Robert Nowak. Bilinear bandits with low-rank structure. In International Conference on Machine Learning, pages 3163–3172. PMLR, 2019.
  • [11] Feng Zhu, Yan Wang, Chaochao Chen, Jun Zhou, Longfei Li, and Guanfeng Liu. Cross-domain recommendation: Challenges, progress, and prospects. arXiv preprint arXiv:2103.01696, 2021.
  • [12] Liang Hu, Jian Cao, Guandong Xu, Longbing Cao, Zhiping Gu, and Can Zhu. Personalized recommendation via cross-domain triadic factorization. In Proceedings of the 22nd international conference on World Wide Web, pages 595–606, 2013.
  • [13] Pinata Winoto and Tiffany Tang. If you like the devil wears prada the book, will you also enjoy the devil wears prada the movie? a study of cross-domain recommendations. New Generation Computing, 26(3):209–225, 2008.
  • [14] Jürgen Schmidhuber. Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-… hook. PhD thesis, Technische Universität München, 1987.
  • [15] Y Bengio, S Bengio, and J Cloutier. Learning a synaptic learning rule. In IJCNN-91-Seattle International Joint Conference on Neural Networks, volume 2, pages 969–vol. IEEE, 1991.
  • [16] Leonardo Cella, Alessandro Lazaric, and Massimiliano Pontil. Meta-learning with stochastic linear bandits. In International Conference on Machine Learning, pages 1360–1370. PMLR, 2020.
  • [17] Mohammad Gheshlaghi Azar, Alessandro Lazaric, and Emma Brunskill. Sequential transfer in multi-armed bandit with finite set of models. In NIPS, 2013.
  • [18] Daniele Calandriello, Alessandro Lazaric, and Marcello Restelli. Sparse multi-task reinforcement learning. In NIPS-Advances in Neural Information Processing Systems 26, 2014.
  • [19] Junzhe Zhang and Elias Bareinboim. Transfer learning in multi-armed bandit: a causal approach. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, pages 1778–1780, 2017.
  • [20] Aniket Anand Deshmukh, Ürün Dogan, and Clayton Scott. Multi-task learning for contextual bandits. In NIPS, 2017.
  • [21] Bo Liu, Ying Wei, Yu Zhang, Zhixian Yan, and Qiang Yang. Transferable contextual bandit for cross-domain recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  • [22] Heishiro Kanagawa, Hayato Kobayashi, Nobuyuki Shimizu, Yukihiro Tagami, and Taiji Suzuki. Cross-domain recommendation via deep domain adaptation. In European Conference on Information Retrieval, pages 20–29. Springer, 2019.
  • [23] Nicolò Cesa-Bianchi, Claudio Gentile, and Giovanni Zappella. A gang of bandits. Advances in Neural Information Processing Systems, 26, 2013.
  • [24] Kaige Yang, Laura Toni, and Xiaowen Dong. Laplacian-regularized graph bandits: Algorithms and theoretical analysis. In International Conference on Artificial Intelligence and Statistics, pages 3133–3143. PMLR, 2020.
  • [25] Marta Soare, Ouais Alsharif, Alessandro Lazaric, and Joelle Pineau. Multi-task linear bandits. In NIPS2014 Workshop on Transfer and Multi-task Learning: Theory meets Practice, 2014.
  • [26] Claudio Gentile, Shuai Li, and Giovanni Zappella. Online clustering of bandits. In International Conference on Machine Learning, pages 757–765. PMLR, 2014.
  • [27] Feng Zhu, Yan Wang, Chaochao Chen, Guanfeng Liu, Mehmet Orgun, and Jia Wu. A deep framework for cross-domain and cross-system recommendations. arXiv preprint arXiv:2009.06215, 2020.
  • [28] Huazheng Wang, Qingyun Wu, and Hongning Wang. Factorization bandits for interactive recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017.
  • [29] T Tony Cai, Anru Zhang, et al. Rate-optimal perturbation bounds for singular subspaces with applications to high-dimensional statistics. The Annals of Statistics, 46(1):60–89, 2018.
  • [30] Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.

Supplementary Material

Appendix A Proofs Omitted from the Main Text

A.1 Proof of Theorem 2.1

Theorem (LinUCB Bound).

A LinUCB agent that naively solves each of the UTU\cdot T tasks separately has regret bounded by

RHO(bHTUlog(HTU)).R_{H}\leq O\left(b\sqrt{HTU}\log\left(\frac{H}{TU}\right)\right).

We first restate the LinUCB regret bound.

Lemma A.1 ( [9, Corollary 19.3]).

Suppose that a LinUCB agent aims to solve a stochastic linear bandit problem with reward parameter 𝛉\mathbf{\bm{\theta}}. If the agent’s decision sets 𝒜h\mathcal{A}_{h} for h[H]h\in[H] satisfy

maxh[H]max𝐱,𝐲𝒜h𝐱𝐲,𝜽1,maxh[H]max𝐱𝒜h𝐱21,\max_{h\in[H]}\max_{\mathbf{\bm{x}},\mathbf{\bm{y}}\in\mathcal{A}_{h}}\left\langle\mathbf{\bm{x}}-\mathbf{\bm{y}},\mathbf{\bm{\theta}}\right\rangle\leq 1,\qquad\qquad\max_{h\in[H]}\max_{\mathbf{\bm{x}}\in\mathcal{A}_{h}}\left\lVert\mathbf{\bm{x}}\right\rVert_{2}\leq 1,

then the regret at time hh is bounded by

RhO(bhlog(h)).R_{h}\leq O\left(b\sqrt{h}\log\left(h\right)\right).

In our setup, there are UTU\cdot T contexts, corresponding to each user-task pair. We define Rhu,tR_{h}^{u,t} to be the regret of a LinUCB agent that runs for hh iterations for the reward parameter Bt𝜽uB_{t}\mathbf{\bm{\theta}}^{u}. We further define

Htu=h=1H𝟙[(uh,th)=(u,t)]H_{t}^{u}=\sum_{h=1}^{H}\mathbbm{1}\left[(u_{h},t_{h})=(u,t)\right]

to be the number of times that the agent encountered the user-task pair (u,t)(u,t). Observe that by our assumptions on the decision set in Section 2.1, the assumptions of Lemma A.1 have been met. As the LinUCB algorithm is anytime (i.e. it does not require advance knowledge of the horizon), we can decompose the regret as

RH=𝔼[u[U],t[T]RHtuu,t]\displaystyle R_{H}=\mathbb{E}\left[\sum_{u\in[U],t\in[T]}R_{H_{t}^{u}}^{u,t}\right] 𝔼[bu[U],t[T]O(Htulog(Htu))]\displaystyle\leq\mathbb{E}\left[b\sum_{u\in[U],t\in[T]}O\left(\sqrt{H_{t}^{u}}\log\left(H_{t}^{u}\right)\right)\right]
u[U],t[T]O(bHUTlog(HUT))\displaystyle\leq\sum_{u\in[U],t\in[T]}O\left(b\sqrt{\frac{H}{UT}}\log\left(\frac{H}{UT}\right)\right)
=O(bUTHlog(HUT)),\displaystyle=O\left(b\sqrt{UTH}\log\left(\frac{H}{UT}\right)\right),

where the second inequality follows by Jensen’s inequality: the quantity is maximized when all contexts are observed an equal number of times.

A.2 Proof of Theorem 3.1

Theorem (Skyline Regret).

An agent with full knowledge of the source and target matrices AA and BB may attain regret

RHO((1κ)bHlog(HU)).R_{H}\leq O\left((1-\kappa)b\sqrt{H}\log\left(\frac{H}{U}\right)\right).

To reduce regret given the decomposition (5), we consider a modification of the standard LinUCB algorithm. Observe that we may rewrite the reward (as given in (1)) for a given action 𝐱𝒜t\mathbf{\bm{x}}\in\mathcal{A}_{t} as

Xh=𝐱,Bth𝜽uh+ηh=𝐱,DTA𝜽uh+DGT𝐱,𝝍uh+ηh\begin{split}X_{h}&=\left\langle\mathbf{\bm{x}},B_{t_{h}}\mathbf{\bm{\theta}}^{u_{h}}\right\rangle+\eta_{h}\\ &=\left\langle\mathbf{\bm{x}},D_{T}A\mathbf{\bm{\theta}}^{u_{h}}\right\rangle+\left\langle D_{G}^{T}\mathbf{\bm{x}},\mathbf{\bm{\psi}}^{u_{h}}\right\rangle+\eta_{h}\end{split} (7)

The generator transforms the action to a lower-dimensional space DGT𝐱(1κ)bD_{G}^{T}\mathbf{\bm{x}}\in\mathbb{R}^{(1-\kappa)b}. To attain low regret in this “affine” variant of the standard stochastic linear bandit problem, we compute confidence sets for 𝝍u\mathbf{\bm{\psi}}^{u} in the style of LinUCB using the transformed actions DGT𝐱D_{G}^{T}\mathbf{\bm{x}} and translated rewards Xh𝐱,DTA𝜽uhX_{h}-\left\langle\mathbf{\bm{x}},D_{T}A\mathbf{\bm{\theta}}^{u_{h}}\right\rangle. We then select the action

𝐱h:=argmax𝐱𝒜h{𝐱,DTA𝜽uh+UCBh(DGT𝐱)},\mathbf{\bm{x}}_{h}:=\operatorname*{argmax}_{\mathbf{\bm{x}}\in\mathcal{A}_{h}}\{\left\langle\mathbf{\bm{x}},D_{T}A\mathbf{\bm{\theta}}^{u_{h}}\right\rangle+\operatorname{UCB}_{h}\left(D_{G}^{T}\mathbf{\bm{x}}\right)\}, (8)

where UCBh(DGT𝐱)DGT𝐱,𝝍\operatorname{UCB}_{h}\left(D_{G}^{T}\mathbf{\bm{x}}\right)\geq\left\langle D_{G}^{T}\mathbf{\bm{x}},\mathbf{\bm{\psi}}\right\rangle (with high probability) denotes the upper confidence bound for the action at time hh. We show that this agent achieves lower regret.

We first prove the regret bound in the case in which we only have one user with reward parameter B𝜽=DTA𝜽+DG𝝍B\mathbf{\bm{\theta}}=D_{T}A\mathbf{\bm{\theta}}+D_{G}\mathbf{\bm{\psi}}. Observe that as DGTD_{G}^{T} has orthogonal rows, for any action 𝐱𝒜h\mathbf{\bm{x}}\in\mathcal{A}_{h},

DGT𝐱2𝐱21.\left\lVert D_{G}^{T}\mathbf{\bm{x}}\right\rVert_{2}\leq\left\lVert\mathbf{\bm{x}}\right\rVert_{2}\leq 1.

We then make use of the following lemma.

Lemma A.2 ([9, Theorems 19.2 and 20.5]).

Let rh:=B𝛉,𝐱h𝐱hr_{h}:=\left\langle B\mathbf{\bm{\theta}},\mathbf{\bm{x}}_{h}^{*}-\mathbf{\bm{x}}_{h}\right\rangle denote the instantaneous regret in round h[H]h\in[H]. If rh𝛙h~𝛙,DGT𝐱hr_{h}\leq\left\langle\tilde{\mathbf{\bm{\psi}}_{h}}-\mathbf{\bm{\psi}},D_{G}^{T}\mathbf{\bm{x}}_{h}\right\rangle. where UCBh(DGT𝐱h)=𝛙h~,DGT𝐱h\operatorname{UCB}_{h}(D_{G}^{T}\mathbf{\bm{x}}_{h})=\left\langle\tilde{\mathbf{\bm{\psi}}_{h}},D_{G}^{T}\mathbf{\bm{x}}_{h}\right\rangle, then,

RHO((1κ)bHlog(H))R_{H}\leq O\left((1-\kappa)b\sqrt{H}\log\left(H\right)\right)

with failure probability δ=1H\delta=\frac{1}{H}.

We first show that rhr_{h} is indeed bounded above by 𝝍h~𝝍,DGT𝐱h\left\langle\tilde{\mathbf{\bm{\psi}}_{h}}-\mathbf{\bm{\psi}},D_{G}^{T}\mathbf{\bm{x}}_{h}\right\rangle. To see this, we first note that by our choice of 𝐱h\mathbf{\bm{x}}_{h},

B𝜽,𝐱h\displaystyle\left\langle B\mathbf{\bm{\theta}},\mathbf{\bm{x}}_{h}^{*}\right\rangle =DTA𝜽,𝐱h+DGT𝐱h,𝝍\displaystyle=\left\langle D_{T}A\mathbf{\bm{\theta}},\mathbf{\bm{x}}_{h}^{*}\right\rangle+\left\langle D_{G}^{T}\mathbf{\bm{x}}_{h}^{*},\mathbf{\bm{\psi}}\right\rangle
DTA𝜽,𝐱h+UCBh(DGT𝐱h)\displaystyle\leq\left\langle D_{T}A\mathbf{\bm{\theta}},\mathbf{\bm{x}}_{h}^{*}\right\rangle+\operatorname{UCB}_{h}\left(D_{G}^{T}\mathbf{\bm{x}}_{h}^{*}\right)
DTA𝜽,𝐱h+UCBh(DGT𝐱h)\displaystyle\leq\left\langle D_{T}A\mathbf{\bm{\theta}},\mathbf{\bm{x}}_{h}\right\rangle+\operatorname{UCB}_{h}\left(D_{G}^{T}\mathbf{\bm{x}}_{h}\right)
=DTA𝜽,𝐱h+𝝍h~,DGT𝐱h,\displaystyle=\left\langle D_{T}A\mathbf{\bm{\theta}},\mathbf{\bm{x}}_{h}\right\rangle+\left\langle\tilde{\mathbf{\bm{\psi}}_{h}},D_{G}^{T}\mathbf{\bm{x}}_{h}\right\rangle,

where the third line follows from our definition (8). We may then compute

rh\displaystyle r_{h} =B𝜽,𝐱h𝐱h\displaystyle=\left\langle B\mathbf{\bm{\theta}},\mathbf{\bm{x}}_{h}^{*}-\mathbf{\bm{x}}_{h}\right\rangle
=B𝜽,𝐱hB𝜽,𝐱h\displaystyle=\left\langle B\mathbf{\bm{\theta}},\mathbf{\bm{x}}_{h}^{*}\right\rangle-\left\langle B\mathbf{\bm{\theta}},\mathbf{\bm{x}}_{h}\right\rangle
DTA𝜽,𝐱h+𝝍h~,DGT𝐱h(DTA𝜽,𝐱h+𝝍,DGT𝐱h)\displaystyle\leq\left\langle D_{T}A\mathbf{\bm{\theta}},\mathbf{\bm{x}}_{h}\right\rangle+\left\langle\tilde{\mathbf{\bm{\psi}}_{h}},D_{G}^{T}\mathbf{\bm{x}}_{h}\right\rangle-\left(\left\langle D_{T}A\mathbf{\bm{\theta}},\mathbf{\bm{x}}_{h}\right\rangle+\left\langle\mathbf{\bm{\psi}},D_{G}^{T}\mathbf{\bm{x}}_{h}\right\rangle\right)
=𝝍h~𝝍,DGT𝐱h.\displaystyle=\left\langle\tilde{\mathbf{\bm{\psi}}_{h}}-\mathbf{\bm{\psi}},D_{G}^{T}\mathbf{\bm{x}}_{h}\right\rangle.

Applying Lemma A.2, we obtain a single-user regret bound

RHO((1κ)bHlog(H)).R_{H}\leq O\left((1-\kappa)b\sqrt{H}\log\left(H\right)\right).

The desired regret bound then follows by applying Jensen’s inequality, as in the proof of Theorem 2.1.

A.3 Proof of Lemma 3.2

Lemma.

With probability 1δ1-\delta, we have that for all uu,

Lu2(1+(σmin(πA𝒰)2Z)1)Z,L^{u}\leq\sqrt{2}\left(1+\left(\sigma_{\mathrm{min}}(\pi_{A}\mathcal{U})-\sqrt{2}Z\right)^{-1}\right)Z,

where q=bU0(b+log(2/δ))H0/U0C(b+log(4/δ))bU0H0CU0bq=\frac{\sqrt{bU_{0}(b+\log(2/\delta))}}{\sqrt{H_{0}/U_{0}}-C(\sqrt{b}+\sqrt{\log(4/\delta)})}\approx\frac{bU_{0}}{\sqrt{H_{0}}-C\sqrt{U_{0}b}}; Z=qσmin(M)qZ=\frac{q}{\sigma_{\mathrm{min}}\left(M\right)-q}; M=[A𝛉1A𝛉U0B𝛉1B𝛉U0]M=\tiny{\begin{bmatrix}A\mathbf{\bm{\theta}}^{1}&\ldots&A\mathbf{\bm{\theta}}^{U_{0}}\\ B\mathbf{\bm{\theta}}^{1}&\ldots&B\mathbf{\bm{\theta}}^{U_{0}}\end{bmatrix}}; CC is some absolute constant; and we assume that σmin(M)>q\sigma_{\mathrm{min}}\left(M\right)>q.

This proof has two parts: first showing that there exists a point in S^u\widehat{S}_{u} which is close to BθuB\theta_{u}, and then showing that this implies the result.

A crucial step in the proof is considering the sin-theta matrix between two orthogonal columns VV and V^\hat{V}. Suppose that the singular values of sin(V,V^)\sin(V,\hat{V}) are σ1σ2σr0\sigma_{1}\geq\sigma_{2}\ldots\geq\sigma_{r}\geq 0. Then we call sin(V,V^)=diag(cos1(σ1),cos1(σ2),cos1(σr))\sin(V,\hat{V})=\text{diag}(\cos^{-1}(\sigma_{1}),\cos^{-1}(\sigma_{2}),\ldots\cos^{-1}(\sigma_{r})).

Lemma A.3.

With probability 1δ1-\delta, there exists γuS^u\gamma_{u}\in\widehat{S}_{u} for all u[U]u\in[U] so that

γuB𝜽u<2(1+(σmin(πA𝒰)Z2)1)Z.\left\lVert\gamma_{u}-B\mathbf{\bm{\theta}}^{u}\right\rVert<\sqrt{2}\left(1+\left(\sigma_{\mathrm{min}}(\pi_{A}\mathcal{U})-Z\sqrt{2}\right)^{-1}\right)Z.
Proof.

Note that we have

𝒰𝒰^O2sin(𝒰,𝒰^)\left\lVert\mathcal{U}-\widehat{\mathcal{U}}O\right\rVert\leq\sqrt{2}\left\lVert\sin(\mathcal{U},\widehat{\mathcal{U}})\right\rVert

for a certain orthogonal matrix OO (by the first equation under the third statement of [29, Lemma 1]). Then note that if 𝒰B𝜽uSu\mathcal{U}_{B}\mathbf{\bm{\theta}}^{u}\in S_{u}, we have

πB𝒰^(O𝜽u+πA𝒰^+((πA𝒰πA𝒰^O)𝜽u)Su\pi_{B}\widehat{\mathcal{U}}(O\mathbf{\bm{\theta}}^{u}+\pi_{A}\widehat{\mathcal{U}}^{+}((\pi_{A}\mathcal{U}-\pi_{A}\widehat{\mathcal{U}}O)\mathbf{\bm{\theta}}^{u})\in S_{u}^{\prime}

which we set to be γu\gamma_{u}.

Now observe that

πB𝒰^(O𝜽u+πA𝒰^+((πA𝒰πA𝒰^O)𝜽u)πB𝒰𝜽u\displaystyle\left\lVert\pi_{B}\widehat{\mathcal{U}}(O\mathbf{\bm{\theta}}^{u}+\pi_{A}\widehat{\mathcal{U}}^{+}((\pi_{A}\mathcal{U}-\pi_{A}\widehat{\mathcal{U}}O)\mathbf{\bm{\theta}}^{u})-\pi_{B}\mathcal{U}\mathbf{\bm{\theta}}^{u}\right\rVert
(πB𝒰^OπB𝒰^)𝜽u+πA𝒰^+((πA𝒰πA𝒰^O)𝜽u)\displaystyle\leq\left\lVert(\pi_{B}\widehat{\mathcal{U}}O-\pi_{B}\widehat{\mathcal{U}})\mathbf{\bm{\theta}}^{u}\right\rVert+\left\lVert\pi_{A}\widehat{\mathcal{U}}^{+}((\pi_{A}\mathcal{U}-\pi_{A}\widehat{\mathcal{U}}O)\mathbf{\bm{\theta}}^{u})\right\rVert
2sin(𝒰,𝒰^)(𝜽u+(πA𝒰^)+𝜽u).\displaystyle\leq\sqrt{2}\left\lVert\sin(\mathcal{U},\widehat{\mathcal{U}})\right\rVert\left(\left\lVert\mathbf{\bm{\theta}}^{u}\right\rVert+\left\lVert(\pi_{A}\widehat{\mathcal{U}})^{+}\right\rVert\left\lVert\mathbf{\bm{\theta}}^{u}\right\rVert\right).

The next result shows that MM is close to M^\widehat{M}, which we show later is enough to show that 𝒰\mathcal{U} is close to 𝒰^\widehat{\mathcal{U}}.

Lemma A.4.

With probability 1δ1-\delta, we have

M^Mq.\left\lVert\widehat{M}-M\right\rVert\leq q.
Proof.

Since the matrix formed by stacking the arms together is fixed across users, we have that

E:=MM^=(DD)1Dη=Fη,E:=M-\widehat{M}=(D^{\top}D)^{-1}D^{\top}\eta=F\eta,

where η\eta is a (H0/U0)×U0(H_{0}/U_{0})\times U_{0} matrix of i.i.d. 11-subgaussian random variables. We seek to bound the value of E\left\lVert E\right\rVert.

Claim.

With probability 1δ1-\delta, we have

ECFU0(b+log(1/δ)),\left\lVert E\right\rVert\leq C\left\lVert F\right\rVert\sqrt{U_{0}(b+\log(1/\delta))},

where CC is an absolute constant.

Proof.

Let 𝒩\mathcal{N} be an ϵ\epsilon-net of the 11-ball of size 5b5^{b} with radius 1/21/2, which exists by [30, Corollary 4.2.13]. Then note that 2supx𝒩ExE2\sup_{x\in\mathcal{N}}\left\lVert E^{\top}x\right\rVert\geq\left\lVert E^{\top}\right\rVert by [30, Lemma 4.4.1]. Letting =Cb+log(1/δ)\ell=C\sqrt{b+\log(1/\delta)}, note that

Ex=ηFx=ηvCU0v=CU0Fx,\left\lVert E^{\top}x\right\rVert=\left\lVert\eta^{\top}F^{\top}x\right\rVert=\left\lVert\eta^{\top}v\right\rVert\geq C\ell\sqrt{U_{0}}\left\lVert v\right\rVert=C\ell\sqrt{U_{0}}\cdot\left\lVert F^{\top}x\right\rVert,

with probability at most 2e2=5bδ2e^{-\ell^{2}}=5^{-b}\delta – where CC is an absolute constant – from properties of subgaussian random variables. Thus, by a union bound, with probability at least 1δ1-\delta we have

E/2=E/2supx𝒩Exsupx𝒩CU0FxCU0F.\left\lVert E\right\rVert/2=\left\lVert E^{\top}\right\rVert/2\leq\sup_{x\in\mathcal{N}}\left\lVert E^{\top}x\right\rVert\leq\sup_{x\in\mathcal{N}}C\ell\sqrt{U_{0}}\left\lVert F^{\top}x\right\rVert\leq C\ell\sqrt{U_{0}}\left\lVert F\right\rVert.\qed
Claim.

With probability 1δ1-\delta, we have

F1H0/bU0C(1+log(2/δ)/b).\left\lVert F\right\rVert\leq\frac{1}{\sqrt{H_{0}/bU_{0}}-C(1+\sqrt{\log(2/\delta)/b})}.
Proof.

We can see that

F=FF=λmin(DD)1/2=1/σmin(D).\left\lVert F\right\rVert=\sqrt{\left\lVert FF^{\top}\right\rVert}=\lambda_{\text{min}}(D^{\top}D)^{-1/2}=1/\sigma_{\text{min}}(D).

By [30, Theorem 4.6.1], since DD is 1/b1/\sqrt{b} times a (H0/U0)×b(H_{0}/U_{0})\times b matrix whose entries are uniform Gaussian, we have

σmin(D)1/b(H0/U0C(b+log(2/δ)))=H0/bU0C(1+log(2/δ)/b)\sigma_{\text{min}}(D)\geq\sqrt{1/b}(\sqrt{H_{0}/U_{0}}-C(\sqrt{b}+\sqrt{\log(2/\delta)}))=\sqrt{H_{0}/bU_{0}}-C(1+\sqrt{\log(2/\delta)/b})

with probability 1δ1-\delta. ∎

By the second claim, we have that

F1H0/bU0C(1+log(4/δ)/b)\left\lVert F\right\rVert\leq\frac{1}{\sqrt{H_{0}/bU_{0}}-C(1+\sqrt{\log(4/\delta)/b})}

with probability at least 1δ/21-\delta/2. By the first claim, we have that with probability at least 1δ/21-\delta/2,

ECFU0(b+log(1/δ)).\left\lVert E\right\rVert\leq C\left\lVert F\right\rVert\sqrt{U_{0}(b+\log(1/\delta))}.

Thus, by the union bound, both inequalities are true with probability at least 1δ1-\delta, in which case we have we have

ECFU0(b+log(1/δ))U0(b+log(1/δ))H0/bU0C(1+log(4/δ)/b),\left\lVert E\right\rVert\leq C\left\lVert F\right\rVert\sqrt{U_{0}(b+\log(1/\delta))}\leq\frac{\sqrt{U_{0}(b+\log(1/\delta))}}{\sqrt{H_{0}/bU_{0}}-C(1+\sqrt{\log(4/\delta)/b})},

as desired.

Suppose now that M^Mq\left\lVert\widehat{M}-M\right\rVert\leq q. We show that the inequality

Lu2(1+(σmin(πA𝒰)2Z)1)ZL^{u}\leq\sqrt{2}\left(1+\left(\sigma_{\mathrm{min}}(\pi_{A}\mathcal{U})-\sqrt{2}Z\right)^{-1}\right)Z

follows, which suffices for the main claim.

By the remark under subsection 2.3 in [29], we have that

sin(𝒰,𝒰^)qσr(M^).\left\lVert\sin(\mathcal{U},\widehat{\mathcal{U}})\right\rVert\leq\frac{q}{\sigma_{r}(\widehat{M})}.

Note that σr(M)σr(M^)+E\sigma_{r}(M)\leq\sigma_{r}(\widehat{M})+\left\lVert E\right\rVert by Weyl’s inequality on singular values, so

σr(M^)σr(M)Eσr(M)q,\sigma_{r}(\widehat{M})\leq\sigma_{r}(M)-\left\lVert E\right\rVert\leq\sigma_{r}(M)-q,

so we have

sin(𝒰,𝒰^)Z.\left\lVert\sin(\mathcal{U},\widehat{\mathcal{U}})\right\rVert\leq Z.

Then note that

(πA𝒰^)+=(σmin(πA𝒰^))1.\left\lVert(\pi_{A}\widehat{\mathcal{U}})^{+}\right\rVert=(\sigma_{\text{min}}(\pi_{A}\widehat{\mathcal{U}}))^{-1}.

Recall that by the first equation under the third statement of [29, Lemma 1] there exists an orthogonal square matrix OO so that

𝒰𝒰^O2sin(𝒰,𝒰^).\left\lVert\mathcal{U}-\widehat{\mathcal{U}}O\right\rVert\leq\sqrt{2}\left\lVert\sin(\mathcal{U},\widehat{\mathcal{U}})\right\rVert.

Thus,

σmin(πA𝒰^)\displaystyle\sigma_{\text{min}}(\pi_{A}\widehat{\mathcal{U}}) =σmin(πA𝒰^O)\displaystyle=\sigma_{\text{min}}(\pi_{A}\widehat{\mathcal{U}}O)
σmin(πA𝒰)πA𝒰πA𝒰^O\displaystyle\geq\sigma_{\text{min}}(\pi_{A}\mathcal{U})-\left\lVert\pi_{A}\mathcal{U}-\pi_{A}\widehat{\mathcal{U}}O\right\rVert
σmin(πA𝒰)𝒰𝒰^O\displaystyle\geq\sigma_{\text{min}}(\pi_{A}\mathcal{U})-\left\lVert\mathcal{U}-\widehat{\mathcal{U}}O\right\rVert
σmin(πA𝒰)2sin(𝒰,𝒰^)\displaystyle\geq\sigma_{\text{min}}(\pi_{A}\mathcal{U})-\sqrt{2}\left\lVert\sin(\mathcal{U},\widehat{\mathcal{U}})\right\rVert
σmin(πA𝒰)2Z,\displaystyle\geq\sigma_{\text{min}}(\pi_{A}\mathcal{U})-\sqrt{2}Z,

where the second line follows from Weyl’s Inequality.

Finally, we have

U^A+(σmin(UA)Z2)1\left\lVert\widehat{U}_{A}^{+}\right\rVert\leq\left(\sigma_{\text{min}}(U_{A})-Z\sqrt{2}\right)^{-1}

and we may now conclude. ∎

From the lemma, we have shown that with probability 1δ1-\delta, there exists γuS^u\gamma_{u}\in\widehat{S}_{u} for all u[U]u\in[U] so that

γuB𝜽u2(1+(σmin(πA𝒰)Z2)1)Z.\left\lVert\gamma_{u}-B\mathbf{\bm{\theta}}^{u}\right\rVert\leq\sqrt{2}\left(1+\left(\sigma_{\mathrm{min}}(\pi_{A}\mathcal{U})-Z\sqrt{2}\right)^{-1}\right)Z.

It remains to show that LuγuB𝜽uL^{u}\leq\left\lVert\gamma_{u}-B\mathbf{\bm{\theta}}^{u}\right\rVert for all uu. Note that γu\gamma_{u} is an element of EuE_{u}. Thus it suffices to show that LuL^{u} is the distance between BθB\theta and the closest element in EuE_{u}, which is given by WWBθW_{\parallel}W_{\parallel}^{\top}B\theta. Thus the distance is given by

B𝜽uWWB𝜽u2=WWB𝜽u2=WB𝜽u2=(WB𝜽u)(1κ)b+2:b2.\left\lVert B\mathbf{\bm{\theta}}^{u}-W_{\parallel}W_{\parallel}^{\top}B\mathbf{\bm{\theta}}^{u}\right\rVert_{2}=\left\lVert W_{\perp}W_{\perp}^{\top}B\mathbf{\bm{\theta}}^{u}\right\rVert_{2}=\left\lVert W_{\perp}^{\top}B\mathbf{\bm{\theta}}^{u}\right\rVert_{2}=\left\lVert(W^{\top}B\mathbf{\bm{\theta}}^{u})_{(1-\kappa)b+2:b}\right\rVert_{2}.

and we may conclude.

A.4 Proof of Theorem 3.3

Theorem.

With an oblivious exploration policy that draws identical arms for each user from 1b𝒩(0,Ib)\frac{1}{\sqrt{b}}\mathcal{N}(0,I_{b}), assuming that each user appears equally many times, with probability at least 1(Uδ+δ)1-\left(U\delta^{\prime}+\delta\right), we have

u[U]LowOFULHuu\displaystyle\sum_{u\in[U]}\mathrm{LowOFUL}^{u}_{H_{u}} O((1κ)b2log(1/δ)UHlog(HU)\displaystyle\leq O\biggr{(}(1-\kappa)b\sqrt{2\log(1/\delta^{\prime})}\sqrt{UH}\log\left(\frac{H}{U}\right)
+H2(1+(σmin(πA𝒰)2Z)1)Zlog(HU)).\displaystyle\quad\quad\quad+H\sqrt{2}\left(1+\left(\sigma_{\mathrm{min}}(\pi_{A}\mathcal{U})-\sqrt{2}Z\right)^{-1}\right)Z\log\left(\frac{H}{U}\right)\biggr{)}.

Here, we first give the LowOFUL algorithm in Algorithm 3. We note here that we use a generalization of the LowOFUL algorithm where the action sets are allowed to vary with time obliviously; we note that the argument in the proof does not change.

Note that we can break up the total regret into two terms, which are

RH(1)=𝔼[u[U]hH0:uh=u(max𝐱𝒜h𝐱,B𝜽uXh)]=ExploreH0R_{H}^{(1)}=\mathbb{E}\left[\sum_{u\in[U]}\sum_{h\leq H_{0}:u_{h}=u}\left(\max_{\mathbf{\bm{x}}\in\mathcal{A}_{h}}\left\langle\mathbf{\bm{x}},B\mathbf{\bm{\theta}}^{u}\right\rangle-X_{h}\right)\right]=\mathrm{Explore}_{H_{0}}

and

RH(2)=𝔼[u[U]h>H0:uh=u(max𝐱𝒜h𝐱,B𝜽uXh)]\displaystyle R_{H}^{(2)}=\mathbb{E}\left[\sum_{u\in[U]}\sum_{h>H_{0}:u_{h}=u}\left(\max_{\mathbf{\bm{x}}\in\mathcal{A}_{h}}\left\langle\mathbf{\bm{x}},B\mathbf{\bm{\theta}}^{u}\right\rangle-X_{h}\right)\right] =𝔼[u[U]h>H0:uh=u(max𝐱W𝒜h𝐱,WB𝜽uXh)]\displaystyle=\mathbb{E}\left[\sum_{u\in[U]}\sum_{h>H_{0}:u_{h}=u}\left(\max_{\mathbf{\bm{x}}\in W^{\top}\mathcal{A}_{h}}\left\langle\mathbf{\bm{x}},W^{\top}B\mathbf{\bm{\theta}}^{u}\right\rangle-X_{h}\right)\right]
=u[U]𝔼[h>H0:uh=u(max𝐱W𝒜h𝐱,WB𝜽uXh)].\displaystyle=\sum_{u\in[U]}\mathbb{E}\left[\sum_{h>H_{0}:u_{h}=u}\left(\max_{\mathbf{\bm{x}}\in W^{\top}\mathcal{A}_{h}}\left\langle\mathbf{\bm{x}},W^{\top}B\mathbf{\bm{\theta}}^{u}\right\rangle-X_{h}\right)\right].

We give our bound in terms of LuL_{u}, and then substitute our bound from Lemma 3.2 to conclude. Note that each user gets (HH0)/U(H-H_{0})/U timesteps in the second phase, which is less than H/UH/U.

From [10, Appendix D(c)] (and using their notation) we have that (for λ=1\lambda=1)

RH/U2βH/Ulog|VH/U||Λ|H/U.R_{H/U}\leq 2\sqrt{\beta_{H/U}}\sqrt{\log\frac{|V_{H/U}|}{|\Lambda|}}\sqrt{H/U}.

Now, we can bound log|VH/U||Λ|2((1κ)b+1)log(1+H/U)\log\frac{|V_{H/U}|}{|\Lambda|}\leq 2((1-\kappa)b+1)\log(1+H/U), and

log|VH/U||Λ|2((1κ)b+1)log(1+H/U).\sqrt{\log\frac{|V_{H/U}|}{|\Lambda|}}\leq\sqrt{2((1-\kappa)b+1)\log(1+H/U)}.

By our choice of WW, combined with Lemma 3.2, we now choose βH/U\beta_{H/U} so that

βH/U\displaystyle\sqrt{\beta_{H/U}} =log|VT||Λ|+2log(1δ)+1+λLu\displaystyle=\sqrt{\log\frac{|V_{T}|}{|\Lambda|}+2\log\left(\frac{1}{\delta^{\prime}}\right)}+1+\sqrt{\lambda_{\perp}}L_{u}
=O(2(1κ)blog(1+H/U)+2log1δ+LuH/U(1κ)blog(1+H/U)),\displaystyle=O\left(\sqrt{2(1-\kappa)b\log(1+H/U)}+\sqrt{2\log\frac{1}{\delta^{\prime}}}+L_{u}\sqrt{\frac{H/U}{(1-\kappa)b}\log(1+H/U)}\right),

where we have taken λ=H(1κ)b+1log(1+H/U)\lambda_{\perp}=\frac{H}{(1-\kappa)b+1}\log(1+H/U).

Finally, by the main result of [10] we obtain that the regret, with probability at least 1δ1-\delta^{\prime}, for a particular user, is

O(2(1κ)blog(1+H/U)+2log1δ+LuH/U(1κ)blog(1+H/U))2((1κ)b+1)log(1+H)H/U\displaystyle O\left(\sqrt{2(1-\kappa)b\log(1+H/U)}+\sqrt{2\log\frac{1}{\delta^{\prime}}}+L_{u}\sqrt{\frac{H/U}{(1-\kappa)b}\log(1+H/U)}\right)\sqrt{2((1-\kappa)b+1)\log(1+H)}\sqrt{H/U}
=O(2(1κ)blog(1+H/U)H/U2log(1/δ)+(H/U)Lulog(1+H/U)).\displaystyle=O(2(1-\kappa)b\log(1+H/U)\sqrt{H/U}\sqrt{2\log(1/\delta^{\prime})}+(H/U)L_{u}\log(1+H/U)).

By a union bound, with probability at least 1Uδδ1-U\delta^{\prime}-\delta, we have that LuL_{u} is small and that all of these regrets are small, which results in an overall regret given by the above expression. Multiplying by UU and dividing by HH now yields the desired result.

A.5 Proof of Lemma 4.1

Lemma.

With access to all task matrices, we obtain a regret of

RHO(rUHlog(HU)).R_{H}\leq O\left(r\sqrt{UH}\log\left(\frac{H}{U}\right)\right).

Once given access to the task matrices B1,,BTB_{1},\dots,B_{T}, the problem reduces to a contextual bandit problem with just UU contexts, instead of UTU\cdot T contexts. Using the task matrices, we first compute the 𝒰t\mathcal{U}_{t}

[𝒰1(𝒰1)𝒰T(𝒰T)][Σr000][𝒱1(𝒱1)𝒱T(𝒱T)]=:[B1BT]\begin{bmatrix}\mathcal{U}_{1}&\left(\mathcal{U}_{1}\right)_{\bot}\\ \vdots&\vdots\\ \mathcal{U}_{T}&\left(\mathcal{U}_{T}\right)_{\bot}\end{bmatrix}\begin{bmatrix}\Sigma_{r}&0\\ 0&0\end{bmatrix}\begin{bmatrix}\mathcal{V}_{1}&\left(\mathcal{V}_{1}\right)_{\bot}\\ \vdots&\vdots\\ \mathcal{V}_{T}&\left(\mathcal{V}_{T}\right)_{\bot}\end{bmatrix}=:\begin{bmatrix}B_{1}\\ \vdots\\ B_{T}\end{bmatrix}

By the definition of rr, we may write Bt𝜽u=𝒰tϕuB_{t}\mathbf{\bm{\theta}}^{u}=\mathcal{U}_{t}\mathbf{\bm{\phi}}^{u} for all (u,t)[U]×[T](u,t)\in[U]\times[T], where ϕur\mathbf{\bm{\phi}}^{u}\in\mathbb{R}^{r}. Noting that

𝐱,Bt𝜽u=𝐱,𝒰tϕu=𝒰t𝐱,ϕu,\left\langle\mathbf{\bm{x}},B_{t}\mathbf{\bm{\theta}}^{u}\right\rangle=\left\langle\mathbf{\bm{x}},\mathcal{U}_{t}\mathbf{\bm{\phi}}^{u}\right\rangle=\left\langle\mathcal{U}_{t}^{\top}\mathbf{\bm{x}},\mathbf{\bm{\phi}}^{u}\right\rangle,

we consider a LinUCB agent that runs one bandit per user, transforming the action space by 𝒜h𝒰th𝒜h\mathcal{A}_{h}\mapsto\mathcal{U}_{t_{h}}^{\top}\mathcal{A}_{h}, where tht_{h} denotes the task context at time hh. From here, we obtain regret in a manner similar to the argument presented in the proof of Theorem 2.1 where the number of contexts UTU\cdot T is replaced with UU, and the dimensionality bb is replaced with rr. This yields the desired bound

RHO(rUHlog(HU)).R_{H}\leq O\left(r\sqrt{UH}\log\left(\frac{H}{U}\right)\right).

Appendix B Pseudocode for Algorithms

B.1 Algorithm for One Source, One Target, Many Users

Input: An exploration strategy 𝙴𝚡𝚙𝚕𝚘𝚛𝚎:(u,h)𝐱𝒜h\mathtt{Explore}:(u,h)\mapsto\mathbf{\bm{x}}\in\mathcal{A}_{h} and the LowOFUL algorithm 𝙻𝚘𝚠𝙾𝙵𝚄𝙻:(𝒜h,L,h)𝐚𝒜h\mathtt{LowOFUL}:(\mathcal{A}_{h},L,h)\mapsto\mathbf{\bm{a}}\in\mathcal{A}_{h} where LL bounds the almost sparse components
Output: Arms to pull 𝐱h𝒜h\mathbf{\bm{x}}_{h}\in\mathcal{A}_{h} for h[H]h\in[H]
/* Use the exploration strategy for the first H0H_{0} iterations on the U0U_{0} beta group users. */
for h[H0]h\in[H_{0}] do
       Observe context uh[U0]u_{h}\in[U_{0}];
       Pull arm: 𝐱h:=𝙴𝚡𝚙𝚕𝚘𝚛𝚎(uh,h)\mathbf{\bm{x}}_{h}:=\mathtt{Explore}(u_{h},h)
      
end for
/* Learn the transformer and generator from observed data */
for u[U0]u\in[U_{0}] do
       Estimate B𝜽u^\widehat{B\mathbf{\bm{\theta}}^{u}} from the arm-reward history;
      
end for
[𝒰^𝒰~][Σr00Σκb][𝒱T^𝒱T~]:=[A𝜽1A𝜽U0B𝜽1^B𝜽U0^]\begin{bmatrix}\widehat{\mathcal{U}}&\widetilde{\mathcal{U}}\end{bmatrix}\begin{bmatrix}\Sigma_{r}&0\\ 0&\Sigma_{\kappa b}\end{bmatrix}\begin{bmatrix}\widehat{\mathcal{V}^{T}}\\ \widetilde{\mathcal{V}^{T}}\end{bmatrix}:=\begin{bmatrix}A\mathbf{\bm{\theta}}^{1}&\ldots&A\mathbf{\bm{\theta}}^{U_{0}}\\ \widehat{B\mathbf{\bm{\theta}}^{1}}&\ldots&\widehat{B\mathbf{\bm{\theta}}^{U_{0}}}\end{bmatrix} ;
  // Compute the SVD
DT^:=πB𝒰^(πB𝒰^)+\widehat{D_{T}}:=\pi_{B}\widehat{\mathcal{U}}\left(\pi_{B}\widehat{\mathcal{U}}\right)^{+};
  // Approximate the transformer
DG^:=𝙾𝚛𝚝𝚑𝚘𝚐𝚘𝚗𝚊𝚕𝙱𝚊𝚜𝚒𝚜(πB𝒰^Null(πA𝒰^))\widehat{D_{G}}:=\mathtt{OrthogonalBasis}\left(\pi_{B}\widehat{\mathcal{U}}\operatorname{Null}\left(\pi_{A}\widehat{\mathcal{U}}\right)\right);
  // Approximate the generator
/* Compute the almost low-dimensional bases */
for u[U0]u\in[U_{0}] do
       Wu:=[WuWu]:=𝙾𝚛𝚝𝚑𝚘𝚐𝚘𝚗𝚊𝚕𝙱𝚊𝚜𝚒𝚜(span(DT^A𝜽u,DG^))W^{u}:=\begin{bmatrix}W^{u}_{\parallel}&W^{u}_{\perp}\end{bmatrix}:=\mathtt{OrthogonalBasis}\left(\operatorname{span}\left(\widehat{D_{T}}A\mathbf{\bm{\theta}}^{u},\widehat{D_{G}}\right)\right);
       Compute the upper bound on LuL^{u} as in Lemma 3.2;
      
end for
/* Apply 𝙻𝚘𝚠𝙾𝙵𝚄𝙻\mathtt{LowOFUL} with one bandit per context using the learned bases */
for h[H0+1,H]h\in[H_{0}+1,H] do
       Observe context uh[U]u_{h}\in[U] and decision set 𝒜h\mathcal{A}_{h};
       𝐚h:=𝙻𝚘𝚠𝙾𝙵𝚄𝙻((Wuh)T𝒜h,Lu,h)\mathbf{\bm{a}}_{h}:=\mathtt{LowOFUL}\left(\left(W^{u_{h}}\right)^{T}\mathcal{A}_{h},L^{u},h\right);
        // Run 𝙻𝚘𝚠𝙾𝙵𝚄𝙻\mathtt{LowOFUL} in the learned basis
       Pull arm: 𝐱h:=Wuh𝐚h\mathbf{\bm{x}}_{h}:=W^{u_{h}}\mathbf{\bm{a}}_{h}
      
end for
Algorithm 1 One Target, Many Users

B.2 Algorithm for No Source, Many Targets, Many Users

Input: An exploration strategy (e.g. LinUCB) 𝙴𝚡𝚙𝚕𝚘𝚛𝚎:(u,h)𝐱𝒜h\mathtt{Explore}:(u,h)\mapsto\mathbf{\bm{x}}\in\mathcal{A}_{h} and the LinUCB algorithm 𝙻𝚒𝚗𝚄𝙲𝙱:(𝒜h,h)𝐚𝒜h\mathtt{LinUCB}:(\mathcal{A}_{h},h)\mapsto\mathbf{\bm{a}}\in\mathcal{A}_{h}
Output: Arms to pull 𝐱h𝒜h\mathbf{\bm{x}}_{h}\in\mathcal{A}_{h} for h[H]h\in[H]
/* Phase 1: Use the exploration strategy for the first H0H_{0} iterations on the U0U_{0} and T0T_{0} beta group users and tasks. */
for h[H0]h\in[H_{0}] do
       Observe context uh[U0]u_{h}\in[U_{0}];
       Pull arm: 𝐱h:=𝙴𝚡𝚙𝚕𝚘𝚛𝚎(uh,h)\mathbf{\bm{x}}_{h}:=\mathtt{Explore}(u_{h},h)
      
end for
/* Learn the matrices {𝒰t^}t[T0]\left\{\widehat{\mathcal{U}_{t}}\right\}_{t\in[T_{0}]} */
for (u,t)[U0]×[T0](u,t)\in[U_{0}]\times[T_{0}] do
       Estimate Bt𝜽u^\widehat{B_{t}\mathbf{\bm{\theta}}^{u}} from the arm-reward history;
      
end for
[𝒰1^𝒰1~𝒰T0^𝒰T0~][Σr00ΣbT0r][𝒱1^𝒱1~𝒱T0^𝒱T0~]:=[B1𝜽1^B1𝜽U0^BT0𝜽1^BT0𝜽U0^]\begin{bmatrix}\widehat{\mathcal{U}_{1}}&\widetilde{\mathcal{U}_{1}}\\ \vdots&\vdots\\ \widehat{\mathcal{U}_{T_{0}}}&\widetilde{\mathcal{U}_{T_{0}}}\end{bmatrix}\begin{bmatrix}\Sigma_{r}&0\\ 0&\Sigma_{bT_{0}-r}\end{bmatrix}\begin{bmatrix}\widehat{\mathcal{V}_{1}}&\widetilde{\mathcal{V}_{1}}\\ \vdots&\vdots\\ \widehat{\mathcal{V}_{T_{0}}}&\widetilde{\mathcal{V}_{T_{0}}}\end{bmatrix}:=\begin{bmatrix}\widehat{B_{1}\mathbf{\bm{\theta}}^{1}}&\ldots&\widehat{B_{1}\mathbf{\bm{\theta}}^{U_{0}}}\\ \vdots&\vdots\\ \widehat{B_{T_{0}}\mathbf{\bm{\theta}}^{1}}&\ldots&\widehat{B_{T_{0}}\mathbf{\bm{\theta}}^{U_{0}}}\end{bmatrix} ;
  // Compute SVD
/* Phase 2: All users arrive, but only request recommendations for the T0T_{0} beta tasks */
for h[H0+1,H1]h\in[H_{0}+1,H_{1}] do
       Observe context uh[U0]u_{h}\in[U_{0}];
       Pull arm: 𝐱h:=𝙴𝚡𝚙𝚕𝚘𝚛𝚎(uh,h)\mathbf{\bm{x}}_{h}:=\mathtt{Explore}(u_{h},h)
      
end for
/* Use arm-reward history and the {𝒰t^}t[T0]\left\{\widehat{\mathcal{U}_{t}}\right\}_{t\in[T_{0}]} to estimate {ϕu^}u[U]\left\{\widehat{\mathbf{\bm{\phi}}^{u}}\right\}_{u\in[U]} */
for u[U]u\in[U] do
       Estimate ϕu^\widehat{\mathbf{\bm{\phi}}^{u}} with least-squares using the matrices {𝒰t^}t[T0]\left\{\widehat{\mathcal{U}_{t}}\right\}_{t\in[T_{0}]}, as described in Section 4.
end for
/* Phase 3: Use our estimates and LinUCB to perform well when all users arrive and request recommendations for any task */
for h[H1+1,H]h\in[H_{1}+1,H] do
       Observe context uh[U]u_{h}\in[U] and decision set 𝒜h\mathcal{A}_{h};
       /* Use the estimate ϕu^\widehat{\mathbf{\bm{\phi}}^{u}} to run 𝙻𝚒𝚗𝚄𝙲𝙱\mathtt{LinUCB} where the reward parameter is given by vec(𝒰t)\operatorname{vec}(\mathcal{U}_{t}) */
       vec(𝐀h):=𝙻𝚒𝚗𝚄𝙲𝙱(𝒜h(ϕu^),hH1)\operatorname{vec}\left(\mathbf{\bm{A}}_{h}\right):=\mathtt{LinUCB}\left(\mathcal{A}_{h}\left(\widehat{\mathbf{\bm{\phi}}^{u}}\right)^{\top},h-H_{1}\right);
       Pull arm: 𝐱h:=𝐀hϕu^ϕu^2\mathbf{\bm{x}}_{h}:=\frac{\mathbf{\bm{A}}_{h}\widehat{\mathbf{\bm{\phi}}^{u}}}{\left\lVert\widehat{\mathbf{\bm{\phi}}^{u}}\right\rVert^{2}}
      
end for
Algorithm 2 No Sources, Many Targets, Many Users

B.3 The LowOFUL Algorithm

For completeness, we overview the LowOFUL Algorithm [10] here.

Input: 𝒜h\mathcal{A}_{h}, LL, hh, and a regularization constant λ\lambda
Output: Arm to pull 𝐚h𝒜h\mathbf{\bm{a}}_{h}\in\mathcal{A}_{h}
λ:=hrlog(1+hλ)\lambda_{\bot}:=\frac{h}{r\log\left(1+\frac{h}{\lambda}\right)};
Λ:=diag(λ,,λfirst r entries,λ,,λ)\Lambda:=\operatorname{diag}\left(\underbrace{\lambda,\dots,\lambda}_{\text{first $r$ entries}},\lambda_{\bot},\dots,\lambda_{\bot}\right) ;
  // Diagonal matrix for regularization
A:=A:=the actions taken up to time h1h-1 – one per row;
𝐲:=\mathbf{\bm{y}}:=the rewards received up time h1h-1 – one per row;
𝜽~:=argmin𝜽12𝐀𝜽2+12𝜽Λ2\widetilde{\mathbf{\bm{\theta}}}:=\operatorname*{argmin}_{\mathbf{\bm{\theta}}}\frac{1}{2}\left\lVert\mathbf{\bm{A}}\mathbf{\bm{\theta}}\right\rVert^{2}+\frac{1}{2}\left\lVert\mathbf{\bm{\theta}}\right\rVert_{\Lambda}^{2};
  // Estimate of the reward parameter
Vh1:=Λ+ATAV_{h-1}:=\Lambda+A^{T}A;
βh1:=(log|Vh1||Λ|δ2+λ+Lλ)2\beta_{h-1}:=\left(\sqrt{\log{\frac{|V_{h-1}|}{|\Lambda|\delta^{2}}}}+\sqrt{\lambda}+L\sqrt{\lambda_{\bot}}\right)^{2};
  // Length of the ellipsoid axes
ch1:={𝜽𝜽𝜽~V1βh1}c_{h-1}:=\left\{\mathbf{\bm{\theta}}\mid\left\lVert\mathbf{\bm{\theta}}-\widetilde{\mathbf{\bm{\theta}}}\right\rVert_{V-1}\leq\sqrt{\beta_{h-1}}\right\};
  // The 1δ1-\delta confidence set
𝐚h:=argmax𝐚𝒜hmax𝜽ch1𝜽,𝐚\mathbf{\bm{a}}_{h}:=\operatorname*{argmax}_{\mathbf{\bm{a}}\in\mathcal{A}_{h}}\max_{\mathbf{\bm{\theta}}\in c_{h-1}}\left\langle\mathbf{\bm{\theta}},\mathbf{\bm{a}}\right\rangle;
Pull arm: 𝐚h\mathbf{\bm{a}}_{h}
Algorithm 3 LowOFUL

Appendix C Full Experimental Details

Actions are isotropic Gaussian random vectors of length aa and bb respectively.

Lemma C.1.

The run-time of the first algorithm is in total

H0(Kb2+b3)Phase 1+(HH0)(K((1κ)b)2+((1κ)b)3)Phase 2+(a+b)2U0+U03SVD\underbrace{H_{0}(Kb^{2}+b^{3})}_{\text{Phase 1}}+\underbrace{(H-H_{0})(K((1-\kappa)b)^{2}+((1-\kappa)b)^{3})}_{\text{Phase 2}}+\underbrace{(a+b)^{2}U_{0}+U_{0}^{3}}_{\text{SVD}}

and the run-time of the second algorithm is

H0(Kb2+b3)Phase 1+(H1H0)(Kn2+n3)Phase 2+(HH1)(K(bn)2+(bn)3)Phase 3+(T0b)2U0+U03SVD.\underbrace{H_{0}(Kb^{2}+b^{3})}_{\text{Phase 1}}+\underbrace{(H_{1}-H_{0})(Kn^{2}+n^{3})}_{\text{Phase 2}}+\underbrace{(H-H_{1})(K(bn)^{2}+(bn)^{3})}_{\text{Phase 3}}+\underbrace{(T_{0}b)^{2}U_{0}+U_{0}^{3}}_{\text{SVD}}.

Experiments with One Source, One Target, Many Users

In this experiment, we choose κ=0.9\kappa=0.9, a=20a=20, b=20b=20, U0=25U_{0}=25, U=500U=500, with a choice between 4040 actions. We also have H0=2,000H_{0}=2,000 and H=8,000H=8,000. We generate AA and BB by letting ϕu\mathbf{\bm{\phi}}^{u} be unit isotropic dimension-dd Gaussians, where d=a+b(1κ)d=a+b(1-\kappa), and letting AA, BB, be 1/d1/\sqrt{d} times arbitrary a×da\times d and b×db\times d-dimensional elementwise Gaussian matrices. In addition, we let 𝒜\mathcal{A} be comprised of |𝒜||\mathcal{A}| dimension-aa isotropic Gaussian vectors, and use Gaussian noise for η\eta.

The only remaining issue is the order in which users arrive. First, we have the users come in order 1,2,U01,2,\ldots U_{0} for H0/U0H_{0}/U_{0} entries, and then the users arrive in random order.

In LowOFUL, we take λ=1\lambda=1 and λ=H/U((1κ)b+1)log(1+H/U)\lambda_{\perp}=\frac{H/U}{((1-\kappa)b+1)\log(1+H/U)}, again following [10].

Experiments with No Sources, Many Targets and Many Users

In this experiment, we choose b=3b=3, T=30T=30, U=50U=50, r=6r=6, |𝒜|=40|\mathcal{A}|=40 actions, and T0=3T_{0}=3, tasks and U0=5U_{0}=5 users in the reduced phase, where the matrices BiB_{i} and vectors ϕu\mathbf{\bm{\phi}}^{u} are all given by isotropic standard Gaussian matrices and vectors. Moreover, the 𝜽ud\mathbf{\bm{\theta}}^{u}\in\mathbb{R}^{d} are each standard isotropic Gaussian, and each task matrix is 1/b1/\sqrt{b} times an arbitrary b×rb\times r elementwise uniform matrix. We run Algorithm 2 for H0=1,000H_{0}=1,000 actions in the first phase, H1H0=3,000H_{1}-H_{0}=3,000 steps in the second phase, and HH1=9,000H-H_{1}=9,000 actions in the third phase.

Appendix D Difficulties in the Analysis of Algorithm 2

The intuition and motivation behind algorithm 2 (the proposed algorithm in section 4) mirrors that of algorithm 1 (our solution to the setup in section 3). However, the analysis becomes intractable due to a dependence structure that arises in this setting that is not present in the setting of section 3. Whereas we had access to the true optimal arms on the source tasks for members of the beta group in section 3, we only have access to confidence sets for the true optimal arms for the other target tasks in this setting. These confidence sets are stochastically determined by the algorithm in a way that depends on the action history. This, in and of itself, is not a problem so long as one can make 2\ell_{2} distance guarantees on the diameter of the confidence set. However, the axes of these confidence sets are not of equal length (i.e. the confidence sets are anisotropic), and the lengths themselves are dependent on the action history. Thus, although the algorithm is well-motivated given the analysis of section 3, and the fundamental ideas remain the same, we believe the analysis of the second case to be intractable. We were, however, able to demonstrate the success of the method experimentally.