This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Multi-facet Contextual Bandits: A Neural Network Perspective

Yikun Ban University of Illinois at Urbana-Champaign [email protected] Jingrui He University of Illinois at Urbana-Champaign [email protected]  and  Curtiss B. Cook Mayo Clinic Arizona [email protected]
(2021)
Abstract.

Contextual multi-armed bandit has shown to be an effective tool in recommender systems. In this paper, we study a novel problem of multi-facet bandits involving a group of bandits, each characterizing the users’ needs from one unique aspect. In each round, for the given user, we need to select one arm from each bandit, such that the combination of all arms maximizes the final reward. This problem can find immediate applications in E-commerce, healthcare, etc. To address this problem, we propose a novel algorithm, named MuFasa, which utilizes an assembled neural network to jointly learn the underlying reward functions of multiple bandits. It estimates an Upper Confidence Bound (UCB) linked with the expected reward to balance between exploitation and exploration. Under mild assumptions, we provide the regret analysis of MuFasa. It can achieve the near-optimal 𝒪~((K+1)T)\widetilde{\mathcal{O}}((K+1)\sqrt{T}) regret bound where KK is the number of bandits and TT is the number of played rounds. Furthermore, we conduct extensive experiments to show that MuFasa outperforms strong baselines on real-world data sets.

Contextual Bandits; Neural Network; Regret Analysis
ccs: Information systems Personalizationccs: Information systems Display advertisingccs: Theory of computation Online learning algorithmsjournalyear: 2021copyright: acmcopyrightconference: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining; August 14–18, 2021; Virtual Event, Singaporebooktitle: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’21), August 14–18, 2021, Virtual Event, Singaporeprice: 15.00doi: 10.1145/3447548.3467299isbn: 978-1-4503-8332-5/21/08

1. introduction

The personalized recommendation is ubiquitous in web applications. Conventional approaches that rely on sufficient historical records, e.g., collaborative filtering (Su and Khoshgoftaar, 2009; Zhou et al., 2020b), have proven successful both theoretically and empirically. However, with the cold-start problem and the rapid change of the recommendation content, these methods might render sub-optimal performance (Li et al., 2010; Gentile et al., 2014). To solve the dilemma between the exploitation of historical data and the exploration of new information, Multi-Armed Bandit (MAB) (Langford and Zhang, 2008; Auer et al., 2002a; Abbasi-Yadkori et al., 2011; Auer et al., 2002b; Ban and He, 2020) turns out to be an effective tool, which has been adapted to personalized recommendation (Li et al., 2010; Gentile et al., 2014), online advertising (Wu et al., 2016), clinical trials (Durand et al., 2018; Bastani and Bayati, 2020), etc.

In the conventional contextual bandit problem setting (Li et al., 2010), i.e., single MAB, the learner is presented with a set of arms in each round, where each arm is represented by a feature vector. Then the learner needs to select and play one arm to receive the corresponding reward that is drawn from an unknown distribution with an unknown mean. To achieve the goal of maximizing the accumulated rewards, the learner needs to consider the arms with the best historical feedback as well as the new arms for potential gains. The single MAB problem has been well studied in various settings. With respect to the reward function, one research direction (Li et al., 2010; Abbasi-Yadkori et al., 2011; Dimakopoulou et al., 2019; Gentile et al., 2014; Li et al., 2016a) assumes that the expected reward is linear with respect to the arm’s feature vector. However, in many real applications, this assumption fails to hold. Thus many exiting works turn to focus on the nonlinear or nonparametric bandits (Srinivas et al., 2009; Bubeck et al., 2011) with mild assumptions such as the Lipschitz continuous property (Bubeck et al., 2011) or embedding in Reproducing Kernel Hilbert Space (Valko et al., 2013; Deshmukh et al., 2017). Furthermore, the single MAB problem has been extended to best arm identification  (Auer et al., 2002a; Audibert and Bubeck, 2010), outlier arm identification (Gentile et al., 2017; Ban and He, 2020), Top-K arm problems  (Buccapatnam et al., 2013), and so on.

In this paper, we define and study a novel problem of multi-facet contextual bandits. In this problem, the users’ needs are characterized from multiple aspects, each associated with one bandit. Consider a task consisting of KK bandits, where each bandit presents a set of arms separately and the learner needs to choose and play one arm from each bandit. Therefore, a total of KK arms are played in one round. In accordance with the standard bandit problem, the learner can observe a reward after playing one arm from the bandit, which we call "sub-reward", and thus KK sub-rewards are received in total. In addition, a reward that is a function with respect to these KK sub-rewards, called "final reward", is observed to represent the overall feedback with respect to the KK selected arms. Note that the functions of final reward and KK sub-rewards are allowed to be either linear or non-linear. The goal of the learner in the multi-facet bandit problem is to maximize the final rewards of all the played rounds.

This problem finds many applications in real-world problems. For instance, in the recommender system, instead of the single item recommendation, an E-commerce company launches a promotion campaign, which sells collections of multiple types of products such as snacks, toiletries, and beverages. Each type of item can be formulated as a multi-armed bandit and the learner aims to select the best combination of snack, toiletry, and beverage. As a result, the final reward is the review of this combined recommendation, while the sub-reward is the review for a particular product . This problem also exists in healthcare. For a diabetes patient, the doctor usually provides a comprehensive recommendation including medication, daily diet, and exercise, where each type has several options. Here, the final reward can be set as the change of key biomarkers for diabetes (e.g., HbA1c) and the sub-reward can be the direct impact of each type of recommendation (e.g., blood pressure change for a medicine).

A major challenge of the proposed multi-facet bandit problem is the partial availability of sub-rewards, as not every sub-reward is easy to observe. For example, regarding the combined recommendation of E-commerce, the user may rate the combination but not individual items; regarding the comprehensive recommendation for a diabetes patient, some sub-rewards can be difficult to measure (e.g., the impact of low-calorie diets on the patient’s overall health conditions). Therefore, in our work, we allow only a subset of all sub-rewards to be observed in each round, which increases the flexibility of our proposed framework.

To address these challenges, we aim to learn the mappings from the selected KK arms (one from each bandit) to the final rewards, incorporating two crucial factors: (1) the collaborative relations exist among these bandits as they formulate the aspects from one same user; (2) the bandits contribute to the task with various weights because some aspects (bandits) are decisive while some maybe not. Hence, we propose a novel algorithm, MuFasa, to learn KK bandits jointly. It utilizes an assembled neural networks to learn the final reward function combined with KK bandits. Although the neural networks have been adapted to the bandit problem (Riquelme et al., 2018; Zahavy and Mannor, 2019; Zhou et al., 2020a), they are designed for the single bandit with one selected arm and one reward in each round. To balance the exploitation and exploration of arm sets, we provide a comprehensive upper confidence bound based on the assembled network linking the predicted reward with the expected reward. When the sub-rewards are partially available, we introduce a new approach to leverage them to train bandits jointly. Furthermore, we carry out the theoretical analysis of MuFasa and prove a near-optimal regret bound under mild assumptions. Our major contributions can be summarized as follows:

(1) Problem. We introduce the problem of multi-facet contextual bandits to characterize the users’ needs from multiple aspects, which can find immediate applications in E-commerce, healthcare, etc.

(2) Algorithm. We propose a novel algorithm, MuFasa, which exploits the final reward and up to KK sub-rewards to train the assembled neural networks and explores potential arm sets with a UCB-based strategy.

(3) Theoretical analysis. Under mild assumptions, we provide the upper confidence bounds for a neural network and the assembled neural networks. Then, we prove that MuFasa can achieve the 𝒪~((K+1)T)\widetilde{\mathcal{O}}((K+1)\sqrt{T}) regret bound, which is near-optimal compared to a single contextual bandit.

(4) Empirical performance. We conduct extensive experiments to show the effectiveness of MuFasa, which outperforms strong baselines on real-world data sets even with partial sub-rewards.

2. related work

Multi-armed bandit. The multi-armed bandit was first introduced by  (Thompson, 1933) and then further studied by many works that succeeded in both theory and practice such as ϵ\epsilon-greedy (Langford and Zhang, 2008), Thompson sampling(Agrawal and Goyal, 2013), and upper confidence bound (Auer et al., 2002a). In the contrast with traditional bandits (Auer et al., 2002a; Ban and He, 2020), the contextual bandit (Li et al., 2010; Abbasi-Yadkori et al., 2011; Wu et al., 2016) has the better representation capacity where each arm is represented by a context vector instead of a scalar to infer the reward. Among them, the linear contextual bandits are extensively studied and many of them use the UCB strategy, achieving 𝒪~(T)\widetilde{\mathcal{O}}(\sqrt{T}) regret bound (Abbasi-Yadkori et al., 2011; Gentile et al., 2014; Ban and He, 2021). To further generalize the reward function, many works use a nonlinear regression model drawn from the reproducing kernel Hilbert space to learn the mapping from contexts to rewards such as the kernel-based methods (Deshmukh et al., 2017; Valko et al., 2013).

Neural bandits. The authors of (Allesiardo et al., 2014) use a neural work to model an arm and then applied ϵ\epsilon-greedy strategy to select an arm. In contrast, MuFasa utilizes a UCB-based strategy working on KK bandits instead of one set of arms. In addition, the Thompson sampling has been combined with deep neural networks (Lipton et al., 2018; Azizzadenesheli et al., 2018; Riquelme et al., 2018; Zahavy and Mannor, 2019). For instance, (Riquelme et al., 2018; Zahavy and Mannor, 2019) regard the last layer of the neural network as the embeddings of contexts and then apply the Thompson sampling to play an arm in each round. NeuUCB (Zhou et al., 2020a) first uses the UCB-based approach constructed on a fully-connected neural network, while it only fits on the single bandit with one set of arms. On the contrary, MuFasa constructs an assembled neural networks to learn KK bandits jointly. Deep neural network in multi-view learning has been well-studied (Zhou et al., 2015, 2020c; Fu et al., 2020; Zheng et al., 2021; Jing et al., 2021), to extract useful information among multiple sources, which inspires one of the core ideas of MuFasa.

Other variant bandit setting. In the non-contextual bandit, a number of works (Chen et al., 2013, 2016; Liu et al., 2011) study playing KK arms at the same time in a single bandit, while these approaches have limited representation power in the recommender system. The most similar setting is the contextual combinatorial MAB problem(Qin et al., 2014; Li et al., 2016b), where the learner tends to choose the optimal subset of arms with certain constraints like the KK-size. One key difference is that all the arms are from the same single bandit where only one reward function exists. On the contrary, in the multi-faced bandits, the selected KK arms come from KK different bandits with KK different reward functions and the sub-rewards are allowed to be partially available. There is another line of works (Gentile et al., 2014; Li et al., 2016a; Ban and He, 2021) for bandit clustering, where a bandit is constructed for each user. They try to leverage the dependency among users to improve the recommendation performance. However, in these works, they still play one arm in each round and the reward function is required to be linear.

3. Problem Definition

In this section, we formulate the problem of multi-facet bandits, with a total of KK bandits, where the learner aims to select the optimal set of KK arms in each round, in order to maximize the final accumulated rewards.

Suppose there are TT rounds altogether. In each round t[T]t\in[T] ([T]={1,,T}[T]=\{1,\dots,T\}), the learner is faced with KK bandits, and each bandit k[K]k\in[K] has a set of arms 𝐗tk={𝐱t,1k,,𝐱t,nkk}\mathbf{X}^{k}_{t}=\{\mathbf{x}^{k}_{t,1},\dots,\mathbf{x}^{k}_{t,n_{k}}\}, where |𝐗tk|=nk|\mathbf{X}^{k}_{t}|=n_{k} is the number of arms in this bandit. In the bandit kk, for each arm 𝐱t,ik𝐗tk\mathbf{x}^{k}_{t,i}\in\mathbf{X}^{k}_{t}, it is represented by a dkd_{k}-dimensional feature vector and we assume 𝐱t,ik21\|\mathbf{x}^{k}_{t,i}\|_{2}\leq 1. Subsequently, in each round tt, the learner will observe KK arm sets {𝐗tk}k=1K\{\mathbf{X}^{k}_{t}\}_{k=1}^{K} and thus a total of k=1Knk\sum_{k=1}^{K}n_{k} arms. As only one arm can be played within each bandit, the learner needs to select and play KK arms denoted as 𝐗t={𝐱t1,,𝐱tk,,𝐱tK}\mathbf{X}_{t}=\{\mathbf{x}^{1}_{t},\dots,\mathbf{x}^{k}_{t},\dots,\mathbf{x}^{K}_{t}\} in which 𝐱tk𝐗t\mathbf{x}_{t}^{k}\in\mathbf{X}_{t} represents the selected arm from 𝐗tk\mathbf{X}^{k}_{t}.

Once the selected arm 𝐱tk\mathbf{x}^{k}_{t} is played for bandit kk, a sub-reward rtkr^{k}_{t} will be received to represent the feedback of this play for bandit kk separately. The sub-reward is assumed to be governed by an unknown reward function:

rtk(𝐱tk)=hk(𝐱tk).r^{k}_{t}(\mathbf{x}^{k}_{t})=h_{k}(\mathbf{x}^{k}_{t}).

where hkh_{k} can be either a linear (Li et al., 2010; Abbasi-Yadkori et al., 2011) or non-linear reward function (Deshmukh et al., 2017; Valko et al., 2013). As a result, in each round tt, the learner needs to play KK arms in 𝐗t\mathbf{X}_{t} and then receive KK sub-rewards denoted by 𝐫t={rt1,,rtk,,rtK}\mathbf{r}_{t}=\{r^{1}_{t},\dots,r^{k}_{t},\dots,r^{K}_{t}\}.

As the KK bandits characterize the users’ needs from various aspects, after playing KK arms in each round tt, a final reward RtR_{t} will be received to represent the overall feedback of the group of KK bandits. The final reward RtR_{t} is considered to be governed by an unknown function with respect to 𝐫t\mathbf{r}_{t}:

Rt(𝐫t)=H((h1(𝐱t1),,hk(𝐱tk),hK(𝐱tK)))+ϵt.R_{t}(\mathbf{r}_{t})=H\left(\left(h_{1}(\mathbf{x}_{t}^{1}),\dots,h_{k}(\mathbf{x}_{t}^{k})\dots,h_{K}(\mathbf{x}_{t}^{K})\right)\right)+\epsilon_{t}.

where ϵt\epsilon_{t} is a noise drawn from a Gaussian distribution with zero mean. In our analysis, we make the following assumptions regarding hkh_{k} and H(vec(𝐫t))H(\text{vec}(\mathbf{r}_{t})):

  1. (1)

    If 𝐱tk=𝟎\mathbf{x}_{t}^{k}=\mathbf{0}, then h(𝐱tk)=0h(\mathbf{x}_{t}^{k})=0; If vec(𝐫t)=(0,,0)\text{vec}(\mathbf{r}_{t})=(0,\dots,0), then H(vec(𝐫t))=0H(\text{vec}(\mathbf{r}_{t}))=0.

  2. (2)

    C¯\bar{C}-Lipschitz continuity. H(vec(𝐫t))H(\text{vec}(\mathbf{r}_{t})) is assumed to be C¯\bar{C}-Lipschitz continuous with respect to the 𝐫t\mathbf{r}_{t}. Formally, there exists a constant C¯>0\bar{C}>0 such that

    |H(vec(𝐫t))H(vec(𝐫t))|C¯kK[rtkrtk]2.|H(\text{vec}(\mathbf{r}_{t}))-H(\text{vec}(\mathbf{r}_{t}^{\prime}))|\leq\bar{C}\sqrt{\sum_{k\in K}[r_{t}^{k}-{r_{t}^{k}}^{\prime}]^{2}}.

Both assumptions are mild. For (1), if the input is zero, then the reward should also be zero. For (2), the Lipschitz continuity can be applied to many real-world applications. For the convenience of presentation, given any set of selected KK arms 𝐗t\mathbf{X}_{t}, we denote the expectation of RtR_{t} by:

(1) (𝐗t)=𝔼[Rt|𝐗t]=H((h1(𝐱t1),,hk(𝐱tk),hK(𝐱tK))).\mathcal{H}(\mathbf{X}_{t})=\mathbb{E}[R_{t}|\mathbf{X}_{t}]=H\left(\left(h_{1}(\mathbf{x}_{t}^{1}),\dots,h_{k}(\mathbf{x}_{t}^{k})\dots,h_{K}(\mathbf{x}_{t}^{K})\right)\right).

Recall that in multi-facet bandits, the learner aims to select the optimal KK arms with the maximal final reward RtR_{t}^{\ast} in each round. First, we need to identify all possible combinations of KK arms, denoted by

(2) 𝐒t={(𝐱t1,,𝐱tk,,𝐱tK)|𝐱tk𝐗tk,k[K]},\mathbf{S}_{t}=\{(\mathbf{x}^{1}_{t},\dots,\mathbf{x}^{k}_{t},\dots,\mathbf{x}^{K}_{t})\ |\ \mathbf{x}^{k}_{t}\in\mathbf{X}^{k}_{t},k\in[K]\},

where |𝐒t|=k=1Knk|\mathbf{S}_{t}|=\prod_{k=1}^{K}n_{k} because bandit kk has nkn_{k} arms for each k[K]k\in[K]. Thus, the regret of multi-facet bandit problem is defined as

Reg =𝔼[t=1T(RtRt)]\displaystyle=\mathbb{E}[\sum_{t=1}^{T}(R_{t}^{\ast}-R_{t})]
=t=1T((𝐗t)(𝐗t)),\displaystyle=\sum_{t=1}^{T}\left(\mathcal{H}(\mathbf{X}_{t}^{\ast})-\mathcal{H}(\mathbf{X}_{t})\right),

where 𝐗t=argmax𝐗t𝐒t(𝐗t)\mathbf{X}_{t}^{\ast}=\arg\max_{\mathbf{X}_{t}\in\mathbf{S}_{t}}\mathcal{H}(\mathbf{X}_{t}). Therefore, our goal is to design a bandit algorithm to select KK arms every round in order to minimize the regret. We use the standard 𝒪\mathcal{O} to hide constants and 𝒪~\widetilde{\mathcal{O}} to hide logarithm.

Availability of sub-rewards. In this framework, the final RtR_{t} is required to be known, while the sub-rewards 𝐫t\mathbf{r}_{t} are allowed to be partially available. Because the feedback of some bandits cannot be directly measured or is simply not available in a real problem. This increases the flexibility of our proposed framework.

More specifically, in each round tt, ideally, the learner is able to receive K+1K+1 rewards including KK sub-rewards {rt1,,rtK}\{r^{1}_{t},\dots,r^{K}_{t}\} and a final reward RtR_{t}. As the final reward is the integral feedback of the entire group of bandits and reflects how the bandits affect each other, RtR_{t} is required to be known. However, the KK sub-rewards are allowed to be partially available, because not every sub-reward is easy to obtain or can be measured accurately.

This is a new challenge in the multi-facet bandit problem. Thus, to learn \mathcal{H}, the designed bandit algorithm is required to handle the partial availability of sub-rewards.

4. Proposed Algorithm

In this section, we introduce the proposed algorithm, MuFasa. The presentation of MuFasa is divided into three parts. First, we present the neural network model used in MuFasa; Second, we detail how to collect training samples to train the model in each round; In the end, we describe the UCB-based arm selection criterion and summarize the workflow of MuFasa.

4.1. Neural network model

To learn the reward function \mathcal{H}, we use K+1K+1 fully-connected neural networks to learn KK bandits jointly, where a neural network fkf_{k} is built for each bandit k[K]k\in[K] to learn its reward function hkh_{k}, and a shared neural network FF is constructed to learn the mapping from the KK neural networks (f1,,fK)(f_{1},\dots,f_{K}) to the final reward RtR_{t}.

First, in round tt, for each bandit k[K]k\in[K], given any context vector 𝐱tkdk\mathbf{x}^{k}_{t}\in\mathbb{R}^{d_{k}}, we use a L1L_{1}-layer fully-connected network to learn hkh_{k} , denoted by fkf_{k}:

fk(𝐱tk;𝜽k)=m1𝐖L1σ(𝐖L11σ(σ(𝐖1𝐱tk))),f_{k}(\mathbf{x}^{k}_{t};\bm{\theta}^{k})=\sqrt{m_{1}}\mathbf{W}_{L_{1}}\sigma(\mathbf{W}_{L_{1}-1}\sigma(\dots\sigma(\mathbf{W}_{1}\mathbf{x}^{k}_{t}))),

where σ(x)\sigma(x) is the rectified linear unit (ReLU) activation function. Without loss of generality, we assume each layer has the same width m1m_{1} for the sake of analysis. Therefore, 𝜽k=(vec(𝐖L1),,\bm{\theta}^{k}=\big{(}\text{vec}(\mathbf{W}_{L_{1}})^{\intercal},\dots, vec(𝐖1))\text{vec}(\mathbf{W}_{1})^{\intercal}\big{)}^{\intercal} P1\in\mathbb{R}^{P_{1}}, where 𝐖1m1×dk\mathbf{W}_{1}\in\mathbb{R}^{m_{1}\times d_{k}}, 𝐖im1×m1,i[1:L11]\mathbf{W}_{i}\in\mathbb{R}^{m_{1}\times m_{1}},\forall i\in[1\mathrel{\mathop{\mathchar 58\relax}}L_{1}-1], and 𝐖L1m^×m1\mathbf{W}_{L_{1}}\in\mathbb{R}^{\widehat{m}\times m_{1}}. Note that fk(𝐱tk;𝜽k)m^f_{k}(\mathbf{x}^{k}_{t};\bm{\theta}^{k})\in\mathbb{R}^{\widehat{m}}, where m^\widehat{m} is set as a tuneable parameter to connect with the following network FF. Denote the gradient 𝜽kfk(𝐱tk;𝜽k)\triangledown_{\bm{\theta}^{k}}f_{k}(\mathbf{x}^{k}_{t};\bm{\theta}^{k}) by g(𝐱tk;𝜽k)g(\mathbf{x}_{t}^{k};\bm{\theta}^{k}).

Next, to learn the final reward function HH, we use a L2L_{2}-layer fully-connected network to combine the outputs of the above KK neural networks, denoted by FF:

F(𝐟t;𝜽Σ)=m2𝐖L2σ(σ(𝐖1(𝐟t)))F\left(\mathbf{f}_{t};\bm{\theta}^{\Sigma}\right)=\sqrt{m_{2}}\mathbf{W}_{L_{2}}\sigma(\dots\sigma(\mathbf{W}_{1}(\mathbf{f}_{t})))

where 𝐟t=(f1(𝐱t1;𝜽1),,fK(𝐱tK;𝜽K))m^K\mathbf{f}_{t}=\left(f_{1}(\mathbf{x}_{t}^{1};\bm{\theta}^{1})^{\intercal},\dots,f_{K}(\mathbf{x}_{t}^{K};\bm{\theta}^{K})^{\intercal}\right)^{\intercal}\in\mathbb{R}^{\widehat{m}K}. Also, we assume that each layer has the same width m2m_{2}. Therefore, 𝜽Σ=(vec(𝐖L2)\bm{\theta}^{\Sigma}=(\text{vec}(\mathbf{W}_{L_{2}})^{\intercal} ,,vec(𝐖1))P2,\dots,\text{vec}(\mathbf{W}_{1})^{\intercal})^{\intercal}\in\mathbb{R}^{P_{2}}, where 𝐖1m2×m^K\mathbf{W}_{1}\in\mathbb{R}^{m_{2}\times\widehat{m}K}, 𝐖im2×m2,i[L21]\mathbf{W}_{i}\in\mathbb{R}^{m_{2}\times m_{2}},\forall i\in[L_{2}-1] and 𝐖L21×m2\mathbf{W}_{L_{2}}\in\mathbb{R}^{1\times m_{2}}, Denote the gradient 𝜽ΣF(𝐟t;𝜽Σ)\triangledown_{\bm{\theta}^{\Sigma}}F\left(\mathbf{f}_{t};\bm{\theta}^{\Sigma}\right) by G(𝐟t;𝜽Σ)G(\mathbf{f}_{t};\bm{\theta}^{\Sigma}).

Therefore, for the convenience of presentation, the whole assembled neural networks can be represented by \mathcal{F} to learn \mathcal{H} (Eq.(6)), given the KK selected arms 𝐗t\mathbf{X}_{t}:

(𝐗t;𝜽)=(F(;𝜽Σ)(f1(;𝜽1),,fK(;𝜽K)))(𝐗t),\mathcal{F}(\mathbf{X}_{t};\bm{\theta})=\left(F(\cdot;\bm{\theta}^{\Sigma})\circ\left(f_{1}(\cdot;\bm{\theta}^{1}),\dots,f_{K}(\cdot;\bm{\theta}^{K})\right)\right)(\mathbf{X}_{t}),

where 𝜽=(𝜽Σ,𝜽1,,𝜽K)\bm{\theta}=(\bm{\theta}^{\Sigma},\bm{\theta}^{1},\dots,\bm{\theta}^{K}).

Initialization. 𝜽\bm{\theta} is initialized by randomly generating each parameter from the Gaussian distribution. More specifically, for 𝜽k,k[K]\bm{\theta}^{k},k\in[K], 𝐖l\mathbf{W}_{l} is set to (𝐰𝟎𝟎𝐰)\begin{pmatrix}\mathbf{w}&\mathbf{0}\\ \mathbf{0}&\mathbf{w}\end{pmatrix} for any l[L1]l\in[L_{1}] where 𝐰\mathbf{w} is drawn from N(0,4/m1)N(0,4/m_{1}). For 𝜽Σ\bm{\theta}^{\Sigma}, 𝐖l\mathbf{W}_{l} is set to (𝐰𝟎𝟎𝐰)\begin{pmatrix}\mathbf{w}&\mathbf{0}\\ \mathbf{0}&\mathbf{w}\end{pmatrix} for any l[L21]l\in[L_{2}-1] where 𝐰\mathbf{w} is drawn from N(0,4/m2)N(0,4/m_{2}); 𝐖L2\mathbf{W}_{L_{2}} is set to (𝐰,𝐰)(\mathbf{w}^{\intercal},-\mathbf{w}^{\intercal}) where 𝐰\mathbf{w} is drawn from N(0,2/m2)N(0,2/m_{2}).

Algorithm 1 MuFasa
1:\mathcal{F}, TT, KK, δ\delta, η\eta , JJ
2:Initialize 𝜽0=(𝜽0Σ,𝜽01,,𝜽0K)\bm{\theta}_{0}=(\bm{\theta}^{\Sigma}_{0},\bm{\theta}^{1}_{0},\dots,\bm{\theta}^{K}_{0})
3:for each t[T]t\in[T] do
4:     for each bandit k[K]k\in[K]  do
5:         Observe context vectors 𝐗tk={𝐱t,1k,,𝐱t,nkk}\mathbf{X}_{t}^{k}=\{\mathbf{x}_{t,1}^{k},\dots,\mathbf{x}_{t,n_{k}}^{k}\}      
6:     Collect 𝐒t\mathbf{S}_{t} (Eq. (2))
7:     Choose KK arms, 𝐗t\mathbf{X}_{t}, by:
𝐗t=argmax𝐗t𝐒t((𝐗t;𝜽t1)+UCB(𝐗t)).(Theorem5.3)\mathbf{X}_{t}=\arg\max_{\mathbf{X}_{t}^{\prime}\in\mathbf{S}_{t}}\bigg{(}\mathcal{F}(\mathbf{X}_{t}^{\prime};\bm{\theta}_{t-1})+\text{UCB}(\mathbf{X}_{t}^{\prime})\bigg{)}.\ \ \ (\ \text{Theorem}\ \ref{theo:ucb}\ )
8:     Play 𝐗t\mathbf{X}_{t} and observe rewards RtR_{t} and 𝐫t\mathbf{r}_{t}.
9:     if  |𝐫t|=K|\mathbf{r}_{t}|=K  then ## sub-rewards are all available.
10:         𝜽t\bm{\theta}_{t} = GradientDescentAllGradientDescent_{All}(\mathcal{F},{𝐗i}i=1t\{\mathbf{X}_{i}\}_{i=1}^{t}, {Ri}i=1t\{R_{i}\}_{i=1}^{t}, {𝐫i}i=1t\{\mathbf{r}_{i}\}_{i=1}^{t} , J,ηJ,\eta)
11:     else   ## sub-rewards are partially available.
12:         Collect {Ωi}i=1t\{\Omega_{i}\}_{i=1}^{t}     ( Eq.(4) )
13:         𝜽t\bm{\theta}_{t} = GradientDescentPartialGradientDescent_{Partial} (\mathcal{F}, {Ωi}i=1t\{\Omega_{i}\}_{i=1}^{t}, JJ, η\eta)      
14:     Update UCB(𝐗t)\text{UCB}(\mathbf{X}_{t}^{\prime}).

4.2. Training process

Only with the final reward RtR_{t} and KK selected arms 𝐗t\mathbf{X}_{t}, the training of the neural network model \mathcal{F} is the following minimization problem:

(3) min𝜽(𝜽)=t=1T((𝐗t;𝜽)Rt)2/2+m2λ𝜽𝜽022/2.\min_{\bm{\theta}}\mathcal{L}(\bm{\theta})=\sum_{t=1}^{T}\left(\mathcal{F}(\mathbf{X}_{t};\bm{\theta})-R_{t}\right)^{2}/2+m_{2}\lambda\|\bm{\theta}-\bm{\theta}_{0}\|_{2}^{2}/2.

where (𝜽)\mathcal{L}(\bm{\theta}) is essentially l2l_{2}-regularized square loss function and 𝜽0\bm{\theta}_{0} is the randomly initialized network parameters. However, once the sub-rewards are available, we should use different methods to train \mathcal{F}, in order to leverage more available information. Next, we will elaborate our training methods using the gradient descend.

Collection of training samples. Depending on the availability of sub-rewards, we apply different strategies to update 𝜽\bm{\theta} in each round. When the sub-rewards are all available, the learner receives one final reward and KK sub-rewards. We apply the straightforward way to train each part of \mathcal{F} accordingly based on the corresponding input and the ground-truth rewards in each round, referring to the details in Algorithm 2, where m~\widetilde{m} in \mathcal{F} should be set as 11.

Algorithm 2 GradientDescentAllGradientDescent_{All}
1:\mathcal{F}, {𝐗i}i=1t\{\mathbf{X}_{i}\}_{i=1}^{t}, {Ri}i=1t\{R_{i}\}_{i=1}^{t}, {𝐫i}i=1t\{\mathbf{r}_{i}\}_{i=1}^{t} , J,ηJ,\eta
2:𝜽t\bm{\theta}_{t}
3:for each k[K]k\in[K] do
4:     Define (𝜽k)=i=1t(fk(𝐱ik;𝜽k)rik)2/2+m1λ𝜽k𝜽022/2\mathcal{L}(\bm{\theta}^{k})=\sum_{i=1}^{t}\left(f_{k}(\mathbf{x}_{i}^{k};\bm{\theta}^{k})-r_{i}^{k}\right)^{2}/2+m_{1}\lambda\|\bm{\theta}^{k}-\bm{\theta}_{0}\|_{2}^{2}/2
5:     for each j[J]j\in[J] do
6:         𝜽jk=𝜽j1kη(𝜽j1k)\bm{\theta}^{k}_{j}=\bm{\theta}^{k}_{j-1}-\eta\triangledown\mathcal{L}(\bm{\theta}^{k}_{j-1})      
7:Define (𝜽Σ)=i=1t(F(vec(𝐫i);𝜽Σ)Ri)2/2+m2λ𝜽Σ𝜽022/2\mathcal{L}(\bm{\theta}^{\Sigma})=\sum_{i=1}^{t}\left(F(\text{vec}(\mathbf{r}_{i});\bm{\theta}^{\Sigma})-R_{i}\right)^{2}/2+m_{2}\lambda\|\bm{\theta}^{\Sigma}-\bm{\theta}_{0}\|_{2}^{2}/2.
8:for each j[J]j\in[J] do
9:     𝜽jΣ=𝜽j1Ση(𝜽j1Σ)\bm{\theta}^{\Sigma}_{j}=\bm{\theta}^{\Sigma}_{j-1}-\eta\triangledown\mathcal{L}(\bm{\theta}^{\Sigma}_{j-1})
10:return (𝜽JΣ,𝜽J1,,𝜽JK)(\bm{\theta}^{\Sigma}_{J},\bm{\theta}^{1}_{J},\dots,\bm{\theta}^{K}_{J})
Algorithm 3 GradientDescentPartialGradientDescent_{Partial}
1:\mathcal{F}, {Ωi}i=1t\{\Omega_{i}\}_{i=1}^{t}, J,ηJ,\eta
2:𝜽t\bm{\theta}_{t}
3:Define (𝜽)=i=1t(𝐗,R)Ωi((𝐗;𝜽)R)2/2+m2λ𝜽𝜽022/2\mathcal{L}(\bm{\theta})=\sum_{i=1}^{t}\sum_{(\mathbf{X},R)\in\Omega_{i}}\left(\mathcal{F}(\mathbf{X};\bm{\theta})-R\right)^{2}/2+m_{2}\lambda\|\bm{\theta}-\bm{\theta}_{0}\|_{2}^{2}/2.
4:for each j[J]j\in[J] do
5:     𝜽j=𝜽j1η(𝜽j1)\bm{\theta}_{j}=\bm{\theta}_{j-1}-\eta\triangledown\mathcal{L}(\bm{\theta}_{j-1})
6:return 𝜽j\bm{\theta}_{j}

However, when the sub-rewards are partially available, the above method is not valid anymore because the bandits without available sub-rewards cannot be trained. Therefore, to learn the KK bandits jointly, we propose the following training approach focusing on the empirical performance.

As the final reward is always available in each round, we collect the first training sample (𝐗t,Rt)(\mathbf{X}_{t},R_{t}). Then, suppose there are 𝒦\mathcal{K} available sub-rewards 𝐫t\mathbf{r}_{t}, 𝒦<K\mathcal{K}<K. For each available sub-reward rtk𝐫tr_{t}^{k}\in\mathbf{r}_{t} and the corresponding context vector 𝐱tk\mathbf{x}^{k}_{t}, we construct the following pair:

𝐗~t,k={𝟎,,𝐱tk,,𝟎}and𝐫~t,k={0,,rtk,,0}.\widetilde{\mathbf{X}}_{t,k}=\{\mathbf{0},\dots,\mathbf{x}^{k}_{t},\dots,\mathbf{0}\ \}\ \ \text{and}\ \ \widetilde{\mathbf{r}}_{t,k}=\{0,\dots,r_{t}^{k},\dots,0\}.

We regard 𝐗~t,k\widetilde{\mathbf{X}}_{t,k} as a new input of \mathcal{F}. Now, we need to determine the ground-truth final reward (𝐗~t,k)=H(vec(𝐫~t,k))\mathcal{H}(\widetilde{\mathbf{X}}_{t,k})=H(\text{vec}(\widetilde{\mathbf{r}}_{t,k})).

Unfortunately, H(vec(𝐫~t,k))H(\text{vec}(\widetilde{\mathbf{r}}_{t,k})) is unknown. Inspired by the UCB strategy, we determine H(vec(𝐫~t,k))H(\text{vec}(\widetilde{\mathbf{r}}_{t,k})) by its upper bound. Based on Lemma 4.1, we have H(vec(𝐫~t,k))C¯rtkH(\text{vec}(\widetilde{\mathbf{r}}_{t,k}))\leq\bar{C}r_{t}^{k}. Therefore, we set H(vec(𝐫~t,k))H(\text{vec}(\widetilde{\mathbf{r}}_{t,k})) as:

H(vec(𝐫~t,k))=C¯rtkH(\text{vec}(\widetilde{\mathbf{r}}_{t,k}))=\bar{C}r_{t}^{k}

because it shows the maximal potential gain for the bandit kk. Then, in round tt, we can collect additional 𝒦\mathcal{K} sample pairs:

{(𝐗~t,k,C¯rtk)}k[𝒦].\{(\widetilde{\mathbf{X}}_{t,k},\bar{C}r_{t}^{k})\}_{k\in[\mathcal{K}]}.

where [𝒦][\mathcal{K}] denotes the bandits with available sub-rewards.

Accordingly, in each round tt, we can collect up to 𝒦+1\mathcal{K}+1 samples for training \mathcal{F}, denoted by Ωt\Omega_{t},:

(4) Ωt={(𝐗~t,k,C¯rtk)}k[𝒦]{(𝐗t,Rt)}.\Omega_{t}=\{(\widetilde{\mathbf{X}}_{t,k},\bar{C}r_{t}^{k})\}_{k\in\mathcal{[K]}}\bigcup\{(\mathbf{X}_{t},R_{t})\}.

Therefore, in each round, we train \mathcal{F} integrally, based on {Ωi}i=1t\{\Omega_{i}\}_{i=1}^{t}, as described in Algorithm 3.

Lemma 4.1.

Let 𝟎=(0,,0)\mathbf{0}=(0,\dots,0) and |𝟎|=K|\mathbf{0}|=K. Given 𝐗~t,k\widetilde{\mathbf{X}}_{t,k} and 𝐫~t,k\widetilde{\mathbf{r}}_{t,k}, then we have H(vec(𝐫~t,k))C¯rtkH(\text{vec}(\widetilde{\mathbf{r}}_{t,k}))\leq\bar{C}r_{t}^{k}.

Prove 4.1.

As HH is C¯\bar{C}-Lipschitz continuous, we have

|H(vec(𝐫~t,k))H(𝟎)|C¯r𝐫~t,k(r0)2=C¯rtk.|H(\text{vec}(\widetilde{\mathbf{r}}_{t,k}))-H(\mathbf{0})|\leq\bar{C}\sqrt{\sum_{r\in\widetilde{\mathbf{r}}_{t,k}}(r-0)^{2}}=\bar{C}r_{t}^{k}.

4.3. Upper confidence bound

In this subsection, we present the arm selection criterion based on the upper confidence bound provided in Section 5 and then summarize the high-level idea of MuFasa.

In each round tt, given an arm combination 𝐗t\mathbf{X}_{t}, the confidence bound of \mathcal{F} with respect to \mathcal{H} is defined as:

(|(𝐗t;𝜽t)(𝐗t)|>UCB(𝐗t))δ,\mathbb{P}\left(|\mathcal{F}(\mathbf{X}_{t};\bm{\theta}_{t})-\mathcal{H}(\mathbf{X}_{t})|>\text{UCB}(\mathbf{X}_{t})\right)\leq\delta,

where UCB(𝐗t\mathbf{X}_{t}) is defined in Theorem 5.3 and δ\delta usually is a small constant. Then, in each round, given the all possible arm combinations 𝐒t\mathbf{S}_{t}, the selected KK arms 𝐗t\mathbf{X}_{t} are determined by:

(5) 𝐗t=argmax𝐗t𝐒t((𝐗;𝜽t)+UCB(𝐗t)).\mathbf{X}_{t}=\arg\max_{\mathbf{X}_{t}^{\prime}\in\mathbf{S}_{t}}\left(\mathcal{F}(\mathbf{X}^{\prime};\bm{\theta}_{t})+\text{UCB}(\mathbf{X}_{t}^{\prime})\right).

With this selection criterion, the workflow of MuFasa is depicted in Algorithm 1.

5. Regret Analysis

In this section, we provide the upper confidence bound and regret analysis of MuFasa when the sub-rewards are all available.

Before presenting Theorem 5.3, let us first focus on an LL-layer fully-connected neural network f(𝐱t;𝜽)f(\mathbf{x}_{t};\bm{\theta}) to learn a ground-truth function h(𝐱t)h(\mathbf{x}_{t}), where 𝐱d\mathbf{x}\in\mathbb{R}^{d}. The parameters of ff are set as 𝐖1m×d\mathbf{W}_{1}\in\mathbb{R}^{m\times d}, 𝐖im×m,i[1:L1]\mathbf{W}_{i}\in\mathbb{R}^{m\times m},\forall i\in[1\mathrel{\mathop{\mathchar 58\relax}}L-1], and 𝐖L1×m\mathbf{W}_{L}\in\mathbb{R}^{1\times m}. Given the context vectors by {𝐱i}i=1T\{\mathbf{x}_{i}\}_{i=1}^{T} and corresponding rewards {rt}t=1T\{r_{t}\}_{t=1}^{T}, conduct the gradient descent with the loss \mathcal{L} to train ff.

Built upon the Neural Tangent Kernel (NTK) (Jacot et al., 2018; Cao and Gu, 2019), h(𝐱t)h(\mathbf{x}_{t}) can be represented by a linear function with respect to the gradient g(𝐱t;𝜽0)g(\mathbf{x}_{t};\bm{\theta}_{0}) introduced in (Zhou et al., 2020a) as the following lemma.

Lemma 5.1 (Lemma 5.1 in (Zhou et al., 2020a)).

There exist a positive constant CC such that with probability at least 1δ1-\delta, if mCT4L6log(T2L/δ)/λ4m\geq CT^{4}L^{6}\log(T^{2}L/\delta)/\lambda^{4} for any 𝐱t{𝐱t}t=1T\mathbf{x}_{t}\in\{\mathbf{x}_{t}\}_{t=1}^{T}, there exists a 𝛉\bm{\theta}^{\ast} such that

h(𝐱t)=g(𝐱t;𝜽0),𝜽\displaystyle h(\mathbf{x}_{t})=\langle g(\mathbf{x}_{t};\bm{\theta}_{0}),\bm{\theta}^{\ast}\rangle

Then, with the above linear representation of h(𝐱t)h(\mathbf{x}_{t}), we provide the following upper confidence bound with regard to ff.

Lemma 5.2.

Given a set of context vectors {𝐱t}t=1T\{\mathbf{x}_{t}\}_{t=1}^{T} and the corresponding rewards {rt}t=1T\{r_{t}\}_{t=1}^{T} , 𝔼(rt)=h(𝐱t)\mathbb{E}(r_{t})=h(\mathbf{x}_{t}) for any 𝐱t{𝐱i}i=1T\mathbf{x}_{t}\in\{\mathbf{x}_{i}\}_{i=1}^{T}. Let f(𝐱t;𝛉)f(\mathbf{x}_{t};\bm{\theta}) be the LL-layers fully-connected neural network where the width is mm, the learning rate is η\eta, and 𝛉P\bm{\theta}\in\mathbb{R}^{P}. Assuming 𝛉2S/m\|\bm{\theta}^{\ast}\|_{2}\leq S/\sqrt{m}, then, there exist positive constants C1,C2C_{1},C_{2} such that if

m\displaystyle m max{𝒪(T7λ7L21(logm)3),𝒪(λ1/2L3/2(log(TL2/δ))3/2)}\displaystyle\geq\max\{\mathcal{O}\left(T^{7}\lambda^{-7}L^{21}(\log m)^{3}\right),\mathcal{O}\left(\lambda^{-1/2}L^{-3/2}(\log(TL^{2}/\delta))^{3/2}\right)\}
η=𝒪(TmL+mλ)1,J𝒪~(TL/λ),\displaystyle\eta=\mathcal{O}(TmL+m\lambda)^{-1},\ \ J\geq\widetilde{\mathcal{O}}(TL/\lambda),

then, with probability at least 1δ1-\delta, for any 𝐱t{𝐱t}t=1T\mathbf{x}_{t}\in\{\mathbf{x}_{t}\}_{t=1}^{T}, we have the following upper confidence bound:

|h(𝐱t)f(𝐱t;𝜽t)|\displaystyle\left|h(\mathbf{x}_{t})-f(\mathbf{x}_{t};\bm{\theta}_{t})\right|\leq γ1g(𝐱t;𝜽t)/m𝐀t1+γ2g(𝐱t;𝜽0)/m𝐀t1\displaystyle\gamma_{1}\|g(\mathbf{x}_{t};\bm{\theta}_{t})/\sqrt{m}\|_{\mathbf{A}_{t}^{-1}}+\gamma_{2}\|g(\mathbf{x}_{t};\bm{\theta}_{0})/\sqrt{m}\|_{\mathbf{A}_{t}^{{}^{\prime}-1}}
+γ1γ3+γ4,where\displaystyle+\gamma_{1}\gamma_{3}+\gamma_{4},\ \ \text{where}
γ1(m,L)=(λ+t𝒪(L))((1ηmλ)J/2t/λ)+1\displaystyle\gamma_{1}(m,L)=(\lambda+t\mathcal{O}(L))\cdot((1-\eta m\lambda)^{J/2}\sqrt{t/\lambda})+1
γ2(m,L,δ)=log(det(𝐀t)det(λ𝐈))2logδ+λ1/2S\displaystyle\gamma_{2}(m,L,\delta)=\sqrt{\log\left(\frac{\text{det}(\mathbf{A}_{t}^{\prime})}{\text{det}(\lambda\mathbf{I})}\right)-2\log\delta}+\lambda^{1/2}S
γ3(m,L)=C2m1/6logmt1/6λ7/6L7/2\displaystyle\gamma_{3}(m,L)=C_{2}m^{-1/6}\sqrt{\log m}t^{1/6}\lambda^{-7/6}L^{7/2}
γ4(m,L)=C1m1/6logmt2/3λ2/3L3\displaystyle\gamma_{4}(m,L)=C_{1}m^{-1/6}\sqrt{\log m}t^{2/3}\lambda^{-2/3}L^{3}
𝐀t=λ𝐈+i=1tg(𝐱t;𝜽t)g(𝐱t;𝜽t)/m\displaystyle\mathbf{A}_{t}=\lambda\mathbf{I}+\sum_{i=1}^{t}g(\mathbf{x}_{t};\bm{\theta}_{t})g(\mathbf{x}_{t};\bm{\theta}_{t})^{\intercal}/m
𝐀t=λ𝐈+i=1tg(𝐱t;𝜽0)g(𝐱t;𝜽0)/m.\displaystyle\mathbf{A}_{t}^{\prime}=\lambda\mathbf{I}+\sum_{i=1}^{t}g(\mathbf{x}_{t};\bm{\theta}_{0})g(\mathbf{x}_{t};\bm{\theta}_{0})^{\intercal}/m.

Now we are ready to provide an extended upper confidence bound for the proposed neural network model \mathcal{F}.

Theorem 5.3.

Given the selected contexts {𝐗t}t=1T\{\mathbf{X}_{t}\}_{t=1}^{T}, the final rewards {Rt}t=1T\{R_{t}\}_{t=1}^{T}, and all sub-rewards {𝐫t}t=1T\{\mathbf{r}_{t}\}_{t=1}^{T}, let \mathcal{F} be the neural network model in MuFasa. In each round tt, with the conditions in Lemma 5.2 and suppose m~=1\widetilde{m}=1, then, with probability at least 1δ1-\delta, for any t[T]t\in[T], we have the following upper confidence bound:

|(𝐗t;𝜽t)(𝐗t)|C¯k=1Kk+F=UCB(𝐗t),where|\mathcal{F}(\mathbf{X}_{t};\bm{\theta}_{t})-\mathcal{H}(\mathbf{X}_{t})|\leq\bar{C}\sum_{k=1}^{K}\mathcal{B}^{k}+\mathcal{B}^{F}=\text{UCB}(\mathbf{X}_{t}),\ \text{where}
k\displaystyle\mathcal{B}^{k} =γ1gk(𝐱tk;𝜽tk)/m1𝐀tk1+γ2(δk+1)gk(𝐱tk;𝜽0k)/m1𝐀tk1\displaystyle=\gamma_{1}\|g_{k}(\mathbf{x}_{t}^{k};\bm{\theta}_{t}^{k})/\sqrt{m_{1}}\|_{{\mathbf{A}^{k}_{t}}^{-1}}+\gamma_{2}(\frac{\delta}{k+1})\|g_{k}(\mathbf{x}_{t}^{k};\bm{\theta}_{0}^{k})/\sqrt{m_{1}}\|_{{\mathbf{A}^{k}_{t}}^{{}^{\prime}-1}}
+γ1γ3+γ4\displaystyle\ \ +\gamma_{1}\gamma_{3}+\gamma_{4}
F\displaystyle\mathcal{B}^{F} =γ1G(𝐟t;𝜽tΣ)/m2𝐀tF1+γ2(δk+1)G(𝐟t;𝜽0Σ)/m2𝐀tF1\displaystyle=\gamma_{1}\|G(\mathbf{f}_{t};\bm{\theta}_{t}^{\Sigma})/\sqrt{m_{2}}\|_{{\mathbf{A}^{F}_{t}}^{-1}}+\gamma_{2}(\frac{\delta}{k+1})\|G(\mathbf{f}_{t};\bm{\theta}_{0}^{\Sigma})/\sqrt{m_{2}}\|_{{\mathbf{A}^{F}_{t}}^{{}^{\prime}-1}}
+γ1γ3+γ4\displaystyle\ \ +\gamma_{1}\gamma_{3}+\gamma_{4}
𝐀tk=λ𝐈+i=1tgk(𝐱ik;𝜽tk)gk(𝐱ik;𝜽tk)/m1\displaystyle\mathbf{A}^{k}_{t}=\lambda\mathbf{I}+\sum_{i=1}^{t}g_{k}(\mathbf{x}_{i}^{k};\bm{\theta}_{t}^{k})g_{k}(\mathbf{x}_{i}^{k};\bm{\theta}_{t}^{k})^{\intercal}/m_{1}
𝐀tk=λ𝐈+i=1tgk(𝐱ik;𝜽0k)gk(𝐱ik;𝜽0k)/m1\displaystyle\mathbf{A}^{k^{\prime}}_{t}=\lambda\mathbf{I}+\sum_{i=1}^{t}g_{k}(\mathbf{x}_{i}^{k};\bm{\theta}_{0}^{k})g_{k}(\mathbf{x}_{i}^{k};\bm{\theta}_{0}^{k})^{\intercal}/m_{1}
𝐀tF=λ𝐈+i=1tG(𝐟i;𝜽tΣ)G(𝐟i;𝜽tΣ)/m2\displaystyle\mathbf{A}^{F}_{t}=\lambda\mathbf{I}+\sum_{i=1}^{t}G(\mathbf{f}_{i};\bm{\theta}_{t}^{\Sigma})G(\mathbf{f}_{i};\bm{\theta}_{t}^{\Sigma})^{\intercal}/m_{2}
𝐀tF=λ𝐈+i=1tG(𝐟i;𝜽0Σ)G(𝐟i;𝜽0Σ)/m2\displaystyle\mathbf{A}^{F^{\prime}}_{t}=\lambda\mathbf{I}+\sum_{i=1}^{t}G(\mathbf{f}_{i};\bm{\theta}_{0}^{\Sigma})G(\mathbf{f}_{i};\bm{\theta}_{0}^{\Sigma})^{\intercal}/m_{2}

With the above UCB, we provide the following regret bound of MuFasa.

Theorem 5.4.

Given the number of rounds TT and suppose that the final reward and all the sub-wards are available, let \mathcal{F} be the neural network model of MuFasa, satisfying the conditions in Theorem 5.3. Then, assuming C¯=1\bar{C}=1, m1=m2=mm_{1}=m_{2}=m, L1=L2=LL_{1}=L_{2}=L and thus P1=P2=PP_{1}=P_{2}=P, with probability at least 1δ1-\delta, the regret of MuFasa is upper bounded by:

Reg\displaystyle\textbf{Reg}\leq (C¯K+1)T2P~log(1+T/λ)+1/λ+1\displaystyle(\bar{C}K+1)\sqrt{T}2\sqrt{\widetilde{P}\log(1+T/\lambda)+1/\lambda+1}
((P~2)log((λ+T)(1+K)λδ)+1/λ+λ1/2S+2)+2(C¯K+1),\displaystyle\cdot\bigg{(}\sqrt{(\widetilde{P}-2)\log\left(\frac{(\lambda+T)(1+K)}{\lambda\delta}\right)+1/\lambda}+\lambda^{1/2}S+2\bigg{)}+2(\bar{C}K+1),

where P~\widetilde{P} is the effective dimension defined in Appendix (Definition 8.3).

Prove 5.4. First, the regret of one round tt :

Regt\displaystyle\text{Reg}_{t} =(𝐗t)(𝐗t)\displaystyle=\mathcal{H}(\mathbf{X}_{t}^{\ast})-\mathcal{H}(\mathbf{X}_{t})
|(𝐗t)(𝐗t)|+(𝐗t)(𝐗t)\displaystyle\leq|\mathcal{H}(\mathbf{X}_{t}^{\ast})-\mathcal{F}(\mathbf{X}_{t}^{\ast})|+\mathcal{F}(\mathbf{X}_{t}^{\ast})-\mathcal{H}(\mathbf{X}_{t})
UCB(𝐗t)+(𝐗t)(𝐗t)\displaystyle\leq\text{UCB}(\mathbf{X}_{t}^{\ast})+\mathcal{F}(\mathbf{X}_{t}^{\ast})-\mathcal{H}(\mathbf{X}_{t})
UCB(𝐗t)+(𝐗t)(𝐗t)2UCB(𝐗t)\displaystyle\leq\text{UCB}(\mathbf{X}_{t})+\mathcal{F}(\mathbf{X}_{t})-\mathcal{H}(\mathbf{X}_{t})\leq 2\text{UCB}(\mathbf{X}_{t})

where the third inequality is due to the selection criterion of MuFasa, satisfying (𝐗t)+UCB(𝐗t)(𝐗t)+UCB(𝐗t).\mathcal{F}(\mathbf{X}_{t}^{\ast})+\text{UCB}(\mathbf{X}_{t}^{\ast})\leq\mathcal{F}(\mathbf{X}_{t})+\text{UCB}(\mathbf{X}_{t}). Thus, it has

Reg =t=1TRegt2t=1TUCB(𝐗t)2t=1T(C¯k=1Kk+F)\displaystyle=\sum_{t=1}^{T}\text{Reg}_{t}\leq 2\sum_{t=1}^{T}\text{UCB}(\mathbf{X}_{t})\leq 2\sum_{t=1}^{T}\left(\bar{C}\sum_{k=1}^{K}\mathcal{B}^{k}+\mathcal{B}^{F}\right)

First, for any k[K]k\in[K], we bound

t=1Tk\displaystyle\sum_{t=1}^{T}\mathcal{B}^{k} γ1t=1Tg(𝐱t;𝜽t)/m𝐀t12𝐈1+γ2t=1Tg(𝐱t;𝜽0)/m𝐀t12𝐈2\displaystyle\leq\underbrace{\gamma_{1}\sum_{t=1}^{T}\|g(\mathbf{x}_{t};\bm{\theta}_{t})/\sqrt{m}\|^{2}_{{\mathbf{A}_{t}}^{-1}}}_{\mathbf{I}_{1}}+\underbrace{\gamma_{2}\sum_{t=1}^{T}\|g(\mathbf{x}_{t};\bm{\theta}_{0})/\sqrt{m}\|^{2}_{{\mathbf{A}_{t}}^{{}^{\prime}-1}}}_{\mathbf{I}_{2}}
+Tγ1γ3+Tγ4\displaystyle+T\gamma_{1}\gamma_{3}+T\gamma_{4}

Because the Lemma 11 in (Abbasi-Yadkori et al., 2011), we have

𝐈1γ1T(t=1Tg(𝐱t;𝜽t)/m𝐀t12)γ1T(logdet(𝐀T)det(λ𝐈))\displaystyle\mathbf{I}_{1}\leq\gamma_{1}\sqrt{T\left(\sum_{t=1}^{T}\|g(\mathbf{x}_{t};\bm{\theta}_{t})/\sqrt{m}\|^{2}_{{\mathbf{A}_{t}}^{-1}}\right)}\leq\gamma_{1}\sqrt{T\left(\log\frac{\text{det}(\mathbf{A}_{T})}{\text{det}(\lambda\mathbf{I})}\right)}
γ1T(logdet(𝐀T)detλ𝐈)+|logdet(𝐀T)det(λ𝐈)logdet(𝐀T)det(λ𝐈)|)\displaystyle\leq\gamma_{1}\sqrt{T\left(\log\frac{\text{det}(\mathbf{A}_{T}^{\prime})}{\text{det}\lambda\mathbf{I})}+|\log\frac{\text{det}(\mathbf{A}_{T})}{\text{det}(\lambda\mathbf{I})}-\log\frac{\text{det}(\mathbf{A}_{T}^{\prime})}{\text{det}(\lambda\mathbf{I})}|\right)}
γ1T(P~log(1+T/λ)+1/λ+1)\displaystyle\leq\gamma_{1}\sqrt{T\left(\widetilde{P}\log(1+T/\lambda)+1/\lambda+1\right)}

where the last inequality is based on Lemma 8.4 and the choice of mm. Then, applying Lemma 11 in (Abbasi-Yadkori et al., 2011) and Lemma 8.4 again, we have

𝐈2\displaystyle\mathbf{I}_{2} γ2T(logdet(𝐀T)det(λ𝐈))\displaystyle\leq\gamma_{2}\sqrt{T\left(\log\frac{\text{det}(\mathbf{A}_{T}^{\prime})}{\text{det}(\lambda\mathbf{I})}\right)}
((P~2)log((λ+T)(1+K)λδ)+1/λ+λ1/2S)\displaystyle\leq\left(\sqrt{(\widetilde{P}-2)\log\left(\frac{(\lambda+T)(1+K)}{\lambda\delta}\right)+1/\lambda}+\lambda^{1/2}S\right)
T(P~log(1+T/λ)+1/λ)\displaystyle\cdot\sqrt{T\left(\widetilde{P}\log(1+T/\lambda)+1/\lambda\right)}

As the choice of JJ, γ12\gamma_{1}\leq 2. Then, as mm is sufficiently large, we have

Tγ1γ31,Tγ4\displaystyle T\gamma_{1}\gamma_{3}\leq 1,T\gamma_{4} 1.\displaystyle\leq 1.

Then, because m1=m2m_{1}=m_{2}, L1=L2L_{1}=L_{2} and C¯=1\bar{C}=1, we have

Reg2(C¯K+1)t=1Tk.\textbf{Reg}\leq 2(\bar{C}K+1)\sum_{t=1}^{T}\mathcal{B}^{k}.

Putting everything together proves the claim.

P~\widetilde{P} is the effective dimension defined by the eigenvalues of the NTK ( Definition 8.3 in Appendix). Effective dimension was first introduced by (Valko et al., 2013) to analyze the kernelized context bandit, and then was extended to analyze the kernel-based Q-learning(Yang and Wang, 2020) and the neural-network-based bandit (Zhou et al., 2020a). P~\widetilde{P} can be much smaller than the real dimension PP, which alleviates the predicament when PP is extremely large.

Theorem 5.4 provides the 𝒪~((K+1)T)\widetilde{\mathcal{O}}\left((K+1)\sqrt{T}\right) regret bound for MuFasa, achieving the near-optimal bound compared with a single bandit ( 𝒪~(T)\widetilde{\mathcal{O}}(\sqrt{T}) ) that is either linear (Abbasi-Yadkori et al., 2011) or non-linear (Valko et al., 2013; Zhou et al., 2020a). With different width m1,m2m_{1},m_{2} and the Lipschitz continuity C¯\bar{C}, the regret bound of MuFasa becomes 𝒪~((C¯K+1)T)\widetilde{\mathcal{O}}((\bar{C}K+1)\sqrt{T}).

6. experiments

To evaluate the empirical performance of MuFasa, in this section, we design two different multi-facet bandit problems on three real-world data sets. The experiments are divided into two parts to evaluate the effects of final rewards and availability of sub-rewards. The code has been released 111https://github.com/banyikun/KDD2021_MuFasa.

Recommendation:Yelp222https://www.yelp.com/dataset. Yelp is a data set released in the Yelp data set challenge, which consists of 4.7 million rating entries for 1.57×1051.57\times 10^{5} restaurants by 1.181.18 million users. In addition to the features of restaurants, this data set also provides the attributes of each user and the list of his/her friends. In this data set, we evaluate MuFasa on personalized recommendation, where the learner needs to simultaneously recommend a restaurant and a friend (user) to a served user. Naturally, this problem can be formulated into 22 bandits in which one set of arms 𝐗t1\mathbf{X}_{t}^{1} represent the candidate restaurants and the other set of arms 𝐗t2\mathbf{X}_{t}^{2} formulates the pool of friends for the recommendation. We apply LocallyLinearEmbedding(Roweis and Saul, 2000) to train a 1010-dimensional feature vector 𝐱t,i1\mathbf{x}_{t,i}^{1} for each restaurant and a 66-dimensional feature vector 𝐱t,j2\mathbf{x}_{t,j}^{2} for each user. Then, for the restaurant, we define the reward according to the rating star: The reward rt1r_{t}^{1} is 11 if the number of rating stars is more than 3 (5 in total); Otherwise, the reward rt1r_{t}^{1} is 0. For friends, the reward rt2=1r_{t}^{2}=1 if the recommended friend is included in the friend list of the served user in fact; Otherwise rt2=0r_{t}^{2}=0. To build the arm sets, we extract the rating entries and friends lists of top-10 users with the most ratings. In each round tt, we build the arm set 𝐗t1\mathbf{X}_{t}^{1} and 𝐗t2\mathbf{X}_{t}^{2} by picking one restaurant/friend with 11 reward and then randomly picking the other 99 restaurants/friends with 0 rewards. Thus |𝐗t1|=|𝐗t2|=10|\mathbf{X}_{t}^{1}|=|\mathbf{X}_{t}^{2}|=10.

Classification:Mnist (LeCun et al., 1998) + NotMnist. These are two well-known 10-class classification data sets. The evaluation of contextual bandit has been adapted to the classification problem (Zhou et al., 2020a; Deshmukh et al., 2017; Valko et al., 2013). Therefore, we utilize these two similar classification data sets to construct 22 bandits, where the 10-class classification is converted into a 1010-armed contextual bandit. Considering a sample figure 𝐱d\mathbf{x}\in\mathbb{R}^{d}, we tend to classify it from 1010 classes. Under the contextual bandit setting, 𝐱\mathbf{x} is transformed into 1010 arms: 𝐱1=(𝐱,𝟎,,𝟎);𝐱2=(𝟎,𝐱,,𝟎);;𝐱10=(𝟎,𝟎,,𝐱)10d\mathbf{x}_{1}=(\mathbf{x},\mathbf{0},\dots,\mathbf{0});\mathbf{x}_{2}=(\mathbf{0},\mathbf{x},\dots,\mathbf{0});\dots;\mathbf{x}_{10}=(\mathbf{0},\mathbf{0},\dots,\mathbf{x})\in\mathbb{R}^{10d} matching the 1010 classes. In consequence, the reward is 11 if the learner plays the arm that matches the real class of 𝐱\mathbf{x}; Otherwise, the reward is 0. Using this way, we can construct two contextual bandits for these two data sets, denoted by (𝐗t1,rt1)(\mathbf{X}_{t}^{1},r_{t}^{1}) and (𝐗t2,rt2)(\mathbf{X}_{t}^{2},r_{t}^{2}). Then, in each round, the arm pools will be |𝐗t1|=|𝐗t2|=10|\mathbf{X}_{t}^{1}|=|\mathbf{X}_{t}^{2}|=10.

To evaluate the effect of different final reward function, with the sub-rewards 𝐫t={rt1,rt2}\mathbf{r}_{t}=\{r_{t}^{1},r_{t}^{2}\}, we design the following final reward function:

(6) H1(vec(𝐫t))=rt1+rt2;H2(vec(𝐫t))=2rt1+rt2.H_{1}(\text{vec}(\mathbf{r}_{t}))=r_{t}^{1}+r_{t}^{2};\ \ H_{2}(\text{vec}(\mathbf{r}_{t}))=2r_{t}^{1}+r_{t}^{2}.\\

For (1), it describes the task where each bandit contributes equally. Of (2), it represents some tasks where each bandit has different importance.

As the problem setting is new, there are no existing algorithms that can directly adapt to this problem. Therefore, we construct baselines by extending the bandit algorithms that work on a single bandit, as follows:

  1. (1)

    (K-)LinUCB. LinUCB (Li et al., 2010) is a linear contextual bandit algorithm where the reward function is assumed as the dot product of the arm feature vector and an unknown user parameter. Then, apply the UCB strategy to select an arm in each round. To adapt to the multi-facet bandit problem, we duplicate LinUCB for KK bandits. For example, in Yelp data set, we use two LinUCB to recommend restaurants and friends, respectively.

  2. (2)

    (K-)KerUCB . KerUCB (Valko et al., 2013) makes use of a predefined kernel matrix to learn the reward function and then build a UCB for exploration. We replicate KK KerUCB to adapt to this problem.

  3. (3)

    (K-)NeuUCB. NeuUCB(Zhou et al., 2020a) uses a fully-connected neural network to learn one reward function with the UCB strategy. Similarly, we duplicate it to KK bandits.

Configurations. For MuFasa, each sub-network fk(𝐱tk;𝜽k)f_{k}(\mathbf{x}_{t}^{k};\bm{\theta}^{k}) is set as a two-layer network: fk(𝐱tk;𝜽k)=m1𝐖2σ(𝐖1𝐱tk)f_{k}(\mathbf{x}_{t}^{k};\bm{\theta}^{k})=\sqrt{m_{1}}\mathbf{W}_{2}\sigma(\mathbf{W}_{1}\mathbf{x}_{t}^{k}), where 𝐖1m1×dk,𝐖2m~×m1\mathbf{W}_{1}\in\mathbb{R}^{m_{1}\times d_{k}},\mathbf{W}_{2}\in\mathbb{R}^{\widetilde{m}\times m_{1}}, and m1=m~=100m_{1}=\widetilde{m}=100. Then, the shared layers F(𝐟t;𝜽Σ)=m2𝐖2σ(𝐖1𝐟t)F(\mathbf{f}_{t};\bm{\theta}^{\Sigma})=\sqrt{m_{2}}\mathbf{W}_{2}\sigma(\mathbf{W}_{1}\mathbf{f}_{t}), where 𝐖1m2×2m~,𝐖21×m2\mathbf{W}_{1}\in\mathbb{R}^{m_{2}\times 2\widetilde{m}},\mathbf{W}_{2}\in\mathbb{R}^{1\times m_{2}} and m2=100m_{2}=100. For the H1H_{1}, C¯\bar{C} is set as 11 and set as 22 for H2H_{2}. To learn KK bandits jointly, in the experiments, we evaluate the performance of Algorithm 1+31+3. For K-NeuUCB, for each NeuUCB, we set it as a 44-layer fully-connected network with the same width m=100m=100 for the fair comparison. The learning rate η\eta is set as 0.010.01 and the upper bound of ground-truth parameter S=1S=1 for these two methods. To accelerate the training process, we update the parameters of the neural networks every 5050 rounds. For the KerUCB, we use the radial basis function (RBF) kernel and stop adding contexts to KerUCB after 1000 rounds, following the same setting for Gaussian Process in (Riquelme et al., 2018; Zhou et al., 2020a). For all the methods, the confidence level δ=0.1\delta=0.1, the regularization parameter λ=1\lambda=1. All experiments are repeatedly run 55 times and report the average results.

6.1. Result 1: All sub-rewards with different HH

Refer to caption
Figure 1. Regret comparison on Yelp with H1H_{1} final reward function.
Refer to caption
Figure 2. Regret comparison on Yelp with H2H_{2} final reward function.

With the final reward function H1H_{1} (Eq.(6)), Figure 1 and Figure 3 report the regret of all methods on Yelp and Mnist+NotMnist data sets, where the first top-two sub-figure shows the regret of each bandit and the bottom sub-figure shows the regret of the whole task (2 bandits). These figures show that MuFasa achieves the best performance (the smallest regret), because it utilizes a comprehensive upper confidence bound built on the assembled neural networks to select two arms jointly in each round. This indicates the good performance of MuFasa on personalized recommendation and classification. Among these baselines, NeuUCB achieves the best performance, which thanks to the representation power of neural networks. However, it chooses each arm separately, neglecting the collaborative relation of KK bandits. For KerUCB, it shows the limitation of the simple kernels like the radial basis function compared to neural network. LinUCB fails to handle each task, as it assume a linear reward function and thus usually cannot to learn the complicated reward functions in practice.

With the final reward function H2H_{2} (Eq.(6)), Figure 2 and Figure 4 depict the regret comparison on Yelp and Mnist+NotMnist data sets. The final reward function H2H_{2} indicates that the bandit 1 weights more than bandit 2 in the task. Therefore, to minimize the regret, the algorithm should place the bandit 1 as the priority when making decisions. As the design of MuFasa, the neural network \mathcal{F} can learn the relation among the bandits. For example, on the Mnist and NotMnist data sets, consider two optional select arm sets {xt,i11,xt,i22}\{x_{t,i_{1}}^{1},x_{t,i_{2}}^{2}\} and {xt,j11,xt,j22}\{x_{t,j_{1}}^{1},x_{t,j_{2}}^{2}\}. The first selected arm set receives 11 reward on Mnist while 0 reward on NotMnist. In contrast, the second selected arm set receives 0 reward on Mnist while 11 reward on NotMnist. However, these two bandits have different weights and thus the two arm sets have different final rewards, i.e., Rt1=2R_{t}^{1}=2 and Rt2=1R_{t}^{2}=1, respectively. To maximize the final reward, the learner should select the first arm set instead of the second arm set. As MuFasa can learn the weights of bandits, it will give more weight to the first bandit and thus select the first arm set. On the contrary, all the baselines treat each bandit equally, and thus they will select these two arm sets randomly. Therefore, under the setting of H2H_{2}, with this advantage, MuFasa further decreases the regret on both Yelp and Mnist+NotMnist data sets. For instance, on Mnist+NotMnist data sets, MuFasa with H2H_{2} decrease 20%20\% regret over NeuUCB while MuFasa with H1H_{1} decrease 17.8%17.8\% regret over NeuUCB.

Refer to caption
Figure 3. Regret comparison on Mnist+NotMnist with H1H_{1}.
Refer to caption
Figure 4. Regret comparison on Mnist+NotMnist with H2H_{2}.

6.2. Result 2: Partial sub-rewards

As the sub-rewards are not always available in many cases, in this subsection, we evaluate MuFasa with partially available sub-rewards on Yelp and Mnist+NotMnist data sets. Therefore, we build another two variants of MuFasa: (1) MuFasa (One sub-reward) is provided with the final reward and one sub-reward of the first bandit; (2) MuFasa (No sub-reward) does not receive any sub-rewards except the final reward. Here, we use the H1H_{1} as the final reward function.

Figure 5 and Figure 6 show the regret comparison with the two variants of MuFasa, where MuFasa exhibits the robustness with respect to the lack of sub-rewards. Indeed, the sub-reward can provide more information to learn, while MuFasa (One sub-reward) still outperforms all the baselines, because the final reward enables MuFasa to learn the all bandits jointly and the sub-reward strengthens the capacity of learning the exclusive features of each bandit. In contrast, all the baselines treat each bandit separately. Without any-rewards, MuFasa still achieves the acceptable performance. On the Yelp data set, the regret of MuFasa (No sub-reward) is still lower than the best baseline NeuUCB while lacking considerable information. On the Mnist+NotMnist data set, although MuFasa (No sub-reward) does not outperform the baselines, its performance is still closed to NeuUCB. Therefore, as long as the final reward is provided, MuFasa can tackle the multi-facet problem effectively. Moreover, MuFasa can leverage available sub-rewards to improve the performance.

Refer to caption
Figure 5. Regret comparison on Yelp with different reward availability.
Refer to caption
Figure 6. Regret comparison on Mnist+NotMnist with different reward availability.

7. Conclusion

In this paper, we define and study the novel problem of the multi-facet contextual bandits, motivated by real applications such as comprehensive personalized recommendation and healthcare. We propose a new bandit algorithm, MuFasa. It utilizes the neural networks to learn the reward functions of multiple bandits jointly and explores new information by a comprehensive upper confidence bound. Moreover, we prove that MuFasa can achieve the 𝒪~((K+1)T)\widetilde{\mathcal{O}}((K+1)\sqrt{T}) regret bound under mild assumptions. Finally, we conduct extensive experiments to show the effectiveness of MuFasa on personalized recommendation and classification tasks, as well as the robustness of MuFasa in the lack of sub-rewards.

Acknowledgement

This work is supported by National Science Foundation under Award No. IIS-1947203 and IIS-2002540, and the U.S. Department of Homeland Security under Grant Award Number 17STQAC00001-03-03. The views and conclusions are those of the authors and should not be interpreted as representing the official policies of the funding agencies or the government.

References

  • (1)
  • Abbasi-Yadkori et al. (2011) Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. 2011. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems. 2312–2320.
  • Agrawal and Goyal (2013) Shipra Agrawal and Navin Goyal. 2013. Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning. PMLR, 127–135.
  • Allen-Zhu et al. (2019) Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. 2019. A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning. PMLR, 242–252.
  • Allesiardo et al. (2014) Robin Allesiardo, Raphaël Féraud, and Djallel Bouneffouf. 2014. A neural networks committee for the contextual bandit problem. In International Conference on Neural Information Processing. Springer, 374–381.
  • Arora et al. (2019) Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Russ R Salakhutdinov, and Ruosong Wang. 2019. On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems. 8141–8150.
  • Audibert and Bubeck (2010) Jean-Yves Audibert and Sébastien Bubeck. 2010. Best arm identification in multi-armed bandits. In Conference on Learning Theory (COLT). 41–53.
  • Auer et al. (2002a) Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. 2002a. Finite-time analysis of the multiarmed bandit problem. Machine learning 47, 2-3 (2002), 235–256.
  • Auer et al. (2002b) Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. 2002b. The nonstochastic multiarmed bandit problem. SIAM journal on computing 32, 1 (2002), 48–77.
  • Azizzadenesheli et al. (2018) Kamyar Azizzadenesheli, Emma Brunskill, and Animashree Anandkumar. 2018. Efficient exploration through bayesian deep q-networks. In 2018 Information Theory and Applications Workshop (ITA). IEEE, 1–9.
  • Ban and He (2020) Yikun Ban and Jingrui He. 2020. Generic Outlier Detection in Multi-Armed Bandit. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 913–923.
  • Ban and He (2021) Yikun Ban and Jingrui He. 2021. Local Clustering in Contextual Multi-Armed Bandits. arXiv preprint arXiv:2103.00063 (2021).
  • Bastani and Bayati (2020) Hamsa Bastani and Mohsen Bayati. 2020. Online decision making with high-dimensional covariates. Operations Research 68, 1 (2020), 276–294.
  • Bubeck et al. (2011) Sébastien Bubeck, Rémi Munos, Gilles Stoltz, and Csaba Szepesvári. 2011. X-Armed Bandits. Journal of Machine Learning Research 12, 5 (2011).
  • Buccapatnam et al. (2013) Swapna Buccapatnam, Atilla Eryilmaz, and Ness B Shroff. 2013. Multi-armed bandits in the presence of side observations in social networks. In 52nd IEEE Conference on Decision and Control. IEEE, 7309–7314.
  • Cao and Gu (2019) Yuan Cao and Quanquan Gu. 2019. Generalization bounds of stochastic gradient descent for wide and deep neural networks. In Advances in Neural Information Processing Systems. 10836–10846.
  • Chen et al. (2013) Wei Chen, Yajun Wang, and Yang Yuan. 2013. Combinatorial multi-armed bandit: General framework and applications. In International Conference on Machine Learning. PMLR, 151–159.
  • Chen et al. (2016) Wei Chen, Yajun Wang, Yang Yuan, and Qinshi Wang. 2016. Combinatorial multi-armed bandit and its extension to probabilistically triggered arms. The Journal of Machine Learning Research 17, 1 (2016), 1746–1778.
  • Deshmukh et al. (2017) Aniket Anand Deshmukh, Urun Dogan, and Clay Scott. 2017. Multi-task learning for contextual bandits. In Advances in neural information processing systems. 4848–4856.
  • Dimakopoulou et al. (2019) Maria Dimakopoulou, Zhengyuan Zhou, Susan Athey, and Guido Imbens. 2019. Balanced linear contextual bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 3445–3453.
  • Durand et al. (2018) Audrey Durand, Charis Achilleos, Demetris Iacovides, Katerina Strati, Georgios D Mitsis, and Joelle Pineau. 2018. Contextual bandits for adapting treatment in a mouse model of de novo carcinogenesis. In Machine Learning for Healthcare Conference. 67–82.
  • Fu et al. (2020) Dongqi Fu, Zhe Xu, Bo Li, Hanghang Tong, and Jingrui He. 2020. A View-Adversarial Framework for Multi-View Network Embedding. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 2025–2028.
  • Gentile et al. (2017) Claudio Gentile, Shuai Li, Purushottam Kar, Alexandros Karatzoglou, Giovanni Zappella, and Evans Etrue. 2017. On context-dependent clustering of bandits. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 1253–1262.
  • Gentile et al. (2014) Claudio Gentile, Shuai Li, and Giovanni Zappella. 2014. Online clustering of bandits. In International Conference on Machine Learning. 757–765.
  • Jacot et al. (2018) Arthur Jacot, Franck Gabriel, and Clément Hongler. 2018. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems. 8571–8580.
  • Jing et al. (2021) Baoyu Jing, Chanyoung Park, and Hanghang Tong. 2021. HDMI: High-order Deep Multiplex Infomax. arXiv preprint arXiv:2102.07810 (2021).
  • Langford and Zhang (2008) John Langford and Tong Zhang. 2008. The epoch-greedy algorithm for multi-armed bandits with side information. In Advances in neural information processing systems. 817–824.
  • LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278–2324.
  • Li et al. (2010) Lihong Li, Wei Chu, John Langford, and Robert E Schapire. 2010. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web. 661–670.
  • Li et al. (2016a) Shuai Li, Alexandros Karatzoglou, and Claudio Gentile. 2016a. Collaborative filtering bandits. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. 539–548.
  • Li et al. (2016b) Shuai Li, Baoxiang Wang, Shengyu Zhang, and Wei Chen. 2016b. Contextual combinatorial cascading bandits. In International conference on machine learning. PMLR, 1245–1253.
  • Lipton et al. (2018) Zachary Lipton, Xiujun Li, Jianfeng Gao, Lihong Li, Faisal Ahmed, and Li Deng. 2018. Bbq-networks: Efficient exploration in deep reinforcement learning for task-oriented dialogue systems. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.
  • Liu et al. (2011) Haoyang Liu, Keqin Liu, and Qing Zhao. 2011. Logarithmic weak regret of non-bayesian restless multi-armed bandit. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1968–1971.
  • Qin et al. (2014) Lijing Qin, Shouyuan Chen, and Xiaoyan Zhu. 2014. Contextual combinatorial bandit and its application on diversified online recommendation. In Proceedings of the 2014 SIAM International Conference on Data Mining. SIAM, 461–469.
  • Riquelme et al. (2018) Carlos Riquelme, George Tucker, and Jasper Snoek. 2018. Deep bayesian bandits showdown: An empirical comparison of bayesian deep networks for thompson sampling. arXiv preprint arXiv:1802.09127 (2018).
  • Roweis and Saul (2000) Sam T Roweis and Lawrence K Saul. 2000. Nonlinear dimensionality reduction by locally linear embedding. science 290, 5500 (2000), 2323–2326.
  • Srinivas et al. (2009) Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. 2009. Gaussian process optimization in the bandit setting: No regret and experimental design. arXiv preprint arXiv:0912.3995 (2009).
  • Su and Khoshgoftaar (2009) Xiaoyuan Su and Taghi M Khoshgoftaar. 2009. A survey of collaborative filtering techniques. Advances in artificial intelligence 2009 (2009).
  • Thompson (1933) William R Thompson. 1933. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25, 3/4 (1933), 285–294.
  • Valko et al. (2013) Michal Valko, Nathaniel Korda, Rémi Munos, Ilias Flaounas, and Nelo Cristianini. 2013. Finite-time analysis of kernelised contextual bandits. arXiv preprint arXiv:1309.6869 (2013).
  • Wu et al. (2016) Qingyun Wu, Huazheng Wang, Quanquan Gu, and Hongning Wang. 2016. Contextual bandits in a collaborative environment. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. 529–538.
  • Yang and Wang (2020) Lin Yang and Mengdi Wang. 2020. Reinforcement learning in feature space: Matrix bandit, kernels, and regret bound. In International Conference on Machine Learning. PMLR, 10746–10756.
  • Zahavy and Mannor (2019) Tom Zahavy and Shie Mannor. 2019. Deep neural linear bandits: Overcoming catastrophic forgetting through likelihood matching. arXiv preprint arXiv:1901.08612 (2019).
  • Zheng et al. (2021) Lecheng Zheng, Yu Cheng, Hongxia Yang, Nan Cao, and Jingrui He. 2021. Deep Co-Attention Network for Multi-View Subspace Learning. arXiv preprint arXiv:2102.07751 (2021).
  • Zhou et al. (2015) Dawei Zhou, Jingrui He, K Selçuk Candan, and Hasan Davulcu. 2015. MUVIR: Multi-View Rare Category Detection.. In IJCAI. Citeseer, 4098–4104.
  • Zhou et al. (2020a) Dongruo Zhou, Lihong Li, and Quanquan Gu. 2020a. Neural contextual bandits with UCB-based exploration. In International Conference on Machine Learning. PMLR, 11492–11502.
  • Zhou et al. (2020c) Dawei Zhou, Lecheng Zheng, Yada Zhu, Jianbo Li, and Jingrui He. 2020c. Domain adaptive multi-modality neural attention network for financial forecasting. In Proceedings of The Web Conference 2020. 2230–2240.
  • Zhou et al. (2020b) Yao Zhou, Jianpeng Xu, Jun Wu, Zeinab Taghavi Nasrabadi, Evren Korpeoglu, Kannan Achan, and Jingrui He. 2020b. GAN-based Recommendation with Positive-Unlabeled Sampling. arXiv preprint arXiv:2012.06901 (2020).

8. appendix

Definition 8.1.

Given the context vectors {𝐱t}t=1T\{\mathbf{x}_{t}\}_{t=1}^{T} and the rewards {rt}t=1T\{r_{t}\}_{t=1}^{T}, then we define the estimation 𝜽\bm{\theta}^{\prime} via ridge regression:

𝐀t=λ𝐈+i=1tg(𝐱t;𝜽0)g(𝐱t;𝜽0)/m\displaystyle\mathbf{A}_{t}^{\prime}=\lambda\mathbf{I}+\sum_{i=1}^{t}g(\mathbf{x}_{t};\bm{\theta}_{0})g(\mathbf{x}_{t};\bm{\theta}_{0})^{\intercal}/m
𝐛t=i=1trtg(𝐱t;𝜽0)/m\displaystyle\mathbf{b}_{t}^{\prime}=\sum_{i=1}^{t}r_{t}g(\mathbf{x}_{t};\bm{\theta}_{0})/\sqrt{m}
𝜽=𝐀t1𝐛t\displaystyle\bm{\theta}^{\prime}=\mathbf{A}^{-1}_{t}\mathbf{b}_{t}
𝐀t=λ𝐈+i=1tg(𝐱t;𝜽t)g(𝐱t;𝜽t)/m\displaystyle\mathbf{A}_{t}=\lambda\mathbf{I}+\sum_{i=1}^{t}g(\mathbf{x}_{t};\bm{\theta}_{t})g(\mathbf{x}_{t};\bm{\theta}_{t})^{\intercal}/m
Definition 8.2 ( NTK (Jacot et al., 2018; Cao and Gu, 2019)).

Let 𝒩\mathcal{N} denote the normal distribution. Define

𝐌i,j0=𝚺i,j0=𝐱i,𝐱j,𝐍i,jl=(𝚺i,il𝚺i,jl𝚺j,il𝚺j,jl)\displaystyle\mathbf{M}_{i,j}^{0}=\bm{\Sigma}^{0}_{i,j}=\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle,\ \ \mathbf{N}^{l}_{i,j}=\begin{pmatrix}\bm{\Sigma}^{l}_{i,i}&\bm{\Sigma}^{l}_{i,j}\\ \bm{\Sigma}^{l}_{j,i}&\bm{\Sigma}^{l}_{j,j}\end{pmatrix}
𝚺i,jl=2𝔼a,b𝒩(𝟎,𝐍i,jl1)[σ(a)σ(b)]\displaystyle\bm{\Sigma}^{l}_{i,j}=2\mathbb{E}_{a,b\sim\mathcal{N}(\mathbf{0},\mathbf{N}_{i,j}^{l-1})}[\sigma(a)\sigma(b)]
𝐌i,jl=2𝐌i,jl1𝔼a,b𝒩(𝟎,𝐍i,jl1)[σ(a)σ(b)]+𝚺i,jl.\displaystyle\mathbf{M}_{i,j}^{l}=2\mathbf{M}_{i,j}^{l-1}\mathbb{E}_{a,b\sim\mathcal{N}(\mathbf{0},\mathbf{N}_{i,j}^{l-1})}[\sigma^{\prime}(a)\sigma^{\prime}(b)]+\bm{\Sigma}^{l}_{i,j}.

Then, given the contexts {𝐱t}t=1T\{\mathbf{x}_{t}\}_{t=1}^{T}, the Neural Tangent Kernel is defined as 𝐌=(𝐌L+𝚺L)/2\mathbf{M}=(\mathbf{M}^{L}+\bm{\Sigma}^{L})/2.

Definition 8.3 (Effective Dimension (Zhou et al., 2020a)).

Given the contexts {𝐱t}t=1T\{\mathbf{x}_{t}\}_{t=1}^{T}, the effective dimension P~\widetilde{P} is defined as

P~=logdet(𝐈+𝐌/λ)log(1+T/λ).\widetilde{P}=\frac{\log\text{det}(\mathbf{I}+\mathbf{M}/\lambda)}{\log(1+T/\lambda)}.

Proof of Lemma 5.2. Given a set of context vectors {𝐱}t=1T\{\mathbf{x}\}_{t=1}^{T} with the ground-truth function hh and a fully-connected neural network ff, we have

|h(𝐱t)f(𝐱t;𝜽t)|\displaystyle\left|h(\mathbf{x}_{t})-f(\mathbf{x}_{t};\bm{\theta}_{t})\right|
\displaystyle\leq |h(𝐱t)g(𝐱t;𝜽0),𝜽/m|+|f(𝐱t;𝜽t)g(𝐱t;𝜽0),𝜽/m|\displaystyle\left|h(\mathbf{x}_{t})-\langle g(\mathbf{x}_{t};\bm{\theta}_{0}),\bm{\theta}^{\prime}/\sqrt{m}\rangle\right|+\left|f(\mathbf{x}_{t};\bm{\theta}_{t})-\langle g(\mathbf{x}_{t};\bm{\theta}_{0}),\bm{\theta}^{\prime}/\sqrt{m}\rangle\right|

where 𝜽\bm{\theta}^{\prime} is the estimation of ridge regression from Definition 8.1. Then, based on the Lemma 5.1, there exists 𝜽𝐑P\bm{\theta}^{\ast}\in\mathbf{R}^{P} such that h(𝐱t)=g(𝐱i,𝜽0),𝜽h(\mathbf{x}_{t})=\left\langle g(\mathbf{x}_{i},\bm{\theta}_{0}),\bm{\theta}^{\ast}\right\rangle. Thus, we have

|h(𝐱t)g(𝐱t;𝜽0),𝜽/m|\displaystyle\ \ \left|h(\mathbf{x}_{t})-\langle g(\mathbf{x}_{t};\bm{\theta}_{0}),\bm{\theta}^{\prime}/\sqrt{m}\rangle\right|
=|g(𝐱i,𝜽0)/m,m𝜽g(𝐱i,𝜽0)/m,𝜽|\displaystyle=\left|\left\langle g(\mathbf{x}_{i},\bm{\theta}_{0})/\sqrt{m},\sqrt{m}\bm{\theta}^{\ast}\right\rangle-\left\langle g(\mathbf{x}_{i},\bm{\theta}_{0})/\sqrt{m},\bm{\theta}^{\prime}\right\rangle\right|\leq
(log(det(𝐀t)det(λ𝐈))2logδ+λ1/2S)g(𝐱t;𝜽0)/m𝐀t1\displaystyle\left(\sqrt{\log\left(\frac{\text{det}(\mathbf{A}_{t}^{\prime})}{\text{det}(\lambda\mathbf{I})}\right)-2\log\delta}+\lambda^{1/2}S\right)\|g(\mathbf{x}_{t};\bm{\theta}_{0})/\sqrt{m}\|_{\mathbf{A}_{t}^{{}^{\prime}-1}}

where the final inequality is based on the the Theorem 2 in (Abbasi-Yadkori et al., 2011), with probability at least 1δ1-\delta, for any t[T]t\in[T].

Second we need to bound

|f(𝐱t;𝜽t)g(𝐱t;𝜽0),𝜽/m|\displaystyle\left|f(\mathbf{x}_{t};\bm{\theta}_{t})-\langle g(\mathbf{x}_{t};\bm{\theta}_{0}),\bm{\theta}^{\prime}/\sqrt{m}\rangle\right|
\displaystyle\leq |f(𝐱t;𝜽t)g(𝐱t;𝜽0),𝜽t𝜽0|\displaystyle\left|f(\mathbf{x}_{t};\bm{\theta}_{t})-\langle g(\mathbf{x}_{t};\bm{\theta}_{0}),\bm{\theta}_{t}-\bm{\theta}_{0}\rangle\right|
+|g(𝐱t;𝜽0),𝜽t𝜽0g(𝐱t;𝜽0),𝜽/m|\displaystyle+\left|\langle g(\mathbf{x}_{t};\bm{\theta}_{0}),\bm{\theta}_{t}-\bm{\theta}_{0}\rangle-\langle g(\mathbf{x}_{t};\bm{\theta}_{0}),\bm{\theta}^{\prime}/\sqrt{m}\rangle\right|

To bound the above inequality, we first bound

|f(𝐱t;𝜽t)g(𝐱t;𝜽0),𝜽t𝜽0|\displaystyle\left|f(\mathbf{x}_{t};\bm{\theta}_{t})-\langle g(\mathbf{x}_{t};\bm{\theta}_{0}),\bm{\theta}_{t}-\bm{\theta}_{0}\rangle\right|
=\displaystyle= |f(𝐱t;𝜽t)f(𝐱t;𝜽0)g(𝐱t;𝜽0),𝜽t𝜽0|\displaystyle\left|f(\mathbf{x}_{t};\bm{\theta}_{t})-f(\mathbf{x}_{t};\bm{\theta}_{0})-\langle g(\mathbf{x}_{t};\bm{\theta}_{0}),\bm{\theta}_{t}-\bm{\theta}_{0}\rangle\right|
\displaystyle\leq C2τ4/3L3mlogmC2m1/6logmt2/3λ2/3L3.\displaystyle C_{2}\tau^{4/3}L^{3}\sqrt{m\log m}\leq C_{2}m^{-1/6}\sqrt{\log m}t^{2/3}\lambda^{-2/3}L^{3}.

where f(𝐱t;𝜽0)=0f(\mathbf{x}_{t};\bm{\theta}_{0})=0 due to the random initialization of 𝜽0\bm{\theta}_{0}. The first inequality is derived by Lemma 8.5. According to the Lemma 8.7, it has 𝜽t𝜽022tmλ\|\bm{\theta}_{t}-\bm{\theta}_{0}\|_{2}\leq 2\sqrt{\frac{t}{m\lambda}}. Then, replacing τ\tau by 2tmλ2\sqrt{\frac{t}{m\lambda}}, we obtain the second inequality.

Next, we need to bound

|g(𝐱t;𝜽0),𝜽t𝜽0g(𝐱t;𝜽0),𝜽/m|\displaystyle|\langle g(\mathbf{x}_{t};\bm{\theta}_{0}),\bm{\theta}_{t}-\bm{\theta}_{0}\rangle-\langle g(\mathbf{x}_{t};\bm{\theta}_{0}),\bm{\theta}^{\prime}/\sqrt{m}\rangle|
=\displaystyle= |g(𝐱t;𝜽0)/m,m(𝜽t𝜽0𝜽/m)|\displaystyle|\langle g(\mathbf{x}_{t};\bm{\theta}_{0})/\sqrt{m},\sqrt{m}(\bm{\theta}_{t}-\bm{\theta}_{0}-\bm{\theta}^{\prime}/\sqrt{m})\rangle|
\displaystyle\leq g(𝐱t;𝜽0)/m𝐀t1m𝜽t𝜽0𝜽/m𝐀t\displaystyle\|g(\mathbf{x}_{t};\bm{\theta}_{0})/\sqrt{m}\|_{\mathbf{A}_{t}^{-1}}\cdot\sqrt{m}\|\bm{\theta}_{t}-\bm{\theta}_{0}-\bm{\theta}^{\prime}/\sqrt{m}\|_{\mathbf{A}_{t}}
\displaystyle\leq g(𝐱t;𝜽0)/m𝐀t1m𝐀t2𝜽t𝜽0𝜽/m2.\displaystyle\|g(\mathbf{x}_{t};\bm{\theta}_{0})/\sqrt{m}\|_{\mathbf{A}_{t}^{-1}}\cdot\sqrt{m}\|{\mathbf{A}_{t}}\|_{2}\cdot\|\bm{\theta}_{t}-\bm{\theta}_{0}-\bm{\theta}^{\prime}/\sqrt{m}\|_{2}.

Due to the Lemma 8.6 and Lemma 8.7, we have

m𝐀t2𝜽t𝜽0𝜽/m2m(λ+t𝒪(L))\displaystyle\sqrt{m}\|{\mathbf{A}_{t}}\|_{2}\cdot\|\bm{\theta}_{t}-\bm{\theta}_{0}-\bm{\theta}^{\prime}/\sqrt{m}\|_{2}\leq\sqrt{m}(\lambda+t\mathcal{O}(L))
((1ηmλ)J/2t/(mλ)+C4m2/3logmL7/2t5/3λ5/3(1+t/λ))\displaystyle\cdot\left((1-\eta m\lambda)^{J/2}\sqrt{t/(m\lambda)}+C_{4}m^{-2/3}\sqrt{\log m}L^{7/2}t^{5/3}\lambda^{-5/3}(1+\sqrt{t/\lambda})\right)
(λ+t𝒪(L))\displaystyle\leq(\lambda+t\mathcal{O}(L))
((1ηmλ)J/2t/λ+C4m1/6logmL7/2t5/3λ5/3(1+t/λ))\displaystyle\cdot((1-\eta m\lambda)^{J/2}\sqrt{t/\lambda}+C_{4}m^{-1/6}\sqrt{\log m}L^{7/2}t^{5/3}\lambda^{-5/3}(1+\sqrt{t/\lambda}))
(λ+t𝒪(L))((1ηmλ)J/2t/λ)+1\displaystyle\leq(\lambda+t\mathcal{O}(L))\cdot((1-\eta m\lambda)^{J/2}\sqrt{t/\lambda})+1

where the last inequality is because mm is sufficiently large. Therefore, we have

|f(𝐱t;𝜽t)g(𝐱t;𝜽0),𝜽/m|\displaystyle\left|f(\mathbf{x}_{t};\bm{\theta}_{t})-\langle g(\mathbf{x}_{t};\bm{\theta}_{0}),\bm{\theta}^{\prime}/\sqrt{m}\rangle\right|
\displaystyle\leq ((λ+t𝒪(L))((1ηmλ)J/2t/λ)+1)g(𝐱t;𝜽0)/m𝐀t1\displaystyle\left((\lambda+t\mathcal{O}(L))\cdot((1-\eta m\lambda)^{J/2}\sqrt{t/\lambda})+1\right)\|g(\mathbf{x}_{t};\bm{\theta}_{0})/\sqrt{m}\|_{\mathbf{A}_{t}^{-1}}
+C2m1/6logmt2/3λ2/3L3\displaystyle+C_{2}m^{-1/6}\sqrt{\log m}t^{2/3}\lambda^{-2/3}L^{3}

And we have

g(𝐱t;𝜽0)/m𝐀t1\displaystyle\|g(\mathbf{x}_{t};\bm{\theta}_{0})/\sqrt{m}\|_{\mathbf{A}_{t}^{-1}}
=\displaystyle= g(𝐱t;𝜽t)+g(𝐱t;𝜽0)g(𝐱t;𝜽t)𝐀t1/m\displaystyle\|g(\mathbf{x}_{t};\bm{\theta}_{t})+g(\mathbf{x}_{t};\bm{\theta}_{0})-g(\mathbf{x}_{t};\bm{\theta}_{t})\|_{\mathbf{A}_{t}^{-1}}/\sqrt{m}
\displaystyle\leq g(𝐱t;𝜽t)/m𝐀t1+𝐀t12g(𝐱t;𝜽0)g(𝐱t;𝜽t)2/m\displaystyle\|g(\mathbf{x}_{t};\bm{\theta}_{t})/\sqrt{m}\|_{\mathbf{A}_{t}^{-1}}+\|\mathbf{A}_{t}^{-1}\|_{2}\|g(\mathbf{x}_{t};\bm{\theta}_{0})-g(\mathbf{x}_{t};\bm{\theta}_{t})\|_{2}/\sqrt{m}
\displaystyle\leq g(𝐱t;𝜽t)/m𝐀t1+λ1m1/6logmt1/6λ1/6L7/2\displaystyle\|g(\mathbf{x}_{t};\bm{\theta}_{t})/\sqrt{m}\|_{\mathbf{A}_{t}^{-1}}+\lambda^{-1}m^{-1/6}\sqrt{\log m}t^{1/6}\lambda^{-1/6}L^{7/2}

where the last inequality is because of Lemma 8.8 with Lemma 8.7 and 𝐀t2λ𝐈2\|\mathbf{A}_{t}\|_{2}\geq\|\lambda\mathbf{I}\|_{2}.

Finally, putting everything together, we have

|h(𝐱t)f(𝐱t;𝜽t)|\displaystyle\left|h(\mathbf{x}_{t})-f(\mathbf{x}_{t};\bm{\theta}_{t})\right| γ1g(𝐱t;𝜽t)/m𝐀t1+γ2g(𝐱t;𝜽0)/m𝐀t1\displaystyle\leq\gamma_{1}\|g(\mathbf{x}_{t};\bm{\theta}_{t})/\sqrt{m}\|_{\mathbf{A}_{t}^{-1}}+\gamma_{2}\|g(\mathbf{x}_{t};\bm{\theta}_{0})/\sqrt{m}\|_{\mathbf{A}_{t}^{{}^{\prime}-1}}
+γ1γ3+γ4.\displaystyle+\gamma_{1}\gamma_{3}+\gamma_{4}.

Proof of Theorem 5.3. First, considering an individual bandit k[K]k\in[K] with the set of context vectors {𝐱tk}t=1T\{\mathbf{x}_{t}^{k}\}_{t=1}^{T} and the set of sub-rewards {rtk}t=1T\{r_{t}^{k}\}_{t=1}^{T}, we can build upper confidence bound of fk(𝐱tk;𝜽tk)f_{k}(\mathbf{x}_{t}^{k};\bm{\theta}^{k}_{t}) with respect to hk(𝐱tk)h_{k}(\mathbf{x}_{t}^{k}) based on the Lemma 5.2. Denote the UCB by (𝐱tk,m,L1,δ)\mathcal{B}(\mathbf{x}_{t}^{k},m,L_{1},\delta^{\prime}), with probability at least 1δ1-\delta^{\prime}, for any t[T]t\in[T] we have

|fk(𝐱tk;𝜽tk)hk(𝐱tk)|(𝐱tk,m,L1,δ,t)=k.|f_{k}(\mathbf{x}_{t}^{k};\bm{\theta}^{k}_{t})-h_{k}(\mathbf{x}_{t}^{k})|\leq\mathcal{B}(\mathbf{x}_{t}^{k},m,L_{1},\delta^{\prime},t)=\mathcal{B}^{k}.

Next, apply the union bound on the K+1K+1 networks, we have δ=δ/(K+1)\delta^{\prime}=\delta/(K+1) in each round tt.

Next we need to bound

|H(vec(𝐫t))H(𝐟t)|\displaystyle\left|H\left(\text{vec}(\mathbf{r}_{t})\right)-H\left(\mathbf{f}_{t}\right)\right|
\displaystyle\leq C¯k=1K|fk(𝐱tk;𝜽tk)hk(𝐱tk)|2\displaystyle\bar{C}\sqrt{\sum_{k=1}^{K}|f_{k}(\mathbf{x}_{t}^{k};\bm{\theta}^{k}_{t})-h_{k}(\mathbf{x}_{t}^{k})|^{2}}
\displaystyle\leq C¯k=1K(k)2C¯k=1Kk\displaystyle\bar{C}\sqrt{\sum_{k=1}^{K}(\mathcal{B}^{k})^{2}}\leq\bar{C}\sum_{k=1}^{K}\mathcal{B}^{k}

where the first inequality is because HH is a C¯\bar{C}-lipschitz continuous function. Therefore, we have

|(𝐗t)(𝐗t)|=|F(𝐟t;𝜽Σ)H(vec(𝐫t))|\displaystyle|\mathcal{F}(\mathbf{X}_{t})-\mathcal{H}(\mathbf{X}_{t})|=|F(\mathbf{f}_{t};\bm{\theta}^{\Sigma})-H(\text{vec}(\mathbf{r}_{t}))|
\displaystyle\leq |F(𝐟t;𝜽Σ)H(𝐟t)|+|H(𝐟t)H(vec(𝐫t))|C¯k=1Kk+F.\displaystyle\left|F(\mathbf{f}_{t};\bm{\theta}^{\Sigma})-H\left(\mathbf{f}_{t}\right)\right|+\left|H\left(\mathbf{f}_{t}\right)-H(\text{vec}(\mathbf{r}_{t}))\right|\leq\bar{C}\sum_{k=1}^{K}\mathcal{B}^{k}+\mathcal{B}^{F}.

This completes the proof of the claim.

Lemma 8.4.

With probability at least 1δ1-\delta^{\prime}, we have

(1)𝐀t2,𝐀t2\displaystyle(1)\|\mathbf{A}_{t}\|_{2},\|\mathbf{A}_{t}^{\prime}\|_{2} λ+t𝒪(L)\displaystyle\leq\lambda+t\mathcal{O}(L)
(2)logdet(𝐀t)det(λ𝐈)\displaystyle(2)\log\frac{\text{det}(\mathbf{A}_{t}^{\prime})}{\text{det}(\lambda\mathbf{I})} P~log(1+T/λ)+1/λ\displaystyle\leq\widetilde{P}\log(1+T/\lambda)+1/\lambda
(3)|logdet(𝐀t)det(λ𝐈)logdet(𝐀t)det(λ𝐈)|\displaystyle(3)|\log\frac{\text{det}(\mathbf{A}_{t})}{\text{det}(\lambda\mathbf{I})}-\log\frac{\text{det}(\mathbf{A}_{t}^{\prime})}{\text{det}(\lambda\mathbf{I})}| 𝒪(m1/6logmL4t5/3λ1/6).\displaystyle\leq\mathcal{O}(m^{-1/6}\sqrt{\log m}L^{4}t^{5/3}\lambda^{-1/6}).

where (3)(3) is referred from Lemma B.3 in (Zhou et al., 2020a).

Proof of Lemma 8.4. For (1), based on the Lemma 8.6, for any 𝐱t{𝐱i}i=1T\mathbf{x}_{t}\in\{\mathbf{x}_{i}\}_{i=1}^{T},
g(𝐱t;𝜽0)F𝒪(mL)\|g(\mathbf{x}_{t};\bm{\theta}_{0})\|_{F}\leq\mathcal{O}(\sqrt{mL}). Then, for the first item:

𝐀t2=λ𝐈+i=1tg(𝐱i;𝜽t)g(𝐱i;𝜽t)/m2\displaystyle\|\mathbf{A}_{t}\|_{2}=\|\lambda\mathbf{I}+\sum_{i=1}^{t}g(\mathbf{x}_{i};\bm{\theta}_{t})g(\mathbf{x}_{i};\bm{\theta}_{t})^{\intercal}/m\|_{2}
λ𝐈2+i=1tg(𝐱i;𝜽t)g(𝐱i;𝜽t)/m2\displaystyle\leq\|\lambda\mathbf{I}\|_{2}+\|\sum_{i=1}^{t}g(\mathbf{x}_{i};\bm{\theta}_{t})g(\mathbf{x}_{i};\bm{\theta}_{t})^{\intercal}/m\|_{2}
λ+i=1tg(𝐱i;𝜽t)22/mλ+i=1tg(𝐱i;𝜽t)F2/m\displaystyle\leq\lambda+\sum_{i=1}^{t}\|g(\mathbf{x}_{i};\bm{\theta}_{t})\|_{2}^{2}/m\leq\lambda+\sum_{i=1}^{t}\|g(\mathbf{x}_{i};\bm{\theta}_{t})\|_{F}^{2}/m
λ+t𝒪(L).\displaystyle\leq\lambda+t\mathcal{O}(L).

Same proof workflow for 𝐀t2\|\mathbf{A}_{t}^{\prime}\|_{2}. For (2), we have

logdet(𝐀t)det(λ𝐈)\displaystyle\log\frac{\text{det}(\mathbf{A}_{t}^{\prime})}{\text{det}(\lambda\mathbf{I})} =logdet(𝐈+t=1Tg(𝐱t;𝜽0)g(𝐱t;𝜽0)/(mλ))\displaystyle=\log\text{det}(\mathbf{I}+\sum_{t=1}^{T}g(\mathbf{x}_{t};\bm{\theta}_{0})g(\mathbf{x}_{t};\bm{\theta}_{0})^{\intercal}/(m\lambda))
=det(𝐈+𝐆𝐆/λ)\displaystyle=\text{det}(\mathbf{I}+\mathbf{G}\mathbf{G}^{\intercal}/\lambda)

where 𝐆=(g(𝐱1;𝜽0),,g(𝐱T;𝜽0))/m\mathbf{G}=(g(\mathbf{x}_{1};\bm{\theta}_{0}),\dots,g(\mathbf{x}_{T};\bm{\theta}_{0}))/\sqrt{m}.

According to the Theorem 3.1 in (Arora et al., 2019), when m=Ω(L6logL/δϵ4)m=\Omega(\frac{L^{6}\log{L/\delta}}{\epsilon^{4}}), with probability at least 1δ1-\delta, for any 𝐱i,𝐱j{𝐱t}t=1T\mathbf{x}_{i},\mathbf{x}_{j}\in\{\mathbf{x}_{t}\}_{t=1}^{T}, it has

|g(𝐱i;𝜽0)g(𝐱j;𝜽0)/m𝐌i,j|ϵ.|g(\mathbf{x}_{i};\bm{\theta}_{0})^{\intercal}g(\mathbf{x}_{j};\bm{\theta}_{0})/m-\mathbf{M}_{i,j}|\leq\epsilon.

Therefore, we have

𝐆𝐆𝐌F\displaystyle\|\mathbf{G}\mathbf{G}^{\intercal}-\mathbf{M}\|_{F} =i=1Tj=1T|g(𝐱i;𝜽0)g(𝐱j;𝜽0)/m𝐌i,j|2\displaystyle=\sqrt{\sum_{i=1}^{T}\sum_{j=1}^{T}|g(\mathbf{x}_{i};\bm{\theta}_{0})^{\intercal}g(\mathbf{x}_{j};\bm{\theta}_{0})/m-\mathbf{M}_{i,j}|^{2}}
Tϵ.\displaystyle\leq T\epsilon.

Then, we have

logdet(𝐈+𝐆𝐆/λ)\displaystyle\log\det(\mathbf{I}+\mathbf{G}\mathbf{G}^{\intercal}/\lambda)
=logdet(𝐈+𝐌λ+(𝐆𝐆𝐌)/λ)\displaystyle=\log\text{det}(\mathbf{I}+\mathbf{M}\lambda+(\mathbf{G}\mathbf{G}^{\intercal}-\mathbf{M})/\lambda)
logdet(𝐈+𝐌λ)+(𝐈+𝐌λ)1,(𝐆𝐆𝐌)/λ\displaystyle\leq\log\text{det}(\mathbf{I}+\mathbf{M}\lambda)+\langle(\mathbf{I}+\mathbf{M}\lambda)^{-1},(\mathbf{G}\mathbf{G}^{\intercal}-\mathbf{M})/\lambda\rangle
logdet(𝐈+𝐌λ)+(𝐈+𝐌λ)1F𝐆𝐆𝐌F/λ\displaystyle\leq\log\text{det}(\mathbf{I}+\mathbf{M}\lambda)+\|(\mathbf{I}+\mathbf{M}\lambda)^{-1}\|_{F}\|\mathbf{G}\mathbf{G}^{\intercal}-\mathbf{M}\|_{F}/\lambda
logdet(𝐈+𝐌λ)+T𝐆𝐆𝐌F/λ\displaystyle\leq\log\text{det}(\mathbf{I}+\mathbf{M}\lambda)+\sqrt{T}\|\mathbf{G}\mathbf{G}^{\intercal}-\mathbf{M}\|_{F}/\lambda
logdet(𝐈+𝐌λ)+λ1\displaystyle\leq\log\text{det}(\mathbf{I}+\mathbf{M}\lambda)+\lambda^{-1}
=P~log(1+T/λ)+λ1.\displaystyle=\widetilde{P}\log(1+T/\lambda)+\lambda^{-1}.

The first inequality is because the concavity of logdet\log\text{det} ; The third inequality is due to (𝐈+𝐌λ)1F𝐈1FT\|(\mathbf{I}+\mathbf{M}\lambda)^{-1}\|_{F}\leq\|\mathbf{I}^{-1}\|_{F}\leq\sqrt{T}; The last inequality is because of the choice the mm; The last equality is because of the Definition 8.3.

Lemma 8.5 ( Lemma 4.1 in (Cao and Gu, 2019) ).

There exist constants {C¯i=13}0\{\bar{C}_{i=1}^{3}\}\geq 0 such that for any δ0\delta\geq 0, if τ\tau satisfies that

τC¯2L6[logm]3/2,\tau\leq\bar{C}_{2}L^{-6}[\log m]^{-3/2},

then with probability at least 1δ1-\delta, for all 𝛉1,𝛉2\bm{\theta}^{1},\bm{\theta}^{2} satisfying 𝛉1𝛉0τ,𝛉2𝛉0τ\|\bm{\theta}^{1}-\bm{\theta}_{0}\|\leq\tau,\|\bm{\theta}^{2}-\bm{\theta}_{0}\|\leq\tau and for any 𝐱t{𝐱t}t=1T\mathbf{x}_{t}\in\{\mathbf{x}_{t}\}_{t=1}^{T}, we have

|f(𝐱;𝜽1)f(𝐱;𝜽2)(g(𝐱;𝜽2),𝜽1𝜽2)|C¯3τ4/3L3mlogm.|f(\mathbf{x};\bm{\theta}^{1})-f(\mathbf{x};\bm{\theta}^{2})-\langle(g(\mathbf{x};\bm{\theta}^{2}),\bm{\theta}^{1}-\bm{\theta}^{2})\rangle|\leq\bar{C}_{3}\tau^{4/3}L^{3}\sqrt{m\log m}.
Lemma 8.6 (Lemma B.3 in (Cao and Gu, 2019) ).

There exist constants {Ci}i=12\{C_{i}\}_{i=1}^{2} such that for any δ>0\delta>0, if τ\tau satisfies that

τC1L6(logm)3/2,\tau\leq C_{1}L^{-6}(\log m)^{-3/2},

then, with probability at least 1δ1-\delta, for any 𝛉𝛉0τ\|\bm{\theta}-\bm{\theta}_{0}\|\leq\tau and 𝐱t{𝐱t}t=1T\mathbf{x}_{t}\in\{\mathbf{x}_{t}\}_{t=1}^{T} we have g(𝐱t;𝛉)2C2mL\|g(\mathbf{x}_{t};\bm{\theta})\|_{2}\leq C_{2}\sqrt{mL}.

Lemma 8.7 (Lemma B.2 in (Zhou et al., 2020a) ).

For the LL-layer full-connected network ff, there exist constants {Ci}i=150\{C_{i}\}_{i=1}^{5}\geq 0 such that for δ>0\delta>0, if for all t[T]t\in[T], η,m\eta,m satisfy

2t/(mλ)C1m3/2L3/2[log(TL2/δ)]3/2,\displaystyle 2\sqrt{t/(m\lambda)}\geq C_{1}m^{-3/2}L^{-3/2}[\log(TL^{2}/\delta)]^{3/2},
2t/(mλ)C2min{L6[logm]3/2,(m(λη)2L6t1(logm)1)3/8},\displaystyle 2\sqrt{t/(m\lambda)}\leq C_{2}\min\{L^{-6}[\log m]^{-3/2},(m(\lambda\eta)^{2}L^{-6}t^{-1}(\log m)^{-1})^{3/8}\},
ηC3(mλ+tmL)1,\displaystyle\eta\leq C_{3}(m\lambda+tmL)^{-1},
m1/6C4logmL7/2t7/6λ7/6(1+t/λ),\displaystyle m^{1/6}\geq C_{4}\sqrt{\log m}L^{7/2}t^{7/6}\lambda^{-7/6}(1+\sqrt{t/\lambda}),

then, with probability at least 1δ1-\delta, it has

𝜽t𝜽02t/(mλ)\displaystyle\|\bm{\theta}_{t}-\bm{\theta}_{0}\|\leq 2\sqrt{t/(m\lambda)}
𝜽t𝜽0𝜽\displaystyle\|\bm{\theta}_{t}-\bm{\theta}_{0}-\bm{\theta}^{\prime}\|\leq
(1ηmλ)J/2t/(mλ)+C5m2/3logmL7/2t5/3λ5/3(1+t/λ).\displaystyle(1-\eta m\lambda)^{J/2}\sqrt{t/(m\lambda)}+C_{5}m^{-2/3}\sqrt{\log m}L^{7/2}t^{5/3}\lambda^{-5/3}(1+\sqrt{t/\lambda}).
Lemma 8.8 (Theorem 5 in (Allen-Zhu et al., 2019)).

With probability at least 1δ1-\delta, there exist constants C1,C2C_{1},C_{2} such that if τC1L9/2log3m\tau\leq C_{1}L^{-9/2}\log^{-3}m, for 𝛉t𝛉02τ\|\bm{\theta}_{t}-\bm{\theta}_{0}\|_{2}\leq\tau, we have

g(𝐱t;𝜽t)g(𝐱t;𝜽0)2C2logmτ1/3L3g(𝐱t;𝜽0)2.\|g(\mathbf{x}_{t};\bm{\theta}_{t})-g(\mathbf{x}_{t};\bm{\theta}_{0})\|_{2}\leq C_{2}\sqrt{\log m}\tau^{1/3}L^{3}\|g(\mathbf{x}_{t};\bm{\theta}_{0})\|_{2}.

8.1. Empirical Setting of UCB

UCB is the key component of MuFasa, determining the empirical performance. UCB(𝐗t)UCB(\mathbf{X}_{t}) (Lemma 5.2) can be formulated as

UCB(𝐗t)=(1λ)g(𝐱;𝜽t)𝐀t1+λg(𝐱;𝜽0)𝐀t1UCB(\mathbf{X}_{t})=(1-\lambda)\|g(\mathbf{x};\bm{\theta}_{t})\|_{\mathbf{A}_{t}^{-1}}+\lambda\|g(\mathbf{x};\bm{\theta}_{0})\|_{\mathbf{A}_{t}^{{}^{\prime}-1}}

where g(𝐱;𝜽0)g(\mathbf{x};\bm{\theta}_{0}) is the gradient at initialization, g(𝐱;𝜽t)g(\mathbf{x};\bm{\theta}_{t}) is the gradient after kk iterations of gradient descent, and λ\lambda is a tunable parameter to trade off between them. Intuitively, g(𝐱;𝜽0)g(\mathbf{x};\bm{\theta}_{0}) has more bias as the weights of neural network function ff are randomness initialized, which brings more exploration portion in decision making of each round. In contrast, g(𝐱;𝜽t)g(\mathbf{x};\bm{\theta}_{t}) has more variance as the weights should be nearer to the optimum after gradient descent.

In the setting where the set of arms are fixed, given an arm 𝐱i\mathbf{x}_{i}, let mim_{i} be the number of rounds that 𝐱i\mathbf{x}_{i} has been played before. When mim_{i} is small, the learner should explore 𝐱i\mathbf{x}_{i} more (λ\lambda is expected to be large). Instead, when mim_{i} is large, the leaner does not need more exploration on it. Therefore, λ\lambda can be defined as a decreasing function with respect to mim_{i}, such as 1mi+1\frac{1}{\sqrt{m_{i}+1}} and 1log(mi+1)\frac{1}{\log(m_{i}+1)}. In the setting without this condition, we can set λ\lambda as a decreasing function with respect to the number of rounds tt, such as 1t+1\frac{1}{\sqrt{t+1}} and 1log(t+1)\frac{1}{\log(t+1)}.

8.2. Future Work

This paper introduces a novel bandit problem, multi-facet contextual bandits. Based on empirical experiments and theoretical analysis, at least three factors play important roles in multi-facet bandits: (1) As all bandits are serving the same user, mutual influence does exist among bandits. Thus, how to extract and harness the mutual influence among bandits is one future-work direction; (2) The weights of bandits usually determine the final reward, since each bandit has a different impact on the serving user. For example, in the diabetes case, the bandit for selecting medications should have a higher weight than the bandit for recommending diabetes articles; (3) Sub-rewards contain crucial information and thus exploiting sub-rewards matters in this problem.

To solve multi-facet bandit problem, we propose the algorithm, MuFasa. However, there are three limitations implicitly shown in this paper: (1) For leveraging the sub-rewards, we assume the C¯\bar{C}-lipschitz continuity of final reward function, while C¯\bar{C} is usually unknown; (2) In the selection criteria of MuFasa, it needs to select one arm with the highest score from all possible combinations 𝐒t\mathbf{S}_{t} (Eq. (2) and Eq. (5)). However, the size of 𝐒t\mathbf{S}_{t} grows exponentially with respect to the number of bandits KK. (3) MuFasa only uses fully-connected work to exploit rewards, while many more advanced neural network models exist such as CNN, residual net, self-attention, etc.