This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Regularized OFU: an Efficient UCB Estimator for Non-linear Contextual Bandit

Yichi Zhou Shihong Song Huishuai Zhang 11footnotemark: 1 Jun Zhu 22footnotemark: 2 Wei Chen 11footnotemark: 1
Tie-Yan Liu 11footnotemark: 1
Microsoft Research, Asia. {yiczho, huishuai.zhang, wche, tyliu}@microsoft.comTsinghua University. [email protected], [email protected]
Abstract

Balancing exploration and exploitation (EE) is a fundamental problem in contextual bandit. One powerful principle for EE trade-off is Optimism in Face of Uncertainty (OFU), in which the agent takes the action according to an upper confidence bound (UCB) of reward. OFU has achieved (near-)optimal regret bound for linear/kernel contextual bandits. However, it is in general unknown how to derive efficient and effective EE trade-off methods for non-linear complex tasks, such as contextual bandit with deep neural network as the reward function. In this paper, we propose a novel OFU algorithm named regularized OFU (ROFU). In ROFU, we measure the uncertainty of the reward by a differentiable function and compute the upper confidence bound by solving a regularized optimization problem. We prove that, for multi-armed bandit, kernel contextual bandit and neural tangent kernel bandit, ROFU achieves (near-)optimal regret bounds with certain uncertainty measure, which theoretically justifies its effectiveness on EE trade-off. Importantly, ROFU admits a very efficient implementation with gradient-based optimizer, which easily extends to general deep neural network models beyond neural tangent kernel, in sharp contrast with previous OFU methods. The empirical evaluation demonstrates that ROFU works extremely well for contextual bandits under various settings.

1 Introduction

Contextual bandit langford2007epoch ; bubeck2012regret is a basic sequential decision-making problem which is extensively studied and widely applied in machine learning. At each time step in contextual bandit, an agent is presented with a context and the agent chooses an action according to the context. After that the agent will receive a reward conditioned on the context and the selected action. The goal of the agent is to maximize its cumulative reward. One key challenge in contextual bandit is exploration and exploitation (EE) trade-off. In order to maximize cumulative reward, the agent exploits collected data to take the action with high estimated reward while it also explores the undiscovered areas to learn knowledge.

Optimism in Face of Uncertainty (OFU) is a powerful principle for EE trade-off. When facing uncertainty, OFU first optimistically guesses how good each action could be and then takes the action with highest guess. In philosophy, it exploits collected knowledge by estimating the expected reward and enforces exploration of the areas with large uncertainty via optimism. OFU algorithms can be divided into two categories, i.e., confidence bonus method auer2002finite and confidence set method jaksch2010near . In confidence bonus method, the uncertainty of each action is measured according to its historical frequency, which is added to its mean value as the estimated reward; in confidence set method, the reward function is constrained to a subset with high confidence and the maximum value over the confidence set is taken as the estimated value. These OFU algorithms have been proved to achieve the (near-)optimal regret bounds on tabular and linear problems, including multi-armed bandit (MAB) auer2002finite , linear contextual bandit (LCB) chu2011contextual and kernel contextual bandit (KCB) valko2013finite , and have also been successfully applied in practice.

However, the representation power of the tabular and linear models often becomes insufficient when facing many real-world contextual bandit problems where the scale of the underlying problem is huge or even infinite, and the reward mechanism is complex. Therefore, in practice, non-linear deep neural networks (DNN) are used to approximate the reward functions. While DNN models work well to fit training data, it remains challenging to quantify its uncertainty on new data. Therefore, existing OFU methods cannot be efficiently applied to non-linear contextual bandit. For confidence set method, it is difficult to compute the confidence set of the non-linear deep neural networks and hence hard to obtain the optimal action. For the confidence bound method, it is unclear how to design the frequency for non-linear reward functions. Overall, estimating uncertainty of the reward function approximated by non-linear functions, e.g., DNN, becomes a major obstacle when applying OFU algorithms to real-world applications.

There are several works zhou2020neural ; filippi2010parametric ; zahavy2019deep ; gu2021batched that attempt to generalize OFU to non-linear contextual bandit. One is NeuralUCB zhou2020neural that takes the gradient of each observation as the random feature and constructs the upper confidence bound accordingly. It has been shown that NeuralUCB achieves sublinear regret on neural tangent kernel contextual bandit (NTKCB). However, at each time step, NeuralUCB requires to compute an inversion of a p×pp\times p matrix where pp is the number of model parameters. As the time complexity of matrix inversion is O(p2.373)O(p^{2.373}) davie2013improved , the computational cost is prohibited for large-scale DNN models. Therefore, it is still largely under-explored to develop OFU algorithms that are able to achieve good EE trade-off with reasonable computational cost for non-linear contextual bandit.

Motivated by these challenges, in this paper, we propose a novel and general OFU method, called Regularized OFU (ROFU). In ROFU, we measure the uncertainty of the reward function by a differentiable regularizer, and then we calculate the optimistic estimation by maximizing the reward function with the regularizer penalizing its uncertainty. We prove that ROFU enjoys near-optimal regret on many contextual bandit problems with relatively simple reward mechanism, including MAB, LCB, KCB and NTKCB. Importantly, in contrast to existing OFU algorithms, ROFU admits an efficient implementation with gradient-based optimizer as the optimization problem in ROFU is unconstrained and differentiable, which has a time complexity linear to pp. Moreover, our algorithm easily extends to general deep neural network models.

We summarize our contributions as follows:

  • We show that ROFU achieves (near-)optimal regret on contextual bandits with simple reward functions, including MAB, LCB, KCB and NTK bandit. These results theoretically justify the efficiency of ROFU on EE trade-off.

  • We propose an algorithm which can efficiently approximate the upper confidence bound with standard gradient-based optimizer.

  • We empirically evaluate ROFU on complex contextual bandits with reward functions beyond linear, kernel and NTK. We show that ROFU also provides efficient UCB estimation for popular DNN architectures including MLP, CNN and ResNet. Moreover, our algorithm enjoys a smaller regret than strong baselines on real-world non-linear contextual bandit problems introduced by deep_bayesian_bandit_showdown .

2 Preliminary

In this work, we consider contextual bandit with general reward functions.

Definition 1 (Contextual bandit).

Contextual bandit is a sequential decision-making problem where the agent has a set of actions AA. At each time step tt, the agent first observes a context xtx_{t}, then selects an action atAa_{t}\in A based on the context. After taking the action, the agent receives a (random) reward rtr_{t} with 𝔼rt:=fθ(xt,at)\mathbb{E}r_{t}\mathrel{\mathop{\mathchar 58\relax}}=f_{\theta^{*}}(x_{t},a_{t}) where fθ(xt,at)f_{\theta^{*}}(x_{t},a_{t}) is a function with unknown groundtruth parameters θ\theta^{*}. The agent aims to maximize its expected cumulative reward tT𝔼fθ(xt,at)\sum_{t\leq T}\mathbb{E}f_{\theta^{*}}(x_{t},a_{t}) which is equivalent to minimizing the regret RT=tTmaxa(fθ(xt,a)fθ(xt,at))R_{T}=\sum_{t\leq T}\max_{a}(f_{\theta^{*}}(x_{t},a)-f_{\theta^{*}}(x_{t},a_{t})).

The most studied contextual bandit is linear contextual bandit, where the reward is generated from a linear function, i.e., fθ(x,a):=ϕ(x,a)θf_{\theta^{*}}(x,a)\mathrel{\mathop{\mathchar 58\relax}}=\phi(x,a)^{\top}\theta^{*} where ϕ(x,a)\phi(x,a) is a known feature map. For more complex problems, we have to use non-linear models, such as DNN, to represent the reward function.

In practice, given dataset D:={(xt,at,rt)}t|D|D\mathrel{\mathop{\mathchar 58\relax}}=\{(x_{t},a_{t},r_{t})\}_{t\leq|D|}, we can estimate θ\theta^{*} by minimizing a loss function, i.e., θ¯:=argminθ(θ;D)\bar{\theta}\mathrel{\mathop{\mathchar 58\relax}}=\operatorname*{arg\,min}_{\theta}\mathcal{L}(\theta;D). In this work, we focus on the mean squared loss and its variants,

MSE(θ;D):=1|D|t|D|(fθ(xt,at)rt)2.\displaystyle\text{MSE}(\theta;D)\mathrel{\mathop{\mathchar 58\relax}}=\frac{1}{|D|}\sum_{t\leq|D|}(f_{\theta}(x_{t},a_{t})-r_{t})^{2}. (1)

This is because MSE is a natural choice for regression tasks. Generally speaking, OFU algorithms trade off exploration and exploitation with an optimistic estimation on the reward for each action, and then select at=argmaxaOFU(xt,a)a_{t}=\operatorname*{arg\,max}_{a}\text{OFU}(x_{t},a). In the literature, OFU algorithms can be divided into two categories. The first one optimistically estimates the reward by adding a confidence bonus to fθ¯f_{\bar{\theta}}, i.e.,

OFUB(x,a):=fθ¯(x,a)+bonus(x,a).\displaystyle\text{OFU}^{B}(x,a)\mathrel{\mathop{\mathchar 58\relax}}=f_{\bar{\theta}}(x,a)+bonus(x,a). (2)

We call this OFU method Bonus-OFU. In simple cases, the bonus can be constructed using concentration bounds. For example, at time step τ\tau, in UCB1 for MAB auer2002finite , the bonus is 2logτ/n(a)\sqrt{2\log\tau/n(a)}, where n(a)n(a) is the number of pulls on action aa; in LinUCB for LCB chu2011contextual , the bonus is ϕ(x,a)(t<τϕ(xt,at)ϕ(xt,at)+𝑰)1ϕ(x,a)\sqrt{\phi(x,a)^{\top}(\sum_{t<\tau}\phi(x_{t},a_{t})\phi(x_{t},a_{t})^{\top}+{\bm{I}})^{-1}\phi(x,a)}. Both UCB1 and LinUCB are proved to have (near-)optimal regret bounds.

The second kind of OFU algorithms optimistically estimates the reward by the best possible value over a confidence set of the reward functions, i.e.,

OFUS(x,a):=maxθΘδfθ(x,a),\displaystyle\text{OFU}^{S}(x,a)\mathrel{\mathop{\mathchar 58\relax}}=\max_{\theta\in\Theta_{\delta}}f_{\theta}(x,a), (3)

where Θδ:={θ:(D|θ)>δ}\Theta_{\delta}\mathrel{\mathop{\mathchar 58\relax}}=\{\theta\mathrel{\mathop{\mathchar 58\relax}}{\mathbb{P}}(D|\theta)>\delta\} and (D|θ){\mathbb{P}}(D|\theta) is the likelihood of DD given θ\theta. We call this OFU method Set-OFU.

For convenience, we suppose θ\theta is a pp-dimensional vector. Both Bonus-OFU and Set-OFU can effectively trade-off EE, however, they require a lot of computational resources when fθf_{\theta} is a complex neural network: firstly, existing Bonus-OFU algorithms, such as NeuralUCB zhou2020neural , need to calculate the inversion of a p×pp\times p matrix, which is expensive especially for large DNN models; secondly, it is also hard to efficiently compute the confidence set of DNN parameters and even harder to maximize the value over it to compute the OFUS\text{OFU}^{S}. The computational cost significantly limits the application of OFU methods for complex tasks.

In the next section, we will propose a new OFU method, named Regularized OFU, which achieves a near-optimal regret bound on simple tasks, and more importantly, admits a very efficient implementation with a gradient-based optimizer.

3 Method

Different from Bonus-OFU and Set-OFU, we measure the uncertainty of reward with a (differentiable) function, and compute the optimistic reward estimation under the regularization of the uncertainty measure function. We call the reward estimation regularized OFU (ROFU) and take action accordingly. In round tt, given context and action pair (xt,a)(x_{t},a), ROFU is defined as follows and the algorithm is presented in Alg. 1:

θ^(xt,a)\displaystyle\hat{\theta}(x_{t},a) :=argmaxθfθ(xt,a)η(xt,a,Dt1)(θ;Dt1),\displaystyle\mathrel{\mathop{\mathchar 58\relax}}=\operatorname*{arg\,max}_{\theta}f_{\theta}(x_{t},a)-\eta(x_{t},a,D_{t-1})\mathcal{R}(\theta;D_{t-1}), (4)
OFUR(xt,a)\displaystyle\text{OFU}^{R}(x_{t},a) :=fθt1(xt,a)+g(fθ^(xt,a)(xt,a)fθt1(xt,a)),\displaystyle\mathrel{\mathop{\mathchar 58\relax}}=f_{\theta_{t-1}}(x_{t},a)+g(f_{\hat{\theta}(x_{t},a)}(x_{t},a)-f_{\theta_{t-1}}(x_{t},a)), (5)
Algorithm 1 Regularized Optimism in Face of Uncertainty
1:  Input: A reward function fθf_{\theta^{*}} with unknown θ\theta^{*}, number of rounds TT, a loss function (θ;D)\mathcal{L}(\theta;D), a regularizer (θ;D)\mathcal{R}(\theta;D), a monotonically increasing function g:g\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}\rightarrow\mathbb{R} with g(0)=0g(0)=0 and the coefficient function η(x,a;D)\eta(x,a;D).
2:  D0:=D_{0}\mathrel{\mathop{\mathchar 58\relax}}=\emptyset.
3:  for t = 1, …, T do
4:     Observe xtx_{t}.
5:     aA\forall a\in A, compute θ^(xt,a)\hat{\theta}(x_{t},a) and OFUR(xt,a)\text{OFU}^{R}(x_{t},a) according to Eq. (4) and (5).
6:     Take at=argmaxaAOFUR(xt,a)a_{t}=\operatorname*{arg\,max}_{a\in A}\text{OFU}^{R}(x_{t},a) and receive reward rtr_{t}.
7:     Let Dt:=Dt1{(xt,at,rt)}D_{t}\mathrel{\mathop{\mathchar 58\relax}}=D_{t-1}\cup\{(x_{t},a_{t},r_{t})\}.
8:     Obtain θt\theta_{t} by minimizing (θ;Dt)\mathcal{L}(\theta;D_{t}).
9:  end for

where Dt1D_{t-1} is the set of collected data in first t1t-1 rounds, \mathcal{R} is a regularizer, η\eta is the weight of the regularizer, gg is a monotonically increasing function with g(0)=0g(0)=0 and θt1\theta_{t-1} is trained by minimizing some objective function, i.e., θt1:=argminθ(θ;Dt1)\theta_{t-1}\mathrel{\mathop{\mathchar 58\relax}}=\operatorname*{arg\,min}_{\theta}\mathcal{L}(\theta;D_{t-1}). We simply take action aROFU:=argmaxaAOFUR(xt,a)a_{\text{ROFU}}\mathrel{\mathop{\mathchar 58\relax}}=\operatorname*{arg\,max}_{a\in A}\text{OFU}^{R}(x_{t},a) to interact with the environment.

In Eq. (4), we first solve an unconstrained optimization problem to calculate θ^(xt,a)\hat{\theta}(x_{t},a). Then, in Eq. (5), we add a bonus to fθt1(xt,a)f_{\theta_{t-1}}(x_{t},a) according to the difference between fθ^(xt,a)(xt,a)f_{\hat{\theta}(x_{t},a)}(x_{t},a) and fθt1(xt,a)f_{\theta_{t-1}}(x_{t},a). From Eq. (4) and (5), we can see that , while ROFU still adds a bonus on fθt1(x,a)f_{\theta_{t-1}}(x,a) as Bonus-OFU, it does not need to explicitly construct the upper confidence bound via concentration inequalities. Instead, we construct the upper confidence bound by solving a regularized optimization problem which is more general. In comparison to Set-OFU, our optimization problem is unconstrained, so Eq. (4) can be efficiently computed by standard gradient-based optimizers. In Sec. 3.2, we will propose an algorithm which can efficiently approximate OFUR\text{OFU}^{R} with a few steps of gradient descent.

An important question is whether selecting action with maximum OFUR\text{OFU}^{R} achieves good EE trade-off, or whether OFUR(xt,a)\text{OFU}^{R}(x_{t},a) is a reasonable upper confidence bound over fθt1(xt,a)f_{\theta_{t-1}}(x_{t},a). To provide more insights, consider the case where =\mathcal{L}=\mathcal{R}. It is easy to see that (x,a),fθ^(x,a)(x,a)fθt1(x,a)\forall(x,a),f_{\hat{\theta}(x,a)}(x,a)\geq f_{\theta_{t-1}}(x,a). Thus, given gg is monotonically increasing and g(0)=0g(0)=0, we have OFUR(x,a)\text{OFU}^{R}(x,a) is an upper bound of fθt1(x,a)f_{\theta_{t-1}}(x,a). Intuitively, if we can find a parameter θ^\hat{\theta} that increases the value of fθ(x,a)f_{\theta}(x,a) without significantly increasing (θ;D)\mathcal{R}(\theta;D), then the uncertainty on the reward of (x,a)(x,a) would be large. Consequently, a large bonus would be added to such an action by Eq. (5). Moreover, gg and η\eta cooperatively control the scale of the confidence bonus: the difference between fθ^(x,a)(x,a)f_{\hat{\theta}(x,a)}(x,a) and fθt1(x,a)f_{\theta_{t-1}}(x,a) increases as η\eta decreases and gg controls the contribution of such difference to the final bonus. In Section 3.1, we show that with proper g,η,g,\eta,\mathcal{L} and \mathcal{R} we can achieve near-optimal regret bound on contextual bandit when the reward function is simple.

Remark 1.

It is noteworthy that the choices of ,,η\mathcal{R},\mathcal{L},\eta and gg are crucial to the performance on EE trade-off. For example, if we select :=(D|θ)\mathcal{R}\mathrel{\mathop{\mathchar 58\relax}}=\mathbb{P}(D|\theta) which is used in Set-OFU, then fθ^(x,a)=+,(x,a)f_{\hat{\theta}(x,a)}=+\infty,\forall(x,a). Then our method would uniformly sample actions.

3.1 Regret analysis

The regret bounds for ROFU are developed by drawing connections between ROFU with existing near-optimal algorithms. Then the regret of ROFU in these cases is also near-optimal. More specifically, we show that for MAB, LCB and KCB, the exact solution of Eq. (5) is equivalent to UCB1, LinUCB and KernelUCB respectively. For NTKCB, OFUR\text{OFU}^{R} has similar properties to the upper confidence bound in NeuralUCB. We summarize these results in Table 1.

MAB Linear/Kernel-CB NTK CB
\mathcal{R} |D|MSE(θ;D)|D|\text{MSE}(\theta;D) θ2+|D|MSE(θ;D)\|\theta\|^{2}+|D|\text{MSE}(\theta;D) See Eq. (8)
\mathcal{L} MSE(θ;D)\text{MSE}(\theta;D) MSE(θ;D)\text{MSE}(\theta;D) See Eq. (7)
η\eta 116log|D|\frac{1}{16\log|D|} 1/21/2 1/(2γ2)1/(2\gamma^{2})
g(w)g(w) w\sqrt{w} w\sqrt{w} w\sqrt{w}
Equivalent/Related to UCB1 Lin/KernelUCB NeuralUCB
Regret O(logT)O(\log T) O~(T)\tilde{O}(\sqrt{T}) O~(T)\tilde{O}(\sqrt{T})
Table 1: Theoretical results for ROFU on contextual bandit tasks with simple reward functions.

Equivalence to UCB1, LinUCB and KernelUCB

Specifically, we use MSE-variants for both \mathcal{L} and \mathcal{R} and g(w)=wg(w)=\sqrt{w}, henceforth. After fixing \mathcal{L}, \mathcal{R} and gg, we can tune η\eta to make the solution of Eq. (5) equal to the upper confidence bound in UCB1, LinUCB or KernelUCB. Recall MSE(θ;D):=1|D|t|D|(fθ(xt,at)rt)2\text{MSE}(\theta;D)\mathrel{\mathop{\mathchar 58\relax}}=\frac{1}{|D|}\sum_{t\leq|D|}(f_{\theta}(x_{t},a_{t})-r_{t})^{2}. The following theorem shows the equivalence between ROFU and LinUCB.

Theorem 1.

With (θ;D):=MSE(θ;D)\mathcal{L}(\theta;D)\mathrel{\mathop{\mathchar 58\relax}}=\text{MSE}(\theta;D), (θ;D):=θ2+|D|MSE(θ;D)\mathcal{R}(\theta;D)\mathrel{\mathop{\mathchar 58\relax}}=\|\theta\|^{2}+|D|\text{MSE}(\theta;D) , η(x,a,D):=1/2\eta(x,a,D)\mathrel{\mathop{\mathchar 58\relax}}=1/2 and g(w)=wg(w)=\sqrt{w}. We have OFUROFU^{R} is identical to the upper confidence bound used in LinUCB.

Proof.

Recall that in LCB, fθ(x,a):=ϕ(x,a)θf_{\theta}(x,a)\mathrel{\mathop{\mathchar 58\relax}}=\phi(x,a)^{\top}\theta. By setting the derivative of Eq. (4) as zero, we have ϕ(x,a)2η(x,a,D)((𝑰+(xi,ai,ri)Dϕ(xi,ai)ϕ(xi,ai))θ(xi,ai,ri)Dϕ(xi,ai)ri)=0\phi(x,a)-2\eta(x,a,D)(({\bm{I}}+\sum_{(x_{i},a_{i},r_{i})\in D}\phi(x_{i},a_{i})\phi(x_{i},a_{i})^{\top})\theta-\sum_{(x_{i},a_{i},r_{i})\in D}\phi(x_{i},a_{i})r_{i})=0. With elementary algebra, we can see that θ^(x,a)=θt1+12η(x,a,D)(I+(xi,ai,ri)Dϕ(xi,ai)ϕ(xi,ai))1ϕ(x,a)\hat{\theta}(x,a)=\theta_{t-1}+\frac{1}{2\eta(x,a,D)}(I+\sum_{(x_{i},a_{i},r_{i})\in D}\phi(x_{i},a_{i})\phi(x_{i},a_{i})^{\top})^{-1}\phi(x,a).

Let 𝒁:=(𝑰+(xi,ai,ri)Dϕ(xi,ai)ϕ(xi,ai)){\bm{Z}}\mathrel{\mathop{\mathchar 58\relax}}=({\bm{I}}+\sum_{(x_{i},a_{i},r_{i})\in D}\phi(x_{i},a_{i})\phi(x_{i},a_{i})^{\top}). Given g(w)=w1/2g(w)=w^{1/2}, inserting θ^(x,a)\hat{\theta}(x,a) and η\eta into Eq. (4), we have OFUR(x,a)=ϕ(x,a)θt1+(ϕ(x,a)𝒁1ϕ(x,a))12\text{OFU}^{R}(x,a)=\phi(x,a)^{\top}\theta_{t-1}+(\phi(x,a)^{\top}{\bm{Z}}^{-1}\phi(x,a))^{\frac{1}{2}} which is used as the upper confidence bound in LinUCB. We finish the proof. ∎

The derivation in Thm 1 can be directly used to show that OFUR\text{OFU}^{R} is identical to upper confidence bound used in KernelUCB as well. And the equivalence between ROFU and UCB1 can be proved by similar techniques. We postpone the derivation into Appendix.

Remark 2.

From the derivation in Thm 1, it is easy to see that when g(w)=wb,b>0g(w)=w^{b},b>0, we can also tune η\eta to make ROFU identical to LinUCB. However, for b1/2b\neq 1/2, η\eta is a complex function depending on x,ax,a and DD. Thus, we only consider g(w)=w1/2g(w)=w^{1/2}.

Connection to NeuralUCB and sublinear regret for NTK contextual bandit.

The NeuralUCB zhou2020neural uses a neural network to learn the reward function and constructs a UCB by using the DNN-based random feature mappings. For learning the reward function, NeuralUCB exploits the recent progress of optimizing a DNN in the Neural Tangent Kernel (NTK) regime allen2018convergence ; du2019gradient ; zou2019improved , which shows that gradient descent can find global minima of a loss for a finite set of training samples if the network is sufficiently wide and the learning rate is sufficiently small.

Intuitively, NeuralUCB takes the gradient θfθ(xt,a)\nabla_{\theta}f_{\theta}(x_{t},a) as a random feature to represent the observation (xt,a)(x_{t},a). Then, it constructs a UCB as in a linear bandit, i.e., NeuralUCB(xt,a)=fθt1(xt,a)+γθfθt1(xt,a)𝒁~t11θfθt1(xt,a)/m\text{NeuralUCB}(x_{t},a)=f_{\theta_{t-1}}(x_{t},a)+\gamma\sqrt{\nabla_{\theta}f_{\theta_{t-1}}(x_{t},a)^{\top}\tilde{{\bm{Z}}}_{t-1}^{-1}\nabla_{\theta}f_{\theta_{t-1}}(x_{t},a)/m}, where γ\gamma is some predefined constant, mm is the network width and 𝒁~t1\tilde{{\bm{Z}}}_{t-1} is a matrix with the contextual information. Specifically, NeuralUCB achieves O~(T)\tilde{O}(\sqrt{T}) regret.

We next show that ROFU can also achieve a similar regret in the NTK regime by carefully choosing the regularizer {\mathcal{R}}. To hide the technical exposure of NTK regime, we adopt the same assumptions and the same model setup as those in zhou2020neural . We bound the difference between ROFU and NeuralUCB in the NTK regime, and then show that ROFU can achieve a similar regret bound to NeuralUCB.

At the start, we assume that the network takes the following form

fθ(𝒙)=m𝑾Lσ(𝑾L1σ(σ(𝑾1𝒙))),\displaystyle f_{\theta}({\bm{x}})=\sqrt{m}{\bm{W}}_{L}\sigma({\bm{W}}_{L-1}\sigma(\cdots\sigma({\bm{W}}_{1}{\bm{x}}))), (6)

where 𝒙{\bm{x}} is composed of (x,a)(x,a), the σ()\sigma(\cdot) is the point-wise activation function, LL is the network depth and mm is the network width. Furthermore, 𝑾1m×d,𝑾lm×m{\bm{W}}_{1}\in{\mathbb{R}}^{m\times d},{\bm{W}}_{l}\in{\mathbb{R}}^{m\times m} for l{2,,L1},𝑾L1×ml\in\{2,...,L-1\},{\bm{W}}_{L}\in{\mathbb{R}}^{1\times m} and θ\theta collects all these matrices into a long vector with dimension pp.

To see the connection with NeuralUCB, in Algorithm 1 we choose the loss as

(θ;D):=(x,a,r)D(fθ(x,a)r)2+mλθθ02,\displaystyle{\mathcal{L}}(\theta;D)\mathrel{\mathop{\mathchar 58\relax}}=\sum_{(x,a,r)\in D}(f_{\theta}(x,a)-r)^{2}+m\lambda\|\theta-\theta_{0}\|^{2}, (7)

where θ0\theta_{0} is the network parameter at initialization, mm is the network width. Here, the MSE loss is penalized by the distance between θ\theta and its initialization θ0\theta_{0} because the analysis technique for NTK bandit requires that the network parameters stay within a neighborhood around the initialization zhou2020neural . We note that the penalizing coefficient written as mλm\lambda is for proof convenience and the comparison with NeuralUCB. We further choose the regularizer as

(θ;Dt1):=1η(fθ(xt,a)f~θ(xt,a;θt1))+~(θ,Dt1;θt1),\displaystyle{\mathcal{R}}(\theta;D_{t-1})\mathrel{\mathop{\mathchar 58\relax}}=\frac{1}{\eta}(f_{\theta}(x_{t},a)-\tilde{f}_{\theta}(x_{t},a;\theta_{t-1}))+\tilde{{\mathcal{R}}}(\theta,D_{t-1};\theta_{t-1}), (8)

where θt1\theta_{t-1} is the last iterate that minimizes the loss on Dt1D_{t-1}. Moreover, f~θ(xt,a;θt1)=fθt1(xt,a)+θfθt1(xt,a),θθt1\tilde{f}_{\theta}(x_{t},a;\theta_{t-1})=f_{\theta_{t-1}}(x_{t},a)+\langle\nabla_{\theta}f_{\theta_{t-1}}(x_{t},a),\theta-\theta_{t-1}\rangle and ~(θ,Dt1;θt1):=(x,a,r)Dt1(fθt1(x,a)+θfθt1(x,a),θθt1r)2+mλθθ02\tilde{{\mathcal{R}}}(\theta,D_{t-1};\theta_{t-1})\mathrel{\mathop{\mathchar 58\relax}}=\sum_{(x,a,r)\in D_{t-1}}(f_{\theta_{t-1}}(x,a)+\langle\nabla_{\theta}f_{\theta_{t-1}}(x,a),\theta-\theta_{t-1}\rangle-r)^{2}+m\lambda\|\theta-\theta_{0}\|^{2} are the first order Taylor expansions of fθ(xt,a)f_{\theta}(x_{t},a) and (θ;Dt1){\mathcal{L}}(\theta;D_{t-1}) at θt1\theta_{t-1}, respectively. Then equivalently, θ^(xt,a)\hat{\theta}(x_{t},a) is calculated by

θ^(xt,a):=argmaxf~θ(xt,a;θt1)η~(θ,Dt1;θt1),\displaystyle\hat{\theta}(x_{t},a)\mathrel{\mathop{\mathchar 58\relax}}=\operatorname*{arg\,max}\tilde{f}_{\theta}(x_{t},a;\theta_{t-1})-\eta\tilde{{\mathcal{R}}}(\theta,D_{t-1};\theta_{t-1}), (9)

where f~θ(xt,a;θt1)\tilde{f}_{\theta}(x_{t},a;\theta_{t-1}) and ~(θ,Dt1;θt1)\tilde{{\mathcal{R}}}(\theta,D_{t-1};\theta_{t-1}) are defined as above.

We assume that the network is trained in the NTK regime with learning rate β\beta (see details in Algorithm 3 of Appendix B). Then if choosing g(w)=wg(w)=\sqrt{w} in Eq. (5) and η=1/(2γ2)\eta=1/(2\gamma^{2}) where γ\gamma is a factor same as the one used in NeuralUCB, we have a regret bound for ROFU as follows.

Theorem 2.

For convenience, let LL denote the network depth, TT denote the time horizon, AA denote the action set and λ0\lambda_{0} denote the minimal eigenvalue of the neural tangent kernel matrix at initialization. If the loss and regularizer are chosen as Eq. (7) and Eq. (8) and the network width satisfies mpoly(T,|A|,L,λ1,λ01)m\geq\text{poly}(T,|A|,L,\lambda^{-1},\lambda_{0}^{-1}) and the learning rate βC(mTL+mλ)1\beta\leq C(mTL+m\lambda)^{-1}, then with high probability, the regret of ROFU satisfies that RTO~(T)R_{T}\leq\tilde{O}(\sqrt{T}) where O~()\tilde{O}(\cdot) hides the logT\log T terms, the dependence on the problem dimension and other factors that are irrelevant with TT.

The idea of proof is to connect the formula of ROFU with the NeuralUCB and then bound their difference such that the order of the regret bound is not affected. We leave the full proof into Appendix B. This justifies that the ROFU can also achieve a near optimal regret bound in the NTK regime. Nonetheless, in sharp contrast with NeuralUCB where computing the inverse of a large matrix 𝒁t1{\bm{Z}}_{t-1} is necessary in each step, ROFU admits a very efficient implementation as shown below.

3.2 An efficient algorithm to compute OFUR\text{OFU}^{R}

In order to apply ROFU, we have to efficiently solve or approximate Eq. (4). In cases of MAB, KCB, we can calculate closed-form solutions for Eq. (4). When fθf_{\theta} is a deep neural network, it is time-consuming to exactly solve Eq. (4). Naturally, one can use gradient descent methods to approximately solve Eq. (4). However, it is still not manageable to optimize from scratch for every (xt,a)(x_{t},a). We propose that the optimization starts from θt1\theta_{t-1} for (xt,a)(x_{t},a).

More specifically, when fθf_{\theta} is a neural network and \mathcal{L} is differentiable, we can optimize the parameters θ\theta with gradient ascent. We approximately solve Eq. (4) by executing a few steps of gradient ascent starting from θt1\theta_{t-1}. That is, θ^j+1=θ^j+κθ(fθ^j(xt,a)η(θ^j;D))\hat{\theta}_{j+1}=\hat{\theta}_{j}+\kappa\nabla_{\theta}(f_{\hat{\theta}_{j}}(x_{t},a)-\eta\mathcal{R}(\hat{\theta}_{j};D)) with θ^0:=θt1\hat{\theta}_{0}\mathrel{\mathop{\mathchar 58\relax}}=\theta_{t-1}, where κ\kappa is the step size. The above implementation is summarized in Alg. 2.

Alg. 2 essentially performs a local search around θt1\theta_{t-1}. Indeed it brings extra benefit for the optimization by starting from θt1\theta_{t-1} in this case. This is because intuitively θt1\theta_{t-1} often gets closer to θ^(,a)\hat{\theta}(\cdot,a) when tt is larger. For example, in MAB, θt1(,a)=μ¯t1(a)\theta_{t-1}(\cdot,a)=\bar{\mu}_{t-1}(a) where μ¯t1(a)\bar{\mu}_{t-1}(a) is the empirical mean of action aa and θ^t1(,a)=μ¯t1(a)+1/n(a)\hat{\theta}_{t-1}(\cdot,a)=\bar{\mu}_{t-1}(a)+1/n(a) where n(a)n(a) is the number of pulls on aa. Thus, θt\theta_{t} is close to θ^\hat{\theta} when n(a)n(a) is large.

It is easy to see that the time complexity of Alg. 2 is O(TM|A|p)O(TM|A|p). In Sec. 5, we can see that the regret of ROFU is low in practice with very small MM.

Remark 3.

For MAB, linear/kernel bandit and NTK bandit, (fθ(xt,a)(θ;D))-(f_{\theta}(x_{t},a)-\mathcal{R}(\theta;D)) is strongly convex and smooth. Assuming that it is α\alpha strongly convex and ζ\zeta smooth, by Thm 2.1 in needell2014stochastic , we have 𝔼θ^M(xt,a)θ^(xt,a)2(12κα(1κζ))Mθt1θ^(xt,a)2+κν2α(1κζ)\mathbb{E}\|\hat{\theta}_{M}(x_{t},a)-\hat{\theta}(x_{t},a)\|^{2}\leq(1-2\kappa\alpha(1-\kappa\zeta))^{M}\|\theta_{t-1}-\hat{\theta}(x_{t},a)\|^{2}+\frac{\kappa\nu^{2}}{\alpha(1-\kappa\zeta)}, where 𝔼\mathbb{E} is over the randomness of SGD, κ\kappa is the learning rate and ν2\nu^{2} is the expected gradient norm square at θ^(xt,a)\hat{\theta}(x_{t},a).

We note that the convex and smooth property for NTK bandit is due to the choice of (θ;D){\mathcal{R}}(\theta;D) in Eq. (8). Based on Remark 3, we can expect Alg. 2 converges fast for these cases, especially, which can leverage the property that θt1θ^(xt,a)\|\theta_{t-1}-\hat{\theta}(x_{t},a)\| gets smaller as tt gets larger. When fθf_{\theta} is general and complex, there is no theoretical guarantee on the convergence rate, but in practice, stochastic gradient descent with a good initialization usually converges fast. We note that NeuralUCB can have a fast implementation by approximating 𝒁{\bm{Z}} with its diagonal elements, which, however, can significantly compromise the regret in practice. Please see Sec. 5.2 for empirical evaluations.

Algorithm 2 An efficient implementation to estimate OFUR\text{OFU}^{R}
1:  Input: Dataset Dt1D_{t-1}, θt1=argminθ(θ;Dt1)\theta_{t-1}=\operatorname*{arg\,min}_{\theta}\mathcal{L}(\theta;D_{t-1}) and context-action pair xt,ax_{t},a. Learning rate κ\kappa and training steps MM.
2:  Observe context xtx_{t}.
3:  Set θ^0(xt,a):=θt1\hat{\theta}_{0}(x_{t},a)\mathrel{\mathop{\mathchar 58\relax}}=\theta_{t-1}.
4:  for j=1,,Mj=1,...,M do
5:      θ^j(xt,a):=θ^j1(xt,a)+κ~(fθ^j1(xt,a)(xt,a)η(θ^j1(xt,a);Dt1))\hat{\theta}_{j}(x_{t},a)\mathrel{\mathop{\mathchar 58\relax}}=\hat{\theta}_{j-1}(x_{t},a)+\kappa\tilde{\nabla}(f_{\hat{\theta}_{j-1}(x_{t},a)}(x_{t},a)-\eta\mathcal{R}(\hat{\theta}_{j-1}(x_{t},a);D_{t-1})), where ~\tilde{\nabla} is an estimator of gradient.
6:  end for
7:  return  OFUR(xt,a):=fθt1(xt,a)+max(0,fθ^M(xt,a)(xt,a)fθt1(xt,a))\text{OFU}^{R}(x_{t},a)\mathrel{\mathop{\mathchar 58\relax}}=f_{\theta_{t-1}}(x_{t},a)+\sqrt{\max(0,f_{\hat{\theta}_{M}(x_{t},a)}(x_{t},a)-f_{\theta_{t-1}}(x_{t},a))}.

4 Related work

The contextual bandit problem has been extensively studied and applied in machine learning. Besides OFU, the most used and studied contextual bandit algorithms can be mainly divided into two categories: ϵ\epsilon-greedy sutton2018reinforcement and Thompson sampling thompson1933likelihood .

ϵ\epsilon-greedy greedily selects the action a=argmaxafθt1(x,a)a=\operatorname*{arg\,max}_{a^{\prime}}f_{\theta_{t-1}}(x,a^{\prime}) with probability 1ϵ1-\epsilon and randomly selects an action to explore with probability ϵ\epsilon. While being sub-optimal on EE trade-off, ϵ\epsilon-greedy is simple, general and computationally efficient. Thus, ϵ\epsilon-greedy is the most widely used algorithm in complex tasks.

The basic idea of Thompson sampling thompson1933likelihood is to maintain a posterior over parameters and make decisions according to the samples from the posterior. However, Thompson sampling can be applied only if we can access the posterior and use approximate inference methods which can be computationally inefficient and may significantly hurt the performance of regret deep_bayesian_bandit_showdown ; phan2019thompson .

There are several works extending OFU and Thompson sampling to the case of non-linear contextual bandits. NeuralUCB zhou2020neural and Neural-TS zhang2021neural extend OFU and TS to NTK bandits and they provide regret bounds in the order of O~(T)\tilde{O}(\sqrt{T}). Our analysis for NTK bandit is largely inspired by their work. However, as mentioned in Sec. 2, both NeuralUCB and NeuralTS need to calculate the inversion of a matrix with dimension equal to the number of parameters in DNN, which is computationally inefficient for large DNN models. Moreover, the neural-linear model is studied in zahavy2019deep ; deep_bayesian_bandit_showdown , where they take the output of the second last layer of a DNN as feature map. These algorithms sometimes work well, but have no theoretical guarantee on regret.

5 Experiment

We now empirically evaluate ROFU. For simplicity, in all our experiments, we set η=1,g(w)=w\eta=1,g(w)=\sqrt{w}, (θ;D)=MSE(θ;D)\mathcal{L}(\theta;D)=\text{MSE}(\theta;D) and (θ;D)=|D|MSE(θ;D)\mathcal{R}(\theta;D)=|D|\text{MSE}(\theta;D). That is, θt1\theta_{t-1} is trained to minimize MSE(θ;D)\text{MSE}(\theta;D) with standard optimizer 111We train θt1\theta_{t-1} with stochastic gradient descent starting from θt2\theta_{t-2} in all the experiments. and θ^(x,a)\hat{\theta}(x,a) is trained to maximize fθ(x,a)|D|MSE(θ;D)f_{\theta}(x,a)-|D|\text{MSE}(\theta;D) using Alg. 2.

Refer to caption
Figure 1: Ablation study on MLP and ResNet bandits. Notation: M=1/5/10M=1/5/10 means that we run 1/5/101/5/10 gradient descent updates in Alg. 2.

5.1 Analysis on MLP and ResNet bandits

As discussed in Sec. 1, a contextual bandit algorithm should be efficient in trading off EE when reward is generated from a complex function while keeping a low cost on computational resources. From Alg. 2, we can see that the time complexity of ROFU is determined by the training step MM. Thus, we here evaluate the regret of ROFU when MM is small.

To evaluate the performance of ROFU in Alg. 2 on complex tasks, we consider two contextual bandits with a DNN as the simulator. That is, r(xt,a)r(x_{t},a) is generated by a DNN model. We consider two popular DNN architectures to generate rewards: 2-layer MLP and 20-layer ResNet with CNN blocks and Batch Normalization as in he2016deep . We summarize other information of the two tested bandits in Table. 2.

  Bandit Layer Context Dim # Arms NN Parameters Context Distribution Noise MLP 22 1010 1010 Random Gaussian 𝒩(0,0.05)\mathcal{N}(0,0.05) ResNet 2020 3×32×323\times 32\times 32 1010 Train on Cifar10 Uniform 𝒩(0,0.5)\mathcal{N}(0,0.5)

Table 2: Basic information about MLP and ResNet bandits.

We use DNNs with larger size for training in Alg. 2. More specifically, for MLP-bandit, fθf_{\theta} is chosen as a 3-layer MLP and for ResNet20-bandit, fθf_{\theta} is chosen as ResNet32. Each experiment is repeated for 16 times. We present the regret and confidence bonus in Fig. 1. From Fig. 1, we can see that (1) ROFU can achieve a small regret on both tasks with a considerably small MM even for very large DNN model; (2) The confidence bonus monotonically increases with MM . For each MM, the confidence bonus converges to 0 as expected. Moreover, while the regret seems sensitive to the value of MM, the regrets of ROFU with M=5,10M=5,10 are much smaller than the case of M=1M=1 and ϵ\epsilon-greedy.

Refer to caption
Refer to caption
Figure 2: Evaluations on non-linear contextual bandits.

5.2 Performance comparison on real-world datasets

To evaluate ROFU against powerful baselines, we conduct experiments on contextual bandits which are created from real-world datasets, following the setting in deep_bayesian_bandit_showdown . For example, suppose that D:={(xt,ct)}tTD\mathrel{\mathop{\mathchar 58\relax}}=\{(x_{t},c_{t})\}_{t\leq T} is a KK-classification dataset where xtx_{t} is the feature vector and ct[K]c_{t}\in[K] is the label. We create a contextual bandit problem as follows: at time step tTt\leq T, the agent observes context xtx_{t}, and then takes an action ata_{t}. The agent receives high reward if it successfully predicts the label of xtx_{t}. For non-classification dataset, we can turn it into contextual bandit in similar ways. For the details of these bandits, please refer to deep_bayesian_bandit_showdown .

For baselines, we consider NeuralUCB zhou2020neural and Thompson sampling variants from deep_bayesian_bandit_showdown . It is noteworthy that we only evaluate the algorithms in deep_bayesian_bandit_showdown with relatively small regrets. We directly run the code provided by the authors. For ROFU, we fix M=5M=5 for all experiments. For other hyper-parameters, please see Appendix C. We repeat each algorithm for 1616 times on each dataset.

We report the regret RT=𝔼tTr(xt,at)𝔼tTr(xt,at)R_{T}=\mathbb{E}\sum_{t\leq T}r(x_{t},a^{\prime}_{t})-\mathbb{E}\sum_{t\leq T}r(x_{t},a_{t}) where at=argmaxaAfθ(xt,a)a^{\prime}_{t}=\operatorname*{arg\,max}_{a\in A}f_{\theta^{\prime}}(x_{t},a) and θ\theta^{\prime} is the parameter trained on the dataset. For hyperparameters tuning and more details, please see Appendix C. The results are presented in Fig. 2 and Table 3. We found that the regret of NeuralUCB is occasionally linear. This might be because that NeuralUCB uses a diagonal matrix to approximate ZZ to accelerate. Moreover, we can see that ROFU significantly outperforms these baselines in terms of regret.

  Mean Census Jester Adult Covertype Statlog Financial Mushroom Dropout 1.75±0.801.75\scriptstyle\pm 0.80 1.51±0.101.51\scriptstyle\pm 0.10 1.34±0.141.34\scriptstyle\pm 0.14 1.00±0.09\mathbf{1.00\scriptstyle\pm 0.09} 1.14±0.131.14\scriptstyle\pm 0.13 1.54±0.871.54\scriptstyle\pm 0.87 3.50±0.603.50\scriptstyle\pm 0.60 2.21±0.422.21\scriptstyle\pm 0.42 Bootstrap 2.23±1.002.23\scriptstyle\pm 1.00 2.51±0.162.51\scriptstyle\pm 0.16 1.72±0.111.72\scriptstyle\pm 0.11 1.43±0.101.43\scriptstyle\pm 0.10 1.93±0.131.93\scriptstyle\pm 0.13 1.43±1.571.43\scriptstyle\pm 1.57 4.52±2.294.52\scriptstyle\pm 2.29 2.04±0.482.04\scriptstyle\pm 0.48 ParamNoise 2.30±1.122.30\scriptstyle\pm 1.12 2.28±0.232.28\scriptstyle\pm 0.23 1.59±0.141.59\scriptstyle\pm 0.14 1.37±0.101.37\scriptstyle\pm 0.10 1.80±0.201.80\scriptstyle\pm 0.20 3.88±6.403.88\scriptstyle\pm 6.40 4.07±1.764.07\scriptstyle\pm 1.76 1.06±0.321.06\scriptstyle\pm 0.32 NeuralLinear 1.82±0.691.82\scriptstyle\pm 0.69 3.24±0.473.24\scriptstyle\pm 0.47 1.70±0.131.70\scriptstyle\pm 0.13 1.46±0.121.46\scriptstyle\pm 0.12 1.84±0.191.84\scriptstyle\pm 0.19 1.25±0.111.25\scriptstyle\pm 0.11 2.25±0.352.25\scriptstyle\pm 0.35 1.00±0.38\mathbf{1.00\scriptstyle\pm 0.38} Greedy 2.47±1.122.47\scriptstyle\pm 1.12 2.76±1.242.76\scriptstyle\pm 1.24 1.65±0.101.65\scriptstyle\pm 0.10 1.56±0.111.56\scriptstyle\pm 0.11 2.27±0.232.27\scriptstyle\pm 0.23 3.08±4.913.08\scriptstyle\pm 4.91 4.74±2.314.74\scriptstyle\pm 2.31 1.20±0.411.20\scriptstyle\pm 0.41 NeuralUCB 9.76±14.029.76\scriptstyle\pm 14.02 1.72±0.121.72\scriptstyle\pm 0.12 1.47±0.081.47\scriptstyle\pm 0.08 1.18±0.051.18\scriptstyle\pm 0.05 1.86±0.161.86\scriptstyle\pm 0.16 41.42±69.5141.42\scriptstyle\pm 69.51 2.74±0.502.74\scriptstyle\pm 0.50 17.29±7.4517.29\scriptstyle\pm 7.45 ROFU (ours) 1.05±0.09\mathbf{1.05\scriptstyle\pm 0.09} 1.00±0.09\mathbf{1.00\scriptstyle\pm 0.09} 1.00±0.20\mathbf{1.00\scriptstyle\pm 0.20} 1.17±0.061.17\scriptstyle\pm 0.06 1.00±0.14\mathbf{1.00\scriptstyle\pm 0.14} 1.00±0.24\mathbf{1.00\scriptstyle\pm 0.24} 1.00±0.60\mathbf{1.00\scriptstyle\pm 0.60} 1.22±0.371.22\scriptstyle\pm 0.37

Table 3: The final regret of each algorithm. The regrets are normalized according to the algorithm with smallest regret.

Conclusion and future work

In this work, we propose an OFU variant, called ROFU, which can be applied to non-linear contextual bandit. We show that the regret of ROFU is (near-)optimal on various contextual bandit models. Moreover, we propose an efficient algorithm to approximately compute the upper confidence bound. Thus, ROFU is efficient in both computation and EE trade-off, which are empirically verified by our experimental results.

EE trade-off is a fundamental problem that lies in the heart of sequential decision making. However, the huge computational cost of (near-)optimal EE trade-off algorithms significantly limits the application, especially on complex domain. We hope our method could inspire more algorithms to efficiently trade-off EE for sequential decision-making tasks beyond contextual bandit, such as deep reinforcement learning.

References

  • (1) Z. Allen-Zhu, Y. Li, and Z. Song. A convergence theory for deep learning via over-parameterization. arXiv preprint arXiv:1811.03962, 2018.
  • (2) Z. Allen-Zhu, Y. Li, and Z. Song. On the convergence rate of training recurrent neural networks. In Advances in Neural Information Processing Systems, 2019.
  • (3) P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 2002.
  • (4) S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. arXiv preprint arXiv:1204.5721, 2012.
  • (5) Y. Cao and Q. Gu. A generalization theory of gradient descent for learning over-parameterized deep ReLU networks. arXiv preprint arXiv:1902.01384, 2019.
  • (6) W. Chu, L. Li, L. Reyzin, and R. Schapire. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 2011.
  • (7) A. M. Davie and A. J. Stothers. Improved bound for complexity of matrix multiplication. Proceedings. Section A, Mathematics-The Royal Society of Edinburgh, 143(2):351, 2013.
  • (8) S. S. Du, X. Zhai, B. Poczos, and A. Singh. Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations (ICLR), 2019.
  • (9) S. Filippi, O. Cappe, A. Garivier, and C. Szepesvári. Parametric bandits: The generalized linear case. In NIPS, volume 23, pages 586–594, 2010.
  • (10) Q. Gu, A. Karbasi, K. Khosravi, V. Mirrokni, and D. Zhou. Batched neural bandits. arXiv preprint arXiv:2102.13028, 2021.
  • (11) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • (12) T. Jaksch, R. Ortner, and P. Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 2010.
  • (13) D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • (14) J. Langford and T. Zhang. The epoch-greedy algorithm for contextual multi-armed bandits. Advances in neural information processing systems, 20(1):96–1, 2007.
  • (15) D. Needell, R. Ward, and N. Srebro. Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm. Advances in neural information processing systems, 27:1017–1025, 2014.
  • (16) M. Phan, Y. Abbasi-Yadkori, and J. Domke. Thompson sampling with approximate inference. arXiv preprint arXiv:1908.04970, 2019.
  • (17) C. Riquelme, G. Tucker, and J. Snoek. Deep bayesian bandits showdown: An empirical comparison of bayesian deep networks for thompson sampling. 2018.
  • (18) R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.
  • (19) W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 1933.
  • (20) M. Valko, N. Korda, R. Munos, I. Flaounas, and N. Cristianini. Finite-time analysis of kernelised contextual bandits. arXiv preprint arXiv:1309.6869, 2013.
  • (21) T. Zahavy and S. Mannor. Deep neural linear bandits: Overcoming catastrophic forgetting through likelihood matching. arXiv preprint arXiv:1901.08612, 2019.
  • (22) W. Zhang, D. Zhou, L. Li, and Q. Gu. Neural thompson sampling. In International Conference on Learning Representations, 2021.
  • (23) D. Zhou, L. Li, and Q. Gu. Neural contextual bandits with ucb-based exploration. In International Conference on Machine Learning, 2020.
  • (24) D. Zou and Q. Gu. An improved analysis of training over-parameterized deep neural networks. In Advances in Neural Information Processing Systems, 2019.

Appendix A Analysis on multi-armed bandit

The following theorem presents the equivalence between ROFU and UCB1 auer2002finite .

Theorem 3.

Let nan_{a} denote the number of pulls on action aa and θ¯a\bar{\theta}_{a} denote the empirical mean of rewards on action aa. With (θ;D)=MSE(θ;D)\mathcal{L}(\theta;D)=\text{MSE}(\theta;D), (θ;D)=|D|MSE(θ;D)\mathcal{R}(\theta;D)=|D|\text{MSE}(\theta;D), η(a)=12na(8log|D|/na)1/2b\eta(a)=\frac{1}{2n_{a}(8\log|D|/n_{a})^{1/2b}} and g(w)=wbg(w)=w^{b}, we have OFUR(a)=θ¯a+8logt/naOFU^{R}(a)=\bar{\theta}_{a}+\sqrt{8\log t/n_{a}} which is used in UCB1.

Proof.

Let the derivative of Eq. (4) equal zero, we have θ^(a)=8logt/na+θ¯a\hat{\theta}(a)=8\log t/n_{a}+\bar{\theta}_{a}. Inserting to Eq. (equation 5), we finish the proof. ∎

Appendix B Analysis on neural tangent kernel bandit

See 2

The step 6 in Algorithm 1 is accomplished by running gradient descent for JJ iterations with learning rate β\beta to minimize the loss, as given by Algorithm 3. In the following analysis, we assume that θ(J)\theta^{(J)} satisfies the first order optimality of minimizing (θ;D)\mathcal{L}(\theta;D) for the ease of the derivation and θ0\theta_{0} is the randomly initialized starting point.

Algorithm 3 TrainNN(D,β,θ0D,\beta,\theta_{0})
1:  Define (θ;D)=(x,a,r)D(f(x,a;θ)r)2+mλθθ02\mathcal{L}(\theta;D)=\sum_{(x,a,r)\in D}(f(x,a;\theta)-r)^{2}+m\lambda\|\theta-\theta_{0}\|^{2}.
2:  Let θ(0)θ0\theta^{(0)}\leftarrow\theta_{0}.
3:  for j=0, …, J-1 do
4:     θ(j+1)=θ(j)β(θ(j))\theta^{(j+1)}=\theta^{(j)}-\beta\nabla\mathcal{L}(\theta^{(j)}).
5:  end for
6:  return  θ(J)\theta^{(J)}.

By choosing the regularizer (θ;Dt1){\mathcal{R}}(\theta;D_{t-1}) as Eq. (8), the procedure of searching θ^(xt,a)\hat{\theta}(x_{t},a) (Eq. (4)) is equivalent to

θ^(xt,a)=argmaxf~θ(xt,a;θt1)η~(θ,Dt1;θt1),\displaystyle\hat{\theta}(x_{t},a)=\operatorname*{arg\,max}\tilde{f}_{\theta}(x_{t},a;\theta_{t-1})-\eta\tilde{{\mathcal{R}}}(\theta,D_{t-1};\theta_{t-1}), (10)

where

f~θ(xt,a;θt1)=fθt1(xt,a)+θfθt1(xt,a),θθt1,\displaystyle\tilde{f}_{\theta}(x_{t},a;\theta_{t-1})=f_{\theta_{t-1}}(x_{t},a)+\langle\nabla_{\theta}f_{\theta_{t-1}}(x_{t},a),\theta-\theta_{t-1}\rangle,
~(θ,Dt1;θt1):=(x,a,r)Dt1(fθt1(x,a)+θfθt1(x,a),θθt1r)2+mλθθ02\displaystyle\tilde{{\mathcal{R}}}(\theta,D_{t-1};\theta_{t-1})\mathrel{\mathop{\mathchar 58\relax}}=\sum_{(x,a,r)\in D_{t-1}}(f_{\theta_{t-1}}(x,a)+\langle\nabla_{\theta}f_{\theta_{t-1}}(x,a),\theta-\theta_{t-1}\rangle-r)^{2}+m\lambda\|\theta-\theta_{0}\|^{2}

are the first order Taylor expansions of fθ(xt,a)f_{\theta}(x_{t},a) and (θ;Dt1){\mathcal{L}}(\theta;D_{t-1}) at θt1\theta_{t-1}, respectively. We note that the special choice of regularizer (θ;Dt1){\mathcal{R}}(\theta;D_{t-1}) Eq. (8) is to derive a simple form of θ^(xt,a)\hat{\theta}(x_{t},a).

In the following analysis, we first connect ROFU with NeuralUCB via an informal argument. Then we bound the approximations in the informal argument and show they will not affect to establish the sublinear regret in the order sense. We use notation hθt1(xt,a)h_{\theta_{t-1}}(x_{t},a) to denote θfθ(xt,a)|θ=θt1\nabla_{\theta}f_{\theta}(x_{t},a)\large|_{\theta=\theta_{t-1}}.

An intuitive procedure.

With the above choices, we can see that the procedure Eq. (10) is equivalent to

θ^(xt,a)=argmaxθhθt1(xt,a),θθt1+ηθθt1m𝒁t12,\displaystyle\hat{\theta}(x_{t},a)=\operatorname*{arg\,max}_{\theta}\langle h_{\theta_{t-1}}(x_{t},a),\theta-\theta_{t-1}\rangle+\eta\|\theta-\theta_{t-1}\|_{m{\bm{Z}}_{t-1}}^{2}, (11)

where

𝒁t1:=λ𝑰+1m(x,a,r)Dt1hθt1(x,a)hθt1(x,a)\displaystyle{\bm{Z}}_{t-1}\mathrel{\mathop{\mathchar 58\relax}}=\lambda{\bm{I}}+\frac{1}{m}\sum_{(x,a,r)\in D_{t-1}}h_{\theta_{t-1}}(x,a)h_{\theta_{t-1}}(x,a)^{\top} (12)

by assuming that θt1\theta_{t-1} satisfies the first-order optimality condition of minimizing the (θ;Dt1){\mathcal{L}}(\theta;D_{t-1}).

Solving the problem Eq. (11), we obtain

θ^(xt,a)=θt1+12ηm𝒁t11hθt1(xt,a).\hat{\theta}(x_{t},a)=\theta_{t-1}+\frac{1}{2\eta m}{\bm{Z}}_{t-1}^{-1}h_{\theta_{t-1}}(x_{t},a).

Hence if the function g()g(\cdot) in Eq. (5) is given by g(x)=xg(x)=\sqrt{x}, then the OFUR\text{OFU}^{R} becomes

OFUR(xt,a)\displaystyle\text{OFU}^{R}(x_{t},a) =fθt1(xt,a)+fθ^(xt,a)(xt,a)fθt1(xt,a)\displaystyle=f_{\theta_{t-1}}(x_{t},a)+\sqrt{f_{\hat{\theta}(x_{t},a)}(x_{t},a)-f_{\theta_{t-1}}(x_{t},a)}
fθt1(xt,a)+12ηhθt1(xt,a)𝒁t11hθt1(xt,a)/m,\displaystyle\approx f_{\theta_{t-1}}(x_{t},a)+\frac{1}{\sqrt{2\eta}}\sqrt{h_{\theta_{t-1}}(x_{t},a)^{\top}{\bm{Z}}_{t-1}^{-1}h_{\theta_{t-1}}(x_{t},a)/m}, (13)

where the approximation is because of using the first order Taylor approximation of fθ^(xt,a)(xt,a)f_{\hat{\theta}(x_{t},a)}(x_{t},a) at θt1\theta_{t-1} to replace fθ^(xt,a)(xt,a)f_{\hat{\theta}(x_{t},a)}(x_{t},a).

Similar to the formula of (13), one uses

Ut,a=fθt1(xt,a)+γt1hθt1(xt,a)𝒁~t11hθt1(xt,a)/m\displaystyle U_{t,a}=f_{\theta_{t-1}}(x_{t},a)+\gamma_{t-1}\sqrt{h_{\theta_{t-1}}(x_{t},a)^{\top}\tilde{{\bm{Z}}}_{t-1}^{-1}h_{\theta_{t-1}}(x_{t},a)/m} (14)

as the upper confidence bound in NeuralUCB (zhou2020neural, , Algorithm 1). One difference is that they zhou2020neural use

𝒁~t1:=λ𝑰+1mτ=1t1hθτ1(xτ,aτ)hθτ1(xτ,aτ).\displaystyle\tilde{{\bm{Z}}}_{t-1}\mathrel{\mathop{\mathchar 58\relax}}=\lambda{\bm{I}}+\frac{1}{m}\sum_{\tau=1}^{t-1}h_{\theta_{\tau-1}}(x_{\tau},a_{\tau})h_{\theta_{\tau-1}}(x_{\tau},a_{\tau})^{\top}. (15)

We next bound the differences between ROFU and NeuralUCB, and then show that the differences won’t affect the regret bound on the order of T\sqrt{T}.

Bound the approximations in above derivation

We first introduce some existing bounds on how the function value is approximated by the first order Taylor expansion in the NTK regime.

Lemma 1 (Lemma 4.1& Theorem B.3, cao2019generalization ).

There exist constants c1,c2,c3>0c_{1},c_{2},c_{3}>0 such that for any δ(0,1)\delta\in(0,1), if ω\omega satisfies that c1(mL)3/2(log(T|A|L2/δ))3/2ωc2L6(logm)3/2c_{1}(mL)^{-3/2}\left(\log(T|A|L^{2}/\delta)\right)^{3/2}\leq\omega\leq c_{2}L^{-6}(\log m)^{-3/2}, then with probability at least 1δ1-\delta, for all θ,θ~\theta,\tilde{\theta} with θ~θ02ω,θθ02ω\|\tilde{\theta}-\theta_{0}\|_{2}\leq\omega,\|\theta-\theta_{0}\|_{2}\leq\omega, and for all x{xt}t[T]x\in\{x_{t}\}_{t\in[T]} and all aAa\in A, we have

|fθ~(x,a)fθ(x,a)hθ(x,a),θ~θ|c3ω4/3L3mlogm,\displaystyle\left|f_{\tilde{\theta}}(x,a)-f_{\theta}(x,a)-\langle h_{\theta}(x,a),\tilde{\theta}-\theta\rangle\right|\leq c_{3}\omega^{4/3}L^{3}\sqrt{m\log m}, (16)
hθ(x,a)Fc4mL.\displaystyle\|h_{\theta}(x,a)\|_{F}\leq c_{4}\sqrt{mL}. (17)

We also introduce a bound on the difference between the gradient at θ\theta and the gradient at initial point θ0\theta_{0} when θ\theta is not far from θ0\theta_{0} as required and satisfied in the NTK regime.

Lemma 2 (Theorem 5, allen2019convergence ).

There exist constants c1,c2,c3>0c_{1},c_{2},c_{3}>0 such that for any δ(0,1)\delta\in(0,1), if ω\omega satisfies that c1(mL)3/2max{log3/2m,log3/2(T|A|/δ)}ωc2L9/2log3m}ωc2L6(logm)3/2c_{1}(mL)^{-3/2}\max\{\log^{-3/2}m,\log^{3/2}(T|A|/\delta)\}\leq\omega\leq c_{2}L^{-9/2}\log^{-3}m\}\leq\omega\leq c_{2}L^{-6}(\log m)^{-3/2}, then with probability at least 1δ1-\delta, for all θ\theta with θθ02ω\|\theta-\theta_{0}\|_{2}\leq\omega,and for all x{xt}t[T]x\in\{x_{t}\}_{t\in[T]} and all aAa\in A, we have

hθ(x,a)hθ0(x,a)2c3logmω1/3L3hθ0(x,a)2.\displaystyle\|h_{\theta}(x,a)-h_{\theta_{0}}(x,a)\|_{2}\leq c_{3}\sqrt{\log m}\omega^{1/3}L^{3}\|h_{\theta_{0}}(x,a)\|_{2}. (18)

With these existing bounds, we next show that 𝒁t{\bm{Z}}_{t} (Eq. (12)) enjoys the same property as 𝒁~t\tilde{{\bm{Z}}}_{t} (Eq. (15)) as follows. We introduce another sequence 𝒁¯t=λ𝑰+1mτ=1thθ0(xτ,aτ)hθ0(xτ,aτ)\bar{{\bm{Z}}}_{t}=\lambda{\bm{I}}+\frac{1}{m}\sum_{\tau=1}^{t}h_{\theta_{0}}(x_{\tau},a_{\tau})h_{\theta_{0}}(x_{\tau},a_{\tau})^{\top}. In (zhou2020neural, , Lemma B.3), they showed that 𝒁~t\tilde{{\bm{Z}}}_{t} is close to 𝒁¯t\bar{{\bm{Z}}}_{t}. Similarly, we have the following lemma on the distance on 𝒁t𝒁¯tF\|{\bm{Z}}_{t}-\bar{{\bm{Z}}}_{t}\|_{F}.

Lemma 3.

With probabililty 1δ1-\delta where δ(0,1)\delta\in(0,1), for any t[T]t\in[T], we have

𝒁t2λ+c1tL,𝒁t𝒁¯tFc2m1/6logmL4t7/6λ1/6,\displaystyle\|{\bm{Z}}_{t}\|_{2}\leq\lambda+c_{1}tL,\quad\|{\bm{Z}}_{t}-\bar{{\bm{Z}}}_{t}\|_{F}\leq c_{2}m^{-1/6}\sqrt{\log m}L^{4}t^{7/6}\lambda^{-1/6}, (19)

if the network width mm satisfies that

c3m3/2L3/2[log(T|A|L2/δ)]3/22t/(mλ)c4L6[logm]3/2,t[T],\displaystyle c_{3}m^{-3/2}L^{-3/2}[\log(T|A|L^{2}/\delta)]^{3/2}\leq 2\sqrt{t/(m\lambda)}\leq c_{4}L^{-6}[\log m]^{-3/2},\forall t\in[T], (20)

with some constants c1,c2,c3,c4>0c_{1},c_{2},c_{3},c_{4}>0.

Proof.

The proof can be adapted from that of Lemma B.3 in zhou2020neural with the fact that hθt1(x,a)hθ0(x,a)2\|h_{\theta_{t-1}}(x,a)-h_{\theta_{0}}(x,a)\|_{2} is bounded because of Lemma 2. ∎

We note that Lemma 3 plays the same role as Lemma B.3 in zhou2020neural . Hence with Lemma 3 we can establish a similar result as Lemma 5.2 in zhou2020neural that with high probability, the optimal θ\theta^{*} lies in the sequence of confidence sets, i.e., an ellipsoid defined with the new matrix 𝒁t{\bm{Z}}_{t}.

Next we show that the first-order Taylor approximation in Eq. (11) does not affect the achievable regret bound either. By the proof of Lemma 5.3 in zhou2020neural , in order to prove a similar result, we only need to bound

|OFUR(xt,a)U~t,a|,\displaystyle|\text{OFU}^{R}(x_{t},a)-\tilde{U}_{t,a}|, (21)

where U~t,a:=hθt1(xt,a),θt1θ0+γt1hθt1(xt,a)𝒁t11hθt1(xt,a)/m\tilde{U}_{t,a}\mathrel{\mathop{\mathchar 58\relax}}=\langle h_{\theta_{t-1}}(x_{t},a),\theta_{t-1}-\theta_{0}\rangle+\gamma_{t-1}\sqrt{h_{\theta_{t-1}}(x_{t},a)^{\top}{\bm{Z}}_{t-1}^{-1}h_{\theta_{t-1}}(x_{t},a)/m} and γt1\gamma_{t-1} is given by Algorithm 1 in zhou2020neural . The derivation is as follows,

|OFUR(xt,a)U~t,a|\displaystyle|\text{OFU}^{R}(x_{t},a)-\tilde{U}_{t,a}|
|fθt1(xt,a)hθt1(xt,a),θt1θ0|\displaystyle\leq|f_{\theta_{t-1}}(x_{t},a)-\langle h_{\theta_{t-1}}(x_{t},a),\theta_{t-1}-\theta_{0}\rangle|
+|fθ^(xt,a)(xt,a)fθt1(xt,a)γt1hθt1(xt,a)𝒁t11hθt1(xt,a)|\displaystyle\quad+\left|\sqrt{f_{\hat{\theta}(x_{t},a)}(x_{t},a)-f_{\theta_{t-1}}(x_{t},a)}-\gamma_{t-1}\sqrt{h_{\theta_{t-1}}(x_{t},a)^{\top}{\bm{Z}}_{t-1}^{-1}h_{\theta_{t-1}}(x_{t},a)}\right|
=|fθ^(xt,a)(xt,a)fθ0(xt,a)fθt1(xt,a)hθt1(xt,a),θ^(xt,a)θt1|\displaystyle=\left|f_{\hat{\theta}(x_{t},a)}(x_{t},a)-f_{\theta_{0}}(x_{t},a)f_{\theta_{t-1}}(x_{t},a)-\langle h_{\theta_{t-1}}(x_{t},a),\hat{\theta}(x_{t},a)-\theta_{t-1}\rangle\right|
+|fθ^(xt,a)(xt,a)fθt1(xt,a)γt12hθt1(xt,a)𝒁t11hθt1(xt,a)|\displaystyle\quad+\sqrt{\left|f_{\hat{\theta}(x_{t},a)}(x_{t},a)-f_{\theta_{t-1}}(x_{t},a)-\gamma_{t-1}^{2}{h_{\theta_{t-1}}(x_{t},a)^{\top}{\bm{Z}}_{t-1}^{-1}h_{\theta_{t-1}}(x_{t},a)}\right|}
c3m1/6logmt2/3λ2/3L3+c3m1/6logmt2/3λ2/3L3,\displaystyle\leq c_{3}m^{-1/6}\sqrt{\log m}t^{2/3}\lambda^{-2/3}L^{3}+\sqrt{c_{3}m^{-1/6}\sqrt{\log m}t^{2/3}\lambda^{-2/3}L^{3}}, (22)

where c3c_{3} is a constant and the last inequality is due to Lemma 1 and the choice of η\eta such that 1/(2η)=γt121/(2\eta)=\gamma_{t-1}^{2}. Hence we show the OFUR(xt,a)\text{OFU}^{R}(x_{t},a) shares the same property as the Ut,aU_{t,a} in Algorithm 1 of zhou2020neural . The left proof follows a similar procedure.

Appendix C Experimental details

Hyperparameters: We tune the hyper-parameters of ROFU on statlog and directly apply the hyperparameters on statlog to other datasets except mushroom. This is because the reward scale of mushroom is much larger than other datasets. For baselines, we directly use the best reported hyper-parameters.

For convenience, we suppose context xtx_{t} is sampled from some distribution 𝒫\mathcal{P} for all tTt\leq T. And we define a virtual dataset {(xt)}T,xt𝒫\{(x^{\prime}_{t})\}_{T},x^{\prime}_{t}\sim\mathcal{P} as well as a parameter trained by minimizing MSE on the virtual dataset θ=MinimizeMSE(tTaA(fθ(xt,a)r(xt,a))2)\theta^{\prime}=\text{MinimizeMSE}\left(\sum_{t\leq T}\sum_{a\in A}(f_{\theta}(x^{\prime}_{t},a)-r(x^{\prime}_{t},a))^{2}\right) where MinimizeMSE is a gradient-based optimizer, e.g., SGD or Adam kingma2014adam .

To alleviate the influence of regret injected by neural network training and generalization error, we decompose the regret into two parts:

T=tTmaxar(xt,a)𝔼tTr(xt,at)I+𝔼tTr(xt,at)𝔼tTr(xt,at)II\displaystyle\mathcal{R}_{T}=\underbrace{\sum_{t\leq T}\max_{a}r(x_{t},a)-\mathbb{E}\sum_{t\leq T}r(x_{t},a^{\prime}_{t})}_{I}+\underbrace{\mathbb{E}\sum_{t\leq T}r(x_{t},a^{\prime}_{t})-\mathbb{E}\sum_{t\leq T}r(x_{t},a_{t})}_{\text{II}} (23)

where at=argmaxafθ(xt,a)a^{\prime}_{t}=\operatorname*{arg\,max}_{a}f_{\theta^{\prime}}(x_{t},a) and ata_{t} is selected by agent.

It is noteworthy that in practice, 𝔼tr(xt,at)\mathbb{E}\sum_{t}r(x_{t},a^{\prime}_{t}) is usually an upper bound of 𝔼tTr(xt,at)\mathbb{E}\sum_{t\leq T}r(x_{t},a_{t}) as θ\theta^{\prime} is trained on T×|A|T\times|A| data points while θ¯\bar{\theta} and θ^\hat{\theta} are trained on no more than TT data points. And the regret I is caused by MinimizeMSE and generalization power of function fθf_{\theta}. So, regret I is independent with the algorithm for EE trade-off. Thus, we only report regret II in Fig. 2 to give a clearer evaluation on ROFU’s ability for EE trade-off. We also present the full regret RTR_{T} in Fig. 3. The running time is in Table. 4.

  ROFU Neural-UCB Dropout ParamNoise Greedy Bootstrap Neural Linear Runing time (s) 5259 1589 525 306 771 729 3612

Table 4: Running time of each algorithm for 2000020000 rounds. All experiments are run on Nvidia Tesla P-100 GPU. Our algorithm is slower than NeuralUCB because NeuralUCB uses the diagonal approximation which lacks theoretical guarantee for the regret.
Refer to caption
Refer to caption
Figure 3: Evaluations on non-linear contextual bandits.