Regularized OFU: an Efficient UCB Estimator for Non-linear Contextual Bandit

Yichi Zhou Shihong Song Huishuai Zhang ¹¹footnotemark: 1 Jun Zhu ²²footnotemark: 2 Wei Chen ¹¹footnotemark: 1
Tie-Yan Liu ¹¹footnotemark: 1
Microsoft Research, Asia. {yiczho, huishuai.zhang, wche, tyliu}@microsoft.comTsinghua University. [email protected], [email protected]

Abstract

Balancing exploration and exploitation (EE) is a fundamental problem in contextual bandit. One powerful principle for EE trade-off is Optimism in Face of Uncertainty (OFU), in which the agent takes the action according to an upper confidence bound (UCB) of reward. OFU has achieved (near-)optimal regret bound for linear/kernel contextual bandits. However, it is in general unknown how to derive efficient and effective EE trade-off methods for non-linear complex tasks, such as contextual bandit with deep neural network as the reward function. In this paper, we propose a novel OFU algorithm named regularized OFU (ROFU). In ROFU, we measure the uncertainty of the reward by a differentiable function and compute the upper confidence bound by solving a regularized optimization problem. We prove that, for multi-armed bandit, kernel contextual bandit and neural tangent kernel bandit, ROFU achieves (near-)optimal regret bounds with certain uncertainty measure, which theoretically justifies its effectiveness on EE trade-off. Importantly, ROFU admits a very efficient implementation with gradient-based optimizer, which easily extends to general deep neural network models beyond neural tangent kernel, in sharp contrast with previous OFU methods. The empirical evaluation demonstrates that ROFU works extremely well for contextual bandits under various settings.

1 Introduction

Contextual bandit langford2007epoch ; bubeck2012regret is a basic sequential decision-making problem which is extensively studied and widely applied in machine learning. At each time step in contextual bandit, an agent is presented with a context and the agent chooses an action according to the context. After that the agent will receive a reward conditioned on the context and the selected action. The goal of the agent is to maximize its cumulative reward. One key challenge in contextual bandit is exploration and exploitation (EE) trade-off. In order to maximize cumulative reward, the agent exploits collected data to take the action with high estimated reward while it also explores the undiscovered areas to learn knowledge.

Optimism in Face of Uncertainty (OFU) is a powerful principle for EE trade-off. When facing uncertainty, OFU first optimistically guesses how good each action could be and then takes the action with highest guess. In philosophy, it exploits collected knowledge by estimating the expected reward and enforces exploration of the areas with large uncertainty via optimism. OFU algorithms can be divided into two categories, i.e., confidence bonus method auer2002finite and confidence set method jaksch2010near . In confidence bonus method, the uncertainty of each action is measured according to its historical frequency, which is added to its mean value as the estimated reward; in confidence set method, the reward function is constrained to a subset with high confidence and the maximum value over the confidence set is taken as the estimated value. These OFU algorithms have been proved to achieve the (near-)optimal regret bounds on tabular and linear problems, including multi-armed bandit (MAB) auer2002finite , linear contextual bandit (LCB) chu2011contextual and kernel contextual bandit (KCB) valko2013finite , and have also been successfully applied in practice.

However, the representation power of the tabular and linear models often becomes insufficient when facing many real-world contextual bandit problems where the scale of the underlying problem is huge or even infinite, and the reward mechanism is complex. Therefore, in practice, non-linear deep neural networks (DNN) are used to approximate the reward functions. While DNN models work well to fit training data, it remains challenging to quantify its uncertainty on new data. Therefore, existing OFU methods cannot be efficiently applied to non-linear contextual bandit. For confidence set method, it is difficult to compute the confidence set of the non-linear deep neural networks and hence hard to obtain the optimal action. For the confidence bound method, it is unclear how to design the frequency for non-linear reward functions. Overall, estimating uncertainty of the reward function approximated by non-linear functions, e.g., DNN, becomes a major obstacle when applying OFU algorithms to real-world applications.

There are several works zhou2020neural ; filippi2010parametric ; zahavy2019deep ; gu2021batched that attempt to generalize OFU to non-linear contextual bandit. One is NeuralUCB zhou2020neural that takes the gradient of each observation as the random feature and constructs the upper confidence bound accordingly. It has been shown that NeuralUCB achieves sublinear regret on neural tangent kernel contextual bandit (NTKCB). However, at each time step, NeuralUCB requires to compute an inversion of a $p\times p$ matrix where $p$ is the number of model parameters. As the time complexity of matrix inversion is $O(p^{2.373})$ davie2013improved , the computational cost is prohibited for large-scale DNN models. Therefore, it is still largely under-explored to develop OFU algorithms that are able to achieve good EE trade-off with reasonable computational cost for non-linear contextual bandit.

Motivated by these challenges, in this paper, we propose a novel and general OFU method, called Regularized OFU (ROFU). In ROFU, we measure the uncertainty of the reward function by a differentiable regularizer, and then we calculate the optimistic estimation by maximizing the reward function with the regularizer penalizing its uncertainty. We prove that ROFU enjoys near-optimal regret on many contextual bandit problems with relatively simple reward mechanism, including MAB, LCB, KCB and NTKCB. Importantly, in contrast to existing OFU algorithms, ROFU admits an efficient implementation with gradient-based optimizer as the optimization problem in ROFU is unconstrained and differentiable, which has a time complexity linear to $p$ . Moreover, our algorithm easily extends to general deep neural network models.

We summarize our contributions as follows:

•

We show that ROFU achieves (near-)optimal regret on contextual bandits with simple reward functions, including MAB, LCB, KCB and NTK bandit. These results theoretically justify the efficiency of ROFU on EE trade-off.
•

We propose an algorithm which can efficiently approximate the upper confidence bound with standard gradient-based optimizer.
•

We empirically evaluate ROFU on complex contextual bandits with reward functions beyond linear, kernel and NTK. We show that ROFU also provides efficient UCB estimation for popular DNN architectures including MLP, CNN and ResNet. Moreover, our algorithm enjoys a smaller regret than strong baselines on real-world non-linear contextual bandit problems introduced by deep_bayesian_bandit_showdown .

2 Preliminary

In this work, we consider contextual bandit with general reward functions.

Definition 1 (Contextual bandit).

Contextual bandit is a sequential decision-making problem where the agent has a set of actions $A$ . At each time step $t$ , the agent first observes a context $x_{t}$ , then selects an action $a_{t}\in A$ based on the context. After taking the action, the agent receives a (random) reward $r_{t}$ with $\mathbb{E}r_{t}\mathrel{\mathop{\mathchar 58\relax}}=f_{\theta^{*}}(x_{t},a_{t})$ where $f_{\theta^{*}}(x_{t},a_{t})$ is a function with unknown groundtruth parameters $\theta^{*}$ . The agent aims to maximize its expected cumulative reward $\sum_{t\leq T}\mathbb{E}f_{\theta^{*}}(x_{t},a_{t})$ which is equivalent to minimizing the regret $R_{T}=\sum_{t\leq T}\max_{a}(f_{\theta^{*}}(x_{t},a)-f_{\theta^{*}}(x_{t},a_{t}))$ .

The most studied contextual bandit is linear contextual bandit, where the reward is generated from a linear function, i.e., $f_{\theta^{*}}(x,a)\mathrel{\mathop{\mathchar 58\relax}}=\phi(x,a)^{\top}\theta^{*}$ where $\phi(x,a)$ is a known feature map. For more complex problems, we have to use non-linear models, such as DNN, to represent the reward function.

In practice, given dataset $D\mathrel{\mathop{\mathchar 58\relax}}=\{(x_{t},a_{t},r_{t})\}_{t\leq|D|}$ , we can estimate $\theta^{*}$ by minimizing a loss function, i.e., $\bar{\theta}\mathrel{\mathop{\mathchar 58\relax}}=\operatorname*{arg\,min}_{\theta}\mathcal{L}(\theta;D)$ . In this work, we focus on the mean squared loss and its variants,

\displaystyle\text{MSE}(\theta;D)\mathrel{\mathop{\mathchar 58\relax}}=\frac{1}{|D|}\sum_{t\leq|D|}(f_{\theta}(x_{t},a_{t})-r_{t})^{2}.

(1)

This is because MSE is a natural choice for regression tasks. Generally speaking, OFU algorithms trade off exploration and exploitation with an optimistic estimation on the reward for each action, and then select $a_{t}=\operatorname*{arg\,max}_{a}\text{OFU}(x_{t},a)$ . In the literature, OFU algorithms can be divided into two categories. The first one optimistically estimates the reward by adding a confidence bonus to $f_{\bar{\theta}}$ , i.e.,

\displaystyle\text{OFU}^{B}(x,a)\mathrel{\mathop{\mathchar 58\relax}}=f_{\bar{\theta}}(x,a)+bonus(x,a).

(2)

We call this OFU method Bonus-OFU. In simple cases, the bonus can be constructed using concentration bounds. For example, at time step $\tau$ , in UCB1 for MAB auer2002finite , the bonus is $\sqrt{2\log\tau/n(a)}$ , where $n(a)$ is the number of pulls on action $a$ ; in LinUCB for LCB chu2011contextual , the bonus is $\sqrt{\phi(x,a)^{\top}(\sum_{t<\tau}\phi(x_{t},a_{t})\phi(x_{t},a_{t})^{\top}+{\bm{I}})^{-1}\phi(x,a)}$ . Both UCB1 and LinUCB are proved to have (near-)optimal regret bounds.

The second kind of OFU algorithms optimistically estimates the reward by the best possible value over a confidence set of the reward functions, i.e.,

\displaystyle\text{OFU}^{S}(x,a)\mathrel{\mathop{\mathchar 58\relax}}=\max_{\theta\in\Theta_{\delta}}f_{\theta}(x,a),

(3)

where $\Theta_{\delta}\mathrel{\mathop{\mathchar 58\relax}}=\{\theta\mathrel{\mathop{\mathchar 58\relax}}{\mathbb{P}}(D|\theta)>\delta\}$ and ${\mathbb{P}}(D|\theta)$ is the likelihood of $D$ given $\theta$ . We call this OFU method Set-OFU.

For convenience, we suppose $\theta$ is a $p$ -dimensional vector. Both Bonus-OFU and Set-OFU can effectively trade-off EE, however, they require a lot of computational resources when $f_{\theta}$ is a complex neural network: firstly, existing Bonus-OFU algorithms, such as NeuralUCB zhou2020neural , need to calculate the inversion of a $p\times p$ matrix, which is expensive especially for large DNN models; secondly, it is also hard to efficiently compute the confidence set of DNN parameters and even harder to maximize the value over it to compute the $\text{OFU}^{S}$ . The computational cost significantly limits the application of OFU methods for complex tasks.

In the next section, we will propose a new OFU method, named Regularized OFU, which achieves a near-optimal regret bound on simple tasks, and more importantly, admits a very efficient implementation with a gradient-based optimizer.

3 Method

Different from Bonus-OFU and Set-OFU, we measure the uncertainty of reward with a (differentiable) function, and compute the optimistic reward estimation under the regularization of the uncertainty measure function. We call the reward estimation regularized OFU (ROFU) and take action accordingly. In round $t$ , given context and action pair $(x_{t},a)$ , ROFU is defined as follows and the algorithm is presented in Alg. 1:

	$\displaystyle\hat{\theta}(x_{t},a)$	$\displaystyle\mathrel{\mathop{\mathchar 58\relax}}=\operatorname*{arg\,max}_{\theta}f_{\theta}(x_{t},a)-\eta(x_{t},a,D_{t-1})\mathcal{R}(\theta;D_{t-1}),$		(4)
	$\displaystyle\text{OFU}^{R}(x_{t},a)$	$\displaystyle\mathrel{\mathop{\mathchar 58\relax}}=f_{\theta_{t-1}}(x_{t},a)+g(f_{\hat{\theta}(x_{t},a)}(x_{t},a)-f_{\theta_{t-1}}(x_{t},a)),$		(5)

Algorithm 1 Regularized Optimism in Face of Uncertainty

1: Input: A reward function

f_{\theta^{*}}

with unknown

\theta^{*}

, number of rounds

T

, a loss function

\mathcal{L}(\theta;D)

, a regularizer

\mathcal{R}(\theta;D)

, a monotonically increasing function

g\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}\rightarrow\mathbb{R}

with

g(0)=0

and the coefficient function

\eta(x,a;D)

D_{0}\mathrel{\mathop{\mathchar 58\relax}}=\emptyset

3: for t = 1, …, T do

4: Observe

x_{t}

\forall a\in A

, compute

\hat{\theta}(x_{t},a)

and

\text{OFU}^{R}(x_{t},a)

according to Eq. (4) and (5).

6: Take

a_{t}=\operatorname*{arg\,max}_{a\in A}\text{OFU}^{R}(x_{t},a)

and receive reward

r_{t}

7: Let

D_{t}\mathrel{\mathop{\mathchar 58\relax}}=D_{t-1}\cup\{(x_{t},a_{t},r_{t})\}

8: Obtain

\theta_{t}

by minimizing

\mathcal{L}(\theta;D_{t})

9: end for

where $D_{t-1}$ is the set of collected data in first $t-1$ rounds, $\mathcal{R}$ is a regularizer, $\eta$ is the weight of the regularizer, $g$ is a monotonically increasing function with $g(0)=0$ and $\theta_{t-1}$ is trained by minimizing some objective function, i.e., $\theta_{t-1}\mathrel{\mathop{\mathchar 58\relax}}=\operatorname*{arg\,min}_{\theta}\mathcal{L}(\theta;D_{t-1})$ . We simply take action $a_{\text{ROFU}}\mathrel{\mathop{\mathchar 58\relax}}=\operatorname*{arg\,max}_{a\in A}\text{OFU}^{R}(x_{t},a)$ to interact with the environment.

In Eq. (4), we first solve an unconstrained optimization problem to calculate $\hat{\theta}(x_{t},a)$ . Then, in Eq. (5), we add a bonus to $f_{\theta_{t-1}}(x_{t},a)$ according to the difference between $f_{\hat{\theta}(x_{t},a)}(x_{t},a)$ and $f_{\theta_{t-1}}(x_{t},a)$ . From Eq. (4) and (5), we can see that , while ROFU still adds a bonus on $f_{\theta_{t-1}}(x,a)$ as Bonus-OFU, it does not need to explicitly construct the upper confidence bound via concentration inequalities. Instead, we construct the upper confidence bound by solving a regularized optimization problem which is more general. In comparison to Set-OFU, our optimization problem is unconstrained, so Eq. (4) can be efficiently computed by standard gradient-based optimizers. In Sec. 3.2, we will propose an algorithm which can efficiently approximate $\text{OFU}^{R}$ with a few steps of gradient descent.

An important question is whether selecting action with maximum $\text{OFU}^{R}$ achieves good EE trade-off, or whether $\text{OFU}^{R}(x_{t},a)$ is a reasonable upper confidence bound over $f_{\theta_{t-1}}(x_{t},a)$ . To provide more insights, consider the case where $\mathcal{L}=\mathcal{R}$ . It is easy to see that $\forall(x,a),f_{\hat{\theta}(x,a)}(x,a)\geq f_{\theta_{t-1}}(x,a)$ . Thus, given $g$ is monotonically increasing and $g(0)=0$ , we have $\text{OFU}^{R}(x,a)$ is an upper bound of $f_{\theta_{t-1}}(x,a)$ . Intuitively, if we can find a parameter $\hat{\theta}$ that increases the value of $f_{\theta}(x,a)$ without significantly increasing $\mathcal{R}(\theta;D)$ , then the uncertainty on the reward of $(x,a)$ would be large. Consequently, a large bonus would be added to such an action by Eq. (5). Moreover, $g$ and $\eta$ cooperatively control the scale of the confidence bonus: the difference between $f_{\hat{\theta}(x,a)}(x,a)$ and $f_{\theta_{t-1}}(x,a)$ increases as $\eta$ decreases and $g$ controls the contribution of such difference to the final bonus. In Section 3.1, we show that with proper $g,\eta,\mathcal{L}$ and $\mathcal{R}$ we can achieve near-optimal regret bound on contextual bandit when the reward function is simple.

Remark 1.

It is noteworthy that the choices of $\mathcal{R},\mathcal{L},\eta$ and $g$ are crucial to the performance on EE trade-off. For example, if we select $\mathcal{R}\mathrel{\mathop{\mathchar 58\relax}}=\mathbb{P}(D|\theta)$ which is used in Set-OFU, then $f_{\hat{\theta}(x,a)}=+\infty,\forall(x,a)$ . Then our method would uniformly sample actions.

3.1 Regret analysis

The regret bounds for ROFU are developed by drawing connections between ROFU with existing near-optimal algorithms. Then the regret of ROFU in these cases is also near-optimal. More specifically, we show that for MAB, LCB and KCB, the exact solution of Eq. (5) is equivalent to UCB1, LinUCB and KernelUCB respectively. For NTKCB, $\text{OFU}^{R}$ has similar properties to the upper confidence bound in NeuralUCB. We summarize these results in Table 1.

	MAB	Linear/Kernel-CB	NTK CB
$\mathcal{R}$	$\|D\|\text{MSE}(\theta;D)$	$\\|\theta\\|^{2}+\|D\|\text{MSE}(\theta;D)$	See Eq. (8)
$\mathcal{L}$	$\text{MSE}(\theta;D)$	$\text{MSE}(\theta;D)$	See Eq. (7)
$\eta$	$\frac{1}{16\log\|D\|}$	$1/2$	$1/(2\gamma^{2})$
$g(w)$	$\sqrt{w}$	$\sqrt{w}$	$\sqrt{w}$
Equivalent/Related to	UCB1	Lin/KernelUCB	NeuralUCB
Regret	$O(\log T)$	$\tilde{O}(\sqrt{T})$	$\tilde{O}(\sqrt{T})$

Table 1: Theoretical results for ROFU on contextual bandit tasks with simple reward functions.

Equivalence to UCB1, LinUCB and KernelUCB

Specifically, we use MSE-variants for both $\mathcal{L}$ and $\mathcal{R}$ and $g(w)=\sqrt{w}$ , henceforth. After fixing $\mathcal{L}$ , $\mathcal{R}$ and $g$ , we can tune $\eta$ to make the solution of Eq. (5) equal to the upper confidence bound in UCB1, LinUCB or KernelUCB. Recall $\text{MSE}(\theta;D)\mathrel{\mathop{\mathchar 58\relax}}=\frac{1}{|D|}\sum_{t\leq|D|}(f_{\theta}(x_{t},a_{t})-r_{t})^{2}$ . The following theorem shows the equivalence between ROFU and LinUCB.

Theorem 1.

With $\mathcal{L}(\theta;D)\mathrel{\mathop{\mathchar 58\relax}}=\text{MSE}(\theta;D)$ , $\mathcal{R}(\theta;D)\mathrel{\mathop{\mathchar 58\relax}}=\|\theta\|^{2}+|D|\text{MSE}(\theta;D)$ , $\eta(x,a,D)\mathrel{\mathop{\mathchar 58\relax}}=1/2$ and $g(w)=\sqrt{w}$ . We have $OFU^{R}$ is identical to the upper confidence bound used in LinUCB.

Proof.

Recall that in LCB, $f_{\theta}(x,a)\mathrel{\mathop{\mathchar 58\relax}}=\phi(x,a)^{\top}\theta$ . By setting the derivative of Eq. (4) as zero, we have $\phi(x,a)-2\eta(x,a,D)(({\bm{I}}+\sum_{(x_{i},a_{i},r_{i})\in D}\phi(x_{i},a_{i})\phi(x_{i},a_{i})^{\top})\theta-\sum_{(x_{i},a_{i},r_{i})\in D}\phi(x_{i},a_{i})r_{i})=0$ . With elementary algebra, we can see that $\hat{\theta}(x,a)=\theta_{t-1}+\frac{1}{2\eta(x,a,D)}(I+\sum_{(x_{i},a_{i},r_{i})\in D}\phi(x_{i},a_{i})\phi(x_{i},a_{i})^{\top})^{-1}\phi(x,a)$ .

Let ${\bm{Z}}\mathrel{\mathop{\mathchar 58\relax}}=({\bm{I}}+\sum_{(x_{i},a_{i},r_{i})\in D}\phi(x_{i},a_{i})\phi(x_{i},a_{i})^{\top})$ . Given $g(w)=w^{1/2}$ , inserting $\hat{\theta}(x,a)$ and $\eta$ into Eq. (4), we have $\text{OFU}^{R}(x,a)=\phi(x,a)^{\top}\theta_{t-1}+(\phi(x,a)^{\top}{\bm{Z}}^{-1}\phi(x,a))^{\frac{1}{2}}$ which is used as the upper confidence bound in LinUCB. We finish the proof. ∎

The derivation in Thm 1 can be directly used to show that $\text{OFU}^{R}$ is identical to upper confidence bound used in KernelUCB as well. And the equivalence between ROFU and UCB1 can be proved by similar techniques. We postpone the derivation into Appendix.

Remark 2.

From the derivation in Thm 1, it is easy to see that when $g(w)=w^{b},b>0$ , we can also tune $\eta$ to make ROFU identical to LinUCB. However, for $b\neq 1/2$ , $\eta$ is a complex function depending on $x,a$ and $D$ . Thus, we only consider $g(w)=w^{1/2}$ .

Connection to NeuralUCB and sublinear regret for NTK contextual bandit.

The NeuralUCB zhou2020neural uses a neural network to learn the reward function and constructs a UCB by using the DNN-based random feature mappings. For learning the reward function, NeuralUCB exploits the recent progress of optimizing a DNN in the Neural Tangent Kernel (NTK) regime allen2018convergence ; du2019gradient ; zou2019improved , which shows that gradient descent can find global minima of a loss for a finite set of training samples if the network is sufficiently wide and the learning rate is sufficiently small.

Intuitively, NeuralUCB takes the gradient $\nabla_{\theta}f_{\theta}(x_{t},a)$ as a random feature to represent the observation $(x_{t},a)$ . Then, it constructs a UCB as in a linear bandit, i.e., $\text{NeuralUCB}(x_{t},a)=f_{\theta_{t-1}}(x_{t},a)+\gamma\sqrt{\nabla_{\theta}f_{\theta_{t-1}}(x_{t},a)^{\top}\tilde{{\bm{Z}}}_{t-1}^{-1}\nabla_{\theta}f_{\theta_{t-1}}(x_{t},a)/m}$ , where $\gamma$ is some predefined constant, $m$ is the network width and $\tilde{{\bm{Z}}}_{t-1}$ is a matrix with the contextual information. Specifically, NeuralUCB achieves $\tilde{O}(\sqrt{T})$ regret.

We next show that ROFU can also achieve a similar regret in the NTK regime by carefully choosing the regularizer ${\mathcal{R}}$ . To hide the technical exposure of NTK regime, we adopt the same assumptions and the same model setup as those in zhou2020neural . We bound the difference between ROFU and NeuralUCB in the NTK regime, and then show that ROFU can achieve a similar regret bound to NeuralUCB.

At the start, we assume that the network takes the following form

\displaystyle f_{\theta}({\bm{x}})=\sqrt{m}{\bm{W}}_{L}\sigma({\bm{W}}_{L-1}\sigma(\cdots\sigma({\bm{W}}_{1}{\bm{x}}))),

(6)

where ${\bm{x}}$ is composed of $(x,a)$ , the $\sigma(\cdot)$ is the point-wise activation function, $L$ is the network depth and $m$ is the network width. Furthermore, ${\bm{W}}_{1}\in{\mathbb{R}}^{m\times d},{\bm{W}}_{l}\in{\mathbb{R}}^{m\times m}$ for $l\in\{2,...,L-1\},{\bm{W}}_{L}\in{\mathbb{R}}^{1\times m}$ and $\theta$ collects all these matrices into a long vector with dimension $p$ .

To see the connection with NeuralUCB, in Algorithm 1 we choose the loss as

\displaystyle{\mathcal{L}}(\theta;D)\mathrel{\mathop{\mathchar 58\relax}}=\sum_{(x,a,r)\in D}(f_{\theta}(x,a)-r)^{2}+m\lambda\|\theta-\theta_{0}\|^{2},

(7)

where $\theta_{0}$ is the network parameter at initialization, $m$ is the network width. Here, the MSE loss is penalized by the distance between $\theta$ and its initialization $\theta_{0}$ because the analysis technique for NTK bandit requires that the network parameters stay within a neighborhood around the initialization zhou2020neural . We note that the penalizing coefficient written as $m\lambda$ is for proof convenience and the comparison with NeuralUCB. We further choose the regularizer as

\displaystyle{\mathcal{R}}(\theta;D_{t-1})\mathrel{\mathop{\mathchar 58\relax}}=\frac{1}{\eta}(f_{\theta}(x_{t},a)-\tilde{f}_{\theta}(x_{t},a;\theta_{t-1}))+\tilde{{\mathcal{R}}}(\theta,D_{t-1};\theta_{t-1}),

(8)

where $\theta_{t-1}$ is the last iterate that minimizes the loss on $D_{t-1}$ . Moreover, $\tilde{f}_{\theta}(x_{t},a;\theta_{t-1})=f_{\theta_{t-1}}(x_{t},a)+\langle\nabla_{\theta}f_{\theta_{t-1}}(x_{t},a),\theta-\theta_{t-1}\rangle$ and $\tilde{{\mathcal{R}}}(\theta,D_{t-1};\theta_{t-1})\mathrel{\mathop{\mathchar 58\relax}}=\sum_{(x,a,r)\in D_{t-1}}(f_{\theta_{t-1}}(x,a)+\langle\nabla_{\theta}f_{\theta_{t-1}}(x,a),\theta-\theta_{t-1}\rangle-r)^{2}+m\lambda\|\theta-\theta_{0}\|^{2}$ are the first order Taylor expansions of $f_{\theta}(x_{t},a)$ and ${\mathcal{L}}(\theta;D_{t-1})$ at $\theta_{t-1}$ , respectively. Then equivalently, $\hat{\theta}(x_{t},a)$ is calculated by

\displaystyle\hat{\theta}(x_{t},a)\mathrel{\mathop{\mathchar 58\relax}}=\operatorname*{arg\,max}\tilde{f}_{\theta}(x_{t},a;\theta_{t-1})-\eta\tilde{{\mathcal{R}}}(\theta,D_{t-1};\theta_{t-1}),

(9)

where $\tilde{f}_{\theta}(x_{t},a;\theta_{t-1})$ and $\tilde{{\mathcal{R}}}(\theta,D_{t-1};\theta_{t-1})$ are defined as above.

We assume that the network is trained in the NTK regime with learning rate $\beta$ (see details in Algorithm 3 of Appendix B). Then if choosing $g(w)=\sqrt{w}$ in Eq. (5) and $\eta=1/(2\gamma^{2})$ where $\gamma$ is a factor same as the one used in NeuralUCB, we have a regret bound for ROFU as follows.

Theorem 2.

For convenience, let $L$ denote the network depth, $T$ denote the time horizon, $A$ denote the action set and $\lambda_{0}$ denote the minimal eigenvalue of the neural tangent kernel matrix at initialization. If the loss and regularizer are chosen as Eq. (7) and Eq. (8) and the network width satisfies $m\geq\text{poly}(T,|A|,L,\lambda^{-1},\lambda_{0}^{-1})$ and the learning rate $\beta\leq C(mTL+m\lambda)^{-1}$ , then with high probability, the regret of ROFU satisfies that $R_{T}\leq\tilde{O}(\sqrt{T})$ where $\tilde{O}(\cdot)$ hides the $\log T$ terms, the dependence on the problem dimension and other factors that are irrelevant with $T$ .

The idea of proof is to connect the formula of ROFU with the NeuralUCB and then bound their difference such that the order of the regret bound is not affected. We leave the full proof into Appendix B. This justifies that the ROFU can also achieve a near optimal regret bound in the NTK regime. Nonetheless, in sharp contrast with NeuralUCB where computing the inverse of a large matrix ${\bm{Z}}_{t-1}$ is necessary in each step, ROFU admits a very efficient implementation as shown below.

3.2 An efficient algorithm to compute $\text{OFU}^{R}$

In order to apply ROFU, we have to efficiently solve or approximate Eq. (4). In cases of MAB, KCB, we can calculate closed-form solutions for Eq. (4). When $f_{\theta}$ is a deep neural network, it is time-consuming to exactly solve Eq. (4). Naturally, one can use gradient descent methods to approximately solve Eq. (4). However, it is still not manageable to optimize from scratch for every $(x_{t},a)$ . We propose that the optimization starts from $\theta_{t-1}$ for $(x_{t},a)$ .

More specifically, when $f_{\theta}$ is a neural network and $\mathcal{L}$ is differentiable, we can optimize the parameters $\theta$ with gradient ascent. We approximately solve Eq. (4) by executing a few steps of gradient ascent starting from $\theta_{t-1}$ . That is, $\hat{\theta}_{j+1}=\hat{\theta}_{j}+\kappa\nabla_{\theta}(f_{\hat{\theta}_{j}}(x_{t},a)-\eta\mathcal{R}(\hat{\theta}_{j};D))$ with $\hat{\theta}_{0}\mathrel{\mathop{\mathchar 58\relax}}=\theta_{t-1}$ , where $\kappa$ is the step size. The above implementation is summarized in Alg. 2.

Alg. 2 essentially performs a local search around $\theta_{t-1}$ . Indeed it brings extra benefit for the optimization by starting from $\theta_{t-1}$ in this case. This is because intuitively $\theta_{t-1}$ often gets closer to $\hat{\theta}(\cdot,a)$ when $t$ is larger. For example, in MAB, $\theta_{t-1}(\cdot,a)=\bar{\mu}_{t-1}(a)$ where $\bar{\mu}_{t-1}(a)$ is the empirical mean of action $a$ and $\hat{\theta}_{t-1}(\cdot,a)=\bar{\mu}_{t-1}(a)+1/n(a)$ where $n(a)$ is the number of pulls on $a$ . Thus, $\theta_{t}$ is close to $\hat{\theta}$ when $n(a)$ is large.

It is easy to see that the time complexity of Alg. 2 is $O(TM|A|p)$ . In Sec. 5, we can see that the regret of ROFU is low in practice with very small $M$ .

Remark 3.

For MAB, linear/kernel bandit and NTK bandit, $-(f_{\theta}(x_{t},a)-\mathcal{R}(\theta;D))$ is strongly convex and smooth. Assuming that it is $\alpha$ strongly convex and $\zeta$ smooth, by Thm 2.1 in needell2014stochastic , we have $\mathbb{E}\|\hat{\theta}_{M}(x_{t},a)-\hat{\theta}(x_{t},a)\|^{2}\leq(1-2\kappa\alpha(1-\kappa\zeta))^{M}\|\theta_{t-1}-\hat{\theta}(x_{t},a)\|^{2}+\frac{\kappa\nu^{2}}{\alpha(1-\kappa\zeta)}$ , where $\mathbb{E}$ is over the randomness of SGD, $\kappa$ is the learning rate and $\nu^{2}$ is the expected gradient norm square at $\hat{\theta}(x_{t},a)$ .

We note that the convex and smooth property for NTK bandit is due to the choice of ${\mathcal{R}}(\theta;D)$ in Eq. (8). Based on Remark 3, we can expect Alg. 2 converges fast for these cases, especially, which can leverage the property that $\|\theta_{t-1}-\hat{\theta}(x_{t},a)\|$ gets smaller as $t$ gets larger. When $f_{\theta}$ is general and complex, there is no theoretical guarantee on the convergence rate, but in practice, stochastic gradient descent with a good initialization usually converges fast. We note that NeuralUCB can have a fast implementation by approximating ${\bm{Z}}$ with its diagonal elements, which, however, can significantly compromise the regret in practice. Please see Sec. 5.2 for empirical evaluations.

Algorithm 2 An efficient implementation to estimate

\text{OFU}^{R}

1: Input: Dataset

D_{t-1}

\theta_{t-1}=\operatorname*{arg\,min}_{\theta}\mathcal{L}(\theta;D_{t-1})

and context-action pair

x_{t},a

. Learning rate

\kappa

and training steps

M

2: Observe context

x_{t}

3: Set

\hat{\theta}_{0}(x_{t},a)\mathrel{\mathop{\mathchar 58\relax}}=\theta_{t-1}

4: for

j=1,...,M

\hat{\theta}_{j}(x_{t},a)\mathrel{\mathop{\mathchar 58\relax}}=\hat{\theta}_{j-1}(x_{t},a)+\kappa\tilde{\nabla}(f_{\hat{\theta}_{j-1}(x_{t},a)}(x_{t},a)-\eta\mathcal{R}(\hat{\theta}_{j-1}(x_{t},a);D_{t-1}))

, where

\tilde{\nabla}

is an estimator of gradient.

6: end for

7: return

\text{OFU}^{R}(x_{t},a)\mathrel{\mathop{\mathchar 58\relax}}=f_{\theta_{t-1}}(x_{t},a)+\sqrt{\max(0,f_{\hat{\theta}_{M}(x_{t},a)}(x_{t},a)-f_{\theta_{t-1}}(x_{t},a))}

4 Related work

The contextual bandit problem has been extensively studied and applied in machine learning. Besides OFU, the most used and studied contextual bandit algorithms can be mainly divided into two categories: $\epsilon$ -greedy sutton2018reinforcement and Thompson sampling thompson1933likelihood .

$\epsilon$ -greedy greedily selects the action $a=\operatorname*{arg\,max}_{a^{\prime}}f_{\theta_{t-1}}(x,a^{\prime})$ with probability $1-\epsilon$ and randomly selects an action to explore with probability $\epsilon$ . While being sub-optimal on EE trade-off, $\epsilon$ -greedy is simple, general and computationally efficient. Thus, $\epsilon$ -greedy is the most widely used algorithm in complex tasks.

The basic idea of Thompson sampling thompson1933likelihood is to maintain a posterior over parameters and make decisions according to the samples from the posterior. However, Thompson sampling can be applied only if we can access the posterior and use approximate inference methods which can be computationally inefficient and may significantly hurt the performance of regret deep_bayesian_bandit_showdown ; phan2019thompson .

There are several works extending OFU and Thompson sampling to the case of non-linear contextual bandits. NeuralUCB zhou2020neural and Neural-TS zhang2021neural extend OFU and TS to NTK bandits and they provide regret bounds in the order of $\tilde{O}(\sqrt{T})$ . Our analysis for NTK bandit is largely inspired by their work. However, as mentioned in Sec. 2, both NeuralUCB and NeuralTS need to calculate the inversion of a matrix with dimension equal to the number of parameters in DNN, which is computationally inefficient for large DNN models. Moreover, the neural-linear model is studied in zahavy2019deep ; deep_bayesian_bandit_showdown , where they take the output of the second last layer of a DNN as feature map. These algorithms sometimes work well, but have no theoretical guarantee on regret.

5 Experiment

We now empirically evaluate ROFU. For simplicity, in all our experiments, we set $\eta=1,g(w)=\sqrt{w}$ , $\mathcal{L}(\theta;D)=\text{MSE}(\theta;D)$ and $\mathcal{R}(\theta;D)=|D|\text{MSE}(\theta;D)$ . That is, $\theta_{t-1}$ is trained to minimize $\text{MSE}(\theta;D)$ with standard optimizer ¹¹1We train $\theta_{t-1}$ with stochastic gradient descent starting from $\theta_{t-2}$ in all the experiments. and $\hat{\theta}(x,a)$ is trained to maximize $f_{\theta}(x,a)-|D|\text{MSE}(\theta;D)$ using Alg. 2.

Refer to caption — Figure 1: Ablation study on MLP and ResNet bandits. Notation: $M=1/5/10$ means that we run $1/5/10$ gradient descent updates in Alg. 2.

5.1 Analysis on MLP and ResNet bandits

As discussed in Sec. 1, a contextual bandit algorithm should be efficient in trading off EE when reward is generated from a complex function while keeping a low cost on computational resources. From Alg. 2, we can see that the time complexity of ROFU is determined by the training step $M$ . Thus, we here evaluate the regret of ROFU when $M$ is small.

To evaluate the performance of ROFU in Alg. 2 on complex tasks, we consider two contextual bandits with a DNN as the simulator. That is, $r(x_{t},a)$ is generated by a DNN model. We consider two popular DNN architectures to generate rewards: 2-layer MLP and 20-layer ResNet with CNN blocks and Batch Normalization as in he2016deep . We summarize other information of the two tested bandits in Table. 2.

Bandit Layer Context Dim # Arms NN Parameters Context Distribution Noise MLP $2$ $10$ $10$ Random Gaussian $\mathcal{N}(0,0.05)$ ResNet $20$ $3\times 32\times 32$ $10$ Train on Cifar10 Uniform $\mathcal{N}(0,0.5)$

Table 2: Basic information about MLP and ResNet bandits.

We use DNNs with larger size for training in Alg. 2. More specifically, for MLP-bandit, $f_{\theta}$ is chosen as a 3-layer MLP and for ResNet20-bandit, $f_{\theta}$ is chosen as ResNet32. Each experiment is repeated for 16 times. We present the regret and confidence bonus in Fig. 1. From Fig. 1, we can see that (1) ROFU can achieve a small regret on both tasks with a considerably small $M$ even for very large DNN model; (2) The confidence bonus monotonically increases with $M$ . For each $M$ , the confidence bonus converges to $0$ as expected. Moreover, while the regret seems sensitive to the value of $M$ , the regrets of ROFU with $M=5,10$ are much smaller than the case of $M=1$ and $\epsilon$ -greedy.

5.2 Performance comparison on real-world datasets

To evaluate ROFU against powerful baselines, we conduct experiments on contextual bandits which are created from real-world datasets, following the setting in deep_bayesian_bandit_showdown . For example, suppose that $D\mathrel{\mathop{\mathchar 58\relax}}=\{(x_{t},c_{t})\}_{t\leq T}$ is a $K$ -classification dataset where $x_{t}$ is the feature vector and $c_{t}\in[K]$ is the label. We create a contextual bandit problem as follows: at time step $t\leq T$ , the agent observes context $x_{t}$ , and then takes an action $a_{t}$ . The agent receives high reward if it successfully predicts the label of $x_{t}$ . For non-classification dataset, we can turn it into contextual bandit in similar ways. For the details of these bandits, please refer to deep_bayesian_bandit_showdown .

For baselines, we consider NeuralUCB zhou2020neural and Thompson sampling variants from deep_bayesian_bandit_showdown . It is noteworthy that we only evaluate the algorithms in deep_bayesian_bandit_showdown with relatively small regrets. We directly run the code provided by the authors. For ROFU, we fix $M=5$ for all experiments. For other hyper-parameters, please see Appendix C. We repeat each algorithm for $16$ times on each dataset.

We report the regret $R_{T}=\mathbb{E}\sum_{t\leq T}r(x_{t},a^{\prime}_{t})-\mathbb{E}\sum_{t\leq T}r(x_{t},a_{t})$ where $a^{\prime}_{t}=\operatorname*{arg\,max}_{a\in A}f_{\theta^{\prime}}(x_{t},a)$ and $\theta^{\prime}$ is the parameter trained on the dataset. For hyperparameters tuning and more details, please see Appendix C. The results are presented in Fig. 2 and Table 3. We found that the regret of NeuralUCB is occasionally linear. This might be because that NeuralUCB uses a diagonal matrix to approximate $Z$ to accelerate. Moreover, we can see that ROFU significantly outperforms these baselines in terms of regret.

Mean Census Jester Adult Covertype Statlog Financial Mushroom Dropout $1.75\scriptstyle\pm 0.80$ $1.51\scriptstyle\pm 0.10$ $1.34\scriptstyle\pm 0.14$ $\mathbf{1.00\scriptstyle\pm 0.09}$ $1.14\scriptstyle\pm 0.13$ $1.54\scriptstyle\pm 0.87$ $3.50\scriptstyle\pm 0.60$ $2.21\scriptstyle\pm 0.42$ Bootstrap $2.23\scriptstyle\pm 1.00$ $2.51\scriptstyle\pm 0.16$ $1.72\scriptstyle\pm 0.11$ $1.43\scriptstyle\pm 0.10$ $1.93\scriptstyle\pm 0.13$ $1.43\scriptstyle\pm 1.57$ $4.52\scriptstyle\pm 2.29$ $2.04\scriptstyle\pm 0.48$ ParamNoise $2.30\scriptstyle\pm 1.12$ $2.28\scriptstyle\pm 0.23$ $1.59\scriptstyle\pm 0.14$ $1.37\scriptstyle\pm 0.10$ $1.80\scriptstyle\pm 0.20$ $3.88\scriptstyle\pm 6.40$ $4.07\scriptstyle\pm 1.76$ $1.06\scriptstyle\pm 0.32$ NeuralLinear $1.82\scriptstyle\pm 0.69$ $3.24\scriptstyle\pm 0.47$ $1.70\scriptstyle\pm 0.13$ $1.46\scriptstyle\pm 0.12$ $1.84\scriptstyle\pm 0.19$ $1.25\scriptstyle\pm 0.11$ $2.25\scriptstyle\pm 0.35$ $\mathbf{1.00\scriptstyle\pm 0.38}$ Greedy $2.47\scriptstyle\pm 1.12$ $2.76\scriptstyle\pm 1.24$ $1.65\scriptstyle\pm 0.10$ $1.56\scriptstyle\pm 0.11$ $2.27\scriptstyle\pm 0.23$ $3.08\scriptstyle\pm 4.91$ $4.74\scriptstyle\pm 2.31$ $1.20\scriptstyle\pm 0.41$ NeuralUCB $9.76\scriptstyle\pm 14.02$ $1.72\scriptstyle\pm 0.12$ $1.47\scriptstyle\pm 0.08$ $1.18\scriptstyle\pm 0.05$ $1.86\scriptstyle\pm 0.16$ $41.42\scriptstyle\pm 69.51$ $2.74\scriptstyle\pm 0.50$ $17.29\scriptstyle\pm 7.45$ ROFU (ours) $\mathbf{1.05\scriptstyle\pm 0.09}$ $\mathbf{1.00\scriptstyle\pm 0.09}$ $\mathbf{1.00\scriptstyle\pm 0.20}$ $1.17\scriptstyle\pm 0.06$ $\mathbf{1.00\scriptstyle\pm 0.14}$ $\mathbf{1.00\scriptstyle\pm 0.24}$ $\mathbf{1.00\scriptstyle\pm 0.60}$ $1.22\scriptstyle\pm 0.37$

Table 3: The final regret of each algorithm. The regrets are normalized according to the algorithm with smallest regret.

Conclusion and future work

In this work, we propose an OFU variant, called ROFU, which can be applied to non-linear contextual bandit. We show that the regret of ROFU is (near-)optimal on various contextual bandit models. Moreover, we propose an efficient algorithm to approximately compute the upper confidence bound. Thus, ROFU is efficient in both computation and EE trade-off, which are empirically verified by our experimental results.

EE trade-off is a fundamental problem that lies in the heart of sequential decision making. However, the huge computational cost of (near-)optimal EE trade-off algorithms significantly limits the application, especially on complex domain. We hope our method could inspire more algorithms to efficiently trade-off EE for sequential decision-making tasks beyond contextual bandit, such as deep reinforcement learning.

References

(1) Z. Allen-Zhu, Y. Li, and Z. Song. A convergence theory for deep learning via over-parameterization. arXiv preprint arXiv:1811.03962, 2018.
(2) Z. Allen-Zhu, Y. Li, and Z. Song. On the convergence rate of training recurrent neural networks. In Advances in Neural Information Processing Systems, 2019.
(3) P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 2002.
(4) S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. arXiv preprint arXiv:1204.5721, 2012.
(5) Y. Cao and Q. Gu. A generalization theory of gradient descent for learning over-parameterized deep ReLU networks. arXiv preprint arXiv:1902.01384, 2019.
(6) W. Chu, L. Li, L. Reyzin, and R. Schapire. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 2011.
(7) A. M. Davie and A. J. Stothers. Improved bound for complexity of matrix multiplication. Proceedings. Section A, Mathematics-The Royal Society of Edinburgh, 143(2):351, 2013.
(8) S. S. Du, X. Zhai, B. Poczos, and A. Singh. Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations (ICLR), 2019.
(9) S. Filippi, O. Cappe, A. Garivier, and C. Szepesvári. Parametric bandits: The generalized linear case. In NIPS, volume 23, pages 586–594, 2010.
(10) Q. Gu, A. Karbasi, K. Khosravi, V. Mirrokni, and D. Zhou. Batched neural bandits. arXiv preprint arXiv:2102.13028, 2021.
(11) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
(12) T. Jaksch, R. Ortner, and P. Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 2010.
(13) D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
(14) J. Langford and T. Zhang. The epoch-greedy algorithm for contextual multi-armed bandits. Advances in neural information processing systems, 20(1):96–1, 2007.
(15) D. Needell, R. Ward, and N. Srebro. Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm. Advances in neural information processing systems, 27:1017–1025, 2014.
(16) M. Phan, Y. Abbasi-Yadkori, and J. Domke. Thompson sampling with approximate inference. arXiv preprint arXiv:1908.04970, 2019.
(17) C. Riquelme, G. Tucker, and J. Snoek. Deep bayesian bandits showdown: An empirical comparison of bayesian deep networks for thompson sampling. 2018.
(18) R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.
(19) W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 1933.
(20) M. Valko, N. Korda, R. Munos, I. Flaounas, and N. Cristianini. Finite-time analysis of kernelised contextual bandits. arXiv preprint arXiv:1309.6869, 2013.
(21) T. Zahavy and S. Mannor. Deep neural linear bandits: Overcoming catastrophic forgetting through likelihood matching. arXiv preprint arXiv:1901.08612, 2019.
(22) W. Zhang, D. Zhou, L. Li, and Q. Gu. Neural thompson sampling. In International Conference on Learning Representations, 2021.
(23) D. Zhou, L. Li, and Q. Gu. Neural contextual bandits with ucb-based exploration. In International Conference on Machine Learning, 2020.
(24) D. Zou and Q. Gu. An improved analysis of training over-parameterized deep neural networks. In Advances in Neural Information Processing Systems, 2019.

Appendix A Analysis on multi-armed bandit

The following theorem presents the equivalence between ROFU and UCB1 auer2002finite .

Theorem 3.

Let $n_{a}$ denote the number of pulls on action $a$ and $\bar{\theta}_{a}$ denote the empirical mean of rewards on action $a$ . With $\mathcal{L}(\theta;D)=\text{MSE}(\theta;D)$ , $\mathcal{R}(\theta;D)=|D|\text{MSE}(\theta;D)$ , $\eta(a)=\frac{1}{2n_{a}(8\log|D|/n_{a})^{1/2b}}$ and $g(w)=w^{b}$ , we have $OFU^{R}(a)=\bar{\theta}_{a}+\sqrt{8\log t/n_{a}}$ which is used in UCB1.

Proof.

Let the derivative of Eq. (4) equal zero, we have $\hat{\theta}(a)=8\log t/n_{a}+\bar{\theta}_{a}$ . Inserting to Eq. (equation 5), we finish the proof. ∎

Appendix B Analysis on neural tangent kernel bandit

See 2

The step 6 in Algorithm 1 is accomplished by running gradient descent for $J$ iterations with learning rate $\beta$ to minimize the loss, as given by Algorithm 3. In the following analysis, we assume that $\theta^{(J)}$ satisfies the first order optimality of minimizing $\mathcal{L}(\theta;D)$ for the ease of the derivation and $\theta_{0}$ is the randomly initialized starting point.

Algorithm 3 TrainNN(

D,\beta,\theta_{0}

)

1: Define

\mathcal{L}(\theta;D)=\sum_{(x,a,r)\in D}(f(x,a;\theta)-r)^{2}+m\lambda\|\theta-\theta_{0}\|^{2}

2: Let

\theta^{(0)}\leftarrow\theta_{0}

3: for j=0, …, J-1 do

\theta^{(j+1)}=\theta^{(j)}-\beta\nabla\mathcal{L}(\theta^{(j)})

5: end for

6: return

\theta^{(J)}

By choosing the regularizer ${\mathcal{R}}(\theta;D_{t-1})$ as Eq. (8), the procedure of searching $\hat{\theta}(x_{t},a)$ (Eq. (4)) is equivalent to

\displaystyle\hat{\theta}(x_{t},a)=\operatorname*{arg\,max}\tilde{f}_{\theta}(x_{t},a;\theta_{t-1})-\eta\tilde{{\mathcal{R}}}(\theta,D_{t-1};\theta_{t-1}),

(10)

where

	$\displaystyle\tilde{f}_{\theta}(x_{t},a;\theta_{t-1})=f_{\theta_{t-1}}(x_{t},a)+\langle\nabla_{\theta}f_{\theta_{t-1}}(x_{t},a),\theta-\theta_{t-1}\rangle,$
	$\displaystyle\tilde{{\mathcal{R}}}(\theta,D_{t-1};\theta_{t-1})\mathrel{\mathop{\mathchar 58\relax}}=\sum_{(x,a,r)\in D_{t-1}}(f_{\theta_{t-1}}(x,a)+\langle\nabla_{\theta}f_{\theta_{t-1}}(x,a),\theta-\theta_{t-1}\rangle-r)^{2}+m\lambda\\|\theta-\theta_{0}\\|^{2}$

are the first order Taylor expansions of $f_{\theta}(x_{t},a)$ and ${\mathcal{L}}(\theta;D_{t-1})$ at $\theta_{t-1}$ , respectively. We note that the special choice of regularizer ${\mathcal{R}}(\theta;D_{t-1})$ Eq. (8) is to derive a simple form of $\hat{\theta}(x_{t},a)$ .

In the following analysis, we first connect ROFU with NeuralUCB via an informal argument. Then we bound the approximations in the informal argument and show they will not affect to establish the sublinear regret in the order sense. We use notation $h_{\theta_{t-1}}(x_{t},a)$ to denote $\nabla_{\theta}f_{\theta}(x_{t},a)\large|_{\theta=\theta_{t-1}}$ .

An intuitive procedure.

With the above choices, we can see that the procedure Eq. (10) is equivalent to

\displaystyle\hat{\theta}(x_{t},a)=\operatorname*{arg\,max}_{\theta}\langle h_{\theta_{t-1}}(x_{t},a),\theta-\theta_{t-1}\rangle+\eta\|\theta-\theta_{t-1}\|_{m{\bm{Z}}_{t-1}}^{2},

(11)

where

\displaystyle{\bm{Z}}_{t-1}\mathrel{\mathop{\mathchar 58\relax}}=\lambda{\bm{I}}+\frac{1}{m}\sum_{(x,a,r)\in D_{t-1}}h_{\theta_{t-1}}(x,a)h_{\theta_{t-1}}(x,a)^{\top}

(12)

by assuming that $\theta_{t-1}$ satisfies the first-order optimality condition of minimizing the ${\mathcal{L}}(\theta;D_{t-1})$ .

Solving the problem Eq. (11), we obtain

\hat{\theta}(x_{t},a)=\theta_{t-1}+\frac{1}{2\eta m}{\bm{Z}}_{t-1}^{-1}h_{\theta_{t-1}}(x_{t},a).

Hence if the function $g(\cdot)$ in Eq. (5) is given by $g(x)=\sqrt{x}$ , then the $\text{OFU}^{R}$ becomes

	$\displaystyle\text{OFU}^{R}(x_{t},a)$	$\displaystyle=f_{\theta_{t-1}}(x_{t},a)+\sqrt{f_{\hat{\theta}(x_{t},a)}(x_{t},a)-f_{\theta_{t-1}}(x_{t},a)}$
		$\displaystyle\approx f_{\theta_{t-1}}(x_{t},a)+\frac{1}{\sqrt{2\eta}}\sqrt{h_{\theta_{t-1}}(x_{t},a)^{\top}{\bm{Z}}_{t-1}^{-1}h_{\theta_{t-1}}(x_{t},a)/m},$		(13)

where the approximation is because of using the first order Taylor approximation of $f_{\hat{\theta}(x_{t},a)}(x_{t},a)$ at $\theta_{t-1}$ to replace $f_{\hat{\theta}(x_{t},a)}(x_{t},a)$ .

Similar to the formula of (13), one uses

\displaystyle U_{t,a}=f_{\theta_{t-1}}(x_{t},a)+\gamma_{t-1}\sqrt{h_{\theta_{t-1}}(x_{t},a)^{\top}\tilde{{\bm{Z}}}_{t-1}^{-1}h_{\theta_{t-1}}(x_{t},a)/m}

(14)

as the upper confidence bound in NeuralUCB (zhou2020neural, , Algorithm 1). One difference is that they zhou2020neural use

\displaystyle\tilde{{\bm{Z}}}_{t-1}\mathrel{\mathop{\mathchar 58\relax}}=\lambda{\bm{I}}+\frac{1}{m}\sum_{\tau=1}^{t-1}h_{\theta_{\tau-1}}(x_{\tau},a_{\tau})h_{\theta_{\tau-1}}(x_{\tau},a_{\tau})^{\top}.

(15)

We next bound the differences between ROFU and NeuralUCB, and then show that the differences won’t affect the regret bound on the order of $\sqrt{T}$ .

Bound the approximations in above derivation

We first introduce some existing bounds on how the function value is approximated by the first order Taylor expansion in the NTK regime.

Lemma 1 (Lemma 4.1& Theorem B.3, cao2019generalization ).

There exist constants $c_{1},c_{2},c_{3}>0$ such that for any $\delta\in(0,1)$ , if $\omega$ satisfies that $c_{1}(mL)^{-3/2}\left(\log(T|A|L^{2}/\delta)\right)^{3/2}\leq\omega\leq c_{2}L^{-6}(\log m)^{-3/2}$ , then with probability at least $1-\delta$ , for all $\theta,\tilde{\theta}$ with $\|\tilde{\theta}-\theta_{0}\|_{2}\leq\omega,\|\theta-\theta_{0}\|_{2}\leq\omega$ , and for all $x\in\{x_{t}\}_{t\in[T]}$ and all $a\in A$ , we have

	$\displaystyle\left\|f_{\tilde{\theta}}(x,a)-f_{\theta}(x,a)-\langle h_{\theta}(x,a),\tilde{\theta}-\theta\rangle\right\|\leq c_{3}\omega^{4/3}L^{3}\sqrt{m\log m},$		(16)
	$\displaystyle\\|h_{\theta}(x,a)\\|_{F}\leq c_{4}\sqrt{mL}.$		(17)

We also introduce a bound on the difference between the gradient at $\theta$ and the gradient at initial point $\theta_{0}$ when $\theta$ is not far from $\theta_{0}$ as required and satisfied in the NTK regime.

Lemma 2 (Theorem 5, allen2019convergence ).

There exist constants $c_{1},c_{2},c_{3}>0$ such that for any $\delta\in(0,1)$ , if $\omega$ satisfies that $c_{1}(mL)^{-3/2}\max\{\log^{-3/2}m,\log^{3/2}(T|A|/\delta)\}\leq\omega\leq c_{2}L^{-9/2}\log^{-3}m\}\leq\omega\leq c_{2}L^{-6}(\log m)^{-3/2}$ , then with probability at least $1-\delta$ , for all $\theta$ with $\|\theta-\theta_{0}\|_{2}\leq\omega$ ,and for all $x\in\{x_{t}\}_{t\in[T]}$ and all $a\in A$ , we have

\displaystyle\|h_{\theta}(x,a)-h_{\theta_{0}}(x,a)\|_{2}\leq c_{3}\sqrt{\log m}\omega^{1/3}L^{3}\|h_{\theta_{0}}(x,a)\|_{2}.

(18)

With these existing bounds, we next show that ${\bm{Z}}_{t}$ (Eq. (12)) enjoys the same property as $\tilde{{\bm{Z}}}_{t}$ (Eq. (15)) as follows. We introduce another sequence $\bar{{\bm{Z}}}_{t}=\lambda{\bm{I}}+\frac{1}{m}\sum_{\tau=1}^{t}h_{\theta_{0}}(x_{\tau},a_{\tau})h_{\theta_{0}}(x_{\tau},a_{\tau})^{\top}$ . In (zhou2020neural, , Lemma B.3), they showed that $\tilde{{\bm{Z}}}_{t}$ is close to $\bar{{\bm{Z}}}_{t}$ . Similarly, we have the following lemma on the distance on $\|{\bm{Z}}_{t}-\bar{{\bm{Z}}}_{t}\|_{F}$ .

Lemma 3.

With probabililty $1-\delta$ where $\delta\in(0,1)$ , for any $t\in[T]$ , we have

\displaystyle\|{\bm{Z}}_{t}\|_{2}\leq\lambda+c_{1}tL,\quad\|{\bm{Z}}_{t}-\bar{{\bm{Z}}}_{t}\|_{F}\leq c_{2}m^{-1/6}\sqrt{\log m}L^{4}t^{7/6}\lambda^{-1/6},

(19)

if the network width $m$ satisfies that

\displaystyle c_{3}m^{-3/2}L^{-3/2}[\log(T|A|L^{2}/\delta)]^{3/2}\leq 2\sqrt{t/(m\lambda)}\leq c_{4}L^{-6}[\log m]^{-3/2},\forall t\in[T],

(20)

with some constants $c_{1},c_{2},c_{3},c_{4}>0$ .

Proof.

The proof can be adapted from that of Lemma B.3 in zhou2020neural with the fact that $\|h_{\theta_{t-1}}(x,a)-h_{\theta_{0}}(x,a)\|_{2}$ is bounded because of Lemma 2. ∎

We note that Lemma 3 plays the same role as Lemma B.3 in zhou2020neural . Hence with Lemma 3 we can establish a similar result as Lemma 5.2 in zhou2020neural that with high probability, the optimal $\theta^{*}$ lies in the sequence of confidence sets, i.e., an ellipsoid defined with the new matrix ${\bm{Z}}_{t}$ .

Next we show that the first-order Taylor approximation in Eq. (11) does not affect the achievable regret bound either. By the proof of Lemma 5.3 in zhou2020neural , in order to prove a similar result, we only need to bound

\displaystyle|\text{OFU}^{R}(x_{t},a)-\tilde{U}_{t,a}|,

(21)

where $\tilde{U}_{t,a}\mathrel{\mathop{\mathchar 58\relax}}=\langle h_{\theta_{t-1}}(x_{t},a),\theta_{t-1}-\theta_{0}\rangle+\gamma_{t-1}\sqrt{h_{\theta_{t-1}}(x_{t},a)^{\top}{\bm{Z}}_{t-1}^{-1}h_{\theta_{t-1}}(x_{t},a)/m}$ and $\gamma_{t-1}$ is given by Algorithm 1 in zhou2020neural . The derivation is as follows,

	$\displaystyle\|\text{OFU}^{R}(x_{t},a)-\tilde{U}_{t,a}\|$
	$\displaystyle\leq\|f_{\theta_{t-1}}(x_{t},a)-\langle h_{\theta_{t-1}}(x_{t},a),\theta_{t-1}-\theta_{0}\rangle\|$
	$\displaystyle\quad+\left\|\sqrt{f_{\hat{\theta}(x_{t},a)}(x_{t},a)-f_{\theta_{t-1}}(x_{t},a)}-\gamma_{t-1}\sqrt{h_{\theta_{t-1}}(x_{t},a)^{\top}{\bm{Z}}_{t-1}^{-1}h_{\theta_{t-1}}(x_{t},a)}\right\|$
	$\displaystyle=\left\|f_{\hat{\theta}(x_{t},a)}(x_{t},a)-f_{\theta_{0}}(x_{t},a)f_{\theta_{t-1}}(x_{t},a)-\langle h_{\theta_{t-1}}(x_{t},a),\hat{\theta}(x_{t},a)-\theta_{t-1}\rangle\right\|$
	$\displaystyle\quad+\sqrt{\left\|f_{\hat{\theta}(x_{t},a)}(x_{t},a)-f_{\theta_{t-1}}(x_{t},a)-\gamma_{t-1}^{2}{h_{\theta_{t-1}}(x_{t},a)^{\top}{\bm{Z}}_{t-1}^{-1}h_{\theta_{t-1}}(x_{t},a)}\right\|}$
	$\displaystyle\leq c_{3}m^{-1/6}\sqrt{\log m}t^{2/3}\lambda^{-2/3}L^{3}+\sqrt{c_{3}m^{-1/6}\sqrt{\log m}t^{2/3}\lambda^{-2/3}L^{3}},$		(22)

where $c_{3}$ is a constant and the last inequality is due to Lemma 1 and the choice of $\eta$ such that $1/(2\eta)=\gamma_{t-1}^{2}$ . Hence we show the $\text{OFU}^{R}(x_{t},a)$ shares the same property as the $U_{t,a}$ in Algorithm 1 of zhou2020neural . The left proof follows a similar procedure.

Appendix C Experimental details

Hyperparameters: We tune the hyper-parameters of ROFU on statlog and directly apply the hyperparameters on statlog to other datasets except mushroom. This is because the reward scale of mushroom is much larger than other datasets. For baselines, we directly use the best reported hyper-parameters.

For convenience, we suppose context $x_{t}$ is sampled from some distribution $\mathcal{P}$ for all $t\leq T$ . And we define a virtual dataset $\{(x^{\prime}_{t})\}_{T},x^{\prime}_{t}\sim\mathcal{P}$ as well as a parameter trained by minimizing MSE on the virtual dataset $\theta^{\prime}=\text{MinimizeMSE}\left(\sum_{t\leq T}\sum_{a\in A}(f_{\theta}(x^{\prime}_{t},a)-r(x^{\prime}_{t},a))^{2}\right)$ where MinimizeMSE is a gradient-based optimizer, e.g., SGD or Adam kingma2014adam .

To alleviate the influence of regret injected by neural network training and generalization error, we decompose the regret into two parts:

\displaystyle\mathcal{R}_{T}=\underbrace{\sum_{t\leq T}\max_{a}r(x_{t},a)-\mathbb{E}\sum_{t\leq T}r(x_{t},a^{\prime}_{t})}_{I}+\underbrace{\mathbb{E}\sum_{t\leq T}r(x_{t},a^{\prime}_{t})-\mathbb{E}\sum_{t\leq T}r(x_{t},a_{t})}_{\text{II}}

(23)

where $a^{\prime}_{t}=\operatorname*{arg\,max}_{a}f_{\theta^{\prime}}(x_{t},a)$ and $a_{t}$ is selected by agent.

It is noteworthy that in practice, $\mathbb{E}\sum_{t}r(x_{t},a^{\prime}_{t})$ is usually an upper bound of $\mathbb{E}\sum_{t\leq T}r(x_{t},a_{t})$ as $\theta^{\prime}$ is trained on $T\times|A|$ data points while $\bar{\theta}$ and $\hat{\theta}$ are trained on no more than $T$ data points. And the regret I is caused by MinimizeMSE and generalization power of function $f_{\theta}$ . So, regret I is independent with the algorithm for EE trade-off. Thus, we only report regret II in Fig. 2 to give a clearer evaluation on ROFU’s ability for EE trade-off. We also present the full regret $R_{T}$ in Fig. 3. The running time is in Table. 4.

ROFU Neural-UCB Dropout ParamNoise Greedy Bootstrap Neural Linear Runing time (s) 5259 1589 525 306 771 729 3612

Table 4: Running time of each algorithm for

20000

rounds. All experiments are run on Nvidia Tesla P-100 GPU. Our algorithm is slower than NeuralUCB because NeuralUCB uses the diagonal approximation which lacks theoretical guarantee for the regret.

	$\displaystyle\|\text{OFU}^{R}(x_{t},a)-\tilde{U}_{t,a}\|$
	$\displaystyle\leq\|f_{\theta_{t-1}}(x_{t},a)-\langle h_{\theta_{t-1}}(x_{t},a),\theta_{t-1}-\theta_{0}\rangle\|$
	$\displaystyle\quad+\left\|\sqrt{f_{\hat{\theta}(x_{t},a)}(x_{t},a)-f_{\theta_{t-1}}(x_{t},a)}-\gamma_{t-1}\sqrt{h_{\theta_{t-1}}(x_{t},a)^{\top}{\bm{Z}}_{t-1}^{-1}h_{\theta_{t-1}}(x_{t},a)}\right\|$
	$\displaystyle=\left\|f_{\hat{\theta}(x_{t},a)}(x_{t},a)-f_{\theta_{0}}(x_{t},a)f_{\theta_{t-1}}(x_{t},a)-\langle h_{\theta_{t-1}}(x_{t},a),\hat{\theta}(x_{t},a)-\theta_{t-1}\rangle\right\|$
	$\displaystyle\quad+\sqrt{\left\|f_{\hat{\theta}(x_{t},a)}(x_{t},a)-f_{\theta_{t-1}}(x_{t},a)-\gamma_{t-1}^{2}{h_{\theta_{t-1}}(x_{t},a)^{\top}{\bm{Z}}_{t-1}^{-1}h_{\theta_{t-1}}(x_{t},a)}\right\|}$
	$\displaystyle\leq c_{3}m^{-1/6}\sqrt{\log m}t^{2/3}\lambda^{-2/3}L^{3}+\sqrt{c_{3}m^{-1/6}\sqrt{\log m}t^{2/3}\lambda^{-2/3}L^{3}},$		(22)

Regularized OFU: an Efficient UCB Estimator for Non-linear Contextual Bandit

Abstract

1 Introduction

2 Preliminary

Definition 1 (Contextual bandit).

3 Method

Remark 1.

3.1 Regret analysis

Theorem 1.

Proof.

Remark 2.

Theorem 2.

3.2 An efficient algorithm to compute OFUR\text{OFU}^{R}

Remark 3.

4 Related work

5 Experiment

5.1 Analysis on MLP and ResNet bandits

5.2 Performance comparison on real-world datasets

Conclusion and future work

References

Appendix A Analysis on multi-armed bandit

Theorem 3.

Proof.

Appendix B Analysis on neural tangent kernel bandit

Lemma 1 (Lemma 4.1& Theorem B.3, cao2019generalization ).

Lemma 2 (Theorem 5, allen2019convergence ).

Lemma 3.

Proof.

Appendix C Experimental details

3.2 An efficient algorithm to compute $\text{OFU}^{R}$