EE-Net: Exploitation-Exploration Neural Networks in Contextual Bandits
Abstract
In this paper, we propose a novel neural exploration strategy in contextual bandits, EE-Net, distinct from the standard UCB-based and TS-based approaches. Contextual multi-armed bandits have been studied for decades with various applications. To solve the exploitation-exploration tradeoff in bandits, there are three main techniques: epsilon-greedy, Thompson Sampling (TS), and Upper Confidence Bound (UCB). In recent literature, linear contextual bandits have adopted ridge regression to estimate the reward function and combine it with TS or UCB strategies for exploration. However, this line of works explicitly assumes the reward is based on a linear function of arm vectors, which may not be true in real-world datasets. To overcome this challenge, a series of neural bandit algorithms have been proposed, where a neural network is used to learn the underlying reward function and TS or UCB are adapted for exploration. Instead of calculating a large-deviation based statistical bound for exploration like previous methods, we propose "EE-Net", a novel neural-based exploration strategy. In addition to using a neural network (Exploitation network) to learn the reward function, EE-Net uses another neural network (Exploration network) to adaptively learn potential gains compared to the currently estimated reward for exploration. Then, a decision-maker is constructed to combine the outputs from the Exploitation and Exploration networks. We prove that EE-Net can achieve regret and show that EE-Net outperforms existing linear and neural contextual bandit baselines on real-world datasets.
1 Introduction
The stochastic contextual multi-armed bandit (MAB) (Lattimore and SzepesvΓ‘ri, 2020) has been studied for decades in machine learning community to solve sequential decision making, with applications in online advertising (Li etΒ al., 2010), personal recommendation (Wu etΒ al., 2016; Ban and He, 2021b), etc. In the standard contextual bandit setting, a set of arms are presented to a learner in each round, where each arm is represented by a context vector. Then by certain strategy, the learner selects and plays one arm, receiving a reward. The goal of this problem is to maximize the cumulative rewards of rounds.
MAB algorithms have principled approaches to address the trade-off between Exploitation and Exploration (EE), as the collected data from past rounds should be exploited to get good rewards but also under-explored arms need to be explored with the hope of getting even better rewards. The most widely-used approaches for EE trade-off can be classified into three main techniques: Epsilon-greedy (Langford and Zhang, 2008), Thompson Sampling (TS) (Thompson, 1933), and Upper Confidence Bound (UCB) (Auer, 2002; Ban and He, 2020).
Linear bandits (Li etΒ al., 2010; Dani etΒ al., 2008; Abbasi-Yadkori etΒ al., 2011), where the reward is assumed to be a linear function with respect to arm vectors, have been well studied and succeeded both empirically and theoretically. Given an arm, ridge regression is usually adapted to estimate its reward based on collected data from past rounds. UCB-based algorithms (Li etΒ al., 2010; Chu etΒ al., 2011; Wu etΒ al., 2016; Ban and He, 2021b) calculate an upper bound for the confidence ellipsoid of estimated reward and determine the arm according to the sum of estimated reward and UCB. TS-based algorithms (Agrawal and Goyal, 2013; Abeille and Lazaric, 2017) formulate each arm as a posterior distribution where mean is the estimated reward and choose the one with the maximal sampled reward. However, the linear assumption regarding the reward may not be true in real-world applications (Valko etΒ al., 2013).
To learn non-linear reward functions, recent works have utilized deep neural networks to learn the underlying reward functions, thanks to its powerful representation ability. Considering the past selected arms and received rewards as training samples, a neural network is built for exploitation. Zhou etΒ al. (2020) computes a gradient-based upper confidence bound with respect to and uses UCB strategy to select arms. Zhang etΒ al. (2021) formulates each arm as a normal distribution where the mean is and deviation is calculated based on gradient of , and then uses the TS strategy to choose arms. Both Zhou etΒ al. (2020) and Zhang etΒ al. (2021) achieve the near-optimal regret bound of .

In this paper, we propose a novel neural exploration strategy, named "EE-Net". Similar to other neural bandits, EE-Net has another exploitation network to estimate rewards for each arm. The crucial difference from existing works is that EE-Net has an exploration network to predict the potential gain for each arm compared to current reward estimate. The input to the exploration network is the gradient of and the ground-truth is residual difference between the true received reward and the estimated reward from . The strategy is inspired by recent advances in the neural UCB strategiesΒ (Zhou etΒ al., 2020; Ban etΒ al., 2021). Finally, a decision-maker is constructed to select arms. has two modes: linear or nonlinear. In linear mode, is a linear combination of and , inspired by the UCB strategy. In the nonlinear mode, is formulated as a neural network with input () and the goal is to learn the probability of being an optimal arm for each arm. Figure 1 depicts the workflow of EE-Net and its advantages for exploration compared to UCB or TS-based methods (see more details in Appendix D). To sum up, the contributions of this paper can be summarized as follows:
-
1.
We propose a novel neural exploration strategy, EE-Net, where another neural network is assigned to learn the potential gain compared to the current reward estimate.
-
2.
Under standard assumptions of over-parameterized neural networks, we prove that EE-Net can achieve the regret upper bound of , which improves a multiplicative factor of and is independent of either input or effective dimension, compared to existing state-of-the-art neural bandit algorithms.
-
3.
We conduct extensive experiments on four real-world datasets, showing that EE-Net outperforms baselines including linear and neural versions of -greedy, TS, and UCB.
Next, we discuss the problem definition in Sec.3, elaborate on the proposed EE-Net in Sec.4, and present our theoretical analysis in Sec.5. In the end, we provide the empirical evaluation (Sec.6) and conclusion.
2 Related Work
Constrained Contextual bandits. The common constrain placed on the reward function is the linear assumption, usually calculated by ridge regression (Li etΒ al., 2010; Abbasi-Yadkori etΒ al., 2011; Valko etΒ al., 2013; Dani etΒ al., 2008). The linear UCB-based bandit algorithms (Abbasi-Yadkori etΒ al., 2011; Li etΒ al., 2016) and the linear Thompson Sampling (Agrawal and Goyal, 2013; Abeille and Lazaric, 2017) can achieve successful performance and the near-optimal regret bound of . To break the linear assumption, Filippi etΒ al. (2010) generalizes the reward function to a composition of linear and non-linear functions and then adopt a UCB-based algorithm to deal with it; Bubeck etΒ al. (2011) imposes the Lipschitz property on reward metric space and constructs a hierarchical optimistic optimization to make selections; Valko etΒ al. (2013) embeds the reward function into Reproducing Kernel Hilbert Space and proposes the kernelized TS/UCB bandit algorithms.
Neural Bandits. To learn non-linear reward functions, deep neural networks have been adapted to bandits with various variants. Riquelme etΒ al. (2018); Lu and VanΒ Roy (2017) build L-layer DNN to learn the arm embeddings and apply Thompson Sampling on the last layer for exploration. Zhou etΒ al. (2020) first introduces a provable neural-based contextual bandit algorithm with a UCB exploration strategy and then Zhang etΒ al. (2021) extends the neural network to Thompson sampling framework. Their regret analysis is built on recent advances on the convergence theory in over-parameterized neural networks(Du etΒ al., 2019; Allen-Zhu etΒ al., 2019) and utilizes Neural Tangent Kernel (Jacot etΒ al., 2018; Arora etΒ al., 2019) to construct connections with linear contextual bandits (Abbasi-Yadkori etΒ al., 2011). Ban and He (2021a) further adopts convolutional neural networks with UCB exploration aiming for visual-aware applications. Xu etΒ al. (2020) performs UCB-based exploration on the last layer of neural networks to reduce the computational cost brought by gradient-based UCB. Different from the above existing works, EE-Net keeps the powerful representation ability of neural networks to learn reward function and first assigns another neural network to determine exploration.
3 Problem definition
We consider the standard contextual multi-armed bandit with the known number of rounds (Zhou etΒ al., 2020; Zhang etΒ al., 2021). In each round , where the sequence , the learner is presented with arms, , in which each arm is represented by a feature vector for each . After playing one arm , its reward is assumed to be generated by the function:
(3.1) |
where the unknown reward function can be either linear or non-linear and the noise is drawn from certain distribution with expectation . Following many existing works (Zhou etΒ al., 2020; Ban etΒ al., 2021; Zhang etΒ al., 2021), we consider bounded rewards, . For the brevity, we denote the selected arm in round by and the reward received in by . The pseudo regret of rounds is defined as:
(3.2) |
where is the maximal expected reward in the round . The goal of this problem is to minimize by certain selection strategy.
Notation. We denote by the sequence . We use to denote the Euclidean norm for a vector , and and to denote the spectral and Frobenius norm for a matrix . We use to denote the standard inner product between two vectors or two matrices. We may use or to represent the gradient for brevity. We use to represent the collected data up to round .
4 Proposed Method: EE-Net
EE-Net is composed of three components. The first component is the exploitation network, , which focuses on learning the unknown reward function based on the data collected in past rounds. The second component is the exploration network, , which focuses on characterizing the level of exploration needed for each arm in the present round. The third component is the decision-maker, , which focuses on suitably combining the outputs of the exploitation and exploration networks leading to the arm selection.
1) Exploitation Net. The exploitation net is a neural network which learns the mapping from arms to rewards. In round , denote the network by , where the superscript of is the index of network and the subscript represents the round where the parameters of finished the last update. Given an arm , is considered the "exploitation score" for . By some criterion, after playing arm , we receive a reward . Therefore, we can conduct gradient descent to update based on the collected training samples and denote the updated parameters by .
Input | Network | Label |
---|---|---|
(Exploitation) | ||
(Exploration) | ||
(Decision-maker with non-linear function) |
2) Exploration Net. Our exploration strategy is inspired by existing UCB-based neural bandits (Zhou etΒ al., 2020; Ban etΒ al., 2021). Based on the Lemma 5.2 in (Ban etΒ al., 2021), given an arm , with probability at least , we have the following UCB form:
(4.1) |
where is defined in Eq. (3.1) and is an upper confidence bound represented by a function with respect to the gradient (see more details and discussions in Appendix D). Then we have the following definition.
Definition 4.1.
In round , given an arm , we define as the "expected potential gain" for and as the "potential gain" for .
Let . When , the arm has positive potential gain compared to the estimated reward . A large positive makes the arm more suitable for exploration, whereas a small (or negative) makes the arm unsuitable for exploration. Recall that traditional approaches such as UCB intend to estimate such potential gain using standard tools, e.g., Markov inequality, Hoeffding bounds, etc., from large deviation bounds.
Instead of calculating a large-deviation based statistical bound for , we use a neural network to represent , where the input is and the ground truth is . Adopting gradient as the input also is due to the fact that it incorporates two aspects of information: the feature of the arm and the discriminative information of .
Moreover, in the upper bound of NeuralUCB or the variance of NeuralTS, there is a recursive term which is a function of past gradients up to and incorporates relevant historical information. On the contrary, in EE-Net, the recursive term which depends on past gradients is in the exploration network because we have conducted gradient descent for based on . Therefore, this form is similar to in neuralUCB/TS, but EE-net does not (need to) make a specific assumption about the functional form of past gradients, and is also more memory-efficient.
To sum up, in round , we consider as the "exploration score" of , because it indicates the potential gain of compared to our current exploitation score . Therefore, after receiving the reward , we can use gradient descent to update based on collected training samples . We also provide other two heuristic forms for βs ground-truth label: and . We compare them in an ablation study in Appendix B.
3) Decision-maker. In round , given an arm , with the computed exploitation score and exploration score , we use a function to trade off between exploitation and exploration and compute the final score for . The selection criterion is defined as
Note that can be either linear or non-linear functions. We provide the following two forms.
(1) Linear function. can be formulated as a linear function with respect to and :
where , are two weights preset by the learner. When , can be thought of as UCB-type policy, where the estimated reward and potential gain are simply added together. In experiments, we report its empirical performance in ablation study (Appendix B).
(2) Non-linear function. also can be formulated as a neural network to learn the mapping from to the optimal arm. We transform the bandit problem into a binary classification problem. Given an arm , we define as the probability of being the optimal arm for in round . For brevity, we denote by the probability of being the optimal arm for the selected arm in round . According to different reward distributions, we have different approaches to determine .
-
1.
Binary reward. , suppose is a binary variable over , it is straightforward to set: if ; , otherwise.
-
2.
Continuous reward. , suppose is a continuous variable over the range , we provide two ways to determine . (1) can be directly set as . (2) The learner can set a threshold . Then if ; , otherwise.
Therefore, with the collected training samples in round , we can conduct gradient descent to update parameters of .
Table 1 details the working structure of EE-Net. Algorithm 1 depicts the workflow of EE-Net, where the input of is normalized, i.e., . Algorithm 1 provides a version of gradient descent (GD) to update EE-Net, where drawing uniformly from their stored historical parameters is for the sake of analysis. One can easily extend EE-Net to stochastic GD to update the parameters incrementally.
Remark 4.1 (Network structure).
The networks can be different structures according to different applications. For example, in the vision tasks, can be set up as convolutional layers (LeCun etΒ al., 1995). For the exploration network , the input may have exploding dimensions when the exploitation network becomes wide and deep, which may cause huge computation cost for . To address this challenge, we can apply dimensionality reduction techniques to obtain low-dimensional vectors of . In the experiments, we use Roweis and Saul (2000) to acquire a -dimensional vector for and achieve the best performance among all baselines.β
Remark 4.2 (Exploration direction).
EE-Net has the ability to determine exploration direction. Given an arm , when the estimation is lower than the expected reward , the learner should make the "upward" exploration, i.e., increase the chance of being explored; When is higher than , the learner should do the "downward" exploration, i.e., decrease the chance of being explored. EE-Net uses the neural network to learn (which has positive and negative scores) and has the ability to determine the exploration direction. In contrast, NeuralUCB will always make "upward" exploration and NeuralTS will randomly choose between "upward" exploration and "downward" exploration (see selection criteria in Table 2 and more details in Appendix D). β
Remark 4.3 (Space complexity).
NeuralUCB and NeuralTS have to maintain the gradient outer product matrix (e.g., ) and, for , have a space complexity of to store the outer product. On the contrary, EE-Net does not have this matrix and only regards as the input of . Thus, EE-Net reduces the space complexity from to . β
5 Regret Analysis
In this section, we provide the regret analysis of EE-Net when is set as the linear function , which can be thought of as the UCB-type trade-off between exploitation and exploration. For the sake of simplicity, we conduct the regret analysis on some unknown but fixed data distribution . In each round , samples are drawn i.i.d. from . This is standard distribution assumption in over-parameterized neural networks (Cao and Gu, 2019). Then, for the analysis, we have the following assumption, which is a standard input assumption in neural bandits and over-parameterized neural networks(Zhou etΒ al., 2020; Allen-Zhu etΒ al., 2019).
Assumption 5.1 (-Separability).
For any , and . Then, for every pair , and , , and suppose there exists an operator such that and
For example, the operator can be designed as . The analysis will focus on over-parameterized neural networks (Jacot etΒ al., 2018; Du etΒ al., 2019; Allen-Zhu etΒ al., 2019). Given an input , without loss of generality, we define the fully-connected network with depth and width :
(5.1) |
where is the ReLU activation function, , , for , , and .
Initialization. For any , each entry of is drawn from the normal distribution and is drawn from the normal distribution . Note that EE-Net at most has three networks . We define them following the definition of for brevity, although they may have different depth or width. Then, we have the following theorem for EE-Net. Recall that are the learning rates for ; is the number of iterations of gradient descent for in each round; and is the number of iterations for .
Theorem 1.
Comparison with existing works. Under the similar assumptions in over-parameterized neural networks, the regret bounds complexity of NeuralUCB (Zhou etΒ al., 2020) and NeuralTS (Zhang etΒ al., 2021) both are
where is the neural tangent kernel matrix (NTK) (Jacot etΒ al., 2018; Arora etΒ al., 2019) and is a regularization parameter. Similarly, in linear contextual bandits, Abbasi-Yadkori etΒ al. (2011) achieve and Li etΒ al. (2017) achieve .
Remark 5.1.
Compared to NeuralUCB/TS, EE-Net roughly improves by a multiplicative factor of , because our proof of EE-Net is directly built on recent advances in convergence theory (Allen-Zhu etΒ al., 2019) and generalization bound (Cao and Gu, 2019) of over-parameterized neural networks. Instead, the analysis for NeuralUCB/TS contains three parts of approximation error by calculating the distances between the expected reward and ridge regression, ridge regression and NTK, and NTK and network function. β
Remark 5.2.
The regret bound of EE-Net does not have the effective dimension or input dimension . or may cause significant error, when the determinant of is extremely large or . β
The proof of Theorem 1 is in Appendix C and mainly based on the following generalization bound. The bound results from an online-to-batch conversion while using convergence guarantees of deep learning optimization.
Lemma 5.1.
For any , suppose satisfy the conditions in Eq. (5.2) and . Let
and is the corresponding reward, given . Then, with probability at least over the random of the initialization, it holds that
(5.4) | ||||
where the expectation is also taken over that are uniformly drawn from .
Remark 5.3.
Lemma 5.1 provides a fixed -rate generalization bound for exploitation-exploration networks in contrast with the relative bound w.r.t.Β the Neural Tangent Random Feature (NTRF) benchmarkΒ (Cao and Gu, 2019). We achieve this by working in the regression rather than classification setting and utilizing the convergence guarantees for square lossΒ (Allen-Zhu etΒ al., 2019). Note that the bound in Lemma 5.1 holds in the setting of bounded (possibly random) rewards instead of a fixed function in the conventional classification setting.
6 Experiments
In this section, we evaluate EE-Net on four real-world datasets comparing with strong state-of-the-art baselines. We first present the setup of experiments, then show regret comparison and report ablation study. Codes are available at 111https://github.com/banyikun/EE-Net-ICLR-2022.
We use four real-world datasets: Mnist, Yelp, Movielens, and Disin, the details and settings of which are attached in Appendix A.

Baselines. To comprehensively evaluate EE-Net, we choose neural-based bandit algorithms, one linear and one kernelized bandit algorithms.
-
1.
LinUCB (Li etΒ al., 2010) explicitly assumes the reward is a linear function of arm vector and unknown user parameter and then applies the ridge regression and un upper confidence bound to determine selected arm.
-
2.
KernelUCB (Valko etΒ al., 2013) adopts a predefined kernel matrix on the reward space combined with a UCB-based exploration strategy.
-
3.
Neural-Epsilon adapts the epsilon-greedy exploration strategy on exploitation network . I.e., with probability , the arm is selected by and with probability , the arm is chosen randomly.
-
4.
NeuralUCB (Zhou etΒ al., 2020) uses the exploitation network to learn the reward function coming with an UCB-based exploration strategy.
-
5.
NeuralTS (Zhang etΒ al., 2021) adopts the exploitation network to learn the reward function coming with an Thompson Sampling exploration strategy.
Note that we do not report results of LinTS and KernelTS in experiments, because of the limited space in figures, but LinTS and KernelTS have been significantly outperformed by NeuralTS (Zhang etΒ al., 2021).
Setup for EE-Net. To compare fairly, for all the neural-based methods including EE-Net, the exploitation network is built by a 2-layer fully-connected network with 100 width. For the exploration network , we use a 2-layer fully-connected network with 100 width as well. For the decision maker , by comprehensively evaluate both linear and nonlinear functions, we found that the most effective approach is combining them together, which we call " hybrid decision maker". In detail, for rounds , is set as , and for , is set as a neural network with two 20-width fully-connected layers. Setting in this way is because the linear decision maker can maintain stable performance in each running (robustness) and the non-linear decision maker can further improve the performance (see details in Appendix B). The hybrid decision maker can combine these two advantages together. The configurations of all methods are attached in Appendix A.

Results. Figure 2 and Figure 3 show the regret comparison on these four datasets. EE-Net consistently outperforms all baselines across all datasets. For LinUCB and KernelUCN, the simple linear reward function or predefined kernel cannot properly formulate ground-truth reward function existed in real-world datasets. In particular, on Mnist and Disin datasets, the correlations between rewards and arm feature vectors are not linear or some simple mappings. Thus, LinUCB and KernelUCB barely exploit the past collected data samples and fail to select correct arms. For neural-based bandit algorithms, the exploration probability of Neural-Epsilon is fixed and difficult to be adjustable. Thus it is usually hard to make effective exploration. To make exploration, NeuralUCB statistically calculates a gradient-based upper confidence bound and NeuralTS draws each armβs predicted reward from a normal distribution where the standard deviation is computed by gradient. However, the confidence bound or standard deviation they calculated only consider the worst cases and thus may not be able represent the actual potential of each arm, and they cannot make "upward" and "downward" exploration properly. Instead, EE-Net uses a neural network to learn each armβs potential by neural networkβs powerful representation ability. Therefore, EE-Net can outperform these two state-of-the-art bandit algorithms. Note that NeuralUCB/TS does need two parameters to tune UCB/TS according to different scenarios while EE-Net only needs to set up a neural network and automatically learns it.
Ablation Study. In Appendix B, we conduct ablation study regarding the label function of and the different setting of .
7 Conclusion
In this paper, we propose a novel exploration strategy, EE-Net. In addition to a neural network that exploits collected data in past rounds , EE-Net has another neural network to learn the potential gain compared to current estimation for exploration. Then, a decision maker is built to make selections to further trade off between exploitation and exploration. We demonstrate that EE-Net outperforms NeuralUCB and NeuralTS both theoretically and empirically, becoming the new state-of-the-art exploration policy.
Acknowledgements: We are grateful to Shiliang Zuo and Yunzhe Qi for the valuable discussions in the revisions of EE-Net. This research work is supported by National Science Foundation under Awards No.Β IIS-1947203, IIS-2002540, IIS-2137468, IIS-1908104, OAC-1934634, and DBI-2021898, and a grant from C3.ai. The views and conclusions are those of the authors and should not be interpreted as representing the official policies of the funding agencies or the government.
References
- Abbasi-Yadkori etΒ al. [2011] Y.Β Abbasi-Yadkori, D.Β PΓ‘l, and C.Β SzepesvΓ‘ri. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312β2320, 2011.
- Abeille and Lazaric [2017] M.Β Abeille and A.Β Lazaric. Linear thompson sampling revisited. In Artificial Intelligence and Statistics, pages 176β184. PMLR, 2017.
- Agrawal and Goyal [2013] S.Β Agrawal and N.Β Goyal. Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning, pages 127β135. PMLR, 2013.
- Ahmed etΒ al. [2018] H.Β Ahmed, I.Β Traore, and S.Β Saad. Detecting opinion spams and fake news using text classification. Security and Privacy, 1(1):e9, 2018.
- Allen-Zhu etΒ al. [2019] Z.Β Allen-Zhu, Y.Β Li, and Z.Β Song. A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, pages 242β252. PMLR, 2019.
- Arora etΒ al. [2019] S.Β Arora, S.Β S. Du, W.Β Hu, Z.Β Li, R.Β R. Salakhutdinov, and R.Β Wang. On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems, pages 8141β8150, 2019.
- Auer [2002] P.Β Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397β422, 2002.
- Ban and He [2020] Y.Β Ban and J.Β He. Generic outlier detection in multi-armed bandit. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 913β923, 2020.
- Ban and He [2021a] Y.Β Ban and J.Β He. Convolutional neural bandit: Provable algorithm for visual-aware advertising. arXiv preprint arXiv:2107.07438, 2021a.
- Ban and He [2021b] Y.Β Ban and J.Β He. Local clustering in contextual multi-armed bandits. In Proceedings of the Web Conference 2021, pages 2335β2346, 2021b.
- Ban etΒ al. [2021] Y.Β Ban, J.Β He, and C.Β B. Cook. Multi-facet contextual bandits: A neural network perspective. In The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, Singapore, August 14-18, 2021, pages 35β45, 2021.
- Bubeck etΒ al. [2011] S.Β Bubeck, R.Β Munos, G.Β Stoltz, and C.Β SzepesvΓ‘ri. X-armed bandits. Journal of Machine Learning Research, 12(5), 2011.
- Cao and Gu [2019] Y.Β Cao and Q.Β Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. Advances in Neural Information Processing Systems, 32:10836β10846, 2019.
- Cesa-Bianchi etΒ al. [2001] N.Β Cesa-Bianchi, A.Β Conconi, and C.Β Gentile. On the generalization ability of on-line learning algorithms. Advances in neural information processing systems, 14, 2001.
- Chlebus [2009] E.Β Chlebus. An approximate formula for a partial sum of the divergent p-series. Applied Mathematics Letters, 22(5):732β737, 2009.
- Chu etΒ al. [2011] W.Β Chu, L.Β Li, L.Β Reyzin, and R.Β Schapire. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 208β214, 2011.
- Dani etΒ al. [2008] V.Β Dani, T.Β P. Hayes, and S.Β M. Kakade. Stochastic linear optimization under bandit feedback. 2008.
- Du etΒ al. [2019] S.Β Du, J.Β Lee, H.Β Li, L.Β Wang, and X.Β Zhai. Gradient descent finds global minima of deep neural networks. In International Conference on Machine Learning, pages 1675β1685. PMLR, 2019.
- Filippi etΒ al. [2010] S.Β Filippi, O.Β Cappe, A.Β Garivier, and C.Β SzepesvΓ‘ri. Parametric bandits: The generalized linear case. In Advances in Neural Information Processing Systems, pages 586β594, 2010.
- Fu and He [2021] D.Β Fu and J.Β He. SDG: A simplified and dynamic graph neural network. In SIGIR β21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, July 11-15, 2021, pages 2273β2277. ACM, 2021.
- Harper and Konstan [2015] F.Β M. Harper and J.Β A. Konstan. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis), 5(4):1β19, 2015.
- Jacot etΒ al. [2018] A.Β Jacot, F.Β Gabriel, and C.Β Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pages 8571β8580, 2018.
- Langford and Zhang [2008] J.Β Langford and T.Β Zhang. The epoch-greedy algorithm for multi-armed bandits with side information. In Advances in neural information processing systems, pages 817β824, 2008.
- Lattimore and SzepesvΓ‘ri [2020] T.Β Lattimore and C.Β SzepesvΓ‘ri. Bandit algorithms. Cambridge University Press, 2020.
- LeCun etΒ al. [1995] Y.Β LeCun, Y.Β Bengio, etΒ al. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10):1995, 1995.
- LeCun etΒ al. [1998] Y.Β LeCun, L.Β Bottou, Y.Β Bengio, and P.Β Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278β2324, 1998.
- Li etΒ al. [2010] L.Β Li, W.Β Chu, J.Β Langford, and R.Β E. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661β670, 2010.
- Li etΒ al. [2017] L.Β Li, Y.Β Lu, and D.Β Zhou. Provably optimal algorithms for generalized linear contextual bandits. In International Conference on Machine Learning, pages 2071β2080. PMLR, 2017.
- Li etΒ al. [2016] S.Β Li, A.Β Karatzoglou, and C.Β Gentile. Collaborative filtering bandits. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 539β548, 2016.
- Lu and VanΒ Roy [2017] X.Β Lu and B.Β VanΒ Roy. Ensemble sampling. arXiv preprint arXiv:1705.07347, 2017.
- Riquelme etΒ al. [2018] C.Β Riquelme, G.Β Tucker, and J.Β Snoek. Deep bayesian bandits showdown: An empirical comparison of bayesian deep networks for thompson sampling. arXiv preprint arXiv:1802.09127, 2018.
- Roweis and Saul [2000] S.Β T. Roweis and L.Β K. Saul. Nonlinear dimensionality reduction by locally linear embedding. science, 290(5500):2323β2326, 2000.
- Thompson [1933] W.Β R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285β294, 1933.
- Valko etΒ al. [2013] M.Β Valko, N.Β Korda, R.Β Munos, I.Β Flaounas, and N.Β Cristianini. Finite-time analysis of kernelised contextual bandits. arXiv preprint arXiv:1309.6869, 2013.
- Wu etΒ al. [2016] Q.Β Wu, H.Β Wang, Q.Β Gu, and H.Β Wang. Contextual bandits in a collaborative environment. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 529β538, 2016.
- Xu etΒ al. [2020] P.Β Xu, Z.Β Wen, H.Β Zhao, and Q.Β Gu. Neural contextual bandits with deep representation and shallow exploration. arXiv preprint arXiv:2012.01780, 2020.
- Zhang etΒ al. [2021] W.Β Zhang, D.Β Zhou, L.Β Li, and Q.Β Gu. Neural thompson sampling. In International Conference on Learning Representations, 2021.
- Zhou etΒ al. [2020] D.Β Zhou, L.Β Li, and Q.Β Gu. Neural contextual bandits with ucb-based exploration. In International Conference on Machine Learning, pages 11492β11502. PMLR, 2020.
Appendix A Datasets and Setup
MNIST dataset. MNIST is a well-known image dataset [LeCun etΒ al., 1998] for the 10-class classification problem. Following the evaluation setting of existing works [Valko etΒ al., 2013, Zhou etΒ al., 2020, Zhang etΒ al., 2021], we transform this classification problem into bandit problem. Consider an image , we aim to classify it from classes. First, in each round, the image is transformed into arms and presented to the learner, represented by vectors in sequence . The reward is defined as if the index of selected arm matches the index of βs ground-truth class; Otherwise, the reward is .
Yelp222https://www.yelp.com/dataset and Movielens [Harper and Konstan, 2015] datasets. Yelp is a dataset released in the Yelp dataset challenge, which consists of 4.7 million rating entries for restaurants by million users. MovieLens is a dataset consisting of million ratings between users and movies. We build the rating matrix by choosing the top users and top restaurants(movies) and use singular-value decomposition (SVD) to extract a -dimension feature vector for each user and restaurant(movie). In these two datasets, the bandit algorithm is to choose the restaurants(movies) with bad ratings. We generate the reward by using the restaurant(movie)βs gained stars scored by the users. In each rating record, if the user scores a restaurant(movie) less than 2 stars (5 stars totally), its reward is ; Otherwise, its reward is . In each round, we set arms as follows: we randomly choose one with reward and randomly pick the other restaurants(movies) with rewards; then, the representation of each arm is the concatenation of corresponding user feature vector and restaurant(movie) feature vector.
Disin [Ahmed etΒ al., 2018] dataset. Disin is a fake news dataset on kaggle333https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset including 12600 fake news articles and 12600 truthful news articles, where each article is represented by the text. To transform the text into vectors, we use the approach [Fu and He, 2021] to represent each article by a 300-dimension vector. Similarly, we form a 10-arm pool in each round, where 9 real news and 1 fake news are randomly selected. If the fake news is selected, the reward is ; Otherwise, the reward is .
Configurations. For LinUCB, following [Li etΒ al., 2010], we do a grid search for the exploration constant over which is to tune the scale of UCB. For KernelUCB [Valko etΒ al., 2013], we use the radial basis function kernel and stop adding contexts after 1000 rounds, following [Valko etΒ al., 2013, Zhou etΒ al., 2020]. For the regularization parameter and exploration parameter in KernelUCB, we do the grid search for over and for over . For NeuralUCB and NeuralTS, following setting of [Zhou etΒ al., 2020, Zhang etΒ al., 2021], we use the exploiation network and conduct the grid search for the exploration parameter over and for the regularization parameter over . For NeuralEpsilon, we use the same neural network and do the grid search for the exploration probability over . For the neural bandits NeuralUCB/TS, following their setting, as they have expensive computation cost to store and compute the whole gradient matrix, we use a diagonal matrix to make approximation. For all neural networks, we conduct the grid search for learning rate over . For all grid-searched parameters, we choose the best of them for the comparison and report the averaged results of runs for all methods.
Create exploration samples for . When the selected arm is not optimal in a round, the optimal arm must exist among the remaining arms, and thus the exploration consideration should be added to the remaining arms. Based on this fact, we create additional samples for the exploration network in practice. For example, in setting of binary reward, e.g., or reward, if the received reward while select , we add new train samples for , for each , where usually is a small constant. This measure can further improve the performance of EE-Net in our experiments.
Appendix B Ablation Study


In this section, we conduct ablation study regarding the label function for exploration network and seting of decision maker on two representative datasets Movielens and Mnist.
Label function . In this paper, we use to measure the potential gain of an arm, as the label of . Moreover, we provide other two intuitive form and . Figure 4 shows the regret with different , where "EE-Net" denotes our method with default , "EE-Net-abs" represents the one with and "EE-Net-ReLU" is with . On Movielens and Mnist datasets, EE-Net slightly outperforms EE-Net-abs and EE-Net-ReLU. In fact, can effectively represent the positive potential gain and negative potential gain, such that intends to score the arm with positive gain higher and score the arm with negative gain lower. However, treats the positive/negative potential gain evenly, weakening the discriminative ability. can recognize the positive gain while neglecting the difference of negative gain. Therefore, usually is the most effective one for empirical performance.
Setting of . can be set as an either linear function or non-linear function. In the experiment, we test the simple linear function , denoted by "EE-Net-Lin", and a non-linear function represented by a 2-layer 20-width fully-connected neural network, denoted by "EE-Net-NoLin". For the default hybrid setting, denoted by "EE-Net", when rounds , ; Otherwise, is the neural network. Figure 5 reports the regret with these three different modes. EE-Net achieves the best performance with small standard deviation. In contrast, EE-Net-NoLin obtains the worst performance and largest standard deviation. However, notice that EE-Net-NoLin can achieve the best performance in certain running (the green shallow) but it is erratic. Because in the begin phase, without enough training samples, EE-Net-NoLin strongly relies on the quality of collected samples. With appropriate training samples, gradient descent can lead to global optimum. On the other hand, with misleading training samples, gradient descent can deviate from global optimum. Therefore, EE-Net-NoLin shows very unstable performance. In contrast, EE-Net-Lin is inspired by the UCB strategy, i.e., the exploitation plus the exploration, exhibiting stable performance. To combine their advantages together, we propose the hybrid approach, EE-Net, achieving the best performance with strong stability.
Appendix C Proof of Theorem 1
In this section, we provide the proof of Theorem 1 and related lemmas.
Proof.
For brevity, for the selected arm in round , let be its expected reward and be the optimal arm in round . Let
Note that , for each . Then, the expected regret of round is given by
(C.1) | ||||
where is because and and introduces the additional parameters which will be suitably chosen.
Because, for each , , applying Lemma C.1 and Corollary C.1, with probability at least over the randomness of initialization, for , we have
(C.2) |
where
(C.3) |
and we apply the union bound over round to make Lemma C.1 and Corollary C.1 hold for each round .
For , based on Lemma C.2, with probability at least , we have
(C.4) |
To sum up, with probability at least , we have
(C.5) |
Then expected regret of rounds is computed by
(C.6) | ||||
where is because [Chlebus, 2009] and the bound of is due to the choice of , i.e., since and , can be chosen so that .
Then, when , we have
(C.7) |
As the choice of , we have . Therefore, we have
(C.8) |
The proof is completed. β
Lemma C.1.
Proof.
In this proof, we consider the collected data of up to round , , as the training dataset and then obtain a generalization bound for it, inspired by Cao and Gu [2019].
For convenience, we use , noting that the same analysis holds for each . Consider the exploration network , applying Lemma C.3. With probability at least , for any , we have
(C.10) |
Similarly, applying Lemma C.3 again, with probability at least , for any , we have
(C.11) |
Because for any , , with Eq. (C.10) and (C.11), applying union bound, with probability at least over the random initialization, we have
(C.12) |
Noting that (C.12) is for for a specific . By union bound, (C.12) holds with probability at least . For brevity, let represent .
Recall that, for each , and are the parameters training on according to Algorithm 1. In round , let given . Let be the corresponding reward. Let be shadow samples from the same distribution and let , with being the corresponding reward. Then, we define
(C.13) | ||||
Then, as , based on the definition of we have
(C.14) | ||||
where denotes the -algebra generated by the history .
Moreover, we have
Since is a martingale difference sequence, inspired by Lemma 1 in [Cesa-Bianchi etΒ al., 2001], applying the Hoeffding-Azuma inequality, with probability at least , we have
(C.15) | ||||
According to Algorithm 1, is uniformly drawn from . Thus, with probability , we have
(C.16) | ||||
For , according to Lemma C.6, for any satisfying , with probability , we have
(C.17) | ||||
For , according to Lemma C.4 (1), these exists satisfying , with probability , such that
(C.18) | ||||
where follows by a direct application of Lemma C.4 (1) by defining the loss .
Combining Eq.(C.16), Eq.(C.17) and Eq.(C.18), with probability we have
(C.19) | ||||
where the expectation over that is uniformly drawn from .
Then, applying union bound to and rescaling the complete the proof. β
Corollary C.1.
For any , suppose satisfy the conditions in Eq. (5.2) and . For any , let
and is the corresponding reward, given . Then, with probability at least over the random of the initialization, there exist , and , such that
(C.20) | ||||
where the expectation is also taken over that are uniformly drawn from .
Proof.
This a direct corollary of Lemma C.1, given the optimal historical pairs . For brevity, let represent .
Suppose that, for each , and are the parameters training on according to Algorithm 1. Note that these pairs are unknown to the algorithm we run, and the parameters are not estimated. However, for the analysis, it is sufficient to show that there exist such parameters so that the conditional expectation of the error can be bounded.
In round , let given . Let be the corresponding reward. Let be shadow samples from the same distribution and let , with being the corresponding reward. Then, we define
(C.21) | ||||
Then, as , we have
(C.22) | ||||
where denotes the -algebra generated by the history .
Therefore, is a martingale difference sequence. Similarly, applying the Hoeffding-Azuma inequality to , with probability , we have
(C.23) | ||||
For , according to Lemma C.6, for any satisfying , with probability , we have
(C.24) | ||||
For , according to Lemma C.4 (1), these exists satisfying , with probability , such that
(C.25) | ||||
where follows by a direct application of Lemma C.4 (1) by defining the loss . Combining above inequalities, with probability we have
(C.26) | ||||
Then, applying union bound to and rescaling the complete the proof.
β
Lemma C.2.
Given , suppose satisfy the conditions in Eq. (5.2). Then, with probability at least , in each round , for any , we have
(C.27) | ||||
(C.28) | ||||
(C.29) | ||||
Proof.
According to Lemma C.4 (2), . Thus, we have .
First, based on Triangle inequality, for any , we have
(C.30) | ||||
where the last inequality is because of Lemma C.4 (3) and Lemma C.7.
Applying Lemma C.5 (1), for any and , we have
(C.31) | ||||
Similarly, we can use the same way to prove the lemmas for . β
Lemma C.3.
Proof.
Considering an inequality , we have . Let be randomly initialized. Then applying Lemma C.5 (1), for any and , we have
(C.32) | ||||
where: is based on the Lemma C.4 (3); is an application of CauchyβSchwarz inequality; is according to Lemma C.4 (2) and (3) in which can be considered as one step gradient descent; is due to Lemma C.4 (2).
Then, the proof is completed. β
Lemma C.4.
Given a constant , suppose satisfies the conditions in Eq. (5.2), the learning rate , the number of iterations . Then, with probability at least , starting from random initialization ,
-
(1)
(Theorem 1 in [Allen-Zhu etΒ al., 2019]) In round , given the collected data , the loss function is defined as: . Then, there exists satisfying , such that in iterations;
-
(2)
(Theorem 1 in [Allen-Zhu etΒ al., 2019]) For any , it holds uniformly that ;
-
(3)
Following the initialization, given , it holds that
where represents the parameters of after iterations of gradient descent in round .
Proof.
Note that the output dimension in [Allen-Zhu etΒ al., 2019] is removed because the output of network function in this paper always is a scalar. For (1) and (2), the only different setting from [Allen-Zhu etΒ al., 2019] is that the initialization of last layer in this paper while in [Allen-Zhu etΒ al., 2019]. Because and here, the upper bound in [Allen-Zhu etΒ al., 2019] still holds for : with probability at least . Therefore, (1) and (2) still hold for the initialization of this paper.
For (3), based on Lemma 7.1 in Allen-Zhu etΒ al. [2019], we have . Denote by the ReLU function. For any ,
where the inequality is according to Lemma 7.2 in Allen-Zhu etΒ al. [2019]. Therefore, we have . β
Lemma C.5 (Lemma 4.1, [Cao and Gu, 2019]).
For any , if satisfies
then, with probability at least over randomness of , for any , and satisfying and , it holds uniformly that
(C.33) |
Lemma C.6.
For any , suppose
Then, with probability at least , setting for algorithm 1, for any satisfying , it holds that
Proof.
This is a direct application of Lemma 4.3 in [Cao and Gu, 2019] by setting , and , where is some small enough absolute constant. We set . Based on Lemma C.4 (2), for any , we have
Then, according to Lemma 4.3 in [Cao and Gu, 2019], then, for any satisfying , there exist a small enough absolute constant , such that
(C.34) |
Then, replacing completes the proof. β
Lemma C.7 (Theorem 5, Allen-Zhu etΒ al. [2019]).
For any , if satisfies that
(C.35) |
then, with probability at least , for all , we have
(C.36) |
Appendix D Motivation of Exploration Network
Methods | Selection Criterion |
---|---|
Neural Epsilon-greedy | With probability , ; Otherwise, select randomly. |
NeuralTS [Zhang etΒ al., 2021] | For , draw from . Then, select , |
NeuralUCB [Zhou etΒ al., 2020] | |
EE-Net (Our approach) | , compute , (Exploration Net). Then . |
In this section, we list one gradient-based UCB from existing works [Ban etΒ al., 2021, Zhou etΒ al., 2020], which motivates our design of exploration network . Let .
Lemma D.1.
(Lemma 5.2 in [Ban etΒ al., 2021]). Given a set of context vectors and the corresponding rewards , for any . Let be the -layers fully-connected neural network where the width is , the learning rate is , the number of iterations of gradient descent is . Then, there exist positive constants , such that if
then, with probability at least , for any , we have the following upper confidence bound:
(D.1) |
where
Note that is the gradient at initialization, which can be initialized as constants. Therefore, the above UCB can be represented as the following form for exploitation network : .
EE-Net has smaller approximation error. Given an arm , let be the estimated reward and be the expected reward. The exploration network in EE-Net is to learn , i.e., the residual between expected reward and estimated reward, which is the ultimate goal of making exploration. There are advantages of using a network to learn in EE-Net, compared to giving a statistical upper bound for it such as NeuralUCB, [Ban etΒ al., 2021], and NeuralTS (in NeuralTS, the variance can be thought of as the upper bound). For EE-Net, the approximation error for is caused by the genenalization error of the neural network (Lemma B.1. in the manuscript). In contrast, for NeuralUCB, [Ban etΒ al., 2021], and NeuralTS, the approximation error for includes three parts. The first part is caused by ridge regression. The second part of the approximation error is caused by the distance between ridge regression and Neural Tangent Kernel (NTK). The third part of the approximation error is caused by the distance between NTK and the network function. Because they use the upper bound to make selections, the errors inherently exist in their algorithms. By reducing the three parts of the approximation errors to only the neural network convergence error, EE-Net achieves tighter regret bound compared to them (improving by roughly ).
Methods | "Upward" Exploration | "Downward" Exploration |
---|---|---|
NeuralUCB | ||
NeuralTS | Randomly | Randomly |
EE-Net |

EE-Net has the ability to determine exploration direction. The two types of exploration are described by Figure 6. When the estimated reward is larger than the expected reward, i.e., , we need to do the βdownward explorationβ, i.e., lowering the exploration score of to reduce its chance of being explored; when , we should do the βupward explorationβ, i.e., raising the exploration score of to increase its chance of being explored. For EE-Net, is to learn . When , will also be positive to make the upward exploration. When , will be negative to make the downward exploration. In contrast, NeuralUCB will always choose upward exploration, i.e., where is always positive. In particular, when , NeuralUCB will further amplify the mistake. NeuralTS will randomly choose upward or downward exploration for all cases, because it draws a sampled reward from a normal distribution where the mean is and the variance is the upper bound.