This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

EE-Net: Exploitation-Exploration Neural Networks in Contextual Bandits

Yikun Ban, Yuchen Yan, Arindam Banerjee, Jingrui He
University of Illinois at Urbana-Champaign
{yikunb2, yucheny5, arindamb, jingrui}@illinois.edu
Abstract

In this paper, we propose a novel neural exploration strategy in contextual bandits, EE-Net, distinct from the standard UCB-based and TS-based approaches. Contextual multi-armed bandits have been studied for decades with various applications. To solve the exploitation-exploration tradeoff in bandits, there are three main techniques: epsilon-greedy, Thompson Sampling (TS), and Upper Confidence Bound (UCB). In recent literature, linear contextual bandits have adopted ridge regression to estimate the reward function and combine it with TS or UCB strategies for exploration. However, this line of works explicitly assumes the reward is based on a linear function of arm vectors, which may not be true in real-world datasets. To overcome this challenge, a series of neural bandit algorithms have been proposed, where a neural network is used to learn the underlying reward function and TS or UCB are adapted for exploration. Instead of calculating a large-deviation based statistical bound for exploration like previous methods, we propose "EE-Net", a novel neural-based exploration strategy. In addition to using a neural network (Exploitation network) to learn the reward function, EE-Net uses another neural network (Exploration network) to adaptively learn potential gains compared to the currently estimated reward for exploration. Then, a decision-maker is constructed to combine the outputs from the Exploitation and Exploration networks. We prove that EE-Net can achieve π’ͺ​(T​log⁑T)\mathcal{O}(\sqrt{T\log T}) regret and show that EE-Net outperforms existing linear and neural contextual bandit baselines on real-world datasets.

1 Introduction

The stochastic contextual multi-armed bandit (MAB) (Lattimore and SzepesvΓ‘ri, 2020) has been studied for decades in machine learning community to solve sequential decision making, with applications in online advertising (Li etΒ al., 2010), personal recommendation (Wu etΒ al., 2016; Ban and He, 2021b), etc. In the standard contextual bandit setting, a set of nn arms are presented to a learner in each round, where each arm is represented by a context vector. Then by certain strategy, the learner selects and plays one arm, receiving a reward. The goal of this problem is to maximize the cumulative rewards of TT rounds.

MAB algorithms have principled approaches to address the trade-off between Exploitation and Exploration (EE), as the collected data from past rounds should be exploited to get good rewards but also under-explored arms need to be explored with the hope of getting even better rewards. The most widely-used approaches for EE trade-off can be classified into three main techniques: Epsilon-greedy (Langford and Zhang, 2008), Thompson Sampling (TS) (Thompson, 1933), and Upper Confidence Bound (UCB) (Auer, 2002; Ban and He, 2020).

Linear bandits (Li etΒ al., 2010; Dani etΒ al., 2008; Abbasi-Yadkori etΒ al., 2011), where the reward is assumed to be a linear function with respect to arm vectors, have been well studied and succeeded both empirically and theoretically. Given an arm, ridge regression is usually adapted to estimate its reward based on collected data from past rounds. UCB-based algorithms (Li etΒ al., 2010; Chu etΒ al., 2011; Wu etΒ al., 2016; Ban and He, 2021b) calculate an upper bound for the confidence ellipsoid of estimated reward and determine the arm according to the sum of estimated reward and UCB. TS-based algorithms (Agrawal and Goyal, 2013; Abeille and Lazaric, 2017) formulate each arm as a posterior distribution where mean is the estimated reward and choose the one with the maximal sampled reward. However, the linear assumption regarding the reward may not be true in real-world applications (Valko etΒ al., 2013).

To learn non-linear reward functions, recent works have utilized deep neural networks to learn the underlying reward functions, thanks to its powerful representation ability. Considering the past selected arms and received rewards as training samples, a neural network f1f_{1} is built for exploitation. Zhou etΒ al. (2020) computes a gradient-based upper confidence bound with respect to f1f_{1} and uses UCB strategy to select arms. Zhang etΒ al. (2021) formulates each arm as a normal distribution where the mean is f1f_{1} and deviation is calculated based on gradient of f1f_{1}, and then uses the TS strategy to choose arms. Both Zhou etΒ al. (2020) and Zhang etΒ al. (2021) achieve the near-optimal regret bound of O​(T​log⁑T)O(\sqrt{T}\log T).

Refer to caption
Figure 1: Left figure: Structure of EE-Net. In the right figure, Case 1: "Upward" exploration should be made when the learner underestimates the reward; Case 2: "Downward" exploration should be chosen when the learner overestimates the reward. EE-Net has the ability to adaptively make exploration according to different cases. In contrast, UCB-based strategy will always make upward exploration, and TS-based strategy will randomly choose upward or downward exploration.

In this paper, we propose a novel neural exploration strategy, named "EE-Net". Similar to other neural bandits, EE-Net has another exploitation network f1f_{1} to estimate rewards for each arm. The crucial difference from existing works is that EE-Net has an exploration network f2f_{2} to predict the potential gain for each arm compared to current reward estimate. The input to the exploration network is the gradient of f1f_{1} and the ground-truth is residual difference between the true received reward and the estimated reward from f1f_{1}. The strategy is inspired by recent advances in the neural UCB strategiesΒ (Zhou etΒ al., 2020; Ban etΒ al., 2021). Finally, a decision-maker f3f_{3} is constructed to select arms. f3f_{3} has two modes: linear or nonlinear. In linear mode, f3f_{3} is a linear combination of f1f_{1} and f2f_{2}, inspired by the UCB strategy. In the nonlinear mode, f3f_{3} is formulated as a neural network with input (f1,f2f_{1},f_{2}) and the goal is to learn the probability of being an optimal arm for each arm. Figure 1 depicts the workflow of EE-Net and its advantages for exploration compared to UCB or TS-based methods (see more details in Appendix D). To sum up, the contributions of this paper can be summarized as follows:

  1. 1.

    We propose a novel neural exploration strategy, EE-Net, where another neural network is assigned to learn the potential gain compared to the current reward estimate.

  2. 2.

    Under standard assumptions of over-parameterized neural networks, we prove that EE-Net can achieve the regret upper bound of π’ͺ​(T​log⁑T)\mathcal{O}(\sqrt{T\log T}), which improves a multiplicative factor of log⁑T\sqrt{\log T} and is independent of either input or effective dimension, compared to existing state-of-the-art neural bandit algorithms.

  3. 3.

    We conduct extensive experiments on four real-world datasets, showing that EE-Net outperforms baselines including linear and neural versions of Ο΅\epsilon-greedy, TS, and UCB.

Next, we discuss the problem definition in Sec.3, elaborate on the proposed EE-Net in Sec.4, and present our theoretical analysis in Sec.5. In the end, we provide the empirical evaluation (Sec.6) and conclusion.

2 Related Work

Constrained Contextual bandits. The common constrain placed on the reward function is the linear assumption, usually calculated by ridge regression (Li etΒ al., 2010; Abbasi-Yadkori etΒ al., 2011; Valko etΒ al., 2013; Dani etΒ al., 2008). The linear UCB-based bandit algorithms (Abbasi-Yadkori etΒ al., 2011; Li etΒ al., 2016) and the linear Thompson Sampling (Agrawal and Goyal, 2013; Abeille and Lazaric, 2017) can achieve successful performance and the near-optimal regret bound of π’ͺ~​(T)\tilde{\mathcal{O}}(\sqrt{T}). To break the linear assumption, Filippi etΒ al. (2010) generalizes the reward function to a composition of linear and non-linear functions and then adopt a UCB-based algorithm to deal with it; Bubeck etΒ al. (2011) imposes the Lipschitz property on reward metric space and constructs a hierarchical optimistic optimization to make selections; Valko etΒ al. (2013) embeds the reward function into Reproducing Kernel Hilbert Space and proposes the kernelized TS/UCB bandit algorithms.

Neural Bandits. To learn non-linear reward functions, deep neural networks have been adapted to bandits with various variants. Riquelme etΒ al. (2018); Lu and VanΒ Roy (2017) build L-layer DNN to learn the arm embeddings and apply Thompson Sampling on the last layer for exploration. Zhou etΒ al. (2020) first introduces a provable neural-based contextual bandit algorithm with a UCB exploration strategy and then Zhang etΒ al. (2021) extends the neural network to Thompson sampling framework. Their regret analysis is built on recent advances on the convergence theory in over-parameterized neural networks(Du etΒ al., 2019; Allen-Zhu etΒ al., 2019) and utilizes Neural Tangent Kernel (Jacot etΒ al., 2018; Arora etΒ al., 2019) to construct connections with linear contextual bandits (Abbasi-Yadkori etΒ al., 2011). Ban and He (2021a) further adopts convolutional neural networks with UCB exploration aiming for visual-aware applications. Xu etΒ al. (2020) performs UCB-based exploration on the last layer of neural networks to reduce the computational cost brought by gradient-based UCB. Different from the above existing works, EE-Net keeps the powerful representation ability of neural networks to learn reward function and first assigns another neural network to determine exploration.

3 Problem definition

We consider the standard contextual multi-armed bandit with the known number of rounds TT (Zhou etΒ al., 2020; Zhang etΒ al., 2021). In each round t∈[T]t\in[T], where the sequence [T]=[1,2,…,T][T]=[1,2,\dots,T], the learner is presented with nn arms, 𝐗t={𝐱t,1,…,𝐱t,n}\mathbf{X}_{t}=\{\mathbf{x}_{t,1},\dots,\mathbf{x}_{t,n}\}, in which each arm is represented by a feature vector 𝐱t,iβˆˆβ„d\mathbf{x}_{t,i}\in\mathbb{R}^{d} for each i∈[n]i\in[n]. After playing one arm 𝐱t,i\mathbf{x}_{t,i}, its reward rt,ir_{t,i} is assumed to be generated by the function:

rt,i=h​(𝐱t,i)+Ξ·t,i,r_{t,i}=h(\mathbf{x}_{t,i})+\eta_{t,i}, (3.1)

where the unknown reward function h​(𝐱t,i)h(\mathbf{x}_{t,i}) can be either linear or non-linear and the noise Ξ·t,i\eta_{t,i} is drawn from certain distribution with expectation 𝔼​[Ξ·t,i]=0\mathbb{E}[\eta_{t,i}]=0. Following many existing works (Zhou etΒ al., 2020; Ban etΒ al., 2021; Zhang etΒ al., 2021), we consider bounded rewards, rt,i∈[a,b]r_{t,i}\in[a,b]. For the brevity, we denote the selected arm in round tt by 𝐱t\mathbf{x}_{t} and the reward received in tt by rtr_{t}. The pseudo regret of TT rounds is defined as:

𝐑T=𝔼​[βˆ‘t=1T(rtβˆ—βˆ’rt)],\mathbf{R}_{T}=\mathbb{E}\left[\sum_{t=1}^{T}(r_{t}^{\ast}-r_{t})\right], (3.2)

where 𝔼​[rtβˆ—βˆ£π—t]=maxi∈[n]⁑h​(𝐱t,i)\mathbb{E}[r_{t}^{\ast}\mid\mathbf{X}_{t}]=\max_{i\in[n]}h(\mathbf{x}_{t,i}) is the maximal expected reward in the round tt. The goal of this problem is to minimize 𝐑T\mathbf{R}_{T} by certain selection strategy.

Notation. We denote by {𝐱i}i=1t\{\mathbf{x}_{i}\}_{i=1}^{t} the sequence (𝐱1,…,𝐱t)(\mathbf{x}_{1},\dots,\mathbf{x}_{t}). We use β€–vβ€–2\|v\|_{2} to denote the Euclidean norm for a vector vv, and ‖𝐖‖2\|\mathbf{W}\|_{2} and ‖𝐖‖F\|\mathbf{W}\|_{F} to denote the spectral and Frobenius norm for a matrix 𝐖\mathbf{W}. We use βŸ¨β‹…,β‹…βŸ©\langle\cdot,\cdot\rangle to denote the standard inner product between two vectors or two matrices. We may use β–½πœ½t1​f1​(𝐱t,i)\triangledown_{\boldsymbol{\theta}^{1}_{t}}f_{1}(\mathbf{x}_{t,i}) or β–½πœ½t1​f1\triangledown_{\boldsymbol{\theta}^{1}_{t}}f_{1} to represent the gradient β–½πœ½t1​f1​(𝐱t,i;𝜽t1)\triangledown_{\boldsymbol{\theta}^{1}_{t}}f_{1}(\mathbf{x}_{t,i};\boldsymbol{\theta}^{1}_{t}) for brevity. We use {𝐱τ,rΟ„}Ο„=1t\{\mathbf{x}_{\tau},r_{\tau}\}_{\tau=1}^{t} to represent the collected data up to round tt.

4 Proposed Method: EE-Net

EE-Net is composed of three components. The first component is the exploitation network, f1​(β‹…;𝜽1)f_{1}(\cdot;\boldsymbol{\theta}^{1}), which focuses on learning the unknown reward function hh based on the data collected in past rounds. The second component is the exploration network, f2​(β‹…;𝜽2)f_{2}(\cdot;\boldsymbol{\theta}^{2}), which focuses on characterizing the level of exploration needed for each arm in the present round. The third component is the decision-maker, f3f_{3}, which focuses on suitably combining the outputs of the exploitation and exploration networks leading to the arm selection.

1) Exploitation Net. The exploitation net f1f_{1} is a neural network which learns the mapping from arms to rewards. In round tt, denote the network by f1​(β‹…;𝜽tβˆ’11)f_{1}(\cdot;\boldsymbol{\theta}^{1}_{t-1}), where the superscript of 𝜽tβˆ’11\boldsymbol{\theta}_{t-1}^{1} is the index of network and the subscript represents the round where the parameters of f1f_{1} finished the last update. Given an arm 𝐱t,i,i∈[n]\mathbf{x}_{t,i},i\in[n], f1​(𝐱t,i;𝜽tβˆ’11)f_{1}(\mathbf{x}_{t,i};\boldsymbol{\theta}^{1}_{t-1}) is considered the "exploitation score" for 𝐱t,i\mathbf{x}_{t,i}. By some criterion, after playing arm 𝐱t\mathbf{x}_{t}, we receive a reward rtr_{t}. Therefore, we can conduct gradient descent to update 𝜽1\boldsymbol{\theta}^{1} based on the collected training samples {𝐱τ,rΟ„}Ο„=1t\{\mathbf{x}_{\tau},r_{\tau}\}_{\tau=1}^{t} and denote the updated parameters by 𝜽t1\boldsymbol{\theta}^{1}_{t}.

Table 1: Structure of EE-Net (Round tt).
Input Network Label
{𝐱τ}Ο„=1t\{\mathbf{x}_{\tau}\}_{\tau=1}^{t} f1​(β‹…;𝜽1)f_{1}(\cdot;\boldsymbol{\theta}^{1}) (Exploitation) {rΟ„}Ο„=1t\{r_{\tau}\}_{\tau=1}^{t}
{β–½πœ½Ο„βˆ’11​f1​(𝐱τ;πœ½Ο„βˆ’11)}Ο„=1t\{\triangledown_{\boldsymbol{\theta}_{\tau-1}^{1}}f_{1}(\mathbf{x}_{\tau};\boldsymbol{\theta}_{\tau-1}^{1})\}_{\tau=1}^{t} f2​(β‹…;𝜽2)f_{2}(\cdot;\boldsymbol{\theta}^{2}) (Exploration) {(rΟ„βˆ’f1​(𝐱τ;πœ½Ο„βˆ’11))}Ο„=1t\left\{\left(r_{\tau}-f_{1}(\mathbf{x}_{\tau};\boldsymbol{\theta}_{\tau-1}^{1})\right)\right\}_{\tau=1}^{t}
{(f1​(𝐱τ;πœ½Ο„βˆ’11),f2​(β–½πœ½Ο„βˆ’11​f1;πœ½Ο„βˆ’12))}Ο„=1t\{(f_{1}(\mathbf{x}_{\tau};\boldsymbol{\theta}_{\tau-1}^{1}),f_{2}(\triangledown_{\boldsymbol{\theta}_{\tau-1}^{1}}f_{1};\boldsymbol{\theta}_{\tau-1}^{2}))\}_{\tau=1}^{t} f3​(β‹…;𝜽3)f_{3}(\cdot;\boldsymbol{\theta}^{3}) (Decision-maker with non-linear function) {pΟ„}Ο„=1t\{p_{\tau}\}_{\tau=1}^{t}

2) Exploration Net. Our exploration strategy is inspired by existing UCB-based neural bandits (Zhou etΒ al., 2020; Ban etΒ al., 2021). Based on the Lemma 5.2 in (Ban etΒ al., 2021), given an arm 𝐱t,i\mathbf{x}_{t,i}, with probability at least 1βˆ’Ξ΄1-\delta, we have the following UCB form:

|h​(𝐱t,i)βˆ’f1​(𝐱t,i;𝜽tβˆ’11)|≀Ψ​(β–½πœ½tβˆ’11​f1​(𝐱t,i;𝜽tβˆ’11)),|h(\mathbf{x}_{t,i})-f_{1}(\mathbf{x}_{t,i};\boldsymbol{\theta}^{1}_{t-1})|\leq\Psi(\triangledown_{\boldsymbol{\theta}^{1}_{t-1}}f_{1}(\mathbf{x}_{t,i};\boldsymbol{\theta}^{1}_{t-1})), (4.1)

where hh is defined in Eq. (3.1) and Ξ¨\Psi is an upper confidence bound represented by a function with respect to the gradient β–½πœ½tβˆ’11​f1\triangledown_{\boldsymbol{\theta}^{1}_{t-1}}f_{1} (see more details and discussions in Appendix D). Then we have the following definition.

Definition 4.1.

In round tt, given an arm 𝐱t,i\mathbf{x}_{t,i}, we define h​(𝐱t,i)βˆ’f1​(𝐱t,i;𝛉tβˆ’11)h(\mathbf{x}_{t,i})-f_{1}(\mathbf{x}_{t,i};\boldsymbol{\theta}^{1}_{t-1}) as the "expected potential gain" for 𝐱t,i\mathbf{x}_{t,i} and rt,iβˆ’f1​(𝐱t,i;𝛉tβˆ’11)r_{t,i}-f_{1}(\mathbf{x}_{t,i};\boldsymbol{\theta}^{1}_{t-1}) as the "potential gain" for 𝐱t,i\mathbf{x}_{t,i}.

Let yt,i=rt,iβˆ’f1​(𝐱t,i;𝜽tβˆ’11)y_{t,i}=r_{t,i}-f_{1}(\mathbf{x}_{t,i};\boldsymbol{\theta}^{1}_{t-1}). When yt,i>0y_{t,i}>0, the arm 𝐱t,i\mathbf{x}_{t,i} has positive potential gain compared to the estimated reward f1​(𝐱t,i;𝜽tβˆ’11)f_{1}(\mathbf{x}_{t,i};\boldsymbol{\theta}^{1}_{t-1}). A large positive yt,iy_{t,i} makes the arm more suitable for exploration, whereas a small (or negative) yt,iy_{t,i} makes the arm unsuitable for exploration. Recall that traditional approaches such as UCB intend to estimate such potential gain yt,iy_{t,i} using standard tools, e.g., Markov inequality, Hoeffding bounds, etc., from large deviation bounds.

Instead of calculating a large-deviation based statistical bound for yt,iy_{t,i}, we use a neural network f2​(β‹…;𝜽2)f_{2}(\cdot;\boldsymbol{\theta}^{2}) to represent Ξ¨\Psi, where the input is β–½πœ½tβˆ’11​f1​(𝐱t,i)\triangledown_{\boldsymbol{\theta}^{1}_{t-1}}f_{1}(\mathbf{x}_{t,i}) and the ground truth is rt,iβˆ’f1​(𝐱t,i;𝜽tβˆ’11)r_{t,i}-f_{1}(\mathbf{x}_{t,i};\boldsymbol{\theta}^{1}_{t-1}). Adopting gradient β–½πœ½tβˆ’11​f1​(𝐱t,i)\triangledown_{\boldsymbol{\theta}^{1}_{t-1}}f_{1}(\mathbf{x}_{t,i}) as the input also is due to the fact that it incorporates two aspects of information: the feature of the arm and the discriminative information of f1f_{1}.

Moreover, in the upper bound of NeuralUCB or the variance of NeuralTS, there is a recursive term 𝐀tβˆ’1=𝐈+βˆ‘Ο„=1tβˆ’1βˆ‡πœ½Ο„βˆ’11f1​(𝐱τ)β€‹βˆ‡πœ½Ο„βˆ’11f1​(𝐱τ)⊀\mathbf{A}_{t-1}=\mathbf{I}+\sum_{\tau=1}^{t-1}\nabla_{\boldsymbol{\theta}^{1}_{\tau-1}}f_{1}(\mathbf{x}_{\tau})\nabla_{\boldsymbol{\theta}^{1}_{\tau-1}}f_{1}(\mathbf{x}_{\tau})^{\top} which is a function of past gradients up to (tβˆ’1)(t-1) and incorporates relevant historical information. On the contrary, in EE-Net, the recursive term which depends on past gradients is 𝜽tβˆ’12\boldsymbol{\theta}^{2}_{t-1} in the exploration network f2f_{2} because we have conducted gradient descent for 𝜽tβˆ’12\boldsymbol{\theta}^{2}_{t-1} based on {βˆ‡πœ½Ο„βˆ’11f1​(𝐱τ)}Ο„=1tβˆ’1\{\nabla_{\boldsymbol{\theta}^{1}_{\tau-1}}f_{1}(\mathbf{x}_{\tau})\}_{\tau=1}^{t-1}. Therefore, this form 𝜽tβˆ’12\boldsymbol{\theta}^{2}_{t-1} is similar to 𝐀tβˆ’1\mathbf{A}_{t-1} in neuralUCB/TS, but EE-net does not (need to) make a specific assumption about the functional form of past gradients, and is also more memory-efficient.

To sum up, in round tt, we consider f2​(β–½πœ½tβˆ’11​f1​(𝐱t,i);𝜽tβˆ’12)f_{2}(\triangledown_{\boldsymbol{\theta}^{1}_{t-1}}f_{1}(\mathbf{x}_{t,i});\boldsymbol{\theta}^{2}_{t-1}) as the "exploration score" of 𝐱t,i\mathbf{x}_{t,i}, because it indicates the potential gain of 𝐱t,i\mathbf{x}_{t,i} compared to our current exploitation score f1​(𝐱t,i;𝜽tβˆ’11)f_{1}(\mathbf{x}_{t,i};\boldsymbol{\theta}^{1}_{t-1}). Therefore, after receiving the reward rtr_{t}, we can use gradient descent to update 𝜽2\boldsymbol{\theta}^{2} based on collected training samples {β–½πœ½Ο„βˆ’11f1(𝐱τ),rΟ„βˆ’f1(𝐱τ;πœ½Ο„βˆ’11}Ο„=1t\{\triangledown_{\boldsymbol{\theta}^{1}_{\tau-1}}f_{1}(\mathbf{x}_{\tau}),r_{\tau}-f_{1}(\mathbf{x}_{\tau};\boldsymbol{\theta}^{1}_{\tau-1}\}_{\tau=1}^{t}. We also provide other two heuristic forms for f2f_{2}’s ground-truth label: |rt,iβˆ’f1​(𝐱t,i;𝜽tβˆ’11)||r_{t,i}-f_{1}(\mathbf{x}_{t,i};\boldsymbol{\theta}^{1}_{t-1})| and ReLU​(rt,iβˆ’f1​(𝐱t,i;𝜽tβˆ’11))\text{ReLU}(r_{t,i}-f_{1}(\mathbf{x}_{t,i};\boldsymbol{\theta}^{1}_{t-1})). We compare them in an ablation study in Appendix B.

3) Decision-maker. In round tt, given an arm 𝐱t,i,i∈[n]\mathbf{x}_{t,i},i\in[n], with the computed exploitation score f1​(𝐱t,i;𝜽tβˆ’11)f_{1}(\mathbf{x}_{t,i};\boldsymbol{\theta}^{1}_{t-1}) and exploration score f2​(β–½πœ½tβˆ’11​f1;𝜽tβˆ’12)f_{2}(\triangledown_{\boldsymbol{\theta}^{1}_{t-1}}f_{1};\boldsymbol{\theta}^{2}_{t-1}), we use a function f3​(f1,f2;𝜽3)f_{3}\left(f_{1},f_{2};\boldsymbol{\theta}^{3}\right) to trade off between exploitation and exploration and compute the final score for 𝐱t,i\mathbf{x}_{t,i}. The selection criterion is defined as

𝐱t=arg⁑max𝐱t,i,i∈[n]⁑f3​(f1​(𝐱t,i;𝜽tβˆ’11),f2​(β–½πœ½tβˆ’11​f1​(𝐱t,i);𝜽tβˆ’12);𝜽tβˆ’13).\mathbf{x}_{t}=\arg\max_{\mathbf{x}_{t,i},i\in[n]}f_{3}\left(f_{1}(\mathbf{x}_{t,i};\boldsymbol{\theta}^{1}_{t-1}),f_{2}\left(\triangledown_{\boldsymbol{\theta}^{1}_{t-1}}f_{1}(\mathbf{x}_{t,i});\boldsymbol{\theta}^{2}_{t-1}\right);\boldsymbol{\theta}^{3}_{t-1}\right).

Note that f3f_{3} can be either linear or non-linear functions. We provide the following two forms.

Algorithm 1 EE-Net
1:f1,f2,f3f_{1},f_{2},f_{3}, TT (number of rounds), Ξ·1\eta_{1} (learning rate for f1f_{1}), Ξ·2\eta_{2} (learning rate for f2f_{2}), Ξ·3\eta_{3} (learning rate for f3f_{3}), K1K_{1} (number of iterations for f1f_{1}), K2K_{2} (number of iterations for f2f_{2}) , K3K_{3} (number of iterations for f3f_{3}), Ο•\phi (normalization operator)
2:Initialize 𝜽01,𝜽02,𝜽03\boldsymbol{\theta}^{1}_{0},\boldsymbol{\theta}^{2}_{0},\boldsymbol{\theta}^{3}_{0}; 𝜽^01=𝜽01,𝜽^02=𝜽02,𝜽^03=𝜽03\widehat{\boldsymbol{\theta}}^{1}_{0}=\boldsymbol{\theta}^{1}_{0},\widehat{\boldsymbol{\theta}}^{2}_{0}=\boldsymbol{\theta}^{2}_{0},\widehat{\boldsymbol{\theta}}^{3}_{0}=\boldsymbol{\theta}^{3}_{0}
3:forΒ  t=1,2,…,Tt=1,2,\dots,TΒ do
4:Β Β Β Β Β Observe nn arms {𝐱t,1,…,𝐱t,n}\{\mathbf{x}_{t,1},\dots,\mathbf{x}_{t,n}\}
5:     for each i∈[n]i\in[n]  do
6:Β Β Β Β Β Β Β Β Β Compute f1​(𝐱t,i;𝜽tβˆ’11),f2​(ϕ​(β–½πœ½tβˆ’11​f1​(𝐱t,i));𝜽tβˆ’12),f3​((f1,f2);𝜽tβˆ’13)f_{1}(\mathbf{x}_{t,i};\boldsymbol{\theta}^{1}_{t-1}),f_{2}(\phi(\triangledown_{\boldsymbol{\theta}^{1}_{t-1}}f_{1}(\mathbf{x}_{t,i}));\boldsymbol{\theta}^{2}_{t-1}),f_{3}((f_{1},f_{2});\boldsymbol{\theta}^{3}_{t-1})
7:Β Β Β Β Β endΒ for
8:     𝐱t=arg⁑max𝐱t,i,i∈[n]⁑f3​(f1​(𝐱t,i;𝜽tβˆ’11),f2​(ϕ​(β–½πœ½tβˆ’11​f1​(𝐱t,i));𝜽tβˆ’12);𝜽tβˆ’13)\mathbf{x}_{t}=\arg\max_{\mathbf{x}_{t,i},i\in[n]}f_{3}\left(f_{1}(\mathbf{x}_{t,i};\boldsymbol{\theta}^{1}_{t-1}),f_{2}(\phi(\triangledown_{\boldsymbol{\theta}^{1}_{t-1}}f_{1}(\mathbf{x}_{t,i}));\boldsymbol{\theta}^{2}_{t-1});\boldsymbol{\theta}^{3}_{t-1}\right) Β Β Β 
9:     Play 𝐱t\mathbf{x}_{t} and observe reward rtr_{t}
10:     𝜽t1,𝜽t2,𝜽t3\boldsymbol{\theta}^{1}_{t},\boldsymbol{\theta}^{2}_{t},\boldsymbol{\theta}^{3}_{t} = GradientDescent(𝜽0\boldsymbol{\theta}_{0}, {𝐱τ}Ο„=1t,{rΟ„}Ο„=1t\{\mathbf{x}_{\tau}\}_{\tau=1}^{t},\{r_{\tau}\}_{\tau=1}^{t})
11:endΒ for
12:
13:procedureΒ GradientDescent(𝜽0\boldsymbol{\theta}_{0}, {𝐱τ}Ο„=1t,{rΟ„}Ο„=1t\{\mathbf{x}_{\tau}\}_{\tau=1}^{t},\{r_{\tau}\}_{\tau=1}^{t})
14:Β Β Β Β Β β„’1=12β€‹βˆ‘Ο„=1t(f1​(𝐱τ;𝜽1)βˆ’rΟ„)2\mathcal{L}_{1}=\frac{1}{2}\sum_{\tau=1}^{t}\left(f_{1}(\mathbf{x}_{\tau};\boldsymbol{\theta}^{1})-r_{\tau}\right)^{2}
15:     𝜽1,(0)=𝜽01\boldsymbol{\theta}^{1,(0)}=\boldsymbol{\theta}_{0}^{1}
16:Β Β Β Β Β forΒ  k∈{1,…,K1}k\in\{1,\dots,K_{1}\} Β do
17:         𝜽1,(k)=𝜽1,(kβˆ’1)βˆ’Ξ·1β€‹β–½πœ½1,(kβˆ’1)​ℒ1\boldsymbol{\theta}^{1,(k)}=\boldsymbol{\theta}^{1,(k-1)}-\eta_{1}\triangledown_{\boldsymbol{\theta}^{1,(k-1)}}\mathcal{L}_{1}
18:Β Β Β Β Β endΒ for
19:     𝜽^t1=𝜽1,(K1)\widehat{\boldsymbol{\theta}}^{1}_{t}=\boldsymbol{\theta}^{1,(K_{1})}
20:Β Β Β Β Β β„’2=12β€‹βˆ‘Ο„=1t(f2​(ϕ​(β–½πœ½Ο„βˆ’11​f1​(𝐱τ));𝜽2)βˆ’(rΟ„βˆ’f1​(𝐱τ;πœ½Ο„βˆ’11)))2\mathcal{L}_{2}=\frac{1}{2}\sum_{\tau=1}^{t}\left(f_{2}(\phi(\triangledown_{\boldsymbol{\theta}^{1}_{\tau-1}}f_{1}(\mathbf{x}_{\tau}));\boldsymbol{\theta}^{2})-(r_{\tau}-f_{1}(\mathbf{x}_{\tau};\boldsymbol{\theta}^{1}_{\tau-1}))\right)^{2}
21:     𝜽2,(0)=𝜽02\boldsymbol{\theta}^{2,(0)}=\boldsymbol{\theta}_{0}^{2}
22:Β Β Β Β Β forΒ  k∈{1,…,K2}k\in\{1,\dots,K_{2}\} Β do
23:         𝜽2,(k)=𝜽2,(kβˆ’1)βˆ’Ξ·2β€‹β–½πœ½2,(kβˆ’1)​ℒ2\boldsymbol{\theta}^{2,(k)}=\boldsymbol{\theta}^{2,(k-1)}-\eta_{2}\triangledown_{\boldsymbol{\theta}^{2,(k-1)}}\mathcal{L}_{2}
24:Β Β Β Β Β endΒ for
25:     𝜽^t2=𝜽2,(K2)\widehat{\boldsymbol{\theta}}^{2}_{t}=\boldsymbol{\theta}^{2,(K_{2})}
26:Β Β Β Β Β Determine label ptp_{t}
27:Β Β Β Β Β β„’3=βˆ’1tβ€‹βˆ‘i=1t[pt​log⁑f3​((f1,f2);𝜽3)+(1βˆ’pt)​log⁑(1βˆ’f3​((f1,f2);𝜽3))]\mathcal{L}_{3}=-\frac{1}{t}\sum_{i=1}^{t}\left[p_{t}\log f_{3}((f_{1},f_{2});\boldsymbol{\theta}^{3})+(1-p_{t})\log(1-f_{3}((f_{1},f_{2});\boldsymbol{\theta}^{3}))\right]
28:     𝜽3,(0)=𝜽03\boldsymbol{\theta}^{3,(0)}=\boldsymbol{\theta}_{0}^{3}
29:Β Β Β Β Β forΒ  k∈{1,…,K3}k\in\{1,\dots,K_{3}\} Β do
30:         𝜽3,(k)=𝜽3,(kβˆ’1)βˆ’Ξ·3β€‹β–½πœ½3,(kβˆ’1)​ℒ3\boldsymbol{\theta}^{3,(k)}=\boldsymbol{\theta}^{3,(k-1)}-\eta_{3}\triangledown_{\boldsymbol{\theta}^{3,(k-1)}}\mathcal{L}_{3}
31:Β Β Β Β Β endΒ for
32:     𝜽^t3=𝜽3,(K3)\widehat{\boldsymbol{\theta}}^{3}_{t}=\boldsymbol{\theta}^{3,(K_{3})}
33:Β Β Β Β Β Randomly choose (𝜽t1\boldsymbol{\theta}_{t}^{1}, 𝜽t2\boldsymbol{\theta}_{t}^{2}) uniformly from {(𝜽^01,𝜽^02),(𝜽^11,𝜽^12),…,(𝜽^t1,𝜽^t2)}\{(\widehat{\boldsymbol{\theta}}^{1}_{0},\widehat{\boldsymbol{\theta}}^{2}_{0}),(\widehat{\boldsymbol{\theta}}^{1}_{1},\widehat{\boldsymbol{\theta}}^{2}_{1}),\dots,(\widehat{\boldsymbol{\theta}}^{1}_{t},\widehat{\boldsymbol{\theta}}^{2}_{t})\}
34:Β Β Β Β Β Randomly choose 𝜽t3\boldsymbol{\theta}_{t}^{3} uniformly from {𝜽^03,𝜽^13,…,𝜽^t3}\{\widehat{\boldsymbol{\theta}}^{3}_{0},\widehat{\boldsymbol{\theta}}^{3}_{1},\dots,\widehat{\boldsymbol{\theta}}^{3}_{t}\}
35:     Return 𝜽t1\boldsymbol{\theta}_{t}^{1}, 𝜽t2\boldsymbol{\theta}_{t}^{2}, 𝜽t3\boldsymbol{\theta}_{t}^{3}
36:endΒ procedure

(1) Linear function. f3f_{3} can be formulated as a linear function with respect to f1f_{1} and f2f_{2} :

f3​(f1,f2;𝜽3)=w1​f1​(𝐱t,i;𝜽1)+w2​f2​(β–½πœ½1​f1;𝜽2)f_{3}(f_{1},f_{2};\boldsymbol{\theta}^{3})=w_{1}f_{1}(\mathbf{x}_{t,i};\boldsymbol{\theta}^{1})+w_{2}f_{2}(\triangledown_{\boldsymbol{\theta}^{1}}f_{1};\boldsymbol{\theta}^{2})

where w1w_{1}, w2w_{2} are two weights preset by the learner. When w1=w2=1w_{1}=w_{2}=1, f3f_{3} can be thought of as UCB-type policy, where the estimated reward f1f_{1} and potential gain f2f_{2} are simply added together. In experiments, we report its empirical performance in ablation study (Appendix B).

(2) Non-linear function. f3f_{3} also can be formulated as a neural network to learn the mapping from (f1,f2)(f_{1},f_{2}) to the optimal arm. We transform the bandit problem into a binary classification problem. Given an arm 𝐱t,i\mathbf{x}_{t,i}, we define pt,ip_{t,i} as the probability of being the optimal arm for 𝐱t,i\mathbf{x}_{t,i} in round tt. For brevity, we denote by ptp_{t} the probability of being the optimal arm for the selected arm 𝐱t\mathbf{x}_{t} in round tt. According to different reward distributions, we have different approaches to determine ptp_{t}.

  1. 1.

    Binary reward. βˆ€t∈[T]\forall t\in[T], suppose rtr_{t} is a binary variable over a,b​(a<b)a,b(a<b), it is straightforward to set: pt=1.0p_{t}=1.0 if rt=br_{t}=b; pt=0.0p_{t}=0.0, otherwise.

  2. 2.

    Continuous reward. βˆ€t∈[T]\forall t\in[T], suppose rtr_{t} is a continuous variable over the range [a,b][a,b], we provide two ways to determine ptp_{t}. (1) ptp_{t} can be directly set as rtβˆ’abβˆ’a\frac{r_{t}-a}{b-a}. (2) The learner can set a threshold ΞΈ,(a<ΞΈ<b)\theta,(a<\theta<b). Then pt=1.0p_{t}=1.0 if rt>ΞΈr_{t}>\theta; pt=0.0p_{t}=0.0, otherwise.

Therefore, with the collected training samples {(f1​(𝐱τ;πœ½Ο„βˆ’11),f2​(β–½πœ½Ο„βˆ’11​f1;πœ½Ο„βˆ’12)),pΟ„}Ο„=1t\left\{\left(f_{1}(\mathbf{x}_{\tau};\boldsymbol{\theta}_{\tau-1}^{1}),f_{2}(\triangledown_{\boldsymbol{\theta}_{\tau-1}^{1}}f_{1};\boldsymbol{\theta}_{\tau-1}^{2})\right),p_{\tau}\right\}_{\tau=1}^{t} in round tt, we can conduct gradient descent to update parameters of f3​(β‹…;𝜽3)f_{3}(\cdot;\boldsymbol{\theta}^{3}).

Table 1 details the working structure of EE-Net. Algorithm 1 depicts the workflow of EE-Net, where the input of f2f_{2} is normalized, i.e., ϕ​(β–½πœ½tβˆ’11​f1​(𝐱t,i))\phi(\triangledown_{\boldsymbol{\theta}^{1}_{t-1}}f_{1}(\mathbf{x}_{t,i})). Algorithm 1 provides a version of gradient descent (GD) to update EE-Net, where drawing (𝜽t1,𝜽t2)(\boldsymbol{\theta}_{t}^{1},\boldsymbol{\theta}_{t}^{2}) uniformly from their stored historical parameters is for the sake of analysis. One can easily extend EE-Net to stochastic GD to update the parameters incrementally.

Remark 4.1 (Network structure).

The networks f1,f2,f3f_{1},f_{2},f_{3} can be different structures according to different applications. For example, in the vision tasks, f1f_{1} can be set up as convolutional layers (LeCun etΒ al., 1995). For the exploration network f2f_{2}, the input ▽𝛉1​f1\triangledown_{\boldsymbol{\theta}^{1}}f_{1} may have exploding dimensions when the exploitation network f1f_{1} becomes wide and deep, which may cause huge computation cost for f2f_{2}. To address this challenge, we can apply dimensionality reduction techniques to obtain low-dimensional vectors of ▽𝛉1​f1\triangledown_{\boldsymbol{\theta}^{1}}f_{1}. In the experiments, we use Roweis and Saul (2000) to acquire a 1010-dimensional vector for ▽𝛉1​f1\triangledown_{\boldsymbol{\theta}^{1}}f_{1} and achieve the best performance among all baselines.∎

Remark 4.2 (Exploration direction).

EE-Net has the ability to determine exploration direction. Given an arm 𝐱t,i\mathbf{x}_{t,i}, when the estimation f1​(𝐱t,i)f_{1}(\mathbf{x}_{t,i}) is lower than the expected reward h​(𝐱t,i)h(\mathbf{x}_{t,i}), the learner should make the "upward" exploration, i.e., increase the chance of 𝐱t,i\mathbf{x}_{t,i} being explored; When f1​(𝐱t,i)f_{1}(\mathbf{x}_{t,i}) is higher than h​(𝐱t,i)h(\mathbf{x}_{t,i}), the learner should do the "downward" exploration, i.e., decrease the chance of 𝐱t,i\mathbf{x}_{t,i} being explored. EE-Net uses the neural network f2f_{2} to learn h​(𝐱t,i)βˆ’f1​(𝐱t,i)h(\mathbf{x}_{t,i})-f_{1}(\mathbf{x}_{t,i}) (which has positive and negative scores) and has the ability to determine the exploration direction. In contrast, NeuralUCB will always make "upward" exploration and NeuralTS will randomly choose between "upward" exploration and "downward" exploration (see selection criteria in Table 2 and more details in Appendix D). ∎

Remark 4.3 (Space complexity).

NeuralUCB and NeuralTS have to maintain the gradient outer product matrix (e.g., 𝐀t=βˆ‘Ο„=1t▽𝛉1​f1​(𝐱τ;𝛉τ1)​▽𝛉1​f1​(𝐱τ;𝛉τ1)βŠ€βˆˆβ„pΓ—p\mathbf{A}_{t}=\sum_{\tau=1}^{t}\triangledown_{\boldsymbol{\theta}^{1}}f_{1}(\mathbf{x}_{\tau};\boldsymbol{\theta}^{1}_{\tau})\triangledown_{\boldsymbol{\theta}^{1}}f_{1}(\mathbf{x}_{\tau};\boldsymbol{\theta}^{1}_{\tau})^{\top}\in\mathbb{R}^{p\times p}) and, for 𝛉1βˆˆβ„p\boldsymbol{\theta}^{1}\in\mathbb{R}^{p}, have a space complexity of O​(p2)O(p^{2}) to store the outer product. On the contrary, EE-Net does not have this matrix and only regards ▽𝛉1​f1\triangledown_{\boldsymbol{\theta}^{1}}f_{1} as the input of f2f_{2}. Thus, EE-Net reduces the space complexity from π’ͺ​(p2)\mathcal{O}(p^{2}) to π’ͺ​(p)\mathcal{O}(p). ∎

5 Regret Analysis

In this section, we provide the regret analysis of EE-Net when f3f_{3} is set as the linear function f3=f1+f2f_{3}=f_{1}+f_{2}, which can be thought of as the UCB-type trade-off between exploitation and exploration. For the sake of simplicity, we conduct the regret analysis on some unknown but fixed data distribution π’Ÿ\mathcal{D}. In each round tt, nn samples {(𝐱t,1,rt,1),(𝐱t,2,rt,2),…,(𝐱t,n,rt,n)}\{(\mathbf{x}_{t,1},r_{t,1}),(\mathbf{x}_{t,2},r_{t,2}),\dots,(\mathbf{x}_{t,n},r_{t,n})\} are drawn i.i.d. from π’Ÿ\mathcal{D}. This is standard distribution assumption in over-parameterized neural networks (Cao and Gu, 2019). Then, for the analysis, we have the following assumption, which is a standard input assumption in neural bandits and over-parameterized neural networks(Zhou etΒ al., 2020; Allen-Zhu etΒ al., 2019).

Assumption 5.1 (ρ\rho-Separability).

For any t∈[T],i∈[n],‖𝐱t,iβ€–2=1t\in[T],i\in[n],\|\mathbf{x}_{t,i}\|_{2}=1, and rt,i∈[0,1]r_{t,i}\in[0,1]. Then, for every pair 𝐱t,i,𝐱tβ€²,iβ€²\mathbf{x}_{t,i},\mathbf{x}_{t^{\prime},i^{\prime}}, tβ€²βˆˆ[T],iβ€²βˆˆ[k],t^{\prime}\in[T],i^{\prime}\in[k], and (t,i)β‰ (tβ€²,iβ€²)(t,i)\neq(t^{\prime},i^{\prime}), ‖𝐱t,iβˆ’π±tβ€²,iβ€²β€–2>ρ\|\mathbf{x}_{t,i}-\mathbf{x}_{t^{\prime},i^{\prime}}\|_{2}>\rho, and suppose there exists an operator such that ‖ϕ​(β‹…)β€–2=1\|\phi(\cdot)\|_{2}=1 and ‖ϕ​(▽𝛉1​f1​(𝐱t,i))βˆ’Ο•β€‹(▽𝛉1​f1​(𝐱tβ€²,iβ€²))β€–2β‰₯ρ\|\phi(\triangledown_{\boldsymbol{\theta}^{1}}f_{1}(\mathbf{x}_{t,i}))-\phi(\triangledown_{\boldsymbol{\theta}^{1}}f_{1}(\mathbf{x}_{t^{\prime},i^{\prime}}))\|_{2}\geq\rho

For example, the operator can be designed as ϕ​(β–½πœ½1​f1​(𝐱t,i))=(β–½πœ½1​f1​(𝐱t,i)2β€‹β€–β–½πœ½1​f1​(𝐱t,i)β€–2,𝐱t,i2)\phi(\triangledown_{\boldsymbol{\theta}^{1}}f_{1}(\mathbf{x}_{t,i}))=(\frac{\triangledown_{\boldsymbol{\theta}^{1}}f_{1}(\mathbf{x}_{t,i})}{\sqrt{2}\|\triangledown_{\boldsymbol{\theta}^{1}}f_{1}(\mathbf{x}_{t,i})\|_{2}},\frac{\mathbf{x}_{t,i}}{\sqrt{2}}). The analysis will focus on over-parameterized neural networks (Jacot etΒ al., 2018; Du etΒ al., 2019; Allen-Zhu etΒ al., 2019). Given an input π±βˆˆβ„d\mathbf{x}\in\mathbb{R}^{d}, without loss of generality, we define the fully-connected network ff with depth Lβ‰₯2L\geq 2 and width mm:

f​(𝐱;𝜽)=𝐖L​σ​(𝐖Lβˆ’1​σ​(𝐖Lβˆ’2​…​σ​(𝐖1​𝐱)))f(\mathbf{x};\boldsymbol{\theta})=\mathbf{W}_{L}\sigma(\mathbf{W}_{L-1}\sigma(\mathbf{W}_{L-2}\dots\sigma(\mathbf{W}_{1}\mathbf{x}))) (5.1)

where Οƒ\sigma is the ReLU activation function, 𝐖1βˆˆβ„mΓ—d\mathbf{W}_{1}\in\mathbb{R}^{m\times d}, 𝐖lβˆˆβ„mΓ—m\mathbf{W}_{l}\in\mathbb{R}^{m\times m}, for 2≀l≀Lβˆ’12\leq l\leq L-1, 𝐖Lβˆˆβ„1Γ—m\mathbf{W}^{L}\in\mathbb{R}^{1\times m}, and 𝜽=[vec​(𝐖1)⊺,vec​(𝐖2)⊺,…,vec​(𝐖L)⊺]⊺\boldsymbol{\theta}=[\text{vec}(\mathbf{W}_{1})^{\intercal},\text{vec}(\mathbf{W}_{2})^{\intercal},\dots,\text{vec}(\mathbf{W}_{L})^{\intercal}]^{\intercal}.

Initialization. For any l∈[Lβˆ’1]l\in[L-1], each entry of 𝐖l\mathbf{W}_{l} is drawn from the normal distribution 𝒩​(0,2m)\mathcal{N}(0,\frac{2}{m}) and 𝐖L\mathbf{W}_{L} is drawn from the normal distribution 𝒩​(0,1m)\mathcal{N}(0,\frac{1}{m}). Note that EE-Net at most has three networks f1,f2,f3f_{1},f_{2},f_{3}. We define them following the definition of ff for brevity, although they may have different depth or width. Then, we have the following theorem for EE-Net. Recall that Ξ·1,Ξ·2\eta_{1},\eta_{2} are the learning rates for f1,f2f_{1},f_{2}; K1K_{1} is the number of iterations of gradient descent for f1f_{1} in each round; and K2K_{2} is the number of iterations for f2f_{2}.

Theorem 1.

Let f1,f2f_{1},f_{2} follow the setting of ff (Eq. (5.1) ) with the same width mm and depth LL. Let β„’1,β„’2\mathcal{L}_{1},\mathcal{L}_{2} be loss functions defined in Algorithm 1. Set f3f_{3} as f3=f1+f2f_{3}=f_{1}+f_{2}. For any δ∈(0,1),ϡ∈(0,π’ͺ​(1T)],ρ∈(0,π’ͺ​(1L)]\delta\in(0,1),\epsilon\in(0,\mathcal{O}(\frac{1}{T})],\rho\in(0,\mathcal{O}(\frac{1}{L})], suppose

m\displaystyle m β‰₯Ξ©~(poly(T,n,L,Οβˆ’1)β‹…log(1/Ξ΄)β‹…elog⁑(T​n/Ξ΄))),\displaystyle\geq\widetilde{\Omega}\left(\text{poly}(T,n,L,\rho^{-1})\cdot\log(1/\delta)\cdot e^{\sqrt{\log(Tn/\delta)}})\right), (5.2)
Ξ·1=Ξ·2\displaystyle\ \eta_{1}=\eta_{2} =min⁑(Ξ˜β€‹(T52​δ2​m),Ξ˜β€‹(ρpoly​(T,n,L)β‹…m)),\displaystyle=\min\left(\Theta\left(\frac{T^{5}}{\sqrt{2}\delta^{2}m}\right),\Theta\left(\frac{\rho}{\text{poly}(T,n,L)\cdot m}\right)\right),
K1=K2=Ξ˜β€‹(poly​(T,n,L)ρ​δ2β‹…log⁑(Ο΅βˆ’1)).\displaystyle K_{1}=K_{2}=\Theta\left(\frac{\text{poly}(T,n,L)}{\rho\delta^{2}}\cdot\log\left(\epsilon^{-1}\right)\right).

Then, with probability at least 1βˆ’Ξ΄1-\delta over the initialization, the pseudo regret of EE-Net in TT rounds satisfies

𝐑T≀π’ͺ​(1)+(2​Tβˆ’1)​3​2​π’ͺ​(L)+π’ͺ​((2​Tβˆ’1)​2​log⁑π’ͺ​(T​n)Ξ΄).\mathbf{R}_{T}\leq\mathcal{O}(1)+(2\sqrt{T}-1)3\sqrt{2}\mathcal{O}(L)+\mathcal{O}\left((2\sqrt{T}-1)\sqrt{2\log\frac{\mathcal{O}(Tn)}{\delta}}\right). (5.3)

Comparison with existing works. Under the similar assumptions in over-parameterized neural networks, the regret bounds complexity of NeuralUCB (Zhou etΒ al., 2020) and NeuralTS (Zhang etΒ al., 2021) both are

𝐑T≀π’ͺ​(d~​T​log⁑T)β‹…π’ͺ​(d~​log⁑T),andd~=log⁑det​(𝐈+𝐇/Ξ»)log⁑(1+T​n/Ξ»)\mathbf{R}_{T}\leq\mathcal{O}\left(\sqrt{\tilde{d}T\log T}\right)\cdot\mathcal{O}\left(\sqrt{\tilde{d}\log T}\right),\ \ \text{and}\ \ \tilde{d}=\frac{\log\text{det}(\mathbf{I}+\mathbf{H}/\lambda)}{\log(1+Tn/\lambda)}

where 𝐇\mathbf{H} is the neural tangent kernel matrix (NTK) (Jacot etΒ al., 2018; Arora etΒ al., 2019) and Ξ»\lambda is a regularization parameter. Similarly, in linear contextual bandits, Abbasi-Yadkori etΒ al. (2011) achieve π’ͺ​(d​T​log⁑T)\mathcal{O}(d\sqrt{T}\log T) and Li etΒ al. (2017) achieve π’ͺ​(d​T​log⁑T)\mathcal{O}(\sqrt{dT}\log T).

Remark 5.1.

Compared to NeuralUCB/TS, EE-Net roughly improves by a multiplicative factor of log⁑T\sqrt{\log T}, because our proof of EE-Net is directly built on recent advances in convergence theory (Allen-Zhu et al., 2019) and generalization bound (Cao and Gu, 2019) of over-parameterized neural networks. Instead, the analysis for NeuralUCB/TS contains three parts of approximation error by calculating the distances between the expected reward and ridge regression, ridge regression and NTK, and NTK and network function. ∎

Remark 5.2.

The regret bound of EE-Net does not have the effective dimension d~\tilde{d} or input dimension dd. d~\tilde{d} or dd may cause significant error, when the determinant of 𝐇\mathbf{H} is extremely large or d>Td>T. ∎

The proof of Theorem 1 is in Appendix C and mainly based on the following generalization bound. The bound results from an online-to-batch conversion while using convergence guarantees of deep learning optimization.

Lemma 5.1.

For any δ∈(0,1),ϡ∈(0,1),ρ∈(0,π’ͺ​(1L))\delta\in(0,1),\epsilon\in(0,1),\rho\in(0,\mathcal{O}(\frac{1}{L})), suppose m,Ξ·1,Ξ·2,K1,K2m,\eta_{1},\eta_{2},K_{1},K_{2} satisfy the conditions in Eq. (5.2) and (𝐱τ,i,rΟ„,i)βˆΌπ’Ÿ,βˆ€Ο„βˆˆ[t],i∈[n](\mathbf{x}_{\tau,i},r_{\tau,i})\sim\mathcal{D},\forall\tau\in[t],i\in[n]. Let

𝐱t=arg⁑max𝐱t,i,i∈[n]⁑[f2​(ϕ​(β–½πœ½tβˆ’11​f1​(𝐱t,i;𝜽tβˆ’11));𝜽tβˆ’12)+f1​(𝐱t,i;𝜽tβˆ’11)],\mathbf{x}_{t}=\arg\max_{\mathbf{x}_{t,i},i\in[n]}\left[f_{2}\left(\phi(\triangledown_{\boldsymbol{\theta}^{1}_{t-1}}f_{1}(\mathbf{x}_{t,i};\boldsymbol{\theta}^{1}_{t-1}));\boldsymbol{\theta}_{t-1}^{2}\right)+f_{1}(\mathbf{x}_{t,i};\boldsymbol{\theta}_{t-1}^{1})\right],

and rtr_{t} is the corresponding reward, given (𝐱t,i,rt,i),i∈[n](\mathbf{x}_{t,i},r_{t,i}),i\in[n]. Then, with probability at least (1βˆ’Ξ΄)(1-\delta) over the random of the initialization, it holds that

𝔼(𝐱t,i,rt,i),i∈[n]\displaystyle\underset{(\mathbf{x}_{t,i},r_{t,i}),i\in[n]}{\mathbb{E}} [|f2​(ϕ​(β–½πœ½tβˆ’11​f1​(𝐱t,i;𝜽tβˆ’11));𝜽tβˆ’12)βˆ’(rtβˆ’f1​(𝐱t;𝜽tβˆ’11))|∣{𝐱τ,rΟ„}Ο„=1tβˆ’1]\displaystyle\left[\left|f_{2}\left(\phi(\triangledown_{\boldsymbol{\theta}^{1}_{t-1}}f_{1}(\mathbf{x}_{t,i};\boldsymbol{\theta}^{1}_{t-1}));\boldsymbol{\theta}_{t-1}^{2}\right)-\left(r_{t}-f_{1}(\mathbf{x}_{t};\boldsymbol{\theta}_{t-1}^{1})\right)\right|\mid\{\mathbf{x}_{\tau},r_{\tau}\}_{\tau=1}^{t-1}\right] (5.4)
≀2​ϡt+π’ͺ​(3​L2​t)+(1+2​ξ)​2​log⁑(π’ͺ​(t​n/Ξ΄))t,\displaystyle\qquad\qquad\leq\sqrt{\frac{2\epsilon}{t}}+\mathcal{O}\left(\frac{3L}{\sqrt{2t}}\right)+(1+2\xi)\sqrt{\frac{2\log(\mathcal{O}(tn/\delta))}{t}},

where the expectation is also taken over (𝛉tβˆ’11,𝛉tβˆ’12)(\boldsymbol{\theta}_{t-1}^{1},\boldsymbol{\theta}_{t-1}^{2}) that are uniformly drawn from (𝛉^Ο„1,𝛉^Ο„2),Ο„βˆˆ[tβˆ’1](\widehat{\boldsymbol{\theta}}_{\tau}^{1},\widehat{\boldsymbol{\theta}}_{\tau}^{2}),\tau\in[t-1].

Remark 5.3.

Lemma 5.1 provides a fixed π’ͺ~​(1t)\tilde{\mathcal{O}}(\frac{1}{\sqrt{t}})-rate generalization bound for exploitation-exploration networks f1,f2f_{1},f_{2} in contrast with the relative bound w.r.t.Β the Neural Tangent Random Feature (NTRF) benchmarkΒ (Cao and Gu, 2019). We achieve this by working in the regression rather than classification setting and utilizing the convergence guarantees for square lossΒ (Allen-Zhu etΒ al., 2019). Note that the bound in Lemma 5.1 holds in the setting of bounded (possibly random) rewards r∈[0,1]r\in[0,1] instead of a fixed function in the conventional classification setting.

6 Experiments

In this section, we evaluate EE-Net on four real-world datasets comparing with strong state-of-the-art baselines. We first present the setup of experiments, then show regret comparison and report ablation study. Codes are available at 111https://github.com/banyikun/EE-Net-ICLR-2022.

We use four real-world datasets: Mnist, Yelp, Movielens, and Disin, the details and settings of which are attached in Appendix A.

Refer to caption
Figure 2: Regret comparison on Movielens and Yelp (mean of 10 runs with standard deviation (shadow)). With the same exploitation network f1f_{1}, EE-Net outperforms all baselines.

Baselines. To comprehensively evaluate EE-Net, we choose 33 neural-based bandit algorithms, one linear and one kernelized bandit algorithms.

  1. 1.

    LinUCB (Li etΒ al., 2010) explicitly assumes the reward is a linear function of arm vector and unknown user parameter and then applies the ridge regression and un upper confidence bound to determine selected arm.

  2. 2.

    KernelUCB (Valko etΒ al., 2013) adopts a predefined kernel matrix on the reward space combined with a UCB-based exploration strategy.

  3. 3.

    Neural-Epsilon adapts the epsilon-greedy exploration strategy on exploitation network f1f_{1}. I.e., with probability 1βˆ’Ο΅1-\epsilon, the arm is selected by 𝐱t=arg⁑maxi∈[n]⁑f1​(𝐱t,i;𝜽1)\mathbf{x}_{t}=\arg\max_{i\in[n]}f_{1}(\mathbf{x}_{t,i};\boldsymbol{\theta}^{1}) and with probability Ο΅\epsilon, the arm is chosen randomly.

  4. 4.

    NeuralUCB (Zhou etΒ al., 2020) uses the exploitation network f1f_{1} to learn the reward function coming with an UCB-based exploration strategy.

  5. 5.

    NeuralTS (Zhang etΒ al., 2021) adopts the exploitation network f1f_{1} to learn the reward function coming with an Thompson Sampling exploration strategy.

Note that we do not report results of LinTS and KernelTS in experiments, because of the limited space in figures, but LinTS and KernelTS have been significantly outperformed by NeuralTS (Zhang etΒ al., 2021).

Setup for EE-Net. To compare fairly, for all the neural-based methods including EE-Net, the exploitation network f1f_{1} is built by a 2-layer fully-connected network with 100 width. For the exploration network f2f_{2}, we use a 2-layer fully-connected network with 100 width as well. For the decision maker f3f_{3}, by comprehensively evaluate both linear and nonlinear functions, we found that the most effective approach is combining them together, which we call " hybrid decision maker". In detail, for rounds t≀500t\leq 500, f3f_{3} is set as f3=f2+f1f_{3}=f_{2}+f_{1}, and for t>500t>500, f3f_{3} is set as a neural network with two 20-width fully-connected layers. Setting f3f_{3} in this way is because the linear decision maker can maintain stable performance in each running (robustness) and the non-linear decision maker can further improve the performance (see details in Appendix B). The hybrid decision maker can combine these two advantages together. The configurations of all methods are attached in Appendix A.

Refer to caption
Figure 3: Regret comparison on Mnist and Disin (mean of 10 runs with standard deviation (shadow)). With the same exploitation network f1f_{1}, EE-Net outperforms all baselines.

Results. Figure 2 and Figure 3 show the regret comparison on these four datasets. EE-Net consistently outperforms all baselines across all datasets. For LinUCB and KernelUCN, the simple linear reward function or predefined kernel cannot properly formulate ground-truth reward function existed in real-world datasets. In particular, on Mnist and Disin datasets, the correlations between rewards and arm feature vectors are not linear or some simple mappings. Thus, LinUCB and KernelUCB barely exploit the past collected data samples and fail to select correct arms. For neural-based bandit algorithms, the exploration probability of Neural-Epsilon is fixed and difficult to be adjustable. Thus it is usually hard to make effective exploration. To make exploration, NeuralUCB statistically calculates a gradient-based upper confidence bound and NeuralTS draws each arm’s predicted reward from a normal distribution where the standard deviation is computed by gradient. However, the confidence bound or standard deviation they calculated only consider the worst cases and thus may not be able represent the actual potential of each arm, and they cannot make "upward" and "downward" exploration properly. Instead, EE-Net uses a neural network f2f_{2} to learn each arm’s potential by neural network’s powerful representation ability. Therefore, EE-Net can outperform these two state-of-the-art bandit algorithms. Note that NeuralUCB/TS does need two parameters to tune UCB/TS according to different scenarios while EE-Net only needs to set up a neural network and automatically learns it.

Ablation Study. In Appendix B, we conduct ablation study regarding the label function yy of f2f_{2} and the different setting of f3f_{3}.

7 Conclusion

In this paper, we propose a novel exploration strategy, EE-Net. In addition to a neural network that exploits collected data in past rounds , EE-Net has another neural network to learn the potential gain compared to current estimation for exploration. Then, a decision maker is built to make selections to further trade off between exploitation and exploration. We demonstrate that EE-Net outperforms NeuralUCB and NeuralTS both theoretically and empirically, becoming the new state-of-the-art exploration policy.

Acknowledgements: We are grateful to Shiliang Zuo and Yunzhe Qi for the valuable discussions in the revisions of EE-Net. This research work is supported by National Science Foundation under Awards No.Β IIS-1947203, IIS-2002540, IIS-2137468, IIS-1908104, OAC-1934634, and DBI-2021898, and a grant from C3.ai. The views and conclusions are those of the authors and should not be interpreted as representing the official policies of the funding agencies or the government.

References

  • Abbasi-Yadkori etΒ al. [2011] Y.Β Abbasi-Yadkori, D.Β PΓ‘l, and C.Β SzepesvΓ‘ri. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011.
  • Abeille and Lazaric [2017] M.Β Abeille and A.Β Lazaric. Linear thompson sampling revisited. In Artificial Intelligence and Statistics, pages 176–184. PMLR, 2017.
  • Agrawal and Goyal [2013] S.Β Agrawal and N.Β Goyal. Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning, pages 127–135. PMLR, 2013.
  • Ahmed etΒ al. [2018] H.Β Ahmed, I.Β Traore, and S.Β Saad. Detecting opinion spams and fake news using text classification. Security and Privacy, 1(1):e9, 2018.
  • Allen-Zhu etΒ al. [2019] Z.Β Allen-Zhu, Y.Β Li, and Z.Β Song. A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, pages 242–252. PMLR, 2019.
  • Arora etΒ al. [2019] S.Β Arora, S.Β S. Du, W.Β Hu, Z.Β Li, R.Β R. Salakhutdinov, and R.Β Wang. On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems, pages 8141–8150, 2019.
  • Auer [2002] P.Β Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422, 2002.
  • Ban and He [2020] Y.Β Ban and J.Β He. Generic outlier detection in multi-armed bandit. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 913–923, 2020.
  • Ban and He [2021a] Y.Β Ban and J.Β He. Convolutional neural bandit: Provable algorithm for visual-aware advertising. arXiv preprint arXiv:2107.07438, 2021a.
  • Ban and He [2021b] Y.Β Ban and J.Β He. Local clustering in contextual multi-armed bandits. In Proceedings of the Web Conference 2021, pages 2335–2346, 2021b.
  • Ban etΒ al. [2021] Y.Β Ban, J.Β He, and C.Β B. Cook. Multi-facet contextual bandits: A neural network perspective. In The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, Singapore, August 14-18, 2021, pages 35–45, 2021.
  • Bubeck etΒ al. [2011] S.Β Bubeck, R.Β Munos, G.Β Stoltz, and C.Β SzepesvΓ‘ri. X-armed bandits. Journal of Machine Learning Research, 12(5), 2011.
  • Cao and Gu [2019] Y.Β Cao and Q.Β Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. Advances in Neural Information Processing Systems, 32:10836–10846, 2019.
  • Cesa-Bianchi etΒ al. [2001] N.Β Cesa-Bianchi, A.Β Conconi, and C.Β Gentile. On the generalization ability of on-line learning algorithms. Advances in neural information processing systems, 14, 2001.
  • Chlebus [2009] E.Β Chlebus. An approximate formula for a partial sum of the divergent p-series. Applied Mathematics Letters, 22(5):732–737, 2009.
  • Chu etΒ al. [2011] W.Β Chu, L.Β Li, L.Β Reyzin, and R.Β Schapire. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 208–214, 2011.
  • Dani etΒ al. [2008] V.Β Dani, T.Β P. Hayes, and S.Β M. Kakade. Stochastic linear optimization under bandit feedback. 2008.
  • Du etΒ al. [2019] S.Β Du, J.Β Lee, H.Β Li, L.Β Wang, and X.Β Zhai. Gradient descent finds global minima of deep neural networks. In International Conference on Machine Learning, pages 1675–1685. PMLR, 2019.
  • Filippi etΒ al. [2010] S.Β Filippi, O.Β Cappe, A.Β Garivier, and C.Β SzepesvΓ‘ri. Parametric bandits: The generalized linear case. In Advances in Neural Information Processing Systems, pages 586–594, 2010.
  • Fu and He [2021] D.Β Fu and J.Β He. SDG: A simplified and dynamic graph neural network. In SIGIR ’21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, July 11-15, 2021, pages 2273–2277. ACM, 2021.
  • Harper and Konstan [2015] F.Β M. Harper and J.Β A. Konstan. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis), 5(4):1–19, 2015.
  • Jacot etΒ al. [2018] A.Β Jacot, F.Β Gabriel, and C.Β Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pages 8571–8580, 2018.
  • Langford and Zhang [2008] J.Β Langford and T.Β Zhang. The epoch-greedy algorithm for multi-armed bandits with side information. In Advances in neural information processing systems, pages 817–824, 2008.
  • Lattimore and SzepesvΓ‘ri [2020] T.Β Lattimore and C.Β SzepesvΓ‘ri. Bandit algorithms. Cambridge University Press, 2020.
  • LeCun etΒ al. [1995] Y.Β LeCun, Y.Β Bengio, etΒ al. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10):1995, 1995.
  • LeCun etΒ al. [1998] Y.Β LeCun, L.Β Bottou, Y.Β Bengio, and P.Β Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • Li etΒ al. [2010] L.Β Li, W.Β Chu, J.Β Langford, and R.Β E. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670, 2010.
  • Li etΒ al. [2017] L.Β Li, Y.Β Lu, and D.Β Zhou. Provably optimal algorithms for generalized linear contextual bandits. In International Conference on Machine Learning, pages 2071–2080. PMLR, 2017.
  • Li etΒ al. [2016] S.Β Li, A.Β Karatzoglou, and C.Β Gentile. Collaborative filtering bandits. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 539–548, 2016.
  • Lu and VanΒ Roy [2017] X.Β Lu and B.Β VanΒ Roy. Ensemble sampling. arXiv preprint arXiv:1705.07347, 2017.
  • Riquelme etΒ al. [2018] C.Β Riquelme, G.Β Tucker, and J.Β Snoek. Deep bayesian bandits showdown: An empirical comparison of bayesian deep networks for thompson sampling. arXiv preprint arXiv:1802.09127, 2018.
  • Roweis and Saul [2000] S.Β T. Roweis and L.Β K. Saul. Nonlinear dimensionality reduction by locally linear embedding. science, 290(5500):2323–2326, 2000.
  • Thompson [1933] W.Β R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
  • Valko etΒ al. [2013] M.Β Valko, N.Β Korda, R.Β Munos, I.Β Flaounas, and N.Β Cristianini. Finite-time analysis of kernelised contextual bandits. arXiv preprint arXiv:1309.6869, 2013.
  • Wu etΒ al. [2016] Q.Β Wu, H.Β Wang, Q.Β Gu, and H.Β Wang. Contextual bandits in a collaborative environment. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 529–538, 2016.
  • Xu etΒ al. [2020] P.Β Xu, Z.Β Wen, H.Β Zhao, and Q.Β Gu. Neural contextual bandits with deep representation and shallow exploration. arXiv preprint arXiv:2012.01780, 2020.
  • Zhang etΒ al. [2021] W.Β Zhang, D.Β Zhou, L.Β Li, and Q.Β Gu. Neural thompson sampling. In International Conference on Learning Representations, 2021.
  • Zhou etΒ al. [2020] D.Β Zhou, L.Β Li, and Q.Β Gu. Neural contextual bandits with ucb-based exploration. In International Conference on Machine Learning, pages 11492–11502. PMLR, 2020.

Appendix A Datasets and Setup

MNIST dataset. MNIST is a well-known image dataset [LeCun etΒ al., 1998] for the 10-class classification problem. Following the evaluation setting of existing works [Valko etΒ al., 2013, Zhou etΒ al., 2020, Zhang etΒ al., 2021], we transform this classification problem into bandit problem. Consider an image π±βˆˆβ„d\mathbf{x}\in\mathbb{R}^{d}, we aim to classify it from 1010 classes. First, in each round, the image 𝐱\mathbf{x} is transformed into 1010 arms and presented to the learner, represented by 1010 vectors in sequence 𝐱1=(𝐱,𝟎,…,𝟎),𝐱2=(𝟎,𝐱,…,𝟎),…,𝐱10=(𝟎,𝟎,…,𝐱)βˆˆβ„10​d\mathbf{x}_{1}=(\mathbf{x},\mathbf{0},\dots,\mathbf{0}),\mathbf{x}_{2}=(\mathbf{0},\mathbf{x},\dots,\mathbf{0}),\dots,\mathbf{x}_{10}=(\mathbf{0},\mathbf{0},\dots,\mathbf{\mathbf{x}})\in\mathbb{R}^{10d}. The reward is defined as 11 if the index of selected arm matches the index of 𝐱\mathbf{x}’s ground-truth class; Otherwise, the reward is 0.

Yelp222https://www.yelp.com/dataset and Movielens [Harper and Konstan, 2015] datasets. Yelp is a dataset released in the Yelp dataset challenge, which consists of 4.7 million rating entries for 1.57Γ—1051.57\times 10^{5} restaurants by 1.181.18 million users. MovieLens is a dataset consisting of 2525 million ratings between 1.6Γ—1051.6\times 10^{5} users and 6Γ—1046\times 10^{4} movies. We build the rating matrix by choosing the top 20002000 users and top 1000010000 restaurants(movies) and use singular-value decomposition (SVD) to extract a 1010-dimension feature vector for each user and restaurant(movie). In these two datasets, the bandit algorithm is to choose the restaurants(movies) with bad ratings. We generate the reward by using the restaurant(movie)’s gained stars scored by the users. In each rating record, if the user scores a restaurant(movie) less than 2 stars (5 stars totally), its reward is 11; Otherwise, its reward is 0. In each round, we set 1010 arms as follows: we randomly choose one with reward 11 and randomly pick the other 99 restaurants(movies) with 0 rewards; then, the representation of each arm is the concatenation of corresponding user feature vector and restaurant(movie) feature vector.

Disin [Ahmed etΒ al., 2018] dataset. Disin is a fake news dataset on kaggle333https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset including 12600 fake news articles and 12600 truthful news articles, where each article is represented by the text. To transform the text into vectors, we use the approach [Fu and He, 2021] to represent each article by a 300-dimension vector. Similarly, we form a 10-arm pool in each round, where 9 real news and 1 fake news are randomly selected. If the fake news is selected, the reward is 11; Otherwise, the reward is 0.

Configurations. For LinUCB, following [Li etΒ al., 2010], we do a grid search for the exploration constant Ξ±\alpha over (0.01,0.1,1)(0.01,0.1,1) which is to tune the scale of UCB. For KernelUCB [Valko etΒ al., 2013], we use the radial basis function kernel and stop adding contexts after 1000 rounds, following [Valko etΒ al., 2013, Zhou etΒ al., 2020]. For the regularization parameter Ξ»\lambda and exploration parameter Ξ½\nu in KernelUCB, we do the grid search for Ξ»\lambda over (0.1,1,10)(0.1,1,10) and for Ξ½\nu over (0.01,0.1,1)(0.01,0.1,1). For NeuralUCB and NeuralTS, following setting of [Zhou etΒ al., 2020, Zhang etΒ al., 2021], we use the exploiation network f1f_{1} and conduct the grid search for the exploration parameter Ξ½\nu over (0.001,0.01,0.1,1)(0.001,0.01,0.1,1) and for the regularization parameter Ξ»\lambda over (0.01,0.1,1)(0.01,0.1,1). For NeuralEpsilon, we use the same neural network f1f_{1} and do the grid search for the exploration probability Ο΅\epsilon over (0.01,0.1,0.2)(0.01,0.1,0.2). For the neural bandits NeuralUCB/TS, following their setting, as they have expensive computation cost to store and compute the whole gradient matrix, we use a diagonal matrix to make approximation. For all neural networks, we conduct the grid search for learning rate over (0.01,0.001,0.0005,0.0001)(0.01,0.001,0.0005,0.0001). For all grid-searched parameters, we choose the best of them for the comparison and report the averaged results of 1010 runs for all methods.

Create exploration samples for f2f_{2}. When the selected arm is not optimal in a round, the optimal arm must exist among the remaining arms, and thus the exploration consideration should be added to the remaining arms. Based on this fact, we create additional samples for the exploration network f2f_{2} in practice. For example, in setting of binary reward, e.g., 0 or 11 reward, if the received reward rt=0r_{t}=0 while select 𝐱t\mathbf{x}_{t}, we add new train samples for f2f_{2}, (𝐱t,i,cr)(\mathbf{x}_{t,i},c_{r}) for each i∈[i]∩𝐱t,i≠𝐱ti\in[i]\cap\mathbf{x}_{t,i}\not=\mathbf{x}_{t}, where cr∈(0,1)c_{r}\in(0,1) usually is a small constant. This measure can further improve the performance of EE-Net in our experiments.

Appendix B Ablation Study

Refer to caption
Figure 4: Ablation study on label function yy for f2f_{2}. EE-Net denotes y1=rβˆ’f1y_{1}=r-f_{1}, EE-Net-abs denotes y2=|rβˆ’f1|y_{2}=|r-f_{1}|, and EE-Net-ReLU denotes y3=ReLU​(rβˆ’f1)y_{3}=\text{ReLU}(r-f_{1}). EE-Net shows the best performance on these two datasets.
Refer to caption
Figure 5: Ablation study on decision maker f3f_{3}. EE-Net-Lin denotes f3=f1+f2f_{3}=f_{1}+f_{2}, EE-Net-NoLin denote the nonlinear one where f3f_{3} is a neural network (2 layer, width 20), EE-Net denotes the hybrid one where f3=f1+f2f_{3}=f_{1}+f_{2} if t≀500t\leq 500 and f3f_{3} is the neural network if t>500t>500. EE-Net has the most stable and best performance.

In this section, we conduct ablation study regarding the label function yy for exploration network f2f_{2} and seting of decision maker f3f_{3} on two representative datasets Movielens and Mnist.

Label function yy. In this paper, we use y1=rβˆ’f1y_{1}=r-f_{1} to measure the potential gain of an arm, as the label of f2f_{2}. Moreover, we provide other two intuitive form y2=|rβˆ’f1|y_{2}=|r-f_{1}| and y3=R​e​L​U​(rβˆ’f1)y_{3}=ReLU(r-f_{1}). Figure 4 shows the regret with different yy, where "EE-Net" denotes our method with default y1y_{1}, "EE-Net-abs" represents the one with y2y_{2} and "EE-Net-ReLU" is with y3y_{3}. On Movielens and Mnist datasets, EE-Net slightly outperforms EE-Net-abs and EE-Net-ReLU. In fact, y1y_{1} can effectively represent the positive potential gain and negative potential gain, such that f2f_{2} intends to score the arm with positive gain higher and score the arm with negative gain lower. However, y2y_{2} treats the positive/negative potential gain evenly, weakening the discriminative ability. y3y_{3} can recognize the positive gain while neglecting the difference of negative gain. Therefore, y1y_{1} usually is the most effective one for empirical performance.

Setting of f3f_{3}. f3f_{3} can be set as an either linear function or non-linear function. In the experiment, we test the simple linear function f3=f1+f2f_{3}=f_{1}+f_{2}, denoted by "EE-Net-Lin", and a non-linear function represented by a 2-layer 20-width fully-connected neural network, denoted by "EE-Net-NoLin". For the default hybrid setting, denoted by "EE-Net", when rounds t≀500t\leq 500, f3=f1+f2f_{3}=f_{1}+f_{2}; Otherwise, f3f_{3} is the neural network. Figure 5 reports the regret with these three different modes. EE-Net achieves the best performance with small standard deviation. In contrast, EE-Net-NoLin obtains the worst performance and largest standard deviation. However, notice that EE-Net-NoLin can achieve the best performance in certain running (the green shallow) but it is erratic. Because in the begin phase, without enough training samples, EE-Net-NoLin strongly relies on the quality of collected samples. With appropriate training samples, gradient descent can lead f3f_{3} to global optimum. On the other hand, with misleading training samples, gradient descent can deviate f3f_{3} from global optimum. Therefore, EE-Net-NoLin shows very unstable performance. In contrast, EE-Net-Lin is inspired by the UCB strategy, i.e., the exploitation plus the exploration, exhibiting stable performance. To combine their advantages together, we propose the hybrid approach, EE-Net, achieving the best performance with strong stability.

Appendix C Proof of Theorem 1

In this section, we provide the proof of Theorem 1 and related lemmas.

Proof.

For brevity, for the selected arm 𝐱t\mathbf{x}_{t} in round tt, let h​(𝐱t)h(\mathbf{x}_{t}) be its expected reward and 𝐱tβˆ—=arg⁑max𝐱t,i,i∈[n]⁑h​(𝐱t,i)\mathbf{x}_{t}^{\ast}=\arg\max_{\mathbf{x}_{t,i},i\in[n]}h(\mathbf{x}_{t,i}) be the optimal arm in round tt. Let f3​(𝐱;𝜽tβˆ’1)=f2​(ϕ​(β–½πœ½tβˆ’11​f1​(𝐱;𝜽tβˆ’11));𝜽tβˆ’12)+f1​(𝐱;𝜽tβˆ’11).f_{3}(\mathbf{x};\boldsymbol{\theta}_{t-1})=f_{2}\left(\phi(\triangledown_{\boldsymbol{\theta}^{1}_{t-1}}f_{1}(\mathbf{x};\boldsymbol{\theta}^{1}_{t-1}));\boldsymbol{\theta}_{t-1}^{2}\right)+f_{1}(\mathbf{x};\boldsymbol{\theta}_{t-1}^{1}).

Note that (𝐱t,i,rt,i)βˆΌπ’Ÿ(\mathbf{x}_{t,i},r_{t,i})\sim\mathcal{D}, for each i∈[n]i\in[n]. Then, the expected regret of round tt is given by

Rt\displaystyle R_{t} =𝔼𝐱t,i,i∈[n]​[h​(𝐱tβˆ—)βˆ’h​(𝐱t)]\displaystyle=\mathbb{E}_{\mathbf{x}_{t,i},i\in[n]}[h(\mathbf{x}_{t}^{\ast})-h(\mathbf{x}_{t})] (C.1)
=𝔼𝐱t,i,i∈[n]​[h​(𝐱tβˆ—)βˆ’f3​(𝐱t)+f3​(𝐱t)βˆ’h​(𝐱t)]\displaystyle=\mathbb{E}_{\mathbf{x}_{t,i},i\in[n]}[h(\mathbf{x}_{t}^{\ast})-f_{3}(\mathbf{x}_{t})+f_{3}(\mathbf{x}_{t})-h(\mathbf{x}_{t})]
≀𝔼𝐱t,i,i∈[n]​[h​(𝐱tβˆ—)βˆ’f3​(𝐱tβˆ—)+f3​(𝐱t)βˆ’f3​(𝐱t)⏟I1+f3​(𝐱t)βˆ’h​(𝐱t)]\displaystyle\leq\mathbb{E}_{\mathbf{x}_{t,i},i\in[n]}[\underbrace{h(\mathbf{x}_{t}^{\ast})-f_{3}(\mathbf{x}_{t}^{\ast})+f_{3}(\mathbf{x}_{t})-f_{3}(\mathbf{x}_{t})}_{I_{1}}+f_{3}(\mathbf{x}_{t})-h(\mathbf{x}_{t})]
=𝔼𝐱t,i,i∈[n]​[h​(𝐱tβˆ—)βˆ’f3​(𝐱tβˆ—)+f3​(𝐱t)βˆ’h​(𝐱t)]\displaystyle=\mathbb{E}_{\mathbf{x}_{t,i},i\in[n]}[h(\mathbf{x}_{t}^{\ast})-f_{3}(\mathbf{x}_{t}^{\ast})+f_{3}(\mathbf{x}_{t})-h(\mathbf{x}_{t})]
=𝔼𝐱t,i,i∈[n]​[h​(𝐱tβˆ—)βˆ’f3​(𝐱tβˆ—;𝜽tβˆ’1)+f3​(𝐱t;𝜽tβˆ’1)βˆ’h​(𝐱t)]\displaystyle=\mathbb{E}_{\mathbf{x}_{t,i},i\in[n]}[h(\mathbf{x}_{t}^{\ast})-f_{3}(\mathbf{x}_{t}^{\ast};\boldsymbol{\theta}_{t-1})+f_{3}(\mathbf{x}_{t};\boldsymbol{\theta}_{t-1})-h(\mathbf{x}_{t})]
=(a)​𝔼𝐱t,i,i∈[n]​[h​(𝐱tβˆ—)βˆ’f3​(𝐱tβˆ—;𝜽tβˆ’1βˆ—)+f3​(𝐱tβˆ—;𝜽tβˆ’1βˆ—)βˆ’f3​(𝐱tβˆ—;𝜽tβˆ’1)+f3​(𝐱t;𝜽tβˆ’1)βˆ’h​(𝐱t)]\displaystyle\overset{(a)}{=}\mathbb{E}_{\mathbf{x}_{t,i},i\in[n]}[h(\mathbf{x}_{t}^{\ast})-f_{3}(\mathbf{x}_{t}^{\ast};\boldsymbol{\theta}_{t-1}^{\ast})+f_{3}(\mathbf{x}_{t}^{\ast};\boldsymbol{\theta}_{t-1}^{\ast})-f_{3}(\mathbf{x}_{t}^{\ast};\boldsymbol{\theta}_{t-1})+f_{3}(\mathbf{x}_{t};\boldsymbol{\theta}_{t-1})-h(\mathbf{x}_{t})]
=𝔼𝐱t,i,i∈[n][h(𝐱tβˆ—)βˆ’f3(𝐱tβˆ—;𝜽tβˆ’1βˆ—)|]+𝔼𝐱t,i,i∈[n][f3(𝐱tβˆ—;𝜽tβˆ’1βˆ—)βˆ’f3(𝐱tβˆ—;𝜽tβˆ’1)]\displaystyle=\mathbb{E}_{\mathbf{x}_{t,i},i\in[n]}[h(\mathbf{x}_{t}^{\ast})-f_{3}(\mathbf{x}_{t}^{\ast};\boldsymbol{\theta}_{t-1}^{\ast})|]+\mathbb{E}_{\mathbf{x}_{t,i},i\in[n]}[f_{3}(\mathbf{x}_{t}^{\ast};\boldsymbol{\theta}_{t-1}^{\ast})-f_{3}(\mathbf{x}_{t}^{\ast};\boldsymbol{\theta}_{t-1})]
+𝔼𝐱t,i,i∈[n]​[f3​(𝐱t;𝜽tβˆ’1)βˆ’h​(𝐱t)]\displaystyle\ \ \ \ +\mathbb{E}_{\mathbf{x}_{t,i},i\in[n]}[f_{3}(\mathbf{x}_{t};\boldsymbol{\theta}_{t-1})-h(\mathbf{x}_{t})]
≀𝔼(𝐱t,i,rt,i),i∈[n]​[|f2​(ϕ​(β–½πœ½tβˆ’11,βˆ—β€‹f1​(𝐱tβˆ—;𝜽tβˆ’11,βˆ—));𝜽tβˆ’12,βˆ—)βˆ’(rtβˆ—βˆ’f1​(𝐱tβˆ—;𝜽tβˆ’11,βˆ—))|]⏟I2\displaystyle\leq\underbrace{\underset{(\mathbf{x}_{t,i},r_{t,i}),i\in[n]}{\mathbb{E}}\left[\left|f_{2}\left(\phi(\triangledown_{\boldsymbol{\theta}^{1,\ast}_{t-1}}f_{1}(\mathbf{x}_{t}^{\ast};\boldsymbol{\theta}^{1,\ast}_{t-1}));\boldsymbol{\theta}_{t-1}^{2,\ast}\right)-\left(r_{t}^{\ast}-f_{1}(\mathbf{x}_{t}^{\ast};\boldsymbol{\theta}_{t-1}^{1,\ast})\right)\right|\right]}_{I_{2}}
+𝔼𝐱t,i,i∈[n]​[|f2​(ϕ​(β–½πœ½tβˆ’11,βˆ—β€‹f1​(𝐱tβˆ—;𝜽tβˆ’11,βˆ—));𝜽tβˆ’12,βˆ—)βˆ’f2​(ϕ​(β–½πœ½tβˆ’11​f1​(𝐱tβˆ—;𝜽tβˆ’11));𝜽tβˆ’12)|]⏟I3\displaystyle\ \ \ +\underbrace{\mathbb{E}_{\mathbf{x}_{t,i},i\in[n]}\left[\left|f_{2}\left(\phi(\triangledown_{\boldsymbol{\theta}^{1,\ast}_{t-1}}f_{1}(\mathbf{x}_{t}^{\ast};\boldsymbol{\theta}^{1,\ast}_{t-1}));\boldsymbol{\theta}_{t-1}^{2,\ast}\right)-f_{2}\left(\phi(\triangledown_{\boldsymbol{\theta}^{1}_{t-1}}f_{1}(\mathbf{x}_{t}^{\ast};\boldsymbol{\theta}^{1}_{t-1}));\boldsymbol{\theta}_{t-1}^{2}\right)\right|\right]}_{I_{3}}
+𝔼𝐱t,i,i∈[n]​[|f1​(𝐱tβˆ—;𝜽tβˆ’11,βˆ—)βˆ’f1​(𝐱tβˆ—;𝜽tβˆ’11)|]⏟I4\displaystyle\ \ \ +\underbrace{\mathbb{E}_{\mathbf{x}_{t,i},i\in[n]}\left[\left|f_{1}(\mathbf{x}_{t}^{\ast};\boldsymbol{\theta}_{t-1}^{1,\ast})-f_{1}(\mathbf{x}_{t}^{\ast};\boldsymbol{\theta}_{t-1}^{1})\right|\right]}_{I_{4}}
+𝔼(𝐱t,i,rt,i),i∈[n]​[|f2​(ϕ​(β–½πœ½tβˆ’11​f1​(𝐱t;𝜽tβˆ’11));𝜽tβˆ’12)βˆ’(rtβˆ’f1​(𝐱t;𝜽tβˆ’11))|]⏟I5\displaystyle\ \ \ +\underbrace{\underset{(\mathbf{x}_{t,i},r_{t,i}),i\in[n]}{\mathbb{E}}\left[\left|f_{2}\left(\phi(\triangledown_{\boldsymbol{\theta}^{1}_{t-1}}f_{1}(\mathbf{x}_{t};\boldsymbol{\theta}^{1}_{t-1}));\boldsymbol{\theta}_{t-1}^{2}\right)-\left(r_{t}-f_{1}(\mathbf{x}_{t};\boldsymbol{\theta}_{t-1}^{1})\right)\right|\right]}_{I_{5}}

where I1I_{1} is because f3​(𝐱t)=maxi∈[n]⁑f3​(𝐱t,i)f_{3}(\mathbf{x}_{t})=\max_{i\in[n]}f_{3}(\mathbf{x}_{t,i}) and f3​(𝐱t)βˆ’f3​(𝐱tβˆ—)β‰₯0f_{3}(\mathbf{x}_{t})-f_{3}(\mathbf{x}_{t}^{\ast})\geq 0 and (a)(a) introduces the additional parameters 𝜽tβˆ’1βˆ—=(𝜽tβˆ’11,βˆ—,𝜽tβˆ’12,βˆ—)\boldsymbol{\theta}_{t-1}^{\ast}=(\boldsymbol{\theta}_{t-1}^{1,\ast},\boldsymbol{\theta}_{t-1}^{2,\ast}) which will be suitably chosen.

Because, for each i∈[n]i\in[n], (𝐱t,i,rt,i)βˆΌπ’Ÿ(\mathbf{x}_{t,i},r_{t,i})\sim\mathcal{D}, applying Lemma C.1 and Corollary C.1, with probability at least (1βˆ’Ξ΄)(1-\delta) over the randomness of initialization, for I2,I5I_{2},I_{5}, we have

I2,I5≀2​ϡt+π’ͺ​(3​L2​t)+(1+2​ξ)​2​log⁑(π’ͺ​(t​n)/Ξ΄)t,I_{2},I_{5}\leq\sqrt{\frac{2\epsilon}{t}}+\mathcal{O}\left(\frac{3L}{\sqrt{2t}}\right)+(1+2\xi)\sqrt{\frac{2\log(\mathcal{O}(tn)/\delta)}{t}}, (C.2)

where

ΞΎ=π’ͺ​(1)+π’ͺ​(t3​n​L​log⁑mρ​m)+π’ͺ​(t4​n​L2​log11/6⁑mρ4/3​m1/6)\xi=\mathcal{O}(1)+\mathcal{O}\left(\frac{t^{3}nL\log m}{\rho\sqrt{m}}\right)+\mathcal{O}\left(\frac{t^{4}nL^{2}\log^{11/6}m}{\rho^{4/3}m^{1/6}}\right) (C.3)

and we apply the union bound over round Ο„,βˆ€Ο„βˆˆ[t]\tau,\forall\tau\in[t] to make Lemma C.1 and Corollary C.1 hold for each round Ο„,Ο„βˆˆ[t]\tau,\tau\in[t].

For I3,I4I_{3},I_{4}, based on Lemma C.2, with probability at least 1βˆ’Ξ΄1-\delta, we have

I3,I4≀(1+π’ͺ​(t​L3​log5/6⁑mρ1/3​m1/6))​π’ͺ​(L​t3ρ​m​log⁑m)+π’ͺ​(t4​L2​log11/6⁑mρ4/3​m1/6):=ΞΎ1.I_{3},I_{4}\leq\left(1+\mathcal{O}\left(\frac{tL^{3}\log^{5/6}m}{\rho^{1/3}m^{1/6}}\right)\right)\mathcal{O}\left(\frac{Lt^{3}}{\rho\sqrt{m}}\log m\right)+\mathcal{O}\left(\frac{t^{4}L^{2}\log^{11/6}m}{\rho^{4/3}m^{1/6}}\right):=\xi_{1}. (C.4)

To sum up, with probability at least 1βˆ’Ξ΄1-\delta, we have

Rt≀2​(2​ϡt+π’ͺ​(3​L2​t)+(1+2​ξ)​2​log⁑(π’ͺ​(t​n)/Ξ΄)t+ΞΎ1).R_{t}\leq 2\left(\sqrt{\frac{2\epsilon}{t}}+\mathcal{O}\left(\frac{3L}{\sqrt{2t}}\right)+(1+2\xi)\sqrt{\frac{2\log(\mathcal{O}(tn)/\delta)}{t}}+\xi_{1}\right). (C.5)

Then expected regret of TT rounds is computed by

𝐑T\displaystyle\mathbf{R}_{T} =βˆ‘t=1TRt\displaystyle=\sum_{t=1}^{T}R_{t} (C.6)
≀2β€‹βˆ‘t=1T(2​ϡt+π’ͺ​(3​L2​t)+(1+2​ξ)​2​log⁑(π’ͺ​(t​n)/Ξ΄)t+ΞΎ1)\displaystyle\leq 2\sum_{t=1}^{T}\left(\sqrt{\frac{2\epsilon}{t}}+\mathcal{O}\left(\frac{3L}{\sqrt{2t}}\right)+(1+2\xi)\sqrt{\frac{2\log(\mathcal{O}(tn)/\delta)}{t}}+\xi_{1}\right)
≀(2​Tβˆ’1)​2​2​ϡ+(2​Tβˆ’1)​3​2​π’ͺ​(L)+2​(1+2​ξ)​(2​Tβˆ’1)​2​log⁑(π’ͺ​(T​n)/Ξ΄)⏟I2+π’ͺ​(1)\displaystyle\leq\underbrace{(2\sqrt{T}-1)2\sqrt{2\epsilon}+(2\sqrt{T}-1)3\sqrt{2}\mathcal{O}(L)+2(1+2\xi)(2\sqrt{T}-1)\sqrt{2\log(\mathcal{O}(Tn)/\delta)}}_{I_{2}}+\mathcal{O}(1)
=(2​Tβˆ’1)​(2​2​ϡ+3​2​π’ͺ​(L))+2​(1+2​ξ)​(2​Tβˆ’1)​2​log⁑(π’ͺ​(T​n)/Ξ΄)+π’ͺ​(1)\displaystyle=(2\sqrt{T}-1)(2\sqrt{2\epsilon}+3\sqrt{2}\mathcal{O}(L))+2(1+2\xi)(2\sqrt{T}-1)\sqrt{2\log(\mathcal{O}(Tn)/\delta)}+\mathcal{O}(1)

where I2I_{2} is because βˆ‘t=1T1tβ‰€βˆ«1T1t​𝑑x+1=2​Tβˆ’1\sum_{t=1}^{T}\frac{1}{\sqrt{t}}\leq\int_{1}^{T}\frac{1}{\sqrt{t}}\ dx+1=2\sqrt{T}-1 [Chlebus, 2009] and the bound of ΞΎ1\xi_{1} is due to the choice of mm, i.e., since ΞΎ1=π’ͺ~​(1/m1/6)\xi_{1}=\widetilde{\mathcal{O}}(1/m^{1/6}) and mβ‰₯Ξ©~​(poly​(T))m\geq\widetilde{\Omega}(\text{poly}(T)), mm can be chosen so that T​ξ1=π’ͺ~​(T/m1/6)≀π’ͺ​(1)T\xi_{1}=\widetilde{\mathcal{O}}(T/m^{1/6})\leq\mathcal{O}(1).

Then, when ϡ≀1/T\epsilon\leq 1/T, we have

𝐑T≀π’ͺ​(1)+(2​Tβˆ’1)​3​2​π’ͺ​(L)+2​(1+2​ξ)​(2​Tβˆ’1)​2​log⁑(π’ͺ​(T​n)/Ξ΄).\mathbf{R}_{T}\leq\mathcal{O}(1)+(2\sqrt{T}-1)3\sqrt{2}\mathcal{O}(L)+2(1+2\xi)(2\sqrt{T}-1)\sqrt{2\log(\mathcal{O}(Tn)/\delta)}. (C.7)

As the choice of mm, we have ξ≀π’ͺ​(1)\xi\leq\mathcal{O}(1). Therefore, we have

𝐑T≀π’ͺ​(1)+(2​Tβˆ’1)​3​2​π’ͺ​(L)+π’ͺ​((2​Tβˆ’1)​2​log⁑(π’ͺ​(T​n)/Ξ΄)).\mathbf{R}_{T}\leq\mathcal{O}(1)+(2\sqrt{T}-1)3\sqrt{2}\mathcal{O}(L)+\mathcal{O}\left((2\sqrt{T}-1)\sqrt{2\log(\mathcal{O}(Tn)/\delta)}\right). (C.8)

The proof is completed. ∎

Lemma C.1.

[Lemma 5.1 restated] For any Ξ΄,ϡ∈(0,1),ρ∈(0,π’ͺ​(1L))\delta,\epsilon\in(0,1),\rho\in(0,\mathcal{O}(\frac{1}{L})), suppose m,Ξ·1,Ξ·2,K1,K2m,\eta_{1},\eta_{2},K_{1},K_{2} satisfy the conditions in Eq. (5.2) and (𝐱τ,i,rΟ„,i)βˆΌπ’Ÿ,βˆ€Ο„βˆˆ[t],i∈[n](\mathbf{x}_{\tau,i},r_{\tau,i})\sim\mathcal{D},\forall\tau\in[t],i\in[n]. Let

𝐱t=arg⁑max𝐱t,i,i∈[n]⁑[f2​(ϕ​(β–½πœ½tβˆ’11​f1​(𝐱t,i;𝜽tβˆ’11));𝜽tβˆ’12)+f1​(𝐱t,i;𝜽tβˆ’11)],\mathbf{x}_{t}=\arg\max_{\mathbf{x}_{t,i},i\in[n]}\left[f_{2}\left(\phi(\triangledown_{\boldsymbol{\theta}^{1}_{t-1}}f_{1}(\mathbf{x}_{t,i};\boldsymbol{\theta}^{1}_{t-1}));\boldsymbol{\theta}_{t-1}^{2}\right)+f_{1}(\mathbf{x}_{t,i};\boldsymbol{\theta}_{t-1}^{1})\right],

and rtr_{t} is the corresponding reward, given (𝐱t,i,rt,i),i∈[n](\mathbf{x}_{t,i},r_{t,i}),i\in[n]. Then, with probability at least (1βˆ’Ξ΄)(1-\delta) over the random of the initialization, it holds that

𝔼(𝐱t,i,rt,i),i∈[n]\displaystyle\underset{(\mathbf{x}_{t,i},r_{t,i}),i\in[n]}{\mathbb{E}} [|f2​(ϕ​(β–½πœ½tβˆ’11​f1​(𝐱t;𝜽tβˆ’11));𝜽tβˆ’12)βˆ’(rtβˆ’f1​(𝐱t;𝜽tβˆ’11))|∣{𝐱τ,rΟ„}Ο„=1tβˆ’1]\displaystyle\left[\left|f_{2}\left(\phi(\triangledown_{\boldsymbol{\theta}^{1}_{t-1}}f_{1}(\mathbf{x}_{t};\boldsymbol{\theta}^{1}_{t-1}));\boldsymbol{\theta}_{t-1}^{2}\right)-\left(r_{t}-f_{1}(\mathbf{x}_{t};\boldsymbol{\theta}_{t-1}^{1})\right)\right|\mid\{\mathbf{x}_{\tau},r_{\tau}\}_{\tau=1}^{t-1}\right] (C.9)
≀2​ϡt+π’ͺ​(3​L2​t)+(1+2​ξ)​2​log⁑(π’ͺ​(t​n/Ξ΄))t,\displaystyle\qquad\qquad\leq\sqrt{\frac{2\epsilon}{t}}+\mathcal{O}\left(\frac{3L}{\sqrt{2t}}\right)+(1+2\xi)\sqrt{\frac{2\log(\mathcal{O}(tn/\delta))}{t}},

where the expectation is also taken over (𝛉tβˆ’11,𝛉tβˆ’12)(\boldsymbol{\theta}_{t-1}^{1},\boldsymbol{\theta}_{t-1}^{2}) that are uniformly drawn from (𝛉^Ο„1,𝛉^Ο„2),Ο„βˆˆ[tβˆ’1](\widehat{\boldsymbol{\theta}}_{\tau}^{1},\widehat{\boldsymbol{\theta}}_{\tau}^{2}),\tau\in[t-1].

Proof.

In this proof, we consider the collected data of up to round tβˆ’1t-1, {𝐱τ,rΟ„}Ο„=1tβˆ’1\{\mathbf{x}_{\tau},r_{\tau}\}_{\tau=1}^{t-1}, as the training dataset and then obtain a generalization bound for it, inspired by Cao and Gu [2019].

For convenience, we use 𝐱:=𝐱t,i,r:=rt,i\mathbf{x}:=\mathbf{x}_{t,i},r:=r_{t,i}, noting that the same analysis holds for each i∈[n]i\in[n]. Consider the exploration network f2f_{2}, applying Lemma C.3. With probability at least 1βˆ’Ξ΄1-\delta, for any Ο„βˆˆ[t]\tau\in[t], we have

|f2​(ϕ​(β–½πœ½^Ο„1​f1​(𝐱;𝜽^Ο„1));𝜽^Ο„2)|≀ξ.\left|f_{2}\left(\phi(\triangledown_{\widehat{\boldsymbol{\theta}}_{\tau}^{1}}f_{1}(\mathbf{x};\widehat{\boldsymbol{\theta}}_{\tau}^{1}));\widehat{\boldsymbol{\theta}}_{\tau}^{2}\right)\right|\leq\xi. (C.10)

Similarly, applying Lemma C.3 again, with probability at least 1βˆ’Ξ΄1-\delta, for any t∈[T]t\in[T], we have

|f1​(𝐱;𝜽^Ο„1)|≀ξ|f_{1}(\mathbf{x};\widehat{\boldsymbol{\theta}}_{\tau}^{1})|\leq\xi (C.11)

Because for any rβˆΌπ’Ÿrr\sim\mathcal{D}_{r}, |r|≀1|r|\leq 1, with Eq. (C.10) and (C.11), applying union bound, with probability at least (1βˆ’2​δ)(1-2\delta) over the random initialization, we have

|f2​(ϕ​(β–½πœ½^Ο„1​f1​(𝐱;𝜽^Ο„1));𝜽^Ο„2)βˆ’(rβˆ’f1​(𝐱;𝜽^Ο„1))|≀1+2​ξ.\left|f_{2}\left(\phi(\triangledown_{\widehat{\boldsymbol{\theta}}_{\tau}^{1}}f_{1}(\mathbf{x};\widehat{\boldsymbol{\theta}}_{\tau}^{1}));\widehat{\boldsymbol{\theta}}_{\tau}^{2}\right)-(r-f_{1}(\mathbf{x};\widehat{\boldsymbol{\theta}}_{\tau}^{1}))\right|\leq 1+2\xi. (C.12)

Noting that (C.12) is for 𝐱=𝐱τ,i\mathbf{x}=\mathbf{x}_{\tau,i} for a specific Ο„βˆˆ[t],i∈[n]\tau\in[t],i\in[n]. By union bound, (C.12) holds βˆ€Ο„βˆˆ[t],i∈[n]\forall\tau\in[t],i\in[n] with probability at least (1βˆ’n​t​δ)(1-nt\delta). For brevity, let f2​(𝐱;𝜽^Ο„2)f_{2}(\mathbf{x};\widehat{\boldsymbol{\theta}}_{\tau}^{2}) represent f2​(ϕ​(β–½πœ½^Ο„1​f1​(𝐱;𝜽^Ο„1));𝜽^Ο„2)f_{2}\left(\phi(\triangledown_{\widehat{\boldsymbol{\theta}}^{1}_{\tau}}f_{1}(\mathbf{x};\widehat{\boldsymbol{\theta}}_{\tau}^{1}));\widehat{\boldsymbol{\theta}}_{\tau}^{2}\right).

Recall that, for each Ο„βˆˆ[tβˆ’1]\tau\in[t-1], 𝜽^Ο„1\widehat{\boldsymbol{\theta}}_{\tau}^{1} and 𝜽^Ο„2\widehat{\boldsymbol{\theta}}_{\tau}^{2} are the parameters training on {𝐱τ′,rΟ„β€²}Ο„β€²=1Ο„\{\mathbf{x}_{\tau^{\prime}},r_{\tau^{\prime}}\}_{\tau^{\prime}=1}^{\tau} according to Algorithm 1. In round Ο„βˆˆ[t]\tau\in[t], let 𝐱τ=arg⁑max𝐱τ,i,i∈[n]⁑[f1​(𝐱τ,i;πœ½Ο„βˆ’11)+f2​(𝐱τ,i;πœ½Ο„βˆ’12)],\mathbf{x}_{\tau}=\arg\max_{\mathbf{x}_{\tau,i},i\in[n]}[f_{1}(\mathbf{x}_{\tau,i};\boldsymbol{\theta}_{\tau-1}^{1})+f_{2}(\mathbf{x}_{\tau,i};\boldsymbol{\theta}_{\tau-1}^{2})], given (𝐱τ,i,rΟ„,i)βˆΌπ’Ÿ,i∈[n](\mathbf{x}_{\tau,i},r_{\tau,i})\sim\mathcal{D},i\in[n]. Let rΟ„r_{\tau} be the corresponding reward. Let (𝐱τ,iβ€²,rΟ„,iβ€²)βˆΌπ’Ÿ,i∈[n](\mathbf{x}^{\prime}_{\tau,i},r^{\prime}_{\tau,i})\sim\mathcal{D},i\in[n] be shadow samples from the same distribution and let 𝐱τ′=arg⁑max𝐱τ,iβ€²,i∈[n]⁑[f1​(𝐱τ,iβ€²;πœ½Ο„βˆ’11)+f2​(𝐱τ,iβ€²;πœ½Ο„βˆ’12)]\mathbf{x}^{\prime}_{\tau}=\arg\max_{\mathbf{x}^{\prime}_{\tau,i},i\in[n]}[f_{1}(\mathbf{x}^{\prime}_{\tau,i};\boldsymbol{\theta}_{\tau-1}^{1})+f_{2}(\mathbf{x}^{\prime}_{\tau,i};\boldsymbol{\theta}_{\tau-1}^{2})], with rΟ„β€²r_{\tau}^{\prime} being the corresponding reward. Then, we define

VΟ„\displaystyle V_{\tau} :=𝔼(𝐱τ,iβ€²,rΟ„,iβ€²),i∈[n]​[|f2​(𝐱τ′;𝜽^Ο„βˆ’12)βˆ’(rΟ„β€²βˆ’f1​(𝐱τ′;𝜽^Ο„βˆ’11))|]\displaystyle:=\underset{(\mathbf{x}^{\prime}_{\tau,i},r^{\prime}_{\tau,i}),i\in[n]}{\mathbb{E}}\left[\left|f_{2}(\mathbf{x}^{\prime}_{\tau};\widehat{\boldsymbol{\theta}}_{\tau-1}^{2})-\left(r^{\prime}_{\tau}-f_{1}(\mathbf{x}^{\prime}_{\tau};\widehat{\boldsymbol{\theta}}_{\tau-1}^{1})\right)\right|\right] (C.13)
βˆ’|f2​(𝐱τ;𝜽^Ο„βˆ’12)βˆ’(rΟ„βˆ’f1​(𝐱τ;𝜽^Ο„βˆ’11))|.\displaystyle\qquad\qquad-\left|f_{2}(\mathbf{x}_{\tau};\widehat{\boldsymbol{\theta}}_{\tau-1}^{2})-\left(r_{\tau}-f_{1}(\mathbf{x}_{\tau};\widehat{\boldsymbol{\theta}}_{\tau-1}^{1})\right)\right|~{}.

Then, as (𝐱τ,i,rΟ„,i)βˆΌπ’Ÿ,i∈[n](\mathbf{x}_{\tau,i},r_{\tau,i})\sim\mathcal{D},i\in[n], based on the definition of (𝐱τ,rΟ„),(\mathbf{x}_{\tau},r_{\tau}), we have

𝔼​[VΟ„|π…Ο„βˆ’1]\displaystyle\mathbb{E}[V_{\tau}|\mathbf{F}_{\tau-1}] =𝔼(𝐱τ,iβ€²,rΟ„,iβ€²),i∈[n]​[|f2​(𝐱τ′;𝜽^Ο„βˆ’12)βˆ’(rΟ„β€²βˆ’f1​(𝐱τ′;𝜽^Ο„βˆ’11))|βˆ£π…Ο„βˆ’1]\displaystyle=\underset{(\mathbf{x}^{\prime}_{\tau,i},r^{\prime}_{\tau,i}),i\in[n]}{\mathbb{E}}\left[\left|f_{2}\left(\mathbf{x}^{\prime}_{\tau};\widehat{\boldsymbol{\theta}}_{\tau-1}^{2}\right)-\left(r^{\prime}_{\tau}-f_{1}(\mathbf{x}^{\prime}_{\tau};\widehat{\boldsymbol{\theta}}_{\tau-1}^{1})\right)\right|\mid\mathbf{F}_{\tau-1}\right] (C.14)
βˆ’π”Ό(𝐱τ,i,rΟ„,i),i∈[n]​[|f2​(𝐱τ;𝜽^Ο„βˆ’12)βˆ’(rΟ„βˆ’f1​(𝐱τ;𝜽^Ο„βˆ’11))||π…Ο„βˆ’1]\displaystyle\quad-\underset{(\mathbf{x}_{\tau,i},r_{\tau,i}),i\in[n]}{\mathbb{E}}\left[\left|f_{2}\left(\mathbf{x}_{\tau};\widehat{\boldsymbol{\theta}}_{\tau-1}^{2}\right)-\left(r_{\tau}-f_{1}(\mathbf{x}_{\tau};\widehat{\boldsymbol{\theta}}_{\tau-1}^{1})\right)\right||\mathbf{F}_{\tau-1}\right]
=0,\displaystyle=0~{},

where π…Ο„βˆ’1\mathbf{F}_{\tau-1} denotes the Οƒ\sigma-algebra generated by the history β„‹Ο„βˆ’1={𝐱τ′,rΟ„β€²}Ο„β€²=1Ο„βˆ’1{\cal H}_{\tau-1}=\{\mathbf{x}_{\tau^{\prime}},r_{\tau^{\prime}}\}_{\tau^{\prime}=1}^{\tau-1}.

Moreover, we have

1tβ€‹βˆ‘Ο„=1tVΟ„\displaystyle\frac{1}{t}\sum_{\tau=1}^{t}V_{\tau} =1tβ€‹βˆ‘Ο„=1t𝔼(𝐱τ,iβ€²,rΟ„,iβ€²),i∈[n]​[|f2​(𝐱τ′;𝜽^Ο„βˆ’12)βˆ’(rΟ„β€²βˆ’f1​(𝐱τ′;𝜽^Ο„βˆ’11))|]\displaystyle=\frac{1}{t}\sum_{\tau=1}^{t}\underset{(\mathbf{x}^{\prime}_{\tau,i},r^{\prime}_{\tau,i}),i\in[n]}{\mathbb{E}}\left[\left|f_{2}(\mathbf{x}^{\prime}_{\tau};\widehat{\boldsymbol{\theta}}_{\tau-1}^{2})-\left(r^{\prime}_{\tau}-f_{1}(\mathbf{x}^{\prime}_{\tau};\widehat{\boldsymbol{\theta}}_{\tau-1}^{1})\right)\right|\right]
βˆ’1tβ€‹βˆ‘Ο„=1t|f2​(𝐱τ;𝜽^Ο„βˆ’12)βˆ’(rΟ„βˆ’f1​(𝐱τ;𝜽^Ο„βˆ’11))|\displaystyle\quad-\frac{1}{t}\sum_{\tau=1}^{t}\left|f_{2}\left(\mathbf{x}_{\tau};\widehat{\boldsymbol{\theta}}_{\tau-1}^{2}\right)-\left(r_{\tau}-f_{1}(\mathbf{x}_{\tau};\widehat{\boldsymbol{\theta}}_{\tau-1}^{1})\right)\right|

Since {VΟ„}Ο„=1t\{V_{\tau}\}_{\tau=1}^{t} is a martingale difference sequence, inspired by Lemma 1 in [Cesa-Bianchi etΒ al., 2001], applying the Hoeffding-Azuma inequality, with probability at least 1βˆ’3​δ1-3\delta, we have

ℙ​[1tβ€‹βˆ‘Ο„=1tVΟ„βˆ’1tβ€‹βˆ‘Ο„=1t𝔼​[VΟ„|𝐅τ]⏟I1>(1+2​ξ)⏟I2​2​log⁑(1/Ξ΄)t]\displaystyle\mathbb{P}\left[\frac{1}{t}\sum_{\tau=1}^{t}V_{\tau}-\underbrace{\frac{1}{t}\sum_{\tau=1}^{t}\mathbb{E}[V_{\tau}|\mathbf{F}_{\tau}]}_{I_{1}}>\underbrace{(1+2\xi)}_{I_{2}}\sqrt{\frac{2\log(1/\delta)}{t}}\right] ≀δ\displaystyle\leq\delta (C.15)
⇒ℙ​[1tβ€‹βˆ‘Ο„=1tVΟ„>(1+2​ξ)​2​log⁑(1/Ξ΄)t]\displaystyle\Rightarrow\qquad\qquad\mathbb{P}\left[\frac{1}{t}\sum_{\tau=1}^{t}V_{\tau}>(1+2\xi)\sqrt{\frac{2\log(1/\delta)}{t}}\right] ≀δ,\displaystyle\leq\delta~{},

where I1=0I_{1}=0 according to (C.14) and I2I_{2} is because of (C.12).

According to Algorithm 1, (𝜽tβˆ’11,𝜽tβˆ’12)(\boldsymbol{\theta}_{t-1}^{1},\boldsymbol{\theta}_{t-1}^{2}) is uniformly drawn from {(𝜽^Ο„1,𝜽^Ο„2)}Ο„=0tβˆ’1\{(\widehat{\boldsymbol{\theta}}_{\tau}^{1},\widehat{\boldsymbol{\theta}}_{\tau}^{2})\}_{\tau=0}^{t-1}. Thus, with probability 1βˆ’3​δ1-3\delta, we have

𝔼(𝐱t,iβ€²,rt,iβ€²),i∈[n]​𝔼(𝜽tβˆ’11,𝜽tβˆ’12)​[|f2​(𝐱tβ€²;𝜽tβˆ’12)βˆ’(rtβ€²βˆ’f1​(𝐱tβ€²;𝜽tβˆ’11))|]\displaystyle\underset{(\mathbf{x}^{\prime}_{t,i},r^{\prime}_{t,i}),i\in[n]}{\mathbb{E}}~{}\underset{(\boldsymbol{\theta}_{t-1}^{1},\boldsymbol{\theta}_{t-1}^{2})}{\mathbb{E}}\left[\left|f_{2}\left(\mathbf{x}^{\prime}_{t};\boldsymbol{\theta}_{t-1}^{2}\right)-\left(r^{\prime}_{t}-f_{1}(\mathbf{x}^{\prime}_{t};\boldsymbol{\theta}_{t-1}^{1})\right)\right|\right] (C.16)
=1tβ€‹βˆ‘Ο„=1t𝔼(𝐱τ,iβ€²,rΟ„,iβ€²),i∈[n]​[|f2​(𝐱τ′;𝜽^Ο„βˆ’12)βˆ’(rΟ„β€²βˆ’f1​(𝐱τ′;𝜽^Ο„βˆ’11))|]\displaystyle=\frac{1}{t}\sum_{\tau=1}^{t}\mathbb{E}_{(\mathbf{x}^{\prime}_{\tau,i},r^{\prime}_{\tau,i}),i\in[n]}\left[\left|f_{2}\left(\mathbf{x}^{\prime}_{\tau};\widehat{\boldsymbol{\theta}}_{\tau-1}^{2}\right)-\left(r^{\prime}_{\tau}-f_{1}(\mathbf{x}^{\prime}_{\tau};\widehat{\boldsymbol{\theta}}_{\tau-1}^{1})\right)\right|\right]
≀1tβ€‹βˆ‘Ο„=1t|f2​(𝐱τ;𝜽^Ο„βˆ’12)βˆ’(rΟ„βˆ’f1​(𝐱τ;𝜽^Ο„βˆ’11))|⏟I3+(1+2​ξ)​2​log⁑(1/Ξ΄)t.\displaystyle\leq\underbrace{\frac{1}{t}\sum_{\tau=1}^{t}\left|f_{2}\left(\mathbf{x}_{\tau};\widehat{\boldsymbol{\theta}}_{\tau-1}^{2}\right)-\left(r_{\tau}-f_{1}(\mathbf{x}_{\tau};\widehat{\boldsymbol{\theta}}_{\tau-1}^{1})\right)\right|}_{I_{3}}+(1+2\xi)\sqrt{\frac{2\log(1/\delta)}{t}}~{}.

For I3I_{3}, according to Lemma C.6, for any 𝜽~2\widetilde{\boldsymbol{\theta}}^{2} satisfying β€–πœ½~2βˆ’πœ½02β€–2≀π’ͺ​(t3ρ​m​log⁑m)\|\widetilde{\boldsymbol{\theta}}^{2}-\boldsymbol{\theta}^{2}_{0}\|_{2}\leq\mathcal{O}(\frac{t^{3}}{\rho\sqrt{m}}\log m) , with probability 1βˆ’Ξ΄1-\delta, we have

1tβ€‹βˆ‘Ο„=1t|f2​(𝐱τ;𝜽^Ο„βˆ’12)βˆ’(rΟ„βˆ’f1​(𝐱τ;𝜽^Ο„βˆ’11))|\displaystyle\frac{1}{t}\sum_{\tau=1}^{t}\left|f_{2}(\mathbf{x}_{\tau};\widehat{\boldsymbol{\theta}}_{\tau-1}^{2})-\left(r_{\tau}-f_{1}(\mathbf{x}_{\tau};\widehat{\boldsymbol{\theta}}_{\tau-1}^{1})\right)\right| (C.17)
≀1tβ€‹βˆ‘Ο„=1t|f2​(𝐱τ;𝜽~2)βˆ’(rΟ„βˆ’f1​(𝐱τ;𝜽^Ο„βˆ’11))|⏟I4+π’ͺ​(3​L2​t).\displaystyle\leq\underbrace{\frac{1}{t}\sum_{\tau=1}^{t}\left|f_{2}(\mathbf{x}_{\tau};\widetilde{\boldsymbol{\theta}}^{2})-\left(r_{\tau}-f_{1}(\mathbf{x}_{\tau};\widehat{\boldsymbol{\theta}}_{\tau-1}^{1})\right)\right|}_{I_{4}}+\mathcal{O}\left(\frac{3L}{\sqrt{2t}}\right)~{}.

For I4I_{4}, according to Lemma C.4 (1), these exists 𝜽~2\widetilde{\boldsymbol{\theta}}^{2} satisfying β€–πœ½~2βˆ’πœ½02β€–2≀π’ͺ​(t3ρ​m​log⁑m)\|\widetilde{\boldsymbol{\theta}}^{2}-\boldsymbol{\theta}^{2}_{0}\|_{2}\leq\mathcal{O}(\frac{t^{3}}{\rho\sqrt{m}}\log m), with probability 1βˆ’Ξ΄1-\delta, such that

1tβ€‹βˆ‘Ο„=1t|f2​(𝐱τ;𝜽~2)βˆ’(rΟ„βˆ’f1​(𝐱τ;𝜽^Ο„βˆ’11))|\displaystyle\frac{1}{t}\sum_{\tau=1}^{t}\left|f_{2}(\mathbf{x}_{\tau};\widetilde{\boldsymbol{\theta}}^{2})-\left(r_{\tau}-f_{1}(\mathbf{x}_{\tau};\widehat{\boldsymbol{\theta}}_{\tau-1}^{1})\right)\right| (C.18)
≀1t​tβ€‹βˆ‘Ο„=1t(f2​(𝐱τ;𝜽~2)βˆ’(rΟ„βˆ’f1​(𝐱τ;𝜽^Ο„βˆ’11)))2⏟I5\displaystyle\leq\frac{1}{t}\sqrt{t}\sqrt{\underbrace{\sum_{\tau=1}^{t}\left(f_{2}(\mathbf{x}_{\tau};\widetilde{\boldsymbol{\theta}}^{2})-\left(r_{\tau}-f_{1}(\mathbf{x}_{\tau};\widehat{\boldsymbol{\theta}}_{\tau-1}^{1})\right)\right)^{2}}_{I_{5}}}
≀1t​2β€‹Ο΅βŸI5,\displaystyle\leq\frac{1}{\sqrt{t}}\sqrt{\underbrace{2\epsilon}_{I_{5}}}~{},

where I5I_{5} follows by a direct application of Lemma C.4 (1) by defining the loss ℒ​(𝜽~2)=12β€‹βˆ‘Ο„=1t(f2​(𝐱τ;𝜽~2)βˆ’(rΟ„βˆ’f1​(𝐱τ;𝜽^Ο„βˆ’11)))2≀ϡ\mathcal{L}(\widetilde{\boldsymbol{\theta}}^{2})=\frac{1}{2}\sum_{\tau=1}^{t}\left(f_{2}(\mathbf{x}_{\tau};\widetilde{\boldsymbol{\theta}}^{2})-\left(r_{\tau}-f_{1}(\mathbf{x}_{\tau};\widehat{\boldsymbol{\theta}}_{\tau-1}^{1})\right)\right)^{2}\leq\epsilon.

Combining Eq.(C.16), Eq.(C.17) and Eq.(C.18), with probability (1βˆ’5​δ)(1-5\delta) we have

𝔼(𝐱t,i,rt,i),i∈[n]​[|f2​(𝐱t;𝜽tβˆ’12)βˆ’(rtβˆ’f1​(𝐱t;𝜽tβˆ’11))||{𝐱τ,rΟ„}Ο„=1tβˆ’1]\displaystyle\underset{(\mathbf{x}_{t,i},r_{t,i}),i\in[n]}{\mathbb{E}}\left[\left|f_{2}\left(\mathbf{x}_{t};\boldsymbol{\theta}_{t-1}^{2}\right)-\left(r_{t}-f_{1}(\mathbf{x}_{t};\boldsymbol{\theta}_{t-1}^{1})\right)\right||\{\mathbf{x}_{\tau},r_{\tau}\}_{\tau=1}^{t-1}\right] (C.19)
≀2​ϡt+π’ͺ​(3​L2​t)+(1+2​ξ)​2​log⁑(1/Ξ΄)t.\displaystyle\qquad\leq\sqrt{\frac{2\epsilon}{t}}+\mathcal{O}\left(\frac{3L}{\sqrt{2t}}\right)+(1+2\xi)\sqrt{\frac{2\log(1/\delta)}{t}}~{}.

where the expectation over (𝜽tβˆ’11,𝜽tβˆ’12)(\boldsymbol{\theta}_{t-1}^{1},\boldsymbol{\theta}_{t-1}^{2}) that is uniformly drawn from {(𝜽^Ο„1,𝜽^Ο„2)}Ο„=0tβˆ’1\{(\widehat{\boldsymbol{\theta}}_{\tau}^{1},\widehat{\boldsymbol{\theta}}_{\tau}^{2})\}_{\tau=0}^{t-1}.

Then, applying union bound to t,nt,n and rescaling the δ\delta complete the proof. ∎

Corollary C.1.

For any Ξ΄,ϡ∈(0,1),ρ∈(0,π’ͺ​(1L))\delta,\epsilon\in(0,1),\rho\in(0,\mathcal{O}(\frac{1}{L})), suppose m,Ξ·1,Ξ·2,K1,K2m,\eta_{1},\eta_{2},K_{1},K_{2} satisfy the conditions in Eq. (5.2) and (𝐱τ,i,rΟ„,i)βˆΌπ’Ÿ,βˆ€Ο„βˆˆ[t],i∈[n](\mathbf{x}_{\tau,i},r_{\tau,i})\sim\mathcal{D},\forall\tau\in[t],i\in[n]. For any Ο„βˆˆ[t]\tau\in[t], let

π±Ο„βˆ—=arg⁑max𝐱τ,i,i∈[n]⁑[h​(𝐱τ,i)],\mathbf{x}_{\tau}^{\ast}=\arg\max_{\mathbf{x}_{\tau,i},i\in[n]}\left[h(\mathbf{x}_{\tau,i})\right],

and rΟ„βˆ—r_{\tau}^{\ast} is the corresponding reward, given (𝐱τ,i,rΟ„,i),i∈[n](\mathbf{x}_{\tau,i},r_{\tau,i}),i\in[n]. Then, with probability at least (1βˆ’Ξ΄)(1-\delta) over the random of the initialization, there exist 𝛉tβˆ’11,βˆ—,𝛉tβˆ’12,βˆ—\boldsymbol{\theta}^{1,\ast}_{t-1},\boldsymbol{\theta}^{2,\ast}_{t-1}, s.t.,βˆ₯𝛉tβˆ’11,βˆ—βˆ’π›‰01βˆ₯2≀π’ͺ(t3ρ​mlogm)s.t.,\|\boldsymbol{\theta}^{1,\ast}_{t-1}-\boldsymbol{\theta}^{1}_{0}\|_{2}\leq\mathcal{O}(\frac{t^{3}}{\rho\sqrt{m}}\log m) and ‖𝛉tβˆ’12,βˆ—βˆ’π›‰02β€–2≀π’ͺ​(t3ρ​m​log⁑m)\|\boldsymbol{\theta}^{2,\ast}_{t-1}-\boldsymbol{\theta}^{2}_{0}\|_{2}\leq\mathcal{O}(\frac{t^{3}}{\rho\sqrt{m}}\log m) , such that

𝔼(𝐱t,i,rt,i),i∈[n]\displaystyle\underset{(\mathbf{x}_{t,i},r_{t,i}),i\in[n]}{\mathbb{E}} [|f2​(ϕ​(β–½πœ½tβˆ’11,βˆ—β€‹f1​(𝐱tβˆ—;𝜽tβˆ’11,βˆ—));𝜽tβˆ’12,βˆ—)βˆ’(rtβˆ—βˆ’f1​(𝐱tβˆ—;𝜽tβˆ’11,βˆ—))|∣{π±Ο„βˆ—,rΟ„βˆ—}Ο„=1tβˆ’1]\displaystyle\left[\left|f_{2}\left(\phi(\triangledown_{\boldsymbol{\theta}^{1,\ast}_{t-1}}f_{1}(\mathbf{x}_{t}^{\ast};\boldsymbol{\theta}^{1,\ast}_{t-1}));\boldsymbol{\theta}_{t-1}^{2,\ast}\right)-\left(r_{t}^{\ast}-f_{1}(\mathbf{x}_{t}^{\ast};\boldsymbol{\theta}_{t-1}^{1,\ast})\right)\right|\mid\{\mathbf{x}_{\tau}^{\ast},r_{\tau}^{\ast}\}_{\tau=1}^{t-1}\right] (C.20)
≀2​ϡt+π’ͺ​(3​L2​t)+(1+2​ξ)​2​log⁑(π’ͺ​(t​n/Ξ΄))t,\displaystyle\qquad\qquad\leq\sqrt{\frac{2\epsilon}{t}}+\mathcal{O}\left(\frac{3L}{\sqrt{2t}}\right)+(1+2\xi)\sqrt{\frac{2\log(\mathcal{O}(tn/\delta))}{t}},

where the expectation is also taken over (𝛉tβˆ’11,βˆ—,𝛉tβˆ’12,βˆ—)(\boldsymbol{\theta}_{t-1}^{1,\ast},\boldsymbol{\theta}_{t-1}^{2,\ast}) that are uniformly drawn from (𝛉^Ο„1,βˆ—,𝛉^Ο„2,βˆ—),Ο„βˆˆ[tβˆ’1](\widehat{\boldsymbol{\theta}}_{\tau}^{1,\ast},\widehat{\boldsymbol{\theta}}_{\tau}^{2,\ast}),\tau\in[t-1].

Proof.

This a direct corollary of Lemma C.1, given the optimal historical pairs {π±Ο„βˆ—,rΟ„βˆ—}Ο„=1tβˆ’1\{\mathbf{x}_{\tau}^{\ast},r_{\tau}^{\ast}\}_{\tau=1}^{t-1}. For brevity, let f2​(𝐱;𝜽^Ο„2,βˆ—)f_{2}(\mathbf{x};\widehat{\boldsymbol{\theta}}_{\tau}^{2,\ast}) represent f2​(ϕ​(β–½πœ½^Ο„1,βˆ—β€‹f1​(𝐱;𝜽^Ο„1,βˆ—));𝜽^Ο„2,βˆ—)f_{2}\left(\phi(\triangledown_{\widehat{\boldsymbol{\theta}}^{1,\ast}_{\tau}}f_{1}(\mathbf{x};\widehat{\boldsymbol{\theta}}_{\tau}^{1,\ast}));\widehat{\boldsymbol{\theta}}_{\tau}^{2,\ast}\right).

Suppose that, for each Ο„βˆˆ[tβˆ’1]\tau\in[t-1], 𝜽^Ο„1,βˆ—\widehat{\boldsymbol{\theta}}_{\tau}^{1,\ast} and 𝜽^Ο„2,βˆ—\widehat{\boldsymbol{\theta}}_{\tau}^{2,\ast} are the parameters training on {π±Ο„β€²βˆ—,rΟ„β€²βˆ—}Ο„β€²=1Ο„\{\mathbf{x}_{\tau^{\prime}}^{\ast},r_{\tau^{\prime}}^{\ast}\}_{\tau^{\prime}=1}^{\tau} according to Algorithm 1. Note that these pairs {π±Ο„β€²βˆ—,rΟ„β€²βˆ—}Ο„β€²=1Ο„\{\mathbf{x}_{\tau^{\prime}}^{\ast},r_{\tau^{\prime}}^{\ast}\}_{\tau^{\prime}=1}^{\tau} are unknown to the algorithm we run, and the parameters (𝜽^Ο„1,βˆ—,𝜽^Ο„2,βˆ—)(\widehat{\boldsymbol{\theta}}_{\tau}^{1,\ast},\widehat{\boldsymbol{\theta}}_{\tau}^{2,\ast}) are not estimated. However, for the analysis, it is sufficient to show that there exist such parameters so that the conditional expectation of the error can be bounded.

In round Ο„βˆˆ[t]\tau\in[t], let π±Ο„βˆ—=arg⁑max𝐱τ,i​i∈[n]⁑[h​(𝐱τ,i)],\mathbf{x}_{\tau}^{\ast}=\arg\max_{\mathbf{x}_{\tau,i}i\in[n]}[h(\mathbf{x}_{\tau,i})], given (𝐱τ,i,rΟ„,i)βˆΌπ’Ÿ,i∈[n](\mathbf{x}_{\tau,i},r_{\tau,i})\sim\mathcal{D},i\in[n]. Let rΟ„βˆ—r^{\ast}_{\tau} be the corresponding reward. Let (𝐱τ,iβ€²,rΟ„,iβ€²)βˆΌπ’Ÿ,i∈[n](\mathbf{x}^{\prime}_{\tau,i},r^{\prime}_{\tau,i})\sim\mathcal{D},i\in[n] be shadow samples from the same distribution and let π±Ο„β€²β£βˆ—=arg⁑max𝐱τ,iβ€²,i∈[n]⁑h​(𝐱τ,iβ€²)\mathbf{x}^{\prime\ast}_{\tau}=\arg\max_{\mathbf{x}^{\prime}_{\tau,i},i\in[n]}h(\mathbf{x}^{\prime}_{\tau,i}), with rΟ„β€²β£βˆ—r^{\prime\ast}_{\tau} being the corresponding reward. Then, we define

VΟ„\displaystyle V_{\tau} :=𝔼(𝐱t,iβ€²,rt,iβ€²),i∈[n]​[|f2​(π±Ο„β€²β£βˆ—;𝜽^Ο„βˆ’12,βˆ—)βˆ’(rΟ„β€²β£βˆ—βˆ’f1​(π±Ο„β€²β£βˆ—;𝜽^Ο„βˆ’11,βˆ—))|]\displaystyle:=\underset{(\mathbf{x}^{\prime}_{t,i},r^{\prime}_{t,i}),i\in[n]}{\mathbb{E}}\left[\left|f_{2}(\mathbf{x}^{\prime\ast}_{\tau};\widehat{\boldsymbol{\theta}}_{\tau-1}^{2,\ast})-\left(r^{\prime\ast}_{\tau}-f_{1}(\mathbf{x}^{\prime\ast}_{\tau};\widehat{\boldsymbol{\theta}}_{\tau-1}^{1,\ast})\right)\right|\right] (C.21)
βˆ’|f2​(π±Ο„βˆ—;𝜽^Ο„βˆ’12,βˆ—)βˆ’(rΟ„βˆ—βˆ’f1​(π±Ο„βˆ—;𝜽^Ο„βˆ’11,βˆ—))|.\displaystyle\qquad\qquad-\left|f_{2}(\mathbf{x}^{\ast}_{\tau};\widehat{\boldsymbol{\theta}}_{\tau-1}^{2,\ast})-\left(r^{\ast}_{\tau}-f_{1}(\mathbf{x}^{\ast}_{\tau};\widehat{\boldsymbol{\theta}}_{\tau-1}^{1,\ast})\right)\right|~{}.

Then, as (𝐱τ,i,rΟ„,i)βˆΌπ’Ÿ,i∈[n](\mathbf{x}_{\tau,i},r_{\tau,i})\sim\mathcal{D},i\in[n], we have

𝔼​[VΟ„|π…Ο„βˆ’1]\displaystyle\mathbb{E}[V_{\tau}|\mathbf{F}_{\tau-1}] =𝔼(𝐱τ,iβ€²,rΟ„,iβ€²),i∈[n]​[|f2​(π±Ο„β€²β£βˆ—;𝜽^Ο„βˆ’12,βˆ—)βˆ’(rΟ„β€²β£βˆ—βˆ’f1​(π±Ο„β€²β£βˆ—;𝜽^Ο„βˆ’11,βˆ—))|βˆ£π…Ο„βˆ’1]\displaystyle=\underset{(\mathbf{x}^{\prime}_{\tau,i},r^{\prime}_{\tau,i}),i\in[n]}{\mathbb{E}}\left[\left|f_{2}\left(\mathbf{x}^{\prime\ast}_{\tau};\widehat{\boldsymbol{\theta}}_{\tau-1}^{2,\ast}\right)-\left(r^{\prime\ast}_{\tau}-f_{1}(\mathbf{x}^{\prime\ast}_{\tau};\widehat{\boldsymbol{\theta}}_{\tau-1}^{1,\ast})\right)\right|\mid\mathbf{F}_{\tau-1}\right] (C.22)
βˆ’π”Ό(𝐱τ,i,rΟ„,i),i∈[n]​[|f2​(π±Ο„βˆ—;𝜽^Ο„βˆ’12,βˆ—)βˆ’(rΟ„βˆ—βˆ’f1​(π±Ο„βˆ—;𝜽^Ο„βˆ’11,βˆ—))|βˆ£π…Ο„βˆ’1]\displaystyle\quad-\underset{(\mathbf{x}_{\tau,i},r_{\tau,i}),i\in[n]}{\mathbb{E}}\left[\left|f_{2}\left(\mathbf{x}_{\tau}^{\ast};\widehat{\boldsymbol{\theta}}_{\tau-1}^{2,\ast}\right)-\left(r_{\tau}^{\ast}-f_{1}(\mathbf{x}_{\tau}^{\ast};\widehat{\boldsymbol{\theta}}_{\tau-1}^{1,\ast})\right)\right|\mid\mathbf{F}_{\tau-1}\right]
=0,\displaystyle=0~{},

where π…Ο„βˆ’1\mathbf{F}_{\tau-1} denotes the Οƒ\sigma-algebra generated by the history {π±Ο„β€²βˆ—,rΟ„β€²βˆ—}Ο„β€²=1Ο„βˆ’1\{\mathbf{x}_{\tau^{\prime}}^{\ast},r_{\tau^{\prime}}^{\ast}\}_{\tau^{\prime}=1}^{\tau-1}.

Therefore, {VΟ„}Ο„=1t\{V_{\tau}\}_{\tau=1}^{t} is a martingale difference sequence. Similarly, applying the Hoeffding-Azuma inequality to VΟ„V_{\tau}, with probability 1βˆ’3​δ1-3\delta, we have

𝔼(𝐱t,iβ€²,rt,iβ€²),i∈[n]​𝔼(𝜽tβˆ’11,βˆ—,𝜽tβˆ’12,βˆ—)​[|f2​(𝐱tβ€²β£βˆ—;𝜽tβˆ’12,βˆ—)βˆ’(rtβ€²β£βˆ—βˆ’f1​(𝐱tβ€²β£βˆ—;𝜽tβˆ’11,βˆ—))|]\displaystyle\underset{(\mathbf{x}^{\prime}_{t,i},r^{\prime}_{t,i}),i\in[n]}{\mathbb{E}}~{}\underset{(\boldsymbol{\theta}_{t-1}^{1,\ast},\boldsymbol{\theta}_{t-1}^{2,\ast})}{\mathbb{E}}\left[\left|f_{2}\left(\mathbf{x}^{\prime\ast}_{t};\boldsymbol{\theta}_{t-1}^{2,\ast}\right)-\left(r^{\prime\ast}_{t}-f_{1}(\mathbf{x}^{\prime\ast}_{t};\boldsymbol{\theta}_{t-1}^{1,\ast})\right)\right|\right] (C.23)
=1tβ€‹βˆ‘Ο„=1t𝔼(𝐱τ,iβ€²,rΟ„,iβ€²),i∈[n]​[|f2​(π±Ο„β€²β£βˆ—;𝜽^Ο„βˆ’12,βˆ—)βˆ’(rΟ„β€²β£βˆ—βˆ’f1​(π±Ο„β€²β£βˆ—;𝜽^Ο„βˆ’11,βˆ—))|]\displaystyle=\frac{1}{t}\sum_{\tau=1}^{t}\mathbb{E}_{(\mathbf{x}^{\prime}_{\tau,i},r^{\prime}_{\tau,i}),i\in[n]}\left[\left|f_{2}\left(\mathbf{x}^{\prime\ast}_{\tau};\widehat{\boldsymbol{\theta}}_{\tau-1}^{2,\ast}\right)-\left(r^{\prime\ast}_{\tau}-f_{1}(\mathbf{x}^{\prime\ast}_{\tau};\widehat{\boldsymbol{\theta}}_{\tau-1}^{1,\ast})\right)\right|\right]
≀1tβ€‹βˆ‘Ο„=1t|f2​(π±Ο„βˆ—;𝜽^Ο„βˆ’12,βˆ—)βˆ’(rΟ„βˆ—βˆ’f1​(π±Ο„βˆ—;𝜽^Ο„βˆ’11,βˆ—))|⏟I3+(1+2​ξ)​2​log⁑(1/Ξ΄)t.\displaystyle\leq\underbrace{\frac{1}{t}\sum_{\tau=1}^{t}\left|f_{2}\left(\mathbf{x}_{\tau}^{\ast};\widehat{\boldsymbol{\theta}}_{\tau-1}^{2,\ast}\right)-\left(r_{\tau}^{\ast}-f_{1}(\mathbf{x}_{\tau}^{\ast};\widehat{\boldsymbol{\theta}}_{\tau-1}^{1,\ast})\right)\right|}_{I_{3}}+(1+2\xi)\sqrt{\frac{2\log(1/\delta)}{t}}~{}.

For I3I_{3}, according to Lemma C.6, for any 𝜽~2,βˆ—\widetilde{\boldsymbol{\theta}}^{2,\ast} satisfying β€–πœ½~2,βˆ—βˆ’πœ½02β€–2≀π’ͺ​(t3ρ​m​log⁑m)\|\widetilde{\boldsymbol{\theta}}^{2,\ast}-\boldsymbol{\theta}^{2}_{0}\|_{2}\leq\mathcal{O}(\frac{t^{3}}{\rho\sqrt{m}}\log m) , with probability 1βˆ’Ξ΄1-\delta, we have

1tβ€‹βˆ‘Ο„=1t|f2​(π±Ο„βˆ—;𝜽^Ο„βˆ’12,βˆ—)βˆ’(rΟ„βˆ—βˆ’f1​(π±Ο„βˆ—;𝜽^Ο„βˆ’11,βˆ—))|\displaystyle\frac{1}{t}\sum_{\tau=1}^{t}\left|f_{2}(\mathbf{x}_{\tau}^{\ast};\widehat{\boldsymbol{\theta}}_{\tau-1}^{2,\ast})-\left(r_{\tau}^{\ast}-f_{1}(\mathbf{x}_{\tau}^{\ast};\widehat{\boldsymbol{\theta}}_{\tau-1}^{1,\ast})\right)\right| (C.24)
≀1tβ€‹βˆ‘Ο„=1t|f2​(π±Ο„βˆ—;𝜽~2,βˆ—)βˆ’(rΟ„βˆ—βˆ’f1​(π±Ο„βˆ—;𝜽^Ο„βˆ’11,βˆ—))|⏟I4+π’ͺ​(3​L2​t).\displaystyle\leq\underbrace{\frac{1}{t}\sum_{\tau=1}^{t}\left|f_{2}(\mathbf{x}_{\tau}^{\ast};\widetilde{\boldsymbol{\theta}}^{2,\ast})-\left(r_{\tau}^{\ast}-f_{1}(\mathbf{x}_{\tau}^{\ast};\widehat{\boldsymbol{\theta}}_{\tau-1}^{1,\ast})\right)\right|}_{I_{4}}+\mathcal{O}\left(\frac{3L}{\sqrt{2t}}\right)~{}.

For I4I_{4}, according to Lemma C.4 (1), these exists 𝜽~2,βˆ—\widetilde{\boldsymbol{\theta}}^{2,\ast} satisfying β€–πœ½~2,βˆ—βˆ’πœ½02β€–2≀π’ͺ​(t3ρ​m​log⁑m)\|\widetilde{\boldsymbol{\theta}}^{2,\ast}-\boldsymbol{\theta}^{2}_{0}\|_{2}\leq\mathcal{O}(\frac{t^{3}}{\rho\sqrt{m}}\log m), with probability 1βˆ’Ξ΄1-\delta, such that

1tβ€‹βˆ‘Ο„=1t|f2​(π±Ο„βˆ—;𝜽~2,βˆ—)βˆ’(rΟ„βˆ—βˆ’f1​(π±Ο„βˆ—;𝜽^Ο„βˆ’11,βˆ—))|\displaystyle\frac{1}{t}\sum_{\tau=1}^{t}\left|f_{2}(\mathbf{x}_{\tau}^{\ast};\widetilde{\boldsymbol{\theta}}^{2,\ast})-\left(r_{\tau}^{\ast}-f_{1}(\mathbf{x}_{\tau}^{\ast};\widehat{\boldsymbol{\theta}}_{\tau-1}^{1,\ast})\right)\right| (C.25)
≀1t​tβ€‹βˆ‘Ο„=1t(f2​(π±Ο„βˆ—;𝜽~2,βˆ—)βˆ’(rΟ„βˆ—βˆ’f1​(π±Ο„βˆ—;𝜽^Ο„βˆ’11,βˆ—)))2⏟I5\displaystyle\leq\frac{1}{t}\sqrt{t}\sqrt{\underbrace{\sum_{\tau=1}^{t}\left(f_{2}(\mathbf{x}_{\tau}^{\ast};\widetilde{\boldsymbol{\theta}}^{2,\ast})-\left(r_{\tau}^{\ast}-f_{1}(\mathbf{x}_{\tau}^{\ast};\widehat{\boldsymbol{\theta}}_{\tau-1}^{1,\ast})\right)\right)^{2}}_{I_{5}}}
≀1t​2β€‹Ο΅βŸI5,\displaystyle\leq\frac{1}{\sqrt{t}}\sqrt{\underbrace{2\epsilon}_{I_{5}}}~{},

where I5I_{5} follows by a direct application of Lemma C.4 (1) by defining the loss ℒ​(𝜽~2,βˆ—)=12β€‹βˆ‘Ο„=1t(f2​(π±Ο„βˆ—;𝜽~2,βˆ—)βˆ’(rΟ„βˆ—βˆ’f1​(π±Ο„βˆ—;𝜽^Ο„βˆ’11,βˆ—)))2≀ϡ\mathcal{L}(\widetilde{\boldsymbol{\theta}}^{2,\ast})=\frac{1}{2}\sum_{\tau=1}^{t}\left(f_{2}(\mathbf{x}_{\tau}^{\ast};\widetilde{\boldsymbol{\theta}}^{2,\ast})-\left(r_{\tau}^{\ast}-f_{1}(\mathbf{x}_{\tau}^{\ast};\widehat{\boldsymbol{\theta}}_{\tau-1}^{1,\ast})\right)\right)^{2}\leq\epsilon. Combining above inequalities, with probability (1βˆ’5​δ)(1-5\delta) we have

𝔼(𝐱t,i,rt,i),i∈[n]​[|f2​(𝐱tβˆ—;𝜽tβˆ’12,βˆ—)βˆ’(rtβˆ—βˆ’f1​(𝐱tβˆ—;𝜽tβˆ’11,βˆ—))||{π±Ο„βˆ—,rΟ„βˆ—}Ο„=1tβˆ’1]\displaystyle\underset{(\mathbf{x}_{t,i},r_{t,i}),i\in[n]}{\mathbb{E}}\left[\left|f_{2}\left(\mathbf{x}_{t}^{\ast};\boldsymbol{\theta}_{t-1}^{2,\ast}\right)-\left(r_{t}^{\ast}-f_{1}(\mathbf{x}_{t}^{\ast};\boldsymbol{\theta}_{t-1}^{1,\ast})\right)\right||\{\mathbf{x}_{\tau}^{\ast},r_{\tau}^{\ast}\}_{\tau=1}^{t-1}\right] (C.26)
≀2​ϡt+π’ͺ​(3​L2​t)+(1+2​ξ)​2​log⁑(1/Ξ΄)t.\displaystyle\qquad\leq\sqrt{\frac{2\epsilon}{t}}+\mathcal{O}\left(\frac{3L}{\sqrt{2t}}\right)+(1+2\xi)\sqrt{\frac{2\log(1/\delta)}{t}}~{}.

Then, applying union bound to t,nt,n and rescaling the Ξ΄\delta complete the proof.

∎

Lemma C.2.

Given Ξ΄,ϡ∈(0,1),ρ∈(0,π’ͺ​(1L))\delta,\epsilon\in(0,1),\rho\in(0,\mathcal{O}(\frac{1}{L})), suppose m,Ξ·1,Ξ·2,K1,K2m,\eta_{1},\eta_{2},K_{1},K_{2} satisfy the conditions in Eq. (5.2). Then, with probability at least 1βˆ’Ξ΄1-\delta, in each round t∈[T]t\in[T], for any ‖𝐱‖2=1\|\mathbf{x}\|_{2}=1, we have

(1)\displaystyle(1) |f1​(𝐱;𝜽tβˆ’11,βˆ—)βˆ’f1​(𝐱;𝜽tβˆ’11)|\displaystyle|f_{1}(\mathbf{x};\boldsymbol{\theta}^{1,\ast}_{t-1})-f_{1}(\mathbf{x};\boldsymbol{\theta}^{1}_{t-1})| (C.27)
≀\displaystyle\leq (1+π’ͺ​(t​L3​log5/6⁑mρ1/3​m1/6))​π’ͺ​(L​t3ρ​m​log⁑m)+π’ͺ​(t4​L2​log11/6⁑mρ4/3​m1/6);\displaystyle\left(1+\mathcal{O}\left(\frac{tL^{3}\log^{5/6}m}{\rho^{1/3}m^{1/6}}\right)\right)\mathcal{O}\left(\frac{Lt^{3}}{\rho\sqrt{m}}\log m\right)+\mathcal{O}\left(\frac{t^{4}L^{2}\log^{11/6}m}{\rho^{4/3}m^{1/6}}\right);
(2)\displaystyle(2) |f2​(ϕ​(β–½πœ½tβˆ’11,βˆ—β€‹f1​(𝐱;𝜽tβˆ’11,βˆ—));𝜽tβˆ’12,βˆ—)βˆ’f2​(ϕ​(β–½πœ½tβˆ’11​f1​(𝐱;𝜽tβˆ’11));𝜽tβˆ’12)|\displaystyle\left|f_{2}\left(\phi(\triangledown_{\boldsymbol{\theta}^{1,\ast}_{t-1}}f_{1}(\mathbf{x};\boldsymbol{\theta}^{1,\ast}_{t-1}));\boldsymbol{\theta}_{t-1}^{2,\ast}\right)-f_{2}\left(\phi(\triangledown_{\boldsymbol{\theta}^{1}_{t-1}}f_{1}(\mathbf{x};\boldsymbol{\theta}^{1}_{t-1}));\boldsymbol{\theta}_{t-1}^{2}\right)\right| (C.28)
≀\displaystyle\leq (1+π’ͺ​(t​L3​log5/6⁑mρ1/3​m1/6))​π’ͺ​(L​t3ρ​m​log⁑m)+π’ͺ​(t4​L2​log11/6⁑mρ4/3​m1/6);\displaystyle\left(1+\mathcal{O}\left(\frac{tL^{3}\log^{5/6}m}{\rho^{1/3}m^{1/6}}\right)\right)\mathcal{O}\left(\frac{Lt^{3}}{\rho\sqrt{m}}\log m\right)+\mathcal{O}\left(\frac{t^{4}L^{2}\log^{11/6}m}{\rho^{4/3}m^{1/6}}\right);
(3)\displaystyle(3) β€–β–½πœ½tβˆ’11​f1​(𝐱;𝜽tβˆ’11)β€–2,β€–β–½πœ½tβˆ’12​f2​(ϕ​(β–½πœ½tβˆ’11​f1​(𝐱;𝜽tβˆ’11));𝜽tβˆ’12)β€–2\displaystyle\|\triangledown_{\boldsymbol{\theta}^{1}_{t-1}}f_{1}(\mathbf{x};\boldsymbol{\theta}^{1}_{t-1})\|_{2},\|\triangledown_{\boldsymbol{\theta}^{2}_{t-1}}f_{2}\left(\phi(\triangledown_{\boldsymbol{\theta}^{1}_{t-1}}f_{1}(\mathbf{x};\boldsymbol{\theta}^{1}_{t-1}));\boldsymbol{\theta}_{t-1}^{2}\right)\|_{2} (C.29)
≀\displaystyle\leq (1+π’ͺ​(t​L3​log5/6⁑mρ1/3​m1/6))​π’ͺ​(L).\displaystyle\left(1+\mathcal{O}\left(\frac{tL^{3}\log^{5/6}m}{\rho^{1/3}m^{1/6}}\right)\right)\mathcal{O}(L)~{}.
Proof.

According to Lemma C.4 (2), β€–πœ½^Ο„βˆ’11βˆ’πœ½01β€–2≀π’ͺ​(t3ρ​m​log⁑m),βˆ€Ο„βˆˆ[t]\|\widehat{\boldsymbol{\theta}}_{\tau-1}^{1}-\boldsymbol{\theta}^{1}_{0}\|_{2}\leq\mathcal{O}(\frac{t^{3}}{\rho\sqrt{m}}\log m),\forall\tau\in[t]. Thus, we have β€–πœ½tβˆ’11βˆ’πœ½01β€–2≀π’ͺ​(t3ρ​m​log⁑m)\|\boldsymbol{\theta}^{1}_{t-1}-\boldsymbol{\theta}^{1}_{0}\|_{2}\leq\mathcal{O}(\frac{t^{3}}{\rho\sqrt{m}}\log m).

First, based on Triangle inequality, for any ‖𝐱‖2=1\|\mathbf{x}\|_{2}=1, we have

β€–β–½πœ½tβˆ’11​f1​(𝐱;𝜽tβˆ’11)β€–2\displaystyle\|\triangledown_{\boldsymbol{\theta}^{1}_{t-1}}f_{1}(\mathbf{x};\boldsymbol{\theta}^{1}_{t-1})\|_{2} β‰€β€–β–½πœ½01​f1​(𝐱;𝜽01)β€–2+β€–β–½πœ½tβˆ’11​f1​(𝐱;𝜽tβˆ’11)βˆ’β–½πœ½01​f1​(𝐱i;𝜽01)β€–2\displaystyle\leq\|\triangledown_{\boldsymbol{\theta}^{1}_{0}}f_{1}(\mathbf{x};\boldsymbol{\theta}^{1}_{0})\|_{2}+\|\triangledown_{\boldsymbol{\theta}^{1}_{t-1}}f_{1}(\mathbf{x};\boldsymbol{\theta}^{1}_{t-1})-\triangledown_{\boldsymbol{\theta}^{1}_{0}}f_{1}(\mathbf{x}_{i};\boldsymbol{\theta}^{1}_{0})\|_{2} (C.30)
≀(1+π’ͺ​(t​L3​log5/6⁑mρ1/3​m1/6))​π’ͺ​(L)\displaystyle\leq\left(1+\mathcal{O}\left(\frac{tL^{3}\log^{5/6}m}{\rho^{1/3}m^{1/6}}\right)\right)\mathcal{O}(L)

where the last inequality is because of Lemma C.4 (3) and Lemma C.7.

Applying Lemma C.5 (1), for any π±βˆΌπ’Ÿ,‖𝐱‖2=1\mathbf{x}\sim\mathcal{D},\|\mathbf{x}\|_{2}=1 and β€–πœ½tβˆ’11,βˆ—βˆ’πœ½tβˆ’11‖≀π’ͺ​(t3ρ​m​log⁑m)=w\|\boldsymbol{\theta}^{1,\ast}_{t-1}-\boldsymbol{\theta}^{1}_{t-1}\|\leq\mathcal{O}(\frac{t^{3}}{\rho\sqrt{m}}\log m)=w, we have

|f1​(𝐱;𝜽tβˆ’11,βˆ—)βˆ’f1​(𝐱;𝜽tβˆ’11)|\displaystyle|f_{1}(\mathbf{x};\boldsymbol{\theta}^{1,\ast}_{t-1})-f_{1}(\mathbf{x};\boldsymbol{\theta}^{1}_{t-1})| (C.31)
≀\displaystyle\leq |βŸ¨β–½πœ½tβˆ’11​f1​(𝐱i;𝜽tβˆ’11),𝜽tβˆ’11,βˆ—βˆ’πœ½tβˆ’11⟩|+π’ͺ​(L2​m​log⁑(m))β€‹β€–πœ½tβˆ’11,βˆ—βˆ’πœ½tβˆ’11β€–2​w1/3\displaystyle|\langle\triangledown_{\boldsymbol{\theta}^{1}_{t-1}}f_{1}(\mathbf{x}_{i};\boldsymbol{\theta}^{1}_{t-1}),\boldsymbol{\theta}^{1,\ast}_{t-1}-\boldsymbol{\theta}^{1}_{t-1}\rangle|+\mathcal{O}(L^{2}\sqrt{m\log(m)})\|\boldsymbol{\theta}^{1,\ast}_{t-1}-\boldsymbol{\theta}^{1}_{t-1}\|_{2}w^{1/3}
≀\displaystyle\leq β€–β–½πœ½tβˆ’11​f1​(𝐱i;𝜽tβˆ’11)β€–2β€‹β€–πœ½tβˆ’11,βˆ—βˆ’πœ½tβˆ’11β€–2+π’ͺ​(L2​m​log⁑(m))β€‹β€–πœ½tβˆ’11,βˆ—βˆ’πœ½tβˆ’11β€–2​w1/3\displaystyle\|\triangledown_{\boldsymbol{\theta}^{1}_{t-1}}f_{1}(\mathbf{x}_{i};\boldsymbol{\theta}^{1}_{t-1})\|_{2}\|\boldsymbol{\theta}^{1,\ast}_{t-1}-\boldsymbol{\theta}^{1}_{t-1}\|_{2}+\mathcal{O}(L^{2}\sqrt{m\log(m)})\|\boldsymbol{\theta}^{1,\ast}_{t-1}-\boldsymbol{\theta}^{1}_{t-1}\|_{2}w^{1/3}
≀\displaystyle\leq (1+π’ͺ​(t​L3​log5/6⁑mρ1/3​m1/6))​π’ͺ​(L​t3ρ​m​log⁑m)+π’ͺ​(t4​L2​log11/6⁑mρ4/3​m1/6)\displaystyle\left(1+\mathcal{O}\left(\frac{tL^{3}\log^{5/6}m}{\rho^{1/3}m^{1/6}}\right)\right)\mathcal{O}(\frac{Lt^{3}}{\rho\sqrt{m}}\log m)+\mathcal{O}\left(\frac{t^{4}L^{2}\log^{11/6}m}{\rho^{4/3}m^{1/6}}\right)

Similarly, we can use the same way to prove the lemmas for f2f_{2}. ∎

Lemma C.3.

Let f​(β‹…;𝛉^t)f(\cdot;\widehat{\boldsymbol{\theta}}_{t}) follow the stochastic gradient descent of f1f_{1} or f2f_{2} in Algorithm 1. Suppose m,Ξ·1,Ξ·2m,\eta_{1},\eta_{2} satisfy the conditions in Eq. (5.2). With probability at least 1βˆ’Ξ΄1-\delta, for any 𝐱\mathbf{x} with ‖𝐱‖2=1\|\mathbf{x}\|_{2}=1 and t∈[T]t\in[T], it holds that

|f​(𝐱;𝜽^t)|≀π’ͺ​(1)+π’ͺ​(t3​n​L​log⁑mρ​m)+π’ͺ​(t4​n​L2​log11/6⁑mρ4/3​m1/6).|f(\mathbf{x};\widehat{\boldsymbol{\theta}}_{t})|\leq\mathcal{O}(1)+\mathcal{O}\left(\frac{t^{3}nL\log m}{\rho\sqrt{m}}\right)+\mathcal{O}\left(\frac{t^{4}nL^{2}\log^{11/6}m}{\rho^{4/3}m^{1/6}}\right).
Proof.

Considering an inequality |aβˆ’b|≀c|a-b|\leq c, we have |a|≀|b|+c|a|\leq|b|+c. Let 𝜽0\boldsymbol{\theta}_{0} be randomly initialized. Then applying Lemma C.5 (1), for any π±βˆΌπ’Ÿ,‖𝐱‖2=1\mathbf{x}\sim\mathcal{D},\|\mathbf{x}\|_{2}=1 and β€–πœ½^tβˆ’πœ½0‖≀w\|\widehat{\boldsymbol{\theta}}_{t}-\boldsymbol{\theta}_{0}\|\leq w, we have

|f​(𝐱;𝜽^t)|\displaystyle|f(\mathbf{x};\widehat{\boldsymbol{\theta}}_{t})| ≀|f​(𝐱;𝜽0)|+|βŸ¨β–½πœ½0​f​(𝐱i;𝜽0),𝜽^tβˆ’πœ½0⟩|+π’ͺ​(L2​m​log⁑(m))β€‹β€–πœ½^tβˆ’πœ½0β€–2​w1/3\displaystyle\leq|f(\mathbf{x};\boldsymbol{\theta}_{0})|+|\langle\triangledown_{\boldsymbol{\theta}_{0}}f(\mathbf{x}_{i};\boldsymbol{\theta}_{0}),\widehat{\boldsymbol{\theta}}_{t}-\boldsymbol{\theta}_{0}\rangle|+\mathcal{O}(L^{2}\sqrt{m\log(m)})\|\widehat{\boldsymbol{\theta}}_{t}-\boldsymbol{\theta}_{0}\|_{2}w^{1/3} (C.32)
≀π’ͺ​(1)⏟I0+β€–β–½πœ½0​f​(𝐱i;𝜽0)β€–2β€‹β€–πœ½^tβˆ’πœ½0β€–2⏟I1+π’ͺ​(L2​m​log⁑(m))β€‹β€–πœ½^tβˆ’πœ½0β€–2​w1/3\displaystyle\leq\underbrace{\mathcal{O}(1)}_{I_{0}}+\underbrace{\|\triangledown_{\boldsymbol{\theta}_{0}}f(\mathbf{x}_{i};\boldsymbol{\theta}_{0})\|_{2}\|\widehat{\boldsymbol{\theta}}_{t}-\boldsymbol{\theta}_{0}\|_{2}}_{I_{1}}+\mathcal{O}(L^{2}\sqrt{m\log(m)})\|\widehat{\boldsymbol{\theta}}_{t}-\boldsymbol{\theta}_{0}\|_{2}w^{1/3}
≀π’ͺ​(1)+π’ͺ​(L)β‹…π’ͺ​(t3ρ​m​log⁑m)⏟I2+π’ͺ​(L2​m​log⁑(m))β‹…π’ͺ​(t3ρ​m​log⁑m)4/3⏟I3\displaystyle\leq\mathcal{O}(1)+\underbrace{\mathcal{O}(L)\cdot\mathcal{O}\left(\frac{t^{3}}{\rho\sqrt{m}}\log m\right)}_{I_{2}}+\underbrace{\mathcal{O}\left(L^{2}\sqrt{m\log(m)}\right)\cdot\mathcal{O}\left(\frac{t^{3}}{\rho\sqrt{m}}\log m\right)^{4/3}}_{I_{3}}
=π’ͺ​(1)+π’ͺ​(t3​L​log⁑mρ​m)+π’ͺ​(t4​L2​log11/6⁑mρ4/3​m1/6)\displaystyle=\mathcal{O}(1)+\mathcal{O}\left(\frac{t^{3}L\log m}{\rho\sqrt{m}}\right)+\mathcal{O}\left(\frac{t^{4}L^{2}\log^{11/6}m}{\rho^{4/3}m^{1/6}}\right)

where: I0I_{0} is based on the Lemma C.4 (3); I1I_{1} is an application of Cauchy–Schwarz inequality; I2I_{2} is according to Lemma C.4 (2) and (3) in which 𝜽^t\widehat{\boldsymbol{\theta}}_{t} can be considered as one step gradient descent; I3I_{3} is due to Lemma C.4 (2).

Then, the proof is completed. ∎

Lemma C.4.

Given a constant 0<Ο΅<10<\epsilon<1, suppose mm satisfies the conditions in Eq. (5.2), the learning rate Ξ·=Ω​(ρpoly​(t,n,L)​m)\eta=\Omega(\frac{\rho}{\text{poly}(t,n,L)m}), the number of iterations K=Ω​(poly​(t,n,L)ρ2β‹…logβ‘Ο΅βˆ’1)K=\Omega(\frac{\text{poly}(t,n,L)}{\rho^{2}}\cdot\log\epsilon^{-1}). Then, with probability at least 1βˆ’Ξ΄1-\delta, starting from random initialization 𝛉0\boldsymbol{\theta}_{0},

  1. (1)

    (Theorem 1 in [Allen-Zhu etΒ al., 2019]) In round t∈[T]t\in[T], given the collected data {𝐱τ,rΟ„}i=Ο„t\{\mathbf{x}_{\tau},r_{\tau}\}_{i=\tau}^{t}, the loss function is defined as: ℒ​(𝜽)=12β€‹βˆ‘Ο„=1t(f​(𝐱τ;𝜽)βˆ’rΟ„)2\mathcal{L}(\boldsymbol{\theta})=\frac{1}{2}\sum_{\tau=1}^{t}\left(f(\mathbf{x}_{\tau};\boldsymbol{\theta})-r_{\tau}\right)^{2}. Then, there exists 𝜽~\widetilde{\boldsymbol{\theta}} satisfying β€–πœ½~βˆ’πœ½0β€–2≀π’ͺ​(t3ρ​m​log⁑m)\|\widetilde{\boldsymbol{\theta}}-\boldsymbol{\theta}_{0}\|_{2}\leq\mathcal{O}\left(\frac{t^{3}}{\rho\sqrt{m}}\log m\right), such that ℒ​(𝜽~)≀ϡ\mathcal{L}(\widetilde{\boldsymbol{\theta}})\leq\epsilon in K=Ω​(poly​(t,n,L)ρ2β‹…logβ‘Ο΅βˆ’1)K=\Omega(\frac{\text{poly}(t,n,L)}{\rho^{2}}\cdot\log\epsilon^{-1}) iterations;

  2. (2)

    (Theorem 1 in [Allen-Zhu etΒ al., 2019]) For any k∈[K]k\in[K], it holds uniformly that β€–πœ½t(k)βˆ’πœ½0β€–2≀π’ͺ​(t3ρ​m​log⁑m)\|\boldsymbol{\theta}^{(k)}_{t}-\boldsymbol{\theta}_{0}\|_{2}\leq\mathcal{O}\left(\frac{t^{3}}{\rho\sqrt{m}}\log m\right);

  3. (3)

    Following the initialization, given ‖𝐱‖2=1\|\mathbf{x}\|_{2}=1, it holds that

    β€–β–½πœ½0​f​(𝐱;𝜽0)β€–2≀π’ͺ​(L),|f​(𝐱;𝜽0)|≀π’ͺ​(1)\|\triangledown_{\boldsymbol{\theta}_{0}}f(\mathbf{x};\boldsymbol{\theta}_{0})\|_{2}\leq\mathcal{O}(L),\ \ \ |f(\mathbf{x};\boldsymbol{\theta}_{0})|\leq\mathcal{O}(1)

where 𝛉t(k)\boldsymbol{\theta}^{(k)}_{t} represents the parameters of ff after k∈[K]k\in[K] iterations of gradient descent in round tt.

Proof.

Note that the output dimension dd in [Allen-Zhu etΒ al., 2019] is removed because the output of network function in this paper always is a scalar. For (1) and (2), the only different setting from [Allen-Zhu etΒ al., 2019] is that the initialization of last layer 𝐖LβˆΌπ’©β€‹(0,2m)\mathbf{W}_{L}\sim\mathcal{N}(0,\frac{2}{m}) in this paper while 𝐖LβˆΌπ’©β€‹(0,1d)\mathbf{W}_{L}\sim\mathcal{N}(0,\frac{1}{d}) in [Allen-Zhu etΒ al., 2019]. Because d=1d=1 and m>dm>d here, the upper bound in [Allen-Zhu etΒ al., 2019] still holds for 𝐖L\mathbf{W}_{L}: with probability at least 1βˆ’exp⁑(βˆ’Ξ©β€‹(m/L)),‖𝐖Lβ€–F≀m/d1-\exp{(-\Omega(m/L))},\|\mathbf{W}_{L}\|_{F}\leq\sqrt{m/d}. Therefore, (1) and (2) still hold for the initialization of this paper.

For (3), based on Lemma 7.1 in Allen-Zhu etΒ al. [2019], we have |f​(𝐱;𝜽0)|≀π’ͺ​(1)|f(\mathbf{x};\boldsymbol{\theta}_{0})|\leq\mathcal{O}(1). Denote by DD the ReLU function. For any l∈[L]l\in[L],

β€–β–½Wl​f​(𝐱;𝜽0)β€–F≀‖𝐖L​D​𝐖Lβˆ’1​⋯​D​𝐖l+1β€–Fβ‹…β€–D​𝐖l+1​⋯​𝐱‖F≀π’ͺ​(L)\|\triangledown_{W_{l}}f(\mathbf{x};\boldsymbol{\theta}_{0})\|_{F}\leq\|\mathbf{W}_{L}D\mathbf{W}_{L-1}\cdots D\mathbf{W}_{l+1}\|_{F}\cdot\|D\mathbf{W}_{l+1}\cdots\mathbf{x}\|_{F}\leq\mathcal{O}(\sqrt{L})

where the inequality is according to Lemma 7.2 in Allen-Zhu etΒ al. [2019]. Therefore, we have β€–β–½πœ½0​f​(𝐱;𝜽0)β€–2≀π’ͺ​(L)\|\triangledown_{\boldsymbol{\theta}_{0}}f(\mathbf{x};\boldsymbol{\theta}_{0})\|_{2}\leq\mathcal{O}(L). ∎

Lemma C.5 (Lemma 4.1, [Cao and Gu, 2019]).

For any δ∈(0,1)\delta\in(0,1), if ww satisfies

π’ͺ​(mβˆ’3/2​Lβˆ’3/2​[log⁑(t​n​L2/Ξ΄)]3/2)≀w≀π’ͺ​(Lβˆ’6​[log⁑m]βˆ’3/2),\mathcal{O}(m^{-3/2}L^{-3/2}[\log(tnL^{2}/\delta)]^{3/2})\leq w\leq\mathcal{O}(L^{-6}[\log m]^{-3/2}),

then, with probability at least 1βˆ’Ξ΄1-\delta over randomness of 𝛉0\boldsymbol{\theta}_{0}, for any t∈[T],‖𝐱‖2=1t\in[T],\|\mathbf{x}\|_{2}=1, and 𝛉,𝛉′\boldsymbol{\theta},\boldsymbol{\theta}^{\prime} satisfying β€–π›‰βˆ’π›‰0β€–2≀w\|\boldsymbol{\theta}-\boldsymbol{\theta}_{0}\|_{2}\leq w and β€–π›‰β€²βˆ’π›‰0β€–2≀w\|\boldsymbol{\theta}^{\prime}-\boldsymbol{\theta}_{0}\|_{2}\leq w , it holds uniformly that

|f​(𝐱i;𝜽)βˆ’f​(𝐱i;πœ½β€²)βˆ’βŸ¨β–½πœ½β€²β€‹f​(𝐱i;πœ½β€²),πœ½βˆ’πœ½β€²βŸ©|≀π’ͺ​(w1/3​L2​m​log⁑(m))β€‹β€–πœ½βˆ’πœ½β€²β€–2.|f(\mathbf{x}_{i};\boldsymbol{\theta})-f(\mathbf{x}_{i};\boldsymbol{\theta}^{\prime})-\langle\triangledown_{\boldsymbol{\theta}^{\prime}}f(\mathbf{x}_{i};\boldsymbol{\theta}^{\prime}),\boldsymbol{\theta}-\boldsymbol{\theta}^{\prime}\rangle|\leq\mathcal{O}(w^{1/3}L^{2}\sqrt{m\log(m)})\|\boldsymbol{\theta}-\boldsymbol{\theta}^{\prime}\|_{2}. (C.33)
Lemma C.6.

For any Ξ΄>0\delta>0, suppose

m>π’ͺ~​(poly​(T,n,Οβˆ’1,L,log⁑(1/Ξ΄)β‹…elog⁑1/Ξ΄)).m>\tilde{\mathcal{O}}\left(\text{poly}(T,n,\rho^{-1},L,\log(1/\delta)\cdot e^{\sqrt{\log 1/\delta}})\right).

Then, with probability at least 1βˆ’Ξ΄1-\delta, setting Ξ·2=Ξ˜β€‹(t5Ξ΄2​2​m)\eta_{2}=\Theta(\frac{t^{5}}{\delta^{2}\sqrt{2}m}) for algorithm 1, for any 𝛉~2\widetilde{\boldsymbol{\theta}}^{2} satisfying ‖𝛉~2βˆ’π›‰02β€–2≀π’ͺ​(t3ρ​m​log⁑m)\|\widetilde{\boldsymbol{\theta}}^{2}-\boldsymbol{\theta}^{2}_{0}\|_{2}\leq\mathcal{O}(\frac{t^{3}}{\rho\sqrt{m}}\log m), it holds that

βˆ‘Ο„=1t|f2​(ϕ​(β–½πœ½^Ο„βˆ’11​f1​(𝐱τ;𝜽^Ο„βˆ’11));𝜽^Ο„βˆ’12)βˆ’(rΟ„βˆ’f1​(𝐱τ;𝜽^Ο„βˆ’11))|\displaystyle\sum_{\tau=1}^{t}\left|f_{2}\left(\phi(\triangledown_{\widehat{\boldsymbol{\theta}}_{\tau-1}^{1}}f_{1}(\mathbf{x}_{\tau};\widehat{\boldsymbol{\theta}}_{\tau-1}^{1}));\widehat{\boldsymbol{\theta}}_{\tau-1}^{2}\right)-\left(r_{\tau}-f_{1}(\mathbf{x}_{\tau};\widehat{\boldsymbol{\theta}}_{\tau-1}^{1})\right)\right|
β‰€βˆ‘Ο„=1t|f2​(ϕ​(β–½πœ½^Ο„βˆ’11​f1​(𝐱τ;𝜽^Ο„βˆ’11));𝜽~2)βˆ’(rΟ„βˆ’f1​(𝐱τ;𝜽^Ο„βˆ’11))|+π’ͺ​(3​L​t2)\displaystyle\leq\sum_{\tau=1}^{t}\left|f_{2}\left(\phi(\triangledown_{\widehat{\boldsymbol{\theta}}_{\tau-1}^{1}}f_{1}(\mathbf{x}_{\tau};\widehat{\boldsymbol{\theta}}_{\tau-1}^{1}));\widetilde{\boldsymbol{\theta}}^{2}\right)-\left(r_{\tau}-f_{1}(\mathbf{x}_{\tau};\widehat{\boldsymbol{\theta}}_{\tau-1}^{1})\right)\right|+\mathcal{O}\left(\frac{3L\sqrt{t}}{\sqrt{2}}\right)
Proof.

This is a direct application of Lemma 4.3 in [Cao and Gu, 2019] by setting R=t3ρ​log⁑m,Ο΅=L​R2​ν​tR=\frac{t^{3}}{\rho}\log m,\epsilon=\frac{LR}{\sqrt{2\nu t}}, and Ξ½=ν′​R2\nu=\nu^{\prime}R^{2}, where Ξ½β€²\nu^{\prime} is some small enough absolute constant. We set Lτ​(𝜽^Ο„βˆ’12)=|f2​(β–½πœ½^Ο„βˆ’11​f1;𝜽^Ο„βˆ’12)βˆ’(rΟ„βˆ’f1​(𝐱τ;𝜽^Ο„βˆ’11))|L_{\tau}(\widehat{\boldsymbol{\theta}}_{\tau-1}^{2})=\left|f_{2}(\triangledown_{\widehat{\boldsymbol{\theta}}_{\tau-1}^{1}}f_{1};\widehat{\boldsymbol{\theta}}_{\tau-1}^{2})-\left(r_{\tau}-f_{1}(\mathbf{x}_{\tau};\widehat{\boldsymbol{\theta}}_{\tau-1}^{1})\right)\right|. Based on Lemma C.4 (2), for any Ο„βˆˆ[t]\tau\in[t], we have

β€–πœ½^Ο„2βˆ’πœ½^Ο„βˆ’12β€–2β‰€β€–πœ½^Ο„2βˆ’πœ½0β€–2+β€–πœ½0βˆ’πœ½^Ο„βˆ’12β€–2≀π’ͺ​(t3ρ​m​log⁑m).\|\widehat{\boldsymbol{\theta}}_{\tau}^{2}-\widehat{\boldsymbol{\theta}}_{\tau-1}^{2}\|_{2}\leq\|\widehat{\boldsymbol{\theta}}_{\tau}^{2}-\boldsymbol{\theta}_{0}\|_{2}+\|\boldsymbol{\theta}_{0}-\widehat{\boldsymbol{\theta}}_{\tau-1}^{2}\|_{2}\leq\mathcal{O}(\frac{t^{3}}{\rho\sqrt{m}}\log m).

Then, according to Lemma 4.3 in [Cao and Gu, 2019], then, for any 𝜽~2\widetilde{\boldsymbol{\theta}}^{2} satisfying β€–πœ½~2βˆ’πœ½02β€–2≀π’ͺ​(t3ρ​m​log⁑m)\|\widetilde{\boldsymbol{\theta}}^{2}-\boldsymbol{\theta}^{2}_{0}\|_{2}\leq\mathcal{O}(\frac{t^{3}}{\rho\sqrt{m}}\log m), there exist a small enough absolute constant Ξ½β€²\nu^{\prime}, such that

βˆ‘Ο„=1tLτ​(𝜽^Ο„βˆ’12)β‰€βˆ‘Ο„=1tLτ​(𝜽~2)+3​t​ϡ.\sum_{\tau=1}^{t}L_{\tau}(\widehat{\boldsymbol{\theta}}_{\tau-1}^{2})\leq\sum_{\tau=1}^{t}L_{\tau}(\widetilde{\boldsymbol{\theta}}^{2})+3t\epsilon. (C.34)

Then, replacing ϡ\epsilon completes the proof. ∎

Lemma C.7 (Theorem 5, Allen-Zhu etΒ al. [2019]).

For any δ∈(0,1)\delta\in(0,1), if ww satisfies that

π’ͺ​(mβˆ’3/2​Lβˆ’3/2​max⁑{logβˆ’3/2⁑m,log3/2⁑(T​n/Ξ΄)})≀w≀π’ͺ​(Lβˆ’9/2​logβˆ’3⁑m),\mathcal{O}(m^{-3/2}L^{-3/2}\max\{\log^{-3/2}m,\log^{3/2}(Tn/\delta)\})\leq w\leq\mathcal{O}(L^{-9/2}\log^{-3}m), (C.35)

then, with probability at least 1βˆ’Ξ΄1-\delta, for all β€–π›‰βˆ’π›‰0β€–2≀w\|\boldsymbol{\theta}-\boldsymbol{\theta}_{0}\|_{2}\leq w, we have

β€–β–½πœ½β€‹f​(𝐱;𝜽)βˆ’β–½πœ½0​f​(𝐱;𝜽0)β€–2≀π’ͺ​(log⁑m​w1/3​L3)β€‹β€–β–½πœ½0​f​(𝐱;𝜽0)β€–2.\|\triangledown_{\boldsymbol{\theta}}f(\mathbf{x};\boldsymbol{\theta})-\triangledown_{\boldsymbol{\theta}_{0}}f(\mathbf{x};\boldsymbol{\theta}_{0})\|_{2}\leq\mathcal{O}(\sqrt{\log m}w^{1/3}L^{3})\|\triangledown_{\boldsymbol{\theta}_{0}}f(\mathbf{x};\boldsymbol{\theta}_{0})\|_{2}. (C.36)

Appendix D Motivation of Exploration Network

Table 2: Selection Criterion Comparison (𝐱t\mathbf{x}_{t}: selected arm in round tt).
Methods Selection Criterion
Neural Epsilon-greedy With probability 1βˆ’Ο΅1-\epsilon, 𝐱t=arg⁑max𝐱t,i,i∈[n]⁑f1​(𝐱t,i;𝜽1)\mathbf{x}_{t}=\arg\max_{\mathbf{x}_{t,i},i\in[n]}f_{1}(\mathbf{x}_{t,i};\boldsymbol{\theta}^{1}); Otherwise, select 𝐱t\mathbf{x}_{t} randomly.
NeuralTS [Zhang etΒ al., 2021] For 𝐱t,i,βˆ€i∈[n]\mathbf{x}_{t,i},\forall i\in[n], draw r^t,i\hat{r}_{t,i} from 𝒩​(f1​(𝐱t,i;𝜽1),Οƒt,i2)\mathcal{N}(f_{1}(\mathbf{x}_{t,i};\boldsymbol{\theta}^{1}),{\sigma_{t,i}}^{2}). Then, select 𝐱t,i^\mathbf{x}_{t,\hat{i}}, i^=arg⁑maxi∈[n]⁑r^t,i.\hat{i}=\arg\max_{i\in[n]}\hat{r}_{t,i}.
NeuralUCB [Zhou etΒ al., 2020] 𝐱t=arg⁑max𝐱t,i,i∈[n]⁑(f1​(𝐱t,i;𝜽1)+UCBt,i).\mathbf{x}_{t}=\arg\max_{\mathbf{x}_{t,i},i\in[n]}\left(f_{1}(\mathbf{x}_{t,i};\boldsymbol{\theta}^{1})+\text{UCB}_{t,i}\right).
EE-Net (Our approach) βˆ€i∈[n]\forall i\in[n], compute f1​(𝐱t,i;𝜽1)f_{1}(\mathbf{x}_{t,i};\boldsymbol{\theta}^{1}), f2​(β–½πœ½1​f1​(𝐱t,i;𝜽1);𝜽2)f_{2}\left(\triangledown_{\boldsymbol{\theta}^{1}}f_{1}(\mathbf{x}_{t,i};\boldsymbol{\theta}^{1});\boldsymbol{\theta}^{2}\right) (Exploration Net). Then 𝐱t=arg⁑max𝐱t,i​i∈[n]⁑f3​(f1,f2;𝜽3)\mathbf{x}_{t}=\arg\max_{\mathbf{x}_{t,i}i\in[n]}f_{3}(f_{1},f_{2};\boldsymbol{\theta}^{3}).

In this section, we list one gradient-based UCB from existing works [Ban etΒ al., 2021, Zhou etΒ al., 2020], which motivates our design of exploration network f2f_{2}. Let g​(𝐱t;𝜽t)=β–½πœ½t​f​(𝐱t;𝜽t)g(\mathbf{x}_{t};\boldsymbol{\theta}_{t})=\triangledown_{\boldsymbol{\theta}_{t}}f(\mathbf{x}_{t};\boldsymbol{\theta}_{t}).

Lemma D.1.

(Lemma 5.2 in [Ban etΒ al., 2021]). Given a set of context vectors {𝐱t}t=1T\{\mathbf{x}_{t}\}_{t=1}^{T} and the corresponding rewards {rt}t=1T\{r_{t}\}_{t=1}^{T} , 𝔼​(rt)=h​(𝐱t)\mathbb{E}(r_{t})=h(\mathbf{x}_{t}) for any 𝐱t∈{𝐱t}t=1T\mathbf{x}_{t}\in\{\mathbf{x}_{t}\}_{t=1}^{T}. Let f​(𝐱t;𝛉)f(\mathbf{x}_{t};\boldsymbol{\theta}) be the LL-layers fully-connected neural network where the width is mm, the learning rate is Ξ·\eta, the number of iterations of gradient descent is KK. Then, there exist positive constants C1,C2,SC_{1},C_{2},S, such that if

m\displaystyle m β‰₯poly​(T,n,L,log⁑(1/Ξ΄)β‹…dβ‹…elog⁑1/Ξ΄),Ξ·=π’ͺ​(T​m​L+m​λ)βˆ’1,Kβ‰₯π’ͺ~​(T​L/Ξ»),\displaystyle\geq\text{poly}(T,n,L,\log(1/\delta)\cdot d\cdot e^{\sqrt{\log 1/\delta}}),\ \ \eta=\mathcal{O}(TmL+m\lambda)^{-1},\ \ K\geq\widetilde{\mathcal{O}}(TL/\lambda),

then, with probability at least 1βˆ’Ξ΄1-\delta, for any 𝐱t∈{𝐱t}t=1T\mathbf{x}_{t}\in\{\mathbf{x}_{t}\}_{t=1}^{T}, we have the following upper confidence bound:

|h​(𝐱t)βˆ’f​(𝐱t;𝜽t)|≀\displaystyle\left|h(\mathbf{x}_{t})-f(\mathbf{x}_{t};\boldsymbol{\theta}_{t})\right|\leq Ξ³1​‖g​(𝐱t;𝜽t)/m‖𝐀tβˆ’1+Ξ³2+Ξ³1​γ3+Ξ³4,\displaystyle\gamma_{1}\|g(\mathbf{x}_{t};\boldsymbol{\theta}_{t})/\sqrt{m}\|_{\mathbf{A}_{t}^{-1}}+\gamma_{2}+\gamma_{1}\gamma_{3}+\gamma_{4},\ (D.1)

where

Ξ³1​(m,L)=(Ξ»+t​π’ͺ​(L))β‹…((1βˆ’Ξ·β€‹m​λ)J/2​t/Ξ»)+1\displaystyle\gamma_{1}(m,L)=(\lambda+t\mathcal{O}(L))\cdot((1-\eta m\lambda)^{J/2}\sqrt{t/\lambda})+1
Ξ³2​(m,L,Ξ΄)=β€–g​(𝐱t;𝜽0)/m‖𝐀tβˆ’β€²1β‹…(log⁑(det​(𝐀tβ€²)det​(Ξ»β€‹πˆ))βˆ’2​log⁑δ+Ξ»1/2​S)\displaystyle\gamma_{2}(m,L,\delta)=\|g(\mathbf{x}_{t};\boldsymbol{\theta}_{0})/\sqrt{m}\|_{\mathbf{A}_{t}^{{}^{\prime}-1}}\cdot\left(\sqrt{\log\left(\frac{\text{det}(\mathbf{A}_{t}^{\prime})}{\text{det}(\lambda\mathbf{I})}\right)-2\log\delta}+\lambda^{1/2}S\right)
Ξ³3​(m,L)=C2​mβˆ’1/6​log⁑m​t1/6β€‹Ξ»βˆ’7/6​L7/2,Ξ³4​(m,L)=C1​mβˆ’1/6​log⁑m​t2/3β€‹Ξ»βˆ’2/3​L3\displaystyle\gamma_{3}(m,L)=C_{2}m^{-1/6}\sqrt{\log m}t^{1/6}\lambda^{-7/6}L^{7/2},\ \ \gamma_{4}(m,L)=C_{1}m^{-1/6}\sqrt{\log m}t^{2/3}\lambda^{-2/3}L^{3}
𝐀t=Ξ»β€‹πˆ+βˆ‘i=1tg​(𝐱t;𝜽t)​g​(𝐱t;𝜽t)⊺/m,𝐀tβ€²=Ξ»β€‹πˆ+βˆ‘i=1tg​(𝐱t;𝜽0)​g​(𝐱t;𝜽0)⊺/m.\displaystyle\mathbf{A}_{t}=\lambda\mathbf{I}+\sum_{i=1}^{t}g(\mathbf{x}_{t};\boldsymbol{\theta}_{t})g(\mathbf{x}_{t};\boldsymbol{\theta}_{t})^{\intercal}/m,\ \,\mathbf{A}_{t}^{\prime}=\lambda\mathbf{I}+\sum_{i=1}^{t}g(\mathbf{x}_{t};\boldsymbol{\theta}_{0})g(\mathbf{x}_{t};\boldsymbol{\theta}_{0})^{\intercal}/m.

Note that g​(𝐱t;𝜽0)g(\mathbf{x}_{t};\boldsymbol{\theta}_{0}) is the gradient at initialization, which can be initialized as constants. Therefore, the above UCB can be represented as the following form for exploitation network f1f_{1}: |h​(𝐱t,i)βˆ’f1​(𝐱t,i;𝜽t1)|≀Ψ​(g​(𝐱t;𝜽t))|h(\mathbf{x}_{t,i})-f_{1}(\mathbf{x}_{t,i};\boldsymbol{\theta}^{1}_{t})|\leq\Psi(g(\mathbf{x}_{t};\boldsymbol{\theta}_{t})).

EE-Net has smaller approximation error. Given an arm xx, let f1​(x)f_{1}(x) be the estimated reward and h​(x)h(x) be the expected reward. The exploration network f2f_{2} in EE-Net is to learn h​(x)βˆ’f1​(x)h(x)-f_{1}(x), i.e., the residual between expected reward and estimated reward, which is the ultimate goal of making exploration. There are advantages of using a network f2f_{2} to learn h​(x)βˆ’f1​(x)h(x)-f_{1}(x) in EE-Net, compared to giving a statistical upper bound for it such as NeuralUCB, [Ban etΒ al., 2021], and NeuralTS (in NeuralTS, the variance Ξ½\nu can be thought of as the upper bound). For EE-Net, the approximation error for h​(x)βˆ’f1​(x)h(x)-f_{1}(x) is caused by the genenalization error of the neural network (Lemma B.1. in the manuscript). In contrast, for NeuralUCB, [Ban etΒ al., 2021], and NeuralTS, the approximation error for h​(x)βˆ’f1​(x)h(x)-f_{1}(x) includes three parts. The first part is caused by ridge regression. The second part of the approximation error is caused by the distance between ridge regression and Neural Tangent Kernel (NTK). The third part of the approximation error is caused by the distance between NTK and the network function. Because they use the upper bound to make selections, the errors inherently exist in their algorithms. By reducing the three parts of the approximation errors to only the neural network convergence error, EE-Net achieves tighter regret bound compared to them (improving by roughly log⁑T\sqrt{\log T}).

Table 3: Exploration Direction Comparison.
Methods "Upward" Exploration "Downward" Exploration
NeuralUCB √\surd Γ—\times
NeuralTS Randomly Randomly
EE-Net √\surd √\surd
Refer to caption
Figure 6: Two types of exploration: Upward exploration and Downward exploration. f1f_{1} is the exploitation network (estimated reward) and hh is the expected reward.

EE-Net has the ability to determine exploration direction. The two types of exploration are described by Figure 6. When the estimated reward is larger than the expected reward, i.e., h​(x)βˆ’f1​(x)<0h(x)-f_{1}(x)<0, we need to do the β€˜downward exploration’, i.e., lowering the exploration score of xx to reduce its chance of being explored; when h​(x)βˆ’f1​(x)>0h(x)-f_{1}(x)>0, we should do the β€˜upward exploration’, i.e., raising the exploration score of xx to increase its chance of being explored. For EE-Net, f2f_{2} is to learn h​(x)βˆ’f1​(x)h(x)-f_{1}(x). When h​(x)βˆ’f1​(x)>0h(x)-f_{1}(x)>0, f2​(x)f_{2}(x) will also be positive to make the upward exploration. When h​(x)βˆ’f1​(x)<0h(x)-f_{1}(x)<0, f2​(x)f_{2}(x) will be negative to make the downward exploration. In contrast, NeuralUCB will always choose upward exploration, i.e., f1​(x)+U​C​B​(x)f_{1}(x)+UCB(x) where U​C​B​(x)UCB(x) is always positive. In particular, when h​(x)βˆ’f1​(x)<0h(x)-f_{1}(x)<0, NeuralUCB will further amplify the mistake. NeuralTS will randomly choose upward or downward exploration for all cases, because it draws a sampled reward from a normal distribution where the mean is f1​(x)f_{1}(x) and the variance Ξ½\nu is the upper bound.