This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Neural Bandit with Arm Group Graph

Yunzhe Qi University of Illinois at Urbana-Champaign [email protected] Yikun Ban University of Illinois at Urbana-Champaign [email protected]  and  Jingrui He University of Illinois at Urbana-Champaign [email protected]
(2022)
Abstract.

Contextual bandits aim to identify among a set of arms the optimal one with the highest reward based on their contextual information. Motivated by the fact that the arms usually exhibit group behaviors and the mutual impacts exist among groups, we introduce a new model, Arm Group Graph (AGG), where the nodes represent the groups of arms and the weighted edges formulate the correlations among groups. To leverage the rich information in AGG, we propose a bandit algorithm, AGG-UCB, where the neural networks are designed to estimate rewards, and we propose to utilize graph neural networks (GNN) to learn the representations of arm groups with correlations. To solve the exploitation-exploration dilemma in bandits, we derive a new upper confidence bound (UCB) built on neural networks (exploitation) for exploration. Furthermore, we prove that AGG-UCB can achieve a near-optimal regret bound with over-parameterized neural networks, and provide the convergence analysis of GNN with fully-connected layers which may be of independent interest. In the end, we conduct extensive experiments against state-of-the-art baselines on multiple public data sets, showing the effectiveness of the proposed algorithm.

Contextual Bandits; Online Learning; Graph Neural Networks
conference: KDD ’22; August 14–18, 2021; Virtual Eventccs: Information systems Personalizationccs: Theory of computation Online learning algorithmsjournalyear: 2022copyright: acmcopyrightconference: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining; August 14–18, 2022; Washington, DC, USAbooktitle: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’22), August 14–18, 2022, Washington, DC, USAprice: 15.00doi: 10.1145/3534678.3539312isbn: 978-1-4503-9385-0/22/08

1. Introduction

Contextual bandits are a specific type of multi-armed bandit (MAB) problem where the learner has access to the contextual information (contexts) related to arms at each round, and the learner is required to make recommendations based on past contexts and received rewards. A variety of models and algorithms have been proposed and successfully applied on real-world problems, such as online content and advertising recommendation (Li et al., 2010; Wu et al., 2016), clinical trials (Durand et al., 2018; Villar et al., 2015) and virtual support agents (Sajeev et al., 2021).

In this paper, we focus on exploiting the accessible arm information to improve the performance of bandit algorithms. Among different types of contextual bandit algorithms, upper confidence bound (UCB) algorithms have been proposed to balance between exploitation and exploration (Auer et al., 2002; Chu et al., 2011; Valko et al., 2013). For conventional UCB algorithms, they are either under the ”pooling setting” (Chu et al., 2011) where one single UCB model is applied for all candidate arms, or the ”disjoint setting” (Li et al., 2010) where each arm is given its own estimator without the collaboration across different arms. Both settings have their limitations: applying only one single model may lead to unanticipated estimation error when some arms exhibit distinct behaviors (Wu et al., 2019b, 2016); on the other hand, assigning each arm its own estimator neglects the mutual impacts among arms and usually suffers from limited user feedback (Gentile et al., 2017; Ban and He, 2021b).

To deal with this challenge, adaptively assigning UCB models to arms based on their group information can be an ideal strategy, i.e., each group of arms has one estimator to represent its behavior. This modeling strategy is linked to ”arm groups” existing in real-world applications. For example, regarding the online movie recommendation scenario, the movies (arms) with the same genre can be assigned to one (arm) group. Another scenario is the drug development, where given a new cancer treatment and a patient pool, we need to select the best patient on whom the treatment is most effective. Here, the patients are the arms, and they can be naturally grouped by their non-numerical attributes, such as the cancer types. Such group information is easily accessible, and can significantly improve the performance of bandit algorithms. Although some works  (Li et al., 2016; Deshmukh et al., 2017) have been proposed to leverage the arm correlations, they can suffer from two common limitations. First, they rely on the assumption of parametric (linear / kernel-based) reward functions, which may not hold in real-world applications (Zhou et al., 2020). Second, they both neglect the correlations among arm groups. We emphasize that the correlations among arm groups also play indispensable roles in many decision-making scenarios. For instance, in online movie recommendation, with each genre being a group of movies, the users who like ”adventure movies” may also appreciate ”action movies”. Regarding drug development, since the alternation of some genes may lead to multiple kinds of tumors (Shao et al., 2019), different types of cancer can also be correlated to some extent.

To address these limitations, we first introduce a novel model, AGG (Arm Group Graph), to formulate non-linear reward assumptions and arm groups with correlations. In this model, as arm attributes are easily accessible (e.g., movie’s genres and patient’s cancer types), the arms with the same attribute are assigned into one group, and represented as one node in the graph. The weighted edge between two nodes represents the correlation between these two groups. In this paper, we assume the arms from the same group are drawn from one unknown distribution. This also provides us with an opportunity to model the correlation of two arm groups by modeling the statistical distance between their associated distributions. Meantime, the unknown non-parametric reward mapping function can be either linear or non-linear.

Then, with the arm group graph, we propose the AGG-UCB framework for contextual bandits. It applies graph neural networks (GNNs) to learn the representations of arm groups with correlations, and neural networks to estimate the reward functions (exploitation). In particular, with the collaboration across arm groups, each arm will be assigned with the group-aware arm representation learned by GNN, which will be fed into a fully-connected (FC) network for the estimation of arm rewards. To deal with the exploitation-exploration dilemma, we also derive a new upper confidence bound based on network-gradients for exploration. By leveraging the arm group information and modeling arm group correlations, our proposed framework provides a novel arm selection strategy for dealing with the aforementioned challenges and limitations. Our main contributions can be summarized as follows:

  • First, motivated by real-world applications, we introduce a new graph-based model in contextual bandits to leverage the available group information of arms and exploit the potential correlations among arm groups.

  • Second, we propose a novel UCB-based neural framework called AGG-UCB  for the graph-based model. To exploit the relationship of arm groups, AGG-UCB  estimates the arm group graph with received contexts on the fly, and utilizes GNN to learn group-aware arm representations.

  • Third, we prove that AGG-UCB can achieve a near-optimal regret bound in the over-parameterized neural works, and provide convergence analysis of GNN with fully-connected layers, which may be of independent interest.

  • Finally, we conduct experiments on publicly available real data sets, and demonstrate that our framework outperforms state-of-the-art techniques. Additional studies are conducted to understand the properties of the proposed framework.

The rest of this paper is organized as following. In Section 2, we briefly discuss related works. Section 3 introduces the new problem settings, and details of our proposed framework AGG-UCB  will be presented in Section 4. Then, we provide theoretical analysis for AGG-UCB in Section 5. After presenting experimental results in Section 6, we finally conclude the paper in Section 7. Due to the page limit, readers may refer to our arXiv version of the paper for the supplementary contents (https://arxiv.org/abs/2206.03644).

2. Related Works

In this section, we briefly review the related work on contextual bandits. Lin-UCB (Chu et al., 2011) first formulates the reward estimation through a linear regression with the received context and builds a confidence bound accordingly. Kernel-UCB (Valko et al., 2013) further extends the reward mapping to the Reproducing Kernel Hilbert Space (RKHS) for the reward and confidence bound estimation under non-linear settings. Besides, there are algorithms under the non-linear settings. Similarly, CGP-UCB (Krause and Ong, 2011) models the reward function through a Gaussian process. GCN-UCB (Upadhyay et al., 2020) applies the GNN model to learn each context an embedding for the linear regression. Then, Neural-UCB (Zhou et al., 2020) proposes to apply FC neural network for reward estimations and derive a confidence bound with the network gradient, which is proved to be a success, and similar ideas has been applied to some other models (Ban et al., 2021; Zhang et al., 2020; Ban and He, 2021a). (Ban et al., 2022b) assigns another FC neural network to learn the confidence ellipsoid for exploration. Yet, as these works consider no collaboration among estimators, they may suffer from the performance bottleneck in the introduction.

To collaborate with different estimators for contextual bandits, various approaches are proposed from different perspectives. User clustering algorithms (Gentile et al., 2014; Li et al., 2019; Ban and He, 2021b; Ban et al., 2022a) try to cluster user with alike preferences into user groups for information sharing while COFIBA (Li et al., 2016) additionally models the relationship of arms. Then, KMTL-UCB (Deshmukh et al., 2017) extends Kernel-UCB to multi-task learning settings for a refined reward and confidence bound estimation. However, these works may encounter with performance bottlenecks as they incline to make additional assumptions on the reward mapping by applying parametric models and neglect the available arm group information.

GNNs (Kipf and Welling, 2016; Wu et al., 2019a; Klicpera et al., 2018; Fu and He, 2021b) are a kind of neural models that deal with tasks on graphs, such as community detection (You et al., 2019), recommender systems (Wei et al., 2020) and modeling protein interactions (Fu and He, 2021a). GNNs can learn from the topological graph structure information and the input graph signal simultaneously, which enables AGG-UCB  to cooperate with different arm groups by sharing information over the arm group neighborhood.

3. Problem Definition and Notation

We suppose a fixed pool 𝒞={1,,Nc}\mathcal{C}=\{1,\dots,N_{c}\} for arm groups with the number of arm groups being |𝒞|=Nc\lvert\mathcal{C}\rvert=N_{c}, and assume each arm group c𝒞c\in\mathcal{C} is associated with an unknown fixed context distribution 𝒟c\mathcal{D}_{c}. At each time step tt, we will receive a subset of groups 𝒞t𝒞\mathcal{C}_{t}\subseteq\mathcal{C}. For each group c𝒞tc\in\mathcal{C}_{t}, we will have the set of sampled arms 𝒳c,t={𝒙c,t(1),𝒙c,t(nc,t)}\mathcal{X}_{c,t}=\{\boldsymbol{x}^{(1)}_{c,t},\cdots\boldsymbol{x}^{(n_{c,t})}_{c,t}\} with the size of |𝒳c,t|=nc,t\lvert\mathcal{X}_{c,t}\rvert=n_{c,t}. Then, i[nc,t]={1,,nc,t}\forall i\in[n_{c,t}]=\{1,\dots,n_{c,t}\}, we suppose 𝒙c,t(i)𝒟c\boldsymbol{x}^{(i)}_{c,t}\sim\mathcal{D}_{c} with the dimensionality 𝒙c,t(i)dx\boldsymbol{x}^{(i)}_{c,t}\in\mathbb{R}^{d_{x}}. Therefore, in the tt-th round, we receive

(1) {𝒳c,t|c𝒞t}and𝒳c,t={𝒙c,t(1),𝒙c,t(nc,t)},c𝒞t.\{\mathcal{X}_{c,t}|c\in\mathcal{C}_{t}\}\ \text{and}\ \mathcal{X}_{c,t}=\{\boldsymbol{x}^{(1)}_{c,t},\cdots\boldsymbol{x}^{(n_{c,t})}_{c,t}\},\forall c\in\mathcal{C}_{t}.

With 𝑾Nc×Nc\boldsymbol{W}^{*}\in\mathbb{R}^{N_{c}\times N_{c}} being the unknown affinity matrix encoding the true arm group correlations, the true reward rc,t(i)r_{c,t}^{(i)} for arm 𝒙c,t(i)\boldsymbol{x}^{(i)}_{c,t} is defined as

(2) rc,t(i)=h(𝑾,𝒙c,t(i))+ϵc,t(i)\begin{split}r_{c,t}^{(i)}=h(\boldsymbol{W}^{*},~{}~{}\boldsymbol{x}^{(i)}_{c,t})+\epsilon^{(i)}_{c,t}\end{split}

where h()h(\cdot) represents the unknown reward mapping function, and ϵ\epsilon is the zero-mean Gaussian noise. For brevity, let 𝒙t\boldsymbol{x}_{t} be the arm we select in round tt and rtr_{t} be the corresponding received reward.

Our goal is recommending arm 𝒙t\boldsymbol{x}_{t} (with reward rtr_{t}) at each time step tt to minimize the cumulative pseudo-regret R(T)=t=1T𝔼[(rtrt)]R(T)=\sum_{t=1}^{T}\mathbb{E}\left[(r_{t}^{*}-r_{t})\right] where 𝔼[rt]=max(c𝒞t,i[nc,t])[h(𝑾,𝒙c,t(i))]\mathbb{E}[r_{t}^{*}]=\max_{(c\in\mathcal{C}_{t},i\in[n_{c,t}])}\big{[}h(\boldsymbol{W}^{*},\boldsymbol{x}^{(i)}_{c,t})\big{]}. At each time step tt, the overall set of received contexts is defined as 𝒳t={𝒙c,t(i)}c𝒞t,i[nc,t]\mathcal{X}_{t}=\{\boldsymbol{x}^{(i)}_{c,t}\}_{c\in\mathcal{C}_{t},i\in[n_{c,t}]}. Note that one arm is possibly associated with multiple arm groups, such as a movie with multiple genres. In other words, for some c,c𝒞tc,c^{\prime}\in\mathcal{C}_{t}, we may have 𝒳c,t𝒳c,t\mathcal{X}_{c,t}\cap\mathcal{X}_{c^{\prime},t}\neq\emptyset.

In order to model the arm group correlations, we maintain an undirected graph 𝒢t=(V,E,Wt)\mathcal{G}_{t}=(V,E,W_{t}) at each time step tt, where each arm group from 𝒞\mathcal{C} is mapped to a corresponding node in node set VV. Then, E={e(ci,cj)}ci,cj𝒞E=\{e(c_{i},c_{j})\}_{c_{i},c_{j}\in\mathcal{C}} is the set of edges, and WtW_{t} represents the set of edge weights. Note that by definition, 𝒢t\mathcal{G}_{t} will stay as a fully-connected graph, and the estimated arm group correlations are modeled by the edge weights connecting pairs of nodes. For a node vVv\in V, we denote the augmented kk-hop neighborhood 𝒩~k(v)=𝒩k(v){v}\widetilde{\mathcal{N}}_{k}(v)=\mathcal{N}_{k}(v)\cup\{v\} as the union node set of its kk-hop neighborhood 𝒩k(v)\mathcal{N}_{k}(v) and node vv itself. For the arm group graph 𝒢t\mathcal{G}_{t}, we denote 𝑨tNc×Nc\boldsymbol{A}_{t}\in\mathbb{R}^{N_{c}\times N_{c}} as the adjacency matrix (with added self-loops) of the given arm group graph and 𝑫tNc×Nc\boldsymbol{D}_{t}\in\mathbb{R}^{N_{c}\times N_{c}} as its degree matrix. For the notation consistency, we will apply a true arm group graph 𝒢\mathcal{G}^{*} instead of 𝑾\boldsymbol{W}^{*} in Eq. 2 to represent the true arm group correlation.

4. Proposed AGG-UCB Framework

In this section, we start with an overview to introduce our proposed AGG-UCB framework. Then, we will show our estimation method of arm group graph before mentioning the related group-aware arm embedding. Afterwards, the two components of our proposed framework, namely the aggregation module and the reward-estimation module, will be presented.

4.1. Overview of AGG-UCB Framework

In Algorithm 1, we present the pseudo-code of proposed AGG-UCB framework.

1 Input: Number of rounds TT, exploration parameter γ\gamma, regularization parameter λ\lambda, network width mm, network depth LL, neighborhood size kk.
2 Output: Arm recommendation 𝒙t\boldsymbol{x}_{t} for each time step tt.
3 Initialization: Initialize the arm group graph as a connected graph 𝒢1=(V,E,W1)\mathcal{G}_{1}=(V,E,W_{1}). Initialize gradient matrix 𝒁𝟎=λ𝑰\boldsymbol{Z_{0}}=\lambda\boldsymbol{I}. Initialize parameter 𝚯𝟎\boldsymbol{\Theta_{0}} for the model f(𝒢,X;𝚯𝟎)f(\mathcal{G},X;\boldsymbol{\Theta_{0}}).
4 for t=1,2,,Tt=1,2,...,T do
5       Receive a set of arm contexts 𝒳t={𝒙c,t(i)}c𝒞t,i[nc,t]\mathcal{X}_{t}=\{\boldsymbol{x}^{(i)}_{c,t}\}_{c\in\mathcal{C}_{t},i\in[n_{c,t}]}.
6       Embed the arm set 𝒳t\mathcal{X}_{t} into 𝒳~t\widetilde{\mathcal{X}}_{t} w.r.t. Eq.4.
7       for each embedded arm 𝐗~c,t(i)𝒳~t\widetilde{\boldsymbol{X}}^{(i)}_{c,t}\in\widetilde{\mathcal{X}}_{t} do
8             Obtain the point estimate r^c,t(i)=f(𝒢t,𝑿~c,t(i);𝚯𝒕𝟏)\widehat{r}^{(i)}_{c,t}=f(\mathcal{G}_{t},\widetilde{\boldsymbol{X}}^{(i)}_{c,t};\boldsymbol{\Theta_{t-1}}).
9             Obtain network gradient 𝒈c,t(i)Θf(𝒢t,𝑿~c,t(i);𝚯𝒕𝟏)\boldsymbol{g}^{(i)}_{c,t}\xleftarrow{}\nabla_{\Theta}f(\mathcal{G}_{t},\widetilde{\boldsymbol{X}}^{(i)}_{c,t};\boldsymbol{\Theta_{t-1}}).
10             Calculate confidence bound as i^c,t(i)=𝒈c,t(i)𝒁t1𝒈c,t(i)/m\widehat{i}^{(i)}_{c,t}=\sqrt{\boldsymbol{g}_{c,t}^{(i)\intercal}\boldsymbol{Z}_{t-1}\boldsymbol{g}^{(i)}_{c,t}/m}.
11            
12       end for
13      Recommend 𝑿~t=argmax𝑿~c,t(i)𝒳~t(r^c,t(i)+γi^c,t(i))\widetilde{\boldsymbol{X}}_{t}=\operatorname*{argmax}_{\widetilde{\boldsymbol{X}}^{(i)}_{c,t}\in\widetilde{\mathcal{X}}_{t}}(\widehat{r}^{(i)}_{c,t}+\gamma\cdot\widehat{i}^{(i)}_{c,t}) with the received reward represented as rtr_{t}.
14       Calculate arm group distances w.r.t. Eq.3, and update the arm group graph 𝒢t\mathcal{G}_{t} to 𝒢t+1\mathcal{G}_{t+1}.
15       Update the model parameter 𝚯𝒕𝟏\boldsymbol{\Theta_{t-1}} to 𝚯𝒕\boldsymbol{\Theta_{t}} according to Algorithm 2.
16       Retrieve the 𝑿~t\widetilde{\boldsymbol{X}}_{t}’s gradient vector 𝒈t\boldsymbol{g}_{t}, and update gradient matrix 𝒁t=𝒁t1+𝒈t𝒈t\boldsymbol{Z}_{t}=\boldsymbol{Z}_{t-1}+\boldsymbol{g}_{t}\cdot\boldsymbol{g}_{t}^{\intercal}.
17      
18 end for
ALGORITHM 1 AGG-UCB

At each time step t[T]t\in[T], AGG-UCB would receive a set of input arm contexts 𝒳t={𝒙c,t(i)}c𝒞t,i[nc,t]\mathcal{X}_{t}=\{\boldsymbol{x}^{(i)}_{c,t}\}_{c\in\mathcal{C}_{t},i\in[n_{c,t}]} (line 5). Then, we embed the arm set 𝒳t\mathcal{X}_{t} to 𝒳~t\widetilde{\mathcal{X}}_{t} based on Eq.4 from Subsection 4.3 (line 6). For each embedded arm 𝑿~𝒳~t\widetilde{\boldsymbol{X}}\in\widetilde{\mathcal{X}}_{t}, its estimated reward r^\widehat{r} and confidence bound i^\widehat{i} would be calculated (line 8-10) with the model f()f(\cdot) in Subsection 4.4. After recommending the best arm 𝑿~t\widetilde{\boldsymbol{X}}_{t} (line 12) and receiving its true reward rtr_{t}, we update the current arm group graph 𝒢t\mathcal{G}_{t} based on Subsection 4.2 (line 13). Then, the model parameters 𝚯𝒕𝟏\boldsymbol{\Theta_{t-1}} will be trained based on Algorithm 2 (line 14), and we incrementally update the gradient matrix to 𝒁t=𝒁t1+𝒈t𝒈t\boldsymbol{Z}_{t}=\boldsymbol{Z}_{t-1}+\boldsymbol{g}_{t}\cdot\boldsymbol{g}_{t}^{\intercal} with the gradient vector 𝒈t\boldsymbol{g}_{t} of model f()f(\cdot) given the selected arm 𝑿~t\widetilde{\boldsymbol{X}}_{t} (line 15).

1 Input: Initial parameter 𝚯𝟎\boldsymbol{\Theta_{0}}, step size η\eta, training steps JJ, network width mm. Updated arm group graph 𝒢t+1\mathcal{G}_{t+1}. Selected embedded contexts {𝑿~τ}τ=1t\{\widetilde{\boldsymbol{X}}_{\tau}\}_{\tau=1}^{t}.
2 Output: Updated model parameter 𝚯𝒕\boldsymbol{\Theta_{t}}.
3 𝚯t0𝚯0\boldsymbol{\Theta}_{t}^{0}\xleftarrow{}\boldsymbol{\Theta}_{0}.
4 Let (𝚯)=12τ=1t|f(𝒢t+1,𝑿~τ;𝚯)rτ|2\mathcal{L}(\boldsymbol{\Theta})=\frac{1}{2}\sum_{\tau=1}^{t}\lvert f(\mathcal{G}_{t+1},\widetilde{\boldsymbol{X}}_{\tau};\boldsymbol{\Theta})-r_{\tau}\rvert^{2}
5 for j=1,2,,Jj=1,2,\dots,J do
6       𝚯tj=𝚯tj1η𝚯(𝚯tj1)\boldsymbol{\Theta}_{t}^{j}=\boldsymbol{\Theta}_{t}^{j-1}-\eta\cdot\nabla_{\boldsymbol{\Theta}}\mathcal{L}(\boldsymbol{\Theta}_{t}^{j-1})
7      
8 end for
9Return new parameter 𝚯tJ\boldsymbol{\Theta}_{t}^{J}.
ALGORITHM 2 Model Training

The steps from Algorithm 2 demonstrate our training process for AGG-UCB  parameters. With the updated arm group graph 𝒢t+1\mathcal{G}_{t+1} and the past embedded arm contexts {𝑿~τ}τ=1t\{\widetilde{\boldsymbol{X}}_{\tau}\}_{\tau=1}^{t} until current time step tt, we define the loss function as the straightforward quadratic loss function (line 4). Finally, we run gradient descent (GD) for JJ steps to derive the new model parameters Θt\Theta_{t} (lines 5-7) based on the initial parameters 𝚯0\boldsymbol{\Theta}_{0} (initialized in Subsection 4.5). Next, we proceed to introduce the detail of framework components.

4.2. Arm Group Graph Estimation

Recall that at time step tt, we model the similar arms into an arm group graph 𝒢t=(V,E,Wt)\mathcal{G}_{t}=(V,E,W_{t}) where the nodes VV are corresponding to the arm groups from 𝒞\mathcal{C} and edges weights WtW_{t} formulate the correlations among arm groups. Given two nodes c,c𝒞\forall c,c^{\prime}\in\mathcal{C}, to measure the similarity between them, inspired by the kernel mean embedding in the multi-task learning settings (Blanchard et al., 2011; Deshmukh et al., 2017), we define edge weight between cc and cc^{\prime} as:

w(c,c)=exp(𝔼𝒙𝒟c[ϕk𝒢(𝒙)]𝔼𝒙𝒟c[ϕk𝒢(𝒙)]2/σs)\begin{split}w^{*}(c,c^{\prime})=\exp(-\lVert\mathbb{E}_{\boldsymbol{x}\sim\mathcal{D}_{c}}\big{[}\phi_{k_{\mathcal{G}}}(\boldsymbol{x})\big{]}-\mathbb{E}_{\boldsymbol{x}^{\prime}\sim\mathcal{D}_{c^{\prime}}}\big{[}\phi_{k_{\mathcal{G}}}(\boldsymbol{x}^{\prime})\big{]}\rVert^{2}/\sigma_{s})\end{split}

where ϕk𝒢()\phi_{k_{\mathcal{G}}}(\cdot) is the induced feature mapping of a given kernel k𝒢k_{\mathcal{G}}, e.g., a radial basis function (RBF) kernel. Unfortunately, c𝒞\forall c\in\mathcal{C}, 𝒟c\mathcal{D}_{c} is unknown. Therefore, we update the edge weight based on the empirical estimation of arm group correlations. Here, let 𝒳ct={𝒙c,τ(i)}τ[t],i[nc,τ]\mathcal{X}_{c}^{t}=\{\boldsymbol{x}^{(i)}_{c,\tau}\}_{\tau\in[t],i\in[n_{c,\tau}]} represent the set of all arm contexts from group c𝒞c\in\mathcal{C} up to time step tt. We define the arm similarity measurement between arms c,c𝒞c,c^{\prime}\in\mathcal{C} through a Gaussian-like kernel as

(3) wt(c,c)=exp(Ψt(𝒟c)Ψt(𝒟c)2/σs)\begin{split}w_{t}(c,c^{\prime})=\exp(-\lVert\Psi_{t}(\mathcal{D}_{c})-\Psi_{t}(\mathcal{D}_{c^{\prime}})\rVert^{2}/\sigma_{s})\end{split}

where Ψt(𝒟c)=1|𝒳ct|x𝒳ctk𝒢(,x)\Psi_{t}(\mathcal{D}_{c})=\frac{1}{\lvert\mathcal{X}_{c}^{t}\rvert}\sum_{x\in\mathcal{X}_{c}^{t}}k_{\mathcal{G}}(\cdot,x) denotes the kernel mean estimation of 𝒟c\mathcal{D}_{c} with a given kernel k𝒢k_{\mathcal{G}}; and σs\sigma_{s} refers to the bandwidth. Then, at time step tt and c,c𝒞\forall c,c^{\prime}\in\mathcal{C}, we update the corresponding weight of edge e(c,c)e(c,c^{\prime}) in the weight set WtW_{t} with wt(c,c)w_{t}(c,c^{\prime}).

4.3. Group-Aware Arm Embedding

To conduct the aggregation operations of GNN, we reconstruct a matrix for each arm context vector. Recall that for an arm group c𝒞c\in\mathcal{C}, if c𝒞tc\in\mathcal{C}_{t}, we receive the contexts xc,t(i)dx,i[nc,t]x^{(i)}_{c,t}\in\mathbb{R}^{d_{x}},i\in[n_{c,t}] at time step tt. Then, the reconstructed matrix for 𝒙c,t(i)\boldsymbol{x}^{(i)}_{c,t} is defined as

(4) 𝑿~c,t(i)=((𝒙c,t(i))𝟎𝟎𝟎(𝒙c,t(i))𝟎𝟎𝟎(𝒙c,t(i)))Nc×dx~\widetilde{\boldsymbol{X}}^{(i)}_{c,t}=\left(\begin{array}[]{cccc}(\boldsymbol{x}^{(i)}_{c,t})^{\intercal}&\boldsymbol{0}&\cdots&\boldsymbol{0}\\ \boldsymbol{0}&(\boldsymbol{x}^{(i)}_{c,t})^{\intercal}&\cdots&\boldsymbol{0}\\ \vdots&&\ddots&\vdots\\ \boldsymbol{0}&\boldsymbol{0}&\cdots&(\boldsymbol{x}^{(i)}_{c,t})^{\intercal}\\ \end{array}\right)\in\mathbb{R}^{N_{c}\times d_{\widetilde{x}}}

where dx~=dxNcd_{\widetilde{x}}=d_{x}\cdot N_{c} is the column dimension of 𝑿~c,t(i)\widetilde{\boldsymbol{X}}^{(i)}_{c,t}. Here, for the cc^{\prime}-th row in matrix 𝑿~c,t(i)\widetilde{\boldsymbol{X}}^{(i)}_{c,t}, the ((c1)dx+1)((c^{\prime}-1)\cdot d_{x}+1)-th to the (cdx)(c^{\prime}\cdot d_{x})-th entries are the transposed original arm context (𝒙c,t(i))(\boldsymbol{x}^{(i)}_{c,t})^{\intercal}, while the other entries are zeros. Receiving a set of arm contexts 𝒳t\mathcal{X}_{t}, we derive the corresponding embedded arm set as 𝒳~t={𝑿~c,t(i)}c𝒞t,i[nc,t]\widetilde{\mathcal{X}}_{t}=\{\widetilde{\boldsymbol{X}}^{(i)}_{c,t}\}_{c\in\mathcal{C}_{t},i\in[n_{c,t}]}.

4.3.1. Aggregation of arm group representations

To leverage the estimated arm group graph for downstream reward estimations, we propose to aggregate over the arm group neighborhood for a more comprehensive arm representation through the GNN-based module, named as group-aware arm representation. It has been proven that the local averaging operation on the graph neighborhood can be deemed as applying the low-pass filter on the corresponding node features (Hoang et al., 2021; Wu et al., 2019a), which would give locally smooth node features within the same neighborhood. Inspired by the SGC model (Wu et al., 2019a), we propose to aggregate over the kk-hop arm group neighborhood 𝒩~k()\widetilde{\mathcal{N}}_{k}(\cdot) for incorporating arm group correlations to obtain the aggregated group-aware embedding for an embedded arm 𝑿~c,t(i)\widetilde{\boldsymbol{X}}^{(i)}_{c,t}, denoted by

(5) 𝑯gnn=1mσ(𝑺tk𝑿~c,t(i)𝚯gnn)Nc×m\begin{split}\boldsymbol{H}_{gnn}=\sqrt{\frac{1}{m}}\cdot\sigma(\boldsymbol{S}_{t}^{k}\cdot\widetilde{\boldsymbol{X}}^{(i)}_{c,t}\boldsymbol{\Theta}_{gnn})\in\mathbb{R}^{N_{c}\times m}\end{split}

where 𝑺t=𝑫t12𝑨t𝑫t12\boldsymbol{S}_{t}=\boldsymbol{D}_{t}^{-\frac{1}{2}}\boldsymbol{A}_{t}\boldsymbol{D}_{t}^{-\frac{1}{2}} is the symmetrically normalized adjacency matrix, and we have

𝚯gnn=(𝚯gnn1dx×m𝚯gnncdx×m𝚯gnnNcdx×m)dx~×m.\boldsymbol{\Theta}_{gnn}=\left(\begin{array}[]{c}\boldsymbol{\Theta}_{gnn}^{1}\in\mathbb{R}^{d_{x}\times m}\\ \vdots\\ \boldsymbol{\Theta}_{gnn}^{c^{\prime}}\in\mathbb{R}^{d_{x}\times m}\\ \vdots\\ \boldsymbol{\Theta}_{gnn}^{N_{c}}\in\mathbb{R}^{d_{x}\times m}\\ \end{array}\right)\in\mathbb{R}^{d_{\widetilde{x}}\times m}.

being the trainable weight matrix with width mm. Here, σ()\sigma(\cdot) denotes the non-linear activation function, which is added after the aggregation operation to alleviate potential concerns when the contexts are not linearly separable (Hoang et al., 2021). Note that the cc^{\prime}-th row of (𝑿~c,t(i)𝚯gnn)(\widetilde{\boldsymbol{X}}^{(i)}_{c,t}\cdot\boldsymbol{\Theta}_{gnn}), denoted by [𝑿~c,t(i)𝚯gnn]c,:[\widetilde{\boldsymbol{X}}^{(i)}_{c,t}\cdot\boldsymbol{\Theta}_{gnn}]_{c^{\prime},:}, is the hidden representation of arm 𝒙\boldsymbol{x} in terms of cc^{\prime}-th arm group in 𝒞\mathcal{C}. Then, these hidden representations will then be aggregated over 𝒩~k(c),c𝒞\widetilde{\mathcal{N}}_{k}(c),c\in\mathcal{C} by multiplying with 𝑺tk\boldsymbol{S}_{t}^{k} to derive the aggregated arm representation for 𝒙\boldsymbol{x}, i.e., 𝑯gnn(x)\boldsymbol{H}_{gnn}(x).

4.3.2. Incorporating initial embedded contexts

Moreover, solely aggregating information from neighbors through the GNN-based models can lead to ”over-smoothing” problems (Xu et al., 2018, 2021). Aggregating from the node neighborhood will end up with identical representations for all the nodes if they form an isolated complete sub-graph, which may not correctly reflect the relationship among these nodes in real-world applications. Therefore, we propose to apply skip-connections to address this potential problem by combining the initial contexts with the aggregated hidden features. Similar ideas have been applied to boost the performance of neural models. For instance, JK-Net (Xu et al., 2018) and GraphSAGE (Hamilton et al., 2017) concatenate hidden features from different levels of node neighborhoods; and ResNet (He et al., 2016) adopts additive residual connections.

Putting these two parts together and setting d=m+dx~d^{\prime}=m~{}+~{}d_{\widetilde{x}}, we then have 𝑯0Nc×d\boldsymbol{H}_{0}\in\mathbb{R}^{N_{c}\times d^{\prime}} as the output group-aware arm representation for 𝑿~c,t(i)\widetilde{\boldsymbol{X}}^{(i)}_{c,t}, represented by

(6) 𝑯0=fgnn(𝒢t,𝑿~c,t(i);𝚯gnn)=[σ(𝑺tk𝑿~c,t(i)𝚯gnn);𝑿~c,t(i)]\begin{split}\boldsymbol{H}_{0}=f_{gnn}(\mathcal{G}_{t},\widetilde{\boldsymbol{X}}^{(i)}_{c,t};\boldsymbol{\Theta}_{gnn})=[\sigma(\boldsymbol{S}_{t}^{k}\cdot\widetilde{\boldsymbol{X}}^{(i)}_{c,t}\boldsymbol{\Theta}_{gnn});\widetilde{\boldsymbol{X}}^{(i)}_{c,t}]\end{split}

where [;][\cdot~{};~{}\cdot] refers to the column-wise concatenation of matrices.

4.4. Reward Estimation Module

In this subsection, we estimate the rewards with a FC network of LL layers and width mm, based on group-aware arm representation 𝑯0\boldsymbol{H}_{0}.

4.4.1. Reward and confidence bound estimation

Here, let 𝚯fc={𝚯l}l[L]\boldsymbol{\Theta}_{fc}=\{\boldsymbol{\Theta}_{l}\}_{l\in[L]} be the set of trainable weight matrices of a fully-connected network, where the specifications are: 𝚯1d×m\boldsymbol{\Theta}_{1}\in\mathbb{R}^{d^{\prime}\times m}, 𝚯Lm\boldsymbol{\Theta}_{L}\in\mathbb{R}^{m} and 𝚯lm×m,l{2,,L1}\boldsymbol{\Theta}_{l}\in\mathbb{R}^{m\times m},\forall l\in\{2,\dots,L-1\}. Then, given the group-aware representation 𝑯0\boldsymbol{H}_{0}, we have the reward estimation module as follows

(7) 𝑯l=1mσ(𝑯l1𝚯l),l{1,,L1},𝒓^all=ffc(𝑯0;𝚯fc)=1m𝑯L1𝚯L\begin{split}&\boldsymbol{H}_{l}=\sqrt{\frac{1}{m}}\cdot\sigma(\boldsymbol{H}_{l-1}\cdot\boldsymbol{\Theta}_{l}),~{}~{}~{}~{}~{}l\in\{1,\dots,L-1\},\\ &\widehat{\boldsymbol{r}}_{all}=f_{fc}(\boldsymbol{H}_{0};\boldsymbol{\Theta}_{fc})=\sqrt{\frac{1}{m}}\cdot\boldsymbol{H}_{L-1}\cdot\boldsymbol{\Theta}_{L}\end{split}

where 𝒓^allNc\widehat{\boldsymbol{r}}_{all}\in\mathbb{R}^{N_{c}} represents the point-estimation vector for the received contexts embedding 𝑯0\boldsymbol{H}_{0} with respect to all the arms groups. Given that the arm 𝒙~c,t(i)\widetilde{\boldsymbol{x}}^{(i)}_{c,t} belonging to cc-th group, we will then have the reward estimation 𝒓^c,t(i)=[𝒓^all]c\widehat{\boldsymbol{r}}^{(i)}_{c,t}=[\widehat{\boldsymbol{r}}_{all}]_{c}\in\mathbb{R} for the embedded context matrix 𝑿~c,t(i)\widetilde{\boldsymbol{X}}^{(i)}_{c,t}, which is the cc-th element of 𝒓^all\widehat{\boldsymbol{r}}_{all}.

Finally, combining the aggregation module with the reward estimation module, given arm group graph 𝒢t\mathcal{G}_{t} at time step tt, the reward estimation for the embedded arm 𝑿~c,t(i)\widetilde{\boldsymbol{X}}^{(i)}_{c,t} (i.e., the reward estimation given its arm group) can be represented as

r^c,t(i)=f(𝒢t,𝑿~c,t(i);𝚯)=[(ffc(;𝚯fc)fgnn(;𝚯gnn))(𝒢t,𝑿~c,t(i))]c.\begin{split}\widehat{r}^{(i)}_{c,t}=f(\mathcal{G}_{t},\widetilde{\boldsymbol{X}}^{(i)}_{c,t}&;\boldsymbol{\Theta})=\Bigg{[}\bigg{(}f_{fc}(\cdot;\boldsymbol{\Theta}_{fc})\circ f_{gnn}(\cdot;\boldsymbol{\Theta}_{gnn})\bigg{)}(\mathcal{G}_{t},\widetilde{\boldsymbol{X}}^{(i)}_{c,t})\Bigg{]}_{c}.\end{split}

Setting p=(2Ncd)m+(L1)m2+mp=(2N_{c}\cdot d)\cdot m+(L-1)\cdot m^{2}+m, we have 𝚯p\boldsymbol{\Theta}\in\mathbb{R}^{p} being the set of all the parameters from these two modules.

4.4.2. Arm pulling mechanism

We obtain confidence bounds for the point estimation with the network gradients as i^=𝒈𝒁t1𝒈/m\widehat{i}=\sqrt{\boldsymbol{g}^{\intercal}\cdot\boldsymbol{Z}_{t-1}\cdot\boldsymbol{g}/m} where 𝒈=𝚯f(𝒢t,𝑿~c,t(i);𝚯)p\boldsymbol{g}=\nabla_{\boldsymbol{\Theta}}f(\mathcal{G}_{t},\widetilde{\boldsymbol{X}}^{(i)}_{c,t};\boldsymbol{\Theta})\in\mathbb{R}^{p} is the gradient vector, and 𝒁t1=𝑰+τ=1t1𝒈τ𝒈τ\boldsymbol{Z}_{t-1}=\boldsymbol{I}+\sum_{\tau=1}^{t-1}\boldsymbol{g}_{\tau}\cdot\boldsymbol{g}_{\tau}^{\intercal} with 𝒈τ\boldsymbol{g}_{\tau} being the gradient vector of the embedded arm which is selected at step τ{1,,t1}\tau\in\{1,\dots,t-1\}. After obtaining the reward and confidence bound estimations for all embedded arm in set 𝒳~t\widetilde{\mathcal{X}}_{t}, we choose the best arm as 𝑿~t=argmax𝑿~c,t(i)𝒳~t(r^c,t(i)+γi^c,t(i))\widetilde{\boldsymbol{X}}_{t}=\operatorname*{argmax}_{\widetilde{\boldsymbol{X}}^{(i)}_{c,t}\in\widetilde{\mathcal{X}}_{t}}(\widehat{r}^{(i)}_{c,t}+\gamma\cdot\widehat{i}^{(i)}_{c,t}) where γ\gamma is the exploration parameter, and the theoretical upper confidence bound will be given in Section 5. Note that based on our problem definition (Section 3), one arm may associate with multiple arm groups. Here, we will separately estimate rewards and confidence bounds of each arm group it belongs to, and consider them as different arms for selection.

4.5. Model Initialization

For the aggregation module weight matrix 𝚯gnn\boldsymbol{\Theta}_{gnn}, each of its entries is sampled from the Gaussian distribution N(0,1)N(0,1). Similarly, the parameters from the first L1L-1 reward estimation module layers ([𝚯1,,𝚯L1][\boldsymbol{\Theta}_{1},\dots,\boldsymbol{\Theta}_{L-1}]) are also sampled from N(0,1)N(0,1). For the final (LL-th) layer, its weight matrix 𝚯L\boldsymbol{\Theta}_{L} is initialized by drawing the entry values from the Gaussian distribution N(0,1/m)N(0,1/m).

5. Theoretical Analysis

In this section, we provide the theoretical analysis for our proposed framework. For the sake of analysis, at each time step tt, we assume each arm group c𝒞c\in\mathcal{C} would receive one arm 𝒙c,t\boldsymbol{x}_{c,t}, which makes |𝒳1t|==|𝒳Nct|=t\lvert\mathcal{X}_{1}^{t}\rvert=\dots=\lvert\mathcal{X}_{N_{c}}^{t}\rvert=t. We also apply the adjacency matrix 𝑨t\boldsymbol{A}_{t} instead of 𝑺t\boldsymbol{S}_{t} for aggregation, and set its elements [𝑨t]ij=1tNcτ=1tϕk𝒢(xci,τ)ϕk𝒢(xcj,τ)[\boldsymbol{A}_{t}]_{ij}=\frac{1}{t\cdot N_{c}}\sum_{\tau=1}^{t}\phi_{k_{\mathcal{G}}}(x_{c_{i},\tau})^{\intercal}\phi_{k_{\mathcal{G}}}(x_{c_{j},\tau}) for arm group similarity between group ci,cj𝒞c_{i},c_{j}\in\mathcal{C}. Here, ϕk𝒢()\phi_{k_{\mathcal{G}}}(\cdot) is the kernel mapping given an RBF kernel k𝒢k_{\mathcal{G}}. With 𝒢\mathcal{G}^{*} being the unknown true arm group graph, its adjacency matrix elements are [𝑨]ij=1Nc𝔼xi𝒟ci,xj𝒟cj(ϕk𝒢(xi)ϕk𝒢(xj))[\boldsymbol{A}^{*}]_{ij}=\frac{1}{N_{c}}\mathbb{E}_{x_{i}\sim\mathcal{D}_{c_{i}},x_{j}\sim\mathcal{D}_{c_{j}}}(\phi_{k_{\mathcal{G}}}(x_{i})^{\intercal}\phi_{k_{\mathcal{G}}}(x_{j})). Note that the norm of adjacency matrices 𝑨2,𝑨t21\lVert\boldsymbol{A}^{*}\rVert_{2},\lVert\boldsymbol{A}_{t}\rVert_{2}\leq 1 since ϕk𝒢(x),ϕk𝒢(x)1\langle\phi_{k_{\mathcal{G}}}(x),\phi_{k_{\mathcal{G}}}(x^{\prime})\rangle\leq 1 for any x,xdxx,x^{\prime}\in\mathbb{R}^{d_{x}}, which makes it feasible to aggregate over kk-hop neighborhood without the explosion of eigenvalues. Before presenting the main results, we first introduce the following background.

Lemma 5.1 ((Ban and He, 2021a; Zhou et al., 2020)).

For any t[T]t\in[T], given arm 𝐱dx\boldsymbol{x}\in\mathbb{R}^{d_{x}} satisfying 𝐱2=1\lVert\boldsymbol{x}\rVert_{2}=1 and its embedded context matrix 𝐗~\widetilde{\boldsymbol{X}}, there exists 𝚯t1p\boldsymbol{\Theta}_{t-1}^{*}\in\mathbb{R}^{p} at time step tt, and a constant S>0S>0, such that

(8) h(𝒢,𝑿~)=g(𝒢,𝑿~;𝚯t1),𝚯t1𝚯0\begin{split}h(\mathcal{G}^{*},\widetilde{\boldsymbol{X}})=\langle g(\mathcal{G}^{*},\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}_{t-1}),\boldsymbol{\Theta}_{t-1}^{*}-\boldsymbol{\Theta}_{0}\rangle\end{split}

where 𝚯t1𝚯02S/m\lVert\boldsymbol{\Theta}_{t-1}^{*}-\boldsymbol{\Theta}_{0}\rVert_{2}\leq S/\sqrt{m}, t[T]\forall t\in[T], and 𝒢\mathcal{G}^{*} stands for the unknown true underlying arm group graph.

Note that with sufficient network width mm, we will have 𝚯t1𝚯0,t[T]\boldsymbol{\Theta}_{t-1}^{*}\approx\boldsymbol{\Theta}_{0}^{*},\forall t\in[T], and we will include more details in the full version of the paper. Following the analogous ideas from previous works (Zhou et al., 2020; Ban et al., 2021), this lemma formulates the expected reward as a linear function parameterized by the difference between randomly initialized network parameter 𝚯0\boldsymbol{\Theta}_{0} and the parameter 𝚯t1\boldsymbol{\Theta}_{t-1}^{*}, which lies in the confidence set with the high probability (Abbasi-Yadkori et al., 2011). Then, regarding the activation function σ()\sigma(\cdot), we have the following assumption on its continuity and smoothness.

Assumption 5.2 (ζ\zeta-Lipschitz continuity and Smoothness (Du et al., 2019b; Ban and He, 2021a)).

For non-linear activation function σ()\sigma(\cdot), there exists a positive constant ζ>0\zeta>0, such that 𝐱,𝐱\forall\boldsymbol{x},\boldsymbol{x}^{\prime}\in\mathbb{R}, we have

|σ(x)σ(x)|ζxx,|σ(x)σ(x)|ζxx\begin{split}&\lvert\sigma(x)-\sigma(x^{\prime})\rvert\leq\zeta\cdot\lVert x-x^{\prime}\rVert,\quad\lvert\sigma^{\prime}(x)-\sigma^{\prime}(x^{\prime})\rvert\leq\zeta\cdot\lVert x-x^{\prime}\rVert\end{split}

with σ()\sigma^{\prime}(\cdot) being the derivative of activation function σ()\sigma(\cdot).

Note that Assumption 5.2 is mild and applicable on many activation functions, such as Sigmoid. Then, we proceed to bound the regret for a single time step tt.

5.1. Upper Confidence Bound

Recall that at time step tt, given an embedded arm matrix 𝑿~\widetilde{\boldsymbol{X}}, the output of our proposed framework is r^=f(𝒢t,𝑿~;𝚯t1)\widehat{r}=f(\mathcal{G}_{t},\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}_{t-1}) with 𝒢t\mathcal{G}_{t}, 𝚯t1\boldsymbol{\Theta}_{t-1} as the estimated arm group graph and trained parameters respectively. The true function h(𝒢,𝑿~)h(\mathcal{G}^{*},\widetilde{\boldsymbol{X}}) is given in Lemma 5.1. Supposing there exists the true arm group graph 𝒢\mathcal{G}^{*}, the confidence bound for a single round tt will be

(9) CBt(𝑿~)=|f(𝒢t,𝑿~;𝚯t1)h(𝒢,𝑿~)||f(𝒢t,𝑿~;𝚯t1)h(𝒢t,𝑿~)|R1+|h(𝒢t,𝑿~)h(𝒢,𝑿~)|R2\begin{split}&\textsf{CB}_{t}(\widetilde{\boldsymbol{X}})=\lvert f(\mathcal{G}_{t},\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}_{t-1})-h(\mathcal{G}^{*},\widetilde{\boldsymbol{X}})\rvert\\ &\leq\underbrace{\lvert f(\mathcal{G}_{t},\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}_{t-1})-h(\mathcal{G}_{t},\widetilde{\boldsymbol{X}})\rvert}_{R_{1}}+\underbrace{\lvert h(\mathcal{G}_{t},\widetilde{\boldsymbol{X}})-h(\mathcal{G}^{*},\widetilde{\boldsymbol{X}})\rvert}_{R_{2}}\end{split}

where R1R_{1} denotes the error induced by network parameter estimations, and R2R_{2} refers to the error from arm group graph estimations. We will then proceed to bound them separately.

5.1.1. Bounding network parameter error R1R_{1}

For simplicity, the 𝒢t\mathcal{G}_{t} notation is omitted for this subsection. To bridge the network parameters after GD with those at random initialization, we define the gradient-based regression estimator 𝚯^t=𝒁t1𝒃t\widehat{\boldsymbol{\Theta}}_{t}=\boldsymbol{Z}_{t}^{-1}\boldsymbol{b}_{t} where 𝒁t=λ𝑰+1mτ=1tg(𝑿~τ;𝚯τ)g(𝑿~τ;𝚯τ),𝒃t=τ=1trτg(𝑿~τ;𝚯τ)/m.\boldsymbol{Z}_{t}=\lambda\boldsymbol{I}+\frac{1}{m}\sum_{\tau=1}^{t}g(\widetilde{\boldsymbol{X}}_{\tau};\boldsymbol{\Theta}_{\tau})\cdot g(\widetilde{\boldsymbol{X}}_{\tau};\boldsymbol{\Theta}_{\tau})^{\intercal},\boldsymbol{b}_{t}=\sum_{\tau=1}^{t}r_{\tau}\cdot g(\widetilde{\boldsymbol{X}}_{\tau};\boldsymbol{\Theta}_{\tau})/\sqrt{m}. Then, we derive the bound for R1R_{1} with the following lemma.

Lemma 5.3.

Assume there are constants βF>0,1<β1,β2,β3,β4<2,βL=max{β1,β2,β3,β4}\beta_{F}>0,~{}~{}1<\beta_{1},\beta_{2},\beta_{3},\beta_{4}<2,~{}~{}\beta_{L}=\max\{\beta_{1},\beta_{2},\beta_{3},\beta_{4}\}, and

βh=max{ζβ1,ζβ2+ζ2β1β2,ζβL+1,(ζβ4)L2(ζβ2+ζ2β1β2)}.\begin{split}&\beta_{h}=\max\{\zeta\beta_{1},~{}~{}~{}\zeta\beta_{2}+\zeta^{2}\beta_{1}\beta_{2},~{}~{}~{}\zeta\beta_{L}+1,~{}~{}~{}(\zeta\beta_{4})^{L-2}(\zeta\beta_{2}+\zeta^{2}\beta_{1}\beta_{2})\}.\end{split}

With a constant δ(0,1)\delta\in(0,1), and LL as the layer number for the FC network, let the network width mPoly(t,L,1βF,1λ,(ζβL)L,log(1δ))m\geq\text{Poly}\big{(}t,L,\frac{1}{\beta_{F}},\frac{1}{\lambda},(\zeta\beta_{L})^{L},\log(\frac{1}{\delta})\big{)}, and learning rate η𝒪((tLβh2(2ζβL)2L)1)\eta\leq\mathcal{O}\big{(}(t\cdot L\beta_{h}^{2}(2\zeta\beta_{L})^{2L})^{-1}\big{)}. Denoting the terms

Υ=22tβF(βh+Λ)(βL+1)LζL,Λ=ζΥβhm(2ζβL)L12ζβL1I~1=t(Lβ32(βLζ)2L+m)+Λt(9L+m1),I~2=λL+1Υ/m,\begin{split}&\Upsilon=\frac{2\sqrt{2}t}{\beta_{F}}(\beta_{h}+\Lambda)(\beta_{L}+1)^{L}\zeta^{L},\quad\Lambda=\frac{\zeta\Upsilon\beta_{h}}{m}\cdot\frac{(2\zeta\beta_{L})^{L}-1}{2\zeta\beta_{L}-1}\\ &\widetilde{I}_{1}=\sqrt{t\cdot(L\cdot\beta_{3}^{2}\cdot(\beta_{L}\zeta)^{2L}+m)}+\Lambda\cdot\sqrt{t\cdot(9L+m^{-1})},\\ &\widetilde{I}_{2}=\lambda\sqrt{L+1}\cdot\Upsilon/\sqrt{m},\end{split}

at time step t~{}t, given the received contexts and rewards, with probability at least 1δ1-\delta and the embedded context 𝐗~\widetilde{\boldsymbol{X}}, we have

|h(𝑿~)f(𝑿~;𝚯t1)|B1g(𝑿~;𝚯t1)/m𝒁t11+B2+B3\begin{split}&\lvert h(\widetilde{\boldsymbol{X}})-f(\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}_{t-1})\rvert\leq B_{1}\lVert g(\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}_{t-1})/\sqrt{m}\rVert_{\boldsymbol{Z}_{t-1}^{-1}}+B_{2}+B_{3}\end{split}

with the terms

B1=log(det(𝒁t1)det(𝝀𝑰))2log(δ)+λ12S,B2=(I~1tB3+mI~2mλ+tmλ)(Λ9L+m1+m1βhLβ32(βLζ)2L+m)B3=m0.5(β3(Λ+βh)+LΥ(Λ+βh)(Λ/βh+1)).\begin{split}&B_{1}=\sqrt{\log(\frac{\det(\boldsymbol{Z}_{t-1})}{\det(\boldsymbol{\lambda I})})-2\log(\delta)}+\lambda^{\frac{1}{2}}S,\\ &B_{2}=\big{(}\frac{\widetilde{I}_{1}\cdot\sqrt{t}B_{3}+m\cdot\widetilde{I}_{2}}{m\lambda}+\sqrt{\frac{t}{m\lambda}}\big{)}\\ &\qquad\cdot\big{(}\Lambda\cdot\sqrt{9L+m^{-1}}+m^{-1}\beta_{h}\cdot\sqrt{L\cdot\beta_{3}^{2}\cdot(\beta_{L}\zeta)^{2L}+m}\big{)}\\ &B_{3}=m^{-0.5}\big{(}\beta_{3}(\Lambda+\beta_{h})+L\cdot\Upsilon\cdot(\Lambda+\beta_{h})(\Lambda/\beta_{h}+1)\big{)}.\end{split}

Proof. Given the embedded context 𝑿~\widetilde{\boldsymbol{X}}, and following the statement in Lemma 5.1, we have

|h(𝑿~)f(𝑿~;𝚯t1)||g(𝑿~;𝚯t1)/m,m(𝚯t1𝚯0)g(𝑿~;𝚯t1)/m,𝚯^t1|+|g(𝑿~;𝚯t1)/m,𝚯^t1f(𝑿~;𝚯t1)|=R3+R4.\begin{split}&\lvert h(\widetilde{\boldsymbol{X}})-f(\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}_{t-1})\rvert\\ &\quad\leq\lvert\langle g(\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}_{t-1})/\sqrt{m},\sqrt{m}(\boldsymbol{\Theta}_{t-1}^{*}-\boldsymbol{\Theta}_{0})\rangle-\langle g(\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}_{t-1})/\sqrt{m},\widehat{\boldsymbol{\Theta}}_{t-1}\rangle\rvert\\ &\quad\qquad+\lvert\langle g(\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}_{t-1})/\sqrt{m},\widehat{\boldsymbol{\Theta}}_{t-1}\rangle-f(\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}_{t-1})\rvert=R_{3}+R_{4}.\end{split}

With Theorem 2 from (Abbasi-Yadkori et al., 2011), we have R3B1g(𝑿~;𝚯t1)/m𝒁t11R_{3}\leq B_{1}\lVert g(\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}_{t-1})/\sqrt{m}\rVert_{\boldsymbol{Z}_{t-1}^{-1}}. Then, for R4R_{4}, we have |f(𝑿~;𝚯t1)g(𝑿~;𝚯t1/m),𝚯^t1|\lvert f(\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}_{t-1})-\langle g(\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}_{t-1}/\sqrt{m}),\widehat{\boldsymbol{\Theta}}_{t-1}\rangle\rvert

R5+R6=|f(𝑿~;𝚯t1)g(𝑿~;𝚯t1),𝚯t1𝚯0|+|g(𝑿~;𝚯t1),𝚯t1𝚯0𝚯^t1/m|\begin{split}\leq R_{5}+R_{6}&=\lvert f(\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}_{t-1})-\langle g(\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}_{t-1}),\boldsymbol{\Theta}_{t-1}-\boldsymbol{\Theta}_{0}\rangle\rvert\\ &\qquad+\lvert\langle g(\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}_{t-1}),\boldsymbol{\Theta}_{t-1}-\boldsymbol{\Theta}_{0}-\widehat{\boldsymbol{\Theta}}_{t-1}/\sqrt{m}\rangle\rvert\end{split}

where R5R_{5} can be bounded by B3B_{3} with Lemma A.6. Then, with conclusions from Lemma B.3 and Lemma A.5, we have

R6𝚯t1𝚯0𝚯^t1/m2g(𝑿~;𝚯t1)2B2=((I~1tB3+mI~2)/(mλ)+t/(mλ))(Λ9L+m1+m1βhLβ32(βLζ)2L+m),\begin{split}R_{6}&\leq\lVert\boldsymbol{\Theta}_{t-1}-\boldsymbol{\Theta}_{0}-\widehat{\boldsymbol{\Theta}}_{t-1}/\sqrt{m}\rVert_{2}\cdot\lVert g(\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}_{t-1})\rVert_{2}\\ &\leq B_{2}=\big{(}(\widetilde{I}_{1}\cdot\sqrt{t}B_{3}+m\cdot\widetilde{I}_{2})/(m\lambda)+\sqrt{t/(m\lambda)}\big{)}\\ &\quad\cdot\big{(}\Lambda\cdot\sqrt{9L+m^{-1}}+m^{-1}\beta_{h}\cdot\sqrt{L\cdot\beta_{3}^{2}\cdot(\beta_{L}\zeta)^{2L}+m}\big{)},\end{split}

which completes the proof. \blacksquare

5.1.2. Bounding graph estimation error R2R_{2}

Regarding the regret term R2R_{2} and for the aggregation module, we have

𝑯gnn=1mσ(𝑨tk𝑿~𝚯gnn)Nc×m\begin{split}\boldsymbol{H}_{gnn}=\sqrt{\frac{1}{m}}\cdot\sigma(\boldsymbol{A}_{t}^{k}\cdot\widetilde{\boldsymbol{X}}\boldsymbol{\Theta}_{gnn})\in\mathbb{R}^{N_{c}\times m}\end{split}

as the output where 𝚯gnn\boldsymbol{\Theta}_{gnn} refers to the trainable weight matrix. Then, we use the following lemma to bound R2R_{2}.

Lemma 5.4.

At this time step t+1t+1, given any two arm groups ci,cj𝒞c_{i},c_{j}\in\mathcal{C} and their sampled arm contexts 𝒳cit={𝐱ci,τ}τ=1t\mathcal{X}_{c_{i}}^{t}=\{\boldsymbol{x}_{c_{i},\tau}\}_{\tau=1}^{t}, 𝒳cjt={𝐱cj,τ}τ=1t\mathcal{X}_{c_{j}}^{t}=\{\boldsymbol{x}_{c_{j},\tau}\}_{\tau=1}^{t}, with the notation from Lemma 5.3 and the probability at least 1δ1-\delta, we have

𝑨𝑨tmax1Nc12tlog(Nc2Ncδ)\begin{split}\lVert\boldsymbol{A}^{*}-\boldsymbol{A}_{t}\rVert_{\max}\leq\frac{1}{N_{c}}\cdot\sqrt{\frac{1}{2t}\log(\frac{N_{c}^{2}-N_{c}}{\delta})}\end{split}

where max\lVert\cdot\rVert_{\max} refers to the greatest entry of a matrix. Then, we will have R2B41/tR_{2}\leq B_{4}\sqrt{1/t} with

B4=SLkm(βh+Λ)(ζβL+Υζm)O(L)12log(Nc2Ncδ),\begin{split}B_{4}=\frac{S\sqrt{L}k}{\sqrt{m}}(\beta_{h}+\Lambda)(\zeta\beta_{L}+\frac{\Upsilon\zeta}{m})^{O(L)}\sqrt{\frac{1}{2}\log(\frac{N_{c}^{2}-N_{c}}{\delta})},\end{split}

and Nc=|𝒞|N_{c}=\lvert\mathcal{C}\rvert is the number of arm groups.

Proof. Recall that for ci,cj𝒞c_{i},c_{j}\in\mathcal{C}, the element of matrix [𝑨]ij=1Nc𝔼xi𝒟ci,xj𝒟cj(ϕk𝒢(xi)ϕk𝒢(xj)),i,j[Nc][\boldsymbol{A}^{*}]_{ij}=\\ \frac{1}{N_{c}}\mathbb{E}_{x_{i}\sim\mathcal{D}_{c_{i}},x_{j}\sim\mathcal{D}_{c_{j}}}(\phi_{k_{\mathcal{G}}}(x_{i})^{\intercal}\phi_{k_{\mathcal{G}}}(x_{j})),\forall i,j\in[N_{c}], and [𝑨t]ij=1tNcτ=1tϕk𝒢(xci,τ)ϕk𝒢(xcj,τ)[\boldsymbol{A}_{t}]_{ij}=\frac{1}{t\cdot N_{c}}\sum_{\tau=1}^{t}\phi_{k_{\mathcal{G}}}(x_{c_{i},\tau})^{\intercal}\phi_{k_{\mathcal{G}}}(x_{c_{j},\tau}). Here, suppose a distribution 𝒟ij\mathcal{D}_{ij} where 𝔼[𝒟ij]=1Nc𝔼xi𝒟ci,xj𝒟cj(ϕk𝒢(xi)ϕk𝒢(xj))\mathbb{E}[\mathcal{D}_{ij}]=\frac{1}{N_{c}}\mathbb{E}_{x_{i}\sim\mathcal{D}_{c_{i}},x_{j}\sim\mathcal{D}_{c_{j}}}(\phi_{k_{\mathcal{G}}}(x_{i})^{\intercal}\phi_{k_{\mathcal{G}}}(x_{j})). Given NcN_{c} arm groups, we have Nc(Nc1)/2N_{c}(N_{c}-1)/2 different group pairs. For group pair ci,cj𝒞c_{i},c_{j}\in\mathcal{C}, each ϕk𝒢(xci,τ)ϕk𝒢(xcj,τ),τ[t]\phi_{k_{\mathcal{G}}}(x_{c_{i},\tau})^{\intercal}\phi_{k_{\mathcal{G}}}(x_{c_{j},\tau}),\tau\in[t] is a sample drawn from 𝒟ij\mathcal{D}_{ij}, and the element distance |[𝑨t]ij[𝑨]ij|\lvert[\boldsymbol{A}_{t}]_{ij}-[\boldsymbol{A}^{*}]_{ij}\rvert can be regarded as the difference between the mean value of samples and the expectation. Applying the Hoeffding’s inequality and the union bound would complete the proof. As 2nmax\lVert\cdot\rVert_{2}\leq n\lVert\cdot\rVert_{\max} for an n×nn\times n square matrix, we have the bound for matrix differences.

Then, consider the power of adjacency matrix 𝑨k\boldsymbol{A}^{k} (for graph 𝒢\mathcal{G}) as input and fix 𝑿~\widetilde{\boldsymbol{X}}. Analogous to the idea that the activation function with the Lipschitz continuity and smoothness property will lead to Lipschitz neural networks (Allen-Zhu et al., 2019), applying Assumption 5.2 and with Lemma A.2, Lemma A.3, we simply have the gradient g(𝒢,𝑿~;𝚯t1)g(\mathcal{G},\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}_{t-1}) being Lipschitz continuous w.r.t. the input graph as

R2g(𝒢,𝑿~;𝚯t1)g(𝒢t,𝑿~;𝚯t1)2𝚯t1𝚯02SLm(βh+Λ)(ζβL+Υζm)O(L)|(𝑨t)k2(𝑨)k2|(i)SLkm(βh+Λ)(ζβL+Υζm)O(L)𝑨t𝑨2\begin{split}R_{2}&\leq\lVert g(\mathcal{G}^{*},\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}_{t-1})-g(\mathcal{G}_{t},\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}_{t-1})\rVert_{2}\cdot\lVert\boldsymbol{\Theta}_{t-1}^{*}-\boldsymbol{\Theta}_{0}\rVert_{2}\\ &\leq\frac{S\sqrt{L}}{\sqrt{m}}(\beta_{h}+\Lambda)(\zeta\beta_{L}+\frac{\Upsilon\zeta}{\sqrt{m}})^{O(L)}\cdot\lvert\lVert(\boldsymbol{A}_{t})^{k}\rVert_{2}-\lVert\boldsymbol{(A^{*}})^{k}\rVert_{2}\rvert\\ &\underset{(i)}{\leq}\frac{S\sqrt{L}k}{\sqrt{m}}(\beta_{h}+\Lambda)(\zeta\beta_{L}+\frac{\Upsilon\zeta}{\sqrt{m}})^{O(L)}\cdot\lVert\boldsymbol{A}_{t}-\boldsymbol{A^{*}}\rVert_{2}\end{split}

where (i)(i) is because 𝑨t,𝑨\boldsymbol{A}_{t},\boldsymbol{A}^{*} are symmetric and bounded polynomial functions are Lipschitz continuous. Combining the two parts will lead to the conclusion. \blacksquare

5.1.3. Combining R2R_{2} with R1R_{1}

At time step tt, with the notation and conclusions from Lemma 5.3 and Lemma 5.4, re-scaling the constant δ\delta, we have the confidence bound given embedded arm 𝑿~\widetilde{\boldsymbol{X}} as

(10) CBt(𝑿~)B1g(𝒢t,𝑿~;𝚯t1)/m𝒁t11+B2+B3+B41t.\begin{split}\textsf{CB}_{t}(\widetilde{\boldsymbol{X}})\leq B_{1}\lVert g(\mathcal{G}_{t},\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}_{t-1})/\sqrt{m}\rVert_{\boldsymbol{Z}_{t-1}^{-1}}+B_{2}+B_{3}+B_{4}\sqrt{\frac{1}{t}}.\end{split}

5.2. Regret Bound

With the UCB shown in Eq. 10, we provide the following regret upper bound R(T)R(T), for a total of TT time steps.

Theorem 5.5.

Given the received contexts and rewards, with the notation from Lemma 5.3, Lemma 5.4, and probability at least 1δ1-\delta, if m,ηm,\eta satisfy conditions in Lemma 5.3, we will have the regret

R(T)2(2B4T+2B4)+22d~Tlog(1+T/λ)+2T(λS+12log(δ/2)+(d~log(1+T/λ)))\begin{split}&R(T)\leq 2\cdot(2B_{4}\sqrt{T}+2-B_{4})+2\sqrt{2\widetilde{d}T\log(1+T/\lambda)+2T}\\ &\qquad\cdot\big{(}\sqrt{\lambda}S+\sqrt{1-2\log(\delta/2)+(\widetilde{d}\log(1+T/\lambda))}\big{)}\end{split}

where the effective dimension d~=logdet(𝐈+𝐆(0)/λ)log(1+T/λ)\widetilde{d}=\frac{\log\det(\boldsymbol{I}+\boldsymbol{G}(0)/\lambda)}{\log(1+T/\lambda)} with
𝐆(0)=𝐆0𝐆0\boldsymbol{G}(0)=\boldsymbol{G}_{0}\boldsymbol{G}_{0}^{\intercal} and 𝐆0=(g(𝐗~1;𝚯0),,g(𝐗~t;𝚯0))\boldsymbol{G}_{0}=\big{(}g(\widetilde{\boldsymbol{X}}_{1};\boldsymbol{\Theta}_{0})^{\intercal},\dots,g(\widetilde{\boldsymbol{X}}_{t};\boldsymbol{\Theta}_{0})^{\intercal}\big{)}.

Proof. By definition, we have the regret RtR_{t} for time step tt as

Rt=h(𝒢,𝑿~t)h(𝒢,𝑿~t)CBt(𝑿~t)+f(𝒢t,𝑿~t;𝚯t1)h(𝒢,𝑿~t)CBt(𝑿~t)+f(𝒢t,𝑿~t;𝚯t1)h(𝒢,𝑿~t)2CBt(𝑿~t)\begin{split}R_{t}&=h(\mathcal{G}^{*},\widetilde{\boldsymbol{X}}_{t}^{*})-h(\mathcal{G}^{*},\widetilde{\boldsymbol{X}}_{t})\\ &\leq\textsf{CB}_{t}(\widetilde{\boldsymbol{X}}_{t}^{*})+f(\mathcal{G}_{t},\widetilde{\boldsymbol{X}}_{t}^{*};\boldsymbol{\Theta}_{t-1})-h(\mathcal{G}^{*},\widetilde{\boldsymbol{X}}_{t})\\ &\leq\textsf{CB}_{t}(\widetilde{\boldsymbol{X}}_{t})+f(\mathcal{G}_{t},\widetilde{\boldsymbol{X}}_{t};\boldsymbol{\Theta}_{t-1})-h(\mathcal{G}^{*},\widetilde{\boldsymbol{X}}_{t})\leq 2\cdot\textsf{CB}_{t}(\widetilde{\boldsymbol{X}}_{t})\end{split}

where the second inequality is due to our arm pulling mechanism. Then, based on Lemma 5.4, Lemma 5.3, and Eq. 10, we have R(T)=R(T)=

t=1TRt2t=1T(B1g(𝒢t,𝑿~;𝚯t1)/m𝒁t11+B2+B3+B41t)2(2B4T+2B4)+2t=1T(B1g(𝒢t,𝑿~;𝚯t1)/m𝒁t11)\begin{split}\sum_{t=1}^{T}&R_{t}\leq 2\sum_{t=1}^{T}\bigg{(}B_{1}\lVert g(\mathcal{G}_{t},\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}_{t-1})/\sqrt{m}\rVert_{\boldsymbol{Z}_{t-1}^{-1}}+B_{2}+B_{3}+B_{4}\sqrt{\frac{1}{t}}\bigg{)}\\ &\leq 2\cdot(2B_{4}\sqrt{T}+2-B_{4})+2\sum_{t=1}^{T}(B_{1}\lVert g(\mathcal{G}_{t},\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}_{t-1})/\sqrt{m}\rVert_{\boldsymbol{Z}_{t-1}^{-1}})\end{split}

with the choice of mm for bounding the summation of B2,B3B_{2},B_{3}, and the bound of i=1T[ti/2]\sum_{i=1}^{T}[t^{-i/2}] in (Chlebus, 2009). Then, with Lemma 11 from (Abbasi-Yadkori et al., 2011),

t=1T(B1g(𝒢t,𝑿~;𝚯t1)/m𝒁t11)B1Tt=1Tg(𝒢t,𝑿~;𝚯t1)/m𝒁t112TB12log(det(𝒁T)det(λ𝑰))(i)2d~Tlog(1+T/λ)+2T(λS+12log(δ/2)+(d~log(1+T/λ)))\begin{split}&\sum_{t=1}^{T}(B_{1}\lVert g(\mathcal{G}_{t},\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}_{t-1})/\sqrt{m}\rVert_{\boldsymbol{Z}_{t-1}^{-1}})\\ &\leq B_{1}\sqrt{T\sum_{t=1}^{T}\lVert g(\mathcal{G}_{t},\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}_{t-1})/\sqrt{m}\rVert_{\boldsymbol{Z}_{t-1}^{-1}}^{2}}\leq\sqrt{T}B_{1}\sqrt{2\log(\frac{\det(\boldsymbol{Z}_{T})}{\det(\lambda\boldsymbol{I})})}\\ &\underset{(i)}{\leq}\sqrt{2\widetilde{d}T\log(1+T/\lambda)+2T}\big{(}\sqrt{\lambda}S+\sqrt{1-2\log(\delta/2)+(\widetilde{d}\log(1+T/\lambda))}\big{)}\end{split}

where (i)(i) is based on Lemma 6.3 in (Ban and He, 2021a) and Lemma 5.4 in (Zhou et al., 2020). \blacksquare

Here, the effective dimension d~\widetilde{d} measures the vanishing speed of 𝑮(0)\boldsymbol{G}(0)’s eigenvalues, and it is analogous to that of existing works on neural contextual bandits algorithms (Ban and He, 2021a; Zhou et al., 2020; Ban et al., 2021). As d~\widetilde{d} is smaller than the dimension of the gradient matrix 𝑮(0)\boldsymbol{G}(0), it is applied to prevent the dimension explosion. Our result matches the state-of-the-art regret complexity (Zhou et al., 2020; Zhang et al., 2020; Ban and He, 2021a) under the worst-case scenario.

5.3. Model Convergence after GD

For model convergence, we first give an assumption of the gradient matrix after jj iterations of GD. First, we define 𝑮(j)(𝚯L1)=(g(𝑿~1;𝚯L1(j)),,g(𝑿~T;𝚯L1(j)))(g(𝑿~1;𝚯L1(j)),,g(𝑿~T;𝚯L1(j)))\boldsymbol{G}^{(j)}(\boldsymbol{\Theta}_{L-1})=\big{(}g(\widetilde{\boldsymbol{X}}_{1};\boldsymbol{\Theta}_{L-1}^{(j)}),\dots,g(\widetilde{\boldsymbol{X}}_{T};\boldsymbol{\Theta}_{L-1}^{(j)})\big{)}^{\intercal}\big{(}g(\widetilde{\boldsymbol{X}}_{1};\boldsymbol{\Theta}_{L-1}^{(j)}),\dots,g(\widetilde{\boldsymbol{X}}_{T};\boldsymbol{\Theta}_{L-1}^{(j)})\big{)}
where g(𝑿~;𝚯L1)g(\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}_{L-1}) is the gradient vector w.r.t. 𝚯L1\boldsymbol{\Theta}_{L-1}.

Assumption 5.6.

With width mPoly(T,L,1βF,1λ,(ζβL)L,log(1δ))m\geq\mbox{Poly}(T,L,\frac{1}{\beta_{F}},\frac{1}{\lambda},(\zeta\beta_{L})^{L},\log(\frac{1}{\delta})) and for j[J]j\in[J], we have the minimal eigenvalue of 𝐆(j)\boldsymbol{G}^{(j)} as

λmin(𝑮(j)(𝚯L1))λ0/2\begin{split}\lambda_{\min}(\boldsymbol{G}^{(j)}(\boldsymbol{\Theta}_{L-1}))\geq\lambda_{0}/2\end{split}

where λ0\lambda_{0} is the minimal eigenvalue of the neural tangent kernel (NTK) (Jacot et al., 2018) matrix induced by AGG-UCB.

Note that Assumption 5.6 is mild and has been proved for various neural architectures in (Du et al., 2019b). The NTK for AGG-UCB  can be derived following a comparable approach as in (Du et al., 2019a; Jacot et al., 2018). Then, we apply the following lemma and theorem to prove the convergence of AGG-UCB. The proof of Lemma 5.7 is given in the appendix.

Lemma 5.7.

After TT time steps, assume the network are trained with the JJ-iterations GD on the past contexts and rewards. Then, with βF>0\beta_{F}>0 and βFη<1\beta_{F}\cdot\eta<1, for any j[J]j\in[J]:

𝑭T(j)𝑭T(j+1)2214ηβF𝑭T(j)𝒀T22\begin{split}\lVert\boldsymbol{F}^{(j)}_{T}-\boldsymbol{F}^{(j+1)}_{T}\rVert_{2}^{2}\leq\frac{1}{4}\eta\beta_{F}\cdot\lVert\boldsymbol{F}^{(j)}_{T}-\boldsymbol{Y}_{T}\rVert_{2}^{2}\end{split}

with network width mm defined in Lemma 5.3.

The Lemma 5.7 shows that we are able to bound the difference in network outputs after one step of GD. Then, we proceed to prove the convergence with the theorem below.

Theorem 5.8.

After TT time steps, assume the model with width mm defined in Lemma 5.3 is trained with the JJ-iterations GD on the contexts {𝐗~τ}τ=1T\{\widetilde{\boldsymbol{X}}_{\tau}\}_{\tau=1}^{T} and rewards {rτ}τ=1T\{r_{\tau}\}_{\tau=1}^{T}. With probability at least 1δ1-\delta, a constant βF\beta_{F} such that βFη<1\beta_{F}\cdot\eta<1, set the network width mPoly(T,L,1βF,1λ,(ζβL)L,log(1δ))m\geq\mbox{Poly}(T,L,\frac{1}{\beta_{F}},\frac{1}{\lambda},(\zeta\beta_{L})^{L},\log(\frac{1}{\delta})) and the learning rate η𝒪(T1L1βh2(2ζβL)2L)\eta\leq\mathcal{O}(T^{-1}L^{-1}\beta_{h}^{-2}(2\zeta\beta_{L})^{-2L}). Then, for any j[J]j\in[J], we have

𝑭T(j)𝒀T22(1βFη)j𝑭T(0)𝒀T22\begin{split}\lVert\boldsymbol{F}^{(j)}_{T}-\boldsymbol{Y}_{T}\rVert_{2}^{2}\leq(1-\beta_{F}\cdot\eta)^{j}\cdot\lVert\boldsymbol{F}^{(0)}_{T}-\boldsymbol{Y}_{T}\rVert_{2}^{2}\end{split}

where the vector 𝐅(j)=[f(𝒢T,𝐗~τ;𝚯(j))]τ=1T\boldsymbol{F}^{(j)}=[f(\mathcal{G}_{T},\widetilde{\boldsymbol{X}}_{\tau};\boldsymbol{\Theta}^{(j)})]_{\tau=1}^{T}, and 𝐘T=[rτ]τ=1T\boldsymbol{Y}_{T}=[r_{\tau}]_{\tau=1}^{T}.

Proof. Following an approach analogous to (Du et al., 2019b), we apply and induction based method for the proof. The hypothesis is that 𝑭T(j)𝒀T22(1βFη)j𝑭T(0)𝒀T22,j[J]\lVert\boldsymbol{F}^{(j)}_{T}-\boldsymbol{Y}_{T}\rVert_{2}^{2}\leq(1-\beta_{F}\cdot\eta)^{j}\cdot\lVert\boldsymbol{F}^{(0)}_{T}-\boldsymbol{Y}_{T}\rVert_{2}^{2},j\in[J]. With a similar procedure in Condition A.1 of (Du et al., 2019b), we have

𝑭T(j+1)𝒀T22𝑭T(j)𝒀T222η𝑭T(j)𝒀T𝑮(j)22(𝒀T𝑭T(j))𝑽(j)+𝑭T(j+1)𝑭T(j)22\begin{split}&\lVert\boldsymbol{F}^{(j+1)}_{T}-\boldsymbol{Y}_{T}\rVert_{2}^{2}\leq\lVert\boldsymbol{F}^{(j)}_{T}-\boldsymbol{Y}_{T}\rVert_{2}^{2}-2\eta\lVert\boldsymbol{F}^{(j)}_{T}-\boldsymbol{Y}_{T}\rVert_{\boldsymbol{G}^{(j)}}^{2}\\ &\qquad-2(\boldsymbol{Y}_{T}-\boldsymbol{F}^{(j)}_{T})^{\intercal}\boldsymbol{V}^{(j)}+\lVert\boldsymbol{F}^{(j+1)}_{T}-\boldsymbol{F}^{(j)}_{T}\rVert_{2}^{2}\end{split}

with 𝑽(j)=(𝑽(j)(𝑿~1),,𝑽(j)(𝑿~T))\boldsymbol{V}^{(j)}=(\boldsymbol{V}^{(j)}(\widetilde{\boldsymbol{X}}_{1}),\dots,\boldsymbol{V}^{(j)}(\widetilde{\boldsymbol{X}}_{T}))^{\intercal}. For 𝚯{𝚯gnn,,𝚯L1}\boldsymbol{\Theta}^{\prime}\in\{\boldsymbol{\Theta}_{gnn},\dots,\boldsymbol{\Theta}_{L-1}\},

|𝑽(j)(𝑿~)|=ηmax0sη[𝚯(𝚯(j))Ff(𝚯(j))f(𝚯(j),s)F]\begin{split}&\lvert\boldsymbol{V}^{(j)}(\widetilde{\boldsymbol{X}})\rvert=\eta\max_{0\leq s\leq\eta}\bigg{[}\sum_{\boldsymbol{\Theta}^{\prime}}\lVert\nabla\mathcal{L}({\boldsymbol{\Theta}^{\prime}}^{(j)})\rVert_{F}\lVert\nabla f({\boldsymbol{\Theta}^{\prime}}^{(j)})-\nabla f({\boldsymbol{\Theta}^{\prime}}^{(j)},s)\rVert_{F}\bigg{]}\end{split}

where f(𝚯(j),s)=f(𝚯(j)s(𝚯(j)))\nabla f({\boldsymbol{\Theta}^{\prime}}^{(j)},s)=\nabla f\big{(}{\boldsymbol{\Theta}^{\prime}}^{(j)}-s\cdot\nabla\mathcal{L}({\boldsymbol{\Theta}^{\prime}}^{(j)})\big{)}. The notation 𝒢,𝑿~\mathcal{G},\widetilde{\boldsymbol{X}} is omitted for simplicity. Then, based on the conclusions from Lemma C.1, Lemma 5.7 and Assumption 5.6, we can have

𝑭T(j+1)𝒀T22(1ηλ0)𝑭T(j)𝒀T222(𝒀T𝑭T(j))𝑽(j)+𝑭T(j+1)𝑭T(j)22(1ηλ02)𝑭T(j)𝒀T22\begin{split}&\lVert\boldsymbol{F}^{(j+1)}_{T}-\boldsymbol{Y}_{T}\rVert_{2}^{2}\leq(1-\eta\lambda_{0})\lVert\boldsymbol{F}^{(j)}_{T}-\boldsymbol{Y}_{T}\rVert_{2}^{2}-2(\boldsymbol{Y}_{T}-\boldsymbol{F}^{(j)}_{T})^{\intercal}\boldsymbol{V}^{(j)}\\ &\qquad+\lVert\boldsymbol{F}^{(j+1)}_{T}-\boldsymbol{F}^{(j)}_{T}\rVert_{2}^{2}\leq(1-\frac{\eta\lambda_{0}}{2})\lVert\boldsymbol{F}^{(j)}_{T}-\boldsymbol{Y}_{T}\rVert_{2}^{2}\end{split}

by setting βF=λ0/2\beta_{F}=\lambda_{0}/2. \blacksquare

This theorem shows that with sufficiently large mm and proper η\eta, the GD will converge to the global minimum at a linear rate, which is essential for proving the regret bound.

6. Experiments

In this section, we demonstrate the effectiveness of our proposed framework by comparing its performances with state-of-the-art baselines through experiments on four real data sets. As linear algorithms have been outperformed in previous works (Zhou et al., 2020; Zhang et al., 2020; Deshmukh et al., 2017), we will not include these linear methods in the experiments below. Our six baseline algorithms are:

  • KMTL-UCB (Deshmukh et al., 2017) estimates the ”task similarities” with received contextual information. The estimations are based on a variant of kernel ridge regression.

  • Kernel-Ind is Kernel-UCB (Valko et al., 2013) under the ”disjoint setting” (Li et al., 2010) where it learns individual estimators for each arm group.

  • Kernel-Pool represents Kernel-UCB under the ”pooling setting” where it applies a single estimator for all arm groups.

  • Neural-TS stands for Neural Thompson Sampling (Zhang et al., 2020) with group-aware embedding, which enables it to leverage the group information. It applies a neural network for exploitation and Thompson sampling strategy for exploration.

  • Neural-Pool is for Neural-UCB (Zhou et al., 2020) with a single neural network to evaluate the reward, and calculate the upper confidence bounds with the network gradients.

  • Neural-Ind represents Neural-UCB with group-aware embedding for utilizing the group information.

Note that COFIBA (Li et al., 2016) is naturally Kernel-Ind (with linear kernel) given the arm group information and one single user to serve, so we do not include it in our benchmarks. To find the best exploration parameter, we perform grid searches over the range {101,102,103}\{10^{-1},10^{-2},10^{-3}\} for all algorithms. Similarly, the learning rate for neural algorithms are chosen from {102,103,104}\{10^{-2},10^{-3},10^{-4}\}. For Neural-UCB, Neural-TS and our reward estimation module, we apply a two-layer FC network with m=500m=500. RBF kernels are applied for KMTL-UCB and Kernel-UCB as well as our graph estimation module. Kernel-Pool and Neural-Pool will not fit into the multi-class classification setting, as we only receive one arm (context) at each time step without the arm group information.

Refer to caption
Figure 1. Cumulative regrets for recommendation data sets.

6.1. Real Data Sets

Here, we compare our proposed model with baseline algorithms on four real data sets with different specifications.

MovieLens and Yelp data sets. The first real data set is the ”MovieLens 20M rating data set” (grouplens.org/datasets/movielens/20m/) . To obtain the user features, we first choose 100 movies and 4000 users with most reviews to form the user-movie matrix where the entries are user ratings, and the user features 𝒗ud\boldsymbol{v}_{u}\in\mathbb{R}^{d} are obtained through singular value decomposition (SVD) where the dimension d=20d=20. Then, since the genome-scores of user-specified tags are provided for each movie, we select 20 tags with the highest variance to construct the movie features 𝒗id\boldsymbol{v}_{i}\in\mathbb{R}^{d} with their scores on these tags. Then, these movies are allocated into 19 groups based on their genres (|𝒞|=19\lvert\mathcal{C}\rvert=19). Receiving a user utu_{t} at each time step tt, we follow the idea of Generalized Matrix Factorization (GMF) (He et al., 2017; Zhou et al., 2021b, a) to encode user information into the contexts as 𝒙~c,t(i)=[𝒗ut𝒗i]d,c𝒞t,i[nc,t]\widetilde{\boldsymbol{x}}^{(i)}_{c,t}=[\boldsymbol{v}_{u_{t}}\odot\boldsymbol{v}_{i}]\in\mathbb{R}^{d},c\in\mathcal{C}_{t},i\in[n_{c,t}], and let |𝒳t|=20\lvert\mathcal{X}_{t}\rvert=20. Finally, we concatenate a constant 0.01 to each 𝒙~c,t(i)\widetilde{\boldsymbol{x}}^{(i)}_{c,t} to obtain 𝒙c,t(i)dx\boldsymbol{x}^{(i)}_{c,t}\in\mathbb{R}^{d_{x}}, which makes dx=21d_{x}=21, before normalizing 𝒙c,t(i)\boldsymbol{x}^{(i)}_{c,t}. Rewards rc,t(i)r_{c,t}^{(i)} are user ratings normalized into range [0, 1].

Then, for the Yelp data set (https://www.yelp.com/dataset), we choose 4000 users with most reviews and restaurants from 20 different categories as arms (|𝒞|=20)(\lvert\mathcal{C}\rvert=20). Both user features and arm features are obtained through SVD with the dimension d=20d=20. Analogous to the MovieLens data set, we follow the GMF based approach and the fore-mentioned constant concatenation to get the arm context 𝒙c,t(i)\boldsymbol{x}^{(i)}_{c,t} (dx=21,|𝒳t|=20d_{x}=21,\lvert\mathcal{X}_{t}\rvert=20) to encode the user information, and the rewards are the normalized user ratings.

MNIST data set with augmented classes (MNIST-Aug). MNIST is a well-known classification data set with 10 original classes where each sample is labeled as a digit from 0 to 9. Here, we further divide the samples from each class into 5 sub-divisions through KK-means clustering, which gives us a total of 50 augmented sub-classes (i.e., arm groups) for the whole data set. Given a sample 𝒙t\boldsymbol{x}_{t}, the reward would be rt=1r_{t}=1 if the learner accurately predicts its sub-class; or the learner will receive the partial reward rt=0.5r_{t}=0.5 when it chooses the wrong sub-class, but this sub-class and the correct one belong to the same digit (original class). Otherwise, the reward rt=0r_{t}=0.

XRMB data set. XRMB data set (Wang et al., 2015) is a multi-view classification data set with 40 different labels. Here, we only apply samples from the first 38 classes as there are insufficient samples for the last two classes. The arm contexts 𝒙t\boldsymbol{x}_{t} are the first-view features of the samples. Then, learner will receive a reward of rt=1r_{t}=1 when they predict the right label, and rt=0r_{t}=0 otherwise.

6.2. Experimental Results

Refer to caption
Figure 2. Cumulative regrets for classification data sets.

Figure 1 shows the cumulative regret results on the two real recommendation data sets where our proposed AGG-UCB  outperforms all strong baselines. In particular, we can find that algorithms with group-aware arm embedding tend to perform better than those without the arm group information (Kernel-Pool, Neural-Pool). This confirms the necessity of exploiting arm group information. Nevertheless, these baselines fed with group-aware are outperformed by AGG-UCB, which implies the advantages of of our new graph-based model. Meantime, it can be observed that neural algorithms (AGG-UCB, Neural-Ind, Neural-TS) generally perform better compared with other baselines due to the representation power of neural networks. Note that since the user features and arm features of the Yelp data set are directly extracted with SVD, the reward estimation on the Yelp data set is comparably easy compared with others data sets. Therefore, the performances of benchmarks do not differ dramatically with AGG-UCB. In opposite, MovieLens data set with true arm features tends to be a more challenging task where a more complex mapping from arms to their rewards can be involved. This can be reason for AGG-UCB’s superiority over the competitors.

Then, Figure 2 shows the cumulative regret results on the two classification data sets where our AGG-UCB  achieves the best performance compared with other baselines. In particular, since sub-classes from each digit are highly correlated in the MNIST-Aug data set, our proposed AGG-UCB  tends to perform significantly better due to its ability of leveraging arm group correlations compared with other neural methods. Thus, these two aspects verify our claim that associating the neural models with arm group relationship modeling can lead to better performance.

6.3. Parameter Study

Refer to caption
Figure 3. Cumulative regrets on MovieLens and MNIST-Aug data sets with different neighborhood parameter kk.

In this section, we conduct our parameter study for the neighborhood parameter kk on the MovieLens data set and MNIST-Aug data set with augmented labels, and the results are presented in Figure 3. For the MovieLens data set, we can observe that setting k=1k=1 would give the best result. Although increasing kk can enable the aggregation module to propagate the hidden representations for multiple hops, it can potentially fail to focus on local arm group neighbors with high correlations, which is comparable to the aforementioned ”over-smoothing” problem. In addition, since the arm group graph of MovieLens data set only has 19 nodes, k=1k=1 would be enough. Meantime, setting k=1k=1 also achieves the best performance on the MNIST data set. The reason can be that the 11-hop neighborhood of each sub-class can already include all the other sub-classes from the same digit with heavy edge weights within the neighborhood for arm group collaboration. Therefore, unless setting kk to considerably large values, the AGG-UCB can maintain robust performances, which reduces the workload for hyperparameter tuning.

7. Conclusion

In this paper, motivated by real applications where the arm group information is available, we propose a new graph-based model to characterize the relationship among arm groups. Base on this model, we propose a novel UCB-based algorithm named AGG-UCB, which uses GNN to exploit the arm group relationship and share the information across similar arm groups. Compared with existing methods, AGG-UCB  provides a new way of collaborating multiple neural contextual bandit estimators for obtaining the rewards. In addition to the theoretical analysis of AGG-UCB, we empirically demonstrate its superiority on real data sets in comparison with state-of-the-art baselines.

Acknowledgements.
This work is supported by National Science Foundation under Award No. IIS-1947203, IIS-2117902, IIS-2137468, and IIS-2002540. The views and conclusions are those of the authors and should not be interpreted as representing the official policies of the funding agencies or the government.

References

  • (1)
  • Abbasi-Yadkori et al. (2011) Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. 2011. Improved algorithms for linear stochastic bandits. NeurIPS 24 (2011), 2312–2320.
  • Allen-Zhu et al. (2019) Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. 2019. A convergence theory for deep learning via over-parameterization. In ICML. PMLR, 242–252.
  • Auer et al. (2002) Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. 2002. Finite-time analysis of the multiarmed bandit problem. Machine learning 47, 2-3 (2002), 235–256.
  • Ban and He (2021a) Yikun Ban and Jingrui He. 2021a. Convolutional neural bandit: Provable algorithm for visual-aware advertising. arXiv preprint arXiv:2107.07438 (2021).
  • Ban and He (2021b) Yikun Ban and Jingrui He. 2021b. Local clustering in contextual multi-armed bandits. In Proceedings of the Web Conference 2021. 2335–2346.
  • Ban et al. (2021) Yikun Ban, Jingrui He, and Curtiss B Cook. 2021. Multi-facet Contextual Bandits: A Neural Network Perspective. arXiv preprint arXiv:2106.03039 (2021).
  • Ban et al. (2022a) Yikun Ban, Yunzhe Qi, Tianxin Wei, and Jingrui He. 2022a. Neural Collaborative Filtering Bandits via Meta Learning. ArXiv abs/2201.13395 (2022).
  • Ban et al. (2022b) Yikun Ban, Yuchen Yan, Arindam Banerjee, and Jingrui He. 2022b. EE-Net: Exploitation-Exploration Neural Networks in Contextual Bandits. In ICLR.
  • Blanchard et al. (2011) Gilles Blanchard, Gyemin Lee, and Clayton Scott. 2011. Generalizing from several related classification tasks to a new unlabeled sample. In NeurIPS. 2178–2186.
  • Chlebus (2009) Edward Chlebus. 2009. An approximate formula for a partial sum of the divergent p-series. Applied Mathematics Letters 22, 5 (2009), 732–737.
  • Chu et al. (2011) Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. 2011. Contextual bandits with linear payoff functions. In AISTATS. 208–214.
  • Deshmukh et al. (2017) Aniket Anand Deshmukh, Urun Dogan, and Clay Scott. 2017. Multi-task learning for contextual bandits. In NeurIPS. 4848–4856.
  • Du et al. (2019b) Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. 2019b. Gradient descent finds global minima of deep neural networks. In ICML. PMLR, 1675–1685.
  • Du et al. (2019a) Simon S Du, Kangcheng Hou, Barnabás Póczos, Ruslan Salakhutdinov, Ruosong Wang, and Keyulu Xu. 2019a. Graph neural tangent kernel: Fusing graph neural networks with graph kernels. arXiv preprint arXiv:1905.13192 (2019).
  • Durand et al. (2018) Audrey Durand, Charis Achilleos, Demetris Iacovides, Katerina Strati, Georgios D Mitsis, and Joelle Pineau. 2018. Contextual bandits for adapting treatment in a mouse model of de novo carcinogenesis. In Machine learning for healthcare conference. PMLR, 67–82.
  • Fu and He (2021a) Dongqi Fu and Jingrui He. 2021a. DPPIN: A biological repository of dynamic protein-protein interaction network data. arXiv preprint arXiv:2107.02168 (2021).
  • Fu and He (2021b) Dongqi Fu and Jingrui He. 2021b. SDG: A Simplified and Dynamic Graph Neural Network. In SIGIR ’21. 2273–2277.
  • Gentile et al. (2017) Claudio Gentile, Shuai Li, Purushottam Kar, Alexandros Karatzoglou, Giovanni Zappella, and Evans Etrue. 2017. On context-dependent clustering of bandits. In ICML. 1253–1262.
  • Gentile et al. (2014) Claudio Gentile, Shuai Li, and Giovanni Zappella. 2014. Online clustering of bandits. In ICML. 757–765.
  • Hamilton et al. (2017) William L Hamilton, Rex Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. arXiv preprint arXiv:1706.02216 (2017).
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. 770–778.
  • He et al. (2017) Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural collaborative filtering. In WWW. 173–182.
  • Hoang et al. (2021) NT Hoang, Takanori Maehara, and Tsuyoshi Murata. 2021. Revisiting Graph Neural Networks: Graph Filtering Perspective. In 2020 25th International Conference on Pattern Recognition (ICPR). IEEE, 8376–8383.
  • Jacot et al. (2018) Arthur Jacot, Franck Gabriel, and Clément Hongler. 2018. Neural tangent kernel: Convergence and generalization in neural networks. arXiv preprint arXiv:1806.07572 (2018).
  • Kipf and Welling (2016) Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
  • Klicpera et al. (2018) Johannes Klicpera, Aleksandar Bojchevski, and Stephan Günnemann. 2018. Predict then propagate: Graph neural networks meet personalized pagerank. arXiv preprint arXiv:1810.05997 (2018).
  • Krause and Ong (2011) Andreas Krause and Cheng Soon Ong. 2011. Contextual Gaussian Process Bandit Optimization.. In NeurIPS. 2447–2455.
  • Li et al. (2010) Lihong Li, Wei Chu, John Langford, and Robert E Schapire. 2010. A contextual-bandit approach to personalized news article recommendation. In WWW. 661–670.
  • Li et al. (2019) Shuai Li, Wei Chen, Shuai Li, and Kwong-Sak Leung. 2019. Improved Algorithm on Online Clustering of Bandits. In IJCAI. 2923–2929.
  • Li et al. (2016) Shuai Li, Alexandros Karatzoglou, and Claudio Gentile. 2016. Collaborative filtering bandits. In SIGIR. 539–548.
  • Sajeev et al. (2021) Sandra Sajeev, Jade Huang, Nikos Karampatziakis, Matthew Hall, Sebastian Kochman, and Weizhu Chen. 2021. Contextual Bandit Applications in a Customer Support Bot. In KDD ’21. 3522–3530.
  • Shao et al. (2019) Xin Shao, Ning Lv, Jie Liao, Jinbo Long, Rui Xue, Ni Ai, Donghang Xu, and Xiaohui Fan. 2019. Copy number variation is highly correlated with differential gene expression: a pan-cancer study. BMC medical genetics 20, 1 (2019), 1–14.
  • Upadhyay et al. (2020) Sohini Upadhyay, Mikhail Yurochkin, Mayank Agarwal, Yasaman Khazaeni, et al. 2020. Online Semi-Supervised Learning with Bandit Feedback. arXiv preprint arXiv:2010.12574 (2020).
  • Valko et al. (2013) Michal Valko, Nathaniel Korda, Rémi Munos, Ilias Flaounas, and Nelo Cristianini. 2013. Finite-time analysis of kernelised contextual bandits. arXiv preprint arXiv:1309.6869 (2013).
  • Vershynin (2010) Roman Vershynin. 2010. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027 (2010).
  • Villar et al. (2015) Sofía S Villar, Jack Bowden, and James Wason. 2015. Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges. Statistical science: a review journal of the Institute of Mathematical Statistics 30, 2 (2015), 199.
  • Wang et al. (2015) Weiran Wang, Raman Arora, Karen Livescu, and Jeff A Bilmes. 2015. Unsupervised learning of acoustic features via deep canonical correlation analysis. In 2015 IEEE ICASSP. IEEE.
  • Wei et al. (2020) Tianxin Wei, Ziwei Wu, Ruirui Li, Ziniu Hu, Fuli Feng, Xiangnan He, Yizhou Sun, and Wei Wang. 2020. Fast adaptation for cold-start collaborative filtering with meta-learning. In ICDM. IEEE, 661–670.
  • Wu et al. (2019a) Felix Wu, Amauri Souza, Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Weinberger. 2019a. Simplifying graph convolutional networks. In ICML. PMLR, 6861–6871.
  • Wu et al. (2016) Qingyun Wu, Huazheng Wang, Quanquan Gu, and Hongning Wang. 2016. Contextual bandits in a collaborative environment. In SIGIR. 529–538.
  • Wu et al. (2019b) Qingyun Wu, Huazheng Wang, Yanen Li, and Hongning Wang. 2019b. Dynamic Ensemble of Contextual Bandits to Satisfy Users’ Changing Interests. In WWW. 2080–2090.
  • Xu et al. (2018) Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Ken-ichi Kawarabayashi, and Stefanie Jegelka. 2018. Representation learning on graphs with jumping knowledge networks. In ICML. PMLR, 5453–5462.
  • Xu et al. (2021) Keyulu Xu, Mozhi Zhang, Stefanie Jegelka, and Kenji Kawaguchi. 2021. Optimization of graph neural networks: Implicit acceleration by skip connections and more depth. In ICML. PMLR, 11592–11602.
  • You et al. (2019) Jiaxuan You, Rex Ying, and Jure Leskovec. 2019. Position-aware graph neural networks. In ICML. PMLR, 7134–7143.
  • Zhang et al. (2020) Weitong Zhang, Dongruo Zhou, Lihong Li, and Quanquan Gu. 2020. Neural thompson sampling. arXiv preprint arXiv:2010.00827 (2020).
  • Zhou et al. (2020) Dongruo Zhou, Lihong Li, and Quanquan Gu. 2020. Neural Contextual Bandits with UCB-based Exploration. arXiv:1911.04462 [cs.LG]
  • Zhou et al. (2021a) Yao Zhou, Haonan Wang, Jingrui He, and Haixun Wang. 2021a. From Intrinsic to Counterfactual: On the Explainability of Contextualized Recommender Systems. ArXiv (2021). arXiv:2110.14844
  • Zhou et al. (2021b) Yao Zhou, Jianpeng Xu, Jun Wu, Zeinab Taghavi Nasrabadi, Evren Körpeoglu, Kannan Achan, and Jingrui He. 2021b. PURE: Positive-Unlabeled Recommendation with Generative Adversarial Network. In KDD ’21. 2409–2419.

Appendix A Lemmas for Intermediate Variables and Weight Matrices

Due to page limit, we will give the proof sketch for lemmas at the end of each corresponding appendix section. Recall that each input context xc,t(i),i[nc,t]x^{(i)}_{c,t},i\in[n_{c,t}] is embedded to 𝑿~c,t(i)\widetilde{\boldsymbol{X}}^{(i)}_{c,t} (represented by 𝑿~\widetilde{\boldsymbol{X}} for brevity). Supposing 𝑿~\widetilde{\boldsymbol{X}} belongs to the arm group cc, denote 𝒉𝑨=[𝑨tk𝑿~]c\boldsymbol{h}_{\boldsymbol{A}}=[\boldsymbol{A}_{t}^{k}\widetilde{\boldsymbol{X}}]_{c} as the corresponding row in matrix 𝑨tk𝑿~\boldsymbol{A}_{t}^{k}\widetilde{\boldsymbol{X}} based on index of group cc in 𝒞\mathcal{C} (if group cc is the cc^{\prime}-th group in 𝒞\mathcal{C}, then 𝒉𝑨\boldsymbol{h}_{\boldsymbol{A}} is the cc^{\prime}-th row in 𝑨tk𝑿~\boldsymbol{A}_{t}^{k}\widetilde{\boldsymbol{X}}). Similarly, we have 𝒉gnn=[𝑯gnn]c\boldsymbol{h}_{gnn}=[\boldsymbol{H}_{gnn}]_{c} and 𝒉l=[𝑯l]c\boldsymbol{h}_{l}=[\boldsymbol{H}_{l}]_{c} respectively. Given received contexts {𝑿~τ}τ=1T\{\widetilde{\boldsymbol{X}}_{\tau}\}_{\tau=1}^{T} and rewards {rτ}τ=1T\{r_{\tau}\}_{\tau=1}^{T}, the gradient w.r.t. weight matrix 𝚯l,l{1,,L1}\boldsymbol{\Theta}_{l},\forall l\in\{1,\dots,L-1\} will be

(Θ)𝚯l=mLl+12τ=1T|f(𝑿~τ;𝚯)rτ|2(𝒉l1𝚯L(q=l+1L1𝚪q𝚯q)𝚪l)\begin{split}\frac{\partial~{}\mathcal{L}(\Theta)}{\partial~{}\boldsymbol{\Theta}_{l}}=m^{-\frac{L-l+1}{2}}\sum_{\tau=1}^{T}\lvert f(\widetilde{\boldsymbol{X}}_{\tau};\boldsymbol{\Theta})-r_{\tau}\rvert^{2}\bigg{(}\boldsymbol{h}_{l-1}\boldsymbol{\Theta}_{L}^{\intercal}\big{(}\prod_{q=l+1}^{L-1}\boldsymbol{\Gamma}_{q}\boldsymbol{\Theta}_{q}^{\intercal}\big{)}\boldsymbol{\Gamma}_{l}\bigg{)}\end{split}

where 𝚪q=diag([σ(𝒉q1𝚯q)])\boldsymbol{\Gamma}_{q}=diag([\sigma^{\prime}(\boldsymbol{h}_{q-1}\boldsymbol{\Theta}_{q})]) is the diagonal matrix whose entries are the elements from σ(𝒉𝒒𝟏𝚯q)\sigma^{\prime}(\boldsymbol{\boldsymbol{h}_{q-1}\Theta}_{q}). The coefficient 12\frac{1}{2} of the cost function is omitted for simplicity. Then, for 𝚯gnn\boldsymbol{\Theta}_{gnn}, we have

(Θ)𝚯gnn=mL+12τ=1T|f(𝑿~τ;𝚯)rτ|2(𝒉𝑨𝚯L(q=2L1𝚪q𝚯q)𝚪1𝚯1𝑸𝚪gnn)\begin{split}\frac{\partial~{}\mathcal{L}(\Theta)}{\partial~{}\boldsymbol{\Theta}_{gnn}}=m^{-\frac{L+1}{2}}\sum_{\tau=1}^{T}\lvert f(\widetilde{\boldsymbol{X}}_{\tau};\boldsymbol{\Theta})-r_{\tau}\rvert^{2}\bigg{(}\boldsymbol{h}_{\boldsymbol{A}}\boldsymbol{\Theta}_{L}^{\intercal}\big{(}\prod_{q=2}^{L-1}\boldsymbol{\Gamma}_{q}\boldsymbol{\Theta}_{q}^{\intercal}\big{)}\boldsymbol{\Gamma}_{1}\boldsymbol{\Theta}_{1}^{\intercal}\boldsymbol{Q}\boldsymbol{\Gamma}_{gnn}\bigg{)}\end{split}

where 𝚪gnn=diag([σ(𝒉𝑨𝚯gnn)])\boldsymbol{\Gamma}_{gnn}=diag([\sigma^{\prime}(\boldsymbol{h}_{\boldsymbol{A}}\boldsymbol{\Theta}_{gnn})]). 𝑸=\boldsymbol{Q}= (𝑰m×m𝟎(dm)×m)d×m\left(\begin{array}[]{c}\boldsymbol{I}\in\mathbb{R}^{m\times m}\\ \hline\cr\boldsymbol{0}\in\mathbb{R}^{(d^{\prime}-m)\times m}\end{array}\right)\in\mathbb{R}^{d^{\prime}\times m}. Given the same 𝒢t\mathcal{G}_{t}, we provide lemmas to bound the R1R_{1} term of Eq. 9. For brevity, the subscript τ[T]\tau\in[T] and notation 𝒢t\mathcal{G}_{t} are omitted below by default.

Lemma A.1.

Given the randomly initialized parameters 𝚯(0)={𝚯gnn(0),𝚯1(0),𝚯2(0),,𝚯L(0)}\boldsymbol{\Theta}^{(0)}=\{\boldsymbol{\Theta}_{gnn}^{(0)},\boldsymbol{\Theta}_{1}^{(0)},\boldsymbol{\Theta}_{2}^{(0)},\dots,\boldsymbol{\Theta}_{L}^{(0)}\}, with the probability at least 1O(TL)eΩ(m)1-O(TL)\cdot e^{-\Omega(m)} and constants 1<β1,β2,β3,β4<21<\beta_{1},\beta_{2},\beta_{3},\beta_{4}<2, we have

𝚯gnn(0)2β1m,𝚯1(0)2β2m,𝚯L(0)2β3,𝒉gnn(0)2ζβ1,𝒉1(0)2ζβ2+ζ2β1β2,|f(𝑿~;𝚯(0))|ζβ3(ζβ4)L2(ζβ2+ζ2β1β2)/m,𝚯l(0)2β4m,𝒉l(0)2(ζβ4)l1(ζβ2+ζ2β1β2),l{2,,L1}.\begin{split}&\lVert\boldsymbol{\Theta}_{gnn}^{(0)}\rVert_{2}\leq\beta_{1}\sqrt{m},~{}\lVert\boldsymbol{\Theta}_{1}^{(0)}\rVert_{2}\leq\beta_{2}\sqrt{m},~{}\lVert\boldsymbol{\Theta}_{L}^{(0)}\rVert_{2}\leq\beta_{3},\\ &\lVert\boldsymbol{h}_{gnn}^{(0)}\rVert_{2}\leq\zeta\cdot\beta_{1},\quad\lVert\boldsymbol{h}_{1}^{(0)}\rVert_{2}\leq\zeta\cdot\beta_{2}+\zeta^{2}\cdot\beta_{1}\beta_{2},\\ &\lvert f(\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}^{(0)})\rvert\leq\zeta\cdot\beta_{3}\cdot(\zeta\cdot\beta_{4})^{L-2}(\zeta\cdot\beta_{2}+\zeta^{2}\cdot\beta_{1}\beta_{2})/\sqrt{m},\\ &\lVert\boldsymbol{\Theta}_{l}^{(0)}\rVert_{2}\leq\beta_{4}\sqrt{m},~{}~{}\lVert\boldsymbol{h}_{l}^{(0)}\rVert_{2}\leq(\zeta\cdot\beta_{4})^{l-1}(\zeta\cdot\beta_{2}+\zeta^{2}\cdot\beta_{1}\beta_{2}),~{}~{}\\ &\forall l\in\{2,\dots,L-1\}.\end{split}

Proof. Based on the properties of random Gaussian matrices (Vershynin, 2010; Du et al., 2019b; Ban and He, 2021a), with the probability of at least 1e(β1dx~/m1)2m2=1eΩ(m)1-e^{-\frac{(\beta_{1}-\sqrt{d_{\widetilde{x}}/m}-1)^{2}\cdot m}{2}}=1-e^{-\Omega(m)}, we have

𝚯gnn(0)2β1m\begin{split}\lVert\boldsymbol{\Theta}_{gnn}^{(0)}\rVert_{2}\leq\beta_{1}\sqrt{m}\end{split}

where β1dx~/m+1\beta_{1}\geq\sqrt{d_{\widetilde{x}}/m}+1 with m>dx~m>d_{\widetilde{x}}. Applying the analogous approach for the other randomly initialized matrices would give similar bounds. Regarding the nature of 𝑨\boldsymbol{A}, we can easily have 𝒉𝑨21\lVert\boldsymbol{h}_{\boldsymbol{A}}\rVert_{2}\leq 1. Then,

𝒉gnn(0)2=m12σ(𝒉𝑨𝚯gnn)2ζm12𝒉𝑨2𝚯gnn2ζβ1\begin{split}\lVert\boldsymbol{h}_{gnn}^{(0)}\rVert_{2}=m^{-\frac{1}{2}}\lVert\sigma(\boldsymbol{h}_{\boldsymbol{A}}\cdot\boldsymbol{\Theta}_{gnn})\rVert_{2}\leq\zeta m^{-\frac{1}{2}}\cdot\lVert\boldsymbol{h}_{\boldsymbol{A}}\rVert_{2}\lVert\boldsymbol{\Theta}_{gnn}\rVert_{2}\leq\zeta\cdot\beta_{1}\end{split}

due to the assumed ζ\zeta-Lipschitz continuity. Denoting the concatenated input for reward estimation module as 𝒙=[𝒉gnn(0);𝑿~]c1×(dx~+m)\boldsymbol{x}^{\prime}=[\boldsymbol{h}_{gnn}^{(0)};\widetilde{\boldsymbol{X}}]_{c}\in\mathbb{R}^{1\times(d_{\widetilde{x}}+m)}, we can easily derive that 𝒙2ζβ1+1\lVert\boldsymbol{x}^{\prime}\rVert_{2}\leq\zeta\cdot\beta_{1}+1. Thus,

𝒉1(0)2=m12σ(𝒙𝚯1)2ζm12𝒙2𝚯12ζβ2(ζβ1+1)=ζβ2+ζ2β1β2.\begin{split}\lVert\boldsymbol{h}_{1}^{(0)}\rVert_{2}&=m^{-\frac{1}{2}}\lVert\sigma(\boldsymbol{x}^{\prime}\cdot\boldsymbol{\Theta}_{1})\rVert_{2}\leq\zeta m^{-\frac{1}{2}}\cdot\lVert\boldsymbol{x}^{\prime}\rVert_{2}\lVert\boldsymbol{\Theta}_{1}\rVert_{2}\\ &\leq\zeta\cdot\beta_{2}(\zeta\cdot\beta_{1}+1)=\zeta\cdot\beta_{2}+\zeta^{2}\cdot\beta_{1}\beta_{2}.\end{split}

Following the same procedure recursively for other intermediate outputs and applying the union bound would complete the proof.

Lemma A.2.

After TT time steps, run GD for JJ-iterations on the network with the received contexts and rewards. Suppose 𝐡l(j)𝐡l(0)2Λ(j),j[J]\lVert\boldsymbol{h}_{l}^{(j)}-\boldsymbol{h}_{l}^{(0)}\rVert_{2}\leq\Lambda^{(j)},\forall j\in[J]. With the probability of at least 1O(TL)eΩ(m)1-O(TL)\cdot e^{-\Omega(m)} and 𝚯{𝚯gnn,𝚯1,,𝚯L}\boldsymbol{\Theta}\in\{\boldsymbol{\Theta}_{gnn},\boldsymbol{\Theta}_{1},\dots,\boldsymbol{\Theta}_{L}\}, we have

𝚯(j)𝚯(0)FΥ/m\begin{split}&\lVert\boldsymbol{\Theta}^{(j)}-\boldsymbol{\Theta}^{(0)}\rVert_{F}\leq\Upsilon/\sqrt{m}\end{split}

where Υ=22tβF(βh+Λ(j))(βL+1)LζL.\Upsilon=\frac{2\sqrt{2}t}{\beta_{F}}(\beta_{h}+\Lambda^{(j)})(\beta_{L}+1)^{L}\zeta^{L}.

Proof. We prove this Lemma following an induction-based procedure (Du et al., 2019b; Ban and He, 2021a). The hypothesis is 𝚯(0)𝚯(j)FΥ/m,𝚯{𝚯gnn,𝚯1,,𝚯L}\lVert\boldsymbol{\Theta}^{(0)}-\boldsymbol{\Theta}^{(j)}\rVert_{F}\leq\Upsilon/\sqrt{m},\forall\boldsymbol{\Theta}\in\{\boldsymbol{\Theta}_{gnn},\boldsymbol{\Theta}_{1},\dots,\boldsymbol{\Theta}_{L}\}, and let βL=max{β1,β2,β3,β4}\beta_{L}=\max\{\beta_{1},\beta_{2},\beta_{3},\beta_{4}\}. According to Algorithm 2, we have for the j+1j+1-th iteration and l[L]l\in[L],

𝚯l(j+1)𝚯l(j)F=mLl+12ητ=1T|f(𝑿~τ;𝚯(j))rτ|2𝒉l1(j)(𝚯L(j))(q=l+1L1𝚪q(j)(𝚯q(j)))𝚪l(j)FmLl+12ηt𝑭t(j)𝒀t2𝒉l1(j)𝚯L(j)Fq=l+1L1𝚯q(j)2q=lL1𝚪q(j)2mLl+12ηt𝑭t(j)𝒀t2𝒉l1(j)2𝚯L(j)2q=l+1L1𝚯q(j)2q=lL1𝚪q(j)2\begin{split}&\lVert\boldsymbol{\Theta}_{l}^{(j+1)}-\boldsymbol{\Theta}_{l}^{(j)}\rVert_{F}=m^{-\frac{L-l+1}{2}}\eta\cdot\lVert\sum_{\tau=1}^{T}\lvert f(\widetilde{\boldsymbol{X}}_{\tau};\boldsymbol{\Theta}^{(j)})-r_{\tau}\rvert^{2}\cdot\boldsymbol{h}_{l-1}^{(j)}(\boldsymbol{\Theta}_{L}^{(j)})^{\intercal}\\ &\qquad\cdot\big{(}\prod_{q=l+1}^{L-1}\boldsymbol{\Gamma}_{q}^{(j)}\cdot(\boldsymbol{\Theta}_{q}^{(j)})^{\intercal}\big{)}\cdot\boldsymbol{\Gamma}_{l}^{(j)}\rVert_{F}\\ &\leq m^{-\frac{L-l+1}{2}}\eta\sqrt{t}\cdot\lVert\boldsymbol{F}^{(j)}_{t}-\boldsymbol{Y}_{t}\rVert_{2}\lVert\boldsymbol{h}_{l-1}^{(j)}\boldsymbol{\Theta}_{L}^{(j)}\rVert_{F}\prod_{q=l+1}^{L-1}\lVert\boldsymbol{\Theta}_{q}^{(j)}\rVert_{2}\prod_{q=l}^{L-1}\lVert\boldsymbol{\Gamma}_{q}^{(j)}\rVert_{2}\\ &\leq m^{-\frac{L-l+1}{2}}\eta\sqrt{t}\cdot\lVert\boldsymbol{F}^{(j)}_{t}-\boldsymbol{Y}_{t}\rVert_{2}\lVert\boldsymbol{h}_{l-1}^{(j)}\rVert_{2}\lVert\boldsymbol{\Theta}_{L}^{(j)}\rVert_{2}\prod_{q=l+1}^{L-1}\lVert\boldsymbol{\Theta}_{q}^{(j)}\rVert_{2}\prod_{q=l}^{L-1}\lVert\boldsymbol{\Gamma}_{q}^{(j)}\rVert_{2}\end{split}

by Cauchy inequality. For 𝚯q(j)2\lVert\boldsymbol{\Theta}_{q}^{(j)}\rVert_{2}, we have

q=l+1L1𝚯q(j)2q=l+1L1(𝚯q(0)2+𝚯q(j)𝚯q(0)2)(βLm+Υ/m)Ll1;\begin{split}\prod_{q=l+1}^{L-1}\lVert\boldsymbol{\Theta}_{q}^{(j)}\rVert_{2}&\leq\prod_{q=l+1}^{L-1}\bigg{(}\lVert\boldsymbol{\Theta}_{q}^{(0)}\rVert_{2}+\lVert\boldsymbol{\Theta}_{q}^{(j)}-\boldsymbol{\Theta}_{q}^{(0)}\rVert_{2}\bigg{)}\\ &\qquad\leq(\beta_{L}\sqrt{m}+\Upsilon/\sqrt{m})^{L-l-1};\end{split}

while for 𝚪q(j)2\lVert\boldsymbol{\Gamma}_{q}^{(j)}\rVert_{2}, we have q=lL1𝚪q(j)2ζLl\prod_{q=l}^{L-1}\lVert\boldsymbol{\Gamma}_{q}^{(j)}\rVert_{2}\leq\zeta^{L-l}. Combining all the results above and based on Lemma 5.8, it means that for l[L]l\in[L],

𝚯l(j+1)𝚯l(j)FmLl+12ηt(1βFη)j/2𝑭t(0)𝒀t2𝒉l1(j)2𝚯L(j)2(βLm+Υ/m)Ll1ζLl+1m12(1βFη)j/2ηt𝑭t(0)𝒀t2((βh+Λ(j))(βL+Υ/m)LlζLl\begin{split}&\lVert\boldsymbol{\Theta}_{l}^{(j+1)}-\boldsymbol{\Theta}_{l}^{(j)}\rVert_{F}\leq m^{-\frac{L-l+1}{2}}\eta\sqrt{t}\cdot(1-\beta_{F}\cdot\eta)^{j/2}\cdot\lVert\boldsymbol{F}^{(0)}_{t}-\boldsymbol{Y}_{t}\rVert_{2}\\ &\qquad\cdot\lVert\boldsymbol{h}_{l-1}^{(j)}\rVert_{2}\cdot\lVert\boldsymbol{\Theta}_{L}^{(j)}\rVert_{2}\cdot(\beta_{L}\sqrt{m}+\Upsilon/\sqrt{m})^{L-l-1}\cdot\zeta^{L-l+1}\\ &\leq m^{-\frac{1}{2}}(1-\beta_{F}\eta)^{j/2}\eta\sqrt{t}\lVert\boldsymbol{F}^{(0)}_{t}-\boldsymbol{Y}_{t}\rVert_{2}((\beta_{h}+\Lambda^{(j)})(\beta_{L}+\Upsilon/m)^{L-l}\zeta^{L-l}\end{split}

where the last inequality is due to Lemma A.3. Then, since we have 𝚯l(j+1)𝚯l(j)F𝚯l(j+1)𝚯l(j)F+𝚯l(j)𝚯l(0)F\lVert\boldsymbol{\Theta}_{l}^{(j+1)}-\boldsymbol{\Theta}_{l}^{(j)}\rVert_{F}\leq\lVert\boldsymbol{\Theta}_{l}^{(j+1)}-\boldsymbol{\Theta}_{l}^{(j)}\rVert_{F}+\lVert\boldsymbol{\Theta}_{l}^{(j)}-\boldsymbol{\Theta}_{l}^{(0)}\rVert_{F}, it leads to

𝚯l(j+1)𝚯l(0)F2tβFm𝑭t(0)𝒀t2(βh+Λ(j))(βL+Υ/m)LlζLl.\begin{split}&\lVert\boldsymbol{\Theta}_{l}^{(j+1)}-\boldsymbol{\Theta}_{l}^{(0)}\rVert_{F}\leq\frac{2\sqrt{t}}{\beta_{F}\sqrt{m}}\lVert\boldsymbol{F}^{(0)}_{t}-\boldsymbol{Y}_{t}\rVert_{2}(\beta_{h}+\Lambda^{(j)})(\beta_{L}+\Upsilon/m)^{L-l}\zeta^{L-l}.\end{split}

For the last layer 𝚯L\boldsymbol{\Theta}_{L}, the conclusion can be verified through a similar procedure. Analogously, for 𝚯gnn\boldsymbol{\Theta}_{gnn}, we have

𝚯gnn(j+1)𝚯gnn(j)F=mL+12ητ=1T|f(𝑿~τ;𝚯)rτ|2(𝒉𝑨𝚯L(q=2L1𝚪q𝚯q)𝚪1𝚯1)𝑸𝚪gnnFmL+12ηt(1βFη)j/2𝑭t(0)𝒀t2𝒉𝑨2q=1L𝚯q2ζL𝑸2tm12(1βFη)j/2η𝑭t(0)𝒀t2ζL(βL+Υ/m)L,\begin{split}&\lVert\boldsymbol{\Theta}_{gnn}^{(j+1)}-\boldsymbol{\Theta}_{gnn}^{(j)}\rVert_{F}\\ &=m^{-\frac{L+1}{2}}\eta\lVert\sum_{\tau=1}^{T}\lvert f(\widetilde{\boldsymbol{X}}_{\tau};\boldsymbol{\Theta})-r_{\tau}\rvert^{2}\cdot\bigg{(}\boldsymbol{h}_{\boldsymbol{A}}\cdot\boldsymbol{\Theta}_{L}^{\intercal}\cdot\big{(}\prod_{q=2}^{L-1}\boldsymbol{\Gamma}_{q}\cdot\boldsymbol{\Theta}_{q}^{\intercal}\big{)}\cdot\boldsymbol{\Gamma}_{1}\boldsymbol{\Theta}_{1}^{\intercal}\bigg{)}\\ &\qquad\qquad\cdot\boldsymbol{Q}\cdot\boldsymbol{\Gamma}_{gnn}\rVert_{F}\\ &\leq m^{-\frac{L+1}{2}}\eta\sqrt{t}(1-\beta_{F}\cdot\eta)^{j/2}\lVert\boldsymbol{F}^{(0)}_{t}-\boldsymbol{Y}_{t}\rVert_{2}\lVert\boldsymbol{h}_{\boldsymbol{A}}\rVert_{2}\lVert\prod_{q=1}^{L}\boldsymbol{\Theta}_{q}\rVert_{2}\zeta^{L}\lVert\boldsymbol{Q}\rVert_{2}\\ &\leq\sqrt{t}m^{-\frac{1}{2}}(1-\beta_{F}\cdot\eta)^{j/2}\eta\cdot\lVert\boldsymbol{F}^{(0)}_{t}-\boldsymbol{Y}_{t}\rVert_{2}\cdot\zeta^{L}\cdot(\beta_{L}+\Upsilon/m)^{L},\end{split}

which leads to

𝚯gnn(j+1)𝚯gnn(0)F2βFtm12𝑭t(0)𝒀t2ζL(βL+Υ/m)L.\begin{split}&\lVert\boldsymbol{\Theta}_{gnn}^{(j+1)}-\boldsymbol{\Theta}_{gnn}^{(0)}\rVert_{F}\leq\frac{2}{\beta_{F}}\sqrt{t}m^{-\frac{1}{2}}\cdot\lVert\boldsymbol{F}^{(0)}_{t}-\boldsymbol{Y}_{t}\rVert_{2}\cdot\zeta^{L}\cdot(\beta_{L}+\Upsilon/m)^{L}.\end{split}

Since 𝑭t(0)𝒀t22t\lVert\boldsymbol{F}^{(0)}_{t}-\boldsymbol{Y}_{t}\rVert_{2}\leq\sqrt{2t} (Lemma 5.8) and Υ/m1\Upsilon/m\leq 1 with sufficiently large mm, combining all the results above would give the conclusion.

Lemma A.3.

After TT time steps, with the probability of at least 1O(TL)eΩ(m)1-O(TL)\cdot e^{-\Omega(m)} and running GD of JJ-iterations on the contexts and rewards, we have βh=max{ζβ1,ζβ2+ζ2β1β2,(ζβ4)L2(ζβ2+ζ2β1β2)}\beta^{\prime}_{h}=\max\{\zeta\cdot\beta_{1},\zeta\cdot\beta_{2}+\zeta^{2}\cdot\beta_{1}\beta_{2},(\zeta\cdot\beta_{4})^{L-2}(\zeta\cdot\beta_{2}+\zeta^{2}\cdot\beta_{1}\beta_{2})\} and βh=max{ζβL+1,βh}\beta_{h}=\max\{\zeta\cdot\beta_{L}+1,\beta^{\prime}_{h}\}. With 𝐡{𝐡gnn,𝐡1,,𝐡L1}\boldsymbol{h}\in\{\boldsymbol{h}_{gnn},\boldsymbol{h}_{1},\dots,\boldsymbol{h}_{L-1}\}, we have

𝒉(j)𝒉(0)2ζΥmβh(2ζβL)L12ζβL1=Λ(j),𝒉(j)2βh+Λ(j)\begin{split}&\lVert\boldsymbol{h}^{(j)}-\boldsymbol{h}^{(0)}\rVert_{2}\leq\frac{\zeta\Upsilon}{m}\cdot\beta_{h}\cdot\frac{(2\zeta\beta_{L})^{L}-1}{2\zeta\beta_{L}-1}=\Lambda^{(j)},~{}\lVert\boldsymbol{h}^{(j)}\rVert_{2}\leq\beta_{h}+\Lambda^{(j)}\end{split}

Proof. Similar to the proof of Lemma A.2, we adopt an induction-based approach. For l[L1]l\in[L-1], we have

𝒉l(j)𝒉l(0)2=1mσ(𝒉l1(j)𝚯l(j))σ(𝒉l1(0)𝚯l(0))21mζ(𝒉l1(j)𝚯l(j)𝒉l1(0)𝚯l(j)2+𝒉l1(0)𝚯l(j)𝒉l1(0)𝚯l(0)2)1mζ(𝚯l(0)2+𝚯l(j)𝚯l(0)F)𝒉l1(j)𝒉l1(0)2+1mζ𝒉l1(0)2𝚯l(j)𝚯l(0)F1mζ(βLm+Υ/m)𝒉l1(j)𝒉l1(0)2+ζβhΥ/mζ(βL+Υ/m)ζΥmΛl1(j)+ζβhΥ/mζΥm(βh+2ζβLΛl1(j))=ζΥmΛl(j)\begin{split}&\lVert\boldsymbol{h}_{l}^{(j)}-\boldsymbol{h}_{l}^{(0)}\rVert_{2}=\sqrt{\frac{1}{m}}\lVert\sigma(\boldsymbol{h}_{l-1}^{(j)}\cdot\boldsymbol{\Theta}_{l}^{(j)})-\sigma(\boldsymbol{h}_{l-1}^{(0)}\cdot\boldsymbol{\Theta}_{l}^{(0)})\rVert_{2}\\ &\leq\sqrt{\frac{1}{m}}\zeta\cdot\big{(}\lVert\boldsymbol{h}_{l-1}^{(j)}\cdot\boldsymbol{\Theta}_{l}^{(j)}-\boldsymbol{h}_{l-1}^{(0)}\cdot\boldsymbol{\Theta}_{l}^{(j)}\rVert_{2}~{}+\lVert\boldsymbol{h}_{l-1}^{(0)}\cdot\boldsymbol{\Theta}_{l}^{(j)}-\boldsymbol{h}_{l-1}^{(0)}\cdot\boldsymbol{\Theta}_{l}^{(0)}\rVert_{2}\big{)}\\ &\leq\sqrt{\frac{1}{m}}\zeta\cdot(\lVert\boldsymbol{\Theta}_{l}^{(0)}\rVert_{2}+\lVert\boldsymbol{\Theta}_{l}^{(j)}-\boldsymbol{\Theta}_{l}^{(0)}\rVert_{F})\cdot\lVert\boldsymbol{h}_{l-1}^{(j)}-\boldsymbol{h}_{l-1}^{(0)}\rVert_{2}~{}+\\ &\qquad\sqrt{\frac{1}{m}}\zeta\cdot\lVert\boldsymbol{h}_{l-1}^{(0)}\rVert_{2}\cdot\lVert\boldsymbol{\Theta}_{l}^{(j)}-\boldsymbol{\Theta}_{l}^{(0)}\rVert_{F}\\ &\leq\sqrt{\frac{1}{m}}\zeta\cdot(\beta_{L}\sqrt{m}+\Upsilon/\sqrt{m})\cdot\lVert\boldsymbol{h}_{l-1}^{(j)}-\boldsymbol{h}_{l-1}^{(0)}\rVert_{2}~{}+\zeta\cdot\beta^{\prime}_{h}\cdot\Upsilon/m\\ &\leq\zeta\cdot(\beta_{L}+\Upsilon/m)\cdot\zeta\frac{\Upsilon}{m}\cdot\Lambda_{l-1}^{(j)}+\zeta\cdot\beta^{\prime}_{h}\cdot\Upsilon/m\\ &\leq\zeta\frac{\Upsilon}{m}\cdot(\beta_{h}+2\zeta\beta_{L}\cdot\Lambda_{l-1}^{(j)})=\zeta\frac{\Upsilon}{m}\cdot\Lambda_{l}^{(j)}\end{split}

where the last two inequalities are derived by applying Lemma A.2 and the hypothesis. For the aggregation module output 𝒉gnn(0)\boldsymbol{h}_{gnn}^{(0)},

𝒉gnn(j)𝒉gnn(0)2=1mσ(𝚯gnn(j)𝒉S)σ(𝚯gnn(0)𝒉S)2ζm𝚯gnn(j)𝚯gnn(0)F𝒉S2ζΥmβh.\begin{split}&\lVert\boldsymbol{h}_{gnn}^{(j)}-\boldsymbol{h}_{gnn}^{(0)}\rVert_{2}=\sqrt{\frac{1}{m}}\lVert\sigma(\boldsymbol{\Theta}_{gnn}^{(j)}\cdot\boldsymbol{h}_{S})-\sigma(\boldsymbol{\Theta}_{gnn}^{(0)}\cdot\boldsymbol{h}_{S})\rVert_{2}\\ &\leq\frac{\zeta}{\sqrt{m}}\lVert\boldsymbol{\Theta}_{gnn}^{(j)}-\boldsymbol{\Theta}_{gnn}^{(0)}\rVert_{F}\cdot\lVert\boldsymbol{h}_{S}\rVert_{2}\leq\frac{\zeta\Upsilon}{m}\beta_{h}.\end{split}

Then, for the first layer l=1l=1, we have

𝒉1(j)𝒉1(0)2=1mσ(𝒙𝚯1(j))σ(𝒙𝚯1(0))2ζm𝚯gnn(j)𝚯1(0)F𝒙2ζΥm(ζβL+1)ζΥmβh.\begin{split}&\lVert\boldsymbol{h}_{1}^{(j)}-\boldsymbol{h}_{1}^{(0)}\rVert_{2}=\sqrt{\frac{1}{m}}\lVert\sigma(\boldsymbol{x}^{\prime}\cdot\boldsymbol{\Theta}_{1}^{(j)})-\sigma(\boldsymbol{x}^{\prime}\cdot\boldsymbol{\Theta}_{1}^{(0)})\rVert_{2}\\ &\leq\frac{\zeta}{\sqrt{m}}\lVert\boldsymbol{\Theta}_{gnn}^{(j)}-\boldsymbol{\Theta}_{1}^{(0)}\rVert_{F}\cdot\lVert\boldsymbol{x}^{\prime}\rVert_{2}\leq\frac{\zeta\Upsilon}{m}\cdot(\zeta\cdot\beta_{L}+1)\leq\frac{\zeta\Upsilon}{m}\cdot\beta_{h}.\end{split}

Combining all the results, for 𝒉{𝒉gnn,𝒉1,,𝒉L1}\boldsymbol{h}\in\{\boldsymbol{h}_{gnn},\boldsymbol{h}_{1},\dots,\boldsymbol{h}_{L-1}\}, it has

𝒉(j)𝒉(0)2ζΥmβh(2ζβL)L12ζβL1=Λ(j),\begin{split}&\lVert\boldsymbol{h}^{(j)}-\boldsymbol{h}^{(0)}\rVert_{2}\leq\frac{\zeta\Upsilon}{m}\cdot\beta_{h}\cdot\frac{(2\zeta\beta_{L})^{L}-1}{2\zeta\beta_{L}-1}=\Lambda^{(j)},\end{split}

which completes the proof.

Lemma A.4.

With initialized network parameters 𝚯\boldsymbol{\Theta} and the probability of at least 1O(TL)eΩ(m)1-O(TL)\cdot e^{-\Omega(m)}, we have

𝚯f(𝑿~;𝚯(0))Fβhβ3(βLζ)L/m,𝚯Lf(𝑿~;𝚯(0))Fβh/m,\begin{split}&\lVert\nabla_{\boldsymbol{\Theta}}f(\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}^{(0)})\rVert_{F}\leq\beta_{h}\beta_{3}\cdot(\beta_{L}\zeta)^{L}/m,\lVert\nabla_{\boldsymbol{\Theta}_{L}}f(\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}^{(0)})\rVert_{F}\leq\beta_{h}/\sqrt{m},\end{split}

and the norm of gradient difference

𝚯f(𝑿~;𝚯(0))𝚯f(𝑿~;𝚯(j))F3Λ(j),𝚯Lf(𝑿~;𝚯(0))𝚯Lf(𝑿~;𝚯(j))FΛ(j)/m.\begin{split}&\lVert\nabla_{\boldsymbol{\Theta}}f(\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}^{(0)})-\nabla_{\boldsymbol{\Theta}}f(\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}^{(j)})\rVert_{F}\leq 3\cdot\Lambda^{(j)},\\ &\lVert\nabla_{\boldsymbol{\Theta}_{L}}f(\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}^{(0)})-\nabla_{\boldsymbol{\Theta}_{L}}f(\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}^{(j)})\rVert_{F}\leq\Lambda^{(j)}/\sqrt{m}.\end{split}

with 𝚯{𝚯gnn,𝚯1,,𝚯L1}\boldsymbol{\Theta}\in\{\boldsymbol{\Theta}_{gnn},\boldsymbol{\Theta}_{1},\dots,\boldsymbol{\Theta}_{L-1}\}.

Proof. First, for l[L1]l\in[L-1], we have

𝚯lf(𝑿~;𝚯(0))F=mLl+12𝒉l1(0)(𝚯L(0)(q=l+1L1𝚪q𝚯q(0))𝚪l)FmLl+12𝒉l1(0)𝚯L(0)Fq=lL1𝚪q2q=l+1L1𝚯q(0)2mLl+12𝒉l1(0)2𝚯L(0)2q=lL1𝚪q2q=l+1L1𝚯q(0)2mLl+12βhβ3ζLl(βLm)Ll1βhβ3(βLζ)L/m.\begin{split}&\lVert\nabla_{\boldsymbol{\Theta}_{l}}f(\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}^{(0)})\rVert_{F}=m^{-\frac{L-l+1}{2}}\lVert\boldsymbol{h}_{l-1}^{(0)}\big{(}\boldsymbol{\Theta}_{L}^{(0)}\big{(}\prod_{q=l+1}^{L-1}\boldsymbol{\Gamma}_{q}\cdot\boldsymbol{\Theta}_{q}^{(0)}\big{)}\cdot\boldsymbol{\Gamma}_{l}\big{)}\rVert_{F}\\ &\leq m^{-\frac{L-l+1}{2}}\cdot\lVert\boldsymbol{h}_{l-1}^{(0)}\boldsymbol{\Theta}_{L}^{(0)}\rVert_{F}\cdot\lVert\prod_{q=l}^{L-1}\boldsymbol{\Gamma}_{q}\rVert_{2}\cdot\lVert\prod_{q=l+1}^{L-1}\boldsymbol{\Theta}_{q}^{(0)}\rVert_{2}\\ &\leq m^{-\frac{L-l+1}{2}}\cdot\lVert\boldsymbol{h}_{l-1}^{(0)}\rVert_{2}\cdot\lVert\boldsymbol{\Theta}_{L}^{(0)}\rVert_{2}\cdot\lVert\prod_{q=l}^{L-1}\boldsymbol{\Gamma}_{q}\rVert_{2}\cdot\lVert\prod_{q=l+1}^{L-1}\boldsymbol{\Theta}_{q}^{(0)}\rVert_{2}\\ &\leq m^{-\frac{L-l+1}{2}}\cdot\beta_{h}\beta_{3}\cdot\zeta^{L-l}\cdot(\beta_{L}\sqrt{m})^{L-l-1}\leq\beta_{h}\beta_{3}\cdot(\beta_{L}\zeta)^{L}/m.\end{split}

For 𝚯gnn\boldsymbol{\Theta}_{gnn}, we can also derive similar results. For 𝚯L\boldsymbol{\Theta}_{L},

𝚯Lf(𝑿~;𝚯(0))F=m0.5𝒉L1(0)2βh/m\begin{split}&\lVert\nabla_{\boldsymbol{\Theta}_{L}}f(\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}^{(0)})\rVert_{F}=m^{-0.5}\cdot\lVert\boldsymbol{h}_{L-1}^{(0)}\rVert_{2}\leq\beta_{h}/\sqrt{m}\end{split}

Then, with l(j)=mLl+12(𝚯L(j))(q=l+1L1𝚪q(𝚯q(j)))𝚪l\nabla_{l}^{(j)}=m^{-\frac{L-l+1}{2}}\cdot(\boldsymbol{\Theta}_{L}^{(j)})^{\intercal}\cdot\big{(}\prod_{q=l+1}^{L-1}\boldsymbol{\Gamma}_{q}\cdot(\boldsymbol{\Theta}_{q}^{(j)})^{\intercal}\big{)}\cdot\boldsymbol{\Gamma}_{l}, we have the norm of gradient difference

𝚯lf(𝑿~;𝚯(j))𝚯lf(𝑿~;𝚯(0))F=𝒉l1(0)l(0)𝒉l1(j)l(j)F𝒉l1(0)l(0)𝒉l1(j)l(0)F+𝒉l1(j)l(0)𝒉l1(j)l(j)F𝒉l1(j)Fl(0)l(j)F+l(0)F𝒉l1(0)𝒉l1(j)F(βh+Λ(j))l(0)l(j)F+Λ(j)l(0)F.\begin{split}&\lVert\nabla_{\boldsymbol{\Theta}_{l}}f(\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}^{(j)})-\nabla_{\boldsymbol{\Theta}_{l}}f(\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}^{(0)})\rVert_{F}=\lVert\boldsymbol{h}_{l-1}^{(0)}\cdot\nabla_{l}^{(0)}-\boldsymbol{h}_{l-1}^{(j)}\cdot\nabla_{l}^{(j)}\rVert_{F}\\ &\leq\lVert\boldsymbol{h}_{l-1}^{(0)}\cdot\nabla_{l}^{(0)}-\boldsymbol{h}_{l-1}^{(j)}\cdot\nabla_{l}^{(0)}\rVert_{F}+\lVert\boldsymbol{h}_{l-1}^{(j)}\cdot\nabla_{l}^{(0)}-\boldsymbol{h}_{l-1}^{(j)}\cdot\nabla_{l}^{(j)}\rVert_{F}\\ &\leq\lVert\boldsymbol{h}_{l-1}^{(j)}\rVert_{F}\cdot\lVert\nabla_{l}^{(0)}-\nabla_{l}^{(j)}\rVert_{F}+\lVert\nabla_{l}^{(0)}\rVert_{F}\cdot\lVert\boldsymbol{h}_{l-1}^{(0)}-\boldsymbol{h}_{l-1}^{(j)}\rVert_{F}\\ &\leq\big{(}\beta_{h}+\Lambda^{(j)}\big{)}\cdot\lVert\nabla_{l}^{(0)}-\nabla_{l}^{(j)}\rVert_{F}+\Lambda^{(j)}\cdot\lVert\nabla_{l}^{(0)}\rVert_{F}.\end{split}

Here, for the difference of \nabla, we have

l(0)l(j)F=mLl+12𝚯L(0)(q=l+1L1𝚪q𝚯q(0))𝚪l𝚯L(j)(q=l+1L1𝚪q𝚯q(j))𝚪lF=m12l+1(0)𝚪l𝚯l+1(0)l+1(j)𝚪l𝚯l+1(j)Fζm(l+1(0)𝚯l+1(0)l+1(0)𝚯l+1(j)F+l+1(0)𝚯l+1(j)l+1(j)𝚯l+1(j)F)ζm(l+1(0)F𝚯l+1(0)𝚯l+1(j)F+𝚯l+1(j)Fl+1(0)l+1(j)F).\begin{split}&\lVert\nabla_{l}^{(0)}-\nabla_{l}^{(j)}\rVert_{F}\\ &=m^{-\frac{L-l+1}{2}}\lVert\boldsymbol{\Theta}_{L}^{(0)}\cdot\big{(}\prod_{q=l+1}^{L-1}\boldsymbol{\Gamma}_{q}\cdot\boldsymbol{\Theta}_{q}^{(0)}\big{)}\boldsymbol{\Gamma}_{l}-\boldsymbol{\Theta}_{L}^{(j)}\big{(}\prod_{q=l+1}^{L-1}\boldsymbol{\Gamma}_{q}\boldsymbol{\Theta}_{q}^{(j)}\big{)}\boldsymbol{\Gamma}_{l}\rVert_{F}\\ &=m^{-\frac{1}{2}}\cdot\lVert\nabla_{l+1}^{(0)}\boldsymbol{\Gamma}_{l}\cdot\boldsymbol{\Theta}_{l+1}^{(0)}-\nabla_{l+1}^{(j)}\boldsymbol{\Gamma}_{l}\cdot\boldsymbol{\Theta}_{l+1}^{(j)}\rVert_{F}\\ &\leq\frac{\zeta}{\sqrt{m}}\cdot(\lVert\nabla_{l+1}^{(0)}\cdot\boldsymbol{\Theta}_{l+1}^{(0)}-\nabla_{l+1}^{(0)}\cdot\boldsymbol{\Theta}_{l+1}^{(j)}\rVert_{F}+\lVert\nabla_{l+1}^{(0)}\cdot\boldsymbol{\Theta}_{l+1}^{(j)}-\nabla_{l+1}^{(j)}\cdot\boldsymbol{\Theta}_{l+1}^{(j)}\rVert_{F})\\ &\leq\frac{\zeta}{\sqrt{m}}\cdot(\lVert\nabla_{l+1}^{(0)}\rVert_{F}\lVert\cdot\boldsymbol{\Theta}_{l+1}^{(0)}-\boldsymbol{\Theta}_{l+1}^{(j)}\rVert_{F}+\lVert\boldsymbol{\Theta}_{l+1}^{(j)}\rVert_{F}\lVert\nabla_{l+1}^{(0)}-\nabla_{l+1}^{(j)}\rVert_{F}).\end{split}

To continue the proof, we need to bound the term l(0)F\lVert\nabla_{l}^{(0)}\rVert_{F} as

l(0)F=m0.5𝚪l𝚯l+1(0)l+1(0)FζβLl+1(0)F.\begin{split}\lVert\nabla_{l}^{(0)}\rVert_{F}=m^{-0.5}\lVert\boldsymbol{\Gamma}_{l}\boldsymbol{\Theta}_{l+1}^{(0)}\cdot\nabla_{l+1}^{(0)}\rVert_{F}\leq\zeta\beta_{L}\cdot\lVert\nabla_{l+1}^{(0)}\rVert_{F}.\end{split}

Since for l=L1l=L-1 we have

L1(0)Fζβ3m,\begin{split}\lVert\nabla_{L-1}^{(0)}\rVert_{F}\leq\frac{\zeta\beta_{3}}{m},\end{split}

we can derive

l(0)Fβ3m(ζβL)L1\begin{split}\lVert\nabla_{l}^{(0)}\rVert_{F}\leq\frac{\beta_{3}}{m}\cdot(\zeta\cdot\beta_{L})^{L}\leq 1\end{split}

with sufficiently large mm, and this bound also applies to gnn(0)F\lVert\nabla_{gnn}^{(0)}\rVert_{F}. For

L(0)L(j)F=m0.5𝒉L1(0)𝒉L1(j)FΛ(j)/m\begin{split}&\lVert\nabla_{L}^{(0)}-\nabla_{L}^{(j)}\rVert_{F}=m^{-0.5}\lVert\boldsymbol{h}_{L-1}^{(0)}-\boldsymbol{h}_{L-1}^{(j)}\rVert_{F}\leq\Lambda^{(j)}/\sqrt{m}\end{split}

Therefore, we have

l(0)l(j)FζΥm+ζ(βL+Υ/m)l+1(0)l+1(j)F.\begin{split}&\lVert\nabla_{l}^{(0)}-\nabla_{l}^{(j)}\rVert_{F}\leq\frac{\zeta\Upsilon}{m}+\zeta\cdot(\beta_{L}+\Upsilon/m)\lVert\nabla_{l+1}^{(0)}-\nabla_{l+1}^{(j)}\rVert_{F}.\end{split}

By following a similar approach as in Lemma A.3, we will have

l(0)l(j)FζΥm(2ζβL)L12ζβL1=Λ(j)βh.\begin{split}&\lVert\nabla_{l}^{(0)}-\nabla_{l}^{(j)}\rVert_{F}\leq\frac{\zeta\Upsilon}{m}\cdot\frac{(2\zeta\beta_{L})^{L}-1}{2\zeta\beta_{L}-1}=\frac{\Lambda^{(j)}}{\beta_{h}}.\end{split}

Therefore, we will have

𝚯lf(𝑿~;𝚯(j))𝚯lf(𝑿~;𝚯(0))F(βh+Λ(j))Λ(j)βh+Λ(j)Λ(j)βh(2βh+1)=Λ(j)(2+1βh)3Λ(j)\begin{split}\lVert\nabla_{\boldsymbol{\Theta}_{l}}f(\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}^{(j)})&-\nabla_{\boldsymbol{\Theta}_{l}}f(\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}^{(0)})\rVert_{F}\leq\big{(}\beta_{h}+\Lambda^{(j)}\big{)}\cdot\frac{\Lambda^{(j)}}{\beta_{h}}+\Lambda^{(j)}\\ &\leq\frac{\Lambda^{(j)}}{\beta_{h}}\cdot(2\beta_{h}+1)=\Lambda^{(j)}\cdot(2+\frac{1}{\beta_{h}})\leq 3\cdot\Lambda^{(j)}\end{split}

with sufficiently large mm. This bound can also be derived for 𝚯gnnf(𝑿~;𝚯(0))2\lVert\nabla_{\boldsymbol{\Theta}_{gnn}}f(\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}^{(0)})\rVert_{2} with a similar procedure. For LL-th layer, we have

𝚯Lf(𝑿~;𝚯(j))𝚯Lf(𝑿~;𝚯(0))Fm0.5𝒉L1(0)𝒉L1(j)FΛ(j)/m,\begin{split}\lVert\nabla_{\boldsymbol{\Theta}_{L}}f(\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}^{(j)})&-\nabla_{\boldsymbol{\Theta}_{L}}f(\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}^{(0)})\rVert_{F}\leq m^{-0.5}\cdot\lVert\boldsymbol{h}_{L-1}^{(0)}-\boldsymbol{h}_{L-1}^{(j)}\rVert_{F}\\ &\leq\Lambda^{(j)}/\sqrt{m},\end{split}

which completes the proof.

Lemma A.5.

With the probability of at least 1O(TL)eΩ(m)1-O(TL)\cdot e^{-\Omega(m)}, we have the gradient for all the network as

g(𝑿~;𝚯(0))2m1βhLβ32(βLζ)2L+m,g(𝑿~;𝚯(j))2Λ(j)9L+m1+m1βhLβ32(βLζ)2L+mg(𝑿~;𝚯(0))g(𝑿~;𝚯(j))2Λ(j)9L+m1.\begin{split}&\lVert g(\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}^{(0)})\rVert_{2}\leq m^{-1}\beta_{h}\cdot\sqrt{L\cdot\beta_{3}^{2}\cdot(\beta_{L}\zeta)^{2L}+m},\\ &\lVert g(\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}^{(j)})\rVert_{2}\leq\Lambda^{(j)}\cdot\sqrt{9L+m^{-1}}+m^{-1}\beta_{h}\cdot\sqrt{L\cdot\beta_{3}^{2}\cdot(\beta_{L}\zeta)^{2L}+m}\\ &\lVert g(\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}^{(0)})-g(\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}^{(j)})\rVert_{2}\leq\Lambda^{(j)}\cdot\sqrt{9L+m^{-1}}.\end{split}

Proof. First, for the gradient before GD, we have

g(𝑿~;𝚯(0))2=𝚯gnnf(𝑿~;𝚯(0))22+l=1L𝚯lf(𝑿~;𝚯(0))22m1βhLβ32(βLζ)2L+m.\begin{split}&\lVert g(\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}^{(0)})\rVert_{2}=\sqrt{\lVert\nabla_{\boldsymbol{\Theta}_{gnn}}f(\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}^{(0)})\rVert_{2}^{2}+\sum_{l=1}^{L}\lVert\nabla_{\boldsymbol{\Theta}_{l}}f(\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}^{(0)})\rVert_{2}^{2}}\\ &\leq m^{-1}\beta_{h}\cdot\sqrt{L\cdot\beta_{3}^{2}\cdot(\beta_{L}\zeta)^{2L}+m}.\end{split}

Then, for the norm of gradients, 𝚯{𝚯gnn,𝚯1,,𝚯L1}\boldsymbol{\Theta}\in\{\boldsymbol{\Theta}_{gnn},\boldsymbol{\Theta}_{1},\dots,\boldsymbol{\Theta}_{L-1}\}, we have

g(𝑿~;𝚯(0))g(𝑿~;𝚯(j))2=𝚯𝚯f(𝑿~;𝚯(0))𝚯f(𝑿~;𝚯(j))229L(Λ(j))2+(Λ(j))2/m=Λ(j)9L+m1.\begin{split}&\lVert g(\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}^{(0)})-g(\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}^{(j)})\rVert_{2}\\ &=\sqrt{\sum_{\boldsymbol{\Theta}}\lVert\nabla_{\boldsymbol{\Theta}}f(\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}^{(0)})-\nabla_{\boldsymbol{\Theta}}f(\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}^{(j)})\rVert_{2}^{2}}\\ &\leq\sqrt{9L\cdot(\Lambda^{(j)})^{2}+(\Lambda^{(j)})^{2}/m}=\Lambda^{(j)}\cdot\sqrt{9L+m^{-1}}.\end{split}

Then, for the network gradient after GD, we have

g(𝑿~;𝚯(j))2g(𝑿~;𝚯(0))g(𝑿~;𝚯(j))2+g(𝑿~;𝚯(0))2Λ(j)9L+m1+m1βhLβ32(βLζ)2L+m\begin{split}&\lVert g(\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}^{(j)})\rVert_{2}\leq\lVert g(\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}^{(0)})-g(\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}^{(j)})\rVert_{2}+\lVert g(\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}^{(0)})\rVert_{2}\\ &\leq\Lambda^{(j)}\cdot\sqrt{9L+m^{-1}}+m^{-1}\beta_{h}\cdot\sqrt{L\cdot\beta_{3}^{2}\cdot(\beta_{L}\zeta)^{2L}+m}\end{split}
Lemma A.6.

With the probability of at least 1O(TL)eΩ(m)1-O(TL)\cdot e^{-\Omega(m)}, for the initialized parameter 𝚯(0)\boldsymbol{\Theta}^{(0)}, we have

|f(𝑿~;𝚯(j))g(𝑿~;𝚯(0)),𝚯(j)𝚯(0)|m0.5(Λ(j)(1+β3)+β3βh+LβhΥ),\begin{split}&\lvert f(\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}^{(j)})-\langle g(\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}^{(0)}),\boldsymbol{\Theta}^{(j)}-\boldsymbol{\Theta}^{(0)}\rangle\rvert\\ &\qquad\qquad\leq m^{-0.5}\cdot\big{(}\Lambda^{(j)}(1+\beta_{3})+\beta_{3}\beta_{h}+L\cdot\beta_{h}\Upsilon\big{)},\end{split}

and for the network parameter after GD, 𝚯(j)\boldsymbol{\Theta}^{(j)}, we have

|f(𝑿~;𝚯(j))g(𝑿~;𝚯(j)),𝚯(j)𝚯(0)|B3=m0.5(β3(Λ(j)+βh)+LΥ(Λ(j)+βh)(Λ(j)/βh+1)).\begin{split}\lvert f&(\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}^{(j)})-\langle g(\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}^{(j)}),\boldsymbol{\Theta}^{(j)}-\boldsymbol{\Theta}^{(0)}\rangle\rvert\leq B_{3}\\ &=m^{-0.5}\big{(}\beta_{3}(\Lambda^{(j)}+\beta_{h})+L\cdot\Upsilon\cdot(\Lambda^{(j)}+\beta_{h})(\Lambda^{(j)}/\beta_{h}+1)\big{)}.\end{split}

Proof. For the sake of enumeration, we let 𝚯0=𝚯gnn,0=gnn,𝒉0=𝒉gnn\boldsymbol{\Theta}_{0}=\boldsymbol{\Theta}_{gnn},\nabla_{0}=\nabla_{gnn},\boldsymbol{h}_{0}=\boldsymbol{h}_{gnn} and 𝒉1=𝒉S\boldsymbol{h}_{-1}=\boldsymbol{h}_{S}. Then, we can derive

|f(𝑿~;𝚯(j))g(𝑿~;𝚯(0)),𝚯(j)𝚯(0)|=|1m𝒉L1(j),𝚯L(j)1m𝒉L1(0),𝚯L(0)𝚯L(j)l=0L1(𝒉l1(0))(𝚯l(0)𝚯l(j))l(0))|m0.5𝒉L(j)𝒉L(0)2𝚯L(j)2+m0.5𝒉L1(0)2𝚯L(0)2+l=0L1𝒉l1(0)2𝚯l(0)𝚯l(j)Fl(0)Fm0.5Λ(j)(Υ/m+β3)+m0.5β3βh+LβhΥmm0.5(Λ(j)(1+β3)+β3βh+LβhΥ).\begin{split}&\lvert f(\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}^{(j)})-\langle g(\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}^{(0)}),\boldsymbol{\Theta}^{(j)}-\boldsymbol{\Theta}^{(0)}\rangle\rvert=\lvert\frac{1}{\sqrt{m}}\langle\boldsymbol{h}_{L-1}^{(j)},\boldsymbol{\Theta}_{L}^{(j)}\rangle\\ &\qquad-\frac{1}{\sqrt{m}}\langle\boldsymbol{h}_{L-1}^{(0)},\boldsymbol{\Theta}_{L}^{(0)}-\boldsymbol{\Theta}_{L}^{(j)}\rangle-\sum_{l=0}^{L-1}(\boldsymbol{h}_{l-1}^{(0)})^{\intercal}(\boldsymbol{\Theta}_{l}^{(0)}-\boldsymbol{\Theta}_{l}^{(j)})\nabla_{l}^{(0)})\rvert\\ &\leq m^{-0.5}\lVert\boldsymbol{h}_{L}^{(j)}-\boldsymbol{h}_{L}^{(0)}\rVert_{2}\lVert\boldsymbol{\Theta}_{L}^{(j)}\rVert_{2}+m^{-0.5}\lVert\boldsymbol{h}_{L-1}^{(0)}\rVert_{2}\lVert\boldsymbol{\Theta}_{L}^{(0)}\rVert_{2}\\ &\qquad\qquad+\sum_{l=0}^{L-1}\lVert\boldsymbol{h}_{l-1}^{(0)}\rVert_{2}\lVert\boldsymbol{\Theta}_{l}^{(0)}-\boldsymbol{\Theta}_{l}^{(j)}\rVert_{F}\lVert\nabla_{l}^{(0)}\rVert_{F}\\ &\leq m^{-0.5}\Lambda^{(j)}(\Upsilon/\sqrt{m}+\beta_{3})+m^{-0.5}\beta_{3}\beta_{h}+L\cdot\beta_{h}\frac{\Upsilon}{\sqrt{m}}\\ &\leq m^{-0.5}\cdot\big{(}\Lambda^{(j)}(1+\beta_{3})+\beta_{3}\beta_{h}+L\cdot\beta_{h}\Upsilon\big{)}.\end{split}

On the other hand, for network parameter after GD, we can have

|f(𝑿~;𝚯(j))g(𝑿~;𝚯(j)),𝚯(j)𝚯(0)|=|1m𝒉L1(j),𝚯L(j)1m𝒉L1(j),𝚯L(j)𝚯L(0)l=0L1(𝒉l1(j))(𝚯l(0)𝚯l(j))l(j))||m0.5𝒉L1(j),𝚯L(0)l=0L1(𝒉l1(j))(𝚯l(0)𝚯l(j))l(j)|m0.5𝒉L1(j)2𝚯L(0)2+l=0L1𝒉l1(j)2𝚯l(0)𝚯l(j)Fl(j)Fm0.5β3(Λ(j)+βh)+L(Λ(j)+βh)(Υ/m)(Λ(j)/βh+1)m0.5(β3(Λ(j)+βh)+LΥ(Λ(j)+βh)(Λ(j)/βh+1)).\begin{split}&\lvert f(\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}^{(j)})-\langle g(\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}^{(j)}),\boldsymbol{\Theta}^{(j)}-\boldsymbol{\Theta}^{(0)}\rangle\rvert=\lvert\frac{1}{\sqrt{m}}\langle\boldsymbol{h}_{L-1}^{(j)},\boldsymbol{\Theta}_{L}^{(j)}\rangle\\ &\qquad-\frac{1}{\sqrt{m}}\langle\boldsymbol{h}_{L-1}^{(j)},\boldsymbol{\Theta}_{L}^{(j)}-\boldsymbol{\Theta}_{L}^{(0)}\rangle-\sum_{l=0}^{L-1}(\boldsymbol{h}_{l-1}^{(j)})^{\intercal}(\boldsymbol{\Theta}_{l}^{(0)}-\boldsymbol{\Theta}_{l}^{(j)})\nabla_{l}^{(j)})\rvert\\ &\leq\lvert m^{-0.5}\langle\boldsymbol{h}_{L-1}^{(j)},\boldsymbol{\Theta}_{L}^{(0)}\rangle-\sum_{l=0}^{L-1}(\boldsymbol{h}_{l-1}^{(j)})^{\intercal}(\boldsymbol{\Theta}_{l}^{(0)}-\boldsymbol{\Theta}_{l}^{(j)})\nabla_{l}^{(j)}\rvert\\ &\leq m^{-0.5}\lVert\boldsymbol{h}_{L-1}^{(j)}\rVert_{2}\lVert\boldsymbol{\Theta}_{L}^{(0)}\rVert_{2}+\sum_{l=0}^{L-1}\lVert\boldsymbol{h}_{l-1}^{(j)}\rVert_{2}\lVert\boldsymbol{\Theta}_{l}^{(0)}-\boldsymbol{\Theta}_{l}^{(j)}\rVert_{F}\lVert\nabla_{l}^{(j)}\rVert_{F}\\ &\leq m^{-0.5}\beta_{3}(\Lambda^{(j)}+\beta_{h})+L\cdot(\Lambda^{(j)}+\beta_{h})(\Upsilon/\sqrt{m})(\Lambda^{(j)}/\beta_{h}+1)\\ &\leq m^{-0.5}\big{(}\beta_{3}(\Lambda^{(j)}+\beta_{h})+L\cdot\Upsilon\cdot(\Lambda^{(j)}+\beta_{h})(\Lambda^{(j)}/\beta_{h}+1)\big{)}.\end{split}

This completes the proof.

Proof sketch for Lemmas A.1-A.6. First we derive the conclusions in Lemma A.1 with the property of Gaussian matrices. Then, Lemmas A.2 and A.3 are proved through the induction after breaking the target into norms of individual terms (variables, weight matrices) and applying Lemma A.1. Finally, for Lemmas A.4-A.6, we also decompose targets into norms of individual terms. Then, applying Lemmas A.1-A.3 the to bound these terms (at random initialization / after GD) would give the result. \blacksquare

Appendix B Lemmas for Gradient Matrices

Inspired by (Zhou et al., 2020; Ban and He, 2021a) and with sufficiently large network width mm, the trained network parameter can be related to ridge regression estimator where the context is embedded by network gradients. With the received contexts and rewards up to time step tt, we have the estimated parameter 𝚯^\widehat{\boldsymbol{\Theta}} as 𝚯^0=(𝒁0)1𝒃0\widehat{\boldsymbol{\Theta}}_{0}=(\boldsymbol{Z}_{0})^{-1}\cdot\boldsymbol{b}_{0} where 𝒁0=λ𝑰+1mτ=1tg(𝑿~τ;𝚯0)g(𝑿~τ;𝚯0),𝒃0=1mτ=1trτg(𝑿~τ;𝚯0).\boldsymbol{Z}_{0}=\lambda\boldsymbol{I}+\frac{1}{m}\sum_{\tau=1}^{t}g(\widetilde{\boldsymbol{X}}_{\tau};\boldsymbol{\Theta}_{0})g(\widetilde{\boldsymbol{X}}_{\tau};\boldsymbol{\Theta}_{0})^{\intercal},\boldsymbol{b}_{0}=\frac{1}{\sqrt{m}}\sum_{\tau=1}^{t}r_{\tau}\cdot g(\widetilde{\boldsymbol{X}}_{\tau};\boldsymbol{\Theta}_{0}). We also define the gradient matrix w.r.t. the network parameters as

𝑮(j)=(g(𝑿~1;𝚯(j)),,g(𝑿~t;𝚯(j)))𝒇(j)=(f(𝑿~1;𝚯(j)),,f(𝑿~t;𝚯(j))),𝒓=(r1,,rt)𝚯(j+1)=𝚯(j)η((𝑮(j))(𝒇(j)𝒓)).\begin{split}&\boldsymbol{G}^{(j)}=\big{(}g(\widetilde{\boldsymbol{X}}_{1};\boldsymbol{\Theta}^{(j)}),\dots,g(\widetilde{\boldsymbol{X}}_{t};\boldsymbol{\Theta}^{(j)})\big{)}\\ &\boldsymbol{f}^{(j)}=\big{(}f(\widetilde{\boldsymbol{X}}_{1};\boldsymbol{\Theta}^{(j)}),\dots,f(\widetilde{\boldsymbol{X}}_{t};\boldsymbol{\Theta}^{(j)})\big{)},\quad\boldsymbol{r}=\big{(}r_{1},\dots,r_{t}\big{)}\\ &\boldsymbol{\Theta}^{(j+1)}=\boldsymbol{\Theta}^{(j)}-\eta\cdot\big{(}(\boldsymbol{G}^{(j)})^{\intercal}(\boldsymbol{f}^{(j)}-\boldsymbol{r})\big{)}.\end{split}

where the tt notation is omitted by default. Then, we use the following Lemma to bound the above matrices.

Lemma B.1.

After jj iterations, with the probability of at least 1O(L)eΩ(m)1-O(L)\cdot e^{-\Omega(m)}, we have

𝑮(0)FG1=m1βht(Lβ32(βLζ)2L+m),𝑮(0)𝑮(j)FΛ(j)t(9L+m1),𝑮(j)FI~1=t(Lβ32(βLζ)2L+m)+Λ(j)t(9L+m1)𝒇(j)(𝑮(j))(𝚯^(j)𝚯^(0))2tB3=tm0.5(β3(Λ(j)+βh)+LΥ(Λ(j)+βh)(Λ(j)/βh+1))\begin{split}&\lVert\boldsymbol{G}^{(0)}\rVert_{F}\leq G_{1}=m^{-1}\beta_{h}\cdot\sqrt{t\cdot(L\cdot\beta_{3}^{2}\cdot(\beta_{L}\zeta)^{2L}+m)},\\ &\lVert\boldsymbol{G}^{(0)}-\boldsymbol{G}^{(j)}\rVert_{F}\leq\Lambda^{(j)}\cdot\sqrt{t\cdot(9L+m^{-1})},\\ &\lVert\boldsymbol{G}^{(j)}\rVert_{F}\leq\widetilde{I}_{1}=\sqrt{t\cdot(L\cdot\beta_{3}^{2}\cdot(\beta_{L}\zeta)^{2L}+m)}+\Lambda^{(j)}\sqrt{t\cdot(9L+m^{-1})}\\ &\lVert\boldsymbol{f}^{(j)}-(\boldsymbol{G}^{(j)})^{\intercal}(\widehat{\boldsymbol{\Theta}}^{(j)}-\widehat{\boldsymbol{\Theta}}^{(0)})\rVert_{2}\leq\sqrt{t}\cdot B_{3}\\ &~{}~{}=\sqrt{t}\cdot m^{-0.5}\big{(}\beta_{3}(\Lambda^{(j)}+\beta_{h})+L\cdot\Upsilon\cdot(\Lambda^{(j)}+\beta_{h})(\Lambda^{(j)}/\beta_{h}+1)\big{)}\end{split}

Proof. For the gradient matrix after random initialization, we have

𝑮(0)F=τ=1tg(𝑿~τ;𝚯(0))22m1βhtLβ32(βLζ)2L+m\begin{split}&\lVert\boldsymbol{G}^{(0)}\rVert_{F}=\sqrt{\sum_{\tau=1}^{t}\lVert g(\widetilde{\boldsymbol{X}}_{\tau};\boldsymbol{\Theta}^{(0)})\rVert_{2}^{2}}\leq m^{-1}\beta_{h}\cdot\sqrt{t\cdot L\cdot\beta_{3}^{2}\cdot(\beta_{L}\zeta)^{2L}+m}\end{split}

with the conclusion from Lemma A.5. Then,

𝑮(0)𝑮(j)F=τ=1tg(𝑿~τ;𝚯(0))g(𝑿~τ;𝚯(j))22Λ(j)t(9L+m1).\begin{split}&\lVert\boldsymbol{G}^{(0)}-\boldsymbol{G}^{(j)}\rVert_{F}=\sqrt{\sum_{\tau=1}^{t}\lVert g(\widetilde{\boldsymbol{X}}_{\tau};\boldsymbol{\Theta}^{(0)})-g(\widetilde{\boldsymbol{X}}_{\tau};\boldsymbol{\Theta}^{(j)})\rVert_{2}^{2}}\\ &\qquad\qquad\leq\Lambda^{(j)}\cdot\sqrt{t\cdot(9L+m^{-1})}.\end{split}

For the third inequality in this Lemma, we have

𝒇(j)(𝑮(j))(𝚯^(j)𝚯^(0))2=τ=1t|f(𝑿~τ;𝚯(j))g(𝑿~τ;𝚯(j)),𝚯(j)𝚯(0))|2tm0.5(β3(Λ(j)+βh)+LΥ(Λ(j)+βh)(Λ(j)/βh+1))\begin{split}&\lVert\boldsymbol{f}^{(j)}-(\boldsymbol{G}^{(j)})^{\intercal}(\widehat{\boldsymbol{\Theta}}^{(j)}-\widehat{\boldsymbol{\Theta}}^{(0)})\rVert_{2}\\ &=\sqrt{\sum_{\tau=1}^{t}\lvert f(\widetilde{\boldsymbol{X}}_{\tau};\boldsymbol{\Theta}^{(j)})-\langle g(\widetilde{\boldsymbol{X}}_{\tau};\boldsymbol{\Theta}^{(j)}),\boldsymbol{\Theta}^{(j)}-\boldsymbol{\Theta}^{(0)}\rangle)\rvert^{2}}\\ &\leq\sqrt{t}\cdot m^{-0.5}\big{(}\beta_{3}(\Lambda^{(j)}+\beta_{h})+L\cdot\Upsilon\cdot(\Lambda^{(j)}+\beta_{h})(\Lambda^{(j)}/\beta_{h}+1)\big{)}\end{split}

based on Lemma A.6.

Analogous to (Ban and He, 2021a; Zhou et al., 2020), we define another auxiliary sequence to bound the parameter difference. With 𝚯~(0)=𝚯(0)\widetilde{\boldsymbol{\Theta}}^{(0)}=\boldsymbol{\Theta}^{(0)}, we have 𝚯~(j+1)=𝚯~(j)η(𝑮(j)((𝑮(j))(𝚯~(j)𝚯~(0))𝒓)+mλ(𝚯~(j)𝚯~(0)))\widetilde{\boldsymbol{\Theta}}^{(j+1)}=\\ \widetilde{\boldsymbol{\Theta}}^{(j)}-\eta\cdot\bigg{(}\boldsymbol{G}^{(j)}\big{(}(\boldsymbol{G}^{(j)})^{\intercal}(\widetilde{\boldsymbol{\Theta}}^{(j)}-\widetilde{\boldsymbol{\Theta}}^{(0)})-\boldsymbol{r}\big{)}+m\lambda(\widetilde{\boldsymbol{\Theta}}^{(j)}-\widetilde{\boldsymbol{\Theta}}^{(0)})\bigg{)}.

Lemma B.2.

After jj iterations, with the probability of at least 1O(L)eΩ(m)1-O(L)\cdot e^{-\Omega(m)}, we have

𝚯~(j)𝚯(0)𝚯^t/m2t/(mλ)\begin{split}&\lVert\widetilde{\boldsymbol{\Theta}}^{(j)}-\boldsymbol{\Theta}^{(0)}-\widehat{\boldsymbol{\Theta}}_{t}/\sqrt{m}\rVert_{2}\leq\sqrt{t/(m\lambda)}\end{split}

Proof. The proof is analogous to Lemma 10.2 in (Ban and He, 2021a) and Lemma C.4 in (Zhou et al., 2020). Switching 𝑮0\boldsymbol{G}_{0} to 𝑮j\boldsymbol{G}_{j} would give the result. \blacksquare

Then, we can have the following lemma to bridge the difference between the regression estimator 𝚯^\widehat{\boldsymbol{\Theta}} and the network parameter 𝚯\boldsymbol{\Theta}.

Lemma B.3.

At this time step tt, with the notation defined in Lemma 5.3 and the probability at least 1O(L)eΩ(m)1-O(L)\cdot e^{-\Omega(m)}, we will have

𝚯t𝚯0𝚯^t/m2(I~1tB3+mI~2)/(mλ)+t/(mλ)\begin{split}&\lVert\boldsymbol{\Theta}_{t}-\boldsymbol{\Theta}_{0}-\widehat{\boldsymbol{\Theta}}_{t}/\sqrt{m}\rVert_{2}\leq(\widetilde{I}_{1}\cdot\sqrt{t}B_{3}+m\cdot\widetilde{I}_{2})/(m\lambda)+\sqrt{t/(m\lambda)}\end{split}

with proper m,ηm,\eta as in Lemma 5.3

Proof. With an analogous approach from Lemma 6.2 in (Ban and He, 2021a), we can have

𝚯~(j+1)𝚯(j+1)2η𝑮(j)2𝒇(j)(𝑮(j))(𝚯(j)𝚯(0))2+ηmλ𝚯(0)𝚯(j)2+𝑰η(mλ𝑰+𝑮(j)(𝑮(j)))2𝚯(j)𝚯~(j)2=I1+I2+I3.\begin{split}&\lVert\widetilde{\boldsymbol{\Theta}}^{(j+1)}-\boldsymbol{\Theta}^{(j+1)}\rVert_{2}\\ &\leq\eta\lVert\boldsymbol{G}^{(j)}\rVert_{2}\lVert\boldsymbol{f}^{(j)}-(\boldsymbol{G}^{(j)})^{\intercal}(\boldsymbol{\Theta}^{(j)}-\boldsymbol{\Theta}^{(0)})\rVert_{2}+\eta m\lambda\lVert\boldsymbol{\Theta}^{(0)}-\boldsymbol{\Theta}^{(j)}\rVert_{2}\\ &+\lVert\boldsymbol{I}-\eta\cdot(m\lambda\boldsymbol{I}+\boldsymbol{G}^{(j)}(\boldsymbol{G}^{(j)})^{\intercal})\rVert_{2}\lVert\boldsymbol{\Theta}^{(j)}-\widetilde{\boldsymbol{\Theta}}^{(j)}\rVert_{2}=I_{1}+I_{2}+I_{3}.\end{split}

With Lemma B.1, we can bound them as

I1ηI~1tB3I2ηmλi=0L𝚯l(0)𝚯l(j)F2ηmI~2=ηmλL+1Υ/m.\begin{split}&I_{1}\leq\eta\cdot\widetilde{I}_{1}\cdot\sqrt{t}B_{3}\\ &I_{2}\leq\eta m\lambda\sqrt{\sum_{i=0}^{L}\lVert\boldsymbol{\Theta}_{l}^{(0)}-\boldsymbol{\Theta}_{l}^{(j)}\rVert_{F}^{2}}\leq\eta m\cdot\widetilde{I}_{2}=\eta m\lambda\sqrt{L+1}\cdot\Upsilon/\sqrt{m}.\end{split}

based on the conclusion from Lemma A.2. For I3I_{3}, we have

η(mλ𝑰+𝑮(0)(𝑮(0)))η𝑰(mλ+(m1βht(Lβ32(βLζ)2L+m))2)𝑰\begin{split}&\eta\cdot(m\lambda\boldsymbol{I}+\boldsymbol{G}^{(0)}(\boldsymbol{G}^{(0)})^{\intercal})\preceq\eta\cdot\boldsymbol{I}\\ &\qquad\qquad\cdot\big{(}m\lambda+(m^{-1}\beta_{h}\cdot\sqrt{t\cdot(L\cdot\beta_{3}^{2}\cdot(\beta_{L}\zeta)^{2L}+m)})^{2}\big{)}\preceq\boldsymbol{I}\end{split}

with proper choice of mm and η\eta. It leads to

𝚯~(j+1)𝚯(j+1)2(1ηmλ)𝚯~(j)𝚯(j)2+I~1tB3+ηmI~2\begin{split}&\lVert\widetilde{\boldsymbol{\Theta}}^{(j+1)}-\boldsymbol{\Theta}^{(j+1)}\rVert_{2}\leq(1-\eta m\lambda)\lVert\widetilde{\boldsymbol{\Theta}}^{(j)}-\boldsymbol{\Theta}^{(j)}\rVert_{2}+\widetilde{I}_{1}\cdot\sqrt{t}B_{3}+\eta m\cdot\widetilde{I}_{2}\end{split}

which by induction and 𝚯~(0)=𝚯(0)\widetilde{\boldsymbol{\Theta}}^{(0)}=\boldsymbol{\Theta}^{(0)}, we have

𝚯~(j)𝚯(j)2(I~1tB3+mI~2)/(mλ).\begin{split}&\lVert\widetilde{\boldsymbol{\Theta}}^{(j)}-\boldsymbol{\Theta}^{(j)}\rVert_{2}\leq(\widetilde{I}_{1}\cdot\sqrt{t}B_{3}+m\cdot\widetilde{I}_{2})/(m\lambda).\end{split}

Finally,

𝚯t𝚯0𝚯^t/m2𝚯~(j)𝚯(j)2+𝚯~t𝚯0𝚯^0/m2(I~1tB3+mI~2)/(mλ)+t/(mλ),\begin{split}&\lVert\boldsymbol{\Theta}_{t}-\boldsymbol{\Theta}_{0}-\widehat{\boldsymbol{\Theta}}_{t}/\sqrt{m}\rVert_{2}\leq\lVert\widetilde{\boldsymbol{\Theta}}^{(j)}-\boldsymbol{\Theta}^{(j)}\rVert_{2}+\lVert\widetilde{\boldsymbol{\Theta}}_{t}-\boldsymbol{\Theta}_{0}-\widehat{\boldsymbol{\Theta}}_{0}/\sqrt{m}\rVert_{2}\\ &\leq(\widetilde{I}_{1}\cdot\sqrt{t}B_{3}+m\cdot\widetilde{I}_{2})/(m\lambda)+\sqrt{t/(m\lambda)},\end{split}

which completes the proof.

Lemma B.4.

At this time step tt, with the probability at least 1O(L)eΩ(m)1-O(L)\cdot e^{-\Omega(m)}, we will have

𝒁t2λ+t(L+1)m(Λ(j)9L+m1+m1βhLβ32(βLζ)2L+m)2,𝑮t𝑮t𝑮0𝑮0F2tm1(Λ(j)9L+m1)(Λ(j)9L+m1+m1βhLβ32(βLζ)2L+m)=BG/m\begin{split}&\lVert\boldsymbol{Z}_{t}\rVert_{2}\leq\lambda+\\ &\quad\frac{t(L+1)}{m}\big{(}\Lambda^{(j)}\cdot\sqrt{9L+m^{-1}}+m^{-1}\beta_{h}\cdot\sqrt{L\cdot\beta_{3}^{2}\cdot(\beta_{L}\zeta)^{2L}+m}\big{)}^{2},\\ &\lVert\boldsymbol{G}_{t}^{\intercal}\boldsymbol{G}_{t}-\boldsymbol{G}_{0}^{\intercal}\boldsymbol{G}_{0}\rVert_{F}\leq 2t\cdot m^{-1}(\Lambda^{(j)}\cdot\sqrt{9L+m^{-1}})\\ &\quad\cdot\big{(}\Lambda^{(j)}\sqrt{9L+m^{-1}}+m^{-1}\beta_{h}\cdot\sqrt{L\cdot\beta_{3}^{2}\cdot(\beta_{L}\zeta)^{2L}+m}\big{)}=B_{G}/m\end{split}

with proper m,ηm,\eta as in Lemma 5.3.

Proof. For the gradient matrix of ridge regression, we have

𝒁t2λ+m1τ=1tg(𝑿~τ;𝚯t)22λ+t(L+1)m(Λ(j)9L+m1+m1βhLβ32(βLζ)2L+m)2\begin{split}&\lVert\boldsymbol{Z}_{t}\rVert_{2}\leq\lambda+m^{-1}\sum_{\tau=1}^{t}\lVert g(\widetilde{\boldsymbol{X}}_{\tau};\boldsymbol{\Theta}_{t})\rVert_{2}^{2}\leq\lambda+\\ &\quad\frac{t(L+1)}{m}\big{(}\Lambda^{(j)}\cdot\sqrt{9L+m^{-1}}+m^{-1}\beta_{h}\cdot\sqrt{L\cdot\beta_{3}^{2}\cdot(\beta_{L}\zeta)^{2L}+m}\big{)}^{2}\end{split}

with the results from Lemma A.5. Then,

𝑮t𝑮t𝑮0𝑮0Fm1i,j=1tg(𝑿~i;𝚯t)+g(𝑿~j;𝚯0)22+g(𝑿~i;𝚯t)g(𝑿~j;𝚯0)222tm1(Λ(j)9L+m1+m1βhLβ32(βLζ)2L+m)(Λ(j)9L+m1)=BG/m.\begin{split}&\lVert\boldsymbol{G}_{t}^{\intercal}\boldsymbol{G}_{t}-\boldsymbol{G}_{0}^{\intercal}\boldsymbol{G}_{0}\rVert_{F}\leq m^{-1}\cdot\\ &\sqrt{\sum_{i,j=1}^{t}\lVert g(\widetilde{\boldsymbol{X}}_{i};\boldsymbol{\Theta}_{t})+g(\widetilde{\boldsymbol{X}}_{j};\boldsymbol{\Theta}_{0})\rVert_{2}^{2}+\lVert g(\widetilde{\boldsymbol{X}}_{i};\boldsymbol{\Theta}_{t})-g(\widetilde{\boldsymbol{X}}_{j};\boldsymbol{\Theta}_{0})\rVert_{2}^{2}}\\ &\leq 2t\cdot m^{-1}\cdot\big{(}\Lambda^{(j)}\sqrt{9L+m^{-1}}+m^{-1}\beta_{h}\cdot\sqrt{L\cdot\beta_{3}^{2}\cdot(\beta_{L}\zeta)^{2L}+m}\big{)}\\ &\qquad(\Lambda^{(j)}\cdot\sqrt{9L+m^{-1}})=B_{G}/m.\end{split}

The proof is then completed.

Proof sketch for Lemmas B.1-B.4. Analogous to lemmas in Section A, Lemma B.1 is proved by Lemmas A.5, A.6 by breaking the target into the product of norms. The proof of Lemma B.2 is analogous to Lemma 10.2 in (Ban and He, 2021a) and Lemma C.4 in (Zhou et al., 2020), then replacing 𝑮0\boldsymbol{G}_{0} with 𝑮j\boldsymbol{G}_{j} would give the result. Then, based on Lemma B.2 results, Lemma B.3 will be proved with after bounding 𝚯~(j+1)𝚯(j+1)2\lVert\widetilde{\boldsymbol{\Theta}}^{(j+1)}-\boldsymbol{\Theta}^{(j+1)}\rVert_{2} by induction. Finally, Lemma B.4 is proved by decomposing the norm into sum of individual terms, and bounding these terms with bounds on gradients in Lemma A.5. \blacksquare

Appendix C Lemmas for Model Convergence

Lemma C.1.

After TT time steps, assume the model with width mm defined in Lemma 5.3 are trained with the JJ-iterations GD on the past contexts and rewards. Then, there exists a constant βF\beta_{F}, such that βFη<1\beta_{F}\cdot\eta<1, for any j[J]j\in[J]:

𝑽(j)214ηβF𝑭T(j)𝒀T2\begin{split}\lVert\boldsymbol{V}^{(j)}\rVert_{2}\leq\frac{1}{4}\eta\beta_{F}\cdot\lVert\boldsymbol{F}^{(j)}_{T}-\boldsymbol{Y}_{T}\rVert_{2}\end{split}

where 𝐅(j)=[f(𝒢T,𝐗~τ;𝚯(j))]τ=1T\boldsymbol{F}^{(j)}=[f(\mathcal{G}_{T},\widetilde{\boldsymbol{X}}_{\tau};\boldsymbol{\Theta}^{(j)})]_{\tau=1}^{T}, and 𝐘T=[rτ]τ=1T\boldsymbol{Y}_{T}=[r_{\tau}]_{\tau=1}^{T}.

Proof. We prove this lemma following an analogous approach as Lemma B.6 in (Du et al., 2019b). Given 𝑿~\widetilde{\boldsymbol{X}}, we denote (𝚯(j))=(𝚯(j))𝚯\nabla\mathcal{L}({\boldsymbol{\Theta}}^{(j)})=\frac{\partial~{}\mathcal{L}(\boldsymbol{\Theta}^{(j)})}{\partial~{}\boldsymbol{\boldsymbol{\Theta}}}, and f(𝚯(j))=f(𝒢T,𝑿~;𝚯(j))𝚯\nabla f({\boldsymbol{\Theta}}^{(j)})=\frac{\partial~{}f(\mathcal{G}_{T},\widetilde{\boldsymbol{X}};\boldsymbol{\Theta}^{(j)})}{\partial~{}\boldsymbol{\Theta}}, where 𝚯{𝚯gnn,𝚯1,,𝚯L}\boldsymbol{\Theta}\in\{\boldsymbol{\Theta}_{gnn},\boldsymbol{\Theta}_{1},\dots,\boldsymbol{\Theta}_{L}\}. By the definition of 𝑽(j)\lVert\boldsymbol{V}^{(j)}\rVert, we have its element |𝑽(j)(𝑿~)|\lvert\boldsymbol{V}^{(j)}(\widetilde{\boldsymbol{X}})\rvert

ηmax0sη[Θ(Θ(j))Ff(Θ(j))f(Θ(j),s)F].\begin{split}&\leq\eta\cdot\max_{0\leq s\leq\eta}\bigg{[}\sum_{\Theta}\lVert\nabla\mathcal{L}({\Theta}^{(j)})\rVert_{F}\lVert\nabla f({\Theta}^{(j)})-\nabla f({\Theta}^{(j)},s)\rVert_{F}\bigg{]}.\end{split}

With the notation and conclusion from Lemma A.2, we have

(Θ(j))Fm122T𝑭T(j)𝒀T2ζL(2βL)Lβh\begin{split}&\lVert\nabla\mathcal{L}({\Theta}^{(j)})\rVert_{F}\leq m^{-\frac{1}{2}}2\sqrt{T}\lVert\boldsymbol{F}^{(j)}_{T}-\boldsymbol{Y}_{T}\rVert_{2}\cdot\zeta^{L}\cdot(2\beta_{L})^{L}\beta_{h}\end{split}

Meantime, f(Θl(j))f(Θl(j),s)F=mLl+12𝒉l1(j)(𝚯L(j))(q=l+1L1𝚪q(j)(𝚯q(j)))𝚪l(j)𝒉l1(j),s(𝚯L(j),s)(q=l+1L1𝚪q(j),s(𝚯q(j),s))𝚪l(j),sF.\lVert\nabla f({\Theta}_{l}^{(j)})-\nabla f({\Theta}_{l}^{(j)},s)\rVert_{F}=m^{-\frac{L-l+1}{2}}\\ \lVert\boldsymbol{h}^{(j)}_{l-1}(\boldsymbol{\Theta}^{(j)}_{L})^{\intercal}\big{(}\prod_{q=l+1}^{L-1}\boldsymbol{\Gamma}^{(j)}_{q}(\boldsymbol{\Theta}^{(j)}_{q})^{\intercal}\big{)}\cdot\boldsymbol{\Gamma}^{(j)}_{l}-\boldsymbol{h}^{(j),s}_{l-1}(\boldsymbol{\Theta}^{(j),s}_{L})^{\intercal}\\ \big{(}\prod_{q=l+1}^{L-1}\boldsymbol{\Gamma}^{(j),s}_{q}\cdot(\boldsymbol{\Theta}^{(j),s}_{q})^{\intercal}\big{)}\cdot\boldsymbol{\Gamma}^{(j),s}_{l}\rVert_{F}. A similar form can also be derived for 𝚯gnn\boldsymbol{\Theta}_{gnn}.

With Υ/m1\Upsilon/\sqrt{m}\leq 1 and Λ(j)βh\Lambda^{(j)}\leq\beta_{h} and a similar procedure as in Lemma A.3 and Lemma A.2, we have

𝚯(j+1)𝚯(j)FηΥ(j)m,𝚯(j)F2βLm𝒉(j+1)𝒉(j)2η2ζβhm(2ζβL)LΥ(j),𝒉(j)22βh,𝚪(j+1)𝚪(j)F2ηζ2βh(2ζβL)LΥ(j),𝚪(j)2ζ\begin{split}&\lVert\boldsymbol{\Theta}^{(j+1)}-\boldsymbol{\Theta}^{(j)}\rVert_{F}\leq\eta\frac{\Upsilon^{(j)}}{\sqrt{m}},\quad\lVert\boldsymbol{\Theta}^{(j)}\rVert_{F}\leq 2\beta_{L}\sqrt{m}\\ &\lVert\boldsymbol{h}^{(j+1)}-\boldsymbol{h}^{(j)}\rVert_{2}\leq\eta\frac{2\zeta\beta_{h}}{\sqrt{m}}(2\zeta\beta_{L})^{L}\Upsilon^{(j)},\quad\lVert\boldsymbol{h}^{(j)}\rVert_{2}\leq 2\beta_{h},\\ &\lVert\boldsymbol{\Gamma}^{(j+1)}-\boldsymbol{\Gamma}^{(j)}\rVert_{F}\leq 2\eta\zeta^{2}\beta_{h}(2\zeta\beta_{L})^{L}\Upsilon^{(j)},\quad\lVert\boldsymbol{\Gamma}^{(j)}\rVert_{2}\leq\zeta\end{split}

With Lemma G.1 from (Du et al., 2019b), for 𝚯{𝚯gnn,𝚯1,,𝚯L}\boldsymbol{\Theta}\in\{\boldsymbol{\Theta}_{gnn},\boldsymbol{\Theta}_{1},\dots,\boldsymbol{\Theta}_{L}\},

f(𝚯(j))f(𝚯(j),s)F4ζmηΥ(j)βhL(2ζβL)2L.\begin{split}\lVert\nabla f({\boldsymbol{\Theta}}^{(j)})-\nabla f({\boldsymbol{\Theta}}^{(j)},s)\rVert_{F}\leq\frac{4\zeta}{\sqrt{m}}\eta\Upsilon^{(j)}\beta_{h}L(2\zeta\beta_{L})^{2L}.\end{split}

Combining with (Θ(j))F\lVert\nabla\mathcal{L}({\Theta^{\prime}}^{(j)})\rVert_{F}, we have

|𝑽(j)(𝑿~)|η24Tm(L+2)2βh3𝑭T(j)𝒀T22(2ζβL)4L.\begin{split}\lvert\boldsymbol{V}^{(j)}(\widetilde{\boldsymbol{X}})\rvert\leq\eta^{2}\frac{4T}{m}(L+2)^{2}\beta_{h}^{3}\cdot\lVert\boldsymbol{F}^{(j)}_{T}-\boldsymbol{Y}_{T}\rVert_{2}^{2}(2\zeta\beta_{L})^{4L}.\end{split}

Since this inequality holds for an arbitrary 𝑿~{𝑿~τ}τ[T]\widetilde{\boldsymbol{X}}\in\{\widetilde{\boldsymbol{X}}_{\tau}\}_{\tau\in[T]} and 𝑭T(0)𝒀T2=𝒪(T)\lVert\boldsymbol{F}^{(0)}_{T}-\boldsymbol{Y}_{T}\rVert_{2}=\mathcal{O}(\sqrt{T}), given network width mm, we finally have

𝑽(j)14ηβF𝑭T(j)𝒀T22.\begin{split}\lVert\boldsymbol{V}^{(j)}\rVert\leq\frac{1}{4}\eta\beta_{F}\lVert\boldsymbol{F}^{(j)}_{T}-\boldsymbol{Y}_{T}\rVert_{2}^{2}.\end{split}

with the choice of learning rate η𝒪(T1L1βh2(2ζβL)2L)\eta\leq\mathcal{O}(T^{-1}L^{-1}\beta_{h}^{-2}(2\zeta\beta_{L})^{-2L}). \blacksquare

Proof of Lemma 5.7. We prove this lemma following an analogous approach as Lemma B.7 in (Du et al., 2019b). By the model definition and substituting Υ(j)/m\Upsilon^{(j)}/\sqrt{m} with m122T𝑭T(j)𝒀T2ζL(2βL)Lβhm^{-\frac{1}{2}}2\sqrt{T}\lVert\boldsymbol{F}^{(j)}_{T}-\boldsymbol{Y}_{T}\rVert_{2}\cdot\zeta^{L}(2\beta_{L})^{L}\beta_{h} as the upper bound based on Lemma A.2, with Λ(j)βh\Lambda^{(j)}\leq\beta_{h}, we have

𝑭T(j)𝑭T(j+1)22=1mτ=1T((𝒉L1,τ(j+1))𝚯L(j+1)(𝒉L1,τ(j))𝚯L(j))22m(𝚯L(j+1)𝚯L(j)22τ=1T𝒉L1,τ(j+1)22+𝚯L(j)22τ=1T𝒉L1,τ(j+1)𝒉L1,τ(j)22)2m(Tmη2(2βh)4𝑭T(j)𝒀T22+T(2β3)2(η2ζβhm(2ζβL)LΥ(j))2)14ηβF𝑭T(j)𝒀T22\begin{split}&\lVert\boldsymbol{F}^{(j)}_{T}-\boldsymbol{F}^{(j+1)}_{T}\rVert_{2}^{2}=\frac{1}{m}\sum_{\tau=1}^{T}\big{(}(\boldsymbol{h}_{L-1,\tau}^{(j+1)})^{\intercal}\boldsymbol{\Theta}_{L}^{(j+1)}-(\boldsymbol{h}_{L-1,\tau}^{(j)})^{\intercal}\boldsymbol{\Theta}_{L}^{(j)}\big{)}^{2}\\ &\leq\frac{2}{m}\bigg{(}\lVert\boldsymbol{\Theta}_{L}^{(j+1)}-\boldsymbol{\Theta}_{L}^{(j)}\rVert_{2}^{2}\sum_{\tau=1}^{T}\lVert\boldsymbol{h}_{L-1,\tau}^{(j+1)}\rVert_{2}^{2}+\lVert\boldsymbol{\Theta}_{L}^{(j)}\rVert_{2}^{2}\sum_{\tau=1}^{T}\lVert\boldsymbol{h}_{L-1,\tau}^{(j+1)}-\boldsymbol{h}_{L-1,\tau}^{(j)}\rVert_{2}^{2}\bigg{)}\\ &\leq\frac{2}{m}\bigg{(}\frac{T}{m}\eta^{2}(2\beta_{h})^{4}\lVert\boldsymbol{F}^{(j)}_{T}-\boldsymbol{Y}_{T}\rVert_{2}^{2}+T(2\beta_{3})^{2}(\eta\frac{2\zeta\beta_{h}}{\sqrt{m}}(2\zeta\beta_{L})^{L}\Upsilon^{(j)})^{2}\bigg{)}\\ &\leq\frac{1}{4}\eta\beta_{F}\lVert\boldsymbol{F}^{(j)}_{T}-\boldsymbol{Y}_{T}\rVert_{2}^{2}\end{split}

where the last inequality is due to sufficiently large mm and the choice of learning rate η\eta. \blacksquare