This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Graph Feedback Bandits with Similar Arms

Han Qi School of Software Engineering, Xi’an Jiaotong University, Xi’an, Shaanxi, China Guo Fei School of Software Engineering, Xi’an Jiaotong University, Xi’an, Shaanxi, China Li Zhu School of Software Engineering, Xi’an Jiaotong University, Xi’an, Shaanxi, China
Abstract

In this paper, we study the stochastic multi-armed bandit problem with graph feedback. Motivated by the clinical trials and recommendation problem, we assume that two arms are connected if and only if they are similar (i.e., their means are close enough). We establish a regret lower bound for this novel feedback structure and introduce two UCB-based algorithms: D-UCB with problem-independent regret upper bounds and C-UCB with problem-dependent upper bounds. Leveraging the similarity structure, we also consider the scenario where the number of arms increases over time. Practical applications related to this scenario include Q&A platforms (Reddit, Stack Overflow, Quora) and product reviews in Amazon and Flipkart. Answers (product reviews) continually appear on the website, and the goal is to display the best answers (product reviews) at the top. When the means of arms are independently generated from some distribution, we provide regret upper bounds for both algorithms and discuss the sub-linearity of bounds in relation to the distribution of means. Finally, we conduct experiments to validate the theoretical results.

1 Introduction

The multi-armed bandit is a classical reinforcement learning problem. At each time step, the learner needs to select an arm. This will yield a reward drawn from a probability distribution (unknown to the learner). The learner’s goal is to maximize the cumulative reward over a period of time steps. This problem has attracted attention from the online learning community because of its effective balance between exploration (trying out as many arms as possible) and exploitation (utilizing the arm with the best current performance). A number of applications of multi-armed bandit can be found in online sequential decision problems, such as online recommendation systems [Li et al., 2011], online advertisement campaign [Schwartz et al., 2017] and clinical trials [Villar et al., 2015, Aziz et al., 2021].

In the standard multi-armed bandit, the learner can only observe the reward of the chosen arm. There has been existing research [Mannor and Shamir, 2011, Caron et al., 2012, Hu et al., 2020, Lykouris et al., 2020] that considers the bandit problem with side observations, wherein the learner can observe information about arms other than the selected one. This observation structure can be encoded as a graph, each node represents an arm. Node ii is linked to node jj if selecting ii provides information on the reward of jj.

We study a new feedback model: if any two arms are ϵ\epsilon-similar, i.e., the absolute value of the difference between the means of the two arms does not exceed ϵ\epsilon, an edge will form between them. This means that after observing the reward of one arm, the decision-maker simultaneously knows the rewards of arms similar to it. If ϵ=0\epsilon=0, this feedback model is the standard multi-armed bandit problem. If ϵ\epsilon is greater than the maximum expected reward, this feedback model turns out to be the full information bandit problem.

As a motivating example, consider the recommendation problem in Spotify and Apple Music. After a recommender suggests a song to a user, they can observe not only whether the user liked or saved the song (reward), but also infer that the user is likely to like or save another song that is very similar. Similarity may be based on factors such as the artist, songwriter, genre, and more. As another motivating example, consider the problem of medicine clinical trials mentioned above. Each arm represents a different medical treatment plan, and these plans may have some similarities such as dosage, formulation, etc. When a researcher selects a plan, they not only observe the reward of that treatment plan but also know the effects of others similar to the selected one. The treatment effect (reward) can be either some summary information or a relative effect, such as positive or negative. Similar examples also appear in chemistry molecular simulations [Pérez et al., 2020].

In this paper, we consider two bandit models: the standard graph feedback bandit problem and the bandit problem with an increasing number of arms. The latter is a more challenging setting than the standard multi-armed bandit. Relevant applications for this scenario encompass Q&A platforms such as Reddit, Stack Overflow, and Quora, as well as product reviews on websites like Amazon and Flipkart. Answers or product reviews continuously populate these platforms means that the number of arms increases over time. The goal is to display the best answers or product reviews at the top. This problem has been previously studied and referred to as "ballooning multi-armed bandits" by Ghalme et al. [2021]. However, they require the optimal arm is more likely to arrive in the early rounds. Our contributions are as follows:

  1. 1.

    We propose a new feedback model, where an edge is formed when the means of two arms is less than some constant ϵ\epsilon. We first analyze the underlying graph GG of this feedback model and establish that the dominant number γ(G)\gamma(G) is equal to the minimum size of an independent dominating set i(G)i(G), while the independent number α(G)\alpha(G) is not greater than twice the dominant number γ(G)\gamma(G),i.e. γ(G)=i(G)α(G)/2\gamma(G)=i(G)\geq\alpha(G)/2. This result is helpful to the design and analysis of the following algorithms.

  2. 2.

    In this feedback setting, we first establish a problem-dependent regret lower bound related to γ(G)\gamma(G). Then, we introduce two algorithms tailored for this specific feedback structure: Double-UCB (D-UCB), utilizing twice UCB algorithms sequentially, and Conservative-UCB (C-UCB), employing a more conservative strategy during the exploration. Regret upper bounds are provided for both algorithms, with D-UCB obtaining a problem-independent bound, while C-UCB achieves a problem-dependent regret bound. Additionally, we analyze the regret bounds of UCB-N [Lykouris et al., 2020] applied to our proposed setting and prove that its regret upper bound shares the same order as D-UCB.

  3. 3.

    We extend D-UCB and C-UCB to the scenario where the number of arms increases over time. Our algorithm does not require the optimal arm to arrive early but assumes that the means of each arrived arm are independently sampled from some distribution, which is more realistic. We provide the regret upper bounds for D-UCB and C-UCB, along with a simple regret lower bound for D-UCB. The lower bound of D-UCB indicates that it achieves sublinear regret only when the means are drawn from a normal-like distribution, while it incurs linear regret for means drawn from a uniform distribution. In contrast, C-UCB can achieve a problem-dependent sublinear regret upper bound regardless of the means distribution.

2 Related Works

Bandits with side observations was first introduced by Mannor and Shamir [2011] for adversarial settings. They proposed two algorithms: ExpBan, a hybrid algorithm combining expert and bandit algorithms based on clique decomposition of the side observations graph; ELP, an extension of the well-known EXP3 algorithm [Auer et al., 2002b]. Caron et al. [2012], Hu et al. [2020] considered stochastic bandits with side observations. They proposed the UCB-N, UCB-NE, and TS-N algorithms, respectively. The regret upper bounds they obtain are of the form c𝒞maxicΔiln(T)(minicΔi)2\sum_{c\in\mathcal{C}}\frac{\max_{i\in c}\Delta_{i}\ln(T)}{(\min_{i\in c}\Delta_{i})^{2}}, where 𝒞\mathcal{C} is the clique covering of the side observation graph.

There has been some works that employ techniques beyond clique partition. Buccapatnam et al. [2014, 2018] proposed the algorithm named UCB-LP, which combine a version of eliminating arms [Even-Dar et al., 2006] suggested by Auer and Ortner [2010] with linear programming to incorporate the graph structure. This algorithm has a regret guarantee of iDln(T)Δi+K2\sum_{i\in D}\frac{\ln(T)}{\Delta_{i}}+K^{2}, where DD is a particularly selected dominating set, KK is the number of arms. Cohen et al. [2016] used a method based on elimination and provides the regret upper bound as O~(αT)\tilde{O}(\sqrt{\alpha T}), where α\alpha is the independence number of the underlying graph. Lykouris et al. [2020] utilized a hierarchical approach inspired by elimination to analyze the feedback graph, demonstrating that UCB-N and TS-N have regret bounds of order O~(iI1Δi)\tilde{O}(\sum_{i\in I}\frac{1}{\Delta_{i}}), where II is an independent set of the graph. There is also some work that considers the case where the feedback graph is a random graph [Alon et al., 2017, Ghari and Shen, 2022, Esposito et al., 2022].

Currently, there is limited research considering scenarios where the number of arms can change. Chakrabarti et al. [2008] was the first to explore this dynamic setting. Their model assumes that each arm has a lifetime budget, after which it automatically disappears, and will be replaced by a new arm. Since the algorithm needs to continuously explore newly available arms in this setting, they only provided the upper bound of the mean regret per time step. Ghalme et al. [2021] considered the "ballooning multi-armed bandits" where the number of arms will increase but not disappear. They show that the regret grows linearly without any distributional assumptions on the arrival of the arms’ means. With the assumption that the optimal arm arrives early with high probability, their proposed algorithm BL-MOSS can achieve sublinear regret. In this paper, we also only consider the "ballooning" settings but without the assumption on the optimal arm’s arrival pattern. We use the feedback graph model mentioned above and assume that the means of each arrived arm are independently sampled from some distribution.

Clustering bandits Gentile et al. [2014], Li et al. [2016], Yang et al. [2022], Wang et al. [2024] are also relevant to our work. Typically, these models assume that a set of arms (or items) can be classified into several unknown groups. Within each group, the observations associated to each of the arms follow the same distribution with the same mean. However, we do not employ a defined concept of clustering groups. Instead, we establish connections between arms by forming an edge only when their means are less than a threshold ϵ\epsilon, thereby creating a graph feedback structure.

3 Problem Formulation

3.1 Graph Feedback with Similar Arms

We consider a stochastic KK-armed bandit problem with an undirected feedback graph and time horizon TT (KTK\leq T). At each round tt, the learner select arm iti_{t}, obtains a reward Xt(it)X_{t}(i_{t}). Without losing generality, we assume that the rewards are bounded in [0,1][0,1] or 12\frac{1}{2}-subGaussian111This is simply to provide a uniform description of both bounded rewards and subGaussian rewards. Our results can be easily extended to other subGaussian distributions.. The expectation of Xt(i)X_{t}(i) is denoted as μ(i)=𝔼[Xt(i)]\mu(i)=\mathbb{E}[X_{t}(i)]. Graph G=(V,E)G=(V,E) denotes the underlying graph that captures all the feedback relationships over the arms set VV. An edge iji\leftrightarrow j in EE means that ii and jj are ϵ\epsilon-similarty, i.e.,

|μ(i)μ(j)|<ϵ,|\mu(i)-\mu(j)|<\epsilon,

where ϵ\epsilon is some constant greater than 0. The learner can get a side observation of arm ii when pulling arm jj, and vice versa. Let NiN_{i} denote the observation set of arm ii consisting of ii and its neighbors in GG. Let kt(i)k_{t}(i) and Ot(i)O_{t}(i) denote the number of pulls and observations of arm ii till time tt respectively. We assume that each node in graph GG contains a self-loops, i.e., the learner will observe the reward of the pulled arm.

Let ii^{*} denote the expected reward of the optimal arm, i.e., μ(i)=maxi{1,,K}μ(i)\mu(i^{*})=\max_{i\in\{1,...,K\}}\mu(i). The gap between the expectation of the optimal arm and the suboptimal arm is denoted as Δi=μ(i)μ(i).\Delta_{i}=\mu(i^{*})-\mu(i). A policy, denoted as π\pi, that selects arm iti_{t} to play at time step tt based on the history plays and rewards. The performance of the policy π\pi is measured as

RTπ=𝔼[t=1Tμ(i)μ(it)].R_{T}^{\pi}=\mathbb{E}\left[\sum_{t=1}^{T}\mu(i^{*})-\mu(i_{t})\right]. (1)

3.2 Ballooning bandit setting

This setting is the same as the graph feedback with similar arms above, except that the number of arms is increased over time. Let K(t)K(t) denote the set of available arms at round tt. We consider a tricky case that a new arm will arrive at each round, the total number of arms is |K(T)|=T|K(T)|=T. We assume that the means of each arrived arms are independently sampled from some distribution 𝒫\mathcal{P}. Let ata_{t} denote the arrived arm at round tt,

μ(a1),μ(a2),,μ(aT)i.i.d.𝒫.\mu(a_{1}),\mu(a_{2}),...,\mu(a_{T})\stackrel{{\scriptstyle i.i.d.}}{{\sim}}\mathcal{P}.

The newly arrived arm may be connected to previous arms, depending on whether their means satisfy the ϵ\epsilon-similarity. In this setting, the optimal arm may vary over time. Let iti_{t}^{*} denote the expected reward of the optimal arm, i.e., μ(it)=maxiK(t)μ(i)\mu(i_{t}^{*})=\max_{i\in K(t)}\mu(i). The regret is given by

RTπ(𝒫)=𝔼[t=1Tμ(it)μ(it)].R_{T}^{\pi}(\mathcal{P})=\mathbb{E}\left[\sum_{t=1}^{T}\mu(i_{t}^{*})-\mu(i_{t})\right]. (2)

4 Stationary Environments

In this section, we consider the graph feedback with similar arms in stationary environments, i.e., the number of arms remains constant. We first analyze the structure of the feedback graph. Then, we provide a problem-dependent regret lower bound. Following that, we introduce the D-UCB and C-UCB algorithm and provide the regret upper bounds.

Dominating and Independent Dominating Sets. A dominating set SS in a graph GG is a set of vertices such that every vertex not in SS is adjacent to a vertex in SS. The domination number of GG, denoted as γ(G)\gamma(G), is the smallest size of a dominating set.

An independent set contains vertices that are not adjacent to each other. An independent dominating set in GG is a set that both dominates and is independent. The independent domination number of GG, denoted as i(G)i(G), is the smallest size of such a set. The independence number of GG, denoted as α(G)\alpha(G), is the largest size of an independent set in GG. For a general graph GG, it follows immediately that γ(G)i(G)α(G)\gamma(G)\leq i(G)\leq\alpha(G).

Proposition 4.1.

Let GG denote the feedback graph with similar arms setting, we have γ(G)=i(G)α(G)2.\gamma(G)=i(G)\geq\frac{\alpha(G)}{2}.

Proof sketch. The first equation can be obtained by proving that GG is a claw-free graph and using Lemma A.2. The second inequality can be obtained by a double counting argument. The details are in the appendix.

Proposition 4.1 shows that γ(G)α(G)2γ(G)\gamma(G)\leq\alpha(G)\leq 2\gamma(G). Once we obtain the bounds based on independence number, we also simultaneously obtain the bounds based on the domination number. Therefore, we can obtain regret bounds based on the minimum dominating set without using the feedback graph to explicitly target exploration. This cannot be guaranteed in the standard graph feedback bandit problem [Lykouris et al., 2020].

4.1 Lower Bounds

Before presenting our algorithm, we first investigate the regret lower bound of this problem. Without loss of generality, we assume the reward distribution is 12\frac{1}{2}-subGaussian.

Let Δmin=μ(i)maxjiμ(j),Δmax=μ(i)minjμ(j)\Delta_{min}=\mu(i^{*})-\max_{j\neq i^{*}}\mu(j),\Delta_{max}=\mu(i^{*})-\min_{j}\mu(j). We assume that Δmin<ϵ\Delta_{min}<\epsilon in our analysis. In other words, we do not consider the easily distinguishable scenario where the optimal arm and the suboptimal arms are clearly separable. If Δminϵ\Delta_{min}\geq\epsilon, our analysis method is also applicable, but the terms related to Δmin\Delta_{min} will vanish in the expressions of both the lower and upper bounds. Caron et al. [2012] has provided a lower bound of Ω(log(T))\Omega(\log(T)), and we present a more refined lower bound specifically for our similarity feedback.

Theorem 4.2.

If a policy π\pi is uniform good222For more details, see [Lai et al., 1985]., for any problem instance, it holds that

lim infTRTπlog(T)2Δmin+C1ϵ,\liminf_{T\to\infty}\frac{R_{T}^{\pi}}{\log(T)}\geq\frac{2}{\Delta_{min}}+\frac{C_{1}}{\epsilon}, (3)

where C1=2log(Δmax+ϵΔmax(γ(G)2)ϵ)C_{1}=2\log(\frac{\Delta_{max}+\epsilon}{\Delta_{max}-(\gamma(G)-2)\epsilon}).

Proof.

Let 𝒮\mathcal{S} denote an independent dominant set that includes ii^{*}. From Proposition 4.1, |𝒮|γ(G)|\mathcal{S}|\geq\gamma(G). The second-best arm is denoted as ii^{\prime}, 𝒟=𝒮{i}.\mathcal{D}=\mathcal{S}\bigcup\{i^{\prime}\}. Since Δmin<ϵ\Delta_{min}<\epsilon, i𝒮i^{\prime}\notin\mathcal{S}.

Let’s construct another policy π\pi^{\prime} for another problem-instance on 𝒟\mathcal{D} without side observations. If π\pi select arm iti_{t} at round tt, π\pi^{\prime} select the arm as following: if it𝒟i_{t}\in\mathcal{D}, π\pi^{\prime} select arm iti_{t} too. If it𝒟i_{t}\notin\mathcal{D}, π\pi^{\prime} will select the arm of Nit𝒟N_{i_{t}}\bigcap\mathcal{D} with the largest mean. Since Nit𝒟N_{i_{t}}\bigcap\mathcal{D}\neq\emptyset, policy π\pi^{\prime} is well-defined. It is clearly that RTπ>RTπR_{T}^{\pi}>R_{T}^{\pi^{\prime}}. By the classical result of [Lai et al., 1985],

lim infTRTπlog(T)i𝒟2Δi=2Δmin+i𝒮\{i}2Δi.\liminf_{T\to\infty}\frac{R_{T}^{\pi^{\prime}}}{\log(T)}\geq\sum_{i\in\mathcal{D}}\frac{2}{\Delta_{i}}=\frac{2}{\Delta_{min}}+\sum_{i\in\mathcal{S}\backslash\{i^{*}\}}\frac{2}{\Delta_{i}}. (4)

i𝒮\{i},Δi[niϵ,(ni+1)ϵ)\forall i\in\mathcal{S}\backslash\{i^{*}\},\Delta_{i}\in[n_{i}\epsilon,(n_{i}+1)\epsilon). nin_{i} is some positive integer smaller than Δmaxϵ\frac{\Delta_{max}}{\epsilon}. Since 𝒮\mathcal{S} is an independent set, i,j𝒮\{i},ij\forall i,j\in\mathcal{S}\backslash\{i^{*}\},i\neq j, we have ninjn_{i}\neq n_{j}. Therefore, we can complete this proof by

i𝒮\{i}2Δi\displaystyle\sum_{i\in\mathcal{S}\backslash\{i^{*}\}}\frac{2}{\Delta_{i}} j=0|𝒮|22Δmaxjϵ\displaystyle\geq\sum_{j=0}^{|\mathcal{S}|-2}\frac{2}{\Delta_{max}-j\epsilon}
21|𝒮|21Δmaxϵx𝑑x\displaystyle\geq 2\int_{-1}^{|\mathcal{S}|-2}\frac{1}{\Delta_{max}-\epsilon x}dx
2log(Δmax+ϵΔmax(γ(G)2)ϵ)ϵ.\displaystyle\geq\frac{2\log(\frac{\Delta_{max}+\epsilon}{\Delta_{max}-(\gamma(G)-2)\epsilon})}{\epsilon}.

Consider two simple cases. (1) γ(G)=1\gamma(G)=1. The feedback graph GG is a complete graph or some graph with independence number less than 22. Then C1=0C_{1}=0, the lower bound holds strictly when GG is not a complete graph. (2) γ(G)=Δmaxϵ\gamma(G)=\frac{\Delta_{max}}{\epsilon}, C1=2log(12γ(G)+12)C_{1}=2\log(\frac{1}{2}\gamma(G)+\frac{1}{2}). In this case, the terms in the lower bound involving ϵ\epsilon have the same order O(log(γ(G)))O(\log(\gamma(G))) as the corresponding terms in the upper bound of the following proposed algorithms.

4.2 Double-UCB

This particular feedback structure inspires us to distinguish arms within the independent set first. This is a straightforward task because the distance between the mean of each arm in the independent set is greater than ϵ\epsilon. Subsequently, we learn from the arm with the maximum confidence bound in the independent set and its neighborhood, which may include the optimal arm. Our algorithm alternates between the two processes simultaneously.

1:  Input: Horizon T,T, δ(0,1)\delta\in(0,1)
2:  Initialize =,i,k(i)=0,O(i)=0\mathcal{I}=\emptyset,\forall i,k(i)=0,O(i)=0
3:  for t=1t=1 to TT do
4:     repeat
5:        Select an arm iti_{t} that has not been observed.
6:        ={it}\mathcal{I}=\mathcal{I}\bigcup\{i_{t}\}
7:        iNit,\forall i\in N_{i_{t}}, update kt(i),Ot(i),μ¯t(i)k_{t}(i),O_{t}(i),\bar{\mu}_{t}(i)
8:        t=t+1t=t+1
9:     until All arms have been observed at least once
10:     jt=argmaxjμ¯(j)+log(2T/δ)O(j)j_{t}=\arg\max_{j\in\mathcal{I}}\bar{\mu}(j)+\sqrt{\frac{\log(\sqrt{2}T/\delta)}{O(j)}}
11:     Select arm it=argmaxiNjtμ¯(i)+log(2T/δ)O(i)i_{t}=\arg\max_{i\in N_{j_{t}}}\bar{\mu}(i)+\sqrt{\frac{\log(\sqrt{2}T/\delta)}{O(i)}}
112:     iNit\forall i\in N_{i_{t}}, update kt(i),Ot(i),μ¯t(i)k_{t}(i),O_{t}(i),\bar{\mu}_{t}(i)
213:  end for
Algorithm 1 D-UCB

Algorithm 1 shows the pseudocode of our method. Steps 4-9 identify an independent set \mathcal{I} in GG, play each arm in the independent set once. This process does not require knowledge of the complete graph structure and requires at most α(G)\alpha(G) rounds. Step 10 calculates the arm jtj_{t} with the maximum upper confidence bound in the independent set. After a finite number of rounds, the optimal arm is likely to fall within NjtN_{j_{t}}. Step 11 use the UCB algorithm again to select arm in NjtN_{j_{t}}. This algorithm employs UCB rules twice for arm selection, hence it is named Double-UCB.

4.2.1 Regret Analysis of Double-UCB

We use (Ni)\mathcal{I}(N_{i}) to denote the set of all independent dominating sets of graph formed by NiN_{i}. Let

(i)=iNi(Ni).\mathcal{I}(i^{*})=\bigcup_{i\in N_{i^{*}}}\mathcal{I}(N_{i}).

Note that, the elements in (i)\mathcal{I}(i^{*}) are independent sets rather than individual nodes, and I(i)\forall I\in\mathcal{I}(i^{*}), |I|2|I|\leq 2.

Theorem 4.3.

Assume that the reward distribution is 12\frac{1}{2}-subgaussian or bounded in [0,1][0,1], set δ=1T,\delta=\frac{1}{T}, the regret of Double-UCB after TT rounds is upper bounded by

RTπ\displaystyle R_{T}^{\pi} 32(log(2T))2maxI(i)iI\{i}1Δi+C2log(2T)ϵ\displaystyle\leq 32(\log(\sqrt{2}T))^{2}\max_{I\in\mathcal{I}(i^{*})}\sum_{i\in I\backslash\{i^{*}\}}\frac{1}{\Delta_{i}}+C_{2}\frac{\log(\sqrt{2}T)}{\epsilon} (5)
+Δmax+4ϵ+1,\displaystyle+\Delta_{max}+4\epsilon+1,

where C2=8(log(2γ(G))+π23).C_{2}=8(\log(2\gamma(G))+\frac{\pi^{2}}{3}).

Recall that we assume Δmin<ϵ\Delta_{min}<\epsilon, then iNi,|Ni|>1\forall i\in N_{i^{*}},|N_{i}|>1. The index set {iI\{i}}\{i\in I\backslash\{i^{*}\}\} in the summation of the first term is non-empty. If Δminϵ\Delta_{min}\geq\epsilon, (i)={i}\mathcal{I}(i^{*})=\{i^{*}\}. The first term will vanish.

Proof sketch. The regret can be analyzed in two parts. The first part arises from Step 9 selecting jtj_{t} whose neighborhood does not contain the optimal arm. The algorithm can easily distinguish the suboptimal arms in NjtN_{j_{t}} from the optimal arm. The regret caused by this part is O(log(T)ϵ)O(\frac{\log(T)}{\epsilon}). The second part of the regret comes from selecting a suboptimal arm in Njt,iNjtN_{j_{t}},i^{*}\in N_{j_{t}}. This part can be viewed as applying UCB rule on a graph with an independence number less than 2.

Since I(i)\forall I\in\mathcal{I}(i^{*}), |I|2|I|\leq 2, we have

maxI(i)iI\{i}1Δi2Δmin.\max_{I\in\mathcal{I}(i^{*})}\sum_{i\in I\backslash\{i^{*}\}}\frac{1}{\Delta_{i}}\leq\frac{2}{\Delta_{min}}.

Therefore, the regret upper bound can be denoted as

RTπ64(log(2T))2Δmin+C2log(2T)ϵ+Δmax+4ϵ+1.R_{T}^{\pi}\leq\frac{64(\log(\sqrt{2}T))^{2}}{\Delta_{min}}+C_{2}\frac{\log(\sqrt{2}T)}{\epsilon}+\Delta_{max}+4\epsilon+1. (6)

Our upper bound suffers an extra logarithm term compared to the lower bound Theorem 4.2. The inclusion of this additional logarithm appears to be essential if one uses a gap-based analysis similar to [Cohen et al., 2016]. This issue has been discussed in [Lykouris et al., 2020].

From Theorem 4.3, we have the following gap-free upper bound

Corollary 4.4.

The regret of Double-UCB is bounded by 16Tlog(2T)+C2log(2T)ϵ+Δmax+4ϵ+1.16\sqrt{T}\log(\sqrt{2}T)+C_{2}\frac{\log(\sqrt{2}T)}{\epsilon}+\Delta_{max}+4\epsilon+1.

4.3 Conservative UCB

Double-UCB is a very natural algorithm for similar arms setting. For this particular feedback structure, we propose the conservative UCB, which simply modifies Step 10 of Algorithm 1 to it=argmaxiNjtμ¯(i)log(2T/δ)O(i)i_{t}=\arg\max_{i\in N_{j_{t}}}\bar{\mu}(i)-\sqrt{\frac{\log(\sqrt{2}T/\delta)}{O(i)}}. This form of the index function is conservative when exploring arms in NjtN_{j_{t}}. It does not immediately try each arm but selects those that have been observed a sufficient number of times. Algorithm 2 shows the pseudocode of C-UCB.

1:  Input: Horizon T,T, δ(0,1)\delta\in(0,1)
2:  Initialize =,i,k(i)=0,O(i)=0\mathcal{I}=\emptyset,\forall i,k(i)=0,O(i)=0
3:  for t=1t=1 to TT do
4:     Steps 4-10 in D-UCB
5:     Select arm it=argmaxiNjtμ¯(i)log(2T/δ)O(i)i_{t}=\arg\max_{i\in N_{j_{t}}}\bar{\mu}(i)-\sqrt{\frac{\log(\sqrt{2}T/\delta)}{O(i)}}
6:     iNit,\forall i\in N_{i_{t}}, update kt(i),Ot(i),μ¯t(i)k_{t}(i),O_{t}(i),\bar{\mu}_{t}(i)
17:  end for
Algorithm 2 C-UCB

4.3.1 Regret Analysis of Conservative UCB

Let G2ϵG_{2\epsilon} denote the subgraph with arms {iV:μ(i)μ(i)<2ϵ}\{i\in V:\mu(i^{*})-\mu(i)<2\epsilon\}. The set of all independent dominating sets of graph G2ϵG_{2\epsilon} is denoted as (G2ϵ)\mathcal{I}(G_{2\epsilon}). We can also define (Gϵ)\mathcal{I}(G_{\epsilon}) in this way. Define Δ2ϵmin=mini,jG2ϵ{|μ(i)μ(j)|}\Delta_{2\epsilon}^{min}=\min_{i,j\in G_{2\epsilon}}\{|\mu(i)-\mu(j)|\}.

We divide the regret into two parts. The first part is the regret caused by choosing arm in Njt,iNjtN_{j_{t}},i^{*}\notin N_{j_{t}}, and the analysis for this part follows the same approach as D-UCB.

The second part is the regret of choosing the suboptimal arms in Njt,iNjtN_{j_{t}},i^{*}\in N_{j_{t}}. It can be proven that if the optimal arm is observed more than 4log(2T/δ)(Δ2ϵmin)2\frac{4\log(\sqrt{2}T/\delta)}{(\Delta_{2\epsilon}^{min})^{2}} times, the algorithm will select the optimal arm with high probability. Intuitively, for any arm iNjt,iii\in N_{j_{t}},i\neq i^{*}, the following events hold with high probability:

μ¯(i)log(2T/δ)O(i)>μ(i)Δ(i)=μ(i),\bar{\mu}(i^{*})-\sqrt{\frac{\log(\sqrt{2}T/\delta)}{O(i^{*})}}>\mu(i^{*})-\Delta(i)=\mu(i), (7)

and

μ¯(i)log(2T/δ)O(i)<μ(i).\bar{\mu}(i)-\sqrt{\frac{\log(\sqrt{2}T/\delta)}{O(i)}}<\mu(i). (8)

Since the optimal arm satisfies Equation 7 and the suboptimal arms satisfy Equation 8 with high probability, the suboptimal arms are unlikely to be selected.

The key to the problem lies in ensuring that the optimal arm can be observed 4log(2T/δ)(Δ2ϵmin)2\frac{4\log(\sqrt{2}T/\delta)}{(\Delta_{2\epsilon}^{min})^{2}} times. Since in the graph formed by NjtN_{j_{t}}, all arms are connected to jtj_{t}. As long as the time steps are sufficiently long, arm jtj_{t} will inevitably be observed more than 4log(2T/δ)(Δ2ϵmin)2\frac{4\log(\sqrt{2}T/\delta)}{(\Delta_{2\epsilon}^{min})^{2}} times, then the arms with means less than μ(jt)\mu(j_{t}) will be ignored (similar to Equation 7 and Equation 8). Choosing arms that the means between (μ(jt),μ(i))(\mu(j_{t}),\mu(i^{*})) will always observe the optimal arm, so the optimal arm can be observed 4log(2T/δ)(Δ2ϵmin)2\frac{4\log(\sqrt{2}T/\delta)}{(\Delta_{2\epsilon}^{min})^{2}} times. Therefore, we have the following theorem:

Theorem 4.5.

Under the same conditions as Theorem 4.3, the regret of C-UCB is upper bounded by

RTπ32ϵlog(2T)(Δ2ϵmin)2+C2log(2T)ϵ+Δmax+2ϵ\displaystyle R_{T}^{\pi}\leq\frac{32\epsilon\log(\sqrt{2}T)}{(\Delta_{2\epsilon}^{min})^{2}}+C_{2}\frac{\log(\sqrt{2}T)}{\epsilon}+\Delta_{max}+2\epsilon (9)

The regret upper bound of C-UCB decreases by a logarithmic factor compared to D-UCB, but with an additional problem-dependent term Δ2ϵ\Delta_{2\epsilon}. This upper bound still does not match the lower bound in Theorem 4.2, as Δ2ϵminΔmin\Delta_{2\epsilon}^{min}\leq\Delta_{min}. If we ignore the terms independent of TT, the regret upper bound of C-UCB matches the lower bound of Ω(log(T))\Omega(\log(T)).

4.4 UCB-N

UCB-N has been analyzed in the standard graph feedback model [Caron et al., 2012, Hu et al., 2020, Lykouris et al., 2020]. One may wonder whether UCB-N can achieve similar regret upper bounds. In fact, if UCB-N uses the same upper confidence function as ours, UCB-N has a similar regret upper bound to D-UCB. We have the following theorem:

Theorem 4.6.

Under the same conditions as Theorem 4.3, the regret of UCB-N is upper bounded by

RTπ32(log(2T))2Δmin+C3log(2T)ϵ+Δmax+2ϵ+1,\displaystyle R_{T}^{\pi}\leq\frac{32(\log(\sqrt{2}T))^{2}}{\Delta_{min}}+C_{3}\frac{\log(\sqrt{2}T)}{\epsilon}+\Delta_{max}+2\epsilon+1, (10)

where C3=8(log(2γ(G))+π26).C_{3}=8(\log(2\gamma(G))+\frac{\pi^{2}}{6}).

Remark 4.7.

The gap-free upper bound of UCB-N is also O(Tlog(T))O(\sqrt{T}\log(T)). Due to the similarity assumption, the regret upper bound of UCB-N is improved compared to [Lykouris et al., 2020]. D-UCB and C-UCB are specifically designed for similarity feedback structures and may fail in the case of standard graph feedback settings. This is because the optimal arm may be connected to an arm with a very small mean, so the jtj_{t} selected in Step 9 may not necessarily include the optimal arm. However, under the ballooning setting, UCB-N cannot achieve sublinear regret. D-UCB and C-UCB can be naturally applied in this setting and achieve sublinear regret under certain conditions.

5 Ballooning Environments

This section considers the setting where arms are increased over time. This problem is highly challenging, as prior research relied on strong assumptions to achieve sublinear regret. The graph feedback structure we propose is particularly effective for this setting. Intuitively, if an arrived arm has a mean very close to arms that have already been distinguished, the algorithm does not need to distinguish it further. This may lead to a significantly smaller number of truly effective arrived arms than TT, making it easier to obtain a sublinear regret bound.

5.1 Double-UCB for Ballooning Setting

1:  Input: Horizon T,T, δ(0,1)\delta\in(0,1)
2:  Initialize =,i,k(i)=0,O(i)=0\mathcal{I}=\emptyset,\forall i,k(i)=0,O(i)=0
3:  for t=1t=1 to TT do
4:     Arm ata_{t} arrives
5:     Feedback graph GG is updated
6:     if atNa_{t}\notin N_{\mathcal{I}} then
7:        ={at}\mathcal{I}=\mathcal{I}\bigcup\{a_{t}\}
8:     end if
9:     jt=argmaxjμ¯(j)+log(2T/δ)O(j)j_{t}=\arg\max_{j\in\mathcal{I}}\bar{\mu}(j)+\sqrt{\frac{\log(\sqrt{2}T/\delta)}{O(j)}}
110:     Pulls arm it=argmaxiNjtμ¯(i)+log(2T/δ)O(i)i_{t}=\arg\max_{i\in N_{j_{t}}}\bar{\mu}(i)+\sqrt{\frac{\log(\sqrt{2}T/\delta)}{O(i)}}
211:     iNit,\forall i\in N_{i_{t}}, update kt(i),Ot(i),μ¯t(i)k_{t}(i),O_{t}(i),\bar{\mu}_{t}(i)
312:  end for
Algorithm 3 Double-UCB for Ballooning Setting

Algorithm 3 shows the pseudocode of our method Double-UCB-BL. For any set 𝒮\mathcal{S}, let N𝒮N_{\mathcal{S}} denote the set of arms linked to 𝒮\mathcal{S}, i.e., N𝒮=i𝒮NiN_{\mathcal{S}}=\bigcup_{i\in\mathcal{S}}N_{i}. Upon the arrival of each arm, first check whether it is in NN_{\mathcal{I}}. If it is not, add it to \mathcal{I} to form a new independent set. The construction of the independent set \mathcal{I} is formed online as arms arrive, while the other parts remain entirely consistent with Double-UCB.

5.1.1 Regret Analysis

Let t\mathcal{I}_{t} denote the independent set at time tt and αtt\alpha_{t}^{*}\in\mathcal{I}_{t} denote the arm that include the optimal arm iti_{t}^{*}. Define 𝒜={at:t[T],μ(at)Nαt}\mathcal{A}=\{a_{t}:t\in[T],\mu(a_{t})\in N_{\alpha_{t}^{*}}\}. To simplify the problem, we only consider the problem instances that satisfy the following assumption:

Assumption 5.1.

ij,ΔminT|μ(i)μ(j)|ΔmaxT\forall i\neq j,\Delta_{min}^{T}\leq|\mu(i)-\mu(j)|\leq\Delta_{max}^{T}, ΔminT,ΔmaxT\Delta_{min}^{T},\Delta_{max}^{T} is some constant.

The first challenge in ballooning settings is the potential existence of many arms with means very close to the optimal arm. That is, the set 𝒜\mathcal{A} may be very large. We first define a quantity that is easy to analyze as the expected upper bound for all arms falling into NαtN_{\alpha_{t}^{*}}. Define

M\displaystyle M =𝔼[t=1T𝟙{|μ(at)μ(it)|<2ϵ)}]\displaystyle=\mathbb{E}[\sum_{t=1}^{T}\mathbbm{1}\{|\mu(a_{t})-\mu(i_{t}^{*})|<2\epsilon)\}] (11)
=t=1T(|μ(at)μ(it)|<2ϵ)\displaystyle=\sum_{t=1}^{T}\mathbb{P}(|\mu(a_{t})-\mu(i_{t}^{*})|<2\epsilon)

It’s easy to verify that {μ(at)Nαt}{|μ(at)μ(it)|<2ϵ)}\{\mu(a_{t})\in N_{\alpha_{t}^{*}}\}\subseteq\{|\mu(a_{t})-\mu(i_{t}^{*})|<2\epsilon)\}. We have 𝔼[|𝒜|]M.\mathbb{E}[|\mathcal{A}|]\leq M.

The second challenge is that our regret is likely related to the independence number, yet under the ballooning setting, the graph’s independence number is a random variable. Denote the independence number as α(GT𝒫)\alpha(G_{T}^{\mathcal{P}}). We attempt to address this issue by providing a high-probability upper bound for the independence number. Let X,Yi.i.d.𝒫X,Y\stackrel{{\scriptstyle i.i.d.}}{{\sim}}\mathcal{P}, then

p=(|XY|ϵ)=ϵϵfXY(z)𝑑z,p=\mathbb{P}(|X-Y|\leq\epsilon)=\int_{-\epsilon}^{\epsilon}f_{X-Y}(z)dz, (12)

where fXYf_{X-Y} is the probability density function of XYX-Y. Let b=11pb=\frac{1}{1-p}, Lemma A.3 has proved a high-probability upper bound of α(GT𝒫)\alpha(G_{T}^{\mathcal{P}}) related to bb.

Now, we can give the following upper bound of Double-UCB.

Theorem 5.2.

Assume that the reward distribution is 12\frac{1}{2}-subgaussian or bounded in [0,1][0,1], set δ=1T,\delta=\frac{1}{T}, the regret of Double-UCB after TT rounds is upper bounded by

RTπ\displaystyle R_{T}^{\pi} 40max{logbT,1}ΔmaxTlog(2T)ϵ2+2ΔmaxT\displaystyle\leq 40\max\{\log_{b}T,1\}\Delta_{max}^{T}\frac{\log(\sqrt{2}T)}{\epsilon^{2}}+2\Delta_{max}^{T} (13)
+46TMlog(2T)+2ϵ+2TϵeM,\displaystyle+4\sqrt{6TM\log(\sqrt{2}T)}+2\epsilon+2T\epsilon e^{-M},

If 𝒫\mathcal{P} is Gaussian distribution, we have the following corollary:

Corollary 5.3.

If 𝒫\mathcal{P} is the Gaussian distribution 𝒩(0,1)\mathcal{N}(0,1), we have M=O(log(T)e2ϵ2log(T))M=O(\log(T)e^{2\epsilon\sqrt{2\log(T)}}) and M=Ω(log(T)log(T))M=\Omega(\log(T)\sqrt{\log(T)}). The asymptotic regret upper bound is of order

O(log(T)Te2ϵ2log(T)).O\Big{(}\log(T)\sqrt{Te^{2\epsilon\sqrt{2\log(T)}}}\Big{)}.

The order of e2log(T)e^{\sqrt{2\log(T)}} is smaller than any power of TT. For example, if T>enT>e^{n},nn is a positive integer, we have e2log(T)T2/ne^{\sqrt{2\log(T)}}\leq T^{\sqrt{2/n}}.

Lower Bounds of Double-UCB Define ={at:t[T],ϵ2<μ(it)μ(at)<ϵ}\mathcal{B}=\{a_{t}:t\in[T],\frac{\epsilon}{2}<\mu(i_{t}^{*})-\mu(a_{t})<\epsilon\}. Then we have 𝒜\mathcal{B}\subset\mathcal{A}. Let

B=𝔼[t=1T𝟙{ϵ2<μ(it)μ(at)<ϵ}].B=\mathbb{E}\Big{[}\sum_{t=1}^{T}\mathbbm{1}\{\frac{\epsilon}{2}<\mu(i_{t}^{*})-\mu(a_{t})<\epsilon\}\Big{]}.

We have 𝔼[||]=B\mathbb{E}[|\mathcal{B}|]=B. If arm ata_{t} arrives and falls into set \mathcal{B} at round tt, then the arm will be selected at least once, unless the algorithm select jtαtj_{t}\neq\alpha_{t}^{*} in Step 9. Note that if an arm in \mathcal{B} is selected at some round, the resulting regret is at least ϵ2\frac{\epsilon}{2}. Once we estimate the size of |||\mathcal{B}| and the number of rounds with jtαtj_{t}\neq\alpha_{t}^{*}, a simple regret lower bound can be obtained. We have the following lower bound:

Theorem 5.4.

Assume that the reward distribution is 12\frac{1}{2}-subgaussian or bounded in [0,1][0,1], set δ=1T,\delta=\frac{1}{T}, the regret lower bound of Double-UCB is

RTπBϵ4(1eB/8)20max{logbT,1}log(2T)ϵϵ\displaystyle R_{T}^{\pi}\geq\frac{B\epsilon}{4}(1-e^{-B/8})-20\max\{\log_{b}T,1\}\frac{\log(\sqrt{2}T)}{\epsilon}-\epsilon (14)

If 𝒫\mathcal{P} is the uniform distribution U(0,1)U(0,1), we can calculate that B(1ϵ)ϵ2TB\geq\frac{(1-\epsilon)\epsilon}{2}T. If 𝒫\mathcal{P} is the half-triangle distribution with probability density function as f(x)=2(1x)𝟙{0<x<1}f(x)=2(1-x)\mathbbm{1}\{0<x<1\}. We can also calculate B3ϵ2(1ϵ)24TB\geq\frac{3\epsilon^{2}(1-\epsilon)^{2}}{4}T. Therefore, their regret lower bounds are of linear order.

1:  Input: Horizon T,T, δ(0,1)\delta\in(0,1)
2:  Initialize =,i,k(i)=0,O(i)=0\mathcal{I}=\emptyset,\forall i,k(i)=0,O(i)=0
3:  for t=1t=1 to TT do
4:     Steps 4-9 in Double-UCB-BL
5:     Pulls arm it=argmaxiNjtμ¯(i)log(2T/δ)O(i)i_{t}=\arg\max_{i\in N_{j_{t}}}\bar{\mu}(i)-\sqrt{\frac{\log(\sqrt{2}T/\delta)}{O(i)}}
6:     iNit,\forall i\in N_{i_{t}}, update kt(i),Ot(i),μ¯t(i)k_{t}(i),O_{t}(i),\bar{\mu}_{t}(i)
7:  end for
Algorithm 4 C-UCB for Ballooning Setting

5.2 C-UCB for Ballooning Setting

The failure on the common uniform distribution limits the D-UCB’s application. The fundamental reason is the aggressive exploration strategy of the UCB algorithm, which tries to explore every arm that enters set 𝒜\mathcal{A} as much as possible. In this section, we apply C-UCB to ballooning settings, which imposes no requirements on the distribution 𝒫\mathcal{P}.

Algorithm 4 shows the pseudocode of conservative UCB (C-UCB). This algorithm is almost identical to D-UCB, with the only change being that the rule for selecting an arm is argmaxiNjtμ¯(i)log(2T/δ)O(i)\arg\max_{i\in N_{j_{t}}}\bar{\mu}(i)-\sqrt{\frac{\log(\sqrt{2}T/\delta)}{O(i)}}. This improvement avoids exploring every arm that enters NjtN_{j_{t}}. It is only chosen if it has been observed a sufficient number of times and its confidence lower bound is the largest. We have the following regret upper bound for C-UCB:

Theorem 5.5.

Assume that the reward distribution is 12\frac{1}{2}-subgaussian or bounded in [0,1][0,1], set δ=1T,\delta=\frac{1}{T}, the regret of C-UCB is upper bounded by

RTπ\displaystyle R_{T}^{\pi} 40max{logbT,1}ΔmaxTlog(2T)ϵ2+2ΔmaxT\displaystyle\leq 40\max\{\log_{b}T,1\}\Delta_{max}^{T}\frac{\log(\sqrt{2}T)}{\epsilon^{2}}+2\Delta_{max}^{T} (15)
+96ϵ(log(eT))2(ΔminT)2+4ϵ\displaystyle+\frac{96\epsilon(\log(eT))^{2}}{(\Delta_{min}^{T})^{2}}+4\epsilon

The arms arrive one by one in ballooning setting, and the optimal arm may change over time. Therefore, the regret upper bound depends on ΔminT\Delta_{min}^{T} rather than Δ2ϵ\Delta_{2\epsilon}. Compared to Theorem 5.2, the upper bound for C-UCB does not involve MM, making it independent of the distribution 𝒫\mathcal{P}. Under the conditions of 5.1, it achieves a regret upper bound of O((log(T))2)O((\log(T))^{2}).

Refer to caption

Refer to caption

Figure 1: "UCB-N (ϵ=0.1\epsilon=0.1)": Graph feedback with similarity structure. "UCB-N-Standard (ϵ=0.1\epsilon=0.1)": Graph feedback without similarity structure, but the graph used has roughly the same independence number with the former setting. Settings with T=105,K=104T=10^{5},K=10^{4}. (a) ϵ=0.1,0.2\epsilon=0.1,0.2 for Gaussian rewards. (b) ϵ=0.05,0.1\epsilon=0.05,0.1 for Bernoulli rewards.

6 Experiments

6.1 Stationary Settings

We first compared the performance of UCB-N under standard graph feedback and graph feedback with similar arms 333Our code is available at https://github.com/qh1874/GraphBandits_SimilarArms. The purpose of this experiment is to show that the similarity structure improves the performance of the original UCB algorithm. To ensure fairness, the problem instances we use in both cases have roughly the same independence number. In the standard graph feedback, we also use a random graph, generating edges with a probability calculated by Equation 12. The graph generated in this way has roughly the same independence number as the graph in the ϵ\epsilon-similarity setting. In particular, if 𝒫\mathcal{P} is the Gaussian distribution 𝒩(0,1)\mathcal{N}(0,1), then

p=2(2Φ(ϵ2)1).p=\sqrt{2}(2\Phi(\frac{\epsilon}{\sqrt{2}})-1).

If 𝒫\mathcal{P} is the uniform distribution U(0,1)U(0,1),

p=1(1ϵ)2.p=1-(1-\epsilon)^{2}.

For each value of ϵ\epsilon, we generate 5050 different problem instances. The expected regret is averaged on the 5050 instances. The 95% confidence interval is shown as a semi-transparent region in the figure. Figure 1 shows the performance of UCB-N under Gaussian rewards. It can be observed that the regret of UCB-N in our settings is smaller than standard graph feedback, thanks to the similarity structure. Additionally, the regret decreases as ϵ\epsilon increases, consistent with theoretical results.

Refer to caption


Refer to caption

Figure 2: Settings with T=106,K=105,ϵ=0.01T=10^{6},K=10^{5},\epsilon=0.01. Bernoulli rewards (a), Gaussian rewards (b).

We then compared the performance of UCB-N, D-UCB, and C-UCB algorithms. Figure 2 shows the performance of the three algorithms with Gaussian and Bernoulli rewards. Although D-UCB and UCB-N have similar regret bounds, the experimental performance of D-UCB and C-UCB is better than UCB-N. This may be because D-UCB and C-UCB directly learn on an independent set, effectively leveraging the graph structure features of similar arms.

6.2 Ballooning Settings

Refer to caption


Refer to caption

Refer to caption

Figure 3: Ballooning Setting. (a) Gaussian arms with 𝒫\mathcal{P} as 𝒩(0,1)\mathcal{N}(0,1) and ϵ=0.1,0.2\epsilon=0.1,0.2. (b) Bernoulli arms with 𝒫\mathcal{P} as U(0,1)U(0,1) and ϵ=0.05,0.1\epsilon=0.05,0.1. (c) Bernoulli arms with 𝒫\mathcal{P} being the half-triangle distribution and ϵ=0.05,0.1\epsilon=0.05,0.1.

UCB-N is not suitable for ballooning settings since it would select each arrived arm at least once. The BL-Moss algorithm [Ghalme et al., 2021] is specifically designed for the ballooning setting. However, this algorithm assumes that the optimal arm is more likely to appear in the early rounds and requires prior knowledge of the parameter λ\lambda to characterize this likelihood, which is not consistent with our setting. We only compare D-UCB and C-UCB with different distributions 𝒫\mathcal{P}.

Figure 3 show the experimental results of ballooning settings. When 𝒫\mathcal{P} follows a standard normal distribution, D-UCB and C-UCB exhibit similar performance. However, when 𝒫\mathcal{P} is a uniform distribution U(0,1)U(0,1) or half-triangle distribution with distribution function as 1(1x)21-(1-x)^{2}, D-UCB fails to achieve sublinear regret, while C-UCB still performs well. These results are consistent across different values of ϵ\epsilon.

7 Conclusion

In this paper, we have introduced a new graph feedback bandit model with similar arms. For this model, we proposed two different UCB-based algorithms (D-UCB, C-UCB) and provided regret upper bounds. We then extended these two algorithms to the ballooning setting. In this setting, the application of C-UCB is more extensive than D-UCB. D-UCB can only achieve sublinear regret when the mean distribution is Gaussian, while C-UCB can achieve problem-dependent sublinear regret regardless of the mean distribution.

References

  • Abramowitz et al. [1988] Milton Abramowitz, Irene A Stegun, and Robert H Romer. Handbook of mathematical functions with formulas, graphs, and mathematical tables, 1988.
  • Allan and Laskar [1978] Robert B Allan and Renu Laskar. On domination and independent domination numbers of a graph. Discrete mathematics, 23(2):73–76, 1978.
  • Alon et al. [2017] Noga Alon, Nicolo Cesa-Bianchi, Claudio Gentile, Shie Mannor, Yishay Mansour, and Ohad Shamir. Nonstochastic multi-armed bandits with graph-structured feedback. SIAM Journal on Computing, 46(6):1785–1826, 2017.
  • Auer and Ortner [2010] Peter Auer and Ronald Ortner. Ucb revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Periodica Mathematica Hungarica, 61(1-2):55–65, 2010.
  • Auer et al. [2002a] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47:235–256, 2002a.
  • Auer et al. [2002b] Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002b.
  • Aziz et al. [2021] Maryam Aziz, Emilie Kaufmann, and Marie-Karelle Riviere. On multi-armed bandit designs for dose-finding clinical trials. The Journal of Machine Learning Research, 22(1):686–723, 2021.
  • Buccapatnam et al. [2014] Swapna Buccapatnam, Atilla Eryilmaz, and Ness B Shroff. Stochastic bandits with side observations on networks. In The 2014 ACM international conference on Measurement and modeling of computer systems, pages 289–300, 2014.
  • Buccapatnam et al. [2018] Swapna Buccapatnam, Fang Liu, Atilla Eryilmaz, and Ness B Shroff. Reward maximization under uncertainty: Leveraging side-observations on networks. Journal of Machine Learning Research, 18(216):1–34, 2018.
  • Caron et al. [2012] Stéphane Caron, Branislav Kveton, Marc Lelarge, and Smriti Bhagat. Leveraging side observations in stochastic bandits. arXiv preprint arXiv:1210.4839, 2012.
  • Chakrabarti et al. [2008] Deepayan Chakrabarti, Ravi Kumar, Filip Radlinski, and Eli Upfal. Mortal multi-armed bandits. Advances in neural information processing systems, 21, 2008.
  • Cohen et al. [2016] Alon Cohen, Tamir Hazan, and Tomer Koren. Online learning with feedback graphs without the graphs. In International Conference on Machine Learning, pages 811–819. PMLR, 2016.
  • Esposito et al. [2022] Emmanuel Esposito, Federico Fusco, Dirk van der Hoeven, and Nicolò Cesa-Bianchi. Learning on the edge: Online learning with stochastic feedback graphs. Advances in Neural Information Processing Systems, 35:34776–34788, 2022.
  • Even-Dar et al. [2006] Eyal Even-Dar, Shie Mannor, Yishay Mansour, and Sridhar Mahadevan. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of machine learning research, 7(6), 2006.
  • Gentile et al. [2014] Claudio Gentile, Shuai Li, and Giovanni Zappella. Online clustering of bandits. In International conference on machine learning, pages 757–765. PMLR, 2014.
  • Ghalme et al. [2021] Ganesh Ghalme, Swapnil Dhamal, Shweta Jain, Sujit Gujar, and Y Narahari. Ballooning multi-armed bandits. Artificial Intelligence, 296:103485, 2021.
  • Ghari and Shen [2022] Pouya M Ghari and Yanning Shen. Online learning with probabilistic feedback. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4183–4187. IEEE, 2022.
  • Hu et al. [2020] Bingshan Hu, Nishant A Mehta, and Jianping Pan. Problem-dependent regret bounds for online learning with feedback graphs. In Uncertainty in Artificial Intelligence, pages 852–861. PMLR, 2020.
  • Lai et al. [1985] Tze Leung Lai, Herbert Robbins, et al. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985.
  • Li et al. [2011] Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In Proceedings of the fourth ACM international conference on Web search and data mining, pages 297–306, 2011.
  • Li et al. [2016] Shuai Li et al. The art of clustering bandits. 2016.
  • Lykouris et al. [2020] Thodoris Lykouris, Eva Tardos, and Drishti Wali. Feedback graph regret bounds for thompson sampling and ucb. In Algorithmic Learning Theory, pages 592–614. PMLR, 2020.
  • Mannor and Shamir [2011] Shie Mannor and Ohad Shamir. From bandits to experts: On the value of side-observations. Advances in Neural Information Processing Systems, 24, 2011.
  • Pérez et al. [2020] Adrià Pérez, Pablo Herrera-Nieto, Stefan Doerr, and Gianni De Fabritiis. Adaptivebandit: a multi-armed bandit framework for adaptive sampling in molecular simulations. Journal of chemical theory and computation, 16(7):4685–4693, 2020.
  • Schwartz et al. [2017] Eric M Schwartz, Eric T Bradlow, and Peter S Fader. Customer acquisition via display advertising using multi-armed bandit experiments. Marketing Science, 36(4):500–522, 2017.
  • Villar et al. [2015] Sofía S Villar, Jack Bowden, and James Wason. Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges. Statistical science: a review journal of the Institute of Mathematical Statistics, 30(2):199, 2015.
  • Wang et al. [2024] Zhiyong Wang, Jize Xie, Xutong Liu, Shuai Li, and John Lui. Online clustering of bandits with misspecified user models. Advances in Neural Information Processing Systems, 36, 2024.
  • Yang et al. [2022] Junwen Yang, Zixin Zhong, and Vincent YF Tan. Optimal clustering with bandit feedback. arXiv preprint arXiv:2202.04294, 2022.

Proofs of Theorems and Corollaries In this appendix, we provide the detailed proof of theorems and corollaries in the main text. Specifically, we provide detailed proofs of Proposition 4.1, Theorem 4.3,Theorem 4.6,Theorem 5.2,Corollary 5.3 and Theorem 5.5. The proof of Theorem 4.5 can be easily obtained from the analysis of Theorem 5.5. The proof of Theorem 5.4 is similar to that of Theorem 5.2. We omit their proofs. Beforehand, we give some well-known results to simplify the proof.

Appendix A Facts and Lemmas

Lemma A.1.

Assume that XiX_{i} are independent random variables. μ=𝔼[Xi],μ¯=1ni=1nXi\mu=\mathbb{E}[X_{i}],\bar{\mu}=\frac{1}{n}\sum_{i=1}^{n}X_{i}. If XiX_{i} is bounded in [0,1][0,1] or 12\frac{1}{2}-subGaussian, for any δ(0,1)\delta\in(0,1), with probability as least 1δ2T21-\frac{\delta^{2}}{T^{2}},

|μ¯μ|log(2T/δ)n.|\bar{\mu}-\mu|\leq\sqrt{\frac{\log(\sqrt{2}T/\delta)}{n}}.
Lemma A.2.

Allan and Laskar [1978] If GG is a claw-free graph, then γ(G)=i(G)\gamma(G)=i(G).

Lemma A.3.

Assume that μ(at)i.i.d.𝒫\mu(a_{t})\stackrel{{\scriptstyle i.i.d.}}{{\sim}}\mathcal{P}. Let GT𝒫G_{T}^{\mathcal{P}} denote the graph constructed by μ(a1),μ(a2),,μ(aT)\mu(a_{1}),\mu(a_{2}),...,\mu(a_{T}),α(GT𝒫)\alpha(G_{T}^{\mathcal{P}}) is the independent number of GT𝒫G_{T}^{\mathcal{P}}. Then

(α(GT𝒫)5max{logbT,1})1T5,\mathbb{P}(\alpha(G_{T}^{\mathcal{P}})\geq 5\max\{\log_{b}T,1\})\leq\frac{1}{T^{5}}, (16)

where bb is some constant related to 𝒫\mathcal{P}.

Proof.

Let X,Yi.i.d.𝒫X,Y\stackrel{{\scriptstyle i.i.d.}}{{\sim}}\mathcal{P}, then

(|XY|ϵ)=ϵϵfXY(z)𝑑z=p,\mathbb{P}(|X-Y|\leq\epsilon)=\int_{-\epsilon}^{\epsilon}f_{X-Y}(z)dz=p,

where fXYf_{X-Y} is the probability density function of XYX-Y. This means that in GT𝒫G_{T}^{\mathcal{P}}, the probability of any two nodes being connected by an edge is pp. Hence, GT𝒫G_{T}^{\mathcal{P}} is a random graph.

Let ZkZ_{k} be the number of independent sets of order kk. Let b=11p,b=\frac{1}{1-p}, k=5logbTk=\lceil 5\log_{b}T\rceil,

(α(GT𝒫)5logbT)\displaystyle\mathbb{P}(\alpha(G_{T}^{\mathcal{P}})\geq 5\log_{b}T) (Zk1)\displaystyle\leq\mathbb{P}(Z_{k}\geq 1) (17)
𝔼[Zk]\displaystyle\leq\mathbb{E}[Z_{k}]
=(Tk)(1p)(k2)\displaystyle=\binom{T}{k}(1-p)^{\binom{k}{2}}
(a)(Tek1p(1p)k/2)k\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\Big{(}\frac{Te}{k\sqrt{1-p}}(1-p)^{k/2}\Big{)}^{k}
(ebk)k(1T1.5)k,\displaystyle\leq\Big{(}\frac{e\sqrt{b}}{k}\Big{)}^{k}\Big{(}\frac{1}{T^{1.5}}\Big{)}^{k},

where (a)(a) uses the fact that (Tk)(Tek)k\binom{T}{k}\leq\Big{(}\frac{Te}{k}\Big{)}^{k}.

If b<Tb<T, k5>ek\geq 5>e. We have,

(α(GT𝒫)5logbT)(ebk)k(1T1.5)k=(ebkT)k(1T)k1T5.\mathbb{P}(\alpha(G_{T}^{\mathcal{P}})\geq 5\log_{b}T)\leq\Big{(}\frac{e\sqrt{b}}{k}\Big{)}^{k}\Big{(}\frac{1}{T^{1.5}}\Big{)}^{k}=\Big{(}\frac{e\sqrt{b}}{k\sqrt{T}}\Big{)}^{k}\Big{(}\frac{1}{T}\Big{)}^{k}\leq\frac{1}{T^{5}}.

If bTb\geq T,i.e., 1p1T1-p\leq\frac{1}{T}, we have

(α(GT𝒫)5)(Z51)(T5)(1p)(52)1T5.\mathbb{P}(\alpha(G_{T}^{\mathcal{P}})\geq 5)\leq\mathbb{P}(Z_{5}\geq 1)\leq\binom{T}{5}(1-p)^{\binom{5}{2}}\leq\frac{1}{T^{5}}.

Therefore,

(α(GT𝒫)5max{logbT,1})1T5.\mathbb{P}(\alpha(G_{T}^{\mathcal{P}})\geq 5\max\{\log_{b}T,1\})\leq\frac{1}{T^{5}}.

Lemma A.4 (Chernoff Bounds).

Let XiX_{i} be independent Bernoulli random variable. Let XX denote their sum and let μ=𝔼[X]\mu=\mathbb{E}[X] denote the sum’s expected value.

(X(1+δ)μ)eδ2μ2+δ,δ0,\mathbb{P}(X\geq(1+\delta)\mu)\leq e^{-\frac{\delta^{2}\mu}{2+\delta}},\delta\geq 0,
(X(1δ)μ)eδ2μ2,0<δ<1.\mathbb{P}(X\leq(1-\delta)\mu)\leq e^{-\frac{\delta^{2}\mu}{2}},0<\delta<1.
Lemma A.5.

Abramowitz et al. [1988] For a Gaussian distributed random variable XX with mean μ\mu and variance σ2\sigma^{2}, for any a>0a>0,

12πa1+a2ea22(Xμ>aσ)1a+a2+4ea22.\frac{1}{\sqrt{2\pi}}\frac{a}{1+a^{2}}e^{-\frac{a^{2}}{2}}\leq\mathbb{P}(X-\mu>a\sigma)\leq\frac{1}{a+\sqrt{a^{2}+4}}e^{-\frac{a^{2}}{2}}.

Appendix B Proofs of Proposition 4.1

(1) We first prove γ(G)=i(G)\gamma(G)=i(G). A claw-free graph is a graph that does not have a claw as an induced subgraph or contains no induced subgraph isomorphic to K1,3K_{1,3}. From Lemma A.2, we just need to prove GG is claw-free.

Assuming GG has a claw, meaning there exists nodes a,b,c,da,b,c,d, such that aa is connected to b,c,db,c,d, while b,c,db,c,d are mutually unconnected. Consider nodes b,cb,c. bb and cc must have a mean greater than aa, and the other must have a mean smaller than aa. Otherwise, the mean difference between bb and cc will be smaller than ϵ\epsilon, and an edge will form between them. Since dd is connected to aa, this would lead to an edge between dd and bb or dd and cc. This is a contradiction. Therefore, GG is claw-free.

(2) α(G)2i(G)\alpha(G)\leq 2i(G). Let II^{*} be a maximum independent set and II be a minimum independent dominating set. Note that any vertex of II is adjacent (including the vertex itself in the neighborhood) to at most two vertices in II^{*}, and that each vertex of II^{*} is adjacent to at least one vertex of II. So by a double counting argument, when counting once the vertices of II^{*}, we can choose one adjacent vertex in II, and we will have counted at most twice the vertices of II.

Appendix C Proofs of Theorem 4.3

Let \mathcal{I} denote the independent set obtained after running Step 4-8 in Algorithm 1. The obtained \mathcal{I} may vary with each run. We first fix \mathcal{I} for analysis and then take the supremum of the results with respect to \mathcal{I}, obtaining an upper bound independent of \mathcal{I}.

Let α\alpha^{*}\in\mathcal{I} denotes the arm that includes the optimal arm, i.e., iNαi^{*}\in N_{\alpha^{*}}. Let ={α1,α2,,α,,α||}\mathcal{I}=\{\alpha_{1},\alpha_{2},...,\alpha^{*},...,\alpha_{|\mathcal{I}|}\}. The regret can be divided into two parts: one part is the selection of arms iNαi\notin N_{\alpha^{*}} and the other part is the selection of arms iNαi\in N_{\alpha^{*}}:

t=1TiVΔi𝟙{it=i}=t=1TiNαΔi𝟙{it=i}+t=1TiNαΔi𝟙{it=i}\displaystyle\sum_{t=1}^{T}\sum_{i\in V}\Delta_{i}\mathbbm{1}\{i_{t}=i\}=\sum_{t=1}^{T}\sum_{i\notin N_{\alpha^{*}}}\Delta_{i}\mathbbm{1}\{i_{t}=i\}+\sum_{t=1}^{T}\sum_{i\in N_{\alpha^{*}}}\Delta_{i}\mathbbm{1}\{i_{t}=i\} (18)

We first focus on the expected regret incurred by the first part. Let Δαj=μ(α)μ(αj)\Delta_{\alpha_{j}}^{\prime}=\mu(\alpha^{*})-\mu(\alpha_{j}), jtj_{t}\in\mathcal{I} denote the arm linked to the selected arm iti_{t}(Step 9 in Algorithm 1).

t=1TiNαΔi𝟙{it=i}=j=1||t=1TiNαjΔi𝟙{it=i}j=1||(Δαj+2ϵ)t=1T𝟙{jt=αj}.\sum_{t=1}^{T}\sum_{i\notin N_{\alpha^{*}}}\Delta_{i}\mathbbm{1}\{i_{t}=i\}=\sum_{j=1}^{|\mathcal{I}|}\sum_{t=1}^{T}\sum_{i\in N_{\alpha_{j}}}\Delta_{i}\mathbbm{1}\{i_{t}=i\}\leq\sum_{j=1}^{|\mathcal{I}|}(\Delta_{\alpha_{j}}^{\prime}+2\epsilon)\sum_{t=1}^{T}\mathbbm{1}\{j_{t}=\alpha_{j}\}. (19)

The last inequality uses the following two facts:

Δi=μ(i)μ(i)=μ(i)μ(α)+μ(α)μ(αj)+μ(αj)μ(i)Δαj+2ϵ,\Delta_{i}=\mu(i^{*})-\mu(i)=\mu(i^{*})-\mu(\alpha^{*})+\mu(\alpha^{*})-\mu(\alpha_{j})+\mu(\alpha_{j})-\mu(i)\leq\Delta_{\alpha_{j}}^{\prime}+2\epsilon,

and

t=1TiNαj𝟙{it=i}=t=1T𝟙{jt=αj}.\sum_{t=1}^{T}\sum_{i\in N_{\alpha_{j}}}\mathbbm{1}\{i_{t}=i\}=\sum_{t=1}^{T}\mathbbm{1}\{j_{t}=\alpha_{j}\}.

Recall that Ot(i)O_{t}(i) denotes the number of observations of arm ii till time tt. Let ct(i)=2log(T2/δ)Ot(i)c_{t}(i)=\sqrt{\frac{2\log(T^{2}/\delta)}{O_{t}(i)}}, X¯s(i)\bar{X}_{s}(i) denote average reward of arm ii after observed ss times, cs(i)=2log(T2/δ)sc_{s}(i)=\sqrt{\frac{2\log(T^{2}/\delta)}{s}} . For any αj\alpha_{j}\in\mathcal{I},

t=1T𝟙{jt=αj}\displaystyle\sum_{t=1}^{T}\mathbbm{1}\{j_{t}=\alpha_{j}\} αj+t=1T𝟙{jt=αj,Ot(αj)αj}\displaystyle\leq\ell_{\alpha_{j}}+\sum_{t=1}^{T}\mathbbm{1}\{j_{t}=\alpha_{j},O_{t}(\alpha_{j})\geq\ell_{\alpha_{j}}\} (20)
αj+t=1T𝟙{μ¯t(αj)+ct(αj)μ¯t(α)+ct(α),Ot(αj)αj}\displaystyle\leq\ell_{\alpha_{j}}+\sum_{t=1}^{T}\mathbbm{1}\{\bar{\mu}_{t}(\alpha_{j})+c_{t}(\alpha_{j})\geq\bar{\mu}_{t}(\alpha^{*})+c_{t}(\alpha^{*}),O_{t}(\alpha_{j})\geq\ell_{\alpha_{j}}\}
αj+t=1T𝟙{maxαjsjtX¯sj(αj)+csj(αj)min1stX¯s(α)+cs(α)}\displaystyle\leq\ell_{\alpha_{j}}+\sum_{t=1}^{T}\mathbbm{1}\{\max_{\ell_{\alpha_{j}}\leq s_{j}\leq t}\bar{X}_{s_{j}}(\alpha_{j})+c_{s_{j}}(\alpha_{j})\geq\min_{1\leq s\leq t}\bar{X}_{s}(\alpha^{*})+c_{s}(\alpha^{*})\}
αj+t=1Ts=1tsj=αjt𝟙{X¯sj(αj)+csj(αj)X¯s(α)+cj(α)}\displaystyle\leq\ell_{\alpha_{j}}+\sum_{t=1}^{T}\sum_{s=1}^{t}\sum_{s_{j}=\ell_{\alpha_{j}}}^{t}\mathbbm{1}\{\bar{X}_{s_{j}}(\alpha_{j})+c_{s_{j}}(\alpha_{j})\geq\bar{X}_{s}(\alpha^{*})+c_{j}(\alpha^{*})\}

Following the same argument as in Auer et al. [2002a], choosing αj=4log(2T/δ)(Δαj)2\ell_{\alpha_{j}}=\frac{4\log(\sqrt{2}T/\delta)}{(\Delta_{\alpha_{j}}^{\prime})^{2}}, we have

(X¯sj(αj)+csj(αj)X¯s(α)+cj(α))(X¯s(α)μ(α)cj(α))+(X¯sj(αj)μ(αj)+csj(αj))\mathbb{P}(\bar{X}_{s_{j}}(\alpha_{j})+c_{s_{j}}(\alpha_{j})\geq\bar{X}_{s}(\alpha^{*})+c_{j}(\alpha^{*}))\leq\mathbb{P}(\bar{X}_{s}(\alpha^{*})\leq\mu(\alpha^{*})-c_{j}(\alpha^{*}))+\mathbb{P}(\bar{X}_{s_{j}}(\alpha_{j})\geq\mu(\alpha_{j})+c_{s_{j}}(\alpha_{j}))

From Lemma A.1,

(X¯s(α)μ(α)cj(α))δ22T2\mathbb{P}(\bar{X}_{s}(\alpha^{*})\leq\mu(\alpha^{*})-c_{j}(\alpha^{*}))\leq\frac{\delta^{2}}{2T^{2}}

Hence,

t=1T(jt=αj)4log(2T/δ)(Δαj)2+Tδ2.\sum_{t=1}^{T}\mathbb{P}(j_{t}=\alpha_{j})\leq\frac{4\log(\sqrt{2}T/\delta)}{(\Delta_{\alpha_{j}}^{\prime})^{2}}+T\delta^{2}.

Plug into Equation 18, we can get

t=1TiNαΔi(it=i)\displaystyle\sum_{t=1}^{T}\sum_{i\notin N_{\alpha^{*}}}\Delta_{i}\mathbb{P}(i_{t}=i) j=1||(Δαj+2ϵ)4log(2T/δ)(Δαj)2+ΔmaxTδ2\displaystyle\leq\sum_{j=1}^{|\mathcal{I}|}\frac{(\Delta_{\alpha_{j}}^{\prime}+2\epsilon)4\log(\sqrt{2}T/\delta)}{(\Delta_{\alpha_{j}}^{\prime})^{2}}+\Delta_{max}T\delta^{2} (21)
=j=1||(1Δαj+2ϵ(Δαj)2)4log(2T/δ)+j=1||ΔmaxTδ2\displaystyle=\sum_{j=1}^{|\mathcal{I}|}(\frac{1}{\Delta_{\alpha_{j}}^{\prime}}+\frac{2\epsilon}{(\Delta_{\alpha_{j}}^{\prime})^{2}})4\log(\sqrt{2}T/\delta)+\sum_{j=1}^{|\mathcal{I}|}\Delta_{max}T\delta^{2}
4log(2T/δ)ϵ(log(α(G))+π23)+α(G)ΔmaxTδ2.\displaystyle\leq\frac{4\log(\sqrt{2}T/\delta)}{\epsilon}(\log(\alpha(G))+\frac{\pi^{2}}{3})+\alpha(G)\Delta_{max}T\delta^{2}.

Now we focus on the second part in Equation 18.

For any iNαi\in N_{\alpha^{*}}, we have Δi2ϵ\Delta_{i}\leq 2\epsilon. This means the gap between suboptimal and optimal arms is bounded. Therefore, this part can be seen as using UCB-N Lykouris et al. [2020] on the graph formed by NαN_{\alpha^{*}}. We can directly use their results by adjusting some constant factors. Following Theorem 6 in Lykouris et al. [2020], this part has a regret upper bound as

16log(2T/δ)log(T)maxI(Nα)iI\{i}1Δi+2ϵTδ2+1+2ϵ.16\cdot\log(\sqrt{2}T/\delta)\log(T)\max_{I\in\mathcal{I}(N_{\alpha^{*}})}\sum_{i\in I\backslash\{i^{*}\}}\frac{1}{\Delta_{i}}+2\epsilon T\delta^{2}+1+2\epsilon. (22)

Let δ=1T\delta=\frac{1}{T}. Combining Equation 21 and Equation 22 and using Proposition 4.1 that α(G)2γ(G)\alpha(G)\leq 2\gamma(G), we have

RTπ\displaystyle R_{T}^{\pi} 4log(2T/δ)ϵ(log(α(G))+π23)+16log(2T/δ)log(T)maxI(Nα)iI\{i}1Δi+Tδ2(α(G)Δmax+2ϵ)+1+2ϵ\displaystyle\leq\frac{4\log(\sqrt{2}T/\delta)}{\epsilon}(\log(\alpha(G))+\frac{\pi^{2}}{3})+16\cdot\log(\sqrt{2}T/\delta)\log(T)\max_{I\in\mathcal{I}(N_{\alpha^{*}})}\sum_{i\in I\backslash\{i^{*}\}}\frac{1}{\Delta_{i}}+T\delta^{2}(\alpha(G)\Delta_{max}+2\epsilon)+1+2\epsilon (23)
8log(2T)ϵ(log(2γ(G))+π23)+32log(2T)log(T)maxI(i)iI\{i}1Δi+Δmax+4ϵ+1.\displaystyle\leq\frac{8\log(\sqrt{2}T)}{\epsilon}(\log(2\gamma(G))+\frac{\pi^{2}}{3})+32\cdot\log(\sqrt{2}T)\log(T)\max_{I\in\mathcal{I}(i^{*})}\sum_{i\in I\backslash\{i^{*}\}}\frac{1}{\Delta_{i}}+\Delta_{max}+4\epsilon+1.

Appendix D Proofs of Theorem 4.6

We just need to discuss Δi\Delta_{i} in intervals

[0,ϵ),[ϵ,2ϵ),,[kϵ,(k+1)ϵ),[0,\epsilon),[\epsilon,2\epsilon),...,[k\epsilon,(k+1)\epsilon),...

The regret for Δi\Delta_{i} in [ϵ,2ϵ),,[kϵ,(k+1)ϵ),[\epsilon,2\epsilon),...,[k\epsilon,(k+1)\epsilon),… can be bounded by the same method used in the proof of Theorem 4.3. We can calculate that the regret of this part has the same form as O(log(2T)ϵ)O(\frac{\log(\sqrt{2}T)}{\epsilon}).

Recall that GϵG_{\epsilon} denote the subgraph with arms {iV:μ(i)μ(i)<ϵ}\{i\in V:\mu(i^{*})-\mu(i)<\epsilon\}. The set of all independent dominating sets of graph GϵG_{\epsilon} is denoted as (Gϵ)\mathcal{I}(G_{\epsilon}). The regret for Δi\Delta_{i} in [0,ϵ)[0,\epsilon) can be bounded as

32(log(2T))2maxI(Gϵ)iI\{i}1Δi.32(\log(\sqrt{2}T))^{2}\max_{I\in\mathcal{I}(G_{\epsilon})}\sum_{i\in I\backslash\{i^{*}\}}\frac{1}{\Delta_{i}}.

Due to the similarity assumption, GϵG_{\epsilon} is a complete graph. Therefore,

maxI(Gϵ)iI\{i}1Δi1Δmin.\max_{I\in\mathcal{I}(G_{\epsilon})}\sum_{i\in I\backslash\{i^{*}\}}\frac{1}{\Delta_{i}}\leq\frac{1}{\Delta_{min}}.

Appendix E Proofs of Theorem 5.2

Recall that t\mathcal{I}_{t} denotes the independent set at time tt and αtt\alpha_{t}^{*}\in\mathcal{I}_{t} denotes the arm that include the optimal arm iti_{t}^{*}. We have |T|α(GT𝒫)|\mathcal{I}_{T}|\leq\alpha(G_{T}^{\mathcal{P}}). Let \mathcal{F} denote the sequences generated from 𝒫\mathcal{P} with length TT, thus \mathcal{F} is a random variable.

Since the optimal arm may change over time, this leads to a time-varying Δi\Delta_{i}. We denote the new gap as Δt(i)\Delta_{t}(i). Therefore, the analysis method in Theorem 4.3 is no longer applicable here. The regret can also be divided into two parts:

𝔼[t=1TiK(t)Δt(i)𝟙{it=i}]=𝔼[t=1TiNαtΔt(i)𝟙{it=i}](A)+𝔼[t=1TiNαtΔi𝟙{it=i}](B)\displaystyle\mathbb{E}\Bigg{[}\sum_{t=1}^{T}\sum_{i\in K(t)}\Delta_{t}(i)\mathbbm{1}\{i_{t}=i\}\Bigg{]}=\underbrace{\mathbb{E}\Bigg{[}\sum_{t=1}^{T}\sum_{i\notin N_{\alpha_{t}^{*}}}\Delta_{t}(i)\mathbbm{1}\{i_{t}=i\}\Bigg{]}}_{(A)}+\underbrace{\mathbb{E}\Bigg{[}\sum_{t=1}^{T}\sum_{i\in N_{\alpha_{t}^{*}}}\Delta_{i}\mathbbm{1}\{i_{t}=i\}\Bigg{]}}_{(B)} (24)

We focus on (A)(A) first.

(A)\displaystyle(A) =𝔼[𝔼[t=1TiNαtΔt(i)𝟙{it=i}|]]\displaystyle=\mathbb{E}_{\mathcal{F}}\Bigg{[}\mathbb{E}\Bigg{[}\sum_{t=1}^{T}\sum_{i\notin N_{\alpha_{t}^{*}}}\Delta_{t}(i)\mathbbm{1}\{i_{t}=i\}\Big{|}\mathcal{F}\Bigg{]}\Bigg{]} (25)
=𝔼[𝔼[t=1TΔt(i)𝟙{it=i,jtαt}|]]\displaystyle=\mathbb{E}_{\mathcal{F}}\Bigg{[}\mathbb{E}\Bigg{[}\sum_{t=1}^{T}\Delta_{t}(i)\mathbbm{1}\{i_{t}=i,j_{t}\neq\alpha_{t}^{*}\}\Big{|}\mathcal{F}\Bigg{]}\Bigg{]}
𝔼[𝔼[t=1TΔmaxT𝟙{jtαt}|]]\displaystyle\leq\mathbb{E}_{\mathcal{F}}\Bigg{[}\mathbb{E}\Bigg{[}\sum_{t=1}^{T}\Delta_{max}^{T}\mathbbm{1}\{j_{t}\neq\alpha_{t}^{*}\}\Big{|}\mathcal{F}\Bigg{]}\Bigg{]}

Given a fixed \mathcal{F}, T\mathcal{I}_{T} is deterministic. Since the gap between optimal and suboptimal arms may be varying over time, we define

Δαj′′=mint[T]{μ(αt)μ(αj):αjT and μ(αt)μ(αj)>0}\Delta_{\alpha_{j}}^{\prime\prime}=\min_{t\in[T]}\{\mu(\alpha_{t}^{*})-\mu(\alpha_{j}):\alpha_{j}\in\mathcal{I}_{T}\text{ and }\mu(\alpha_{t}^{*})-\mu(\alpha_{j})>0\}

denote the minimum gap when αjT\alpha_{j}\in\mathcal{I}_{T} is not the optimal selection that include the optimal arm. Then Δαj′′ϵ\Delta_{\alpha_{j}}^{\prime\prime}\geq\epsilon.

Following the proofs of Theorem 4.3, for any αjTαt\alpha_{j}\in\mathcal{I}_{T}\neq\alpha_{t}^{*}, the probability of the algorithm selecting it will be less than δ2\delta^{2} after it has been selected 4log(2T/δ)ϵ2\frac{4\log(\sqrt{2}T/\delta)}{\epsilon^{2}} times. Therefore, the inner expectation of Equation 25 is bounded as

|T|ΔmaxT(4log(2T/δ)ϵ2+Tδ2)|\mathcal{I}_{T}|\Delta_{max}^{T}(\frac{4\log(\sqrt{2}T/\delta)}{\epsilon^{2}}+T\delta^{2}) (26)

The inner expectation of Equation 25 also has a native bound TΔmaxTT\Delta_{max}^{T}.

Plugging into (A)(A) and using Lemma A.3, we get

(A)\displaystyle(A) 𝔼[min{|T|ΔmaxT(4log(2T/δ)ϵ2+Tδ2),TΔmaxT]\displaystyle\leq\mathbb{E}_{\mathcal{F}}\Bigg{[}\min\{|\mathcal{I}_{T}|\Delta_{max}^{T}(\frac{4\log(\sqrt{2}T/\delta)}{\epsilon^{2}}+T\delta^{2}),T\Delta_{max}^{T}\Bigg{]} (27)
k=1T(α(GT𝒫)=k)min{kΔmaxT(4log(2T/δ)ϵ2+Tδ2),TΔmaxT}\displaystyle\leq\sum_{k=1}^{T}\mathbb{P}(\alpha(G_{T}^{\mathcal{P}})=k)\min\{k\Delta_{max}^{T}(\frac{4\log(\sqrt{2}T/\delta)}{\epsilon^{2}}+T\delta^{2}),T\Delta_{max}^{T}\}
=(α(GT𝒫)5max{logbT,1})5max{logbT,1}ΔmaxT(4log(2T/δ)ϵ2+Tδ2)+(α(GT𝒫)>5max{logbT,1})TΔmaxT\displaystyle=\mathbb{P}(\alpha(G_{T}^{\mathcal{P}})\leq 5\max\{\log_{b}T,1\})5\max\{\log_{b}T,1\}\Delta_{max}^{T}(\frac{4\log(\sqrt{2}T/\delta)}{\epsilon^{2}}+T\delta^{2})+\mathbb{P}(\alpha(G_{T}^{\mathcal{P}})>5\max\{\log_{b}T,1\})T\Delta_{max}^{T}
5max{logbT,1}ΔmaxT(4log(2T/δ)ϵ2+Tδ2)+ΔmaxT.\displaystyle\leq 5\max\{\log_{b}T,1\}\Delta_{max}^{T}(\frac{4\log(\sqrt{2}T/\delta)}{\epsilon^{2}}+T\delta^{2})+\Delta_{max}^{T}.

Now we consider the part (B)(B).

(B)\displaystyle(B) =𝔼[𝔼[t=1TiNαtΔi𝟙{it=i}|]]\displaystyle=\mathbb{E}_{\mathcal{F}}\Bigg{[}\mathbb{E}\Bigg{[}\sum_{t=1}^{T}\sum_{i\in N_{\alpha_{t}^{*}}}\Delta_{i}\mathbbm{1}\{i_{t}=i\}\Bigg{|}\mathcal{F}\Bigg{]}\Bigg{]} (28)

Recall that 𝒜={at:t[T],μ(at)Nαt},M=t=1T{|μ(at)μ(it)|2ϵ}.\mathcal{A}=\{a_{t}:t\in[T],\mu(a_{t})\in N_{\alpha_{t}^{*}}\},M=\sum_{t=1}^{T}\mathbb{P}\{|\mu(a_{t})-\mu(i_{t}^{*})|\leq 2\epsilon\}. Then

|𝒜|M=t=1T𝟙{|μ(at)μ(it)|2ϵ}.|\mathcal{A}|\leq M^{\prime}=\sum_{t=1}^{T}\mathbbm{1}\{|\mu(a_{t})-\mu(i_{t}^{*})|\leq 2\epsilon\}.

Since μ(at)𝒫\mu(a_{t})\sim\mathcal{P} is independent of each other, the event 𝟙{|μ(at)μ(it)|2ϵ}\mathbbm{1}\{|\mu(a_{t})-\mu(i_{t}^{*})|\leq 2\epsilon\} is also mutually independent. Using Lemma A.4,

(M3M)eM.\mathbb{P}(M^{\prime}\geq 3M)\leq e^{-M}. (29)

Given a fixed instance \mathcal{F}, we divide the rounds into LL^{\prime} parts: (t0=1,tL+1=T)(t_{0}=1,t_{L^{\prime}+1}=T)

[1,t1],(t1,t2],,(tL,T].[1,t_{1}],(t_{1},t_{2}],...,(t_{L^{\prime}},T].

This partition satisfies t(tj,tj+1),\forall t\in(t_{j},t_{j+1}), iti_{t}^{*} is stationary. The αt\alpha_{t}^{*} is also stationary, t(tj,tj+1)\forall t\in(t_{j},t_{j+1}), let αt=αj\alpha_{t}^{*}=\alpha_{j}.

Let’s focus on the intervals (tj,tj+1](t_{j},t_{j+1}], the analysis for other intervals is similar.

The best case is that all arms in NαjN_{\alpha_{j}} are arrived at the beginning. In this case, the regret for this part is equivalent to the regret of using the UCB algorithm on the subgraph formed by NαjN_{\alpha_{j}} for tj+1tjt_{j+1}-t_{j} rounds. The independence number of the subgraph formed by NαjN_{\alpha_{j}} is 22, which leads to a regret upper bound independent of the number of arms arriving. However, we are primarily concerned with the worst case. The worst case is that the algorithm cannot benefit from the graph feedback at all. That is, the algorithm spends O(log(T)(Δ1)2)O(\frac{\log(T)}{(\Delta_{1})^{2}}) rounds distinguishing the optimal arm from the first arriving suboptimal arm 11. After this process, the second suboptimal arm 22 arrives, and again O(log(T)(Δ2)2)O(\frac{\log(T)}{(\Delta_{2})^{2}}) rounds are spent distinguishing the optimal arm from this arm…

Let VjV_{j} denote the arms fall into NαjN_{\alpha_{j}} at the rounds (tj,tj+1](t_{j},t_{j+1}]. If iVji\in V_{j} has not been arrived at round tt, (it=i)=0\mathbb{P}(i_{t}=i)=0.

Following the same argument as the proofs of Theorem 4.3, the inner expectation in Equation 28 can be bounded as

j=0LiVjt=tjtj+1Δi(it=i,jt=αj)\displaystyle\sum_{j=0}^{L^{\prime}}\sum_{i\in V_{j}}\sum_{t=t_{j}}^{t_{j+1}}\Delta_{i}\mathbb{P}(i_{t}=i,j_{t}=\alpha_{j}) =j=0LiVj,Δi<Δt=tjtj+1Δi(it=i,jt=αj)+j=0LiVj,ΔiΔt=tjtj+1Δi(it=i,jt=αj)\displaystyle=\sum_{j=0}^{L^{\prime}}\sum_{i\in V_{j},\Delta_{i}<\Delta}\sum_{t=t_{j}}^{t_{j+1}}\Delta_{i}\mathbb{P}(i_{t}=i,j_{t}=\alpha_{j})+\sum_{j=0}^{L^{\prime}}\sum_{i\in V_{j},\Delta_{i}\geq\Delta}\sum_{t=t_{j}}^{t_{j+1}}\Delta_{i}\mathbb{P}(i_{t}=i,j_{t}=\alpha_{j}) (30)
j=0L(tj+1tj)Δ+j=0LiVj,ΔiΔ(Δi(tj+1tj)δ2+4log(2T/δ)Δi)\displaystyle\leq\sum_{j=0}^{L^{\prime}}(t_{j+1}-t_{j})\Delta+\sum_{j=0}^{L^{\prime}}\sum_{i\in V_{j},\Delta_{i}\geq\Delta}\Bigg{(}\Delta_{i}(t_{j+1}-t_{j})\delta^{2}+\frac{4\log(\sqrt{2}T/\delta)}{\Delta_{i}}\Bigg{)}
TΔ+j=0L(4|Vj|log(2T/δ)Δ+M2ϵ(tj+1tj)δ2)\displaystyle\leq T\Delta+\sum_{j=0}^{L^{\prime}}(\frac{4|V_{j}|\log(\sqrt{2}T/\delta)}{\Delta}+M^{\prime}2\epsilon(t_{j+1}-t_{j})\delta^{2})
(a)TΔ+4Mlog(2T/δ)Δ+2TMϵδ2\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}T\Delta+\frac{4M^{\prime}\log(\sqrt{2}T/\delta)}{\Delta}+2TM^{\prime}\epsilon\delta^{2}
(b)4TMlog(2T/δ)+2ϵ,\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}}4\sqrt{TM^{\prime}\log(\sqrt{2}T/\delta)}+2\epsilon,

where (a)(a) comes from the fact that j|Nαj|MT\sum_{j}|N_{\alpha_{j}}|\leq M^{\prime}\leq T and δ=1T\delta=\frac{1}{T}, (b)(b) follows from Δ=4Mlog(2T/δ)T\Delta=\sqrt{\frac{4M^{\prime}\log(\sqrt{2}T/\delta)}{T}}.

Similar to the approach of bounding (A)(A), the inner expectation in Equation 28 also has a native bound 2Tϵ2T\epsilon.

Substituting into Equation 28, we get

(B)\displaystyle(B) =𝔼[4TMlog(2T/δ)+2ϵ]\displaystyle=\mathbb{E}_{\mathcal{F}}\Bigg{[}4\sqrt{TM^{\prime}\log(\sqrt{2}T/\delta)}+2\epsilon\Bigg{]} (31)
(M3M)(43TMlog(2T/δ)+2ϵ)+2Tϵ(M>3M)\displaystyle\leq\mathbb{P}(M^{\prime}\leq 3M)(4\sqrt{3TM\log(\sqrt{2}T/\delta)}+2\epsilon)+2T\epsilon\mathbb{P}(M^{\prime}>3M)
43TMlog(2T/δ)+2ϵ+2TϵeM.\displaystyle\leq 4\sqrt{3TM\log(\sqrt{2}T/\delta)}+2\epsilon+2T\epsilon e^{-M}.

From Equation 27 and Equation 31, we get the total regret

RTπ5max{logbT,1}ΔmaxT(4log(2T/δ)ϵ2+Tδ2)+ΔmaxT+43TMlog(2T/δ)+2ϵ+2TϵeM.R_{T}^{\pi}\leq 5\max\{\log_{b}T,1\}\Delta_{max}^{T}(\frac{4\log(\sqrt{2}T/\delta)}{\epsilon^{2}}+T\delta^{2})+\Delta_{max}^{T}+4\sqrt{3TM\log(\sqrt{2}T/\delta)}+2\epsilon+2T\epsilon e^{-M}.

Appendix F Proofs of Corollary 5.3

First, we calculate an MM that is independent of the distribution 𝒫\mathcal{P}. Given X1,X2,,XTX_{1},X_{2},...,X_{T} as independent random variables from 𝒫\mathcal{P}. Let

M=t=1T(|Xtmaxi=1,,tXi|<2ϵ).M=\sum_{t=1}^{T}\mathbb{P}(|X_{t}-\max_{i=1,...,t}X_{i}|<2\epsilon).

Denote F(x)=(X<x),Mt=maxitXiF(x)=\mathbb{P}(X<x),M_{t}=\max_{i\leq t}X_{i}. Then

(|XtMt|<2ϵ|Mt=x)=F(x+2ϵ)F(x2ϵ)\mathbb{P}(|X_{t}-M_{t}|<2\epsilon|M_{t}=x)=F(x+2\epsilon)-F(x-2\epsilon)

and

(|XtMt|<2ϵ)=t𝒟(F(x))t1(F(x+2ϵ)F(x2ϵ))F(x)𝑑x,\mathbb{P}(|X_{t}-M_{t}|<2\epsilon)=t\int_{\mathcal{D}}(F(x))^{t-1}(F(x+2\epsilon)-F(x-2\epsilon))F^{\prime}(x)dx,

where 𝒟\mathcal{D} is the support set of 𝒫\mathcal{P}. Since

r=1Rrxr1=ddx1xR+11x=1(R+1)xR+RxR+1(1x)2.\sum_{r=1}^{R}rx^{r-1}=\frac{d}{dx}\frac{1-x^{R+1}}{1-x}=\frac{1-(R+1)x^{R}+Rx^{R+1}}{(1-x)^{2}}.

We get

M=𝒟1(T+1)(F(x))T+T(F(x))T+1(1F(x))2(F(x+2ϵ)F(x2ϵ))F(x)𝑑x.M=\int_{\mathcal{D}}\frac{1-(T+1)(F(x))^{T}+T(F(x))^{T+1}}{(1-F(x))^{2}}(F(x+2\epsilon)-F(x-2\epsilon))F^{\prime}(x)dx. (32)

We need to estimate the upper and lower bounds of MM separately. For Gaussian distribution 𝒩(0,1)\mathcal{N}(0,1), F(x)=Φ(x),F(x)=ϕ(x).F(x)=\Phi(x),F^{\prime}(x)=\phi(x). We first proof M=O(log(T)e2ϵ2log(T)).M=O(\log(T)e^{2\epsilon\sqrt{2\log(T)}}).

Denote the integrand function as H(x)H(x). Let m=2log(T)+ϵm=\sqrt{2\log(T)}+\epsilon,

M=mH(x)𝑑x+m+H(x)𝑑xM=\int_{-\infty}^{m}H(x)dx+\int_{m}^{+\infty}H(x)dx (33)

First, we have

x,1(T+1)(F(x))T+T(F(x))T+1(1F(x))2=t=1Tt(F(x))t1T(T+1)2T2.\forall x\in\mathbb{R},\frac{1-(T+1)(F(x))^{T}+T(F(x))^{T+1}}{(1-F(x))^{2}}=\sum_{t=1}^{T}t(F(x))^{t-1}\leq\frac{T(T+1)}{2}\leq T^{2}.
Φ(x+2ϵ)Φ(x2ϵ)4ϵϕ(x2ϵ).\Phi(x+2\epsilon)-\Phi(x-2\epsilon)\leq 4\epsilon\phi(x-2\epsilon).

And

(F(x+2ϵ)F(x2ϵ))F(x)4ϵϕ(x2ϵ)ϕ(x)2ϵeϵ2πe(xϵ)2\displaystyle(F(x+2\epsilon)-F(x-2\epsilon))F^{\prime}(x)\leq 4\epsilon\phi(x-2\epsilon)\phi(x)\leq\frac{2\epsilon e^{-\epsilon^{2}}}{\pi}e^{-(x-\epsilon)^{2}} (34)

Then,

m+H(x)𝑑x\displaystyle\int_{m}^{+\infty}H(x)dx T22ϵeϵ2πme(xϵ)2𝑑x\displaystyle\leq T^{2}\frac{2\epsilon e^{-\epsilon^{2}}}{\pi}\int_{m}^{\infty}e^{-(x-\epsilon)^{2}}dx (35)
=T22ϵeϵ2πΦ(2(mϵ))\displaystyle=T^{2}\frac{2\epsilon e^{-\epsilon^{2}}}{\sqrt{\pi}}\Phi(\sqrt{2}(m-\epsilon))
(a)T22ϵeϵ2π12(mϵ)+2(mϵ)2+4e(mϵ)2\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}T^{2}\frac{2\epsilon e^{-\epsilon^{2}}}{\sqrt{\pi}}\frac{1}{\sqrt{2}(m-\epsilon)+\sqrt{2(m-\epsilon)^{2}+4}}e^{-(m-\epsilon)^{2}}
2ϵeϵ2π,\displaystyle\leq\frac{2\epsilon e^{-\epsilon^{2}}}{\sqrt{\pi}},

where (a)(a) use Lemma A.5.

Now we calculate the second term.

mH(x)𝑑x=1H(x)𝑑x+1mH(x)𝑑xΦ(1)(1Φ(1))2+1mH(x)𝑑x.\displaystyle\int_{-\infty}^{m}H(x)dx=\int_{-\infty}^{1}H(x)dx+\int_{1}^{m}H(x)dx\leq\frac{\Phi(1)}{(1-\Phi(1))^{2}}+\int_{1}^{m}H(x)dx. (36)

We only need to bound the integral within (1,m)(1,m).

1mH(x)𝑑x\displaystyle\int_{1}^{m}H(x)dx 1m(Φ(x+2ϵ)Φ(x2ϵ))ϕ(x)(1Φ(x))2𝑑x\displaystyle\leq\int_{1}^{m}\frac{(\Phi(x+2\epsilon)-\Phi(x-2\epsilon))\phi(x)}{(1-\Phi(x))^{2}}dx (37)
2ϵeϵ2π1me(xϵ)2(1Φ(x))2𝑑x\displaystyle\leq\frac{2\epsilon e^{-\epsilon^{2}}}{\pi}\int_{1}^{m}\frac{e^{-(x-\epsilon)^{2}}}{(1-\Phi(x))^{2}}dx
(b)4ϵeϵ21m(1x+x)2ex2e(xϵ)2𝑑x\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}}4\epsilon e^{-\epsilon^{2}}\int_{1}^{m}(\frac{1}{x}+x)^{2}e^{x^{2}}e^{-(x-\epsilon)^{2}}dx
=4ϵe2ϵ21m(1x+x)2e2xϵ𝑑x\displaystyle=4\epsilon e^{-2\epsilon^{2}}\int_{1}^{m}(\frac{1}{x}+x)^{2}e^{2x\epsilon}dx
4m2e2mϵe2ϵ2\displaystyle\leq 4m^{2}e^{2m\epsilon}e^{-2\epsilon^{2}}
8log(T)e2ϵ2log(T),\displaystyle\leq 8\log(T)e^{2\epsilon\sqrt{2\log(T)}},

where (b)(b) uses Lemma A.5.

Therefore,

M2ϵeϵ2π+Φ(1)(1Φ(1))2+8log(T)e2ϵ2log(T).M\leq\frac{2\epsilon e^{-\epsilon^{2}}}{\sqrt{\pi}}+\frac{\Phi(1)}{(1-\Phi(1))^{2}}+8\log(T)e^{2\epsilon\sqrt{2\log(T)}}.

Then we derive a lower bound for MM.

Since ϕ(t)\phi(t) is convex function in [1,+)[1,+\infty), we have

ϕ(t)ϕ(a+b2)+ϕ(a+b2)(ta+b2),t[a,b],a1.\phi(t)\geq\phi(\frac{a+b}{2})+\phi^{\prime}(\frac{a+b}{2})(t-\frac{a+b}{2}),\forall t\in[a,b],a\geq 1.

Then when x2ϵ1x-2\epsilon\geq 1,

Φ(x+2ϵ)Φ(x2ϵ)=x2ϵx+2ϵϕ(t)𝑑t4ϵϕ(x).\Phi(x+2\epsilon)-\Phi(x-2\epsilon)=\int_{x-2\epsilon}^{x+2\epsilon}\phi(t)dt\geq 4\epsilon\phi(x).

Hence,

(Φ(x+2ϵ)Φ(x2ϵ))ϕ(x)4ϵ(ϕ(x))2.(\Phi(x+2\epsilon)-\Phi(x-2\epsilon))\phi(x)\geq 4\epsilon(\phi(x))^{2}.

Substituting into Equation 32,

M\displaystyle M 1+2ϵlog(T)1(T+1)(Φ(x))T+T(Φ(x))T+1(1Φ(x))24ϵ(ϕ(x))2𝑑x\displaystyle\geq\int_{1+2\epsilon}^{\sqrt{\log(T)}}\frac{1-(T+1)(\Phi(x))^{T}+T(\Phi(x))^{T+1}}{(1-\Phi(x))^{2}}4\epsilon(\phi(x))^{2}dx (38)
8ϵπ1+2ϵlog(T)(x2+1)(1(T+1)(112πx1+x2ex22)T+T(Φ(x))T+1)𝑑x\displaystyle\geq\frac{8\epsilon}{\pi}\int_{1+2\epsilon}^{\sqrt{\log(T)}}(x^{2}+1)(1-(T+1)\Big{(}1-\frac{1}{\sqrt{2\pi}}\frac{x}{1+x^{2}}e^{-\frac{x^{2}}{2}}\Big{)}^{T}+T(\Phi(x))^{T+1})dx
8ϵπ1+2ϵlog(T)(x2+1)(1(T+1)(112πx1+x2ex22)T)𝑑x.\displaystyle\geq\frac{8\epsilon}{\pi}\int_{1+2\epsilon}^{\sqrt{\log(T)}}(x^{2}+1)(1-(T+1)\Big{(}1-\frac{1}{\sqrt{2\pi}}\frac{x}{1+x^{2}}e^{-\frac{x^{2}}{2}}\Big{)}^{T})dx.

The function h(x)=112πx1+x2ex22h(x)=1-\frac{1}{\sqrt{2\pi}}\frac{x}{1+x^{2}}e^{-\frac{x^{2}}{2}} is increasing on the interval (1,+)(1,+\infty). We have

(112πx1+x2ex22)T\displaystyle(1-\frac{1}{\sqrt{2\pi}}\frac{x}{1+x^{2}}e^{-\frac{x^{2}}{2}}\Big{)}^{T} (112πlog(T)1+log(T)1T)T\displaystyle\leq(1-\frac{1}{\sqrt{2\pi}}\frac{\sqrt{\log(T)}}{1+\log(T)}\frac{1}{\sqrt{T}}\Big{)}^{T} (39)
eT4πlog(T)\displaystyle\leq e^{-\sqrt{\frac{T}{4\pi\log(T)}}}

Observe that for large TT ( Te11T\geq e^{11}), eT4πlog(T)1T2e^{-\sqrt{\frac{T}{4\pi\log(T)}}}\leq\frac{1}{T^{2}}. Therefore, for any Te11T\geq e^{11},

M8ϵπ1+2ϵlog(T)(x2+1)(1T+1T2)𝑑x8ϵπ1+2ϵlog(T)x2𝑑x.M\geq\frac{8\epsilon}{\pi}\int_{1+2\epsilon}^{\sqrt{\log(T)}}(x^{2}+1)(1-\frac{T+1}{T^{2}})dx\geq\frac{8\epsilon}{\pi}\int_{1+2\epsilon}^{\sqrt{\log(T)}}x^{2}dx.

We have

M8ϵ3πlog(T)log(T)8ϵ(1+2ϵ)23π.M\geq\frac{8\epsilon}{3\pi}\log(T)\sqrt{\log(T)}-\frac{8\epsilon(1+2\epsilon)^{2}}{3\pi}.

Appendix G Proofs of Theorem 5.5

Similar to the proofs of Theorem 5.2, the regret can also be divided into two parts:

𝔼[t=1TiK(t)Δt(i)𝟙{it=i}]=𝔼[t=1TiNαtΔt(i)𝟙{it=i}](A)+𝔼[t=1TiNαtΔi𝟙{it=i}](B)\displaystyle\mathbb{E}\Bigg{[}\sum_{t=1}^{T}\sum_{i\in K(t)}\Delta_{t}(i)\mathbbm{1}\{i_{t}=i\}\Bigg{]}=\underbrace{\mathbb{E}\Bigg{[}\sum_{t=1}^{T}\sum_{i\notin N_{\alpha_{t}^{*}}}\Delta_{t}(i)\mathbbm{1}\{i_{t}=i\}\Bigg{]}}_{(A)}+\underbrace{\mathbb{E}\Bigg{[}\sum_{t=1}^{T}\sum_{i\in N_{\alpha_{t}^{*}}}\Delta_{i}\mathbbm{1}\{i_{t}=i\}\Bigg{]}}_{(B)} (40)

Part (A)(A) is exactly the same as the analysis in the corresponding part of Theorem 5.2. We only focus part (B)(B).

Let

L=t=1T𝟙{μ(at)>μ(it)}L^{\prime}=\sum_{t=1}^{T}\mathbbm{1}\{\mu(a_{t})>\mu(i_{t}^{*})\}

denote the number of times the optimal arms changes. Then

𝔼[L]=L=t=1T{μ(at)>μ(it)}.\mathbb{E}[L^{\prime}]=L=\sum_{t=1}^{T}\mathbb{P}\{\mu(a_{t})>\mu(i_{t}^{*})\}.

Using Lemma A.4, we can get

(L3L)eL.\mathbb{P}(L^{\prime}\geq 3L)\leq e^{-L}. (41)

Since μ(at)𝒫\mu(a_{t})\sim\mathcal{P} is independent of each other, each μ(at)\mu(a_{t}) is equally likely to be the largest one, i.e., (μ(at)>μ(it))=1t.\mathbb{P}(\mu(a_{t})>\mu(i_{t}^{*}))=\frac{1}{t}. Or we can obtain this result through Equation 32. We have

L=t=1T1t=log(T)+c+o(1),L=\sum_{t=1}^{T}\frac{1}{t}=\log(T)+c+o(1), (42)

where c0.577c\approx 0.577 is the Euler constant. Then log(T)Llog(T)+1\log(T)\leq L\leq\log(T)+1.

Given a fixed instance \mathcal{F}, we also divide the rounds into LL^{\prime} parts: (t0=1,tL+1=T)(t_{0}=1,t_{L^{\prime}+1}=T)

[1,t1],(t1,t2],,(tL,T].[1,t_{1}],(t_{1},t_{2}],...,(t_{L^{\prime}},T].

This partition satisfies t(tj,tj+1),\forall t\in(t_{j},t_{j+1}), iti_{t}^{*} is stationary. The αt\alpha_{t}^{*} is also stationary, t(tj,tj+1)\forall t\in(t_{j},t_{j+1}), let αt=αj\alpha_{t}^{*}=\alpha_{j}.

Let’s focus on the intervals (tj,tj+1](t_{j},t_{j+1}], the analysis for other intervals is similar. All arms falling into NαjN_{\alpha_{j}} at rounds (tj,tj+1)(t_{j},t_{j+1}) are denoted by VjV_{j}. The arms in VjV_{j} can be divided into two parts: E1={iVj,μ(i)<μ(αj)}E_{1}=\{i\in V_{j},\mu(i)<\mu(\alpha_{j})\} and E2={iVj,μ(i)μ(αj)}E_{2}=\{i\in V_{j},\mu(i)\geq\mu(\alpha_{j})\}. If iVji\in V_{j} has not been arrived at round tt, 𝟙{it=i}=0\mathbbm{1}\{i_{t}=i\}=0. Then we have

iVjt=tjtj+1𝟙{it=i}=iE1t=tjtj+1𝟙{it=i}(C)+iE2t=tjtj+1𝟙{it=i}(D)\sum_{i\in V_{j}}\sum_{t=t_{j}}^{t_{j+1}}\mathbbm{1}\{i_{t}=i\}=\underbrace{\sum_{i\in E_{1}}\sum_{t=t_{j}}^{t_{j+1}}\mathbbm{1}\{i_{t}=i\}}_{(C)}+\underbrace{\sum_{i\in E_{2}}\sum_{t=t_{j}}^{t_{j+1}}\mathbbm{1}\{i_{t}=i\}}_{(D)} (43)

From the way our algorithm constructs independent sets, it can be inferred that all arms in VjV_{j} are connected to αt=αj\alpha_{t}^{*}=\alpha_{j}, and the distances are all less than ϵ\epsilon. Hence, both E1E_{1} and E2E_{2} form a clique.

Note that selecting any arm in E1E_{1} will result in the observation of αj\alpha_{j}. We have

(C)=iE1t=tjtj+1𝟙{it=i}\displaystyle(C)=\sum_{i\in E_{1}}\sum_{t=t_{j}}^{t_{j+1}}\mathbbm{1}\{i_{t}=i\} αj+iE1t=tjtj+1𝟙{it=i,Ot(αj)>αj}\displaystyle\leq\ell_{\alpha_{j}}+\sum_{i\in E_{1}}\sum_{t=t_{j}}^{t_{j+1}}\mathbbm{1}\{i_{t}=i,O_{t}(\alpha_{j})>\ell_{\alpha_{j}}\} (44)
αj+iE1t=tjtj+1𝟙{μ¯t(i)ct(i)>μ¯t(αj)ct(αj),Ot(αj)αj}\displaystyle\leq\ell_{\alpha_{j}}+\sum_{i\in E_{1}}\sum_{t=t_{j}}^{t_{j+1}}\mathbbm{1}\{\bar{\mu}_{t}(i)-c_{t}(i)>\bar{\mu}_{t}(\alpha_{j})-c_{t}(\alpha_{j}),O_{t}(\alpha_{j})\geq\ell_{\alpha_{j}}\}
αj+iE1t=tjtj+1𝟙{max1sitX¯si(i)csi(i)>minαjstX¯s(αj)cs(αj)}\displaystyle\leq\ell_{\alpha_{j}}+\sum_{i\in E_{1}}\sum_{t=t_{j}}^{t_{j+1}}\mathbbm{1}\{\max_{1\leq s_{i}\leq t}\bar{X}_{s_{i}}(i)-c_{s_{i}}(i)>\min_{\ell_{\alpha_{j}}\leq s\leq t}\bar{X}_{s}(\alpha_{j})-c_{s}(\alpha_{j})\}
αj+iE1t=tjtj+1si=1ts=αjt𝟙{X¯si(i)csi(i)>X¯s(αj)cs(αj)}\displaystyle\leq\ell_{\alpha_{j}}+\sum_{i\in E_{1}}\sum_{t=t_{j}}^{t_{j+1}}\sum_{s_{i}=1}^{t}\sum_{s=\ell_{\alpha_{j}}}^{t}\mathbbm{1}\{\bar{X}_{s_{i}}(i)-c_{s_{i}}(i)>\bar{X}_{s}(\alpha_{j})-c_{s}(\alpha_{j})\}

Let (αj)=4log(2T/δ)(ΔminT)2\ell(\alpha_{j})=\frac{4\log(\sqrt{2}T/\delta)}{(\Delta_{min}^{T})^{2}}. If s(αj)s\geq\ell(\alpha_{j}), the event {μ(αj)μ(i)2csi(i)}\{\mu(\alpha_{j})-\mu(i)\leq 2c_{s_{i}}(i)\} never occurs. Then

{X¯si(i)csi(i)>X¯s(αj)cs(αj)}{X¯si(i)>μ(i)+csi(i)}{X¯s(αj)<μ(αj)cs(αj)}.\{\bar{X}_{s_{i}}(i)-c_{s_{i}}(i)>\bar{X}_{s}(\alpha_{j})-c_{s}(\alpha_{j})\}\subset\{\bar{X}_{s_{i}}(i)>\mu(i)+c_{s_{i}}(i)\}\bigcup\{\bar{X}_{s}(\alpha_{j})<\mu(\alpha_{j})-c_{s}(\alpha_{j})\}.

From Lemma A.1, we have

(X¯si(i)csi(i)>X¯s(αj)cs(αj))δ2T2.\mathbb{P}(\bar{X}_{s_{i}}(i)-c_{s_{i}}(i)>\bar{X}_{s}(\alpha_{j})-c_{s}(\alpha_{j}))\leq\frac{\delta^{2}}{T^{2}}.

The regret incurred by E1E_{1} in (tj,tj+1](t_{j},t_{j+1}] is at most

8ϵlog(2T/δ)(ΔminT)2+2ϵ(tj+1tj)|E1|δ2.\frac{8\epsilon\log(\sqrt{2}T/\delta)}{(\Delta_{min}^{T})^{2}}+2\epsilon(t_{j+1}-t_{j})|E_{1}|\delta^{2}.

Using the same method, we can get the regret incurred by E2E_{2} in (tj,tj+1](t_{j},t_{j+1}] is bounded as

8ϵlog(2T/δ)(ΔminT)2+2ϵ(tj+1tj)|E2|δ2.\frac{8\epsilon\log(\sqrt{2}T/\delta)}{(\Delta_{min}^{T})^{2}}+2\epsilon(t_{j+1}-t_{j})|E_{2}|\delta^{2}.

Therefore, choosing δ=1T\delta=\frac{1}{T}, (B)(B) with the fixed \mathcal{F} is bounded as

16Lϵlog(2T/δ)(ΔminT)2+2ϵ.\frac{16L^{\prime}\epsilon\log(\sqrt{2}T/\delta)}{(\Delta_{min}^{T})^{2}}+2\epsilon. (45)

From Equation 41 and Equation 42, we have

(B)48Lϵlog(2T/δ)(ΔminT)2+2ϵ+2ϵTeL48Lϵlog(2T/δ)(ΔminT)2+4ϵ(B)\leq\frac{48L\epsilon\log(\sqrt{2}T/\delta)}{(\Delta_{min}^{T})^{2}}+2\epsilon+2\epsilon Te^{-L}\leq\frac{48L\epsilon\log(\sqrt{2}T/\delta)}{(\Delta_{min}^{T})^{2}}+4\epsilon

Therefore, we get the total regret

RTπ5logbTΔmax(4log(2T/δ)ϵ2+Tδ2)+Δmax+48Lϵlog(2T/δ)(ΔminT)2+4ϵ.R_{T}^{\pi}\leq 5\log_{b}T\Delta_{max}(\frac{4\log(\sqrt{2}T/\delta)}{\epsilon^{2}}+T\delta^{2})+\Delta_{max}+\frac{48L\epsilon\log(\sqrt{2}T/\delta)}{(\Delta_{min}^{T})^{2}}+4\epsilon.