Graph Feedback Bandits with Similar Arms

Han Qi School of Software Engineering, Xi’an Jiaotong University, Xi’an, Shaanxi, China Guo Fei School of Software Engineering, Xi’an Jiaotong University, Xi’an, Shaanxi, China Li Zhu School of Software Engineering, Xi’an Jiaotong University, Xi’an, Shaanxi, China

Abstract

In this paper, we study the stochastic multi-armed bandit problem with graph feedback. Motivated by the clinical trials and recommendation problem, we assume that two arms are connected if and only if they are similar (i.e., their means are close enough). We establish a regret lower bound for this novel feedback structure and introduce two UCB-based algorithms: D-UCB with problem-independent regret upper bounds and C-UCB with problem-dependent upper bounds. Leveraging the similarity structure, we also consider the scenario where the number of arms increases over time. Practical applications related to this scenario include Q&A platforms (Reddit, Stack Overflow, Quora) and product reviews in Amazon and Flipkart. Answers (product reviews) continually appear on the website, and the goal is to display the best answers (product reviews) at the top. When the means of arms are independently generated from some distribution, we provide regret upper bounds for both algorithms and discuss the sub-linearity of bounds in relation to the distribution of means. Finally, we conduct experiments to validate the theoretical results.

1 Introduction

The multi-armed bandit is a classical reinforcement learning problem. At each time step, the learner needs to select an arm. This will yield a reward drawn from a probability distribution (unknown to the learner). The learner’s goal is to maximize the cumulative reward over a period of time steps. This problem has attracted attention from the online learning community because of its effective balance between exploration (trying out as many arms as possible) and exploitation (utilizing the arm with the best current performance). A number of applications of multi-armed bandit can be found in online sequential decision problems, such as online recommendation systems [Li et al., 2011], online advertisement campaign [Schwartz et al., 2017] and clinical trials [Villar et al., 2015, Aziz et al., 2021].

In the standard multi-armed bandit, the learner can only observe the reward of the chosen arm. There has been existing research [Mannor and Shamir, 2011, Caron et al., 2012, Hu et al., 2020, Lykouris et al., 2020] that considers the bandit problem with side observations, wherein the learner can observe information about arms other than the selected one. This observation structure can be encoded as a graph, each node represents an arm. Node $i$ is linked to node $j$ if selecting $i$ provides information on the reward of $j$ .

We study a new feedback model: if any two arms are $\epsilon$ -similar, i.e., the absolute value of the difference between the means of the two arms does not exceed $\epsilon$ , an edge will form between them. This means that after observing the reward of one arm, the decision-maker simultaneously knows the rewards of arms similar to it. If $\epsilon=0$ , this feedback model is the standard multi-armed bandit problem. If $\epsilon$ is greater than the maximum expected reward, this feedback model turns out to be the full information bandit problem.

As a motivating example, consider the recommendation problem in Spotify and Apple Music. After a recommender suggests a song to a user, they can observe not only whether the user liked or saved the song (reward), but also infer that the user is likely to like or save another song that is very similar. Similarity may be based on factors such as the artist, songwriter, genre, and more. As another motivating example, consider the problem of medicine clinical trials mentioned above. Each arm represents a different medical treatment plan, and these plans may have some similarities such as dosage, formulation, etc. When a researcher selects a plan, they not only observe the reward of that treatment plan but also know the effects of others similar to the selected one. The treatment effect (reward) can be either some summary information or a relative effect, such as positive or negative. Similar examples also appear in chemistry molecular simulations [Pérez et al., 2020].

In this paper, we consider two bandit models: the standard graph feedback bandit problem and the bandit problem with an increasing number of arms. The latter is a more challenging setting than the standard multi-armed bandit. Relevant applications for this scenario encompass Q&A platforms such as Reddit, Stack Overflow, and Quora, as well as product reviews on websites like Amazon and Flipkart. Answers or product reviews continuously populate these platforms means that the number of arms increases over time. The goal is to display the best answers or product reviews at the top. This problem has been previously studied and referred to as "ballooning multi-armed bandits" by Ghalme et al. [2021]. However, they require the optimal arm is more likely to arrive in the early rounds. Our contributions are as follows:

1.

We propose a new feedback model, where an edge is formed when the means of two arms is less than some constant $\epsilon$ . We first analyze the underlying graph $G$ of this feedback model and establish that the dominant number $\gamma(G)$ is equal to the minimum size of an independent dominating set $i(G)$ , while the independent number $\alpha(G)$ is not greater than twice the dominant number $\gamma(G)$ ,i.e. $\gamma(G)=i(G)\geq\alpha(G)/2$ . This result is helpful to the design and analysis of the following algorithms.
2.

In this feedback setting, we first establish a problem-dependent regret lower bound related to $\gamma(G)$ . Then, we introduce two algorithms tailored for this specific feedback structure: Double-UCB (D-UCB), utilizing twice UCB algorithms sequentially, and Conservative-UCB (C-UCB), employing a more conservative strategy during the exploration. Regret upper bounds are provided for both algorithms, with D-UCB obtaining a problem-independent bound, while C-UCB achieves a problem-dependent regret bound. Additionally, we analyze the regret bounds of UCB-N [Lykouris et al., 2020] applied to our proposed setting and prove that its regret upper bound shares the same order as D-UCB.
3.

We extend D-UCB and C-UCB to the scenario where the number of arms increases over time. Our algorithm does not require the optimal arm to arrive early but assumes that the means of each arrived arm are independently sampled from some distribution, which is more realistic. We provide the regret upper bounds for D-UCB and C-UCB, along with a simple regret lower bound for D-UCB. The lower bound of D-UCB indicates that it achieves sublinear regret only when the means are drawn from a normal-like distribution, while it incurs linear regret for means drawn from a uniform distribution. In contrast, C-UCB can achieve a problem-dependent sublinear regret upper bound regardless of the means distribution.

2 Related Works

Bandits with side observations was first introduced by Mannor and Shamir [2011] for adversarial settings. They proposed two algorithms: ExpBan, a hybrid algorithm combining expert and bandit algorithms based on clique decomposition of the side observations graph; ELP, an extension of the well-known EXP3 algorithm [Auer et al., 2002b]. Caron et al. [2012], Hu et al. [2020] considered stochastic bandits with side observations. They proposed the UCB-N, UCB-NE, and TS-N algorithms, respectively. The regret upper bounds they obtain are of the form $\sum_{c\in\mathcal{C}}\frac{\max_{i\in c}\Delta_{i}\ln(T)}{(\min_{i\in c}\Delta_{i})^{2}}$ , where $\mathcal{C}$ is the clique covering of the side observation graph.

There has been some works that employ techniques beyond clique partition. Buccapatnam et al. [2014, 2018] proposed the algorithm named UCB-LP, which combine a version of eliminating arms [Even-Dar et al., 2006] suggested by Auer and Ortner [2010] with linear programming to incorporate the graph structure. This algorithm has a regret guarantee of $\sum_{i\in D}\frac{\ln(T)}{\Delta_{i}}+K^{2}$ , where $D$ is a particularly selected dominating set, $K$ is the number of arms. Cohen et al. [2016] used a method based on elimination and provides the regret upper bound as $\tilde{O}(\sqrt{\alpha T})$ , where $\alpha$ is the independence number of the underlying graph. Lykouris et al. [2020] utilized a hierarchical approach inspired by elimination to analyze the feedback graph, demonstrating that UCB-N and TS-N have regret bounds of order $\tilde{O}(\sum_{i\in I}\frac{1}{\Delta_{i}})$ , where $I$ is an independent set of the graph. There is also some work that considers the case where the feedback graph is a random graph [Alon et al., 2017, Ghari and Shen, 2022, Esposito et al., 2022].

Currently, there is limited research considering scenarios where the number of arms can change. Chakrabarti et al. [2008] was the first to explore this dynamic setting. Their model assumes that each arm has a lifetime budget, after which it automatically disappears, and will be replaced by a new arm. Since the algorithm needs to continuously explore newly available arms in this setting, they only provided the upper bound of the mean regret per time step. Ghalme et al. [2021] considered the "ballooning multi-armed bandits" where the number of arms will increase but not disappear. They show that the regret grows linearly without any distributional assumptions on the arrival of the arms’ means. With the assumption that the optimal arm arrives early with high probability, their proposed algorithm BL-MOSS can achieve sublinear regret. In this paper, we also only consider the "ballooning" settings but without the assumption on the optimal arm’s arrival pattern. We use the feedback graph model mentioned above and assume that the means of each arrived arm are independently sampled from some distribution.

Clustering bandits Gentile et al. [2014], Li et al. [2016], Yang et al. [2022], Wang et al. [2024] are also relevant to our work. Typically, these models assume that a set of arms (or items) can be classified into several unknown groups. Within each group, the observations associated to each of the arms follow the same distribution with the same mean. However, we do not employ a defined concept of clustering groups. Instead, we establish connections between arms by forming an edge only when their means are less than a threshold $\epsilon$ , thereby creating a graph feedback structure.

3 Problem Formulation

3.1 Graph Feedback with Similar Arms

We consider a stochastic $K$ -armed bandit problem with an undirected feedback graph and time horizon $T$ ( $K\leq T$ ). At each round $t$ , the learner select arm $i_{t}$ , obtains a reward $X_{t}(i_{t})$ . Without losing generality, we assume that the rewards are bounded in $[0,1]$ or $\frac{1}{2}$ -subGaussian¹¹1This is simply to provide a uniform description of both bounded rewards and subGaussian rewards. Our results can be easily extended to other subGaussian distributions.. The expectation of $X_{t}(i)$ is denoted as $\mu(i)=\mathbb{E}[X_{t}(i)]$ . Graph $G=(V,E)$ denotes the underlying graph that captures all the feedback relationships over the arms set $V$ . An edge $i\leftrightarrow j$ in $E$ means that $i$ and $j$ are $\epsilon$ -similarty, i.e.,

|\mu(i)-\mu(j)|<\epsilon,

where $\epsilon$ is some constant greater than $0$ . The learner can get a side observation of arm $i$ when pulling arm $j$ , and vice versa. Let $N_{i}$ denote the observation set of arm $i$ consisting of $i$ and its neighbors in $G$ . Let $k_{t}(i)$ and $O_{t}(i)$ denote the number of pulls and observations of arm $i$ till time $t$ respectively. We assume that each node in graph $G$ contains a self-loops, i.e., the learner will observe the reward of the pulled arm.

Let $i^{*}$ denote the expected reward of the optimal arm, i.e., $\mu(i^{*})=\max_{i\in\{1,...,K\}}\mu(i)$ . The gap between the expectation of the optimal arm and the suboptimal arm is denoted as $\Delta_{i}=\mu(i^{*})-\mu(i).$ A policy, denoted as $\pi$ , that selects arm $i_{t}$ to play at time step $t$ based on the history plays and rewards. The performance of the policy $\pi$ is measured as

R_{T}^{\pi}=\mathbb{E}\left[\sum_{t=1}^{T}\mu(i^{*})-\mu(i_{t})\right].

(1)

3.2 Ballooning bandit setting

This setting is the same as the graph feedback with similar arms above, except that the number of arms is increased over time. Let $K(t)$ denote the set of available arms at round $t$ . We consider a tricky case that a new arm will arrive at each round, the total number of arms is $|K(T)|=T$ . We assume that the means of each arrived arms are independently sampled from some distribution $\mathcal{P}$ . Let $a_{t}$ denote the arrived arm at round $t$ ,

\mu(a_{1}),\mu(a_{2}),...,\mu(a_{T})\stackrel{{\scriptstyle i.i.d.}}{{\sim}}\mathcal{P}.

The newly arrived arm may be connected to previous arms, depending on whether their means satisfy the $\epsilon$ -similarity. In this setting, the optimal arm may vary over time. Let $i_{t}^{*}$ denote the expected reward of the optimal arm, i.e., $\mu(i_{t}^{*})=\max_{i\in K(t)}\mu(i)$ . The regret is given by

R_{T}^{\pi}(\mathcal{P})=\mathbb{E}\left[\sum_{t=1}^{T}\mu(i_{t}^{*})-\mu(i_{t})\right].

(2)

4 Stationary Environments

In this section, we consider the graph feedback with similar arms in stationary environments, i.e., the number of arms remains constant. We first analyze the structure of the feedback graph. Then, we provide a problem-dependent regret lower bound. Following that, we introduce the D-UCB and C-UCB algorithm and provide the regret upper bounds.

Dominating and Independent Dominating Sets. A dominating set $S$ in a graph $G$ is a set of vertices such that every vertex not in $S$ is adjacent to a vertex in $S$ . The domination number of $G$ , denoted as $\gamma(G)$ , is the smallest size of a dominating set.

An independent set contains vertices that are not adjacent to each other. An independent dominating set in $G$ is a set that both dominates and is independent. The independent domination number of $G$ , denoted as $i(G)$ , is the smallest size of such a set. The independence number of $G$ , denoted as $\alpha(G)$ , is the largest size of an independent set in $G$ . For a general graph $G$ , it follows immediately that $\gamma(G)\leq i(G)\leq\alpha(G)$ .

Proposition 4.1.

Let $G$ denote the feedback graph with similar arms setting, we have $\gamma(G)=i(G)\geq\frac{\alpha(G)}{2}.$

Proof sketch. The first equation can be obtained by proving that $G$ is a claw-free graph and using Lemma A.2. The second inequality can be obtained by a double counting argument. The details are in the appendix.

Proposition 4.1 shows that $\gamma(G)\leq\alpha(G)\leq 2\gamma(G)$ . Once we obtain the bounds based on independence number, we also simultaneously obtain the bounds based on the domination number. Therefore, we can obtain regret bounds based on the minimum dominating set without using the feedback graph to explicitly target exploration. This cannot be guaranteed in the standard graph feedback bandit problem [Lykouris et al., 2020].

4.1 Lower Bounds

Before presenting our algorithm, we first investigate the regret lower bound of this problem. Without loss of generality, we assume the reward distribution is $\frac{1}{2}$ -subGaussian.

Let $\Delta_{min}=\mu(i^{*})-\max_{j\neq i^{*}}\mu(j),\Delta_{max}=\mu(i^{*})-\min_{j}\mu(j)$ . We assume that $\Delta_{min}<\epsilon$ in our analysis. In other words, we do not consider the easily distinguishable scenario where the optimal arm and the suboptimal arms are clearly separable. If $\Delta_{min}\geq\epsilon$ , our analysis method is also applicable, but the terms related to $\Delta_{min}$ will vanish in the expressions of both the lower and upper bounds. Caron et al. [2012] has provided a lower bound of $\Omega(\log(T))$ , and we present a more refined lower bound specifically for our similarity feedback.

Theorem 4.2.

If a policy $\pi$ is uniform good²²2For more details, see [Lai et al., 1985]., for any problem instance, it holds that

\liminf_{T\to\infty}\frac{R_{T}^{\pi}}{\log(T)}\geq\frac{2}{\Delta_{min}}+\frac{C_{1}}{\epsilon},

(3)

where $C_{1}=2\log(\frac{\Delta_{max}+\epsilon}{\Delta_{max}-(\gamma(G)-2)\epsilon})$ .

Proof.

Let $\mathcal{S}$ denote an independent dominant set that includes $i^{*}$ . From Proposition 4.1, $|\mathcal{S}|\geq\gamma(G)$ . The second-best arm is denoted as $i^{\prime}$ , $\mathcal{D}=\mathcal{S}\bigcup\{i^{\prime}\}.$ Since $\Delta_{min}<\epsilon$ , $i^{\prime}\notin\mathcal{S}$ .

Let’s construct another policy $\pi^{\prime}$ for another problem-instance on $\mathcal{D}$ without side observations. If $\pi$ select arm $i_{t}$ at round $t$ , $\pi^{\prime}$ select the arm as following: if $i_{t}\in\mathcal{D}$ , $\pi^{\prime}$ select arm $i_{t}$ too. If $i_{t}\notin\mathcal{D}$ , $\pi^{\prime}$ will select the arm of $N_{i_{t}}\bigcap\mathcal{D}$ with the largest mean. Since $N_{i_{t}}\bigcap\mathcal{D}\neq\emptyset$ , policy $\pi^{\prime}$ is well-defined. It is clearly that $R_{T}^{\pi}>R_{T}^{\pi^{\prime}}$ . By the classical result of [Lai et al., 1985],

\liminf_{T\to\infty}\frac{R_{T}^{\pi^{\prime}}}{\log(T)}\geq\sum_{i\in\mathcal{D}}\frac{2}{\Delta_{i}}=\frac{2}{\Delta_{min}}+\sum_{i\in\mathcal{S}\backslash\{i^{*}\}}\frac{2}{\Delta_{i}}.

(4)

$\forall i\in\mathcal{S}\backslash\{i^{*}\},\Delta_{i}\in[n_{i}\epsilon,(n_{i}+1)\epsilon)$ . $n_{i}$ is some positive integer smaller than $\frac{\Delta_{max}}{\epsilon}$ . Since $\mathcal{S}$ is an independent set, $\forall i,j\in\mathcal{S}\backslash\{i^{*}\},i\neq j$ , we have $n_{i}\neq n_{j}$ . Therefore, we can complete this proof by

	$\displaystyle\sum_{i\in\mathcal{S}\backslash\{i^{*}\}}\frac{2}{\Delta_{i}}$	$\displaystyle\geq\sum_{j=0}^{\|\mathcal{S}\|-2}\frac{2}{\Delta_{max}-j\epsilon}$
		$\displaystyle\geq 2\int_{-1}^{\|\mathcal{S}\|-2}\frac{1}{\Delta_{max}-\epsilon x}dx$
		$\displaystyle\geq\frac{2\log(\frac{\Delta_{max}+\epsilon}{\Delta_{max}-(\gamma(G)-2)\epsilon})}{\epsilon}.$

∎

Consider two simple cases. (1) $\gamma(G)=1$ . The feedback graph $G$ is a complete graph or some graph with independence number less than $2$ . Then $C_{1}=0$ , the lower bound holds strictly when $G$ is not a complete graph. (2) $\gamma(G)=\frac{\Delta_{max}}{\epsilon}$ , $C_{1}=2\log(\frac{1}{2}\gamma(G)+\frac{1}{2})$ . In this case, the terms in the lower bound involving $\epsilon$ have the same order $O(\log(\gamma(G)))$ as the corresponding terms in the upper bound of the following proposed algorithms.

4.2 Double-UCB

This particular feedback structure inspires us to distinguish arms within the independent set first. This is a straightforward task because the distance between the mean of each arm in the independent set is greater than $\epsilon$ . Subsequently, we learn from the arm with the maximum confidence bound in the independent set and its neighborhood, which may include the optimal arm. Our algorithm alternates between the two processes simultaneously.

1: Input: Horizon

T,

\delta\in(0,1)

2: Initialize

\mathcal{I}=\emptyset,\forall i,k(i)=0,O(i)=0

3: for

t=1

T

4: repeat

5: Select an arm

i_{t}

that has not been observed.

\mathcal{I}=\mathcal{I}\bigcup\{i_{t}\}

\forall i\in N_{i_{t}},

update

k_{t}(i),O_{t}(i),\bar{\mu}_{t}(i)

t=t+1

9: until All arms have been observed at least once

10:

j_{t}=\arg\max_{j\in\mathcal{I}}\bar{\mu}(j)+\sqrt{\frac{\log(\sqrt{2}T/\delta)}{O(j)}}

11: Select arm

i_{t}=\arg\max_{i\in N_{j_{t}}}\bar{\mu}(i)+\sqrt{\frac{\log(\sqrt{2}T/\delta)}{O(i)}}

112:

\forall i\in N_{i_{t}}

, update

k_{t}(i),O_{t}(i),\bar{\mu}_{t}(i)

213: end for

Algorithm 1 D-UCB

Algorithm 1 shows the pseudocode of our method. Steps 4-9 identify an independent set $\mathcal{I}$ in $G$ , play each arm in the independent set once. This process does not require knowledge of the complete graph structure and requires at most $\alpha(G)$ rounds. Step 10 calculates the arm $j_{t}$ with the maximum upper confidence bound in the independent set. After a finite number of rounds, the optimal arm is likely to fall within $N_{j_{t}}$ . Step 11 use the UCB algorithm again to select arm in $N_{j_{t}}$ . This algorithm employs UCB rules twice for arm selection, hence it is named Double-UCB.

4.2.1 Regret Analysis of Double-UCB

We use $\mathcal{I}(N_{i})$ to denote the set of all independent dominating sets of graph formed by $N_{i}$ . Let

\mathcal{I}(i^{*})=\bigcup_{i\in N_{i^{*}}}\mathcal{I}(N_{i}).

Note that, the elements in $\mathcal{I}(i^{*})$ are independent sets rather than individual nodes, and $\forall I\in\mathcal{I}(i^{*})$ , $|I|\leq 2$ .

Theorem 4.3.

Assume that the reward distribution is $\frac{1}{2}$ -subgaussian or bounded in $[0,1]$ , set $\delta=\frac{1}{T},$ the regret of Double-UCB after $T$ rounds is upper bounded by

	$\displaystyle R_{T}^{\pi}$	$\displaystyle\leq 32(\log(\sqrt{2}T))^{2}\max_{I\in\mathcal{I}(i^{})}\sum_{i\in I\backslash\{i^{}\}}\frac{1}{\Delta_{i}}+C_{2}\frac{\log(\sqrt{2}T)}{\epsilon}$		(5)
		$\displaystyle+\Delta_{max}+4\epsilon+1,$		(5)

where $C_{2}=8(\log(2\gamma(G))+\frac{\pi^{2}}{3}).$

Recall that we assume $\Delta_{min}<\epsilon$ , then $\forall i\in N_{i^{*}},|N_{i}|>1$ . The index set $\{i\in I\backslash\{i^{*}\}\}$ in the summation of the first term is non-empty. If $\Delta_{min}\geq\epsilon$ , $\mathcal{I}(i^{*})=\{i^{*}\}$ . The first term will vanish.

Proof sketch. The regret can be analyzed in two parts. The first part arises from Step 9 selecting $j_{t}$ whose neighborhood does not contain the optimal arm. The algorithm can easily distinguish the suboptimal arms in $N_{j_{t}}$ from the optimal arm. The regret caused by this part is $O(\frac{\log(T)}{\epsilon})$ . The second part of the regret comes from selecting a suboptimal arm in $N_{j_{t}},i^{*}\in N_{j_{t}}$ . This part can be viewed as applying UCB rule on a graph with an independence number less than 2.

Since $\forall I\in\mathcal{I}(i^{*})$ , $|I|\leq 2$ , we have

\max_{I\in\mathcal{I}(i^{*})}\sum_{i\in I\backslash\{i^{*}\}}\frac{1}{\Delta_{i}}\leq\frac{2}{\Delta_{min}}.

Therefore, the regret upper bound can be denoted as

R_{T}^{\pi}\leq\frac{64(\log(\sqrt{2}T))^{2}}{\Delta_{min}}+C_{2}\frac{\log(\sqrt{2}T)}{\epsilon}+\Delta_{max}+4\epsilon+1.

(6)

Our upper bound suffers an extra logarithm term compared to the lower bound Theorem 4.2. The inclusion of this additional logarithm appears to be essential if one uses a gap-based analysis similar to [Cohen et al., 2016]. This issue has been discussed in [Lykouris et al., 2020].

From Theorem 4.3, we have the following gap-free upper bound

Corollary 4.4.

The regret of Double-UCB is bounded by $16\sqrt{T}\log(\sqrt{2}T)+C_{2}\frac{\log(\sqrt{2}T)}{\epsilon}+\Delta_{max}+4\epsilon+1.$

4.3 Conservative UCB

Double-UCB is a very natural algorithm for similar arms setting. For this particular feedback structure, we propose the conservative UCB, which simply modifies Step 10 of Algorithm 1 to $i_{t}=\arg\max_{i\in N_{j_{t}}}\bar{\mu}(i)-\sqrt{\frac{\log(\sqrt{2}T/\delta)}{O(i)}}$ . This form of the index function is conservative when exploring arms in $N_{j_{t}}$ . It does not immediately try each arm but selects those that have been observed a sufficient number of times. Algorithm 2 shows the pseudocode of C-UCB.

1: Input: Horizon

T,

\delta\in(0,1)

2: Initialize

\mathcal{I}=\emptyset,\forall i,k(i)=0,O(i)=0

3: for

t=1

T

4: Steps 4-10 in D-UCB

5: Select arm

i_{t}=\arg\max_{i\in N_{j_{t}}}\bar{\mu}(i)-\sqrt{\frac{\log(\sqrt{2}T/\delta)}{O(i)}}

\forall i\in N_{i_{t}},

update

k_{t}(i),O_{t}(i),\bar{\mu}_{t}(i)

17: end for

Algorithm 2 C-UCB

4.3.1 Regret Analysis of Conservative UCB

Let $G_{2\epsilon}$ denote the subgraph with arms $\{i\in V:\mu(i^{*})-\mu(i)<2\epsilon\}$ . The set of all independent dominating sets of graph $G_{2\epsilon}$ is denoted as $\mathcal{I}(G_{2\epsilon})$ . We can also define $\mathcal{I}(G_{\epsilon})$ in this way. Define $\Delta_{2\epsilon}^{min}=\min_{i,j\in G_{2\epsilon}}\{|\mu(i)-\mu(j)|\}$ .

We divide the regret into two parts. The first part is the regret caused by choosing arm in $N_{j_{t}},i^{*}\notin N_{j_{t}}$ , and the analysis for this part follows the same approach as D-UCB.

The second part is the regret of choosing the suboptimal arms in $N_{j_{t}},i^{*}\in N_{j_{t}}$ . It can be proven that if the optimal arm is observed more than $\frac{4\log(\sqrt{2}T/\delta)}{(\Delta_{2\epsilon}^{min})^{2}}$ times, the algorithm will select the optimal arm with high probability. Intuitively, for any arm $i\in N_{j_{t}},i\neq i^{*}$ , the following events hold with high probability:

\bar{\mu}(i^{*})-\sqrt{\frac{\log(\sqrt{2}T/\delta)}{O(i^{*})}}>\mu(i^{*})-\Delta(i)=\mu(i),

(7)

and

\bar{\mu}(i)-\sqrt{\frac{\log(\sqrt{2}T/\delta)}{O(i)}}<\mu(i).

(8)

Since the optimal arm satisfies Equation 7 and the suboptimal arms satisfy Equation 8 with high probability, the suboptimal arms are unlikely to be selected.

The key to the problem lies in ensuring that the optimal arm can be observed $\frac{4\log(\sqrt{2}T/\delta)}{(\Delta_{2\epsilon}^{min})^{2}}$ times. Since in the graph formed by $N_{j_{t}}$ , all arms are connected to $j_{t}$ . As long as the time steps are sufficiently long, arm $j_{t}$ will inevitably be observed more than $\frac{4\log(\sqrt{2}T/\delta)}{(\Delta_{2\epsilon}^{min})^{2}}$ times, then the arms with means less than $\mu(j_{t})$ will be ignored (similar to Equation 7 and Equation 8). Choosing arms that the means between $(\mu(j_{t}),\mu(i^{*}))$ will always observe the optimal arm, so the optimal arm can be observed $\frac{4\log(\sqrt{2}T/\delta)}{(\Delta_{2\epsilon}^{min})^{2}}$ times. Therefore, we have the following theorem:

Theorem 4.5.

Under the same conditions as Theorem 4.3, the regret of C-UCB is upper bounded by

\displaystyle R_{T}^{\pi}\leq\frac{32\epsilon\log(\sqrt{2}T)}{(\Delta_{2\epsilon}^{min})^{2}}+C_{2}\frac{\log(\sqrt{2}T)}{\epsilon}+\Delta_{max}+2\epsilon

(9)

The regret upper bound of C-UCB decreases by a logarithmic factor compared to D-UCB, but with an additional problem-dependent term $\Delta_{2\epsilon}$ . This upper bound still does not match the lower bound in Theorem 4.2, as $\Delta_{2\epsilon}^{min}\leq\Delta_{min}$ . If we ignore the terms independent of $T$ , the regret upper bound of C-UCB matches the lower bound of $\Omega(\log(T))$ .

4.4 UCB-N

UCB-N has been analyzed in the standard graph feedback model [Caron et al., 2012, Hu et al., 2020, Lykouris et al., 2020]. One may wonder whether UCB-N can achieve similar regret upper bounds. In fact, if UCB-N uses the same upper confidence function as ours, UCB-N has a similar regret upper bound to D-UCB. We have the following theorem:

Theorem 4.6.

Under the same conditions as Theorem 4.3, the regret of UCB-N is upper bounded by

\displaystyle R_{T}^{\pi}\leq\frac{32(\log(\sqrt{2}T))^{2}}{\Delta_{min}}+C_{3}\frac{\log(\sqrt{2}T)}{\epsilon}+\Delta_{max}+2\epsilon+1,

(10)

where $C_{3}=8(\log(2\gamma(G))+\frac{\pi^{2}}{6}).$

Remark 4.7.

The gap-free upper bound of UCB-N is also $O(\sqrt{T}\log(T))$ . Due to the similarity assumption, the regret upper bound of UCB-N is improved compared to [Lykouris et al., 2020]. D-UCB and C-UCB are specifically designed for similarity feedback structures and may fail in the case of standard graph feedback settings. This is because the optimal arm may be connected to an arm with a very small mean, so the $j_{t}$ selected in Step 9 may not necessarily include the optimal arm. However, under the ballooning setting, UCB-N cannot achieve sublinear regret. D-UCB and C-UCB can be naturally applied in this setting and achieve sublinear regret under certain conditions.

5 Ballooning Environments

This section considers the setting where arms are increased over time. This problem is highly challenging, as prior research relied on strong assumptions to achieve sublinear regret. The graph feedback structure we propose is particularly effective for this setting. Intuitively, if an arrived arm has a mean very close to arms that have already been distinguished, the algorithm does not need to distinguish it further. This may lead to a significantly smaller number of truly effective arrived arms than $T$ , making it easier to obtain a sublinear regret bound.

5.1 Double-UCB for Ballooning Setting

1: Input: Horizon

T,

\delta\in(0,1)

2: Initialize

\mathcal{I}=\emptyset,\forall i,k(i)=0,O(i)=0

3: for

t=1

T

4: Arm

a_{t}

arrives

5: Feedback graph

G

is updated

6: if

a_{t}\notin N_{\mathcal{I}}

then

\mathcal{I}=\mathcal{I}\bigcup\{a_{t}\}

8: end if

j_{t}=\arg\max_{j\in\mathcal{I}}\bar{\mu}(j)+\sqrt{\frac{\log(\sqrt{2}T/\delta)}{O(j)}}

110: Pulls arm

i_{t}=\arg\max_{i\in N_{j_{t}}}\bar{\mu}(i)+\sqrt{\frac{\log(\sqrt{2}T/\delta)}{O(i)}}

211:

\forall i\in N_{i_{t}},

update

k_{t}(i),O_{t}(i),\bar{\mu}_{t}(i)

312: end for

Algorithm 3 Double-UCB for Ballooning Setting

Algorithm 3 shows the pseudocode of our method Double-UCB-BL. For any set $\mathcal{S}$ , let $N_{\mathcal{S}}$ denote the set of arms linked to $\mathcal{S}$ , i.e., $N_{\mathcal{S}}=\bigcup_{i\in\mathcal{S}}N_{i}$ . Upon the arrival of each arm, first check whether it is in $N_{\mathcal{I}}$ . If it is not, add it to $\mathcal{I}$ to form a new independent set. The construction of the independent set $\mathcal{I}$ is formed online as arms arrive, while the other parts remain entirely consistent with Double-UCB.

5.1.1 Regret Analysis

Let $\mathcal{I}_{t}$ denote the independent set at time $t$ and $\alpha_{t}^{*}\in\mathcal{I}_{t}$ denote the arm that include the optimal arm $i_{t}^{*}$ . Define $\mathcal{A}=\{a_{t}:t\in[T],\mu(a_{t})\in N_{\alpha_{t}^{*}}\}$ . To simplify the problem, we only consider the problem instances that satisfy the following assumption:

Assumption 5.1.

$\forall i\neq j,\Delta_{min}^{T}\leq|\mu(i)-\mu(j)|\leq\Delta_{max}^{T}$ , $\Delta_{min}^{T},\Delta_{max}^{T}$ is some constant.

The first challenge in ballooning settings is the potential existence of many arms with means very close to the optimal arm. That is, the set $\mathcal{A}$ may be very large. We first define a quantity that is easy to analyze as the expected upper bound for all arms falling into $N_{\alpha_{t}^{*}}$ . Define

	$\displaystyle M$	$\displaystyle=\mathbb{E}[\sum_{t=1}^{T}\mathbbm{1}\{\|\mu(a_{t})-\mu(i_{t}^{*})\|<2\epsilon)\}]$		(11)
		$\displaystyle=\sum_{t=1}^{T}\mathbb{P}(\|\mu(a_{t})-\mu(i_{t}^{*})\|<2\epsilon)$		(11)

It’s easy to verify that $\{\mu(a_{t})\in N_{\alpha_{t}^{*}}\}\subseteq\{|\mu(a_{t})-\mu(i_{t}^{*})|<2\epsilon)\}$ . We have $\mathbb{E}[|\mathcal{A}|]\leq M.$

The second challenge is that our regret is likely related to the independence number, yet under the ballooning setting, the graph’s independence number is a random variable. Denote the independence number as $\alpha(G_{T}^{\mathcal{P}})$ . We attempt to address this issue by providing a high-probability upper bound for the independence number. Let $X,Y\stackrel{{\scriptstyle i.i.d.}}{{\sim}}\mathcal{P}$ , then

p=\mathbb{P}(|X-Y|\leq\epsilon)=\int_{-\epsilon}^{\epsilon}f_{X-Y}(z)dz,

(12)

where $f_{X-Y}$ is the probability density function of $X-Y$ . Let $b=\frac{1}{1-p}$ , Lemma A.3 has proved a high-probability upper bound of $\alpha(G_{T}^{\mathcal{P}})$ related to $b$ .

Now, we can give the following upper bound of Double-UCB.

Theorem 5.2.

Assume that the reward distribution is $\frac{1}{2}$ -subgaussian or bounded in $[0,1]$ , set $\delta=\frac{1}{T},$ the regret of Double-UCB after $T$ rounds is upper bounded by

	$\displaystyle R_{T}^{\pi}$	$\displaystyle\leq 40\max\{\log_{b}T,1\}\Delta_{max}^{T}\frac{\log(\sqrt{2}T)}{\epsilon^{2}}+2\Delta_{max}^{T}$		(13)
		$\displaystyle+4\sqrt{6TM\log(\sqrt{2}T)}+2\epsilon+2T\epsilon e^{-M},$		(13)

If $\mathcal{P}$ is Gaussian distribution, we have the following corollary:

Corollary 5.3.

If $\mathcal{P}$ is the Gaussian distribution $\mathcal{N}(0,1)$ , we have $M=O(\log(T)e^{2\epsilon\sqrt{2\log(T)}})$ and $M=\Omega(\log(T)\sqrt{\log(T)})$ . The asymptotic regret upper bound is of order

O\Big{(}\log(T)\sqrt{Te^{2\epsilon\sqrt{2\log(T)}}}\Big{)}.

The order of $e^{\sqrt{2\log(T)}}$ is smaller than any power of $T$ . For example, if $T>e^{n}$ , $n$ is a positive integer, we have $e^{\sqrt{2\log(T)}}\leq T^{\sqrt{2/n}}$ .

Lower Bounds of Double-UCB Define $\mathcal{B}=\{a_{t}:t\in[T],\frac{\epsilon}{2}<\mu(i_{t}^{*})-\mu(a_{t})<\epsilon\}$ . Then we have $\mathcal{B}\subset\mathcal{A}$ . Let

B=\mathbb{E}\Big{[}\sum_{t=1}^{T}\mathbbm{1}\{\frac{\epsilon}{2}<\mu(i_{t}^{*})-\mu(a_{t})<\epsilon\}\Big{]}.

We have $\mathbb{E}[|\mathcal{B}|]=B$ . If arm $a_{t}$ arrives and falls into set $\mathcal{B}$ at round $t$ , then the arm will be selected at least once, unless the algorithm select $j_{t}\neq\alpha_{t}^{*}$ in Step 9. Note that if an arm in $\mathcal{B}$ is selected at some round, the resulting regret is at least $\frac{\epsilon}{2}$ . Once we estimate the size of $|\mathcal{B}|$ and the number of rounds with $j_{t}\neq\alpha_{t}^{*}$ , a simple regret lower bound can be obtained. We have the following lower bound:

Theorem 5.4.

Assume that the reward distribution is $\frac{1}{2}$ -subgaussian or bounded in $[0,1]$ , set $\delta=\frac{1}{T},$ the regret lower bound of Double-UCB is

\displaystyle R_{T}^{\pi}\geq\frac{B\epsilon}{4}(1-e^{-B/8})-20\max\{\log_{b}T,1\}\frac{\log(\sqrt{2}T)}{\epsilon}-\epsilon

(14)

If $\mathcal{P}$ is the uniform distribution $U(0,1)$ , we can calculate that $B\geq\frac{(1-\epsilon)\epsilon}{2}T$ . If $\mathcal{P}$ is the half-triangle distribution with probability density function as $f(x)=2(1-x)\mathbbm{1}\{0<x<1\}$ . We can also calculate $B\geq\frac{3\epsilon^{2}(1-\epsilon)^{2}}{4}T$ . Therefore, their regret lower bounds are of linear order.

1: Input: Horizon

T,

\delta\in(0,1)

2: Initialize

\mathcal{I}=\emptyset,\forall i,k(i)=0,O(i)=0

3: for

t=1

T

4: Steps 4-9 in Double-UCB-BL

5: Pulls arm

i_{t}=\arg\max_{i\in N_{j_{t}}}\bar{\mu}(i)-\sqrt{\frac{\log(\sqrt{2}T/\delta)}{O(i)}}

\forall i\in N_{i_{t}},

update

k_{t}(i),O_{t}(i),\bar{\mu}_{t}(i)

7: end for

Algorithm 4 C-UCB for Ballooning Setting

5.2 C-UCB for Ballooning Setting

The failure on the common uniform distribution limits the D-UCB’s application. The fundamental reason is the aggressive exploration strategy of the UCB algorithm, which tries to explore every arm that enters set $\mathcal{A}$ as much as possible. In this section, we apply C-UCB to ballooning settings, which imposes no requirements on the distribution $\mathcal{P}$ .

Algorithm 4 shows the pseudocode of conservative UCB (C-UCB). This algorithm is almost identical to D-UCB, with the only change being that the rule for selecting an arm is $\arg\max_{i\in N_{j_{t}}}\bar{\mu}(i)-\sqrt{\frac{\log(\sqrt{2}T/\delta)}{O(i)}}$ . This improvement avoids exploring every arm that enters $N_{j_{t}}$ . It is only chosen if it has been observed a sufficient number of times and its confidence lower bound is the largest. We have the following regret upper bound for C-UCB:

Theorem 5.5.

Assume that the reward distribution is $\frac{1}{2}$ -subgaussian or bounded in $[0,1]$ , set $\delta=\frac{1}{T},$ the regret of C-UCB is upper bounded by

	$\displaystyle R_{T}^{\pi}$	$\displaystyle\leq 40\max\{\log_{b}T,1\}\Delta_{max}^{T}\frac{\log(\sqrt{2}T)}{\epsilon^{2}}+2\Delta_{max}^{T}$		(15)
		$\displaystyle+\frac{96\epsilon(\log(eT))^{2}}{(\Delta_{min}^{T})^{2}}+4\epsilon$		(15)

The arms arrive one by one in ballooning setting, and the optimal arm may change over time. Therefore, the regret upper bound depends on $\Delta_{min}^{T}$ rather than $\Delta_{2\epsilon}$ . Compared to Theorem 5.2, the upper bound for C-UCB does not involve $M$ , making it independent of the distribution $\mathcal{P}$ . Under the conditions of 5.1, it achieves a regret upper bound of $O((\log(T))^{2})$ .

Refer to caption — Figure 1: "UCB-N ( $\epsilon=0.1$ )": Graph feedback with similarity structure. "UCB-N-Standard ( $\epsilon=0.1$ )": Graph feedback without similarity structure, but the graph used has roughly the same independence number with the former setting. Settings with $T=10^{5},K=10^{4}$ . (a) $\epsilon=0.1,0.2$ for Gaussian rewards. (b) $\epsilon=0.05,0.1$ for Bernoulli rewards.

6 Experiments

6.1 Stationary Settings

We first compared the performance of UCB-N under standard graph feedback and graph feedback with similar arms ³³3Our code is available at https://github.com/qh1874/GraphBandits_SimilarArms. The purpose of this experiment is to show that the similarity structure improves the performance of the original UCB algorithm. To ensure fairness, the problem instances we use in both cases have roughly the same independence number. In the standard graph feedback, we also use a random graph, generating edges with a probability calculated by Equation 12. The graph generated in this way has roughly the same independence number as the graph in the $\epsilon$ -similarity setting. In particular, if $\mathcal{P}$ is the Gaussian distribution $\mathcal{N}(0,1)$ , then

p=\sqrt{2}(2\Phi(\frac{\epsilon}{\sqrt{2}})-1).

If $\mathcal{P}$ is the uniform distribution $U(0,1)$ ,

p=1-(1-\epsilon)^{2}.

For each value of $\epsilon$ , we generate $50$ different problem instances. The expected regret is averaged on the $50$ instances. The 95% confidence interval is shown as a semi-transparent region in the figure. Figure 1 shows the performance of UCB-N under Gaussian rewards. It can be observed that the regret of UCB-N in our settings is smaller than standard graph feedback, thanks to the similarity structure. Additionally, the regret decreases as $\epsilon$ increases, consistent with theoretical results.

We then compared the performance of UCB-N, D-UCB, and C-UCB algorithms. Figure 2 shows the performance of the three algorithms with Gaussian and Bernoulli rewards. Although D-UCB and UCB-N have similar regret bounds, the experimental performance of D-UCB and C-UCB is better than UCB-N. This may be because D-UCB and C-UCB directly learn on an independent set, effectively leveraging the graph structure features of similar arms.

6.2 Ballooning Settings

UCB-N is not suitable for ballooning settings since it would select each arrived arm at least once. The BL-Moss algorithm [Ghalme et al., 2021] is specifically designed for the ballooning setting. However, this algorithm assumes that the optimal arm is more likely to appear in the early rounds and requires prior knowledge of the parameter $\lambda$ to characterize this likelihood, which is not consistent with our setting. We only compare D-UCB and C-UCB with different distributions $\mathcal{P}$ .

Figure 3 show the experimental results of ballooning settings. When $\mathcal{P}$ follows a standard normal distribution, D-UCB and C-UCB exhibit similar performance. However, when $\mathcal{P}$ is a uniform distribution $U(0,1)$ or half-triangle distribution with distribution function as $1-(1-x)^{2}$ , D-UCB fails to achieve sublinear regret, while C-UCB still performs well. These results are consistent across different values of $\epsilon$ .

7 Conclusion

In this paper, we have introduced a new graph feedback bandit model with similar arms. For this model, we proposed two different UCB-based algorithms (D-UCB, C-UCB) and provided regret upper bounds. We then extended these two algorithms to the ballooning setting. In this setting, the application of C-UCB is more extensive than D-UCB. D-UCB can only achieve sublinear regret when the mean distribution is Gaussian, while C-UCB can achieve problem-dependent sublinear regret regardless of the mean distribution.

References

Abramowitz et al. [1988] Milton Abramowitz, Irene A Stegun, and Robert H Romer. Handbook of mathematical functions with formulas, graphs, and mathematical tables, 1988.
Allan and Laskar [1978] Robert B Allan and Renu Laskar. On domination and independent domination numbers of a graph. Discrete mathematics, 23(2):73–76, 1978.
Alon et al. [2017] Noga Alon, Nicolo Cesa-Bianchi, Claudio Gentile, Shie Mannor, Yishay Mansour, and Ohad Shamir. Nonstochastic multi-armed bandits with graph-structured feedback. SIAM Journal on Computing, 46(6):1785–1826, 2017.
Auer and Ortner [2010] Peter Auer and Ronald Ortner. Ucb revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Periodica Mathematica Hungarica, 61(1-2):55–65, 2010.
Auer et al. [2002a] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47:235–256, 2002a.
Auer et al. [2002b] Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002b.
Aziz et al. [2021] Maryam Aziz, Emilie Kaufmann, and Marie-Karelle Riviere. On multi-armed bandit designs for dose-finding clinical trials. The Journal of Machine Learning Research, 22(1):686–723, 2021.
Buccapatnam et al. [2014] Swapna Buccapatnam, Atilla Eryilmaz, and Ness B Shroff. Stochastic bandits with side observations on networks. In The 2014 ACM international conference on Measurement and modeling of computer systems, pages 289–300, 2014.
Buccapatnam et al. [2018] Swapna Buccapatnam, Fang Liu, Atilla Eryilmaz, and Ness B Shroff. Reward maximization under uncertainty: Leveraging side-observations on networks. Journal of Machine Learning Research, 18(216):1–34, 2018.
Caron et al. [2012] Stéphane Caron, Branislav Kveton, Marc Lelarge, and Smriti Bhagat. Leveraging side observations in stochastic bandits. arXiv preprint arXiv:1210.4839, 2012.
Chakrabarti et al. [2008] Deepayan Chakrabarti, Ravi Kumar, Filip Radlinski, and Eli Upfal. Mortal multi-armed bandits. Advances in neural information processing systems, 21, 2008.
Cohen et al. [2016] Alon Cohen, Tamir Hazan, and Tomer Koren. Online learning with feedback graphs without the graphs. In International Conference on Machine Learning, pages 811–819. PMLR, 2016.
Esposito et al. [2022] Emmanuel Esposito, Federico Fusco, Dirk van der Hoeven, and Nicolò Cesa-Bianchi. Learning on the edge: Online learning with stochastic feedback graphs. Advances in Neural Information Processing Systems, 35:34776–34788, 2022.
Even-Dar et al. [2006] Eyal Even-Dar, Shie Mannor, Yishay Mansour, and Sridhar Mahadevan. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of machine learning research, 7(6), 2006.
Gentile et al. [2014] Claudio Gentile, Shuai Li, and Giovanni Zappella. Online clustering of bandits. In International conference on machine learning, pages 757–765. PMLR, 2014.
Ghalme et al. [2021] Ganesh Ghalme, Swapnil Dhamal, Shweta Jain, Sujit Gujar, and Y Narahari. Ballooning multi-armed bandits. Artificial Intelligence, 296:103485, 2021.
Ghari and Shen [2022] Pouya M Ghari and Yanning Shen. Online learning with probabilistic feedback. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4183–4187. IEEE, 2022.
Hu et al. [2020] Bingshan Hu, Nishant A Mehta, and Jianping Pan. Problem-dependent regret bounds for online learning with feedback graphs. In Uncertainty in Artificial Intelligence, pages 852–861. PMLR, 2020.
Lai et al. [1985] Tze Leung Lai, Herbert Robbins, et al. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985.
Li et al. [2011] Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In Proceedings of the fourth ACM international conference on Web search and data mining, pages 297–306, 2011.
Li et al. [2016] Shuai Li et al. The art of clustering bandits. 2016.
Lykouris et al. [2020] Thodoris Lykouris, Eva Tardos, and Drishti Wali. Feedback graph regret bounds for thompson sampling and ucb. In Algorithmic Learning Theory, pages 592–614. PMLR, 2020.
Mannor and Shamir [2011] Shie Mannor and Ohad Shamir. From bandits to experts: On the value of side-observations. Advances in Neural Information Processing Systems, 24, 2011.
Pérez et al. [2020] Adrià Pérez, Pablo Herrera-Nieto, Stefan Doerr, and Gianni De Fabritiis. Adaptivebandit: a multi-armed bandit framework for adaptive sampling in molecular simulations. Journal of chemical theory and computation, 16(7):4685–4693, 2020.
Schwartz et al. [2017] Eric M Schwartz, Eric T Bradlow, and Peter S Fader. Customer acquisition via display advertising using multi-armed bandit experiments. Marketing Science, 36(4):500–522, 2017.
Villar et al. [2015] Sofía S Villar, Jack Bowden, and James Wason. Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges. Statistical science: a review journal of the Institute of Mathematical Statistics, 30(2):199, 2015.
Wang et al. [2024] Zhiyong Wang, Jize Xie, Xutong Liu, Shuai Li, and John Lui. Online clustering of bandits with misspecified user models. Advances in Neural Information Processing Systems, 36, 2024.
Yang et al. [2022] Junwen Yang, Zixin Zhong, and Vincent YF Tan. Optimal clustering with bandit feedback. arXiv preprint arXiv:2202.04294, 2022.

Proofs of Theorems and Corollaries In this appendix, we provide the detailed proof of theorems and corollaries in the main text. Specifically, we provide detailed proofs of Proposition 4.1, Theorem 4.3,Theorem 4.6,Theorem 5.2,Corollary 5.3 and Theorem 5.5. The proof of Theorem 4.5 can be easily obtained from the analysis of Theorem 5.5. The proof of Theorem 5.4 is similar to that of Theorem 5.2. We omit their proofs. Beforehand, we give some well-known results to simplify the proof.

Appendix A Facts and Lemmas

Lemma A.1.

Assume that $X_{i}$ are independent random variables. $\mu=\mathbb{E}[X_{i}],\bar{\mu}=\frac{1}{n}\sum_{i=1}^{n}X_{i}$ . If $X_{i}$ is bounded in $[0,1]$ or $\frac{1}{2}$ -subGaussian, for any $\delta\in(0,1)$ , with probability as least $1-\frac{\delta^{2}}{T^{2}}$ ,

|\bar{\mu}-\mu|\leq\sqrt{\frac{\log(\sqrt{2}T/\delta)}{n}}.

Lemma A.2.

Allan and Laskar [1978] If $G$ is a claw-free graph, then $\gamma(G)=i(G)$ .

Lemma A.3.

Assume that $\mu(a_{t})\stackrel{{\scriptstyle i.i.d.}}{{\sim}}\mathcal{P}$ . Let $G_{T}^{\mathcal{P}}$ denote the graph constructed by $\mu(a_{1}),\mu(a_{2}),...,\mu(a_{T})$ , $\alpha(G_{T}^{\mathcal{P}})$ is the independent number of $G_{T}^{\mathcal{P}}$ . Then

\mathbb{P}(\alpha(G_{T}^{\mathcal{P}})\geq 5\max\{\log_{b}T,1\})\leq\frac{1}{T^{5}},

(16)

where $b$ is some constant related to $\mathcal{P}$ .

Proof.

Let $X,Y\stackrel{{\scriptstyle i.i.d.}}{{\sim}}\mathcal{P}$ , then

\mathbb{P}(|X-Y|\leq\epsilon)=\int_{-\epsilon}^{\epsilon}f_{X-Y}(z)dz=p,

where $f_{X-Y}$ is the probability density function of $X-Y$ . This means that in $G_{T}^{\mathcal{P}}$ , the probability of any two nodes being connected by an edge is $p$ . Hence, $G_{T}^{\mathcal{P}}$ is a random graph.

Let $Z_{k}$ be the number of independent sets of order $k$ . Let $b=\frac{1}{1-p},$ $k=\lceil 5\log_{b}T\rceil$ ,

$\displaystyle\mathbb{P}(\alpha(G_{T}^{\mathcal{P}})\geq 5\log_{b}T)$	$\displaystyle\leq\mathbb{P}(Z_{k}\geq 1)$	(17)
	$\displaystyle\leq\mathbb{E}[Z_{k}]$
	$\displaystyle=\binom{T}{k}(1-p)^{\binom{k}{2}}$
	$\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\Big{(}\frac{Te}{k\sqrt{1-p}}(1-p)^{k/2}\Big{)}^{k}$
	$\displaystyle\leq\Big{(}\frac{e\sqrt{b}}{k}\Big{)}^{k}\Big{(}\frac{1}{T^{1.5}}\Big{)}^{k},$

where $(a)$ uses the fact that $\binom{T}{k}\leq\Big{(}\frac{Te}{k}\Big{)}^{k}$ .

If $b<T$ , $k\geq 5>e$ . We have,

\mathbb{P}(\alpha(G_{T}^{\mathcal{P}})\geq 5\log_{b}T)\leq\Big{(}\frac{e\sqrt{b}}{k}\Big{)}^{k}\Big{(}\frac{1}{T^{1.5}}\Big{)}^{k}=\Big{(}\frac{e\sqrt{b}}{k\sqrt{T}}\Big{)}^{k}\Big{(}\frac{1}{T}\Big{)}^{k}\leq\frac{1}{T^{5}}.

If $b\geq T$ ,i.e., $1-p\leq\frac{1}{T}$ , we have

\mathbb{P}(\alpha(G_{T}^{\mathcal{P}})\geq 5)\leq\mathbb{P}(Z_{5}\geq 1)\leq\binom{T}{5}(1-p)^{\binom{5}{2}}\leq\frac{1}{T^{5}}.

Therefore,

\mathbb{P}(\alpha(G_{T}^{\mathcal{P}})\geq 5\max\{\log_{b}T,1\})\leq\frac{1}{T^{5}}.

∎

Lemma A.4 (Chernoff Bounds).

Let $X_{i}$ be independent Bernoulli random variable. Let $X$ denote their sum and let $\mu=\mathbb{E}[X]$ denote the sum’s expected value.

\mathbb{P}(X\geq(1+\delta)\mu)\leq e^{-\frac{\delta^{2}\mu}{2+\delta}},\delta\geq 0,

\mathbb{P}(X\leq(1-\delta)\mu)\leq e^{-\frac{\delta^{2}\mu}{2}},0<\delta<1.

Lemma A.5.

Abramowitz et al. [1988] For a Gaussian distributed random variable $X$ with mean $\mu$ and variance $\sigma^{2}$ , for any $a>0$ ,

\frac{1}{\sqrt{2\pi}}\frac{a}{1+a^{2}}e^{-\frac{a^{2}}{2}}\leq\mathbb{P}(X-\mu>a\sigma)\leq\frac{1}{a+\sqrt{a^{2}+4}}e^{-\frac{a^{2}}{2}}.

Appendix B Proofs of Proposition 4.1

(1) We first prove $\gamma(G)=i(G)$ . A claw-free graph is a graph that does not have a claw as an induced subgraph or contains no induced subgraph isomorphic to $K_{1,3}$ . From Lemma A.2, we just need to prove $G$ is claw-free.

Assuming $G$ has a claw, meaning there exists nodes $a,b,c,d$ , such that $a$ is connected to $b,c,d$ , while $b,c,d$ are mutually unconnected. Consider nodes $b,c$ . $b$ and $c$ must have a mean greater than $a$ , and the other must have a mean smaller than $a$ . Otherwise, the mean difference between $b$ and $c$ will be smaller than $\epsilon$ , and an edge will form between them. Since $d$ is connected to $a$ , this would lead to an edge between $d$ and $b$ or $d$ and $c$ . This is a contradiction. Therefore, $G$ is claw-free.

(2) $\alpha(G)\leq 2i(G)$ . Let $I^{*}$ be a maximum independent set and $I$ be a minimum independent dominating set. Note that any vertex of $I$ is adjacent (including the vertex itself in the neighborhood) to at most two vertices in $I^{*}$ , and that each vertex of $I^{*}$ is adjacent to at least one vertex of $I$ . So by a double counting argument, when counting once the vertices of $I^{*}$ , we can choose one adjacent vertex in $I$ , and we will have counted at most twice the vertices of $I$ .

Appendix C Proofs of Theorem 4.3

Let $\mathcal{I}$ denote the independent set obtained after running Step 4-8 in Algorithm 1. The obtained $\mathcal{I}$ may vary with each run. We first fix $\mathcal{I}$ for analysis and then take the supremum of the results with respect to $\mathcal{I}$ , obtaining an upper bound independent of $\mathcal{I}$ .

Let $\alpha^{*}\in\mathcal{I}$ denotes the arm that includes the optimal arm, i.e., $i^{*}\in N_{\alpha^{*}}$ . Let $\mathcal{I}=\{\alpha_{1},\alpha_{2},...,\alpha^{*},...,\alpha_{|\mathcal{I}|}\}$ . The regret can be divided into two parts: one part is the selection of arms $i\notin N_{\alpha^{*}}$ and the other part is the selection of arms $i\in N_{\alpha^{*}}$ :

\displaystyle\sum_{t=1}^{T}\sum_{i\in V}\Delta_{i}\mathbbm{1}\{i_{t}=i\}=\sum_{t=1}^{T}\sum_{i\notin N_{\alpha^{*}}}\Delta_{i}\mathbbm{1}\{i_{t}=i\}+\sum_{t=1}^{T}\sum_{i\in N_{\alpha^{*}}}\Delta_{i}\mathbbm{1}\{i_{t}=i\}

(18)

We first focus on the expected regret incurred by the first part. Let $\Delta_{\alpha_{j}}^{\prime}=\mu(\alpha^{*})-\mu(\alpha_{j})$ , $j_{t}\in\mathcal{I}$ denote the arm linked to the selected arm $i_{t}$ (Step 9 in Algorithm 1).

\sum_{t=1}^{T}\sum_{i\notin N_{\alpha^{*}}}\Delta_{i}\mathbbm{1}\{i_{t}=i\}=\sum_{j=1}^{|\mathcal{I}|}\sum_{t=1}^{T}\sum_{i\in N_{\alpha_{j}}}\Delta_{i}\mathbbm{1}\{i_{t}=i\}\leq\sum_{j=1}^{|\mathcal{I}|}(\Delta_{\alpha_{j}}^{\prime}+2\epsilon)\sum_{t=1}^{T}\mathbbm{1}\{j_{t}=\alpha_{j}\}.

(19)

The last inequality uses the following two facts:

\Delta_{i}=\mu(i^{*})-\mu(i)=\mu(i^{*})-\mu(\alpha^{*})+\mu(\alpha^{*})-\mu(\alpha_{j})+\mu(\alpha_{j})-\mu(i)\leq\Delta_{\alpha_{j}}^{\prime}+2\epsilon,

and

\sum_{t=1}^{T}\sum_{i\in N_{\alpha_{j}}}\mathbbm{1}\{i_{t}=i\}=\sum_{t=1}^{T}\mathbbm{1}\{j_{t}=\alpha_{j}\}.

Recall that $O_{t}(i)$ denotes the number of observations of arm $i$ till time $t$ . Let $c_{t}(i)=\sqrt{\frac{2\log(T^{2}/\delta)}{O_{t}(i)}}$ , $\bar{X}_{s}(i)$ denote average reward of arm $i$ after observed $s$ times, $c_{s}(i)=\sqrt{\frac{2\log(T^{2}/\delta)}{s}}$ . For any $\alpha_{j}\in\mathcal{I}$ ,

$\displaystyle\sum_{t=1}^{T}\mathbbm{1}\{j_{t}=\alpha_{j}\}$	$\displaystyle\leq\ell_{\alpha_{j}}+\sum_{t=1}^{T}\mathbbm{1}\{j_{t}=\alpha_{j},O_{t}(\alpha_{j})\geq\ell_{\alpha_{j}}\}$	(20)
	$\displaystyle\leq\ell_{\alpha_{j}}+\sum_{t=1}^{T}\mathbbm{1}\{\bar{\mu}_{t}(\alpha_{j})+c_{t}(\alpha_{j})\geq\bar{\mu}_{t}(\alpha^{})+c_{t}(\alpha^{}),O_{t}(\alpha_{j})\geq\ell_{\alpha_{j}}\}$
	$\displaystyle\leq\ell_{\alpha_{j}}+\sum_{t=1}^{T}\mathbbm{1}\{\max_{\ell_{\alpha_{j}}\leq s_{j}\leq t}\bar{X}_{s_{j}}(\alpha_{j})+c_{s_{j}}(\alpha_{j})\geq\min_{1\leq s\leq t}\bar{X}_{s}(\alpha^{})+c_{s}(\alpha^{})\}$
	$\displaystyle\leq\ell_{\alpha_{j}}+\sum_{t=1}^{T}\sum_{s=1}^{t}\sum_{s_{j}=\ell_{\alpha_{j}}}^{t}\mathbbm{1}\{\bar{X}_{s_{j}}(\alpha_{j})+c_{s_{j}}(\alpha_{j})\geq\bar{X}_{s}(\alpha^{})+c_{j}(\alpha^{})\}$

Following the same argument as in Auer et al. [2002a], choosing $\ell_{\alpha_{j}}=\frac{4\log(\sqrt{2}T/\delta)}{(\Delta_{\alpha_{j}}^{\prime})^{2}}$ , we have

\mathbb{P}(\bar{X}_{s_{j}}(\alpha_{j})+c_{s_{j}}(\alpha_{j})\geq\bar{X}_{s}(\alpha^{*})+c_{j}(\alpha^{*}))\leq\mathbb{P}(\bar{X}_{s}(\alpha^{*})\leq\mu(\alpha^{*})-c_{j}(\alpha^{*}))+\mathbb{P}(\bar{X}_{s_{j}}(\alpha_{j})\geq\mu(\alpha_{j})+c_{s_{j}}(\alpha_{j}))

From Lemma A.1,

\mathbb{P}(\bar{X}_{s}(\alpha^{*})\leq\mu(\alpha^{*})-c_{j}(\alpha^{*}))\leq\frac{\delta^{2}}{2T^{2}}

Hence,

\sum_{t=1}^{T}\mathbb{P}(j_{t}=\alpha_{j})\leq\frac{4\log(\sqrt{2}T/\delta)}{(\Delta_{\alpha_{j}}^{\prime})^{2}}+T\delta^{2}.

Plug into Equation 18, we can get

$\displaystyle\sum_{t=1}^{T}\sum_{i\notin N_{\alpha^{*}}}\Delta_{i}\mathbb{P}(i_{t}=i)$	$\displaystyle\leq\sum_{j=1}^{\|\mathcal{I}\|}\frac{(\Delta_{\alpha_{j}}^{\prime}+2\epsilon)4\log(\sqrt{2}T/\delta)}{(\Delta_{\alpha_{j}}^{\prime})^{2}}+\Delta_{max}T\delta^{2}$	(21)
	$\displaystyle=\sum_{j=1}^{\|\mathcal{I}\|}(\frac{1}{\Delta_{\alpha_{j}}^{\prime}}+\frac{2\epsilon}{(\Delta_{\alpha_{j}}^{\prime})^{2}})4\log(\sqrt{2}T/\delta)+\sum_{j=1}^{\|\mathcal{I}\|}\Delta_{max}T\delta^{2}$
	$\displaystyle\leq\frac{4\log(\sqrt{2}T/\delta)}{\epsilon}(\log(\alpha(G))+\frac{\pi^{2}}{3})+\alpha(G)\Delta_{max}T\delta^{2}.$

Now we focus on the second part in Equation 18.

For any $i\in N_{\alpha^{*}}$ , we have $\Delta_{i}\leq 2\epsilon$ . This means the gap between suboptimal and optimal arms is bounded. Therefore, this part can be seen as using UCB-N Lykouris et al. [2020] on the graph formed by $N_{\alpha^{*}}$ . We can directly use their results by adjusting some constant factors. Following Theorem 6 in Lykouris et al. [2020], this part has a regret upper bound as

16\cdot\log(\sqrt{2}T/\delta)\log(T)\max_{I\in\mathcal{I}(N_{\alpha^{*}})}\sum_{i\in I\backslash\{i^{*}\}}\frac{1}{\Delta_{i}}+2\epsilon T\delta^{2}+1+2\epsilon.

(22)

Let $\delta=\frac{1}{T}$ . Combining Equation 21 and Equation 22 and using Proposition 4.1 that $\alpha(G)\leq 2\gamma(G)$ , we have

	$\displaystyle R_{T}^{\pi}$	$\displaystyle\leq\frac{4\log(\sqrt{2}T/\delta)}{\epsilon}(\log(\alpha(G))+\frac{\pi^{2}}{3})+16\cdot\log(\sqrt{2}T/\delta)\log(T)\max_{I\in\mathcal{I}(N_{\alpha^{}})}\sum_{i\in I\backslash\{i^{}\}}\frac{1}{\Delta_{i}}+T\delta^{2}(\alpha(G)\Delta_{max}+2\epsilon)+1+2\epsilon$		(23)
		$\displaystyle\leq\frac{8\log(\sqrt{2}T)}{\epsilon}(\log(2\gamma(G))+\frac{\pi^{2}}{3})+32\cdot\log(\sqrt{2}T)\log(T)\max_{I\in\mathcal{I}(i^{})}\sum_{i\in I\backslash\{i^{}\}}\frac{1}{\Delta_{i}}+\Delta_{max}+4\epsilon+1.$		(23)

Appendix D Proofs of Theorem 4.6

We just need to discuss $\Delta_{i}$ in intervals

[0,\epsilon),[\epsilon,2\epsilon),...,[k\epsilon,(k+1)\epsilon),...

The regret for $\Delta_{i}$ in $[\epsilon,2\epsilon),...,[k\epsilon,(k+1)\epsilon),$ … can be bounded by the same method used in the proof of Theorem 4.3. We can calculate that the regret of this part has the same form as $O(\frac{\log(\sqrt{2}T)}{\epsilon})$ .

Recall that $G_{\epsilon}$ denote the subgraph with arms $\{i\in V:\mu(i^{*})-\mu(i)<\epsilon\}$ . The set of all independent dominating sets of graph $G_{\epsilon}$ is denoted as $\mathcal{I}(G_{\epsilon})$ . The regret for $\Delta_{i}$ in $[0,\epsilon)$ can be bounded as

32(\log(\sqrt{2}T))^{2}\max_{I\in\mathcal{I}(G_{\epsilon})}\sum_{i\in I\backslash\{i^{*}\}}\frac{1}{\Delta_{i}}.

Due to the similarity assumption, $G_{\epsilon}$ is a complete graph. Therefore,

\max_{I\in\mathcal{I}(G_{\epsilon})}\sum_{i\in I\backslash\{i^{*}\}}\frac{1}{\Delta_{i}}\leq\frac{1}{\Delta_{min}}.

Appendix E Proofs of Theorem 5.2

Recall that $\mathcal{I}_{t}$ denotes the independent set at time $t$ and $\alpha_{t}^{*}\in\mathcal{I}_{t}$ denotes the arm that include the optimal arm $i_{t}^{*}$ . We have $|\mathcal{I}_{T}|\leq\alpha(G_{T}^{\mathcal{P}})$ . Let $\mathcal{F}$ denote the sequences generated from $\mathcal{P}$ with length $T$ , thus $\mathcal{F}$ is a random variable.

Since the optimal arm may change over time, this leads to a time-varying $\Delta_{i}$ . We denote the new gap as $\Delta_{t}(i)$ . Therefore, the analysis method in Theorem 4.3 is no longer applicable here. The regret can also be divided into two parts:

\displaystyle\mathbb{E}\Bigg{[}\sum_{t=1}^{T}\sum_{i\in K(t)}\Delta_{t}(i)\mathbbm{1}\{i_{t}=i\}\Bigg{]}=\underbrace{\mathbb{E}\Bigg{[}\sum_{t=1}^{T}\sum_{i\notin N_{\alpha_{t}^{*}}}\Delta_{t}(i)\mathbbm{1}\{i_{t}=i\}\Bigg{]}}_{(A)}+\underbrace{\mathbb{E}\Bigg{[}\sum_{t=1}^{T}\sum_{i\in N_{\alpha_{t}^{*}}}\Delta_{i}\mathbbm{1}\{i_{t}=i\}\Bigg{]}}_{(B)}

(24)

We focus on $(A)$ first.

$\displaystyle(A)$	$\displaystyle=\mathbb{E}_{\mathcal{F}}\Bigg{[}\mathbb{E}\Bigg{[}\sum_{t=1}^{T}\sum_{i\notin N_{\alpha_{t}^{*}}}\Delta_{t}(i)\mathbbm{1}\{i_{t}=i\}\Big{\|}\mathcal{F}\Bigg{]}\Bigg{]}$	(25)
	$\displaystyle=\mathbb{E}_{\mathcal{F}}\Bigg{[}\mathbb{E}\Bigg{[}\sum_{t=1}^{T}\Delta_{t}(i)\mathbbm{1}\{i_{t}=i,j_{t}\neq\alpha_{t}^{*}\}\Big{\|}\mathcal{F}\Bigg{]}\Bigg{]}$
	$\displaystyle\leq\mathbb{E}_{\mathcal{F}}\Bigg{[}\mathbb{E}\Bigg{[}\sum_{t=1}^{T}\Delta_{max}^{T}\mathbbm{1}\{j_{t}\neq\alpha_{t}^{*}\}\Big{\|}\mathcal{F}\Bigg{]}\Bigg{]}$

Given a fixed $\mathcal{F}$ , $\mathcal{I}_{T}$ is deterministic. Since the gap between optimal and suboptimal arms may be varying over time, we define

\Delta_{\alpha_{j}}^{\prime\prime}=\min_{t\in[T]}\{\mu(\alpha_{t}^{*})-\mu(\alpha_{j}):\alpha_{j}\in\mathcal{I}_{T}\text{ and }\mu(\alpha_{t}^{*})-\mu(\alpha_{j})>0\}

denote the minimum gap when $\alpha_{j}\in\mathcal{I}_{T}$ is not the optimal selection that include the optimal arm. Then $\Delta_{\alpha_{j}}^{\prime\prime}\geq\epsilon$ .

Following the proofs of Theorem 4.3, for any $\alpha_{j}\in\mathcal{I}_{T}\neq\alpha_{t}^{*}$ , the probability of the algorithm selecting it will be less than $\delta^{2}$ after it has been selected $\frac{4\log(\sqrt{2}T/\delta)}{\epsilon^{2}}$ times. Therefore, the inner expectation of Equation 25 is bounded as

|\mathcal{I}_{T}|\Delta_{max}^{T}(\frac{4\log(\sqrt{2}T/\delta)}{\epsilon^{2}}+T\delta^{2})

(26)

The inner expectation of Equation 25 also has a native bound $T\Delta_{max}^{T}$ .

Plugging into $(A)$ and using Lemma A.3, we get

$\displaystyle(A)$	$\displaystyle\leq\mathbb{E}_{\mathcal{F}}\Bigg{[}\min\{\|\mathcal{I}_{T}\|\Delta_{max}^{T}(\frac{4\log(\sqrt{2}T/\delta)}{\epsilon^{2}}+T\delta^{2}),T\Delta_{max}^{T}\Bigg{]}$	(27)
	$\displaystyle\leq\sum_{k=1}^{T}\mathbb{P}(\alpha(G_{T}^{\mathcal{P}})=k)\min\{k\Delta_{max}^{T}(\frac{4\log(\sqrt{2}T/\delta)}{\epsilon^{2}}+T\delta^{2}),T\Delta_{max}^{T}\}$
	$\displaystyle=\mathbb{P}(\alpha(G_{T}^{\mathcal{P}})\leq 5\max\{\log_{b}T,1\})5\max\{\log_{b}T,1\}\Delta_{max}^{T}(\frac{4\log(\sqrt{2}T/\delta)}{\epsilon^{2}}+T\delta^{2})+\mathbb{P}(\alpha(G_{T}^{\mathcal{P}})>5\max\{\log_{b}T,1\})T\Delta_{max}^{T}$
	$\displaystyle\leq 5\max\{\log_{b}T,1\}\Delta_{max}^{T}(\frac{4\log(\sqrt{2}T/\delta)}{\epsilon^{2}}+T\delta^{2})+\Delta_{max}^{T}.$

Now we consider the part $(B)$ .

\displaystyle(B)

\displaystyle=\mathbb{E}_{\mathcal{F}}\Bigg{[}\mathbb{E}\Bigg{[}\sum_{t=1}^{T}\sum_{i\in N_{\alpha_{t}^{*}}}\Delta_{i}\mathbbm{1}\{i_{t}=i\}\Bigg{|}\mathcal{F}\Bigg{]}\Bigg{]}

(28)

Recall that $\mathcal{A}=\{a_{t}:t\in[T],\mu(a_{t})\in N_{\alpha_{t}^{*}}\},M=\sum_{t=1}^{T}\mathbb{P}\{|\mu(a_{t})-\mu(i_{t}^{*})|\leq 2\epsilon\}.$ Then

|\mathcal{A}|\leq M^{\prime}=\sum_{t=1}^{T}\mathbbm{1}\{|\mu(a_{t})-\mu(i_{t}^{*})|\leq 2\epsilon\}.

Since $\mu(a_{t})\sim\mathcal{P}$ is independent of each other, the event $\mathbbm{1}\{|\mu(a_{t})-\mu(i_{t}^{*})|\leq 2\epsilon\}$ is also mutually independent. Using Lemma A.4,

\mathbb{P}(M^{\prime}\geq 3M)\leq e^{-M}.

(29)

Given a fixed instance $\mathcal{F}$ , we divide the rounds into $L^{\prime}$ parts: $(t_{0}=1,t_{L^{\prime}+1}=T)$

[1,t_{1}],(t_{1},t_{2}],...,(t_{L^{\prime}},T].

This partition satisfies $\forall t\in(t_{j},t_{j+1}),$ $i_{t}^{*}$ is stationary. The $\alpha_{t}^{*}$ is also stationary, $\forall t\in(t_{j},t_{j+1})$ , let $\alpha_{t}^{*}=\alpha_{j}$ .

Let’s focus on the intervals $(t_{j},t_{j+1}]$ , the analysis for other intervals is similar.

The best case is that all arms in $N_{\alpha_{j}}$ are arrived at the beginning. In this case, the regret for this part is equivalent to the regret of using the UCB algorithm on the subgraph formed by $N_{\alpha_{j}}$ for $t_{j+1}-t_{j}$ rounds. The independence number of the subgraph formed by $N_{\alpha_{j}}$ is $2$ , which leads to a regret upper bound independent of the number of arms arriving. However, we are primarily concerned with the worst case. The worst case is that the algorithm cannot benefit from the graph feedback at all. That is, the algorithm spends $O(\frac{\log(T)}{(\Delta_{1})^{2}})$ rounds distinguishing the optimal arm from the first arriving suboptimal arm $1$ . After this process, the second suboptimal arm $2$ arrives, and again $O(\frac{\log(T)}{(\Delta_{2})^{2}})$ rounds are spent distinguishing the optimal arm from this arm…

Let $V_{j}$ denote the arms fall into $N_{\alpha_{j}}$ at the rounds $(t_{j},t_{j+1}]$ . If $i\in V_{j}$ has not been arrived at round $t$ , $\mathbb{P}(i_{t}=i)=0$ .

Following the same argument as the proofs of Theorem 4.3, the inner expectation in Equation 28 can be bounded as

$\displaystyle\sum_{j=0}^{L^{\prime}}\sum_{i\in V_{j}}\sum_{t=t_{j}}^{t_{j+1}}\Delta_{i}\mathbb{P}(i_{t}=i,j_{t}=\alpha_{j})$	$\displaystyle=\sum_{j=0}^{L^{\prime}}\sum_{i\in V_{j},\Delta_{i}<\Delta}\sum_{t=t_{j}}^{t_{j+1}}\Delta_{i}\mathbb{P}(i_{t}=i,j_{t}=\alpha_{j})+\sum_{j=0}^{L^{\prime}}\sum_{i\in V_{j},\Delta_{i}\geq\Delta}\sum_{t=t_{j}}^{t_{j+1}}\Delta_{i}\mathbb{P}(i_{t}=i,j_{t}=\alpha_{j})$	(30)
	$\displaystyle\leq\sum_{j=0}^{L^{\prime}}(t_{j+1}-t_{j})\Delta+\sum_{j=0}^{L^{\prime}}\sum_{i\in V_{j},\Delta_{i}\geq\Delta}\Bigg{(}\Delta_{i}(t_{j+1}-t_{j})\delta^{2}+\frac{4\log(\sqrt{2}T/\delta)}{\Delta_{i}}\Bigg{)}$
	$\displaystyle\leq T\Delta+\sum_{j=0}^{L^{\prime}}(\frac{4\|V_{j}\|\log(\sqrt{2}T/\delta)}{\Delta}+M^{\prime}2\epsilon(t_{j+1}-t_{j})\delta^{2})$
	$\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}T\Delta+\frac{4M^{\prime}\log(\sqrt{2}T/\delta)}{\Delta}+2TM^{\prime}\epsilon\delta^{2}$
	$\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}}4\sqrt{TM^{\prime}\log(\sqrt{2}T/\delta)}+2\epsilon,$

where $(a)$ comes from the fact that $\sum_{j}|N_{\alpha_{j}}|\leq M^{\prime}\leq T$ and $\delta=\frac{1}{T}$ , $(b)$ follows from $\Delta=\sqrt{\frac{4M^{\prime}\log(\sqrt{2}T/\delta)}{T}}$ .

Similar to the approach of bounding $(A)$ , the inner expectation in Equation 28 also has a native bound $2T\epsilon$ .

Substituting into Equation 28, we get

$\displaystyle(B)$	$\displaystyle=\mathbb{E}_{\mathcal{F}}\Bigg{[}4\sqrt{TM^{\prime}\log(\sqrt{2}T/\delta)}+2\epsilon\Bigg{]}$	(31)
	$\displaystyle\leq\mathbb{P}(M^{\prime}\leq 3M)(4\sqrt{3TM\log(\sqrt{2}T/\delta)}+2\epsilon)+2T\epsilon\mathbb{P}(M^{\prime}>3M)$
	$\displaystyle\leq 4\sqrt{3TM\log(\sqrt{2}T/\delta)}+2\epsilon+2T\epsilon e^{-M}.$

From Equation 27 and Equation 31, we get the total regret

R_{T}^{\pi}\leq 5\max\{\log_{b}T,1\}\Delta_{max}^{T}(\frac{4\log(\sqrt{2}T/\delta)}{\epsilon^{2}}+T\delta^{2})+\Delta_{max}^{T}+4\sqrt{3TM\log(\sqrt{2}T/\delta)}+2\epsilon+2T\epsilon e^{-M}.

Appendix F Proofs of Corollary 5.3

First, we calculate an $M$ that is independent of the distribution $\mathcal{P}$ . Given $X_{1},X_{2},...,X_{T}$ as independent random variables from $\mathcal{P}$ . Let

M=\sum_{t=1}^{T}\mathbb{P}(|X_{t}-\max_{i=1,...,t}X_{i}|<2\epsilon).

Denote $F(x)=\mathbb{P}(X<x),M_{t}=\max_{i\leq t}X_{i}$ . Then

\mathbb{P}(|X_{t}-M_{t}|<2\epsilon|M_{t}=x)=F(x+2\epsilon)-F(x-2\epsilon)

and

\mathbb{P}(|X_{t}-M_{t}|<2\epsilon)=t\int_{\mathcal{D}}(F(x))^{t-1}(F(x+2\epsilon)-F(x-2\epsilon))F^{\prime}(x)dx,

where $\mathcal{D}$ is the support set of $\mathcal{P}$ . Since

\sum_{r=1}^{R}rx^{r-1}=\frac{d}{dx}\frac{1-x^{R+1}}{1-x}=\frac{1-(R+1)x^{R}+Rx^{R+1}}{(1-x)^{2}}.

We get

M=\int_{\mathcal{D}}\frac{1-(T+1)(F(x))^{T}+T(F(x))^{T+1}}{(1-F(x))^{2}}(F(x+2\epsilon)-F(x-2\epsilon))F^{\prime}(x)dx.

(32)

We need to estimate the upper and lower bounds of $M$ separately. For Gaussian distribution $\mathcal{N}(0,1)$ , $F(x)=\Phi(x),F^{\prime}(x)=\phi(x).$ We first proof $M=O(\log(T)e^{2\epsilon\sqrt{2\log(T)}}).$

Denote the integrand function as $H(x)$ . Let $m=\sqrt{2\log(T)}+\epsilon$ ,

M=\int_{-\infty}^{m}H(x)dx+\int_{m}^{+\infty}H(x)dx

(33)

First, we have

\forall x\in\mathbb{R},\frac{1-(T+1)(F(x))^{T}+T(F(x))^{T+1}}{(1-F(x))^{2}}=\sum_{t=1}^{T}t(F(x))^{t-1}\leq\frac{T(T+1)}{2}\leq T^{2}.

\Phi(x+2\epsilon)-\Phi(x-2\epsilon)\leq 4\epsilon\phi(x-2\epsilon).

And

\displaystyle(F(x+2\epsilon)-F(x-2\epsilon))F^{\prime}(x)\leq 4\epsilon\phi(x-2\epsilon)\phi(x)\leq\frac{2\epsilon e^{-\epsilon^{2}}}{\pi}e^{-(x-\epsilon)^{2}}

(34)

Then,

$\displaystyle\int_{m}^{+\infty}H(x)dx$	$\displaystyle\leq T^{2}\frac{2\epsilon e^{-\epsilon^{2}}}{\pi}\int_{m}^{\infty}e^{-(x-\epsilon)^{2}}dx$	(35)
	$\displaystyle=T^{2}\frac{2\epsilon e^{-\epsilon^{2}}}{\sqrt{\pi}}\Phi(\sqrt{2}(m-\epsilon))$
	$\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}T^{2}\frac{2\epsilon e^{-\epsilon^{2}}}{\sqrt{\pi}}\frac{1}{\sqrt{2}(m-\epsilon)+\sqrt{2(m-\epsilon)^{2}+4}}e^{-(m-\epsilon)^{2}}$
	$\displaystyle\leq\frac{2\epsilon e^{-\epsilon^{2}}}{\sqrt{\pi}},$

where $(a)$ use Lemma A.5.

Now we calculate the second term.

\displaystyle\int_{-\infty}^{m}H(x)dx=\int_{-\infty}^{1}H(x)dx+\int_{1}^{m}H(x)dx\leq\frac{\Phi(1)}{(1-\Phi(1))^{2}}+\int_{1}^{m}H(x)dx.

(36)

We only need to bound the integral within $(1,m)$ .

$\displaystyle\int_{1}^{m}H(x)dx$	$\displaystyle\leq\int_{1}^{m}\frac{(\Phi(x+2\epsilon)-\Phi(x-2\epsilon))\phi(x)}{(1-\Phi(x))^{2}}dx$	(37)
	$\displaystyle\leq\frac{2\epsilon e^{-\epsilon^{2}}}{\pi}\int_{1}^{m}\frac{e^{-(x-\epsilon)^{2}}}{(1-\Phi(x))^{2}}dx$
	$\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}}4\epsilon e^{-\epsilon^{2}}\int_{1}^{m}(\frac{1}{x}+x)^{2}e^{x^{2}}e^{-(x-\epsilon)^{2}}dx$
	$\displaystyle=4\epsilon e^{-2\epsilon^{2}}\int_{1}^{m}(\frac{1}{x}+x)^{2}e^{2x\epsilon}dx$
	$\displaystyle\leq 4m^{2}e^{2m\epsilon}e^{-2\epsilon^{2}}$
	$\displaystyle\leq 8\log(T)e^{2\epsilon\sqrt{2\log(T)}},$

where $(b)$ uses Lemma A.5.

Therefore,

M\leq\frac{2\epsilon e^{-\epsilon^{2}}}{\sqrt{\pi}}+\frac{\Phi(1)}{(1-\Phi(1))^{2}}+8\log(T)e^{2\epsilon\sqrt{2\log(T)}}.

Then we derive a lower bound for $M$ .

Since $\phi(t)$ is convex function in $[1,+\infty)$ , we have

\phi(t)\geq\phi(\frac{a+b}{2})+\phi^{\prime}(\frac{a+b}{2})(t-\frac{a+b}{2}),\forall t\in[a,b],a\geq 1.

Then when $x-2\epsilon\geq 1$ ,

\Phi(x+2\epsilon)-\Phi(x-2\epsilon)=\int_{x-2\epsilon}^{x+2\epsilon}\phi(t)dt\geq 4\epsilon\phi(x).

Hence,

(\Phi(x+2\epsilon)-\Phi(x-2\epsilon))\phi(x)\geq 4\epsilon(\phi(x))^{2}.

Substituting into Equation 32,

$\displaystyle M$	$\displaystyle\geq\int_{1+2\epsilon}^{\sqrt{\log(T)}}\frac{1-(T+1)(\Phi(x))^{T}+T(\Phi(x))^{T+1}}{(1-\Phi(x))^{2}}4\epsilon(\phi(x))^{2}dx$	(38)
	$\displaystyle\geq\frac{8\epsilon}{\pi}\int_{1+2\epsilon}^{\sqrt{\log(T)}}(x^{2}+1)(1-(T+1)\Big{(}1-\frac{1}{\sqrt{2\pi}}\frac{x}{1+x^{2}}e^{-\frac{x^{2}}{2}}\Big{)}^{T}+T(\Phi(x))^{T+1})dx$
	$\displaystyle\geq\frac{8\epsilon}{\pi}\int_{1+2\epsilon}^{\sqrt{\log(T)}}(x^{2}+1)(1-(T+1)\Big{(}1-\frac{1}{\sqrt{2\pi}}\frac{x}{1+x^{2}}e^{-\frac{x^{2}}{2}}\Big{)}^{T})dx.$

The function $h(x)=1-\frac{1}{\sqrt{2\pi}}\frac{x}{1+x^{2}}e^{-\frac{x^{2}}{2}}$ is increasing on the interval $(1,+\infty)$ . We have

	$\displaystyle(1-\frac{1}{\sqrt{2\pi}}\frac{x}{1+x^{2}}e^{-\frac{x^{2}}{2}}\Big{)}^{T}$	$\displaystyle\leq(1-\frac{1}{\sqrt{2\pi}}\frac{\sqrt{\log(T)}}{1+\log(T)}\frac{1}{\sqrt{T}}\Big{)}^{T}$		(39)
		$\displaystyle\leq e^{-\sqrt{\frac{T}{4\pi\log(T)}}}$		(39)

Observe that for large $T$ ( $T\geq e^{11}$ ), $e^{-\sqrt{\frac{T}{4\pi\log(T)}}}\leq\frac{1}{T^{2}}$ . Therefore, for any $T\geq e^{11}$ ,

M\geq\frac{8\epsilon}{\pi}\int_{1+2\epsilon}^{\sqrt{\log(T)}}(x^{2}+1)(1-\frac{T+1}{T^{2}})dx\geq\frac{8\epsilon}{\pi}\int_{1+2\epsilon}^{\sqrt{\log(T)}}x^{2}dx.

We have

M\geq\frac{8\epsilon}{3\pi}\log(T)\sqrt{\log(T)}-\frac{8\epsilon(1+2\epsilon)^{2}}{3\pi}.

Appendix G Proofs of Theorem 5.5

Similar to the proofs of Theorem 5.2, the regret can also be divided into two parts:

\displaystyle\mathbb{E}\Bigg{[}\sum_{t=1}^{T}\sum_{i\in K(t)}\Delta_{t}(i)\mathbbm{1}\{i_{t}=i\}\Bigg{]}=\underbrace{\mathbb{E}\Bigg{[}\sum_{t=1}^{T}\sum_{i\notin N_{\alpha_{t}^{*}}}\Delta_{t}(i)\mathbbm{1}\{i_{t}=i\}\Bigg{]}}_{(A)}+\underbrace{\mathbb{E}\Bigg{[}\sum_{t=1}^{T}\sum_{i\in N_{\alpha_{t}^{*}}}\Delta_{i}\mathbbm{1}\{i_{t}=i\}\Bigg{]}}_{(B)}

(40)

Part $(A)$ is exactly the same as the analysis in the corresponding part of Theorem 5.2. We only focus part $(B)$ .

Let

L^{\prime}=\sum_{t=1}^{T}\mathbbm{1}\{\mu(a_{t})>\mu(i_{t}^{*})\}

denote the number of times the optimal arms changes. Then

\mathbb{E}[L^{\prime}]=L=\sum_{t=1}^{T}\mathbb{P}\{\mu(a_{t})>\mu(i_{t}^{*})\}.

Using Lemma A.4, we can get

\mathbb{P}(L^{\prime}\geq 3L)\leq e^{-L}.

(41)

Since $\mu(a_{t})\sim\mathcal{P}$ is independent of each other, each $\mu(a_{t})$ is equally likely to be the largest one, i.e., $\mathbb{P}(\mu(a_{t})>\mu(i_{t}^{*}))=\frac{1}{t}.$ Or we can obtain this result through Equation 32. We have

L=\sum_{t=1}^{T}\frac{1}{t}=\log(T)+c+o(1),

(42)

where $c\approx 0.577$ is the Euler constant. Then $\log(T)\leq L\leq\log(T)+1$ .

Given a fixed instance $\mathcal{F}$ , we also divide the rounds into $L^{\prime}$ parts: $(t_{0}=1,t_{L^{\prime}+1}=T)$

[1,t_{1}],(t_{1},t_{2}],...,(t_{L^{\prime}},T].

This partition satisfies $\forall t\in(t_{j},t_{j+1}),$ $i_{t}^{*}$ is stationary. The $\alpha_{t}^{*}$ is also stationary, $\forall t\in(t_{j},t_{j+1})$ , let $\alpha_{t}^{*}=\alpha_{j}$ .

Let’s focus on the intervals $(t_{j},t_{j+1}]$ , the analysis for other intervals is similar. All arms falling into $N_{\alpha_{j}}$ at rounds $(t_{j},t_{j+1})$ are denoted by $V_{j}$ . The arms in $V_{j}$ can be divided into two parts: $E_{1}=\{i\in V_{j},\mu(i)<\mu(\alpha_{j})\}$ and $E_{2}=\{i\in V_{j},\mu(i)\geq\mu(\alpha_{j})\}$ . If $i\in V_{j}$ has not been arrived at round $t$ , $\mathbbm{1}\{i_{t}=i\}=0$ . Then we have

\sum_{i\in V_{j}}\sum_{t=t_{j}}^{t_{j+1}}\mathbbm{1}\{i_{t}=i\}=\underbrace{\sum_{i\in E_{1}}\sum_{t=t_{j}}^{t_{j+1}}\mathbbm{1}\{i_{t}=i\}}_{(C)}+\underbrace{\sum_{i\in E_{2}}\sum_{t=t_{j}}^{t_{j+1}}\mathbbm{1}\{i_{t}=i\}}_{(D)}

(43)

From the way our algorithm constructs independent sets, it can be inferred that all arms in $V_{j}$ are connected to $\alpha_{t}^{*}=\alpha_{j}$ , and the distances are all less than $\epsilon$ . Hence, both $E_{1}$ and $E_{2}$ form a clique.

Note that selecting any arm in $E_{1}$ will result in the observation of $\alpha_{j}$ . We have

$\displaystyle(C)=\sum_{i\in E_{1}}\sum_{t=t_{j}}^{t_{j+1}}\mathbbm{1}\{i_{t}=i\}$	$\displaystyle\leq\ell_{\alpha_{j}}+\sum_{i\in E_{1}}\sum_{t=t_{j}}^{t_{j+1}}\mathbbm{1}\{i_{t}=i,O_{t}(\alpha_{j})>\ell_{\alpha_{j}}\}$	(44)
	$\displaystyle\leq\ell_{\alpha_{j}}+\sum_{i\in E_{1}}\sum_{t=t_{j}}^{t_{j+1}}\mathbbm{1}\{\bar{\mu}_{t}(i)-c_{t}(i)>\bar{\mu}_{t}(\alpha_{j})-c_{t}(\alpha_{j}),O_{t}(\alpha_{j})\geq\ell_{\alpha_{j}}\}$
	$\displaystyle\leq\ell_{\alpha_{j}}+\sum_{i\in E_{1}}\sum_{t=t_{j}}^{t_{j+1}}\mathbbm{1}\{\max_{1\leq s_{i}\leq t}\bar{X}_{s_{i}}(i)-c_{s_{i}}(i)>\min_{\ell_{\alpha_{j}}\leq s\leq t}\bar{X}_{s}(\alpha_{j})-c_{s}(\alpha_{j})\}$
	$\displaystyle\leq\ell_{\alpha_{j}}+\sum_{i\in E_{1}}\sum_{t=t_{j}}^{t_{j+1}}\sum_{s_{i}=1}^{t}\sum_{s=\ell_{\alpha_{j}}}^{t}\mathbbm{1}\{\bar{X}_{s_{i}}(i)-c_{s_{i}}(i)>\bar{X}_{s}(\alpha_{j})-c_{s}(\alpha_{j})\}$

Let $\ell(\alpha_{j})=\frac{4\log(\sqrt{2}T/\delta)}{(\Delta_{min}^{T})^{2}}$ . If $s\geq\ell(\alpha_{j})$ , the event $\{\mu(\alpha_{j})-\mu(i)\leq 2c_{s_{i}}(i)\}$ never occurs. Then

\{\bar{X}_{s_{i}}(i)-c_{s_{i}}(i)>\bar{X}_{s}(\alpha_{j})-c_{s}(\alpha_{j})\}\subset\{\bar{X}_{s_{i}}(i)>\mu(i)+c_{s_{i}}(i)\}\bigcup\{\bar{X}_{s}(\alpha_{j})<\mu(\alpha_{j})-c_{s}(\alpha_{j})\}.

From Lemma A.1, we have

\mathbb{P}(\bar{X}_{s_{i}}(i)-c_{s_{i}}(i)>\bar{X}_{s}(\alpha_{j})-c_{s}(\alpha_{j}))\leq\frac{\delta^{2}}{T^{2}}.

The regret incurred by $E_{1}$ in $(t_{j},t_{j+1}]$ is at most

\frac{8\epsilon\log(\sqrt{2}T/\delta)}{(\Delta_{min}^{T})^{2}}+2\epsilon(t_{j+1}-t_{j})|E_{1}|\delta^{2}.

Using the same method, we can get the regret incurred by $E_{2}$ in $(t_{j},t_{j+1}]$ is bounded as

\frac{8\epsilon\log(\sqrt{2}T/\delta)}{(\Delta_{min}^{T})^{2}}+2\epsilon(t_{j+1}-t_{j})|E_{2}|\delta^{2}.

Therefore, choosing $\delta=\frac{1}{T}$ , $(B)$ with the fixed $\mathcal{F}$ is bounded as

\frac{16L^{\prime}\epsilon\log(\sqrt{2}T/\delta)}{(\Delta_{min}^{T})^{2}}+2\epsilon.

(45)

From Equation 41 and Equation 42, we have

(B)\leq\frac{48L\epsilon\log(\sqrt{2}T/\delta)}{(\Delta_{min}^{T})^{2}}+2\epsilon+2\epsilon Te^{-L}\leq\frac{48L\epsilon\log(\sqrt{2}T/\delta)}{(\Delta_{min}^{T})^{2}}+4\epsilon

Therefore, we get the total regret

R_{T}^{\pi}\leq 5\log_{b}T\Delta_{max}(\frac{4\log(\sqrt{2}T/\delta)}{\epsilon^{2}}+T\delta^{2})+\Delta_{max}+\frac{48L\epsilon\log(\sqrt{2}T/\delta)}{(\Delta_{min}^{T})^{2}}+4\epsilon.

$\displaystyle(A)$	$\displaystyle=\mathbb{E}_{\mathcal{F}}\Bigg{[}\mathbb{E}\Bigg{[}\sum_{t=1}^{T}\sum_{i\notin N_{\alpha_{t}^{*}}}\Delta_{t}(i)\mathbbm{1}\{i_{t}=i\}\Big{\|}\mathcal{F}\Bigg{]}\Bigg{]}$	(25)
	$\displaystyle=\mathbb{E}_{\mathcal{F}}\Bigg{[}\mathbb{E}\Bigg{[}\sum_{t=1}^{T}\Delta_{t}(i)\mathbbm{1}\{i_{t}=i,j_{t}\neq\alpha_{t}^{*}\}\Big{\|}\mathcal{F}\Bigg{]}\Bigg{]}$
	$\displaystyle\leq\mathbb{E}_{\mathcal{F}}\Bigg{[}\mathbb{E}\Bigg{[}\sum_{t=1}^{T}\Delta_{max}^{T}\mathbbm{1}\{j_{t}\neq\alpha_{t}^{*}\}\Big{\|}\mathcal{F}\Bigg{]}\Bigg{]}$