This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Towards Fundamental Limits of Multi-armed Bandits with Random Walk Feedback

Tianyu Wang   Lin F. Yang   Zizhuo Wang
Abstract

In this paper, we consider a new Multi-Armed Bandit (MAB) problem where arms are nodes in an unknown and possibly changing graph, and the agent (i) initiates random walks over the graph by pulling arms, (ii) observes the random walk trajectories, and (iii) receives rewards equal to the lengths of the walks. We provide a comprehensive understanding of this problem by studying both the stochastic and the adversarial setting. We show that this problem is not easier than a standard MAB in an information theoretical sense, although additional information is available through random walk trajectories. Behaviors of bandit algorithms on this problem are also studied.

1 Introduction

Multi-Armed Bandit (MAB) problems simultaneously call for exploitation of good options and exploration of the decision space. Algorithms for this problem find applications in various domains, from medical trials (Robbins,, 1952) to online advertisement (Li et al.,, 2010). Many authors have studied bandit problems from different perspectives.

In this paper, we study a new bandit learning problem where the feedback is depicted by a random walk over the arms. That is, each time an arm/node ii is played, one observes a random walk over the arms/nodes from ii to an absorbing node, and the reward/loss is the length of this random walk. In this learning setting, we want to carefully select which nodes to initialize the random walks, so that the hitting time to the absorbing node is maximized/minimized. This learning protocol captures important problems in computational social networks. In particular, our learning protocol encapsulates an online learning version of the influence maximization problem (Kempe et al.,, 2003). See e.g., (Arora et al.,, 2017) for a survey on influence maximization. The goal of influence maximization is to find a node so that a diffusion starting from that node can propagate through the network and can influence as many nodes as possible. Our problem provides an online learning formulation of the influence maximization problem, where the diffusion over the graph is modeled as a random walk over the graph. As a concrete example, the word-of-mouth rating of a movie on social networks can be captured by the influence maximization model (Arora et al.,, 2017).

More formally, we consider the following model. The environment is modeled by a graph G=(V,E)G=\left(V,E\right), where VV consists of transient nodes [K]:={1,2,,K}[K]:=\{1,2,\dots,K\} and an absorbing node *. Each edge ijij (i[K],j[K]{}i\in[K],j\in[K]\cup\{*\}) can encode two quantities, a transition probability from ii to jj and the distance from ii to jj. For t=1,2,,Tt=1,2,\dots,T, we pick a node to start a random walk, and observe the random walk trajectory from the selected node to the absorbing node. For each random walk, we use its hitting time (to the absorbing node *) to model how long-lasting it is. With this formulation, we can define a bandit learning problem for the question. Each time, the agent picks a node in GG to start a random walk, observes the trajectory, and receives the hitting time of the random walk as reward. In this setting, the performance of learning algorithms is typically measured by regret, which is the difference between the rewards of an optimal node and the rewards of the nodes played by the algorithm. Unlike standard multi-armed bandit problems, the feedback is random walk trajectories and thus reveals information not only about the node played, but the environment (transitions/distances among nodes) as well.

Interestingly, the extra information from the random walk does not trivialize the learning in a mini-max sense. Intuitively, the sample/event space describes how much information the feedback can carry. If we execute a policy π\pi for TT epochs on a problem instance, the sample space is then (h=1h)T\left(\cup_{h=1}^{\infty}\mathcal{B}^{h}\right)^{T}, where \mathcal{B} is the space of all outcomes that a single step on a trajectory can generate. For example, if all edge length can be sampled from [0,1][0,1], then =[0,1]×[K]\mathcal{B}=[0,1]\times[K], since a single step on a trajectory might be any node, and the corresponding edge length may be any number from [0,1][0,1]. In this case the sample space of a simple epoch is h=1h\cup_{h=1}^{\infty}\mathcal{B}^{h}, where the union up to infinity captures the fact that the trajectory can be arbitrarily long. Indeed, this set (h=1([0,1]×[K])h)T\left(\cup_{h=1}^{\infty}\left([0,1]\times[K]\right)^{h}\right)^{T} contains much richer information than that of standard MAB problems, which is [0,1]T[0,1]^{T} for TT rounds of pulls. This richer sample space means that: each feedback carries much more information and thus our problem can be strictly easier than a standard MAB. However, the information theoretical lower bounds for our problem are of order Ω~(T)\widetilde{\Omega}\left(\sqrt{T}\right). In other words, we prove that even though each trajectory carries much more information than a reward sample (and has a chance of revealing all information about the environment when the trajectory is long), no algorithm can beat the bound Ω~(T)\widetilde{\Omega}\left(\sqrt{T}\right) in a mini-max sense.

In summary, the contributions of our paper is as follows.

  1. 1.

    We propose a new online learning problem that is compatible with important computational social network problem including influence maximization, as already discussed in the introduction.

  2. 2.

    We prove lower bounds for this problem in a random walk trajectories sample space. Our results show that, although random walk trajectories carry much more information than reward samples, the additional information does not simplify the problem. More specifically, no algorithm can beat an Ω~(T)\widetilde{\Omega}\left(\sqrt{T}\right) lower bound in a mini-max sense. These information theoretical findings of the newly proposed problem are discussed in Section 3.

  3. 3.

    We propose algorithms for the bandit problems with random walk feedback, and show that the performance of our algorithms improves over that of the standard MAB algorithms.

1.1 Related Works

Bandit problems date its history back to at least Thompson, (1933), and have been studied extensively in the literature. One of the the most popular approaches to the stochastic bandit problem is the Upper Confidence Bound (UCB) algorithms (Robbins,, 1952; Lai and Robbins,, 1985; Auer,, 2002). Various extensions of UCB algorithms have been studied (Srinivas et al.,, 2010; Abbasi-Yadkori et al.,, 2011; Agrawal and Goyal,, 2012; Bubeck and Slivkins,, 2012; Seldin and Slivkins,, 2014). Specifically, some works use KL-divergence to construct the confidence bound (Lai and Robbins,, 1985; Garivier and Cappé,, 2011; Maillard et al.,, 2011), or include variance estimates within the confidence bound (Audibert et al.,, 2009; Auer and Ortner,, 2010). UCB is also used in the contextual learning setting (e.g., Li et al.,, 2010; Krause and Ong,, 2011; Slivkins,, 2014). The UCB algorithm and its variations are also used for other feedback settings, including the stochastic combinatorial bandit problem (Chen et al.,, 2013, 2016). Parallel to the stochastic setting, studies on the adversarial bandit problem form another line of literature. Since randomized weighted majorities (Littlestone and Warmuth,, 1994), exponential weights remains a top strategy for adversarial bandits (Auer et al.,, 1995; Cesa-Bianchi et al.,, 1997; Auer et al.,, 2002). Many efforts have been made to improve/extend exponential weights algorithms. For example, Kocák et al., (2014) target at implicit variance reduction. Mannor and Shamir, (2011); Alon et al., (2013) study a partially observable setting. Despite the large body of literature, no previous work has, to the best of our knowledge, explicitly focused on problems where the feedback is a random walk.

For both stochastic bandits and adversarial bandits, lower bounds in different scenarios have been derived, since the 𝒪(logT)\mathcal{O}(\log T) asymptotic lower bounds for consistent policies (Lai and Robbins,, 1985). Worst case bound of order 𝒪(T){\mathcal{O}}(\sqrt{T}) have also been derived (Auer et al.,, 1995) for the stochastic setting. In addition to the classic stochastic setting, lower bounds in other stochastic (or stochastic-like) settings have also been considered, including PAC-learning complexity (Mannor and Tsitsiklis,, 2004), best arm identification complexity (Kaufmann et al.,, 2016; Chen et al.,, 2017), and lower bounds in continuous spaces (Kleinberg et al.,, 2008). Lower bound problems for adversarial bandits may be converted to lower bound problems for stochastic bandits (Auer et al.,, 1995) in many cases. Yet the above mentioned works do not cover the lower bounds for our settings.

The Stochastic Shortest Path (with adversarial edge length) (e.g., Bertsekas and Tsitsiklis,, 1991; Neu et al.,, 2012; Rosenberg and Mansour,, 2020) and the online MDP problems (e.g., Even-Dar et al.,, 2009; Gergely Neu et al.,, 2010; Dick et al.,, 2014; Jin et al.,, 2019) are related to our problem. However, these settings are fundamentally different because of the sample space generated by the possible trajectories. In all previously studied settings, a control is available at each step, and a regret is immediately incurred. In this regard, the infinite-length free trajectory scenario is impossible in all previous studies. In other words, every trajectory in previous works is effectively of length one, since a control is imposed, and a regret is incurred every time the state changes.

2 Problem Setting

The learning process repeats for TT epochs and the learning environment is described by graphs G1,G2,,GTG_{1},G_{2},\dots,G_{T} for epochs t=1,2,,Tt=1,2,\dots,T. The graph GtG_{t} is defined on KK transient nodes [K]={1,2,,K}[K]=\{1,2,\dots,K\} and one absorbing node denoted by *. We will use V=[K]V=[K] to denote the set of transient nodes, and use V~:=[K]{}\widetilde{V}:=[K]\cup\{*\} to denote the transient nodes together with the absorbing node. On this node set V~\widetilde{V}, graph GtG_{t} encodes transition probabilities and edge lengths: Gt:=({mij}iV,jV~,{lij(t)}iV,jV~)G_{t}:=\left(\{m_{ij}\}_{i\in V,j\in\widetilde{V}},\{l_{ij}^{(t)}\}_{i\in V,j\in\widetilde{V}}\right), where mijm_{ij} is the probability of transiting from ii to jj and lij(t)[0,1]l_{ij}^{(t)}\in[0,1] is the length from ii to jj (at epoch tt). We gather the transition probabilities among transient nodes to form a transition matrix M=[mij]i,j[K]M=[m_{ij}]_{i,j\in[K]}. We make the following assumption about MM.

Assumption 1.

The transition matrix M=[mij]i,j[K]M=[m_{ij}]_{i,j\in[K]} among transient nodes is primitive. 111A matrix MM is primitive if there exists a positive integer kk, such that all entries in MkM^{k} is positive. In addition, there is an absolute constant ρ\rho, such that Mρ<1\|M\|_{\infty}\leq\rho<1, where M=maxi[K]j[K]|mij|\|M\|_{\infty}=\max_{i\in[K]}\sum_{j\in[K]}\rvert m_{ij}\rvert is the maximum absolute row sum.

In Assumption 1, the primitivity assumption ensures that we can get to any transient node vv from any other node state uu. The infinite norm of MM being strictly less than 11 means that the random walk will transit to the absorbing node starting from any node (eventually with probability 1). This describes the absorptiveness of the environment. This infinite norm assumption can be replaced by other notions of matrix norms. Unless otherwise stated, we assume that ρ\rho is an absolute constant independent of KK and TT.

Playing node jj at epoch tt generates a random walk trajectory 𝒫t,j:=(Xt,0(j),\mathcal{P}_{t,j}:=\big{(}X_{t,0}^{(j)}, Lt,1(j),L_{t,1}^{(j)}, Xt,1(j),X_{t,1}^{(j)}, Lt,2(j),L_{t,2}^{(j)}, Xt,2(j),X_{t,2}^{(j)}, ,\dots, Lt,Ht,j(j),L_{t,H_{t,j}}^{(j)}, Xt,Ht,j(j))X_{t,H_{t,j}}^{(j)}\big{)}, where Xt,0(j)=jX_{t,0}^{(j)}=j is the starting nodes, Xt,Ht,j(j)=X_{t,H_{t,j}}^{(j)}=* is the absorbing node, Xt,i(j)X_{t,i}^{(j)} is the ii-th node in the random walk trajectory, Lt,i(j)L_{t,i}^{(j)} is the edge length from Xt,i1(j)X_{t,i-1}^{(j)} to Xt,i(j)X_{t,i}^{(j)}, and Ht,jH_{t,j} is the number of edges in trajectory 𝒫t,j\mathcal{P}_{t,j}. For simplicity, we write Xt,i(j)X_{t,i}^{(j)} (resp. Lt,i(j)L_{t,i}^{(j)}) as Xt,iX_{t,i} (resp. Lt,iL_{t,i}) when it is clear from context.

For the random trajectory 𝒫t,j:=(Xt,0,\mathcal{P}_{t,j}:=\big{(}X_{t,0}, Lt,1,Xt,1,L_{t,1},X_{t,1}, Lt,2,Xt,2,,Lt,Ht,j,Xt,Ht,j)L_{t,2},X_{t,2},\dots,L_{t,H_{t,j}},X_{t,H_{t,j}}\big{)}, the length of the trajectory (or hitting time of node jj at epoch tt) is defined as (𝒫t,j):=i=1Ht,jLt,i.\mathcal{L}\left(\mathcal{P}_{t,j}\right):=\sum_{i=1}^{H_{t,j}}L_{t,i}. Here we use the edge length to represent the reward of the trajectory. In practice, the edge lengths may have real-world meanings. For example, the out-going edge from a node may represent utility (e.g., profit) of visiting this node. At epoch tt, the agent selects a node Jt[K]J_{t}\in[K] to initiate a random walk, and observe trajectory 𝒫t,Jt\mathcal{P}_{t,J_{t}}. In stochastic environments, the environment does not change across epochs. Thus for any fixed node v[K]v\in[K], the random trajectories 𝒫1,v,𝒫2,v,𝒫3,v,\mathcal{P}_{1,v},\mathcal{P}_{2,v},\mathcal{P}_{3,v},\dots are independently identically distributed.

3 Information Theoretical Properties

Consider the case where the graphs GtG_{t} do not change across epochs. To solve this problem, one can estimate the expected hitting times μj:=𝔼[(𝒫t,j)]\mu_{j}:=\mathbb{E}\left[\mathcal{L}(\mathcal{P}_{t,j})\right] for all nodes j[K]j\in[K] (and maintain a confidence interval of the estimations). As one can expect, the random walk trajectory reveals more information than a sample of reward. Naturally, this allows us to reduce this problem to a standard (stochastic) MAB problem.

3.1 Reduction to Standard MAB

Recall 𝒫t,Jt=(Xt,0,Lt,1,Xt,1,,Lt,Ht,Jt,Xt,Ht,Jt)\mathcal{P}_{t,J_{t}}=\left(X_{t,0},L_{t,1},X_{t,1},\dots,L_{t,H_{t,J_{t}}},X_{t,H_{t,J_{t}}}\right) is the trajectory at epoch tt. For a node vv and the trajectories 𝒫1,J1,𝒫2,J2,𝒫3,J3,\mathcal{P}_{1,J_{1}},\mathcal{P}_{2,J_{2}},\mathcal{P}_{3,J_{3}},\dots, let kv,ik_{v,i} be the index (epoch) of the ii-th trajectory that covers node vv. Let Yv,kv,iY_{v,k_{v,i}} be the sum of edge lengths between the first occurrence of vv and the absorbing node * in trajectory kv,ik_{v,i}. One has the following proposition due to Markov property.

Proposition 1.

In the stochastic setting, for any nodes v[K]v\in[K], we have, for t,i+,r\forall t,i\in\mathbb{N}_{+},\forall r\in\mathbb{R}

(Yv,kv,i=r)=((𝒫t,v)=r).\displaystyle\mathbb{P}\left(Y_{v,k_{v,i}}=r\right)=\mathbb{P}\left(\mathcal{L}\left(\mathcal{P}_{t,v}\right)=r\right). (1)
Proof.

In a trajectory 𝒫t,Jt=(Xt,0,Lt,1,Xt,1,,Lt,Ht,Jt,Xt,Ht,Jt)\mathcal{P}_{t,J_{t}}=\left(X_{t,0},L_{t,1},X_{t,1},\dots,L_{t,H_{t,J_{t}}},X_{t,H_{t,J_{t}}}\right), conditioning on Xt,i=jX_{t,i}=j being known (and no future information is revealed), the randomness generated by Lt,i+1,Xt,i+1,Lt,i+2,Xt,i+2,L_{t,i+1},X_{t,i+1},L_{t,i+2},X_{t,i+2},\dots is identical to the randomness generated by Lt,1,Xt,1,Lt,2,Xt,2,L_{t,1},X_{t,1},L_{t,2},X_{t,2},\dots conditioning on Xt,0=jX_{t,0}=j being fixed. Note that even if each trajectory can visit a node multiple times, only one hitting time sample can be used. This is because extracting multiple sample would break Markovianity, by revealing that the random walk will visit a same node again. ∎

For a node v[K]v\in[K], we define Nt(v):=1s<t𝕀[Js=v]N_{t}(v):=1\vee\sum_{s<t}\mathbb{I}_{[J_{s}=v]}, and Nt+(v):=1s<t𝕀[v𝒫s,Js].{N}_{t}^{+}(v):=1\vee\sum_{s<t}\mathbb{I}_{\left[v\in\mathcal{P}_{s,J_{s}}\right]}. where ab=max{a,b}a\vee b=\max\{a,b\}. In words, Nt(v){N}_{t}(v) is the number of times node vv is played, and Nt+(v){N}_{t}^{+}(v) is the number of times node vv is covered by a trajectory.

By Proposition 1, the information about the node rewards (hitting time to absorbing node) accumulates faster than standard MAB problems. Formally, it holds that Nt+(v)Nt(v)N_{t}^{+}(v)\geq N_{t}(v). Thus solving this problem is not hard: one can extract the hitting time estimates and apply a standard algorithm (e.g., UCB) based on the estimates. However, some information is lost when we extract hitting time samples, since trajectories also carry additional information (e.g, about graph transition and graph structure) but we only extract hitting time samples. Thus the intriguing question to ask is:

  • Do we give up too much information by only extracting hitting time samples from trajectories? (Q)

We show that, although more information in addition to reward sample are available from the random walk trajectories, no algorithm can beat an Ω(T)\Omega\left(\sqrt{T}\right) lower bound in a worst case sense. This provides an answer to (Q).

21\pgfmathresultpt18\frac{1}{8}\pgfmathresultpt12+ϵ\frac{1}{2}+\epsilon/12\frac{1}{2}\pgfmathresultpt12\frac{1}{2}/12+ϵ\frac{1}{2}+\epsilon
Figure 1: Problem instances constructed for Theorem 1. The edge labels denote edge transition probabilities in 𝔍\mathfrak{J}/𝔍\mathfrak{J}^{\prime}. The top dark node denotes the absorbing node *. Note that node 1 is connected to node 2 with a constant probability.
Theorem 1.

For any given TT and any policy π\pi, there exists a problem instance 𝔍\mathfrak{J} satisfying Assumption 1 such that: (1) The probability of visiting any node from different node is larger than an absolute constant; (2) For any ϵ(0,14)\epsilon\in\left(0,\frac{1}{4}\right), the TT step regret of π\pi on instance 𝔍\mathfrak{J} is lower bounded by (3215ϵ+O(ϵ2))Texp(1129T(ϵ2+O(ϵ3))).\left(\frac{32}{15}\epsilon+O\left(\epsilon^{2}\right)\right)T\exp\left(-\frac{112}{9}T\left(\epsilon^{2}+O\left(\epsilon^{3}\right)\right)\right). In particular, setting ϵ=14T1/2\epsilon=\frac{1}{4}T^{-1/2} gives that there exists a problem instance such that the regret of any algorithm on this instance is lower bounded by Ω(T)\Omega\left(\sqrt{T}\right).

To prove Theorem 1, we construct problem instances such that the optimal nodes are almost indistinguishable. In the construction, we ensure that the nodes visits each other with a constant probability. If instead, the nodes visit each other with an arbitrarily small probability, the nodes are basically disconnected and the problem is too similar to a standard MAB problem. We use the instances illustrated in Fig. 1 for the proof. As shown in Fig. 1, node 1 and node 2 are connected with probability 18\frac{1}{8}. This construction ensures that the two nodes visits each other with a constant probability. This prevents our construction from collapsing to a standard MAB problem, where the non-absorbing nodes are disconnected.

Proof of Theorem 1..

We construct two “symmetric” problem instances 𝔍\mathfrak{J} and 𝔍\mathfrak{J}^{\prime} both on two transient nodes {1,2}\{1,2\} and one absorbing node *. All edges in both instances are of length 1. We use M=[mij]{M}=[m_{ij}] (resp. M=[mij]{M}^{\prime}=[m_{ij}^{\prime}]) to denote the transition probabilities among transient nodes in instance 𝔍\mathfrak{J} (resp. 𝔍\mathfrak{J}^{\prime}). We construct instances 𝔍\mathfrak{J} and 𝔍\mathfrak{J}^{\prime} so that M=[12181812+ϵ]M=\begin{bmatrix}\frac{1}{2}&\frac{1}{8}\\ \frac{1}{8}&\frac{1}{2}+\epsilon\end{bmatrix} and M=[12+ϵ181812]M^{\prime}=\begin{bmatrix}\frac{1}{2}+\epsilon&\frac{1}{8}\\ \frac{1}{8}&\frac{1}{2}\end{bmatrix}, as shown in Figure 1. We use ZvZ_{v} (resp. ZvZ_{v}^{\prime}) to denote the random variable (𝒫t,v)\mathcal{L}\left(\mathcal{P}_{t,v}\right) in problem instance 𝔍\mathfrak{J} (resp. 𝔍\mathfrak{J}^{\prime}). Note that the (expected) hitting time of node 1 in 𝔍\mathfrak{J} can be expressed as 𝔼[Z1]=m11(𝔼[Z1]+1)+m21(𝔼[Z2]+1)+(1m11m21)=m11𝔼[Z1]+m21𝔼[Z2]+1\mathbb{E}[Z_{1}]=m_{11}(\mathbb{E}[Z_{1}]+1)+m_{21}(\mathbb{E}[Z_{2}]+1)+(1-m_{11}-m_{21})=m_{11}\mathbb{E}[Z_{1}]+m_{21}\mathbb{E}[Z_{2}]+1. Thus in matrix form we have

[𝔼[Z1],𝔼[Z2]]=\displaystyle\left[\mathbb{E}\left[Z_{1}\right],\mathbb{E}\left[Z_{2}\right]\right]= [𝔼[Z1],𝔼[Z2]]M+𝟏,\displaystyle\;\left[\mathbb{E}\left[Z_{1}\right],\mathbb{E}\left[Z_{2}\right]\right]M+\mathbf{1}^{\top},
[𝔼[Z1],𝔼[Z2]]=\displaystyle\left[\mathbb{E}\left[Z_{1}^{\prime}\right],\mathbb{E}\left[Z_{2}^{\prime}\right]\right]= [𝔼[Z1],𝔼[Z2]]M+𝟏,\displaystyle\;\left[\mathbb{E}\left[Z_{1}^{\prime}\right],\mathbb{E}\left[Z_{2}^{\prime}\right]\right]M^{\prime}+\mathbf{1}^{\top},

where 𝟏\mathbf{1} is the all-one vector. Solving the above equations gives, for both instances 𝔍\mathfrak{J} and 𝔍\mathfrak{J}^{\prime}, the optimality gap Δ\Delta is

Δ:=|𝔼[Z1]𝔼[Z2]|=64ϵ15+O(ϵ2).\displaystyle\Delta:=\left\rvert\mathbb{E}\left[Z_{1}\right]-\mathbb{E}\left[Z_{2}\right]\right\rvert=\frac{64\epsilon}{15}+O\left(\epsilon^{2}\right). (2)

Let π\pi be any fixed algorithm and let TT be any fixed time horizon, we use 𝔍,π\mathbb{P}_{\mathfrak{J},\pi} (resp. 𝔍,π\mathbb{P}_{\mathfrak{J}^{\prime},\pi}) to denote the probability measure of running π\pi on instance 𝔍\mathfrak{J} (resp. 𝔍\mathfrak{J}^{\prime}) for TT epochs.

Since the event {Jt=1}\{J_{t}=1\} (tTt\leq T) is measurable by both 𝔍,π\mathbb{P}_{\mathfrak{J},\pi} and 𝔍,π\mathbb{P}_{\mathfrak{J}^{\prime},\pi}, we have

𝔍,π(Jt1)+𝔍,π(Jt=1)=\displaystyle\mathbb{P}_{\mathfrak{J},\pi}(J_{t}\neq 1)+\mathbb{P}_{\mathfrak{J}^{\prime},\pi}(J_{t}=1)=  1𝔍,π(Jt=1)+𝔍,π(Jt=1)\displaystyle\;1-\mathbb{P}_{\mathfrak{J},\pi}(J_{t}=1)+\mathbb{P}_{\mathfrak{J}^{\prime},\pi}(J_{t}=1)
(i)\displaystyle\overset{(i)}{\geq}  1𝔍,π𝔍,πTV\displaystyle\;1-\|\mathbb{P}_{\mathfrak{J},\pi}-\mathbb{P}_{\mathfrak{J}^{\prime},\pi}\|_{TV}
(ii)\displaystyle\overset{(ii)}{\geq}  11exp(Dkl(𝔍,π𝔍,π))\displaystyle\;1-\sqrt{1-\exp\left(D_{kl}(\mathbb{P}_{\mathfrak{J},\pi}\|\mathbb{P}_{\mathfrak{J}^{\prime},\pi})\right)}
(iii)\displaystyle\overset{(iii)}{\geq} 12exp(Dkl(𝔍,π𝔍,π)),\displaystyle\;\frac{1}{2}\exp\left(-D_{kl}(\mathbb{P}_{\mathfrak{J},\pi}\|\mathbb{P}_{\mathfrak{J}^{\prime},\pi})\right), (3)

where (i)(i) uses the definition of total variation, (ii)(ii) uses the Bretagnolle-Huber inequality, and (iii)(iii) uses that 11x12x1-\sqrt{1-x}\geq\frac{1}{2}x for all x0x\geq 0.

Let i\mathbb{Q}_{i} (resp. i\mathbb{Q}_{i}^{\prime}) be the probability measure generated by playing node ii in instance 𝔍\mathfrak{J} (resp. 𝔍\mathfrak{J}^{\prime}). We can then decompose 𝔍,π\mathbb{P}_{\mathfrak{J},\pi} by

𝔍,π\displaystyle\mathbb{P}_{\mathfrak{J},\pi} =J1(J1|π)J2(J2|π,J1)JT(JT|π,J1,J2,,Jt1),\displaystyle=\mathbb{Q}_{J_{1}}\mathbb{P}\left(J_{1}\rvert\pi\right)\mathbb{Q}_{J_{2}}\mathbb{P}(J_{2}\rvert\pi,J_{1})\cdots\mathbb{Q}_{J_{T}}\mathbb{P}(J_{T}\rvert\pi,J_{1},J_{2},\dots,J_{t-1}),
𝔍,π\displaystyle\mathbb{P}_{\mathfrak{J}^{\prime},\pi} =J1(J1|π)J2(J2|π,J1)JT(JT|π,J1,J2,,Jt1).\displaystyle=\mathbb{Q}_{J_{1}}^{\prime}\mathbb{P}\left(J_{1}\rvert\pi\right)\mathbb{Q}_{J_{2}}^{\prime}\mathbb{P}(J_{2}\rvert\pi,J_{1})\cdots\mathbb{Q}_{J_{T}}^{\prime}\mathbb{P}(J_{T}\rvert\pi,J_{1},J_{2},\dots,J_{t-1}).

By chain rule for KL-divergence, we have

Dkl(𝔍,π𝔍,π)\displaystyle D_{kl}(\mathbb{P}_{\mathfrak{J},\pi}\|\mathbb{P}_{\mathfrak{J}^{\prime},\pi})
=\displaystyle= J1{1,2}(J1|π)Dkl(J1J1)\displaystyle\sum_{J_{1}\in\{1,2\}}\mathbb{P}(J_{1}\rvert\pi)D_{kl}(\mathbb{Q}_{J_{1}}\|\mathbb{Q}_{J_{1}}^{\prime})
+t=2TJt{1,2}(Jt|π,J1,,Jt1)Dkl(JtJt).\displaystyle+\sum_{t=2}^{T}\sum_{J_{t}\in\{1,2\}}\mathbb{P}(J_{t}\rvert\pi,J_{1},\cdots,J_{t-1})D_{kl}(\mathbb{Q}_{J_{t}}\|\mathbb{Q}_{J_{t}}^{\prime}). (4)

Since the policy must pick one of node 1 and node 2, from (4) we have

Dkl(𝔍,π𝔍,π)t=1Ti=12Dkl(ii),\displaystyle D_{kl}(\mathbb{P}_{\mathfrak{J},\pi}\|\mathbb{P}_{\mathfrak{J}^{\prime},\pi})\leq\sum_{t=1}^{T}\sum_{i=1}^{2}D_{kl}(\mathbb{Q}_{i}\|\mathbb{Q}_{i}^{\prime}), (5)

which allows us to remove dependence on policy π\pi.

Next we study the distributions i\mathbb{Q}_{i}. With edge lengths fixed, the sample space of this distribution is h=1{1,2}h\cup_{h=1}^{\infty}\{1,2\}^{h}, since length of the trajectory can be arbitrarily long, and each node on the trajectory can be either of {1,2}\{1,2\}. To describe the distribution i\mathbb{Q}_{i} and i\mathbb{Q}_{i}^{\prime}, we use random variables X0,X1,X2,X3,{1,2,}X_{0},X_{1},X_{2},X_{3},\dots\in\{1,2,*\} (with X0=iX_{0}=i), where XkX_{k} is the kk-th node in the trajectory generated by playing ii.

By Markov property we have, for i,j{1,2}i,j\in\{1,2\},

i(Xk+1,Xk+2,|Xk=j)\displaystyle\mathbb{Q}_{i}\left(X_{k+1},X_{k+2},\dots\rvert X_{k}=j\right) =jand\displaystyle=\mathbb{Q}_{j}\quad\text{and} (6)
i(Xk+1,Xk+2,|Xk=j)\displaystyle\mathbb{Q}_{i}^{\prime}\left(X_{k+1},X_{k+2},\dots\rvert X_{k}=j\right) =j,k+.\displaystyle=\mathbb{Q}_{j}^{\prime},\quad\forall k\in\mathbb{N}_{+}.

Since we can decompose i\mathbb{Q}_{i} by i=i(X1)i(X2,X3,,|X1)\mathbb{Q}_{i}=\mathbb{Q}_{i}(X_{1})\mathbb{Q}_{i}(X_{2},X_{3},\dots,\rvert X_{1}). Thus by chain rule of KL-divergence, we have

Dkl(11)\displaystyle\;D_{kl}(\mathbb{Q}_{1}\|\mathbb{Q}_{1}^{\prime})
=\displaystyle= Dkl(1(X1)1(X1))\displaystyle\;D_{kl}(\mathbb{Q}_{1}(X_{1})\|\mathbb{Q}_{1}^{\prime}(X_{1}))
+1(X1=1)Dkl(1(X2,|X1=1)1(X2,|X1=1))\displaystyle+\mathbb{Q}_{1}(X_{1}=1)D_{kl}(\mathbb{Q}_{1}(X_{2},\cdots\rvert X_{1}=1)\|\mathbb{Q}_{1}^{\prime}(X_{2},\cdots\rvert X_{1}=1))
+1(X1=2)Dkl(1(X2,|X1=2)1(X2,|X1=2))\displaystyle+\mathbb{Q}_{1}(X_{1}=2)D_{kl}(\mathbb{Q}_{1}(X_{2},\cdots\rvert X_{1}=2)\|\mathbb{Q}_{1}^{\prime}(X_{2},\cdots\rvert X_{1}=2))
=\displaystyle= Dkl(1(X1)1(X1))\displaystyle\;D_{kl}(\mathbb{Q}_{1}(X_{1})\|\mathbb{Q}_{1}^{\prime}(X_{1}))
+1(X1=1)Dkl(11)+1(X1=2)Dkl(22),\displaystyle+\mathbb{Q}_{1}(X_{1}=1)D_{kl}(\mathbb{Q}_{1}\|\mathbb{Q}_{1}^{\prime})+\mathbb{Q}_{1}(X_{1}=2)D_{kl}(\mathbb{Q}_{2}\|\mathbb{Q}_{2}^{\prime}), (7)

where the last line uses (6). A similar argument gives

Dkl(22)\displaystyle\;D_{kl}(\mathbb{Q}_{2}\|\mathbb{Q}_{2}^{\prime})
=\displaystyle= Dkl(2(X1)2(X1))\displaystyle\;D_{kl}(\mathbb{Q}_{2}(X_{1})\|\mathbb{Q}_{2}^{\prime}(X_{1}))
+2(X1=1)Dkl(11)+2(X1=2)Dkl(22),\displaystyle+\mathbb{Q}_{2}(X_{1}=1)D_{kl}(\mathbb{Q}_{1}\|\mathbb{Q}_{1}^{\prime})+\mathbb{Q}_{2}(X_{1}=2)D_{kl}(\mathbb{Q}_{2}\|\mathbb{Q}_{2}^{\prime}), (8)

Since Dkl(i(X1)i(X1))=73ϵ2+O(ϵ3)D_{kl}(\mathbb{Q}_{i}(X_{1})\|\mathbb{Q}_{i}^{\prime}(X_{1}))=\frac{7}{3}\epsilon^{2}+O\left(\epsilon^{3}\right) for i{1,2}i\in\{1,2\}, the above (Eq. 7 and Eq. 8) gives

Dkl(ii)\displaystyle D_{kl}(\mathbb{Q}_{i}\|\mathbb{Q}_{i}^{\prime}) =56ϵ29+O(ϵ3).\displaystyle=\frac{56\epsilon^{2}}{9}+O\left(\epsilon^{3}\right). (9)

Combining (5) and (9) gives

Dkl(𝔍,π𝔍,π)1129T(ϵ2+O(ϵ3)).\displaystyle D_{kl}\left(\mathbb{P}_{\mathfrak{J},\pi}\|\mathbb{P}_{\mathfrak{J}^{\prime},\pi}\right)\leq\frac{112}{9}T\left(\epsilon^{2}+O\left(\epsilon^{3}\right)\right). (10)

Let Reg(T)\mathrm{Reg}(T) (resp. Reg(T)\mathrm{Reg}^{\prime}(T)) be the TT epoch regret in instance 𝔍\mathfrak{J} (resp. 𝔍\mathfrak{J}^{\prime}). Recall, by our construction, node 1 is suboptimal in instance 𝔍\mathfrak{J} and node 1 is optimal in instance 𝔍\mathfrak{J}^{\prime}. Since the optimality gaps in 𝔍\mathfrak{J} and 𝔍\mathfrak{J}^{\prime} are the same (Eq. 2), we have,

Reg(T)+Reg(T)\displaystyle\;\mathrm{Reg}(T)+\mathrm{Reg}^{\prime}(T)
\displaystyle\geq Δt=1T(𝔍,π(Jt1)+𝔍,π(Jt=1))\displaystyle\;\Delta\sum_{t=1}^{T}\Big{(}\mathbb{P}_{\mathfrak{J},\pi}\left(J_{t}\neq 1\right)+\mathbb{P}_{\mathfrak{J}^{\prime},\pi}\left(J_{t}=1\right)\Big{)}
\displaystyle\geq 12ΔTexp(Dkl(𝔍,π𝔍,π))\displaystyle\;\frac{1}{2}\Delta T\exp\left(-D_{kl}(\mathbb{P}_{\mathfrak{J},\pi}\|\mathbb{P}_{\mathfrak{J}^{\prime},\pi})\right) (by Eq. 3)
\displaystyle\geq 12ΔTexp(1129T(ϵ2+O(ϵ3)))\displaystyle\;\frac{1}{2}\Delta T\exp\left(-\frac{112}{9}T\left(\epsilon^{2}+O\left(\epsilon^{3}\right)\right)\right) (by Eq. 10)
\displaystyle\geq (3215ϵ+O(ϵ2))Texp(1129T(ϵ2+O(ϵ3))).\displaystyle\;\left(\frac{32}{15}\epsilon+O\left(\epsilon^{2}\right)\right)T\exp\left(-\frac{112}{9}T\left(\epsilon^{2}+O\left(\epsilon^{3}\right)\right)\right). (by Eq. 2)

In Theorem 1, the two-nodes case has been covered. A similar result for the KK-node case is in Theorem 2.

Theorem 2.

Let Assumption 1 be true. Given any number of epochs TT and policy π\pi, there exists KK problem instances 𝔍1,,𝔍K\mathfrak{J}_{1},\cdots,\mathfrak{J}_{K}, such that (1) all problems instances have KK nodes, (2) all transient node are connected with the same probability and the probability of hitting the absorbing node from any transient node is a constant independent of TT and KK, and (3) for any policy π\pi,

maxk[K]𝔼𝔍k,π[Reg(T)]182KT,\displaystyle\max_{k\in[K]}\mathbb{E}_{\mathfrak{J}_{k},\pi}\left[\mathrm{Reg}(T)\right]\geq\frac{1}{8\sqrt{2}}\sqrt{KT},

where 𝔼𝔍k,π\mathbb{E}_{\mathfrak{J}_{k},\pi} is the expectation with respect to the distribution generated by instance 𝔍k\mathfrak{J}_{k} and policy π\pi.

The proof of Theorem 2 follows a recipe similar to that of Theorem 1, and the details are deferred to the Appendix.

4 Algorithms for Multi-Armed Bandits with Random Walk Feedback

Perhaps the two most well-known algorithms for bandit problems are the UCB algorithm and the EXP3 algorithm. In this section, we study the behavior of these two algorithms on bandit problems with random walk feedback.

4.1 UCB Algorithm for the Stochastic Setting

As discussed previously, the problem with random walk feedback can be reduced to standard MAB problems, and the UCB algorithm solves this problem as one would expect. We present here the UCB algorithm and provide regret analysis for it. Recall the regret is defined as

Reg(T)=maxi[K]t=1Tμit=1TμJt,\displaystyle\mathrm{Reg}(T)=\max_{i\in[K]}\sum_{t=1}^{T}\mu_{i}-\sum_{t=1}^{T}\mu_{{}_{J_{t}}}, (11)

where JtJ_{t} is the node played by the algorithm (at epoch tt), and μj=𝔼[(𝒫t,j)]\mu_{{}_{j}}=\mathbb{E}\left[\mathcal{L}\left(\mathcal{P}_{t,j}\right)\right] is the expected hitting time of node jj.

For a transient node v[K]v\in[K] and nn trajectories (at epochs kv,1,kv,2,,kv,nk_{v,1},k_{v,2},\dots,k_{v,n} ) that cover node vv, the hitting time estimator of vv is computed as Z~v,n:=1ni=1nYv,kv,i.\widetilde{Z}_{v,n}:=\frac{1}{n}\sum_{i=1}^{n}Y_{v,k_{v,i}}. Since v𝒫t,vv\in\mathcal{P}_{t,v}, Yv,kv,iY_{v,k_{v,i}} is an identical copy of the hitting time (𝒫t,v)\mathcal{L}\left(\mathcal{P}_{t,v}\right) (Proposition 1).

We also need confidence intervals for our estimators. Given Nt+(v)N_{t}^{+}(v) trajectories covering vv, the confidence terms (at epoch tt) are C~Nt+(v),t:=8ξtlogtNt+(v),\widetilde{C}_{N_{t}^{+}(v),t}:=\sqrt{\frac{8\xi_{t}\log t}{{N_{t}^{+}(v)}}}, where ξt=max{1+ρ(1ρ)2,log(1ρ)logρ+5logtlog1/ρ}\xi_{t}=\max\left\{1+\frac{\rho}{(1-\rho)^{2}},\frac{\log(1-\rho)}{\log\rho}+\frac{5\log t}{\log 1/\rho}\right\}. Here ξt\xi_{t} serves as a truncation parameter since the reward distribution is not sub-Gaussian. Alternatively, one can use robust estimators for bandits with heavy-tail reward for this task Bubeck et al., (2013).

At each time tt, we play a node JtJ_{t} so that

JtargmaxvZ~v,Nt+(v)+C~Nt+(v),t.\displaystyle J_{t}\in\arg\max_{v}\widetilde{Z}_{v,N_{t}^{+}(v)}+\widetilde{C}_{N_{t}^{+}(v),t}.

This strategy is described in Algorithm 1.

Algorithm 1
1:Input: A set of nodes [K][K]. Parameters: a constant ρ\rho that bounds the spectral radius of MM.
2:Warm up: Play each node once to initialize. Observe trajectories.
3:For any v[K]v\in[K], define the decision index Iv,Nt+(v),t=Z~v,Nt+(v)+C~Nt+(v),t.I_{v,N_{t}^{+}(v),t}=\widetilde{Z}_{v,N_{t}^{+}(v)}+\widetilde{C}_{N_{t}^{+}(v),t}.
4:for t=1,2,3,t=1,2,3,\dots do
5:   Select JtJ_{t}, such that JtargmaxvVIv,Nt+(v),t,J_{t}\in\arg\max_{v\in V}I_{v,N_{t}^{+}(v),t}, with ties broken arbitrarily.
6:   Observe the trajectory 𝒫t,vt:={Xt,0,Lt,1,Xt,1,Lt,2,Xt,2,,Lt,Ht,vtXt,Ht,Jt}\mathcal{P}_{t,v_{t}}:=\big{\{}X_{t,0},L_{t,1},X_{t,1},L_{t,2},X_{t,2},\cdots,L_{t,H_{t,v_{t}}}X_{t,H_{t,J_{t}}}\big{\}}. Update Nt+(){N}_{t}^{+}(\cdot) and UCBs for all v[K]v\in[K].
7:end for

Similar to the UCB algorithm for standard MAB problems, Algorithm 1 obtains 𝒪~(T)\widetilde{\mathcal{O}}(\sqrt{T}) mini-max regret (gap-independent regret) and 𝒪~(1)\widetilde{\mathcal{O}}(1) asymptotic regret (gap-dependent regret). Such results can be derived from the observations discussed in Section 3.1.

4.2 EXP3 Algorithm for Adversarially Chosen Edge Lengths

In this section, we consider the case in which the network structure GtG_{t} changes over time, and study a version of this problem in which the adversary alters edge length across epochs: In each epoch, the adversary can arbitrarily pick edge lengths lij(t)l_{ij}^{(t)} from [0,1][0,1]. In this case, the performance is measured by the regret against playing any fixed node i[K]i\in[K]: Regiadv(T)=t=1Tlt,it=1Tlt,Jt,\mathrm{Reg}^{\mathrm{adv}}_{i}(T)=\sum_{t=1}^{T}l_{t,i}-\sum_{t=1}^{T}l_{t,J_{t}}, where JtJ_{t} is the node played in epoch tt, and lt,j=𝔼[(𝒫t,i)]l_{t,j}=\mathbb{E}\left[\mathcal{L}\left(\mathcal{P}_{t,i}\right)\right]. Since (𝒫t,j)\mathcal{L}\left(\mathcal{P}_{t,j}\right) concentrates around lt,jl_{t,j}, a high probability bound on Regiadv(T)\mathrm{Reg}^{\mathrm{adv}}_{i}(T) naturally provides a high probability bound on t=1T(𝒫t,i)t=1T(𝒫t,Jt)\sum_{t=1}^{T}\mathcal{L}\left(\mathcal{P}_{t,i}\right)-\sum_{t=1}^{T}\mathcal{L}\left(\mathcal{P}_{t,J_{t}}\right).

We define a notion of centrality that will aid our discussions.

Definition 1.

Let X0,X1,X2,,Xτ=X_{0},X_{1},X_{2},\dots,X_{\tau}=* be nodes on a random trajectory. Under Assumption 1, we define, for node v[K]v\in[K], αv:=minu[K],uv(v{X1,X2,,Xτ}|X0=u)\alpha_{v}:=\min_{u\in[K],\;u\neq v}\mathbb{P}\left(v\in\{X_{1},X_{2},\dots,X_{\tau}\}\rvert X_{0}=u\right) to be the hitting centrality of node vv. We also define α=minvαv\alpha=\min_{v}\alpha_{v}

For a node with positive the hitting centrality, information about it is revealed even if it is not played. This quantity will show up in the regret bound in some cases, as we will discuss later.

We will use a version of the exponential weight algorithm to solve this adversarial problem. Also, a high probability guarantee is provided using a new concentration lemma (Lemma 1). As background, exponential weights algorithms maintain a probability distribution over the choices. This probability distribution gives higher weights to historically more rewarding nodes. In each epoch, a node is sampled from this probability distribution, and information is recorded down. To formally describe the strategy, we define some notations now. We first extract a sample of (𝒫t,j)\mathcal{L}\left(\mathcal{P}_{t,j}\right) from the trajectory 𝒫t,Jt\mathcal{P}_{t,J_{t}}, where JtJ_{t} is the node played in epoch tt.

Given 𝒫t,Jt=(Xt,0,Lt,1,Xt,1,,Lt,Ht,Jt,Xt,Ht,Jt)\mathcal{P}_{t,J_{t}}=\left(X_{t,0},L_{t,1},X_{t,1},\dots,L_{t,H_{t,J_{t}}},X_{t,H_{t,J_{t}}}\right), we define, for v[K]v\in[K],

Yv(𝒫t,Jt)=maxi:0i<Ht,Jt𝕀[Xt,i=v]i(𝒫t,Jt),\displaystyle Y_{v}(\mathcal{P}_{t,J_{t}})=\max_{i:0\leq i<H_{t,J_{t}}}\;\mathbb{I}_{\left[X_{t,i}=v\right]}\cdot\mathcal{L}_{i}(\mathcal{P}_{t,J_{t}}), (12)

where i(𝒫t,Jt):=k=i+1Ht,JtLt,k\mathcal{L}_{i}(\mathcal{P}_{t,J_{t}}):=\sum_{k=i+1}^{H_{t,J_{t}}}L_{t,k}. In words, i(𝒫t,Jt)\mathcal{L}_{i}(\mathcal{P}_{t,J_{t}}) is the distance (sum of edge lengths) from the first occurrence of vv to the absorbing node.

By the principle of Proposition 1, if node ii is covered by trajectory 𝒫t,Jt\mathcal{P}_{t,J_{t}}, Yv(𝒫t,Jt)Y_{v}(\mathcal{P}_{t,J_{t}}) is a sample of (𝒫t,i)\mathcal{L}\left(\mathcal{P}_{t,i}\right). We define, for the trajectory 𝒫t,Jt={Xt,0,Lt,1,Xt,1,Lt,2,,Lt,Ht,Jt,Xt,Ht,Jt}\mathcal{P}_{t,J_{t}}=\{X_{t,0},L_{t,1},X_{t,1},L_{t,2},\dots,L_{t,H_{t,J_{t}}},X_{t,H_{t,J_{t}}}\}, Zt,v:=Yv(𝒫t,Jt),v[K],Z_{t,v}:=Y_{v}(\mathcal{P}_{t,J_{t}}),\quad\forall v\in[K], where Yv(𝒫t,Jt)Y_{v}(\mathcal{P}_{t,J_{t}}) is defined above.

Define 𝕀t,ij:=𝕀[i𝒫t,Jt and j𝒫t,Jt,Yi(𝒫t,Jt)>Yj(𝒫t,Jt)]\mathbb{I}_{t,ij}:=\mathbb{I}_{\left[i\in\mathcal{P}_{t,J_{t}}\text{ and }j\in\mathcal{P}_{t,J_{t}},\,Y_{i}(\mathcal{P}_{t,J_{t}})>Y_{j}(\mathcal{P}_{t,J_{t}})\right]}. This indicator random variable is 11 iff ii and jj both show up in 𝒫t,Jt\mathcal{P}_{t,J_{t}} and the first occurrence of jj is after the first occurence of ii. We then define q^t,ij:=s=1t1𝕀s,ijNt+(i),\widehat{q}_{t,ij}:=\frac{\sum_{s=1}^{t-1}\mathbb{I}_{s,ij}}{N_{t}^{+}(i)}, which is an estimator of how likely jj is visited via a trajectory starting at ii. In other words, q^t,ij\widehat{q}_{t,ij} is an estimator of qij:=(j𝒫t,i)q_{ij}:=\mathbb{P}\left(j\in\mathcal{P}_{t,i}\right), which is the probability of jj being visited by a trajectory from ii. Using q^t,ij\widehat{q}_{t,ij} and sample Zt,i{Z}_{t,i}, we define an estimator for lt,jBB\frac{l_{t,j}-B}{B} as

Z^t,i\displaystyle\widehat{Z}_{t,i} :=Zt,iBB𝕀[i𝒫t,Jt]+βpti+jiq^t,jiptj,i[K],\displaystyle:=\frac{\frac{{Z}_{t,i}-B}{B}\mathbb{I}_{\left[i\in\mathcal{P}_{t,J_{t}}\right]}+\beta}{p_{ti}+\sum_{j\neq i}\widehat{q}_{t,ji}p_{tj}},\quad\forall i\in[K], (13)

where B,βB,\beta are algorithm parameters (βα\beta\leq\alpha and BB to be specified later). Here β\beta serves as a parameter for implicit exploration Neu, (2015). We use estimate for lt,jBB\frac{l_{t,j}-B}{B} instead of an estimate for lt,jl_{t,j}. This shift can guarantee that the estimator Z^t,i\widehat{Z}_{t,i} is between [1,0][-1,0] with high probability. Such shifting in estimator has been used as a common trick for variance reduction for the EXP3 algorithms (e.g., Lattimore and Szepesvári,, 2020). In addition, a small bias is introduced via β\beta. With the estimators Z^t,i\widehat{Z}_{t,i}, we define S^t,j=s=1t1Z^s,i.\widehat{S}_{t,j}=\sum_{s=1}^{t-1}\widehat{Z}_{s,i}. By convention, we set S^0,i=0\widehat{S}_{0,i}=0 for all i[K]i\in[K]. The probability of playing ii in epoch tt is defined as

pti:={1K, if t=1,exp(ηS^t1,i)j=1Kexp(ηS^t1,j), if t2,\displaystyle p_{ti}:=\begin{cases}\frac{1}{K},\quad\text{ if }t=1,\\ \frac{\exp\left(\eta\widehat{S}_{t-1,i}\right)}{\sum_{j=1}^{K}\exp\left(\eta\widehat{S}_{t-1,j}\right)},\quad\text{ if }t\geq 2,\end{cases} (14)

where η\eta is the learning rate.

Against any arm j[K]j\in[K], following the sampling rule (14) can guarantee an 𝒪~(T)\widetilde{\mathcal{O}}\left(\sqrt{T}\right) regret bound. We now summarize our strategy in Algorithm 2, and state the performance guarantee in Theorem 3.

Algorithm 2
1:Input: A set of nodes [K][K], transition matrix MM, total number of epochs TT, probability parameter ϵ(0,1(1ρ)KT)\epsilon\in\left(0,\frac{1}{(1-\rho)KT}\right). Algorithm parameters: B=log(1ρ)ϵKTlogρB=\frac{\log\frac{(1-\rho)\epsilon}{KT}}{\log\rho}, η=1κT\eta=\frac{1}{\sqrt{\kappa T}}, β=1κT\beta=\frac{1}{\sqrt{\kappa T}}.
2:for t=1,2,3,,Tt=1,2,3,\dots,T do
3:   Randomly play Jt[K]J_{t}\in[K], such that (Jt=i)=pti,i[K],\mathbb{P}\left(J_{t}=i\right)=p_{ti},\forall i\in[K], where ptip_{ti} is defined in (14).
4:   Observe the trajectory 𝒫t,Jt\mathcal{P}_{t,J_{t}}. Update estimates Z^t,j\widehat{Z}_{t,j} according to (13).
5:end for

A high probability performance guarantee of Algorithm 2 is below in Theorem 3.

Theorem 3.

Let κ:=1+j1αj1+αj\kappa:=1+\sum_{j}\frac{1-\sqrt{\alpha_{j}}}{1+\sqrt{\alpha_{j}}}, where αj\alpha_{j} is the hitting centrality of jj. Fix any i[K]i\in[K]. If the time horizon TT and algorithm parameters satisfies ϵ1T\epsilon\leq\frac{1}{T}, B=log(1ρ)ϵKTlogρB=\frac{\log\frac{(1-\rho)\epsilon}{KT}}{\log\rho}, then with probability exceeding 1𝒪~(ϵ)1-\widetilde{\mathcal{O}}(\epsilon),

Regiadv(T)\displaystyle\mathrm{Reg}^{\mathrm{adv}}_{i}(T)\leq Blog(T/ϵ2)β+BκβT+𝒪~(BβκT)\displaystyle\;B\frac{\log(T/\epsilon^{2})}{\beta}+B\kappa\beta T+\widetilde{\mathcal{O}}\left(B\sqrt{\beta\kappa T}\right)
+BlogKη+BηκT+𝒪~(Bη(1+βκ)T).\displaystyle+\frac{B\log K}{\eta}+B\eta\kappa T+\widetilde{\mathcal{O}}\left(B\eta\sqrt{(1+\beta\kappa)T}\right).

If we set η=β=Θ(1κT)\eta=\beta=\Theta\left(\frac{1}{\sqrt{\kappa T}}\right), Algorithm 2 achieves Regiadv(T)𝒪~(κT)\mathrm{Reg}^{\mathrm{adv}}_{i}(T)\leq\widetilde{\mathcal{O}}\left(\sqrt{\kappa T}\right). In Figure 2, we provide a plot of f(x)=1x1+xf(x)=\frac{1-\sqrt{x}}{1+\sqrt{x}} with x[0,1]x\in[0,1]. This figure illustrates how the regret scales with κ\kappa, and shows that a multiplicative factor is saved. An empirical comparison between Algorithm 2 and the standard EXP3 algorithm (without using the estimator in Eq. 12) is shown in Section 5.

Refer to caption
Figure 2: A plot of function f(x)=1x1+xf(x)=\frac{1-\sqrt{x}}{1+\sqrt{x}}, x[0,1]x\in[0,1].

4.2.1 Analysis of Algorithm 2

In this section we present a proof for Theorem 3. In general, the proof follows the general recipe for the analysis of EXP3 algorithms, which is in turn a special case of the Follow-The-Regularized-Leader (FTRL) or the Mirror Descent framework. Below we include some non-standard intermediate steps due to the special feedback structure of the problem studied. More details are defered to the Appendix.

By the exponential weights argument (Littlestone and Warmuth,, 1994; Auer et al.,, 2002), it holds that, under event T(B):={Zt,jB for all t=1,2,,T, and j[K]}\mathcal{E}_{T}(B):=\{Z_{t,j}\leq B\text{ for all }t=1,2,\cdots,T,\text{ and }j\in[K]\},

t=1TZ^t,it=1TjptjZ^t,j\displaystyle\sum_{t=1}^{T}\widehat{Z}_{t,i}-\sum_{t=1}^{T}\sum_{j}p_{tj}\widehat{Z}_{t,j}\leq logKη+ηt=1Tjptjp~tj2𝕀[j𝒫t,Jt].\displaystyle\frac{\log K}{\eta}+\eta\sum_{t=1}^{T}\sum_{j}\frac{p_{tj}}{\widetilde{p}_{tj}^{2}}\mathbb{I}_{\left[j\in\mathcal{P}_{t,J_{t}}\right]}. (15)

To link the regret (t=1Tlt,it=1Tlt,Jt\sum_{t=1}^{T}l_{t,i}-\sum_{t=1}^{T}l_{t,J_{t}}) to t=1TZ^t,it=1TjptjZ^t,j\sum_{t=1}^{T}\widehat{Z}_{t,i}-\sum_{t=1}^{T}\sum_{j}p_{tj}\widehat{Z}_{t,j} and to bound t=1Tjptjp~tj2𝕀[j𝒫t,Jt]\sum_{t=1}^{T}\sum_{j}\frac{p_{tj}}{\widetilde{p}_{tj}^{2}}\mathbb{I}_{\left[j\in\mathcal{P}_{t,J_{t}}\right]}, we use the results in Lemmas 1 and 2.

Lemma 1.

For any ϵ(0,1)\epsilon\in(0,1) and TT\in\mathbb{N}, such that ϵ1T\epsilon\leq\frac{1}{T} and T10T\geq 10, it holds that

(tlt,iBBtZ^t,ilog(T/ϵ2)β)1𝒪~(ϵ).\displaystyle\mathbb{P}\left(\sum_{t}\frac{l_{t,i}-B}{B}-\sum_{t}\widehat{Z}_{t,i}\leq\frac{\log\left(T/\epsilon^{2}\right)}{\beta}\right)\geq 1-\widetilde{\mathcal{O}}\left(\epsilon\right).

Lemma 1 can be viewed as a one-side regularized version of the Hoeffding’s inequality, with a regularization parameter β\beta. Its proof uses the Markov inequality and can be found in the Appendix.

Lemma 2.

With probability exceeding 1𝒪(ϵ)1-\mathcal{O}(\epsilon), we have

(i)\displaystyle(i) tjptjp^tj2𝕀[j𝒫t,Jt]κT+𝒪~(Tlog(1/ϵ)),\displaystyle\quad\sum_{t}\sum_{j}\frac{p_{tj}}{\widehat{p}_{tj}^{2}}\mathbb{I}_{\left[j\in\mathcal{P}_{t,J_{t}}\right]}\leq\kappa T+\widetilde{\mathcal{O}}\left(\sqrt{T\log(1/\epsilon)}\right),
(ii)\displaystyle(ii) tjptjZ^t,jtlt,JtBBκβT+𝒪((1+βκ)Tlog(1/ϵ)).\displaystyle\quad\sum_{t}\sum_{j}p_{tj}\widehat{Z}_{t,j}-\sum_{t}\frac{l_{t,J_{t}}-B}{B}\leq\kappa\beta T+\mathcal{O}\left(\sqrt{\left(1+\beta\kappa\right)T\log(1/\epsilon)}\right).
Proof of Lemma 2.

Firstly, it holds with high probability that p^tj=p~tj+𝒪(1t)\widehat{p}_{tj}=\widetilde{p}_{tj}+\mathcal{O}\left(\frac{1}{t}\right) (Lemma 5 in the Appendix). Therefore ptjp^tj=ptjp~tj+𝒪(1t)=ptjp~tj+𝒪(1t)\frac{p_{tj}}{\widehat{p}_{tj}}=\frac{p_{tj}}{\widetilde{p}_{tj}+\mathcal{O}\left(\frac{1}{t}\right)}=\frac{p_{tj}}{\widetilde{p}_{tj}}+\mathcal{O}\left(\frac{1}{t}\right).

Let 𝔼t\mathbb{E}_{t} denote the expectation conditioning on all randomness right before the tt-th epoch. Since pt,jp_{t,j} and p~tj\widetilde{p}_{tj} are known before the tt-th epoch, we have 𝔼t[ptjp^tj2𝕀[j𝒫t,Jt]]=ptjp^tj2𝔼t[𝕀[j𝒫t,Jt]]=ptj(p~tj+𝒪(1t))2p~tj=ptjp~tj+𝒪(1t)\mathbb{E}_{t}\left[\frac{p_{tj}}{\widehat{p}_{tj}^{2}}\mathbb{I}_{\left[j\in\mathcal{P}_{t,J_{t}}\right]}\right]=\frac{p_{tj}}{\widehat{p}_{tj}^{2}}\mathbb{E}_{t}\left[\mathbb{I}_{\left[j\in\mathcal{P}_{t,J_{t}}\right]}\right]=\frac{p_{tj}}{\left(\widetilde{p}_{tj}+\mathcal{O}\left(\frac{1}{t}\right)\right)^{2}}\widetilde{p}_{tj}=\frac{p_{tj}}{\widetilde{p}_{tj}}+\mathcal{O}\left(\frac{1}{t}\right). Thus the random variables

{ptjp^tj2𝕀[j𝒫t,Jt]ptjp~tj+𝒪(1t)}t\displaystyle\left\{\frac{p_{tj}}{\widehat{p}_{tj}^{2}}\mathbb{I}_{\left[j\in\mathcal{P}_{t,J_{t}}\right]}-\frac{p_{tj}}{\widetilde{p}_{tj}}+\mathcal{O}\left(\frac{1}{t}\right)\right\}_{t}

form a martingale difference sequence for any jj.

Applying the Azuma’s inequality to this martingale difference sequence gives, with porbability exceeding 1𝒪(ϵ)1-\mathcal{O}(\epsilon),

tjptjp^tj2𝕀[j𝒫t,Jt]\displaystyle\sum_{t}\sum_{j}\frac{p_{tj}}{\widehat{p}_{tj}^{2}}\mathbb{I}_{\left[j\in\mathcal{P}_{t,J_{t}}\right]} tjptjp~tj+𝒪~(Tlog(1/ϵ))+tj𝒪(1t)\displaystyle\leq\sum_{t}\sum_{j}\frac{p_{tj}}{\widetilde{p}_{tj}}+\widetilde{\mathcal{O}}\left(\sqrt{T\log(1/\epsilon)}\right)+\sum_{t}\sum_{j}\mathcal{O}\left(\frac{1}{t}\right)
tjptjp~tj+𝒪~(Tlog(1/ϵ)).\displaystyle\leq\sum_{t}\sum_{j}\frac{p_{tj}}{\widetilde{p}_{tj}}+\widetilde{\mathcal{O}}\left(\sqrt{T\log(1/\epsilon)}\right). (16)

Since ptjp~tjptj+1αj1+αj\frac{p_{tj}}{\widetilde{p}_{tj}}\leq p_{tj}+\frac{1-\sqrt{\alpha_{j}}}{1+\sqrt{\alpha_{j}}} (Proposition 2 in Appendix), we have

tjptjp~tjtj(ptj+1αj1+αj)κT.\displaystyle\sum_{t}\sum_{j}\frac{p_{tj}}{\widetilde{p}_{tj}}\leq\sum_{t}\sum_{j}\left(p_{tj}+\frac{1-\sqrt{\alpha_{j}}}{1+\sqrt{\alpha_{j}}}\right)\leq\kappa T. (17)

Combining (16) and (17) finishes the proof for item (i)(i).

For the second inequality in the lemma statement, we first note that Zt,jBZ_{t,j}\leq B with high probability for all tt and jj. Thus we have

jptjZ^t,j=\displaystyle\sum_{j}p_{tj}\widehat{Z}_{t,j}= jptjZt,jBB𝕀[j𝒫t,Jt]+βp^tj\displaystyle\;\sum_{j}p_{tj}\frac{\frac{Z_{t,j}-B}{B}\mathbb{I}_{\left[j\in\mathcal{P}_{t,J_{t}}\right]}+\beta}{\widehat{p}_{tj}}
=\displaystyle\overset{①}{=} jptjZt,jBB𝕀[j𝒫t,Jt]p~tj+jβptjp~tj+𝒪~(1t)\displaystyle\;\sum_{j}p_{tj}\frac{\frac{Z_{t,j}-B}{B}\mathbb{I}_{\left[j\in\mathcal{P}_{t,J_{t}}\right]}}{\widetilde{p}_{tj}}+\sum_{j}\frac{\beta p_{tj}}{\widetilde{p}_{tj}}+\widetilde{\mathcal{O}}\left(\frac{1}{t}\right)
\displaystyle\leq βκ+𝒪~(1t),\displaystyle\;\beta\kappa+\widetilde{\mathcal{O}}\left(\frac{1}{t}\right),

where ① uses a Taylor expansion (to replace p^tj\widehat{p}_{tj} with p~tj\widetilde{p}_{tj}), and the last line uses ptjp~tjptj+1αj1+αj\frac{p_{tj}}{\widetilde{p}_{tj}}\leq p_{tj}+\frac{1-\sqrt{\alpha_{j}}}{1+\sqrt{\alpha_{j}}} (Proposition 2 in the Appendix) and that Zt,jZ_{t,j} is smaller than BB with high probability. Also, lt,jl_{t,j} is smaller than BB for all t,jt,j. Since 𝔼t[jptjZt,jBB𝕀[j𝒫t,Jt]p~tj]=𝔼t[lt,JtBB]\mathbb{E}_{t}\left[\sum_{j}p_{tj}\frac{\frac{Z_{t,j}-B}{B}\mathbb{I}_{\left[j\in\mathcal{P}_{t,J_{t}}\right]}}{\widetilde{p}_{tj}}\right]=\mathbb{E}_{t}\left[\frac{l_{t,J_{t}}-B}{B}\right], we know {(jptjZ^t,jlt,JtBB+βκ+𝒪~(1t))}t\left\{\left(\sum_{j}p_{tj}\widehat{Z}_{t,j}-\frac{l_{t,J_{t}}-B}{B}+\beta\kappa+\widetilde{\mathcal{O}}\left(\frac{1}{t}\right)\right)\right\}_{t} is a super-martingale difference sequence. We can now apply the Azuma’s inequality to this super-martingale sequence and get

tjptjZ^t,jtlt,JtBBκβT+𝒪~(βκT),\displaystyle\sum_{t}\sum_{j}p_{tj}\widehat{Z}_{t,j}-\sum_{t}\frac{l_{t,J_{t}}-B}{B}\leq\kappa\beta T+\widetilde{\mathcal{O}}\left(\sqrt{\beta\kappa T}\right), (18)

with high probability. ∎

Proof of Theorem 3.

By the above results, we have

t=1Tlt,it=1Tlt,Jt=\displaystyle\sum_{t=1}^{T}l_{t,i}-\sum_{t=1}^{T}l_{t,J_{t}}= B(tlt,iBBtlt,JtBB)\displaystyle\;B\left(\sum_{t}\frac{l_{t,i}-B}{B}-\sum_{t}\frac{l_{t,J_{t}}-B}{B}\right)
\displaystyle\leq B(tlt,iBBtZ^t,i)\displaystyle\;B\left(\sum_{t}\frac{l_{t,i}-B}{B}-\sum_{t}\widehat{Z}_{t,i}\right)
+B(tjptjZ^t,jtlt,JtBB)\displaystyle+B\left(\sum_{t}\sum_{j}p_{tj}\widehat{Z}_{t,j}-\sum_{t}\frac{l_{t,J_{t}}-B}{B}\right)
+B(Z^t,itjptjZ^t,j)\displaystyle+B\left(\widehat{Z}_{t,i}-\sum_{t}\sum_{j}p_{tj}\widehat{Z}_{t,j}\right)
(i)\displaystyle\overset{(i)}{\leq} Blog(T/ϵ2)β+BκβT+𝒪~(BβκT)\displaystyle\;B\frac{\log(T/\epsilon^{2})}{\beta}+B\kappa\beta T+\widetilde{\mathcal{O}}\left(B\sqrt{\beta\kappa T}\right)
+BlogKη+Bηtjptjp~tj2𝕀[j𝒫t,Jt]\displaystyle+\frac{B\log K}{\eta}+B\eta\sum_{t}\sum_{j}\frac{p_{tj}}{\widetilde{p}_{tj}^{2}}\mathbb{I}_{\left[j\in\mathcal{P}_{t,J_{t}}\right]}
(ii)\displaystyle\overset{(ii)}{\leq} Blog(T/ϵ2)β+BκβT+𝒪~(BβκT)\displaystyle\;B\frac{\log(T/\epsilon^{2})}{\beta}+B\kappa\beta T+\widetilde{\mathcal{O}}\left(B\sqrt{\beta\kappa T}\right)
+BlogKη+BηκT+𝒪~(Bη(1+βκ)T),\displaystyle+\frac{B\log K}{\eta}+B\eta\kappa T+\widetilde{\mathcal{O}}\left(B\eta\sqrt{(1+\beta\kappa)T}\right),

where (i)(i) uses Lemma 1, Lemma 2, and (15), and (ii)(ii) uses Lemma 2. ∎

012345678
Figure 3: The network structure for experiments. The dark node at the center is the absorbing node *, and nodes labelled with numbers are transient nodes. Nodes without edges connecting them visits each other with zero probability.

5 Experiments

We deploy our algorithms on a problem with 9 transient nodes. The results for Algorithm 2 is in Figure 4, and the results for Algorithm 1 is deferred to the Appendix. The evaluation of Algorithm 2 is performed on a problem instance with the transition matrix specified in (19).

mij\displaystyle m_{ij} ={0.3, if i=j,0.1, if i=j±1mod9,0, otherwise.\displaystyle=\begin{cases}0.3,\text{ if }i=j,\\ 0.1,\text{ if }i=j\pm 1\mod 9,\\ 0,\text{ otherwise.}\end{cases} (19)

The edge lengths are sampled from Gaussian distributions and truncated to between 0 and 1. Specifically, for all t=1,2,,Tt=1,2,\cdots,T, lij(t)={clip[0,1](Wt+0.5), if i=0 and j=clip[0,1](Wt), if i0 and j=1, otherwise,l_{ij}^{(t)}=\begin{cases}\text{clip}_{[0,1]}\left(W_{t}+0.5\right),\text{ if }i=0\text{ and }j=*\\ \text{clip}_{[0,1]}\left(W_{t}\right),\text{ if }i\neq 0\text{ and }j=*\\ 1,\text{ otherwise}\end{cases}, where clip[0,1](z)\text{clip}_{[0,1]}(z) takes a number zz and clips it to [0,1][0,1], Wti.i.d.𝒩(0.5,0.1)W_{t}\overset{i.i.d.}{\sim}\mathcal{N}\left(0.5,0.1\right), and * denotes the absorbing node.

Refer to caption
Figure 4: Experimental Results for Algorithm 2, compared with the standard EXP3 algorithm. Each solid line plot is an average of 10 runs. The shaded area indicate one standard deviation below and above the average.

As shown in Figure 4, EXP3 algorithm performs better when using the, which is consistent with the guarantee provided in Theorem 3. For implementation purpose and a fair comparison, the estimator for Algorithm 2 is Z^t,i=Zt,i𝕀[i𝒫t,Jt]pti+jiq^t,jiptj\widehat{Z}_{t,i}=\frac{{Z}_{t,i}\mathbb{I}_{\left[i\in\mathcal{P}_{t,J_{t}}\right]}}{p_{ti}+\sum_{j\neq i}\widehat{q}_{t,ji}p_{tj}}, and the estimators for EXP3 uses Z^t,i=Zt,i𝕀[i=Jt]pti\widehat{Z}_{t,i}=\frac{{Z}_{t,i}\mathbb{I}_{\left[i=J_{t}\right]}}{p_{ti}}. The learning rate for both algorithms is set to 0.0010.001.

6 Conclusion

In this paper, we study the bandit problem where the feedback is random walk trajectories. This problem is motivated by influence maximization in computational social science. We show that, despite substantially more information can be extracted from the random walk trajectories, such problems are not significantly easier than its standard MAB counterpart in a mini-max sense. Behaviors of UCB and EXP3 are studied.

References

  • Abbasi-Yadkori et al., (2011) Abbasi-Yadkori, Y., Pál, D., and Szepesvári, C. (2011). Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320.
  • Agrawal and Goyal, (2012) Agrawal, S. and Goyal, N. (2012). Analysis of Thompson sampling for the multi-armed bandit problem. In Conference on Learning Theory, pages 39–1.
  • Alon et al., (2013) Alon, N., Cesa-Bianchi, N., Gentile, C., and Mansour, Y. (2013). From bandits to experts: A tale of domination and independence. In Advances in Neural Information Processing Systems, pages 1610–1618.
  • Arora et al., (2017) Arora, A., Galhotra, S., and Ranu, S. (2017). Debunking the myths of influence maximization: An in-depth benchmarking study. In Proceedings of the 2017 ACM international conference on management of data, pages 651–666.
  • Audibert et al., (2009) Audibert, J.-Y., Munos, R., and Szepesvári, C. (2009). Exploration–exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science, 410(19):1876–1902.
  • Auer, (2002) Auer, P. (2002). Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422.
  • Auer et al., (1995) Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. (1995). Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Proceedings of IEEE 36th Annual Foundations of Computer Science, pages 322–331. IEEE.
  • Auer et al., (2002) Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. (2002). The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77.
  • Auer and Ortner, (2010) Auer, P. and Ortner, R. (2010). UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Periodica Mathematica Hungarica, 61(1-2):55–65.
  • Bertsekas and Tsitsiklis, (1991) Bertsekas, D. P. and Tsitsiklis, J. N. (1991). An analysis of stochastic shortest path problems. Mathematics of Operations Research, 16(3):580–595.
  • Bubeck et al., (2013) Bubeck, S., Cesa-Bianchi, N., and Lugosi, G. (2013). Bandits with heavy tail. IEEE Transactions on Information Theory, 59(11):7711–7717.
  • Bubeck and Slivkins, (2012) Bubeck, S. and Slivkins, A. (2012). The best of both worlds: stochastic and adversarial bandits. In Conference on Learning Theory, pages 42–1.
  • Cesa-Bianchi et al., (1997) Cesa-Bianchi, N., Freund, Y., Haussler, D., Helmbold, D. P., Schapire, R. E., and Warmuth, M. K. (1997). How to use expert advice. Journal of the ACM (JACM), 44(3):427–485.
  • Chen et al., (2017) Chen, L., Li, J., and Qiao, M. (2017). Nearly instance optimal sample complexity bounds for top-k arm selection. In Artificial Intelligence and Statistics, pages 101–110.
  • Chen et al., (2016) Chen, W., Hu, W., Li, F., Li, J., Liu, Y., and Lu, P. (2016). Combinatorial multi-armed bandit with general reward functions. In Advances in Neural Information Processing Systems, pages 1651–1659.
  • Chen et al., (2013) Chen, W., Wang, Y., and Yuan, Y. (2013). Combinatorial multi-armed bandit: General framework and applications. In International Conference on Machine Learning, pages 151–159. PMLR.
  • Dick et al., (2014) Dick, T., Gyorgy, A., and Szepesvari, C. (2014). Online learning in markov decision processes with changing cost sequences. In International Conference on Machine Learning, pages 512–520. PMLR.
  • Even-Dar et al., (2009) Even-Dar, E., Kakade, S. M., and Mansour, Y. (2009). Online markov decision processes. Mathematics of Operations Research, 34(3):726–736.
  • Garivier and Cappé, (2011) Garivier, A. and Cappé, O. (2011). The KL–UCB algorithm for bounded stochastic bandits and beyond. In Conference on Learning Theory, pages 359–376.
  • Gergely Neu et al., (2010) Gergely Neu, A. G., Szepesvári, C., and Antos, A. (2010). Online markov decision processes under bandit feedback. In Proceedings of the Twenty-Fourth Annual Conference on Neural Information Processing Systems.
  • Jin et al., (2019) Jin, C., Jin, T., Luo, H., Sra, S., and Yu, T. (2019). Learning adversarial mdps with bandit feedback and unknown transition. arXiv preprint arXiv:1912.01192.
  • Kaufmann et al., (2016) Kaufmann, E., Cappé, O., and Garivier, A. (2016). On the complexity of best-arm identification in multi-armed bandit models. The Journal of Machine Learning Research, 17(1):1–42.
  • Kempe et al., (2003) Kempe, D., Kleinberg, J., and Tardos, É. (2003). Maximizing the spread of influence through a social network. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 137–146.
  • Kleinberg et al., (2008) Kleinberg, R., Slivkins, A., and Upfal, E. (2008). Multi-armed bandits in metric spaces. In ACM Symposium on Theory of Computing, pages 681–690. ACM.
  • Kocák et al., (2014) Kocák, T., Neu, G., Valko, M., and Munos, R. (2014). Efficient learning by implicit exploration in bandit problems with side observations. In Advances in Neural Information Processing Systems, pages 613–621.
  • Krause and Ong, (2011) Krause, A. and Ong, C. S. (2011). Contextual Gaussian process bandit optimization. In Advances in Neural Information Processing Systems, pages 2447–2455.
  • Lai and Robbins, (1985) Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22.
  • Lattimore and Szepesvári, (2020) Lattimore, T. and Szepesvári, C. (2020). Bandit Algorithms. Cambridge University Press.
  • Li et al., (2010) Li, L., Chu, W., Langford, J., and Schapire, R. E. (2010). A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670.
  • Littlestone and Warmuth, (1994) Littlestone, N. and Warmuth, M. K. (1994). The weighted majority algorithm. Information and computation, 108(2):212–261.
  • Maillard et al., (2011) Maillard, O.-A., Munos, R., and Stoltz, G. (2011). A finite-time analysis of multi-armed bandits problems with Kullback-Leibler divergences. In Conference On Learning Theory, pages 497–514.
  • Mannor and Shamir, (2011) Mannor, S. and Shamir, O. (2011). From bandits to experts: On the value of side-observations. In Advances in Neural Information Processing Systems, pages 684–692.
  • Mannor and Tsitsiklis, (2004) Mannor, S. and Tsitsiklis, J. N. (2004). The sample complexity of exploration in the multi-armed bandit problem. Journal of Machine Learning Research, 5(Jun):623–648.
  • Neu, (2015) Neu, G. (2015). Explore no more: Improved high-probability regret bounds for non-stochastic bandits. In Advances in Neural Information Processing Systems, pages 3168–3176.
  • Neu et al., (2012) Neu, G., Gyorgy, A., and Szepesvári, C. (2012). The adversarial stochastic shortest path problem with unknown transition probabilities. In Artificial Intelligence and Statistics, pages 805–813. PMLR.
  • Robbins, (1952) Robbins, H. (1952). Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5):527–535.
  • Rosenberg and Mansour, (2020) Rosenberg, A. and Mansour, Y. (2020). Adversarial stochastic shortest path. arXiv preprint arXiv:2006.11561.
  • Seldin and Slivkins, (2014) Seldin, Y. and Slivkins, A. (2014). One practical algorithm for both stochastic and adversarial bandits. In International Conference on Machine Learning, pages 1287–1295.
  • Slivkins, (2014) Slivkins, A. (2014). Contextual bandits with similarity information. The Journal of Machine Learning Research, 15(1):2533–2568.
  • Srinivas et al., (2010) Srinivas, N., Krause, A., Kakade, S., and Seeger, M. (2010). Gaussian process optimization in the bandit setting: No regret and experimental design. In International Conference on Machine Learning.
  • Thompson, (1933) Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294.

Appendix A Proof Details of Theorems 2

Construct K+1K+1 instances specified by graphs: 𝔍0=(Gt0)t=1T,𝔍1=(Gt1)t=1T,𝔍2=(Gt2)t=1T,,𝔍K=(GtK)t=1T\mathfrak{J}_{0}=\left(G_{t}^{0}\right)_{t=1}^{T},\mathfrak{J}_{1}=\left(G_{t}^{1}\right)_{t=1}^{T},\mathfrak{J}_{2}=\left(G_{t}^{2}\right)_{t=1}^{T},\cdots,\mathfrak{J}_{K}=\left(G_{t}^{K}\right)_{t=1}^{T}. For any t,kt,k, let all transition probabilities equal pp. For any t,kt,k, let the graph GtkG_{t}^{k} have edge lengths ({li(t)}i[K],{lij(t)}i,j[K])\left(\{l_{i*}^{(t)}\}_{i\in[K]},\{l_{ij}^{(t)}\}_{i,j\in[K]}\right) chosen randomly. In 𝔍0\mathfrak{J}_{0}, all edge lengths are independently sampled from Bernoulli(12)\text{Bernoulli}\left(\frac{1}{2}\right). In 𝔍k\mathfrak{J}_{k} (k1k\geq 1), lk(t)l_{k*}^{(t)} (for all t[T]t\in[T], k1k\geq 1) are sampled from Bernoulli(12+ϵ1Kp)\text{Bernoulli}\left(\frac{1}{2}+\frac{\epsilon}{1-Kp}\right), and all other edge lengths are sampled from Bernoulli(12)\text{Bernoulli}\left(\frac{1}{2}\right). The proof is presented in three steps: Step 1 computes the KL-divergence using similar arguments discussed in the main text. Steps 2 & 3 use standard argument for general lower bound proofs.

Step 1: compute the KL-divergence between 𝔍0\mathfrak{J}_{0} and 𝔍k\mathfrak{J}_{k}. Let t,j(k)\mathbb{Q}_{t,j}^{(k)} span the probability of playing jj at tt in instance 𝔍k\mathfrak{J}_{k}. By chain rule, we have, for any k=2,,Kk=2,\cdots,K,

DKL(𝔍0,π𝔍k,π)=t=1Tj[K]𝔍0,π(Jt=j)DKL(t,j(0)t,j(k)).\displaystyle D_{KL}\left(\mathbb{P}_{\mathfrak{J}_{0},\pi}\|\mathbb{P}_{\mathfrak{J}_{k},\pi}\right)=\sum_{t=1}^{T}\sum_{j\in[K]}\mathbb{P}_{\mathfrak{J}_{0},\pi}\left(J_{t}=j\right)D_{KL}\left(\mathbb{Q}_{t,j}^{(0)}\|\mathbb{Q}_{t,j}^{(k)}\right). (20)

Let X0,L1,X1,L2,X_{0},L_{1},X_{1},L_{2},\cdots be the nodes and edge length of each step in the trajectory after playing a node. The sample space of t,j(k)\mathbb{Q}_{t,j}^{(k)} is spanned by X0,L1,X1,L2,X_{0},L_{1},X_{1},L_{2},\cdots.

By Markov property, we have, for all i,j[K]i,j\in[K] and s+s\in\mathbb{N}_{+},

t,i(k)(Ls+1,Xs+1,Ls+2,Xs+2,|Xs=j)=t,j(k),k=0,1,2,,K.\displaystyle\mathbb{Q}_{t,i}^{(k)}\left(L_{s+1},X_{s+1},L_{s+2},X_{s+2},\dots\rvert X_{s}=j\right)=\mathbb{Q}_{t,j}^{(k)},\quad\forall k=0,1,2,\cdots,K.

Thus by applying the chain rule to t,j(k)\mathbb{Q}_{t,j}^{(k)}, we have

DKL(t,i(0)t,i(k))\displaystyle D_{KL}\left(\mathbb{Q}_{t,i}^{(0)}\|\mathbb{Q}_{t,i}^{(k)}\right)
=\displaystyle= DKL(t,i(0)(X1,L1)t,i(k)(X1,L1))\displaystyle D_{KL}\left(\mathbb{Q}_{t,i}^{(0)}(X_{1},L_{1})\|\mathbb{Q}_{t,i}^{(k)}(X_{1},L_{1})\right)
+x[K]l{0,1}(X1=x,L1=l)\displaystyle+\sum_{x\in[K]}\sum_{l\in\{0,1\}}\mathbb{P}\left(X_{1}=x,L_{1}=l\right)
DKL(t,i(0)(X2,L2,|X1=x,L1=l)t,i(k)(X2,L2,|X1=x,L1=l)).\displaystyle\cdot D_{KL}\left(\mathbb{Q}_{t,i}^{(0)}(X_{2},L_{2},...\rvert X_{1}=x,L_{1}=l)\|\mathbb{Q}_{t,i}^{(k)}(X_{2},L_{2},...\rvert X_{1}=x,L_{1}=l)\right). (21)

Recall that the edge lengths are selected independent of other randomness. Thus we have

t,i(k)(X2,L2,|X1=x,L1=l)=t,i(k)(X2,L2,|X1=x),k=0,1,2,,K.\displaystyle\mathbb{Q}_{t,i}^{(k)}(X_{2},L_{2},...\rvert X_{1}=x,L_{1}=l)=\mathbb{Q}_{t,i}^{(k)}(X_{2},L_{2},...\rvert X_{1}=x),\quad k=0,1,2,\cdots,K.

Thus (21) simplifies to

DKL(t,i(0)t,i(k))\displaystyle\;D_{KL}\left(\mathbb{Q}_{t,i}^{(0)}\|\mathbb{Q}_{t,i}^{(k)}\right)
=\displaystyle= DKL(t,i(0)(X1,L1)t,i(k)(X1,L1))+j[K]mjiDKL(t,j(0)t,j(k)),\displaystyle\;D_{KL}\left(\mathbb{Q}_{t,i}^{(0)}(X_{1},L_{1})\|\mathbb{Q}_{t,i}^{(k)}(X_{1},L_{1})\right)+\sum_{j\in[K]}m_{ji}D_{KL}\left(\mathbb{Q}_{t,j}^{(0)}\|\mathbb{Q}_{t,j}^{(k)}\right), (22)

where mjim_{ji} (mji=pm_{ji}=p) is the probability of visiting jj from ii.

We have, for k1k\geq 1,

DKL(t,i(0)(X1,L1)t,i(k)(X1,L1))\displaystyle\;D_{KL}\left(\mathbb{Q}_{t,i}^{(0)}(X_{1},L_{1})\|\mathbb{Q}_{t,i}^{(k)}(X_{1},L_{1})\right)
=\displaystyle= {1Kp2log(1Kp)/2(1Kp)(12+ϵ1Kp)+1Kp2log(1Kp)/2(1Kp)(12ϵ1Kp),if i=k,0,otherwise.\displaystyle\;\begin{cases}\frac{1-Kp}{2}\log\frac{(1-Kp)/2}{(1-Kp)\left(\frac{1}{2}+\frac{\epsilon}{1-Kp}\right)}+\frac{1-Kp}{2}\log\frac{(1-Kp)/2}{(1-Kp)\left(\frac{1}{2}-\frac{\epsilon}{1-Kp}\right)},&\text{if }i=k,\\ 0,&\text{otherwise}.\end{cases}
\displaystyle\leq {ϵ21Kp,if i=k,0,otherwise.\displaystyle\;\begin{cases}\frac{\epsilon^{2}}{1-Kp},&\text{if }i=k,\\ 0,&\text{otherwise}.\end{cases} (23)

where the last line uses that xx22log(1+x)xx-\frac{x^{2}}{2}\leq\log(1+x)\leq x for all x[0,1]x\in[0,1].

Combining (22) and (23) gives

DKL(t,i(0)t,i(k))\displaystyle D_{KL}\left(\mathbb{Q}_{t,i}^{(0)}\|\mathbb{Q}_{t,i}^{(k)}\right)\leq {ϵ21Kp+pϵ2(1Kp)2,if i=k,pϵ2(1Kp)2,otherwise.\displaystyle\;\begin{cases}\frac{\epsilon^{2}}{1-Kp}+\frac{p\epsilon^{2}}{(1-Kp)^{2}},&\text{if }i=k,\\ \frac{p\epsilon^{2}}{(1-Kp)^{2}},&\text{otherwise}.\end{cases} (24)

Plugging the above results into (20) and we get

DKL(𝔍0,π𝔍k,π)=\displaystyle D_{KL}\left(\mathbb{P}_{\mathfrak{J}_{0},\pi}\|\mathbb{P}_{\mathfrak{J}_{k},\pi}\right)= t=1Tj[K]𝔍0,π(Jt=j)DKL(t,j(0)t,j(k))\displaystyle\;\sum_{t=1}^{T}\sum_{j\in[K]}\mathbb{P}_{\mathfrak{J}_{0},\pi}\left(J_{t}=j\right)D_{KL}\left(\mathbb{Q}_{t,j}^{(0)}\|\mathbb{Q}_{t,j}^{(k)}\right)
=\displaystyle= j𝔼𝔍0,π[Nj]DKL(t,j(0)t,j(k))\displaystyle\;\sum_{j}\mathbb{E}_{\mathfrak{J}_{0},\pi}\left[N_{j}\right]D_{KL}\left(\mathbb{Q}_{t,j}^{(0)}\|\mathbb{Q}_{t,j}^{(k)}\right)
\displaystyle\leq ϵ2𝔼𝔍0,π[Nk]1Kp+ϵ2pT(1Kp)2,\displaystyle\;\frac{\epsilon^{2}\mathbb{E}_{\mathfrak{J}_{0},\pi}\left[N_{k}\right]}{1-Kp}+\frac{\epsilon^{2}pT}{(1-Kp)^{2}},

where NjN_{j} is the number of times jj is played.

Step 2: compute the optimality gap between nodes.

Let HkH_{k} be the vector of (expected) hitting times in instance 𝔍k\mathfrak{J}_{k}. The vector HkH_{k} (k2)(k\geq 2) satisfies

Hk=MHk+(1Kp)(12𝟏+ϵ1Kp𝒆k),\displaystyle H_{k}=MH_{k}+(1-Kp)\left(\frac{1}{2}\bm{1}+\frac{\epsilon}{1-Kp}\bm{e}_{k}\right),

where MM is the transition matrix among transient nodes (all entries of MM are pp), 𝟏\bm{1} is the all 𝟏\bm{1} vector and 𝒆k\bm{e}_{k} is the kk-th canonical basis vector.

Solving the above equation gives, for k2k\geq 2

Hk=(1Kp)(I+M1Kp)(12𝟏+ϵ1Kp𝒆k).\displaystyle H_{k}=(1-Kp)\left(I+\frac{M}{1-Kp}\right)\left(\frac{1}{2}\bm{1}+\frac{\epsilon}{1-Kp}\bm{e}_{k}\right).

Thus the optimality gap, which is the difference between the hitting time of node kk in 𝔍k\mathfrak{J}_{k} and the hitting time of other nodes in 𝔍k\mathfrak{J}_{k} (k1)(k\geq 1), is Δ:=ϵ\Delta:=\epsilon.

Step 3: apply Yao’s principle and Pinsker’s inequality to finish the proof.

By Pinsker’s inequality, we have j,k,\forall j,k,

|𝔍0,π(Jt=j)𝔍k,π(Jt=j)|\displaystyle\rvert\mathbb{P}_{\mathfrak{J}_{0},\pi}(J_{t}=j)-\mathbb{P}_{\mathfrak{J}_{k},\pi}(J_{t}=j)\rvert\leq 2DKL(𝔍0,π𝔍k,π).\displaystyle\;\sqrt{2D_{KL}\left(\mathbb{P}_{\mathfrak{J}_{0},\pi}\|\mathbb{P}_{\mathfrak{J}_{k},\pi}\right)}. (25)

Thus for the regret against kk is instance 𝔍k\mathfrak{J}_{k}, we have

1Kk=1K𝔼𝔍k,π[Regkadv(T)]\displaystyle\;\frac{1}{K}\sum_{k=1}^{K}\mathbb{E}_{\mathfrak{J}_{k},\pi}\left[\mathrm{Reg}^{\mathrm{adv}}_{k}(T)\right]
=\displaystyle= 1Kk=1Kt𝔼𝔍k,π[Yk,t]𝔼𝔍k,π[YJt,t]\displaystyle\;\frac{1}{K}\sum_{k=1}^{K}\sum_{t}\mathbb{E}_{\mathfrak{J}_{k},\pi}\left[Y_{k,t}\right]-\mathbb{E}_{\mathfrak{J}_{k},\pi}\left[Y_{J_{t},t}\right]
=\displaystyle= ϵKk=1Kt𝔍k,π(Jtk)\displaystyle\;\frac{\epsilon}{K}\sum_{k=1}^{K}\sum_{t}\mathbb{P}_{\mathfrak{J}_{k},\pi}(J_{t}\neq k) (by the Wald’s indentity)
=\displaystyle= ϵTϵKk=1Kt𝔍k,π(Jt=k)\displaystyle\;\epsilon T-\frac{\epsilon}{K}\sum_{k=1}^{K}\sum_{t}\mathbb{P}_{\mathfrak{J}_{k},\pi}\left(J_{t}=k\right)
\displaystyle\geq ϵTϵKk=1Kt𝔍0,π(Jt=k)ϵKk=1Kt2DKL(𝒫𝔍0,π𝒫𝔍k,π)\displaystyle\;\epsilon T-\frac{\epsilon}{K}\sum_{k=1}^{K}\sum_{t}\mathbb{P}_{\mathfrak{J}_{0},\pi}(J_{t}=k)-\frac{\epsilon}{K}\sum_{k=1}^{K}\sum_{t}\sqrt{2D_{KL}\left(\mathcal{P}_{\mathfrak{J}_{0},\pi}\|\mathcal{P}_{\mathfrak{J}_{k},\pi}\right)} (by Eq. 25)
\displaystyle\geq (K1)ϵTKϵ2T1Kk=1K(2𝔼𝔍0,π[Nk]1Kp+2pT(1Kp)2)\displaystyle\;\frac{(K-1)\epsilon T}{K}-\epsilon^{2}T\sqrt{\frac{1}{K}\sum_{k=1}^{K}\left(\frac{2\mathbb{E}_{\mathfrak{J}_{0},\pi}\left[N_{k}\right]}{1-Kp}+\frac{2pT}{(1-Kp)^{2}}\right)} (by Jensen’s inequality)
=\displaystyle= (K1)ϵTKϵ2T2TK(1Kp)+2pT(1Kp)2.\displaystyle\;\frac{(K-1)\epsilon T}{K}-\epsilon^{2}T\sqrt{\frac{2T}{K(1-Kp)}+\frac{2pT}{(1-Kp)^{2}}}. (26)

Now we set p=12Kp=\frac{1}{2K} and ϵ=142K1KKT\epsilon=\frac{1}{4\sqrt{2}}\frac{K-1}{{K}}\sqrt{\frac{K}{T}} and (26) gives

1Kk=1K𝔼𝔍k,π[Regkadv(T)]182KT\displaystyle\frac{1}{K}\sum_{k=1}^{K}\mathbb{E}_{\mathfrak{J}_{k},\pi}\left[\mathrm{Reg}^{\mathrm{adv}}_{k}(T)\right]\geq\frac{1}{8\sqrt{2}}\sqrt{KT}

Thus we have

maxk𝔼𝔍k,π[Regkadv(T)]\displaystyle\max_{k}\mathbb{E}_{\mathfrak{J}_{k},\pi}\left[\mathrm{Reg}^{\mathrm{adv}}_{k}(T)\right]\geq 1Kk=1K𝔼𝔍k,π[Regkadv(T)]\displaystyle\;\frac{1}{K}\sum_{k=1}^{K}\mathbb{E}_{\mathfrak{J}_{k},\pi}\left[\mathrm{Reg}^{\mathrm{adv}}_{k}(T)\right]
\displaystyle\geq 182KT.\displaystyle\;\frac{1}{8\sqrt{2}}\sqrt{KT}.

Appendix B Proof Details for Section 4

We use the following notations for simplicity. 1. We write p~tj=ptj+ijqijpti\widetilde{p}_{tj}=p_{tj}+\sum_{i\neq j}q_{ij}p_{ti}, and p^tj=ptj+ijq^t,ijpti\widehat{p}_{tj}=p_{tj}+\sum_{i\neq j}\widehat{q}_{t,ij}p_{ti}. 2. We use t\mathcal{F}_{t} to denote the σ\sigma-algebra generated by all randomness up to the end of epoch tt. We use t,i\mathcal{F}_{t,i} to denote the σ\sigma-algebra of all randomness up to the first occurrence of node ii in epoch tt (or end of epoch tt if ii is not visited in epoch tt). We use 𝔼t\mathbb{E}_{t} to denote the expectation conditioning on t\mathcal{F}_{t}, i.e., 𝔼t[]=𝔼[|t]\mathbb{E}_{t}\left[\cdot\right]=\mathbb{E}\left[\cdot\rvert\mathcal{F}_{t}\right]. 3. Unless otherwise stated, we use t\sum_{t} and j\sum_{j} as shorthand for t=1T\sum_{t=1}^{T} and j[K]\sum_{j\in[K]}, respectively.

In Appendix B.1, we state some preparation properties needed for proving Lemma 1. Proposition 2 is also included in this part. In Appendix B.2, we provide a proof for Lemma 1.

B.1 Additional Properties

As Proposition 1 suggests, number of times a node is visited linearly accumulates with number of epochs tt. We state this observation below in Lemma 3.

Lemma 3.

For vVv\in V and tt, it holds that (Nt+(v)Nt(v)αv(tNt(v))λ)eλ22t.\mathbb{P}\Big{(}{N}_{t}^{+}(v)-{N}_{t}(v)-\alpha_{v}\left(t-{N}_{t}(v)\right)\geq-\lambda\Big{)}\leq e^{-\frac{\lambda^{2}}{2t}}.

Proof.

Recall 𝒫t,Jt=(Xt,0,Lt,1,Xt,1,Lt,2,Xt,2,,Lt,Ht,Jt,Xt,Ht,Jt)\mathcal{P}_{t,J_{t}}=\left(X_{t,0},L_{t,1},X_{t,1},L_{t,2},X_{t,2},\dots,L_{t,H_{t,J_{t}}},X_{t,H_{t,J_{t}}}\right) is the trajectory for epoch tt and Xi,0X_{i,0} is the node played at epoch ii. For a fixed node vVv\in V, consider the random variables {𝕀[v𝒫t,Jt{Xt,0}]}t\left\{\mathbb{I}_{\left[v\in\mathcal{P}_{t,J_{t}}\setminus\{X_{t,0}\}\right]}\right\}_{t}, which is the indicator that takes value 1 when vv is covered in path 𝒫t,Jt\mathcal{P}_{t,J_{t}} but is not played at tt. From this definition, we have

k=1t𝕀[v𝒫k,Jk{Xk,0}]=Nt+(v)Nt(v).\displaystyle\sum_{k=1}^{t}\mathbb{I}_{\left[v\in\mathcal{P}_{k,J_{k}}\setminus\{X_{k,0}\}\right]}={N}_{t}^{+}(v)-{N}_{t}(v).

From definition of αv\alpha_{v}, we have

𝔼[Nt+(v)Nt(v)]=𝔼[k=1t𝕀[v𝒫i{Xk,0}]]αv(t𝔼[Nt(v)]).\displaystyle\mathbb{E}\left[{N}_{t}^{+}(v)-{N}_{t}(v)\right]=\mathbb{E}\left[\sum_{k=1}^{t}\mathbb{I}_{\left[v\in\mathcal{P}_{i}\setminus\{X_{k,0}\}\right]}\right]\geq\alpha_{v}\left(t-\mathbb{E}\left[{N}_{t}(v)\right]\right).

Thus by one-sided Azuma’s inequality, we have for any λ>0\lambda>0,

(Nt+(v)Nt(v)αv(tNt(v))λ)exp(λ22t).\displaystyle\mathbb{P}\Big{(}{N}_{t}^{+}(v)-{N}_{t}(v)-\alpha_{v}\left(t-{N}_{t}(v)\right)\geq-\lambda\Big{)}\leq\exp\left(-\frac{\lambda^{2}}{2t}\right). (27)

Lemma 4.

For any t,i,jt,i,j, it holds that 𝕍[q^t,ij]=𝒪~(1t).\mathbb{V}\left[\widehat{q}_{t,ij}\right]=\widetilde{\mathcal{O}}\left(\frac{1}{t}\right).

Proof.

For the variance, we have

𝕍[q^t,ij]=\displaystyle\mathbb{V}\left[\widehat{q}_{t,ij}\right]= m=1t𝕍[q^t,ij|Nt+(i)=m](Nt+(i)=m)\displaystyle\;\sum_{m=1}^{t}\mathbb{V}\left[\widehat{q}_{t,ij}\bigg{\rvert}N_{t}^{+}(i)=m\right]\mathbb{P}\left(N_{t}^{+}(i)=m\right)
\displaystyle\leq m=1t1m(Nt+(i)=m)=𝔼[1Nt+(i)].\displaystyle\;\sum_{m=1}^{t}\frac{1}{m}\mathbb{P}\left(N_{t}^{+}(i)=m\right)=\mathbb{E}\left[\frac{1}{N_{t}^{+}(i)}\right].

By Lemma 3 and a union bound, we know, for any δ(0,1)\delta\in(0,1),

(Nt+(i)αt2tlog(2TK/δ),i[K],t[T])δ.\displaystyle\mathbb{P}\Big{(}{N}_{t}^{+}(i)\geq\alpha t-\sqrt{2t\log(2TK/\delta)},\quad\forall i\in[K],\;t\in[T]\Big{)}\leq{\delta}.

Thus it holds that

𝔼[1Nt+(i)]=\displaystyle\mathbb{E}\left[\frac{1}{N_{t}^{+}(i)}\right]= 𝔼[1Nt+(i)|Nt+(i)αt2tlog(TK/δ)]\displaystyle\;\mathbb{E}\left[\frac{1}{N_{t}^{+}(i)}\big{\rvert}{N_{t}^{+}(i)}\geq\alpha t-\sqrt{2t\log(TK/\delta)}\right]
(Nt+(i)αt2tlog(TK/δ))\displaystyle\quad\cdot\mathbb{P}\left(N_{t}^{+}(i)\geq\alpha t-\sqrt{2t\log(TK/\delta)}\right)
+𝔼[1Nt+(i)|Nt+(i)<αt2tlog(TK/δ)]\displaystyle+\mathbb{E}\left[\frac{1}{N_{t}^{+}(i)}\big{\rvert}{N_{t}^{+}(i)}<\alpha t-\sqrt{2t\log(TK/\delta)}\right]
(Nt+(i)<αt2tlog(TK/δ))\displaystyle\quad\cdot\mathbb{P}\left(N_{t}^{+}(i)<\alpha t-\sqrt{2t\log(TK/\delta)}\right)
\displaystyle\leq 1max{1,αt2tlog(TK/δ)}+δ.\displaystyle\;\frac{1}{\max\left\{1,\alpha t-\sqrt{2t\log(TK/\delta)}\right\}}+\delta.

Setting δ=1t\delta=\frac{1}{t} concludes the proof. ∎

Next, we consider a high probability event and approximate q^t,ij\widehat{q}_{t,ij} under this event.

Lemma 5.

For any ϵ(0,1)\epsilon\in(0,1), let

T:={\displaystyle\mathcal{E}_{T}^{\prime}:=\Bigg{\{} |q^t,ijqij|2𝕍[q^t,ij]log(T/ϵ)Nt1+(i)+log(T/ϵ)3Nt1+(i),\displaystyle\left\rvert\widehat{q}_{t,ij}-q_{ij}\right\rvert\leq\sqrt{\frac{2\mathbb{V}\left[\widehat{q}_{t,ij}\right]\log(T/\epsilon)}{N_{t-1}^{+}(i)}}+\frac{\log(T/\epsilon)}{3N_{t-1}^{+}(i)},
Nt+(i)αttlog(TK/ϵ),i,j[K]t[T]}.\displaystyle\quad\;N_{t}^{+}(i)\geq\alpha t-\sqrt{t\log(TK/\epsilon)},\;\forall i,j\in[K]\forall t\in[T]\Bigg{\}}.

It holds that (t)12ϵT\mathbb{P}\left(\mathcal{E}_{t}^{\prime}\right)\geq 1-\frac{2\epsilon}{T} and under T\mathcal{E}_{T}^{\prime}, for all i,j[K]i,j\in[K] and t[T]t\in[T],

q^t,ij=qij±𝒪(log(TK/ϵ)t), and p^ti=p~ti±𝒪(log(TK/ϵ)t).\displaystyle\widehat{q}_{t,ij}=q_{ij}\pm\mathcal{O}\left({\frac{\log(TK/\epsilon)}{t}}\right),\text{ and }\widehat{p}_{ti}=\widetilde{p}_{ti}\pm\mathcal{O}\left({\frac{\log(TK/\epsilon)}{t}}\right). (28)
Proof.

By Bennett’s inequality, it holds that

(|q^t,ijqij|2𝕍[q^t,ij]log(KT/ϵ)Nt1+(i)+log(KT/ϵ)3Nt1+(i))ϵ\displaystyle\mathbb{P}\left(\left\rvert\widehat{q}_{t,ij}-q_{ij}\right\rvert\geq\sqrt{\frac{2\mathbb{V}\left[\widehat{q}_{t,ij}\right]\log(KT/\epsilon)}{N_{t-1}^{+}(i)}}+\frac{\log(KT/\epsilon)}{3N_{t-1}^{+}(i)}\right)\leq\epsilon (29)

By Lemma 3 and a union bound, we known (T)12ϵ\mathbb{P}\left(\mathcal{E}_{T}^{\prime}\right)\geq 1-2\epsilon. By Lemma 4, we know that 𝕍[q^t,ij]=𝒪~(1t)\mathbb{V}\left[\widehat{q}_{t,ij}\right]=\widetilde{\mathcal{O}}\left(\frac{1}{t}\right). Thus, under event T\mathcal{E}_{T}^{\prime}, it holds that

q^t,ij=qij±𝒪~(log(T/ϵ)t),\displaystyle\widehat{q}_{t,ij}=q_{ij}\pm\widetilde{\mathcal{O}}\left(\frac{\log\left(T/\epsilon\right)}{t}\right),

and thus

p^ti=p~ti±𝒪~(log(T/ϵ)t).\displaystyle\widehat{p}_{ti}=\widetilde{p}_{ti}\pm\widetilde{\mathcal{O}}\left(\frac{\log\left(T/\epsilon\right)}{t}\right).

Lemma 6.

For any BB, let T(B):={Zt,jB for all t=1,2,,T, and j[K]}.\mathcal{E}_{T}(B):=\left\{{Z}_{t,j}\leq B\text{ for all }t=1,2,\cdots,T,\text{ and }j\in[K]\right\}. For any ϵ(0,1)\epsilon\in(0,1) and B=log(1ρ)ϵKTlogρB=\frac{\log\frac{(1-\rho)\epsilon}{KT}}{\log\rho}, it holds that

(T(B))1ϵ,\displaystyle\mathbb{P}\left(\mathcal{E}_{T}\left(B\right)\right)\geq 1-\epsilon,
𝔼[Zt,i|not T(B)]KB+K(1ρ)2,\displaystyle\mathbb{E}\left[{Z}_{t,i}\rvert\text{not }\mathcal{E}_{T}(B)\right]\leq KB+\frac{K}{(1-\rho)^{2}},
𝔼[Zt,i2|not T(B)]KB2+2K(1ρ)3.\displaystyle\mathbb{E}\left[{Z}_{t,i}^{2}\rvert\text{not }\mathcal{E}_{T}(B)\right]\leq KB^{2}+\frac{2K}{(1-\rho)^{3}}.
Proof.

Since all edge lengths are smaller than 1, we have, for any integer BB,

(Zt,i>B)\displaystyle\mathbb{P}\left(Z_{t,i}>B\right) ({random walk starting from i does not terminate in B steps})\displaystyle\leq\mathbb{P}\left(\{\text{random walk starting from $i$ does not terminate in $B$ steps}\}\right)
=l=B({random walk starting from i terminates at step l})\displaystyle=\sum_{l=B}^{\infty}\mathbb{P}\left(\{\text{random walk starting from $i$ terminates at step $l$}\}\right)
=l=Bj=1K[Ml]ijl=B|M|ll=BρlρB1ρ.\displaystyle=\sum_{l=B}^{\infty}\sum_{j=1}^{K}\left[M^{l}\right]_{ij}\leq\sum_{l=B}^{\infty}\rvert M\rvert_{\infty}^{l}\leq\sum_{l=B}^{\infty}\rho^{l}\leq\frac{\rho^{B}}{1-\rho}.

Thus with probability at least 1ρB1ρ1-\frac{\rho^{B}}{1-\rho}, we have Zt,iBZ_{t,i}\leq B. We define

T(B):={Zt,jB for all t=1,2,,T, and j[K]}.\displaystyle\mathcal{E}_{T}(B):=\left\{{Z}_{t,j}\leq B\text{ for all }t=1,2,\cdots,T,\text{ and }j\in[K]\right\}.

By a union bound, (T(B))1KTρB1ρ\mathbb{P}\left(\mathcal{E}_{T}(B)\right)\geq 1-\frac{KT\rho^{B}}{1-\rho}. Now we can set B=log(1ρ)ϵKTlogρB=\frac{\log\frac{(1-\rho)\epsilon}{KT}}{\log\rho} so that (T(B))1ϵ\mathbb{P}\left(\mathcal{E}_{T}(B)\right)\geq 1-\epsilon.

The random variables Zt,iZ_{t,i} also has the memorylessness-type property:

𝔼[Zt,i|not T(B)]\displaystyle\;\mathbb{E}\left[Z_{t,i}\rvert\text{not }\mathcal{E}_{T}(B)\right]
\displaystyle\leq 𝔼[Zt,i|Zt,i>B]l=B+1l({𝒫t,i terminates at step l}{Zt,i>B})(Zt,i>B)\displaystyle\mathbb{E}\left[Z_{t,i}\rvert Z_{t,i}>B\right]\leq\sum_{l=B+1}^{\infty}l\frac{\mathbb{P}\left(\{\text{$\mathcal{P}_{t,i}$ terminates at step $l$}\}\cap\{Z_{t,i}>B\}\right)}{\mathbb{P}\left(Z_{t,i}>B\right)}
=\displaystyle= l=B+1lj({𝒫t,i terminates at step l}{the (B+1)-th step is at j})j({the (B+1)-th step is at j})\displaystyle\;\sum_{l=B+1}^{\infty}l\frac{\sum_{j}\mathbb{P}\left(\{\text{$\mathcal{P}_{t,i}$ terminates at step $l$}\}\cap\{\text{the $(B+1)$-th step is at $j$}\}\right)}{\sum_{j}\mathbb{P}\left(\{\text{the $(B+1)$-th step is at $j$}\}\right)}
\displaystyle\leq l=B+1lj({𝒫t,i terminates at step l}|{the (B+1)-th step is at j})\displaystyle\;\sum_{l=B+1}^{\infty}l\sum_{j}\mathbb{P}\left(\{\text{$\mathcal{P}_{t,i}$ terminates at step $l$}\}\big{\rvert}\{\text{the $(B+1)$-th step is at $j$}\}\right)
=\displaystyle= jl=1(l+B)({𝒫t,j terminates at step l})\displaystyle\;\sum_{j}\sum_{l=1}^{\infty}(l+B)\mathbb{P}\left(\{\text{$\mathcal{P}_{t,j}$ terminates at step $l$}\}\right)
=\displaystyle= j𝔼[Zt,j+B]KB+j𝔼[Zt,j]\displaystyle\;\sum_{j}\mathbb{E}\left[Z_{t,j}+B\right]\leq KB+\sum_{j}\mathbb{E}\left[Z_{t,j}\right]

where we use Markov property on the second last line.

Since 𝔼[Zt,j]𝒪(1(1ρ)2)\mathbb{E}\left[Z_{t,j}\right]\leq\mathcal{O}\left(\frac{1}{\left(1-\rho\right)^{2}}\right), we insert this into the above equation to get

𝔼[Zt,i|not T(B)]𝒪(KB+K(1ρ)2).\displaystyle\mathbb{E}\left[Z_{t,i}\rvert\text{not }\mathcal{E}_{T}(B)\right]\leq\mathcal{O}\left(KB+\frac{K}{(1-\rho)^{2}}\right).

Similarly, we have

𝔼[Zt,i2|not T(B)]𝒪(KB2+K(1ρ)3).\displaystyle\mathbb{E}\left[Z_{t,i}^{2}\rvert\text{not }\mathcal{E}_{T}(B)\right]\leq\mathcal{O}\left(KB^{2}+\frac{K}{(1-\rho)^{3}}\right).

Remark 1.

By Lemmas 5 and 6, we know p^ti=p~ti+𝒪~(1t)\widehat{p}_{ti}=\widetilde{p}_{ti}+\widetilde{\mathcal{O}}\left(\frac{1}{t}\right) and Zt,iBZ_{t,i}\leq B (i[K],t[T]\forall i\in[K],t\in[T]) hold with high probability. Let Σ\Sigma be the whole event space spanned by all possibilities of TT rounds of plays. From now on, we will restrict our attention to the event space Σ={eΣ:eT(B)T}\Sigma^{\prime}=\{e\in\Sigma:e\cap\mathcal{E}_{T}(B)\cap\mathcal{E}_{T}^{\prime}\}, and work in this event space unless otherwise noted.

Proposition 2.

Fix any a(0,1]a\in(0,1]. We have

xx+(1a)xx+1a1+a,x(0,1).\displaystyle\frac{x}{x+(1-a)x}\leq x+\frac{1-\sqrt{a}}{1+\sqrt{a}},\qquad\forall x\in(0,1). (30)
Proof.

If suffices to show, for any a(0,1]a\in(0,1], the function fa(x):=xx+(1x)axf_{a}(x):=\frac{x}{x+(1-x)a}-x is upper bounded by 1a1+a\frac{1-\sqrt{a}}{1+\sqrt{a}}. This can be shown via a quick first-order test. At xmax=a1+ax_{\max}=\frac{\sqrt{a}}{1+\sqrt{a}}, the maximum of faf_{a} is achieved, and fa(xmax)=1a1+af_{a}(x_{\max})=\frac{1-\sqrt{a}}{1+\sqrt{a}}. ∎

B.2 Proof of Lemma 1

By law of total expectation, we have

𝔼t1[Zt,i𝕀[i𝒫t,Jt]]=\displaystyle\mathbb{E}_{t-1}\left[Z_{t,i}\mathbb{I}_{\left[i\in\mathcal{P}_{t,J_{t}}\right]}\right]= 𝔼t1[𝔼[Zt,i𝕀[i𝒫t,Jt]|t,i]]\displaystyle\;\mathbb{E}_{t-1}\left[\mathbb{E}\left[Z_{t,i}\mathbb{I}_{\left[i\in\mathcal{P}_{t,J_{t}}\right]}\big{\rvert}\mathcal{F}_{t,i}\right]\right]
=\displaystyle= 𝔼t1[lt,i𝕀[i𝒫t,Jt]]=lt,ip~ti.\displaystyle\;\mathbb{E}_{t-1}\left[l_{t,i}\mathbb{I}_{\left[i\in\mathcal{P}_{t,J_{t}}\right]}\right]=l_{t,i}\widetilde{p}_{ti}.

Also, we have

𝔼t1[(Zt,iBB)2]1\displaystyle\mathbb{E}_{t-1}\left[\left(\frac{Z_{t,i}-B}{B}\right)^{2}\right]\leq 1 (31)

By Lemma 5, we know

1p^tj=1p~tj±𝒪~(log(T/ϵ)t),t[T],j[K]\displaystyle\frac{1}{\widehat{p}_{tj}}=\frac{1}{\widetilde{p}_{tj}}\pm\widetilde{\mathcal{O}}\left(\frac{\log\left(T/\epsilon\right)}{t}\right),\quad\forall t\in[T],j\in[K] (32)

We can use (32) to get

𝔼t1[exp(βB(lt,iB)β(Zt,iBB𝕀[i𝒫t,Jt]+βpti+jiq^t,jiptj))]\displaystyle\mathbb{E}_{t-1}\left[\exp\left(\frac{\beta}{B}\left(l_{t,i}-B\right)-\beta\left(\frac{\frac{{Z}_{t,i}-B}{B}\mathbb{I}_{\left[i\in\mathcal{P}_{t,J_{t}}\right]}+\beta}{p_{ti}+\sum_{j\neq i}\widehat{q}_{t,ji}p_{tj}}\right)\right)\right]
=\displaystyle{=} 𝔼t1[exp(βB(lt,iB)β(Zt,iBB𝕀[i𝒫t,Jt]+βp~ti)+𝒪~(log(T/ϵ)t))]\displaystyle\mathbb{E}_{t-1}\left[\exp\left(\frac{\beta}{B}\left(l_{t,i}-B\right)-\beta\left(\frac{\frac{{Z}_{t,i}-B}{B}\mathbb{I}_{\left[i\in\mathcal{P}_{t,J_{t}}\right]}+\beta}{\widetilde{p}_{ti}}\right)+\widetilde{\mathcal{O}}\left(\frac{\log\left(T/\epsilon\right)}{t}\right)\right)\right]
=\displaystyle{=} exp(β2p~ti+𝒪~(log(T/ϵ)t))\displaystyle\exp\left(-\frac{\beta^{2}}{\widetilde{p}_{ti}}+\widetilde{\mathcal{O}}\left(\frac{\log\left(T/\epsilon\right)}{t}\right)\right)\cdot
𝔼t1[exp(βB(lt,iB)β(Zt,iBB𝕀[i𝒫t,Jt]p~ti))].\displaystyle\qquad\qquad\mathbb{E}_{t-1}\left[\exp\left(\frac{\beta}{B}\left(l_{t,i}-B\right)-\beta\left(\frac{\frac{{Z}_{t,i}-B}{B}\mathbb{I}_{\left[i\in\mathcal{P}_{t,J_{t}}\right]}}{\widetilde{p}_{ti}}\right)\right)\right]. (33)

Since exp(x)1+x+x2\exp(x)\leq 1+x+x^{2} for x1x\leq 1, we have

𝔼t1[exp(βB(lt,iB)β(Zt,iBB𝕀[i𝒫t,Jt]p~ti))]\displaystyle\mathbb{E}_{t-1}\left[\exp\left(\frac{\beta}{B}\left(l_{t,i}-B\right)-\beta\left(\frac{\frac{{Z}_{t,i}-B}{B}\mathbb{I}_{\left[i\in\mathcal{P}_{t,J_{t}}\right]}}{\widetilde{p}_{ti}}\right)\right)\right]
\displaystyle\leq 1+𝔼t1[βB(lt,iB)β(Zt,iBB𝕀[i𝒫t,Jt]p~ti)]\displaystyle 1+\mathbb{E}_{t-1}\left[\frac{\beta}{B}\left(l_{t,i}-B\right)-\beta\left(\frac{\frac{{Z}_{t,i}-B}{B}\mathbb{I}_{\left[i\in\mathcal{P}_{t,J_{t}}\right]}}{\widetilde{p}_{ti}}\right)\right]
+𝔼t1[(βB(lt,iB)β(Zt,iBB𝕀[i𝒫t,Jt]p~ti))2].\displaystyle\qquad+\mathbb{E}_{t-1}\left[\left(\frac{\beta}{B}\left(l_{t,i}-B\right)-\beta\left(\frac{\frac{{Z}_{t,i}-B}{B}\mathbb{I}_{\left[i\in\mathcal{P}_{t,J_{t}}\right]}}{\widetilde{p}_{ti}}\right)\right)^{2}\right]. (34)

Since 𝔼t1[𝕀[i𝒫t,Jt]]=p~tj\mathbb{E}_{t-1}\left[\mathbb{I}_{\left[i\in\mathcal{P}_{t,J_{t}}\right]}\right]=\widetilde{p}_{tj}, 𝔼t1[(Zt,iB)𝕀[i𝒫t,Jt]]=(lt,jB)p~tj\mathbb{E}_{t-1}\left[\left({Z_{t,i}-B}\right)\mathbb{I}_{\left[i\in\mathcal{P}_{t,J_{t}}\right]}\right]=\left(l_{t,j}-B\right)\widetilde{p}_{tj} and 𝔼t1[(Zt,iBB)2𝕀[i𝒫t,Jt]]p~tj\mathbb{E}_{t-1}\left[\left(\frac{Z_{t,i}-B}{B}\right)^{2}\mathbb{I}_{\left[i\in\mathcal{P}_{t,J_{t}}\right]}\right]\leq\widetilde{p}_{tj}, we have

1+𝔼t1[βB(lt,iB)β(Zt,iBB𝕀[i𝒫t,Jt]p~ti)]\displaystyle 1+\mathbb{E}_{t-1}\left[\frac{\beta}{B}\left(l_{t,i}-B\right)-\beta\left(\frac{\frac{{Z}_{t,i}-B}{B}\mathbb{I}_{\left[i\in\mathcal{P}_{t,J_{t}}\right]}}{\widetilde{p}_{ti}}\right)\right]
+𝔼t1[(βB(lt,iB)β(Zt,iBB𝕀[i𝒫t,Jt]p~ti))2]\displaystyle\qquad\qquad+\mathbb{E}_{t-1}\left[\left(\frac{\beta}{B}\left(l_{t,i}-B\right)-\beta\left(\frac{\frac{{Z}_{t,i}-B}{B}\mathbb{I}_{\left[i\in\mathcal{P}_{t,J_{t}}\right]}}{\widetilde{p}_{ti}}\right)\right)^{2}\right]
=\displaystyle= 1β2B2(lt,iB)2+β2p~ti1+β2p~tiexp(β2p~ti),\displaystyle 1-\frac{\beta^{2}}{B^{2}}\left(l_{t,i}-B\right)^{2}+\frac{\beta^{2}}{\widetilde{p}_{ti}}\leq 1+\frac{\beta^{2}}{\widetilde{p}_{ti}}\leq\exp\left(\frac{\beta^{2}}{\widetilde{p}_{ti}}\right), (35)

where the last inequality uses 1+xexp(x)1+x\leq\exp(x).

We can now combine (33), (34) and (35) and get

𝔼t1[exp(βB(lt,iB)β(Zt,iBB𝕀[i𝒫t,Jt]+βpti+jiq^t,jiptj))]\displaystyle\mathbb{E}_{t-1}\left[\exp\left(\frac{\beta}{B}\left(l_{t,i}-B\right)-\beta\left(\frac{\frac{{Z}_{t,i}-B}{B}\mathbb{I}_{\left[i\in\mathcal{P}_{t,J_{t}}\right]}+\beta}{p_{ti}+\sum_{j\neq i}\widehat{q}_{t,ji}p_{tj}}\right)\right)\right]
\displaystyle\leq exp(𝒪~(log(T/ϵ)t))\displaystyle\exp\left(\widetilde{\mathcal{O}}\left(\frac{\log\left(T/\epsilon\right)}{t}\right)\right)

Let X=βBt=1T(lt,iB)βt=1T(Zt,iBB𝕀[i𝒫t,Jt]+βpti+jiq^t,jiptj)X=\frac{\beta}{B}\sum_{t=1}^{T}\left(l_{t,i}-B\right)-\beta\sum_{t=1}^{T}\left(\frac{\frac{{Z}_{t,i}-B}{B}\mathbb{I}_{\left[i\in\mathcal{P}_{t,J_{t}}\right]}+\beta}{p_{ti}+\sum_{j\neq i}\widehat{q}_{t,ji}p_{tj}}\right) for simplicity. Note that X=β(t=1Tlt,iBBt=1TZ^t,i)X=\beta\left(\sum_{t=1}^{T}\frac{l_{t,i}-B}{B}-\sum_{t=1}^{T}\widehat{Z}_{t,i}\right).

We combine (33), (34) and (35) and get

𝔼[eX]\displaystyle\mathbb{E}\left[e^{X}\right]\leq t=1Texp(𝒪~(log(T/ϵ)t))=𝒪~(T/ϵ).\displaystyle\prod_{t=1}^{T}\exp\left(\widetilde{\mathcal{O}}\left(\frac{\log\left(T/\epsilon\right)}{t}\right)\right)=\widetilde{\mathcal{O}}\left(T/\epsilon\right).

By Markov inequality and the above results, we have

(Xβlog(T/ϵ2)β)𝒪~(ϵ),\displaystyle\mathbb{P}\left(\frac{X}{\beta}\geq\frac{\log\left(T/\epsilon^{2}\right)}{\beta}\right)\leq\widetilde{\mathcal{O}}\left(\epsilon\right),

which concludes the proof.

Appendix C Additional Experimental Results

Algorithm 1 is empirically studied here. In Figure 5, we plot both the regret and error of estimators of Algorithm 1. As a consequence of Lemma 3, information about all node accumulates linearly, and the regret will stopped increasing after a certain point. Note that this does not contradict Theorem 1 or Theorem 2. The reason is that this figure shows the regret and error of estimation for a given instance, whereas the lower bound theorems assert the existence of some instance (not the given instance) on which no algorithm can beat an Ω(T)\Omega(\sqrt{T}) lower bound. Since this some instance must satisfy Assumption 1, we show that bandit problems with random walk feedback is not easier than their standard counterpart. The right subfigure in Figure 5 shows that the estimation errors quickly drops to zero, which shows that the flat regret curve is a consequence of learning, not a consequence of luck.

Refer to caption
Refer to caption
Figure 5: Experimental results for Algorithm 1. Left: The regret of Algorithm 1 versus number of epochs. Right: The estimation error of the hitting time estimators versus number of epochs. The solid line plot in both figures are averaged over 10 runs.