Towards Fundamental Limits of Multi-armed Bandits with Random Walk Feedback

Tianyu Wang Lin F. Yang Zizhuo Wang

Abstract

In this paper, we consider a new Multi-Armed Bandit (MAB) problem where arms are nodes in an unknown and possibly changing graph, and the agent (i) initiates random walks over the graph by pulling arms, (ii) observes the random walk trajectories, and (iii) receives rewards equal to the lengths of the walks. We provide a comprehensive understanding of this problem by studying both the stochastic and the adversarial setting. We show that this problem is not easier than a standard MAB in an information theoretical sense, although additional information is available through random walk trajectories. Behaviors of bandit algorithms on this problem are also studied.

1 Introduction

Multi-Armed Bandit (MAB) problems simultaneously call for exploitation of good options and exploration of the decision space. Algorithms for this problem find applications in various domains, from medical trials (Robbins,, 1952) to online advertisement (Li et al.,, 2010). Many authors have studied bandit problems from different perspectives.

In this paper, we study a new bandit learning problem where the feedback is depicted by a random walk over the arms. That is, each time an arm/node $i$ is played, one observes a random walk over the arms/nodes from $i$ to an absorbing node, and the reward/loss is the length of this random walk. In this learning setting, we want to carefully select which nodes to initialize the random walks, so that the hitting time to the absorbing node is maximized/minimized. This learning protocol captures important problems in computational social networks. In particular, our learning protocol encapsulates an online learning version of the influence maximization problem (Kempe et al.,, 2003). See e.g., (Arora et al.,, 2017) for a survey on influence maximization. The goal of influence maximization is to find a node so that a diffusion starting from that node can propagate through the network and can influence as many nodes as possible. Our problem provides an online learning formulation of the influence maximization problem, where the diffusion over the graph is modeled as a random walk over the graph. As a concrete example, the word-of-mouth rating of a movie on social networks can be captured by the influence maximization model (Arora et al.,, 2017).

More formally, we consider the following model. The environment is modeled by a graph $G=\left(V,E\right)$ , where $V$ consists of transient nodes $[K]:=\{1,2,\dots,K\}$ and an absorbing node $*$ . Each edge $ij$ ( $i\in[K],j\in[K]\cup\{*\}$ ) can encode two quantities, a transition probability from $i$ to $j$ and the distance from $i$ to $j$ . For $t=1,2,\dots,T$ , we pick a node to start a random walk, and observe the random walk trajectory from the selected node to the absorbing node. For each random walk, we use its hitting time (to the absorbing node $*$ ) to model how long-lasting it is. With this formulation, we can define a bandit learning problem for the question. Each time, the agent picks a node in $G$ to start a random walk, observes the trajectory, and receives the hitting time of the random walk as reward. In this setting, the performance of learning algorithms is typically measured by regret, which is the difference between the rewards of an optimal node and the rewards of the nodes played by the algorithm. Unlike standard multi-armed bandit problems, the feedback is random walk trajectories and thus reveals information not only about the node played, but the environment (transitions/distances among nodes) as well.

Interestingly, the extra information from the random walk does not trivialize the learning in a mini-max sense. Intuitively, the sample/event space describes how much information the feedback can carry. If we execute a policy $\pi$ for $T$ epochs on a problem instance, the sample space is then $\left(\cup_{h=1}^{\infty}\mathcal{B}^{h}\right)^{T}$ , where $\mathcal{B}$ is the space of all outcomes that a single step on a trajectory can generate. For example, if all edge length can be sampled from $[0,1]$ , then $\mathcal{B}=[0,1]\times[K]$ , since a single step on a trajectory might be any node, and the corresponding edge length may be any number from $[0,1]$ . In this case the sample space of a simple epoch is $\cup_{h=1}^{\infty}\mathcal{B}^{h}$ , where the union up to infinity captures the fact that the trajectory can be arbitrarily long. Indeed, this set $\left(\cup_{h=1}^{\infty}\left([0,1]\times[K]\right)^{h}\right)^{T}$ contains much richer information than that of standard MAB problems, which is $[0,1]^{T}$ for $T$ rounds of pulls. This richer sample space means that: each feedback carries much more information and thus our problem can be strictly easier than a standard MAB. However, the information theoretical lower bounds for our problem are of order $\widetilde{\Omega}\left(\sqrt{T}\right)$ . In other words, we prove that even though each trajectory carries much more information than a reward sample (and has a chance of revealing all information about the environment when the trajectory is long), no algorithm can beat the bound $\widetilde{\Omega}\left(\sqrt{T}\right)$ in a mini-max sense.

In summary, the contributions of our paper is as follows.

1.

We propose a new online learning problem that is compatible with important computational social network problem including influence maximization, as already discussed in the introduction.
2.

We prove lower bounds for this problem in a random walk trajectories sample space. Our results show that, although random walk trajectories carry much more information than reward samples, the additional information does not simplify the problem. More specifically, no algorithm can beat an $\widetilde{\Omega}\left(\sqrt{T}\right)$ lower bound in a mini-max sense. These information theoretical findings of the newly proposed problem are discussed in Section 3.
3.

We propose algorithms for the bandit problems with random walk feedback, and show that the performance of our algorithms improves over that of the standard MAB algorithms.

1.1 Related Works

Bandit problems date its history back to at least Thompson, (1933), and have been studied extensively in the literature. One of the the most popular approaches to the stochastic bandit problem is the Upper Confidence Bound (UCB) algorithms (Robbins,, 1952; Lai and Robbins,, 1985; Auer,, 2002). Various extensions of UCB algorithms have been studied (Srinivas et al.,, 2010; Abbasi-Yadkori et al.,, 2011; Agrawal and Goyal,, 2012; Bubeck and Slivkins,, 2012; Seldin and Slivkins,, 2014). Specifically, some works use KL-divergence to construct the confidence bound (Lai and Robbins,, 1985; Garivier and Cappé,, 2011; Maillard et al.,, 2011), or include variance estimates within the confidence bound (Audibert et al.,, 2009; Auer and Ortner,, 2010). UCB is also used in the contextual learning setting (e.g., Li et al.,, 2010; Krause and Ong,, 2011; Slivkins,, 2014). The UCB algorithm and its variations are also used for other feedback settings, including the stochastic combinatorial bandit problem (Chen et al.,, 2013, 2016). Parallel to the stochastic setting, studies on the adversarial bandit problem form another line of literature. Since randomized weighted majorities (Littlestone and Warmuth,, 1994), exponential weights remains a top strategy for adversarial bandits (Auer et al.,, 1995; Cesa-Bianchi et al.,, 1997; Auer et al.,, 2002). Many efforts have been made to improve/extend exponential weights algorithms. For example, Kocák et al., (2014) target at implicit variance reduction. Mannor and Shamir, (2011); Alon et al., (2013) study a partially observable setting. Despite the large body of literature, no previous work has, to the best of our knowledge, explicitly focused on problems where the feedback is a random walk.

For both stochastic bandits and adversarial bandits, lower bounds in different scenarios have been derived, since the $\mathcal{O}(\log T)$ asymptotic lower bounds for consistent policies (Lai and Robbins,, 1985). Worst case bound of order ${\mathcal{O}}(\sqrt{T})$ have also been derived (Auer et al.,, 1995) for the stochastic setting. In addition to the classic stochastic setting, lower bounds in other stochastic (or stochastic-like) settings have also been considered, including PAC-learning complexity (Mannor and Tsitsiklis,, 2004), best arm identification complexity (Kaufmann et al.,, 2016; Chen et al.,, 2017), and lower bounds in continuous spaces (Kleinberg et al.,, 2008). Lower bound problems for adversarial bandits may be converted to lower bound problems for stochastic bandits (Auer et al.,, 1995) in many cases. Yet the above mentioned works do not cover the lower bounds for our settings.

The Stochastic Shortest Path (with adversarial edge length) (e.g., Bertsekas and Tsitsiklis,, 1991; Neu et al.,, 2012; Rosenberg and Mansour,, 2020) and the online MDP problems (e.g., Even-Dar et al.,, 2009; Gergely Neu et al.,, 2010; Dick et al.,, 2014; Jin et al.,, 2019) are related to our problem. However, these settings are fundamentally different because of the sample space generated by the possible trajectories. In all previously studied settings, a control is available at each step, and a regret is immediately incurred. In this regard, the infinite-length free trajectory scenario is impossible in all previous studies. In other words, every trajectory in previous works is effectively of length one, since a control is imposed, and a regret is incurred every time the state changes.

2 Problem Setting

The learning process repeats for $T$ epochs and the learning environment is described by graphs $G_{1},G_{2},\dots,G_{T}$ for epochs $t=1,2,\dots,T$ . The graph $G_{t}$ is defined on $K$ transient nodes $[K]=\{1,2,\dots,K\}$ and one absorbing node denoted by $*$ . We will use $V=[K]$ to denote the set of transient nodes, and use $\widetilde{V}:=[K]\cup\{*\}$ to denote the transient nodes together with the absorbing node. On this node set $\widetilde{V}$ , graph $G_{t}$ encodes transition probabilities and edge lengths: $G_{t}:=\left(\{m_{ij}\}_{i\in V,j\in\widetilde{V}},\{l_{ij}^{(t)}\}_{i\in V,j\in\widetilde{V}}\right)$ , where $m_{ij}$ is the probability of transiting from $i$ to $j$ and $l_{ij}^{(t)}\in[0,1]$ is the length from $i$ to $j$ (at epoch $t$ ). We gather the transition probabilities among transient nodes to form a transition matrix $M=[m_{ij}]_{i,j\in[K]}$ . We make the following assumption about $M$ .

Assumption 1.

The transition matrix $M=[m_{ij}]_{i,j\in[K]}$ among transient nodes is primitive. ¹¹1A matrix $M$ is primitive if there exists a positive integer $k$ , such that all entries in $M^{k}$ is positive. In addition, there is an absolute constant $\rho$ , such that $\|M\|_{\infty}\leq\rho<1$ , where $\|M\|_{\infty}=\max_{i\in[K]}\sum_{j\in[K]}\rvert m_{ij}\rvert$ is the maximum absolute row sum.

In Assumption 1, the primitivity assumption ensures that we can get to any transient node $v$ from any other node state $u$ . The infinite norm of $M$ being strictly less than $1$ means that the random walk will transit to the absorbing node starting from any node (eventually with probability 1). This describes the absorptiveness of the environment. This infinite norm assumption can be replaced by other notions of matrix norms. Unless otherwise stated, we assume that $\rho$ is an absolute constant independent of $K$ and $T$ .

Playing node $j$ at epoch $t$ generates a random walk trajectory $\mathcal{P}_{t,j}:=\big{(}X_{t,0}^{(j)},$ $L_{t,1}^{(j)},$ $X_{t,1}^{(j)},$ $L_{t,2}^{(j)},$ $X_{t,2}^{(j)},$ $\dots,$ $L_{t,H_{t,j}}^{(j)},$ $X_{t,H_{t,j}}^{(j)}\big{)}$ , where $X_{t,0}^{(j)}=j$ is the starting nodes, $X_{t,H_{t,j}}^{(j)}=*$ is the absorbing node, $X_{t,i}^{(j)}$ is the $i$ -th node in the random walk trajectory, $L_{t,i}^{(j)}$ is the edge length from $X_{t,i-1}^{(j)}$ to $X_{t,i}^{(j)}$ , and $H_{t,j}$ is the number of edges in trajectory $\mathcal{P}_{t,j}$ . For simplicity, we write $X_{t,i}^{(j)}$ (resp. $L_{t,i}^{(j)}$ ) as $X_{t,i}$ (resp. $L_{t,i}$ ) when it is clear from context.

For the random trajectory $\mathcal{P}_{t,j}:=\big{(}X_{t,0},$ $L_{t,1},X_{t,1},$ $L_{t,2},X_{t,2},\dots,L_{t,H_{t,j}},X_{t,H_{t,j}}\big{)}$ , the length of the trajectory (or hitting time of node $j$ at epoch $t$ ) is defined as $\mathcal{L}\left(\mathcal{P}_{t,j}\right):=\sum_{i=1}^{H_{t,j}}L_{t,i}.$ Here we use the edge length to represent the reward of the trajectory. In practice, the edge lengths may have real-world meanings. For example, the out-going edge from a node may represent utility (e.g., profit) of visiting this node. At epoch $t$ , the agent selects a node $J_{t}\in[K]$ to initiate a random walk, and observe trajectory $\mathcal{P}_{t,J_{t}}$ . In stochastic environments, the environment does not change across epochs. Thus for any fixed node $v\in[K]$ , the random trajectories $\mathcal{P}_{1,v},\mathcal{P}_{2,v},\mathcal{P}_{3,v},\dots$ are independently identically distributed.

3 Information Theoretical Properties

Consider the case where the graphs $G_{t}$ do not change across epochs. To solve this problem, one can estimate the expected hitting times $\mu_{j}:=\mathbb{E}\left[\mathcal{L}(\mathcal{P}_{t,j})\right]$ for all nodes $j\in[K]$ (and maintain a confidence interval of the estimations). As one can expect, the random walk trajectory reveals more information than a sample of reward. Naturally, this allows us to reduce this problem to a standard (stochastic) MAB problem.

3.1 Reduction to Standard MAB

Recall $\mathcal{P}_{t,J_{t}}=\left(X_{t,0},L_{t,1},X_{t,1},\dots,L_{t,H_{t,J_{t}}},X_{t,H_{t,J_{t}}}\right)$ is the trajectory at epoch $t$ . For a node $v$ and the trajectories $\mathcal{P}_{1,J_{1}},\mathcal{P}_{2,J_{2}},\mathcal{P}_{3,J_{3}},\dots$ , let $k_{v,i}$ be the index (epoch) of the $i$ -th trajectory that covers node $v$ . Let $Y_{v,k_{v,i}}$ be the sum of edge lengths between the first occurrence of $v$ and the absorbing node $*$ in trajectory $k_{v,i}$ . One has the following proposition due to Markov property.

Proposition 1.

In the stochastic setting, for any nodes $v\in[K]$ , we have, for $\forall t,i\in\mathbb{N}_{+},\forall r\in\mathbb{R}$

\displaystyle\mathbb{P}\left(Y_{v,k_{v,i}}=r\right)=\mathbb{P}\left(\mathcal{L}\left(\mathcal{P}_{t,v}\right)=r\right).

(1)

Proof.

In a trajectory $\mathcal{P}_{t,J_{t}}=\left(X_{t,0},L_{t,1},X_{t,1},\dots,L_{t,H_{t,J_{t}}},X_{t,H_{t,J_{t}}}\right)$ , conditioning on $X_{t,i}=j$ being known (and no future information is revealed), the randomness generated by $L_{t,i+1},X_{t,i+1},L_{t,i+2},X_{t,i+2},\dots$ is identical to the randomness generated by $L_{t,1},X_{t,1},L_{t,2},X_{t,2},\dots$ conditioning on $X_{t,0}=j$ being fixed. Note that even if each trajectory can visit a node multiple times, only one hitting time sample can be used. This is because extracting multiple sample would break Markovianity, by revealing that the random walk will visit a same node again. ∎

For a node $v\in[K]$ , we define $N_{t}(v):=1\vee\sum_{s<t}\mathbb{I}_{[J_{s}=v]}$ , and ${N}_{t}^{+}(v):=1\vee\sum_{s<t}\mathbb{I}_{\left[v\in\mathcal{P}_{s,J_{s}}\right]}.$ where $a\vee b=\max\{a,b\}$ . In words, ${N}_{t}(v)$ is the number of times node $v$ is played, and ${N}_{t}^{+}(v)$ is the number of times node $v$ is covered by a trajectory.

By Proposition 1, the information about the node rewards (hitting time to absorbing node) accumulates faster than standard MAB problems. Formally, it holds that $N_{t}^{+}(v)\geq N_{t}(v)$ . Thus solving this problem is not hard: one can extract the hitting time estimates and apply a standard algorithm (e.g., UCB) based on the estimates. However, some information is lost when we extract hitting time samples, since trajectories also carry additional information (e.g, about graph transition and graph structure) but we only extract hitting time samples. Thus the intriguing question to ask is:

•

Do we give up too much information by only extracting hitting time samples from trajectories? (Q)

We show that, although more information in addition to reward sample are available from the random walk trajectories, no algorithm can beat an $\Omega\left(\sqrt{T}\right)$ lower bound in a worst case sense. This provides an answer to (Q).

Figure 1: Problem instances constructed for Theorem 1. The edge labels denote edge transition probabilities in

\mathfrak{J}

\mathfrak{J}^{\prime}

. The top dark node denotes the absorbing node

*

. Note that node 1 is connected to node 2 with a constant probability.

Theorem 1.

For any given $T$ and any policy $\pi$ , there exists a problem instance $\mathfrak{J}$ satisfying Assumption 1 such that: (1) The probability of visiting any node from different node is larger than an absolute constant; (2) For any $\epsilon\in\left(0,\frac{1}{4}\right)$ , the $T$ step regret of $\pi$ on instance $\mathfrak{J}$ is lower bounded by $\left(\frac{32}{15}\epsilon+O\left(\epsilon^{2}\right)\right)T\exp\left(-\frac{112}{9}T\left(\epsilon^{2}+O\left(\epsilon^{3}\right)\right)\right).$ In particular, setting $\epsilon=\frac{1}{4}T^{-1/2}$ gives that there exists a problem instance such that the regret of any algorithm on this instance is lower bounded by $\Omega\left(\sqrt{T}\right)$ .

To prove Theorem 1, we construct problem instances such that the optimal nodes are almost indistinguishable. In the construction, we ensure that the nodes visits each other with a constant probability. If instead, the nodes visit each other with an arbitrarily small probability, the nodes are basically disconnected and the problem is too similar to a standard MAB problem. We use the instances illustrated in Fig. 1 for the proof. As shown in Fig. 1, node 1 and node 2 are connected with probability $\frac{1}{8}$ . This construction ensures that the two nodes visits each other with a constant probability. This prevents our construction from collapsing to a standard MAB problem, where the non-absorbing nodes are disconnected.

Proof of Theorem 1..

We construct two “symmetric” problem instances $\mathfrak{J}$ and $\mathfrak{J}^{\prime}$ both on two transient nodes $\{1,2\}$ and one absorbing node $*$ . All edges in both instances are of length 1. We use ${M}=[m_{ij}]$ (resp. ${M}^{\prime}=[m_{ij}^{\prime}]$ ) to denote the transition probabilities among transient nodes in instance $\mathfrak{J}$ (resp. $\mathfrak{J}^{\prime}$ ). We construct instances $\mathfrak{J}$ and $\mathfrak{J}^{\prime}$ so that $M=\begin{bmatrix}\frac{1}{2}&\frac{1}{8}\\ \frac{1}{8}&\frac{1}{2}+\epsilon\end{bmatrix}$ and $M^{\prime}=\begin{bmatrix}\frac{1}{2}+\epsilon&\frac{1}{8}\\ \frac{1}{8}&\frac{1}{2}\end{bmatrix}$ , as shown in Figure 1. We use $Z_{v}$ (resp. $Z_{v}^{\prime}$ ) to denote the random variable $\mathcal{L}\left(\mathcal{P}_{t,v}\right)$ in problem instance $\mathfrak{J}$ (resp. $\mathfrak{J}^{\prime}$ ). Note that the (expected) hitting time of node 1 in $\mathfrak{J}$ can be expressed as $\mathbb{E}[Z_{1}]=m_{11}(\mathbb{E}[Z_{1}]+1)+m_{21}(\mathbb{E}[Z_{2}]+1)+(1-m_{11}-m_{21})=m_{11}\mathbb{E}[Z_{1}]+m_{21}\mathbb{E}[Z_{2}]+1$ . Thus in matrix form we have

	$\displaystyle\left[\mathbb{E}\left[Z_{1}\right],\mathbb{E}\left[Z_{2}\right]\right]=$	$\displaystyle\;\left[\mathbb{E}\left[Z_{1}\right],\mathbb{E}\left[Z_{2}\right]\right]M+\mathbf{1}^{\top},$
	$\displaystyle\left[\mathbb{E}\left[Z_{1}^{\prime}\right],\mathbb{E}\left[Z_{2}^{\prime}\right]\right]=$	$\displaystyle\;\left[\mathbb{E}\left[Z_{1}^{\prime}\right],\mathbb{E}\left[Z_{2}^{\prime}\right]\right]M^{\prime}+\mathbf{1}^{\top},$

where $\mathbf{1}$ is the all-one vector. Solving the above equations gives, for both instances $\mathfrak{J}$ and $\mathfrak{J}^{\prime}$ , the optimality gap $\Delta$ is

\displaystyle\Delta:=\left\rvert\mathbb{E}\left[Z_{1}\right]-\mathbb{E}\left[Z_{2}\right]\right\rvert=\frac{64\epsilon}{15}+O\left(\epsilon^{2}\right).

(2)

Let $\pi$ be any fixed algorithm and let $T$ be any fixed time horizon, we use $\mathbb{P}_{\mathfrak{J},\pi}$ (resp. $\mathbb{P}_{\mathfrak{J}^{\prime},\pi}$ ) to denote the probability measure of running $\pi$ on instance $\mathfrak{J}$ (resp. $\mathfrak{J}^{\prime}$ ) for $T$ epochs.

Since the event $\{J_{t}=1\}$ ( $t\leq T$ ) is measurable by both $\mathbb{P}_{\mathfrak{J},\pi}$ and $\mathbb{P}_{\mathfrak{J}^{\prime},\pi}$ , we have

$\displaystyle\mathbb{P}_{\mathfrak{J},\pi}(J_{t}\neq 1)+\mathbb{P}_{\mathfrak{J}^{\prime},\pi}(J_{t}=1)=$	$\displaystyle\;1-\mathbb{P}_{\mathfrak{J},\pi}(J_{t}=1)+\mathbb{P}_{\mathfrak{J}^{\prime},\pi}(J_{t}=1)$
$\displaystyle\overset{(i)}{\geq}$	$\displaystyle\;1-\\|\mathbb{P}_{\mathfrak{J},\pi}-\mathbb{P}_{\mathfrak{J}^{\prime},\pi}\\|_{TV}$
$\displaystyle\overset{(ii)}{\geq}$	$\displaystyle\;1-\sqrt{1-\exp\left(D_{kl}(\mathbb{P}_{\mathfrak{J},\pi}\\|\mathbb{P}_{\mathfrak{J}^{\prime},\pi})\right)}$
$\displaystyle\overset{(iii)}{\geq}$	$\displaystyle\;\frac{1}{2}\exp\left(-D_{kl}(\mathbb{P}_{\mathfrak{J},\pi}\\|\mathbb{P}_{\mathfrak{J}^{\prime},\pi})\right),$	(3)

where $(i)$ uses the definition of total variation, $(ii)$ uses the Bretagnolle-Huber inequality, and $(iii)$ uses that $1-\sqrt{1-x}\geq\frac{1}{2}x$ for all $x\geq 0$ .

Let $\mathbb{Q}_{i}$ (resp. $\mathbb{Q}_{i}^{\prime}$ ) be the probability measure generated by playing node $i$ in instance $\mathfrak{J}$ (resp. $\mathfrak{J}^{\prime}$ ). We can then decompose $\mathbb{P}_{\mathfrak{J},\pi}$ by

	$\displaystyle\mathbb{P}_{\mathfrak{J},\pi}$	$\displaystyle=\mathbb{Q}_{J_{1}}\mathbb{P}\left(J_{1}\rvert\pi\right)\mathbb{Q}_{J_{2}}\mathbb{P}(J_{2}\rvert\pi,J_{1})\cdots\mathbb{Q}_{J_{T}}\mathbb{P}(J_{T}\rvert\pi,J_{1},J_{2},\dots,J_{t-1}),$
	$\displaystyle\mathbb{P}_{\mathfrak{J}^{\prime},\pi}$	$\displaystyle=\mathbb{Q}_{J_{1}}^{\prime}\mathbb{P}\left(J_{1}\rvert\pi\right)\mathbb{Q}_{J_{2}}^{\prime}\mathbb{P}(J_{2}\rvert\pi,J_{1})\cdots\mathbb{Q}_{J_{T}}^{\prime}\mathbb{P}(J_{T}\rvert\pi,J_{1},J_{2},\dots,J_{t-1}).$

By chain rule for KL-divergence, we have

	$\displaystyle D_{kl}(\mathbb{P}_{\mathfrak{J},\pi}\\|\mathbb{P}_{\mathfrak{J}^{\prime},\pi})$
$\displaystyle=$	$\displaystyle\sum_{J_{1}\in\{1,2\}}\mathbb{P}(J_{1}\rvert\pi)D_{kl}(\mathbb{Q}_{J_{1}}\\|\mathbb{Q}_{J_{1}}^{\prime})$
	$\displaystyle+\sum_{t=2}^{T}\sum_{J_{t}\in\{1,2\}}\mathbb{P}(J_{t}\rvert\pi,J_{1},\cdots,J_{t-1})D_{kl}(\mathbb{Q}_{J_{t}}\\|\mathbb{Q}_{J_{t}}^{\prime}).$	(4)

Since the policy must pick one of node 1 and node 2, from (4) we have

\displaystyle D_{kl}(\mathbb{P}_{\mathfrak{J},\pi}\|\mathbb{P}_{\mathfrak{J}^{\prime},\pi})\leq\sum_{t=1}^{T}\sum_{i=1}^{2}D_{kl}(\mathbb{Q}_{i}\|\mathbb{Q}_{i}^{\prime}),

(5)

which allows us to remove dependence on policy $\pi$ .

Next we study the distributions $\mathbb{Q}_{i}$ . With edge lengths fixed, the sample space of this distribution is $\cup_{h=1}^{\infty}\{1,2\}^{h}$ , since length of the trajectory can be arbitrarily long, and each node on the trajectory can be either of $\{1,2\}$ . To describe the distribution $\mathbb{Q}_{i}$ and $\mathbb{Q}_{i}^{\prime}$ , we use random variables $X_{0},X_{1},X_{2},X_{3},\dots\in\{1,2,*\}$ (with $X_{0}=i$ ), where $X_{k}$ is the $k$ -th node in the trajectory generated by playing $i$ .

By Markov property we have, for $i,j\in\{1,2\}$ ,

	$\displaystyle\mathbb{Q}_{i}\left(X_{k+1},X_{k+2},\dots\rvert X_{k}=j\right)$	$\displaystyle=\mathbb{Q}_{j}\quad\text{and}$		(6)
	$\displaystyle\mathbb{Q}_{i}^{\prime}\left(X_{k+1},X_{k+2},\dots\rvert X_{k}=j\right)$	$\displaystyle=\mathbb{Q}_{j}^{\prime},\quad\forall k\in\mathbb{N}_{+}.$

Since we can decompose $\mathbb{Q}_{i}$ by $\mathbb{Q}_{i}=\mathbb{Q}_{i}(X_{1})\mathbb{Q}_{i}(X_{2},X_{3},\dots,\rvert X_{1})$ . Thus by chain rule of KL-divergence, we have

	$\displaystyle\;D_{kl}(\mathbb{Q}_{1}\\|\mathbb{Q}_{1}^{\prime})$
$\displaystyle=$	$\displaystyle\;D_{kl}(\mathbb{Q}_{1}(X_{1})\\|\mathbb{Q}_{1}^{\prime}(X_{1}))$
	$\displaystyle+\mathbb{Q}_{1}(X_{1}=1)D_{kl}(\mathbb{Q}_{1}(X_{2},\cdots\rvert X_{1}=1)\\|\mathbb{Q}_{1}^{\prime}(X_{2},\cdots\rvert X_{1}=1))$
	$\displaystyle+\mathbb{Q}_{1}(X_{1}=2)D_{kl}(\mathbb{Q}_{1}(X_{2},\cdots\rvert X_{1}=2)\\|\mathbb{Q}_{1}^{\prime}(X_{2},\cdots\rvert X_{1}=2))$
$\displaystyle=$	$\displaystyle\;D_{kl}(\mathbb{Q}_{1}(X_{1})\\|\mathbb{Q}_{1}^{\prime}(X_{1}))$
	$\displaystyle+\mathbb{Q}_{1}(X_{1}=1)D_{kl}(\mathbb{Q}_{1}\\|\mathbb{Q}_{1}^{\prime})+\mathbb{Q}_{1}(X_{1}=2)D_{kl}(\mathbb{Q}_{2}\\|\mathbb{Q}_{2}^{\prime}),$	(7)

where the last line uses (6). A similar argument gives

	$\displaystyle\;D_{kl}(\mathbb{Q}_{2}\\|\mathbb{Q}_{2}^{\prime})$
$\displaystyle=$	$\displaystyle\;D_{kl}(\mathbb{Q}_{2}(X_{1})\\|\mathbb{Q}_{2}^{\prime}(X_{1}))$
	$\displaystyle+\mathbb{Q}_{2}(X_{1}=1)D_{kl}(\mathbb{Q}_{1}\\|\mathbb{Q}_{1}^{\prime})+\mathbb{Q}_{2}(X_{1}=2)D_{kl}(\mathbb{Q}_{2}\\|\mathbb{Q}_{2}^{\prime}),$	(8)

Since $D_{kl}(\mathbb{Q}_{i}(X_{1})\|\mathbb{Q}_{i}^{\prime}(X_{1}))=\frac{7}{3}\epsilon^{2}+O\left(\epsilon^{3}\right)$ for $i\in\{1,2\}$ , the above (Eq. 7 and Eq. 8) gives

\displaystyle D_{kl}(\mathbb{Q}_{i}\|\mathbb{Q}_{i}^{\prime})

\displaystyle=\frac{56\epsilon^{2}}{9}+O\left(\epsilon^{3}\right).

(9)

Combining (5) and (9) gives

\displaystyle D_{kl}\left(\mathbb{P}_{\mathfrak{J},\pi}\|\mathbb{P}_{\mathfrak{J}^{\prime},\pi}\right)\leq\frac{112}{9}T\left(\epsilon^{2}+O\left(\epsilon^{3}\right)\right).

(10)

Let $\mathrm{Reg}(T)$ (resp. $\mathrm{Reg}^{\prime}(T)$ ) be the $T$ epoch regret in instance $\mathfrak{J}$ (resp. $\mathfrak{J}^{\prime}$ ). Recall, by our construction, node 1 is suboptimal in instance $\mathfrak{J}$ and node 1 is optimal in instance $\mathfrak{J}^{\prime}$ . Since the optimality gaps in $\mathfrak{J}$ and $\mathfrak{J}^{\prime}$ are the same (Eq. 2), we have,

	$\displaystyle\;\mathrm{Reg}(T)+\mathrm{Reg}^{\prime}(T)$
$\displaystyle\geq$	$\displaystyle\;\Delta\sum_{t=1}^{T}\Big{(}\mathbb{P}_{\mathfrak{J},\pi}\left(J_{t}\neq 1\right)+\mathbb{P}_{\mathfrak{J}^{\prime},\pi}\left(J_{t}=1\right)\Big{)}$
$\displaystyle\geq$	$\displaystyle\;\frac{1}{2}\Delta T\exp\left(-D_{kl}(\mathbb{P}_{\mathfrak{J},\pi}\\|\mathbb{P}_{\mathfrak{J}^{\prime},\pi})\right)$	(by Eq. 3)
$\displaystyle\geq$	$\displaystyle\;\frac{1}{2}\Delta T\exp\left(-\frac{112}{9}T\left(\epsilon^{2}+O\left(\epsilon^{3}\right)\right)\right)$	(by Eq. 10)
$\displaystyle\geq$	$\displaystyle\;\left(\frac{32}{15}\epsilon+O\left(\epsilon^{2}\right)\right)T\exp\left(-\frac{112}{9}T\left(\epsilon^{2}+O\left(\epsilon^{3}\right)\right)\right).$	(by Eq. 2)

∎

In Theorem 1, the two-nodes case has been covered. A similar result for the $K$ -node case is in Theorem 2.

Theorem 2.

Let Assumption 1 be true. Given any number of epochs $T$ and policy $\pi$ , there exists $K$ problem instances $\mathfrak{J}_{1},\cdots,\mathfrak{J}_{K}$ , such that (1) all problems instances have $K$ nodes, (2) all transient node are connected with the same probability and the probability of hitting the absorbing node from any transient node is a constant independent of $T$ and $K$ , and (3) for any policy $\pi$ ,

\displaystyle\max_{k\in[K]}\mathbb{E}_{\mathfrak{J}_{k},\pi}\left[\mathrm{Reg}(T)\right]\geq\frac{1}{8\sqrt{2}}\sqrt{KT},

where $\mathbb{E}_{\mathfrak{J}_{k},\pi}$ is the expectation with respect to the distribution generated by instance $\mathfrak{J}_{k}$ and policy $\pi$ .

The proof of Theorem 2 follows a recipe similar to that of Theorem 1, and the details are deferred to the Appendix.

4 Algorithms for Multi-Armed Bandits with Random Walk Feedback

Perhaps the two most well-known algorithms for bandit problems are the UCB algorithm and the EXP3 algorithm. In this section, we study the behavior of these two algorithms on bandit problems with random walk feedback.

4.1 UCB Algorithm for the Stochastic Setting

As discussed previously, the problem with random walk feedback can be reduced to standard MAB problems, and the UCB algorithm solves this problem as one would expect. We present here the UCB algorithm and provide regret analysis for it. Recall the regret is defined as

\displaystyle\mathrm{Reg}(T)=\max_{i\in[K]}\sum_{t=1}^{T}\mu_{i}-\sum_{t=1}^{T}\mu_{{}_{J_{t}}},

(11)

where $J_{t}$ is the node played by the algorithm (at epoch $t$ ), and $\mu_{{}_{j}}=\mathbb{E}\left[\mathcal{L}\left(\mathcal{P}_{t,j}\right)\right]$ is the expected hitting time of node $j$ .

For a transient node $v\in[K]$ and $n$ trajectories (at epochs $k_{v,1},k_{v,2},\dots,k_{v,n}$ ) that cover node $v$ , the hitting time estimator of $v$ is computed as $\widetilde{Z}_{v,n}:=\frac{1}{n}\sum_{i=1}^{n}Y_{v,k_{v,i}}.$ Since $v\in\mathcal{P}_{t,v}$ , $Y_{v,k_{v,i}}$ is an identical copy of the hitting time $\mathcal{L}\left(\mathcal{P}_{t,v}\right)$ (Proposition 1).

We also need confidence intervals for our estimators. Given $N_{t}^{+}(v)$ trajectories covering $v$ , the confidence terms (at epoch $t$ ) are $\widetilde{C}_{N_{t}^{+}(v),t}:=\sqrt{\frac{8\xi_{t}\log t}{{N_{t}^{+}(v)}}},$ where $\xi_{t}=\max\left\{1+\frac{\rho}{(1-\rho)^{2}},\frac{\log(1-\rho)}{\log\rho}+\frac{5\log t}{\log 1/\rho}\right\}$ . Here $\xi_{t}$ serves as a truncation parameter since the reward distribution is not sub-Gaussian. Alternatively, one can use robust estimators for bandits with heavy-tail reward for this task Bubeck et al., (2013).

At each time $t$ , we play a node $J_{t}$ so that

\displaystyle J_{t}\in\arg\max_{v}\widetilde{Z}_{v,N_{t}^{+}(v)}+\widetilde{C}_{N_{t}^{+}(v),t}.

This strategy is described in Algorithm 1.

Algorithm 1

1:Input: A set of nodes

[K]

. Parameters: a constant

\rho

that bounds the spectral radius of

M

2:Warm up: Play each node once to initialize. Observe trajectories.

3:For any

v\in[K]

, define the decision index

I_{v,N_{t}^{+}(v),t}=\widetilde{Z}_{v,N_{t}^{+}(v)}+\widetilde{C}_{N_{t}^{+}(v),t}.

4:for

t=1,2,3,\dots

5: Select

J_{t}

, such that

J_{t}\in\arg\max_{v\in V}I_{v,N_{t}^{+}(v),t},

with ties broken arbitrarily.

6: Observe the trajectory

\mathcal{P}_{t,v_{t}}:=\big{\{}X_{t,0},L_{t,1},X_{t,1},L_{t,2},X_{t,2},\cdots,L_{t,H_{t,v_{t}}}X_{t,H_{t,J_{t}}}\big{\}}

. Update

{N}_{t}^{+}(\cdot)

and UCBs for all

v\in[K]

7:end for

Similar to the UCB algorithm for standard MAB problems, Algorithm 1 obtains $\widetilde{\mathcal{O}}(\sqrt{T})$ mini-max regret (gap-independent regret) and $\widetilde{\mathcal{O}}(1)$ asymptotic regret (gap-dependent regret). Such results can be derived from the observations discussed in Section 3.1.

4.2 EXP3 Algorithm for Adversarially Chosen Edge Lengths

In this section, we consider the case in which the network structure $G_{t}$ changes over time, and study a version of this problem in which the adversary alters edge length across epochs: In each epoch, the adversary can arbitrarily pick edge lengths $l_{ij}^{(t)}$ from $[0,1]$ . In this case, the performance is measured by the regret against playing any fixed node $i\in[K]$ : $\mathrm{Reg}^{\mathrm{adv}}_{i}(T)=\sum_{t=1}^{T}l_{t,i}-\sum_{t=1}^{T}l_{t,J_{t}},$ where $J_{t}$ is the node played in epoch $t$ , and $l_{t,j}=\mathbb{E}\left[\mathcal{L}\left(\mathcal{P}_{t,i}\right)\right]$ . Since $\mathcal{L}\left(\mathcal{P}_{t,j}\right)$ concentrates around $l_{t,j}$ , a high probability bound on $\mathrm{Reg}^{\mathrm{adv}}_{i}(T)$ naturally provides a high probability bound on $\sum_{t=1}^{T}\mathcal{L}\left(\mathcal{P}_{t,i}\right)-\sum_{t=1}^{T}\mathcal{L}\left(\mathcal{P}_{t,J_{t}}\right)$ .

We define a notion of centrality that will aid our discussions.

Definition 1.

Let $X_{0},X_{1},X_{2},\dots,X_{\tau}=*$ be nodes on a random trajectory. Under Assumption 1, we define, for node $v\in[K]$ , $\alpha_{v}:=\min_{u\in[K],\;u\neq v}\mathbb{P}\left(v\in\{X_{1},X_{2},\dots,X_{\tau}\}\rvert X_{0}=u\right)$ to be the hitting centrality of node $v$ . We also define $\alpha=\min_{v}\alpha_{v}$

For a node with positive the hitting centrality, information about it is revealed even if it is not played. This quantity will show up in the regret bound in some cases, as we will discuss later.

We will use a version of the exponential weight algorithm to solve this adversarial problem. Also, a high probability guarantee is provided using a new concentration lemma (Lemma 1). As background, exponential weights algorithms maintain a probability distribution over the choices. This probability distribution gives higher weights to historically more rewarding nodes. In each epoch, a node is sampled from this probability distribution, and information is recorded down. To formally describe the strategy, we define some notations now. We first extract a sample of $\mathcal{L}\left(\mathcal{P}_{t,j}\right)$ from the trajectory $\mathcal{P}_{t,J_{t}}$ , where $J_{t}$ is the node played in epoch $t$ .

Given $\mathcal{P}_{t,J_{t}}=\left(X_{t,0},L_{t,1},X_{t,1},\dots,L_{t,H_{t,J_{t}}},X_{t,H_{t,J_{t}}}\right)$ , we define, for $v\in[K]$ ,

\displaystyle Y_{v}(\mathcal{P}_{t,J_{t}})=\max_{i:0\leq i<H_{t,J_{t}}}\;\mathbb{I}_{\left[X_{t,i}=v\right]}\cdot\mathcal{L}_{i}(\mathcal{P}_{t,J_{t}}),

(12)

where $\mathcal{L}_{i}(\mathcal{P}_{t,J_{t}}):=\sum_{k=i+1}^{H_{t,J_{t}}}L_{t,k}$ . In words, $\mathcal{L}_{i}(\mathcal{P}_{t,J_{t}})$ is the distance (sum of edge lengths) from the first occurrence of $v$ to the absorbing node.

By the principle of Proposition 1, if node $i$ is covered by trajectory $\mathcal{P}_{t,J_{t}}$ , $Y_{v}(\mathcal{P}_{t,J_{t}})$ is a sample of $\mathcal{L}\left(\mathcal{P}_{t,i}\right)$ . We define, for the trajectory $\mathcal{P}_{t,J_{t}}=\{X_{t,0},L_{t,1},X_{t,1},L_{t,2},\dots,L_{t,H_{t,J_{t}}},X_{t,H_{t,J_{t}}}\}$ , $Z_{t,v}:=Y_{v}(\mathcal{P}_{t,J_{t}}),\quad\forall v\in[K],$ where $Y_{v}(\mathcal{P}_{t,J_{t}})$ is defined above.

Define $\mathbb{I}_{t,ij}:=\mathbb{I}_{\left[i\in\mathcal{P}_{t,J_{t}}\text{ and }j\in\mathcal{P}_{t,J_{t}},\,Y_{i}(\mathcal{P}_{t,J_{t}})>Y_{j}(\mathcal{P}_{t,J_{t}})\right]}$ . This indicator random variable is $1$ iff $i$ and $j$ both show up in $\mathcal{P}_{t,J_{t}}$ and the first occurrence of $j$ is after the first occurence of $i$ . We then define $\widehat{q}_{t,ij}:=\frac{\sum_{s=1}^{t-1}\mathbb{I}_{s,ij}}{N_{t}^{+}(i)},$ which is an estimator of how likely $j$ is visited via a trajectory starting at $i$ . In other words, $\widehat{q}_{t,ij}$ is an estimator of $q_{ij}:=\mathbb{P}\left(j\in\mathcal{P}_{t,i}\right)$ , which is the probability of $j$ being visited by a trajectory from $i$ . Using $\widehat{q}_{t,ij}$ and sample ${Z}_{t,i}$ , we define an estimator for $\frac{l_{t,j}-B}{B}$ as

\displaystyle\widehat{Z}_{t,i}

\displaystyle:=\frac{\frac{{Z}_{t,i}-B}{B}\mathbb{I}_{\left[i\in\mathcal{P}_{t,J_{t}}\right]}+\beta}{p_{ti}+\sum_{j\neq i}\widehat{q}_{t,ji}p_{tj}},\quad\forall i\in[K],

(13)

where $B,\beta$ are algorithm parameters ( $\beta\leq\alpha$ and $B$ to be specified later). Here $\beta$ serves as a parameter for implicit exploration Neu, (2015). We use estimate for $\frac{l_{t,j}-B}{B}$ instead of an estimate for $l_{t,j}$ . This shift can guarantee that the estimator $\widehat{Z}_{t,i}$ is between $[-1,0]$ with high probability. Such shifting in estimator has been used as a common trick for variance reduction for the EXP3 algorithms (e.g., Lattimore and Szepesvári,, 2020). In addition, a small bias is introduced via $\beta$ . With the estimators $\widehat{Z}_{t,i}$ , we define $\widehat{S}_{t,j}=\sum_{s=1}^{t-1}\widehat{Z}_{s,i}.$ By convention, we set $\widehat{S}_{0,i}=0$ for all $i\in[K]$ . The probability of playing $i$ in epoch $t$ is defined as

\displaystyle p_{ti}:=\begin{cases}\frac{1}{K},\quad\text{ if }t=1,\\ \frac{\exp\left(\eta\widehat{S}_{t-1,i}\right)}{\sum_{j=1}^{K}\exp\left(\eta\widehat{S}_{t-1,j}\right)},\quad\text{ if }t\geq 2,\end{cases}

(14)

where $\eta$ is the learning rate.

Against any arm $j\in[K]$ , following the sampling rule (14) can guarantee an $\widetilde{\mathcal{O}}\left(\sqrt{T}\right)$ regret bound. We now summarize our strategy in Algorithm 2, and state the performance guarantee in Theorem 3.

Algorithm 2

1:Input: A set of nodes

[K]

, transition matrix

M

, total number of epochs

T

, probability parameter

\epsilon\in\left(0,\frac{1}{(1-\rho)KT}\right)

. Algorithm parameters:

B=\frac{\log\frac{(1-\rho)\epsilon}{KT}}{\log\rho}

\eta=\frac{1}{\sqrt{\kappa T}}

\beta=\frac{1}{\sqrt{\kappa T}}

2:for

t=1,2,3,\dots,T

3: Randomly play

J_{t}\in[K]

, such that

\mathbb{P}\left(J_{t}=i\right)=p_{ti},\forall i\in[K],

where

p_{ti}

is defined in (14).

4: Observe the trajectory

\mathcal{P}_{t,J_{t}}

. Update estimates

\widehat{Z}_{t,j}

according to (13).

5:end for

A high probability performance guarantee of Algorithm 2 is below in Theorem 3.

Theorem 3.

Let $\kappa:=1+\sum_{j}\frac{1-\sqrt{\alpha_{j}}}{1+\sqrt{\alpha_{j}}}$ , where $\alpha_{j}$ is the hitting centrality of $j$ . Fix any $i\in[K]$ . If the time horizon $T$ and algorithm parameters satisfies $\epsilon\leq\frac{1}{T}$ , $B=\frac{\log\frac{(1-\rho)\epsilon}{KT}}{\log\rho}$ , then with probability exceeding $1-\widetilde{\mathcal{O}}(\epsilon)$ ,

	$\displaystyle\mathrm{Reg}^{\mathrm{adv}}_{i}(T)\leq$	$\displaystyle\;B\frac{\log(T/\epsilon^{2})}{\beta}+B\kappa\beta T+\widetilde{\mathcal{O}}\left(B\sqrt{\beta\kappa T}\right)$
		$\displaystyle+\frac{B\log K}{\eta}+B\eta\kappa T+\widetilde{\mathcal{O}}\left(B\eta\sqrt{(1+\beta\kappa)T}\right).$

If we set $\eta=\beta=\Theta\left(\frac{1}{\sqrt{\kappa T}}\right)$ , Algorithm 2 achieves $\mathrm{Reg}^{\mathrm{adv}}_{i}(T)\leq\widetilde{\mathcal{O}}\left(\sqrt{\kappa T}\right)$ . In Figure 2, we provide a plot of $f(x)=\frac{1-\sqrt{x}}{1+\sqrt{x}}$ with $x\in[0,1]$ . This figure illustrates how the regret scales with $\kappa$ , and shows that a multiplicative factor is saved. An empirical comparison between Algorithm 2 and the standard EXP3 algorithm (without using the estimator in Eq. 12) is shown in Section 5.

Refer to caption — Figure 2: A plot of function $f(x)=\frac{1-\sqrt{x}}{1+\sqrt{x}}$ , $x\in[0,1]$ .

4.2.1 Analysis of Algorithm 2

In this section we present a proof for Theorem 3. In general, the proof follows the general recipe for the analysis of EXP3 algorithms, which is in turn a special case of the Follow-The-Regularized-Leader (FTRL) or the Mirror Descent framework. Below we include some non-standard intermediate steps due to the special feedback structure of the problem studied. More details are defered to the Appendix.

By the exponential weights argument (Littlestone and Warmuth,, 1994; Auer et al.,, 2002), it holds that, under event $\mathcal{E}_{T}(B):=\{Z_{t,j}\leq B\text{ for all }t=1,2,\cdots,T,\text{ and }j\in[K]\}$ ,

\displaystyle\sum_{t=1}^{T}\widehat{Z}_{t,i}-\sum_{t=1}^{T}\sum_{j}p_{tj}\widehat{Z}_{t,j}\leq

\displaystyle\frac{\log K}{\eta}+\eta\sum_{t=1}^{T}\sum_{j}\frac{p_{tj}}{\widetilde{p}_{tj}^{2}}\mathbb{I}_{\left[j\in\mathcal{P}_{t,J_{t}}\right]}.

(15)

To link the regret ( $\sum_{t=1}^{T}l_{t,i}-\sum_{t=1}^{T}l_{t,J_{t}}$ ) to $\sum_{t=1}^{T}\widehat{Z}_{t,i}-\sum_{t=1}^{T}\sum_{j}p_{tj}\widehat{Z}_{t,j}$ and to bound $\sum_{t=1}^{T}\sum_{j}\frac{p_{tj}}{\widetilde{p}_{tj}^{2}}\mathbb{I}_{\left[j\in\mathcal{P}_{t,J_{t}}\right]}$ , we use the results in Lemmas 1 and 2.

Lemma 1.

For any $\epsilon\in(0,1)$ and $T\in\mathbb{N}$ , such that $\epsilon\leq\frac{1}{T}$ and $T\geq 10$ , it holds that

\displaystyle\mathbb{P}\left(\sum_{t}\frac{l_{t,i}-B}{B}-\sum_{t}\widehat{Z}_{t,i}\leq\frac{\log\left(T/\epsilon^{2}\right)}{\beta}\right)\geq 1-\widetilde{\mathcal{O}}\left(\epsilon\right).

Lemma 1 can be viewed as a one-side regularized version of the Hoeffding’s inequality, with a regularization parameter $\beta$ . Its proof uses the Markov inequality and can be found in the Appendix.

Lemma 2.

With probability exceeding $1-\mathcal{O}(\epsilon)$ , we have

	$\displaystyle(i)$	$\displaystyle\quad\sum_{t}\sum_{j}\frac{p_{tj}}{\widehat{p}_{tj}^{2}}\mathbb{I}_{\left[j\in\mathcal{P}_{t,J_{t}}\right]}\leq\kappa T+\widetilde{\mathcal{O}}\left(\sqrt{T\log(1/\epsilon)}\right),$
	$\displaystyle(ii)$	$\displaystyle\quad\sum_{t}\sum_{j}p_{tj}\widehat{Z}_{t,j}-\sum_{t}\frac{l_{t,J_{t}}-B}{B}\leq\kappa\beta T+\mathcal{O}\left(\sqrt{\left(1+\beta\kappa\right)T\log(1/\epsilon)}\right).$

Proof of Lemma 2.

Firstly, it holds with high probability that $\widehat{p}_{tj}=\widetilde{p}_{tj}+\mathcal{O}\left(\frac{1}{t}\right)$ (Lemma 5 in the Appendix). Therefore $\frac{p_{tj}}{\widehat{p}_{tj}}=\frac{p_{tj}}{\widetilde{p}_{tj}+\mathcal{O}\left(\frac{1}{t}\right)}=\frac{p_{tj}}{\widetilde{p}_{tj}}+\mathcal{O}\left(\frac{1}{t}\right)$ .

Let $\mathbb{E}_{t}$ denote the expectation conditioning on all randomness right before the $t$ -th epoch. Since $p_{t,j}$ and $\widetilde{p}_{tj}$ are known before the $t$ -th epoch, we have $\mathbb{E}_{t}\left[\frac{p_{tj}}{\widehat{p}_{tj}^{2}}\mathbb{I}_{\left[j\in\mathcal{P}_{t,J_{t}}\right]}\right]=\frac{p_{tj}}{\widehat{p}_{tj}^{2}}\mathbb{E}_{t}\left[\mathbb{I}_{\left[j\in\mathcal{P}_{t,J_{t}}\right]}\right]=\frac{p_{tj}}{\left(\widetilde{p}_{tj}+\mathcal{O}\left(\frac{1}{t}\right)\right)^{2}}\widetilde{p}_{tj}=\frac{p_{tj}}{\widetilde{p}_{tj}}+\mathcal{O}\left(\frac{1}{t}\right)$ . Thus the random variables

\displaystyle\left\{\frac{p_{tj}}{\widehat{p}_{tj}^{2}}\mathbb{I}_{\left[j\in\mathcal{P}_{t,J_{t}}\right]}-\frac{p_{tj}}{\widetilde{p}_{tj}}+\mathcal{O}\left(\frac{1}{t}\right)\right\}_{t}

form a martingale difference sequence for any $j$ .

Applying the Azuma’s inequality to this martingale difference sequence gives, with porbability exceeding $1-\mathcal{O}(\epsilon)$ ,

	$\displaystyle\sum_{t}\sum_{j}\frac{p_{tj}}{\widehat{p}_{tj}^{2}}\mathbb{I}_{\left[j\in\mathcal{P}_{t,J_{t}}\right]}$	$\displaystyle\leq\sum_{t}\sum_{j}\frac{p_{tj}}{\widetilde{p}_{tj}}+\widetilde{\mathcal{O}}\left(\sqrt{T\log(1/\epsilon)}\right)+\sum_{t}\sum_{j}\mathcal{O}\left(\frac{1}{t}\right)$
		$\displaystyle\leq\sum_{t}\sum_{j}\frac{p_{tj}}{\widetilde{p}_{tj}}+\widetilde{\mathcal{O}}\left(\sqrt{T\log(1/\epsilon)}\right).$		(16)

Since $\frac{p_{tj}}{\widetilde{p}_{tj}}\leq p_{tj}+\frac{1-\sqrt{\alpha_{j}}}{1+\sqrt{\alpha_{j}}}$ (Proposition 2 in Appendix), we have

\displaystyle\sum_{t}\sum_{j}\frac{p_{tj}}{\widetilde{p}_{tj}}\leq\sum_{t}\sum_{j}\left(p_{tj}+\frac{1-\sqrt{\alpha_{j}}}{1+\sqrt{\alpha_{j}}}\right)\leq\kappa T.

(17)

Combining (16) and (17) finishes the proof for item $(i)$ .

For the second inequality in the lemma statement, we first note that $Z_{t,j}\leq B$ with high probability for all $t$ and $j$ . Thus we have

	$\displaystyle\sum_{j}p_{tj}\widehat{Z}_{t,j}=$	$\displaystyle\;\sum_{j}p_{tj}\frac{\frac{Z_{t,j}-B}{B}\mathbb{I}_{\left[j\in\mathcal{P}_{t,J_{t}}\right]}+\beta}{\widehat{p}_{tj}}$
	$\displaystyle\overset{①}{=}$	$\displaystyle\;\sum_{j}p_{tj}\frac{\frac{Z_{t,j}-B}{B}\mathbb{I}_{\left[j\in\mathcal{P}_{t,J_{t}}\right]}}{\widetilde{p}_{tj}}+\sum_{j}\frac{\beta p_{tj}}{\widetilde{p}_{tj}}+\widetilde{\mathcal{O}}\left(\frac{1}{t}\right)$
	$\displaystyle\leq$	$\displaystyle\;\beta\kappa+\widetilde{\mathcal{O}}\left(\frac{1}{t}\right),$

where ① uses a Taylor expansion (to replace $\widehat{p}_{tj}$ with $\widetilde{p}_{tj}$ ), and the last line uses $\frac{p_{tj}}{\widetilde{p}_{tj}}\leq p_{tj}+\frac{1-\sqrt{\alpha_{j}}}{1+\sqrt{\alpha_{j}}}$ (Proposition 2 in the Appendix) and that $Z_{t,j}$ is smaller than $B$ with high probability. Also, $l_{t,j}$ is smaller than $B$ for all $t,j$ . Since $\mathbb{E}_{t}\left[\sum_{j}p_{tj}\frac{\frac{Z_{t,j}-B}{B}\mathbb{I}_{\left[j\in\mathcal{P}_{t,J_{t}}\right]}}{\widetilde{p}_{tj}}\right]=\mathbb{E}_{t}\left[\frac{l_{t,J_{t}}-B}{B}\right]$ , we know $\left\{\left(\sum_{j}p_{tj}\widehat{Z}_{t,j}-\frac{l_{t,J_{t}}-B}{B}+\beta\kappa+\widetilde{\mathcal{O}}\left(\frac{1}{t}\right)\right)\right\}_{t}$ is a super-martingale difference sequence. We can now apply the Azuma’s inequality to this super-martingale sequence and get

\displaystyle\sum_{t}\sum_{j}p_{tj}\widehat{Z}_{t,j}-\sum_{t}\frac{l_{t,J_{t}}-B}{B}\leq\kappa\beta T+\widetilde{\mathcal{O}}\left(\sqrt{\beta\kappa T}\right),

(18)

with high probability. ∎

Proof of Theorem 3.

By the above results, we have

	$\displaystyle\sum_{t=1}^{T}l_{t,i}-\sum_{t=1}^{T}l_{t,J_{t}}=$	$\displaystyle\;B\left(\sum_{t}\frac{l_{t,i}-B}{B}-\sum_{t}\frac{l_{t,J_{t}}-B}{B}\right)$
	$\displaystyle\leq$	$\displaystyle\;B\left(\sum_{t}\frac{l_{t,i}-B}{B}-\sum_{t}\widehat{Z}_{t,i}\right)$
		$\displaystyle+B\left(\sum_{t}\sum_{j}p_{tj}\widehat{Z}_{t,j}-\sum_{t}\frac{l_{t,J_{t}}-B}{B}\right)$
		$\displaystyle+B\left(\widehat{Z}_{t,i}-\sum_{t}\sum_{j}p_{tj}\widehat{Z}_{t,j}\right)$
	$\displaystyle\overset{(i)}{\leq}$	$\displaystyle\;B\frac{\log(T/\epsilon^{2})}{\beta}+B\kappa\beta T+\widetilde{\mathcal{O}}\left(B\sqrt{\beta\kappa T}\right)$
		$\displaystyle+\frac{B\log K}{\eta}+B\eta\sum_{t}\sum_{j}\frac{p_{tj}}{\widetilde{p}_{tj}^{2}}\mathbb{I}_{\left[j\in\mathcal{P}_{t,J_{t}}\right]}$
	$\displaystyle\overset{(ii)}{\leq}$	$\displaystyle\;B\frac{\log(T/\epsilon^{2})}{\beta}+B\kappa\beta T+\widetilde{\mathcal{O}}\left(B\sqrt{\beta\kappa T}\right)$
		$\displaystyle+\frac{B\log K}{\eta}+B\eta\kappa T+\widetilde{\mathcal{O}}\left(B\eta\sqrt{(1+\beta\kappa)T}\right),$

where $(i)$ uses Lemma 1, Lemma 2, and (15), and $(ii)$ uses Lemma 2. ∎

Figure 3: The network structure for experiments. The dark node at the center is the absorbing node

*

, and nodes labelled with numbers are transient nodes. Nodes without edges connecting them visits each other with zero probability.

5 Experiments

We deploy our algorithms on a problem with 9 transient nodes. The results for Algorithm 2 is in Figure 4, and the results for Algorithm 1 is deferred to the Appendix. The evaluation of Algorithm 2 is performed on a problem instance with the transition matrix specified in (19).

\displaystyle m_{ij}

\displaystyle=\begin{cases}0.3,\text{ if }i=j,\\ 0.1,\text{ if }i=j\pm 1\mod 9,\\ 0,\text{ otherwise.}\end{cases}

(19)

The edge lengths are sampled from Gaussian distributions and truncated to between 0 and 1. Specifically, for all $t=1,2,\cdots,T$ , $l_{ij}^{(t)}=\begin{cases}\text{clip}_{[0,1]}\left(W_{t}+0.5\right),\text{ if }i=0\text{ and }j=*\\ \text{clip}_{[0,1]}\left(W_{t}\right),\text{ if }i\neq 0\text{ and }j=*\\ 1,\text{ otherwise}\end{cases},$ where $\text{clip}_{[0,1]}(z)$ takes a number $z$ and clips it to $[0,1]$ , $W_{t}\overset{i.i.d.}{\sim}\mathcal{N}\left(0.5,0.1\right)$ , and $*$ denotes the absorbing node.

As shown in Figure 4, EXP3 algorithm performs better when using the, which is consistent with the guarantee provided in Theorem 3. For implementation purpose and a fair comparison, the estimator for Algorithm 2 is $\widehat{Z}_{t,i}=\frac{{Z}_{t,i}\mathbb{I}_{\left[i\in\mathcal{P}_{t,J_{t}}\right]}}{p_{ti}+\sum_{j\neq i}\widehat{q}_{t,ji}p_{tj}}$ , and the estimators for EXP3 uses $\widehat{Z}_{t,i}=\frac{{Z}_{t,i}\mathbb{I}_{\left[i=J_{t}\right]}}{p_{ti}}$ . The learning rate for both algorithms is set to $0.001$ .

6 Conclusion

In this paper, we study the bandit problem where the feedback is random walk trajectories. This problem is motivated by influence maximization in computational social science. We show that, despite substantially more information can be extracted from the random walk trajectories, such problems are not significantly easier than its standard MAB counterpart in a mini-max sense. Behaviors of UCB and EXP3 are studied.

References

Abbasi-Yadkori et al., (2011) Abbasi-Yadkori, Y., Pál, D., and Szepesvári, C. (2011). Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320.
Agrawal and Goyal, (2012) Agrawal, S. and Goyal, N. (2012). Analysis of Thompson sampling for the multi-armed bandit problem. In Conference on Learning Theory, pages 39–1.
Alon et al., (2013) Alon, N., Cesa-Bianchi, N., Gentile, C., and Mansour, Y. (2013). From bandits to experts: A tale of domination and independence. In Advances in Neural Information Processing Systems, pages 1610–1618.
Arora et al., (2017) Arora, A., Galhotra, S., and Ranu, S. (2017). Debunking the myths of influence maximization: An in-depth benchmarking study. In Proceedings of the 2017 ACM international conference on management of data, pages 651–666.
Audibert et al., (2009) Audibert, J.-Y., Munos, R., and Szepesvári, C. (2009). Exploration–exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science, 410(19):1876–1902.
Auer, (2002) Auer, P. (2002). Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422.
Auer et al., (1995) Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. (1995). Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Proceedings of IEEE 36th Annual Foundations of Computer Science, pages 322–331. IEEE.
Auer et al., (2002) Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. (2002). The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77.
Auer and Ortner, (2010) Auer, P. and Ortner, R. (2010). UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Periodica Mathematica Hungarica, 61(1-2):55–65.
Bertsekas and Tsitsiklis, (1991) Bertsekas, D. P. and Tsitsiklis, J. N. (1991). An analysis of stochastic shortest path problems. Mathematics of Operations Research, 16(3):580–595.
Bubeck et al., (2013) Bubeck, S., Cesa-Bianchi, N., and Lugosi, G. (2013). Bandits with heavy tail. IEEE Transactions on Information Theory, 59(11):7711–7717.
Bubeck and Slivkins, (2012) Bubeck, S. and Slivkins, A. (2012). The best of both worlds: stochastic and adversarial bandits. In Conference on Learning Theory, pages 42–1.
Cesa-Bianchi et al., (1997) Cesa-Bianchi, N., Freund, Y., Haussler, D., Helmbold, D. P., Schapire, R. E., and Warmuth, M. K. (1997). How to use expert advice. Journal of the ACM (JACM), 44(3):427–485.
Chen et al., (2017) Chen, L., Li, J., and Qiao, M. (2017). Nearly instance optimal sample complexity bounds for top-k arm selection. In Artificial Intelligence and Statistics, pages 101–110.
Chen et al., (2016) Chen, W., Hu, W., Li, F., Li, J., Liu, Y., and Lu, P. (2016). Combinatorial multi-armed bandit with general reward functions. In Advances in Neural Information Processing Systems, pages 1651–1659.
Chen et al., (2013) Chen, W., Wang, Y., and Yuan, Y. (2013). Combinatorial multi-armed bandit: General framework and applications. In International Conference on Machine Learning, pages 151–159. PMLR.
Dick et al., (2014) Dick, T., Gyorgy, A., and Szepesvari, C. (2014). Online learning in markov decision processes with changing cost sequences. In International Conference on Machine Learning, pages 512–520. PMLR.
Even-Dar et al., (2009) Even-Dar, E., Kakade, S. M., and Mansour, Y. (2009). Online markov decision processes. Mathematics of Operations Research, 34(3):726–736.
Garivier and Cappé, (2011) Garivier, A. and Cappé, O. (2011). The KL–UCB algorithm for bounded stochastic bandits and beyond. In Conference on Learning Theory, pages 359–376.
Gergely Neu et al., (2010) Gergely Neu, A. G., Szepesvári, C., and Antos, A. (2010). Online markov decision processes under bandit feedback. In Proceedings of the Twenty-Fourth Annual Conference on Neural Information Processing Systems.
Jin et al., (2019) Jin, C., Jin, T., Luo, H., Sra, S., and Yu, T. (2019). Learning adversarial mdps with bandit feedback and unknown transition. arXiv preprint arXiv:1912.01192.
Kaufmann et al., (2016) Kaufmann, E., Cappé, O., and Garivier, A. (2016). On the complexity of best-arm identification in multi-armed bandit models. The Journal of Machine Learning Research, 17(1):1–42.
Kempe et al., (2003) Kempe, D., Kleinberg, J., and Tardos, É. (2003). Maximizing the spread of influence through a social network. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 137–146.
Kleinberg et al., (2008) Kleinberg, R., Slivkins, A., and Upfal, E. (2008). Multi-armed bandits in metric spaces. In ACM Symposium on Theory of Computing, pages 681–690. ACM.
Kocák et al., (2014) Kocák, T., Neu, G., Valko, M., and Munos, R. (2014). Efficient learning by implicit exploration in bandit problems with side observations. In Advances in Neural Information Processing Systems, pages 613–621.
Krause and Ong, (2011) Krause, A. and Ong, C. S. (2011). Contextual Gaussian process bandit optimization. In Advances in Neural Information Processing Systems, pages 2447–2455.
Lai and Robbins, (1985) Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22.
Lattimore and Szepesvári, (2020) Lattimore, T. and Szepesvári, C. (2020). Bandit Algorithms. Cambridge University Press.
Li et al., (2010) Li, L., Chu, W., Langford, J., and Schapire, R. E. (2010). A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670.
Littlestone and Warmuth, (1994) Littlestone, N. and Warmuth, M. K. (1994). The weighted majority algorithm. Information and computation, 108(2):212–261.
Maillard et al., (2011) Maillard, O.-A., Munos, R., and Stoltz, G. (2011). A finite-time analysis of multi-armed bandits problems with Kullback-Leibler divergences. In Conference On Learning Theory, pages 497–514.
Mannor and Shamir, (2011) Mannor, S. and Shamir, O. (2011). From bandits to experts: On the value of side-observations. In Advances in Neural Information Processing Systems, pages 684–692.
Mannor and Tsitsiklis, (2004) Mannor, S. and Tsitsiklis, J. N. (2004). The sample complexity of exploration in the multi-armed bandit problem. Journal of Machine Learning Research, 5(Jun):623–648.
Neu, (2015) Neu, G. (2015). Explore no more: Improved high-probability regret bounds for non-stochastic bandits. In Advances in Neural Information Processing Systems, pages 3168–3176.
Neu et al., (2012) Neu, G., Gyorgy, A., and Szepesvári, C. (2012). The adversarial stochastic shortest path problem with unknown transition probabilities. In Artificial Intelligence and Statistics, pages 805–813. PMLR.
Robbins, (1952) Robbins, H. (1952). Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5):527–535.
Rosenberg and Mansour, (2020) Rosenberg, A. and Mansour, Y. (2020). Adversarial stochastic shortest path. arXiv preprint arXiv:2006.11561.
Seldin and Slivkins, (2014) Seldin, Y. and Slivkins, A. (2014). One practical algorithm for both stochastic and adversarial bandits. In International Conference on Machine Learning, pages 1287–1295.
Slivkins, (2014) Slivkins, A. (2014). Contextual bandits with similarity information. The Journal of Machine Learning Research, 15(1):2533–2568.
Srinivas et al., (2010) Srinivas, N., Krause, A., Kakade, S., and Seeger, M. (2010). Gaussian process optimization in the bandit setting: No regret and experimental design. In International Conference on Machine Learning.
Thompson, (1933) Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294.

Appendix A Proof Details of Theorems 2

Construct $K+1$ instances specified by graphs: $\mathfrak{J}_{0}=\left(G_{t}^{0}\right)_{t=1}^{T},\mathfrak{J}_{1}=\left(G_{t}^{1}\right)_{t=1}^{T},\mathfrak{J}_{2}=\left(G_{t}^{2}\right)_{t=1}^{T},\cdots,\mathfrak{J}_{K}=\left(G_{t}^{K}\right)_{t=1}^{T}$ . For any $t,k$ , let all transition probabilities equal $p$ . For any $t,k$ , let the graph $G_{t}^{k}$ have edge lengths $\left(\{l_{i*}^{(t)}\}_{i\in[K]},\{l_{ij}^{(t)}\}_{i,j\in[K]}\right)$ chosen randomly. In $\mathfrak{J}_{0}$ , all edge lengths are independently sampled from $\text{Bernoulli}\left(\frac{1}{2}\right)$ . In $\mathfrak{J}_{k}$ ( $k\geq 1$ ), $l_{k*}^{(t)}$ (for all $t\in[T]$ , $k\geq 1$ ) are sampled from $\text{Bernoulli}\left(\frac{1}{2}+\frac{\epsilon}{1-Kp}\right)$ , and all other edge lengths are sampled from $\text{Bernoulli}\left(\frac{1}{2}\right)$ . The proof is presented in three steps: Step 1 computes the KL-divergence using similar arguments discussed in the main text. Steps 2 & 3 use standard argument for general lower bound proofs.

Step 1: compute the KL-divergence between $\mathfrak{J}_{0}$ and $\mathfrak{J}_{k}$ . Let $\mathbb{Q}_{t,j}^{(k)}$ span the probability of playing $j$ at $t$ in instance $\mathfrak{J}_{k}$ . By chain rule, we have, for any $k=2,\cdots,K$ ,

\displaystyle D_{KL}\left(\mathbb{P}_{\mathfrak{J}_{0},\pi}\|\mathbb{P}_{\mathfrak{J}_{k},\pi}\right)=\sum_{t=1}^{T}\sum_{j\in[K]}\mathbb{P}_{\mathfrak{J}_{0},\pi}\left(J_{t}=j\right)D_{KL}\left(\mathbb{Q}_{t,j}^{(0)}\|\mathbb{Q}_{t,j}^{(k)}\right).

(20)

Let $X_{0},L_{1},X_{1},L_{2},\cdots$ be the nodes and edge length of each step in the trajectory after playing a node. The sample space of $\mathbb{Q}_{t,j}^{(k)}$ is spanned by $X_{0},L_{1},X_{1},L_{2},\cdots$ .

By Markov property, we have, for all $i,j\in[K]$ and $s\in\mathbb{N}_{+}$ ,

\displaystyle\mathbb{Q}_{t,i}^{(k)}\left(L_{s+1},X_{s+1},L_{s+2},X_{s+2},\dots\rvert X_{s}=j\right)=\mathbb{Q}_{t,j}^{(k)},\quad\forall k=0,1,2,\cdots,K.

Thus by applying the chain rule to $\mathbb{Q}_{t,j}^{(k)}$ , we have

	$\displaystyle D_{KL}\left(\mathbb{Q}_{t,i}^{(0)}\\|\mathbb{Q}_{t,i}^{(k)}\right)$
$\displaystyle=$	$\displaystyle D_{KL}\left(\mathbb{Q}_{t,i}^{(0)}(X_{1},L_{1})\\|\mathbb{Q}_{t,i}^{(k)}(X_{1},L_{1})\right)$
	$\displaystyle+\sum_{x\in[K]}\sum_{l\in\{0,1\}}\mathbb{P}\left(X_{1}=x,L_{1}=l\right)$
	$\displaystyle\cdot D_{KL}\left(\mathbb{Q}_{t,i}^{(0)}(X_{2},L_{2},...\rvert X_{1}=x,L_{1}=l)\\|\mathbb{Q}_{t,i}^{(k)}(X_{2},L_{2},...\rvert X_{1}=x,L_{1}=l)\right).$	(21)

Recall that the edge lengths are selected independent of other randomness. Thus we have

\displaystyle\mathbb{Q}_{t,i}^{(k)}(X_{2},L_{2},...\rvert X_{1}=x,L_{1}=l)=\mathbb{Q}_{t,i}^{(k)}(X_{2},L_{2},...\rvert X_{1}=x),\quad k=0,1,2,\cdots,K.

Thus (21) simplifies to

		$\displaystyle\;D_{KL}\left(\mathbb{Q}_{t,i}^{(0)}\\|\mathbb{Q}_{t,i}^{(k)}\right)$
	$\displaystyle=$	$\displaystyle\;D_{KL}\left(\mathbb{Q}_{t,i}^{(0)}(X_{1},L_{1})\\|\mathbb{Q}_{t,i}^{(k)}(X_{1},L_{1})\right)+\sum_{j\in[K]}m_{ji}D_{KL}\left(\mathbb{Q}_{t,j}^{(0)}\\|\mathbb{Q}_{t,j}^{(k)}\right),$		(22)

where $m_{ji}$ ( $m_{ji}=p$ ) is the probability of visiting $j$ from $i$ .

We have, for $k\geq 1$ ,

	$\displaystyle\;D_{KL}\left(\mathbb{Q}_{t,i}^{(0)}(X_{1},L_{1})\\|\mathbb{Q}_{t,i}^{(k)}(X_{1},L_{1})\right)$
$\displaystyle=$	$\displaystyle\;\begin{cases}\frac{1-Kp}{2}\log\frac{(1-Kp)/2}{(1-Kp)\left(\frac{1}{2}+\frac{\epsilon}{1-Kp}\right)}+\frac{1-Kp}{2}\log\frac{(1-Kp)/2}{(1-Kp)\left(\frac{1}{2}-\frac{\epsilon}{1-Kp}\right)},&\text{if }i=k,\\ 0,&\text{otherwise}.\end{cases}$
$\displaystyle\leq$	$\displaystyle\;\begin{cases}\frac{\epsilon^{2}}{1-Kp},&\text{if }i=k,\\ 0,&\text{otherwise}.\end{cases}$	(23)

where the last line uses that $x-\frac{x^{2}}{2}\leq\log(1+x)\leq x$ for all $x\in[0,1]$ .

Combining (22) and (23) gives

\displaystyle D_{KL}\left(\mathbb{Q}_{t,i}^{(0)}\|\mathbb{Q}_{t,i}^{(k)}\right)\leq

\displaystyle\;\begin{cases}\frac{\epsilon^{2}}{1-Kp}+\frac{p\epsilon^{2}}{(1-Kp)^{2}},&\text{if }i=k,\\ \frac{p\epsilon^{2}}{(1-Kp)^{2}},&\text{otherwise}.\end{cases}

(24)

Plugging the above results into (20) and we get

	$\displaystyle D_{KL}\left(\mathbb{P}_{\mathfrak{J}_{0},\pi}\\|\mathbb{P}_{\mathfrak{J}_{k},\pi}\right)=$	$\displaystyle\;\sum_{t=1}^{T}\sum_{j\in[K]}\mathbb{P}_{\mathfrak{J}_{0},\pi}\left(J_{t}=j\right)D_{KL}\left(\mathbb{Q}_{t,j}^{(0)}\\|\mathbb{Q}_{t,j}^{(k)}\right)$
	$\displaystyle=$	$\displaystyle\;\sum_{j}\mathbb{E}_{\mathfrak{J}_{0},\pi}\left[N_{j}\right]D_{KL}\left(\mathbb{Q}_{t,j}^{(0)}\\|\mathbb{Q}_{t,j}^{(k)}\right)$
	$\displaystyle\leq$	$\displaystyle\;\frac{\epsilon^{2}\mathbb{E}_{\mathfrak{J}_{0},\pi}\left[N_{k}\right]}{1-Kp}+\frac{\epsilon^{2}pT}{(1-Kp)^{2}},$

where $N_{j}$ is the number of times $j$ is played.

Step 2: compute the optimality gap between nodes.

Let $H_{k}$ be the vector of (expected) hitting times in instance $\mathfrak{J}_{k}$ . The vector $H_{k}$ $(k\geq 2)$ satisfies

\displaystyle H_{k}=MH_{k}+(1-Kp)\left(\frac{1}{2}\bm{1}+\frac{\epsilon}{1-Kp}\bm{e}_{k}\right),

where $M$ is the transition matrix among transient nodes (all entries of $M$ are $p$ ), $\bm{1}$ is the all $\bm{1}$ vector and $\bm{e}_{k}$ is the $k$ -th canonical basis vector.

Solving the above equation gives, for $k\geq 2$

\displaystyle H_{k}=(1-Kp)\left(I+\frac{M}{1-Kp}\right)\left(\frac{1}{2}\bm{1}+\frac{\epsilon}{1-Kp}\bm{e}_{k}\right).

Thus the optimality gap, which is the difference between the hitting time of node $k$ in $\mathfrak{J}_{k}$ and the hitting time of other nodes in $\mathfrak{J}_{k}$ $(k\geq 1)$ , is $\Delta:=\epsilon$ .

Step 3: apply Yao’s principle and Pinsker’s inequality to finish the proof.

By Pinsker’s inequality, we have $\forall j,k,$

\displaystyle\rvert\mathbb{P}_{\mathfrak{J}_{0},\pi}(J_{t}=j)-\mathbb{P}_{\mathfrak{J}_{k},\pi}(J_{t}=j)\rvert\leq

\displaystyle\;\sqrt{2D_{KL}\left(\mathbb{P}_{\mathfrak{J}_{0},\pi}\|\mathbb{P}_{\mathfrak{J}_{k},\pi}\right)}.

(25)

Thus for the regret against $k$ is instance $\mathfrak{J}_{k}$ , we have

	$\displaystyle\;\frac{1}{K}\sum_{k=1}^{K}\mathbb{E}_{\mathfrak{J}_{k},\pi}\left[\mathrm{Reg}^{\mathrm{adv}}_{k}(T)\right]$
$\displaystyle=$	$\displaystyle\;\frac{1}{K}\sum_{k=1}^{K}\sum_{t}\mathbb{E}_{\mathfrak{J}_{k},\pi}\left[Y_{k,t}\right]-\mathbb{E}_{\mathfrak{J}_{k},\pi}\left[Y_{J_{t},t}\right]$
$\displaystyle=$	$\displaystyle\;\frac{\epsilon}{K}\sum_{k=1}^{K}\sum_{t}\mathbb{P}_{\mathfrak{J}_{k},\pi}(J_{t}\neq k)$	(by the Wald’s indentity)
$\displaystyle=$	$\displaystyle\;\epsilon T-\frac{\epsilon}{K}\sum_{k=1}^{K}\sum_{t}\mathbb{P}_{\mathfrak{J}_{k},\pi}\left(J_{t}=k\right)$
$\displaystyle\geq$	$\displaystyle\;\epsilon T-\frac{\epsilon}{K}\sum_{k=1}^{K}\sum_{t}\mathbb{P}_{\mathfrak{J}_{0},\pi}(J_{t}=k)-\frac{\epsilon}{K}\sum_{k=1}^{K}\sum_{t}\sqrt{2D_{KL}\left(\mathcal{P}_{\mathfrak{J}_{0},\pi}\\|\mathcal{P}_{\mathfrak{J}_{k},\pi}\right)}$	(by Eq. 25)
$\displaystyle\geq$	$\displaystyle\;\frac{(K-1)\epsilon T}{K}-\epsilon^{2}T\sqrt{\frac{1}{K}\sum_{k=1}^{K}\left(\frac{2\mathbb{E}_{\mathfrak{J}_{0},\pi}\left[N_{k}\right]}{1-Kp}+\frac{2pT}{(1-Kp)^{2}}\right)}$	(by Jensen’s inequality)
$\displaystyle=$	$\displaystyle\;\frac{(K-1)\epsilon T}{K}-\epsilon^{2}T\sqrt{\frac{2T}{K(1-Kp)}+\frac{2pT}{(1-Kp)^{2}}}.$	(26)

Now we set $p=\frac{1}{2K}$ and $\epsilon=\frac{1}{4\sqrt{2}}\frac{K-1}{{K}}\sqrt{\frac{K}{T}}$ and (26) gives

\displaystyle\frac{1}{K}\sum_{k=1}^{K}\mathbb{E}_{\mathfrak{J}_{k},\pi}\left[\mathrm{Reg}^{\mathrm{adv}}_{k}(T)\right]\geq\frac{1}{8\sqrt{2}}\sqrt{KT}

Thus we have

	$\displaystyle\max_{k}\mathbb{E}_{\mathfrak{J}_{k},\pi}\left[\mathrm{Reg}^{\mathrm{adv}}_{k}(T)\right]\geq$	$\displaystyle\;\frac{1}{K}\sum_{k=1}^{K}\mathbb{E}_{\mathfrak{J}_{k},\pi}\left[\mathrm{Reg}^{\mathrm{adv}}_{k}(T)\right]$
	$\displaystyle\geq$	$\displaystyle\;\frac{1}{8\sqrt{2}}\sqrt{KT}.$

Appendix B Proof Details for Section 4

We use the following notations for simplicity. 1. We write $\widetilde{p}_{tj}=p_{tj}+\sum_{i\neq j}q_{ij}p_{ti}$ , and $\widehat{p}_{tj}=p_{tj}+\sum_{i\neq j}\widehat{q}_{t,ij}p_{ti}$ . 2. We use $\mathcal{F}_{t}$ to denote the $\sigma$ -algebra generated by all randomness up to the end of epoch $t$ . We use $\mathcal{F}_{t,i}$ to denote the $\sigma$ -algebra of all randomness up to the first occurrence of node $i$ in epoch $t$ (or end of epoch $t$ if $i$ is not visited in epoch $t$ ). We use $\mathbb{E}_{t}$ to denote the expectation conditioning on $\mathcal{F}_{t}$ , i.e., $\mathbb{E}_{t}\left[\cdot\right]=\mathbb{E}\left[\cdot\rvert\mathcal{F}_{t}\right]$ . 3. Unless otherwise stated, we use $\sum_{t}$ and $\sum_{j}$ as shorthand for $\sum_{t=1}^{T}$ and $\sum_{j\in[K]}$ , respectively.

In Appendix B.1, we state some preparation properties needed for proving Lemma 1. Proposition 2 is also included in this part. In Appendix B.2, we provide a proof for Lemma 1.

B.1 Additional Properties

As Proposition 1 suggests, number of times a node is visited linearly accumulates with number of epochs $t$ . We state this observation below in Lemma 3.

Lemma 3.

For $v\in V$ and $t$ , it holds that $\mathbb{P}\Big{(}{N}_{t}^{+}(v)-{N}_{t}(v)-\alpha_{v}\left(t-{N}_{t}(v)\right)\geq-\lambda\Big{)}\leq e^{-\frac{\lambda^{2}}{2t}}.$

Proof.

Recall $\mathcal{P}_{t,J_{t}}=\left(X_{t,0},L_{t,1},X_{t,1},L_{t,2},X_{t,2},\dots,L_{t,H_{t,J_{t}}},X_{t,H_{t,J_{t}}}\right)$ is the trajectory for epoch $t$ and $X_{i,0}$ is the node played at epoch $i$ . For a fixed node $v\in V$ , consider the random variables $\left\{\mathbb{I}_{\left[v\in\mathcal{P}_{t,J_{t}}\setminus\{X_{t,0}\}\right]}\right\}_{t}$ , which is the indicator that takes value 1 when $v$ is covered in path $\mathcal{P}_{t,J_{t}}$ but is not played at $t$ . From this definition, we have

\displaystyle\sum_{k=1}^{t}\mathbb{I}_{\left[v\in\mathcal{P}_{k,J_{k}}\setminus\{X_{k,0}\}\right]}={N}_{t}^{+}(v)-{N}_{t}(v).

From definition of $\alpha_{v}$ , we have

\displaystyle\mathbb{E}\left[{N}_{t}^{+}(v)-{N}_{t}(v)\right]=\mathbb{E}\left[\sum_{k=1}^{t}\mathbb{I}_{\left[v\in\mathcal{P}_{i}\setminus\{X_{k,0}\}\right]}\right]\geq\alpha_{v}\left(t-\mathbb{E}\left[{N}_{t}(v)\right]\right).

Thus by one-sided Azuma’s inequality, we have for any $\lambda>0$ ,

\displaystyle\mathbb{P}\Big{(}{N}_{t}^{+}(v)-{N}_{t}(v)-\alpha_{v}\left(t-{N}_{t}(v)\right)\geq-\lambda\Big{)}\leq\exp\left(-\frac{\lambda^{2}}{2t}\right).

(27)

∎

Lemma 4.

For any $t,i,j$ , it holds that $\mathbb{V}\left[\widehat{q}_{t,ij}\right]=\widetilde{\mathcal{O}}\left(\frac{1}{t}\right).$

Proof.

For the variance, we have

	$\displaystyle\mathbb{V}\left[\widehat{q}_{t,ij}\right]=$	$\displaystyle\;\sum_{m=1}^{t}\mathbb{V}\left[\widehat{q}_{t,ij}\bigg{\rvert}N_{t}^{+}(i)=m\right]\mathbb{P}\left(N_{t}^{+}(i)=m\right)$
	$\displaystyle\leq$	$\displaystyle\;\sum_{m=1}^{t}\frac{1}{m}\mathbb{P}\left(N_{t}^{+}(i)=m\right)=\mathbb{E}\left[\frac{1}{N_{t}^{+}(i)}\right].$

By Lemma 3 and a union bound, we know, for any $\delta\in(0,1)$ ,

\displaystyle\mathbb{P}\Big{(}{N}_{t}^{+}(i)\geq\alpha t-\sqrt{2t\log(2TK/\delta)},\quad\forall i\in[K],\;t\in[T]\Big{)}\leq{\delta}.

Thus it holds that

	$\displaystyle\mathbb{E}\left[\frac{1}{N_{t}^{+}(i)}\right]=$	$\displaystyle\;\mathbb{E}\left[\frac{1}{N_{t}^{+}(i)}\big{\rvert}{N_{t}^{+}(i)}\geq\alpha t-\sqrt{2t\log(TK/\delta)}\right]$
		$\displaystyle\quad\cdot\mathbb{P}\left(N_{t}^{+}(i)\geq\alpha t-\sqrt{2t\log(TK/\delta)}\right)$
		$\displaystyle+\mathbb{E}\left[\frac{1}{N_{t}^{+}(i)}\big{\rvert}{N_{t}^{+}(i)}<\alpha t-\sqrt{2t\log(TK/\delta)}\right]$
		$\displaystyle\quad\cdot\mathbb{P}\left(N_{t}^{+}(i)<\alpha t-\sqrt{2t\log(TK/\delta)}\right)$
	$\displaystyle\leq$	$\displaystyle\;\frac{1}{\max\left\{1,\alpha t-\sqrt{2t\log(TK/\delta)}\right\}}+\delta.$

Setting $\delta=\frac{1}{t}$ concludes the proof. ∎

Next, we consider a high probability event and approximate $\widehat{q}_{t,ij}$ under this event.

Lemma 5.

For any $\epsilon\in(0,1)$ , let

	$\displaystyle\mathcal{E}_{T}^{\prime}:=\Bigg{\{}$	$\displaystyle\left\rvert\widehat{q}_{t,ij}-q_{ij}\right\rvert\leq\sqrt{\frac{2\mathbb{V}\left[\widehat{q}_{t,ij}\right]\log(T/\epsilon)}{N_{t-1}^{+}(i)}}+\frac{\log(T/\epsilon)}{3N_{t-1}^{+}(i)},$
		$\displaystyle\quad\;N_{t}^{+}(i)\geq\alpha t-\sqrt{t\log(TK/\epsilon)},\;\forall i,j\in[K]\forall t\in[T]\Bigg{\}}.$

It holds that $\mathbb{P}\left(\mathcal{E}_{t}^{\prime}\right)\geq 1-\frac{2\epsilon}{T}$ and under $\mathcal{E}_{T}^{\prime}$ , for all $i,j\in[K]$ and $t\in[T]$ ,

\displaystyle\widehat{q}_{t,ij}=q_{ij}\pm\mathcal{O}\left({\frac{\log(TK/\epsilon)}{t}}\right),\text{ and }\widehat{p}_{ti}=\widetilde{p}_{ti}\pm\mathcal{O}\left({\frac{\log(TK/\epsilon)}{t}}\right).

(28)

Proof.

By Bennett’s inequality, it holds that

\displaystyle\mathbb{P}\left(\left\rvert\widehat{q}_{t,ij}-q_{ij}\right\rvert\geq\sqrt{\frac{2\mathbb{V}\left[\widehat{q}_{t,ij}\right]\log(KT/\epsilon)}{N_{t-1}^{+}(i)}}+\frac{\log(KT/\epsilon)}{3N_{t-1}^{+}(i)}\right)\leq\epsilon

(29)

By Lemma 3 and a union bound, we known $\mathbb{P}\left(\mathcal{E}_{T}^{\prime}\right)\geq 1-2\epsilon$ . By Lemma 4, we know that $\mathbb{V}\left[\widehat{q}_{t,ij}\right]=\widetilde{\mathcal{O}}\left(\frac{1}{t}\right)$ . Thus, under event $\mathcal{E}_{T}^{\prime}$ , it holds that

\displaystyle\widehat{q}_{t,ij}=q_{ij}\pm\widetilde{\mathcal{O}}\left(\frac{\log\left(T/\epsilon\right)}{t}\right),

and thus

\displaystyle\widehat{p}_{ti}=\widetilde{p}_{ti}\pm\widetilde{\mathcal{O}}\left(\frac{\log\left(T/\epsilon\right)}{t}\right).

∎

Lemma 6.

For any $B$ , let $\mathcal{E}_{T}(B):=\left\{{Z}_{t,j}\leq B\text{ for all }t=1,2,\cdots,T,\text{ and }j\in[K]\right\}.$ For any $\epsilon\in(0,1)$ and $B=\frac{\log\frac{(1-\rho)\epsilon}{KT}}{\log\rho}$ , it holds that

	$\displaystyle\mathbb{P}\left(\mathcal{E}_{T}\left(B\right)\right)\geq 1-\epsilon,$
	$\displaystyle\mathbb{E}\left[{Z}_{t,i}\rvert\text{not }\mathcal{E}_{T}(B)\right]\leq KB+\frac{K}{(1-\rho)^{2}},$
	$\displaystyle\mathbb{E}\left[{Z}_{t,i}^{2}\rvert\text{not }\mathcal{E}_{T}(B)\right]\leq KB^{2}+\frac{2K}{(1-\rho)^{3}}.$

Proof.

Since all edge lengths are smaller than 1, we have, for any integer $B$ ,

	$\displaystyle\mathbb{P}\left(Z_{t,i}>B\right)$	$\displaystyle\leq\mathbb{P}\left(\{\text{random walk starting from $i$ does not terminate in $B$ steps}\}\right)$
		$\displaystyle=\sum_{l=B}^{\infty}\mathbb{P}\left(\{\text{random walk starting from $i$ terminates at step $l$}\}\right)$
		$\displaystyle=\sum_{l=B}^{\infty}\sum_{j=1}^{K}\left[M^{l}\right]_{ij}\leq\sum_{l=B}^{\infty}\rvert M\rvert_{\infty}^{l}\leq\sum_{l=B}^{\infty}\rho^{l}\leq\frac{\rho^{B}}{1-\rho}.$

Thus with probability at least $1-\frac{\rho^{B}}{1-\rho}$ , we have $Z_{t,i}\leq B$ . We define

\displaystyle\mathcal{E}_{T}(B):=\left\{{Z}_{t,j}\leq B\text{ for all }t=1,2,\cdots,T,\text{ and }j\in[K]\right\}.

By a union bound, $\mathbb{P}\left(\mathcal{E}_{T}(B)\right)\geq 1-\frac{KT\rho^{B}}{1-\rho}$ . Now we can set $B=\frac{\log\frac{(1-\rho)\epsilon}{KT}}{\log\rho}$ so that $\mathbb{P}\left(\mathcal{E}_{T}(B)\right)\geq 1-\epsilon$ .

The random variables $Z_{t,i}$ also has the memorylessness-type property:

		$\displaystyle\;\mathbb{E}\left[Z_{t,i}\rvert\text{not }\mathcal{E}_{T}(B)\right]$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}\left[Z_{t,i}\rvert Z_{t,i}>B\right]\leq\sum_{l=B+1}^{\infty}l\frac{\mathbb{P}\left(\{\text{$\mathcal{P}_{t,i}$ terminates at step $l$}\}\cap\{Z_{t,i}>B\}\right)}{\mathbb{P}\left(Z_{t,i}>B\right)}$
	$\displaystyle=$	$\displaystyle\;\sum_{l=B+1}^{\infty}l\frac{\sum_{j}\mathbb{P}\left(\{\text{$\mathcal{P}_{t,i}$ terminates at step $l$}\}\cap\{\text{the $(B+1)$-th step is at $j$}\}\right)}{\sum_{j}\mathbb{P}\left(\{\text{the $(B+1)$-th step is at $j$}\}\right)}$
	$\displaystyle\leq$	$\displaystyle\;\sum_{l=B+1}^{\infty}l\sum_{j}\mathbb{P}\left(\{\text{$\mathcal{P}_{t,i}$ terminates at step $l$}\}\big{\rvert}\{\text{the $(B+1)$-th step is at $j$}\}\right)$
	$\displaystyle=$	$\displaystyle\;\sum_{j}\sum_{l=1}^{\infty}(l+B)\mathbb{P}\left(\{\text{$\mathcal{P}_{t,j}$ terminates at step $l$}\}\right)$
	$\displaystyle=$	$\displaystyle\;\sum_{j}\mathbb{E}\left[Z_{t,j}+B\right]\leq KB+\sum_{j}\mathbb{E}\left[Z_{t,j}\right]$

where we use Markov property on the second last line.

Since $\mathbb{E}\left[Z_{t,j}\right]\leq\mathcal{O}\left(\frac{1}{\left(1-\rho\right)^{2}}\right)$ , we insert this into the above equation to get

\displaystyle\mathbb{E}\left[Z_{t,i}\rvert\text{not }\mathcal{E}_{T}(B)\right]\leq\mathcal{O}\left(KB+\frac{K}{(1-\rho)^{2}}\right).

Similarly, we have

\displaystyle\mathbb{E}\left[Z_{t,i}^{2}\rvert\text{not }\mathcal{E}_{T}(B)\right]\leq\mathcal{O}\left(KB^{2}+\frac{K}{(1-\rho)^{3}}\right).

∎

Remark 1.

By Lemmas 5 and 6, we know $\widehat{p}_{ti}=\widetilde{p}_{ti}+\widetilde{\mathcal{O}}\left(\frac{1}{t}\right)$ and $Z_{t,i}\leq B$ ( $\forall i\in[K],t\in[T]$ ) hold with high probability. Let $\Sigma$ be the whole event space spanned by all possibilities of $T$ rounds of plays. From now on, we will restrict our attention to the event space $\Sigma^{\prime}=\{e\in\Sigma:e\cap\mathcal{E}_{T}(B)\cap\mathcal{E}_{T}^{\prime}\}$ , and work in this event space unless otherwise noted.

Proposition 2.

Fix any $a\in(0,1]$ . We have

\displaystyle\frac{x}{x+(1-a)x}\leq x+\frac{1-\sqrt{a}}{1+\sqrt{a}},\qquad\forall x\in(0,1).

(30)

Proof.

If suffices to show, for any $a\in(0,1]$ , the function $f_{a}(x):=\frac{x}{x+(1-x)a}-x$ is upper bounded by $\frac{1-\sqrt{a}}{1+\sqrt{a}}$ . This can be shown via a quick first-order test. At $x_{\max}=\frac{\sqrt{a}}{1+\sqrt{a}}$ , the maximum of $f_{a}$ is achieved, and $f_{a}(x_{\max})=\frac{1-\sqrt{a}}{1+\sqrt{a}}$ . ∎

B.2 Proof of Lemma 1

By law of total expectation, we have

	$\displaystyle\mathbb{E}_{t-1}\left[Z_{t,i}\mathbb{I}_{\left[i\in\mathcal{P}_{t,J_{t}}\right]}\right]=$	$\displaystyle\;\mathbb{E}_{t-1}\left[\mathbb{E}\left[Z_{t,i}\mathbb{I}_{\left[i\in\mathcal{P}_{t,J_{t}}\right]}\big{\rvert}\mathcal{F}_{t,i}\right]\right]$
	$\displaystyle=$	$\displaystyle\;\mathbb{E}_{t-1}\left[l_{t,i}\mathbb{I}_{\left[i\in\mathcal{P}_{t,J_{t}}\right]}\right]=l_{t,i}\widetilde{p}_{ti}.$

Also, we have

\displaystyle\mathbb{E}_{t-1}\left[\left(\frac{Z_{t,i}-B}{B}\right)^{2}\right]\leq 1

(31)

By Lemma 5, we know

\displaystyle\frac{1}{\widehat{p}_{tj}}=\frac{1}{\widetilde{p}_{tj}}\pm\widetilde{\mathcal{O}}\left(\frac{\log\left(T/\epsilon\right)}{t}\right),\quad\forall t\in[T],j\in[K]

(32)

We can use (32) to get

	$\displaystyle\mathbb{E}_{t-1}\left[\exp\left(\frac{\beta}{B}\left(l_{t,i}-B\right)-\beta\left(\frac{\frac{{Z}_{t,i}-B}{B}\mathbb{I}_{\left[i\in\mathcal{P}_{t,J_{t}}\right]}+\beta}{p_{ti}+\sum_{j\neq i}\widehat{q}_{t,ji}p_{tj}}\right)\right)\right]$
$\displaystyle{=}$	$\displaystyle\mathbb{E}_{t-1}\left[\exp\left(\frac{\beta}{B}\left(l_{t,i}-B\right)-\beta\left(\frac{\frac{{Z}_{t,i}-B}{B}\mathbb{I}_{\left[i\in\mathcal{P}_{t,J_{t}}\right]}+\beta}{\widetilde{p}_{ti}}\right)+\widetilde{\mathcal{O}}\left(\frac{\log\left(T/\epsilon\right)}{t}\right)\right)\right]$
$\displaystyle{=}$	$\displaystyle\exp\left(-\frac{\beta^{2}}{\widetilde{p}_{ti}}+\widetilde{\mathcal{O}}\left(\frac{\log\left(T/\epsilon\right)}{t}\right)\right)\cdot$
	$\displaystyle\qquad\qquad\mathbb{E}_{t-1}\left[\exp\left(\frac{\beta}{B}\left(l_{t,i}-B\right)-\beta\left(\frac{\frac{{Z}_{t,i}-B}{B}\mathbb{I}_{\left[i\in\mathcal{P}_{t,J_{t}}\right]}}{\widetilde{p}_{ti}}\right)\right)\right].$	(33)

Since $\exp(x)\leq 1+x+x^{2}$ for $x\leq 1$ , we have

	$\displaystyle\mathbb{E}_{t-1}\left[\exp\left(\frac{\beta}{B}\left(l_{t,i}-B\right)-\beta\left(\frac{\frac{{Z}_{t,i}-B}{B}\mathbb{I}_{\left[i\in\mathcal{P}_{t,J_{t}}\right]}}{\widetilde{p}_{ti}}\right)\right)\right]$
$\displaystyle\leq$	$\displaystyle 1+\mathbb{E}_{t-1}\left[\frac{\beta}{B}\left(l_{t,i}-B\right)-\beta\left(\frac{\frac{{Z}_{t,i}-B}{B}\mathbb{I}_{\left[i\in\mathcal{P}_{t,J_{t}}\right]}}{\widetilde{p}_{ti}}\right)\right]$
	$\displaystyle\qquad+\mathbb{E}_{t-1}\left[\left(\frac{\beta}{B}\left(l_{t,i}-B\right)-\beta\left(\frac{\frac{{Z}_{t,i}-B}{B}\mathbb{I}_{\left[i\in\mathcal{P}_{t,J_{t}}\right]}}{\widetilde{p}_{ti}}\right)\right)^{2}\right].$	(34)

Since $\mathbb{E}_{t-1}\left[\mathbb{I}_{\left[i\in\mathcal{P}_{t,J_{t}}\right]}\right]=\widetilde{p}_{tj}$ , $\mathbb{E}_{t-1}\left[\left({Z_{t,i}-B}\right)\mathbb{I}_{\left[i\in\mathcal{P}_{t,J_{t}}\right]}\right]=\left(l_{t,j}-B\right)\widetilde{p}_{tj}$ and $\mathbb{E}_{t-1}\left[\left(\frac{Z_{t,i}-B}{B}\right)^{2}\mathbb{I}_{\left[i\in\mathcal{P}_{t,J_{t}}\right]}\right]\leq\widetilde{p}_{tj}$ , we have

	$\displaystyle 1+\mathbb{E}_{t-1}\left[\frac{\beta}{B}\left(l_{t,i}-B\right)-\beta\left(\frac{\frac{{Z}_{t,i}-B}{B}\mathbb{I}_{\left[i\in\mathcal{P}_{t,J_{t}}\right]}}{\widetilde{p}_{ti}}\right)\right]$
	$\displaystyle\qquad\qquad+\mathbb{E}_{t-1}\left[\left(\frac{\beta}{B}\left(l_{t,i}-B\right)-\beta\left(\frac{\frac{{Z}_{t,i}-B}{B}\mathbb{I}_{\left[i\in\mathcal{P}_{t,J_{t}}\right]}}{\widetilde{p}_{ti}}\right)\right)^{2}\right]$
$\displaystyle=$	$\displaystyle 1-\frac{\beta^{2}}{B^{2}}\left(l_{t,i}-B\right)^{2}+\frac{\beta^{2}}{\widetilde{p}_{ti}}\leq 1+\frac{\beta^{2}}{\widetilde{p}_{ti}}\leq\exp\left(\frac{\beta^{2}}{\widetilde{p}_{ti}}\right),$	(35)

where the last inequality uses $1+x\leq\exp(x)$ .

We can now combine (33), (34) and (35) and get

		$\displaystyle\mathbb{E}_{t-1}\left[\exp\left(\frac{\beta}{B}\left(l_{t,i}-B\right)-\beta\left(\frac{\frac{{Z}_{t,i}-B}{B}\mathbb{I}_{\left[i\in\mathcal{P}_{t,J_{t}}\right]}+\beta}{p_{ti}+\sum_{j\neq i}\widehat{q}_{t,ji}p_{tj}}\right)\right)\right]$
	$\displaystyle\leq$	$\displaystyle\exp\left(\widetilde{\mathcal{O}}\left(\frac{\log\left(T/\epsilon\right)}{t}\right)\right)$

Let $X=\frac{\beta}{B}\sum_{t=1}^{T}\left(l_{t,i}-B\right)-\beta\sum_{t=1}^{T}\left(\frac{\frac{{Z}_{t,i}-B}{B}\mathbb{I}_{\left[i\in\mathcal{P}_{t,J_{t}}\right]}+\beta}{p_{ti}+\sum_{j\neq i}\widehat{q}_{t,ji}p_{tj}}\right)$ for simplicity. Note that $X=\beta\left(\sum_{t=1}^{T}\frac{l_{t,i}-B}{B}-\sum_{t=1}^{T}\widehat{Z}_{t,i}\right)$ .

We combine (33), (34) and (35) and get

\displaystyle\mathbb{E}\left[e^{X}\right]\leq

\displaystyle\prod_{t=1}^{T}\exp\left(\widetilde{\mathcal{O}}\left(\frac{\log\left(T/\epsilon\right)}{t}\right)\right)=\widetilde{\mathcal{O}}\left(T/\epsilon\right).

By Markov inequality and the above results, we have

\displaystyle\mathbb{P}\left(\frac{X}{\beta}\geq\frac{\log\left(T/\epsilon^{2}\right)}{\beta}\right)\leq\widetilde{\mathcal{O}}\left(\epsilon\right),

which concludes the proof.

Appendix C Additional Experimental Results

Algorithm 1 is empirically studied here. In Figure 5, we plot both the regret and error of estimators of Algorithm 1. As a consequence of Lemma 3, information about all node accumulates linearly, and the regret will stopped increasing after a certain point. Note that this does not contradict Theorem 1 or Theorem 2. The reason is that this figure shows the regret and error of estimation for a given instance, whereas the lower bound theorems assert the existence of some instance (not the given instance) on which no algorithm can beat an $\Omega(\sqrt{T})$ lower bound. Since this some instance must satisfy Assumption 1, we show that bandit problems with random walk feedback is not easier than their standard counterpart. The right subfigure in Figure 5 shows that the estimation errors quickly drops to zero, which shows that the flat regret curve is a consequence of learning, not a consequence of luck.

	$\displaystyle D_{kl}(\mathbb{P}_{\mathfrak{J},\pi}\\|\mathbb{P}_{\mathfrak{J}^{\prime},\pi})$
$\displaystyle=$	$\displaystyle\sum_{J_{1}\in\{1,2\}}\mathbb{P}(J_{1}\rvert\pi)D_{kl}(\mathbb{Q}_{J_{1}}\\|\mathbb{Q}_{J_{1}}^{\prime})$
	$\displaystyle+\sum_{t=2}^{T}\sum_{J_{t}\in\{1,2\}}\mathbb{P}(J_{t}\rvert\pi,J_{1},\cdots,J_{t-1})D_{kl}(\mathbb{Q}_{J_{t}}\\|\mathbb{Q}_{J_{t}}^{\prime}).$	(4)

	$\displaystyle\;D_{kl}(\mathbb{Q}_{1}\\|\mathbb{Q}_{1}^{\prime})$
$\displaystyle=$	$\displaystyle\;D_{kl}(\mathbb{Q}_{1}(X_{1})\\|\mathbb{Q}_{1}^{\prime}(X_{1}))$
	$\displaystyle+\mathbb{Q}_{1}(X_{1}=1)D_{kl}(\mathbb{Q}_{1}(X_{2},\cdots\rvert X_{1}=1)\\|\mathbb{Q}_{1}^{\prime}(X_{2},\cdots\rvert X_{1}=1))$
	$\displaystyle+\mathbb{Q}_{1}(X_{1}=2)D_{kl}(\mathbb{Q}_{1}(X_{2},\cdots\rvert X_{1}=2)\\|\mathbb{Q}_{1}^{\prime}(X_{2},\cdots\rvert X_{1}=2))$
$\displaystyle=$	$\displaystyle\;D_{kl}(\mathbb{Q}_{1}(X_{1})\\|\mathbb{Q}_{1}^{\prime}(X_{1}))$
	$\displaystyle+\mathbb{Q}_{1}(X_{1}=1)D_{kl}(\mathbb{Q}_{1}\\|\mathbb{Q}_{1}^{\prime})+\mathbb{Q}_{1}(X_{1}=2)D_{kl}(\mathbb{Q}_{2}\\|\mathbb{Q}_{2}^{\prime}),$	(7)

	$\displaystyle\;D_{kl}(\mathbb{Q}_{2}\\|\mathbb{Q}_{2}^{\prime})$
$\displaystyle=$	$\displaystyle\;D_{kl}(\mathbb{Q}_{2}(X_{1})\\|\mathbb{Q}_{2}^{\prime}(X_{1}))$
	$\displaystyle+\mathbb{Q}_{2}(X_{1}=1)D_{kl}(\mathbb{Q}_{1}\\|\mathbb{Q}_{1}^{\prime})+\mathbb{Q}_{2}(X_{1}=2)D_{kl}(\mathbb{Q}_{2}\\|\mathbb{Q}_{2}^{\prime}),$	(8)