Towards Fundamental Limits of Multi-armed Bandits with Random Walk Feedback
Abstract
In this paper, we consider a new Multi-Armed Bandit (MAB) problem where arms are nodes in an unknown and possibly changing graph, and the agent (i) initiates random walks over the graph by pulling arms, (ii) observes the random walk trajectories, and (iii) receives rewards equal to the lengths of the walks. We provide a comprehensive understanding of this problem by studying both the stochastic and the adversarial setting. We show that this problem is not easier than a standard MAB in an information theoretical sense, although additional information is available through random walk trajectories. Behaviors of bandit algorithms on this problem are also studied.
1 Introduction
Multi-Armed Bandit (MAB) problems simultaneously call for exploitation of good options and exploration of the decision space. Algorithms for this problem find applications in various domains, from medical trials (Robbins,, 1952) to online advertisement (Li et al.,, 2010). Many authors have studied bandit problems from different perspectives.
In this paper, we study a new bandit learning problem where the feedback is depicted by a random walk over the arms. That is, each time an arm/node is played, one observes a random walk over the arms/nodes from to an absorbing node, and the reward/loss is the length of this random walk. In this learning setting, we want to carefully select which nodes to initialize the random walks, so that the hitting time to the absorbing node is maximized/minimized. This learning protocol captures important problems in computational social networks. In particular, our learning protocol encapsulates an online learning version of the influence maximization problem (Kempe et al.,, 2003). See e.g., (Arora et al.,, 2017) for a survey on influence maximization. The goal of influence maximization is to find a node so that a diffusion starting from that node can propagate through the network and can influence as many nodes as possible. Our problem provides an online learning formulation of the influence maximization problem, where the diffusion over the graph is modeled as a random walk over the graph. As a concrete example, the word-of-mouth rating of a movie on social networks can be captured by the influence maximization model (Arora et al.,, 2017).
More formally, we consider the following model. The environment is modeled by a graph , where consists of transient nodes and an absorbing node . Each edge () can encode two quantities, a transition probability from to and the distance from to . For , we pick a node to start a random walk, and observe the random walk trajectory from the selected node to the absorbing node. For each random walk, we use its hitting time (to the absorbing node ) to model how long-lasting it is. With this formulation, we can define a bandit learning problem for the question. Each time, the agent picks a node in to start a random walk, observes the trajectory, and receives the hitting time of the random walk as reward. In this setting, the performance of learning algorithms is typically measured by regret, which is the difference between the rewards of an optimal node and the rewards of the nodes played by the algorithm. Unlike standard multi-armed bandit problems, the feedback is random walk trajectories and thus reveals information not only about the node played, but the environment (transitions/distances among nodes) as well.
Interestingly, the extra information from the random walk does not trivialize the learning in a mini-max sense. Intuitively, the sample/event space describes how much information the feedback can carry. If we execute a policy for epochs on a problem instance, the sample space is then , where is the space of all outcomes that a single step on a trajectory can generate. For example, if all edge length can be sampled from , then , since a single step on a trajectory might be any node, and the corresponding edge length may be any number from . In this case the sample space of a simple epoch is , where the union up to infinity captures the fact that the trajectory can be arbitrarily long. Indeed, this set contains much richer information than that of standard MAB problems, which is for rounds of pulls. This richer sample space means that: each feedback carries much more information and thus our problem can be strictly easier than a standard MAB. However, the information theoretical lower bounds for our problem are of order . In other words, we prove that even though each trajectory carries much more information than a reward sample (and has a chance of revealing all information about the environment when the trajectory is long), no algorithm can beat the bound in a mini-max sense.
In summary, the contributions of our paper is as follows.
-
1.
We propose a new online learning problem that is compatible with important computational social network problem including influence maximization, as already discussed in the introduction.
-
2.
We prove lower bounds for this problem in a random walk trajectories sample space. Our results show that, although random walk trajectories carry much more information than reward samples, the additional information does not simplify the problem. More specifically, no algorithm can beat an lower bound in a mini-max sense. These information theoretical findings of the newly proposed problem are discussed in Section 3.
-
3.
We propose algorithms for the bandit problems with random walk feedback, and show that the performance of our algorithms improves over that of the standard MAB algorithms.
1.1 Related Works
Bandit problems date its history back to at least Thompson, (1933), and have been studied extensively in the literature. One of the the most popular approaches to the stochastic bandit problem is the Upper Confidence Bound (UCB) algorithms (Robbins,, 1952; Lai and Robbins,, 1985; Auer,, 2002). Various extensions of UCB algorithms have been studied (Srinivas et al.,, 2010; Abbasi-Yadkori et al.,, 2011; Agrawal and Goyal,, 2012; Bubeck and Slivkins,, 2012; Seldin and Slivkins,, 2014). Specifically, some works use KL-divergence to construct the confidence bound (Lai and Robbins,, 1985; Garivier and Cappé,, 2011; Maillard et al.,, 2011), or include variance estimates within the confidence bound (Audibert et al.,, 2009; Auer and Ortner,, 2010). UCB is also used in the contextual learning setting (e.g., Li et al.,, 2010; Krause and Ong,, 2011; Slivkins,, 2014). The UCB algorithm and its variations are also used for other feedback settings, including the stochastic combinatorial bandit problem (Chen et al.,, 2013, 2016). Parallel to the stochastic setting, studies on the adversarial bandit problem form another line of literature. Since randomized weighted majorities (Littlestone and Warmuth,, 1994), exponential weights remains a top strategy for adversarial bandits (Auer et al.,, 1995; Cesa-Bianchi et al.,, 1997; Auer et al.,, 2002). Many efforts have been made to improve/extend exponential weights algorithms. For example, Kocák et al., (2014) target at implicit variance reduction. Mannor and Shamir, (2011); Alon et al., (2013) study a partially observable setting. Despite the large body of literature, no previous work has, to the best of our knowledge, explicitly focused on problems where the feedback is a random walk.
For both stochastic bandits and adversarial bandits, lower bounds in different scenarios have been derived, since the asymptotic lower bounds for consistent policies (Lai and Robbins,, 1985). Worst case bound of order have also been derived (Auer et al.,, 1995) for the stochastic setting. In addition to the classic stochastic setting, lower bounds in other stochastic (or stochastic-like) settings have also been considered, including PAC-learning complexity (Mannor and Tsitsiklis,, 2004), best arm identification complexity (Kaufmann et al.,, 2016; Chen et al.,, 2017), and lower bounds in continuous spaces (Kleinberg et al.,, 2008). Lower bound problems for adversarial bandits may be converted to lower bound problems for stochastic bandits (Auer et al.,, 1995) in many cases. Yet the above mentioned works do not cover the lower bounds for our settings.
The Stochastic Shortest Path (with adversarial edge length) (e.g., Bertsekas and Tsitsiklis,, 1991; Neu et al.,, 2012; Rosenberg and Mansour,, 2020) and the online MDP problems (e.g., Even-Dar et al.,, 2009; Gergely Neu et al.,, 2010; Dick et al.,, 2014; Jin et al.,, 2019) are related to our problem. However, these settings are fundamentally different because of the sample space generated by the possible trajectories. In all previously studied settings, a control is available at each step, and a regret is immediately incurred. In this regard, the infinite-length free trajectory scenario is impossible in all previous studies. In other words, every trajectory in previous works is effectively of length one, since a control is imposed, and a regret is incurred every time the state changes.
2 Problem Setting
The learning process repeats for epochs and the learning environment is described by graphs for epochs . The graph is defined on transient nodes and one absorbing node denoted by . We will use to denote the set of transient nodes, and use to denote the transient nodes together with the absorbing node. On this node set , graph encodes transition probabilities and edge lengths: , where is the probability of transiting from to and is the length from to (at epoch ). We gather the transition probabilities among transient nodes to form a transition matrix . We make the following assumption about .
Assumption 1.
The transition matrix among transient nodes is primitive. 111A matrix is primitive if there exists a positive integer , such that all entries in is positive. In addition, there is an absolute constant , such that , where is the maximum absolute row sum.
In Assumption 1, the primitivity assumption ensures that we can get to any transient node from any other node state . The infinite norm of being strictly less than means that the random walk will transit to the absorbing node starting from any node (eventually with probability 1). This describes the absorptiveness of the environment. This infinite norm assumption can be replaced by other notions of matrix norms. Unless otherwise stated, we assume that is an absolute constant independent of and .
Playing node at epoch generates a random walk trajectory , where is the starting nodes, is the absorbing node, is the -th node in the random walk trajectory, is the edge length from to , and is the number of edges in trajectory . For simplicity, we write (resp. ) as (resp. ) when it is clear from context.
For the random trajectory , the length of the trajectory (or hitting time of node at epoch ) is defined as Here we use the edge length to represent the reward of the trajectory. In practice, the edge lengths may have real-world meanings. For example, the out-going edge from a node may represent utility (e.g., profit) of visiting this node. At epoch , the agent selects a node to initiate a random walk, and observe trajectory . In stochastic environments, the environment does not change across epochs. Thus for any fixed node , the random trajectories are independently identically distributed.
3 Information Theoretical Properties
Consider the case where the graphs do not change across epochs. To solve this problem, one can estimate the expected hitting times for all nodes (and maintain a confidence interval of the estimations). As one can expect, the random walk trajectory reveals more information than a sample of reward. Naturally, this allows us to reduce this problem to a standard (stochastic) MAB problem.
3.1 Reduction to Standard MAB
Recall is the trajectory at epoch . For a node and the trajectories , let be the index (epoch) of the -th trajectory that covers node . Let be the sum of edge lengths between the first occurrence of and the absorbing node in trajectory . One has the following proposition due to Markov property.
Proposition 1.
In the stochastic setting, for any nodes , we have, for
(1) |
Proof.
In a trajectory , conditioning on being known (and no future information is revealed), the randomness generated by is identical to the randomness generated by conditioning on being fixed. Note that even if each trajectory can visit a node multiple times, only one hitting time sample can be used. This is because extracting multiple sample would break Markovianity, by revealing that the random walk will visit a same node again. ∎
For a node , we define , and where . In words, is the number of times node is played, and is the number of times node is covered by a trajectory.
By Proposition 1, the information about the node rewards (hitting time to absorbing node) accumulates faster than standard MAB problems. Formally, it holds that . Thus solving this problem is not hard: one can extract the hitting time estimates and apply a standard algorithm (e.g., UCB) based on the estimates. However, some information is lost when we extract hitting time samples, since trajectories also carry additional information (e.g, about graph transition and graph structure) but we only extract hitting time samples. Thus the intriguing question to ask is:
-
•
Do we give up too much information by only extracting hitting time samples from trajectories? (Q)
We show that, although more information in addition to reward sample are available from the random walk trajectories, no algorithm can beat an lower bound in a worst case sense. This provides an answer to (Q).
Theorem 1.
For any given and any policy , there exists a problem instance satisfying Assumption 1 such that: (1) The probability of visiting any node from different node is larger than an absolute constant; (2) For any , the step regret of on instance is lower bounded by In particular, setting gives that there exists a problem instance such that the regret of any algorithm on this instance is lower bounded by .
To prove Theorem 1, we construct problem instances such that the optimal nodes are almost indistinguishable. In the construction, we ensure that the nodes visits each other with a constant probability. If instead, the nodes visit each other with an arbitrarily small probability, the nodes are basically disconnected and the problem is too similar to a standard MAB problem. We use the instances illustrated in Fig. 1 for the proof. As shown in Fig. 1, node 1 and node 2 are connected with probability . This construction ensures that the two nodes visits each other with a constant probability. This prevents our construction from collapsing to a standard MAB problem, where the non-absorbing nodes are disconnected.
Proof of Theorem 1..
We construct two “symmetric” problem instances and both on two transient nodes and one absorbing node . All edges in both instances are of length 1. We use (resp. ) to denote the transition probabilities among transient nodes in instance (resp. ). We construct instances and so that and , as shown in Figure 1. We use (resp. ) to denote the random variable in problem instance (resp. ). Note that the (expected) hitting time of node 1 in can be expressed as . Thus in matrix form we have
where is the all-one vector. Solving the above equations gives, for both instances and , the optimality gap is
(2) |
Let be any fixed algorithm and let be any fixed time horizon, we use (resp. ) to denote the probability measure of running on instance (resp. ) for epochs.
Since the event () is measurable by both and , we have
(3) |
where uses the definition of total variation, uses the Bretagnolle-Huber inequality, and uses that for all .
Let (resp. ) be the probability measure generated by playing node in instance (resp. ). We can then decompose by
By chain rule for KL-divergence, we have
(4) |
Since the policy must pick one of node 1 and node 2, from (4) we have
(5) |
which allows us to remove dependence on policy .
Next we study the distributions . With edge lengths fixed, the sample space of this distribution is , since length of the trajectory can be arbitrarily long, and each node on the trajectory can be either of . To describe the distribution and , we use random variables (with ), where is the -th node in the trajectory generated by playing .
By Markov property we have, for ,
(6) | ||||
Since we can decompose by . Thus by chain rule of KL-divergence, we have
(7) |
where the last line uses (6). A similar argument gives
(8) |
In Theorem 1, the two-nodes case has been covered. A similar result for the -node case is in Theorem 2.
Theorem 2.
Let Assumption 1 be true. Given any number of epochs and policy , there exists problem instances , such that (1) all problems instances have nodes, (2) all transient node are connected with the same probability and the probability of hitting the absorbing node from any transient node is a constant independent of and , and (3) for any policy ,
where is the expectation with respect to the distribution generated by instance and policy .
4 Algorithms for Multi-Armed Bandits with Random Walk Feedback
Perhaps the two most well-known algorithms for bandit problems are the UCB algorithm and the EXP3 algorithm. In this section, we study the behavior of these two algorithms on bandit problems with random walk feedback.
4.1 UCB Algorithm for the Stochastic Setting
As discussed previously, the problem with random walk feedback can be reduced to standard MAB problems, and the UCB algorithm solves this problem as one would expect. We present here the UCB algorithm and provide regret analysis for it. Recall the regret is defined as
(11) |
where is the node played by the algorithm (at epoch ), and is the expected hitting time of node .
For a transient node and trajectories (at epochs ) that cover node , the hitting time estimator of is computed as Since , is an identical copy of the hitting time (Proposition 1).
We also need confidence intervals for our estimators. Given trajectories covering , the confidence terms (at epoch ) are where . Here serves as a truncation parameter since the reward distribution is not sub-Gaussian. Alternatively, one can use robust estimators for bandits with heavy-tail reward for this task Bubeck et al., (2013).
4.2 EXP3 Algorithm for Adversarially Chosen Edge Lengths
In this section, we consider the case in which the network structure changes over time, and study a version of this problem in which the adversary alters edge length across epochs: In each epoch, the adversary can arbitrarily pick edge lengths from . In this case, the performance is measured by the regret against playing any fixed node : where is the node played in epoch , and . Since concentrates around , a high probability bound on naturally provides a high probability bound on .
We define a notion of centrality that will aid our discussions.
Definition 1.
Let be nodes on a random trajectory. Under Assumption 1, we define, for node , to be the hitting centrality of node . We also define
For a node with positive the hitting centrality, information about it is revealed even if it is not played. This quantity will show up in the regret bound in some cases, as we will discuss later.
We will use a version of the exponential weight algorithm to solve this adversarial problem. Also, a high probability guarantee is provided using a new concentration lemma (Lemma 1). As background, exponential weights algorithms maintain a probability distribution over the choices. This probability distribution gives higher weights to historically more rewarding nodes. In each epoch, a node is sampled from this probability distribution, and information is recorded down. To formally describe the strategy, we define some notations now. We first extract a sample of from the trajectory , where is the node played in epoch .
Given , we define, for ,
(12) |
where . In words, is the distance (sum of edge lengths) from the first occurrence of to the absorbing node.
By the principle of Proposition 1, if node is covered by trajectory , is a sample of . We define, for the trajectory , where is defined above.
Define . This indicator random variable is iff and both show up in and the first occurrence of is after the first occurence of . We then define which is an estimator of how likely is visited via a trajectory starting at . In other words, is an estimator of , which is the probability of being visited by a trajectory from . Using and sample , we define an estimator for as
(13) |
where are algorithm parameters ( and to be specified later). Here serves as a parameter for implicit exploration Neu, (2015). We use estimate for instead of an estimate for . This shift can guarantee that the estimator is between with high probability. Such shifting in estimator has been used as a common trick for variance reduction for the EXP3 algorithms (e.g., Lattimore and Szepesvári,, 2020). In addition, a small bias is introduced via . With the estimators , we define By convention, we set for all . The probability of playing in epoch is defined as
(14) |
where is the learning rate.
Against any arm , following the sampling rule (14) can guarantee an regret bound. We now summarize our strategy in Algorithm 2, and state the performance guarantee in Theorem 3.
Theorem 3.
Let , where is the hitting centrality of . Fix any . If the time horizon and algorithm parameters satisfies , , then with probability exceeding ,
If we set , Algorithm 2 achieves . In Figure 2, we provide a plot of with . This figure illustrates how the regret scales with , and shows that a multiplicative factor is saved. An empirical comparison between Algorithm 2 and the standard EXP3 algorithm (without using the estimator in Eq. 12) is shown in Section 5.

4.2.1 Analysis of Algorithm 2
In this section we present a proof for Theorem 3. In general, the proof follows the general recipe for the analysis of EXP3 algorithms, which is in turn a special case of the Follow-The-Regularized-Leader (FTRL) or the Mirror Descent framework. Below we include some non-standard intermediate steps due to the special feedback structure of the problem studied. More details are defered to the Appendix.
By the exponential weights argument (Littlestone and Warmuth,, 1994; Auer et al.,, 2002), it holds that, under event ,
(15) |
Lemma 1.
For any and , such that and , it holds that
Lemma 1 can be viewed as a one-side regularized version of the Hoeffding’s inequality, with a regularization parameter . Its proof uses the Markov inequality and can be found in the Appendix.
Lemma 2.
With probability exceeding , we have
Proof of Lemma 2.
Firstly, it holds with high probability that (Lemma 5 in the Appendix). Therefore .
Let denote the expectation conditioning on all randomness right before the -th epoch. Since and are known before the -th epoch, we have . Thus the random variables
form a martingale difference sequence for any .
Applying the Azuma’s inequality to this martingale difference sequence gives, with porbability exceeding ,
(16) |
Since (Proposition 2 in Appendix), we have
(17) |
For the second inequality in the lemma statement, we first note that with high probability for all and . Thus we have
where ① uses a Taylor expansion (to replace with ), and the last line uses (Proposition 2 in the Appendix) and that is smaller than with high probability. Also, is smaller than for all . Since , we know is a super-martingale difference sequence. We can now apply the Azuma’s inequality to this super-martingale sequence and get
(18) |
with high probability. ∎
Proof of Theorem 3.
5 Experiments
We deploy our algorithms on a problem with 9 transient nodes. The results for Algorithm 2 is in Figure 4, and the results for Algorithm 1 is deferred to the Appendix. The evaluation of Algorithm 2 is performed on a problem instance with the transition matrix specified in (19).
(19) |
The edge lengths are sampled from Gaussian distributions and truncated to between 0 and 1. Specifically, for all , where takes a number and clips it to , , and denotes the absorbing node.

As shown in Figure 4, EXP3 algorithm performs better when using the, which is consistent with the guarantee provided in Theorem 3. For implementation purpose and a fair comparison, the estimator for Algorithm 2 is , and the estimators for EXP3 uses . The learning rate for both algorithms is set to .
6 Conclusion
In this paper, we study the bandit problem where the feedback is random walk trajectories. This problem is motivated by influence maximization in computational social science. We show that, despite substantially more information can be extracted from the random walk trajectories, such problems are not significantly easier than its standard MAB counterpart in a mini-max sense. Behaviors of UCB and EXP3 are studied.
References
- Abbasi-Yadkori et al., (2011) Abbasi-Yadkori, Y., Pál, D., and Szepesvári, C. (2011). Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320.
- Agrawal and Goyal, (2012) Agrawal, S. and Goyal, N. (2012). Analysis of Thompson sampling for the multi-armed bandit problem. In Conference on Learning Theory, pages 39–1.
- Alon et al., (2013) Alon, N., Cesa-Bianchi, N., Gentile, C., and Mansour, Y. (2013). From bandits to experts: A tale of domination and independence. In Advances in Neural Information Processing Systems, pages 1610–1618.
- Arora et al., (2017) Arora, A., Galhotra, S., and Ranu, S. (2017). Debunking the myths of influence maximization: An in-depth benchmarking study. In Proceedings of the 2017 ACM international conference on management of data, pages 651–666.
- Audibert et al., (2009) Audibert, J.-Y., Munos, R., and Szepesvári, C. (2009). Exploration–exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science, 410(19):1876–1902.
- Auer, (2002) Auer, P. (2002). Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422.
- Auer et al., (1995) Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. (1995). Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Proceedings of IEEE 36th Annual Foundations of Computer Science, pages 322–331. IEEE.
- Auer et al., (2002) Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. (2002). The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77.
- Auer and Ortner, (2010) Auer, P. and Ortner, R. (2010). UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Periodica Mathematica Hungarica, 61(1-2):55–65.
- Bertsekas and Tsitsiklis, (1991) Bertsekas, D. P. and Tsitsiklis, J. N. (1991). An analysis of stochastic shortest path problems. Mathematics of Operations Research, 16(3):580–595.
- Bubeck et al., (2013) Bubeck, S., Cesa-Bianchi, N., and Lugosi, G. (2013). Bandits with heavy tail. IEEE Transactions on Information Theory, 59(11):7711–7717.
- Bubeck and Slivkins, (2012) Bubeck, S. and Slivkins, A. (2012). The best of both worlds: stochastic and adversarial bandits. In Conference on Learning Theory, pages 42–1.
- Cesa-Bianchi et al., (1997) Cesa-Bianchi, N., Freund, Y., Haussler, D., Helmbold, D. P., Schapire, R. E., and Warmuth, M. K. (1997). How to use expert advice. Journal of the ACM (JACM), 44(3):427–485.
- Chen et al., (2017) Chen, L., Li, J., and Qiao, M. (2017). Nearly instance optimal sample complexity bounds for top-k arm selection. In Artificial Intelligence and Statistics, pages 101–110.
- Chen et al., (2016) Chen, W., Hu, W., Li, F., Li, J., Liu, Y., and Lu, P. (2016). Combinatorial multi-armed bandit with general reward functions. In Advances in Neural Information Processing Systems, pages 1651–1659.
- Chen et al., (2013) Chen, W., Wang, Y., and Yuan, Y. (2013). Combinatorial multi-armed bandit: General framework and applications. In International Conference on Machine Learning, pages 151–159. PMLR.
- Dick et al., (2014) Dick, T., Gyorgy, A., and Szepesvari, C. (2014). Online learning in markov decision processes with changing cost sequences. In International Conference on Machine Learning, pages 512–520. PMLR.
- Even-Dar et al., (2009) Even-Dar, E., Kakade, S. M., and Mansour, Y. (2009). Online markov decision processes. Mathematics of Operations Research, 34(3):726–736.
- Garivier and Cappé, (2011) Garivier, A. and Cappé, O. (2011). The KL–UCB algorithm for bounded stochastic bandits and beyond. In Conference on Learning Theory, pages 359–376.
- Gergely Neu et al., (2010) Gergely Neu, A. G., Szepesvári, C., and Antos, A. (2010). Online markov decision processes under bandit feedback. In Proceedings of the Twenty-Fourth Annual Conference on Neural Information Processing Systems.
- Jin et al., (2019) Jin, C., Jin, T., Luo, H., Sra, S., and Yu, T. (2019). Learning adversarial mdps with bandit feedback and unknown transition. arXiv preprint arXiv:1912.01192.
- Kaufmann et al., (2016) Kaufmann, E., Cappé, O., and Garivier, A. (2016). On the complexity of best-arm identification in multi-armed bandit models. The Journal of Machine Learning Research, 17(1):1–42.
- Kempe et al., (2003) Kempe, D., Kleinberg, J., and Tardos, É. (2003). Maximizing the spread of influence through a social network. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 137–146.
- Kleinberg et al., (2008) Kleinberg, R., Slivkins, A., and Upfal, E. (2008). Multi-armed bandits in metric spaces. In ACM Symposium on Theory of Computing, pages 681–690. ACM.
- Kocák et al., (2014) Kocák, T., Neu, G., Valko, M., and Munos, R. (2014). Efficient learning by implicit exploration in bandit problems with side observations. In Advances in Neural Information Processing Systems, pages 613–621.
- Krause and Ong, (2011) Krause, A. and Ong, C. S. (2011). Contextual Gaussian process bandit optimization. In Advances in Neural Information Processing Systems, pages 2447–2455.
- Lai and Robbins, (1985) Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22.
- Lattimore and Szepesvári, (2020) Lattimore, T. and Szepesvári, C. (2020). Bandit Algorithms. Cambridge University Press.
- Li et al., (2010) Li, L., Chu, W., Langford, J., and Schapire, R. E. (2010). A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670.
- Littlestone and Warmuth, (1994) Littlestone, N. and Warmuth, M. K. (1994). The weighted majority algorithm. Information and computation, 108(2):212–261.
- Maillard et al., (2011) Maillard, O.-A., Munos, R., and Stoltz, G. (2011). A finite-time analysis of multi-armed bandits problems with Kullback-Leibler divergences. In Conference On Learning Theory, pages 497–514.
- Mannor and Shamir, (2011) Mannor, S. and Shamir, O. (2011). From bandits to experts: On the value of side-observations. In Advances in Neural Information Processing Systems, pages 684–692.
- Mannor and Tsitsiklis, (2004) Mannor, S. and Tsitsiklis, J. N. (2004). The sample complexity of exploration in the multi-armed bandit problem. Journal of Machine Learning Research, 5(Jun):623–648.
- Neu, (2015) Neu, G. (2015). Explore no more: Improved high-probability regret bounds for non-stochastic bandits. In Advances in Neural Information Processing Systems, pages 3168–3176.
- Neu et al., (2012) Neu, G., Gyorgy, A., and Szepesvári, C. (2012). The adversarial stochastic shortest path problem with unknown transition probabilities. In Artificial Intelligence and Statistics, pages 805–813. PMLR.
- Robbins, (1952) Robbins, H. (1952). Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5):527–535.
- Rosenberg and Mansour, (2020) Rosenberg, A. and Mansour, Y. (2020). Adversarial stochastic shortest path. arXiv preprint arXiv:2006.11561.
- Seldin and Slivkins, (2014) Seldin, Y. and Slivkins, A. (2014). One practical algorithm for both stochastic and adversarial bandits. In International Conference on Machine Learning, pages 1287–1295.
- Slivkins, (2014) Slivkins, A. (2014). Contextual bandits with similarity information. The Journal of Machine Learning Research, 15(1):2533–2568.
- Srinivas et al., (2010) Srinivas, N., Krause, A., Kakade, S., and Seeger, M. (2010). Gaussian process optimization in the bandit setting: No regret and experimental design. In International Conference on Machine Learning.
- Thompson, (1933) Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294.
Appendix A Proof Details of Theorems 2
Construct instances specified by graphs: . For any , let all transition probabilities equal . For any , let the graph have edge lengths chosen randomly. In , all edge lengths are independently sampled from . In (), (for all , ) are sampled from , and all other edge lengths are sampled from . The proof is presented in three steps: Step 1 computes the KL-divergence using similar arguments discussed in the main text. Steps 2 & 3 use standard argument for general lower bound proofs.
Step 1: compute the KL-divergence between and . Let span the probability of playing at in instance . By chain rule, we have, for any ,
(20) |
Let be the nodes and edge length of each step in the trajectory after playing a node. The sample space of is spanned by .
By Markov property, we have, for all and ,
Thus by applying the chain rule to , we have
(21) |
Recall that the edge lengths are selected independent of other randomness. Thus we have
We have, for ,
(23) |
where the last line uses that for all .
Step 2: compute the optimality gap between nodes.
Let be the vector of (expected) hitting times in instance . The vector satisfies
where is the transition matrix among transient nodes (all entries of are ), is the all vector and is the -th canonical basis vector.
Solving the above equation gives, for
Thus the optimality gap, which is the difference between the hitting time of node in and the hitting time of other nodes in , is .
Step 3: apply Yao’s principle and Pinsker’s inequality to finish the proof.
By Pinsker’s inequality, we have
(25) |
Thus for the regret against is instance , we have
(by the Wald’s indentity) | ||||
(by Eq. 25) | ||||
(by Jensen’s inequality) | ||||
(26) |
Now we set and and (26) gives
Thus we have
Appendix B Proof Details for Section 4
We use the following notations for simplicity. 1. We write , and . 2. We use to denote the -algebra generated by all randomness up to the end of epoch . We use to denote the -algebra of all randomness up to the first occurrence of node in epoch (or end of epoch if is not visited in epoch ). We use to denote the expectation conditioning on , i.e., . 3. Unless otherwise stated, we use and as shorthand for and , respectively.
In Appendix B.1, we state some preparation properties needed for proving Lemma 1. Proposition 2 is also included in this part. In Appendix B.2, we provide a proof for Lemma 1.
B.1 Additional Properties
As Proposition 1 suggests, number of times a node is visited linearly accumulates with number of epochs . We state this observation below in Lemma 3.
Lemma 3.
For and , it holds that
Proof.
Recall is the trajectory for epoch and is the node played at epoch . For a fixed node , consider the random variables , which is the indicator that takes value 1 when is covered in path but is not played at . From this definition, we have
From definition of , we have
Thus by one-sided Azuma’s inequality, we have for any ,
(27) |
∎
Lemma 4.
For any , it holds that
Proof.
For the variance, we have
By Lemma 3 and a union bound, we know, for any ,
Thus it holds that
Setting concludes the proof. ∎
Next, we consider a high probability event and approximate under this event.
Lemma 5.
For any , let
It holds that and under , for all and ,
(28) |
Proof.
Lemma 6.
For any , let For any and , it holds that
Proof.
Since all edge lengths are smaller than 1, we have, for any integer ,
Thus with probability at least , we have . We define
By a union bound, . Now we can set so that .
The random variables also has the memorylessness-type property:
where we use Markov property on the second last line.
Since , we insert this into the above equation to get
Similarly, we have
∎
Remark 1.
Proposition 2.
Fix any . We have
(30) |
Proof.
If suffices to show, for any , the function is upper bounded by . This can be shown via a quick first-order test. At , the maximum of is achieved, and . ∎
B.2 Proof of Lemma 1
By law of total expectation, we have
Also, we have
(31) |
By Lemma 5, we know
(32) |
We can use (32) to get
(33) |
Since for , we have
(34) |
Since , and , we have
(35) |
where the last inequality uses .
Let for simplicity. Note that .
By Markov inequality and the above results, we have
which concludes the proof.
Appendix C Additional Experimental Results
Algorithm 1 is empirically studied here. In Figure 5, we plot both the regret and error of estimators of Algorithm 1. As a consequence of Lemma 3, information about all node accumulates linearly, and the regret will stopped increasing after a certain point. Note that this does not contradict Theorem 1 or Theorem 2. The reason is that this figure shows the regret and error of estimation for a given instance, whereas the lower bound theorems assert the existence of some instance (not the given instance) on which no algorithm can beat an lower bound. Since this some instance must satisfy Assumption 1, we show that bandit problems with random walk feedback is not easier than their standard counterpart. The right subfigure in Figure 5 shows that the estimation errors quickly drops to zero, which shows that the flat regret curve is a consequence of learning, not a consequence of luck.

