Lipschitz Bandits with Batched Feedback
Abstract
In this paper, we study Lipschitz bandit problems with batched feedback, where the expected reward is Lipschitz and the reward observations are communicated to the player in batches. We introduce a novel landscape-aware algorithm, called Batched Lipschitz Narrowing (BLiN), that optimally solves this problem. Specifically, we show that for a -step problem with Lipschitz reward of zooming dimension , our algorithm achieves theoretically optimal (up to logarithmic factors) regret rate using only batches. We also provide complexity analysis for this problem. Our theoretical lower bound implies that batches are necessary for any algorithm to achieve the optimal regret. Thus, BLiN achieves optimal regret rate (up to logarithmic factors) using minimal communication.
1 Introduction
Multi-Armed Bandit (MAB) algorithms aim to exploit the good options while explore the decision space. These algorithms and methodologies find successful applications in artificial intelligence and reinforcement learning (e.g., [40]). While the classic MAB setting assumes that the rewards are immediately observed after each arm pull, real-world data often arrives in different patterns. For example, observations from clinical trials are often be collected in a batched fashion [37]. Another example is from online advertising, where strategies are tested on multiple customers at the same time [10]. In such cases, any observation-dependent decision-making should comply with this data-arriving pattern, including MAB algorithms.
In this paper, we study the Lipschitz bandit problem with batched feedback – a MAB problem where the expected reward is Lipschitz and the reward observations are communicated to the player in batches. In such settings, rewards are communicated only at the end of the batches, and the algorithm can only make decisions based on information up to the previous batch. Existing Lipschitz bandit algorithms heavily rely on timely access to the reward samples, since the partition of arm space may change at any time. Therefore, they can not solve the batched feedback setting. To address this difficulty, we present a novel adaptive algorithm for Lipschitz bandits with communication constraints, named Batched Lipschitz Narrowing (BLiN). BLiN learns the landscape of the reward by adaptively narrowing the arm set, so that regions of high reward are more frequently played. Also, BLiN determines the data collection procedure adaptively, so that only very few data communications are needed.
The above BLiN procedure achieves optimal regret rate ( is the zooming dimension [28, 15]), and can be implemented in a clean and friendly form. In addition to achieving the optimal regret rate, BLiN also improves the state-of-the-art results in the following senses:
- •
-
•
BLiN’s time complexity is optimal: if the arithmetic operations and sampling are of complexity , then the time complexity of BLiN is , which improves the best known time complexity for Lipschitz bandit problems [15].
-
•
The space complexity of BLiN is , which also improves the best known result. This is because we do not need to store information of cubes in previous batches. The detailed time and space complexity analysis of BLiN is in Remark 1.
In Table 1 we provide a comparison of BLiN and state-of-the-art Lipschitz bandit algorithms in terms of regret bound, communication bound, time complexity and space complexity.
algorithm | regret | communication | time complexity | space complexity |
---|---|---|---|---|
Zooming [28] | ||||
HOO [15] | ||||
A-BLiN (our work) |
1.1 Settings & Preliminaries
For a Lipschitz bandit problem (with communication constraints), the arm set is a compact doubling metric space . The expected reward is -Lipschitz with respect to the metric , that is, for any .
At time , the learning agent pulls an arm that yields a reward sample , where is a mean-zero independent sub-Gaussian noise. Without loss of generality, we assume that , since generalizations to other sub-Gaussian noises are not hard.
Similar to most bandit learning problems, the agent seeks to minimize regret in the batched feedback environment. The regret is defined as , where denotes . For simplicity, we define (called optimality gap of ) for all .
1.1.1 Doubling Metric Spaces and the Metric Space
By the Assouad’s embedding theorem [6], the (compact) doubling metric space can be embedded into a Euclidean space with some distortion of the metric; See [46] for more discussions in a machine learning context. Due to existence of such embedding, the metric space , where metric balls are hypercubes, is sufficient for the purpose of our paper. For the rest of the paper, we will use hypercubes in algorithm design for simplicity, while our algorithmic idea generalizes to other doubling metric spaces.
1.1.2 Zooming Number and Zooming Dimension
An important concept for bandit problems in metric spaces is the zooming number and the zooming dimension [28, 14, 42], which we discuss now. We start with the definition of packing numbers.
Definition 1.
Let be a metric space. The -packing number of is the size of the largest packing of with disjoint -open balls with radius .
Then we define the zooming number and the zooming dimension.
Definition 2.
For a problem instance with arm set and expected reward , we let denote the set of -optimal arms, that is, . We define the -zooming number as . The zooming dimension is then defined as
Moreover, we define the zooming constant as
Zooming dimension can be significantly smaller than ambient dimension and can be zero. For a simple example, consider a problem with ambient dimension and expected reward function for . Then for any with , we have and . Therefore, for this problem the zooming dimension equals to , with zooming constant .
1.2 Batched feedback pattern and our results
In the batched feedback setting, for a -step game, the player determines a grid adaptively, where and . During the game, reward observations are communicated to the player only at the grid points . As a consequence, for any time in the -th batch, that is, , the reward cannot be observed until time , and the decision made at time depends only on rewards up to time . The determination of the grid is adaptive in the sense that the player chooses each grid point based on the operations and observations up to the previous point .
In this work, we present BLiN algorithm to solve Lipschitz bandits under batched feedback. During the learning procedure, BLiN detects and eliminates the ‘bad area’ of the arm set in batches and partition the remaining area according to an approporiate edge-length sequence. Our first theoretical upper bound is that with simple Doubling Edge-length Sequence , BLiN achieves optimal regret rate by using batches.
Theorem 1.
With probability exceeding , the -step total regret of BLiN with Doubling Edge-length Sequence (D-BLiN) satisfies
where is the zooming dimension of the problem instance. In addition, D-BLiN only needs no more than rounds of communications to achieve this regret rate. Here and henceforth, only omits constants.
While D-BLiN is efficient for batched Lipschitz bandits, its communication complexity is not optimal. We then propose a new edge-length sequence, which we call Appropriate Combined Edge-length Sequence (ACE Sequence) to improve the algorithm. The idea behind this sequence is that by appropriately combining some batches, the algorithm can achieve better communication bound without incurring increased regret. As we shall see, BLiN with ACE Sequence (A-BLiN) achieves regret rate with only batches.
Theorem 2.
With probability exceeding , the -step total regret of A-BLiN satisfies
where is the zooming dimension of the problem instance. In addition, Algorithm 1 only needs no more than rounds of communications to achieve this regret rate.
As a comparison, seminal works [28, 42, 15] show that the optimal regret bound for Lipschitz bandits without communications constraints, where the reward observations are immediately observable after each arm pull, is . Therefore, A-BLiN achieves optimal regret rate of Lipschitz bandits by using very few batches.
Furthermore, we provide a theoretical lower bound for Lipschitz bandits with batched feedback.
Theorem 3.
Consider Lipschitz bandit problems with time horizon , ambient dimension and zooming dimension . If rounds of communications are allowed, then for any policy , there exists a problem instance with zooming dimension such that
In the lower bound analysis, we use a “linear-decaying extension” technique to construct problem instances with zooming dimension . To the best of our knowledge, our construction provides the first minimax lower bound for Lipschitz bandits where the zooming dimension is explicitly different from the ambient dimension . As a result of Theorem 3, we can derive the minimum rounds of communications needed to achieve optimal regret bound for Lipschitz bandit problem, which is stated in Corollary 1. The proof of Corollary 1 is deferred to Appendix A.
Corollary 1.
For Lipschitz bandit problems with ambient dimension , zooming dimension and time horizon , any algorithm needs rounds of communications to achieve the optimal regret rate .
Consequently, BLiN algorithm is optimal in terms of both regret and communication.
1.3 Related Works
The history of the Multi-Armed Bandit (MAB) problem can date back to Thompson [45]. Solvers for this problems include the UCB algorithms [30, 4, 7], the arm elimination method [20, 35, 39], the -greedy strategy [7, 43], the exponential weights and mirror descent framework [8].
Recently, with the prevalence of distributed computing and large-scale field experiments, the setting of batched feedback has captured attention (e.g., [17]). Perchet et al. [36] mainly consider batched bandit with two arms, and a matching lower bound for static grid is proved. It was then generalized by Gao et al. [22] to finite-armed bandit problems. In their work, the authors designed an elimination method for finite-armed bandit problem and proved matching lower bounds for both static and adaptive grid. Soon afterwards, Zhang et al. [49] studies inference for batched bandits. Esfandiari et al. [19] studies batched linear bandits and batched adversarial bandits. Han et al. [24] and Ruan et al. [38] provide solutions for batched contextual linear bandits. Li and Scarlett [31] studies batched Gaussian process bandits. Batched dueling bandits have also been studied by Agarwal et al. [2]. Parallel to the regret control regime, best arm identification with limited number of batches was studied in [1] and [25]. Top- arm identification in the collaborative learning framework is also closely related to the batched setting, where the goal is to minimize the number of iterations (or communication steps) between agents. In this setting, tight bounds have been obtained in the recent works [44, 26]. Yet the problem of Lipschitz bandit with communication constraints remains unsolved.
The Lipschitz bandit problem is important in its own stand. The Lipschitz bandit problem was introduced as “continuum-armed bandits” [3], where the arm space is a compact interval. Along this line, bandits that are Lipschitz (or Hölder) continuous have been studied. For this problem, Kleinberg [27] proves a lower bound and introduced a matching algorithm. Under extra conditions on top of Lipschitzness, regret rate of was achieved [9, 18]. For general (doubling) metric spaces, the Zooming bandit algorithm [28] and the Hierarchical Optimistic Optimization (HOO) algorithm [15] were developed. In more recent years, some attention has been focused on Lipschitz bandit problems with certain extra structures. To name a few, Bubeck et al. [16] study Lipschitz bandits for differentiable rewards, which enables algorithms to run without explicitly knowing the Lipschitz constants. Wang et al. [47] studied discretization-based Lipschitz bandit algorithms from a Gaussian process perspective. Magureanu et al. [33] derive a new concentration inequality and study discrete Lipschitz bandits. The idea of robust mean estimators [11, 5, 13] was applied to the Lipschitz bandit problem to cope with heavy-tail rewards, leading to the development of a near-optimal algorithm for Lipschitz bandit with heavy-tailed rewards [32]. Wanigasekara and Yu [48] studied Lipschitz bandits where a clustering is used to infer the underlying metric. Contextual Lipschitz bandits have been studied by Slivkins [42]. Contextual bandits with continuous actions have also been studied by Krishnamurthy et al. [29] and Majzoubi et al. [34] through a smoothness approach. Yet all of the existing works for Lipschitz bandits assume that the reward sample is immediately observed after each arm pull, and none of them solve the Lipschitz bandit problem with communication constraints.
This paper is organized as follows. In section 2, we introduce the BLiN algorithm and give a visual illustration of the algorithm procedure. In section 3, we prove that BLiN with ACE Sequence achieves the optimal regret rate using only rounds of communications. Section 4 provides information-theoretical lower bounds for Lipschitz bandits with communication constraints, which shows that BLiN is optimal in terms of both regret and rounds of communications. Experimental results are presented in Section 5.
2 Algorithm
With communication constraints, the agent’s knowledge about the environment does not accumulate within each batch. This characteristic of the problem suggests a ‘uniform’ type algorithm – we shall treat each step within the same batch equally. Following this intuition, in each batch, we uniformly play the remaining arms, and then eliminate arms of low reward after the observations are communicated. Next we describe the uniform play rule and the arm elimination rule.
Uniform Play Rule: At the beginning of each batch , a collection of subsets of the arm space is constructed. This collection of subset consists of standard cubes, and all cubes in have the same edge length . We will detail the construction of when we describe the arm elimination rule. We refer to cubes in as active cubes of batch .
During batch , each cube in is played
(1) |
times, where is the total time horizon. More specifically, within each , arms are played.111One can arbitrarily play as long as for all . The reward samples corresponding to will be collected at the end of the this batch.
Arm Elimination Rule: At the end of batch , information from the arm pulls is collected, and we estimate the reward of each by . Cubes of low estimated rewards are then eliminated, according to the following rule: a cube is eliminated if , where . After necessary removal of “bad cubes”, each cube in that survives the elimination is equally partitioned into subcubes of edge length , where is predetermined sequence to be specified soon. These cubes (of edge length ) are collected to construct , and the learning process moves on to the next batch. Appropriate rounding may be required to ensure the ratio is an integer. See Remark 2 for more details.
The learning process is summarized in Algorithm 1.
Remark 1 (Time and space complexity).
The time complexity of our algorithm is , which is better than the state of the art in [15]. This is because that the running time of a batch is of order , where is number of samples in batch . Since , the time complexity of BLiN is . Besides, the space complexity of our algorithm is , which also improves the best known space complexity. This is because we do not need to store information of cubes in previous batches. The space complexity analysis is deferred to Appendix B.
The following theorem gives regret and communication upper bound of BLiN with Doubling Edge-length Sequence (see Appendix C for a proof). Note that this result implies Theorem 1.
Theorem 4.
With probability exceeding , the -step total regret of BLiN with Doubling Edge-length Sequence (D-BLiN) satisfies
where is the zooming dimension of the problem instance. In addition, D-BLiN only needs no more than rounds of communications to achieve this regret rate.
Although D-BLiN efficiently solves batched Lipschitz bandits, its simple partition strategy leads to suboptimal communication complexity. Now we show that by approporiately combining some batches, BLiN achieves the optimal communication bound, without incurring additional regret. Specifically, we introduce the following edge-length sequence, which we call ACE Sequence. When using ACE Sequence, the regret of each batch is of order . Thus, the length of any batch cannot be increased without affecting the optimal regret rate. This implies that the ACE Sequence is optimal in terms of both regret and communication. See the proof of Theorem 5 for more details.
Definition 3.
For a problem with ambient dimension , zooming dimension and time horizon , we denote and for any , where . Then the Appropriate Combined Edge-length (ACE) Sequence is defined by for any .
Theorem 5 states that BLiN with ACE Sequence (A-BLiN) obtains an improved communication complexity, thus proves Theorem 2.
Theorem 5.
The partition and elimination process of a real A-BLiN run is in Figure 1. In the -th subgraph, the white cubes are those remaining after the -th batch. In this experiment, we set , and the optimal arm is . Note that is not eliminated during the game. More details of this experiment are in Section 5.




The definition of the ACE Sequence relies on the zooming dimension . If is not known ahead of time, we recommend two ways to proceed. 1) The player can apply BLiN with Doubling Edge-length Sequence (D-BLiN). Theorem 4 shows that D-BLiN achieves optimal regret rate by using batches. Although the rate is not optimal, it is good enough for total time horizon that is not too large, as shown in the experimental results in Appendix G.2. 2) If an upper bound of the zooming dimension is known, that is, , then the player can apply A-BLiN by using to define the ACE Sequence. Theorem 5 yields that A-BLiN with achieves regret rate by using batches.
3 Regret Analysis of A-BLiN
In this section, we provide regret analysis for A-BLiN. The highlight of the finding is that batches are sufficient to achieve optimal regret rate of , as summarized in Theorem 5.
To prove Theorem 5, we first show that the estimator is concentrated to the true expected reward (Lemma 1), and the optimal arm survives all eliminations with high probability (Lemma 2). In the following analysis, we let be the total number of batches of the BLiN run.
Lemma 1.
Define
It holds that .
Proof.
Fix a cube . Recall the average payoff of cube is defined as
We also have
Since obeys normal distribution , Hoeffding inequality gives
On the other hand, by Lipschitzness of , it is obvious that
Consequently, we have
Lemma 2.
Under event (defined in Lemma 1), the optimal arm is not eliminated after the first batches.
Proof.
We use to denote the cube containing in . Here we proof that is not eliminated in round .
Under event , for any cube and , we have
where the second inequality follows from (1). Then from the elimination rule, is not eliminated. ∎
Based on these results, we show the cubes that survive elimination are of high reward.
Lemma 3.
Under event (defined in Lemma 1), for any , any and any , satisfies
(3) |
Proof.
For , (3) holds directly from the Lipschitzness of . For , let be the cube in such that . From Lemma 2, this cube is well-defined under . For any cube and , it is obvious that is also in the parent of (the cube in the previous round that contains ), which is denoted by . Thus for any , it holds that
where the inequality uses Lemma 1.
Equality gives that
It is obvious that . Moreover, since the cube is not eliminated, from the elimination rule we have
Hence, we conclude that . ∎
We are now ready to prove Theorem 5.
Proof of Theorem 5.
Let denote regret of the -th batch. Fixing any positive number , the total regret can be divided into two parts: . In the following, we bound these two parts separately and then determine to obtain the upper bound of the total regret. Moreover, we show A-BLiN uses only rounds of communications to achieve the optimal regret.
Recall that is set of the active cubes in the -th batch. According to Lemma 3, for any , we have . Let be set of cubes not eliminated in batch . Each cube in is a -ball with radius , and is a subset of . Therefore, forms a -packing of , and the definition of zooming dimension yields that
(4) |
By definition, , so
(5) |
The total regret of the -th batch is
(6) | ||||
where (i) follows from (5), (ii) follows from (4), and (iii) follows from the definition of ACE Sequence.
Define . Since , calculation shows that . Thus for any , we have . Hence,
(7) |
The inequality (7) holds even if the -th batch does not exist (where we let ) or is not completed. Thus we obtain the first upper bound . Lemma 3 implies that any arm played after the first batches satisfies , so the total regret after batches is bounded by
Therefore, the total regret satisfies
This inequality holds for any positive . Then by choosing , we have and
The above analysis implies that we can achieve the optimal regret rate by letting the for-loop run times and finishing the remaining rounds in the Cleanup step. In other words, rounds of communications are sufficient for A-BLiN to achieve the regret bound (2). ∎
Remark 2.
The quantity in line 9 of Algorithm 1 may not be integers for some . Thus, in practice we denote , , and define rounded ACE Sequence by for and for . Then the total regret can be divided as . For the first part we have and , while for the second part we have . Therefore, by similar argument to the proof of Theorem 5, we can bound these three parts separately, and conclude that BLiN with rounded ACE sequence achieves the optimal regret bound by using only rounds of communications. The Rounded ACE Sequence is further investigated in [21, 23].
For rounded ACE Sequence, the quantity is an integer for any , so the partition in Line 9 of Algorithm 1 is well-defined. In Theorem 6, we show that BLiN with rounded ACE sequence can also achieve the optimal regret bound by using batches. For any , if there exists such that (including the case where ), then we skip the -th batch. It is easy to verify that the following analysis is still valid in this case.
Theorem 6.
Proof.
Firstly, we fix positive number and consider the first batches. As is summairzed in Remark 2, we bound the regret caused by the first batches through two different arguments.
For , , we have and , and thus
(8) |
Let be set of cubes not eliminated in round . Similar argument to Theorem 5 shows that . The total regret of round is
where the sixth line follows from (8), and the seventh line follows from (7). Summing over gives that
(9) |
For , , we have and , and thus . Lemma 3 shows that any cube in is a subset of , so we have . Therefore, the total regret of round is
Since , summing over gives that
(10) |
The definition of round ACE Sequence shows that
so we have
(11) |
Similar argument to Theorem 5 shows that the total regret after batches is upper bounded by . Since
we further have
(12) |
Combining (9), (11) and (12), we conclude that
The analysis in Theorem 6 implies that we can achieve the optimal regret rate by letting the for-loop of Algorithm 1 run times and finishing the remaining rounds in the Cleanup step. In other words, rounds of communications are sufficient for BLiN to achieve the optimal regret. ∎
4 Lower Bounds
In this section, we present lower bounds for Lipschitz bandits with batched feedback, which in turn gives communication lower bounds for all Lipschitz bandit algorithms. Our lower bounds depend on the rounds of communications . When is sufficiently large, our results match the upper bound for the vanilla Lipschitz bandit problem . More importantly, this dependency on gives the minimal rounds of communications needed to achieve optimal regret bound for all Lipschitz bandit algorithms, which is summarized in Corollary 1. Since this lower bound matches the upper bound presented in Theorem 5, BLiN optimally solves Lipschitz bandits with minimal communication.
Similar to most lower bound proofs, we need to construct problem instances that are difficult to differentiate. What’s different is that we need to carefully integrate batched feedback pattern [36] with the Lipschitz payoff reward [42, 32]. To capture the adaptivity in grid determination, we construct “static reference communication grids” to remove the stochasticity in grid selection [1, 22]. Moreover, to prove the lower bounds for general , we apply a “linear-decaying extension” technique to transfer instances from the -dimensional subspace to the -dimensional whole space.
The lower bound analysis is organized as follows. In Section 4.1, we present lower bounds for the full-dimensional case, that is, . In Section 4.1.1, we consider the static grid case, where the grid is predetermined. This static grid case will provide intuition for the adaptive and more general case. In Section 4.1.2, we provide the lower bound for general adaptive grid. Finally, in Section 4.2, we apply the “linear-decaying extension” technique to prove lower bounds for the case that .
4.1 Lower Bounds for the Full-Dimension Case
In this section, we let the zooming dimension equal to the ambient dimension . The aim of considering this case first is to simplify the construction, and highlight the technique to deal with the batched feedback setting.
4.1.1 The Static Grid Case
We first provide the lower bound for the case where the grid is static and determined before the game.
The expected reward functions of hard instances are constructed as follows: we choose some ‘positions’ and ‘heights’, such that the expected reward function obtains local maximum of the specified ‘height’ at the specified ‘position’. We will use the word ‘peak’ to refer to the local maxima. The following theorem presents the lower bound for the static grid case.
Theorem 7.
Consider Lipschitz bandit problems with time horizon and ambient dimension such that the grid of reward communication is static and determined before the game. If rounds of communications are allowed, then for any policy , there exists a problem instance such that
To prove Theorem 7, we first show that for any there exists an instance such that . Fixing , we let and . Then we construct a set of problem instances , such that the gap between the highest peak and the second highest peak is about for every instance in .
Based on this construction, we prove that no algorithm can distinguish instances in from one another in the first batches, so the worst-case regret is at least , which gives the inequality we need. For the first batch , we can easily construct a set of instances where the worst-case regret is at least , since no information is available during this time. Thus, there exists a problem instance such that
Since , the inequality in Theorem 7 follows.
Proof of Theorem 7.
Fixing an index , we first show that there exists an instance such that . We construct a set of problem instances that is difficult to distinguish. Let and . We define and in this way to: 1. ensure that we can find a set of arms such that for any ; 2. maximize while ensuring , and thus the hard instances with maximum reward can still confuse a learner who only make observations. Then we consider a set of problem instances . The expected reward for is defined as
(13) |
For , the expected reward for is defined as
(14) |
Let (the ball with center and radius ). It is easy to verify the following properties of construction (13) and (14):
-
1.
For any , for any ;
-
2.
For any , , for any ;
-
3.
For any , under , pulling an arm that is not in incurs a regret at least .
For all arm pulls in all problem instances, a Gaussian noise sampled from is added to the observed reward. This noise corruption is independent from all other randomness.
The lower bound of expected regret relies on the following lemma.
Lemma 4.
For any policy , there exists a problem instance such that
Proof.
Let denote the choices of policy at time , and denote the reward. Additionally, for , we define as the distribution of sequence under instance and policy . It holds that
(15) |
where denotes the regret incurred by policy at time .
From our construction, it is easy to see that for any , so we can construct a test such that implies . Then from Lemma 12,
To avoid notational clutter, for any (), define
Now we calculate . From the chain rule of KL-Divergence, we have
(16) | ||||
(17) | ||||
(18) | ||||
(19) | ||||
(20) |
where (16) uses chain rule for KL-divergence and the conditional independence of the reward, (17) removes dependence on in the first term by another use of chain rule and the fact that the distribution of is fully determined by the policy and the distribution of , (18) uses that the rewards are corrupted by a standard normal noise, and (19) uses the first two properties of the construction.
By omitting terms with in the above summation, we have
The above analysis can be applied for any . For the first batch , we can easily construct a set of instances where the worst-case regret is at least , since no information is available during this time. Thus, there exists a problem instance such that
Since , we further have
where . The above two inequalities imply that
which finishes the proof. ∎
4.1.2 Removing the Static Grid Assumption
So far we have derived the lower bound for the static grid case. Yet there is a gap between the static and the adaptive case. We will close this gap in the following Theorem.
Theorem 8.
Consider Lipschitz bandit problems with time horizon and ambient dimension such that the grid of reward communication is adaptively determined by the player. If rounds of communications are allowed, then for any policy , there exists a problem instance such that
To prove Theorem 8, we consider a reference static grid , where for . We set the reference grid in this way because it is the solution to the following optimization problem
Then we construct a series of ‘worlds’, denoted by . Each world is a set of problem instances, and each problem instance in world is defined by peak location set and basic height , where the sets and quantities for are presented in the proof below.
Based on these constructions, we first prove that for any adaptive grid and policy, there exists an index such that the event happens with sufficiently high probability in world . Then similar to Theorem 7, we prove that in world there exists a set of problem instances that is difficult to differentiate in the first batches. In addition, event implies that , so the worst-case regret is at least , which gives the lower bound we need.
Proof of Theorem 8.
Firstly, we define , where , and define . From the definition, we have
(24) |
For , we can find sets of arms such that (a) for any , and (b) .
Then we present the construction of worlds . For , we let , and the expected reward of is defined as
(25) |
and , otherwise. For , we let . The expected reward of is defined as
(26) |
and , otherwise. Roughly speaking, our constructions satisfy two properties: for each and ,
-
1.
is close to ;
-
2.
under , pulling an arm that is far from incurs a regret at least .
The formal version of these two properties are presented in the proof of Lemma 5 and Lemma 6.
As mentioned above, based on these constructions, we first show that for any adaptive grid , there exists an index such that is sufficiently large in world . More formally, for each , and event , we define the quantities for and , where denotes the probability of the event under instance and policy . For these quantities, we have the following lemma.
Lemma 5.
For any adaptive grid and policy , it holds that
Proof.
For and , we define , which is the ball centered as with radius . It is easy to verify the following properties of our construction (25) and (26):
-
1.
for any ;
-
2.
, for any .
Let denote the choices of policy at time , and denote the reward. For , we define (resp. ) as the distribution of sequence under instance (resp. ) and policy . Since event can be completely described by the observations up to time ( is an event in the -algebra where and are defined on), we can use the definition of total variation to get
For the total variation, we apply Lemma 11 to get
An argument similar to (21) yields that
where denotes the number of pulls which is in before the batch containing . Combining the above two inequalities gives
(27) | ||||
(28) | ||||
(29) | ||||
where (27) uses Jensen’s inequality, (28) uses the fact that , and (29) uses (24).
Plugging the above results implies that
Since , it holds that
Lemma 5 implies that there exists some such that . Then similar to Theorem 7, we show that the worst-case regret of the policy in world gives the lower bound we need.
Lemma 6.
For adaptive grid and policy , if index satisfies , then there exists a problem instance such that
Proof.
Here we proceed with the case where . The case for can be proved analogously.
For any , we construct a set of problem instances . For , the expected reward of is defined as
where is defined in (25). For , we let .
We define , and our construction has the following properties:
-
1.
For any , for any ;
-
2.
For any , for any ;
-
3.
For any , under , pulling an arm that is not in incurs a regret at least .
Let denote the choices of policy at time , and denote the reward. For , we define as the distribution of sequence under instance and policy . From similar argument in (15), it holds that
(30) |
From our construction, it is easy to see that for any , so we can construct a test such that implies . By Lemma 12 with a star graph on with center , we have
(31) |
(32) | ||||
(33) | ||||
(34) |
where (32) follows from data processing inequality of total variation and the equation , (33) restricts the integration to event , and (34) holds because the observations at time are the same as those at time under event .
For the term , it holds that
(35) | ||||
(36) |
where (35) uses the inequality , and (36) holds because and can be determined by the observations up to .
Finally, combining the above two lemmas, we arrive at the lower bound in Theorem 8. ∎
4.2 Communication Lower bound for Lipschitz Bandits with Batched Feedback
Based on the constructions for the full-dimension case, we are now ready to present the theoretical lower bound of batched Lipschitz bandits with . For easy understanding, we start with the result for the static grid case. In this section we provide lower bounds for integer-valued . In general, the zooming dimension is not necessarily an integer.
Theorem 9.
Consider Lipschitz bandit problems with time horizon , ambient dimension and zooming dimension such that the grid of reward communication is static and determined before the game. If rounds of communications are allowed, then for any policy , there exists a problem instance with zooming dimension such that
The proof technique of Theorem 9 is similar to that of Theorem 7, with the main difference being the construction of hard instances. To build instances satisfying the statement in Theorem 9, we first construct reward functions in a -dimensional subspace according to the argument in Theorem 7, and then use a “linear-decaying extension” technique to transfer them to the -dimensional whole space. Below, we present the detailed construction of the hard instances.
To begin with, we introduce some settings and notations. For a given , we consider the whole space and being the induced metric of . can be represented by a Cartsian product . Therefore, each element in can be represented as , the concatenation of and . Additionally, we use to denote the -dimensional subspace , where is the zero vector in .
For any fixed , let and . We can find a set of arms such that for any . As stated above, we first construct a set of expected reward functions on , which are the same as (13) and (14) in the proof of Theorem 7. We define as
(39) |
For , is defined as
(40) |
Based on , we define a set of problem instances . For each , the expected reward for is defined as
(41) |
where is the concatenation of and . Note that is well-defined since .
For all arm pulls in all problem instances, an Gaussian noise sampled from is added to the observed reward. This noise corruption is independent from all other randomness.
Now we show that for each and , is -Lipschitz and the zooming dimension equals to . Firstly, for any and , we have
Therefore, is -Lipschitz. Secondly, for any , (41) yields that . As a consequence, we have , and the zooming dimension equals to .
After presenting the new constructions, we show that similar argument to the full-dimension case gives the lower bound we need. The remaining proof of Theorem 9 is deferred to Appendix E.
Finally, we combine all techniques in above analysis to obtain the lower bound for general and adaptive grid, and thus prove Theorem 3.
Theorem 10.
Consider Lipschitz bandit problems with time horizon , ambient dimension and zooming dimension such that the grid of reward communication is adaptively determined by the player. If rounds of communications are allowed, then for any policy , there exists a problem instance with zooming dimension such that
5 Experiments
In this section, we present numerical studies of A-BLiN. In the experiments, we use the arm space and the expected reward function , where and . The landscape of and the resulting partition is shown in Figure 2a. As can be seen, the partition is finer in the area closer to the optimal arm .


We let the time horizon , and report the accumulated regret in Figure 2b. The regret curve is sublinear, which agrees with the regret bound (2). Besides, different background colors in Figure 2b represent different batches. For the total time horizon , A-BLiN only needs rounds of communications. We also present regret curve of zooming algorithm [28] for comparison. Different from zooming algorithm, regret curve of A-BLiN is approximately piecewise linear, which is because the strategy of BLiN does not change within each batch. Results of more repeated experiments, as well as experimental results of D-BLiN, are in Appendix G. Our code is available at https://github.com/FengYasong-fifol/Batched-Lipschitz-Narrowing.
6 Conclusion
In this paper, we study Lipschitz bandits with communication constraints, and propose the BLiN algorithm as a solution. We prove that BLiN only need rounds of communications to achieve the optimal regret rate of best previous Lipschitz bandit algorithms [28, 14] that need batches. This improvement in number of the batches significantly saves data communication costs. We also provide complexity analysis for this problem. We show that rounds of communications are necessary for any algorithm to optimally solve Lipschitz bandit problems. Hence, BLiN algorithm is optimal.
References
- [1] Arpit Agarwal, Shivani Agarwal, Sepehr Assadi, and Sanjeev Khanna. Learning with limited rounds of adaptivity: coin tossing, multi-armed bandits, and ranking from pairwise comparisons. In Conference on Learning Theory, pages 39–75. PMLR, 2017.
- [2] Arpit Agarwal, Rohan Ghuge, and Viswanath Nagarajan. Batched dueling bandits. In International Conference on Machine Learning, pages 89–110. PMLR, 2022.
- [3] Rajeev Agrawal. The continuum-armed bandit problem. SIAM Journal on Control and Optimization, 33(6):1926–1951, 1995.
- [4] Rajeev Agrawal. Sample mean based index policies by regret for the multi-armed bandit problem. Advances in Applied Probability, 27(4):1054–1078, 1995.
- [5] Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. Journal of Computer and System Sciences, 58(1):137–147, 1999.
- [6] Patrice Assouad. Plongements Lipschitziens dans . Bulletin de la Société Mathématique de France, 111:429–448, 1983.
- [7] Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235–256, 2002.
- [8] Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002.
- [9] Peter Auer, Ronald Ortner, and Csaba Szepesvári. Improved rates for the stochastic continuum-armed bandit problem. In Conference on Computational Learning Theory, pages 454–468. Springer, 2007.
- [10] Dimitris Bertsimas and Adam J Mersereau. A learning approach for interactive marketing to a customer segment. Operations Research, 55(6):1120–1135, 2007.
- [11] Peter J. Bickel. On some robust estimates of location. The Annals of Mathematical Statistics, pages 847–858, 1965.
- [12] Jean Bretagnolle and Catherine Huber. Estimation des densités: risque minimax. Séminaire de probabilités de Strasbourg, 12:342–363, 1978.
- [13] Sébastien Bubeck, Nicolo Cesa-Bianchi, and Gábor Lugosi. Bandits with heavy tail. IEEE Transactions on Information Theory, 59(11):7711–7717, 2013.
- [14] Sébastien Bubeck, Rémi Munos, Gilles Stoltz, and Csaba Szepesvári. Online optimization in -armed bandits. Advances in Neural Information Processing Systems, 22:201–208, 2009.
- [15] Sébastien Bubeck, Rémi Munos, Gilles Stoltz, and Csaba Szepesvári. -armed bandits. Journal of Machine Learning Research, 12(5):1655–1695, 2011.
- [16] Sébastien Bubeck, Gilles Stoltz, and Jia Yuan Yu. Lipschitz bandits without the Lipschitz constant. In International Conference on Algorithmic Learning Theory, pages 144–158. Springer, 2011.
- [17] Nicolo Cesa-Bianchi, Ofer Dekel, and Ohad Shamir. Online learning with switching costs and other adaptive adversaries. Advances in Neural Information Processing Systems, 26:1160–1168, 2013.
- [18] Eric W. Cope. Regret and convergence bounds for a class of continuum-armed bandit problems. IEEE Transactions on Automatic Control, 54(6):1243–1253, 2009.
- [19] Hossein Esfandiari, Amin Karbasi, Abbas Mehrabian, and Vahab Mirrokni. Regret bounds for batched bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 7340–7348, 2021.
- [20] Eyal Even-Dar, Shie Mannor, Yishay Mansour, and Sridhar Mahadevan. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of machine learning research, 7(6):1079–1105, 2006.
- [21] Yasong Feng, Weijian Luo, Yimin Huang, and Tianyu Wang. A lipschitz bandits approach for continuous hyperparameter optimization. arXiv preprint arXiv:2302.01539, 2023.
- [22] Zijun Gao, Yanjun Han, Zhimei Ren, and Zhengqing Zhou. Batched multi-armed bandits problem. Advances in Neural Information Processing Systems, 32:503–513, 2019.
- [23] Chuying Han, Yasong Feng, and Tianyu Wang. From random search to bandit learning in metric measure spaces. arXiv preprint arXiv:2305.11509, 2023.
- [24] Yanjun Han, Zhengqing Zhou, Zhengyuan Zhou, Jose Blanchet, Peter W Glynn, and Yinyu Ye. Sequential batch learning in finite-action linear contextual bandits. arXiv preprint arXiv:2004.06321, 2020.
- [25] Kwang-Sung Jun, Kevin Jamieson, Robert Nowak, and Xiaojin Zhu. Top arm identification in multi-armed bandits with batch arm pulls. In Artificial Intelligence and Statistics, pages 139–148. PMLR, 2016.
- [26] Nikolai Karpov, Qin Zhang, and Yuan Zhou. Collaborative top distribution identifications with limited interaction. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pages 160–171. IEEE, 2020.
- [27] Robert Kleinberg. Nearly tight bounds for the continuum-armed bandit problem. Advances in Neural Information Processing Systems, 18:697–704, 2005.
- [28] Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. Multi-armed bandits in metric spaces. In Proceedings of the fortieth annual ACM symposium on Theory of computing, pages 681–690, 2008.
- [29] Akshay Krishnamurthy, John Langford, Aleksandrs Slivkins, and Chicheng Zhang. Contextual bandits with continuous actions: Smoothing, zooming, and adapting. The Journal of Machine Learning Research, 21(1):5402–5446, 2020.
- [30] Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4–22, 1985.
- [31] Zihan Li and Jonathan Scarlett. Gaussian process bandit optimization with few batches. In International Conference on Artificial Intelligence and Statistics, pages 92–107. PMLR, 2022.
- [32] Shiyin Lu, Guanghui Wang, Yao Hu, and Lijun Zhang. Optimal algorithms for Lipschitz bandits with heavy-tailed rewards. In International Conference on Machine Learning, pages 4154–4163, 2019.
- [33] Stefan Magureanu, Richard Combes, and Alexandre Proutiere. Lipschitz bandits: Regret lower bound and optimal algorithms. In Conference on Learning Theory, pages 975–999. PMLR, 2014.
- [34] Maryam Majzoubi, Chicheng Zhang, Rajan Chari, Akshay Krishnamurthy, John Langford, and Aleksandrs Slivkins. Efficient contextual bandits with continuous actions. Advances in Neural Information Processing Systems, 33:349–360, 2020.
- [35] Vianney Perchet and Philippe Rigollet. The multi-armed bandit problem with covariates. The Annals of Statistics, 41(2):693–721, 2013.
- [36] Vianney Perchet, Philippe Rigollet, Sylvain Chassang, and Erik Snowberg. Batched bandit problems. The Annals of Statistics, 44(2):660–681, 2016.
- [37] Stuart J. Pocock. Group sequential methods in the design and analysis of clinical trials. Biometrika, 64(2):191–199, 1977.
- [38] Yufei Ruan, Jiaqi Yang, and Yuan Zhou. Linear bandits with limited adaptivity and learning distributional optimal design. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 74–87, 2021.
- [39] Sudeep Salgia, Sattar Vakili, and Qing Zhao. A domain-shrinking based bayesian optimization algorithm with order-optimal regret performance. In Advances in Neural Information Processing Systems, volume 34, pages 28836–28847, 2021.
- [40] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
- [41] Sean Sinclair, Tianyu Wang, Gauri Jain, Siddhartha Banerjee, and Christina Yu. Adaptive discretization for model-based reinforcement learning. Advances in Neural Information Processing Systems, 33:3858–3871, 2020.
- [42] Aleksandrs Slivkins. Contextual bandits with similarity information. Journal of Machine Learning Research, 15(1):2533–2568, 2014.
- [43] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT press, 2018.
- [44] Chao Tao, Qin Zhang, and Yuan Zhou. Collaborative learning with limited interaction: tight bounds for distributed exploration in multi-armed bandits. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pages 126–146. IEEE, 2019.
- [45] William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
- [46] Tianyu Wang and Cynthia Rudin. Bandits for BMO functions. In International Conference on Machine Learning, pages 9996–10006. PMLR, 2020.
- [47] Tianyu Wang, Weicheng Ye, Dawei Geng, and Cynthia Rudin. Towards practical lipschitz bandits. In Proceedings of the 2020 ACM-IMS on Foundations of Data Science Conference, FODS ’20, page 129–138, New York, NY, USA, 2020. Association for Computing Machinery.
- [48] Nirandika Wanigasekara and Christina Yu. Nonparametric contextual bandits in an unknown metric space. In Advances in Neural Information Processing Systems, volume 32, pages 14684–14694, 2019.
- [49] Kelly Zhang, Lucas Janson, and Susan Murphy. Inference for batched bandits. Advances in Neural Information Processing Systems, 33:9818–9829, 2020.
Appendix A Proof of Corollary 1
Corollary 1. For Lipschitz bandit problems with ambient dimension , zooming dimension and time horizon , any algorithm needs rounds of communications to achieve the optimal regret rate .
Proof.
From Theorem 3, the expected regret is lower bounded by
Here we seek for the minimum such that
(42) |
for some constant .
Calculation shows that
(43) |
Substituting (43) to (42) and taking log on both sides yield that
and
Taking log on both sides again yields that
(44) |
We use to denote the minimum such that inequality (44) holds. Calculation shows that (44) holds for
so we have . Then since the RHS of (44) decreases with , we have
Therefore, rounds of communications are necessary for any algorithm to optimally solve Lipschitz bandit problems. ∎
Appendix B Space Complexity Analysis of A-BLiN
Let the A-BLiN run contains batches: batches in the for-loop and a clean-up batch. We use to denote number of cubes in batch , that is, , for . Then it is easy to see that the space complexity is linear in . In the following, we bound for each to obtain the space complexity of A-BLiN.
Appendix C Proof of Theorem 4
Theorem 4. With probability exceeding , the -step total regret of BLiN with Doubling Edge-length Sequence (D-BLiN) satisfies
(45) |
where is the zooming dimension of the problem instance. In addition, D-BLiN only needs no more than rounds of communications to achieve this regret rate.
Proof.
Since for Doubling Edge-length Sequence, Lemma 3 implies that every cube is a subset of . Thus from the definition of zooming number, we have
(46) |
Fix any positive number . Also by Lemma 3, we know that any arm played after batch incurs a regret bounded by , since the cubes played after batch have edge length no larger than . Then the total regret occurs after the first batch is bounded by .
Thus the regret can be bounded by
(47) |
where the first term bounds the regret in the first batches of D-BLiN, and the second term bounds the regret after the first batches. If the algorithm stops at batch , we define for any and inequality (47) still holds.
By Lemma 3, we have for all . We can thus bound (47) by
(48) | ||||
(49) | ||||
where (48) uses (46), and (49) uses equality . Since and , we have
This inequality holds for any positive . By choosing , we have
The above analysis implies that we can achieve the optimal regret rate by letting the for-loop run times and finishing the remaining rounds in the Cleanup step. In other words, rounds of communications are sufficient for D-BLiN to achieve the regret bound (45). ∎
Appendix D Regret Upper Bound of BLiN with a Fixed Number of Batches
This section studies the cases where a hard upper bound on the number of batches is imposed. In such cases, we simply apply BLiN by executing the for-loop times, and then run the clean-up step. Since some batches may be skipped in the for-loop, the total number of batches is upper bounded by . The regret in such cases is described by a quantity called effective number of batches (written ). The quantity counts number of batches where finer partitioning of the arm space occurs. Further in Lemma 7, we show that .
Firstly, we introduce some notations. In the proof of Theorem 6, we bound the regret of the odd batches ( and ), the even batches ( and ) and the clean-up batch (the -th batch) separately. For convenience, we denote
where is the regret of the -th batch. Additionally, we let , and thus is the last even batch before the -th batch. As is mentioned in the paper (the paragraph before Theorem 6), when using BLiN with rounded ACE Sequence, if there exists such that , then we skip the -th batch. We use to denote the number of effective batches, that is, the batches which are not skipped.
By omitting the equality and using the definition of rounded ACE Sequence, the arguments in the proof of Theorem 6 can be directly applied to the case of fixed . Specifically, inequality (9) yields that
inequality (10) yields that
and inequality (12) yields that
Thus, the -step total regret is bounded by
(50) |
Furthermore, because of the existence of the rounding step, the actual number of batches of BLiN with rounded ACE Sequence has the following upper bound.
Lemma 7.
Let be the total time horizon. When applying BLiN with rounded ACE Sequence and any number of batches , the effective number of batches is upper bounded by
Proof.
The rounded ACE sequence is defined as for and for , where and . As is explained in Remark 2, if there exists integer and such that , then the rounding step yields , so the -th and the -th batches are skipped. We note that the sequence is increasing and . It is easy to verify that if inequality
(51) |
is satisfied for some , then there will be at most effective batches after the -th batch. As a consequence, for any , the number of effective batches is upper bounded by .
Appendix E Proof of Theorem 9
Theorem 9. Consider Lipschitz bandit problems with time horizon , ambient dimension and zooming dimension such that the grid of reward communication is static and determined before the game. If rounds of communications are allowed, then for any policy , there exists a problem instance with zooming dimension such that
Proof.
Fixing and , we show that is -Lipschitz and the zooming dimension equals to .
Firstly, for any and , we have
Therefore, is -Lipschitz.
Secondly, for any , (41) yields that . Therefore, we have , and the zooming dimension equals to .
Then we show that an argument similar to Theorem 7 yields the lower bound we need. For any , we prove that no algorithm can distinguish instances in from one another in the first batches, so the worst-case regret is at least , which equals to . For the first batch , we can easily construct a set of instances where the worst-case regret is at least , since no information is available during this time. Thus, there exists a problem instance such that
Since , the inequality in Theorem 9 follows.
Recall that each is in . For convenience, we write . Let , where denotes the -dimensional ball with center and radius . It is easy to verify the following properties of construction (39),(40) and (41):
-
1.
For any , for any ;
-
2.
For any , , for any ;
-
3.
For any , under , pulling an arm that is not in incurs a regret at least .
The lower bound of expected regret relies on the following lemma.
Lemma 8.
For any policy , there exists a problem instance such that
Proof.
Let denote the choices of policy at time , and denote the reward. Additionally, for , we define as the distribution of sequence under instance and policy . It holds that
(52) |
where denotes the regret incurred by policy at time .
From our construction, it is easy to see that for any , so we can construct a test such that implies . Then from Lemma 12,
To avoid notational clutter, for any (), define
Now we calculate . From the chain rule of KL-Divergence, we have
(53) | ||||
(54) | ||||
(55) | ||||
(56) | ||||
(57) |
where (53) uses chain rule for KL-divergence and the conditional independence of the reward, (54) removes dependence on in the first term by another use of chain rule and the fact that the distribution of is fully determined by the policy and the distribution of , (55) uses that the rewards are corrupted by a standard normal noise, and (56) uses the first two properties of the construction.
Since , the expected regret of policy satisfies
on an instance .
By omitting terms with in the above summation, we have
The above analysis can be applied for any . For the first batch , we can easily construct a set of instances where the worst-case regret is at least , since no information is available during this time. Thus, there exists a problem instance such that
Since , we further have
where . Combining the above two inequalities, we conclude that
Appendix F Proof of Theorem 10
Theorem 10. Consider Lipschitz bandit problems with time horizon , ambient dimension and zooming dimension such that the grid of reward communication is adaptively determined by the player. If rounds of communications are allowed, then for any policy , there exists a problem instance with zooming dimension such that
Proof.
The main argument in the proof of Theorem 10 is similar to that of Theorem 8. To construct hard instances in the -dimensional space, we use the ‘linear-decaying extension’ technique, which is the same as the proof of Theorem 9.
To prove Theorem 10, we consider a reference static grid , where for . Then we construct a series of ‘worlds’, denoted by . Each world is a set of problem instances, and each problem instance in world is defined by peak location set and basic height , where the sets and quantities for are presented in the proof below. Based on these constructions, we first prove that for any adaptive grid and policy, there exists an index such that the event happens with sufficiently high probability in world . Then similar to Theorem 9, we prove that in world there exists a set of problem instances that is difficult to differentiate in the first batches. In addition, event implies that , so the worst-case regret is at least , which gives the lower bound we need.
Firstly, we define and , where . From the definition, we have
(61) |
For , we can find sets of arms such that (a) for any , and (b) .
Then we present the construction of worlds . For , we let . We first construct a set of expected reward functions on . For each , we define as
(62) |
Based on , the expected reward of is defined as
(63) |
where is the concatenation of and . For , we let . We first define a function on as
(64) |
Then the expected reward of is defined as
(65) |
Roughly speaking, our constructions satisfy two properties: for each and ,
-
1.
is close to ;
-
2.
under , pulling an arm that is far from incurs a regret at least .
The formal version of these two properties are presented in the proof of Lemma 9 and Lemma 10.
Based on these constructions, we first show that for any adaptive grid , there exists an index such that is sufficiently large in world . More formally, for each , and event , we define the quantities for and , where denotes the probability of the event under instance and policy . For these quantities, we have the following lemma.
Lemma 9.
For any adaptive grid and policy , it holds that
Proof.
For and , we write . We define , where denotes the -dimensional ball with center and radius . It is easy to verify the following properties of our construction (62), (63), (64) and (65):
-
1.
for any ;
-
2.
, for any .
Let denote the choices of policy at time , and denote the reward. For , we define (resp. ) as the distribution of sequence under instance (resp. ) and policy . Since event can be completely described by the observations up to time ( is an event in the -algebra where and are defined on), we can use the definition of total variation to get
For the total variation, we apply Lemma 11 to get
An argument similar to (58) yields that
where denotes the number of pulls which is in before the batch containing . Combining the above two inequalities gives
(66) | ||||
(67) | ||||
(68) | ||||
where (66) uses Jensen’s inequality, (67) uses the fact that , and (68) uses (61).
Plugging the above results implies that
Since , it holds that
Lemma 9 implies that there exists some such that . Then we show that the worst-case regret in world gives the lower bound we need.
Lemma 10.
For adaptive grid and policy , if index satisfies , then there exists a problem instance with zooming dimension such that
Proof.
Here we proceed with the case where . The case for can be proved analogously.
For any , we construct a set of problem instances . For , we first define a function on as
where is defined in (62). Then the expected reward of is defined as
(69) |
where is the concatenation of and . For , we let .
We first show that each is -Lipschitz and the zooming dimension equals to .
Firstly, for any and , we have
Therefore, is -Lipschitz.
Secondly, for any , (63) and (69) yield that and . Therefore, we have , and the zooming dimension equals to .
Then we show that an argument similar to Lemma 6 yields the lower bound we need. For each , we write . We define , and our construction has the following properties:
-
1.
For any , for any ;
-
2.
For any , for any ;
-
3.
For any , under , pulling an arm that is not in incurs a regret at least .
Let denote the choices of policy at time , and denote the reward. For , we define as the distribution of sequence under instance and policy . From similar argument in (52), it holds that
(70) |
From our construction, it is easy to see that for any , so we can construct a test such that implies . By Lemma 12 with a star graph on with center , we have
(71) |
(72) | ||||
(73) | ||||
(74) |
where (72) follows from data processing inequality of total variation and the equation , (73) restricts the integration to event , and (74) holds because the observations at time are the same as those at time under event .
For the term , it holds that
(75) | ||||
(76) |
where (75) uses the inequality , and (76) holds because and can be determined by the observations up to .
Finally, combining the above two lemmas, we arrive at the lower bound in Theorem 10. ∎
Appendix G Additional Experimental results
G.1 Repeated experiments of A-BLiN
We present results of A-BLiN with some random seeds in Figure 3, where the figure legends and labels are the same as whose in Figure 2b. These results stably agree with the plot in the paper. The curve of zooming algorithm in Figure 2b and 3 is the average of repeated experiments. The reason we did not present averaged regret curve of A-BLiN in Figure 2b is that we want to show the batch pattern of a single A-BLiN run in the figure. Averaging across different runs breaks the batch pattern. As an example, one stochastic run may end the first batch after 100 observations, while another may end the third batch after 110 observations.






G.2 Experimental results of D-BLiN
We run D-BLiN to solve the same problem in Section 5. The partition and elimination process of this experiment is presented in Figure 4, which shows that the optimal arm is not eliminated during the game, and only rounds of communications are needed for time horizon . Moreover, we present the resulting partition and the accumulated regret in Figure 5.








Appendix H Auxiliary Lemmas
Lemma 11 (Bretagnolle-Huber Inequality[12]).
Let and be any probability measures on the same probability space. It holds that
Lemma 12 ([22]).
Let be probability measures over a common probability space , and be any measurable function (i.e., test). Then for any tree with vertex set and edge set , we have
-
1.
-
2.