Lipschitz Bandits with Batched Feedback

Yasong Feng Shanghai Center for Mathematical Sciences, Fudan University; email: [email protected]. Zengfeng Huang School of Data Science, Fudan University; email: [email protected]. Tianyu Wang Shanghai Center for Mathematical Sciences, Fudan University; email: [email protected].

Abstract

In this paper, we study Lipschitz bandit problems with batched feedback, where the expected reward is Lipschitz and the reward observations are communicated to the player in batches. We introduce a novel landscape-aware algorithm, called Batched Lipschitz Narrowing (BLiN), that optimally solves this problem. Specifically, we show that for a $T$ -step problem with Lipschitz reward of zooming dimension $d_{z}$ , our algorithm achieves theoretically optimal (up to logarithmic factors) regret rate $\widetilde{\mathcal{O}}\left(T^{\frac{d_{z}+1}{d_{z}+2}}\right)$ using only $\mathcal{O}\left(\log\log T\right)$ batches. We also provide complexity analysis for this problem. Our theoretical lower bound implies that $\Omega(\log\log T)$ batches are necessary for any algorithm to achieve the optimal regret. Thus, BLiN achieves optimal regret rate (up to logarithmic factors) using minimal communication.

1 Introduction

Multi-Armed Bandit (MAB) algorithms aim to exploit the good options while explore the decision space. These algorithms and methodologies find successful applications in artificial intelligence and reinforcement learning (e.g., [40]). While the classic MAB setting assumes that the rewards are immediately observed after each arm pull, real-world data often arrives in different patterns. For example, observations from clinical trials are often be collected in a batched fashion [37]. Another example is from online advertising, where strategies are tested on multiple customers at the same time [10]. In such cases, any observation-dependent decision-making should comply with this data-arriving pattern, including MAB algorithms.

In this paper, we study the Lipschitz bandit problem with batched feedback – a MAB problem where the expected reward is Lipschitz and the reward observations are communicated to the player in batches. In such settings, rewards are communicated only at the end of the batches, and the algorithm can only make decisions based on information up to the previous batch. Existing Lipschitz bandit algorithms heavily rely on timely access to the reward samples, since the partition of arm space may change at any time. Therefore, they can not solve the batched feedback setting. To address this difficulty, we present a novel adaptive algorithm for Lipschitz bandits with communication constraints, named Batched Lipschitz Narrowing (BLiN). BLiN learns the landscape of the reward by adaptively narrowing the arm set, so that regions of high reward are more frequently played. Also, BLiN determines the data collection procedure adaptively, so that only very few data communications are needed.

The above BLiN procedure achieves optimal regret rate $\widetilde{\mathcal{O}}\left(T^{\frac{d_{z}+1}{d_{z}+2}}\right)$ ( $d_{z}$ is the zooming dimension [28, 15]), and can be implemented in a clean and friendly form. In addition to achieving the optimal regret rate, BLiN also improves the state-of-the-art results in the following senses:

•

BLiN’s communication complexity is optimal. BLiN only needs $\mathcal{O}(\log\log T)$ rounds of communications to achieve the optimal regret rate (Theorem 2), and no algorithm can achieve this rate with fewer than $\Omega(\log\log T)$ rounds of communications (Corollary 1).
•

BLiN’s time complexity is optimal: if the arithmetic operations and sampling are of complexity $\mathcal{O}(1)$ , then the time complexity of BLiN is $\mathcal{O}(T)$ , which improves the best known time complexity $\mathcal{O}(T\log T)$ for Lipschitz bandit problems [15].
•

The space complexity of BLiN is $\mathcal{O}\left(T^{\frac{d_{z}+1}{d_{z}+2}}(\log T)^{-\frac{d_{z}+1}{d_{z}+2}}\right)$ , which also improves the best known result. This is because we do not need to store information of cubes in previous batches. The detailed time and space complexity analysis of BLiN is in Remark 1.

In Table 1 we provide a comparison of BLiN and state-of-the-art Lipschitz bandit algorithms in terms of regret bound, communication bound, time complexity and space complexity.

Table 1: Comparison with State-of-the-art Lipschitz Bandit Algorithms

algorithm	regret	communication	time complexity	space complexity
Zooming [28]	$\widetilde{\mathcal{O}}\left(T^{\frac{d_{z}+1}{d_{z}+2}}\right)$	$T$	$\mathcal{O}\left(T^{2}\right)$	$\mathcal{O}(T)$
HOO [15]	$\widetilde{\mathcal{O}}\left(T^{\frac{d_{z}+1}{d_{z}+2}}\right)$	$T$	$\mathcal{O}\left(T\log T\right)$	$\mathcal{O}(T)$
A-BLiN (our work)	$\bm{\widetilde{\mathcal{O}}\left(T^{\frac{d_{z}+1}{d_{z}+2}}\right)}$	$\bm{\mathcal{O}(\log\log T)}$	$\bm{\mathcal{O}\left(T\right)}$	$\bm{\mathcal{O}\left(T^{\frac{d_{z}+1}{d_{z}+2}}(\log T)^{-\frac{d_{z}+1}{d_{z}+2}}\right)}$

1.1 Settings & Preliminaries

For a Lipschitz bandit problem (with communication constraints), the arm set is a compact doubling metric space $(\mathcal{A},d_{\mathcal{A}})$ . The expected reward $\mu:\mathcal{A}\rightarrow\mathbb{R}$ is $1$ -Lipschitz with respect to the metric $d_{\mathcal{A}}$ , that is, $|\mu(x_{1})-\mu(x_{2})|\leq d_{\mathcal{A}}(x_{1},x_{2})$ for any $x_{1},x_{2}\in\mathcal{A}$ .

At time $t\leq T$ , the learning agent pulls an arm $x_{t}\in\mathcal{A}$ that yields a reward sample $y_{t}=\mu(x_{t})+\epsilon_{t}$ , where $\epsilon_{t}$ is a mean-zero independent sub-Gaussian noise. Without loss of generality, we assume that $\epsilon_{t}\sim\mathcal{N}(0,1)$ , since generalizations to other sub-Gaussian noises are not hard.

Similar to most bandit learning problems, the agent seeks to minimize regret in the batched feedback environment. The regret is defined as $R(T)=\sum_{t=1}^{T}\left(\mu^{*}-\mu(x_{t})\right)$ , where $\mu^{*}$ denotes $\max_{x\in\mathcal{A}}\mu(x)$ . For simplicity, we define $\Delta_{x}=\mu^{*}-\mu(x)$ (called optimality gap of $x$ ) for all $x\in\mathcal{A}$ .

1.1.1 Doubling Metric Spaces and the $([0,1]^{d},\|\cdot\|_{\infty})$ Metric Space

By the Assouad’s embedding theorem [6], the (compact) doubling metric space $(\mathcal{A},d_{\mathcal{A}})$ can be embedded into a Euclidean space with some distortion of the metric; See [46] for more discussions in a machine learning context. Due to existence of such embedding, the metric space $([0,1]^{d},\|\cdot\|_{\infty})$ , where metric balls are hypercubes, is sufficient for the purpose of our paper. For the rest of the paper, we will use hypercubes in algorithm design for simplicity, while our algorithmic idea generalizes to other doubling metric spaces.

1.1.2 Zooming Number and Zooming Dimension

An important concept for bandit problems in metric spaces is the zooming number and the zooming dimension [28, 14, 42], which we discuss now. We start with the definition of packing numbers.

Definition 1.

Let $(\mathcal{A},d_{\mathcal{A}})$ be a metric space. The $r$ -packing number $\mathcal{N}(\mathcal{S},r)$ of $\mathcal{S}\subset\mathcal{A}$ is the size of the largest packing of $\mathcal{S}$ with disjoint $d_{\mathcal{A}}$ -open balls with radius $r$ .

Then we define the zooming number and the zooming dimension.

Definition 2.

For a problem instance with arm set $\mathcal{A}$ and expected reward $\mu$ , we let $S(r)$ denote the set of $r$ -optimal arms, that is, $S(r)=\{x\in\mathcal{A}:\Delta_{x}=\mu^{*}-\mu(x)\leq r\}$ . We define the $r$ -zooming number $N_{r}$ as $N_{r}=\mathcal{N}\left(S(16r),\frac{r}{2}\right)$ . The zooming dimension is then defined as

\displaystyle d_{z}:=\;\min\left\{d\geq 0:\exists a>0,\;N_{r}\leq ar^{-d},\;\forall 0<r<1\right\}.

Moreover, we define the zooming constant $C_{z}$ as

\displaystyle C_{z}=\min\left\{a>0:\;N_{r}\leq ar^{-d_{z}},\;\forall 0<r<1\right\}.

Zooming dimension $d_{z}$ can be significantly smaller than ambient dimension $d$ and can be zero. For a simple example, consider a problem with ambient dimension $d=1$ and expected reward function $\mu(x)=x$ for $0\leq x\leq 1$ . Then for any $r=2^{-i}$ with $i\geq 4$ , we have $S(16r)=[1-16r,1]$ and $N_{r}=16$ . Therefore, for this problem the zooming dimension equals to $0$ , with zooming constant $C_{z}=16$ .

1.2 Batched feedback pattern and our results

In the batched feedback setting, for a $T$ -step game, the player determines a grid $\mathcal{T}=\{t_{0},\cdots,t_{B}\}$ adaptively, where $0=t_{0}<t_{1}<\cdots<t_{B}=T$ and $B\ll T$ . During the game, reward observations are communicated to the player only at the grid points $t_{1},\cdots,t_{B}$ . As a consequence, for any time $t$ in the $j$ -th batch, that is, $t_{j-1}<t\leq t_{j}$ , the reward $y_{t}$ cannot be observed until time $t_{j}$ , and the decision made at time $t$ depends only on rewards up to time $t_{j-1}$ . The determination of the grid $\mathcal{T}$ is adaptive in the sense that the player chooses each grid point $t_{j}\in\mathcal{T}$ based on the operations and observations up to the previous point $t_{j-1}$ .

In this work, we present BLiN algorithm to solve Lipschitz bandits under batched feedback. During the learning procedure, BLiN detects and eliminates the ‘bad area’ of the arm set in batches and partition the remaining area according to an approporiate edge-length sequence. Our first theoretical upper bound is that with simple Doubling Edge-length Sequence $r_{m}=2^{-m+1}$ , BLiN achieves optimal regret rate $\widetilde{\mathcal{O}}\left(T^{\frac{d_{z}+1}{d_{z}+2}}\right)$ by using $\mathcal{O}(\log T)$ batches.

Theorem 1.

With probability exceeding $1-\frac{2}{T^{6}}$ , the $T$ -step total regret $R(T)$ of BLiN with Doubling Edge-length Sequence (D-BLiN) satisfies

R(T)\lesssim T^{\frac{d_{z}+1}{d_{z}+2}}\cdot(\log T)^{\frac{1}{d_{z}+2}},

where $d_{z}$ is the zooming dimension of the problem instance. In addition, D-BLiN only needs no more than $\mathcal{O}(\log T)$ rounds of communications to achieve this regret rate. Here and henceforth, $\lesssim$ only omits constants.

While D-BLiN is efficient for batched Lipschitz bandits, its communication complexity is not optimal. We then propose a new edge-length sequence, which we call Appropriate Combined Edge-length Sequence (ACE Sequence) to improve the algorithm. The idea behind this sequence is that by appropriately combining some batches, the algorithm can achieve better communication bound without incurring increased regret. As we shall see, BLiN with ACE Sequence (A-BLiN) achieves regret rate $\widetilde{\mathcal{O}}\left(T^{\frac{d_{z}+1}{d_{z}+2}}\right)$ with only $\mathcal{O}(\log\log T)$ batches.

Theorem 2.

With probability exceeding $1-\frac{2}{T^{6}}$ , the $T$ -step total regret $R(T)$ of A-BLiN satisfies

R(T)\lesssim T^{\frac{d_{z}+1}{d_{z}+2}}\cdot(\log T)^{\frac{1}{d_{z}+2}}\cdot\log\log T,

where $d_{z}$ is the zooming dimension of the problem instance. In addition, Algorithm 1 only needs no more than $\mathcal{O}(\log\log T)$ rounds of communications to achieve this regret rate.

As a comparison, seminal works [28, 42, 15] show that the optimal regret bound for Lipschitz bandits without communications constraints, where the reward observations are immediately observable after each arm pull, is $R(T)\lesssim T^{\frac{d_{z}+1}{d_{z}+2}}\cdot(\log T)^{\frac{1}{d_{z}+2}}$ . Therefore, A-BLiN achieves optimal regret rate of Lipschitz bandits by using very few batches.

Furthermore, we provide a theoretical lower bound for Lipschitz bandits with batched feedback.

Theorem 3.

Consider Lipschitz bandit problems with time horizon $T$ , ambient dimension $d$ and zooming dimension $d_{z}\leq d$ . If $B$ rounds of communications are allowed, then for any policy $\pi$ , there exists a problem instance with zooming dimension $d_{z}$ such that

\displaystyle\mathbb{E}\left[R_{T}(\pi)\right]\geq\frac{1}{512B^{2}}T^{\frac{1-\frac{1}{d_{z}+2}}{1-\left(\frac{1}{d_{z}+2}\right)^{B}}}.

In the lower bound analysis, we use a “linear-decaying extension” technique to construct problem instances with zooming dimension $d_{z}$ . To the best of our knowledge, our construction provides the first minimax lower bound for Lipschitz bandits where the zooming dimension $d_{z}$ is explicitly different from the ambient dimension $d$ . As a result of Theorem 3, we can derive the minimum rounds of communications needed to achieve optimal regret bound for Lipschitz bandit problem, which is stated in Corollary 1. The proof of Corollary 1 is deferred to Appendix A.

Corollary 1.

For Lipschitz bandit problems with ambient dimension $d$ , zooming dimension $d_{z}\leq d$ and time horizon $T$ , any algorithm needs $\Omega(\log\log T)$ rounds of communications to achieve the optimal regret rate $\mathcal{O}\left(T^{\frac{d_{z}+1}{d_{z}+2}}\right)$ .

Consequently, BLiN algorithm is optimal in terms of both regret and communication.

1.3 Related Works

The history of the Multi-Armed Bandit (MAB) problem can date back to Thompson [45]. Solvers for this problems include the UCB algorithms [30, 4, 7], the arm elimination method [20, 35, 39], the $\epsilon$ -greedy strategy [7, 43], the exponential weights and mirror descent framework [8].

Recently, with the prevalence of distributed computing and large-scale field experiments, the setting of batched feedback has captured attention (e.g., [17]). Perchet et al. [36] mainly consider batched bandit with two arms, and a matching lower bound for static grid is proved. It was then generalized by Gao et al. [22] to finite-armed bandit problems. In their work, the authors designed an elimination method for finite-armed bandit problem and proved matching lower bounds for both static and adaptive grid. Soon afterwards, Zhang et al. [49] studies inference for batched bandits. Esfandiari et al. [19] studies batched linear bandits and batched adversarial bandits. Han et al. [24] and Ruan et al. [38] provide solutions for batched contextual linear bandits. Li and Scarlett [31] studies batched Gaussian process bandits. Batched dueling bandits have also been studied by Agarwal et al. [2]. Parallel to the regret control regime, best arm identification with limited number of batches was studied in [1] and [25]. Top- $k$ arm identification in the collaborative learning framework is also closely related to the batched setting, where the goal is to minimize the number of iterations (or communication steps) between agents. In this setting, tight bounds have been obtained in the recent works [44, 26]. Yet the problem of Lipschitz bandit with communication constraints remains unsolved.

The Lipschitz bandit problem is important in its own stand. The Lipschitz bandit problem was introduced as “continuum-armed bandits” [3], where the arm space is a compact interval. Along this line, bandits that are Lipschitz (or Hölder) continuous have been studied. For this problem, Kleinberg [27] proves a $\Omega(T^{2/3})$ lower bound and introduced a matching algorithm. Under extra conditions on top of Lipschitzness, regret rate of $\widetilde{\mathcal{O}}(T^{1/2})$ was achieved [9, 18]. For general (doubling) metric spaces, the Zooming bandit algorithm [28] and the Hierarchical Optimistic Optimization (HOO) algorithm [15] were developed. In more recent years, some attention has been focused on Lipschitz bandit problems with certain extra structures. To name a few, Bubeck et al. [16] study Lipschitz bandits for differentiable rewards, which enables algorithms to run without explicitly knowing the Lipschitz constants. Wang et al. [47] studied discretization-based Lipschitz bandit algorithms from a Gaussian process perspective. Magureanu et al. [33] derive a new concentration inequality and study discrete Lipschitz bandits. The idea of robust mean estimators [11, 5, 13] was applied to the Lipschitz bandit problem to cope with heavy-tail rewards, leading to the development of a near-optimal algorithm for Lipschitz bandit with heavy-tailed rewards [32]. Wanigasekara and Yu [48] studied Lipschitz bandits where a clustering is used to infer the underlying metric. Contextual Lipschitz bandits have been studied by Slivkins [42]. Contextual bandits with continuous actions have also been studied by Krishnamurthy et al. [29] and Majzoubi et al. [34] through a smoothness approach. Yet all of the existing works for Lipschitz bandits assume that the reward sample is immediately observed after each arm pull, and none of them solve the Lipschitz bandit problem with communication constraints.

This paper is organized as follows. In section 2, we introduce the BLiN algorithm and give a visual illustration of the algorithm procedure. In section 3, we prove that BLiN with ACE Sequence achieves the optimal regret rate using only $\mathcal{O}\left(\log\log T\right)$ rounds of communications. Section 4 provides information-theoretical lower bounds for Lipschitz bandits with communication constraints, which shows that BLiN is optimal in terms of both regret and rounds of communications. Experimental results are presented in Section 5.

2 Algorithm

With communication constraints, the agent’s knowledge about the environment does not accumulate within each batch. This characteristic of the problem suggests a ‘uniform’ type algorithm – we shall treat each step within the same batch equally. Following this intuition, in each batch, we uniformly play the remaining arms, and then eliminate arms of low reward after the observations are communicated. Next we describe the uniform play rule and the arm elimination rule.

Uniform Play Rule: At the beginning of each batch $m$ , a collection of subsets of the arm space $\mathcal{A}_{m}=\{C_{m,1},C_{m,2},\cdots,C_{m,|\mathcal{A}_{m}|}\}$ is constructed. This collection of subset $\mathcal{A}_{m}$ consists of standard cubes, and all cubes in $\mathcal{A}_{m}$ have the same edge length $r_{m}$ . We will detail the construction of $\mathcal{A}_{m}$ when we describe the arm elimination rule. We refer to cubes in $\mathcal{A}_{m}$ as active cubes of batch $m$ .

During batch $m$ , each cube in $\mathcal{A}_{m}$ is played

n_{m}\triangleq\frac{16\log T}{r_{m}^{2}}

(1)

times, where $T$ is the total time horizon. More specifically, within each $C\in\mathcal{A}_{m}$ , arms $x_{C,1},x_{C,2},\cdots,x_{C,n_{m}}\in C$ are played.¹¹1One can arbitrarily play $x_{C,1},x_{C,2},\cdots,x_{C,n_{m}}$ as long as $x_{C,i}\in C$ for all $i$ . The reward samples $\left\{y_{C,1},y_{C,2},\cdots,y_{C,n_{m}}\right\}_{C\in\mathcal{A}_{m}}$ corresponding to $\left\{x_{C,1},x_{C,2},\cdots,x_{C,n_{m}}\right\}_{C\in\mathcal{A}_{m}}$ will be collected at the end of the this batch.

Arm Elimination Rule: At the end of batch $m$ , information from the arm pulls is collected, and we estimate the reward of each $C\in\mathcal{A}_{m}$ by $\widehat{\mu}_{m}(C)=\frac{1}{n_{m}}\sum_{i=1}^{n_{m}}y_{C,i}$ . Cubes of low estimated rewards are then eliminated, according to the following rule: a cube $C\in\mathcal{A}_{m}$ is eliminated if $\widehat{\mu}_{m}^{\max}-\widehat{\mu}_{m}(C)\geq 4r_{m}$ , where $\widehat{\mu}_{m}^{\max}:=\max_{C\in\mathcal{A}_{m}}\widehat{\mu}_{m}(C)$ . After necessary removal of “bad cubes”, each cube in $\mathcal{A}_{m}$ that survives the elimination is equally partitioned into $\left(\frac{r_{m}}{r_{m+1}}\right)^{d}$ subcubes of edge length $r_{m+1}$ , where $\{r_{m}\}_{m}$ is predetermined sequence to be specified soon. These cubes (of edge length $r_{m+1}$ ) are collected to construct $\mathcal{A}_{m+1}$ , and the learning process moves on to the next batch. Appropriate rounding may be required to ensure the ratio $\frac{r_{m}}{r_{m+1}}$ is an integer. See Remark 2 for more details.

The learning process is summarized in Algorithm 1.

Algorithm 1 Batched Lipschitz Narrowing (BLiN)

1: Input. Arm set

\mathcal{A}=[0,1]^{d}

; time horizon

T

2: Initialization Number of batches

B

; Edge-length sequence

\{r_{m}\}_{m=1}^{B+1}

; The first grid point

t_{0}=0

; Equally partition

\mathcal{A}

r_{1}^{d}

subcubes and define

\mathcal{A}_{1}

as the collection of these subcubes.

3: Compute

n_{m}=\frac{16\log T}{r_{m}^{2}}

for

m=1,\cdots,B+1

4: for

m=1,2,\cdots,B

5: For each cube

C\in\mathcal{A}_{m}

, play arms

x_{C,1},\cdots x_{C,n_{m}}

from

C

6: Collect the rewards of all pulls up to

t_{m}

. Compute the average payoff

\widehat{\mu}_{m}(C)=\frac{\sum_{i=1}^{n_{m}}y_{C,i}}{n_{m}}

for each cube

C\in\mathcal{A}_{m}

. Find

\widehat{\mu}_{m}^{max}=\max_{C\in\mathcal{A}_{m}}\widehat{\mu}(C)

7: For each cube

C\in\mathcal{A}_{m}

, eliminate

C

\widehat{\mu}_{m}^{max}-\widehat{\mu}_{m}(C)>4r_{m}

. Let

\mathcal{A}_{m}^{+}

be set of cubes not eliminated.

8: Compute

t_{m+1}=t_{m}+(r_{m}/r_{m+1})^{d}\cdot|\mathcal{A}_{m}^{+}|\cdot n_{m+1}

. If

t_{m+1}\geq T

m=B

then break.

9: Equally partition each cube in

\mathcal{A}_{m}^{+}

into

\left(r_{m}/r_{m+1}\right)^{d}

subcubes and define

\mathcal{A}_{m+1}

as the collection of these subcubes. /*See Remark 2 for more details on cases where

\left(r_{m}/r_{m+1}\right)^{d}

is not an integer.*/

10: end for

11: Cleanup: Arbitrarily play the remaining arms until all

T

steps are used.

Remark 1 (Time and space complexity).

The time complexity of our algorithm is $\mathcal{O}(T)$ , which is better than the state of the art $\mathcal{O}(T\log T)$ in [15]. This is because that the running time of a batch $j$ is of order $\mathcal{O}(l_{j})$ , where $l_{j}=t_{j}-t_{j-1}$ is number of samples in batch $j$ . Since $\sum_{j}l_{j}=T$ , the time complexity of BLiN is $\mathcal{O}(T)$ . Besides, the space complexity of our algorithm is $\mathcal{O}\left(T^{\frac{d_{z}+1}{d_{z}+2}}(\log T)^{-\frac{d_{z}+1}{d_{z}+2}}\right)$ , which also improves the best known space complexity. This is because we do not need to store information of cubes in previous batches. The space complexity analysis is deferred to Appendix B.

The following theorem gives regret and communication upper bound of BLiN with Doubling Edge-length Sequence $r_{m}=2^{-m+1}$ (see Appendix C for a proof). Note that this result implies Theorem 1.

Theorem 4.

With probability exceeding $1-\frac{2}{T^{6}}$ , the $T$ -step total regret $R(T)$ of BLiN with Doubling Edge-length Sequence (D-BLiN) satisfies

R(T)\leq(512C_{z}+16)\cdot T^{\frac{d_{z}+1}{d_{z}+2}}(\log T)^{\frac{1}{d_{z}+2}},

where $d_{z}$ is the zooming dimension of the problem instance. In addition, D-BLiN only needs no more than $\frac{\log T-\log\log T}{d_{z}+2}+2$ rounds of communications to achieve this regret rate.

Although D-BLiN efficiently solves batched Lipschitz bandits, its simple partition strategy leads to suboptimal communication complexity. Now we show that by approporiately combining some batches, BLiN achieves the optimal communication bound, without incurring additional regret. Specifically, we introduce the following edge-length sequence, which we call ACE Sequence. When using ACE Sequence, the regret of each batch is of order $\widetilde{\mathcal{O}}\left(T^{\frac{d_{z}+1}{d_{z}+2}}\right)$ . Thus, the length of any batch cannot be increased without affecting the optimal regret rate. This implies that the ACE Sequence is optimal in terms of both regret and communication. See the proof of Theorem 5 for more details.

Definition 3.

For a problem with ambient dimension $d$ , zooming dimension $d_{z}$ and time horizon $T$ , we denote $c_{1}=\frac{d_{z}+1}{(d+2)(d_{z}+2)}\log\frac{T}{\log T}$ and $c_{i+1}={\eta c_{i}}$ for any $i\geq 1$ , where $\eta=\frac{d+1-d_{z}}{d+2}$ . Then the Appropriate Combined Edge-length (ACE) Sequence $\{r_{m}\}$ is defined by $r_{m}=2^{-\sum_{i=1}^{m}c_{i}}$ for any $m\geq 1$ .

Theorem 5 states that BLiN with ACE Sequence (A-BLiN) obtains an improved communication complexity, thus proves Theorem 2.

Theorem 5.

With probability exceeding $1-\frac{2}{T^{6}}$ , the $T$ -step total regret $R(T)$ of Algorithm 1 satisfies

R(T)\leq\left(\frac{128C_{z}}{\log\frac{d+2}{d+1-d_{z}}}\cdot\log\log T+16\right)\cdot T^{\frac{d_{z}+1}{d_{z}+2}}(\log T)^{\frac{1}{d_{z}+2}},

(2)

where $d_{z}$ is the zooming dimension of the problem instance. In addition, Algorithm 1 with ACE sequence only needs no more than $\frac{\log\log T}{\log\frac{d+2}{d+1-d_{z}}}+1$ rounds of communications to achieve this regret rate.

The partition and elimination process of a real A-BLiN run is in Figure 1. In the $i$ -th subgraph, the white cubes are those remaining after the $(i-1)$ -th batch. In this experiment, we set $\mathcal{A}=[0,1]^{2}$ , and the optimal arm is $x^{*}=(0.8,0.7)$ . Note that $x^{*}$ is not eliminated during the game. More details of this experiment are in Section 5.

The definition of the ACE Sequence relies on the zooming dimension $d_{z}$ . If $d_{z}$ is not known ahead of time, we recommend two ways to proceed. 1) The player can apply BLiN with Doubling Edge-length Sequence $r_{m}=2^{-m+1}$ (D-BLiN). Theorem 4 shows that D-BLiN achieves optimal regret rate $\widetilde{\mathcal{O}}\left(T^{\frac{d_{z}+1}{d_{z}+2}}\right)$ by using $\mathcal{O}(\log T)$ batches. Although the rate $\mathcal{O}(\log T)$ is not optimal, it is good enough for total time horizon $T$ that is not too large, as shown in the experimental results in Appendix G.2. 2) If an upper bound $d_{u}$ of the zooming dimension is known, that is, $d_{z}\leq d_{u}\leq d$ , then the player can apply A-BLiN by using $d_{u}$ to define the ACE Sequence. Theorem 5 yields that A-BLiN with $d_{u}$ achieves regret rate $\widetilde{\mathcal{O}}\left(T^{\frac{d_{u}+1}{d_{u}+2}}\right)$ by using $\mathcal{O}(\log\log T)$ batches.

3 Regret Analysis of A-BLiN

In this section, we provide regret analysis for A-BLiN. The highlight of the finding is that $\mathcal{O}(\log\log T)$ batches are sufficient to achieve optimal regret rate of $\widetilde{\mathcal{O}}\left(T^{\frac{d_{z}+1}{d_{z}+2}}\right)$ , as summarized in Theorem 5.

To prove Theorem 5, we first show that the estimator $\widehat{\mu}$ is concentrated to the true expected reward $\mu$ (Lemma 1), and the optimal arm survives all eliminations with high probability (Lemma 2). In the following analysis, we let $B_{\mathrm{stop}}$ be the total number of batches of the BLiN run.

Lemma 1.

Define

\displaystyle\mathcal{E}:=\;\Bigg{\{}|\mu(x)-\widehat{\mu}_{m}(C)|\leq r_{m}+\sqrt{\frac{16\log T}{n_{m}}},\;\forall 1\leq m\leq B_{\mathrm{stop}}-1,\;\forall C\in\mathcal{A}_{m},\;\forall x\in C\Bigg{\}}.

It holds that $\mathbb{P}\left(\mathcal{E}\right)\geq 1-2T^{-6}$ .

Proof.

Fix a cube $C\in\mathcal{A}_{m}$ . Recall the average payoff of cube $C\in\mathcal{A}_{m}$ is defined as

\widehat{\mu}_{m}(C)=\frac{\sum_{i=1}^{n_{m}}y_{C,i}}{n_{m}}.

We also have

\mathbb{E}\left[\widehat{\mu}_{m}(C)\right]=\frac{\sum_{i=1}^{n_{m}}\mu(x_{C,i})}{n_{m}}.

Since $\widehat{\mu}_{m}(C)-\mathbb{E}\left[\widehat{\mu}_{m}(C)\right]$ obeys normal distribution $\mathcal{N}\left(0,\frac{1}{n_{m}}\right)$ , Hoeffding inequality gives

\displaystyle\mathbb{P}\left(\left|\widehat{\mu}_{m}(C)-\mathbb{E}\left[\widehat{\mu}_{m}(C)\right]\right|\geq\sqrt{\frac{16\log T}{n_{m}}}\right)\leq\frac{2}{T^{8}}.

On the other hand, by Lipschitzness of $\mu$ , it is obvious that

\displaystyle\left|\mathbb{E}\left[\widehat{\mu}_{m}(C)\right]-\mu(x)\right|\leq r_{m},\quad\forall x\in C.

Consequently, we have

\displaystyle\mathbb{P}\left(\sup_{x\in C}|\mu(x)-\widehat{\mu}_{m}(C)|\leq r_{m}+\sqrt{\frac{16\log T}{n_{m}}}\right)\geq 1-\frac{2}{T^{8}}.

For $1\leq m\leq B_{\mathrm{stop}}-1$ , the $m$ -th batch is finished, so any cube $C\in\mathcal{A}_{m}$ is played for not less than $1$ time, and thus $|\mathcal{A}_{m}|\leq T$ . From here, by similar argument to Lemma F.1 in [41] and Lemma 1 in [32], taking a union bound over $C\in\mathcal{A}_{m}$ and $1\leq m\leq B_{\mathrm{stop}}-1$ finishes the proof. ∎

Lemma 2.

Under event $\mathcal{E}$ (defined in Lemma 1), the optimal arm $x^{*}=\arg\max\mu(x)$ is not eliminated after the first $B_{\mathrm{stop}}-1$ batches.

Proof.

We use $C^{*}_{m}$ to denote the cube containing $x^{*}$ in $\mathcal{A}_{m}$ . Here we proof that $C^{*}_{m}$ is not eliminated in round $m$ .

Under event $\mathcal{E}$ , for any cube $C\in\mathcal{A}_{m}$ and $x\in C$ , we have

\displaystyle\widehat{\mu}(C)-\widehat{\mu}(C^{*}_{m})\leq\mu(x)+\sqrt{\frac{16\log T}{n_{m}}}+r_{m}-\mu(x^{*})+\sqrt{\frac{16\log T}{n_{m}}}+r_{m}\leq 4r_{m},

where the second inequality follows from (1). Then from the elimination rule, $C^{*}_{m}$ is not eliminated. ∎

Based on these results, we show the cubes that survive elimination are of high reward.

Lemma 3.

Under event $\mathcal{E}$ (defined in Lemma 1), for any $1\leq m\leq B_{\mathrm{stop}}$ , any $C\in\mathcal{A}_{m}$ and any $x\in C$ , $\Delta_{x}$ satisfies

\displaystyle\Delta_{x}\leq 8r_{m-1}.

(3)

Proof.

For $m=1$ , (3) holds directly from the Lipschitzness of $\mu$ . For $m>1$ , let $C_{m-1}^{*}$ be the cube in $\mathcal{A}_{m-1}$ such that $x^{*}\in C_{m-1}^{*}$ . From Lemma 2, this cube $C_{m-1}^{*}$ is well-defined under $\mathcal{E}$ . For any cube $C\in\mathcal{A}_{m}$ and $x\in C$ , it is obvious that $x$ is also in the parent of $C$ (the cube in the previous round that contains $C$ ), which is denoted by $C_{par}$ . Thus for any $x\in C$ , it holds that

\displaystyle\Delta_{x}=\mu^{*}-\mu(x)\leq\widehat{\mu}_{m-1}(C^{*}_{m-1})+\sqrt{\frac{16\log T}{n_{m-1}}}+r_{m-1}-\widehat{\mu}_{m-1}(C_{par})+\sqrt{\frac{16\log T}{n_{m-1}}}+r_{m-1},

where the inequality uses Lemma 1.

Equality $\sqrt{\frac{16\log T}{n_{m-1}}}=r_{m-1}$ gives that

\displaystyle\Delta_{x}

\displaystyle\leq\widehat{\mu}_{m-1}(C^{*}_{m-1})-\widehat{\mu}_{m-1}(C_{par})+4r_{m-1}.

It is obvious that $\widehat{\mu}_{m-1}(C^{*}_{m-1})\leq\widehat{\mu}_{m-1}^{\max}$ . Moreover, since the cube $C_{par}$ is not eliminated, from the elimination rule we have

\displaystyle\widehat{\mu}_{m-1}^{\max}-\widehat{\mu}_{m-1}(C_{par})\leq 4r_{m-1}.

Hence, we conclude that $\Delta_{x}\leq 8r_{m-1}$ . ∎

We are now ready to prove Theorem 5.

Proof of Theorem 5.

Let $R_{m}$ denote regret of the $m$ -th batch. Fixing any positive number $B$ , the total regret $R(T)$ can be divided into two parts: $R(T)=\sum_{m\leq B}R_{m}+\sum_{m>B}R_{m}$ . In the following, we bound these two parts separately and then determine $B$ to obtain the upper bound of the total regret. Moreover, we show A-BLiN uses only $\mathcal{O}(\log\log T)$ rounds of communications to achieve the optimal regret.

Recall that $\mathcal{A}_{m}$ is set of the active cubes in the $m$ -th batch. According to Lemma 3, for any $x\in\cup_{C\in\mathcal{A}_{m}}C$ , we have $\Delta_{x}\leq 8r_{m-1}$ . Let $\mathcal{A}_{m}^{+}$ be set of cubes not eliminated in batch $m$ . Each cube in $\mathcal{A}_{m-1}^{+}$ is a $\|\cdot\|_{\infty}$ -ball with radius $\frac{r_{m-1}}{2}$ , and is a subset of $S(8r_{m-1})$ . Therefore, $\mathcal{A}_{m-1}^{+}$ forms a $\left(\frac{r_{m-1}}{2}\right)$ -packing of $S(8r_{m-1})$ , and the definition of zooming dimension yields that

|\mathcal{A}_{m-1}^{+}|\leq N_{r_{m-1}}\leq C_{z}r_{m-1}^{-d_{z}}.

(4)

By definition, $r_{m}=r_{m-1}2^{-c_{m}}$ , so

|\mathcal{A}_{m}|=2^{d{c_{m}}}|\mathcal{A}_{m-1}^{+}|.

(5)

The total regret of the $m$ -th batch is

$\displaystyle R_{m}=$	$\displaystyle\;\sum_{C\in\mathcal{A}_{m}}\sum_{i=1}^{n_{m}}\Delta_{x_{C,i}}\leq\|\mathcal{A}_{m}\|\cdot\frac{16\log T}{r_{m}^{2}}\cdot 8r_{m-1}$	(6)
$\displaystyle\overset{(i)}{=}$	$\displaystyle\;2^{d{c_{m}}}\|\mathcal{A}_{m-1}^{+}\|\cdot\frac{16\log T}{r_{m}^{2}}\cdot 8r_{m-1}$
$\displaystyle\overset{(ii)}{\leq}$	$\displaystyle\;2^{d{c_{m}}}\cdot C_{z}r_{m-1}^{-d_{z}+1}\cdot\frac{128\log T}{r_{m}^{2}}$
$\displaystyle\overset{(iii)}{=}$	$\displaystyle\;2^{(\sum_{i=1}^{m-1}c_{i})(d_{z}+1)+c_{m}(d+2)}\cdot 128C_{z}\log T,$

where (i) follows from (5), (ii) follows from (4), and (iii) follows from the definition of ACE Sequence.

Define $C_{m}=(\sum_{i=1}^{m-1}c_{i})(d_{z}+1)+c_{m}(d+2)$ . Since $c_{m}=c_{m-1}\cdot\frac{d+1-d_{z}}{d+2}$ , calculation shows that $C_{m}=(\sum_{i=1}^{m-2}c_{i})(d_{z}+1)+c_{m-1}(d+2)+c_{m-1}(d_{z}+1-d-2)+c_{m}(d+2)=C_{m-1}$ . Thus for any $m$ , we have $C_{m}=C_{1}=c_{1}(d+2)$ . Hence,

R_{m}\leq 2^{c_{1}(d+2)}\cdot 128C_{z}\log T=T^{\frac{d_{z}+1}{d_{z}+2}}\cdot 128C_{z}(\log T)^{\frac{1}{d_{z}+2}}.

(7)

The inequality (7) holds even if the $m$ -th batch does not exist (where we let $R_{m}=0$ ) or is not completed. Thus we obtain the first upper bound $\sum_{m\leq B}R_{m}\leq T^{\frac{d_{z}+1}{d_{z}+2}}\cdot 128C_{z}\cdot B(\log T)^{\frac{1}{d_{z}+2}}$ . Lemma 3 implies that any arm $x$ played after the first $B$ batches satisfies $\Delta_{x}\leq 8r_{B}$ , so the total regret after $B$ batches is bounded by

	$\displaystyle\sum_{m>B}R_{m}\leq$	$\displaystyle\;8r_{B}\cdot T=8T\cdot 2^{-\sum_{i=1}^{B}c_{i}}=8T\cdot 2^{-c_{1}(\frac{1-\eta^{B}}{1-\eta})}$
	$\displaystyle=$	$\displaystyle\;8T^{\frac{d_{z}+1}{d_{z}+2}}(\log T)^{\frac{1}{d_{z}+2}}\cdot\left(T/\log T\right)^{\frac{\eta^{B}}{d_{z}+2}}\leq\;8T^{\frac{d_{z}+1}{d_{z}+2}}(\log T)^{\frac{1}{d_{z}+2}}\cdot T^{\frac{\eta^{B}}{d_{z}+2}}.$

Therefore, the total regret $R(T)$ satisfies

\displaystyle R(T)=\;\sum_{m\leq B}R_{m}+\sum_{m>B}R_{m}\leq\;\left(128C_{z}\cdot B+8T^{\frac{\eta^{B}}{d_{z}+2}}\right)\cdot T^{\frac{d_{z}+1}{d_{z}+2}}(\log T)^{\frac{1}{d_{z}+2}}.

This inequality holds for any positive $B$ . Then by choosing $B^{*}=\frac{\log\log T-\log(d_{z}+2)}{\log\frac{1}{\eta}}=\frac{\log\log T-\log(d_{z}+2)}{\log\frac{d+2}{d+1-d_{z}}}$ , we have $\frac{\eta^{B^{*}}}{d_{z}+2}=\frac{1}{\log T}$ and

R(T)\leq\left(\frac{128C_{z}\log\log T}{\log\frac{d+2}{d+1-d_{z}}}+16\right)\cdot T^{\frac{d_{z}+1}{d_{z}+2}}(\log T)^{\frac{1}{d_{z}+2}}.

The above analysis implies that we can achieve the optimal regret rate $\widetilde{\mathcal{O}}\left(T^{\frac{d_{z}+1}{d_{z}+2}}\right)$ by letting the for-loop run $B^{*}$ times and finishing the remaining rounds in the Cleanup step. In other words, $B^{*}+1$ rounds of communications are sufficient for A-BLiN to achieve the regret bound (2). ∎

Remark 2.

The quantity $\frac{r_{m}}{r_{m+1}}$ in line 9 of Algorithm 1 may not be integers for some $m$ . Thus, in practice we denote $\alpha_{n}=\lfloor\sum_{i=1}^{n}c_{i}\rfloor$ , $\beta_{n}=\lceil\sum_{i=1}^{n}c_{i}\rceil$ , and define rounded ACE Sequence $\{\widetilde{r}_{m}\}_{m\in\mathbb{N}}$ by $\widetilde{r}_{m}=2^{-\alpha_{k}}$ for $m=2k-1$ and $\widetilde{r}_{m}=2^{-\beta_{k}}$ for $m=2k$ . Then the total regret can be divided as $R(T)=\sum_{1\leq k\leq B^{*}}R_{2k-1}+\sum_{1\leq k\leq B^{*}}R_{2k}+\sum_{m>2B^{*}}R_{m}$ . For the first part we have $\widetilde{r}_{2k-2}\leq r_{k-1}$ and $\widetilde{r}_{2k-1}\geq r_{k}$ , while for the second part we have $\frac{\widetilde{r}_{2k-1}}{\widetilde{r}_{2k}}=2$ . Therefore, by similar argument to the proof of Theorem 5, we can bound these three parts separately, and conclude that BLiN with rounded ACE sequence achieves the optimal regret bound $\widetilde{\mathcal{O}}(T^{\frac{d_{z}+1}{d_{z}+2}})$ by using only $\mathcal{O}(\log\log T)$ rounds of communications. The Rounded ACE Sequence is further investigated in [21, 23].

For rounded ACE Sequence, the quantity $\frac{\widetilde{r}_{m}}{\widetilde{r}_{m+1}}$ is an integer for any $m$ , so the partition in Line 9 of Algorithm 1 is well-defined. In Theorem 6, we show that BLiN with rounded ACE sequence can also achieve the optimal regret bound by using $\mathcal{O}(\log\log T)$ batches. For any $m$ , if there exists $m^{\prime}<m$ such that $\widetilde{r}_{m}\geq\widetilde{r}_{m^{\prime}}$ (including the case where $\frac{\widetilde{r}_{2k-1}}{\widetilde{r}_{2k}}=1$ ), then we skip the $m$ -th batch. It is easy to verify that the following analysis is still valid in this case.

Theorem 6.

With probability exceeding $1-\frac{2}{T^{6}}$ , the $T$ -step total regret $R(T)$ of Algorithm 1 with rounded ACE sequence satisfies

\displaystyle R(T)\leq\left(\frac{128C_{z}\log\log T}{\log\frac{d+2}{d+1-d_{z}}}+512C_{z}+16\right)T^{\frac{d_{z}+1}{d_{z}+2}}(\log T)^{\frac{1}{d_{z}+2}},

where $d_{z}$ is the zooming dimension of the problem instance. In addition, Algorithm 1 only needs no more than $\frac{2\log\log T}{\log\frac{d+2}{d+1-d_{z}}}+1$ rounds of communications to achieve this regret rate.

Proof.

The proof of Theorem 6 is similar to that of Theorem 5.

Firstly, we fix positive number $B^{*}=\frac{\log\log\frac{T}{\log T}-\log(d_{z}+2)}{\log\frac{d+2}{d+1-d_{z}}}$ and consider the first $2B^{*}$ batches. As is summairzed in Remark 2, we bound the regret caused by the first $2B^{*}$ batches through two different arguments.

For $m=2k-1$ , $1\leq k\leq B^{*}$ , we have $\widetilde{r}_{m}=2^{-\alpha_{k}}$ and $\widetilde{r}_{m-1}=2^{-\beta_{k-1}}$ , and thus

\widetilde{r}_{m}\geq r_{k}\quad\text{and}\quad\widetilde{r}_{m-1}\leq r_{k-1}.

(8)

Let $\mathcal{A}_{m}^{+}$ be set of cubes not eliminated in round $m$ . Similar argument to Theorem 5 shows that $|\mathcal{A}_{m-1}^{+}|\leq C_{z}\widetilde{r}_{m-1}^{-d_{z}}$ . The total regret of round $m$ is

	$\displaystyle R_{m}$	$\displaystyle=\sum_{C\in\mathcal{A}_{m}}\sum_{i=1}^{n_{m}}\Delta_{x_{C,i}}$
		$\displaystyle\leq\|\mathcal{A}_{m}\|\cdot\frac{16\log T}{\widetilde{r}_{m}^{2}}\cdot 8\widetilde{r}_{m-1}$
		$\displaystyle=\left(\frac{\widetilde{r}_{m-1}}{\widetilde{r}_{m}}\right)^{d}\|\mathcal{A}_{m-1}^{+}\|\cdot\frac{16\log T}{\widetilde{r}_{m}^{2}}\cdot 8\widetilde{r}_{m-1}$
		$\displaystyle\leq\left(\frac{\widetilde{r}_{m-1}}{\widetilde{r}_{m}}\right)^{d}\cdot C_{z}\widetilde{r}_{m-1}^{-d_{z}}\cdot\frac{16\log T}{\widetilde{r}_{m}^{2}}\cdot 8\widetilde{r}_{m-1}$
		$\displaystyle\leq\widetilde{r}_{m-1}^{d+1-d_{z}}\cdot\widetilde{r}_{m}^{-d-2}\cdot 128C_{z}\log T$
		$\displaystyle\leq r_{k-1}^{d+1-d_{z}}\cdot r_{k}^{-d-2}\cdot 128C_{z}\log T$
		$\displaystyle=T^{\frac{d_{z}+1}{d_{z}+2}}\cdot 128C_{z}\cdot(\log T)^{\frac{1}{d_{z}+2}},$

where the sixth line follows from (8), and the seventh line follows from (7). Summing over $k$ gives that

\displaystyle\sum_{k=1}^{B^{*}}R_{2k-1}\leq\;T^{\frac{d_{z}+1}{d_{z}+2}}\cdot 128C_{z}\cdot B^{*}\cdot(\log T)^{\frac{1}{d_{z}+2}}\leq\;\frac{128C_{z}\log\log T}{\log\frac{d+2}{d+1-d_{z}}}\cdot T^{\frac{d_{z}+1}{d_{z}+2}}(\log T)^{\frac{1}{d_{z}+2}}.

(9)

For $m=2k$ , $1\leq k\leq B^{*}$ , we have $\widetilde{r}_{m}=2^{-\beta_{k}}$ and $\widetilde{r}_{m-1}=2^{-\alpha_{k}}$ , and thus $\widetilde{r}_{m}=\frac{1}{2}\widetilde{r}_{m-1}$ . Lemma 3 shows that any cube in $\mathcal{A}_{m}$ is a subset of $S(16\widetilde{r}_{m})$ , so we have $|\mathcal{A}_{m}|\leq N_{\widetilde{r}_{m}}\leq C_{z}\widetilde{r}_{m}^{-d_{z}}$ . Therefore, the total regret of round $m$ is

\displaystyle R_{m}=\sum_{C\in\mathcal{A}_{m}}\sum_{i=1}^{n_{m}}\Delta_{x_{C,i}}\leq|\mathcal{A}_{m}|\cdot\frac{16\log T}{\widetilde{r}_{m}^{2}}\cdot 16\widetilde{r}_{m}\leq C_{z}\widetilde{r}_{m}^{-d_{z}}\cdot\frac{16\log T}{\widetilde{r}_{m}^{2}}\cdot 16\widetilde{r}_{m}=\widetilde{r}_{m}^{-d_{z}-1}\cdot 256C_{z}\log T.

Since $\frac{\widetilde{r}_{2k-2}}{\widetilde{r}_{2k}}=\frac{\widetilde{r}_{2k-2}}{\widetilde{r}_{2k-1}}\cdot\frac{\widetilde{r}_{2k-1}}{\widetilde{r}_{2k}}\geq 2$ , summing over $k$ gives that

\sum_{k=1}^{B^{*}}R_{2k}\leq\widetilde{r}_{2B^{*}}^{-d_{z}-1}\cdot 512C_{z}\log T.

(10)

The definition of round ACE Sequence shows that

\displaystyle\widetilde{r}_{2B^{*}}=\;2^{-\lceil\sum_{i=1}^{B^{*}}c_{i}\rceil}=2^{-\left\lceil c_{1}\left(\frac{1-\eta^{B^{*}}}{1-\eta}\right)\right\rceil}=\;2^{-\left\lceil\frac{\log\frac{T}{\log T}}{d_{z}+2}-1\right\rceil}\geq\left(\frac{T}{\log T}\right)^{-\frac{1}{d_{z}+2}},

so we have

\displaystyle\sum_{k=1}^{B^{*}}R_{2k}\leq\;\left(\left(\frac{T}{\log T}\right)^{-\frac{1}{d_{z}+2}}\right)^{-d_{z}-1}\cdot 512C_{z}\log T=\;T^{\frac{d_{z}+1}{d_{z}+2}}\cdot 512C_{z}(\log T)^{\frac{1}{d_{z}+2}}.

(11)

Similar argument to Theorem 5 shows that the total regret after $2B^{*}$ batches is upper bounded by $8\widetilde{r}_{2B^{*}}T$ . Since

\displaystyle\widetilde{r}_{2B^{*}}=2^{-\left\lceil\frac{\log\frac{T}{\log T}}{d_{z}+2}-1\right\rceil}\leq 2^{-\frac{\log\frac{T}{\log T}}{d_{z}+2}+1}\leq 2\left(\frac{T}{\log T}\right)^{-\frac{1}{d_{z}+2}},

we further have

\sum_{m>2B^{*}}R_{m}\leq 8\widetilde{r}_{2B^{*}}T\leq 16\cdot T^{\frac{d_{z}+1}{d_{z}+2}}(\log T)^{\frac{1}{d_{z}+2}}.

(12)

Combining (9), (11) and (12), we conclude that

\displaystyle R(T)\leq\left(\frac{128C_{z}\log\log T}{\log\frac{d+2}{d+1-d_{z}}}+512C_{z}+16\right)\cdot T^{\frac{d_{z}+1}{d_{z}+2}}(\log T)^{\frac{1}{d_{z}+2}}.

The analysis in Theorem 6 implies that we can achieve the optimal regret rate $\widetilde{\mathcal{O}}\left(T^{\frac{d_{z}+1}{d_{z}+2}}\right)$ by letting the for-loop of Algorithm 1 run $2B^{*}$ times and finishing the remaining rounds in the Cleanup step. In other words, $2B^{*}+1$ rounds of communications are sufficient for BLiN to achieve the optimal regret. ∎

Remark 3.

The proof of Theorem 6 implies a regret upper bound of BLiN with a fixed number of batches $B$ . See Appendix D for the detailed analysis.

4 Lower Bounds

In this section, we present lower bounds for Lipschitz bandits with batched feedback, which in turn gives communication lower bounds for all Lipschitz bandit algorithms. Our lower bounds depend on the rounds of communications $B$ . When $B$ is sufficiently large, our results match the upper bound for the vanilla Lipschitz bandit problem $\widetilde{O}\left(T^{\frac{d_{z}+1}{d_{z}+2}}\right)$ . More importantly, this dependency on $B$ gives the minimal rounds of communications needed to achieve optimal regret bound for all Lipschitz bandit algorithms, which is summarized in Corollary 1. Since this lower bound matches the upper bound presented in Theorem 5, BLiN optimally solves Lipschitz bandits with minimal communication.

Similar to most lower bound proofs, we need to construct problem instances that are difficult to differentiate. What’s different is that we need to carefully integrate batched feedback pattern [36] with the Lipschitz payoff reward [42, 32]. To capture the adaptivity in grid determination, we construct “static reference communication grids” to remove the stochasticity in grid selection [1, 22]. Moreover, to prove the lower bounds for general $d_{z}\leq d$ , we apply a “linear-decaying extension” technique to transfer instances from the $d_{z}$ -dimensional subspace to the $d$ -dimensional whole space.

The lower bound analysis is organized as follows. In Section 4.1, we present lower bounds for the full-dimensional case, that is, $d=d_{z}$ . In Section 4.1.1, we consider the static grid case, where the grid is predetermined. This static grid case will provide intuition for the adaptive and more general case. In Section 4.1.2, we provide the lower bound for general adaptive grid. Finally, in Section 4.2, we apply the “linear-decaying extension” technique to prove lower bounds for the case that $d_{z}\leq d$ .

4.1 Lower Bounds for the Full-Dimension Case

In this section, we let the zooming dimension $d_{z}$ equal to the ambient dimension $d$ . The aim of considering this case first is to simplify the construction, and highlight the technique to deal with the batched feedback setting.

4.1.1 The Static Grid Case

We first provide the lower bound for the case where the grid is static and determined before the game.

The expected reward functions of hard instances are constructed as follows: we choose some ‘positions’ and ‘heights’, such that the expected reward function obtains local maximum of the specified ‘height’ at the specified ‘position’. We will use the word ‘peak’ to refer to the local maxima. The following theorem presents the lower bound for the static grid case.

Theorem 7.

Consider Lipschitz bandit problems with time horizon $T$ and ambient dimension $d$ such that the grid of reward communication $\mathcal{T}$ is static and determined before the game. If $B$ rounds of communications are allowed, then for any policy $\pi$ , there exists a problem instance such that

\displaystyle\mathbb{E}\left[R_{T}(\pi)\right]\geq\frac{1}{32e^{\frac{1}{16}}}\cdot T^{\frac{1-\frac{1}{d+2}}{1-\left(\frac{1}{d+2}\right)^{B}}}.

To prove Theorem 7, we first show that for any $k>1$ there exists an instance such that $\mathbb{E}[R_{T}(\pi)]\geq\frac{t_{k}}{t_{k-1}^{\frac{1}{d+2}}}$ . Fixing $k>1$ , we let $r_{k}=\frac{1}{t_{k-1}^{\frac{1}{d+2}}}$ and $M_{k}:=t_{k-1}r_{k}^{2}=\frac{1}{r_{k}^{d}}$ . Then we construct a set of problem instances $\mathcal{I}_{k}=\left\{I_{k,1},\cdots,I_{k,M_{k}}\right\}$ , such that the gap between the highest peak and the second highest peak is about $r_{k}$ for every instance in $\mathcal{I}_{k}$ .

Based on this construction, we prove that no algorithm can distinguish instances in $\mathcal{I}_{k}$ from one another in the first $(k-1)$ batches, so the worst-case regret is at least $r_{k}t_{k}$ , which gives the inequality we need. For the first batch $(0,t_{1}]$ , we can easily construct a set of instances where the worst-case regret is at least $t_{1}$ , since no information is available during this time. Thus, there exists a problem instance such that

\displaystyle\mathbb{E}[R_{T}(\pi)]\gtrsim\max\left\{t_{1},\frac{t_{2}}{t_{1}^{\frac{1}{d+2}}},\cdots,\frac{t_{B}}{t_{B-1}^{\frac{1}{d+2}}}\right\}.

Since $0<t_{1}<\cdots<t_{B}=T$ , the inequality in Theorem 7 follows.

Proof of Theorem 7.

Fixing an index $k>1$ , we first show that there exists an instance such that $\mathbb{E}[R_{T}(\pi)]\geq\frac{t_{k}}{t_{k-1}^{\frac{1}{d+2}}}$ . We construct a set of problem instances that is difficult to distinguish. Let $r_{k}=\frac{1}{t_{k-1}^{\frac{1}{d+2}}}$ and $M_{k}:=t_{k-1}r_{k}^{2}=\frac{1}{r_{k}^{d}}$ . We define $r_{k}$ and $M_{k}$ in this way to: 1. ensure that we can find a set of arms $\mathcal{U}_{k}=\left\{u_{k,1},\cdots,u_{k,M_{k}}\right\}$ such that $d_{\mathcal{A}}(u_{k,i},u_{k,j})\geq r_{k}$ for any $i\neq j$ ; 2. maximize $r_{k}$ while ensuring $r_{k}\leq\sqrt{M_{k}/t_{k-1}}$ , and thus the hard instances with maximum reward $r_{k}$ can still confuse a learner who only make $t_{k-1}$ observations. Then we consider a set of problem instances $\mathcal{I}_{k}=\left\{I_{k,1},\cdots,I_{k,M_{k}}\right\}$ . The expected reward for $I_{k,1}$ is defined as

\displaystyle\mu_{k,1}(x)=\begin{cases}\frac{3}{4}r_{k},\;\text{if}\;x=u_{k,1},\\ \frac{5}{8}r_{k},\;\text{if}\;x=u_{k,j},\;j\neq 1,\\ \max\left\{\frac{r_{k}}{2},\max_{u\in\mathcal{U}_{k}}\left\{\mu_{k,1}(u)-d_{\mathcal{A}}(x,u)\right\}\right\},\\ \quad\text{if}\;x\in\mathcal{A}\setminus\mathcal{U}_{k}.\end{cases}

(13)

For $2\leq i\leq M_{k}$ , the expected reward for $I_{k,i}$ is defined as

\displaystyle\mu_{k,i}(x)=\begin{cases}\frac{3}{4}r_{k},\;\text{if}\;x=u_{k,1},\\ \frac{7}{8}r_{k},\;\text{if}\;x=u_{k,i},\\ \frac{5}{8}r_{k},\;\text{if}\;x=u_{k,j},\;j\neq 1\;\text{and}\;j\neq i,\\ \max\left\{\frac{r_{k}}{2},\max_{u\in\mathcal{U}_{k}}\left\{\mu_{k,i}(u)-d_{\mathcal{A}}(x,u)\right\}\right\},\\ \quad\text{if}\;x\in\mathcal{A}\setminus\mathcal{U}_{k}.\end{cases}

(14)

Let $S_{k,i}=\mathbb{B}(u_{k,i},\frac{3}{8}r_{k})$ (the ball with center $u_{k,i}$ and radius $\frac{3}{8}r_{k}$ ). It is easy to verify the following properties of construction (13) and (14):

1.

For any $2\leq i\leq M_{k}$ , $\mu_{k,i}(x)=\mu_{k,1}(x)$ for any $x\in\mathcal{A}\setminus S_{k,i}$ ;
2.

For any $2\leq i\leq M_{k}$ , $\mu_{k,1}(x)\leq\mu_{k,i}(x)\leq\mu_{k,1}(x)+\frac{r_{k}}{4}$ , for any $x\in S_{k,i}$ ;
3.

For any $1\leq i\leq M_{k}$ , under $I_{k,i}$ , pulling an arm that is not in $S_{k,i}$ incurs a regret at least $\frac{r_{k}}{8}$ .

For all arm pulls in all problem instances, a Gaussian noise sampled from $\mathcal{N}(0,1)$ is added to the observed reward. This noise corruption is independent from all other randomness.

The lower bound of expected regret relies on the following lemma.

Lemma 4.

For any policy $\pi$ , there exists a problem instance $I\in\mathcal{I}_{k}$ such that

\mathbb{E}\left[R_{T}(\pi)\right]\geq\frac{r_{k}}{32}\cdot\sum_{j=1}^{B}\left(t_{j}-t_{j-1}\right)\exp\left\{-\frac{t_{j-1}r_{k}^{2}}{32(M_{k}-1)}\right\}.

Proof.

Let $x_{t}$ denote the choices of policy $\pi$ at time $t$ , and $y_{t}$ denote the reward. Additionally, for $t_{j-1}<t\leq t_{j}$ , we define $\mathbb{P}_{k.i}^{t}$ as the distribution of sequence $\left(x_{1},y_{1},\cdots,x_{t_{j-1}},y_{t_{j-1}}\right)$ under instance $I_{k,i}$ and policy $\pi$ . It holds that

\displaystyle\sup_{I\in\mathcal{I}_{k}}\mathbb{E}R_{T}(\pi)\geq\frac{1}{M_{k}}\sum_{i=1}^{M_{k}}\mathbb{E}_{\mathbb{P}_{k,i}}\left[R_{T}(\pi)\right]\geq\frac{1}{M_{k}}\sum_{i=1}^{M_{k}}\sum_{t=1}^{T}\mathbb{E}_{\mathbb{P}_{k,i}^{t}}\left[R^{t}(\pi)\right]\geq\frac{r_{k}}{8}\sum_{t=1}^{T}\frac{1}{M_{k}}\sum_{i=1}^{M_{k}}\mathbb{P}_{k,i}^{t}(x_{t}\notin S_{k,i}),

(15)

where $R^{t}(\pi)$ denotes the regret incurred by policy $\pi$ at time $t$ .

From our construction, it is easy to see that $S_{k,i}\cap S_{k,j}=\varnothing$ for any $i\neq j$ , so we can construct a test $\Psi$ such that $x_{t}\in S_{k,i}$ implies $\Psi=i$ . Then from Lemma 12,

\displaystyle\;\frac{1}{M_{k}}\sum_{i=1}^{M_{k}}\mathbb{P}_{k,i}^{t}(x_{t}\notin S_{k,i})\geq\;\frac{1}{M_{k}}\sum_{i=1}^{M_{k}}\mathbb{P}_{k,i}^{t}(\Psi\neq i)\geq\;\frac{1}{2M_{k}}\sum_{i=2}^{M_{k}}\exp\left\{-D_{KL}\left(\mathbb{P}_{k,1}^{t}\|\mathbb{P}_{k,i}^{t}\right)\right\}.

To avoid notational clutter, for any $s,s^{\prime}$ ( $s\geq s^{\prime}$ ), define

\displaystyle(\mathbf{x},\mathbf{y})_{:s}^{:s^{\prime}}=

\displaystyle\;(x_{1},y_{1},\cdots,x_{s^{\prime}},y_{s^{\prime}},\cdots,x_{s}).

Now we calculate $D_{KL}\left(\mathbb{P}_{k,1}^{t}\|\mathbb{P}_{k,i}^{t}\right)$ . From the chain rule of KL-Divergence, we have

	$\displaystyle\;D_{KL}\left(\mathbb{P}_{k,1}^{t}\\|\mathbb{P}_{k,i}^{t}\right)$
$\displaystyle=$	$\displaystyle\;D_{KL}\left(\mathbb{P}_{k,1}^{t}\left((\mathbf{x},\mathbf{y})_{:t_{j-1}}^{:t_{j-1}}\right)\\|\mathbb{P}_{k,i}^{t}\left((\mathbf{x},\mathbf{y})_{:t_{j-1}}^{:t_{j-1}}\right)\right)$
$\displaystyle=$	$\displaystyle\;D_{KL}\left(\mathbb{P}_{k,1}^{t}\left((\mathbf{x},\mathbf{y})_{:t_{j-1}}^{:t_{j-1}-1}\right)\\|\mathbb{P}_{k,i}^{t}\left((\mathbf{x},\mathbf{y})_{:t_{j-1}}^{:t_{j-1}-1}\right)\right)+\mathbb{E}_{\mathbb{P}_{k,1}}\left(D_{KL}\left(\mathbb{P}_{k,1}^{t}(y_{t_{j-1}}\|x_{t_{j-1}})\\|\mathbb{P}_{k,i}^{t}(y_{t_{j-1}}\|x_{t_{j-1}})\right)\right)$	(16)
$\displaystyle=$	$\displaystyle\;D_{KL}\left(\mathbb{P}_{k,1}^{t}\left((\mathbf{x},\mathbf{y})_{:t_{j-1}-1}^{:t_{j-1}-1}\right)\\|\mathbb{P}_{k,i}^{t}\left((\mathbf{x},\mathbf{y})_{:t_{j-1}-1}^{:t_{j-1}-1}\right)\right)+\mathbb{E}_{\mathbb{P}_{k,1}}\left(D_{KL}\left(\mathbb{P}_{k,1}^{t}(y_{t_{j-1}}\|x_{t_{j-1}})\\|\mathbb{P}_{k,i}^{t}(y_{t_{j-1}}\|x_{t_{j-1}})\right)\right)$	(17)
$\displaystyle\leq$	$\displaystyle\;D_{KL}\left(\mathbb{P}_{k,1}^{t}\left((\mathbf{x},\mathbf{y})_{:t_{j-1}-1}^{:t_{j-1}-1}\right)\\|\mathbb{P}_{k,i}^{t}\left((\mathbf{x},\mathbf{y})_{:t_{j-1}-1}^{:t_{j-1}-1}\right)\right)+\mathbb{E}_{\mathbb{P}_{k,1}}\left(D_{KL}\left(N(\mu_{k,1}(x_{t_{j-1}}),1)\\|N(\mu_{k,i}(x_{t_{j-1}}),1)\right)\right)$	(18)
$\displaystyle=$	$\displaystyle\;D_{KL}\left(\mathbb{P}_{k,1}^{t}\left((\mathbf{x},\mathbf{y})_{:t_{j-1}-1}^{:t_{j-1}-1}\right)\\|\mathbb{P}_{k,i}^{t}\left((\mathbf{x},\mathbf{y})_{:t_{j-1}-1}^{:t_{j-1}-1}\right)\right)+\mathbb{E}_{\mathbb{P}_{k,1}}\left(\frac{1}{2}\left(\mu_{k,1}(x_{t_{j-1}})-\mu_{k,i}(x_{t_{j-1}})\right)^{2}\right)$
$\displaystyle\leq$	$\displaystyle\;D_{KL}\left(\mathbb{P}_{k,1}^{t}\left((\mathbf{x},\mathbf{y})_{:t_{j-1}-1}^{:t_{j-1}-1}\right)\\|\mathbb{P}_{k,i}^{t}\left((\mathbf{x},\mathbf{y})_{:t_{j-1}-1}^{:t_{j-1}-1}\right)\right)+\mathbb{E}_{\mathbb{P}_{k,1}}\left(\bm{1}_{\{x_{t_{j-1}}\in S_{k,i}\}}\cdot\frac{1}{2}\left(\frac{r_{k}}{4}\right)^{2}\right)$	(19)
$\displaystyle=$	$\displaystyle\;D_{KL}\left(\mathbb{P}_{k,1}^{t}\left((\mathbf{x},\mathbf{y})_{:t_{j-1}-1}^{:t_{j-1}-1}\right)\\|\mathbb{P}_{k,i}^{t}\left((\mathbf{x},\mathbf{y})_{:t_{j-1}-1}^{:t_{j-1}-1}\right)\right)+\frac{r_{k}^{2}}{32}\cdot\mathbb{P}_{k,1}\left(x_{t_{j-1}}\in S_{k,i}\right),$	(20)

where (16) uses chain rule for KL-divergence and the conditional independence of the reward, (17) removes dependence on $x_{t_{j-1}}$ in the first term by another use of chain rule and the fact that the distribution of $x_{t_{j-1}}$ is fully determined by the policy and the distribution of $(\mathbf{x},\mathbf{y})_{:t_{j-1}-1}^{:t_{j-1}-1}$ , (18) uses that the rewards are corrupted by a standard normal noise, and (19) uses the first two properties of the construction.

Since (20) holds for all $t\leq t_{j-1}$ , we conclude that

\displaystyle D_{KL}\left(\mathbb{P}_{k,1}^{t}\|\mathbb{P}_{k,i}^{t}\right)\leq\frac{r_{k}^{2}}{32}\sum_{s\leq t_{j-1}}\mathbb{P}_{k,1}\left(x_{s}\in S_{k,i}\right)=\frac{r_{k}^{2}}{32}\mathbb{E}_{\mathbb{P}_{k,1}}\tau_{i},

(21)

where $\tau_{i}$ denotes the number of pulls of arms in $S_{k,i}$ before the batch containing $t$ . Then for all $t\in(t_{j-1},t_{j}]$ , we have

$\displaystyle\frac{1}{M_{k}}\sum_{i=1}^{M_{k}}\mathbb{P}_{k,i}^{t}(x_{t}\notin S_{k,i})\geq$	$\displaystyle\;\frac{1}{2M_{k}}\sum_{i=2}^{M_{k}}\exp\left\{-\frac{r_{k}^{2}}{32}\mathbb{E}_{\mathbb{P}_{k,1}}\tau_{i}\right\}$
$\displaystyle\geq$	$\displaystyle\;\frac{M_{k}-1}{2M_{k}}\exp\left\{-\frac{r_{k}^{2}}{32(M_{k}-1)}\sum_{i=2}^{M_{k}}\mathbb{E}_{\mathbb{P}_{k,1}}\tau_{i}\right\}$	(22)
$\displaystyle\geq$	$\displaystyle\;\frac{1}{4}\exp\left\{-\frac{r_{k}^{2}t_{j-1}}{32(M_{k}-1)}\right\},$	(23)

where (22) uses the Jensen’ inequality, and (23) uses the fact that $\sum_{i=2}^{M_{k}}\tau_{i}\leq t_{j-1}$ . Finally, we substitute (23) to (15) to finish the proof. ∎

Since $M_{k}=t_{k-1}r_{k}^{2}$ , the expected regret of policy $\pi$ satisfies

	$\displaystyle\mathbb{E}\left[R_{T}(\pi)\right]$	$\displaystyle\geq\frac{r_{k}}{32}\cdot\sum_{j=1}^{B}\left(t_{j}-t_{j-1}\right)\exp\left\{-\frac{t_{j-1}r_{k}^{2}}{32(M_{k}-1)}\right\}$
		$\displaystyle\geq\frac{r_{k}}{32}\cdot\sum_{j=1}^{B}\left(t_{j}-t_{j-1}\right)\exp\left\{-\frac{t_{j-1}r_{k}^{2}}{16M_{k}}\right\}$
		$\displaystyle\geq\frac{r_{k}}{32}\cdot\sum_{j=1}^{B}\left(t_{j}-t_{j-1}\right)\exp\left\{-\frac{t_{j-1}}{16t_{k-1}}\right\}$

on instance $I$ defined in Lemma 4.

By omitting terms with $j>k$ in the above summation, we have

	$\displaystyle\mathbb{E}[R_{T}(\pi)]$	$\displaystyle\geq\frac{r_{k}}{32}\cdot\sum_{j=1}^{B}\left(t_{j}-t_{j-1}\right)\exp\left\{-\frac{t_{j-1}}{16t_{k-1}}\right\}$
		$\displaystyle\geq\frac{r_{k}}{32}\cdot\sum_{j=1}^{k}\left(t_{j}-t_{j-1}\right)\exp\left\{-\frac{1}{16}\right\}$
		$\displaystyle=\frac{1}{32e^{\frac{1}{16}}}r_{k}t_{k}=\frac{1}{32e^{\frac{1}{16}}}\cdot\frac{t_{k}}{t_{k-1}^{\frac{1}{d+2}}}.$

The above analysis can be applied for any $k>1$ . For the first batch $(0,t_{1}]$ , we can easily construct a set of instances where the worst-case regret is at least $t_{1}$ , since no information is available during this time. Thus, there exists a problem instance such that

\displaystyle\mathbb{E}[R_{T}(\pi)]\geq\frac{1}{32e^{\frac{1}{16}}}\max\left\{t_{1},\frac{t_{2}}{t_{1}^{\frac{1}{d+2}}},\cdots,\frac{t_{B}}{t_{B-1}^{\frac{1}{d+2}}}\right\}.

Since $0<t_{1}<\cdots<t_{B}=T$ , we further have

\displaystyle\max\left\{t_{1},\frac{t_{2}}{t_{1}^{\varepsilon}},\cdots,\frac{t_{B}}{t_{B-1}^{\varepsilon}}\right\}\geq\left(t_{1}^{\varepsilon^{B-1}}\cdot\left(\frac{t_{2}}{t_{1}^{\varepsilon}}\right)^{\varepsilon^{B-2}}\cdots\left(\frac{t_{B-1}}{t_{B-2}^{\varepsilon}}\right)^{\varepsilon}\cdot\frac{t_{B}}{t_{B-1}^{\varepsilon}}\right)^{\frac{1}{\sum_{i=1}^{B-1}\varepsilon^{i}}}=T^{\frac{1-\varepsilon}{1-\varepsilon^{B}}},

where $\varepsilon=\frac{1}{d+2}$ . The above two inequalities imply that

\displaystyle\mathbb{E}\left[R_{T}(\pi)\right]\geq\frac{1}{32e^{\frac{1}{16}}}\cdot T^{\frac{1-\frac{1}{d+2}}{1-\left(\frac{1}{d+2}\right)^{B}}},

which finishes the proof. ∎

4.1.2 Removing the Static Grid Assumption

So far we have derived the lower bound for the static grid case. Yet there is a gap between the static and the adaptive case. We will close this gap in the following Theorem.

Theorem 8.

Consider Lipschitz bandit problems with time horizon $T$ and ambient dimension $d$ such that the grid of reward communication $\mathcal{T}$ is adaptively determined by the player. If $B$ rounds of communications are allowed, then for any policy $\pi$ , there exists a problem instance such that

\displaystyle\mathbb{E}\left[R_{T}(\pi)\right]\geq\frac{1}{256B^{2}}T^{\frac{1-\frac{1}{d+2}}{1-\left(\frac{1}{d+2}\right)^{B}}}.

To prove Theorem 8, we consider a reference static grid $\mathcal{T}_{r}=\{T_{0},T_{1},\cdots,T_{B}\}$ , where $T_{j}=T^{\frac{1-\varepsilon^{j}}{1-\varepsilon^{B}}}$ for $\varepsilon=\frac{1}{d+2}$ . We set the reference grid in this way because it is the solution to the following optimization problem

\max_{1\leq T_{1}\leq\cdots\leq T_{B}=T}\min\left\{T_{1},\frac{T_{2}}{T_{1}^{\varepsilon}},\cdots,\frac{T_{B}}{T_{B-1}^{\varepsilon}}\right\}.

Then we construct a series of ‘worlds’, denoted by $\mathcal{I}_{1},\cdots,\mathcal{I}_{B}$ . Each world is a set of problem instances, and each problem instance in world $\mathcal{I}_{j}$ is defined by peak location set $\mathcal{U}_{j}$ and basic height $r_{j}$ , where the sets $\mathcal{U}_{j}$ and quantities $r_{j}$ for $1\leq j\leq B$ are presented in the proof below.

Based on these constructions, we first prove that for any adaptive grid and policy, there exists an index $j$ such that the event $A_{j}=\{t_{j-1}<T_{j-1},\;t_{j}\geq T_{j}\}$ happens with sufficiently high probability in world $\mathcal{I}_{j}$ . Then similar to Theorem 7, we prove that in world $\mathcal{I}_{j}$ there exists a set of problem instances that is difficult to differentiate in the first $j-1$ batches. In addition, event $A_{j}$ implies that $t_{j}\geq T_{j}$ , so the worst-case regret is at least $r_{j}T_{j}$ , which gives the lower bound we need.

Proof of Theorem 8.

Firstly, we define $r_{j}=\frac{1}{T_{j-1}^{\varepsilon}B}$ , where $\varepsilon=\frac{1}{d+2}$ , and define $M_{j}=\frac{1}{r_{j}^{d}}$ . From the definition, we have

\displaystyle T_{j-1}r_{j}^{2}=\frac{1}{r_{j}^{d}B^{d+2}}\leq\frac{1}{r_{j}^{d}B^{2}}=\frac{M_{j}}{B^{2}}.

(24)

For $1\leq j\leq B$ , we can find sets of arms $\mathcal{U}_{j}=\{u_{j,1},\cdots,u_{j,M_{j}}\}$ such that (a) $d_{\mathcal{A}}(u_{j,m},u_{j,n})\geq r_{j}$ for any $m\neq n$ , and (b) $u_{1,M_{1}}=\cdots=u_{B,M_{B}}$ .

Then we present the construction of worlds $\mathcal{I}_{1},\cdots,\mathcal{I}_{B}$ . For $1\leq j\leq B-1$ , we let $\mathcal{I}_{j}=\{I_{j,k}\}_{k=1}^{M_{j}-1}$ , and the expected reward of $I_{j,k}$ is defined as

\displaystyle\mu_{j,k}(x)=\left\{\begin{aligned} &\frac{r_{1}}{2}+\frac{r_{j}}{16}+\frac{r_{B}}{16},&x=u_{j,k},\\ &\frac{r_{1}}{2}+\frac{r_{B}}{16},&x=u_{j,M_{j}},\end{aligned}\right.

(25)

and $\mu_{j,k}(x)=\max\left\{\frac{r_{1}}{2},\max_{u\in\mathcal{U}_{j}}\left\{\mu_{j,k}(u)-d_{\mathcal{A}}(x,u)\right\}\right\}$ , otherwise. For $j=B$ , we let $\mathcal{I}_{B}=\{I_{B}\}$ . The expected reward of $I_{B}$ is defined as

\displaystyle\mu_{B}(u_{B,M_{B}})=\frac{r_{1}}{2}+\frac{r_{B}}{16}

(26)

and $\mu_{B}(x)=\max\left\{\frac{r_{1}}{2},\mu_{B}(u_{B,M_{B}})-d_{\mathcal{A}}(x,u_{B,M_{B}})\right\}$ , otherwise. Roughly speaking, our constructions satisfy two properties: for each $j\neq B$ and $1\leq k\leq M_{j}-1$ ,

1.

$\mu_{j,k}$ is close to $\mu_{B}$ ;
2.

under $I_{j,k}$ , pulling an arm that is far from $u_{j,k}$ incurs a regret at least $\frac{r_{j}}{16}$ .

The formal version of these two properties are presented in the proof of Lemma 5 and Lemma 6.

As mentioned above, based on these constructions, we first show that for any adaptive grid $\mathcal{T}=\{t_{0},\cdots,t_{B}\}$ , there exists an index $j$ such that $(t_{j-1},t_{j}]$ is sufficiently large in world $\mathcal{I}_{j}$ . More formally, for each $j\in[B]$ , and event $A_{j}=\{t_{j-1}<T_{j-1},\;t_{j}\geq T_{j}\}$ , we define the quantities $p_{j}:=\frac{1}{M_{j}-1}\sum_{k=1}^{M_{j}-1}\mathbb{P}_{j,k}(A_{j})$ for $j\leq B-1$ and $p_{B}:=\mathbb{P}_{B}(A_{B})$ , where $\mathbb{P}_{j,k}(A_{j})$ denotes the probability of the event $A_{j}$ under instance $I_{j,k}$ and policy $\pi$ . For these quantities, we have the following lemma.

Lemma 5.

For any adaptive grid $\mathcal{T}$ and policy $\pi$ , it holds that $\sum_{j=1}^{B}p_{j}\geq\frac{7}{8}.$

Proof.

For $1\leq j\leq B-1$ and $1\leq k\leq M_{j}-1$ , we define $S_{j,k}=\mathbb{B}(u_{j,k},\frac{3}{8}r_{j})$ , which is the ball centered as $u_{j,k}$ with radius $\frac{3}{8}r_{j}$ . It is easy to verify the following properties of our construction (25) and (26):

1.

$\mu_{j,k}(x)=\mu_{B}(x)$ for any $x\notin S_{j,k}$ ;
2.

$\mu_{B}(x)\leq\mu_{j,k}(x)\leq\mu_{B}(x)+\frac{r_{j}}{8}$ , for any $x\in S_{j,k}$ .

Let $x_{t}$ denote the choices of policy $\pi$ at time $t$ , and $y_{t}$ denote the reward. For $t_{j-1}<t\leq t_{j}$ , we define $\mathbb{P}_{j,k}^{t}$ (resp. $\mathbb{P}_{B}^{t}$ ) as the distribution of sequence $\left(x_{1},y_{1},\cdots,x_{t_{j-1}},y_{t_{j-1}}\right)$ under instance $I_{j,k}$ (resp. $I_{B}$ ) and policy $\pi$ . Since event $A_{j}$ can be completely described by the observations up to time $T_{j-1}$ ( $A_{j}$ is an event in the $\sigma$ -algebra where $\mathbb{P}_{j,k}^{T_{j-1}}$ and $\mathbb{P}_{B}^{T_{j-1}}$ are defined on), we can use the definition of total variation to get

\displaystyle|\mathbb{P}_{B}(A_{j})-\mathbb{P}_{j,k}(A_{j})|=\;|\mathbb{P}_{B}^{T_{j-1}}(A_{j})-\mathbb{P}_{j,k}^{T_{j-1}}(A_{j})|\leq\;TV\left(\mathbb{P}_{B}^{T_{j-1}},\mathbb{P}_{j,k}^{T_{j-1}}\right).

For the total variation, we apply Lemma 11 to get

\displaystyle\;\frac{1}{M_{j}-1}\sum_{k=1}^{M_{j}-1}TV\left(\mathbb{P}_{B}^{T_{j-1}},\mathbb{P}_{j,k}^{T_{j-1}}\right)\leq\;\frac{1}{M_{j}-1}\sum_{k=1}^{M_{j}-1}\sqrt{1-\exp\left(-D_{KL}\left(\mathbb{P}_{B}^{T_{j-1}}\|\mathbb{P}_{j,k}^{T_{j-1}}\right)\right)}.

An argument similar to (21) yields that

\displaystyle D_{KL}\left(\mathbb{P}_{B}^{T_{j-1}}\|\mathbb{P}_{j,k}^{T_{j-1}}\right)\leq\frac{r_{j}^{2}}{128}\mathbb{E}_{\mathbb{P}_{B}}\tau_{k},

where $\tau_{k}$ denotes the number of pulls which is in $S_{j,k}$ before the batch containing $T_{j-1}$ . Combining the above two inequalities gives

$\displaystyle\frac{1}{M_{j}-1}\sum_{k=1}^{M_{j}-1}TV\left(\mathbb{P}_{B}^{T_{j-1}},\mathbb{P}_{j,k}^{T_{j-1}}\right)\leq$	$\displaystyle\;\frac{1}{M_{j}-1}\sum_{k=1}^{M_{j}-1}\sqrt{1-\exp\left(-\frac{r_{j}^{2}}{128}\mathbb{E}_{\mathbb{P}_{B}}\tau_{k}\right)}$
$\displaystyle\leq$	$\displaystyle\;\sqrt{1-\exp\left(-\frac{r_{j}^{2}}{128(M_{j}-1)}\mathbb{E}_{\mathbb{P}_{B}}\left[\sum_{k=1}^{M_{j}-1}\tau_{k}\right]\right)}$	(27)
$\displaystyle\leq$	$\displaystyle\;\sqrt{1-\exp\left(-\frac{r_{j}^{2}T_{j-1}}{128(M_{j}-1)}\right)}$	(28)
$\displaystyle\leq$	$\displaystyle\;\sqrt{1-\exp\left(-\frac{1}{64B^{2}}\right)}$	(29)
$\displaystyle\leq$	$\displaystyle\;\frac{1}{8B},$

where (27) uses Jensen’s inequality, (28) uses the fact that $\sum_{k=1}^{M_{j}-1}\tau_{k}\leq T_{j-1}$ , and (29) uses (24).

Plugging the above results implies that

\displaystyle|\mathbb{P}_{B}(A_{j})-p_{j}|\leq\frac{1}{M_{j}-1}\sum_{k=1}^{M_{j}-1}|\mathbb{P}_{B}(A_{j})-\mathbb{P}_{j,k}(A_{j})|\leq\frac{1}{8B}.

Since $\sum_{j=1}^{B}\mathbb{P}_{B}\left(A_{j}\right)\geq\mathbb{P}_{B}\left(\cup_{j=1}^{B}A_{j}\right)=1$ , it holds that

\sum_{j=1}^{B}p_{j}\geq\mathbb{P}_{B}(A_{B})+\sum_{j=1}^{B-1}\left(\mathbb{P}_{B}(A_{j})-\frac{1}{8B}\right)\geq\frac{7}{8}.\qed

Lemma 5 implies that there exists some $j$ such that $p_{j}>\frac{7}{8B}$ . Then similar to Theorem 7, we show that the worst-case regret of the policy in world $\mathcal{I}_{j}$ gives the lower bound we need.

Lemma 6.

For adaptive grid $\mathcal{T}$ and policy $\pi$ , if index $j$ satisfies $p_{j}\geq\frac{7}{8B}$ , then there exists a problem instance $I$ such that

\displaystyle\mathbb{E}\left[R_{T}(\pi)\right]\geq\frac{1}{256B^{2}}T^{\frac{1-\frac{1}{d+2}}{1-\left(\frac{1}{d+2}\right)^{B}}}.

Proof.

Here we proceed with the case where $j\leq B-1$ . The case for $j=B$ can be proved analogously.

For any $1\leq k\leq M_{j}-1$ , we construct a set of problem instances $\mathcal{I}_{j,k}=\left(I_{j,k,l}\right)_{1\leq l\leq M_{j}}$ . For $l\neq k$ , the expected reward of $I_{j,k,l}$ is defined as

\displaystyle\mu_{j,k,l}(x)=\begin{cases}\mu_{j,k}(x)+\frac{3r_{j}}{16},\text{ if }x=u_{j,l},\\ \mu_{j,k}(x),\text{ if }x\in\mathcal{U}_{j}\text{ and }x\neq u_{j,l},\\ \max\left\{\frac{r_{1}}{2},\max_{u\in\mathcal{U}_{j}}\left\{\mu_{j,k,l}(u)-d_{\mathcal{A}}(x,u)\right\}\right\},\text{ otherwise.}\end{cases}

where $\mu_{j,k}$ is defined in (25). For $l=k$ , we let $\mu_{j,k,k}=\mu_{j,k}$ .

We define $C_{j,k}=\mathbb{B}\left(u_{j,k},\frac{r_{j}}{4}\right)$ , and our construction $\mathcal{I}_{j,k}$ has the following properties:

1.

For any $l\neq k$ , $\mu_{j,k,l}(x)=\mu_{j,k,k}(x)$ for any $x\notin C_{j,l}$ ;
2.

For any $l\neq k$ , $\mu_{j,k,k}(x)\leq\mu_{j,k,l}(x)\leq\mu_{j,k,k}(x)+\frac{3r_{j}}{16}$ for any $x\in C_{j,l}$ ;
3.

For any $1\leq l\leq M_{j}$ , under $I_{j,k,l}$ , pulling an arm that is not in $C_{j,l}$ incurs a regret at least $\frac{r_{j}}{16}$ .

Let $x_{t}$ denote the choices of policy $\pi$ at time $t$ , and $y_{t}$ denote the reward. For $t_{j-1}<t\leq t_{j}$ , we define $\mathbb{P}_{j,k,l}^{t}$ as the distribution of sequence $\left(x_{1},y_{1},\cdots,x_{t_{j-1}},y_{t_{j-1}}\right)$ under instance $I_{j,k,l}$ and policy $\pi$ . From similar argument in (15), it holds that

\displaystyle\sup_{I\in\mathcal{I}_{j,k}}\mathbb{E}\left[R_{T}(\pi)\right]\geq\frac{r_{j}}{16}\sum_{t=1}^{T}\frac{1}{M_{j}}\sum_{l=1}^{M_{j}}\mathbb{P}_{j,k,l}^{t}(x_{t}\notin C_{j,l}).

(30)

From our construction, it is easy to see that $C_{j,k_{1}}\cap C_{j,k_{2}}=\varnothing$ for any $k_{1}\neq k_{2}$ , so we can construct a test $\Psi$ such that $x_{t}\in C_{j,k}$ implies $\Psi=k$ . By Lemma 12 with a star graph on $[K]$ with center $k$ , we have

\displaystyle\frac{1}{M_{j}}\sum_{l=1}^{M_{j}}\mathbb{P}_{j,k,l}^{t}(x_{t}\notin C_{j,l})\geq\frac{1}{M_{j}}\sum_{l\neq k}\int\min\left\{d\mathbb{P}_{j,k,k}^{t},d\mathbb{P}_{j,k,l}^{t}\right\}.

(31)

Combining (30) and (31) gives

$\displaystyle\sup_{I\in\mathcal{I}_{j,k}}\mathbb{E}\left[R_{T}(\pi)\right]\geq$	$\displaystyle\;\frac{r_{j}}{16}\sum_{t=1}^{T}\frac{1}{M_{j}}\sum_{l\neq k}\int\min\left\{d\mathbb{P}_{j,k,k}^{t},d\mathbb{P}_{j,k,l}^{t}\right\}$
$\displaystyle\geq$	$\displaystyle\;\frac{r_{j}}{16}\sum_{t=1}^{T_{j}}\frac{1}{M_{j}}\sum_{l\neq k}\int\min\left\{d\mathbb{P}_{j,k,k}^{t},d\mathbb{P}_{j,k,l}^{t}\right\}$
$\displaystyle\geq$	$\displaystyle\;\frac{r_{j}T_{j}}{16}\cdot\frac{1}{M_{j}}\sum_{l\neq k}\int\min\left\{d\mathbb{P}_{j,k,k}^{T_{j}},d\mathbb{P}_{j,k,l}^{T_{j}}\right\}$	(32)
$\displaystyle\geq$	$\displaystyle\;\frac{r_{j}T_{j}}{16}\cdot\frac{1}{M_{j}}\sum_{l\neq k}\int_{A_{j}}\min\left\{d\mathbb{P}_{j,k,k}^{T_{j}},d\mathbb{P}_{j,k,l}^{T_{j}}\right\}$	(33)
$\displaystyle\geq$	$\displaystyle\;\frac{r_{j}T_{j}}{16}\cdot\frac{1}{M_{j}}\sum_{l\neq k}\int_{A_{j}}\min\left\{d\mathbb{P}_{j,k,k}^{T_{j-1}},d\mathbb{P}_{j,k,l}^{T_{j-1}}\right\},$	(34)

where (32) follows from data processing inequality of total variation and the equation $\int\min\left\{dP,dQ\right\}=1-TV(P,Q)$ , (33) restricts the integration to event $A_{j}$ , and (34) holds because the observations at time $T_{j}$ are the same as those at time $T_{j-1}$ under event $A_{j}$ .

For the term $\int_{A_{j}}\min\left\{d\mathbb{P}_{j,k,k}^{T_{j-1}},d\mathbb{P}_{j,k,l}^{T_{j-1}}\right\}$ , it holds that

$\displaystyle\int_{A_{j}}\min\left\{d\mathbb{P}_{j,k,k}^{T_{j-1}},d\mathbb{P}_{j,k,l}^{T_{j-1}}\right\}=$	$\displaystyle\;\int_{A_{j}}\frac{d\mathbb{P}_{j,k,k}^{T_{j-1}}+d\mathbb{P}_{j,k,l}^{T_{j-1}}-\left\|d\mathbb{P}_{j,k,k}^{T_{j-1}}-d\mathbb{P}_{j,k,l}^{T_{j-1}}\right\|}{2}$
$\displaystyle=$	$\displaystyle\;\frac{\mathbb{P}_{j,k,k}^{T_{j-1}}(A_{j})+\mathbb{P}_{j,k,l}^{T_{j-1}}(A_{j})}{2}-\frac{1}{2}\int_{A_{j}}\left\|d\mathbb{P}_{j,k,k}^{T_{j-1}}-d\mathbb{P}_{j,k,l}^{T_{j-1}}\right\|$
$\displaystyle\geq$	$\displaystyle\;\left(\mathbb{P}_{j,k,k}^{T_{j-1}}(A_{j})-\frac{1}{2}TV\left(\mathbb{P}_{j,k,k}^{T_{j-1}},\mathbb{P}_{j,k,l}^{T_{j-1}}\right)\right)-TV\left(\mathbb{P}_{j,k,k}^{T_{j-1}},\mathbb{P}_{j,k,l}^{T_{j-1}}\right)$	(35)
$\displaystyle=$	$\displaystyle\;\mathbb{P}_{j,k}(A_{j})-\frac{3}{2}TV\left(\mathbb{P}_{j,k,k}^{T_{j-1}},\mathbb{P}_{j,k,l}^{T_{j-1}}\right),$	(36)

where (35) uses the inequality $|\mathbb{P}(A)-\mathbb{Q}(A)|\leq TV(\mathbb{P},\mathbb{Q})$ , and (36) holds because $I_{j,k}=I_{j,k,k}$ and $A_{j}$ can be determined by the observations up to $T_{j-1}$ .

We use an argument similar to (21) to get

\displaystyle D_{KL}\left(\mathbb{P}_{j,k,k}^{T_{j-1}}\|\mathbb{P}_{j,k,l}^{T_{j-1}}\right)\leq\frac{1}{2}\cdot\left(\frac{3r_{j}}{16}\right)^{2}\mathbb{E}_{\mathbb{P}_{j,k}}\tau_{l}\leq\frac{r_{j}^{2}}{32}\mathbb{E}_{\mathbb{P}_{j,k}}\tau_{l},

where $\tau_{l}$ denotes the number of pulls which is in $C_{j,l}$ before the batch of time $T_{j-1}$ . Then from Lemma 11, we have

$\displaystyle\frac{1}{M_{j}}\sum_{l\neq k}TV\left(\mathbb{P}_{j,k,k}^{T_{j-1}},\mathbb{P}_{j,k,l}^{T_{j-1}}\right)\leq$	$\displaystyle\;\frac{1}{M_{j}}\sum_{l\neq k}\sqrt{1-\exp\left(-D_{KL}\left(\mathbb{P}_{j,k,k}^{T_{j-1}}\\|\mathbb{P}_{j,k,l}^{T_{j-1}}\right)\right)}$
$\displaystyle\leq$	$\displaystyle\;\frac{1}{M_{j}}\sum_{l\neq k}\sqrt{1-\exp\left(-\frac{r_{j}^{2}}{32}\mathbb{E}_{\mathbb{P}_{j,k}}\tau_{l}\right)}$
$\displaystyle\leq$	$\displaystyle\;\frac{M_{j}-1}{M_{j}}\sqrt{1-\exp\left(-\frac{r_{j}^{2}}{32(M_{j}-1)}\sum_{l\neq k}\mathbb{E}_{\mathbb{P}_{j,k}}\tau_{l}\right)}$
$\displaystyle\leq$	$\displaystyle\;\frac{M_{j}-1}{M_{j}}\sqrt{1-\exp\left(-\frac{r_{j}^{2}T_{j-1}}{32(M_{j}-1)}\right)}$
$\displaystyle\leq$	$\displaystyle\;\frac{M_{j}-1}{M_{j}}\sqrt{1-\exp\left(-\frac{M_{j}}{32(M_{j}-1)B^{2}}\right)}$	(37)
$\displaystyle\leq$	$\displaystyle\;\frac{M_{j}-1}{M_{j}}\sqrt{\frac{M_{j}}{32(M_{j}-1)B^{2}}}$
$\displaystyle\leq$	$\displaystyle\;\frac{1}{4B},$	(38)

where (37) uses (24).

Combining (34), (36) and (38) yields that

\displaystyle\sup_{I\in\mathcal{I}_{j,k}}\mathbb{E}\left[R_{T}(\pi)\right]\geq\frac{1}{16}r_{j}T_{j}\left(\frac{\mathbb{P}_{j,k}(A_{j})}{2}-\frac{3}{8B}\right)\geq\frac{1}{16B}T^{\frac{1-\varepsilon}{1-\varepsilon^{B}}}\left(\frac{\mathbb{P}_{j,k}(A_{j})}{2}-\frac{3}{8B}\right),

where $\varepsilon=\frac{1}{d+2}$ . This inequality holds for any $k\leq M_{j}-1$ . Averaging over $k$ yields

	$\displaystyle\sup_{I\in\cup_{k\leq M_{j}-1}\mathcal{I}_{j,k}}\mathbb{E}\left[R_{T}(\pi)\right]\geq$	$\displaystyle\;\frac{1}{16B}T^{\frac{1-\varepsilon}{1-\varepsilon^{B}}}\left(\frac{1}{2(M_{j}-1)}\sum_{k=1}^{M_{j}-1}\mathbb{P}_{j,k}(A_{j})-\frac{3}{8B}\right)$
	$\displaystyle\geq$	$\displaystyle\;\frac{1}{16B}T^{\frac{1-\varepsilon}{1-\varepsilon^{B}}}\left(\frac{7}{16B}-\frac{3}{8B}\right)$
	$\displaystyle\geq$	$\displaystyle\;\frac{1}{256B^{2}}T^{\frac{1-\varepsilon}{1-\varepsilon^{B}}},$

where the second inequality holds from $p_{j}\geq\frac{7}{8B}$ . Hence, the proof of Lemma 6 is completed. ∎

Finally, combining the above two lemmas, we arrive at the lower bound in Theorem 8. ∎

4.2 Communication Lower bound for Lipschitz Bandits with Batched Feedback

Based on the constructions for the full-dimension case, we are now ready to present the theoretical lower bound of batched Lipschitz bandits with $d_{z}\leq d$ . For easy understanding, we start with the result for the static grid case. In this section we provide lower bounds for integer-valued $d_{z}$ . In general, the zooming dimension is not necessarily an integer.

Theorem 9.

Consider Lipschitz bandit problems with time horizon $T$ , ambient dimension $d$ and zooming dimension $d_{z}\leq d$ such that the grid of reward communication $\mathcal{T}$ is static and determined before the game. If $B$ rounds of communications are allowed, then for any policy $\pi$ , there exists a problem instance with zooming dimension $d_{z}$ such that

\mathbb{E}[R_{T}(\pi)]\geq\frac{1}{64e^{\frac{1}{16}}}\cdot T^{\frac{1-\frac{1}{d_{z}+2}}{1-\left(\frac{1}{d_{z}+2}\right)^{B}}}.

The proof technique of Theorem 9 is similar to that of Theorem 7, with the main difference being the construction of hard instances. To build instances satisfying the statement in Theorem 9, we first construct reward functions in a $d_{z}$ -dimensional subspace according to the argument in Theorem 7, and then use a “linear-decaying extension” technique to transfer them to the $d$ -dimensional whole space. Below, we present the detailed construction of the hard instances.

To begin with, we introduce some settings and notations. For a given $d$ , we consider the whole space $\mathcal{A}=[0,1]^{d}$ and $d_{\mathcal{A}}$ being the induced metric of $\|\cdot\|_{\infty}$ . $\mathcal{A}$ can be represented by a Cartsian product $[0,1]^{d}=[0,1]^{d_{z}}\times[0,1]^{d-d_{z}}$ . Therefore, each element in $\mathcal{A}$ can be represented as $(\alpha,\beta)$ , the concatenation of $\alpha\in[0,1]^{d_{z}}$ and $\beta\in[0,1]^{d-d_{z}}$ . Additionally, we use $\mathcal{A}_{d_{z}}$ to denote the $d_{z}$ -dimensional subspace $\{(\alpha,0_{d-d_{z}})|\alpha\in[0,1]^{d_{z}}\}$ , where $0_{d-d_{z}}$ is the zero vector in $[0,1]^{d-d_{z}}$ .

For any fixed $k$ , let $r_{k}=\frac{1}{t_{k-1}^{\frac{1}{d_{z}+2}}}$ and $M_{k}=t_{k-1}r_{k}^{2}=\frac{1}{r_{k}^{d_{z}}}$ . We can find a set of arms $\mathcal{U}_{k}=\{u_{k,1},\cdots,u_{k,M_{k}}\}\subset\mathcal{A}_{d_{z}}$ such that $d_{\mathcal{A}}(u_{k,i},u_{k,j})\geq r_{k}$ for any $i\neq j$ . As stated above, we first construct a set of expected reward functions $\{\nu_{k,1},\cdots,\nu_{k,M_{k}}\}$ on $\mathcal{A}_{d_{z}}$ , which are the same as (13) and (14) in the proof of Theorem 7. We define $\nu_{k,1}$ as

\displaystyle\nu_{k,1}(z)=\begin{cases}\frac{3}{4}r_{k},\;\text{if}\;z=u_{k,1},\\ \frac{5}{8}r_{k},\;\text{if}\;z=u_{k,j},\;j\neq 1,\\ \max\left\{\frac{r_{k}}{2},\max_{u\in\mathcal{U}_{k}}\left\{\nu_{k,1}(u)-d_{\mathcal{A}}(z,u)\right\}\right\},\\ \quad\text{if}\;z\in\mathcal{A}_{d_{z}}\setminus\mathcal{U}_{k}.\end{cases}

(39)

For $2\leq i\leq M_{k}$ , $\nu_{k,i}$ is defined as

\displaystyle\nu_{k,i}(z)=\begin{cases}\frac{3}{4}r_{k},\;\text{if}\;z=u_{k,1},\\ \frac{7}{8}r_{k},\;\text{if}\;z=u_{k,i},\\ \frac{5}{8}r_{k},\;\text{if}\;z=u_{k,j},\;j\neq 1\;\text{and}\;j\neq i,\\ \max\left\{\frac{r_{k}}{2},\max_{u\in\mathcal{U}_{k}}\left\{\nu_{k,i}(u)-d_{\mathcal{A}}(z,u)\right\}\right\},\\ \quad\text{if}\;z\in\mathcal{A}_{d_{z}}\setminus\mathcal{U}_{k}.\end{cases}

(40)

Based on $\{\nu_{k,i}\}_{i=1}^{M_{k}}$ , we define a set of problem instances $\mathcal{I}_{k}=\{I_{k,1},\cdots,I_{k,M_{k}}\}$ . For each $1\leq i\leq M_{k}$ , the expected reward for $I_{k,i}$ is defined as

\displaystyle\mu_{k,i}((\alpha,\beta))=\frac{1}{2}\cdot\min\left\{\nu_{k,i}((\alpha,0_{d-d_{z}})),\frac{7}{8}r_{k}-\|\beta\|_{\infty}\right\},

(41)

where $(\alpha,\beta)$ is the concatenation of $\alpha\in[0,1]^{d_{z}}$ and $\beta\in[0,1]^{d-d_{z}}$ . Note that $\nu((\alpha,0_{d-d_{z}}))$ is well-defined since $(\alpha,0_{d-d_{z}})\in\mathcal{A}_{d_{z}}$ .

For all arm pulls in all problem instances, an Gaussian noise sampled from $\mathcal{N}(0,1)$ is added to the observed reward. This noise corruption is independent from all other randomness.

Now we show that for each $1\leq k\leq B$ and $1\leq i\leq M_{k}$ , $\mu_{k,i}$ is $1$ -Lipschitz and the zooming dimension equals to $d_{z}$ . Firstly, for any $(\alpha_{1},\beta_{1})$ and $(\alpha_{2},\beta_{2})$ , we have

	$\displaystyle\mu_{k,i}((\alpha_{1},\beta_{1}))-\mu_{k,i}((\alpha_{2},\beta_{2}))\leq$	$\displaystyle\mu_{k,i}((\alpha_{1},\beta_{1}))-\mu_{k,i}((\alpha_{1},\beta_{2}))+\mu_{k,i}((\alpha_{1},\beta_{2}))-\mu_{k,i}((\alpha_{2},\beta_{2}))$
	$\displaystyle\leq$	$\displaystyle\frac{1}{2}\\|\beta_{1}-\beta_{2}\\|_{\infty}+\frac{1}{2}\big{(}\nu_{k,i}((\alpha_{1},0_{d-d_{z}}))-\nu_{k,i}((\alpha_{2},0_{d-d_{z}}))\big{)}$
	$\displaystyle\leq$	$\displaystyle\frac{1}{2}(\\|\beta_{1}-\beta_{2}\\|_{\infty}+\\|\alpha_{1}-\alpha_{2}\\|_{\infty})$
	$\displaystyle\leq$	$\displaystyle\\|(\alpha_{1}-\alpha_{2},\beta_{1}-\beta_{2})\\|_{\infty}.$

Therefore, $\mu_{k,i}$ is $1$ -Lipschitz. Secondly, for any $r\geq r_{k}$ , (41) yields that $S(16r)=[0,1]^{d_{z}}\times[0,32r]^{d-d_{z}}$ . As a consequence, we have $N_{r}=32^{d-d_{z}}r^{-d_{z}}$ , and the zooming dimension equals to $d_{z}$ .

After presenting the new constructions, we show that similar argument to the full-dimension case gives the lower bound we need. The remaining proof of Theorem 9 is deferred to Appendix E.

Finally, we combine all techniques in above analysis to obtain the lower bound for general $d_{z}$ and adaptive grid, and thus prove Theorem 3.

Theorem 10.

Consider Lipschitz bandit problems with time horizon $T$ , ambient dimension $d$ and zooming dimension $d_{z}\leq d$ such that the grid of reward communication $\mathcal{T}$ is adaptively determined by the player. If $B$ rounds of communications are allowed, then for any policy $\pi$ , there exists a problem instance with zooming dimension $d_{z}$ such that

\displaystyle\mathbb{E}\left[R_{T}(\pi)\right]\geq\frac{1}{512B^{2}}T^{\frac{1-\frac{1}{d_{z}+2}}{1-\left(\frac{1}{d_{z}+2}\right)^{B}}}.

The proof of Theorem 10 is deferred to Appendix F.

5 Experiments

In this section, we present numerical studies of A-BLiN. In the experiments, we use the arm space $\mathcal{A}=[0,1]^{2}$ and the expected reward function $\mu(x)=1-\frac{1}{2}\|x-x_{1}\|_{2}-\frac{3}{10}\|x-x_{2}\|_{2}$ , where $x_{1}=(0.8,\;0.7)$ and $x_{2}=(0.1,\;0.1)$ . The landscape of $\mu$ and the resulting partition is shown in Figure 2a. As can be seen, the partition is finer in the area closer to the optimal arm $x^{*}=(0.8,\;0.7)$ .

We let the time horizon $T=80000$ , and report the accumulated regret in Figure 2b. The regret curve is sublinear, which agrees with the regret bound (2). Besides, different background colors in Figure 2b represent different batches. For the total time horizon $T=80000$ , A-BLiN only needs $4$ rounds of communications. We also present regret curve of zooming algorithm [28] for comparison. Different from zooming algorithm, regret curve of A-BLiN is approximately piecewise linear, which is because the strategy of BLiN does not change within each batch. Results of more repeated experiments, as well as experimental results of D-BLiN, are in Appendix G. Our code is available at https://github.com/FengYasong-fifol/Batched-Lipschitz-Narrowing.

6 Conclusion

In this paper, we study Lipschitz bandits with communication constraints, and propose the BLiN algorithm as a solution. We prove that BLiN only need $\mathcal{O}\left(\log\log T\right)$ rounds of communications to achieve the optimal regret rate of best previous Lipschitz bandit algorithms [28, 14] that need $T$ batches. This improvement in number of the batches significantly saves data communication costs. We also provide complexity analysis for this problem. We show that $\Omega(\log\log T)$ rounds of communications are necessary for any algorithm to optimally solve Lipschitz bandit problems. Hence, BLiN algorithm is optimal.

References

[1] Arpit Agarwal, Shivani Agarwal, Sepehr Assadi, and Sanjeev Khanna. Learning with limited rounds of adaptivity: coin tossing, multi-armed bandits, and ranking from pairwise comparisons. In Conference on Learning Theory, pages 39–75. PMLR, 2017.
[2] Arpit Agarwal, Rohan Ghuge, and Viswanath Nagarajan. Batched dueling bandits. In International Conference on Machine Learning, pages 89–110. PMLR, 2022.
[3] Rajeev Agrawal. The continuum-armed bandit problem. SIAM Journal on Control and Optimization, 33(6):1926–1951, 1995.
[4] Rajeev Agrawal. Sample mean based index policies by $\mathcal{O}(\log n)$ regret for the multi-armed bandit problem. Advances in Applied Probability, 27(4):1054–1078, 1995.
[5] Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. Journal of Computer and System Sciences, 58(1):137–147, 1999.
[6] Patrice Assouad. Plongements Lipschitziens dans $\mathbb{R}^{n}$ . Bulletin de la Société Mathématique de France, 111:429–448, 1983.
[7] Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235–256, 2002.
[8] Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002.
[9] Peter Auer, Ronald Ortner, and Csaba Szepesvári. Improved rates for the stochastic continuum-armed bandit problem. In Conference on Computational Learning Theory, pages 454–468. Springer, 2007.
[10] Dimitris Bertsimas and Adam J Mersereau. A learning approach for interactive marketing to a customer segment. Operations Research, 55(6):1120–1135, 2007.
[11] Peter J. Bickel. On some robust estimates of location. The Annals of Mathematical Statistics, pages 847–858, 1965.
[12] Jean Bretagnolle and Catherine Huber. Estimation des densités: risque minimax. Séminaire de probabilités de Strasbourg, 12:342–363, 1978.
[13] Sébastien Bubeck, Nicolo Cesa-Bianchi, and Gábor Lugosi. Bandits with heavy tail. IEEE Transactions on Information Theory, 59(11):7711–7717, 2013.
[14] Sébastien Bubeck, Rémi Munos, Gilles Stoltz, and Csaba Szepesvári. Online optimization in $\mathcal{X}$ -armed bandits. Advances in Neural Information Processing Systems, 22:201–208, 2009.
[15] Sébastien Bubeck, Rémi Munos, Gilles Stoltz, and Csaba Szepesvári. $\mathcal{X}$ -armed bandits. Journal of Machine Learning Research, 12(5):1655–1695, 2011.
[16] Sébastien Bubeck, Gilles Stoltz, and Jia Yuan Yu. Lipschitz bandits without the Lipschitz constant. In International Conference on Algorithmic Learning Theory, pages 144–158. Springer, 2011.
[17] Nicolo Cesa-Bianchi, Ofer Dekel, and Ohad Shamir. Online learning with switching costs and other adaptive adversaries. Advances in Neural Information Processing Systems, 26:1160–1168, 2013.
[18] Eric W. Cope. Regret and convergence bounds for a class of continuum-armed bandit problems. IEEE Transactions on Automatic Control, 54(6):1243–1253, 2009.
[19] Hossein Esfandiari, Amin Karbasi, Abbas Mehrabian, and Vahab Mirrokni. Regret bounds for batched bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 7340–7348, 2021.
[20] Eyal Even-Dar, Shie Mannor, Yishay Mansour, and Sridhar Mahadevan. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of machine learning research, 7(6):1079–1105, 2006.
[21] Yasong Feng, Weijian Luo, Yimin Huang, and Tianyu Wang. A lipschitz bandits approach for continuous hyperparameter optimization. arXiv preprint arXiv:2302.01539, 2023.
[22] Zijun Gao, Yanjun Han, Zhimei Ren, and Zhengqing Zhou. Batched multi-armed bandits problem. Advances in Neural Information Processing Systems, 32:503–513, 2019.
[23] Chuying Han, Yasong Feng, and Tianyu Wang. From random search to bandit learning in metric measure spaces. arXiv preprint arXiv:2305.11509, 2023.
[24] Yanjun Han, Zhengqing Zhou, Zhengyuan Zhou, Jose Blanchet, Peter W Glynn, and Yinyu Ye. Sequential batch learning in finite-action linear contextual bandits. arXiv preprint arXiv:2004.06321, 2020.
[25] Kwang-Sung Jun, Kevin Jamieson, Robert Nowak, and Xiaojin Zhu. Top arm identification in multi-armed bandits with batch arm pulls. In Artificial Intelligence and Statistics, pages 139–148. PMLR, 2016.
[26] Nikolai Karpov, Qin Zhang, and Yuan Zhou. Collaborative top distribution identifications with limited interaction. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pages 160–171. IEEE, 2020.
[27] Robert Kleinberg. Nearly tight bounds for the continuum-armed bandit problem. Advances in Neural Information Processing Systems, 18:697–704, 2005.
[28] Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. Multi-armed bandits in metric spaces. In Proceedings of the fortieth annual ACM symposium on Theory of computing, pages 681–690, 2008.
[29] Akshay Krishnamurthy, John Langford, Aleksandrs Slivkins, and Chicheng Zhang. Contextual bandits with continuous actions: Smoothing, zooming, and adapting. The Journal of Machine Learning Research, 21(1):5402–5446, 2020.
[30] Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4–22, 1985.
[31] Zihan Li and Jonathan Scarlett. Gaussian process bandit optimization with few batches. In International Conference on Artificial Intelligence and Statistics, pages 92–107. PMLR, 2022.
[32] Shiyin Lu, Guanghui Wang, Yao Hu, and Lijun Zhang. Optimal algorithms for Lipschitz bandits with heavy-tailed rewards. In International Conference on Machine Learning, pages 4154–4163, 2019.
[33] Stefan Magureanu, Richard Combes, and Alexandre Proutiere. Lipschitz bandits: Regret lower bound and optimal algorithms. In Conference on Learning Theory, pages 975–999. PMLR, 2014.
[34] Maryam Majzoubi, Chicheng Zhang, Rajan Chari, Akshay Krishnamurthy, John Langford, and Aleksandrs Slivkins. Efficient contextual bandits with continuous actions. Advances in Neural Information Processing Systems, 33:349–360, 2020.
[35] Vianney Perchet and Philippe Rigollet. The multi-armed bandit problem with covariates. The Annals of Statistics, 41(2):693–721, 2013.
[36] Vianney Perchet, Philippe Rigollet, Sylvain Chassang, and Erik Snowberg. Batched bandit problems. The Annals of Statistics, 44(2):660–681, 2016.
[37] Stuart J. Pocock. Group sequential methods in the design and analysis of clinical trials. Biometrika, 64(2):191–199, 1977.
[38] Yufei Ruan, Jiaqi Yang, and Yuan Zhou. Linear bandits with limited adaptivity and learning distributional optimal design. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 74–87, 2021.
[39] Sudeep Salgia, Sattar Vakili, and Qing Zhao. A domain-shrinking based bayesian optimization algorithm with order-optimal regret performance. In Advances in Neural Information Processing Systems, volume 34, pages 28836–28847, 2021.
[40] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
[41] Sean Sinclair, Tianyu Wang, Gauri Jain, Siddhartha Banerjee, and Christina Yu. Adaptive discretization for model-based reinforcement learning. Advances in Neural Information Processing Systems, 33:3858–3871, 2020.
[42] Aleksandrs Slivkins. Contextual bandits with similarity information. Journal of Machine Learning Research, 15(1):2533–2568, 2014.
[43] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT press, 2018.
[44] Chao Tao, Qin Zhang, and Yuan Zhou. Collaborative learning with limited interaction: tight bounds for distributed exploration in multi-armed bandits. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pages 126–146. IEEE, 2019.
[45] William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
[46] Tianyu Wang and Cynthia Rudin. Bandits for BMO functions. In International Conference on Machine Learning, pages 9996–10006. PMLR, 2020.
[47] Tianyu Wang, Weicheng Ye, Dawei Geng, and Cynthia Rudin. Towards practical lipschitz bandits. In Proceedings of the 2020 ACM-IMS on Foundations of Data Science Conference, FODS ’20, page 129–138, New York, NY, USA, 2020. Association for Computing Machinery.
[48] Nirandika Wanigasekara and Christina Yu. Nonparametric contextual bandits in an unknown metric space. In Advances in Neural Information Processing Systems, volume 32, pages 14684–14694, 2019.
[49] Kelly Zhang, Lucas Janson, and Susan Murphy. Inference for batched bandits. Advances in Neural Information Processing Systems, 33:9818–9829, 2020.

Appendix A Proof of Corollary 1

Corollary 1. For Lipschitz bandit problems with ambient dimension $d$ , zooming dimension $d_{z}\leq d$ and time horizon $T$ , any algorithm needs $\Omega(\log\log T)$ rounds of communications to achieve the optimal regret rate $\mathcal{O}\left(T^{\frac{d_{z}+1}{d_{z}+2}}\right)$ .

Proof.

From Theorem 3, the expected regret is lower bounded by

\displaystyle\mathbb{E}\left[R_{T}(\pi)\right]\geq\frac{1}{512B^{2}}T^{\frac{1-\frac{1}{d_{z}+2}}{1-\left(\frac{1}{d_{z}+2}\right)^{B}}}.

Here we seek for the minimum $B$ such that

\displaystyle\frac{\frac{1}{512B^{2}}T^{\frac{1-\frac{1}{d_{z}+2}}{1-\left(\frac{1}{d_{z}+2}\right)^{B}}}}{T^{\frac{d_{z}+1}{d_{z}+2}}}\leq C

(42)

for some constant $C$ .

Calculation shows that

\displaystyle\frac{\frac{1}{512B^{2}}T^{\frac{1-\frac{1}{d_{z}+2}}{1-\left(\frac{1}{d_{z}+2}\right)^{B}}}}{T^{\frac{d_{z}+1}{d_{z}+2}}}=\frac{1}{512B^{2}}\left(T^{\frac{d_{z}+1}{d_{z}+2}}\right)^{\frac{1}{(d_{z}+2)^{B}-1}}.

(43)

Substituting (43) to (42) and taking log on both sides yield that

\displaystyle\frac{d_{z}+1}{d_{z}+2}\cdot\frac{\log T}{(d_{z}+2)^{B}-1}\leq\log(512B^{2}C)

and

\displaystyle(d_{z}+2)^{B}\geq\frac{d_{z}+1}{(d_{z}+2)\log(512B^{2}C)}\cdot\log T+1.

Taking log on both sides again yields that

B\geq\frac{\log\left[\left(\frac{d_{z}+1}{(d_{z}+2)\log(512B^{2}C)}\right)\log T+1\right]}{\log(d_{z}+2)}.

(44)

We use $B_{\min}$ to denote the minimum $B$ such that inequality (44) holds. Calculation shows that (44) holds for

B=B_{*}\triangleq\frac{\log\left[\left(\frac{d_{z}+1}{(d_{z}+2)\log(512C)}\right)\log T+1\right]}{\log(d_{z}+2)},

so we have $B_{\min}\leq B_{*}$ . Then since the RHS of (44) decreases with $B$ , we have

\displaystyle B_{\min}\geq\frac{\log\left[\left(\frac{d_{z}+1}{(d_{z}+2)\log(512B_{\min}^{2}C)}\right)\log T+1\right]}{\log(d_{z}+2)}\geq\frac{\log\left[\left(\frac{d_{z}+1}{(d_{z}+2)\log(512B_{*}^{2}C)}\right)\log T+1\right]}{\log(d_{z}+2)}.

Therefore, $\Omega(\log\log T)$ rounds of communications are necessary for any algorithm to optimally solve Lipschitz bandit problems. ∎

Appendix B Space Complexity Analysis of A-BLiN

Let the A-BLiN run contains $B+1$ batches: $B$ batches in the for-loop and a clean-up batch. We use $\gamma_{m}$ to denote number of cubes in batch $m$ , that is, $\gamma_{m}=|\mathcal{A}_{m}|$ , for $1\leq m\leq B$ . Then it is easy to see that the space complexity is linear in $\max_{1\leq m\leq B}\gamma_{m}$ . In the following, we bound $\gamma_{m}$ for each $m$ to obtain the space complexity of A-BLiN.

Equation (6) and (7) yields that

\displaystyle\gamma_{m}\cdot\frac{16\log T}{r_{m}^{2}}\cdot 8r_{m-1}\leq T^{\frac{d_{z}+1}{d_{z}+2}}\cdot 128C_{z}(\log T)^{\frac{1}{d_{z}+2}},

and thus

\gamma_{m}\leq C_{z}T^{\frac{d_{z}+1}{d_{z}+2}}(\log T)^{-\frac{d_{z}+1}{d_{z}+2}}\cdot\frac{r_{m}^{2}}{r_{m-1}}.

Since $r_{m}\leq 1$ and $r_{m}\leq r_{m-1}$ , we further have

\displaystyle\gamma_{m}\leq C_{z}T^{\frac{d_{z}+1}{d_{z}+2}}(\log T)^{-\frac{d_{z}+1}{d_{z}+2}}.

The above inequality is satisfied for each $1\leq m\leq B$ . Consequently, the space complexity of A-BLiN is upper bounded by $\max_{1\leq m\leq B}\gamma_{m}=\mathcal{O}\left(T^{\frac{d_{z}+1}{d_{z}+2}}(\log T)^{-\frac{d_{z}+1}{d_{z}+2}}\right)$ .

Appendix C Proof of Theorem 4

Theorem 4. With probability exceeding $1-\frac{2}{T^{6}}$ , the $T$ -step total regret $R(T)$ of BLiN with Doubling Edge-length Sequence (D-BLiN) satisfies

R(T)\leq(512C_{z}+16)\cdot T^{\frac{d_{z}+1}{d_{z}+2}}(\log T)^{\frac{1}{d_{z}+2}},

(45)

where $d_{z}$ is the zooming dimension of the problem instance. In addition, D-BLiN only needs no more than $\frac{\log T-\log\log T}{d_{z}+2}+2$ rounds of communications to achieve this regret rate.

Proof.

Since $r_{m}=\frac{r_{m-1}}{2}$ for Doubling Edge-length Sequence, Lemma 3 implies that every cube $C\in\mathcal{A}_{m}$ is a subset of $S(16r_{m})$ . Thus from the definition of zooming number, we have

\displaystyle|\mathcal{A}_{m}|\leq N_{r_{m}}.

(46)

Fix any positive number $B$ . Also by Lemma 3, we know that any arm played after batch $B$ incurs a regret bounded by $16r_{B}$ , since the cubes played after batch $B$ have edge length no larger than $r_{B}$ . Then the total regret occurs after the first $B$ batch is bounded by $16r_{B}T$ .

Thus the regret $R(T)$ can be bounded by

\displaystyle R(T)

\displaystyle\leq\sum_{m=1}^{B}\sum_{C\in\mathcal{A}_{m}}\sum_{i=1}^{n_{m}}\Delta_{x_{C,i}}+16r_{B}T,

(47)

where the first term bounds the regret in the first $B$ batches of D-BLiN, and the second term bounds the regret after the first $B$ batches. If the algorithm stops at batch $\widetilde{B}<B$ , we define $\mathcal{A}_{m}=\varnothing$ for any $\widetilde{B}<m\leq B$ and inequality (47) still holds.

By Lemma 3, we have $\Delta_{C,i}\leq 16r_{m}$ for all $C\in\mathcal{A}_{m}$ . We can thus bound (47) by

$\displaystyle R(T)\leq$	$\displaystyle\sum_{m=1}^{B}\|\mathcal{A}_{m}\|\cdot n_{m}\cdot 16r_{m}+16r_{B}T$
$\displaystyle\leq$	$\displaystyle\sum_{m=1}^{B}N_{r_{m}}\cdot n_{m}\cdot 16r_{m}+16r_{B}T,$	(48)
$\displaystyle\leq$	$\displaystyle\sum_{m=1}^{B}N_{r_{m}}\cdot\frac{16\log T}{r_{m}^{2}}\cdot 16r_{m}+16r_{B}T,$	(49)
$\displaystyle=$	$\displaystyle\sum_{m=1}^{B}N_{r_{m}}\cdot\frac{256\log T}{r_{m}}+16r_{B}T,$

where (48) uses (46), and (49) uses equality $n_{m}=\frac{16\log T}{r_{m}}$ . Since $r_{m}=2^{-m+1}$ and $N_{r_{m}}\leq C_{z}r_{m}^{-d_{z}}\leq C_{z}\cdot 2^{(m-1)d_{z}}$ , we have

\displaystyle R(T)\leq 256C_{z}\sum_{m=1}^{B}\frac{2^{(m-1)d_{z}}\log T}{2^{-(m-1)}}+16\cdot 2^{-{B}+1}T.

This inequality holds for any positive $B$ . By choosing $B^{*}=1+\frac{\log\frac{T}{\log T}}{d_{z}+2}$ , we have

\displaystyle R(T)\leq

\displaystyle 512C_{z}\cdot 2^{\left(B^{*}-1\right)\left(d_{z}+1\right)}\log T+16\cdot T\cdot 2^{-B^{*}+1}\leq(512C_{z}+16)\cdot T^{\frac{d_{z}+1}{d_{z}+2}}(\log T)^{\frac{1}{d_{z}+2}}.

Appendix D Regret Upper Bound of BLiN with a Fixed Number of Batches

This section studies the cases where a hard upper bound $B$ on the number of batches is imposed. In such cases, we simply apply BLiN by executing the for-loop $B-1$ times, and then run the clean-up step. Since some batches may be skipped in the for-loop, the total number of batches is upper bounded by $B$ . The regret in such cases is described by a quantity called effective number of batches (written $\text{Eff}(B)$ ). The quantity $\text{Eff}(B)$ counts number of batches where finer partitioning of the arm space occurs. Further in Lemma 7, we show that $\text{Eff}(B)=\mathcal{O}(\log\log T)$ .

Firstly, we introduce some notations. In the proof of Theorem 6, we bound the regret of the odd batches ( $m=2k-1$ and $m<B$ ), the even batches ( $m=2k$ and $m<B$ ) and the clean-up batch (the $B$ -th batch) separately. For convenience, we denote

R_{\mathrm{odd}}=\sum_{m=2k-1,\;m<B}R_{m}\quad\text{and}\quad R_{\mathrm{even}}=\sum_{m=2k,\;m<B}R_{m},

where $R_{m}$ is the regret of the $m$ -th batch. Additionally, we let $B_{e}=\left\lfloor\frac{B-1}{2}\right\rfloor$ , and thus $2B_{e}$ is the last even batch before the $B$ -th batch. As is mentioned in the paper (the paragraph before Theorem 6), when using BLiN with rounded ACE Sequence, if there exists $k<m$ such that $\widetilde{r}_{m}\geq\widetilde{r}_{k}$ , then we skip the $m$ -th batch. We use $\text{Eff}(B)$ to denote the number of effective batches, that is, the batches which are not skipped.

By omitting the equality $B^{*}=\frac{\log\log\frac{T}{\log T}-\log(d_{z}+2)}{\log\frac{d+2}{d+1-d_{z}}}$ and using the definition of rounded ACE Sequence, the arguments in the proof of Theorem 6 can be directly applied to the case of fixed $B$ . Specifically, inequality (9) yields that

R_{\text{odd}}\lesssim\text{Eff}(B)\cdot T^{\frac{d_{z}+1}{d_{z}+2}}\cdot(\log T)^{\frac{1}{d_{z}+2}};

inequality (10) yields that

\displaystyle R_{\text{even}}\lesssim\widetilde{r}_{2B_{e}}^{-d_{z}-1}\cdot\log T\lesssim\left(\frac{T}{\log T}\right)^{-\frac{d_{z}+1}{d_{z}+2}\left(\frac{d+1-d_{z}}{d+2}\right)^{\left\lfloor\frac{B-1}{2}\right\rfloor}}\cdot\left(\frac{T}{\log T}\right)^{\frac{d_{z}+1}{d_{z}+2}}\cdot\log T\leq T^{\frac{d_{z}+1}{d_{z}+2}}\cdot(\log T)^{\frac{1}{d_{z}+2}};

and inequality (12) yields that

\displaystyle R_{B}\lesssim\widetilde{r}_{2B_{e}}\cdot T\lesssim T^{\frac{1}{d_{z}+2}\left(\frac{d+1-d_{z}}{d+2}\right)^{\left\lfloor\frac{B-1}{2}\right\rfloor}}\cdot T^{\frac{d_{z}+1}{d_{z}+2}}\cdot(\log T)^{\frac{1}{d_{z}+2}}.

Thus, the $T$ -step total regret is bounded by

\displaystyle R(T)=R_{\text{odd}}+R_{\text{even}}+R_{B}\lesssim\left(\text{Eff}(B)+T^{\frac{1}{d_{z}+2}\left(\frac{d+1-d_{z}}{d+2}\right)^{\left\lfloor\frac{B-1}{2}\right\rfloor}}\right)\cdot T^{\frac{d_{z}+1}{d_{z}+2}}\cdot(\log T)^{\frac{1}{d_{z}+2}}.

(50)

Furthermore, because of the existence of the rounding step, the actual number of batches of BLiN with rounded ACE Sequence has the following upper bound.

Lemma 7.

Let $T$ be the total time horizon. When applying BLiN with rounded ACE Sequence $\{\widetilde{r}_{m}\}$ and any number of batches $B$ , the effective number of batches is upper bounded by

\mathrm{Eff}(B)=\mathcal{O}(\log\log T).

Proof.

The rounded ACE sequence $\{\widetilde{r}_{m}\}_{m\in\mathbb{N}}$ is defined as $\widetilde{r}_{m}=2^{-\alpha_{k}}$ for $m=2k-1$ and $\widetilde{r}_{m}=2^{-\beta_{k}}$ for $m=2k$ , where $\alpha_{k}=\lfloor c_{1}\cdot\frac{1-\eta^{k}}{1-\eta}\rfloor$ and $\beta_{k}=\lceil c_{1}\cdot\frac{1-\eta^{k}}{1-\eta}\rceil$ . As is explained in Remark 2, if there exists integer $M$ and $k$ such that $M<c_{1}\cdot\frac{1-\eta^{k}}{1-\eta}<c_{1}\cdot\frac{1-\eta^{k+1}}{1-\eta}<M+1$ , then the rounding step yields $\widetilde{r}_{2k-1}=\widetilde{r}_{2k+1}<\widetilde{r}_{2k}=\widetilde{r}_{2k+2}$ , so the $(2k+1)$ -th and the $(2k+2)$ -th batches are skipped. We note that the sequence $\{c_{1}\cdot\frac{1-\eta^{k}}{1-\eta}\}$ is increasing and $\lim_{k\to\infty}c_{1}\cdot\frac{1-\eta^{k}}{1-\eta}=c_{1}\cdot\frac{1}{1-\eta}$ . It is easy to verify that if inequality

c_{1}\cdot\frac{1}{1-\eta}-c_{1}\cdot\frac{1-\eta^{k_{0}}}{1-\eta}<1

(51)

is satisfied for some $k_{0}$ , then there will be at most $2$ effective batches after the $2k_{0}$ -th batch. As a consequence, for any $B$ , the number of effective batches is upper bounded by $2k_{0}+2$ .

By choosing $k_{0}=\frac{\log\log T}{\log\frac{d+2}{d+1-d_{z}}}$ , we have

k_{0}>\frac{\log\left(\frac{1}{d_{z}+2}\log\frac{T}{\log T}\right)}{\log\frac{d+2}{d+1-d_{z}}}=\frac{\log\frac{1-\eta}{c_{1}}}{\log\eta},

and (51) is satisfied. Therefore, we conclude that $\text{Eff}(B)\leq\frac{2\log\log T}{\log\frac{d+2}{d+1-d_{z}}}+2$ . ∎

Combining (50), Lemma 7 and the inequality $\left\lfloor\frac{B-1}{2}\right\rfloor\geq\frac{B}{2}-1$ , we conclude that the $T$ -step regret of BLiN with rounded ACE Sequence and $B$ batches is upper bounded by

R(T)\lesssim T^{\frac{1}{d_{z}+2}\left(\frac{d+1-d_{z}}{d+2}\right)^{\frac{B}{2}-1}}\cdot T^{\frac{d_{z}+1}{d_{z}+2}}\cdot(\log T)^{\frac{1}{d_{z}+2}}.

This upper bound is slightly larger than our lower bound in Theorem 3, and they matches when $B=\Theta(\log\log T)$ .

Appendix E Proof of Theorem 9

Theorem 9. Consider Lipschitz bandit problems with time horizon $T$ , ambient dimension $d$ and zooming dimension $d_{z}\leq d$ such that the grid of reward communication $\mathcal{T}$ is static and determined before the game. If $B$ rounds of communications are allowed, then for any policy $\pi$ , there exists a problem instance with zooming dimension $d_{z}$ such that

\mathbb{E}[R_{T}(\pi)]\geq\frac{1}{64e^{\frac{1}{16}}}\cdot T^{\frac{1-\frac{1}{d_{z}+2}}{1-\left(\frac{1}{d_{z}+2}\right)^{B}}}.

Proof.

Fixing $k$ and $1\leq i\leq M_{k}$ , we show that $\mu_{k,i}$ is $1$ -Lipschitz and the zooming dimension equals to $d_{z}$ .

Firstly, for any $(\alpha_{1},\beta_{1})$ and $(\alpha_{2},\beta_{2})$ , we have

	$\displaystyle\mu_{k,i}((\alpha_{1},\beta_{1}))-\mu_{k,i}((\alpha_{2},\beta_{2}))\leq$	$\displaystyle\mu_{k,i}((\alpha_{1},\beta_{1}))-\mu_{k,i}((\alpha_{1},\beta_{2}))+\mu_{k,i}((\alpha_{1},\beta_{2}))-\mu_{k,i}((\alpha_{2},\beta_{2}))$
	$\displaystyle\leq$	$\displaystyle\frac{1}{2}\\|\beta_{1}-\beta_{2}\\|_{\infty}+\frac{1}{2}\big{(}\nu_{k,i}((\alpha_{1},0_{d-d_{z}}))-\nu_{k,i}((\alpha_{2},0_{d-d_{z}}))\big{)}$
	$\displaystyle\leq$	$\displaystyle\frac{1}{2}(\\|\beta_{1}-\beta_{2}\\|_{\infty}+\\|\alpha_{1}-\alpha_{2}\\|_{\infty})$
	$\displaystyle\leq$	$\displaystyle\\|(\alpha_{1}-\alpha_{2},\beta_{1}-\beta_{2})\\|_{\infty}.$

Therefore, $\mu_{k,i}$ is $1$ -Lipschitz.

Secondly, for any $r\geq r_{k}$ , (41) yields that $S(16r)=[0,1]^{d_{z}}\times[0,32r]^{d-d_{z}}$ . Therefore, we have $N_{r}=32^{d-d_{z}}r^{-d_{z}}$ , and the zooming dimension equals to $d_{z}$ .

Then we show that an argument similar to Theorem 7 yields the lower bound we need. For any $k>1$ , we prove that no algorithm can distinguish instances in $\mathcal{I}_{k}$ from one another in the first $(k-1)$ batches, so the worst-case regret is at least $r_{k}t_{k}$ , which equals to $\frac{t_{k}}{t_{k-1}^{\frac{1}{d_{z}+2}}}$ . For the first batch $(0,t_{1}]$ , we can easily construct a set of instances where the worst-case regret is at least $t_{1}$ , since no information is available during this time. Thus, there exists a problem instance such that

\displaystyle\mathbb{E}[R_{T}(\pi)]\gtrsim\max\left\{t_{1},\frac{t_{2}}{t_{1}^{\frac{1}{d_{z}+2}}},\cdots,\frac{t_{B}}{t_{B-1}^{\frac{1}{d_{z}+2}}}\right\}.

Since $0<t_{1}<\cdots<t_{B}=T$ , the inequality in Theorem 9 follows.

Recall that each $u_{k,i}$ is in $\mathcal{A}_{d_{z}}$ . For convenience, we write $u_{k,i}=(\hat{u}_{k,i},0_{d-d_{z}})$ . Let $H_{k,i}=\mathbb{B}_{d_{z}}(\hat{u}_{k,i},\frac{3}{8}r_{k})\times[0,1]^{d-d_{z}}$ , where $\mathbb{B}_{d_{z}}(\hat{u}_{k,i},\frac{3}{8}r_{k})$ denotes the $d_{z}$ -dimensional ball with center $\hat{u}_{k,i}$ and radius $\frac{3}{8}r_{k}$ . It is easy to verify the following properties of construction (39),(40) and (41):

1.

For any $2\leq i\leq M_{k}$ , $\mu_{k,i}(z)=\mu_{k,1}(z)$ for any $z\in\mathcal{A}\setminus H_{k,i}$ ;
2.

For any $2\leq i\leq M_{k}$ , $\mu_{k,1}(z)\leq\mu_{k,i}(z)\leq\mu_{k,1}(z)+\frac{r_{k}}{4}$ , for any $z\in H_{k,i}$ ;
3.

For any $1\leq i\leq M_{k}$ , under $I_{k,i}$ , pulling an arm that is not in $H_{k,i}$ incurs a regret at least $\frac{r_{k}}{16}$ .

The lower bound of expected regret relies on the following lemma.

Lemma 8.

For any policy $\pi$ , there exists a problem instance $I\in\mathcal{I}_{k}$ such that

\mathbb{E}\left[R_{T}(\pi)\right]\geq\frac{r_{k}}{64}\cdot\sum_{j=1}^{B}\left(t_{j}-t_{j-1}\right)\exp\left\{-\frac{t_{j-1}r_{k}^{2}}{32(M_{k}-1)}\right\}.

Proof.

\displaystyle\sup_{I\in\mathcal{I}_{k}}\mathbb{E}R_{T}(\pi)\geq\frac{1}{M_{k}}\sum_{i=1}^{M_{k}}\mathbb{E}_{\mathbb{P}_{k,i}}\left[R_{T}(\pi)\right]\geq\frac{1}{M_{k}}\sum_{i=1}^{M_{k}}\sum_{t=1}^{T}\mathbb{E}_{\mathbb{P}_{k,i}^{t}}\left[R^{t}(\pi)\right]\geq\frac{r_{k}}{16}\sum_{t=1}^{T}\frac{1}{M_{k}}\sum_{i=1}^{M_{k}}\mathbb{P}_{k,i}^{t}(x_{t}\notin H_{k,i}),

(52)

where $R^{t}(\pi)$ denotes the regret incurred by policy $\pi$ at time $t$ .

From our construction, it is easy to see that $H_{k,i}\cap H_{k,j}=\varnothing$ for any $i\neq j$ , so we can construct a test $\Psi$ such that $x_{t}\in H_{k,i}$ implies $\Psi=i$ . Then from Lemma 12,

\displaystyle\frac{1}{M_{k}}\sum_{i=1}^{M_{k}}\mathbb{P}_{k,i}^{t}(x_{t}\notin H_{k,i})\geq\;\frac{1}{M_{k}}\sum_{i=1}^{M_{k}}\mathbb{P}_{k,i}^{t}(\Psi\neq i)\geq\;\frac{1}{2M_{k}}\sum_{i=2}^{M_{k}}\exp\left\{-D_{KL}\left(\mathbb{P}_{k,1}^{t}\|\mathbb{P}_{k,i}^{t}\right)\right\}.

To avoid notational clutter, for any $s,s^{\prime}$ ( $s\geq s^{\prime}$ ), define

\displaystyle(\mathbf{x},\mathbf{y})_{:s}^{:s^{\prime}}=

\displaystyle\;(x_{1},y_{1},\cdots,x_{s^{\prime}},y_{s^{\prime}},\cdots,x_{s}).

Now we calculate $D_{KL}\left(\mathbb{P}_{k,1}^{t}\|\mathbb{P}_{k,i}^{t}\right)$ . From the chain rule of KL-Divergence, we have

	$\displaystyle\;D_{KL}\left(\mathbb{P}_{k,1}^{t}\\|\mathbb{P}_{k,i}^{t}\right)$
$\displaystyle=$	$\displaystyle\;D_{KL}\left(\mathbb{P}_{k,1}^{t}\left((\mathbf{x},\mathbf{y})_{:t_{j-1}}^{:t_{j-1}}\right)\\|\mathbb{P}_{k,i}^{t}\left((\mathbf{x},\mathbf{y})_{:t_{j-1}}^{:t_{j-1}}\right)\right)$
$\displaystyle=$	$\displaystyle\;D_{KL}\left(\mathbb{P}_{k,1}^{t}\left((\mathbf{x},\mathbf{y})_{:t_{j-1}}^{:t_{j-1}-1}\right)\\|\mathbb{P}_{k,i}^{t}\left((\mathbf{x},\mathbf{y})_{:t_{j-1}}^{:t_{j-1}-1}\right)\right)+\mathbb{E}_{\mathbb{P}_{k,1}}\left(D_{KL}\left(\mathbb{P}_{k,1}^{t}(y_{t_{j-1}}\|x_{t_{j-1}})\\|\mathbb{P}_{k,i}^{t}(y_{t_{j-1}}\|x_{t_{j-1}})\right)\right)$	(53)
$\displaystyle=$	$\displaystyle\;D_{KL}\left(\mathbb{P}_{k,1}^{t}\left((\mathbf{x},\mathbf{y})_{:t_{j-1}-1}^{:t_{j-1}-1}\right)\\|\mathbb{P}_{k,i}^{t}\left((\mathbf{x},\mathbf{y})_{:t_{j-1}-1}^{:t_{j-1}-1}\right)\right)+\mathbb{E}_{\mathbb{P}_{k,1}}\left(D_{KL}\left(\mathbb{P}_{k,1}^{t}(y_{t_{j-1}}\|x_{t_{j-1}})\\|\mathbb{P}_{k,i}^{t}(y_{t_{j-1}}\|x_{t_{j-1}})\right)\right)$	(54)
$\displaystyle\leq$	$\displaystyle\;D_{KL}\left(\mathbb{P}_{k,1}^{t}\left((\mathbf{x},\mathbf{y})_{:t_{j-1}-1}^{:t_{j-1}-1}\right)\\|\mathbb{P}_{k,i}^{t}\left((\mathbf{x},\mathbf{y})_{:t_{j-1}-1}^{:t_{j-1}-1}\right)\right)+\mathbb{E}_{\mathbb{P}_{k,1}}\left(D_{KL}\left(N(\mu_{k,1}(x_{t_{j-1}}),1)\\|N(\mu_{k,i}(x_{t_{j-1}}),1)\right)\right)$	(55)
$\displaystyle=$	$\displaystyle\;D_{KL}\left(\mathbb{P}_{k,1}^{t}\left((\mathbf{x},\mathbf{y})_{:t_{j-1}-1}^{:t_{j-1}-1}\right)\\|\mathbb{P}_{k,i}^{t}\left((\mathbf{x},\mathbf{y})_{:t_{j-1}-1}^{:t_{j-1}-1}\right)\right)+\mathbb{E}_{\mathbb{P}_{k,1}}\left(\frac{1}{2}\left(\mu_{k,1}(x_{t_{j-1}})-\mu_{k,i}(x_{t_{j-1}})\right)^{2}\right)$
$\displaystyle\leq$	$\displaystyle\;D_{KL}\left(\mathbb{P}_{k,1}^{t}\left((\mathbf{x},\mathbf{y})_{:t_{j-1}-1}^{:t_{j-1}-1}\right)\\|\mathbb{P}_{k,i}^{t}\left((\mathbf{x},\mathbf{y})_{:t_{j-1}-1}^{:t_{j-1}-1}\right)\right)+\mathbb{E}_{\mathbb{P}_{k,1}}\left(\bm{1}_{\{x_{t_{j-1}}\in S_{k,i}\}}\cdot\frac{1}{2}\left(\frac{r_{k}}{4}\right)^{2}\right)$	(56)
$\displaystyle=$	$\displaystyle\;D_{KL}\left(\mathbb{P}_{k,1}^{t}\left((\mathbf{x},\mathbf{y})_{:t_{j-1}-1}^{:t_{j-1}-1}\right)\\|\mathbb{P}_{k,i}^{t}\left((\mathbf{x},\mathbf{y})_{:t_{j-1}-1}^{:t_{j-1}-1}\right)\right)+\frac{r_{k}^{2}}{32}\cdot\mathbb{P}_{k,1}\left(x_{t_{j-1}}\in S_{k,i}\right),$	(57)

where (53) uses chain rule for KL-divergence and the conditional independence of the reward, (54) removes dependence on $x_{t_{j-1}}$ in the first term by another use of chain rule and the fact that the distribution of $x_{t_{j-1}}$ is fully determined by the policy and the distribution of $(\mathbf{x},\mathbf{y})_{:t_{j-1}-1}^{:t_{j-1}-1}$ , (55) uses that the rewards are corrupted by a standard normal noise, and (56) uses the first two properties of the construction.

Since (57) holds for all $t\leq t_{j-1}$ , we conclude that

\displaystyle D_{KL}\left(\mathbb{P}_{k,1}^{t}\|\mathbb{P}_{k,i}^{t}\right)\leq\frac{r_{k}^{2}}{32}\sum_{s\leq t_{j-1}}\mathbb{P}_{k,1}\left(x_{s}\in H_{k,i}\right)=\frac{r_{k}^{2}}{32}\mathbb{E}_{\mathbb{P}_{k,1}}\tau_{i},

(58)

where $\tau_{i}$ denotes the number of pulls of arms in $H_{k,i}$ before the batch containing $t$ . Then for all $t\in(t_{j-1},t_{j}]$ , we have

$\displaystyle\frac{1}{M_{k}}\sum_{i=1}^{M_{k}}\mathbb{P}_{k,i}^{t}(x_{t}\notin H_{k,i})\geq$	$\displaystyle\;\frac{1}{2M_{k}}\sum_{i=2}^{M_{k}}\exp\left\{-\frac{r_{k}^{2}}{32}\mathbb{E}_{\mathbb{P}_{k,1}}\tau_{i}\right\}$
$\displaystyle\geq$	$\displaystyle\;\frac{M_{k}-1}{2M_{k}}\exp\left\{-\frac{r_{k}^{2}}{32(M_{k}-1)}\sum_{i=2}^{M_{k}}\mathbb{E}_{\mathbb{P}_{k,1}}\tau_{i}\right\}$	(59)
$\displaystyle\geq$	$\displaystyle\;\frac{1}{4}\exp\left\{-\frac{r_{k}^{2}t_{j-1}}{32(M_{k}-1)}\right\},$	(60)

where (59) uses the Jensen’ inequality, and (60) uses the fact that $\sum_{i=2}^{M_{k}}\tau_{i}\leq t_{j-1}$ . Finally, we substitute (60) to (52) to finish the proof of Lemma 8. ∎

Since $M_{k}=t_{k-1}r_{k}^{2}$ , the expected regret of policy $\pi$ satisfies

	$\displaystyle\mathbb{E}\left[R_{T}(\pi)\right]$	$\displaystyle\geq\frac{r_{k}}{64}\cdot\sum_{j=1}^{B}\left(t_{j}-t_{j-1}\right)\exp\left\{-\frac{t_{j-1}r_{k}^{2}}{32(M_{k}-1)}\right\}$
		$\displaystyle\geq\frac{r_{k}}{64}\cdot\sum_{j=1}^{B}\left(t_{j}-t_{j-1}\right)\exp\left\{-\frac{t_{j-1}r_{k}^{2}}{16M_{k}}\right\}$
		$\displaystyle\geq\frac{r_{k}}{64}\cdot\sum_{j=1}^{B}\left(t_{j}-t_{j-1}\right)\exp\left\{-\frac{t_{j-1}}{16t_{k-1}}\right\}$

on an instance $I_{k,i}\in\mathcal{I}_{k}$ .

By omitting terms with $j>k$ in the above summation, we have

	$\displaystyle\mathbb{E}[R_{T}(\pi)]$	$\displaystyle\geq\frac{r_{k}}{64}\cdot\sum_{j=1}^{B}\left(t_{j}-t_{j-1}\right)\exp\left\{-\frac{t_{j-1}}{16t_{k-1}}\right\}$
		$\displaystyle\geq\frac{r_{k}}{64}\cdot\sum_{j=1}^{k}\left(t_{j}-t_{j-1}\right)\exp\left\{-\frac{1}{16}\right\}$
		$\displaystyle=\frac{1}{64e^{\frac{1}{16}}}r_{k}t_{k}=\frac{1}{64e^{\frac{1}{16}}}\cdot\frac{t_{k}}{t_{k-1}^{\frac{1}{d_{z}+2}}}.$

\displaystyle\mathbb{E}[R_{T}(\pi)]\geq\frac{1}{64e^{\frac{1}{16}}}\max\left\{t_{1},\frac{t_{2}}{t_{1}^{\frac{1}{d_{z}+2}}},\cdots,\frac{t_{B}}{t_{B-1}^{\frac{1}{d_{z}+2}}}\right\}.

Since $0<t_{1}<\cdots<t_{B}=T$ , we further have

\displaystyle\max\left\{t_{1},\frac{t_{2}}{t_{1}^{\varepsilon}},\cdots,\frac{t_{B}}{t_{B-1}^{\varepsilon}}\right\}\geq\left(t_{1}^{\varepsilon^{B-1}}\cdot\left(\frac{t_{2}}{t_{1}^{\varepsilon}}\right)^{\varepsilon^{B-2}}\cdots\left(\frac{t_{B-1}}{t_{B-2}^{\varepsilon}}\right)^{\varepsilon}\cdot\frac{t_{B}}{t_{B-1}^{\varepsilon}}\right)^{\frac{1}{\sum_{i=1}^{B-1}\varepsilon^{i}}}=T^{\frac{1-\varepsilon}{1-\varepsilon^{B}}},

where $\varepsilon=\frac{1}{d_{z}+2}$ . Combining the above two inequalities, we conclude that

\mathbb{E}\left[R_{T}(\pi)\right]\geq\frac{1}{64e^{\frac{1}{16}}}\cdot T^{\frac{1-\frac{1}{d_{z}+2}}{1-\left(\frac{1}{d_{z}+2}\right)^{B}}}.\qed

Appendix F Proof of Theorem 10

Theorem 10. Consider Lipschitz bandit problems with time horizon $T$ , ambient dimension $d$ and zooming dimension $d_{z}\leq d$ such that the grid of reward communication $\mathcal{T}$ is adaptively determined by the player. If $B$ rounds of communications are allowed, then for any policy $\pi$ , there exists a problem instance with zooming dimension $d_{z}$ such that

\displaystyle\mathbb{E}\left[R_{T}(\pi)\right]\geq\frac{1}{512B^{2}}T^{\frac{1-\frac{1}{d_{z}+2}}{1-\left(\frac{1}{d_{z}+2}\right)^{B}}}.

Proof.

The main argument in the proof of Theorem 10 is similar to that of Theorem 8. To construct hard instances in the $d$ -dimensional space, we use the ‘linear-decaying extension’ technique, which is the same as the proof of Theorem 9.

To prove Theorem 10, we consider a reference static grid $\mathcal{T}_{r}=\{T_{0},T_{1},\cdots,T_{B}\}$ , where $T_{j}=T^{\frac{1-\varepsilon^{j}}{1-\varepsilon^{B}}}$ for $\varepsilon=\frac{1}{d_{z}+2}$ . Then we construct a series of ‘worlds’, denoted by $\mathcal{I}_{1},\cdots,\mathcal{I}_{B}$ . Each world is a set of problem instances, and each problem instance in world $\mathcal{I}_{j}$ is defined by peak location set $\mathcal{U}_{j}$ and basic height $r_{j}$ , where the sets $\mathcal{U}_{j}$ and quantities $r_{j}$ for $1\leq j\leq B$ are presented in the proof below. Based on these constructions, we first prove that for any adaptive grid and policy, there exists an index $j$ such that the event $A_{j}=\{t_{j-1}<T_{j-1},\;t_{j}\geq T_{j}\}$ happens with sufficiently high probability in world $\mathcal{I}_{j}$ . Then similar to Theorem 9, we prove that in world $\mathcal{I}_{j}$ there exists a set of problem instances that is difficult to differentiate in the first $j-1$ batches. In addition, event $A_{j}$ implies that $t_{j}\geq T_{j}$ , so the worst-case regret is at least $r_{j}T_{j}$ , which gives the lower bound we need.

Firstly, we define $r_{j}=\frac{1}{T_{j-1}^{\varepsilon}B}$ and $M_{j}=\frac{1}{r_{j}^{d_{z}}}$ , where $\varepsilon=\frac{1}{d_{z}+2}$ . From the definition, we have

\displaystyle T_{j-1}r_{j}^{2}=\frac{1}{r_{j}^{d_{z}}B^{d_{z}+2}}\leq\frac{1}{r_{j}^{d_{z}}B^{2}}=\frac{M_{j}}{B^{2}}.

(61)

For $1\leq j\leq B$ , we can find sets of arms $\mathcal{U}_{j}=\{u_{j,1},\cdots,u_{j,M_{j}}\}\in\mathcal{A}_{d_{z}}$ such that (a) $d_{\mathcal{A}}(u_{j,m},u_{j,n})\geq r_{j}$ for any $m\neq n$ , and (b) $u_{1,M_{1}}=\cdots=u_{B,M_{B}}$ .

Then we present the construction of worlds $\mathcal{I}_{1},\cdots,\mathcal{I}_{B}$ . For $1\leq j\leq B-1$ , we let $\mathcal{I}_{j}=\{\mathcal{I}_{j,k}\}_{k=1}^{M_{j}-1}$ . We first construct a set of expected reward functions $\{\nu_{j,1},\cdots,\nu_{j,M_{j}-1}\}$ on $\mathcal{A}_{d_{z}}$ . For each $1\leq k\leq M_{j}-1$ , we define $\nu_{j,k}$ as

\displaystyle\nu_{j,k}(z)=\begin{cases}\frac{r_{1}}{2}+\frac{r_{j}}{16}+\frac{r_{B}}{16},\;\text{if}\;z=u_{j,k},\\ \frac{r_{1}}{2}+\frac{r_{B}}{16},\;\text{if}\;z=u_{j,M_{j}},\\ \max\left\{\frac{r_{1}}{2},\max_{u\in\mathcal{U}_{j}}\left\{\nu_{j,k}(u)-d_{\mathcal{A}}(z,u)\right\}\right\},\\ \quad\text{if}\;z\in\mathcal{A}_{d_{z}}\setminus\mathcal{U}_{j}.\end{cases}

(62)

Based on $\{\nu_{j,k}\}_{k=1}^{M_{j}-1}$ , the expected reward of $I_{j,k}$ is defined as

\displaystyle\mu_{j,k}((\alpha,\beta))=\frac{1}{2}\left(\nu_{j,k}((\alpha,0_{d-d_{z}}))-\|\beta\|_{\infty}\right),

(63)

where $(\alpha,\beta)$ is the concatenation of $\alpha\in[0,1]^{d_{z}}$ and $\beta\in[0,1]^{d-d_{z}}$ . For $j=B$ , we let $\mathcal{I}_{B}=\{I_{B}\}$ . We first define a function $\nu_{B}$ on $\mathcal{A}_{d_{z}}$ as

\displaystyle\nu_{B}(z)=\begin{cases}\frac{r_{1}}{2}+\frac{r_{B}}{16},\;\text{if}\;z=u_{B,M_{B}},\\ \max\left\{\frac{r_{1}}{2},\max_{u\in\mathcal{U}_{j}}\left\{\mu_{j,k}(u)-d_{\mathcal{A}}(z,u)\right\}\right\},\\ \quad\text{if}\;z\in\mathcal{A}_{d_{z}}/\{u_{B,M_{B}}\}.\end{cases}

(64)

Then the expected reward of $I_{B}$ is defined as

\displaystyle\mu_{B}((\alpha,\beta))=\frac{1}{2}\left(\nu_{B}((\alpha,0_{d-d_{z}}))-\|\beta\|_{\infty}\right).

(65)

Roughly speaking, our constructions satisfy two properties: for each $j\neq B$ and $1\leq k\leq M_{j}-1$ ,

1.

$\mu_{j,k}$ is close to $\mu_{B}$ ;
2.

under $I_{j,k}$ , pulling an arm that is far from $u_{j,k}$ incurs a regret at least $\frac{r_{j}}{16}$ .

The formal version of these two properties are presented in the proof of Lemma 9 and Lemma 10.

Based on these constructions, we first show that for any adaptive grid $\mathcal{T}=\{t_{0},\cdots,t_{B}\}$ , there exists an index $j$ such that $(t_{j-1},t_{j}]$ is sufficiently large in world $\mathcal{I}_{j}$ . More formally, for each $j\in[B]$ , and event $A_{j}=\{t_{j-1}<T_{j-1},\;t_{j}\geq T_{j}\}$ , we define the quantities $p_{j}:=\frac{1}{M_{j}-1}\sum_{k=1}^{M_{j}-1}\mathbb{P}_{j,k}(A_{j})$ for $j\leq B-1$ and $p_{B}:=\mathbb{P}_{B}(A_{B})$ , where $\mathbb{P}_{j,k}(A_{j})$ denotes the probability of the event $A_{j}$ under instance $I_{j,k}$ and policy $\pi$ . For these quantities, we have the following lemma.

Lemma 9.

For any adaptive grid $\mathcal{T}$ and policy $\pi$ , it holds that $\sum_{j=1}^{B}p_{j}\geq\frac{7}{8}.$

Proof.

For $1\leq j\leq B-1$ and $1\leq k\leq M_{j}-1$ , we write $u_{j,k}=(\hat{u}_{j,k},0_{d-d_{z}})$ . We define $H_{j,k}=\mathbb{B}_{d_{z}}(\hat{u}_{j,k},\frac{3}{8}r_{j})\times[0,1]^{d-d_{z}}$ , where $\mathbb{B}_{d_{z}}(\hat{u}_{j,k},\frac{3}{8}r_{j})$ denotes the $d_{z}$ -dimensional ball with center $\hat{u}_{j,k}$ and radius $\frac{3}{8}r_{j}$ . It is easy to verify the following properties of our construction (62), (63), (64) and (65):

1.

$\mu_{j,k}(z)=\mu_{B}(z)$ for any $z\notin H_{j,k}$ ;
2.

$\mu_{B}(z)\leq\mu_{j,k}(z)\leq\mu_{B}(z)+\frac{r_{j}}{8}$ , for any $z\in H_{j,k}$ .

\displaystyle|\mathbb{P}_{B}(A_{j})-\mathbb{P}_{j,k}(A_{j})|=\;|\mathbb{P}_{B}^{T_{j-1}}(A_{j})-\mathbb{P}_{j,k}^{T_{j-1}}(A_{j})|\leq\;TV\left(\mathbb{P}_{B}^{T_{j-1}},\mathbb{P}_{j,k}^{T_{j-1}}\right).

For the total variation, we apply Lemma 11 to get

\displaystyle\;\frac{1}{M_{j}-1}\sum_{k=1}^{M_{j}-1}TV\left(\mathbb{P}_{B}^{T_{j-1}},\mathbb{P}_{j,k}^{T_{j-1}}\right)\leq\;\frac{1}{M_{j}-1}\sum_{k=1}^{M_{j}-1}\sqrt{1-\exp\left(-D_{KL}\left(\mathbb{P}_{B}^{T_{j-1}}\|\mathbb{P}_{j,k}^{T_{j-1}}\right)\right)}.

An argument similar to (58) yields that

\displaystyle D_{KL}\left(\mathbb{P}_{B}^{T_{j-1}}\|\mathbb{P}_{j,k}^{T_{j-1}}\right)\leq\frac{r_{j}^{2}}{128}\mathbb{E}_{\mathbb{P}_{B}}\tau_{k},

where $\tau_{k}$ denotes the number of pulls which is in $S_{j,k}$ before the batch containing $T_{j-1}$ . Combining the above two inequalities gives

$\displaystyle\frac{1}{M_{j}-1}\sum_{k=1}^{M_{j}-1}TV\left(\mathbb{P}_{B}^{T_{j-1}},\mathbb{P}_{j,k}^{T_{j-1}}\right)\leq$	$\displaystyle\;\frac{1}{M_{j}-1}\sum_{k=1}^{M_{j}-1}\sqrt{1-\exp\left(-\frac{r_{j}^{2}}{128}\mathbb{E}_{\mathbb{P}_{B}}\tau_{k}\right)}$
$\displaystyle\leq$	$\displaystyle\;\sqrt{1-\exp\left(-\frac{r_{j}^{2}}{128(M_{j}-1)}\mathbb{E}_{\mathbb{P}_{B}}\left[\sum_{k=1}^{M_{j}-1}\tau_{k}\right]\right)}$	(66)
$\displaystyle\leq$	$\displaystyle\;\sqrt{1-\exp\left(-\frac{r_{j}^{2}T_{j-1}}{128(M_{j}-1)}\right)}$	(67)
$\displaystyle\leq$	$\displaystyle\;\sqrt{1-\exp\left(-\frac{1}{64B^{2}}\right)}$	(68)
$\displaystyle\leq$	$\displaystyle\;\frac{1}{8B},$

where (66) uses Jensen’s inequality, (67) uses the fact that $\sum_{k=1}^{M_{j}-1}\tau_{k}\leq T_{j-1}$ , and (68) uses (61).

Plugging the above results implies that

\displaystyle|\mathbb{P}_{B}(A_{j})-p_{j}|\leq\frac{1}{M_{j}-1}\sum_{k=1}^{M_{j}-1}|\mathbb{P}_{B}(A_{j})-\mathbb{P}_{j,k}(A_{j})|\leq\frac{1}{8B}.

Since $\sum_{j=1}^{B}\mathbb{P}_{B}\left(A_{j}\right)\geq\mathbb{P}_{B}\left(\cup_{j=1}^{B}A_{j}\right)=1$ , it holds that

\sum_{j=1}^{B}p_{j}\geq\mathbb{P}_{B}(A_{B})+\sum_{j=1}^{B-1}\left(\mathbb{P}_{B}(A_{j})-\frac{1}{8B}\right)\geq\frac{7}{8}.\qed

Lemma 9 implies that there exists some $j$ such that $p_{j}>\frac{7}{8B}$ . Then we show that the worst-case regret in world $\mathcal{I}_{j}$ gives the lower bound we need.

Lemma 10.

For adaptive grid $\mathcal{T}$ and policy $\pi$ , if index $j$ satisfies $p_{j}\geq\frac{7}{8B}$ , then there exists a problem instance $I$ with zooming dimension $d_{z}$ such that

\displaystyle\mathbb{E}\left[R_{T}(\pi)\right]\geq\frac{1}{512B^{2}}T^{\frac{1-\frac{1}{d_{z}+2}}{1-\left(\frac{1}{d_{z}+2}\right)^{B}}}.

Proof.

Here we proceed with the case where $j\leq B-1$ . The case for $j=B$ can be proved analogously.

For any $1\leq k\leq M_{j}-1$ , we construct a set of problem instances $\mathcal{I}_{j,k}=\left(I_{j,k,l}\right)_{1\leq l\leq M_{j}}$ . For $l\neq k$ , we first define a function $\nu_{j,k,l}$ on $\mathcal{A}_{d_{z}}$ as

\displaystyle\nu_{j,k,l}(z)=\begin{cases}\nu_{j,k}(z)+\frac{3r_{j}}{16},\;\text{if}\;z=u_{j,l},\\ \nu_{j,k}(z),\;\text{if}\;z\in\mathcal{U}_{j}\;\text{and}\;z\neq u_{j,l},\\ \max\left\{\frac{r_{1}}{2},\max_{u\in\mathcal{U}_{j}}\left\{\nu_{j,k,l}(u)-d_{\mathcal{A}}(z,u)\right\}\right\},\\ \quad\text{if}\;z\in\mathcal{A}_{d_{z}}\setminus\mathcal{U}_{j},\end{cases}

where $\nu_{j,k}$ is defined in (62). Then the expected reward of $I_{j,k,l}$ is defined as

\displaystyle\mu_{j,k,l}((\alpha,\beta))=\frac{1}{2}\left(\nu_{j,k,l}((\alpha,0_{d-d_{z}}))-\|\beta\|_{\infty}\right),

(69)

where $(\alpha,\beta)$ is the concatenation of $\alpha\in[0,1]^{d_{z}}$ and $\beta\in[0,1]^{d-d_{z}}$ . For $l=k$ , we let $\mu_{j,k,k}=\mu_{j,k}$ .

We first show that each $\mu_{j,k,l}$ is $1$ -Lipschitz and the zooming dimension equals to $d_{z}$ .

Firstly, for any $(\alpha_{1},\beta_{1})$ and $(\alpha_{2},\beta_{2})$ , we have

	$\displaystyle\mu_{j,k,l}((\alpha_{1},\beta_{1}))-\mu_{j,k,l}((\alpha_{2},\beta_{2}))\leq$	$\displaystyle\mu_{j,k,l}((\alpha_{1},\beta_{1}))-\mu_{j,k,l}((\alpha_{1},\beta_{2}))+\mu_{j,k,l}((\alpha_{1},\beta_{2}))-\mu_{j,k,l}((\alpha_{2},\beta_{2}))$
	$\displaystyle\leq$	$\displaystyle\frac{1}{2}\\|\beta_{1}-\beta_{2}\\|_{\infty}+\frac{1}{2}\big{(}\nu_{j,k,l}((\alpha_{1},0_{d-d_{z}}))-\nu_{j,k,l}((\alpha_{2},0_{d-d_{z}}))\big{)}$
	$\displaystyle\leq$	$\displaystyle\frac{1}{2}(\\|\beta_{1}-\beta_{2}\\|_{\infty}+\\|\alpha_{1}-\alpha_{2}\\|_{\infty})$
	$\displaystyle\leq$	$\displaystyle\\|(\alpha_{1}-\alpha_{2},\beta_{1}-\beta_{2})\\|_{\infty}.$

Therefore, $\mu_{j,k,l}$ is $1$ -Lipschitz.

Secondly, for any $r\geq r_{j}$ , (63) and (69) yield that $S(16r)\subset[0,1]^{d_{z}}\times[0,32r]^{d-d_{z}}$ and $S(16r)\supset[0,1]^{d_{z}}\times[0,30r]^{d-d_{z}}$ . Therefore, we have $30^{d-d_{z}}r^{-d_{z}}\leq N_{r}\leq 32^{d-d_{z}}r^{-d_{z}}$ , and the zooming dimension equals to $d_{z}$ .

Then we show that an argument similar to Lemma 6 yields the lower bound we need. For each $1\leq k\leq M_{j}$ , we write $u_{j,k}=(\hat{u}_{j,k},0_{d-d_{z}})$ . We define $C_{j,k}=\mathbb{B}_{d_{z}}\left(\hat{u}_{j,k},\frac{r_{j}}{4}\right)\times[0,1]^{d-d_{z}}$ , and our construction $\mathcal{I}_{j,k}$ has the following properties:

1.

For any $l\neq k$ , $\mu_{j,k,l}(z)=\mu_{j,k,k}(z)$ for any $z\notin C_{j,l}$ ;
2.

For any $l\neq k$ , $\mu_{j,k,k}(z)\leq\mu_{j,k,l}(z)\leq\mu_{j,k,k}(z)+\frac{3r_{j}}{16}$ for any $z\in C_{j,l}$ ;
3.

For any $1\leq l\leq M_{j}$ , under $I_{j,k,l}$ , pulling an arm that is not in $C_{j,l}$ incurs a regret at least $\frac{r_{j}}{32}$ .

\displaystyle\sup_{I\in\mathcal{I}_{j,k}}\mathbb{E}\left[R_{T}(\pi)\right]\geq\frac{r_{j}}{32}\sum_{t=1}^{T}\frac{1}{M_{j}}\sum_{l=1}^{M_{j}}\mathbb{P}_{j,k,l}^{t}(x_{t}\notin C_{j,l}).

(70)

\displaystyle\frac{1}{M_{j}}\sum_{l=1}^{M_{j}}\mathbb{P}_{j,k,l}^{t}(x_{t}\notin C_{j,l})\geq\frac{1}{M_{j}}\sum_{l\neq k}\int\min\left\{d\mathbb{P}_{j,k,k}^{t},d\mathbb{P}_{j,k,l}^{t}\right\}.

(71)

Combining (70) and (71) gives

$\displaystyle\sup_{I\in\mathcal{I}_{j,k}}\mathbb{E}\left[R_{T}(\pi)\right]\geq$	$\displaystyle\;\frac{r_{j}}{32}\sum_{t=1}^{T}\frac{1}{M_{j}}\sum_{l\neq k}\int\min\left\{d\mathbb{P}_{j,k,k}^{t},d\mathbb{P}_{j,k,l}^{t}\right\}$
$\displaystyle\geq$	$\displaystyle\;\frac{r_{j}}{32}\sum_{t=1}^{T_{j}}\frac{1}{M_{j}}\sum_{l\neq k}\int\min\left\{d\mathbb{P}_{j,k,k}^{t},d\mathbb{P}_{j,k,l}^{t}\right\}$
$\displaystyle\geq$	$\displaystyle\;\frac{r_{j}T_{j}}{32}\cdot\frac{1}{M_{j}}\sum_{l\neq k}\int\min\left\{d\mathbb{P}_{j,k,k}^{T_{j}},d\mathbb{P}_{j,k,l}^{T_{j}}\right\}$	(72)
$\displaystyle\geq$	$\displaystyle\;\frac{r_{j}T_{j}}{32}\cdot\frac{1}{M_{j}}\sum_{l\neq k}\int_{A_{j}}\min\left\{d\mathbb{P}_{j,k,k}^{T_{j}},d\mathbb{P}_{j,k,l}^{T_{j}}\right\}$	(73)
$\displaystyle\geq$	$\displaystyle\;\frac{r_{j}T_{j}}{32}\cdot\frac{1}{M_{j}}\sum_{l\neq k}\int_{A_{j}}\min\left\{d\mathbb{P}_{j,k,k}^{T_{j-1}},d\mathbb{P}_{j,k,l}^{T_{j-1}}\right\},$	(74)

where (72) follows from data processing inequality of total variation and the equation $\int\min\left\{dP,dQ\right\}=1-TV(P,Q)$ , (73) restricts the integration to event $A_{j}$ , and (74) holds because the observations at time $T_{j}$ are the same as those at time $T_{j-1}$ under event $A_{j}$ .

For the term $\int_{A_{j}}\min\left\{d\mathbb{P}_{j,k,k}^{T_{j-1}},d\mathbb{P}_{j,k,l}^{T_{j-1}}\right\}$ , it holds that

$\displaystyle\int_{A_{j}}\min\left\{d\mathbb{P}_{j,k,k}^{T_{j-1}},d\mathbb{P}_{j,k,l}^{T_{j-1}}\right\}=$	$\displaystyle\;\int_{A_{j}}\frac{d\mathbb{P}_{j,k,k}^{T_{j-1}}+d\mathbb{P}_{j,k,l}^{T_{j-1}}-\left\|d\mathbb{P}_{j,k,k}^{T_{j-1}}-d\mathbb{P}_{j,k,l}^{T_{j-1}}\right\|}{2}$
$\displaystyle=$	$\displaystyle\;\frac{\mathbb{P}_{j,k,k}^{T_{j-1}}(A_{j})+\mathbb{P}_{j,k,l}^{T_{j-1}}(A_{j})}{2}-\frac{1}{2}\int_{A_{j}}\left\|d\mathbb{P}_{j,k,k}^{T_{j-1}}-d\mathbb{P}_{j,k,l}^{T_{j-1}}\right\|$
$\displaystyle\geq$	$\displaystyle\;\left(\mathbb{P}_{j,k,k}^{T_{j-1}}(A_{j})-\frac{1}{2}TV\left(\mathbb{P}_{j,k,k}^{T_{j-1}},\mathbb{P}_{j,k,l}^{T_{j-1}}\right)\right)-TV\left(\mathbb{P}_{j,k,k}^{T_{j-1}},\mathbb{P}_{j,k,l}^{T_{j-1}}\right)$	(75)
$\displaystyle=$	$\displaystyle\;\mathbb{P}_{j,k}(A_{j})-\frac{3}{2}TV\left(\mathbb{P}_{j,k,k}^{T_{j-1}},\mathbb{P}_{j,k,l}^{T_{j-1}}\right),$	(76)

where (75) uses the inequality $|\mathbb{P}(A)-\mathbb{Q}(A)|\leq TV(\mathbb{P},\mathbb{Q})$ , and (76) holds because $I_{j,k}=I_{j,k,k}$ and $A_{j}$ can be determined by the observations up to $T_{j-1}$ .

We use an argument similar to (58) to get

\displaystyle D_{KL}\left(\mathbb{P}_{j,k,k}^{T_{j-1}}\|\mathbb{P}_{j,k,l}^{T_{j-1}}\right)\leq\frac{1}{2}\cdot\left(\frac{3r_{j}}{16}\right)^{2}\mathbb{E}_{\mathbb{P}_{j,k}}\tau_{l}\leq\frac{r_{j}^{2}}{32}\mathbb{E}_{\mathbb{P}_{j,k}}\tau_{l},

where $\tau_{l}$ denotes the number of pulls which is in $C_{j,l}$ before the batch of time $T_{j-1}$ . Then from Lemma 11, we have

$\displaystyle\frac{1}{M_{j}}\sum_{l\neq k}TV\left(\mathbb{P}_{j,k,k}^{T_{j-1}},\mathbb{P}_{j,k,l}^{T_{j-1}}\right)\leq$	$\displaystyle\;\frac{1}{M_{j}}\sum_{l\neq k}\sqrt{1-\exp\left(-D_{KL}\left(\mathbb{P}_{j,k,k}^{T_{j-1}}\\|\mathbb{P}_{j,k,l}^{T_{j-1}}\right)\right)}$
$\displaystyle\leq$	$\displaystyle\;\frac{1}{M_{j}}\sum_{l\neq k}\sqrt{1-\exp\left(-\frac{r_{j}^{2}}{32}\mathbb{E}_{\mathbb{P}_{j,k}}\tau_{l}\right)}$
$\displaystyle\leq$	$\displaystyle\;\frac{M_{j}-1}{M_{j}}\sqrt{1-\exp\left(-\frac{r_{j}^{2}}{32(M_{j}-1)}\sum_{l\neq k}\mathbb{E}_{\mathbb{P}_{j,k}}\tau_{l}\right)}$
$\displaystyle\leq$	$\displaystyle\;\frac{M_{j}-1}{M_{j}}\sqrt{1-\exp\left(-\frac{r_{j}^{2}T_{j-1}}{32(M_{j}-1)}\right)}$
$\displaystyle\leq$	$\displaystyle\;\frac{M_{j}-1}{M_{j}}\sqrt{1-\exp\left(-\frac{M_{j}}{32(M_{j}-1)B^{2}}\right)}$	(77)
$\displaystyle\leq$	$\displaystyle\;\frac{M_{j}-1}{M_{j}}\sqrt{\frac{M_{j}}{32(M_{j}-1)B^{2}}}$
$\displaystyle\leq$	$\displaystyle\;\frac{1}{4B},$	(78)

where (77) uses (61).

Combining (74), (76) and (78) yields that

\displaystyle\sup_{I\in\mathcal{I}_{j,k}}\mathbb{E}\left[R_{T}(\pi)\right]\geq\frac{1}{32}r_{j}T_{j}\left(\frac{\mathbb{P}_{j,k}(A_{j})}{2}-\frac{3}{8B}\right)\geq\frac{1}{32B}T^{\frac{1-\varepsilon}{1-\varepsilon^{B}}}\left(\frac{\mathbb{P}_{j,k}(A_{j})}{2}-\frac{3}{8B}\right),

where $\varepsilon=\frac{1}{d_{z}+2}$ . This inequality holds for any $k\leq M_{j}-1$ . Averaging over $k$ yields

\displaystyle\begin{split}\sup_{I\in\cup_{k\leq M_{j}-1}\mathcal{I}_{j,k}}\mathbb{E}\left[R_{T}(\pi)\right]\geq&\;\frac{1}{32B}T^{\frac{1-\varepsilon}{1-\varepsilon^{B}}}\left(\frac{1}{2(M_{j}-1)}\sum_{k=1}^{M_{j}-1}\mathbb{P}_{j,k}(A_{j})-\frac{3}{8B}\right)\\ \geq&\;\frac{1}{32B}T^{\frac{1-\varepsilon}{1-\varepsilon^{B}}}\left(\frac{7}{16B}-\frac{3}{8B}\right)\\ \geq&\;\frac{1}{512B^{2}}T^{\frac{1-\varepsilon}{1-\varepsilon^{B}}},\end{split}

where the second inequality holds from $p_{j}\geq\frac{7}{8B}$ . Hence, the proof of Lemma 10 is completed. ∎

Finally, combining the above two lemmas, we arrive at the lower bound in Theorem 10. ∎

Appendix G Additional Experimental results

G.1 Repeated experiments of A-BLiN

We present results of A-BLiN with some random seeds in Figure 3, where the figure legends and labels are the same as whose in Figure 2b. These results stably agree with the plot in the paper. The curve of zooming algorithm in Figure 2b and 3 is the average of $10$ repeated experiments. The reason we did not present averaged regret curve of A-BLiN in Figure 2b is that we want to show the batch pattern of a single A-BLiN run in the figure. Averaging across different runs breaks the batch pattern. As an example, one stochastic run may end the first batch after 100 observations, while another may end the third batch after 110 observations.

G.2 Experimental results of D-BLiN

We run D-BLiN to solve the same problem in Section 5. The partition and elimination process of this experiment is presented in Figure 4, which shows that the optimal arm $x^{*}$ is not eliminated during the game, and only $6$ rounds of communications are needed for time horizon $T=80000$ . Moreover, we present the resulting partition and the accumulated regret in Figure 5.

Appendix H Auxiliary Lemmas

Lemma 11 (Bretagnolle-Huber Inequality[12]).

Let $P$ and $Q$ be any probability measures on the same probability space. It holds that

\displaystyle TV(P,Q)\leq\sqrt{1-\exp\left(-D_{KL}(P\|Q)\right)}\leq 1-\frac{1}{2}\exp\left(-D_{KL}(P\|Q)\right).

Lemma 12 ([22]).

Let $Q_{1},\cdots,Q_{n}$ be probability measures over a common probability space $(\Omega,\mathcal{F})$ , and $\Psi:\Omega\rightarrow[n]$ be any measurable function (i.e., test). Then for any tree $T=([n],E)$ with vertex set $[n]$ and edge set $E$ , we have

1.

$\frac{1}{n}\sum_{i=1}^{n}Q_{i}(\Psi\neq i)\geq\frac{1}{n}\sum_{(i,j)\in E}\int\min\{dQ_{i},dQ_{j}\};$
2.

$\frac{1}{n}\sum_{i=1}^{n}Q_{i}(\Psi\neq i)\geq\frac{1}{2n}\sum_{(i,j)\in E}\exp\left(-D_{KL}(Q_{i}\|Q_{j})\right).$

Lipschitz Bandits with Batched Feedback

Abstract

1 Introduction

1.1 Settings & Preliminaries

1.1.1 Doubling Metric Spaces and the ([0,1]d,∥⋅∥∞)([0,1]^{d},\|\cdot\|_{\infty}) Metric Space

1.1.2 Zooming Number and Zooming Dimension

Definition 1.

Definition 2.

1.2 Batched feedback pattern and our results

Theorem 1.

Theorem 2.

Theorem 3.

Corollary 1.

1.3 Related Works

2 Algorithm

Remark 1 (Time and space complexity).

Theorem 4.

Definition 3.

Theorem 5.

3 Regret Analysis of A-BLiN

Lemma 1.

Proof.

Lemma 2.

Proof.

Lemma 3.

Proof.

Proof of Theorem 5.

Remark 2.

Theorem 6.

Proof.

Remark 3.

4 Lower Bounds

4.1 Lower Bounds for the Full-Dimension Case

4.1.1 The Static Grid Case

Theorem 7.

Proof of Theorem 7.

Lemma 4.

Proof.

4.1.2 Removing the Static Grid Assumption

Theorem 8.

Proof of Theorem 8.

Lemma 5.

Proof.

Lemma 6.

Proof.

4.2 Communication Lower bound for Lipschitz Bandits with Batched Feedback

Theorem 9.

Theorem 10.

5 Experiments

6 Conclusion

References

Appendix A Proof of Corollary 1

Proof.

Appendix B Space Complexity Analysis of A-BLiN

Appendix C Proof of Theorem 4

Proof.

Appendix D Regret Upper Bound of BLiN with a Fixed Number of Batches

Lemma 7.

Proof.

Appendix E Proof of Theorem 9

Proof.

Lemma 8.

Proof.

Appendix F Proof of Theorem 10

Proof.

Lemma 9.

Proof.

Lemma 10.

Proof.

Appendix G Additional Experimental results

G.1 Repeated experiments of A-BLiN

G.2 Experimental results of D-BLiN

Appendix H Auxiliary Lemmas

Lemma 11 (Bretagnolle-Huber Inequality[12]).

Lemma 12 ([22]).

1.1.1 Doubling Metric Spaces and the $([0,1]^{d},\|\cdot\|_{\infty})$ Metric Space