This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Lipschitz Bandits with Batched Feedback

Yasong Feng Shanghai Center for Mathematical Sciences, Fudan University; email: [email protected].    Zengfeng Huang School of Data Science, Fudan University; email: [email protected].    Tianyu Wang Shanghai Center for Mathematical Sciences, Fudan University; email: [email protected].
Abstract

In this paper, we study Lipschitz bandit problems with batched feedback, where the expected reward is Lipschitz and the reward observations are communicated to the player in batches. We introduce a novel landscape-aware algorithm, called Batched Lipschitz Narrowing (BLiN), that optimally solves this problem. Specifically, we show that for a TT-step problem with Lipschitz reward of zooming dimension dzd_{z}, our algorithm achieves theoretically optimal (up to logarithmic factors) regret rate 𝒪~(Tdz+1dz+2)\widetilde{\mathcal{O}}\left(T^{\frac{d_{z}+1}{d_{z}+2}}\right) using only 𝒪(loglogT)\mathcal{O}\left(\log\log T\right) batches. We also provide complexity analysis for this problem. Our theoretical lower bound implies that Ω(loglogT)\Omega(\log\log T) batches are necessary for any algorithm to achieve the optimal regret. Thus, BLiN achieves optimal regret rate (up to logarithmic factors) using minimal communication.

1 Introduction

Multi-Armed Bandit (MAB) algorithms aim to exploit the good options while explore the decision space. These algorithms and methodologies find successful applications in artificial intelligence and reinforcement learning (e.g., [40]). While the classic MAB setting assumes that the rewards are immediately observed after each arm pull, real-world data often arrives in different patterns. For example, observations from clinical trials are often be collected in a batched fashion [37]. Another example is from online advertising, where strategies are tested on multiple customers at the same time [10]. In such cases, any observation-dependent decision-making should comply with this data-arriving pattern, including MAB algorithms.

In this paper, we study the Lipschitz bandit problem with batched feedback – a MAB problem where the expected reward is Lipschitz and the reward observations are communicated to the player in batches. In such settings, rewards are communicated only at the end of the batches, and the algorithm can only make decisions based on information up to the previous batch. Existing Lipschitz bandit algorithms heavily rely on timely access to the reward samples, since the partition of arm space may change at any time. Therefore, they can not solve the batched feedback setting. To address this difficulty, we present a novel adaptive algorithm for Lipschitz bandits with communication constraints, named Batched Lipschitz Narrowing (BLiN). BLiN learns the landscape of the reward by adaptively narrowing the arm set, so that regions of high reward are more frequently played. Also, BLiN determines the data collection procedure adaptively, so that only very few data communications are needed.

The above BLiN procedure achieves optimal regret rate 𝒪~(Tdz+1dz+2)\widetilde{\mathcal{O}}\left(T^{\frac{d_{z}+1}{d_{z}+2}}\right) (dzd_{z} is the zooming dimension [28, 15]), and can be implemented in a clean and friendly form. In addition to achieving the optimal regret rate, BLiN also improves the state-of-the-art results in the following senses:

  • BLiN’s communication complexity is optimal. BLiN only needs 𝒪(loglogT)\mathcal{O}(\log\log T) rounds of communications to achieve the optimal regret rate (Theorem 2), and no algorithm can achieve this rate with fewer than Ω(loglogT)\Omega(\log\log T) rounds of communications (Corollary 1).

  • BLiN’s time complexity is optimal: if the arithmetic operations and sampling are of complexity 𝒪(1)\mathcal{O}(1), then the time complexity of BLiN is 𝒪(T)\mathcal{O}(T), which improves the best known time complexity 𝒪(TlogT)\mathcal{O}(T\log T) for Lipschitz bandit problems [15].

  • The space complexity of BLiN is 𝒪(Tdz+1dz+2(logT)dz+1dz+2)\mathcal{O}\left(T^{\frac{d_{z}+1}{d_{z}+2}}(\log T)^{-\frac{d_{z}+1}{d_{z}+2}}\right), which also improves the best known result. This is because we do not need to store information of cubes in previous batches. The detailed time and space complexity analysis of BLiN is in Remark 1.

In Table 1 we provide a comparison of BLiN and state-of-the-art Lipschitz bandit algorithms in terms of regret bound, communication bound, time complexity and space complexity.

Table 1: Comparison with State-of-the-art Lipschitz Bandit Algorithms
algorithm regret communication time complexity space complexity
Zooming [28] 𝒪~(Tdz+1dz+2)\widetilde{\mathcal{O}}\left(T^{\frac{d_{z}+1}{d_{z}+2}}\right) TT 𝒪(T2)\mathcal{O}\left(T^{2}\right) 𝒪(T)\mathcal{O}(T)
HOO [15] 𝒪~(Tdz+1dz+2)\widetilde{\mathcal{O}}\left(T^{\frac{d_{z}+1}{d_{z}+2}}\right) TT 𝒪(TlogT)\mathcal{O}\left(T\log T\right) 𝒪(T)\mathcal{O}(T)
A-BLiN (our work) 𝓞~(𝑻𝒅𝒛+𝟏𝒅𝒛+𝟐)\bm{\widetilde{\mathcal{O}}\left(T^{\frac{d_{z}+1}{d_{z}+2}}\right)} 𝓞(𝐥𝐨𝐠𝐥𝐨𝐠𝑻)\bm{\mathcal{O}(\log\log T)} 𝓞(𝑻)\bm{\mathcal{O}\left(T\right)} 𝓞(𝑻𝒅𝒛+𝟏𝒅𝒛+𝟐(𝐥𝐨𝐠𝑻)𝒅𝒛+𝟏𝒅𝒛+𝟐)\bm{\mathcal{O}\left(T^{\frac{d_{z}+1}{d_{z}+2}}(\log T)^{-\frac{d_{z}+1}{d_{z}+2}}\right)}

1.1 Settings & Preliminaries

For a Lipschitz bandit problem (with communication constraints), the arm set is a compact doubling metric space (𝒜,d𝒜)(\mathcal{A},d_{\mathcal{A}}). The expected reward μ:𝒜\mu:\mathcal{A}\rightarrow\mathbb{R} is 11-Lipschitz with respect to the metric d𝒜d_{\mathcal{A}}, that is, |μ(x1)μ(x2)|d𝒜(x1,x2)|\mu(x_{1})-\mu(x_{2})|\leq d_{\mathcal{A}}(x_{1},x_{2}) for any x1,x2𝒜x_{1},x_{2}\in\mathcal{A}.

At time tTt\leq T, the learning agent pulls an arm xt𝒜x_{t}\in\mathcal{A} that yields a reward sample yt=μ(xt)+ϵty_{t}=\mu(x_{t})+\epsilon_{t}, where ϵt\epsilon_{t} is a mean-zero independent sub-Gaussian noise. Without loss of generality, we assume that ϵt𝒩(0,1)\epsilon_{t}\sim\mathcal{N}(0,1), since generalizations to other sub-Gaussian noises are not hard.

Similar to most bandit learning problems, the agent seeks to minimize regret in the batched feedback environment. The regret is defined as R(T)=t=1T(μμ(xt))R(T)=\sum_{t=1}^{T}\left(\mu^{*}-\mu(x_{t})\right), where μ\mu^{*} denotes maxx𝒜μ(x)\max_{x\in\mathcal{A}}\mu(x). For simplicity, we define Δx=μμ(x)\Delta_{x}=\mu^{*}-\mu(x) (called optimality gap of xx) for all x𝒜x\in\mathcal{A}.

1.1.1 Doubling Metric Spaces and the ([0,1]d,)([0,1]^{d},\|\cdot\|_{\infty}) Metric Space

By the Assouad’s embedding theorem [6], the (compact) doubling metric space (𝒜,d𝒜)(\mathcal{A},d_{\mathcal{A}}) can be embedded into a Euclidean space with some distortion of the metric; See [46] for more discussions in a machine learning context. Due to existence of such embedding, the metric space ([0,1]d,)([0,1]^{d},\|\cdot\|_{\infty}), where metric balls are hypercubes, is sufficient for the purpose of our paper. For the rest of the paper, we will use hypercubes in algorithm design for simplicity, while our algorithmic idea generalizes to other doubling metric spaces.

1.1.2 Zooming Number and Zooming Dimension

An important concept for bandit problems in metric spaces is the zooming number and the zooming dimension [28, 14, 42], which we discuss now. We start with the definition of packing numbers.

Definition 1.

Let (𝒜,d𝒜)(\mathcal{A},d_{\mathcal{A}}) be a metric space. The rr-packing number 𝒩(𝒮,r)\mathcal{N}(\mathcal{S},r) of 𝒮𝒜\mathcal{S}\subset\mathcal{A} is the size of the largest packing of 𝒮\mathcal{S} with disjoint d𝒜d_{\mathcal{A}}-open balls with radius rr.

Then we define the zooming number and the zooming dimension.

Definition 2.

For a problem instance with arm set 𝒜\mathcal{A} and expected reward μ\mu, we let S(r)S(r) denote the set of rr-optimal arms, that is, S(r)={x𝒜:Δx=μμ(x)r}S(r)=\{x\in\mathcal{A}:\Delta_{x}=\mu^{*}-\mu(x)\leq r\}. We define the rr-zooming number NrN_{r} as Nr=𝒩(S(16r),r2)N_{r}=\mathcal{N}\left(S(16r),\frac{r}{2}\right). The zooming dimension is then defined as

dz:=min{d0:a>0,Nrard,0<r<1}.\displaystyle d_{z}:=\;\min\left\{d\geq 0:\exists a>0,\;N_{r}\leq ar^{-d},\;\forall 0<r<1\right\}.

Moreover, we define the zooming constant CzC_{z} as

Cz=min{a>0:Nrardz,0<r<1}.\displaystyle C_{z}=\min\left\{a>0:\;N_{r}\leq ar^{-d_{z}},\;\forall 0<r<1\right\}.

Zooming dimension dzd_{z} can be significantly smaller than ambient dimension dd and can be zero. For a simple example, consider a problem with ambient dimension d=1d=1 and expected reward function μ(x)=x\mu(x)=x for 0x10\leq x\leq 1. Then for any r=2ir=2^{-i} with i4i\geq 4, we have S(16r)=[116r,1]S(16r)=[1-16r,1] and Nr=16N_{r}=16. Therefore, for this problem the zooming dimension equals to 0, with zooming constant Cz=16C_{z}=16.

1.2 Batched feedback pattern and our results

In the batched feedback setting, for a TT-step game, the player determines a grid 𝒯={t0,,tB}\mathcal{T}=\{t_{0},\cdots,t_{B}\} adaptively, where 0=t0<t1<<tB=T0=t_{0}<t_{1}<\cdots<t_{B}=T and BTB\ll T. During the game, reward observations are communicated to the player only at the grid points t1,,tBt_{1},\cdots,t_{B}. As a consequence, for any time tt in the jj-th batch, that is, tj1<ttjt_{j-1}<t\leq t_{j}, the reward yty_{t} cannot be observed until time tjt_{j}, and the decision made at time tt depends only on rewards up to time tj1t_{j-1}. The determination of the grid 𝒯\mathcal{T} is adaptive in the sense that the player chooses each grid point tj𝒯t_{j}\in\mathcal{T} based on the operations and observations up to the previous point tj1t_{j-1}.

In this work, we present BLiN algorithm to solve Lipschitz bandits under batched feedback. During the learning procedure, BLiN detects and eliminates the ‘bad area’ of the arm set in batches and partition the remaining area according to an approporiate edge-length sequence. Our first theoretical upper bound is that with simple Doubling Edge-length Sequence rm=2m+1r_{m}=2^{-m+1}, BLiN achieves optimal regret rate 𝒪~(Tdz+1dz+2)\widetilde{\mathcal{O}}\left(T^{\frac{d_{z}+1}{d_{z}+2}}\right) by using 𝒪(logT)\mathcal{O}(\log T) batches.

Theorem 1.

With probability exceeding 12T61-\frac{2}{T^{6}}, the TT-step total regret R(T)R(T) of BLiN with Doubling Edge-length Sequence (D-BLiN) satisfies

R(T)Tdz+1dz+2(logT)1dz+2,R(T)\lesssim T^{\frac{d_{z}+1}{d_{z}+2}}\cdot(\log T)^{\frac{1}{d_{z}+2}},

where dzd_{z} is the zooming dimension of the problem instance. In addition, D-BLiN only needs no more than 𝒪(logT)\mathcal{O}(\log T) rounds of communications to achieve this regret rate. Here and henceforth, \lesssim only omits constants.

While D-BLiN is efficient for batched Lipschitz bandits, its communication complexity is not optimal. We then propose a new edge-length sequence, which we call Appropriate Combined Edge-length Sequence (ACE Sequence) to improve the algorithm. The idea behind this sequence is that by appropriately combining some batches, the algorithm can achieve better communication bound without incurring increased regret. As we shall see, BLiN with ACE Sequence (A-BLiN) achieves regret rate 𝒪~(Tdz+1dz+2)\widetilde{\mathcal{O}}\left(T^{\frac{d_{z}+1}{d_{z}+2}}\right) with only 𝒪(loglogT)\mathcal{O}(\log\log T) batches.

Theorem 2.

With probability exceeding 12T61-\frac{2}{T^{6}}, the TT-step total regret R(T)R(T) of A-BLiN satisfies

R(T)Tdz+1dz+2(logT)1dz+2loglogT,R(T)\lesssim T^{\frac{d_{z}+1}{d_{z}+2}}\cdot(\log T)^{\frac{1}{d_{z}+2}}\cdot\log\log T,

where dzd_{z} is the zooming dimension of the problem instance. In addition, Algorithm 1 only needs no more than 𝒪(loglogT)\mathcal{O}(\log\log T) rounds of communications to achieve this regret rate.

As a comparison, seminal works [28, 42, 15] show that the optimal regret bound for Lipschitz bandits without communications constraints, where the reward observations are immediately observable after each arm pull, is R(T)Tdz+1dz+2(logT)1dz+2R(T)\lesssim T^{\frac{d_{z}+1}{d_{z}+2}}\cdot(\log T)^{\frac{1}{d_{z}+2}}. Therefore, A-BLiN achieves optimal regret rate of Lipschitz bandits by using very few batches.

Furthermore, we provide a theoretical lower bound for Lipschitz bandits with batched feedback.

Theorem 3.

Consider Lipschitz bandit problems with time horizon TT, ambient dimension dd and zooming dimension dzdd_{z}\leq d. If BB rounds of communications are allowed, then for any policy π\pi, there exists a problem instance with zooming dimension dzd_{z} such that

𝔼[RT(π)]1512B2T11dz+21(1dz+2)B.\displaystyle\mathbb{E}\left[R_{T}(\pi)\right]\geq\frac{1}{512B^{2}}T^{\frac{1-\frac{1}{d_{z}+2}}{1-\left(\frac{1}{d_{z}+2}\right)^{B}}}.

In the lower bound analysis, we use a “linear-decaying extension” technique to construct problem instances with zooming dimension dzd_{z}. To the best of our knowledge, our construction provides the first minimax lower bound for Lipschitz bandits where the zooming dimension dzd_{z} is explicitly different from the ambient dimension dd. As a result of Theorem 3, we can derive the minimum rounds of communications needed to achieve optimal regret bound for Lipschitz bandit problem, which is stated in Corollary 1. The proof of Corollary 1 is deferred to Appendix A.

Corollary 1.

For Lipschitz bandit problems with ambient dimension dd, zooming dimension dzdd_{z}\leq d and time horizon TT, any algorithm needs Ω(loglogT)\Omega(\log\log T) rounds of communications to achieve the optimal regret rate 𝒪(Tdz+1dz+2)\mathcal{O}\left(T^{\frac{d_{z}+1}{d_{z}+2}}\right).

Consequently, BLiN algorithm is optimal in terms of both regret and communication.

1.3 Related Works

The history of the Multi-Armed Bandit (MAB) problem can date back to Thompson [45]. Solvers for this problems include the UCB algorithms [30, 4, 7], the arm elimination method [20, 35, 39], the ϵ\epsilon-greedy strategy [7, 43], the exponential weights and mirror descent framework [8].

Recently, with the prevalence of distributed computing and large-scale field experiments, the setting of batched feedback has captured attention (e.g., [17]). Perchet et al. [36] mainly consider batched bandit with two arms, and a matching lower bound for static grid is proved. It was then generalized by Gao et al. [22] to finite-armed bandit problems. In their work, the authors designed an elimination method for finite-armed bandit problem and proved matching lower bounds for both static and adaptive grid. Soon afterwards, Zhang et al. [49] studies inference for batched bandits. Esfandiari et al. [19] studies batched linear bandits and batched adversarial bandits. Han et al. [24] and Ruan et al. [38] provide solutions for batched contextual linear bandits. Li and Scarlett [31] studies batched Gaussian process bandits. Batched dueling bandits have also been studied by Agarwal et al. [2]. Parallel to the regret control regime, best arm identification with limited number of batches was studied in [1] and [25]. Top-kk arm identification in the collaborative learning framework is also closely related to the batched setting, where the goal is to minimize the number of iterations (or communication steps) between agents. In this setting, tight bounds have been obtained in the recent works [44, 26]. Yet the problem of Lipschitz bandit with communication constraints remains unsolved.

The Lipschitz bandit problem is important in its own stand. The Lipschitz bandit problem was introduced as “continuum-armed bandits” [3], where the arm space is a compact interval. Along this line, bandits that are Lipschitz (or Hölder) continuous have been studied. For this problem, Kleinberg [27] proves a Ω(T2/3)\Omega(T^{2/3}) lower bound and introduced a matching algorithm. Under extra conditions on top of Lipschitzness, regret rate of 𝒪~(T1/2)\widetilde{\mathcal{O}}(T^{1/2}) was achieved [9, 18]. For general (doubling) metric spaces, the Zooming bandit algorithm [28] and the Hierarchical Optimistic Optimization (HOO) algorithm [15] were developed. In more recent years, some attention has been focused on Lipschitz bandit problems with certain extra structures. To name a few, Bubeck et al. [16] study Lipschitz bandits for differentiable rewards, which enables algorithms to run without explicitly knowing the Lipschitz constants. Wang et al. [47] studied discretization-based Lipschitz bandit algorithms from a Gaussian process perspective. Magureanu et al. [33] derive a new concentration inequality and study discrete Lipschitz bandits. The idea of robust mean estimators [11, 5, 13] was applied to the Lipschitz bandit problem to cope with heavy-tail rewards, leading to the development of a near-optimal algorithm for Lipschitz bandit with heavy-tailed rewards [32]. Wanigasekara and Yu [48] studied Lipschitz bandits where a clustering is used to infer the underlying metric. Contextual Lipschitz bandits have been studied by Slivkins [42]. Contextual bandits with continuous actions have also been studied by Krishnamurthy et al. [29] and Majzoubi et al. [34] through a smoothness approach. Yet all of the existing works for Lipschitz bandits assume that the reward sample is immediately observed after each arm pull, and none of them solve the Lipschitz bandit problem with communication constraints.

This paper is organized as follows. In section 2, we introduce the BLiN algorithm and give a visual illustration of the algorithm procedure. In section 3, we prove that BLiN with ACE Sequence achieves the optimal regret rate using only 𝒪(loglogT)\mathcal{O}\left(\log\log T\right) rounds of communications. Section 4 provides information-theoretical lower bounds for Lipschitz bandits with communication constraints, which shows that BLiN is optimal in terms of both regret and rounds of communications. Experimental results are presented in Section 5.

2 Algorithm

With communication constraints, the agent’s knowledge about the environment does not accumulate within each batch. This characteristic of the problem suggests a ‘uniform’ type algorithm – we shall treat each step within the same batch equally. Following this intuition, in each batch, we uniformly play the remaining arms, and then eliminate arms of low reward after the observations are communicated. Next we describe the uniform play rule and the arm elimination rule.

Uniform Play Rule: At the beginning of each batch mm, a collection of subsets of the arm space 𝒜m={Cm,1,Cm,2,,Cm,|𝒜m|}\mathcal{A}_{m}=\{C_{m,1},C_{m,2},\cdots,C_{m,|\mathcal{A}_{m}|}\} is constructed. This collection of subset 𝒜m\mathcal{A}_{m} consists of standard cubes, and all cubes in 𝒜m\mathcal{A}_{m} have the same edge length rmr_{m}. We will detail the construction of 𝒜m\mathcal{A}_{m} when we describe the arm elimination rule. We refer to cubes in 𝒜m\mathcal{A}_{m} as active cubes of batch mm.

During batch mm, each cube in 𝒜m\mathcal{A}_{m} is played

nm16logTrm2n_{m}\triangleq\frac{16\log T}{r_{m}^{2}} (1)

times, where TT is the total time horizon. More specifically, within each C𝒜mC\in\mathcal{A}_{m}, arms xC,1,xC,2,,xC,nmCx_{C,1},x_{C,2},\cdots,x_{C,n_{m}}\in C are played.111One can arbitrarily play xC,1,xC,2,,xC,nmx_{C,1},x_{C,2},\cdots,x_{C,n_{m}} as long as xC,iCx_{C,i}\in C for all ii. The reward samples {yC,1,yC,2,,yC,nm}C𝒜m\left\{y_{C,1},y_{C,2},\cdots,y_{C,n_{m}}\right\}_{C\in\mathcal{A}_{m}} corresponding to {xC,1,xC,2,,xC,nm}C𝒜m\left\{x_{C,1},x_{C,2},\cdots,x_{C,n_{m}}\right\}_{C\in\mathcal{A}_{m}} will be collected at the end of the this batch.

Arm Elimination Rule: At the end of batch mm, information from the arm pulls is collected, and we estimate the reward of each C𝒜mC\in\mathcal{A}_{m} by μ^m(C)=1nmi=1nmyC,i\widehat{\mu}_{m}(C)=\frac{1}{n_{m}}\sum_{i=1}^{n_{m}}y_{C,i}. Cubes of low estimated rewards are then eliminated, according to the following rule: a cube C𝒜mC\in\mathcal{A}_{m} is eliminated if μ^mmaxμ^m(C)4rm\widehat{\mu}_{m}^{\max}-\widehat{\mu}_{m}(C)\geq 4r_{m}, where μ^mmax:=maxC𝒜mμ^m(C)\widehat{\mu}_{m}^{\max}:=\max_{C\in\mathcal{A}_{m}}\widehat{\mu}_{m}(C). After necessary removal of “bad cubes”, each cube in 𝒜m\mathcal{A}_{m} that survives the elimination is equally partitioned into (rmrm+1)d\left(\frac{r_{m}}{r_{m+1}}\right)^{d} subcubes of edge length rm+1r_{m+1}, where {rm}m\{r_{m}\}_{m} is predetermined sequence to be specified soon. These cubes (of edge length rm+1r_{m+1}) are collected to construct 𝒜m+1\mathcal{A}_{m+1}, and the learning process moves on to the next batch. Appropriate rounding may be required to ensure the ratio rmrm+1\frac{r_{m}}{r_{m+1}} is an integer. See Remark 2 for more details.

The learning process is summarized in Algorithm 1.

Algorithm 1 Batched Lipschitz Narrowing (BLiN)
1:  Input. Arm set 𝒜=[0,1]d\mathcal{A}=[0,1]^{d}; time horizon TT.
2:  Initialization Number of batches BB; Edge-length sequence {rm}m=1B+1\{r_{m}\}_{m=1}^{B+1}; The first grid point t0=0t_{0}=0; Equally partition 𝒜\mathcal{A} to r1dr_{1}^{d} subcubes and define 𝒜1\mathcal{A}_{1} as the collection of these subcubes.
3:  Compute nm=16logTrm2n_{m}=\frac{16\log T}{r_{m}^{2}} for m=1,,B+1m=1,\cdots,B+1.
4:  for m=1,2,,Bm=1,2,\cdots,B do
5:     For each cube C𝒜mC\in\mathcal{A}_{m}, play arms xC,1,xC,nmx_{C,1},\cdots x_{C,n_{m}} from CC.
6:     Collect the rewards of all pulls up to tmt_{m}. Compute the average payoff μ^m(C)=i=1nmyC,inm\widehat{\mu}_{m}(C)=\frac{\sum_{i=1}^{n_{m}}y_{C,i}}{n_{m}} for each cube C𝒜mC\in\mathcal{A}_{m}. Find μ^mmax=maxC𝒜mμ^(C)\widehat{\mu}_{m}^{max}=\max_{C\in\mathcal{A}_{m}}\widehat{\mu}(C).
7:     For each cube C𝒜mC\in\mathcal{A}_{m}, eliminate CC if μ^mmaxμ^m(C)>4rm\widehat{\mu}_{m}^{max}-\widehat{\mu}_{m}(C)>4r_{m}. Let 𝒜m+\mathcal{A}_{m}^{+} be set of cubes not eliminated.
8:     Compute tm+1=tm+(rm/rm+1)d|𝒜m+|nm+1t_{m+1}=t_{m}+(r_{m}/r_{m+1})^{d}\cdot|\mathcal{A}_{m}^{+}|\cdot n_{m+1}. If tm+1Tt_{m+1}\geq T or m=Bm=B then break.
9:     Equally partition each cube in 𝒜m+\mathcal{A}_{m}^{+} into (rm/rm+1)d\left(r_{m}/r_{m+1}\right)^{d} subcubes and define 𝒜m+1\mathcal{A}_{m+1} as the collection of these subcubes. /*See Remark 2 for more details on cases where (rm/rm+1)d\left(r_{m}/r_{m+1}\right)^{d} is not an integer.*/
10:  end for
11:  Cleanup: Arbitrarily play the remaining arms until all TT steps are used.
Remark 1 (Time and space complexity).

The time complexity of our algorithm is 𝒪(T)\mathcal{O}(T), which is better than the state of the art 𝒪(TlogT)\mathcal{O}(T\log T) in [15]. This is because that the running time of a batch jj is of order 𝒪(lj)\mathcal{O}(l_{j}), where lj=tjtj1l_{j}=t_{j}-t_{j-1} is number of samples in batch jj. Since jlj=T\sum_{j}l_{j}=T, the time complexity of BLiN is 𝒪(T)\mathcal{O}(T). Besides, the space complexity of our algorithm is 𝒪(Tdz+1dz+2(logT)dz+1dz+2)\mathcal{O}\left(T^{\frac{d_{z}+1}{d_{z}+2}}(\log T)^{-\frac{d_{z}+1}{d_{z}+2}}\right), which also improves the best known space complexity. This is because we do not need to store information of cubes in previous batches. The space complexity analysis is deferred to Appendix B.

The following theorem gives regret and communication upper bound of BLiN with Doubling Edge-length Sequence rm=2m+1r_{m}=2^{-m+1} (see Appendix C for a proof). Note that this result implies Theorem 1.

Theorem 4.

With probability exceeding 12T61-\frac{2}{T^{6}}, the TT-step total regret R(T)R(T) of BLiN with Doubling Edge-length Sequence (D-BLiN) satisfies

R(T)(512Cz+16)Tdz+1dz+2(logT)1dz+2,R(T)\leq(512C_{z}+16)\cdot T^{\frac{d_{z}+1}{d_{z}+2}}(\log T)^{\frac{1}{d_{z}+2}},

where dzd_{z} is the zooming dimension of the problem instance. In addition, D-BLiN only needs no more than logTloglogTdz+2+2\frac{\log T-\log\log T}{d_{z}+2}+2 rounds of communications to achieve this regret rate.

Although D-BLiN efficiently solves batched Lipschitz bandits, its simple partition strategy leads to suboptimal communication complexity. Now we show that by approporiately combining some batches, BLiN achieves the optimal communication bound, without incurring additional regret. Specifically, we introduce the following edge-length sequence, which we call ACE Sequence. When using ACE Sequence, the regret of each batch is of order 𝒪~(Tdz+1dz+2)\widetilde{\mathcal{O}}\left(T^{\frac{d_{z}+1}{d_{z}+2}}\right). Thus, the length of any batch cannot be increased without affecting the optimal regret rate. This implies that the ACE Sequence is optimal in terms of both regret and communication. See the proof of Theorem 5 for more details.

Definition 3.

For a problem with ambient dimension dd, zooming dimension dzd_{z} and time horizon TT, we denote c1=dz+1(d+2)(dz+2)logTlogTc_{1}=\frac{d_{z}+1}{(d+2)(d_{z}+2)}\log\frac{T}{\log T} and ci+1=ηcic_{i+1}={\eta c_{i}} for any i1i\geq 1, where η=d+1dzd+2\eta=\frac{d+1-d_{z}}{d+2}. Then the Appropriate Combined Edge-length (ACE) Sequence {rm}\{r_{m}\} is defined by rm=2i=1mcir_{m}=2^{-\sum_{i=1}^{m}c_{i}} for any m1m\geq 1.

Theorem 5 states that BLiN with ACE Sequence (A-BLiN) obtains an improved communication complexity, thus proves Theorem 2.

Theorem 5.

With probability exceeding 12T61-\frac{2}{T^{6}}, the TT-step total regret R(T)R(T) of Algorithm 1 satisfies

R(T)(128Czlogd+2d+1dzloglogT+16)Tdz+1dz+2(logT)1dz+2,R(T)\leq\left(\frac{128C_{z}}{\log\frac{d+2}{d+1-d_{z}}}\cdot\log\log T+16\right)\cdot T^{\frac{d_{z}+1}{d_{z}+2}}(\log T)^{\frac{1}{d_{z}+2}}, (2)

where dzd_{z} is the zooming dimension of the problem instance. In addition, Algorithm 1 with ACE sequence only needs no more than loglogTlogd+2d+1dz+1\frac{\log\log T}{\log\frac{d+2}{d+1-d_{z}}}+1 rounds of communications to achieve this regret rate.

The partition and elimination process of a real A-BLiN run is in Figure 1. In the ii-th subgraph, the white cubes are those remaining after the (i1)(i-1)-th batch. In this experiment, we set 𝒜=[0,1]2\mathcal{A}=[0,1]^{2}, and the optimal arm is x=(0.8,0.7)x^{*}=(0.8,0.7). Note that xx^{*} is not eliminated during the game. More details of this experiment are in Section 5.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 1: Partition and elimination process of A-BLiN. The ii-th subfigure shows the pattern before the ii-th batch, which is from a real A-BLiN run on the reward function defined in Section 5. Dark gray cubes are those eliminated in the most recent batch, while the light gray ones are those eliminated in earlier batches. For the total time horizon T=80000T=80000, A-BLiN needs 44 rounds of communications. For this experiment, r1=14,r2=18,r3=116,r4=132,r_{1}=\frac{1}{4},\;r_{2}=\frac{1}{8},\;r_{3}=\frac{1}{16},\;r_{4}=\frac{1}{32},\; which is the ACE sequence (rounded as in Remark 2) for d=2d=2 and dz=0d_{z}=0.

The definition of the ACE Sequence relies on the zooming dimension dzd_{z}. If dzd_{z} is not known ahead of time, we recommend two ways to proceed. 1) The player can apply BLiN with Doubling Edge-length Sequence rm=2m+1r_{m}=2^{-m+1} (D-BLiN). Theorem 4 shows that D-BLiN achieves optimal regret rate 𝒪~(Tdz+1dz+2)\widetilde{\mathcal{O}}\left(T^{\frac{d_{z}+1}{d_{z}+2}}\right) by using 𝒪(logT)\mathcal{O}(\log T) batches. Although the rate 𝒪(logT)\mathcal{O}(\log T) is not optimal, it is good enough for total time horizon TT that is not too large, as shown in the experimental results in Appendix G.2. 2) If an upper bound dud_{u} of the zooming dimension is known, that is, dzdudd_{z}\leq d_{u}\leq d, then the player can apply A-BLiN by using dud_{u} to define the ACE Sequence. Theorem 5 yields that A-BLiN with dud_{u} achieves regret rate 𝒪~(Tdu+1du+2)\widetilde{\mathcal{O}}\left(T^{\frac{d_{u}+1}{d_{u}+2}}\right) by using 𝒪(loglogT)\mathcal{O}(\log\log T) batches.

3 Regret Analysis of A-BLiN

In this section, we provide regret analysis for A-BLiN. The highlight of the finding is that 𝒪(loglogT)\mathcal{O}(\log\log T) batches are sufficient to achieve optimal regret rate of 𝒪~(Tdz+1dz+2)\widetilde{\mathcal{O}}\left(T^{\frac{d_{z}+1}{d_{z}+2}}\right), as summarized in Theorem 5.

To prove Theorem 5, we first show that the estimator μ^\widehat{\mu} is concentrated to the true expected reward μ\mu (Lemma 1), and the optimal arm survives all eliminations with high probability (Lemma 2). In the following analysis, we let BstopB_{\mathrm{stop}} be the total number of batches of the BLiN run.

Lemma 1.

Define

:={|μ(x)μ^m(C)|rm+16logTnm,1mBstop1,C𝒜m,xC}.\displaystyle\mathcal{E}:=\;\Bigg{\{}|\mu(x)-\widehat{\mu}_{m}(C)|\leq r_{m}+\sqrt{\frac{16\log T}{n_{m}}},\;\forall 1\leq m\leq B_{\mathrm{stop}}-1,\;\forall C\in\mathcal{A}_{m},\;\forall x\in C\Bigg{\}}.

It holds that ()12T6\mathbb{P}\left(\mathcal{E}\right)\geq 1-2T^{-6}.

Proof.

Fix a cube C𝒜mC\in\mathcal{A}_{m}. Recall the average payoff of cube C𝒜mC\in\mathcal{A}_{m} is defined as

μ^m(C)=i=1nmyC,inm.\widehat{\mu}_{m}(C)=\frac{\sum_{i=1}^{n_{m}}y_{C,i}}{n_{m}}.

We also have

𝔼[μ^m(C)]=i=1nmμ(xC,i)nm.\mathbb{E}\left[\widehat{\mu}_{m}(C)\right]=\frac{\sum_{i=1}^{n_{m}}\mu(x_{C,i})}{n_{m}}.

Since μ^m(C)𝔼[μ^m(C)]\widehat{\mu}_{m}(C)-\mathbb{E}\left[\widehat{\mu}_{m}(C)\right] obeys normal distribution 𝒩(0,1nm)\mathcal{N}\left(0,\frac{1}{n_{m}}\right), Hoeffding inequality gives

(|μ^m(C)𝔼[μ^m(C)]|16logTnm)2T8.\displaystyle\mathbb{P}\left(\left|\widehat{\mu}_{m}(C)-\mathbb{E}\left[\widehat{\mu}_{m}(C)\right]\right|\geq\sqrt{\frac{16\log T}{n_{m}}}\right)\leq\frac{2}{T^{8}}.

On the other hand, by Lipschitzness of μ\mu, it is obvious that

|𝔼[μ^m(C)]μ(x)|rm,xC.\displaystyle\left|\mathbb{E}\left[\widehat{\mu}_{m}(C)\right]-\mu(x)\right|\leq r_{m},\quad\forall x\in C.

Consequently, we have

(supxC|μ(x)μ^m(C)|rm+16logTnm)12T8.\displaystyle\mathbb{P}\left(\sup_{x\in C}|\mu(x)-\widehat{\mu}_{m}(C)|\leq r_{m}+\sqrt{\frac{16\log T}{n_{m}}}\right)\geq 1-\frac{2}{T^{8}}.

For 1mBstop11\leq m\leq B_{\mathrm{stop}}-1, the mm-th batch is finished, so any cube C𝒜mC\in\mathcal{A}_{m} is played for not less than 11 time, and thus |𝒜m|T|\mathcal{A}_{m}|\leq T. From here, by similar argument to Lemma F.1 in [41] and Lemma 1 in [32], taking a union bound over C𝒜mC\in\mathcal{A}_{m} and 1mBstop11\leq m\leq B_{\mathrm{stop}}-1 finishes the proof. ∎

Lemma 2.

Under event \mathcal{E} (defined in Lemma 1), the optimal arm x=argmaxμ(x)x^{*}=\arg\max\mu(x) is not eliminated after the first Bstop1B_{\mathrm{stop}}-1 batches.

Proof.

We use CmC^{*}_{m} to denote the cube containing xx^{*} in 𝒜m\mathcal{A}_{m}. Here we proof that CmC^{*}_{m} is not eliminated in round mm.

Under event \mathcal{E}, for any cube C𝒜mC\in\mathcal{A}_{m} and xCx\in C, we have

μ^(C)μ^(Cm)μ(x)+16logTnm+rmμ(x)+16logTnm+rm4rm,\displaystyle\widehat{\mu}(C)-\widehat{\mu}(C^{*}_{m})\leq\mu(x)+\sqrt{\frac{16\log T}{n_{m}}}+r_{m}-\mu(x^{*})+\sqrt{\frac{16\log T}{n_{m}}}+r_{m}\leq 4r_{m},

where the second inequality follows from (1). Then from the elimination rule, CmC^{*}_{m} is not eliminated. ∎

Based on these results, we show the cubes that survive elimination are of high reward.

Lemma 3.

Under event \mathcal{E} (defined in Lemma 1), for any 1mBstop1\leq m\leq B_{\mathrm{stop}}, any C𝒜mC\in\mathcal{A}_{m} and any xCx\in C, Δx\Delta_{x} satisfies

Δx8rm1.\displaystyle\Delta_{x}\leq 8r_{m-1}. (3)
Proof.

For m=1m=1, (3) holds directly from the Lipschitzness of μ\mu. For m>1m>1, let Cm1C_{m-1}^{*} be the cube in 𝒜m1\mathcal{A}_{m-1} such that xCm1x^{*}\in C_{m-1}^{*}. From Lemma 2, this cube Cm1C_{m-1}^{*} is well-defined under \mathcal{E}. For any cube C𝒜mC\in\mathcal{A}_{m} and xCx\in C, it is obvious that xx is also in the parent of CC (the cube in the previous round that contains CC), which is denoted by CparC_{par}. Thus for any xCx\in C, it holds that

Δx=μμ(x)μ^m1(Cm1)+16logTnm1+rm1μ^m1(Cpar)+16logTnm1+rm1,\displaystyle\Delta_{x}=\mu^{*}-\mu(x)\leq\widehat{\mu}_{m-1}(C^{*}_{m-1})+\sqrt{\frac{16\log T}{n_{m-1}}}+r_{m-1}-\widehat{\mu}_{m-1}(C_{par})+\sqrt{\frac{16\log T}{n_{m-1}}}+r_{m-1},

where the inequality uses Lemma 1.

Equality 16logTnm1=rm1\sqrt{\frac{16\log T}{n_{m-1}}}=r_{m-1} gives that

Δx\displaystyle\Delta_{x} μ^m1(Cm1)μ^m1(Cpar)+4rm1.\displaystyle\leq\widehat{\mu}_{m-1}(C^{*}_{m-1})-\widehat{\mu}_{m-1}(C_{par})+4r_{m-1}.

It is obvious that μ^m1(Cm1)μ^m1max\widehat{\mu}_{m-1}(C^{*}_{m-1})\leq\widehat{\mu}_{m-1}^{\max}. Moreover, since the cube CparC_{par} is not eliminated, from the elimination rule we have

μ^m1maxμ^m1(Cpar)4rm1.\displaystyle\widehat{\mu}_{m-1}^{\max}-\widehat{\mu}_{m-1}(C_{par})\leq 4r_{m-1}.

Hence, we conclude that Δx8rm1\Delta_{x}\leq 8r_{m-1}. ∎

We are now ready to prove Theorem 5.

Proof of Theorem 5.

Let RmR_{m} denote regret of the mm-th batch. Fixing any positive number BB, the total regret R(T)R(T) can be divided into two parts: R(T)=mBRm+m>BRmR(T)=\sum_{m\leq B}R_{m}+\sum_{m>B}R_{m}. In the following, we bound these two parts separately and then determine BB to obtain the upper bound of the total regret. Moreover, we show A-BLiN uses only 𝒪(loglogT)\mathcal{O}(\log\log T) rounds of communications to achieve the optimal regret.

Recall that 𝒜m\mathcal{A}_{m} is set of the active cubes in the mm-th batch. According to Lemma 3, for any xC𝒜mCx\in\cup_{C\in\mathcal{A}_{m}}C, we have Δx8rm1\Delta_{x}\leq 8r_{m-1}. Let 𝒜m+\mathcal{A}_{m}^{+} be set of cubes not eliminated in batch mm. Each cube in 𝒜m1+\mathcal{A}_{m-1}^{+} is a \|\cdot\|_{\infty}-ball with radius rm12\frac{r_{m-1}}{2}, and is a subset of S(8rm1)S(8r_{m-1}). Therefore, 𝒜m1+\mathcal{A}_{m-1}^{+} forms a (rm12)\left(\frac{r_{m-1}}{2}\right)-packing of S(8rm1)S(8r_{m-1}), and the definition of zooming dimension yields that

|𝒜m1+|Nrm1Czrm1dz.|\mathcal{A}_{m-1}^{+}|\leq N_{r_{m-1}}\leq C_{z}r_{m-1}^{-d_{z}}. (4)

By definition, rm=rm12cmr_{m}=r_{m-1}2^{-c_{m}}, so

|𝒜m|=2dcm|𝒜m1+|.|\mathcal{A}_{m}|=2^{d{c_{m}}}|\mathcal{A}_{m-1}^{+}|. (5)

The total regret of the mm-th batch is

Rm=\displaystyle R_{m}= C𝒜mi=1nmΔxC,i|𝒜m|16logTrm28rm1\displaystyle\;\sum_{C\in\mathcal{A}_{m}}\sum_{i=1}^{n_{m}}\Delta_{x_{C,i}}\leq|\mathcal{A}_{m}|\cdot\frac{16\log T}{r_{m}^{2}}\cdot 8r_{m-1} (6)
=(i)\displaystyle\overset{(i)}{=}  2dcm|𝒜m1+|16logTrm28rm1\displaystyle\;2^{d{c_{m}}}|\mathcal{A}_{m-1}^{+}|\cdot\frac{16\log T}{r_{m}^{2}}\cdot 8r_{m-1}
(ii)\displaystyle\overset{(ii)}{\leq}  2dcmCzrm1dz+1128logTrm2\displaystyle\;2^{d{c_{m}}}\cdot C_{z}r_{m-1}^{-d_{z}+1}\cdot\frac{128\log T}{r_{m}^{2}}
=(iii)\displaystyle\overset{(iii)}{=}  2(i=1m1ci)(dz+1)+cm(d+2)128CzlogT,\displaystyle\;2^{(\sum_{i=1}^{m-1}c_{i})(d_{z}+1)+c_{m}(d+2)}\cdot 128C_{z}\log T,

where (i) follows from (5), (ii) follows from (4), and (iii) follows from the definition of ACE Sequence.

Define Cm=(i=1m1ci)(dz+1)+cm(d+2)C_{m}=(\sum_{i=1}^{m-1}c_{i})(d_{z}+1)+c_{m}(d+2). Since cm=cm1d+1dzd+2c_{m}=c_{m-1}\cdot\frac{d+1-d_{z}}{d+2}, calculation shows that Cm=(i=1m2ci)(dz+1)+cm1(d+2)+cm1(dz+1d2)+cm(d+2)=Cm1C_{m}=(\sum_{i=1}^{m-2}c_{i})(d_{z}+1)+c_{m-1}(d+2)+c_{m-1}(d_{z}+1-d-2)+c_{m}(d+2)=C_{m-1}. Thus for any mm, we have Cm=C1=c1(d+2)C_{m}=C_{1}=c_{1}(d+2). Hence,

Rm2c1(d+2)128CzlogT=Tdz+1dz+2128Cz(logT)1dz+2.R_{m}\leq 2^{c_{1}(d+2)}\cdot 128C_{z}\log T=T^{\frac{d_{z}+1}{d_{z}+2}}\cdot 128C_{z}(\log T)^{\frac{1}{d_{z}+2}}. (7)

The inequality (7) holds even if the mm-th batch does not exist (where we let Rm=0R_{m}=0) or is not completed. Thus we obtain the first upper bound mBRmTdz+1dz+2128CzB(logT)1dz+2\sum_{m\leq B}R_{m}\leq T^{\frac{d_{z}+1}{d_{z}+2}}\cdot 128C_{z}\cdot B(\log T)^{\frac{1}{d_{z}+2}}. Lemma 3 implies that any arm xx played after the first BB batches satisfies Δx8rB\Delta_{x}\leq 8r_{B}, so the total regret after BB batches is bounded by

m>BRm\displaystyle\sum_{m>B}R_{m}\leq  8rBT=8T2i=1Bci=8T2c1(1ηB1η)\displaystyle\;8r_{B}\cdot T=8T\cdot 2^{-\sum_{i=1}^{B}c_{i}}=8T\cdot 2^{-c_{1}(\frac{1-\eta^{B}}{1-\eta})}
=\displaystyle=  8Tdz+1dz+2(logT)1dz+2(T/logT)ηBdz+2 8Tdz+1dz+2(logT)1dz+2TηBdz+2.\displaystyle\;8T^{\frac{d_{z}+1}{d_{z}+2}}(\log T)^{\frac{1}{d_{z}+2}}\cdot\left(T/\log T\right)^{\frac{\eta^{B}}{d_{z}+2}}\leq\;8T^{\frac{d_{z}+1}{d_{z}+2}}(\log T)^{\frac{1}{d_{z}+2}}\cdot T^{\frac{\eta^{B}}{d_{z}+2}}.

Therefore, the total regret R(T)R(T) satisfies

R(T)=mBRm+m>BRm(128CzB+8TηBdz+2)Tdz+1dz+2(logT)1dz+2.\displaystyle R(T)=\;\sum_{m\leq B}R_{m}+\sum_{m>B}R_{m}\leq\;\left(128C_{z}\cdot B+8T^{\frac{\eta^{B}}{d_{z}+2}}\right)\cdot T^{\frac{d_{z}+1}{d_{z}+2}}(\log T)^{\frac{1}{d_{z}+2}}.

This inequality holds for any positive BB. Then by choosing B=loglogTlog(dz+2)log1η=loglogTlog(dz+2)logd+2d+1dzB^{*}=\frac{\log\log T-\log(d_{z}+2)}{\log\frac{1}{\eta}}=\frac{\log\log T-\log(d_{z}+2)}{\log\frac{d+2}{d+1-d_{z}}}, we have ηBdz+2=1logT\frac{\eta^{B^{*}}}{d_{z}+2}=\frac{1}{\log T} and

R(T)(128CzloglogTlogd+2d+1dz+16)Tdz+1dz+2(logT)1dz+2.R(T)\leq\left(\frac{128C_{z}\log\log T}{\log\frac{d+2}{d+1-d_{z}}}+16\right)\cdot T^{\frac{d_{z}+1}{d_{z}+2}}(\log T)^{\frac{1}{d_{z}+2}}.

The above analysis implies that we can achieve the optimal regret rate 𝒪~(Tdz+1dz+2)\widetilde{\mathcal{O}}\left(T^{\frac{d_{z}+1}{d_{z}+2}}\right) by letting the for-loop run BB^{*} times and finishing the remaining rounds in the Cleanup step. In other words, B+1B^{*}+1 rounds of communications are sufficient for A-BLiN to achieve the regret bound (2). ∎

Remark 2.

The quantity rmrm+1\frac{r_{m}}{r_{m+1}} in line 9 of Algorithm 1 may not be integers for some mm. Thus, in practice we denote αn=i=1nci\alpha_{n}=\lfloor\sum_{i=1}^{n}c_{i}\rfloor, βn=i=1nci\beta_{n}=\lceil\sum_{i=1}^{n}c_{i}\rceil, and define rounded ACE Sequence {r~m}m\{\widetilde{r}_{m}\}_{m\in\mathbb{N}} by r~m=2αk\widetilde{r}_{m}=2^{-\alpha_{k}} for m=2k1m=2k-1 and r~m=2βk\widetilde{r}_{m}=2^{-\beta_{k}} for m=2km=2k. Then the total regret can be divided as R(T)=1kBR2k1+1kBR2k+m>2BRmR(T)=\sum_{1\leq k\leq B^{*}}R_{2k-1}+\sum_{1\leq k\leq B^{*}}R_{2k}+\sum_{m>2B^{*}}R_{m}. For the first part we have r~2k2rk1\widetilde{r}_{2k-2}\leq r_{k-1} and r~2k1rk\widetilde{r}_{2k-1}\geq r_{k}, while for the second part we have r~2k1r~2k=2\frac{\widetilde{r}_{2k-1}}{\widetilde{r}_{2k}}=2. Therefore, by similar argument to the proof of Theorem 5, we can bound these three parts separately, and conclude that BLiN with rounded ACE sequence achieves the optimal regret bound 𝒪~(Tdz+1dz+2)\widetilde{\mathcal{O}}(T^{\frac{d_{z}+1}{d_{z}+2}}) by using only 𝒪(loglogT)\mathcal{O}(\log\log T) rounds of communications. The Rounded ACE Sequence is further investigated in [21, 23].

For rounded ACE Sequence, the quantity r~mr~m+1\frac{\widetilde{r}_{m}}{\widetilde{r}_{m+1}} is an integer for any mm, so the partition in Line 9 of Algorithm 1 is well-defined. In Theorem 6, we show that BLiN with rounded ACE sequence can also achieve the optimal regret bound by using 𝒪(loglogT)\mathcal{O}(\log\log T) batches. For any mm, if there exists m<mm^{\prime}<m such that r~mr~m\widetilde{r}_{m}\geq\widetilde{r}_{m^{\prime}} (including the case where r~2k1r~2k=1\frac{\widetilde{r}_{2k-1}}{\widetilde{r}_{2k}}=1), then we skip the mm-th batch. It is easy to verify that the following analysis is still valid in this case.

Theorem 6.

With probability exceeding 12T61-\frac{2}{T^{6}}, the TT-step total regret R(T)R(T) of Algorithm 1 with rounded ACE sequence satisfies

R(T)(128CzloglogTlogd+2d+1dz+512Cz+16)Tdz+1dz+2(logT)1dz+2,\displaystyle R(T)\leq\left(\frac{128C_{z}\log\log T}{\log\frac{d+2}{d+1-d_{z}}}+512C_{z}+16\right)T^{\frac{d_{z}+1}{d_{z}+2}}(\log T)^{\frac{1}{d_{z}+2}},

where dzd_{z} is the zooming dimension of the problem instance. In addition, Algorithm 1 only needs no more than 2loglogTlogd+2d+1dz+1\frac{2\log\log T}{\log\frac{d+2}{d+1-d_{z}}}+1 rounds of communications to achieve this regret rate.

Proof.

The proof of Theorem 6 is similar to that of Theorem 5.

Firstly, we fix positive number B=loglogTlogTlog(dz+2)logd+2d+1dzB^{*}=\frac{\log\log\frac{T}{\log T}-\log(d_{z}+2)}{\log\frac{d+2}{d+1-d_{z}}} and consider the first 2B2B^{*} batches. As is summairzed in Remark 2, we bound the regret caused by the first 2B2B^{*} batches through two different arguments.

For m=2k1m=2k-1, 1kB1\leq k\leq B^{*}, we have r~m=2αk\widetilde{r}_{m}=2^{-\alpha_{k}} and r~m1=2βk1\widetilde{r}_{m-1}=2^{-\beta_{k-1}}, and thus

r~mrkandr~m1rk1.\widetilde{r}_{m}\geq r_{k}\quad\text{and}\quad\widetilde{r}_{m-1}\leq r_{k-1}. (8)

Let 𝒜m+\mathcal{A}_{m}^{+} be set of cubes not eliminated in round mm. Similar argument to Theorem 5 shows that |𝒜m1+|Czr~m1dz|\mathcal{A}_{m-1}^{+}|\leq C_{z}\widetilde{r}_{m-1}^{-d_{z}}. The total regret of round mm is

Rm\displaystyle R_{m} =C𝒜mi=1nmΔxC,i\displaystyle=\sum_{C\in\mathcal{A}_{m}}\sum_{i=1}^{n_{m}}\Delta_{x_{C,i}}
|𝒜m|16logTr~m28r~m1\displaystyle\leq|\mathcal{A}_{m}|\cdot\frac{16\log T}{\widetilde{r}_{m}^{2}}\cdot 8\widetilde{r}_{m-1}
=(r~m1r~m)d|𝒜m1+|16logTr~m28r~m1\displaystyle=\left(\frac{\widetilde{r}_{m-1}}{\widetilde{r}_{m}}\right)^{d}|\mathcal{A}_{m-1}^{+}|\cdot\frac{16\log T}{\widetilde{r}_{m}^{2}}\cdot 8\widetilde{r}_{m-1}
(r~m1r~m)dCzr~m1dz16logTr~m28r~m1\displaystyle\leq\left(\frac{\widetilde{r}_{m-1}}{\widetilde{r}_{m}}\right)^{d}\cdot C_{z}\widetilde{r}_{m-1}^{-d_{z}}\cdot\frac{16\log T}{\widetilde{r}_{m}^{2}}\cdot 8\widetilde{r}_{m-1}
r~m1d+1dzr~md2128CzlogT\displaystyle\leq\widetilde{r}_{m-1}^{d+1-d_{z}}\cdot\widetilde{r}_{m}^{-d-2}\cdot 128C_{z}\log T
rk1d+1dzrkd2128CzlogT\displaystyle\leq r_{k-1}^{d+1-d_{z}}\cdot r_{k}^{-d-2}\cdot 128C_{z}\log T
=Tdz+1dz+2128Cz(logT)1dz+2,\displaystyle=T^{\frac{d_{z}+1}{d_{z}+2}}\cdot 128C_{z}\cdot(\log T)^{\frac{1}{d_{z}+2}},

where the sixth line follows from (8), and the seventh line follows from (7). Summing over kk gives that

k=1BR2k1Tdz+1dz+2128CzB(logT)1dz+2128CzloglogTlogd+2d+1dzTdz+1dz+2(logT)1dz+2.\displaystyle\sum_{k=1}^{B^{*}}R_{2k-1}\leq\;T^{\frac{d_{z}+1}{d_{z}+2}}\cdot 128C_{z}\cdot B^{*}\cdot(\log T)^{\frac{1}{d_{z}+2}}\leq\;\frac{128C_{z}\log\log T}{\log\frac{d+2}{d+1-d_{z}}}\cdot T^{\frac{d_{z}+1}{d_{z}+2}}(\log T)^{\frac{1}{d_{z}+2}}. (9)

For m=2km=2k, 1kB1\leq k\leq B^{*}, we have r~m=2βk\widetilde{r}_{m}=2^{-\beta_{k}} and r~m1=2αk\widetilde{r}_{m-1}=2^{-\alpha_{k}}, and thus r~m=12r~m1\widetilde{r}_{m}=\frac{1}{2}\widetilde{r}_{m-1}. Lemma 3 shows that any cube in 𝒜m\mathcal{A}_{m} is a subset of S(16r~m)S(16\widetilde{r}_{m}), so we have |𝒜m|Nr~mCzr~mdz|\mathcal{A}_{m}|\leq N_{\widetilde{r}_{m}}\leq C_{z}\widetilde{r}_{m}^{-d_{z}}. Therefore, the total regret of round mm is

Rm=C𝒜mi=1nmΔxC,i|𝒜m|16logTr~m216r~mCzr~mdz16logTr~m216r~m=r~mdz1256CzlogT.\displaystyle R_{m}=\sum_{C\in\mathcal{A}_{m}}\sum_{i=1}^{n_{m}}\Delta_{x_{C,i}}\leq|\mathcal{A}_{m}|\cdot\frac{16\log T}{\widetilde{r}_{m}^{2}}\cdot 16\widetilde{r}_{m}\leq C_{z}\widetilde{r}_{m}^{-d_{z}}\cdot\frac{16\log T}{\widetilde{r}_{m}^{2}}\cdot 16\widetilde{r}_{m}=\widetilde{r}_{m}^{-d_{z}-1}\cdot 256C_{z}\log T.

Since r~2k2r~2k=r~2k2r~2k1r~2k1r~2k2\frac{\widetilde{r}_{2k-2}}{\widetilde{r}_{2k}}=\frac{\widetilde{r}_{2k-2}}{\widetilde{r}_{2k-1}}\cdot\frac{\widetilde{r}_{2k-1}}{\widetilde{r}_{2k}}\geq 2, summing over kk gives that

k=1BR2kr~2Bdz1512CzlogT.\sum_{k=1}^{B^{*}}R_{2k}\leq\widetilde{r}_{2B^{*}}^{-d_{z}-1}\cdot 512C_{z}\log T. (10)

The definition of round ACE Sequence shows that

r~2B= 2i=1Bci=2c1(1ηB1η)= 2logTlogTdz+21(TlogT)1dz+2,\displaystyle\widetilde{r}_{2B^{*}}=\;2^{-\lceil\sum_{i=1}^{B^{*}}c_{i}\rceil}=2^{-\left\lceil c_{1}\left(\frac{1-\eta^{B^{*}}}{1-\eta}\right)\right\rceil}=\;2^{-\left\lceil\frac{\log\frac{T}{\log T}}{d_{z}+2}-1\right\rceil}\geq\left(\frac{T}{\log T}\right)^{-\frac{1}{d_{z}+2}},

so we have

k=1BR2k((TlogT)1dz+2)dz1512CzlogT=Tdz+1dz+2512Cz(logT)1dz+2.\displaystyle\sum_{k=1}^{B^{*}}R_{2k}\leq\;\left(\left(\frac{T}{\log T}\right)^{-\frac{1}{d_{z}+2}}\right)^{-d_{z}-1}\cdot 512C_{z}\log T=\;T^{\frac{d_{z}+1}{d_{z}+2}}\cdot 512C_{z}(\log T)^{\frac{1}{d_{z}+2}}. (11)

Similar argument to Theorem 5 shows that the total regret after 2B2B^{*} batches is upper bounded by 8r~2BT8\widetilde{r}_{2B^{*}}T. Since

r~2B=2logTlogTdz+212logTlogTdz+2+12(TlogT)1dz+2,\displaystyle\widetilde{r}_{2B^{*}}=2^{-\left\lceil\frac{\log\frac{T}{\log T}}{d_{z}+2}-1\right\rceil}\leq 2^{-\frac{\log\frac{T}{\log T}}{d_{z}+2}+1}\leq 2\left(\frac{T}{\log T}\right)^{-\frac{1}{d_{z}+2}},

we further have

m>2BRm8r~2BT16Tdz+1dz+2(logT)1dz+2.\sum_{m>2B^{*}}R_{m}\leq 8\widetilde{r}_{2B^{*}}T\leq 16\cdot T^{\frac{d_{z}+1}{d_{z}+2}}(\log T)^{\frac{1}{d_{z}+2}}. (12)

Combining (9), (11) and (12), we conclude that

R(T)(128CzloglogTlogd+2d+1dz+512Cz+16)Tdz+1dz+2(logT)1dz+2.\displaystyle R(T)\leq\left(\frac{128C_{z}\log\log T}{\log\frac{d+2}{d+1-d_{z}}}+512C_{z}+16\right)\cdot T^{\frac{d_{z}+1}{d_{z}+2}}(\log T)^{\frac{1}{d_{z}+2}}.

The analysis in Theorem 6 implies that we can achieve the optimal regret rate 𝒪~(Tdz+1dz+2)\widetilde{\mathcal{O}}\left(T^{\frac{d_{z}+1}{d_{z}+2}}\right) by letting the for-loop of Algorithm 1 run 2B2B^{*} times and finishing the remaining rounds in the Cleanup step. In other words, 2B+12B^{*}+1 rounds of communications are sufficient for BLiN to achieve the optimal regret. ∎

Remark 3.

The proof of Theorem 6 implies a regret upper bound of BLiN with a fixed number of batches BB. See Appendix D for the detailed analysis.

4 Lower Bounds

In this section, we present lower bounds for Lipschitz bandits with batched feedback, which in turn gives communication lower bounds for all Lipschitz bandit algorithms. Our lower bounds depend on the rounds of communications BB. When BB is sufficiently large, our results match the upper bound for the vanilla Lipschitz bandit problem O~(Tdz+1dz+2)\widetilde{O}\left(T^{\frac{d_{z}+1}{d_{z}+2}}\right). More importantly, this dependency on BB gives the minimal rounds of communications needed to achieve optimal regret bound for all Lipschitz bandit algorithms, which is summarized in Corollary 1. Since this lower bound matches the upper bound presented in Theorem 5, BLiN optimally solves Lipschitz bandits with minimal communication.

Similar to most lower bound proofs, we need to construct problem instances that are difficult to differentiate. What’s different is that we need to carefully integrate batched feedback pattern [36] with the Lipschitz payoff reward [42, 32]. To capture the adaptivity in grid determination, we construct “static reference communication grids” to remove the stochasticity in grid selection [1, 22]. Moreover, to prove the lower bounds for general dzdd_{z}\leq d, we apply a “linear-decaying extension” technique to transfer instances from the dzd_{z}-dimensional subspace to the dd-dimensional whole space.

The lower bound analysis is organized as follows. In Section 4.1, we present lower bounds for the full-dimensional case, that is, d=dzd=d_{z}. In Section 4.1.1, we consider the static grid case, where the grid is predetermined. This static grid case will provide intuition for the adaptive and more general case. In Section 4.1.2, we provide the lower bound for general adaptive grid. Finally, in Section 4.2, we apply the “linear-decaying extension” technique to prove lower bounds for the case that dzdd_{z}\leq d.

4.1 Lower Bounds for the Full-Dimension Case

In this section, we let the zooming dimension dzd_{z} equal to the ambient dimension dd. The aim of considering this case first is to simplify the construction, and highlight the technique to deal with the batched feedback setting.

4.1.1 The Static Grid Case

We first provide the lower bound for the case where the grid is static and determined before the game.

The expected reward functions of hard instances are constructed as follows: we choose some ‘positions’ and ‘heights’, such that the expected reward function obtains local maximum of the specified ‘height’ at the specified ‘position’. We will use the word ‘peak’ to refer to the local maxima. The following theorem presents the lower bound for the static grid case.

Theorem 7.

Consider Lipschitz bandit problems with time horizon TT and ambient dimension dd such that the grid of reward communication 𝒯\mathcal{T} is static and determined before the game. If BB rounds of communications are allowed, then for any policy π\pi, there exists a problem instance such that

𝔼[RT(π)]132e116T11d+21(1d+2)B.\displaystyle\mathbb{E}\left[R_{T}(\pi)\right]\geq\frac{1}{32e^{\frac{1}{16}}}\cdot T^{\frac{1-\frac{1}{d+2}}{1-\left(\frac{1}{d+2}\right)^{B}}}.

To prove Theorem 7, we first show that for any k>1k>1 there exists an instance such that 𝔼[RT(π)]tktk11d+2\mathbb{E}[R_{T}(\pi)]\geq\frac{t_{k}}{t_{k-1}^{\frac{1}{d+2}}}. Fixing k>1k>1, we let rk=1tk11d+2r_{k}=\frac{1}{t_{k-1}^{\frac{1}{d+2}}} and Mk:=tk1rk2=1rkdM_{k}:=t_{k-1}r_{k}^{2}=\frac{1}{r_{k}^{d}}. Then we construct a set of problem instances k={Ik,1,,Ik,Mk}\mathcal{I}_{k}=\left\{I_{k,1},\cdots,I_{k,M_{k}}\right\}, such that the gap between the highest peak and the second highest peak is about rkr_{k} for every instance in k\mathcal{I}_{k}.

Based on this construction, we prove that no algorithm can distinguish instances in k\mathcal{I}_{k} from one another in the first (k1)(k-1) batches, so the worst-case regret is at least rktkr_{k}t_{k}, which gives the inequality we need. For the first batch (0,t1](0,t_{1}], we can easily construct a set of instances where the worst-case regret is at least t1t_{1}, since no information is available during this time. Thus, there exists a problem instance such that

𝔼[RT(π)]max{t1,t2t11d+2,,tBtB11d+2}.\displaystyle\mathbb{E}[R_{T}(\pi)]\gtrsim\max\left\{t_{1},\frac{t_{2}}{t_{1}^{\frac{1}{d+2}}},\cdots,\frac{t_{B}}{t_{B-1}^{\frac{1}{d+2}}}\right\}.

Since 0<t1<<tB=T0<t_{1}<\cdots<t_{B}=T, the inequality in Theorem 7 follows.

Proof of Theorem 7.

Fixing an index k>1k>1, we first show that there exists an instance such that 𝔼[RT(π)]tktk11d+2\mathbb{E}[R_{T}(\pi)]\geq\frac{t_{k}}{t_{k-1}^{\frac{1}{d+2}}}. We construct a set of problem instances that is difficult to distinguish. Let rk=1tk11d+2r_{k}=\frac{1}{t_{k-1}^{\frac{1}{d+2}}} and Mk:=tk1rk2=1rkdM_{k}:=t_{k-1}r_{k}^{2}=\frac{1}{r_{k}^{d}}. We define rkr_{k} and MkM_{k} in this way to: 1. ensure that we can find a set of arms 𝒰k={uk,1,,uk,Mk}\mathcal{U}_{k}=\left\{u_{k,1},\cdots,u_{k,M_{k}}\right\} such that d𝒜(uk,i,uk,j)rkd_{\mathcal{A}}(u_{k,i},u_{k,j})\geq r_{k} for any iji\neq j; 2. maximize rkr_{k} while ensuring rkMk/tk1r_{k}\leq\sqrt{M_{k}/t_{k-1}}, and thus the hard instances with maximum reward rkr_{k} can still confuse a learner who only make tk1t_{k-1} observations. Then we consider a set of problem instances k={Ik,1,,Ik,Mk}\mathcal{I}_{k}=\left\{I_{k,1},\cdots,I_{k,M_{k}}\right\}. The expected reward for Ik,1I_{k,1} is defined as

μk,1(x)={34rk,ifx=uk,1,58rk,ifx=uk,j,j1,max{rk2,maxu𝒰k{μk,1(u)d𝒜(x,u)}},ifx𝒜𝒰k.\displaystyle\mu_{k,1}(x)=\begin{cases}\frac{3}{4}r_{k},\;\text{if}\;x=u_{k,1},\\ \frac{5}{8}r_{k},\;\text{if}\;x=u_{k,j},\;j\neq 1,\\ \max\left\{\frac{r_{k}}{2},\max_{u\in\mathcal{U}_{k}}\left\{\mu_{k,1}(u)-d_{\mathcal{A}}(x,u)\right\}\right\},\\ \quad\text{if}\;x\in\mathcal{A}\setminus\mathcal{U}_{k}.\end{cases} (13)

For 2iMk2\leq i\leq M_{k}, the expected reward for Ik,iI_{k,i} is defined as

μk,i(x)={34rk,ifx=uk,1,78rk,ifx=uk,i,58rk,ifx=uk,j,j1andji,max{rk2,maxu𝒰k{μk,i(u)d𝒜(x,u)}},ifx𝒜𝒰k.\displaystyle\mu_{k,i}(x)=\begin{cases}\frac{3}{4}r_{k},\;\text{if}\;x=u_{k,1},\\ \frac{7}{8}r_{k},\;\text{if}\;x=u_{k,i},\\ \frac{5}{8}r_{k},\;\text{if}\;x=u_{k,j},\;j\neq 1\;\text{and}\;j\neq i,\\ \max\left\{\frac{r_{k}}{2},\max_{u\in\mathcal{U}_{k}}\left\{\mu_{k,i}(u)-d_{\mathcal{A}}(x,u)\right\}\right\},\\ \quad\text{if}\;x\in\mathcal{A}\setminus\mathcal{U}_{k}.\end{cases} (14)

Let Sk,i=𝔹(uk,i,38rk)S_{k,i}=\mathbb{B}(u_{k,i},\frac{3}{8}r_{k}) (the ball with center uk,iu_{k,i} and radius 38rk\frac{3}{8}r_{k}). It is easy to verify the following properties of construction (13) and (14):

  1. 1.

    For any 2iMk2\leq i\leq M_{k}, μk,i(x)=μk,1(x)\mu_{k,i}(x)=\mu_{k,1}(x) for any x𝒜Sk,ix\in\mathcal{A}\setminus S_{k,i};

  2. 2.

    For any 2iMk2\leq i\leq M_{k}, μk,1(x)μk,i(x)μk,1(x)+rk4\mu_{k,1}(x)\leq\mu_{k,i}(x)\leq\mu_{k,1}(x)+\frac{r_{k}}{4}, for any xSk,ix\in S_{k,i};

  3. 3.

    For any 1iMk1\leq i\leq M_{k}, under Ik,iI_{k,i}, pulling an arm that is not in Sk,iS_{k,i} incurs a regret at least rk8\frac{r_{k}}{8}.

For all arm pulls in all problem instances, a Gaussian noise sampled from 𝒩(0,1)\mathcal{N}(0,1) is added to the observed reward. This noise corruption is independent from all other randomness.

The lower bound of expected regret relies on the following lemma.

Lemma 4.

For any policy π\pi, there exists a problem instance IkI\in\mathcal{I}_{k} such that

𝔼[RT(π)]rk32j=1B(tjtj1)exp{tj1rk232(Mk1)}.\mathbb{E}\left[R_{T}(\pi)\right]\geq\frac{r_{k}}{32}\cdot\sum_{j=1}^{B}\left(t_{j}-t_{j-1}\right)\exp\left\{-\frac{t_{j-1}r_{k}^{2}}{32(M_{k}-1)}\right\}.
Proof.

Let xtx_{t} denote the choices of policy π\pi at time tt, and yty_{t} denote the reward. Additionally, for tj1<ttjt_{j-1}<t\leq t_{j}, we define k.it\mathbb{P}_{k.i}^{t} as the distribution of sequence (x1,y1,,xtj1,ytj1)\left(x_{1},y_{1},\cdots,x_{t_{j-1}},y_{t_{j-1}}\right) under instance Ik,iI_{k,i} and policy π\pi. It holds that

supIk𝔼RT(π)1Mki=1Mk𝔼k,i[RT(π)]1Mki=1Mkt=1T𝔼k,it[Rt(π)]rk8t=1T1Mki=1Mkk,it(xtSk,i),\displaystyle\sup_{I\in\mathcal{I}_{k}}\mathbb{E}R_{T}(\pi)\geq\frac{1}{M_{k}}\sum_{i=1}^{M_{k}}\mathbb{E}_{\mathbb{P}_{k,i}}\left[R_{T}(\pi)\right]\geq\frac{1}{M_{k}}\sum_{i=1}^{M_{k}}\sum_{t=1}^{T}\mathbb{E}_{\mathbb{P}_{k,i}^{t}}\left[R^{t}(\pi)\right]\geq\frac{r_{k}}{8}\sum_{t=1}^{T}\frac{1}{M_{k}}\sum_{i=1}^{M_{k}}\mathbb{P}_{k,i}^{t}(x_{t}\notin S_{k,i}), (15)

where Rt(π)R^{t}(\pi) denotes the regret incurred by policy π\pi at time tt.

From our construction, it is easy to see that Sk,iSk,j=S_{k,i}\cap S_{k,j}=\varnothing for any iji\neq j, so we can construct a test Ψ\Psi such that xtSk,ix_{t}\in S_{k,i} implies Ψ=i\Psi=i. Then from Lemma 12,

1Mki=1Mkk,it(xtSk,i)1Mki=1Mkk,it(Ψi)12Mki=2Mkexp{DKL(k,1tk,it)}.\displaystyle\;\frac{1}{M_{k}}\sum_{i=1}^{M_{k}}\mathbb{P}_{k,i}^{t}(x_{t}\notin S_{k,i})\geq\;\frac{1}{M_{k}}\sum_{i=1}^{M_{k}}\mathbb{P}_{k,i}^{t}(\Psi\neq i)\geq\;\frac{1}{2M_{k}}\sum_{i=2}^{M_{k}}\exp\left\{-D_{KL}\left(\mathbb{P}_{k,1}^{t}\|\mathbb{P}_{k,i}^{t}\right)\right\}.

To avoid notational clutter, for any s,ss,s^{\prime} (sss\geq s^{\prime}), define

(𝐱,𝐲):s:s=\displaystyle(\mathbf{x},\mathbf{y})_{:s}^{:s^{\prime}}= (x1,y1,,xs,ys,,xs).\displaystyle\;(x_{1},y_{1},\cdots,x_{s^{\prime}},y_{s^{\prime}},\cdots,x_{s}).

Now we calculate DKL(k,1tk,it)D_{KL}\left(\mathbb{P}_{k,1}^{t}\|\mathbb{P}_{k,i}^{t}\right). From the chain rule of KL-Divergence, we have

DKL(k,1tk,it)\displaystyle\;D_{KL}\left(\mathbb{P}_{k,1}^{t}\|\mathbb{P}_{k,i}^{t}\right)
=\displaystyle= DKL(k,1t((𝐱,𝐲):tj1:tj1)k,it((𝐱,𝐲):tj1:tj1))\displaystyle\;D_{KL}\left(\mathbb{P}_{k,1}^{t}\left((\mathbf{x},\mathbf{y})_{:t_{j-1}}^{:t_{j-1}}\right)\|\mathbb{P}_{k,i}^{t}\left((\mathbf{x},\mathbf{y})_{:t_{j-1}}^{:t_{j-1}}\right)\right)
=\displaystyle= DKL(k,1t((𝐱,𝐲):tj1:tj11)k,it((𝐱,𝐲):tj1:tj11))+𝔼k,1(DKL(k,1t(ytj1|xtj1)k,it(ytj1|xtj1)))\displaystyle\;D_{KL}\left(\mathbb{P}_{k,1}^{t}\left((\mathbf{x},\mathbf{y})_{:t_{j-1}}^{:t_{j-1}-1}\right)\|\mathbb{P}_{k,i}^{t}\left((\mathbf{x},\mathbf{y})_{:t_{j-1}}^{:t_{j-1}-1}\right)\right)+\mathbb{E}_{\mathbb{P}_{k,1}}\left(D_{KL}\left(\mathbb{P}_{k,1}^{t}(y_{t_{j-1}}|x_{t_{j-1}})\|\mathbb{P}_{k,i}^{t}(y_{t_{j-1}}|x_{t_{j-1}})\right)\right) (16)
=\displaystyle= DKL(k,1t((𝐱,𝐲):tj11:tj11)k,it((𝐱,𝐲):tj11:tj11))+𝔼k,1(DKL(k,1t(ytj1|xtj1)k,it(ytj1|xtj1)))\displaystyle\;D_{KL}\left(\mathbb{P}_{k,1}^{t}\left((\mathbf{x},\mathbf{y})_{:t_{j-1}-1}^{:t_{j-1}-1}\right)\|\mathbb{P}_{k,i}^{t}\left((\mathbf{x},\mathbf{y})_{:t_{j-1}-1}^{:t_{j-1}-1}\right)\right)+\mathbb{E}_{\mathbb{P}_{k,1}}\left(D_{KL}\left(\mathbb{P}_{k,1}^{t}(y_{t_{j-1}}|x_{t_{j-1}})\|\mathbb{P}_{k,i}^{t}(y_{t_{j-1}}|x_{t_{j-1}})\right)\right) (17)
\displaystyle\leq DKL(k,1t((𝐱,𝐲):tj11:tj11)k,it((𝐱,𝐲):tj11:tj11))+𝔼k,1(DKL(N(μk,1(xtj1),1)N(μk,i(xtj1),1)))\displaystyle\;D_{KL}\left(\mathbb{P}_{k,1}^{t}\left((\mathbf{x},\mathbf{y})_{:t_{j-1}-1}^{:t_{j-1}-1}\right)\|\mathbb{P}_{k,i}^{t}\left((\mathbf{x},\mathbf{y})_{:t_{j-1}-1}^{:t_{j-1}-1}\right)\right)+\mathbb{E}_{\mathbb{P}_{k,1}}\left(D_{KL}\left(N(\mu_{k,1}(x_{t_{j-1}}),1)\|N(\mu_{k,i}(x_{t_{j-1}}),1)\right)\right) (18)
=\displaystyle= DKL(k,1t((𝐱,𝐲):tj11:tj11)k,it((𝐱,𝐲):tj11:tj11))+𝔼k,1(12(μk,1(xtj1)μk,i(xtj1))2)\displaystyle\;D_{KL}\left(\mathbb{P}_{k,1}^{t}\left((\mathbf{x},\mathbf{y})_{:t_{j-1}-1}^{:t_{j-1}-1}\right)\|\mathbb{P}_{k,i}^{t}\left((\mathbf{x},\mathbf{y})_{:t_{j-1}-1}^{:t_{j-1}-1}\right)\right)+\mathbb{E}_{\mathbb{P}_{k,1}}\left(\frac{1}{2}\left(\mu_{k,1}(x_{t_{j-1}})-\mu_{k,i}(x_{t_{j-1}})\right)^{2}\right)
\displaystyle\leq DKL(k,1t((𝐱,𝐲):tj11:tj11)k,it((𝐱,𝐲):tj11:tj11))+𝔼k,1(𝟏{xtj1Sk,i}12(rk4)2)\displaystyle\;D_{KL}\left(\mathbb{P}_{k,1}^{t}\left((\mathbf{x},\mathbf{y})_{:t_{j-1}-1}^{:t_{j-1}-1}\right)\|\mathbb{P}_{k,i}^{t}\left((\mathbf{x},\mathbf{y})_{:t_{j-1}-1}^{:t_{j-1}-1}\right)\right)+\mathbb{E}_{\mathbb{P}_{k,1}}\left(\bm{1}_{\{x_{t_{j-1}}\in S_{k,i}\}}\cdot\frac{1}{2}\left(\frac{r_{k}}{4}\right)^{2}\right) (19)
=\displaystyle= DKL(k,1t((𝐱,𝐲):tj11:tj11)k,it((𝐱,𝐲):tj11:tj11))+rk232k,1(xtj1Sk,i),\displaystyle\;D_{KL}\left(\mathbb{P}_{k,1}^{t}\left((\mathbf{x},\mathbf{y})_{:t_{j-1}-1}^{:t_{j-1}-1}\right)\|\mathbb{P}_{k,i}^{t}\left((\mathbf{x},\mathbf{y})_{:t_{j-1}-1}^{:t_{j-1}-1}\right)\right)+\frac{r_{k}^{2}}{32}\cdot\mathbb{P}_{k,1}\left(x_{t_{j-1}}\in S_{k,i}\right), (20)

where (16) uses chain rule for KL-divergence and the conditional independence of the reward, (17) removes dependence on xtj1x_{t_{j-1}} in the first term by another use of chain rule and the fact that the distribution of xtj1x_{t_{j-1}} is fully determined by the policy and the distribution of (𝐱,𝐲):tj11:tj11(\mathbf{x},\mathbf{y})_{:t_{j-1}-1}^{:t_{j-1}-1}, (18) uses that the rewards are corrupted by a standard normal noise, and (19) uses the first two properties of the construction.

Since (20) holds for all ttj1t\leq t_{j-1}, we conclude that

DKL(k,1tk,it)rk232stj1k,1(xsSk,i)=rk232𝔼k,1τi,\displaystyle D_{KL}\left(\mathbb{P}_{k,1}^{t}\|\mathbb{P}_{k,i}^{t}\right)\leq\frac{r_{k}^{2}}{32}\sum_{s\leq t_{j-1}}\mathbb{P}_{k,1}\left(x_{s}\in S_{k,i}\right)=\frac{r_{k}^{2}}{32}\mathbb{E}_{\mathbb{P}_{k,1}}\tau_{i}, (21)

where τi\tau_{i} denotes the number of pulls of arms in Sk,iS_{k,i} before the batch containing tt. Then for all t(tj1,tj]t\in(t_{j-1},t_{j}], we have

1Mki=1Mkk,it(xtSk,i)\displaystyle\frac{1}{M_{k}}\sum_{i=1}^{M_{k}}\mathbb{P}_{k,i}^{t}(x_{t}\notin S_{k,i})\geq 12Mki=2Mkexp{rk232𝔼k,1τi}\displaystyle\;\frac{1}{2M_{k}}\sum_{i=2}^{M_{k}}\exp\left\{-\frac{r_{k}^{2}}{32}\mathbb{E}_{\mathbb{P}_{k,1}}\tau_{i}\right\}
\displaystyle\geq Mk12Mkexp{rk232(Mk1)i=2Mk𝔼k,1τi}\displaystyle\;\frac{M_{k}-1}{2M_{k}}\exp\left\{-\frac{r_{k}^{2}}{32(M_{k}-1)}\sum_{i=2}^{M_{k}}\mathbb{E}_{\mathbb{P}_{k,1}}\tau_{i}\right\} (22)
\displaystyle\geq 14exp{rk2tj132(Mk1)},\displaystyle\;\frac{1}{4}\exp\left\{-\frac{r_{k}^{2}t_{j-1}}{32(M_{k}-1)}\right\}, (23)

where (22) uses the Jensen’ inequality, and (23) uses the fact that i=2Mkτitj1\sum_{i=2}^{M_{k}}\tau_{i}\leq t_{j-1}. Finally, we substitute (23) to (15) to finish the proof. ∎

Since Mk=tk1rk2M_{k}=t_{k-1}r_{k}^{2}, the expected regret of policy π\pi satisfies

𝔼[RT(π)]\displaystyle\mathbb{E}\left[R_{T}(\pi)\right] rk32j=1B(tjtj1)exp{tj1rk232(Mk1)}\displaystyle\geq\frac{r_{k}}{32}\cdot\sum_{j=1}^{B}\left(t_{j}-t_{j-1}\right)\exp\left\{-\frac{t_{j-1}r_{k}^{2}}{32(M_{k}-1)}\right\}
rk32j=1B(tjtj1)exp{tj1rk216Mk}\displaystyle\geq\frac{r_{k}}{32}\cdot\sum_{j=1}^{B}\left(t_{j}-t_{j-1}\right)\exp\left\{-\frac{t_{j-1}r_{k}^{2}}{16M_{k}}\right\}
rk32j=1B(tjtj1)exp{tj116tk1}\displaystyle\geq\frac{r_{k}}{32}\cdot\sum_{j=1}^{B}\left(t_{j}-t_{j-1}\right)\exp\left\{-\frac{t_{j-1}}{16t_{k-1}}\right\}

on instance II defined in Lemma 4.

By omitting terms with j>kj>k in the above summation, we have

𝔼[RT(π)]\displaystyle\mathbb{E}[R_{T}(\pi)] rk32j=1B(tjtj1)exp{tj116tk1}\displaystyle\geq\frac{r_{k}}{32}\cdot\sum_{j=1}^{B}\left(t_{j}-t_{j-1}\right)\exp\left\{-\frac{t_{j-1}}{16t_{k-1}}\right\}
rk32j=1k(tjtj1)exp{116}\displaystyle\geq\frac{r_{k}}{32}\cdot\sum_{j=1}^{k}\left(t_{j}-t_{j-1}\right)\exp\left\{-\frac{1}{16}\right\}
=132e116rktk=132e116tktk11d+2.\displaystyle=\frac{1}{32e^{\frac{1}{16}}}r_{k}t_{k}=\frac{1}{32e^{\frac{1}{16}}}\cdot\frac{t_{k}}{t_{k-1}^{\frac{1}{d+2}}}.

The above analysis can be applied for any k>1k>1. For the first batch (0,t1](0,t_{1}], we can easily construct a set of instances where the worst-case regret is at least t1t_{1}, since no information is available during this time. Thus, there exists a problem instance such that

𝔼[RT(π)]132e116max{t1,t2t11d+2,,tBtB11d+2}.\displaystyle\mathbb{E}[R_{T}(\pi)]\geq\frac{1}{32e^{\frac{1}{16}}}\max\left\{t_{1},\frac{t_{2}}{t_{1}^{\frac{1}{d+2}}},\cdots,\frac{t_{B}}{t_{B-1}^{\frac{1}{d+2}}}\right\}.

Since 0<t1<<tB=T0<t_{1}<\cdots<t_{B}=T, we further have

max{t1,t2t1ε,,tBtB1ε}(t1εB1(t2t1ε)εB2(tB1tB2ε)εtBtB1ε)1i=1B1εi=T1ε1εB,\displaystyle\max\left\{t_{1},\frac{t_{2}}{t_{1}^{\varepsilon}},\cdots,\frac{t_{B}}{t_{B-1}^{\varepsilon}}\right\}\geq\left(t_{1}^{\varepsilon^{B-1}}\cdot\left(\frac{t_{2}}{t_{1}^{\varepsilon}}\right)^{\varepsilon^{B-2}}\cdots\left(\frac{t_{B-1}}{t_{B-2}^{\varepsilon}}\right)^{\varepsilon}\cdot\frac{t_{B}}{t_{B-1}^{\varepsilon}}\right)^{\frac{1}{\sum_{i=1}^{B-1}\varepsilon^{i}}}=T^{\frac{1-\varepsilon}{1-\varepsilon^{B}}},

where ε=1d+2\varepsilon=\frac{1}{d+2}. The above two inequalities imply that

𝔼[RT(π)]132e116T11d+21(1d+2)B,\displaystyle\mathbb{E}\left[R_{T}(\pi)\right]\geq\frac{1}{32e^{\frac{1}{16}}}\cdot T^{\frac{1-\frac{1}{d+2}}{1-\left(\frac{1}{d+2}\right)^{B}}},

which finishes the proof. ∎

4.1.2 Removing the Static Grid Assumption

So far we have derived the lower bound for the static grid case. Yet there is a gap between the static and the adaptive case. We will close this gap in the following Theorem.

Theorem 8.

Consider Lipschitz bandit problems with time horizon TT and ambient dimension dd such that the grid of reward communication 𝒯\mathcal{T} is adaptively determined by the player. If BB rounds of communications are allowed, then for any policy π\pi, there exists a problem instance such that

𝔼[RT(π)]1256B2T11d+21(1d+2)B.\displaystyle\mathbb{E}\left[R_{T}(\pi)\right]\geq\frac{1}{256B^{2}}T^{\frac{1-\frac{1}{d+2}}{1-\left(\frac{1}{d+2}\right)^{B}}}.

To prove Theorem 8, we consider a reference static grid 𝒯r={T0,T1,,TB}\mathcal{T}_{r}=\{T_{0},T_{1},\cdots,T_{B}\}, where Tj=T1εj1εBT_{j}=T^{\frac{1-\varepsilon^{j}}{1-\varepsilon^{B}}} for ε=1d+2\varepsilon=\frac{1}{d+2}. We set the reference grid in this way because it is the solution to the following optimization problem

max1T1TB=Tmin{T1,T2T1ε,,TBTB1ε}.\max_{1\leq T_{1}\leq\cdots\leq T_{B}=T}\min\left\{T_{1},\frac{T_{2}}{T_{1}^{\varepsilon}},\cdots,\frac{T_{B}}{T_{B-1}^{\varepsilon}}\right\}.

Then we construct a series of ‘worlds’, denoted by 1,,B\mathcal{I}_{1},\cdots,\mathcal{I}_{B}. Each world is a set of problem instances, and each problem instance in world j\mathcal{I}_{j} is defined by peak location set 𝒰j\mathcal{U}_{j} and basic height rjr_{j}, where the sets 𝒰j\mathcal{U}_{j} and quantities rjr_{j} for 1jB1\leq j\leq B are presented in the proof below.

Based on these constructions, we first prove that for any adaptive grid and policy, there exists an index jj such that the event Aj={tj1<Tj1,tjTj}A_{j}=\{t_{j-1}<T_{j-1},\;t_{j}\geq T_{j}\} happens with sufficiently high probability in world j\mathcal{I}_{j}. Then similar to Theorem 7, we prove that in world j\mathcal{I}_{j} there exists a set of problem instances that is difficult to differentiate in the first j1j-1 batches. In addition, event AjA_{j} implies that tjTjt_{j}\geq T_{j}, so the worst-case regret is at least rjTjr_{j}T_{j}, which gives the lower bound we need.

Proof of Theorem 8.

Firstly, we define rj=1Tj1εBr_{j}=\frac{1}{T_{j-1}^{\varepsilon}B}, where ε=1d+2\varepsilon=\frac{1}{d+2}, and define Mj=1rjdM_{j}=\frac{1}{r_{j}^{d}}. From the definition, we have

Tj1rj2=1rjdBd+21rjdB2=MjB2.\displaystyle T_{j-1}r_{j}^{2}=\frac{1}{r_{j}^{d}B^{d+2}}\leq\frac{1}{r_{j}^{d}B^{2}}=\frac{M_{j}}{B^{2}}. (24)

For 1jB1\leq j\leq B, we can find sets of arms 𝒰j={uj,1,,uj,Mj}\mathcal{U}_{j}=\{u_{j,1},\cdots,u_{j,M_{j}}\} such that (a) d𝒜(uj,m,uj,n)rjd_{\mathcal{A}}(u_{j,m},u_{j,n})\geq r_{j} for any mnm\neq n, and (b) u1,M1==uB,MBu_{1,M_{1}}=\cdots=u_{B,M_{B}}.

Then we present the construction of worlds 1,,B\mathcal{I}_{1},\cdots,\mathcal{I}_{B}. For 1jB11\leq j\leq B-1, we let j={Ij,k}k=1Mj1\mathcal{I}_{j}=\{I_{j,k}\}_{k=1}^{M_{j}-1}, and the expected reward of Ij,kI_{j,k} is defined as

μj,k(x)={r12+rj16+rB16,x=uj,k,r12+rB16,x=uj,Mj,\displaystyle\mu_{j,k}(x)=\left\{\begin{aligned} &\frac{r_{1}}{2}+\frac{r_{j}}{16}+\frac{r_{B}}{16},&x=u_{j,k},\\ &\frac{r_{1}}{2}+\frac{r_{B}}{16},&x=u_{j,M_{j}},\end{aligned}\right. (25)

and μj,k(x)=max{r12,maxu𝒰j{μj,k(u)d𝒜(x,u)}}\mu_{j,k}(x)=\max\left\{\frac{r_{1}}{2},\max_{u\in\mathcal{U}_{j}}\left\{\mu_{j,k}(u)-d_{\mathcal{A}}(x,u)\right\}\right\}, otherwise. For j=Bj=B, we let B={IB}\mathcal{I}_{B}=\{I_{B}\}. The expected reward of IBI_{B} is defined as

μB(uB,MB)=r12+rB16\displaystyle\mu_{B}(u_{B,M_{B}})=\frac{r_{1}}{2}+\frac{r_{B}}{16} (26)

and μB(x)=max{r12,μB(uB,MB)d𝒜(x,uB,MB)}\mu_{B}(x)=\max\left\{\frac{r_{1}}{2},\mu_{B}(u_{B,M_{B}})-d_{\mathcal{A}}(x,u_{B,M_{B}})\right\}, otherwise. Roughly speaking, our constructions satisfy two properties: for each jBj\neq B and 1kMj11\leq k\leq M_{j}-1,

  1. 1.

    μj,k\mu_{j,k} is close to μB\mu_{B};

  2. 2.

    under Ij,kI_{j,k}, pulling an arm that is far from uj,ku_{j,k} incurs a regret at least rj16\frac{r_{j}}{16}.

The formal version of these two properties are presented in the proof of Lemma 5 and Lemma 6.

As mentioned above, based on these constructions, we first show that for any adaptive grid 𝒯={t0,,tB}\mathcal{T}=\{t_{0},\cdots,t_{B}\}, there exists an index jj such that (tj1,tj](t_{j-1},t_{j}] is sufficiently large in world j\mathcal{I}_{j}. More formally, for each j[B]j\in[B], and event Aj={tj1<Tj1,tjTj}A_{j}=\{t_{j-1}<T_{j-1},\;t_{j}\geq T_{j}\}, we define the quantities pj:=1Mj1k=1Mj1j,k(Aj)p_{j}:=\frac{1}{M_{j}-1}\sum_{k=1}^{M_{j}-1}\mathbb{P}_{j,k}(A_{j}) for jB1j\leq B-1 and pB:=B(AB)p_{B}:=\mathbb{P}_{B}(A_{B}), where j,k(Aj)\mathbb{P}_{j,k}(A_{j}) denotes the probability of the event AjA_{j} under instance Ij,kI_{j,k} and policy π\pi. For these quantities, we have the following lemma.

Lemma 5.

For any adaptive grid 𝒯\mathcal{T} and policy π\pi, it holds that j=1Bpj78.\sum_{j=1}^{B}p_{j}\geq\frac{7}{8}.

Proof.

For 1jB11\leq j\leq B-1 and 1kMj11\leq k\leq M_{j}-1, we define Sj,k=𝔹(uj,k,38rj)S_{j,k}=\mathbb{B}(u_{j,k},\frac{3}{8}r_{j}), which is the ball centered as uj,ku_{j,k} with radius 38rj\frac{3}{8}r_{j}. It is easy to verify the following properties of our construction (25) and (26):

  1. 1.

    μj,k(x)=μB(x)\mu_{j,k}(x)=\mu_{B}(x) for any xSj,kx\notin S_{j,k};

  2. 2.

    μB(x)μj,k(x)μB(x)+rj8\mu_{B}(x)\leq\mu_{j,k}(x)\leq\mu_{B}(x)+\frac{r_{j}}{8}, for any xSj,kx\in S_{j,k}.

Let xtx_{t} denote the choices of policy π\pi at time tt, and yty_{t} denote the reward. For tj1<ttjt_{j-1}<t\leq t_{j}, we define j,kt\mathbb{P}_{j,k}^{t} (resp. Bt\mathbb{P}_{B}^{t}) as the distribution of sequence (x1,y1,,xtj1,ytj1)\left(x_{1},y_{1},\cdots,x_{t_{j-1}},y_{t_{j-1}}\right) under instance Ij,kI_{j,k} (resp. IBI_{B}) and policy π\pi. Since event AjA_{j} can be completely described by the observations up to time Tj1T_{j-1} (AjA_{j} is an event in the σ\sigma-algebra where j,kTj1\mathbb{P}_{j,k}^{T_{j-1}} and BTj1\mathbb{P}_{B}^{T_{j-1}} are defined on), we can use the definition of total variation to get

|B(Aj)j,k(Aj)|=|BTj1(Aj)j,kTj1(Aj)|TV(BTj1,j,kTj1).\displaystyle|\mathbb{P}_{B}(A_{j})-\mathbb{P}_{j,k}(A_{j})|=\;|\mathbb{P}_{B}^{T_{j-1}}(A_{j})-\mathbb{P}_{j,k}^{T_{j-1}}(A_{j})|\leq\;TV\left(\mathbb{P}_{B}^{T_{j-1}},\mathbb{P}_{j,k}^{T_{j-1}}\right).

For the total variation, we apply Lemma 11 to get

1Mj1k=1Mj1TV(BTj1,j,kTj1)1Mj1k=1Mj11exp(DKL(BTj1j,kTj1)).\displaystyle\;\frac{1}{M_{j}-1}\sum_{k=1}^{M_{j}-1}TV\left(\mathbb{P}_{B}^{T_{j-1}},\mathbb{P}_{j,k}^{T_{j-1}}\right)\leq\;\frac{1}{M_{j}-1}\sum_{k=1}^{M_{j}-1}\sqrt{1-\exp\left(-D_{KL}\left(\mathbb{P}_{B}^{T_{j-1}}\|\mathbb{P}_{j,k}^{T_{j-1}}\right)\right)}.

An argument similar to (21) yields that

DKL(BTj1j,kTj1)rj2128𝔼Bτk,\displaystyle D_{KL}\left(\mathbb{P}_{B}^{T_{j-1}}\|\mathbb{P}_{j,k}^{T_{j-1}}\right)\leq\frac{r_{j}^{2}}{128}\mathbb{E}_{\mathbb{P}_{B}}\tau_{k},

where τk\tau_{k} denotes the number of pulls which is in Sj,kS_{j,k} before the batch containing Tj1T_{j-1}. Combining the above two inequalities gives

1Mj1k=1Mj1TV(BTj1,j,kTj1)\displaystyle\frac{1}{M_{j}-1}\sum_{k=1}^{M_{j}-1}TV\left(\mathbb{P}_{B}^{T_{j-1}},\mathbb{P}_{j,k}^{T_{j-1}}\right)\leq 1Mj1k=1Mj11exp(rj2128𝔼Bτk)\displaystyle\;\frac{1}{M_{j}-1}\sum_{k=1}^{M_{j}-1}\sqrt{1-\exp\left(-\frac{r_{j}^{2}}{128}\mathbb{E}_{\mathbb{P}_{B}}\tau_{k}\right)}
\displaystyle\leq 1exp(rj2128(Mj1)𝔼B[k=1Mj1τk])\displaystyle\;\sqrt{1-\exp\left(-\frac{r_{j}^{2}}{128(M_{j}-1)}\mathbb{E}_{\mathbb{P}_{B}}\left[\sum_{k=1}^{M_{j}-1}\tau_{k}\right]\right)} (27)
\displaystyle\leq 1exp(rj2Tj1128(Mj1))\displaystyle\;\sqrt{1-\exp\left(-\frac{r_{j}^{2}T_{j-1}}{128(M_{j}-1)}\right)} (28)
\displaystyle\leq 1exp(164B2)\displaystyle\;\sqrt{1-\exp\left(-\frac{1}{64B^{2}}\right)} (29)
\displaystyle\leq 18B,\displaystyle\;\frac{1}{8B},

where (27) uses Jensen’s inequality, (28) uses the fact that k=1Mj1τkTj1\sum_{k=1}^{M_{j}-1}\tau_{k}\leq T_{j-1}, and (29) uses (24).

Plugging the above results implies that

|B(Aj)pj|1Mj1k=1Mj1|B(Aj)j,k(Aj)|18B.\displaystyle|\mathbb{P}_{B}(A_{j})-p_{j}|\leq\frac{1}{M_{j}-1}\sum_{k=1}^{M_{j}-1}|\mathbb{P}_{B}(A_{j})-\mathbb{P}_{j,k}(A_{j})|\leq\frac{1}{8B}.

Since j=1BB(Aj)B(j=1BAj)=1\sum_{j=1}^{B}\mathbb{P}_{B}\left(A_{j}\right)\geq\mathbb{P}_{B}\left(\cup_{j=1}^{B}A_{j}\right)=1, it holds that

j=1BpjB(AB)+j=1B1(B(Aj)18B)78.\sum_{j=1}^{B}p_{j}\geq\mathbb{P}_{B}(A_{B})+\sum_{j=1}^{B-1}\left(\mathbb{P}_{B}(A_{j})-\frac{1}{8B}\right)\geq\frac{7}{8}.\qed

Lemma 5 implies that there exists some jj such that pj>78Bp_{j}>\frac{7}{8B}. Then similar to Theorem 7, we show that the worst-case regret of the policy in world j\mathcal{I}_{j} gives the lower bound we need.

Lemma 6.

For adaptive grid 𝒯\mathcal{T} and policy π\pi, if index jj satisfies pj78Bp_{j}\geq\frac{7}{8B}, then there exists a problem instance II such that

𝔼[RT(π)]1256B2T11d+21(1d+2)B.\displaystyle\mathbb{E}\left[R_{T}(\pi)\right]\geq\frac{1}{256B^{2}}T^{\frac{1-\frac{1}{d+2}}{1-\left(\frac{1}{d+2}\right)^{B}}}.
Proof.

Here we proceed with the case where jB1j\leq B-1. The case for j=Bj=B can be proved analogously.

For any 1kMj11\leq k\leq M_{j}-1, we construct a set of problem instances j,k=(Ij,k,l)1lMj\mathcal{I}_{j,k}=\left(I_{j,k,l}\right)_{1\leq l\leq M_{j}}. For lkl\neq k, the expected reward of Ij,k,lI_{j,k,l} is defined as

μj,k,l(x)={μj,k(x)+3rj16, if x=uj,l,μj,k(x), if x𝒰j and xuj,l,max{r12,maxu𝒰j{μj,k,l(u)d𝒜(x,u)}}, otherwise.\displaystyle\mu_{j,k,l}(x)=\begin{cases}\mu_{j,k}(x)+\frac{3r_{j}}{16},\text{ if }x=u_{j,l},\\ \mu_{j,k}(x),\text{ if }x\in\mathcal{U}_{j}\text{ and }x\neq u_{j,l},\\ \max\left\{\frac{r_{1}}{2},\max_{u\in\mathcal{U}_{j}}\left\{\mu_{j,k,l}(u)-d_{\mathcal{A}}(x,u)\right\}\right\},\text{ otherwise.}\end{cases}

where μj,k\mu_{j,k} is defined in (25). For l=kl=k, we let μj,k,k=μj,k\mu_{j,k,k}=\mu_{j,k}.

We define Cj,k=𝔹(uj,k,rj4)C_{j,k}=\mathbb{B}\left(u_{j,k},\frac{r_{j}}{4}\right), and our construction j,k\mathcal{I}_{j,k} has the following properties:

  1. 1.

    For any lkl\neq k, μj,k,l(x)=μj,k,k(x)\mu_{j,k,l}(x)=\mu_{j,k,k}(x) for any xCj,lx\notin C_{j,l};

  2. 2.

    For any lkl\neq k, μj,k,k(x)μj,k,l(x)μj,k,k(x)+3rj16\mu_{j,k,k}(x)\leq\mu_{j,k,l}(x)\leq\mu_{j,k,k}(x)+\frac{3r_{j}}{16} for any xCj,lx\in C_{j,l};

  3. 3.

    For any 1lMj1\leq l\leq M_{j}, under Ij,k,lI_{j,k,l}, pulling an arm that is not in Cj,lC_{j,l} incurs a regret at least rj16\frac{r_{j}}{16}.

Let xtx_{t} denote the choices of policy π\pi at time tt, and yty_{t} denote the reward. For tj1<ttjt_{j-1}<t\leq t_{j}, we define j,k,lt\mathbb{P}_{j,k,l}^{t} as the distribution of sequence (x1,y1,,xtj1,ytj1)\left(x_{1},y_{1},\cdots,x_{t_{j-1}},y_{t_{j-1}}\right) under instance Ij,k,lI_{j,k,l} and policy π\pi. From similar argument in (15), it holds that

supIj,k𝔼[RT(π)]rj16t=1T1Mjl=1Mjj,k,lt(xtCj,l).\displaystyle\sup_{I\in\mathcal{I}_{j,k}}\mathbb{E}\left[R_{T}(\pi)\right]\geq\frac{r_{j}}{16}\sum_{t=1}^{T}\frac{1}{M_{j}}\sum_{l=1}^{M_{j}}\mathbb{P}_{j,k,l}^{t}(x_{t}\notin C_{j,l}). (30)

From our construction, it is easy to see that Cj,k1Cj,k2=C_{j,k_{1}}\cap C_{j,k_{2}}=\varnothing for any k1k2k_{1}\neq k_{2}, so we can construct a test Ψ\Psi such that xtCj,kx_{t}\in C_{j,k} implies Ψ=k\Psi=k. By Lemma 12 with a star graph on [K][K] with center kk, we have

1Mjl=1Mjj,k,lt(xtCj,l)1Mjlkmin{dj,k,kt,dj,k,lt}.\displaystyle\frac{1}{M_{j}}\sum_{l=1}^{M_{j}}\mathbb{P}_{j,k,l}^{t}(x_{t}\notin C_{j,l})\geq\frac{1}{M_{j}}\sum_{l\neq k}\int\min\left\{d\mathbb{P}_{j,k,k}^{t},d\mathbb{P}_{j,k,l}^{t}\right\}. (31)

Combining (30) and (31) gives

supIj,k𝔼[RT(π)]\displaystyle\sup_{I\in\mathcal{I}_{j,k}}\mathbb{E}\left[R_{T}(\pi)\right]\geq rj16t=1T1Mjlkmin{dj,k,kt,dj,k,lt}\displaystyle\;\frac{r_{j}}{16}\sum_{t=1}^{T}\frac{1}{M_{j}}\sum_{l\neq k}\int\min\left\{d\mathbb{P}_{j,k,k}^{t},d\mathbb{P}_{j,k,l}^{t}\right\}
\displaystyle\geq rj16t=1Tj1Mjlkmin{dj,k,kt,dj,k,lt}\displaystyle\;\frac{r_{j}}{16}\sum_{t=1}^{T_{j}}\frac{1}{M_{j}}\sum_{l\neq k}\int\min\left\{d\mathbb{P}_{j,k,k}^{t},d\mathbb{P}_{j,k,l}^{t}\right\}
\displaystyle\geq rjTj161Mjlkmin{dj,k,kTj,dj,k,lTj}\displaystyle\;\frac{r_{j}T_{j}}{16}\cdot\frac{1}{M_{j}}\sum_{l\neq k}\int\min\left\{d\mathbb{P}_{j,k,k}^{T_{j}},d\mathbb{P}_{j,k,l}^{T_{j}}\right\} (32)
\displaystyle\geq rjTj161MjlkAjmin{dj,k,kTj,dj,k,lTj}\displaystyle\;\frac{r_{j}T_{j}}{16}\cdot\frac{1}{M_{j}}\sum_{l\neq k}\int_{A_{j}}\min\left\{d\mathbb{P}_{j,k,k}^{T_{j}},d\mathbb{P}_{j,k,l}^{T_{j}}\right\} (33)
\displaystyle\geq rjTj161MjlkAjmin{dj,k,kTj1,dj,k,lTj1},\displaystyle\;\frac{r_{j}T_{j}}{16}\cdot\frac{1}{M_{j}}\sum_{l\neq k}\int_{A_{j}}\min\left\{d\mathbb{P}_{j,k,k}^{T_{j-1}},d\mathbb{P}_{j,k,l}^{T_{j-1}}\right\}, (34)

where (32) follows from data processing inequality of total variation and the equation min{dP,dQ}=1TV(P,Q)\int\min\left\{dP,dQ\right\}=1-TV(P,Q), (33) restricts the integration to event AjA_{j}, and (34) holds because the observations at time TjT_{j} are the same as those at time Tj1T_{j-1} under event AjA_{j}.

For the term Ajmin{dj,k,kTj1,dj,k,lTj1}\int_{A_{j}}\min\left\{d\mathbb{P}_{j,k,k}^{T_{j-1}},d\mathbb{P}_{j,k,l}^{T_{j-1}}\right\}, it holds that

Ajmin{dj,k,kTj1,dj,k,lTj1}=\displaystyle\int_{A_{j}}\min\left\{d\mathbb{P}_{j,k,k}^{T_{j-1}},d\mathbb{P}_{j,k,l}^{T_{j-1}}\right\}= Ajdj,k,kTj1+dj,k,lTj1|dj,k,kTj1dj,k,lTj1|2\displaystyle\;\int_{A_{j}}\frac{d\mathbb{P}_{j,k,k}^{T_{j-1}}+d\mathbb{P}_{j,k,l}^{T_{j-1}}-\left|d\mathbb{P}_{j,k,k}^{T_{j-1}}-d\mathbb{P}_{j,k,l}^{T_{j-1}}\right|}{2}
=\displaystyle= j,k,kTj1(Aj)+j,k,lTj1(Aj)212Aj|dj,k,kTj1dj,k,lTj1|\displaystyle\;\frac{\mathbb{P}_{j,k,k}^{T_{j-1}}(A_{j})+\mathbb{P}_{j,k,l}^{T_{j-1}}(A_{j})}{2}-\frac{1}{2}\int_{A_{j}}\left|d\mathbb{P}_{j,k,k}^{T_{j-1}}-d\mathbb{P}_{j,k,l}^{T_{j-1}}\right|
\displaystyle\geq (j,k,kTj1(Aj)12TV(j,k,kTj1,j,k,lTj1))TV(j,k,kTj1,j,k,lTj1)\displaystyle\;\left(\mathbb{P}_{j,k,k}^{T_{j-1}}(A_{j})-\frac{1}{2}TV\left(\mathbb{P}_{j,k,k}^{T_{j-1}},\mathbb{P}_{j,k,l}^{T_{j-1}}\right)\right)-TV\left(\mathbb{P}_{j,k,k}^{T_{j-1}},\mathbb{P}_{j,k,l}^{T_{j-1}}\right) (35)
=\displaystyle= j,k(Aj)32TV(j,k,kTj1,j,k,lTj1),\displaystyle\;\mathbb{P}_{j,k}(A_{j})-\frac{3}{2}TV\left(\mathbb{P}_{j,k,k}^{T_{j-1}},\mathbb{P}_{j,k,l}^{T_{j-1}}\right), (36)

where (35) uses the inequality |(A)(A)|TV(,)|\mathbb{P}(A)-\mathbb{Q}(A)|\leq TV(\mathbb{P},\mathbb{Q}), and (36) holds because Ij,k=Ij,k,kI_{j,k}=I_{j,k,k} and AjA_{j} can be determined by the observations up to Tj1T_{j-1}.

We use an argument similar to (21) to get

DKL(j,k,kTj1j,k,lTj1)12(3rj16)2𝔼j,kτlrj232𝔼j,kτl,\displaystyle D_{KL}\left(\mathbb{P}_{j,k,k}^{T_{j-1}}\|\mathbb{P}_{j,k,l}^{T_{j-1}}\right)\leq\frac{1}{2}\cdot\left(\frac{3r_{j}}{16}\right)^{2}\mathbb{E}_{\mathbb{P}_{j,k}}\tau_{l}\leq\frac{r_{j}^{2}}{32}\mathbb{E}_{\mathbb{P}_{j,k}}\tau_{l},

where τl\tau_{l} denotes the number of pulls which is in Cj,lC_{j,l} before the batch of time Tj1T_{j-1}. Then from Lemma 11, we have

1MjlkTV(j,k,kTj1,j,k,lTj1)\displaystyle\frac{1}{M_{j}}\sum_{l\neq k}TV\left(\mathbb{P}_{j,k,k}^{T_{j-1}},\mathbb{P}_{j,k,l}^{T_{j-1}}\right)\leq 1Mjlk1exp(DKL(j,k,kTj1j,k,lTj1))\displaystyle\;\frac{1}{M_{j}}\sum_{l\neq k}\sqrt{1-\exp\left(-D_{KL}\left(\mathbb{P}_{j,k,k}^{T_{j-1}}\|\mathbb{P}_{j,k,l}^{T_{j-1}}\right)\right)}
\displaystyle\leq 1Mjlk1exp(rj232𝔼j,kτl)\displaystyle\;\frac{1}{M_{j}}\sum_{l\neq k}\sqrt{1-\exp\left(-\frac{r_{j}^{2}}{32}\mathbb{E}_{\mathbb{P}_{j,k}}\tau_{l}\right)}
\displaystyle\leq Mj1Mj1exp(rj232(Mj1)lk𝔼j,kτl)\displaystyle\;\frac{M_{j}-1}{M_{j}}\sqrt{1-\exp\left(-\frac{r_{j}^{2}}{32(M_{j}-1)}\sum_{l\neq k}\mathbb{E}_{\mathbb{P}_{j,k}}\tau_{l}\right)}
\displaystyle\leq Mj1Mj1exp(rj2Tj132(Mj1))\displaystyle\;\frac{M_{j}-1}{M_{j}}\sqrt{1-\exp\left(-\frac{r_{j}^{2}T_{j-1}}{32(M_{j}-1)}\right)}
\displaystyle\leq Mj1Mj1exp(Mj32(Mj1)B2)\displaystyle\;\frac{M_{j}-1}{M_{j}}\sqrt{1-\exp\left(-\frac{M_{j}}{32(M_{j}-1)B^{2}}\right)} (37)
\displaystyle\leq Mj1MjMj32(Mj1)B2\displaystyle\;\frac{M_{j}-1}{M_{j}}\sqrt{\frac{M_{j}}{32(M_{j}-1)B^{2}}}
\displaystyle\leq 14B,\displaystyle\;\frac{1}{4B}, (38)

where (37) uses (24).

Combining (34), (36) and (38) yields that

supIj,k𝔼[RT(π)]116rjTj(j,k(Aj)238B)116BT1ε1εB(j,k(Aj)238B),\displaystyle\sup_{I\in\mathcal{I}_{j,k}}\mathbb{E}\left[R_{T}(\pi)\right]\geq\frac{1}{16}r_{j}T_{j}\left(\frac{\mathbb{P}_{j,k}(A_{j})}{2}-\frac{3}{8B}\right)\geq\frac{1}{16B}T^{\frac{1-\varepsilon}{1-\varepsilon^{B}}}\left(\frac{\mathbb{P}_{j,k}(A_{j})}{2}-\frac{3}{8B}\right),

where ε=1d+2\varepsilon=\frac{1}{d+2}. This inequality holds for any kMj1k\leq M_{j}-1. Averaging over kk yields

supIkMj1j,k𝔼[RT(π)]\displaystyle\sup_{I\in\cup_{k\leq M_{j}-1}\mathcal{I}_{j,k}}\mathbb{E}\left[R_{T}(\pi)\right]\geq 116BT1ε1εB(12(Mj1)k=1Mj1j,k(Aj)38B)\displaystyle\;\frac{1}{16B}T^{\frac{1-\varepsilon}{1-\varepsilon^{B}}}\left(\frac{1}{2(M_{j}-1)}\sum_{k=1}^{M_{j}-1}\mathbb{P}_{j,k}(A_{j})-\frac{3}{8B}\right)
\displaystyle\geq 116BT1ε1εB(716B38B)\displaystyle\;\frac{1}{16B}T^{\frac{1-\varepsilon}{1-\varepsilon^{B}}}\left(\frac{7}{16B}-\frac{3}{8B}\right)
\displaystyle\geq 1256B2T1ε1εB,\displaystyle\;\frac{1}{256B^{2}}T^{\frac{1-\varepsilon}{1-\varepsilon^{B}}},

where the second inequality holds from pj78Bp_{j}\geq\frac{7}{8B}. Hence, the proof of Lemma 6 is completed. ∎

Finally, combining the above two lemmas, we arrive at the lower bound in Theorem 8. ∎

4.2 Communication Lower bound for Lipschitz Bandits with Batched Feedback

Based on the constructions for the full-dimension case, we are now ready to present the theoretical lower bound of batched Lipschitz bandits with dzdd_{z}\leq d. For easy understanding, we start with the result for the static grid case. In this section we provide lower bounds for integer-valued dzd_{z}. In general, the zooming dimension is not necessarily an integer.

Theorem 9.

Consider Lipschitz bandit problems with time horizon TT, ambient dimension dd and zooming dimension dzdd_{z}\leq d such that the grid of reward communication 𝒯\mathcal{T} is static and determined before the game. If BB rounds of communications are allowed, then for any policy π\pi, there exists a problem instance with zooming dimension dzd_{z} such that

𝔼[RT(π)]164e116T11dz+21(1dz+2)B.\mathbb{E}[R_{T}(\pi)]\geq\frac{1}{64e^{\frac{1}{16}}}\cdot T^{\frac{1-\frac{1}{d_{z}+2}}{1-\left(\frac{1}{d_{z}+2}\right)^{B}}}.

The proof technique of Theorem 9 is similar to that of Theorem 7, with the main difference being the construction of hard instances. To build instances satisfying the statement in Theorem 9, we first construct reward functions in a dzd_{z}-dimensional subspace according to the argument in Theorem 7, and then use a “linear-decaying extension” technique to transfer them to the dd-dimensional whole space. Below, we present the detailed construction of the hard instances.

To begin with, we introduce some settings and notations. For a given dd, we consider the whole space 𝒜=[0,1]d\mathcal{A}=[0,1]^{d} and d𝒜d_{\mathcal{A}} being the induced metric of \|\cdot\|_{\infty}. 𝒜\mathcal{A} can be represented by a Cartsian product [0,1]d=[0,1]dz×[0,1]ddz[0,1]^{d}=[0,1]^{d_{z}}\times[0,1]^{d-d_{z}}. Therefore, each element in 𝒜\mathcal{A} can be represented as (α,β)(\alpha,\beta), the concatenation of α[0,1]dz\alpha\in[0,1]^{d_{z}} and β[0,1]ddz\beta\in[0,1]^{d-d_{z}}. Additionally, we use 𝒜dz\mathcal{A}_{d_{z}} to denote the dzd_{z}-dimensional subspace {(α,0ddz)|α[0,1]dz}\{(\alpha,0_{d-d_{z}})|\alpha\in[0,1]^{d_{z}}\}, where 0ddz0_{d-d_{z}} is the zero vector in [0,1]ddz[0,1]^{d-d_{z}}.

For any fixed kk, let rk=1tk11dz+2r_{k}=\frac{1}{t_{k-1}^{\frac{1}{d_{z}+2}}} and Mk=tk1rk2=1rkdzM_{k}=t_{k-1}r_{k}^{2}=\frac{1}{r_{k}^{d_{z}}}. We can find a set of arms 𝒰k={uk,1,,uk,Mk}𝒜dz\mathcal{U}_{k}=\{u_{k,1},\cdots,u_{k,M_{k}}\}\subset\mathcal{A}_{d_{z}} such that d𝒜(uk,i,uk,j)rkd_{\mathcal{A}}(u_{k,i},u_{k,j})\geq r_{k} for any iji\neq j. As stated above, we first construct a set of expected reward functions {νk,1,,νk,Mk}\{\nu_{k,1},\cdots,\nu_{k,M_{k}}\} on 𝒜dz\mathcal{A}_{d_{z}}, which are the same as (13) and (14) in the proof of Theorem 7. We define νk,1\nu_{k,1} as

νk,1(z)={34rk,ifz=uk,1,58rk,ifz=uk,j,j1,max{rk2,maxu𝒰k{νk,1(u)d𝒜(z,u)}},ifz𝒜dz𝒰k.\displaystyle\nu_{k,1}(z)=\begin{cases}\frac{3}{4}r_{k},\;\text{if}\;z=u_{k,1},\\ \frac{5}{8}r_{k},\;\text{if}\;z=u_{k,j},\;j\neq 1,\\ \max\left\{\frac{r_{k}}{2},\max_{u\in\mathcal{U}_{k}}\left\{\nu_{k,1}(u)-d_{\mathcal{A}}(z,u)\right\}\right\},\\ \quad\text{if}\;z\in\mathcal{A}_{d_{z}}\setminus\mathcal{U}_{k}.\end{cases} (39)

For 2iMk2\leq i\leq M_{k}, νk,i\nu_{k,i} is defined as

νk,i(z)={34rk,ifz=uk,1,78rk,ifz=uk,i,58rk,ifz=uk,j,j1andji,max{rk2,maxu𝒰k{νk,i(u)d𝒜(z,u)}},ifz𝒜dz𝒰k.\displaystyle\nu_{k,i}(z)=\begin{cases}\frac{3}{4}r_{k},\;\text{if}\;z=u_{k,1},\\ \frac{7}{8}r_{k},\;\text{if}\;z=u_{k,i},\\ \frac{5}{8}r_{k},\;\text{if}\;z=u_{k,j},\;j\neq 1\;\text{and}\;j\neq i,\\ \max\left\{\frac{r_{k}}{2},\max_{u\in\mathcal{U}_{k}}\left\{\nu_{k,i}(u)-d_{\mathcal{A}}(z,u)\right\}\right\},\\ \quad\text{if}\;z\in\mathcal{A}_{d_{z}}\setminus\mathcal{U}_{k}.\end{cases} (40)

Based on {νk,i}i=1Mk\{\nu_{k,i}\}_{i=1}^{M_{k}}, we define a set of problem instances k={Ik,1,,Ik,Mk}\mathcal{I}_{k}=\{I_{k,1},\cdots,I_{k,M_{k}}\}. For each 1iMk1\leq i\leq M_{k}, the expected reward for Ik,iI_{k,i} is defined as

μk,i((α,β))=12min{νk,i((α,0ddz)),78rkβ},\displaystyle\mu_{k,i}((\alpha,\beta))=\frac{1}{2}\cdot\min\left\{\nu_{k,i}((\alpha,0_{d-d_{z}})),\frac{7}{8}r_{k}-\|\beta\|_{\infty}\right\}, (41)

where (α,β)(\alpha,\beta) is the concatenation of α[0,1]dz\alpha\in[0,1]^{d_{z}} and β[0,1]ddz\beta\in[0,1]^{d-d_{z}}. Note that ν((α,0ddz))\nu((\alpha,0_{d-d_{z}})) is well-defined since (α,0ddz)𝒜dz(\alpha,0_{d-d_{z}})\in\mathcal{A}_{d_{z}}.

For all arm pulls in all problem instances, an Gaussian noise sampled from 𝒩(0,1)\mathcal{N}(0,1) is added to the observed reward. This noise corruption is independent from all other randomness.

Now we show that for each 1kB1\leq k\leq B and 1iMk1\leq i\leq M_{k}, μk,i\mu_{k,i} is 11-Lipschitz and the zooming dimension equals to dzd_{z}. Firstly, for any (α1,β1)(\alpha_{1},\beta_{1}) and (α2,β2)(\alpha_{2},\beta_{2}), we have

μk,i((α1,β1))μk,i((α2,β2))\displaystyle\mu_{k,i}((\alpha_{1},\beta_{1}))-\mu_{k,i}((\alpha_{2},\beta_{2}))\leq μk,i((α1,β1))μk,i((α1,β2))+μk,i((α1,β2))μk,i((α2,β2))\displaystyle\mu_{k,i}((\alpha_{1},\beta_{1}))-\mu_{k,i}((\alpha_{1},\beta_{2}))+\mu_{k,i}((\alpha_{1},\beta_{2}))-\mu_{k,i}((\alpha_{2},\beta_{2}))
\displaystyle\leq 12β1β2+12(νk,i((α1,0ddz))νk,i((α2,0ddz)))\displaystyle\frac{1}{2}\|\beta_{1}-\beta_{2}\|_{\infty}+\frac{1}{2}\big{(}\nu_{k,i}((\alpha_{1},0_{d-d_{z}}))-\nu_{k,i}((\alpha_{2},0_{d-d_{z}}))\big{)}
\displaystyle\leq 12(β1β2+α1α2)\displaystyle\frac{1}{2}(\|\beta_{1}-\beta_{2}\|_{\infty}+\|\alpha_{1}-\alpha_{2}\|_{\infty})
\displaystyle\leq (α1α2,β1β2).\displaystyle\|(\alpha_{1}-\alpha_{2},\beta_{1}-\beta_{2})\|_{\infty}.

Therefore, μk,i\mu_{k,i} is 11-Lipschitz. Secondly, for any rrkr\geq r_{k}, (41) yields that S(16r)=[0,1]dz×[0,32r]ddzS(16r)=[0,1]^{d_{z}}\times[0,32r]^{d-d_{z}}. As a consequence, we have Nr=32ddzrdzN_{r}=32^{d-d_{z}}r^{-d_{z}}, and the zooming dimension equals to dzd_{z}.

After presenting the new constructions, we show that similar argument to the full-dimension case gives the lower bound we need. The remaining proof of Theorem 9 is deferred to Appendix E.

Finally, we combine all techniques in above analysis to obtain the lower bound for general dzd_{z} and adaptive grid, and thus prove Theorem 3.

Theorem 10.

Consider Lipschitz bandit problems with time horizon TT, ambient dimension dd and zooming dimension dzdd_{z}\leq d such that the grid of reward communication 𝒯\mathcal{T} is adaptively determined by the player. If BB rounds of communications are allowed, then for any policy π\pi, there exists a problem instance with zooming dimension dzd_{z} such that

𝔼[RT(π)]1512B2T11dz+21(1dz+2)B.\displaystyle\mathbb{E}\left[R_{T}(\pi)\right]\geq\frac{1}{512B^{2}}T^{\frac{1-\frac{1}{d_{z}+2}}{1-\left(\frac{1}{d_{z}+2}\right)^{B}}}.

The proof of Theorem 10 is deferred to Appendix F.

5 Experiments

In this section, we present numerical studies of A-BLiN. In the experiments, we use the arm space 𝒜=[0,1]2\mathcal{A}=[0,1]^{2} and the expected reward function μ(x)=112xx12310xx22\mu(x)=1-\frac{1}{2}\|x-x_{1}\|_{2}-\frac{3}{10}\|x-x_{2}\|_{2}, where x1=(0.8, 0.7)x_{1}=(0.8,\;0.7) and x2=(0.1, 0.1)x_{2}=(0.1,\;0.1). The landscape of μ\mu and the resulting partition is shown in Figure 2a. As can be seen, the partition is finer in the area closer to the optimal arm x=(0.8, 0.7)x^{*}=(0.8,\;0.7).

Refer to caption
(a) Partition
Refer to caption
(b) Regret
Figure 2: Resulting partition and regret of A-BLiN. In Figure 2a, we show the resulting partition of A-BLiN. The background color denotes the true value of expected reward μ\mu, and blue means high values. The figure shows that the partition is finer for larger values of μ\mu. In Figure 2b, we show accumulated regret of A-BLiN and zooming algorithm [28]. In the figure, different background colors represent different batches of A-BLiN. For the total time horizon T=80000T=80000, A-BLiN needs 44 rounds of communications.

We let the time horizon T=80000T=80000, and report the accumulated regret in Figure 2b. The regret curve is sublinear, which agrees with the regret bound (2). Besides, different background colors in Figure 2b represent different batches. For the total time horizon T=80000T=80000, A-BLiN only needs 44 rounds of communications. We also present regret curve of zooming algorithm [28] for comparison. Different from zooming algorithm, regret curve of A-BLiN is approximately piecewise linear, which is because the strategy of BLiN does not change within each batch. Results of more repeated experiments, as well as experimental results of D-BLiN, are in Appendix G. Our code is available at https://github.com/FengYasong-fifol/Batched-Lipschitz-Narrowing.

6 Conclusion

In this paper, we study Lipschitz bandits with communication constraints, and propose the BLiN algorithm as a solution. We prove that BLiN only need 𝒪(loglogT)\mathcal{O}\left(\log\log T\right) rounds of communications to achieve the optimal regret rate of best previous Lipschitz bandit algorithms [28, 14] that need TT batches. This improvement in number of the batches significantly saves data communication costs. We also provide complexity analysis for this problem. We show that Ω(loglogT)\Omega(\log\log T) rounds of communications are necessary for any algorithm to optimally solve Lipschitz bandit problems. Hence, BLiN algorithm is optimal.

References

  • [1] Arpit Agarwal, Shivani Agarwal, Sepehr Assadi, and Sanjeev Khanna. Learning with limited rounds of adaptivity: coin tossing, multi-armed bandits, and ranking from pairwise comparisons. In Conference on Learning Theory, pages 39–75. PMLR, 2017.
  • [2] Arpit Agarwal, Rohan Ghuge, and Viswanath Nagarajan. Batched dueling bandits. In International Conference on Machine Learning, pages 89–110. PMLR, 2022.
  • [3] Rajeev Agrawal. The continuum-armed bandit problem. SIAM Journal on Control and Optimization, 33(6):1926–1951, 1995.
  • [4] Rajeev Agrawal. Sample mean based index policies by 𝒪(logn)\mathcal{O}(\log n) regret for the multi-armed bandit problem. Advances in Applied Probability, 27(4):1054–1078, 1995.
  • [5] Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. Journal of Computer and System Sciences, 58(1):137–147, 1999.
  • [6] Patrice Assouad. Plongements Lipschitziens dans n\mathbb{R}^{n}. Bulletin de la Société Mathématique de France, 111:429–448, 1983.
  • [7] Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235–256, 2002.
  • [8] Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002.
  • [9] Peter Auer, Ronald Ortner, and Csaba Szepesvári. Improved rates for the stochastic continuum-armed bandit problem. In Conference on Computational Learning Theory, pages 454–468. Springer, 2007.
  • [10] Dimitris Bertsimas and Adam J Mersereau. A learning approach for interactive marketing to a customer segment. Operations Research, 55(6):1120–1135, 2007.
  • [11] Peter J. Bickel. On some robust estimates of location. The Annals of Mathematical Statistics, pages 847–858, 1965.
  • [12] Jean Bretagnolle and Catherine Huber. Estimation des densités: risque minimax. Séminaire de probabilités de Strasbourg, 12:342–363, 1978.
  • [13] Sébastien Bubeck, Nicolo Cesa-Bianchi, and Gábor Lugosi. Bandits with heavy tail. IEEE Transactions on Information Theory, 59(11):7711–7717, 2013.
  • [14] Sébastien Bubeck, Rémi Munos, Gilles Stoltz, and Csaba Szepesvári. Online optimization in 𝒳\mathcal{X}-armed bandits. Advances in Neural Information Processing Systems, 22:201–208, 2009.
  • [15] Sébastien Bubeck, Rémi Munos, Gilles Stoltz, and Csaba Szepesvári. 𝒳\mathcal{X}-armed bandits. Journal of Machine Learning Research, 12(5):1655–1695, 2011.
  • [16] Sébastien Bubeck, Gilles Stoltz, and Jia Yuan Yu. Lipschitz bandits without the Lipschitz constant. In International Conference on Algorithmic Learning Theory, pages 144–158. Springer, 2011.
  • [17] Nicolo Cesa-Bianchi, Ofer Dekel, and Ohad Shamir. Online learning with switching costs and other adaptive adversaries. Advances in Neural Information Processing Systems, 26:1160–1168, 2013.
  • [18] Eric W. Cope. Regret and convergence bounds for a class of continuum-armed bandit problems. IEEE Transactions on Automatic Control, 54(6):1243–1253, 2009.
  • [19] Hossein Esfandiari, Amin Karbasi, Abbas Mehrabian, and Vahab Mirrokni. Regret bounds for batched bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 7340–7348, 2021.
  • [20] Eyal Even-Dar, Shie Mannor, Yishay Mansour, and Sridhar Mahadevan. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of machine learning research, 7(6):1079–1105, 2006.
  • [21] Yasong Feng, Weijian Luo, Yimin Huang, and Tianyu Wang. A lipschitz bandits approach for continuous hyperparameter optimization. arXiv preprint arXiv:2302.01539, 2023.
  • [22] Zijun Gao, Yanjun Han, Zhimei Ren, and Zhengqing Zhou. Batched multi-armed bandits problem. Advances in Neural Information Processing Systems, 32:503–513, 2019.
  • [23] Chuying Han, Yasong Feng, and Tianyu Wang. From random search to bandit learning in metric measure spaces. arXiv preprint arXiv:2305.11509, 2023.
  • [24] Yanjun Han, Zhengqing Zhou, Zhengyuan Zhou, Jose Blanchet, Peter W Glynn, and Yinyu Ye. Sequential batch learning in finite-action linear contextual bandits. arXiv preprint arXiv:2004.06321, 2020.
  • [25] Kwang-Sung Jun, Kevin Jamieson, Robert Nowak, and Xiaojin Zhu. Top arm identification in multi-armed bandits with batch arm pulls. In Artificial Intelligence and Statistics, pages 139–148. PMLR, 2016.
  • [26] Nikolai Karpov, Qin Zhang, and Yuan Zhou. Collaborative top distribution identifications with limited interaction. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pages 160–171. IEEE, 2020.
  • [27] Robert Kleinberg. Nearly tight bounds for the continuum-armed bandit problem. Advances in Neural Information Processing Systems, 18:697–704, 2005.
  • [28] Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. Multi-armed bandits in metric spaces. In Proceedings of the fortieth annual ACM symposium on Theory of computing, pages 681–690, 2008.
  • [29] Akshay Krishnamurthy, John Langford, Aleksandrs Slivkins, and Chicheng Zhang. Contextual bandits with continuous actions: Smoothing, zooming, and adapting. The Journal of Machine Learning Research, 21(1):5402–5446, 2020.
  • [30] Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4–22, 1985.
  • [31] Zihan Li and Jonathan Scarlett. Gaussian process bandit optimization with few batches. In International Conference on Artificial Intelligence and Statistics, pages 92–107. PMLR, 2022.
  • [32] Shiyin Lu, Guanghui Wang, Yao Hu, and Lijun Zhang. Optimal algorithms for Lipschitz bandits with heavy-tailed rewards. In International Conference on Machine Learning, pages 4154–4163, 2019.
  • [33] Stefan Magureanu, Richard Combes, and Alexandre Proutiere. Lipschitz bandits: Regret lower bound and optimal algorithms. In Conference on Learning Theory, pages 975–999. PMLR, 2014.
  • [34] Maryam Majzoubi, Chicheng Zhang, Rajan Chari, Akshay Krishnamurthy, John Langford, and Aleksandrs Slivkins. Efficient contextual bandits with continuous actions. Advances in Neural Information Processing Systems, 33:349–360, 2020.
  • [35] Vianney Perchet and Philippe Rigollet. The multi-armed bandit problem with covariates. The Annals of Statistics, 41(2):693–721, 2013.
  • [36] Vianney Perchet, Philippe Rigollet, Sylvain Chassang, and Erik Snowberg. Batched bandit problems. The Annals of Statistics, 44(2):660–681, 2016.
  • [37] Stuart J. Pocock. Group sequential methods in the design and analysis of clinical trials. Biometrika, 64(2):191–199, 1977.
  • [38] Yufei Ruan, Jiaqi Yang, and Yuan Zhou. Linear bandits with limited adaptivity and learning distributional optimal design. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 74–87, 2021.
  • [39] Sudeep Salgia, Sattar Vakili, and Qing Zhao. A domain-shrinking based bayesian optimization algorithm with order-optimal regret performance. In Advances in Neural Information Processing Systems, volume 34, pages 28836–28847, 2021.
  • [40] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
  • [41] Sean Sinclair, Tianyu Wang, Gauri Jain, Siddhartha Banerjee, and Christina Yu. Adaptive discretization for model-based reinforcement learning. Advances in Neural Information Processing Systems, 33:3858–3871, 2020.
  • [42] Aleksandrs Slivkins. Contextual bandits with similarity information. Journal of Machine Learning Research, 15(1):2533–2568, 2014.
  • [43] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT press, 2018.
  • [44] Chao Tao, Qin Zhang, and Yuan Zhou. Collaborative learning with limited interaction: tight bounds for distributed exploration in multi-armed bandits. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pages 126–146. IEEE, 2019.
  • [45] William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
  • [46] Tianyu Wang and Cynthia Rudin. Bandits for BMO functions. In International Conference on Machine Learning, pages 9996–10006. PMLR, 2020.
  • [47] Tianyu Wang, Weicheng Ye, Dawei Geng, and Cynthia Rudin. Towards practical lipschitz bandits. In Proceedings of the 2020 ACM-IMS on Foundations of Data Science Conference, FODS ’20, page 129–138, New York, NY, USA, 2020. Association for Computing Machinery.
  • [48] Nirandika Wanigasekara and Christina Yu. Nonparametric contextual bandits in an unknown metric space. In Advances in Neural Information Processing Systems, volume 32, pages 14684–14694, 2019.
  • [49] Kelly Zhang, Lucas Janson, and Susan Murphy. Inference for batched bandits. Advances in Neural Information Processing Systems, 33:9818–9829, 2020.

Appendix A Proof of Corollary 1

Corollary 1. For Lipschitz bandit problems with ambient dimension dd, zooming dimension dzdd_{z}\leq d and time horizon TT, any algorithm needs Ω(loglogT)\Omega(\log\log T) rounds of communications to achieve the optimal regret rate 𝒪(Tdz+1dz+2)\mathcal{O}\left(T^{\frac{d_{z}+1}{d_{z}+2}}\right).

Proof.

From Theorem 3, the expected regret is lower bounded by

𝔼[RT(π)]1512B2T11dz+21(1dz+2)B.\displaystyle\mathbb{E}\left[R_{T}(\pi)\right]\geq\frac{1}{512B^{2}}T^{\frac{1-\frac{1}{d_{z}+2}}{1-\left(\frac{1}{d_{z}+2}\right)^{B}}}.

Here we seek for the minimum BB such that

1512B2T11dz+21(1dz+2)BTdz+1dz+2C\displaystyle\frac{\frac{1}{512B^{2}}T^{\frac{1-\frac{1}{d_{z}+2}}{1-\left(\frac{1}{d_{z}+2}\right)^{B}}}}{T^{\frac{d_{z}+1}{d_{z}+2}}}\leq C (42)

for some constant CC.

Calculation shows that

1512B2T11dz+21(1dz+2)BTdz+1dz+2=1512B2(Tdz+1dz+2)1(dz+2)B1.\displaystyle\frac{\frac{1}{512B^{2}}T^{\frac{1-\frac{1}{d_{z}+2}}{1-\left(\frac{1}{d_{z}+2}\right)^{B}}}}{T^{\frac{d_{z}+1}{d_{z}+2}}}=\frac{1}{512B^{2}}\left(T^{\frac{d_{z}+1}{d_{z}+2}}\right)^{\frac{1}{(d_{z}+2)^{B}-1}}. (43)

Substituting (43) to (42) and taking log on both sides yield that

dz+1dz+2logT(dz+2)B1log(512B2C)\displaystyle\frac{d_{z}+1}{d_{z}+2}\cdot\frac{\log T}{(d_{z}+2)^{B}-1}\leq\log(512B^{2}C)

and

(dz+2)Bdz+1(dz+2)log(512B2C)logT+1.\displaystyle(d_{z}+2)^{B}\geq\frac{d_{z}+1}{(d_{z}+2)\log(512B^{2}C)}\cdot\log T+1.

Taking log on both sides again yields that

Blog[(dz+1(dz+2)log(512B2C))logT+1]log(dz+2).B\geq\frac{\log\left[\left(\frac{d_{z}+1}{(d_{z}+2)\log(512B^{2}C)}\right)\log T+1\right]}{\log(d_{z}+2)}. (44)

We use BminB_{\min} to denote the minimum BB such that inequality (44) holds. Calculation shows that (44) holds for

B=Blog[(dz+1(dz+2)log(512C))logT+1]log(dz+2),B=B_{*}\triangleq\frac{\log\left[\left(\frac{d_{z}+1}{(d_{z}+2)\log(512C)}\right)\log T+1\right]}{\log(d_{z}+2)},

so we have BminBB_{\min}\leq B_{*}. Then since the RHS of (44) decreases with BB, we have

Bminlog[(dz+1(dz+2)log(512Bmin2C))logT+1]log(dz+2)log[(dz+1(dz+2)log(512B2C))logT+1]log(dz+2).\displaystyle B_{\min}\geq\frac{\log\left[\left(\frac{d_{z}+1}{(d_{z}+2)\log(512B_{\min}^{2}C)}\right)\log T+1\right]}{\log(d_{z}+2)}\geq\frac{\log\left[\left(\frac{d_{z}+1}{(d_{z}+2)\log(512B_{*}^{2}C)}\right)\log T+1\right]}{\log(d_{z}+2)}.

Therefore, Ω(loglogT)\Omega(\log\log T) rounds of communications are necessary for any algorithm to optimally solve Lipschitz bandit problems. ∎

Appendix B Space Complexity Analysis of A-BLiN

Let the A-BLiN run contains B+1B+1 batches: BB batches in the for-loop and a clean-up batch. We use γm\gamma_{m} to denote number of cubes in batch mm, that is, γm=|𝒜m|\gamma_{m}=|\mathcal{A}_{m}|, for 1mB1\leq m\leq B. Then it is easy to see that the space complexity is linear in max1mBγm\max_{1\leq m\leq B}\gamma_{m}. In the following, we bound γm\gamma_{m} for each mm to obtain the space complexity of A-BLiN.

Equation (6) and (7) yields that

γm16logTrm28rm1Tdz+1dz+2128Cz(logT)1dz+2,\displaystyle\gamma_{m}\cdot\frac{16\log T}{r_{m}^{2}}\cdot 8r_{m-1}\leq T^{\frac{d_{z}+1}{d_{z}+2}}\cdot 128C_{z}(\log T)^{\frac{1}{d_{z}+2}},

and thus

γmCzTdz+1dz+2(logT)dz+1dz+2rm2rm1.\gamma_{m}\leq C_{z}T^{\frac{d_{z}+1}{d_{z}+2}}(\log T)^{-\frac{d_{z}+1}{d_{z}+2}}\cdot\frac{r_{m}^{2}}{r_{m-1}}.

Since rm1r_{m}\leq 1 and rmrm1r_{m}\leq r_{m-1}, we further have

γmCzTdz+1dz+2(logT)dz+1dz+2.\displaystyle\gamma_{m}\leq C_{z}T^{\frac{d_{z}+1}{d_{z}+2}}(\log T)^{-\frac{d_{z}+1}{d_{z}+2}}.

The above inequality is satisfied for each 1mB1\leq m\leq B. Consequently, the space complexity of A-BLiN is upper bounded by max1mBγm=𝒪(Tdz+1dz+2(logT)dz+1dz+2)\max_{1\leq m\leq B}\gamma_{m}=\mathcal{O}\left(T^{\frac{d_{z}+1}{d_{z}+2}}(\log T)^{-\frac{d_{z}+1}{d_{z}+2}}\right).

Appendix C Proof of Theorem 4

Theorem 4. With probability exceeding 12T61-\frac{2}{T^{6}}, the TT-step total regret R(T)R(T) of BLiN with Doubling Edge-length Sequence (D-BLiN) satisfies

R(T)(512Cz+16)Tdz+1dz+2(logT)1dz+2,R(T)\leq(512C_{z}+16)\cdot T^{\frac{d_{z}+1}{d_{z}+2}}(\log T)^{\frac{1}{d_{z}+2}}, (45)

where dzd_{z} is the zooming dimension of the problem instance. In addition, D-BLiN only needs no more than logTloglogTdz+2+2\frac{\log T-\log\log T}{d_{z}+2}+2 rounds of communications to achieve this regret rate.

Proof.

Since rm=rm12r_{m}=\frac{r_{m-1}}{2} for Doubling Edge-length Sequence, Lemma 3 implies that every cube C𝒜mC\in\mathcal{A}_{m} is a subset of S(16rm)S(16r_{m}). Thus from the definition of zooming number, we have

|𝒜m|Nrm.\displaystyle|\mathcal{A}_{m}|\leq N_{r_{m}}. (46)

Fix any positive number BB. Also by Lemma 3, we know that any arm played after batch BB incurs a regret bounded by 16rB16r_{B}, since the cubes played after batch BB have edge length no larger than rBr_{B}. Then the total regret occurs after the first BB batch is bounded by 16rBT16r_{B}T.

Thus the regret R(T)R(T) can be bounded by

R(T)\displaystyle R(T) m=1BC𝒜mi=1nmΔxC,i+16rBT,\displaystyle\leq\sum_{m=1}^{B}\sum_{C\in\mathcal{A}_{m}}\sum_{i=1}^{n_{m}}\Delta_{x_{C,i}}+16r_{B}T, (47)

where the first term bounds the regret in the first BB batches of D-BLiN, and the second term bounds the regret after the first BB batches. If the algorithm stops at batch B~<B\widetilde{B}<B, we define 𝒜m=\mathcal{A}_{m}=\varnothing for any B~<mB\widetilde{B}<m\leq B and inequality (47) still holds.

By Lemma 3, we have ΔC,i16rm\Delta_{C,i}\leq 16r_{m} for all C𝒜mC\in\mathcal{A}_{m}. We can thus bound (47) by

R(T)\displaystyle R(T)\leq m=1B|𝒜m|nm16rm+16rBT\displaystyle\sum_{m=1}^{B}|\mathcal{A}_{m}|\cdot n_{m}\cdot 16r_{m}+16r_{B}T
\displaystyle\leq m=1BNrmnm16rm+16rBT,\displaystyle\sum_{m=1}^{B}N_{r_{m}}\cdot n_{m}\cdot 16r_{m}+16r_{B}T, (48)
\displaystyle\leq m=1BNrm16logTrm216rm+16rBT,\displaystyle\sum_{m=1}^{B}N_{r_{m}}\cdot\frac{16\log T}{r_{m}^{2}}\cdot 16r_{m}+16r_{B}T, (49)
=\displaystyle= m=1BNrm256logTrm+16rBT,\displaystyle\sum_{m=1}^{B}N_{r_{m}}\cdot\frac{256\log T}{r_{m}}+16r_{B}T,

where (48) uses (46), and (49) uses equality nm=16logTrmn_{m}=\frac{16\log T}{r_{m}}. Since rm=2m+1r_{m}=2^{-m+1} and NrmCzrmdzCz2(m1)dzN_{r_{m}}\leq C_{z}r_{m}^{-d_{z}}\leq C_{z}\cdot 2^{(m-1)d_{z}}, we have

R(T)256Czm=1B2(m1)dzlogT2(m1)+162B+1T.\displaystyle R(T)\leq 256C_{z}\sum_{m=1}^{B}\frac{2^{(m-1)d_{z}}\log T}{2^{-(m-1)}}+16\cdot 2^{-{B}+1}T.

This inequality holds for any positive BB. By choosing B=1+logTlogTdz+2B^{*}=1+\frac{\log\frac{T}{\log T}}{d_{z}+2}, we have

R(T)\displaystyle R(T)\leq 512Cz2(B1)(dz+1)logT+16T2B+1(512Cz+16)Tdz+1dz+2(logT)1dz+2.\displaystyle 512C_{z}\cdot 2^{\left(B^{*}-1\right)\left(d_{z}+1\right)}\log T+16\cdot T\cdot 2^{-B^{*}+1}\leq(512C_{z}+16)\cdot T^{\frac{d_{z}+1}{d_{z}+2}}(\log T)^{\frac{1}{d_{z}+2}}.

The above analysis implies that we can achieve the optimal regret rate 𝒪~(Tdz+1dz+2)\widetilde{\mathcal{O}}\left(T^{\frac{d_{z}+1}{d_{z}+2}}\right) by letting the for-loop run BB^{*} times and finishing the remaining rounds in the Cleanup step. In other words, B+1B^{*}+1 rounds of communications are sufficient for D-BLiN to achieve the regret bound (45). ∎

Appendix D Regret Upper Bound of BLiN with a Fixed Number of Batches

This section studies the cases where a hard upper bound BB on the number of batches is imposed. In such cases, we simply apply BLiN by executing the for-loop B1B-1 times, and then run the clean-up step. Since some batches may be skipped in the for-loop, the total number of batches is upper bounded by BB. The regret in such cases is described by a quantity called effective number of batches (written Eff(B)\text{Eff}(B)). The quantity Eff(B)\text{Eff}(B) counts number of batches where finer partitioning of the arm space occurs. Further in Lemma 7, we show that Eff(B)=𝒪(loglogT)\text{Eff}(B)=\mathcal{O}(\log\log T).

Firstly, we introduce some notations. In the proof of Theorem 6, we bound the regret of the odd batches (m=2k1m=2k-1 and m<Bm<B), the even batches (m=2km=2k and m<Bm<B) and the clean-up batch (the BB-th batch) separately. For convenience, we denote

Rodd=m=2k1,m<BRmandReven=m=2k,m<BRm,R_{\mathrm{odd}}=\sum_{m=2k-1,\;m<B}R_{m}\quad\text{and}\quad R_{\mathrm{even}}=\sum_{m=2k,\;m<B}R_{m},

where RmR_{m} is the regret of the mm-th batch. Additionally, we let Be=B12B_{e}=\left\lfloor\frac{B-1}{2}\right\rfloor, and thus 2Be2B_{e} is the last even batch before the BB-th batch. As is mentioned in the paper (the paragraph before Theorem 6), when using BLiN with rounded ACE Sequence, if there exists k<mk<m such that r~mr~k\widetilde{r}_{m}\geq\widetilde{r}_{k}, then we skip the mm-th batch. We use Eff(B)\text{Eff}(B) to denote the number of effective batches, that is, the batches which are not skipped.

By omitting the equality B=loglogTlogTlog(dz+2)logd+2d+1dzB^{*}=\frac{\log\log\frac{T}{\log T}-\log(d_{z}+2)}{\log\frac{d+2}{d+1-d_{z}}} and using the definition of rounded ACE Sequence, the arguments in the proof of Theorem 6 can be directly applied to the case of fixed BB. Specifically, inequality (9) yields that

RoddEff(B)Tdz+1dz+2(logT)1dz+2;R_{\text{odd}}\lesssim\text{Eff}(B)\cdot T^{\frac{d_{z}+1}{d_{z}+2}}\cdot(\log T)^{\frac{1}{d_{z}+2}};

inequality (10) yields that

Revenr~2Bedz1logT(TlogT)dz+1dz+2(d+1dzd+2)B12(TlogT)dz+1dz+2logTTdz+1dz+2(logT)1dz+2;\displaystyle R_{\text{even}}\lesssim\widetilde{r}_{2B_{e}}^{-d_{z}-1}\cdot\log T\lesssim\left(\frac{T}{\log T}\right)^{-\frac{d_{z}+1}{d_{z}+2}\left(\frac{d+1-d_{z}}{d+2}\right)^{\left\lfloor\frac{B-1}{2}\right\rfloor}}\cdot\left(\frac{T}{\log T}\right)^{\frac{d_{z}+1}{d_{z}+2}}\cdot\log T\leq T^{\frac{d_{z}+1}{d_{z}+2}}\cdot(\log T)^{\frac{1}{d_{z}+2}};

and inequality (12) yields that

RBr~2BeTT1dz+2(d+1dzd+2)B12Tdz+1dz+2(logT)1dz+2.\displaystyle R_{B}\lesssim\widetilde{r}_{2B_{e}}\cdot T\lesssim T^{\frac{1}{d_{z}+2}\left(\frac{d+1-d_{z}}{d+2}\right)^{\left\lfloor\frac{B-1}{2}\right\rfloor}}\cdot T^{\frac{d_{z}+1}{d_{z}+2}}\cdot(\log T)^{\frac{1}{d_{z}+2}}.

Thus, the TT-step total regret is bounded by

R(T)=Rodd+Reven+RB(Eff(B)+T1dz+2(d+1dzd+2)B12)Tdz+1dz+2(logT)1dz+2.\displaystyle R(T)=R_{\text{odd}}+R_{\text{even}}+R_{B}\lesssim\left(\text{Eff}(B)+T^{\frac{1}{d_{z}+2}\left(\frac{d+1-d_{z}}{d+2}\right)^{\left\lfloor\frac{B-1}{2}\right\rfloor}}\right)\cdot T^{\frac{d_{z}+1}{d_{z}+2}}\cdot(\log T)^{\frac{1}{d_{z}+2}}. (50)

Furthermore, because of the existence of the rounding step, the actual number of batches of BLiN with rounded ACE Sequence has the following upper bound.

Lemma 7.

Let TT be the total time horizon. When applying BLiN with rounded ACE Sequence {r~m}\{\widetilde{r}_{m}\} and any number of batches BB, the effective number of batches is upper bounded by

Eff(B)=𝒪(loglogT).\mathrm{Eff}(B)=\mathcal{O}(\log\log T).
Proof.

The rounded ACE sequence {r~m}m\{\widetilde{r}_{m}\}_{m\in\mathbb{N}} is defined as r~m=2αk\widetilde{r}_{m}=2^{-\alpha_{k}} for m=2k1m=2k-1 and r~m=2βk\widetilde{r}_{m}=2^{-\beta_{k}} for m=2km=2k, where αk=c11ηk1η\alpha_{k}=\lfloor c_{1}\cdot\frac{1-\eta^{k}}{1-\eta}\rfloor and βk=c11ηk1η\beta_{k}=\lceil c_{1}\cdot\frac{1-\eta^{k}}{1-\eta}\rceil. As is explained in Remark 2, if there exists integer MM and kk such that M<c11ηk1η<c11ηk+11η<M+1M<c_{1}\cdot\frac{1-\eta^{k}}{1-\eta}<c_{1}\cdot\frac{1-\eta^{k+1}}{1-\eta}<M+1, then the rounding step yields r~2k1=r~2k+1<r~2k=r~2k+2\widetilde{r}_{2k-1}=\widetilde{r}_{2k+1}<\widetilde{r}_{2k}=\widetilde{r}_{2k+2}, so the (2k+1)(2k+1)-th and the (2k+2)(2k+2)-th batches are skipped. We note that the sequence {c11ηk1η}\{c_{1}\cdot\frac{1-\eta^{k}}{1-\eta}\} is increasing and limkc11ηk1η=c111η\lim_{k\to\infty}c_{1}\cdot\frac{1-\eta^{k}}{1-\eta}=c_{1}\cdot\frac{1}{1-\eta}. It is easy to verify that if inequality

c111ηc11ηk01η<1c_{1}\cdot\frac{1}{1-\eta}-c_{1}\cdot\frac{1-\eta^{k_{0}}}{1-\eta}<1 (51)

is satisfied for some k0k_{0}, then there will be at most 22 effective batches after the 2k02k_{0}-th batch. As a consequence, for any BB, the number of effective batches is upper bounded by 2k0+22k_{0}+2.

By choosing k0=loglogTlogd+2d+1dzk_{0}=\frac{\log\log T}{\log\frac{d+2}{d+1-d_{z}}}, we have

k0>log(1dz+2logTlogT)logd+2d+1dz=log1ηc1logη,k_{0}>\frac{\log\left(\frac{1}{d_{z}+2}\log\frac{T}{\log T}\right)}{\log\frac{d+2}{d+1-d_{z}}}=\frac{\log\frac{1-\eta}{c_{1}}}{\log\eta},

and (51) is satisfied. Therefore, we conclude that Eff(B)2loglogTlogd+2d+1dz+2\text{Eff}(B)\leq\frac{2\log\log T}{\log\frac{d+2}{d+1-d_{z}}}+2. ∎

Combining (50), Lemma 7 and the inequality B12B21\left\lfloor\frac{B-1}{2}\right\rfloor\geq\frac{B}{2}-1, we conclude that the TT-step regret of BLiN with rounded ACE Sequence and BB batches is upper bounded by

R(T)T1dz+2(d+1dzd+2)B21Tdz+1dz+2(logT)1dz+2.R(T)\lesssim T^{\frac{1}{d_{z}+2}\left(\frac{d+1-d_{z}}{d+2}\right)^{\frac{B}{2}-1}}\cdot T^{\frac{d_{z}+1}{d_{z}+2}}\cdot(\log T)^{\frac{1}{d_{z}+2}}.

This upper bound is slightly larger than our lower bound in Theorem 3, and they matches when B=Θ(loglogT)B=\Theta(\log\log T).

Appendix E Proof of Theorem 9

Theorem 9. Consider Lipschitz bandit problems with time horizon TT, ambient dimension dd and zooming dimension dzdd_{z}\leq d such that the grid of reward communication 𝒯\mathcal{T} is static and determined before the game. If BB rounds of communications are allowed, then for any policy π\pi, there exists a problem instance with zooming dimension dzd_{z} such that

𝔼[RT(π)]164e116T11dz+21(1dz+2)B.\mathbb{E}[R_{T}(\pi)]\geq\frac{1}{64e^{\frac{1}{16}}}\cdot T^{\frac{1-\frac{1}{d_{z}+2}}{1-\left(\frac{1}{d_{z}+2}\right)^{B}}}.
Proof.

Fixing kk and 1iMk1\leq i\leq M_{k}, we show that μk,i\mu_{k,i} is 11-Lipschitz and the zooming dimension equals to dzd_{z}.

Firstly, for any (α1,β1)(\alpha_{1},\beta_{1}) and (α2,β2)(\alpha_{2},\beta_{2}), we have

μk,i((α1,β1))μk,i((α2,β2))\displaystyle\mu_{k,i}((\alpha_{1},\beta_{1}))-\mu_{k,i}((\alpha_{2},\beta_{2}))\leq μk,i((α1,β1))μk,i((α1,β2))+μk,i((α1,β2))μk,i((α2,β2))\displaystyle\mu_{k,i}((\alpha_{1},\beta_{1}))-\mu_{k,i}((\alpha_{1},\beta_{2}))+\mu_{k,i}((\alpha_{1},\beta_{2}))-\mu_{k,i}((\alpha_{2},\beta_{2}))
\displaystyle\leq 12β1β2+12(νk,i((α1,0ddz))νk,i((α2,0ddz)))\displaystyle\frac{1}{2}\|\beta_{1}-\beta_{2}\|_{\infty}+\frac{1}{2}\big{(}\nu_{k,i}((\alpha_{1},0_{d-d_{z}}))-\nu_{k,i}((\alpha_{2},0_{d-d_{z}}))\big{)}
\displaystyle\leq 12(β1β2+α1α2)\displaystyle\frac{1}{2}(\|\beta_{1}-\beta_{2}\|_{\infty}+\|\alpha_{1}-\alpha_{2}\|_{\infty})
\displaystyle\leq (α1α2,β1β2).\displaystyle\|(\alpha_{1}-\alpha_{2},\beta_{1}-\beta_{2})\|_{\infty}.

Therefore, μk,i\mu_{k,i} is 11-Lipschitz.

Secondly, for any rrkr\geq r_{k}, (41) yields that S(16r)=[0,1]dz×[0,32r]ddzS(16r)=[0,1]^{d_{z}}\times[0,32r]^{d-d_{z}}. Therefore, we have Nr=32ddzrdzN_{r}=32^{d-d_{z}}r^{-d_{z}}, and the zooming dimension equals to dzd_{z}.

Then we show that an argument similar to Theorem 7 yields the lower bound we need. For any k>1k>1, we prove that no algorithm can distinguish instances in k\mathcal{I}_{k} from one another in the first (k1)(k-1) batches, so the worst-case regret is at least rktkr_{k}t_{k}, which equals to tktk11dz+2\frac{t_{k}}{t_{k-1}^{\frac{1}{d_{z}+2}}}. For the first batch (0,t1](0,t_{1}], we can easily construct a set of instances where the worst-case regret is at least t1t_{1}, since no information is available during this time. Thus, there exists a problem instance such that

𝔼[RT(π)]max{t1,t2t11dz+2,,tBtB11dz+2}.\displaystyle\mathbb{E}[R_{T}(\pi)]\gtrsim\max\left\{t_{1},\frac{t_{2}}{t_{1}^{\frac{1}{d_{z}+2}}},\cdots,\frac{t_{B}}{t_{B-1}^{\frac{1}{d_{z}+2}}}\right\}.

Since 0<t1<<tB=T0<t_{1}<\cdots<t_{B}=T, the inequality in Theorem 9 follows.

Recall that each uk,iu_{k,i} is in 𝒜dz\mathcal{A}_{d_{z}}. For convenience, we write uk,i=(u^k,i,0ddz)u_{k,i}=(\hat{u}_{k,i},0_{d-d_{z}}). Let Hk,i=𝔹dz(u^k,i,38rk)×[0,1]ddzH_{k,i}=\mathbb{B}_{d_{z}}(\hat{u}_{k,i},\frac{3}{8}r_{k})\times[0,1]^{d-d_{z}}, where 𝔹dz(u^k,i,38rk)\mathbb{B}_{d_{z}}(\hat{u}_{k,i},\frac{3}{8}r_{k}) denotes the dzd_{z}-dimensional ball with center u^k,i\hat{u}_{k,i} and radius 38rk\frac{3}{8}r_{k}. It is easy to verify the following properties of construction (39),(40) and (41):

  1. 1.

    For any 2iMk2\leq i\leq M_{k}, μk,i(z)=μk,1(z)\mu_{k,i}(z)=\mu_{k,1}(z) for any z𝒜Hk,iz\in\mathcal{A}\setminus H_{k,i};

  2. 2.

    For any 2iMk2\leq i\leq M_{k}, μk,1(z)μk,i(z)μk,1(z)+rk4\mu_{k,1}(z)\leq\mu_{k,i}(z)\leq\mu_{k,1}(z)+\frac{r_{k}}{4}, for any zHk,iz\in H_{k,i};

  3. 3.

    For any 1iMk1\leq i\leq M_{k}, under Ik,iI_{k,i}, pulling an arm that is not in Hk,iH_{k,i} incurs a regret at least rk16\frac{r_{k}}{16}.

The lower bound of expected regret relies on the following lemma.

Lemma 8.

For any policy π\pi, there exists a problem instance IkI\in\mathcal{I}_{k} such that

𝔼[RT(π)]rk64j=1B(tjtj1)exp{tj1rk232(Mk1)}.\mathbb{E}\left[R_{T}(\pi)\right]\geq\frac{r_{k}}{64}\cdot\sum_{j=1}^{B}\left(t_{j}-t_{j-1}\right)\exp\left\{-\frac{t_{j-1}r_{k}^{2}}{32(M_{k}-1)}\right\}.
Proof.

Let xtx_{t} denote the choices of policy π\pi at time tt, and yty_{t} denote the reward. Additionally, for tj1<ttjt_{j-1}<t\leq t_{j}, we define k.it\mathbb{P}_{k.i}^{t} as the distribution of sequence (x1,y1,,xtj1,ytj1)\left(x_{1},y_{1},\cdots,x_{t_{j-1}},y_{t_{j-1}}\right) under instance Ik,iI_{k,i} and policy π\pi. It holds that

supIk𝔼RT(π)1Mki=1Mk𝔼k,i[RT(π)]1Mki=1Mkt=1T𝔼k,it[Rt(π)]rk16t=1T1Mki=1Mkk,it(xtHk,i),\displaystyle\sup_{I\in\mathcal{I}_{k}}\mathbb{E}R_{T}(\pi)\geq\frac{1}{M_{k}}\sum_{i=1}^{M_{k}}\mathbb{E}_{\mathbb{P}_{k,i}}\left[R_{T}(\pi)\right]\geq\frac{1}{M_{k}}\sum_{i=1}^{M_{k}}\sum_{t=1}^{T}\mathbb{E}_{\mathbb{P}_{k,i}^{t}}\left[R^{t}(\pi)\right]\geq\frac{r_{k}}{16}\sum_{t=1}^{T}\frac{1}{M_{k}}\sum_{i=1}^{M_{k}}\mathbb{P}_{k,i}^{t}(x_{t}\notin H_{k,i}), (52)

where Rt(π)R^{t}(\pi) denotes the regret incurred by policy π\pi at time tt.

From our construction, it is easy to see that Hk,iHk,j=H_{k,i}\cap H_{k,j}=\varnothing for any iji\neq j, so we can construct a test Ψ\Psi such that xtHk,ix_{t}\in H_{k,i} implies Ψ=i\Psi=i. Then from Lemma 12,

1Mki=1Mkk,it(xtHk,i)1Mki=1Mkk,it(Ψi)12Mki=2Mkexp{DKL(k,1tk,it)}.\displaystyle\frac{1}{M_{k}}\sum_{i=1}^{M_{k}}\mathbb{P}_{k,i}^{t}(x_{t}\notin H_{k,i})\geq\;\frac{1}{M_{k}}\sum_{i=1}^{M_{k}}\mathbb{P}_{k,i}^{t}(\Psi\neq i)\geq\;\frac{1}{2M_{k}}\sum_{i=2}^{M_{k}}\exp\left\{-D_{KL}\left(\mathbb{P}_{k,1}^{t}\|\mathbb{P}_{k,i}^{t}\right)\right\}.

To avoid notational clutter, for any s,ss,s^{\prime} (sss\geq s^{\prime}), define

(𝐱,𝐲):s:s=\displaystyle(\mathbf{x},\mathbf{y})_{:s}^{:s^{\prime}}= (x1,y1,,xs,ys,,xs).\displaystyle\;(x_{1},y_{1},\cdots,x_{s^{\prime}},y_{s^{\prime}},\cdots,x_{s}).

Now we calculate DKL(k,1tk,it)D_{KL}\left(\mathbb{P}_{k,1}^{t}\|\mathbb{P}_{k,i}^{t}\right). From the chain rule of KL-Divergence, we have

DKL(k,1tk,it)\displaystyle\;D_{KL}\left(\mathbb{P}_{k,1}^{t}\|\mathbb{P}_{k,i}^{t}\right)
=\displaystyle= DKL(k,1t((𝐱,𝐲):tj1:tj1)k,it((𝐱,𝐲):tj1:tj1))\displaystyle\;D_{KL}\left(\mathbb{P}_{k,1}^{t}\left((\mathbf{x},\mathbf{y})_{:t_{j-1}}^{:t_{j-1}}\right)\|\mathbb{P}_{k,i}^{t}\left((\mathbf{x},\mathbf{y})_{:t_{j-1}}^{:t_{j-1}}\right)\right)
=\displaystyle= DKL(k,1t((𝐱,𝐲):tj1:tj11)k,it((𝐱,𝐲):tj1:tj11))+𝔼k,1(DKL(k,1t(ytj1|xtj1)k,it(ytj1|xtj1)))\displaystyle\;D_{KL}\left(\mathbb{P}_{k,1}^{t}\left((\mathbf{x},\mathbf{y})_{:t_{j-1}}^{:t_{j-1}-1}\right)\|\mathbb{P}_{k,i}^{t}\left((\mathbf{x},\mathbf{y})_{:t_{j-1}}^{:t_{j-1}-1}\right)\right)+\mathbb{E}_{\mathbb{P}_{k,1}}\left(D_{KL}\left(\mathbb{P}_{k,1}^{t}(y_{t_{j-1}}|x_{t_{j-1}})\|\mathbb{P}_{k,i}^{t}(y_{t_{j-1}}|x_{t_{j-1}})\right)\right) (53)
=\displaystyle= DKL(k,1t((𝐱,𝐲):tj11:tj11)k,it((𝐱,𝐲):tj11:tj11))+𝔼k,1(DKL(k,1t(ytj1|xtj1)k,it(ytj1|xtj1)))\displaystyle\;D_{KL}\left(\mathbb{P}_{k,1}^{t}\left((\mathbf{x},\mathbf{y})_{:t_{j-1}-1}^{:t_{j-1}-1}\right)\|\mathbb{P}_{k,i}^{t}\left((\mathbf{x},\mathbf{y})_{:t_{j-1}-1}^{:t_{j-1}-1}\right)\right)+\mathbb{E}_{\mathbb{P}_{k,1}}\left(D_{KL}\left(\mathbb{P}_{k,1}^{t}(y_{t_{j-1}}|x_{t_{j-1}})\|\mathbb{P}_{k,i}^{t}(y_{t_{j-1}}|x_{t_{j-1}})\right)\right) (54)
\displaystyle\leq DKL(k,1t((𝐱,𝐲):tj11:tj11)k,it((𝐱,𝐲):tj11:tj11))+𝔼k,1(DKL(N(μk,1(xtj1),1)N(μk,i(xtj1),1)))\displaystyle\;D_{KL}\left(\mathbb{P}_{k,1}^{t}\left((\mathbf{x},\mathbf{y})_{:t_{j-1}-1}^{:t_{j-1}-1}\right)\|\mathbb{P}_{k,i}^{t}\left((\mathbf{x},\mathbf{y})_{:t_{j-1}-1}^{:t_{j-1}-1}\right)\right)+\mathbb{E}_{\mathbb{P}_{k,1}}\left(D_{KL}\left(N(\mu_{k,1}(x_{t_{j-1}}),1)\|N(\mu_{k,i}(x_{t_{j-1}}),1)\right)\right) (55)
=\displaystyle= DKL(k,1t((𝐱,𝐲):tj11:tj11)k,it((𝐱,𝐲):tj11:tj11))+𝔼k,1(12(μk,1(xtj1)μk,i(xtj1))2)\displaystyle\;D_{KL}\left(\mathbb{P}_{k,1}^{t}\left((\mathbf{x},\mathbf{y})_{:t_{j-1}-1}^{:t_{j-1}-1}\right)\|\mathbb{P}_{k,i}^{t}\left((\mathbf{x},\mathbf{y})_{:t_{j-1}-1}^{:t_{j-1}-1}\right)\right)+\mathbb{E}_{\mathbb{P}_{k,1}}\left(\frac{1}{2}\left(\mu_{k,1}(x_{t_{j-1}})-\mu_{k,i}(x_{t_{j-1}})\right)^{2}\right)
\displaystyle\leq DKL(k,1t((𝐱,𝐲):tj11:tj11)k,it((𝐱,𝐲):tj11:tj11))+𝔼k,1(𝟏{xtj1Sk,i}12(rk4)2)\displaystyle\;D_{KL}\left(\mathbb{P}_{k,1}^{t}\left((\mathbf{x},\mathbf{y})_{:t_{j-1}-1}^{:t_{j-1}-1}\right)\|\mathbb{P}_{k,i}^{t}\left((\mathbf{x},\mathbf{y})_{:t_{j-1}-1}^{:t_{j-1}-1}\right)\right)+\mathbb{E}_{\mathbb{P}_{k,1}}\left(\bm{1}_{\{x_{t_{j-1}}\in S_{k,i}\}}\cdot\frac{1}{2}\left(\frac{r_{k}}{4}\right)^{2}\right) (56)
=\displaystyle= DKL(k,1t((𝐱,𝐲):tj11:tj11)k,it((𝐱,𝐲):tj11:tj11))+rk232k,1(xtj1Sk,i),\displaystyle\;D_{KL}\left(\mathbb{P}_{k,1}^{t}\left((\mathbf{x},\mathbf{y})_{:t_{j-1}-1}^{:t_{j-1}-1}\right)\|\mathbb{P}_{k,i}^{t}\left((\mathbf{x},\mathbf{y})_{:t_{j-1}-1}^{:t_{j-1}-1}\right)\right)+\frac{r_{k}^{2}}{32}\cdot\mathbb{P}_{k,1}\left(x_{t_{j-1}}\in S_{k,i}\right), (57)

where (53) uses chain rule for KL-divergence and the conditional independence of the reward, (54) removes dependence on xtj1x_{t_{j-1}} in the first term by another use of chain rule and the fact that the distribution of xtj1x_{t_{j-1}} is fully determined by the policy and the distribution of (𝐱,𝐲):tj11:tj11(\mathbf{x},\mathbf{y})_{:t_{j-1}-1}^{:t_{j-1}-1}, (55) uses that the rewards are corrupted by a standard normal noise, and (56) uses the first two properties of the construction.

Since (57) holds for all ttj1t\leq t_{j-1}, we conclude that

DKL(k,1tk,it)rk232stj1k,1(xsHk,i)=rk232𝔼k,1τi,\displaystyle D_{KL}\left(\mathbb{P}_{k,1}^{t}\|\mathbb{P}_{k,i}^{t}\right)\leq\frac{r_{k}^{2}}{32}\sum_{s\leq t_{j-1}}\mathbb{P}_{k,1}\left(x_{s}\in H_{k,i}\right)=\frac{r_{k}^{2}}{32}\mathbb{E}_{\mathbb{P}_{k,1}}\tau_{i}, (58)

where τi\tau_{i} denotes the number of pulls of arms in Hk,iH_{k,i} before the batch containing tt. Then for all t(tj1,tj]t\in(t_{j-1},t_{j}], we have

1Mki=1Mkk,it(xtHk,i)\displaystyle\frac{1}{M_{k}}\sum_{i=1}^{M_{k}}\mathbb{P}_{k,i}^{t}(x_{t}\notin H_{k,i})\geq 12Mki=2Mkexp{rk232𝔼k,1τi}\displaystyle\;\frac{1}{2M_{k}}\sum_{i=2}^{M_{k}}\exp\left\{-\frac{r_{k}^{2}}{32}\mathbb{E}_{\mathbb{P}_{k,1}}\tau_{i}\right\}
\displaystyle\geq Mk12Mkexp{rk232(Mk1)i=2Mk𝔼k,1τi}\displaystyle\;\frac{M_{k}-1}{2M_{k}}\exp\left\{-\frac{r_{k}^{2}}{32(M_{k}-1)}\sum_{i=2}^{M_{k}}\mathbb{E}_{\mathbb{P}_{k,1}}\tau_{i}\right\} (59)
\displaystyle\geq 14exp{rk2tj132(Mk1)},\displaystyle\;\frac{1}{4}\exp\left\{-\frac{r_{k}^{2}t_{j-1}}{32(M_{k}-1)}\right\}, (60)

where (59) uses the Jensen’ inequality, and (60) uses the fact that i=2Mkτitj1\sum_{i=2}^{M_{k}}\tau_{i}\leq t_{j-1}. Finally, we substitute (60) to (52) to finish the proof of Lemma 8. ∎

Since Mk=tk1rk2M_{k}=t_{k-1}r_{k}^{2}, the expected regret of policy π\pi satisfies

𝔼[RT(π)]\displaystyle\mathbb{E}\left[R_{T}(\pi)\right] rk64j=1B(tjtj1)exp{tj1rk232(Mk1)}\displaystyle\geq\frac{r_{k}}{64}\cdot\sum_{j=1}^{B}\left(t_{j}-t_{j-1}\right)\exp\left\{-\frac{t_{j-1}r_{k}^{2}}{32(M_{k}-1)}\right\}
rk64j=1B(tjtj1)exp{tj1rk216Mk}\displaystyle\geq\frac{r_{k}}{64}\cdot\sum_{j=1}^{B}\left(t_{j}-t_{j-1}\right)\exp\left\{-\frac{t_{j-1}r_{k}^{2}}{16M_{k}}\right\}
rk64j=1B(tjtj1)exp{tj116tk1}\displaystyle\geq\frac{r_{k}}{64}\cdot\sum_{j=1}^{B}\left(t_{j}-t_{j-1}\right)\exp\left\{-\frac{t_{j-1}}{16t_{k-1}}\right\}

on an instance Ik,ikI_{k,i}\in\mathcal{I}_{k}.

By omitting terms with j>kj>k in the above summation, we have

𝔼[RT(π)]\displaystyle\mathbb{E}[R_{T}(\pi)] rk64j=1B(tjtj1)exp{tj116tk1}\displaystyle\geq\frac{r_{k}}{64}\cdot\sum_{j=1}^{B}\left(t_{j}-t_{j-1}\right)\exp\left\{-\frac{t_{j-1}}{16t_{k-1}}\right\}
rk64j=1k(tjtj1)exp{116}\displaystyle\geq\frac{r_{k}}{64}\cdot\sum_{j=1}^{k}\left(t_{j}-t_{j-1}\right)\exp\left\{-\frac{1}{16}\right\}
=164e116rktk=164e116tktk11dz+2.\displaystyle=\frac{1}{64e^{\frac{1}{16}}}r_{k}t_{k}=\frac{1}{64e^{\frac{1}{16}}}\cdot\frac{t_{k}}{t_{k-1}^{\frac{1}{d_{z}+2}}}.

The above analysis can be applied for any k>1k>1. For the first batch (0,t1](0,t_{1}], we can easily construct a set of instances where the worst-case regret is at least t1t_{1}, since no information is available during this time. Thus, there exists a problem instance such that

𝔼[RT(π)]164e116max{t1,t2t11dz+2,,tBtB11dz+2}.\displaystyle\mathbb{E}[R_{T}(\pi)]\geq\frac{1}{64e^{\frac{1}{16}}}\max\left\{t_{1},\frac{t_{2}}{t_{1}^{\frac{1}{d_{z}+2}}},\cdots,\frac{t_{B}}{t_{B-1}^{\frac{1}{d_{z}+2}}}\right\}.

Since 0<t1<<tB=T0<t_{1}<\cdots<t_{B}=T, we further have

max{t1,t2t1ε,,tBtB1ε}(t1εB1(t2t1ε)εB2(tB1tB2ε)εtBtB1ε)1i=1B1εi=T1ε1εB,\displaystyle\max\left\{t_{1},\frac{t_{2}}{t_{1}^{\varepsilon}},\cdots,\frac{t_{B}}{t_{B-1}^{\varepsilon}}\right\}\geq\left(t_{1}^{\varepsilon^{B-1}}\cdot\left(\frac{t_{2}}{t_{1}^{\varepsilon}}\right)^{\varepsilon^{B-2}}\cdots\left(\frac{t_{B-1}}{t_{B-2}^{\varepsilon}}\right)^{\varepsilon}\cdot\frac{t_{B}}{t_{B-1}^{\varepsilon}}\right)^{\frac{1}{\sum_{i=1}^{B-1}\varepsilon^{i}}}=T^{\frac{1-\varepsilon}{1-\varepsilon^{B}}},

where ε=1dz+2\varepsilon=\frac{1}{d_{z}+2}. Combining the above two inequalities, we conclude that

𝔼[RT(π)]164e116T11dz+21(1dz+2)B.\mathbb{E}\left[R_{T}(\pi)\right]\geq\frac{1}{64e^{\frac{1}{16}}}\cdot T^{\frac{1-\frac{1}{d_{z}+2}}{1-\left(\frac{1}{d_{z}+2}\right)^{B}}}.\qed

Appendix F Proof of Theorem 10

Theorem 10. Consider Lipschitz bandit problems with time horizon TT, ambient dimension dd and zooming dimension dzdd_{z}\leq d such that the grid of reward communication 𝒯\mathcal{T} is adaptively determined by the player. If BB rounds of communications are allowed, then for any policy π\pi, there exists a problem instance with zooming dimension dzd_{z} such that

𝔼[RT(π)]1512B2T11dz+21(1dz+2)B.\displaystyle\mathbb{E}\left[R_{T}(\pi)\right]\geq\frac{1}{512B^{2}}T^{\frac{1-\frac{1}{d_{z}+2}}{1-\left(\frac{1}{d_{z}+2}\right)^{B}}}.
Proof.

The main argument in the proof of Theorem 10 is similar to that of Theorem 8. To construct hard instances in the dd-dimensional space, we use the ‘linear-decaying extension’ technique, which is the same as the proof of Theorem 9.

To prove Theorem 10, we consider a reference static grid 𝒯r={T0,T1,,TB}\mathcal{T}_{r}=\{T_{0},T_{1},\cdots,T_{B}\}, where Tj=T1εj1εBT_{j}=T^{\frac{1-\varepsilon^{j}}{1-\varepsilon^{B}}} for ε=1dz+2\varepsilon=\frac{1}{d_{z}+2}. Then we construct a series of ‘worlds’, denoted by 1,,B\mathcal{I}_{1},\cdots,\mathcal{I}_{B}. Each world is a set of problem instances, and each problem instance in world j\mathcal{I}_{j} is defined by peak location set 𝒰j\mathcal{U}_{j} and basic height rjr_{j}, where the sets 𝒰j\mathcal{U}_{j} and quantities rjr_{j} for 1jB1\leq j\leq B are presented in the proof below. Based on these constructions, we first prove that for any adaptive grid and policy, there exists an index jj such that the event Aj={tj1<Tj1,tjTj}A_{j}=\{t_{j-1}<T_{j-1},\;t_{j}\geq T_{j}\} happens with sufficiently high probability in world j\mathcal{I}_{j}. Then similar to Theorem 9, we prove that in world j\mathcal{I}_{j} there exists a set of problem instances that is difficult to differentiate in the first j1j-1 batches. In addition, event AjA_{j} implies that tjTjt_{j}\geq T_{j}, so the worst-case regret is at least rjTjr_{j}T_{j}, which gives the lower bound we need.

Firstly, we define rj=1Tj1εBr_{j}=\frac{1}{T_{j-1}^{\varepsilon}B} and Mj=1rjdzM_{j}=\frac{1}{r_{j}^{d_{z}}}, where ε=1dz+2\varepsilon=\frac{1}{d_{z}+2}. From the definition, we have

Tj1rj2=1rjdzBdz+21rjdzB2=MjB2.\displaystyle T_{j-1}r_{j}^{2}=\frac{1}{r_{j}^{d_{z}}B^{d_{z}+2}}\leq\frac{1}{r_{j}^{d_{z}}B^{2}}=\frac{M_{j}}{B^{2}}. (61)

For 1jB1\leq j\leq B, we can find sets of arms 𝒰j={uj,1,,uj,Mj}𝒜dz\mathcal{U}_{j}=\{u_{j,1},\cdots,u_{j,M_{j}}\}\in\mathcal{A}_{d_{z}} such that (a) d𝒜(uj,m,uj,n)rjd_{\mathcal{A}}(u_{j,m},u_{j,n})\geq r_{j} for any mnm\neq n, and (b) u1,M1==uB,MBu_{1,M_{1}}=\cdots=u_{B,M_{B}}.

Then we present the construction of worlds 1,,B\mathcal{I}_{1},\cdots,\mathcal{I}_{B}. For 1jB11\leq j\leq B-1, we let j={j,k}k=1Mj1\mathcal{I}_{j}=\{\mathcal{I}_{j,k}\}_{k=1}^{M_{j}-1}. We first construct a set of expected reward functions {νj,1,,νj,Mj1}\{\nu_{j,1},\cdots,\nu_{j,M_{j}-1}\} on 𝒜dz\mathcal{A}_{d_{z}}. For each 1kMj11\leq k\leq M_{j}-1, we define νj,k\nu_{j,k} as

νj,k(z)={r12+rj16+rB16,ifz=uj,k,r12+rB16,ifz=uj,Mj,max{r12,maxu𝒰j{νj,k(u)d𝒜(z,u)}},ifz𝒜dz𝒰j.\displaystyle\nu_{j,k}(z)=\begin{cases}\frac{r_{1}}{2}+\frac{r_{j}}{16}+\frac{r_{B}}{16},\;\text{if}\;z=u_{j,k},\\ \frac{r_{1}}{2}+\frac{r_{B}}{16},\;\text{if}\;z=u_{j,M_{j}},\\ \max\left\{\frac{r_{1}}{2},\max_{u\in\mathcal{U}_{j}}\left\{\nu_{j,k}(u)-d_{\mathcal{A}}(z,u)\right\}\right\},\\ \quad\text{if}\;z\in\mathcal{A}_{d_{z}}\setminus\mathcal{U}_{j}.\end{cases} (62)

Based on {νj,k}k=1Mj1\{\nu_{j,k}\}_{k=1}^{M_{j}-1}, the expected reward of Ij,kI_{j,k} is defined as

μj,k((α,β))=12(νj,k((α,0ddz))β),\displaystyle\mu_{j,k}((\alpha,\beta))=\frac{1}{2}\left(\nu_{j,k}((\alpha,0_{d-d_{z}}))-\|\beta\|_{\infty}\right), (63)

where (α,β)(\alpha,\beta) is the concatenation of α[0,1]dz\alpha\in[0,1]^{d_{z}} and β[0,1]ddz\beta\in[0,1]^{d-d_{z}}. For j=Bj=B, we let B={IB}\mathcal{I}_{B}=\{I_{B}\}. We first define a function νB\nu_{B} on 𝒜dz\mathcal{A}_{d_{z}} as

νB(z)={r12+rB16,ifz=uB,MB,max{r12,maxu𝒰j{μj,k(u)d𝒜(z,u)}},ifz𝒜dz/{uB,MB}.\displaystyle\nu_{B}(z)=\begin{cases}\frac{r_{1}}{2}+\frac{r_{B}}{16},\;\text{if}\;z=u_{B,M_{B}},\\ \max\left\{\frac{r_{1}}{2},\max_{u\in\mathcal{U}_{j}}\left\{\mu_{j,k}(u)-d_{\mathcal{A}}(z,u)\right\}\right\},\\ \quad\text{if}\;z\in\mathcal{A}_{d_{z}}/\{u_{B,M_{B}}\}.\end{cases} (64)

Then the expected reward of IBI_{B} is defined as

μB((α,β))=12(νB((α,0ddz))β).\displaystyle\mu_{B}((\alpha,\beta))=\frac{1}{2}\left(\nu_{B}((\alpha,0_{d-d_{z}}))-\|\beta\|_{\infty}\right). (65)

Roughly speaking, our constructions satisfy two properties: for each jBj\neq B and 1kMj11\leq k\leq M_{j}-1,

  1. 1.

    μj,k\mu_{j,k} is close to μB\mu_{B};

  2. 2.

    under Ij,kI_{j,k}, pulling an arm that is far from uj,ku_{j,k} incurs a regret at least rj16\frac{r_{j}}{16}.

The formal version of these two properties are presented in the proof of Lemma 9 and Lemma 10.

Based on these constructions, we first show that for any adaptive grid 𝒯={t0,,tB}\mathcal{T}=\{t_{0},\cdots,t_{B}\}, there exists an index jj such that (tj1,tj](t_{j-1},t_{j}] is sufficiently large in world j\mathcal{I}_{j}. More formally, for each j[B]j\in[B], and event Aj={tj1<Tj1,tjTj}A_{j}=\{t_{j-1}<T_{j-1},\;t_{j}\geq T_{j}\}, we define the quantities pj:=1Mj1k=1Mj1j,k(Aj)p_{j}:=\frac{1}{M_{j}-1}\sum_{k=1}^{M_{j}-1}\mathbb{P}_{j,k}(A_{j}) for jB1j\leq B-1 and pB:=B(AB)p_{B}:=\mathbb{P}_{B}(A_{B}), where j,k(Aj)\mathbb{P}_{j,k}(A_{j}) denotes the probability of the event AjA_{j} under instance Ij,kI_{j,k} and policy π\pi. For these quantities, we have the following lemma.

Lemma 9.

For any adaptive grid 𝒯\mathcal{T} and policy π\pi, it holds that j=1Bpj78.\sum_{j=1}^{B}p_{j}\geq\frac{7}{8}.

Proof.

For 1jB11\leq j\leq B-1 and 1kMj11\leq k\leq M_{j}-1, we write uj,k=(u^j,k,0ddz)u_{j,k}=(\hat{u}_{j,k},0_{d-d_{z}}). We define Hj,k=𝔹dz(u^j,k,38rj)×[0,1]ddzH_{j,k}=\mathbb{B}_{d_{z}}(\hat{u}_{j,k},\frac{3}{8}r_{j})\times[0,1]^{d-d_{z}}, where 𝔹dz(u^j,k,38rj)\mathbb{B}_{d_{z}}(\hat{u}_{j,k},\frac{3}{8}r_{j}) denotes the dzd_{z}-dimensional ball with center u^j,k\hat{u}_{j,k} and radius 38rj\frac{3}{8}r_{j}. It is easy to verify the following properties of our construction (62), (63), (64) and (65):

  1. 1.

    μj,k(z)=μB(z)\mu_{j,k}(z)=\mu_{B}(z) for any zHj,kz\notin H_{j,k};

  2. 2.

    μB(z)μj,k(z)μB(z)+rj8\mu_{B}(z)\leq\mu_{j,k}(z)\leq\mu_{B}(z)+\frac{r_{j}}{8}, for any zHj,kz\in H_{j,k}.

Let xtx_{t} denote the choices of policy π\pi at time tt, and yty_{t} denote the reward. For tj1<ttjt_{j-1}<t\leq t_{j}, we define j,kt\mathbb{P}_{j,k}^{t} (resp. Bt\mathbb{P}_{B}^{t}) as the distribution of sequence (x1,y1,,xtj1,ytj1)\left(x_{1},y_{1},\cdots,x_{t_{j-1}},y_{t_{j-1}}\right) under instance Ij,kI_{j,k} (resp. IBI_{B}) and policy π\pi. Since event AjA_{j} can be completely described by the observations up to time Tj1T_{j-1} (AjA_{j} is an event in the σ\sigma-algebra where j,kTj1\mathbb{P}_{j,k}^{T_{j-1}} and BTj1\mathbb{P}_{B}^{T_{j-1}} are defined on), we can use the definition of total variation to get

|B(Aj)j,k(Aj)|=|BTj1(Aj)j,kTj1(Aj)|TV(BTj1,j,kTj1).\displaystyle|\mathbb{P}_{B}(A_{j})-\mathbb{P}_{j,k}(A_{j})|=\;|\mathbb{P}_{B}^{T_{j-1}}(A_{j})-\mathbb{P}_{j,k}^{T_{j-1}}(A_{j})|\leq\;TV\left(\mathbb{P}_{B}^{T_{j-1}},\mathbb{P}_{j,k}^{T_{j-1}}\right).

For the total variation, we apply Lemma 11 to get

1Mj1k=1Mj1TV(BTj1,j,kTj1)1Mj1k=1Mj11exp(DKL(BTj1j,kTj1)).\displaystyle\;\frac{1}{M_{j}-1}\sum_{k=1}^{M_{j}-1}TV\left(\mathbb{P}_{B}^{T_{j-1}},\mathbb{P}_{j,k}^{T_{j-1}}\right)\leq\;\frac{1}{M_{j}-1}\sum_{k=1}^{M_{j}-1}\sqrt{1-\exp\left(-D_{KL}\left(\mathbb{P}_{B}^{T_{j-1}}\|\mathbb{P}_{j,k}^{T_{j-1}}\right)\right)}.

An argument similar to (58) yields that

DKL(BTj1j,kTj1)rj2128𝔼Bτk,\displaystyle D_{KL}\left(\mathbb{P}_{B}^{T_{j-1}}\|\mathbb{P}_{j,k}^{T_{j-1}}\right)\leq\frac{r_{j}^{2}}{128}\mathbb{E}_{\mathbb{P}_{B}}\tau_{k},

where τk\tau_{k} denotes the number of pulls which is in Sj,kS_{j,k} before the batch containing Tj1T_{j-1}. Combining the above two inequalities gives

1Mj1k=1Mj1TV(BTj1,j,kTj1)\displaystyle\frac{1}{M_{j}-1}\sum_{k=1}^{M_{j}-1}TV\left(\mathbb{P}_{B}^{T_{j-1}},\mathbb{P}_{j,k}^{T_{j-1}}\right)\leq 1Mj1k=1Mj11exp(rj2128𝔼Bτk)\displaystyle\;\frac{1}{M_{j}-1}\sum_{k=1}^{M_{j}-1}\sqrt{1-\exp\left(-\frac{r_{j}^{2}}{128}\mathbb{E}_{\mathbb{P}_{B}}\tau_{k}\right)}
\displaystyle\leq 1exp(rj2128(Mj1)𝔼B[k=1Mj1τk])\displaystyle\;\sqrt{1-\exp\left(-\frac{r_{j}^{2}}{128(M_{j}-1)}\mathbb{E}_{\mathbb{P}_{B}}\left[\sum_{k=1}^{M_{j}-1}\tau_{k}\right]\right)} (66)
\displaystyle\leq 1exp(rj2Tj1128(Mj1))\displaystyle\;\sqrt{1-\exp\left(-\frac{r_{j}^{2}T_{j-1}}{128(M_{j}-1)}\right)} (67)
\displaystyle\leq 1exp(164B2)\displaystyle\;\sqrt{1-\exp\left(-\frac{1}{64B^{2}}\right)} (68)
\displaystyle\leq 18B,\displaystyle\;\frac{1}{8B},

where (66) uses Jensen’s inequality, (67) uses the fact that k=1Mj1τkTj1\sum_{k=1}^{M_{j}-1}\tau_{k}\leq T_{j-1}, and (68) uses (61).

Plugging the above results implies that

|B(Aj)pj|1Mj1k=1Mj1|B(Aj)j,k(Aj)|18B.\displaystyle|\mathbb{P}_{B}(A_{j})-p_{j}|\leq\frac{1}{M_{j}-1}\sum_{k=1}^{M_{j}-1}|\mathbb{P}_{B}(A_{j})-\mathbb{P}_{j,k}(A_{j})|\leq\frac{1}{8B}.

Since j=1BB(Aj)B(j=1BAj)=1\sum_{j=1}^{B}\mathbb{P}_{B}\left(A_{j}\right)\geq\mathbb{P}_{B}\left(\cup_{j=1}^{B}A_{j}\right)=1, it holds that

j=1BpjB(AB)+j=1B1(B(Aj)18B)78.\sum_{j=1}^{B}p_{j}\geq\mathbb{P}_{B}(A_{B})+\sum_{j=1}^{B-1}\left(\mathbb{P}_{B}(A_{j})-\frac{1}{8B}\right)\geq\frac{7}{8}.\qed

Lemma 9 implies that there exists some jj such that pj>78Bp_{j}>\frac{7}{8B}. Then we show that the worst-case regret in world j\mathcal{I}_{j} gives the lower bound we need.

Lemma 10.

For adaptive grid 𝒯\mathcal{T} and policy π\pi, if index jj satisfies pj78Bp_{j}\geq\frac{7}{8B}, then there exists a problem instance II with zooming dimension dzd_{z} such that

𝔼[RT(π)]1512B2T11dz+21(1dz+2)B.\displaystyle\mathbb{E}\left[R_{T}(\pi)\right]\geq\frac{1}{512B^{2}}T^{\frac{1-\frac{1}{d_{z}+2}}{1-\left(\frac{1}{d_{z}+2}\right)^{B}}}.
Proof.

Here we proceed with the case where jB1j\leq B-1. The case for j=Bj=B can be proved analogously.

For any 1kMj11\leq k\leq M_{j}-1, we construct a set of problem instances j,k=(Ij,k,l)1lMj\mathcal{I}_{j,k}=\left(I_{j,k,l}\right)_{1\leq l\leq M_{j}}. For lkl\neq k, we first define a function νj,k,l\nu_{j,k,l} on 𝒜dz\mathcal{A}_{d_{z}} as

νj,k,l(z)={νj,k(z)+3rj16,ifz=uj,l,νj,k(z),ifz𝒰jandzuj,l,max{r12,maxu𝒰j{νj,k,l(u)d𝒜(z,u)}},ifz𝒜dz𝒰j,\displaystyle\nu_{j,k,l}(z)=\begin{cases}\nu_{j,k}(z)+\frac{3r_{j}}{16},\;\text{if}\;z=u_{j,l},\\ \nu_{j,k}(z),\;\text{if}\;z\in\mathcal{U}_{j}\;\text{and}\;z\neq u_{j,l},\\ \max\left\{\frac{r_{1}}{2},\max_{u\in\mathcal{U}_{j}}\left\{\nu_{j,k,l}(u)-d_{\mathcal{A}}(z,u)\right\}\right\},\\ \quad\text{if}\;z\in\mathcal{A}_{d_{z}}\setminus\mathcal{U}_{j},\end{cases}

where νj,k\nu_{j,k} is defined in (62). Then the expected reward of Ij,k,lI_{j,k,l} is defined as

μj,k,l((α,β))=12(νj,k,l((α,0ddz))β),\displaystyle\mu_{j,k,l}((\alpha,\beta))=\frac{1}{2}\left(\nu_{j,k,l}((\alpha,0_{d-d_{z}}))-\|\beta\|_{\infty}\right), (69)

where (α,β)(\alpha,\beta) is the concatenation of α[0,1]dz\alpha\in[0,1]^{d_{z}} and β[0,1]ddz\beta\in[0,1]^{d-d_{z}}. For l=kl=k, we let μj,k,k=μj,k\mu_{j,k,k}=\mu_{j,k}.

We first show that each μj,k,l\mu_{j,k,l} is 11-Lipschitz and the zooming dimension equals to dzd_{z}.

Firstly, for any (α1,β1)(\alpha_{1},\beta_{1}) and (α2,β2)(\alpha_{2},\beta_{2}), we have

μj,k,l((α1,β1))μj,k,l((α2,β2))\displaystyle\mu_{j,k,l}((\alpha_{1},\beta_{1}))-\mu_{j,k,l}((\alpha_{2},\beta_{2}))\leq μj,k,l((α1,β1))μj,k,l((α1,β2))+μj,k,l((α1,β2))μj,k,l((α2,β2))\displaystyle\mu_{j,k,l}((\alpha_{1},\beta_{1}))-\mu_{j,k,l}((\alpha_{1},\beta_{2}))+\mu_{j,k,l}((\alpha_{1},\beta_{2}))-\mu_{j,k,l}((\alpha_{2},\beta_{2}))
\displaystyle\leq 12β1β2+12(νj,k,l((α1,0ddz))νj,k,l((α2,0ddz)))\displaystyle\frac{1}{2}\|\beta_{1}-\beta_{2}\|_{\infty}+\frac{1}{2}\big{(}\nu_{j,k,l}((\alpha_{1},0_{d-d_{z}}))-\nu_{j,k,l}((\alpha_{2},0_{d-d_{z}}))\big{)}
\displaystyle\leq 12(β1β2+α1α2)\displaystyle\frac{1}{2}(\|\beta_{1}-\beta_{2}\|_{\infty}+\|\alpha_{1}-\alpha_{2}\|_{\infty})
\displaystyle\leq (α1α2,β1β2).\displaystyle\|(\alpha_{1}-\alpha_{2},\beta_{1}-\beta_{2})\|_{\infty}.

Therefore, μj,k,l\mu_{j,k,l} is 11-Lipschitz.

Secondly, for any rrjr\geq r_{j}, (63) and (69) yield that S(16r)[0,1]dz×[0,32r]ddzS(16r)\subset[0,1]^{d_{z}}\times[0,32r]^{d-d_{z}} and S(16r)[0,1]dz×[0,30r]ddzS(16r)\supset[0,1]^{d_{z}}\times[0,30r]^{d-d_{z}}. Therefore, we have 30ddzrdzNr32ddzrdz30^{d-d_{z}}r^{-d_{z}}\leq N_{r}\leq 32^{d-d_{z}}r^{-d_{z}}, and the zooming dimension equals to dzd_{z}.

Then we show that an argument similar to Lemma 6 yields the lower bound we need. For each 1kMj1\leq k\leq M_{j}, we write uj,k=(u^j,k,0ddz)u_{j,k}=(\hat{u}_{j,k},0_{d-d_{z}}). We define Cj,k=𝔹dz(u^j,k,rj4)×[0,1]ddzC_{j,k}=\mathbb{B}_{d_{z}}\left(\hat{u}_{j,k},\frac{r_{j}}{4}\right)\times[0,1]^{d-d_{z}}, and our construction j,k\mathcal{I}_{j,k} has the following properties:

  1. 1.

    For any lkl\neq k, μj,k,l(z)=μj,k,k(z)\mu_{j,k,l}(z)=\mu_{j,k,k}(z) for any zCj,lz\notin C_{j,l};

  2. 2.

    For any lkl\neq k, μj,k,k(z)μj,k,l(z)μj,k,k(z)+3rj16\mu_{j,k,k}(z)\leq\mu_{j,k,l}(z)\leq\mu_{j,k,k}(z)+\frac{3r_{j}}{16} for any zCj,lz\in C_{j,l};

  3. 3.

    For any 1lMj1\leq l\leq M_{j}, under Ij,k,lI_{j,k,l}, pulling an arm that is not in Cj,lC_{j,l} incurs a regret at least rj32\frac{r_{j}}{32}.

Let xtx_{t} denote the choices of policy π\pi at time tt, and yty_{t} denote the reward. For tj1<ttjt_{j-1}<t\leq t_{j}, we define j,k,lt\mathbb{P}_{j,k,l}^{t} as the distribution of sequence (x1,y1,,xtj1,ytj1)\left(x_{1},y_{1},\cdots,x_{t_{j-1}},y_{t_{j-1}}\right) under instance Ij,k,lI_{j,k,l} and policy π\pi. From similar argument in (52), it holds that

supIj,k𝔼[RT(π)]rj32t=1T1Mjl=1Mjj,k,lt(xtCj,l).\displaystyle\sup_{I\in\mathcal{I}_{j,k}}\mathbb{E}\left[R_{T}(\pi)\right]\geq\frac{r_{j}}{32}\sum_{t=1}^{T}\frac{1}{M_{j}}\sum_{l=1}^{M_{j}}\mathbb{P}_{j,k,l}^{t}(x_{t}\notin C_{j,l}). (70)

From our construction, it is easy to see that Cj,k1Cj,k2=C_{j,k_{1}}\cap C_{j,k_{2}}=\varnothing for any k1k2k_{1}\neq k_{2}, so we can construct a test Ψ\Psi such that xtCj,kx_{t}\in C_{j,k} implies Ψ=k\Psi=k . By Lemma 12 with a star graph on [K][K] with center kk, we have

1Mjl=1Mjj,k,lt(xtCj,l)1Mjlkmin{dj,k,kt,dj,k,lt}.\displaystyle\frac{1}{M_{j}}\sum_{l=1}^{M_{j}}\mathbb{P}_{j,k,l}^{t}(x_{t}\notin C_{j,l})\geq\frac{1}{M_{j}}\sum_{l\neq k}\int\min\left\{d\mathbb{P}_{j,k,k}^{t},d\mathbb{P}_{j,k,l}^{t}\right\}. (71)

Combining (70) and (71) gives

supIj,k𝔼[RT(π)]\displaystyle\sup_{I\in\mathcal{I}_{j,k}}\mathbb{E}\left[R_{T}(\pi)\right]\geq rj32t=1T1Mjlkmin{dj,k,kt,dj,k,lt}\displaystyle\;\frac{r_{j}}{32}\sum_{t=1}^{T}\frac{1}{M_{j}}\sum_{l\neq k}\int\min\left\{d\mathbb{P}_{j,k,k}^{t},d\mathbb{P}_{j,k,l}^{t}\right\}
\displaystyle\geq rj32t=1Tj1Mjlkmin{dj,k,kt,dj,k,lt}\displaystyle\;\frac{r_{j}}{32}\sum_{t=1}^{T_{j}}\frac{1}{M_{j}}\sum_{l\neq k}\int\min\left\{d\mathbb{P}_{j,k,k}^{t},d\mathbb{P}_{j,k,l}^{t}\right\}
\displaystyle\geq rjTj321Mjlkmin{dj,k,kTj,dj,k,lTj}\displaystyle\;\frac{r_{j}T_{j}}{32}\cdot\frac{1}{M_{j}}\sum_{l\neq k}\int\min\left\{d\mathbb{P}_{j,k,k}^{T_{j}},d\mathbb{P}_{j,k,l}^{T_{j}}\right\} (72)
\displaystyle\geq rjTj321MjlkAjmin{dj,k,kTj,dj,k,lTj}\displaystyle\;\frac{r_{j}T_{j}}{32}\cdot\frac{1}{M_{j}}\sum_{l\neq k}\int_{A_{j}}\min\left\{d\mathbb{P}_{j,k,k}^{T_{j}},d\mathbb{P}_{j,k,l}^{T_{j}}\right\} (73)
\displaystyle\geq rjTj321MjlkAjmin{dj,k,kTj1,dj,k,lTj1},\displaystyle\;\frac{r_{j}T_{j}}{32}\cdot\frac{1}{M_{j}}\sum_{l\neq k}\int_{A_{j}}\min\left\{d\mathbb{P}_{j,k,k}^{T_{j-1}},d\mathbb{P}_{j,k,l}^{T_{j-1}}\right\}, (74)

where (72) follows from data processing inequality of total variation and the equation min{dP,dQ}=1TV(P,Q)\int\min\left\{dP,dQ\right\}=1-TV(P,Q), (73) restricts the integration to event AjA_{j}, and (74) holds because the observations at time TjT_{j} are the same as those at time Tj1T_{j-1} under event AjA_{j}.

For the term Ajmin{dj,k,kTj1,dj,k,lTj1}\int_{A_{j}}\min\left\{d\mathbb{P}_{j,k,k}^{T_{j-1}},d\mathbb{P}_{j,k,l}^{T_{j-1}}\right\}, it holds that

Ajmin{dj,k,kTj1,dj,k,lTj1}=\displaystyle\int_{A_{j}}\min\left\{d\mathbb{P}_{j,k,k}^{T_{j-1}},d\mathbb{P}_{j,k,l}^{T_{j-1}}\right\}= Ajdj,k,kTj1+dj,k,lTj1|dj,k,kTj1dj,k,lTj1|2\displaystyle\;\int_{A_{j}}\frac{d\mathbb{P}_{j,k,k}^{T_{j-1}}+d\mathbb{P}_{j,k,l}^{T_{j-1}}-\left|d\mathbb{P}_{j,k,k}^{T_{j-1}}-d\mathbb{P}_{j,k,l}^{T_{j-1}}\right|}{2}
=\displaystyle= j,k,kTj1(Aj)+j,k,lTj1(Aj)212Aj|dj,k,kTj1dj,k,lTj1|\displaystyle\;\frac{\mathbb{P}_{j,k,k}^{T_{j-1}}(A_{j})+\mathbb{P}_{j,k,l}^{T_{j-1}}(A_{j})}{2}-\frac{1}{2}\int_{A_{j}}\left|d\mathbb{P}_{j,k,k}^{T_{j-1}}-d\mathbb{P}_{j,k,l}^{T_{j-1}}\right|
\displaystyle\geq (j,k,kTj1(Aj)12TV(j,k,kTj1,j,k,lTj1))TV(j,k,kTj1,j,k,lTj1)\displaystyle\;\left(\mathbb{P}_{j,k,k}^{T_{j-1}}(A_{j})-\frac{1}{2}TV\left(\mathbb{P}_{j,k,k}^{T_{j-1}},\mathbb{P}_{j,k,l}^{T_{j-1}}\right)\right)-TV\left(\mathbb{P}_{j,k,k}^{T_{j-1}},\mathbb{P}_{j,k,l}^{T_{j-1}}\right) (75)
=\displaystyle= j,k(Aj)32TV(j,k,kTj1,j,k,lTj1),\displaystyle\;\mathbb{P}_{j,k}(A_{j})-\frac{3}{2}TV\left(\mathbb{P}_{j,k,k}^{T_{j-1}},\mathbb{P}_{j,k,l}^{T_{j-1}}\right), (76)

where (75) uses the inequality |(A)(A)|TV(,)|\mathbb{P}(A)-\mathbb{Q}(A)|\leq TV(\mathbb{P},\mathbb{Q}), and (76) holds because Ij,k=Ij,k,kI_{j,k}=I_{j,k,k} and AjA_{j} can be determined by the observations up to Tj1T_{j-1}.

We use an argument similar to (58) to get

DKL(j,k,kTj1j,k,lTj1)12(3rj16)2𝔼j,kτlrj232𝔼j,kτl,\displaystyle D_{KL}\left(\mathbb{P}_{j,k,k}^{T_{j-1}}\|\mathbb{P}_{j,k,l}^{T_{j-1}}\right)\leq\frac{1}{2}\cdot\left(\frac{3r_{j}}{16}\right)^{2}\mathbb{E}_{\mathbb{P}_{j,k}}\tau_{l}\leq\frac{r_{j}^{2}}{32}\mathbb{E}_{\mathbb{P}_{j,k}}\tau_{l},

where τl\tau_{l} denotes the number of pulls which is in Cj,lC_{j,l} before the batch of time Tj1T_{j-1}. Then from Lemma 11, we have

1MjlkTV(j,k,kTj1,j,k,lTj1)\displaystyle\frac{1}{M_{j}}\sum_{l\neq k}TV\left(\mathbb{P}_{j,k,k}^{T_{j-1}},\mathbb{P}_{j,k,l}^{T_{j-1}}\right)\leq 1Mjlk1exp(DKL(j,k,kTj1j,k,lTj1))\displaystyle\;\frac{1}{M_{j}}\sum_{l\neq k}\sqrt{1-\exp\left(-D_{KL}\left(\mathbb{P}_{j,k,k}^{T_{j-1}}\|\mathbb{P}_{j,k,l}^{T_{j-1}}\right)\right)}
\displaystyle\leq 1Mjlk1exp(rj232𝔼j,kτl)\displaystyle\;\frac{1}{M_{j}}\sum_{l\neq k}\sqrt{1-\exp\left(-\frac{r_{j}^{2}}{32}\mathbb{E}_{\mathbb{P}_{j,k}}\tau_{l}\right)}
\displaystyle\leq Mj1Mj1exp(rj232(Mj1)lk𝔼j,kτl)\displaystyle\;\frac{M_{j}-1}{M_{j}}\sqrt{1-\exp\left(-\frac{r_{j}^{2}}{32(M_{j}-1)}\sum_{l\neq k}\mathbb{E}_{\mathbb{P}_{j,k}}\tau_{l}\right)}
\displaystyle\leq Mj1Mj1exp(rj2Tj132(Mj1))\displaystyle\;\frac{M_{j}-1}{M_{j}}\sqrt{1-\exp\left(-\frac{r_{j}^{2}T_{j-1}}{32(M_{j}-1)}\right)}
\displaystyle\leq Mj1Mj1exp(Mj32(Mj1)B2)\displaystyle\;\frac{M_{j}-1}{M_{j}}\sqrt{1-\exp\left(-\frac{M_{j}}{32(M_{j}-1)B^{2}}\right)} (77)
\displaystyle\leq Mj1MjMj32(Mj1)B2\displaystyle\;\frac{M_{j}-1}{M_{j}}\sqrt{\frac{M_{j}}{32(M_{j}-1)B^{2}}}
\displaystyle\leq 14B,\displaystyle\;\frac{1}{4B}, (78)

where (77) uses (61).

Combining (74), (76) and (78) yields that

supIj,k𝔼[RT(π)]132rjTj(j,k(Aj)238B)132BT1ε1εB(j,k(Aj)238B),\displaystyle\sup_{I\in\mathcal{I}_{j,k}}\mathbb{E}\left[R_{T}(\pi)\right]\geq\frac{1}{32}r_{j}T_{j}\left(\frac{\mathbb{P}_{j,k}(A_{j})}{2}-\frac{3}{8B}\right)\geq\frac{1}{32B}T^{\frac{1-\varepsilon}{1-\varepsilon^{B}}}\left(\frac{\mathbb{P}_{j,k}(A_{j})}{2}-\frac{3}{8B}\right),

where ε=1dz+2\varepsilon=\frac{1}{d_{z}+2}. This inequality holds for any kMj1k\leq M_{j}-1. Averaging over kk yields

supIkMj1j,k𝔼[RT(π)]132BT1ε1εB(12(Mj1)k=1Mj1j,k(Aj)38B)132BT1ε1εB(716B38B)1512B2T1ε1εB,\displaystyle\begin{split}\sup_{I\in\cup_{k\leq M_{j}-1}\mathcal{I}_{j,k}}\mathbb{E}\left[R_{T}(\pi)\right]\geq&\;\frac{1}{32B}T^{\frac{1-\varepsilon}{1-\varepsilon^{B}}}\left(\frac{1}{2(M_{j}-1)}\sum_{k=1}^{M_{j}-1}\mathbb{P}_{j,k}(A_{j})-\frac{3}{8B}\right)\\ \geq&\;\frac{1}{32B}T^{\frac{1-\varepsilon}{1-\varepsilon^{B}}}\left(\frac{7}{16B}-\frac{3}{8B}\right)\\ \geq&\;\frac{1}{512B^{2}}T^{\frac{1-\varepsilon}{1-\varepsilon^{B}}},\end{split}

where the second inequality holds from pj78Bp_{j}\geq\frac{7}{8B}. Hence, the proof of Lemma 10 is completed. ∎

Finally, combining the above two lemmas, we arrive at the lower bound in Theorem 10. ∎

Appendix G Additional Experimental results

G.1 Repeated experiments of A-BLiN

We present results of A-BLiN with some random seeds in Figure 3, where the figure legends and labels are the same as whose in Figure 2b. These results stably agree with the plot in the paper. The curve of zooming algorithm in Figure 2b and 3 is the average of 1010 repeated experiments. The reason we did not present averaged regret curve of A-BLiN in Figure 2b is that we want to show the batch pattern of a single A-BLiN run in the figure. Averaging across different runs breaks the batch pattern. As an example, one stochastic run may end the first batch after 100 observations, while another may end the third batch after 110 observations.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Figure 3: Results of A-BLiN with some random seeds. The figure legends and labels are the same as whose in Figure 2b.

G.2 Experimental results of D-BLiN

We run D-BLiN to solve the same problem in Section 5. The partition and elimination process of this experiment is presented in Figure 4, which shows that the optimal arm xx^{*} is not eliminated during the game, and only 66 rounds of communications are needed for time horizon T=80000T=80000. Moreover, we present the resulting partition and the accumulated regret in Figure 5.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Figure 4: Partition and elimination process of D-BLiN. The ii-th subfigure shows the pattern before the ii-th batch. Dark gray cubes are those eliminated in the most recent batch, while the light gray ones are those eliminated in earlier batches. For the total time horizon T=80000T=80000, D-BLiN needs 66 rounds of communications.
Refer to caption
(a) Partition
Refer to caption
(b) Regret
Figure 5: Resulting partition and regret of D-BLiN. In Figure 5a, we show the resulting partition of D-BLiN. The background color denotes the true value of expected reward μ\mu, and blue means high values. The figure shows that the partition is finer for larger values of μ\mu. In Figure 5b, we show accumulated regret of D-BLiN. In the figure, different background colors represent different batches. For the total time horizon T=80000T=80000, D-BLiN needs 66 rounds of communications (the first two batches are too small and are combined with the third batch in the plot).

Appendix H Auxiliary Lemmas

Lemma 11 (Bretagnolle-Huber Inequality[12]).

Let PP and QQ be any probability measures on the same probability space. It holds that

TV(P,Q)1exp(DKL(PQ))112exp(DKL(PQ)).\displaystyle TV(P,Q)\leq\sqrt{1-\exp\left(-D_{KL}(P\|Q)\right)}\leq 1-\frac{1}{2}\exp\left(-D_{KL}(P\|Q)\right).
Lemma 12 ([22]).

Let Q1,,QnQ_{1},\cdots,Q_{n} be probability measures over a common probability space (Ω,)(\Omega,\mathcal{F}), and Ψ:Ω[n]\Psi:\Omega\rightarrow[n] be any measurable function (i.e., test). Then for any tree T=([n],E)T=([n],E) with vertex set [n][n] and edge set EE, we have

  1. 1.

    1ni=1nQi(Ψi)1n(i,j)Emin{dQi,dQj};\frac{1}{n}\sum_{i=1}^{n}Q_{i}(\Psi\neq i)\geq\frac{1}{n}\sum_{(i,j)\in E}\int\min\{dQ_{i},dQ_{j}\};

  2. 2.

    1ni=1nQi(Ψi)12n(i,j)Eexp(DKL(QiQj)).\frac{1}{n}\sum_{i=1}^{n}Q_{i}(\Psi\neq i)\geq\frac{1}{2n}\sum_{(i,j)\in E}\exp\left(-D_{KL}(Q_{i}\|Q_{j})\right).