This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Explicit Best Arm Identification in Linear Bandits Using No-Regret Learners

Mohammadi Zaki
Electrical Communication Engineering,
Indian Institute of Science,
Bangalore 560012.
[email protected], Avi Mohan
Faculty of Electrical Engineering,
Technion, Israel Institute of Technology,
Haifa 3200003.
[email protected] Aditya Gopalan
Electrical Communication Engineering,
Indian Institute of Science,
Bangalore 560012.
[email protected]
Under review. Please do not distribute.

Abstract.

We study the problem of best arm identification in linearly parameterised multi-armed bandits. Given a set of feature vectors 𝒳d,\mathcal{X}\subset\mathbb{R}^{d}, a confidence parameter δ\delta and an unknown vector θ,\theta^{*}, the goal is to identify argmaxx𝒳xTθ\mathop{\mathrm{argmax}}_{x\in\mathcal{X}}x^{T}\theta^{*}, with probability at least 1δ,1-\delta, using noisy measurements of the form xTθ.x^{T}\theta^{*}. For this fixed confidence (δ\delta-PAC) setting, we propose an explicitly implementable and provably order-optimal sample-complexity algorithm to solve this problem. Previous approaches rely on access to minimax optimization oracles. The algorithm, which we call the Phased Elimination Linear Exploration Game (PELEG), maintains a high-probability confidence ellipsoid containing θ\theta^{*} in each round and uses it to eliminate suboptimal arms in phases. PELEG achieves fast shrinkage of this confidence ellipsoid along the most confusing (i.e., close to, but not optimal) directions by interpreting the problem as a two player zero-sum game, and sequentially converging to its saddle point using low-regret learners to compute players’ strategies in each round. We analyze the sample complexity of PELEG and show that it matches, up to order, an instance-dependent lower bound on sample complexity in the linear bandit setting. We also provide numerical results for the proposed algorithm consistent with its theoretical guarantees.

1 Introduction

Function optimization over structured domains is a basic sequential decision making problem. A well-known formulation of this problem is Probably Approximately Correct (PAC) best arm identification in multi-armed bandits [7], in which a learner is given a set of arms with unknown (and unrelated) means. The learner must sequentially test arms and output, as soon as possible with high confidence, a near-optimal arm (where optimality is defined in terms of the largest mean).

Often, the arms (decisions) and their associated rewards, possess structural relationships, allowing for more efficient learning of the rewards and transfer of learnt information, e.g., two ‘close enough’ arms may have similar mean rewards. One of the best-known examples of structured decision spaces is the linear bandit, whose arms are vectors (points) in d\mathbb{R}^{d}. The reward or function value of an arm is an unknown linear function of its vector representation, and the goal is to find an arm with maximum reward in the shortest possible time by measuring arms’ rewards sequentially with noise. This framework models an array of structured online linear optimization problems including adaptive routing [2], smooth function optimization over graphs [20], subset selection [13] and, in the nonparametric setting, black box optimization in smooth function spaces [18], among others.

Although no-regret online learning for linear bandits is a well-understood problem (see [14] and references therein), the PAC-sample complexity of best arm identification in this model has not received significant attention until recently [17]. The state of the art here is the work of Fiez et al. [8], who give an algorithm with optimal (instance-dependent) PAC sample complexity. However, a closer look indicates that the algorithm assumes repeated oracle access to a minimax optimization problem111This is, in fact, a plug-in version of a minimax optimization problem representing an information-theoretic sample complexity lower bound for the problem.; it is not clear, from a performance standpoint, in what manner (and to what accuracy) this optimization problem should be practically solved222For its experiments, the paper implements a (approximate) minimax oracle using the Frank-Wolfe algorithm and a heuristic stopping rule, but this is not rigorously justifiable for nonsmooth optimization, see Sec. 4. to enjoy the claimed sample complexity. Hence, the question of how to design an explicit algorithm with optimal PAC sample complexity for best arm identification in linear bandits has remained open.

In this paper, we resolve this question affirmatively by giving an explicit linear bandit best-arm identification algorithm with instance-optimal PAC sample complexity and, more importantly, a clearly quantified computational effort. We achieve this goal using new techniques: the main ingredient in the proposed algorithm is a game-theoretic interpretation of the minimax optimization problem that is at the heart of the instance-based sample complexity lower bound. This in turn yields an adaptive, sample-based approach using carefully constructed confidence sets for the unknown parameter θ\theta^{*}. The adaptive sampling strategy is driven by the interaction of 2 no-regret online learning subroutines that attempt to solve the minimax problem approximately, obviating the worry of i) solving the optimal minimax allocation to a suitable precision and ii) making an integer sampling allocation from it by rounding, which occur in the approach of Fiez et al [8]. We note that the seeds of this game-theoretic approach were laid by the recent work of Degenne et al. [6] for the simple (i.e., unstructured) multiarmed bandit problem. However, our work demonstrates a novel extension of their methodology to solve best-arm learning in structured multi-armed bandits for the first time to the best of our knowledge.

1.1 Related Work

The PAC best arm identification problem for linearly parameterised bandits is first studied in [17], in which an adaptive algorithm is given with a sample complexity guarantee involving a hardness term (MM^{*}) which in general renders the sample complexity suboptimal. Tao et al [19] take the path of constructing new estimators instead of ordinary least squares, using which they give an algorithm achieving the familiar sum-of-inverse-gaps sample complexity known for standard bandits; this is, however, not optimal for general linear bandits. The LinGapE algorithm [15] is an attempt at solving best arm identification with a fully adaptive strategy, but its sample complexity in general is not instance-optimal and can additionally scale with the total number of arms, in addition to the extra dimension-dependence known to be incurred by self-normalized inequality-based confidence set constructions [1]. Zaki et al [21] design a fully adaptive algorithm based on the Lower-Upper Confidence Bound (LUCB) principle [11] with limited guarantees for 2 or 3 armed settings. Fiez et al [8] give a phased elimination algorithm achieving the ideal information-theoretic sample complexity but with minimax oracle access and an additional rounding operation; we detail an explicit arm-playing strategy that eliminates both these steps, in the same high-level template. In a separate vein, game-theoretic techniques to solve minimax problems have been in existence for over a couple of decades [9]; only recently have they been combined with optimism to give a powerful framework to solve adaptive hypothesis testing problems [6].

Table 1 compares the sample complexities of various best arm identification algorithms in the literature.

2 Problem Statement and Notation

We study the problem of best arm identification in linear bandits with the arm set 𝒳{x1,x2,,xK}\mathcal{X}\equiv\{x_{1},x_{2},\ldots,x_{K}\}, where each arm333In general, the number of arms can be much larger than the ambient dimension, i.e., dK.d\ll K. xax_{a} is a vector in d\mathbb{R}^{d}. We will interchangeably use 𝒳\mathcal{X} and the set [K]{1,2,,K}[K]\equiv\{1,2,\ldots,K\}, whenever the context is clear. In every round t=1,2,t=1,2,\ldots the agent chooses an arm xt𝒳x_{t}\in\mathcal{X}, and receives a reward y(xt)=θTxt+ηty(x_{t})={\theta^{*}}^{T}x_{t}+\eta_{t}, where θ\theta^{*} is assumed to be a fixed but unknown vector, and ηt\eta_{t} is zero-mean noise assumed to be conditionally 11- subgaussian, i.e., γ,𝔼[eγηt|x1,x2,,xt1,η1,η2,,ηt1]exp(γ22)\forall\gamma\in\mathbb{R},\mathbb{E}\left[{e^{\gamma\eta_{t}}|x_{1},x_{2},\ldots,x_{{t-1}},\eta_{1},\eta_{2},\ldots,\eta_{t-1}}\right]\leqslant\exp\Big{(}\frac{\gamma^{2}}{2}\Big{)}. We denote by νθk\nu^{k}_{\theta^{*}} the distribution of the reward obtained by pulling arm k[K],k\in[K], i.e., t1,y(xt)νθk,\forall t\geqslant 1,y(x_{t})\sim\nu^{k}_{\theta^{*}}, whenever xt=xk.x_{t}=x_{k}. Given two probability distributions μ,ν\mu,\nu over \mathbb{R}, KL(μ,ν)KL(\mu,\nu) denotes the KL Divergence of μ\mu and ν\nu (assuming μν\mu\ll\nu). Given θd\theta\in\mathbb{R}^{d}, let aa(θ)=argmaxa[K]θTxaa^{*}\equiv a^{*}(\theta)=\mathop{\mathrm{argmax}}\limits_{a\in[K]}{\theta}^{T}x_{a}, where we assume that θ\theta is such that the argmax is unique.

A learning algorithm for the best arm identification problem comprises the following rules: (1) a sampling rule, which determines based on the past play of arms and observations, which arm to pull next, (2) a stopping rule, which controls the end of sampling phase and is a function of the past observations and reward, and (3) a recommendation rule, which, when the algorithm stops, offers a guess for the best arm. The goal of a learning algorithm is: Given an error probability δ>0\delta>0, identify (guess) aa^{*} with probability 1δ\geqslant 1-\delta by pulling as few (in an expected sense) arms as possible. Any algorithm that (1) stops with probability 1 and (2) returns aa^{*} upon stopping with probability at least 1δ1-\delta is said to be δ\delta-Probably Approximately Correct (δ\delta-PAC). For clarity of exposition, we distinguish the above linear bandit setting from what we term the unstructured bandit setting, wherein K=d,K=d, and xi=e^i,k[K]x_{i}=\widehat{e}_{i},~{}\forall k\in[K] the canonical basis vectors (the former setting generalizes the latter). The (expected) number of samples τ\tau\in\mathbb{N} consumed by an algorithm in determining the optimal arm in any bandit setting (not necessarily the linear setting) is called its sample complexity.

In the rest of the paper, we will assume that xk21,xk𝒳\left\lVert x_{k}\right\rVert_{2}\leqslant 1,\forall x_{k}\in\mathcal{X}. Given a positive definite matrix AA, we denote by xA:=xTAx\left\lVert x\right\rVert_{A}:=\sqrt{x^{T}Ax}, the matrix norm induced by AA. For any i[K],iai\in[K],i\neq a^{*}, we define Δi:=θT(xaxi)\Delta_{i}:={\theta^{*}}^{T}(x_{a^{*}}-x_{i}) to be the gap between the largest expected reward and the expected reward of (suboptimal) arm xix_{i}. Let Δmin:=mini[K]Δi\Delta_{\min}:=\min\limits_{\begin{subarray}{c}i\in[K]\end{subarray}}\Delta_{i}. We denote B(z,r)B(z,r) as the closed ball with center zz and radius rr. For any measurable space (Ω,),\left(\Omega,\mathcal{F}\right), we define 𝒫(Ω)\mathcal{P}(\Omega) to be the set of all probability measures on Ω\Omega. 𝒪~\tilde{\mathcal{O}} is big-Oh notation that suppresses logarithmic dependence on problem parameters. For the benefit of the reader, we provide a glossary of commonly used symbols in Sec. A in the Appendix.

Algorithm Sample Complexity Remarks
𝒳𝒴\mathcal{X}\mathcal{Y}-static [17] 𝒪(dΔmin(ln1δ+lnK+ln1Δmin)+d2)\mathcal{O}\Big{(}\frac{d}{\Delta_{\min}}(\ln\frac{1}{\delta}+\ln K+\ln\frac{1}{\Delta_{min}})+d^{2}\Big{)}
Static allocation, worst-case optimal
Dependence on dd cannot be removed
LinGapE444Here H0H_{0} is a complicated term defined in terms of a solution to an offline optimization problem in [15]. [15] 𝒪(dH0log(dH0log1δ))\mathcal{O}\Big{(}dH_{0}\log\Big{(}dH_{0}\log\frac{1}{\delta}\Big{)}\Big{)} Fully adaptive, sub-optimal in general.
ALBA [19] 𝒪(i=1d1Δ(i)2ln(Kδ+ln1Δmin))\mathcal{O}\Big{(}\sum_{i=1}^{d}\frac{1}{\Delta^{2}_{(i)}}\ln\left(\frac{K}{\delta}+\ln\frac{1}{\Delta_{min}}\right)\Big{)} Fully adaptive, sub-optimal in general (see [8])
RAGE [8] 𝒪(1Dθlog1/Δminlog(K2log21/Δminδ))\mathcal{O}\left(\frac{1}{D_{\theta^{*}}}\log{1/\Delta_{\min}}\log{\left(\frac{K^{2}\log^{2}{1/\Delta_{min}}}{\delta}\right)}\right)
Instance-optimal, but
Minimax oracle required
PELEG (this paper) 𝒪(log2(1/Δmin)Dθ[log2((log2(1/Δmin))2K2/δ)C2])\mathcal{O}\left(\frac{\log_{2}\left(1/\Delta_{min}\right)}{D_{\theta^{*}}}\left[\frac{\log^{2}\left(\left(\log_{2}\left(1/\Delta_{min}\right)\right)^{2}K^{2}/\delta\right)}{C^{2}}\right]\right)
Instance-optimal (upto a factor of C2C^{2}),
Explicitly implementable
Table 1: Comparison of Sample complexities achieved by various algorithms for the Linear Multi-armed Bandit problem in the literature. Note that KK is the number of arms, dd is the ambient dimension, δ\delta is the PAC guarantee parameter and Δmin\Delta_{min} is the minimum reward gap. H0H_{0} is a complicated term defined in terms of a solution to an offline optimization problem in [15].

Note: C:=λmin(x𝒳xxT)C:=\lambda_{\min}\left(\sum_{x\in\mathcal{X}}xx^{T}\right) is a term depending only on the geometry of the arm set.

3 Overview of Algorithmic Techniques

In this section we describe the main ingredients in our algorithm design and how they build upon ideas introduced in recent work [6, 8] (the explicit algorithm appears in Sec. 4).

The phased elimination approach: Fiez et al [8]. We first note that a lower bound on the sample complexity of any δ\delta- PAC algorithm for the canonical (i.e., unstructured) bandit setting [10] was generalized by Fiez et al [8] to the linear bandit setting, assuming {ηt}t1\{\eta_{t}\}_{t\geqslant 1} to be standard normal random variables. This result states that any δ\delta-PAC algorithm in the linear setting must satisfy 𝔼θ[τ](log1/2.4δ)1Tθ(log1/2.4δ)1Dθ\mathbb{E}_{\theta^{*}}[\tau]\geqslant\left(\log{1/2.4\delta}\right)\frac{1}{T_{\theta^{*}}}\geqslant\left(\log{1/2.4\delta}\right)\frac{1}{D_{\theta^{*}}}, where Tθ:=maxw𝒫(𝒳)minθ:a(θ)a(θ)k[K]wkKL(νθk,νθk)T_{\theta^{*}}:=\max_{w\in\mathcal{P}(\mathcal{X})}\min_{\theta:a^{*}(\theta)\neq a^{*}(\theta^{*})}\sum_{k\in[K]}w_{k}KL\left(\nu^{k}_{\theta},\nu^{k}_{\theta^{*}}\right) and Dθ:=maxw𝒫(𝒳)minx𝒳,xx(θT(xx))2xx(x𝒳wxxxT)12D_{\theta^{*}}:=\max\limits_{w\in\mathcal{P}(\mathcal{X})}\min\limits_{x\in\mathcal{X},x\neq x^{*}}\frac{\left({\theta^{*}}^{T}\left(x^{*}-x\right)\right)^{2}}{\left\lVert x^{*}-x\right\rVert^{2}_{\left(\sum_{x\in\mathcal{X}}w_{x}xx^{T}\right)^{-1}}}, where x=xax^{*}=x_{a^{*}}. The bound suggests a natural δ\delta-PAC strategy, namely, to sample arms according to the distribution

w=argminw𝒫(𝒳)maxx𝒳{x}xx(x𝒳wxxxT)12((xx)θ)2.w^{*}=\mathop{\mathrm{argmin}}_{w\in\mathcal{P}(\mathcal{X})}\max_{x\in\mathcal{X}\setminus\{x^{*}\}}\frac{\parallel x^{*}-x\parallel^{2}_{\left(\sum_{x\in\mathcal{X}}w_{x}xx^{T}\right)^{-1}}}{\left(\left(x^{*}-x\right)\theta^{*}\right)^{2}}. (1)

In fact, as [8, Sec. 2.2] explains, using the Ordinary Least Squares (OLS) estimator θ^\widehat{\theta} for θ\theta^{*} and sampling arm x𝒳x\in\mathcal{X} exactly 2wN2\lfloor w^{*}N\rfloor times with N=𝒪(logK/δDθ)N=\mathcal{O}\left(\frac{\log{K/\delta}}{D_{\theta^{*}}}\right) ensures (xx)Tθ^>0,xx(x-x^{*})^{T}\widehat{\theta}>0,~{}\forall x\neq x^{*} with probability 1δ.\geqslant 1-\delta. Unfortunately, this sampling distribution cannot directly be implemented since xx^{*} is unknown.

Fiez et al circumvent this difficulty by designing a nontrivial strategy (RAGE) that attempts to mimic the optimal allocation ww^{*} in phases. Specifically, in phase mm, it tries to eliminate arms that are about 2m2^{-m}-suboptimal (in their gaps), by solving (1) with a plugin estimate of θ\theta^{*}. The resulting fractional allocation, passed through a separate discrete rounding procedure, gives an integer pull count distribution which ensures that all surviving arms’ mean differences are estimated with high precision and confidence.

Though direct and appealing, this phased elimination strategy is based crucially on solving minimax problems of the form (1). Though the inner (max) function is convex as a function of ww on the probability simplex (see e.g., Lemma 1 in [21]), it is non-smooth, and it is not made explicit how, and to what extent, it must be solved in [8]. Fortunately, we are able to circumvent this obstacle by using ideas from games between no-regret online learners with optimism, as introduced by the work of Degenne et al [6] for unstructured bandits.

From Pure-exploration Games to δ\delta-PAC Algorithms: Degenne et al [6]. We briefly explain some of the insights in [6] that we leverage to design an explicit linear bandit-δ\delta-PAC algorithm with low computational complexity. For a fixed weight parameter θd,\theta^{*}\in\mathbb{R}^{d}, consider the two-player, zero-sum Pure-exploration Game in which the MAXMAX player (or column player) plays an arm k[K]k\in[K] while the MINMIN (or row) player chooses an alternative bandit model θd\theta\in\mathbb{R}^{d} such that a(θ)a.a^{*}(\theta)\neq a^{*}. MAXMAX then receives a payoff of k[K]KL(νθk,νθk)\sum_{k\in[K]}KL\left(\nu^{k}_{\theta^{*}},\nu^{k}_{\theta}\right) from MIN.MIN. For a given w𝒫(𝒳),w\in\mathcal{P}(\mathcal{X}), define Tθ(w)=minθ:a(θ)ax𝒳wxxxT,T_{\theta^{*}}(w)=\min_{\theta:a^{*}(\theta)\neq a^{*}}\sum_{x\in\mathcal{X}}w_{x}xx^{T}, and w(θ)w^{*}(\theta^{*}) the mixed strategy that attains Tθ.T_{\theta^{*}}. With MAXMAX moving first and playing a mixed strategy w𝒫(𝒳),w\in\mathcal{P}(\mathcal{X}), the value of the game becomes TθT_{\theta^{*}}. In the unstructured bandit setting, to match the sample complexity lower bound, any algorithm must essentially sample arm k[K]k\in[K] at rate NtKtwk(θ),\frac{N^{K}_{t}}{t}\rightarrow w^{*}_{k}(\theta^{*}), where NtkN^{k}_{t} is the number of times Arm kk has been sampled up to time tt [12]. This helps explain why any δ\delta-PAC algorithm implicitly needs to solve the Pure Exploration Game Tθ.T_{\theta^{*}}.

We crucially employ no-regret online learners to solve the Pure Exploration Game for linear bandits. More precisely, no-regret learning with the well-known Exponential Weights rule/Negative-entropy mirror descent algorithm [16] on one hand, and a best-response convex programming subroutine on the other, provides a direct sampling strategy that obviates the need for separate allocation optimization and rounding for sampling as in [8]. One crucial advantage of our approach (inspired by [6]) is that we only use a best response oracle to solve for Tθ(w)T_{\theta^{*}}(w), which gives us a computational edge over [8] who employ the computationally more costly max-min oracle to solve Tθ(w),T_{\theta^{*}}(w), or, its linear bandit equivalent, Dθ.D_{\theta^{*}}.

4 Algorithm and Sample Complexity Bound

Our algorithm, that we call “Phased Elimination Linear Exploration Game” (PELEG), is presented in detail as Algorithm 1. PELEG proceeds in phases with each phase consisting of multiple rounds, maintaining a set of active arms 𝒳m\mathcal{X}_{m} for testing during Phase mm. An OLS estimate θ^m\widehat{\theta}_{m} of θ\theta^{*} is used to estimate the mean reward of active arms and, at the end of phase mm, every active arm with a plausible reward more than 2m\approx 2^{-m} below that of some arm in 𝒳m\mathcal{X}_{m} is eliminated. Suppose 𝒮m:={x𝒳{x}:θT(xx)<12m}\mathcal{S}_{m}:=\left\{x\in\mathcal{X}\setminus\{x^{*}\}:{\theta^{*}}^{T}\left(x^{*}-x\right)<\frac{1}{2^{m}}\right\}. If we can ensure that 𝒳m𝒮m\mathcal{X}_{m}\subset\mathcal{S}_{m} in every Phase m1,m\geqslant 1, then PELEG will terminate within log2(1/Δmin)\lceil\log_{2}(1/\Delta_{\min})\rceil phases, where Δmin=minxxθT(xx).\Delta_{\min}=\min_{x\neq x^{*}}{\theta^{*}}^{T}\left(x^{*}-x\right). This statement is proved in Corollary 2 in the Supplementary Material.

If we knew θ,\theta^{*}, then we could sample arms according to the optimal distribution ww^{*} in (1). However, since all we now have at our disposal is the knowledge that Δi2m,xi𝒳m,\Delta_{i}\leqslant 2^{-m},~{}\forall x_{i}\in\mathcal{X}_{m}, we can instead construct a sampling distribution wmw^{*}_{m} by solving the surrogate wm=argminw𝒫(𝒳)maxx,x𝒳m:xxxx(x𝒳wxxxT)12w^{*}_{m}=\mathop{\mathrm{argmin}}_{w\in\mathcal{P}(\mathcal{X})}\max_{x,x^{\prime}\in\mathcal{X}_{m}:x\neq x^{\prime}}\parallel x-x^{\prime}\parallel^{2}_{\left(\sum_{x\in\mathcal{X}}w_{x}xx^{T}\right)^{-1}}, and sampling each arm in 𝒳m\mathcal{X}_{m} sufficiently often to produce a small enough confidence set. This is precisely what RAGE [8] does. However solving this optimization is, as mentioned in Sec. 3, computationally expensive and RAGE repeatedly accesses a minmax oracle to do this. Note that in simulating this algorithm, the authors implement an approximate oracle using the Frank-Wolfe method to solve the outer optimization over ww [8, Sec. F]. The max\max operation, however, renders the optimization objective non-smooth, and it is well-known that the Frank-Wolfe iteration can fail with even simple non-differentiable objectives (see e.g., [4]). We, therefore, deviate from RAGE at this point by employing three novel techniques, the first two motivated by ideas in [6].

  • We formulate the above minimax problem as a two player, zero-sum game. We solve the game sequentially, converging to its Nash equilibrium by invoking the use of the EXP-WTS algorithm [3]. Specifically, in each round tt in a phase, PELEG supplies EXP-WTS with an appropriate loss function lt1MAXl^{MAX}_{t-1} and receives the requisite sampling distribution wtw_{t} (lines 15 & 18 of the algorithm). This wtw_{t} is then fed to the second no-regret learner – a best response subroutine – that finds the ‘most confusing’ plausible model λ\lambda to focus next on (line 16). This is a minimization of a quadratic function over a union of finitely many convex sets (halfspaces intersecting a ball) which can be transparently implemented in polynomial time.

  • Once the sampling distribution is found, there still remains the problem of actually sampling according to it. Given a distribution w𝒫(𝒳m),w\in\mathcal{P}(\mathcal{X}_{m}), approximating it by sampling x𝒳x\in\mathcal{X} Nwx\lfloor Nw_{x}\rfloor or Nwx\lceil Nw_{x}\rceil times can lead to too few (resp. many) samples. Other naive sampling strategies are, for the same reason, unusable. While [8] invokes a specialized rounding algorithm for this purpose, we opt for a more efficient tracking procedure (line 19): In each Round tt of Phase mm, we sample Arm kt:=argmink[K]nt1k/s=1twskk_{t}:=\mathop{\mathrm{argmin}}\limits_{k\in[K]}n_{t-1}^{k}/\sum\limits_{s=1}^{t}w_{s}^{k}, where ntkn^{k}_{t} is the number of times Arm kk has been sampled up to time t.t. In Lem. 3, we show that this procedure is efficient, i.e., s=1twsk(K1)ntks=1twsk+1.\sum\limits_{s=1}^{t}w_{s}^{k}-\left(K-1\right)\leqslant n_{t}^{k}\leqslant\sum\limits_{s=1}^{t}w_{s}^{k}+1.

  • Finally, in each phase mm, we need to sample arms often enough to (i) construct confidence intervals of size at most 2(m+1)2^{-(m+1)} around (xx)Tθ,x,x𝒳m,(x-x^{\prime})^{T}\theta^{*},~{}\forall x,x^{\prime}\in\mathcal{X}_{m}, (ii) ensure that 𝒳m𝒮m\mathcal{X}_{m}\subset\mathcal{S}_{m} and (iii) that x𝒳m.x^{*}\in\mathcal{X}_{m}. In Sec. E, we prove a Key Lemma (whose argument is discussed in Sec. 5) to show that our novel Phase Stopping Criterion ensures this with high probability.

It is worth remarking that naively trying to adapt the strategy of Degenne et al [6] to the linear bandit structure yields a suboptimal (multiplicative d\sqrt{d}) dependence in the sample complexity, thus we adopt the phased elimination template of Fiez et al [8]. We also find, interestingly, that this phased structure eliminates the need to use more complex, self-tuning online learners like AdaHedge [5] in favour of the simpler Exponential Weights (Hedge).

Algorithm 1 Phased Elimination Linear Exploration Game (PELEG)
1:Input: 𝒳,δ.\mathcal{X},\delta.
2:Init: m1,𝒳m𝒳.m\leftarrow 1,\mathcal{X}_{m}\leftarrow\mathcal{X}.
3:Cλmin(k=1KxkxkT).C\leftarrow\lambda_{min}\left(\sum\limits_{k=1}^{K}x_{k}x_{k}^{T}\right).
4:while {|𝒳m|>1}\left\{\left|\mathcal{X}_{m}\right|>1\right\} do
5:     δmδm2.\delta_{m}\leftarrow\frac{\delta}{m^{2}}.
6:     Dm2(21)Cmaxx,x𝒳m,xxxx22logKD_{m}\leftarrow 2(\sqrt{2}-1)\sqrt{\frac{{C}}{{\max\limits_{x,x^{\prime}\in\mathcal{X}_{m},x\neq x^{\prime}}\left\lVert x-x^{\prime}\right\rVert_{2}^{2}\log K}}}
7:     εmmin{1,DmC8log(K2/δm)}(12)m+1.\varepsilon_{m}\leftarrow\min\left\{1,\frac{D_{m}\sqrt{C}}{\sqrt{8\log\left(K^{2}/\delta_{m}\right)}}\right\}\left(\frac{1}{2}\right)^{m+1}.
8:     x𝒳m,𝒞m(x):={λd:x𝒳m,xx|λTxλTx+εm}\forall x\in\mathcal{X}_{m},\mathcal{C}_{m}(x):=\left\{\lambda\in\mathbb{R}^{d}:\exists x^{\prime}\in\mathcal{X}_{m},x^{\prime}\neq x|\lambda^{T}x^{\prime}\geqslant\lambda^{T}x+\varepsilon_{m}\right\}.
9:     t1,n0k0,k[K]t\leftarrow 1,n_{0}^{k}\leftarrow 0,\forall k\in[K].
10:     Play each arm in 𝒳\mathcal{X} once and collect rewards Ykνk,1kKY_{k}\sim\nu_{k},1\leqslant k\leqslant K. \triangleright Burn-in period
11:     k[K],nkt|t=K=nkK1\forall k\in[K],~{}n_{k}^{t}\bigg{|}_{t=K}=n_{k}^{K}\leftarrow 1, Vtm|t=K=VKmk=1KxkxkTV_{t}^{m}\bigg{|}_{t=K}=V_{K}^{m}\leftarrow\sum\limits_{k=1}^{K}x_{k}x_{k}^{T}, tKt\leftarrow K.
12:     Initialize 𝒜mMAXEXPWTS\mathcal{A}^{MAX}_{m}\equiv EXP-WTS with expert set {e^1,,e^K}K\{\widehat{e}_{1},\cdots,\widehat{e}_{K}\}\subset\mathbb{R}^{K} and loss function lt1MAX()l^{MAX}_{t-1}().\triangleright MAXMAX player: EXP-WTS
13:     while {minλx𝒳m𝒞m(x)B(0,Dm)λVtm28log(K2/δm)}\left\{\min\limits_{\lambda\in\bigcup\limits_{x\in\mathcal{X}_{m}}\mathcal{C}_{m}(x)\cap B\left(0,D_{m}\right)}\left\lVert\lambda\right\rVert^{2}_{V_{t}^{m}}\leqslant 8\log\left(K^{2}/{\delta_{m}}\right)\right\} do \triangleright Phase Stopping Criterion
14:         tt+1t\leftarrow t+1.
15:         Get wtw_{t} from 𝒜mMAX\mathcal{A}^{MAX}_{m} and form the matrix Wt=k=1KwtkxkxkTW_{t}=\sum\limits_{k=1}^{K}w_{t}^{k}x_{k}x_{k}^{T}.
16:         λtargminλx𝒳m𝒞m(x)B(0,Dm)λWt2\lambda_{t}\leftarrow\mathop{\mathrm{argmin}}\limits_{\lambda\in\cup_{x\in\mathcal{X}_{m}}\mathcal{C}_{m}(x)\cap B\left(0,D_{m}\right)}\left\lVert\lambda\right\rVert^{2}_{W_{t}}.
17:         For k[K],Utk:=(λtTxk)2.k\in[K],U_{t}^{k}:=\left(\lambda_{t}^{T}x_{k}\right)^{2}. \triangleright MINMIN player: Best response
18:         Construct loss function ltMAX(w)=wTUtl_{t}^{MAX}\left(w\right)=-w^{T}U_{t} .
19:         Play arm kt:=argmink[K]nt1ks=1twskk_{t}:=\mathop{\mathrm{argmin}}\limits_{k\in[K]}\frac{n_{t-1}^{k}}{\sum\limits_{s=1}^{t}w_{s}^{k}} \triangleright Tracking
20:         ntktntkt+1n_{t}^{k_{t}}\leftarrow n_{t}^{k_{t}}+1
21:         Collect sample Yt=θTxkt+ηtY_{t}={\theta^{*}}^{T}x_{k_{t}}+\eta_{t}
22:         Vtm=Vt1m+xktxktT.V_{t}^{m}=V_{t-1}^{m}+x_{k_{t}}{x_{k_{t}}}^{T}.
23:     end while
24:     NmtN_{m}\leftarrow t
25:     Update: θ^m(VNmm)1(s=1NmYsxks)\widehat{\theta}_{m}\leftarrow\left({V^{m}_{N_{m}}}\right)^{-1}\left(\sum\limits_{s=1}^{N_{m}}Y_{s}x_{k_{s}}\right) \triangleright Least-squares estimate of θ\theta^{*}
26:     Update: 𝒳m+1𝒳m\{x𝒳m|x𝒳m:θ^mT(xx)>2(m+2)}\mathcal{X}_{m+1}\leftarrow\mathcal{X}_{m}\Big{\backslash}\left\{x\in\mathcal{X}_{m}|\exists x^{\prime}\in\mathcal{X}_{m}:{\widehat{\theta}_{m}}^{T}\left(x^{\prime}-x\right)>2^{-(m+2)}\right\}
27:     mm+1m\leftarrow m+1
28:end while
29:return 𝒳m\mathcal{X}_{m} \triangleright Output surviving arm

The main theoretical result of this paper is the following performance guarantee.

Theorem 1 (Sample Complexity of Algorithm 1).

With probability at least 1δ1-\delta, PELEG returns the optimal arm after τ\tau rounds, with

τ2048log2(1/Δmin)Dθ[(log((log2(1/Δmin))2K2/δ))2logKC2]+256log2(1/Δmin)Dθlog((log2(1/Δmin))2K2/δ)=𝒪~(log2(K2/δ)C2Dθ).\begin{split}\tau\leqslant 2048\frac{\log_{2}\left(1/\Delta_{min}\right)}{D_{\theta^{*}}}\left[\frac{\left(\log\left(\left(\log_{2}\left(1/\Delta_{min}\right)\right)^{2}K^{2}/\delta\right)\right)^{2}\log K}{C^{2}}\right]+\\ 256\frac{\log_{2}\left(1/\Delta_{min}\right)}{D_{\theta^{*}}}\log\left(\left(\log_{2}\left(1/\Delta_{min}\right)\right)^{2}K^{2}/\delta\right)=\tilde{\mathcal{O}}\left(\frac{\log^{2}(K^{2}/\delta)}{C^{2}D_{\theta^{*}}}\right).\end{split} (2)

In Sec. 5, we sketch the arguments behind the result. The proof in its entirety can be found in Sec. F in the Supplementary Material.

Note 1.

As explained in Sec. 3, the optimal (oracle) allocation requires 𝒪(1DθlogKδ)\mathcal{O}\left(\frac{1}{D_{\theta^{*}}}\log{\frac{K}{\delta}}\right) samples. Comparing this with (2), we see that our algorithm is instance optimal up to logarithmic factors, barring the 1C2\frac{1}{C^{2}} term, so the optimality holds whenever C=Ω(1)C=\Omega(1). Recall that CC is the smallest eigenvalue of x𝒳xxT\sum_{x\in\mathcal{X}}xx^{T}. C=Ω(1)C=\Omega(1) is reasonable to expect given that in most applications, feature vectors (i.e., x1,,xKx_{1},\cdots,x_{K}) are chosen to represent the feature space well which translates to a high value of CC.

Note 2.

The main computational effort in Algorithm 1 is in checking the phase stopping criterion (line 13) and implementing the best-response model learner (line 16), both of which are explicit quadratic programs. Note also that bounding the losses submitted to EXP-WTS to within B(0,Dm)B(0,D_{m}) is required only for the regret analysis of EXP-WTS to go through. In practice, as the simulation results show, PELEG works without this and, in fact, permits efficient solution of Step 16 in the algorithm, further reducing computational complexity.

5 Sketch of Sample Complexity Analysis

This section outlines the proof of the δ\delta-PAC sample complexity of Algorithm 1 (Theorem 1) and describes the main ideas and challenges involved in the analysis.

At a high level the proof of Theorem 1 involves two main parts: (1) a correctness argument for the central while loop that eliminates arms, and (2) a bound for its length, which, when added across all phases, gives the overall sample complexity bound.

1. Ensuring progress (arm elimination) in each phase. At the heart of the analysis is the following result which guarantees that upon termination of the central while loop, the uncertainty in estimating all differences of means among the surviving (i.e., non-eliminated) arms remains bounded.

Lemma 1 (Key Lemma).

After each phase m1m\geqslant 1, maxx,x𝒳m,xxxx(VNmm)12((12)m+1)28logK2/δm.\max\limits_{x,x^{\prime}\in\mathcal{X}_{m},x\neq x^{\prime}}\left\lVert x-x^{\prime}\right\rVert^{2}_{\left({V_{N_{m}}^{m}}\right)^{-1}}\leqslant\frac{\left(\left(\frac{1}{2}\right)^{m+1}\right)^{2}}{8\log K^{2}/\delta_{m}}.

Proof sketch. Phase mm ends at time tt when the ellipsoid (0,Vtm,rm)\mathcal{E}(0,V_{t}^{m},r_{m}), with center 0 and shape according to the arms played in the phase so far, becomes small enough to avoid intersecting the half spaces 𝒞m(x)\mathcal{C}_{m}(x), for all surviving arms xx, within the ball B(0,Dm)\cap B(0,D_{m}) (Step 13 of the algorithm) which is required to keep loss functions bounded for no-regret properties.

Suppose, for the sake of simplicity, that only two arms xix_{i} and xjx_{j} are present when phase mm starts. Figure 1(a) depicts a possible situation when the phase ends. 𝒞m(xi)𝒞m(xi;εm)\mathcal{C}_{m}(x_{i})\equiv\mathcal{C}_{m}(x_{i};\varepsilon_{m}) and 𝒞m(xj;εm)\mathcal{C}_{m}(x_{j};\varepsilon_{m}) with εm2m\varepsilon_{m}\approx 2^{-m} are halfspaces, denoted in gray, that intersect the ball B(0,Dm)B(0,D_{m}) in the areas colored red. In this situation, the ellipsoid VtmV_{t}^{m}, shaded in blue, has just broken away from the red regions in the interior of the ball. Because its extent in the direction xixjx_{i}-x_{j} lies within the strip between the two hyperplanes bounding 𝒞m(i),𝒞m(j)\mathcal{C}_{m}(i),\mathcal{C}_{m}(j), it can be shown (see proof of lemma in appendix) that xixj(Vtm)1\left\lVert x_{i}-x_{j}\right\rVert_{(V_{t}^{m})^{-1}} is small enough to not exceed roughly 2m2^{-m}.

The more challenging situation is when the ellipsoid VtmV_{t}^{m} breaks away from the red regions by breaching the boundary of the ball B(0,Dm)B(0,D_{m}), as in the green ellipsoid in Figure 1(b). The while loop terminating at this time would not satisfy the objective of controlling xixj(Vtm)1\left\lVert x_{i}-x_{j}\right\rVert_{(V_{t}^{m})^{-1}} to within 2m2^{-m}, since the extent of the ellipsoid in the direction xixjx_{i}-x_{j} is larger than the gap between the halfspaces 𝒞m(xi)\mathcal{C}_{m}(x_{i}) and 𝒞m(xj)\mathcal{C}_{m}(x_{j}). A key idea we introduce here is to shrink the hyperplane gap (i.e., εm\varepsilon_{m}) by a factor (precisely DmC(8logK2/δm)1D_{m}\sqrt{C}(8\log K^{2}/\delta_{m})^{-1}) which is represented by the min operation in Step 7. In doing this we bring the halfspaces closer, and then insist that the ellipsoid break away from these new halfspaces within the ball. This more stringent requirement guarantees that when the loop terminates, the extent of the final ellipsoid (shaded in blue) stays within the original, unshrunk, gap ensuring xixj(Vtm)12m\left\lVert x_{i}-x_{j}\right\rVert_{(V_{t}^{m})^{-1}}\lessapprox 2^{-m}.

0Span(xixj)\text{Span}(x_{i}-x_{j}){λ:λT(xixj)ε}\quad\quad\{\lambda:\lambda^{T}(x_{i}-x_{j})\geqslant\varepsilon\}{λ:λT(xixj)ε}\{\lambda:\lambda^{T}(x_{i}-x_{j})\leqslant-\varepsilon\}Ellipsoid fits within this\Rightarrowxixj(Vtm)12m\left\lVert x_{i}-x_{j}\right\rVert_{(V_{t}^{m})^{-1}}\lessapprox 2^{-m}B(0,Dm)B(0,D_{m})(0,Vtm,rm)\mathcal{E}(0,V_{t}^{m},r_{m})
(a) ‘Easy’ case: The blue ellipsoid separates from the halfspaces intersecting the ball (red) by staying within the ball. In this case its extent along (xixj)(x_{i}-x_{j}) is within the gap between the hyperplanes (parallel black lines).
0Span(xixj)\text{Span}(x_{i}-x_{j}){λ:λT(xixj)ε}\quad\quad\{\lambda:\lambda^{T}(x_{i}-x_{j})\geqslant\varepsilon\}{λ:λT(xixj)ε}\{\lambda:\lambda^{T}(x_{i}-x_{j})\leqslant-\varepsilon\}Ellipsoid fits within this\Rightarrowxixj(Vtm)12m\left\lVert x_{i}-x_{j}\right\rVert_{(V_{t}^{m})^{-1}}\lessapprox 2^{-m}Ellipsoid extenttoo largeEllipsoid extentsufficiently small
(b) ‘Difficult’ case: The green ellipsoid separates from the halfspaces intersecting the the ball (red) by breaching the ball. Its extent along (xixj)(x_{i}-x_{j}) exceeds the gap between the hyperplanes (parallel black lines). When forced to separate from a closer pair of halfspaces (dotted black lines), then the ellipsoid’s (in blue) extent is within the original gap.
Figure 1: The phase stopping condition in Algorithm 1 ensures xixj(Vtm)12m\left\lVert x_{i}-x_{j}\right\rVert_{(V_{t}^{m})^{-1}}\lessapprox 2^{-m} after phase mm.

2. Bounding the number of arm pulls in a phase. The main bound on the length of the central while loop is the following result.

Lemma 2 (Phase length bound).

Let Bm:=minw𝒫Kmaxx,x𝒳m,xxxxW12B_{m}:=\min\limits_{w\in\mathcal{P}_{K}}\max\limits_{x,x^{\prime}\in\mathcal{X}_{m},x\neq x^{\prime}}\left\lVert x-x^{\prime}\right\rVert^{2}_{W^{-1}}. There exists δ0\delta_{0} such that δ<δ0\forall\delta<\delta_{0}, the length NmN_{m} of any phase mm is bounded as :

Nm{2Bm(2m+1)2[rm4logK(21)2C2]+1 if εm=DmCrm(12)m+1,2Bm(2m+1)2rm2+1 if εm=(12)m+1.\displaystyle N_{m}\leqslant\begin{cases}2B_{m}\left(2^{m+1}\right)^{2}\left[\frac{r_{m}^{4}\log K}{(\sqrt{2}-1)^{2}C^{2}}\right]+1&\text{ if }\varepsilon_{m}=\frac{D_{m}\sqrt{C}}{r_{m}}\left(\frac{1}{2}\right)^{m+1},\\ 2B_{m}\left(2^{m+1}\right)^{2}r_{m}^{2}+1&\text{ if }\varepsilon_{m}=\left(\frac{1}{2}\right)^{m+1}.\end{cases}

To prove this we use the no-regret property of both the best-response MINMIN and the EXP-WTS MAXMAX learner (the full proof appears in the appendix). A key novelty here is the introduction of the ball B(0,Dm)B(0,D_{m}) as a technical device to control the 22-norm radius of the final stopped ellipsoid (0,Vtm,rm)\mathcal{E}(0,V_{t}^{m},r_{m}) (inequality (i)(i) in the proof) when used with the basic tracking rule over arms introduced by Degenne et al [6].

6 Experiments

We numerically evaluate PELEG, against the algorithms 𝒳𝒴\mathcal{X}\mathcal{Y}-static ([17]), LUCB ([11]), ALBA ([19]), LinGapE ([15]) and RAGE ([8]), for 3 common benchmark settings. The oracle lower bound is also calculated. Note: In our implementation, we ignore the term B(0,Dm)B(0,D_{m}) in the phase stopping criterion; this has the advantage of making the criterion check-able in closed form. We simulate independent, 𝒩(0,1)\mathcal{N}(0,1) observation noise in each round. All results reported are averaged over 50 trials. We also empirically observe a 100%100\% success rate in identifying the best arm, although a confidence value of δ=0.1\delta=0.1 is passed in all cases.

Setting 1: Standard bandit. The arm set is the standard basis {e1,e2,,e5}\left\{e_{1},e_{2},\ldots,e_{5}\right\} in 5 dimensions. The unknown parameter θ\theta^{*} is set to (Δ,0,,0)\left(\Delta,0,\ldots,0\right), where Δ>0\Delta>0, with Δ\Delta swept across {0.1,0.2,0.3,0.4,0.5}\left\{0.1,0.2,0.3,0.4,0.5\right\}. As noted in [15], for Δ\Delta close to 0, 𝒳𝒴\mathcal{X}\mathcal{Y}-static’s essentially uniform allocation is optimal, since we have to estimate all directions equally accurately. However, PELEG performs better (Fig. 2(a)) due to being able to eliminate suboptimal arms earlier instead of uniformly across all arms. Fig. 2(b) compares PELEG and RAGE in the smaller window Δ[0.11,0.19]\Delta\in\left[0.11,0.19\right], where PELEG is found to be competitive (and often better than) RAGE.

Setting 2: Unit sphere. The arms set comprises of 100100 vectors sampled uniformly from the surface of the unit sphere 𝕊d1\mathbb{S}^{d-1}. We pick the two closest arms, say uu and vv, and then set θ=u+γ(vu)\theta^{*}=u+\gamma(v-u) for γ=0.01\gamma=0.01, making uu the best arm. We simulate all algorithms over dimensions d=10,20,,50d=10,20,\ldots,50. This setting was first introduced in [19], and PELEG is uniformly competitive with the other algorithms (Fig. 2(c)).

Setting 3: Standard bandit with a confounding arm [17]. We instantiate dd canonical basis arms {e1,e2,,ed}\{e_{1},e_{2},\ldots,e_{d}\} and an additional arm xd+1=(cos(ω),sin(ω),0,,0)x_{d+1}=(\cos(\omega),\sin(\omega),0,\ldots,0), d=2,,10d=2,\dots,10, with θ=e1\theta^{*}=e_{1} so that the first arm is the best arm. By setting 0<ω<<10<\omega<<1, the d+1d+1th arm becomes the closest competitor. Here, the performance critically depends on how much an agent focuses on comparing arm 11 and arm d+1d+1. LinGapE performs very well in this setting, and PELEG and RAGE are competitive with it (Fig. 2(d)).

[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
Figure 2: Sample complexity performance of linear bandit best arm identification algorithms for 33 different settings: Standard bandit (Figs. a, b), Unit Sphere (Fig. c) and Standard bandit with confounding arm (Fig. d).

7 Concluding Remarks

We have proposed a new, explicitly described algorithm for best arm identification in linear bandits, using tools from game theory and no-regret learning to solve minimax games. Several interesting directions remain unexplored. Removing the less-than-ideal dependence on the feature CC of the arm geometry and the extra logarithmic dependence on log(1/δ)\log(1/\delta) are perhaps the most interesting technical questions. It is also of great interest to see if a more direct game-theoretic strategy, along the lines of [6], exists for structured bandit problems, as also whether one can extend this machinery to solve for best policies in more general Markov Decision Processes.

Broader Impact. This work is largely theoretical in its objective. However, the problem that it attempts to lay sound theoretical foundations for is a widely encountered search problem based on features in machine learning. As a result, we anticipate that its implications may carry over to domains that involve continuous, feature-based learning, such as attribute-based recommendation systems, adaptive sensing and robotics applications. Proper care must be taken in such cases to ensure that recommendations or decisions from the algorithms set out in this work do not transgress considerations of safety and bias. While we do not address such concerns explicitly in this work, they are important in the design and operation of automated systems that continually interact with human users.

References

  • Abbasi-Yadkori et al. [2011] Y. Abbasi-Yadkori, D. Pal, and C. Szepesvari. Improved Algorithms for Linear Stochastic Bandits. In Proc. NIPS, pages 2312–2320, 2011.
  • Awerbuch and Kleinberg [2008] B. Awerbuch and R. Kleinberg. Online linear optimization and adaptive routing. Journal of Computer and System Sciences, 74(1):97–114, 2008.
  • Cesa-Bianchi and Lugosi [2006] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press, 2006.
  • Cheung and Li [2018] E. Cheung and Y. Li. Solving separable nonsmooth problems using frank-wolfe with uniform affine approximations. In IJCAI, pages 2035–2041, 2018.
  • de Rooij et al. [2014] S. de Rooij, T. van Erven, P. Grunwald, and W. Koolen. Follow the leader if you can, hedge if you must. Journal of Machine Learning Research, 15:1281–1316, 2014.
  • Degenne et al. [2019] R. Degenne, W. M. Koolen, and P. Ménard. Non-asymptotic pure exploration by solving games. In NeurIPS, 2019.
  • Even-Dar et al. [2006] E. Even-Dar, S. Mannor, and Y. Mansour. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. J. Mach. Learn. Res., 7:1079–1105, Dec. 2006.
  • Fiez et al. [2019] T. Fiez, L. Jain, K. G. Jamieson, and L. Ratliff. Sequential experimental design for transductive linear bandits. In Advances in Neural Information Processing Systems, pages 10666–10676, 2019.
  • Freund and Schapire [1999] Y. Freund and R. E. Schapire. Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29(1-2):79–103, 1999.
  • Garivier and Kaufmann [Jun. 2016] A. Garivier and E. Kaufmann. Optimal best arm identification with fixed confidence. In Conference On Learning Theory, pages 998–1027, Jun. 2016.
  • Kalyanakrishnan et al. [2012] S. Kalyanakrishnan, A. Tewari, P. Auer, and P. Stone. Pac subset selection in stochastic multi-armed bandits. In ICML, 2012.
  • Kaufmann et al. [2016] E. Kaufmann, O. Cappé, and A. Garivier. On the complexity of best-arm identification in multi-armed bandit models. J. Mach. Learn. Res., 17(1):1–42, Jan. 2016.
  • Kuroki et al. [2019] Y. Kuroki, L. Xu, A. Miyauchi, J. Honda, and M. Sugiyama. Polynomial-time algorithms for combinatorial pure exploration with full-bandit feedback. arXiv preprint arXiv:1902.10582, 2019.
  • Lattimore and Szepesvári [2020] T. Lattimore and C. Szepesvári. Bandit Algorithms. Cambridge University Press, 2020.
  • Liyuan Xu [2017] M. S. Liyuan Xu, Junya Honda. Fully adaptive algorithm for pure exploration in linear bandits. In International Conference on Artificial Intelligence and Statistics, 2017.
  • Shalev-Shwartz et al. [2011] S. Shalev-Shwartz et al. Online learning and online convex optimization. Foundations and trends in Machine Learning, 4(2):107–194, 2011.
  • Soare et al. [2014] M. Soare, A. Lazaric, and R. Munos. Best-arm identification in linear bandits. Advances in Neural Information Processing Systems 27, pages 828–836, 2014.
  • Srinivas et al. [2010] N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger. Gaussian process optimization in the bandit setting: no regret and experimental design. In In International Conference on Machine Learning, 2010.
  • Tao et al. [2018] C. Tao, S. Blanco, and Y. Zhou. Best arm identification in linear bandits with linear dimension dependency. volume 80 of Proceedings of Machine Learning Research, pages 4877–4886. PMLR, 2018.
  • Valko et al. [2014] M. Valko, R. Munos, B. Kveton, and T. Kocák. Spectral bandits for smooth graph functions. In International Conference on Machine Learning, pages 46–54, 2014.
  • Zaki et al. [2019] M. Zaki, A. Mohan, and A. Gopalan. Towards optimal and efficient best arm identification in linear bandits. arXiv preprint arXiv:1911.01695, 2019.

Appendix A Glossary of symbols

  1. 1.

    𝒜mMAX:\mathcal{A}^{MAX}_{m}: the EXP-WTS algorithm, used to compute the mixed strategy of the MAXMAX player in each round of PELEG.

  2. 2.

    a:a^{*}: the index of the best arm, i.e., a:=argmaxi[K]xiTθa^{*}:=\mathop{\mathrm{argmax}}_{i\in[K]}x_{i}^{T}\theta^{*}.

  3. 3.

    B(0,Dm):B(0,D_{m}): the closed ball of radius DmD_{m} in d\mathbb{R}^{d}, centered at 0.

  4. 4.

    C=λmin(x𝒳xxT)C=\lambda_{\min}\left(\sum_{x\in\mathcal{X}}xx^{T}\right).

  5. 5.

    𝒞m(x):={λd:x𝒳m,xx|λTxλTx+εm}\mathcal{C}_{m}(x):=\left\{\lambda\in\mathbb{R}^{d}:\exists x^{\prime}\in\mathcal{X}_{m},x^{\prime}\neq x|\lambda^{T}x^{\prime}\geqslant\lambda^{T}x+\varepsilon_{m}\right\} is the union of all hyperplanes {λd|λT(xx)εm}.\{\lambda\in\mathbb{R}^{d}|\lambda^{T}(x^{\prime}-x)\geqslant\varepsilon_{m}\}.

  6. 6.

    Dm:=2(21)Cmaxx,x𝒳m,xxxx22logKD_{m}:=2(\sqrt{2}-1)\sqrt{\frac{{C}}{{\max\limits_{x,x^{\prime}\in\mathcal{X}_{m},x\neq x^{\prime}}\left\lVert x-x^{\prime}\right\rVert_{2}^{2}\log K}}}.

  7. 7.

    d:d: dimension of space in which the feature vectors x1,,xKx_{1},\cdots,x_{K} reside.

  8. 8.

    Δi=(xxi)Tθ,ia.\Delta_{i}=(x^{*}-x_{i})^{T}\theta^{*},~{}i\neq a^{*}.

  9. 9.

    Δmin=miniaΔi.\Delta_{\min}=\min_{i\neq a^{*}}\Delta_{i}.

  10. 10.

    δ:\delta: maximum allowable probability of erroneous arm selection (a.k.a confidence parameter).

  11. 11.

    δm=δm2.\delta_{m}=\frac{\delta}{m^{2}}.

  12. 12.

    (0,V,r):={λdλTVλr2}\mathcal{E}(0,V,r):=\{\lambda\in\mathbb{R}^{d}\mid\lambda^{T}V\lambda\leqslant r^{2}\}, is the confidence ellipsoid with center 0, shaped by VV and r.r.

  13. 13.

    (x,x):\mathcal{H}(x,x^{\prime}):

  14. 14.

    K=|𝒳|K=|\mathcal{X}| number of feature vectors.

  15. 15.

    Nm:N_{m}: the length of Phase m.m.

  16. 16.

    νk:\nu_{k}: rewards from Arm kk are all drawn IID from νk.\nu_{k}.

  17. 17.

    𝒫(Ω):={p[0,1]|Ω|:p1=1}\mathcal{P}(\Omega):=\{p\in[0,1]^{|\Omega|}:\parallel p\parallel_{1}=1\}, the set of all probability measures on some given set Ω.\Omega.

  18. 18.

    rm=8logK2δm.r_{m}=\sqrt{8\log{\frac{K^{2}}{\delta_{m}}}}.

  19. 19.

    θ:\theta^{*}: fixed but unknown vector in d\mathbb{R}^{d} that parameterizes the means of νk,\nu_{k}, i.e., the mean of νk\nu_{k} is xkTθ.x_{k}^{T}\theta^{*}.

  20. 20.

    ntk:n_{t}^{k}: number of times Arm kk has been sampled up to Round tt of PELEG.

  21. 21.

    θ^m:\widehat{\theta}_{m}: OLS estimate of θ\theta^{*} at the end of Phase mm of PELEG.

  22. 22.

    Vtm=stxsxsTV_{t}^{m}=\sum_{s\leqslant t}x_{s}x_{s}^{T} the design matrix in Round tt of Phase m.m.

  23. 23.

    Wt=x𝒳wxxxTW_{t}=\sum_{x\in\mathcal{X}}w_{x}xx^{T} the design matrix formed by sampling armsw𝒫(𝒳).\sim w\in\mathcal{P}(\mathcal{X}).

  24. 24.

    𝒳={x1,,xK}\mathcal{X}=\{x_{1},\cdots,x_{K}\}, the feature set.

  25. 25.

    𝒳m\mathcal{X}_{m} the set of features that survive Phase mm of PELEG.

Appendix B Technical lemmas

B.1 Tracking lemma

The 𝒜MAX\mathcal{A}^{MAX} subroutine recommends a distribution wtw_{t} in every round tt over the set of arms. In order to play an arm from this distribution we use a “tracking” rule, which helps the number of arm pulls to stay close to the cumulative sum s=1twsk\sum\limits_{s=1}^{t}w_{s}^{k} for each arm k[K]k\in[K].

Lemma 3 (ntkn^{k}_{t} tracks s=1twsk\sum_{s=1}^{t}w^{k}_{s}).

In any phase m1m\geqslant 1, if for tK\forall t\geqslant K, the following strategy for pulling the arms is used.

Choose arm, kt=argmink[K]nt1ks=1twsk,\text{Choose arm, }k_{t}=\mathop{\mathrm{argmin}}\limits_{k\in[K]}\frac{n_{t-1}^{k}}{\sum\limits_{s=1}^{t}w_{s}^{k}},

then, for all tKt\geqslant K and for all k[K]k\in[K], s=1twsk(K1)ntks=1twsk+1\sum\limits_{s=1}^{t}w_{s}^{k}-\left(K-1\right)\leqslant n_{t}^{k}\leqslant\sum\limits_{s=1}^{t}w_{s}^{k}+1.

Proof.

We first show the upper-bound. We need to show that the inequality holds for all arms. First, let kktk\neq k_{t}. We will use induction on tt.
Base case: At t=Kt=K, ntj=1=s=1twsj,j[K]n_{t}^{j}=1=\sum\limits_{s=1}^{t}w_{s}^{j},\forall j\in[K].
Let, the induction hold for all s<ts<t. We will show that the inequality holds for tt. If kktk\neq k_{t}, then ntk=nt1k()s=1t1wsk+1s=1twsk+1n_{t}^{k}=n_{t-1}^{k}\stackrel{{\scriptstyle(*)}}{{\leqslant}}\sum\limits_{s=1}^{t-1}w_{s}^{k}+1\leqslant\sum\limits_{s=1}^{t}w_{s}^{k}+1.
Next, let k=ktk=k_{t}. We note that by definition of ktk_{t}, we have

nt1kts=1twskt()j=1Knt1jj=1Ks=1twsj=t1t1.\frac{n_{t-1}^{k_{t}}}{\sum\limits_{s=1}^{t}w_{s}^{k_{t}}}\stackrel{{\scriptstyle(*)}}{{\leqslant}}\frac{\sum\limits_{j=1}^{K}n_{t-1}^{j}}{\sum\limits_{j=1}^{K}\sum\limits_{s=1}^{t}w_{s}^{j}}=\frac{t-1}{t}\leqslant 1.

Here, the inequality (*) follows because of the following fact: for any sequence of positive numbers {ai}1in\{a_{i}\}_{1\leqslant i\leqslant n} and {bi}1in\{b_{i}\}_{1\leqslant i\leqslant n}, min1inaibii=1naii=1nbi.\min\limits_{1\leqslant i\leqslant n}\frac{a_{i}}{b_{i}}\leqslant\frac{\sum_{i=1}^{n}a_{i}}{\sum_{i=1}^{n}b_{i}}. Consequently, ntkts=1twskt=nt1kt+1s=1twskt1+1s=1twskt\frac{n_{t}^{k_{t}}}{\sum\limits_{s=1}^{t}w_{s}^{k_{t}}}=\frac{n_{t-1}^{k_{t}}+1}{\sum\limits_{s=1}^{t}w_{s}^{k_{t}}}\leqslant 1+\frac{1}{\sum\limits_{s=1}^{t}w_{s}^{k_{t}}}. Rearranging completes the proof of the right hand side.
For the lower bound inequality, we observe that for any k[K]k\in[K],

ntk=tjkntj()tjks=1twsj(K1)=tj=1Ks=1twsj+s=1twsk(K1)=s=1twsk(K1).n_{t}^{k}=t-\sum\limits_{j\neq k}n_{t}^{j}\stackrel{{\scriptstyle(*)}}{{\geqslant}}t-\sum\limits_{j\neq k}\sum\limits_{s=1}^{t}w_{s}^{j}-\left(K-1\right)=t-\sum\limits_{j=1}^{K}\sum\limits_{s=1}^{t}w_{s}^{j}+\sum\limits_{s=1}^{t}w_{s}^{k}-\left(K-1\right)=\sum\limits_{s=1}^{t}w_{s}^{k}-\left(K-1\right).

Here, the inequality ()(*) follows from the the upper-bound on ntk,k[K]n_{t}^{k},~{}k\in[K]. ∎

B.1.1 Details of 𝒜mMAX\mathcal{A}^{MAX}_{m} (EXP-WTS)

We employ the EXP-WTS algorithm to recommend to the MAX player, the arm to be played in round t>Kt>K. At the start of every phase m1m\geqslant 1, an EXP-WTS subroutine is instantiated afresh, with initial weight vectors to be 11 for each of the KK experts. The KK experts are taken to be standard unit vectors (0,0,,0,1,0,,0)(0,0,\ldots,0,1,0,\ldots,0) with 11 at the kthk^{th} position, k[K]k\in[K]. The EXP-WTS subroutine recommends an exponentially-weighted probability distribution over the number of arms, depending upon the weights on each expert. The loss function supplied to update the weights of each expert, is indicated in Step 18 of Algorithm 1.

EXP-WTS requires a bound on the losses (rewards) in order to set its learning parameter optimally. This is ensured by passing an upper-bound of Dm2D_{m}^{2} (\because in any Phase m,λ2Dm,m,\left\lVert\lambda\right\rVert_{2}\leqslant D_{m}, see Step 13 of Algorithm 1).

Lemma 4.

In any phase mm, at any round t>Kt>K, 𝒜mMAX\mathcal{A}^{MAX}_{m} has a regret bounded as

RtDm221tlogK.R_{t}\leqslant\frac{D_{m}^{2}}{\sqrt{2}-1}\sqrt{t\log K}.
Proof.

The proof involves a simple modification of the proof of the regret analysis of the EXP-WTS algorithm (see for example, [3]), with loss scaled by [0,Dm2][0,D_{m}^{2}] followed by the well-known doubling trick.

Appendix C Proof of Key Lemma

See 1

Proof.

Let rm:=8logK2/δmr_{m}:=\sqrt{8\log K^{2}/\delta_{m}}, for ease of notation. The phase stopping criterion is

STOP at round tK if: minλx𝒳m𝒞m(x)B(0,Dm)λ(VNmm)2>rm2.\text{STOP at round $t\geqslant K$ if: }\min\limits_{\lambda\in\bigcup\limits_{x\in\mathcal{X}_{m}}\mathcal{C}_{m}(x)\cap B(0,D_{m})}\left\lVert\lambda\right\rVert^{2}_{\left({V_{N_{m}}^{m}}\right)}>r_{m}^{2}. (3)

Note that the set 𝒞m(x)\mathcal{C}_{m}(x) depends on the value that εm\varepsilon_{m} takes in phase mm. Depending on the value of εm\varepsilon_{m}, we divide the analysis into the following two cases.

Case 1. εm=(1/2)m+1\varepsilon_{m}=\left(1/2\right)^{m+1}.

In this case DmCrm1\frac{D_{m}\sqrt{C}}{r_{m}}\geqslant 1. For any phase m1m\geqslant 1, and t1t\geqslant 1, let us define the ellipsoid (0,Vtm,rm):={θd:θVtm2rm2}\mathcal{E}\left(0,V_{t}^{m},r_{m}\right):=\left\{\theta\in\mathbb{R}^{d}:\left\lVert\theta\right\rVert_{{V_{t}^{m}}}^{2}\leqslant r_{m}^{2}\right\}. The phase stopping rule at round tKt\geqslant K is equivalent to :

STOP if :\displaystyle: (0,Vtm,rm){x𝒳m𝒞m(x)B(0,Dm)}= (empty set)\displaystyle\mathcal{E}(0,V_{t}^{m},r_{m})\bigcap\left\{\bigcup\limits_{x\in\mathcal{X}_{m}}\mathcal{C}_{m}(x)\cap B(0,D_{m})\right\}=\emptyset\text{ (empty set)}
\displaystyle\Leftrightarrow {(0,Vtm,rm)B(0,Dm)}{x𝒳m𝒞m(x)}=.\displaystyle\left\{\mathcal{E}(0,V_{t}^{m},r_{m})\cap B(0,D_{m})\right\}\bigcap\left\{\bigcup\limits_{x\in\mathcal{X}_{m}}\mathcal{C}_{m}(x)\right\}=\emptyset.

However by Rayleigh’ inequality555for any PSD matrix AA and xdx\in\mathbb{R}^{d}, λmin(A)xTAxxTxλmax(A)\lambda_{min}(A)\leqslant\frac{x^{T}Ax}{x^{T}x}\leqslant\lambda_{max}(A) followed by the fact that DmCrm1\frac{D_{m}\sqrt{C}}{r_{m}}\geqslant 1, we have for any θ(0,Vtm,rm)\theta\in\mathcal{E}(0,V_{t}^{m},r_{m}),

θ22θVtm2λmin(Vtm)()θVtm2λmin(k=1KxkxkT)rm2CDm2.\left\lVert\theta\right\rVert_{2}^{2}\leqslant\frac{\left\lVert\theta\right\rVert_{V_{t}^{m}}^{2}}{\lambda_{min}(V_{t}^{m})}\stackrel{{\scriptstyle(*)}}{{\leqslant}}\frac{\left\lVert\theta\right\rVert_{V_{t}^{m}}^{2}}{\lambda_{min}(\sum\limits_{k=1}^{K}x_{k}x_{k}^{T})}\leqslant\frac{r_{m}^{2}}{C}\leqslant D_{m}^{2}.

The inequality (*) follows from the following fact: for tKt\geqslant K, Vtm=k=1KxkxkT+s=K+1txsxsTk=1KxkxkTV_{t}^{m}=\sum\limits_{k=1}^{K}x_{k}x_{k}^{T}+\sum\limits_{s=K+1}^{t}x_{s}x_{s}^{T}\succcurlyeq\sum\limits_{k=1}^{K}x_{k}x_{k}^{T}.

(0,Vtm,rm)B(0,Dm),tK\therefore\mathcal{E}(0,V_{t}^{m},r_{m})\subseteq B(0,D_{m}),\forall t\geqslant K. Hence the phase stopping rule reduces to,

STOP if: (0,Vtm,rm){x𝒳m𝒞m(x)}=\displaystyle\text{ STOP if: }\mathcal{E}(0,V_{t}^{m},r_{m})\bigcap\left\{\bigcup\limits_{x\in\mathcal{X}_{m}}\mathcal{C}_{m}(x)\right\}=\emptyset minλx𝒳m𝒞m(x)λVtm2>rm2\displaystyle\Leftrightarrow\min\limits_{\lambda\in\cup_{x\in\mathcal{X}_{m}}\mathcal{C}_{m}(x)}\left\lVert\lambda\right\rVert_{V_{t}^{m}}^{2}>r_{m}^{2}
minλ(x,x)𝒳m2{λ:λTxλTx+(1/2)m+1}λVtm2>rm2.\displaystyle\Leftrightarrow\min\limits_{\lambda\in\bigcup\limits_{(x,x^{\prime})\in\mathcal{X}_{m}^{2}}\{\lambda^{\prime}:{\lambda^{\prime}}^{T}x^{\prime}\geqslant{\lambda^{\prime}}^{T}x+\left(1/2\right)^{m+1}\}}\left\lVert\lambda\right\rVert_{V_{t}^{m}}^{2}>r_{m}^{2}.

The above reduction is a minimization problem over union of halfspaces. For any fixed pair (x,x)𝒳m2,xx\left(x,x^{\prime}\right)\in\mathcal{X}_{m}^{2},x\neq x^{\prime}, this is a quadratic optimization problem with linear constraints, which can be explicitly solved using standard Lagrange method.

Lemma 5 (Supporting Lemma for Lem. 1).

For any two arms xx and xx^{\prime}, we have that

minλ{λ:λTxλTx+(12)m+1}λVtm2=((12)m+1)2xx(Vtm)12.\displaystyle\min\limits_{\lambda\in\{\lambda^{\prime}:{\lambda^{\prime}}^{T}x^{\prime}\geqslant{\lambda^{\prime}}^{T}x+\left(\frac{1}{2}\right)^{m+1}\}}\left\lVert\lambda\right\rVert_{V_{t}^{m}}^{2}=\frac{\left(\left(\frac{1}{2}\right)^{m+1}\right)^{2}}{\left\lVert x-x^{\prime}\right\rVert^{2}_{(V_{t}^{m})^{-1}}}.
Proof.

The result follows by solving the optimization problem explicitly using the Lagrange multiplier method. ∎

By using the above lemma we obtain:

STOP if:x,x𝒳m,xx,xx(Vtm)12<((12)m+1)28logK2/δm.\text{STOP if:}\forall x,x^{\prime}\in\mathcal{X}_{m},x\neq x^{\prime},\left\lVert x-x^{\prime}\right\rVert^{2}_{(V_{t}^{m})^{-1}}<\frac{\left(\left(\frac{1}{2}\right)^{m+1}\right)^{2}}{8\log K^{2}/\delta_{m}}.

Hence, at round t=Nmt=N_{m} we have, x,x𝒳m,xx,xx(VNmm)12<((12)m+1)28logK2/δm\forall x,x^{\prime}\in\mathcal{X}_{m},x\neq x^{\prime},\left\lVert x-x^{\prime}\right\rVert^{2}_{(V_{N_{m}}^{m})^{-1}}<\frac{\left(\left(\frac{1}{2}\right)^{m+1}\right)^{2}}{8\log K^{2}/\delta_{m}}.

Case 2. εm=DmCrm(12)m+1\varepsilon_{m}=\frac{D_{m}\sqrt{C}}{r_{m}}\left(\frac{1}{2}\right)^{m+1}.

In this case, we have DmCrm<1\frac{D_{m}\sqrt{C}}{r_{m}}<1.
The phase ends when (x,x)𝒳m2\forall(x,x^{\prime})\in\mathcal{X}_{m}^{2}, minλ{λd:λTxλTx+εm}B(0,Dm)λVtm2>rm2\min\limits_{\begin{subarray}{c}{\lambda\in\left\{\lambda\in\mathbb{R}^{d}:\lambda^{T}x^{\prime}\geqslant\lambda^{T}x+\varepsilon_{m}\right\}}{\cap B(0,D_{m})}\end{subarray}}\left\lVert\lambda\right\rVert^{2}_{V_{t}^{m}}>r_{m}^{2}. Let us decompose the optimization problem defining the phase stopping criteria into smaller sub-problems, depending on pair of arms (x,x)(x,x^{\prime}) in 𝒳m2\mathcal{X}_{m}^{2}. That is, we split the set x𝒳m𝒞m(x)\cup_{x\in\mathcal{X}_{m}}\mathcal{C}_{m}(x) in equation (3), and consider the following problem: for any pair of distinct arms (x,x)𝒳m(x,x^{\prime})\in\mathcal{X}_{m}, consider

P(x,x):minλ{λd:λTxλTx+εm}B(0,Dm)λVtm2.P(x,x^{\prime}):\min\limits_{\begin{subarray}{c}{\lambda\in\left\{\lambda\in\mathbb{R}^{d}:\lambda^{T}x^{\prime}\geqslant\lambda^{T}x+\varepsilon_{m}\right\}}{\cap B(0,D_{m})}\end{subarray}}\left\lVert\lambda\right\rVert^{2}_{V_{t}^{m}}.

let tx,xt_{x,x^{\prime}} be the first round when minλ{λd:λTxλTx+εm}B(0,Dm)λVtm2>rm2\min\limits_{\begin{subarray}{c}{\lambda\in\left\{\lambda\in\mathbb{R}^{d}:\lambda^{T}x^{\prime}\geqslant\lambda^{T}x+\varepsilon_{m}\right\}}{\cap B(0,D_{m})}\end{subarray}}\left\lVert\lambda\right\rVert^{2}_{V_{t}^{m}}>r_{m}^{2}. Clearly, we have Nm=max(x,x)𝒳m2,xxtx,xN_{m}=\max\limits_{(x,x^{\prime})\in\mathcal{X}_{m}^{2},x\neq x^{\prime}}t_{x,x^{\prime}}. In addition, for any ttx,xt\geqslant t_{x,x^{\prime}}, λVtm2=λT(Vtx,xm+s=tx,x+1txsxst)λ=λVtx,xm2+s=tx,x+1t(xsTλ)2λVtx,xm2>rm2\left\lVert\lambda\right\rVert_{V_{t}^{m}}^{2}=\lambda^{T}\left(V^{m}_{t_{x,x^{\prime}}}+\sum_{s=t_{x,x^{\prime}}+1}^{t}x_{s}x_{s}^{t}\right)\lambda=\left\lVert\lambda\right\rVert^{2}_{V^{m}_{t_{x,x^{\prime}}}}+\sum_{s=t_{x,x^{\prime}}+1}^{t}(x_{s}^{T}\lambda)^{2}\geqslant\left\lVert\lambda\right\rVert^{2}_{V^{m}_{t_{x,x^{\prime}}}}>r_{m}^{2}. Hence, once the inequality for a given pair of arms (x,x)(x,x^{\prime}) is fulfilled it is satisfied for all subsequent rounds. We will now analyze the problem P(x,x)P(x,x^{\prime}) for each pair of arms (x,x)𝒳m2(x,x^{\prime})\in\mathcal{X}_{m}^{2} individually.

For any t1t\geqslant 1 , define λtargminλ{λd:λTxλTx+εm}B(0,Dm)λVtm2\lambda_{t}^{*}\in\mathop{\mathrm{argmin}}\limits_{\begin{subarray}{c}{\lambda\in\left\{\lambda\in\mathbb{R}^{d}:\lambda^{T}x^{\prime}\geqslant\lambda^{T}x+\varepsilon_{m}\right\}}\\ {\cap B(0,D_{m})}\end{subarray}}\left\lVert\lambda\right\rVert^{2}_{V_{t}^{m}}. Note that λt\lambda_{t}^{*} is specific to the pair (x,x)(x,x^{\prime}).

Claim 1. λtT(xx)=εm,t1{\lambda_{t}^{*}}^{T}(x^{\prime}-x)=\varepsilon_{m},\forall t\geqslant 1.

Proof of Claim 1.

For the proof, let’s denote λλt\lambda^{*}\equiv\lambda_{t}^{*}. Now, suppose that the claim was not true, i.e., λT(xx)=εm+a{\lambda^{*}}^{T}(x^{\prime}-x)=\varepsilon_{m}+a for some a>0a>0. Let b=aλT(xx)b=\frac{a}{{\lambda^{*}}^{T}(x^{\prime}-x)}. Then 0<b<10<b<1. Define λ:=(1b)λ\lambda^{\prime}:=(1-b)\lambda^{*}. By construction, λT(xx)=εm{\lambda^{\prime}}^{T}(x^{\prime}-x)=\varepsilon_{m}, and λ2=(1b)λ2<λ2\left\lVert\lambda^{\prime}\right\rVert_{2}=(1-b)\left\lVert\lambda^{*}\right\rVert_{2}<\left\lVert\lambda^{*}\right\rVert_{2}. Hence λ{λd:λTxλTx+εm}B(0,Dm)\lambda^{\prime}\in{\left\{\lambda\in\mathbb{R}^{d}:\lambda^{T}x^{\prime}\geqslant\lambda^{T}x+\varepsilon_{m}\right\}}{\cap B(0,D_{m})}. However, λVtm=(1b)λVtm<λVtm\left\lVert\lambda^{\prime}\right\rVert_{V_{t}^{m}}=(1-b)\left\lVert\lambda^{*}\right\rVert_{V_{t}^{m}}<\left\lVert\lambda^{*}\right\rVert_{V_{t}^{m}}, which is a contradiction. ∎

At t=tx,x,t=t_{x,x^{\prime}}, we have minλ{λd:λTxλTx+εm}B(0,Dm)λVtm2>rm2\min\limits_{\begin{subarray}{c}{\lambda\in\left\{\lambda\in\mathbb{R}^{d}:\lambda^{T}x^{\prime}\geqslant\lambda^{T}x+\varepsilon_{m}\right\}}\\ {\cap B(0,D_{m})}\end{subarray}}\left\lVert\lambda\right\rVert^{2}_{V_{t}^{m}}>r_{m}^{2}. We have two sub-cases depending on the 2-norm of λt\lambda_{t}^{*}.

Sub-case 1. λt2<Dm\left\lVert\lambda_{t}^{*}\right\rVert_{2}<D_{m}.

In this case, we have the following equivalence:

minλ{λd:λTxλTx+εm}B(0,Dm)λVtm2minλ{λd:λTxλTx+εm}λVtm2.\min\limits_{\begin{subarray}{c}{\lambda\in\left\{\lambda\in\mathbb{R}^{d}:\lambda^{T}x^{\prime}\geqslant\lambda^{T}x+\varepsilon_{m}\right\}}\\ {\cap B(0,D_{m})}\end{subarray}}\left\lVert\lambda\right\rVert^{2}_{V_{t}^{m}}\equiv\min\limits_{\begin{subarray}{c}{\lambda\in\left\{\lambda\in\mathbb{R}^{d}:\lambda^{T}x^{\prime}\geqslant\lambda^{T}x+\varepsilon_{m}\right\}}\end{subarray}}\left\lVert\lambda\right\rVert^{2}_{V_{t}^{m}}.

This can be seen by noting that if λt2<Dm\left\lVert\lambda_{t}^{*}\right\rVert_{2}<D_{m}, then the corresponding Lagrange multiplier is zero. Hence at round t=tx,xt=t_{x,x^{\prime}}, by solving a standard Lagrange optimization problem, we get xx(Vtm)12<εm28logK2/δm=Dm2Crm2(12)2(m+1)8logK2/δm<(12)2(m+1)8logK2/δm\left\lVert x-x^{\prime}\right\rVert^{2}_{(V_{t}^{m})^{-1}}<\frac{\varepsilon_{m}^{2}}{8\log K^{2}/\delta_{m}}=\frac{D_{m}^{2}C}{r_{m}^{2}}\frac{\left(\frac{1}{2}\right)^{2(m+1)}}{8\log K^{2}/\delta_{m}}<\frac{\left(\frac{1}{2}\right)^{2(m+1)}}{8\log K^{2}/\delta_{m}}. The last inequality follows from the hypothesis of Case 2. Since Nmtx,xN_{m}\geqslant t_{x,x^{\prime}}, we get xx(VNmm)12xx(Vtx,xm)12<((12)m+1)28logK2/δm\left\lVert x-x^{\prime}\right\rVert^{2}_{(V_{N_{m}}^{m})^{-1}}\leqslant\left\lVert x-x^{\prime}\right\rVert^{2}_{\left(V^{m}_{t_{x,x^{\prime}}}\right)^{-1}}<\frac{\left(\left(\frac{1}{2}\right)^{m+1}\right)^{2}}{8\log K^{2}/\delta_{m}}.

Sub-case 2. λt2=Dm\left\lVert\lambda_{t}^{*}\right\rVert_{2}=D_{m}.

The sub-case when λt2=Dm\left\lVert\lambda_{t}^{*}\right\rVert_{2}=D_{m}, is more involved. Let’s enumerate the properties of λt\lambda_{t}^{*} at t=tx,xt=t_{x,x^{\prime}} that we have.

  • λtVtm2>rm2.\left\lVert\lambda_{t}^{*}\right\rVert_{V_{t}^{m}}^{2}>r_{m}^{2}.

  • λt2=Dm.\left\lVert\lambda_{t}^{*}\right\rVert_{2}=D_{m}.

  • λtT(xx)=εm.{\lambda_{t}^{*}}^{T}(x-x^{\prime})=\varepsilon_{m}.

We divide the analysis of this sub-case into two further sub-cases.

Sub-sub-case 1. rm2xx(Vtm)12>εm2r_{m}^{2}\left\lVert x-x^{\prime}\right\rVert^{2}_{(V_{t}^{m})^{-1}}>\varepsilon_{m}^{2}.

Let θt:=argmaxθ(0,Vtm,rm)θT(xx)\theta_{t}^{*}:=\mathop{\mathrm{argmax}}\limits_{\theta\in\mathcal{E}(0,V_{t}^{m},r_{m})}\theta^{T}(x^{\prime}-x). Then, one can verify by solving the maximization problem explicitly that θtT(xx)=rmxx(Vtm)1{\theta_{t}^{*}}^{T}(x^{\prime}-x)=r_{m}\left\lVert x^{\prime}-x\right\rVert_{(V_{t}^{m})^{-1}}. Let θ1:=θtT(xx)xx22(xx)\theta_{1}:=\frac{{\theta_{t}^{*}}^{T}(x^{\prime}-x)}{\left\lVert x^{\prime}-x\right\rVert_{2}^{2}}(x^{\prime}-x). We have the following properties of θ1\theta_{1} by construction, which are straight-forward to verify.

  • θ12=rmxx(Vtm)1xx2.\left\lVert\theta_{1}\right\rVert_{2}=\frac{r_{m}\left\lVert x^{\prime}-x\right\rVert_{(V^{m}_{t})^{-1}}}{\left\lVert x^{\prime}-x\right\rVert_{2}}.

  • θ1T(θtθ1)=0.\theta_{1}^{T}(\theta_{t}^{*}-\theta_{1})=0.

Let λ1:=λtT(xx)xx22(xx)\lambda_{1}:=\frac{{\lambda_{t}^{*}}^{T}(x^{\prime}-x)}{\left\lVert x^{\prime}-x\right\rVert_{2}^{2}}(x^{\prime}-x). It follows that, λ12=|λtT(xx)|xx2=εmxx2\left\lVert\lambda_{1}\right\rVert_{2}=\frac{\left|{\lambda_{t}^{*}}^{T}(x^{\prime}-x)\right|}{\left\lVert x^{\prime}-x\right\rVert_{2}}=\frac{\varepsilon_{m}}{\left\lVert x^{\prime}-x\right\rVert_{2}}.
Finally, let us define two more quantities. Let λ2:=rmxx(Vtm)1εmλt\lambda_{2}:=\frac{r_{m}\left\lVert x^{\prime}-x\right\rVert_{(V_{t}^{m})^{-1}}}{\varepsilon_{m}}\lambda_{t}^{*} and θ2:=εmrmxx(Vtm)1θt\theta_{2}:=\frac{\varepsilon_{m}}{r_{m}\left\lVert x^{\prime}-x\right\rVert_{(V^{m}_{t})^{-1}}}\theta_{t}^{*}. We have by the hypothesis of sub-sub-case 1, that θ222<θt22\left\lVert\theta_{2}\right\rVert_{2}^{2}<\left\lVert\theta_{t}^{*}\right\rVert_{2}^{2}. This implies that θ2(0,Vtm,rm)\theta_{2}\in\mathcal{E}(0,V_{t}^{m},r_{m}).

Next, we make the following two claims on the 2-norms of θ2\theta_{2} and θtθ1\theta_{t}^{*}-\theta_{1}.

Claim. θ22>Dm\left\lVert\theta_{2}\right\rVert_{2}>D_{m}.

Proof of Claim..

Suppose that θ2B(0,Dm)\theta_{2}\in B(0,D_{m}). By construction, θ2T(xx)=εm\theta_{2}^{T}(x^{\prime}-x)=\varepsilon_{m}. Hence, θ2{λd:λTxλTx+εm}B(0,Dm).\theta_{2}\in\left\{\lambda\in\mathbb{R}^{d}:\lambda^{T}x^{\prime}\geqslant\lambda^{T}x+\varepsilon_{m}\right\}\cap B(0,D_{m}). Since, θ2(0,Vtm,rm)\theta_{2}\in\mathcal{E}(0,V_{t}^{m},r_{m}), this implies that θ2Vtmrm\left\lVert\theta_{2}\right\rVert_{V_{t}^{m}}\leqslant r_{m}. However, this is a contradiction since at round tt, minλ{λTxλTx+εm}B(0,Dm)>rm2\min\limits_{\begin{subarray}{c}{\lambda\in\left\{\lambda^{T}x^{\prime}\geqslant\lambda^{T}x+\varepsilon_{m}\right\}}\\ {\cap B(0,D_{m})}\end{subarray}}>r_{m}^{2}. ∎

Hence, we have the following,

Dm2<θ222=εm2rm2xx(Vtm)12θt22=Dm2λ222θt22θt22>λ222.D_{m}^{2}<\left\lVert\theta_{2}\right\rVert_{2}^{2}=\frac{\varepsilon_{m}^{2}}{r_{m}^{2}\left\lVert x^{\prime}-x\right\rVert^{2}_{(V_{t}^{m})^{-1}}}\left\lVert\theta_{t}^{*}\right\rVert_{2}^{2}=\frac{D_{m}^{2}}{\left\lVert\lambda_{2}\right\rVert_{2}^{2}}\left\lVert\theta_{t}^{*}\right\rVert_{2}^{2}\Rightarrow\left\lVert\theta_{t}^{*}\right\rVert_{2}^{2}>\left\lVert\lambda_{2}\right\rVert_{2}^{2}.

Claim. θtθ122>λ2θ122\left\lVert\theta_{t}^{*}-\theta_{1}\right\rVert_{2}^{2}>\left\lVert\lambda_{2}-\theta_{1}\right\rVert_{2}^{2}.

Proof of Claim..

First we note that,

θ1T(θtλ2)\displaystyle\theta_{1}^{T}(\theta_{t}^{*}-\lambda_{2}) =θtT(xx)xx22(xx)T(θtrmxx(Vtm)1εmλt)\displaystyle=\frac{{\theta_{t}^{*}}^{T}(x^{\prime}-x)}{\left\lVert x^{\prime}-x\right\rVert_{2}^{2}}(x^{\prime}-x)^{T}\left(\theta_{t}^{*}-\frac{r_{m}\left\lVert x^{\prime}-x\right\rVert_{(V^{m}_{t})^{-1}}}{\varepsilon_{m}}\lambda_{t}^{*}\right)
=rm2xx(Vtm)12xx22rm2xx(Vtm)12xx22=0.\displaystyle=\frac{r_{m}^{2}\left\lVert x^{\prime}-x\right\rVert^{2}_{(V_{t}^{m})^{-1}}}{\left\lVert x^{\prime}-x\right\rVert_{2}^{2}}-\frac{r_{m}^{2}\left\lVert x^{\prime}-x\right\rVert^{2}_{(V_{t}^{m})^{-1}}}{\left\lVert x^{\prime}-x\right\rVert_{2}^{2}}=0.

Next observe that,

θtθ122\displaystyle\left\lVert\theta_{t}^{*}-\theta_{1}\right\rVert_{2}^{2} =θt22+θ1222θtTθ1\displaystyle=\left\lVert\theta_{t}^{*}\right\rVert_{2}^{2}+\left\lVert\theta_{1}\right\rVert_{2}^{2}-2{\theta_{t}^{*}}^{T}\theta_{1}
=θt22+θ1222(θtλ2)Tθ12θ1Tλ2\displaystyle=\left\lVert\theta_{t}^{*}\right\rVert_{2}^{2}+\left\lVert\theta_{1}\right\rVert_{2}^{2}-2({\theta_{t}^{*}-\lambda_{2}})^{T}\theta_{1}-2\theta_{1}^{T}\lambda_{2}
=θt22+θ1222θ1Tλ2\displaystyle=\left\lVert\theta_{t}^{*}\right\rVert_{2}^{2}+\left\lVert\theta_{1}\right\rVert_{2}^{2}-2\theta_{1}^{T}\lambda_{2}
>λ222+θ1222θ1Tλ2=λ2θ122.\displaystyle>\left\lVert\lambda_{2}\right\rVert_{2}^{2}+\left\lVert\theta_{1}\right\rVert_{2}^{2}-2\theta_{1}^{T}\lambda_{2}=\left\lVert\lambda_{2}-\theta_{1}\right\rVert_{2}^{2}.

Putting things together we have,

θt22\displaystyle\left\lVert\theta_{t}^{*}\right\rVert_{2}^{2} =θtθ122+θ122\displaystyle=\left\lVert\theta_{t}^{*}-\theta_{1}\right\rVert_{2}^{2}+\left\lVert\theta_{1}\right\rVert_{2}^{2}
θ122\displaystyle\Rightarrow\left\lVert\theta_{1}\right\rVert_{2}^{2} =θt22θtθ122\displaystyle=\left\lVert\theta_{t}^{*}\right\rVert_{2}^{2}-\left\lVert\theta_{t}^{*}-\theta_{1}\right\rVert_{2}^{2}
θ122\displaystyle\Rightarrow\left\lVert\theta_{1}\right\rVert_{2}^{2} <θt22λ2θ122\displaystyle<\left\lVert\theta_{t}^{*}\right\rVert_{2}^{2}-\left\lVert\lambda_{2}-\theta_{1}\right\rVert_{2}^{2}
rm2xx(Vtm)12xx22\displaystyle\Rightarrow\frac{r_{m}^{2}\left\lVert x^{\prime}-x\right\rVert^{2}_{(V_{t}^{m})^{-1}}}{\left\lVert x^{\prime}-x\right\rVert_{2}^{2}} <rm2Crm2xx(Vtm)12(Dm2εm21xx22)\displaystyle<\frac{r_{m}^{2}}{C}-r_{m}^{2}\left\lVert x^{\prime}-x\right\rVert^{2}_{(V_{t}^{m})^{-1}}\left(\frac{D_{m}^{2}}{\varepsilon_{m}^{2}}-\frac{1}{\left\lVert x^{\prime}-x\right\rVert_{2}^{2}}\right)
xx(Vtm)12xx22\displaystyle\Rightarrow\frac{\left\lVert x^{\prime}-x\right\rVert^{2}_{(V_{t}^{m})^{-1}}}{\left\lVert x^{\prime}-x\right\rVert_{2}^{2}} <1Cxx(Vtm)12(Dm2εm21xx22)\displaystyle<\frac{1}{C}-\left\lVert x^{\prime}-x\right\rVert^{2}_{(V_{t}^{m})^{-1}}\left(\frac{D_{m}^{2}}{\varepsilon_{m}^{2}}-\frac{1}{\left\lVert x^{\prime}-x\right\rVert_{2}^{2}}\right)
xx(Vtm)12xx22\displaystyle\Rightarrow\frac{\left\lVert x^{\prime}-x\right\rVert^{2}_{(V_{t}^{m})^{-1}}}{\left\lVert x^{\prime}-x\right\rVert_{2}^{2}} <1Cxx(Vtm)12Dm2εm2+xx(Vtm)12xx22\displaystyle<\frac{1}{C}-\left\lVert x^{\prime}-x\right\rVert^{2}_{(V_{t}^{m})^{-1}}\frac{D_{m}^{2}}{\varepsilon_{m}^{2}}+\frac{\left\lVert x^{\prime}-x\right\rVert^{2}_{(V_{t}^{m})^{-1}}}{\left\lVert x^{\prime}-x\right\rVert_{2}^{2}}
xx(Vtm)12\displaystyle\Rightarrow\left\lVert x^{\prime}-x\right\rVert^{2}_{(V_{t}^{m})^{-1}} <εm2Dm2C=Dm2Crm2Dm2C(12)2(m+1)=((12)m+1)28logK2/δm.\displaystyle<\frac{\varepsilon_{m}^{2}}{D_{m}^{2}C}=\frac{D_{m}^{2}C}{r_{m}^{2}D_{m}^{2}C}\left(\frac{1}{2}\right)^{2(m+1)}=\frac{\left(\left(\frac{1}{2}\right)^{m+1}\right)^{2}}{8\log K^{2}/\delta_{m}}.

Sub-sub-case 2. rm2xx(Vtm)12εm2r_{m}^{2}\left\lVert x-x^{\prime}\right\rVert^{2}_{(V_{t}^{m})^{-1}}\leqslant\varepsilon_{m}^{2}.

This case is trivial as by the hypothesis,

xx(Vtm)12εm2rm2=Dm2Crm21rm2((12)m+1)2<((12)m+1)28logK2/δm.\left\lVert x-x^{\prime}\right\rVert^{2}_{(V_{t}^{m})^{-1}}\leqslant\frac{\varepsilon_{m}^{2}}{r_{m}^{2}}=\frac{D_{m}^{2}C}{r_{m}^{2}}\frac{1}{r_{m}^{2}}\left(\left(\frac{1}{2}\right)^{m+1}\right)^{2}<\frac{\left(\left(\frac{1}{2}\right)^{m+1}\right)^{2}}{8\log K^{2}/\delta_{m}}.

This completes the proof of the key lemma.

Appendix D Proofs of bounds on phase length

In this section we will provide an upper-bound on the length of any phase m1m\geqslant 1. Clearly, the length of any phase mm is governed by the value of εm\varepsilon_{m} in that phase. Towards this, we have the following lemma.

See 2

Proof.

Recall that rm=8logK2/δm.r_{m}=\sqrt{8\log{K^{2}}/\delta_{m}}. Let tt be the last round in phase mm, before the phase ends. Then by definition of phase stopping rule (Step 12 of the algorithm),

rm2\displaystyle r_{m}^{2} minλx𝒳m𝒞m(x)B(0,Dm)λVtm2\displaystyle\geqslant\min\limits_{\lambda\in\bigcup\limits_{x\in\mathcal{X}_{m}}\mathcal{C}_{m}(x)\cap B(0,D_{m})}\left\lVert\lambda\right\rVert^{2}_{{V_{t}^{m}}}
(i)minλx𝒳m𝒞m(x)B(0,Dm)s=1tλWs2K2Dm2\displaystyle\stackrel{{\scriptstyle(i)}}{{\geqslant}}\min\limits_{\lambda\in\bigcup\limits_{x\in\mathcal{X}_{m}}\mathcal{C}_{m}(x)\cap B(0,D_{m})}\sum\limits_{s=1}^{t}\left\lVert\lambda\right\rVert^{2}_{{W_{s}}}-K^{2}D_{m}^{2}
(ii)s=1tλsWs2K2Dm2\displaystyle\stackrel{{\scriptstyle(ii)}}{{\geqslant}}\sum\limits_{s=1}^{t}\left\lVert\lambda_{s}\right\rVert^{2}_{{W_{s}}}-K^{2}D_{m}^{2}
=(iii)s=1tk=1Kwsk(λsTxk)2K2Dm2\displaystyle\stackrel{{\scriptstyle(iii)}}{{=}}\sum\limits_{s=1}^{t}\sum\limits_{k=1}^{K}w_{s}^{k}\left(\lambda_{s}^{T}x_{k}\right)^{2}-K^{2}D_{m}^{2}
(iv)maxw𝒫Ks=1tk=1Kwk(λsTxk)2Dm221tlogKK2Dm2\displaystyle\stackrel{{\scriptstyle(iv)}}{{\geqslant}}\max\limits_{w\in\mathcal{P}_{K}}\sum\limits_{s=1}^{t}\sum\limits_{k=1}^{K}w^{k}\left(\lambda_{s}^{T}x_{k}\right)^{2}-\frac{D_{m}^{2}}{\sqrt{2}-1}\sqrt{t\log K}-K^{2}D_{m}^{2}
=maxw𝒫Ks=1tλsW2Dm221tlogKK2Dm2\displaystyle=\max\limits_{w\in\mathcal{P}_{K}}\sum\limits_{s=1}^{t}\left\lVert\lambda_{s}\right\rVert^{2}_{W}-\frac{D_{m}^{2}}{\sqrt{2}-1}\sqrt{t\log K}-K^{2}D_{m}^{2}
=t.maxw𝒫Ks=1t1tλsW2Dm221tlogKK2Dm2\displaystyle=t.\max\limits_{w\in\mathcal{P}_{K}}\sum\limits_{s=1}^{t}\frac{1}{t}\left\lVert\lambda_{s}\right\rVert^{2}_{W}-\frac{D_{m}^{2}}{\sqrt{2}-1}\sqrt{t\log K}-K^{2}D_{m}^{2}
(v)t.maxw𝒫Kminq𝒫(x𝒳m𝒞m(x)B(0,Dm))𝔼λq[λW2]Dm221tlogKK2Dm2\displaystyle\stackrel{{\scriptstyle(v)}}{{\geqslant}}t.\max\limits_{w\in\mathcal{P}_{K}}\min\limits_{q\in\mathcal{P}\left(\bigcup\limits_{x\in\mathcal{X}_{m}}\mathcal{C}_{m}(x)\cap B(0,D_{m})\right)}\mathbb{E}_{\lambda\sim q}\left[\left\lVert\lambda\right\rVert^{2}_{W}\right]-\frac{D_{m}^{2}}{\sqrt{2}-1}\sqrt{t\log K}-K^{2}D_{m}^{2}
(vi)t.maxw𝒫Kminq𝒫(x𝒳m𝒞m(x))𝔼λq[λW2]Dm221tlogKK2Dm2\displaystyle\stackrel{{\scriptstyle(vi)}}{{\geqslant}}t.\max\limits_{w\in\mathcal{P}_{K}}\min\limits_{q\in\mathcal{P}\left(\bigcup\limits_{x\in\mathcal{X}_{m}}\mathcal{C}_{m}(x)\right)}\mathbb{E}_{\lambda\sim q}\left[\left\lVert\lambda\right\rVert^{2}_{W}\right]-\frac{D_{m}^{2}}{\sqrt{2}-1}\sqrt{t\log K}-K^{2}D_{m}^{2}
=(vii)tεm2BmDm221tlogKK2Dm2.\displaystyle\stackrel{{\scriptstyle(vii)}}{{=}}t\frac{\varepsilon_{m}^{2}}{B_{m}}-\frac{D_{m}^{2}}{\sqrt{2}-1}\sqrt{t\log K}-K^{2}D_{m}^{2}.

Here the inequalities follow because of (i) lemma 3, (ii) best-response of MIN player as given in Step 15 of the algorithm, (iii) by definition of WsW_{s} in Step 14, (iv) regret property of MAX player (see lemma 4), (v) s=1t1t𝟙{λ=λs}𝒫(x𝒳m𝒞m(x)B(0,Dm))\sum\limits_{s=1}^{t}\frac{1}{t}\mathbbm{1}\{\lambda=\lambda_{s}\}\in\mathcal{P}\left(\bigcup\limits_{x\in\mathcal{X}_{m}}\mathcal{C}_{m}(x)\cap B(0,D_{m})\right), (vi) taking minimum over a larger set, and (vii) follows by explicitly solving the minimization problem and recalling the definition of BmB_{m} We have that,

tBm(21)εm2Dm2logKtBmεm2rm2+Bmεm2K2Dm2.t-\frac{B_{m}}{(\sqrt{2}-1)\varepsilon_{m}^{2}}D_{m}^{2}\sqrt{\log K}\sqrt{t}\leqslant\frac{B_{m}}{\varepsilon_{m}^{2}}r_{m}^{2}+\frac{B_{m}}{\varepsilon_{m}^{2}}K^{2}D_{m}^{2}. (4)

We will do the analysis depending on the value that εm\varepsilon_{m} takes in phase mm.

Case 1. εm=DmCrm(12)m+1\varepsilon_{m}=\frac{D_{m}\sqrt{C}}{r_{m}}\left(\frac{1}{2}\right)^{m+1}.

In this case we have, DmCrm<1\frac{D_{m}\sqrt{C}}{r_{m}}<1. Applying the value of εm\varepsilon_{m} in eq. (4), we have

tBm(21)εm2Dm2logKt\displaystyle t-\frac{B_{m}}{(\sqrt{2}-1)\varepsilon_{m}^{2}}D_{m}^{2}\sqrt{\log K}\sqrt{t} Bmεm2rm2+Bmεm2K2Dm2\displaystyle\leqslant\frac{B_{m}}{\varepsilon_{m}^{2}}r_{m}^{2}+\frac{B_{m}}{\varepsilon_{m}^{2}}K^{2}D_{m}^{2}
tBm(21)Crm2(2m+1)2logKt\displaystyle\Rightarrow t-\frac{B_{m}}{(\sqrt{2}-1)C}r_{m}^{2}\left(2^{m+1}\right)^{2}\sqrt{\log K}\sqrt{t} BmDm2Crm4(2m+1)2+BmCrm2(2m+1)2K2.\displaystyle\leqslant\frac{B_{m}}{D_{m}^{2}C}r_{m}^{4}\left(2^{m+1}\right)^{2}+\frac{B_{m}}{C}r_{m}^{2}\left(2^{m+1}\right)^{2}K^{2}. (5)

Let Tm:=BmDm2Crm4(2m+1)2+BmCrm2(2m+1)2K2T_{m}:=\frac{B_{m}}{D_{m}^{2}C}r_{m}^{4}\left(2^{m+1}\right)^{2}+\frac{B_{m}}{C}r_{m}^{2}\left(2^{m+1}\right)^{2}K^{2}. The function ttt\mapsto\sqrt{t} is a differentiable concave function, meaning for any t1,t2>0t_{1},t_{2}>0, t2t1+12t1(t2t1)\sqrt{t_{2}}\leqslant\sqrt{t_{1}}+\frac{1}{2\sqrt{t_{1}}}(t_{2}-t_{1}). We therefore have

tTm+12Tm(tTm).\sqrt{t}\leqslant\sqrt{T_{m}}+\frac{1}{2\sqrt{T_{m}}}\left(t-T_{m}\right).

Applying both these to (5) and rearranging, we get

tTm(1+2Bmrm2(2m+1)2logK2(21)CTmBmrm2(2m+1)2logK).t\leqslant T_{m}\left(1+\frac{2B_{m}r_{m}^{2}\left(2^{m+1}\right)^{2}\sqrt{\log K}}{2(\sqrt{2}-1)C\sqrt{T_{m}}-B_{m}r_{m}^{2}\left(2^{m+1}\right)^{2}\sqrt{\log K}}\right).

Note that for small enough δ,\delta, the first term in the definition of TmT_{m} dominates the second term, i.e., there exists δ0(1)>0\delta^{(1)}_{0}>0 such that δ<δ0(1),\forall\delta<\delta^{(1)}_{0},

BmCrm2(2m+1)2K2\displaystyle\frac{B_{m}}{C}r_{m}^{2}\left(2^{m+1}\right)^{2}K^{2} \displaystyle\leqslant BmDm2Crm4(2m+1)2,\displaystyle\frac{B_{m}}{D_{m}^{2}C}r_{m}^{4}\left(2^{m+1}\right)^{2},
rm2K2Dm2.\displaystyle\Rightarrow r_{m}^{2}\geqslant K^{2}D_{m}^{2}. (6)

This means that Tm2BmDm2Crm4(2m+1)2,T_{m}\leqslant 2\frac{B_{m}}{D_{m}^{2}C}r_{m}^{4}\left(2^{m+1}\right)^{2}, and hence,

t\displaystyle t 2Bmrm4(2m+1)2Dm2C(1+2Bmrm2(2m+1)2logK2(21)CBmrm4(2m+1)2Dm2CBmrm2(2m+1)2logK)\displaystyle\leqslant 2\frac{B_{m}r_{m}^{4}\left(2^{m+1}\right)^{2}}{D_{m}^{2}C}\left(1+\frac{2B_{m}r_{m}^{2}\left(2^{m+1}\right)^{2}\sqrt{\log K}}{2(\sqrt{2}-1)C\sqrt{\frac{B_{m}r_{m}^{4}\left(2^{m+1}\right)^{2}}{D_{m}^{2}C}}-B_{m}r_{m}^{2}\left(2^{m+1}\right)^{2}\sqrt{\log K}}\right)
=2Bmrm4(2m+1)2Dm2C(1+2DmBm(2m+1)2logK2(21)CDmBm(2m+1)2logK).\displaystyle=2\frac{B_{m}r_{m}^{4}\left(2^{m+1}\right)^{2}}{D_{m}^{2}C}\left(1+\frac{2D_{m}\sqrt{B_{m}\left(2^{m+1}\right)^{2}{\log K}}}{2(\sqrt{2}-1)\sqrt{C}-D_{m}\sqrt{B_{m}\left(2^{m+1}\right)^{2}{\log K}}}\right).

We note here the following lower bound on BmB_{m}.

Bm\displaystyle B_{m} =minw𝒫Kmaxx,x𝒳m,xxxxW12\displaystyle=\min\limits_{w\in\mathcal{P}_{K}}\max\limits_{x,x^{\prime}\in\mathcal{X}_{m},x\neq x^{\prime}}\left\lVert x-x^{\prime}\right\rVert^{2}_{W^{-1}}
minw𝒫Kmaxx,x𝒳m,xxλmin(W1)xx22\displaystyle\geqslant\min\limits_{w\in\mathcal{P}_{K}}\max\limits_{x,x^{\prime}\in\mathcal{X}_{m},x\neq x^{\prime}}\lambda_{min}(W^{-1})\left\lVert x-x^{\prime}\right\rVert_{2}^{2}
=minw𝒫Kmaxx,x𝒳m,xx1λmax(W)xx22\displaystyle=\min\limits_{w\in\mathcal{P}_{K}}\max\limits_{x,x^{\prime}\in\mathcal{X}_{m},x\neq x^{\prime}}\frac{1}{\lambda_{max}(W)}\left\lVert x-x^{\prime}\right\rVert_{2}^{2}
minw𝒫Kmaxx,x𝒳m,xxxx22\displaystyle\geqslant\min\limits_{w\in\mathcal{P}_{K}}\max\limits_{x,x^{\prime}\in\mathcal{X}_{m},x\neq x^{\prime}}\left\lVert x-x^{\prime}\right\rVert_{2}^{2}
=maxx,x𝒳m,xxxx22.\displaystyle=\max\limits_{x,x^{\prime}\in\mathcal{X}_{m},x\neq x^{\prime}}\left\lVert x-x^{\prime}\right\rVert_{2}^{2}.

By using the value of DmD_{m} as given in Step 6 of the algorithm, we note that

DmBm(2m+1)2logK\displaystyle D_{m}\sqrt{B_{m}\left(2^{m+1}\right)^{2}{\log K}} =2(21)Cmaxx,x𝒳m,xxxx22logKBm(2m+1)2logK\displaystyle=2(\sqrt{2}-1)\sqrt{\frac{{C}}{{\max\limits_{x,x^{\prime}\in\mathcal{X}_{m},x\neq x^{\prime}}\left\lVert x-x^{\prime}\right\rVert_{2}^{2}\log K}}}\sqrt{B_{m}\left(2^{m+1}\right)^{2}{\log K}}
2(21)Cmaxx,x𝒳m,xxxx22logKmaxx,x𝒳m,xxxx22(2m+1)2logK\displaystyle\geqslant 2(\sqrt{2}-1)\sqrt{\frac{{C}}{{\max\limits_{x,x^{\prime}\in\mathcal{X}_{m},x\neq x^{\prime}}\left\lVert x-x^{\prime}\right\rVert_{2}^{2}\log K}}}\sqrt{\max\limits_{x,x^{\prime}\in\mathcal{X}_{m},x\neq x^{\prime}}\left\lVert x-x^{\prime}\right\rVert_{2}^{2}\left(2^{m+1}\right)^{2}{\log K}}
=(2m+1).2(21)C>2(21)C.\displaystyle=\left(2^{m+1}\right).2(\sqrt{2}-1)\sqrt{C}>2(\sqrt{2}-1)\sqrt{C}.

Using this we get a bound on tt as:

t\displaystyle t 2Bmrm4(2m+1)2Dm2C=2Bmrm4(2m+1)24(21)2C2(maxx,x𝒳m,xxxx22logK)\displaystyle\leqslant 2\frac{B_{m}r_{m}^{4}\left(2^{m+1}\right)^{2}}{D_{m}^{2}C}=2\frac{B_{m}r_{m}^{4}\left(2^{m+1}\right)^{2}}{4(\sqrt{2}-1)^{2}C^{2}}\left(\max\limits_{x,x^{\prime}\in\mathcal{X}_{m},x\neq x^{\prime}}\left\lVert x-x^{\prime}\right\rVert_{2}^{2}\log K\right)
2Bm(2m+1)2[rm4logK(21)2C2].\displaystyle\leqslant 2B_{m}\left(2^{m+1}\right)^{2}\left[\frac{r_{m}^{4}\log K}{(\sqrt{2}-1)^{2}C^{2}}\right].

Since, by assumption, Cλmin(k=1KxkxkT)=Θ(1)C\equiv\lambda_{min}\left(\sum\limits_{k=1}^{K}x_{k}x_{k}^{T}\right)=\Theta(1), we have tlimO(Bm(2m+1)2rm4logK),δ<δ0(1).t\leqslant\lim O\left(B_{m}\left(2^{m+1}\right)^{2}r_{m}^{4}\log K\right),~{}\forall\delta<\delta^{(1)}_{0}.

Case 2. εm=(12)m+1\varepsilon_{m}=\left(\frac{1}{2}\right)^{m+1}.

We have in this case that, DmCrm1\frac{D_{m}\sqrt{C}}{r_{m}}\geqslant 1. Applying the value of εm\varepsilon_{m} in eq. (4), we obtain

tBm(21)εm2Dm2logKt\displaystyle t-\frac{B_{m}}{(\sqrt{2}-1)\varepsilon_{m}^{2}}D_{m}^{2}\sqrt{\log K}\sqrt{t} Bmεm2rm2+Bmεm2K2Dm2\displaystyle\leqslant\frac{B_{m}}{\varepsilon_{m}^{2}}r_{m}^{2}+\frac{B_{m}}{\varepsilon_{m}^{2}}K^{2}D_{m}^{2} (7)
tBm(21)Dm2(2m+1)2logKt\displaystyle\Rightarrow t-\frac{B_{m}}{(\sqrt{2}-1)}D_{m}^{2}\left(2^{m+1}\right)^{2}\sqrt{\log K}\sqrt{t} Bmrm2(2m+1)2+Bm(2m+1)2K2Dm2.\displaystyle\leqslant{B_{m}}r_{m}^{2}\left(2^{m+1}\right)^{2}+{B_{m}}\left(2^{m+1}\right)^{2}K^{2}D_{m}^{2}. (8)

Let Tm:=Bmrm2(2m+1)2+Bm(2m+1)2K2Dm2.T_{m}:={B_{m}}r_{m}^{2}\left(2^{m+1}\right)^{2}+{B_{m}}\left(2^{m+1}\right)^{2}K^{2}D_{m}^{2}.. As before, noting that ttt\mapsto\sqrt{t} is a concave, differentiable function, we have

tTm+12Tm(tTm).\sqrt{t}\leqslant\sqrt{T_{m}}+\frac{1}{2\sqrt{T_{m}}}\left(t-T_{m}\right).

Applying this to (8) and rearranging, we get

tTm(1+2Bmrm2(2m+1)2logK2(21)CTmBmrm2(2m+1)2logK).t\leqslant T_{m}\left(1+\frac{2B_{m}r_{m}^{2}\left(2^{m+1}\right)^{2}\sqrt{\log K}}{2(\sqrt{2}-1)C\sqrt{T_{m}}-B_{m}r_{m}^{2}\left(2^{m+1}\right)^{2}\sqrt{\log K}}\right).

Going along the same lines as Case 1, we see that there exists δ0(2)>0\delta^{(2)}_{0}>0 such that δ<δ0(2)\forall~{}\delta<\delta^{(2)}_{0}, Tm2Bmrm2(2m+1)2T_{m}\leqslant 2{B_{m}}r_{m}^{2}\left(2^{m+1}\right)^{2}, whence

t2Bm(2m+1)2rm2.\displaystyle t\leqslant 2B_{m}\left(2^{m+1}\right)^{2}r_{m}^{2}.

We now set δ0=min{δ0(1),δ0(2)}\delta_{0}=\min\{\delta^{(1)}_{0},\delta^{(2)}_{0}\}.

Appendix E Justification of elimination criteria

In this section, we argue that progress is made after every phase of the algorithm. We will also show the correctness of the algorithm. Let us define a few terms which will be useful for analysis.

Let 𝒮m:={x𝒳:θT(xx)<12m}\mathcal{S}_{m}:=\left\{x\in\mathcal{X}:{\theta^{*}}^{T}\left(x^{*}-x\right)<\frac{1}{2^{m}}\right\}. Let Bm:=minw𝒫Kmax(x,x)𝒮m2,xxxxW12B_{m}^{*}:=\min\limits_{w\in\mathcal{P}_{K}}\max\limits_{(x,x^{\prime})\in{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathcal{S}_{m}^{2}},x\neq x^{\prime}}\left\lVert x-x^{\prime}\right\rVert^{2}_{W^{-1}}, where W=k=1KwkxkxkW=\sum\limits_{k=1}^{K}w^{k}x_{k}x_{k}. Finally, define Tm:=BmDm2Crm4(2m+1)2+BmCrm2(2m+1)2K2Dm2T_{m}^{*}:=\frac{B_{m}^{*}}{D_{m}^{2}C}r_{m}^{4}\left(2^{m+1}\right)^{2}+\frac{B_{m}^{*}}{C}r_{m}^{2}\left(2^{m+1}\right)^{2}K^{2}D_{m}^{2}.

Define a sequence of favorable events {𝒢m}m1\left\{\mathcal{G}_{m}\right\}_{m\geqslant 1} as,

𝒢m:={NmTm(1+2Bmrm2(2m+1)2logK2(21)CTmBmrm2(2m+1)2logK)}{x𝒳m+1}{𝒳m+1𝒮m+1}.\mathcal{G}_{m}:=\left\{N_{m}\leqslant T_{m}^{*}\left(1+\frac{2B_{m}^{*}r_{m}^{2}\left(2^{m+1}\right)^{2}\sqrt{\log K}}{2(\sqrt{2}-1)C\sqrt{T_{m}^{*}}-B_{m}^{*}r_{m}^{2}\left(2^{m+1}\right)^{2}\sqrt{\log K}}\right)\right\}\bigcap\left\{x^{*}\in\mathcal{X}_{m+1}\right\}\bigcap\left\{\mathcal{X}_{m+1}\subseteq\mathcal{S}_{m+1}\right\}.
Note 3.

Conditioned on the event 𝒢m1\mathcal{G}_{m-1}, x𝒳mx^{*}\in\mathcal{X}_{m} and 𝒳m𝒮m\mathcal{X}_{m}\subseteq\mathcal{S}_{m}. Hence, BmBmB_{m}\leqslant B_{m}^{*} and TmTmT_{m}\leqslant T_{m}^{*}. Hence, under the event 𝒢m1\mathcal{G}_{m-1},

NmTm(1+2Bmrm2(2m+1)2logK2(21)CTmBmrm2(2m+1)2logK)a.s.N_{m}\leqslant T_{m}^{*}\left(1+\frac{2B_{m}^{*}r_{m}^{2}\left(2^{m+1}\right)^{2}\sqrt{\log K}}{2(\sqrt{2}-1)C\sqrt{T_{m}^{*}}-B_{m}^{*}r_{m}^{2}\left(2^{m+1}\right)^{2}\sqrt{\log K}}\right)a.s.

Note here that the right hand side is a non-random quantity.

Lemma 6.

[𝒢m|𝒢m1,,𝒢1]1δm\mathbb{P}\left[{\mathcal{G}_{m}\;\big{|}\;\mathcal{G}_{m-1},\ldots,\mathcal{G}_{1}}\right]\geqslant 1-\delta_{m}.

Proof of lemma 6.

Let y=xixjy=x_{i}-x_{j} for some xi,xj𝒳m,xixjx_{i},x_{j}\in\mathcal{X}_{m},x_{i}\neq x_{j}. Since θ^m\widehat{\theta}_{m} is a least squares estimate of θ\theta^{*}, conditioned on the realization of the set 𝒳m\mathcal{X}_{m}, yT(θ^mθ)y^{T}\left(\widehat{\theta}_{m}-\theta^{*}\right) is a y(VNmm)12\left\lVert y\right\rVert^{2}_{(V_{N_{m}}^{m})^{-1}}-sub-Gaussian random variable.

By the key lemma 1 we have that y(VNmm)1218(2m+1)2log(K2/δm)\left\lVert y\right\rVert^{2}_{(V_{N_{m}}^{m})^{-1}}\leqslant\frac{1}{8\left(2^{m+1}\right)^{2}\log\left(K^{2}/\delta_{m}\right)}. Using property of sub-Gaussian random variables, we write for any η(0,1)\eta\in\left(0,1\right),

[|yT(θ^mθ)|>2y(VNmm)12log(2/η)|𝒢m1,,𝒢1]η,\mathbb{P}\left[{\left|y^{T}\left(\widehat{\theta}_{m}-\theta^{*}\right)\right|>\sqrt{2\left\lVert y\right\rVert^{2}_{(V_{N_{m}}^{m})^{-1}}\log\left(2/\eta\right)}\;\big{|}\;\mathcal{G}_{m-1},\ldots,\mathcal{G}_{1}}\right]\leqslant\eta,

which implies that

[|yT(θ^mθ)|>2log(2/η)8(2m+1)2log(K2/δm)|𝒢m1,,𝒢1]η.\mathbb{P}\left[{\left|y^{T}\left(\widehat{\theta}_{m}-\theta^{*}\right)\right|>\sqrt{\frac{2\log\left(2/\eta\right)}{8\left(2^{m+1}\right)^{2}\log\left(K^{2}/\delta_{m}\right)}}\;\big{|}\;\mathcal{G}_{m-1},\ldots,\mathcal{G}_{1}}\right]\leqslant\eta.

Taking intersection over all possible y𝒴(𝒳m)y\in\mathcal{Y}\left(\mathcal{X}_{m}\right), and setting η=2δm/K2\eta=2\delta_{m}/{K^{2}}, gives

[y𝒴(𝒳m):|yT(θθ^m)|2(m+2)|𝒢m1,,𝒢1]>1δm.\mathbb{P}\left[{\forall y\in\mathcal{Y}\left(\mathcal{X}_{m}\right):\left|y^{T}\left(\theta^{*}-\widehat{\theta}_{m}\right)\right|\leqslant 2^{-(m+2)}\;\big{|}\;\mathcal{G}_{m-1},\ldots,\mathcal{G}_{1}}\right]>1-\delta_{m}. (9)

Conditioned on 𝒢m1\mathcal{G}_{m-1}, x𝒳mx^{*}\in\mathcal{X}_{m}. Let x𝒳mx^{\prime}\in\mathcal{X}_{m} be such that x𝒮m+1x^{\prime}\notin\mathcal{S}_{m+1}. Let y=(xx)y=\left(x^{*}-x^{\prime}\right). Then y𝒴(𝒳m)y\in\mathcal{Y}\left(\mathcal{X}_{m}\right). By eq. (9) we have with probability 1δm\geqslant 1-\delta_{m}:

(xx)T(θθ^m)2(m+2)θ^mT(xx)>2(m+1)2(m+2)=2(m+2).\left(x^{*}-x^{\prime}\right)^{T}\left(\theta^{*}-\widehat{\theta}_{m}\right)\leqslant 2^{-(m+2)}\Rightarrow\widehat{\theta}_{m}^{T}\left(x^{*}-x^{\prime}\right)>2^{-(m+1)}-2^{-(m+2)}=2^{-(m+2)}.

Thus arm xx^{\prime} will get eliminated after phase mm by the elimination criteria of algorithm 1(see step 25 of algorithm 1). Hence 𝒳m+1𝒮m+1\mathcal{X}_{m+1}\subseteq\mathcal{S}_{m+1} w.p. 1δm\geqslant 1-\delta_{m}.
Next, we show that conditioned on 𝒢m1\mathcal{G}_{m-1}, x𝒳m+1x^{*}\in\mathcal{X}_{m+1}, w.p. 1δm\geqslant 1-\delta_{m}. Suppose that xx^{*} gets eliminated at the end of phase mm. This means that x𝒳m\exists x^{\prime}\in\mathcal{X}_{m}, such that θ^mT(xx)>2(m+2)\widehat{\theta}_{m}^{T}\left(x^{\prime}-x^{*}\right)>2^{-(m+2)}. However, by eq. 9,

(xx)T(θ^mθ)2(m+2)θT(xx)<0\left(x^{\prime}-x^{*}\right)^{T}\left(\widehat{\theta}_{m}-\theta^{*}\right)\leqslant 2^{-(m+2)}\Rightarrow{\theta^{*}}^{T}\left(x^{*}-x^{\prime}\right)<0

which is a contradiction. This, along with note 3 shows that [𝒢m|𝒢m1,,𝒢1]1δm\mathbb{P}\left[{\mathcal{G}_{m}\;\big{|}\;\mathcal{G}_{m-1},\ldots,\mathcal{G}_{1}}\right]\geqslant 1-\delta_{m}.

Corollary 1.
[m1𝒢m]m=1(1δm2)1δ.\mathbb{P}\left[{\bigcap\limits_{m\geqslant 1}\mathcal{G}_{m}}\right]\geqslant\prod_{m=1}^{\infty}\left(1-\frac{\delta}{m^{2}}\right)\geqslant 1-\delta.
Corollary 2.

The maximum number of phases of Algorithm 1 is bounded by log21Δmin\log_{2}\frac{1}{\Delta_{min}}.

Proof.

Recall that Δmin=minx𝒳:xxθT(xx).\Delta_{min}=\min_{x\in\mathcal{X}:x\neq x^{*}}{\theta^{*}}^{T}\left(x^{*}-x\right). The proof follows by observing that after any phase mm, under the favorable event 𝒢m1\mathcal{G}_{m-1}, 𝒳m𝒮m\mathcal{X}_{m}\subseteq\mathcal{S}_{m}. Since the size 𝒮m\mathcal{S}_{m} shrinks exponentially with the number of phases (because𝒮m={x𝒳:θT(xx)<12m})\left(\text{because}~{}\mathcal{S}_{m}=\left\{x\in\mathcal{X}:{\theta^{*}}^{T}\left(x^{*}-x\right)<\frac{1}{2^{m}}\right\}\right), we have the result. ∎

Appendix F Proof of bound on sample complexity

We begin by observing the following useful result from [8]. Recall that

Dθ=maxwΔKminx𝒳,xx(θT(xx))2xxW12D_{\theta^{*}}=\max\limits_{w\in\Delta_{K}}\min\limits_{x\in\mathcal{X},x\neq x^{*}}\frac{\left({\theta^{*}}^{T}\left(x^{*}-x\right)\right)^{2}}{\left\lVert x^{*}-x\right\rVert^{2}_{W^{-1}}}
Proposition 1 ([8]).
m=1log21Δmin(2m)2Bm4log2(1/Δmin)Dθ.\sum_{m=1}^{\log_{2}\frac{1}{\Delta_{min}}}\left(2^{m}\right)^{2}B_{m}^{*}\leqslant\frac{4\log_{2}\left(1/\Delta_{min}\right)}{D_{\theta^{*}}}.

Using proposition 1 we now give a bound on the asymptotic sample complexity of algorithm 1.

Theorem 2.

With probability at least 1δ1-\delta, PEPEG returns the optimal arm after τ\tau rounds, with

τ(2048log2(1/Δmin)Dθ[(log((log2(1/Δmin))2K2/δ))2logK(21)2C2])+(256log2(1/Δmin)Dθlog((log2(1/Δmin))2K2/δ)).\begin{split}\tau\leqslant\left(2048\frac{\log_{2}\left(1/\Delta_{min}\right)}{D_{\theta^{*}}}\left[\frac{\left(\log\left(\left(\log_{2}\left(1/\Delta_{min}\right)\right)^{2}K^{2}/\delta\right)\right)^{2}\log K}{(\sqrt{2}-1)^{2}C^{2}}\right]\right)+\\ \left(256\frac{\log_{2}\left(1/\Delta_{min}\right)}{D_{\theta^{*}}}\log\left(\left(\log_{2}\left(1/\Delta_{min}\right)\right)^{2}K^{2}/\delta\right)\right).\end{split}
Proof.

The proof follows from Lemma 2 (phase length bound), Corollary 2 (bound on number of phases), Prop. 1 above and the fact that the sum of several non negative quantities is bigger than their max.

To begin with, the discussion in Sec. E shows that in every phase, BmBm.B_{m}\leqslant B^{*}_{m}. Next, Lemma 2 gives us (w.h.p),

τ\displaystyle\tau =m=1log2(1/Δmin)Nm\displaystyle=\sum_{m=1}^{\log_{2}\left(1/\Delta_{min}\right)}N_{m}
m=1log2(1/Δmin)max{2Bm(2m+1)2[rm4logK(21)2C2],2Bm(2m+1)2rm2}\displaystyle\leqslant\sum_{m=1}^{\log_{2}\left(1/\Delta_{min}\right)}\max\left\{2B_{m}^{*}\left(2^{m+1}\right)^{2}\left[\frac{r_{m}^{4}\log K}{(\sqrt{2}-1)^{2}C^{2}}\right],2B_{m}^{*}\left(2^{m+1}\right)^{2}r_{m}^{2}\right\}
m=1log2(1/Δmin)2Bm(2m+1)2[rm4logK(21)2C2]+m=1log2(1/Δmin)2Bm(2m+1)2rm2\displaystyle\leqslant\sum\limits_{m=1}^{\log_{2}\left(1/\Delta_{min}\right)}2B_{m}^{*}\left(2^{m+1}\right)^{2}\left[\frac{r_{m}^{4}\log K}{(\sqrt{2}-1)^{2}C^{2}}\right]+\sum\limits_{m=1}^{\log_{2}\left(1/\Delta_{min}\right)}2B_{m}^{*}\left(2^{m+1}\right)^{2}r_{m}^{2}

Hence, using the fact that rm=8logK2/δmr_{m}=\sqrt{8\log{K}^{2}/\delta_{m}} and invoking Prop. 1 we get

τm=1log2(1/Δmin)512Bm(2m)2[(log(K2/δm))2logK(21)2C2]+m=1log2(1/Δmin)64Bm(2m+1)2log(K2/δm)()2048log2(1/Δmin)Dθ[(log((log2(1/Δmin))2K2/δ))2logK(21)2C2]+256log2(1/Δmin)Dθlog((log2(1/Δmin))2K2/δ),\displaystyle\begin{split}\tau&\leqslant\sum\limits_{m=1}^{\log_{2}\left(1/\Delta_{min}\right)}512B_{m}^{*}\left(2^{m}\right)^{2}\left[\frac{\left(\log\left(K^{2}/\delta_{m}\right)\right)^{2}\log K}{(\sqrt{2}-1)^{2}C^{2}}\right]\\ &\quad\quad\quad+\sum\limits_{m=1}^{\log_{2}\left(1/\Delta_{min}\right)}64B_{m}^{*}\left(2^{m+1}\right)^{2}\log\left(K^{2}/\delta_{m}\right)\\ &\stackrel{{\scriptstyle(\ast)}}{{\leqslant}}2048\frac{\log_{2}\left(1/\Delta_{min}\right)}{D_{\theta^{*}}}\left[\frac{\left(\log\left(\left(\log_{2}\left(1/\Delta_{min}\right)\right)^{2}K^{2}/\delta\right)\right)^{2}\log K}{(\sqrt{2}-1)^{2}C^{2}}\right]\\ &\quad\quad\quad+256\frac{\log_{2}\left(1/\Delta_{min}\right)}{D_{\theta^{*}}}\log\left(\left(\log_{2}\left(1/\Delta_{min}\right)\right)^{2}K^{2}/\delta\right),\end{split}

where ()(\ast) follows from the fact that K2δm=m2K2δ(log2(1/Δmin))2K2δ.\frac{K^{2}}{\delta_{m}}=\frac{m^{2}K^{2}}{\delta}\leqslant\frac{\left(\log_{2}\left(1/\Delta_{min}\right)\right)^{2}K^{2}}{\delta}.

Appendix G Experiment Details

In this section, we provide some details on the implementation of each algorithm. Each experiment was repeated 50 times and the errorbar plots show the mean sample complexity with 1-standard deviations.

  • For implementation of PELEG, as mentioned in Sec. 6, we ignore the intersection with the ball B(0,Dm)B(0,D_{m}) in the phase stopping criterion. This helps in implementing a closed form expression for the stopping rule. The learning rate parameter in the EXP-WTS subroutine is set to be equal to (1/Dm2)8logK/t(1/D_{m}^{2})\sqrt{8\log K/t}.

  • LinGapE: In the paper of [15] LinGapE was simulated using a greedy arm selection strategy that deviates from the algorithm that is analyzed. We instead implement the LinGapE algorithm in the form that it is analyzed.

  • For implementation of RAGE, ALBA and 𝒳𝒴\mathcal{XY-}ORACLE, we have used the code provided in the Supplementary material of Fiez et al [8]. We refer the readers to Appendix Sec. F of [8] for further details of their implementations.