Explicit Best Arm Identification in Linear Bandits Using No-Regret Learners
Abstract.
We study the problem of best arm identification in linearly parameterised multi-armed bandits. Given a set of feature vectors a confidence parameter and an unknown vector the goal is to identify , with probability at least using noisy measurements of the form For this fixed confidence (-PAC) setting, we propose an explicitly implementable and provably order-optimal sample-complexity algorithm to solve this problem. Previous approaches rely on access to minimax optimization oracles. The algorithm, which we call the Phased Elimination Linear Exploration Game (PELEG), maintains a high-probability confidence ellipsoid containing in each round and uses it to eliminate suboptimal arms in phases. PELEG achieves fast shrinkage of this confidence ellipsoid along the most confusing (i.e., close to, but not optimal) directions by interpreting the problem as a two player zero-sum game, and sequentially converging to its saddle point using low-regret learners to compute players’ strategies in each round. We analyze the sample complexity of PELEG and show that it matches, up to order, an instance-dependent lower bound on sample complexity in the linear bandit setting. We also provide numerical results for the proposed algorithm consistent with its theoretical guarantees.
1 Introduction
Function optimization over structured domains is a basic sequential decision making problem. A well-known formulation of this problem is Probably Approximately Correct (PAC) best arm identification in multi-armed bandits [7], in which a learner is given a set of arms with unknown (and unrelated) means. The learner must sequentially test arms and output, as soon as possible with high confidence, a near-optimal arm (where optimality is defined in terms of the largest mean).
Often, the arms (decisions) and their associated rewards, possess structural relationships, allowing for more efficient learning of the rewards and transfer of learnt information, e.g., two ‘close enough’ arms may have similar mean rewards. One of the best-known examples of structured decision spaces is the linear bandit, whose arms are vectors (points) in . The reward or function value of an arm is an unknown linear function of its vector representation, and the goal is to find an arm with maximum reward in the shortest possible time by measuring arms’ rewards sequentially with noise. This framework models an array of structured online linear optimization problems including adaptive routing [2], smooth function optimization over graphs [20], subset selection [13] and, in the nonparametric setting, black box optimization in smooth function spaces [18], among others.
Although no-regret online learning for linear bandits is a well-understood problem (see [14] and references therein), the PAC-sample complexity of best arm identification in this model has not received significant attention until recently [17]. The state of the art here is the work of Fiez et al. [8], who give an algorithm with optimal (instance-dependent) PAC sample complexity. However, a closer look indicates that the algorithm assumes repeated oracle access to a minimax optimization problem111This is, in fact, a plug-in version of a minimax optimization problem representing an information-theoretic sample complexity lower bound for the problem.; it is not clear, from a performance standpoint, in what manner (and to what accuracy) this optimization problem should be practically solved222For its experiments, the paper implements a (approximate) minimax oracle using the Frank-Wolfe algorithm and a heuristic stopping rule, but this is not rigorously justifiable for nonsmooth optimization, see Sec. 4. to enjoy the claimed sample complexity. Hence, the question of how to design an explicit algorithm with optimal PAC sample complexity for best arm identification in linear bandits has remained open.
In this paper, we resolve this question affirmatively by giving an explicit linear bandit best-arm identification algorithm with instance-optimal PAC sample complexity and, more importantly, a clearly quantified computational effort. We achieve this goal using new techniques: the main ingredient in the proposed algorithm is a game-theoretic interpretation of the minimax optimization problem that is at the heart of the instance-based sample complexity lower bound. This in turn yields an adaptive, sample-based approach using carefully constructed confidence sets for the unknown parameter . The adaptive sampling strategy is driven by the interaction of 2 no-regret online learning subroutines that attempt to solve the minimax problem approximately, obviating the worry of i) solving the optimal minimax allocation to a suitable precision and ii) making an integer sampling allocation from it by rounding, which occur in the approach of Fiez et al [8]. We note that the seeds of this game-theoretic approach were laid by the recent work of Degenne et al. [6] for the simple (i.e., unstructured) multiarmed bandit problem. However, our work demonstrates a novel extension of their methodology to solve best-arm learning in structured multi-armed bandits for the first time to the best of our knowledge.
1.1 Related Work
The PAC best arm identification problem for linearly parameterised bandits is first studied in [17], in which an adaptive algorithm is given with a sample complexity guarantee involving a hardness term () which in general renders the sample complexity suboptimal. Tao et al [19] take the path of constructing new estimators instead of ordinary least squares, using which they give an algorithm achieving the familiar sum-of-inverse-gaps sample complexity known for standard bandits; this is, however, not optimal for general linear bandits. The LinGapE algorithm [15] is an attempt at solving best arm identification with a fully adaptive strategy, but its sample complexity in general is not instance-optimal and can additionally scale with the total number of arms, in addition to the extra dimension-dependence known to be incurred by self-normalized inequality-based confidence set constructions [1]. Zaki et al [21] design a fully adaptive algorithm based on the Lower-Upper Confidence Bound (LUCB) principle [11] with limited guarantees for 2 or 3 armed settings. Fiez et al [8] give a phased elimination algorithm achieving the ideal information-theoretic sample complexity but with minimax oracle access and an additional rounding operation; we detail an explicit arm-playing strategy that eliminates both these steps, in the same high-level template. In a separate vein, game-theoretic techniques to solve minimax problems have been in existence for over a couple of decades [9]; only recently have they been combined with optimism to give a powerful framework to solve adaptive hypothesis testing problems [6].
Table 1 compares the sample complexities of various best arm identification algorithms in the literature.
2 Problem Statement and Notation
We study the problem of best arm identification in linear bandits with the arm set , where each arm333In general, the number of arms can be much larger than the ambient dimension, i.e., is a vector in . We will interchangeably use and the set , whenever the context is clear. In every round the agent chooses an arm , and receives a reward , where is assumed to be a fixed but unknown vector, and is zero-mean noise assumed to be conditionally subgaussian, i.e., . We denote by the distribution of the reward obtained by pulling arm i.e., whenever Given two probability distributions over , denotes the KL Divergence of and (assuming ). Given , let , where we assume that is such that the argmax is unique.
A learning algorithm for the best arm identification problem comprises the following rules: (1) a sampling rule, which determines based on the past play of arms and observations, which arm to pull next, (2) a stopping rule, which controls the end of sampling phase and is a function of the past observations and reward, and (3) a recommendation rule, which, when the algorithm stops, offers a guess for the best arm. The goal of a learning algorithm is: Given an error probability , identify (guess) with probability by pulling as few (in an expected sense) arms as possible. Any algorithm that (1) stops with probability 1 and (2) returns upon stopping with probability at least is said to be -Probably Approximately Correct (-PAC). For clarity of exposition, we distinguish the above linear bandit setting from what we term the unstructured bandit setting, wherein and the canonical basis vectors (the former setting generalizes the latter). The (expected) number of samples consumed by an algorithm in determining the optimal arm in any bandit setting (not necessarily the linear setting) is called its sample complexity.
In the rest of the paper, we will assume that . Given a positive definite matrix , we denote by , the matrix norm induced by . For any , we define to be the gap between the largest expected reward and the expected reward of (suboptimal) arm . Let . We denote as the closed ball with center and radius . For any measurable space we define to be the set of all probability measures on . is big-Oh notation that suppresses logarithmic dependence on problem parameters. For the benefit of the reader, we provide a glossary of commonly used symbols in Sec. A in the Appendix.
Algorithm | Sample Complexity | Remarks | ||
---|---|---|---|---|
-static [17] |
|
|||
LinGapE444Here is a complicated term defined in terms of a solution to an offline optimization problem in [15]. [15] | Fully adaptive, sub-optimal in general. | |||
ALBA [19] | Fully adaptive, sub-optimal in general (see [8]) | |||
RAGE [8] |
|
|||
PELEG (this paper) |
|
Note: is a term depending only on the geometry of the arm set.
3 Overview of Algorithmic Techniques
In this section we describe the main ingredients in our algorithm design and how they build upon ideas introduced in recent work [6, 8] (the explicit algorithm appears in Sec. 4).
The phased elimination approach: Fiez et al [8]. We first note that a lower bound on the sample complexity of any - PAC algorithm for the canonical (i.e., unstructured) bandit setting [10] was generalized by Fiez et al [8] to the linear bandit setting, assuming to be standard normal random variables. This result states that any -PAC algorithm in the linear setting must satisfy , where and , where . The bound suggests a natural -PAC strategy, namely, to sample arms according to the distribution
(1) |
In fact, as [8, Sec. 2.2] explains, using the Ordinary Least Squares (OLS) estimator for and sampling arm exactly times with ensures with probability Unfortunately, this sampling distribution cannot directly be implemented since is unknown.
Fiez et al circumvent this difficulty by designing a nontrivial strategy (RAGE) that attempts to mimic the optimal allocation in phases. Specifically, in phase , it tries to eliminate arms that are about -suboptimal (in their gaps), by solving (1) with a plugin estimate of . The resulting fractional allocation, passed through a separate discrete rounding procedure, gives an integer pull count distribution which ensures that all surviving arms’ mean differences are estimated with high precision and confidence.
Though direct and appealing, this phased elimination strategy is based crucially on solving minimax problems of the form (1). Though the inner (max) function is convex as a function of on the probability simplex (see e.g., Lemma 1 in [21]), it is non-smooth, and it is not made explicit how, and to what extent, it must be solved in [8]. Fortunately, we are able to circumvent this obstacle by using ideas from games between no-regret online learners with optimism, as introduced by the work of Degenne et al [6] for unstructured bandits.
From Pure-exploration Games to -PAC Algorithms: Degenne et al [6]. We briefly explain some of the insights in [6] that we leverage to design an explicit linear bandit--PAC algorithm with low computational complexity. For a fixed weight parameter consider the two-player, zero-sum Pure-exploration Game in which the player (or column player) plays an arm while the (or row) player chooses an alternative bandit model such that then receives a payoff of from For a given define and the mixed strategy that attains With moving first and playing a mixed strategy the value of the game becomes . In the unstructured bandit setting, to match the sample complexity lower bound, any algorithm must essentially sample arm at rate where is the number of times Arm has been sampled up to time [12]. This helps explain why any -PAC algorithm implicitly needs to solve the Pure Exploration Game
We crucially employ no-regret online learners to solve the Pure Exploration Game for linear bandits. More precisely, no-regret learning with the well-known Exponential Weights rule/Negative-entropy mirror descent algorithm [16] on one hand, and a best-response convex programming subroutine on the other, provides a direct sampling strategy that obviates the need for separate allocation optimization and rounding for sampling as in [8]. One crucial advantage of our approach (inspired by [6]) is that we only use a best response oracle to solve for , which gives us a computational edge over [8] who employ the computationally more costly max-min oracle to solve or, its linear bandit equivalent,
4 Algorithm and Sample Complexity Bound
Our algorithm, that we call “Phased Elimination Linear Exploration Game” (PELEG), is presented in detail as Algorithm 1. PELEG proceeds in phases with each phase consisting of multiple rounds, maintaining a set of active arms for testing during Phase . An OLS estimate of is used to estimate the mean reward of active arms and, at the end of phase , every active arm with a plausible reward more than below that of some arm in is eliminated. Suppose . If we can ensure that in every Phase then PELEG will terminate within phases, where This statement is proved in Corollary 2 in the Supplementary Material.
If we knew then we could sample arms according to the optimal distribution in (1). However, since all we now have at our disposal is the knowledge that we can instead construct a sampling distribution by solving the surrogate , and sampling each arm in sufficiently often to produce a small enough confidence set. This is precisely what RAGE [8] does. However solving this optimization is, as mentioned in Sec. 3, computationally expensive and RAGE repeatedly accesses a minmax oracle to do this. Note that in simulating this algorithm, the authors implement an approximate oracle using the Frank-Wolfe method to solve the outer optimization over [8, Sec. F]. The operation, however, renders the optimization objective non-smooth, and it is well-known that the Frank-Wolfe iteration can fail with even simple non-differentiable objectives (see e.g., [4]). We, therefore, deviate from RAGE at this point by employing three novel techniques, the first two motivated by ideas in [6].
-
•
We formulate the above minimax problem as a two player, zero-sum game. We solve the game sequentially, converging to its Nash equilibrium by invoking the use of the EXP-WTS algorithm [3]. Specifically, in each round in a phase, PELEG supplies EXP-WTS with an appropriate loss function and receives the requisite sampling distribution (lines 15 & 18 of the algorithm). This is then fed to the second no-regret learner – a best response subroutine – that finds the ‘most confusing’ plausible model to focus next on (line 16). This is a minimization of a quadratic function over a union of finitely many convex sets (halfspaces intersecting a ball) which can be transparently implemented in polynomial time.
-
•
Once the sampling distribution is found, there still remains the problem of actually sampling according to it. Given a distribution approximating it by sampling or times can lead to too few (resp. many) samples. Other naive sampling strategies are, for the same reason, unusable. While [8] invokes a specialized rounding algorithm for this purpose, we opt for a more efficient tracking procedure (line 19): In each Round of Phase , we sample Arm , where is the number of times Arm has been sampled up to time In Lem. 3, we show that this procedure is efficient, i.e.,
-
•
Finally, in each phase , we need to sample arms often enough to (i) construct confidence intervals of size at most around (ii) ensure that and (iii) that In Sec. E, we prove a Key Lemma (whose argument is discussed in Sec. 5) to show that our novel Phase Stopping Criterion ensures this with high probability.
It is worth remarking that naively trying to adapt the strategy of Degenne et al [6] to the linear bandit structure yields a suboptimal (multiplicative ) dependence in the sample complexity, thus we adopt the phased elimination template of Fiez et al [8]. We also find, interestingly, that this phased structure eliminates the need to use more complex, self-tuning online learners like AdaHedge [5] in favour of the simpler Exponential Weights (Hedge).
The main theoretical result of this paper is the following performance guarantee.
Theorem 1 (Sample Complexity of Algorithm 1).
With probability at least , PELEG returns the optimal arm after rounds, with
(2) |
In Sec. 5, we sketch the arguments behind the result. The proof in its entirety can be found in Sec. F in the Supplementary Material.
Note 1.
As explained in Sec. 3, the optimal (oracle) allocation requires samples. Comparing this with (2), we see that our algorithm is instance optimal up to logarithmic factors, barring the term, so the optimality holds whenever . Recall that is the smallest eigenvalue of . is reasonable to expect given that in most applications, feature vectors (i.e., ) are chosen to represent the feature space well which translates to a high value of .
Note 2.
The main computational effort in Algorithm 1 is in checking the phase stopping criterion (line 13) and implementing the best-response model learner (line 16), both of which are explicit quadratic programs. Note also that bounding the losses submitted to EXP-WTS to within is required only for the regret analysis of EXP-WTS to go through. In practice, as the simulation results show, PELEG works without this and, in fact, permits efficient solution of Step 16 in the algorithm, further reducing computational complexity.
5 Sketch of Sample Complexity Analysis
This section outlines the proof of the -PAC sample complexity of Algorithm 1 (Theorem 1) and describes the main ideas and challenges involved in the analysis.
At a high level the proof of Theorem 1 involves two main parts: (1) a correctness argument for the central while loop that eliminates arms, and (2) a bound for its length, which, when added across all phases, gives the overall sample complexity bound.
1. Ensuring progress (arm elimination) in each phase. At the heart of the analysis is the following result which guarantees that upon termination of the central while loop, the uncertainty in estimating all differences of means among the surviving (i.e., non-eliminated) arms remains bounded.
Lemma 1 (Key Lemma).
After each phase ,
Proof sketch. Phase ends at time when the ellipsoid , with center and shape according to the arms played in the phase so far, becomes small enough to avoid intersecting the half spaces , for all surviving arms , within the ball (Step 13 of the algorithm) which is required to keep loss functions bounded for no-regret properties.
Suppose, for the sake of simplicity, that only two arms and are present when phase starts. Figure 1(a) depicts a possible situation when the phase ends. and with are halfspaces, denoted in gray, that intersect the ball in the areas colored red. In this situation, the ellipsoid , shaded in blue, has just broken away from the red regions in the interior of the ball. Because its extent in the direction lies within the strip between the two hyperplanes bounding , it can be shown (see proof of lemma in appendix) that is small enough to not exceed roughly .
The more challenging situation is when the ellipsoid breaks away from the red regions by breaching the boundary of the ball , as in the green ellipsoid in Figure 1(b). The while loop terminating at this time would not satisfy the objective of controlling to within , since the extent of the ellipsoid in the direction is larger than the gap between the halfspaces and . A key idea we introduce here is to shrink the hyperplane gap (i.e., ) by a factor (precisely ) which is represented by the min operation in Step 7. In doing this we bring the halfspaces closer, and then insist that the ellipsoid break away from these new halfspaces within the ball. This more stringent requirement guarantees that when the loop terminates, the extent of the final ellipsoid (shaded in blue) stays within the original, unshrunk, gap ensuring .
2. Bounding the number of arm pulls in a phase. The main bound on the length of the central while loop is the following result.
Lemma 2 (Phase length bound).
Let . There exists such that , the length of any phase is bounded as :
To prove this we use the no-regret property of both the best-response and the EXP-WTS learner (the full proof appears in the appendix). A key novelty here is the introduction of the ball as a technical device to control the -norm radius of the final stopped ellipsoid (inequality in the proof) when used with the basic tracking rule over arms introduced by Degenne et al [6].
6 Experiments
We numerically evaluate PELEG, against the algorithms -static ([17]), LUCB ([11]), ALBA ([19]), LinGapE ([15]) and RAGE ([8]), for 3 common benchmark settings. The oracle lower bound is also calculated. Note: In our implementation, we ignore the term in the phase stopping criterion; this has the advantage of making the criterion check-able in closed form. We simulate independent, observation noise in each round. All results reported are averaged over 50 trials. We also empirically observe a success rate in identifying the best arm, although a confidence value of is passed in all cases.
Setting 1: Standard bandit. The arm set is the standard basis in 5 dimensions. The unknown parameter is set to , where , with swept across . As noted in [15], for close to , -static’s essentially uniform allocation is optimal, since we have to estimate all directions equally accurately. However, PELEG performs better (Fig. 2(a)) due to being able to eliminate suboptimal arms earlier instead of uniformly across all arms. Fig. 2(b) compares PELEG and RAGE in the smaller window , where PELEG is found to be competitive (and often better than) RAGE.
Setting 2: Unit sphere. The arms set comprises of vectors sampled uniformly from the surface of the unit sphere . We pick the two closest arms, say and , and then set for , making the best arm. We simulate all algorithms over dimensions . This setting was first introduced in [19], and PELEG is uniformly competitive with the other algorithms (Fig. 2(c)).
Setting 3: Standard bandit with a confounding arm [17]. We instantiate canonical basis arms and an additional arm , , with so that the first arm is the best arm. By setting , the th arm becomes the closest competitor. Here, the performance critically depends on how much an agent focuses on comparing arm and arm . LinGapE performs very well in this setting, and PELEG and RAGE are competitive with it (Fig. 2(d)).
![[Uncaptioned image]](https://cdn.awesomepapers.org/papers/dd5658a9-3d11-4424-a413-124ea44375df/x1.png)
![[Uncaptioned image]](https://cdn.awesomepapers.org/papers/dd5658a9-3d11-4424-a413-124ea44375df/x2.png)
![[Uncaptioned image]](https://cdn.awesomepapers.org/papers/dd5658a9-3d11-4424-a413-124ea44375df/x3.png)
![[Uncaptioned image]](https://cdn.awesomepapers.org/papers/dd5658a9-3d11-4424-a413-124ea44375df/x4.png)
7 Concluding Remarks
We have proposed a new, explicitly described algorithm for best arm identification in linear bandits, using tools from game theory and no-regret learning to solve minimax games. Several interesting directions remain unexplored. Removing the less-than-ideal dependence on the feature of the arm geometry and the extra logarithmic dependence on are perhaps the most interesting technical questions. It is also of great interest to see if a more direct game-theoretic strategy, along the lines of [6], exists for structured bandit problems, as also whether one can extend this machinery to solve for best policies in more general Markov Decision Processes.
Broader Impact. This work is largely theoretical in its objective. However, the problem that it attempts to lay sound theoretical foundations for is a widely encountered search problem based on features in machine learning. As a result, we anticipate that its implications may carry over to domains that involve continuous, feature-based learning, such as attribute-based recommendation systems, adaptive sensing and robotics applications. Proper care must be taken in such cases to ensure that recommendations or decisions from the algorithms set out in this work do not transgress considerations of safety and bias. While we do not address such concerns explicitly in this work, they are important in the design and operation of automated systems that continually interact with human users.
References
- Abbasi-Yadkori et al. [2011] Y. Abbasi-Yadkori, D. Pal, and C. Szepesvari. Improved Algorithms for Linear Stochastic Bandits. In Proc. NIPS, pages 2312–2320, 2011.
- Awerbuch and Kleinberg [2008] B. Awerbuch and R. Kleinberg. Online linear optimization and adaptive routing. Journal of Computer and System Sciences, 74(1):97–114, 2008.
- Cesa-Bianchi and Lugosi [2006] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press, 2006.
- Cheung and Li [2018] E. Cheung and Y. Li. Solving separable nonsmooth problems using frank-wolfe with uniform affine approximations. In IJCAI, pages 2035–2041, 2018.
- de Rooij et al. [2014] S. de Rooij, T. van Erven, P. Grunwald, and W. Koolen. Follow the leader if you can, hedge if you must. Journal of Machine Learning Research, 15:1281–1316, 2014.
- Degenne et al. [2019] R. Degenne, W. M. Koolen, and P. Ménard. Non-asymptotic pure exploration by solving games. In NeurIPS, 2019.
- Even-Dar et al. [2006] E. Even-Dar, S. Mannor, and Y. Mansour. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. J. Mach. Learn. Res., 7:1079–1105, Dec. 2006.
- Fiez et al. [2019] T. Fiez, L. Jain, K. G. Jamieson, and L. Ratliff. Sequential experimental design for transductive linear bandits. In Advances in Neural Information Processing Systems, pages 10666–10676, 2019.
- Freund and Schapire [1999] Y. Freund and R. E. Schapire. Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29(1-2):79–103, 1999.
- Garivier and Kaufmann [Jun. 2016] A. Garivier and E. Kaufmann. Optimal best arm identification with fixed confidence. In Conference On Learning Theory, pages 998–1027, Jun. 2016.
- Kalyanakrishnan et al. [2012] S. Kalyanakrishnan, A. Tewari, P. Auer, and P. Stone. Pac subset selection in stochastic multi-armed bandits. In ICML, 2012.
- Kaufmann et al. [2016] E. Kaufmann, O. Cappé, and A. Garivier. On the complexity of best-arm identification in multi-armed bandit models. J. Mach. Learn. Res., 17(1):1–42, Jan. 2016.
- Kuroki et al. [2019] Y. Kuroki, L. Xu, A. Miyauchi, J. Honda, and M. Sugiyama. Polynomial-time algorithms for combinatorial pure exploration with full-bandit feedback. arXiv preprint arXiv:1902.10582, 2019.
- Lattimore and Szepesvári [2020] T. Lattimore and C. Szepesvári. Bandit Algorithms. Cambridge University Press, 2020.
- Liyuan Xu [2017] M. S. Liyuan Xu, Junya Honda. Fully adaptive algorithm for pure exploration in linear bandits. In International Conference on Artificial Intelligence and Statistics, 2017.
- Shalev-Shwartz et al. [2011] S. Shalev-Shwartz et al. Online learning and online convex optimization. Foundations and trends in Machine Learning, 4(2):107–194, 2011.
- Soare et al. [2014] M. Soare, A. Lazaric, and R. Munos. Best-arm identification in linear bandits. Advances in Neural Information Processing Systems 27, pages 828–836, 2014.
- Srinivas et al. [2010] N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger. Gaussian process optimization in the bandit setting: no regret and experimental design. In In International Conference on Machine Learning, 2010.
- Tao et al. [2018] C. Tao, S. Blanco, and Y. Zhou. Best arm identification in linear bandits with linear dimension dependency. volume 80 of Proceedings of Machine Learning Research, pages 4877–4886. PMLR, 2018.
- Valko et al. [2014] M. Valko, R. Munos, B. Kveton, and T. Kocák. Spectral bandits for smooth graph functions. In International Conference on Machine Learning, pages 46–54, 2014.
- Zaki et al. [2019] M. Zaki, A. Mohan, and A. Gopalan. Towards optimal and efficient best arm identification in linear bandits. arXiv preprint arXiv:1911.01695, 2019.
Appendix A Glossary of symbols
-
1.
the EXP-WTS algorithm, used to compute the mixed strategy of the player in each round of PELEG.
-
2.
the index of the best arm, i.e., .
-
3.
the closed ball of radius in , centered at .
-
4.
.
-
5.
is the union of all hyperplanes
-
6.
.
-
7.
dimension of space in which the feature vectors reside.
-
8.
-
9.
-
10.
maximum allowable probability of erroneous arm selection (a.k.a confidence parameter).
-
11.
-
12.
, is the confidence ellipsoid with center , shaped by and
-
13.
-
14.
number of feature vectors.
-
15.
the length of Phase
-
16.
rewards from Arm are all drawn IID from
-
17.
, the set of all probability measures on some given set
-
18.
-
19.
fixed but unknown vector in that parameterizes the means of i.e., the mean of is
-
20.
number of times Arm has been sampled up to Round of PELEG.
-
21.
OLS estimate of at the end of Phase of PELEG.
-
22.
the design matrix in Round of Phase
-
23.
the design matrix formed by sampling arms
-
24.
, the feature set.
-
25.
the set of features that survive Phase of PELEG.
Appendix B Technical lemmas
B.1 Tracking lemma
The subroutine recommends a distribution in every round over the set of arms. In order to play an arm from this distribution we use a “tracking” rule, which helps the number of arm pulls to stay close to the cumulative sum for each arm .
Lemma 3 ( tracks ).
In any phase , if for , the following strategy for pulling the arms is used.
then, for all and for all , .
Proof.
We first show the upper-bound. We need to show that the inequality holds for all arms. First, let . We will use induction on .
Base case: At , .
Let, the induction hold for all . We will show that the inequality holds for . If , then .
Next, let . We note that by definition of , we have
Here, the inequality (*) follows because of the following fact:
for any sequence of positive numbers and ,
Consequently, . Rearranging completes the proof of the right hand side.
For the lower bound inequality, we observe that for any ,
Here, the inequality follows from the the upper-bound on . ∎
B.1.1 Details of (EXP-WTS)
We employ the EXP-WTS algorithm to recommend to the MAX player, the arm to be played in round . At the start of every phase , an EXP-WTS subroutine is instantiated afresh, with initial weight vectors to be for each of the experts. The experts are taken to be standard unit vectors with at the position, . The EXP-WTS subroutine recommends an exponentially-weighted probability distribution over the number of arms, depending upon the weights on each expert. The loss function supplied to update the weights of each expert, is indicated in Step 18 of Algorithm 1.
EXP-WTS requires a bound on the losses (rewards) in order to set its learning parameter optimally. This is ensured by passing an upper-bound of ( in any Phase see Step 13 of Algorithm 1).
Lemma 4.
In any phase , at any round , has a regret bounded as
Proof.
The proof involves a simple modification of the proof of the regret analysis of the EXP-WTS algorithm (see for example, [3]), with loss scaled by followed by the well-known doubling trick. ∎
Appendix C Proof of Key Lemma
See 1
Proof.
Let , for ease of notation. The phase stopping criterion is
(3) |
Note that the set depends on the value that takes in phase . Depending on the value of , we divide the analysis into the following two cases.
Case 1. .
In this case . For any phase , and , let us define the ellipsoid . The phase stopping rule at round is equivalent to :
STOP if | ||||
However by Rayleigh’ inequality555for any PSD matrix and , followed by the fact that , we have for any ,
The inequality (*) follows from the following fact: for , .
. Hence the phase stopping rule reduces to,
The above reduction is a minimization problem over union of halfspaces. For any fixed pair , this is a quadratic optimization problem with linear constraints, which can be explicitly solved using standard Lagrange method.
Lemma 5 (Supporting Lemma for Lem. 1).
For any two arms and , we have that
Proof.
The result follows by solving the optimization problem explicitly using the Lagrange multiplier method. ∎
By using the above lemma we obtain:
Hence, at round we have, .
Case 2. .
In this case, we have .
The phase ends when , .
Let us decompose the optimization problem defining the phase stopping criteria into smaller sub-problems, depending on pair of arms in . That is, we split the set in equation (3), and consider the following problem: for any pair of distinct arms , consider
let be the first round when . Clearly, we have . In addition, for any , . Hence, once the inequality for a given pair of arms is fulfilled it is satisfied for all subsequent rounds. We will now analyze the problem for each pair of arms individually.
For any , define . Note that is specific to the pair .
Claim 1. .
Proof of Claim 1.
For the proof, let’s denote . Now, suppose that the claim was not true, i.e., for some . Let . Then . Define . By construction, , and . Hence . However, , which is a contradiction. ∎
At we have . We have two sub-cases depending on the 2-norm of .
Sub-case 1. .
In this case, we have the following equivalence:
This can be seen by noting that if , then the corresponding Lagrange multiplier is zero. Hence at round , by solving a standard Lagrange optimization problem, we get . The last inequality follows from the hypothesis of Case 2. Since , we get .
Sub-case 2. .
The sub-case when , is more involved. Let’s enumerate the properties of at that we have.
-
•
-
•
-
•
We divide the analysis of this sub-case into two further sub-cases.
Sub-sub-case 1. .
Let . Then, one can verify by solving the maximization problem explicitly that . Let . We have the following properties of by construction, which are straight-forward to verify.
-
•
-
•
Let . It follows that, .
Finally, let us define two more quantities. Let and .
We have by the hypothesis of sub-sub-case 1, that . This implies that .
Next, we make the following two claims on the 2-norms of and .
Claim. .
Proof of Claim..
Suppose that . By construction, . Hence, Since, , this implies that . However, this is a contradiction since at round , . ∎
Hence, we have the following,
Claim. .
Proof of Claim..
First we note that,
Next observe that,
∎
Putting things together we have,
Sub-sub-case 2. .
This case is trivial as by the hypothesis,
This completes the proof of the key lemma.
∎
Appendix D Proofs of bounds on phase length
In this section we will provide an upper-bound on the length of any phase . Clearly, the length of any phase is governed by the value of in that phase. Towards this, we have the following lemma.
See 2
Proof.
Recall that Let be the last round in phase , before the phase ends. Then by definition of phase stopping rule (Step 12 of the algorithm),
Here the inequalities follow because of (i) lemma 3, (ii) best-response of MIN player as given in Step 15 of the algorithm, (iii) by definition of in Step 14, (iv) regret property of MAX player (see lemma 4), (v) , (vi) taking minimum over a larger set, and (vii) follows by explicitly solving the minimization problem and recalling the definition of We have that,
(4) |
We will do the analysis depending on the value that takes in phase .
Case 1. .
In this case we have, . Applying the value of in eq. (4), we have
(5) |
Let . The function is a differentiable concave function, meaning for any , . We therefore have
Applying both these to (5) and rearranging, we get
Note that for small enough the first term in the definition of dominates the second term, i.e., there exists such that
(6) |
This means that and hence,
We note here the following lower bound on .
By using the value of as given in Step 6 of the algorithm, we note that
Using this we get a bound on as:
Since, by assumption, , we have
Case 2. .
We have in this case that, . Applying the value of in eq. (4), we obtain
(7) | ||||
(8) |
Let . As before, noting that is a concave, differentiable function, we have
Applying this to (8) and rearranging, we get
Going along the same lines as Case 1, we see that there exists such that , , whence
We now set .
∎
Appendix E Justification of elimination criteria
In this section, we argue that progress is made after every phase of the algorithm. We will also show the correctness of the algorithm. Let us define a few terms which will be useful for analysis.
Let . Let , where . Finally, define .
Define a sequence of favorable events as,
Note 3.
Conditioned on the event , and . Hence, and . Hence, under the event ,
Note here that the right hand side is a non-random quantity.
Lemma 6.
.
Proof of lemma 6.
Let for some . Since is a least squares estimate of , conditioned on the realization of the set , is a sub-Gaussian random variable.
By the key lemma 1 we have that . Using property of sub-Gaussian random variables, we write for any ,
which implies that
Taking intersection over all possible , and setting , gives
(9) |
Conditioned on , . Let be such that . Let . Then . By eq. (9) we have with probability :
Thus arm will get eliminated after phase by the elimination criteria of algorithm 1(see step 25 of algorithm 1). Hence w.p. .
Next, we show that conditioned on , , w.p. . Suppose that gets eliminated at the end of phase . This means that , such that . However, by eq. 9,
which is a contradiction. This, along with note 3 shows that .
∎
Corollary 1.
Corollary 2.
The maximum number of phases of Algorithm 1 is bounded by .
Proof.
Recall that The proof follows by observing that after any phase , under the favorable event , . Since the size shrinks exponentially with the number of phases , we have the result. ∎
Appendix F Proof of bound on sample complexity
We begin by observing the following useful result from [8]. Recall that
Proposition 1 ([8]).
Theorem 2.
With probability at least , PEPEG returns the optimal arm after rounds, with
Appendix G Experiment Details
In this section, we provide some details on the implementation of each algorithm. Each experiment was repeated 50 times and the errorbar plots show the mean sample complexity with 1-standard deviations.
-
•
For implementation of PELEG, as mentioned in Sec. 6, we ignore the intersection with the ball in the phase stopping criterion. This helps in implementing a closed form expression for the stopping rule. The learning rate parameter in the EXP-WTS subroutine is set to be equal to .
-
•
LinGapE: In the paper of [15] LinGapE was simulated using a greedy arm selection strategy that deviates from the algorithm that is analyzed. We instead implement the LinGapE algorithm in the form that it is analyzed.
- •