Achieving Near Instance-Optimality and Minimax-Optimality
in Stochastic and Adversarial Linear Bandits Simultaneously
Abstract
In this work, we develop linear bandit algorithms that automatically adapt to different environments. By plugging a novel loss estimator into the optimization problem that characterizes the instance-optimal strategy, our first algorithm not only achieves nearly instance-optimal regret in stochastic environments, but also works in corrupted environments with additional regret being the amount of corruption, while the state-of-the-art (Li et al., 2019) achieves neither instance-optimality nor the optimal dependence on the corruption amount. Moreover, by equipping this algorithm with an adversarial component and carefully-designed testings, our second algorithm additionally enjoys minimax-optimal regret in completely adversarial environments, which is the first of this kind to our knowledge. Finally, all our guarantees hold with high probability, while existing instance-optimal guarantees only hold in expectation.
1 Introduction
We consider the linear bandit problem with a finite and fixed action set. In this problem, the learner repeatedly selects an action from the action set and observes her loss whose mean is the inner product between the chosen action and an unknown loss vector determined by the environment. The goal is to minimize the regret, which is the difference between the learner’s total loss and the total loss of the best action in hindsight. Two standard environments are heavily-studied in the literature: the stochastic environment and the adversarial environment. In the stochastic environment, the loss vector is fixed over time, and we are interested in instance-optimal regret bounds of order for any , where is the number of rounds and hides some instance-dependent constants. On the other hand, in the adversarial environment, the loss vector can be arbitrary in each round, and we are interested in minimax-optimal regret bound , where hides the problem dimension and logarithmic factors in .
While there are many algorithms obtaining such optimal bounds in either environment (e.g., (Lattimore & Szepesvari, 2017) in the stochastic setting and (Bubeck et al., 2012) in the adversarial setting), a natural question is whether there exists an algorithm achieving both guarantees simultaneously without knowing the type of the environment. Indeed, the same question has been studied extensively in recent years for the special case of multi-armed bandits where the action set is the standard basis (Bubeck & Slivkins, 2012; Seldin & Slivkins, 2014; Auer & Chiang, 2016; Seldin & Lugosi, 2017; Wei & Luo, 2018; Zimmert & Seldin, 2019). Notably, Zimmert & Seldin (2019) developed an algorithm that is optimal up to universal constants for both stochastic and adversarial environments, and the techniques have been extended to combinatorial semi-bandits (Zimmert et al., 2019) and finite-horizon tabular Markov decision processes (Jin & Luo, 2020). Despite all these advances, however, it is still open whether similar results can be achieved for general linear bandits.
On the other hand, another line of recent works study the robustness of stochastic linear bandit algorithms from a different perspective and consider a corrupted setting where an adversary can corrupt the stochastic losses up to some limited amount . This was first considered in multi-armed bandits (Lykouris et al., 2018; Gupta et al., 2019; Zimmert & Seldin, 2019, 2021) and later extended to linear bandits (Li et al., 2019) and Markov decision processes (Lykouris et al., 2019; Jin & Luo, 2020). Ideally, the regret of a robust stochastic algorithm should degrade with an additive term in this setting, which is indeed the case in (Gupta et al., 2019; Zimmert & Seldin, 2019, 2021; Jin & Luo, 2020) for multi-armed bandits or Markov decision processes, but is not achieved yet for general linear bandits.
In this paper, we make significant progress in this direction and develop algorithms with near-optimal regret simultaneously for different environments. Our main contributions are as follows.
-
•
In Section 4, we first introduce Algorithm LABEL:alg:REOLB, a simple algorithm that achieves regret with high probability in the corrupted setting,111In the texts, often hides lower-order terms (in terms of dependence) for simplicity. However, in all formal theorem/lemma statements, we use to hide universal constants only. where is an instance-dependent quantity such that the instance-optimal bound for the stochastic setting (i.e. ) is . This result significantly improves (Li et al., 2019) which only achieves where is the dimension of the actions and is the minimum sub-optimality gap satisfying . Moreover, Algorithm LABEL:alg:REOLB also ensures an instance-independent bound that some existing instance-optimal algorithms fail to achieve even when (e.g., (Jun & Zhang, 2020)).
-
•
In Section 5, based on Algorithm LABEL:alg:REOLB, we further propose Algorithm 1 which not only achieves nearly instance-optimal regret in the stochastic setting, but also achieves the minimax optimal regret in the adversarial setting (both with high probability). To the best of our knowledge, this is the first algorithm that enjoys the best of both worlds for linear bandits. Additionally, the same algorithm also guarantees in the corrupted setting, which is slightly worse than Algorithm LABEL:alg:REOLB but still significantly better than (Li et al., 2019).
-
•
Finally, noticing the extra factor in our bound for the stochastic setting, in Appendix D we also prove that this is in fact inevitable if the same algorithm simultaneously achieves sublinear regret in the adversarial setting with high probability (which is the case for Algorithm 1). This generalizes the result of (Auer & Chiang, 2016) for two-armed bandits.
At a high level, Algorithm LABEL:alg:REOLB utilizes a well-known optimization problem (that characterizes the lower bound in the stochastic setting) along with a robust estimator to determine a randomized strategy for each round. This ensures the near instance-optimality of the algorithm in the stochastic setting, and also the robustness to corruption when combined with a doubling trick. To handle the adversarial setting as well, Algorithm 1 switches between an adversarial linear bandit algorithm with high-probability regret guarantees and a variant of Algorithm LABEL:alg:REOLB, depending on the results of some carefully-designed statistical tests on the stochasticity of the environment.
2 Related Work
Linear Bandits.
Linear bandits is a classic model to study sequential decision problems. The stochastic setting dates back to (Abe & Long, 1999). Auer (2002) first used the optimism principle to solve this problem. Later, several algorithms were proposed based on confidence ellipsoids, further improving the regret bounds (Dani et al., 2008a; Rusmevichientong & Tsitsiklis, 2010; Abbasi-Yadkori et al., 2011; Chu et al., 2011).
On the other hand, the adversarial setting was introduced by Awerbuch & Kleinberg (2004). Dani et al. (2008b) achieved the first expected regret bound using the Geometric Hedge algorithm (also called Exp2) with uniform exploration over a barycentric spanner. Abernethy et al. (2008) proposed the first computational efficient algorithm that achieves regret using the Following-the-Regularized-Leader framework. Bubeck et al. (2012) further tightened the bound by improving Exp2 with John’s exploration. Our Algorithm 1 makes use of any adversarial linear bandit algorithm with high-probability guarantees (e.g., (Bartlett et al., 2008; Lee et al., 2020)) in a black-box manner.
Instance Optimality for Bandit Problems.
In the stochastic setting, Lattimore & Szepesvari (2017) showed that, unlike multi-armed bandits, optimism-based algorithms or Thompson sampling can be arbitrarily far from optimal in some simple instances. They proposed an algorithm also based on the lower bound optimization problem to achieve instance-optimality, but their algorithm is deterministic and cannot be robust to an adversary. Instance-optimality was also considered in other related problems lately such as linear contextual bandits (Hao et al., 2020; Tirinzoni et al., 2020), partial monitoring (Komiyama et al., 2015), and structured bandits (Combes et al., 2017; Jun & Zhang, 2020). Most of these works only consider expected regret, while our guarantees all hold with high probability.
Best-of-Both-Worlds.
Algorithms that are optimal for both stochastic and adversarial settings were studied in multi-armed bandits (Bubeck & Slivkins, 2012; Seldin & Slivkins, 2014; Auer & Chiang, 2016; Seldin & Lugosi, 2017; Wei & Luo, 2018; Zimmert & Seldin, 2019), semi-bandits (Zimmert et al., 2019), and Markov Decision Processes (Jin & Luo, 2020). On the other hand, linear bandits, a generalization of multi-armed bandits and semi-bandits, is much more challenging and currently underexplored in this direction. To the best of our knowledge, our algorithm is the first that guarantees near-optimal regret bounds in both stochastic and adversarial settings simultaneously.
Stochastic Bandits with Corruption.
Lykouris et al. (2018) first considered the corrupted setting for multi-armed bandits. Their results were improved by (Gupta et al., 2019; Zimmert & Seldin, 2019, 2021) and extended to linear bandits (Li et al., 2019; Bogunovic et al., 2020) and reinforcement learning (Lykouris et al., 2019). As mentioned, our results significantly improve those of (Li et al., 2019) (although their corruption model is slightly more general than ours; see Section 3). On the other hand, the results of (Bogunovic et al., 2020) are incomparable to ours, because they consider a setting where the adversary has even more power and can decide the corruption after seeing the chosen action. Finally, we note that (Lykouris et al., 2019, Theorem 3.2) considers episodic linear Markov decision processes in the corrupted setting, which can be seen as a generalization of linear bandits. However, this result is highly suboptimal when specified to linear bandits ( ignoring other parameters).
3 Preliminaries
Let be a finite set that spans . Each element in is called an arm or an action. We assume that for all . A linear bandit problem proceeds in rounds. In each round , the learner selects an action . Simultaneously, the environment decides a hidden loss vector and generates some independent zero-mean noise for each action . Afterwards, the learner observes her loss . We consider three different types of settings: stochastic, corrupted, and adversarial, explained in detail below.
In the stochastic setting, is fixed to some unknown vector . We assume that there exists a unique optimal arm such that , and define for each , its sub-optimality gap as . Also denote the minimum gap by .
The corrupted setting is a generalization of the stochastic setting, where in addition to a fixed vector , the environment also decides a corruption vector for each round (before seeing ) so that .222In other words, the environment corrupts the observation by adding . The setting of (Li et al., 2019) is slightly more general with the corruption on being for some function that is not necessarily linear. We define the total amount of corruption as . The stochastic setting is clearly a special case with . In both of these settings, we define the regret as .
Finally, in the adversarial setting, can be chosen arbitrarily (possibly dependent on the learner’s algorithm and her previously chosen actions). The difference compared to the corrupted setting (which also has potentially arbitrary loss vectors) is that the regret is now defined in terms of : .
In all settings, we assume and are all in for all and . We also denote by and similarly by .
It is known that the minimax optimal regret in the adversarial setting is (Dani et al., 2008b; Bubeck et al., 2012). The instance-optimality in the stochastic case, on the other hand, is slightly more complicated. Specifically, an algorithm is called consistent if it guarantees for any , , and . Then, a classic lower bound result (see e.g., (Lattimore & Szepesvari, 2017)) states that: for a particular instance , all consistent algorithms satisfy:333The original proof is under the Gaussian noise assumption. To meet our boundedness assumption on , it suffices to consider the case when is a Bernoulli random variable, which only affects the constant of the lower bound.
where is the objective value of the following optimization problem:
(1) | ||||
subject to | (2) |
and (the notation denotes the quadratic norm with respect to a matrix ). This implies that the best instance-dependent bound for one can hope for is (and more generally for the corrupted setting). It can be shown that (see Lemma 16), but this upper bound can be arbitrarily loose as shown in (Lattimore & Szepesvari, 2017).
The solution in the optimization problem above specifies the least number of times action should be drawn in order to distinguish between the present environment and any other alternative environment with a different optimal action. Many previous instance-optimal algorithms try to match their number of pulls for to the solution under some estimated gap (Lattimore & Szepesvari, 2017; Hao et al., 2020; Jun & Zhang, 2020). While these algorithms are asymptotically optimal, their regret usually grows linearly when is small (Jun & Zhang, 2020). Furthermore, they are all deterministic algorithms and by design cannot tolerate corruptions. We will show how these issues can be addressed in the next section.
Notations.
We use to denote the probability simplex over : , and define the clipping operator as for .
4 A New Algorithm for the Corrupted Setting
In this section, we focus on the corrupted setting (hence covering the stochastic setting as well). We introduce a new algorithm that achieves with high probability an instance-dependent regret bound of for large and also an instance-independent regret bound of for any . This improves over previous instance-optimal algorithms (Lattimore & Szepesvari, 2017; Hao et al., 2020; Jun & Zhang, 2020) from several aspects: 1) first and foremost, our algorithm handles corruption optimally with extra regret, while previous algorithms can fail completely due to their deterministic nature; 2) previous bounds only hold in expectation; 3) previous algorithms might suffer linear regret when is small, while ours is always for any . The price we pay is an additional factor in the instance-dependent bound. On the other hand, compared to the work of (Li et al., 2019) that also covers the same corrupted setting and achieves , our results are also significantly better (recall ), although as mentioned in Footnote 2, their results hold for an even more general setting with non-linear corruption.
Our algorithm is presented in Algorithm LABEL:alg:REOLB, which proceeds in blocks of rounds whose length grows in a doubling manner (). At the beginning of block (denoted as ), we compute a distribution over actions by solving an optimization problem OP (Figure LABEL:fig:_op) using the empirical gap estimated in the previous block (Line LABEL:line:_solving_OP_in_alg_1). Then we use to sample actions for the entire block , and construct an unbiased loss estimator in every round for every action (Line LABEL:line:_construct_loss_estimator_alg_1). At the end of each block , we use to construct a robust loss estimator for each action (Line LABEL:line:_construct_robust_alg_1), which will then be used to construct for the next block. We next explain the optimization problem OP and the estimators in detail.
OP is inspired by the lower bound optimization (Eq. (1) and Eq. (2)), where we normalize the pull counts as a distribution over the arms such that for a large , holds for . One key difference between our algorithm and previous ones (Lattimore & Szepesvari, 2017; Hao et al., 2020; Jun & Zhang, 2020) is exactly that we select actions randomly according to these distributions, while they try to deterministically match the pull count of each arm to . Our randomized strategy not only prevents the environment from exploiting the knowledge on the learner’s choices, but also allows us to construct unbiased estimator (Line LABEL:line:_construct_loss_estimator_alg_1) following standard adversarial linear bandit algorithms (Dani et al., 2008b; Bubeck et al., 2012). In fact, as shown in our analysis, the variance of the estimator is exactly bounded by (for ), which is in turn bounded in terms of the sub-optimality gap of in light of the constraint Eq. (LABEL:eqn:_opt-2-constraint). The similar idea of imposing explicit constraints on the variance of loss estimators appears before in for example (Dudik et al., 2011; Agarwal et al., 2014) for contextual bandits. Finally, we point out that OP always has a solution due to the additive term in Eq. (LABEL:eqn:_opt-2-constraint) (see Lemma 11), and it can be solved efficiently by standard methods since Eq. (LABEL:eqn:_opt-2-constraint) is a convex constraint.
Another important ingredient of our algorithm is the robust estimator , which is a clipped version of the Catoni’s esimator (Catoni, 2012) constructed using all the unbiased estimators from this block for action (Figure LABEL:fig:catoni). From a technical perspective, this avoids a lower-order term in Bernstein-style concentration bounds and is critical for our analysis. We in fact also believe that this is necessary since there is no explicit regularization on the magnitude of , and it can indeed have a heavy-tailed distribution. While other robust estimators are possible, we use the Catoni’s estimator which was analyzed in (Wei et al., 2020) for non-i.i.d. random variables (again important for our analysis).
The following theorem summarizes the nearly instance-optimal regret bound of Algorithm LABEL:alg:REOLB.
Theorem 1.
In the corrupted setting, Algorithm LABEL:alg:REOLB guarantees that with probability at least ,
where is some constant that depends on and only.
The dominating term of this regret bound is thus as claimed. The definition of can be found in the proof (Appendix B) and is importantly independent of . In fact, in Theorem 19, we also provide an alternative (albeit weaker) bound for Algorithm LABEL:alg:REOLB without the dependence on .
The next theorem shows an instance-independent bound of order for Algorithm LABEL:alg:REOLB, which previous instance-optimal algorithms fail to achieve as mentioned.
Theorem 2.
In the corrupted setting, Algorithm LABEL:alg:REOLB guarantees that with probability at least , .
We emphasize that Algorithm LABEL:alg:REOLB is parameter-free and does not need to know to achieve these bounds. In the rest of the section, we provide a proof sketch for Theorem 1 and Theorem 2. First, we show that the estimated gap is close to the true gap with a constant multiplicative factor and some additive terms that go down at the rate of roughly up to the some average amount of corruption.
Lemma 3.
With probability at least , Algorithm LABEL:alg:REOLB ensures for all and all ,
(3) | ||||
(4) |
where ( is defined as ), is the amount of corruption within block , and .
As mentioned, the proof of Lemma 3 heavily relies on the robust estimators we use as well as the variance constraint Eq. (LABEL:eqn:_opt-2-constraint). Next, we have the following lemma which bounds the objective value of OP.
Lemma 4.
Let be the solution of , where . Then we have .
Combining Lemma 3 and Lemma 4, we see that in block , the regret of Algorithm LABEL:alg:REOLB can be upper bounded by
where in the first equality we use Lemma 3 and in the second equality we use Lemma 4 with the fact that . Further summing this over and relating to proves Theorem 2.
In addition, based on Lemma 3, we show that when is larger than plus some problem-dependent constant, the estimated gap becomes . Therefore, the solution from OP is very close to , where is the optimal solution of Eq. (1) and Eq. (2), except that we have an additional factor in the constraint (coming from ). Therefore, the regret is bounded by for large enough . Formally, we have the following lemma.
Lemma 5.
Algorithm LABEL:alg:REOLB guarantees with probability at least , for some constant depending on , and :
5 Best of Three Worlds
In this section, building on top of Algorithm LABEL:alg:REOLB, we develop another algorithm that enjoys similar regret guarantees in the stochastic or corrupted setting, and additionally guarantees regret in the adversarial setting, without having any prior knowledge on which environment it is facing. To the best of our knowledge, this kind of best-of-three-worlds guarantee only appears before for multi-armed bandits (Wei & Luo, 2018; Zimmert & Seldin, 2019) and Markov decision processes (Jin & Luo, 2020), but not for linear bandits.
Our algorithm requires a block-box access to an adversarial linear bandit algorithm that satisfies the following:
Assumption 1.
is a linear bandit algorithm that outputs a loss estimator for each action after each time . There exist , , and universal constant , such that for all , guarantees the following with probability at least : ,
(5) |
Eq. (5) states that the regret of against action is bounded by a -order term minus the deviation between the loss of and its estimator. While this might not seem intuitive, in fact, all existing linear bandit algorithms with a near-optimal high-probability bound satisfy Assumption 1, even though this may not have been stated explicitly (and one may need to slightly change the constant parameters in these algorithms to satisfy the conditions on and ). Below, we give two examples of such and justify them in Appendix E.
-
•
A variant of GeometricHedge.P (Bartlett et al., 2008) with an improved exploration scheme satisfies Assumption 1 with ()
-
•
The algorithm of (Lee et al., 2020) satisfies Assumption 1 with (, )
With such a black-box at hand, our algorithm BOTW is shown in Algorithm 1. We first present its formal guarantees in different settings.
Theorem 6.
Algorithm 1 guarantees that with probability at least , in the stochastic setting (), is at most
where is the same problem-dependent constant as in Theorem 1; and in the corrupted setting (), is at most
In the case when is the variant of GeometricHedge.P, the last bound is
Therefore, Algorithm 1 enjoys the nearly instance-optimal regret in the stochastic setting as Algorithm LABEL:alg:REOLB444Note that when we choose as the variant of GeometricHedge.P, which is dominated by the term when is sufficiently large., but slightly worse regret in the corrupted setting (recall again ). In exchange, however, Algorithm 1 enjoys the following worst-case robustness in the adversarial setting.
Theorem 7.
In the adversarial setting, Algorithm 1 guarantees that with probability at least , is at most .
The dependence on in this bound is minimax-optimal as mentioned, while the dependence on depends on the coefficient of the black-box. Note that because of this adversarial robustness, the dependence in Theorem 6 turns out to be unavoidable, as we show in Theorem 27. In addition, Theorem 7 also works for the stochastic setting, which implies a regret bound of . This is a factor of better than the guarantee of Algorithm LABEL:alg:REOLB shown in Theorem 2.
Next, in Section 5.1, we describe our algorithm in detail. Then in Section 5.2 and Section 5.3, we provide proof sketches for Theorem 7 and Theorem 6 respectively.
5.1 The algorithm
Algorithm 1 BOTW takes a black-box satisfying Assumption 1 (with parameter ) as input, and then proceeds in epochs until the game ends. In each epoch, it runs its single-epoch version BOTW-SE (Algorithm 2) with a minimum duration (initialized as ). Based on the results of some statistical tests, at some point BOTW-SE will terminate with an output . Then BOTW enters into the next epoch with updated to , so that the number of epochs is always .
BOTW-SE has two phases. In Phase 1, the learner executes the adversarial linear bandit algorithm . Starting from (i.e. after the minimum duration specified by the input), the algorithm checks in every round whether Eq. (7) and Eq. (8) hold for some action (Line 3). If there exists such an , Phase 1 terminates and the algorithm proceeds to Phase 2. This test is to detect whether the environment is likely stochastic. Indeed, Eq. (7) and Eq. (8) imply that the performance of the learner is significantly better than all but one action (i.e., ). In the stochastic environment, this event happens at roughly with . This is exactly the timing when the learner should stop using whose regret grows as and start doing more exploitation on the better actions, in order to keep the regret logarithmic in time for the stochastic environment. We define to be the time when Phase 1 ends, and be the empirical gap for action with respect to the estimators obtained from so far (Line 3). In the stochastic setting, we can show that holds with high probability.
In the second phase, we calculate the action distribution using OP with the estimated gap . Indeed, if ’s are accurate, the distribution returned by OP is close to the optimal way of allocating arm pulls, leading to near-optimal regret.555Here, we solve OP at every iteration for simplicity. It can in fact be done only when time doubles, just like Algorithm LABEL:alg:REOLB. For technical reasons, there are some differences between Phase 2 and Algorithm LABEL:alg:REOLB. First, instead of using , the distribution returned by OP, to draw actions, we mix it with (the distribution that concentrates on ), and draw actions using . This way, is drawn with probability at least . Moreover, the loss estimator is now defined as the following:
(6) |
where . While the construction of for is the same as Algorithm LABEL:alg:REOLB, we see that the construction of is different and is based on standard inverse probability weighting. These differences are mainly because we later use the average estimator instead of the robust mean estimator for (the latter produces a slightly looser concentration bound in our analysis). Therefore, we must ensure that is drawn with enough probability, and that the magnitude of is well-controlled.
(7) | ||||
(8) |
(9) | |||
(10) |
Then, we define the average empirical gap in for and in Phase 2 as the following:
(11) |
where
with (c.f. Figure LABEL:fig:catoni). Note that we use a simple average estimator for , but a hybrid of average estimator of Phase 1 and robust estimator of Phase 2 for other actions. These gap estimators are useful in monitoring the non-stochasticity of the environment, which is done via the tests Eq. (9) and Eq. (10). The first condition (Eq. (9)) checks whether the average empirical gap is still close to the estimated gap at the end of Phase 1. The second condition (Eq. (10)) checks whether the regret against incurred in Phase 2 is still tolerable. It can be shown that (see Lemma 10), with high probability Eq. (9) and Eq. (10) do not hold in a stochastic environment. Therefore, when either event is detected, BOTW-SE terminates and returns the value of to BOTW, which will then run BOTW-SE again from scratch with .
In the following subsections, we provide a sketch of analysis for BOTW, further revealing the ideas behind our design.
5.2 Analysis for the Adversarial Setting (Theorem 7)
We first show that at any time in Phase 2, with high probability, is always the best action so far.
Lemma 8.
With probability at least , for at any in Phase 2, we have .
Proof sketch.
The idea is to prove that for any , the deviation between the actual gap and the estimated gap is no larger than . This is enough to prove the statement since is of order in light of the test in Eq. (9).
Bounding the derivation for Phase 2 is somewhat similar to the analysis of Algorithm LABEL:alg:REOLB, and here we only show how to bound the derivation for Phase 1: . We start by rearranging Eq. (5) to get: . By the termination conditions of Phase 1, we have and , which then shows as desired. (See Appendix C for the full proof.) ∎
We then prove that, importantly, the regret in each epoch is bounded by (not square root of the epoch length):
Lemma 9.
With probability at least , for any time in Phase 2, we have for any ,
Proof sketch.
By Lemma 8, it suffices to consider . By Eq. (5), we know that the regret for the first rounds is directly bounded by . For the regret incurred in Phase 2, we decompose it as the sum of , , and . The first term is controlled by the test in Eq. (10). The second and third terms are martingale difference sequences with variance bounded by , which as we further show is at most with . By combining Eq. (7) and Eq. (8), it is clear that and thus the variance is in the order of . Applying Freedman’s inequality, the last two terms are thus bounded by as well, proving the claimed result (see Appendix C for the full proof). ∎
5.3 Analysis for the Corrupted Setting (Theorem 6)
The key for this analysis is the following lemma.
Lemma 10.
In the corrupted setting, BOTW-SE ensures with probability at least :
-
•
.
-
•
If , then 1) ; 2) for all ; and 3) Phase 2 never ends.
Using this lemma, we show a proof sketch of Theorem 6 for the stochastic case (i.e. ). The full proof is deferred to Appendix C.2.
Proof sketch for Theorem 6 with .
By Lemma 10, we know that after roughly rounds in Phase 1, the algorithm finds , estimates up to a constant factor of , and enters Phase 2 without ever going back to Phase 1. By Eq. (5), the regret in Phase 1 can be upper bounded by .
To bound the regret in Phase 2, we show that as long as is larger than a problem-dependent constant , there exist satisfying such that is a feasible solution of Eq. (LABEL:eqn:_opt-2-constraint). Therefore, we can bound the regret in this regime as follows:
() | |||
() | |||
(optimality of ) | |||
() | |||
(definition of ) |
Combining the regret bounds in Phase 1 and Phase 2, we prove the results for the stochastic setting. ∎
6 Conclusion
In this work, we make significant progress on improving the robustness and adaptivity of linear bandit algorithms. Our algorithms are the first to achieve near-optimal regret in various different settings, without having any prior knowledge on the environment. Our techniques might also be useful for more general problems such as linear contextual bandits.
In light of the work (Zimmert & Seldin, 2019) for multi-armed bandits that shows a simple Follow-the-Regularized-Leader algorithm achieves optimal regret in different settings, one interesting open question is whether there also exists such a simple Follow-the-Regularized-Leader algorithm for linear bandit with the same adaptivity to different settings. In fact, it can be shown that their algorithm has a deep connection with OP in the special case of multi-armed bandits, but we are unable to extend the connection to general linear bandits.
Acknowledgements
We thank Tor Lattimore and Julian Zimmert for helpful discussions. HL thanks Ilias Diakonikolas and Anastasia Voloshinov for initial discussions in this direction. The first four authors are supported by NSF Awards IIS-1755781 and IIS-1943607.
References
- Abbasi-Yadkori et al. (2011) Abbasi-Yadkori, Y., Pál, D., and Szepesvári, C. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24:2312–2320, 2011.
- Abe & Long (1999) Abe, N. and Long, P. M. Associative reinforcement learning using linear probabilistic concepts. In ICML, 1999.
- Abernethy et al. (2008) Abernethy, J., Hazan, E., and Rakhlin, A. Competing in the dark: An efficient algorithm for bandit linear optimization. In 21st Annual Conference on Learning Theory, COLT 2008, pp. 263–273, 2008.
- Agarwal et al. (2014) Agarwal, A., Hsu, D., Kale, S., Langford, J., Li, L., and Schapire, R. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, pp. 1638–1646. PMLR, 2014.
- Auer (2002) Auer, P. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422, 2002.
- Auer & Chiang (2016) Auer, P. and Chiang, C.-K. An algorithm with nearly optimal pseudo-regret for both stochastic and adversarial bandits. In Conference on Learning Theory, pp. 116–120, 2016.
- Awerbuch & Kleinberg (2004) Awerbuch, B. and Kleinberg, R. D. Adaptive routing with end-to-end feedback: Distributed learning and geometric approaches. In Proceedings of the thirty-sixth annual ACM symposium on Theory of computing, pp. 45–53, 2004.
- Bartlett et al. (2008) Bartlett, P., Dani, V., Hayes, T., Kakade, S., Rakhlin, A., and Tewari, A. High-probability regret bounds for bandit online linear optimization. In Proceedings of the 21st Annual Conference on Learning Theory-COLT 2008, pp. 335–342. Omnipress, 2008.
- Bogunovic et al. (2020) Bogunovic, I., Losalka, A., Krause, A., and Scarlett, J. Stochastic linear bandits robust to adversarial attacks. arXiv:2007.03285, 2020.
- Bubeck & Slivkins (2012) Bubeck, S. and Slivkins, A. The best of both worlds: Stochastic and adversarial bandits. In Conference on Learning Theory, 2012.
- Bubeck et al. (2012) Bubeck, S., Cesa-Bianchi, N., and Kakade, S. M. Towards minimax policies for online linear optimization with bandit feedback. In Conference on Learning Theory, 2012.
- Catoni (2012) Catoni, O. Challenging the empirical mean and empirical variance: a deviation study. In Annales de l’IHP Probabilités et statistiques, volume 48, pp. 1148–1185, 2012.
- Chu et al. (2011) Chu, W., Li, L., Reyzin, L., and Schapire, R. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 208–214, 2011.
- Combes et al. (2017) Combes, R., Magureanu, S., and Proutiere, A. Minimal exploration in structured stochastic bandits. In Advances in Neural Information Processing Systems, 2017.
- Dani et al. (2008a) Dani, V., Hayes, T. P., and Kakade, S. M. Stochastic linear optimization under bandit feedback. 2008a.
- Dani et al. (2008b) Dani, V., Kakade, S. M., and Hayes, T. P. The price of bandit information for online optimization. In Advances in Neural Information Processing Systems, 2008b.
- Dudik et al. (2011) Dudik, M., Hsu, D., Kale, S., Karampatziakis, N., Langford, J., Reyzin, L., and Zhang, T. Efficient optimal learning for contextual bandits. In Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence, UAI 2011, pp. 169, 2011.
- Gerchinovitz & Lattimore (2016) Gerchinovitz, S. and Lattimore, T. Refined lower bounds for adversarial bandits. In NeurIPS, 2016.
- Gupta et al. (2019) Gupta, A., Koren, T., and Talwar, K. Better algorithms for stochastic bandits with adversarial corruptions. In Conference on Learning Theory, 2019.
- Hao et al. (2020) Hao, B., Lattimore, T., and Szepesvari, C. Adaptive exploration in linear contextual bandit. In International Conference on Artificial Intelligence and Statistics. PMLR, 2020.
- Jin & Luo (2020) Jin, T. and Luo, H. Simultaneously learning stochastic and adversarial episodic mdps with known transition. Advances in Neural Information Processing Systems, 2020.
- Jun & Zhang (2020) Jun, K.-S. and Zhang, C. Crush optimism with pessimism: Structured bandits beyond asymptotic optimality. Advances in Neural Information Processing Systems, 2020.
- Komiyama et al. (2015) Komiyama, J., Honda, J., and Nakagawa, H. Regret lower bound and optimal algorithm in finite stochastic partial monitoring. Advances in Neural Information Processing Systems, 2015.
- Lattimore & Szepesvari (2017) Lattimore, T. and Szepesvari, C. The end of optimism? an asymptotic analysis of finite-armed linear bandits. In Artificial Intelligence and Statistics, pp. 728–737. PMLR, 2017.
- Lee et al. (2020) Lee, C.-W., Luo, H., Wei, C.-Y., and Zhang, M. Bias no more: high-probability data-dependent regret bounds for adversarial bandits and mdps. Advances in neural information processing systems, 2020.
- Li et al. (2019) Li, Y., Lou, E. Y., and Shan, L. Stochastic linear optimization with adversarial corruption. arXiv:1909.02109, 2019.
- Lykouris et al. (2018) Lykouris, T., Mirrokni, V., and Paes Leme, R. Stochastic bandits robust to adversarial corruptions. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, 2018.
- Lykouris et al. (2019) Lykouris, T., Simchowitz, M., Slivkins, A., and Sun, W. Corruption robust exploration in episodic reinforcement learning. arXiv:1911.08689, 2019.
- Mond & Pecaric (1996) Mond, B. and Pecaric, J. A mixed arithmetic-mean-harmonic-mean matrix inequality. Linear algebra and its applications, 237:449–454, 1996.
- Rusmevichientong & Tsitsiklis (2010) Rusmevichientong, P. and Tsitsiklis, J. N. Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395–411, 2010.
- Seldin & Lugosi (2017) Seldin, Y. and Lugosi, G. An improved parametrization and analysis of the exp3++ algorithm for stochastic and adversarial bandits. In Conference on Learning Theory, pp. 1743–1759. PMLR, 2017.
- Seldin & Slivkins (2014) Seldin, Y. and Slivkins, A. One practical algorithm for both stochastic and adversarial bandits. In International Conference on Machine Learning, pp. 1287–1295. PMLR, 2014.
- Shalev-Shwartz & Ben-David (2014) Shalev-Shwartz, S. and Ben-David, S. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
- Tirinzoni et al. (2020) Tirinzoni, A., Pirotta, M., Restelli, M., and Lazaric, A. An asymptotically optimal primal-dual incremental algorithm for contextual linear bandits. In Advances in Neural Information Processing Systems, 2020.
- Wei & Luo (2018) Wei, C.-Y. and Luo, H. More adaptive algorithms for adversarial bandits. In Conference On Learning Theory, pp. 1263–1291. PMLR, 2018.
- Wei et al. (2020) Wei, C.-Y., Luo, H., and Agarwal, A. Taking a hint: How to leverage loss predictors in contextual bandits? In Conference on Learning Theory, pp. 3583–3634. PMLR, 2020.
- Zimmert & Seldin (2019) Zimmert, J. and Seldin, Y. An optimal algorithm for stochastic and adversarial bandits. In The 22nd International Conference on Artificial Intelligence and Statistics. PMLR, 2019.
- Zimmert & Seldin (2021) Zimmert, J. and Seldin, Y. Tsallis-inf: An optimal algorithm for stochastic and adversarial bandits. J. Mach. Learn. Res., 22:28–1, 2021.
- Zimmert et al. (2019) Zimmert, J., Luo, H., and Wei, C.-Y. Beating stochastic and adversarial semi-bandits optimally and simultaneously. In International Conference on Machine Learning, pp. 7683–7692. PMLR, 2019.
Appendix A Auxiliary Lemmas for OP
Proof of Lemma 4.
Consider the minimizer of the following constrained minimization problem for some :
(12) |
where . We will show that
(13) | ||||
(14) |
To prove this, first note that relaxing the constraints from to the set of sub-distributions does not change the solution of this problem. This is because for any sub-distribution, we can always make it a distribution by increasing the weight of some with (at least one exists) while not increasing the objective value (since is non-decreasing in for each ). Therefore, applying the KKT conditions, we have
(15) |
where are Lagrange multipliers. Plugging in the optimal solution and taking summation over all , we have
(complementary slackness) | ||||
() |
Therefore, we have and as . This proves Eq. (13). For Eq. (14), using Eq. (15), we have
where the first inequality is due to , and the second inequality is due to .
Now we show how to transform into a distribution satisfying the constraint of OP. Choose . Let . We construct the distribution , where is defined in Lemma 11 with , and prove that satisfies Eq. (LABEL:eqn:_opt-2-constraint). Indeed, for all , we have by definition and thus
for , according to Lemma 11 below, we have .
Combining the two cases above, we prove that satisfies Eq. (LABEL:eqn:_opt-2-constraint). According to the optimality of , we thus have
(by the definition of ) | ||||
(by Eq. (13), , the definition of , and the choice of ) | ||||
(by the definition of ) |
proving the lemma. ∎
The following lemma shows that for any , there always exists a distribution that puts most weights on actions from , such that for all .
Lemma 11.
Suppose that spans and let be the uniform distribution over . For any and , there exists a distribution such that for all , where and .
Proof.
Let . As spans the whole space, is well-defined for all . Then we have
(16) | |||
where the second equality is by the Sion’s minimax theorem as Eq. (16) is linear in and convex in . ∎
Lemma 12.
Given , suppose there exists a unique such that , and . Then when , where is the solution to .
Proof.
We divide actions into groups based on the following rule:
Let be the largest index such that is not empty and for . For each group , by Lemma 11, we find a distribution with , such that for all . Then we define a distribution over actions as the following:
is a valid distribution as
( and for ) | ||||
(by definition for all ) | ||||
(condition and thus ) |
Now we show that also satisfies the constraint of . Indeed, for any and such that , we use the facts for by definition and as well to arrive at:
for , we have as shown above and thus,
Thus, satisfies Eq. (LABEL:eqn:_opt-2-constraint). Therefore,
(by the feasibility of and the optimality of ) | ||||
(by the definition of and ) | ||||
( for ) | ||||
() |
proving the lemma. ∎
Lemma 13.
Let for all for some , and for some . Then .
Proof.
By the condition on , we have . Also, the condition implies that and for all . Therefore,
where the second equality is due to Lemma 12 and the other inequalities follow from for all . ∎
In the following lemma, we define a problem-dependent quantity .
Lemma 14.
Consider the optimization problem:
where and . Define its optimal objective value as (same as in Section 3). Then, there exist satisfying the constraint of this optimization problem with and being finite. (Define .)
Proof.
If there exists an assignment of for the optimal objective value which has finite , then the lemma trivially holds. Otherwise, consider the optimal solution with . According to the constraints, the following holds for all :
As is finite, by definition, we know for any , there exists a positive value such that for all , . Choosing , we have when , for all ,
Therefore, consider the solution where if and . We have . Moreover, the objective value is bounded by . ∎
Lemma 15.
Suppose for all for some , and for some , where is defined in Lemma 14. Then .
Proof.
Recall defined in Lemma 14. Define , a distribution over , as the following:
It is clear that is a valid distribution since . Also, note that by the definition of and the condition of ,
Below we show that satisfies the constraint of . Indeed, for any ,
() | ||||
(by the constraint in the definition of ) | ||||
() |
for , we have by the condition of :
Notice that implies . Thus,
() | ||||
(by the feasibility of and the optimality of ) | ||||
(by the definition of and ) | ||||
() | ||||
( proven in Lemma 14) |
finishing the proof. ∎
Lemma 16.
We have .
Proof.
The idea is similar to that of Lemma 12. Define , and be the largest index such that is not empty. For each , let with be the distribution such that for all (see Lemma 11). Let and for , we let . Next we show that satisfies the constraint Eq. (2).
In fact, fix , by definition of , we have
where the first inequality is because is invertible. Therefore, the objective value of Eq. (1) is bounded as follows:
(by the definition of ) | ||||
(by the definition of ) | ||||
( for , ) | ||||
() | ||||
() |
Therefore, we have . ∎
Appendix B Analysis for Algorithm LABEL:alg:REOLB
In this section, we analyze the performance of Algorithm LABEL:alg:REOLB in the stochastic or corrupted setting. For Theorem 1, we decompose the proof into two parts. First, we show in Lemma 17 (a more concrete version of Lemma 5) that for some constant specified later, we have . Second, using Lemma 4, we know that Algorithm LABEL:alg:REOLB also enjoys a regret bound of , which gives for the first rounds and proves Theorem 2.
To prove Lemma 17, we first show in Lemma 3 that and are close within some multiplicative factor with some additional terms related to the corruption. This holds with the help of Eq. (LABEL:eqn:_opt-2-constraint) and the use of robust estimators. For notational convenience, first recall the following definitions from Lemma 3.
Definition 1.
Proof of Lemma 3.
We prove this by induction. For base case , by definition, and Also, and Eq. (3) holds.
If both inequalities hold for , then
(17) |
where the first inequality is because of Eq. (LABEL:eqn:_opt-2-constraint) of . Next, we show that the expectation of is and the variance of is upper bounded by for . In fact, we have
(18) | |||
(19) |
Now we are ready to prove the relation. Let . Under the induction hypothesis for the case of , using Lemma 29 with , with probability at least , we have for all :
(by Eq. (19) and Lemma 29) | |||
(using the definition of and ) | |||
(by the choice of ) | |||
(by Eq. (17)) | |||
() | |||
(by definition of ) |
Therefore, , which proves Eq. (3). The other claim Eq. (4) can be proven using similar analysis. Taking a union bound over all finishes the proof. ∎
Lemma 17 (A detailed version of Lemma 5).
Let , where and is defined in Lemma 14. Then Algorithm LABEL:alg:REOLB guarantees with probability at least :
Proof.
First, note that according to Lemma 28, implies . Let be the epoch such that , which means and thus . Then we have,
(20) |
where the first inequality is by definition of and the second inequality is because and . Then by Lemma 3, we have for all , , , which gives
(21) |
Since for all , we must have . Moreover, according to the definition of , we have . Therefore, the conditions of Lemma 15 hold. Applying Lemma 15 with , we get
Summing over , we get with probability at least ,
∎
Now we are ready to prove the main result Theorem 1.
Proof of Theorem 1.
Let denote the conditional expectation given the history up to time . By Freedman’s inequality, we have
() | ||||
(22) |
Fix a round and the corresponding epoch . According to Lemma 4, we know that
Therefore, combining Lemma 3, the regret in epoch is bounded by
Summing up the regret till round , we have
(23) |
In the following, we prove an alternative bound of , which is independent of . The following lemma is an analogue of Lemma 17, but the constant is independent of .
Lemma 18.
Let . Then Algorithm LABEL:alg:REOLB guarantees with probability
Proof.
First, note that according to Lemma 28, implies . Let be the epoch such that , which means and thus . Therefore, we still have Eq. (20), which further shows that we have Eq. (21) and . Moreover, according to the definition of , we have . Therefore, the conditions of Lemma 13 hold. Applying Lemma 13 with , we get
Summing over , we get with probability at least ,
∎
Theorem 19.
Algorithm LABEL:alg:REOLB guarantees that with probability at least ,
Proof.
Appendix C Analysis of Algorithm 1
In this section, we show that Algorithm 1 achieves both minimax-optimality in the adversarial setting and near instance-optimality in the stochastic setting. In Appendix C.1, we prove Theorem 7, showing that Algorithm 1 enjoys regret in the adversarial setting. In Appendix C.2, we prove Theorem 6, showing that Algorithm 1 also enjoys nearly instance-optimal regret in the stochastic setting and slightly worse regret in the corrupted setting.
C.1 Analysis of Algorithm 1 in the adversarial setting
To prove the guarantee in the adversarial setting, we first prove Lemma 8, which shows that at any time in Phase 2, has the smallest cumulative loss within .
Proof of Lemma 8.
By Assumption 1, for any , and any in Phase 1,
which implies
(25) |
At time , we have with probability at least ,
(by Eq. (25)) | ||||
(by Azuma’s inequality) | ||||
(by Eq. (7)) | ||||
(26) |
Bounding the deviation of for :
For all , the variance of is bounded as follows:
(27) |
where the last inequality is due to . Therefore, using Lemma 29 with , with probability at least , for all in Phase 2 and all ,
(Choose optimally) | ||||
(28) |
where the last inequality is due to Eq. (LABEL:eqn:_opt-2-constraint). For , since
(29) |
we have
and . Note that an increasing function when . Using Eq. (28) and , we have
(30) |
For the first rounds, according to Eq. (26), we have
(31) |
where the last inequality is due to Eq. (29). Combining Eq. (30) and Eq. (31) and noticing that , we have for all ,
(32) |
Bounding the deviation of (recall that we use the standard average estimator for ):
For the first rounds, according to Eq. (26), since , we have
For , according to Freedman’s inequality and the fact that as , we have with probability at least ,
Combining the above two inequalities, we have
(33) |
Now we are ready to prove our main lemma in the adversarial setting.
Proof of Lemma 9.
By Lemma 8, we know that the regret comparator is . By the regret bound of and the fact that (recall from Assumption 1), we have
For the regret in Phase , first note that it suffices to consider not being the last round of this phase (since the last round contributes at most to the regret). Then, consider the following decomposition:
Term 1 is upper bounded by since it corresponds to the termination condition Eq. (10).
Term 2 is a martingale difference sequence since
The variance is upper bounded by
(34) |
where the last term is because .
Term 3 is also a martingale difference sequence. As , its variance can be upper bounded by
(35) |
Therefore, with probability at least , we have by Freedman’s inequality.
As and , by Lemma 12, we have
(36) |
Combining all bounds above, we have shown
proving the lemma. ∎
C.2 Analysis of Algorithm 1 in the corrupted stochastic setting
In this section, we prove our results in the corrupted setting. To prove the main lemma Lemma 10, we separate the proof into two parts, Lemma 20 and Lemma 22.
Lemma 20.
In the stochastic setting with corruptions, within a single epoch,
-
1.
with probability at least , ;
-
2.
if , then with probability at least , ;
-
3.
if , then with probability at least , ;
-
4.
if , then with probability at least , for all .
Proof.
In the corrupted setting, we can identify as in the adversarial setting. We first show the following property: at any in Phase and with probability at least , for any ,
(37) |
By the guarantee of , we have with probability at least , for any and
(38) |
Since , we have for any ,
(39) |
Combining Eq. (38) and Eq. (39), and using for any , we get Eq. (37).
Below, we define .
Claim 1’s proof:
Let . Below we prove that if Phase 1 has not finished before time , then for the choice of , both Eq. (7) and Eq. (8) hold with high probability at time .
Consider Eq. (7). With probability at least ,
(by Azuma’s inequality) | ||||
(by Eq. (39)) | ||||
(by Eq. (37) and ) | ||||
( and ) |
showing that Eq. (7) holds for .
For Eq. (8), by the regret bound of , with probability at least , for ,
(by the regret bound of and Azuma’s inequality) | |||
(by Eq. (37) and that ) | |||
() |
By the condition of , we have for all . Thus, the last expression can further be upper bounded by , indicating that Eq. (8) also holds for all . Combining the two parts above finishes the proof.
Claim 2’s proof:
Claim 3’s proof:
Claim 4’s proof.
For notational convenience, denote the set by . We have
(hold w.p. by Eq. (37)) | ||||
(using and ) | ||||
(by Claim 3, holds w.p. ) | ||||
which finishes the proof. ∎
The next lemma shows that when grows large enough compared to the total corruption , the termination condition Eq. (10) will never be satisfied once the algorithm enters Phase .
Lemma 21.
Algorithm 1 guarantees that with probability at least , for any in Phase , when , we have
Furthermore, when ( is the constant defined in Lemma 14), we have
Proof.
Recall that and . Thus,
Except for Term 1, all terms are martingale difference sequences. Let be the expectation taken over the randomness before Phase 2. Similar to the calculation in Eq. (34) and Eq. (35), we have
and
By Freedman’s inequality, we have with probability at least , for all in Phase ,
Then we deal with Term 1. Again, by Freeman’s inequality with probability at least , for all in Phase 2,
( for ) |
When , according to Lemma 20, we know that with probability , and .
Also by Lemma 20, with probability , for any , we have . These conditions satisfy the requirement in Lemma 13 with . Therefore we can apply Lemma 13 and get
(42) |
for all . Combining all the above, we get
As argued in Eq. (36), . Therefore, the above can be further upper bounded by
(, ) | ||||
(by definition of ) | ||||
() | ||||
() | ||||
() |
Below, we use an alternative way to bound . Let , which implies . For , we use Lemma 4, and bound
(43) |
For , we use Lemma 15 and bound
(44) |
∎
Now we are ready to show that once grows large enough, Phase never ends.
Lemma 22.
If , then with probability at least , Phase 2 never ends.
Proof.
It suffices to verify the two termination conditions Eq. (9) and Eq. (10) are never satisfied. Eq. (10) does not hold because of Lemma 21. Consider Eq. (9). Let be in Phase and . According to Eq. (32) and Eq. (33), we have with probability ,
Therefore, we have
() |
This means that
Therefore, Eq. (9) is not satisfied. ∎
Finally, we prove the regret bound for the corrupted stochastic setting.
Proof of Theorem 6.
First, we consider the pure stochastic setting with . According to Lemma 20, we know that the algorithm has only one epoch as is satisfied in the first epoch. Specifically, after at most rounds in Phase , the algorithm goes to Phase and never goes back to Phase . Then we can directly apply the second claim in Lemma 21 to get the regret bound in the stochastic setting. Specifically, we bound the regret in Phase by . For the regret in Phase , according to the second claim in Lemma 21, we bound the regret by , where is the same as the one in Eq. (24). Combining them together proves the first claim.
Now we consider the corrupted stochastic setting with . Suppose that we are in the epoch with , which is the first epoch such that . Therefore, in previous epochs, we have . According to Lemma 20, we have in the previous epoch and .
We bound the regret before this epoch, as well as the regret in the first phase of this epoch by the adversarial regret bound:
If we use GeometricHedge.P as the adversarial linear bandit algorithm and , then the above is upper bounded by
For Phase of the epoch with , according to Lemma 20, we know that this phase will never end and by definition of , we have . Note that in this phase .
Therefore, by taking a summation over on Eq. (42), we bound the regret in this interval by
Combining the regret bounds finishes the proof of the second claim. ∎
Appendix D Lower Bound
In this section, we prove that the factor in our bound for the stochastic setting is unavoidable if the same algorithm also achieves sublinear regret with high probability in the adversarial case. The full statement is in Theorem 27, and we first present some definitions and related lemmas. We fix a stochastic linear bandit instance (i.e., we fix the parameter and the action set ). Assume that for all and . We call this instance the first environment. The observation is generated according to the following Bernoulli distribution666 In the Bernoulli noise case, we are only looking at a subclass of problems (i.e., those with and ). For problems that are outside this class, the Bernoulli noise case might be much easier than the Gaussian noise case.:
Again, let be the solution of the following optimization problem:
(45) | ||||
where . We also define .
For a fixed (which is chosen later), we divide the whole horizon into intervals of length
where . We denote these intervals as . Observe that for all .
Definition 2.
Let , for some and some universal constant .
Assumption 2.
Let be a linear bandit algorithm with the following regret guarantee: there is a problem-dependent constant (i.e., depending on and ) such that for any ,
for some .
Definition 3.
Let where the expectation is with respect to the environment of and the algorithm . Let .
Definition 4.
Let .
Lemma 23.
Let . There exists an action such that
Proof.
We use contradiction to prove this lemma. Suppose that for all we have . Then observe that where . Therefore,
satisfies the constraint of the optimization problem Eq. (45). Therefore,
This contradicts with the assumption on the regret bound of . ∎
Lemma 24.
If there exists an such that
then there exists such that
Proof.
We use contradiction to prove the lemma. Suppose that for all .
Then
where in the first inequality we use
which is a generalization of the “arithmetic-mean-harmonic-mean inequality” (Mond & Pecaric, 1996). ∎
Lemma 25.
Let . If , then .
Proof.
We have
(46) |
where the last inequality is because of the definition of . For large enough , we must have . Otherwise, by Markov’s inequality, with probability at least , in interval the algorithm draws at most times, and thus the regret of would be at least , violating the assumption on . Therefore, for large enough , we have
where the last inequality is by our assumption. Combining this with Eq. (46), we get
and the conclusion follows based on our assumption. ∎
Note that when the conclusion of Lemma 25 holds, that is, , it means that the exploration in interval is not enough, since by the lower bound in (Lattimore & Szepesvari, 2017), the amount of exploration should make . Therefore, the next natural idea is to change the parameter in this interval , and argue that the amount of exploration is not enough to “detect this change with high probability”.
We now let be the first interval such that there exists with . Also, we use to denote the that satisfies this condition. Define
Notice that in this case,
That is, is a better action than under the parameter .
We now define the second environment as follows: in intervals , the losses are generated according to , but in intervals , the losses are generated according to . We use and to denote the expectation under the first environment and the second environment, and , to denote the probability measures respectively. For now, we only focus on interval (so the probability measure is only over the sequence ).
Lemma 26.
.
Proof.
Note that for any ,
(let ) | ||||
(by the assumption ) |
Therefore, .
Finally, we are ready to present the lower bound. Roughly speaking, it shows that if an algorithm achieves regret in the stochastic case with , then it cannot be robust in the adversarial setting, in the sense that it cannot guarantee a regret bound such as with probability at least .
Theorem 27.
For any , if an algorithm guarantees a pseudo regret bound of
for constant in stochastic environments for all sufficiently large , then there exists an adversarial environment such that with probability at least , the regret of the same algorithm is at least .
Proof.
Let be the event: . By Lemma 5 in (Lattimore & Szepesvari, 2017) and choosing , we have
(by Lemma 26) | ||||
(48) |
Notice that when event happens under the first environment, the regret is at least . By the assumption on , we have
implying that
Combining this with (48), we get
(49) |
Choose , we have for large enough . Then notice that when happens under the second environment, the regret within interval (against comparator ) is at least . Since under the second environment, the learner may have negative regret against in interval , in the best case the regret against in interval is at least
In conclusion, in the second environment, algorithm suffers at least regret in the first intervals with probability at least . ∎
Appendix E Adversarial Linear Bandit Algorithms with High-probability Guarantees
In this section, we show that the algorithms of (Bartlett et al., 2008) and (Lee et al., 2020) both satisfy Assumption 1.
E.1 GeometricHedge.P
We first show GeometricHedge.P in Algorithm 3 for completeness. We remark the differences between the original version and one shown here. First, we consider the noisy feedback instead of the zero-noise feedback . However, most analysis in (Bartlett et al., 2008) still holds. Second, instead of using the barycentric spanner exploration (known to be suboptimal), we use John’s exploration shown to be optimal in Bubeck et al. (2012). With this replacement, Lemma 3 in Bartlett et al. (2008) can be improved to and .
Now, consider martingale difference sequence . We have and
Using Lemma 2 in (Bartlett et al., 2008), we have that with probability at least (set ),
(50) |
where the last inequality is by AM-GM inequality. Note that . Plugging this into Eq. (50), we have with probability at least ,
(51) |
The counterpart of Lemma 6 in Bartlett et al. (2008) shows that with probability at least ,
(52) |
Using Eq. (51), we have the counterpart of Lemma 7 in Bartlett et al. (2008) as follows: with probability at least ,
(53) |
The counterpart of Lemma 8 in Bartlett et al. (2008) is: with probability at least ,
(54) |
Plugging Eq. (52), Eq. (53), and Eq. (54), into Equation (2) in (Bartlett et al., 2008), with have with probability at least ,
(55) |
Again using Eq. (51) and Equation (4) in (Bartlett et al., 2008), we have with probability at least , for all ,
Combining this with Eq. (55) and assuming , we have that with probability at least , for every ,
Recalling the definition of in Eq. (50) and combining terms, we have
(56) |
It remains to decide and . Note that the analysis of (Bartlett et al., 2008) requires . From the proof of Lemma 4 in (Bartlett et al., 2008), we know that . Thus, we set so that always holds. Therefore,
Choosing , we have with probability at least , for all ,
() | ||||
(Eq. (50)) |
which proves Eq. (5).
E.2 The algorithm of (Lee et al., 2020)
Now we introduce another high-probability adversarial linear bandit algorithm from (Lee et al., 2020). The regret bound of this algorithm is slightly worse than (Bartlett et al., 2008). However, the algorithm is efficient when there are infinite or exponentially many actions. For the concrete pseudocode of the algorithm, we refer the readers to Algorithm 2 of (Lee et al., 2020). Here, we focus on showing that it satisfies Eq. (5).
We first restate Lemma B.15 in (Lee et al., 2020) with explicit logarithmic factors: Algorithm 2 of (Lee et al., 2020) with for some universal constant guarantees that with probability at least ,
(57) |
where , is an upper bound on with probability , are two universal constants, and we replace the self-concordant parameter in their bound by a trivial upper bound .
Appendix F Auxiliary Lemmas
In this section, we provide several auxiliary lemmas that we have used in the analysis.
Lemma 28.
(Lemma A.2 of (Shalev-Shwartz & Ben-David, 2014)) Let and . If , then we have .
Lemma 29 (Concentration inequality for Catoni’s estimator (Wei et al., 2020)).
Let be a filtration, and be real random variables such that is -measurable, for some fixed , and for some fixed . Denote and let be the Catoni’s robust mean estimator of with a fixed parameter , that is, is the unique root of the function
where
Then for any , as long as is large enough such that , we have with probability at least ,
Choosing optimally, we have
In particular, if , we have