A Blackbox Approach to Best of Both Worlds in Bandits and Beyond
Abstract
Best-of-both-worlds algorithms for online learning which achieve near-optimal regret in both the adversarial and the stochastic regimes have received growing attention recently. Existing techniques often require careful adaptation to every new problem setup, including specialised potentials and careful tuning of algorithm parameters. Yet, in domains such as linear bandits, it is still unknown if there exists an algorithm that can simultaneously obtain regret in the stochastic regime and regret in the adversarial regime. In this work, we resolve this question positively and present a general reduction from best of both worlds to a wide family of follow-the-regularized-leader (FTRL) and online-mirror-descent (OMD) algorithms. We showcase the capability of this reduction by transforming existing algorithms that are only known to achieve worst-case guarantees into new algorithms with best-of-both-worlds guarantees in contextual bandits, graph bandits and tabular Markov decision processes.
1 Introduction
Multi-armed bandits and its various extensions have a long history (Lai et al., 1985, Auer et al., 2002a; b). Traditionally, the stochastic regime in which all losses/rewards are i.i.d. and the adversarial regime in which an adversary chooses the losses in an arbitrary fashion have been studied in isolation. However, it is often unclear in practice whether an environment is best modelled by the stochastic regime, a slightly contaminated regime, or a fully adversarial regime. This is why the question of automatically adapting to the hardness of the environment, also called achieving best-of-both-worlds guarantees, has received growing attention (Bubeck and Slivkins, 2012, Seldin and Slivkins, 2014, Auer and Chiang, 2016, Seldin and Lugosi, 2017, Wei and Luo, 2018, Zimmert and Seldin, 2019, Ito, 2021b, Ito et al., 2022a, Masoudian and Seldin, 2021, Gaillard et al., 2014, Mourtada and Gaïffas, 2019, Ito, 2021a, Lee et al., 2021, Rouyer et al., 2021, Amir et al., 2022, Huang et al., 2022, Tsuchiya et al., 2022, Erez and Koren, 2021, Rouyer et al., 2022, Ito et al., 2022b, Kong et al., 2022, Jin and Luo, 2020, Jin et al., 2021, Chen et al., 2022, Saha and Gaillard, 2022, Masoudian et al., 2022, Honda et al., 2023). One of the most successful approaches has been to derive carefully tuned FTRL algorithms, which are canonically suited for the adversarial regime, and then show that these algorithms are also close to optimal in the stochastic regime as well. This is achieved via proving a crucial self-bounding property, which then translates into stochastic guarantees. This type of algorithms have been proposed for multi-armed bandits (Wei and Luo, 2018, Zimmert and Seldin, 2019, Ito, 2021b, Ito et al., 2022a), combinatorial semi-bandits (Zimmert et al., 2019), bandits with graph feedback (Ito et al., 2022b), tabular MDPs (Jin and Luo, 2020, Jin et al., 2021) and others. Algorithms with self-bounding properties also automatically adapt to intermediate regimes of stochastic losses with adversarial corruptions (Lykouris et al., 2018, Gupta et al., 2019, Zimmert and Seldin, 2019, Ito, 2021a), which highlights the strong robustness of this algorithm design. However, every problem variation required careful design of potential functions and learning rate schedules. For linear bandits, it has been still unknown whether self-bounding is possible and therefore the state-of-the-art best-of-both-worlds algorithm (Lee et al., 2021) neither obtains optimal stochastic rate, nor canonically adapts to corruptions.
In this work, the make the following contributions: 1) We propose a general reduction from best of both worlds to typical FTRL/OMD algorithms, sidestepping the need for customized potentials and learning rates. 2) We derive the first best-of-both-worlds algorithm for linear bandits that obtains regret in the stochastic regime, optimal adversarial worst-case regret and adapts canonically to corruptions. 3) We derive the first best-of-both-worlds algorithm for linear bandits and tabular MDPs with first-order guarantees in the adversarial regime. 4) We obtain the first best-of-both-worlds algorithms for bandits with graph feedback and bandits with expert advice with optimal stochastic regret.
Setting | Algorithm | Adversarial | Stochastic | Rd. 1 | Rd. 2 | |
Linear bandit | Lee et al. (2021) | |||||
Theorem 3 | ✓ | |||||
Corollary 7 | ✓ | ✓ | ||||
Corollary 12 | ✓ | ✓ | ✓ | |||
Contextual bandit | Corollary 13 | ✓ | ✓ | ✓ | ||
Graph bandit Strongly observable | Ito et al. (2022b) | ✓ | ||||
Rouyer et al. (2022) | ||||||
Corollary 15 | ✓ | ✓ | ✓ | |||
Graph bandit Weakly observable | Ito et al. (2022b) | ✓ | ||||
Corollary 19 | ✓ | ✓ | ✓ | |||
Tabular MDP | Jin et al. (2021) | ✓ | ||||
Theorem 26 | ✓ | ✓ | ✓ |
2 Related Work
Our reduction procedure is related to a class of model selection algorithms that uses a meta bandit algorithm to learn over a set of base bandit algorithms, and the goal is to achieve a comparable performance as if the best base algorithm is run alone. For the adversarial regime, Agarwal et al. (2017) introduced the Corral algorithm to learn over adversarial bandits algorithm that satisfies the stability condition. This framework is further improved by Foster et al. (2020), Luo et al. (2022). For the stochastic regime, Arora et al. (2021), Abbasi-Yadkori et al. (2020), Cutkosky et al. (2021) introduced another set of techniques to achieve similar guarantees, but without explicitly relying on the stability condition. While most of these results focus on obtaining a regret, Arora et al. (2021), Cutkosky et al. (2021), Wei et al. (2022), Pacchiano et al. (2022) made some progress in obtaining regret in the stochastic regime. Among them, Wei et al. (2022), Pacchiano et al. (2022) are most related to us since they also pursue a best-of-both-worlds guarantee across adversarial and stochastic regimes. However, their regret bounds, when applied to our problems, all suffer from highly sub-optimal dependence on and the amount of corruptions.
3 Preliminaries
We consider sequential decision making problems where the learning interacts with the environment in rounds. In each round , the environment generates a loss vector and the learner generates a distribution over actions and chooses an action . The learner then suffers loss and receives some information about depending on the concrete setting. In all settings but MDPs, we assume .
In the adversarial regime, is generated arbitrarily subject to the structure of the concrete setting (e.g., in linear bandits, for arbitrary ). In the stochastic regime, we further assume that there exists an action and a gap such that for all . We also consider the corrupted stochastic regime, where the assumption is relaxed to for some . We define .
4 Main Techniques
4.1 The standard global self-bounding condition and a new linear bandit algorithm
To obtain a best-of-both-world regret bound, previous works show the following property for their algorithm:
Definition 1 (-global-self-bounding condition, or -GSB).
where are problem-dependent constants and is the probability of the learner choosing in round .
With this condition, one can use the standard self-bounding technique to obtain a best-of-both-world regret bounds:
Proposition 2.
If an algorithm satisfies -GSB, then its pseudo-regret is bounded by in the adversarial regime, and by in the corrupted stochastic regime.
Previous works have found algorithms with GSB in various settings, such as multi-armed bandits, semi-bandits, graph-bandits, and MDPs. Here, we provide one such algorithm (Algorithm 3) for linear bandits based on the framework of SCRiBLe (Abernethy et al., 2008). The guarantee of Algorithm 3 is stated in the next theorem under the more general “learning with predictions” setting introduced in Rakhlin and Sridharan (2013), where in every round , the learner receives a loss predictor before making decisions.
Theorem 3.
In the “learning with predictions” setting, Algorithm 3 achieves a second-order regret bound of in the adversarial regime, where is the feature dimension, is the self-concordance parameter of the regularizer, and is the loss predictor. This also implies a first-order regret bound of if and ; it simultaneously achieves in the corrupted stochastic regime.
See Appendix B for the algorithm and the proof. We call this algorithm Variance-Reduced SCRiBLe (VR-SCRiBLe) since it is based on the original SCRiBLe updates, but with some refinement in the construction of the loss estimator to reduce its variance. A good property of a SCRiBLe-based algorithm is that it simultaneously achieves data-dependent bounds (i.e., first- and second-order bounds), similar to the case of multi-armed bandit using FTRL/OMD with log-barrier regularizer (Wei and Luo, 2018, Ito, 2021b). Like Rakhlin and Sridharan (2013), Bubeck et al. (2019), Ito (2021b), Ito et al. (2022a), we can also use another procedure to learn and obtain path-length or variance bounds. The details are omitted.
We notice, however, that the the bound in Theorem 3 is sub-optimal in since the best-known regret for linear bandits is either or , depending on whether the number of actions is larger than . These bounds also hint the possibility of getting better dependence on in the stochastic regime if we can also establish GSB for these algorithms. Therefore, we ask: can we achieve best-of-both-world bounds for linear bandits with the optimal dependence?
An attempt is to try to show GSB for existing algorithms with optimal dependence, including the EXP2 algorithm (Bubeck et al., 2012) and the logdet-FTRL algorithm (Zimmert and Lattimore, 2022). Based on our attempt, it is unclear how to adapt the analysis of logdet-FTRL to show GSB. For EXP2, using the learning-rate tuning technique by Ito et al. (2022b), one can make it satisfy a similar guarantee as GSB, albeit with an additional factor, resulting in a bound in the stochastic regime. This gives a sub-optimal rate of . Therefore, we further ask: can we achieve regret in the stochastic regime with an optimal dependence?
Motivated by these questions, we resort to approaches that do not rely on GSB, which appears for the first time in the best-of-both-world literature. Our approach is in fact a general reduction — it not only successfully achieves the desired bounds for linear bandits, but also provides a principled way to convert an algorithm with a worst-case guarantee to one with a best-of-both-world guarantee for general settings. In the next two subsections, we introduce our reduction techniques.
4.2 First reduction: Best of Both Worlds Local Self-Bounding
Our reduction approach relies on an algorithm to satisfy a weaker condition than GSB, defined in the following:
Definition 4 (-local-self-bounding condition, or -LSB).
We say an algorithm satisfies the -local-self-bounding condition if it takes a candidate action as input and satisfies the following pseudo-regret guarantee for any stopping time :
where are problem dependent constants.
The difference between LSB in Definition 4 and GSB in Definition 1 is that LSB only requires the self-bounding term to appear when is a particular action given as an input to the algorithm; for all other actions, the worst-case bound suffices. A minor additional requirement is that the pseudo-regret needs to hold for any stopping time (because our reduction may stop this algorithm during the learning procedure), but this is relatively easy to satisfy — for all algorithms in this paper, without additional efforts, their regret bounds naturally hold for any stopping time.
Input: LSB algorithm
, .
.
.
for do
Set counters for all .
for do
if and such that then
.
break.
For linear bandits, we find that an adaptation of the logdet-FTRL algorithm (Zimmert and Lattimore, 2022) satisfies -LSB, as stated in the following lemma. The proof is provided in Appendix C.
Lemma 5.
For -dimensional linear bandits, by transforming the action set into , Algorithm 4 satisfies -LSB with and .
With the LSB condition defined, we develop a reduction procedure (Algorithm 1) which turns any algorithm with LSB into a best-of-both-world algorithm that has a similar guarantee as in Proposition 2. The guarantee is formally stated in the next theorem.
Theorem 6.
If satisfies -LSB, then the regret of Algorithm 1 with as the base algorithm is upper bounded by in the adversarial regime and by in the corrupted stochastic regime.
See Appendix D.1 for a proof of Theorem 6. In particular, Theorem 6 together with Lemma 5 directly lead to a better best-of-both-world bound for linear bandits.
Corollary 7.
Combining Algorithm 1 and Algorithm 4 results in a linear bandit algorithm with regret in the adversarial regime and regret in the corrupted stochastic regime simultaneously.
Ideas of the first reduction
Algorithm 1 proceeds in epochs. In epoch , there is an action being selected as the candidate ( is randomly drawn from ). The procedure simply executes the base algorithm with input , and monitors the number of draws to each action. If in epoch there exists some being drawn more than half of the time, and the length of epoch already exceeds two times the length of epoch , then a new epoch is started with set to .
Theorem 6 is not sufficient to be considered a black-box reduction approach, since algorithms with LSB are not common. Therefore, our next step is to present a more general reduction that makes use of recent advances of Corral algorithms.
4.3 Second reduction: Local Self-Bounding Importance-Weighting Stable
In this subsection, we show that one can achieve LSB using the idea of model selection. Specifically, we will run a variant of the Corral algorithm (Agarwal et al., 2017) over two instances: one is , and the other is a importance-weighting-stable algorithm (see Definition 8) over the action set . Here, we focus on the case of , which is the case for most standard settings where the worst-case regret is ; examples for in the graph bandit setting is discussed in Section 5.
First, we introduce the notion of importance-weighting stablility.
Definition 8 (-iw-stable).
An algorithm is -iw-stable (importance-weighting stable), if given an adaptive sequence of weights and the assertion that the feedback in round is observed with probability , it obtains the following pseudo regret guarantee for any stopping time :
Definition 8 requires that even if the algorithm only receives the desired feedback with probability in round , it still achieves a meaningful regret bound that smoothly degrades with . In previous works on Corral and its variants (Agarwal et al., 2017, Foster et al., 2020), a similar notion of -stability is defined as having a regret of , which is a weaker assumption than our -iw-stability. Our stronger notion of stability is crucial to get a best-of-both-world bound, but it is still very natural and holds for a wide range of algorithms.
Input: candidate action , -iw-stable algorithm over with constants .
Define: .
.
for do
if then draw and observe ;
Below, we provide two examples of -iw-stable algorithms. Their analysis is mostly standard FTRL analysis (considering extra importance weighting) and can be found in Appendix F.
Lemma 9.
For linear bandits, EXP2 with adaptive learning rate (Algorithm 9) is -iw-stable with constants .
Lemma 10.
For contextual bandits, EXP4 with adaptive learning rate (Algorithm 10) is -iw-stable with constants .
Our reduction procedure is Algorithm 2, which turns an -iw-stable algorithm into one with -LSB. The guarantee is formalized in the next theorem, whose proof is in Appendix E.1.
Theorem 11.
If is a -iw-stable algorithm with constants , then Algorithm 2 with as the base algorithm satisfies -LSB with constants where and .
Cascading Theorem 11 and Theorem 6, and combining it with Lemma 9 and Lemma 10, respectively, we get the following corollaries:
Corollary 12.
Combining Algorithm 1, Algorithm 2 and Algorithm 9 results in a linear bandit algorithm with regret in the adversarial regime and regret in the corrupted stochastic regime simultaneously.
Corollary 13.
Combining Algorithm 1, Algorithm 2 and Algorithm 10 results in a contextual bandit algorithm with regret in the adversarial regime and regret in the corrupted stochastic regime simultaneously, where is the number of arms.
Ideas of the second reduction
Algorithm 2 is related to the Corral algorithm (Agarwal et al., 2017, Foster et al., 2020, Luo et al., 2022). We use a special version which is essentially an FTRL algorithm (with a hybrid -Tsallis entropy and log-barrier regularizer) over two base algorithms: the candidate , and an algorithm operating on the reduced action set . For base algorithm , the Corral algorithm adds a bonus term that upper bounds the regret of under importance weighting (i.e., the quantity defined in Algorithm 6). In the regret analysis, the bonus creates a negative term as well as a bonus overhead in the learner’s regret. The negative term can be used to cancel the positive regret of , and the analysis reduces to bounding the bonus overhead. Showing that the bonus overhead is upper bounded by the order of is the key to establish the LSB property.
Combining Algorithm 2 and Algorithm 1, we have the following general reduction: as long as we have an -iw-stable algorithm over for any , we have an algorithm with the best-of-both-world guarantee when the optimal action is unique. Notice that -iw-stable algorithms are quite common – usually it can be just a FTRL/OMD algorithm with adaptive learning rate.
The overall reduction is reminiscent of the G-COBE procedure by Wei et al. (2022), which also runs model selection over and a base algorithm for (similar to Algorithm 2), and dynamically restarts the procedure with increasing epoch lengths (similar to Algorithm 1). However, besides being more complicated, G-COBE only guarantees a bound of in the corrupted stochastic regime (omitting dependencies on ), which is sub-optimal in both and .111However, Wei et al. (2022) is able to handle a more general type of corruption.
5 Case Study: Graph Bandits
In this section, we show the power of our reduction by improving the state of the art of best-of-both-worlds algorithms for bandits with graph feedback. In bandits with graph feedback, the learner is given a directed feedback graph , where the nodes correspond to the -arms of the bandit. Instead of observing the loss of the played action , the player observes the loss iff . Learnable graphs are divided into strongly observable graphs and weakly observable graphs. In the first case, every arm must either receive its own feedback, i.e. , or all other arms do, i.e. . In the weakly observable case, every arm must be observable by at least one arm, i.e. . The following graph properties characterize the difficulty of the two regimes. An independence set is any subset such that no edge exists between any two distinct nodes in , i.e. we have and . The independence number is the size of the largest independence set. A related number is the weakly independence number , which is the independence number of the subgraph obtained by removing all one-sided edges, i.e. iff and . For undirected graphs, the two notions coincide, but in general can be larger than by up to a factor of . Finally, a weakly dominating set is a subset such that every node in is observable by some node in , i.e. . The weakly dominating number is the size of the smallest weakly dominating set.
Alon et al. (2015) provides a almost tight characterization of the adversarial regime, providing upper bounds of and for the strongly observable case and for the weakly observable case, as well as matching lower bounds up to all log factors. Zimmert and Lattimore (2019) have shown that the dependency can be avoided and that hence is the right notion of difficulty for strongly observable graphs even in the limit (though they pay for this with a larger dependency).
State-of-the-art best-of-both-worlds guarantees for bandits with graph feedback are derivations of EXP3 and obtain either or regret for strongly observable graphs and for weakly observable graphs (Rouyer et al., 2022, Ito et al., 2022b)222 Rouyer et al. (2022) obtains better bounds when the gaps are not uniform, while Ito et al. (2022b) can handle graphs that are unions of strongly observable and weakly observable sub-graphs. We do not extend our analysis to the latter case for the sake of simplicity. . Our black-box reduction leads directly to algorithms with optimal regret in the stochastic regime.
5.1 Strongly observable graphs
We begin by providing an examples of -iw-stable algorithm for bandits with strongly observable graph feedback. The algorithm and the proof are in Appendix F.3.
Lemma 14.
For bandits with strongly observable graph feedback, -Tsallis-INF with adaptive learning rate (Algorithm 11) is -iw-stable with constants .
Before we apply the reduction, note that Algorithm 2 requires that the player observes the loss when playing arm , which is not necessarily the case for all strongly observable graphs. However, this can be overcome via defining an observable surrogate loss that is used in Algorithm 2 instead. We explain how this works in detail in Section G and assume from now that this technical issue does not arise.
Cascading the two reduction steps, we immediately obtain the following.
Corollary 15.
Combining Algorithm 1, Algorithm 2 and Algorithm 11 results in a graph bandit algorithm with regret in the adversarial regime and regret in the corrupted stochastic regime simultaneously.
5.2 Weakly observable graphs
Weakly observable graphs motivate our general definition of LSB stable beyond . We first need to define the equivalent of -iw-stable for this regime.
Definition 16 (-iw-stable).
An algorithm is -iw-stable (importance-weighting stable), if given an adaptive sequence of weights and the feedback in any round being observed with probability , it obtains the following pseudo regret guarantee for any stopping time :
An example of such an algorithm is given by the following lemma (see Appendix F.4 for the algorithm and the proof).
Lemma 17.
For bandits with weakly observable graph feedback, EXP3 with adaptive learning rate (Algorithm 12) is -iw-stable with constants .
Similar to the -case, this allows for a general reduction to LSB algorithms.
Theorem 18.
If is a -iw-stable algorithm with constants , then Algorithm 5 with as the base algorithm satisfies -LSB with constants where and .
See Appendix E.2 for the proof. Algorithm 5 works almost the same as Algorithm 2, but we need to adapt the bonus terms to the definition of -iw-stable. The major additional difference is that we do not necessarily observe the loss of the action played and hence need to play exploratory actions with probability in every round to estimate the performance difference between and . A second point to notice is that the base algorithm uses the action set , but is still allowed to use in its internal exploration policy. This is necessary because the sub-graph with one arm removed might have a larger dominating number, or might even not be learnable at all. By allowing in the internal exploration, we guarantee the the regret of the base algorithm is not larger than over the full action set. Finally, cascading the lemma and the reduction, we obtain
Corollary 19.
Combining Algorithm 1, Algorithm 5 and Algorithm 12 results in a graph bandit algorithm with regret in the adversarial regime and regret in the corrupted stochastic regime simultaneously.
6 More Adaptations
To demonstrate the versatility of our reduction framework, we provide two more adaptations. The first demonstrates that our reduction can be easily adapted to obtain a data-dependent bound (i.e., first- or second-order bound) in the adversarial regime, provided that the base algorithm achieves a corresponding data-dependent bound. The second tries to eliminate the undesired requirement by the corral algorithm (Algorithm 2) that the base algorithm needs to operate in instead of the more natural . We show that under a stronger stability assumption, we can indeed just use a base algorithm that operates in . This would be helpful for settings where excluding one single action/policy in the algorithm is not straightforward (e.g., MDP). Finally, we combine the two techniques to obtain a first-order best-of-both-world bound for tabular MDPs.
6.1 Reduction with first- and second-order bounds
A first-order regret bound refers to a regret bound of order , where is the best action’s cumulative loss. This is meaningful under the assumption that for all . A second-order regret bound refers to a bound of order , where is a loss predictor for action that is available to the learner at the beginning of round . We refer the reader to Wei and Luo (2018), Ito (2021b) for more discussions on data-dependent bounds.
We first define counterparts of the LSB condition and iw-stability condition with data-dependent guarantees:
Definition 20 (-dd-LSB).
We say an algorithm satisfies -dd-LSB (data-dependent LSB) if it takes a candidate action as input and satisfies the following pseudo-regret guarantee for any stopping time :
where are some problem dependent constants, in the second-order bound case, and in the first-order bound case.
Definition 21 (-dd-iw-stable).
An algorithm is -dd-iw-stable (data-dependent-iw-stable) if given an adaptive sequence of weights and the assertion that the feedback in round is observed with probability , it obtains the following pseudo regret guarantee for any stopping time :
where if feedback is observed in round and otherwise, and is defined in the same way as in Definition 20.
We can turn a dd-iw-stable algorithm into one with dd-LSB (see Appendix E.3 for the proof):
Theorem 22.
If is -dd-iw-stable with constants , then Algorithm 6 with as the base algorithm satisfies -dd-LSB with constants where and .
Then we turn an algorithm with dd-LSB into one with data-dependent best-of-both-world guarantee (see Appendix D.2 for the proof):
Theorem 23.
If an algorithm satisfies -dd-LSE, then the regret of Algorithm 1 with as the base algorithm is upper bounded by in the adversarial regime and in the corrupted stochastic regime.
6.2 Achieving LSB without excluding
Our reduction in Section 4.3 requires that the base algorithm to operate in the action set of . This is sometimes not easy to implement for structural problems where actions share common components (e.g., MDPs or combinatorial bandits). To eliminate this requirement so that we can simply use a base algorithm that operates in the original action space , we propose the following stronger notion of iw-stability that we called strongly-iw-stable:
Definition 24 (-strongly-iw-stable).
An algorithm is -strongly-iw-stable if the following holds: given an adaptive sequence of weights and the assertion that the feedback in round is observed with probability if the algorithm chooses , it obtains the following pseudo regret guarantee for any stopping time :
Compared with iw-stability defined in Definition 8, strong-iw-stability requires the bound to have an additional flexibility: if choosing action results in a probability of observing the feedback, the regret bound needs to adapts to . For the class of FTRL/OMD algorithms, strong-iw-stability holds if the stability term is bounded by a constant no matter what action is chosen. Examples include log-barrier-OMD/FTRL for multi-armed bandits, SCRiBLe (Abernethy et al., 2008) or a truncated version of continuous exponential weights (Ito et al., 2020) for linear bandits. In fact, Upper-Confidence-Bound (UCB) algorithms also satisfy strong-iw-stability, though it is mainly designed for the pure stochastic setting; however, we will need this property when we design algorithms for adversarial MDPs, where the transition estimation part is done through UCB approaches. This will be demonstrated in Section 6.3. With strong-iw-stability, the second reduction only requires a base algorithm over :
Theorem 25.
If is a -strongly-iw-stable algorithm with constants , then Algorithm 7 with as the base algorithm satisfies -LSB with constants where and .
The proof is provided in Appendix E.4.
6.3 Application to MDPs
Finally, we combine the techniques developed in Section 6.1 and Section 6.2 to obtain a best-of-both-world guarantee for tabular MDPs with a first-order regret bound in the adversarial regime. We use Algorithm 13 as the base algorithm, which is adapted from the UCB-log-barrier Policy Search algorithm by Lee et al. (2020) to satisfy both the data-dependent property (Definition 21) and the strongly iw-stable property (Definition 24). The corral algorithm we use is Algorithm 8, which takes a base algorithm with the dd-strongly-iw-stable property and turns it into a data-dependent best-of-both-world algorithm. The details and notations are all provided in Appendix H. The guarantee of the algorithm is formally stated in the next theorem.
Theorem 26.
Combining Algorithm 1, Algorithm 8, and Algorithm 13 results in an MDP algorithm with regret in the adversarial regime, and in the corrupted stochastic regime, where is the number of states, is the number of actions, is the horizon, is the cumulative loss of the best policy, and .
7 Conclusion
We provided a general reduction from the best-of-both-worlds problem to a wide range of FTRL/OMD algorithms, which improves the state of the art in several problem settings. We showed the versatility of our approach by extending it to preserving data-dependent bounds.
Another potential application of our framework is partial monitoring, where one might improve the rates of to for both the and the regime using our respective reductions.
A weakness of our approach is the uniqueness requirement of the best action. While this assumption is typical in the best-of-both-worlds literature, it is not merely an artifact of the analysis for us due to the doubling procedure in the first reduction. Additionally, our reduction can only obtain worst-case dependent bounds in the stochastic regime, which can be significantly weaker than more refined notions of complexity.
References
- Abbasi-Yadkori et al. (2020) Yasin Abbasi-Yadkori, Aldo Pacchiano, and My Phan. Regret balancing for bandit and rl model selection. arXiv preprint arXiv:2006.05491, 2020.
- Abernethy et al. (2008) Jacob Abernethy, Elad Hazan, and Alexander Rakhlin. Competing in the dark: An efficient algorithm for bandit linear optimization. In 21st Annual Conference on Learning Theory, COLT 2008, 2008.
- Agarwal et al. (2017) Alekh Agarwal, Haipeng Luo, Behnam Neyshabur, and Robert E Schapire. Corralling a band of bandit algorithms. In Conference on Learning Theory, pages 12–38. PMLR, 2017.
- Alon et al. (2013) Noga Alon, Nicolo Cesa-Bianchi, Claudio Gentile, and Yishay Mansour. From bandits to experts: A tale of domination and independence. Advances in Neural Information Processing Systems, 26, 2013.
- Alon et al. (2015) Noga Alon, Nicolo Cesa-Bianchi, Ofer Dekel, and Tomer Koren. Online learning with feedback graphs: Beyond bandits. In Conference on Learning Theory, pages 23–35. PMLR, 2015.
- Amir et al. (2022) Idan Amir, Guy Azov, Tomer Koren, and Roi Livni. Better best of both worlds bounds for bandits with switching costs. In Advances in Neural Information Processing Systems, 2022.
- Arora et al. (2021) Raman Arora, Teodor Vanislavov Marinov, and Mehryar Mohri. Corralling stochastic bandit algorithms. In International Conference on Artificial Intelligence and Statistics, pages 2116–2124. PMLR, 2021.
- Auer and Chiang (2016) Peter Auer and Chao-Kai Chiang. An algorithm with nearly optimal pseudo-regret for both stochastic and adversarial bandits. In Conference on Learning Theory, pages 116–120. PMLR, 2016.
- Auer et al. (2002a) Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47:235–256, 2002a.
- Auer et al. (2002b) Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002b.
- Bubeck and Slivkins (2012) Sébastien Bubeck and Aleksandrs Slivkins. The best of both worlds: Stochastic and adversarial bandits. In Conference on Learning Theory, pages 42–1. JMLR Workshop and Conference Proceedings, 2012.
- Bubeck et al. (2012) Sébastien Bubeck, Nicolo Cesa-Bianchi, and Sham M Kakade. Towards minimax policies for online linear optimization with bandit feedback. In Conference on Learning Theory, pages 41–1. JMLR Workshop and Conference Proceedings, 2012.
- Bubeck et al. (2019) Sébastien Bubeck, Yuanzhi Li, Haipeng Luo, and Chen-Yu Wei. Improved path-length regret bounds for bandits. In Conference On Learning Theory, pages 508–528. PMLR, 2019.
- Chen et al. (2022) Cheng Chen, Canzhe Zhao, and Shuai Li. Simultaneously learning stochastic and adversarial bandits under the position-based model. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 6202–6210, 2022.
- Cutkosky et al. (2021) Ashok Cutkosky, Christoph Dann, Abhimanyu Das, Claudio Gentile, Aldo Pacchiano, and Manish Purohit. Dynamic balancing for model selection in bandits and rl. In International Conference on Machine Learning, pages 2276–2285. PMLR, 2021.
- Erez and Koren (2021) Liad Erez and Tomer Koren. Towards best-of-all-worlds online learning with feedback graphs. Advances in Neural Information Processing Systems, 34:28511–28521, 2021.
- Foster et al. (2020) Dylan J Foster, Claudio Gentile, Mehryar Mohri, and Julian Zimmert. Adapting to misspecification in contextual bandits. Advances in Neural Information Processing Systems, 33:11478–11489, 2020.
- Gaillard et al. (2014) Pierre Gaillard, Gilles Stoltz, and Tim Van Erven. A second-order bound with excess losses. In Conference on Learning Theory, pages 176–196. PMLR, 2014.
- Gupta et al. (2019) Anupam Gupta, Tomer Koren, and Kunal Talwar. Better algorithms for stochastic bandits with adversarial corruptions. In Conference on Learning Theory, pages 1562–1578. PMLR, 2019.
- Honda et al. (2023) Junya Honda, Shinji Ito, and Taira Tsuchiya. Follow-the-perturbed-leader achieves best-of-both-worlds for bandit problems. In International Conference on Algorithmic Learning Theory. PMLR, 2023.
- Huang et al. (2022) Jiatai Huang, Yan Dai, and Longbo Huang. Adaptive best-of-both-worlds algorithm for heavy-tailed multi-armed bandits. In International Conference on Machine Learning, pages 9173–9200. PMLR, 2022.
- Ito (2021a) Shinji Ito. On optimal robustness to adversarial corruption in online decision problems. Advances in Neural Information Processing Systems, 34:7409–7420, 2021a.
- Ito (2021b) Shinji Ito. Parameter-free multi-armed bandit algorithms with hybrid data-dependent regret bounds. In Conference on Learning Theory, pages 2552–2583. PMLR, 2021b.
- Ito et al. (2020) Shinji Ito, Shuichi Hirahara, Tasuku Soma, and Yuichi Yoshida. Tight first-and second-order regret bounds for adversarial linear bandits. Advances in Neural Information Processing Systems, 33:2028–2038, 2020.
- Ito et al. (2022a) Shinji Ito, Taira Tsuchiya, and Junya Honda. Adversarially robust multi-armed bandit algorithm with variance-dependent regret bounds. In Conference on Learning Theory, pages 1421–1422. PMLR, 2022a.
- Ito et al. (2022b) Shinji Ito, Taira Tsuchiya, and Junya Honda. Nearly optimal best-of-both-worlds algorithms for online learning with feedback graphs. In Advances in Neural Information Processing Systems, 2022b.
- Jin and Luo (2020) Tiancheng Jin and Haipeng Luo. Simultaneously learning stochastic and adversarial episodic mdps with known transition. Advances in neural information processing systems, 33:16557–16566, 2020.
- Jin et al. (2021) Tiancheng Jin, Longbo Huang, and Haipeng Luo. The best of both worlds: stochastic and adversarial episodic mdps with unknown transition. Advances in Neural Information Processing Systems, 34:20491–20502, 2021.
- Kong et al. (2022) Fang Kong, Yichi Zhou, and Shuai Li. Simultaneously learning stochastic and adversarial bandits with general graph feedback. In International Conference on Machine Learning, pages 11473–11482. PMLR, 2022.
- Lai et al. (1985) Tze Leung Lai, Herbert Robbins, et al. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985.
- Lee et al. (2020) Chung-Wei Lee, Haipeng Luo, Chen-Yu Wei, and Mengxiao Zhang. Bias no more: high-probability data-dependent regret bounds for adversarial bandits and mdps. Advances in neural information processing systems, 33:15522–15533, 2020.
- Lee et al. (2021) Chung-Wei Lee, Haipeng Luo, Chen-Yu Wei, Mengxiao Zhang, and Xiaojin Zhang. Achieving near instance-optimality and minimax-optimality in stochastic and adversarial linear bandits simultaneously. In International Conference on Machine Learning, pages 6142–6151. PMLR, 2021.
- Luo (2022) Haipeng Luo. Homework 3 solution, introduction to online optimization/learning. http://haipeng-luo.net/courses/CSCI659/2022_fall/homework/HW3_solutions.pdf, November 2022.
- Luo et al. (2022) Haipeng Luo, Mengxiao Zhang, Peng Zhao, and Zhi-Hua Zhou. Corralling a larger band of bandits: A case study on switching regret for linear bandits. In Conference on Learning Theory, pages 3635–3684. PMLR, 2022.
- Lykouris et al. (2018) Thodoris Lykouris, Vahab Mirrokni, and Renato Paes Leme. Stochastic bandits robust to adversarial corruptions. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 114–122, 2018.
- Masoudian and Seldin (2021) Saeed Masoudian and Yevgeny Seldin. Improved analysis of the tsallis-inf algorithm in stochastically constrained adversarial bandits and stochastic bandits with adversarial corruptions. In Conference on Learning Theory, pages 3330–3350. PMLR, 2021.
- Masoudian et al. (2022) Saeed Masoudian, Julian Zimmert, and Yevgeny Seldin. A best-of-both-worlds algorithm for bandits with delayed feedback. In Advances in Neural Information Processing Systems, 2022.
- Mourtada and Gaïffas (2019) Jaouad Mourtada and Stéphane Gaïffas. On the optimality of the hedge algorithm in the stochastic regime. Journal of Machine Learning Research, 20:1–28, 2019.
- Pacchiano et al. (2022) Aldo Pacchiano, Christoph Dann, and Claudio Gentile. Best of both worlds model selection. In Advances in Neural Information Processing Systems, 2022.
- Rakhlin and Sridharan (2013) Sasha Rakhlin and Karthik Sridharan. Optimization, learning, and games with predictable sequences. Advances in Neural Information Processing Systems, 26, 2013.
- Rouyer et al. (2021) Chloé Rouyer, Yevgeny Seldin, and Nicolo Cesa-Bianchi. An algorithm for stochastic and adversarial bandits with switching costs. In International Conference on Machine Learning, pages 9127–9135. PMLR, 2021.
- Rouyer et al. (2022) Chloé Rouyer, Dirk van der Hoeven, Nicolò Cesa-Bianchi, and Yevgeny Seldin. A near-optimal best-of-both-worlds algorithm for online learning with feedback graphs. In Advances in Neural Information Processing Systems, 2022.
- Saha and Gaillard (2022) Aadirupa Saha and Pierre Gaillard. Versatile dueling bandits: Best-of-both world analyses for learning from relative preferences. In International Conference on Machine Learning, pages 19011–19026. PMLR, 2022.
- Seldin and Lugosi (2017) Yevgeny Seldin and Gábor Lugosi. An improved parametrization and analysis of the exp3++ algorithm for stochastic and adversarial bandits. In Conference on Learning Theory, pages 1743–1759. PMLR, 2017.
- Seldin and Slivkins (2014) Yevgeny Seldin and Aleksandrs Slivkins. One practical algorithm for both stochastic and adversarial bandits. In International Conference on Machine Learning, pages 1287–1295. PMLR, 2014.
- Tsuchiya et al. (2022) Taira Tsuchiya, Shinji Ito, and Junya Honda. Best-of-both-worlds algorithms for partial monitoring. arXiv preprint arXiv:2207.14550, 2022.
- Wei and Luo (2018) Chen-Yu Wei and Haipeng Luo. More adaptive algorithms for adversarial bandits. In Conference On Learning Theory, pages 1263–1291. PMLR, 2018.
- Wei et al. (2022) Chen-Yu Wei, Christoph Dann, and Julian Zimmert. A model selection approach for corruption robust reinforcement learning. In International Conference on Algorithmic Learning Theory, pages 1043–1096. PMLR, 2022.
- Zimmert and Lattimore (2019) Julian Zimmert and Tor Lattimore. Connections between mirror descent, thompson sampling and the information ratio. Advances in Neural Information Processing Systems, 32, 2019.
- Zimmert and Lattimore (2022) Julian Zimmert and Tor Lattimore. Return of the bias: Almost minimax optimal high probability bounds for adversarial linear bandits. In Conference on Learning Theory, pages 3285–3312. PMLR, 2022.
- Zimmert and Seldin (2019) Julian Zimmert and Yevgeny Seldin. An optimal algorithm for stochastic and adversarial bandits. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 467–475. PMLR, 2019.
- Zimmert et al. (2019) Julian Zimmert, Haipeng Luo, and Chen-Yu Wei. Beating stochastic and adversarial semi-bandits optimally and simultaneously. In International Conference on Machine Learning, pages 7683–7692. PMLR, 2019.
[section] \printcontents[section]l1
Appendix A FTRL Analysis
Lemma 27.
The optimistic FTRL algorithm over a convex set :
guarantees the following for any :
Proof.
Lemma 28.
Consider the optimistic FTRL algorithm with bonus :
with where for some , , and is non-increasing. We have
for any and any , , such that
(3) | ||||
(4) | ||||
(5) | ||||
(6) | ||||
(7) | ||||
(8) |
for all .
Proof.
By Problem 1 in Luo (2022), we can bound stability-1 by under the condition (5). Similarly, stability-2 can be upper bounded by under the condition (7). Using Lemma 30, we can bound stability-3 by under the condition (6). Similarly, we can bound stability-4 by under (8). Collecting all terms and using the definition of finishes the proof.
∎
Lemma 29.
Consider the optimistic FTRL algorithm with bonus :
with , and is non-increasing. We have
for any and any if the following hold: and for all .
Proof.
Using Lemma 30, we can bound stability-1 by , and bound stability-2 by under the specified conditions. Collecting all terms and using the definition of finishes the proof. ∎
Lemma 30 (Stability under log barrier).
Let , and let be such that . Then
Proof.
Define . Let be the solution in the last expression. Next, we verify that under the specified conditions, we have . It suffices to show that there exists such that since if such exists, then it must the maximizer of and thus .
By the condition, we have for all . and so has solution in , which is .
Therefore, , and we have
It remains to bound , which by definition can be written as
where . By the relation between and we just derived, it holds that . By the fact that for all , we have
which gives the desired bound.
∎
Lemma 31 (Stability under negentropy).
If is the negentropy and , then for any the stability is bounded by
If , then the stability is bounded by
instead.
Proof.
Let . Then we maximize , which is upper bounded by maximizing the expression in each coordinate for . We have
and hence the maximum is obtained for . Plugging this in leads to
for non-negative and
for respectively, where we used the bound which holds for any and , which holds for respectively. Summing over all coordinates finishes the proof. ∎
Lemma 32 (Stability of Tsallis-INF).
For the potential , any , any non-negative loss and any positive learning rate , we have
Proof.
We upper bound this term by maximizing over instead of . Since is positive, the optimal satisfies in all components. We have
By AM-GM inequality, we further have
∎
Appendix B Analysis for Variance-Reduced SCRiBLe (Algorithm 3 / Theorem 3)
Define: Let be a -self-concordant barrier of .
for do
(9) |
Sample uniformly from (the unit sphere in -dimension).
Define
Lemma 33.
Proof.
(note that depend on ) | ||||
(because ) |
For any , we have and
Therefore, we continue to bound by
( and ) |
∎
Lemma 34.
If , then .
Proof.
We first show that . By the definition of , we have
Define
Define . Let be the maximizer of . it suffices to show because this leads to
To show , it suffices to show that for all such that , . To see why this is the case, notice that , and because is the maximizer of . Therefore, if , then there exists in the line segment between and with such that , contradicting that is strictly concave.
Below, consider any with . By Taylor expansion, there exists in the line segment between and such that
Because is a self-concordant barrier and that , we have . Continuing from the previous inequality,
This concludes the proof. ∎
Proof of Theorem 3.
By the standard analysis of optimistic-FTRL and Lemma 34, we have for any ,
(by the tuning of ) |
In the adversarial regime, using Lemma 33, we continue to bound (LABEL:eq:_scrible_regret_tmp) by
When losses are non-negative and , we can further upper bound it by
Then solving the inequality for , we can further get the first-order regret bound of
In the corrupted stochastic setting, using Lemma 33, we continue to bound (LABEL:eq:_scrible_regret_tmp) by
Then by the self-bounding technique stated in Proposition 2, we can further bound the regret in the corrupted stochastic setting by
∎
Appendix C Analysis for LSB Log-Determinant FTRL (Algorithm 4 / Lemma 5)
Input: .
Define: Let be John’s exploration over . , .
for do
Sample an action .
Construct loss estimator:
We begin by using the following simplifying notation. For a distribution , we define as the restricted distribution over such that over , i.e. for any . We further define
Lemma 35.
Assume is such that is of rank (i.e. full rank over ). Then we have the following properties:
where denotes the pseudo-inverse.
Proof.
The first identity is a simple algebraic identity
which holds for any . For the second identity note that by the definition of and , we have . Furthermore due to the rank assumption. Multiplying the two matrices yields
∎
which implies for any
(11) | |||
(12) | |||
(13) |
Lemma 36.
Let , then for any :
Proof.
Simple algebra shows
∎
Lemma 37.
Let be arbitrary, and for , then it holds for any :
Proof.
We have
hence
For the second inequality, we have
where the last inequality uses that John’s exploration satisfies for all . ∎
Lemma 38.
The Bregman divergence between two distributions over with respect to the function is bounded by
where is the Bregman divergence of .
Proof.
We begin by simplifying . Note that
is a matrix with eigenvalues of size , since it is the identity with two rank-1 updates. The product of the remaining eigenvalues is given by considering the determinant of the sub-matrix with coordinates and . We have that
Hence we have
Where is the determinant over the first eigenvalues of the submatrix of the first coordinates. The derivative term is given by
The first term is
(by Lemma 35) | |||
(by Lemma 35) | |||
(by (13)) |
The norm term is
Combining both terms
Combining the everything
where the last inequality follows from the positiveness of Bregman divergences. ∎
Lemma 39.
For any , any and , it holds that
Proof.
The statement is equivalent to
since can take positive or negative sign. The function is concave, so setting the derivative to is the optimal solution if that value lies in .
Plugging this in, leads to
where the last line uses for any . ∎
Lemma 40.
The stability term
satisfies
Proof.
We have
Hence for :
Appendix D Analysis for the First Reduction
D.1 BOBW to LSB (Algorithm 1 / Theorem 6)
Proof of Theorem 6.
We use to denote the length of epoch , and let be the last epoch (define ). Also, we use to denote expectation conditioned on all history up to time . We first consider the adversarial regime. By the definition of local-self-bounding algorithms, we have for any ,
which implies
using the property for and . Summing the bounds over , and using the fact that for all , we get
(14) |
The same analysis also gives
(15) |
Next, consider the corrupted stochastic regime. We first argue that it suffices to consider the regret comparator . This is because if , then it holds that . Then the right-hand side of (15) is further upper bounded by
which fulfills the requirement for the stochastic regime. Below, we focus on bounding the pseudo-regret with respect to .
Let . Notice that is either the last epoch (i.e., ) or the second last (i.e., ), because for any two consecutive epochs, at least one of them must have . Below we show that .
If , by the fact that the termination condition was not triggered one round earlier, the number of plays in epoch is at most ; in other words, the number of times is at least . Further notice that because , we have .
Now consider the case and . Recall that is the action with . This implies that .
Finally, consider the case and , then we have and the statement holds trivially.
The regret up to and including epoch can be lower and upper bounded using the self-bounding technique:
(for ) | |||
(the first term is by a similar calculation as (14), but replacing by and by ) | |||
where in the last inequality we use Lemma 41.
If is not the last epoch, then it holds that for the final epoch . In this case, the regret in the final epoch is
(Lemma 41) |
∎
Lemma 41.
For and and , we have
Proof.
If , we have
where the last inequality is by the weighted AM-GM inequality. Therefore,
Choosing , we bound the last expression by
If , we have
where the last inequality is because if then . Choosing , we bound the last expression by . Combining cases finishes the proof. ∎
D.2 dd-BOBW to dd-LSB (Algorithm 1 / Theorem 23)
Proof of Theorem 23.
In the adversarial regime, we have that the regret in each phase is bounded by
We have maximally episodes, since the length doubles every time. Via Cauchy-Schwarz, we get
Taking the expectation on both sides and the tower rule of expectations finishes the bound for the adversarial regime. For the stochastic regime, note that and hence
dd-LSB implies regular LSB (up to a factor of ) and hence the stochastic bound of regular LSB applies. ∎
Appendix E Analysis for the Second Reduction
E.1 -LSB to -iw-stable (Algorithm 2 / Theorem 11)
Proof of Theorem 11.
The per-step bonus is the sum of two terms:
Since , using the inequalities above, we have
(16) | ||||
(17) |
By Lemma 28 and that , we have for any ,
(18) |
We bound individual terms below:
(19) | ||||
(Cauchy-Schwarz) | ||||
Continuing from (19), we also have because .
Using all bounds above in (18), we can bound by
For comparator , we choose and bound by the pos-term above. For comparator , we first choose and upper bound by . On the other hand, by the -iw-stable assumption, . Combining them, we get that for all , we also have . Comparing the coefficients in pos-term with those in Definition 4, we see that Algorithm 2 satisfies -LSB with constants where and . ∎
E.2 -LSB to -iw-stable (Algorithm 5 / Theorem 18)
Input: candidate action , -iw-stable algorithm over with constants .
Define: .
for do
Let
if then set ;
if then draw a revealing action of and observe ;
Proof of Theorem 18.
The per-step bonus is the sum of two terms:
Since , using the inequalities above, we have
(20) | ||||
(21) |
By Lemma 28 and that , we have for any ,
(22) |
We bound individual terms below:
(23) | ||||
(Cauchy-Schwarz) | ||||
Continuing from (23), we also have because .
Using all bounds above in (22), we can bound by
For comparator , we choose and bound by the pos-term above. For comparator , we first choose and upper bound by . On the other hand, by the -iw-stable assumption, . Combining them, we get that for all , we also have .
Finally, notice that . This implies that Algorithm 5 is -LSB with coefficient with , and .
∎
Input: candidate action , -iw-stable algorithm over with constant .
Define: . .
Define: For first-order bound, and ; for second-order bound, where is the loss predictor.
for do
Receive prediction for all , and set and .
Let
if then draw and observe ;
E.3 -dd-LSB to -dd-iw-stable (Algorithm 6 / Theorem 22)
Proof of Theorem 22.
Define . Notice that we have
(24) |
By Lemma 29 and that , we have for any ,
(choosing ) | ||||
(Cauchy-Schwarz) | ||||
Combining all inequalities above, we can bound by
(25) | |||
Similar to the arguments in the proofs of Theorem 11 and Theorem 18, we end up bounding by the pos-term above for all . Finally, we process pos-term. Observe that
and that
Thus, for any ,
which implies that the algorithm satisfies -dd-LSB with constants . ∎
E.4 -LSB to -strongly-iw-stable (Algorithm 7 / Theorem 25)
Input: candidate action , -iw-stable algorithm over with constant .
Define: .
.
for do
Let
if then draw and observe ;
Proof of Theorem 25.
The proof of this theorem mostly follows that of Theorem 11. The difference is that the regret of the base algorithm is now bounded by
(26) |
because when , the base algorithm is able to receive feedback no matter which side the Corral algorithm chooses. The goal of adding bonus is now only to cancel the first term and the third term on the right-hand side of (26).
Below, we bound .
(when , ) | ||||
(28) | ||||
(Cauchy-Schwarz) | ||||
(29) |
Continuing from (28), we also have .
Using all the inequalities above in (27), we can bound by
For comparator , we set . Then we have . Observe that , so this gives the desired property of -LSB for action . For , we set and get where the additional term comes from the second term on the right-hand side of (26), which is not cancelled by the negative term. Note, however, that this still satisfies the requirement of -LSB for because for these actions, we only require the worst-case bound to hold. Overall, we have justified that Algorithm 7 satisfies -LSB with constants where and . ∎
E.5 -dd-LSB to -dd-strongly-iw-stable (Algorithm 8 / Lemma 42)
Input: candidate action , -iw-stable algorithm over with constant .
Define: . .
Define: For first-order bound, and ; for second-order bound, where is the loss predictor.
for do
Let generate an action (which is the action to be chosen if is selected in this round).
Let
if then draw and observe ;
(30) |
Lemma 42.
Let be an algorithm with the following stability guarantee: given an adaptive sequence of weights such that the feedback in round is observed with probability if is chosen, and an adaptive sequence available at the beginning of round , it obtains the following pseudo regret guarantee for any stopping time :
where if feedback is observed in round and otherwise. in the second-order bound case, and in the first-order bound case. Then Algorithm 8 with as input satisfies -dd-LSB.
Proof.
The proof of this lemma is a combination of the elements in Theorem 22 (data-dependent iw-stable) and Theorem 25 (strongly-iw-stable), so we omit the details here and only provide a sketch.
In Algorithm 8, since is -dd-strongly-iw-stable, its regret is upper bounded by the order of
(31) |
because if chooses , then the probability of observing the feedback is , and is otherwise. This motivates the choice of the bonus in (30). Then we can follow the proof of Theorem 22 step-by-step, and show that the regret compared to is upper bounded by the order of
(taking expectation over and following the calculation in the proof of Theorem 22) | |||
(taking expectation over ) | |||
(using the property that for , and thus ) | |||
() | |||
which satisfies the requirement of -dd-LSB for the regret against . For the regret against , similar to the proof of Theorem 25, an extra positive regret comes from the second term in (31). Therefore, the regret against , can be upper bounded by
which also satisfies the requirement of -dd-LSB for . ∎
Appendix F Analysis for IW-Stable Algorithms
Input: .
for do
Let
With probability , receive
(in this case, set ; otherwise, set ).
Construct loss estimator:
F.1 EXP2 (Algorithm 9 / Lemma 9)
Proof of Lemma 9.
Consider the EXP2 algorithm (Algorithm 9) which corresponds to FTRL with negentropy potential. By standard analysis of FTRL (Lemma 27), for any and any ,
To apply Lemma 31, we need to show that . We have by Cauchy Schwarz . For each term, we have
due to the properties of John’s exploration. Hence . We can apply Lemma 31 for the stability term, resulting in
By we have furthermore
Combining everything and taking the expectation on both sides leads to
∎
F.2 EXP4 (Algorithm 10 / Lemma 10)
In this section, we use the more standard notation to denote the policy class.
Input: (policy class), (number of arms)
for do
Receive update probability .
Let
With probability , receive (in this case, set ; otherwise, set ).
Construct loss estimator:
Proof of Lemma 10.
Consider the EXP4 algorithm with adaptive stepsize (Algorithm 10), which corresponds to FTRL with negentropy regularization. By standard analysis of FTRL (Lemma 27) and Lemma 31, for any and any
Taking expectation on both sides finishes the proof. ∎
F.3 -Tsallis-INF (Algorithm 11 / Lemma 14)
Input: .
Define: , where .
for do
Let
With probability receive for all (in this case, set ; otherwise, set ).
Define
We require the following graph theoretic Lemmas.
Lemma 43.
Let be a directed graph with independence number and vertex weights , then
Proof.
Without loss of generality, assume that , since the statement is scale invariant. The statement is equivalent to
Let
Via K.K.T. conditions, there exists such that for an optimal solution it holds
Next we bound . Take the sub-graph over and take a maximally independence set over (, since ). We have
Hence . Finally,
∎
Lemma 44.
(Lemma 10 Alon et al. (2013)) let be a directed graph. Then, for any distribution we have:
where is the maximal acyclic sub-graph.
With these Lemmas, we are ready to proof the iw-stability.
Proof of Lemma 14.
Consider the -Tsallis-INF algorithm with adaptive stepsize (Algorithm 11). Applying Lemma 27, we have for any stopping time and :
where , which is well defined because contains by definition maximally one arm.
We can apply Lemma 32, since the losses are strictly positive.
We split the vertices in three sets. Let the nodes with self-loops. Let and . we have
since for any , we have . The first term is in expectation , the third term is in expectation , while the second is
The subgraph consisting of contains sef-loops, hence we can apply Lemma 44 to bound this term by . If , we instead take a permutation of , such that is in position and for any
The existence of such a permutation is guaranteed by Lemma 43: Apply Lemma 43 on the graph to select . Recursively remove the -th vertex from the graph and apply Lemma 43 on the resulting sub-graph to pick . Applying this permutation yields
Combining everything
By the definition of the learning rate, we obtain
∎
F.4 EXP3 for weakly observable graphs (Algorithm 12 / Lemma 17)
Input: , dominating set (potentially including ).
Define: . is the uniform distribution over .
for do
Let
With probability receive for all (in this case, set ; otherwise, set ).
Define
Proof of Lemma 17.
Consider the EXP3 algorithm (Algorithm 12). By standard analysis of FTRL (Lemma 27) and Lemma 31 by the non-negativity of all losses, for any and any ,
where in the last inequality we use . We have due to
Taking expectation on both sides finishes the proof. ∎
Appendix G Surrogate Loss for Strongly Observable Graph Problems
When there exist arms such that , we cannot directly apply Algorithm 2. To make the algorithm applicable to all strongly observable graphs, define the surrogate losses in the following way:
where is the distribution of the base algorithm over at round . By construction and the definition of strongly observable graphs, is observed when playing arm . (When the player does not have access to , one can also sample one action from the current distribution for an unbiased estimate of .) The losses are in range instead of and can further be shifted to be strictly non-negative. Finally observe that
as well as
That means running Algorithm 2, replacing by allows to apply Theorem 11 to strongly observable graphs where not every arm receives its own loss as feedback.
Appendix H Tabular MDP (Theorem 26)
In this section, we consider using the UOB-Log-barrier Policy Search algorithm in Lee et al. (2020) (their Algorithm 4) as our base algorithm. To this end, we need to show that it satisfies a dd-strongly-iw-stable condition specified in Lemma 42. We consider a variant of their algorithm by incorporating the feedback probability . The algorithm is Algorithm 13.
Input: state space , action space , candidate policy .
Define: , .
.
Initialization:
for do
Construct loss estimators
Update counters: for all ,
.
.
We refer the readers to Section C.3 of Lee et al. (2020) for the setting descriptions and notations, since we will follow them tightly in this section.
H.1 Base algorithm
Lemma 45.
Algorithm 13 ensures for any ,
Proof.
The regret is decomposed as the following (similar to Section C.3 in Lee et al. (2020)):
with the same definition of as in (24) of Lee et al. (2020). trivially (see (25) of Lee et al. (2020)). For the remaining four terms, we use Lemma 49, Lemma 50, Lemma 51, and Lemma 52, respectively. Combining all terms finishes the proof. ∎
Lemma 46 (Lemma C.2 of Lee et al. (2020)).
With probability , for all ,
Lemma 47 (cf. Lemma C.3 of Lee et al. (2020)).
With probability at least , for all ,
Proof.
The proof of this lemma is identical to the original one — only need to notice that in our case, the probability of obtaining a sample of is . ∎
Lemma 48 (cf. Lemma C.6 of Lee et al. (2020)).
With probability at least , for any and any collection of transition functions such that for all , we have
where and .
Proof.
Following the proof of Lemma C.6 in Lee et al. (2020), we have
where
(32) |
The first term in (32) can be upper bounded by the order of
(by Lemma 47) |
The second term in (32) can be upper bounded by the order of
(by Lemma 47) |
Combining the two parts, we get
Next, we bound . By the same calculation as in Lemma C.6 of Lee et al. (2020),
For the first term above, we consider the summation over . We continue to bound it by the order of
(by AM-GM, holds for any ) | |||
(by Lemma 47) | |||
(choose the optimal ) | |||
Therefore,
Combining the bounds finishes the proof. ∎
Lemma 49 (cf. Lemma C.7 of Lee et al. (2020)).
Proof.
By Lemma 48, with probability at least ,
Error | |||
Furthermore, with probability . Therefore,
by our choice of . ∎
Lemma 50 (cf. Lemma C.8 of Lee et al. (2020)).
Proof.
Let be the event that for all .
Thus,
Summing this over ,
By Lemma 48, is upper bounded by with probability . Taking expectations on both sides and using that , we get
∎
Lemma 51 (cf. Lemma C.9 of Lee et al. (2020)).
Proof.
Let be the event that for all . By the construction of the loss estimator, we have
and thus
(by Lemma C.5 of Lee et al. (2020), ) |
and
where the last inequality is by our choice of . ∎
Lemma 52 (cf. Lemma C.10 of Lee et al. (2020)).
Proof.
By the same calculation as in the proof of Lemma C.10 in Lee et al. (2020),
Let be the time indices when , and let be the last time this happens. Summing the inequalities above and using telescoping, we get
(computed in the proof of Lemma C.10 in Lee et al. (2020)) | ||||
(by the timing we halve the learning rate) | ||||
∎
H.2 Corraling
We use Algorithm 8 as the corral algorithm for the MDP setting, with Algorithm 13 being the base algorithm. The guarantee of Algorithm 8 is provided in Lemma 42.
Proof of Theorem 26.
To apply Lemma 42, we have to perform re-scaling on the loss because for MDPs, the loss of a policy in one round is . Therefore, scale down all losses by a factor of . Then by Lemma 45, the base algorithm satisfies the condition in Lemma 42 with and where is defined as , the expected loss of policy after scaling. Therefore, by Lemma 42, the we can transform it to an -dd-LSB algorithm with and . Finally, using Theorem 23 and scaling back the loss range, we can get
regret in the adversarial regime, and
regret in the corrupted stochastic regime. ∎