This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Blackbox Approach to Best of Both Worlds in Bandits and Beyond

Christoph Dann Google Research. Email: [email protected].    Chen-Yu Wei MIT Institute for Data, Systems, and Society. Email: [email protected].    Julian Zimmert Google Research. Email: [email protected].
Abstract

Best-of-both-worlds algorithms for online learning which achieve near-optimal regret in both the adversarial and the stochastic regimes have received growing attention recently. Existing techniques often require careful adaptation to every new problem setup, including specialised potentials and careful tuning of algorithm parameters. Yet, in domains such as linear bandits, it is still unknown if there exists an algorithm that can simultaneously obtain O(log(T))O(\log(T)) regret in the stochastic regime and O~(T)\tilde{O}(\sqrt{T}) regret in the adversarial regime. In this work, we resolve this question positively and present a general reduction from best of both worlds to a wide family of follow-the-regularized-leader (FTRL) and online-mirror-descent (OMD) algorithms. We showcase the capability of this reduction by transforming existing algorithms that are only known to achieve worst-case guarantees into new algorithms with best-of-both-worlds guarantees in contextual bandits, graph bandits and tabular Markov decision processes.

1 Introduction

Multi-armed bandits and its various extensions have a long history (Lai et al., 1985, Auer et al., 2002a; b). Traditionally, the stochastic regime in which all losses/rewards are i.i.d. and the adversarial regime in which an adversary chooses the losses in an arbitrary fashion have been studied in isolation. However, it is often unclear in practice whether an environment is best modelled by the stochastic regime, a slightly contaminated regime, or a fully adversarial regime. This is why the question of automatically adapting to the hardness of the environment, also called achieving best-of-both-worlds guarantees, has received growing attention (Bubeck and Slivkins, 2012, Seldin and Slivkins, 2014, Auer and Chiang, 2016, Seldin and Lugosi, 2017, Wei and Luo, 2018, Zimmert and Seldin, 2019, Ito, 2021b, Ito et al., 2022a, Masoudian and Seldin, 2021, Gaillard et al., 2014, Mourtada and Gaïffas, 2019, Ito, 2021a, Lee et al., 2021, Rouyer et al., 2021, Amir et al., 2022, Huang et al., 2022, Tsuchiya et al., 2022, Erez and Koren, 2021, Rouyer et al., 2022, Ito et al., 2022b, Kong et al., 2022, Jin and Luo, 2020, Jin et al., 2021, Chen et al., 2022, Saha and Gaillard, 2022, Masoudian et al., 2022, Honda et al., 2023). One of the most successful approaches has been to derive carefully tuned FTRL algorithms, which are canonically suited for the adversarial regime, and then show that these algorithms are also close to optimal in the stochastic regime as well. This is achieved via proving a crucial self-bounding property, which then translates into stochastic guarantees. This type of algorithms have been proposed for multi-armed bandits (Wei and Luo, 2018, Zimmert and Seldin, 2019, Ito, 2021b, Ito et al., 2022a), combinatorial semi-bandits (Zimmert et al., 2019), bandits with graph feedback (Ito et al., 2022b), tabular MDPs (Jin and Luo, 2020, Jin et al., 2021) and others. Algorithms with self-bounding properties also automatically adapt to intermediate regimes of stochastic losses with adversarial corruptions (Lykouris et al., 2018, Gupta et al., 2019, Zimmert and Seldin, 2019, Ito, 2021a), which highlights the strong robustness of this algorithm design. However, every problem variation required careful design of potential functions and learning rate schedules. For linear bandits, it has been still unknown whether self-bounding is possible and therefore the state-of-the-art best-of-both-worlds algorithm (Lee et al., 2021) neither obtains optimal logT\log T stochastic rate, nor canonically adapts to corruptions.

In this work, the make the following contributions: 1) We propose a general reduction from best of both worlds to typical FTRL/OMD algorithms, sidestepping the need for customized potentials and learning rates. 2) We derive the first best-of-both-worlds algorithm for linear bandits that obtains logT\log T regret in the stochastic regime, optimal adversarial worst-case regret and adapts canonically to corruptions. 3) We derive the first best-of-both-worlds algorithm for linear bandits and tabular MDPs with first-order guarantees in the adversarial regime. 4) We obtain the first best-of-both-worlds algorithms for bandits with graph feedback and bandits with expert advice with optimal logT\log T stochastic regret.

Setting Algorithm Adversarial Stochastic o(C)o(C) Rd. 1 Rd. 2
Linear bandit Lee et al. (2021) dTlog(|𝒳|T)logT\sqrt{dT\log(|\mathcal{X}|T)\log T} dlog(|𝒳|T)log(T)Δ\frac{d\log(|\mathcal{X}|T)\log(T)}{\Delta}
Theorem 3 dνLlogTd\sqrt{\nu L_{\star}\log T} d2νlogTΔ\frac{d^{2}\nu\log T}{\Delta}
Corollary 7 dTlogTd\sqrt{T\log T} d2logTΔ\frac{d^{2}\log T}{\Delta}
Corollary 12 dTlog|𝒳|\sqrt{dT\log|\mathcal{X}|} dlog|𝒳|logTΔ\frac{d\log|\mathcal{X}|\log T}{\Delta}
Contextual bandit Corollary 13 KTlog|𝒳|\sqrt{KT\log|\mathcal{X}|} Klog|𝒳|logTΔ\frac{K\log|\mathcal{X}|\log T}{\Delta}
Graph bandit Strongly observable Ito et al. (2022b) αTlogTlog(KT)\sqrt{\alpha T\log T}\log(KT) αlog2(KT)logTΔ\frac{\alpha\log^{2}(KT)\log T}{\Delta}
Rouyer et al. (2022) α~TlogK\sqrt{\widetilde{\alpha}T\log K} α~log(KT)logTΔ\frac{\widetilde{\alpha}\log(KT)\log T}{\Delta}
Corollary 15 min{α~,αlogK}TlogK\sqrt{\min\{\widetilde{\alpha},\alpha\log K\}T\log K} min{α~,αlogK}logKlogTΔ\frac{\min\{\widetilde{\alpha},\alpha\log K\}\log K\log T}{\Delta}
Graph bandit Weakly observable Ito et al. (2022b) (δlog(KT)logT)13T23(\delta\log(KT)\log T)^{\frac{1}{3}}T^{\frac{2}{3}} δlog(KT)logTΔ2\frac{\delta\log(KT)\log T}{\Delta^{2}}
Corollary 19 (δlogK)13T23(\delta\log K)^{\frac{1}{3}}T^{\frac{2}{3}} δlogKlogTΔ2\frac{\delta\log K\log T}{\Delta^{2}}
Tabular MDP Jin et al. (2021) H2S2ATlog2T\sqrt{H^{2}S^{2}AT\log^{2}T} H6S2Alog2TΔ\frac{H^{6}S^{2}A\log^{2}T}{\Delta^{\prime}}
Theorem 26 HS2ALlog4T\sqrt{HS^{2}AL_{\star}\ \log^{4}T} H2S2Alog3TΔ\frac{H^{2}S^{2}A\ \log^{3}T}{\Delta}
Table 1: Overview of regret bounds. The o(C)o(C) column specifies whether the algorithm achieves the optimal dependence on the amount of corruption (C\sqrt{C} or C23C^{\frac{2}{3}}, depending on the setting). “Rd. 1” and “Rd. 2” indicate whether the result leverages the first and the second reductions described in Section 4, respectively. LL_{\star} is the cumulative loss of the best action or policy. ν\nu in Theorem 3 is the self-concordance parameter; it holds that νd\nu\leq d. Δ\Delta^{\prime} in Jin et al. (2021) is the gap in the QQ^{\star} function, which is different from the policy gap Δ\Delta we use; it holds that ΔΔ\Delta\leq\Delta^{\prime}. α,α~,δ\alpha,\widetilde{\alpha},\delta are complexity measures of feedback graphs defined in Section 5.

2 Related Work

Our reduction procedure is related to a class of model selection algorithms that uses a meta bandit algorithm to learn over a set of base bandit algorithms, and the goal is to achieve a comparable performance as if the best base algorithm is run alone. For the adversarial regime, Agarwal et al. (2017) introduced the Corral algorithm to learn over adversarial bandits algorithm that satisfies the stability condition. This framework is further improved by Foster et al. (2020), Luo et al. (2022). For the stochastic regime, Arora et al. (2021), Abbasi-Yadkori et al. (2020), Cutkosky et al. (2021) introduced another set of techniques to achieve similar guarantees, but without explicitly relying on the stability condition. While most of these results focus on obtaining a T\sqrt{T} regret, Arora et al. (2021), Cutkosky et al. (2021), Wei et al. (2022), Pacchiano et al. (2022) made some progress in obtaining polylog(T)\text{polylog}(T) regret in the stochastic regime. Among them, Wei et al. (2022), Pacchiano et al. (2022) are most related to us since they also pursue a best-of-both-worlds guarantee across adversarial and stochastic regimes. However, their regret bounds, when applied to our problems, all suffer from highly sub-optimal dependence on log(T)\log(T) and the amount of corruptions.

3 Preliminaries

We consider sequential decision making problems where the learning interacts with the environment in TT rounds. In each round tt, the environment generates a loss vector (t,u)u𝒳(\ell_{t,u})_{u\in\mathcal{X}} and the learner generates a distribution over actions ptp_{t} and chooses an action AtptA_{t}\sim p_{t}. The learner then suffers loss t,At\ell_{t,A_{t}} and receives some information about (t,u)u𝒳(\ell_{t,u})_{u\in\mathcal{X}} depending on the concrete setting. In all settings but MDPs, we assume t,u[1,1]\ell_{t,u}\in[-1,1].

In the adversarial regime, (t,u)u𝒳(\ell_{t,u})_{u\in\mathcal{X}} is generated arbitrarily subject to the structure of the concrete setting (e.g., in linear bandits, 𝔼[t,u]=u,t\mathbb{E}[\ell_{t,u}]=\langle u,\ell_{t}\rangle for arbitrary td\ell_{t}\in\mathbb{R}^{d}). In the stochastic regime, we further assume that there exists an action x𝒳x^{\star}\in\mathcal{X} and a gap Δ>0\Delta>0 such that 𝔼[t,xt,x]Δ\mathbb{E}[\ell_{t,x}-\ell_{t,x^{\star}}]\geq\Delta for all xxx\neq x^{\star}. We also consider the corrupted stochastic regime, where the assumption is relaxed to 𝔼[t,xt,x]ΔCt\mathbb{E}[\ell_{t,x}-\ell_{t,x^{\star}}]\geq\Delta-C_{t} for some Ct0C_{t}\geq 0. We define C=t=1TCtC=\sum_{t=1}^{T}C_{t}.

4 Main Techniques

4.1 The standard global self-bounding condition and a new linear bandit algorithm

To obtain a best-of-both-world regret bound, previous works show the following property for their algorithm:

Definition 1 (α\alpha-global-self-bounding condition, or α\alpha-GSB).
u𝒳:𝔼[t=1T(t,Att,u)]min{c01αTα,(c1logT)1α𝔼[t=1T(1pt,u)]α}+c2logT\displaystyle\forall u\in\mathcal{X}:\,\mathbb{E}\left[\sum_{t=1}^{T}\left(\ell_{t,A_{t}}-\ell_{t,u}\right)\right]\leq\min\left\{c_{0}^{1-\alpha}T^{\alpha},\ \ (c_{1}\log T)^{1-\alpha}\mathbb{E}\left[\sum_{t=1}^{T}(1-p_{t,u})\right]^{\alpha}\right\}+c_{2}\log T

where c0,c1,c2c_{0},c_{1},c_{2} are problem-dependent constants and pt,up_{t,u} is the probability of the learner choosing uu in round tt.

With this condition, one can use the standard self-bounding technique to obtain a best-of-both-world regret bounds:

Proposition 2.

If an algorithm satisfies α\alpha-GSB, then its pseudo-regret is bounded by O(c01αTα+c2logT)O\left(c_{0}^{1-\alpha}T^{\alpha}+c_{2}\log T\right) in the adversarial regime, and by O(c1log(T)Δα1α+(c1logT)1α(CΔ1)α+c2log(T))O\big{(}c_{1}\log(T)\Delta^{-\frac{\alpha}{1-\alpha}}+(c_{1}\log T)^{1-\alpha}\left(C\Delta^{-1}\right)^{\alpha}+c_{2}\log(T)\big{)} in the corrupted stochastic regime.

Previous works have found algorithms with GSB in various settings, such as multi-armed bandits, semi-bandits, graph-bandits, and MDPs. Here, we provide one such algorithm (Algorithm 3) for linear bandits based on the framework of SCRiBLe (Abernethy et al., 2008). The guarantee of Algorithm 3 is stated in the next theorem under the more general “learning with predictions” setting introduced in Rakhlin and Sridharan (2013), where in every round tt, the learner receives a loss predictor mtm_{t} before making decisions.

Theorem 3.

In the “learning with predictions” setting, Algorithm 3 achieves a second-order regret bound of O(dνlog(T)t=1T(t,Atmt,At)2+dνlogT)O\Big{(}d\sqrt{\nu\log(T)\sum_{t=1}^{T}(\ell_{t,A_{t}}-m_{t,A_{t}})^{2}}+d\nu\log T\Big{)} in the adversarial regime, where dd is the feature dimension, νd\nu\leq d is the self-concordance parameter of the regularizer, and mt,x=x,mtm_{t,x}=\langle x,m_{t}\rangle is the loss predictor. This also implies a first-order regret bound of O(dνlog(T)t=1Tt,u)O\Big{(}d\sqrt{\nu\log(T)\sum_{t=1}^{T}\ell_{t,u}}\Big{)} if t,x0\ell_{t,x}\geq 0 and mt=𝟎m_{t}=\mathbf{0}; it simultaneously achieves Reg=O(d2νlogTΔ+d2νlogTΔC)\text{\rm Reg}=O\Big{(}\frac{d^{2}\nu\log T}{\Delta}+\sqrt{\frac{d^{2}\nu\log T}{\Delta}C}\Big{)} in the corrupted stochastic regime.

See Appendix B for the algorithm and the proof. We call this algorithm Variance-Reduced SCRiBLe (VR-SCRiBLe) since it is based on the original SCRiBLe updates, but with some refinement in the construction of the loss estimator to reduce its variance. A good property of a SCRiBLe-based algorithm is that it simultaneously achieves data-dependent bounds (i.e., first- and second-order bounds), similar to the case of multi-armed bandit using FTRL/OMD with log-barrier regularizer (Wei and Luo, 2018, Ito, 2021b). Like Rakhlin and Sridharan (2013), Bubeck et al. (2019), Ito (2021b), Ito et al. (2022a), we can also use another procedure to learn mtm_{t} and obtain path-length or variance bounds. The details are omitted.

We notice, however, that the the bound in Theorem 3 is sub-optimal in dd since the best-known regret for linear bandits is either dTlogTd\sqrt{T\log T} or dTlog|𝒳|\sqrt{dT\log|\mathcal{X}|}, depending on whether the number of actions is larger than TdT^{d}. These bounds also hint the possibility of getting better dependence on dd in the stochastic regime if we can also establish GSB for these algorithms. Therefore, we ask: can we achieve best-of-both-world bounds for linear bandits with the optimal dd dependence?

An attempt is to try to show GSB for existing algorithms with optimal dd dependence, including the EXP2 algorithm (Bubeck et al., 2012) and the logdet-FTRL algorithm (Zimmert and Lattimore, 2022). Based on our attempt, it is unclear how to adapt the analysis of logdet-FTRL to show GSB. For EXP2, using the learning-rate tuning technique by Ito et al. (2022b), one can make it satisfy a similar guarantee as GSB, albeit with an additional log(T)\log(T) factor, resulting in a bound dlog(|𝒳|T)log(T)Δ\frac{d\log(|\mathcal{X}|T)\log(T)}{\Delta} in the stochastic regime. This gives a sub-optimal rate of O(log2T)O(\log^{2}T). Therefore, we further ask: can we achieve O(logT)O(\log T) regret in the stochastic regime with an optimal dd dependence?

Motivated by these questions, we resort to approaches that do not rely on GSB, which appears for the first time in the best-of-both-world literature. Our approach is in fact a general reduction — it not only successfully achieves the desired bounds for linear bandits, but also provides a principled way to convert an algorithm with a worst-case guarantee to one with a best-of-both-world guarantee for general settings. In the next two subsections, we introduce our reduction techniques.

4.2 First reduction:  Best of Both Worlds \rightarrow Local Self-Bounding

Our reduction approach relies on an algorithm to satisfy a weaker condition than GSB, defined in the following:

Definition 4 (α\alpha-local-self-bounding condition, or α\alpha-LSB).

We say an algorithm satisfies the α\alpha-local-self-bounding condition if it takes a candidate action x^𝒳\widehat{x}\in\mathcal{X} as input and satisfies the following pseudo-regret guarantee for any stopping time t[1,T]t^{\prime}\in[1,T]:

u𝒳:𝔼[t=1t(t,Att,u)]min{c01α𝔼[t]α,(c1logT)1α𝔼[t=1t(1𝕀{𝒖=𝒙^}pt,u)]α}+c2logT\displaystyle\forall u\in\mathcal{X}:\,\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}\left(\ell_{t,A_{t}}-\ell_{t,u}\right)\right]\leq\min\left\{c_{0}^{1-\alpha}\mathbb{E}[t^{\prime}]^{\alpha},\ \ (c_{1}\log T)^{1-\alpha}\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}(1-\bm{\mathbb{I}\{u=\widehat{x}\}}p_{t,u})\right]^{\alpha}\right\}+c_{2}\log T

where c0,c1,c2c_{0},c_{1},c_{2} are problem dependent constants.

The difference between LSB in Definition 4 and GSB in Definition 1 is that LSB only requires the self-bounding term t(1pt,u)\sum_{t}(1-p_{t,u}) to appear when uu is a particular action x^\widehat{x} given as an input to the algorithm; for all other actions, the worst-case bound suffices. A minor additional requirement is that the pseudo-regret needs to hold for any stopping time (because our reduction may stop this algorithm during the learning procedure), but this is relatively easy to satisfy — for all algorithms in this paper, without additional efforts, their regret bounds naturally hold for any stopping time.

Algorithm 1 BOBW via LSB algorithms

Input: LSB algorithm 𝒜\mathcal{A}
T10T_{1}\leftarrow 0,  T0c2log(T)T_{0}\leftarrow-c_{2}\log(T).
x^1unif(𝒳)\widehat{x}_{1}\sim\text{unif}(\mathcal{X}).
t1t\leftarrow 1.
for k=1,2,k=1,2\ldots, do

       Initialize 𝒜\mathcal{A} with candidate x^k\widehat{x}_{k}.
Set counters Nk(x)=0N_{k}(x)=0 for all x𝒳x\in\mathcal{X}.
for t=Tk+1,Tk+2,t=T_{k}+1,T_{k}+2,\ldots do
             Play action AtA_{t} as suggested by 𝒜\mathcal{A}, and advance 𝒜\mathcal{A} by one step.
Nk(At)Nk(At)+1N_{k}(A_{t})\leftarrow N_{k}(A_{t})+1
if  tTk2(TkTk1)t-T_{k}\geq 2(T_{k}-T_{k-1}) and x𝒳{x^k}\exists x\in\mathcal{X}\setminus\{\widehat{x}_{k}\} such that Nk(x)tTk2N_{k}(x)\geq\frac{t-T_{k}}{2} then
                   x^k+1x\widehat{x}_{k+1}\leftarrow x.
Tk+1tT_{k+1}\leftarrow t.
break.
            
      

For linear bandits, we find that an adaptation of the logdet-FTRL algorithm (Zimmert and Lattimore, 2022) satisfies 12\frac{1}{2}-LSB, as stated in the following lemma. The proof is provided in Appendix C.

Lemma 5.

For dd-dimensional linear bandits, by transforming the action set into {(x0)|x𝒳{x^}}{(𝟎1)}d+1\big{\{}\binom{x}{0}\,\big{|}\,x\in\mathcal{X}\setminus\{\widehat{x}\}\big{\}}\cup\big{\{}\binom{\mathbf{0}}{1}\big{\}}\subset\mathbb{R}^{d+1}, Algorithm 4 satisfies 12\frac{1}{2}-LSB with c0=O((d+1)2logT)c_{0}=O((d+1)^{2}\log T) and c1=c2=O((d+1)2)c_{1}=c_{2}=O((d+1)^{2}).

With the LSB condition defined, we develop a reduction procedure (Algorithm 1) which turns any algorithm with LSB into a best-of-both-world algorithm that has a similar guarantee as in Proposition 2. The guarantee is formally stated in the next theorem.

Theorem 6.

If 𝒜\mathcal{A} satisfies α\alpha-LSB, then the regret of Algorithm 1 with 𝒜\mathcal{A} as the base algorithm is upper bounded by O(c01αTα+c2log2T)O\left(c_{0}^{1-\alpha}T^{\alpha}+c_{2}\log^{2}T\right) in the adversarial regime and by O(c1log(T)Δα1α+(c1logT)1α(CΔ1)α+\sloppy O\big{(}c_{1}\log(T)\Delta^{-\frac{\alpha}{1-\alpha}}+(c_{1}\log T)^{1-\alpha}\left(C\Delta^{-1}\right)^{\alpha}+ c2log(T)log(CΔ1))c_{2}\log(T)\log(C\Delta^{-1})\big{)} in the corrupted stochastic regime.

See Appendix D.1 for a proof of Theorem 6. In particular, Theorem 6 together with Lemma 5 directly lead to a better best-of-both-world bound for linear bandits.

Corollary 7.

Combining Algorithm 1 and Algorithm 4 results in a linear bandit algorithm with O(dTlogT)O\big{(}d\sqrt{T\log T}\big{)} regret in the adversarial regime and O(d2logTΔ+dClogTΔ)O\Big{(}\frac{d^{2}\log T}{\Delta}+d\sqrt{\frac{C\log T}{\Delta}}\Big{)} regret in the corrupted stochastic regime simultaneously.

Ideas of the first reduction

Algorithm 1 proceeds in epochs. In epoch kk, there is an action x^k𝒳\widehat{x}_{k}\in\mathcal{X} being selected as the candidate (x^1\widehat{x}_{1} is randomly drawn from 𝒳\mathcal{X}). The procedure simply executes the base algorithm 𝒜\mathcal{A} with input x^=x^k\widehat{x}=\widehat{x}_{k}, and monitors the number of draws to each action. If in epoch kk there exists some xx^kx\neq\widehat{x}_{k} being drawn more than half of the time, and the length of epoch kk already exceeds two times the length of epoch k1k-1, then a new epoch is started with x^k+1\widehat{x}_{k+1} set to xx.

Theorem 6 is not sufficient to be considered a black-box reduction approach, since algorithms with LSB are not common. Therefore, our next step is to present a more general reduction that makes use of recent advances of Corral algorithms.

4.3 Second reduction:  Local Self-Bounding \rightarrow Importance-Weighting Stable

In this subsection, we show that one can achieve LSB using the idea of model selection. Specifically, we will run a variant of the Corral algorithm (Agarwal et al., 2017) over two instances: one is x^\widehat{x}, and the other is a importance-weighting-stable algorithm (see Definition 8) over the action set 𝒳\{x^}\mathcal{X}\backslash\{\widehat{x}\}. Here, we focus on the case of α=12\alpha=\frac{1}{2}, which is the case for most standard settings where the worst-case regret is T\sqrt{T}; examples for α=23\alpha=\frac{2}{3} in the graph bandit setting is discussed in Section 5.

First, we introduce the notion of importance-weighting stablility.

Definition 8 (12\frac{1}{2}-iw-stable).

An algorithm is 12\frac{1}{2}-iw-stable (importance-weighting stable), if given an adaptive sequence of weights q1,q2,(0,1]q_{1},q_{2},\dots\in(0,1] and the assertion that the feedback in round tt is observed with probability qtq_{t}, it obtains the following pseudo regret guarantee for any stopping time tt^{\prime}:

𝔼[t=1t(t,Att,u)]𝔼[c1t=1t1qt+c2minttqt].\displaystyle\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}\left(\ell_{t,A_{t}}-\ell_{t,u}\right)\right]\leq\mathbb{E}\left[\sqrt{c_{1}\sum_{t=1}^{t^{\prime}}\frac{1}{q_{t}}}+\frac{c_{2}}{\min_{t\leq t^{\prime}}q_{t}}\right].

Definition 8 requires that even if the algorithm only receives the desired feedback with probability qtq_{t} in round tt, it still achieves a meaningful regret bound that smoothly degrades with t1qt\sum_{t}\frac{1}{q_{t}}. In previous works on Corral and its variants (Agarwal et al., 2017, Foster et al., 2020), a similar notion of 12\frac{1}{2}-stability is defined as having a regret of c1(maxτt1qτ)t\sqrt{c_{1}\left(\max_{\tau\leq t^{\prime}}\frac{1}{q_{\tau}}\right)t^{\prime}}, which is a weaker assumption than our 12\frac{1}{2}-iw-stability. Our stronger notion of stability is crucial to get a best-of-both-world bound, but it is still very natural and holds for a wide range of algorithms.

Algorithm 2 LSB via Corral (for α=12\alpha=\frac{1}{2})

Input: candidate action x^\widehat{x}, 12\frac{1}{2}-iw-stable algorithm \mathcal{B} over 𝒳\{x^}\mathcal{X}\backslash\{\widehat{x}\} with constants c1,c2c_{1},c_{2}.
Define: ψt(q)=2ηti=12qi+1βi=12ln1qi\psi_{t}(q)=-\frac{2}{\eta_{t}}\sum_{i=1}^{2}\sqrt{q_{i}}+\frac{1}{\beta}\sum_{i=1}^{2}\ln\frac{1}{q_{i}}.  
B0=0B_{0}=0.
for t=1,2,t=1,2,\ldots do

       Let
q¯t=argminqΔ2{q,τ=1t1zτ[0Bt1]+ψt(q)},qt=(112t2)q¯t+14t2𝟏,\displaystyle\bar{q}_{t}=\operatorname*{argmin}_{q\in\Delta_{2}}\left\{\left\langle q,\sum_{\tau=1}^{t-1}z_{\tau}-\begin{bmatrix}0\\ B_{t-1}\end{bmatrix}\right\rangle+\psi_{t}(q)\right\},\quad q_{t}=\left(1-\frac{1}{2t^{2}}\right)\bar{q}_{t}+\frac{1}{4t^{2}}\mathbf{1},
withηt=1t+8c1,β=18c2.\displaystyle\text{with}\ \ \ \eta_{t}=\frac{1}{\sqrt{t}+8\sqrt{c_{1}}},\qquad\beta=\frac{1}{8c_{2}}\,.
Sample itqti_{t}\sim q_{t}.
if it=1i_{t}=1 then  draw At=x^A_{t}=\widehat{x} and observe t,At\ell_{t,A_{t}} ;
       else  draw AtA_{t} according to base algorithm \mathcal{B} and observe t,At\ell_{t,A_{t}} ;
       Define zt,i=(t,At+1)𝕀{it=i}qt,i1z_{t,i}=\frac{(\ell_{t,A_{t}}+1)\mathbb{I}\{i_{t}=i\}}{q_{t,i}}-1 and
Bt=c1τ=1t1qτ,2+c2minτtqτ,2.\displaystyle B_{t}=\sqrt{c_{1}\sum_{\tau=1}^{t}\frac{1}{q_{\tau,2}}}+\frac{c_{2}}{\min_{\tau\leq t}q_{\tau,2}}.

Below, we provide two examples of 12\frac{1}{2}-iw-stable algorithms. Their analysis is mostly standard FTRL analysis (considering extra importance weighting) and can be found in Appendix F.

Lemma 9.

For linear bandits, EXP2 with adaptive learning rate (Algorithm 9) is 12\frac{1}{2}-iw-stable with constants c1=c2=O(dlog|𝒳|)c_{1}=c_{2}=O(d\log|\mathcal{X}|).

Lemma 10.

For contextual bandits, EXP4 with adaptive learning rate (Algorithm 10) is 12\frac{1}{2}-iw-stable with constants c1=O(Klog|𝒳|),c2=0c_{1}=O(K\log|\mathcal{X}|),c_{2}=0.

Our reduction procedure is Algorithm 2, which turns an 12\frac{1}{2}-iw-stable algorithm into one with 12\frac{1}{2}-LSB. The guarantee is formalized in the next theorem, whose proof is in Appendix E.1.

Theorem 11.

If \mathcal{B} is a 12\frac{1}{2}-iw-stable algorithm with constants (c1,c2)(c_{1},c_{2}), then Algorithm 2 with \mathcal{B} as the base algorithm satisfies 12\frac{1}{2}-LSB with constants (c0,c1,c2)(c_{0}^{\prime},c_{1}^{\prime},c_{2}^{\prime}) where c0=c1=O(c1)c_{0}^{\prime}=c_{1}^{\prime}=O(c_{1}) and c2=O(c2)c_{2}^{\prime}=O(c_{2}).

Cascading Theorem 11 and Theorem 6, and combining it with Lemma 9 and Lemma 10, respectively, we get the following corollaries:

Corollary 12.

Combining Algorithm 1, Algorithm 2 and Algorithm 9 results in a linear bandit algorithm with O(dTlog|𝒳|)O\big{(}\sqrt{dT\log|\mathcal{X}|}\big{)} regret in the adversarial regime and O(dlog|𝒳|logTΔ+dlog|𝒳|logTΔC)O\Big{(}\frac{d\log|\mathcal{X}|\log T}{\Delta}+\sqrt{\frac{d\log|\mathcal{X}|\log T}{\Delta}C}\Big{)} regret in the corrupted stochastic regime simultaneously.

Corollary 13.

Combining Algorithm 1, Algorithm 2 and Algorithm 10 results in a contextual bandit algorithm with O(KTlog|𝒳|)O\big{(}\sqrt{KT\log|\mathcal{X}|}\big{)} regret in the adversarial regime and O(Klog|𝒳|logTΔ+Klog|𝒳|logTΔC)O\Big{(}\frac{K\log|\mathcal{X}|\log T}{\Delta}+\sqrt{\frac{K\log|\mathcal{X}|\log T}{\Delta}C}\Big{)} regret in the corrupted stochastic regime simultaneously, where KK is the number of arms.

Ideas of the second reduction

Algorithm 2 is related to the Corral algorithm (Agarwal et al., 2017, Foster et al., 2020, Luo et al., 2022). We use a special version which is essentially an FTRL algorithm (with a hybrid 12\frac{1}{2}-Tsallis entropy and log-barrier regularizer) over two base algorithms: the candidate x^\widehat{x}, and an algorithm \mathcal{B} operating on the reduced action set 𝒳\{x^}\mathcal{X}\backslash\{\widehat{x}\}. For base algorithm \mathcal{B}, the Corral algorithm adds a bonus term that upper bounds the regret of \mathcal{B} under importance weighting (i.e., the quantity BtB_{t} defined in Algorithm 6). In the regret analysis, the bonus creates a negative term as well as a bonus overhead in the learner’s regret. The negative term can be used to cancel the positive regret of \mathcal{B}, and the analysis reduces to bounding the bonus overhead. Showing that the bonus overhead is upper bounded by the order of c1log(T)t=1T(1pt,x^)\sqrt{c_{1}\log(T)\sum_{t=1}^{T}(1-p_{t,\widehat{x}})} is the key to establish the LSB property.

Combining Algorithm 2 and Algorithm 1, we have the following general reduction: as long as we have an 12\frac{1}{2}-iw-stable algorithm over 𝒳\{x^}\mathcal{X}\backslash\{\widehat{x}\} for any x^𝒳\widehat{x}\in\mathcal{X}, we have an algorithm with the best-of-both-world guarantee when the optimal action is unique. Notice that 12\frac{1}{2}-iw-stable algorithms are quite common – usually it can be just a FTRL/OMD algorithm with adaptive learning rate.

The overall reduction is reminiscent of the G-COBE procedure by Wei et al. (2022), which also runs model selection over x^\widehat{x} and a base algorithm for 𝒳\{x^}\mathcal{X}\backslash\{\widehat{x}\} (similar to Algorithm 2), and dynamically restarts the procedure with increasing epoch lengths (similar to Algorithm 1). However, besides being more complicated, G-COBE only guarantees a bound of polylog(T)Δ+C\frac{\text{polylog}(T)}{\Delta}+C in the corrupted stochastic regime (omitting dependencies on c1,c2c_{1},c_{2}), which is sub-optimal in both CC and TT.111However, Wei et al. (2022) is able to handle a more general type of corruption.

5 Case Study:  Graph Bandits

In this section, we show the power of our reduction by improving the state of the art of best-of-both-worlds algorithms for bandits with graph feedback. In bandits with graph feedback, the learner is given a directed feedback graph G=(V,E)G=(V,E), where the nodes V=[K]V=[K] correspond to the KK-arms of the bandit. Instead of observing the loss of the played action AtA_{t}, the player observes the loss t,j\ell_{t,j} iff (At,j)E(A_{t},j)\in E. Learnable graphs are divided into strongly observable graphs and weakly observable graphs. In the first case, every arm i[K]i\in[K] must either receive its own feedback, i.e. (i,i)E(i,i)\in E, or all other arms do, i.e. j[K]{i}:(j,i)E\forall j\in[K]\setminus\{i\}:\,(j,i)\in E. In the weakly observable case, every arm i[K]i\in[K] must be observable by at least one arm, i.e. j[K]:(j,i)E\exists j\in[K]:\,(j,i)\in E. The following graph properties characterize the difficulty of the two regimes. An independence set is any subset VVV^{\prime}\subset V such that no edge exists between any two distinct nodes in VV^{\prime}, i.e. i,jV,ij\forall i,j\in V^{\prime},i\neq j we have (i,j)E(i,j)\not\in E and (j,i)E(j,i)\not\in E. The independence number α\alpha is the size of the largest independence set. A related number is the weakly independence number α~\widetilde{\alpha}, which is the independence number of the subgraph (V,E)(V,E^{\prime}) obtained by removing all one-sided edges, i.e. (i,j)E(i,j)\in E^{\prime} iff (i,j)E(i,j)\in E and (j,i)E(j,i)\in E. For undirected graphs, the two notions coincide, but in general α~\widetilde{\alpha} can be larger than α\alpha by up to a factor of KK. Finally, a weakly dominating set is a subset DVD\subset V such that every node in VV is observable by some node in DD, i.e. jViD:(i,j)E\forall j\in V\,\exists i\in D:\,(i,j)\in E. The weakly dominating number δ\delta is the size of the smallest weakly dominating set.

Alon et al. (2015) provides a almost tight characterization of the adversarial regime, providing upper bounds of O(αTlogTlogK)O(\sqrt{\alpha T\log T\log K}) and O(α~TlogK)O(\sqrt{\widetilde{\alpha}T\log K}) for the strongly observable case and O((δlogK)13T23)O((\delta\log K)^{\frac{1}{3}}T^{\frac{2}{3}}) for the weakly observable case, as well as matching lower bounds up to all log factors. Zimmert and Lattimore (2019) have shown that the log(T)\log(T) dependency can be avoided and that hence α\alpha is the right notion of difficulty for strongly observable graphs even in the limit TT\rightarrow\infty (though they pay for this with a larger logK\log K dependency).

State-of-the-art best-of-both-worlds guarantees for bandits with graph feedback are derivations of EXP3 and obtain either O(α~log2(T)Δ)O(\frac{\widetilde{\alpha}\log^{2}(T)}{\Delta}) or O(αlog(T)log2(TK)Δ)O(\frac{\alpha\log(T)\log^{2}(TK)}{\Delta}) regret for strongly observable graphs and O(δlog2(T)Δ2)O(\frac{\delta\log^{2}(T)}{\Delta^{2}}) for weakly observable graphs (Rouyer et al., 2022, Ito et al., 2022b)222 Rouyer et al. (2022) obtains better bounds when the gaps are not uniform, while Ito et al. (2022b) can handle graphs that are unions of strongly observable and weakly observable sub-graphs. We do not extend our analysis to the latter case for the sake of simplicity. . Our black-box reduction leads directly to algorithms with optimal log(T)\log(T) regret in the stochastic regime.

5.1 Strongly observable graphs

We begin by providing an examples of 12\frac{1}{2}-iw-stable algorithm for bandits with strongly observable graph feedback. The algorithm and the proof are in Appendix F.3.

Lemma 14.

For bandits with strongly observable graph feedback, (11/logK)(1-1/\log K)-Tsallis-INF with adaptive learning rate (Algorithm 11) is 12\frac{1}{2}-iw-stable with constants c1=O(min{α~,αlogK}logK),c2=0c_{1}=O(\min\{\widetilde{\alpha},\alpha\log K\}\log K),c_{2}=0.

Before we apply the reduction, note that Algorithm 2 requires that the player observes the loss t,At\ell_{t,A_{t}} when playing arm AtA_{t}, which is not necessarily the case for all strongly observable graphs. However, this can be overcome via defining an observable surrogate loss that is used in Algorithm 2 instead. We explain how this works in detail in Section G and assume from now that this technical issue does not arise.

Cascading the two reduction steps, we immediately obtain the following.

Corollary 15.

Combining Algorithm 1, Algorithm 2 and Algorithm 11 results in a graph bandit algorithm with O(min{α~,αlogK}TlogK)O\big{(}\sqrt{\min\{\widetilde{\alpha},\alpha\log K\}T\log K}\big{)} regret in the adversarial regime and O(min{α~,αlogK}logTlogKΔ+min{α~,αlogK}logTlogKΔC)O\Big{(}\frac{\min\{\widetilde{\alpha},\alpha\log K\}\log T\log K}{\Delta}+\sqrt{\frac{\min\{\widetilde{\alpha},\alpha\log K\}\log T\log K}{\Delta}C}\Big{)} regret in the corrupted stochastic regime simultaneously.

5.2 Weakly observable graphs

Weakly observable graphs motivate our general definition of LSB stable beyond α=12\alpha=\frac{1}{2}. We first need to define the equivalent of 12\frac{1}{2}-iw-stable for this regime.

Definition 16 (23\frac{2}{3}-iw-stable).

An algorithm is 23\frac{2}{3}-iw-stable (importance-weighting stable), if given an adaptive sequence of weights q1,q2,,qτq_{1},q_{2},\dots,q_{\tau} and the feedback in any round being observed with probability qtq_{t}, it obtains the following pseudo regret guarantee for any stopping time tt^{\prime}:

𝔼[t=1t(t,Att,u)]𝔼[c113(t=1t1qt)23+maxttc2qt].\displaystyle\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}\left(\ell_{t,A_{t}}-\ell_{t,u}\right)\right]\leq\mathbb{E}\left[c_{1}^{\frac{1}{3}}\left(\sum_{t=1}^{t^{\prime}}\frac{1}{\sqrt{q_{t}}}\right)^{\frac{2}{3}}+\max_{t\leq t^{\prime}}\frac{c_{2}}{q_{t}}\right].

An example of such an algorithm is given by the following lemma (see Appendix F.4 for the algorithm and the proof).

Lemma 17.

For bandits with weakly observable graph feedback, EXP3 with adaptive learning rate (Algorithm 12) is 23\frac{2}{3}-iw-stable with constants c1=c2=O(δlogK)c_{1}=c_{2}=O(\delta\log K).

Similar to the 12\frac{1}{2}-case, this allows for a general reduction to LSB algorithms.

Theorem 18.

If \mathcal{B} is a 23\frac{2}{3}-iw-stable algorithm with constants (c1,c2)(c_{1},c_{2}), then Algorithm 5 with \mathcal{B} as the base algorithm satisfies 23\frac{2}{3}-LSB with constants (c0,c1,c2)(c_{0}^{\prime},c_{1}^{\prime},c_{2}^{\prime}) where c0=c1=O(c1)c_{0}^{\prime}=c_{1}^{\prime}=O(c_{1}) and c2=O(c113+c2)c_{2}^{\prime}=O(c_{1}^{\frac{1}{3}}+c_{2}).

See Appendix E.2 for the proof. Algorithm 5 works almost the same as Algorithm 2, but we need to adapt the bonus terms to the definition of 23\frac{2}{3}-iw-stable. The major additional difference is that we do not necessarily observe the loss of the action played and hence need to play exploratory actions with probability γt\gamma_{t} in every round to estimate the performance difference between x^\widehat{x} and \mathcal{B}. A second point to notice is that the base algorithm \mathcal{B} uses the action set 𝒳{x^}\mathcal{X}\setminus\{\widehat{x}\}, but is still allowed to use x^\widehat{x} in its internal exploration policy. This is necessary because the sub-graph with one arm removed might have a larger dominating number, or might even not be learnable at all. By allowing x^\widehat{x} in the internal exploration, we guarantee the the regret of the base algorithm is not larger than over the full action set. Finally, cascading the lemma and the reduction, we obtain

Corollary 19.

Combining Algorithm 1, Algorithm 5 and Algorithm 12 results in a graph bandit algorithm with O((δlogK)13T23)O\big{(}(\delta\log K)^{\frac{1}{3}}T^{\frac{2}{3}}\big{)} regret in the adversarial regime and O(δlogKlogTΔ2+(C2δlogKlogTΔ2)13)O\Big{(}\frac{\delta\log K\log T}{\Delta^{2}}+\left(\frac{C^{2}\delta\log K\log T}{\Delta^{2}}\right)^{\frac{1}{3}}\Big{)} regret in the corrupted stochastic regime simultaneously.

6 More Adaptations

To demonstrate the versatility of our reduction framework, we provide two more adaptations. The first demonstrates that our reduction can be easily adapted to obtain a data-dependent bound (i.e., first- or second-order bound) in the adversarial regime, provided that the base algorithm achieves a corresponding data-dependent bound. The second tries to eliminate the undesired requirement by the corral algorithm (Algorithm 2) that the base algorithm needs to operate in 𝒳\{x^}\mathcal{X}\backslash\{\widehat{x}\} instead of the more natural 𝒳\mathcal{X}. We show that under a stronger stability assumption, we can indeed just use a base algorithm that operates in 𝒳\mathcal{X}. This would be helpful for settings where excluding one single action/policy in the algorithm is not straightforward (e.g., MDP). Finally, we combine the two techniques to obtain a first-order best-of-both-world bound for tabular MDPs.

6.1 Reduction with first- and second-order bounds

A first-order regret bound refers to a regret bound of order O(c1polylog(T)L+c2polylog(T))O\big{(}\sqrt{c_{1}\text{polylog}(T)L_{\star}}+c_{2}\text{polylog}(T)\big{)}, where L=minx𝒳𝔼[t=1Tt,x]L_{\star}=\min_{x\in\mathcal{X}}\mathbb{E}\big{[}\sum_{t=1}^{T}\ell_{t,x}\big{]} is the best action’s cumulative loss. This is meaningful under the assumption that t,x0\ell_{t,x}\geq 0 for all t,xt,x. A second-order regret bound refers to a bound of order O(c1polylog(T)𝔼[t=1T(t,Atmt,At)2]+c2polylog(T))O\Big{(}\sqrt{c_{1}\text{polylog}(T)\mathbb{E}\big{[}\sum_{t=1}^{T}(\ell_{t,A_{t}}-m_{t,A_{t}})^{2}}\big{]}+c_{2}\text{polylog}(T)\Big{)}, where mt,xm_{t,x} is a loss predictor for action xx that is available to the learner at the beginning of round tt. We refer the reader to Wei and Luo (2018), Ito (2021b) for more discussions on data-dependent bounds.

We first define counterparts of the LSB condition and iw-stability condition with data-dependent guarantees:

Definition 20 (12\frac{1}{2}-dd-LSB).

We say an algorithm satisfies 12\frac{1}{2}-dd-LSB (data-dependent LSB) if it takes a candidate action x^𝒳\widehat{x}\in\mathcal{X} as input and satisfies the following pseudo-regret guarantee for any stopping time t[1,T]t^{\prime}\in[1,T]:

u𝒳:\displaystyle\forall u\in\mathcal{X}:
𝔼[t=1t(t,Att,u)](c1logT)𝔼[t=1t(xpt,xξt,x𝕀{𝒖=𝒙^}pt,u2ξt,u)]+c2logT\displaystyle\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}\left(\ell_{t,A_{t}}-\ell_{t,u}\right)\right]\leq\sqrt{(c_{1}\log T)\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}\left(\sum_{x}p_{t,x}\xi_{t,x}-\bm{\mathbb{I}\{u=\widehat{x}\}}p_{t,u}^{2}\xi_{t,u}\right)\right]}+c_{2}\log T

where c1,c2c_{1},c_{2} are some problem dependent constants, ξt,x=(t,xmt,x)2\xi_{t,x}=(\ell_{t,x}-m_{t,x})^{2} in the second-order bound case, and ξt,x=t,x\xi_{t,x}=\ell_{t,x} in the first-order bound case.

Definition 21 (12\frac{1}{2}-dd-iw-stable).

An algorithm is 12\frac{1}{2}-dd-iw-stable (data-dependent-iw-stable) if given an adaptive sequence of weights q1,q2,(0,1]q_{1},q_{2},\dots\in(0,1] and the assertion that the feedback in round tt is observed with probability qtq_{t}, it obtains the following pseudo regret guarantee for any stopping time tt^{\prime}:

𝔼[t=1t(t,Att,u)]c1𝔼[t=1tupdtξt,Atqt2]+𝔼[c2minttqt],\displaystyle\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}\left(\ell_{t,A_{t}}-\ell_{t,u}\right)\right]\leq\sqrt{c_{1}\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}\frac{\scalebox{0.9}{{upd}}_{t}\cdot\xi_{t,A_{t}}}{q_{t}^{2}}\right]}+\mathbb{E}\left[\frac{c_{2}}{\min_{t\leq t^{\prime}}q_{t}}\right],

where updt=1\scalebox{0.9}{{upd}}_{t}=1 if feedback is observed in round tt and updt=0\scalebox{0.9}{{upd}}_{t}=0 otherwise, and ξt,x\xi_{t,x} is defined in the same way as in Definition 20.

We can turn a dd-iw-stable algorithm into one with dd-LSB (see Appendix E.3 for the proof):

Theorem 22.

If \mathcal{B} is 12\frac{1}{2}-dd-iw-stable with constants (c1,c2)(c_{1},c_{2}), then Algorithm 6 with \mathcal{B} as the base algorithm satisfies 12\frac{1}{2}-dd-LSB with constants (c1,c2)(c_{1}^{\prime},c_{2}^{\prime}) where c1=O(c1)c_{1}^{\prime}=O(c_{1}) and c2=O(c1+c2)c_{2}^{\prime}=O(\sqrt{c_{1}}+c_{2}).

Then we turn an algorithm with dd-LSB into one with data-dependent best-of-both-world guarantee (see Appendix D.2 for the proof):

Theorem 23.

If an algorithm 𝒜\mathcal{A} satisfies 12\frac{1}{2}-dd-LSE, then the regret of Algorithm 1 with 𝒜\mathcal{A} as the base algorithm is upper bounded by O(c1𝔼[t=1Tξt,At]log2T+c2log2T)O\Big{(}\sqrt{c_{1}\mathbb{E}\left[\sum_{t=1}^{T}\xi_{t,A_{t}}\right]\log^{2}T}+c_{2}\log^{2}T\Big{)} in the adversarial regime and O(c1logTΔ+c1logTΔC+c2log(T)log(CΔ1))O\Big{(}\frac{c_{1}\log T}{\Delta}+\sqrt{\frac{c_{1}\log T}{\Delta}C}+c_{2}\log(T)\log(C\Delta^{-1})\Big{)} in the corrupted stochastic regime.

6.2 Achieving LSB without excluding x^\widehat{x}

Our reduction in Section 4.3 requires that the base algorithm \mathcal{B} to operate in the action set of 𝒳\{x^}\mathcal{X}\backslash\{\widehat{x}\}. This is sometimes not easy to implement for structural problems where actions share common components (e.g., MDPs or combinatorial bandits). To eliminate this requirement so that we can simply use a base algorithm \mathcal{B} that operates in the original action space 𝒳\mathcal{X}, we propose the following stronger notion of iw-stability that we called strongly-iw-stable:

Definition 24 (12\frac{1}{2}-strongly-iw-stable).

An algorithm is 12\frac{1}{2}-strongly-iw-stable if the following holds: given an adaptive sequence of weights q1,q2,(0,1]𝒳q_{1},q_{2},\dots\in(0,1]^{\mathcal{X}} and the assertion that the feedback in round tt is observed with probability qt(x)q_{t}(x) if the algorithm chooses At=xA_{t}=x, it obtains the following pseudo regret guarantee for any stopping time tt^{\prime}:

𝔼[t=1t(t,Att,u)]𝔼[c1t=1t1qt(At)+c2minttminxqt(x)].\displaystyle\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}\left(\ell_{t,A_{t}}-\ell_{t,u}\right)\right]\leq\mathbb{E}\left[\sqrt{c_{1}\sum_{t=1}^{t^{\prime}}\frac{1}{q_{t}(A_{t})}}+\frac{c_{2}}{\min_{t\leq t^{\prime}}\min_{x}q_{t}(x)}\right].

Compared with iw-stability defined in Definition 8, strong-iw-stability requires the bound to have an additional flexibility: if choosing action xx results in a probability qt(x)q_{t}(x) of observing the feedback, the regret bound needs to adapts to t1qt(At)\sum_{t}\frac{1}{q_{t}(A_{t})}. For the class of FTRL/OMD algorithms, strong-iw-stability holds if the stability term is bounded by a constant no matter what action is chosen. Examples include log-barrier-OMD/FTRL for multi-armed bandits, SCRiBLe (Abernethy et al., 2008) or a truncated version of continuous exponential weights (Ito et al., 2020) for linear bandits. In fact, Upper-Confidence-Bound (UCB) algorithms also satisfy strong-iw-stability, though it is mainly designed for the pure stochastic setting; however, we will need this property when we design algorithms for adversarial MDPs, where the transition estimation part is done through UCB approaches. This will be demonstrated in Section 6.3. With strong-iw-stability, the second reduction only requires a base algorithm over 𝒳\mathcal{X}:

Theorem 25.

If \mathcal{B} is a 12\frac{1}{2}-strongly-iw-stable algorithm with constants (c1,c2)(c_{1},c_{2}), then Algorithm 7 with \mathcal{B} as the base algorithm satisfies 12\frac{1}{2}-LSB with constants (c0,c1,c2)(c_{0}^{\prime},c_{1}^{\prime},c_{2}^{\prime}) where c0=c1=O(c1)c_{0}^{\prime}=c_{1}^{\prime}=O(c_{1}) and c2=O(c1+c2)c_{2}^{\prime}=O(\sqrt{c_{1}}+c_{2}).

The proof is provided in Appendix E.4.

6.3 Application to MDPs

Finally, we combine the techniques developed in Section 6.1 and Section 6.2 to obtain a best-of-both-world guarantee for tabular MDPs with a first-order regret bound in the adversarial regime. We use Algorithm 13 as the base algorithm, which is adapted from the UCB-log-barrier Policy Search algorithm by Lee et al. (2020) to satisfy both the data-dependent property (Definition 21) and the strongly iw-stable property (Definition 24). The corral algorithm we use is Algorithm 8, which takes a base algorithm with the dd-strongly-iw-stable property and turns it into a data-dependent best-of-both-world algorithm. The details and notations are all provided in Appendix H. The guarantee of the algorithm is formally stated in the next theorem.

Theorem 26.

Combining Algorithm 1, Algorithm 8, and Algorithm 13 results in an MDP algorithm with O(S2AHLlog2(T)ι2+S5A2log2(T)ι2)O\Big{(}\sqrt{S^{2}AHL_{\star}\log^{2}(T)\iota^{2}}+S^{5}A^{2}\log^{2}(T)\iota^{2}\Big{)} regret in the adversarial regime, and O(H2S2Aι2logTΔ+H2S2Aι2logTΔC+S5A2ι2log(T)log(CΔ1))O\Big{(}\frac{H^{2}S^{2}A\iota^{2}\log T}{\Delta}+\sqrt{\frac{H^{2}S^{2}A\iota^{2}\log T}{\Delta}C}+S^{5}A^{2}\iota^{2}\log(T)\log(C\Delta^{-1})\Big{)} in the corrupted stochastic regime, where SS is the number of states, AA is the number of actions, HH is the horizon, LL_{\star} is the cumulative loss of the best policy, and ι=log(SAT)\iota=\log(SAT).

7 Conclusion

We provided a general reduction from the best-of-both-worlds problem to a wide range of FTRL/OMD algorithms, which improves the state of the art in several problem settings. We showed the versatility of our approach by extending it to preserving data-dependent bounds.

Another potential application of our framework is partial monitoring, where one might improve the log2(T)\log^{2}(T) rates of to log(T)\log(T) for both the T12T^{\frac{1}{2}} and the T23T^{\frac{2}{3}} regime using our respective reductions.

A weakness of our approach is the uniqueness requirement of the best action. While this assumption is typical in the best-of-both-worlds literature, it is not merely an artifact of the analysis for us due to the doubling procedure in the first reduction. Additionally, our reduction can only obtain worst-case Δ\Delta dependent bounds in the stochastic regime, which can be significantly weaker than more refined notions of complexity.

References

  • Abbasi-Yadkori et al. (2020) Yasin Abbasi-Yadkori, Aldo Pacchiano, and My Phan. Regret balancing for bandit and rl model selection. arXiv preprint arXiv:2006.05491, 2020.
  • Abernethy et al. (2008) Jacob Abernethy, Elad Hazan, and Alexander Rakhlin. Competing in the dark: An efficient algorithm for bandit linear optimization. In 21st Annual Conference on Learning Theory, COLT 2008, 2008.
  • Agarwal et al. (2017) Alekh Agarwal, Haipeng Luo, Behnam Neyshabur, and Robert E Schapire. Corralling a band of bandit algorithms. In Conference on Learning Theory, pages 12–38. PMLR, 2017.
  • Alon et al. (2013) Noga Alon, Nicolo Cesa-Bianchi, Claudio Gentile, and Yishay Mansour. From bandits to experts: A tale of domination and independence. Advances in Neural Information Processing Systems, 26, 2013.
  • Alon et al. (2015) Noga Alon, Nicolo Cesa-Bianchi, Ofer Dekel, and Tomer Koren. Online learning with feedback graphs: Beyond bandits. In Conference on Learning Theory, pages 23–35. PMLR, 2015.
  • Amir et al. (2022) Idan Amir, Guy Azov, Tomer Koren, and Roi Livni. Better best of both worlds bounds for bandits with switching costs. In Advances in Neural Information Processing Systems, 2022.
  • Arora et al. (2021) Raman Arora, Teodor Vanislavov Marinov, and Mehryar Mohri. Corralling stochastic bandit algorithms. In International Conference on Artificial Intelligence and Statistics, pages 2116–2124. PMLR, 2021.
  • Auer and Chiang (2016) Peter Auer and Chao-Kai Chiang. An algorithm with nearly optimal pseudo-regret for both stochastic and adversarial bandits. In Conference on Learning Theory, pages 116–120. PMLR, 2016.
  • Auer et al. (2002a) Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47:235–256, 2002a.
  • Auer et al. (2002b) Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002b.
  • Bubeck and Slivkins (2012) Sébastien Bubeck and Aleksandrs Slivkins. The best of both worlds: Stochastic and adversarial bandits. In Conference on Learning Theory, pages 42–1. JMLR Workshop and Conference Proceedings, 2012.
  • Bubeck et al. (2012) Sébastien Bubeck, Nicolo Cesa-Bianchi, and Sham M Kakade. Towards minimax policies for online linear optimization with bandit feedback. In Conference on Learning Theory, pages 41–1. JMLR Workshop and Conference Proceedings, 2012.
  • Bubeck et al. (2019) Sébastien Bubeck, Yuanzhi Li, Haipeng Luo, and Chen-Yu Wei. Improved path-length regret bounds for bandits. In Conference On Learning Theory, pages 508–528. PMLR, 2019.
  • Chen et al. (2022) Cheng Chen, Canzhe Zhao, and Shuai Li. Simultaneously learning stochastic and adversarial bandits under the position-based model. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 6202–6210, 2022.
  • Cutkosky et al. (2021) Ashok Cutkosky, Christoph Dann, Abhimanyu Das, Claudio Gentile, Aldo Pacchiano, and Manish Purohit. Dynamic balancing for model selection in bandits and rl. In International Conference on Machine Learning, pages 2276–2285. PMLR, 2021.
  • Erez and Koren (2021) Liad Erez and Tomer Koren. Towards best-of-all-worlds online learning with feedback graphs. Advances in Neural Information Processing Systems, 34:28511–28521, 2021.
  • Foster et al. (2020) Dylan J Foster, Claudio Gentile, Mehryar Mohri, and Julian Zimmert. Adapting to misspecification in contextual bandits. Advances in Neural Information Processing Systems, 33:11478–11489, 2020.
  • Gaillard et al. (2014) Pierre Gaillard, Gilles Stoltz, and Tim Van Erven. A second-order bound with excess losses. In Conference on Learning Theory, pages 176–196. PMLR, 2014.
  • Gupta et al. (2019) Anupam Gupta, Tomer Koren, and Kunal Talwar. Better algorithms for stochastic bandits with adversarial corruptions. In Conference on Learning Theory, pages 1562–1578. PMLR, 2019.
  • Honda et al. (2023) Junya Honda, Shinji Ito, and Taira Tsuchiya. Follow-the-perturbed-leader achieves best-of-both-worlds for bandit problems. In International Conference on Algorithmic Learning Theory. PMLR, 2023.
  • Huang et al. (2022) Jiatai Huang, Yan Dai, and Longbo Huang. Adaptive best-of-both-worlds algorithm for heavy-tailed multi-armed bandits. In International Conference on Machine Learning, pages 9173–9200. PMLR, 2022.
  • Ito (2021a) Shinji Ito. On optimal robustness to adversarial corruption in online decision problems. Advances in Neural Information Processing Systems, 34:7409–7420, 2021a.
  • Ito (2021b) Shinji Ito. Parameter-free multi-armed bandit algorithms with hybrid data-dependent regret bounds. In Conference on Learning Theory, pages 2552–2583. PMLR, 2021b.
  • Ito et al. (2020) Shinji Ito, Shuichi Hirahara, Tasuku Soma, and Yuichi Yoshida. Tight first-and second-order regret bounds for adversarial linear bandits. Advances in Neural Information Processing Systems, 33:2028–2038, 2020.
  • Ito et al. (2022a) Shinji Ito, Taira Tsuchiya, and Junya Honda. Adversarially robust multi-armed bandit algorithm with variance-dependent regret bounds. In Conference on Learning Theory, pages 1421–1422. PMLR, 2022a.
  • Ito et al. (2022b) Shinji Ito, Taira Tsuchiya, and Junya Honda. Nearly optimal best-of-both-worlds algorithms for online learning with feedback graphs. In Advances in Neural Information Processing Systems, 2022b.
  • Jin and Luo (2020) Tiancheng Jin and Haipeng Luo. Simultaneously learning stochastic and adversarial episodic mdps with known transition. Advances in neural information processing systems, 33:16557–16566, 2020.
  • Jin et al. (2021) Tiancheng Jin, Longbo Huang, and Haipeng Luo. The best of both worlds: stochastic and adversarial episodic mdps with unknown transition. Advances in Neural Information Processing Systems, 34:20491–20502, 2021.
  • Kong et al. (2022) Fang Kong, Yichi Zhou, and Shuai Li. Simultaneously learning stochastic and adversarial bandits with general graph feedback. In International Conference on Machine Learning, pages 11473–11482. PMLR, 2022.
  • Lai et al. (1985) Tze Leung Lai, Herbert Robbins, et al. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985.
  • Lee et al. (2020) Chung-Wei Lee, Haipeng Luo, Chen-Yu Wei, and Mengxiao Zhang. Bias no more: high-probability data-dependent regret bounds for adversarial bandits and mdps. Advances in neural information processing systems, 33:15522–15533, 2020.
  • Lee et al. (2021) Chung-Wei Lee, Haipeng Luo, Chen-Yu Wei, Mengxiao Zhang, and Xiaojin Zhang. Achieving near instance-optimality and minimax-optimality in stochastic and adversarial linear bandits simultaneously. In International Conference on Machine Learning, pages 6142–6151. PMLR, 2021.
  • Luo (2022) Haipeng Luo. Homework 3 solution, introduction to online optimization/learning. http://haipeng-luo.net/courses/CSCI659/2022_fall/homework/HW3_solutions.pdf, November 2022.
  • Luo et al. (2022) Haipeng Luo, Mengxiao Zhang, Peng Zhao, and Zhi-Hua Zhou. Corralling a larger band of bandits: A case study on switching regret for linear bandits. In Conference on Learning Theory, pages 3635–3684. PMLR, 2022.
  • Lykouris et al. (2018) Thodoris Lykouris, Vahab Mirrokni, and Renato Paes Leme. Stochastic bandits robust to adversarial corruptions. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 114–122, 2018.
  • Masoudian and Seldin (2021) Saeed Masoudian and Yevgeny Seldin. Improved analysis of the tsallis-inf algorithm in stochastically constrained adversarial bandits and stochastic bandits with adversarial corruptions. In Conference on Learning Theory, pages 3330–3350. PMLR, 2021.
  • Masoudian et al. (2022) Saeed Masoudian, Julian Zimmert, and Yevgeny Seldin. A best-of-both-worlds algorithm for bandits with delayed feedback. In Advances in Neural Information Processing Systems, 2022.
  • Mourtada and Gaïffas (2019) Jaouad Mourtada and Stéphane Gaïffas. On the optimality of the hedge algorithm in the stochastic regime. Journal of Machine Learning Research, 20:1–28, 2019.
  • Pacchiano et al. (2022) Aldo Pacchiano, Christoph Dann, and Claudio Gentile. Best of both worlds model selection. In Advances in Neural Information Processing Systems, 2022.
  • Rakhlin and Sridharan (2013) Sasha Rakhlin and Karthik Sridharan. Optimization, learning, and games with predictable sequences. Advances in Neural Information Processing Systems, 26, 2013.
  • Rouyer et al. (2021) Chloé Rouyer, Yevgeny Seldin, and Nicolo Cesa-Bianchi. An algorithm for stochastic and adversarial bandits with switching costs. In International Conference on Machine Learning, pages 9127–9135. PMLR, 2021.
  • Rouyer et al. (2022) Chloé Rouyer, Dirk van der Hoeven, Nicolò Cesa-Bianchi, and Yevgeny Seldin. A near-optimal best-of-both-worlds algorithm for online learning with feedback graphs. In Advances in Neural Information Processing Systems, 2022.
  • Saha and Gaillard (2022) Aadirupa Saha and Pierre Gaillard. Versatile dueling bandits: Best-of-both world analyses for learning from relative preferences. In International Conference on Machine Learning, pages 19011–19026. PMLR, 2022.
  • Seldin and Lugosi (2017) Yevgeny Seldin and Gábor Lugosi. An improved parametrization and analysis of the exp3++ algorithm for stochastic and adversarial bandits. In Conference on Learning Theory, pages 1743–1759. PMLR, 2017.
  • Seldin and Slivkins (2014) Yevgeny Seldin and Aleksandrs Slivkins. One practical algorithm for both stochastic and adversarial bandits. In International Conference on Machine Learning, pages 1287–1295. PMLR, 2014.
  • Tsuchiya et al. (2022) Taira Tsuchiya, Shinji Ito, and Junya Honda. Best-of-both-worlds algorithms for partial monitoring. arXiv preprint arXiv:2207.14550, 2022.
  • Wei and Luo (2018) Chen-Yu Wei and Haipeng Luo. More adaptive algorithms for adversarial bandits. In Conference On Learning Theory, pages 1263–1291. PMLR, 2018.
  • Wei et al. (2022) Chen-Yu Wei, Christoph Dann, and Julian Zimmert. A model selection approach for corruption robust reinforcement learning. In International Conference on Algorithmic Learning Theory, pages 1043–1096. PMLR, 2022.
  • Zimmert and Lattimore (2019) Julian Zimmert and Tor Lattimore. Connections between mirror descent, thompson sampling and the information ratio. Advances in Neural Information Processing Systems, 32, 2019.
  • Zimmert and Lattimore (2022) Julian Zimmert and Tor Lattimore. Return of the bias: Almost minimax optimal high probability bounds for adversarial linear bandits. In Conference on Learning Theory, pages 3285–3312. PMLR, 2022.
  • Zimmert and Seldin (2019) Julian Zimmert and Yevgeny Seldin. An optimal algorithm for stochastic and adversarial bandits. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 467–475. PMLR, 2019.
  • Zimmert et al. (2019) Julian Zimmert, Haipeng Luo, and Chen-Yu Wei. Beating stochastic and adversarial semi-bandits optimally and simultaneously. In International Conference on Machine Learning, pages 7683–7692. PMLR, 2019.
\appendixpage
\startcontents

[section] \printcontents[section]l1

Appendix A FTRL Analysis

Lemma 27.

The optimistic FTRL algorithm over a convex set Ω\Omega:

pt=argminpΩ{p,τ=1t1τ+mt+ψt(p)}\displaystyle p_{t}=\operatorname*{argmin}_{p\in\Omega}\left\{\left\langle p,\sum_{\tau=1}^{t-1}\ell_{\tau}\right\rangle+m_{t}+\psi_{t}(p)\right\}

guarantees the following for any tt^{\prime}:

t=1tptu,t\displaystyle\sum_{t=1}^{t^{\prime}}\langle p_{t}-u,\ell_{t}\rangle\leq
ψ0(u)minpΩψ0(p)+t=1t(ψt(u)ψt1(u)ψt(pt)+ψt1(pt))+t=1tmaxpΩ(ptp,tmtDψt(p,pt))stability.\displaystyle\psi_{0}(u)-\min_{p\in\Omega}\psi_{0}(p)+\sum_{t=1}^{t^{\prime}}(\psi_{t}(u)-\psi_{t-1}(u)-\psi_{t}(p_{t})+\psi_{t-1}(p_{t}))+\sum_{t=1}^{t^{\prime}}\underbrace{\max_{p\in\Omega}\left(\langle p_{t}-p,\ell_{t}-m_{t}\rangle-D_{\psi_{t}}(p,p_{t})\right)}_{\text{stability}}.
Proof.

Let Ltτ=1tτL_{t}\triangleq\sum_{\tau=1}^{t}\ell_{\tau}. Define Ft(p)=p,Lt1+mt+ψt(p)F_{t}(p)=\left\langle p,L_{t-1}+m_{t}\right\rangle+\psi_{t}(p) and Gt(p)=p,Lt+ψt(p)G_{t}(p)=\left\langle p,L_{t}\right\rangle+\psi_{t}(p). Therefore, ptp_{t} is the minimizer of FtF_{t} over Ω\Omega. Let pt+1p_{t+1}^{\prime} be minimizer of GtG_{t} over Ω\Omega. Then by the first-order optimality condition, we have

Ft(pt)Gt(pt+1)\displaystyle F_{t}(p_{t})-G_{t}(p_{t+1}^{\prime}) Ft(pt+1)Gt(pt+1)Dψt(pt+1,pt)=pt+1,tmtDψt(pt+1,pt).\displaystyle\leq F_{t}(p_{t+1}^{\prime})-G_{t}(p_{t+1}^{\prime})-D_{\psi_{t}}(p_{t+1}^{\prime},p_{t})=-\langle p_{t+1}^{\prime},\ell_{t}-m_{t}\rangle-D_{\psi_{t}}(p_{t+1}^{\prime},p_{t}). (1)

By definition, we also have

Gt(pt+1)Ft+1(pt+1)\displaystyle G_{t}(p_{t+1}^{\prime})-F_{t+1}(p_{t+1}) Gt(pt+1)Ft+1(pt+1)=ψt(pt+1)ψt+1(pt+1)pt+1,mt+1.\displaystyle\leq G_{t}(p_{t+1})-F_{t+1}(p_{t+1})=\psi_{t}(p_{t+1})-\psi_{t+1}(p_{t+1})-\langle p_{t+1},m_{t+1}\rangle. (2)

Thus,

t=1tpt,t\displaystyle\sum_{t=1}^{t^{\prime}}\langle p_{t},\ell_{t}\rangle
t=1t(pt,tpt+1,tmtDψt(pt+1,pt)+Gt(pt+1)Ft(pt))\displaystyle\leq\sum_{t=1}^{t^{\prime}}\left(\langle p_{t},\ell_{t}\rangle-\langle p_{t+1}^{\prime},\ell_{t}-m_{t}\rangle-D_{\psi_{t}}(p_{t+1}^{\prime},p_{t})+G_{t}(p_{t+1}^{\prime})-F_{t}(p_{t})\right) (by (1))
=t=1t(pt,tpt+1,tmtDψt(pt+1,pt)+Gt1(pt)Ft(pt))+Gt(pt+1)G0(p1)\displaystyle=\sum_{t=1}^{t^{\prime}}\left(\langle p_{t},\ell_{t}\rangle-\langle p_{t+1}^{\prime},\ell_{t}-m_{t}\rangle-D_{\psi_{t}}(p_{t+1}^{\prime},p_{t})+G_{t-1}(p_{t}^{\prime})-F_{t}(p_{t})\right)+G_{t^{\prime}}(p_{t^{\prime}+1}^{\prime})-G_{0}(p_{1}^{\prime})
=t=1t(pt,tpt+1,tmtDψt(pt+1,pt)ψt(pt)+ψt1(pt)pt,mt)+Gt(pt+1)G0(p1)\displaystyle=\sum_{t=1}^{t^{\prime}}\left(\langle p_{t},\ell_{t}\rangle-\langle p_{t+1}^{\prime},\ell_{t}-m_{t}\rangle-D_{\psi_{t}}(p_{t+1}^{\prime},p_{t})-\psi_{t}(p_{t})+\psi_{t-1}(p_{t})-\langle p_{t},m_{t}\rangle\right)+G_{t^{\prime}}(p_{t^{\prime}+1}^{\prime})-G_{0}(p_{1}^{\prime})
t=1t(ptpt+1,tmtDψt(pt+1,pt)ψt(pt)+ψt1(pt))+Gt(u)minpΩψ0(p)\displaystyle\leq\sum_{t=1}^{t^{\prime}}\left(\langle p_{t}-p_{t+1}^{\prime},\ell_{t}-m_{t}\rangle-D_{\psi_{t}}(p_{t+1}^{\prime},p_{t})-\psi_{t}(p_{t})+\psi_{t-1}(p_{t})\right)+G_{t^{\prime}}(u)-\min_{p\in\Omega}\psi_{0}(p) (by (2), using that pt+1p^{\prime}_{t^{\prime}+1} is the minimizer of GtG_{t^{\prime}})
=t=1t(maxpΩ{ptp,tmtDψt(p,pt)}ψt(pt)+ψt1(pt))\displaystyle=\sum_{t=1}^{t^{\prime}}\left(\max_{p\in\Omega}\left\{\langle p_{t}-p,\ell_{t}-m_{t}\rangle-D_{\psi_{t}}(p,p_{t})\right\}-\psi_{t}(p_{t})+\psi_{t-1}(p_{t})\right)
+t=1tu,t+ψt(u)minpΩψ0(p)\displaystyle\qquad+\sum_{t=1}^{t^{\prime}}\langle u,\ell_{t}\rangle+\psi_{t^{\prime}}(u)-\min_{p\in\Omega}\psi_{0}(p)
=t=1t(maxpΩ{ptp,tmtDψt(p,pt)}ψt(pt)+ψt1(pt))\displaystyle=\sum_{t=1}^{t^{\prime}}\left(\max_{p\in\Omega}\left\{\langle p_{t}-p,\ell_{t}-m_{t}\rangle-D_{\psi_{t}}(p,p_{t})\right\}-\psi_{t}(p_{t})+\psi_{t-1}(p_{t})\right)
+t=1tu,t+ψt(u)minpΩψ0(p)\displaystyle\qquad+\sum_{t=1}^{t^{\prime}}\langle u,\ell_{t}\rangle+\psi_{t^{\prime}}(u)-\min_{p\in\Omega}\psi_{0}(p)
=t=1t(maxpΩ{ptp,tmtDψt(p,pt)}+(ψt(u)ψt1(u)ψt(pt)+ψt1(pt)))\displaystyle=\sum_{t=1}^{t^{\prime}}\left(\max_{p\in\Omega}\left\{\langle p_{t}-p,\ell_{t}-m_{t}\rangle-D_{\psi_{t}}(p,p_{t})\right\}+(\psi_{t}(u)-\psi_{t-1}(u)-\psi_{t}(p_{t})+\psi_{t-1}(p_{t}))\right)
+t=1tu,t+ψ0(u)minpΩψ0(p)\displaystyle\qquad+\sum_{t=1}^{t^{\prime}}\langle u,\ell_{t}\rangle+\psi_{0}(u)-\min_{p\in\Omega}\psi_{0}(p)

Re-arranging finishes the proof. ∎

Lemma 28.

Consider the optimistic FTRL algorithm with bonus bt𝟎b_{t}\geq\mathbf{0}:

pt=argminpΩ{p,τ=1t1(τbτ)+mt+ψt(p)}\displaystyle p_{t}=\operatorname*{argmin}_{p\in\Omega}\left\{\left\langle p,\sum_{\tau=1}^{t-1}(\ell_{\tau}-b_{\tau})\right\rangle+m_{t}+\psi_{t}(p)\right\}

with ψt(p)=1ηtψTs(p)+1βψLo(p)\psi_{t}(p)=\frac{1}{\eta_{t}}\psi^{\scalebox{0.5}{{{Ts}}}}(p)+\frac{1}{\beta}\psi^{\scalebox{0.5}{{{Lo}}}}(p) where ψTs(p)=11αipiα\psi^{\scalebox{0.5}{{{Ts}}}}(p)=\frac{-1}{1-\alpha}\sum_{i}p_{i}^{\alpha} for some α(0,1)\alpha\in(0,1), ψLo(p)=iln1pi\psi^{\scalebox{0.5}{{{Lo}}}}(p)=\sum_{i}\ln\frac{1}{p_{i}}, and ηt\eta_{t} is non-increasing. We have

t=1tptu,t\displaystyle\sum_{t=1}^{t^{\prime}}\langle p_{t}-u,\ell_{t}\rangle 11αK1αη0+11αt=1t(1ηt1ηt1)(i=1Kpt,iα1)+Kln1δβ+1αt=1tηti=1Kpt,i2α(t,iTsxt)2\displaystyle\leq\frac{1}{1-\alpha}\frac{K^{1-\alpha}}{\eta_{0}}+\frac{1}{1-\alpha}\sum_{t=1}^{t^{\prime}}\left(\frac{1}{\eta_{t}}-\frac{1}{\eta_{t-1}}\right)\left(\sum_{i=1}^{K}p_{t,i}^{\alpha}-1\right)+\frac{K\ln\frac{1}{\delta}}{\beta}+\frac{1}{\alpha}\sum_{t=1}^{t^{\prime}}\eta_{t}\sum_{i=1}^{K}p_{t,i}^{2-\alpha}(\ell_{t,i}^{\scalebox{0.5}{{{Ts}}}}-x_{t})^{2}
+2βt=1ti=1Kpt,i2(t,iLowt)2+(1+1α)t=1tpt,btt=1tu,bt+δt=1tu+1K𝟏,tbt.\displaystyle\qquad+2\beta\sum_{t=1}^{t^{\prime}}\sum_{i=1}^{K}p_{t,i}^{2}(\ell^{\scalebox{0.5}{{{Lo}}}}_{t,i}-w_{t})^{2}+\left(1+\frac{1}{\alpha}\right)\sum_{t=1}^{t^{\prime}}\langle p_{t},b_{t}\rangle-\sum_{t=1}^{t^{\prime}}\langle u,b_{t}\rangle+\delta\sum_{t=1}^{t^{\prime}}\left\langle-u+\frac{1}{K}\mathbf{1},\ell_{t}-b_{t}\right\rangle.

for any δ(0,1)\delta\in(0,1) and any tTs,btTsK\ell^{\scalebox{0.5}{{{Ts}}}}_{t},b^{\scalebox{0.5}{{{Ts}}}}_{t}\in\mathbb{R}^{K}, tLo,btLoK\ell^{\scalebox{0.5}{{{Lo}}}}_{t},b^{\scalebox{0.5}{{{Lo}}}}_{t}\in\mathbb{R}^{K}, xt,wtx_{t},w_{t}\in\mathbb{R} such that

tTs+tLo\displaystyle\ell_{t}^{\scalebox{0.5}{{{Ts}}}}+\ell_{t}^{\scalebox{0.5}{{{Lo}}}} =tmt,\displaystyle=\ell_{t}-m_{t}, (3)
btTs+btLo\displaystyle b_{t}^{\scalebox{0.5}{{{Ts}}}}+b_{t}^{\scalebox{0.5}{{{Lo}}}} =bt\displaystyle=b_{t} (4)
ηtpt,i1α(t,iTsxt)\displaystyle\eta_{t}p_{t,i}^{1-\alpha}(\ell_{t,i}^{\scalebox{0.5}{{{Ts}}}}-x_{t}) 14,\displaystyle\geq-\frac{1}{4}, (5)
βpt,i(t,iLowt)\displaystyle\beta p_{t,i}(\ell_{t,i}^{\scalebox{0.5}{{{Lo}}}}-w_{t}) 14,\displaystyle\geq-\frac{1}{4}, (6)
0ηtpt,i1αbt,iTs\displaystyle 0\leq\eta_{t}p_{t,i}^{1-\alpha}b_{t,i}^{\scalebox{0.5}{{{Ts}}}} 14,\displaystyle\leq\frac{1}{4}, (7)
0βpt,ibt,iLo\displaystyle 0\leq\beta p_{t,i}b_{t,i}^{\scalebox{0.5}{{{Lo}}}} 14\displaystyle\leq\frac{1}{4} (8)

for all t,it,i.

Proof.

Let u=(1δ)u+δK𝟏u^{\prime}=(1-\delta)u+\frac{\delta}{K}\mathbf{1}. By Lemma 27, we have

t=1tptu,tbt\displaystyle\sum_{t=1}^{t^{\prime}}\langle p_{t}-u^{\prime},\ell_{t}-b_{t}\rangle 11α1η0i=1Kp1,iα+11αt=1t(1ηt1ηt1)i=1K(pt,iαuiα)+1βi=1Klnp1,iui\displaystyle\leq\frac{1}{1-\alpha}\frac{1}{\eta_{0}}\sum_{i=1}^{K}p^{\alpha}_{1,i}+\frac{1}{1-\alpha}\sum_{t=1}^{t^{\prime}}\left(\frac{1}{\eta_{t}}-\frac{1}{\eta_{t-1}}\right)\sum_{i=1}^{K}\left(p_{t,i}^{\alpha}-u_{i}^{\prime\alpha}\right)+\frac{1}{\beta}\sum_{i=1}^{K}\ln\frac{p_{1,i}}{u_{i}^{\prime}}
+t=1tmaxp(ptp,tbtmt1ηtDψTs(p,pt)1βDψLo(p,pt))\displaystyle\qquad+\sum_{t=1}^{t^{\prime}}\max_{p}\left(\langle p_{t}-p,\ell_{t}-b_{t}-m_{t}\rangle-\frac{1}{\eta_{t}}D_{\psi^{\scalebox{0.5}{{{Ts}}}}}(p,p_{t})-\frac{1}{\beta}D_{\psi^{\scalebox{0.5}{{{Lo}}}}}(p,p_{t})\right)
11αK1αη0+11αt=1t(1ηt1ηt1)(i=1Kpt,iα1)+Kln1δβ\displaystyle\leq\frac{1}{1-\alpha}\frac{K^{1-\alpha}}{\eta_{0}}+\frac{1}{1-\alpha}\sum_{t=1}^{t^{\prime}}\left(\frac{1}{\eta_{t}}-\frac{1}{\eta_{t-1}}\right)\left(\sum_{i=1}^{K}p_{t,i}^{\alpha}-1\right)+\frac{K\ln\frac{1}{\delta}}{\beta}
+t=1tmaxp(ptp,Tsxt𝟏12ηtDψTs(p,pt))stability-1\displaystyle\qquad+\sum_{t=1}^{t^{\prime}}\underbrace{\max_{p}\left(\langle p_{t}-p,\ell^{\scalebox{0.5}{{{Ts}}}}-x_{t}\mathbf{1}\rangle-\frac{1}{2\eta_{t}}D_{\psi^{\scalebox{0.5}{{{Ts}}}}}(p,p_{t})\right)}_{\textbf{stability-1}}
+t=1tmaxp(ptp,btTs𝟏12ηtDψTs(p,pt))stability-2\displaystyle\qquad+\sum_{t=1}^{t^{\prime}}\underbrace{\max_{p}\left(\langle p_{t}-p,-b_{t}^{\scalebox{0.5}{{{Ts}}}}\mathbf{1}\rangle-\frac{1}{2\eta_{t}}D_{\psi^{\scalebox{0.5}{{{Ts}}}}}(p,p_{t})\right)}_{\textbf{stability-2}}
+t=1tmaxp(ptp,Lowt𝟏12βDψLo(p,pt))stability-3\displaystyle\qquad+\sum_{t=1}^{t^{\prime}}\underbrace{\max_{p}\left(\langle p_{t}-p,\ell^{\scalebox{0.5}{{{Lo}}}}-w_{t}\mathbf{1}\rangle-\frac{1}{2\beta}D_{\psi^{\scalebox{0.5}{{{Lo}}}}}(p,p_{t})\right)}_{\textbf{stability-3}}
+t=1tmaxp(ptp,btLo𝟏12βDψLo(p,pt))stability-4\displaystyle\qquad+\sum_{t=1}^{t^{\prime}}\underbrace{\max_{p}\left(\langle p_{t}-p,-b_{t}^{\scalebox{0.5}{{{Lo}}}}\mathbf{1}\rangle-\frac{1}{2\beta}D_{\psi^{\scalebox{0.5}{{{Lo}}}}}(p,p_{t})\right)}_{\textbf{stability-4}}

where in the last inequality we use ptp,𝟏=0\langle p_{t}-p,\mathbf{1}\rangle=0.

By Problem 1 in Luo (2022), we can bound stability-1 by ηtαi=1Kpt,i2α(t,iTsxt)2\frac{\eta_{t}}{\alpha}\sum_{i=1}^{K}p_{t,i}^{2-\alpha}(\ell_{t,i}^{\scalebox{0.5}{{{Ts}}}}-x_{t})^{2} under the condition (5). Similarly, stability-2 can be upper bounded by ηtαi=1Kpt,i2αbt,iTs21αi=1Kpt,ibt,iTs\frac{\eta_{t}}{\alpha}\sum_{i=1}^{K}p_{t,i}^{2-\alpha}b^{\scalebox{0.5}{{{Ts}}}2}_{t,i}\leq\frac{1}{\alpha}\sum_{i=1}^{K}p_{t,i}b^{\scalebox{0.5}{{{Ts}}}}_{t,i} under the condition (7). Using Lemma 30, we can bound stability-3 by 2βi=1Kpt,i2(t,iLowt)22\beta\sum_{i=1}^{K}p_{t,i}^{2}(\ell^{\scalebox{0.5}{{{Lo}}}}_{t,i}-w_{t})^{2} under the condition (6). Similarly, we can bound stability-4 by 2βi=1Kpt,i2bt,iLo2i=1Kpt,ibt,iLo2\beta\sum_{i=1}^{K}p_{t,i}^{2}b^{\scalebox{0.5}{{{Lo}}}2}_{t,i}\leq\sum_{i=1}^{K}p_{t,i}b^{\scalebox{0.5}{{{Lo}}}}_{t,i} under (8). Collecting all terms and using the definition of uu^{\prime} finishes the proof.

Lemma 29.

Consider the optimistic FTRL algorithm with bonus bt𝟎b_{t}\geq\mathbf{0}:

pt=argminpΩ{p,τ=1t1(τbτ)+mt+ψt(p)}\displaystyle p_{t}=\operatorname*{argmin}_{p\in\Omega}\left\{\left\langle p,\sum_{\tau=1}^{t-1}(\ell_{\tau}-b_{\tau})\right\rangle+m_{t}+\psi_{t}(p)\right\}

with ψt(p)=1ηtiln1pi\psi_{t}(p)=\frac{1}{\eta_{t}}\sum_{i}\ln\frac{1}{p_{i}}, and ηt\eta_{t} is non-increasing. We have

t=1tptu,t\displaystyle\sum_{t=1}^{t^{\prime}}\langle p_{t}-u,\ell_{t}\rangle Kln1δηt+2t=1tηti=1Kpt,i2(t,imt,ixt)2\displaystyle\leq\frac{K\ln\frac{1}{\delta}}{\eta_{t^{\prime}}}+2\sum_{t=1}^{t^{\prime}}\eta_{t}\sum_{i=1}^{K}p_{t,i}^{2}(\ell_{t,i}-m_{t,i}-x_{t})^{2}
+2t=1tpt,btt=1tu,bt+δt=1tu+1K𝟏,tbt.\displaystyle\qquad+2\sum_{t=1}^{t^{\prime}}\langle p_{t},b_{t}\rangle-\sum_{t=1}^{t^{\prime}}\langle u,b_{t}\rangle+\delta\sum_{t=1}^{t^{\prime}}\left\langle-u+\frac{1}{K}\mathbf{1},\ell_{t}-b_{t}\right\rangle.

for any δ(0,1)\delta\in(0,1) and any xtx_{t}\in\mathbb{R} if the following hold: ηtpt,i(t,imt,ixt)14\eta_{t}p_{t,i}(\ell_{t,i}-m_{t,i}-x_{t})\geq-\frac{1}{4} and ηtpt,ibt,i14\eta_{t}p_{t,i}b_{t,i}\leq\frac{1}{4} for all t,it,i.

Proof.

Let u=(1δ)u+δK𝟏u^{\prime}=(1-\delta)u+\frac{\delta}{K}\mathbf{1}. By Lemma 27 and letting η0=\eta_{0}=\infty, we have

t=1tptu,tbt\displaystyle\sum_{t=1}^{t^{\prime}}\langle p_{t}-u^{\prime},\ell_{t}-b_{t}\rangle t=1ti=1K(lnpt,iui)(1ηt1ηt1)+t=1tmaxp(ptp,tbtmtDψt(p,pt))\displaystyle\leq\sum_{t=1}^{t^{\prime}}\sum_{i=1}^{K}\left(\ln\frac{p_{t,i}}{u_{i}^{\prime}}\right)\left(\frac{1}{\eta_{t}}-\frac{1}{\eta_{t-1}}\right)+\sum_{t=1}^{t^{\prime}}\max_{p}\left(\langle p_{t}-p,\ell_{t}-b_{t}-m_{t}\rangle-D_{\psi_{t}}(p,p_{t})\right)
KlnKδηt+t=1tmaxp(ptp,tmtxt𝟏12Dψt(p,pt))stability-1\displaystyle\leq\frac{K\ln\frac{K}{\delta}}{\eta_{t^{\prime}}}+\sum_{t=1}^{t^{\prime}}\underbrace{\max_{p}\left(\langle p_{t}-p,\ell_{t}-m_{t}-x_{t}\mathbf{1}\rangle-\frac{1}{2}D_{\psi_{t}}(p,p_{t})\right)}_{\textbf{stability-1}}
+t=1tmaxp(ptp,bt12Dψt(p,pt))stability-2\displaystyle\qquad+\sum_{t=1}^{t^{\prime}}\underbrace{\max_{p}\left(\langle p_{t}-p,-b_{t}\rangle-\frac{1}{2}D_{\psi_{t}}(p,p_{t})\right)}_{\textbf{stability-2}}

where in the last inequality we use ptp,𝟏=0\langle p_{t}-p,\mathbf{1}\rangle=0.

Using Lemma 30, we can bound stability-1 by 2ηti=1Kpt,i2(t,imt,ixt)22\eta_{t}\sum_{i=1}^{K}p_{t,i}^{2}(\ell_{t,i}-m_{t,i}-x_{t})^{2}, and bound stability-2 by 2ηti=1Kpt,i2bt,i2i=1Kpt,ibt,i2\eta_{t}\sum_{i=1}^{K}p_{t,i}^{2}b_{t,i}^{2}\leq\sum_{i=1}^{K}p_{t,i}b_{t,i} under the specified conditions. Collecting all terms and using the definition of uu^{\prime} finishes the proof. ∎

Lemma 30 (Stability under log barrier).

Let ψ(p)=1βiln1pi\psi(p)=\frac{1}{\beta}\sum_{i}\ln\frac{1}{p_{i}}, and let tK\ell_{t}\in\mathbb{R}^{K} be such that βpit,i12\beta p_{i}\ell_{t,i}\geq-\frac{1}{2}. Then

maxpΔ([K]){ptp,tDψ(p,pt)}iβpt,i2t,i2.\displaystyle\max_{p\in\Delta([K])}\left\{\langle p_{t}-p,\ell_{t}\rangle-D_{\psi}(p,p_{t})\right\}\leq\sum_{i}\beta p_{t,i}^{2}\ell_{t,i}^{2}.
Proof.
maxpΔ([K]){ptp,tDψ(p,pt)}maxq+K{ptq,tDψ(q,pt)}\displaystyle\max_{p\in\Delta([K])}\left\{\langle p_{t}-p,\ell_{t}\rangle-D_{\psi}(p,p_{t})\right\}\leq\max_{q\in\mathbb{R}^{K}_{+}}\left\{\langle p_{t}-q,\ell_{t}\rangle-D_{\psi}(q,p_{t})\right\}

Define f(q)=ptq,tDψ(q,pt)f(q)=\langle p_{t}-q,\ell_{t}\rangle-D_{\psi}(q,p_{t}). Let qq^{\star} be the solution in the last expression. Next, we verify that under the specified conditions, we have f(q)=0\nabla f(q^{\star})=0. It suffices to show that there exists q+Kq\in\mathbb{R}^{K}_{+} such that f(q)=0\nabla f(q)=0 since if such qq exists, then it must the maximizer of ff and thus q=qq^{\star}=q.

[f(q)]i=t,i[ψ(q)]i+[ψ(pt)]i=t,i+1βqi1βpt,i\displaystyle[\nabla f(q)]_{i}=-\ell_{t,i}-[\nabla\psi(q)]_{i}+[\nabla\psi(p_{t})]_{i}=-\ell_{t,i}+\frac{1}{\beta q_{i}}-\frac{1}{\beta p_{t,i}}

By the condition, we have 1βpt,it,i<0-\frac{1}{\beta p_{t,i}}-\ell_{t,i}<0 for all ii. and so f(q)=𝟎\nabla f(q)=\mathbf{0} has solution in +\mathbb{R}_{+}, which is qi=(1pt,i+ηt,it,i)1q_{i}=\left(\frac{1}{p_{t,i}}+\eta_{t,i}\ell_{t,i}\right)^{-1}.

Therefore, f(q)=tψt(q)+ψt(pt)=0\nabla f(q^{\star})=-\ell_{t}-\nabla\psi_{t}(q^{\star})+\nabla\psi_{t}(p_{t})=0, and we have

maxq+K{ptq,tDψ(q,pt)}=ptq,ψ(pt)ψ(q)Dψ(q,pt)=Dψ(pt,q).\displaystyle\max_{q\in\mathbb{R}^{K}_{+}}\left\{\langle p_{t}-q,\ell_{t}\rangle-D_{\psi}(q,p_{t})\right\}=\langle p_{t}-q^{\star},\nabla\psi(p_{t})-\nabla\psi(q^{\star})\rangle-D_{\psi}(q^{\star},p_{t})=D_{\psi}(p_{t},q^{\star}).

It remains to bound Dψ(pt,q)D_{\psi}(p_{t},q^{\star}), which by definition can be written as

Dψ(pt,q)=i1βh(pt,iqi)\displaystyle D_{\psi}(p_{t},q^{\star})=\sum_{i}\frac{1}{\beta}h\left(\frac{p_{t,i}}{q^{\star}_{i}}\right)

where h(x)=x1ln(x)h(x)=x-1-\ln(x). By the relation between qiq^{\star}_{i} and pt,ip_{t,i} we just derived, it holds that pt,iqi=1+βpt,it,i\frac{p_{t,i}}{q^{\star}_{i}}=1+\beta p_{t,i}\ell_{t,i}. By the fact that ln(1+x)xx2\ln(1+x)\geq x-x^{2} for all x12x\geq-\frac{1}{2}, we have

h(pt,iqi)=βpt,it,iln(1+βpt,it,i)β2pt,i2t,i2\displaystyle h\left(\frac{p_{t,i}}{q^{\star}_{i}}\right)=\beta p_{t,i}\ell_{t,i}-\ln(1+\beta p_{t,i}\ell_{t,i})\leq\beta^{2}p_{t,i}^{2}\ell_{t,i}^{2}

which gives the desired bound.

Lemma 31 (Stability under negentropy).

If ψ\psi is the negentropy ψ(p)=ipilogpi\psi(p)=\sum_{i}p_{i}\log p_{i} and t,i>0\ell_{t,i}>0, then for any η>0\eta>0 the stability is bounded by

maxp0Kptp,t1ηDψ(p,pt)η2ipt,it,i2.\displaystyle\max_{p\in\mathbb{R}_{\geq 0}^{K}}\langle p_{t}-p,\ell_{t}\rangle-\frac{1}{\eta}D_{\psi}(p,p_{t})\leq\frac{\eta}{2}\sum_{i}p_{t,i}\ell_{t,i}^{2}\,.

If t>1η\ell_{t}>-\frac{1}{\eta}, then the stability is bounded by

maxp0Kptp,t1ηDψ(p,pt)ηipt,it,i2.\displaystyle\max_{p\in\mathbb{R}_{\geq 0}^{K}}\langle p_{t}-p,\ell_{t}\rangle-\frac{1}{\eta}D_{\psi}(p,p_{t})\leq\eta\sum_{i}p_{t,i}\ell_{t,i}^{2}\,.

instead.

Proof.

Let fi(pi)=(pt,ipi)ti1η(pi(logpi1)pilogpt,i+pt,i)f_{i}(p_{i})=(p_{t,i}-p_{i})\ell_{ti}-\frac{1}{\eta}\left(p_{i}(\log p_{i}-1)-p_{i}\log p_{t,i}+p_{t,i}\right). Then we maximize ifi(pi)\sum_{i}f_{i}(p_{i}), which is upper bounded by maximizing the expression in each coordinate for pi0p_{i}\geq 0. We have

fi(p)=t,i1η(logplogpt,i),\displaystyle f_{i}^{\prime}(p)=-\ell_{t,i}-\frac{1}{\eta}(\log p-\log p_{t,i}),

and hence the maximum is obtained for p=pt,iexp(ηt,i)p^{\star}=p_{t,i}\exp(-\eta\ell_{t,i}). Plugging this in leads to

fi(p)\displaystyle f_{i}(p^{\star}) =pt,it,i(1exp(ηt,i))1η(pt,iexp(ηt,i)ηt,i+pt,i(1exp(ηt,i)))\displaystyle=p_{t,i}\ell_{t,i}(1-\exp(-\eta\ell_{t,i}))-\frac{1}{\eta}\left(-p_{t,i}\exp(-\eta\ell_{t,i})\eta\ell_{t,i}+p_{t,i}(1-\exp(-\eta\ell_{t,i}))\right)
=pt,it,ipt,iη(1exp(ηt,i))pt,it,ipt,iη(ηt,i12η2t,i2)=η2pt,it,i2,\displaystyle=p_{t,i}\ell_{t,i}-\frac{p_{t,i}}{\eta}(1-\exp(-\eta\ell_{t,i}))\leq p_{t,i}\ell_{t,i}-\frac{p_{t,i}}{\eta}\left(\eta\ell_{t,i}-\frac{1}{2}\eta^{2}\ell_{t,i}^{2}\right)=\frac{\eta}{2}p_{t,i}\ell_{t,i}^{2}\,,

for non-negative t\ell_{t} and

fi(p)ηpt,it,i2\displaystyle f_{i}(p^{*})\leq\eta p_{t,i}\ell_{t,i}^{2}

for t,i1/η\ell_{t,i}\geq-1/\eta respectively, where we used the bound exp(x)1x+x22\exp(-x)\leq 1-x+\frac{x^{2}}{2} which holds for any x0x\geq 0 and exp(x)1x+x2\exp(-x)\leq 1-x+x^{2}, which holds for x1x\geq-1 respectively. Summing over all coordinates finishes the proof. ∎

Lemma 32 (Stability of Tsallis-INF).

For the potential ψ(p)=ipiαα(1α)\psi(p)=-\sum_{i}\frac{p_{i}^{\alpha}}{\alpha(1-\alpha)}, any ptΔ([K])p_{t}\in\Delta([K]), any non-negative loss t0\ell_{t}\geq 0 and any positive learning rate η>0\eta>0, we have

maxpppt,t1ηDψ(p,pt)η2ipi2αt,i2.\displaystyle\max_{p}\langle p-p_{t},\ell_{t}\rangle-\frac{1}{\eta}D_{\psi}(p,p_{t})\leq\frac{\eta}{2}\sum_{i}p_{i}^{2-\alpha}\ell_{t,i}^{2}\,.
Proof.

We upper bound this term by maximizing over p0p\geq 0 instead of pΔ([K])p\in\Delta([K]). Since t\ell_{t} is positive, the optimal pp^{\star} satisfies pipip^{\star}_{i}\leq p_{i} in all components. We have

Dψ(p,pt)=i=1Kpt,ipipt,iα1pα11α𝑑p=i=1Kpt,ipipt,ip~pα2𝑑p𝑑p~\displaystyle D_{\psi}(p^{\star},p_{t})=\sum_{i=1}^{K}\int_{p_{t,i}}^{p^{\star}_{i}}\frac{p_{t,i}^{\alpha-1}-p^{\alpha-1}}{1-\alpha}\,dp=\sum_{i=1}^{K}\int_{p_{t,i}}^{p^{\star}_{i}}\int_{p_{t,i}}^{\widetilde{p}}p^{\alpha-2}\,dp\,d\widetilde{p} i=1Kpt,ipipt,ip~pt,iα2𝑑p𝑑p~\displaystyle\geq\sum_{i=1}^{K}\int_{p_{t,i}}^{p^{\star}_{i}}\int_{p_{t,i}}^{\widetilde{p}}p_{t,i}^{\alpha-2}\,dp\,d\widetilde{p}
=i=1K12(pipt,i)2pt,iα2.\displaystyle=\sum_{i=1}^{K}\frac{1}{2}(p^{\star}_{i}-p_{t,i})^{2}p_{t,i}^{\alpha-2}\,.

By AM-GM inequality, we further have

maxpppt,t1ηDψ(p,pt)maxpi((pipt,i)t,i(pipt,i)2pt,iα22η)η2ipt,i2αt,i2.\displaystyle\max_{p}\langle p-p_{t},\ell_{t}\rangle-\frac{1}{\eta}D_{\psi}(p,p_{t})\leq\max_{p}\sum_{i}\left((p_{i}-p_{t,i})\ell_{t,i}-\frac{(p_{i}-p_{t,i})^{2}p_{t,i}^{\alpha-2}}{2\eta}\right)\leq\frac{\eta}{2}\sum_{i}p_{t,i}^{2-\alpha}\ell_{t,i}^{2}\,.

Appendix B Analysis for Variance-Reduced SCRiBLe (Algorithm 3 / Theorem 3)

Algorithm 3 VR-SCRiBLe

Define: Let ψ()\psi(\cdot) be a ν\nu-self-concordant barrier of conv(𝒳)d\text{conv}(\mathcal{X})\subset\mathbb{R}^{d}.
for t=1,2,t=1,2,\ldots do

       Receive mtdm_{t}\in\mathbb{R}^{d}. Compute
wt=argminwconv(𝒳){w,τ=1t1^τ+mt+1ηtψ(w)}\displaystyle w_{t}=\operatorname*{argmin}_{w\in\text{conv}(\mathcal{X})}\left\{\left\langle w,\sum_{\tau=1}^{t-1}\widehat{\ell}_{\tau}+m_{t}\right\rangle+\frac{1}{\eta_{t}}\psi(w)\right\} (9)
where
ηt=min{116d,νlogTτ=1t1^τmτHτ12}.\displaystyle\eta_{t}=\min\left\{\frac{1}{16d},\sqrt{\frac{\nu\log T}{\sum_{\tau=1}^{t-1}\|\widehat{\ell}_{\tau}-m_{\tau}\|^{2}_{H_{\tau}^{-1}}}}\right\}.
where Ht=2ψ(wt)H_{t}=\nabla^{2}\psi(w_{t}).
Sample sts_{t} uniformly from 𝕊d\mathbb{S}_{d} (the unit sphere in dd-dimension).
Define
wt+=wt+Ht12st,wt=wtHt12st.\displaystyle w_{t}^{+}=w_{t}+H_{t}^{-\frac{1}{2}}s_{t},\qquad w_{t}^{-}=w_{t}-H_{t}^{-\frac{1}{2}}s_{t}.
Find distributions qt+q_{t}^{+} and qtq_{t}^{-} over actions such that
wt+=x𝒳qt,x+x,wt=x𝒳qt,xx\displaystyle w_{t}^{+}=\sum_{x\in\mathcal{X}}q_{t,x}^{+}x,\qquad w_{t}^{-}=\sum_{x\in\mathcal{X}}q_{t,x}^{-}x
Sample Atqtqt++qt2A_{t}\sim q_{t}\triangleq\frac{q_{t}^{+}+q_{t}^{-}}{2}, receive t,At[1,1]\ell_{t,A_{t}}\in[-1,1] with 𝔼[t,x]=x,t\mathbb{E}[\ell_{t,x}]=\langle x,\ell_{t}\rangle, and define
^t=d(t,Atmt,At)(qt,At+qt,Atqt,At++qt,At)Ht12st+mt\displaystyle\widehat{\ell}_{t}=d(\ell_{t,A_{t}}-m_{t,A_{t}})\left(\frac{q_{t,A_{t}}^{+}-q_{t,A_{t}}^{-}}{q_{t,A_{t}}^{+}+q_{t,A_{t}}^{-}}\right)H_{t}^{\frac{1}{2}}s_{t}+m_{t}
where mt,x:=x,mtm_{t,x}:=\langle x,m_{t}\rangle.
Lemma 33.

In Algorithm 3, we have 𝔼[^t]=t\mathbb{E}\left[\widehat{\ell}_{t}\right]=\ell_{t} and

𝔼[^tmt2ψ(wt)2]d2𝔼[x𝒳min{pt,x,1pt,x}(t,xmt,x)2].\displaystyle\mathbb{E}\left[\left\|\widehat{\ell}_{t}-m_{t}\right\|_{\nabla^{-2}\psi(w_{t})}^{2}\right]\leq d^{2}\mathbb{E}\left[\sum_{x\in\mathcal{X}}\min\{p_{t,x},1-p_{t,x}\}(\ell_{t,x}-m_{t,x})^{2}\right].

where pt,xp_{t,x} is the probability of choosing action xx in round tt.

Proof.
𝔼[^t]\displaystyle\mathbb{E}\left[\widehat{\ell}_{t}\right] =𝔼[d(t,Atmt,At)(qt,At+qt,Atqt,At++qt,At)Ht12st+mt]\displaystyle=\mathbb{E}\left[d(\ell_{t,A_{t}}-m_{t,A_{t}})\left(\frac{q_{t,A_{t}}^{+}-q_{t,A_{t}}^{-}}{q_{t,A_{t}}^{+}+q_{t,A_{t}}^{-}}\right)H_{t}^{\frac{1}{2}}s_{t}+m_{t}\right]
=𝔼[d𝔼[(t,Atmt,At)(qt,At+qt,Atqt,At++qt,At)|st]Ht12st+mt]\displaystyle=\mathbb{E}\left[d\mathbb{E}\left[(\ell_{t,A_{t}}-m_{t,A_{t}})\left(\frac{q_{t,A_{t}}^{+}-q_{t,A_{t}}^{-}}{q_{t,A_{t}}^{+}+q_{t,A_{t}}^{-}}\right)\leavevmode\nobreak\ \bigg{|}s_{t}\right]H_{t}^{\frac{1}{2}}s_{t}+m_{t}\right] (note that qt+,qtq_{t}^{+},q_{t}^{-} depend on sts_{t})
=𝔼[d𝔼[xqt,x(t,xmt,x)(qt,x+qt,xqt,x++qt,x)|st]Ht12st+mt]\displaystyle=\mathbb{E}\left[d\mathbb{E}\left[\sum_{x}q_{t,x}(\ell_{t,x}-m_{t,x})\left(\frac{q_{t,x}^{+}-q_{t,x}^{-}}{q_{t,x}^{+}+q_{t,x}^{-}}\right)\leavevmode\nobreak\ \bigg{|}s_{t}\right]H_{t}^{\frac{1}{2}}s_{t}+m_{t}\right]
=𝔼[d𝔼[xx,tmt(qt,x+qt,x2)|st]Ht12st+mt]\displaystyle=\mathbb{E}\left[d\mathbb{E}\left[\sum_{x}\langle x,\ell_{t}-m_{t}\rangle\left(\frac{q_{t,x}^{+}-q_{t,x}^{-}}{2}\right)\leavevmode\nobreak\ \bigg{|}s_{t}\right]H_{t}^{\frac{1}{2}}s_{t}+m_{t}\right]
=𝔼[d𝔼[wt+wt,tmt2|st]Ht12st+mt]\displaystyle=\mathbb{E}\left[d\mathbb{E}\left[\frac{\langle w_{t}^{+}-w_{t}^{-},\ell_{t}-m_{t}\rangle}{2}\leavevmode\nobreak\ \bigg{|}s_{t}\right]H_{t}^{\frac{1}{2}}s_{t}+m_{t}\right]
=𝔼[dH12st,tmtHt12st+mt]\displaystyle=\mathbb{E}\left[d\langle H^{-\frac{1}{2}}s_{t},\ell_{t}-m_{t}\rangle H_{t}^{\frac{1}{2}}s_{t}+m_{t}\right]
=𝔼[dHt12ststHt12(tmt)+mt]\displaystyle=\mathbb{E}\left[dH_{t}^{\frac{1}{2}}s_{t}s_{t}^{\top}H_{t}^{-\frac{1}{2}}(\ell_{t}-m_{t})+m_{t}\right]
=t.\displaystyle=\ell_{t}. (because 𝔼[stst]=1dI\mathbb{E}[s_{t}s_{t}^{\top}]=\frac{1}{d}I)
𝔼[^tmt2ψ(wt)2]\displaystyle\mathbb{E}\left[\left\|\widehat{\ell}_{t}-m_{t}\right\|_{\nabla^{-2}\psi(w_{t})}^{2}\right] =𝔼[d2(t,Atmt,At)2(qt,At+qt,Atqt,At++qt,At)2Ht12stHt12]\displaystyle=\mathbb{E}\left[d^{2}(\ell_{t,A_{t}}-m_{t,A_{t}})^{2}\left(\frac{q_{t,A_{t}}^{+}-q_{t,A_{t}}^{-}}{q_{t,A_{t}}^{+}+q_{t,A_{t}}^{-}}\right)^{2}\left\|H_{t}^{\frac{1}{2}}s_{t}\right\|_{H_{t}^{-1}}^{2}\right]
=𝔼[d2(t,Atmt,At)2(qt,At+qt,Atqt,At++qt,At)2]\displaystyle=\mathbb{E}\left[d^{2}(\ell_{t,A_{t}}-m_{t,A_{t}})^{2}\left(\frac{q_{t,A_{t}}^{+}-q_{t,A_{t}}^{-}}{q_{t,A_{t}}^{+}+q_{t,A_{t}}^{-}}\right)^{2}\right]
=𝔼[𝔼[d2(t,Atmt,At)2(qt,At+qt,Atqt,At++qt,At)2|st]]\displaystyle=\mathbb{E}\left[\leavevmode\nobreak\ \mathbb{E}\left[d^{2}(\ell_{t,A_{t}}-m_{t,A_{t}})^{2}\left(\frac{q_{t,A_{t}}^{+}-q_{t,A_{t}}^{-}}{q_{t,A_{t}}^{+}+q_{t,A_{t}}^{-}}\right)^{2}\leavevmode\nobreak\ \Bigg{|}\leavevmode\nobreak\ s_{t}\right]\leavevmode\nobreak\ \right]
𝔼[𝔼[xqt,xd2(t,xmt,x)2|qt,x+qt,xqt,x++qt,x||st]]\displaystyle\leq\mathbb{E}\left[\leavevmode\nobreak\ \mathbb{E}\left[\sum_{x}q_{t,x}d^{2}(\ell_{t,x}-m_{t,x})^{2}\left|\frac{q_{t,x}^{+}-q_{t,x}^{-}}{q_{t,x}^{+}+q_{t,x}^{-}}\right|\leavevmode\nobreak\ \Bigg{|}\leavevmode\nobreak\ s_{t}\right]\leavevmode\nobreak\ \right]

For any xx, we have qt,x|qt,x+qt,xqt,x++qt,x|qt,xq_{t,x}\left|\frac{q_{t,x}^{+}-q_{t,x}^{-}}{q_{t,x}^{+}+q_{t,x}^{-}}\right|\leq q_{t,x} and

qt,x|qt,x+qt,xqt,x++qt,x|=|qt,x+qt,x|21qt,x++qt,x2=1qt,x.\displaystyle q_{t,x}\left|\frac{q_{t,x}^{+}-q_{t,x}^{-}}{q_{t,x}^{+}+q_{t,x}^{-}}\right|=\frac{|q_{t,x}^{+}-q_{t,x}^{-}|}{2}\leq 1-\frac{q_{t,x}^{+}+q_{t,x}^{-}}{2}=1-q_{t,x}.

Therefore, we continue to bound 𝔼[^tmt2ψ(wt)2]\mathbb{E}\left[\left\|\widehat{\ell}_{t}-m_{t}\right\|_{\nabla^{-2}\psi(w_{t})}^{2}\right] by

𝔼[𝔼[x𝒳min{qt,x,1qt,x}d2(t,xmt,x)2|st]]\displaystyle\mathbb{E}\left[\leavevmode\nobreak\ \mathbb{E}\left[\sum_{x\in\mathcal{X}}\min\{q_{t,x},1-q_{t,x}\}d^{2}(\ell_{t,x}-m_{t,x})^{2}\leavevmode\nobreak\ \Bigg{|}\leavevmode\nobreak\ s_{t}\right]\leavevmode\nobreak\ \right]
𝔼[x𝒳min{pt,x,1pt,x}d2(t,xmt,x)2].\displaystyle\leq\mathbb{E}\left[\leavevmode\nobreak\ \sum_{x\in\mathcal{X}}\min\big{\{}p_{t,x},1-p_{t,x}\big{\}}d^{2}(\ell_{t,x}-m_{t,x})^{2}\leavevmode\nobreak\ \right]. (𝔼[min(,)]min(𝔼[],𝔼[])\mathbb{E}[\min(\cdot,\cdot)]\leq\min(\mathbb{E}[\cdot],\mathbb{E}[\cdot]) and pt,x=𝔼[𝔼[qt,x|st]]p_{t,x}=\mathbb{E}[\mathbb{E}[q_{t,x}\leavevmode\nobreak\ |\leavevmode\nobreak\ s_{t}]])

Lemma 34.

If ηt116d\eta_{t}\leq\frac{1}{16d}, then maxw(wtw,^tmt1ηtDψ(w,wt))8ηt^tmt2ψ(wt)2\max_{w}\left(\langle w_{t}-w,\widehat{\ell}_{t}-m_{t}\rangle-\frac{1}{\eta_{t}}D_{\psi}(w,w_{t})\right)\leq 8\eta_{t}\|\widehat{\ell}_{t}-m_{t}\|_{\nabla^{-2}\psi(w_{t})}^{2}.

Proof.

We first show that ηt^tmt2ψ(wt)116\eta_{t}\|\widehat{\ell}_{t}-m_{t}\|_{\nabla^{-2}\psi(w_{t})}\leq\frac{1}{16}. By the definition of ^t\widehat{\ell}_{t}, we have

ηt^tmt2ψ(wt)\displaystyle\eta_{t}\left\|\widehat{\ell}_{t}-m_{t}\right\|_{\nabla^{-2}\psi(w_{t})} 116ddHt12stHt1116.\displaystyle\leq\frac{1}{16d}\cdot d\|H_{t}^{\frac{1}{2}}s_{t}\|_{H_{t}^{-1}}\leq\frac{1}{16}.

Define

F(w)\displaystyle F(w) =wtw,^tmt1ηtDψ(w,wt).\displaystyle=\langle w_{t}-w,\widehat{\ell}_{t}-m_{t}\rangle-\frac{1}{\eta_{t}}D_{\psi}(w,w_{t}).

Define λ=^tmt2ψ(wt)\lambda=\|\widehat{\ell}_{t}-m_{t}\|_{\nabla^{-2}\psi(w_{t})}. Let ww^{\prime} be the maximizer of FF. it suffices to show wwt2ψ(wt)8ηtλ\|w^{\prime}-w_{t}\|_{\nabla^{2}\psi(w_{t})}\leq 8\eta_{t}\lambda because this leads to

F(w)wwt2ψ(wt)^tmt2ψ(wt)8ηtλ2.\displaystyle F(w^{\prime})\leq\|w^{\prime}-w_{t}\|_{\nabla^{2}\psi(w_{t})}\|\widehat{\ell}_{t}-m_{t}\|_{\nabla^{-2}\psi(w_{t})}\leq 8\eta_{t}\lambda^{2}.

To show wwt2ψ(wt)8ηtλ\|w^{\prime}-w_{t}\|_{\nabla^{2}\psi(w_{t})}\leq 8\eta_{t}\lambda, it suffices to show that for all uu such that uwt2ψ(wt)=8ηtλ\|u-w_{t}\|_{\nabla^{2}\psi(w_{t})}=8\eta_{t}\lambda, F(u)0F(u)\leq 0. To see why this is the case, notice that F(wt)=0F(w_{t})=0, and F(w)0F(w^{\prime})\geq 0 because ww^{\prime} is the maximizer of FF. Therefore, if wwt2ψ(wt)>8ηtλ\|w^{\prime}-w_{t}\|_{\nabla^{2}\psi(w_{t})}>8\eta_{t}\lambda, then there exists uu in the line segment between wtw_{t} and ww^{\prime} with uwt2ψ(wt)=8ηtλ\|u-w_{t}\|_{\nabla^{2}\psi(w_{t})}=8\eta_{t}\lambda such that F(u)0min{F(wt),F(w)}F(u)\leq 0\leq\min\{F(w_{t}),F(w^{\prime})\}, contradicting that FF is strictly concave.

Below, consider any uu with uwt2ψ(wt)=8ηtλ\|u-w_{t}\|_{\nabla^{2}\psi(w_{t})}=8\eta_{t}\lambda. By Taylor expansion, there exists uu^{\prime} in the line segment between uu and wtw_{t} such that

F(u)\displaystyle F(u) uwt2ψ(wt)^tmt2ψ(wt)12ηtuwt2ψ(u)2.\displaystyle\leq\|u-w_{t}\|_{\nabla^{2}\psi(w_{t})}\|\widehat{\ell}_{t}-m_{t}\|_{\nabla^{-2}\psi(w_{t})}-\frac{1}{2\eta_{t}}\|u-w_{t}\|_{\nabla^{2}\psi(u^{\prime})}^{2}.

Because ψ\psi is a self-concordant barrier and that uwt2ψ(wt)uwt2ψ(wt)=8ηtλ12\|u^{\prime}-w_{t}\|_{\nabla^{2}\psi(w_{t})}\leq\|u-w_{t}\|_{\nabla^{2}\psi(w_{t})}=8\eta_{t}\lambda\leq\frac{1}{2}, we have 2ψ(u)142ψ(wt)\nabla^{2}\psi(u^{\prime})\succeq\frac{1}{4}\nabla^{2}\psi(w_{t}). Continuing from the previous inequality,

F(u)\displaystyle F(u) uwt2ψ(wt)^tmt2ψ(wt)18ηtuwt2ψ(wt)2=8ηtλλ(8ηtλ)28ηt=0.\displaystyle\leq\|u-w_{t}\|_{\nabla^{2}\psi(w_{t})}\|\widehat{\ell}_{t}-m_{t}\|_{\nabla^{-2}\psi(w_{t})}-\frac{1}{8\eta_{t}}\|u-w_{t}\|_{\nabla^{2}\psi(w_{t})}^{2}=8\eta_{t}\lambda\cdot\lambda-\frac{(8\eta_{t}\lambda)^{2}}{8\eta_{t}}=0.

This concludes the proof. ∎

Proof of Theorem 3.

By the standard analysis of optimistic-FTRL and Lemma 34, we have for any uu,

𝔼[t=1T(t,Att,u)]\displaystyle\mathbb{E}\left[\sum_{t=1}^{T}(\ell_{t,A_{t}}-\ell_{t,u})\right]
O(𝔼[νlogTηT+t=1Tηt^tmt2ψ(wt)2])\displaystyle\leq O\left(\mathbb{E}\left[\frac{\nu\log T}{\eta_{T}}+\sum_{t=1}^{T}\eta_{t}\left\|\widehat{\ell}_{t}-m_{t}\right\|_{\nabla^{-2}\psi(w_{t})}^{2}\right]\right)
=O(νlogT𝔼[t=1T^tmt2ψ(wt)2]+dνlog(T)).\displaystyle=O\left(\sqrt{\nu\log T\mathbb{E}\left[\sum_{t=1}^{T}\left\|\widehat{\ell}_{t}-m_{t}\right\|_{\nabla^{-2}\psi(w_{t})}^{2}\right]}+d\nu\log(T)\right). (by the tuning of ηt\eta_{t})

In the adversarial regime, using Lemma 33, we continue to bound (LABEL:eq:_scrible_regret_tmp) by

O(dνlogT𝔼[t=1T(t,Atmt,At)2]+dνlog(T))\displaystyle O\left(d\sqrt{\nu\log T\mathbb{E}\left[\sum_{t=1}^{T}(\ell_{t,A_{t}}-m_{t,A_{t}})^{2}\right]}+d\nu\log(T)\right)

When losses are non-negative and mt=𝟎m_{t}=\mathbf{0}, we can further upper bound it by

O(dνlogT𝔼[t=1Tt,At]+dνlog(T)).\displaystyle O\left(d\sqrt{\nu\log T\mathbb{E}\left[\sum_{t=1}^{T}\ell_{t,A_{t}}\right]}+d\nu\log(T)\right).

Then solving the inequality for 𝔼[t=1Tt,At]\mathbb{E}\left[\sum_{t=1}^{T}\ell_{t,A_{t}}\right], we can further get the first-order regret bound of

𝔼[t=1T(t,Att,u)]O(dνlogT𝔼[t=1Tt,u]+dνlog(T)).\displaystyle\mathbb{E}\left[\sum_{t=1}^{T}(\ell_{t,A_{t}}-\ell_{t,u})\right]\leq O\left(d\sqrt{\nu\log T\mathbb{E}\left[\sum_{t=1}^{T}\ell_{t,u}\right]}+d\nu\log(T)\right).

In the corrupted stochastic setting, using Lemma 33, we continue to bound (LABEL:eq:_scrible_regret_tmp) by

O(dνlog(T)𝔼[t=1T((1pt,u)(t,umt,u)2+xupt,x(t,xmt,x)2]))\displaystyle O\left(d\sqrt{\nu\log(T)\mathbb{E}\left[\sum_{t=1}^{T}\left((1-p_{t,u})(\ell_{t,u}-m_{t,u})^{2}+\sum_{x\neq u}p_{t,x}(\ell_{t,x}-m_{t,x})^{2}\right]\right)}\right)
O(dνlog(T)𝔼[t=1T(1pt,u)]).\displaystyle\leq O\left(d\sqrt{\nu\log(T)\mathbb{E}\left[\sum_{t=1}^{T}(1-p_{t,u})\right]}\right).

Then by the self-bounding technique stated in Proposition 2, we can further bound the regret in the corrupted stochastic setting by

O(d2νlog(T)Δ+d2νlog(T)ΔC).\displaystyle O\left(\frac{d^{2}\nu\log(T)}{\Delta}+\sqrt{\frac{d^{2}\nu\log(T)}{\Delta}C}\right).

Appendix C Analysis for LSB Log-Determinant FTRL (Algorithm 4 / Lemma 5)

Algorithm 4 LSB-logdet

Input: 𝒳(={(x0)}),x^=(𝟎1)\mathcal{X}(=\{\binom{x}{0}\}),\widehat{x}=\binom{\bm{0}}{1}.
Define: Let π𝒳\pi_{\mathcal{X}} be John’s exploration over 𝒳\mathcal{X}. H(α):=𝔼xα[xx]H(\alpha):=\mathbb{E}_{x\sim\alpha}[xx^{\top}] ,  μα:=𝔼xα[x]\mu_{\alpha}:=\mathbb{E}_{x\sim\alpha}[x].
for t=1,2,t=1,2,\ldots do

       Let
ηt=min{14d,log(T)τ=1t1(1pτ,x^)},\displaystyle\eta_{t}=\min\left\{\frac{1}{4d},\sqrt{\frac{\log(T)}{\sum_{\tau=1}^{t-1}(1-p_{\tau,\widehat{x}})}}\right\}\,,
pt:=argminαΔ(𝒳{x^})μα,τ=1t1^τ1ηtlogdet(H(α)μαμα),\displaystyle p_{t}:=\operatorname*{argmin}_{\alpha\in\Delta(\mathcal{X}\cup\{\widehat{x}\})}\left\langle\mu_{\alpha},\sum_{\tau=1}^{t-1}\widehat{\ell}_{\tau}\right\rangle-\frac{1}{\eta_{t}}\log\det\left(H(\alpha)-\mu_{\alpha}\mu_{\alpha}^{\top}\right)\,,
p~t:=(1ηtd)pt+ηtd((1pt,x^)π𝒳+pt,x^πx^)\displaystyle\widetilde{p}_{t}:=(1-\eta_{t}d)p_{t}+\eta_{t}d((1-p_{t,\widehat{x}})\pi_{\mathcal{X}}+p_{t,\widehat{x}}\pi_{\widehat{x}})
where πx^\pi_{\widehat{x}} denotes the sampling distribution that picks x^\widehat{x} with probability 1.
Sample an action Atp~tA_{t}\sim\widetilde{p}_{t}.
Construct loss estimator:
^t(a)=a(H(p~t)μp~tμp~t)1(atμp~t)t(at).\displaystyle\widehat{\ell}_{t}(a)=a^{\top}\Big{(}H(\widetilde{p}_{t})-\mu_{\widetilde{p}_{t}}\mu_{\widetilde{p}_{t}}^{\top}\Big{)}^{-1}\big{(}a_{t}-\mu_{\widetilde{p}_{t}}\big{)}\ell_{t}(a_{t})\,.

We begin by using the following simplifying notation. For a distribution αΔ(𝒳{x^})\alpha\in\Delta(\mathcal{X}\cup\{\widehat{x}\}), we define α𝒳\alpha_{\mathcal{X}} as the restricted distribution over 𝒳\mathcal{X} such that α𝒳α\alpha_{\mathcal{X}}\propto\alpha over 𝒳\mathcal{X}, i.e. α𝒳,x=αx1αx^\alpha_{\mathcal{X},x}=\frac{\alpha_{x}}{1-\alpha_{\widehat{x}}} for any x𝒳x\in\mathcal{X}. We further define

H(α)=𝔼xα[xx],μα=𝔼xα[x],\displaystyle H(\alpha)=\mathbb{E}_{x\sim\alpha}[xx^{\top}]\,,\quad\mu_{\alpha}=\mathbb{E}_{x\sim\alpha}[x],
Hα𝕍=H(α)μαμα,\displaystyle H^{\mathbb{V}}_{\alpha}=H(\alpha)-\mu_{\alpha}\mu_{\alpha}^{\top},
G(α)=H(α𝒳),mα=μα𝒳,\displaystyle G(\alpha)=H(\alpha_{\mathcal{X}}),\quad m_{\alpha}=\mu_{\alpha_{\mathcal{X}}},
Gα𝕍=G(α)mαmα.\displaystyle G^{\mathbb{V}}_{\alpha}=G(\alpha)-m_{\alpha}m_{\alpha}^{\top}\,.
Lemma 35.

Assume αΔ(𝒳{x^}))\alpha\in\Delta(\mathcal{X}\cup\{\widehat{x}\})) is such that Gα𝕍G^{\mathbb{V}}_{\alpha} is of rank d1d-1 (i.e. full rank over 𝒳\mathcal{X}). Then we have the following properties:

Hα𝕍=(1αx^)(Gα𝕍+αx^(mαx^)(mαx^)),\displaystyle H^{\mathbb{V}}_{\alpha}=(1-\alpha_{\widehat{x}})\left(G^{\mathbb{V}}_{\alpha}+\alpha_{\widehat{x}}(m_{\alpha}-\widehat{x})(m_{\alpha}-\widehat{x})^{\top}\right),
[Hα𝕍]1=11αx^([Gα𝕍]++[Gα𝕍]+mαx^+x^mα[Gα𝕍]++(mα[Gα𝕍]+2+1αx^)x^x^),\displaystyle[H^{\mathbb{V}}_{\alpha}]^{-1}=\frac{1}{1-\alpha_{\widehat{x}}}\left([G^{\mathbb{V}}_{\alpha}]^{+}+[G^{\mathbb{V}}_{\alpha}]^{+}m_{\alpha}\widehat{x}^{\top}+\widehat{x}m_{\alpha}^{\top}[G^{\mathbb{V}}_{\alpha}]^{+}+\left(\left\|m_{\alpha}\right\|^{2}_{[G^{\mathbb{V}}_{\alpha}]^{+}}+\frac{1}{\alpha_{\widehat{x}}}\right)\widehat{x}\widehat{x}^{\top}\right)\,,

where [Gα𝕍]+[G^{\mathbb{V}}_{\alpha}]^{+} denotes the pseudo-inverse.

Proof.

The first identity is a simple algebraic identity

Hα𝕍\displaystyle H^{\mathbb{V}}_{\alpha} =H(α)μαμα=(1αx^)G(α)+αx^x^x^((1αx^)mα+αx^x^)((1αx^)mα+αx^x^)\displaystyle=H(\alpha)-\mu_{\alpha}\mu_{\alpha}^{\top}=(1-\alpha_{\widehat{x}})G(\alpha)+\alpha_{\widehat{x}}\widehat{x}\widehat{x}^{\top}-((1-\alpha_{\widehat{x}})m_{\alpha}+\alpha_{\widehat{x}}\widehat{x})((1-\alpha_{\widehat{x}})m_{\alpha}+\alpha_{\widehat{x}}\widehat{x})^{\top}
=(1αx^)(Gα𝕍+αx^(mαx^)(mαx^)),\displaystyle=(1-\alpha_{\widehat{x}})\left(G^{\mathbb{V}}_{\alpha}+\alpha_{\widehat{x}}(m_{\alpha}-\widehat{x})(m_{\alpha}-\widehat{x})^{\top}\right)\,,

which holds for any α\alpha. For the second identity note that by the definition of x^=(𝟎1)\widehat{x}=\binom{\bm{0}}{1} and x𝒳:x,x^=0\forall x\in\mathcal{X}:\,\langle x,\widehat{x}\rangle=0, we have Gα𝕍x^=[Gα𝕍]+x^=0G^{\mathbb{V}}_{\alpha}\widehat{x}=[G^{\mathbb{V}}_{\alpha}]^{+}\widehat{x}=0. Furthermore Gα𝕍[Gα𝕍]+=Ix^x^1G^{\mathbb{V}}_{\alpha}[G^{\mathbb{V}}_{\alpha}]^{+}=I-\widehat{x}\widehat{x}^{-1} due to the rank d1d-1 assumption. Multiplying the two matrices yields

(Gα𝕍+αx^(mαx^)(mαx^))([Gα𝕍]++[Gα𝕍]+mαx^+x^mα[Gα𝕍]++(mα[Gα𝕍]+2+1αx^)x^x^)\displaystyle\left(G^{\mathbb{V}}_{\alpha}+\alpha_{\widehat{x}}(m_{\alpha}-\widehat{x})(m_{\alpha}-\widehat{x})^{\top}\right)\cdot\left([G^{\mathbb{V}}_{\alpha}]^{+}+[G^{\mathbb{V}}_{\alpha}]^{+}m_{\alpha}\widehat{x}^{\top}+\widehat{x}m_{\alpha}^{\top}[G^{\mathbb{V}}_{\alpha}]^{+}+\left(\left\|m_{\alpha}\right\|^{2}_{[G^{\mathbb{V}}_{\alpha}]^{+}}+\frac{1}{\alpha_{\widehat{x}}}\right)\widehat{x}\widehat{x}^{\top}\right)
=Ix^x^+mαx^+αx^(mαx^)([Gα𝕍]+mα+mα[Gα𝕍]+2x^[Gα𝕍]+mα(mα[Gα𝕍]+2+1αx^)x^)\displaystyle=I-\widehat{x}\widehat{x}^{\top}+m_{\alpha}\widehat{x}^{\top}+\alpha_{\widehat{x}}(m_{\alpha}-\widehat{x})\left([G^{\mathbb{V}}_{\alpha}]^{+}m_{\alpha}+\left\|m_{\alpha}\right\|^{2}_{[G^{\mathbb{V}}_{\alpha}]^{+}}\widehat{x}-[G^{\mathbb{V}}_{\alpha}]^{+}m_{\alpha}-\left(\left\|m_{\alpha}\right\|^{2}_{[G^{\mathbb{V}}_{\alpha}]^{+}}+\frac{1}{\alpha_{\widehat{x}}}\right)\widehat{x}\right)^{\top}
=Ix^x^+mαx^(mαx^)x^=I\displaystyle=I-\widehat{x}\widehat{x}^{\top}+m_{\alpha}\widehat{x}^{\top}-(m_{\alpha}-\widehat{x})\widehat{x}^{\top}=I

which implies for any x,yspan(𝒳)x,y\in\operatorname{span}(\mathcal{X})

x,[Hα𝕍]1y=x,[Gα𝕍]+y1αx^,\displaystyle\langle x,[H^{\mathbb{V}}_{\alpha}]^{-1}y\rangle=\frac{\langle x,[G^{\mathbb{V}}_{\alpha}]^{+}y\rangle}{1-\alpha_{\widehat{x}}}, (11)
x,[Hα𝕍]1(x^mα)=0,\displaystyle\langle x,[H^{\mathbb{V}}_{\alpha}]^{-1}(\widehat{x}-m_{\alpha})\rangle=0, (12)
x^mα[Hα𝕍]12=1(1αx^)αx^.\displaystyle\left\|\widehat{x}-m_{\alpha}\right\|^{2}_{[H^{\mathbb{V}}_{\alpha}]^{-1}}=\frac{1}{(1-\alpha_{\widehat{x}})\alpha_{\widehat{x}}}. (13)
Lemma 36.

Let π1,π2Δ(𝒳{x^})\pi_{1},\pi_{2}\in\Delta(\mathcal{X}\cup\{\widehat{x}\}), then for any λ[0,1]\lambda\in[0,1]:

Hλπ1+(1λ)π2𝕍λHπ1𝕍.\displaystyle H^{\mathbb{V}}_{\lambda\pi_{1}+(1-\lambda)\pi_{2}}\succeq\lambda H^{\mathbb{V}}_{\pi_{1}}\,.
Proof.

Simple algebra shows

Hλπ1+(1λ)π2𝕍=λHπ1𝕍+(1λ)Hπ2𝕍+λ(1λ)(μπ1μπ2)(μπ1μπ2).\displaystyle H^{\mathbb{V}}_{\lambda\pi_{1}+(1-\lambda)\pi_{2}}=\lambda H^{\mathbb{V}}_{\pi_{1}}+(1-\lambda)H^{\mathbb{V}}_{\pi_{2}}+\lambda(1-\lambda)(\mu_{\pi_{1}}-\mu_{\pi_{2}})(\mu_{\pi_{1}}-\mu_{\pi_{2}})^{\top}\,.

Lemma 37.

Let πΔ(𝒳)\pi\in\Delta(\mathcal{X}) be arbitrary, and π~=(1ηtd)π+ηtdπ𝒳\tilde{\pi}=(1-\eta_{t}d)\pi+\eta_{t}d\pi_{\mathcal{X}} for (ηtd)(0,1)(\eta_{t}d)\in(0,1), then it holds for any x𝒳x\in\mathcal{X}:

mπmπ~[Gπ~𝕍]+ηtd1ηtd\displaystyle\left\|m_{\pi}-m_{\tilde{\pi}}\right\|_{[G^{\mathbb{V}}_{\tilde{\pi}}]^{+}}\leq\sqrt{\frac{\eta_{t}d}{1-\eta_{t}d}}
xmπ~[Gπ~𝕍]+2ηt\displaystyle\left\|x-m_{\tilde{\pi}}\right\|_{[G^{\mathbb{V}}_{\tilde{\pi}}]^{+}}\leq\frac{2}{\sqrt{\eta_{t}}}
Proof.

We have

Gπ~𝕍=ηtdGπ𝒳𝕍+(1ηtd)Gπ𝕍+ηtd(1ηtd)(mπmπ𝒳)(mπmπ𝒳),\displaystyle G^{\mathbb{V}}_{\widetilde{\pi}}=\eta_{t}dG^{\mathbb{V}}_{\pi_{\mathcal{X}}}+(1-\eta_{t}d)G^{\mathbb{V}}_{\pi}+\eta_{t}d(1-\eta_{t}d)(m_{\pi}-m_{\pi_{\mathcal{X}}})(m_{\pi}-m_{\pi_{\mathcal{X}}})^{\top}\,,

hence

mπmπ~[Gπ~𝕍]+2\displaystyle\left\|m_{\pi}-m_{\tilde{\pi}}\right\|_{[G^{\mathbb{V}}_{\tilde{\pi}}]^{+}}^{2} =(ηtd)2mπmπ𝒳[ηtdGπ𝒳𝕍+(1ηtd)Gπ𝕍+ηtd(1ηtd)(mπmπ𝒳)(mπmπ𝒳)]12\displaystyle=(\eta_{t}d)^{2}\left\|m_{\pi}-m_{\pi_{\mathcal{X}}}\right\|_{\left[\eta_{t}dG^{\mathbb{V}}_{\pi_{\mathcal{X}}}+(1-\eta_{t}d)G^{\mathbb{V}}_{\pi}+\eta_{t}d(1-\eta_{t}d)(m_{\pi}-m_{\pi_{\mathcal{X}}})(m_{\pi}-m_{\pi_{\mathcal{X}}})^{\top}\right]^{-1}}^{2}
(ηtd)2mπmπ𝒳[ηtd(1ηtd)(mπmπ𝒳)(mπmπ𝒳)]+2=ηtd1ηtd.\displaystyle\leq(\eta_{t}d)^{2}\left\|m_{\pi}-m_{\pi_{\mathcal{X}}}\right\|^{2}_{\left[\eta_{t}d(1-\eta_{t}d)(m_{\pi}-m_{\pi_{\mathcal{X}}})(m_{\pi}-m_{\pi_{\mathcal{X}}})^{\top}\right]^{+}}=\frac{\eta_{t}d}{1-\eta_{t}d}\,.

For the second inequality, we have

xmπ~[Gπ~𝕍]+\displaystyle\left\|x-m_{\tilde{\pi}}\right\|_{[G^{\mathbb{V}}_{\tilde{\pi}}]^{+}} xmπ𝒳[Gπ~𝕍]++mπ~mπ𝒳[Gπ~𝕍]+\displaystyle\leq\left\|x-m_{\pi_{\mathcal{X}}}\right\|_{[G^{\mathbb{V}}_{\tilde{\pi}}]^{+}}+\left\|m_{\tilde{\pi}}-m_{\pi_{\mathcal{X}}}\right\|_{[G^{\mathbb{V}}_{\tilde{\pi}}]^{+}}
1ηtd(xmπ𝒳[Gπ𝒳𝕍]++mπ~mπ𝒳[Gπ𝒳𝕍]+)2ηt,\displaystyle\leq\frac{1}{\sqrt{\eta_{t}d}}\left(\left\|x-m_{\pi_{\mathcal{X}}}\right\|_{[G^{\mathbb{V}}_{\pi_{\mathcal{X}}}]^{+}}+\left\|m_{\tilde{\pi}}-m_{\pi_{\mathcal{X}}}\right\|_{[G^{\mathbb{V}}_{\pi_{\mathcal{X}}}]^{+}}\right)\leq\frac{2}{\sqrt{\eta_{t}}}\,,

where the last inequality uses that John’s exploration satisfies xmπ𝒳[Gπ𝒳𝕍]+2d\left\|x-m_{\pi_{\mathcal{X}}}\right\|^{2}_{[G^{\mathbb{V}}_{\pi_{\mathcal{X}}}]^{+}}\leq d for all x𝒳x\in\mathcal{X}. ∎

Lemma 38.

The Bregman divergence between two distributions α,β\alpha,\beta over 𝒳{x^}\mathcal{X}\cup\{\widehat{x}\} with respect to the function F(α)=logdet(Hα𝕍)F(\alpha)=-\log\det\left(H^{\mathbb{V}}_{\alpha}\right) is bounded by

D(α,β)Dlog(αx^,βx^)+Dlog(1αx^,1βx^)+1αx^1βx^mαmβ[Gβ𝕍]+2\displaystyle D(\alpha,\beta)\geq D_{\log}(\alpha_{\widehat{x}},\beta_{\widehat{x}})+D_{\log}(1-\alpha_{\widehat{x}},1-\beta_{\widehat{x}})+\frac{1-\alpha_{\widehat{x}}}{1-\beta_{\widehat{x}}}\left\|m_{\alpha}-m_{\beta}\right\|^{2}_{[G^{\mathbb{V}}_{\beta}]^{+}}

where DlogD_{\log} is the Bregman divergence of log(x)-\log(x).

Proof.

We begin by simplifying F(α)F(\alpha). Note that

Hα𝕍=(1αx^)(Gα𝕍+αx^x^)12M(Gα𝕍+αx^x^)12,\displaystyle H^{\mathbb{V}}_{\alpha}=(1-\alpha_{\widehat{x}})\left(G^{\mathbb{V}}_{\alpha}+\alpha_{\widehat{x}}\widehat{x}^{\top}\right)^{\frac{1}{2}}M\left(G^{\mathbb{V}}_{\alpha}+\alpha_{\widehat{x}}\widehat{x}^{\top}\right)^{\frac{1}{2}}\,,
M=I+(αx^[Gα𝕍]+mαx^)(αx^[Gα𝕍]+mαx^))x^x^.\displaystyle M=I+(\sqrt{\alpha_{\widehat{x}}}[G^{\mathbb{V}}_{\alpha}]^{+}m_{\alpha}-\widehat{x})(\sqrt{\alpha_{\widehat{x}}}[G^{\mathbb{V}}_{\alpha}]^{+}m_{\alpha}-\widehat{x}))^{\top}-\widehat{x}\widehat{x}^{\top}.

MM is a matrix with d2d-2 eigenvalues of size 11, since it is the identity with two rank-1 updates. The product of the remaining eigenvalues is given by considering the determinant of the 2×22\times 2 sub-matrix with coordinates [Gα𝕍]+mα[Gα𝕍]+mα\frac{[G^{\mathbb{V}}_{\alpha}]^{+}m_{\alpha}}{\left\|[G^{\mathbb{V}}_{\alpha}]^{+}m_{\alpha}\right\|} and x^\widehat{x}. We have that

det(M)\displaystyle\det(M) =x^Mx^mα[Gα𝕍]+[Gα𝕍]+mαM[Gα𝕍]+mα[Gα𝕍]+mα(x^M[Gα𝕍]+mα[Gα𝕍]+mα)2\displaystyle=\widehat{x}^{\top}M\widehat{x}\cdot\frac{m_{\alpha}^{\top}[G^{\mathbb{V}}_{\alpha}]^{+}}{\left\|[G^{\mathbb{V}}_{\alpha}]^{+}m_{\alpha}\right\|}M\frac{[G^{\mathbb{V}}_{\alpha}]^{+}m_{\alpha}}{\left\|[G^{\mathbb{V}}_{\alpha}]^{+}m_{\alpha}\right\|}-\left(\widehat{x}^{\top}M\frac{[G^{\mathbb{V}}_{\alpha}]^{+}m_{\alpha}}{\left\|[G^{\mathbb{V}}_{\alpha}]^{+}m_{\alpha}\right\|}\right)^{2}
=1(1+αx^[Gα𝕍]+mα2)αx^[Gα𝕍]+mα2=1.\displaystyle=1\cdot\left(1+\alpha_{\widehat{x}}\left\|[G^{\mathbb{V}}_{\alpha}]^{+}m_{\alpha}\right\|^{2}\right)-\alpha_{\widehat{x}}\left\|[G^{\mathbb{V}}_{\alpha}]^{+}m_{\alpha}\right\|^{2}=1.

Hence we have

F(α)=log(1αx^)log(αx^)logdetd1((1αx^)Gα𝕍)\displaystyle F(\alpha)=-\log(1-\alpha_{\widehat{x}})-\log(\alpha_{\widehat{x}})-\log{\textstyle\det_{d-1}}((1-\alpha_{\widehat{x}})G^{\mathbb{V}}_{\alpha})

Where detd1\det_{d-1} is the determinant over the first d1d-1 eigenvalues of the submatrix of the first (d1)(d-1) coordinates. The derivative term is given by

αβ,F(β)=x𝒳{x^}(αxβx)xμβ[Hβ𝕍]12\displaystyle\langle\alpha-\beta,\nabla F(\beta)\rangle=\sum_{x\in\mathcal{X}\cup\{\widehat{x}\}}(\alpha_{x}-\beta_{x})\left\|x-\mu_{\beta}\right\|^{2}_{[H^{\mathbb{V}}_{\beta}]^{-1}}
=x𝒳{x^}αxxμβ[Hβ𝕍]12d\displaystyle=\sum_{x\in\mathcal{X}\cup\{\widehat{x}\}}\alpha_{x}\left\|x-\mu_{\beta}\right\|^{2}_{[H^{\mathbb{V}}_{\beta}]^{-1}}-d
=Tr(x𝒳{x^}αx(xμβ)(xμβ)[Hβ𝕍]1)d\displaystyle=\operatorname{Tr}\left(\sum_{x\in\mathcal{X}\cup\{\widehat{x}\}}\alpha_{x}(x-\mu_{\beta})(x-\mu_{\beta})^{\top}[H^{\mathbb{V}}_{\beta}]^{-1}\right)-d
=Tr((Hα𝕍+(μαμβ)(μαμβ))[Hβ𝕍]1)d\displaystyle=\operatorname{Tr}\left(\left(H^{\mathbb{V}}_{\alpha}+(\mu_{\alpha}-\mu_{\beta})(\mu_{\alpha}-\mu_{\beta})^{\top}\right)[H^{\mathbb{V}}_{\beta}]^{-1}\right)-d
=Tr(Hα𝕍[Hβ𝕍]1)+μαμβ[Hβ𝕍]12d.\displaystyle=\operatorname{Tr}\left(H^{\mathbb{V}}_{\alpha}[H^{\mathbb{V}}_{\beta}]^{-1}\right)+\left\|\mu_{\alpha}-\mu_{\beta}\right\|^{2}_{[H^{\mathbb{V}}_{\beta}]^{-1}}-d.

The first term is

Tr(Hα𝕍[Hβ𝕍]1)\displaystyle\operatorname{Tr}\left(H^{\mathbb{V}}_{\alpha}[H^{\mathbb{V}}_{\beta}]^{-1}\right)
=Tr((1αx^)Gα𝕍[Hβ𝕍]1)+αx^(1αx^)x^mα[Hβ𝕍]12\displaystyle=\operatorname{Tr}\left((1-\alpha_{\widehat{x}})G^{\mathbb{V}}_{\alpha}[H^{\mathbb{V}}_{\beta}]^{-1}\right)+\alpha_{\widehat{x}}(1-\alpha_{\widehat{x}})\left\|\widehat{x}-m_{\alpha}\right\|^{2}_{[H^{\mathbb{V}}_{\beta}]^{-1}} (by Lemma 35)
=Trd1(1αx^1βx^Gα𝕍[Gβ𝕍]+)+αx^(1αx^)(x^mβ[Hβ𝕍]12+mαmβ[Hβ𝕍]12)\displaystyle=\operatorname{Tr}_{d-1}\left(\frac{1-\alpha_{\widehat{x}}}{1-\beta_{\widehat{x}}}G^{\mathbb{V}}_{\alpha}[G^{\mathbb{V}}_{\beta}]^{+}\right)+\alpha_{\widehat{x}}(1-\alpha_{\widehat{x}})\left(\left\|\widehat{x}-m_{\beta}\right\|^{2}_{[H^{\mathbb{V}}_{\beta}]^{-1}}+\left\|m_{\alpha}-m_{\beta}\right\|^{2}_{[H^{\mathbb{V}}_{\beta}]^{-1}}\right) (by Lemma 35)
=Trd1(1αx^1βx^Gα𝕍[Gβ𝕍]+)+αx^(1αx^)βx^(1βx^)+αx^(1αx^)mαmβ[Gβ𝕍]+21βx^.\displaystyle=\operatorname{Tr}_{d-1}\left(\frac{1-\alpha_{\widehat{x}}}{1-\beta_{\widehat{x}}}G^{\mathbb{V}}_{\alpha}[G^{\mathbb{V}}_{\beta}]^{+}\right)+\frac{\alpha_{\widehat{x}}(1-\alpha_{\widehat{x}})}{\beta_{\widehat{x}}(1-\beta_{\widehat{x}})}+\frac{\alpha_{\widehat{x}}(1-\alpha_{\widehat{x}})\left\|m_{\alpha}-m_{\beta}\right\|^{2}_{[G^{\mathbb{V}}_{\beta}]^{+}}}{1-\beta_{\widehat{x}}}\,. (by (13))

The norm term is

μαμβ[Hβ𝕍]12\displaystyle\left\|\mu_{\alpha}-\mu_{\beta}\right\|^{2}_{[H^{\mathbb{V}}_{\beta}]^{-1}}
=μα(αx^x^+(1αx^)mβ)[Hβ𝕍]12+(αx^x^+(1αx^)mβ)μβ[Hβ𝕍]12\displaystyle=\left\|\mu_{\alpha}-(\alpha_{\widehat{x}}\widehat{x}+(1-\alpha_{\widehat{x}})m_{\beta})\right\|^{2}_{[H^{\mathbb{V}}_{\beta}]^{-1}}+\left\|(\alpha_{\widehat{x}}\widehat{x}+(1-\alpha_{\widehat{x}})m_{\beta})-\mu_{\beta}\right\|^{2}_{[H^{\mathbb{V}}_{\beta}]^{-1}}
=(1αx^)2mαmβ[Hβ𝕍]12+(αx^βx^)2x^mβ[Hβ𝕍]12\displaystyle=(1-\alpha_{\widehat{x}})^{2}\left\|m_{\alpha}-m_{\beta}\right\|^{2}_{[H^{\mathbb{V}}_{\beta}]^{-1}}+(\alpha_{\widehat{x}}-\beta_{\widehat{x}})^{2}\left\|\widehat{x}-m_{\beta}\right\|^{2}_{[H^{\mathbb{V}}_{\beta}]^{-1}}
=(1αx^)2mαmβ[Gβ𝕍]+21βx^+(αx^βx^)2βx^(1βx^).\displaystyle=\frac{(1-\alpha_{\widehat{x}})^{2}\left\|m_{\alpha}-m_{\beta}\right\|^{2}_{[G^{\mathbb{V}}_{\beta}]^{+}}}{1-\beta_{\widehat{x}}}+\frac{(\alpha_{\widehat{x}}-\beta_{\widehat{x}})^{2}}{\beta_{\widehat{x}}(1-\beta_{\widehat{x}})}\,.

Combining both terms

αβ,F(β)\displaystyle\langle\alpha-\beta,\nabla F(\beta)\rangle
=Trd1(1αx^1βx^Gα𝕍[Gβ𝕍]+)+αx^(1αx^)+(αx^βx^)2βx^(1βx^)+1αx^1βx^mαmβ[Gβ𝕍]+2d\displaystyle=\operatorname{Tr}_{d-1}\left(\frac{1-\alpha_{\widehat{x}}}{1-\beta_{\widehat{x}}}G^{\mathbb{V}}_{\alpha}[G^{\mathbb{V}}_{\beta}]^{+}\right)+\frac{\alpha_{\widehat{x}}(1-\alpha_{\widehat{x}})+(\alpha_{\widehat{x}}-\beta_{\widehat{x}})^{2}}{\beta_{\widehat{x}}(1-\beta_{\widehat{x}})}+\frac{1-\alpha_{\widehat{x}}}{1-\beta_{\widehat{x}}}\left\|m_{\alpha}-m_{\beta}\right\|^{2}_{[G^{\mathbb{V}}_{\beta}]^{+}}-d
=Trd1(1αx^1βx^Gα𝕍[Gβ𝕍]+)+αx^βx^+1αx^1βx^1+1αx^1βx^mαmβ[Gβ𝕍]+2d.\displaystyle=\operatorname{Tr}_{d-1}\left(\frac{1-\alpha_{\widehat{x}}}{1-\beta_{\widehat{x}}}G^{\mathbb{V}}_{\alpha}[G^{\mathbb{V}}_{\beta}]^{+}\right)+\frac{\alpha_{\widehat{x}}}{\beta_{\widehat{x}}}+\frac{1-\alpha_{\widehat{x}}}{1-\beta_{\widehat{x}}}-1+\frac{1-\alpha_{\widehat{x}}}{1-\beta_{\widehat{x}}}\left\|m_{\alpha}-m_{\beta}\right\|^{2}_{[G^{\mathbb{V}}_{\beta}]^{+}}-d\,.

Combining the everything

D(α,β)\displaystyle D(\alpha,\beta) =Dlog(αx^,βx^)+Dlog(1αx^,1βx^)+1αx^1βx^mαmβ[Gβ𝕍]+2+Dd1((1αx^)Gα𝕍,(1βx^)Gβ𝕍)\displaystyle=D_{\log}(\alpha_{\widehat{x}},\beta_{\widehat{x}})+D_{\log}(1-\alpha_{\widehat{x}},1-\beta_{\widehat{x}})+\frac{1-\alpha_{\widehat{x}}}{1-\beta_{\widehat{x}}}\left\|m_{\alpha}-m_{\beta}\right\|^{2}_{[G^{\mathbb{V}}_{\beta}]^{+}}+D_{d-1}\left((1-\alpha_{\widehat{x}})G^{\mathbb{V}}_{\alpha},(1-\beta_{\widehat{x}})G^{\mathbb{V}}_{\beta}\right)
Dlog(αx^,βx^)+Dlog(1αx^,1βx^)+1αx^1βx^mαmβ[Gβ𝕍]+2,\displaystyle\geq D_{\log}(\alpha_{\widehat{x}},\beta_{\widehat{x}})+D_{\log}(1-\alpha_{\widehat{x}},1-\beta_{\widehat{x}})+\frac{1-\alpha_{\widehat{x}}}{1-\beta_{\widehat{x}}}\left\|m_{\alpha}-m_{\beta}\right\|^{2}_{[G^{\mathbb{V}}_{\beta}]^{+}}\,,

where the last inequality follows from the positiveness of Bregman divergences. ∎

Lemma 39.

For any b(0,1)b\in(0,1), any xx\in\mathbb{R} and η12|x|\eta\leq\frac{1}{2|x|}, it holds that

supα[0,1]|ab|x1ηDlog(a,b)ηb2x2.\displaystyle\sup_{\alpha\in[0,1]}|a-b|x-\frac{1}{\eta}D_{\log}(a,b)\leq\eta b^{2}x^{2}\,.
Proof.

The statement is equivalent to

supa[0,1]f(a)=supa[0,1](ba)x1ηDlog(a,b)ηb2x2,\displaystyle\sup_{a\in[0,1]}f(a)=\sup_{a\in[0,1]}(b-a)x-\frac{1}{\eta}D_{\log}(a,b)\leq\eta b^{2}x^{2}\,,

since xx can take positive or negative sign. The function is concave, so setting the derivative to 0 is the optimal solution if that value lies in (0,)(0,\infty).

f(a)=x+1η(1a1b)a=b1+ηbx.\displaystyle f^{\prime}(a)=-x+\frac{1}{\eta}\left(\frac{1}{a}-\frac{1}{b}\right)\,\qquad a^{\star}=\frac{b}{1+\eta bx}\,.

Plugging this in, leads to

f(a)\displaystyle f(a^{\star}) =ηb2x21+ηbx1η(log(1+ηbx)+11+ηbx1)\displaystyle=\frac{\eta b^{2}x^{2}}{1+\eta bx}-\frac{1}{\eta}\left(\log(1+\eta bx)+\frac{1}{1+\eta bx}-1\right)
ηb2x21+ηbx1η(ηbxη2b2x2ηbx1+ηbx)=ηb2x2,\displaystyle\leq\frac{\eta b^{2}x^{2}}{1+\eta bx}-\frac{1}{\eta}\left(\eta bx-\eta^{2}b^{2}x^{2}-\frac{\eta bx}{1+\eta bx}\right)=\eta b^{2}x^{2}\,,

where the last line uses log(1+x)xx2\log(1+x)\geq x-x^{2} for any x12x\geq-\frac{1}{2}. ∎

Lemma 40.

The stability term

stabt:=supαΔ(𝒳{x^})μptμα,^t1ηtD(α,pt),\displaystyle\textit{stab}_{t}:=\sup_{\alpha\in\Delta(\mathcal{X}\cup\{\widehat{x}\})}\langle\mu_{p_{t}}-\mu_{\alpha},\widehat{\ell}_{t}\rangle-\frac{1}{\eta_{t}}D(\alpha,p_{t})\,,

satisfies

𝔼t[stabt]=O((1pt,x^)ηtd).\displaystyle\mathbb{E}_{t}[\textit{stab}_{t}]=O((1-p_{t,\widehat{x}})\eta_{t}d)\,.
Proof.

We have

μptμα\displaystyle\mu_{p_{t}}-\mu_{\alpha} =(pt,x^αx^)x^+(1pt,x^)mpt(1αx^)mα\displaystyle=(p_{t,\widehat{x}}-\alpha_{\widehat{x}})\widehat{x}+(1-p_{t,\widehat{x}})m_{p_{t}}-(1-\alpha_{\widehat{x}})m_{\alpha}
=(pt,x^αx^)(x^mp~t+mp~tmpt)+(1αx^)(mp~tmα).\displaystyle=(p_{t,\widehat{x}}-\alpha_{\widehat{x}})(\widehat{x}-m_{\widetilde{p}_{t}}+m_{\widetilde{p}_{t}}-m_{p_{t}})+(1-\alpha_{\widehat{x}})(m_{\widetilde{p}_{t}}-m_{\alpha})\,.

Hence for At=y𝒳A_{t}=y\in\mathcal{X}:

μptμα,^t\displaystyle\langle\mu_{p_{t}}-\mu_{\alpha},\widehat{\ell}_{t}\rangle =μptμα,[Hp~t𝕍]1(yμp~t)t,At\displaystyle=\langle\mu_{p_{t}}-\mu_{\alpha},[H^{\mathbb{V}}_{\widetilde{p}_{t}}]^{-1}(y-\mu_{\widetilde{p}_{t}})\rangle\ell_{t,A_{t}}
=(pt,x^αx^)x^mp~t,[Hp~t𝕍]1(yμp~t)t,At\displaystyle=(p_{t,\widehat{x}}-\alpha_{\widehat{x}})\langle\widehat{x}-m_{\widetilde{p}_{t}},[H^{\mathbb{V}}_{\widetilde{p}_{t}}]^{-1}(y-\mu_{\widetilde{p}_{t}})\rangle\ell_{t,A_{t}}
+(pt,x^αx^)mp~tmpt,[Hp~t𝕍]1(yμp~t)t,At\displaystyle\qquad+(p_{t,\widehat{x}}-\alpha_{\widehat{x}})\langle m_{\widetilde{p}_{t}}-m_{p_{t}},[H^{\mathbb{V}}_{\widetilde{p}_{t}}]^{-1}(y-\mu_{\widetilde{p}_{t}})\rangle\ell_{t,A_{t}}
+(1αx^)mp~tmα,[Hp~t𝕍]1(yμp~t)t,At\displaystyle\qquad+(1-\alpha_{\widehat{x}})\langle m_{\widetilde{p}_{t}}-m_{\alpha},[H^{\mathbb{V}}_{\widetilde{p}_{t}}]^{-1}(y-\mu_{\widetilde{p}_{t}})\rangle\ell_{t,A_{t}}
=(pt,x^αx^)x^mp~t,[Hp~t𝕍]1(mp~tμp~t)t,At\displaystyle=(p_{t,\widehat{x}}-\alpha_{\widehat{x}})\langle\widehat{x}-m_{\widetilde{p}_{t}},[H^{\mathbb{V}}_{\widetilde{p}_{t}}]^{-1}(m_{\widetilde{p}_{t}}-\mu_{\widetilde{p}_{t}})\rangle\ell_{t,A_{t}} (By Eq. (12))
+(pt,x^αx^)mp~tmpt,[Hp~t𝕍]1(ymp~t)t,At\displaystyle\qquad+(p_{t,\widehat{x}}-\alpha_{\widehat{x}})\langle m_{\widetilde{p}_{t}}-m_{p_{t}},[H^{\mathbb{V}}_{\widetilde{p}_{t}}]^{-1}(y-m_{\widetilde{p}_{t}})\rangle\ell_{t,A_{t}}
+(1αx^)mp~tmα,[Hp~t𝕍]1(ymp~t)t,At\displaystyle\qquad+(1-\alpha_{\widehat{x}})\langle m_{\widetilde{p}_{t}}-m_{\alpha},[H^{\mathbb{V}}_{\widetilde{p}_{t}}]^{-1}(y-m_{\widetilde{p}_{t}})\rangle\ell_{t,A_{t}}
|αx^pt,x^|1pt,x^+|pt,x^αx^|1pt,x^|mp~tmpt,[Hp~t𝕍]1(ymp~t)|\displaystyle\leq\frac{|\alpha_{\widehat{x}}-p_{t,\widehat{x}}|}{1-p_{t,\widehat{x}}}+\frac{|p_{t,\widehat{x}}-\alpha_{\widehat{x}}|}{1-p_{t,\widehat{x}}}|\langle m_{\widetilde{p}_{t}}-m_{p_{t}},[H^{\mathbb{V}}_{\widetilde{p}_{t}}]^{-1}(y-m_{\widetilde{p}_{t}})\rangle|
+1αx^1pt,x^|mp~tmα,[Gp~t𝕍]+(ymp~t)|\displaystyle+\quad\frac{1-\alpha_{\widehat{x}}}{1-p_{t,\widehat{x}}}|\langle m_{\widetilde{p}_{t}}-m_{\alpha},[G^{\mathbb{V}}_{\widetilde{p}_{t}}]^{+}(y-m_{\widetilde{p}_{t}})\rangle| (By Eq. (13) and Eq. (11))
|pt,x^αx^|1pt,x^(1+2d)\displaystyle\leq\frac{|p_{t,\widehat{x}}-\alpha_{\widehat{x}}|}{1-p_{t,\widehat{x}}}(1+2\sqrt{d})
+43×1αx^1pt,x^mptmα[Gpt𝕍]+ymp~t)[Gp~t𝕍]+\displaystyle\qquad+\frac{4}{3}\times\frac{1-\alpha_{\widehat{x}}}{1-p_{t,\widehat{x}}}\left\|m_{p_{t}}-m_{\alpha}\right\|_{[G^{\mathbb{V}}_{p_{t}}]^{+}}\left\|y-m_{\widetilde{p}_{t}})\right\|_{[G^{\mathbb{V}}_{\widetilde{p}_{t}}]^{+}} (Cauchy-Schwarz and Lemma 37, Lemma 36)

Equivalently for At=x^A_{t}=\widehat{x},

μptμα,^t\displaystyle\langle\mu_{p_{t}}-\mu_{\alpha},\widehat{\ell}_{t}\rangle =μp~tμα,[Hp~t𝕍]1(x^μp~t)t,At\displaystyle=\langle\mu_{\widetilde{p}_{t}}-\mu_{\alpha},[H^{\mathbb{V}}_{\widetilde{p}_{t}}]^{-1}(\widehat{x}-\mu_{\widetilde{p}_{t}})\rangle\ell_{t,A_{t}}
=(pt,x^αx^)x^mp~t,[Hp~t𝕍]1(x^μp~t)^t,At\displaystyle=(p_{t,\widehat{x}}-\alpha_{\widehat{x}})\langle\widehat{x}-m_{\widetilde{p}_{t}},[H^{\mathbb{V}}_{\widetilde{p}_{t}}]^{-1}(\widehat{x}-\mu_{\widetilde{p}_{t}})\rangle\widehat{\ell}_{t,A_{t}}
|pt,x^αx^|pt,x^.\displaystyle\leq\frac{|p_{t,\widehat{x}}-\alpha_{\widehat{x}}|}{p_{t,\widehat{x}}}\,.

Hence the stability term for At𝒳A_{t}\in\mathcal{X}, is bounded by

supαΔ(𝒳{x^})μptμα,^t1ηtD(α,pt)\displaystyle\sup_{\alpha\in\Delta(\mathcal{X}\cup\{\widehat{x}\})}\langle\mu_{p_{t}}-\mu_{\alpha},\widehat{\ell}_{t}\rangle-\frac{1}{\eta_{t}}D(\alpha,p_{t})
supαΔ(𝒳{x^})|αx^pt,x^|1pt,x^(1+2d)1ηtD(1αx^,1pt,x^)\displaystyle\leq\sup_{\alpha\in\Delta(\mathcal{X}\cup\{\widehat{x}\})}\frac{|\alpha_{\widehat{x}}-p_{t,\widehat{x}}|}{1-p_{t,\widehat{x}}}(1+2\sqrt{d})-\frac{1}{\eta_{t}}D(1-\alpha_{\widehat{x}},1-p_{t,\widehat{x}})
+1αx^1pt,x^(43mαmp~t[Gp~t𝕍]+ymp~t[Gp~t𝕍]+1ηtmαmp~t[Gp~t𝕍]+2)\displaystyle\qquad+\frac{1-\alpha_{\widehat{x}}}{1-p_{t,\widehat{x}}}\left(\frac{4}{3}\left\|m_{\alpha}-m_{\widetilde{p}_{t}}\right\|_{[G^{\mathbb{V}}_{\widetilde{p}_{t}}]^{+}}\left\|y-m_{\widetilde{p}_{t}}\right\|_{[G^{\mathbb{V}}_{\widetilde{p}_{t}}]^{+}}-\frac{1}{\eta_{t}}\left\|m_{\alpha}-m_{\widetilde{p}_{t}}\right\|_{[G^{\mathbb{V}}_{\widetilde{p}_{t}}]^{+}}^{2}\right)
ηtymp~t[Gp~t𝕍]+2+supαx^[0,1]|αx^pt,x^|1pt,x^(1+2d)\displaystyle\leq\eta_{t}\left\|y-m_{\widetilde{p}_{t}}\right\|^{2}_{[G^{\mathbb{V}}_{\widetilde{p}_{t}}]^{+}}+\sup_{\alpha_{\widehat{x}}\in[0,1]}\frac{|\alpha_{\widehat{x}}-p_{t,\widehat{x}}|}{1-p_{t,\widehat{x}}}(1+2\sqrt{d})
+ηt|αx^pt,x^|1pt,x^ymp~t[Gp~t𝕍]+21ηtD(1αx^,1pt,x^)\displaystyle\qquad+\eta_{t}\frac{|\alpha_{\widehat{x}}-p_{t,\widehat{x}}|}{1-p_{t,\widehat{x}}}\left\|y-m_{\widetilde{p}_{t}}\right\|^{2}_{[G^{\mathbb{V}}_{\widetilde{p}_{t}}]^{+}}-\frac{1}{\eta_{t}}D(1-\alpha_{\widehat{x}},1-p_{t,\widehat{x}}) (AM-GM inequality)
ηtymp~t[Gp~t𝕍]+2\displaystyle\leq\eta_{t}\left\|y-m_{\widetilde{p}_{t}}\right\|^{2}_{[G^{\mathbb{V}}_{\widetilde{p}_{t}}]^{+}}
+supαx^[0,1]|αx^pt,x^|1pt,x^(5+2d)1ηtD(1αx^,1pt,x^)\displaystyle\qquad+\sup_{\alpha_{\widehat{x}}\in[0,1]}\frac{|\alpha_{\widehat{x}}-p_{t,\widehat{x}}|}{1-p_{t,\widehat{x}}}(5+2\sqrt{d})-\frac{1}{\eta_{t}}D(1-\alpha_{\widehat{x}},1-p_{t,\widehat{x}}) (Lemma 37)
ηtymp~t[Gp~t𝕍]+2+O(ηtd).\displaystyle\leq\eta_{t}\left\|y-m_{\widetilde{p}_{t}}\right\|^{2}_{[G^{\mathbb{V}}_{\widetilde{p}_{t}}]^{+}}+O(\eta_{t}d)\,. (Lemma 39)

Taking the expectation over yp~t,𝒳y\sim\widetilde{p}_{t,\mathcal{X}} leads to

𝔼Atp~t,𝒳[stabt]=O(ηtd)\displaystyle\mathbb{E}_{A_{t}\sim\widetilde{p}_{t,\mathcal{X}}}\left[\textit{stab}_{t}\right]=O(\eta_{t}d)

For At=x^A_{t}=\widehat{x} we have two cases. If pt,x^12p_{t,\widehat{x}}\geq\frac{1}{2}, then by Lemma 39,

stabt\displaystyle\textit{stab}_{t} supαx^[0,1]|pt,x^αx^|pt,x^1ηtDlog(1αx^,1pt,x^)O(ηt(1pt,x^)2×1pt,x^2)O(ηt(1pt,x^)).\displaystyle\leq\sup_{\alpha_{\widehat{x}}\in[0,1]}\frac{|p_{t,\widehat{x}}-\alpha_{\widehat{x}}|}{p_{t,\widehat{x}}}-\frac{1}{\eta_{t}}D_{\log}(1-\alpha_{\widehat{x}},1-p_{t,\widehat{x}})\leq O\left(\eta_{t}(1-p_{t,\widehat{x}})^{2}\times\frac{1}{p_{t,\widehat{x}}^{2}}\right)\leq O\left(\eta_{t}(1-p_{t,\widehat{x}})\right)\,.

Otherwise If 1pt,x^121-p_{t,\widehat{x}}\geq\frac{1}{2}, then by Lemma 39,

stabt\displaystyle\textit{stab}_{t} supαx^[0,1]|pt,x^αx^|pt,x^1ηtDlog(αx^,pt,x^)O(ηtpt,x^2×1pt,x^2)O(ηt(1pt,x^)).\displaystyle\leq\sup_{\alpha_{\widehat{x}}\in[0,1]}\frac{|p_{t,\widehat{x}}-\alpha_{\widehat{x}}|}{p_{t,\widehat{x}}}-\frac{1}{\eta_{t}}D_{\log}(\alpha_{\widehat{x}},p_{t,\widehat{x}})\leq O\left(\eta_{t}p_{t,\widehat{x}}^{2}\times\frac{1}{p_{t,\widehat{x}}^{2}}\right)\leq O(\eta_{t}(1-p_{t,\widehat{x}}))\,.

Finally we have

𝔼t[stabt]=(1pt,x^)𝔼Atpt,𝒳[stabt]+pt,x^𝔼Atπx^[stabt]=O((1pt,x^)ηtd)\displaystyle\mathbb{E}_{t}\left[\textit{stab}_{t}\right]=(1-p_{t,\widehat{x}})\mathbb{E}_{A_{t}\sim p_{t,\mathcal{X}}}[\textit{stab}_{t}]+p_{t,\widehat{x}}\mathbb{E}_{A_{t}\sim\pi_{\widehat{x}}}[\textit{stab}_{t}]=O((1-p_{t,\widehat{x}})\eta_{t}d)

Proof of Lemma 5.

By standard analysis of FTRL (Lemma 27) and Lemma 40, for any τ\tau and xx

t=1τ𝔼t[pt,^t^t(x)]dlogTητ+t=1τO(ηt(1pt,x^)d)O(dlogTt=1τ(1p~t,x^)),\displaystyle\sum_{t=1}^{\tau}\mathbb{E}_{t}\left[\langle p_{t},\widehat{\ell}_{t}\rangle-\widehat{\ell}_{t}(x)\right]\leq\frac{d\log T}{\eta_{\tau}}+\sum_{t=1}^{\tau}O(\eta_{t}(1-p_{t,\widehat{x}})d)\leq O\left(\sqrt{d\log T\sum_{t=1}^{\tau}(1-\widetilde{p}_{t,\widehat{x}})}\right)\,,

where the last inequality follows from the definition of learning rate and p~t,x^=pt,x^\widetilde{p}_{t,\widehat{x}}=p_{t,\widehat{x}}. Additionally, we have

𝔼t[p~tpt,^t]=p~tpt,tηtd(1pt,x^),\displaystyle\mathbb{E}_{t}\left[\langle\widetilde{p}_{t}-p_{t},\widehat{\ell}_{t}\rangle\right]=\langle\widetilde{p}_{t}-p_{t},\ell_{t}\rangle\leq\eta_{t}d(1-p_{t,\widehat{x}})\,,

so that

t=1τ𝔼t[p~t,^t^t(x)]=O(dlogTt=1τ(1p~t,x^)).\displaystyle\sum_{t=1}^{\tau}\mathbb{E}_{t}\left[\langle\widetilde{p}_{t},\widehat{\ell}_{t}\rangle-\widehat{\ell}_{t}(x)\right]=O\left(\sqrt{d\log T\sum_{t=1}^{\tau}(1-\widetilde{p}_{t,\widehat{x}})}\right)\,.

Taking expectations on both sides finishes the proof. ∎

Appendix D Analysis for the First Reduction

D.1 BOBW to LSB (Algorithm 1 / Theorem 6)

Proof of Theorem 6.

We use τk=Tk+1Tk\tau_{k}=T_{k+1}-T_{k} to denote the length of epoch kk, and let nn be the last epoch (define Tn+1=TT_{n+1}=T). Also, we use 𝔼t[]\mathbb{E}_{t}[\cdot] to denote expectation conditioned on all history up to time tt. We first consider the adversarial regime. By the definition of local-self-bounding algorithms, we have for any uu,

𝔼Tk[t=Tk+1Tk+1(t,Att,u)]c01α𝔼Tk[τk]α+c2log(T),\displaystyle\mathbb{E}_{T_{k}}\left[\sum_{t=T_{k}+1}^{T_{k+1}}\left(\ell_{t,A_{t}}-\ell_{t,u}\right)\right]\leq c_{0}^{1-\alpha}\mathbb{E}_{T_{k}}[\tau_{k}]^{\alpha}+c_{2}\log(T),

which implies

𝔼[t=Tk+1Tk+1(t,Att,u)]c01α𝔼[τk]α+c2log(T)\displaystyle\mathbb{E}\left[\sum_{t=T_{k}+1}^{T_{k+1}}\left(\ell_{t,A_{t}}-\ell_{t,u}\right)\right]\leq c_{0}^{1-\alpha}\mathbb{E}[\tau_{k}]^{\alpha}+c_{2}\log(T)

using the property 𝔼[xα]𝔼[x]α\mathbb{E}[x^{\alpha}]\leq\mathbb{E}[x]^{\alpha} for x+x\in\mathbb{R}_{+} and 0<α<10<\alpha<1. Summing the bounds over k=1,2,,nk=1,2,\ldots,n, and using the fact that τk2τk1\tau_{k}\geq 2\tau_{k-1} for all k<nk<n, we get

𝔼[t=1T(t,Att,u)]\displaystyle\mathbb{E}\left[\sum_{t=1}^{T}\left(\ell_{t,A_{t}}-\ell_{t,u}\right)\right]
c01α(𝔼[τn]α+𝔼[τn1]α(1+12α+122α+))+c2log(T)log2(Tc2log(T))\displaystyle\leq c_{0}^{1-\alpha}\left(\mathbb{E}\left[\tau_{n}\right]^{\alpha}+\mathbb{E}\left[\tau_{n-1}\right]^{\alpha}\left(1+\frac{1}{2^{\alpha}}+\frac{1}{2^{2\alpha}}+\cdots\right)\right)+c_{2}\log(T)\log_{2}\left(\frac{T}{c_{2}\log(T)}\right)
O(c01αTα+c2log2(T)).\displaystyle\leq O\left(c_{0}^{1-\alpha}T^{\alpha}+c_{2}\log^{2}(T)\right). (14)

The same analysis also gives

𝔼[t=1T(t,Att,u)]O((c1logT)1αTα+c2log2(T)).\displaystyle\mathbb{E}\left[\sum_{t=1}^{T}\left(\ell_{t,A_{t}}-\ell_{t,u}\right)\right]\leq O\left((c_{1}\log T)^{1-\alpha}T^{\alpha}+c_{2}\log^{2}(T)\right). (15)

Next, consider the corrupted stochastic regime. We first argue that it suffices to consider the regret comparator xx^{\star}. This is because if xargminu𝒳𝔼[t=1Tt,u]x^{\star}\neq\operatorname*{argmin}_{u\in\mathcal{X}}\mathbb{E}\left[\sum_{t=1}^{T}\ell_{t,u}\right], then it holds that CTΔC\geq T\Delta. Then the right-hand side of (15) is further upper bounded by

O((c1logT)1α(CΔ1)α+c2log(T)log(CΔ1)),\displaystyle O\left((c_{1}\log T)^{1-\alpha}(C\Delta^{-1})^{\alpha}+c_{2}\log(T)\log(C\Delta^{-1})\right),

which fulfills the requirement for the stochastic regime. Below, we focus on bounding the pseudo-regret with respect to xx^{\star}.

Let m=max{k|x^kx}m=\max\{k\in\mathbb{N}\,|\,\widehat{x}_{k}\neq x^{\star}\}. Notice that mm is either the last epoch (i.e., m=nm=n) or the second last (i.e., m=n1m=n-1), because for any two consecutive epochs, at least one of them must have x^kx\widehat{x}_{k}\neq x^{\star}. Below we show that |{t[Tm+1]|Atx}|Tm+182τ0|\{t\in[T_{m+1}]\,|\,A_{t}\neq x^{\star}\}|\geq\frac{T_{m+1}}{8}-2\tau_{0}.

If τm>2τm1\tau_{m}>2\tau_{m-1}, by the fact that the termination condition was not triggered one round earlier, the number of plays Nm(x)N_{m}(x^{\star}) in epoch mm is at most τm12+1=τm+12\frac{\tau_{m}-1}{2}+1=\frac{\tau_{m}+1}{2}; in other words, the number of times t=Tm+1Tm+1𝕀{Atx}\sum_{t=T_{m}+1}^{T_{m+1}}\mathbb{I}\{A_{t}\neq x^{\star}\} is at least τm12\frac{\tau_{m}-1}{2}. Further notice that because τm>2τm14τm2>\tau_{m}>2\tau_{m-1}\geq 4\tau_{m-2}>\cdots, we have τm1214k=1mτk12=Tm+1412\frac{\tau_{m}-1}{2}\geq\frac{1}{4}\sum_{k=1}^{m}\tau_{k}-\frac{1}{2}=\frac{T_{m+1}}{4}-\frac{1}{2}.

Now consider the case m>1m>1 and τm2τm1\tau_{m}\leq 2\tau_{m-1}. Recall that x^m\widehat{x}_{m} is the action with Nm1(x^m)τm12N_{m-1}(\widehat{x}_{m})\geq\frac{\tau_{m-1}}{2}. This implies that t=Tm1+1Tm𝕀{Atx}τm1212max{τm2,12k=1m1τk}18k=1mτk=Tm+18\sum_{t=T_{m-1}+1}^{T_{m}}\mathbb{I}\{A_{t}\neq x^{\star}\}\geq\frac{\tau_{m-1}}{2}\geq\frac{1}{2}\max\left\{\frac{\tau_{m}}{2},\frac{1}{2}\sum_{k=1}^{m-1}\tau_{k}\right\}\geq\frac{1}{8}\sum_{k=1}^{m}\tau_{k}=\frac{T_{m+1}}{8}.

Finally, consider the case m=1m=1 and τ12τ0\tau_{1}\leq 2\tau_{0}, then we have T22τ00T_{2}-2\tau_{0}\leq 0 and the statement holds trivially.

The regret up to and including epoch mm can be lower and upper bounded using the self-bounding technique:

𝔼[t=1Tm+1(t,Att,x)]=(1+λ)𝔼[t=1Tm+1(t,Att,x)]λ𝔼[t=1Tm+1(t,Att,x)]\displaystyle\mathbb{E}\left[\sum_{t=1}^{T_{m+1}}(\ell_{t,A_{t}}-\ell_{t,x^{\star}})\right]=(1+\lambda)\mathbb{E}\left[\sum_{t=1}^{T_{m+1}}(\ell_{t,A_{t}}-\ell_{t,x^{\star}})\right]-\lambda\mathbb{E}\left[\sum_{t=1}^{T_{m+1}}(\ell_{t,A_{t}}-\ell_{t,x^{\star}})\right] (for 0λ10\leq\lambda\leq 1)
O((c1logT)1α𝔼[Tm+1]α+c2log(T)log(𝔼[Tm+1]c2log(T)))λ((18𝔼[Tm+1]c2log(T))ΔC)\displaystyle\leq O\left((c_{1}\log T)^{1-\alpha}\mathbb{E}\left[T_{m+1}\right]^{\alpha}+c_{2}\log(T)\log\left(\frac{\mathbb{E}[T_{m+1}]}{c_{2}\log(T)}\right)\right)-\lambda\left(\left(\frac{1}{8}\mathbb{E}\left[T_{m+1}\right]-c_{2}\log(T)\right)\Delta-C\right) (the first term is by a similar calculation as (14), but replacing TT by TmT_{m} and c0c_{0} by c1logTc_{1}\log T )
O(c1log(T)Δα1α+(c1logT)1α(CΔ1)α+c2log(T)log(CΔ1))\displaystyle\leq O\left(c_{1}\log(T)\Delta^{-\frac{\alpha}{1-\alpha}}+(c_{1}\log T)^{1-\alpha}\left(C\Delta^{-1}\right)^{\alpha}+c_{2}\log(T)\log(C\Delta^{-1})\right)

where in the last inequality we use Lemma 41.

If mm is not the last epoch, then it holds that x^n=x\widehat{x}_{n}=x^{\star} for the final epoch nn. In this case, the regret in the final epoch is

𝔼[t=Tn+1T(t,Att,x)]=(1+λ)𝔼[t=Tn+1T(t,Att,x)]λ𝔼[t=Tn+1T(t,Att,x)]\displaystyle\mathbb{E}\left[\sum_{t=T_{n}+1}^{T}(\ell_{t,A_{t}}-\ell_{t,x^{\star}})\right]=(1+\lambda)\mathbb{E}\left[\sum_{t=T_{n}+1}^{T}(\ell_{t,A_{t}}-\ell_{t,x^{\star}})\right]-\lambda\mathbb{E}\left[\sum_{t=T_{n}+1}^{T}(\ell_{t,A_{t}}-\ell_{t,x^{\star}})\right]
O((c1logT)1α𝔼[t=Tn+1T(1pt,x)]α)λ(𝔼[t=Tn+1T(1pt,x)]ΔC)+c2log(T)\displaystyle\leq O\left((c_{1}\log T)^{1-\alpha}\mathbb{E}\left[\sum_{t=T_{n}+1}^{T}(1-p_{t,x^{\star}})\right]^{\alpha}\right)-\lambda\left(\mathbb{E}\left[\sum_{t=T_{n}+1}^{T}(1-p_{t,x^{\star}})\right]\Delta-C\right)+c_{2}\log(T)
O(c1log(T)Δα1α+(c1logT)1α(CΔ1)α)+c2log(T).\displaystyle\leq O\left(c_{1}\log(T)\Delta^{-\frac{\alpha}{1-\alpha}}+(c_{1}\log T)^{1-\alpha}\left(C\Delta^{-1}\right)^{\alpha}\right)+c_{2}\log(T). (Lemma 41)

Lemma 41.

For Δ(0,1]\Delta\in(0,1] and c,c,X1c,c^{\prime},X\geq 1 and C0C\geq 0, we have

minλ[0,1]{12c1αXα+12clogXλ(XΔC)}cΔα1α+2c1α(CΔ)α+2clog(1+c+CΔ)\displaystyle\min_{\lambda\in[0,1]}\left\{\frac{1}{2}c^{1-\alpha}X^{\alpha}+\frac{1}{2}c^{\prime}\log X-\lambda(X\Delta-C)\right\}\leq c\Delta^{-\frac{\alpha}{1-\alpha}}+2c^{1-\alpha}\left(\frac{C}{\Delta}\right)^{\alpha}+2c^{\prime}\log\left(1+\frac{c^{\prime}+C}{\Delta}\right)
Proof.

If c1αXαclogTc^{1-\alpha}X^{\alpha}\geq c^{\prime}\log T, we have

12c1αXα+12clogXc1αXαλXΔ+cλα1αΔα1α.\displaystyle\frac{1}{2}c^{1-\alpha}X^{\alpha}+\frac{1}{2}c^{\prime}\log X\leq c^{1-\alpha}X^{\alpha}\leq\lambda X\Delta+c\lambda^{-\frac{\alpha}{1-\alpha}}\Delta^{-\frac{\alpha}{1-\alpha}}.

where the last inequality is by the weighted AM-GM inequality. Therefore,

minλ[0,1]{12c1αXα+12clogXλ(XΔC)}minλ[0,1]cλα1αΔα1α+Cλ.\displaystyle\min_{\lambda\in[0,1]}\left\{\frac{1}{2}c^{1-\alpha}X^{\alpha}+\frac{1}{2}c^{\prime}\log X-\lambda(X\Delta-C)\right\}\leq\min_{\lambda\in[0,1]}c\lambda^{-\frac{\alpha}{1-\alpha}}\Delta^{-\frac{\alpha}{1-\alpha}}+C\lambda.

Choosing λ=min{1,c1αC(1α)Δα}\lambda=\min\{1,c^{1-\alpha}C^{-(1-\alpha)}\Delta^{-\alpha}\}, we bound the last expression by

cmax{1,(c1αC(1α)Δα)α1α}Δα1α+c1αCαΔα\displaystyle c\max\left\{1,\left(c^{1-\alpha}C^{-(1-\alpha)}\Delta^{-\alpha}\right)^{-\frac{\alpha}{1-\alpha}}\right\}\Delta^{-\frac{\alpha}{1-\alpha}}+c^{1-\alpha}C^{\alpha}\Delta^{-\alpha}
cmax{1,cαCαΔα21α}Δα1α+c1αCαΔα\displaystyle\leq c\max\left\{1,c^{-\alpha}C^{\alpha}\Delta^{\frac{\alpha^{2}}{1-\alpha}}\right\}\Delta^{-\frac{\alpha}{1-\alpha}}+c^{1-\alpha}C^{\alpha}\Delta^{-\alpha}
cΔα1α+2c1αCαΔα.\displaystyle\leq c\Delta^{-\frac{\alpha}{1-\alpha}}+2c^{1-\alpha}C^{\alpha}\Delta^{-\alpha}.

If c1αXαclogTc^{1-\alpha}X^{\alpha}\leq c^{\prime}\log T, we have

12c1αXα+12clogXλ(XΔC)clogXλXΔ+λCclog(1+c2λ2Δ2)+λC\displaystyle\frac{1}{2}c^{1-\alpha}X^{\alpha}+\frac{1}{2}c^{\prime}\log X-\lambda(X\Delta-C)\leq c^{\prime}\log X-\lambda X\Delta+\lambda C\leq c^{\prime}\log\left(1+\frac{c^{\prime 2}}{\lambda^{2}\Delta^{2}}\right)+\lambda C

where the last inequality is because if Xc2λ2Δ2X\geq\frac{c^{\prime 2}}{\lambda^{2}\Delta^{2}} then clogXλXΔclogXcX<0c^{\prime}\log X-\lambda X\Delta\leq c^{\prime}\log X-c^{\prime}\sqrt{X}<0. Choosing λ=min{1,cC}\lambda=\min\{1,\frac{c^{\prime}}{C}\}, we bound the last expression by clog(1+(c2+C2)/Δ2)2clog(1+(c+C)/Δ)c^{\prime}\log(1+(c^{\prime 2}+C^{2})/\Delta^{2})\leq 2c^{\prime}\log(1+(c^{\prime}+C)/\Delta). Combining cases finishes the proof. ∎

D.2 dd-BOBW to dd-LSB (Algorithm 1 / Theorem 23)

Proof of Theorem 23.

In the adversarial regime, we have that the regret in each phase kk is bounded by

𝔼Tk[t=Tk+1Tk+1(t,Att,u)]c1log(T)𝔼Tk[t=Tk+1Tk+1xpt,xξt,x]+c2logT.\displaystyle\mathbb{E}_{T_{k}}\left[\sum_{t=T_{k}+1}^{T_{k+1}}(\ell_{t,A_{t}}-\ell_{t,u})\right]\leq\sqrt{c_{1}\log(T)\mathbb{E}_{T_{k}}\left[\sum_{t=T_{k}+1}^{T_{k+1}}\sum_{x}p_{t,x}\xi_{t,x}\right]}+c_{2}\log T\,.

We have maximally logT\log T episodes, since the length doubles every time. Via Cauchy-Schwarz, we get

k=1kmax𝔼Tk[t=Tk+1Tk+1(t,Att,u)]c1logTk=1kmax𝔼Tk[t=Tk+1Tk+1xpt,xξt,x]logT+c2log2T.\displaystyle\sum_{k=1}^{k_{\max}}\mathbb{E}_{T_{k}}\left[\sum_{t=T_{k}+1}^{T_{k+1}}(\ell_{t,A_{t}}-\ell_{t,u})\right]\leq\sqrt{c_{1}\log T\sum_{k=1}^{k_{\max}}\mathbb{E}_{T_{k}}\left[\sum_{t=T_{k}+1}^{T_{k+1}}\sum_{x}p_{t,x}\xi_{t,x}\right]}\sqrt{\log T}+c_{2}\log^{2}T\,.

Taking the expectation on both sides and the tower rule of expectations finishes the bound for the adversarial regime. For the stochastic regime, note that ξt,x1\xi_{t,x}\leq 1 and hence

xpt,xξt,x𝕀{𝒖=𝒙^}pt,u2ξt,u\displaystyle\sum_{x}p_{t,x}\xi_{t,x}-\bm{\mathbb{I}\{u=\widehat{x}\}}p_{t,u}^{2}\xi_{t,u}
1𝕀{𝒖=𝒙^}pt,u2\displaystyle\leq 1-\bm{\mathbb{I}\{u=\widehat{x}\}}p_{t,u}^{2}
=(1𝕀{𝒖=𝒙^}pt,u)(1+𝕀{𝒖=𝒙^}pt,u)2(1𝕀{𝒖=𝒙^}pt,u).\displaystyle=(1-\bm{\mathbb{I}\{u=\widehat{x}\}}p_{t,u})(1+\bm{\mathbb{I}\{u=\widehat{x}\}}p_{t,u})\leq 2(1-\bm{\mathbb{I}\{u=\widehat{x}\}}p_{t,u})\,.

dd-LSB implies regular LSB (up to a factor of 22) and hence the stochastic bound of regular LSB applies. ∎

Appendix E Analysis for the Second Reduction

E.1 12\frac{1}{2}-LSB to 12\frac{1}{2}-iw-stable (Algorithm 2 / Theorem 11)

Proof of Theorem 11.

The per-step bonus bt=BtBt1b_{t}=B_{t}-B_{t-1} is the sum of two terms:

btTs\displaystyle b_{t}^{\scalebox{0.5}{{{Ts}}}} =c1τ=1t1qτ,2c1τ=1t11qτ,2c1qt,2c1τ=1t1qτ,2c1qt,2,\displaystyle=\sqrt{c_{1}\sum_{\tau=1}^{t}\frac{1}{q_{\tau,2}}}-\sqrt{c_{1}\sum_{\tau=1}^{t-1}\frac{1}{q_{\tau,2}}}\leq\frac{\frac{c_{1}}{q_{t,2}}}{\sqrt{c_{1}\sum_{\tau=1}^{t}\frac{1}{q_{\tau,2}}}}\leq\sqrt{\frac{c_{1}}{q_{t,2}}},
btLo\displaystyle b_{t}^{\scalebox{0.5}{{{Lo}}}} =c2minτtqτ,2c2minτt1qτ,2=c2qt,2(1minτtqτ,2minτt1qτ,2).\displaystyle=\frac{c_{2}}{\min_{\tau\leq t}q_{\tau,2}}-\frac{c_{2}}{\min_{\tau\leq t-1}q_{\tau,2}}=\frac{c_{2}}{q_{t,2}}\left(1-\frac{\min_{\tau\leq t}q_{\tau,2}}{\min_{\tau\leq t-1}q_{\tau,2}}\right).

Since q¯t,2qt,22\frac{\bar{q}_{t,2}}{q_{t,2}}\leq 2, using the inequalities above, we have

ηtq¯t,2btTs\displaystyle\eta_{t}\sqrt{\bar{q}_{t,2}}b_{t}^{\scalebox{0.5}{{{Ts}}}} ηtq¯t,2qt,2c1ηt2c114.\displaystyle\leq\eta_{t}\sqrt{\frac{\bar{q}_{t,2}}{q_{t,2}}c_{1}}\leq\eta_{t}\sqrt{2c_{1}}\leq\frac{1}{4}. (16)
βq¯t,2btLo\displaystyle\beta\bar{q}_{t,2}b_{t}^{\scalebox{0.5}{{{Lo}}}} βq¯t,2qt,2c22βc214.\displaystyle\leq\beta\frac{\bar{q}_{t,2}}{q_{t,2}}c_{2}\leq 2\beta c_{2}\leq\frac{1}{4}. (17)

By Lemma 28 and that q¯t,2qt,22\frac{\bar{q}_{t,2}}{q_{t,2}}\leq 2, we have for any uu,

t=1tqtu,ztt=1tqtq¯t,ztterm1\displaystyle\sum_{t=1}^{t^{\prime}}\langle q_{t}-u,z_{t}\rangle\leq\underbrace{\sum_{t=1}^{t^{\prime}}\langle q_{t}-\bar{q}_{t},z_{t}\rangle}_{\textbf{term}_{1}}
+O(c1+t=1tqt,2tterm2+logtβterm3+t=1tηtmin|θt|1qt,i32(zt,iθt)2term4+t=1tqt,2btTsterm5+t=1tqt,2btLoterm6)u2t=1tbt.\displaystyle\ \ +O\bigg{(}\sqrt{c_{1}}+\underbrace{\sum_{t=1}^{t^{\prime}}\frac{\sqrt{q_{t,2}}}{\sqrt{t}}}_{\textbf{term}_{2}}+\underbrace{\frac{\log t^{\prime}}{\beta}}_{\textbf{term}_{3}}+\underbrace{\sum_{t=1}^{t^{\prime}}\eta_{t}\min_{|\theta_{t}|\leq 1}q_{t,i}^{\frac{3}{2}}(z_{t,i}-\theta_{t})^{2}}_{\textbf{term}_{4}}+\underbrace{\sum_{t=1}^{t^{\prime}}q_{t,2}b_{t}^{\scalebox{0.5}{{{Ts}}}}}_{\textbf{term}_{5}}+\underbrace{\sum_{t=1}^{t^{\prime}}q_{t,2}b_{t}^{\scalebox{0.5}{{{Lo}}}}}_{\textbf{term}_{6}}\bigg{)}-u_{2}\sum_{t=1}^{t^{\prime}}b_{t}. (18)

We bound individual terms below:

𝔼[term1]=𝔼[t=1tqtq¯t,zt]O(t=1t1t2)=O(1).\displaystyle\mathbb{E}[\textbf{term}_{1}]=\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}\langle q_{t}-\bar{q}_{t},z_{t}\rangle\right]\leq O\left(\sum_{t=1}^{t^{\prime}}\frac{1}{t^{2}}\right)=O(1).
term2\displaystyle\textbf{term}_{2} O(min{t,t=1tqt,2logT}).\displaystyle\leq O\left(\min\left\{\sqrt{t^{\prime}},\ \ \sqrt{\sum_{t=1}^{t^{\prime}}q_{t,2}\log T}\right\}\right).
term3\displaystyle\textbf{term}_{3} =logtβO(c2logT).\displaystyle=\frac{\log t^{\prime}}{\beta}\leq O\left(c_{2}\log T\right).
𝔼[term4]\displaystyle\mathbb{E}\left[\textbf{term}_{4}\right] =𝔼[t=1tηtminθt[1,1]qt,i32(zt,iθt)2]\displaystyle=\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}\eta_{t}\min_{\theta_{t}\in[-1,1]}q_{t,i}^{\frac{3}{2}}(z_{t,i}-\theta_{t})^{2}\right]
𝔼[t=1tηti=12qt,i32(zt,it,At)2]\displaystyle\leq\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}\eta_{t}\sum_{i=1}^{2}q_{t,i}^{\frac{3}{2}}\left(z_{t,i}-\ell_{t,A_{t}}\right)^{2}\right]
=𝔼[t=1tηti=121qt,i(𝕀[it=i]t,Atqt,it,At)2]\displaystyle=\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}\eta_{t}\sum_{i=1}^{2}\frac{1}{\sqrt{q_{t,i}}}\left(\mathbb{I}[i_{t}=i]\ell_{t,A_{t}}-q_{t,i}\ell_{t,A_{t}}\right)^{2}\right]
𝔼[t=1tηti=12(qt,i(1qt,i)2+(1qt,i)qt,i32)]\displaystyle\leq\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}\eta_{t}\sum_{i=1}^{2}\left(\sqrt{q_{t,i}}(1-q_{t,i})^{2}+(1-q_{t,i})q_{t,i}^{\frac{3}{2}}\right)\right]
O(𝔼[t=1tηtqt,2])O(𝔼[term2]).\displaystyle\leq O\left(\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}\eta_{t}\sqrt{q_{t,2}}\right]\right)\leq O\left(\mathbb{E}[\textbf{term}_{2}]\right).
term5=t=1tqt,2btTs\displaystyle\textbf{term}_{5}=\sum_{t=1}^{t^{\prime}}q_{t,2}b_{t}^{\scalebox{0.5}{{{Ts}}}} t=1tqt,2(c1qt,2c1τ=1t1qτ,2)\displaystyle\leq\sum_{t=1}^{t^{\prime}}q_{t,2}\left(\frac{\frac{c_{1}}{q_{t,2}}}{\sqrt{c_{1}\sum_{\tau=1}^{t}\frac{1}{q_{\tau,2}}}}\right)
=c1t=1t1qt,2τ=1t1qτ,2×qt,2\displaystyle=\sqrt{c_{1}}\sum_{t=1}^{t^{\prime}}\frac{\frac{1}{\sqrt{q_{t,2}}}}{\sqrt{\sum_{\tau=1}^{t}\frac{1}{q_{\tau,2}}}}\times\sqrt{q_{t,2}} (19)
c1t=1t1qt,2τ=1t1qτ,2t=1tqt,2\displaystyle\leq\sqrt{c_{1}}\sqrt{\sum_{t=1}^{t^{\prime}}\frac{\frac{1}{q_{t,2}}}{\sum_{\tau=1}^{t}\frac{1}{q_{\tau,2}}}}\sqrt{\sum_{t=1}^{t^{\prime}}q_{t,2}} (Cauchy-Schwarz)
c11+log(t=1t1qt,2)t=1tqt,2\displaystyle\leq\sqrt{c_{1}}\sqrt{1+\log\left(\sum_{t=1}^{t^{\prime}}\frac{1}{q_{t,2}}\right)}\sqrt{\sum_{t=1}^{t^{\prime}}q_{t,2}}
O(c1t=1tqt,2logT).\displaystyle\leq O\left(\sqrt{c_{1}\sum_{t=1}^{t^{\prime}}q_{t,2}\log T}\right).

Continuing from (19), we also have term5t=1tqt,2btTsc1t=1t1t2c1t\textbf{term}_{5}\leq\sum_{t=1}^{t^{\prime}}q_{t,2}b_{t}^{\scalebox{0.5}{{{Ts}}}}\leq\sqrt{c_{1}}\sum_{t=1}^{t^{\prime}}\frac{1}{\sqrt{t}}\leq 2\sqrt{c_{1}t^{\prime}} because qτ,21q_{\tau,2}\leq 1.

term6=t=1tqt,2btLo\displaystyle\textbf{term}_{6}=\sum_{t=1}^{t^{\prime}}q_{t,2}b_{t}^{\scalebox{0.5}{{{Lo}}}} =c2t=1t(1minτtqτ,2minτt1qτ,2)c2t=1tlog(minτt1qτ,2minτtqτ,2)O(c2logT).\displaystyle=c_{2}\sum_{t=1}^{t^{\prime}}\left(1-\frac{\min_{\tau\leq t}q_{\tau,2}}{\min_{\tau\leq t-1}q_{\tau,2}}\right)\leq c_{2}\sum_{t=1}^{t^{\prime}}\log\left(\frac{\min_{\tau\leq t-1}q_{\tau,2}}{\min_{\tau\leq t}q_{\tau,2}}\right)\leq O\left(c_{2}\log T\right).

Using all bounds above in (18), we can bound 𝔼[t=1tqtu,zt]\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}\langle q_{t}-u,z_{t}\rangle\right] by

O(min{c1𝔼[t],c1[t=1tqt,2]logT}+c2logT)pos-termu2𝔼[c1t=1t1qt,2+c2minttqt,2]neg-term.\displaystyle\underbrace{O\left(\min\left\{\sqrt{c_{1}\mathbb{E}[t^{\prime}]},\ \ \sqrt{c_{1}\left[\sum_{t=1}^{t^{\prime}}q_{t,2}\right]\log T}\right\}+c_{2}\log T\right)}_{\textbf{pos-term}}-u_{2}\underbrace{\mathbb{E}\left[\sqrt{c_{1}\sum_{t=1}^{t^{\prime}}\frac{1}{q_{t,2}}}+\frac{c_{2}}{\min_{t\leq t^{\prime}}q_{t,2}}\right]}_{\textbf{neg-term}}.

For comparator x^\widehat{x}, we choose u=𝐞1u=\mathbf{e}_{1} and bound 𝔼[t=1t(t,Att,x^)]\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}(\ell_{t,A_{t}}-\ell_{t,\widehat{x}})\right] by the pos-term above. For comparator xx^x\neq\widehat{x}, we first choose u=𝐞2u=\mathbf{e}_{2} and upper bound 𝔼[t=1t(t,Att,A~t)]\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}(\ell_{t,A_{t}}-\ell_{t,\widetilde{A}_{t}})\right] by pos-termneg-term\textbf{pos-term}-\textbf{neg-term}. On the other hand, by the 12\frac{1}{2}-iw-stable assumption, 𝔼[t=1t(t,A~tt,x)]neg-term\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}(\ell_{t,\widetilde{A}_{t}}-\ell_{t,x})\right]\leq\textbf{neg-term}. Combining them, we get that for all xx^x\neq\widehat{x}, we also have 𝔼[t=1t(t,Att,x)]pos-term\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}(\ell_{t,A_{t}}-\ell_{t,x})\right]\leq\textbf{pos-term}. Comparing the coefficients in pos-term with those in Definition 4, we see that Algorithm 2 satisfies 12\frac{1}{2}-LSB with constants (c0,c1,c2)(c_{0}^{\prime},c_{1}^{\prime},c_{2}^{\prime}) where c0=c1=O(c1)c_{0}^{\prime}=c_{1}^{\prime}=O(c_{1}) and c2=O(c2)c_{2}^{\prime}=O(c_{2}). ∎

E.2 23\frac{2}{3}-LSB to 23\frac{2}{3}-iw-stable (Algorithm 5 / Theorem 18)

Algorithm 5 LSB via Corral (for α=23\alpha=\frac{2}{3})

Input: candidate action x^\widehat{x}, 23\frac{2}{3}-iw-stable algorithm \mathcal{B} over 𝒳\{x^}\mathcal{X}\backslash\{\widehat{x}\} with constants c1,c2c_{1},c_{2}.
Define: ψt(q)=3ηti=12qi23+1βi=12ln1qi\psi_{t}(q)=-\frac{3}{\eta_{t}}\sum_{i=1}^{2}q_{i}^{\frac{2}{3}}+\frac{1}{\beta}\sum_{i=1}^{2}\ln\frac{1}{q_{i}}.  
for t=1,2,t=1,2,\ldots do

       Let \mathcal{B} generate an action A~t\widetilde{A}_{t}.
Let
q¯t=argminqΔ2{q,τ=1t1zτ[0Bt1]+ψt(q)},qt=(1γt)q¯t,\displaystyle\bar{q}_{t}=\operatorname*{argmin}_{q\in\Delta_{2}}\left\{\left\langle q,\sum_{\tau=1}^{t-1}z_{\tau}-\begin{bmatrix}0\\ B_{t-1}\end{bmatrix}\right\rangle+\psi_{t}(q)\right\},\quad q_{t}=(1-\gamma_{t})\bar{q}_{t},
whereηt=1t23+8c113,β=18c2, and γt=max{ηtqt,223,ηtqt,213}.\displaystyle\text{where}\ \ \ \eta_{t}=\frac{1}{t^{\frac{2}{3}}+8c_{1}^{\frac{1}{3}}},\,\beta=\frac{1}{8c_{2}}\text{, and }\gamma_{t}=\max\left\{\sqrt{\eta_{t}}q_{t,2}^{\frac{2}{3}},\eta_{t}q_{t,2}^{\frac{1}{3}}\right\}.
Sample itq¯ti_{t}\sim\bar{q}_{t}.
if it=1i_{t}=1 then  set A¯t=x^\bar{A}_{t}=\widehat{x} ;
       else  set A¯t=A~t\bar{A}_{t}=\widetilde{A}_{t} ;
       Sample jtγtj_{t}\sim\gamma_{t}.
if jt=1j_{t}=1 then  draw a revealing action of A¯t\bar{A}_{t} and observe t,A¯t\ell_{t,\bar{A}_{t}} ;
       else  draw At=A¯tA_{t}=\bar{A}_{t} ;
       Define zt,i=t,A¯t𝕀{it=i}𝕀{jt=1}γtz_{t,i}=\frac{\ell_{t,\bar{A}_{t}}\mathbb{I}\{i_{t}=i\}\mathbb{I}\{j_{t}=1\}}{\gamma_{t}} and
Bt=c113(τ=1t1qτ,2)23+c2minτtqτ,2.\displaystyle B_{t}=c_{1}^{\frac{1}{3}}\left(\sum_{\tau=1}^{t}\frac{1}{\sqrt{q_{\tau,2}}}\right)^{\frac{2}{3}}+\frac{c_{2}}{\min_{\tau\leq t}q_{\tau,2}}.
Proof of Theorem 18.

The per-step bonus bt=BtBt1b_{t}=B_{t}-B_{t-1} is the sum of two terms:

btTs\displaystyle b_{t}^{\scalebox{0.5}{{{Ts}}}} =c113((τ=1t1qτ,2)23(τ=1t11qτ,2)23)c1131qt,2(τ=1t1qτ,2)13(c1qt,2)13,\displaystyle=c_{1}^{\frac{1}{3}}\left(\left(\sum_{\tau=1}^{t}\frac{1}{\sqrt{q_{\tau,2}}}\right)^{\frac{2}{3}}-\left(\sum_{\tau=1}^{t-1}\frac{1}{\sqrt{q_{\tau,2}}}\right)^{\frac{2}{3}}\right)\leq c_{1}^{\frac{1}{3}}\frac{\frac{1}{\sqrt{q_{t,2}}}}{\left(\sum_{\tau=1}^{t}\frac{1}{\sqrt{q_{\tau,2}}}\right)^{\frac{1}{3}}}\leq\left(\frac{c_{1}}{q_{t,2}}\right)^{\frac{1}{3}},
btLo\displaystyle b_{t}^{\scalebox{0.5}{{{Lo}}}} =c2minτtqτ,2c2minτt1qτ,2=c2qt,2(1minτtqτ,2minτt1qτ,2).\displaystyle=\frac{c_{2}}{\min_{\tau\leq t}q_{\tau,2}}-\frac{c_{2}}{\min_{\tau\leq t-1}q_{\tau,2}}=\frac{c_{2}}{q_{t,2}}\left(1-\frac{\min_{\tau\leq t}q_{\tau,2}}{\min_{\tau\leq t-1}q_{\tau,2}}\right).

Since q¯t,2qt,22\frac{\bar{q}_{t,2}}{q_{t,2}}\leq 2, using the inequalities above, we have

ηtq¯t,213btTs\displaystyle\eta_{t}\bar{q}_{t,2}^{\frac{1}{3}}b_{t}^{\scalebox{0.5}{{{Ts}}}} ηt(q¯t,2c1qt,2)13ηt(2c1)1314.\displaystyle\leq\eta_{t}\left(\frac{\bar{q}_{t,2}c_{1}}{q_{t,2}}\right)^{\frac{1}{3}}\leq\eta_{t}(2c_{1})^{\frac{1}{3}}\leq\frac{1}{4}. (20)
βq¯t,2btLo\displaystyle\beta\bar{q}_{t,2}b_{t}^{\scalebox{0.5}{{{Lo}}}} βq¯t,2qt,2c22βc214.\displaystyle\leq\beta\frac{\bar{q}_{t,2}}{q_{t,2}}c_{2}\leq 2\beta c_{2}\leq\frac{1}{4}. (21)

By Lemma 28 and that q¯t,2qt,22\frac{\bar{q}_{t,2}}{q_{t,2}}\leq 2, we have for any uu,

t=1tqtu,ztt=1tqtq¯t,ztterm1\displaystyle\sum_{t=1}^{t^{\prime}}\langle q_{t}-u,z_{t}\rangle\leq\underbrace{\sum_{t=1}^{t^{\prime}}\langle q_{t}-\bar{q}_{t},z_{t}\rangle}_{\textbf{term}_{1}}
+O(c113+t=1tqt,223t13term2+logtβterm3+t=1tηtmin|θt|1qt,i43(zt,iθt)2term4+t=1tqt,2btTsterm5+t=1tqt,2btLoterm6)u2t=1tbt\displaystyle\ \ +O\bigg{(}c_{1}^{\frac{1}{3}}+\underbrace{\sum_{t=1}^{t^{\prime}}\frac{q_{t,2}^{\frac{2}{3}}}{t^{\frac{1}{3}}}}_{\textbf{term}_{2}}+\underbrace{\frac{\log t^{\prime}}{\beta}}_{\textbf{term}_{3}}+\underbrace{\sum_{t=1}^{t^{\prime}}\eta_{t}\min_{|\theta_{t}|\leq 1}q_{t,i}^{\frac{4}{3}}(z_{t,i}-\theta_{t})^{2}}_{\textbf{term}_{4}}+\underbrace{\sum_{t=1}^{t^{\prime}}q_{t,2}b_{t}^{\scalebox{0.5}{{{Ts}}}}}_{\textbf{term}_{5}}+\underbrace{\sum_{t=1}^{t^{\prime}}q_{t,2}b_{t}^{\scalebox{0.5}{{{Lo}}}}}_{\textbf{term}_{6}}\bigg{)}-u_{2}\sum_{t=1}^{t^{\prime}}b_{t} (22)

We bound individual terms below:

𝔼[term1]=𝔼[t=1tqtq¯t,zt]O(t=1tγt)\displaystyle\mathbb{E}[\textbf{term}_{1}]=\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}\langle q_{t}-\bar{q}_{t},z_{t}\rangle\right]\leq O\left(\sum_{t=1}^{t^{\prime}}\gamma_{t}\right) =O(t=1tqt,223t13+qt,213t23)\displaystyle=O\left(\sum_{t=1}^{t^{\prime}}\frac{q_{t,2}^{\frac{2}{3}}}{t^{\frac{1}{3}}}+\frac{q_{t,2}^{\frac{1}{3}}}{t^{\frac{2}{3}}}\right)
=O(min{t23,(t=1tqt)23(logT)13+logT}).\displaystyle=O\left(\min\left\{{t^{\prime}}^{\frac{2}{3}},\left(\sum_{t=1}^{t^{\prime}}q_{t}\right)^{\frac{2}{3}}(\log T)^{\frac{1}{3}}+\log T\right\}\right).
term2\displaystyle\textbf{term}_{2} O(term1)\displaystyle\leq O\left(\textbf{term}_{1}\right)
term3\displaystyle\textbf{term}_{3} =logtβO(c2logT).\displaystyle=\frac{\log t^{\prime}}{\beta}\leq O\left(c_{2}\log T\right).
𝔼[term4]\displaystyle\mathbb{E}\left[\textbf{term}_{4}\right] =𝔼[t=1tηtqt,243(zt,2zt,1)2]\displaystyle=\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}\eta_{t}q_{t,2}^{\frac{4}{3}}(z_{t,2}-z_{t,1})^{2}\right]
𝔼[t=1tηtqt,i43γt]𝔼[t=1tγt2γt]=𝔼[O(term1)].\displaystyle\leq\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}\eta_{t}\frac{q_{t,i}^{\frac{4}{3}}}{\gamma_{t}}\right]\leq\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}\frac{\gamma_{t}^{2}}{\gamma_{t}}\right]=\mathbb{E}\left[O(\textbf{term}_{1})\right].
term5\displaystyle\textbf{term}_{5} =t=1tqt,2btTst=1tqt,2(c1131qt,2(τ=1t1qτ,2)13)\displaystyle=\sum_{t=1}^{t^{\prime}}q_{t,2}b_{t}^{\scalebox{0.5}{{{Ts}}}}\leq\sum_{t=1}^{t^{\prime}}q_{t,2}\left(c_{1}^{\frac{1}{3}}\frac{\frac{1}{\sqrt{q_{t,2}}}}{\left(\sum_{\tau=1}^{t}\frac{1}{\sqrt{q_{\tau,2}}}\right)^{\frac{1}{3}}}\right)
=c113t=1tqt,216(τ=1t1qτ,2)13×qt,223\displaystyle=c_{1}^{\frac{1}{3}}\sum_{t=1}^{t^{\prime}}\frac{q_{t,2}^{-\frac{1}{6}}}{\left(\sum_{\tau=1}^{t}\frac{1}{\sqrt{q_{\tau,2}}}\right)^{\frac{1}{3}}}\times q_{t,2}^{\frac{2}{3}} (23)
c113(t=1t1qt,2τ=1t1qτ,2)13(t=1tqt,2)23\displaystyle\leq c_{1}^{\frac{1}{3}}\left(\sum_{t=1}^{t^{\prime}}\frac{\frac{1}{\sqrt{q_{t,2}}}}{\sum_{\tau=1}^{t}\frac{1}{\sqrt{q_{\tau,2}}}}\right)^{\frac{1}{3}}\left(\sum_{t=1}^{t^{\prime}}q_{t,2}\right)^{\frac{2}{3}} (Cauchy-Schwarz)
c113(1+log(t=1t1qt,2))13(t=1tqt,2)23\displaystyle\leq c_{1}^{\frac{1}{3}}\left(1+\log\left(\sum_{t=1}^{t^{\prime}}\frac{1}{q_{t,2}}\right)\right)^{\frac{1}{3}}\left(\sum_{t=1}^{t^{\prime}}q_{t,2}\right)^{\frac{2}{3}}
O(c113(t=1tqt,2)23(logT)13).\displaystyle\leq O\left(c_{1}^{\frac{1}{3}}\left(\sum_{t=1}^{t^{\prime}}q_{t,2}\right)^{\frac{2}{3}}(\log T)^{\frac{1}{3}}\right).

Continuing from (23), we also have term5t=1tqt,2btTsc113t=1t1t13O(c113t23)\textbf{term}_{5}\leq\sum_{t=1}^{t^{\prime}}q_{t,2}b_{t}^{\scalebox{0.5}{{{Ts}}}}\leq c_{1}^{\frac{1}{3}}\sum_{t=1}^{t^{\prime}}\frac{1}{t^{\frac{1}{3}}}\leq O(c_{1}^{\frac{1}{3}}{t^{\prime}}^{\frac{2}{3}}) because qτ,21q_{\tau,2}\leq 1.

term6=t=1tqt,2btLo\displaystyle\textbf{term}_{6}=\sum_{t=1}^{t^{\prime}}q_{t,2}b_{t}^{\scalebox{0.5}{{{Lo}}}} =c2t=1t(1minτtqτ,2minτt1qτ,2)c2t=1tlog(minτt1qτ,2minτtqτ,2)O(c2logT).\displaystyle=c_{2}\sum_{t=1}^{t^{\prime}}\left(1-\frac{\min_{\tau\leq t}q_{\tau,2}}{\min_{\tau\leq t-1}q_{\tau,2}}\right)\leq c_{2}\sum_{t=1}^{t^{\prime}}\log\left(\frac{\min_{\tau\leq t-1}q_{\tau,2}}{\min_{\tau\leq t}q_{\tau,2}}\right)\leq O\left(c_{2}\log T\right).

Using all bounds above in (22), we can bound 𝔼[t=1tqtu,zt]\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}\langle q_{t}-u,z_{t}\rangle\right] by

O(min{c113𝔼[t]23,(c1logT)13[t=1tqt,2]23}+c2logT)pos-termu2𝔼[c113(t=1t1qt,2)23+c2minttqt,2]neg-term.\displaystyle\underbrace{O\left(\min\left\{c_{1}^{\frac{1}{3}}\mathbb{E}[t^{\prime}]^{\frac{2}{3}},\ \ (c_{1}\log T)^{\frac{1}{3}}\left[\sum_{t=1}^{t^{\prime}}q_{t,2}\right]^{\frac{2}{3}}\right\}+c_{2}\log T\right)}_{\textbf{pos-term}}-u_{2}\underbrace{\mathbb{E}\left[c_{1}^{\frac{1}{3}}\left(\sum_{t=1}^{t^{\prime}}\frac{1}{q_{t,2}}\right)^{\frac{2}{3}}+\frac{c_{2}}{\min_{t\leq t^{\prime}}q_{t,2}}\right]}_{\textbf{neg-term}}.

For comparator x^\widehat{x}, we choose u=𝐞1u=\mathbf{e}_{1} and bound 𝔼[t=1t(t,Att,x^)]\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}(\ell_{t,A_{t}}-\ell_{t,\widehat{x}})\right] by the pos-term above. For comparator xx^x\neq\widehat{x}, we first choose u=𝐞2u=\mathbf{e}_{2} and upper bound 𝔼[t=1t(t,Att,A~t)]\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}(\ell_{t,A_{t}}-\ell_{t,\widetilde{A}_{t}})\right] by pos-termneg-term\textbf{pos-term}-\textbf{neg-term}. On the other hand, by the 12\frac{1}{2}-iw-stable assumption, 𝔼[t=1t(t,A~tt,x)]neg-term\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}(\ell_{t,\widetilde{A}_{t}}-\ell_{t,x})\right]\leq\textbf{neg-term}. Combining them, we get that for all xx^x\neq\widehat{x}, we also have 𝔼[t=1t(t,Att,x)]pos-term\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}(\ell_{t,A_{t}}-\ell_{t,x})\right]\leq\textbf{pos-term}.

Finally, notice that 𝔼t[𝕀{Atx^}]q¯t,2(1γt)=qt,2,\mathbb{E}_{t}[\mathbb{I}\{A_{t}\neq\widehat{x}\}]\geq\bar{q}_{t,2}(1-\gamma_{t})=q_{t,2},. This implies that Algorithm 5 is 23\frac{2}{3}-LSB with coefficient (c0,c1,c2)(c_{0}^{\prime},c_{1}^{\prime},c_{2}^{\prime}) with c0=c1=O(c1)c_{0}^{\prime}=c_{1}^{\prime}=O(c_{1}), and c2=O(c113+c2)c_{2}^{\prime}=O(c_{1}^{\frac{1}{3}}+c_{2}).

Algorithm 6 dd-LSB via Corral (for α=12\alpha=\frac{1}{2})

Input: candidate action x^\widehat{x}, 12\frac{1}{2}-iw-stable algorithm \mathcal{B} over 𝒳\{x^}\mathcal{X}\backslash\{\widehat{x}\} with constant cc.
Define: ψ(q)=i=12ln1qi\psi(q)=\sum_{i=1}^{2}\ln\frac{1}{q_{i}}. B0=0B_{0}=0.
Define: For first-order bound, ξt,x=t,x\xi_{t,x}=\ell_{t,x} and mt,x=0m_{t,x}=0; for second-order bound, ξt,x=(t,xmt,x)2\xi_{t,x}=(\ell_{t,x}-m_{t,x})^{2} where mt,xm_{t,x} is the loss predictor.
for t=1,2,t=1,2,\ldots do

       Let \mathcal{B} generate an action A~t\widetilde{A}_{t} (which is the action to be chosen if \mathcal{B} is selected in this round).
Receive prediction mt,xm_{t,x} for all x𝒳x\in\mathcal{X}, and set yt,1=mt,x^y_{t,1}=m_{t,\widehat{x}} and yt,2=mt,A~ty_{t,2}=m_{t,\widetilde{A}_{t}}.
Let
q¯t=argminqΔ2{q,τ=1t1zτ+yt[0Bt1]+1ηtψ(q)},qt=(112t2)q¯t+14t2𝟏,\displaystyle\bar{q}_{t}=\operatorname*{argmin}_{q\in\Delta_{2}}\left\{\left\langle q,\sum_{\tau=1}^{t-1}z_{\tau}+y_{t}-\begin{bmatrix}0\\ B_{t-1}\end{bmatrix}\right\rangle+\frac{1}{\eta_{t}}\psi(q)\right\},\quad q_{t}=\left(1-\frac{1}{2t^{2}}\right)\bar{q}_{t}+\frac{1}{4t^{2}}\mathbf{1},
whereηt=14(logT)12(τ=1t1(𝕀[iτ=i]qτ,i)2ξτ,Aτ+(c1+c22)logT)12.\displaystyle\text{where}\ \ \ \eta_{t}=\frac{1}{4}(\log T)^{\frac{1}{2}}\left(\sum_{\tau=1}^{t-1}(\mathbb{I}[i_{\tau}=i]-q_{\tau,i})^{2}\xi_{\tau,A_{\tau}}+(c_{1}+c_{2}^{2})\log T\right)^{-\frac{1}{2}}.
Sample itqti_{t}\sim q_{t}.
if it=1i_{t}=1 then  draw At=x^A_{t}=\widehat{x} and observe t,At\ell_{t,A_{t}} ;
       else  draw At=A~tA_{t}=\widetilde{A}_{t} and observe t,At\ell_{t,A_{t}} ;
       Define zt,i=(t,Atyt,i)𝕀{it=i}qt,i+yt,iz_{t,i}=\frac{(\ell_{t,A_{t}}-y_{t,i})\mathbb{I}\{i_{t}=i\}}{q_{t,i}}+y_{t,i} and
Bt=c1τ=1tξt,At𝕀[iτ=2]qτ,22+c2minτtqτ,2.\displaystyle B_{t}=\sqrt{c_{1}\sum_{\tau=1}^{t}\frac{\xi_{t,A_{t}}\mathbb{I}[i_{\tau}=2]}{q_{\tau,2}^{2}}}+\frac{c_{2}}{\min_{\tau\leq t}q_{\tau,2}}.

E.3 12\frac{1}{2}-dd-LSB to 12\frac{1}{2}-dd-iw-stable (Algorithm 6 / Theorem 22)

Proof of Theorem 22.

Define bt=BtBt1b_{t}=B_{t}-B_{t-1}. Notice that we have

ηtq¯t,2bt\displaystyle\eta_{t}\bar{q}_{t,2}b_{t} 2ηtqt,2bt\displaystyle\leq 2\eta_{t}q_{t,2}b_{t}
2ηtqt,2(c1ξt,At𝕀[it=2]qt,22c1τ=1t1ξτ,Aτ𝕀[iτ=2]qτ,22)+2c2(1minτtqτ,21minτt1qτ,2)\displaystyle\leq 2\eta_{t}q_{t,2}\left(\frac{\frac{c_{1}\xi_{t,A_{t}}\mathbb{I}[i_{t}=2]}{q_{t,2}^{2}}}{\sqrt{c_{1}\sum_{\tau=1}^{t-1}\frac{\xi_{\tau,A_{\tau}}\mathbb{I}[i_{\tau}=2]}{q_{\tau,2}^{2}}}}\right)+2c_{2}\left(\frac{1}{\min_{\tau\leq t}q_{\tau,2}}-\frac{1}{\min_{\tau\leq t-1}q_{\tau,2}}\right)
2ηtc1+2ηtc2(1minτtqτ,2minτt1qτ,2)\displaystyle\leq 2\eta_{t}\sqrt{c_{1}}+2\eta_{t}c_{2}\left(1-\frac{\min_{\tau\leq t}q_{\tau,2}}{\min_{\tau\leq t-1}q_{\tau,2}}\right)
14.\displaystyle\leq\frac{1}{4}. (24)

By Lemma 29 and that q¯t,2qt,22\frac{\bar{q}_{t,2}}{q_{t,2}}\leq 2, we have for any uu,

t=1tqtu,zt\displaystyle\sum_{t=1}^{t^{\prime}}\langle q_{t}-u,z_{t}\rangle O(logTηtterm1+t=1tηtmin|θ|1i=12qt,i2(zt,iyt,iθt)2term2\displaystyle\leq O\Bigg{(}\underbrace{\frac{\log T}{\eta_{t^{\prime}}}}_{\textbf{term}_{1}}+\underbrace{\sum_{t=1}^{t^{\prime}}\eta_{t}\min_{|\theta|\leq 1}\sum_{i=1}^{2}q_{t,i}^{2}\left(z_{t,i}-y_{t,i}-\theta_{t}\right)^{2}}_{\textbf{term}_{2}}
+t=1tqtq¯t,ztterm3+t=1tqt,2btterm4)t=1tu2bt\displaystyle\qquad\qquad+\underbrace{\sum_{t=1}^{t^{\prime}}\langle q_{t}-\bar{q}_{t},z_{t}\rangle}_{\textbf{term}_{3}}+\underbrace{\sum_{t=1}^{t^{\prime}}q_{t,2}b_{t}}_{\textbf{term}_{4}}\Bigg{)}-\sum_{t=1}^{t^{\prime}}u_{2}b_{t}
term1\displaystyle\textbf{term}_{1} O(t=1t1i=12(𝕀[it=i]qt,i)2ξt,At+(c1+c22)logTlogTlogT)\displaystyle\leq O\left(\sqrt{\frac{\sum_{t=1}^{t^{\prime}-1}\sum_{i=1}^{2}(\mathbb{I}[i_{t}=i]-q_{t,i})^{2}\xi_{t,A_{t}}+(c_{1}+c_{2}^{2})\log T}{\log T}}\log T\right)
O(t=1ti=12(𝕀[it=i]qt,i)2ξt,AtlogT+(c1+c2)logT).\displaystyle\leq O\left(\sqrt{\sum_{t=1}^{t^{\prime}}\sum_{i=1}^{2}(\mathbb{I}[i_{t}=i]-q_{t,i})^{2}\xi_{t,A_{t}}\log T}+(\sqrt{c_{1}}+c_{2})\log T\right).
𝔼[term2]\displaystyle\mathbb{E}\left[\textbf{term}_{2}\right] 𝔼[t=1tηtminθi=12qt,i2((t,Atmt,At)𝕀[it=i]qt,iθ)2]\displaystyle\leq\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}\eta_{t}\min_{\theta}\sum_{i=1}^{2}q_{t,i}^{2}\left(\frac{(\ell_{t,A_{t}}-m_{t,A_{t}})\mathbb{I}[i_{t}=i]}{q_{t,i}}-\theta\right)^{2}\right]
𝔼[t=1tηti=12(𝕀[it=i]qt,i)2(t,Atmt,At)2]\displaystyle\leq\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}\eta_{t}\sum_{i=1}^{2}(\mathbb{I}[i_{t}=i]-q_{t,i})^{2}(\ell_{t,A_{t}}-m_{t,A_{t}})^{2}\right] (choosing θ=t,Atmt,At\theta=\ell_{t,A_{t}}-m_{t,A_{t}})
𝔼[t=1ti=12(𝕀[it=i]qt,i)2ξt,AtlogT].\displaystyle\leq\mathbb{E}\left[\sqrt{\sum_{t=1}^{t^{\prime}}\sum_{i=1}^{2}(\mathbb{I}[i_{t}=i]-q_{t,i})^{2}\xi_{t,A_{t}}\log T}\right].
𝔼[term3]=𝔼[t=1tqtq¯t,𝔼t[zt]]=𝔼[t=1t12t2q¯t+14t2𝟏,𝔼t[zt]]O(1).\displaystyle\mathbb{E}\left[\textbf{term}_{3}\right]=\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}\langle q_{t}-\bar{q}_{t},\mathbb{E}_{t}[z_{t}]\rangle\right]=\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}\left\langle-\frac{1}{2t^{2}}\bar{q}_{t}+\frac{1}{4t^{2}}\mathbf{1},\mathbb{E}_{t}[z_{t}]\right\rangle\right]\leq O(1).
term4\displaystyle\textbf{term}_{4} t=1tqt,2bt\displaystyle\leq\sum_{t=1}^{t^{\prime}}q_{t,2}b_{t}
=t=1tqt,2(c1ξt,At𝕀[it=2]qt,22c1τ=1tξτ,Aτ𝕀[iτ=2]qτ,22+c2(1minτtqτ,21minτt1qτ,2))\displaystyle=\sum_{t=1}^{t^{\prime}}q_{t,2}\left(\frac{\frac{c_{1}\xi_{t,A_{t}}\mathbb{I}[i_{t}=2]}{q_{t,2}^{2}}}{\sqrt{c_{1}\sum_{\tau=1}^{t}\frac{\xi_{\tau,A_{\tau}}\mathbb{I}[i_{\tau}=2]}{q_{\tau,2}^{2}}}}+c_{2}\left(\frac{1}{\min_{\tau\leq t}q_{\tau,2}}-\frac{1}{\min_{\tau\leq t-1}q_{\tau,2}}\right)\right)
c1t=1tξt,At𝕀[it=2]qt,22τ=1tξτ,Aτ𝟏[iτ=2]qτ,22×ξt,At𝕀[it=2]+c2t=1t(1minτtqτ,2minτt1qτ,2)\displaystyle\leq\sqrt{c_{1}}\sum_{t=1}^{t^{\prime}}\sqrt{\frac{\frac{\xi_{t,A_{t}}\mathbb{I}[i_{t}=2]}{q_{t,2}^{2}}}{\sum_{\tau=1}^{t}\frac{\xi_{\tau,A_{\tau}}\mathbf{1}[i_{\tau}=2]}{q_{\tau,2}^{2}}}}\times\sqrt{\xi_{t,A_{t}}\mathbb{I}[i_{t}=2]}+c_{2}\sum_{t=1}^{t^{\prime}}\left(1-\frac{\min_{\tau\leq t}q_{\tau,2}}{\min_{\tau\leq t-1}q_{\tau,2}}\right)
c1t=1tξt,At𝕀[it=2]qt,22τ=1tξτ,Aτ𝟏[iτ=2]qτ,22×t=1tξt,At𝕀[it=2]+c2t=1tlog(minτt1qτ,2minτtqτ,2)\displaystyle\leq\sqrt{c_{1}}\sqrt{\sum_{t=1}^{t^{\prime}}\frac{\frac{\xi_{t,A_{t}}\mathbb{I}[i_{t}=2]}{q_{t,2}^{2}}}{\sum_{\tau=1}^{t}\frac{\xi_{\tau,A_{\tau}}\mathbf{1}[i_{\tau}=2]}{q_{\tau,2}^{2}}}}\times\sqrt{\sum_{t=1}^{t^{\prime}}\xi_{t,A_{t}}\mathbb{I}[i_{t}=2]}+c_{2}\sum_{t=1}^{t^{\prime}}\log\left(\frac{\min_{\tau\leq t-1}q_{\tau,2}}{\min_{\tau\leq t}q_{\tau,2}}\right) (Cauchy-Schwarz)
c11+log(t=1tξt,At𝕀[it=2]qt,22)t=1tξt,At𝕀[it=2]+c2log1minτtqτ,2\displaystyle\leq\sqrt{c_{1}}\sqrt{1+\log\left(\sum_{t=1}^{t^{\prime}}\frac{\xi_{t,A_{t}}\mathbb{I}[i_{t}=2]}{q_{t,2}^{2}}\right)}\sqrt{\sum_{t=1}^{t^{\prime}}\xi_{t,A_{t}}\mathbb{I}[i_{t}=2]}+c_{2}\log\frac{1}{\min_{\tau\leq t^{\prime}}q_{\tau,2}}
O(c1t=1tξt,At𝕀[it=2]logT+c2logT).\displaystyle\leq O\left(\sqrt{c_{1}\sum_{t=1}^{t^{\prime}}\xi_{t,A_{t}}\mathbb{I}[i_{t}=2]\log T}+c_{2}\log T\right).
term5=u2Bt=u2(c1t=1tξt,At𝕀[it=2]qt,22+c2minttqt,2).\displaystyle\textbf{term}_{5}=-u_{2}B_{t^{\prime}}=-u_{2}\left(\sqrt{c_{1}\sum_{t=1}^{t^{\prime}}\frac{\xi_{t,A_{t}}\mathbb{I}[i_{t}=2]}{q_{t,2}^{2}}}+\frac{c_{2}}{\min_{t\leq t^{\prime}}q_{t,2}}\right).

Combining all inequalities above, we can bound 𝔼[t=1tqtu,zt]\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}\langle q_{t}-u,z_{t}\rangle\right] by

O(𝔼[c1t=1ti=12(𝕀[it=i]qt,i)2ξt,AtlogT+c1t=1tξt,At𝕀[it=2]logT]+(c1+c2)logT)pos-term\displaystyle\underbrace{O\left(\mathbb{E}\left[\sqrt{c_{1}\sum_{t=1}^{t^{\prime}}\sum_{i=1}^{2}(\mathbb{I}[i_{t}=i]-q_{t,i})^{2}\xi_{t,A_{t}}\log T}+\sqrt{c_{1}\sum_{t=1}^{t^{\prime}}\xi_{t,A_{t}}\mathbb{I}[i_{t}=2]\log T}\right]+(\sqrt{c_{1}}+c_{2})\log T\right)}_{\textbf{pos-term}} (25)
u2(c1t=1tξt,At𝕀[it=2]qt,22+c2minttqt,2)neg-term.\displaystyle\qquad-u_{2}\underbrace{\left(\sqrt{c_{1}\sum_{t=1}^{t^{\prime}}\frac{\xi_{t,A_{t}}\mathbb{I}[i_{t}=2]}{q_{t,2}^{2}}}+\frac{c_{2}}{\min_{t\leq t^{\prime}}q_{t,2}}\right)}_{\textbf{neg-term}}.

Similar to the arguments in the proofs of Theorem 11 and Theorem 18, we end up bounding 𝔼[t=1t(t,Att,x)]\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}(\ell_{t,A_{t}}-\ell_{t,x})\right] by the pos-term above for all x𝒳x\in\mathcal{X}. Finally, we process pos-term. Observe that

𝔼t[i=12(𝕀[it=i]qt,i)2ξt,At]\displaystyle\mathbb{E}_{t}\left[\sum_{i=1}^{2}(\mathbb{I}[i_{t}=i]-q_{t,i})^{2}\xi_{t,A_{t}}\right]
=𝔼t[qt,1(1qt,1)2ξt,x^+(1qt,1)qt,12ξt,A~t+qt,2(1qt,2)2ξt,A~t+(1qt,2)qt,22ξt,x^]\displaystyle=\mathbb{E}_{t}\left[q_{t,1}(1-q_{t,1})^{2}\xi_{t,\widehat{x}}+(1-q_{t,1})q_{t,1}^{2}\xi_{t,\widetilde{A}_{t}}+q_{t,2}(1-q_{t,2})^{2}\xi_{t,\widetilde{A}_{t}}+(1-q_{t,2})q_{t,2}^{2}\xi_{t,\widehat{x}}\right]
=𝔼t[2qt,1qt,22ξt,x^+2qt,12qt,2ξt,A~t]\displaystyle=\mathbb{E}_{t}\left[2q_{t,1}q_{t,2}^{2}\xi_{t,\widehat{x}}+2q_{t,1}^{2}q_{t,2}\xi_{t,\widetilde{A}_{t}}\right]
=2qt,1qt,22ξt,x^+2qt,12(xx^pt,xξt,x)\displaystyle=2q_{t,1}q_{t,2}^{2}\xi_{t,\widehat{x}}+2q_{t,1}^{2}\left(\sum_{x\neq\widehat{x}}p_{t,x}\xi_{t,x}\right)
2pt,x^(1pt,x^)ξt,x^+2(xx^pt,xξt,x)\displaystyle\leq 2p_{t,\widehat{x}}(1-p_{t,\widehat{x}})\xi_{t,\widehat{x}}+2\left(\sum_{x\neq\widehat{x}}p_{t,x}\xi_{t,x}\right)
=2(xpt,xξt,xpt,x^2ξt,x^),\displaystyle=2\left(\sum_{x}p_{t,x}\xi_{t,x}-p_{t,\widehat{x}}^{2}\xi_{t,\widehat{x}}\right),

and that

𝔼t[ξt,A~t𝕀[it=2]]=xx^pt,x^ξt,xxpt,xξt,xpt,x^2ξt,x^\displaystyle\mathbb{E}_{t}\left[\xi_{t,\widetilde{A}_{t}}\mathbb{I}[i_{t}=2]\right]=\sum_{x\neq\widehat{x}}p_{t,\widehat{x}}\xi_{t,x}\leq\sum_{x}p_{t,x}\xi_{t,x}-p_{t,\widehat{x}}^{2}\xi_{t,\widehat{x}}

Thus, for any u𝒳u\in\mathcal{X},

𝔼[t=1t(t,Att,u)]𝔼[pos-term]O(c1𝔼[t=1t(xpt,xξt,xpt,x^2ξt,x^)]logT+(c1+c2)logT),\displaystyle\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}(\ell_{t,A_{t}}-\ell_{t,u})\right]\leq\mathbb{E}\left[\textbf{pos-term}\right]\leq O\left(\sqrt{c_{1}\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}\left(\sum_{x}p_{t,x}\xi_{t,x}-p_{t,\widehat{x}}^{2}\xi_{t,\widehat{x}}\right)\right]\log T}+(\sqrt{c_{1}}+c_{2})\log T\right),

which implies that the algorithm satisfies 12\frac{1}{2}-dd-LSB with constants (O(c1),O(c1+c2))(O(c_{1}),O(\sqrt{c_{1}}+c_{2})). ∎

E.4 12\frac{1}{2}-LSB to 12\frac{1}{2}-strongly-iw-stable (Algorithm 7 / Theorem 25)

Algorithm 7 LSB via Corral (for α=12\alpha=\frac{1}{2}, using a 12\frac{1}{2}-strongly-iw-stable algorithm)

Input: candidate action x^\widehat{x}, 12\frac{1}{2}-iw-stable algorithm \mathcal{B} over 𝒳\mathcal{X} with constant cc.
Define: ψt(q)=2ηti=12qi+1βi=12ln1qi\psi_{t}(q)=\frac{-2}{\eta_{t}}\sum_{i=1}^{2}\sqrt{q_{i}}+\frac{1}{\beta}\sum_{i=1}^{2}\ln\frac{1}{q_{i}}.
B0=0B_{0}=0.
for t=1,2,t=1,2,\ldots do

       Let \mathcal{B} generate an action A~t\widetilde{A}_{t} (which is the action to be chosen if \mathcal{B} is selected in this round).
Let
q¯t=argminqΔ2{q,τ=1t1zτ[0Bt1]+ψt(q)},qt=(112t2)q¯t+14t2𝟏,\displaystyle\bar{q}_{t}=\operatorname*{argmin}_{q\in\Delta_{2}}\left\{\left\langle q,\sum_{\tau=1}^{t-1}z_{\tau}-\begin{bmatrix}0\\ B_{t-1}\end{bmatrix}\right\rangle+\psi_{t}(q)\right\},\quad q_{t}=\left(1-\frac{1}{2t^{2}}\right)\bar{q}_{t}+\frac{1}{4t^{2}}\mathbf{1},
whereηt=1τ=1t𝕀{A~τx^}+8c1,β=18c2.\displaystyle\text{where}\ \ \ \eta_{t}=\frac{1}{\sqrt{\sum_{\tau=1}^{t}\mathbb{I}\{\widetilde{A}_{\tau}\neq\widehat{x}\}}+8\sqrt{c_{1}}},\quad\beta=\frac{1}{8c_{2}}.
Sample itqti_{t}\sim q_{t}.
if it=1i_{t}=1 then  draw At=x^A_{t}=\widehat{x} and observe t,At\ell_{t,A_{t}} ;
       else  draw At=A~tA_{t}=\widetilde{A}_{t} and observe t,At\ell_{t,A_{t}} ;
       Define zt,i=t,At𝕀{it=i}qt,i𝕀{A~tx^}z_{t,i}=\frac{\ell_{t,A_{t}}\mathbb{I}\{i_{t}=i\}}{q_{t,i}}\mathbb{I}\{\widetilde{A}_{t}\neq\widehat{x}\} and
Bt=c1τ=1t𝕀{A~tx^}qτ,2+c2maxτt1qτ,2.\displaystyle B_{t}=\sqrt{c_{1}\sum_{\tau=1}^{t}\frac{\mathbb{I}\left\{\widetilde{A}_{t}\neq\widehat{x}\right\}}{q_{\tau,2}}}+c_{2}\max_{\tau\leq t}\frac{1}{q_{\tau,2}}.
Proof of Theorem 25.

The proof of this theorem mostly follows that of Theorem 11. The difference is that the regret of the base algorithm \mathcal{B} is now bounded by

c1t=1t1qt,2+qt,1𝕀{A~t=x^}+c2logTminttqt,2c1t=1t𝕀{A~tx^}qt,2+c1t+c2logTminttqt,2\displaystyle\sqrt{c_{1}\sum_{t=1}^{t^{\prime}}\frac{1}{q_{t,2}+q_{t,1}\mathbb{I}\{\widetilde{A}_{t}=\widehat{x}\}}}+\frac{c_{2}\log T}{\min_{t\leq t^{\prime}}q_{t,2}}\leq\sqrt{c_{1}\sum_{t=1}^{t^{\prime}}\frac{\mathbb{I}\{\widetilde{A}_{t}\neq\widehat{x}\}}{q_{t,2}}}+\sqrt{c_{1}t^{\prime}}+\frac{c_{2}\log T}{\min_{t\leq t^{\prime}}q_{t,2}} (26)

because when A~t=x^\widetilde{A}_{t}=\widehat{x}, the base algorithm \mathcal{B} is able to receive feedback no matter which side the Corral algorithm chooses. The goal of adding bonus is now only to cancel the first term and the third term on the right-hand side of (26).

Similar to Eq. (18), we have

t=1tqtu,ztt=1tqtq¯t,ztterm1\displaystyle\sum_{t=1}^{t^{\prime}}\langle q_{t}-u,z_{t}\rangle\leq\underbrace{\sum_{t=1}^{t^{\prime}}\langle q_{t}-\bar{q}_{t},z_{t}\rangle}_{\textbf{term}_{1}}
+O(c1+t=1tqt,2𝕀{A~tx^}τ=1t𝕀{A~τx^}term2+logTβterm3+t=1tηtmin|θt|1qt,i32(zt,iθt)2term4+t=1tqt,2btTsterm5+t=1tqt,2btLoterm6)u2t=1tbt\displaystyle\ \ +O\bigg{(}\sqrt{c_{1}}+\underbrace{\sum_{t=1}^{t^{\prime}}\frac{\sqrt{q_{t,2}}\mathbb{I}\{\widetilde{A}_{t}\neq\widehat{x}\}}{\sqrt{\sum_{\tau=1}^{t}\mathbb{I}\{\widetilde{A}_{\tau}\neq\widehat{x}\}}}}_{\textbf{term}_{2}}+\underbrace{\frac{\log T}{\beta}}_{\textbf{term}_{3}}+\underbrace{\sum_{t=1}^{t^{\prime}}\eta_{t}\min_{|\theta_{t}|\leq 1}q_{t,i}^{\frac{3}{2}}(z_{t,i}-\theta_{t})^{2}}_{\textbf{term}_{4}}+\underbrace{\sum_{t=1}^{t^{\prime}}q_{t,2}b_{t}^{\scalebox{0.5}{{{Ts}}}}}_{\textbf{term}_{5}}+\underbrace{\sum_{t=1}^{t^{\prime}}q_{t,2}b_{t}^{\scalebox{0.5}{{{Lo}}}}}_{\textbf{term}_{6}}\bigg{)}-u_{2}\sum_{t=1}^{t^{\prime}}b_{t} (27)

where

btTs\displaystyle b_{t}^{\scalebox{0.5}{{{Ts}}}} =c1τ=1t𝕀{A~τx^}qτ,2c1τ=1t1𝕀{A~τx^}qτ,2c1𝕀{A~tx^}qt,2c1τ=1t𝕀{A~τx^}qτ,2c1qt,2,\displaystyle=\sqrt{c_{1}\sum_{\tau=1}^{t}\frac{\mathbb{I}\{\widetilde{A}_{\tau}\neq\widehat{x}\}}{q_{\tau,2}}}-\sqrt{c_{1}\sum_{\tau=1}^{t-1}\frac{\mathbb{I}\{\widetilde{A}_{\tau}\neq\widehat{x}\}}{q_{\tau,2}}}\leq\frac{\frac{c_{1}\mathbb{I}\{\widetilde{A}_{t}\neq\widehat{x}\}}{q_{t,2}}}{\sqrt{c_{1}\sum_{\tau=1}^{t}\frac{\mathbb{I}\{\widetilde{A}_{\tau}\neq\widehat{x}\}}{q_{\tau,2}}}}\leq\sqrt{\frac{c_{1}}{q_{t,2}}},
btLo\displaystyle b_{t}^{\scalebox{0.5}{{{Lo}}}} =c2minτtqτ,2c2minτt1qτ,2=c2qt,2(1minτtqτ,2minτt1qτ,2)\displaystyle=\frac{c_{2}}{\min_{\tau\leq t}q_{\tau,2}}-\frac{c_{2}}{\min_{\tau\leq t-1}q_{\tau,2}}=\frac{c_{2}}{q_{t,2}}\left(1-\frac{\min_{\tau\leq t}q_{\tau,2}}{\min_{\tau\leq t-1}q_{\tau,2}}\right)

satisfying ηtq¯t,2btTs12\eta_{t}\sqrt{\bar{q}_{t,2}}b_{t}^{\scalebox{0.5}{{{Ts}}}}\leq\frac{1}{2} and βq¯t,2btLo12\beta\bar{q}_{t,2}b_{t}^{\scalebox{0.5}{{{Lo}}}}\leq\frac{1}{2} as in (20) and (21).

Below, we bound term1,,term6\textbf{term}_{1},\ldots,\textbf{term}_{6}.

𝔼[term1]=𝔼[t=1tqtq¯t,zt]O(t=1t1t2)=O(1).\displaystyle\mathbb{E}[\textbf{term}_{1}]=\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}\langle q_{t}-\bar{q}_{t},z_{t}\rangle\right]\leq O\left(\sum_{t=1}^{t^{\prime}}\frac{1}{t^{2}}\right)=O(1).
term2\displaystyle\textbf{term}_{2} O(min{t=1t𝕀{A~tx^},t=1tqt,2𝕀{A~tx^}logT})\displaystyle\leq O\left(\min\left\{\sqrt{\sum_{t=1}^{t^{\prime}}\mathbb{I}\{\widetilde{A}_{t}\neq\widehat{x}\}},\ \ \sqrt{\sum_{t=1}^{t^{\prime}}q_{t,2}\mathbb{I}\{\widetilde{A}_{t}\neq\widehat{x}\}\log T}\right\}\right)
term3\displaystyle\textbf{term}_{3} =logTβO(c2logT).\displaystyle=\frac{\log T}{\beta}\leq O\left(c_{2}\log T\right).
𝔼[term4]\displaystyle\mathbb{E}\left[\textbf{term}_{4}\right] =𝔼[t=1tηtminθt[1,1]qt,i32(zt,iθt)2]\displaystyle=\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}\eta_{t}\min_{\theta_{t}\in[-1,1]}q_{t,i}^{\frac{3}{2}}(z_{t,i}-\theta_{t})^{2}\right]
𝔼[t=1tηti=12qt,i32(zt,it,At)2𝕀{A~tx^}]\displaystyle\leq\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}\eta_{t}\sum_{i=1}^{2}q_{t,i}^{\frac{3}{2}}\left(z_{t,i}-\ell_{t,A_{t}}\right)^{2}\mathbb{I}\{\widetilde{A}_{t}\neq\widehat{x}\}\right] (when A~t=x^\widetilde{A}_{t}=\widehat{x}, zt,1=zt,2=0z_{t,1}=z_{t,2}=0)
=𝔼[t=1tηti=121qt,i(𝕀[it=i]t,Atqt,it,At)2𝕀{A~tx^}]\displaystyle=\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}\eta_{t}\sum_{i=1}^{2}\frac{1}{\sqrt{q_{t,i}}}\left(\mathbb{I}[i_{t}=i]\ell_{t,A_{t}}-q_{t,i}\ell_{t,A_{t}}\right)^{2}\mathbb{I}\{\widetilde{A}_{t}\neq\widehat{x}\}\right]
𝔼[t=1tηti=12(qt,i(1qt,i)2+(1qt,i)qt,i32)𝕀{A~tx^}]\displaystyle\leq\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}\eta_{t}\sum_{i=1}^{2}\left(\sqrt{q_{t,i}}(1-q_{t,i})^{2}+(1-q_{t,i})q_{t,i}^{\frac{3}{2}}\right)\mathbb{I}\{\widetilde{A}_{t}\neq\widehat{x}\}\right]
𝔼[t=1tηtqt,2𝕀{A~tx^}]O(𝔼[term2]).\displaystyle\leq\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}\eta_{t}\sqrt{q_{t,2}\mathbb{I}\{\widetilde{A}_{t}\neq\widehat{x}\}}\right]\leq O\left(\mathbb{E}[\textbf{term}_{2}]\right).
term5+term6\displaystyle\textbf{term}_{5}+\textbf{term}_{6} =t=1tqt,2(btTs+btLo)\displaystyle=\sum_{t=1}^{t^{\prime}}q_{t,2}(b_{t}^{\scalebox{0.5}{{{Ts}}}}+b_{t}^{\scalebox{0.5}{{{Lo}}}})
=t=1tqt,2(c1𝕀{A~tx^}qt,2c1τ=1t𝕀{A~τx^}qτ,2+c2(1minτtqτ,21minτt1qτ,2))\displaystyle=\sum_{t=1}^{t^{\prime}}q_{t,2}\left(\frac{\frac{c_{1}\mathbb{I}\{\widetilde{A}_{t}\neq\widehat{x}\}}{q_{t,2}}}{\sqrt{c_{1}\sum_{\tau=1}^{t}\frac{\mathbb{I}\{\widetilde{A}_{\tau}\neq\widehat{x}\}}{q_{\tau,2}}}}+c_{2}\left(\frac{1}{\min_{\tau\leq t}q_{\tau,2}}-\frac{1}{\min_{\tau\leq t-1}q_{\tau,2}}\right)\right)
=c1t=1t𝕀{A~tx^}qt,2τ=1t𝕀{A~τx^}qτ,2×qt,2𝕀{A~tx^}+c2t=1t(1minτtqτ,2minτt1qτ,2)\displaystyle=\sqrt{c_{1}}\sum_{t=1}^{t^{\prime}}\frac{\frac{\mathbb{I}\{\widetilde{A}_{t}\neq\widehat{x}\}}{\sqrt{q_{t,2}}}}{\sqrt{\sum_{\tau=1}^{t}\frac{\mathbb{I}\{\widetilde{A}_{\tau}\neq\widehat{x}\}}{q_{\tau,2}}}}\times\sqrt{q_{t,2}\mathbb{I}\{\widetilde{A}_{t}\neq\widehat{x}\}}+c_{2}\sum_{t=1}^{t^{\prime}}\left(1-\frac{\min_{\tau\leq t}q_{\tau,2}}{\min_{\tau\leq t-1}q_{\tau,2}}\right) (28)
c1t=1t𝕀{A~tx^}qt,2τ=1t𝕀{A~τx^}qτ,2t=1tqt,2𝕀{A~tx^}+c2t=1tlog(minτt1qτ,2minτtqτ,2)\displaystyle\leq\sqrt{c_{1}}\sqrt{\sum_{t=1}^{t^{\prime}}\frac{\frac{\mathbb{I}\{\widetilde{A}_{t}\neq\widehat{x}\}}{q_{t,2}}}{\sum_{\tau=1}^{t}\frac{\mathbb{I}\{\widetilde{A}_{\tau}\neq\widehat{x}\}}{q_{\tau,2}}}}\sqrt{\sum_{t=1}^{t^{\prime}}q_{t,2}\mathbb{I}\{\widetilde{A}_{t}\neq\widehat{x}\}}+c_{2}\sum_{t=1}^{t^{\prime}}\log\left(\frac{\min_{\tau\leq t-1}q_{\tau,2}}{\min_{\tau\leq t}q_{\tau,2}}\right) (Cauchy-Schwarz)
c11+log(t=1t𝕀{A~tx^}qt,2)t=1tqt,2𝕀{A~tx^}+c2log1minτtqτ,2\displaystyle\leq\sqrt{c_{1}}\sqrt{1+\log\left(\sum_{t=1}^{t^{\prime}}\frac{\mathbb{I}\{\widetilde{A}_{t}\neq\widehat{x}\}}{q_{t,2}}\right)}\sqrt{\sum_{t=1}^{t^{\prime}}q_{t,2}\mathbb{I}\{\widetilde{A}_{t}\neq\widehat{x}\}}+c_{2}\log\frac{1}{\min_{\tau\leq t^{\prime}}q_{\tau,2}}
O(c1t=1tqt,2𝕀{A~tx^}logT+c2logT).\displaystyle\leq O\left(\sqrt{c_{1}\sum_{t=1}^{t^{\prime}}q_{t,2}\mathbb{I}\{\widetilde{A}_{t}\neq\widehat{x}\}\log T}+c_{2}\log T\right). (29)

Continuing from (28), we also have term5c1t=1t𝕀{A~tx^}τ=1t𝕀{A~τx^}2c1t\textbf{term}_{5}\leq\sqrt{c_{1}}\sum_{t=1}^{t^{\prime}}\frac{\mathbb{I}\{\widetilde{A}_{t}\neq\widehat{x}\}}{\sum_{\tau=1}^{t}\mathbb{I}\{\widetilde{A}_{\tau}\neq\widehat{x}\}}\leq 2\sqrt{c_{1}t^{\prime}}.

Using all the inequalities above in (27), we can bound 𝔼[t=1tqtu,zt]\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}\langle q_{t}-u,z_{t}\rangle\right] by

O(min{c1𝔼[t],c1[t=1tqt,2𝕀{A~tx^}]logT}+(c1+c2)logT)pos-term\displaystyle\underbrace{O\left(\min\left\{\sqrt{c_{1}\mathbb{E}[t^{\prime}]},\ \ \sqrt{c_{1}\left[\sum_{t=1}^{t^{\prime}}q_{t,2}\mathbb{I}\{\widetilde{A}_{t}\neq\widehat{x}\}\right]\log T}\right\}+(\sqrt{c_{1}}+c_{2})\log T\right)}_{\textbf{pos-term}}
u2𝔼[c1t=1t𝕀{A~tx^}qt,2+c2minttqt,2]neg-term.\displaystyle\qquad\qquad\qquad-u_{2}\underbrace{\mathbb{E}\left[\sqrt{c_{1}\sum_{t=1}^{t^{\prime}}\frac{\mathbb{I}\{\widetilde{A}_{t}\neq\widehat{x}\}}{q_{t,2}}}+\frac{c_{2}}{\min_{t\leq t^{\prime}}q_{t,2}}\right]}_{\textbf{neg-term}}.

For comparator x^\widehat{x}, we set u=𝐞1u=\mathbf{e}_{1}. Then we have 𝔼[t=1t(t,Att,x^)]pos-term\mathbb{E}[\sum_{t=1}^{t^{\prime}}(\ell_{t,A_{t}}-\ell_{t,\widehat{x}})]\leq\textbf{pos-term}. Observe that 𝔼[t=1tqt,2𝕀{A~tx^}]=𝔼[t=1t(1pt,x^)]\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}q_{t,2}\mathbb{I}\{\widetilde{A}_{t}\neq\widehat{x}\}\right]=\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}(1-p_{t,\widehat{x}})\right], so this gives the desired property of 12\frac{1}{2}-LSB for action x^\widehat{x}. For xx^x\neq\widehat{x}, we set u=𝐞2u=\mathbf{e}_{2} and get 𝔼[t=1t(t,Att,x)]=𝔼[t=1t(t,Att,A~t)]+𝔼[t=1t(t,A~tt,x)](pos-termneg-term)+(neg-term+c1𝔼[t])\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}(\ell_{t,A_{t}}-\ell_{t,x})\right]=\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}(\ell_{t,A_{t}}-\ell_{t,\widetilde{A}_{t}})\right]+\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}(\ell_{t,\widetilde{A}_{t}}-\ell_{t,x})\right]\leq(\textbf{pos-term}-\textbf{neg-term})+(\textbf{neg-term}+\sqrt{c_{1}\mathbb{E}[t^{\prime}]}) where the additional c1t\sqrt{c_{1}t^{\prime}} term comes from the second term on the right-hand side of (26), which is not cancelled by the negative term. Note, however, that this still satisfies the requirement of 12\frac{1}{2}-LSB for xx^x\neq\widehat{x} because for these actions, we only require the worst-case bound to hold. Overall, we have justified that Algorithm 7 satisfies 12\frac{1}{2}-LSB with constants (c0,c1,c2)(c_{0}^{\prime},c_{1}^{\prime},c_{2}^{\prime}) where c0=c1=O(c1)c_{0}^{\prime}=c_{1}^{\prime}=O(c_{1}) and c2=(c1+c2)c_{2}=(\sqrt{c_{1}}+c_{2}). ∎

E.5 12\frac{1}{2}-dd-LSB to 12\frac{1}{2}-dd-strongly-iw-stable (Algorithm 8 / Lemma 42)

Algorithm 8 dd-LSB via Corral (for α=12\alpha=\frac{1}{2}, using a 12\frac{1}{2}-dd-strongly-iw-stable algorithm)

Input: candidate action x^\widehat{x}, 12\frac{1}{2}-iw-stable algorithm \mathcal{B} over 𝒳\mathcal{X} with constant (c1,c2)(c_{1},c_{2}).
Define: ψ(q)=i=12ln1qi\psi(q)=\sum_{i=1}^{2}\ln\frac{1}{q_{i}}. B0=0B_{0}=0.
Define: For first-order bound, ξt,x=t,x\xi_{t,x}=\ell_{t,x} and mt,x=0m_{t,x}=0; for second-order bound, ξt,x=(t,xmt,x)2\xi_{t,x}=(\ell_{t,x}-m_{t,x})^{2} where mt,xm_{t,x} is the loss predictor.
for t=1,2,t=1,2,\ldots do

       Receive prediction mt,xm_{t,x} for all x𝒳x\in\mathcal{X}, and set yt,1=mt,x^y_{t,1}=m_{t,\widehat{x}} and yt,2=mt,A~ty_{t,2}=m_{t,\widetilde{A}_{t}}.
Let \mathcal{B} generate an action A~t\widetilde{A}_{t} (which is the action to be chosen if \mathcal{B} is selected in this round).
Let
q¯t=argminqΔ2{q,τ=1t1zτ+yt[0Bt1]+1ηtψ(q)},qt=(112t2)q¯t+14t2𝟏,\displaystyle\bar{q}_{t}=\operatorname*{argmin}_{q\in\Delta_{2}}\left\{\left\langle q,\sum_{\tau=1}^{t-1}z_{\tau}+y_{t}-\begin{bmatrix}0\\ B_{t-1}\end{bmatrix}\right\rangle+\frac{1}{\eta_{t}}\psi(q)\right\},\quad q_{t}=\left(1-\frac{1}{2t^{2}}\right)\bar{q}_{t}+\frac{1}{4t^{2}}\mathbf{1},
whereηt=14(logT)12(τ=1t1(𝕀[iτ=i]qτ,i)2ξτ,Aτ𝕀[A~τx^]+(c1+c22)logT)12.\displaystyle\text{where}\ \ \ \eta_{t}=\frac{1}{4}(\log T)^{\frac{1}{2}}\left(\sum_{\tau=1}^{t-1}(\mathbb{I}[i_{\tau}=i]-q_{\tau,i})^{2}\xi_{\tau,A_{\tau}}\mathbb{I}[\widetilde{A}_{\tau}\neq\widehat{x}]+(c_{1}+c_{2}^{2})\log T\right)^{-\frac{1}{2}}.
Sample itqti_{t}\sim q_{t}.
if it=1i_{t}=1 then  draw At=x^A_{t}=\widehat{x} and observe t,At\ell_{t,A_{t}} ;
       else  draw At=A~tA_{t}=\widetilde{A}_{t} and observe t,At\ell_{t,A_{t}} ;
       Define zt,i=((t,Atyt,i)𝕀{it=i}qt,i+yt,i)𝕀{A~tx^}z_{t,i}=\left(\frac{(\ell_{t,A_{t}}-y_{t,i})\mathbb{I}\{i_{t}=i\}}{q_{t,i}}+y_{t,i}\right)\mathbb{I}\{\widetilde{A}_{t}\neq\widehat{x}\} and
Bt=c1τ=1tξt,At𝕀{iτ=2}𝕀{A~tx^}qτ,22+c2minτtqτ,2.\displaystyle B_{t}=\sqrt{c_{1}\sum_{\tau=1}^{t}\frac{\xi_{t,A_{t}}\mathbb{I}\{i_{\tau}=2\}\mathbb{I}\{\widetilde{A}_{t}\neq\widehat{x}\}}{q_{\tau,2}^{2}}}+\frac{c_{2}}{\min_{\tau\leq t}q_{\tau,2}}. (30)
Lemma 42.

Let \mathcal{B} be an algorithm with the following stability guarantee: given an adaptive sequence of weights q1,q2,(0,1]𝒳q_{1},q_{2},\dots\in(0,1]^{\mathcal{X}} such that the feedback in round tt is observed with probability qt(x)q_{t}(x) if xx is chosen, and an adaptive sequence {mt,x}x𝒳\{m_{t,x}\}_{x\in\mathcal{X}} available at the beginning of round tt, it obtains the following pseudo regret guarantee for any stopping time tt^{\prime}:

𝔼[t=1t(t,Att,u)]𝔼[c1t=1tupdtξt,Atqt(At)2+c2minttminxqt(x)],\displaystyle\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}\left(\ell_{t,A_{t}}-\ell_{t,u}\right)\right]\leq\mathbb{E}\left[\sqrt{c_{1}\sum_{t=1}^{t^{\prime}}\frac{\scalebox{0.9}{{upd}}_{t}\cdot\xi_{t,A_{t}}}{q_{t}(A_{t})^{2}}}+\frac{c_{2}}{\min_{t\leq t^{\prime}}\min_{x}q_{t}(x)}\right],

where updt=1\scalebox{0.9}{{upd}}_{t}=1 if feedback is observed in round tt and updt=0\scalebox{0.9}{{upd}}_{t}=0 otherwise. ξt,x=(t,xmt,x)2\xi_{t,x}=(\ell_{t,x}-m_{t,x})^{2} in the second-order bound case, and ξt,x=t,x\xi_{t,x}=\ell_{t,x} in the first-order bound case. Then Algorithm 8 with \mathcal{B} as input satisfies 12\frac{1}{2}-dd-LSB.

Proof.

The proof of this lemma is a combination of the elements in Theorem 22 (data-dependent iw-stable) and Theorem 25 (strongly-iw-stable), so we omit the details here and only provide a sketch.

In Algorithm 8, since \mathcal{B} is 12\frac{1}{2}-dd-strongly-iw-stable, its regret is upper bounded by the order of

c1t=1t(𝕀{A~t=x^}ξt,At1+𝕀{A~tx^}𝕀{it=2}ξt,Atqt,22)+c2minttminxqt(x)\displaystyle\sqrt{c_{1}\sum_{t=1}^{t^{\prime}}\left(\frac{\mathbb{I}\{\widetilde{A}_{t}=\widehat{x}\}\cdot\xi_{t,A_{t}}}{1}+\frac{\mathbb{I}\{\widetilde{A}_{t}\neq\widehat{x}\}\mathbb{I}\{i_{t}=2\}\xi_{t,A_{t}}}{q_{t,2}^{2}}\right)}+\frac{c_{2}}{\min_{t\leq t^{\prime}}\min_{x}q_{t}(x)}
c1t=1t𝕀{A~tx^}𝕀{it=2}ξt,Atqt,22+c1t=1tξt,At+c2minttminxqt(x)\displaystyle\leq\sqrt{c_{1}\sum_{t=1}^{t^{\prime}}\frac{\mathbb{I}\{\widetilde{A}_{t}\neq\widehat{x}\}\mathbb{I}\{i_{t}=2\}\cdot\xi_{t,A_{t}}}{q_{t,2}^{2}}}+\sqrt{c_{1}\sum_{t=1}^{t^{\prime}}\xi_{t,A_{t}}}+\frac{c_{2}}{\min_{t\leq t^{\prime}}\min_{x}q_{t}(x)} (31)

because if \mathcal{B} chooses x^\widehat{x}, then the probability of observing the feedback is 11, and is qt,2q_{t,2} otherwise. This motivates the choice of the bonus in (30). Then we can follow the proof of Theorem 22 step-by-step, and show that the regret compared to x^\widehat{x} is upper bounded by the order of

𝔼[c1t=1ti=12(𝕀[it=i]qt,i)2ξt,At𝕀{A~tx^}logT]\displaystyle\mathbb{E}\left[\sqrt{c_{1}\sum_{t=1}^{t^{\prime}}\sum_{i=1}^{2}(\mathbb{I}[i_{t}=i]-q_{t,i})^{2}\xi_{t,A_{t}}\mathbb{I}\{\widetilde{A}_{t}\neq\widehat{x}\}\log T}\right]
+𝔼[c1t=1tξt,At𝕀[it=2]𝕀{A~tx^}logT]+(c1+c2)logT\displaystyle\qquad+\mathbb{E}\left[\sqrt{c_{1}\sum_{t=1}^{t^{\prime}}\xi_{t,A_{t}}\mathbb{I}[i_{t}=2]\mathbb{I}\{\widetilde{A}_{t}\neq\widehat{x}\}\log T}\right]+(\sqrt{c_{1}}+c_{2})\log T
c1𝔼[t=1t(2qt,1qt,22ξt,x^+2qt,12qt,2ξt,A~t)𝕀{A~tx^}]\displaystyle\leq\sqrt{c_{1}\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}\left(2q_{t,1}q_{t,2}^{2}\xi_{t,\widehat{x}}+2q_{t,1}^{2}q_{t,2}\xi_{t,\widetilde{A}_{t}}\right)\mathbb{I}\{\widetilde{A}_{t}\neq\widehat{x}\}\right]}
+c1𝔼[t=1tξt,A~tqt,2𝕀{A~tx^}]logT+(c1+c2)logT\displaystyle\qquad+\sqrt{c_{1}\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}\xi_{t,\widetilde{A}_{t}}q_{t,2}\mathbb{I}\{\widetilde{A}_{t}\neq\widehat{x}\}\right]\log T}+(\sqrt{c_{1}}+c_{2})\log T (taking expectation over iti_{t} and following the calculation in the proof of Theorem 22)
=c1𝔼[t=1t(2qt,1qt,22ξt,x^Pr[A~tx^]+2qt,12qt,2xx^Pr[A~t=x]ξt,x)]\displaystyle=\sqrt{c_{1}\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}\left(2q_{t,1}q_{t,2}^{2}\xi_{t,\widehat{x}}\Pr[\widetilde{A}_{t}\neq\widehat{x}]+2q_{t,1}^{2}q_{t,2}\sum_{x\neq\widehat{x}}\Pr[\widetilde{A}_{t}=x]\xi_{t,x}\right)\right]}
+c1𝔼[t=1tqt,2xx^Pr[A~t=x]ξt,x]logT+(c1+c2)logT\displaystyle\qquad+\sqrt{c_{1}\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}q_{t,2}\sum_{x\neq\widehat{x}}\Pr[\widetilde{A}_{t}=x]\xi_{t,x}\right]\log T}+(\sqrt{c_{1}}+c_{2})\log T (taking expectation over A~t\widetilde{A}_{t})
c1𝔼[t=1t(2qt,1(1pt,x^)ξt,x^+2qt,12xx^pt,xξt,x)]\displaystyle\leq\sqrt{c_{1}\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}\left(2q_{t,1}(1-p_{t,\widehat{x}})\xi_{t,\widehat{x}}+2q_{t,1}^{2}\sum_{x\neq\widehat{x}}p_{t,x}\xi_{t,x}\right)\right]}
+c1𝔼[t=1txx^pt,xξt,x]logT+(c1+c2)logT\displaystyle\qquad+\sqrt{c_{1}\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}\sum_{x\neq\widehat{x}}p_{t,x}\xi_{t,x}\right]\log T}+(\sqrt{c_{1}}+c_{2})\log T (using the property that for xx^x\neq\widehat{x}, pt,x=Pr[A~t=x]qt,2p_{t,x}=\Pr[\widetilde{A}_{t}=x]q_{t,2} and thus 1pt,x^=Pr[A~tx^]qt,21-p_{t,\widehat{x}}=\Pr[\widetilde{A}_{t}\neq\widehat{x}]q_{t,2})
c1𝔼[t=1t(2pt,x^(1pt,x^)ξt,x^+2xx^pt,xξt,x)]\displaystyle\leq\sqrt{c_{1}\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}\left(2p_{t,\widehat{x}}(1-p_{t,\widehat{x}})\xi_{t,\widehat{x}}+2\sum_{x\neq\widehat{x}}p_{t,x}\xi_{t,x}\right)\right]} (qt,1pt,x^1q_{t,1}\leq p_{t,\widehat{x}}\leq 1)
+c1𝔼[t=1txx^pt,xξt,x]logT+(c1+c2)logT\displaystyle\qquad+\sqrt{c_{1}\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}\sum_{x\neq\widehat{x}}p_{t,x}\xi_{t,x}\right]\log T}+(\sqrt{c_{1}}+c_{2})\log T
O(c1𝔼[t=1t(xpt,xξt,xpt,x^2ξt,x^)]logT)+(c1+c2)logT\displaystyle\leq O\left(\sqrt{c_{1}\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}\left(\sum_{x}p_{t,x}\xi_{t,x}-p_{t,\widehat{x}}^{2}\xi_{t,\widehat{x}}\right)\right]\log T}\right)+(\sqrt{c_{1}}+c_{2})\log T

which satisfies the requirement of 12\frac{1}{2}-dd-LSB for the regret against x^\widehat{x}. For the regret against xx^x\neq\widehat{x}, similar to the proof of Theorem 25, an extra positive regret comes from the second term in (31). Therefore, the regret against xx^x\neq\widehat{x}, can be upper bounded by

O(c1𝔼[t=1txpt,xξt,x]logT)+(c1+c2)logT\displaystyle O\left(\sqrt{c_{1}\mathbb{E}\left[\sum_{t=1}^{t^{\prime}}\sum_{x}p_{t,x}\xi_{t,x}\right]\log T}\right)+(\sqrt{c_{1}}+c_{2})\log T

which also satisfies the requirement of 12\frac{1}{2}-dd-LSB for xx^x\neq\widehat{x}. ∎

Appendix F Analysis for IW-Stable Algorithms

Algorithm 9 EXP2

Input: 𝒳\mathcal{X}.
for t=1,2,t=1,2,\ldots do

       Receive update probability qtq_{t}.
Let
ηt=min{ln|𝒳|dτ=1t1qτ,12dminτtqτ},Pt(a)exp(ηtτ=1t1^τ(a)).\displaystyle\eta_{t}=\min\left\{\sqrt{\frac{\ln|\mathcal{X}|}{d\sum_{\tau=1}^{t}\frac{1}{q_{\tau}}}},\ \frac{1}{2d}\min_{\tau\leq t}q_{\tau}\right\},\qquad P_{t}(a)\propto\exp\left(-\eta_{t}\sum_{\tau=1}^{t-1}\widehat{\ell}_{\tau}(a)\right).
Sample an action atpt=(1dηtqt)Pt+dηtqtνa_{t}\sim p_{t}=(1-\frac{d\eta_{t}}{q_{t}})P_{t}+\frac{d\eta_{t}}{q_{t}}\nu, where ν\nu is John’s exploration.
With probability qtq_{t}, receive t(at)=at,θt+noise\ell_{t}(a_{t})=\langle a_{t},\theta_{t}\rangle+\text{noise}
(in this case, set updt=1\scalebox{0.9}{{upd}}_{t}=1; otherwise, set updt=0\scalebox{0.9}{{upd}}_{t}=0).
Construct loss estimator:
^t(a)=updtqta(𝔼bpt[bb])1att(at)\displaystyle\widehat{\ell}_{t}(a)=\frac{\scalebox{0.9}{{upd}}_{t}}{q_{t}}a^{\top}\left(\mathbb{E}_{b\sim p_{t}}[bb^{\top}]\right)^{-1}a_{t}\ell_{t}(a_{t})

F.1 EXP2 (Algorithm 9 / Lemma 9)

Proof of Lemma 9.

Consider the EXP2 algorithm (Algorithm 9) which corresponds to FTRL with negentropy potential. By standard analysis of FTRL (Lemma 27), for any τ\tau and any aa^{\star},

t=1τ𝔼t[aPt(a)^t(a)]t=1τ𝔼t[^t(a)]\displaystyle\sum_{t=1}^{\tau}\mathbb{E}_{t}\left[\sum_{a}P_{t}(a)\widehat{\ell}_{t}(a)\right]-\sum_{t=1}^{\tau}\mathbb{E}_{t}\left[\widehat{\ell}_{t}(a^{\star})\right] ln|𝒳|ητ+t=1τ𝔼t[maxP(PtP,t1ηtDψ(P,Pt))].\displaystyle\leq\frac{\ln|\mathcal{X}|}{\eta_{\tau}}+\sum_{t=1}^{\tau}\mathbb{E}_{t}\left[\max_{P}\left(\langle P_{t}-P,\ell_{t}\rangle-\frac{1}{\eta_{t}}D_{\psi}(P,P_{t})\right)\right]\,.

To apply Lemma 31, we need to show that ηt^t(a)1\eta_{t}\widehat{\ell}_{t}(a)\geq-1. We have by Cauchy Schwarz |a(𝔼bpt[bb])1at|a(𝔼bpt[bb])1aat(𝔼bpt[bb])1at|a^{\top}\left(\mathbb{E}_{b\sim p_{t}}[bb^{\top}]\right)^{-1}a_{t}|\leq\sqrt{a^{\top}\left(\mathbb{E}_{b\sim p_{t}}[bb^{\top}]\right)^{-1}a}\sqrt{a_{t}^{\top}\left(\mathbb{E}_{b\sim p_{t}}[bb^{\top}]\right)^{-1}a_{t}}. For each term, we have

a(𝔼bpt[bb])1aqtdηta(𝔼bν[bb])1a=qtηt,\displaystyle a^{\top}\left(\mathbb{E}_{b\sim p_{t}}[bb^{\top}]\right)^{-1}a\leq\frac{q_{t}}{d\eta_{t}}a^{\top}\left(\mathbb{E}_{b\sim\nu}[bb^{\top}]\right)^{-1}a=\frac{q_{t}}{\eta_{t}}\,,

due to the properties of John’s exploration. Hence |ηt^t(a)||t(a)|1|\eta_{t}\widehat{\ell}_{t}(a)|\leq|\ell_{t}(a)|\leq 1. We can apply Lemma 31 for the stability term, resulting in

𝔼t[maxP(PtP,t1ηtDψ(P,Pt))]\displaystyle\mathbb{E}_{t}\left[\max_{P}\left(\langle P_{t}-P,\ell_{t}\rangle-\frac{1}{\eta_{t}}D_{\psi}(P,P_{t})\right)\right]
ηt𝔼t[aPt(a)a(𝔼bpt[bb])1atat(𝔼bpt[bb])1aqt2]\displaystyle\leq\eta_{t}\mathbb{E}_{t}\left[\sum_{a}P_{t}(a)\frac{a^{\top}\left(\mathbb{E}_{b\sim p_{t}}[bb^{\top}]\right)^{-1}a_{t}a_{t}^{\top}\left(\mathbb{E}_{b\sim p_{t}}[bb^{\top}]\right)^{-1}a}{q_{t}^{2}}\right]
=ηtaPt(a)a(𝔼bpt[bb])1aqt\displaystyle=\eta_{t}\sum_{a}P_{t}(a)\frac{a^{\top}\left(\mathbb{E}_{b\sim p_{t}}[bb^{\top}]\right)^{-1}a}{q_{t}}
=2ηtaPt(a)a(𝔼bPt[bb])1aqt=2ηtdqt.\displaystyle=2\eta_{t}\sum_{a}P_{t}(a)\frac{a^{\top}\left(\mathbb{E}_{b\sim P_{t}}[bb^{\top}]\right)^{-1}a}{q_{t}}=\frac{2\eta_{t}d}{q_{t}}\,.

By |t(a)|1|\ell_{t}(a)|\leq 1 we have furthermore

t=1τ𝔼t[a(Pt(a)pt(a))^t(a)]=t=1τ𝔼t[a(Pt(a)pt(a))t(a)]t=1τdηtqt.\displaystyle\sum_{t=1}^{\tau}\mathbb{E}_{t}\left[\sum_{a}(P_{t}(a)-p_{t}(a))\widehat{\ell}_{t}(a)\right]=\sum_{t=1}^{\tau}\mathbb{E}_{t}\left[\sum_{a}(P_{t}(a)-p_{t}(a))\ell_{t}(a)\right]\leq\sum_{t=1}^{\tau}\frac{d\eta_{t}}{q_{t}}\,.

Combining everything and taking the expectation on both sides leads to

𝔼[t=1τapt(a)t(a)t(a)]\displaystyle\mathbb{E}\left[\sum_{t=1}^{\tau}\sum_{a}p_{t}(a)\ell_{t}(a)-\ell_{t}(a^{\star})\right] 𝔼[log|𝒳|ητ+t=1τ3dηtqt]\displaystyle\leq\mathbb{E}\left[\frac{\log|\mathcal{X}|}{\eta_{\tau}}+\sum_{t=1}^{\tau}\frac{3d\eta_{t}}{q_{t}}\right]
𝔼[7dlog|𝒳|t=1τ1qt+2log|𝒳|d1mintτqt].\displaystyle\leq\mathbb{E}\left[7\sqrt{d\log|\mathcal{X}|\sum_{t=1}^{\tau}\frac{1}{q_{t}}}+2\log|\mathcal{X}|d\frac{1}{\min_{t\leq\tau}q_{t}}\right]\,.

F.2 EXP4 (Algorithm 10 / Lemma 10)

In this section, we use the more standard notation Π\Pi to denote the policy class.

Algorithm 10 EXP4

Input: Π\Pi (policy class), KK (number of arms)
for t=1,2,t=1,2,\ldots do

       Receive context xtx_{t}.
Receive update probability qtq_{t}.
Let
ηt=ln|Π|Kτ=1t1qτ,Pt(π)exp(ηtτ=1t1^τ(π(xτ))).\displaystyle\eta_{t}=\sqrt{\frac{\ln|\Pi|}{K\sum_{\tau=1}^{t}\frac{1}{q_{\tau}}}},\qquad P_{t}(\pi)\propto\exp\left(-\eta_{t}\sum_{\tau=1}^{t-1}\widehat{\ell}_{\tau}\left(\pi(x_{\tau})\right)\right).
Sample an arm atpta_{t}\sim p_{t} where pt(a)=π:π(xt)=aPt(π)π(xt)p_{t}(a)=\sum_{\pi:\pi(x_{t})=a}P_{t}(\pi)\pi(x_{t}).
With probability qtq_{t}, receive t(at)\ell_{t}(a_{t}) (in this case, set updt=1\scalebox{0.9}{{upd}}_{t}=1; otherwise, set updt=0\scalebox{0.9}{{upd}}_{t}=0).
Construct loss estimator:
^t(a)=updt𝕀[at=a]t(a)qtpt(a).\displaystyle\widehat{\ell}_{t}(a)=\frac{\scalebox{0.9}{{upd}}_{t}\mathbb{I}[a_{t}=a]\ell_{t}(a)}{q_{t}p_{t}(a)}.
Proof of Lemma 10.

Consider the EXP4 algorithm with adaptive stepsize (Algorithm 10), which corresponds to FTRL with negentropy regularization. By standard analysis of FTRL (Lemma 27) and Lemma 31, for any τ\tau and any π\pi^{\star}

t=1τ𝔼t[πPt(π)^t(π(xt))]t=1τ𝔼t[^t(π(xt))]\displaystyle\sum_{t=1}^{\tau}\mathbb{E}_{t}\left[\sum_{\pi}P_{t}(\pi)\widehat{\ell}_{t}(\pi(x_{t}))\right]-\sum_{t=1}^{\tau}\mathbb{E}_{t}\left[\widehat{\ell}_{t}(\pi^{\star}(x_{t}))\right] ln|Π|ητ+t=1τηt2𝔼t[πPt(π)^t2(π(xt))]\displaystyle\leq\frac{\ln|\Pi|}{\eta_{\tau}}+\sum_{t=1}^{\tau}\frac{\eta_{t}}{2}\mathbb{E}_{t}\left[\sum_{\pi}P_{t}(\pi)\widehat{\ell}_{t}^{2}(\pi(x_{t}))\right]
ln|Π|ητ+t=1τηt2πPt(π)t(π(xt))qtpt(π(xt))\displaystyle\leq\frac{\ln|\Pi|}{\eta_{\tau}}+\sum_{t=1}^{\tau}\frac{\eta_{t}}{2}\sum_{\pi}P_{t}(\pi)\frac{\ell_{t}(\pi(x_{t}))}{q_{t}p_{t}(\pi(x_{t}))}
ln|Π|ητ+Kηt2qt2Kln|Π|t=1τ1qt.\displaystyle\leq\frac{\ln|\Pi|}{\eta_{\tau}}+K\frac{\eta_{t}}{2q_{t}}\leq 2\sqrt{K\ln|\Pi|\sum_{t=1}^{\tau}\frac{1}{q_{t}}}.

Taking expectation on both sides finishes the proof. ∎

F.3 (11/log(K))(1-1/\log(K))-Tsallis-INF (Algorithm 11 / Lemma 14)

Algorithm 11 (11/log(K))(1-1/\log(K))-Tsallis-Inf (for strongly observable graphs)

Input: 𝒢x^\mathcal{G}\setminus\widehat{x}.
Define: ψ(x)=i=1Kxiββ(1β)\psi(x)=\sum_{i=1}^{K}\frac{x_{i}^{\beta}}{\beta(1-\beta)}, where β=11/log(K)\beta=1-1/\log(K).
for t=1,2,t=1,2,\ldots do

       Receive update probability qtq_{t}.
Let
pt=argminxΔ([K]){τ=1t1x,^τ+1ηtψ(x)}\displaystyle p_{t}=\operatorname*{argmin}_{x\in\Delta([K])}\left\{\sum_{\tau=1}^{t-1}\langle x,\widehat{\ell}_{\tau}\rangle+\frac{1}{\eta_{t}}\psi(x)\right\}
whereηt=logKs=1t1qs(1+min{α~,αlogK}).\displaystyle\text{where}\ \ \ \eta_{t}=\sqrt{\frac{\log K}{\sum_{s=1}^{t}\frac{1}{q_{s}}(1+\min\{\widetilde{\alpha},\alpha\log K\})}}.
Sample AtptA_{t}\sim p_{t}.
With probability qtq_{t} receive t,i\ell_{t,i} for all At𝒩(i)A_{t}\in\mathcal{N}(i) (in this case, set updt=1\scalebox{0.9}{{upd}}_{t}=1; otherwise, set updt=0\scalebox{0.9}{{upd}}_{t}=0).
Define
^t,i=updt((t,i𝕀{it})𝕀{At𝒩(i)}j𝒩(i)pt,jqt+𝕀{it}),\displaystyle\widehat{\ell}_{t,i}=\scalebox{0.9}{{upd}}_{t}\left(\frac{(\ell_{t,i}-\mathbb{I}\{i\in\mathcal{I}_{t}\})\mathbb{I}\{A_{t}\in\mathcal{N}(i)\}}{\sum_{j\in\mathcal{N}(i)}p_{t,j}q_{t}}+\mathbb{I}\{i\in\mathcal{I}_{t}\}\right)\,,
where t={i[K]:i𝒩(i)pt,i>12} and 𝒩(i)={j𝒢x^|(j,i)E}.\displaystyle\text{where }\mathcal{I}_{t}=\left\{i\in[K]:i\not\in\mathcal{N}(i)\land p_{t,i}>\frac{1}{2}\right\}\,\text{ and }\mathcal{N}(i)=\{j\in\mathcal{G}\setminus\widehat{x}\,|\,(j,i)\in E\}.

We require the following graph theoretic Lemmas.

Lemma 43.

Let 𝒢=(V,E)\mathcal{G}=(V,E) be a directed graph with independence number α\alpha and vertex weights pi>0p_{i}>0, then

iV:pipi+j:(j,i)Epj2piαipi.\displaystyle\exists i\in V:\,\frac{p_{i}}{p_{i}+\sum_{j:(j,i)\in E}p_{j}}\leq\frac{2p_{i}\alpha}{\sum_{i}p_{i}}\,.
Proof.

Without loss of generality, assume that pΔ(V)p\in\Delta(V), since the statement is scale invariant. The statement is equivalent to

minipi+j:(j,i)Epj12α.\displaystyle\min_{i}p_{i}+\sum_{j:(j,i)\in E}p_{j}\geq\frac{1}{2\alpha}\,.

Let

minpΔ(V)mini(pi+j:(j,i)Epj)minpΔ(V)i(pi2+j:(j,i)Epipj).\displaystyle\min_{p\in\Delta(V)}\min_{i}\left(p_{i}+\sum_{j:(j,i)\in E}p_{j}\right)\leq\min_{p\in\Delta(V)}\sum_{i}\left(p_{i}^{2}+\sum_{j:(j,i)\in E}p_{i}p_{j}\right)\,.

Via K.K.T. conditions, there exists λ\lambda\in\mathbb{R} such that for an optimal solution pp^{\star} it holds

iV:\displaystyle\forall i\in V:\, either 2pi+j:(j,i)Epj+j:(i,j)Epj=λ\displaystyle\text{either \ \ }2p^{\star}_{i}+\sum_{j:(j,i)\in E}p^{\star}_{j}+\sum_{j:(i,j)\in E}p^{\star}_{j}=\lambda
or 2pi+j:(j,i)Epj+j:(i,j)Epjλ and pi=0.\displaystyle\text{or\ \ }2p^{\star}_{i}+\sum_{j:(j,i)\in E}p^{\star}_{j}+\sum_{j:(i,j)\in E}p^{\star}_{j}\geq\lambda\text{ and }p^{\star}_{i}=0\,.

Next we bound λ\lambda. Take the sub-graph over V+={i:pi>0}V_{+}=\{i:\,p^{\star}_{i}>0\} and take a maximally independence set SS over V+V_{+} (|S|α|S|\leq\alpha, since V+VV_{+}\subset V). We have

1jV+pj=0+iS(2pi+j:(j,i)Epj+j:(i,j)Epj)=|S|λαλ.\displaystyle 1\leq\underbrace{\sum_{j\not\in V_{+}}p^{\star}_{j}}_{=0}+\sum_{i\in S}\left(2p^{\star}_{i}+\sum_{j:(j,i)\in E}p^{\star}_{j}+\sum_{j:(i,j)\in E}p^{\star}_{j}\right)=|S|\lambda\leq\alpha\lambda.

Hence λ1α\lambda\geq\frac{1}{\alpha}. Finally,

i((pi)2+j:(j,i)Epipj)=12i(2(pi)2+j:(j,i)Epipj+j:(i,j)Epipj)ipiλ212α.\displaystyle\sum_{i}\left((p^{\star}_{i})^{2}+\sum_{j:(j,i)\in E}p^{\star}_{i}p^{\star}_{j}\right)=\frac{1}{2}\sum_{i}\left(2(p^{\star}_{i})^{2}+\sum_{j:(j,i)\in E}p^{\star}_{i}p^{\star}_{j}+\sum_{j:(i,j)\in E}p^{\star}_{i}p^{\star}_{j}\right)\geq\sum_{i}\frac{p^{\star}_{i}\lambda}{2}\geq\frac{1}{2\alpha}\,.

Lemma 44.

(Lemma 10 Alon et al. (2013)) let 𝒢=(V,E)\mathcal{G}=(V,E) be a directed graph. Then, for any distribution pΔ(V)p\in\Delta(V) we have:

ipipi+j:(j,i)Epjmas(𝒢),\displaystyle\sum_{i}\frac{p_{i}}{p_{i}+\sum_{j:(j,i)\in E}p_{j}}\leq\operatorname{mas}(\mathcal{G})\,,

where mas(𝒢)α~\operatorname{mas}(\mathcal{G})\leq\widetilde{\alpha} is the maximal acyclic sub-graph.

With these Lemmas, we are ready to proof the iw-stability.

Proof of Lemma 14.

Consider the (11/log(K))(1-1/\log(K))-Tsallis-INF algorithm with adaptive stepsize (Algorithm 11). Applying Lemma 27, we have for any stopping time τ\tau and a[K]a^{\star}\in[K]:

t=1τ𝔼t[pt,^t^t,a]\displaystyle\sum_{t=1}^{\tau}\mathbb{E}_{t}[\langle p_{t},\widehat{\ell}_{t}\rangle-\widehat{\ell}_{t,a^{\star}}] 2elogKητ+t=1τ𝔼t[maxpppt,^t1ηtDψ(p,pt)]\displaystyle\leq\frac{2e\log K}{\eta_{\tau}}+\sum_{t=1}^{\tau}\mathbb{E}_{t}\left[\max_{p}\langle p-p_{t},\widehat{\ell}_{t}\rangle-\frac{1}{\eta_{t}}D_{\psi}(p,p_{t})\right]
=2elogKητ+t=1τ𝔼t[maxpppt,^t+ct𝟏1ηtDψ(p,pt)],\displaystyle=\frac{2e\log K}{\eta_{\tau}}+\sum_{t=1}^{\tau}\mathbb{E}_{t}\left[\max_{p}\langle p-p_{t},\widehat{\ell}_{t}+c_{t}\bm{1}\rangle-\frac{1}{\eta_{t}}D_{\psi}(p,p_{t})\right]\,,

where ct=updt(t,t1)𝕀{At𝒩(t)}j𝒩(t)pt,jqtc_{t}=-\scalebox{0.9}{{upd}}_{t}\frac{(\ell_{t,\mathcal{I}_{t}}-1)\mathbb{I}\{A_{t}\in\mathcal{N}(\mathcal{I}_{t})\}}{\sum_{j\in\mathcal{N}(\mathcal{I}_{t})}p_{t,j}q_{t}}, which is well defined because t\mathcal{I}_{t} contains by definition maximally one arm.

We can apply Lemma 32, since the losses are strictly positive.

𝔼t[maxpppt,^t+ct𝟏1ηtDψ(p,pt)]ηt2𝔼t[ipt,i1+1logK(^t,i+ct)2].\displaystyle\mathbb{E}_{t}\left[\max_{p}\langle p-p_{t},\widehat{\ell}_{t}+c_{t}\bm{1}\rangle-\frac{1}{\eta_{t}}D_{\psi}(p,p_{t})\right]\leq\frac{\eta_{t}}{2}\mathbb{E}_{t}\left[\sum_{i}p_{t,i}^{1+\frac{1}{\log K}}(\widehat{\ell}_{t,i}+c_{t})^{2}\right]\,.

We split the vertices in three sets. Let M1={i|i𝒩(i)}M_{1}=\{i\,|\,i\in\mathcal{N}(i)\} the nodes with self-loops. Let M2={i|iM1,t}M_{2}=\{i\,|\,i\not\in M_{1},\not\in\mathcal{I}_{t}\} and M3=tM_{3}=\mathcal{I}_{t}. we have

ηt2𝔼t[ipt,i1+1logK(^t,i+ct)2]\displaystyle\frac{\eta_{t}}{2}\mathbb{E}_{t}\left[\sum_{i}p_{t,i}^{1+\frac{1}{\log K}}(\widehat{\ell}_{t,i}+c_{t})^{2}\right]
ηt𝔼t[updt(iM1M2Pt,ict2+iM1pt,i1+1logK(j𝒩(i)pt,jqt)2+iM24pt,iqt2+pt,t)],\displaystyle\leq\eta_{t}\mathbb{E}_{t}\left[\scalebox{0.9}{{upd}}_{t}\left(\sum_{i\in M_{1}\cup M_{2}}P_{t,i}c_{t}^{2}+\sum_{i\in M_{1}}\frac{p_{t,i}^{1+\frac{1}{\log K}}}{(\sum_{j\in\mathcal{N}(i)}p_{t,j}q_{t})^{2}}+\sum_{i\in M_{2}}\frac{4p_{t,i}}{q_{t}^{2}}+p_{t,\mathcal{I}_{t}}\right)\right]\,,

since for any iM2i\in M_{2}, we have j𝒩(i)pt,j>12\sum_{j\in\mathcal{N}(i)}p_{t,j}>\frac{1}{2}. The first term is in expectation 𝔼t[iM1M2pt,ict2]1qt\mathbb{E}_{t}\left[\sum_{i\in M_{1}\cup M_{2}}p_{t,i}c_{t}^{2}\right]\leq\frac{1}{q_{t}}, the third term is in expectation 𝔼t[iM24pt,iupdtqt2]4qt\mathbb{E}_{t}\left[\sum_{i\in M_{2}}\frac{4p_{t,i}\scalebox{0.9}{{upd}}_{t}}{q_{t}^{2}}\right]\leq\frac{4}{q_{t}}, while the second is

𝔼t[iM1pt,i1+1logKupdt(j𝒩(i)pt,jqt)2]2qtiM1pt,i1+1logKj𝒩(i)pt,j.\displaystyle\mathbb{E}_{t}\left[\sum_{i\in M_{1}}\frac{p_{t,i}^{1+\frac{1}{\log K}}\scalebox{0.9}{{upd}}_{t}}{(\sum_{j\in\mathcal{N}(i)}p_{t,j}q_{t})^{2}}\right]\leq\frac{2}{q_{t}}\sum_{i\in M_{1}}\frac{p_{t,i}^{1+\frac{1}{\log K}}}{\sum_{j\in\mathcal{N}(i)}p_{t,j}}\,.

The subgraph consisting of M1M_{1} contains sef-loops, hence we can apply Lemma 44 to bound this term by 2α~qt\frac{2\widetilde{\alpha}}{q_{t}}. If αlogK<α~\alpha\log K<\widetilde{\alpha}, we instead take a permutation p~\widetilde{p} of pp, such that M1M_{1} is in position 1,..,|M1|1,..,|M_{1}| and for any i[|M1|]i\in[|M_{1}|]

p~t,ij𝒩(i)p~t,jp~t,ij𝒩(i)[i]p~t,j2αp~t,ij=1ip~t,j.\displaystyle\frac{\widetilde{p}_{t,i}}{\sum_{j\in\mathcal{N}(i)}\widetilde{p}_{t,j}}\leq\frac{\widetilde{p}_{t,i}}{\sum_{j\in\mathcal{N}(i)\cap[i]}\widetilde{p}_{t,j}}\leq\frac{2\alpha\widetilde{p}_{t,i}}{\sum_{j=1}^{i}\widetilde{p}_{t,j}}\,.

The existence of such a permutation is guaranteed by Lemma 43: Apply Lemma 43 on the graph M1M_{1} to select p~|M1|\tilde{p}_{|M_{1}|}. Recursively remove the ii-th vertex from the graph and apply Lemma 43 on the resulting sub-graph to pick p~i1\tilde{p}_{i-1}. Applying this permutation yields

2qtiM1pt,i1+1logKj𝒩(i)pt,j4αqti=1M1p~i1+1logKj=1ip~j4αqti=1M1p~i(j=1ip~j)11logK4αlogKqt.\displaystyle\frac{2}{q_{t}}\sum_{i\in M_{1}}\frac{p_{t,i}^{1+\frac{1}{\log K}}}{\sum_{j\in\mathcal{N}(i)}p_{t,j}}\leq\frac{4\alpha}{q_{t}}\sum_{i=1}^{M_{1}}\frac{\widetilde{p}_{i}^{1+\frac{1}{\log K}}}{\sum_{j=1}^{i}\widetilde{p}_{j}}\leq\frac{4\alpha}{q_{t}}\sum_{i=1}^{M_{1}}\frac{\widetilde{p}_{i}}{\left(\sum_{j=1}^{i}\widetilde{p}_{j}\right)^{1-\frac{1}{\log K}}}\leq\frac{4\alpha\log K}{q_{t}}\,.

Combining everything

𝔼t[maxpppt,^t+ct𝟏1ηtDψ(p,pt)]ηtqt(6+4min{α~,αlogK}).\displaystyle\mathbb{E}_{t}\left[\max_{p}\langle p-p_{t},\widehat{\ell}_{t}+c_{t}\bm{1}\rangle-\frac{1}{\eta_{t}}D_{\psi}(p,p_{t})\right]\leq\frac{\eta_{t}}{q_{t}}\left(6+4\min\{\widetilde{\alpha},\alpha\log K\}\right)\,.

By the definition of the learning rate, we obtain

t=1τ𝔼t[pt,^t^t,a]\displaystyle\sum_{t=1}^{\tau}\mathbb{E}_{t}[\langle p_{t},\widehat{\ell}_{t}\rangle-\widehat{\ell}_{t,a^{\star}}] 2elogKητ+t=1τηtqt(6+4min{α~,αlogK})\displaystyle\leq\frac{2e\log K}{\eta_{\tau}}+\sum_{t=1}^{\tau}\frac{\eta_{t}}{q_{t}}\left(6+4\min\{\widetilde{\alpha},\alpha\log K\}\right)
18(1+min{α~,αlogK})t=1τ1qtlogK.\displaystyle\leq 18\sqrt{(1+\min\{\widetilde{\alpha},\alpha\log K\})\sum_{t=1}^{\tau}\frac{1}{q_{t}}\log K}\,.

F.4 EXP3 for weakly observable graphs (Algorithm 12 / Lemma 17)

Algorithm 12 EXP3 (for weakly observable graphs)

Input: 𝒢x^\mathcal{G}\setminus\widehat{x}, dominating set DD (potentially including x^\widehat{x}).
Define: ψ(x)=i=1Kxilog(xi)\psi(x)=\sum_{i=1}^{K}x_{i}\log(x_{i}). νD\nu_{D} is the uniform distribution over DD.
for t=1,2,t=1,2,\ldots do

       Receive update probability qtq_{t}.
Let
Pt=argminxΔ([K]){τ=1t1x,^τ+1ηtψ(x)},pt=(1γt)Pt+γtνD\displaystyle P_{t}=\operatorname*{argmin}_{x\in\Delta([K])}\left\{\sum_{\tau=1}^{t-1}\langle x,\widehat{\ell}_{\tau}\rangle+\frac{1}{\eta_{t}}\psi(x)\right\},p_{t}=(1-\gamma_{t})P_{t}+\gamma_{t}\nu_{D}
whereηt=((δs=1t1qslog(K))23+4δminstqs)1 and γt=ηtδqt.\displaystyle\text{where}\ \ \ \eta_{t}=\left(\left(\frac{\sqrt{\delta}\sum_{s=1}^{t}\frac{1}{\sqrt{q_{s}}}}{\log(K)}\right)^{\frac{2}{3}}+\frac{4\delta}{\min_{s\leq t}q_{s}}\right)^{-1}\text{ and }\gamma_{t}=\sqrt{\frac{\eta_{t}\delta}{q_{t}}}.
Sample AtptA_{t}\sim p_{t}.
With probability qtq_{t} receive t,i\ell_{t,i} for all At𝒩(i)A_{t}\in\mathcal{N}(i) (in this case, set updt=1\scalebox{0.9}{{upd}}_{t}=1; otherwise, set updt=0\scalebox{0.9}{{upd}}_{t}=0).
Define
^t,i=t,iupdt𝕀{At𝒩(i)}j𝒩(i)pt,jqt.\displaystyle\widehat{\ell}_{t,i}=\frac{\ell_{t,i}\scalebox{0.9}{{upd}}_{t}\mathbb{I}\{A_{t}\in\mathcal{N}(i)\}}{\sum_{j\in\mathcal{N}(i)}p_{t,j}q_{t}}.
Proof of Lemma 17.

Consider the EXP3 algorithm (Algorithm 12). By standard analysis of FTRL (Lemma 27) and Lemma 31 by the non-negativity of all losses, for any τ\tau and any aa^{\star},

t=1τ𝔼t[aPt(a)^t(a)]t=1τ𝔼t[^t(a)]\displaystyle\sum_{t=1}^{\tau}\mathbb{E}_{t}\left[\sum_{a}P_{t}(a)\widehat{\ell}_{t}(a)\right]-\sum_{t=1}^{\tau}\mathbb{E}_{t}\left[\widehat{\ell}_{t}(a^{\star})\right]
logKητ+t=1τηt2𝔼t[iPt,i^t,i2]\displaystyle\leq\frac{\log K}{\eta_{\tau}}+\sum_{t=1}^{\tau}\frac{\eta_{t}}{2}\mathbb{E}_{t}\left[\sum_{i}P_{t,i}\widehat{\ell}_{t,i}^{2}\right]
logKητ+t=1τηt2iPt,i1j𝒩(i)pt,jqt\displaystyle\leq\frac{\log K}{\eta_{\tau}}+\sum_{t=1}^{\tau}\frac{\eta_{t}}{2}\sum_{i}P_{t,i}\frac{1}{\sum_{j\in\mathcal{N}(i)}p_{t,j}q_{t}}
logKητ+t=1τηtδ4qt\displaystyle\leq\frac{\log K}{\eta_{\tau}}+\sum_{t=1}^{\tau}\sqrt{\frac{\eta_{t}\delta}{4q_{t}}}\,

where in the last inequality we use ηt2iPt,i×δγt×1qt=12ηtδqt\frac{\eta_{t}}{2}\sum_{i}P_{t,i}\times\frac{\delta}{\gamma_{t}}\times\frac{1}{q_{t}}=\frac{1}{2}\sqrt{\frac{\eta_{t}\delta}{q_{t}}}. We have due to 𝔼t[Ptpt,^t]γt\mathbb{E}_{t}[\langle P_{t}-p_{t},\widehat{\ell}_{t}\rangle]\leq\gamma_{t}

t=1τ𝔼t[apt(a)^t(a)]t=1τ𝔼t[^t(a)]\displaystyle\sum_{t=1}^{\tau}\mathbb{E}_{t}\left[\sum_{a}p_{t}(a)\widehat{\ell}_{t}(a)\right]-\sum_{t=1}^{\tau}\mathbb{E}_{t}\left[\widehat{\ell}_{t}(a^{\star})\right] logKητ+t=1τ9ηtδ4qt\displaystyle\leq\frac{\log K}{\eta_{\tau}}+\sum_{t=1}^{\tau}\sqrt{9\frac{\eta_{t}\delta}{4q_{t}}}
6(δlog(K))13(t=1τ1qt)23+4δlog(K)mintTqt.\displaystyle\leq 6(\delta\log(K))^{\frac{1}{3}}\left(\sum_{t=1}^{\tau}\frac{1}{\sqrt{q_{t}}}\right)^{\frac{2}{3}}+\frac{4\delta\log(K)}{\min_{t\leq T}q_{t}}.

Taking expectation on both sides finishes the proof. ∎

Appendix G Surrogate Loss for Strongly Observable Graph Problems

When there exist arms such that i𝒩(i)i\not\in\mathcal{N}(i), we cannot directly apply Algorithm 2. To make the algorithm applicable to all strongly observable graphs, define the surrogate losses ~t\tilde{\ell}_{t} in the following way:

~t,x^=t,x^𝕀{(x^,x^)E}j[K]{x^}pt,jt,j𝕀{(j,j)E}\displaystyle\tilde{\ell}_{t,\widehat{x}}=\ell_{t,\widehat{x}}\mathbb{I}\{(\widehat{x},\widehat{x})\in E\}-\sum_{j\in[K]\setminus\{\widehat{x}\}}p_{t,j}\ell_{t,j}\mathbb{I}\{(j,j)\not\in E\}
j[K]{x^}:\displaystyle\forall j\in[K]\setminus\{\widehat{x}\}:\, ~t,j=t,j𝕀{(j,j)E}t,x^𝕀{(x^,x^)E},\displaystyle\tilde{\ell}_{t,j}=\ell_{t,j}\mathbb{I}\{(j,j)\in E\}-\ell_{t,\widehat{x}}\mathbb{I}\{(\widehat{x},\widehat{x})\not\in E\}\,,

where ptp_{t} is the distribution of the base algorithm \mathcal{B} over [K]{x^}[K]\setminus\{\widehat{x}\} at round tt. By construction and the definition of strongly observable graphs, ~t,j\tilde{\ell}_{t,j} is observed when playing arm jj. (When the player does not have access to ptp_{t}, one can also sample one action from the current distribution for an unbiased estimate of t,x^\ell_{t,\widehat{x}}.) The losses ~t\tilde{\ell}_{t} are in range [1,1][-1,1] instead of [0,1][0,1] and can further be shifted to be strictly non-negative. Finally observe that

𝔼[~t,At~t,x^]\displaystyle\mathbb{E}[\tilde{\ell}_{t,A_{t}}-\tilde{\ell}_{t,\widehat{x}}] =qt,2(j[K]{x^}pt,j~t,j~t,x^)\displaystyle=q_{t,2}\left(\sum_{j\in[K]\setminus\{\widehat{x}\}}p_{t,j}\tilde{\ell}_{t,j}-\tilde{\ell}_{t,\widehat{x}}\right)
=qt,2(j[K]{x^}pt,jt,jt,x^)\displaystyle=q_{t,2}\left(\sum_{j\in[K]\setminus\{\widehat{x}\}}p_{t,j}\ell_{t,j}-\ell_{t,\widehat{x}}\right)
=𝔼[t,Att,x^],\displaystyle=\mathbb{E}[\ell_{t,A_{t}}-\ell_{t,\widehat{x}}],

as well as

𝔼[~t,Atj[K]{x^}pt,j~t,j]\displaystyle\mathbb{E}\left[\tilde{\ell}_{t,A_{t}}-\sum_{j\in[K]\setminus\{\widehat{x}\}}p_{t,j}\tilde{\ell}_{t,j}\right] =qt,1(~t,x^j[K]{x^}pt,j~t,j)\displaystyle=q_{t,1}\left(\tilde{\ell}_{t,\widehat{x}}-\sum_{j\in[K]\setminus\{\widehat{x}\}}p_{t,j}\tilde{\ell}_{t,j}\right)
=qt,1(~t,x^j[K]{x^}pt,jt,j)\displaystyle=q_{t,1}\left(\tilde{\ell}_{t,\widehat{x}}-\sum_{j\in[K]\setminus\{\widehat{x}\}}p_{t,j}\ell_{t,j}\right)
=𝔼[t,Atj[K]{x^}pt,jt,j].\displaystyle=\mathbb{E}\left[\ell_{t,A_{t}}-\sum_{j\in[K]\setminus\{\widehat{x}\}}p_{t,j}\ell_{t,j}\right]\,.

That means running Algorithm 2, replacing t\ell_{t} by ~t\tilde{\ell}_{t} allows to apply Theorem 11 to strongly observable graphs where not every arm receives its own loss as feedback.

Appendix H Tabular MDP (Theorem 26)

In this section, we consider using the UOB-Log-barrier Policy Search algorithm in Lee et al. (2020) (their Algorithm 4) as our base algorithm. To this end, we need to show that it satisfies a dd-strongly-iw-stable condition specified in Lemma 42. We consider a variant of their algorithm by incorporating the feedback probability qt=qt+(1qt)𝕀[πt=π^]q_{t}^{\prime}=q_{t}+(1-q_{t})\mathbb{I}[\pi_{t}=\widehat{\pi}]. The algorithm is Algorithm 13.

Algorithm 13 UOB-Log-Barrier Policy Search

Input: state space 𝒮\mathcal{S}, action space 𝒜\mathcal{A}, candidate policy π^\widehat{\pi}.
Define: Ω={w^:w^(s,a,s)1T3S2A}\Omega=\left\{\widehat{w}:\leavevmode\nobreak\ \widehat{w}(s,a,s^{\prime})\geq\frac{1}{T^{3}S^{2}A}\right\}, ψ(w)=h(s,a,s)𝒮h×𝒜×𝒮h+1ln1w(s,a,s)\psi(w)=\sum_{h}\sum_{(s,a,s^{\prime})\in\mathcal{S}_{h}\times\mathcal{A}\times\mathcal{S}_{h+1}}\ln\frac{1}{w(s,a,s^{\prime})}.
δ=1T5S3A\delta=\frac{1}{T^{5}S^{3}A}.
Initialization:

w^1(s,a,s)=1|𝒮h||𝒜||𝒮h+1|,π1=πw^1,t1.\displaystyle\widehat{w}_{1}(s,a,s^{\prime})=\frac{1}{|\mathcal{S}_{h}||\mathcal{A}||\mathcal{S}_{h+1}|},\qquad\pi_{1}=\pi^{\widehat{w}_{1}},\qquad t^{\star}\leftarrow 1.

for t=1,2,t=1,2,\ldots do

       If πt=π^\pi_{t}=\widehat{\pi}, updt=1\scalebox{0.9}{{upd}}_{t}=1; otherwise, updt=1\scalebox{0.9}{{upd}}_{t}=1 w.p. qtq_{t} and updt=0\scalebox{0.9}{{upd}}_{t}=0 w.p. 1qt1-q_{t}. If updt=1\scalebox{0.9}{{upd}}_{t}=1, obtain the trajectory sh,ah,t(sh,ah)s_{h},a_{h},\ell_{t}(s_{h},a_{h}) for all h=1,,Hh=1,\ldots,H.
Construct loss estimators
^t(s,a)=updtqtt(s,a)𝕀t(s,a)ϕt(s,a),where𝕀t(s,a)=𝕀{(sh(s),ah(s))=(s,a)}\displaystyle\widehat{\ell}_{t}(s,a)=\frac{\scalebox{0.9}{{upd}}_{t}}{q_{t}^{\prime}}\cdot\frac{\ell_{t}(s,a)\mathbb{I}_{t}(s,a)}{\phi_{t}(s,a)},\quad\text{where}\ \mathbb{I}_{t}(s,a)=\mathbb{I}\{(s_{h(s)},a_{h(s)})=(s,a)\}
where qt=qt+(1qt)𝟏[πt=π^]q_{t}^{\prime}=q_{t}+(1-q_{t})\mathbf{1}[\pi_{t}=\widehat{\pi}].
Update counters: for all s,a,ss,a,s^{\prime},
Nt+1(s,a)Nt(s,a)+𝕀t(s,a),Nt+1(s,a,s)Nt(s,a,s)+𝕀t(s,a)𝕀{sh(s)+1=s}.\displaystyle N_{t+1}(s,a)\leftarrow N_{t}(s,a)+\mathbb{I}_{t}(s,a),\quad N_{t+1}(s,a,s^{\prime})\leftarrow N_{t}(s,a,s^{\prime})+\mathbb{I}_{t}(s,a)\mathbb{I}\{s_{h(s)+1}=s^{\prime}\}.
Compute confidence set
𝒫t+1={P^:|P^(s|s,a)P¯t+1(s|s,a)|ϵt+1(s|s,a),(s,a,s)}\displaystyle\mathcal{P}_{t+1}=\left\{\hat{P}:\leavevmode\nobreak\ \left|\hat{P}(s^{\prime}|s,a)-\overline{P}_{t+1}(s^{\prime}|s,a)\right|\leq\epsilon_{t+1}(s^{\prime}|s,a),\ \ \forall(s,a,s^{\prime})\right\}
where P¯t+1(s|s,a)=Nt+1(s,a,s)max{1,Nt+1(s,a)}\overline{P}_{t+1}(s^{\prime}|s,a)=\frac{N_{t+1}(s,a,s^{\prime})}{\max\{1,N_{t+1}(s,a)\}} and
ϵt+1(s|s,a)=4P¯t+1(s|s,a)ln(SAT/δ)max{1,Nt+1(s,a)}+28ln(SAT/δ)3max{1,Nt+1(s,a)}\displaystyle\epsilon_{t+1}(s^{\prime}|s,a)=4\sqrt{\frac{\overline{P}_{t+1}(s^{\prime}|s,a)\ln(SAT/\delta)}{\max\{1,N_{t+1}(s,a)\}}}+\frac{28\ln(SAT/\delta)}{3\max\{1,N_{t+1}(s,a)\}}
if  τ=tts,aupdτ𝕀τ(s,a)τ(s,a)2qt2+maxτtHqτ2S2Aln(SAT)ηt2\sum_{\tau=t^{\star}}^{t}\sum_{s,a}\frac{\scalebox{0.9}{{upd}}_{\tau}\mathbb{I}_{\tau}(s,a)\ell_{\tau}(s,a)^{2}}{q_{t}^{\prime 2}}+\max_{\tau\leq t}\frac{H}{q_{\tau}^{\prime 2}}\geq\frac{S^{2}A\ln(SAT)}{\eta_{t}^{2}}  then
             ηt+1ηt2\eta_{t+1}\leftarrow\frac{\eta_{t}}{2}
w^t+1=argminwΔ(𝒫t+1)Ωψ(w)\widehat{w}_{t+1}=\operatorname*{argmin}_{w\in\Delta(\mathcal{P}_{t+1})\cap\Omega}\psi(w)
tt+1t^{\star}\leftarrow t+1.
      else
             ηt+1=ηt\eta_{t+1}=\eta_{t}
w^t+1=argminwΔ(𝒫t+1)Ω{w,^t+1ηtDψ(w,w^t)}\widehat{w}_{t+1}=\operatorname*{argmin}_{w\in\Delta(\mathcal{P}_{t+1})\cap\Omega}\left\{\langle w,\widehat{\ell}_{t}\rangle+\frac{1}{\eta_{t}}D_{\psi}(w,\widehat{w}_{t})\right\}.
      Update policy πt+1=πw^t+1\pi_{t+1}=\pi^{\widehat{w}_{t+1}}.

We refer the readers to Section C.3 of Lee et al. (2020) for the setting descriptions and notations, since we will follow them tightly in this section.

H.1 Base algorithm

Lemma 45.

Algorithm 13 ensures for any uu^{\star},

𝔼[t=1Twtu,t]O(HS2Aι2𝔼[t=1Ts,aupdt𝕀t(s,a)t(s,a)qt2]+𝔼[S5A2ι2mintqt]).\displaystyle\mathbb{E}\left[\sum_{t=1}^{T}\langle w_{t}-u^{\star},\ell_{t}\rangle\right]\leq O\left(\sqrt{HS^{2}A\iota^{2}\mathbb{E}\left[\sum_{t=1}^{T}\sum_{s,a}\frac{\scalebox{0.9}{{upd}}_{t}\mathbb{I}_{t}(s,a)\ell_{t}(s,a)}{q_{t}^{\prime 2}}\right]}+\mathbb{E}\left[\frac{S^{5}A^{2}\iota^{2}}{\min_{t}q_{t}^{\prime}}\right]\right).
Proof.

The regret is decomposed as the following (similar to Section C.3 in Lee et al. (2020)):

t=1Twtu,t\displaystyle\sum_{t=1}^{T}\langle w_{t}-u^{\star},\ell_{t}\rangle =t=1Twtw^t,tError+t=1Tw^t,t^tBias-1+t=1Tw^tu,^tReg-term\displaystyle=\underbrace{\sum_{t=1}^{T}\langle w_{t}-\widehat{w}_{t},\ell_{t}\rangle}_{\text{Error}}+\underbrace{\sum_{t=1}^{T}\langle\widehat{w}_{t},\ell_{t}-\widehat{\ell}_{t}\rangle}_{\text{Bias-1}}+\underbrace{\sum_{t=1}^{T}\langle\widehat{w}_{t}-u,\widehat{\ell}_{t}\rangle}_{\text{Reg-term}}
+t=1Tu,^ttBias-2+t=1Tuu,tBias-3,\displaystyle\quad+\underbrace{\sum_{t=1}^{T}\langle u,\widehat{\ell}_{t}-\ell_{t}\rangle}_{\text{Bias-2}}+\underbrace{\sum_{t=1}^{T}\langle u-u^{\star},\ell_{t}\rangle}_{\text{Bias-3}},

with the same definition of uu as in (24) of Lee et al. (2020). Bias-3H\text{Bias-3}\leq H trivially (see (25) of Lee et al. (2020)). For the remaining four terms, we use Lemma 49, Lemma 50, Lemma 51, and Lemma 52, respectively. Combining all terms finishes the proof. ∎

Lemma 46 (Lemma C.2 of Lee et al. (2020)).

With probability 1O(δ)1-O(\delta), for all t,s,a,st,s,a,s^{\prime},

|P(s|s,a)P¯t(s|s,a)|ϵt(s|s,a)2.\displaystyle\left|P(s^{\prime}|s,a)-\overline{P}_{t}(s^{\prime}|s,a)\right|\leq\frac{\epsilon_{t}(s^{\prime}|s,a)}{2}.
Lemma 47 (cf. Lemma C.3 of Lee et al. (2020)).

With probability at least 1δ1-\delta, for all hh,

t=1Ts𝒮h,a𝒜qtwt(s,a)max{1,Nt(s,a)}\displaystyle\sum_{t=1}^{T}\sum_{s\in\mathcal{S}_{h},a\in\mathcal{A}}\frac{q_{t}^{\prime}\cdot w_{t}(s,a)}{\max\{1,N_{t}(s,a)\}} =O(|𝒮h|Alog(T)+ln(H/δ))\displaystyle=O\left(|\mathcal{S}_{h}|A\log(T)+\ln(H/\delta)\right)
Proof.

The proof of this lemma is identical to the original one — only need to notice that in our case, the probability of obtaining a sample of (s,a)(s,a) is qtwt(s,a)q_{t}^{\prime}\cdot w_{t}(s,a). ∎

Lemma 48 (cf. Lemma C.6 of Lee et al. (2020)).

With probability at least 1O(δ)1-O(\delta), for any tt and any collection of transition functions {Pst}s𝒮\{P^{s}_{t}\}_{s\in\mathcal{S}} such that Pst𝒫tP^{s}_{t}\in\mathcal{P}_{t} for all ss, we have

t=1Ts𝒮,a𝒜|wPst,πt(s,a)wt(s,a)|t(s,a)=O(SHAt=1Twt,t2qtι2+S5A2ι2mintqt).\displaystyle\sum_{t=1}^{T}\sum_{s\in\mathcal{S},a\in\mathcal{A}}\left|w^{P^{s}_{t},\pi_{t}}(s,a)-w_{t}(s,a)\right|\ell_{t}(s,a)=O\left(S\sqrt{HA\sum_{t=1}^{T}\frac{\langle w_{t},\ell_{t}^{2}\rangle}{q_{t}^{\prime}}\iota^{2}}+\frac{S^{5}A^{2}\iota^{2}}{\min_{t}q_{t}^{\prime}}\right).

where ιlog(SAT/δ)\iota\triangleq\log(SAT/\delta) and t2(s,a)(t(s,a))2\ell_{t}^{2}(s,a)\triangleq(\ell_{t}(s,a))^{2}.

Proof.

Following the proof of Lemma C.6 in Lee et al. (2020), we have

t=1Ts𝒮,a𝒜|wPst,πt(s,a)wt(s,a)|t(s,a)B1+SB2\displaystyle\sum_{t=1}^{T}\sum_{s\in\mathcal{S},a\in\mathcal{A}}\left|w^{P^{s}_{t},\pi_{t}}(s,a)-w_{t}(s,a)\right|\ell_{t}(s,a)\leq B_{1}+SB_{2}

where

B2\displaystyle B_{2} O(t=1Ts,a,sx,y,xP(s|s,a)ιmax{1,Nt(s,a)}wt(s,a)P(x|x,y)ιmax{1,Nt(x,y)}wt(x,y|s))\displaystyle\leq O\left(\sum_{t=1}^{T}\sum_{s,a,s^{\prime}}\sum_{x,y,x^{\prime}}\sqrt{\frac{P(s^{\prime}|s,a)\iota}{\max\{1,N_{t}(s,a)\}}}w_{t}(s,a)\sqrt{\frac{P(x^{\prime}|x,y)\iota}{\max\{1,N_{t}(x,y)\}}}w_{t}(x,y|s^{\prime})\right)
+O(t=1Ts,a,sx,y,xwt(s,a)ιmax{1,Nt(s,a)})\displaystyle\qquad+O\left(\sum_{t=1}^{T}\sum_{s,a,s^{\prime}}\sum_{x,y,x^{\prime}}\frac{w_{t}(s,a)\iota}{\max\{1,N_{t}(s,a)\}}\right) (32)

The first term in (32) can be upper bounded by the order of

t=1Ts,a,sx,y,xwt(s,a)P(x|x,y)wt(x,y|s)ιmax{1,Nt(s,a)}wt(s,a)P(s|s,a)wt(x,y|s)ιmax{1,Nt(x,y)}\displaystyle\sum_{t=1}^{T}\sum_{s,a,s^{\prime}}\sum_{x,y,x^{\prime}}\sqrt{\frac{w_{t}(s,a)P(x^{\prime}|x,y)w_{t}(x,y|s^{\prime})\iota}{\max\{1,N_{t}(s,a)\}}}\sqrt{\frac{w_{t}(s,a)P(s^{\prime}|s,a)w_{t}(x,y|s^{\prime})\iota}{\max\{1,N_{t}(x,y)\}}}
t=1Ts,a,sx,y,xwt(s,a)P(x|x,y)wt(x,y|s)ιmax{1,Nt(s,a)}s,a,sx,y,xwt(s,a)P(s|s,a)wt(x,y|s)ιmax{1,Nt(x,y)}\displaystyle\leq\sum_{t=1}^{T}\sqrt{\sum_{s,a,s^{\prime}}\sum_{x,y,x^{\prime}}\frac{w_{t}(s,a)P(x^{\prime}|x,y)w_{t}(x,y|s^{\prime})\iota}{\max\{1,N_{t}(s,a)\}}}\sqrt{\sum_{s,a,s^{\prime}}\sum_{x,y,x^{\prime}}\frac{w_{t}(s,a)P(s^{\prime}|s,a)w_{t}(x,y|s^{\prime})\iota}{\max\{1,N_{t}(x,y)\}}}
t=1THs,a,swt(s,a)ιmax{1,Nt(s,a)}Hx,y,xwt(x,y)ιmax{1,Nt(x,y)}\displaystyle\leq\sum_{t=1}^{T}\sqrt{H\sum_{s,a,s^{\prime}}\frac{w_{t}(s,a)\iota}{\max\{1,N_{t}(s,a)\}}}\sqrt{H\sum_{x,y,x^{\prime}}\frac{w_{t}(x,y)\iota}{\max\{1,N_{t}(x,y)\}}}
HSt=1Ts,a,swt(s,a)ιmax{1,Nt(s,a)}\displaystyle\leq HS\sum_{t=1}^{T}\sum_{s,a,s^{\prime}}\frac{w_{t}(s,a)\iota}{\max\{1,N_{t}(s,a)\}}
O(HS2ιmintqt(SAln(T)+Hlog(H/δ))).\displaystyle\leq O\left(\frac{HS^{2}\iota}{\min_{t}q_{t}^{\prime}}\left(SA\ln(T)+H\log(H/\delta)\right)\right). (by Lemma 47)

The second term in (32) can be upper bounded by the order of

S2At=1Ts,a,swt(s,a)ιmax{1,Nt(s,a)}=O(S3Aιmintqt(SAln(T)+Hlog(H/δ))).\displaystyle S^{2}A\sum_{t=1}^{T}\sum_{s,a,s^{\prime}}\frac{w_{t}(s,a)\iota}{\max\{1,N_{t}(s,a)\}}=O\left(\frac{S^{3}A\iota}{\min_{t}q_{t}^{\prime}}\left(SA\ln(T)+H\log(H/\delta)\right)\right). (by Lemma 47)

Combining the two parts, we get

B2O(1mintqt(S4A2ln(T)ι+HS3Aιlog(H/δ)))O(S4A2ι2mintqt).\displaystyle B_{2}\leq O\left(\frac{1}{\min_{t}q_{t}^{\prime}}\left(S^{4}A^{2}\ln(T)\iota+HS^{3}A\iota\log(H/\delta)\right)\right)\leq O\left(\frac{S^{4}A^{2}\iota^{2}}{\min_{t}q_{t}^{\prime}}\right).

Next, we bound B1B_{1}. By the same calculation as in Lemma C.6 of Lee et al. (2020),

B1\displaystyle B_{1}
O(t=1Ts,a,sx,ywt(s,a)P(s|s,a)ιmax{1,Nt(s,a)}wt(x,y|s)t(x,y)+Ht=1Ts,a,swt(x,a)ιmax{1,Nt(s,a)})\displaystyle\leq O\left(\sum_{t=1}^{T}\sum_{s,a,s^{\prime}}\sum_{x,y}w_{t}(s,a)\sqrt{\frac{P(s^{\prime}|s,a)\iota}{\max\{1,N_{t}(s,a)\}}}w_{t}(x,y|s^{\prime})\ell_{t}(x,y)+H\sum_{t=1}^{T}\sum_{s,a,s^{\prime}}\frac{w_{t}(x,a)\iota}{\max\{1,N_{t}(s,a)\}}\right)
O(t=1Ts,a,sx,ywt(s,a)P(s|s,a)ιmax{1,Nt(s,a)}wt(x,y|s)t(x,y)+HS2Aι2mintqt)\displaystyle\leq O\left(\sum_{t=1}^{T}\sum_{s,a,s^{\prime}}\sum_{x,y}w_{t}(s,a)\sqrt{\frac{P(s^{\prime}|s,a)\iota}{\max\{1,N_{t}(s,a)\}}}w_{t}(x,y|s^{\prime})\ell_{t}(x,y)+\frac{HS^{2}A\iota^{2}}{\min_{t}q_{t}^{\prime}}\right)

For the first term above, we consider the summation over (s,a,s)𝒯h𝒮h×𝒜×𝒮h+1(s,a,s^{\prime})\in\mathcal{T}_{h}\triangleq\mathcal{S}_{h}\times\mathcal{A}\times\mathcal{S}_{h+1}. We continue to bound it by the order of

t=1Ts,a,s𝒯hx,ywt(s,a)P(s|s,a)ιmax{1,Nt(s,a)}wt(x,y|s)t(x,y)\displaystyle\sum_{t=1}^{T}\sum_{s,a,s^{\prime}\in\mathcal{T}_{h}}\sum_{x,y}w_{t}(s,a)\sqrt{\frac{P(s^{\prime}|s,a)\iota}{\max\{1,N_{t}(s,a)\}}}w_{t}(x,y|s^{\prime})\ell_{t}(x,y)
αt=1T1qts,a,s𝒯hx,ywt(s,a)P(s|s,a)wt(x,y|s)t(x,y)2\displaystyle\leq\alpha\sum_{t=1}^{T}\frac{1}{q_{t}^{\prime}}\sum_{s,a,s^{\prime}\in\mathcal{T}_{h}}\sum_{x,y}w_{t}(s,a)P(s^{\prime}|s,a)w_{t}(x,y|s^{\prime})\ell_{t}(x,y)^{2}
+1αt=1Ts,a,s𝒯hx,yqtwt(s,a)wt(x,y|s)ιmax{1,Nt(s,a)}\displaystyle\qquad+\frac{1}{\alpha}\sum_{t=1}^{T}\sum_{s,a,s^{\prime}\in\mathcal{T}_{h}}\sum_{x,y}q_{t}^{\prime}w_{t}(s,a)w_{t}(x,y|s^{\prime})\cdot\frac{\iota}{\max\{1,N_{t}(s,a)\}} (by AM-GM, holds for any α>0\alpha>0)
αt=1T1qtx,ywt(x,y)t(x,y)2+H|𝒮h+1|αt=1Ts,a𝒮h×𝒜qtwt(s,a)ιNt(s,a)\displaystyle\leq\alpha\sum_{t=1}^{T}\frac{1}{q_{t}^{\prime}}\sum_{x,y}w_{t}(x,y)\ell_{t}(x,y)^{2}+\frac{H|\mathcal{S}_{h+1}|}{\alpha}\sum_{t=1}^{T}\sum_{s,a\in\mathcal{S}_{h}\times\mathcal{A}}\frac{q_{t}^{\prime}w_{t}(s,a)\iota}{N_{t}(s,a)}
αt=1Twt,t2qt+H|𝒮h+1||𝒮h|Aι2α\displaystyle\leq\alpha\sum_{t=1}^{T}\frac{\langle w_{t},\ell_{t}^{2}\rangle}{q_{t}^{\prime}}+\frac{H|\mathcal{S}_{h+1}||\mathcal{S}_{h}|A\iota^{2}}{\alpha} (by Lemma 47)
O(H|𝒮h||𝒮h+1|At=1Twt,t2ι2qt)\displaystyle\leq O\left(\sqrt{H|\mathcal{S}_{h}||\mathcal{S}_{h+1}|A\sum_{t=1}^{T}\frac{\langle w_{t},\ell_{t}^{2}\rangle\iota^{2}}{q_{t}^{\prime}}}\right) (choose the optimal α\alpha)
O((|𝒮h|+|𝒮h+1|)HAt=1Twt,t2ι2qt)\displaystyle\leq O\left((|\mathcal{S}_{h}|+|\mathcal{S}_{h+1}|)\sqrt{HA\sum_{t=1}^{T}\frac{\langle w_{t},\ell_{t}^{2}\rangle\iota^{2}}{q_{t}^{\prime}}}\right)

Therefore,

B1\displaystyle B_{1} O(h(|𝒮h|+|𝒮h+1|)HAt=1Twt,t2ι2qt+HS2Aι2mintqt)\displaystyle\leq O\left(\sum_{h}(|\mathcal{S}_{h}|+|\mathcal{S}_{h+1}|)\sqrt{HA\sum_{t=1}^{T}\frac{\langle w_{t},\ell_{t}^{2}\rangle\iota^{2}}{q_{t}^{\prime}}}+\frac{HS^{2}A\iota^{2}}{\min_{t}q_{t}^{\prime}}\right)
O(SHAt=1Twt,t2ι2qt+HS2Aι2mintqt).\displaystyle\leq O\left(S\sqrt{HA\sum_{t=1}^{T}\frac{\langle w_{t},\ell_{t}^{2}\rangle\iota^{2}}{q_{t}^{\prime}}}+\frac{HS^{2}A\iota^{2}}{\min_{t}q_{t}^{\prime}}\right).

Combining the bounds finishes the proof. ∎

Lemma 49 (cf. Lemma C.7 of Lee et al. (2020)).
𝔼[Error]=𝔼[t=1Tw^twt,t]O(𝔼[SHAt=1Twt,t2ι2qt+S5A2ι2mintqt]).\displaystyle\mathbb{E}\left[\textup{Error}\right]=\mathbb{E}\left[\sum_{t=1}^{T}\langle\widehat{w}_{t}-w_{t},\ell_{t}\rangle\right]\leq O\left(\mathbb{E}\left[S\sqrt{HA\sum_{t=1}^{T}\frac{\langle w_{t},\ell_{t}^{2}\rangle\iota^{2}}{q_{t}^{\prime}}}+\frac{S^{5}A^{2}\iota^{2}}{\min_{t}q_{t}^{\prime}}\right]\right).
Proof.

By Lemma 48, with probability at least 1O(δ)1-O(\delta),

Error =t=1Tw^twt,tt=1Ts,a|w^t(s,a)wt(s,a)|t(s,a)\displaystyle=\sum_{t=1}^{T}\langle\widehat{w}_{t}-w_{t},\ell_{t}\rangle\leq\sum_{t=1}^{T}\sum_{s,a}|\widehat{w}_{t}(s,a)-w_{t}(s,a)|\ell_{t}(s,a)
O(SHAt=1Twt,t2ι2qt+S5A2ι2mintqt)\displaystyle\leq O\left(S\sqrt{HA\sum_{t=1}^{T}\frac{\langle w_{t},\ell_{t}^{2}\rangle\iota^{2}}{q_{t}^{\prime}}}+\frac{S^{5}A^{2}\iota^{2}}{\min_{t}q_{t}^{\prime}}\right)

Furthermore, |Error|O(SAT)|\textup{Error}|\leq O(SAT) with probability 11. Therefore,

𝔼[Error]\displaystyle\mathbb{E}[\textup{Error}] O(𝔼[SHAt=1Twt,t2ι2qt+S5A2ι2mintqt]+δSAT)\displaystyle\leq O\left(\mathbb{E}\left[S\sqrt{HA\sum_{t=1}^{T}\frac{\langle w_{t},\ell_{t}^{2}\rangle\iota^{2}}{q_{t}^{\prime}}}+\frac{S^{5}A^{2}\iota^{2}}{\min_{t}q_{t}^{\prime}}\right]+\delta SAT\right)
O(𝔼[SHAt=1Twt,t2ι2qt+S5A2ι2mintqt])\displaystyle\leq O\left(\mathbb{E}\left[S\sqrt{HA\sum_{t=1}^{T}\frac{\langle w_{t},\ell_{t}^{2}\rangle\iota^{2}}{q_{t}^{\prime}}}+\frac{S^{5}A^{2}\iota^{2}}{\min_{t}q_{t}^{\prime}}\right]\right)

by our choice of δ\delta. ∎

Lemma 50 (cf. Lemma C.8 of Lee et al. (2020)).
𝔼[Bias-1]\displaystyle\mathbb{E}[\textup{Bias-1}] =𝔼[t=1Tw^t,t^t]O(𝔼[SHAt=1Twt,tqt+S5A2ι2mintqt]).\displaystyle=\mathbb{E}\left[\sum_{t=1}^{T}\langle\widehat{w}_{t},\ell_{t}-\widehat{\ell}_{t}\rangle\right]\leq O\left(\mathbb{E}\left[S\sqrt{HA\sum_{t=1}^{T}\frac{\langle w_{t},\ell_{t}\rangle}{q_{t}^{\prime}}}+\frac{S^{5}A^{2}\iota^{2}}{\min_{t}q_{t}^{\prime}}\right]\right).
Proof.

Let t\mathcal{E}_{t} be the event that P𝒫τP\in\mathcal{P}_{\tau} for all τt\tau\leq t.

𝔼t[w^t,t^t|t]\displaystyle\mathbb{E}_{t}\left[\langle\widehat{w}_{t},\ell_{t}-\widehat{\ell}_{t}\rangle\leavevmode\nobreak\ \bigg{|}\leavevmode\nobreak\ \mathcal{E}_{t}\right] =s,aw^t(s,a)t(s,a)(1wt(s,a)ϕt(s,a))\displaystyle=\sum_{s,a}\widehat{w}_{t}(s,a)\ell_{t}(s,a)\left(1-\frac{w_{t}(s,a)}{\phi_{t}(s,a)}\right)
s,a|ϕt(s,a)wt(s,a)|t(s,a)\displaystyle\leq\sum_{s,a}|\phi_{t}(s,a)-w_{t}(s,a)|\ell_{t}(s,a)

Thus,

𝔼t[w^t,t^t]\displaystyle\mathbb{E}_{t}\left[\langle\widehat{w}_{t},\ell_{t}-\widehat{\ell}_{t}\rangle\right] s,a|ϕt(s,a)wt(s,a)|t(s,a)+O(H𝕀[¯t])\displaystyle\leq\sum_{s,a}|\phi_{t}(s,a)-w_{t}(s,a)|\ell_{t}(s,a)+O(H\mathbb{I}[\overline{\mathcal{E}}_{t}])

Summing this over tt,

t=1T𝔼t[w^t,t^t]t=1Ts,a|ϕt(s,a)wt(s,a)|t(s,a)()+O(Ht=1T𝕀[¯t])\displaystyle\sum_{t=1}^{T}\mathbb{E}_{t}\left[\langle\widehat{w}_{t},\ell_{t}-\widehat{\ell}_{t}\rangle\right]\leq\underbrace{\sum_{t=1}^{T}\sum_{s,a}|\phi_{t}(s,a)-w_{t}(s,a)|\ell_{t}(s,a)}_{(\star)}+O\left(H\sum_{t=1}^{T}\mathbb{I}[\overline{\mathcal{E}}_{t}]\right)

By Lemma 48, ()(\star) is upper bounded by O(SHAt=1Twt,t2qtι2+S5A2ι2mintqt)O\left(S\sqrt{HA\sum_{t=1}^{T}\frac{\langle w_{t},\ell_{t}^{2}\rangle}{q_{t}^{\prime}}\iota^{2}}+\frac{S^{5}A^{2}\iota^{2}}{\min_{t}q_{t}^{\prime}}\right) with probability 1O(δ)1-O(\delta). Taking expectations on both sides and using that Pr[¯t]δ\Pr[\overline{\mathcal{E}}_{t}]\leq\delta, we get

𝔼[t=1Tw^t,t^t]\displaystyle\mathbb{E}\left[\sum_{t=1}^{T}\langle\widehat{w}_{t},\ell_{t}-\widehat{\ell}_{t}\rangle\right] O(𝔼[SHAt=1Twt,t2qtι2+S5A2ι2mintqt])+O(δSAT)\displaystyle\leq O\left(\mathbb{E}\left[S\sqrt{HA\sum_{t=1}^{T}\frac{\langle w_{t},\ell_{t}^{2}\rangle}{q_{t}^{\prime}}\iota^{2}}+\frac{S^{5}A^{2}\iota^{2}}{\min_{t}q_{t}^{\prime}}\right]\right)+O\left(\delta SAT\right)
O(𝔼[SHAt=1Twt,t2qtι2+S5A2ι2mintqt]).\displaystyle\leq O\left(\mathbb{E}\left[S\sqrt{HA\sum_{t=1}^{T}\frac{\langle w_{t},\ell_{t}^{2}\rangle}{q_{t}^{\prime}}\iota^{2}}+\frac{S^{5}A^{2}\iota^{2}}{\min_{t}q_{t}^{\prime}}\right]\right).

Lemma 51 (cf. Lemma C.9 of Lee et al. (2020)).
𝔼[Bias-2]=𝔼[t=1Tu,^tt]O(1).\displaystyle\mathbb{E}[\textup{Bias-2}]=\mathbb{E}\left[\sum_{t=1}^{T}\langle u,\widehat{\ell}_{t}-\ell_{t}\rangle\right]\leq O(1).
Proof.

Let t\mathcal{E}_{t} be the event that P𝒫τP\in\mathcal{P}_{\tau} for all τt\tau\leq t. By the construction of the loss estimator, we have

𝔼t[u,^tt|t]0\displaystyle\mathbb{E}_{t}\left[\langle u,\widehat{\ell}_{t}-\ell_{t}\rangle\leavevmode\nobreak\ \big{|}\leavevmode\nobreak\ \mathcal{E}_{t}\right]\leq 0

and thus

𝔼t[u,^tt]𝕀[¯t]HT3S2A\displaystyle\mathbb{E}_{t}\left[\langle u,\widehat{\ell}_{t}-\ell_{t}\rangle\right]\leq\mathbb{I}[\overline{\mathcal{E}}_{t}]\cdot H\cdot T^{3}S^{2}A (by Lemma C.5 of Lee et al. (2020), ^t(s,a)T3S2A\widehat{\ell}_{t}(s,a)\leq T^{3}S^{2}A)

and

𝔼[t=1Tu,^tt]HT4S2A𝔼[t=1T𝕀[¯t]]O(δHT5S2A)O(1),\displaystyle\mathbb{E}\left[\sum_{t=1}^{T}\langle u,\widehat{\ell}_{t}-\ell_{t}\rangle\right]\leq HT^{4}S^{2}A\mathbb{E}\left[\sum_{t=1}^{T}\mathbb{I}[\overline{\mathcal{E}}_{t}]\right]\leq O(\delta HT^{5}S^{2}A)\leq O(1),

where the last inequality is by our choice of δ\delta. ∎

Lemma 52 (cf. Lemma C.10 of Lee et al. (2020)).
𝔼[Reg-term]O(S2Aln(SAT)𝔼[t=1Ts,aupdt𝕀t(s,a)t(s,a)2qt2+maxtHqt2]).\displaystyle\mathbb{E}[\textup{Reg-term}]\leq O\left(\sqrt{S^{2}A\ln(SAT)\mathbb{E}\left[\sum_{t=1}^{T}\sum_{s,a}\frac{\scalebox{0.9}{{upd}}_{t}\mathbb{I}_{t}(s,a)\ell_{t}(s,a)^{2}}{q_{t}^{\prime 2}}+\max_{t}\frac{H}{q_{t}^{\prime 2}}\right]}\right).
Proof.

By the same calculation as in the proof of Lemma C.10 in Lee et al. (2020),

w^tu,^t\displaystyle\langle\widehat{w}_{t}-u,\widehat{\ell}_{t}\rangle Dψ(u,w^t)Dψ(u,w^t+1)ηt+ηts,aw^t(s,a)2^t(s,a)2\displaystyle\leq\frac{D_{\psi}(u,\widehat{w}_{t})-D_{\psi}(u,\widehat{w}_{t+1})}{\eta_{t}}+\eta_{t}\sum_{s,a}\widehat{w}_{t}(s,a)^{2}\widehat{\ell}_{t}(s,a)^{2}
Dψ(u,w^t)Dψ(u,w^t+1)ηt+ηts,aupdt𝕀t(s,a)t(s,a)2qt2.\displaystyle\leq\frac{D_{\psi}(u,\widehat{w}_{t})-D_{\psi}(u,\widehat{w}_{t+1})}{\eta_{t}}+\eta_{t}\sum_{s,a}\frac{\scalebox{0.9}{{upd}}_{t}\mathbb{I}_{t}(s,a)\ell_{t}(s,a)^{2}}{q_{t}^{\prime 2}}.

Let t1,t2,t_{1},t_{2},\ldots be the time indices when ηt=ηt12\eta_{t}=\frac{\eta_{t-1}}{2}, and let tit_{i^{\star}} be the last time this happens. Summing the inequalities above and using telescoping, we get

t=1Tw^tu,^t\displaystyle\sum_{t=1}^{T}\langle\widehat{w}_{t}-u,\widehat{\ell}_{t}\rangle i[Dψ(u,w^ti)ηti+ηtit=titi+11s,aupdt𝕀t(s,a)t(s,a)2qt2]\displaystyle\leq\sum_{i}\left[\frac{D_{\psi}(u,\widehat{w}_{t_{i}})}{\eta_{t_{i}}}+\eta_{t_{i}}\sum_{t=t_{i}}^{t_{i+1}-1}\sum_{s,a}\frac{\scalebox{0.9}{{upd}}_{t}\mathbb{I}_{t}(s,a)\ell_{t}(s,a)^{2}}{q_{t}^{\prime 2}}\right]
i[O(S2Aln(SAT))ηti+ηtit=titi+11s,aupdt𝕀t(s,a)t(s,a)2qt2]\displaystyle\leq\sum_{i}\left[\frac{O(S^{2}A\ln(SAT))}{\eta_{t_{i}}}+\eta_{t_{i}}\sum_{t=t_{i}}^{t_{i+1}-1}\sum_{s,a}\frac{\scalebox{0.9}{{upd}}_{t}\mathbb{I}_{t}(s,a)\ell_{t}(s,a)^{2}}{q_{t}^{\prime 2}}\right] (computed in the proof of Lemma C.10 in Lee et al. (2020))
iO(S2Aln(SAT))ηti\displaystyle\leq\sum_{i}\frac{O(S^{2}A\ln(SAT))}{\eta_{t_{i}}} (by the timing we halve the learning rate)
=O(S2Aln(SAT)ηti)\displaystyle=O\left(\frac{S^{2}A\ln(SAT)}{\eta_{t_{i^{\star}}}}\right)
O(S2Aln(SAT)(t=1Ts,aupdt𝕀t(s,a)t(s,a)2qt2+maxtHqt2)).\displaystyle\leq O\left(\sqrt{S^{2}A\ln(SAT)\left(\sum_{t=1}^{T}\sum_{s,a}\frac{\scalebox{0.9}{{upd}}_{t}\mathbb{I}_{t}(s,a)\ell_{t}(s,a)^{2}}{q_{t}^{\prime 2}}+\max_{t}\frac{H}{q_{t}^{\prime 2}}\right)}\right).

H.2 Corraling

We use Algorithm 8 as the corral algorithm for the MDP setting, with Algorithm 13 being the base algorithm. The guarantee of Algorithm 8 is provided in Lemma 42.

Proof of Theorem 26.

To apply Lemma 42, we have to perform re-scaling on the loss because for MDPs, the loss of a policy in one round is HH. Therefore, scale down all losses by a factor of 1H\frac{1}{H}. Then by Lemma 45, the base algorithm satisfies the condition in Lemma 42 with c1=S2Aι2c_{1}=S^{2}A\iota^{2} and c2=S5A2ι2/Hc_{2}=S^{5}A^{2}\iota^{2}/H where ξt,π\xi_{t,\pi} is defined as t,π=t,π/H\ell_{t,\pi}^{\prime}=\ell_{t,\pi}/H, the expected loss of policy π\pi after scaling. Therefore, by Lemma 42, the we can transform it to an 12\frac{1}{2}-dd-LSB algorithm with c1=S2Aι2c_{1}=S^{2}A\iota^{2} and c2=S5A2ι2/Hc_{2}=S^{5}A^{2}\iota^{2}/H. Finally, using Theorem 23 and scaling back the loss range, we can get

O(S2AHLlog2(T)ι2+S5A2log2(T)ι2)\displaystyle O\left(\sqrt{S^{2}AHL_{\star}\log^{2}(T)\iota^{2}}+S^{5}A^{2}\log^{2}(T)\iota^{2}\right)

regret in the adversarial regime, and

O(H2S2Aι2logTΔ+H2S2Aι2logTΔC+S5A2ι2log(T)log(CΔ1))\displaystyle O\left(\frac{H^{2}S^{2}A\iota^{2}\log T}{\Delta}+\sqrt{\frac{H^{2}S^{2}A\iota^{2}\log T}{\Delta}C}+S^{5}A^{2}\iota^{2}\log(T)\log(C\Delta^{-1})\right)

regret in the corrupted stochastic regime. ∎