This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Achieving Near Instance-Optimality and Minimax-Optimality
in Stochastic and Adversarial Linear Bandits Simultaneously

Chung-Wei Lee    Haipeng Luo    Chen-Yu Wei    Mengxiao Zhang    Xiaojin Zhang
Abstract

In this work, we develop linear bandit algorithms that automatically adapt to different environments. By plugging a novel loss estimator into the optimization problem that characterizes the instance-optimal strategy, our first algorithm not only achieves nearly instance-optimal regret in stochastic environments, but also works in corrupted environments with additional regret being the amount of corruption, while the state-of-the-art (Li et al., 2019) achieves neither instance-optimality nor the optimal dependence on the corruption amount. Moreover, by equipping this algorithm with an adversarial component and carefully-designed testings, our second algorithm additionally enjoys minimax-optimal regret in completely adversarial environments, which is the first of this kind to our knowledge. Finally, all our guarantees hold with high probability, while existing instance-optimal guarantees only hold in expectation.

Machine Learning, ICML

1 Introduction

We consider the linear bandit problem with a finite and fixed action set. In this problem, the learner repeatedly selects an action from the action set and observes her loss whose mean is the inner product between the chosen action and an unknown loss vector determined by the environment. The goal is to minimize the regret, which is the difference between the learner’s total loss and the total loss of the best action in hindsight. Two standard environments are heavily-studied in the literature: the stochastic environment and the adversarial environment. In the stochastic environment, the loss vector is fixed over time, and we are interested in instance-optimal regret bounds of order o(Tϵ)o(T^{\epsilon}) for any ϵ>0\epsilon>0, where TT is the number of rounds and o()o(\cdot) hides some instance-dependent constants. On the other hand, in the adversarial environment, the loss vector can be arbitrary in each round, and we are interested in minimax-optimal regret bound 𝒪~(T)\widetilde{\mathcal{O}}(\sqrt{T}), where 𝒪~()\widetilde{\mathcal{O}}(\cdot) hides the problem dimension and logarithmic factors in TT.

While there are many algorithms obtaining such optimal bounds in either environment (e.g., (Lattimore & Szepesvari, 2017) in the stochastic setting and (Bubeck et al., 2012) in the adversarial setting), a natural question is whether there exists an algorithm achieving both guarantees simultaneously without knowing the type of the environment. Indeed, the same question has been studied extensively in recent years for the special case of multi-armed bandits where the action set is the standard basis (Bubeck & Slivkins, 2012; Seldin & Slivkins, 2014; Auer & Chiang, 2016; Seldin & Lugosi, 2017; Wei & Luo, 2018; Zimmert & Seldin, 2019). Notably, Zimmert & Seldin (2019) developed an algorithm that is optimal up to universal constants for both stochastic and adversarial environments, and the techniques have been extended to combinatorial semi-bandits (Zimmert et al., 2019) and finite-horizon tabular Markov decision processes (Jin & Luo, 2020). Despite all these advances, however, it is still open whether similar results can be achieved for general linear bandits.

On the other hand, another line of recent works study the robustness of stochastic linear bandit algorithms from a different perspective and consider a corrupted setting where an adversary can corrupt the stochastic losses up to some limited amount CC. This was first considered in multi-armed bandits (Lykouris et al., 2018; Gupta et al., 2019; Zimmert & Seldin, 2019, 2021) and later extended to linear bandits (Li et al., 2019) and Markov decision processes (Lykouris et al., 2019; Jin & Luo, 2020). Ideally, the regret of a robust stochastic algorithm should degrade with an additive term 𝒪(C)\mathcal{O}(C) in this setting, which is indeed the case in (Gupta et al., 2019; Zimmert & Seldin, 2019, 2021; Jin & Luo, 2020) for multi-armed bandits or Markov decision processes, but is not achieved yet for general linear bandits.

In this paper, we make significant progress in this direction and develop algorithms with near-optimal regret simultaneously for different environments. Our main contributions are as follows.

  • In Section 4, we first introduce Algorithm LABEL:alg:REOLB, a simple algorithm that achieves 𝒪(c(𝒳,θ)log2T+C)\mathcal{O}(c(\mathcal{X},\theta)\log^{2}T+C) regret with high probability in the corrupted setting,111In the texts, 𝒪()\mathcal{O}(\cdot) often hides lower-order terms (in terms of TT dependence) for simplicity. However, in all formal theorem/lemma statements, we use 𝒪()\mathcal{O}(\cdot) to hide universal constants only. where c(𝒳,θ)c(\mathcal{X},\theta) is an instance-dependent quantity such that the instance-optimal bound for the stochastic setting (i.e. C=0C=0) is Θ(c(𝒳,θ)logT)\Theta(c(\mathcal{X},\theta)\log T). This result significantly improves (Li et al., 2019) which only achieves 𝒪(d6log2TΔmin2+d2.5ClogTΔmin)\mathcal{O}\big{(}\frac{d^{6}\log^{2}T}{\Delta_{\min}^{2}}+\frac{d^{2.5}C\log T}{\Delta_{\min}}\big{)} where dd is the dimension of the actions and Δmin\Delta_{\min} is the minimum sub-optimality gap satisfying c(𝒳,θ)𝒪(dΔmin)c(\mathcal{X},\theta)\leq\mathcal{O}\big{(}\frac{d}{\Delta_{\min}}\big{)}. Moreover, Algorithm LABEL:alg:REOLB also ensures an instance-independent bound 𝒪~(dT+C)\widetilde{\mathcal{O}}(d\sqrt{T}+C) that some existing instance-optimal algorithms fail to achieve even when C=0C=0 (e.g., (Jun & Zhang, 2020)).

  • In Section 5, based on Algorithm LABEL:alg:REOLB, we further propose Algorithm 1 which not only achieves nearly instance-optimal regret 𝒪(c(𝒳,θ)log2T)\mathcal{O}(c(\mathcal{X},\theta)\log^{2}T) in the stochastic setting, but also achieves the minimax optimal regret 𝒪~(T)\widetilde{\mathcal{O}}(\sqrt{T}) in the adversarial setting (both with high probability). To the best of our knowledge, this is the first algorithm that enjoys the best of both worlds for linear bandits. Additionally, the same algorithm also guarantees 𝒪~(dlog2TΔmin+C)\widetilde{\mathcal{O}}\big{(}\frac{d\log^{2}T}{\Delta_{\min}}+C\big{)} in the corrupted setting, which is slightly worse than Algorithm LABEL:alg:REOLB but still significantly better than (Li et al., 2019).

  • Finally, noticing the extra logT\log T factor in our bound for the stochastic setting, in Appendix D we also prove that this is in fact inevitable if the same algorithm simultaneously achieves sublinear regret in the adversarial setting with high probability (which is the case for Algorithm 1). This generalizes the result of (Auer & Chiang, 2016) for two-armed bandits.

At a high level, Algorithm LABEL:alg:REOLB utilizes a well-known optimization problem (that characterizes the lower bound in the stochastic setting) along with a robust estimator to determine a randomized strategy for each round. This ensures the near instance-optimality of the algorithm in the stochastic setting, and also the robustness to corruption when combined with a doubling trick. To handle the adversarial setting as well, Algorithm 1 switches between an adversarial linear bandit algorithm with high-probability regret guarantees and a variant of Algorithm LABEL:alg:REOLB, depending on the results of some carefully-designed statistical tests on the stochasticity of the environment.

2 Related Work

Linear Bandits.

Linear bandits is a classic model to study sequential decision problems. The stochastic setting dates back to (Abe & Long, 1999). Auer (2002) first used the optimism principle to solve this problem. Later, several algorithms were proposed based on confidence ellipsoids, further improving the regret bounds (Dani et al., 2008a; Rusmevichientong & Tsitsiklis, 2010; Abbasi-Yadkori et al., 2011; Chu et al., 2011).

On the other hand, the adversarial setting was introduced by Awerbuch & Kleinberg (2004). Dani et al. (2008b) achieved the first 𝒪(T)\mathcal{O}(\sqrt{T}) expected regret bound using the Geometric Hedge algorithm (also called Exp2) with uniform exploration over a barycentric spanner. Abernethy et al. (2008) proposed the first computational efficient algorithm that achieves 𝒪~(T)\widetilde{\mathcal{O}}(\sqrt{T}) regret using the Following-the-Regularized-Leader framework. Bubeck et al. (2012) further tightened the bound by improving Exp2 with John’s exploration. Our Algorithm 1 makes use of any adversarial linear bandit algorithm with high-probability guarantees (e.g., (Bartlett et al., 2008; Lee et al., 2020)) in a black-box manner.

Instance Optimality for Bandit Problems.

In the stochastic setting, Lattimore & Szepesvari (2017) showed that, unlike multi-armed bandits, optimism-based algorithms or Thompson sampling can be arbitrarily far from optimal in some simple instances. They proposed an algorithm also based on the lower bound optimization problem to achieve instance-optimality, but their algorithm is deterministic and cannot be robust to an adversary. Instance-optimality was also considered in other related problems lately such as linear contextual bandits (Hao et al., 2020; Tirinzoni et al., 2020), partial monitoring (Komiyama et al., 2015), and structured bandits (Combes et al., 2017; Jun & Zhang, 2020). Most of these works only consider expected regret, while our guarantees all hold with high probability.

Best-of-Both-Worlds.

Algorithms that are optimal for both stochastic and adversarial settings were studied in multi-armed bandits (Bubeck & Slivkins, 2012; Seldin & Slivkins, 2014; Auer & Chiang, 2016; Seldin & Lugosi, 2017; Wei & Luo, 2018; Zimmert & Seldin, 2019), semi-bandits (Zimmert et al., 2019), and Markov Decision Processes (Jin & Luo, 2020). On the other hand, linear bandits, a generalization of multi-armed bandits and semi-bandits, is much more challenging and currently underexplored in this direction. To the best of our knowledge, our algorithm is the first that guarantees near-optimal regret bounds in both stochastic and adversarial settings simultaneously.

Stochastic Bandits with Corruption.

Lykouris et al. (2018) first considered the corrupted setting for multi-armed bandits. Their results were improved by (Gupta et al., 2019; Zimmert & Seldin, 2019, 2021) and extended to linear bandits (Li et al., 2019; Bogunovic et al., 2020) and reinforcement learning (Lykouris et al., 2019). As mentioned, our results significantly improve those of (Li et al., 2019) (although their corruption model is slightly more general than ours; see Section 3). On the other hand, the results of (Bogunovic et al., 2020) are incomparable to ours, because they consider a setting where the adversary has even more power and can decide the corruption after seeing the chosen action. Finally, we note that (Lykouris et al., 2019, Theorem 3.2) considers episodic linear Markov decision processes in the corrupted setting, which can be seen as a generalization of linear bandits. However, this result is highly suboptimal when specified to linear bandits (Ω(C2T)\Omega(C^{2}\sqrt{T}) ignoring other parameters).

3 Preliminaries

Let 𝒳d\mathcal{X}\subset\mathbb{R}^{d} be a finite set that spans d\mathbb{R}^{d}. Each element in 𝒳\mathcal{X} is called an arm or an action. We assume that x21\|x\|_{2}\leq 1 for all x𝒳x\in\mathcal{X}. A linear bandit problem proceeds in TT rounds. In each round t=1,,Tt=1,\ldots,T, the learner selects an action xt𝒳x_{t}\in\mathcal{X}. Simultaneously, the environment decides a hidden loss vector td\ell_{t}\in\mathbb{R}^{d} and generates some independent zero-mean noise ϵt(x)\epsilon_{t}(x) for each action xx. Afterwards, the learner observes her loss yt=xt,t+ϵt(xt)y_{t}=\langle x_{t},\ell_{t}\rangle+\epsilon_{t}(x_{t}). We consider three different types of settings: stochastic, corrupted, and adversarial, explained in detail below.

In the stochastic setting, t\ell_{t} is fixed to some unknown vector θd\theta\in\mathbb{R}^{d}. We assume that there exists a unique optimal arm x𝒳x^{*}\in\mathcal{X} such that x,θ<minxx𝒳x,θ\langle x^{*},\theta\rangle<\min_{x^{*}\neq x\in\mathcal{X}}\langle x,\theta\rangle, and define for each x𝒳x\in\mathcal{X}, its sub-optimality gap as Δx=xx,θ\Delta_{x}=\langle x-x^{*},\theta\rangle. Also denote the minimum gap minxxΔx\min_{x\neq x^{*}}\Delta_{x} by Δmin\Delta_{\min}.

The corrupted setting is a generalization of the stochastic setting, where in addition to a fixed vector θ\theta, the environment also decides a corruption vector ctdc_{t}\in\mathbb{R}^{d} for each round (before seeing xtx_{t}) so that t=θ+ct\ell_{t}=\theta+c_{t}.222In other words, the environment corrupts the observation yty_{t} by adding xt,ct\langle x_{t},c_{t}\rangle. The setting of (Li et al., 2019) is slightly more general with the corruption on yty_{t} being ct(xt)c_{t}(x_{t}) for some function ctc_{t} that is not necessarily linear. We define the total amount of corruption as C=tmaxx𝒳|x,ct|C=\sum_{t}\max_{x\in\mathcal{X}}|\langle x,c_{t}\rangle|. The stochastic setting is clearly a special case with C=0C=0. In both of these settings, we define the regret as Reg(T)=maxx𝒳t=1Txtx,θ=t=1TΔxt\text{\rm Reg}(T)=\max_{x\in\mathcal{X}}\sum_{t=1}^{T}\langle x_{t}-x,\theta\rangle=\sum_{t=1}^{T}\Delta_{x_{t}}.

Finally, in the adversarial setting, t\ell_{t} can be chosen arbitrarily (possibly dependent on the learner’s algorithm and her previously chosen actions). The difference compared to the corrupted setting (which also has potentially arbitrary loss vectors) is that the regret is now defined in terms of t\ell_{t}: Reg(T)=maxx𝒳t=1Txtx,t\text{\rm Reg}(T)=\max_{x\in\mathcal{X}}\sum_{t=1}^{T}\langle x_{t}-x,\ell_{t}\rangle.

In all settings, we assume x,θ,x,ct,x,t\langle x,\theta\rangle,\langle x,c_{t}\rangle,\langle x,\ell_{t}\rangle and yty_{t} are all in [1,1][-1,1] for all tt and x𝒳x\in\mathcal{X}. We also denote x,t\langle x,\ell_{t}\rangle by t,x\ell_{t,x} and similarly x,ct\langle x,c_{t}\rangle by ct,xc_{t,x}.

It is known that the minimax optimal regret in the adversarial setting is Θ(dT)\Theta(d\sqrt{T}) (Dani et al., 2008b; Bubeck et al., 2012). The instance-optimality in the stochastic case, on the other hand, is slightly more complicated. Specifically, an algorithm is called consistent if it guarantees 𝔼[Reg(T)]=o(Tϵ)\mathbb{E}[\text{\rm Reg}(T)]=o(T^{\epsilon}) for any θ\theta, 𝒳\mathcal{X}, and ϵ>0\epsilon>0. Then, a classic lower bound result (see e.g., (Lattimore & Szepesvari, 2017)) states that: for a particular instance (𝒳,θ)(\mathcal{X},\theta), all consistent algorithms satisfy:333The original proof is under the Gaussian noise assumption. To meet our boundedness assumption on yty_{t}, it suffices to consider the case when yty_{t} is a Bernoulli random variable, which only affects the constant of the lower bound.

lim infT𝔼[Reg(T)]logTΩ(c(𝒳,θ)),\displaystyle\liminf_{T\to\infty}\frac{\mathbb{E}[\text{\rm Reg}(T)]}{\log T}\geq\Omega(c(\mathcal{X},\theta)),

where c(𝒳,θ)c(\mathcal{X},\theta) is the objective value of the following optimization problem:

infN[0,)𝒳x𝒳\{x}NxΔx\displaystyle\inf_{N\in[0,\infty)^{\mathcal{X}}}\sum_{x\in\mathcal{X}\backslash\{x^{*}\}}N_{x}\Delta_{x} (1)
subject to xH1(N)2Δx22,x𝒳\{x}\displaystyle\|x\|^{2}_{H^{-1}(N)}\leq\frac{\Delta_{x}^{2}}{2},\quad\forall{x\in\mathcal{X}\backslash\{x^{*}\}} (2)

and H(N)=x𝒳NxxxH(N)=\sum_{x\in\mathcal{X}}N_{x}xx^{\top} (the notation xM\|x\|_{M} denotes the quadratic norm xMx\sqrt{x^{\top}Mx} with respect to a matrix MM). This implies that the best instance-dependent bound for Reg(T)\text{\rm Reg}(T) one can hope for is 𝒪(c(𝒳,θ)logT)\mathcal{O}(c(\mathcal{X},\theta)\log T) (and more generally 𝒪(c(𝒳,θ)logT+C)\mathcal{O}(c(\mathcal{X},\theta)\log T+C) for the corrupted setting). It can be shown that c(𝒳,θ)𝒪(dΔmin)c(\mathcal{X},\theta)\leq\mathcal{O}\left(\frac{d}{\Delta_{\min}}\right) (see Lemma 16), but this upper bound can be arbitrarily loose as shown in (Lattimore & Szepesvari, 2017).

The solution NxN_{x} in the optimization problem above specifies the least number of times action xx should be drawn in order to distinguish between the present environment and any other alternative environment with a different optimal action. Many previous instance-optimal algorithms try to match their number of pulls for xx to the solution NxN_{x} under some estimated gap Δ^x\widehat{\Delta}_{x} (Lattimore & Szepesvari, 2017; Hao et al., 2020; Jun & Zhang, 2020). While these algorithms are asymptotically optimal, their regret usually grows linearly when TT is small (Jun & Zhang, 2020). Furthermore, they are all deterministic algorithms and by design cannot tolerate corruptions. We will show how these issues can be addressed in the next section.

Notations.

We use 𝒫𝒮\mathcal{P}_{\mathcal{S}} to denote the probability simplex over 𝒮\mathcal{S}: {p0|𝒮|:s𝒮ps=1}\left\{p\in\mathbb{R}_{\geq 0}^{|\mathcal{S}|}:\sum_{s\in\mathcal{S}}p_{s}=1\right\}, and define the clipping operator Clip[a,b](v)\text{Clip}_{[a,b]}(v) as min(max(v,a),b)\min(\max(v,a),b) for aba\leq b.

4 A New Algorithm for the Corrupted Setting

In this section, we focus on the corrupted setting (hence covering the stochastic setting as well). We introduce a new algorithm that achieves with high probability an instance-dependent regret bound of 𝒪(c(𝒳,θ)log2T+C)\mathcal{O}(c(\mathcal{X},\theta)\log^{2}T+C) for large TT and also an instance-independent regret bound of 𝒪~(dT+C)\widetilde{\mathcal{O}}(d\sqrt{T}+C) for any TT. This improves over previous instance-optimal algorithms (Lattimore & Szepesvari, 2017; Hao et al., 2020; Jun & Zhang, 2020) from several aspects: 1) first and foremost, our algorithm handles corruption optimally with extra 𝒪(C)\mathcal{O}(C) regret, while previous algorithms can fail completely due to their deterministic nature; 2) previous bounds only hold in expectation; 3) previous algorithms might suffer linear regret when TT is small, while ours is always 𝒪~(dT+C)\widetilde{\mathcal{O}}(d\sqrt{T}+C) for any TT. The price we pay is an additional logT\log T factor in the instance-dependent bound. On the other hand, compared to the work of (Li et al., 2019) that also covers the same corrupted setting and achieves 𝒪(d6log2TΔmin2+d2.5ClogTΔmin)\mathcal{O}\left(\frac{d^{6}\log^{2}T}{\Delta_{\min}^{2}}+\frac{d^{2.5}C\log T}{\Delta_{\min}}\right), our results are also significantly better (recall c(𝒳,θ)𝒪(d/Δmin)c(\mathcal{X},\theta)\leq\mathcal{O}(d/\Delta_{\min})), although as mentioned in Footnote 2, their results hold for an even more general setting with non-linear corruption.

Our algorithm is presented in Algorithm LABEL:alg:REOLB, which proceeds in blocks of rounds whose length grows in a doubling manner (20,21,2^{0},2^{1},\ldots). At the beginning of block mm (denoted as m\mathcal{B}_{m}), we compute a distribution pmp_{m} over actions by solving an optimization problem OP (Figure LABEL:fig:_op) using the empirical gap Δ^m,x\widehat{\Delta}_{m,x} estimated in the previous block (Line LABEL:line:_solving_OP_in_alg_1). Then we use pmp_{m} to sample actions for the entire block mm, and construct an unbiased loss estimator ^t,x\widehat{\ell}_{t,x} in every round for every action xx (Line LABEL:line:_construct_loss_estimator_alg_1). At the end of each block mm, we use {^τ,x}τm\{\widehat{\ell}_{\tau,x}\big{\}}_{\tau\in\mathcal{B}_{m}} to construct a robust loss estimator Robm,x\text{Rob}_{m,x} for each action (Line LABEL:line:_construct_robust_alg_1), which will then be used to construct Δ^m+1,x\widehat{\Delta}_{m+1,x} for the next block. We next explain the optimization problem OP and the estimators in detail.

OP is inspired by the lower bound optimization (Eq. (1) and Eq. (2)), where we normalize the pull counts NN as a distribution pp over the arms such that for a large mm, pm,xNx2mp_{m,x}\approx\frac{N_{x}}{2^{m}} holds for xxx\neq x^{*}. One key difference between our algorithm and previous ones (Lattimore & Szepesvari, 2017; Hao et al., 2020; Jun & Zhang, 2020) is exactly that we select actions randomly according to these distributions, while they try to deterministically match the pull count of each arm to NN. Our randomized strategy not only prevents the environment from exploiting the knowledge on the learner’s choices, but also allows us to construct unbiased estimator ^t,x\widehat{\ell}_{t,x} (Line LABEL:line:_construct_loss_estimator_alg_1) following standard adversarial linear bandit algorithms (Dani et al., 2008b; Bubeck et al., 2012). In fact, as shown in our analysis, the variance of the estimator ^t,x\widehat{\ell}_{t,x} is exactly bounded by xSm12\|x\|_{S_{m}^{-1}}^{2} (for tmt\in\mathcal{B}_{m}), which is in turn bounded in terms of the sub-optimality gap of xx in light of the constraint Eq. (LABEL:eqn:_opt-2-constraint). The similar idea of imposing explicit constraints on the variance of loss estimators appears before in for example (Dudik et al., 2011; Agarwal et al., 2014) for contextual bandits. Finally, we point out that OP always has a solution due to the additive term 4d4d in Eq. (LABEL:eqn:_opt-2-constraint) (see Lemma 11), and it can be solved efficiently by standard methods since Eq. (LABEL:eqn:_opt-2-constraint) is a convex constraint.

Another important ingredient of our algorithm is the robust estimator Robm,x\text{Rob}_{m,x}, which is a clipped version of the Catoni’s esimator (Catoni, 2012) constructed using all the unbiased estimators {^τ,x}τm\{\widehat{\ell}_{\tau,x}\}_{\tau\in\mathcal{B}_{m}} from this block for action xx (Figure LABEL:fig:catoni). From a technical perspective, this avoids a lower-order term in Bernstein-style concentration bounds and is critical for our analysis. We in fact also believe that this is necessary since there is no explicit regularization on the magnitude of ^t,x\widehat{\ell}_{t,x}, and it can indeed have a heavy-tailed distribution. While other robust estimators are possible, we use the Catoni’s estimator which was analyzed in (Wei et al., 2020) for non-i.i.d. random variables (again important for our analysis).

The following theorem summarizes the nearly instance-optimal regret bound of Algorithm LABEL:alg:REOLB.

Theorem 1.

In the corrupted setting, Algorithm LABEL:alg:REOLB guarantees that with probability at least 1δ1-\delta,

Reg(T)\displaystyle\text{\rm Reg}(T) =𝒪(c(𝒳,θ)logTlogT|𝒳|δ+Mlog321δ\displaystyle=\mathcal{O}\Bigg{(}c(\mathcal{X},\theta)\log T\log\frac{T|\mathcal{X}|}{\delta}+M^{*}\log^{\frac{3}{2}}\frac{1}{\delta}
+C+dCΔminlogC|𝒳|Δminδ),\displaystyle\qquad\qquad+C+d\sqrt{\frac{C}{\Delta_{\min}}}\log\frac{C|\mathcal{X}|}{\Delta_{\min}\delta}\Bigg{)},

where MM^{*} is some constant that depends on 𝒳\mathcal{X} and θ\theta only.

The dominating term of this regret bound is thus 𝒪(c(𝒳,θ)log2T+C)\mathcal{O}(c(\mathcal{X},\theta)\log^{2}T+C) as claimed. The definition of MM^{*} can be found in the proof (Appendix B) and is importantly independent of TT. In fact, in Theorem 19, we also provide an alternative (albeit weaker) bound 𝒪(d2(logT)2Δmin+C)\mathcal{O}(\frac{d^{2}(\log T)^{2}}{\Delta_{\min}}+C) for Algorithm LABEL:alg:REOLB without the dependence on MM^{*}.

The next theorem shows an instance-independent bound of order 𝒪~(dT+C)\widetilde{\mathcal{O}}(d\sqrt{T}+C) for Algorithm LABEL:alg:REOLB, which previous instance-optimal algorithms fail to achieve as mentioned.

Theorem 2.

In the corrupted setting, Algorithm LABEL:alg:REOLB guarantees that with probability at least 1δ1-\delta, Reg(T)𝒪(dTlog(T|𝒳|/δ)+C)\text{\rm Reg}(T)\leq\mathcal{O}(d\sqrt{T}\log(T|\mathcal{X}|/\delta)+C).

We emphasize that Algorithm LABEL:alg:REOLB is parameter-free and does not need to know CC to achieve these bounds. In the rest of the section, we provide a proof sketch for Theorem 1 and Theorem 2. First, we show that the estimated gap Δ^m,x\widehat{\Delta}_{m,x} is close to the true gap Δx\Delta_{x} with a constant multiplicative factor and some additive terms that go down at the rate of roughly 1/t1/\sqrt{t} up to the some average amount of corruption.

Lemma 3.

With probability at least 1δ1-\delta, Algorithm LABEL:alg:REOLB ensures for all mm and all xx,

Δx\displaystyle\Delta_{x} 2Δ^m,x+dγm42m+2ρm1,\displaystyle\leq 2\widehat{\Delta}_{m,x}+\sqrt{\frac{d\gamma_{m}}{4\cdot 2^{m}}}+2\rho_{m-1}, (3)
Δ^m,x\displaystyle\widehat{\Delta}_{m,x} 2Δx+dγm42m+2ρm1,\displaystyle\leq 2\Delta_{x}+\sqrt{\frac{d\gamma_{m}}{4\cdot 2^{m}}}+2\rho_{m-1}, (4)

where ρm=k=0m2kCk4m1\rho_{m}=\sum_{k=0}^{m}\frac{2^{k}C_{k}}{4^{m-1}} (ρ1\rho_{-1} is defined as 0), Ck=τkmaxx𝒳|cτ,x|C_{k}=\sum_{\tau\in\mathcal{B}_{k}}\max_{x\in\mathcal{X}}|c_{\tau,x}| is the amount of corruption within block kk, and γm=215log(2m|𝒳|/δ)\gamma_{m}=2^{15}\log(2^{m}|\mathcal{X}|/\delta).

As mentioned, the proof of Lemma 3 heavily relies on the robust estimators we use as well as the variance constraint Eq. (LABEL:eqn:_opt-2-constraint). Next, we have the following lemma which bounds the objective value of OP.

Lemma 4.

Let pp be the solution of OP(t,Δ^)\textbf{OP}(t,\widehat{\Delta}), where Δ^0|𝒳|\widehat{\Delta}\in\mathbb{R}_{\geq 0}^{|\mathcal{X}|}. Then we have x𝒳pxΔ^x=𝒪(dlog(t|𝒳|/δ)t)\sum_{x\in\mathcal{X}}p_{x}\widehat{\Delta}_{x}=\mathcal{O}\left(\frac{d\log(t|\mathcal{X}|/\delta)}{\sqrt{t}}\right).

Combining Lemma 3 and Lemma 4, we see that in block mm, the regret of Algorithm LABEL:alg:REOLB can be upper bounded by

𝒪(2mxpm,xΔx)\displaystyle\mathcal{O}\left(2^{m}\sum_{x}p_{m,x}\Delta_{x}\right)
=𝒪~(2mxpm,x(Δ^m,x+d2m+ρm1))\displaystyle=\widetilde{\mathcal{O}}\left(2^{m}\sum_{x}p_{m,x}\left(\widehat{\Delta}_{m,x}+\sqrt{\frac{d}{2^{m}}}+\rho_{m-1}\right)\right)
=𝒪~(d2m+2mρm1),\displaystyle=\widetilde{\mathcal{O}}\left(d\sqrt{2^{m}}+2^{m}\rho_{m-1}\right),

where in the first equality we use Lemma 3 and in the second equality we use Lemma 4 with the fact that pm=OP(2m,Δ^m)p_{m}=\textbf{OP}(2^{m},\widehat{\Delta}_{m}). Further summing this over mm and relating m2mρm1\sum_{m}2^{m}\rho_{m-1} to CC proves Theorem 2.

In addition, based on Lemma 3, we show that when tmt\in\mathcal{B}_{m} is larger than Ω(C/Δmin)\Omega(C/\Delta_{\min}) plus some problem-dependent constant, the estimated gap Δ^m,x\widehat{\Delta}_{m,x} becomes Θ(Δx)\Theta(\Delta_{x}). Therefore, the solution {pm,x}x𝒳{x}\{p_{m,x}\}_{x\in\mathcal{X}\setminus\{x^{*}\}} from OP is very close to {Nx2m}x𝒳{x}\{\frac{N_{x}}{2^{m}}\}_{x\in\mathcal{X}\setminus\{x^{*}\}}, where NxN_{x} is the optimal solution of Eq. (1) and Eq. (2), except that we have an additional log(2m|𝒳|/δ)\log(2^{m}|\mathcal{X}|/\delta) factor in the constraint (coming from β2m\beta_{2^{m}}). Therefore, the regret is bounded by 𝒪(c(𝒳,θ)log(T)log(T|𝒳|/δ))\mathcal{O}(c(\mathcal{X},\theta)\log(T)\log(T|\mathcal{X}|/\delta)) for large enough TT. Formally, we have the following lemma.

Lemma 5.

Algorithm LABEL:alg:REOLB guarantees with probability at least 1δ1-\delta, for some constant TT^{*} depending on 𝒳,θ\mathcal{X},\theta, and CC:

t=T+1Txpt,xΔx𝒪(c(𝒳,θ)log(T)log(T|𝒳|/δ)).\displaystyle\sum^{T}_{t=T^{*}+1}\sum_{x}p_{t,x}\Delta_{x}\leq\mathcal{O}\left(c(\mathcal{X},\theta)\log(T)\log(T|\mathcal{X}|/\delta)\right).

Finally, to obtain Theorem 1, it suffices to apply Theorem 2 for the regret before round TT^{*} and Lemma 5 for the regret after.

5 Best of Three Worlds

In this section, building on top of Algorithm LABEL:alg:REOLB, we develop another algorithm that enjoys similar regret guarantees in the stochastic or corrupted setting, and additionally guarantees 𝒪~(T)\widetilde{\mathcal{O}}(\sqrt{T}) regret in the adversarial setting, without having any prior knowledge on which environment it is facing. To the best of our knowledge, this kind of best-of-three-worlds guarantee only appears before for multi-armed bandits (Wei & Luo, 2018; Zimmert & Seldin, 2019) and Markov decision processes (Jin & Luo, 2020), but not for linear bandits.

Our algorithm requires a block-box access to an adversarial linear bandit algorithm 𝒜\mathcal{A} that satisfies the following:

Assumption 1.

𝒜\mathcal{A} is a linear bandit algorithm that outputs a loss estimator ^t,x\widehat{\ell}_{t,x} for each action xx after each time tt. There exist L0L_{0}, C1215dlog(T|𝒳|/δ)C_{1}\geq 2^{15}d\log(T|\mathcal{X}|/\delta), and universal constant C220C_{2}\geq 20, such that for all tL0t\geq L_{0}, 𝒜\mathcal{A} guarantees the following with probability at least 1δT1-\frac{\delta}{T}: x𝒳\forall x\in\mathcal{X},

s=1t(s,xss,x)C1tC2|s=1t(s,x^s,x)|.\displaystyle\sum_{s=1}^{t}(\ell_{s,x_{s}}-\ell_{s,x})\leq\sqrt{C_{1}t}-C_{2}\left|\sum_{s=1}^{t}(\ell_{s,x}-\widehat{\ell}_{s,x})\right|. (5)

Eq. (5) states that the regret of 𝒜\mathcal{A} against action xx is bounded by a t\sqrt{t}-order term minus the deviation between the loss of xx and its estimator. While this might not seem intuitive, in fact, all existing linear bandit algorithms with a near-optimal high-probability bound satisfy Assumption 1, even though this may not have been stated explicitly (and one may need to slightly change the constant parameters in these algorithms to satisfy the conditions on C1C_{1} and C2C_{2}). Below, we give two examples of such 𝒜\mathcal{A} and justify them in Appendix E.

  • A variant of GeometricHedge.P (Bartlett et al., 2008) with an improved exploration scheme satisfies Assumption 1 with (δ=δ/(|𝒳|log2T)\delta^{\prime}=\delta/(|\mathcal{X}|\log_{2}T))

    C1=Θ(dlog(T/δ)),L0=Θ(dlog2(T/δ)).C_{1}=\Theta\left(d\,{\log(T/\delta^{\prime})}\right),\ \ L_{0}=\Theta\left(d\,{\log^{2}(T/\delta^{\prime})}\right).
  • The algorithm of (Lee et al., 2020) satisfies Assumption 1 with (lg=log(dT)\lg=\log(dT), δ′′=δ/(|𝒳|T)\delta^{\prime\prime}=\delta/(|\mathcal{X}|T))

    C1=Θ(d6lg8log(lg/δ′′)),L0=Θ(log(lg/δ′′)).C_{1}=\Theta\left(d^{6}\lg^{8}{\log(\lg/\delta^{\prime\prime})}\right),\ \ L_{0}=\Theta\left(\log(\lg/\delta^{\prime\prime})\right).

With such a black-box at hand, our algorithm BOTW is shown in Algorithm 1. We first present its formal guarantees in different settings.

Theorem 6.

Algorithm 1 guarantees that with probability at least 1δ1-\delta, in the stochastic setting (C=0C=0), Reg(T)\text{\rm Reg}(T) is at most

𝒪(c(𝒳,θ)logTlogT|𝒳|δ+C1logTΔmin\displaystyle\mathcal{O}\left(c(\mathcal{X},\theta)\log T\log\frac{T|\mathcal{X}|}{\delta}+\frac{C_{1}\sqrt{\log T}}{\Delta_{\min}}\right.
+Mlog321δ+C1L0),\displaystyle\left.+M^{*}\log^{\frac{3}{2}}\frac{1}{\delta}+\sqrt{C_{1}L_{0}}\right),

where MM^{*} is the same problem-dependent constant as in Theorem 1; and in the corrupted setting (C>0C>0), Reg(T)\text{\rm Reg}(T) is at most

𝒪(C1logTΔmin+C+C1L0).\displaystyle\mathcal{O}\left(\frac{C_{1}\log T}{\Delta_{\min}}+C+\sqrt{C_{1}L_{0}}\right).

In the case when 𝒜\mathcal{A} is the variant of GeometricHedge.P, the last bound is

𝒪(dlog(T|𝒳|/δ)logTΔmin+C).\displaystyle\mathcal{O}\left(\frac{d\log(T|\mathcal{X}|/\delta)\log T}{\Delta_{\min}}+C\right).

Therefore, Algorithm 1 enjoys the nearly instance-optimal regret 𝒪(c(𝒳,θ)log2T)\mathcal{O}(c(\mathcal{X},\theta)\log^{2}T) in the stochastic setting as Algorithm LABEL:alg:REOLB444Note that when we choose 𝒜\mathcal{A} as the variant of GeometricHedge.P, C1logTΔmin=𝒪(dΔminlog32T)\frac{C_{1}\sqrt{\log T}}{\Delta_{\min}}=\mathcal{O}\left(\frac{d}{\Delta_{\min}}\log^{\frac{3}{2}}T\right) which is dominated by the term 𝒪(c(𝒳,θ)log2T)\mathcal{O}(c(\mathcal{X},\theta)\log^{2}T) when TT is sufficiently large., but slightly worse regret 𝒪(dlog2TΔmin+C)\mathcal{O}(\frac{d\log^{2}T}{\Delta_{\min}}+C) in the corrupted setting (recall again c(𝒳,θ)d/Δminc(\mathcal{X},\theta)\leq d/\Delta_{\min}). In exchange, however, Algorithm 1 enjoys the following worst-case robustness in the adversarial setting.

Theorem 7.

In the adversarial setting, Algorithm 1 guarantees that with probability at least 1δ1-\delta, Reg(T)\text{\rm Reg}(T) is at most 𝒪(C1TlogT+C1L0)\mathcal{O}\left(\sqrt{C_{1}T\log T}+\sqrt{C_{1}L_{0}}\right).

The dependence on TT in this bound is minimax-optimal as mentioned, while the dependence on dd depends on the coefficient C1C_{1} of the black-box. Note that because of this adversarial robustness, the log2T\log^{2}T dependence in Theorem 6 turns out to be unavoidable, as we show in Theorem 27. In addition, Theorem 7 also works for the stochastic setting, which implies a regret bound of 𝒪(dTlog(T|𝒳|log2T/δ))\mathcal{O}(\sqrt{dT}\log(T|\mathcal{X}|\log_{2}T/\delta)). This is a factor of d\sqrt{d} better than the guarantee of Algorithm LABEL:alg:REOLB shown in Theorem 2.

Next, in Section 5.1, we describe our algorithm in detail. Then in Section 5.2 and Section 5.3, we provide proof sketches for Theorem 7 and Theorem 6 respectively.

Input: an algorithm 𝒜\mathcal{A} satisfying Assumption 1.
Initialize: LL0L\leftarrow L_{0} (L0L_{0} defined in Assumption 1).
while true do
       Run BOTW-SE with input LL, and receive output t0t_{0}.
       L2t0L\leftarrow 2t_{0}.
Algorithm 1 BOTW (Best of Three Worlds)

5.1 The algorithm

Algorithm 1 BOTW takes a black-box 𝒜\mathcal{A} satisfying Assumption 1 (with parameter L0L_{0}) as input, and then proceeds in epochs until the game ends. In each epoch, it runs its single-epoch version BOTW-SE (Algorithm 2) with a minimum duration LL (initialized as L0L_{0}). Based on the results of some statistical tests, at some point BOTW-SE will terminate with an output t0Lt_{0}\geq L. Then BOTW enters into the next epoch with LL updated to 2t02t_{0}, so that the number of epochs is always 𝒪(logT)\mathcal{O}(\log T).

BOTW-SE has two phases. In Phase 1, the learner executes the adversarial linear bandit algorithm 𝒜\mathcal{A}. Starting from t=Lt=L (i.e. after the minimum duration specified by the input), the algorithm checks in every round whether Eq. (7) and Eq. (8) hold for some action x^\widehat{x} (Line 3). If there exists such an x^\widehat{x}, Phase 1 terminates and the algorithm proceeds to Phase 2. This test is to detect whether the environment is likely stochastic. Indeed, Eq. (7) and Eq. (8) imply that the performance of the learner is significantly better than all but one action (i.e., x^\widehat{x}). In the stochastic environment, this event happens at roughly tΘ(dΔmin2)t\approx\Theta\left(\frac{d}{\Delta_{\min}^{2}}\right) with x^=x\widehat{x}=x^{*}. This is exactly the timing when the learner should stop using 𝒜\mathcal{A} whose regret grows as 𝒪~(t)\widetilde{\mathcal{O}}(\sqrt{t}) and start doing more exploitation on the better actions, in order to keep the regret logarithmic in time for the stochastic environment. We define t0t_{0} to be the time when Phase 1 ends, and Δ^x\widehat{\Delta}_{x} be the empirical gap for action xx with respect to the estimators obtained from 𝒜\mathcal{A} so far (Line 3). In the stochastic setting, we can show that Δ^x=Θ(Δx)\widehat{\Delta}_{x}=\Theta(\Delta_{x}) holds with high probability.

In the second phase, we calculate the action distribution using OP with the estimated gap {Δ^x}x𝒳\{\widehat{\Delta}_{x}\}_{x\in\mathcal{X}}. Indeed, if Δ^x\widehat{\Delta}_{x}’s are accurate, the distribution returned by OP is close to the optimal way of allocating arm pulls, leading to near-optimal regret.555Here, we solve OP at every iteration for simplicity. It can in fact be done only when time doubles, just like Algorithm LABEL:alg:REOLB. For technical reasons, there are some differences between Phase 2 and Algorithm LABEL:alg:REOLB. First, instead of using ptp_{t}, the distribution returned by OP, to draw actions, we mix it with ex^e_{\widehat{x}} (the distribution that concentrates on x^\widehat{x}), and draw actions using p~t=12ex^+12pt\widetilde{p}_{t}=\frac{1}{2}e_{\widehat{x}}+\frac{1}{2}p_{t}. This way, x^\widehat{x} is drawn with probability at least 12\frac{1}{2}. Moreover, the loss estimator ^t,x\widehat{\ell}_{t,x} is now defined as the following:

^t,x={xS~t1xtyt,xx^ytp~t,x^𝕀{xt=x^},x=x^\displaystyle\widehat{\ell}_{t,x}=\begin{cases}x^{\top}\widetilde{S}_{t}^{-1}x_{t}y_{t},\qquad&x\neq\widehat{x}\\ \frac{y_{t}}{\widetilde{p}_{t,\widehat{x}}}\mathbb{I}\{x_{t}=\widehat{x}\},\qquad&x=\widehat{x}\end{cases} (6)

where S~t=x𝒳p~t,xxx\widetilde{S}_{t}=\sum_{x\in\mathcal{X}}\widetilde{p}_{t,x}xx^{\top}. While the construction of ^t,x\widehat{\ell}_{t,x} for xx^x\neq\widehat{x} is the same as Algorithm LABEL:alg:REOLB, we see that the construction of ^t,x^\widehat{\ell}_{t,\widehat{x}} is different and is based on standard inverse probability weighting. These differences are mainly because we later use the average estimator instead of the robust mean estimator for x^\widehat{x} (the latter produces a slightly looser concentration bound in our analysis). Therefore, we must ensure that x^\widehat{x} is drawn with enough probability, and that the magnitude of ^t,x^\widehat{\ell}_{t,\widehat{x}} is well-controlled.

Input: LL (minimum duration)
Define: fT=logTf_{T}=\log T
Initialize: a new instance of 𝒜\mathcal{A}.
// Phase 1
1 for t=1,2,t=1,2,\ldots do
      2 Execute and update 𝒜\mathcal{A}. Receive estimators {^t,x}x𝒳\{\widehat{\ell}_{t,x}\}_{x\in\mathcal{X}}.
      43 if tLt\geq L and there exists an action x^\widehat{x} such that
s=1tyss=1t^s,x^\displaystyle\sum_{s=1}^{t}y_{s}-\sum_{s=1}^{t}\widehat{\ell}_{s,\widehat{x}} 5fTC1t,\displaystyle\geq-5\sqrt{f_{T}C_{1}t}, (7)
s=1tyss=1t^s,x\displaystyle\sum_{s=1}^{t}y_{s}-\sum_{s=1}^{t}\widehat{\ell}_{s,x} 25fTC1t,xx^,\displaystyle\leq-25\sqrt{f_{T}C_{1}t},\ \ \ \forall x\neq\widehat{x}, (8)
then t0t,Δ^x1t0(s=1t0^s,x^s,x^)t_{0}\leftarrow t,\;\widehat{\Delta}_{x}\leftarrow\frac{1}{t_{0}}\left(\sum_{s=1}^{t_{0}}\widehat{\ell}_{s,x}-\widehat{\ell}_{s,\widehat{x}}\right),   break.
// Phase 2
5 for t=t0+1,t=t_{0}+1,\ldots do
      6 Let pt=OP(t,Δ^)p_{t}=\textbf{OP}(t,\widehat{\Delta}) and p~t=12ex^+12pt\widetilde{p}_{t}=\frac{1}{2}e_{\widehat{x}}+\frac{1}{2}p_{t}.
      7 Sample xtp~tx_{t}\sim\widetilde{p}_{t} and observe yty_{t}.
      8Calculate ^t,x\widehat{\ell}_{t,x} and Δ^t,x\widehat{\Delta}_{t,x} based on Eq. (6) and Eq. (11).
      109 if
xx^,Δ^t,x[0.39Δ^x,1.81Δ^x]or\displaystyle\exists x\neq\widehat{x},~{}~{}\widehat{\Delta}_{t,x}\notin\left[0.39\widehat{\Delta}_{x},1.81\widehat{\Delta}_{x}\right]\ \ \text{or} (9)
s=t0+1t(ys^s,x^)20fTC1t0.\displaystyle\sum_{s=t_{0}+1}^{t}\left(y_{s}-{\widehat{\ell}}_{s,\widehat{x}}\right)\geq 20\sqrt{f_{T}C_{1}t_{0}}. (10)
then break.
11Return t0t_{0}.
Algorithm 2 BOTW-SE (BOTW – Single Epoch)

Then, we define the average empirical gap in [1,t][1,t] for xx^x\neq\widehat{x} and tt in Phase 2 as the following:

Δ^t,x\displaystyle\displaystyle\widehat{\Delta}_{t,x} =1t(s=1t0^s,x+(tt0)Robt,xs=1t^s,x^)\displaystyle=\frac{1}{t}\left(\sum_{s=1}^{t_{0}}\widehat{\ell}_{s,x}+(t-t_{0})\text{Rob}_{t,x}-\sum_{s=1}^{t}\widehat{\ell}_{s,\widehat{x}}\right) (11)

where

Robt,x=Clip[1,1](Catoniαx({^τ,x}τ=t0+1t))\displaystyle\text{Rob}_{t,x}=\text{Clip}_{[-1,1]}\left(\textbf{Catoni}_{\alpha_{x}}\left(\big{\{}\widehat{\ell}_{\tau,x}\big{\}}_{\tau=t_{0}+1}^{t}\right)\right)

with αx=(4log(t|𝒳|/δ)tt0+τ=t0+1t2xSτ12)12\alpha_{x}=\left(\frac{4\log(t|\mathcal{X}|/\delta)}{t-t_{0}+\sum_{\tau=t_{0}+1}^{t}2\|x\|_{S_{\tau}^{-1}}^{2}}\right)^{\frac{1}{2}} (c.f. Figure LABEL:fig:catoni). Note that we use a simple average estimator for x^\widehat{x}, but a hybrid of average estimator of Phase 1 and robust estimator of Phase 2 for other actions. These gap estimators are useful in monitoring the non-stochasticity of the environment, which is done via the tests Eq. (9) and Eq. (10). The first condition (Eq. (9)) checks whether the average empirical gap Δ^t,x\widehat{\Delta}_{t,x} is still close to the estimated gap Δ^x\widehat{\Delta}_{x} at the end of Phase 1. The second condition (Eq. (10)) checks whether the regret against x^\widehat{x} incurred in Phase 2 is still tolerable. It can be shown that (see Lemma 10), with high probability Eq. (9) and Eq. (10) do not hold in a stochastic environment. Therefore, when either event is detected, BOTW-SE terminates and returns the value of t0t_{0} to BOTW, which will then run BOTW-SE again from scratch with L=2t0L=2t_{0}.

In the following subsections, we provide a sketch of analysis for BOTW, further revealing the ideas behind our design.

5.2 Analysis for the Adversarial Setting (Theorem 7)

We first show that at any time tt in Phase 2, with high probability, x^\widehat{x} is always the best action so far.

Lemma 8.

With probability at least 1δ1-\delta, for at any tt in Phase 2, we have x^argminx𝒳s=1ts,x\widehat{x}\in\operatorname*{argmin}_{x\in\mathcal{X}}\sum_{s=1}^{t}\ell_{s,x}.

Proof sketch.

The idea is to prove that for any xx^x\neq\widehat{x}, the deviation between the actual gap s=1t(s,xs,x^)\sum_{s=1}^{t}(\ell_{s,x}-\ell_{s,\widehat{x}}) and the estimated gap tΔ^t,xt\widehat{\Delta}_{t,x} is no larger than 𝒪(tΔ^x)\mathcal{O}(t\widehat{\Delta}_{x}). This is enough to prove the statement since tΔ^t,xt\widehat{\Delta}_{t,x} is of order Ω(tΔ^x)\Omega(t\widehat{\Delta}_{x}) in light of the test in Eq. (9).

Bounding the derivation for Phase 2 is somewhat similar to the analysis of Algorithm LABEL:alg:REOLB, and here we only show how to bound the derivation for Phase 1: s=1t0(s,x^s,x)\sum_{s=1}^{t_{0}}(\ell_{s,x}-\widehat{\ell}_{s,x}). We start by rearranging Eq. (5) to get: (C21)|s=1t0(s,x^s,x)|C1t0s=1t0(s,xs^s,x)=C1t0s=1t0(s,xs^s,x^)+t0Δ^x(C_{2}-1)\left|\sum_{s=1}^{t_{0}}(\ell_{s,x}-\widehat{\ell}_{s,x})\right|\leq\sqrt{C_{1}t_{0}}-\sum_{s=1}^{t_{0}}(\ell_{s,x_{s}}-\widehat{\ell}_{s,x})=\sqrt{C_{1}t_{0}}-\sum_{s=1}^{t_{0}}(\ell_{s,x_{s}}-\widehat{\ell}_{s,\widehat{x}})+t_{0}\widehat{\Delta}_{x}. By the termination conditions of Phase 1, we have s=1t0(s,xs^s,x^)5fTC1t0\sum_{s=1}^{t_{0}}(\ell_{s,x_{s}}-\widehat{\ell}_{s,\widehat{x}})\geq-5\sqrt{f_{T}C_{1}t_{0}} and Δ^x20fTC1/t0\widehat{\Delta}_{x}\geq 20\sqrt{f_{T}C_{1}/t_{0}}, which then shows |s=1t0(s,x^s,x)|6fTC1t0+t0Δ^xC21=𝒪(t0Δ^x)\left|\sum_{s=1}^{t_{0}}(\ell_{s,x}-\widehat{\ell}_{s,x})\right|\leq\frac{6\sqrt{f_{T}C_{1}t_{0}}+t_{0}\widehat{\Delta}_{x}}{C_{2}-1}=\mathcal{O}(t_{0}\widehat{\Delta}_{x}) as desired. (See Appendix C for the full proof.) ∎

We then prove that, importantly, the regret in each epoch is bounded by 𝒪~(t0)\widetilde{\mathcal{O}}(\sqrt{t_{0}}) (not square root of the epoch length):

Lemma 9.

With probability at least 1δ1-\delta, for any time tt in Phase 2, we have for any x𝒳x\in\mathcal{X},

s=1t(s,xss,x)=𝒪(C1t0fT).\displaystyle\sum_{s=1}^{t}\left(\ell_{s,x_{s}}-\ell_{s,x}\right)=\mathcal{O}\left(\sqrt{C_{1}t_{0}f_{T}}\right).
Proof sketch.

By Lemma 8, it suffices to consider x=x^x=\widehat{x}. By Eq. (5), we know that the regret for the first t0t_{0} rounds is directly bounded by 𝒪(C1t0)\mathcal{O}\left(\sqrt{C_{1}t_{0}}\right). For the regret incurred in Phase 2, we decompose it as the sum of s=t0+1t(ys^s,x^)\sum_{s=t_{0}+1}^{t}(y_{s}-\widehat{\ell}_{s,\widehat{x}}), s=t0+1t(^s,x^s,x^ϵs(x^))\sum_{s=t_{0}+1}^{t}(\widehat{\ell}_{s,\widehat{x}}-\ell_{s,\widehat{x}}-\epsilon_{s}(\widehat{x})), and s=t0+1t(ϵs(x^)ϵs(xs))\sum_{s=t_{0}+1}^{t}(\epsilon_{s}(\widehat{x})-\epsilon_{s}(x_{s})). The first term is controlled by the test in Eq. (10). The second and third terms are martingale difference sequences with variance bounded by 𝒪(1p~s,x^)\mathcal{O}(1-\widetilde{p}_{s,\widehat{x}}), which as we further show is at most 1/sΔ^min2\nicefrac{{1}}{{s\widehat{\Delta}_{\min}^{2}}} with Δ^min=minxx^Δ^x\widehat{\Delta}_{\min}=\min_{x\neq\widehat{x}}\widehat{\Delta}_{x}. By combining Eq. (7) and Eq. (8), it is clear that Δ^min20fTC1/t0\widehat{\Delta}_{\min}\geq 20\sqrt{f_{T}C_{1}/t_{0}} and thus the variance is in the order of t0/st_{0}/s. Applying Freedman’s inequality, the last two terms are thus bounded by 𝒪~(t0)\widetilde{\mathcal{O}}(\sqrt{t_{0}}) as well, proving the claimed result (see Appendix C for the full proof). ∎

Finally, to obtain Theorem 7, it suffices to apply Lemma 9 and the fact that the number of epochs is 𝒪(logT)\mathcal{O}(\log T).

5.3 Analysis for the Corrupted Setting (Theorem 6)

The key for this analysis is the following lemma.

Lemma 10.

In the corrupted setting, BOTW-SE ensures with probability at least 115δ1-15\delta:

  • t0max{900fTC1Δmin2,900C2fTC1,L}t_{0}\leq\max\left\{\frac{900f_{T}C_{1}}{\Delta_{\min}^{2}},\frac{900C^{2}}{f_{T}C_{1}},L\right\}.

  • If C130fTC1LC\leq\frac{1}{30}\sqrt{f_{T}C_{1}L}, then 1) x^=x\widehat{x}=x^{*}; 2) Δ^x[0.7Δx,1.3Δx]\widehat{\Delta}_{x}\in[0.7\Delta_{x},1.3\Delta_{x}] for all xx; and 3) Phase 2 never ends.

Using this lemma, we show a proof sketch of Theorem 6 for the stochastic case (i.e. C=0C=0). The full proof is deferred to Appendix C.2.

Proof sketch for Theorem 6 with C=0C=0.

By Lemma 10, we know that after roughly Θ(fTC1Δmin2)\Theta\left(\frac{f_{T}C_{1}}{\Delta_{\min}^{2}}\right) rounds in Phase 1, the algorithm finds x^=x\widehat{x}=x^{*}, estimates Δ^x\widehat{\Delta}_{x} up to a constant factor of Δx\Delta_{x}, and enters Phase 2 without ever going back to Phase 1. By Eq. (5), the regret in Phase 1 can be upper bounded by 𝒪(C1fTC1Δmin2)=𝒪(C1ΔminfT)\mathcal{O}\left(\sqrt{C_{1}\cdot\frac{f_{T}C_{1}}{\Delta_{\min}^{2}}}\right)=\mathcal{O}\left(\frac{C_{1}}{\Delta_{\min}}\sqrt{f_{T}}\right).

To bound the regret in Phase 2, we show that as long as tt is larger than a problem-dependent constant TT^{*}, there exist {Nx}x𝒳\{N_{x}\}_{x\in\mathcal{X}} satisfying x𝒳NxΔx2c(𝒳,θ)\sum_{x\in\mathcal{X}}N_{x}\Delta_{x}\leq 2c(\mathcal{X},\theta) such that {pt,x}x𝒳{x}={βtNx2t}x𝒳{x}\{p^{*}_{t,x}\}_{x\in\mathcal{X}\setminus\{x^{*}\}}=\big{\{}\frac{\beta_{t}N_{x}}{2t}\big{\}}_{x\in\mathcal{X}\setminus\{x^{*}\}} is a feasible solution of Eq. (LABEL:eqn:_opt-2-constraint). Therefore, we can bound the regret in this regime as follows:

s=T+1tx𝒳p~s,xΔx\displaystyle\sum_{s=T^{*}+1}^{t}\sum_{x\in\mathcal{X}}\widetilde{p}_{s,x}\Delta_{x}
=s=T+1tx𝒳12ps,xΔx\displaystyle=\sum_{s=T^{*}+1}^{t}\sum_{x\in\mathcal{X}}\frac{1}{2}p_{s,x}\Delta_{x} (x=x^x^{*}=\widehat{x})
s=T+1tx𝒳11.4ps,xΔ^x\displaystyle\leq\sum_{s=T^{*}+1}^{t}\sum_{x\in\mathcal{X}}\frac{1}{1.4}p_{s,x}\widehat{\Delta}_{x} (Δ^x[0.7Δx,1.3Δx]\widehat{\Delta}_{x}\in[0.7\Delta_{x},1.3\Delta_{x}])
s=T+1tx𝒳11.4ps,xΔ^x\displaystyle\leq\sum_{s=T^{*}+1}^{t}\sum_{x\in\mathcal{X}}\frac{1}{1.4}p^{*}_{s,x}\widehat{\Delta}_{x} (optimality of psp_{s})
s=T+1tx𝒳ps,xΔx\displaystyle\leq\sum_{s=T^{*}+1}^{t}\sum_{x\in\mathcal{X}}p^{*}_{s,x}\Delta_{x} (Δ^x[0.7Δx,1.3Δx]\widehat{\Delta}_{x}\in[0.7\Delta_{x},1.3\Delta_{x}])
𝒪(c(𝒳,θ)log2T).\displaystyle\leq\mathcal{O}\left(c(\mathcal{X},\theta)\log^{2}T\right). (definition of psp_{s}^{*})

Combining the regret bounds in Phase 1 and Phase 2, we prove the results for the stochastic setting. ∎

6 Conclusion

In this work, we make significant progress on improving the robustness and adaptivity of linear bandit algorithms. Our algorithms are the first to achieve near-optimal regret in various different settings, without having any prior knowledge on the environment. Our techniques might also be useful for more general problems such as linear contextual bandits.

In light of the work (Zimmert & Seldin, 2019) for multi-armed bandits that shows a simple Follow-the-Regularized-Leader algorithm achieves optimal regret in different settings, one interesting open question is whether there also exists such a simple Follow-the-Regularized-Leader algorithm for linear bandit with the same adaptivity to different settings. In fact, it can be shown that their algorithm has a deep connection with OP in the special case of multi-armed bandits, but we are unable to extend the connection to general linear bandits.

Acknowledgements

We thank Tor Lattimore and Julian Zimmert for helpful discussions. HL thanks Ilias Diakonikolas and Anastasia Voloshinov for initial discussions in this direction. The first four authors are supported by NSF Awards IIS-1755781 and IIS-1943607.

References

  • Abbasi-Yadkori et al. (2011) Abbasi-Yadkori, Y., Pál, D., and Szepesvári, C. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24:2312–2320, 2011.
  • Abe & Long (1999) Abe, N. and Long, P. M. Associative reinforcement learning using linear probabilistic concepts. In ICML, 1999.
  • Abernethy et al. (2008) Abernethy, J., Hazan, E., and Rakhlin, A. Competing in the dark: An efficient algorithm for bandit linear optimization. In 21st Annual Conference on Learning Theory, COLT 2008, pp. 263–273, 2008.
  • Agarwal et al. (2014) Agarwal, A., Hsu, D., Kale, S., Langford, J., Li, L., and Schapire, R. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, pp. 1638–1646. PMLR, 2014.
  • Auer (2002) Auer, P. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422, 2002.
  • Auer & Chiang (2016) Auer, P. and Chiang, C.-K. An algorithm with nearly optimal pseudo-regret for both stochastic and adversarial bandits. In Conference on Learning Theory, pp.  116–120, 2016.
  • Awerbuch & Kleinberg (2004) Awerbuch, B. and Kleinberg, R. D. Adaptive routing with end-to-end feedback: Distributed learning and geometric approaches. In Proceedings of the thirty-sixth annual ACM symposium on Theory of computing, pp.  45–53, 2004.
  • Bartlett et al. (2008) Bartlett, P., Dani, V., Hayes, T., Kakade, S., Rakhlin, A., and Tewari, A. High-probability regret bounds for bandit online linear optimization. In Proceedings of the 21st Annual Conference on Learning Theory-COLT 2008, pp.  335–342. Omnipress, 2008.
  • Bogunovic et al. (2020) Bogunovic, I., Losalka, A., Krause, A., and Scarlett, J. Stochastic linear bandits robust to adversarial attacks. arXiv:2007.03285, 2020.
  • Bubeck & Slivkins (2012) Bubeck, S. and Slivkins, A. The best of both worlds: Stochastic and adversarial bandits. In Conference on Learning Theory, 2012.
  • Bubeck et al. (2012) Bubeck, S., Cesa-Bianchi, N., and Kakade, S. M. Towards minimax policies for online linear optimization with bandit feedback. In Conference on Learning Theory, 2012.
  • Catoni (2012) Catoni, O. Challenging the empirical mean and empirical variance: a deviation study. In Annales de l’IHP Probabilités et statistiques, volume 48, pp.  1148–1185, 2012.
  • Chu et al. (2011) Chu, W., Li, L., Reyzin, L., and Schapire, R. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp.  208–214, 2011.
  • Combes et al. (2017) Combes, R., Magureanu, S., and Proutiere, A. Minimal exploration in structured stochastic bandits. In Advances in Neural Information Processing Systems, 2017.
  • Dani et al. (2008a) Dani, V., Hayes, T. P., and Kakade, S. M. Stochastic linear optimization under bandit feedback. 2008a.
  • Dani et al. (2008b) Dani, V., Kakade, S. M., and Hayes, T. P. The price of bandit information for online optimization. In Advances in Neural Information Processing Systems, 2008b.
  • Dudik et al. (2011) Dudik, M., Hsu, D., Kale, S., Karampatziakis, N., Langford, J., Reyzin, L., and Zhang, T. Efficient optimal learning for contextual bandits. In Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence, UAI 2011, pp.  169, 2011.
  • Gerchinovitz & Lattimore (2016) Gerchinovitz, S. and Lattimore, T. Refined lower bounds for adversarial bandits. In NeurIPS, 2016.
  • Gupta et al. (2019) Gupta, A., Koren, T., and Talwar, K. Better algorithms for stochastic bandits with adversarial corruptions. In Conference on Learning Theory, 2019.
  • Hao et al. (2020) Hao, B., Lattimore, T., and Szepesvari, C. Adaptive exploration in linear contextual bandit. In International Conference on Artificial Intelligence and Statistics. PMLR, 2020.
  • Jin & Luo (2020) Jin, T. and Luo, H. Simultaneously learning stochastic and adversarial episodic mdps with known transition. Advances in Neural Information Processing Systems, 2020.
  • Jun & Zhang (2020) Jun, K.-S. and Zhang, C. Crush optimism with pessimism: Structured bandits beyond asymptotic optimality. Advances in Neural Information Processing Systems, 2020.
  • Komiyama et al. (2015) Komiyama, J., Honda, J., and Nakagawa, H. Regret lower bound and optimal algorithm in finite stochastic partial monitoring. Advances in Neural Information Processing Systems, 2015.
  • Lattimore & Szepesvari (2017) Lattimore, T. and Szepesvari, C. The end of optimism? an asymptotic analysis of finite-armed linear bandits. In Artificial Intelligence and Statistics, pp.  728–737. PMLR, 2017.
  • Lee et al. (2020) Lee, C.-W., Luo, H., Wei, C.-Y., and Zhang, M. Bias no more: high-probability data-dependent regret bounds for adversarial bandits and mdps. Advances in neural information processing systems, 2020.
  • Li et al. (2019) Li, Y., Lou, E. Y., and Shan, L. Stochastic linear optimization with adversarial corruption. arXiv:1909.02109, 2019.
  • Lykouris et al. (2018) Lykouris, T., Mirrokni, V., and Paes Leme, R. Stochastic bandits robust to adversarial corruptions. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, 2018.
  • Lykouris et al. (2019) Lykouris, T., Simchowitz, M., Slivkins, A., and Sun, W. Corruption robust exploration in episodic reinforcement learning. arXiv:1911.08689, 2019.
  • Mond & Pecaric (1996) Mond, B. and Pecaric, J. A mixed arithmetic-mean-harmonic-mean matrix inequality. Linear algebra and its applications, 237:449–454, 1996.
  • Rusmevichientong & Tsitsiklis (2010) Rusmevichientong, P. and Tsitsiklis, J. N. Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395–411, 2010.
  • Seldin & Lugosi (2017) Seldin, Y. and Lugosi, G. An improved parametrization and analysis of the exp3++ algorithm for stochastic and adversarial bandits. In Conference on Learning Theory, pp.  1743–1759. PMLR, 2017.
  • Seldin & Slivkins (2014) Seldin, Y. and Slivkins, A. One practical algorithm for both stochastic and adversarial bandits. In International Conference on Machine Learning, pp. 1287–1295. PMLR, 2014.
  • Shalev-Shwartz & Ben-David (2014) Shalev-Shwartz, S. and Ben-David, S. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
  • Tirinzoni et al. (2020) Tirinzoni, A., Pirotta, M., Restelli, M., and Lazaric, A. An asymptotically optimal primal-dual incremental algorithm for contextual linear bandits. In Advances in Neural Information Processing Systems, 2020.
  • Wei & Luo (2018) Wei, C.-Y. and Luo, H. More adaptive algorithms for adversarial bandits. In Conference On Learning Theory, pp.  1263–1291. PMLR, 2018.
  • Wei et al. (2020) Wei, C.-Y., Luo, H., and Agarwal, A. Taking a hint: How to leverage loss predictors in contextual bandits? In Conference on Learning Theory, pp.  3583–3634. PMLR, 2020.
  • Zimmert & Seldin (2019) Zimmert, J. and Seldin, Y. An optimal algorithm for stochastic and adversarial bandits. In The 22nd International Conference on Artificial Intelligence and Statistics. PMLR, 2019.
  • Zimmert & Seldin (2021) Zimmert, J. and Seldin, Y. Tsallis-inf: An optimal algorithm for stochastic and adversarial bandits. J. Mach. Learn. Res., 22:28–1, 2021.
  • Zimmert et al. (2019) Zimmert, J., Luo, H., and Wei, C.-Y. Beating stochastic and adversarial semi-bandits optimally and simultaneously. In International Conference on Machine Learning, pp. 7683–7692. PMLR, 2019.

Appendix A Auxiliary Lemmas for OP

Proof of Lemma 4.

Consider the minimizer pp^{*} of the following constrained minimization problem for some ξ>0\xi>0:

minpΔ𝒳x𝒳pxΔ^x+2ξ(ln(det(S(p)))),\displaystyle\min_{p\in\Delta_{\mathcal{X}}}\sum_{x\in\mathcal{X}}p_{x}\widehat{\Delta}_{x}+\frac{2}{\xi}\left(-\ln(\det(S(p)))\right), (12)

where S(p)=x𝒳pxxxS(p)=\sum_{x\in\mathcal{X}}p_{x}xx^{\top}. We will show that

x𝒳pxΔ^x\displaystyle\sum_{x\in\mathcal{X}}p^{*}_{x}\widehat{\Delta}_{x} 2dξ,\displaystyle\leq\frac{2d}{\xi}, (13)
xS(p)12\displaystyle\|x\|_{S(p^{*})^{-1}}^{2} ξΔ^x2+d,x𝒳.\displaystyle\leq\frac{\xi\widehat{\Delta}_{x}}{2}+d,\;\forall x\in\mathcal{X}. (14)

To prove this, first note that relaxing the constraints from p𝒫𝒳p\in\mathcal{P}_{\mathcal{X}} to the set of sub-distributions {p:x𝒳px1 and px0,x}\{p:\sum_{x\in\mathcal{X}}p_{x}\leq 1\text{ and }p_{x}\geq 0,\;\forall x\} does not change the solution of this problem. This is because for any sub-distribution, we can always make it a distribution by increasing the weight of some xx with Δ^x=0\widehat{\Delta}_{x}=0 (at least one exists) while not increasing the objective value (since ln(det(S(p)))\ln(\det(S(p))) is non-decreasing in pxp_{x} for each xx). Therefore, applying the KKT conditions, we have

Δ^x2ξxS(p)1xλx+λ=0,\displaystyle\widehat{\Delta}_{x}-\frac{2}{\xi}x^{\top}S(p^{*})^{-1}x-\lambda_{x}+\lambda=0, (15)

where λx,λ0\lambda_{x},\lambda\geq 0 are Lagrange multipliers. Plugging in the optimal solution pp^{*} and taking summation over all x𝒳x\in\mathcal{X}, we have

0\displaystyle 0 =x𝒳pxΔ^x2ξx𝒳pxxS(p)1xx𝒳λxpx+λ\displaystyle=\sum_{x\in\mathcal{X}}p^{*}_{x}\widehat{\Delta}_{x}-\frac{2}{\xi}\sum_{x\in\mathcal{X}}p^{*}_{x}x^{\top}S(p^{*})^{-1}x-\sum_{x\in\mathcal{X}}\lambda_{x}p^{*}_{x}+\lambda
=x𝒳pxΔ^x2ξTr(S(p)1S(p))+λ\displaystyle=\sum_{x\in\mathcal{X}}p^{*}_{x}\widehat{\Delta}_{x}-\frac{2}{\xi}\operatorname{Tr}(S(p^{*})^{-1}S(p^{*}))+\lambda (complementary slackness)
=x𝒳pxΔ^x2dξ+λ\displaystyle=\sum_{x\in\mathcal{X}}p_{x}^{*}\widehat{\Delta}_{x}-\frac{2d}{\xi}+\lambda
x𝒳pxΔ^x2dξ.\displaystyle\geq\sum_{x\in\mathcal{X}}p_{x}^{*}\widehat{\Delta}_{x}-\frac{2d}{\xi}. (λ0\lambda\geq 0)

Therefore, we have x𝒳pxΔx2dξ\sum_{x\in\mathcal{X}}p_{x}^{*}\Delta_{x}\leq\frac{2d}{\xi} and λ2dξ\lambda\leq\frac{2d}{\xi} as x𝒳pxΔ^x0\sum_{x\in\mathcal{X}}p_{x}^{*}\widehat{\Delta}_{x}\geq 0. This proves Eq. (13). For Eq. (14), using Eq. (15), we have

xS(p)12=ξ2(Δ^xλx+λ)ξ2(Δ^x+λ)ξΔ^x2+d,\displaystyle\|x\|_{S(p^{*})^{-1}}^{2}=\frac{\xi}{2}\left(\widehat{\Delta}_{x}-\lambda_{x}+\lambda\right)\leq\frac{\xi}{2}\left(\widehat{\Delta}_{x}+\lambda\right)\leq\frac{\xi\widehat{\Delta}_{x}}{2}+d,

where the first inequality is due to λx0\lambda_{x}\geq 0, and the second inequality is due to λ2dξ\lambda\leq\frac{2d}{\xi}.

Now we show how to transform pp^{*} into a distribution satisfying the constraint of OP. Choose ξ=tβt\xi=\frac{\sqrt{t}}{\beta_{t}}. Let G={x:Δ^x1t}G=\{x:\widehat{\Delta}_{x}\leq\frac{1}{\sqrt{t}}\}. We construct the distribution q=12p+12qG,κq=\frac{1}{2}p^{*}+\frac{1}{2}q^{G,\kappa}, where qG,κq^{G,\kappa} is defined in Lemma 11 with κ=1t\kappa=\frac{1}{\sqrt{t}}, and prove that qq satisfies Eq. (LABEL:eqn:_opt-2-constraint). Indeed, for all xGx\notin G, we have by definition tΔ^x1\sqrt{t}\widehat{\Delta}_{x}\geq 1 and thus

xS(q)12x12S(p)12ξΔ^x+2d=tΔ^xβt+2dtΔ^x2βt+4d;\displaystyle\|x\|_{S(q)^{-1}}^{2}\leq\|x\|_{\frac{1}{2}S(p^{*})^{-1}}^{2}\leq\xi\widehat{\Delta}_{x}+2d=\frac{\sqrt{t}\widehat{\Delta}_{x}}{\beta_{t}}+2d\leq\frac{t\widehat{\Delta}_{x}^{2}}{\beta_{t}}+4d;

for xGx\in G, according to Lemma 11 below, we have xS(q)12x12S(qG,κ)124dtΔ^x2βt+4d\|x\|_{S(q)^{-1}}^{2}\leq\|x\|_{\frac{1}{2}S(q^{G,\kappa})^{-1}}^{2}\leq 4d\leq\frac{t\widehat{\Delta}_{x}^{2}}{\beta_{t}}+4d.

Combining the two cases above, we prove that qq satisfies Eq. (LABEL:eqn:_opt-2-constraint). According to the optimality of pp, we thus have

x𝒳pxΔ^x\displaystyle\sum_{x\in\mathcal{X}}p_{x}\widehat{\Delta}_{x} x𝒳qxΔ^x\displaystyle\leq\sum_{x\in\mathcal{X}}q_{x}\widehat{\Delta}_{x}
=12x𝒳pxΔ^x+12x𝒳qxG,κΔ^x\displaystyle=\frac{1}{2}\sum_{x\in\mathcal{X}}p_{x}^{*}\widehat{\Delta}_{x}+\frac{1}{2}\sum_{x\in\mathcal{X}}q^{G,\kappa}_{x}\widehat{\Delta}_{x} (by the definition of qq)
122dξ+12t+12t\displaystyle\leq\frac{1}{2}\frac{2d}{\xi}+\frac{1}{2\sqrt{t}}+\frac{1}{2\sqrt{t}} (by Eq. (13), Δ^x1\widehat{\Delta}_{x}\leq 1, the definition of GG, and the choice of κ\kappa)
dβt+1t=𝒪(dβtt),\displaystyle\leq\frac{d\beta_{t}+1}{\sqrt{t}}=\mathcal{O}\left(\frac{d\beta_{t}}{\sqrt{t}}\right), (by the definition of ξ\xi)

proving the lemma. ∎

The following lemma shows that for any G𝒳G\subset\mathcal{X}, there always exists a distribution p𝒫𝒳p\in\mathcal{P}_{\mathcal{X}} that puts most weights on actions from GG, such that x(x𝒳pxxx)12𝒪(d)\|x\|_{(\sum_{x\in\mathcal{X}}p_{x}xx^{\top})^{-1}}^{2}\leq\mathcal{O}(d) for all xGx\in G.

Lemma 11.

Suppose that 𝒳d\mathcal{X}\subseteq\mathbb{R}^{d} spans d\mathbb{R}^{d} and let p𝒳p_{\mathcal{X}} be the uniform distribution over 𝒳\mathcal{X}. For any G𝒳G\subseteq\mathcal{X} and κ(0,12]\kappa\in(0,\frac{1}{2}], there exists a distribution q𝒫Gq\in\mathcal{P}_{G} such that xS(qG,κ)122d\|x\|_{S(q^{G,\kappa})^{-1}}^{2}\leq 2d for all xGx\in G, where qG,κκp𝒳+(1κ)qq^{G,\kappa}\triangleq\kappa\cdot p_{\mathcal{X}}+(1-\kappa)\cdot q and S(p)=x𝒳pxxxS(p)=\sum_{x\in\mathcal{X}}p_{x}xx^{\top}.

Proof.

Let 𝒫Gκ={p𝒫𝒳|p=κp𝒳+(1κ)q,q𝒫G}\mathcal{P}_{G}^{\kappa}=\{p\in\mathcal{P}_{\mathcal{X}}\;|\;p=\kappa\cdot p_{\mathcal{X}}+(1-\kappa)\cdot q,q\in\mathcal{P}_{G}\}. As 𝒳\mathcal{X} spans the whole d\mathbb{R}^{d} space, xS(p)12\|x\|_{S(p)^{-1}}^{2} is well-defined for all p𝒫Gκp\in\mathcal{P}_{G}^{\kappa}. Then we have

minp𝒫GκmaxxGxS1(p)2\displaystyle\min_{p\in\mathcal{P}_{G}^{\kappa}}\max_{x\in G}\|x\|_{S^{-1}(p)}^{2}
=minp𝒫Gκmaxq𝒫Gx𝒳qxxx,(x𝒳pxxx)1\displaystyle=\min_{p\in\mathcal{P}_{G}^{\kappa}}\max_{q\in\mathcal{P}_{G}}\left\langle\sum_{x\in\mathcal{X}}q_{x}xx^{\top},\left(\sum_{x\in\mathcal{X}}p_{x}xx^{\top}\right)^{-1}\right\rangle
=maxq𝒫Gminp𝒫Gκx𝒳qxxx,(x𝒳pxxx)1\displaystyle=\max_{q\in\mathcal{P}_{G}}\min_{p\in\mathcal{P}_{G}^{\kappa}}\left\langle\sum_{x\in\mathcal{X}}q_{x}xx^{\top},\left(\sum_{x\in\mathcal{X}}p_{x}xx^{\top}\right)^{-1}\right\rangle (16)
maxq𝒫Gx𝒳qxxx,(x𝒳(κ|𝒳|+(1κ)qx)xx)1\displaystyle\leq\max_{q\in\mathcal{P}_{G}}\left\langle\sum_{x\in\mathcal{X}}q_{x}xx^{\top},\left(\sum_{x\in\mathcal{X}}\left(\frac{\kappa}{|\mathcal{X}|}+(1-\kappa)q_{x}\right)xx^{\top}\right)^{-1}\right\rangle
2maxq𝒫G(1κ)x𝒳qxxx,(x𝒳(κ|𝒳|+(1κ)qx)xx)1\displaystyle\leq 2\max_{q\in\mathcal{P}_{G}}\left\langle(1-\kappa)\sum_{x\in\mathcal{X}}q_{x}xx^{\top},\left(\sum_{x\in\mathcal{X}}\left(\frac{\kappa}{|\mathcal{X}|}+(1-\kappa)q_{x}\right)xx^{\top}\right)^{-1}\right\rangle
2maxq𝒫Gκ|𝒳|x𝒳xx+(1κ)x𝒳qxxx,(x𝒳(κ|𝒳|+(1κ)qx)xx)1\displaystyle\leq 2\max_{q\in\mathcal{P}_{G}}\left\langle\frac{\kappa}{|\mathcal{X}|}\sum_{x\in\mathcal{X}}xx^{\top}+(1-\kappa)\sum_{x\in\mathcal{X}}q_{x}xx^{\top},\left(\sum_{x\in\mathcal{X}}\left(\frac{\kappa}{|\mathcal{X}|}+(1-\kappa)q_{x}\right)xx^{\top}\right)^{-1}\right\rangle
=2d,\displaystyle=2d,

where the second equality is by the Sion’s minimax theorem as Eq. (16) is linear in qq and convex in pp. ∎

Lemma 12.

Given {Δ^}x𝒳\{\widehat{\Delta}\}_{x\in\mathcal{X}}, suppose there exists a unique x^\widehat{x} such that Δ^x^=0\widehat{\Delta}_{\widehat{x}}=0, and Δ^min=minxx^Δ^x>0\widehat{\Delta}_{\min}=\min_{x\neq\widehat{x}}\widehat{\Delta}_{x}>0. Then xpxΔ^x24dβtΔ^mint\sum_{x}p_{x}\widehat{\Delta}_{x}\leq\frac{24d\beta_{t}}{\widehat{\Delta}_{\min}t} when t16dβtΔ^min2t\geq\frac{16d\beta_{t}}{\widehat{\Delta}_{\min}^{2}}, where pp is the solution to OP(t,Δ^)\textbf{OP}(t,\widehat{\Delta}).

Proof.

We divide actions into groups G0,G1,G2,G_{0},G_{1},G_{2},\ldots based on the following rule:

G0={x^},\displaystyle G_{0}=\{\widehat{x}\},
Gi={x:2i1Δ^min2Δ^x2<2iΔ^min2}.\displaystyle G_{i}=\left\{x:2^{i-1}\widehat{\Delta}_{\min}^{2}\leq\widehat{\Delta}_{x}^{2}<2^{i}\widehat{\Delta}_{\min}^{2}\right\}.

Let nn be the largest index such that GnG_{n} is not empty and zi=dβt2i2Δ^min2tz_{i}=\frac{d\beta_{t}}{2^{i-2}\widehat{\Delta}_{\min}^{2}t} for i1i\geq 1. For each group ii, by Lemma 11, we find a distribution qGi,κq^{G_{i},\kappa} with κ=1n2n|𝒳|\kappa=\frac{1}{n\cdot 2^{n}|\mathcal{X}|}, such that x(y𝒳qyGi,κyy)122d\|x\|_{\left(\sum_{y\in\mathcal{X}}q^{G_{i},\kappa}_{y}yy^{\top}\right)^{-1}}^{2}\leq 2d for all xGix\in G_{i}. Then we define a distribution p~\widetilde{p} over actions as the following:

p~x={j1zjqxGj,κif xx^1xx^p~xif x=x^.\displaystyle\widetilde{p}_{x}=\left\{\begin{aligned} &\sum_{j\geq 1}z_{j}q^{G_{j},\kappa}_{x}&&\text{if $x\neq\widehat{x}$}\\ &1-\sum_{x^{\prime}\neq\widehat{x}}\widetilde{p}_{x^{\prime}}&&\text{if $x=\widehat{x}$.}\end{aligned}\right.

p~\widetilde{p} is a valid distribution as

p~x^\displaystyle\widetilde{p}_{\widehat{x}} =1i1xGij1zjqxGj,κ\displaystyle=1-\sum_{i\geq 1}\sum_{x\in G_{i}}\sum_{j\geq 1}z_{j}q^{G_{j},\kappa}_{x}
=1i1xGiziqxGi,κi1xGiji,j1zjqxGj,κ\displaystyle=1-\sum_{i\geq 1}\sum_{x\in G_{i}}z_{i}q^{G_{i},\kappa}_{x}-\sum_{i\geq 1}\sum_{x\in G_{i}}\sum_{j\neq i,j\geq 1}z_{j}q^{G_{j},\kappa}_{x}
1i1zii1xGiji,j1zjn2n|𝒳|\displaystyle\geq 1-\sum_{i\geq 1}z_{i}-\sum_{i\geq 1}\sum_{x\in G_{i}}\sum_{j\neq i,j\geq 1}\frac{z_{j}}{n\cdot 2^{n}\cdot|\mathcal{X}|} (xGiqxGi,κ1\sum_{x\in G_{i}}q^{G_{i},\kappa}_{x}\leq 1 and qxGj,κ=1n2n|𝒳|q^{G_{j},\kappa}_{x}=\frac{1}{n\cdot 2^{n}|\mathcal{X}|} for xGjx\notin G_{j})
1i1zii1zi\displaystyle\geq 1-\sum_{i\geq 1}z_{i}-\sum_{i\geq 1}z_{i} (by definition zj2nzi\frac{z_{j}}{2^{n}}\leq z_{i} for all i,ji,j)
12.\displaystyle\geq\frac{1}{2}. (condition t16dβtΔ^min2t\geq\frac{16d\beta_{t}}{\widehat{\Delta}_{\min}^{2}} and thus i12zii=122i216=12\sum_{i\geq 1}2z_{i}\leq\sum^{\infty}_{i=1}\frac{2}{2^{i-2}\cdot 16}=\frac{1}{2})

Now we show that p~\widetilde{p} also satisfies the constraint of OP(t,Δ^x)\textbf{OP}(t,\widehat{\Delta}_{x}). Indeed, for any xx^x\neq\widehat{x} and ii such that xGix\in G_{i}, we use the facts p~yziqyGi,κ\widetilde{p}_{y}\geq z_{i}q^{G_{i},\kappa}_{y} for yx^y\neq\widehat{x} by definition and p~x^12zin2n|𝒳|=ziqx^Gi,κ\widetilde{p}_{\widehat{x}}\geq\frac{1}{2}\geq\frac{z_{i}}{n\cdot 2^{n}|\mathcal{X}|}=z_{i}q^{G_{i},\kappa}_{\widehat{x}} as well to arrive at:

xS(p~)12x(y𝒳ziqyGi,κyy)12=1zix(y𝒳qyGi,κyy)122dzi=2i1Δ^min2tβttΔ^x2βt;\displaystyle\|x\|_{S(\widetilde{p})^{-1}}^{2}\leq\|x\|_{\left(\sum_{y\in\mathcal{X}}z_{i}q^{G_{i},\kappa}_{y}yy^{\top}\right)^{-1}}^{2}=\frac{1}{z_{i}}\|x\|_{\left(\sum_{y\in\mathcal{X}}q^{G_{i},\kappa}_{y}yy^{\top}\right)^{-1}}^{2}\leq\frac{2d}{z_{i}}=\frac{2^{i-1}\widehat{\Delta}_{\min}^{2}t}{\beta_{t}}\leq\frac{t\widehat{\Delta}_{x}^{2}}{\beta_{t}};

for x=x^x=\widehat{x}, we have p~x^12\widetilde{p}_{\widehat{x}}\geq\frac{1}{2} as shown above and thus,

x^S(p~)12=S(p~)1x^S(p~)2S(p~)1x^12x^x^2=12x^S(p~)14x^S1(p~)22.\displaystyle\|\widehat{x}\|_{S(\widetilde{p})^{-1}}^{2}=\|S(\widetilde{p})^{-1}\widehat{x}\|_{S(\widetilde{p})}^{2}\geq\|S(\widetilde{p})^{-1}\widehat{x}\|_{\tfrac{1}{2}\widehat{x}{\widehat{x}}^{\top}}^{2}=\frac{1}{2}\|\widehat{x}\|_{S(\widetilde{p})^{-1}}^{4}~{}\Longrightarrow~{}\|\widehat{x}\|_{S^{-1}(\widetilde{p})}^{2}\leq 2.

Thus, p~\widetilde{p} satisfies Eq. (LABEL:eqn:_opt-2-constraint). Therefore,

x𝒳pxΔ^x\displaystyle\sum_{x\in\mathcal{X}}p_{x}\widehat{\Delta}_{x} x𝒳p~xΔ^x\displaystyle\leq\sum_{x\in\mathcal{X}}\widetilde{p}_{x}\widehat{\Delta}_{x} (by the feasibility of p~\widetilde{p} and the optimality of pp)
=xx^p~xΔ^x\displaystyle=\sum_{x\neq\widehat{x}}\widetilde{p}_{x}\widehat{\Delta}_{x}
i1xGij1dβt2j2Δ^min2tqxGj,κ2iΔ^min\displaystyle\leq\sum_{i\geq 1}\sum_{x\in G_{i}}\sum_{j\geq 1}\frac{d\beta_{t}}{2^{j-2}\widehat{\Delta}_{\min}^{2}t}q^{G_{j},\kappa}_{x}\sqrt{2^{i}}\widehat{\Delta}_{\min} (by the definition of p~x\widetilde{p}_{x} and GiG_{i})
=i1xGiji,j1dβtqxGj,κ2ji/22Δ^mint+i1xGidβtqxGi,κ2i/22Δ^mint\displaystyle=\sum_{i\geq 1}\sum_{x\in G_{i}}\sum_{j\neq i,j\geq 1}\frac{d\beta_{t}q^{G_{j},\kappa}_{x}}{2^{j-\nicefrac{{i}}{{2}}-2}\widehat{\Delta}_{\min}t}+\sum_{i\geq 1}\sum_{x\in G_{i}}\frac{d\beta_{t}q^{G_{i},\kappa}_{x}}{2^{\nicefrac{{i}}{{2}}-2}\widehat{\Delta}_{\min}t}
i1xGiji,j1dβt|𝒳|n2n+ji/22Δ^mint+i1xGidβtqxGi,κ2i/22Δ^mint\displaystyle\leq\sum_{i\geq 1}\sum_{x\in G_{i}}\sum_{j\neq i,j\geq 1}\frac{d\beta_{t}}{|\mathcal{X}|\cdot n\cdot 2^{n+j-\nicefrac{{i}}{{2}}-2}\widehat{\Delta}_{\min}t}+\sum_{i\geq 1}\sum_{x\in G_{i}}\frac{d\beta_{t}q^{G_{i},\kappa}_{x}}{2^{\nicefrac{{i}}{{2}}-2}\widehat{\Delta}_{\min}t} (qxGj,κ=1n2n|𝒳|q^{G_{j},\kappa}_{x}=\frac{1}{n\cdot 2^{n}\cdot|\mathcal{X}|} for xGjx\notin G_{j})
i12dβt2i/22Δ^mint24dβtΔ^mint,\displaystyle\leq\sum_{i\geq 1}\frac{2d\beta_{t}}{2^{\nicefrac{{i}}{{2}}-2}\widehat{\Delta}_{\min}t}\leq\frac{24d\beta_{t}}{\widehat{\Delta}_{\min}t}, (n+ji2i2n+j-\frac{i}{2}\geq\frac{i}{2})

proving the lemma. ∎

Lemma 13.

Let Δ^x[1rΔx,rΔx]\widehat{\Delta}_{x}\in\left[\frac{1}{\sqrt{r}}\Delta_{x},\sqrt{r}\Delta_{x}\right] for all x𝒳x\in\mathcal{X} for some r>1r>1, and p=OP(t,Δ^)p=\textbf{OP}(t,\widehat{\Delta}) for some t16rdβtΔmin2t\geq\frac{16rd\beta_{t}}{\Delta_{\min}^{2}}. Then x𝒳pxΔx24rdβtΔmint\sum_{x\in\mathcal{X}}p_{x}\Delta_{x}\leq\frac{24rd\beta_{t}}{\Delta_{\min}t}.

Proof.

By the condition on Δ^x\widehat{\Delta}_{x}, we have t16rdβtΔmin216dβtΔ^min2t\geq\frac{16rd\beta_{t}}{\Delta_{\min}^{2}}\geq\frac{16d\beta_{t}}{\widehat{\Delta}_{\min}^{2}}. Also, the condition implies that Δ^x=Δx=0\widehat{\Delta}_{x^{*}}=\Delta_{x^{*}}=0 and Δ^x>0\widehat{\Delta}_{x}>0 for all xxx\neq x^{*}. Therefore,

x𝒳pxΔxrx𝒳pxΔ^xr24dβtΔ^mint24rdβtΔmint,\displaystyle\sum_{x\in\mathcal{X}}p_{x}\Delta_{x}\leq\sqrt{r}\sum_{x\in\mathcal{X}}p_{x}\widehat{\Delta}_{x}\leq\sqrt{r}\frac{24d\beta_{t}}{\widehat{\Delta}_{\min}t}\leq\frac{24rd\beta_{t}}{\Delta_{\min}t},

where the second equality is due to Lemma 12 and the other inequalities follow from Δ^x[1rΔx,rΔx]\widehat{\Delta}_{x}\in\left[\frac{1}{\sqrt{r}}\Delta_{x},\sqrt{r}\Delta_{x}\right] for all xx. ∎

In the following lemma, we define a problem-dependent quantity MM.

Lemma 14.

Consider the optimization problem:

min{Nx}x𝒳,Nx0x𝒳NxΔx\displaystyle\min_{\{N_{x}\}_{x\in\mathcal{X}},N_{x}\geq 0}\sum_{x\in\mathcal{X}}N_{x}\Delta_{x}
s.t.\displaystyle s.t.\qquad xH(N)12Δx22,x𝒳,\displaystyle\|x\|_{H(N)^{-1}}^{2}\leq\frac{\Delta_{x}^{2}}{2},\forall x\in\mathcal{X}^{-},

where H(N)=x𝒳NxxxH(N)=\sum_{x\in\mathcal{X}}N_{x}xx^{\top} and 𝒳=𝒳\{x}\mathcal{X}^{-}=\mathcal{X}\backslash\{x^{*}\}. Define its optimal objective value as c(𝒳,θ)c(\mathcal{X},\theta) (same as in Section 3). Then, there exist {Nx}x𝒳\{N^{*}_{x}\}_{x\in\mathcal{X}} satisfying the constraint of this optimization problem with x𝒳NxΔx2c(𝒳,θ)\sum_{x\in\mathcal{X}}N^{*}_{x}\Delta_{x}\leq 2c(\mathcal{X},\theta) and NxN^{*}_{x^{*}} being finite. (Define M=x𝒳NxM=\sum_{x\in\mathcal{X}}N^{*}_{x}.)

Proof.

If there exists an assignment of {Nx}x𝒳\{N_{x}\}_{x\in\mathcal{X}} for the optimal objective value which has finite NxN_{x^{*}}, then the lemma trivially holds. Otherwise, consider the optimal solution {N~x}x𝒳\{\widetilde{N}_{x}\}_{x\in\mathcal{X}} with N~x=\widetilde{N}_{x^{*}}=\infty. According to the constraints, the following holds for all x𝒳x\in\mathcal{X}^{-}:

limNx(Nxx+y𝒳N~yyy)12Δx22.\displaystyle\lim_{N\rightarrow\infty}\|x\|_{\left(Nx^{*}x^{*^{\top}}+\sum_{y\in\mathcal{X}^{-}}\widetilde{N}_{y}yy^{\top}\right)^{-1}}^{2}\leq\frac{\Delta_{x}^{2}}{2}.

As |𝒳||\mathcal{X}| is finite, by definition, we know for any ϵ\epsilon, there exists a positive value MϵM_{\epsilon} such that for all NMϵN\geq M_{\epsilon}, x(Nxx+y𝒳N~yyy)12Δx22+ϵ\|x\|^{2}_{\left(Nx^{*}x^{*^{\top}}+\sum_{y\in\mathcal{X}^{-}}\widetilde{N}_{y}yy^{\top}\right)^{-1}}\leq\frac{\Delta_{x}^{2}}{2}+\epsilon. Choosing ϵ=Δmin22\epsilon=\frac{\Delta_{\min}^{2}}{2}, we have when NMϵN\geq M_{\epsilon}, for all x𝒳x\in\mathcal{X}^{-},

x(Nxx+x𝒳N~xxx)12<Δx22+Δmin22Δx2.\displaystyle\|x\|_{\left(Nx^{*}x^{*^{\top}}+\sum_{x\in\mathcal{X}^{-}}\widetilde{N}_{x}xx^{\top}\right)^{-1}}^{2}<\frac{\Delta_{x}^{2}}{2}+\frac{\Delta_{\min}^{2}}{2}\leq\Delta_{x}^{2}.

Therefore, consider the solution {Nx}x𝒳\{N^{*}_{x}\}_{x\in\mathcal{X}} where Nx=2N~xN^{*}_{x}=2\widetilde{N}_{x} if x𝒳x\in\mathcal{X}^{-} and Nx=2MϵN^{*}_{x^{*}}=2M_{\epsilon}. We have xH(N)12Δx22\|x\|_{H(N)^{-1}}^{2}\leq\frac{\Delta_{x}^{2}}{2}. Moreover, the objective value is bounded by x𝒳NxΔx=2x𝒳1N~xΔx=2c(𝒳,θ)\sum_{x\in\mathcal{X}}N^{*}_{x}\Delta_{x}=2\sum_{x\in\mathcal{X}^{-1}}\widetilde{N}_{x}\Delta_{x}=2c(\mathcal{X},\theta). ∎

Lemma 15.

Suppose Δ^x[1rΔx,rΔx]\widehat{\Delta}_{x}\in\left[\frac{1}{\sqrt{r}}\Delta_{x},\sqrt{r}\Delta_{x}\right] for all x𝒳x\in\mathcal{X} for some r>1r>1, and p=OP(t,Δ^)p=\textbf{OP}(t,\widehat{\Delta}) for some trβtMt\geq r\beta_{t}M, where MM is defined in Lemma 14. Then x𝒳pxΔxr2βttc(𝒳,θ)\sum_{x\in\mathcal{X}}p_{x}\Delta_{x}\leq\frac{r^{2}\beta_{t}}{t}c(\mathcal{X},\theta).

Proof.

Recall NN^{*} defined in Lemma 14. Define p~\widetilde{p}, a distribution over 𝒳\mathcal{X}, as the following:

p~x={rβtNx2t,xx1xxp~xx=x.\displaystyle\widetilde{p}_{x}=\begin{cases}\frac{r\beta_{t}N_{x}^{*}}{2t},&x\neq x^{*}\\ 1-\sum_{x^{\prime}\neq x^{*}}\widetilde{p}_{x^{\prime}}&x=x^{*}.\end{cases}

It is clear that p~\widetilde{p} is a valid distribution since trβtMt\geq r\beta_{t}M. Also, note that by the definition of MM and the condition of tt,

p~x=1xxrβtNx2t1rβtM2trβtM2trβtNx2t.\displaystyle\widetilde{p}_{x^{*}}=1-\sum_{x^{\prime}\neq x^{*}}\frac{r\beta_{t}N_{x}^{*}}{2t}\geq 1-\frac{r\beta_{t}M}{2t}\geq\frac{r\beta_{t}M}{2t}\geq\frac{r\beta_{t}N_{x^{*}}^{*}}{2t}.

Below we show that p~\widetilde{p} satisfies the constraint of OP(t,Δ^x)\textbf{OP}(t,\widehat{\Delta}_{x}). Indeed, for any xxx\neq x^{*},

xS(p~)12\displaystyle\|x\|_{S(\widetilde{p})^{-1}}^{2} x(y𝒳rβtNy2tyy)12\displaystyle\leq\|x\|_{\left(\sum_{y\in\mathcal{X}}\frac{r\beta_{t}N^{*}_{y}}{2t}yy^{\top}\right)^{-1}}^{2} (p~xrβtNx2t\widetilde{p}_{x^{*}}\geq\frac{r\beta_{t}N_{x^{*}}^{*}}{2t})
=2trβtx(y𝒳Nyyy)12\displaystyle=\frac{2t}{r\beta_{t}}\|x\|_{\left(\sum_{y\in\mathcal{X}}N^{*}_{y}yy^{\top}\right)^{-1}}^{2}
tΔx2rβt\displaystyle\leq\frac{t\Delta_{x}^{2}}{r\beta_{t}} (by the constraint in the definition of {Nx}x𝒳\{N_{x}^{*}\}_{x\in\mathcal{X}})
tΔ^x2βt;\displaystyle\leq\frac{t\widehat{\Delta}_{x}^{2}}{\beta_{t}}; (Δ^x[1rΔx,rΔx]\widehat{\Delta}_{x}\in\left[\frac{1}{\sqrt{r}}\Delta_{x},\sqrt{r}\Delta_{x}\right])

for x=xx=x^{*}, we have p~x1rβtM2t12\widetilde{p}_{x^{*}}\geq 1-\frac{r\beta_{t}M}{2t}\geq\frac{1}{2} by the condition of tt:

xS(p~)12=S(p~)1xS(p~)2S(p~)1x12xx2=12xS(p~)14xS1(p~)22.\displaystyle\|x^{*}\|_{S(\widetilde{p})^{-1}}^{2}=\|S(\widetilde{p})^{-1}x^{*}\|_{S(\widetilde{p})}^{2}\geq\|S(\widetilde{p})^{-1}x^{*}\|_{\tfrac{1}{2}x^{*}{x^{*}}^{\top}}^{2}=\frac{1}{2}\|x^{*}\|_{S(\widetilde{p})^{-1}}^{4}~{}\Longrightarrow~{}\|x^{*}\|_{S^{-1}(\widetilde{p})}^{2}\leq 2.

Notice that Δ^x[1rΔx,rΔx]\widehat{\Delta}_{x^{*}}\in\left[\frac{1}{\sqrt{r}}\Delta_{x^{*}},\sqrt{r}\Delta_{x^{*}}\right] implies Δx=Δ^x=0\Delta_{x^{*}}=\widehat{\Delta}_{x^{*}}=0. Thus,

x𝒳pxΔx\displaystyle\sum_{x\in\mathcal{X}}p_{x}\Delta_{x} rx𝒳pxΔ^x\displaystyle\leq\sqrt{r}\sum_{x\in\mathcal{X}}p_{x}\widehat{\Delta}_{x} (Δ^x[1rΔx,rΔx]\widehat{\Delta}_{x}\in\left[\frac{1}{\sqrt{r}}\Delta_{x},\sqrt{r}\Delta_{x}\right])
rx𝒳p~xΔ^x\displaystyle\leq\sqrt{r}\sum_{x\in\mathcal{X}}\widetilde{p}_{x}\widehat{\Delta}_{x} (by the feasibility of p~\widetilde{p} and the optimality of pp)
rrβtx𝒳Nx2tΔ^x\displaystyle\leq r\sqrt{r}\beta_{t}\sum_{x\in\mathcal{X}}\frac{N_{x}^{*}}{2t}\widehat{\Delta}_{x} (by the definition of p~\widetilde{p} and Δ^x=0\widehat{\Delta}_{x^{*}}=0)
r2βtx𝒳Nx2tΔx\displaystyle\leq r^{2}\beta_{t}\sum_{x\in\mathcal{X}}\frac{N_{x}^{*}}{2t}\Delta_{x} (Δ^x[1rΔx,rΔx]\widehat{\Delta}_{x}\in\left[\frac{1}{\sqrt{r}}\Delta_{x},\sqrt{r}\Delta_{x}\right])
r2βttc(𝒳,θ),\displaystyle\leq\frac{r^{2}\beta_{t}}{t}c(\mathcal{X},\theta), (xNxΔx2c(𝒳,θ)\sum_{x}N_{x}^{*}\Delta_{x}\leq 2c(\mathcal{X},\theta) proven in Lemma 14)

finishing the proof. ∎

Lemma 16.

We have c(𝒳,θ)48dΔminc(\mathcal{X},\theta)\leq\frac{48d}{\Delta_{\min}}.

Proof.

The idea is similar to that of Lemma 12. Define G0={x}G_{0}=\{x^{*}\}, Gi={x:Δx2[2i1,2i)Δmin2}G_{i}=\left\{x:\Delta_{x}^{2}\in[2^{i-1},2^{i})\Delta_{\min}^{2}\right\} and nn be the largest index such that GnG_{n} is not empty. For each i1i\geq 1, let qGi,κ𝒫𝒳q^{G_{i},\kappa}\in\mathcal{P}_{\mathcal{X}} with κ=1|𝒳|n2n\kappa=\frac{1}{|\mathcal{X}|\cdot n\cdot 2^{n}} be the distribution such that xS(qGi,κ)122d\|x\|_{S(q^{G_{i},\kappa})^{-1}}^{2}\leq 2d for all xGix\in G_{i} (see Lemma 11). Let Nx=N_{x^{*}}=\infty and for xGix\in G_{i}, we let Nx=j14dqxGj,κ2j1Δmin2N_{x}=\sum_{j\geq 1}\frac{4dq^{G_{j},\kappa}_{x}}{2^{j-1}\Delta_{\min}^{2}}. Next we show that {Nx}x𝒳\{N_{x}\}_{x\in\mathcal{X}} satisfies the constraint Eq. (2).

In fact, fix xGi𝒳x\in G_{i}\subseteq\mathcal{X}^{-}, by definition of {Nx}x𝒳\{N_{x}\}_{x\in\mathcal{X}}, we have

x(x𝒳Nxxx)12x(x𝒳4dqxGi,κxx2i1Δmin2)12=2i1Δmin24dxS(qGi,κ)122i2Δmin2Δx22,\displaystyle\|x\|_{\left(\sum_{x\in\mathcal{X}}N_{x}xx^{\top}\right)^{-1}}^{2}\leq\|x\|_{\left(\sum_{x\in\mathcal{X}}\frac{4dq^{G_{i},\kappa}_{x}xx^{\top}}{2^{i-1}\Delta_{\min}^{2}}\right)^{-1}}^{2}=\frac{2^{i-1}\Delta_{\min}^{2}}{4d}\|x\|_{S(q^{G_{i},\kappa})^{-1}}^{2}\leq 2^{i-2}\Delta_{\min}^{2}\leq\frac{\Delta_{x}^{2}}{2},

where the first inequality is because S(qGi,κ)S(q^{G_{i},\kappa}) is invertible. Therefore, the objective value of Eq. (1) is bounded as follows:

x𝒳NxΔx\displaystyle\sum_{x\in\mathcal{X}}N_{x}\Delta_{x} =i1xGij14dqGj,x2j1Δmin2Δx\displaystyle=\sum_{i\geq 1}\sum_{x\in G_{i}}\sum_{j\geq 1}\frac{4dq^{*}_{G_{j},x}}{2^{j-1}\Delta_{\min}^{2}}\Delta_{x} (by the definition of NxN_{x})
i1xGij14dqGj,x2ji/21Δmin\displaystyle\leq\sum_{i\geq 1}\sum_{x\in G_{i}}\sum_{j\geq 1}\frac{4dq^{*}_{G_{j},x}}{2^{j-\nicefrac{{i}}{{2}}-1}\Delta_{\min}} (by the definition of GiG_{i})
=i1xGiji,j14dqGj,x2ji/21Δmin+i1xGi4dqxGi,κ2i/21Δmin\displaystyle=\sum_{i\geq 1}\sum_{x\in G_{i}}\sum_{j\neq i,j\geq 1}\frac{4dq^{*}_{G_{j},x}}{2^{j-\nicefrac{{i}}{{2}}-1}\Delta_{\min}}+\sum_{i\geq 1}\sum_{x\in G_{i}}\frac{4dq^{G_{i},\kappa}_{x}}{2^{\nicefrac{{i}}{{2}}-1}\Delta_{\min}}
i1xGiji,j14d|𝒳|n2n+ji/21Δmin+i14d2i/21Δmin\displaystyle\leq\sum_{i\geq 1}\sum_{x\in G_{i}}\sum_{j\neq i,j\geq 1}\frac{4d}{|\mathcal{X}|\cdot n\cdot 2^{n+j-\nicefrac{{i}}{{2}}-1}\Delta_{\min}}+\sum_{i\geq 1}\frac{4d}{2^{\nicefrac{{i}}{{2}}-1}\Delta_{\min}} (qxGj,κ=1|𝒳|n2nq^{G_{j},\kappa}_{x}=\frac{1}{|\mathcal{X}|\cdot n\cdot 2^{n}} for jij\neq i, xGix\in G_{i})
i14d2ni/21Δmin+i14d2i/21Δmin\displaystyle\leq\sum_{i\geq 1}\frac{4d}{2^{n-\nicefrac{{i}}{{2}}-1}\Delta_{\min}}+\sum_{i\geq 1}\frac{4d}{2^{\nicefrac{{i}}{{2}}-1}\Delta_{\min}} (j1j\geq 1)
i8d2i/21Δmin48dΔmin.\displaystyle\leq\sum_{i}\frac{8d}{2^{\nicefrac{{i}}{{2}}-1}\Delta_{\min}}\leq\frac{48d}{\Delta_{\min}}. (ini\leq n)

Therefore, we have c(𝒳,θ)48dΔminc(\mathcal{X},\theta)\leq\frac{48d}{\Delta_{\min}}. ∎

Appendix B Analysis for Algorithm LABEL:alg:REOLB

In this section, we analyze the performance of Algorithm LABEL:alg:REOLB in the stochastic or corrupted setting. For Theorem 1, we decompose the proof into two parts. First, we show in Lemma 17 (a more concrete version of Lemma 5) that for some constant TT^{*} specified later, we have t=T+1Tpt,xΔx=𝒪(c(𝒳,θ)logTlog(T|𝒳|/δ))\sum_{t=T^{*}+1}^{T}p_{t,x}\Delta_{x}=\mathcal{O}(c(\mathcal{X},\theta)\log T\log(T|\mathcal{X}|/\delta)). Second, using Lemma 4, we know that Algorithm LABEL:alg:REOLB also enjoys a regret bound of 𝒪(dTlog(T|𝒳|/δ)+C)\mathcal{O}(d\sqrt{T}\log(T|\mathcal{X}|/\delta)+C), which gives 𝒪(dTlog(T|𝒳|/δ)+C)\mathcal{O}(d\sqrt{T^{*}}\log(T^{*}|\mathcal{X}|/\delta)+C) for the first TT^{*} rounds and proves Theorem 2.

To prove Lemma 17, we first show in Lemma 3 that Δx\Delta_{x} and Δ^m,x\widehat{\Delta}_{m,x} are close within some multiplicative factor with some additional terms related to the corruption. This holds with the help of Eq. (LABEL:eqn:_opt-2-constraint) and the use of robust estimators. For notational convenience, first recall the following definitions from Lemma 3.

Definition 1.
γm\displaystyle\gamma_{m} 215log(2m|𝒳|/δ)=β2m,\displaystyle\triangleq 2^{15}\log(2^{m}|\mathcal{X}|/\delta)=\beta_{2^{m}},
Cm\displaystyle C_{m} τmmaxx|cτ,x|,\displaystyle\triangleq\sum_{\tau\in\mathcal{B}_{m}}\max_{x}|c_{\tau,x}|,
ρm\displaystyle\rho_{m} {k=0m2kCk4m1,if m0,0,m=1.\displaystyle\triangleq\begin{cases}\sum_{k=0}^{m}\frac{2^{k}C_{k}}{4^{m-1}},&\mbox{if $m\geq 0$},\\ 0,&\mbox{$m=-1$}.\end{cases}
Proof of Lemma 3.

We prove this by induction. For base case m=0m=0, Δ^m,x=0\widehat{\Delta}_{m,x}=0 by definition, and Also, Δx2dγm4\Delta_{x}\leq 2\leq\sqrt{\frac{d\gamma_{m}}{4}} and Eq. (3) holds.

If both inequalities hold for mm, then

xSm12\displaystyle\|x\|_{S_{m}^{-1}}^{2} 2mΔ^m,x2γm+4d2m(2Δx+dγm42m+2ρm1)2γm+4d162mΔx2γm+8d+162mρm12γm,\displaystyle\leq\frac{2^{m}\widehat{\Delta}_{m,x}^{2}}{\gamma_{m}}+4d\leq\frac{2^{m}\left(2\Delta_{x}+\sqrt{\frac{d\gamma_{m}}{4\cdot 2^{m}}}+2\rho_{m-1}\right)^{2}}{\gamma_{m}}+4d\leq\frac{16\cdot 2^{m}\Delta_{x}^{2}}{\gamma_{m}}+8d+\frac{16\cdot 2^{m}\rho_{m-1}^{2}}{\gamma_{m}}, (17)

where the first inequality is because of Eq. (LABEL:eqn:_opt-2-constraint) of OP(2m,Δ^m,x)\textbf{OP}(2^{m},\widehat{\Delta}_{m,x}). Next, we show that the expectation of ^τ,x\widehat{\ell}_{\tau,x} is x,θ+cτ\langle x,\theta+c_{\tau}\rangle and the variance of ^τ,x\widehat{\ell}_{\tau,x} is upper bounded by xSm12\|x\|_{S_{m}^{-1}}^{2} for τm\tau\in\mathcal{B}_{m}. In fact, we have

𝔼[^τ,x]=𝔼[xSm1xτ(xτ,θ+cτ+ϵτ)]=x,θ+cτ,\displaystyle\mathbb{E}\left[\widehat{\ell}_{\tau,x}\right]=\mathbb{E}\left[x^{\top}S_{m}^{-1}x_{\tau}\left(\langle x_{\tau},\theta+c_{\tau}\rangle+\epsilon_{\tau}\right)\right]=\langle x,\theta+c_{\tau}\rangle, (18)
𝔼[^τ,x2]𝔼[xSm1xτxτSm1xyτ2]xSm1x=xSm12.\displaystyle\mathbb{E}\left[\widehat{\ell}_{\tau,x}^{2}\right]\leq\mathbb{E}[x^{\top}S_{m}^{-1}x_{\tau}x_{\tau}^{\top}S_{m}^{-1}x\cdot y_{\tau}^{2}]\leq x^{\top}S_{m}^{-1}x=\|x\|_{S_{m}^{-1}}^{2}. (19)

Now we are ready to prove the relation. Let c¯m,x=12mτmcτ,x\bar{c}_{m,x}=\frac{1}{2^{m}}\sum_{\tau\in\mathcal{B}_{m}}c_{\tau,x}. Under the induction hypothesis for the case of mm, using Lemma 29 with μi=x,θ+ci\mu_{i}=\langle x,\theta+c_{i}\rangle, with probability at least 1δ4m1-\frac{\delta}{4^{m}}, we have for all x𝒳x\in\mathcal{X}:

ΔxΔ^m+1,x\displaystyle\Delta_{x}-\widehat{\Delta}_{m+1,x}
=x,θx,θRobm,x+minxRobm,x\displaystyle=\langle x,\theta\rangle-\langle x^{*},\theta\rangle-\text{Rob}_{m,x}+\min_{x}\text{Rob}_{m,x}
x,θx,θRobm,x+Robm,x\displaystyle\leq\langle x,\theta\rangle-\langle x^{*},\theta\rangle-\text{Rob}_{m,x}+\text{Rob}_{m,x^{*}}
|Robm,xx,θτmx,cτ2m|+|Robm,xx,θτmx,cτ2m|+2Cm2m\displaystyle\leq\left|\text{Rob}_{m,x^{*}}-\langle x^{*},\theta\rangle-\frac{\sum_{\tau\in\mathcal{B}_{m}}\langle x^{*},c_{\tau}\rangle}{2^{m}}\right|+\left|\text{Rob}_{m,x}-\langle x,\theta\rangle-\frac{\sum_{\tau\in\mathcal{B}_{m}}\langle x,c_{\tau}\rangle}{2^{m}}\right|+\frac{2C_{m}}{2^{m}}
12m(αx(2mxSm12+τm(cτ,xc¯m,x)2)+4log(2m|𝒳|/δ)αx)\displaystyle\leq\frac{1}{2^{m}}\left(\alpha_{x}\left(2^{m}\|x\|^{2}_{S_{m}^{-1}}+\sum_{\tau\in\mathcal{B}_{m}}(c_{\tau,x}-\bar{c}_{m,x})^{2}\right)+\frac{4\log(2^{m}|\mathcal{X}|/\delta)}{\alpha_{x}}\right)
+12m(αx(2mxSm12+τm(cτ,xc¯m,x)2)+4log(2m|𝒳|/δ)αx)+2Cm2m\displaystyle\quad+\frac{1}{2^{m}}\left(\alpha_{x^{*}}\left(2^{m}\|x^{*}\|^{2}_{S_{m}^{-1}}+\sum_{\tau\in\mathcal{B}_{m}}(c_{\tau,x^{*}}-\bar{c}_{m,x^{*}})^{2}\right)+\frac{4\log(2^{m}|\mathcal{X}|/\delta)}{\alpha_{x^{*}}}\right)+\frac{2C_{m}}{2^{m}} (by Eq. (19) and Lemma 29)
12m(αx(2mxSm12+2m)+γm212αx)+12m(αx(2mxSm12+2m)+γm212αx)+2Cm2m\displaystyle\leq\frac{1}{2^{m}}\left(\alpha_{x}\left(2^{m}\|x\|_{S_{m}^{-1}}^{2}+2^{m}\right)+\frac{\gamma_{m}}{2^{12}\alpha_{x}}\right)+\frac{1}{2^{m}}\left(\alpha_{x^{*}}\left(2^{m}\|x^{*}\|_{S_{m}^{-1}}^{2}+2^{m}\right)+\frac{\gamma_{m}}{2^{12}\alpha_{x^{*}}}\right)+\frac{2C_{m}}{2^{m}} (using the definition of γm\gamma_{m} and τm(cτ,xc¯m,x)2τmcτ,x22m\sum_{\tau\in\mathcal{B}_{m}}(c_{\tau,x}-\bar{c}_{m,x})^{2}\leq\sum_{\tau\in\mathcal{B}_{m}}c_{\tau,x}^{2}\leq 2^{m})
=2642m(2mxSm12+2m)γm+2642m(2mxSm12+2m)γm+Cm2m1\displaystyle=\frac{2}{64\cdot 2^{m}}\sqrt{\left(2^{m}\|x\|_{S_{m}^{-1}}^{2}+2^{m}\right)\gamma_{m}}+\frac{2}{64\cdot 2^{m}}\sqrt{\left(2^{m}\|x^{*}\|_{S_{m}^{-1}}^{2}+2^{m}\right)\gamma_{m}}+\frac{C_{m}}{2^{m-1}} (by the choice of αx\alpha_{x})
4642m(1622mΔx2γm+162md+1622mρm12γm)γm+Cm2m1\displaystyle\leq\frac{4}{64\cdot 2^{m}}\sqrt{\left(\frac{16\cdot 2^{2m}\Delta_{x}^{2}}{\gamma_{m}}+16\cdot 2^{m}d+\frac{16\cdot 2^{2m}\rho_{m-1}^{2}}{\gamma_{m}}\right)\gamma_{m}}+\frac{C_{m}}{2^{m-1}} (by Eq. (17))
Δx2+dγm162m+ρm14+Cm2m1\displaystyle\leq\frac{\Delta_{x}}{2}+\sqrt{\frac{d\gamma_{m}}{16\cdot 2^{m}}}+\frac{\rho_{m-1}}{4}+\frac{C_{m}}{2^{m-1}} (a+ba+b\sqrt{a+b}\leq\sqrt{a}+\sqrt{b})
Δx2+dγm162m+ρm.\displaystyle\leq\frac{\Delta_{x}}{2}+\sqrt{\frac{d\gamma_{m}}{16\cdot 2^{m}}}+\rho_{m}. (by definition of ρm\rho_{m})

Therefore, Δx2Δ^m+1,x+dγm42m+2ρm\Delta_{x}\leq 2\widehat{\Delta}_{m+1,x}+\sqrt{\frac{d\gamma_{m}}{4\cdot 2^{m}}}+2\rho_{m}, which proves Eq. (3). The other claim Eq. (4) can be proven using similar analysis. Taking a union bound over all mm finishes the proof. ∎

Lemma 17 (A detailed version of Lemma 5).

Let T32CΔmin+4Mlog(2M|𝒳|δ)T^{*}\triangleq\frac{32C}{\Delta_{\min}}+4M^{\prime}\log\left(\frac{2M^{\prime}|\mathcal{X}|}{\delta}\right), where M=220(M+dΔmin2)M^{\prime}=2^{20}\left(M+\frac{d}{\Delta_{\min}^{2}}\right) and MM is defined in Lemma 14. Then Algorithm LABEL:alg:REOLB guarantees with probability at least 1δ1-\delta:

t=T+1Txpmt,xΔx𝒪(c(𝒳,θ)log(T)log(T/δ)).\displaystyle\sum^{T}_{t=T^{*}+1}\sum_{x}p_{m_{t},x}\Delta_{x}\leq\mathcal{O}\left(c(\mathcal{X},\theta)\log(T)\log(T/\delta)\right).
Proof.

First, note that according to Lemma 28, t4Mlog(2M|𝒳|δ)t\geq 4M^{\prime}\log\left(\frac{2M^{\prime}|\mathcal{X}|}{\delta}\right) implies tMlog(t|𝒳|δ)32dβtΔmin2t\geq M^{\prime}\log\left(\frac{t|\mathcal{X}|}{\delta}\right)\geq\frac{32d\beta_{t}}{\Delta_{\min}^{2}}. Let mtm_{t} be the epoch such that tmtt\in\mathcal{B}_{m_{t}}, which means t[2mt,2mt+1)t\in[2^{m_{t}},2^{m_{t}+1}) and thus βtγmt\beta_{t}\geq\gamma_{m_{t}}. Then we have,

dγmt42mt+2ρmt\displaystyle\sqrt{\frac{d\gamma_{m_{t}}}{4\cdot 2^{m_{t}}}}+2\rho_{m_{t}} dγmt42mt+2C2mt114Δmin+14Δmin12Δx,\displaystyle\leq\sqrt{\frac{d\gamma_{m_{t}}}{4\cdot 2^{m_{t}}}}+\frac{2C}{2^{m_{t}-1}}\leq\frac{1}{4}\Delta_{\min}+\frac{1}{4}\Delta_{\min}\leq\frac{1}{2}\Delta_{x}, (20)

where the first inequality is by definition of ρm\rho_{m} and the second inequality is because 2mt+1tT32CΔmin2^{m_{t}+1}\geq t\geq T^{*}\geq\frac{32C}{\Delta_{\min}} and t32dγmtΔmin2t\geq\frac{32d\gamma_{m_{t}}}{\Delta_{\min}^{2}}. Then by Lemma 3, we have for all xxx\neq x^{*}, Δx2Δ^mt,x+Δx2\Delta_{x}\leq 2\widehat{\Delta}_{m_{t},x}+\frac{\Delta_{x}}{2}, Δ^mt,x2Δx+Δx2\widehat{\Delta}_{m_{t},x}\leq 2\Delta_{x}+\frac{\Delta_{x}}{2}, which gives

Δx4Δ^mt,x,Δ^mt,x4Δx.\displaystyle\Delta_{x}\leq 4\widehat{\Delta}_{m_{t},x},\qquad\widehat{\Delta}_{m_{t},x}\leq 4\Delta_{x}. (21)

Since Δ^mt,x14Δx\widehat{\Delta}_{m_{t},x}\geq\frac{1}{4}\Delta_{x} for all xxx\neq x^{*}, we must have Δ^mt,x=Δx=0\widehat{\Delta}_{m_{t},x^{*}}=\Delta_{x^{*}}=0. Moreover, according to the definition of TT^{*}, we have t220Mlog(t|𝒳|δ)16βtMt\geq 2^{20}M\log(\frac{t|\mathcal{X}|}{\delta})\geq 16\beta_{t}M. Therefore, the conditions of Lemma 15 hold. Applying Lemma 15 with r=16r=16, we get

xpmt,xΔx𝒪(βt2mtc(𝒳,θ)).\displaystyle\sum_{x}p_{m_{t},x}\Delta_{x}\leq\mathcal{O}\left(\frac{\beta_{t}}{2^{m_{t}}}c(\mathcal{X},\theta)\right).

Summing over tT+1t\geq T^{*}+1, we get with probability at least 1δ1-\delta,

t=T+1Txpmt,xΔx=𝒪(βTc(𝒳,θ)log(T))=𝒪(c(𝒳,θ)log(T)log(T|𝒳|/δ)).\displaystyle\sum_{t=T^{*}+1}^{T}\sum_{x}p_{m_{t},x}\Delta_{x}=\mathcal{O}\left(\beta_{T}c(\mathcal{X},\theta)\log(T)\right)=\mathcal{O}\left(c(\mathcal{X},\theta)\log(T)\log(T|\mathcal{X}|/\delta)\right).

Now we are ready to prove the main result Theorem 1.

Proof of Theorem 1.

Let 𝔼t[]\mathbb{E}_{t}[\cdot] denote the conditional expectation given the history up to time tt. By Freedman’s inequality, we have

Reg(T)\displaystyle\text{\rm Reg}(T) =t=1Tx𝒳𝟙{xt=x}Δx\displaystyle=\sum_{t=1}^{T}\sum_{x\in\mathcal{X}}\mathbbm{1}\{x_{t}=x\}\Delta_{x}
t=1Tx𝒳pmt,xΔx+2log(1/δ)t=1T𝔼t[(xx𝟙{xt=x}Δx)2]+log(1/δ)\displaystyle\leq\sum_{t=1}^{T}\sum_{x\in\mathcal{X}}p_{m_{t},x}\Delta_{x}+2\sqrt{\log(1/\delta)\sum_{t=1}^{T}\mathbb{E}_{t}\left[\left(\sum_{x\neq x^{*}}\mathbbm{1}\{x_{t}=x\}\Delta_{x}\right)^{2}\right]}+\log(1/\delta)
t=1Tx𝒳pmt,xΔx+2log(1/δ)t=1T𝔼t[xx𝟙{xt=x}Δx]+log(1/δ)\displaystyle\leq\sum_{t=1}^{T}\sum_{x\in\mathcal{X}}p_{m_{t},x}\Delta_{x}+2\sqrt{\log(1/\delta)\sum_{t=1}^{T}\mathbb{E}_{t}\left[\sum_{x\neq x^{*}}\mathbbm{1}\{x_{t}=x\}\Delta_{x}\right]}+\log(1/\delta) (Δx1\Delta_{x}\leq 1)
t=1Tx𝒳pmt,xΔx+2log(1/δ)t=1Tx𝒳pmt,xΔx+log(1/δ)\displaystyle\leq\sum_{t=1}^{T}\sum_{x\in\mathcal{X}}p_{m_{t},x}\Delta_{x}+2\sqrt{\log(1/\delta)\sum_{t=1}^{T}\sum_{x\in\mathcal{X}}p_{m_{t},x}\Delta_{x}}+\log(1/\delta)
2t=1Tx𝒳pmt,xΔx+2log(1/δ).\displaystyle\leq 2\sum_{t=1}^{T}\sum_{x\in\mathcal{X}}p_{m_{t},x}\Delta_{x}+2\log(1/\delta). (22)

Fix a round tt and the corresponding epoch mtm_{t}. According to Lemma 4, we know that

x𝒳pmt,xΔ^mt,x𝒪(dlog(2mt|𝒳|/δ)2mt/2)\displaystyle\sum_{x\in\mathcal{X}}p_{m_{t},x}\widehat{\Delta}_{m_{t},x}\leq\mathcal{O}\left(\frac{d\log(2^{m_{t}}|\mathcal{X}|/\delta)}{2^{\nicefrac{{m_{t}}}{{2}}}}\right)

Therefore, combining Lemma 3, the regret in epoch mm is bounded by

2mx𝒳pm,xΔx\displaystyle 2^{m}\sum_{x\in\mathcal{X}}p_{m,x}\Delta_{x} 2m(2xpm,xΔ^m,x+𝒪(dγm2m+ρm1))\displaystyle\leq 2^{m}\left(2\sum_{x}p_{m,x}\widehat{\Delta}_{m,x}+\mathcal{O}\left(\sqrt{\frac{d\gamma_{m}}{2^{m}}}+\rho_{m-1}\right)\right)
𝒪(d2m/2log(2m/δ))+𝒪(2mdγm+k=0m12kCk4m2)\displaystyle\leq\mathcal{O}\left(d\cdot 2^{\nicefrac{{m}}{{2}}}\log(2^{m}/\delta)\right)+\mathcal{O}\left(\sqrt{2^{m}d\gamma_{m}}+\sum_{k=0}^{m-1}\frac{2^{k}C_{k}}{4^{m-2}}\right)
𝒪(d2m/2log(2m/δ)+2mdγm+k=0mCk2mk).\displaystyle\leq\mathcal{O}\left(d\cdot 2^{\nicefrac{{m}}{{2}}}\log(2^{m}/\delta)+\sqrt{2^{m}d\gamma_{m}}+\sum_{k=0}^{m}\frac{C_{k}}{2^{m-k}}\right).

Summing up the regret till round TT, we have

t=1Tx𝒳pmt,xΔx\displaystyle\sum_{t=1}^{T}\sum_{x\in\mathcal{X}}p_{m_{t},x}\Delta_{x} m=0log2T𝒪(d2m/2log(2m|𝒳|/δ)+2mdγm)+𝒪(k=0log2TCkm=klog2T12mk)\displaystyle\leq\sum_{m=0}^{\log_{2}T}\mathcal{O}\left(d\cdot 2^{\nicefrac{{m}}{{2}}}\log(2^{m}|\mathcal{X}|/\delta)+\sqrt{2^{m}d\gamma_{m}}\right)+\mathcal{O}\left(\sum_{k=0}^{\log_{2}T}C_{k}\sum_{m=k}^{\log_{2}T}\frac{1}{2^{m-k}}\right)
=𝒪(dTlog(T|𝒳|/δ)+k=0log2TCk)\displaystyle=\mathcal{O}\left(d\sqrt{T}\log(T|\mathcal{X}|/\delta)+\sum_{k=0}^{\log_{2}T}C_{k}\right)
=𝒪(dTlog(T|𝒳|/δ)+C).\displaystyle=\mathcal{O}\left(d\sqrt{T}\log(T|\mathcal{X}|/\delta)+C\right). (23)

Using Eq. (23) for tTt\leq T^{*} and using Lemma 17 for the rest, we know that

t=1Tx𝒳pmt,xΔx𝒪(c(𝒳,θ)log(T)log(T|𝒳|/δ)+dTlog(T|𝒳|/δ)+C).\displaystyle\sum_{t=1}^{T}\sum_{x\in\mathcal{X}}p_{m_{t},x}\Delta_{x}\leq\mathcal{O}\left(c(\mathcal{X},\theta)\log(T)\log(T|\mathcal{X}|/\delta)+d\sqrt{T^{*}}\log(T^{*}|\mathcal{X}|/\delta)+C\right).

Combining this with Eq. (22), we get

Reg(T)\displaystyle\text{\rm Reg}(T) =𝒪(t=1Tx𝒳pmt,xΔx+log(1/δ))\displaystyle=\mathcal{O}\left(\sum_{t=1}^{T}\sum_{x\in\mathcal{X}}p_{m_{t},x}\Delta_{x}+\log(1/\delta)\right)
=𝒪(c(𝒳,θ)log(T)log(T|𝒳|/δ)+dTlog(T|𝒳|/δ)+C)\displaystyle=\mathcal{O}\left(c(\mathcal{X},\theta)\log(T)\log(T|\mathcal{X}|/\delta)+d\sqrt{T^{*}}\log(T^{*}|\mathcal{X}|/\delta)+C\right)
=𝒪(c(𝒳,θ)log(T)log(T|𝒳|/δ)+dCΔminlog(C|𝒳|Δminδ)+dMlog(M|𝒳|δ)log(M|𝒳|δ)+C)\displaystyle=\mathcal{O}\left(c(\mathcal{X},\theta)\log(T)\log(T|\mathcal{X}|/\delta)+d\sqrt{\frac{C}{\Delta_{\min}}}\log\left(\frac{C|\mathcal{X}|}{\Delta_{\min}\delta}\right)+d\sqrt{M^{\prime}\log\left(\frac{M^{\prime}|\mathcal{X}|}{\delta}\right)}\log\left(\frac{M^{\prime}|\mathcal{X}|}{\delta}\right)+C\right)
=𝒪(c(𝒳,θ)log(T)log(T|𝒳|/δ)+dCΔminlog(C|𝒳|Δminδ)+C+M(log(1/δ))32)\displaystyle=\mathcal{O}\left(c(\mathcal{X},\theta)\log(T)\log(T|\mathcal{X}|/\delta)+d\sqrt{\frac{C}{\Delta_{\min}}}\log\left(\frac{C|\mathcal{X}|}{\Delta_{\min}\delta}\right)+C+M^{*}\left(\log(1/\delta)\right)^{\frac{3}{2}}\right) (24)

for some constant MM^{*} that depends on M=218(M+dΔmin2)M^{\prime}=2^{18}\left(M+\frac{d}{\Delta_{\min}^{2}}\right) and log|𝒳|\log|\mathcal{X}|. ∎

In the following, we prove an alternative bound of 𝒪(d2(logT)2Δmin+C)\mathcal{O}(\frac{d^{2}(\log T)^{2}}{\Delta_{\min}}+C), which is independent of MM^{*}. The following lemma is an analogue of Lemma 17, but the constant TT^{\prime} is independent of MM^{*}.

Lemma 18.

Let T32CΔmin+225dΔmin2log(224d|𝒳|δΔmin2)T^{\prime}\triangleq\frac{32C}{\Delta_{\min}}+\frac{2^{25}d}{\Delta_{\min}^{2}}\log\left(\frac{2^{24}d|\mathcal{X}|}{\delta\Delta_{\min}^{2}}\right). Then Algorithm LABEL:alg:REOLB guarantees with probability 1δ1-\delta

t=T+1Tx𝒳pmt,xΔx𝒪(dlog(T)log(T|𝒳|/δ)Δmin).\displaystyle\sum^{T}_{t=T^{\prime}+1}\sum_{x\in\mathcal{X}}p_{m_{t},x}\Delta_{x}\leq\mathcal{O}\left(\frac{d\log(T)\log(T|\mathcal{X}|/\delta)}{\Delta_{\min}}\right).
Proof.

First, note that according to Lemma 28, t225dΔmin2log(224d|𝒳|Δmin2δ)t\geq\frac{2^{25}d}{\Delta_{\min}^{2}}\log\left(\frac{2^{24}d|\mathcal{X}|}{\Delta_{\min}^{2}\delta}\right) implies t223dΔmin2log(t|𝒳|δ)t\geq\frac{2^{23}d}{\Delta_{\min}^{2}}\log\left(\frac{t|\mathcal{X}|}{\delta}\right). Let mtm_{t} be the epoch such that tmtt\in\mathcal{B}_{m_{t}}, which means t[2mt,2mt+1)t\in[2^{m_{t}},2^{m_{t}+1}) and thus βtγmt\beta_{t}\geq\gamma_{m_{t}}. Therefore, we still have Eq. (20), which further shows that we have Eq. (21) and Δ^mt,x=Δx=0\widehat{\Delta}_{m_{t},x^{*}}=\Delta_{x^{*}}=0. Moreover, according to the definition of TT^{\prime}, we have t223dΔmin2log(t|𝒳|δ)256dβt/Δmin2t\geq\frac{2^{23}d}{\Delta_{\min}^{2}}\log(\frac{t|\mathcal{X}|}{\delta})\geq 256d\beta_{t}/\Delta_{\min}^{2}. Therefore, the conditions of Lemma 13 hold. Applying Lemma 13 with r=16r=16, we get

xpmt,xΔx𝒪(dβt2mtΔmin).\displaystyle\sum_{x}p_{m_{t},x}\Delta_{x}\leq\mathcal{O}\left(\frac{d\beta_{t}}{2^{m_{t}}\Delta_{\min}}\right).

Summing over tT+1t\geq T^{\prime}+1, we get with probability at least 1δ1-\delta,

t=T+1Txpmt,xΔx=𝒪(dβTlog(T)Δmin)=𝒪(dlog(T)log(T|𝒳|/δ)Δmin).\displaystyle\sum_{t=T^{\prime}+1}^{T}\sum_{x}p_{m_{t},x}\Delta_{x}=\mathcal{O}\left(\frac{d\beta_{T}\log(T)}{\Delta_{\min}}\right)=\mathcal{O}\left(\frac{d\log(T)\log(T|\mathcal{X}|/\delta)}{\Delta_{\min}}\right).

Theorem 19.

Algorithm LABEL:alg:REOLB guarantees that with probability at least 1δ1-\delta,

Reg(T)\displaystyle\text{\rm Reg}(T) =𝒪(d2Δminlog2(T|𝒳|Δminδ)+C).\displaystyle=\mathcal{O}\left(\frac{d^{2}}{\Delta_{\min}}\log^{2}\left(\frac{T|\mathcal{X}|}{\Delta_{\min}\delta}\right)+C\right).
Proof.

Using Eq. (23) for tTt\leq T^{\prime} and using Lemma 18 for the rest, we know that

t=1Tx𝒳pmt,xΔx𝒪(dlog(T)log(T|𝒳|/δ)Δmin+dTlog(T|𝒳|/δ)+C).\displaystyle\sum_{t=1}^{T}\sum_{x\in\mathcal{X}}p_{m_{t},x}\Delta_{x}\leq\mathcal{O}\left(\frac{d\log(T)\log(T|\mathcal{X}|/\delta)}{\Delta_{\min}}+d\sqrt{T^{\prime}}\log(T^{\prime}|\mathcal{X}|/\delta)+C\right).

Combining this with Eq. (22), we get

Reg(T)\displaystyle\text{\rm Reg}(T) =𝒪(t=1Tx𝒳pmt,xΔx+log(1/δ))\displaystyle=\mathcal{O}\left(\sum_{t=1}^{T}\sum_{x\in\mathcal{X}}p_{m_{t},x}\Delta_{x}+\log(1/\delta)\right)
=𝒪(dlog(T)log(T|𝒳|/δ)Δmin+dTlog(T|𝒳|/δ)+C)\displaystyle=\mathcal{O}\left(\frac{d\log(T)\log(T|\mathcal{X}|/\delta)}{\Delta_{\min}}+d\sqrt{T^{\prime}}\log(T^{\prime}|\mathcal{X}|/\delta)+C\right)
=𝒪(dlog(T)log(T|𝒳|/δ)Δmin+dCΔminlog(C|𝒳|Δminδ)+ddΔmin2log(d|𝒳|δΔmin2)log(d|𝒳|δΔmin2)+C)\displaystyle=\mathcal{O}\left(\frac{d\log(T)\log(T|\mathcal{X}|/\delta)}{\Delta_{\min}}+d\sqrt{\frac{C}{\Delta_{\min}}}\log\left(\frac{C|\mathcal{X}|}{\Delta_{\min}\delta}\right)+d\sqrt{\frac{d}{\Delta_{\min}^{2}}\log\left(\frac{d|\mathcal{X}|}{\delta\Delta_{\min}^{2}}\right)}\log\left(\frac{d|\mathcal{X}|}{\delta\Delta_{\min}^{2}}\right)+C\right)
=𝒪(dlog(T)log(T|𝒳|/δ)Δmin+dCΔminlog(C|𝒳|Δminδ)+C+d32log32(d|𝒳|δΔmin)Δmin)\displaystyle=\mathcal{O}\left(\frac{d\log(T)\log(T|\mathcal{X}|/\delta)}{\Delta_{\min}}+d\sqrt{\frac{C}{\Delta_{\min}}}\log\left(\frac{C|\mathcal{X}|}{\Delta_{\min}\delta}\right)+C+\frac{d^{\frac{3}{2}}\log^{\frac{3}{2}}(\frac{d|\mathcal{X}|}{\delta\Delta_{\min}})}{\Delta_{\min}}\right)
𝒪(d2Δminlog2(T|𝒳|Δminδ)+C).\displaystyle\leq\mathcal{O}\left(\frac{d^{2}}{\Delta_{\min}}\log^{2}\left(\frac{T|\mathcal{X}|}{\Delta_{\min}\delta}\right)+C\right). (using AM-GM inequality and CTC\leq T, TdT\geq d)

Finally, we prove Theorem 2, which is a direct result by combining Eq. (23) and Eq. (22).

Proof of Theorem 2.

Combining Eq. (23) and Eq. (22), we have

Reg(T)=𝒪(t=1Tx𝒳pmt,xΔx+log(1/δ))=𝒪(dTlog(T|𝒳|/δ)+C).\displaystyle\text{\rm Reg}(T)=\mathcal{O}\left(\sum_{t=1}^{T}\sum_{x\in\mathcal{X}}p_{m_{t},x}\Delta_{x}+\log(1/\delta)\right)=\mathcal{O}\left(d\sqrt{T}\log(T|\mathcal{X}|/\delta)+C\right).

Appendix C Analysis of Algorithm 1

In this section, we show that Algorithm 1 achieves both minimax-optimality in the adversarial setting and near instance-optimality in the stochastic setting. In Appendix C.1, we prove Theorem 7, showing that Algorithm 1 enjoys 𝒪~(T)\widetilde{\mathcal{O}}(\sqrt{T}) regret in the adversarial setting. In Appendix C.2, we prove Theorem 6, showing that Algorithm 1 also enjoys nearly instance-optimal regret in the stochastic setting and slightly worse regret in the corrupted setting.

C.1 Analysis of Algorithm 1 in the adversarial setting

To prove the guarantee in the adversarial setting, we first prove Lemma 8, which shows that at any time in Phase 2, x^\widehat{x} has the smallest cumulative loss within [1,t][1,t].

Proof of Lemma 8.

By Assumption 1, for any xx, and any tt in Phase 1,

s=1t(s,xss,x)\displaystyle\sum_{s=1}^{t}(\ell_{s,x_{s}}-\ell_{s,x}) C1tC2|s=1t(s,x^s,x)|\displaystyle\leq\sqrt{C_{1}t}-C_{2}\left|\sum_{s=1}^{t}(\ell_{s,x}-\widehat{\ell}_{s,x})\right|
C1t(C21)|s=1t(s,x^s,x)|s=1t(s,x^s,x),\displaystyle\leq\sqrt{C_{1}t}-(C_{2}-1)\left|\sum_{s=1}^{t}(\ell_{s,x}-\widehat{\ell}_{s,x})\right|-\sum_{s=1}^{t}(\ell_{s,x}-\widehat{\ell}_{s,x}),

which implies

|s=1t(s,x^s,x)|\displaystyle\left|\sum_{s=1}^{t}(\ell_{s,x}-\widehat{\ell}_{s,x})\right| 1C21(C1ts=1t(s,x^s,x)s=1t(s,xss,x))\displaystyle\leq\frac{1}{C_{2}-1}\left(\sqrt{C_{1}t}-\sum_{s=1}^{t}(\ell_{s,x}-\widehat{\ell}_{s,x})-\sum_{s=1}^{t}(\ell_{s,x_{s}}-\ell_{s,x})\right)
1C21(C1t+s=1t^s,xs=1ts,xs).\displaystyle\leq\frac{1}{C_{2}-1}\left(\sqrt{C_{1}t}+\sum_{s=1}^{t}\widehat{\ell}_{s,x}-\sum_{s=1}^{t}\ell_{s,x_{s}}\right). (25)

At time t0t_{0}, we have with probability at least 12δ1-2\delta,

|s=1t0(s,x^s,x)|\displaystyle\left|\sum_{s=1}^{t_{0}}(\ell_{s,x}-\widehat{\ell}_{s,x})\right| 1C21(C1t0+s=1t0^s,xs=1t0s,xs)\displaystyle\leq\frac{1}{C_{2}-1}\left(\sqrt{C_{1}t_{0}}+\sum_{s=1}^{t_{0}}\widehat{\ell}_{s,x}-\sum_{s=1}^{t_{0}}\ell_{s,x_{s}}\right) (by Eq. (25))
1C21(2C1t0+s=1t0^s,xs=1t0ys)\displaystyle\leq\frac{1}{C_{2}-1}\left(2\sqrt{C_{1}t_{0}}+\sum_{s=1}^{t_{0}}\widehat{\ell}_{s,x}-\sum_{s=1}^{t_{0}}y_{s}\right) (by Azuma’s inequality)
1C21(2C1t0+5fTC1t0+s=1t0^s,xs=1t0^s,x^)\displaystyle\leq\frac{1}{C_{2}-1}\left(2\sqrt{C_{1}t_{0}}+5\sqrt{f_{T}C_{1}t_{0}}+\sum_{s=1}^{t_{0}}\widehat{\ell}_{s,x}-\sum_{s=1}^{t_{0}}\widehat{\ell}_{s,\widehat{x}}\right) (by Eq. (7))
=1C21(7fTC1t0+t0Δ^x).\displaystyle=\frac{1}{C_{2}-1}\left(7\sqrt{f_{T}C_{1}t_{0}}+t_{0}\widehat{\Delta}_{x}\right). (26)

Bounding the deviation of (tt0)Robt,x(t-t_{0})\text{Rob}_{t,x} for xx^x\neq\widehat{x}:

For all xx^x\neq\widehat{x}, the variance of ^τ,x\widehat{\ell}_{\tau,x} is bounded as follows:

Var(^τ,x)𝔼[^τ,x2]𝔼[(xS~τ1xtxtS~τ1x)2]xS~τ122xSτ12,\displaystyle\text{Var}(\widehat{\ell}_{\tau,x})\leq\mathbb{E}\left[\widehat{\ell}_{\tau,x}^{2}\right]\leq\mathbb{E}\left[(x^{\top}\widetilde{S}_{\tau}^{-1}x_{t}x_{t}^{\top}\widetilde{S}_{\tau}^{-1}x)^{2}\right]\leq\|x\|_{\widetilde{S}_{\tau}^{-1}}^{2}\leq 2\|x\|_{S_{\tau}^{-1}}^{2}, (27)

where the last inequality is due to S~τ=12x^x^+12Sτ12Sτ\widetilde{S}_{\tau}=\frac{1}{2}\widehat{x}\widehat{x}^{\top}+\frac{1}{2}S_{\tau}\succeq\frac{1}{2}S_{\tau}. Therefore, using Lemma 29 with μi=i,x\mu_{i}=\ell_{i,x}, with probability at least 12δ1-2\delta, for all tt in Phase 2 and all xx^x\neq\widehat{x},

|(tt0)Robt,xτ=t0+1tτ,x|\displaystyle\left|(t-t_{0})\cdot\text{Rob}_{t,x}-\sum_{\tau=t_{0}+1}^{t}\ell_{\tau,x}\right| αx(2τ=t0+1txSτ12+τ=t0+1t(τ,x1tt0τ=t0+1tτ,x)2)+2logt2|𝒳|δαx\displaystyle\leq\alpha_{x}\left(2\sum_{\tau=t_{0}+1}^{t}\|x\|_{S_{\tau}^{-1}}^{2}+\sum_{\tau=t_{0}+1}^{t}\left(\ell_{\tau,x}-\frac{1}{t-t_{0}}\sum_{\tau^{\prime}=t_{0}+1}^{t}\ell_{\tau^{\prime},x}\right)^{2}\right)+\frac{2\log\frac{t^{2}|\mathcal{X}|}{\delta}}{\alpha_{x}}
αxτ=t0+1t(2xSτ12+1)+4logt|𝒳|δαx\displaystyle\leq\alpha_{x}\sum_{\tau=t_{0}+1}^{t}\left(2\|x\|_{S_{\tau}^{-1}}^{2}+1\right)+\frac{4\log\frac{t|\mathcal{X}|}{\delta}}{\alpha_{x}}
24τ=t0+1t(2xSτ12+1)logt|𝒳|δ\displaystyle\leq 2\sqrt{4\sum_{\tau=t_{0}+1}^{t}\left(2\|x\|_{S_{\tau}^{-1}}^{2}+1\right)\log\frac{t|\mathcal{X}|}{\delta}} (Choose αx\alpha_{x} optimally)
24logt|𝒳|δτ=t0+1t(2τΔ^x2βτ+9d),\displaystyle\leq 2\sqrt{4\log\frac{t|\mathcal{X}|}{\delta}\sum_{\tau=t_{0}+1}^{t}\left(\frac{2\tau\widehat{\Delta}_{x}^{2}}{\beta_{\tau}}+9d\right)}, (28)

where the last inequality is due to Eq. (LABEL:eqn:_opt-2-constraint). For τt0\tau\geq t_{0}, since

Δ^x=1t0(s=1t0^s,x^s,x^)20fTC1t0,\displaystyle\widehat{\Delta}_{x}=\frac{1}{t_{0}}\left(\sum_{s=1}^{t_{0}}\widehat{\ell}_{s,x}-\widehat{\ell}_{s,\widehat{x}}\right)\geq 20\sqrt{\frac{f_{T}C_{1}}{t_{0}}}, (29)

we have

Δ^x20fTC1τ20fTdβTτ20dβTτ20dβττ3dβτ2τ\displaystyle\widehat{\Delta}_{x}\geq 20\sqrt{\frac{f_{T}C_{1}}{\tau}}\geq 20\sqrt{\frac{f_{T}d\beta_{T}}{\tau}}\geq 20\sqrt{\frac{d\beta_{T}}{\tau}}\geq 20\sqrt{\frac{d\beta_{\tau}}{\tau}}\geq 3\sqrt{\frac{d\beta_{\tau}}{2\tau}}

and 9d2τΔ^x2βτ9d\leq\frac{2\tau\widehat{\Delta}_{x}^{2}}{\beta_{\tau}}. Note that h(τ)=τlog(τ|𝒳|/δ)h(\tau)=\frac{\tau}{\log(\tau|\mathcal{X}|/\delta)} an increasing function when δ0.1\delta\leq 0.1. Using Eq. (28) and 9d2τΔ^x2βτ9d\leq\frac{2\tau\widehat{\Delta}_{x}^{2}}{\beta_{\tau}}, we have

|(tt0)Robt,xs=t0+1ts,x|24logt|𝒳|δτ=t0+1t4τΔ^x2βτ216t2Δ^x2logt|𝒳|δ1βttΔ^x16.\displaystyle\left|(t-t_{0})\cdot\text{Rob}_{t,x}-\sum_{s=t_{0}+1}^{t}\ell_{s,x}\right|\leq 2\sqrt{4\log\frac{t|\mathcal{X}|}{\delta}\sum_{\tau=t_{0}+1}^{t}\frac{4\tau\widehat{\Delta}_{x}^{2}}{\beta_{\tau}}}\leq 2\sqrt{16t^{2}\widehat{\Delta}_{x}^{2}\log\frac{t|\mathcal{X}|}{\delta}\frac{1}{\beta_{t}}}\leq\frac{t\widehat{\Delta}_{x}}{16}. (30)

For the first t0t_{0} rounds, according to Eq. (26), we have

|s=1t0(s,x^s,x)|1C21(7fTC1t0+t0Δ^x)1.4t0Δ^xC21,\displaystyle\left|\sum_{s=1}^{t_{0}}(\ell_{s,x}-\widehat{\ell}_{s,x})\right|\leq\frac{1}{C_{2}-1}\left(7\sqrt{f_{T}C_{1}t_{0}}+t_{0}\widehat{\Delta}_{x}\right)\leq\frac{1.4t_{0}\widehat{\Delta}_{x}}{C_{2}-1}, (31)

where the last inequality is due to Eq. (29). Combining Eq. (30) and Eq. (31) and noticing that C220C_{2}\geq 20, we have for all xx^x\neq\widehat{x},

|s=1t0(s,x^s,x)+s=t0+1t(s,xRobt,x)|1.7tΔ^x10.\displaystyle\left|\sum_{s=1}^{t_{0}}(\ell_{s,x}-\widehat{\ell}_{s,x})+\sum_{s=t_{0}+1}^{t}\left(\ell_{s,x}-\text{Rob}_{t,x}\right)\right|\leq\frac{1.7t\widehat{\Delta}_{x}}{10}. (32)

Bounding the deviation of s=1t^s,x^\sum_{s=1}^{t}\widehat{\ell}_{s,\widehat{x}} (recall that we use the standard average estimator for x^\widehat{x}):

For the first t0t_{0} rounds, according to Eq. (26), since Δ^x^=0\widehat{\Delta}_{\widehat{x}}=0, we have

|s=1t0(s,x^^s,x^)|fTC1t0.\displaystyle\left|\sum_{s=1}^{t_{0}}(\ell_{s,\widehat{x}}-\widehat{\ell}_{s,\widehat{x}})\right|\leq\sqrt{f_{T}C_{1}t_{0}}.

For tt0+1t\geq t_{0}+1, according to Freedman’s inequality and the fact that 𝔼t[^s,x^2]=𝔼t[ys2p~s,x^2𝟙{xs=x^}]1p~s,x^2\mathbb{E}_{t}\left[\widehat{\ell}_{s,\widehat{x}}^{2}\right]=\mathbb{E}_{t}\left[\frac{y_{s}^{2}}{\widetilde{p}^{2}_{s,\widehat{x}}}\cdot\mathbbm{1}\{x_{s}=\widehat{x}\}\right]\leq\frac{1}{\widetilde{p}_{s,\widehat{x}}}\leq 2 as p~s,x^12\widetilde{p}_{s,\widehat{x}}\geq\frac{1}{2}, we have with probability at least 1δ1-\delta,

|s=t0+1ts,x^s=t0+1t^s,x^|22tlog(t|𝒳|/δ)+2log(t|𝒳|/δ)C1t.\displaystyle\left|\sum_{s=t_{0}+1}^{t}\ell_{s,\widehat{x}}-\sum_{s=t_{0}+1}^{t}\widehat{\ell}_{s,\widehat{x}}\right|\leq 2\sqrt{2t\log(t|\mathcal{X}|/\delta)}+2\log(t|\mathcal{X}|/\delta)\leq\sqrt{C_{1}t}.

Combining the above two inequalities, we have

|s=1t(s,x^^s,x^)|3fTC1t.\displaystyle\left|\sum_{s=1}^{t}(\ell_{s,\widehat{x}}-\widehat{\ell}_{s,\widehat{x}})\right|\leq 3\sqrt{f_{T}C_{1}t}. (33)

In sum: combining the bounds for xx^x\neq\widehat{x} and x=x^x=\widehat{x}, we have for all xx^x\neq\widehat{x},

s=1t(s,xs,x^)\displaystyle\sum_{s=1}^{t}(\ell_{s,x}-\ell_{s,\widehat{x}}) s=1t1(s,xs,x^)2\displaystyle\geq\sum_{s=1}^{t-1}(\ell_{s,x}-\ell_{s,\widehat{x}})-2
s=1t0(^s,x^s,x^)+((tt01)Robt1,xs=t0+1t1^s,x^)\displaystyle\geq\sum_{s=1}^{t_{0}}\left(\widehat{\ell}_{s,x}-\widehat{\ell}_{s,\widehat{x}}\right)+\left((t-t_{0}-1)\text{Rob}_{t-1,x}-\sum_{s=t_{0}+1}^{t-1}\widehat{\ell}_{s,\widehat{x}}\right)
3fTC1(t1)1.7(t1)Δ^x102\displaystyle\qquad-3\sqrt{f_{T}C_{1}(t-1)}-\frac{1.7(t-1)\widehat{\Delta}_{x}}{10}-2 (Eq. (32) and Eq. (33))
(t1)Δ^t1,x4fTC1(t1)1.7(t1)Δ^x10\displaystyle\geq(t-1)\widehat{\Delta}_{t-1,x}-4\sqrt{f_{T}C_{1}(t-1)}-\frac{1.7(t-1)\widehat{\Delta}_{x}}{10} (by the definition of Δ^t1,x\widehat{\Delta}_{t-1,x} in Eq. (11))
(t1)Δ^t1,x3.7(t1)Δ^x10\displaystyle\geq(t-1)\widehat{\Delta}_{t-1,x}-\frac{3.7(t-1)\widehat{\Delta}_{x}}{10} (by Eq. (29))
0.02(t1)Δ^x>0,\displaystyle\geq 0.02(t-1)\widehat{\Delta}_{x}>0,

where the last inequality is because tt belongs to Phase 22, which means that at time t1t-1, Eq. (9) is satisfied. ∎

Now we are ready to prove our main lemma in the adversarial setting.

Proof of Lemma 9.

By Lemma 8, we know that the regret comparator is x^\widehat{x}. By the regret bound of 𝒜\mathcal{A} and the fact that t0L0t_{0}\geq L_{0} (recall L0L_{0} from Assumption 1), we have

s=1t0(s,xss,x^)O(C1t0).\displaystyle\sum_{s=1}^{t_{0}}(\ell_{s,x_{s}}-\ell_{s,\widehat{x}})\leq O\left(\sqrt{C_{1}t_{0}}\right).

For the regret in Phase 22, first note that it suffices to consider tt not being the last round of this phase (since the last round contributes at most 22 to the regret). Then, consider the following decomposition:

s=t0+1t(s,xss,x^)\displaystyle\sum_{s=t_{0}+1}^{t}(\ell_{s,x_{s}}-\ell_{s,\widehat{x}}) s=t0+1t1(ysϵs(xs)s,x^)+2\displaystyle\leq\sum_{s=t_{0}+1}^{t-1}(y_{s}-\epsilon_{s}(x_{s})-\ell_{s,\widehat{x}})+2
=s=t0+1t1(ys^s,x^)Term 1+s=t0+1t1(^s,x^s,x^ϵs(x^))Term 2+s=t0+1t1(ϵs(x^)ϵs(xs))Term 3+2.\displaystyle=\underbrace{\sum_{s=t_{0}+1}^{t-1}\left(y_{s}-\widehat{\ell}_{s,\widehat{x}}\right)}_{\textsc{Term 1}}+\underbrace{\sum_{s=t_{0}+1}^{t-1}\left(\widehat{\ell}_{s,\widehat{x}}-\ell_{s,\widehat{x}}-\epsilon_{s}(\widehat{x})\right)}_{\textsc{Term 2}}+\underbrace{\sum_{s=t_{0}+1}^{t-1}\left(\epsilon_{s}(\widehat{x})-\epsilon_{s}(x_{s})\right)}_{\textsc{Term 3}}+2.

Term 1 is upper bounded by 𝒪(fTC1t0)\mathcal{O}\left(\sqrt{f_{T}C_{1}t_{0}}\right) since it corresponds to the termination condition Eq. (10).

Term 2 is a martingale difference sequence since

𝔼t[^s,x^s,x^ϵs(x^)]=𝔼t[(s,x^+ϵs(x^))𝕀{xs=x^}p~s,x^(s,x^+ϵs(x^))]=0.\displaystyle\mathbb{E}_{t}\left[\widehat{\ell}_{s,\widehat{x}}-\ell_{s,\widehat{x}}-\epsilon_{s}(\widehat{x})\right]=\mathbb{E}_{t}\left[\frac{(\ell_{s,\widehat{x}}+\epsilon_{s}(\widehat{x}))\mathbb{I}\{x_{s}=\widehat{x}\}}{\widetilde{p}_{s,\widehat{x}}}-(\ell_{s,\widehat{x}}+\epsilon_{s}(\widehat{x}))\right]=0.

The variance is upper bounded by

𝔼t[(^s,x^s,x^ϵs(x^))2]\displaystyle\mathbb{E}_{t}\left[\left(\widehat{\ell}_{s,\widehat{x}}-\ell_{s,\widehat{x}}-\epsilon_{s}(\widehat{x})\right)^{2}\right] =𝔼t[(s,x^+ϵs(x^))2(𝕀{xs=x^}p~s,x^1)2]\displaystyle=\mathbb{E}_{t}\left[(\ell_{s,\widehat{x}}+\epsilon_{s}(\widehat{x}))^{2}\left(\frac{\mathbb{I}\{x_{s}=\widehat{x}\}}{\widetilde{p}_{s,\widehat{x}}}-1\right)^{2}\right]
p~s,x^(1p~s,x^1)2+(1p~s,x^)\displaystyle\leq\widetilde{p}_{s,\widehat{x}}\left(\frac{1}{\widetilde{p}_{s,\widehat{x}}}-1\right)^{2}+(1-\widetilde{p}_{s,\widehat{x}})
2(1p~s,x^),\displaystyle\leq 2(1-\widetilde{p}_{s,\widehat{x}}), (34)

where the last term is because p~s,x^12\widetilde{p}_{s,\widehat{x}}\geq\frac{1}{2}.

Term 3 is also a martingale difference sequence. As ϵs(x)[2,2]\epsilon_{s}(x)\in[-2,2], its variance can be upper bounded by

𝔼t[(ϵs(x^)ϵs(xs))2]16𝔼t[𝕀{xsx^}]=16(1p~s,x^).\displaystyle\mathbb{E}_{t}\left[\left(\epsilon_{s}(\widehat{x})-\epsilon_{s}(x_{s})\right)^{2}\right]\leq 16\mathbb{E}_{t}\left[\mathbb{I}\{x_{s}\neq\widehat{x}\}\right]=16(1-\widetilde{p}_{s,\widehat{x}}). (35)

Therefore, with probability at least 1δ/t1-\delta/t, we have Term 2+Term 3=𝒪(s=t0+1t(1p~s,x^)log(t/δ)+log(t/δ))\textsc{Term 2}+\textsc{Term 3}=\mathcal{O}\left(\sqrt{\sum_{s=t_{0}+1}^{t}(1-\widetilde{p}_{s,\widehat{x}})\log(t/\delta)}+\log(t/\delta)\right) by Freedman’s inequality.

As pt=OP(t,Δ^)p_{t}=\textbf{OP}(t,\widehat{\Delta}) and tt0400C1fTΔ^min216dβtΔ^min2t\geq t_{0}\geq\frac{400C_{1}f_{T}}{\widehat{\Delta}_{\min}^{2}}\geq\frac{16d\beta_{t}}{\widehat{\Delta}_{\min}^{2}}, by Lemma 12, we have

1p~t,x^=12(1pt,x^)12xpt,xΔ^xΔ^min12dβttΔ^min2.\displaystyle 1-\widetilde{p}_{t,\widehat{x}}=\frac{1}{2}\left(1-p_{t,\widehat{x}}\right)\leq\frac{1}{2}\sum_{x}p_{t,x}\frac{\widehat{\Delta}_{x}}{\widehat{\Delta}_{\min}}\leq\frac{12d\beta_{t}}{t\widehat{\Delta}_{\min}^{2}}. (36)

Combining the above with Δ^min20C1fTt0\widehat{\Delta}_{\min}\geq 20\sqrt{\frac{C_{1}f_{T}}{t_{0}}}, we get

Term 2+Term 3\displaystyle\textsc{Term 2}+\textsc{Term 3} =𝒪(log(t/δ)s=t0+1tdβtsΔ^min2+log(t/δ))\displaystyle=\mathcal{O}\left(\sqrt{\log(t/\delta)\sum_{s=t_{0}+1}^{t}\frac{d\beta_{t}}{s\widehat{\Delta}_{\min}^{2}}}+\log(t/\delta)\right)
=𝒪(dt0βtlog(t/δ)log(t)fTC1+log(t/δ))\displaystyle=\mathcal{O}\left(\sqrt{\frac{dt_{0}\beta_{t}\log(t/\delta)\log(t)}{f_{T}C_{1}}}+\log(t/\delta)\right)
=𝒪(t0log(t/δ)+log(t/δ))\displaystyle=\mathcal{O}\left(\sqrt{t_{0}\log(t/\delta)}+\log(t/\delta)\right)

where the last step uses the definition of βt\beta_{t} and C1215dlog(T|𝒳|/δ)C_{1}\geq 2^{15}d\log(T|\mathcal{X}|/\delta) from Assumption 1.

Combining all bounds above, we have shown

s=1t(s,xss,x^)=𝒪(C1t0fT),\displaystyle\sum_{s=1}^{t}(\ell_{s,x_{s}}-\ell_{s,\widehat{x}})=\mathcal{O}\left(\sqrt{C_{1}t_{0}f_{T}}\right),

proving the lemma. ∎

Theorem 7 can then be proven by directly applying Lemma 9 to each epoch and using the fact that the number of epochs is at most 𝒪(logT)\mathcal{O}(\log T).

C.2 Analysis of Algorithm 1 in the corrupted stochastic setting

In this section, we prove our results in the corrupted setting. To prove the main lemma Lemma 10, we separate the proof into two parts, Lemma 20 and Lemma 22.

Lemma 20.

In the stochastic setting with corruptions, within a single epoch,

  1. 1.

    with probability at least 14δ1-4\delta, t0max{900fTC1Δmin2,900C2fTC1,L}t_{0}\leq\max\left\{\frac{900f_{T}C_{1}}{\Delta_{\min}^{2}},\frac{900C^{2}}{f_{T}C_{1}},L\right\};

  2. 2.

    if C130fTC1LC\leq\frac{1}{30}\sqrt{f_{T}C_{1}L}, then with probability at least 1δ1-\delta, x^=x\widehat{x}=x^{*};

  3. 3.

    if C130fTC1LC\leq\frac{1}{30}\sqrt{f_{T}C_{1}L}, then with probability at least 12δ1-2\delta, t064fTC1Δmin2t_{0}\geq\frac{64f_{T}C_{1}}{\Delta_{\min}^{2}};

  4. 4.

    if C130fTC1LC\leq\frac{1}{30}\sqrt{f_{T}C_{1}L}, then with probability at least 13δ1-3\delta, Δ^x[0.7Δx,1.3Δx]\widehat{\Delta}_{x}\in[0.7\Delta_{x},1.3\Delta_{x}] for all xxx\neq x^{*}.

Proof.

In the corrupted setting, we can identify +ct\ell+c_{t} as t\ell_{t} in the adversarial setting. We first show the following property: at any tt in Phase 11 and with probability at least 1δ1-\delta, for any xx,

C2|s=1t(s,x^s,x)|C1t+tΔx+2C.\displaystyle C_{2}\left|\sum_{s=1}^{t}(\ell_{s,x}-\widehat{\ell}_{s,x})\right|\leq\sqrt{C_{1}t}+t\Delta_{x}+2C. (37)

By the guarantee of 𝒜\mathcal{A}, we have with probability at least 1δ1-\delta, for any xx and t[T]t\in[T]

C2|s=1t(s,x^s,x)|C1t+s=1ts,xs=1ts,xs.\displaystyle C_{2}\left|\sum_{s=1}^{t}(\ell_{s,x}-\widehat{\ell}_{s,x})\right|\leq\sqrt{C_{1}t}+\sum_{s=1}^{t}\ell_{s,x}-\sum_{s=1}^{t}\ell_{s,x_{s}}. (38)

Since t,xtt,xmaxx𝒳|ct,x|\ell_{t,x_{t}}\geq\ell_{t,x^{*}}-\max_{x\in\mathcal{X}}|c_{t,x}|, we have for any tt,

s=1ts,xss=1ts,xC.\displaystyle\sum_{s=1}^{t}\ell_{s,x_{s}}\geq\sum_{s=1}^{t}\ell_{s,x^{*}}-C. (39)

Combining Eq. (38) and Eq. (39), and using s,xs,xΔx+maxx𝒳|cs,x|\ell_{s,x}-\ell_{s,x^{*}}\leq\Delta_{x}+\max_{x^{\prime}\in\mathcal{X}}|c_{s,x^{\prime}}| for any x𝒳x\in\mathcal{X}, we get Eq. (37).

Below, we define Devt,x|s=1t(s,x^s,x)|\textsc{Dev}_{t,x}\triangleq\left|\sum_{s=1}^{t}(\ell_{s,x}-\widehat{\ell}_{s,x})\right|.

Claim 1’s proof:

Let t=max{900fTC1Δmin2,900C2fTC1,L}t=\max\left\{\frac{900f_{T}C_{1}}{\Delta_{\min}^{2}},\frac{900C^{2}}{f_{T}C_{1}},L\right\}. Below we prove that if Phase 1 has not finished before time tt, then for the choice of x^=x\widehat{x}=x^{*}, both Eq. (7) and Eq. (8) hold with high probability at time tt.

Consider Eq. (7). With probability at least 12δ1-2\delta,

s=1tys\displaystyle\sum_{s=1}^{t}y_{s} s=1ts,xsC1t\displaystyle\geq\sum_{s=1}^{t}\ell_{s,x_{s}}-\sqrt{C_{1}t} (by Azuma’s inequality)
s=1ts,xC1tC\displaystyle\geq\sum_{s=1}^{t}\ell_{s,x^{*}}-\sqrt{C_{1}t}-C (by Eq. (39))
s=1t^s,x2C1t3C\displaystyle\geq\sum_{s=1}^{t}\widehat{\ell}_{s,x^{*}}-2\sqrt{C_{1}t}-3C (by Eq. (37) and Δx=0\Delta_{x^{*}}=0)
s=1t^s,x3fTC1t,\displaystyle\geq\sum_{s=1}^{t}\widehat{\ell}_{s,x^{*}}-3\sqrt{f_{T}C_{1}t}, (t900C2fTC1t\geq\frac{900C^{2}}{f_{T}C_{1}} and fTC1t30C\sqrt{f_{T}C_{1}t}\geq 30C)

showing that Eq. (7) holds for x^=x\widehat{x}=x^{*}.

For Eq. (8), by the regret bound of 𝒜\mathcal{A}, with probability at least 12δ1-2\delta, for xxx\neq x^{*},

s=1tyss=1t^s,x\displaystyle\sum_{s=1}^{t}y_{s}-\sum_{s=1}^{t}\widehat{\ell}_{s,x}
=s=1t(yss,xs)+s=1t(s,xss,x)+s=1t(s,xs,x)+s=1t(s,x^s,x)\displaystyle=\sum_{s=1}^{t}(y_{s}-\ell_{s,x_{s}})+\sum_{s=1}^{t}(\ell_{s,x_{s}}-\ell_{s,x^{*}})+\sum_{s=1}^{t}(\ell_{s,x^{*}}-\ell_{s,x})+\sum_{s=1}^{t}(\ell_{s,x}-\widehat{\ell}_{s,x})
C1t+(C1tC2Devt,x)+(tΔx+C)+Devt,x\displaystyle\leq\sqrt{C_{1}t}+\left(\sqrt{C_{1}t}-C_{2}\textsc{Dev}_{t,x^{*}}\right)+\left(-t\Delta_{x}+C\right)+\textsc{Dev}_{t,x} (by the regret bound of 𝒜\mathcal{A} and Azuma’s inequality)
(2+130)fTC1ttΔx+1C2(C1t+tΔx+2C)\displaystyle\leq\left(2+\frac{1}{30}\right)\sqrt{f_{T}C_{1}t}-t\Delta_{x}+\frac{1}{C_{2}}\left(\sqrt{C_{1}t}+t\Delta_{x}+2C\right) (by Eq. (37) and that 30CfTC1t30C\leq\sqrt{f_{T}C_{1}t})
0.95tΔx+2.1fTC1t.\displaystyle\leq-0.95t\Delta_{x}+2.1\sqrt{f_{T}C_{1}t}. (C220C_{2}\geq 20)

By the condition of tt, we have tΔx30fTC1tt\Delta_{x}\geq 30\sqrt{f_{T}C_{1}t} for all xxx\neq x^{*}. Thus, the last expression can further be upper bounded by (30×0.95+2.1)fTC1t25fTC1t(-30\times 0.95+2.1)\sqrt{f_{T}C_{1}t}\leq-25\sqrt{f_{T}C_{1}t}, indicating that Eq. (8) also holds for all xxx\neq x^{*}. Combining the two parts above finishes the proof.

Claim 2’s proof:

Note that Eq. (7) and Eq. (8) jointly imply that

s=1t0(^s,x^s,x^)20fTC1t0xx^.\displaystyle\sum_{s=1}^{t_{0}}(\widehat{\ell}_{s,x}-\widehat{\ell}_{s,\widehat{x}})\geq 20\sqrt{f_{T}C_{1}t_{0}}\qquad\forall x\neq\widehat{x}. (40)

However, with probability at least 1δ1-\delta, for any xxx\neq x^{*},

s=1t0(^s,x^s,x)\displaystyle\sum_{s=1}^{t_{0}}(\widehat{\ell}_{s,x^{*}}-\widehat{\ell}_{s,x}) =s=1t0(^s,xs,x)+s=1t0(s,xs,x)+s=1t0(s,x^s,x)\displaystyle=\sum_{s=1}^{t_{0}}(\widehat{\ell}_{s,x^{*}}-\ell_{s,x^{*}})+\sum_{s=1}^{t_{0}}(\ell_{s,x^{*}}-\ell_{s,x})+\sum_{s=1}^{t_{0}}(\ell_{s,x}-\widehat{\ell}_{s,x})
Devt0,x+(t0Δx+C)+Devt0,x\displaystyle\leq\textsc{Dev}_{t_{0},x^{*}}+\left(-t_{0}\Delta_{x}+C\right)+\textsc{Dev}_{t_{0},x}
1C2(C1t0+2C)+(t0Δx+C)+1C2(C1t0+t0Δx+2C)\displaystyle\leq\frac{1}{C_{2}}\left(\sqrt{C_{1}t_{0}}+2C\right)+\left(-t_{0}\Delta_{x}+C\right)+\frac{1}{C_{2}}\left(\sqrt{C_{1}t_{0}}+t_{0}\Delta_{x}+2C\right) (by Eq. (37))
5fTC1t0.\displaystyle\leq 5\sqrt{f_{T}C_{1}t_{0}}. (using C130fTC1t0C\leq\frac{1}{30}\sqrt{f_{T}C_{1}t_{0}} and C220C_{2}\geq 20)

Therefore, to make Eq. (40) hold, it must be that x^=x\widehat{x}=x^{*}.

Claim 3’s proof:

Suppose that t064fTC1Δmin2t_{0}\leq\frac{64f_{T}C_{1}}{\Delta_{\min}^{2}}, and let xx be such that Δx=Δmin\Delta_{x}=\Delta_{\min}. Then we have

s=1t0(^s,x^s,x)\displaystyle\sum_{s=1}^{t_{0}}\left(\widehat{\ell}_{s,x}-\widehat{\ell}_{s,x^{*}}\right) s=1t0(s,xs,x)+Devt0,x+Devt0,x\displaystyle\leq\sum_{s=1}^{t_{0}}\left(\ell_{s,x}-\ell_{s,x^{*}}\right)+\textsc{Dev}_{t_{0},x}+\textsc{Dev}_{t_{0},x^{*}} (hold w.p. 1δ1-\delta)
(t0Δmin+C)+1C2(2C1t0+t0Δmin+4C)\displaystyle\leq\left(t_{0}\Delta_{\min}+C\right)+\frac{1}{C_{2}}\left(2\sqrt{C_{1}t_{0}}+t_{0}\Delta_{\min}+4C\right) (hold w.p. 1δ1-\delta by Eq. (37))
2t0Δmin+2fTC1t0\displaystyle\leq 2t_{0}\Delta_{\min}+2\sqrt{f_{T}C_{1}t_{0}} (by C130fTC1t0C\leq\frac{1}{30}\sqrt{f_{T}C_{1}t_{0}} and C2=20C_{2}=20)
16fTC1t0+2fTC1t0\displaystyle\leq 16\sqrt{f_{T}C_{1}t_{0}}+2\sqrt{f_{T}C_{1}t_{0}} (t064fTC1Δmin2t_{0}\leq\frac{64f_{T}C_{1}}{\Delta_{\min}^{2}})
=18fTC1t0.\displaystyle=18\sqrt{f_{T}C_{1}t_{0}}.

Recall that Eq. (40) needs to hold, and recall from Claim 2 that x^=x\widehat{x}=x^{*} holds with probability 1δ1-\delta. Thus, the bound above is a contradiction. Therefore, with probability 12δ1-2\delta, t064fTC1Δmin2t_{0}\geq\frac{64f_{T}C_{1}}{\Delta_{\min}^{2}}.

Claim 4’s proof.

For notational convenience, denote the set [ab,a+b][a-b,a+b] by [a±b][a\pm b]. We have

t0Δ^x\displaystyle t_{0}\widehat{\Delta}_{x} =s=1t0(^s,x^s,x)[s=1t0(s,xs,x)±(Devt0,x+Devt0,x)]\displaystyle=\sum_{s=1}^{t_{0}}\left(\widehat{\ell}_{s,x}-\widehat{\ell}_{s,x^{*}}\right)\in\left[\sum_{s=1}^{t_{0}}(\ell_{s,x}-\ell_{s,x^{*}})\pm(\textsc{Dev}_{t_{0},x}+\textsc{Dev}_{t_{0},x^{*}})\right]
[t0Δx±(C+1C2(2C1t0+t0Δx+4C))]\displaystyle\subseteq\left[t_{0}\Delta_{x}\pm\left(C+\frac{1}{C_{2}}\left(2\sqrt{C_{1}t_{0}}+t_{0}\Delta_{x}+4C\right)\right)\right] (hold w.p. 1δ1-\delta by Eq. (37))
[t0Δx±(1C2t0Δx+fTC1t0)]\displaystyle\subseteq\left[t_{0}\Delta_{x}\pm\left(\frac{1}{C_{2}}t_{0}\Delta_{x}+\sqrt{f_{T}C_{1}t_{0}}\right)\right] (using C130fTC1t0C\leq\frac{1}{30}\sqrt{f_{T}C_{1}t_{0}} and C220C_{2}\geq 20)
[t0Δx±(1C2t0Δx+18t0Δx)]\displaystyle\subseteq\left[t_{0}\Delta_{x}\pm\left(\frac{1}{C_{2}}t_{0}\Delta_{x}+\frac{1}{8}t_{0}\Delta_{x}\right)\right] (by Claim 3, t064fTC1Δmin2t_{0}\geq\frac{64f_{T}C_{1}}{\Delta_{\min}^{2}} holds w.p. 12δ1-2\delta)
[(1±0.3)tΔx],\displaystyle\subseteq\left[\left(1\pm 0.3\right)t\Delta_{x}\right],

which finishes the proof. ∎

The next lemma shows that when LL grows large enough compared to the total corruption CC, the termination condition Eq. (10) will never be satisfied once the algorithm enters Phase 22.

Lemma 21.

Algorithm 1 guarantees that with probability at least 110δ1-10\delta, for any tt in Phase 22, when 0C130fTC1L0\leq C\leq\frac{1}{30}\sqrt{f_{T}C_{1}L}, we have

s=t0+1t(ys^s,x^)\displaystyle\sum_{s=t_{0}+1}^{t}\left(y_{s}-\widehat{\ell}_{s,\widehat{x}}\right) 20fTC1t0.\displaystyle\leq 20\sqrt{f_{T}C_{1}t_{0}}.

Furthermore, when tM=10βtMt\geq M^{\prime}=10\beta_{t}M (MM is the constant defined in Lemma 14), we have

s=t0+1t(ys^s,x^)\displaystyle\sum_{s=t_{0}+1}^{t}\left(y_{s}-\widehat{\ell}_{s,\widehat{x}}\right) =𝒪(c(𝒳,θ)logTlog(T|𝒳|/δ)+fTC1t0+dβMM).\displaystyle=\mathcal{O}\left(c(\mathcal{X},\theta)\log T\log(T|\mathcal{X}|/\delta)+\sqrt{f_{T}C_{1}t_{0}}+d\beta_{M^{\prime}}\sqrt{M^{\prime}}\right).
Proof.

Recall that ys=s,xs+ϵs(xs)y_{s}=\ell_{s,x_{s}}+\epsilon_{s}(x_{s}) and ^s,x^=s,x^+ϵs(x^)p~s,x^𝟙{xs=x^}\widehat{\ell}_{s,\widehat{x}}=\frac{\ell_{s,\widehat{x}}+\epsilon_{s}(\widehat{x})}{\widetilde{p}_{s,\widehat{x}}}\mathbbm{1}\{x_{s}=\widehat{x}\}. Thus,

s=t0+1t(ys^s,x^)\displaystyle\sum_{s=t_{0}+1}^{t}\left(y_{s}-\widehat{\ell}_{s,\widehat{x}}\right)
=s=t0+1t(s,xss,x^)Term 1+s=t0+1t(s,x^s,x^p~s,x^𝟙{xs=x^})Term 2+s=t0+1t(ϵs(xs)ϵs(x^))Term 3+s=t0+1t(ϵs(x^)ϵs(x^)p~s,x^𝟙{xs=x^})Term 4\displaystyle=\underbrace{\sum_{s=t_{0}+1}^{t}\left(\ell_{s,x_{s}}-\ell_{s,\widehat{x}}\right)}_{\textsc{Term 1}}+\underbrace{\sum_{s=t_{0}+1}^{t}\left(\ell_{s,\widehat{x}}-\frac{\ell_{s,\widehat{x}}}{\widetilde{p}_{s,\widehat{x}}}\mathbbm{1}\{x_{s}=\widehat{x}\}\right)}_{\textsc{Term 2}}+\underbrace{\sum_{s=t_{0}+1}^{t}\left(\epsilon_{s}(x_{s})-\epsilon_{s}(\widehat{x})\right)}_{\textsc{Term 3}}+\underbrace{\sum_{s=t_{0}+1}^{t}\left(\epsilon_{s}(\widehat{x})-\frac{\epsilon_{s}(\widehat{x})}{\widetilde{p}_{s,\widehat{x}}}\mathbbm{1}\{x_{s}=\widehat{x}\}\right)}_{\textsc{Term 4}}

Except for Term 1, all terms are martingale difference sequences. Let 𝔼0\mathbb{E}_{0} be the expectation taken over the randomness before Phase 2. Similar to the calculation in Eq. (34) and Eq. (35), we have

𝔼s[(s,x^s,x^p~s,x^𝟙{xs=x^})2]2𝔼0[1p~s,x^],𝔼s[(ϵs(x^)ϵs(x^)p~s,x^𝟙{xs=x^})2]8𝔼0[1p~s,x^]\displaystyle\mathbb{E}_{s}\left[\left(\ell_{s,\widehat{x}}-\frac{\ell_{s,\widehat{x}}}{\widetilde{p}_{s,\widehat{x}}}\mathbbm{1}\{x_{s}=\widehat{x}\}\right)^{2}\right]\leq 2\mathbb{E}_{0}[1-\widetilde{p}_{s,\widehat{x}}],\quad\qquad\mathbb{E}_{s}\left[\left(\epsilon_{s}(\widehat{x})-\frac{\epsilon_{s}(\widehat{x})}{\widetilde{p}_{s,\widehat{x}}}\mathbbm{1}\{x_{s}=\widehat{x}\}\right)^{2}\right]\leq 8\mathbb{E}_{0}[1-\widetilde{p}_{s,\widehat{x}}]

and

𝔼s[(ϵs(xs)ϵs(x^))2]16𝔼0[1p~s,x^].\displaystyle\mathbb{E}_{s}\left[\left(\epsilon_{s}(x_{s})-\epsilon_{s}(\widehat{x})\right)^{2}\right]\leq 16\mathbb{E}_{0}\left[1-\widetilde{p}_{s,\widehat{x}}\right].

By Freedman’s inequality, we have with probability at least 13δ1-3\delta, for all tt in Phase 22,

Term 2+Term 3+Term 4\displaystyle\textsc{Term 2}+\textsc{Term 3}+\textsc{Term 4}
22s=s0+1t𝔼0[1p~s,x^]log(T/δ)+log(T/δ)+216s=s0+1t𝔼0[1p~s,x^]log(T/δ)+4log(T/δ)\displaystyle\leq 2\sqrt{2\sum_{s=s_{0}+1}^{t}\mathbb{E}_{0}[1-\widetilde{p}_{s,\widehat{x}}]\log(T/\delta)}+\log(T/\delta)+2\sqrt{16\sum_{s=s_{0}+1}^{t}\mathbb{E}_{0}[1-\widetilde{p}_{s,\widehat{x}}]\log(T/\delta)}+4\log(T/\delta)
+28s=s0+1t𝔼0[1p~s,x^]log(T/δ)+2log(T/δ)\displaystyle\qquad+2\sqrt{8\sum_{s=s_{0}+1}^{t}\mathbb{E}_{0}[1-\widetilde{p}_{s,\widehat{x}}]\log(T/\delta)}+2\log(T/\delta)
20s=s0+1t𝔼0[1p~s,x^]log(T/δ)+7log(T/δ).\displaystyle\leq 20\sqrt{\sum_{s=s_{0}+1}^{t}\mathbb{E}_{0}[1-\widetilde{p}_{s,\widehat{x}}]\log(T/\delta)}+7\log(T/\delta).

Then we deal with Term 1. Again, by Freeman’s inequality with probability at least 1δ1-\delta, for all tt in Phase 2,

s=t0+1t(s,xss,x^)\displaystyle\sum_{s=t_{0}+1}^{t}(\ell_{s,x_{s}}-\ell_{s,\widehat{x}}) s=t0+1txx^p~s,x(s,xs,x^)+4s=t0+1t𝔼0[1p~s,x^]log(T/δ)+2log(T/δ)\displaystyle\leq\sum_{s=t_{0}+1}^{t}\sum_{x\neq\widehat{x}}\widetilde{p}_{s,x}(\ell_{s,x}-\ell_{s,\widehat{x}})+4\sqrt{\sum_{s=t_{0}+1}^{t}\mathbb{E}_{0}[1-\widetilde{p}_{s,\widehat{x}}]\log(T/\delta)}+2\log(T/\delta)
C+s=t0+1txx^p~s,x(ΔxΔx^)+4s=t0+1t𝔼0[1p~s,x^]log(T/δ)+2log(T/δ)\displaystyle\leq C+\sum_{s=t_{0}+1}^{t}\sum_{x\neq\widehat{x}}\widetilde{p}_{s,x}(\Delta_{x}-\Delta_{\widehat{x}})+4\sqrt{\sum_{s=t_{0}+1}^{t}\mathbb{E}_{0}[1-\widetilde{p}_{s,\widehat{x}}]\log(T/\delta)}+2\log(T/\delta)
C+12s=t0+1txx^ps,xΔx+4s=t0+1t𝔼0[1p~s,x^]log(T/δ)+2log(T/δ).\displaystyle\leq C+\frac{1}{2}\sum_{s=t_{0}+1}^{t}\sum_{x\neq\widehat{x}}p_{s,x}\Delta_{x}+4\sqrt{\sum_{s=t_{0}+1}^{t}\mathbb{E}_{0}[1-\widetilde{p}_{s,\widehat{x}}]\log(T/\delta)}+2\log(T/\delta). (p~s,x=12ps,x\widetilde{p}_{s,x}=\frac{1}{2}p_{s,x} for xx^x\neq\widehat{x})

When C[0,130fTC1L][0,130fTC1t0]C\in[0,\frac{1}{30}\sqrt{f_{T}C_{1}L}]\subseteq[0,\frac{1}{30}\sqrt{f_{T}C_{1}t_{0}}], according to Lemma 20, we know that with probability 14δ1-4\delta, x^=x\widehat{x}=x^{*} and Δ^x[0.7Δx,1.3Δx]\widehat{\Delta}_{x}\in[0.7\Delta_{x},1.3\Delta_{x}].

Also by Lemma 20, with probability 12δ1-2\delta, for any st0s\geq t_{0}, we have st064fTC1Δmin248dβsΔmin2s\geq t_{0}\geq\frac{64f_{T}C_{1}}{\Delta_{\min}^{2}}\geq\frac{48d\beta_{s}}{\Delta_{\min}^{2}}. These conditions satisfy the requirement in Lemma 13 with r=3r=3. Therefore we can apply Lemma 13 and get

x𝒳ps,xΔx72dβsΔmins\displaystyle\sum_{x\in\mathcal{X}}p_{s,x}\Delta_{x}\leq\frac{72d\beta_{s}}{\Delta_{\min}s} (42)

for all st0s\geq t_{0}. Combining all the above, we get

s=t0+1t(ys^s,x^)=C+72dβtlogtΔmin+24s=t0+1t𝔼0[1p~s,x^]log(T/δ)+9log(T/δ).\displaystyle\sum_{s=t_{0}+1}^{t}\left(y_{s}-\widehat{\ell}_{s,\widehat{x}}\right)=C+\frac{72d\beta_{t}\log t}{\Delta_{\min}}+24\sqrt{\sum_{s=t_{0}+1}^{t}\mathbb{E}_{0}[1-\widetilde{p}_{s,\widehat{x}}]\log(T/\delta)}+9\log(T/\delta).

As argued in Eq. (36), 1p~s,x^12dβsΔ^min2s1-\widetilde{p}_{s,\widehat{x}}\leq\frac{12d\beta_{s}}{\widehat{\Delta}_{\min}^{2}s}. Therefore, the above can be further upper bounded by

s=t0+1t(ys^s,x^)\displaystyle\sum_{s=t_{0}+1}^{t}\left(y_{s}-\widehat{\ell}_{s,\widehat{x}}\right) C+96dβtlogtΔ^min+24s=t0+1t12dβsΔ^min2slog(T/δ)\displaystyle\leq C+\frac{96d\beta_{t}\log t}{\widehat{\Delta}_{\min}}+24\sqrt{\sum_{s=t_{0}+1}^{t}\frac{12d\beta_{s}}{\widehat{\Delta}_{\min}^{2}s}\log(T/\delta)} (Δ^x[0.7Δx,1.3Δx]\widehat{\Delta}_{x}\in[0.7\Delta_{x},1.3\Delta_{x}], 1p~s,x^12dβsΔ^min2s1-\widetilde{p}_{s,\widehat{x}}\leq\frac{12d\beta_{s}}{\widehat{\Delta}_{\min}^{2}s})
C+96dβtlogtΔ^min+144dβTlogTΔ^min\displaystyle\leq C+\frac{96d\beta_{t}\log t}{\widehat{\Delta}_{\min}}+\frac{144d\beta_{T}\log T}{\widehat{\Delta}_{\min}} (by definition of βT\beta_{T})
C+10t0fTC1dβTlogT\displaystyle\leq C+10\sqrt{\frac{t_{0}}{f_{T}C_{1}}}d\beta_{T}\log T (t064fTC1Δmin224fTC1Δ^min2t_{0}\geq\frac{64f_{T}C_{1}}{\Delta_{\min}^{2}}\geq\frac{24f_{T}C_{1}}{\widehat{\Delta}_{\min}^{2}})
130fTC1t0+10t0fTC1dβTlogT\displaystyle\leq\frac{1}{30}\sqrt{f_{T}C_{1}t_{0}}+10\sqrt{\frac{t_{0}}{f_{T}C_{1}}}d\beta_{T}\log T (C130fTC1t0C\leq\frac{1}{30}\sqrt{f_{T}C_{1}t_{0}})
20fTC1t0.\displaystyle\leq 20\sqrt{f_{T}C_{1}t_{0}}. (C1dβTC_{1}\geq d\beta_{T})

Below, we use an alternative way to bound x𝒳ps,xΔx\sum_{x\in\mathcal{X}}p_{s,x}\Delta_{x}. Let M20βMMM^{\prime}\geq 20\beta_{M}M, which implies M10βMMM^{\prime}\geq 10\beta_{M^{\prime}}M. For s[t0+1,M]s\in[t_{0}+1,M^{\prime}], we use Lemma 4, and bound

s=t0+1Mx𝒳ps,xΔx10.7s=t0+1Mx𝒳ps,xΔ^x10.7s=t0+1Mdβss𝒪(dβMM).\displaystyle\sum_{s=t_{0}+1}^{M^{\prime}}\sum_{x\in\mathcal{X}}p_{s,x}\Delta_{x}\leq\frac{1}{0.7}\sum_{s=t_{0}+1}^{M^{\prime}}\sum_{x\in\mathcal{X}}p_{s,x}\widehat{\Delta}_{x}\leq\frac{1}{0.7}\sum_{s=t_{0}+1}^{M^{\prime}}\frac{d\beta_{s}}{\sqrt{s}}\leq\mathcal{O}\left(d\beta_{M^{\prime}}\sqrt{M^{\prime}}\right). (43)

For s>Ms>M^{\prime}, we use Lemma 15 and bound

s=M+1tx𝒳ps,xΔxs=M+1t𝒪(βssc(𝒳,θ))=𝒪(c(𝒳,θ)βtlogt).\displaystyle\sum_{s=M^{\prime}+1}^{t}\sum_{x\in\mathcal{X}}p_{s,x}\Delta_{x}\leq\sum_{s=M^{\prime}+1}^{t}\mathcal{O}\left(\frac{\beta_{s}}{s}c(\mathcal{X},\theta)\right)=\mathcal{O}\left(c(\mathcal{X},\theta)\beta_{t}\log t\right). (44)

Combining Eq. (43) and Eq. (44) and following a similar analysis in the previous case, we have

s=t0+1t(s,xss,x^)\displaystyle\sum_{s=t_{0}+1}^{t}\left(\ell_{s,x_{s}}-\ell_{s,\widehat{x}}\right) C+𝒪(dβMM+c(𝒳;θ)βtlogt)+4s=t0+1t𝔼0[1p~s,x^]log(T/δ)+2log(T/δ)\displaystyle\leq C+\mathcal{O}\left(d\beta_{M^{\prime}}\sqrt{M^{\prime}}+c(\mathcal{X};\theta)\beta_{t}\log t\right)+4\sqrt{\sum_{s=t_{0}+1}^{t}\mathbb{E}_{0}[1-\widetilde{p}_{s,\widehat{x}}]\log(T/\delta)}+2\log(T/\delta)
𝒪(c(𝒳,θ)logTlog(T|𝒳|/δ)+fTC1t0+dβMM).\displaystyle\leq\mathcal{O}\left(c(\mathcal{X},\theta)\log T\log(T|\mathcal{X}|/\delta)+\sqrt{f_{T}C_{1}t_{0}}+d\beta_{M^{\prime}}\sqrt{M^{\prime}}\right).

Now we are ready to show that once LL grows large enough, Phase 22 never ends.

Lemma 22.

If C130fTC1LC\leq\frac{1}{30}\sqrt{f_{T}C_{1}L}, then with probability at least 115δ1-15\delta, Phase 2 never ends.

Proof.

It suffices to verify the two termination conditions Eq. (9) and Eq. (10) are never satisfied. Eq. (10) does not hold because of Lemma 21. Consider Eq. (9). Let tt be in Phase 22 and xx^x\neq\widehat{x}. According to Eq. (32) and Eq. (33), we have with probability 15δ1-5\delta,

|s=1t0(s,x^s,x)+s=t0+1t(s,xRobt,x)|\displaystyle\left|\sum_{s=1}^{t_{0}}(\ell_{s,x}-\widehat{\ell}_{s,x})+\sum_{s=t_{0}+1}^{t}\left(\ell_{s,x}-\text{Rob}_{t,x}\right)\right| 1.7tΔ^x10,\displaystyle\leq\frac{1.7t\widehat{\Delta}_{x}}{10},
|s=1t(s,x^^s,x^)|\displaystyle\left|\sum_{s=1}^{t}(\ell_{s,\widehat{x}}-\widehat{\ell}_{s,\widehat{x}})\right| 3fTC1t0.15tΔ^x.\displaystyle\leq 3\sqrt{f_{T}C_{1}t}\leq 0.15t\widehat{\Delta}_{x}.

Therefore, we have

|tΔ^t,xtΔx|\displaystyle\left|t\widehat{\Delta}_{t,x}-t\Delta_{x}\right| |s=1t0(s,x^s,x)+s=t0+1t(s,xRobt,x)|+|s=1t(s,x^^s,x^)|+C\displaystyle\leq\left|\sum_{s=1}^{t_{0}}(\ell_{s,x}-\widehat{\ell}_{s,x})+\sum_{s=t_{0}+1}^{t}\left(\ell_{s,x}-\text{Rob}_{t,x}\right)\right|+\left|\sum_{s=1}^{t}(\ell_{s,\widehat{x}}-\widehat{\ell}_{s,\widehat{x}})\right|+C
0.32tΔ^x+C0.372tΔ^x.\displaystyle\leq 0.32t\widehat{\Delta}_{x}+C\leq 0.372t\widehat{\Delta}_{x}. (C130fTC1t0.052tΔ^minC\leq\frac{1}{30}\sqrt{f_{T}C_{1}t}\leq 0.052t\widehat{\Delta}_{\min})

This means that

tΔ^t,xtΔx+0.372tΔ^x10.7tΔ^x+0.372tΔ^x1.81tΔ^x,\displaystyle t\widehat{\Delta}_{t,x}\leq t\Delta_{x}+0.372t\widehat{\Delta}_{x}\leq\frac{1}{0.7}t\widehat{\Delta}_{x}+0.372t\widehat{\Delta}_{x}\leq 1.81t\widehat{\Delta}_{x},
tΔ^t,xtΔx0.372tΔ^x11.3tΔ^x0.372tΔ^x0.39tΔ^x.\displaystyle t\widehat{\Delta}_{t,x}\geq t\Delta_{x}-0.372t\widehat{\Delta}_{x}\geq\frac{1}{1.3}t\widehat{\Delta}_{x}-0.372t\widehat{\Delta}_{x}\geq 0.39t\widehat{\Delta}_{x}.

Therefore, Eq. (9) is not satisfied. ∎

Finally, we prove the regret bound for the corrupted stochastic setting.

Proof of Theorem 6.

First, we consider the pure stochastic setting with C=0C=0. According to Lemma 20, we know that the algorithm has only one epoch as C130fTC1LC\leq\frac{1}{30}\sqrt{f_{T}C_{1}L} is satisfied in the first epoch. Specifically, after at most 900fTC1Δmin2\frac{900f_{T}C_{1}}{\Delta_{\min}^{2}} rounds in Phase 11, the algorithm goes to Phase 22 and never goes back to Phase 11. Then we can directly apply the second claim in Lemma 21 to get the regret bound in the stochastic setting. Specifically, we bound the regret in Phase 11 by 𝒪(C1L0+C1900fTC1Δmin2)=𝒪(C1L0+C1logTΔmin)\mathcal{O}\left(\sqrt{C_{1}L_{0}}+\sqrt{C_{1}\cdot\frac{900f_{T}C_{1}}{\Delta_{\min}^{2}}}\right)=\mathcal{O}\left(\sqrt{C_{1}L_{0}}+\frac{C_{1}\sqrt{\log T}}{\Delta_{\min}}\right). For the regret in Phase 22, according to the second claim in Lemma 21, we bound the regret by 𝒪(c(𝒳;θ)logTlogT|𝒳|δ+dβMM)=𝒪(c(𝒳;θ)logTlogT|𝒳|δ+Mlog321δ)\mathcal{O}\left(c(\mathcal{X};\theta)\log T\log\frac{T|\mathcal{X}|}{\delta}+d\beta_{M^{\prime}}\sqrt{M^{\prime}}\right)=\mathcal{O}\left(c(\mathcal{X};\theta)\log T\log\frac{T|\mathcal{X}|}{\delta}+M^{*}\log^{\frac{3}{2}}\frac{1}{\delta}\right), where MM^{*} is the same as the one in Eq. (24). Combining them together proves the first claim.

Now we consider the corrupted stochastic setting with C>0C>0. Suppose that we are in the epoch with L=LL=L^{*}, which is the first epoch such that Lmax{900fTC1Δmin2,900C2fTC1}L^{*}\geq\max\left\{\frac{900f_{T}C_{1}}{\Delta_{\min}^{2}},\frac{900C^{2}}{f_{T}C_{1}}\right\}. Therefore, in previous epochs, we have Lmax{900fTC1Δmin2,900C2fTC1}L\leq\max\left\{\frac{900f_{T}C_{1}}{\Delta_{\min}^{2}},\frac{900C^{2}}{f_{T}C_{1}}\right\}. According to Lemma 20, we have t0max{900fTC1Δmin2,900C2fTC1}t_{0}\leq\max\left\{\frac{900f_{T}C_{1}}{\Delta_{\min}^{2}},\frac{900C^{2}}{f_{T}C_{1}}\right\} in the previous epoch and L=2t0L^{*}=2t_{0}.

We bound the regret before this epoch, as well as the regret in the first phase of this epoch by the adversarial regret bound:

𝒪(C1fTL+C1L0)=𝒪(C1L0+C1fT×(fTC1Δmin+CfTC1))=𝒪(C1L0+fTC1Δmin+C).\displaystyle\mathcal{O}\left(\sqrt{C_{1}f_{T}L^{*}}+\sqrt{C_{1}L_{0}}\right)=\mathcal{O}\left(\sqrt{C_{1}L_{0}}+\sqrt{C_{1}f_{T}}\times\left(\frac{\sqrt{f_{T}C_{1}}}{\Delta_{\min}}+\frac{C}{\sqrt{f_{T}C_{1}}}\right)\right)=\mathcal{O}\left(\sqrt{C_{1}L_{0}}+\frac{f_{T}C_{1}}{\Delta_{\min}}+C\right).

If we use GeometricHedge.P as the adversarial linear bandit algorithm and fT=logTf_{T}=\log T, then the above is upper bounded by

𝒪(dlogTlog(T|𝒳|/δ)Δmin+C).\displaystyle\mathcal{O}\left(\frac{d\log T\log(T|\mathcal{X}|/\delta)}{\Delta_{\min}}+C\right).

For Phase 22 of the epoch with L=LL=L^{*}, according to Lemma 20, we know that this phase will never end and by definition of LL^{*}, we have C130fTC1LC\leq\frac{1}{30}\sqrt{f_{T}C_{1}L}. Note that in this phase Δ^x[0.7Δx,1.3Δx]\widehat{\Delta}_{x}\in[0.7\Delta_{x},1.3\Delta_{x}].

Therefore, by taking a summation over tt on Eq. (42), we bound the regret in this interval by

𝒪(dβTlogTΔmin)=𝒪(dlogTlog(T|𝒳|/δ)Δmin).\displaystyle\mathcal{O}\left(\frac{d\beta_{T}\log T}{\Delta_{\min}}\right)=\mathcal{O}\left(\frac{d\log T\log(T|\mathcal{X}|/\delta)}{\Delta_{\min}}\right).

Combining the regret bounds finishes the proof of the second claim. ∎

Appendix D Lower Bound

In this section, we prove that the log2T\log^{2}T factor in our bound for the stochastic setting is unavoidable if the same algorithm also achieves sublinear regret with high probability in the adversarial case. The full statement is in Theorem 27, and we first present some definitions and related lemmas. We fix a stochastic linear bandit instance (i.e., we fix the parameter θ\theta and the action set 𝒳\mathcal{X}). Assume that x1\|x\|\leq 1 for all x𝒳x\in\mathcal{X} and θ14\|\theta\|\leq\frac{1}{4}. We call this instance the first environment. The observation yty_{t} is generated according to the following Bernoulli distribution666 In the Bernoulli noise case, we are only looking at a subclass of problems (i.e., those with θ14\|\theta\|\leq\frac{1}{4} and x1\|x\|\leq 1). For problems that are outside this class, the Bernoulli noise case might be much easier than the Gaussian noise case.:

yt={1 with probability 12+12xt,θ,1 with probability 1212xt,θ.\displaystyle y_{t}=\begin{cases}1&\text{\ with probability\ }\frac{1}{2}+\frac{1}{2}\langle x_{t},\theta\rangle,\\ -1&\text{\ with probability\ }\frac{1}{2}-\frac{1}{2}\langle x_{t},\theta\rangle.\end{cases}

Again, let c(𝒳,θ)c(\mathcal{X},\theta) be the solution of the following optimization problem:

infN[0,)𝒳x𝒳\{x}NxΔx\displaystyle\inf_{N\in[0,\infty)^{\mathcal{X}}}\sum_{x\in\mathcal{X}\backslash\{x^{*}\}}N_{x}\Delta_{x} (45)
s.t.\displaystyle s.t. xH(N)12Δx22,x𝒳=𝒳\{x},\displaystyle\|x\|^{2}_{H(N)^{-1}}\leq\frac{\Delta_{x}^{2}}{2},\forall x\in\mathcal{X}^{-}=\mathcal{X}\backslash\{x^{*}\},

where H(N)=x𝒳NxxxH(N)=\sum_{x\in\mathcal{X}}N_{x}xx^{\top}. We also define Δmin=minxxΔx\Delta_{\min}=\min_{x\neq x^{*}}\Delta_{x}.

For a fixed γ(0,1)\gamma\in(0,1) (which is chosen later), we divide the whole horizon into intervals of length

Tγ,4ΔminTγ,(4Δmin)2Tγ,,(4Δmin)S1Tγ,\displaystyle T^{\gamma},\frac{4}{\Delta_{\min}}T^{\gamma},\left(\frac{4}{\Delta_{\min}}\right)^{2}T^{\gamma},\ldots,\left(\frac{4}{\Delta_{\min}}\right)^{S-1}T^{\gamma},

where S=Θ(1γlog4ΔminlogT)S=\Theta\left(\frac{1-\gamma}{\log\frac{4}{\Delta_{\min}}}\log T\right). We denote these intervals as 1,,S\mathcal{I}_{1},\ldots,\mathcal{I}_{S}. Observe that |i|3Δminj<i|j||\mathcal{I}_{i}|\geq\frac{3}{\Delta_{\min}}\sum_{j<i}|\mathcal{I}_{j}| for all ii.

Definition 2.

Let U=cregS(logT)1βU=c_{\text{reg}}S(\log T)^{1-\beta}, V=US=creg(logT)1βV=\frac{U}{S}=c_{\text{reg}}(\log T)^{1-\beta} for some β0\beta\geq 0 and some universal constant cregc_{\text{reg}}.

Assumption 2.

Let 𝒜\mathcal{A} be a linear bandit algorithm with the following regret guarantee: there is a problem-dependent constant T0T_{0} (i.e., depending on θ\theta and 𝒳\mathcal{X}) such that for any TT0T\geq T_{0},

𝔼[Reg(T)]=𝔼[t=1T𝟙[xt=x]Δx]Uc(𝒳,θ)\displaystyle\mathbb{E}\left[\text{\rm Reg}(T)\right]=\mathbb{E}\left[\sum_{t=1}^{T}\mathbbm{1}[x_{t}=x]\Delta_{x}\right]\leq U\cdot c(\mathcal{X},\theta)

for some β0\beta\geq 0.

Definition 3.

Let Gi𝔼[tixtxt]G_{i}\triangleq\mathbb{E}\left[\sum_{t\in\mathcal{I}_{i}}x_{t}x_{t}^{\top}\right] where the expectation 𝔼\mathbb{E} is with respect to the environment of θ\theta and the algorithm 𝒜\mathcal{A}. Let G=i=1SGiG=\sum_{i=1}^{S}G_{i}.

Definition 4.

Let Ti(x)ti𝟙[xt=x]T_{i}(x)\triangleq\sum_{t\in\mathcal{I}_{i}}\mathbbm{1}[x_{t}=x].

Lemma 23.

Let TT0T\geq T_{0}. There exists an action xxx\neq x^{*} such that

xG12Δx22U.\displaystyle\|x\|^{2}_{G^{-1}}\geq\frac{\Delta_{x}^{2}}{2U}.
Proof.

We use contradiction to prove this lemma. Suppose that for all xxx\neq x^{*} we have xG12Δx22U\|x\|^{2}_{G^{-1}}\leq\frac{\Delta_{x}^{2}}{2U}. Then observe that xG¯12Δx22\|x\|_{\overline{G}^{-1}}^{2}\leq\frac{\Delta_{x}^{2}}{2} where G¯=1UG\overline{G}=\frac{1}{U}\cdot G. Therefore,

N^x=𝔼[t=1T𝟙[xt=x]U]\displaystyle\widehat{N}_{x}=\mathbb{E}\left[\sum_{t=1}^{T}\frac{\mathbbm{1}[x_{t}=x]}{U}\right]

satisfies the constraint of the optimization problem Eq. (45). Therefore,

c(𝒳,θ)xxN^xΔx=𝔼[t=1T𝟙[xt=x]U1]=RegTU.\displaystyle c(\mathcal{X},\theta)\leq\sum_{x\neq x^{*}}\widehat{N}_{x}\Delta_{x}=\mathbb{E}\left[\sum_{t=1}^{T}\mathbbm{1}[x_{t}=x]U^{-1}\right]=\frac{\text{\rm Reg}_{T}}{U}.

This contradicts with the assumption on the regret bound of 𝒜\mathcal{A}. ∎

Lemma 24.

If there exists an xxx\neq x^{*} such that

xG12Δx22U,\displaystyle\|x\|^{2}_{G^{-1}}\geq\frac{\Delta_{x}^{2}}{2U},

then there exists i[S]i\in[S] such that

xGi12Δx22V.\displaystyle\|x\|^{2}_{G_{i}^{-1}}\geq\frac{\Delta_{x}^{2}}{2V}.
Proof.

We use contradiction to prove the lemma. Suppose that xGi12<Δx22V\|x\|^{2}_{G_{i}^{-1}}<\frac{\Delta_{x}^{2}}{2V} for all i[S]i\in[S].

Then

xG12\displaystyle\|x\|^{2}_{G^{-1}} =x(G1++GS)121S2(xG112++xGS12)\displaystyle=\|x\|^{2}_{\left(G_{1}+\cdots+G_{S}\right)^{-1}}\leq\frac{1}{S^{2}}\left(\|x\|^{2}_{G_{1}^{-1}}+\cdots+\|x\|^{2}_{G_{S}^{-1}}\right)
<1SΔx22V=Δx22U\displaystyle<\frac{1}{S}\frac{\Delta_{x}^{2}}{2V}=\frac{\Delta_{x}^{2}}{2U}

where in the first inequality we use

(1S(G1++GS))11S(G11++GS1),\displaystyle\left(\frac{1}{S}(G_{1}+\cdots+G_{S})\right)^{-1}\preceq\frac{1}{S}\left(G_{1}^{-1}+\cdots+G_{S}^{-1}\right),

which is a generalization of the “arithmetic-mean-harmonic-mean inequality” (Mond & Pecaric, 1996). ∎

Lemma 25.

Let TT0T\geq T_{0}. If xGi12Δx22V\|x\|_{G_{i}^{-1}}^{2}\geq\frac{\Delta_{x}^{2}}{2V}, then xxGi12Δx28V\|x-x^{*}\|_{G_{i}^{-1}}^{2}\geq\frac{\Delta_{x}^{2}}{8V}.

Proof.

We have

xGi122xxGi12+2xGi122xxGi12+2x2𝔼[Ti(x)],\displaystyle\|x\|_{G_{i}^{-1}}^{2}\leq 2\|x-x^{*}\|_{G_{i}^{-1}}^{2}+2\|x^{*}\|_{G_{i}^{-1}}^{2}\leq 2\|x-x^{*}\|_{G_{i}^{-1}}^{2}+\frac{2\|x^{*}\|^{2}}{\mathbb{E}[T_{i}(x^{*})]}, (46)

where the last inequality is because of the definition of GiG_{i}. For large enough TT, we must have 𝔼[Ti(x)]8x2Δmin2V\mathbb{E}\left[T_{i}(x^{*})\right]\geq\frac{8\|x^{*}\|^{2}}{\Delta_{\min}^{2}}V. Otherwise, by Markov’s inequality, with probability at least 12\frac{1}{2}, in interval ii the algorithm draws xx^{*} at most 16x2Δmin2V\frac{16\|x^{*}\|^{2}}{\Delta_{\min}^{2}}V times, and thus the regret of 𝒜\mathcal{A} would be at least Δmin(|i|16x2Δmin2V)=Ω((4Δmin)i1TγΔmin16x2Δmincreg(logT)1β)=Ω(TγΔmin)\Delta_{\min}\left(|\mathcal{I}_{i}|-\frac{16\|x^{*}\|^{2}}{\Delta_{\min}^{2}}V\right)=\Omega\left(\left(\frac{4}{\Delta_{\min}}\right)^{i-1}T^{\gamma}\Delta_{\min}-\frac{16\|x^{*}\|^{2}}{\Delta_{\min}}c_{\text{reg}}(\log T)^{1-\beta}\right)=\Omega\left(T^{\gamma}\Delta_{\min}\right), violating the assumption on 𝒜\mathcal{A}. Therefore, for large enough TT, we have

2x2𝔼[Ti(x)]Δmin24V12xGi12,\displaystyle\frac{2\|x^{*}\|^{2}}{\mathbb{E}[T_{i}(x^{*})]}\leq\frac{\Delta_{\min}^{2}}{4V}\leq\frac{1}{2}\|x\|_{G_{i}^{-1}}^{2},

where the last inequality is by our assumption. Combining this with Eq. (46), we get

xGi124xxGi12\displaystyle\|x\|_{G_{i}^{-1}}^{2}\leq 4\|x-x^{*}\|^{2}_{G_{i}^{-1}}

and the conclusion follows based on our assumption. ∎

Note that when the conclusion of Lemma 25 holds, that is, xxGi12Ω(Δx2V)=Ω(Δx2logT(logT)β)\|x-x^{*}\|_{G_{i}^{-1}}^{2}\geq\Omega\left(\frac{\Delta_{x}^{2}}{V}\right)=\Omega\left(\frac{\Delta_{x}^{2}}{\log T}(\log T)^{\beta}\right), it means that the exploration in interval ii is not enough, since by the lower bound in (Lattimore & Szepesvari, 2017), the amount of exploration should make xxGi12O(Δx2logT)\|x-x^{*}\|^{2}_{G_{i}^{-1}}\leq O\left(\frac{\Delta_{x}^{2}}{\log T}\right). Therefore, the next natural idea is to change the parameter θ\theta in this interval ii, and argue that the amount of exploration 𝒜\mathcal{A} is not enough to “detect this change with high probability”.

We now let ii be the first interval such that there exists xx with xxGi12Δx28V\|x-x^{*}\|^{2}_{G_{i}^{-1}}\geq\frac{\Delta_{x}^{2}}{8V}. Also, we use xx^{\prime} to denote the xx that satisfies this condition. Define

θ=θGi1(xx)xxGi122Δx.\displaystyle\theta^{\prime}=\theta-\frac{G_{i}^{-1}(x^{\prime}-x^{*})}{\|x^{\prime}-x^{*}\|^{2}_{G_{i}^{-1}}}2\Delta_{x^{\prime}}.

Notice that in this case,

xx,θ=xx,θ2Δx=Δx.\displaystyle\langle x^{\prime}-x^{*},\theta^{\prime}\rangle=\langle x^{\prime}-x^{*},\theta\rangle-2\Delta_{x^{\prime}}=-\Delta_{x^{\prime}}.

That is, xx^{\prime} is a better action than xx^{*} under the parameter θ\theta^{\prime}.

We now define the second environment as follows: in intervals 1,,i11,\ldots,i-1, the losses are generated according to θ\theta, but in intervals i,,Si,\ldots,S, the losses are generated according to θ\theta^{\prime}. We use 𝔼\mathbb{E} and 𝔼\mathbb{E}^{\prime} to denote the expectation under the first environment and the second environment, and \mathbb{P}, \mathbb{P}^{\prime} to denote the probability measures respectively. For now, we only focus on interval ii (so the probability measure is only over the sequence (xt,t,xt)ti(x_{t},\ell_{t,x_{t}})_{t\in\mathcal{I}_{i}}).

Lemma 26.

KL(,)64V\text{\rm KL}\left(\mathbb{P},\mathbb{P}^{\prime}\right)\leq 64V.

Proof.

Note that for any xx,

|x,θx,θ|\displaystyle|\langle x,\theta^{\prime}\rangle-\langle x,\theta\rangle| =|xGi1(xx)xxGi122Δx|\displaystyle=\left|\frac{x^{\top}G_{i}^{-1}(x^{\prime}-x^{*})}{\|x^{\prime}-x^{*}\|^{2}_{G_{i}^{-1}}}2\Delta_{x^{\prime}}\right|
=|2xGi1(xx)(xx)θxxGi12|\displaystyle=\left|\frac{2x^{\top}G_{i}^{-1}(x^{\prime}-x^{*})(x^{\prime}-x^{*})^{\top}\theta}{\|x^{\prime}-x^{*}\|^{2}_{G_{i}^{-1}}}\right|
2xΦopθtr(Φ)\displaystyle\leq\frac{2\|x\|\|\Phi\|_{\text{op}}\|\theta\|}{\text{tr}(\Phi)} (let Φ=Gi1(xx)(xx)\Phi=G_{i}^{-1}(x^{\prime}-x^{*})(x^{\prime}-x^{*})^{\top})
2xθ12.\displaystyle\leq 2\|x\|\|\theta\|\leq\frac{1}{2}. (by the assumption θ14\|\theta\|\leq\frac{1}{4})

Therefore, |x,θ||x,θ|+1234|\langle x,\theta^{\prime}\rangle|\leq|\langle x,\theta\rangle|+\frac{1}{2}\leq\frac{3}{4}.

Notice that for p,q[34,34]p,q\in[-\frac{3}{4},\frac{3}{4}], the KL divergence between the following two distributions:

y={1 with probability 12+12p1 with probability 1212pandy={1 with probability 12+12q1 with probability 1212q\displaystyle y=\begin{cases}1&\text{\ with probability\ }\frac{1}{2}+\frac{1}{2}p\\ -1&\text{\ with probability\ }\frac{1}{2}-\frac{1}{2}p\end{cases}\qquad\text{and}\qquad y=\begin{cases}1&\text{\ with probability\ }\frac{1}{2}+\frac{1}{2}q\\ -1&\text{\ with probability\ }\frac{1}{2}-\frac{1}{2}q\end{cases}

is

kl(p,q)12(1+p)ln1+p1+q+12(1p)ln1p1q2(pq)2.\displaystyle\text{kl}(p,q)\triangleq\frac{1}{2}(1+p)\ln\frac{1+p}{1+q}+\frac{1}{2}(1-p)\ln\frac{1-p}{1-q}\leq 2(p-q)^{2}. (47)

Therefore,

KL(,)\displaystyle\text{\rm KL}\left(\mathbb{P},\mathbb{P}^{\prime}\right) x𝒳𝔼[Ti(x)]kl(x,θ,x,θ)\displaystyle\leq\sum_{x\in\mathcal{X}}\mathbb{E}[T_{i}(x)]\text{kl}\left(\langle x,\theta\rangle,\langle x,\theta^{\prime}\rangle\right) (by Lemma 1 in (Gerchinovitz & Lattimore, 2016))
2x𝔼[Ti(x)]x,θθ2\displaystyle\leq 2\sum_{x}\mathbb{E}[T_{i}(x)]\langle x,\theta-\theta^{\prime}\rangle^{2} (by Eq. (47))
=2θθGi2\displaystyle=2\|\theta-\theta^{\prime}\|_{G_{i}}^{2} (by definition of GiG_{i})
=8Δx2xxGi12\displaystyle=\frac{8\Delta_{x^{\prime}}^{2}}{\|x^{\prime}-x^{*}\|_{G_{i}^{-1}}^{2}} (by definition of θ\theta^{\prime})
64V.\displaystyle\leq 64V. (by the choice of xx^{\prime})

Finally, we are ready to present the lower bound. Roughly speaking, it shows that if an algorithm achieves 𝒪(c(𝒳,θ)logzT)\mathcal{O}(c(\mathcal{X},\theta)\log^{z}T) regret in the stochastic case with z[1,2)z\in[1,2), then it cannot be robust in the adversarial setting, in the sense that it cannot guarantee a regret bound such as o(T)poly(d,ln(1/δ))o(T)\cdot\text{\rm poly}(d,\ln(1/\delta)) with probability at least 1δ1-\delta.

Theorem 27.

For any γ(0,1)\gamma\in(0,1), if an algorithm guarantees a pseudo regret bound of

cregc(𝒳,θ)1γlog4Δmin(logT)2\displaystyle c_{\text{reg}}\cdot c(\mathcal{X},\theta)\frac{1-\gamma}{\log\frac{4}{\Delta_{\min}}}(\log T)^{2}

for constant creg=γ256c_{\text{reg}}=\frac{\gamma}{256} in stochastic environments for all sufficiently large TT, then there exists an adversarial environment such that with probability at least 14T14γ\frac{1}{4}T^{-\frac{1}{4}\gamma}, the regret of the same algorithm is at least 16TγΔmin\frac{1}{6}T^{\gamma}\Delta_{\min}.

Proof.

Let AA be the event: {Ti(x)|i|2}\left\{T_{i}(x^{*})\leq\frac{|\mathcal{I}_{i}|}{2}\right\}. By Lemma 5 in (Lattimore & Szepesvari, 2017) and choosing β=0\beta=0, we have

(A)+(Ac)\displaystyle\mathbb{P}(A)+\mathbb{P}^{\prime}(A^{c}) 12exp(KL(,))\displaystyle\geq\frac{1}{2}\exp(-\text{\rm KL}(\mathbb{P},\mathbb{P}^{\prime}))
12exp(64V)\displaystyle\geq\frac{1}{2}\exp(-64V) (by Lemma 26)
=12exp(64creglogT)\displaystyle=\frac{1}{2}\exp\left(-64c_{\text{reg}}\log T\right)
=12(1T)64creg.\displaystyle=\frac{1}{2}\left(\frac{1}{T}\right)^{64c_{\text{reg}}}. (48)

Notice that when event AA happens under the first environment, the regret is at least |i|Δmin2\frac{|\mathcal{I}_{i}|\Delta_{\min}}{2}. By the assumption on 𝒜\mathcal{A}, we have

(A)×|i|Δmin2RegTcregc(𝒳,θ)1γlog4Δmin(logT)2,\displaystyle\mathbb{P}(A)\times\frac{|\mathcal{I}_{i}|\Delta_{\min}}{2}\leq\text{\rm Reg}_{T}\leq c_{\text{reg}}\cdot c(\mathcal{X},\theta)\frac{1-\gamma}{\log\frac{4}{\Delta_{\min}}}(\log T)^{2},

implying that

(A)cregc(𝒳,θ)1γlog4Δmin(logT)2(4Δmin)i1TγΔmincregc(𝒳,θ)(logT)2Tγ.\displaystyle\mathbb{P}(A)\leq\frac{c_{\text{reg}}\cdot c(\mathcal{X},\theta)\frac{1-\gamma}{\log\frac{4}{\Delta_{\min}}}(\log T)^{2}}{\left(\frac{4}{\Delta_{\min}}\right)^{i-1}T^{\gamma}\Delta_{\min}}\leq\frac{c_{\text{reg}}\cdot c(\mathcal{X},\theta)(\log T)^{2}}{T^{\gamma}}.

Combining this with (48), we get

(Ac)0.5T64cregcregc(𝒳,θ)(logT)2Tγ.\displaystyle\mathbb{P}^{\prime}(A^{c})\geq 0.5\cdot T^{-64c_{\text{reg}}}-c_{\text{reg}}\cdot c(\mathcal{X},\theta)(\log T)^{2}\cdot T^{-\gamma}. (49)

Choose creg=γ256c_{\text{reg}}=\frac{\gamma}{256}, we have (Ac)0.25T64creg\mathbb{P}^{\prime}(A^{c})\geq 0.25\cdot T^{-64c_{\text{reg}}} for large enough TT. Then notice that when AcA^{c} happens under the second environment, the regret within interval i\mathcal{I}_{i} (against comparator xx^{\prime}) is at least |i|Δmin2\frac{|\mathcal{I}_{i}|\Delta_{\min}}{2}. Since under the second environment, the learner may have negative regret against xx^{\prime} in interval 1,,i11,\ldots,i-1, in the best case the regret against xx^{\prime} in interval 1,,i1,\ldots,i is at least

|i|Δmin2(|1|++|i1|)|i|Δmin6TγΔmin6.\displaystyle\frac{|\mathcal{I}_{i}|\Delta_{\min}}{2}-\left(|\mathcal{I}_{1}|+\cdots+|\mathcal{I}_{i-1}|\right)\geq\frac{|\mathcal{I}_{i}|\Delta_{\min}}{6}\geq\frac{T^{\gamma}\Delta_{\min}}{6}.

In conclusion, in the second environment, algorithm 𝒜\mathcal{A} suffers at least TγΔmin6\frac{T^{\gamma}\Delta_{\min}}{6} regret in the first ii intervals with probability at least 0.25T14γ0.25\cdot T^{-\frac{1}{4}\gamma}. ∎

Appendix E Adversarial Linear Bandit Algorithms with High-probability Guarantees

In this section, we show that the algorithms of (Bartlett et al., 2008) and (Lee et al., 2020) both satisfy Assumption 1.

E.1 GeometricHedge.P

Input: 𝒳,γ,η,δ\mathcal{X},\gamma,\eta,\delta^{\prime}, and John’s exploration distribution q𝒫𝒳q\in\mathcal{P}_{\mathcal{X}}.
Set x𝒳\forall x\in\mathcal{X}, w1(x)=1w_{1}(x)=1, and W1=|𝒳|W_{1}=|\mathcal{X}|.
for t=1t=1 to TT do
       Set pt(x)=(1γ)wt(x)Wt+γq(x),x𝒳p_{t}(x)=(1-\gamma)\frac{w_{t}(x)}{W_{t}}+\gamma q(x),~{}\forall x\in\mathcal{X}.
       Sample xtx_{t} according to distribution ptp_{t}.
       Observe loss yt=t,xt+ϵt(xt)y_{t}=\ell_{t,x_{t}}+\epsilon_{t}(x_{t}), where t,xt=xt,t\ell_{t,x_{t}}=\langle x_{t},\ell_{t}\rangle.
       Compute S(pt)=x𝒳pt(x)xxS(p_{t})=\sum_{x\in\mathcal{X}}p_{t}(x)xx^{\top} and ^t=S(pt)1xtyt\widehat{\ell}_{t}=S(p_{t})^{-1}x_{t}\cdot y_{t}.
       x𝒳\forall x\in\mathcal{X}, compute
^t,x=x,^t,~t,x=^t,x2xS(pt)12log(1/δ)dT,wt+1=wt(x)exp(η~t,x).\widehat{\ell}_{t,x}=\langle x,\widehat{\ell}_{t}\rangle,\qquad\widetilde{\ell}_{t,x}=\widehat{\ell}_{t,x}-2\|x\|_{S(p_{t})^{-1}}^{2}\sqrt{\frac{\log(1/\delta^{\prime})}{dT}},\qquad w_{t+1}=w_{t}(x)\exp\left(-\eta\widetilde{\ell}_{t,x}\right).
Compute Wt+1=x𝒳wt+1(x)W_{t+1}=\sum_{x\in\mathcal{X}}w_{t+1}(x).
Algorithm 3 GeometricHedge.P

We first show GeometricHedge.P in Algorithm 3 for completeness. We remark the differences between the original version and one shown here. First, we consider the noisy feedback yty_{t} instead of the zero-noise feedback t,xt\ell_{t,x_{t}}. However, most analysis in (Bartlett et al., 2008) still holds. Second, instead of using the barycentric spanner exploration (known to be suboptimal), we use John’s exploration shown to be optimal in Bubeck et al. (2012). With this replacement, Lemma 3 in Bartlett et al. (2008) can be improved to |^t,x|d/γ|\widehat{\ell}_{t,x}|\leq d/\gamma and xS(pt)12d/γ\|x\|_{S(p_{t})^{-1}}^{2}\leq d/\gamma.

Now, consider martingale difference sequence Mt(x)=^t,xt,xM_{t}(x)=\widehat{\ell}_{t,x}-\ell_{t,x}. We have |Mt(x)|dγ+1b|M_{t}(x)|\leq\frac{d}{\gamma}+1\triangleq b and

σ=t=1TVart(Mt)t=1TxS(pt)12.\sigma=\sqrt{\sum^{T}_{t=1}\text{Var}_{t}(M_{t})}\leq\sqrt{\sum_{t=1}^{T}\|x\|^{2}_{S(p_{t})^{-1}}}.

Using Lemma 2 in (Bartlett et al., 2008), we have that with probability at least 12δlog2T1-2\delta^{\prime}\log_{2}T (set δ=δ/(|𝒳|log2(T))\delta^{\prime}=\delta/(|\mathcal{X}|\log_{2}(T))),

|t=1T(^t,xt,x)|\displaystyle\left|\sum_{t=1}^{T}(\widehat{\ell}_{t,x}-\ell_{t,x})\right| 2max{2σ,blog(1/δ)}log(1/δ)\displaystyle\leq 2\max\left\{2\sigma,b\sqrt{\log(1/\delta^{\prime})}\right\}\sqrt{\log(1/\delta^{\prime})}
4σlog(1/δ)+2blog(1/δ)\displaystyle\leq 4\sigma\sqrt{\log(1/\delta^{\prime})}+2b\cdot{\log(1/\delta^{\prime})}
4t=1TxS(pt)12log(1/δ)+2(dγ+1)log(1/δ)\displaystyle\leq 4\sqrt{\sum_{t=1}^{T}\|x\|^{2}_{S(p_{t})^{-1}}}\sqrt{\log(1/\delta^{\prime})}+2\left(\frac{d}{\gamma}+1\right){\log(1/\delta^{\prime})}
1C2(t=1TxS(pt)12log(1/δ)dT)+4C2dTlog(1/δ)+2(dγ+1)log(1/δ)DevT,x,\displaystyle\leq\underbrace{\frac{1}{C_{2}}\left(\sum_{t=1}^{T}\|x\|^{2}_{S(p_{t})^{-1}}\sqrt{\frac{\log(1/\delta^{\prime})}{dT}}\right)+4C_{2}\sqrt{dT\log(1/\delta^{\prime})}+2\left(\frac{d}{\gamma}+1\right){\log(1/\delta^{\prime})}}_{\triangleq\textsc{Dev}_{T,x}}, (50)

where the last inequality is by AM-GM inequality. Note that ^t,x=~t,x+2xS(pt)12log(1/δ)dT\widehat{\ell}_{t,x}=\widetilde{\ell}_{t,x}+2\|x\|^{2}_{S(p_{t})^{-1}}\sqrt{\frac{\log(1/\delta^{\prime})}{dT}}. Plugging this into Eq. (50), we have with probability at least 12δlog2T1-2\delta^{\prime}\log_{2}T,

t=1T~t,xt=1Tt,x(t=1TxS(pt)12log(1/δ)dT)+4C2dTlog(1/δ)+2(dγ+1)log(1/δ).\displaystyle\sum_{t=1}^{T}\widetilde{\ell}_{t,x}\leq\sum_{t=1}^{T}\ell_{t,x}-{\left(\sum_{t=1}^{T}\|x\|^{2}_{S(p_{t})^{-1}}\sqrt{\frac{\log(1/\delta^{\prime})}{dT}}\right)+4C_{2}\sqrt{dT\log(1/\delta^{\prime})}+2\left(\frac{d}{\gamma}+1\right){\log(1/\delta^{\prime})}}. (51)

The counterpart of Lemma 6 in Bartlett et al. (2008) shows that with probability at least 1δ1-\delta,

t=1Tt,xtt=1Tx𝒳pt(x)^t,x(d+1)2Tlog(1/δ)+43log(1/δ)(dγ+1).\displaystyle\sum^{T}_{t=1}\ell_{t,x_{t}}-\sum^{T}_{t=1}\sum_{x\in\mathcal{X}}p_{t}(x)\widehat{\ell}_{t,x}\leq(\sqrt{d}+1)\sqrt{2T\log(1/\delta)}+\frac{4}{3}\log(1/\delta)\left(\frac{d}{\gamma}+1\right). (52)

Using Eq. (51), we have the counterpart of Lemma 7 in Bartlett et al. (2008) as follows: with probability at least 12δ1-2\delta,

γt=1Tx𝒳q(x)~t,x\displaystyle\gamma\sum_{t=1}^{T}\sum_{x\in\mathcal{X}}q(x)\widetilde{\ell}_{t,x} γt=1Tx𝒳q(x)t,x+4C2γdTlog(1/δ)+2γ(dγ+1)log(1/δ)\displaystyle\leq\gamma\sum_{t=1}^{T}\sum_{x\in\mathcal{X}}q(x)\ell_{t,x}+{4C_{2}\gamma\sqrt{dT\log(1/\delta^{\prime})}+2\gamma\left(\frac{d}{\gamma}+1\right){\log(1/\delta^{\prime})}}
γT+4C2γdTlog(1/δ)+2(d+γ)log(1/δ).\displaystyle\leq\gamma T+{4C_{2}\gamma\sqrt{dT\log(1/\delta^{\prime})}+2\left({d}+\gamma\right){\log(1/\delta^{\prime})}}. (53)

The counterpart of Lemma 8 in Bartlett et al. (2008) is: with probability at least 1δ1-\delta,

t=1Tx𝒳pt(x)^t,x2dT+dγ2Tlog(1/δ).\displaystyle\sum^{T}_{t=1}\sum_{x\in\mathcal{X}}p_{t}(x){\widehat{\ell}_{t,x}}^{2}\leq dT+\frac{d}{\gamma}\sqrt{2T\log(1/\delta)}. (54)

Plugging Eq. (52), Eq. (53), and Eq. (54), into Equation (2) in (Bartlett et al., 2008), with have with probability at least 13δ1-3\delta,

logWT+1W1\displaystyle\log\frac{W_{T+1}}{W_{1}} η1γ(t=1Tt,xt+2dTlog(1/δ)+(d+1)2Tlog(1/δ)+43log(1/δ)(dγ+1)+γT+\displaystyle\leq\frac{\eta}{1-\gamma}\left(-\sum^{T}_{t=1}\ell_{t,x_{t}}+2\sqrt{dT\log(1/\delta^{\prime})}+(\sqrt{d}+1)\sqrt{2T\log(1/\delta)}+\frac{4}{3}\log(1/\delta)\left(\frac{d}{\gamma}+1\right)+\gamma T+\right.
4C2γdTlog(1/δ)+2(d+γ)log(1/δ)+2ηdT+2ηdγ2Tlog(1/δ)+8ηlog(1/δ)dT).\displaystyle\quad\left.{4C_{2}\gamma\sqrt{dT\log(1/\delta^{\prime})}+2\left({d}+\gamma\right){\log(1/\delta^{\prime})}}+2\eta dT+\frac{2\eta d}{\gamma}\sqrt{2T\log(1/\delta)}+8\eta\log(1/\delta^{\prime})\sqrt{dT}\right). (55)

Again using Eq. (51) and Equation (4) in (Bartlett et al., 2008), we have with probability at least 1δ1-\delta, for all x𝒳x\in\mathcal{X},

logWT+1W1\displaystyle\log\frac{W_{T+1}}{W_{1}} η(t=1T~t,x)log|𝒳|\displaystyle\geq-\eta\left(\sum^{T}_{t=1}\widetilde{\ell}_{t,x}\right)-\log|\mathcal{X}|
ηt=1Tt,x+η(t=1TxS(pt)12log(1/δ)dT)4ηC2dTlog(1/δ)2η(dγ+1)log(1/δ)log|𝒳|.\displaystyle\geq-\eta\sum_{t=1}^{T}\ell_{t,x}+{\eta\left(\sum_{t=1}^{T}\|x\|^{2}_{S(p_{t})^{-1}}\sqrt{\frac{\log(1/\delta^{\prime})}{dT}}\right)-4\eta C_{2}\sqrt{dT\log(1/\delta^{\prime})}-2\eta\left(\frac{d}{\gamma}+1\right){\log(1/\delta^{\prime})}}-\log|\mathcal{X}|.

Combining this with Eq. (55) and assuming γ12\gamma\leq\frac{1}{2}, we have that with probability at least 15δ1-5\delta, for every x𝒳x\in\mathcal{X},

t=1Tt,xt\displaystyle\sum^{T}_{t=1}\ell_{t,x_{t}} t=1Tt,x12(t=1TxS(pt)12log(1/δ)dT)+4C2dTlog(1/δ)+2(dγ+1)log(1/δ)+log|𝒳|η+\displaystyle\leq\sum^{T}_{t=1}\ell_{t,x}-{\frac{1}{2}\left(\sum_{t=1}^{T}\|x\|^{2}_{S(p_{t})^{-1}}\sqrt{\frac{\log(1/\delta^{\prime})}{dT}}\right)+4C_{2}\sqrt{dT\log(1/\delta^{\prime})}+2\left(\frac{d}{\gamma}+1\right){\log(1/\delta^{\prime})}}+\frac{\log|\mathcal{X}|}{\eta}+
2dTlog(1/δ)+(d+1)2Tlog(1/δ)+43log(1/δ)(dγ+1)+γT+2C2dTlog(1/δ)+\displaystyle\qquad 2\sqrt{dT\log(1/\delta^{\prime})}+(\sqrt{d}+1)\sqrt{2T\log(1/\delta)}+\frac{4}{3}\log(1/\delta)\left(\frac{d}{\gamma}+1\right)+\gamma T+2C_{2}\sqrt{dT\log(1/\delta^{\prime})}+
2(d+γ)log(1/δ)+2ηdT+2ηdγ2Tlog(1/δ)+8ηlog(1/δ)dT.\displaystyle\qquad 2\left({d}+\gamma\right){\log(1/\delta^{\prime})}+2\eta dT+\frac{2\eta d}{\gamma}\sqrt{2T\log(1/\delta)}+8\eta\log(1/\delta^{\prime})\sqrt{dT}.

Recalling the definition of DevT,x\textsc{Dev}_{T,x} in Eq. (50) and combining terms, we have

t=1Tt,xt\displaystyle\sum^{T}_{t=1}\ell_{t,x_{t}} t=1Tt,xC2DevT,x+𝒪(dTlog(1/δ)+dγlog(1/δ))+log|𝒳|η+γT+2ηdT+\displaystyle\leq\sum^{T}_{t=1}\ell_{t,x}-C_{2}\cdot\textsc{Dev}_{T,x}+\mathcal{O}\left(\sqrt{dT\log(1/\delta^{\prime})}+\frac{d}{\gamma}\log(1/\delta^{\prime})\right)+\frac{\log|\mathcal{X}|}{\eta}+\gamma T+2\eta dT+
2ηdγ2Tlog(1/δ)+8ηlog(1/δ)dT.\displaystyle\quad\frac{2\eta d}{\gamma}\sqrt{2T\log(1/\delta)}+8\eta\log(1/\delta^{\prime})\sqrt{dT}. (56)

It remains to decide η\eta and γ\gamma. Note that the analysis of (Bartlett et al., 2008) requires |η~t,x|1|\eta\widetilde{\ell}_{t,x}|\leq 1. From the proof of Lemma 4 in (Bartlett et al., 2008), we know that |η~t,x|ηdγ(1+2log(1/δ)dT)|\eta\widetilde{\ell}_{t,x}|\leq\frac{\eta d}{\gamma}\left(1+2\sqrt{\frac{\log(1/\delta^{\prime})}{dT}}\right). Thus, we set η=γ/(d+2dlog(1/δ)dT)\eta=\gamma/\left(d+2d\sqrt{\frac{\log(1/\delta^{\prime})}{dT}}\right) so that |η~t,x|1|\eta\widetilde{\ell}_{t,x}|\leq 1 always holds. Therefore,

t=1Tt,xt\displaystyle\sum^{T}_{t=1}\ell_{t,x_{t}} t=1Tt,xC2DevT,x+𝒪(dTlog(1/δ)+dγlog(|𝒳|/δ))+2γdlog3(|𝒳|/δ)T+3γT+8γlog(1/δ)Td.\displaystyle\leq\sum^{T}_{t=1}\ell_{t,x}-C_{2}\cdot\textsc{Dev}_{T,x}+\mathcal{O}\left(\sqrt{dT\log(1/\delta^{\prime})}+\frac{d}{\gamma}\log(|\mathcal{X}|/\delta^{\prime})\right)+\frac{2}{\gamma}\sqrt{\frac{d\log^{3}(|\mathcal{X}|/\delta^{\prime})}{T}}+3\gamma T+8\gamma\log(1/\delta^{\prime})\sqrt{\frac{T}{d}}.

Choosing γ=min{12,dlog(|𝒳|/δ)T}\gamma=\min\left\{\frac{1}{2},\sqrt{\frac{d\log({|\mathcal{X}|}/{\delta^{\prime}})}{T}}\right\}, we have with probability at least 17δ1-7\delta, for all x𝒳x\in\mathcal{X},

t=1Tt,xt\displaystyle\sum^{T}_{t=1}\ell_{t,x_{t}} t=1Tt,xC2DevT,x+𝒪(dTlog(|𝒳|/δ)+dlog32(|𝒳|/δ))\displaystyle\leq\sum^{T}_{t=1}\ell_{t,x}-C_{2}\cdot\textsc{Dev}_{T,x}+\mathcal{O}\left(\sqrt{dT\log(|\mathcal{X}|/\delta^{\prime})}+d\log^{\tfrac{3}{2}}(|\mathcal{X}|/\delta^{\prime})\right)
t=1Tt,xC2DevT,x+𝒪(dTlog(|𝒳|log2(T)/δ)+dlog32(|𝒳|log2(T)/δ))\displaystyle\leq\sum^{T}_{t=1}\ell_{t,x}-C_{2}\cdot\textsc{Dev}_{T,x}+\mathcal{O}\left(\sqrt{dT\log(|\mathcal{X}|\log_{2}(T)/\delta)}+d\log^{\tfrac{3}{2}}(|\mathcal{X}|\log_{2}(T)/\delta)\right) (δ=δ/(|𝒳|log2(T))\delta^{\prime}=\delta/(|\mathcal{X}|\log_{2}(T)))
t=1Tt,xC2|t=1T(t,x^t,x)|+𝒪(dTlog(|𝒳|log2(T)/δ)+dlog32(|𝒳|log2(T)/δ)),\displaystyle\leq\sum^{T}_{t=1}\ell_{t,x}-C_{2}\left|\sum_{t=1}^{T}(\ell_{t,x}-\widehat{\ell}_{t,x})\right|+\mathcal{O}\left(\sqrt{dT\log(|\mathcal{X}|\log_{2}(T)/\delta)}+d\log^{\tfrac{3}{2}}(|\mathcal{X}|\log_{2}(T)/\delta)\right), (Eq. (50))

which proves Eq. (5).

E.2 The algorithm of (Lee et al., 2020)

Now we introduce another high-probability adversarial linear bandit algorithm from (Lee et al., 2020). The regret bound of this algorithm is slightly worse than (Bartlett et al., 2008). However, the algorithm is efficient when there are infinite or exponentially many actions. For the concrete pseudocode of the algorithm, we refer the readers to Algorithm 2 of (Lee et al., 2020). Here, we focus on showing that it satisfies Eq. (5).

We first restate Lemma B.15 in (Lee et al., 2020) with explicit logarithmic factors: Algorithm 2 of (Lee et al., 2020) with ηC3d2lg3log(lg/δ)\eta\leq\frac{C_{3}}{d^{2}\lg^{3}\log(\lg/\delta)} for some universal constant C3>0C_{3}>0 guarantees that with probability at least 1δ1-\delta,

t=1Txtx,t𝒪(dlogTη+ηd2T+lg2Tlog(lg/δ))+DevT,x(C41C5ηd2lg3Tlog(lg/δ)),\displaystyle\sum^{T}_{t=1}\langle x_{t}-x,\ell_{t}\rangle\leq\mathcal{O}\left(\frac{d\log T}{\eta}+\eta d^{2}T+\lg^{2}\sqrt{T\log(\lg/\delta)}\right)+\textsc{Dev}_{T,x}\cdot\left(C_{4}-\frac{1}{C_{5}\eta d^{2}\lg^{3}\sqrt{T\log(\lg/\delta)}}\right), (57)

where lg=log(dT)\lg=\log(dT), DevT,x\textsc{Dev}_{T,x} is an upper bound on |t=1T(t,x^t,x)|\left|\sum_{t=1}^{T}(\ell_{t,x}-\widehat{\ell}_{t,x})\right| with probability 1δ1-\delta, C4,C5>0C_{4},C_{5}>0 are two universal constants, and we replace the self-concordant parameter in their bound by a trivial upper bound dd.

Therefore, choosing η=min{C3d2lg3log(lg/δ),12C4C5d2lg3Tlog(lg/δ),12C2C5d2lg3Tlog(lg/δ)}\eta=\min\left\{\frac{C_{3}}{d^{2}\lg^{3}\log(\lg/\delta)},\frac{1}{2C_{4}C_{5}d^{2}\lg^{3}\sqrt{T\log(\lg/\delta)}},\frac{1}{2C_{2}C_{5}d^{2}\lg^{3}\sqrt{T\log(\lg/\delta)}}\right\} for some C220C_{2}\geq 20, the coefficient of DevT,x\textsc{Dev}_{T,x} becomes at most C2-C_{2}, leading to

t=1Txtx,t\displaystyle\sum^{T}_{t=1}\langle x_{t}-x,\ell_{t}\rangle 𝒪(d3lg4log(lg/δ)+d3lg4Tlog(lg/δ))C2DevT,x\displaystyle\leq\mathcal{O}\left(d^{3}\lg^{4}\log(\lg/\delta)+d^{3}\lg^{4}\sqrt{T\log(\lg/\delta)}\right)-C_{2}\cdot\textsc{Dev}_{T,x}
𝒪(d3lg4log(lg/δ)+d3lg4Tlog(lg/δ))C2|t=1T(t,x^t,x)|.\displaystyle\leq\mathcal{O}\left(d^{3}\lg^{4}\log(\lg/\delta)+d^{3}\lg^{4}\sqrt{T\log(\lg/\delta)}\right)-C_{2}\cdot\left|\sum_{t=1}^{T}(\ell_{t,x}-\widehat{\ell}_{t,x})\right|.

Finally, using a union bound over all xx similar to Theorem B.16 of (Lee et al., 2020), we get with probability at least 12δ1-2\delta, for every x𝒳x\in\mathcal{X},

t=1Txtx,t\displaystyle\sum^{T}_{t=1}\langle x_{t}-x,\ell_{t}\rangle 𝒪(d3lg4log(lg/δ′′)+d3lg4Tlog(lg/δ′′))C2|t=1T(t,x^t,x)|\displaystyle\leq\mathcal{O}\left(d^{3}\lg^{4}\log(\lg/\delta^{\prime\prime})+d^{3}\lg^{4}\sqrt{T\log(\lg/\delta^{\prime\prime})}\right)-C_{2}\left|\sum_{t=1}^{T}(\ell_{t,x}-\widehat{\ell}_{t,x})\right|

where δ′′=δ/(|𝒳|T)\delta^{\prime\prime}=\delta/(|\mathcal{X}|T). Therefore, we conclude that this algorithm satisfies Eq. (5) as well.

Appendix F Auxiliary Lemmas

In this section, we provide several auxiliary lemmas that we have used in the analysis.

Lemma 28.

(Lemma A.2 of (Shalev-Shwartz & Ben-David, 2014)) Let a1a\geq 1 and b>0b>0. If x4alog(2a)+2bx\geq 4a\log(2a)+2b, then we have xalog(x)+bx\geq a\log(x)+b.

Lemma 29 (Concentration inequality for Catoni’s estimator (Wei et al., 2020)).

Let 0n\mathcal{F}_{0}\subset\cdots\subset\mathcal{F}_{n} be a filtration, and X1,,XnX_{1},\ldots,X_{n} be real random variables such that XiX_{i} is i\mathcal{F}_{i}-measurable, 𝔼[Xi|i1]=μi\mathbb{E}[X_{i}|\mathcal{F}_{i-1}]=\mu_{i} for some fixed μi\mu_{i}, and i=1n𝔼[(Xiμi)2|i1]V\sum_{i=1}^{n}\mathbb{E}[(X_{i}-\mu_{i})^{2}|\mathcal{F}_{i-1}]\leq V for some fixed VV. Denote μ1ni=1nμi\mu\triangleq\frac{1}{n}\sum_{i=1}^{n}\mu_{i} and let μ^n,α\widehat{\mu}_{n,\alpha} be the Catoni’s robust mean estimator of X1,,XnX_{1},\ldots,X_{n} with a fixed parameter α>0\alpha>0, that is, μ^n,α\widehat{\mu}_{n,\alpha} is the unique root of the function

f(z)=i=1nψ(α(Xiz))\displaystyle f(z)=\sum_{i=1}^{n}\psi(\alpha(X_{i}-z))

where

ψ(y)={ln(1+y+y2/2),if y0,ln(1y+y2/2),else.\psi(y)=\begin{cases}\ln(1+y+y^{2}/2),&\text{if $y\geq 0$,}\\ -\ln(1-y+y^{2}/2),&\text{else.}\end{cases}

Then for any δ(0,1)\delta\in(0,1), as long as nn is large enough such that nα2(V+i=1n(μiμ)2)+2log(1/δ)n\geq\alpha^{2}(V+\sum_{i=1}^{n}(\mu_{i}-\mu)^{2})+2\log(1/\delta), we have with probability at least 12δ1-2\delta,

|μ^n,αμ|α(V+i=1n(μiμ)2)n+2log(1/δ)αn.\displaystyle|\widehat{\mu}_{n,\alpha}-\mu|\leq\frac{\alpha(V+\sum_{i=1}^{n}(\mu_{i}-\mu)^{2})}{n}+\frac{2\log(1/\delta)}{\alpha n}.

Choosing α\alpha optimally, we have

|μ^n,αμ|2n2(V+i=1n(μiμ)2)log(1/δ).\displaystyle|\widehat{\mu}_{n,\alpha}-\mu|\leq\frac{2}{n}\sqrt{2\left(V+\sum_{i=1}^{n}(\mu_{i}-\mu)^{2}\right)\log(1/\delta)}.

In particular, if μ1==μn=μ\mu_{1}=\cdots=\mu_{n}=\mu, we have

|μ^n,αμ|2n2Vlog(1/δ).\displaystyle|\widehat{\mu}_{n,\alpha}-\mu|\leq\frac{2}{n}\sqrt{2V\log(1/\delta)}.