Achieving Near Instance-Optimality and Minimax-Optimality
in Stochastic and Adversarial Linear Bandits Simultaneously

Chung-Wei Lee Haipeng Luo Chen-Yu Wei Mengxiao Zhang Xiaojin Zhang

Abstract

In this work, we develop linear bandit algorithms that automatically adapt to different environments. By plugging a novel loss estimator into the optimization problem that characterizes the instance-optimal strategy, our first algorithm not only achieves nearly instance-optimal regret in stochastic environments, but also works in corrupted environments with additional regret being the amount of corruption, while the state-of-the-art (Li et al., 2019) achieves neither instance-optimality nor the optimal dependence on the corruption amount. Moreover, by equipping this algorithm with an adversarial component and carefully-designed testings, our second algorithm additionally enjoys minimax-optimal regret in completely adversarial environments, which is the first of this kind to our knowledge. Finally, all our guarantees hold with high probability, while existing instance-optimal guarantees only hold in expectation.

Machine Learning, ICML

1 Introduction

We consider the linear bandit problem with a finite and fixed action set. In this problem, the learner repeatedly selects an action from the action set and observes her loss whose mean is the inner product between the chosen action and an unknown loss vector determined by the environment. The goal is to minimize the regret, which is the difference between the learner’s total loss and the total loss of the best action in hindsight. Two standard environments are heavily-studied in the literature: the stochastic environment and the adversarial environment. In the stochastic environment, the loss vector is fixed over time, and we are interested in instance-optimal regret bounds of order $o(T^{\epsilon})$ for any $\epsilon>0$ , where $T$ is the number of rounds and $o(\cdot)$ hides some instance-dependent constants. On the other hand, in the adversarial environment, the loss vector can be arbitrary in each round, and we are interested in minimax-optimal regret bound $\widetilde{\mathcal{O}}(\sqrt{T})$ , where $\widetilde{\mathcal{O}}(\cdot)$ hides the problem dimension and logarithmic factors in $T$ .

While there are many algorithms obtaining such optimal bounds in either environment (e.g., (Lattimore & Szepesvari, 2017) in the stochastic setting and (Bubeck et al., 2012) in the adversarial setting), a natural question is whether there exists an algorithm achieving both guarantees simultaneously without knowing the type of the environment. Indeed, the same question has been studied extensively in recent years for the special case of multi-armed bandits where the action set is the standard basis (Bubeck & Slivkins, 2012; Seldin & Slivkins, 2014; Auer & Chiang, 2016; Seldin & Lugosi, 2017; Wei & Luo, 2018; Zimmert & Seldin, 2019). Notably, Zimmert & Seldin (2019) developed an algorithm that is optimal up to universal constants for both stochastic and adversarial environments, and the techniques have been extended to combinatorial semi-bandits (Zimmert et al., 2019) and finite-horizon tabular Markov decision processes (Jin & Luo, 2020). Despite all these advances, however, it is still open whether similar results can be achieved for general linear bandits.

On the other hand, another line of recent works study the robustness of stochastic linear bandit algorithms from a different perspective and consider a corrupted setting where an adversary can corrupt the stochastic losses up to some limited amount $C$ . This was first considered in multi-armed bandits (Lykouris et al., 2018; Gupta et al., 2019; Zimmert & Seldin, 2019, 2021) and later extended to linear bandits (Li et al., 2019) and Markov decision processes (Lykouris et al., 2019; Jin & Luo, 2020). Ideally, the regret of a robust stochastic algorithm should degrade with an additive term $\mathcal{O}(C)$ in this setting, which is indeed the case in (Gupta et al., 2019; Zimmert & Seldin, 2019, 2021; Jin & Luo, 2020) for multi-armed bandits or Markov decision processes, but is not achieved yet for general linear bandits.

In this paper, we make significant progress in this direction and develop algorithms with near-optimal regret simultaneously for different environments. Our main contributions are as follows.

•

In Section 4, we first introduce Algorithm LABEL:alg:REOLB, a simple algorithm that achieves $\mathcal{O}(c(\mathcal{X},\theta)\log^{2}T+C)$ regret with high probability in the corrupted setting,¹¹1In the texts, $\mathcal{O}(\cdot)$ often hides lower-order terms (in terms of $T$ dependence) for simplicity. However, in all formal theorem/lemma statements, we use $\mathcal{O}(\cdot)$ to hide universal constants only. where $c(\mathcal{X},\theta)$ is an instance-dependent quantity such that the instance-optimal bound for the stochastic setting (i.e. $C=0$ ) is $\Theta(c(\mathcal{X},\theta)\log T)$ . This result significantly improves (Li et al., 2019) which only achieves $\mathcal{O}\big{(}\frac{d^{6}\log^{2}T}{\Delta_{\min}^{2}}+\frac{d^{2.5}C\log T}{\Delta_{\min}}\big{)}$ where $d$ is the dimension of the actions and $\Delta_{\min}$ is the minimum sub-optimality gap satisfying $c(\mathcal{X},\theta)\leq\mathcal{O}\big{(}\frac{d}{\Delta_{\min}}\big{)}$ . Moreover, Algorithm LABEL:alg:REOLB also ensures an instance-independent bound $\widetilde{\mathcal{O}}(d\sqrt{T}+C)$ that some existing instance-optimal algorithms fail to achieve even when $C=0$ (e.g., (Jun & Zhang, 2020)).
•

In Section 5, based on Algorithm LABEL:alg:REOLB, we further propose Algorithm 1 which not only achieves nearly instance-optimal regret $\mathcal{O}(c(\mathcal{X},\theta)\log^{2}T)$ in the stochastic setting, but also achieves the minimax optimal regret $\widetilde{\mathcal{O}}(\sqrt{T})$ in the adversarial setting (both with high probability). To the best of our knowledge, this is the first algorithm that enjoys the best of both worlds for linear bandits. Additionally, the same algorithm also guarantees $\widetilde{\mathcal{O}}\big{(}\frac{d\log^{2}T}{\Delta_{\min}}+C\big{)}$ in the corrupted setting, which is slightly worse than Algorithm LABEL:alg:REOLB but still significantly better than (Li et al., 2019).
•

Finally, noticing the extra $\log T$ factor in our bound for the stochastic setting, in Appendix D we also prove that this is in fact inevitable if the same algorithm simultaneously achieves sublinear regret in the adversarial setting with high probability (which is the case for Algorithm 1). This generalizes the result of (Auer & Chiang, 2016) for two-armed bandits.

At a high level, Algorithm LABEL:alg:REOLB utilizes a well-known optimization problem (that characterizes the lower bound in the stochastic setting) along with a robust estimator to determine a randomized strategy for each round. This ensures the near instance-optimality of the algorithm in the stochastic setting, and also the robustness to corruption when combined with a doubling trick. To handle the adversarial setting as well, Algorithm 1 switches between an adversarial linear bandit algorithm with high-probability regret guarantees and a variant of Algorithm LABEL:alg:REOLB, depending on the results of some carefully-designed statistical tests on the stochasticity of the environment.

2 Related Work

Linear Bandits.

Linear bandits is a classic model to study sequential decision problems. The stochastic setting dates back to (Abe & Long, 1999). Auer (2002) first used the optimism principle to solve this problem. Later, several algorithms were proposed based on confidence ellipsoids, further improving the regret bounds (Dani et al., 2008a; Rusmevichientong & Tsitsiklis, 2010; Abbasi-Yadkori et al., 2011; Chu et al., 2011).

On the other hand, the adversarial setting was introduced by Awerbuch & Kleinberg (2004). Dani et al. (2008b) achieved the first $\mathcal{O}(\sqrt{T})$ expected regret bound using the Geometric Hedge algorithm (also called Exp2) with uniform exploration over a barycentric spanner. Abernethy et al. (2008) proposed the first computational efficient algorithm that achieves $\widetilde{\mathcal{O}}(\sqrt{T})$ regret using the Following-the-Regularized-Leader framework. Bubeck et al. (2012) further tightened the bound by improving Exp2 with John’s exploration. Our Algorithm 1 makes use of any adversarial linear bandit algorithm with high-probability guarantees (e.g., (Bartlett et al., 2008; Lee et al., 2020)) in a black-box manner.

Instance Optimality for Bandit Problems.

In the stochastic setting, Lattimore & Szepesvari (2017) showed that, unlike multi-armed bandits, optimism-based algorithms or Thompson sampling can be arbitrarily far from optimal in some simple instances. They proposed an algorithm also based on the lower bound optimization problem to achieve instance-optimality, but their algorithm is deterministic and cannot be robust to an adversary. Instance-optimality was also considered in other related problems lately such as linear contextual bandits (Hao et al., 2020; Tirinzoni et al., 2020), partial monitoring (Komiyama et al., 2015), and structured bandits (Combes et al., 2017; Jun & Zhang, 2020). Most of these works only consider expected regret, while our guarantees all hold with high probability.

Best-of-Both-Worlds.

Algorithms that are optimal for both stochastic and adversarial settings were studied in multi-armed bandits (Bubeck & Slivkins, 2012; Seldin & Slivkins, 2014; Auer & Chiang, 2016; Seldin & Lugosi, 2017; Wei & Luo, 2018; Zimmert & Seldin, 2019), semi-bandits (Zimmert et al., 2019), and Markov Decision Processes (Jin & Luo, 2020). On the other hand, linear bandits, a generalization of multi-armed bandits and semi-bandits, is much more challenging and currently underexplored in this direction. To the best of our knowledge, our algorithm is the first that guarantees near-optimal regret bounds in both stochastic and adversarial settings simultaneously.

Stochastic Bandits with Corruption.

Lykouris et al. (2018) first considered the corrupted setting for multi-armed bandits. Their results were improved by (Gupta et al., 2019; Zimmert & Seldin, 2019, 2021) and extended to linear bandits (Li et al., 2019; Bogunovic et al., 2020) and reinforcement learning (Lykouris et al., 2019). As mentioned, our results significantly improve those of (Li et al., 2019) (although their corruption model is slightly more general than ours; see Section 3). On the other hand, the results of (Bogunovic et al., 2020) are incomparable to ours, because they consider a setting where the adversary has even more power and can decide the corruption after seeing the chosen action. Finally, we note that (Lykouris et al., 2019, Theorem 3.2) considers episodic linear Markov decision processes in the corrupted setting, which can be seen as a generalization of linear bandits. However, this result is highly suboptimal when specified to linear bandits ( $\Omega(C^{2}\sqrt{T})$ ignoring other parameters).

3 Preliminaries

Let $\mathcal{X}\subset\mathbb{R}^{d}$ be a finite set that spans $\mathbb{R}^{d}$ . Each element in $\mathcal{X}$ is called an arm or an action. We assume that $\|x\|_{2}\leq 1$ for all $x\in\mathcal{X}$ . A linear bandit problem proceeds in $T$ rounds. In each round $t=1,\ldots,T$ , the learner selects an action $x_{t}\in\mathcal{X}$ . Simultaneously, the environment decides a hidden loss vector $\ell_{t}\in\mathbb{R}^{d}$ and generates some independent zero-mean noise $\epsilon_{t}(x)$ for each action $x$ . Afterwards, the learner observes her loss $y_{t}=\langle x_{t},\ell_{t}\rangle+\epsilon_{t}(x_{t})$ . We consider three different types of settings: stochastic, corrupted, and adversarial, explained in detail below.

In the stochastic setting, $\ell_{t}$ is fixed to some unknown vector $\theta\in\mathbb{R}^{d}$ . We assume that there exists a unique optimal arm $x^{*}\in\mathcal{X}$ such that $\langle x^{*},\theta\rangle<\min_{x^{*}\neq x\in\mathcal{X}}\langle x,\theta\rangle$ , and define for each $x\in\mathcal{X}$ , its sub-optimality gap as $\Delta_{x}=\langle x-x^{*},\theta\rangle$ . Also denote the minimum gap $\min_{x\neq x^{*}}\Delta_{x}$ by $\Delta_{\min}$ .

The corrupted setting is a generalization of the stochastic setting, where in addition to a fixed vector $\theta$ , the environment also decides a corruption vector $c_{t}\in\mathbb{R}^{d}$ for each round (before seeing $x_{t}$ ) so that $\ell_{t}=\theta+c_{t}$ .²²2In other words, the environment corrupts the observation $y_{t}$ by adding $\langle x_{t},c_{t}\rangle$ . The setting of (Li et al., 2019) is slightly more general with the corruption on $y_{t}$ being $c_{t}(x_{t})$ for some function $c_{t}$ that is not necessarily linear. We define the total amount of corruption as $C=\sum_{t}\max_{x\in\mathcal{X}}|\langle x,c_{t}\rangle|$ . The stochastic setting is clearly a special case with $C=0$ . In both of these settings, we define the regret as $\text{\rm Reg}(T)=\max_{x\in\mathcal{X}}\sum_{t=1}^{T}\langle x_{t}-x,\theta\rangle=\sum_{t=1}^{T}\Delta_{x_{t}}$ .

Finally, in the adversarial setting, $\ell_{t}$ can be chosen arbitrarily (possibly dependent on the learner’s algorithm and her previously chosen actions). The difference compared to the corrupted setting (which also has potentially arbitrary loss vectors) is that the regret is now defined in terms of $\ell_{t}$ : $\text{\rm Reg}(T)=\max_{x\in\mathcal{X}}\sum_{t=1}^{T}\langle x_{t}-x,\ell_{t}\rangle$ .

In all settings, we assume $\langle x,\theta\rangle,\langle x,c_{t}\rangle,\langle x,\ell_{t}\rangle$ and $y_{t}$ are all in $[-1,1]$ for all $t$ and $x\in\mathcal{X}$ . We also denote $\langle x,\ell_{t}\rangle$ by $\ell_{t,x}$ and similarly $\langle x,c_{t}\rangle$ by $c_{t,x}$ .

It is known that the minimax optimal regret in the adversarial setting is $\Theta(d\sqrt{T})$ (Dani et al., 2008b; Bubeck et al., 2012). The instance-optimality in the stochastic case, on the other hand, is slightly more complicated. Specifically, an algorithm is called consistent if it guarantees $\mathbb{E}[\text{\rm Reg}(T)]=o(T^{\epsilon})$ for any $\theta$ , $\mathcal{X}$ , and $\epsilon>0$ . Then, a classic lower bound result (see e.g., (Lattimore & Szepesvari, 2017)) states that: for a particular instance $(\mathcal{X},\theta)$ , all consistent algorithms satisfy:³³3The original proof is under the Gaussian noise assumption. To meet our boundedness assumption on $y_{t}$ , it suffices to consider the case when $y_{t}$ is a Bernoulli random variable, which only affects the constant of the lower bound.

\displaystyle\liminf_{T\to\infty}\frac{\mathbb{E}[\text{\rm Reg}(T)]}{\log T}\geq\Omega(c(\mathcal{X},\theta)),

where $c(\mathcal{X},\theta)$ is the objective value of the following optimization problem:

		$\displaystyle\inf_{N\in[0,\infty)^{\mathcal{X}}}\sum_{x\in\mathcal{X}\backslash\{x^{*}\}}N_{x}\Delta_{x}$		(1)
	subject to	$\displaystyle\\|x\\|^{2}_{H^{-1}(N)}\leq\frac{\Delta_{x}^{2}}{2},\quad\forall{x\in\mathcal{X}\backslash\{x^{*}\}}$		(2)

and $H(N)=\sum_{x\in\mathcal{X}}N_{x}xx^{\top}$ (the notation $\|x\|_{M}$ denotes the quadratic norm $\sqrt{x^{\top}Mx}$ with respect to a matrix $M$ ). This implies that the best instance-dependent bound for $\text{\rm Reg}(T)$ one can hope for is $\mathcal{O}(c(\mathcal{X},\theta)\log T)$ (and more generally $\mathcal{O}(c(\mathcal{X},\theta)\log T+C)$ for the corrupted setting). It can be shown that $c(\mathcal{X},\theta)\leq\mathcal{O}\left(\frac{d}{\Delta_{\min}}\right)$ (see Lemma 16), but this upper bound can be arbitrarily loose as shown in (Lattimore & Szepesvari, 2017).

The solution $N_{x}$ in the optimization problem above specifies the least number of times action $x$ should be drawn in order to distinguish between the present environment and any other alternative environment with a different optimal action. Many previous instance-optimal algorithms try to match their number of pulls for $x$ to the solution $N_{x}$ under some estimated gap $\widehat{\Delta}_{x}$ (Lattimore & Szepesvari, 2017; Hao et al., 2020; Jun & Zhang, 2020). While these algorithms are asymptotically optimal, their regret usually grows linearly when $T$ is small (Jun & Zhang, 2020). Furthermore, they are all deterministic algorithms and by design cannot tolerate corruptions. We will show how these issues can be addressed in the next section.

Notations.

We use $\mathcal{P}_{\mathcal{S}}$ to denote the probability simplex over $\mathcal{S}$ : $\left\{p\in\mathbb{R}_{\geq 0}^{|\mathcal{S}|}:\sum_{s\in\mathcal{S}}p_{s}=1\right\}$ , and define the clipping operator $\text{Clip}_{[a,b]}(v)$ as $\min(\max(v,a),b)$ for $a\leq b$ .

4 A New Algorithm for the Corrupted Setting

In this section, we focus on the corrupted setting (hence covering the stochastic setting as well). We introduce a new algorithm that achieves with high probability an instance-dependent regret bound of $\mathcal{O}(c(\mathcal{X},\theta)\log^{2}T+C)$ for large $T$ and also an instance-independent regret bound of $\widetilde{\mathcal{O}}(d\sqrt{T}+C)$ for any $T$ . This improves over previous instance-optimal algorithms (Lattimore & Szepesvari, 2017; Hao et al., 2020; Jun & Zhang, 2020) from several aspects: 1) first and foremost, our algorithm handles corruption optimally with extra $\mathcal{O}(C)$ regret, while previous algorithms can fail completely due to their deterministic nature; 2) previous bounds only hold in expectation; 3) previous algorithms might suffer linear regret when $T$ is small, while ours is always $\widetilde{\mathcal{O}}(d\sqrt{T}+C)$ for any $T$ . The price we pay is an additional $\log T$ factor in the instance-dependent bound. On the other hand, compared to the work of (Li et al., 2019) that also covers the same corrupted setting and achieves $\mathcal{O}\left(\frac{d^{6}\log^{2}T}{\Delta_{\min}^{2}}+\frac{d^{2.5}C\log T}{\Delta_{\min}}\right)$ , our results are also significantly better (recall $c(\mathcal{X},\theta)\leq\mathcal{O}(d/\Delta_{\min})$ ), although as mentioned in Footnote 2, their results hold for an even more general setting with non-linear corruption.

Our algorithm is presented in Algorithm LABEL:alg:REOLB, which proceeds in blocks of rounds whose length grows in a doubling manner ( $2^{0},2^{1},\ldots$ ). At the beginning of block $m$ (denoted as $\mathcal{B}_{m}$ ), we compute a distribution $p_{m}$ over actions by solving an optimization problem OP (Figure LABEL:fig:_op) using the empirical gap $\widehat{\Delta}_{m,x}$ estimated in the previous block (Line LABEL:line:_solving_OP_in_alg_1). Then we use $p_{m}$ to sample actions for the entire block $m$ , and construct an unbiased loss estimator $\widehat{\ell}_{t,x}$ in every round for every action $x$ (Line LABEL:line:_construct_loss_estimator_alg_1). At the end of each block $m$ , we use $\{\widehat{\ell}_{\tau,x}\big{\}}_{\tau\in\mathcal{B}_{m}}$ to construct a robust loss estimator $\text{Rob}_{m,x}$ for each action (Line LABEL:line:_construct_robust_alg_1), which will then be used to construct $\widehat{\Delta}_{m+1,x}$ for the next block. We next explain the optimization problem OP and the estimators in detail.

OP is inspired by the lower bound optimization (Eq. (1) and Eq. (2)), where we normalize the pull counts $N$ as a distribution $p$ over the arms such that for a large $m$ , $p_{m,x}\approx\frac{N_{x}}{2^{m}}$ holds for $x\neq x^{*}$ . One key difference between our algorithm and previous ones (Lattimore & Szepesvari, 2017; Hao et al., 2020; Jun & Zhang, 2020) is exactly that we select actions randomly according to these distributions, while they try to deterministically match the pull count of each arm to $N$ . Our randomized strategy not only prevents the environment from exploiting the knowledge on the learner’s choices, but also allows us to construct unbiased estimator $\widehat{\ell}_{t,x}$ (Line LABEL:line:_construct_loss_estimator_alg_1) following standard adversarial linear bandit algorithms (Dani et al., 2008b; Bubeck et al., 2012). In fact, as shown in our analysis, the variance of the estimator $\widehat{\ell}_{t,x}$ is exactly bounded by $\|x\|_{S_{m}^{-1}}^{2}$ (for $t\in\mathcal{B}_{m}$ ), which is in turn bounded in terms of the sub-optimality gap of $x$ in light of the constraint Eq. (LABEL:eqn:_opt-2-constraint). The similar idea of imposing explicit constraints on the variance of loss estimators appears before in for example (Dudik et al., 2011; Agarwal et al., 2014) for contextual bandits. Finally, we point out that OP always has a solution due to the additive term $4d$ in Eq. (LABEL:eqn:_opt-2-constraint) (see Lemma 11), and it can be solved efficiently by standard methods since Eq. (LABEL:eqn:_opt-2-constraint) is a convex constraint.

Another important ingredient of our algorithm is the robust estimator $\text{Rob}_{m,x}$ , which is a clipped version of the Catoni’s esimator (Catoni, 2012) constructed using all the unbiased estimators $\{\widehat{\ell}_{\tau,x}\}_{\tau\in\mathcal{B}_{m}}$ from this block for action $x$ (Figure LABEL:fig:catoni). From a technical perspective, this avoids a lower-order term in Bernstein-style concentration bounds and is critical for our analysis. We in fact also believe that this is necessary since there is no explicit regularization on the magnitude of $\widehat{\ell}_{t,x}$ , and it can indeed have a heavy-tailed distribution. While other robust estimators are possible, we use the Catoni’s estimator which was analyzed in (Wei et al., 2020) for non-i.i.d. random variables (again important for our analysis).

The following theorem summarizes the nearly instance-optimal regret bound of Algorithm LABEL:alg:REOLB.

Theorem 1.

In the corrupted setting, Algorithm LABEL:alg:REOLB guarantees that with probability at least $1-\delta$ ,

	$\displaystyle\text{\rm Reg}(T)$	$\displaystyle=\mathcal{O}\Bigg{(}c(\mathcal{X},\theta)\log T\log\frac{T\|\mathcal{X}\|}{\delta}+M^{*}\log^{\frac{3}{2}}\frac{1}{\delta}$
		$\displaystyle\qquad\qquad+C+d\sqrt{\frac{C}{\Delta_{\min}}}\log\frac{C\|\mathcal{X}\|}{\Delta_{\min}\delta}\Bigg{)},$

where $M^{*}$ is some constant that depends on $\mathcal{X}$ and $\theta$ only.

The dominating term of this regret bound is thus $\mathcal{O}(c(\mathcal{X},\theta)\log^{2}T+C)$ as claimed. The definition of $M^{*}$ can be found in the proof (Appendix B) and is importantly independent of $T$ . In fact, in Theorem 19, we also provide an alternative (albeit weaker) bound $\mathcal{O}(\frac{d^{2}(\log T)^{2}}{\Delta_{\min}}+C)$ for Algorithm LABEL:alg:REOLB without the dependence on $M^{*}$ .

The next theorem shows an instance-independent bound of order $\widetilde{\mathcal{O}}(d\sqrt{T}+C)$ for Algorithm LABEL:alg:REOLB, which previous instance-optimal algorithms fail to achieve as mentioned.

Theorem 2.

In the corrupted setting, Algorithm LABEL:alg:REOLB guarantees that with probability at least $1-\delta$ , $\text{\rm Reg}(T)\leq\mathcal{O}(d\sqrt{T}\log(T|\mathcal{X}|/\delta)+C)$ .

We emphasize that Algorithm LABEL:alg:REOLB is parameter-free and does not need to know $C$ to achieve these bounds. In the rest of the section, we provide a proof sketch for Theorem 1 and Theorem 2. First, we show that the estimated gap $\widehat{\Delta}_{m,x}$ is close to the true gap $\Delta_{x}$ with a constant multiplicative factor and some additive terms that go down at the rate of roughly $1/\sqrt{t}$ up to the some average amount of corruption.

Lemma 3.

With probability at least $1-\delta$ , Algorithm LABEL:alg:REOLB ensures for all $m$ and all $x$ ,

	$\displaystyle\Delta_{x}$	$\displaystyle\leq 2\widehat{\Delta}_{m,x}+\sqrt{\frac{d\gamma_{m}}{4\cdot 2^{m}}}+2\rho_{m-1},$		(3)
	$\displaystyle\widehat{\Delta}_{m,x}$	$\displaystyle\leq 2\Delta_{x}+\sqrt{\frac{d\gamma_{m}}{4\cdot 2^{m}}}+2\rho_{m-1},$		(4)

where $\rho_{m}=\sum_{k=0}^{m}\frac{2^{k}C_{k}}{4^{m-1}}$ ( $\rho_{-1}$ is defined as $0$ ), $C_{k}=\sum_{\tau\in\mathcal{B}_{k}}\max_{x\in\mathcal{X}}|c_{\tau,x}|$ is the amount of corruption within block $k$ , and $\gamma_{m}=2^{15}\log(2^{m}|\mathcal{X}|/\delta)$ .

As mentioned, the proof of Lemma 3 heavily relies on the robust estimators we use as well as the variance constraint Eq. (LABEL:eqn:_opt-2-constraint). Next, we have the following lemma which bounds the objective value of OP.

Lemma 4.

Let $p$ be the solution of $\textbf{OP}(t,\widehat{\Delta})$ , where $\widehat{\Delta}\in\mathbb{R}_{\geq 0}^{|\mathcal{X}|}$ . Then we have $\sum_{x\in\mathcal{X}}p_{x}\widehat{\Delta}_{x}=\mathcal{O}\left(\frac{d\log(t|\mathcal{X}|/\delta)}{\sqrt{t}}\right)$ .

Combining Lemma 3 and Lemma 4, we see that in block $m$ , the regret of Algorithm LABEL:alg:REOLB can be upper bounded by

	$\displaystyle\mathcal{O}\left(2^{m}\sum_{x}p_{m,x}\Delta_{x}\right)$
	$\displaystyle=\widetilde{\mathcal{O}}\left(2^{m}\sum_{x}p_{m,x}\left(\widehat{\Delta}_{m,x}+\sqrt{\frac{d}{2^{m}}}+\rho_{m-1}\right)\right)$
	$\displaystyle=\widetilde{\mathcal{O}}\left(d\sqrt{2^{m}}+2^{m}\rho_{m-1}\right),$

where in the first equality we use Lemma 3 and in the second equality we use Lemma 4 with the fact that $p_{m}=\textbf{OP}(2^{m},\widehat{\Delta}_{m})$ . Further summing this over $m$ and relating $\sum_{m}2^{m}\rho_{m-1}$ to $C$ proves Theorem 2.

In addition, based on Lemma 3, we show that when $t\in\mathcal{B}_{m}$ is larger than $\Omega(C/\Delta_{\min})$ plus some problem-dependent constant, the estimated gap $\widehat{\Delta}_{m,x}$ becomes $\Theta(\Delta_{x})$ . Therefore, the solution $\{p_{m,x}\}_{x\in\mathcal{X}\setminus\{x^{*}\}}$ from OP is very close to $\{\frac{N_{x}}{2^{m}}\}_{x\in\mathcal{X}\setminus\{x^{*}\}}$ , where $N_{x}$ is the optimal solution of Eq. (1) and Eq. (2), except that we have an additional $\log(2^{m}|\mathcal{X}|/\delta)$ factor in the constraint (coming from $\beta_{2^{m}}$ ). Therefore, the regret is bounded by $\mathcal{O}(c(\mathcal{X},\theta)\log(T)\log(T|\mathcal{X}|/\delta))$ for large enough $T$ . Formally, we have the following lemma.

Lemma 5.

Algorithm LABEL:alg:REOLB guarantees with probability at least $1-\delta$ , for some constant $T^{*}$ depending on $\mathcal{X},\theta$ , and $C$ :

\displaystyle\sum^{T}_{t=T^{*}+1}\sum_{x}p_{t,x}\Delta_{x}\leq\mathcal{O}\left(c(\mathcal{X},\theta)\log(T)\log(T|\mathcal{X}|/\delta)\right).

Finally, to obtain Theorem 1, it suffices to apply Theorem 2 for the regret before round $T^{*}$ and Lemma 5 for the regret after.

5 Best of Three Worlds

In this section, building on top of Algorithm LABEL:alg:REOLB, we develop another algorithm that enjoys similar regret guarantees in the stochastic or corrupted setting, and additionally guarantees $\widetilde{\mathcal{O}}(\sqrt{T})$ regret in the adversarial setting, without having any prior knowledge on which environment it is facing. To the best of our knowledge, this kind of best-of-three-worlds guarantee only appears before for multi-armed bandits (Wei & Luo, 2018; Zimmert & Seldin, 2019) and Markov decision processes (Jin & Luo, 2020), but not for linear bandits.

Our algorithm requires a block-box access to an adversarial linear bandit algorithm $\mathcal{A}$ that satisfies the following:

Assumption 1.

$\mathcal{A}$ is a linear bandit algorithm that outputs a loss estimator $\widehat{\ell}_{t,x}$ for each action $x$ after each time $t$ . There exist $L_{0}$ , $C_{1}\geq 2^{15}d\log(T|\mathcal{X}|/\delta)$ , and universal constant $C_{2}\geq 20$ , such that for all $t\geq L_{0}$ , $\mathcal{A}$ guarantees the following with probability at least $1-\frac{\delta}{T}$ : $\forall x\in\mathcal{X}$ ,

\displaystyle\sum_{s=1}^{t}(\ell_{s,x_{s}}-\ell_{s,x})\leq\sqrt{C_{1}t}-C_{2}\left|\sum_{s=1}^{t}(\ell_{s,x}-\widehat{\ell}_{s,x})\right|.

(5)

Eq. (5) states that the regret of $\mathcal{A}$ against action $x$ is bounded by a $\sqrt{t}$ -order term minus the deviation between the loss of $x$ and its estimator. While this might not seem intuitive, in fact, all existing linear bandit algorithms with a near-optimal high-probability bound satisfy Assumption 1, even though this may not have been stated explicitly (and one may need to slightly change the constant parameters in these algorithms to satisfy the conditions on $C_{1}$ and $C_{2}$ ). Below, we give two examples of such $\mathcal{A}$ and justify them in Appendix E.

•

A variant of GeometricHedge.P (Bartlett et al., 2008) with an improved exploration scheme satisfies Assumption 1 with ( $\delta^{\prime}=\delta/(|\mathcal{X}|\log_{2}T)$ )

$C_{1}=\Theta\left(d\,{\log(T/\delta^{\prime})}\right),\ \ L_{0}=\Theta\left(d\,{\log^{2}(T/\delta^{\prime})}\right).$

•

The algorithm of (Lee et al., 2020) satisfies Assumption 1 with ( $\lg=\log(dT)$ , $\delta^{\prime\prime}=\delta/(|\mathcal{X}|T)$ )

C_{1}=\Theta\left(d^{6}\lg^{8}{\log(\lg/\delta^{\prime\prime})}\right),\ \ L_{0}=\Theta\left(\log(\lg/\delta^{\prime\prime})\right).

With such a black-box at hand, our algorithm BOTW is shown in Algorithm 1. We first present its formal guarantees in different settings.

Theorem 6.

Algorithm 1 guarantees that with probability at least $1-\delta$ , in the stochastic setting ( $C=0$ ), $\text{\rm Reg}(T)$ is at most

	$\displaystyle\mathcal{O}\left(c(\mathcal{X},\theta)\log T\log\frac{T\|\mathcal{X}\|}{\delta}+\frac{C_{1}\sqrt{\log T}}{\Delta_{\min}}\right.$
	$\displaystyle\left.+M^{*}\log^{\frac{3}{2}}\frac{1}{\delta}+\sqrt{C_{1}L_{0}}\right),$

where $M^{*}$ is the same problem-dependent constant as in Theorem 1; and in the corrupted setting ( $C>0$ ), $\text{\rm Reg}(T)$ is at most

\displaystyle\mathcal{O}\left(\frac{C_{1}\log T}{\Delta_{\min}}+C+\sqrt{C_{1}L_{0}}\right).

In the case when $\mathcal{A}$ is the variant of GeometricHedge.P, the last bound is

\displaystyle\mathcal{O}\left(\frac{d\log(T|\mathcal{X}|/\delta)\log T}{\Delta_{\min}}+C\right).

Therefore, Algorithm 1 enjoys the nearly instance-optimal regret $\mathcal{O}(c(\mathcal{X},\theta)\log^{2}T)$ in the stochastic setting as Algorithm LABEL:alg:REOLB⁴⁴4Note that when we choose $\mathcal{A}$ as the variant of GeometricHedge.P, $\frac{C_{1}\sqrt{\log T}}{\Delta_{\min}}=\mathcal{O}\left(\frac{d}{\Delta_{\min}}\log^{\frac{3}{2}}T\right)$ which is dominated by the term $\mathcal{O}(c(\mathcal{X},\theta)\log^{2}T)$ when $T$ is sufficiently large., but slightly worse regret $\mathcal{O}(\frac{d\log^{2}T}{\Delta_{\min}}+C)$ in the corrupted setting (recall again $c(\mathcal{X},\theta)\leq d/\Delta_{\min}$ ). In exchange, however, Algorithm 1 enjoys the following worst-case robustness in the adversarial setting.

Theorem 7.

In the adversarial setting, Algorithm 1 guarantees that with probability at least $1-\delta$ , $\text{\rm Reg}(T)$ is at most $\mathcal{O}\left(\sqrt{C_{1}T\log T}+\sqrt{C_{1}L_{0}}\right)$ .

The dependence on $T$ in this bound is minimax-optimal as mentioned, while the dependence on $d$ depends on the coefficient $C_{1}$ of the black-box. Note that because of this adversarial robustness, the $\log^{2}T$ dependence in Theorem 6 turns out to be unavoidable, as we show in Theorem 27. In addition, Theorem 7 also works for the stochastic setting, which implies a regret bound of $\mathcal{O}(\sqrt{dT}\log(T|\mathcal{X}|\log_{2}T/\delta))$ . This is a factor of $\sqrt{d}$ better than the guarantee of Algorithm LABEL:alg:REOLB shown in Theorem 2.

Next, in Section 5.1, we describe our algorithm in detail. Then in Section 5.2 and Section 5.3, we provide proof sketches for Theorem 7 and Theorem 6 respectively.

Input: an algorithm

\mathcal{A}

satisfying Assumption 1.

Initialize:

L\leftarrow L_{0}

(

L_{0}

defined in Assumption 1).

while true do

Run BOTW-SE with input

L

, and receive output

t_{0}

L\leftarrow 2t_{0}

Algorithm 1 BOTW (Best of Three Worlds)

5.1 The algorithm

Algorithm 1 BOTW takes a black-box $\mathcal{A}$ satisfying Assumption 1 (with parameter $L_{0}$ ) as input, and then proceeds in epochs until the game ends. In each epoch, it runs its single-epoch version BOTW-SE (Algorithm 2) with a minimum duration $L$ (initialized as $L_{0}$ ). Based on the results of some statistical tests, at some point BOTW-SE will terminate with an output $t_{0}\geq L$ . Then BOTW enters into the next epoch with $L$ updated to $2t_{0}$ , so that the number of epochs is always $\mathcal{O}(\log T)$ .

BOTW-SE has two phases. In Phase 1, the learner executes the adversarial linear bandit algorithm $\mathcal{A}$ . Starting from $t=L$ (i.e. after the minimum duration specified by the input), the algorithm checks in every round whether Eq. (7) and Eq. (8) hold for some action $\widehat{x}$ (Line 3). If there exists such an $\widehat{x}$ , Phase 1 terminates and the algorithm proceeds to Phase 2. This test is to detect whether the environment is likely stochastic. Indeed, Eq. (7) and Eq. (8) imply that the performance of the learner is significantly better than all but one action (i.e., $\widehat{x}$ ). In the stochastic environment, this event happens at roughly $t\approx\Theta\left(\frac{d}{\Delta_{\min}^{2}}\right)$ with $\widehat{x}=x^{*}$ . This is exactly the timing when the learner should stop using $\mathcal{A}$ whose regret grows as $\widetilde{\mathcal{O}}(\sqrt{t})$ and start doing more exploitation on the better actions, in order to keep the regret logarithmic in time for the stochastic environment. We define $t_{0}$ to be the time when Phase 1 ends, and $\widehat{\Delta}_{x}$ be the empirical gap for action $x$ with respect to the estimators obtained from $\mathcal{A}$ so far (Line 3). In the stochastic setting, we can show that $\widehat{\Delta}_{x}=\Theta(\Delta_{x})$ holds with high probability.

In the second phase, we calculate the action distribution using OP with the estimated gap $\{\widehat{\Delta}_{x}\}_{x\in\mathcal{X}}$ . Indeed, if $\widehat{\Delta}_{x}$ ’s are accurate, the distribution returned by OP is close to the optimal way of allocating arm pulls, leading to near-optimal regret.⁵⁵5Here, we solve OP at every iteration for simplicity. It can in fact be done only when time doubles, just like Algorithm LABEL:alg:REOLB. For technical reasons, there are some differences between Phase 2 and Algorithm LABEL:alg:REOLB. First, instead of using $p_{t}$ , the distribution returned by OP, to draw actions, we mix it with $e_{\widehat{x}}$ (the distribution that concentrates on $\widehat{x}$ ), and draw actions using $\widetilde{p}_{t}=\frac{1}{2}e_{\widehat{x}}+\frac{1}{2}p_{t}$ . This way, $\widehat{x}$ is drawn with probability at least $\frac{1}{2}$ . Moreover, the loss estimator $\widehat{\ell}_{t,x}$ is now defined as the following:

\displaystyle\widehat{\ell}_{t,x}=\begin{cases}x^{\top}\widetilde{S}_{t}^{-1}x_{t}y_{t},\qquad&x\neq\widehat{x}\\ \frac{y_{t}}{\widetilde{p}_{t,\widehat{x}}}\mathbb{I}\{x_{t}=\widehat{x}\},\qquad&x=\widehat{x}\end{cases}

(6)

where $\widetilde{S}_{t}=\sum_{x\in\mathcal{X}}\widetilde{p}_{t,x}xx^{\top}$ . While the construction of $\widehat{\ell}_{t,x}$ for $x\neq\widehat{x}$ is the same as Algorithm LABEL:alg:REOLB, we see that the construction of $\widehat{\ell}_{t,\widehat{x}}$ is different and is based on standard inverse probability weighting. These differences are mainly because we later use the average estimator instead of the robust mean estimator for $\widehat{x}$ (the latter produces a slightly looser concentration bound in our analysis). Therefore, we must ensure that $\widehat{x}$ is drawn with enough probability, and that the magnitude of $\widehat{\ell}_{t,\widehat{x}}$ is well-controlled.

Input:

L

(minimum duration)

Define:

f_{T}=\log T

Initialize: a new instance of

\mathcal{A}

// Phase 1

1 for $t=1,2,\ldots$ do

2 Execute and update

\mathcal{A}

. Receive estimators

\{\widehat{\ell}_{t,x}\}_{x\in\mathcal{X}}

43 if

t\geq L

and there exists an action

\widehat{x}

such that

	$\displaystyle\sum_{s=1}^{t}y_{s}-\sum_{s=1}^{t}\widehat{\ell}_{s,\widehat{x}}$	$\displaystyle\geq-5\sqrt{f_{T}C_{1}t},$		(7)
	$\displaystyle\sum_{s=1}^{t}y_{s}-\sum_{s=1}^{t}\widehat{\ell}_{s,x}$	$\displaystyle\leq-25\sqrt{f_{T}C_{1}t},\ \ \ \forall x\neq\widehat{x},$		(8)

then

t_{0}\leftarrow t,\;\widehat{\Delta}_{x}\leftarrow\frac{1}{t_{0}}\left(\sum_{s=1}^{t_{0}}\widehat{\ell}_{s,x}-\widehat{\ell}_{s,\widehat{x}}\right)

, break.

// Phase 2

5 for $t=t_{0}+1,\ldots$ do

6 Let

p_{t}=\textbf{OP}(t,\widehat{\Delta})

and

\widetilde{p}_{t}=\frac{1}{2}e_{\widehat{x}}+\frac{1}{2}p_{t}

7 Sample

x_{t}\sim\widetilde{p}_{t}

and observe

y_{t}

8Calculate

\widehat{\ell}_{t,x}

and

\widehat{\Delta}_{t,x}

based on Eq. (6) and Eq. (11).

109 if

	$\displaystyle\exists x\neq\widehat{x},~{}~{}\widehat{\Delta}_{t,x}\notin\left[0.39\widehat{\Delta}_{x},1.81\widehat{\Delta}_{x}\right]\ \ \text{or}$		(9)
	$\displaystyle\sum_{s=t_{0}+1}^{t}\left(y_{s}-{\widehat{\ell}}_{s,\widehat{x}}\right)\geq 20\sqrt{f_{T}C_{1}t_{0}}.$		(10)

then break.

11Return

t_{0}

Algorithm 2 BOTW-SE (BOTW – Single Epoch)

Then, we define the average empirical gap in $[1,t]$ for $x\neq\widehat{x}$ and $t$ in Phase 2 as the following:

\displaystyle\displaystyle\widehat{\Delta}_{t,x}

\displaystyle=\frac{1}{t}\left(\sum_{s=1}^{t_{0}}\widehat{\ell}_{s,x}+(t-t_{0})\text{Rob}_{t,x}-\sum_{s=1}^{t}\widehat{\ell}_{s,\widehat{x}}\right)

(11)

where

\displaystyle\text{Rob}_{t,x}=\text{Clip}_{[-1,1]}\left(\textbf{Catoni}_{\alpha_{x}}\left(\big{\{}\widehat{\ell}_{\tau,x}\big{\}}_{\tau=t_{0}+1}^{t}\right)\right)

with $\alpha_{x}=\left(\frac{4\log(t|\mathcal{X}|/\delta)}{t-t_{0}+\sum_{\tau=t_{0}+1}^{t}2\|x\|_{S_{\tau}^{-1}}^{2}}\right)^{\frac{1}{2}}$ (c.f. Figure LABEL:fig:catoni). Note that we use a simple average estimator for $\widehat{x}$ , but a hybrid of average estimator of Phase 1 and robust estimator of Phase 2 for other actions. These gap estimators are useful in monitoring the non-stochasticity of the environment, which is done via the tests Eq. (9) and Eq. (10). The first condition (Eq. (9)) checks whether the average empirical gap $\widehat{\Delta}_{t,x}$ is still close to the estimated gap $\widehat{\Delta}_{x}$ at the end of Phase 1. The second condition (Eq. (10)) checks whether the regret against $\widehat{x}$ incurred in Phase 2 is still tolerable. It can be shown that (see Lemma 10), with high probability Eq. (9) and Eq. (10) do not hold in a stochastic environment. Therefore, when either event is detected, BOTW-SE terminates and returns the value of $t_{0}$ to BOTW, which will then run BOTW-SE again from scratch with $L=2t_{0}$ .

In the following subsections, we provide a sketch of analysis for BOTW, further revealing the ideas behind our design.

5.2 Analysis for the Adversarial Setting (Theorem 7)

We first show that at any time $t$ in Phase 2, with high probability, $\widehat{x}$ is always the best action so far.

Lemma 8.

With probability at least $1-\delta$ , for at any $t$ in Phase 2, we have $\widehat{x}\in\operatorname*{argmin}_{x\in\mathcal{X}}\sum_{s=1}^{t}\ell_{s,x}$ .

Proof sketch.

The idea is to prove that for any $x\neq\widehat{x}$ , the deviation between the actual gap $\sum_{s=1}^{t}(\ell_{s,x}-\ell_{s,\widehat{x}})$ and the estimated gap $t\widehat{\Delta}_{t,x}$ is no larger than $\mathcal{O}(t\widehat{\Delta}_{x})$ . This is enough to prove the statement since $t\widehat{\Delta}_{t,x}$ is of order $\Omega(t\widehat{\Delta}_{x})$ in light of the test in Eq. (9).

Bounding the derivation for Phase 2 is somewhat similar to the analysis of Algorithm LABEL:alg:REOLB, and here we only show how to bound the derivation for Phase 1: $\sum_{s=1}^{t_{0}}(\ell_{s,x}-\widehat{\ell}_{s,x})$ . We start by rearranging Eq. (5) to get: $(C_{2}-1)\left|\sum_{s=1}^{t_{0}}(\ell_{s,x}-\widehat{\ell}_{s,x})\right|\leq\sqrt{C_{1}t_{0}}-\sum_{s=1}^{t_{0}}(\ell_{s,x_{s}}-\widehat{\ell}_{s,x})=\sqrt{C_{1}t_{0}}-\sum_{s=1}^{t_{0}}(\ell_{s,x_{s}}-\widehat{\ell}_{s,\widehat{x}})+t_{0}\widehat{\Delta}_{x}$ . By the termination conditions of Phase 1, we have $\sum_{s=1}^{t_{0}}(\ell_{s,x_{s}}-\widehat{\ell}_{s,\widehat{x}})\geq-5\sqrt{f_{T}C_{1}t_{0}}$ and $\widehat{\Delta}_{x}\geq 20\sqrt{f_{T}C_{1}/t_{0}}$ , which then shows $\left|\sum_{s=1}^{t_{0}}(\ell_{s,x}-\widehat{\ell}_{s,x})\right|\leq\frac{6\sqrt{f_{T}C_{1}t_{0}}+t_{0}\widehat{\Delta}_{x}}{C_{2}-1}=\mathcal{O}(t_{0}\widehat{\Delta}_{x})$ as desired. (See Appendix C for the full proof.) ∎

We then prove that, importantly, the regret in each epoch is bounded by $\widetilde{\mathcal{O}}(\sqrt{t_{0}})$ (not square root of the epoch length):

Lemma 9.

With probability at least $1-\delta$ , for any time $t$ in Phase 2, we have for any $x\in\mathcal{X}$ ,

\displaystyle\sum_{s=1}^{t}\left(\ell_{s,x_{s}}-\ell_{s,x}\right)=\mathcal{O}\left(\sqrt{C_{1}t_{0}f_{T}}\right).

Proof sketch.

By Lemma 8, it suffices to consider $x=\widehat{x}$ . By Eq. (5), we know that the regret for the first $t_{0}$ rounds is directly bounded by $\mathcal{O}\left(\sqrt{C_{1}t_{0}}\right)$ . For the regret incurred in Phase 2, we decompose it as the sum of $\sum_{s=t_{0}+1}^{t}(y_{s}-\widehat{\ell}_{s,\widehat{x}})$ , $\sum_{s=t_{0}+1}^{t}(\widehat{\ell}_{s,\widehat{x}}-\ell_{s,\widehat{x}}-\epsilon_{s}(\widehat{x}))$ , and $\sum_{s=t_{0}+1}^{t}(\epsilon_{s}(\widehat{x})-\epsilon_{s}(x_{s}))$ . The first term is controlled by the test in Eq. (10). The second and third terms are martingale difference sequences with variance bounded by $\mathcal{O}(1-\widetilde{p}_{s,\widehat{x}})$ , which as we further show is at most $\nicefrac{{1}}{{s\widehat{\Delta}_{\min}^{2}}}$ with $\widehat{\Delta}_{\min}=\min_{x\neq\widehat{x}}\widehat{\Delta}_{x}$ . By combining Eq. (7) and Eq. (8), it is clear that $\widehat{\Delta}_{\min}\geq 20\sqrt{f_{T}C_{1}/t_{0}}$ and thus the variance is in the order of $t_{0}/s$ . Applying Freedman’s inequality, the last two terms are thus bounded by $\widetilde{\mathcal{O}}(\sqrt{t_{0}})$ as well, proving the claimed result (see Appendix C for the full proof). ∎

Finally, to obtain Theorem 7, it suffices to apply Lemma 9 and the fact that the number of epochs is $\mathcal{O}(\log T)$ .

5.3 Analysis for the Corrupted Setting (Theorem 6)

The key for this analysis is the following lemma.

Lemma 10.

In the corrupted setting, BOTW-SE ensures with probability at least $1-15\delta$ :

•

$t_{0}\leq\max\left\{\frac{900f_{T}C_{1}}{\Delta_{\min}^{2}},\frac{900C^{2}}{f_{T}C_{1}},L\right\}$ .
•

If $C\leq\frac{1}{30}\sqrt{f_{T}C_{1}L}$ , then 1) $\widehat{x}=x^{*}$ ; 2) $\widehat{\Delta}_{x}\in[0.7\Delta_{x},1.3\Delta_{x}]$ for all $x$ ; and 3) Phase 2 never ends.

Using this lemma, we show a proof sketch of Theorem 6 for the stochastic case (i.e. $C=0$ ). The full proof is deferred to Appendix C.2.

Proof sketch for Theorem 6 with $C=0$ .

By Lemma 10, we know that after roughly $\Theta\left(\frac{f_{T}C_{1}}{\Delta_{\min}^{2}}\right)$ rounds in Phase 1, the algorithm finds $\widehat{x}=x^{*}$ , estimates $\widehat{\Delta}_{x}$ up to a constant factor of $\Delta_{x}$ , and enters Phase 2 without ever going back to Phase 1. By Eq. (5), the regret in Phase 1 can be upper bounded by $\mathcal{O}\left(\sqrt{C_{1}\cdot\frac{f_{T}C_{1}}{\Delta_{\min}^{2}}}\right)=\mathcal{O}\left(\frac{C_{1}}{\Delta_{\min}}\sqrt{f_{T}}\right)$ .

To bound the regret in Phase 2, we show that as long as $t$ is larger than a problem-dependent constant $T^{*}$ , there exist $\{N_{x}\}_{x\in\mathcal{X}}$ satisfying $\sum_{x\in\mathcal{X}}N_{x}\Delta_{x}\leq 2c(\mathcal{X},\theta)$ such that $\{p^{*}_{t,x}\}_{x\in\mathcal{X}\setminus\{x^{*}\}}=\big{\{}\frac{\beta_{t}N_{x}}{2t}\big{\}}_{x\in\mathcal{X}\setminus\{x^{*}\}}$ is a feasible solution of Eq. (LABEL:eqn:_opt-2-constraint). Therefore, we can bound the regret in this regime as follows:

	$\displaystyle\sum_{s=T^{*}+1}^{t}\sum_{x\in\mathcal{X}}\widetilde{p}_{s,x}\Delta_{x}$
	$\displaystyle=\sum_{s=T^{*}+1}^{t}\sum_{x\in\mathcal{X}}\frac{1}{2}p_{s,x}\Delta_{x}$		( $x^{*}=\widehat{x}$ )
	$\displaystyle\leq\sum_{s=T^{*}+1}^{t}\sum_{x\in\mathcal{X}}\frac{1}{1.4}p_{s,x}\widehat{\Delta}_{x}$		( $\widehat{\Delta}_{x}\in[0.7\Delta_{x},1.3\Delta_{x}]$ )
	$\displaystyle\leq\sum_{s=T^{}+1}^{t}\sum_{x\in\mathcal{X}}\frac{1}{1.4}p^{}_{s,x}\widehat{\Delta}_{x}$		(optimality of $p_{s}$ )
	$\displaystyle\leq\sum_{s=T^{}+1}^{t}\sum_{x\in\mathcal{X}}p^{}_{s,x}\Delta_{x}$		( $\widehat{\Delta}_{x}\in[0.7\Delta_{x},1.3\Delta_{x}]$ )
	$\displaystyle\leq\mathcal{O}\left(c(\mathcal{X},\theta)\log^{2}T\right).$		(definition of $p_{s}^{*}$ )

Combining the regret bounds in Phase 1 and Phase 2, we prove the results for the stochastic setting. ∎

6 Conclusion

In this work, we make significant progress on improving the robustness and adaptivity of linear bandit algorithms. Our algorithms are the first to achieve near-optimal regret in various different settings, without having any prior knowledge on the environment. Our techniques might also be useful for more general problems such as linear contextual bandits.

In light of the work (Zimmert & Seldin, 2019) for multi-armed bandits that shows a simple Follow-the-Regularized-Leader algorithm achieves optimal regret in different settings, one interesting open question is whether there also exists such a simple Follow-the-Regularized-Leader algorithm for linear bandit with the same adaptivity to different settings. In fact, it can be shown that their algorithm has a deep connection with OP in the special case of multi-armed bandits, but we are unable to extend the connection to general linear bandits.

Acknowledgements

We thank Tor Lattimore and Julian Zimmert for helpful discussions. HL thanks Ilias Diakonikolas and Anastasia Voloshinov for initial discussions in this direction. The first four authors are supported by NSF Awards IIS-1755781 and IIS-1943607.

References

Abbasi-Yadkori et al. (2011) Abbasi-Yadkori, Y., Pál, D., and Szepesvári, C. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24:2312–2320, 2011.
Abe & Long (1999) Abe, N. and Long, P. M. Associative reinforcement learning using linear probabilistic concepts. In ICML, 1999.
Abernethy et al. (2008) Abernethy, J., Hazan, E., and Rakhlin, A. Competing in the dark: An efficient algorithm for bandit linear optimization. In 21st Annual Conference on Learning Theory, COLT 2008, pp. 263–273, 2008.
Agarwal et al. (2014) Agarwal, A., Hsu, D., Kale, S., Langford, J., Li, L., and Schapire, R. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, pp. 1638–1646. PMLR, 2014.
Auer (2002) Auer, P. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422, 2002.
Auer & Chiang (2016) Auer, P. and Chiang, C.-K. An algorithm with nearly optimal pseudo-regret for both stochastic and adversarial bandits. In Conference on Learning Theory, pp. 116–120, 2016.
Awerbuch & Kleinberg (2004) Awerbuch, B. and Kleinberg, R. D. Adaptive routing with end-to-end feedback: Distributed learning and geometric approaches. In Proceedings of the thirty-sixth annual ACM symposium on Theory of computing, pp. 45–53, 2004.
Bartlett et al. (2008) Bartlett, P., Dani, V., Hayes, T., Kakade, S., Rakhlin, A., and Tewari, A. High-probability regret bounds for bandit online linear optimization. In Proceedings of the 21st Annual Conference on Learning Theory-COLT 2008, pp. 335–342. Omnipress, 2008.
Bogunovic et al. (2020) Bogunovic, I., Losalka, A., Krause, A., and Scarlett, J. Stochastic linear bandits robust to adversarial attacks. arXiv:2007.03285, 2020.
Bubeck & Slivkins (2012) Bubeck, S. and Slivkins, A. The best of both worlds: Stochastic and adversarial bandits. In Conference on Learning Theory, 2012.
Bubeck et al. (2012) Bubeck, S., Cesa-Bianchi, N., and Kakade, S. M. Towards minimax policies for online linear optimization with bandit feedback. In Conference on Learning Theory, 2012.
Catoni (2012) Catoni, O. Challenging the empirical mean and empirical variance: a deviation study. In Annales de l’IHP Probabilités et statistiques, volume 48, pp. 1148–1185, 2012.
Chu et al. (2011) Chu, W., Li, L., Reyzin, L., and Schapire, R. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 208–214, 2011.
Combes et al. (2017) Combes, R., Magureanu, S., and Proutiere, A. Minimal exploration in structured stochastic bandits. In Advances in Neural Information Processing Systems, 2017.
Dani et al. (2008a) Dani, V., Hayes, T. P., and Kakade, S. M. Stochastic linear optimization under bandit feedback. 2008a.
Dani et al. (2008b) Dani, V., Kakade, S. M., and Hayes, T. P. The price of bandit information for online optimization. In Advances in Neural Information Processing Systems, 2008b.
Dudik et al. (2011) Dudik, M., Hsu, D., Kale, S., Karampatziakis, N., Langford, J., Reyzin, L., and Zhang, T. Efficient optimal learning for contextual bandits. In Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence, UAI 2011, pp. 169, 2011.
Gerchinovitz & Lattimore (2016) Gerchinovitz, S. and Lattimore, T. Refined lower bounds for adversarial bandits. In NeurIPS, 2016.
Gupta et al. (2019) Gupta, A., Koren, T., and Talwar, K. Better algorithms for stochastic bandits with adversarial corruptions. In Conference on Learning Theory, 2019.
Hao et al. (2020) Hao, B., Lattimore, T., and Szepesvari, C. Adaptive exploration in linear contextual bandit. In International Conference on Artificial Intelligence and Statistics. PMLR, 2020.
Jin & Luo (2020) Jin, T. and Luo, H. Simultaneously learning stochastic and adversarial episodic mdps with known transition. Advances in Neural Information Processing Systems, 2020.
Jun & Zhang (2020) Jun, K.-S. and Zhang, C. Crush optimism with pessimism: Structured bandits beyond asymptotic optimality. Advances in Neural Information Processing Systems, 2020.
Komiyama et al. (2015) Komiyama, J., Honda, J., and Nakagawa, H. Regret lower bound and optimal algorithm in finite stochastic partial monitoring. Advances in Neural Information Processing Systems, 2015.
Lattimore & Szepesvari (2017) Lattimore, T. and Szepesvari, C. The end of optimism? an asymptotic analysis of finite-armed linear bandits. In Artificial Intelligence and Statistics, pp. 728–737. PMLR, 2017.
Lee et al. (2020) Lee, C.-W., Luo, H., Wei, C.-Y., and Zhang, M. Bias no more: high-probability data-dependent regret bounds for adversarial bandits and mdps. Advances in neural information processing systems, 2020.
Li et al. (2019) Li, Y., Lou, E. Y., and Shan, L. Stochastic linear optimization with adversarial corruption. arXiv:1909.02109, 2019.
Lykouris et al. (2018) Lykouris, T., Mirrokni, V., and Paes Leme, R. Stochastic bandits robust to adversarial corruptions. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, 2018.
Lykouris et al. (2019) Lykouris, T., Simchowitz, M., Slivkins, A., and Sun, W. Corruption robust exploration in episodic reinforcement learning. arXiv:1911.08689, 2019.
Mond & Pecaric (1996) Mond, B. and Pecaric, J. A mixed arithmetic-mean-harmonic-mean matrix inequality. Linear algebra and its applications, 237:449–454, 1996.
Rusmevichientong & Tsitsiklis (2010) Rusmevichientong, P. and Tsitsiklis, J. N. Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395–411, 2010.
Seldin & Lugosi (2017) Seldin, Y. and Lugosi, G. An improved parametrization and analysis of the exp3++ algorithm for stochastic and adversarial bandits. In Conference on Learning Theory, pp. 1743–1759. PMLR, 2017.
Seldin & Slivkins (2014) Seldin, Y. and Slivkins, A. One practical algorithm for both stochastic and adversarial bandits. In International Conference on Machine Learning, pp. 1287–1295. PMLR, 2014.
Shalev-Shwartz & Ben-David (2014) Shalev-Shwartz, S. and Ben-David, S. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
Tirinzoni et al. (2020) Tirinzoni, A., Pirotta, M., Restelli, M., and Lazaric, A. An asymptotically optimal primal-dual incremental algorithm for contextual linear bandits. In Advances in Neural Information Processing Systems, 2020.
Wei & Luo (2018) Wei, C.-Y. and Luo, H. More adaptive algorithms for adversarial bandits. In Conference On Learning Theory, pp. 1263–1291. PMLR, 2018.
Wei et al. (2020) Wei, C.-Y., Luo, H., and Agarwal, A. Taking a hint: How to leverage loss predictors in contextual bandits? In Conference on Learning Theory, pp. 3583–3634. PMLR, 2020.
Zimmert & Seldin (2019) Zimmert, J. and Seldin, Y. An optimal algorithm for stochastic and adversarial bandits. In The 22nd International Conference on Artificial Intelligence and Statistics. PMLR, 2019.
Zimmert & Seldin (2021) Zimmert, J. and Seldin, Y. Tsallis-inf: An optimal algorithm for stochastic and adversarial bandits. J. Mach. Learn. Res., 22:28–1, 2021.
Zimmert et al. (2019) Zimmert, J., Luo, H., and Wei, C.-Y. Beating stochastic and adversarial semi-bandits optimally and simultaneously. In International Conference on Machine Learning, pp. 7683–7692. PMLR, 2019.

Appendix A Auxiliary Lemmas for OP

Proof of Lemma 4.

Consider the minimizer $p^{*}$ of the following constrained minimization problem for some $\xi>0$ :

\displaystyle\min_{p\in\Delta_{\mathcal{X}}}\sum_{x\in\mathcal{X}}p_{x}\widehat{\Delta}_{x}+\frac{2}{\xi}\left(-\ln(\det(S(p)))\right),

(12)

where $S(p)=\sum_{x\in\mathcal{X}}p_{x}xx^{\top}$ . We will show that

	$\displaystyle\sum_{x\in\mathcal{X}}p^{*}_{x}\widehat{\Delta}_{x}$	$\displaystyle\leq\frac{2d}{\xi},$		(13)
	$\displaystyle\\|x\\|_{S(p^{*})^{-1}}^{2}$	$\displaystyle\leq\frac{\xi\widehat{\Delta}_{x}}{2}+d,\;\forall x\in\mathcal{X}.$		(14)

To prove this, first note that relaxing the constraints from $p\in\mathcal{P}_{\mathcal{X}}$ to the set of sub-distributions $\{p:\sum_{x\in\mathcal{X}}p_{x}\leq 1\text{ and }p_{x}\geq 0,\;\forall x\}$ does not change the solution of this problem. This is because for any sub-distribution, we can always make it a distribution by increasing the weight of some $x$ with $\widehat{\Delta}_{x}=0$ (at least one exists) while not increasing the objective value (since $\ln(\det(S(p)))$ is non-decreasing in $p_{x}$ for each $x$ ). Therefore, applying the KKT conditions, we have

\displaystyle\widehat{\Delta}_{x}-\frac{2}{\xi}x^{\top}S(p^{*})^{-1}x-\lambda_{x}+\lambda=0,

(15)

where $\lambda_{x},\lambda\geq 0$ are Lagrange multipliers. Plugging in the optimal solution $p^{*}$ and taking summation over all $x\in\mathcal{X}$ , we have

$\displaystyle 0$	$\displaystyle=\sum_{x\in\mathcal{X}}p^{}_{x}\widehat{\Delta}_{x}-\frac{2}{\xi}\sum_{x\in\mathcal{X}}p^{}_{x}x^{\top}S(p^{})^{-1}x-\sum_{x\in\mathcal{X}}\lambda_{x}p^{}_{x}+\lambda$
	$\displaystyle=\sum_{x\in\mathcal{X}}p^{}_{x}\widehat{\Delta}_{x}-\frac{2}{\xi}\operatorname{Tr}(S(p^{})^{-1}S(p^{*}))+\lambda$	(complementary slackness)
	$\displaystyle=\sum_{x\in\mathcal{X}}p_{x}^{*}\widehat{\Delta}_{x}-\frac{2d}{\xi}+\lambda$
	$\displaystyle\geq\sum_{x\in\mathcal{X}}p_{x}^{*}\widehat{\Delta}_{x}-\frac{2d}{\xi}.$	( $\lambda\geq 0$ )

Therefore, we have $\sum_{x\in\mathcal{X}}p_{x}^{*}\Delta_{x}\leq\frac{2d}{\xi}$ and $\lambda\leq\frac{2d}{\xi}$ as $\sum_{x\in\mathcal{X}}p_{x}^{*}\widehat{\Delta}_{x}\geq 0$ . This proves Eq. (13). For Eq. (14), using Eq. (15), we have

\displaystyle\|x\|_{S(p^{*})^{-1}}^{2}=\frac{\xi}{2}\left(\widehat{\Delta}_{x}-\lambda_{x}+\lambda\right)\leq\frac{\xi}{2}\left(\widehat{\Delta}_{x}+\lambda\right)\leq\frac{\xi\widehat{\Delta}_{x}}{2}+d,

where the first inequality is due to $\lambda_{x}\geq 0$ , and the second inequality is due to $\lambda\leq\frac{2d}{\xi}$ .

Now we show how to transform $p^{*}$ into a distribution satisfying the constraint of OP. Choose $\xi=\frac{\sqrt{t}}{\beta_{t}}$ . Let $G=\{x:\widehat{\Delta}_{x}\leq\frac{1}{\sqrt{t}}\}$ . We construct the distribution $q=\frac{1}{2}p^{*}+\frac{1}{2}q^{G,\kappa}$ , where $q^{G,\kappa}$ is defined in Lemma 11 with $\kappa=\frac{1}{\sqrt{t}}$ , and prove that $q$ satisfies Eq. (LABEL:eqn:_opt-2-constraint). Indeed, for all $x\notin G$ , we have by definition $\sqrt{t}\widehat{\Delta}_{x}\geq 1$ and thus

\displaystyle\|x\|_{S(q)^{-1}}^{2}\leq\|x\|_{\frac{1}{2}S(p^{*})^{-1}}^{2}\leq\xi\widehat{\Delta}_{x}+2d=\frac{\sqrt{t}\widehat{\Delta}_{x}}{\beta_{t}}+2d\leq\frac{t\widehat{\Delta}_{x}^{2}}{\beta_{t}}+4d;

for $x\in G$ , according to Lemma 11 below, we have $\|x\|_{S(q)^{-1}}^{2}\leq\|x\|_{\frac{1}{2}S(q^{G,\kappa})^{-1}}^{2}\leq 4d\leq\frac{t\widehat{\Delta}_{x}^{2}}{\beta_{t}}+4d$ .

Combining the two cases above, we prove that $q$ satisfies Eq. (LABEL:eqn:_opt-2-constraint). According to the optimality of $p$ , we thus have

$\displaystyle\sum_{x\in\mathcal{X}}p_{x}\widehat{\Delta}_{x}$	$\displaystyle\leq\sum_{x\in\mathcal{X}}q_{x}\widehat{\Delta}_{x}$
	$\displaystyle=\frac{1}{2}\sum_{x\in\mathcal{X}}p_{x}^{*}\widehat{\Delta}_{x}+\frac{1}{2}\sum_{x\in\mathcal{X}}q^{G,\kappa}_{x}\widehat{\Delta}_{x}$	(by the definition of $q$ )
	$\displaystyle\leq\frac{1}{2}\frac{2d}{\xi}+\frac{1}{2\sqrt{t}}+\frac{1}{2\sqrt{t}}$	(by Eq. (13), $\widehat{\Delta}_{x}\leq 1$ , the definition of $G$ , and the choice of $\kappa$ )
	$\displaystyle\leq\frac{d\beta_{t}+1}{\sqrt{t}}=\mathcal{O}\left(\frac{d\beta_{t}}{\sqrt{t}}\right),$	(by the definition of $\xi$ )

proving the lemma. ∎

The following lemma shows that for any $G\subset\mathcal{X}$ , there always exists a distribution $p\in\mathcal{P}_{\mathcal{X}}$ that puts most weights on actions from $G$ , such that $\|x\|_{(\sum_{x\in\mathcal{X}}p_{x}xx^{\top})^{-1}}^{2}\leq\mathcal{O}(d)$ for all $x\in G$ .

Lemma 11.

Suppose that $\mathcal{X}\subseteq\mathbb{R}^{d}$ spans $\mathbb{R}^{d}$ and let $p_{\mathcal{X}}$ be the uniform distribution over $\mathcal{X}$ . For any $G\subseteq\mathcal{X}$ and $\kappa\in(0,\frac{1}{2}]$ , there exists a distribution $q\in\mathcal{P}_{G}$ such that $\|x\|_{S(q^{G,\kappa})^{-1}}^{2}\leq 2d$ for all $x\in G$ , where $q^{G,\kappa}\triangleq\kappa\cdot p_{\mathcal{X}}+(1-\kappa)\cdot q$ and $S(p)=\sum_{x\in\mathcal{X}}p_{x}xx^{\top}$ .

Proof.

Let $\mathcal{P}_{G}^{\kappa}=\{p\in\mathcal{P}_{\mathcal{X}}\;|\;p=\kappa\cdot p_{\mathcal{X}}+(1-\kappa)\cdot q,q\in\mathcal{P}_{G}\}$ . As $\mathcal{X}$ spans the whole $\mathbb{R}^{d}$ space, $\|x\|_{S(p)^{-1}}^{2}$ is well-defined for all $p\in\mathcal{P}_{G}^{\kappa}$ . Then we have

	$\displaystyle\min_{p\in\mathcal{P}_{G}^{\kappa}}\max_{x\in G}\\|x\\|_{S^{-1}(p)}^{2}$
	$\displaystyle=\min_{p\in\mathcal{P}_{G}^{\kappa}}\max_{q\in\mathcal{P}_{G}}\left\langle\sum_{x\in\mathcal{X}}q_{x}xx^{\top},\left(\sum_{x\in\mathcal{X}}p_{x}xx^{\top}\right)^{-1}\right\rangle$
	$\displaystyle=\max_{q\in\mathcal{P}_{G}}\min_{p\in\mathcal{P}_{G}^{\kappa}}\left\langle\sum_{x\in\mathcal{X}}q_{x}xx^{\top},\left(\sum_{x\in\mathcal{X}}p_{x}xx^{\top}\right)^{-1}\right\rangle$		(16)
	$\displaystyle\leq\max_{q\in\mathcal{P}_{G}}\left\langle\sum_{x\in\mathcal{X}}q_{x}xx^{\top},\left(\sum_{x\in\mathcal{X}}\left(\frac{\kappa}{\|\mathcal{X}\|}+(1-\kappa)q_{x}\right)xx^{\top}\right)^{-1}\right\rangle$
	$\displaystyle\leq 2\max_{q\in\mathcal{P}_{G}}\left\langle(1-\kappa)\sum_{x\in\mathcal{X}}q_{x}xx^{\top},\left(\sum_{x\in\mathcal{X}}\left(\frac{\kappa}{\|\mathcal{X}\|}+(1-\kappa)q_{x}\right)xx^{\top}\right)^{-1}\right\rangle$
	$\displaystyle\leq 2\max_{q\in\mathcal{P}_{G}}\left\langle\frac{\kappa}{\|\mathcal{X}\|}\sum_{x\in\mathcal{X}}xx^{\top}+(1-\kappa)\sum_{x\in\mathcal{X}}q_{x}xx^{\top},\left(\sum_{x\in\mathcal{X}}\left(\frac{\kappa}{\|\mathcal{X}\|}+(1-\kappa)q_{x}\right)xx^{\top}\right)^{-1}\right\rangle$
	$\displaystyle=2d,$

where the second equality is by the Sion’s minimax theorem as Eq. (16) is linear in $q$ and convex in $p$ . ∎

Lemma 12.

Given $\{\widehat{\Delta}\}_{x\in\mathcal{X}}$ , suppose there exists a unique $\widehat{x}$ such that $\widehat{\Delta}_{\widehat{x}}=0$ , and $\widehat{\Delta}_{\min}=\min_{x\neq\widehat{x}}\widehat{\Delta}_{x}>0$ . Then $\sum_{x}p_{x}\widehat{\Delta}_{x}\leq\frac{24d\beta_{t}}{\widehat{\Delta}_{\min}t}$ when $t\geq\frac{16d\beta_{t}}{\widehat{\Delta}_{\min}^{2}}$ , where $p$ is the solution to $\textbf{OP}(t,\widehat{\Delta})$ .

Proof.

We divide actions into groups $G_{0},G_{1},G_{2},\ldots$ based on the following rule:

	$\displaystyle G_{0}=\{\widehat{x}\},$
	$\displaystyle G_{i}=\left\{x:2^{i-1}\widehat{\Delta}_{\min}^{2}\leq\widehat{\Delta}_{x}^{2}<2^{i}\widehat{\Delta}_{\min}^{2}\right\}.$

Let $n$ be the largest index such that $G_{n}$ is not empty and $z_{i}=\frac{d\beta_{t}}{2^{i-2}\widehat{\Delta}_{\min}^{2}t}$ for $i\geq 1$ . For each group $i$ , by Lemma 11, we find a distribution $q^{G_{i},\kappa}$ with $\kappa=\frac{1}{n\cdot 2^{n}|\mathcal{X}|}$ , such that $\|x\|_{\left(\sum_{y\in\mathcal{X}}q^{G_{i},\kappa}_{y}yy^{\top}\right)^{-1}}^{2}\leq 2d$ for all $x\in G_{i}$ . Then we define a distribution $\widetilde{p}$ over actions as the following:

\displaystyle\widetilde{p}_{x}=\left\{\begin{aligned} &\sum_{j\geq 1}z_{j}q^{G_{j},\kappa}_{x}&&\text{if $x\neq\widehat{x}$}\\ &1-\sum_{x^{\prime}\neq\widehat{x}}\widetilde{p}_{x^{\prime}}&&\text{if $x=\widehat{x}$.}\end{aligned}\right.

$\widetilde{p}$ is a valid distribution as

$\displaystyle\widetilde{p}_{\widehat{x}}$	$\displaystyle=1-\sum_{i\geq 1}\sum_{x\in G_{i}}\sum_{j\geq 1}z_{j}q^{G_{j},\kappa}_{x}$
	$\displaystyle=1-\sum_{i\geq 1}\sum_{x\in G_{i}}z_{i}q^{G_{i},\kappa}_{x}-\sum_{i\geq 1}\sum_{x\in G_{i}}\sum_{j\neq i,j\geq 1}z_{j}q^{G_{j},\kappa}_{x}$
	$\displaystyle\geq 1-\sum_{i\geq 1}z_{i}-\sum_{i\geq 1}\sum_{x\in G_{i}}\sum_{j\neq i,j\geq 1}\frac{z_{j}}{n\cdot 2^{n}\cdot\|\mathcal{X}\|}$	( $\sum_{x\in G_{i}}q^{G_{i},\kappa}_{x}\leq 1$ and $q^{G_{j},\kappa}_{x}=\frac{1}{n\cdot 2^{n}\|\mathcal{X}\|}$ for $x\notin G_{j}$ )
	$\displaystyle\geq 1-\sum_{i\geq 1}z_{i}-\sum_{i\geq 1}z_{i}$	(by definition $\frac{z_{j}}{2^{n}}\leq z_{i}$ for all $i,j$ )
	$\displaystyle\geq\frac{1}{2}.$	(condition $t\geq\frac{16d\beta_{t}}{\widehat{\Delta}_{\min}^{2}}$ and thus $\sum_{i\geq 1}2z_{i}\leq\sum^{\infty}_{i=1}\frac{2}{2^{i-2}\cdot 16}=\frac{1}{2}$ )

Now we show that $\widetilde{p}$ also satisfies the constraint of $\textbf{OP}(t,\widehat{\Delta}_{x})$ . Indeed, for any $x\neq\widehat{x}$ and $i$ such that $x\in G_{i}$ , we use the facts $\widetilde{p}_{y}\geq z_{i}q^{G_{i},\kappa}_{y}$ for $y\neq\widehat{x}$ by definition and $\widetilde{p}_{\widehat{x}}\geq\frac{1}{2}\geq\frac{z_{i}}{n\cdot 2^{n}|\mathcal{X}|}=z_{i}q^{G_{i},\kappa}_{\widehat{x}}$ as well to arrive at:

\displaystyle\|x\|_{S(\widetilde{p})^{-1}}^{2}\leq\|x\|_{\left(\sum_{y\in\mathcal{X}}z_{i}q^{G_{i},\kappa}_{y}yy^{\top}\right)^{-1}}^{2}=\frac{1}{z_{i}}\|x\|_{\left(\sum_{y\in\mathcal{X}}q^{G_{i},\kappa}_{y}yy^{\top}\right)^{-1}}^{2}\leq\frac{2d}{z_{i}}=\frac{2^{i-1}\widehat{\Delta}_{\min}^{2}t}{\beta_{t}}\leq\frac{t\widehat{\Delta}_{x}^{2}}{\beta_{t}};

for $x=\widehat{x}$ , we have $\widetilde{p}_{\widehat{x}}\geq\frac{1}{2}$ as shown above and thus,

\displaystyle\|\widehat{x}\|_{S(\widetilde{p})^{-1}}^{2}=\|S(\widetilde{p})^{-1}\widehat{x}\|_{S(\widetilde{p})}^{2}\geq\|S(\widetilde{p})^{-1}\widehat{x}\|_{\tfrac{1}{2}\widehat{x}{\widehat{x}}^{\top}}^{2}=\frac{1}{2}\|\widehat{x}\|_{S(\widetilde{p})^{-1}}^{4}~{}\Longrightarrow~{}\|\widehat{x}\|_{S^{-1}(\widetilde{p})}^{2}\leq 2.

Thus, $\widetilde{p}$ satisfies Eq. (LABEL:eqn:_opt-2-constraint). Therefore,

$\displaystyle\sum_{x\in\mathcal{X}}p_{x}\widehat{\Delta}_{x}$	$\displaystyle\leq\sum_{x\in\mathcal{X}}\widetilde{p}_{x}\widehat{\Delta}_{x}$	(by the feasibility of $\widetilde{p}$ and the optimality of $p$ )
	$\displaystyle=\sum_{x\neq\widehat{x}}\widetilde{p}_{x}\widehat{\Delta}_{x}$
	$\displaystyle\leq\sum_{i\geq 1}\sum_{x\in G_{i}}\sum_{j\geq 1}\frac{d\beta_{t}}{2^{j-2}\widehat{\Delta}_{\min}^{2}t}q^{G_{j},\kappa}_{x}\sqrt{2^{i}}\widehat{\Delta}_{\min}$	(by the definition of $\widetilde{p}_{x}$ and $G_{i}$ )
	$\displaystyle=\sum_{i\geq 1}\sum_{x\in G_{i}}\sum_{j\neq i,j\geq 1}\frac{d\beta_{t}q^{G_{j},\kappa}_{x}}{2^{j-\nicefrac{{i}}{{2}}-2}\widehat{\Delta}_{\min}t}+\sum_{i\geq 1}\sum_{x\in G_{i}}\frac{d\beta_{t}q^{G_{i},\kappa}_{x}}{2^{\nicefrac{{i}}{{2}}-2}\widehat{\Delta}_{\min}t}$
	$\displaystyle\leq\sum_{i\geq 1}\sum_{x\in G_{i}}\sum_{j\neq i,j\geq 1}\frac{d\beta_{t}}{\|\mathcal{X}\|\cdot n\cdot 2^{n+j-\nicefrac{{i}}{{2}}-2}\widehat{\Delta}_{\min}t}+\sum_{i\geq 1}\sum_{x\in G_{i}}\frac{d\beta_{t}q^{G_{i},\kappa}_{x}}{2^{\nicefrac{{i}}{{2}}-2}\widehat{\Delta}_{\min}t}$	( $q^{G_{j},\kappa}_{x}=\frac{1}{n\cdot 2^{n}\cdot\|\mathcal{X}\|}$ for $x\notin G_{j}$ )
	$\displaystyle\leq\sum_{i\geq 1}\frac{2d\beta_{t}}{2^{\nicefrac{{i}}{{2}}-2}\widehat{\Delta}_{\min}t}\leq\frac{24d\beta_{t}}{\widehat{\Delta}_{\min}t},$	( $n+j-\frac{i}{2}\geq\frac{i}{2}$ )

proving the lemma. ∎

Lemma 13.

Let $\widehat{\Delta}_{x}\in\left[\frac{1}{\sqrt{r}}\Delta_{x},\sqrt{r}\Delta_{x}\right]$ for all $x\in\mathcal{X}$ for some $r>1$ , and $p=\textbf{OP}(t,\widehat{\Delta})$ for some $t\geq\frac{16rd\beta_{t}}{\Delta_{\min}^{2}}$ . Then $\sum_{x\in\mathcal{X}}p_{x}\Delta_{x}\leq\frac{24rd\beta_{t}}{\Delta_{\min}t}$ .

Proof.

By the condition on $\widehat{\Delta}_{x}$ , we have $t\geq\frac{16rd\beta_{t}}{\Delta_{\min}^{2}}\geq\frac{16d\beta_{t}}{\widehat{\Delta}_{\min}^{2}}$ . Also, the condition implies that $\widehat{\Delta}_{x^{*}}=\Delta_{x^{*}}=0$ and $\widehat{\Delta}_{x}>0$ for all $x\neq x^{*}$ . Therefore,

\displaystyle\sum_{x\in\mathcal{X}}p_{x}\Delta_{x}\leq\sqrt{r}\sum_{x\in\mathcal{X}}p_{x}\widehat{\Delta}_{x}\leq\sqrt{r}\frac{24d\beta_{t}}{\widehat{\Delta}_{\min}t}\leq\frac{24rd\beta_{t}}{\Delta_{\min}t},

where the second equality is due to Lemma 12 and the other inequalities follow from $\widehat{\Delta}_{x}\in\left[\frac{1}{\sqrt{r}}\Delta_{x},\sqrt{r}\Delta_{x}\right]$ for all $x$ . ∎

In the following lemma, we define a problem-dependent quantity $M$ .

Lemma 14.

Consider the optimization problem:

		$\displaystyle\min_{\{N_{x}\}_{x\in\mathcal{X}},N_{x}\geq 0}\sum_{x\in\mathcal{X}}N_{x}\Delta_{x}$
	$\displaystyle s.t.\qquad$	$\displaystyle\\|x\\|_{H(N)^{-1}}^{2}\leq\frac{\Delta_{x}^{2}}{2},\forall x\in\mathcal{X}^{-},$

where $H(N)=\sum_{x\in\mathcal{X}}N_{x}xx^{\top}$ and $\mathcal{X}^{-}=\mathcal{X}\backslash\{x^{*}\}$ . Define its optimal objective value as $c(\mathcal{X},\theta)$ (same as in Section 3). Then, there exist $\{N^{*}_{x}\}_{x\in\mathcal{X}}$ satisfying the constraint of this optimization problem with $\sum_{x\in\mathcal{X}}N^{*}_{x}\Delta_{x}\leq 2c(\mathcal{X},\theta)$ and $N^{*}_{x^{*}}$ being finite. (Define $M=\sum_{x\in\mathcal{X}}N^{*}_{x}$ .)

Proof.

If there exists an assignment of $\{N_{x}\}_{x\in\mathcal{X}}$ for the optimal objective value which has finite $N_{x^{*}}$ , then the lemma trivially holds. Otherwise, consider the optimal solution $\{\widetilde{N}_{x}\}_{x\in\mathcal{X}}$ with $\widetilde{N}_{x^{*}}=\infty$ . According to the constraints, the following holds for all $x\in\mathcal{X}^{-}$ :

\displaystyle\lim_{N\rightarrow\infty}\|x\|_{\left(Nx^{*}x^{*^{\top}}+\sum_{y\in\mathcal{X}^{-}}\widetilde{N}_{y}yy^{\top}\right)^{-1}}^{2}\leq\frac{\Delta_{x}^{2}}{2}.

As $|\mathcal{X}|$ is finite, by definition, we know for any $\epsilon$ , there exists a positive value $M_{\epsilon}$ such that for all $N\geq M_{\epsilon}$ , $\|x\|^{2}_{\left(Nx^{*}x^{*^{\top}}+\sum_{y\in\mathcal{X}^{-}}\widetilde{N}_{y}yy^{\top}\right)^{-1}}\leq\frac{\Delta_{x}^{2}}{2}+\epsilon$ . Choosing $\epsilon=\frac{\Delta_{\min}^{2}}{2}$ , we have when $N\geq M_{\epsilon}$ , for all $x\in\mathcal{X}^{-}$ ,

\displaystyle\|x\|_{\left(Nx^{*}x^{*^{\top}}+\sum_{x\in\mathcal{X}^{-}}\widetilde{N}_{x}xx^{\top}\right)^{-1}}^{2}<\frac{\Delta_{x}^{2}}{2}+\frac{\Delta_{\min}^{2}}{2}\leq\Delta_{x}^{2}.

Therefore, consider the solution $\{N^{*}_{x}\}_{x\in\mathcal{X}}$ where $N^{*}_{x}=2\widetilde{N}_{x}$ if $x\in\mathcal{X}^{-}$ and $N^{*}_{x^{*}}=2M_{\epsilon}$ . We have $\|x\|_{H(N)^{-1}}^{2}\leq\frac{\Delta_{x}^{2}}{2}$ . Moreover, the objective value is bounded by $\sum_{x\in\mathcal{X}}N^{*}_{x}\Delta_{x}=2\sum_{x\in\mathcal{X}^{-1}}\widetilde{N}_{x}\Delta_{x}=2c(\mathcal{X},\theta)$ . ∎

Lemma 15.

Suppose $\widehat{\Delta}_{x}\in\left[\frac{1}{\sqrt{r}}\Delta_{x},\sqrt{r}\Delta_{x}\right]$ for all $x\in\mathcal{X}$ for some $r>1$ , and $p=\textbf{OP}(t,\widehat{\Delta})$ for some $t\geq r\beta_{t}M$ , where $M$ is defined in Lemma 14. Then $\sum_{x\in\mathcal{X}}p_{x}\Delta_{x}\leq\frac{r^{2}\beta_{t}}{t}c(\mathcal{X},\theta)$ .

Proof.

Recall $N^{*}$ defined in Lemma 14. Define $\widetilde{p}$ , a distribution over $\mathcal{X}$ , as the following:

\displaystyle\widetilde{p}_{x}=\begin{cases}\frac{r\beta_{t}N_{x}^{*}}{2t},&x\neq x^{*}\\ 1-\sum_{x^{\prime}\neq x^{*}}\widetilde{p}_{x^{\prime}}&x=x^{*}.\end{cases}

It is clear that $\widetilde{p}$ is a valid distribution since $t\geq r\beta_{t}M$ . Also, note that by the definition of $M$ and the condition of $t$ ,

\displaystyle\widetilde{p}_{x^{*}}=1-\sum_{x^{\prime}\neq x^{*}}\frac{r\beta_{t}N_{x}^{*}}{2t}\geq 1-\frac{r\beta_{t}M}{2t}\geq\frac{r\beta_{t}M}{2t}\geq\frac{r\beta_{t}N_{x^{*}}^{*}}{2t}.

Below we show that $\widetilde{p}$ satisfies the constraint of $\textbf{OP}(t,\widehat{\Delta}_{x})$ . Indeed, for any $x\neq x^{*}$ ,

$\displaystyle\\|x\\|_{S(\widetilde{p})^{-1}}^{2}$	$\displaystyle\leq\\|x\\|_{\left(\sum_{y\in\mathcal{X}}\frac{r\beta_{t}N^{*}_{y}}{2t}yy^{\top}\right)^{-1}}^{2}$	( $\widetilde{p}_{x^{}}\geq\frac{r\beta_{t}N_{x^{}}^{*}}{2t}$ )
	$\displaystyle=\frac{2t}{r\beta_{t}}\\|x\\|_{\left(\sum_{y\in\mathcal{X}}N^{*}_{y}yy^{\top}\right)^{-1}}^{2}$
	$\displaystyle\leq\frac{t\Delta_{x}^{2}}{r\beta_{t}}$	(by the constraint in the definition of $\{N_{x}^{*}\}_{x\in\mathcal{X}}$ )
	$\displaystyle\leq\frac{t\widehat{\Delta}_{x}^{2}}{\beta_{t}};$	( $\widehat{\Delta}_{x}\in\left[\frac{1}{\sqrt{r}}\Delta_{x},\sqrt{r}\Delta_{x}\right]$ )

for $x=x^{*}$ , we have $\widetilde{p}_{x^{*}}\geq 1-\frac{r\beta_{t}M}{2t}\geq\frac{1}{2}$ by the condition of $t$ :

\displaystyle\|x^{*}\|_{S(\widetilde{p})^{-1}}^{2}=\|S(\widetilde{p})^{-1}x^{*}\|_{S(\widetilde{p})}^{2}\geq\|S(\widetilde{p})^{-1}x^{*}\|_{\tfrac{1}{2}x^{*}{x^{*}}^{\top}}^{2}=\frac{1}{2}\|x^{*}\|_{S(\widetilde{p})^{-1}}^{4}~{}\Longrightarrow~{}\|x^{*}\|_{S^{-1}(\widetilde{p})}^{2}\leq 2.

Notice that $\widehat{\Delta}_{x^{*}}\in\left[\frac{1}{\sqrt{r}}\Delta_{x^{*}},\sqrt{r}\Delta_{x^{*}}\right]$ implies $\Delta_{x^{*}}=\widehat{\Delta}_{x^{*}}=0$ . Thus,

$\displaystyle\sum_{x\in\mathcal{X}}p_{x}\Delta_{x}$	$\displaystyle\leq\sqrt{r}\sum_{x\in\mathcal{X}}p_{x}\widehat{\Delta}_{x}$	( $\widehat{\Delta}_{x}\in\left[\frac{1}{\sqrt{r}}\Delta_{x},\sqrt{r}\Delta_{x}\right]$ )
	$\displaystyle\leq\sqrt{r}\sum_{x\in\mathcal{X}}\widetilde{p}_{x}\widehat{\Delta}_{x}$	(by the feasibility of $\widetilde{p}$ and the optimality of $p$ )
	$\displaystyle\leq r\sqrt{r}\beta_{t}\sum_{x\in\mathcal{X}}\frac{N_{x}^{*}}{2t}\widehat{\Delta}_{x}$	(by the definition of $\widetilde{p}$ and $\widehat{\Delta}_{x^{*}}=0$ )
	$\displaystyle\leq r^{2}\beta_{t}\sum_{x\in\mathcal{X}}\frac{N_{x}^{*}}{2t}\Delta_{x}$	( $\widehat{\Delta}_{x}\in\left[\frac{1}{\sqrt{r}}\Delta_{x},\sqrt{r}\Delta_{x}\right]$ )
	$\displaystyle\leq\frac{r^{2}\beta_{t}}{t}c(\mathcal{X},\theta),$	( $\sum_{x}N_{x}^{*}\Delta_{x}\leq 2c(\mathcal{X},\theta)$ proven in Lemma 14)

finishing the proof. ∎

Lemma 16.

We have $c(\mathcal{X},\theta)\leq\frac{48d}{\Delta_{\min}}$ .

Proof.

The idea is similar to that of Lemma 12. Define $G_{0}=\{x^{*}\}$ , $G_{i}=\left\{x:\Delta_{x}^{2}\in[2^{i-1},2^{i})\Delta_{\min}^{2}\right\}$ and $n$ be the largest index such that $G_{n}$ is not empty. For each $i\geq 1$ , let $q^{G_{i},\kappa}\in\mathcal{P}_{\mathcal{X}}$ with $\kappa=\frac{1}{|\mathcal{X}|\cdot n\cdot 2^{n}}$ be the distribution such that $\|x\|_{S(q^{G_{i},\kappa})^{-1}}^{2}\leq 2d$ for all $x\in G_{i}$ (see Lemma 11). Let $N_{x^{*}}=\infty$ and for $x\in G_{i}$ , we let $N_{x}=\sum_{j\geq 1}\frac{4dq^{G_{j},\kappa}_{x}}{2^{j-1}\Delta_{\min}^{2}}$ . Next we show that $\{N_{x}\}_{x\in\mathcal{X}}$ satisfies the constraint Eq. (2).

In fact, fix $x\in G_{i}\subseteq\mathcal{X}^{-}$ , by definition of $\{N_{x}\}_{x\in\mathcal{X}}$ , we have

\displaystyle\|x\|_{\left(\sum_{x\in\mathcal{X}}N_{x}xx^{\top}\right)^{-1}}^{2}\leq\|x\|_{\left(\sum_{x\in\mathcal{X}}\frac{4dq^{G_{i},\kappa}_{x}xx^{\top}}{2^{i-1}\Delta_{\min}^{2}}\right)^{-1}}^{2}=\frac{2^{i-1}\Delta_{\min}^{2}}{4d}\|x\|_{S(q^{G_{i},\kappa})^{-1}}^{2}\leq 2^{i-2}\Delta_{\min}^{2}\leq\frac{\Delta_{x}^{2}}{2},

where the first inequality is because $S(q^{G_{i},\kappa})$ is invertible. Therefore, the objective value of Eq. (1) is bounded as follows:

$\displaystyle\sum_{x\in\mathcal{X}}N_{x}\Delta_{x}$	$\displaystyle=\sum_{i\geq 1}\sum_{x\in G_{i}}\sum_{j\geq 1}\frac{4dq^{*}_{G_{j},x}}{2^{j-1}\Delta_{\min}^{2}}\Delta_{x}$	(by the definition of $N_{x}$ )
	$\displaystyle\leq\sum_{i\geq 1}\sum_{x\in G_{i}}\sum_{j\geq 1}\frac{4dq^{*}_{G_{j},x}}{2^{j-\nicefrac{{i}}{{2}}-1}\Delta_{\min}}$	(by the definition of $G_{i}$ )
	$\displaystyle=\sum_{i\geq 1}\sum_{x\in G_{i}}\sum_{j\neq i,j\geq 1}\frac{4dq^{*}_{G_{j},x}}{2^{j-\nicefrac{{i}}{{2}}-1}\Delta_{\min}}+\sum_{i\geq 1}\sum_{x\in G_{i}}\frac{4dq^{G_{i},\kappa}_{x}}{2^{\nicefrac{{i}}{{2}}-1}\Delta_{\min}}$
	$\displaystyle\leq\sum_{i\geq 1}\sum_{x\in G_{i}}\sum_{j\neq i,j\geq 1}\frac{4d}{\|\mathcal{X}\|\cdot n\cdot 2^{n+j-\nicefrac{{i}}{{2}}-1}\Delta_{\min}}+\sum_{i\geq 1}\frac{4d}{2^{\nicefrac{{i}}{{2}}-1}\Delta_{\min}}$	( $q^{G_{j},\kappa}_{x}=\frac{1}{\|\mathcal{X}\|\cdot n\cdot 2^{n}}$ for $j\neq i$ , $x\in G_{i}$ )
	$\displaystyle\leq\sum_{i\geq 1}\frac{4d}{2^{n-\nicefrac{{i}}{{2}}-1}\Delta_{\min}}+\sum_{i\geq 1}\frac{4d}{2^{\nicefrac{{i}}{{2}}-1}\Delta_{\min}}$	( $j\geq 1$ )
	$\displaystyle\leq\sum_{i}\frac{8d}{2^{\nicefrac{{i}}{{2}}-1}\Delta_{\min}}\leq\frac{48d}{\Delta_{\min}}.$	( $i\leq n$ )

Therefore, we have $c(\mathcal{X},\theta)\leq\frac{48d}{\Delta_{\min}}$ . ∎

Appendix B Analysis for Algorithm LABEL:alg:REOLB

In this section, we analyze the performance of Algorithm LABEL:alg:REOLB in the stochastic or corrupted setting. For Theorem 1, we decompose the proof into two parts. First, we show in Lemma 17 (a more concrete version of Lemma 5) that for some constant $T^{*}$ specified later, we have $\sum_{t=T^{*}+1}^{T}p_{t,x}\Delta_{x}=\mathcal{O}(c(\mathcal{X},\theta)\log T\log(T|\mathcal{X}|/\delta))$ . Second, using Lemma 4, we know that Algorithm LABEL:alg:REOLB also enjoys a regret bound of $\mathcal{O}(d\sqrt{T}\log(T|\mathcal{X}|/\delta)+C)$ , which gives $\mathcal{O}(d\sqrt{T^{*}}\log(T^{*}|\mathcal{X}|/\delta)+C)$ for the first $T^{*}$ rounds and proves Theorem 2.

To prove Lemma 17, we first show in Lemma 3 that $\Delta_{x}$ and $\widehat{\Delta}_{m,x}$ are close within some multiplicative factor with some additional terms related to the corruption. This holds with the help of Eq. (LABEL:eqn:_opt-2-constraint) and the use of robust estimators. For notational convenience, first recall the following definitions from Lemma 3.

Definition 1.

	$\displaystyle\gamma_{m}$	$\displaystyle\triangleq 2^{15}\log(2^{m}\|\mathcal{X}\|/\delta)=\beta_{2^{m}},$
	$\displaystyle C_{m}$	$\displaystyle\triangleq\sum_{\tau\in\mathcal{B}_{m}}\max_{x}\|c_{\tau,x}\|,$
	$\displaystyle\rho_{m}$	$\displaystyle\triangleq\begin{cases}\sum_{k=0}^{m}\frac{2^{k}C_{k}}{4^{m-1}},&\mbox{if $m\geq 0$},\\ 0,&\mbox{$m=-1$}.\end{cases}$

Proof of Lemma 3.

We prove this by induction. For base case $m=0$ , $\widehat{\Delta}_{m,x}=0$ by definition, and Also, $\Delta_{x}\leq 2\leq\sqrt{\frac{d\gamma_{m}}{4}}$ and Eq. (3) holds.

If both inequalities hold for $m$ , then

\displaystyle\|x\|_{S_{m}^{-1}}^{2}

\displaystyle\leq\frac{2^{m}\widehat{\Delta}_{m,x}^{2}}{\gamma_{m}}+4d\leq\frac{2^{m}\left(2\Delta_{x}+\sqrt{\frac{d\gamma_{m}}{4\cdot 2^{m}}}+2\rho_{m-1}\right)^{2}}{\gamma_{m}}+4d\leq\frac{16\cdot 2^{m}\Delta_{x}^{2}}{\gamma_{m}}+8d+\frac{16\cdot 2^{m}\rho_{m-1}^{2}}{\gamma_{m}},

(17)

where the first inequality is because of Eq. (LABEL:eqn:_opt-2-constraint) of $\textbf{OP}(2^{m},\widehat{\Delta}_{m,x})$ . Next, we show that the expectation of $\widehat{\ell}_{\tau,x}$ is $\langle x,\theta+c_{\tau}\rangle$ and the variance of $\widehat{\ell}_{\tau,x}$ is upper bounded by $\|x\|_{S_{m}^{-1}}^{2}$ for $\tau\in\mathcal{B}_{m}$ . In fact, we have

	$\displaystyle\mathbb{E}\left[\widehat{\ell}_{\tau,x}\right]=\mathbb{E}\left[x^{\top}S_{m}^{-1}x_{\tau}\left(\langle x_{\tau},\theta+c_{\tau}\rangle+\epsilon_{\tau}\right)\right]=\langle x,\theta+c_{\tau}\rangle,$		(18)
	$\displaystyle\mathbb{E}\left[\widehat{\ell}_{\tau,x}^{2}\right]\leq\mathbb{E}[x^{\top}S_{m}^{-1}x_{\tau}x_{\tau}^{\top}S_{m}^{-1}x\cdot y_{\tau}^{2}]\leq x^{\top}S_{m}^{-1}x=\\|x\\|_{S_{m}^{-1}}^{2}.$		(19)

Now we are ready to prove the relation. Let $\bar{c}_{m,x}=\frac{1}{2^{m}}\sum_{\tau\in\mathcal{B}_{m}}c_{\tau,x}$ . Under the induction hypothesis for the case of $m$ , using Lemma 29 with $\mu_{i}=\langle x,\theta+c_{i}\rangle$ , with probability at least $1-\frac{\delta}{4^{m}}$ , we have for all $x\in\mathcal{X}$ :

	$\displaystyle\Delta_{x}-\widehat{\Delta}_{m+1,x}$
	$\displaystyle=\langle x,\theta\rangle-\langle x^{*},\theta\rangle-\text{Rob}_{m,x}+\min_{x}\text{Rob}_{m,x}$
	$\displaystyle\leq\langle x,\theta\rangle-\langle x^{},\theta\rangle-\text{Rob}_{m,x}+\text{Rob}_{m,x^{}}$
	$\displaystyle\leq\left\|\text{Rob}_{m,x^{}}-\langle x^{},\theta\rangle-\frac{\sum_{\tau\in\mathcal{B}_{m}}\langle x^{*},c_{\tau}\rangle}{2^{m}}\right\|+\left\|\text{Rob}_{m,x}-\langle x,\theta\rangle-\frac{\sum_{\tau\in\mathcal{B}_{m}}\langle x,c_{\tau}\rangle}{2^{m}}\right\|+\frac{2C_{m}}{2^{m}}$
	$\displaystyle\leq\frac{1}{2^{m}}\left(\alpha_{x}\left(2^{m}\\|x\\|^{2}_{S_{m}^{-1}}+\sum_{\tau\in\mathcal{B}_{m}}(c_{\tau,x}-\bar{c}_{m,x})^{2}\right)+\frac{4\log(2^{m}\|\mathcal{X}\|/\delta)}{\alpha_{x}}\right)$
	$\displaystyle\quad+\frac{1}{2^{m}}\left(\alpha_{x^{}}\left(2^{m}\\|x^{}\\|^{2}_{S_{m}^{-1}}+\sum_{\tau\in\mathcal{B}_{m}}(c_{\tau,x^{}}-\bar{c}_{m,x^{}})^{2}\right)+\frac{4\log(2^{m}\|\mathcal{X}\|/\delta)}{\alpha_{x^{*}}}\right)+\frac{2C_{m}}{2^{m}}$		(by Eq. (19) and Lemma 29)
	$\displaystyle\leq\frac{1}{2^{m}}\left(\alpha_{x}\left(2^{m}\\|x\\|_{S_{m}^{-1}}^{2}+2^{m}\right)+\frac{\gamma_{m}}{2^{12}\alpha_{x}}\right)+\frac{1}{2^{m}}\left(\alpha_{x^{}}\left(2^{m}\\|x^{}\\|_{S_{m}^{-1}}^{2}+2^{m}\right)+\frac{\gamma_{m}}{2^{12}\alpha_{x^{*}}}\right)+\frac{2C_{m}}{2^{m}}$		(using the definition of $\gamma_{m}$ and $\sum_{\tau\in\mathcal{B}_{m}}(c_{\tau,x}-\bar{c}_{m,x})^{2}\leq\sum_{\tau\in\mathcal{B}_{m}}c_{\tau,x}^{2}\leq 2^{m}$ )
	$\displaystyle=\frac{2}{64\cdot 2^{m}}\sqrt{\left(2^{m}\\|x\\|_{S_{m}^{-1}}^{2}+2^{m}\right)\gamma_{m}}+\frac{2}{64\cdot 2^{m}}\sqrt{\left(2^{m}\\|x^{*}\\|_{S_{m}^{-1}}^{2}+2^{m}\right)\gamma_{m}}+\frac{C_{m}}{2^{m-1}}$		(by the choice of $\alpha_{x}$ )
	$\displaystyle\leq\frac{4}{64\cdot 2^{m}}\sqrt{\left(\frac{16\cdot 2^{2m}\Delta_{x}^{2}}{\gamma_{m}}+16\cdot 2^{m}d+\frac{16\cdot 2^{2m}\rho_{m-1}^{2}}{\gamma_{m}}\right)\gamma_{m}}+\frac{C_{m}}{2^{m-1}}$		(by Eq. (17))
	$\displaystyle\leq\frac{\Delta_{x}}{2}+\sqrt{\frac{d\gamma_{m}}{16\cdot 2^{m}}}+\frac{\rho_{m-1}}{4}+\frac{C_{m}}{2^{m-1}}$		( $\sqrt{a+b}\leq\sqrt{a}+\sqrt{b}$ )
	$\displaystyle\leq\frac{\Delta_{x}}{2}+\sqrt{\frac{d\gamma_{m}}{16\cdot 2^{m}}}+\rho_{m}.$		(by definition of $\rho_{m}$ )

Therefore, $\Delta_{x}\leq 2\widehat{\Delta}_{m+1,x}+\sqrt{\frac{d\gamma_{m}}{4\cdot 2^{m}}}+2\rho_{m}$ , which proves Eq. (3). The other claim Eq. (4) can be proven using similar analysis. Taking a union bound over all $m$ finishes the proof. ∎

Lemma 17 (A detailed version of Lemma 5).

Let $T^{*}\triangleq\frac{32C}{\Delta_{\min}}+4M^{\prime}\log\left(\frac{2M^{\prime}|\mathcal{X}|}{\delta}\right)$ , where $M^{\prime}=2^{20}\left(M+\frac{d}{\Delta_{\min}^{2}}\right)$ and $M$ is defined in Lemma 14. Then Algorithm LABEL:alg:REOLB guarantees with probability at least $1-\delta$ :

\displaystyle\sum^{T}_{t=T^{*}+1}\sum_{x}p_{m_{t},x}\Delta_{x}\leq\mathcal{O}\left(c(\mathcal{X},\theta)\log(T)\log(T/\delta)\right).

Proof.

First, note that according to Lemma 28, $t\geq 4M^{\prime}\log\left(\frac{2M^{\prime}|\mathcal{X}|}{\delta}\right)$ implies $t\geq M^{\prime}\log\left(\frac{t|\mathcal{X}|}{\delta}\right)\geq\frac{32d\beta_{t}}{\Delta_{\min}^{2}}$ . Let $m_{t}$ be the epoch such that $t\in\mathcal{B}_{m_{t}}$ , which means $t\in[2^{m_{t}},2^{m_{t}+1})$ and thus $\beta_{t}\geq\gamma_{m_{t}}$ . Then we have,

\displaystyle\sqrt{\frac{d\gamma_{m_{t}}}{4\cdot 2^{m_{t}}}}+2\rho_{m_{t}}

\displaystyle\leq\sqrt{\frac{d\gamma_{m_{t}}}{4\cdot 2^{m_{t}}}}+\frac{2C}{2^{m_{t}-1}}\leq\frac{1}{4}\Delta_{\min}+\frac{1}{4}\Delta_{\min}\leq\frac{1}{2}\Delta_{x},

(20)

where the first inequality is by definition of $\rho_{m}$ and the second inequality is because $2^{m_{t}+1}\geq t\geq T^{*}\geq\frac{32C}{\Delta_{\min}}$ and $t\geq\frac{32d\gamma_{m_{t}}}{\Delta_{\min}^{2}}$ . Then by Lemma 3, we have for all $x\neq x^{*}$ , $\Delta_{x}\leq 2\widehat{\Delta}_{m_{t},x}+\frac{\Delta_{x}}{2}$ , $\widehat{\Delta}_{m_{t},x}\leq 2\Delta_{x}+\frac{\Delta_{x}}{2}$ , which gives

\displaystyle\Delta_{x}\leq 4\widehat{\Delta}_{m_{t},x},\qquad\widehat{\Delta}_{m_{t},x}\leq 4\Delta_{x}.

(21)

Since $\widehat{\Delta}_{m_{t},x}\geq\frac{1}{4}\Delta_{x}$ for all $x\neq x^{*}$ , we must have $\widehat{\Delta}_{m_{t},x^{*}}=\Delta_{x^{*}}=0$ . Moreover, according to the definition of $T^{*}$ , we have $t\geq 2^{20}M\log(\frac{t|\mathcal{X}|}{\delta})\geq 16\beta_{t}M$ . Therefore, the conditions of Lemma 15 hold. Applying Lemma 15 with $r=16$ , we get

\displaystyle\sum_{x}p_{m_{t},x}\Delta_{x}\leq\mathcal{O}\left(\frac{\beta_{t}}{2^{m_{t}}}c(\mathcal{X},\theta)\right).

Summing over $t\geq T^{*}+1$ , we get with probability at least $1-\delta$ ,

\displaystyle\sum_{t=T^{*}+1}^{T}\sum_{x}p_{m_{t},x}\Delta_{x}=\mathcal{O}\left(\beta_{T}c(\mathcal{X},\theta)\log(T)\right)=\mathcal{O}\left(c(\mathcal{X},\theta)\log(T)\log(T|\mathcal{X}|/\delta)\right).

∎

Now we are ready to prove the main result Theorem 1.

Proof of Theorem 1.

Let $\mathbb{E}_{t}[\cdot]$ denote the conditional expectation given the history up to time $t$ . By Freedman’s inequality, we have

$\displaystyle\text{\rm Reg}(T)$	$\displaystyle=\sum_{t=1}^{T}\sum_{x\in\mathcal{X}}\mathbbm{1}\{x_{t}=x\}\Delta_{x}$
	$\displaystyle\leq\sum_{t=1}^{T}\sum_{x\in\mathcal{X}}p_{m_{t},x}\Delta_{x}+2\sqrt{\log(1/\delta)\sum_{t=1}^{T}\mathbb{E}_{t}\left[\left(\sum_{x\neq x^{*}}\mathbbm{1}\{x_{t}=x\}\Delta_{x}\right)^{2}\right]}+\log(1/\delta)$
	$\displaystyle\leq\sum_{t=1}^{T}\sum_{x\in\mathcal{X}}p_{m_{t},x}\Delta_{x}+2\sqrt{\log(1/\delta)\sum_{t=1}^{T}\mathbb{E}_{t}\left[\sum_{x\neq x^{*}}\mathbbm{1}\{x_{t}=x\}\Delta_{x}\right]}+\log(1/\delta)$	( $\Delta_{x}\leq 1$ )
	$\displaystyle\leq\sum_{t=1}^{T}\sum_{x\in\mathcal{X}}p_{m_{t},x}\Delta_{x}+2\sqrt{\log(1/\delta)\sum_{t=1}^{T}\sum_{x\in\mathcal{X}}p_{m_{t},x}\Delta_{x}}+\log(1/\delta)$
	$\displaystyle\leq 2\sum_{t=1}^{T}\sum_{x\in\mathcal{X}}p_{m_{t},x}\Delta_{x}+2\log(1/\delta).$	(22)

Fix a round $t$ and the corresponding epoch $m_{t}$ . According to Lemma 4, we know that

\displaystyle\sum_{x\in\mathcal{X}}p_{m_{t},x}\widehat{\Delta}_{m_{t},x}\leq\mathcal{O}\left(\frac{d\log(2^{m_{t}}|\mathcal{X}|/\delta)}{2^{\nicefrac{{m_{t}}}{{2}}}}\right)

Therefore, combining Lemma 3, the regret in epoch $m$ is bounded by

	$\displaystyle 2^{m}\sum_{x\in\mathcal{X}}p_{m,x}\Delta_{x}$	$\displaystyle\leq 2^{m}\left(2\sum_{x}p_{m,x}\widehat{\Delta}_{m,x}+\mathcal{O}\left(\sqrt{\frac{d\gamma_{m}}{2^{m}}}+\rho_{m-1}\right)\right)$
		$\displaystyle\leq\mathcal{O}\left(d\cdot 2^{\nicefrac{{m}}{{2}}}\log(2^{m}/\delta)\right)+\mathcal{O}\left(\sqrt{2^{m}d\gamma_{m}}+\sum_{k=0}^{m-1}\frac{2^{k}C_{k}}{4^{m-2}}\right)$
		$\displaystyle\leq\mathcal{O}\left(d\cdot 2^{\nicefrac{{m}}{{2}}}\log(2^{m}/\delta)+\sqrt{2^{m}d\gamma_{m}}+\sum_{k=0}^{m}\frac{C_{k}}{2^{m-k}}\right).$

Summing up the regret till round $T$ , we have

$\displaystyle\sum_{t=1}^{T}\sum_{x\in\mathcal{X}}p_{m_{t},x}\Delta_{x}$	$\displaystyle\leq\sum_{m=0}^{\log_{2}T}\mathcal{O}\left(d\cdot 2^{\nicefrac{{m}}{{2}}}\log(2^{m}\|\mathcal{X}\|/\delta)+\sqrt{2^{m}d\gamma_{m}}\right)+\mathcal{O}\left(\sum_{k=0}^{\log_{2}T}C_{k}\sum_{m=k}^{\log_{2}T}\frac{1}{2^{m-k}}\right)$
	$\displaystyle=\mathcal{O}\left(d\sqrt{T}\log(T\|\mathcal{X}\|/\delta)+\sum_{k=0}^{\log_{2}T}C_{k}\right)$
	$\displaystyle=\mathcal{O}\left(d\sqrt{T}\log(T\|\mathcal{X}\|/\delta)+C\right).$	(23)

Using Eq. (23) for $t\leq T^{*}$ and using Lemma 17 for the rest, we know that

\displaystyle\sum_{t=1}^{T}\sum_{x\in\mathcal{X}}p_{m_{t},x}\Delta_{x}\leq\mathcal{O}\left(c(\mathcal{X},\theta)\log(T)\log(T|\mathcal{X}|/\delta)+d\sqrt{T^{*}}\log(T^{*}|\mathcal{X}|/\delta)+C\right).

Combining this with Eq. (22), we get

$\displaystyle\text{\rm Reg}(T)$	$\displaystyle=\mathcal{O}\left(\sum_{t=1}^{T}\sum_{x\in\mathcal{X}}p_{m_{t},x}\Delta_{x}+\log(1/\delta)\right)$
	$\displaystyle=\mathcal{O}\left(c(\mathcal{X},\theta)\log(T)\log(T\|\mathcal{X}\|/\delta)+d\sqrt{T^{}}\log(T^{}\|\mathcal{X}\|/\delta)+C\right)$
	$\displaystyle=\mathcal{O}\left(c(\mathcal{X},\theta)\log(T)\log(T\|\mathcal{X}\|/\delta)+d\sqrt{\frac{C}{\Delta_{\min}}}\log\left(\frac{C\|\mathcal{X}\|}{\Delta_{\min}\delta}\right)+d\sqrt{M^{\prime}\log\left(\frac{M^{\prime}\|\mathcal{X}\|}{\delta}\right)}\log\left(\frac{M^{\prime}\|\mathcal{X}\|}{\delta}\right)+C\right)$
	$\displaystyle=\mathcal{O}\left(c(\mathcal{X},\theta)\log(T)\log(T\|\mathcal{X}\|/\delta)+d\sqrt{\frac{C}{\Delta_{\min}}}\log\left(\frac{C\|\mathcal{X}\|}{\Delta_{\min}\delta}\right)+C+M^{*}\left(\log(1/\delta)\right)^{\frac{3}{2}}\right)$	(24)

for some constant $M^{*}$ that depends on $M^{\prime}=2^{18}\left(M+\frac{d}{\Delta_{\min}^{2}}\right)$ and $\log|\mathcal{X}|$ . ∎

In the following, we prove an alternative bound of $\mathcal{O}(\frac{d^{2}(\log T)^{2}}{\Delta_{\min}}+C)$ , which is independent of $M^{*}$ . The following lemma is an analogue of Lemma 17, but the constant $T^{\prime}$ is independent of $M^{*}$ .

Lemma 18.

Let $T^{\prime}\triangleq\frac{32C}{\Delta_{\min}}+\frac{2^{25}d}{\Delta_{\min}^{2}}\log\left(\frac{2^{24}d|\mathcal{X}|}{\delta\Delta_{\min}^{2}}\right)$ . Then Algorithm LABEL:alg:REOLB guarantees with probability $1-\delta$

\displaystyle\sum^{T}_{t=T^{\prime}+1}\sum_{x\in\mathcal{X}}p_{m_{t},x}\Delta_{x}\leq\mathcal{O}\left(\frac{d\log(T)\log(T|\mathcal{X}|/\delta)}{\Delta_{\min}}\right).

Proof.

First, note that according to Lemma 28, $t\geq\frac{2^{25}d}{\Delta_{\min}^{2}}\log\left(\frac{2^{24}d|\mathcal{X}|}{\Delta_{\min}^{2}\delta}\right)$ implies $t\geq\frac{2^{23}d}{\Delta_{\min}^{2}}\log\left(\frac{t|\mathcal{X}|}{\delta}\right)$ . Let $m_{t}$ be the epoch such that $t\in\mathcal{B}_{m_{t}}$ , which means $t\in[2^{m_{t}},2^{m_{t}+1})$ and thus $\beta_{t}\geq\gamma_{m_{t}}$ . Therefore, we still have Eq. (20), which further shows that we have Eq. (21) and $\widehat{\Delta}_{m_{t},x^{*}}=\Delta_{x^{*}}=0$ . Moreover, according to the definition of $T^{\prime}$ , we have $t\geq\frac{2^{23}d}{\Delta_{\min}^{2}}\log(\frac{t|\mathcal{X}|}{\delta})\geq 256d\beta_{t}/\Delta_{\min}^{2}$ . Therefore, the conditions of Lemma 13 hold. Applying Lemma 13 with $r=16$ , we get

\displaystyle\sum_{x}p_{m_{t},x}\Delta_{x}\leq\mathcal{O}\left(\frac{d\beta_{t}}{2^{m_{t}}\Delta_{\min}}\right).

Summing over $t\geq T^{\prime}+1$ , we get with probability at least $1-\delta$ ,

\displaystyle\sum_{t=T^{\prime}+1}^{T}\sum_{x}p_{m_{t},x}\Delta_{x}=\mathcal{O}\left(\frac{d\beta_{T}\log(T)}{\Delta_{\min}}\right)=\mathcal{O}\left(\frac{d\log(T)\log(T|\mathcal{X}|/\delta)}{\Delta_{\min}}\right).

∎

Theorem 19.

Algorithm LABEL:alg:REOLB guarantees that with probability at least $1-\delta$ ,

\displaystyle\text{\rm Reg}(T)

\displaystyle=\mathcal{O}\left(\frac{d^{2}}{\Delta_{\min}}\log^{2}\left(\frac{T|\mathcal{X}|}{\Delta_{\min}\delta}\right)+C\right).

Proof.

Using Eq. (23) for $t\leq T^{\prime}$ and using Lemma 18 for the rest, we know that

\displaystyle\sum_{t=1}^{T}\sum_{x\in\mathcal{X}}p_{m_{t},x}\Delta_{x}\leq\mathcal{O}\left(\frac{d\log(T)\log(T|\mathcal{X}|/\delta)}{\Delta_{\min}}+d\sqrt{T^{\prime}}\log(T^{\prime}|\mathcal{X}|/\delta)+C\right).

Combining this with Eq. (22), we get

$\displaystyle\text{\rm Reg}(T)$	$\displaystyle=\mathcal{O}\left(\sum_{t=1}^{T}\sum_{x\in\mathcal{X}}p_{m_{t},x}\Delta_{x}+\log(1/\delta)\right)$
	$\displaystyle=\mathcal{O}\left(\frac{d\log(T)\log(T\|\mathcal{X}\|/\delta)}{\Delta_{\min}}+d\sqrt{T^{\prime}}\log(T^{\prime}\|\mathcal{X}\|/\delta)+C\right)$
	$\displaystyle=\mathcal{O}\left(\frac{d\log(T)\log(T\|\mathcal{X}\|/\delta)}{\Delta_{\min}}+d\sqrt{\frac{C}{\Delta_{\min}}}\log\left(\frac{C\|\mathcal{X}\|}{\Delta_{\min}\delta}\right)+d\sqrt{\frac{d}{\Delta_{\min}^{2}}\log\left(\frac{d\|\mathcal{X}\|}{\delta\Delta_{\min}^{2}}\right)}\log\left(\frac{d\|\mathcal{X}\|}{\delta\Delta_{\min}^{2}}\right)+C\right)$
	$\displaystyle=\mathcal{O}\left(\frac{d\log(T)\log(T\|\mathcal{X}\|/\delta)}{\Delta_{\min}}+d\sqrt{\frac{C}{\Delta_{\min}}}\log\left(\frac{C\|\mathcal{X}\|}{\Delta_{\min}\delta}\right)+C+\frac{d^{\frac{3}{2}}\log^{\frac{3}{2}}(\frac{d\|\mathcal{X}\|}{\delta\Delta_{\min}})}{\Delta_{\min}}\right)$
	$\displaystyle\leq\mathcal{O}\left(\frac{d^{2}}{\Delta_{\min}}\log^{2}\left(\frac{T\|\mathcal{X}\|}{\Delta_{\min}\delta}\right)+C\right).$	(using AM-GM inequality and $C\leq T$ , $T\geq d$ )

∎

Finally, we prove Theorem 2, which is a direct result by combining Eq. (23) and Eq. (22).

Proof of Theorem 2.

Combining Eq. (23) and Eq. (22), we have

\displaystyle\text{\rm Reg}(T)=\mathcal{O}\left(\sum_{t=1}^{T}\sum_{x\in\mathcal{X}}p_{m_{t},x}\Delta_{x}+\log(1/\delta)\right)=\mathcal{O}\left(d\sqrt{T}\log(T|\mathcal{X}|/\delta)+C\right).

∎

Appendix C Analysis of Algorithm 1

In this section, we show that Algorithm 1 achieves both minimax-optimality in the adversarial setting and near instance-optimality in the stochastic setting. In Appendix C.1, we prove Theorem 7, showing that Algorithm 1 enjoys $\widetilde{\mathcal{O}}(\sqrt{T})$ regret in the adversarial setting. In Appendix C.2, we prove Theorem 6, showing that Algorithm 1 also enjoys nearly instance-optimal regret in the stochastic setting and slightly worse regret in the corrupted setting.

C.1 Analysis of Algorithm 1 in the adversarial setting

To prove the guarantee in the adversarial setting, we first prove Lemma 8, which shows that at any time in Phase 2, $\widehat{x}$ has the smallest cumulative loss within $[1,t]$ .

Proof of Lemma 8.

By Assumption 1, for any $x$ , and any $t$ in Phase 1,

	$\displaystyle\sum_{s=1}^{t}(\ell_{s,x_{s}}-\ell_{s,x})$	$\displaystyle\leq\sqrt{C_{1}t}-C_{2}\left\|\sum_{s=1}^{t}(\ell_{s,x}-\widehat{\ell}_{s,x})\right\|$
		$\displaystyle\leq\sqrt{C_{1}t}-(C_{2}-1)\left\|\sum_{s=1}^{t}(\ell_{s,x}-\widehat{\ell}_{s,x})\right\|-\sum_{s=1}^{t}(\ell_{s,x}-\widehat{\ell}_{s,x}),$

which implies

	$\displaystyle\left\|\sum_{s=1}^{t}(\ell_{s,x}-\widehat{\ell}_{s,x})\right\|$	$\displaystyle\leq\frac{1}{C_{2}-1}\left(\sqrt{C_{1}t}-\sum_{s=1}^{t}(\ell_{s,x}-\widehat{\ell}_{s,x})-\sum_{s=1}^{t}(\ell_{s,x_{s}}-\ell_{s,x})\right)$
		$\displaystyle\leq\frac{1}{C_{2}-1}\left(\sqrt{C_{1}t}+\sum_{s=1}^{t}\widehat{\ell}_{s,x}-\sum_{s=1}^{t}\ell_{s,x_{s}}\right).$		(25)

At time $t_{0}$ , we have with probability at least $1-2\delta$ ,

$\displaystyle\left\|\sum_{s=1}^{t_{0}}(\ell_{s,x}-\widehat{\ell}_{s,x})\right\|$	$\displaystyle\leq\frac{1}{C_{2}-1}\left(\sqrt{C_{1}t_{0}}+\sum_{s=1}^{t_{0}}\widehat{\ell}_{s,x}-\sum_{s=1}^{t_{0}}\ell_{s,x_{s}}\right)$	(by Eq. (25))
	$\displaystyle\leq\frac{1}{C_{2}-1}\left(2\sqrt{C_{1}t_{0}}+\sum_{s=1}^{t_{0}}\widehat{\ell}_{s,x}-\sum_{s=1}^{t_{0}}y_{s}\right)$	(by Azuma’s inequality)
	$\displaystyle\leq\frac{1}{C_{2}-1}\left(2\sqrt{C_{1}t_{0}}+5\sqrt{f_{T}C_{1}t_{0}}+\sum_{s=1}^{t_{0}}\widehat{\ell}_{s,x}-\sum_{s=1}^{t_{0}}\widehat{\ell}_{s,\widehat{x}}\right)$	(by Eq. (7))
	$\displaystyle=\frac{1}{C_{2}-1}\left(7\sqrt{f_{T}C_{1}t_{0}}+t_{0}\widehat{\Delta}_{x}\right).$	(26)

Bounding the deviation of $(t-t_{0})\text{Rob}_{t,x}$ for $x\neq\widehat{x}$ :

For all $x\neq\widehat{x}$ , the variance of $\widehat{\ell}_{\tau,x}$ is bounded as follows:

\displaystyle\text{Var}(\widehat{\ell}_{\tau,x})\leq\mathbb{E}\left[\widehat{\ell}_{\tau,x}^{2}\right]\leq\mathbb{E}\left[(x^{\top}\widetilde{S}_{\tau}^{-1}x_{t}x_{t}^{\top}\widetilde{S}_{\tau}^{-1}x)^{2}\right]\leq\|x\|_{\widetilde{S}_{\tau}^{-1}}^{2}\leq 2\|x\|_{S_{\tau}^{-1}}^{2},

(27)

where the last inequality is due to $\widetilde{S}_{\tau}=\frac{1}{2}\widehat{x}\widehat{x}^{\top}+\frac{1}{2}S_{\tau}\succeq\frac{1}{2}S_{\tau}$ . Therefore, using Lemma 29 with $\mu_{i}=\ell_{i,x}$ , with probability at least $1-2\delta$ , for all $t$ in Phase 2 and all $x\neq\widehat{x}$ ,

$\displaystyle\left\|(t-t_{0})\cdot\text{Rob}_{t,x}-\sum_{\tau=t_{0}+1}^{t}\ell_{\tau,x}\right\|$	$\displaystyle\leq\alpha_{x}\left(2\sum_{\tau=t_{0}+1}^{t}\\|x\\|_{S_{\tau}^{-1}}^{2}+\sum_{\tau=t_{0}+1}^{t}\left(\ell_{\tau,x}-\frac{1}{t-t_{0}}\sum_{\tau^{\prime}=t_{0}+1}^{t}\ell_{\tau^{\prime},x}\right)^{2}\right)+\frac{2\log\frac{t^{2}\|\mathcal{X}\|}{\delta}}{\alpha_{x}}$
	$\displaystyle\leq\alpha_{x}\sum_{\tau=t_{0}+1}^{t}\left(2\\|x\\|_{S_{\tau}^{-1}}^{2}+1\right)+\frac{4\log\frac{t\|\mathcal{X}\|}{\delta}}{\alpha_{x}}$
	$\displaystyle\leq 2\sqrt{4\sum_{\tau=t_{0}+1}^{t}\left(2\\|x\\|_{S_{\tau}^{-1}}^{2}+1\right)\log\frac{t\|\mathcal{X}\|}{\delta}}$	(Choose $\alpha_{x}$ optimally)
	$\displaystyle\leq 2\sqrt{4\log\frac{t\|\mathcal{X}\|}{\delta}\sum_{\tau=t_{0}+1}^{t}\left(\frac{2\tau\widehat{\Delta}_{x}^{2}}{\beta_{\tau}}+9d\right)},$	(28)

where the last inequality is due to Eq. (LABEL:eqn:_opt-2-constraint). For $\tau\geq t_{0}$ , since

\displaystyle\widehat{\Delta}_{x}=\frac{1}{t_{0}}\left(\sum_{s=1}^{t_{0}}\widehat{\ell}_{s,x}-\widehat{\ell}_{s,\widehat{x}}\right)\geq 20\sqrt{\frac{f_{T}C_{1}}{t_{0}}},

(29)

we have

\displaystyle\widehat{\Delta}_{x}\geq 20\sqrt{\frac{f_{T}C_{1}}{\tau}}\geq 20\sqrt{\frac{f_{T}d\beta_{T}}{\tau}}\geq 20\sqrt{\frac{d\beta_{T}}{\tau}}\geq 20\sqrt{\frac{d\beta_{\tau}}{\tau}}\geq 3\sqrt{\frac{d\beta_{\tau}}{2\tau}}

and $9d\leq\frac{2\tau\widehat{\Delta}_{x}^{2}}{\beta_{\tau}}$ . Note that $h(\tau)=\frac{\tau}{\log(\tau|\mathcal{X}|/\delta)}$ an increasing function when $\delta\leq 0.1$ . Using Eq. (28) and $9d\leq\frac{2\tau\widehat{\Delta}_{x}^{2}}{\beta_{\tau}}$ , we have

\displaystyle\left|(t-t_{0})\cdot\text{Rob}_{t,x}-\sum_{s=t_{0}+1}^{t}\ell_{s,x}\right|\leq 2\sqrt{4\log\frac{t|\mathcal{X}|}{\delta}\sum_{\tau=t_{0}+1}^{t}\frac{4\tau\widehat{\Delta}_{x}^{2}}{\beta_{\tau}}}\leq 2\sqrt{16t^{2}\widehat{\Delta}_{x}^{2}\log\frac{t|\mathcal{X}|}{\delta}\frac{1}{\beta_{t}}}\leq\frac{t\widehat{\Delta}_{x}}{16}.

(30)

For the first $t_{0}$ rounds, according to Eq. (26), we have

\displaystyle\left|\sum_{s=1}^{t_{0}}(\ell_{s,x}-\widehat{\ell}_{s,x})\right|\leq\frac{1}{C_{2}-1}\left(7\sqrt{f_{T}C_{1}t_{0}}+t_{0}\widehat{\Delta}_{x}\right)\leq\frac{1.4t_{0}\widehat{\Delta}_{x}}{C_{2}-1},

(31)

where the last inequality is due to Eq. (29). Combining Eq. (30) and Eq. (31) and noticing that $C_{2}\geq 20$ , we have for all $x\neq\widehat{x}$ ,

\displaystyle\left|\sum_{s=1}^{t_{0}}(\ell_{s,x}-\widehat{\ell}_{s,x})+\sum_{s=t_{0}+1}^{t}\left(\ell_{s,x}-\text{Rob}_{t,x}\right)\right|\leq\frac{1.7t\widehat{\Delta}_{x}}{10}.

(32)

Bounding the deviation of $\sum_{s=1}^{t}\widehat{\ell}_{s,\widehat{x}}$ (recall that we use the standard average estimator for $\widehat{x}$ ):

For the first $t_{0}$ rounds, according to Eq. (26), since $\widehat{\Delta}_{\widehat{x}}=0$ , we have

\displaystyle\left|\sum_{s=1}^{t_{0}}(\ell_{s,\widehat{x}}-\widehat{\ell}_{s,\widehat{x}})\right|\leq\sqrt{f_{T}C_{1}t_{0}}.

For $t\geq t_{0}+1$ , according to Freedman’s inequality and the fact that $\mathbb{E}_{t}\left[\widehat{\ell}_{s,\widehat{x}}^{2}\right]=\mathbb{E}_{t}\left[\frac{y_{s}^{2}}{\widetilde{p}^{2}_{s,\widehat{x}}}\cdot\mathbbm{1}\{x_{s}=\widehat{x}\}\right]\leq\frac{1}{\widetilde{p}_{s,\widehat{x}}}\leq 2$ as $\widetilde{p}_{s,\widehat{x}}\geq\frac{1}{2}$ , we have with probability at least $1-\delta$ ,

\displaystyle\left|\sum_{s=t_{0}+1}^{t}\ell_{s,\widehat{x}}-\sum_{s=t_{0}+1}^{t}\widehat{\ell}_{s,\widehat{x}}\right|\leq 2\sqrt{2t\log(t|\mathcal{X}|/\delta)}+2\log(t|\mathcal{X}|/\delta)\leq\sqrt{C_{1}t}.

Combining the above two inequalities, we have

\displaystyle\left|\sum_{s=1}^{t}(\ell_{s,\widehat{x}}-\widehat{\ell}_{s,\widehat{x}})\right|\leq 3\sqrt{f_{T}C_{1}t}.

(33)

In sum: combining the bounds for $x\neq\widehat{x}$ and $x=\widehat{x}$ , we have for all $x\neq\widehat{x}$ ,

$\displaystyle\sum_{s=1}^{t}(\ell_{s,x}-\ell_{s,\widehat{x}})$	$\displaystyle\geq\sum_{s=1}^{t-1}(\ell_{s,x}-\ell_{s,\widehat{x}})-2$
	$\displaystyle\geq\sum_{s=1}^{t_{0}}\left(\widehat{\ell}_{s,x}-\widehat{\ell}_{s,\widehat{x}}\right)+\left((t-t_{0}-1)\text{Rob}_{t-1,x}-\sum_{s=t_{0}+1}^{t-1}\widehat{\ell}_{s,\widehat{x}}\right)$
	$\displaystyle\qquad-3\sqrt{f_{T}C_{1}(t-1)}-\frac{1.7(t-1)\widehat{\Delta}_{x}}{10}-2$	(Eq. (32) and Eq. (33))
	$\displaystyle\geq(t-1)\widehat{\Delta}_{t-1,x}-4\sqrt{f_{T}C_{1}(t-1)}-\frac{1.7(t-1)\widehat{\Delta}_{x}}{10}$	(by the definition of $\widehat{\Delta}_{t-1,x}$ in Eq. (11))
	$\displaystyle\geq(t-1)\widehat{\Delta}_{t-1,x}-\frac{3.7(t-1)\widehat{\Delta}_{x}}{10}$	(by Eq. (29))
	$\displaystyle\geq 0.02(t-1)\widehat{\Delta}_{x}>0,$

where the last inequality is because $t$ belongs to Phase $2$ , which means that at time $t-1$ , Eq. (9) is satisfied. ∎

Now we are ready to prove our main lemma in the adversarial setting.

Proof of Lemma 9.

By Lemma 8, we know that the regret comparator is $\widehat{x}$ . By the regret bound of $\mathcal{A}$ and the fact that $t_{0}\geq L_{0}$ (recall $L_{0}$ from Assumption 1), we have

\displaystyle\sum_{s=1}^{t_{0}}(\ell_{s,x_{s}}-\ell_{s,\widehat{x}})\leq O\left(\sqrt{C_{1}t_{0}}\right).

For the regret in Phase $2$ , first note that it suffices to consider $t$ not being the last round of this phase (since the last round contributes at most $2$ to the regret). Then, consider the following decomposition:

	$\displaystyle\sum_{s=t_{0}+1}^{t}(\ell_{s,x_{s}}-\ell_{s,\widehat{x}})$	$\displaystyle\leq\sum_{s=t_{0}+1}^{t-1}(y_{s}-\epsilon_{s}(x_{s})-\ell_{s,\widehat{x}})+2$
		$\displaystyle=\underbrace{\sum_{s=t_{0}+1}^{t-1}\left(y_{s}-\widehat{\ell}_{s,\widehat{x}}\right)}_{\textsc{Term 1}}+\underbrace{\sum_{s=t_{0}+1}^{t-1}\left(\widehat{\ell}_{s,\widehat{x}}-\ell_{s,\widehat{x}}-\epsilon_{s}(\widehat{x})\right)}_{\textsc{Term 2}}+\underbrace{\sum_{s=t_{0}+1}^{t-1}\left(\epsilon_{s}(\widehat{x})-\epsilon_{s}(x_{s})\right)}_{\textsc{Term 3}}+2.$

Term 1 is upper bounded by $\mathcal{O}\left(\sqrt{f_{T}C_{1}t_{0}}\right)$ since it corresponds to the termination condition Eq. (10).

Term 2 is a martingale difference sequence since

\displaystyle\mathbb{E}_{t}\left[\widehat{\ell}_{s,\widehat{x}}-\ell_{s,\widehat{x}}-\epsilon_{s}(\widehat{x})\right]=\mathbb{E}_{t}\left[\frac{(\ell_{s,\widehat{x}}+\epsilon_{s}(\widehat{x}))\mathbb{I}\{x_{s}=\widehat{x}\}}{\widetilde{p}_{s,\widehat{x}}}-(\ell_{s,\widehat{x}}+\epsilon_{s}(\widehat{x}))\right]=0.

The variance is upper bounded by

$\displaystyle\mathbb{E}_{t}\left[\left(\widehat{\ell}_{s,\widehat{x}}-\ell_{s,\widehat{x}}-\epsilon_{s}(\widehat{x})\right)^{2}\right]$	$\displaystyle=\mathbb{E}_{t}\left[(\ell_{s,\widehat{x}}+\epsilon_{s}(\widehat{x}))^{2}\left(\frac{\mathbb{I}\{x_{s}=\widehat{x}\}}{\widetilde{p}_{s,\widehat{x}}}-1\right)^{2}\right]$
	$\displaystyle\leq\widetilde{p}_{s,\widehat{x}}\left(\frac{1}{\widetilde{p}_{s,\widehat{x}}}-1\right)^{2}+(1-\widetilde{p}_{s,\widehat{x}})$
	$\displaystyle\leq 2(1-\widetilde{p}_{s,\widehat{x}}),$	(34)

where the last term is because $\widetilde{p}_{s,\widehat{x}}\geq\frac{1}{2}$ .

Term 3 is also a martingale difference sequence. As $\epsilon_{s}(x)\in[-2,2]$ , its variance can be upper bounded by

\displaystyle\mathbb{E}_{t}\left[\left(\epsilon_{s}(\widehat{x})-\epsilon_{s}(x_{s})\right)^{2}\right]\leq 16\mathbb{E}_{t}\left[\mathbb{I}\{x_{s}\neq\widehat{x}\}\right]=16(1-\widetilde{p}_{s,\widehat{x}}).

(35)

Therefore, with probability at least $1-\delta/t$ , we have $\textsc{Term 2}+\textsc{Term 3}=\mathcal{O}\left(\sqrt{\sum_{s=t_{0}+1}^{t}(1-\widetilde{p}_{s,\widehat{x}})\log(t/\delta)}+\log(t/\delta)\right)$ by Freedman’s inequality.

As $p_{t}=\textbf{OP}(t,\widehat{\Delta})$ and $t\geq t_{0}\geq\frac{400C_{1}f_{T}}{\widehat{\Delta}_{\min}^{2}}\geq\frac{16d\beta_{t}}{\widehat{\Delta}_{\min}^{2}}$ , by Lemma 12, we have

\displaystyle 1-\widetilde{p}_{t,\widehat{x}}=\frac{1}{2}\left(1-p_{t,\widehat{x}}\right)\leq\frac{1}{2}\sum_{x}p_{t,x}\frac{\widehat{\Delta}_{x}}{\widehat{\Delta}_{\min}}\leq\frac{12d\beta_{t}}{t\widehat{\Delta}_{\min}^{2}}.

(36)

Combining the above with $\widehat{\Delta}_{\min}\geq 20\sqrt{\frac{C_{1}f_{T}}{t_{0}}}$ , we get

	$\displaystyle\textsc{Term 2}+\textsc{Term 3}$	$\displaystyle=\mathcal{O}\left(\sqrt{\log(t/\delta)\sum_{s=t_{0}+1}^{t}\frac{d\beta_{t}}{s\widehat{\Delta}_{\min}^{2}}}+\log(t/\delta)\right)$
		$\displaystyle=\mathcal{O}\left(\sqrt{\frac{dt_{0}\beta_{t}\log(t/\delta)\log(t)}{f_{T}C_{1}}}+\log(t/\delta)\right)$
		$\displaystyle=\mathcal{O}\left(\sqrt{t_{0}\log(t/\delta)}+\log(t/\delta)\right)$

where the last step uses the definition of $\beta_{t}$ and $C_{1}\geq 2^{15}d\log(T|\mathcal{X}|/\delta)$ from Assumption 1.

Combining all bounds above, we have shown

\displaystyle\sum_{s=1}^{t}(\ell_{s,x_{s}}-\ell_{s,\widehat{x}})=\mathcal{O}\left(\sqrt{C_{1}t_{0}f_{T}}\right),

proving the lemma. ∎

Theorem 7 can then be proven by directly applying Lemma 9 to each epoch and using the fact that the number of epochs is at most $\mathcal{O}(\log T)$ .

C.2 Analysis of Algorithm 1 in the corrupted stochastic setting

In this section, we prove our results in the corrupted setting. To prove the main lemma Lemma 10, we separate the proof into two parts, Lemma 20 and Lemma 22.

Lemma 20.

In the stochastic setting with corruptions, within a single epoch,

1.

with probability at least $1-4\delta$ , $t_{0}\leq\max\left\{\frac{900f_{T}C_{1}}{\Delta_{\min}^{2}},\frac{900C^{2}}{f_{T}C_{1}},L\right\}$ ;
2.

if $C\leq\frac{1}{30}\sqrt{f_{T}C_{1}L}$ , then with probability at least $1-\delta$ , $\widehat{x}=x^{*}$ ;
3.

if $C\leq\frac{1}{30}\sqrt{f_{T}C_{1}L}$ , then with probability at least $1-2\delta$ , $t_{0}\geq\frac{64f_{T}C_{1}}{\Delta_{\min}^{2}}$ ;
4.

if $C\leq\frac{1}{30}\sqrt{f_{T}C_{1}L}$ , then with probability at least $1-3\delta$ , $\widehat{\Delta}_{x}\in[0.7\Delta_{x},1.3\Delta_{x}]$ for all $x\neq x^{*}$ .

Proof.

In the corrupted setting, we can identify $\ell+c_{t}$ as $\ell_{t}$ in the adversarial setting. We first show the following property: at any $t$ in Phase $1$ and with probability at least $1-\delta$ , for any $x$ ,

\displaystyle C_{2}\left|\sum_{s=1}^{t}(\ell_{s,x}-\widehat{\ell}_{s,x})\right|\leq\sqrt{C_{1}t}+t\Delta_{x}+2C.

(37)

By the guarantee of $\mathcal{A}$ , we have with probability at least $1-\delta$ , for any $x$ and $t\in[T]$

\displaystyle C_{2}\left|\sum_{s=1}^{t}(\ell_{s,x}-\widehat{\ell}_{s,x})\right|\leq\sqrt{C_{1}t}+\sum_{s=1}^{t}\ell_{s,x}-\sum_{s=1}^{t}\ell_{s,x_{s}}.

(38)

Since $\ell_{t,x_{t}}\geq\ell_{t,x^{*}}-\max_{x\in\mathcal{X}}|c_{t,x}|$ , we have for any $t$ ,

\displaystyle\sum_{s=1}^{t}\ell_{s,x_{s}}\geq\sum_{s=1}^{t}\ell_{s,x^{*}}-C.

(39)

Combining Eq. (38) and Eq. (39), and using $\ell_{s,x}-\ell_{s,x^{*}}\leq\Delta_{x}+\max_{x^{\prime}\in\mathcal{X}}|c_{s,x^{\prime}}|$ for any $x\in\mathcal{X}$ , we get Eq. (37).

Below, we define $\textsc{Dev}_{t,x}\triangleq\left|\sum_{s=1}^{t}(\ell_{s,x}-\widehat{\ell}_{s,x})\right|$ .

Claim 1’s proof:

Let $t=\max\left\{\frac{900f_{T}C_{1}}{\Delta_{\min}^{2}},\frac{900C^{2}}{f_{T}C_{1}},L\right\}$ . Below we prove that if Phase 1 has not finished before time $t$ , then for the choice of $\widehat{x}=x^{*}$ , both Eq. (7) and Eq. (8) hold with high probability at time $t$ .

Consider Eq. (7). With probability at least $1-2\delta$ ,

$\displaystyle\sum_{s=1}^{t}y_{s}$	$\displaystyle\geq\sum_{s=1}^{t}\ell_{s,x_{s}}-\sqrt{C_{1}t}$	(by Azuma’s inequality)
	$\displaystyle\geq\sum_{s=1}^{t}\ell_{s,x^{*}}-\sqrt{C_{1}t}-C$	(by Eq. (39))
	$\displaystyle\geq\sum_{s=1}^{t}\widehat{\ell}_{s,x^{*}}-2\sqrt{C_{1}t}-3C$	(by Eq. (37) and $\Delta_{x^{*}}=0$ )
	$\displaystyle\geq\sum_{s=1}^{t}\widehat{\ell}_{s,x^{*}}-3\sqrt{f_{T}C_{1}t},$	( $t\geq\frac{900C^{2}}{f_{T}C_{1}}$ and $\sqrt{f_{T}C_{1}t}\geq 30C$ )

showing that Eq. (7) holds for $\widehat{x}=x^{*}$ .

For Eq. (8), by the regret bound of $\mathcal{A}$ , with probability at least $1-2\delta$ , for $x\neq x^{*}$ ,

	$\displaystyle\sum_{s=1}^{t}y_{s}-\sum_{s=1}^{t}\widehat{\ell}_{s,x}$
	$\displaystyle=\sum_{s=1}^{t}(y_{s}-\ell_{s,x_{s}})+\sum_{s=1}^{t}(\ell_{s,x_{s}}-\ell_{s,x^{}})+\sum_{s=1}^{t}(\ell_{s,x^{}}-\ell_{s,x})+\sum_{s=1}^{t}(\ell_{s,x}-\widehat{\ell}_{s,x})$
	$\displaystyle\leq\sqrt{C_{1}t}+\left(\sqrt{C_{1}t}-C_{2}\textsc{Dev}_{t,x^{*}}\right)+\left(-t\Delta_{x}+C\right)+\textsc{Dev}_{t,x}$		(by the regret bound of $\mathcal{A}$ and Azuma’s inequality)
	$\displaystyle\leq\left(2+\frac{1}{30}\right)\sqrt{f_{T}C_{1}t}-t\Delta_{x}+\frac{1}{C_{2}}\left(\sqrt{C_{1}t}+t\Delta_{x}+2C\right)$		(by Eq. (37) and that $30C\leq\sqrt{f_{T}C_{1}t}$ )
	$\displaystyle\leq-0.95t\Delta_{x}+2.1\sqrt{f_{T}C_{1}t}.$		( $C_{2}\geq 20$ )

By the condition of $t$ , we have $t\Delta_{x}\geq 30\sqrt{f_{T}C_{1}t}$ for all $x\neq x^{*}$ . Thus, the last expression can further be upper bounded by $(-30\times 0.95+2.1)\sqrt{f_{T}C_{1}t}\leq-25\sqrt{f_{T}C_{1}t}$ , indicating that Eq. (8) also holds for all $x\neq x^{*}$ . Combining the two parts above finishes the proof.

Claim 2’s proof:

Note that Eq. (7) and Eq. (8) jointly imply that

\displaystyle\sum_{s=1}^{t_{0}}(\widehat{\ell}_{s,x}-\widehat{\ell}_{s,\widehat{x}})\geq 20\sqrt{f_{T}C_{1}t_{0}}\qquad\forall x\neq\widehat{x}.

(40)

However, with probability at least $1-\delta$ , for any $x\neq x^{*}$ ,

$\displaystyle\sum_{s=1}^{t_{0}}(\widehat{\ell}_{s,x^{*}}-\widehat{\ell}_{s,x})$	$\displaystyle=\sum_{s=1}^{t_{0}}(\widehat{\ell}_{s,x^{}}-\ell_{s,x^{}})+\sum_{s=1}^{t_{0}}(\ell_{s,x^{*}}-\ell_{s,x})+\sum_{s=1}^{t_{0}}(\ell_{s,x}-\widehat{\ell}_{s,x})$
	$\displaystyle\leq\textsc{Dev}_{t_{0},x^{*}}+\left(-t_{0}\Delta_{x}+C\right)+\textsc{Dev}_{t_{0},x}$
	$\displaystyle\leq\frac{1}{C_{2}}\left(\sqrt{C_{1}t_{0}}+2C\right)+\left(-t_{0}\Delta_{x}+C\right)+\frac{1}{C_{2}}\left(\sqrt{C_{1}t_{0}}+t_{0}\Delta_{x}+2C\right)$	(by Eq. (37))
	$\displaystyle\leq 5\sqrt{f_{T}C_{1}t_{0}}.$	(using $C\leq\frac{1}{30}\sqrt{f_{T}C_{1}t_{0}}$ and $C_{2}\geq 20$ )

Therefore, to make Eq. (40) hold, it must be that $\widehat{x}=x^{*}$ .

Claim 3’s proof:

Suppose that $t_{0}\leq\frac{64f_{T}C_{1}}{\Delta_{\min}^{2}}$ , and let $x$ be such that $\Delta_{x}=\Delta_{\min}$ . Then we have

$\displaystyle\sum_{s=1}^{t_{0}}\left(\widehat{\ell}_{s,x}-\widehat{\ell}_{s,x^{*}}\right)$	$\displaystyle\leq\sum_{s=1}^{t_{0}}\left(\ell_{s,x}-\ell_{s,x^{}}\right)+\textsc{Dev}_{t_{0},x}+\textsc{Dev}_{t_{0},x^{}}$	(hold w.p. $1-\delta$ )
	$\displaystyle\leq\left(t_{0}\Delta_{\min}+C\right)+\frac{1}{C_{2}}\left(2\sqrt{C_{1}t_{0}}+t_{0}\Delta_{\min}+4C\right)$	(hold w.p. $1-\delta$ by Eq. (37))
	$\displaystyle\leq 2t_{0}\Delta_{\min}+2\sqrt{f_{T}C_{1}t_{0}}$	(by $C\leq\frac{1}{30}\sqrt{f_{T}C_{1}t_{0}}$ and $C_{2}=20$ )
	$\displaystyle\leq 16\sqrt{f_{T}C_{1}t_{0}}+2\sqrt{f_{T}C_{1}t_{0}}$	( $t_{0}\leq\frac{64f_{T}C_{1}}{\Delta_{\min}^{2}}$ )
	$\displaystyle=18\sqrt{f_{T}C_{1}t_{0}}.$

Recall that Eq. (40) needs to hold, and recall from Claim 2 that $\widehat{x}=x^{*}$ holds with probability $1-\delta$ . Thus, the bound above is a contradiction. Therefore, with probability $1-2\delta$ , $t_{0}\geq\frac{64f_{T}C_{1}}{\Delta_{\min}^{2}}$ .

Claim 4’s proof.

For notational convenience, denote the set $[a-b,a+b]$ by $[a\pm b]$ . We have

$\displaystyle t_{0}\widehat{\Delta}_{x}$	$\displaystyle=\sum_{s=1}^{t_{0}}\left(\widehat{\ell}_{s,x}-\widehat{\ell}_{s,x^{}}\right)\in\left[\sum_{s=1}^{t_{0}}(\ell_{s,x}-\ell_{s,x^{}})\pm(\textsc{Dev}_{t_{0},x}+\textsc{Dev}_{t_{0},x^{*}})\right]$
	$\displaystyle\subseteq\left[t_{0}\Delta_{x}\pm\left(C+\frac{1}{C_{2}}\left(2\sqrt{C_{1}t_{0}}+t_{0}\Delta_{x}+4C\right)\right)\right]$	(hold w.p. $1-\delta$ by Eq. (37))
	$\displaystyle\subseteq\left[t_{0}\Delta_{x}\pm\left(\frac{1}{C_{2}}t_{0}\Delta_{x}+\sqrt{f_{T}C_{1}t_{0}}\right)\right]$	(using $C\leq\frac{1}{30}\sqrt{f_{T}C_{1}t_{0}}$ and $C_{2}\geq 20$ )
	$\displaystyle\subseteq\left[t_{0}\Delta_{x}\pm\left(\frac{1}{C_{2}}t_{0}\Delta_{x}+\frac{1}{8}t_{0}\Delta_{x}\right)\right]$	(by Claim 3, $t_{0}\geq\frac{64f_{T}C_{1}}{\Delta_{\min}^{2}}$ holds w.p. $1-2\delta$ )
	$\displaystyle\subseteq\left[\left(1\pm 0.3\right)t\Delta_{x}\right],$

which finishes the proof. ∎

The next lemma shows that when $L$ grows large enough compared to the total corruption $C$ , the termination condition Eq. (10) will never be satisfied once the algorithm enters Phase $2$ .

Lemma 21.

Algorithm 1 guarantees that with probability at least $1-10\delta$ , for any $t$ in Phase $2$ , when $0\leq C\leq\frac{1}{30}\sqrt{f_{T}C_{1}L}$ , we have

\displaystyle\sum_{s=t_{0}+1}^{t}\left(y_{s}-\widehat{\ell}_{s,\widehat{x}}\right)

\displaystyle\leq 20\sqrt{f_{T}C_{1}t_{0}}.

Furthermore, when $t\geq M^{\prime}=10\beta_{t}M$ ( $M$ is the constant defined in Lemma 14), we have

\displaystyle\sum_{s=t_{0}+1}^{t}\left(y_{s}-\widehat{\ell}_{s,\widehat{x}}\right)

\displaystyle=\mathcal{O}\left(c(\mathcal{X},\theta)\log T\log(T|\mathcal{X}|/\delta)+\sqrt{f_{T}C_{1}t_{0}}+d\beta_{M^{\prime}}\sqrt{M^{\prime}}\right).

Proof.

Recall that $y_{s}=\ell_{s,x_{s}}+\epsilon_{s}(x_{s})$ and $\widehat{\ell}_{s,\widehat{x}}=\frac{\ell_{s,\widehat{x}}+\epsilon_{s}(\widehat{x})}{\widetilde{p}_{s,\widehat{x}}}\mathbbm{1}\{x_{s}=\widehat{x}\}$ . Thus,

	$\displaystyle\sum_{s=t_{0}+1}^{t}\left(y_{s}-\widehat{\ell}_{s,\widehat{x}}\right)$
	$\displaystyle=\underbrace{\sum_{s=t_{0}+1}^{t}\left(\ell_{s,x_{s}}-\ell_{s,\widehat{x}}\right)}_{\textsc{Term 1}}+\underbrace{\sum_{s=t_{0}+1}^{t}\left(\ell_{s,\widehat{x}}-\frac{\ell_{s,\widehat{x}}}{\widetilde{p}_{s,\widehat{x}}}\mathbbm{1}\{x_{s}=\widehat{x}\}\right)}_{\textsc{Term 2}}+\underbrace{\sum_{s=t_{0}+1}^{t}\left(\epsilon_{s}(x_{s})-\epsilon_{s}(\widehat{x})\right)}_{\textsc{Term 3}}+\underbrace{\sum_{s=t_{0}+1}^{t}\left(\epsilon_{s}(\widehat{x})-\frac{\epsilon_{s}(\widehat{x})}{\widetilde{p}_{s,\widehat{x}}}\mathbbm{1}\{x_{s}=\widehat{x}\}\right)}_{\textsc{Term 4}}$

Except for Term 1, all terms are martingale difference sequences. Let $\mathbb{E}_{0}$ be the expectation taken over the randomness before Phase 2. Similar to the calculation in Eq. (34) and Eq. (35), we have

\displaystyle\mathbb{E}_{s}\left[\left(\ell_{s,\widehat{x}}-\frac{\ell_{s,\widehat{x}}}{\widetilde{p}_{s,\widehat{x}}}\mathbbm{1}\{x_{s}=\widehat{x}\}\right)^{2}\right]\leq 2\mathbb{E}_{0}[1-\widetilde{p}_{s,\widehat{x}}],\quad\qquad\mathbb{E}_{s}\left[\left(\epsilon_{s}(\widehat{x})-\frac{\epsilon_{s}(\widehat{x})}{\widetilde{p}_{s,\widehat{x}}}\mathbbm{1}\{x_{s}=\widehat{x}\}\right)^{2}\right]\leq 8\mathbb{E}_{0}[1-\widetilde{p}_{s,\widehat{x}}]

and

\displaystyle\mathbb{E}_{s}\left[\left(\epsilon_{s}(x_{s})-\epsilon_{s}(\widehat{x})\right)^{2}\right]\leq 16\mathbb{E}_{0}\left[1-\widetilde{p}_{s,\widehat{x}}\right].

By Freedman’s inequality, we have with probability at least $1-3\delta$ , for all $t$ in Phase $2$ ,

	$\displaystyle\textsc{Term 2}+\textsc{Term 3}+\textsc{Term 4}$
	$\displaystyle\leq 2\sqrt{2\sum_{s=s_{0}+1}^{t}\mathbb{E}_{0}[1-\widetilde{p}_{s,\widehat{x}}]\log(T/\delta)}+\log(T/\delta)+2\sqrt{16\sum_{s=s_{0}+1}^{t}\mathbb{E}_{0}[1-\widetilde{p}_{s,\widehat{x}}]\log(T/\delta)}+4\log(T/\delta)$
	$\displaystyle\qquad+2\sqrt{8\sum_{s=s_{0}+1}^{t}\mathbb{E}_{0}[1-\widetilde{p}_{s,\widehat{x}}]\log(T/\delta)}+2\log(T/\delta)$
	$\displaystyle\leq 20\sqrt{\sum_{s=s_{0}+1}^{t}\mathbb{E}_{0}[1-\widetilde{p}_{s,\widehat{x}}]\log(T/\delta)}+7\log(T/\delta).$

Then we deal with Term 1. Again, by Freeman’s inequality with probability at least $1-\delta$ , for all $t$ in Phase 2,

$\displaystyle\sum_{s=t_{0}+1}^{t}(\ell_{s,x_{s}}-\ell_{s,\widehat{x}})$	$\displaystyle\leq\sum_{s=t_{0}+1}^{t}\sum_{x\neq\widehat{x}}\widetilde{p}_{s,x}(\ell_{s,x}-\ell_{s,\widehat{x}})+4\sqrt{\sum_{s=t_{0}+1}^{t}\mathbb{E}_{0}[1-\widetilde{p}_{s,\widehat{x}}]\log(T/\delta)}+2\log(T/\delta)$
	$\displaystyle\leq C+\sum_{s=t_{0}+1}^{t}\sum_{x\neq\widehat{x}}\widetilde{p}_{s,x}(\Delta_{x}-\Delta_{\widehat{x}})+4\sqrt{\sum_{s=t_{0}+1}^{t}\mathbb{E}_{0}[1-\widetilde{p}_{s,\widehat{x}}]\log(T/\delta)}+2\log(T/\delta)$
	$\displaystyle\leq C+\frac{1}{2}\sum_{s=t_{0}+1}^{t}\sum_{x\neq\widehat{x}}p_{s,x}\Delta_{x}+4\sqrt{\sum_{s=t_{0}+1}^{t}\mathbb{E}_{0}[1-\widetilde{p}_{s,\widehat{x}}]\log(T/\delta)}+2\log(T/\delta).$	( $\widetilde{p}_{s,x}=\frac{1}{2}p_{s,x}$ for $x\neq\widehat{x}$ )

When $C\in[0,\frac{1}{30}\sqrt{f_{T}C_{1}L}]\subseteq[0,\frac{1}{30}\sqrt{f_{T}C_{1}t_{0}}]$ , according to Lemma 20, we know that with probability $1-4\delta$ , $\widehat{x}=x^{*}$ and $\widehat{\Delta}_{x}\in[0.7\Delta_{x},1.3\Delta_{x}]$ .

Also by Lemma 20, with probability $1-2\delta$ , for any $s\geq t_{0}$ , we have $s\geq t_{0}\geq\frac{64f_{T}C_{1}}{\Delta_{\min}^{2}}\geq\frac{48d\beta_{s}}{\Delta_{\min}^{2}}$ . These conditions satisfy the requirement in Lemma 13 with $r=3$ . Therefore we can apply Lemma 13 and get

\displaystyle\sum_{x\in\mathcal{X}}p_{s,x}\Delta_{x}\leq\frac{72d\beta_{s}}{\Delta_{\min}s}

(42)

for all $s\geq t_{0}$ . Combining all the above, we get

\displaystyle\sum_{s=t_{0}+1}^{t}\left(y_{s}-\widehat{\ell}_{s,\widehat{x}}\right)=C+\frac{72d\beta_{t}\log t}{\Delta_{\min}}+24\sqrt{\sum_{s=t_{0}+1}^{t}\mathbb{E}_{0}[1-\widetilde{p}_{s,\widehat{x}}]\log(T/\delta)}+9\log(T/\delta).

As argued in Eq. (36), $1-\widetilde{p}_{s,\widehat{x}}\leq\frac{12d\beta_{s}}{\widehat{\Delta}_{\min}^{2}s}$ . Therefore, the above can be further upper bounded by

$\displaystyle\sum_{s=t_{0}+1}^{t}\left(y_{s}-\widehat{\ell}_{s,\widehat{x}}\right)$	$\displaystyle\leq C+\frac{96d\beta_{t}\log t}{\widehat{\Delta}_{\min}}+24\sqrt{\sum_{s=t_{0}+1}^{t}\frac{12d\beta_{s}}{\widehat{\Delta}_{\min}^{2}s}\log(T/\delta)}$	( $\widehat{\Delta}_{x}\in[0.7\Delta_{x},1.3\Delta_{x}]$ , $1-\widetilde{p}_{s,\widehat{x}}\leq\frac{12d\beta_{s}}{\widehat{\Delta}_{\min}^{2}s}$ )
	$\displaystyle\leq C+\frac{96d\beta_{t}\log t}{\widehat{\Delta}_{\min}}+\frac{144d\beta_{T}\log T}{\widehat{\Delta}_{\min}}$	(by definition of $\beta_{T}$ )
	$\displaystyle\leq C+10\sqrt{\frac{t_{0}}{f_{T}C_{1}}}d\beta_{T}\log T$	( $t_{0}\geq\frac{64f_{T}C_{1}}{\Delta_{\min}^{2}}\geq\frac{24f_{T}C_{1}}{\widehat{\Delta}_{\min}^{2}}$ )
	$\displaystyle\leq\frac{1}{30}\sqrt{f_{T}C_{1}t_{0}}+10\sqrt{\frac{t_{0}}{f_{T}C_{1}}}d\beta_{T}\log T$	( $C\leq\frac{1}{30}\sqrt{f_{T}C_{1}t_{0}}$ )
	$\displaystyle\leq 20\sqrt{f_{T}C_{1}t_{0}}.$	( $C_{1}\geq d\beta_{T}$ )

Below, we use an alternative way to bound $\sum_{x\in\mathcal{X}}p_{s,x}\Delta_{x}$ . Let $M^{\prime}\geq 20\beta_{M}M$ , which implies $M^{\prime}\geq 10\beta_{M^{\prime}}M$ . For $s\in[t_{0}+1,M^{\prime}]$ , we use Lemma 4, and bound

\displaystyle\sum_{s=t_{0}+1}^{M^{\prime}}\sum_{x\in\mathcal{X}}p_{s,x}\Delta_{x}\leq\frac{1}{0.7}\sum_{s=t_{0}+1}^{M^{\prime}}\sum_{x\in\mathcal{X}}p_{s,x}\widehat{\Delta}_{x}\leq\frac{1}{0.7}\sum_{s=t_{0}+1}^{M^{\prime}}\frac{d\beta_{s}}{\sqrt{s}}\leq\mathcal{O}\left(d\beta_{M^{\prime}}\sqrt{M^{\prime}}\right).

(43)

For $s>M^{\prime}$ , we use Lemma 15 and bound

\displaystyle\sum_{s=M^{\prime}+1}^{t}\sum_{x\in\mathcal{X}}p_{s,x}\Delta_{x}\leq\sum_{s=M^{\prime}+1}^{t}\mathcal{O}\left(\frac{\beta_{s}}{s}c(\mathcal{X},\theta)\right)=\mathcal{O}\left(c(\mathcal{X},\theta)\beta_{t}\log t\right).

(44)

Combining Eq. (43) and Eq. (44) and following a similar analysis in the previous case, we have

	$\displaystyle\sum_{s=t_{0}+1}^{t}\left(\ell_{s,x_{s}}-\ell_{s,\widehat{x}}\right)$	$\displaystyle\leq C+\mathcal{O}\left(d\beta_{M^{\prime}}\sqrt{M^{\prime}}+c(\mathcal{X};\theta)\beta_{t}\log t\right)+4\sqrt{\sum_{s=t_{0}+1}^{t}\mathbb{E}_{0}[1-\widetilde{p}_{s,\widehat{x}}]\log(T/\delta)}+2\log(T/\delta)$
		$\displaystyle\leq\mathcal{O}\left(c(\mathcal{X},\theta)\log T\log(T\|\mathcal{X}\|/\delta)+\sqrt{f_{T}C_{1}t_{0}}+d\beta_{M^{\prime}}\sqrt{M^{\prime}}\right).$

∎

Now we are ready to show that once $L$ grows large enough, Phase $2$ never ends.

Lemma 22.

If $C\leq\frac{1}{30}\sqrt{f_{T}C_{1}L}$ , then with probability at least $1-15\delta$ , Phase 2 never ends.

Proof.

It suffices to verify the two termination conditions Eq. (9) and Eq. (10) are never satisfied. Eq. (10) does not hold because of Lemma 21. Consider Eq. (9). Let $t$ be in Phase $2$ and $x\neq\widehat{x}$ . According to Eq. (32) and Eq. (33), we have with probability $1-5\delta$ ,

	$\displaystyle\left\|\sum_{s=1}^{t_{0}}(\ell_{s,x}-\widehat{\ell}_{s,x})+\sum_{s=t_{0}+1}^{t}\left(\ell_{s,x}-\text{Rob}_{t,x}\right)\right\|$	$\displaystyle\leq\frac{1.7t\widehat{\Delta}_{x}}{10},$
	$\displaystyle\left\|\sum_{s=1}^{t}(\ell_{s,\widehat{x}}-\widehat{\ell}_{s,\widehat{x}})\right\|$	$\displaystyle\leq 3\sqrt{f_{T}C_{1}t}\leq 0.15t\widehat{\Delta}_{x}.$

Therefore, we have

	$\displaystyle\left\|t\widehat{\Delta}_{t,x}-t\Delta_{x}\right\|$	$\displaystyle\leq\left\|\sum_{s=1}^{t_{0}}(\ell_{s,x}-\widehat{\ell}_{s,x})+\sum_{s=t_{0}+1}^{t}\left(\ell_{s,x}-\text{Rob}_{t,x}\right)\right\|+\left\|\sum_{s=1}^{t}(\ell_{s,\widehat{x}}-\widehat{\ell}_{s,\widehat{x}})\right\|+C$
		$\displaystyle\leq 0.32t\widehat{\Delta}_{x}+C\leq 0.372t\widehat{\Delta}_{x}.$		( $C\leq\frac{1}{30}\sqrt{f_{T}C_{1}t}\leq 0.052t\widehat{\Delta}_{\min}$ )

This means that

	$\displaystyle t\widehat{\Delta}_{t,x}\leq t\Delta_{x}+0.372t\widehat{\Delta}_{x}\leq\frac{1}{0.7}t\widehat{\Delta}_{x}+0.372t\widehat{\Delta}_{x}\leq 1.81t\widehat{\Delta}_{x},$
	$\displaystyle t\widehat{\Delta}_{t,x}\geq t\Delta_{x}-0.372t\widehat{\Delta}_{x}\geq\frac{1}{1.3}t\widehat{\Delta}_{x}-0.372t\widehat{\Delta}_{x}\geq 0.39t\widehat{\Delta}_{x}.$

Therefore, Eq. (9) is not satisfied. ∎

Finally, we prove the regret bound for the corrupted stochastic setting.

Proof of Theorem 6.

First, we consider the pure stochastic setting with $C=0$ . According to Lemma 20, we know that the algorithm has only one epoch as $C\leq\frac{1}{30}\sqrt{f_{T}C_{1}L}$ is satisfied in the first epoch. Specifically, after at most $\frac{900f_{T}C_{1}}{\Delta_{\min}^{2}}$ rounds in Phase $1$ , the algorithm goes to Phase $2$ and never goes back to Phase $1$ . Then we can directly apply the second claim in Lemma 21 to get the regret bound in the stochastic setting. Specifically, we bound the regret in Phase $1$ by $\mathcal{O}\left(\sqrt{C_{1}L_{0}}+\sqrt{C_{1}\cdot\frac{900f_{T}C_{1}}{\Delta_{\min}^{2}}}\right)=\mathcal{O}\left(\sqrt{C_{1}L_{0}}+\frac{C_{1}\sqrt{\log T}}{\Delta_{\min}}\right)$ . For the regret in Phase $2$ , according to the second claim in Lemma 21, we bound the regret by $\mathcal{O}\left(c(\mathcal{X};\theta)\log T\log\frac{T|\mathcal{X}|}{\delta}+d\beta_{M^{\prime}}\sqrt{M^{\prime}}\right)=\mathcal{O}\left(c(\mathcal{X};\theta)\log T\log\frac{T|\mathcal{X}|}{\delta}+M^{*}\log^{\frac{3}{2}}\frac{1}{\delta}\right)$ , where $M^{*}$ is the same as the one in Eq. (24). Combining them together proves the first claim.

Now we consider the corrupted stochastic setting with $C>0$ . Suppose that we are in the epoch with $L=L^{*}$ , which is the first epoch such that $L^{*}\geq\max\left\{\frac{900f_{T}C_{1}}{\Delta_{\min}^{2}},\frac{900C^{2}}{f_{T}C_{1}}\right\}$ . Therefore, in previous epochs, we have $L\leq\max\left\{\frac{900f_{T}C_{1}}{\Delta_{\min}^{2}},\frac{900C^{2}}{f_{T}C_{1}}\right\}$ . According to Lemma 20, we have $t_{0}\leq\max\left\{\frac{900f_{T}C_{1}}{\Delta_{\min}^{2}},\frac{900C^{2}}{f_{T}C_{1}}\right\}$ in the previous epoch and $L^{*}=2t_{0}$ .

We bound the regret before this epoch, as well as the regret in the first phase of this epoch by the adversarial regret bound:

\displaystyle\mathcal{O}\left(\sqrt{C_{1}f_{T}L^{*}}+\sqrt{C_{1}L_{0}}\right)=\mathcal{O}\left(\sqrt{C_{1}L_{0}}+\sqrt{C_{1}f_{T}}\times\left(\frac{\sqrt{f_{T}C_{1}}}{\Delta_{\min}}+\frac{C}{\sqrt{f_{T}C_{1}}}\right)\right)=\mathcal{O}\left(\sqrt{C_{1}L_{0}}+\frac{f_{T}C_{1}}{\Delta_{\min}}+C\right).

If we use GeometricHedge.P as the adversarial linear bandit algorithm and $f_{T}=\log T$ , then the above is upper bounded by

\displaystyle\mathcal{O}\left(\frac{d\log T\log(T|\mathcal{X}|/\delta)}{\Delta_{\min}}+C\right).

For Phase $2$ of the epoch with $L=L^{*}$ , according to Lemma 20, we know that this phase will never end and by definition of $L^{*}$ , we have $C\leq\frac{1}{30}\sqrt{f_{T}C_{1}L}$ . Note that in this phase $\widehat{\Delta}_{x}\in[0.7\Delta_{x},1.3\Delta_{x}]$ .

Therefore, by taking a summation over $t$ on Eq. (42), we bound the regret in this interval by

\displaystyle\mathcal{O}\left(\frac{d\beta_{T}\log T}{\Delta_{\min}}\right)=\mathcal{O}\left(\frac{d\log T\log(T|\mathcal{X}|/\delta)}{\Delta_{\min}}\right).

Combining the regret bounds finishes the proof of the second claim. ∎

Appendix D Lower Bound

In this section, we prove that the $\log^{2}T$ factor in our bound for the stochastic setting is unavoidable if the same algorithm also achieves sublinear regret with high probability in the adversarial case. The full statement is in Theorem 27, and we first present some definitions and related lemmas. We fix a stochastic linear bandit instance (i.e., we fix the parameter $\theta$ and the action set $\mathcal{X}$ ). Assume that $\|x\|\leq 1$ for all $x\in\mathcal{X}$ and $\|\theta\|\leq\frac{1}{4}$ . We call this instance the first environment. The observation $y_{t}$ is generated according to the following Bernoulli distribution⁶⁶6 In the Bernoulli noise case, we are only looking at a subclass of problems (i.e., those with $\|\theta\|\leq\frac{1}{4}$ and $\|x\|\leq 1$ ). For problems that are outside this class, the Bernoulli noise case might be much easier than the Gaussian noise case.:

\displaystyle y_{t}=\begin{cases}1&\text{\ with probability\ }\frac{1}{2}+\frac{1}{2}\langle x_{t},\theta\rangle,\\ -1&\text{\ with probability\ }\frac{1}{2}-\frac{1}{2}\langle x_{t},\theta\rangle.\end{cases}

Again, let $c(\mathcal{X},\theta)$ be the solution of the following optimization problem:

		$\displaystyle\inf_{N\in[0,\infty)^{\mathcal{X}}}\sum_{x\in\mathcal{X}\backslash\{x^{*}\}}N_{x}\Delta_{x}$		(45)
	$\displaystyle s.t.$	$\displaystyle\\|x\\|^{2}_{H(N)^{-1}}\leq\frac{\Delta_{x}^{2}}{2},\forall x\in\mathcal{X}^{-}=\mathcal{X}\backslash\{x^{*}\},$		(45)

where $H(N)=\sum_{x\in\mathcal{X}}N_{x}xx^{\top}$ . We also define $\Delta_{\min}=\min_{x\neq x^{*}}\Delta_{x}$ .

For a fixed $\gamma\in(0,1)$ (which is chosen later), we divide the whole horizon into intervals of length

\displaystyle T^{\gamma},\frac{4}{\Delta_{\min}}T^{\gamma},\left(\frac{4}{\Delta_{\min}}\right)^{2}T^{\gamma},\ldots,\left(\frac{4}{\Delta_{\min}}\right)^{S-1}T^{\gamma},

where $S=\Theta\left(\frac{1-\gamma}{\log\frac{4}{\Delta_{\min}}}\log T\right)$ . We denote these intervals as $\mathcal{I}_{1},\ldots,\mathcal{I}_{S}$ . Observe that $|\mathcal{I}_{i}|\geq\frac{3}{\Delta_{\min}}\sum_{j<i}|\mathcal{I}_{j}|$ for all $i$ .

Definition 2.

Let $U=c_{\text{reg}}S(\log T)^{1-\beta}$ , $V=\frac{U}{S}=c_{\text{reg}}(\log T)^{1-\beta}$ for some $\beta\geq 0$ and some universal constant $c_{\text{reg}}$ .

Assumption 2.

Let $\mathcal{A}$ be a linear bandit algorithm with the following regret guarantee: there is a problem-dependent constant $T_{0}$ (i.e., depending on $\theta$ and $\mathcal{X}$ ) such that for any $T\geq T_{0}$ ,

\displaystyle\mathbb{E}\left[\text{\rm Reg}(T)\right]=\mathbb{E}\left[\sum_{t=1}^{T}\mathbbm{1}[x_{t}=x]\Delta_{x}\right]\leq U\cdot c(\mathcal{X},\theta)

for some $\beta\geq 0$ .

Definition 3.

Let $G_{i}\triangleq\mathbb{E}\left[\sum_{t\in\mathcal{I}_{i}}x_{t}x_{t}^{\top}\right]$ where the expectation $\mathbb{E}$ is with respect to the environment of $\theta$ and the algorithm $\mathcal{A}$ . Let $G=\sum_{i=1}^{S}G_{i}$ .

Definition 4.

Let $T_{i}(x)\triangleq\sum_{t\in\mathcal{I}_{i}}\mathbbm{1}[x_{t}=x]$ .

Lemma 23.

Let $T\geq T_{0}$ . There exists an action $x\neq x^{*}$ such that

\displaystyle\|x\|^{2}_{G^{-1}}\geq\frac{\Delta_{x}^{2}}{2U}.

Proof.

We use contradiction to prove this lemma. Suppose that for all $x\neq x^{*}$ we have $\|x\|^{2}_{G^{-1}}\leq\frac{\Delta_{x}^{2}}{2U}$ . Then observe that $\|x\|_{\overline{G}^{-1}}^{2}\leq\frac{\Delta_{x}^{2}}{2}$ where $\overline{G}=\frac{1}{U}\cdot G$ . Therefore,

\displaystyle\widehat{N}_{x}=\mathbb{E}\left[\sum_{t=1}^{T}\frac{\mathbbm{1}[x_{t}=x]}{U}\right]

satisfies the constraint of the optimization problem Eq. (45). Therefore,

\displaystyle c(\mathcal{X},\theta)\leq\sum_{x\neq x^{*}}\widehat{N}_{x}\Delta_{x}=\mathbb{E}\left[\sum_{t=1}^{T}\mathbbm{1}[x_{t}=x]U^{-1}\right]=\frac{\text{\rm Reg}_{T}}{U}.

This contradicts with the assumption on the regret bound of $\mathcal{A}$ . ∎

Lemma 24.

If there exists an $x\neq x^{*}$ such that

\displaystyle\|x\|^{2}_{G^{-1}}\geq\frac{\Delta_{x}^{2}}{2U},

then there exists $i\in[S]$ such that

\displaystyle\|x\|^{2}_{G_{i}^{-1}}\geq\frac{\Delta_{x}^{2}}{2V}.

Proof.

We use contradiction to prove the lemma. Suppose that $\|x\|^{2}_{G_{i}^{-1}}<\frac{\Delta_{x}^{2}}{2V}$ for all $i\in[S]$ .

Then

	$\displaystyle\\|x\\|^{2}_{G^{-1}}$	$\displaystyle=\\|x\\|^{2}_{\left(G_{1}+\cdots+G_{S}\right)^{-1}}\leq\frac{1}{S^{2}}\left(\\|x\\|^{2}_{G_{1}^{-1}}+\cdots+\\|x\\|^{2}_{G_{S}^{-1}}\right)$
		$\displaystyle<\frac{1}{S}\frac{\Delta_{x}^{2}}{2V}=\frac{\Delta_{x}^{2}}{2U}$

where in the first inequality we use

\displaystyle\left(\frac{1}{S}(G_{1}+\cdots+G_{S})\right)^{-1}\preceq\frac{1}{S}\left(G_{1}^{-1}+\cdots+G_{S}^{-1}\right),

which is a generalization of the “arithmetic-mean-harmonic-mean inequality” (Mond & Pecaric, 1996). ∎

Lemma 25.

Let $T\geq T_{0}$ . If $\|x\|_{G_{i}^{-1}}^{2}\geq\frac{\Delta_{x}^{2}}{2V}$ , then $\|x-x^{*}\|_{G_{i}^{-1}}^{2}\geq\frac{\Delta_{x}^{2}}{8V}$ .

Proof.

We have

\displaystyle\|x\|_{G_{i}^{-1}}^{2}\leq 2\|x-x^{*}\|_{G_{i}^{-1}}^{2}+2\|x^{*}\|_{G_{i}^{-1}}^{2}\leq 2\|x-x^{*}\|_{G_{i}^{-1}}^{2}+\frac{2\|x^{*}\|^{2}}{\mathbb{E}[T_{i}(x^{*})]},

(46)

where the last inequality is because of the definition of $G_{i}$ . For large enough $T$ , we must have $\mathbb{E}\left[T_{i}(x^{*})\right]\geq\frac{8\|x^{*}\|^{2}}{\Delta_{\min}^{2}}V$ . Otherwise, by Markov’s inequality, with probability at least $\frac{1}{2}$ , in interval $i$ the algorithm draws $x^{*}$ at most $\frac{16\|x^{*}\|^{2}}{\Delta_{\min}^{2}}V$ times, and thus the regret of $\mathcal{A}$ would be at least $\Delta_{\min}\left(|\mathcal{I}_{i}|-\frac{16\|x^{*}\|^{2}}{\Delta_{\min}^{2}}V\right)=\Omega\left(\left(\frac{4}{\Delta_{\min}}\right)^{i-1}T^{\gamma}\Delta_{\min}-\frac{16\|x^{*}\|^{2}}{\Delta_{\min}}c_{\text{reg}}(\log T)^{1-\beta}\right)=\Omega\left(T^{\gamma}\Delta_{\min}\right)$ , violating the assumption on $\mathcal{A}$ . Therefore, for large enough $T$ , we have

\displaystyle\frac{2\|x^{*}\|^{2}}{\mathbb{E}[T_{i}(x^{*})]}\leq\frac{\Delta_{\min}^{2}}{4V}\leq\frac{1}{2}\|x\|_{G_{i}^{-1}}^{2},

where the last inequality is by our assumption. Combining this with Eq. (46), we get

\displaystyle\|x\|_{G_{i}^{-1}}^{2}\leq 4\|x-x^{*}\|^{2}_{G_{i}^{-1}}

and the conclusion follows based on our assumption. ∎

Note that when the conclusion of Lemma 25 holds, that is, $\|x-x^{*}\|_{G_{i}^{-1}}^{2}\geq\Omega\left(\frac{\Delta_{x}^{2}}{V}\right)=\Omega\left(\frac{\Delta_{x}^{2}}{\log T}(\log T)^{\beta}\right)$ , it means that the exploration in interval $i$ is not enough, since by the lower bound in (Lattimore & Szepesvari, 2017), the amount of exploration should make $\|x-x^{*}\|^{2}_{G_{i}^{-1}}\leq O\left(\frac{\Delta_{x}^{2}}{\log T}\right)$ . Therefore, the next natural idea is to change the parameter $\theta$ in this interval $i$ , and argue that the amount of exploration $\mathcal{A}$ is not enough to “detect this change with high probability”.

We now let $i$ be the first interval such that there exists $x$ with $\|x-x^{*}\|^{2}_{G_{i}^{-1}}\geq\frac{\Delta_{x}^{2}}{8V}$ . Also, we use $x^{\prime}$ to denote the $x$ that satisfies this condition. Define

\displaystyle\theta^{\prime}=\theta-\frac{G_{i}^{-1}(x^{\prime}-x^{*})}{\|x^{\prime}-x^{*}\|^{2}_{G_{i}^{-1}}}2\Delta_{x^{\prime}}.

Notice that in this case,

\displaystyle\langle x^{\prime}-x^{*},\theta^{\prime}\rangle=\langle x^{\prime}-x^{*},\theta\rangle-2\Delta_{x^{\prime}}=-\Delta_{x^{\prime}}.

That is, $x^{\prime}$ is a better action than $x^{*}$ under the parameter $\theta^{\prime}$ .

We now define the second environment as follows: in intervals $1,\ldots,i-1$ , the losses are generated according to $\theta$ , but in intervals $i,\ldots,S$ , the losses are generated according to $\theta^{\prime}$ . We use $\mathbb{E}$ and $\mathbb{E}^{\prime}$ to denote the expectation under the first environment and the second environment, and $\mathbb{P}$ , $\mathbb{P}^{\prime}$ to denote the probability measures respectively. For now, we only focus on interval $i$ (so the probability measure is only over the sequence $(x_{t},\ell_{t,x_{t}})_{t\in\mathcal{I}_{i}}$ ).

Lemma 26.

$\text{\rm KL}\left(\mathbb{P},\mathbb{P}^{\prime}\right)\leq 64V$ .

Proof.

Note that for any $x$ ,

$\displaystyle\|\langle x,\theta^{\prime}\rangle-\langle x,\theta\rangle\|$	$\displaystyle=\left\|\frac{x^{\top}G_{i}^{-1}(x^{\prime}-x^{})}{\\|x^{\prime}-x^{}\\|^{2}_{G_{i}^{-1}}}2\Delta_{x^{\prime}}\right\|$
	$\displaystyle=\left\|\frac{2x^{\top}G_{i}^{-1}(x^{\prime}-x^{})(x^{\prime}-x^{})^{\top}\theta}{\\|x^{\prime}-x^{*}\\|^{2}_{G_{i}^{-1}}}\right\|$
	$\displaystyle\leq\frac{2\\|x\\|\\|\Phi\\|_{\text{op}}\\|\theta\\|}{\text{tr}(\Phi)}$	(let $\Phi=G_{i}^{-1}(x^{\prime}-x^{})(x^{\prime}-x^{})^{\top}$ )
	$\displaystyle\leq 2\\|x\\|\\|\theta\\|\leq\frac{1}{2}.$	(by the assumption $\\|\theta\\|\leq\frac{1}{4}$ )

Therefore, $|\langle x,\theta^{\prime}\rangle|\leq|\langle x,\theta\rangle|+\frac{1}{2}\leq\frac{3}{4}$ .

Notice that for $p,q\in[-\frac{3}{4},\frac{3}{4}]$ , the KL divergence between the following two distributions:

\displaystyle y=\begin{cases}1&\text{\ with probability\ }\frac{1}{2}+\frac{1}{2}p\\ -1&\text{\ with probability\ }\frac{1}{2}-\frac{1}{2}p\end{cases}\qquad\text{and}\qquad y=\begin{cases}1&\text{\ with probability\ }\frac{1}{2}+\frac{1}{2}q\\ -1&\text{\ with probability\ }\frac{1}{2}-\frac{1}{2}q\end{cases}

\displaystyle\text{kl}(p,q)\triangleq\frac{1}{2}(1+p)\ln\frac{1+p}{1+q}+\frac{1}{2}(1-p)\ln\frac{1-p}{1-q}\leq 2(p-q)^{2}.

(47)

Therefore,

$\displaystyle\text{\rm KL}\left(\mathbb{P},\mathbb{P}^{\prime}\right)$	$\displaystyle\leq\sum_{x\in\mathcal{X}}\mathbb{E}[T_{i}(x)]\text{kl}\left(\langle x,\theta\rangle,\langle x,\theta^{\prime}\rangle\right)$	(by Lemma 1 in (Gerchinovitz & Lattimore, 2016))
	$\displaystyle\leq 2\sum_{x}\mathbb{E}[T_{i}(x)]\langle x,\theta-\theta^{\prime}\rangle^{2}$	(by Eq. (47))
	$\displaystyle=2\\|\theta-\theta^{\prime}\\|_{G_{i}}^{2}$	(by definition of $G_{i}$ )
	$\displaystyle=\frac{8\Delta_{x^{\prime}}^{2}}{\\|x^{\prime}-x^{*}\\|_{G_{i}^{-1}}^{2}}$	(by definition of $\theta^{\prime}$ )
	$\displaystyle\leq 64V.$	(by the choice of $x^{\prime}$ )

∎

Finally, we are ready to present the lower bound. Roughly speaking, it shows that if an algorithm achieves $\mathcal{O}(c(\mathcal{X},\theta)\log^{z}T)$ regret in the stochastic case with $z\in[1,2)$ , then it cannot be robust in the adversarial setting, in the sense that it cannot guarantee a regret bound such as $o(T)\cdot\text{\rm poly}(d,\ln(1/\delta))$ with probability at least $1-\delta$ .

Theorem 27.

For any $\gamma\in(0,1)$ , if an algorithm guarantees a pseudo regret bound of

\displaystyle c_{\text{reg}}\cdot c(\mathcal{X},\theta)\frac{1-\gamma}{\log\frac{4}{\Delta_{\min}}}(\log T)^{2}

for constant $c_{\text{reg}}=\frac{\gamma}{256}$ in stochastic environments for all sufficiently large $T$ , then there exists an adversarial environment such that with probability at least $\frac{1}{4}T^{-\frac{1}{4}\gamma}$ , the regret of the same algorithm is at least $\frac{1}{6}T^{\gamma}\Delta_{\min}$ .

Proof.

Let $A$ be the event: $\left\{T_{i}(x^{*})\leq\frac{|\mathcal{I}_{i}|}{2}\right\}$ . By Lemma 5 in (Lattimore & Szepesvari, 2017) and choosing $\beta=0$ , we have

$\displaystyle\mathbb{P}(A)+\mathbb{P}^{\prime}(A^{c})$	$\displaystyle\geq\frac{1}{2}\exp(-\text{\rm KL}(\mathbb{P},\mathbb{P}^{\prime}))$
	$\displaystyle\geq\frac{1}{2}\exp(-64V)$	(by Lemma 26)
	$\displaystyle=\frac{1}{2}\exp\left(-64c_{\text{reg}}\log T\right)$
	$\displaystyle=\frac{1}{2}\left(\frac{1}{T}\right)^{64c_{\text{reg}}}.$	(48)

Notice that when event $A$ happens under the first environment, the regret is at least $\frac{|\mathcal{I}_{i}|\Delta_{\min}}{2}$ . By the assumption on $\mathcal{A}$ , we have

\displaystyle\mathbb{P}(A)\times\frac{|\mathcal{I}_{i}|\Delta_{\min}}{2}\leq\text{\rm Reg}_{T}\leq c_{\text{reg}}\cdot c(\mathcal{X},\theta)\frac{1-\gamma}{\log\frac{4}{\Delta_{\min}}}(\log T)^{2},

implying that

\displaystyle\mathbb{P}(A)\leq\frac{c_{\text{reg}}\cdot c(\mathcal{X},\theta)\frac{1-\gamma}{\log\frac{4}{\Delta_{\min}}}(\log T)^{2}}{\left(\frac{4}{\Delta_{\min}}\right)^{i-1}T^{\gamma}\Delta_{\min}}\leq\frac{c_{\text{reg}}\cdot c(\mathcal{X},\theta)(\log T)^{2}}{T^{\gamma}}.

Combining this with (48), we get

\displaystyle\mathbb{P}^{\prime}(A^{c})\geq 0.5\cdot T^{-64c_{\text{reg}}}-c_{\text{reg}}\cdot c(\mathcal{X},\theta)(\log T)^{2}\cdot T^{-\gamma}.

(49)

Choose $c_{\text{reg}}=\frac{\gamma}{256}$ , we have $\mathbb{P}^{\prime}(A^{c})\geq 0.25\cdot T^{-64c_{\text{reg}}}$ for large enough $T$ . Then notice that when $A^{c}$ happens under the second environment, the regret within interval $\mathcal{I}_{i}$ (against comparator $x^{\prime}$ ) is at least $\frac{|\mathcal{I}_{i}|\Delta_{\min}}{2}$ . Since under the second environment, the learner may have negative regret against $x^{\prime}$ in interval $1,\ldots,i-1$ , in the best case the regret against $x^{\prime}$ in interval $1,\ldots,i$ is at least

\displaystyle\frac{|\mathcal{I}_{i}|\Delta_{\min}}{2}-\left(|\mathcal{I}_{1}|+\cdots+|\mathcal{I}_{i-1}|\right)\geq\frac{|\mathcal{I}_{i}|\Delta_{\min}}{6}\geq\frac{T^{\gamma}\Delta_{\min}}{6}.

In conclusion, in the second environment, algorithm $\mathcal{A}$ suffers at least $\frac{T^{\gamma}\Delta_{\min}}{6}$ regret in the first $i$ intervals with probability at least $0.25\cdot T^{-\frac{1}{4}\gamma}$ . ∎

Appendix E Adversarial Linear Bandit Algorithms with High-probability Guarantees

In this section, we show that the algorithms of (Bartlett et al., 2008) and (Lee et al., 2020) both satisfy Assumption 1.

E.1 GeometricHedge.P

Input:

\mathcal{X},\gamma,\eta,\delta^{\prime}

, and John’s exploration distribution

q\in\mathcal{P}_{\mathcal{X}}

Set

\forall x\in\mathcal{X}

w_{1}(x)=1

, and

W_{1}=|\mathcal{X}|

for $t=1$ to $T$ do

Set

p_{t}(x)=(1-\gamma)\frac{w_{t}(x)}{W_{t}}+\gamma q(x),~{}\forall x\in\mathcal{X}

Sample

x_{t}

according to distribution

p_{t}

Observe loss

y_{t}=\ell_{t,x_{t}}+\epsilon_{t}(x_{t})

, where

\ell_{t,x_{t}}=\langle x_{t},\ell_{t}\rangle

Compute

S(p_{t})=\sum_{x\in\mathcal{X}}p_{t}(x)xx^{\top}

and

\widehat{\ell}_{t}=S(p_{t})^{-1}x_{t}\cdot y_{t}

\forall x\in\mathcal{X}

, compute

\widehat{\ell}_{t,x}=\langle x,\widehat{\ell}_{t}\rangle,\qquad\widetilde{\ell}_{t,x}=\widehat{\ell}_{t,x}-2\|x\|_{S(p_{t})^{-1}}^{2}\sqrt{\frac{\log(1/\delta^{\prime})}{dT}},\qquad w_{t+1}=w_{t}(x)\exp\left(-\eta\widetilde{\ell}_{t,x}\right).

Compute

W_{t+1}=\sum_{x\in\mathcal{X}}w_{t+1}(x)

Algorithm 3 GeometricHedge.P

We first show GeometricHedge.P in Algorithm 3 for completeness. We remark the differences between the original version and one shown here. First, we consider the noisy feedback $y_{t}$ instead of the zero-noise feedback $\ell_{t,x_{t}}$ . However, most analysis in (Bartlett et al., 2008) still holds. Second, instead of using the barycentric spanner exploration (known to be suboptimal), we use John’s exploration shown to be optimal in Bubeck et al. (2012). With this replacement, Lemma 3 in Bartlett et al. (2008) can be improved to $|\widehat{\ell}_{t,x}|\leq d/\gamma$ and $\|x\|_{S(p_{t})^{-1}}^{2}\leq d/\gamma$ .

Now, consider martingale difference sequence $M_{t}(x)=\widehat{\ell}_{t,x}-\ell_{t,x}$ . We have $|M_{t}(x)|\leq\frac{d}{\gamma}+1\triangleq b$ and

\sigma=\sqrt{\sum^{T}_{t=1}\text{Var}_{t}(M_{t})}\leq\sqrt{\sum_{t=1}^{T}\|x\|^{2}_{S(p_{t})^{-1}}}.

Using Lemma 2 in (Bartlett et al., 2008), we have that with probability at least $1-2\delta^{\prime}\log_{2}T$ (set $\delta^{\prime}=\delta/(|\mathcal{X}|\log_{2}(T))$ ),

$\displaystyle\left\|\sum_{t=1}^{T}(\widehat{\ell}_{t,x}-\ell_{t,x})\right\|$	$\displaystyle\leq 2\max\left\{2\sigma,b\sqrt{\log(1/\delta^{\prime})}\right\}\sqrt{\log(1/\delta^{\prime})}$
	$\displaystyle\leq 4\sigma\sqrt{\log(1/\delta^{\prime})}+2b\cdot{\log(1/\delta^{\prime})}$
	$\displaystyle\leq 4\sqrt{\sum_{t=1}^{T}\\|x\\|^{2}_{S(p_{t})^{-1}}}\sqrt{\log(1/\delta^{\prime})}+2\left(\frac{d}{\gamma}+1\right){\log(1/\delta^{\prime})}$
	$\displaystyle\leq\underbrace{\frac{1}{C_{2}}\left(\sum_{t=1}^{T}\\|x\\|^{2}_{S(p_{t})^{-1}}\sqrt{\frac{\log(1/\delta^{\prime})}{dT}}\right)+4C_{2}\sqrt{dT\log(1/\delta^{\prime})}+2\left(\frac{d}{\gamma}+1\right){\log(1/\delta^{\prime})}}_{\triangleq\textsc{Dev}_{T,x}},$	(50)

where the last inequality is by AM-GM inequality. Note that $\widehat{\ell}_{t,x}=\widetilde{\ell}_{t,x}+2\|x\|^{2}_{S(p_{t})^{-1}}\sqrt{\frac{\log(1/\delta^{\prime})}{dT}}$ . Plugging this into Eq. (50), we have with probability at least $1-2\delta^{\prime}\log_{2}T$ ,

\displaystyle\sum_{t=1}^{T}\widetilde{\ell}_{t,x}\leq\sum_{t=1}^{T}\ell_{t,x}-{\left(\sum_{t=1}^{T}\|x\|^{2}_{S(p_{t})^{-1}}\sqrt{\frac{\log(1/\delta^{\prime})}{dT}}\right)+4C_{2}\sqrt{dT\log(1/\delta^{\prime})}+2\left(\frac{d}{\gamma}+1\right){\log(1/\delta^{\prime})}}.

(51)

The counterpart of Lemma 6 in Bartlett et al. (2008) shows that with probability at least $1-\delta$ ,

\displaystyle\sum^{T}_{t=1}\ell_{t,x_{t}}-\sum^{T}_{t=1}\sum_{x\in\mathcal{X}}p_{t}(x)\widehat{\ell}_{t,x}\leq(\sqrt{d}+1)\sqrt{2T\log(1/\delta)}+\frac{4}{3}\log(1/\delta)\left(\frac{d}{\gamma}+1\right).

(52)

Using Eq. (51), we have the counterpart of Lemma 7 in Bartlett et al. (2008) as follows: with probability at least $1-2\delta$ ,

	$\displaystyle\gamma\sum_{t=1}^{T}\sum_{x\in\mathcal{X}}q(x)\widetilde{\ell}_{t,x}$	$\displaystyle\leq\gamma\sum_{t=1}^{T}\sum_{x\in\mathcal{X}}q(x)\ell_{t,x}+{4C_{2}\gamma\sqrt{dT\log(1/\delta^{\prime})}+2\gamma\left(\frac{d}{\gamma}+1\right){\log(1/\delta^{\prime})}}$
		$\displaystyle\leq\gamma T+{4C_{2}\gamma\sqrt{dT\log(1/\delta^{\prime})}+2\left({d}+\gamma\right){\log(1/\delta^{\prime})}}.$		(53)

The counterpart of Lemma 8 in Bartlett et al. (2008) is: with probability at least $1-\delta$ ,

\displaystyle\sum^{T}_{t=1}\sum_{x\in\mathcal{X}}p_{t}(x){\widehat{\ell}_{t,x}}^{2}\leq dT+\frac{d}{\gamma}\sqrt{2T\log(1/\delta)}.

(54)

Plugging Eq. (52), Eq. (53), and Eq. (54), into Equation (2) in (Bartlett et al., 2008), with have with probability at least $1-3\delta$ ,

	$\displaystyle\log\frac{W_{T+1}}{W_{1}}$	$\displaystyle\leq\frac{\eta}{1-\gamma}\left(-\sum^{T}_{t=1}\ell_{t,x_{t}}+2\sqrt{dT\log(1/\delta^{\prime})}+(\sqrt{d}+1)\sqrt{2T\log(1/\delta)}+\frac{4}{3}\log(1/\delta)\left(\frac{d}{\gamma}+1\right)+\gamma T+\right.$
		$\displaystyle\quad\left.{4C_{2}\gamma\sqrt{dT\log(1/\delta^{\prime})}+2\left({d}+\gamma\right){\log(1/\delta^{\prime})}}+2\eta dT+\frac{2\eta d}{\gamma}\sqrt{2T\log(1/\delta)}+8\eta\log(1/\delta^{\prime})\sqrt{dT}\right).$		(55)

Again using Eq. (51) and Equation (4) in (Bartlett et al., 2008), we have with probability at least $1-\delta$ , for all $x\in\mathcal{X}$ ,

	$\displaystyle\log\frac{W_{T+1}}{W_{1}}$	$\displaystyle\geq-\eta\left(\sum^{T}_{t=1}\widetilde{\ell}_{t,x}\right)-\log\|\mathcal{X}\|$
		$\displaystyle\geq-\eta\sum_{t=1}^{T}\ell_{t,x}+{\eta\left(\sum_{t=1}^{T}\\|x\\|^{2}_{S(p_{t})^{-1}}\sqrt{\frac{\log(1/\delta^{\prime})}{dT}}\right)-4\eta C_{2}\sqrt{dT\log(1/\delta^{\prime})}-2\eta\left(\frac{d}{\gamma}+1\right){\log(1/\delta^{\prime})}}-\log\|\mathcal{X}\|.$

Combining this with Eq. (55) and assuming $\gamma\leq\frac{1}{2}$ , we have that with probability at least $1-5\delta$ , for every $x\in\mathcal{X}$ ,

	$\displaystyle\sum^{T}_{t=1}\ell_{t,x_{t}}$	$\displaystyle\leq\sum^{T}_{t=1}\ell_{t,x}-{\frac{1}{2}\left(\sum_{t=1}^{T}\\|x\\|^{2}_{S(p_{t})^{-1}}\sqrt{\frac{\log(1/\delta^{\prime})}{dT}}\right)+4C_{2}\sqrt{dT\log(1/\delta^{\prime})}+2\left(\frac{d}{\gamma}+1\right){\log(1/\delta^{\prime})}}+\frac{\log\|\mathcal{X}\|}{\eta}+$
		$\displaystyle\qquad 2\sqrt{dT\log(1/\delta^{\prime})}+(\sqrt{d}+1)\sqrt{2T\log(1/\delta)}+\frac{4}{3}\log(1/\delta)\left(\frac{d}{\gamma}+1\right)+\gamma T+2C_{2}\sqrt{dT\log(1/\delta^{\prime})}+$
		$\displaystyle\qquad 2\left({d}+\gamma\right){\log(1/\delta^{\prime})}+2\eta dT+\frac{2\eta d}{\gamma}\sqrt{2T\log(1/\delta)}+8\eta\log(1/\delta^{\prime})\sqrt{dT}.$

Recalling the definition of $\textsc{Dev}_{T,x}$ in Eq. (50) and combining terms, we have

	$\displaystyle\sum^{T}_{t=1}\ell_{t,x_{t}}$	$\displaystyle\leq\sum^{T}_{t=1}\ell_{t,x}-C_{2}\cdot\textsc{Dev}_{T,x}+\mathcal{O}\left(\sqrt{dT\log(1/\delta^{\prime})}+\frac{d}{\gamma}\log(1/\delta^{\prime})\right)+\frac{\log\|\mathcal{X}\|}{\eta}+\gamma T+2\eta dT+$
		$\displaystyle\quad\frac{2\eta d}{\gamma}\sqrt{2T\log(1/\delta)}+8\eta\log(1/\delta^{\prime})\sqrt{dT}.$		(56)

It remains to decide $\eta$ and $\gamma$ . Note that the analysis of (Bartlett et al., 2008) requires $|\eta\widetilde{\ell}_{t,x}|\leq 1$ . From the proof of Lemma 4 in (Bartlett et al., 2008), we know that $|\eta\widetilde{\ell}_{t,x}|\leq\frac{\eta d}{\gamma}\left(1+2\sqrt{\frac{\log(1/\delta^{\prime})}{dT}}\right)$ . Thus, we set $\eta=\gamma/\left(d+2d\sqrt{\frac{\log(1/\delta^{\prime})}{dT}}\right)$ so that $|\eta\widetilde{\ell}_{t,x}|\leq 1$ always holds. Therefore,

\displaystyle\sum^{T}_{t=1}\ell_{t,x_{t}}

\displaystyle\leq\sum^{T}_{t=1}\ell_{t,x}-C_{2}\cdot\textsc{Dev}_{T,x}+\mathcal{O}\left(\sqrt{dT\log(1/\delta^{\prime})}+\frac{d}{\gamma}\log(|\mathcal{X}|/\delta^{\prime})\right)+\frac{2}{\gamma}\sqrt{\frac{d\log^{3}(|\mathcal{X}|/\delta^{\prime})}{T}}+3\gamma T+8\gamma\log(1/\delta^{\prime})\sqrt{\frac{T}{d}}.

Choosing $\gamma=\min\left\{\frac{1}{2},\sqrt{\frac{d\log({|\mathcal{X}|}/{\delta^{\prime}})}{T}}\right\}$ , we have with probability at least $1-7\delta$ , for all $x\in\mathcal{X}$ ,

$\displaystyle\sum^{T}_{t=1}\ell_{t,x_{t}}$	$\displaystyle\leq\sum^{T}_{t=1}\ell_{t,x}-C_{2}\cdot\textsc{Dev}_{T,x}+\mathcal{O}\left(\sqrt{dT\log(\|\mathcal{X}\|/\delta^{\prime})}+d\log^{\tfrac{3}{2}}(\|\mathcal{X}\|/\delta^{\prime})\right)$
	$\displaystyle\leq\sum^{T}_{t=1}\ell_{t,x}-C_{2}\cdot\textsc{Dev}_{T,x}+\mathcal{O}\left(\sqrt{dT\log(\|\mathcal{X}\|\log_{2}(T)/\delta)}+d\log^{\tfrac{3}{2}}(\|\mathcal{X}\|\log_{2}(T)/\delta)\right)$	( $\delta^{\prime}=\delta/(\|\mathcal{X}\|\log_{2}(T))$ )
	$\displaystyle\leq\sum^{T}_{t=1}\ell_{t,x}-C_{2}\left\|\sum_{t=1}^{T}(\ell_{t,x}-\widehat{\ell}_{t,x})\right\|+\mathcal{O}\left(\sqrt{dT\log(\|\mathcal{X}\|\log_{2}(T)/\delta)}+d\log^{\tfrac{3}{2}}(\|\mathcal{X}\|\log_{2}(T)/\delta)\right),$	(Eq. (50))

which proves Eq. (5).

E.2 The algorithm of (Lee et al., 2020)

Now we introduce another high-probability adversarial linear bandit algorithm from (Lee et al., 2020). The regret bound of this algorithm is slightly worse than (Bartlett et al., 2008). However, the algorithm is efficient when there are infinite or exponentially many actions. For the concrete pseudocode of the algorithm, we refer the readers to Algorithm 2 of (Lee et al., 2020). Here, we focus on showing that it satisfies Eq. (5).

We first restate Lemma B.15 in (Lee et al., 2020) with explicit logarithmic factors: Algorithm 2 of (Lee et al., 2020) with $\eta\leq\frac{C_{3}}{d^{2}\lg^{3}\log(\lg/\delta)}$ for some universal constant $C_{3}>0$ guarantees that with probability at least $1-\delta$ ,

\displaystyle\sum^{T}_{t=1}\langle x_{t}-x,\ell_{t}\rangle\leq\mathcal{O}\left(\frac{d\log T}{\eta}+\eta d^{2}T+\lg^{2}\sqrt{T\log(\lg/\delta)}\right)+\textsc{Dev}_{T,x}\cdot\left(C_{4}-\frac{1}{C_{5}\eta d^{2}\lg^{3}\sqrt{T\log(\lg/\delta)}}\right),

(57)

where $\lg=\log(dT)$ , $\textsc{Dev}_{T,x}$ is an upper bound on $\left|\sum_{t=1}^{T}(\ell_{t,x}-\widehat{\ell}_{t,x})\right|$ with probability $1-\delta$ , $C_{4},C_{5}>0$ are two universal constants, and we replace the self-concordant parameter in their bound by a trivial upper bound $d$ .

Therefore, choosing $\eta=\min\left\{\frac{C_{3}}{d^{2}\lg^{3}\log(\lg/\delta)},\frac{1}{2C_{4}C_{5}d^{2}\lg^{3}\sqrt{T\log(\lg/\delta)}},\frac{1}{2C_{2}C_{5}d^{2}\lg^{3}\sqrt{T\log(\lg/\delta)}}\right\}$ for some $C_{2}\geq 20$ , the coefficient of $\textsc{Dev}_{T,x}$ becomes at most $-C_{2}$ , leading to

	$\displaystyle\sum^{T}_{t=1}\langle x_{t}-x,\ell_{t}\rangle$	$\displaystyle\leq\mathcal{O}\left(d^{3}\lg^{4}\log(\lg/\delta)+d^{3}\lg^{4}\sqrt{T\log(\lg/\delta)}\right)-C_{2}\cdot\textsc{Dev}_{T,x}$
		$\displaystyle\leq\mathcal{O}\left(d^{3}\lg^{4}\log(\lg/\delta)+d^{3}\lg^{4}\sqrt{T\log(\lg/\delta)}\right)-C_{2}\cdot\left\|\sum_{t=1}^{T}(\ell_{t,x}-\widehat{\ell}_{t,x})\right\|.$

Finally, using a union bound over all $x$ similar to Theorem B.16 of (Lee et al., 2020), we get with probability at least $1-2\delta$ , for every $x\in\mathcal{X}$ ,

\displaystyle\sum^{T}_{t=1}\langle x_{t}-x,\ell_{t}\rangle

\displaystyle\leq\mathcal{O}\left(d^{3}\lg^{4}\log(\lg/\delta^{\prime\prime})+d^{3}\lg^{4}\sqrt{T\log(\lg/\delta^{\prime\prime})}\right)-C_{2}\left|\sum_{t=1}^{T}(\ell_{t,x}-\widehat{\ell}_{t,x})\right|

where $\delta^{\prime\prime}=\delta/(|\mathcal{X}|T)$ . Therefore, we conclude that this algorithm satisfies Eq. (5) as well.

Appendix F Auxiliary Lemmas

In this section, we provide several auxiliary lemmas that we have used in the analysis.

Lemma 28.

(Lemma A.2 of (Shalev-Shwartz & Ben-David, 2014)) Let $a\geq 1$ and $b>0$ . If $x\geq 4a\log(2a)+2b$ , then we have $x\geq a\log(x)+b$ .

Lemma 29 (Concentration inequality for Catoni’s estimator (Wei et al., 2020)).

Let $\mathcal{F}_{0}\subset\cdots\subset\mathcal{F}_{n}$ be a filtration, and $X_{1},\ldots,X_{n}$ be real random variables such that $X_{i}$ is $\mathcal{F}_{i}$ -measurable, $\mathbb{E}[X_{i}|\mathcal{F}_{i-1}]=\mu_{i}$ for some fixed $\mu_{i}$ , and $\sum_{i=1}^{n}\mathbb{E}[(X_{i}-\mu_{i})^{2}|\mathcal{F}_{i-1}]\leq V$ for some fixed $V$ . Denote $\mu\triangleq\frac{1}{n}\sum_{i=1}^{n}\mu_{i}$ and let $\widehat{\mu}_{n,\alpha}$ be the Catoni’s robust mean estimator of $X_{1},\ldots,X_{n}$ with a fixed parameter $\alpha>0$ , that is, $\widehat{\mu}_{n,\alpha}$ is the unique root of the function

\displaystyle f(z)=\sum_{i=1}^{n}\psi(\alpha(X_{i}-z))

where

\psi(y)=\begin{cases}\ln(1+y+y^{2}/2),&\text{if $y\geq 0$,}\\ -\ln(1-y+y^{2}/2),&\text{else.}\end{cases}

Then for any $\delta\in(0,1)$ , as long as $n$ is large enough such that $n\geq\alpha^{2}(V+\sum_{i=1}^{n}(\mu_{i}-\mu)^{2})+2\log(1/\delta)$ , we have with probability at least $1-2\delta$ ,

\displaystyle|\widehat{\mu}_{n,\alpha}-\mu|\leq\frac{\alpha(V+\sum_{i=1}^{n}(\mu_{i}-\mu)^{2})}{n}+\frac{2\log(1/\delta)}{\alpha n}.

Choosing $\alpha$ optimally, we have

\displaystyle|\widehat{\mu}_{n,\alpha}-\mu|\leq\frac{2}{n}\sqrt{2\left(V+\sum_{i=1}^{n}(\mu_{i}-\mu)^{2}\right)\log(1/\delta)}.

In particular, if $\mu_{1}=\cdots=\mu_{n}=\mu$ , we have

\displaystyle|\widehat{\mu}_{n,\alpha}-\mu|\leq\frac{2}{n}\sqrt{2V\log(1/\delta)}.

$\displaystyle\sum_{t=1}^{T}\sum_{x\in\mathcal{X}}p_{m_{t},x}\Delta_{x}$	$\displaystyle\leq\sum_{m=0}^{\log_{2}T}\mathcal{O}\left(d\cdot 2^{\nicefrac{{m}}{{2}}}\log(2^{m}\|\mathcal{X}\|/\delta)+\sqrt{2^{m}d\gamma_{m}}\right)+\mathcal{O}\left(\sum_{k=0}^{\log_{2}T}C_{k}\sum_{m=k}^{\log_{2}T}\frac{1}{2^{m-k}}\right)$
	$\displaystyle=\mathcal{O}\left(d\sqrt{T}\log(T\|\mathcal{X}\|/\delta)+\sum_{k=0}^{\log_{2}T}C_{k}\right)$
	$\displaystyle=\mathcal{O}\left(d\sqrt{T}\log(T\|\mathcal{X}\|/\delta)+C\right).$	(23)

$\displaystyle\text{\rm Reg}(T)$	$\displaystyle=\mathcal{O}\left(\sum_{t=1}^{T}\sum_{x\in\mathcal{X}}p_{m_{t},x}\Delta_{x}+\log(1/\delta)\right)$
	$\displaystyle=\mathcal{O}\left(\frac{d\log(T)\log(T\|\mathcal{X}\|/\delta)}{\Delta_{\min}}+d\sqrt{T^{\prime}}\log(T^{\prime}\|\mathcal{X}\|/\delta)+C\right)$
	$\displaystyle=\mathcal{O}\left(\frac{d\log(T)\log(T\|\mathcal{X}\|/\delta)}{\Delta_{\min}}+d\sqrt{\frac{C}{\Delta_{\min}}}\log\left(\frac{C\|\mathcal{X}\|}{\Delta_{\min}\delta}\right)+d\sqrt{\frac{d}{\Delta_{\min}^{2}}\log\left(\frac{d\|\mathcal{X}\|}{\delta\Delta_{\min}^{2}}\right)}\log\left(\frac{d\|\mathcal{X}\|}{\delta\Delta_{\min}^{2}}\right)+C\right)$
	$\displaystyle=\mathcal{O}\left(\frac{d\log(T)\log(T\|\mathcal{X}\|/\delta)}{\Delta_{\min}}+d\sqrt{\frac{C}{\Delta_{\min}}}\log\left(\frac{C\|\mathcal{X}\|}{\Delta_{\min}\delta}\right)+C+\frac{d^{\frac{3}{2}}\log^{\frac{3}{2}}(\frac{d\|\mathcal{X}\|}{\delta\Delta_{\min}})}{\Delta_{\min}}\right)$
	$\displaystyle\leq\mathcal{O}\left(\frac{d^{2}}{\Delta_{\min}}\log^{2}\left(\frac{T\|\mathcal{X}\|}{\Delta_{\min}\delta}\right)+C\right).$	(using AM-GM inequality and $C\leq T$ , $T\geq d$ )

$\displaystyle\left\|(t-t_{0})\cdot\text{Rob}_{t,x}-\sum_{\tau=t_{0}+1}^{t}\ell_{\tau,x}\right\|$	$\displaystyle\leq\alpha_{x}\left(2\sum_{\tau=t_{0}+1}^{t}\\|x\\|_{S_{\tau}^{-1}}^{2}+\sum_{\tau=t_{0}+1}^{t}\left(\ell_{\tau,x}-\frac{1}{t-t_{0}}\sum_{\tau^{\prime}=t_{0}+1}^{t}\ell_{\tau^{\prime},x}\right)^{2}\right)+\frac{2\log\frac{t^{2}\|\mathcal{X}\|}{\delta}}{\alpha_{x}}$
	$\displaystyle\leq\alpha_{x}\sum_{\tau=t_{0}+1}^{t}\left(2\\|x\\|_{S_{\tau}^{-1}}^{2}+1\right)+\frac{4\log\frac{t\|\mathcal{X}\|}{\delta}}{\alpha_{x}}$
	$\displaystyle\leq 2\sqrt{4\sum_{\tau=t_{0}+1}^{t}\left(2\\|x\\|_{S_{\tau}^{-1}}^{2}+1\right)\log\frac{t\|\mathcal{X}\|}{\delta}}$	(Choose $\alpha_{x}$ optimally)
	$\displaystyle\leq 2\sqrt{4\log\frac{t\|\mathcal{X}\|}{\delta}\sum_{\tau=t_{0}+1}^{t}\left(\frac{2\tau\widehat{\Delta}_{x}^{2}}{\beta_{\tau}}+9d\right)},$	(28)

$\displaystyle\|\langle x,\theta^{\prime}\rangle-\langle x,\theta\rangle\|$	$\displaystyle=\left\|\frac{x^{\top}G_{i}^{-1}(x^{\prime}-x^{})}{\\|x^{\prime}-x^{}\\|^{2}_{G_{i}^{-1}}}2\Delta_{x^{\prime}}\right\|$
	$\displaystyle=\left\|\frac{2x^{\top}G_{i}^{-1}(x^{\prime}-x^{})(x^{\prime}-x^{})^{\top}\theta}{\\|x^{\prime}-x^{*}\\|^{2}_{G_{i}^{-1}}}\right\|$
	$\displaystyle\leq\frac{2\\|x\\|\\|\Phi\\|_{\text{op}}\\|\theta\\|}{\text{tr}(\Phi)}$	(let $\Phi=G_{i}^{-1}(x^{\prime}-x^{})(x^{\prime}-x^{})^{\top}$ )
	$\displaystyle\leq 2\\|x\\|\\|\theta\\|\leq\frac{1}{2}.$	(by the assumption $\\|\theta\\|\leq\frac{1}{4}$ )

$\displaystyle\sum^{T}_{t=1}\ell_{t,x_{t}}$	$\displaystyle\leq\sum^{T}_{t=1}\ell_{t,x}-C_{2}\cdot\textsc{Dev}_{T,x}+\mathcal{O}\left(\sqrt{dT\log(\|\mathcal{X}\|/\delta^{\prime})}+d\log^{\tfrac{3}{2}}(\|\mathcal{X}\|/\delta^{\prime})\right)$
	$\displaystyle\leq\sum^{T}_{t=1}\ell_{t,x}-C_{2}\cdot\textsc{Dev}_{T,x}+\mathcal{O}\left(\sqrt{dT\log(\|\mathcal{X}\|\log_{2}(T)/\delta)}+d\log^{\tfrac{3}{2}}(\|\mathcal{X}\|\log_{2}(T)/\delta)\right)$	( $\delta^{\prime}=\delta/(\|\mathcal{X}\|\log_{2}(T))$ )
	$\displaystyle\leq\sum^{T}_{t=1}\ell_{t,x}-C_{2}\left\|\sum_{t=1}^{T}(\ell_{t,x}-\widehat{\ell}_{t,x})\right\|+\mathcal{O}\left(\sqrt{dT\log(\|\mathcal{X}\|\log_{2}(T)/\delta)}+d\log^{\tfrac{3}{2}}(\|\mathcal{X}\|\log_{2}(T)/\delta)\right),$	(Eq. (50))

Achieving Near Instance-Optimality and Minimax-Optimality in Stochastic and Adversarial Linear Bandits Simultaneously

Abstract

1 Introduction

2 Related Work

Linear Bandits.

Instance Optimality for Bandit Problems.

Best-of-Both-Worlds.

Stochastic Bandits with Corruption.

3 Preliminaries

Notations.

4 A New Algorithm for the Corrupted Setting

Theorem 1.

Theorem 2.

Lemma 3.

Lemma 4.

Lemma 5.

5 Best of Three Worlds

Assumption 1.

Theorem 6.

Theorem 7.

5.1 The algorithm

5.2 Analysis for the Adversarial Setting (Theorem 7)

Lemma 8.

Proof sketch.

Lemma 9.

Proof sketch.

5.3 Analysis for the Corrupted Setting (Theorem 6)

Lemma 10.

Proof sketch for Theorem 6 with C=0C=0.

6 Conclusion

Acknowledgements

References

Appendix A Auxiliary Lemmas for OP

Proof of Lemma 4.

Lemma 11.

Proof.

Lemma 12.

Proof.

Lemma 13.

Proof.

Lemma 14.

Proof.

Lemma 15.

Proof.

Lemma 16.

Proof.

Appendix B Analysis for Algorithm LABEL:alg:REOLB

Definition 1.

Proof of Lemma 3.

Lemma 17 (A detailed version of Lemma 5).

Proof.

Proof of Theorem 1.

Lemma 18.

Proof.

Theorem 19.

Proof.

Proof of Theorem 2.

Appendix C Analysis of Algorithm 1

C.1 Analysis of Algorithm 1 in the adversarial setting

Proof of Lemma 8.

Proof of Lemma 9.

C.2 Analysis of Algorithm 1 in the corrupted stochastic setting

Lemma 20.

Proof.

Claim 1’s proof:

Claim 2’s proof:

Claim 3’s proof:

Claim 4’s proof.

Lemma 21.

Proof.

Lemma 22.

Proof.

Proof of Theorem 6.

Appendix D Lower Bound

Definition 2.

Assumption 2.

Definition 3.

Definition 4.

Lemma 23.

Proof.

Achieving Near Instance-Optimality and Minimax-Optimality
in Stochastic and Adversarial Linear Bandits Simultaneously

Proof sketch for Theorem 6 with $C=0$ .