This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Contaminated Online Convex Optimization

Tomoya Kamijima 111A portion of this work was done while he was an intern at NEC Corporation. Email: [email protected] Department of Mathematical Informatics, The University of Tokyo, Tokyo, Japan Shinji Ito 222He is currently affiliated with the University of Tokyo. Email: [email protected] NEC Corporation, Kanagawa, Japan RIKEN AIP, Tokyo, Japan
Abstract

In online convex optimization, some efficient algorithms have been designed for each of the individual classes of objective functions, e.g., convex, strongly convex, and exp-concave. However, existing regret analyses, including those of universal algorithms, are limited to cases in which the objective functions in all rounds belong to the same class and cannot be applied to cases in which the property of objective functions may change in each time step. This paper introduces a novel approach to address such cases, proposing a new regime we term as contaminated online convex optimization. For the contaminated case, we demonstrate that the regret is lower bounded by Ω(logT+k)\Omega(\log T+\sqrt{k}). Here, kk signifies the level of contamination in the objective functions. We also demonstrate that the regret is bounded by O(logT+klogT)O(\log T+\sqrt{k\log T}) when universal algorithms are used. When our proposed algorithms with additional information are employed, the regret is bounded by O(logT+k)O(\log T+\sqrt{k}), which matches the lower bound. These are intermediate bounds between a convex case and a strongly convex or exp-concave case.

1 Introduction

Online convex optimization (OCO) is an optimization framework in which convex objective function changes for each time step t{1,2,,T}t\in\{1,2,\ldots,T\}. OCO has a lot of applications such as prediction from expert advice (Littlestone and Warmuth 1994, Arora et al. 2012), spam filtering (Hazan 2016), shortest paths (Awerbuch and Kleinberg 2004), portfolio selection (Cover 1991, Hazan et al. 2006), and recommendation systems (Hazan and Kale 2012). The performance of the OCO algorithm is compared by regret (defined in Section 3). As shown in Table 1, it is already known that sublinear regret can be achieved for each function class, such as convex, strongly convex, and exp-concave, and the bound depends on the function class. In addition, these upper bounds coincide with lower bounds, so these are optimal. However, these optimal algorithms are applicable to one specific function class. Therefore, we need prior knowledge about the function class to which the objective functions belong.

To solve this problem, many universal algorithms that work well for multiple function classes by one algorithm have been proposed (Hazan et al. 2007, Van Erven and Koolen 2016, Wang et al. 2020, Zhang et al. 2022, Yan et al. 2024). For example, the MetaGrad algorithm proposed by Van Erven and Koolen (2016) achieves an O(T)O(\sqrt{T})-regret for any sequence of convex objective functions and an O(logT)O(\log T)-regret if all the objective functions are exp-concave. Universal algorithms are useful in that they can be used without prior knowledge about the objective functions. Some universal algorithms are introduced in Appendix A.3.

A notable limitation of the previous regret analyses about universal algorithms is that they apply only to cases where all the objective functions f1,f2,,fTf_{1},f_{2},\ldots,f_{T} belong to the same function class. Therefore, for example, if some objective functions in a limited number of rounds are not strongly convex and if the other objective functions are strongly convex, regret bounds for strongly convex functions in previous studies are not always valid. This study aims to remove this limitation.

Table 1: Comparison of regret bounds. The parameter dd is the dimension of the decision set.

FUNCTION CLASS UPPER BOUNDS LOWER BOUNDS
Convex O(T)O(\sqrt{T}) Ω(T)\Omega(\sqrt{T})
(Zinkevich 2003) (Abernethy et al. 2008)
α\alpha-exp-concave O((d/α)logT)O((d/\alpha)\log T) Ω((d/α)logT)\Omega((d/\alpha)\log T)
(Hazan et al. 2006) (Ordentlich and Cover 1998)
kk-contaminated α\alpha-exp-concave O((d/α)logT+kdlogT)O((d/\alpha)\log T+\sqrt{kd\log T}) Ω((d/α)logT+k)\Omega((d/\alpha)\log T+\sqrt{k})
(This work, Corollary 5.5) (This work, Corollary 4.8)
kk-contaminated α\alpha-exp-concave O((d/α)logT+k)O((d/\alpha)\log T+\sqrt{k}) Ω((d/α)logT+k)\Omega((d/\alpha)\log T+\sqrt{k})
(with additional information) (This work, Theorem 6.2) (This work, Corollary 4.8)
λ\lambda-strongly convex O((1/λ)logT)O((1/\lambda)\log T) Ω((1/λ)logT)\Omega((1/\lambda)\log T)
(Hazan et al. 2006) (Abernethy et al. 2008)
kk-contaminated λ\lambda-strongly convex O((1/λ)logT+klogT)O((1/\lambda)\log T+\sqrt{k\log T}) Ω((1/λ)logT+k)\Omega((1/\lambda)\log T+\sqrt{k})
(This work, Corollary 5.7) (This work, Corollary 4.9)
kk-contaminated λ\lambda-strongly convex O((1/λ)logT+k)O((1/\lambda)\log T+\sqrt{k}) Ω((1/λ)logT+k)\Omega((1/\lambda)\log T+\sqrt{k})
(with additional information) (This work, Theorem 6.2) (This work, Corollary 4.9)

1.1 Our Contribution

In this study, we consider the situation in which the function class of the objective ftf_{t} may change in each time step tt. We call this situation contaminated OCO. More specifically, in kk-contaminated OCO with a function class \mathcal{F}, we suppose that the objective function ftf_{t} does not necessarily belong to the desired function class \mathcal{F} (e.g., exp-concave or strongly convex functions) in kk rounds out of the total TT rounds. Section 3 introduces its formal definition and examples. This class of OCO problems can be interpreted as an intermediate setting between general OCO problems and restricted OCO problems with \mathcal{F} (\mathcal{F}-OCOs). Intuitively, the parameter k[0,T]k\in[0,T] represents the magnitude of the impurity in the sequence of the objective functions, and measures how close the problems are to \mathcal{F}-OCOs; k=0k=0 and k=Tk=T respectively correspond to \mathcal{F}-OCO and general OCO.

The contribution of this study can be summarized as follows: (i) We introduce contaminated OCO, which captures the situations in which the class of the objective functions may change over different rounds. (ii) We find that the Online Newton Step, one of the optimal algorithms for exp-concave functions, does not always work well in contaminated OCO, as discussed in Section 4.1. (iii) We present regret lower bounds for contaminated OCO in Section 4.2. (iv) We show that some existing universal algorithms achieve better regret bounds than ONS for contaminated OCO, which details are given in Section 5. (v) We propose an algorithm that attains the optimal regret bounds under the additional assumption that information of the class of the previous objective function is accessible in Section 6.

Regret bounds of contaminated cases compared to existing bounds are shown in Table 1. The new upper bounds contain bounds in existing studies for exp-concave functions and strongly convex functions as a particular case (k=0k=0). Additionally, the new lower bounds generalize bounds in existing studies for convex functions, exp-concave functions, and strongly convex functions. In cases where only gradient information is available, there is a multiplicative gap of O(logT)O(\sqrt{\log T}) between the second terms of the upper bounds and the lower bounds. This gap is eliminated when the information of the class of the previous objective function is available.

To prove regret lower bounds in Table 1, we construct distributions of problem instances of contaminated OCO for which any algorithm suffers a certain amount of regret in expected values. Such distributions are constructed by combining suitably designed problem instances of \mathcal{F}-OCO and general OCO.

To derive novel regret upper bounds without additional information in Table 1, we exploit regret upper bounds expressed using some problem-dependent values such as a measure of variance (Van Erven and Koolen 2016). By combining such regret upper bounds and inequalities derived from the definition of kk-contaminated OCO, we obtain regret upper bounds, including the regret itself, which can be interpreted as quadratic inequalities in regret. Solving these inequalities leads to regret upper bounds in Table 1.

We develop algorithms that can achieve optimal regret upper bounds, taking into account the function class information of the previous function. To accomplish this, we modified two existing OCO algorithms: the Online Newton Step (ONS), as introduced by Hazan et al. (2006), and the Online Gradient Descent (OGD), presented by Zinkevich (2003). The modification is changing the update process depending on the function class of the last revealed objective function.

2 Related Work

In the context of online learning, adaptive algorithms (Orabona 2019) have been extensively studied due to their practical usefulness. These algorithms work well by automatically exploiting the intrinsic properties of the sequence of objective functions and do not require parameter tuning based on prior knowledge of the objective function. For example, AdaGrad (McMahan and Streeter 2010, Duchi et al. 2011) is probably one of the best-known adaptive algorithms, which automatically adapts to the magnitude of the gradients. Studies on universal algorithms (Hazan et al. 2007, Van Erven and Koolen 2016, Wang et al. 2020, Zhang et al. 2022, Yan et al. 2024), which work well for several different function classes, can also be positioned within these research trends. Our study shows that some of these universal algorithms have further adaptability, i.e., nearly tight regret bounds for contaminated settings.

van Erven et al. (2021) has explored a similar setting to ours, focusing on robustness to outliers. They regard rounds with larger gradient norms than some threshold as outliers and denote the number of outliers as kk, whose definition differs from ours. They have defined regret only for rounds that are not outliers, terming it robust regret, and have shown that the additional O(k)O(k) term is inevitable in bounds on robust regret.

Studies on best-of-both-worlds (BOBW) bandit algorithms (Bubeck and Slivkins 2012) and on stochastic bandits with adversarial corruptions (Lykouris et al. 2018, Gupta et al. 2019) are also related to our study. BOBW algorithms are designed to achieve (nearly) optimal performance both for stochastic and adversarial environments, e.g., O(logT)O(\log T)-regret for stochastic and O(T)O(\sqrt{T})-regret for adversarial environments, respectively. Stochastic bandits with adversarial corruptions are problems for intermediate environments between stochastic and adversarial ones, in which the magnitude of adversarial components is measured by means of the corruption level parameter C0C\geq 0. A BOBW algorithm by Bubeck and Slivkins (2012) has shown to have a regret bound of O(logT+ClogT)O(\log T+\sqrt{C\log T}) as well for stochastic environments with adversarial corruptions, which is also nearly tight (Ito 2021). In the proof of such an upper bound, an approach referred to as the self-bounding technique (Gaillard et al. 2014, Wei and Luo 2018) is used, which leads to improved guarantees via some regret upper bounds that include the regret itself. Similar proof techniques are used in our study as well.

3 Problem Setting

In this section, we explain the problem setting we consider. Throughout this paper, we assume functions f1,f2,,fTf_{1},f_{2},\ldots,f_{T} are differentiable and convex.

3.1 OCO Framework and Assumptions

First of all, the mathematical formulation of OCO is as follows. At each time step t[T]({1,2,,T})t\in[T](\coloneqq\{1,2,\ldots,T\}), a convex nonempty set 𝒳d\mathcal{X}\subset\mathbb{R}^{d} and convex objective functions f1,f2,,ft1:𝒳f_{1},f_{2},\ldots,f_{t-1}\colon\mathcal{X}\to\mathbb{R} are known and ftf_{t} is not known. A learner chooses an action 𝒙t𝒳\bm{x}_{t}\in\mathcal{X} and incurs a loss ft(𝒙t)f_{t}(\bm{x}_{t}) after the choice. Since ftf_{t} is unknown when choosing 𝒙t\bm{x}_{t}, it is impossible to minimize the cumulative loss t=1Tft(𝒙t)\sum_{t=1}^{T}f_{t}(\bm{x}_{t}) for all sequences of ftf_{t}. Instead, the goal of OCO is to minimize regret:

RT:=t=1Tft(𝒙t)min𝒙𝒳t=1Tft(𝒙).R_{T}:=\sum_{t=1}^{T}f_{t}(\bm{x}_{t})-\min_{\bm{x}\in\mathcal{X}}\sum_{t=1}^{T}f_{t}(\bm{x}).

Regret is the difference between the cumulative loss of the learner and that of the best choice in hindsight. The regret can be logarithmic if the objective functions are λ\lambda-strongly convex, i.e., f(𝒚)f(𝒙)+f(𝒙),𝒚𝒙+λ2𝒙𝒚2f(\bm{y})\geq f(\bm{x})+\left\langle{\nabla f(\bm{x})},{\bm{y}-\bm{x}}\right\rangle+\frac{\lambda}{2}\|\bm{x}-\bm{y}\|^{2} for all 𝒙,𝒚𝒳\bm{x},\bm{y}\in\mathcal{X}, or α\alpha-exp-concave, i.e., exp(αf(𝒙))\exp(-\alpha f(\bm{x})) is concave on 𝒳\mathcal{X}.

Remark 3.1.

The type of information about ftf_{t} that needs to be accessed varies depending on the algorithm. Universal algorithms only utilize gradient information, while the algorithm presented in Section 6 requires additional information besides the gradient, such as strong convexity and exp-concavity. The lower bounds discussed in Section 4.2 are applicable to arbitrary algorithms with complete access to full information about the objective functions.

Next, we introduce the following two assumptions. These assumptions are very standard in OCO and frequently used in regret analysis. We assume them throughout this paper without mentioning them.

Assumption 3.2.

There exists a constant D>0D>0 such that 𝒙𝒚D\|\bm{x}-\bm{y}\|\leq D holds for all 𝒙,𝒚𝒳\bm{x},\bm{y}\in\mathcal{X}.

Assumption 3.3.

There exists a constant G>0G>0 such that ft(𝒙)G\|\nabla f_{t}(\bm{x})\|\leq G holds for all 𝒙𝒳\bm{x}\in\mathcal{X} and t[T]t\in[T].

These assumptions are important, not only because we can bound 𝒙𝒚\|\bm{x}-\bm{y}\| and ft(𝒙)\|\nabla f_{t}(\bm{x})\|, but also because we can use the following two lemmas:

Lemma 3.4.

(Hazan 2016) Let f:𝒳f\colon\mathcal{X}\to\mathbb{R} be an α\alpha-exp-concave function. Assume that there exist constants D,G>0D,G>0 such that 𝐱𝐲D\|\bm{x}-\bm{y}\|\leq D and f(𝐱)G\|\nabla f(\bm{x})\|\leq G hold for all 𝐱,𝐲𝒳\bm{x},\bm{y}\in\mathcal{X}. The following holds for all γ(1/2)min{1/(GD),α}\gamma\leq(1/2)\min\{1/(GD),\alpha\} and all 𝐱,𝐲𝒳\bm{x},\bm{y}\in\mathcal{X}:

f(𝒙)f(𝒚)+f(𝒚),𝒙𝒚+γ2(f(𝒚),𝒙𝒚)2.f(\bm{x})\geq f(\bm{y})+\left\langle{\nabla f(\bm{y})},{\bm{x}-\bm{y}}\right\rangle+\frac{\gamma}{2}(\left\langle{\nabla f(\bm{y})},{\bm{x}-\bm{y}}\right\rangle)^{2}.
Lemma 3.5.

(Hazan 2016) If f:𝒳f\colon\mathcal{X}\to\mathbb{R} is a twice differentiable λ\lambda-strongly convex function satisfying f(𝐱)G\|\nabla f(\bm{x})\|\leq G for all 𝐱𝒳\bm{x}\in\mathcal{X}, then it is λ/G2\lambda/G^{2}-exp-concave.

Lemma 3.5 means that exp-concavity is a milder condition than strong convexity, combining with the fact that log𝒂,𝒙-\log\left\langle{\bm{a}},{\bm{x}}\right\rangle is not strongly convex but 1-exp-concave.

3.2 Contaminated Case

In this subsection, we define contaminated OCO and introduce examples that belong to this problem class. The definition is below.

Definition 3.6.

For some function class \mathcal{F}, a sequence of convex functions (f1,f2,,fT)(f_{1},f_{2},\ldots,f_{T}) belongs to kk-contaminated \mathcal{F} if there exists a set of indices I[T]I\subset[T] such that |I|=k|I|=k and ftf_{t}\in\mathcal{F} holds for all t[T]\It\in[T]\backslash I.

For example, if functions other than kk functions of them are α\alpha-exp-concave, we call the functions kk-contaminated α\alpha-exp-concave. And especially for OCO problems, if the objective functions are contaminated, we call them contaminated OCO.

The following two examples are functions whose function class varies with time step. These examples motivate this study.

Example 3.7.

(Online least mean square regressions) Consider the situation where a batch of input-output data (𝒂t,i,bt,i)d×(\bm{a}_{t,i},b_{t,i})\in\mathbb{R}^{d}\times\mathbb{R} (i{1,2,,n})(i\in\{1,2,\ldots,n\}) is given in each round tt and we want to estimate 𝒙\bm{x} which enable to predict b𝒂,𝒙b\approx\left\langle{\bm{a}},{\bm{x}}\right\rangle. This can be regarded as an OCO problem whose objective functions are ft(𝒙):=(1/n)i=1n(𝒂t,i,𝒙bt,i)2f_{t}(\bm{x}):=(1/n)\sum_{i=1}^{n}(\left\langle{\bm{a}_{t,i}},{\bm{x}}\right\rangle-b_{t,i})^{2}. These functions are λt\lambda_{t}-strongly convex, where λt\lambda_{t} is the minimum eigenvalue of the matrix (2/n)i=1n𝒂t,i𝒂t,i(2/n)\sum_{i=1}^{n}\bm{a}_{t,i}\bm{a}_{t,i}^{\top}. Let k(λ):=|{t[T]λt<λ}|k(\lambda):=|\{t\in[T]\mid\lambda_{t}<\lambda\}| for any λ>0\lambda>0. Then (f1,f2,,fT)(f_{1},f_{2},\ldots,f_{T}) is k(λ)k(\lambda)-contaminated λ\lambda-strongly convex.

Example 3.8.

(Online classification by using logistic regression) Consider the online classification problem. A batch of input-output data (𝒂t,i,bt,i)d×{±1}(\bm{a}_{t,i},b_{t,i})\in\mathbb{R}^{d}\times\{\pm 1\} (i{1,2,,n})(i\in\{1,2,\ldots,n\}) is given in each round tt and we want to estimate 𝒙\bm{x} which enable to predict b=sgn(𝒂,𝒙)b=\mathrm{sgn}(\left\langle{\bm{a}},{\bm{x}}\right\rangle). Suppose that the objective functions are given by ft(𝒙):=(1/n)i=1nlog(1+exp(bt,i𝒂t,i,𝒙))f_{t}(\bm{x}):=(1/n)\sum_{i=1}^{n}\log(1+\exp(-b_{t,i}\left\langle{\bm{a}_{t,i}},{\bm{x}}\right\rangle)). Exp-concavity of ft(𝒙)f_{t}(\bm{x}) on {𝒙d|𝒙1}\{\bm{x}\in\mathbb{R}^{d}~{}|~{}\|\bm{x}\|\leq 1\} changes with time step. Especially, in the case 𝒂t,i=𝒂t,bt,i=bt\bm{a}_{t,i}=\bm{a}_{t},~{}b_{t,i}=b_{t}, ftf_{t} is exp(𝒂t)\exp(-\|\bm{a}_{t}\|)-exp-concave, as proved in Appendix B.1. Let k(α):=|{t[T]αt<α}|k(\alpha):=|\{t\in[T]\mid\alpha_{t}<\alpha\}| for any α>0\alpha>0, where αt\alpha_{t} is defined so that ftf_{t} is αt\alpha_{t}-exp-concave. Then (f1,f2,,fT)(f_{1},f_{2},\ldots,f_{T}) is k(α)k(\alpha)-contaminated α\alpha-exp-concave.

Remark 3.9.

In the two examples above, constants λ\lambda and α\alpha in the definition of λ\lambda-strong convexity and α\alpha-exp-concavity can be strictly positive for all time steps. However, since the regret bounds are O((1/λ)logT)O((1/\lambda)\log T) and O((d/α)logT)O((d/\alpha)\log T) for λ\lambda-strongly convex functions and α\alpha-exp-concave functions respectively, if λ\lambda and α\alpha are O(1/T)O(1/T), then the regret bounds become trivial. Analyses in this paper give a nontrivial regret bound to such a case.

4 Regret Lower Bounds

4.1 Vulnerability of ONS

This subsection explains how Online Newton Step (ONS) works for contaminated exp-concave functions. ONS is an algorithm for online exp-concave learning. Details of ONS are in Appendix A.2. The upper bound is as follows.

Proposition 4.1.

If a sequence of objective functions (f1,f2,,fT)(f_{1},f_{2},\ldots,f_{T}) is kk-contaminated α\alpha-exp-concave, the regret upper bound of ONS with γ=(1/2)min{1/(GD),α}\gamma=(1/2)\min\{1/(GD),\alpha\} and ε=1/(γ2D2)\varepsilon=1/(\gamma^{2}D^{2}) is O((d/γ)logT+k)O((d/\gamma)\log T+k).

This proposition is proved by using the proof for noncontaminated cases by Hazan (2016). A detailed proof is in Appendix B.2. This upper bound seems trivial, but the bound is tight because of the lower bound stated in Corollary 4.6.

Before stating the lower bound, we introduce the following theorem, which is essential in deriving some lower bounds of contaminated cases.

Theorem 4.2.

Let \mathcal{F} be an arbitrary function class. Suppose that functions g1,g2g_{1},g_{2} are the functions such that Ω(g1(T))\Omega(g_{1}(T)) and Ω(g2(T))\Omega(g_{2}(T)) are lower bounds for function class \mathcal{F} and convex functions, respectively, for some OCO algorithm. If a sequence of objective functions belongs to kk-contaminated \mathcal{F}, then regret in the worst case is Ω(g1(T)+g2(k))\Omega(g_{1}(T)+g_{2}(k)) for the OCO algorithms.

Remark 4.3.

In Theorem 4.2, if the lower bounds Ω(g1(T))\Omega(g_{1}(T)) and Ω(g2(T))\Omega(g_{2}(T)) are for all OCO algorithms, then the lower bound Ω(g1(T)+g2(k))\Omega(g_{1}(T)+g_{2}(k)) is also for all OCO algorithms.

To derive this lower bound, we use the following two instances; one is the instance used to prove lower bound RT=Ω(g1(T))R_{T}=\Omega(g_{1}(T)) for function class \mathcal{F}, and the other is the instance used to prove Rk=Ω(g2(k))R_{k}=\Omega(g_{2}(k)) for convex objective functions. By considering the instance that these instances realize with probability 1/21/2, we can construct an instance that satisfies

𝔼[RT]=Ω(g1(T)+g2(k)),\mathbb{E}[R_{T}]=\Omega(g_{1}(T)+g_{2}(k)),

for all OCO algorithms. A detailed proof of this proposition is postponed to Appendix B.3.

Theorem 4.2 implies that, in contaminated cases, we can derive interpolating lower bounds of regret. The lower bound obtained from this theorem is Ω(g1(T))\Omega(g_{1}(T)) if kTk\ll T, and Ω(g2(T))\Omega(g_{2}(T)) if k=Tk=T. Since the contaminated case can be interpreted as an intermediate regime between restricted \mathcal{F}-OCO and general OCO, this result would seen as reasonable. This lower bound applies not only to ONS but also to arbitrary algorithms.

To apply Theorem 4.2 to ONS, we show the following lower bound in the case of convex functions. This lower bound shows that ONS is not suitable for convex objective functions.

Proposition 4.4.

For any positive parameters γ\gamma and ε\varepsilon, ONS incurs Ω(T)\Omega(T) regret in the worst case.

To prove this proposition, consider the instance as follows:

ft(x)=vtx,x𝒳=[D/2,D/2],x1=G,f_{t}(x)=v_{t}x,~{}x\in\mathcal{X}=[-D/2,D/2],~{}x_{1}=-G,

where

vt={(1)tGt<t1,Gtt1,xt10,Gtt1,xt1<0,v_{t}=\begin{dcases}(-1)^{t}G&t<t_{1},\\ G&t\geq t_{1},~{}x_{t_{1}}\geq 0,\\ -G&t\geq t_{1},~{}x_{t_{1}}<0,\end{dcases}

and t1t_{1} is a minimum natural number which satisfies t1(1+γG2D/2)1Tt_{1}\geq(1+\gamma G^{2}D/2)^{-1}T. Then, we can get

RTγG2D/22(1+γG2D/2)2T1γlog(1+G2εT)2γGG2D2.R_{T}\geq\frac{\gamma G^{2}D/2}{2(1+\gamma G^{2}D/2)^{2}}T-\frac{1}{\gamma}\log\left(1+\frac{G^{2}}{\varepsilon}T\right)-\frac{2}{\gamma G}-\frac{G^{2}D}{2}.

A detailed proof of this proposition is postponed to Appendix B.4.

Remark 4.5.

Corollary 4.4 states the lower bound that holds only for ONS. However, if some better algorithms are used, the lower bound can be improved. Therefore, it is not a contradiction that the general lower bound in Table 1 is better than that of ONS. This is also true for Corollary 4.6, which is about the contaminated case.

The lower bound of α\alpha-exp-concave functions can be derived as follows. The lower bound of 11-exp-concave functions is Ω(dlogT)\Omega(d\log T) (Ordentlich and Cover 1998). Here, when divided by α\alpha, 11-exp-concave functions turn into α\alpha-exp-concave functions, and regret is also divided by α\alpha. Hence, the lower bound of α\alpha-exp-concave functions is Ω((d/α)logT)\Omega((d/\alpha)\log T).

We get the following from this lower bound for exp-concave functions, Proposition 4.4, and Theorem 4.2.

Corollary 4.6.

If a sequence of objective functions (f1,f2,,fT)(f_{1},f_{2},\ldots,f_{T}) is kk-contaminated α\alpha-exp-concave, regret in worst case is Ω((d/α)logT+k)\Omega((d/\alpha)\log T+k), for ONS.

This proposition shows that the regret analysis in Proposition 4.1 is strict. While ONS does not work well for contaminated OCO, universal algorithms exhibit more robust performance. In Section 5, we analyze some universal algorithms on this point.

Remark 4.7.

For the 1-dimension instance above, ONS can also be regarded as OGD (Algorithm 2 in Appendix A.1) with Θ(1/t)\Theta(1/t) stepsize. This implies that we can show that OGD with Θ(1/t)\Theta(1/t) stepsize can incur Ω(T)\Omega(T) regret in the worst case. Therefore, for kk-contaminated strongly convex functions, regret in worst case is Ω((1/λ)logT+k)\Omega((1/\lambda)\log T+k), for OGD with Θ(1/t)\Theta(1/t) stepsize.

4.2 General Lower Bounds

In this subsection, we present regret lower bounds for arbitrary algorithms.

Using Theorem 4.2, we can get a lower bound of kk-contaminated exp-concave functions. As mentioned in Section 4.1, regret lower bound of α\alpha-exp-concave functions is Ω((d/α)logT)\Omega((d/\alpha)\log T). From this lower bound and that of convex functions is Ω(GDT)\Omega(GD\sqrt{T}) (Abernethy et al. 2008), we can derive the following lower bound. This corollary shows that kk-contamination of exp-concave functions worsens regret lower bound at least Ω(GDk)\Omega(GD\sqrt{k}).

Corollary 4.8.

If (f1,f2,,fT)(f_{1},f_{2},\ldots,f_{T}) is kk-contaminated α\alpha-exp-concave, regret in worst case is Ω((d/α)logT+GDk)\Omega((d/\alpha)\log T+GD\sqrt{k}), for all OCO algorithms.

According to Abernethy et al. (2008), the regret lower bound in the case of λ\lambda-strongly convex functions is Ω((G2/λ)logT)\Omega((G^{2}/\lambda)\log T). Therefore, following a similar corollary is derived in the same way.

Corollary 4.9.

If a sequence of objective functions (f1,f2,,fT)(f_{1},f_{2},\ldots,f_{T}) is kk-contaminated λ\lambda-strongly convex, regret in worst case is Ω((G2/λ)logT+GDk)\Omega((G^{2}/\lambda)\log T+GD\sqrt{k}), for all OCO algorithms.

5 Regret Upper Bounds by Universal Algorithms

In this section, we explain the regret upper bounds of some universal algorithms when the objective functions are contaminated. The algorithms we analyze in this paper are multiple eta gradient algorithm (MetaGrad) (Van Erven and Koolen 2016), multiple sub-algorithms and learning rates (Maler) (Wang et al. 2020), and universal strategy for online convex optimization (USC) (Zhang et al. 2022). Though there are two variations of MetaGrad; full MetaGrad and diag MetaGrad, in this paper, MetaGrad means full MetaGrad. We denote RT𝒙:=t=1T(ft(𝒙t)ft(𝒙))R_{T}^{\bm{x}}:=\sum_{t=1}^{T}(f_{t}(\bm{x}_{t})-f_{t}(\bm{x})), R~T𝒙:=t=1Tft(𝒙t),𝒙t𝒙\tilde{R}_{T}^{\bm{x}}:=\sum_{t=1}^{T}\left\langle{\nabla f_{t}(\bm{x}_{t})},{\bm{x}_{t}-\bm{x}}\right\rangle, VT𝒙:=t=1T(ft(𝒙t),𝒙t𝒙)2V_{T}^{\bm{x}}:=\sum_{t=1}^{T}(\left\langle{\nabla f_{t}(\bm{x}_{t})},{\bm{x}_{t}-\bm{x}}\right\rangle)^{2} and WT𝒙:=G2t=1T𝒙t𝒙2W_{T}^{\bm{x}}:=G^{2}\sum_{t=1}^{T}\|\bm{x}_{t}-\bm{x}\|^{2}.

Concerning MetaGrad and Maler, following regret bounds hold without assuming exp-concavity or strong convexity:

Theorem 5.1.

(Van Erven and Koolen 2016) For MetaGrad, RT𝐱R_{T}^{\bm{x}} is simultaneously bounded by O(GDTloglogT)O(GD\sqrt{T\log\log T}), and by

RT𝒙R~T𝒙=O(VT𝒙dlogT+GDdlogT),R_{T}^{\bm{x}}\leq\tilde{R}_{T}^{\bm{x}}=O(\sqrt{V_{T}^{\bm{x}}d\log T}+GDd\log T),

for any 𝐱𝒳\bm{x}\in\mathcal{X}.

Theorem 5.2.

(Wang et al. 2020) For Maler, RT𝐱R_{T}^{\bm{x}} is simultaneously bounded by O(GDT)O(GD\sqrt{T}),

RT𝒙\displaystyle R_{T}^{\bm{x}} R~T𝒙=O(VT𝒙dlogT)and\displaystyle\leq\tilde{R}_{T}^{\bm{x}}=O(\sqrt{V_{T}^{\bm{x}}d\log T})~{}\mathrm{and}
RT𝒙\displaystyle R_{T}^{\bm{x}} R~T𝒙=O(WT𝒙logT),\displaystyle\leq\tilde{R}_{T}^{\bm{x}}=O(\sqrt{W_{T}^{\bm{x}}\log T}),

for any 𝐱𝒳\bm{x}\in\mathcal{X}.

Though Theorem 5.2 is derived only for 𝒙argmin𝒙𝒳t=1Tft(𝒙)\bm{x}\in\mathrm{argmin}_{\bm{x}\in\mathcal{X}}\sum_{t=1}^{T}f_{t}(\bm{x}) in the original paper by Wang et al. (2020), the proof is also valid even when 𝒙\bm{x} is any vector in 𝒳\mathcal{X}, so we rewrite the statement in this form. The proof of this generalization is provided in Appendix B.5. Further explanations of universal algorithms are in Appendix A.3.

Concerning the regret bound of contaminated exp-concavity, the following theorem holds. This theorem’s assumption is satisfied when using MetaGrad or Maler, and the result for them is described after the proof of this theorem.

Theorem 5.3.

Let αt\alpha_{t} be a constant such that ftf_{t} is αt\alpha_{t}-exp-concave (if ftf_{t} is not exp-concave, then αt\alpha_{t} is 0) for each tt. Suppose that

RT𝒙R~T𝒙=O(VT𝒙r1(T)+r2(T))\displaystyle R_{T}^{\bm{x}}\leq\tilde{R}_{T}^{\bm{x}}=O\left(\sqrt{V_{T}^{\bm{x}}r_{1}(T)}+r_{2}(T)\right) (1)

holds for some functions r1r_{1}, r2r_{2}, and point 𝐱𝒳\bm{x}\in\mathcal{X}. Then, it holds for any γ>0\gamma>0 that

RT𝒙=O(1γr1(T)+GDkγr1(T)+r2(T)),R_{T}^{\bm{x}}=O\left(\frac{1}{\gamma}r_{1}(T)+GD\sqrt{k_{\gamma}r_{1}(T)}+r_{2}(T)\right),

where kγ:=t=1Tmax{1γt/γ,0}k_{\gamma}:=\sum_{t=1}^{T}\max\{1-\gamma_{t}/\gamma,0\}, γt:=(1/2)min{1/(GD),αt}\gamma_{t}:=(1/2)\min\{1/(GD),\alpha_{t}\}.

Before proving this theorem, we prepare a lemma. The proof of this lemma is given in Appendix B.6.

Lemma 5.4.

For all a,b,x0a,b,x\geq 0, if xax+bx\leq\sqrt{ax}+b, then x3(a+b)/2x\leq 3(a+b)/2.

Proof of Theorem 5.3.

From Lemma 3.4, we have

RT𝒙\displaystyle R_{T}^{\bm{x}} =t=1T(ft(𝒙t)ft(𝒙))\displaystyle=\sum_{t=1}^{T}(f_{t}(\bm{x}_{t})-f_{t}(\bm{x}))
t=1T(ft(𝒙t),𝒙t𝒙γt2(ft(𝒙t),𝒙𝒙t)2)\displaystyle\leq\sum_{t=1}^{T}\left(\left\langle{\nabla f_{t}(\bm{x}_{t})},{\bm{x}_{t}-\bm{x}}\right\rangle-\frac{\gamma_{t}}{2}(\left\langle{\nabla f_{t}(\bm{x}_{t})},{\bm{x}-\bm{x}_{t}}\right\rangle)^{2}\right)
=R~T𝒙γ2VT𝒙+t=1Tγγt2(ft(𝒙t),𝒙𝒙t)2\displaystyle=\tilde{R}_{T}^{\bm{x}}-\frac{\gamma}{2}V_{T}^{\bm{x}}+\sum_{t=1}^{T}\frac{\gamma-\gamma_{t}}{2}(\left\langle{\nabla f_{t}(\bm{x}_{t})},{\bm{x}-\bm{x}_{t}}\right\rangle)^{2}
R~T𝒙γ2VT𝒙+t:γt<γγγt2G2D2\displaystyle\leq\tilde{R}_{T}^{\bm{x}}-\frac{\gamma}{2}V_{T}^{\bm{x}}+\sum_{t:\gamma_{t}<\gamma}\frac{\gamma-\gamma_{t}}{2}G^{2}D^{2}
R~T𝒙γ2VT𝒙+γ2kγG2D2.\displaystyle\leq\tilde{R}_{T}^{\bm{x}}-\frac{\gamma}{2}V_{T}^{\bm{x}}+\frac{\gamma}{2}k_{\gamma}G^{2}D^{2}.

If RT𝒙<0R_{T}^{\bm{x}}<0, 0 is an upper bound, so it is sufficient to think of the case RT𝒙0R_{T}^{\bm{x}}\geq 0. In this case, we have

VT𝒙2γR~T𝒙+kγG2D2.\displaystyle V_{T}^{\bm{x}}\leq\frac{2}{\gamma}\tilde{R}_{T}^{\bm{x}}+k_{\gamma}G^{2}D^{2}. (2)

From equation (1), there exists a positive constant C>0C>0 such that

R~T𝒙\displaystyle\tilde{R}_{T}^{\bm{x}} C(VT𝒙r1(T)+r2(T))\displaystyle\leq C\left(\sqrt{V_{T}^{\bm{x}}r_{1}(T)}+r_{2}(T)\right)
C((2γR~T𝒙+kγG2D2)r1(T)+r2(T))\displaystyle\leq C\left(\sqrt{\left(\frac{2}{\gamma}\tilde{R}_{T}^{\bm{x}}+k_{\gamma}G^{2}D^{2}\right)r_{1}(T)}+r_{2}(T)\right)
2γC2r1(T)R~T𝒙+CGDkγr1(T)+Cr2(T).\displaystyle\leq\sqrt{\frac{2}{\gamma}C^{2}r_{1}(T)\tilde{R}_{T}^{\bm{x}}}+CGD\sqrt{k_{\gamma}r_{1}(T)}+Cr_{2}(T). (3)

The second inequality holds from the inequality (2), and the last inequality holds from the inequality x+yx+y\sqrt{x+y}\leq\sqrt{x}+\sqrt{y} for x,y0x,y\geq 0.

From Lemma 5.4 with a=(2/γ)C2r1(T)a=(2/\gamma)C^{2}r_{1}(T) and b=CGDkγr1(T)+Cr2(T)b=CGD\sqrt{k_{\gamma}r_{1}(T)}+Cr_{2}(T), we have

R~T𝒙\displaystyle\tilde{R}_{T}^{\bm{x}} 32(2γC2r1(T)+CGDkγr1(T)+Cr2(T)).\displaystyle\leq\frac{3}{2}\left(\frac{2}{\gamma}C^{2}r_{1}(T)+CGD\sqrt{k_{\gamma}r_{1}(T)}+Cr_{2}(T)\right).

From this inequality and RT𝒙R~T𝒙R_{T}^{\bm{x}}\leq\tilde{R}_{T}^{\bm{x}}, Theorem 5.3 follows. ∎

The core of this proof is inequality (3), which can be regarded as a quadratic inequality. Solving this inequality enables us to obtain a regret upper bound for contaminated cases from a regret upper bound for non-contaminated cases.

Theorem 5.3 combined with Theorem 5.1, Theorem 5.2, and Theorem A.1 in the appendix gives upper bounds for universal algorithms; MetaGrad, Maler, and USC. The following corollary shows that, even if exp-concave objective functions are kk-contaminated, regret can be bounded by an additional term of O(kdlogT)O(\sqrt{kd\log T}). This regret bound is better than ONS’s in the parameter kk.

Corollary 5.5.

If a sequence of objective functions (f1,f2,,fT)(f_{1},f_{2},\ldots,f_{T}) is kk-contaminated α\alpha-exp-concave, the regret bound of MetaGrad, Maler, and USC with MetaGrad or Maler as an expert algorithm is

RT=O(dγlogT+GDkdlogT),\displaystyle R_{T}=O\left(\frac{d}{\gamma}\log T+GD\sqrt{kd\log T}\right), (4)

where γ:=(1/2)min{1/(GD),α}\gamma:=(1/2)\min\{1/(GD),\alpha\}.

We only give proof for MetaGrad and Maler here, and the proof for USC will be provided in Appendix B.7.

Proof.

As for MetaGrad and Maler, from Theorem 5.1 and Theorem 5.2,

R~T𝒙=O(VT𝒙dlogT+GDdlogT)\tilde{R}_{T}^{\bm{x}}=O(\sqrt{V_{T}^{\bm{x}}d\log T}+GDd\log T)

holds for any 𝒙𝒳\bm{x}\in\mathcal{X}. Therefore, by Theorem 5.3, we have

RT𝒙=O(dγlogT+GDkγdlogT),R_{T}^{\bm{x}}=O\left(\frac{d}{\gamma}\log T+GD\sqrt{k_{\gamma}d\log T}\right),

where GD=O(1/γ)GD=O(1/\gamma) is used, which follows from γ=(1/2)min{1/(GD),α}\gamma=(1/2)\min\{1/(GD),\alpha\}. Here, kγk_{\gamma} satisfies

kγ=t=1Tmax{1γtγ,0}=t:γt<γ(1γtγ)k.\displaystyle k_{\gamma}=\sum_{t=1}^{T}\max\left\{1-\frac{\gamma_{t}}{\gamma},0\right\}=\sum_{t\colon\gamma_{t}<\gamma}\left(1-\frac{\gamma_{t}}{\gamma}\right)\leq k.

The inequality follows from the fact that if γt<γ\gamma_{t}<\gamma, then αt<α\alpha_{t}<\alpha holds. Hence, we have

RT𝒙=O(dγlogT+GDkdlogT),R_{T}^{\bm{x}}=O\left(\frac{d}{\gamma}\log T+GD\sqrt{kd\log T}\right),

especially, we get the regret upper bound (4). ∎

As for strongly convex functions, we can get a similar result as Theorem 5.3.

Theorem 5.6.

Let λt\lambda_{t} be a constant such that ftf_{t} is λt\lambda_{t}-strongly convex (if ftf_{t} is not strongly convex, then λt\lambda_{t} is 0) for each tt. Suppose that

RT𝒙R~T𝒙=O(WT𝒙r1(T)+r2(T))R_{T}^{\bm{x}}\leq\tilde{R}_{T}^{\bm{x}}=O\left(\sqrt{W_{T}^{\bm{x}}r_{1}(T)}+r_{2}(T)\right)

holds for some functions r1r_{1}, r2r_{2}, and point 𝐱𝒳\bm{x}\in\mathcal{X}. Then, it holds for any λ>0\lambda>0 that

RT𝒙=O(G2λr1(T)+GDkλr1(T)+r2(T)),R_{T}^{\bm{x}}=O\left(\frac{G^{2}}{\lambda}r_{1}(T)+GD\sqrt{k_{\lambda}r_{1}(T)}+r_{2}(T)\right),

where kλ:=t=1Tmax{1λt/λ,0}k_{\lambda}:=\sum_{t=1}^{T}\max\{1-\lambda_{t}/\lambda,0\}.

This theorem can be derived in almost the same manner as the proof of Theorem 5.3, other than using the definition of strong convexity and kλk_{\lambda}. A more detailed proof is in Appendix B.8.

Theorem 5.6 combined with Theorem 5.1, Theorem 5.2, and Theorem A.1 in the appendix gives upper bounds for universal algorithms; MetaGrad, Maler, and USC. This corollary shows that, even if strongly convex objective functions are kk-contaminated, regret can be bounded by an additional term of O(klogT)O(\sqrt{k\log T}) if Maler or USC with Maler as an expert algorithm is used.

Corollary 5.7.

If a sequence of objective functions (f1,f2,,fT)(f_{1},f_{2},\ldots,f_{T}) is kk-contaminated λ\lambda-strongly convex, the regret bound of MetaGrad, Maler, and USC with Maler as an expert algorithm is

RT=O((G2λ+GD)d~logT+GDkd~logT),R_{T}=O\left(\left(\frac{G^{2}}{\lambda}+GD\right)\tilde{d}\log T+GD\sqrt{k\tilde{d}\log T}\right),

where d~\tilde{d} is dd in the case of MetaGrad and 1 in the case of the other two algorithms.

This corollary can be derived from Theorem 5.6 in almost the same manner as the proof of Corollary 5.5. A more detailed proof is in Appendix B.9 and Appendix B.10.

Remark 5.8.

If (f1,f2,,fT)(f_{1},f_{2},\ldots,f_{T}) is k1k_{1}-contaminated α\alpha-exp-concave and k2k_{2}-contaminated λ\lambda-strongly convex, then we have two regret upper bounds:

RT=O(dγlogT+GDk1dlogT),R_{T}=O\left(\frac{d}{\gamma}\log T+GD\sqrt{k_{1}d\log T}\right),

from Corollary 5.5 and

RT=O((G2λ+GD)d~logT+GDk2d~logT),R_{T}=O\left(\left(\frac{G^{2}}{\lambda}+GD\right)\tilde{d}\log T+GD\sqrt{k_{2}\tilde{d}\log T}\right),

from Corollary 5.7. Here, strongly convex functions are also exp-concave functions from Lemma 3.5. Therefore, if λ/G2α\lambda/G^{2}\geq\alpha, then k1k2k_{1}\leq k_{2}.

Remark 5.9.

Note that the universal algorithms analyzed in this section do not require additional information other than the gradient, which is a valuable property in practical use. However, comparing lower bounds in Corollary 4.8 and Corollary 4.9 with upper bounds in Corollary 5.5 and Corollary 5.7 respectively, there are gaps between them. This implies that our upper bounds in this section or lower bounds in Section 4.2 might not be tight. As we will discuss in the next section, this gap can be removed if additional information on function classes are available while it is not always the case in real-world applications.

6 Regret Upper Bounds Given Additional Information

In this section, we propose a method that achieves better bounds than those of universal algorithms discussed in the previous section under the condition that the information of the class of the last objective function is revealed. The experimental performance of this method is shown in Appendix C.

We denote S1{t[T]ftisλ-strongly convex}S_{1}\coloneqq\{t\in[T]\mid f_{t}~{}\mathrm{is}~{}\lambda\text{-strongly convex}\}, S2{t[T]\S1ftisα-exp-concave}S_{2}\coloneqq\{t\in[T]\backslash S_{1}\mid f_{t}~{}\mathrm{is}~{}\alpha\text{-exp-concave}\}, U[T]\(S1S2)U\coloneqq[T]\backslash(S_{1}\cup S_{2}), and k|U|k\coloneqq|U|. The algorithm using additional information is shown in Algorithm 1 (𝑰d\bm{I}_{d} is dd dimensional identity matrix, and 𝑨2\|\cdot\|_{\bm{A}}^{2} means 𝑨,\left\langle{\bm{A}\cdot},{\cdot}\right\rangle). This algorithm is a generalization of OGD and ONS. Indeed, (S1,S2,U)=([T],,)(S_{1},S_{2},U)=([T],\emptyset,\emptyset) gives normal OGD and (S1,S2,U)=(,[T],)(S_{1},S_{2},U)=(\emptyset,[T],\emptyset) gives normal ONS.

Algorithm 1 Algorithm using additional information
0:  convex set 𝒳d\mathcal{X}\subset\mathbb{R}^{d}, 𝒙1𝒳\bm{x}_{1}\in\mathcal{X}, TT, DD, GG, α\alpha, λ\lambda
1:  Set γ(1/2)min{1/(GD),α}\gamma\coloneqq(1/2)\min\{1/(GD),\alpha\}, ε2G/D\varepsilon\coloneqq\sqrt{2}G/D, 𝑨0ε𝑰d\bm{A}_{0}\coloneqq\varepsilon\bm{I}_{d}.
2:  for t=1t=1 to TT do
3:     Play 𝒙t\bm{x}_{t} and observe cost ft(𝒙t)f_{t}(\bm{x}_{t}).
4:     Update:
𝑨t=𝑨t1+{λ𝑰d,tS1γft(𝒙t)ft(𝒙t),tS2GD2|[t]U|𝑰d,tU,\bm{A}_{t}=\bm{A}_{t-1}+\begin{dcases}\lambda\bm{I}_{d},&t\in S_{1}\\ \gamma\nabla f_{t}(\bm{x}_{t})\nabla f_{t}(\bm{x}_{t})^{\top},&t\in S_{2}\\ \frac{G}{D\sqrt{2|[t]\cap U|}}\bm{I}_{d},&t\in U,\end{dcases}
5:     Newton step and generalized projection:
𝒚t+1=𝒙t𝑨t1ft(𝒙t),\bm{y}_{t+1}=\bm{x}_{t}-\bm{A}_{t}^{-1}\nabla f_{t}(\bm{x}_{t}),
𝒙t+1=Π𝒳𝑨t(𝒚t+1)argmin𝒙𝒳{𝒚t+1𝒙𝑨t2}.\bm{x}_{t+1}=\Pi_{\mathcal{X}}^{\bm{A}_{t}}(\bm{y}_{t+1})\coloneqq\operatorname*{argmin}_{\bm{x}\in\mathcal{X}}\{\|\bm{y}_{t+1}-\bm{x}\|_{\bm{A}_{t}}^{2}\}.
6:  end for

Before stating the regret upper bounds of Algorithm 1, we prepare the following lemma:

Lemma 6.1.

Let {𝐱t}t\{\bm{x}_{t}\}_{t} be the sequence generated by Algorithm 1 and S1S_{1}, S2S_{2}, UU, and kk be as defined above. Then, the following inequalities hold:

tS1ft(𝒙t)𝑨t12\displaystyle\sum_{t\in S_{1}}\|\nabla f_{t}(\bm{x}_{t})\|_{\bm{A}_{t}^{-1}}^{2} G2λlog(1+λD2G|S1|),\displaystyle\leq\frac{G^{2}}{\lambda}\log\left(1+\frac{\lambda D}{\sqrt{2}G}|S_{1}|\right), (5)
tS2ft(𝒙t)𝑨t12\displaystyle\sum_{t\in S_{2}}\|\nabla f_{t}(\bm{x}_{t})\|_{\bm{A}_{t}^{-1}}^{2} dγlog(1+λD2G|S1|+γGD2|S2|+k),\displaystyle\leq\frac{d}{\gamma}\log\left(1+\frac{\lambda D}{\sqrt{2}G}|S_{1}|+\frac{\gamma GD}{\sqrt{2}}|S_{2}|+\sqrt{k}\right), (6)
tUft(𝒙t)𝑨t12\displaystyle\sum_{t\in U}\|\nabla f_{t}(\bm{x}_{t})\|_{\bm{A}_{t}^{-1}}^{2} 2GD(k+11).\displaystyle\leq\sqrt{2}GD(\sqrt{k+1}-1). (7)
Proof.

Inequality (5) can be obtained as follows:

tS1ft(𝒙t)𝑨t12G2tS11λmin(𝑨t)G2i=1|S1|1ε+λiG20|S1|dsε+λs=G2λlog(1+λD2G|S1|),\sum_{t\in S_{1}}\|\nabla f_{t}(\bm{x}_{t})\|_{\bm{A}_{t}^{-1}}^{2}\leq G^{2}\sum_{t\in S_{1}}\frac{1}{\lambda_{\mathrm{min}}(\bm{A}_{t})}\leq G^{2}\sum_{i=1}^{|S_{1}|}\frac{1}{\varepsilon+\lambda i}\leq G^{2}\int_{0}^{|S_{1}|}\frac{\differential s}{\varepsilon+\lambda s}=\frac{G^{2}}{\lambda}\log\left(1+\frac{\lambda D}{\sqrt{2}G}|S_{1}|\right),

where λmin(𝑨t)\lambda_{\mathrm{min}}(\bm{A}_{t}) is the minimum eigenvalue of the matrix 𝑨t\bm{A}_{t}, which at least increases by λ\lambda when tS1t\in S_{1}.

We can bound the left-hand side of the inequality (6) as follows:

tS2ft(𝒙t)𝑨t12\displaystyle\sum_{t\in S_{2}}\|\nabla f_{t}(\bm{x}_{t})\|_{\bm{A}_{t}^{-1}}^{2} =tS2tr(𝑨t1ft(𝒙t)(ft(𝒙t)))\displaystyle=\sum_{t\in S_{2}}\tr(\bm{A}_{t}^{-1}\nabla f_{t}(\bm{x}_{t})(\nabla f_{t}(\bm{x}_{t}))^{\top})
=1γtS2tr(𝑨t1(𝑨t𝑨t1))\displaystyle=\frac{1}{\gamma}\sum_{t\in S_{2}}\tr(\bm{A}_{t}^{-1}(\bm{A}_{t}-\bm{A}_{t-1}))
1γtS2log|𝑨t||𝑨t1|.\displaystyle\leq\frac{1}{\gamma}\sum_{t\in S_{2}}\log\frac{|\bm{A}_{t}|}{|\bm{A}_{t-1}|}.

The first inequality is from Lemma B.5 in Appendix B.11. Since |𝑨t||𝑨t1|(tS1U)|\bm{A}_{t}|\geq|\bm{A}_{t-1}|~{}(\forall t\in S_{1}\cup U),

1γtS2log|𝑨t||𝑨t1|\displaystyle\frac{1}{\gamma}\sum_{t\in S_{2}}\log\frac{|\bm{A}_{t}|}{|\bm{A}_{t-1}|} 1γt=1Tlog|𝑨t||𝑨t1|\displaystyle\leq\frac{1}{\gamma}\sum_{t=1}^{T}\log\frac{|\bm{A}_{t}|}{|\bm{A}_{t-1}|}
=1γlog|𝑨T||𝑨0|\displaystyle=\frac{1}{\gamma}\log\frac{|\bm{A}_{T}|}{|\bm{A}_{0}|}
dγlog(1+λD2G|S1|+γGD2|S2|+k).\displaystyle\leq\frac{d}{\gamma}\log\left(1+\frac{\lambda D}{\sqrt{2}G}|S_{1}|+\frac{\gamma GD}{\sqrt{2}}|S_{2}|+\sqrt{k}\right).

The last inequality is from the fact that the largest eigenvalue of 𝑨T\bm{A}_{T} is at most 2G/D+λ|S1|+γG2|S2|+(G/D)2k\sqrt{2}G/D+\lambda|S_{1}|+\gamma G^{2}|S_{2}|+(G/D)\sqrt{2k}.

Inequality (7) can be obtained as follows:

tUft(𝒙t)𝑨t12\displaystyle\sum_{t\in U}\|\nabla f_{t}(\bm{x}_{t})\|_{\bm{A}_{t}^{-1}}^{2} G2tU1λmin(𝑨t)\displaystyle\leq G^{2}\sum_{t\in U}\frac{1}{\lambda_{\mathrm{min}}(\bm{A}_{t})}
G2tU1ε+i=1|[t]U|GD2i\displaystyle\leq G^{2}\sum_{t\in U}\frac{1}{\varepsilon+\sum_{i=1}^{|[t]\cap U|}\frac{G}{D\sqrt{2i}}}
G2tU1ε+2GD(|[t]U|+11)\displaystyle\leq G^{2}\sum_{t\in U}\frac{1}{\varepsilon+\sqrt{2}\frac{G}{D}(\sqrt{|[t]\cap U|+1}-1)}
=GD2i=1k1i+1\displaystyle=\frac{GD}{\sqrt{2}}\sum_{i=1}^{k}\frac{1}{\sqrt{i+1}}
2GD(k+11).\displaystyle\leq\sqrt{2}GD(\sqrt{k+1}-1).

Using this lemma, we can bound the regret of Algorithm 1 as follows:

Theorem 6.2.

Let kk be defined at the beginning of this section. Algorithm 1 guarantees

RT=O((G2λ+dγ)logT+GDk).R_{T}=O\left(\left(\frac{G^{2}}{\lambda}+\frac{d}{\gamma}\right)\log T+GD\sqrt{k}\right).
Proof.

When tS1t\in S_{1}, from the definition of strong convexity,

2(ft(𝒙t)ft(𝒙))2ft(𝒙t),𝒙t𝒙λ𝒙t𝒙2\displaystyle 2(f_{t}(\bm{x}_{t})-f_{t}(\bm{x}^{\ast}))\leq 2\left\langle{\nabla f_{t}(\bm{x}_{t})},{\bm{x}_{t}-\bm{x}^{\ast}}\right\rangle-\lambda\|\bm{x}_{t}-\bm{x}^{\ast}\|^{2} (8)

holds. When tS2t\in S_{2}, from Lemma 3.4,

2(ft(𝒙t)ft(𝒙))2ft(𝒙t),𝒙t𝒙γ(f(𝒙t),𝒙t𝒙)2\displaystyle 2(f_{t}(\bm{x}_{t})-f_{t}(\bm{x}^{\ast}))\leq 2\left\langle{\nabla f_{t}(\bm{x}_{t})},{\bm{x}_{t}-\bm{x}^{\ast}}\right\rangle-\gamma(\left\langle{\nabla f(\bm{x}_{t})},{\bm{x}_{t}-\bm{x}^{\ast}}\right\rangle)^{2} (9)

holds. When tUt\in U, since ftf_{t} is convex,

2(ft(𝒙t)ft(𝒙))2ft(𝒙t),𝒙t𝒙GD2|[t]U|𝒙t𝒙2+GD2|[t]U|𝒙t𝒙2\displaystyle 2(f_{t}(\bm{x}_{t})-f_{t}(\bm{x}^{\ast}))\leq 2\left\langle{\nabla f_{t}(\bm{x}_{t})},{\bm{x}_{t}-\bm{x}^{\ast}}\right\rangle-\frac{G}{D\sqrt{2|[t]\cap U|}}\|\bm{x}_{t}-\bm{x}^{\ast}\|^{2}+\frac{G}{D\sqrt{2|[t]\cap U|}}\|\bm{x}_{t}-\bm{x}^{\ast}\|^{2} (10)

holds. From the update rule of 𝑨t\bm{A}_{t}, inequalities (8), (9), and (10) can be combined into one inequality

2(ft(𝒙t)ft(𝒙))2ft(𝒙t),𝒙t𝒙(𝑨t𝑨t1)(𝒙t𝒙),𝒙t𝒙+G1U(t)D2|[t]U|𝒙t𝒙2,2(f_{t}(\bm{x}_{t})-f_{t}(\bm{x}^{\ast}))\leq 2\left\langle{\nabla f_{t}(\bm{x}_{t})},{\bm{x}_{t}-\bm{x}^{\ast}}\right\rangle-\left\langle{(\bm{A}_{t}-\bm{A}_{t-1})(\bm{x}_{t}-\bm{x}^{\ast})},{\bm{x}_{t}-\bm{x}^{\ast}}\right\rangle+\frac{G1_{U}(t)}{D\sqrt{2|[t]\cap U|}}\|\bm{x}_{t}-\bm{x}^{\ast}\|^{2},

where 1U1_{U} is the indicator function, i.e., 1U(t)=11_{U}(t)=1 if tUt\in U, and 1U(t)=01_{U}(t)=0 otherwise. The first and second terms in the right-hand side can be bounded as follows:

2ft(𝒙t),𝒙t𝒙(𝑨t𝑨t1)(𝒙t𝒙),𝒙t𝒙\displaystyle 2\left\langle{\nabla f_{t}(\bm{x}_{t})},{\bm{x}_{t}-\bm{x}^{\ast}}\right\rangle-\left\langle{(\bm{A}_{t}-\bm{A}_{t-1})(\bm{x}_{t}-\bm{x}^{\ast})},{\bm{x}_{t}-\bm{x}^{\ast}}\right\rangle
=2𝑨t(𝒚t+1𝒙t),𝒙𝒙t𝒙t𝒙𝑨t2+𝒙t𝒙𝑨t12\displaystyle=2\left\langle{\bm{A}_{t}(\bm{y}_{t+1}-\bm{x}_{t})},{\bm{x}^{\ast}-\bm{x}_{t}}\right\rangle-\|\bm{x}_{t}-\bm{x}^{\ast}\|_{\bm{A}_{t}}^{2}+\|\bm{x}_{t}-\bm{x}^{\ast}\|_{\bm{A}_{t-1}}^{2}
=𝒚t+1𝒙t𝑨t2𝒚t+1𝒙𝑨t2+𝒙t𝒙𝑨t12\displaystyle=\|\bm{y}_{t+1}-\bm{x}_{t}\|_{\bm{A}_{t}}^{2}-\|\bm{y}_{t+1}-\bm{x}^{\ast}\|_{\bm{A}_{t}}^{2}+\|\bm{x}_{t}-\bm{x}^{\ast}\|_{\bm{A}_{t-1}}^{2}
𝒚t+1𝒙t𝑨t2𝒙t+1𝒙𝑨t2+𝒙t𝒙𝑨t12.\displaystyle\leq\|\bm{y}_{t+1}-\bm{x}_{t}\|_{\bm{A}_{t}}^{2}-\|\bm{x}_{t+1}-\bm{x}^{\ast}\|_{\bm{A}_{t}}^{2}+\|\bm{x}_{t}-\bm{x}^{\ast}\|_{\bm{A}_{t-1}}^{2}.

The first equality is from the algorithm, the second equality is from the law of cosines, and the last inequality is from the nonexpansiveness of projection. Therefore, we have

2(ft(𝒙t)ft(𝒙))𝒚t+1𝒙t𝑨t2𝒙t+1𝒙𝑨t2+𝒙t𝒙𝑨t12+G1U(t)D2|[t]U|𝒙t𝒙2.2(f_{t}(\bm{x}_{t})-f_{t}(\bm{x}^{\ast}))\leq\|\bm{y}_{t+1}-\bm{x}_{t}\|_{\bm{A}_{t}}^{2}-\|\bm{x}_{t+1}-\bm{x}^{\ast}\|_{\bm{A}_{t}}^{2}+\|\bm{x}_{t}-\bm{x}^{\ast}\|_{\bm{A}_{t-1}}^{2}+\frac{G1_{U}(t)}{D\sqrt{2|[t]\cap U|}}\|\bm{x}_{t}-\bm{x}^{\ast}\|^{2}.

By summing up from t=1t=1 to TT, we can bound regret as follows:

2RT\displaystyle 2R_{T} t=1T𝒚t+1𝒙t𝑨t2+𝒙1𝒙𝑨02+tUGD2|[t]U|𝒙t𝒙2\displaystyle\leq\sum_{t=1}^{T}\|\bm{y}_{t+1}-\bm{x}_{t}\|_{\bm{A}_{t}}^{2}+\|\bm{x}_{1}-\bm{x}^{\ast}\|_{\bm{A}_{0}}^{2}+\sum_{t\in U}\frac{G}{D\sqrt{2|[t]\cap U|}}\|\bm{x}_{t}-\bm{x}^{\ast}\|^{2}
t=1Tft(𝒙t)𝑨t12+D2ε+GD2i=1k1i\displaystyle\leq\sum_{t=1}^{T}\|\nabla f_{t}(\bm{x}_{t})\|_{\bm{A}_{t}^{-1}}^{2}+D^{2}\varepsilon+\frac{GD}{\sqrt{2}}\sum_{i=1}^{k}\frac{1}{\sqrt{i}}
tS1ft(𝒙t)𝑨t12+tS2ft(𝒙t)𝑨t12+tUft(𝒙t)𝑨t12+2GD(k+1).\displaystyle\leq\sum_{t\in S_{1}}\|\nabla f_{t}(\bm{x}_{t})\|_{\bm{A}_{t}^{-1}}^{2}+\sum_{t\in S_{2}}\|\nabla f_{t}(\bm{x}_{t})\|_{\bm{A}_{t}^{-1}}^{2}+\sum_{t\in U}\|\nabla f_{t}(\bm{x}_{t})\|_{\bm{A}_{t}^{-1}}^{2}+\sqrt{2}GD(\sqrt{k}+1).

From Lemma 6.1, we can get

2RT\displaystyle 2R_{T} G2λlog(1+λD2G|S1|)+dγlog(1+λD2G|S1|+γGD2|S2|+k)+22GDk+1\displaystyle\leq\frac{G^{2}}{\lambda}\log\left(1+\frac{\lambda D}{\sqrt{2}G}|S_{1}|\right)+\frac{d}{\gamma}\log\left(1+\frac{\lambda D}{\sqrt{2}G}|S_{1}|+\frac{\gamma GD}{\sqrt{2}}|S_{2}|+\sqrt{k}\right)+2\sqrt{2}GD\sqrt{k+1}
=O((G2λ+dγ)logT+GDk).\displaystyle=O\left(\left(\frac{G^{2}}{\lambda}+\frac{d}{\gamma}\right)\log T+GD\sqrt{k}\right).

The key point of Theorem 6.2 is that the second term of the regret upper bound is proportional to k\sqrt{k}. Compared with Corollary 5.5, we can see that additional information improves the regret upper bound.

Remark 6.3.

Algorithm 1 is written in a general form, and it is better to set S2=S_{2}=\emptyset in the contaminated strongly convex case. This is because Algorithm 1 needs O(d3)O(d^{3}) computation to calculate 𝑨t1\bm{A}_{t}^{-1} if S2S_{2} is nonempty. When S1=S_{1}=\emptyset or S2=S_{2}=\emptyset, the regret bound in Theorem 6.2 is reduced to O((d/γ)logT+GDk)O((d/\gamma)\log T+GD\sqrt{k}) or O((G2/λ)logT+GDk)O((G^{2}/\lambda)\log T+GD\sqrt{k}) respectively.

Remark 6.4.

As mentioned in Remark 5.9, the algorithms analyzed in this section need information that is not always available in the real world. Therefore, the improved regret bound in Theorem 6.2 is theoretical, and regret bounds for universal algorithms explained in Section 5 are more important in real applications. However, the algorithm in this section has the notable advantage that its regret upper bounds match the lower bounds.

7 Conclusion

In this paper, we proposed a problem class for OCO, namely contaminated OCO, the property whose objective functions change in time steps. On this regime, we derived some upper bounds for existing and proposed algorithms and some lower bounds of regret. While we successfully obtained optimal upper bounds with additional information of the function class of the last revealed objective function, there are still gaps of O(dlogT)O(\sqrt{d\log T}) or O(logT)O(\sqrt{\log T}) between the upper bound and the lower bound without additional information. One natural future research direction is to fill these gaps. We believe there is room for improvement in the upper bounds and the lower bounds seem tight. Indeed, lower bounds in this study interpolate well between tight bounds for general OCO and for (restricted) \mathcal{F}-OCO.

Acknowledgments

Portions of this research were conducted during visits by the first author, Kamijima, to NEC Corporation.

References

  • Abernethy et al. (2008) J. D. Abernethy, P. L. Bartlett, A. Rakhlin, and A. Tewari. Optimal strategies and minimax lower bounds for online convex games. In Conference on Learning Theory, pages 414–424, 2008.
  • Arora et al. (2012) S. Arora, E. Hazan, and S. Kale. The multiplicative weights update method: a meta-algorithm and applications. Theory of Computing, 8(1):121–164, 2012.
  • Awerbuch and Kleinberg (2004) B. Awerbuch and R. D. Kleinberg. Adaptive routing with end-to-end feedback: Distributed learning and geometric approaches. In Proceedings of the thirty-sixth annual ACM symposium on Theory of Computing, pages 45–53, 2004.
  • Bubeck and Slivkins (2012) S. Bubeck and A. Slivkins. The best of both worlds: Stochastic and adversarial bandits. In Conference on Learning Theory, pages 42.1–42.23, 2012.
  • Cover (1991) T. M. Cover. Universal portfolios. Mathematical Finance, 1(1):1–29, 1991.
  • Duchi et al. (2011) J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(7), 2011.
  • Gaillard et al. (2014) P. Gaillard, G. Stoltz, and T. Van Erven. A second-order bound with excess losses. In Conference on Learning Theory, pages 176–196. PMLR, 2014.
  • Gupta et al. (2019) A. Gupta, T. Koren, and K. Talwar. Better algorithms for stochastic bandits with adversarial corruptions. In Conference on Learning Theory, pages 1562–1578. PMLR, 2019.
  • Hazan (2016) E. Hazan. Introduction to online convex optimization. Found. Trends Optim., 2:157–325, 2016.
  • Hazan and Kale (2012) E. Hazan and S. Kale. Projection-free online learning. In 29th International Conference on Machine Learning, ICML 2012, pages 521–528, 2012.
  • Hazan et al. (2006) E. Hazan, A. Agarwal, and S. Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69:169–192, 2006.
  • Hazan et al. (2007) E. Hazan, A. Rakhlin, and P. Bartlett. Adaptive online gradient descent. Advances in Neural Information Processing Systems, 20, 2007.
  • Ito (2021) S. Ito. On optimal robustness to adversarial corruption in online decision problems. Advances in Neural Information Processing Systems, 34:7409–7420, 2021.
  • Littlestone and Warmuth (1994) N. Littlestone and M. K. Warmuth. The weighted majority algorithm. Information and Computation, 108(2):212–261, 1994.
  • Lykouris et al. (2018) T. Lykouris, V. Mirrokni, and R. Paes Leme. Stochastic bandits robust to adversarial corruptions. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 114–122, 2018.
  • McMahan and Streeter (2010) H. B. McMahan and M. Streeter. Adaptive bound optimization for online convex optimization. Conference on Learning Theory, page 244, 2010.
  • Orabona (2019) F. Orabona. A modern introduction to online learning. arXiv preprint arXiv:1912.13213, 2019.
  • Ordentlich and Cover (1998) E. Ordentlich and T. M. Cover. The cost of achieving the best portfolio in hindsight. Math. Oper. Res., 23:960–982, 1998.
  • Van Erven and Koolen (2016) T. Van Erven and W. M. Koolen. Metagrad: Multiple learning rates in online learning. Advances in Neural Information Processing Systems, 29, 2016.
  • van Erven et al. (2021) T. van Erven, S. Sachs, W. M. Koolen, and W. Kotlowski. Robust online convex optimization in the presence of outliers. In Conference on Learning Theory, pages 4174–4194. PMLR, 2021.
  • Wang et al. (2020) G. Wang, S. Lu, and L. Zhang. Adaptivity and optimality: A universal algorithm for online convex optimization. In Uncertainty in Artificial Intelligence, pages 659–668. PMLR, 2020.
  • Wei and Luo (2018) C.-Y. Wei and H. Luo. More adaptive algorithms for adversarial bandits. In Conference on Learning Theory, pages 1263–1291. PMLR, 2018.
  • Yan et al. (2024) Y.-H. Yan, P. Zhao, and Z.-H. Zhou. Universal online learning with gradient variations: A multi-layer online ensemble approach. Advances in Neural Information Processing Systems, 36, 2024.
  • Zhang et al. (2022) L. Zhang, G. Wang, J. Yi, and T. Yang. A simple yet universal strategy for online convex optimization. In International Conference on Machine Learning, pages 26605–26623. PMLR, 2022.
  • Zinkevich (2003) M. A. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In International Conference on Machine Learning, 2003.

Appendix A Existing Algorithms and Known Regret Bounds

This section introduces existing algorithms for OCO and their regret bounds.

A.1 OGD Algorithm

In OCO, one of the most fundamental algorithms is online gradient descent (OGD), which is shown in Algorithm 2. An action 𝒙t\bm{x}_{t} is updated by using the gradient of the point and projected onto the feasible region 𝒳\mathcal{X} in each step. If all the objective functions are convex and learning rates are set Θ(1/t)\Theta(1/\sqrt{t}), the regret is bounded by O(T)O(\sqrt{T}) (Zinkevich 2003), and if all the objective functions are λ\lambda-strongly convex and learning rates are set Θ(1/t)\Theta(1/t), the regret is bounded by O((1/λ)logT)O((1/\lambda)\log T) (Hazan et al. 2006).

A.2 ONS Algorithm

If all the objective functions are α\alpha-exp-concave, ONS, shown in Algorithm 3, works well. This is an algorithm proposed by Hazan et al. (2006) as an online version of the offline Newton method. This algorithm needs parameters γ,ε>0\gamma,\varepsilon>0, and if γ=(1/2)min{1/(GD),α}\gamma=(1/2)\min\{1/(GD),\alpha\} and ε=1/(γ2D2)\varepsilon=1/(\gamma^{2}D^{2}), then the regret is bounded by O((d/α)logT)O((d/\alpha)\log T).

Algorithm 2 Online Gradient Descent (Zinkevich 2003)
0:  convex set 𝒳d\mathcal{X}\subset\mathbb{R}^{d}, TT, 𝒙1𝒳\bm{x}_{1}\in\mathcal{X}, parameters ηt\eta_{t}
1:  for t=1t=1 to TT do
2:     Play 𝒙t\bm{x}_{t} and observe cost ft(𝒙t)f_{t}(\bm{x}_{t}).
3:     Gradient step and projection:
𝒚t+1=𝒙tηtft(𝒙t),\bm{y}_{t+1}=\bm{x}_{t}-\eta_{t}\nabla f_{t}(\bm{x}_{t}),
𝒙t+1=Π𝒳(𝒚t+1)argmin𝒙𝒳{𝒚t+1𝒙2}.\bm{x}_{t+1}=\Pi_{\mathcal{X}}(\bm{y}_{t+1})\coloneqq\operatorname*{argmin}_{\bm{x}\in\mathcal{X}}\{\|\bm{y}_{t+1}-\bm{x}\|^{2}\}.
4:  end for
Algorithm 3 Online Newton step (Hazan et al. 2006)
0:  convex set 𝒳d\mathcal{X}\subset\mathbb{R}^{d}, TT, 𝒙1𝒳\bm{x}_{1}\in\mathcal{X}, parameters γ,ε>0\gamma,\varepsilon>0, 𝑨0=ε𝑰d\bm{A}_{0}=\varepsilon\bm{I}_{d}
1:  for t=1t=1 to TT do
2:     Play 𝒙t\bm{x}_{t} and observe cost ft(𝒙t)f_{t}(\bm{x}_{t}).
3:     Rank-1 update: 𝑨t=𝑨t1+ft(𝒙t)(ft(𝒙t))\bm{A}_{t}=\bm{A}_{t-1}+\nabla f_{t}(\bm{x}_{t})(\nabla f_{t}(\bm{x}_{t}))^{\top}.
4:     Newton step and generalized projection:
𝒚t+1=𝒙tγ1𝑨t1ft(𝒙t),\bm{y}_{t+1}=\bm{x}_{t}-\gamma^{-1}\bm{A}_{t}^{-1}\nabla f_{t}(\bm{x}_{t}),
𝒙t+1=Π𝒳𝑨t(𝒚t+1).\bm{x}_{t+1}=\Pi_{\mathcal{X}}^{\bm{A}_{t}}(\bm{y}_{t+1}).
5:  end for

A.3 Universal Algorithms

In real-world applications, it may be unknown which function class the objective functions belong to. To cope with such cases, many universal algorithms have been developed. Most universal algorithms are constructed with two types of algorithms: a meta-algorithm and expert algorithms. Each expert algorithm is an online learning algorithm and not always universal. In each time step, expert algorithms update 𝒙ti\bm{x}_{t}^{i}, and a meta-algorithm integrates these outputs in some way, such as a convex combination. In the following, we explain three universal algorithms: multiple eta gradient algorithm (MetaGrad) (Van Erven and Koolen 2016), multiple sub-algorithms and learning rates (Maler) (Wang et al. 2020), and universal strategy for online convex optimization (USC) (Zhang et al. 2022).

First, MetaGrad is an algorithm with multiple experts, each with a different parameter η\eta as shown in Algorithm 4 and 5. In contrast to nonuniversal algorithms that need to set parameters beforehand depending on the property of objective functions, MetaGrad sets multiple η\eta so that users do not need prior knowledge. It is known that MetaGrad achieves O(TloglogT),O((d/λ)logT)O(\sqrt{T\log\log T}),~{}O((d/\lambda)\log T) and O((d/α)logT)O((d/\alpha)\log T) regret bounds for convex, λ\lambda-strongly convex and α\alpha-exp-concave objective functions respectively.

Algorithm 4 MetaGrad Master (Van Erven and Koolen 2016)
0:  TT, GG, DD, C=1+1/(1+(1/2)log2T)C=1+1/(1+\lceil(1/2)\log_{2}T\rceil)
1:  Set ηi=2i/(5GD),π1ηi=C/((i+1)(i+2))\eta_{i}=2^{-i}/(5GD),~{}\pi_{1}^{\eta_{i}}=C/((i+1)(i+2)) for i=0,1,,(1/2)log2Ti=0,1,\ldots,\lceil(1/2)\log_{2}T\rceil.
2:  for t=1t=1 to TT do
3:     Get prediction 𝒙tηi\bm{x}_{t}^{\eta_{i}} of slave for each ii.
4:     Play 𝒙t\bm{x}_{t}:
𝒙t=iπtηiηi𝒙tηiiπtηiηi.\bm{x}_{t}=\frac{\sum_{i}\pi_{t}^{\eta_{i}}\eta_{i}\bm{x}_{t}^{\eta_{i}}}{\sum_{i}\pi_{t}^{\eta_{i}}\eta_{i}}.
5:     Update for each ii:
tηi(𝒙tηi)=ηi𝒙t𝒙tηi,ft(𝒙t)+ηi2(𝒙t𝒙tηi,ft(𝒙t))2,\ell_{t}^{\eta_{i}}(\bm{x}_{t}^{\eta_{i}})=-{\eta_{i}}\langle\bm{x}_{t}-\bm{x}_{t}^{\eta_{i}},\nabla f_{t}(\bm{x}_{t})\rangle+\eta_{i}^{2}(\langle\bm{x}_{t}-\bm{x}_{t}^{\eta_{i}},\nabla f_{t}(\bm{x}_{t})\rangle)^{2},
πt+1ηi=πtηietηi(𝒙tηi)iπtηietηi(𝒙tηi).\pi_{t+1}^{\eta_{i}}=\frac{\pi_{t}^{\eta_{i}}\mathrm{e}^{\ell_{t}^{\eta_{i}}(\bm{x}_{t}^{\eta_{i}})}}{\sum_{i}\pi_{t}^{\eta_{i}}\mathrm{e}^{\ell_{t}^{\eta_{i}}(\bm{x}_{t}^{\eta_{i}})}}.
6:  end for
Algorithm 5 MetaGrad Slave (Van Erven and Koolen 2016)
0:  convex set 𝒳d\mathcal{X}\subset\mathbb{R}^{d}, TT, η\eta, DD
1:  Set 𝒙1η=𝟎\bm{x}_{1}^{\eta}=\bm{0}, 𝚺1η=D2𝑰d\bm{\Sigma}_{1}^{\eta}=D^{2}\bm{I}_{d}
2:  for t=1t=1 to TT do
3:     Issue 𝒙tη\bm{x}_{t}^{\eta} to master.
4:     Update:
𝚺t+1η=(1D2𝑰d+2η2s=1tft(𝒙t)(ft(𝒙t)))1,\bm{\Sigma}_{t+1}^{\eta}=\left(\frac{1}{D^{2}}\bm{I}_{d}+2\eta^{2}\sum_{s=1}^{t}\nabla f_{t}(\bm{x}_{t})(\nabla f_{t}(\bm{x}_{t}))^{\top}\right)^{-1},
𝒙~t+1η=𝒙tηη𝚺t+1η(1+2η2ft(𝒙t),𝒙tη𝒙t)ft(𝒙t),\tilde{\bm{x}}_{t+1}^{\eta}=\bm{x}_{t}^{\eta}-\eta\bm{\Sigma}_{t+1}^{\eta}(1+2\eta^{2}\langle\nabla f_{t}(\bm{x}_{t}),\bm{x}_{t}^{\eta}-\bm{x}_{t}\rangle)\nabla f_{t}(\bm{x}_{t}),
𝒙t+1η=Π𝒳(𝚺t+1η)1(𝒙~t+1η).\bm{x}_{t+1}^{\eta}=\Pi_{\mathcal{X}}^{(\bm{\Sigma}_{t+1}^{\eta})^{-1}}(\tilde{\bm{x}}_{t+1}^{\eta}).
5:  end for

Second, Maler is an algorithm with three types of expert algorithms: a convex expert algorithm, strongly convex expert algorithms, and exp-concave expert algorithms, as shown in Algorithm 6 to 9. They are similar to OGD with Θ(1/t)\Theta(1/\sqrt{t}) stepsize, OGD with Θ(1/t)\Theta(1/t) stepsize, and ONS, respectively. Expert algorithms contain multiple strongly convex expert algorithms and multiple exp-concave expert algorithms with multiple parameters η\eta like MetaGrad. It is known that Maler achieves O(T),O((1/λ)logT)O(\sqrt{T}),~{}O((1/\lambda)\log T) and O((d/α)logT)O((d/\alpha)\log T) regret bounds for convex, λ\lambda-strongly convex and α\alpha-exp-concave objective functions respectively.

Algorithm 6 Maler Meta (Wang et al. 2020)
0:  TT, GG, DD, C=1+1/(1+(1/2)log2T)C=1+1/(1+\lceil(1/2)\log_{2}T\rceil)
1:  Set ηc=1/(2GDT),ηi=2i/(5GD)\eta^{c}=1/(2GD\sqrt{T}),~{}\eta_{i}=2^{-i}/(5GD) for i=0,1,,(1/2)log2Ti=0,1,\ldots,\lceil(1/2)\log_{2}T\rceil.
2:  Set πc=1/3,π1ηi,=π1ηi,s=C/(3(i+1)(i+2))\pi^{c}=1/3,~{}\pi_{1}^{\eta_{i},\ell}=\pi_{1}^{\eta_{i},s}=C/(3(i+1)(i+2)) for i=0,1,,(1/2)log2Ti=0,1,\ldots,\lceil(1/2)\log_{2}T\rceil.
3:  for t=1t=1 to TT do
4:     Get predictions 𝒙tc\bm{x}^{c}_{t} from Algorithm 7 and 𝒙tηi,,𝒙tηi,s\bm{x}_{t}^{\eta_{i},\ell},\bm{x}_{t}^{\eta_{i},s} from Algorithms 8 and 9 for all ii.
5:     Play
𝒙t=πtcηc𝒙tc+i(πtηi,sηi𝒙tηi,s+πtηi,ηi𝒙tηi,)πtcηc+i(πtηi,sηi+πtηi,ηi).\bm{x}_{t}=\frac{\pi_{t}^{c}\eta^{c}\bm{x}_{t}^{c}+\sum_{i}(\pi_{t}^{\eta_{i},s}\eta_{i}\bm{x}_{t}^{\eta_{i},s}+\pi_{t}^{\eta_{i},\ell}\eta_{i}\bm{x}_{t}^{\eta_{i},\ell})}{\pi_{t}^{c}\eta^{c}+\sum_{i}(\pi_{t}^{\eta_{i},s}\eta_{i}+\pi_{t}^{\eta_{i},\ell}\eta_{i})}.
6:     Observe gradient ft(𝒙t)\nabla f_{t}(\bm{x}_{t}) and send it to all experts.
7:     Update weights:
πt+1c=πtcect(𝒙tc)Φt,\pi_{t+1}^{c}=\frac{\pi_{t}^{c}\mathrm{e}^{-c_{t}(\bm{x}_{t}^{c})}}{\Phi_{t}},
πt+1ηi,s=πtηi,sestηi(𝒙tηi,s)Φtfor eachi,\pi_{t+1}^{\eta_{i},s}=\frac{\pi_{t}^{\eta_{i},s}\mathrm{e}^{-s_{t}^{\eta_{i}}(\bm{x}_{t}^{\eta_{i},s})}}{\Phi_{t}}~{}\text{for~{}each}~{}i,
πt+1ηi,=πtηi,etηi(𝒙tηi,)Φtfor eachi,\pi_{t+1}^{\eta_{i},\ell}=\frac{\pi_{t}^{\eta_{i},\ell}\mathrm{e}^{-\ell_{t}^{\eta_{i}}(\bm{x}_{t}^{\eta_{i},\ell})}}{\Phi_{t}}~{}\text{for~{}each}~{}i,
where
Φt=i(π1ηi,seτ=1tsτηi(𝒙τηi,s)+π1ηi,eτ=1tτηi(𝒙τηi,))+π1ceτ=1tcτ(𝒙τc),\Phi_{t}=\sum_{i}(\pi_{1}^{\eta_{i},s}\mathrm{e}^{-\sum_{\tau=1}^{t}s_{\tau}^{\eta_{i}}(\bm{x}_{\tau}^{\eta_{i},s})}+\pi_{1}^{\eta_{i},\ell}\mathrm{e}^{-\sum_{\tau=1}^{t}\ell_{\tau}^{\eta_{i}}(\bm{x}_{\tau}^{\eta_{i},\ell})})+\pi_{1}^{c}\mathrm{e}^{-\sum_{\tau=1}^{t}c_{\tau}(\bm{x}_{\tau}^{c})},
ct(𝒙)=ηcft(𝒙t),𝒙t𝒙+(ηcGD)2,c_{t}(\bm{x})=-\eta^{c}\left\langle{\nabla f_{t}(\bm{x}_{t})},{\bm{x}_{t}-\bm{x}}\right\rangle+(\eta^{c}GD)^{2},
stη(𝒙)=ηft(𝒙t),𝒙t𝒙+η2G2𝒙t𝒙2,s_{t}^{\eta}(\bm{x})=-\eta\left\langle{\nabla f_{t}(\bm{x}_{t})},{\bm{x}_{t}-\bm{x}}\right\rangle+\eta^{2}G^{2}\|\bm{x}_{t}-\bm{x}\|^{2},
tη(𝒙)=ηft(𝒙t),𝒙t𝒙+η2(ft(𝒙t),𝒙t𝒙)2.\ell_{t}^{\eta}(\bm{x})=-\eta\left\langle{\nabla f_{t}(\bm{x}_{t})},{\bm{x}_{t}-\bm{x}}\right\rangle+\eta^{2}(\left\langle{\nabla f_{t}(\bm{x}_{t})},{\bm{x}_{t}-\bm{x}}\right\rangle)^{2}.
8:  end for
Algorithm 7 Maler Convex Expert (Wang et al. 2020)
0:  convex set 𝒳d\mathcal{X}\subset\mathbb{R}^{d}, TT, GG, DD, ηc\eta^{c}
1:  Set 𝒙tc=𝟎\bm{x}_{t}^{c}=\bm{0}.
2:  for t=1t=1 to TT do
3:     Send 𝒙tc\bm{x}_{t}^{c} to Algorithm 6.
4:     Receive gradient ft(𝒙t)\nabla f_{t}(\bm{x}_{t}) from Algorithm 6.
5:     Update:
𝒙t+1c=Π𝒳(𝒙tcDηcGtct(𝒙tc)).\bm{x}_{t+1}^{c}=\Pi_{\mathcal{X}}\left(\bm{x}_{t}^{c}-\frac{D}{\eta^{c}G\sqrt{t}}\nabla c_{t}(\bm{x}_{t}^{c})\right).
6:  end for
Algorithm 8 Maler Exp-concave Expert (Wang et al. 2020)
0:  convex set 𝒳d\mathcal{X}\subset\mathbb{R}^{d}, TT, DD, η\eta
1:  Set 𝒙tη,=𝟎,β=1/2,𝚺1=(1/(β2D2))𝑰d\bm{x}_{t}^{\eta,\ell}=\bm{0},~{}\beta=1/2,~{}\bm{\Sigma}_{1}=(1/(\beta^{2}D^{2}))\bm{I}_{d}.
2:  for t=1t=1 to TT do
3:     Send 𝒙tη,\bm{x}_{t}^{\eta,\ell} to Algorithm 6.
4:     Receive gradient ft(𝒙t)\nabla f_{t}(\bm{x}_{t}) from Algorithm 6.
5:     Update:
𝚺t+1=𝚺t+tη(𝒙tη,)(tη(𝒙tη,)),\bm{\Sigma}_{t+1}=\bm{\Sigma}_{t}+\nabla\ell_{t}^{\eta}(\bm{x}_{t}^{\eta,\ell})(\nabla\ell_{t}^{\eta}(\bm{x}_{t}^{\eta,\ell}))^{\top},
𝒙t+1η,=Π𝒳𝚺t+1(𝒙tη,1β𝚺t+11tη(𝒙tη,)).\bm{x}_{t+1}^{\eta,\ell}=\Pi_{\mathcal{X}}^{\bm{\Sigma}_{t+1}}\left(\bm{x}_{t}^{\eta,\ell}-\frac{1}{\beta}\bm{\Sigma}_{t+1}^{-1}\nabla\ell_{t}^{\eta}(\bm{x}_{t}^{\eta,\ell})\right).
6:  end for
Algorithm 9 Maler Strongly Convex Expert (Wang et al. 2020)
0:  convex set 𝒳d\mathcal{X}\subset\mathbb{R}^{d}, TT, GG, η\eta
1:  Set 𝒙tη,s=𝟎\bm{x}_{t}^{\eta,s}=\bm{0}.
2:  for t=1t=1 to TT do
3:     Send 𝒙tη,s\bm{x}_{t}^{\eta,s} to Algorithm 6.
4:     Receive gradient ft(𝒙t)\nabla f_{t}(\bm{x}_{t}) from Algorithm 6.
5:     Update:
𝒙t+1η,s=Π𝒳(𝒙tη,s12η2G2tstη(𝒙tη,s)).\bm{x}_{t+1}^{\eta,s}=\Pi_{\mathcal{X}}\left(\bm{x}_{t}^{\eta,s}-\frac{1}{2\eta^{2}G^{2}t}\nabla s_{t}^{\eta}(\bm{x}_{t}^{\eta,s})\right).
6:  end for

Finally, USC is an algorithm with many expert algorithms, as shown in Algorithm 10. In contrast to Maler, which contains OGD and ONS as expert algorithms, USC contains more expert algorithms. To integrate many experts, USC utilizes Adapt-ML-Prod (Gaillard et al. 2014) as a meta-algorithm, which realizes universal regret bound. Concerning the regret bound of USC, there is a theorem as follows.

Theorem A.1.

(Zhang et al. 2022) Let \mathcal{E} be a set of expert algorithms and 𝐱ti\bm{x}_{t}^{i} be an output of iith algorithm in tt time step. Then,

t=1T(ft(𝒙t)ft(𝒙ti))t=1Tft(𝒙t),𝒙t𝒙ti4ΓGD+Γlog||4G2D2+t=1T(ft(𝒙t),𝒙t𝒙ti)2,\displaystyle\sum_{t=1}^{T}(f_{t}(\bm{x}_{t})-f_{t}(\bm{x}_{t}^{i}))\leq\sum_{t=1}^{T}\left\langle{\nabla f_{t}(\bm{x}_{t})},{\bm{x}_{t}-\bm{x}_{t}^{i}}\right\rangle\leq 4\Gamma GD+\frac{\Gamma}{\sqrt{\log|\mathcal{E}|}}\sqrt{4G^{2}D^{2}+\sum_{t=1}^{T}(\left\langle{\nabla f_{t}(\bm{x}_{t})},{\bm{x}_{t}-\bm{x}_{t}^{i}}\right\rangle)^{2}},

where Γ=O(loglogT)\Gamma=O(\log\log T).

In USC, expert algorithms are chosen so that ||=O(logT)|\mathcal{E}|=O(\log T) holds. This theorem holds without assuming exp-concavity or strong convexity. In addition, it is known that USC achieves O(LTloglogT)O(\sqrt{L_{T}\log\log T}), O((1/λ)(min{logLT,logVT}+loglogT))O((1/\lambda)\cdot(\min\{\log L_{T},\log V_{T}\}+\log\log T)) and O((1/α)(dmin{logLT,logVT}+loglogT))O((1/\alpha)(d\min\{\log L_{T},\log V_{T}\}+\log\log T)) regret bounds for convex, λ\lambda-strongly convex and α\alpha-exp-concave objective functions respectively, where LT:=min𝒙𝒳t=1Tft(𝒙)=O(T)L_{T}:=\min_{\bm{x}\in\mathcal{X}}\sum_{t=1}^{T}f_{t}(\bm{x})=O(T), VT:=t=1Tmax𝒙𝒳ft(𝒙)ft1(𝒙)22=O(T)V_{T}:=\sum_{t=1}^{T}\max_{\bm{x}\in\mathcal{X}}\|\nabla f_{t}(\bm{x})-\nabla f_{t-1}(\bm{x})\|_{2}^{2}=O(T).

Algorithm 10 Universal Strategy for Online Convex Optimization (USC) (Zhang et al. 2022)
0:  𝒜str\mathcal{A}_{str}, 𝒜exp\mathcal{A}_{exp}, and 𝒜con\mathcal{A}_{con}, which are sets of algorithms designed for strongly convex functions, exp-concave functions and general convex functions respectively; 𝒫str\mathcal{P}_{str} and 𝒫exp\mathcal{P}_{exp}, which are sets of parameters of strong convexity and exp-concavity respectively.
1:  Initialize =\mathcal{E}=\emptyset.
2:  for each algorithm A𝒜strA\in\mathcal{A}_{str} do
3:     for each λ𝒫str\lambda\in\mathcal{P}_{str} do
4:        Create an expert E(A,λ)E(A,\lambda).
5:        Update =E(A,λ)\mathcal{E}=\mathcal{E}\cup E(A,\lambda).
6:     end for
7:  end for
8:  for each algorithm A𝒜expA\in\mathcal{A}_{exp} do
9:     for each α𝒫exp\alpha\in\mathcal{P}_{exp} do
10:        Create an expert E(A,α)E(A,\alpha).
11:        Update =E(A,α)\mathcal{E}=\mathcal{E}\cup E(A,\alpha).
12:     end for
13:  end for
14:  for each algorithm A𝒜conA\in\mathcal{A}_{con} do
15:     Create an expert E(A)E(A).
16:     Update =E(A)\mathcal{E}=\mathcal{E}\cup E(A).
17:  end for
18:  for t=1t=1 to TT do
19:     Calculate the weight ptip_{t}^{i} of each expert EiE^{i} by
pti=ηt1iwt1ij=1||ηt1jwt1j.p_{t}^{i}=\frac{\eta_{t-1}^{i}w_{t-1}^{i}}{\sum_{j=1}^{|\mathcal{E}|}\eta_{t-1}^{j}w_{t-1}^{j}}.
20:     Receive 𝒙ti\bm{x}_{t}^{i} from each expert EiE^{i}\in\mathcal{E}.
21:     Output the weighted average 𝒙t=i=1||pti𝒙ti\bm{x}_{t}=\sum_{i=1}^{|\mathcal{E}|}p_{t}^{i}\bm{x}_{t}^{i}.
22:     Observe the loss function ft()f_{t}(\cdot).
23:     Send the function ft()f_{t}(\cdot) to each expert EiE^{i}\in\mathcal{E}.
24:  end for

Appendix B MISSING PROOFS

In this section, we explain missing proofs.

B.1 Proof of the Exp-Concavity of the Function in Example 3.8

In this subsection, we present the proof of the exp-concavity of the functionftf_{t} in Example 3.8 in the case at,i=at,bt,i=bta_{t,i}=a_{t},~{}b_{t,i}=b_{t}. Before the proof, we introduce the following lemma.

Lemma B.1.

(Hazan 2016) A twice-differentiable function f:df\colon\mathbb{R}^{d}\to\mathbb{R} is α\alpha-exp-concave at 𝐱\bm{x} if and only if

2f(𝒙)αf(𝒙)f(𝒙).\nabla^{2}f(\bm{x})\succeq\alpha\nabla f(\bm{x})\nabla f(\bm{x})^{\top}.

Using this lemma, we can check the exp-concavity of the function ftf_{t}.

Proof.

By differentiating ftf_{t}, we have

ft(𝒙)\displaystyle\nabla f_{t}(\bm{x}) =bt𝒂t1+exp(bt𝒂t,𝒙),2ft(𝒙)=bt2𝒂t𝒂texp(bt𝒂t,𝒙)(1+exp(bt𝒂t,𝒙))2.\displaystyle=-\frac{b_{t}\bm{a}_{t}}{1+\exp(b_{t}\left\langle{\bm{a}_{t}},{\bm{x}}\right\rangle)},~{}\nabla^{2}f_{t}(\bm{x})=\frac{b_{t}^{2}\bm{a}_{t}\bm{a}_{t}^{\top}\exp(b_{t}\left\langle{\bm{a}_{t}},{\bm{x}}\right\rangle)}{(1+\exp(b_{t}\left\langle{\bm{a}_{t}},{\bm{x}}\right\rangle))^{2}}.

For all 𝒗d\bm{v}\in\mathbb{R}^{d},

𝒗(2ft(𝒙)αft(𝒙)ft(𝒙))𝒗=bt2𝒂t,𝒗2exp(bt𝒂t,𝒙)α(1+exp(bt𝒂t,𝒙))2\bm{v}^{\top}(\nabla^{2}f_{t}(\bm{x})-\alpha\nabla f_{t}(\bm{x})\nabla f_{t}(\bm{x})^{\top})\bm{v}=b_{t}^{2}\left\langle{\bm{a}_{t}},{\bm{v}}\right\rangle^{2}\frac{\exp(b_{t}\left\langle{\bm{a}_{t}},{\bm{x}}\right\rangle)-\alpha}{(1+\exp(b_{t}\left\langle{\bm{a}_{t}},{\bm{x}}\right\rangle))^{2}}

holds, and combined with Lemma B.1, ftf_{t} is exp(𝒂t)\exp(-\|\bm{a}_{t}\|)-exp-concave. ∎

B.2 Proof of Proposition 4.1

This subsection presents the proof of Proposition 4.1.

Proof.

Let 𝒙argmin𝒙𝒳t=1Tft(𝒙)\bm{x}^{\ast}\in\operatorname*{argmin}_{\bm{x}\in\mathcal{X}}\sum_{t=1}^{T}f_{t}(\bm{x}). We can bound regret as follows:

RT\displaystyle R_{T} =t=1T(ft(𝒙t)ft(𝒙))\displaystyle=\sum_{t=1}^{T}(f_{t}(\bm{x}_{t})-f_{t}(\bm{x}^{\ast}))
t=1T(ft(𝒙t),𝒙t𝒙γt2(ft(𝒙t),𝒙t𝒙)2)\displaystyle\leq\sum_{t=1}^{T}\left(\left\langle{\nabla f_{t}(\bm{x}_{t})},{\bm{x}_{t}-\bm{x}^{\ast}}\right\rangle-\frac{\gamma_{t}}{2}(\left\langle{\nabla f_{t}(\bm{x}_{t})},{\bm{x}_{t}-\bm{x}^{\ast}}\right\rangle)^{2}\right)
=t=1T(ft(𝒙t),𝒙t𝒙γ2(ft(𝒙t),𝒙t𝒙)2)+t=1Tγγt2(ft(𝒙t),𝒙t𝒙)2\displaystyle=\sum_{t=1}^{T}\left(\left\langle{\nabla f_{t}(\bm{x}_{t})},{\bm{x}_{t}-\bm{x}^{\ast}}\right\rangle-\frac{\gamma}{2}(\left\langle{\nabla f_{t}(\bm{x}_{t})},{\bm{x}_{t}-\bm{x}^{\ast}}\right\rangle)^{2}\right)+\sum_{t=1}^{T}\frac{\gamma-\gamma_{t}}{2}(\left\langle{\nabla f_{t}(\bm{x}_{t})},{\bm{x}_{t}-\bm{x}^{\ast}}\right\rangle)^{2}
t=1T(ft(𝒙t),𝒙t𝒙γ2(ft(𝒙t),𝒙t𝒙)2)+t:γt<γγγt2G2D2\displaystyle\leq\sum_{t=1}^{T}\left(\left\langle{\nabla f_{t}(\bm{x}_{t})},{\bm{x}_{t}-\bm{x}^{\ast}}\right\rangle-\frac{\gamma}{2}(\left\langle{\nabla f_{t}(\bm{x}_{t})},{\bm{x}_{t}-\bm{x}^{\ast}}\right\rangle)^{2}\right)+\sum_{t:\gamma_{t}<\gamma}\frac{\gamma-\gamma_{t}}{2}G^{2}D^{2}
2dγlogT+14kGD,\displaystyle\leq\frac{2d}{\gamma}\log T+\frac{1}{4}kGD,

where γt\gamma_{t} is defined in the same way as defined in Theorem 5.3. The first inequality is from Lemma 3.4. In the last inequality, the first term is bounded by (2d/γ)logT(2d/\gamma)\log T because of the proof of ONS’s regret bound by Hazan (2016). The second term is bounded by (1/4)kGD(1/4)kGD from γ1/(2GD)\gamma\leq 1/(2GD) by definition of γ\gamma. ∎

B.3 Proof of Theorem 4.2

This subsection presents the proof of Theorem 4.2.

Proof.

Let I1I_{1} and I2I_{2} be instances used to prove lower bound RT=Ω(g1(T))R_{T}=\Omega(g_{1}(T)) for function class \mathcal{F} and Rk=Ω(g2(k))R_{k}=\Omega(g_{2}(k)) for convex objective functions, respectively, and fi,t(i=1,2)f_{i,t}~{}(i=1,2) be objective functions of IiI_{i} at time step tt, and 𝒳i\mathcal{X}_{i} be sets which decision variables of IiI_{i} belong to. Here, take a set 𝒳\mathcal{X} so that there exist surjections ϕi:𝒳𝒳i\phi_{i}\colon\mathcal{X}\to\mathcal{X}_{i}. For this 𝒳\mathcal{X}, let I~1\tilde{I}_{1} be an instance whose objective function at time step tt is f1,tϕ1f_{1,t}\circ\phi_{1} and I~2\tilde{I}_{2} be an instance whose objective function at time step tt is f2,tϕ2f_{2,t}\circ\phi_{2} if tkt\leq k, and some function in \mathcal{F} whose minimizer is the same as the minimizer of t=1kf2,t\sum_{t=1}^{k}f_{2,t} otherwise. For these instances, consider the case that instances I~1\tilde{I}_{1} and I~2\tilde{I}_{2} realize with probability 1/21/2. In this case, the expectation of regret satisfies

𝔼[RT]\displaystyle\mathbb{E}[R_{T}] =12Ω(g1(T))+12Ω(g2(k))\displaystyle=\frac{1}{2}\Omega(g_{1}(T))+\frac{1}{2}\Omega(g_{2}(k))
=Ω(g1(T)+g2(k)),\displaystyle=\Omega(g_{1}(T)+g_{2}(k)),

for all OCO algorithms. Therefore, Theorem 4.2 follows. ∎

B.4 Proof of Proposition 4.4

This subsection presents the proof of Proposition 4.4.

Proof.

Consider the instance as follows:

ft(x)=vtx,x𝒳=[D/2,D/2],x1=D/2,f_{t}(x)=v_{t}x,~{}x\in\mathcal{X}=[-D/2,D/2],~{}x_{1}=-D/2,

where

vt={(1)tGt<t1,Gtt1,xt10,Gtt1,xt1<0,v_{t}=\begin{dcases}(-1)^{t}G&t<t_{1},\\ G&t\geq t_{1},~{}x_{t_{1}}\geq 0,\\ -G&t\geq t_{1},~{}x_{t_{1}}<0,\end{dcases}

and t1t_{1} is a minimum natural number which satisfies t1(1+γG2D/2)1Tt_{1}\geq(1+\gamma G^{2}D/2)^{-1}T. Then,

minx𝒳t=1Tft(x)(T+t1)GD2(γG2D/21+γG2D/2T+1)GD2.\displaystyle\min_{x\in\mathcal{X}}\sum_{t=1}^{T}f_{t}(x)\leq(-T+t_{1})\frac{GD}{2}\leq\left(-\frac{\gamma G^{2}D/2}{1+\gamma G^{2}D/2}T+1\right)\frac{GD}{2}. (11)

The second inequality is from t1(1+γG2D/2)1T+1t_{1}\leq(1+\gamma G^{2}D/2)^{-1}T+1. If xt10x_{t_{1}}\geq 0,

t=1Tft(xt)=Gt=1t11(1)txt+Gt=t1Txt.\displaystyle\sum_{t=1}^{T}f_{t}(x_{t})=G\sum_{t=1}^{t_{1}-1}(-1)^{t}x_{t}+G\sum_{t=t_{1}}^{T}x_{t}. (12)

For the first term, since

At=ε+G2tA_{t}=\varepsilon+G^{2}t

for all t[T]t\in[T], and if t<t1t<t_{1}, we have

yt+1=xtγ1(ε+G2t)1(1)tG.y_{t+1}=x_{t}-\gamma^{-1}(\varepsilon+G^{2}t)^{-1}(-1)^{t}G.

Now, xt+1x_{t+1} is defined as

xt+1={D2yt+1>D2,yt+1D2yt+1D2,D2yt+1<D2,x_{t+1}=\begin{dcases}\frac{D}{2}&y_{t+1}>\frac{D}{2},\\ y_{t+1}&-\frac{D}{2}\leq y_{t+1}\leq\frac{D}{2},\\ -\frac{D}{2}&y_{t+1}<-\frac{D}{2},\end{dcases}

and therefore, we get

xt={(1)tD2tt2,(1)t2D2s=t2t1(1)sGγ(ε+G2s)t>t2,x_{t}=\begin{dcases}(-1)^{t}\frac{D}{2}&t\leq t_{2},\\ (-1)^{t_{2}}\frac{D}{2}-\sum_{s=t_{2}}^{t-1}\frac{(-1)^{s}G}{\gamma(\varepsilon+G^{2}s)}&t>t_{2},\end{dcases}

where t2t_{2} is a minimum time step tt which satisfies Gγ1(ε+G2t)1<DG\gamma^{-1}(\varepsilon+G^{2}t)^{-1}<D, i.e., t>1/(γGD)ε/G2t>1/(\gamma GD)-\varepsilon/G^{2}. From this, for sufficiently large TT so that t2t12t_{2}\leq t_{1}-2, we have

Gt=1t11(1)txt\displaystyle G\sum_{t=1}^{t_{1}-1}(-1)^{t}x_{t} =Gt=1t2(1)t(1)tD2+t=t2+1t11(1)t((1)t2D2s=t2t1(1)sGγ(ε+G2s))\displaystyle=G\sum_{t=1}^{t_{2}}(-1)^{t}(-1)^{t}\frac{D}{2}+\sum_{t=t_{2}+1}^{t_{1}-1}(-1)^{t}\left((-1)^{t_{2}}\frac{D}{2}-\sum_{s=t_{2}}^{t-1}\frac{(-1)^{s}G}{\gamma(\varepsilon+G^{2}s)}\right)
t212GD+G2γt=t2+1t11s=t2t1(1)s+t+1ε+G2s\displaystyle\geq\frac{t_{2}-1}{2}GD+\frac{G^{2}}{\gamma}\sum_{t=t_{2}+1}^{t_{1}-1}\sum_{s=t_{2}}^{t-1}\frac{(-1)^{s+t+1}}{\varepsilon+G^{2}s}
=t212GD+G2γs=t2t12t=s+1t11(1)s+t+1ε+G2s\displaystyle=\frac{t_{2}-1}{2}GD+\frac{G^{2}}{\gamma}\sum_{s=t_{2}}^{t_{1}-2}\sum_{t=s+1}^{t_{1}-1}\frac{(-1)^{s+t+1}}{\varepsilon+G^{2}s}
t212GDG2γs=t2t121ε+G2s\displaystyle\geq\frac{t_{2}-1}{2}GD-\frac{G^{2}}{\gamma}\sum_{s=t_{2}}^{t_{1}-2}\frac{1}{\varepsilon+G^{2}s}
t212GDG2γt21t12dsε+G2s\displaystyle\geq\frac{t_{2}-1}{2}GD-\frac{G^{2}}{\gamma}\int_{t_{2}-1}^{t_{1}-2}\frac{\differential s}{\varepsilon+G^{2}s}
=t212GD1γlogε+(t12)G2ε+(t21)G2\displaystyle=\frac{t_{2}-1}{2}GD-\frac{1}{\gamma}\log\frac{\varepsilon+(t_{1}-2)G^{2}}{\varepsilon+(t_{2}-1)G^{2}}
1γlog(1+G2εT).\displaystyle\geq-\frac{1}{\gamma}\log\left(1+\frac{G^{2}}{\varepsilon}T\right). (13)

Next, for the second term of equation (12), if xt10x_{t_{1}}\geq 0 and tt1t\geq t_{1}, we then have

yt+1=xtγ1(ε+G2t)1xtγ1G2t11y_{t+1}=x_{t}-\gamma^{-1}(\varepsilon+G^{2}t)^{-1}\geq x_{t}-\gamma^{-1}G^{-2}t_{1}^{-1}

and since xt+1yt+1x_{t+1}\geq y_{t+1} holds from yt+1xtD/2y_{t+1}\leq x_{t}\leq D/2, we have

yt+1\displaystyle y_{t+1} xt1(tt1)γ1G2t11\displaystyle\geq x_{t_{1}}-(t-t_{1})\gamma^{-1}G^{-2}t_{1}^{-1}
(Tt1)γ1G2t11\displaystyle\geq-(T-t_{1})\gamma^{-1}G^{-2}t_{1}^{-1}
=(Tt11)G2γ1\displaystyle=-\left(\frac{T}{t_{1}}-1\right)G^{-2}\gamma^{-1}
(T(1+γG2D/2)1T1)G2γ1\displaystyle\geq-\left(\frac{T}{(1+\gamma G^{2}D/2)^{-1}T}-1\right)G^{-2}\gamma^{-1}
=D2.\displaystyle=-\frac{D}{2}.

Therefore, from

xt=xt1s=t1t1γ1(ε+G2s)1,x_{t}=x_{t_{1}}-\sum_{s=t_{1}}^{t-1}\gamma^{-1}(\varepsilon+G^{2}s)^{-1},

we have

Gt=t1Txt\displaystyle G\sum_{t=t_{1}}^{T}x_{t} =Gt=t1T(xt1s=t1t1γ1(ε+G2s)1)\displaystyle=G\sum_{t=t_{1}}^{T}\left(x_{t_{1}}-\sum_{s=t_{1}}^{t-1}\gamma^{-1}(\varepsilon+G^{2}s)^{-1}\right)
1γGt=t1Ts=t1t1s1\displaystyle\geq-\frac{1}{\gamma G}\sum_{t=t_{1}}^{T}\sum_{s=t_{1}}^{t-1}s^{-1}
=1γGs=t1T1t=s+1Ts1\displaystyle=-\frac{1}{\gamma G}\sum_{s=t_{1}}^{T-1}\sum_{t=s+1}^{T}s^{-1}
=1γGs=t1T1(1+Ts)\displaystyle=-\frac{1}{\gamma G}\sum_{s=t_{1}}^{T-1}\left(-1+\frac{T}{s}\right)
Tt11γGTγGlogTt1\displaystyle\geq\frac{T-t_{1}-1}{\gamma G}-\frac{T}{\gamma G}\log\frac{T}{t_{1}}
T(1+γG2D/2)1T2γGTγGlogT(1+γG2D/2)1T\displaystyle\geq\frac{T-(1+\gamma G^{2}D/2)^{-1}T-2}{\gamma G}-\frac{T}{\gamma G}\log\frac{T}{(1+\gamma G^{2}D/2)^{-1}T}
=TG(G2D/21+γG2D/2γ1log(1+γG2D/2))2γG.\displaystyle=\frac{T}{G}\left(\frac{G^{2}D/2}{1+\gamma G^{2}D/2}-\gamma^{-1}\log(1+\gamma G^{2}D/2)\right)-\frac{2}{\gamma G}. (14)

We can derive the same bound similarly in the case of xt1<0x_{t_{1}}<0.

From inequality (11), equality (12), inequality (13) and inequality (14), we complete the proof:

RT\displaystyle R_{T} 1γlog(1+G2εT)+TG(G2D/21+γG2D/2γ1log(1+γG2D/2))2γG\displaystyle\geq-\frac{1}{\gamma}\log\left(1+\frac{G^{2}}{\varepsilon}T\right)+\frac{T}{G}\left(\frac{G^{2}D/2}{1+\gamma G^{2}D/2}-\gamma^{-1}\log(1+\gamma G^{2}D/2)\right)-\frac{2}{\gamma G}
(γG2D/21+γG2D/2T+1)GD2\displaystyle\quad-\left(-\frac{\gamma G^{2}D/2}{1+\gamma G^{2}D/2}T+1\right)\frac{GD}{2}
T(G2D2γ1(γG2D2(γG2D/2)22(1+γG2D/2)2))1γlog(1+G2εT)2γGGD2\displaystyle\geq T\left(\frac{G^{2}D}{2}-\gamma^{-1}\left(\gamma\frac{G^{2}D}{2}-\frac{(\gamma G^{2}D/2)^{2}}{2(1+\gamma G^{2}D/2)^{2}}\right)\right)-\frac{1}{\gamma}\log\left(1+\frac{G^{2}}{\varepsilon}T\right)-\frac{2}{\gamma G}-\frac{GD}{2}
=γG2D/22(1+γG2D/2)2T1γlog(1+G2εT)2γGG2D2\displaystyle=\frac{\gamma G^{2}D/2}{2(1+\gamma G^{2}D/2)^{2}}T-\frac{1}{\gamma}\log\left(1+\frac{G^{2}}{\varepsilon}T\right)-\frac{2}{\gamma G}-\frac{G^{2}D}{2}
=Ω(T).\displaystyle=\Omega(T).

The second inequality follows from the inequality log(1+γG2D/2)γG2D/2(γG2D/2)22(1+γG2D/2)2\log(1+\gamma G^{2}D/2)\leq\gamma G^{2}D/2-\frac{(\gamma G^{2}D/2)^{2}}{2(1+\gamma G^{2}D/2)^{2}} for any γ>0\gamma>0 by Taylor’s theorem. ∎

B.5 Proof of Theorem 5.2

This subsection proves that Theorem 5.2 holds for any 𝒙𝒳\bm{x}\in\mathcal{X}. Before stating the proof of Theorem 5.2, we introduce two following lemmas. Here, ctc_{t}, 𝒙tc\bm{x}_{t}^{c}, tη\ell_{t}^{\eta}, 𝒙tη,\bm{x}_{t}^{\eta,\ell}, stηs_{t}^{\eta}, and 𝒙tη,s\bm{x}_{t}^{\eta,s} are introduced in Algorithm 6.

Lemma B.2.

(Wang et al. 2020) For every grid point η\eta, we have

t=1T(ct(𝒙t)ct(𝒙tc))\displaystyle\sum_{t=1}^{T}(c_{t}(\bm{x}_{t})-c_{t}(\bm{x}_{t}^{c})) log3+14,\displaystyle\leq\log 3+\frac{1}{4}, (15)
t=1T(tη(𝒙t)tη(𝒙tη,))\displaystyle\sum_{t=1}^{T}(\ell_{t}^{\eta}(\bm{x}_{t})-\ell_{t}^{\eta}(\bm{x}_{t}^{\eta,\ell})) 2log(3(12log2T+3)),\displaystyle\leq 2\log\left(\sqrt{3}\left(\frac{1}{2}\log_{2}T+3\right)\right), (16)

and

t=1T(stη(𝒙t)stη(𝒙tη,s))\displaystyle\sum_{t=1}^{T}(s_{t}^{\eta}(\bm{x}_{t})-s_{t}^{\eta}(\bm{x}_{t}^{\eta,s})) 2log(3(12log2T+3)).\displaystyle\leq 2\log\left(\sqrt{3}\left(\frac{1}{2}\log_{2}T+3\right)\right). (17)
Remark B.3.

According to Wang et al. (2020),

t=1T(ct(𝒙t)ct(𝒙tc))log3\sum_{t=1}^{T}(c_{t}(\bm{x}_{t})-c_{t}(\bm{x}_{t}^{c}))\leq\log 3

holds instead of inequality (15). However, the above inequality needs to be corrected. In the last part of the proof of this lemma in their paper, they derived

0t=1Tct(𝒙tc)+log1πc=t=1Tct(𝒙tc)+log3.0\leq\sum_{t=1}^{T}c_{t}(\bm{x}_{t}^{c})+\log\frac{1}{\pi^{c}}=\sum_{t=1}^{T}c_{t}(\bm{x}_{t}^{c})+\log 3.

From the definition of ctc_{t} and ηc=1/(2GDT)\eta^{c}=1/(2GD\sqrt{T}), we have

t=1Tct(𝒙t)=t=1T(ηcGD)2=t=1T14T=14,\sum_{t=1}^{T}c_{t}(\bm{x}_{t})=\sum_{t=1}^{T}(\eta^{c}GD)^{2}=\sum_{t=1}^{T}\frac{1}{4T}=\frac{1}{4},

though they treated this term as 0. Combining these relationships, we get inequality (15). This mistake seems to be a mere typo since the regret bound in their paper coincides with the result derived from the inequality (15).

Lemma B.4.

(Wang et al. 2020) For every grid point η\eta and any 𝐱𝒳\bm{x}\in\mathcal{X}, we have

t=1T(ct(𝒙tc)ctη(𝒙))\displaystyle\sum_{t=1}^{T}(c_{t}(\bm{x}_{t}^{c})-c_{t}^{\eta}(\bm{x})) 34,\displaystyle\leq\frac{3}{4}, (18)
t=1T(tη(𝒙tη,)tη(𝒙))\displaystyle\sum_{t=1}^{T}(\ell_{t}^{\eta}(\bm{x}_{t}^{\eta,\ell})-\ell_{t}^{\eta}(\bm{x})) 10dlogT,\displaystyle\leq 10d\log T, (19)

and

t=1T(stη(𝒙tη,s)stη(𝒙))\displaystyle\sum_{t=1}^{T}(s_{t}^{\eta}(\bm{x}_{t}^{\eta,s})-s_{t}^{\eta}(\bm{x})) 1+logT.\displaystyle\leq 1+\log T. (20)
Proof of Theorem 5.2.

We can get O(GDT)O(GD\sqrt{T}) bound as follows:

RT𝒙R~T𝒙\displaystyle R_{T}^{\bm{x}}\leq\tilde{R}_{T}^{\bm{x}} =1ηct=1T((ηcGD)2ct(𝒙))\displaystyle=\frac{1}{\eta^{c}}\sum_{t=1}^{T}((\eta^{c}GD)^{2}-c_{t}(\bm{x}))
=1ηc(t=1T(ct(𝒙t)ct(𝒙tc))+t=1T(ct(𝒙tc)ct(𝒙)))\displaystyle=\frac{1}{\eta^{c}}\left(\sum_{t=1}^{T}(c_{t}(\bm{x}_{t})-c_{t}(\bm{x}_{t}^{c}))+\sum_{t=1}^{T}(c_{t}(\bm{x}_{t}^{c})-c_{t}(\bm{x}))\right)
2(1+log3)GDT,\displaystyle\leq 2(1+\log 3)GD\sqrt{T},

where the last inequality follows from inequalities (15) and (18).

We can get O(WT𝒙logT)O(\sqrt{W_{T}^{\bm{x}}\log T}) bound as follows:

RT𝒙R~T𝒙\displaystyle R_{T}^{\bm{x}}\leq\tilde{R}_{T}^{\bm{x}} =1ηt=1T((ηG)2𝒙t𝒙2stη(𝒙))\displaystyle=\frac{1}{\eta}\sum_{t=1}^{T}((\eta G)^{2}\|\bm{x}_{t}-\bm{x}\|^{2}-s_{t}^{\eta}(\bm{x}))
=ηt=1TG2𝒙t𝒙2+1η(t=1T(st(𝒙t)st(𝒙ts))+t=1T(st(𝒙ts)st(𝒙)))\displaystyle=\eta\sum_{t=1}^{T}G^{2}\|\bm{x}_{t}-\bm{x}\|^{2}+\frac{1}{\eta}\left(\sum_{t=1}^{T}(s_{t}(\bm{x}_{t})-s_{t}(\bm{x}_{t}^{s}))+\sum_{t=1}^{T}(s_{t}(\bm{x}_{t}^{s})-s_{t}(\bm{x}))\right)
ηWT𝒙+Aη,\displaystyle\leq\eta W_{T}^{\bm{x}}+\frac{A}{\eta},

where A2log(3((1/2)log2T+3))+1+logT1A\coloneqq 2\log(\sqrt{3}((1/2)\log_{2}T+3))+1+\log T\geq 1 and the last inequality follows from inequalities (17) and (20). Let η^A/WT𝒙1/(5GDT)\hat{\eta}\coloneqq\sqrt{A/W_{T}^{\bm{x}}}\geq 1/(5GD\sqrt{T}). If η^1/(5GD)\hat{\eta}\leq 1/(5GD), there exists a grid point ηi[η^/2,η^]\eta_{i}\in[\hat{\eta}/2,\hat{\eta}] and we get

R~T𝒙ηiWT𝒙+Aηiη^WT𝒙+2Aη^=WT𝒙A.\tilde{R}_{T}^{\bm{x}}\leq\eta_{i}W_{T}^{\bm{x}}+\frac{A}{\eta_{i}}\leq\hat{\eta}W_{T}^{\bm{x}}+\frac{2A}{\hat{\eta}}=\sqrt{W_{T}^{\bm{x}}A}.

Otherwise, since WT𝒙25G2D2AW_{T}^{\bm{x}}\leq 25G^{2}D^{2}A, by taking η=η1=1/(5GD)\eta=\eta_{1}=1/(5GD), we get

R~T𝒙η1WT𝒙+Aη110GDA.\tilde{R}_{T}^{\bm{x}}\leq\eta_{1}W_{T}^{\bm{x}}+\frac{A}{\eta_{1}}\leq 10GDA.

Therefore, R~T𝒙=O(WT𝒙logT)\tilde{R}_{T}^{\bm{x}}=O(\sqrt{W_{T}^{\bm{x}}\log T}) holds.

We can get O(VT𝒙dlogT)O(\sqrt{V_{T}^{\bm{x}}d\log T}) bound as follows:

RT𝒙R~T𝒙\displaystyle R_{T}^{\bm{x}}\leq\tilde{R}_{T}^{\bm{x}} =1ηt=1T(η2(ft(𝒙t),𝒙t𝒙)2tη(𝒙))\displaystyle=\frac{1}{\eta}\sum_{t=1}^{T}(\eta^{2}(\left\langle{\nabla f_{t}(\bm{x}_{t})},{\bm{x}_{t}-\bm{x}}\right\rangle)^{2}-\ell_{t}^{\eta}(\bm{x}))
=ηt=1T(ft(𝒙t),𝒙t𝒙)2+1η(t=1T(t(𝒙t)t(𝒙t))+t=1T(t(𝒙t)t(𝒙)))\displaystyle=\eta\sum_{t=1}^{T}(\left\langle{\nabla f_{t}(\bm{x}_{t})},{\bm{x}_{t}-\bm{x}}\right\rangle)^{2}+\frac{1}{\eta}\left(\sum_{t=1}^{T}(\ell_{t}(\bm{x}_{t})-\ell_{t}(\bm{x}_{t}^{\ell}))+\sum_{t=1}^{T}(\ell_{t}(\bm{x}_{t}^{\ell})-\ell_{t}(\bm{x}))\right)
ηVT𝒙+Bη,\displaystyle\leq\eta V_{T}^{\bm{x}}+\frac{B}{\eta},

where B2log(3((1/2)log2T+3))+10dlogTB\coloneqq 2\log(\sqrt{3}((1/2)\log_{2}T+3))+10d\log T and the last inequality follows from inequalities (16) and (19). By similar arguments, R~T𝒙=O(VT𝒙dlogT)\tilde{R}_{T}^{\bm{x}}=O(\sqrt{V_{T}^{\bm{x}}d\log T}) holds. ∎

B.6 Proof of Lemma 5.4

In this subsection, we prove Lemma 5.4 used in the proof of Theorem 5.3.

Proof.

Since x3(a+b)/2x\leq 3(a+b)/2 holds when xbx\leq b, it is sufficient to consider the case x>bx>b. If xax+bx\leq\sqrt{ax}+b, then we have

x2(a+2b)x+b20.x^{2}-(a+2b)x+b^{2}\leq 0.

By solving this, we have

xa+2b+a2+4ab2a+b+ab32(a+b).x\leq\frac{a+2b+\sqrt{a^{2}+4ab}}{2}\leq a+b+\sqrt{ab}\leq\frac{3}{2}(a+b).

The second inequality holds from the inequality x+yx+y\sqrt{x+y}\leq\sqrt{x}+\sqrt{y} for x,y0x,y\geq 0, and the last inequality holds from the inequality of arithmetic and geometric means. ∎

B.7 Proof of Corollary 5.5 for USC

This subsection presents the proof of Corollary 5.5 for USC.

Proof.

The regret satisfies

RT\displaystyle R_{T} =t=1Tft(𝒙t)t=1Tft(𝒙ti)+t=1Tft(𝒙ti)min𝒙𝒳t=1Tft(𝒙)\displaystyle=\sum_{t=1}^{T}f_{t}(\bm{x}_{t})-\sum_{t=1}^{T}f_{t}(\bm{x}_{t}^{i})+\sum_{t=1}^{T}f_{t}(\bm{x}_{t}^{i})-\min_{\bm{x}\in\mathcal{X}}\sum_{t=1}^{T}f_{t}(\bm{x})
=RTmeta+RTexpert,\displaystyle=R_{T}^{\mathrm{meta}}+R_{T}^{\mathrm{expert}},

where RTmeta:=t=1Tft(𝒙t)t=1Tft(𝒙ti)R_{T}^{\mathrm{meta}}:=\sum_{t=1}^{T}f_{t}(\bm{x}_{t})-\sum_{t=1}^{T}f_{t}(\bm{x}_{t}^{i}) and RTexpert:=t=1Tft(𝒙ti)min𝒙𝒳t=1Tft(𝒙)R_{T}^{\mathrm{expert}}:=\sum_{t=1}^{T}f_{t}(\bm{x}_{t}^{i})-\min_{\bm{x}\in\mathcal{X}}\sum_{t=1}^{T}f_{t}(\bm{x}). From Theorem A.1,

RTmeta\displaystyle R_{T}^{\mathrm{meta}} t=1Tft(𝒙t),𝒙t𝒙ti\displaystyle\leq\sum_{t=1}^{T}\left\langle{\nabla f_{t}(\bm{x}_{t})},{\bm{x}_{t}-\bm{x}_{t}^{i}}\right\rangle
4ΓGD+Γlog||4G2D2+t=1T(ft(𝒙t),𝒙t𝒙ti)2\displaystyle\leq 4\Gamma GD+\frac{\Gamma}{\sqrt{\log|\mathcal{E}|}}\sqrt{4G^{2}D^{2}+\sum_{t=1}^{T}(\left\langle{\nabla f_{t}(\bm{x}_{t})},{\bm{x}_{t}-\bm{x}_{t}^{i}}\right\rangle)^{2}}
2ΓGD(2+1log||)+Γ2log||t=1T(ft(𝒙t),𝒙t𝒙ti)2.\displaystyle\leq 2\Gamma GD\left(2+\frac{1}{\sqrt{\log|\mathcal{E}|}}\right)+\sqrt{\frac{\Gamma^{2}}{\log|\mathcal{E}|}\sum_{t=1}^{T}(\left\langle{\nabla f_{t}(\bm{x}_{t})},{\bm{x}_{t}-\bm{x}_{t}^{i}}\right\rangle)^{2}}.

Last inequality holds from the inequality x+yx+y\sqrt{x+y}\leq\sqrt{x}+\sqrt{y} for x,y0x,y\geq 0.

Similar to equation (2), if RTmeta>0R_{T}^{\mathrm{meta}}>0, inequality

t=1T(ft(𝒙t),𝒙t𝒙ti)22γt=1Tft(𝒙t),𝒙t𝒙ti+kγG2D2\sum_{t=1}^{T}(\left\langle{\nabla f_{t}(\bm{x}_{t})},{\bm{x}_{t}-\bm{x}_{t}^{i}}\right\rangle)^{2}\leq\frac{2}{\gamma}\sum_{t=1}^{T}\left\langle{\nabla f_{t}(\bm{x}_{t})},{\bm{x}_{t}-\bm{x}_{t}^{i}}\right\rangle+k_{\gamma}G^{2}D^{2}

holds. By combining these inequalities, we have

t=1Tft(𝒙t),𝒙t𝒙ti\displaystyle\sum_{t=1}^{T}\left\langle{\nabla f_{t}(\bm{x}_{t})},{\bm{x}_{t}-\bm{x}_{t}^{i}}\right\rangle 2ΓGD(2+1log||)+Γ2log||(2γt=1Tft(𝒙t),𝒙t𝒙ti+kγG2D2)\displaystyle\leq 2\Gamma GD\left(2+\frac{1}{\sqrt{\log|\mathcal{E}|}}\right)+\sqrt{\frac{\Gamma^{2}}{\log|\mathcal{E}|}\left(\frac{2}{\gamma}\sum_{t=1}^{T}\left\langle{\nabla f_{t}(\bm{x}_{t})},{\bm{x}_{t}-\bm{x}_{t}^{i}}\right\rangle+k_{\gamma}G^{2}D^{2}\right)}
ΓGD(4+2+kγlog||)+2Γ2γlog||t=1Tft(𝒙t),𝒙t𝒙ti.\displaystyle\leq\Gamma GD\left(4+\frac{2+\sqrt{k_{\gamma}}}{\sqrt{\log|\mathcal{E}|}}\right)+\sqrt{\frac{2\Gamma^{2}}{\gamma\log|\mathcal{E}|}\sum_{t=1}^{T}\left\langle{\nabla f_{t}(\bm{x}_{t})},{\bm{x}_{t}-\bm{x}_{t}^{i}}\right\rangle}.

From Lemma 5.4 with a=2Γ2/(γlog||)a=2\Gamma^{2}/(\gamma\log|\mathcal{E}|) and b=ΓGD(4+(2+kγ)/log||)b=\Gamma GD(4+(2+\sqrt{k_{\gamma}})/\sqrt{\log|\mathcal{E}|}), we have

RTmeta\displaystyle R_{T}^{\mathrm{meta}} t=1Tft(𝒙t),𝒙t𝒙ti\displaystyle\leq\sum_{t=1}^{T}\left\langle{\nabla f_{t}(\bm{x}_{t})},{\bm{x}_{t}-\bm{x}_{t}^{i}}\right\rangle
32(ΓGD(4+2+kγlog||)+2Γ2γlog||).\displaystyle\leq\frac{3}{2}\left(\Gamma GD\left(4+\frac{2+\sqrt{k_{\gamma}}}{\sqrt{\log|\mathcal{E}|}}\right)+\frac{2\Gamma^{2}}{\gamma\log|\mathcal{E}|}\right).

Since ||=O(logT)|\mathcal{E}|=\mathrm{O}(\log T) in USC, we obtain the following loose upper bound:

RTmeta=O(dγlogT+GDkdlogT).R_{T}^{\mathrm{meta}}=O\left(\frac{d}{\gamma}\log T+GD\sqrt{kd\log T}\right).

On the other hand, by thinking of the case that iith expert is MetaGrad or Maler, from Corollary 5.5 for MetaGrad and Maler,

RTexpert=O(dγlogT+GDkdlogT).R_{T}^{\mathrm{expert}}=O\left(\frac{d}{\gamma}\log T+GD\sqrt{kd\log T}\right).

Combining these bounds, we get

RT=O(dγlogT+GDkdlogT).R_{T}=O\left(\frac{d}{\gamma}\log T+GD\sqrt{kd\log T}\right).

B.8 Proof of Theorem 5.6

This subsection presents the proof of Theorem 5.6.

Proof.

From the definition of strong convexity, we have

RT𝒙\displaystyle R_{T}^{\bm{x}} =t=1T(ft(𝒙t)ft(𝒙))\displaystyle=\sum_{t=1}^{T}(f_{t}(\bm{x}_{t})-f_{t}(\bm{x}))
t=1T(ft(𝒙t),𝒙t𝒙λt2𝒙t𝒙2)\displaystyle\leq\sum_{t=1}^{T}\left(\left\langle{\nabla f_{t}(\bm{x}_{t})},{\bm{x}_{t}-\bm{x}}\right\rangle-\frac{\lambda_{t}}{2}\|\bm{x}_{t}-\bm{x}\|^{2}\right)
=R~T𝒙λ2G2WT𝒙+t=1Tλλt2𝒙t𝒙2\displaystyle=\tilde{R}_{T}^{\bm{x}}-\frac{\lambda}{2G^{2}}W_{T}^{\bm{x}}+\sum_{t=1}^{T}\frac{\lambda-\lambda_{t}}{2}\|\bm{x}_{t}-\bm{x}\|^{2}
R~T𝒙λ2G2WT𝒙+t:λt<λλλt2D2\displaystyle\leq\tilde{R}_{T}^{\bm{x}}-\frac{\lambda}{2G^{2}}W_{T}^{\bm{x}}+\sum_{t:\lambda_{t}<\lambda}\frac{\lambda-\lambda_{t}}{2}D^{2}
R~T𝒙λ2G2WT𝒙+λ2kλD2.\displaystyle\leq\tilde{R}_{T}^{\bm{x}}-\frac{\lambda}{2G^{2}}W_{T}^{\bm{x}}+\frac{\lambda}{2}k_{\lambda}D^{2}.

If RT𝒙<0R_{T}^{\bm{x}}<0, 0 is the upper bound, so it is sufficient to think of the case RT𝒙0R_{T}^{\bm{x}}\geq 0. In this case, we have

WT𝒙2G2λR~T𝒙+kλG2D2.W_{T}^{\bm{x}}\leq\frac{2G^{2}}{\lambda}\tilde{R}_{T}^{\bm{x}}+k_{\lambda}G^{2}D^{2}.

From the assumption of Theorem 5.6, there exists a positive constant C>0C>0 such that

R~T𝒙\displaystyle\tilde{R}_{T}^{\bm{x}} C(WT𝒙r1(T)+r2(T))\displaystyle\leq C\left(\sqrt{W_{T}^{\bm{x}}r_{1}(T)}+r_{2}(T)\right)
C((2G2λR~T𝒙+kλG2D2)r1(T)+r2(T))\displaystyle\leq C\left(\sqrt{\left(\frac{2G^{2}}{\lambda}\tilde{R}_{T}^{\bm{x}}+k_{\lambda}G^{2}D^{2}\right)r_{1}(T)}+r_{2}(T)\right)
2G2λC2r1(T)R~T𝒙+CGDkλr1(T)+Cr2(T).\displaystyle\leq\sqrt{\frac{2G^{2}}{\lambda}C^{2}r_{1}(T)\tilde{R}_{T}^{\bm{x}}}+CGD\sqrt{k_{\lambda}r_{1}(T)}+Cr_{2}(T).

Last inequality holds from the inequality x+yx+y\sqrt{x+y}\leq\sqrt{x}+\sqrt{y} for x,y0x,y\geq 0. Here, we use Lemma 5.4 with a=(2G2/λ)C2r1(T)a=(2G^{2}/\lambda)C^{2}r_{1}(T) and b=CGDkλr1(T)+Cr2(T)b=CGD\sqrt{k_{\lambda}r_{1}(T)}+Cr_{2}(T),

R~T𝒙\displaystyle\tilde{R}_{T}^{\bm{x}} 32(2G2λC2r1(T)+CGDkλr1(T)+Cr2(T)).\displaystyle\leq\frac{3}{2}\left(\frac{2G^{2}}{\lambda}C^{2}r_{1}(T)+CGD\sqrt{k_{\lambda}r_{1}(T)}+Cr_{2}(T)\right).

From this inequality and RT𝒙R~T𝒙R_{T}^{\bm{x}}\leq\tilde{R}_{T}^{\bm{x}}, Theorem 5.6 follows. ∎

B.9 Proof of Corollary 5.7 for MetaGrad and Maler

This subsection presents the proof of Corollary 5.7 for MetaGrad and Maler.

Proof.

As for MetaGrad and Maler, from Theorem 5.1 and Theorem 5.2,

R~T𝒙=O(WT𝒙d~logT+GDd~logT)\tilde{R}_{T}^{\bm{x}}=O(\sqrt{W_{T}^{\bm{x}}\tilde{d}\log T}+GD\tilde{d}\log T)

holds for any 𝒙𝒳\bm{x}\in\mathcal{X}, where d~\tilde{d} is dd and 1 in the case of MetaGrad and Maler, respectively. Therefore, by Theorem 5.6, we have

RT𝒙=O((G2λ+GD)d~logT+GDkλd~logT).R_{T}^{\bm{x}}=O\left(\left(\frac{G^{2}}{\lambda}+GD\right)\tilde{d}\log T+GD\sqrt{k_{\lambda}\tilde{d}\log T}\right).

Here, kλk_{\lambda} satisfies

kλ\displaystyle k_{\lambda} =t=1Tmax{1λtλ,0}=t:λt<λ(1λtλ)k.\displaystyle=\sum_{t=1}^{T}\max\left\{1-\frac{\lambda_{t}}{\lambda},0\right\}=\sum_{t\colon\lambda_{t}<\lambda}\left(1-\frac{\lambda_{t}}{\lambda}\right)\leq k.

Hence, we have

RT𝒙=O((G2λ+GD)d~logT+GDkd~logT),R_{T}^{\bm{x}}=O\left(\left(\frac{G^{2}}{\lambda}+GD\right)\tilde{d}\log T+GD\sqrt{k\tilde{d}\log T}\right),

and especially,

RT=O((G2λ+GD)d~logT+GDkd~logT).R_{T}=O\left(\left(\frac{G^{2}}{\lambda}+GD\right)\tilde{d}\log T+GD\sqrt{k\tilde{d}\log T}\right).

B.10 Proof of Corollary 5.7 for USC

This subsection presents the proof of Corollary 5.7 for USC.

Proof.

The same as the proof of Corollary 5.5, we have

RT=RTmeta+RTexpert.R_{T}=R_{T}^{\mathrm{meta}}+R_{T}^{\mathrm{expert}}.

From Theorem A.1, we have

RTmeta\displaystyle R_{T}^{\mathrm{meta}} t=1Tft(𝒙t),𝒙t𝒙ti\displaystyle\leq\sum_{t=1}^{T}\left\langle{\nabla f_{t}(\bm{x}_{t})},{\bm{x}_{t}-\bm{x}_{t}^{i}}\right\rangle
2ΓGD(2+1log||)+Γ2log||t=1Tft(𝒙t),𝒙t𝒙ti2\displaystyle\leq 2\Gamma GD\left(2+\frac{1}{\sqrt{\log|\mathcal{E}|}}\right)+\sqrt{\frac{\Gamma^{2}}{\log|\mathcal{E}|}\sum_{t=1}^{T}\left\langle{\nabla f_{t}(\bm{x}_{t})},{\bm{x}_{t}-\bm{x}_{t}^{i}}\right\rangle^{2}}
2ΓGD(2+1log||)+Γ2G2log||t=1T𝒙t𝒙ti2.\displaystyle\leq 2\Gamma GD\left(2+\frac{1}{\sqrt{\log|\mathcal{E}|}}\right)+\sqrt{\frac{\Gamma^{2}G^{2}}{\log|\mathcal{E}|}\sum_{t=1}^{T}\|\bm{x}_{t}-\bm{x}_{t}^{i}\|^{2}}.

From the definition of strong convexity, we have

RTmeta\displaystyle R_{T}^{\mathrm{meta}} =t=1T(ft(𝒙t)ft(𝒙ti))\displaystyle=\sum_{t=1}^{T}(f_{t}(\bm{x}_{t})-f_{t}(\bm{x}_{t}^{i}))
t=1T(ft(𝒙t),𝒙t𝒙tiλ2𝒙t𝒙ti2)+t=1Tλ2max{1λtλ,0}𝒙t𝒙ti2\displaystyle\leq\sum_{t=1}^{T}\left(\left\langle{\nabla f_{t}(\bm{x}_{t})},{\bm{x}_{t}-\bm{x}_{t}^{i}}\right\rangle-\frac{\lambda}{2}\|\bm{x}_{t}-\bm{x}_{t}^{i}\|^{2}\right)+\sum_{t=1}^{T}\frac{\lambda}{2}\max\left\{1-\frac{\lambda_{t}}{\lambda},0\right\}\|\bm{x}_{t}-\bm{x}_{t}^{i}\|^{2}
t=1Tft(𝒙t),𝒙t𝒙tiλ2t=1T𝒙t𝒙ti2+λ2kλD2.\displaystyle\leq\sum_{t=1}^{T}\left\langle{\nabla f_{t}(\bm{x}_{t})},{\bm{x}_{t}-\bm{x}_{t}^{i}}\right\rangle-\frac{\lambda}{2}\sum_{t=1}^{T}\|\bm{x}_{t}-\bm{x}_{t}^{i}\|^{2}+\frac{\lambda}{2}k_{\lambda}D^{2}.

If RTmeta<0R_{T}^{\mathrm{meta}}<0, 0 is the upper bound, so it is sufficient to think of the case RTmeta0R_{T}^{\mathrm{meta}}\geq 0. In this case, we have

t=1T𝒙t𝒙ti22λt=1Tft(𝒙t),𝒙t𝒙ti+kλD2.\sum_{t=1}^{T}\|\bm{x}_{t}-\bm{x}_{t}^{i}\|^{2}\leq\frac{2}{\lambda}\sum_{t=1}^{T}\left\langle{\nabla f_{t}(\bm{x}_{t})},{\bm{x}_{t}-\bm{x}_{t}^{i}}\right\rangle+k_{\lambda}D^{2}.

By combining these inequalities, we have

t=1Tft(𝒙t),𝒙t𝒙ti\displaystyle\sum_{t=1}^{T}\left\langle{\nabla f_{t}(\bm{x}_{t})},{\bm{x}_{t}-\bm{x}_{t}^{i}}\right\rangle 2ΓGD(2+1log||)+Γ2G2log||(2λt=1Tft(𝒙t),𝒙t𝒙ti+kλD2)\displaystyle\leq 2\Gamma GD\left(2+\frac{1}{\sqrt{\log|\mathcal{E}|}}\right)+\sqrt{\frac{\Gamma^{2}G^{2}}{\log|\mathcal{E}|}\left(\frac{2}{\lambda}\sum_{t=1}^{T}\left\langle{\nabla f_{t}(\bm{x}_{t})},{\bm{x}_{t}-\bm{x}_{t}^{i}}\right\rangle+k_{\lambda}D^{2}\right)}
ΓGD(4+2+kλlog||)+2Γ2G2λlog||t=1Tft(𝒙t),𝒙t𝒙ti.\displaystyle\leq\Gamma GD\left(4+\frac{2+\sqrt{k_{\lambda}}}{\sqrt{\log|\mathcal{E}|}}\right)+\sqrt{\frac{2\Gamma^{2}G^{2}}{\lambda\log|\mathcal{E}|}\sum_{t=1}^{T}\left\langle{\nabla f_{t}(\bm{x}_{t})},{\bm{x}_{t}-\bm{x}_{t}^{i}}\right\rangle}.

From Lemma 5.4 with a=2Γ2G2/(λlog||)a=2\Gamma^{2}G^{2}/(\lambda\log|\mathcal{E}|) and b=ΓGD(4+(2+kλ)/log||)b=\Gamma GD(4+(2+\sqrt{k_{\lambda}})/\sqrt{\log|\mathcal{E}|}), we have

RTmetat=1Tft(𝒙t),𝒙t𝒙ti32(ΓGD(4+2+kλlog||)+2Γ2G2λlog||).R_{T}^{\mathrm{meta}}\leq\sum_{t=1}^{T}\left\langle{\nabla f_{t}(\bm{x}_{t})},{\bm{x}_{t}-\bm{x}_{t}^{i}}\right\rangle\leq\frac{3}{2}\left(\Gamma GD\left(4+\frac{2+\sqrt{k_{\lambda}}}{\sqrt{\log|\mathcal{E}|}}\right)+\frac{2\Gamma^{2}G^{2}}{\lambda\log|\mathcal{E}|}\right).

Since ||=O(logT)|\mathcal{E}|=O(\log T) in USC, we obtain the following loose upper bound:

RTmeta=O((G2λ+GD)logT+GDklogT).R_{T}^{\mathrm{meta}}=O\left(\left(\frac{G^{2}}{\lambda}+GD\right)\log T+GD\sqrt{k\log T}\right).

On the other hand, by thinking of the case that iith expert is Maler, from Corollary 5.7 for Maler, we have

RTexpert=O((G2λ+GD)logT+GDklogT).R_{T}^{\mathrm{expert}}=O\left(\left(\frac{G^{2}}{\lambda}+GD\right)\log T+GD\sqrt{k\log T}\right).

Combining these bounds, we get

RT=O((G2λ+GD)logT+GDklogT).R_{T}=O\left(\left(\frac{G^{2}}{\lambda}+GD\right)\log T+GD\sqrt{k\log T}\right).

B.11 The Lemma in the Proof of Theorem 6.2

In this subsection, we introduce the following lemma used in the proof of Theorem 6.2.

Lemma B.5.

(Hazan 2016) Let 𝐀𝐁𝐎\bm{A}\succeq\bm{B}\succ\bm{O} be positive definite matrices. We then have

tr(𝑨1(𝑨𝑩))log|𝑨||𝑩|.\tr(\bm{A}^{-1}(\bm{A}-\bm{B}))\leq\log\frac{|\bm{A}|}{|\bm{B}|}.

Appendix C Numerical Experiments

In this section, we explain experimental results. We compare the performances of 5 OCO algorithms; OGD with stepsizes ηt=D/(Gt)\eta_{t}=D/(G\sqrt{t}), ONS, MetaGrad, Algorithm 1 with S1=S_{1}=\emptyset (Con-ONS), and Algorithm 1 with S2=S_{2}=\emptyset (Con-OGD). We include OGD, ONS, and MetaGrad because OGD and ONS are famous OCO algorithms, and MetaGrad is one of the universal algorithms. All the experiments are implemented in Python 3.9.2 on a MacBook Air whose processor is 1.8 GHz dual-core Intel Core i5 and memory is 8GB.

C.1 Experiment 1: Contaminated Exp-Concave

In this experiment, we set d=1d=1, 𝒳=[0,1]\mathcal{X}=[0,1] and the objective function is as follows:

ft(x){100xtI,log(x+0.01)otherwise,f_{t}(x)\coloneqq\begin{dcases}100x&t\in I,\\ -\log(x+0.01)&\mathrm{otherwise},\end{dcases}

where I[T]I\subset[T] is chosen uniformly at random under the condition that |I|=k|I|=k. (f1,f2,,fT)(f_{1},f_{2},\ldots,f_{T}) is kk-contaminated 1-exp-concave and the minimum value of t=1Tft\sum_{t=1}^{T}f_{t} is T2k(Tk)log((Tk)/(100k))T-2k-(T-k)\log((T-k)/(100k)) if 2k<T2k<T. We repeat this numerical experiment in 100 different random seeds and calculate their mean and standard deviation. Other parameters are shown in Table 2.

We compare the performances of 4 OCO algorithms: OGD, ONS, MetaGrad, and Con-ONS. The parameters of ONS are set as γ=0.005\gamma=0.005 and ε=1/(γ2D2)=40000\varepsilon=1/(\gamma^{2}D^{2})=40000.

Table 2: Parameter setting in Experiment 1.

x1x_{1} xx^{\ast} TT kk DD GG α\alpha
0 0.020.02 1000 250 11 100 1

The time variation of regret and xtx_{t} is shown in Figure 1. In the graphs presented in this paper, the error bars represent the magnitude of the standard deviation. Standard deviations in the graphs are large because contamination causes fluctuation in the sequence of solutions. Only points where t is a multiple of 25 are plotted to view the graph easily. The left graph shows that the regrets of all methods are sublinear. MetaGrad and ONS perform particularly well, followed by Con-ONS. From the right graph, we can confirm that xtx_{t} of all methods converge to the optimal solution quickly. OGD seems influenced by contamination a little stronger than other methods.


Refer to caption Refer to caption
Figure 1: The comparison of the time variation of regret (left) and xtx_{t} (right) in Experiment 1.

C.2 Experiment 2: Contaminated Strongly Convex

In this experiment, d=1d=1, 𝒳=[0,1]\mathcal{X}=[0,1] and the objective function is as follows:

ft(x){xtI,12(x1)2otherwise,f_{t}(x)\coloneqq\begin{dcases}x&t\in I,\\ \frac{1}{2}(x-1)^{2}&\mathrm{otherwise},\end{dcases}

where I[T]I\subset[T] is chosen uniformly at random under the condition that |I|=k|I|=k. (f1,f2,,fT)(f_{1},f_{2},\ldots,f_{T}) is kk-contaminated 1-strongly convex and the minimum value of t=1Tft\sum_{t=1}^{T}f_{t} is (2kT3k2)/(2T2k)(2kT-3k^{2})/(2T-2k) if 2k<T2k<T. We repeat this numerical experiment in 100 different random seeds and calculate their mean and standard deviation. Other parameters are shown in Table 3.

We compare the performances of 3 OCO algorithms: OGD, MetaGrad, and Con-OGD.

Table 3: Parameter setting in Experiment 2.

x1x_{1} xx^{\ast} TT kk DD GG λ\lambda
0 2/32/3 1000 250 11 1 1

The time variation of regret and xtx_{t} is shown in Figure 2. Only points where tt is a multiple of 25 are plotted to view the graph easily. The left graph shows that the regrets of all methods are sublinear. The performance of Con-OGD is the best, followed by MetaGrad and OGD, showing correspondence with the theoretical results. From the right graph, we can confirm that xtx_{t} of all methods converge to the optimal solution quickly.


Refer to caption Refer to caption
Figure 2: The comparison of the time variation of regret (left) and xtx_{t} (right) in Experiment 2.

C.3 Experiment 3: Mini-Batch Least Mean Square Regressions

Experimental settings are as follows. We use the squared loss as the objective function:

ft(𝒙):=1ni=1n(𝒂t,i,𝒙bt,i)2,f_{t}(\bm{x}):=\frac{1}{n}\sum_{i=1}^{n}(\left\langle{\bm{a}_{t,i}},{\bm{x}}\right\rangle-b_{t,i})^{2},

which is exemplified in Example 3.7. In this experiment, each component of the vector 𝒂t,i\bm{a}_{t,i} is taken from a uniform distribution on [1,2][1,2] independently. We set 𝒳=[0,1]d\mathcal{X}=[0,1]^{d} and assume there exists an optimal solution 𝒙\bm{x}^{\ast} which is taken from a uniform distribution on 𝒳\mathcal{X}, i.e., we take bt,i=𝒂t,i,𝒙b_{t,i}=\left\langle{\bm{a}_{t,i}},{\bm{x}^{\ast}}\right\rangle. We set kk firstly and compute thresholds α\alpha and λ\lambda based on kk, though this is impossible in real applications. Parameters GG, λ\lambda, α\alpha are calculated for each 𝒂t,i\bm{a}_{t,i} and bt,ib_{t,i}, e.g., G429G\simeq 429, λ0.0969\lambda\simeq 0.0969, and α5.28×107\alpha\simeq 5.28\times 10^{-7} for some sequence. The parameters of ONS are set as described in Section A.2. Other parameters are shown in Table 4.

Table 4: Parameter setting in Experiment 3.

𝒙1\bm{x}_{1} nn dd TT kk DD
𝟎\bm{0} 10 5 1000 250 5\sqrt{5}

Refer to caption Refer to caption
Figure 3: The comparison of the time variation of regret (left) and 𝒙t\|\bm{x}_{t}\| (right).

The time variation of regret and 𝒙t\|\bm{x}_{t}\| is shown in Figure 3. The performances of OGD, Con-OGD, and Con-ONS are almost the same in this experiment. The left graph shows that OGD’s, MetaGrad’s, and our proposed method’s regrets are sublinear and consistent with the theoretical results. Though this is out of the graph, ONS’s regret is almost linear even if we take T=10000T=10000. From the right graph, we can confirm that 𝒙t\|\bm{x}_{t}\| of OGD, MetaGrad, and proposed methods converge to some point quickly, and that of ONS does not change so much. The poor performance of ONS is because γ\gamma is too small to take large enough stepsizes. This result shows the vulnerability of ONS.