This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Non-stationary Online Convex Optimization with Arbitrary Delays

Yuanyu Wan    Chang Yao    Mingli Song    Lijun Zhang
Abstract

Online convex optimization (OCO) with arbitrary delays, in which gradients or other information of functions could be arbitrarily delayed, has received increasing attention recently. Different from previous studies that focus on stationary environments, this paper investigates the delayed OCO in non-stationary environments, and aims to minimize the dynamic regret with respect to any sequence of comparators. To this end, we first propose a simple algorithm, namely DOGD, which performs a gradient descent step for each delayed gradient according to their arrival order. Despite its simplicity, our novel analysis shows that the dynamic regret of DOGD can be automatically bounded by O(d¯T(PT+1))O(\sqrt{\bar{d}T}(P_{T}+1)) under mild assumptions, and O(dT(PT+1))O(\sqrt{dT}(P_{T}+1)) in the worst case, where d¯\bar{d} and dd denote the average and maximum delay respectively, TT is the time horizon, and PTP_{T} is the path-length of comparators. Furthermore, we develop an improved algorithm, which reduces those dynamic regret bounds achieved by DOGD to O(d¯T(PT+1))O(\sqrt{\bar{d}T(P_{T}+1)}) and O(dT(PT+1))O(\sqrt{dT(P_{T}+1)}), respectively. The key idea is to run multiple DOGD with different learning rates, and utilize a meta-algorithm to track the best one based on their delayed performance. Finally, we demonstrate that our improved algorithm is optimal in the worst case by deriving a matching lower bound.

Machine Learning, ICML

1 Introduction

Online convex optimization (OCO) has become a popular paradigm for solving sequential decision-making problems (Shalev-Shwartz, 2011; Hazan, 2016; Orabona, 2019). In OCO, an online player acts as the decision maker, which chooses a decision 𝐱t\mathbf{x}_{t} from a convex set 𝒦n\mathcal{K}\subseteq\mathbb{R}^{n} at each round t[T]t\in[T]. After the decision 𝐱t\mathbf{x}_{t} is committed, the player suffers a loss ft(𝐱t)f_{t}(\mathbf{x}_{t}), where ft(𝐱):𝒦f_{t}(\mathbf{x}):\mathcal{K}\mapsto\mathbb{R} is a convex function selected by an adversary. To improve the performance in subsequent rounds, the player needs to update the decision by exploiting information about loss functions in previous rounds. Plenty of algorithms and theories have been introduced to guide the player (Zinkevich, 2003; Shalev-Shwartz & Singer, 2007; Hazan et al., 2007).

However, most of existing studies assume that the information about each function ft(𝐱)f_{t}(\mathbf{x}) is revealed at the end of round tt, which is not necessarily satisfied in many real applications. For example, in online advertisement (McMahan et al., 2013; He et al., 2014), each loss function depends on whether a user clicks an ad or not, which may not be decided even when the user has observed the ad for a long period of time. To tackle this issue, there has been a surge of research interest in OCO with arbitrary delays (Joulani et al., 2013; McMahan & Streeter, 2014; Quanrud & Khashabi, 2015; Joulani et al., 2016; Flaspohler et al., 2021; Wan et al., 2022a, b, 2023a), where the information about ft(𝐱)f_{t}(\mathbf{x}) is revealed at the end of round t+dt1t+d_{t}-1, and dt1d_{t}\geq 1 denotes the delay. However, these studies focus on developing algorithms to minimize the static regret of the player, i.e., R(T)=t=1Tft(𝐱t)min𝐱𝒦t=1Tft(𝐱)R(T)=\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-\min_{\mathbf{x}\in\mathcal{K}}\sum_{t=1}^{T}f_{t}(\mathbf{x}), which is only meaningful for stationary environments where at least one fixed decision can minimize the cumulative loss well, and thus cannot handle non-stationary environments where the best decision is drifting over time.

To address this limitation, we investigate the delayed OCO with a more suitable performance metric called dynamic regret (Zinkevich, 2003):

R(𝐮1,,𝐮T)=t=1Tft(𝐱t)t=1Tft(𝐮t)R(\mathbf{u}_{1},\dots,\mathbf{u}_{T})=\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-\sum_{t=1}^{T}f_{t}(\mathbf{u}_{t})

which compares the player against any sequence of changing comparators 𝐮1,,𝐮T𝒦\mathbf{u}_{1},\dots,\mathbf{u}_{T}\in\mathcal{K}. It is well-known that in the non-delayed setting, online gradient descent (OGD) can attain a dynamic regret bound of O(T(PT+1))O(\sqrt{T}(P_{T}+1)) (Zinkevich, 2003), where PT=t=2T𝐮t𝐮t12P_{T}=\sum_{t=2}^{T}\|\mathbf{u}_{t}-\mathbf{u}_{t-1}\|_{2} is the path-length of comparators, and multiple OGD with different learning rates can be combined to achieve an optimal dynamic regret bound of O(T(PT+1))O(\sqrt{T(P_{T}+1)}) by using a mete-algorithm (Zhang et al., 2018a). Thus, it is natural to ask whether these algorithms and dynamic regret bounds can be generalized into the setting with arbitrary delays.

In this paper, we provide an affirmative answer to the above question. Specifically, we first propose delayed online gradient descent (DOGD), and provide a novel analysis on its dynamic regret. In the literature, Quanrud & Khashabi (2015) have developed a delayed variant of OGD for minimizing the static regret, which performs a gradient descent step by using the sum of gradients received in each round. Different from their algorithm, our DOGD performs a gradient descent step for each delayed gradient according to their arrival order, which allows us to exploit an In-Order property (i.e., delays do not change the arrival order of gradients) to reduce the dynamic regret. Let d¯=t=1Tdt/T\bar{d}=\sum_{t=1}^{T}d_{t}/T and d=max{d1,,dT}d=\max\{d_{1},\dots,d_{T}\} denote the average and maximum delay, respectively. Our analysis shows that the dynamic regret of DOGD can be automatically bounded by O(d¯T(PT+1))O(\sqrt{\bar{d}T}(P_{T}+1)) under mild assumptions such as the In-Order property, and O(dT(PT+1))O(\sqrt{dT}(P_{T}+1)) in the worst case.

Furthermore, inspired by Zhang et al. (2018a), we propose an improved algorithm based on DOGD, namely multiple delayed online gradient descent (Mild-OGD). The essential idea is to run multiple DOGD, each with a different learning rate that enjoys small dynamic regret for a specific path-length, and combine them with a meta-algorithm. Compared with Zhang et al. (2018a), the key challenge is that the performance of each DOGD is required by the meta-algorithm, but it is also arbitrarily delayed. To address this difficulty, our meta-algorithm is built upon the delayed Hedge—a technique for prediction with delayed expert advice (Korotin et al., 2020), which can track the best DOGD based on their delayed performance. We prove that the dynamic regret of Mild-OGD can be automatically bounded by O(d¯T(PT+1))O(\sqrt{\bar{d}T(P_{T}+1)}) under mild assumptions such as the In-Order property, and O(dT(PT+1))O(\sqrt{dT(P_{T}+1)}) in the worst case. In the special case without delay, both bounds reduce to the O(T(PT+1))O(\sqrt{T(P_{T}+1)}) bound achieved by Zhang et al. (2018a). Finally, we demonstrate that our Mild-OGD is optimal in the worst case by deriving a matching lower bound.

2 Related Work

In this section, we briefly review related work on OCO with arbitrary delays and the dynamic regret.

2.1 OCO with Arbitrary Delays

To deal with arbitrary delays, Joulani et al. (2013) first propose a black-box technique, which can extend any non-delayed OCO algorithm into the delayed setting. The main idea is to pool multiple instances of the non-delayed algorithm, each of which runs over a subsequence of rounds that satisfies the non-delayed assumption. Moreover, Joulani et al. (2013) show that if the non-delayed algorithm has a static regret bound of R(T)R(T), this technique can attain a static regret bound of dR(T/d)dR(T/d). Notice that in the non-delayed setting, there exist plenty of algorithms with an O(T)O(\sqrt{T}) static regret bound, such as OGD (Zinkevich, 2003). As a result, combining with OGD, this technique can achieve a static regret bound of O(dT)O(\sqrt{dT}). However, despite the generality of this technique, it needs to run multiple instances of the non-delayed algorithm, which could be prohibitively resource-intensive (Quanrud & Khashabi, 2015; Joulani et al., 2016). For this reason, instead of adopting the technique of Joulani et al. (2013), subsequent studies extend many specific non-delayed OCO algorithms into the delayed setting by only running a single instance of them with the delayed information about all loss functions.

Specifically, Quanrud & Khashabi (2015) propose a delayed variant of OGD, and reduce the static regret to O(d¯T)O(\sqrt{\bar{d}T}), which depends on the average delay d¯\bar{d}, instead of the maximum delay dd. By additionally assuming that the In-Order property holds, McMahan & Streeter (2014) develop a delayed variant of the adaptive gradient (AdaGrad) algorithm (McMahan & Streeter, 2010; Duchi et al., 2011), and establish a data-dependent static regret bound, which could be tighter than O(dT)O(\sqrt{dT}) for sparse data. Later, Joulani et al. (2016) propose another delayed variant of AdaGrad, which can attain a data-dependent static regret bound without the In-Order property. Recently, Flaspohler et al. (2021) develop delayed variants of optimistic algorithms (Rakhlin & Sridharan, 2013; Joulani et al., 2017), which can make use of “hints” about expected future loss functions to improve the O(dT)O(\sqrt{dT}) static regret. Wan et al. (2022a) extend the delayed variant of OGD (Quanrud & Khashabi, 2015) to further exploit the strong convexity of functions. Wan et al. (2022b, 2023a) develop a delayed variant of online Frank-Wolfe (Hazan & Kale, 2012), and obtain a static regret bound of O(T3/4+d¯T1/4)O(T^{3/4}+\bar{d}T^{1/4}). Their algorithm is projection-free and can be efficiently implemented over complex constraints. We also notice that Korotin et al. (2020) consider the problem of prediction with expert advice—a special case of OCO with linear functions and simplex decision sets, and propose a delayed variant of Hedge (Freund & Schapire, 1997) to achieve the O(d¯T)O(\sqrt{\bar{d}T}) static regret.

2.2 Dynamic Regret

Dynamic regret of OCO is first introduced by Zinkevich (2003), who demonstrates that OGD can attain a dynamic regret bound of O(T(PT+1))O(\sqrt{T}(P_{T}+1)) by simply utilizing a constant learning rate. Later, Zhang et al. (2018a) establish a lower bound of Ω(T(PT+1))\Omega(\sqrt{T(P_{T}+1)}) for the dynamic regret. Moreover, to improve the upper bound, Zhang et al. (2018a) propose a novel algorithm that runs multiple instances of OGD with different learning rates in parallel, and tracks the best one via Hedge (Freund & Schapire, 1997). Although the strategy of maintaining multiple learning rates is originally proposed to adaptively minimize the static regret for multiple types of functions (van Erven & Koolen, 2016; van Erven et al., 2021), Zhang et al. (2018a) extend it to achieve an optimal dynamic regret bound of O(T(PT+1))O(\sqrt{T(P_{T}+1)}). Subsequent studies achieve tighter dynamic regret bounds for special types of data (Cutkosky, 2020) and functions (Zhao et al., 2020; Baby & Wang, 2021, 2022, 2023), and reduce the computational complexity for handling complex constraints (Zhao et al., 2022; Wang et al., 2024). Besides, there also exist plenty of studies (Jadbabaie et al., 2015; Besbes et al., 2015; Yang et al., 2016; Mokhtari et al., 2016; Zhang et al., 2017, 2018b; Baby & Wang, 2019; Wan et al., 2021, 2023b; Zhao & Zhang, 2021; Wang et al., 2021, 2023) that focus on a restricted form of the dynamic regret, in which 𝐮t=𝐱targmin𝐱𝒦ft(𝐱)\mathbf{u}_{t}=\mathbf{x}_{t}^{\ast}\in\operatorname*{argmin}_{\mathbf{x}\in\mathcal{K}}f_{t}(\mathbf{x}). However, as discussed by Zhang et al. (2018a), the restricted dynamic regret is too pessimistic and less flexible than the general one.

2.3 Discussions

Although both arbitrary delays and the dynamic regret have attracted much research interest, it is still unclear how arbitrary delays affect the dynamic regret. Recently, Wang et al. (2021, 2023) have demonstrated under a fixed and knowable delay dd^{\prime}, simply performing OGD with a delayed gradient ftd+1(𝐱td+1)\nabla f_{t-d^{\prime}+1}(\mathbf{x}_{t-d^{\prime}+1}) is able to achieve a restricted dynamic regret bound of O(dT(PT+1))O(\sqrt{d^{\prime}T(P_{T}^{\ast}+1)}) when PT=t=2T𝐱t𝐱t12P_{T}^{\ast}=\sum_{t=2}^{T}\|\mathbf{x}_{t}^{\ast}-\mathbf{x}_{t-1}^{\ast}\|_{2} is also knowable.111Note that Wang et al. (2021, 2023) aim to handle a special decision set with long-term constraints, and thus their algorithm is more complicated than OGD with the delayed gradient. Here, we omit other details of their algorithm because such a decision set is beyond the scope of this paper. However, their algorithm and theoretical results do not apply to the general dynamic regret under arbitrary delays. Moreover, one may try to extend existing algorithms with dynamic regret bounds into the delayed setting via the black-box technique of Joulani et al. (2013). However, we want to emphasize that they focus on the static regret, and their analysis cannot directly yield a dynamic regret bound. In addition, since their technique does not achieve the O(d¯T)O(\sqrt{\bar{d}T}) static regret, it seems also unable to achieve the O(d¯T(PT+1))O(\sqrt{\bar{d}T(P_{T}+1)}) dynamic regret even under the In-Order assumption.

3 Main Results

In this section, we first introduce necessary assumptions, and then present our DOGD and Mild-OGD. Finally, we provide a matching lower bound to demonstrate the optimality of our Mild-OGD in the worst case.

3.1 Assumptions

Assumption 3.1.

The gradients of all functions are bounded by GG, i.e., ft(𝐱)2G\|\nabla f_{t}(\mathbf{x})\|_{2}\leq G for any 𝐱𝒦\mathbf{x}\in\mathcal{K} and t[T]t\in[T].

Assumption 3.2.

The decision set 𝒦\mathcal{K} contains the origin 𝟎\mathbf{0}, and its diameter is bounded by DD, i.e., 𝐱𝐲2D\|\mathbf{x}-\mathbf{y}\|_{2}\leq D for any 𝐱,𝐲𝒦\mathbf{x},\mathbf{y}\in\mathcal{K}.

Assumption 3.3.

Delays do not change the arrival order of gradients, i.e., the gradient fi(𝐱i)\nabla f_{i}(\mathbf{x}_{i}) is received before the gradient fj(𝐱j)\nabla f_{j}(\mathbf{x}_{j}), for any 1i<jT1\leq i<j\leq T.

Remark: The first two assumptions have been commonly utilized in previous studies on OCO (Shalev-Shwartz, 2011; Hazan, 2016). To further justify the rationality of Assumption 3.3, we notice that parallel and distributed optimization (McMahan & Streeter, 2014; Zhou et al., 2018) is also a representative application of delayed OCO. For parallel optimization with many threads, the delay is mainly caused by the computing time of gradients. Thus, as in McMahan & Streeter (2014), it is reasonable to assume that these delays satisfy the In-Order assumption, because the gradient computed first is more likely to be obtained first. Even for general parallel and distributed optimization, polynomially growing delays, which imply didjd_{i}\leq d_{j} for i<ji<j and thus satisfy the In-Order assumption, have received much attention in recent years (Zhou et al., 2018; Ren et al., 2020; Zhou et al., 2022). Moreover, we want to emphasize that Assumption 3.3 is only utilized to achieve the dynamic regret bound depending on the average delay d¯\bar{d}, and the case without this assumption is also considered.

3.2 DOGD with Dynamic Regret

In the following, we first introduce detailed procedures of DOGD, and then present its theoretical guarantees.

3.2.1 Detailed Procedures

Recall that in the non-delayed setting, the classical OGD algorithm (Zinkevich, 2003) at each round tt updates the decision as

𝐱t+1=argmin𝐱𝒦𝐱(𝐱tηft(𝐱t))22\begin{split}\mathbf{x}_{t+1}=\operatorname*{argmin}_{\mathbf{x}\in\mathcal{K}}\|\mathbf{x}-(\mathbf{x}_{t}-\eta\nabla f_{t}(\mathbf{x}_{t}))\|_{2}^{2}\end{split} (1)

where η\eta is a learning rate. To handle the setting with arbitrary delays, Quanrud & Khashabi (2015) have proposed a delayed variant of OGD by replacing ft(𝐱t)\nabla f_{t}(\mathbf{x}_{t}) with the sum of gradients received in round tt. However, it ignores the arrival order of gradients, and thus cannot benefit from the In-Order property when minimizing the dynamic regret.

Algorithm 1 DOGD
1:  Input: a learning rate η\eta
2:  Initialization: set 𝐲1=𝟎\mathbf{y}_{1}=\mathbf{0} and τ=1\tau=1
3:  for t=1,,Tt=1,\dots,T do
4:     Play 𝐱t=𝐲τ\mathbf{x}_{t}=\mathbf{y}_{\tau} and query ft(𝐱t)\nabla f_{t}(\mathbf{x}_{t})
5:     Receive {fk(𝐱k)|kt}\{\nabla f_{k}(\mathbf{x}_{k})|k\in\mathcal{F}_{t}\}
6:     for ktk\in\mathcal{F}_{t} (in the ascending order) do
7:        Compute 𝐲τ+1\mathbf{y}_{\tau+1} as in (2) and set τ=τ+1\tau=\tau+1
8:     end for
9:  end for

To address this limitation, we propose a new delayed variant of OGD, which performs a gradient descent step for each delayed gradient according to their arrival order. Specifically, our algorithm is named as delayed online gradient descent (DOGD) and outlined in Algorithm 1, where τ\tau records the number of generated decisions and 𝐲τ\mathbf{y}_{\tau} denotes the τ\tau-th generated decision. Initially, we set 𝐲1=𝟎\mathbf{y}_{1}=\mathbf{0} and τ=1\tau=1. At each round t[T]t\in[T], we first play the latest decision 𝐱t=𝐲τ\mathbf{x}_{t}=\mathbf{y}_{\tau} and query the gradient ft(𝐱t)\nabla f_{t}(\mathbf{x}_{t}).

After that, due to the effect of arbitrary delays, we receive a set of delayed gradients {fk(𝐱k)|kt}\{\nabla f_{k}(\mathbf{x}_{k})|k\in\mathcal{F}_{t}\}, where

t={k[T]|k+dk1=t}.\mathcal{F}_{t}=\{k\in[T]|k+d_{k}-1=t\}.

For each ktk\in\mathcal{F}_{t}, inspired by (1), we perform the following update

𝐲τ+1=argmin𝐱𝒦𝐱(𝐲τηfk(𝐱k))22\begin{split}&\mathbf{y}_{\tau+1}=\operatorname*{argmin}_{\mathbf{x}\in\mathcal{K}}\|\mathbf{x}-(\mathbf{y}_{\tau}-\eta\nabla f_{k}(\mathbf{x}_{k}))\|_{2}^{2}\end{split} (2)

and then set τ=τ+1\tau=\tau+1. Moreover, to utilize the In-Order property, elements in the set t\mathcal{F}_{t} are sorted and traversed in the ascending order.

3.2.2 Theoretical Guarantees

We notice that due to the effect of delays, there could exist some gradients that arrive after round TT. Although our DOGD does not need to utilize these gradients, they are useful to facilitate our analysis and discussion. Therefore, in the analysis of DOGD, we virtually set 𝐱t=𝐲τ\mathbf{x}_{t}=\mathbf{y}_{\tau} and perform steps 55 to 88 in Algorithm 1 at some additional rounds t=T+1,,T+d1t=T+1,\dots,T+d-1. In this way, all queried gradients are utilized to generate decisions

𝐲1,,𝐲T+1.\mathbf{y}_{1},\dots,\mathbf{y}_{T+1}.

Moreover, we denote the time-stamp of the τ\tau-th utilized gradient by cτc_{\tau}. To help understanding, one can imagine that DOGD also sets cτ=kc_{\tau}=k at the beginning of its step 77.

Then, we establish the following theorem with only Assumptions 3.1 and 3.2.

Theorem 3.4.

Under Assumptions 3.1 and 3.2, for any comparator sequence 𝐮1,,𝐮T𝒦\mathbf{u}_{1},\dots,\mathbf{u}_{T}\in\mathcal{K}, Algorithm 1 ensures

R(𝐮1,,𝐮T)D2+DPTη+ηG2t=1Tmt+t=1TG𝐮t𝐮ct2\begin{split}&R(\mathbf{u}_{1},\dots,\mathbf{u}_{T})\\ \leq&\frac{D^{2}+DP_{T}}{\eta}+{\eta G^{2}}\sum_{t=1}^{T}m_{t}+\sum_{t=1}^{T}G\|\mathbf{u}_{t}-\mathbf{u}_{c_{t}}\|_{2}\end{split} (3)

where mt=ti=1t1|i|m_{t}=t-\sum_{i=1}^{t-1}|\mathcal{F}_{i}|.

Remark: The value of mt1m_{t}-1 actually counts the number of gradients that have been queried, but still not received at the end of round t1t-1. Since the gradient ft(𝐱t)\nabla f_{t}(\mathbf{x}_{t}) will only be counted as an unreceived gradient in dt1d_{t}-1 rounds, it is easy to verify that

t=1Tmtt=1Tdt=d¯T.\sum_{t=1}^{T}m_{t}\leq\sum_{t=1}^{T}d_{t}=\bar{d}T. (4)

Therefore, the first two terms in the right side of (3) are upper bounded by (2D+PT)Gd¯T(2D+P_{T})G\sqrt{\bar{d}T} so long as

η=DGt=1Tmt.\eta=\frac{D}{G\sqrt{\sum_{t=1}^{T}m_{t}}}. (5)

However, we still need to bound the last term in the right side of (3), which reflects the “comparator drift” caused by arbitrary delays, and has never appeared in previous studies on the delayed feedback and dynamic regret.

To this end, we establish the following lemma regarding the comparator drift.

Lemma 3.5.

Under Assumption 3.2, for any comparator sequence 𝐮1,,𝐮T𝒦\mathbf{u}_{1},\dots,\mathbf{u}_{T}\in\mathcal{K}, Algorithm 1 ensures

t=1T𝐮t𝐮ct2min{KD,2dPT}2dKDPT\sum_{t=1}^{T}\|\mathbf{u}_{t}-\mathbf{u}_{c_{t}}\|_{2}\leq\min\left\{KD,2dP_{T}\right\}\leq\sqrt{2dKDP_{T}}

where K=t=1T𝕀(tct)K=\sum_{t=1}^{T}\mathbb{I}(t\neq c_{t}) and 𝕀()\mathbb{I}(\cdot) denotes the indicator function.

Remark: Since Algorithm 1 utilizes the received gradients in the ascending order, the value of KK counts the number of delays that are not in order. Therefore, Lemma 3.5 implies that the comparator drift can be upper bounded by O(dTPT)O(\sqrt{dTP_{T}}) in the worst case because of KTK\leq T, and vanishes if the In-Order property holds, i.e., K=0K=0. To facilitate discussions, we mainly focus on these two extremes, though the comparator drift can be bounded by O(d¯TPT)O(\sqrt{\bar{d}TP_{T}}) in an intermediate case with KO(Td¯/d)K\leq O(T\bar{d}/d).

By further combining Theorem 3.4 with (4) and Lemma 3.5, we derive the following corollary.

Corollary 3.6.

Under Assumptions 3.1 and 3.2, by setting η\eta as in (5), Algorithm 1 ensures

R(𝐮1,,𝐮T)(2D+PT)Gd¯T+C\begin{split}R(\mathbf{u}_{1},\dots,\mathbf{u}_{T})\leq&(2D+P_{T})G\sqrt{\bar{d}T}+C\end{split}

for any comparator sequence 𝐮1,,𝐮T𝒦\mathbf{u}_{1},\dots,\mathbf{u}_{T}\in\mathcal{K}, where

C={0,if Assumption 3.3 also holds;min{TGD,2dGPT},otherwise.C=\left\{\begin{aligned} &0,~{}\text{if Assumption \ref{assum4} also holds;}\\ &\min\left\{TGD,2dGP_{T}\right\},~{}\text{otherwise.}\end{aligned}\right. (6)

Remark: From Corollary 3.6, our DOGD enjoys a dynamic regret bound of O(d¯T(PT+1)+C)O(\sqrt{\bar{d}T}(P_{T}+1)+C), which is adaptive to the upper bound of comparator drift. First, because of min{TGD,2dGPT}G2dTDPT\min\left\{TGD,2dGP_{T}\right\}\leq G\sqrt{2dTDP_{T}} and d¯d\bar{d}\leq d, the dynamic regret of DOGD can be bounded by O(dT(PT+1))O(\sqrt{dT}(P_{T}+1)) in the worst case, which magnifies the O(T(PT+1))O(\sqrt{T}(P_{T}+1)) dynamic regret of OGD (Zinkevich, 2003) in the non-delayed setting by a coefficient depending on the maximum delay dd. Second, in case CO(d¯TPT)C\leq O(\sqrt{\bar{d}T}P_{T}), the dynamic regret of DOGD automatically reduces to O(d¯T(PT+1))O(\sqrt{\bar{d}T}(P_{T}+1)), which depends on the average delay. According to (6), this condition can be simply satisfied for all possible PTP_{T} when the In-Order property holds or dd¯Td\leq\sqrt{\bar{d}T}. Third, by substituting 𝐮1==𝐮T\mathbf{u}_{1}=\dots=\mathbf{u}_{T} into Corollary 3.6, we find that DOGD can attain a static regret bound of O(d¯T)O(\sqrt{\bar{d}T}) for arbitrary delays, which matches the best existing result (Quanrud & Khashabi, 2015).

Remark: At first glance, Corollary 3.6 needs to set the learning rate as in (5), which may become a limitation of DOGD, because the value of t=1Tmt\sum_{t=1}^{T}m_{t} is generally unknown in practice. However, we note that Quanrud & Khashabi (2015) also face this issue when minimizing the static regret of OCO with arbitrary delays, and have introduced a simple solution by utilizing the standard “doubling trick” (Cesa-Bianchi et al., 1997) to adaptively adjust the learning rate. The main insight behind this solution is that the value of t=1Tmt\sum_{t=1}^{T}m_{t} can be calculated on the fly. The details about DOGD with the doubling trick are provided in the appendix.

3.3 Mild-OGD with Improved Dynamic Regret

One unsatisfactory point of DOGD is that the dynamic regret linearly depends on the path-length. Notice that if only a specific path-length PTP_{T} is considered, from Theorem 3.4, we can tune the learning rate as

η=D(D+PT)Gt=1Tmt\eta_{\ast}=\frac{\sqrt{D(D+P_{T})}}{G\sqrt{\sum_{t=1}^{T}m_{t}}}

and obtain the dynamic regret sublinear to PTP_{T}. However, our goal is to minimize the dynamic regret with respect to any possible path-length PTP_{T}. To address this dilemma, inspired by Zhang et al. (2018a), we develop an algorithm that runs multiple DOGD as experts, each with a different learning rate for a specific path-length, and combines them with a meta-algorithm.

It is worth noting that the meta-algorithm of Zhang et al. (2018a) is incompatible to the delayed setting studied here. To this end, we adopt the delayed Hedge (Korotin et al., 2020), an expert-tracking method under arbitrary delays, to design our meta-algorithm. Moreover, there exist two options for the meta-algorithm to maintain these expert-algorithms: running them over the original functions {ft(𝐱)}t[T]\{f_{t}(\mathbf{x})\}_{t\in[T]} or the surrogate functions {t(𝐱)}t[T]\{\ell_{t}(\mathbf{x})\}_{t\in[T]}, where

t(𝐱)=ft(𝐱t),𝐱𝐱t\ell_{t}(\mathbf{x})=\langle\nabla f_{t}(\mathbf{x}_{t}),\mathbf{x}-\mathbf{x}_{t}\rangle (7)

and 𝐱t\mathbf{x}_{t} is the decision of the meta-algorithm. In this paper, we choose the second option, because the surrogate functions allow expert-algorithms to reuse the gradient of the meta-algorithm, and thus can avoid inconsistent delays between the meta-algorithm and expert-algorithms. Specifically, our algorithm is named as multiple delayed online gradient descent (Mild-OGD), and stated below.

Algorithm 2 Mild-OGD: Meta-algorithm
1:  Input: a parameter α\alpha and a set \mathcal{H} containing learning rates for experts
2:  Activate a set of experts {Eη|η}\left\{E^{\eta}|\eta\in\mathcal{H}\right\} by invoking the expert-algorithm for each learning rate η\eta\in\mathcal{H}
3:  Sort learning rates in the ascending order, i.e., η1η||\eta_{1}\leq\dots\leq\eta_{|\mathcal{H}|}, and set w1ηi=||+1i(i+1)||w_{1}^{\eta_{i}}=\frac{|\mathcal{H}|+1}{i(i+1)|\mathcal{H}|}
4:  for t=1,,Tt=1,\dots,T do
5:     Receive 𝐱tη\mathbf{x}_{t}^{\eta} from each expert EηE^{\eta}
6:     Play the decision 𝐱t=ηwtη𝐱tη\mathbf{x}_{t}=\sum_{\eta\in\mathcal{H}}w_{t}^{\eta}\mathbf{x}_{t}^{\eta}
7:     Query ft(𝐱t)\nabla f_{t}(\mathbf{x}_{t}) and receive {fk(𝐱k)|kt}\{\nabla f_{k}(\mathbf{x}_{k})|k\in\mathcal{F}_{t}\}
8:     Update the weight of each expert as in (8)
9:     Send {fk(𝐱k)|kt}\{\nabla f_{k}(\mathbf{x}_{k})|k\in\mathcal{F}_{t}\} to each expert EηE^{\eta}
10:  end for
Meta-algorithm

Let \mathcal{H} denote a set of learning rates for experts. We first activate a set of experts {Eη|η}\left\{E^{\eta}|\eta\in\mathcal{H}\right\} by invoking the expert-algorithm for each learning rate η\eta\in\mathcal{H}. Let ηi\eta_{i} be the ii-th smallest learning rate in \mathcal{H}. Following Zhang et al. (2018a), the initial weight of each expert EηiE^{\eta_{i}} is set as

w1ηi=||+1i(i+1)||.w_{1}^{\eta_{i}}=\frac{|\mathcal{H}|+1}{i(i+1)|\mathcal{H}|}.

In each round t[T]t\in[T], our meta-algorithm receives a decision 𝐱tη\mathbf{x}_{t}^{\eta} from each expert EηE^{\eta}, and then plays the weighted decision as

𝐱t=ηwtη𝐱tη.\mathbf{x}_{t}=\sum_{\eta\in\mathcal{H}}w_{t}^{\eta}\mathbf{x}_{t}^{\eta}.

After that, it queries the gradient ft(𝐱t)\nabla f_{t}(\mathbf{x}_{t}), but only receives {fk(𝐱k)|kt}\{\nabla f_{k}(\mathbf{x}_{k})|k\in\mathcal{F}_{t}\} due to the effect of arbitrary delays. Then, according to the delayed Hedge (Korotin et al., 2020), we update the weight of each expert as

wt+1η=wtηeαktk(𝐱kη)μwtμeαktk(𝐱kμ)w_{t+1}^{\eta}=\frac{w_{t}^{\eta}e^{-\alpha\sum_{k\in\mathcal{F}_{t}}\ell_{k}(\mathbf{x}_{k}^{\eta})}}{\sum_{\mu\in\mathcal{H}}w_{t}^{\mu}e^{-\alpha\sum_{k\in\mathcal{F}_{t}}\ell_{k}(\mathbf{x}_{k}^{\mu})}} (8)

where α\alpha is a parameter and k(𝐱)\ell_{k}(\mathbf{x}) is defined in (7). This is the critical difference between our meta-algorithm and that in Zhang et al. (2018a), which updates the weight according to the vanilla Hedge (Cesa-Bianchi et al., 1997).

Finally, we send gradients {fk(𝐱k)|kt}\{\nabla f_{k}(\mathbf{x}_{k})|k\in\mathcal{F}_{t}\} to each expert EηE^{\eta} so that they can update their own decisions without querying additional gradients. The detailed procedures of our meta-algorithm are summarized in Algorithm 2.

Algorithm 3 Mild-OGD: Expert-algorithm
1:  Input: a learning rate η\eta
2:  Initialization: set 𝐲1η=𝟎\mathbf{y}_{1}^{\eta}=\mathbf{0} and τ=1\tau=1
3:  for t=1,,Tt=1,\dots,T do
4:     Submit 𝐱tη=𝐲τη\mathbf{x}_{t}^{\eta}=\mathbf{y}_{\tau}^{\eta} to the meta-algorithm
5:     Receive gradients {fk(𝐱k)|kt}\{\nabla f_{k}(\mathbf{x}_{k})|k\in\mathcal{F}_{t}\} from the meta-algorithm
6:     for ktk\in\mathcal{F}_{t} (in the ascending order) do
7:        Compute 𝐲τ+1η\mathbf{y}_{\tau+1}^{\eta} as in (9) and set τ=τ+1\tau=\tau+1
8:     end for
9:  end for
Expert-algorithm

The expert-algorithm is instantiated by running DOGD over the surrogate loss function defined in (7), instead of the real loss function. To emphasize this difference, we present its procedures in Algorithm 3. The input and initialization are the same as those in DOGD. At each round t[T]t\in[T], the expert-algorithm first submits the decision 𝐱tη=𝐲τη\mathbf{x}_{t}^{\eta}=\mathbf{y}_{\tau}^{\eta} to the meta-algorithm, and then receives gradients {fk(𝐱k)|kt}\{\nabla f_{k}(\mathbf{x}_{k})|k\in\mathcal{F}_{t}\} from the meta-algorithm. For each ktk\in\mathcal{F}_{t}, it updates the decision as

𝐲τ+1η=argmin𝐱𝒦𝐱(𝐲τηηfk(𝐱k))22\mathbf{y}_{\tau+1}^{\eta}=\operatorname*{argmin}_{\mathbf{x}\in\mathcal{K}}\|\mathbf{x}-(\mathbf{y}_{\tau}^{\eta}-\eta\nabla f_{k}(\mathbf{x}_{k}))\|_{2}^{2} (9)

and sets τ=τ+1\tau=\tau+1.

We have the following theoretical guarantee for the dynamic regret of Mild-OGD.

Theorem 3.7.

Let mt=ti=1t1|i|m_{t}=t-\sum_{i=1}^{t-1}|\mathcal{F}_{i}|. Under Assumptions 3.1 and 3.2, by setting

={ηi=2i1DGβ|i=1,,N} and α=1GDβ\mathcal{H}=\left\{\eta_{i}=\left.\frac{2^{i-1}D}{G\sqrt{\beta}}\right|i=1,\dots,N\right\}\text{ and }\alpha=\frac{1}{GD\sqrt{\beta}}

where N=12log2(T+1)+1N=\left\lceil\frac{1}{2}\log_{2}(T+1)\right\rceil+1 and β=t=1Tmt\beta=\sum_{t=1}^{T}m_{t}, Algorithm 2 ensures

R(𝐮1,,𝐮T)\displaystyle R(\mathbf{u}_{1},\dots,\mathbf{u}_{T})\leq (3D(D+PT)+D)Gd¯T\displaystyle(3\sqrt{D(D+P_{T})}+D)G\sqrt{\bar{d}T}
+C+2GDd¯Tln(k+1)\displaystyle+C+2GD\sqrt{\bar{d}T}\ln\left(k+1\right)
=\displaystyle= O(d¯T(PT+1)+C)\displaystyle O\left(\sqrt{\bar{d}T(P_{T}+1)}+C\right)

for any comparator sequence 𝐮1,,𝐮T𝒦\mathbf{u}_{1},\dots,\mathbf{u}_{T}\in\mathcal{K}, where

k=log2(PT+D)/D+1k=\left\lfloor\log_{2}\sqrt{(P_{T}+D)/D}\right\rfloor+1

and CC is defined in (6).

Remark: Theorem 3.7 shows that Mild-OGD can attain a dynamic regret bound of O(d¯T(PT+1)+C)O(\sqrt{\bar{d}T(P_{T}+1)}+C), which is also adaptive to the upper bound of comparator drift. It is easy to verify that this dynamic regret bound becomes O(dT(PT+1))O(\sqrt{dT(P_{T}+1)}) in the worst case. Moreover, it reduces to O(d¯T(PT+1))O(\sqrt{\bar{d}T(P_{T}+1)}) in case CO(d¯TPT)C\leq O(\sqrt{\bar{d}TP_{T}}), which can be satisfied for all possible PTP_{T} when the In-order property holds or for PTd¯T/d2P_{T}\leq\bar{d}T/d^{2}. Compared with the dynamic regret of DOGD, Mild-OGD reduces the linear dependence on PTP_{T} to be sublinear. Moreover, compared with the optimal O(T(PT+1))O(\sqrt{T(P_{T}+1)}) bound achieved in the non-delayed setting (Zhang et al., 2018a), Mild-OGD magnifies it by a coefficient depending on delays. We also notice that although Theorem 3.7 requires the value of t=1Tmt\sum_{t=1}^{T}m_{t} to tune parameters, as previously discussed, this requirement can be removed by utilizing the doubling trick. The details about Mild-OGD with the doubling trick are provided in the appendix.

3.4 Lower Bound

Finally, we show that our Mild-OGD is optimal in the worst case by establishing the following lower bound.

Theorem 3.8.

Let L=TD/max{P,D}L=\left\lceil TD/\max\{P,D\}\right\rceil. Suppose 𝒦=[D/(2n),D/(2n)]n\mathcal{K}=\left[-D/(2\sqrt{n}),D/(2\sqrt{n})\right]^{n} which satisfies Assumption 3.2. For any OCO algorithm, any P[0,TD]P\in[0,TD], and any positive integer dd, there exists a sequence of comparators 𝐮1,,𝐮T𝒦\mathbf{u}_{1},\dots,\mathbf{u}_{T}\in\mathcal{K} satisfying PTPP_{T}\leq P, a sequence of functions f1(𝐱),,fT(𝐱)f_{1}(\mathbf{x}),\dots,f_{T}(\mathbf{x}) satisfying Assumption 3.1, and a sequence of delays 1d1,,dTd1\leq d_{1},\dots,d_{T}\leq d such that

R(𝐮1,,𝐮T){DGT22,if d>L;GdDmax{P,D}T42,otherwise.R(\mathbf{u}_{1},\dots,\mathbf{u}_{T})\geq\left\{\begin{aligned} &\frac{DGT}{2\sqrt{2}},~{}\text{if $d>L$;}\\ &\frac{G\sqrt{dD\max\{P,D\}T}}{4\sqrt{2}},~{}\text{otherwise.}\end{aligned}\right.

Remark: From Theorem 3.8, if d>L=Ω(T/(PT+1))d>L=\Omega(T/(P_{T}+1)), there exists an Ω(T)\Omega(T) lower bound on the dynamic regret, which can be trivially matched by any OCO algorithm including our Algorithm 2. As a result, we mainly focus on the case dLd\leq L, and notice that Theorem 3.8 essentially establishes an Ω(dT(PT+1))\Omega(\sqrt{dT(P_{T}+1)}) lower bound, which matches the O(dT(PT+1))O(\sqrt{dT(P_{T}+1)}) dynamic regret of our Mild-OGD in the worst case. To the best of our knowledge, this is the first lower bound for the dynamic regret of the delayed OCO.

4 Analysis

In this section, we prove Theorem 3.4, Lemma 3.5, Theorem 3.7, and Theorem 3.8 by introducing some lemmas. The omitted proofs can be found in the appendix.

4.1 Proof of Theorem 3.4

It is easy to verify that

R(𝐮1,,𝐮T)t=1Tft(𝐱t),𝐱t𝐮t\begin{split}R(\mathbf{u}_{1},\dots,\mathbf{u}_{T})\leq&\sum_{t=1}^{T}\left\langle\nabla f_{t}(\mathbf{x}_{t}),\mathbf{x}_{t}-\mathbf{u}_{t}\right\rangle\end{split}

where the inequality is due to the convexity of functions.

Let τt=1+i=1t1|i|\tau_{t}=1+\sum_{i=1}^{t-1}|\mathcal{F}_{i}|. Then, combining the above inequality with the fact that c1,,cTc_{1},\dots,c_{T} is a permutation of 1,,T1,\dots,T, we have

R(𝐮1,,𝐮T)t=1Tfct(𝐱ct),𝐱ct𝐮ct=t=1Tfct(𝐱ct),𝐲τct𝐮ct=t=1Tfct(𝐱ct),𝐲t𝐮t+𝐲τct𝐲t+t=1Tfct(𝐱ct),𝐮t𝐮ctt=1Tfct(𝐱ct),𝐲t𝐮t+t=1TG𝐲τct𝐲t2+t=1TG𝐮t𝐮ct2\begin{split}&R(\mathbf{u}_{1},\dots,\mathbf{u}_{T})\\ \leq&\sum_{t=1}^{T}\left\langle\nabla f_{c_{t}}(\mathbf{x}_{c_{t}}),\mathbf{x}_{c_{t}}-\mathbf{u}_{c_{t}}\right\rangle\\ =&\sum_{t=1}^{T}\left\langle\nabla f_{c_{t}}(\mathbf{x}_{c_{t}}),\mathbf{y}_{\tau_{c_{t}}}-\mathbf{u}_{c_{t}}\right\rangle\\ =&\sum_{t=1}^{T}\left\langle\nabla f_{c_{t}}(\mathbf{x}_{c_{t}}),\mathbf{y}_{t}-\mathbf{u}_{t}+\mathbf{y}_{\tau_{c_{t}}}-\mathbf{y}_{t}\right\rangle\\ &+\sum_{t=1}^{T}\left\langle\nabla f_{c_{t}}(\mathbf{x}_{c_{t}}),\mathbf{u}_{t}-\mathbf{u}_{c_{t}}\right\rangle\\ \leq&\sum_{t=1}^{T}\left\langle\nabla f_{c_{t}}(\mathbf{x}_{c_{t}}),\mathbf{y}_{t}-\mathbf{u}_{t}\right\rangle+\sum_{t=1}^{T}G\left\|\mathbf{y}_{\tau_{c_{t}}}-\mathbf{y}_{t}\right\|_{2}\\ &+\sum_{t=1}^{T}G\|\mathbf{u}_{t}-\mathbf{u}_{c_{t}}\|_{2}\end{split} (10)

where the first equality is due to 𝐱t=𝐲τt\mathbf{x}_{t}=\mathbf{y}_{\tau_{t}} in Algorithm 1, and the last inequality is due to Assumption 3.1.

Let 𝐲t+1=𝐲tηfct(𝐱ct)\mathbf{y}^{\prime}_{t+1}=\mathbf{y}_{t}-\eta\nabla f_{c_{t}}(\mathbf{x}_{c_{t}}). For the first term in the right side of (10), we have

t=1Tfct(𝐱ct),𝐲t𝐮t=t=1T𝐲t𝐲t+1,𝐲t𝐮tη=t=1T(𝐲t𝐮t22𝐲t+1𝐮t22+𝐲t𝐲t+122)2η=t=1T(𝐲t𝐮t22𝐲t+1𝐮t22+ηfct(𝐱ct)22)2ηt=1T12η(𝐲t𝐮t22𝐲t+1𝐮t22+η2G2)=t=1T((𝐲t22𝐲t+122)2η+𝐲t+1𝐲t,𝐮tη+ηG22)1η𝐲T+1,𝐮T+t=2T1η𝐮t1𝐮t,𝐲t+ηTG221η𝐲T+12𝐮T2+t=2T1η𝐮t1𝐮t2𝐲t2+ηTG22D2+DPTη+ηTG22\begin{split}&\sum_{t=1}^{T}\left\langle\nabla f_{c_{t}}(\mathbf{x}_{c_{t}}),\mathbf{y}_{t}-\mathbf{u}_{t}\right\rangle=\sum_{t=1}^{T}\frac{\left\langle\mathbf{y}_{t}-\mathbf{y}_{t+1}^{\prime},\mathbf{y}_{t}-\mathbf{u}_{t}\right\rangle}{\eta}\\ =&\sum_{t=1}^{T}\frac{\left(\|\mathbf{y}_{t}-\mathbf{u}_{t}\|_{2}^{2}-\|\mathbf{y}_{t+1}^{\prime}-\mathbf{u}_{t}\|_{2}^{2}+\|\mathbf{y}_{t}-\mathbf{y}_{t+1}^{\prime}\|_{2}^{2}\right)}{2\eta}\\ =&\sum_{t=1}^{T}\frac{\left(\|\mathbf{y}_{t}-\mathbf{u}_{t}\|_{2}^{2}-\|\mathbf{y}_{t+1}^{\prime}-\mathbf{u}_{t}\|_{2}^{2}+\|\eta\nabla f_{c_{t}}(\mathbf{x}_{c_{t}})\|_{2}^{2}\right)}{2\eta}\\ \leq&\sum_{t=1}^{T}\frac{1}{2\eta}\left(\|\mathbf{y}_{t}-\mathbf{u}_{t}\|_{2}^{2}-\|\mathbf{y}_{t+1}-\mathbf{u}_{t}\|_{2}^{2}+\eta^{2}G^{2}\right)\\ =&\sum_{t=1}^{T}\left(\frac{\left(\|\mathbf{y}_{t}\|_{2}^{2}-\|\mathbf{y}_{t+1}\|_{2}^{2}\right)}{2\eta}+\frac{\langle\mathbf{y}_{t+1}-\mathbf{y}_{t},\mathbf{u}_{t}\rangle}{\eta}+\frac{\eta G^{2}}{2}\right)\\ \leq&\frac{1}{\eta}\langle\mathbf{y}_{T+1},\mathbf{u}_{T}\rangle+\sum_{t=2}^{T}\frac{1}{\eta}\langle\mathbf{u}_{t-1}-\mathbf{u}_{t},\mathbf{y}_{t}\rangle+\frac{\eta TG^{2}}{2}\\ \leq&\frac{1}{\eta}\|\mathbf{y}_{T+1}\|_{2}\|\mathbf{u}_{T}\|_{2}+\sum_{t=2}^{T}\frac{1}{\eta}\|\mathbf{u}_{t-1}-\mathbf{u}_{t}\|_{2}\|\mathbf{y}_{t}\|_{2}+\frac{\eta TG^{2}}{2}\\ \leq&\frac{D^{2}+DP_{T}}{\eta}+\frac{\eta TG^{2}}{2}\end{split} (11)

where the first inequality is due to Assumption 3.1, the second inequality is due to 𝐲1=𝟎\mathbf{y}_{1}=\mathbf{0} and 𝐲T+1220\|\mathbf{y}_{T+1}\|_{2}^{2}\geq 0, and the last inequality is due to Assumption 3.2.

Then, we proceed to bound the second term in the right side of (10). Note that before round ctc_{t}, Algorithm 1 has received τct1\tau_{c_{t}}-1 gradients, and thus has generated 𝐲1,,𝐲τct\mathbf{y}_{1},\dots,\mathbf{y}_{\tau_{c_{t}}}. Moreover, let q=ct+dct1q=c_{t}+d_{c_{t}}-1. It is easy to verify that qctq\geq c_{t}, and thus Algorithm 1 has also generated 𝐲1,,𝐲τct\mathbf{y}_{1},\dots,\mathbf{y}_{\tau_{c_{t}}} before round qq. Since the gradient fct(𝐱ct)\nabla f_{c_{t}}(\mathbf{x}_{c_{t}}) is used to update 𝐲t\mathbf{y}_{t} in round qq, we have

τctt.\tau_{c_{t}}\leq t. (12)

From (12), we have

t=1T𝐲τct𝐲t2t=1Tk=τctt1𝐲k𝐲k+12t=1Tk=τctt1𝐲k𝐲k+12t=1Tk=τctt1ηfck(𝐱ck)2ηGt=1T(tτct)\begin{split}\sum_{t=1}^{T}\left\|\mathbf{y}_{\tau_{c_{t}}}-\mathbf{y}_{t}\right\|_{2}\leq&\sum_{t=1}^{T}\sum_{k=\tau_{c_{t}}}^{t-1}\left\|\mathbf{y}_{k}-\mathbf{y}_{k+1}\right\|_{2}\\ \leq&\sum_{t=1}^{T}\sum_{k=\tau_{c_{t}}}^{t-1}\left\|\mathbf{y}_{k}-\mathbf{y}_{k+1}^{\prime}\right\|_{2}\\ \leq&\sum_{t=1}^{T}\sum_{k=\tau_{c_{t}}}^{t-1}\left\|\eta\nabla f_{c_{k}}(\mathbf{x}_{c_{k}})\right\|_{2}\\ \leq&\eta G\sum_{t=1}^{T}\left(t-\tau_{c_{t}}\right)\end{split} (13)

where the last inequality is due to Assumption 3.1.

Moreover, because of the definitions of τt\tau_{t} and mtm_{t}, we have

t=1T(tτct)=t=1T(t1)t=1Ti=1ct1|i|=t=1T(t1)t=1Ti=1t1|i|=t=1T(t1i=1t1|i|)=t=1T(mt1)\begin{split}\sum_{t=1}^{T}\left(t-\tau_{c_{t}}\right)=&\sum_{t=1}^{T}\left(t-1\right)-\sum_{t=1}^{T}\sum_{i=1}^{c_{t}-1}|\mathcal{F}_{i}|\\ =&\sum_{t=1}^{T}\left(t-1\right)-\sum_{t=1}^{T}\sum_{i=1}^{t-1}|\mathcal{F}_{i}|\\ =&\sum_{t=1}^{T}\left(t-1-\sum_{i=1}^{t-1}|\mathcal{F}_{i}|\right)\\ =&\sum_{t=1}^{T}\left(m_{t}-1\right)\end{split} (14)

where the second equality is due to the fact that c1,,cTc_{1},\dots,c_{T} is a permutation of 1,,T1,\dots,T.

Then, combining (13) with (14), we have

t=1TG𝐲τct𝐲t2ηG2t=1T(mt1).\begin{split}\sum_{t=1}^{T}G\left\|\mathbf{y}_{\tau_{c_{t}}}-\mathbf{y}_{t}\right\|_{2}\leq\eta G^{2}\sum_{t=1}^{T}\left(m_{t}-1\right).\end{split} (15)

Finally, combining (10) with (11) and (15), we have

t=1T(ft(𝐱t)ft(𝐮t))\displaystyle\sum_{t=1}^{T}\left(f_{t}(\mathbf{x}_{t})-f_{t}(\mathbf{u}_{t})\right)
\displaystyle\leq D2+DPTη+ηG2t=1Tmt+t=1TG𝐮t𝐮ct2\displaystyle\frac{D^{2}+DP_{T}}{\eta}+{\eta G^{2}}\sum_{t=1}^{T}m_{t}+\sum_{t=1}^{T}G\|\mathbf{u}_{t}-\mathbf{u}_{c_{t}}\|_{2}

which completes this proof.

4.2 Proof of Lemma 3.5

Since fct(𝐱ct)\nabla f_{c_{t}}(\mathbf{x}_{c_{t}}) is the tt-th used gradient and arrives at the end of round ct+dct1c_{t}+d_{c_{t}}-1, it is not hard to verify that

tct+dct1ct+d1t\leq c_{t}+d_{c_{t}}-1\leq c_{t}+d-1 (16)

for any t[T]t\in[T], and there are at most t1t-1 arrived gradients before round ct+dct1c_{t}+d_{c_{t}}-1. Notice that gradients queried at rounds 1,,t1,\dots,t must have arrived at the end of round t+d1t+d-1. Therefore, we also have ct+dct2<t+d1c_{t}+d_{c_{t}}-2<t+d-1, which implies that

ctt+ddctt+d1.c_{t}\leq t+d-d_{c_{t}}\leq t+d-1. (17)

If t[T]t\in[T] and cttc_{t}\leq t, according to (16), we have

𝐮t𝐮ct2k=ctt1𝐮k+1𝐮k2k=ctmin{ct+d2,T1}𝐮k+1𝐮k2.\begin{split}\|\mathbf{u}_{t}-\mathbf{u}_{c_{t}}\|_{2}\leq&\sum_{k=c_{t}}^{t-1}\|\mathbf{u}_{k+1}-\mathbf{u}_{k}\|_{2}\\ \leq&\sum_{k=c_{t}}^{\min\{c_{t}+d-2,T-1\}}\|\mathbf{u}_{k+1}-\mathbf{u}_{k}\|_{2}.\end{split} (18)

Otherwise, if t[T]t\in[T] and ct>tc_{t}>t, according to (17), we have

𝐮t𝐮ct2k=tct1𝐮k+1𝐮k2k=tmin{t+d2,T1}𝐮k+1𝐮k2.\begin{split}\|\mathbf{u}_{t}-\mathbf{u}_{c_{t}}\|_{2}\leq&\sum_{k=t}^{c_{t}-1}\|\mathbf{u}_{k+1}-\mathbf{u}_{k}\|_{2}\\ \leq&\sum_{k=t}^{\min\{t+d-2,T-1\}}\|\mathbf{u}_{k+1}-\mathbf{u}_{k}\|_{2}.\end{split} (19)

Therefore, combining (18) and (19), we have

t=1T𝐮t𝐮ct2t=1Tk=ctmin{ct+d2,T1}𝐮k+1𝐮k2+t=1Tk=tmin{t+d2,T1}𝐮k+1𝐮k2=2t=1Tk=tmin{t+d2,T1}𝐮k+1𝐮k22k=1dt=1T1𝐮t+1𝐮t2=2dPT\begin{split}&\sum_{t=1}^{T}\|\mathbf{u}_{t}-\mathbf{u}_{c_{t}}\|_{2}\\ \leq&\sum_{t=1}^{T}\sum_{k=c_{t}}^{\min\{c_{t}+d-2,T-1\}}\|\mathbf{u}_{k+1}-\mathbf{u}_{k}\|_{2}\\ &+\sum_{t=1}^{T}\sum_{k=t}^{\min\{t+d-2,T-1\}}\|\mathbf{u}_{k+1}-\mathbf{u}_{k}\|_{2}\\ =&2\sum_{t=1}^{T}\sum_{k=t}^{\min\{t+d-2,T-1\}}\|\mathbf{u}_{k+1}-\mathbf{u}_{k}\|_{2}\\ \leq&2\sum_{k=1}^{d}\sum_{t=1}^{T-1}\|\mathbf{u}_{t+1}-\mathbf{u}_{t}\|_{2}=2dP_{T}\end{split}

where the equality is due to the fact that c1,,cTc_{1},\dots,c_{T} is a permutation of 1,,T1,\dots,T.

Then, we complete this proof by further noticing that Assumption 3.2 and the definition of KK can ensure

t=1T𝐮t𝐮ct2=t=1T𝕀(tct)𝐮t𝐮ct2t=1T𝕀(tct)D=DK.\begin{split}\sum_{t=1}^{T}\|\mathbf{u}_{t}-\mathbf{u}_{c_{t}}\|_{2}=&\sum_{t=1}^{T}\mathbb{I}(t\neq c_{t})\|\mathbf{u}_{t}-\mathbf{u}_{c_{t}}\|_{2}\\ \leq&\sum_{t=1}^{T}\mathbb{I}(t\neq c_{t})D=DK.\end{split}

4.3 Proof of Theorem 3.7

Let η=D(D+PT)/(βG2)\eta_{\ast}=\sqrt{D(D+P_{T})/(\beta G^{2})}, where β=t=1Tmt\beta=\sum_{t=1}^{T}m_{t}. From Assumption 3.2, we have

0PT=t=2T𝐮t𝐮t12TD0\leq P_{T}=\sum_{t=2}^{T}\|\mathbf{u}_{t}-\mathbf{u}_{t-1}\|_{2}\leq TD

which implies that

η1=DGβηDT+1Gβη||.\eta_{1}=\frac{D}{G\sqrt{\beta}}\leq\eta_{\ast}\leq\frac{D\sqrt{T+1}}{G\sqrt{\beta}}\leq\eta_{|\mathcal{H}|}.

Therefore, for any possible value of PTP_{T}, there must exist a learning rate ηk\eta_{k}\in\mathcal{H} such that

ηkη2ηk\eta_{k}\leq\eta_{\ast}\leq 2\eta_{k} (20)

where k=log2(PT+D)/D+1k=\lfloor\log_{2}\sqrt{(P_{T}+D)/D}\rfloor+1.

Then, the dynamic regret can be upper bounded as follows

R(𝐮1,,𝐮T)t=1Tft(𝐱t),𝐱t𝐮t=t=1Tt(𝐱t)t=1Tt(𝐱tηk)+t=1Tt(𝐱tηk)t=1Tt(𝐮t).\begin{split}R(\mathbf{u}_{1},\dots,\mathbf{u}_{T})\leq&\sum_{t=1}^{T}\langle\nabla f_{t}(\mathbf{x}_{t}),\mathbf{x}_{t}-\mathbf{u}_{t}\rangle\\ =&\sum_{t=1}^{T}\ell_{t}\left(\mathbf{x}_{t}\right)-\sum_{t=1}^{T}\ell_{t}\left(\mathbf{x}_{t}^{\eta_{k}}\right)\\ &+\sum_{t=1}^{T}\ell_{t}\left(\mathbf{x}_{t}^{\eta_{k}}\right)-\sum_{t=1}^{T}\ell_{t}\left(\mathbf{u}_{t}\right).\end{split} (21)

To bound the first term in the right side of (21), we introduce the following lemma.

Lemma 4.1.

Let mt=ti=1t1|i|m_{t}=t-\sum_{i=1}^{t-1}|\mathcal{F}_{i}|. Under Assumptions 3.1 and 3.2, for any η\eta\in\mathcal{H}, Algorithm 2 has

t=1Tt(𝐱t)t=1Tt(𝐱tη)1αln1w1η+αG2D2t=1Tmt.\begin{split}\sum_{t=1}^{T}\ell_{t}\left(\mathbf{x}_{t}\right)-\sum_{t=1}^{T}\ell_{t}(\mathbf{x}_{t}^{\eta})\leq\frac{1}{\alpha}\ln\frac{1}{w_{1}^{\eta}}+\alpha G^{2}D^{2}\sum_{t=1}^{T}m_{t}.\end{split}

Combining Lemma 4.1 with (1/w1ηk)(k+1)2(1/w_{1}^{\eta_{k}})\leq(k+1)^{2} and α=1GDt=1Tmt\alpha=\frac{1}{GD\sqrt{\sum_{t=1}^{T}m_{t}}}, under Assumptions 3.1 and 3.2, we have

t=1Tt(𝐱t)t=1Tt(𝐱tηk)2GDt=1Tmtln(k+1)+GDt=1Tmt2GDd¯Tln(k+1)+GDd¯T\begin{split}&\sum_{t=1}^{T}\ell_{t}\left(\mathbf{x}_{t}\right)-\sum_{t=1}^{T}\ell_{t}(\mathbf{x}_{t}^{\eta_{k}})\\ \leq&2GD\sqrt{\sum_{t=1}^{T}m_{t}}\ln(k+1)+GD\sqrt{\sum_{t=1}^{T}m_{t}}\\ \leq&2GD\sqrt{\bar{d}T}\ln\left(k+1\right)+GD\sqrt{\bar{d}T}\end{split}

where the last inequality is due to (4).

Note that each expert EηE^{\eta} actually is equal to running Algorithm 1 with 1(𝐱),,T(𝐱)\ell_{1}(\mathbf{x}),\dots,\ell_{T}(\mathbf{x}), where each gradient t(𝐱tη)=ft(𝐱t)\nabla\ell_{t}(\mathbf{x}_{t}^{\eta})=\nabla f_{t}(\mathbf{x}_{t}) is delayed to the end of round t+dt1t+d_{t}-1. Therefore, combining Theorem 3.4 with Lemma 3.5 and the definition of CC in (6), under Assumptions 3.1 and 3.2, we have

t=1Tt(𝐱tηk)t=1Tt(𝐮t)D2+DPTηk+ηkG2t=1Tmt+C2(D2+DPT)η+ηG2t=1Tmt+C3GD(D+PT)d¯T+C\begin{split}&\sum_{t=1}^{T}\ell_{t}\left(\mathbf{x}_{t}^{\eta_{k}}\right)-\sum_{t=1}^{T}\ell_{t}\left(\mathbf{u}_{t}\right)\\ \leq&\frac{D^{2}+DP_{T}}{\eta_{k}}+\eta_{k}G^{2}\sum_{t=1}^{T}m_{t}+C\\ \leq&\frac{2(D^{2}+DP_{T})}{\eta_{\ast}}+\eta_{\ast}G^{2}\sum_{t=1}^{T}m_{t}+C\\ \leq&3G\sqrt{D(D+P_{T})}\sqrt{\bar{d}T}+C\end{split}

where the second inequality is due to (20), and the last inequality is due to the definition of η\eta_{\ast} and (4).

Finally, we complete this proof by combining (21) with the above two inequalities.

4.4 Proof of Theorem 3.8

Inspired by the proof of the lower bound in the non-delayed setting (Zhang et al., 2018a), we first need to establish a lower bound of static regret in the delayed setting. Although the seminal work of Weinberger & Ordentlich (2002) has already provided such a lower bound, it only holds in the special case that dd divides TT. To address this limitation, we establish a lower bound of static regret for any dd and TT, which is presented in the following lemma.

Lemma 4.2.

Suppose 𝒦=[D/(2n),D/(2n)]n\mathcal{K}=\left[-D/(2\sqrt{n}),D/(2\sqrt{n})\right]^{n} which satisfies Assumption 3.2. For any OCO algorithm and any positive integer dd, there exists a sequence of functions f1(𝐱),,fT(𝐱)f_{1}(\mathbf{x}),\dots,f_{T}(\mathbf{x}) satisfying Assumption 3.1 and a sequence of delays 1d1,,dTd1\leq d_{1},\dots,d_{T}\leq d such that

R(T)DGT22T/d.R(T)\geq\frac{DGT}{2\sqrt{2\left\lceil T/d\right\rceil}}.

Let Z=T/LZ=\lceil T/L\rceil. We then divide the total TT rounds into ZZ blocks, where the length of the first Z1Z-1 blocks is LL and that of the last block is T(Z1)LT-(Z-1)L. In this way, we can define the set of rounds in the block zz as

𝒯z={(z1)L+1,,min{zL,T}}.\mathcal{T}_{z}=\{(z-1)L+1,\dots,\min\{zL,T\}\}.

Moreover, we define the feasible set of 𝐮1,,𝐮T\mathbf{u}_{1},\dots,\mathbf{u}_{T} as

𝒞(P)={𝐮1,,𝐮T𝒦|t=2T𝐮t𝐮t12P}\mathcal{C}(P)=\left\{\mathbf{u}_{1},\dots,\mathbf{u}_{T}\in\mathcal{K}\left|\sum_{t=2}^{T}\|\mathbf{u}_{t}-\mathbf{u}_{t-1}\|_{2}\leq P\right.\right\}

and construct a subset of 𝒞(P)\mathcal{C}(P) as

𝒞(P)=\displaystyle\mathcal{C}^{\prime}(P)= {𝐮1,,𝐮T𝒦|𝐮i=𝐮j,z[Z],i,j𝒯z}.\displaystyle\left\{\mathbf{u}_{1},\dots,\mathbf{u}_{T}\in\mathcal{K}\left|\mathbf{u}_{i}=\mathbf{u}_{j},\forall z\in[Z],i,j\in\mathcal{T}_{z}\right.\right\}.

Notice that the connection 𝒞(P)𝒞(P)\mathcal{C}^{\prime}(P)\subseteq\mathcal{C}(P) is derived by the fact that the comparator sequence in 𝒞(P)\mathcal{C}^{\prime}(P) only changes Z1P/DZ-1\leq P/D times, and thus its path-length does not exceed PP.

Then, because of 𝒞(P)𝒞(P)\mathcal{C}^{\prime}(P)\subseteq\mathcal{C}(P) and Lemma 4.2, there exists a sequence of functions f1(𝐱),,fT(𝐱)f_{1}(\mathbf{x}),\dots,f_{T}(\mathbf{x}) satisfying Assumption 3.1 and a sequence of delays 1d1,,dTd1\leq d_{1},\dots,d_{T}\leq d such that

t=1Tft(𝐱t)min𝐮1,,𝐮T𝒞(P)t=1Tft(𝐮t)t=1Tft(𝐱t)min𝐮1,,𝐮T𝒞(P)t=1Tft(𝐮t)=z=1Z(t𝒯zft(𝐱t)min𝐱𝒦t𝒯zft(𝐱))z=1ZDG|𝒯z|22|𝒯z|/d.\begin{split}&\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-\min_{\mathbf{u}_{1},\dots,\mathbf{u}_{T}\in\mathcal{C}(P)}\sum_{t=1}^{T}f_{t}(\mathbf{u}_{t})\\ \geq&\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-\min_{\mathbf{u}_{1},\dots,\mathbf{u}_{T}\in\mathcal{C}^{\prime}(P)}\sum_{t=1}^{T}f_{t}(\mathbf{u}_{t})\\ =&\sum_{z=1}^{Z}\left(\sum_{t\in\mathcal{T}_{z}}f_{t}(\mathbf{x}_{t})-\min_{\mathbf{x}\in\mathcal{K}}\sum_{t\in\mathcal{T}_{z}}f_{t}(\mathbf{x})\right)\\ \geq&\sum_{z=1}^{Z}\frac{DG|\mathcal{T}_{z}|}{2\sqrt{2\left\lceil|\mathcal{T}_{z}|/d\right\rceil}}.\end{split}

Finally, we can complete this proof by further noticing that

z=1ZDG|𝒯z|22|𝒯z|/dz=1ZDG|𝒯z|22L/d=DGT22L/d{DGT22,if d>L;GdDmax{P,D}T42,otherwise;\begin{split}\sum_{z=1}^{Z}\frac{DG|\mathcal{T}_{z}|}{2\sqrt{2\left\lceil|\mathcal{T}_{z}|/d\right\rceil}}\geq&\sum_{z=1}^{Z}\frac{DG|\mathcal{T}_{z}|}{2\sqrt{2\left\lceil L/d\right\rceil}}=\frac{DGT}{2\sqrt{2\left\lceil L/d\right\rceil}}\\ \geq&\left\{\begin{aligned} &\frac{DGT}{2\sqrt{2}},~{}\text{if $d>L$;}\\ &\frac{G\sqrt{dD\max\{P,D\}T}}{4\sqrt{2}},~{}\text{otherwise;}\end{aligned}\right.\end{split}

where the first inequality is due to |𝒯z|L|\mathcal{T}_{z}|\leq L for any z[Z]z\in[Z], and the last inequality is mainly due to

L/d\displaystyle\left\lceil L/d\right\rceil\leq 2L/d=2TD/max{P,D}/d\displaystyle 2L/d=2\left\lceil TD/\max\{P,D\}\right\rceil/d
\displaystyle\leq 4TD/(max{P,D}d)\displaystyle 4TD/(\max\{P,D\}d)

for dLd\leq L.

5 Conclusion and Future Work

In this paper, we study the dynamic regret of OCO with arbitrary delays. To this end, we first propose a simple algorithm called DOGD, the dynamic regret of which can be automatically bounded by O(d¯T(PT+1))O(\sqrt{\bar{d}T}(P_{T}+1)) under mild assumptions such as the In-Order property, and O(dT(PT+1))O(\sqrt{dT}(P_{T}+1)) in the worst case. Furthermore, based on DOGD, we develop an improved algorithm called Mild-OGD, which can automatically enjoy an O(d¯T(PT+1))O(\sqrt{\bar{d}T(P_{T}+1)}) dynamic regret bound under mild assumptions such as the In-Order property, and an O(dT(PT+1))O(\sqrt{dT(P_{T}+1)}) dynamic regret bound in the worst case. Finally, we provide a matching lower bound to show the optimality of our Mild-OGD in the worst case.

It is worth noting that there still are several directions for future research, which are discussed in the appendix due to the limitation of space.

Acknowledgements

This work was partially supported by the National Natural Science Foundation of China (62306275, 62122037), the Zhejiang Province High-Level Talents Special Support Program “Leading Talent of Technological Innovation of Ten-Thousands Talents Program” (No. 2022R52046), the Key Research and Development Program of Zhejiang Province (No. 2023C03192), and the Open Research Fund of the State Key Laboratory of Blockchain and Data Security, Zhejiang University. The authors would also like to thank the anonymous reviewers for their helpful comments.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are some potential societal consequences of our work, but we feel that none of them must be specifically highlighted here.

References

  • Abernethy et al. (2008) Abernethy, J. D., Bartlett, P. L., Rakhlin, A., and Tewari, A. Optimal stragies and minimax lower bounds for online convex games. In Proceedings of the 21st Annual Conference on Learning Theory, pp.  415–424, 2008.
  • Baby & Wang (2019) Baby, D. and Wang, Y.-X. Online forecasting of total-variation-bounded sequences. In Advances in Neural Information Processing Systems 32, pp. 11071–11081, 2019.
  • Baby & Wang (2021) Baby, D. and Wang, Y.-X. Optimal dynamic regret in exp-concave online learning. In Proceedings of 34th Annual Conference on Learning Theory, pp.  359–409, 2021.
  • Baby & Wang (2022) Baby, D. and Wang, Y.-X. Optimal dynamic regret in proper online learning with strongly convex losses and beyond. In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics, pp.  1805–1845, 2022.
  • Baby & Wang (2023) Baby, D. and Wang, Y.-X. Second order path variationals in non-stationary online learning. In Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, pp.  9024–9075, 2023.
  • Besbes et al. (2015) Besbes, O., Gur, Y., and Zeevi, A. Non-stationary stochastic optimization. Operations Research, 63(5):1227–1244, 2015.
  • Cesa-Bianchi & Lugosi (2006) Cesa-Bianchi, N. and Lugosi, G. Prediction, Learning, and Games. Cambridge University Press, 2006.
  • Cesa-Bianchi et al. (1997) Cesa-Bianchi, N., Freund, Y., Haussler, D., Helmbold, D. P., Schapire, R. E., and Warmuth, M. K. How to use expert advice. Journal of the ACM, 44(3):427–485, 1997.
  • Cutkosky (2020) Cutkosky, A. Parameter-free, dynamic, and strongly-adaptive online learning. In Proceedings of the 37th International Conference on Machine Learning, pp.  2250–2259, 2020.
  • Duchi et al. (2011) Duchi, J. C., Agarwal, A., and Wainwright, M. J. Dual averaging for distributed optimization: Convergence analysis and network scaling. IEEE Transactions on Automatic Control, 57(3):592–606, 2011.
  • Flaspohler et al. (2021) Flaspohler, G. E., Orabona, F., Cohen, J., Mouatadid, S., Oprescu, M., Orenstein, P., and Mackey, L. Online learning with optimism and delay. In Proceedings of the 38th International Conference on Machine Learning, pp.  3363–3373, 2021.
  • Freund & Schapire (1997) Freund, Y. and Schapire, R. E. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119–139, 1997.
  • Hazan (2016) Hazan, E. Introduction to online convex optimization. Foundations and Trends in Optimization, 2(3–4):157–325, 2016.
  • Hazan & Kale (2012) Hazan, E. and Kale, S. Projection-free online learning. In Proceedings of the 29th International Conference on Machine Learning, pp.  1843–1850, 2012.
  • Hazan et al. (2007) Hazan, E., Agarwal, A., and Kale, S. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2):169–192, 2007.
  • He et al. (2014) He, X., Pan, J., Jin, O., Xu, T., Liu, B., Xu, T., Shi, Y., Atallah, A., Herbrich, R., Bowers, S., and Candela, J. Q. Practical lessons from predicting clicks on ads at facebook. In Proceedings of the 8th International Workshop on Data Mining for Online Advertising, pp.  1–9, 2014.
  • Hoeffding (1963) Hoeffding, W. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963.
  • Jadbabaie et al. (2015) Jadbabaie, A., Rakhlin, A., Shahrampour, S., and Sridharan, K. Online optimization: Competing with dynamic comparators. In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics, pp.  398–406, 2015.
  • Joulani et al. (2013) Joulani, P., György, A., and Szepesvári, C. Online learning under delayed feedback. In Proceedings of the 30th International Conference on Machine Learning, pp.  1453–1461, 2013.
  • Joulani et al. (2016) Joulani, P., György, A., and Szepesvári, C. Delay-tolerant online convex optimization: Unified analysis and adaptive-gradient algorithms. Proceedings of the 30th AAAI Conference on Artificial Intelligence, pp.  1744–1750, 2016.
  • Joulani et al. (2017) Joulani, P., György, A., and Szepesvári, C. A modular analysis of adaptive (non-)convex optimization: Optimism, composite objectives, and variational bounds. In Proceedings of the 28th International Conference on Algorithmic Learning Theory, pp.  681–720, 2017.
  • Korotin et al. (2020) Korotin, A., V’yugin, V., and Burnaev, E. Adaptive hedging under delayed feedback. Neurocomputing, 397:356–368, 2020.
  • McMahan & Streeter (2010) McMahan, H. B. and Streeter, M. Adaptive bound optimization for online convex optimization. In Proceedings of the 23rd Conference on Learning Theory, pp. 244–256, 2010.
  • McMahan & Streeter (2014) McMahan, H. B. and Streeter, M. Delay-tolerant algorithms for asynchronous distributed online learning. In Advances in Neural Information Processing Systems 27, pp. 2915–2923, 2014.
  • McMahan et al. (2013) McMahan, H. B., Holt, G., Sculley, D., Young, M., Ebner, D., Grady, J., Nie, L., Phillips, T., Davydov, E., Golovin, D., Chikkerur, S., Liu, D., Wattenberg, M., Hrafnkelsson, A. M., Boulos, T., and Kubica, J. Ad click prediction: a view from the trenches. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.  1222–1230, 2013.
  • Mokhtari et al. (2016) Mokhtari, A., Shahrampour, S., Jadbabaie, A., and Ribeiro, A. Online optimization in dynamic environments: Improved regret rates for strongly convex problems. In Proceedings of 55th Conference on Decision and Control, pp.  7195–7201, 2016.
  • Orabona (2019) Orabona, F. A modern introduction to online learning. arXiv:1912.13213, 2019.
  • Quanrud & Khashabi (2015) Quanrud, K. and Khashabi, D. Online learning with adversarial delays. In Advances in Neural Information Processing Systems 28, pp. 1270–1278, 2015.
  • Rakhlin & Sridharan (2013) Rakhlin, A. and Sridharan, K. Online learning with predictable sequences. In Proceedings of the 26th Annual Conference on Learning Theory, pp.  993–1019, 2013.
  • Ren et al. (2020) Ren, Z., Zhou, Z., Qiu, L., Deshpande, A., and Kalagnanam, J. Delay-adaptive distributed stochastic optimization. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, pp.  5503–5510, 2020.
  • Shalev-Shwartz (2011) Shalev-Shwartz, S. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107–194, 2011.
  • Shalev-Shwartz & Singer (2007) Shalev-Shwartz, S. and Singer, Y. A primal-dual perspective of online learning algorithm. Machine Learning, 69(2–3):115–142, 2007.
  • van Erven & Koolen (2016) van Erven, T. and Koolen, W. M. MetaGrad: Multiple learning rates in online learning. In Advances in Neural Information Processing Systems 29, pp. 3666–3674, 2016.
  • van Erven et al. (2021) van Erven, T., Koolen, W. M., and van der Hoeven, D. Metagrad: Adaptation using multiple learning rates in online learning. Journal of Machine Learning Research, 22(161):1–61, 2021.
  • Wan et al. (2021) Wan, Y., Xue, B., and Zhang, L. Projection-free online learning in dynamic environments. In Proceedings of the 35th AAAI Conference on Artificial Intelligence, pp.  10067–10075, 2021.
  • Wan et al. (2022a) Wan, Y., Tu, W.-W., and Zhang, L. Online strongly convex optimization with unknown delays. Machine Learning, 111(3):871–893, 2022a.
  • Wan et al. (2022b) Wan, Y., Tu, W.-W., and Zhang, L. Online Frank-Wolfe with arbitrary delays. In Advances in Neural Information Processing Systems 35, 2022b.
  • Wan et al. (2023a) Wan, Y., Wang, Y., Yao, C., Tu, W.-W., and Zhang, L. Projection-free online learning with arbitrary delays. arXiv:2204.04964v2, 2023a.
  • Wan et al. (2023b) Wan, Y., Zhang, L., and Song, M. Improved dynamic regret for online Frank-Wolfe. In Proceedings of the 36th Annual Conference on Learning Theory, 2023b.
  • Wang et al. (2021) Wang, J., Liang, B., Dong, M., Boudreau, G., and Abou-Zeid, H. Delay-tolerant constrained OCO with application to network resource allocation. In Proceedings of the 2021 IEEE Conference on Computer Communications, pp.  1–10, 2021.
  • Wang et al. (2023) Wang, J., Dong, M., Liang, B., Boudreau, G., and Abou-Zeid, H. Delay-tolerant OCO with long-term constraints: Algorithm and its application to network resource allocation. IEEE/ACM Transactions on Networking, 31(1):147–163, 2023.
  • Wang et al. (2024) Wang, Y., Yang, W., Jiang, W., Lu, S., Wang, B., Tang, H., Wan, Y., and Zhang, L. Non-stationary projection-free online learning with dynamic and adaptive regret guarantees. In Proceedings of the 38th AAAI Conference on Artificial Intelligence, pp.  15671–15679, 2024.
  • Weinberger & Ordentlich (2002) Weinberger, M. J. and Ordentlich, E. On delayed prediction of individual sequences. IEEE Transactions on Information Theory, 48(7):1959–1976, 2002.
  • Yang et al. (2016) Yang, T., Zhang, L., Jin, R., and Yi, J. Tracking slowly moving clairvoyant: Optimal dynamic regret of online learning with true and noisy gradient. In Proceedings of the 33rd International Conference on Machine Learning, 2016.
  • Zhang et al. (2017) Zhang, L., Yang, T., Yi, J., Jin, R., and Zhou, Z.-H. Improved dynamic regret for non-degenerate functions. In Advances in Neural Information Processing Systems 30, pp. 732–741, 2017.
  • Zhang et al. (2018a) Zhang, L., Lu, S., and Zhou, Z.-H. Adaptive online learning in dynamic environments. In Advances in Neural Information Processing Systems 31, pp. 1323–1333, 2018a.
  • Zhang et al. (2018b) Zhang, L., Yang, T., Jin, R., and Zhou, Z.-H. Dynamic regret of strongly adaptive methods. In Proceedings of the 35th International Conference on Machine Learning, pp.  5877–5886, 2018b.
  • Zhao & Zhang (2021) Zhao, P. and Zhang, L. Improved analysis for dynamic regret of strongly convex and smooth functions. In Proceedings of the 3rd Conference on Learning for Dynamics and Control, pp.  48–59, 2021.
  • Zhao et al. (2020) Zhao, P., Zhang, Y.-J., Zhang, L., and Zhou, Z.-H. Dynamic regret of convex and smooth functions. In Advances in Neural Information Processing Systems 33, pp. 12510–12520, 2020.
  • Zhao et al. (2022) Zhao, P., Xie, Y.-F., Zhang, L., and Zhou, Z.-H. Efficient methods for non-stationary online learning. In Advances in Neural Information Processing Systems 35, pp. 11573–11585, 2022.
  • Zhou et al. (2018) Zhou, Z., Mertikopoulos, P., Bambos, N., Glynn, P., Ye, Y., Li, L.-J., and Fei-Fei, L. Distributed asynchronous optimization with unbounded delays: How slow can you go? In Proceedings of the 35th International Conference on Machine Learning, pp.  5970–5979, 2018.
  • Zhou et al. (2022) Zhou, Z., Mertikopoulos, P., Bambos, N., Glynn, P. W., and Ye, Y. Distributed stochastic optimization with large delays. Mathematics of Operations Research, 47(3):2082–2111, 2022.
  • Zinkevich (2003) Zinkevich, M. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning, pp.  928–936, 2003.

Appendix A Detailed Discussions on Future Work

First, we notice that the O(d¯T)O(\sqrt{\bar{d}T}) static regret bound can be achieved under arbitrary delays (Quanrud & Khashabi, 2015). Thus, it is natural to ask whether the O(d¯T(PT+1))O(\sqrt{\bar{d}T(P_{T}+1)}) dynamic regret bound can also be achieved without additional assumptions. However, from Theorem 3.4, compared with the static regret, it is more challenging to minimize the dynamic regret in the delayed setting, because delays will further cause a comparator drift, i.e., t=1T𝐮t𝐮ct2\sum_{t=1}^{T}\|\mathbf{u}_{t}-\mathbf{u}_{c_{t}}\|_{2}. It seems highly non-trivial to reduce the comparator drift without additional assumptions, and we leave this question as a future work.

Second, we have utilized the doubling trick to avoid tuning the learning rate with the unknown cumulative delay. One potential limitation of this technique is that it needs to repeatedly restart itself, while forgetting all the preceding information. For minimizing the static regret with arbitrary delays, Joulani et al. (2016) have addressed this limitation by continuously adjusting the learning rate according to the norm of received gradients. Thus, it is also appealing to extend this idea for minimizing the dynamic regret with arbitrary delays.

Third, our proposed algorithms require the time-stamp of delayed feedback. It is interesting to investigate how to minimize the dynamic regret with anonymous and arbitrary delays. A potential useful property is that under the In-Order assumption, the arrival order of the delayed gradients already ensures the ascending order of their time-stamps. Since our DOGD in Algorithm 1 only utilizes the time-stamp to sort the elements in t\mathcal{F}_{t}, it actually can be implemented by simply performing the gradient descent step in (1) whenever a gradient arrives even without the time-stamp. However, in our Mild-OGD, the time-stamp is further utilized to compute the delayed surrogate loss of experts, i.e., k(𝐱kη)\ell_{k}(\mathbf{x}_{k}^{\eta}) in (8), which cannot be discarded.

Appendix B Proof of Lemma 4.1

We first define

Ltη=i=1tkik(𝐱kη),L~tη=i=1ti(𝐱iη), and W~t=ηw1ηeαL~tη.L_{t}^{\eta}=\sum_{i=1}^{t}\sum_{k\in\mathcal{F}_{i}}\ell_{k}(\mathbf{x}_{k}^{\eta}),\tilde{L}_{t}^{\eta}=\sum_{i=1}^{t}\ell_{i}(\mathbf{x}_{i}^{\eta}),\text{ and }\tilde{W}_{t}=\sum_{\eta\in\mathcal{H}}w_{1}^{\eta}e^{-\alpha\tilde{L}_{t}^{\eta}}.

Moreover, we define

𝐜t=(Ltη)η||,𝐜~t=(L~tη)η||, and 𝐰t=(wtη)η||.\mathbf{c}_{t}=(L_{t}^{\eta})_{\eta\in\mathcal{H}}\in\mathbb{R}^{|\mathcal{H}|},\tilde{\mathbf{c}}_{t}=(\tilde{L}_{t}^{\eta})_{\eta\in\mathcal{H}}\in\mathbb{R}^{|\mathcal{H}|},\text{ and }\mathbf{w}_{t}=(w_{t}^{\eta})_{\eta\in\mathcal{H}}\in\mathbb{R}^{|\mathcal{H}|}.

According to Algorithm 2, for any t1t\geq 1, it is easy to verify that

wt+1η=wtηeαktk(𝐱kη)μwtμeαktk(𝐱kμ)=w1ηeαLtημw1μeαLtμ.w_{t+1}^{\eta}=\frac{w_{t}^{\eta}e^{-\alpha\sum_{k\in\mathcal{F}_{t}}\ell_{k}(\mathbf{x}_{k}^{\eta})}}{\sum_{\mu\in\mathcal{H}}w_{t}^{\mu}e^{-\alpha\sum_{k\in\mathcal{F}_{t}}\ell_{k}(\mathbf{x}_{k}^{\mu})}}=\frac{w_{1}^{\eta}e^{-\alpha L_{t}^{\eta}}}{\sum_{\mu\in\mathcal{H}}w_{1}^{\mu}e^{-\alpha L_{t}^{\mu}}}.

Combining with the above definitions, we have

𝐰t+1=argmin𝐰Δ1αln(𝐰1)+𝐜t,𝐰+1α(𝐰)\mathbf{w}_{t+1}=\operatorname*{argmin}_{\mathbf{w}\in\Delta}\left\langle-\frac{1}{\alpha}\ln(\mathbf{w}_{1})+\mathbf{c}_{t},\mathbf{w}\right\rangle+\frac{1}{\alpha}\mathcal{R}(\mathbf{w})

where Δ={𝐰𝟎|𝐰,𝟏=1}\Delta=\left\{\mathbf{w}\succeq\mathbf{0}|\langle\mathbf{w},\mathbf{1}\rangle=1\right\} and (𝐰)=iwilnwi\mathcal{R}(\mathbf{w})=\sum_{i}w_{i}\ln w_{i}.

Similarly, for any t1t\geq 1, we define

𝐰~t+1=argmin𝐰Δ1αln(𝐰1)+𝐜~t,𝐰+1α(𝐰)\tilde{\mathbf{w}}_{t+1}=\operatorname*{argmin}_{\mathbf{w}\in\Delta}\left\langle-\frac{1}{\alpha}\ln(\mathbf{w}_{1})+\tilde{\mathbf{c}}_{t},\mathbf{w}\right\rangle+\frac{1}{\alpha}\mathcal{R}(\mathbf{w})

In this way, for any t1t\geq 1, we also have 𝐰~t+1=(w~t+1η)η\tilde{\mathbf{w}}_{t+1}=(\tilde{w}_{t+1}^{\eta})_{\eta\in\mathcal{H}}, where

w~t+1η=w1ηeαL~tημw1μeαL~tμ.\tilde{w}_{t+1}^{\eta}=\frac{w_{1}^{\eta}e^{-\alpha\tilde{L}_{t}^{\eta}}}{\sum_{\mu\in\mathcal{H}}w_{1}^{\mu}e^{-\alpha\tilde{L}_{t}^{\mu}}}.

Moreover, we define 𝐰~1=𝐰1\tilde{\mathbf{w}}_{1}=\mathbf{w}_{1} and

𝐱~t=ηw~tη𝐱tη.\tilde{\mathbf{x}}_{t}=\sum_{\eta\in\mathcal{H}}\tilde{w}_{t}^{\eta}\mathbf{x}_{t}^{\eta}. (22)

Then, we will bound the distance between 𝐱~t\tilde{\mathbf{x}}_{t} and 𝐱t\mathbf{x}_{t} based on the following lemma.

Lemma B.1.

(Lemma 5 in Duchi et al. (2011)) Let Π𝒦(𝐮,α)=argmin𝐱𝒦𝐮,𝐱+1α(𝐱)\Pi_{\mathcal{K}}(\mathbf{u},\alpha)=\operatorname*{argmin}_{\mathbf{x}\in\mathcal{K}}\langle\mathbf{u},\mathbf{x}\rangle+\frac{1}{\alpha}\mathcal{R}(\mathbf{x}). If (𝐱)\mathcal{R}(\mathbf{x}) is 11-strongly convex with respect to a norm \|\cdot\|, it holds that

Π𝒦(𝐮,α)Π𝒦(𝐯,α)α𝐮𝐯\|\Pi_{\mathcal{K}}(\mathbf{u},\alpha)-\Pi_{\mathcal{K}}(\mathbf{v},\alpha)\|\leq\alpha\|\mathbf{u}-\mathbf{v}\|_{\ast}

for any 𝐮\mathbf{u} and 𝐯\mathbf{v}, where \|\cdot\|_{\ast} is the dual norm of \|\cdot\|.

Since (𝐰)=iwilnwi\mathcal{R}(\mathbf{w})=\sum_{i}w_{i}\ln w_{i} is 11-strongly convex with respect to 1\|\cdot\|_{1}, by applying Lemma B.1, for any t>1t>1, we have

𝐱~t𝐱t2=η(w~tηwtη)𝐱tη2η|w~tηwtη|𝐱tη2D𝐰~t𝐰t1αD𝐜~t1𝐜t1.\begin{split}\left\|\tilde{\mathbf{x}}_{t}-\mathbf{x}_{t}\right\|_{2}=&\left\|\sum_{\eta\in\mathcal{H}}(\tilde{w}_{t}^{\eta}-w_{t}^{\eta})\mathbf{x}_{t}^{\eta}\right\|_{2}\leq\sum_{\eta\in\mathcal{H}}|\tilde{w}_{t}^{\eta}-w_{t}^{\eta}|\left\|\mathbf{x}_{t}^{\eta}\right\|_{2}\leq D\|\tilde{\mathbf{w}}_{t}-\mathbf{w}_{t}\|_{1}\leq\alpha D\|\tilde{\mathbf{c}}_{t-1}-\mathbf{c}_{t-1}\|_{\infty}.\end{split}

Let 𝒰t=[t]i[t]i\mathcal{U}_{t}=[t]\setminus\cup_{i\in[t]}\mathcal{F}_{i}. Note that 𝒰t\mathcal{U}_{t} actually records the time-stamp of gradients that are queried, but still not arrive at the end of round tt. Then, for t>1t>1, it is not hard to verify that

𝐱~t𝐱t2αD𝐜~t1𝐜t1αDmaxη|k𝒰t1k(𝐱kη)|α(t1i=1t1|i|)GD2=α(mt1)GD2\begin{split}\left\|\tilde{\mathbf{x}}_{t}-\mathbf{x}_{t}\right\|_{2}\leq&\alpha D\|\tilde{\mathbf{c}}_{t-1}-\mathbf{c}_{t-1}\|_{\infty}\leq\alpha D\max_{\eta\in\mathcal{H}}\left|\sum_{k\in\mathcal{U}_{t-1}}\ell_{k}(\mathbf{x}_{k}^{\eta})\right|\leq\alpha\left(t-1-\sum_{i=1}^{t-1}|\mathcal{F}_{i}|\right)GD^{2}=\alpha\left(m_{t}-1\right)GD^{2}\end{split} (23)

where the last inequality is due to the definition of 𝒰t\mathcal{U}_{t} and the fact that Assumptions 3.1 and 3.2 ensure

|k(𝐱kη)|=|fk(𝐱k),𝐱kη𝐱k|fk(𝐱k)2𝐱kη𝐱k2GD|\ell_{k}(\mathbf{x}_{k}^{\eta})|=|\langle\nabla f_{k}(\mathbf{x}_{k}),\mathbf{x}_{k}^{\eta}-\mathbf{x}_{k}\rangle|\leq\|\nabla f_{k}(\mathbf{x}_{k})\|_{2}\|\mathbf{x}_{k}^{\eta}-\mathbf{x}_{k}\|_{2}\leq GD (24)

for any k[T]k\in[T] and η\eta\in\mathcal{H}.

The above inequality shows that 𝐱~t\tilde{\mathbf{x}}_{t} is close to 𝐱t\mathbf{x}_{t}. In the following, we first focus on the analysis of 𝐱~t\tilde{\mathbf{x}}_{t}, and then combine with the distance between 𝐱~t\tilde{\mathbf{x}}_{t} and 𝐱t\mathbf{x}_{t}. To this end, we notice that

lnW~T=ln(ηw1ηeαL~Tη)ln(maxηw1ηeαL~Tη)=αminη(L~Tη+1αln1w1η).\begin{split}\ln\tilde{W}_{T}=&\ln\left(\sum_{\eta\in\mathcal{H}}w_{1}^{\eta}e^{-\alpha\tilde{L}_{T}^{\eta}}\right)\geq\ln\left(\max_{\eta\in\mathcal{H}}w_{1}^{\eta}e^{-\alpha\tilde{L}_{T}^{\eta}}\right)=-\alpha\min_{\eta\in\mathcal{H}}\left(\tilde{L}_{T}^{\eta}+\frac{1}{\alpha}\ln\frac{1}{{w}_{1}^{\eta}}\right).\end{split} (25)

Next, for any t2t\geq 2, we have

ln(W~tW~t1)=ln(ηw1ηeαL~tηηw1ηeαL~t1η)=ln(ηw1ηeαL~t1ηeαt(𝐱tη)ηw1ηeαL~t1η)=ln(ηw~tηeαt(𝐱tη)).\begin{split}\ln\left(\frac{\tilde{W}_{t}}{\tilde{W}_{t-1}}\right)=&\ln\left(\frac{\sum_{\eta\in\mathcal{H}}w_{1}^{\eta}e^{-\alpha\tilde{L}_{t}^{\eta}}}{\sum_{\eta\in\mathcal{H}}w_{1}^{\eta}e^{-\alpha\tilde{L}_{t-1}^{\eta}}}\right)=\ln\left(\frac{\sum_{\eta\in\mathcal{H}}w_{1}^{\eta}e^{-\alpha\tilde{L}_{t-1}^{\eta}}e^{-\alpha\ell_{t}(\mathbf{x}_{t}^{\eta})}}{\sum_{\eta\in\mathcal{H}}w_{1}^{\eta}e^{-\alpha\tilde{L}_{t-1}^{\eta}}}\right)=\ln\left(\sum_{\eta\in\mathcal{H}}\tilde{w}_{t}^{\eta}e^{-\alpha\ell_{t}(\mathbf{x}_{t}^{\eta})}\right).\end{split} (26)

Combining (26)(\ref{lem1-eq6}) and w~1η=w1η\tilde{w}_{1}^{\eta}=w_{1}^{\eta}, we have

lnW~T=lnW~1+t=2Tln(W~tW~t1)=t=1Tln(ηw~tηeαt(𝐱tη)).\begin{split}\ln\tilde{W}_{T}=\ln\tilde{W}_{1}+\sum_{t=2}^{T}\ln\left(\frac{\tilde{W}_{t}}{\tilde{W}_{t-1}}\right)=\sum_{t=1}^{T}\ln\left(\sum_{\eta\in\mathcal{H}}\tilde{w}_{t}^{\eta}e^{-\alpha\ell_{t}(\mathbf{x}_{t}^{\eta})}\right).\end{split} (27)

To proceed, we introduce Hoeffding’s inequality (Hoeffding, 1963).

Lemma B.2.

Let XX be a random variable with aXba\leq X\leq b. Then, for any ss\in\mathbb{R}, it holds that

ln𝔼[esX]s𝔼[X]+s2(ba)28.\ln\mathbb{E}[e^{sX}]\leq s\mathbb{E}[X]+\frac{s^{2}(b-a)^{2}}{8}.

From (24) and Lemma B.2, we have

ln(ηw~tηeαt(𝐱tη))αηw~tηt(𝐱tη)+α2G2D22αt(𝐱~t)+α2G2D22\begin{split}\ln\left(\sum_{\eta\in\mathcal{H}}\tilde{w}_{t}^{\eta}e^{-\alpha\ell_{t}(\mathbf{x}_{t}^{\eta})}\right)\leq&-\alpha\sum_{\eta\in\mathcal{H}}\tilde{w}_{t}^{\eta}\ell_{t}(\mathbf{x}_{t}^{\eta})+\frac{\alpha^{2}G^{2}D^{2}}{2}\leq-\alpha\ell_{t}\left(\tilde{\mathbf{x}}_{t}\right)+\frac{\alpha^{2}G^{2}D^{2}}{2}\end{split} (28)

where the second inequality is due to Jensen’s inequality and (22).

Combining (27) with (28), we have

lnW~Tαt=1Tt(𝐱~t)+α2G2D2T2.\begin{split}\ln\tilde{W}_{T}\leq-\alpha\sum_{t=1}^{T}\ell_{t}\left(\tilde{\mathbf{x}}_{t}\right)+\frac{\alpha^{2}G^{2}D^{2}T}{2}.\end{split}

Then, by further combining with (25), we have

t=1Tt(𝐱~t)minη(t=1Tt(𝐱tη)+1αln1w1η)αG2D2T2.\begin{split}\sum_{t=1}^{T}\ell_{t}\left(\tilde{\mathbf{x}}_{t}\right)-\min_{\eta\in\mathcal{H}}\left(\sum_{t=1}^{T}\ell_{t}(\mathbf{x}_{t}^{\eta})+\frac{1}{\alpha}\ln\frac{1}{{w}_{1}^{\eta}}\right)\leq\frac{\alpha G^{2}D^{2}T}{2}.\end{split}

Finally, combining with (23), for any η\eta\in\mathcal{H}, we have

t=1Tt(𝐱t)(t=1Tt(𝐱tη)+1αln1w1η)=t=1Tt(𝐱t)t=1Tt(𝐱~t)+t=1Tt(𝐱~t)(t=1Tt(𝐱tη)+1αln1w1η)t=1Tft(𝐱t),𝐱t𝐱~t+αG2D2T2t=1Tft(𝐱t)2𝐱t𝐱~t2+αG2D2T2αG2D2t=1T(mt1)+αG2D2T2αG2D2t=1Tmt\begin{split}&\sum_{t=1}^{T}\ell_{t}\left(\mathbf{x}_{t}\right)-\left(\sum_{t=1}^{T}\ell_{t}(\mathbf{x}_{t}^{\eta})+\frac{1}{\alpha}\ln\frac{1}{{w}_{1}^{\eta}}\right)\\ =&\sum_{t=1}^{T}\ell_{t}\left(\mathbf{x}_{t}\right)-\sum_{t=1}^{T}\ell_{t}\left(\tilde{\mathbf{x}}_{t}\right)+\sum_{t=1}^{T}\ell_{t}\left(\tilde{\mathbf{x}}_{t}\right)-\left(\sum_{t=1}^{T}\ell_{t}(\mathbf{x}_{t}^{\eta})+\frac{1}{\alpha}\ln\frac{1}{{w}_{1}^{\eta}}\right)\\ \leq&\sum_{t=1}^{T}\langle\nabla f_{t}(\mathbf{x}_{t}),\mathbf{x}_{t}-\tilde{\mathbf{x}}_{t}\rangle+\frac{\alpha G^{2}D^{2}T}{2}\leq\sum_{t=1}^{T}\|\nabla f_{t}(\mathbf{x}_{t})\|_{2}\|\mathbf{x}_{t}-\tilde{\mathbf{x}}_{t}\|_{2}+\frac{\alpha G^{2}D^{2}T}{2}\\ \leq&\alpha G^{2}D^{2}\sum_{t=1}^{T}\left(m_{t}-1\right)+\frac{\alpha G^{2}D^{2}T}{2}\leq\alpha G^{2}D^{2}\sum_{t=1}^{T}m_{t}\end{split} (29)

which completes this proof.

Appendix C Proof of Lemma 4.2

Let Z=T/dZ=\left\lceil T/d\right\rceil. We first divide the total TT rounds into ZZ blocks, where the length of the first Z1Z-1 blocks is dd and that of the last block is T(Z1)dT-(Z-1)d. In this way, we can define the set of rounds in the block zz as

𝒯z={(z1)d+1,,min{zd,T}}.\mathcal{T}_{z}=\{(z-1)d+1,\dots,\min\{zd,T\}\}.

For any z[Z]z\in[Z] and t𝒯zt\in\mathcal{T}_{z}, we construct the delay as

dt=min{zd,T}t+1d_{t}=\min\{zd,T\}-t+1

which satisfies 1dtd1\leq d_{t}\leq d. These delays ensure that the information of any function in each block zz is delayed to the end of the block, which is critical for us to construct loss functions that maximize the impact of delays on the static regret.

Note that to establish the lower bound of the static regret in the non-delayed setting, one can utilize a randomized strategy to select loss functions for each round (Abernethy et al., 2008). Here, to maximize the impact of delays, we only select one loss function hz(𝐱)h_{z}(\mathbf{x}) for all rounds in the same block zz, i.e., ft(𝐱)=hz(𝐱)f_{t}(\mathbf{x})=h_{z}(\mathbf{x}) for any t𝒯zt\in\mathcal{T}_{z}. Specifically, we set

hz(𝐱)=Gn𝐰z,𝐱h_{z}(\mathbf{x})=\frac{G}{\sqrt{n}}\langle\mathbf{w}_{z},\mathbf{x}\rangle

where the ii-th coordinate of 𝐰z\mathbf{w}_{z} is ±1\pm 1 with probability 1/21/2 for any i[n]i\in[n] and will be denoted as wz,iw_{z,i}. It is not hard to verify that hz(𝐱)h_{z}(\mathbf{x}) satisfies Assumption 3.1.

From the above definitions, we have

𝔼𝐰1,,𝐰Z[R(T)]=𝔼𝐰1,,𝐰Z[t=1Tft(𝐱t)min𝐱𝒦t=1Tft(𝐱)]=𝔼𝐰1,,𝐰Z[z=1Zt𝒯zGn𝐰z,𝐱tmin𝐱𝒦z=1Zt𝒯zGn𝐰z,𝐱]=𝔼𝐰1,,𝐰Z[min𝐱𝒦z=1ZG|𝒯z|n𝐰z,𝐱]\begin{split}\mathbb{E}_{\mathbf{w}_{1},\dots,\mathbf{w}_{Z}}[R(T)]=&\mathbb{E}_{\mathbf{w}_{1},\dots,\mathbf{w}_{Z}}\left[\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-\min_{\mathbf{x}\in\mathcal{K}}\sum_{t=1}^{T}f_{t}(\mathbf{x})\right]\\ =&\mathbb{E}_{\mathbf{w}_{1},\dots,\mathbf{w}_{Z}}\left[\sum_{z=1}^{Z}\sum_{t\in\mathcal{T}_{z}}\frac{G}{\sqrt{n}}\langle\mathbf{w}_{z},\mathbf{x}_{t}\rangle-\min_{\mathbf{x}\in\mathcal{K}}\sum_{z=1}^{Z}\sum_{t\in\mathcal{T}_{z}}\frac{G}{\sqrt{n}}\langle\mathbf{w}_{z},\mathbf{x}\rangle\right]\\ =&\mathbb{E}_{\mathbf{w}_{1},\dots,\mathbf{w}_{Z}}\left[-\min_{\mathbf{x}\in\mathcal{K}}\sum_{z=1}^{Z}\frac{G|\mathcal{T}_{z}|}{\sqrt{n}}\langle\mathbf{w}_{z},\mathbf{x}\rangle\right]\end{split}

where the third equality is due to 𝔼𝐰1,,𝐰Z[𝐰z,𝐱t]=0\mathbb{E}_{\mathbf{w}_{1},\dots,\mathbf{w}_{Z}}[\langle\mathbf{w}_{z},\mathbf{x}_{t}\rangle]=0 for any t𝒯zt\in\mathcal{T}_{z}, which can be derived by the fact that any decision 𝐱t\mathbf{x}_{t} in the block zz is made before receiving the information of 𝐰z\mathbf{w}_{z}, and thus is independent with 𝐰z\mathbf{w}_{z}.

Since a linear function is minimized at the vertices of the cube, we further have

𝔼𝐰1,,𝐰Z[R(T)]=𝔼𝐰1,,𝐰Z[min𝐱{D/(2n),D/(2n)}nz=1ZG|𝒯z|n𝐰z,𝐱]=𝔼𝐰1,,𝐰Z[i=1nD2n|z=1Zwz,iG|𝒯z|n|]=DG2𝔼𝐰1,,𝐰Z[|z=1Zwz,1|𝒯z||]DG22z=1Z|𝒯z|2DG22(z=1Z|𝒯z|)2Z=DGT22T/d\begin{split}\mathbb{E}_{\mathbf{w}_{1},\dots,\mathbf{w}_{Z}}[R(T)]=&-\mathbb{E}_{\mathbf{w}_{1},\dots,\mathbf{w}_{Z}}\left[\min_{\mathbf{x}\in\{-D/(2\sqrt{n}),D/(2\sqrt{n})\}^{n}}\sum_{z=1}^{Z}\frac{G|\mathcal{T}_{z}|}{\sqrt{n}}\langle\mathbf{w}_{z},\mathbf{x}\rangle\right]\\ =&\mathbb{E}_{\mathbf{w}_{1},\dots,\mathbf{w}_{Z}}\left[\sum_{i=1}^{n}\frac{D}{2\sqrt{n}}\left|\sum_{z=1}^{Z}\frac{w_{z,i}G|\mathcal{T}_{z}|}{\sqrt{n}}\right|\right]\\ =&\frac{DG}{2}\mathbb{E}_{\mathbf{w}_{1},\dots,\mathbf{w}_{Z}}\left[\left|\sum_{z=1}^{Z}w_{z,1}|\mathcal{T}_{z}|\right|\right]\geq\frac{DG}{2\sqrt{2}}\sqrt{\sum_{z=1}^{Z}|\mathcal{T}_{z}|^{2}}\\ \geq&\frac{DG}{2\sqrt{2}}\sqrt{\frac{(\sum_{z=1}^{Z}|\mathcal{T}_{z}|)^{2}}{Z}}=\frac{DGT}{2\sqrt{2\left\lceil T/d\right\rceil}}\end{split} (30)

where the first inequality is due to the Khintchine inequality and the second inequality is due to the Cauchy-Schwarz inequality.

The expected lower bound in (30) implies that for any OCO algorithm and any positive integer dd, there exists a particular choice of 𝐰1,,𝐰Z\mathbf{w}_{1},\dots,\mathbf{w}_{Z} such that

R(T)DGT22T/d.\begin{split}R(T)\geq\frac{DGT}{2\sqrt{2\left\lceil T/d\right\rceil}}.\end{split}

Appendix D DOGD with the Doubling Trick

As discussed after Corollary 3.6, our DOGD needs a learning rate depending on the following value

t=1Tmt=t=1T(ti=1t1|i|).\sum_{t=1}^{T}m_{t}=\sum_{t=1}^{T}\left(t-\sum_{i=1}^{t-1}|\mathcal{F}_{i}|\right).

However, it may be not available beforehand. Fortunately, the doubling trick (Cesa-Bianchi & Lugosi, 2006) provides a way to adaptively estimate this value. Specifically, it will divide the total TT rounds into several epochs, and run a new instance of DOGD per epoch. Let svs_{v} and sv+11s_{v+1}-1 respectively denote the start round and the end round of the vv-th epoch. In this way, to tune the learning rate for the vv-th epoch, we only need to know the following value

t=svsv+11(t+1svi=svt1|isv|)\sum_{t=s_{v}}^{s_{v+1}-1}\left(t+1-s_{v}-\sum_{i=s_{v}}^{t-1}|\mathcal{F}_{i}^{s_{v}}|\right)

where isv={k[sv,i]|k+dk1=i}\mathcal{F}_{i}^{s_{v}}=\{k\in[s_{v},i]|k+d_{k}-1=i\}.

According to the doubling trick, we can estimate this value to be 2v2^{v} at the start round svs_{v} of the vv-th epoch. Then, for any t>svt>s_{v}, we first judge whether the estimate is still valid, i.e.,

j=svt(j+1svi=svj1|isv|)2v\sum_{j=s_{v}}^{t}\left(j+1-s_{v}-\sum_{i=s_{v}}^{j-1}|\mathcal{F}_{i}^{s_{v}}|\right)\leq 2^{v}

where the left side can be calculated at the beginning of round tt. If the answer is positive, the round tt is still assigned to the vv-th epoch, and the instance of DOGD keeps running. Otherwise, the round tt is set as the start round of the (v+1)(v+1)-th epoch, and a new instance of DOGD is activated. Notice that in the start round of the (v+1)(v+1)-th epoch, the new estimate must be valid, since t=sv+1t=s_{v+1} and

j=sv+1t(j+1sv+1i=sv+1j1|isv+1|)=12v+1.\sum_{j=s_{v+1}}^{t}\left(j+1-s_{v+1}-\sum_{i=s_{v+1}}^{j-1}|\mathcal{F}_{i}^{s_{v+1}}|\right)=1\leq 2^{v+1}.

Moreover, it is natural to set s1=1s_{1}=1. Then, the detailed procedures of DOGD with the doubling trick are summarized in Algorithm 4.

Remark: First, in Algorithm 4, the learning rate ηv\eta_{v} is set by replacing t=1Tmt\sum_{t=1}^{T}m_{t} in the learning rate required by Corollary 3.6 with 2v2^{v}. Second, in each epoch vv, we do not need to utilize gradients queried before this epoch. For this reason, in Algorithm 4, we only receive {fk(𝐱k)|ktsv}\{\nabla f_{k}(\mathbf{x}_{k})|k\in\mathcal{F}_{t}^{s_{v}}\}, instead of {fk(𝐱k)|kt}\{\nabla f_{k}(\mathbf{x}_{k})|k\in\mathcal{F}_{t}\}.

We have the following theorem, which can recover the dynamic regret bound in Corollary 3.6 up to a constant factor.

Algorithm 4 DOGD with the Doubling Trick
1:  Initialization: set 𝐲1=𝟎\mathbf{y}_{1}=\mathbf{0}, τ=1\tau=1, v=1v=1, and sv=1s_{v}=1
2:  for t=1,,Tt=1,\dots,T do
3:     if j=svt(j+1svi=svj1|isv|)>2v\sum_{j=s_{v}}^{t}\left(j+1-s_{v}-\sum_{i=s_{v}}^{j-1}|\mathcal{F}_{i}^{s_{v}}|\right)>2^{v} then
4:        Set 𝐲1=𝟎\mathbf{y}_{1}=\mathbf{0}, τ=1\tau=1, v=v+1v=v+1, and sv=ts_{v}=t
5:     end if
6:     Play 𝐱t=𝐲τ\mathbf{x}_{t}=\mathbf{y}_{\tau} and query ft(𝐱t)\nabla f_{t}(\mathbf{x}_{t})
7:     Receive {fk(𝐱k)|ktsv}\{\nabla f_{k}(\mathbf{x}_{k})|k\in\mathcal{F}_{t}^{s_{v}}\}, where tsv={k[sv,t]|k+dk1=t}\mathcal{F}_{t}^{s_{v}}=\{k\in[s_{v},t]|k+d_{k}-1=t\}
8:     for ktsvk\in\mathcal{F}_{t}^{s_{v}} (in the ascending order) do
9:        Compute 𝐲τ+1=argmin𝐱𝒦𝐱(𝐲τηvfk(𝐱k))22\mathbf{y}_{\tau+1}=\operatorname*{argmin}_{\mathbf{x}\in\mathcal{K}}\|\mathbf{x}-(\mathbf{y}_{\tau}-\eta_{v}\nabla f_{k}(\mathbf{x}_{k}))\|_{2}^{2}, where ηv=DG2v/2\eta_{v}=\frac{D}{G2^{v/2}}
10:        Set τ=τ+1\tau=\tau+1
11:     end for
12:  end for
Theorem D.1.

Under Assumptions 3.1 and 3.2, for any comparator sequence 𝐮1,,𝐮T𝒦\mathbf{u}_{1},\dots,\mathbf{u}_{T}\in\mathcal{K}, Algorithm 2 ensures

R(𝐮1,,𝐮T)\displaystyle R(\mathbf{u}_{1},\dots,\mathbf{u}_{T})\leq 2G(2D+PT)d¯T21+C\displaystyle\frac{2G\left(2D+P_{T}\right)\sqrt{\bar{d}T}}{\sqrt{2}-1}+C

where CC is defined in (6).

Proof.

For any svs_{v} and jsvj\geq s_{v}, we first notice that the value of jsvi=svj1|isv|j-s_{v}-\sum_{i=s_{v}}^{j-1}|\mathcal{F}_{i}^{s_{v}}| counts the number of gradients that have been queried over interval [sv,j1][s_{v},j-1], but still not arrive at the end of round j1j-1. Moreover, the gradient fj(𝐱j)\nabla f_{j}(\mathbf{x}_{j}) will only be counted as an unreceived gradient in dj1d_{j}-1 rounds. Therefore, for any svtTs_{v}\leq t\leq T, it is easy to verify that

j=svt(j+1svi=svj1|isv|)j=svtdjj=1Tdj=d¯T.\displaystyle\sum_{j=s_{v}}^{t}\left(j+1-s_{v}-\sum_{i=s_{v}}^{j-1}|\mathcal{F}_{i}^{s_{v}}|\right)\leq\sum_{j=s_{v}}^{t}d_{j}\leq\sum_{j=1}^{T}d_{j}=\bar{d}T.

For brevity, let VV denote the final vv of Algorithm 4, and let S=d¯TS=\bar{d}T. It is easy to verify that

V1+log2S.V\leq 1+\log_{2}S. (31)

Then, let sV+1=T+1s_{V+1}=T+1. We notice that for v[V]v\in[V], Algorithm 4 actually starts or restarts Algorithm 1 with the learning rate of ηv\eta_{v} at round svs_{v}, which ends at round sv+11s_{v+1}-1. Therefore, combining Theorem 3.4 with Lemma 3.5, under Assumptions 3.1 and 3.2, we have

t=svsv+11ft(𝐱t)t=svsv+11ft(𝐮t)D2+Dt=sv+1sv+11𝐮t𝐮t12ηv+ηvG2j=svsv+11(j+1svi=svj1|isv|)+Cv\begin{split}&\sum_{t=s_{v}}^{s_{v+1}-1}f_{t}(\mathbf{x}_{t})-\sum_{t=s_{v}}^{s_{v+1}-1}f_{t}(\mathbf{u}_{t})\\ \leq&\frac{D^{2}+D\sum_{t=s_{v}+1}^{s_{v+1}-1}\|\mathbf{u}_{t}-\mathbf{u}_{t-1}\|_{2}}{\eta_{v}}+{\eta_{v}G^{2}}\sum_{j=s_{v}}^{s_{v+1}-1}\left(j+1-s_{v}-\sum_{i=s_{v}}^{j-1}|\mathcal{F}_{i}^{s_{v}}|\right)+C_{v}\end{split} (32)

where

Cv={0,if Assumption 3.3 also holds;min{(sv+1sv)GD,2dGt=sv+1sv+11𝐮t𝐮t12},otherwise.C_{v}=\left\{\begin{aligned} &0,~{}\text{if Assumption \ref{assum4} also holds;}\\ &\min\left\{(s_{v+1}-s_{v})GD,2dG\sum_{t=s_{v}+1}^{s_{v+1}-1}\|\mathbf{u}_{t}-\mathbf{u}_{t-1}\|_{2}\right\},~{}\text{otherwise.}\end{aligned}\right. (33)

Moreover, we notice that Algorithm 4 also ensures that

j=svsv+11(j+1svi=svj1|isv|)2v.\sum_{j=s_{v}}^{s_{v+1}-1}\left(j+1-s_{v}-\sum_{i=s_{v}}^{j-1}|\mathcal{F}_{i}^{s_{v}}|\right)\leq 2^{v}. (34)

By substituting the above inequality into (32), we have

t=svsv+11ft(𝐱t)t=svsv+11ft(𝐮t)D2+Dt=sv+1sv+11𝐮t𝐮t12ηv+ηvG22v+Cv=G2v/2(2D+t=sv+1sv+11𝐮t𝐮t12)+CvG2v/2(2D+PT)+Cv.\begin{split}\sum_{t=s_{v}}^{s_{v+1}-1}f_{t}(\mathbf{x}_{t})-\sum_{t=s_{v}}^{s_{v+1}-1}f_{t}(\mathbf{u}_{t})\leq&\frac{D^{2}+D\sum_{t=s_{v}+1}^{s_{v+1}-1}\|\mathbf{u}_{t}-\mathbf{u}_{t-1}\|_{2}}{\eta_{v}}+{\eta_{v}G^{2}}2^{v}+C_{v}\\ =&G2^{v/2}\left(2D+\sum_{t=s_{v}+1}^{s_{v+1}-1}\|\mathbf{u}_{t}-\mathbf{u}_{t-1}\|_{2}\right)+C_{v}\\ \leq&G2^{v/2}\left(2D+P_{T}\right)+C_{v}.\end{split} (35)

Then, because of (31), we have

R(𝐮1,,𝐮T)=v=1V(t=svsv+11ft(𝐱t)t=svsv+11ft(𝐮t))v=1VG2v/2(2D+PT)+v=1VCv=G(2D+PT)2(2V/21)21+v=1VCv2G(2D+PT)S21+v=1VCv.\begin{split}R(\mathbf{u}_{1},\dots,\mathbf{u}_{T})=&\sum_{v=1}^{V}\left(\sum_{t=s_{v}}^{s_{v+1}-1}f_{t}(\mathbf{x}_{t})-\sum_{t=s_{v}}^{s_{v+1}-1}f_{t}(\mathbf{u}_{t})\right)\leq\sum_{v=1}^{V}G2^{v/2}\left(2D+P_{T}\right)+\sum_{v=1}^{V}C_{v}\\ =&G\left(2D+P_{T}\right)\frac{\sqrt{2}(2^{V/2}-1)}{\sqrt{2}-1}+\sum_{v=1}^{V}C_{v}\leq\frac{2G\left(2D+P_{T}\right)\sqrt{S}}{\sqrt{2}-1}+\sum_{v=1}^{V}C_{v}.\end{split} (36)

Moreover, it is not hard to verify that

v=1Vmin{(sv+1sv)GD,2dGt=sv+1sv+11𝐮t𝐮t12}\displaystyle\sum_{v=1}^{V}\min\left\{(s_{v+1}-s_{v})GD,2dG\sum_{t=s_{v}+1}^{s_{v+1}-1}\|\mathbf{u}_{t}-\mathbf{u}_{t-1}\|_{2}\right\}
\displaystyle\leq min{v=1V(sv+1sv)GD,v=1V2dGt=sv+1sv+11𝐮t𝐮t12}\displaystyle\min\left\{\sum_{v=1}^{V}(s_{v+1}-s_{v})GD,\sum_{v=1}^{V}2dG\sum_{t=s_{v}+1}^{s_{v+1}-1}\|\mathbf{u}_{t}-\mathbf{u}_{t-1}\|_{2}\right\}
\displaystyle\leq min{TGD,2dGPT}\displaystyle\min\left\{TGD,2dGP_{T}\right\}

which implies that

v=1VCvC.\sum_{v=1}^{V}C_{v}\leq C. (37)

Finally, we complete this proof by substituting (37) and S=d¯TS=\bar{d}T into (36). ∎

Algorithm 5 Mild-OGD with the Doubling Trick: Meta-algorithm
1:  Initialization: set v=1v=1 and sv=1s_{v}=1
2:  Activate a set of experts {Eη|η}\left\{E^{\eta}|\eta\in\mathcal{H}\right\} by invoking the expert-algorithm for each constant η𝒳\eta\in\mathcal{X}, where
={ηi=D2i1G|i=1,,log2T+1+1}\mathcal{H}=\left\{\eta_{i}=\left.\frac{D2^{i-1}}{G}\right|i=1,\dots,\left\lceil\log_{2}\sqrt{T+1}\right\rceil+1\right\}
3:  Set wtηi=||+1i(i+1)||w_{t}^{\eta_{i}}=\frac{|\mathcal{H}|+1}{i(i+1)|\mathcal{H}|}
4:  for t=1,,Tt=1,\dots,T do
5:     if j=svt(j+1svi=svj1|isv|)>2v\sum_{j=s_{v}}^{t}\left(j+1-s_{v}-\sum_{i=s_{v}}^{j-1}|\mathcal{F}_{i}^{s_{v}}|\right)>2^{v} then
6:        Set v=v+1v=v+1, sv=ts_{v}=t, and wtηi=||+1i(i+1)||w_{t}^{\eta_{i}}=\frac{|\mathcal{H}|+1}{i(i+1)|\mathcal{H}|}
7:     end if
8:     Receive 𝐱tη\mathbf{x}_{t}^{\eta} from each expert EηE^{\eta}
9:     Play the decision 𝐱t=ηwtη𝐱tη\mathbf{x}_{t}=\sum_{\eta\in\mathcal{H}}w_{t}^{\eta}\mathbf{x}_{t}^{\eta}
10:     Query ft(𝐱t)\nabla f_{t}(\mathbf{x}_{t}) and receive {fk(𝐱k)|ktsv}\{\nabla f_{k}(\mathbf{x}_{k})|k\in\mathcal{F}_{t}^{s_{v}}\}, where tsv={k[sv,t]|k+dk1=t}\mathcal{F}_{t}^{s_{v}}=\{k\in[s_{v},t]|k+d_{k}-1=t\}
11:     Update the weight of each expert by
wt+1η=wtηeαvktsvk(𝐱kη)μwtμeαvktsvk(𝐱kμ)w_{t+1}^{\eta}=\frac{w_{t}^{\eta}e^{-\alpha_{v}\sum_{k\in\mathcal{F}_{t}^{s_{v}}}\ell_{k}(\mathbf{x}_{k}^{\eta})}}{\sum_{\mu\in\mathcal{H}}w_{t}^{\mu}e^{-\alpha_{v}\sum_{k\in\mathcal{F}_{t}^{s_{v}}}\ell_{k}(\mathbf{x}_{k}^{\mu})}}
where k(𝐱)=fk(𝐱k),𝐱𝐱k\ell_{k}(\mathbf{x})=\langle\nabla f_{k}(\mathbf{x}_{k}),\mathbf{x}-\mathbf{x}_{k}\rangle and αv=1GD2v/2\alpha_{v}=\frac{1}{GD2^{v/2}}
12:     Send {fk(𝐱k)|ktsv}\{\nabla f_{k}(\mathbf{x}_{k})|k\in\mathcal{F}_{t}^{s_{v}}\} to each expert EηE^{\eta}
13:  end for
Algorithm 6 Mild-OGD with the Doubling Trick: Expert-algorithm
1:  Input: a constant η\eta
2:  Initialization: set 𝐲1η=𝟎\mathbf{y}_{1}^{\eta}=\mathbf{0}, τ=1\tau=1, v=1v=1, and sv=1s_{v}=1
3:  for t=1,,Tt=1,\dots,T do
4:     if j=svt(j+1svi=svj1|isv|)>2v\sum_{j=s_{v}}^{t}\left(j+1-s_{v}-\sum_{i=s_{v}}^{j-1}|\mathcal{F}_{i}^{s_{v}}|\right)>2^{v} then
5:        Set 𝐲1=𝟎\mathbf{y}_{1}=\mathbf{0}, τ=1\tau=1, v=v+1v=v+1, and sv=ts_{v}=t
6:     end if
7:     Submit 𝐱tη=𝐲τη\mathbf{x}_{t}^{\eta}=\mathbf{y}_{\tau}^{\eta} to the meta-algorithm
8:     Receive gradients {fk(𝐱k)|ktsv}\{\nabla f_{k}(\mathbf{x}_{k})|k\in\mathcal{F}_{t}^{s_{v}}\} from the meta-algorithm
9:     for ktk\in\mathcal{F}_{t} (in the ascending order) do
10:        Compute 𝐲τ+1η=argmin𝐱𝒦𝐱(𝐲τηη2v/2fk(𝐱k))22\mathbf{y}_{\tau+1}^{\eta}=\operatorname*{argmin}_{\mathbf{x}\in\mathcal{K}}\left\|\mathbf{x}-\left(\mathbf{y}_{\tau}^{\eta}-\frac{\eta}{2^{v/2}}\nabla f_{k}(\mathbf{x}_{k})\right)\right\|_{2}^{2}
11:        Set τ=τ+1\tau=\tau+1
12:     end for
13:  end for

Appendix E Mild-OGD with the Doubling Trick

Similar to DOGD, Mild-OGD requires the value of t=1Tmt\sum_{t=1}^{T}m_{t} for setting

α=1GDt=1Tmt and ηi=2i1DGt=1Tmt\alpha=\frac{1}{GD\sqrt{\sum_{t=1}^{T}m_{t}}}\text{~{}and~{}}\eta_{i}=\frac{2^{i-1}D}{G\sqrt{\sum_{t=1}^{T}m_{t}}} (38)

where α\alpha is the learning rate for updating the weight, and ηi\eta_{i} is the learning rate for the ii-th expert. To address this limitation, we can utilize the doubling trick as described in the previous section. The only change is to replace DOGD with Mild-OGD. The detailed procedures of Mild-OGD with the doubling trick are outlined in Algorithms 5 and 6.

Remark: We would like to emphasize that since multiple instances of the expert-algorithm run over the surrogate losses defined by the meta-algorithm, these instances and the meta-algorithm will start a new epoch synchronously. Moreover, as shown in step 6 of Algorithm 5, in the start of each epoch, we need to reinitialize the weight of each expert EηE^{\eta}. As shown in step 1111, in each epoch vv, we update the weight by using the learning rate αv\alpha_{v}, which replaces t=1Tmt\sum_{t=1}^{T}m_{t} in (38) with 2v2^{v}. Additionally, to facilitate presentation, in step 2 of Algorithm 5, each ηi\eta_{i} in \mathcal{H} only contains the constant part that does not depend on the value of t=1Tmt\sum_{t=1}^{T}m_{t}. Meanwhile, according to steps 1 and 10 of Algorithm 6, the ii-th expert will receive ηi\eta_{i} from the meta-algorithm, and combine it with the estimation of t=1Tmt\sum_{t=1}^{T}m_{t} to compute the learning rate.

Furthermore, we have the following theorem, which can recover the dynamic regret bound in Theorem 3.7 up to a constant factor.

Theorem E.1.

Under Assumptions 3.1 and 3.2, for any comparator sequence 𝐮1,,𝐮T𝒦\mathbf{u}_{1},\dots,\mathbf{u}_{T}\in\mathcal{K}, Algorithm 2 ensures

R(𝐮1,,𝐮T)2((2ln(log2(D+PT)/D+2)+1)GD+3GD2+DPT)d¯T21+CR(\mathbf{u}_{1},\dots,\mathbf{u}_{T})\leq\frac{2\left(\left(2\ln\left(\left\lfloor\log_{2}\sqrt{\left(D+P_{T}\right)/D}\right\rfloor+2\right)+1\right)GD+3G\sqrt{D^{2}+DP_{T}}\right)\sqrt{\bar{d}T}}{\sqrt{2}-1}+C

where CC is defined in (6).

Proof.

Following the proof of Theorem D.1, we use VV to denote the final vv of Algorithms 5 and 6 and define sV+1=T+1s_{V+1}=T+1. Moreover, let S=d¯TS=\bar{d}T. It is easy to verify that (31) also holds.

Then, we consider the dynamic regret of Algorithm 5 over the interval [sv,sv+11][s_{v},s_{v+1}-1] for each v[V]v\in[V]. Let

ηv=DG2(D+t=sv+1sv+11𝐮t𝐮t12).\eta_{\ast}^{v}=\sqrt{\frac{D}{G^{2}}\left(D+\sum_{t=s_{v}+1}^{s_{v+1}-1}\|\mathbf{u}_{t}-\mathbf{u}_{t-1}\|_{2}\right)}.

From Assumption 3.2, we have

0t=sv+1sv+11𝐮t𝐮t12(sv+1sv1)DTD0\leq\sum_{t=s_{v}+1}^{s_{v+1}-1}\|\mathbf{u}_{t}-\mathbf{u}_{t-1}\|_{2}\leq(s_{v+1}-s_{v}-1)D\leq TD

which implies that

η1=DGηvDT+1Gη||.\eta_{1}=\frac{D}{G}\leq\eta_{\ast}^{v}\leq\frac{D\sqrt{T+1}}{G}\leq\eta_{|\mathcal{H}|}.

Therefore, for any possible value of t=sv+1sv+11𝐮t𝐮t12\sum_{t=s_{v}+1}^{s_{v+1}-1}\|\mathbf{u}_{t}-\mathbf{u}_{t-1}\|_{2}, there must exist a constant ηkv\eta_{k_{v}}\in\mathcal{H} such that

ηkvηv2ηkv\eta_{k_{v}}\leq\eta_{\ast}^{v}\leq 2\eta_{k_{v}} (39)

where

kv=log2(D+t=sv+1sv+11𝐮t𝐮t12)/D+1log2(D+PT)/D+1.k_{v}=\left\lfloor\log_{2}\sqrt{\left(D+\sum_{t=s_{v}+1}^{s_{v+1}-1}\|\mathbf{u}_{t}-\mathbf{u}_{t-1}\|_{2}\right)/D}\right\rfloor+1\leq\left\lfloor\log_{2}\sqrt{\left(D+P_{T}\right)/D}\right\rfloor+1.

Moreover, we notice that each expert EηE^{\eta} over the interval [sv,sv+11][s_{v},s_{v+1}-1] actually runs Algorithm 1 with the learning rate η2v/2\frac{\eta}{2^{v/2}} to handle the surrogate losses sv(𝐱),,sv+11(𝐱)\ell_{s_{v}}(\mathbf{x}),\dots,\ell_{s_{v+1}-1}(\mathbf{x}), where each gradient t(𝐱tη)=ft(𝐱t)\nabla\ell_{t}(\mathbf{x}_{t}^{\eta})=\nabla f_{t}(\mathbf{x}_{t}) is delayed to the end of round t+dt1t+d_{t}-1 for t[sv,sv+11]t\in[s_{v},s_{v+1}-1].

Therefore, by combining Theorem 3.4 with Lemma 3.5, under Assumptions 3.1 and 3.2, we have

t=svsv+11(t(𝐱tηkv)t(𝐮t))2v/2(D2+Dt=sv+1sv+11𝐮t𝐮t12)ηkv+ηkvG22v/2j=svsv+11(j+1svi=svj1|isv|)+Cv2v/2(D2+Dt=sv+1sv+11𝐮t𝐮t12)ηkv+ηkvG22v/2+Cv3G2v(D2+Dt=sv+1sv+11𝐮t𝐮t12)+Cv3G2v(D2+DPT)+Cv\begin{split}&\sum_{t=s_{v}}^{s_{v+1}-1}\left(\ell_{t}(\mathbf{x}_{t}^{\eta_{k_{v}}})-\ell_{t}(\mathbf{u}_{t})\right)\\ \leq&\frac{2^{v/2}\left(D^{2}+D\sum_{t=s_{v}+1}^{s_{v+1}-1}\|\mathbf{u}_{t}-\mathbf{u}_{t-1}\|_{2}\right)}{\eta_{k_{v}}}+\frac{\eta_{k_{v}}G^{2}}{2^{v/2}}\sum_{j=s_{v}}^{s_{v+1}-1}\left(j+1-s_{v}-\sum_{i=s_{v}}^{j-1}|\mathcal{F}_{i}^{s_{v}}|\right)+C_{v}\\ \leq&\frac{2^{v/2}\left(D^{2}+D\sum_{t=s_{v}+1}^{s_{v+1}-1}\|\mathbf{u}_{t}-\mathbf{u}_{t-1}\|_{2}\right)}{\eta_{k_{v}}}+\eta_{k_{v}}G^{2}2^{v/2}+C_{v}\\ \leq&3G\sqrt{2^{v}\left(D^{2}+D\sum_{t=s_{v}+1}^{s_{v+1}-1}\|\mathbf{u}_{t}-\mathbf{u}_{t-1}\|_{2}\right)}+C_{v}\\ \leq&3G\sqrt{2^{v}\left(D^{2}+DP_{T}\right)}+C_{v}\end{split} (40)

where CvC_{v} is defined in (33), the second inequality is due to the fact that Algorithm 6 also ensures (34), and the third inequality is due to (39) and the definition of ηv\eta_{\ast}^{v}.

Moreover, it is also easy to verify that Algorithm 5 actually starts or restarts Algorithm 2 with the learning rate of αv\alpha_{v} at round svs_{v}, which ends at round sv+11s_{v+1}-1. Then, by using Lemma 4.1 with (1/wsvηkv)(kv+1)2(1/w_{s_{v}}^{\eta_{k_{v}}})\leq(k_{v}+1)^{2}, under Assumptions 3.1 and 3.2, we have

t=svsv+11t(𝐱t)t=svsv+11t(𝐱tηkv)2αvln(kv+1)+αvG2D2j=svsv+11(j+1svi=svj1|isv|)(2ln(log2(D+PT)/D+2)+1)2v/2GD\begin{split}\sum_{t=s_{v}}^{s_{v+1}-1}\ell_{t}\left(\mathbf{x}_{t}\right)-\sum_{t=s_{v}}^{s_{v+1}-1}\ell_{t}(\mathbf{x}_{t}^{\eta_{k_{v}}})\leq&\frac{2}{\alpha_{v}}\ln(k_{v}+1)+\alpha_{v}G^{2}D^{2}\sum_{j=s_{v}}^{s_{v+1}-1}\left(j+1-s_{v}-\sum_{i=s_{v}}^{j-1}|\mathcal{F}_{i}^{s_{v}}|\right)\\ \leq&\left(2\ln\left(\left\lfloor\log_{2}\sqrt{\left(D+P_{T}\right)/D}\right\rfloor+2\right)+1\right)2^{v/2}GD\end{split} (41)

where the second inequality is due to αv=1GD2v/2\alpha_{v}=\frac{1}{GD2^{v/2}}, the definition of kvk_{v}, and the fact that Algorithm 5 also ensures (34).

By combining (40) and (41), it is not hard to verify that

t=svsv+11ft(𝐱t)t=svsv+11ft(𝐮t)t=svsv+11t(𝐱t)t=svsv+11t(𝐮t)=t=svsv+11t(𝐱t)t=svsv+11t(𝐱tηkv)+t=svsv+11t(𝐱tηkv)t=svsv+11t(𝐮t)(2ln(log2(D+PT)/D+2)+1)2v/2GD+3G2v(D2+DPT)+Cv\begin{split}&\sum_{t=s_{v}}^{s_{v+1}-1}f_{t}(\mathbf{x}_{t})-\sum_{t=s_{v}}^{s_{v+1}-1}f_{t}(\mathbf{u}_{t})\\ \leq&\sum_{t=s_{v}}^{s_{v+1}-1}\ell_{t}(\mathbf{x}_{t})-\sum_{t=s_{v}}^{s_{v+1}-1}\ell_{t}(\mathbf{u}_{t})\\ =&\sum_{t=s_{v}}^{s_{v+1}-1}\ell_{t}(\mathbf{x}_{t})-\sum_{t=s_{v}}^{s_{v+1}-1}\ell_{t}(\mathbf{x}_{t}^{\eta_{k_{v}}})+\sum_{t=s_{v}}^{s_{v+1}-1}\ell_{t}(\mathbf{x}_{t}^{\eta_{k_{v}}})-\sum_{t=s_{v}}^{s_{v+1}-1}\ell_{t}(\mathbf{u}_{t})\\ \leq&\left(2\ln\left(\left\lfloor\log_{2}\sqrt{\left(D+P_{T}\right)/D}\right\rfloor+2\right)+1\right)2^{v/2}GD+3G\sqrt{2^{v}\left(D^{2}+DP_{T}\right)}+C_{v}\end{split} (42)

Then, we have

R(𝐮1,,𝐮T)=v=1V(t=svsv+11ft(𝐱t)t=svsv+11ft(𝐮t))((2ln(log2(D+PT)/D+2)+1)GD+3GD2+DPT)v=1V2v/2+v=1VCv=((2ln(log2(D+PT)/D+2)+1)GD+3GD2+DPT)2(2V/21)21+v=1VCv2((2ln(log2(D+PT)/D+2)+1)GD+3GD2+DPT)S21+v=1VCv\begin{split}&R(\mathbf{u}_{1},\dots,\mathbf{u}_{T})=\sum_{v=1}^{V}\left(\sum_{t=s_{v}}^{s_{v+1}-1}f_{t}(\mathbf{x}_{t})-\sum_{t=s_{v}}^{s_{v+1}-1}f_{t}(\mathbf{u}_{t})\right)\\ \leq&\left(\left(2\ln\left(\left\lfloor\log_{2}\sqrt{\left(D+P_{T}\right)/D}\right\rfloor+2\right)+1\right)GD+3G\sqrt{D^{2}+DP_{T}}\right)\sum_{v=1}^{V}2^{v/2}+\sum_{v=1}^{V}C_{v}\\ =&\left(\left(2\ln\left(\left\lfloor\log_{2}\sqrt{\left(D+P_{T}\right)/D}\right\rfloor+2\right)+1\right)GD+3G\sqrt{D^{2}+DP_{T}}\right)\frac{\sqrt{2}(2^{V/2}-1)}{\sqrt{2}-1}+\sum_{v=1}^{V}C_{v}\\ \leq&\frac{2\left(\left(2\ln\left(\left\lfloor\log_{2}\sqrt{\left(D+P_{T}\right)/D}\right\rfloor+2\right)+1\right)GD+3G\sqrt{D^{2}+DP_{T}}\right)\sqrt{S}}{\sqrt{2}-1}+\sum_{v=1}^{V}C_{v}\end{split}

where the first inequality is due to (42), and the last inequality is due to (31).

Finally, by substituting (37) and S=d¯TS=\bar{d}T into the above inequality, we complete this proof. ∎