This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Online Convex Optimization Using Coordinate Descent Algorithms

Yankai Lin [email protected]    Iman Shames [email protected]    Dragan Nesic [email protected] Department of Mechanical Engineering, Eindhoven University of Technology, the Netherlands CIICADA Lab, School of Engineering, The Australian National University, Acton, 0200, ACT, Australia Department of Electrical and Electronic Engineering, The University of Melbourne, Parkville, 3010, Victoria, Australia
Abstract

This paper considers the problem of online optimization where the objective function is time-varying. In particular, we extend coordinate descent type algorithms to the online case, where the objective function varies after a finite number of iterations of the algorithm. Instead of solving the problem exactly at each time step, we only apply a finite number of iterations at each time step. Commonly used notions of regret are used to measure the performance of the online algorithm. Moreover, coordinate descent algorithms with different updating rules are considered, including both deterministic and stochastic rules that are developed in the literature of classical offline optimization. A thorough regret analysis is given for each case. Finally, numerical simulations are provided to illustrate the theoretical results.

keywords:
Online convex optimization; Coordinate descent; Online learning; Regret minimization.
thanks: This paper was not presented at any conference. This work was supported by the Australian Research Council under the Discovery Projects DP170104099, DP210102454, and the Australian Government, via grant AUSMURIB000001 associated with ONR MURI grant N00014-19-1-2571. Corresponding author Y. Lin.

, ,

1 Introduction

Online learning [29], resource allocation [9], demand response in power systems [14], and localization of moving targets [2] are just a few examples where online convex optimization (OCO) has been applied. In the problem setup of OCO, the objective functions are time-varying and are not available to the decision maker a priori. At each time instance, after an update of the decision variable, new information of the latest cost function is made available to the decision maker. The objective of the decision maker is to minimize the objective function over time. One commonly used performance measure of an online optimization algorithm is the notion of regret which measures the suboptimality of the algorithm compared to the outcome generated by the best decision at each time step.

In the seminal work of [41], the method of online gradient descent is proposed for OCO problems, where at each time step the decision maker performs one gradient descent step using the latest available information. A static regret upper bound that is sublinear in TT is proved, where TT is the length of the horizon. Under stronger assumptions on the cost functions such as strong convexity, an improved logarithmic regret bound can be achieved [11, 23, 10]. If future information is available, it can be used to further improve the performance of the online optimization algorithm in terms of regret bounds. The work [15] introduces an additional predictive step following the algorithm developed in [41], if certain conditions on the estimated gradient and descent direction are met. Similar algorithms have also been extended to cases where zeroth order [6, 30, 36, 26, 32] and second order [16] oracles are used instead of (sub)gradients. The works [6, 36] on bandit feedback consider the situation where there are time-varying inequality constraints. In such cases the algorithms proposed in [41] will be hard to implement because of the high computational resource demand of the projection operation. This motivates recent research on online optimization algorithms with time-varying constraints including the primal-dual algorithm proposed in [13, 37], a modified saddle-point method given in [9]. Other algorithms are also proposed to handle stochastic constraints [38] and cover continuous-time applications [27]. In the case where only the values rather than the exact form of the cost function at are revealed to the decision maker, bandit feedback based online algorithms [6, 36] can be used to solve the problem, other methods such as forward gradient [22] have also been proposed recently to deal with the issue. The need for applications in large-scale systems has also led to extensive research on distributed OCO. Distributed online algorithms that achieve sublinear regret bound for convex optimization problems with static constraints can be found in [28, 12, 33]. For instance, [28] proposes a distributed version of the dynamic mirror descent algorithm which is a generalization of the classical gradient descent methods suitable for high-dimensional optimization problems. The work [19] proposes distributed online primal-dual algorithms for optimization problems with static coupled inequality constraints while the work [35] studies distributed online convex optimization with time-varying inequality constraints in the discrete-time setting. For a more detailed documentation of recent advances of online optimization, we refer the readers to the survey paper [17].

To the best of our knowledge, Coordinate descent [31], as an important class of optimization algorithms, is not sufficiently analyzed by researchers in the online optimization community. In coordinate descent algorithms, most components of the decision variable are fixed during one iteration while the cost function is minimized with respect to the remaining components of the decision variable. The resulting problem is lower-dimensional and often much easier to solve. Thus, coordinate descent algorithms have great potential in applications such as machine learning, where iteration with full gradient information is computationally expensive. In [24], it is shown that for huge scale problems, coordinate descent can be very efficient. Another situation where one may find coordinate descent useful is dual decomposition based methods for distributed optimization, see [21] and references therein. Specifically, the dual problem of multi-agent optimal consensus results in a sum of functions with very loose coupling between the dual variables. Calculation of a component of the gradient of the dual function only involves computations and communications of a pair of agents (or processors). Moreover, it can also be implemented in a parallel fashion as shown in [3]. Therefore, sufficient effort has been made recently by researchers to develop theoretical performance guarantees of various coordinate descent algorithms [31]. In this paper, we aim to extend this appealing algorithm to solve OCO problems by providing an in-depth regret analysis for different types of online coordinate descent algorithms.

The main contributions of the paper can be summarized as follows. First, we extend the coordinate descent algorithms considered in [31] to the online case and provide their regret analysis. To the best of our knowledge, this is the first attempt to look at possibilities of using coordinate descent methods to solve OCO problems. Second, we provide an in-depth regret analysis of various coordinate descent algorithms with different rules, such as cyclic updating rules and random updating rules. Specifically, we consider both random and deterministic online coordinate descent algorithms under assumptions commonly used in the literature. In particular, most existing literature on OCO are based on extensions of offline algorithms that monotonically reduce the distance from the decision variable to the set of solutions at each iteration. An example is the well-known online gradient descent [41, 11]. However, offline deterministic coordinate descent algorithm, although has provable convergence properties to the set of solutions, does not necessarily result in an updated variable that is closer to the set of solutions at each iteration. We overcome this issue by using predictive like updates at each time which are detailed in Section 5. Lastly, we show that the regret bounds achieved by our online coordinate descent algorithms are comparable to those achieved by the literature on centralized full-gradient based online algorithms.

We summarize the theoretical upper bounds of regrets we prove in Theorem 1 to Theorem 7 for Algorithms 1, 4, 5 in the following table.

Table 1: Regret bounds proved in the paper with CT=t=1T|xtxt1|,CT,2=t=1T|xtxt1|2C_{T}=\sum_{t=1}^{T}|x_{t}^{*}-x_{t-1}^{*}|,C_{T,2}=\sum_{t=1}^{T}|x_{t}^{*}-x_{t-1}^{*}|^{2} and CC stands for convex cases SCSC stands for strongly convex cases.
RTsR_{T}^{s},C RTsR_{T}^{s},SC RTdR_{T}^{d},C RTdR_{T}^{d},SC
Alg. 1 O(T)O(\sqrt{T}) O(logT)O(\log T) O(CTT)O(\sqrt{C_{T}T}) O(CT)O(C_{T})
Alg. 2,4 O(T)O(\sqrt{T}) O(logT)O(\log T) O(CTT)O(\sqrt{C_{T}T}) O(CT,2)O(C_{T,2})
Alg. 3,5 O(T)O(\sqrt{T}) O(logT)O(\log T) O(CTT)O(\sqrt{C_{T}T}) O(CT,2)O(C_{T,2})
Gradient Method O(T)O(\sqrt{T}) [41] O(logT)O(\log T) [11] O(CTT)O(\sqrt{C_{T}T}) [7] O(CT)O(C_{T}) [28] O(CT,2)O(C_{T,2}) [8]

The regret bounds summarized in Table 1 are consistent with regret bounds of full-gradient based online optimization algorithms proved in the existing literature [29, 11, 7, 23] under similar settings. Our dynamic regret bounds for strongly convex functions proved in Theorems 6 and 7 might need multiple updates at each time tt. This setup is also adopted in some existing works including [40, 8] to achieve less conservative regret bounds. It should also be noted that, although the online algorithms with different update rules share the regret bounds with the same order, the exact coefficient for the regret bounds may still be different.

The rest of paper is organized as follows. The problem formulation is presented in Section 2. The online coordinate descent algorithm considered in this paper is given in Section 3. Regret bounds for random online coordinate descent algorithms are given in Section 4 followed by regret bounds for deterministic online coordinate descent algorithms in Section 5. The numerical simulation is given in Section 6. Finally the results presented in this paper is summarized in Section 7.

Notation: Let \mathbb{R} be the set of real numbers and n\mathbb{R}^{n} be the nn-dimensional Euclidean space, 0\mathbb{R}_{\geq 0} (resp. >0\mathbb{R}_{>0}) be the set of non-negative (resp. positive) real numbers. For x,ynx,y\in\mathbb{R}^{n}, x,y\langle x,y\rangle denotes the inner product in n\mathbb{R}^{n}. The set of non-negative (resp. positive) integers is denoted by 0\mathbb{Z}_{\geq 0} (resp. >0\mathbb{Z}_{>0}). Furthermore, |||\cdot| denotes the Euclidean norm of a vector xnx\in\mathbb{R}^{n}. The matrix InI_{n} is used to denote the nn-dimensional identity matrix and nn will be omitted when the dimension is clear. For a given vector xx, x(i)x_{(i)} denotes the ii-th component of xx, xix_{i} denotes the value of xx at time ii, and x(i),jx_{(i),j} denotes the value of the ii-th component of xx at time jj. All random variables considered in this paper are defined on a probability space (Ω,,)(\Omega,\mathcal{F},\mathbb{P}), where Ω\Omega is the sample space, the σ\sigma-field \mathcal{F} is the set of events and \mathbb{P} is the probability measure defined on (Ω,)(\Omega,\mathcal{F}).

2 Problem formulation

We consider the following optimization problem

minxΘft(x),\underset{x\in\Theta}{\text{min}}\ f_{t}(x), (1)

where ft:nf_{t}:\mathbb{R}^{n}\rightarrow\mathbb{R} is the convex cost function at time tt, xnx\in\mathbb{R}^{n} is the decision variable, and Θn\Theta\subseteq\mathbb{R}^{n} is the non-empty closed convex constraint set. For simplicity, we assume ftf_{t} is continuously differentiable for any t0t\in\mathbb{Z}_{\geq 0}. Moreover, let the decision variable xtx_{t} at any given time tt be partitioned as xt=[x(1),tT,x(2),tT,,x(P),tT]x_{t}=[x_{(1),t}^{T},x_{(2),t}^{T},\ldots,x_{(P),t}^{T}], x(p),tnpx_{(p),t}\in\mathbb{R}^{n_{p}}, p=1Pnp=n\sum_{p=1}^{P}n_{p}=n. The PP components of the vector are assigned to PP individual processors. For any integer p{1,P}p\in\{1,\ldots P\}, the processor pp is responsible for updating its own “local” decision variable x(p),tx_{(p),t} [3].

Denote the minimizer of ft(x)f_{t}(x) at time tt by xtΘx_{t}^{*}\in\Theta. We use a notion of regret to measure processors’ ability to track xtx_{t}^{*}. Regret refers to a quantity that measures the overall difference between the cost incurred by an algorithm and the cost at the best possible point from an offline view up to a horizon TT. Two notions of regret commonly considered in the literature are static regret and dynamic regret. The static regret is defined as:

RTs:=t=1Tft(xt)min𝑥t=1Tft(x)=t=1Tft(xt)t=1Tft(x),\color[rgb]{0,0,0}\begin{split}R_{T}^{s}:&=\sum_{t=1}^{T}f_{t}(x_{t})-\underset{x}{\text{min}}\sum_{t=1}^{T}f_{t}(x)\\ &=\sum_{t=1}^{T}f_{t}(x_{t})-\sum_{t=1}^{T}f_{t}(x^{*}),\end{split} (2)

where x:=argminxΘt=1Tft(x)x^{*}:=\arg\underset{x\in\Theta}{\min}\sum_{t=1}^{T}f_{t}(x) for a given TT, the subscript TT is omitted for convenience. On the other hand, the dynamic regret is defined as

RTd:=t=1Tft(xt)t=1Tft(xt).R_{T}^{d}:=\sum_{t=1}^{T}f_{t}(x_{t})-\sum_{t=1}^{T}f_{t}(x_{t}^{*}). (3)
Remark 1.

Static regret is a useful performance metric in applications such as static parameter estimation, where the variable of interest is static. However, if the variable of interest evolves over time (e.g. tracking moving targets), the notion of dynamic regret makes more sense than its static counterpart. It can be seen from (2) and (3) that regret captures the accumulation of errors due to the fact that optimization problems are not solved exactly at each step. If the regret is sublinear in TT, then the average accumulated error RTs/TR_{T}^{s}/T (or RTd/TR_{T}^{d}/T) will converge to 0 as TT\rightarrow\infty. This further implies that xtx_{t} converges to xx^{*} (or xtx_{t}^{*}). Although being the more appropriate performance metric in most applications, dynamic regret does not have a provable bound sublinear in TT in general. To obtain a sublinear regret bound, additional regularity assumptions such as bounded variation of environment are typically required, see [28] for example.

If stochastic algorithms are considered, similar notions of regrets can be defined via expectations.

RTs:=t=1T𝔼[ft(xt)]min𝑥t=1Tft(x).R_{T}^{s}:=\sum_{t=1}^{T}\mathbb{E}[f_{t}(x_{t})]-\underset{x}{\text{min}}\sum_{t=1}^{T}f_{t}(x). (4)
RTd:=t=1T𝔼[ft(xt)]t=1Tft(xt).R_{T}^{d}:=\sum_{t=1}^{T}\mathbb{E}[f_{t}(x_{t})]-\sum_{t=1}^{T}f_{t}(x_{t}^{*}). (5)

With some abuse of notation, RTsR_{T}^{s} and RTdR_{T}^{d} are used for both random and deterministic cases. However, this would not lead to confusion since it should be clear whether an algorithm is stochastic or not.

3 Online coordinate descent algorithms

We construct our online block coordinate descent algorithms following the setup in [31]. At time tt, we select a component of xtx_{t} to update. The updating component at time tt is denoted by iti_{t}. If the ii-th component is updating, then updating equation for component ii is given by

x(i),t+1=x(i),tαt[(i)ft(xt)],x_{(i),t+1}=x_{(i),t}-\alpha_{t}[\nabla_{(i)}f_{t}(x_{t})], (6)

where αt\alpha_{t} is the stepsize at time tt and (i)ft(xt)\nabla_{(i)}f_{t}(x_{t}) is the ii-th component of the gradient of ff evaluated at xtx_{t} at time tt. For any pitp\neq i_{t}, we have

x(p),t+1=x(p),t.x_{(p),t+1}=x_{(p),t}. (7)

Then xt+1x_{t+1} is projected onto Θ\Theta. We define the matrix Ut=diag(U(1),t,U(2),t,,U(P),t)U_{t}=\text{diag}(U_{(1),t},U_{(2),t},\ldots,U_{(P),t}). For any t0t\in\mathbb{Z}_{\geq 0} and 1pP1\leq p\leq P, U(p),t{Inp,0np}U_{(p),t}\in\{I_{n_{p}},0_{n_{p}}\}, where InpI_{n_{p}} and 0np0_{n_{p}} denote identity and zero matrices of dimension npn_{p}, respectively. Then the updates of the coordinate descent algorithm at time tt can be written as

xt+1=ΠΘ(xtαtUt[ft(xt)]),x_{t+1}=\Pi_{\Theta}(x_{t}-\alpha_{t}U_{t}[\nabla f_{t}(x_{t})]), (8)

where ΠΘ()\Pi_{\Theta}(\cdot) denotes projection on Θ\Theta which is well defined by closedness and convexity of Θ\Theta. The following non-expansive property of the projection operator will be used extensively throughout the paper.

Lemma 1.

[4, Proposition 3.2.1] Let Θ\Theta be a non-empty closed convex set. We have |ΠΘ(x)ΠΘ(y)||xy||\Pi_{\Theta}(x)-\Pi_{\Theta}(y)|\leq|x-y|, for any x,ynx,y\in\mathbb{R}^{n}.

In this paper, we consider the following three commonly used schemes of selecting the updating component iti_{t}.

  • The random coordinate descent algorithm. In this case, iti_{t} is selected randomly with equal probability, independently of the selections made at previous iterations.

  • The cyclic coordinate descent algorithm. In this case, the updating component iti_{t} is selected in a pre-determined cyclic fashion: it+1=(itmodP)+1i_{t+1}=(i_{t}\ \mathrm{mod}\ {P})+1.

  • The coordinate descent algorithm with Gauss-Southwell Rule. In this case, the updating component iti_{t} is selected as itargmaxi|(i)ft(xt)|i_{t}\in\arg\max_{i}|\nabla_{(i)}f_{t}(x_{t})|.

In the paper, we consider the case where only one coordinate is allowed to change per iteration. The results on random coordinate descent can be extended to other cases where probabilities of selections are unequal with potentially overlapping components [20], such as the random sleep scheme [34]. Intuitively speaking, if for any tt the expectation of the update direction takes the form Γft(x)\Gamma\nabla f_{t}(x) for a given positive definite diagonal matrix Γ\Gamma, then the analysis in this work can be applied with mild modifications. Our analysis for deterministic cases can not be trivially extended to cover overlapping components. We aim to address this topic in future research.

Remark 2.

A substantial review of variants of coordinate descent algorithms can be found in [4, Section 6.5.1]. The cyclic selection of coordinates is normally assumed to ensure convergence of the algorithm. On the other hand, the use of an irregular order is then considered by researchers to accelerate convergence. Particularly, it is shown in [31] that randomization leads to faster convergence in terms of expectation. Obviously, this is not guaranteed for each instance of the algorithm. The Gauss-Southwell method leads to faster convergence at the cost of extra computations and evaluations of gradients during the selection of coordinates which can be an issue in large-scale problems [25].

4 Regret bounds for online coordinate descent algorithms with random coordinate selection rules

The online random coordinate descent algorithm considered in this section is given in Algorithm 1.

Algorithm 1 Online random coordinate descent algorithm
1:Initialization: x0Θx_{0}\in\Theta.
2:Coordinate Selection: At time tt, it=pi_{t}=p with probability 1P\frac{1}{P} where p{1,2,,P}p\in\{1,2,\ldots,P\}.
3:Update: For i=iti=i_{t}:
x(i),t+1x(i),tαt[(i)ft(xt)].\displaystyle x_{(i),t+1}\leftarrow x_{(i),t}-\alpha_{t}[\nabla_{(i)}f_{t}(x_{t})].
For all iiti\neq i_{t}:
x(i),t+1x(i),t.x_{(i),t+1}\leftarrow x_{(i),t}.
4:Projection: For xt+1ΠΘ(xt+1)x_{t+1}\leftarrow\Pi_{\Theta}(x_{t+1}).
5:Set tt+1t\leftarrow t+1 and go to Step 2.

4.1 Static regret for convex functions

Before we state the main results, we first list the assumptions.

Assumption 1.

For any given t>0t\in\mathbb{Z}_{>0}, xΘx\in\Theta and x:=argminxΘt=1Tft(x)x^{*}:=\arg\underset{x\in\Theta}{\min}\sum_{t=1}^{T}f_{t}(x), the following inequalities hold uniformly in tt and TT for problem (1),

  1. (i)

    |ft(x)|G|\nabla f_{t}(x)|\leq G,

  2. (ii)

    |xtx|R|x_{t}-x^{*}|\leq R,

where xtx_{t} is the decision variable at time tt.

Remark 3.

Item (ii) of Assumption 1 can be ensured if the constraint set Θ\Theta is bounded. Moreover, when Θ\Theta is bounded, item (i) of Assumption 1 holds in many cases including linear regression and logistic regression [5]. By [29, Lemma 2.6], Assumption 1 (i) implies that ftf_{t} is also Lipschitz continuous uniformly in tt over Θ\Theta. These assumptions are quite standard in the literature on online optimization [6, 7, 5, 17, 37, 39, 15, 18, 28].

Before we state the first main result of the paper, we first present the so-called doubling trick scheme stepsize rule which is introduced in [29, Section 2.3.1].

Definition 1 (Doubling trick scheme).

A stepsize sequence {αt}\{\alpha_{t}\} is said to be chosen by the doubling trick scheme, if for q=0,1,,log2T+1q=0,1,\ldots,\lceil\log_{2}T\rceil+1, αt=12q\alpha_{t}=\frac{1}{\sqrt{2^{q}}} in each period of 2q2^{q} iterations t=2q,2q+1,,2q+11t=2^{q},2^{q}+1,\ldots,2^{q+1}-1.

Remark 4.

The doubling trick scheme of stepsize choices is particularly useful when stepsizes of type 1/T\sqrt{1/T} are needed to achieve a desirable regret bound. Since in applications it is often unrealistic to know the horizon TT in advance, by using the doubling trick scheme, regret bounds of the same order can be achieved without explicit knowledge of TT. The same trick will be used extensively throughout the paper to derive regret bounds. It should also be noted that, the doubling trick scheme in general does not result a better regret bound in terms orders compared to other stepsize rules [41].

The following result states that, under Assumption 1, if the stepsize at each iteration is chosen by the doubling trick scheme, there is an upper bound for the static regret defined in (4). Moreover, the upper bound has the order of O(T)O(\sqrt{T}) for convex costs.

Theorem 1.

Suppose Assumption 1 holds. Furthermore, if the stepsize is chosen according to Definition 1. Then, the static regret (4) achieved by Algorithm 1 satisfies

RTs(B1+B2)T,R_{T}^{s}\leq(B_{1}+B_{2})\sqrt{T}, (9)

where B1=PR22B_{1}=\frac{PR^{2}}{2} and B2=2G22(21)B_{2}=\frac{\sqrt{2}G^{2}}{2(\sqrt{2}-1)}.

{pf}

From (8) and Lemma 1, we have

|xt+1x|2|xtαtUt[ft(xt)]x|2=|xtx|22αt[Utft(xt)]T(xtx)+αt2|Utft(xt)|2\begin{split}&|x_{t+1}-x|^{2}\leq|x_{t}-\alpha_{t}U_{t}[\nabla f_{t}(x_{t})]-x|^{2}=|x_{t}-x|^{2}\\ &-2\alpha_{t}[U_{t}\nabla f_{t}(x_{t})]^{T}(x_{t}-x)+\alpha_{t}^{2}|U_{t}\nabla f_{t}(x_{t})|^{2}\end{split}

for any xΘx\in\Theta. Given xtx_{t}, denote the σ\sigma-field containing past data of Algorithm 1 up to time tt by t\mathcal{F}_{t}. Then, by convexity of ftf_{t}, we have

𝔼[|xt+1x|2|t]|xtx|22αtP[ft(xt)]T(xtx)+αt2P|ft(xt)|2|xtx|22αtP(ft(xt)ft(x))+αt2P|ft(xt)|2.\begin{split}&\mathbb{E}[|x_{t+1}-x|^{2}|\mathcal{F}_{t}]\leq\\ &|x_{t}-x|^{2}-\frac{2\alpha_{t}}{P}[\nabla f_{t}(x_{t})]^{T}(x_{t}-x)+\frac{\alpha_{t}^{2}}{P}|\nabla f_{t}(x_{t})|^{2}\\ &\leq|x_{t}-x|^{2}-\frac{2\alpha_{t}}{P}(f_{t}(x_{t})-f_{t}(x))+\frac{\alpha_{t}^{2}}{P}|\nabla f_{t}(x_{t})|^{2}.\end{split} (10)

Substituting x:=argminxΘt=1Tft(x)x^{*}:=\arg\underset{x\in\Theta}{\min}\sum_{t=1}^{T}f_{t}(x) in to (10), taking total expectations using 𝔼[𝔼[x|t]|t1]=𝔼[x|t1]\mathbb{E}[\mathbb{E}[x|\mathcal{F}_{t}]|\mathcal{F}_{t-1}]=\mathbb{E}[x|\mathcal{F}_{t-1}] for t1t\mathcal{F}_{t-1}\subset\mathcal{F}_{t}, and rearranging (10) leads to

𝔼[ft(xt)ft(x)]P2αt(𝔼[|xtx|2]𝔼[|xt+1x|2])+αt2|ft(xt)|2.\begin{split}&\mathbb{E}[f_{t}(x_{t})-f_{t}(x^{*})]\\ &\leq\frac{P}{2\alpha_{t}}(\mathbb{E}[|x_{t}-x^{*}|^{2}]-\mathbb{E}[|x_{t+1}-x^{*}|^{2}])+\frac{\alpha_{t}}{2}|\nabla f_{t}(x_{t})|^{2}.\end{split} (11)

Note that

RTst=1TP2αt(𝔼[|xtx|2]𝔼[|xt+1x|2])+t=1Tαt2|ft(xt)|2.\begin{split}R_{T}^{s}\leq&\sum_{t=1}^{T}\frac{P}{2\alpha_{t}}(\mathbb{E}[|x_{t}-x^{*}|^{2}]-\mathbb{E}[|x_{t+1}-x^{*}|^{2}])\\ &+\sum_{t=1}^{T}\frac{\alpha_{t}}{2}|\nabla f_{t}(x_{t})|^{2}.\end{split} (12)

Due to Assumption 1 the gradient of ftf_{t} is uniformly bounded by GG and |xtx||x_{t}-x^{*}| is upper bounded uniformly by RR. Consequently,

t=1TP2αt(𝔼[|xtx|2]𝔼[|xt+1x|2])=P2α1𝔼[|x1x|2]P2αT𝔼[|xT+1x|2]+t=2T(P2αtP2αt1)𝔼[|xtx|2]PR22α1+R2t=2T(P2αtP2αt1)=PR22αTPR22T:=B1T.\begin{split}&\sum_{t=1}^{T}\frac{P}{2\alpha_{t}}(\mathbb{E}[|x_{t}-x^{*}|^{2}]-\mathbb{E}[|x_{t+1}-x^{*}|^{2}])=\\ &\frac{P}{2\alpha_{1}}\mathbb{E}[|x_{1}-x^{*}|^{2}]-\frac{P}{2\alpha_{T}}\mathbb{E}[|x_{T+1}-x^{*}|^{2}]\\ &+\sum_{t=2}^{T}(\frac{P}{2\alpha_{t}}-\frac{P}{2\alpha_{t-1}})\mathbb{E}[|x_{t}-x^{*}|^{2}]\\ &\leq\frac{PR^{2}}{2\alpha_{1}}+R^{2}\sum_{t=2}^{T}(\frac{P}{2\alpha_{t}}-\frac{P}{2\alpha_{t-1}})\\ &=\frac{PR^{2}}{2\alpha_{T}}\leq\frac{PR^{2}}{2}\sqrt{T}:=B_{1}\sqrt{T}.\end{split} (13)

In the last inequality of (13), the property αT1/T\alpha_{T}\geq 1/\sqrt{T} is used, which is a direct consequence of the definition of the stepsize rule. On the other hand, the remaining term in (12) can be upper bounded as follows

t=1Tαt2|ft(xt)|2t=1TαtG22.\sum_{t=1}^{T}\frac{\alpha_{t}}{2}|\nabla f_{t}(x_{t})|^{2}\leq\sum_{t=1}^{T}\alpha_{t}\frac{G^{2}}{2}.

To derive the regret bound, we again use the properties of the doubling trick scheme. First, we set the horizon to be some known constant TT^{*} and the stepsize is chosen as the constant 1T\frac{1}{\sqrt{T^{*}}}. Then it is obvious that

t=1Tαt2|ft(xt)|2t=1TαtG22G22T.\sum_{t=1}^{T^{*}}\frac{\alpha_{t}}{2}|\nabla f_{t}(x_{t})|^{2}\leq\sum_{t=1}^{T^{*}}\alpha_{t}\frac{G^{2}}{2}\leq\frac{G^{2}}{2}\sqrt{T^{*}}. (14)

Since for q=0,1,,log2T+1q=0,1,\ldots,\lceil\log_{2}T\rceil+1, αt=12q\alpha_{t}=\frac{1}{\sqrt{2^{q}}} in each period of 2q2^{q} iterations t=2q,2q+1,,2q+11t=2^{q},2^{q}+1,\ldots,2^{q+1}-1. That is T=2qT^{*}=2^{q}. Then we can sum (14) up over any given TT as

t=1Tαt2|ft(xt)|2q=0log2T2qG22=G2212log2T+112G2212T122G22(21)T:=B2T.\begin{split}\sum_{t=1}^{T}\frac{\alpha_{t}}{2}|\nabla f_{t}(x_{t})|^{2}&\leq\sum_{q=0}^{\lceil\log_{2}T\rceil}\sqrt{2^{q}}\frac{G^{2}}{2}\\ &=\frac{G^{2}}{2}\frac{1-{\sqrt{2}}^{\lceil\log_{2}T\rceil+1}}{1-\sqrt{2}}\\ &\leq\frac{G^{2}}{2}\frac{1-{\sqrt{2T}}}{1-\sqrt{2}}\\ &\leq\frac{\sqrt{2}G^{2}}{2(\sqrt{2}-1)}\sqrt{T}:=B_{2}\sqrt{T}.\end{split} (15)

Thus, the proof is complete.

Remark 4.1.

In Theorem 1, the differentiability of ftf_{t} is in fact never used. Thus the results apply to the case where subgradients are used when ftf_{t} is not differentiable for some tt as long as item 1 of Assumption 1 is satisfied by the subgradients used in the iterations. Moreover, the regret bound established in Theorem 1 is of the same order as the one established in [29] for online gradient descent under the same assumptions.

4.2 Static regret for strongly convex functions

In this part, we show that if the cost functions are strongly convex, then we can achieve an improved static regret bound for the online random coordinate descent algorithms.

Assumption 2.

For any t0t\in\mathbb{Z}_{\geq 0}, the function ftf_{t} is uniformly μ\mu-strongly convex, i.e., there exists μ>0\mu>0 such that the following inequality holds for any t0t\in\mathbb{Z}_{\geq 0}, x,ynx,y\in\mathbb{R}^{n}

ft(y)ft(x)+ft(x)T(yx)+μ2|yx|2.f_{t}(y)\geq f_{t}(x)+\nabla f_{t}(x)^{T}(y-x)+\frac{\mu}{2}|y-x|^{2}. (16)
Theorem 2.

Suppose Assumptions 1 (i) and 2 hold. Furthermore, if the stepsize is chosen as αt=Pμt\alpha_{t}=\frac{P}{\mu t}. Then, the static regret (4) achieved by Algorithm 1 satisfies

RTsPG22μ(1+logT).R_{T}^{s}\leq\frac{PG^{2}}{2\mu}(1+\log T). (17)
{pf}

Similar to (10), given xtx_{t}, we have the following relationship for any xΘx\in\Theta,

𝔼[|xt+1x|2|t]|xtx|22αtP[ft(xt)]T(xtx)+αt2P|ft(xt)|2|xtx|22αtP(ft(xt)ft(x))αtμP|xtx|2+αt2P2|ft(xt)|2=(1αtμP)|xtx|22αtP(ft(xt)ft(x))+αt2P|ft(xt)|2.\begin{split}&\mathbb{E}[|x_{t+1}-x|^{2}|\mathcal{F}_{t}]\leq\\ &|x_{t}-x|^{2}-\frac{2\alpha_{t}}{P}[\nabla f_{t}(x_{t})]^{T}(x_{t}-x)+\frac{\alpha_{t}^{2}}{P}|\nabla f_{t}(x_{t})|^{2}\\ &\leq|x_{t}-x|^{2}-\frac{2\alpha_{t}}{P}(f_{t}(x_{t})-f_{t}(x))-\frac{\alpha_{t}\mu}{P}|x_{t}-x|^{2}\\ &+\frac{\alpha_{t}^{2}}{P^{2}}|\nabla f_{t}(x_{t})|^{2}=(1-\frac{\alpha_{t}\mu}{P})|x_{t}-x|^{2}\\ &-\frac{2\alpha_{t}}{P}(f_{t}(x_{t})-f_{t}(x))+\frac{\alpha_{t}^{2}}{P}|\nabla f_{t}(x_{t})|^{2}.\end{split} (18)

By substituting xx^{*} to (18) and rearranging the terms, we have the following inequality

𝔼[ft(xt)ft(x)]P2αt(𝔼[|xtx|2]𝔼[|xt+1x|2])μ2𝔼[|xtx|2]+αt2|ft(xt)|2.\begin{split}\mathbb{E}[f_{t}(x_{t})-f_{t}(x^{*})]\leq&\frac{P}{2\alpha_{t}}(\mathbb{E}[|x_{t}-x^{*}|^{2}]-\mathbb{E}[|x_{t+1}-x^{*}|^{2}])\\ &-\frac{\mu}{2}\mathbb{E}[|x_{t}-x^{*}|^{2}]+\frac{\alpha_{t}}{2}|\nabla f_{t}(x_{t})|^{2}.\end{split}

Since the gradient of ftf_{t} is uniformly bounded by GG, the following inequality holds

RTst=1T[P2αt(𝔼[|xtx|2]𝔼[|xt+1x|2])]t=1Tμ2𝔼[|xtx|2]+t=1Tαt2|ft(xt)|2.\begin{split}R_{T}^{s}\leq&\sum_{t=1}^{T}\left[\frac{P}{2\alpha_{t}}(\mathbb{E}[|x_{t}-x^{*}|^{2}]-\mathbb{E}[|x_{t+1}-x^{*}|^{2}])\right]\\ &-\sum_{t=1}^{T}\frac{\mu}{2}\mathbb{E}[|x_{t}-x^{*}|^{2}]+\sum_{t=1}^{T}\frac{\alpha_{t}}{2}|\nabla f_{t}(x_{t})|^{2}.\end{split} (19)

By choosing αt=Pμt\alpha_{t}=\frac{P}{\mu t}, we have the following relationship

RTs0+t=2T(P2αtP2αt1μ2)𝔼[|xtx|2]μT2𝔼[|xT+1x|2]+t=1TαtG220+t=1TαtG22PG22μ(1+logT),\begin{split}&R_{T}^{s}\leq 0+\sum_{t=2}^{T}(\frac{P}{2\alpha_{t}}-\frac{P}{2\alpha_{t-1}}-\frac{\mu}{2})\mathbb{E}[|x_{t}-x^{*}|^{2}]\\ &-\frac{\mu T}{2}\mathbb{E}[|x_{T+1}-x^{*}|^{2}]+\sum_{t=1}^{T}\frac{\alpha_{t}G^{2}}{2}\\ &\leq 0+\sum_{t=1}^{T}\frac{\alpha_{t}G^{2}}{2}\leq\frac{PG^{2}}{2\mu}(1+\log T),\end{split} (20)

where the second inequality follows by expanding the sum. Thus, the proof is complete.

By making the extra assumption of strong convexity, the sublinear static regret bound of order O(T)O(\sqrt{T}) established in Theorem 1 is improved to be of order O(logT)O(\log T) which is consistent with the regret bound for online gradient descent algorithms under the same assumptions established in [11].

4.3 Dynamic regret for convex functions

Now, we provide an upper bound of the dynamic regret of the online coordinate descent algorithm. First, we define the following measure of variations of the problem (1) which is commonly used in the literature [23, 28]

CT:=t=1T|xtxt1|,C_{T}:=\sum_{t=1}^{T}|x_{t}^{*}-x_{t-1}^{*}|, (21)

where xtx_{t}^{*} is a solution to the optimization problem at time tt and x0x_{0}^{*} can be arbitrary constant. The term CTC_{T} captures the accumulated variations of optimal points at two consecutive time instances over time. If the cost functions are general convex functions and an upper bound of CTC_{T} is known, the following results can be stated.

Theorem 3.

Suppose Assumption 1 holds. Furthermore, if the stepsize is chosen as αt=CTT\alpha_{t}=\sqrt{\frac{C_{T}}{T}}, the dynamic regret (5) achieved by Algorithm 1 satisfies

RTd(5R22CT+RPCT)T+CTG22T.R_{T}^{d}\leq(\frac{5R^{2}}{2\sqrt{C_{T}}}+RP\sqrt{C_{T}})\sqrt{T}+\frac{\sqrt{C_{T}}G^{2}}{2}\sqrt{T}.
{pf}

Substituting xtx_{t}^{*} in (10) for any t2t\geq 2, yields

𝔼[ft(xt)ft(xt)]P2αt(𝔼[|xtxt|2]𝔼[|xt+1xt|2])+αt2|ft(xt)|2.\begin{split}\mathbb{E}[f_{t}(x_{t})-f_{t}(x_{t}^{*})]\leq&\frac{P}{2\alpha_{t}}(\mathbb{E}[|x_{t}-x_{t}^{*}|^{2}]-\mathbb{E}[|x_{t+1}-x_{t}^{*}|^{2}])\\ &+\frac{\alpha_{t}}{2}|\nabla f_{t}(x_{t})|^{2}.\end{split}

Furthermore,

t=1T(|xtxt|2|xt+1xt|2)|x1x1|2+t=2T(|xtxt|2|xtxt1|2)=|x1x1|2+t=2T2xtT(xt1xt)+t=2T(|xt|2|xt1|2)|x1x1|2+|xT|2+2t=2T2|xt||xt1xt|5R2+2RCT.\begin{split}&\sum_{t=1}^{T}(|x_{t}-x_{t}^{*}|^{2}-|x_{t+1}-x_{t}^{*}|^{2})\leq\\ &|x_{1}-x_{1}^{*}|^{2}+\sum_{t=2}^{T}(|x_{t}-x_{t}^{*}|^{2}-|x_{t}-x_{t-1}^{*}|^{2})\\ &=|x_{1}-x_{1}^{*}|^{2}+\sum_{t=2}^{T}2x_{t}^{T}(x_{t-1}^{*}-x_{t}^{*})+\sum_{t=2}^{T}(|x_{t}^{*}|^{2}-|x_{t-1}^{*}|^{2})\\ &\leq|x_{1}-x_{1}^{*}|^{2}+|x_{T}^{*}|^{2}+2\sum_{t=2}^{T}2|x_{t}||x_{t-1}^{*}-x_{t}^{*}|\\ &\leq 5R^{2}+2RC_{T}.\\ \end{split} (22)

Then,

RTdP2αt(5R2+2RCT)+αt2G2T.R_{T}^{d}\leq\frac{P}{2\alpha_{t}}(5R^{2}+2RC_{T})+\frac{\alpha_{t}}{2}G^{2}T.

Choosing the stepsize to be CTT\sqrt{\frac{C_{T}}{T}} leads to

RTd(5R22CT+RPCT)T+CTG22T.R_{T}^{d}\leq(\frac{5R^{2}}{2\sqrt{C_{T}}}+RP\sqrt{C_{T}})\sqrt{T}+\frac{\sqrt{C_{T}}G^{2}}{2}\sqrt{T}.

Remark 4.2.

As discussed in Remark 4, the choice of stepsize CTT\sqrt{\frac{C_{T}}{T}} can be replaced by the doubling trick scheme (see Definition 1) to derive a regret bound of the same order. Moreover, the exact value of CTC_{T} is also not necessary to derive the regret bound given in Theorem 3. If CTC_{T} is not available, a stepsize of the same order of CTT\sqrt{\frac{C_{T}}{T}} can be used with constant factor errors. We refer readers to [7, Theorem 1] for a more detailed discussion on how to implement stepsizes of the form CTT\sqrt{\frac{C_{T}}{T}} and obtain an upper bound of CTC_{T} in practice using limited information. The inclusion of CTC_{T} in the dynamic regret bound is common in existing literature [7, 28, 37]. As argued in [7], if CTC_{T} is sublinear in TT, then the overall dynamic regret bound in Theorem 3 will be sublinear in TT, which is desired and consistent with the dynamic regret bounds established in [7] for full-gradient based online algorithms. To see this, if the variation of minimizers decreases with the order 1/t{1}/{t}, then CT=O(logT)C_{T}=O(\log T) and RTdO(TlogT)R_{T}^{d}\leq O(\sqrt{T\log T}). Similarly, if the variation of minimizers decreases with the order 1/tq{1}/{t^{q}} with 0<q<10<q<1, then CT=O(T1q)C_{T}=O(T^{1-q}) and RTdO(T1q2)R_{T}^{d}\leq O(T^{1-\frac{q}{2}}). In the worst case, by Assumption 1, we have CTRTC_{T}\leq RT and RTdO(RT)R_{T}^{d}\leq O(\sqrt{R}T). Thus, RTd/TO(R){R_{T}^{d}}/{T}\leq O(\sqrt{R}), which means the corresponding online algorithm incurs a steady-state tracking error. Moreover, the error decreases with the constant bound on the variation of minimizers.

4.4 Dynamic regret for strongly convex functions

Now, we consider the dynamic regret of the online random coordinate descent algorithm for strongly convex functions. As before, we consider μ\mu-strongly convex functions. In addition, we will make the following assumption commonly seen in online optimization literature [18, 23].

Assumption 3.

For any t0t\in\mathbb{Z}_{\geq 0}, the gradient of ftf_{t} is uniformly Lipschitz continuous, i.e., there exists L>0L>0 such that the following inequality holds for any t0t\in\mathbb{Z}_{\geq 0}, x,ynx,y\in\mathbb{R}^{n}

|ft(x)ft(y)|L|xy|.|\nabla f_{t}(x)-\nabla f_{t}(y)|\leq L|x-y|. (23)

It is shown in [4, Proposition 6.1.2] that under the same conditions stated in Assumption 3, (23) is equivalent to

ft(y)ft(x)+ft(x)T(yx)+L2|yx|2.f_{t}(y)\leq f_{t}(x)+\nabla f_{t}(x)^{T}(y-x)+\frac{L}{2}|y-x|^{2}. (24)

The main result regarding the dynamic regret bound of Algorithm 1 for strongly convex functions with Lipschitz gradients is stated as follows.

Theorem 4.

Suppose ft(xt)=0\nabla f_{t}(x_{t}^{*})=0 for any t0t\in\mathbb{Z}_{\geq 0}, Assumption 1 (i) and Assumptions 2-3 hold. If the stepsize is chosen as αt=α2μ+L\alpha_{t}=\alpha\leq\frac{2}{\mu+L}, the dynamic regret (5) achieved by Algorithm 1 satisfies

RTdGe(CT+C1),R_{T}^{d}\leq\frac{G}{e}(C_{T}+C_{1}),

where e=112αPμLμ+Le=1-\sqrt{1-\frac{2\alpha}{P}\frac{\mu L}{\mu+L}}, CT=t=1T|xtxt1|C_{T}=\sum_{t=1}^{T}|x_{t}^{*}-x_{t-1}^{*}|, and C1=|x1x1||x1x0|C_{1}=|x_{1}-x_{1}^{*}|-|x_{1}^{*}-x_{0}^{*}|.

{pf}

From (16) and (23), the following inequality holds [4, Proposition 6.1.9] for any t0,x,ynt\in\mathbb{Z}_{\geq 0},x,y\in\mathbb{R}^{n}

(ft(x)ft(y))T(xy)μLμ+L|xy|2+1μ+L|ft(x)ft(y)|2.\begin{split}&(\nabla f_{t}(x)-\nabla f_{t}(y))^{T}(x-y)\geq\\ &\frac{\mu L}{\mu+L}|x-y|^{2}+\frac{1}{\mu+L}|\nabla f_{t}(x)-\nabla f_{t}(y)|^{2}.\end{split} (25)

Then, using Lemma 1, we have |xt+1xt|2|xtαUtft(xt)xt|2|x_{t+1}-x_{t}^{*}|^{2}\leq|x_{t}-\alpha U_{t}\nabla f_{t}(x_{t})-x_{t}^{*}|^{2}. Next, we take the conditional expectation given past history up to time tt,

𝔼[|xt+1xt|2|t]2αP(ft(xt)ft(xt))T(xtxt)+|xtxt|2+α2𝔼[|Utft(xt)|2|t].\begin{split}&\mathbb{E}[|x_{t+1}-x_{t}^{*}|^{2}|\mathcal{F}_{t}]\leq-\frac{2\alpha}{P}(\nabla f_{t}(x_{t})-\nabla f_{t}(x_{t}^{*}))^{T}(x_{t}-x_{t}^{*})\\ &+|x_{t}-x_{t}^{*}|^{2}+\alpha^{2}\mathbb{E}[|U_{t}\nabla f_{t}(x_{t})|^{2}|\mathcal{F}_{t}].\end{split} (26)

The last term in (26) can be upper bounded using the following inequality

𝔼[|Utft(xt)|2|t]1P|ft(xt)ft(xt)|2.\mathbb{E}[|U_{t}\nabla f_{t}(x_{t})|^{2}|\mathcal{F}_{t}]{\leq}\frac{1}{P}|\nabla f_{t}(x_{t})-\nabla f_{t}(x_{t}^{*})|^{2}. (27)

The inequality (27) comes from the fact that

𝔼[|Utft(xt)|2|t]=|ft(xt)|2/P.\mathbb{E}[|U_{t}\nabla f_{t}(x_{t})|^{2}|\mathcal{F}_{t}]=|\nabla f_{t}(x_{t})|^{2}/P.

Combining (25), (26) and (27), we have for any t0t\in\mathbb{Z}_{\geq 0}

𝔼[|xt+1xt|2|t](12αPμLμ+L)|xtxt|2+(α2P2αP1μ+L)|ft(xt)ft(xt)|2(12αPμLμ+L)|xtxt|2,\begin{split}&\mathbb{E}[|x_{t+1}-x_{t}^{*}|^{2}|\mathcal{F}_{t}]\leq(1-\frac{2\alpha}{P}\frac{\mu L}{\mu+L})|x_{t}-x_{t}^{*}|^{2}\\ &+(\frac{\alpha^{2}}{P}-\frac{2\alpha}{P}\frac{1}{\mu+L})|\nabla f_{t}(x_{t})-\nabla f_{t}(x_{t}^{*})|^{2}\\ &{\leq}(1-\frac{2\alpha}{P}\frac{\mu L}{\mu+L})|x_{t}-x_{t}^{*}|^{2},\end{split} (28)

where the last inequality comes from the constant stepsize choice α2/(μ+L)\alpha\leq 2/(\mu+L). By Jensen’s inequality, we have

𝔼|xt+1xt|t|12αPμLμ+L|xtxt|.\mathbb{E}|x_{t+1}-x^{*}_{t}|\mathcal{F}_{t}|\leq\sqrt{1-\frac{2\alpha}{P}\frac{\mu L}{\mu+L}}|x_{t}-x_{t}^{*}|. (29)

Note that 12αPμLμ+L<1\sqrt{1-\frac{2\alpha}{P}\frac{\mu L}{\mu+L}}<1. Thus, there exists 0<e<10<e<1 such that 12αPμLμ+L=1e\sqrt{1-\frac{2\alpha}{P}\frac{\mu L}{\mu+L}}=1-e.

Assumption 1 (i) implies |ft(x)ft(y)|G|xy||f_{t}(x)-f_{t}(y)|\leq G|x-y| for any t0t\in\mathbb{Z}_{\geq 0} and any x,yΘx,y\in\Theta. As a result, the stochastic dynamic regret is bounded as RTdG𝔼[t=1T|xtxt|]R_{T}^{d}\leq G\mathbb{E}[\sum_{t=1}^{T}|x_{t}-x_{t}^{*}|]. Moreover, we have

|xt+1xt+1||xt+1xt|+|xtxt+1|.|x_{t+1}-x_{t+1}^{*}|\leq|x_{t+1}-x_{t}^{*}|+|x_{t}^{*}-x_{t+1}^{*}|. (30)

Now taking total expectations, using (29) and summing up both sides of (30) from t=1t=1 to t=T1t=T-1 yields

RTd/G|x1x1|+(1e)RTd/G+t=1T|xtxt+1|.R_{T}^{d}/G\leq|x_{1}-x_{1}^{*}|+(1-e)R_{T}^{d}/G+\sum_{t=1}^{T}|x_{t}^{*}-x_{t+1}^{*}|. (31)

Rearranging (31) leads to

RTdGe(CT+C1).R_{T}^{d}\leq\frac{G}{e}(C_{T}+C_{1}).

Remark 4.3.

By assuming uniform strong convexity and uniform Lipschitz continuity of gradients of ftf_{t}, Theorem 4 improves the order of the theoretical dynamic regret bound from O(CTT)O(\sqrt{C_{T}T}) in Theorem 3 to O(CT)O(C_{T}), which is consistent with results shown in [23] for full-gradient based online algorithms. This means if problem (1) does not change over time i.e. ft=ff_{t}=f for any tt, CT=0C_{T}=0 and the regret grows at a rate O(1)O(1) which is consistent with the convergence result of coordinate descent algorithms in the offline setting [31]. As item (ii) of Assumption 1 is not required for proving Theorem 4, the regret bound is applicable to unconstrained problems with bounded gradients.

5 Regret bounds for online coordinate descent algorithms with deterministic coordinate selection rules

We consider two deterministic online coordinate descent algorithms in this section and they are given in Algorithms 2 and 3, respectively.

Algorithm 2 Online cyclic coordinate descent algorithm
1:Initialization: x0,i0x_{0},i_{0}.
2:Select Coordinate: it(it1modP)+1i_{t}\leftarrow(i_{t-1}\mod P)+1.
3:Update: For i=iti=i_{t}:
x(i),t+1x(i),tαt[(i)ft(xt)].\displaystyle x_{(i),t+1}\leftarrow x_{(i),t}-\alpha_{t}[\nabla_{(i)}f_{t}(x_{t})].
For all iiti\neq i_{t}:
x(i),t+1x(i),t.x_{(i),t+1}\leftarrow x_{(i),t}.
4:Projection: xt+1ΠΘ(xt+1)x_{t+1}\leftarrow\Pi_{\Theta}(x_{t+1}).
5:Set tt+1t\leftarrow t+1 and go to Step 2.
Algorithm 3 Online coordinate descent algorithm with Gauss-Southwell Rule
1:Initialization: x0Θx_{0}\in\Theta.
2:Select Coordinate: itargmaxi|(i)ft(xt)|i_{t}\leftarrow\arg\max_{i}|\nabla_{(i)}f_{t}(x_{t})|.
3:Update: For i=iti=i_{t}:
x(i),t+1x(i),tαt[(i)ft(xt)].\displaystyle x_{(i),t+1}\leftarrow x_{(i),t}-\alpha_{t}[\nabla_{(i)}f_{t}(x_{t})].
For all iiti\neq i_{t}:
x(i),t+1x(i),t.x_{(i),t+1}\leftarrow x_{(i),t}.
4:Projection: xt+1ΠΘ(xt+1)x_{t+1}\leftarrow\Pi_{\Theta}(x_{t+1}).
5:Set tt+1t\leftarrow t+1 and go to Step 2.

Note that both Algorithms 2 and 3 are deterministic and update a component of the decision variable. We can therefore relate the regrets achievable by Algorithms 2 and 3 to regret achievable by the online projected gradient descent algorithm which takes the form

xt+1=ΠΘ(xtαtft(xt)).\color[rgb]{0,0,0}x_{t+1}=\Pi_{\Theta}(x_{t}-\alpha_{t}\nabla f_{t}(x_{t})).\color[rgb]{0,0,0} (32)

It can be easily seen that Algorithms 2 and 3 take the following form

xt+1=ΠΘ(xtαt(ft(xt)+νt)),x_{t+1}=\Pi_{\Theta}(x_{t}-\alpha_{t}(\nabla f_{t}(x_{t})+\nu_{t})), (33)

where the vector νt:=¯(it)ft(xt)ft(xt)\nu_{t}:=\overline{\nabla}_{(i_{t})}f_{t}(x_{t})-\nabla f_{t}(x_{t}) with ¯(it)ft(xt)n\overline{\nabla}_{(i_{t})}f_{t}(x_{t})\in\mathbb{R}^{n} denotes the vector such that its iti_{t}-th component is (it)ft(xt){\nabla}_{(i_{t})}f_{t}(x_{t}) while all other entries are 0. It captures the effect of the components of the gradient that are not updating. We use R¯Ts\bar{R}_{T}^{s} and R¯Td\bar{R}_{T}^{d} to denote the static and dynamic regrets of the online projected gradient descent algorithm respectively. Then, we can have the following result.

Proposition 5.1.

Suppose Assumption 1 holds, then the static regret RTsR_{T}^{s} and dynamic regret RTdR_{T}^{d} of iterations (33) satisfy the following two relationships.

  1. 1.

    RTsR¯Ts+G2t=1TαtR_{T}^{s}\leq\bar{R}_{T}^{s}+G^{2}\sum_{t=1}^{T}\alpha_{t}.

  2. 2.

    RTdR¯Td+G2t=1TαtR_{T}^{d}\leq\bar{R}_{T}^{d}+G^{2}\sum_{t=1}^{T}\alpha_{t}.

{pf}

By the definitions of regrets in (2) and (3), it is obvious that RTsR¯Ts=RTdR¯TdR_{T}^{s}-\bar{R}_{T}^{s}=R_{T}^{d}-\bar{R}_{T}^{d}. Thus, if one of the two items is proved, so is the other item. Since Assumption 1 holds, from (2) and Lemma 1, we know that RTsR¯Tst=1TG|αtν|G2t=1T|αt|=G2t=1TαtR_{T}^{s}-\bar{R}_{T}^{s}\leq\sum_{t=1}^{T}G|\alpha_{t}\nu|\leq G^{2}\sum_{t=1}^{T}|\alpha_{t}|=G^{2}\sum_{t=1}^{T}\alpha_{t}. Thus the proof is complete. By using Proposition 5.1, we can establish regret bounds for Algorithms 2 and 3 using known regret bounds for online gradient descent algorithms. Moreover, if the regret bounds for online gradient descent are sublinear in TT and t=1Tαt\sum_{t=1}^{T}\alpha_{t} is also sublinear in TT, then the established regret bounds for Algorithms 2 and 3 will be sublinear.

5.1 Static regret for convex functions

The following static regret bound of online projected gradient descent algorithm is proved in [41, Theorem 1].

Lemma 5.2.

Suppose Assumption 1 holds. Furthermore, if the stepsize is chosen as αt=1t\alpha_{t}=\sqrt{\frac{1}{t}}, then the static regret of online projected gradient descent iterations (32) satisfy

R¯TsR2T2+(T12)G2.\bar{R}_{T}^{s}\leq\frac{R^{2}\sqrt{T}}{2}+(\sqrt{T}-\frac{1}{2})G^{2}.

The following result on static regret bounds of Algorithms 2 and 3 is a direct corollary of Proposition 5.1 and Lemma 5.2.

Corollary 5.3.

Suppose Assumption 1 holds. Furthermore, if the stepsize is chosen as αt=1t\alpha_{t}=\sqrt{\frac{1}{t}}, then the static regrets of Algorithms 2 and 3 satisfy

RTsR2T2+(T12)G2+2G2T.R_{T}^{s}\leq\frac{R^{2}\sqrt{T}}{2}+(\sqrt{T}-\frac{1}{2})G^{2}+2G^{2}\sqrt{T}.
{pf}

Since αt=1t\alpha_{t}=\sqrt{\frac{1}{t}}, we have

t=1Tαt=1+t=2Tαt1+s=1T1s𝑑s=2T.\begin{split}\sum_{t=1}^{T}\alpha_{t}&=1+\sum_{t=2}^{T}\alpha_{t}\\ &\leq 1+\int_{s=1}^{T}\frac{1}{\sqrt{s}}ds\\ &=2\sqrt{T}.\end{split}

The conclusion follows as a result of Proposition 5.1 and Lemma 5.2.

5.2 Static regret for strongly convex functions

The static regret bound in Lemma 5.2 is improved in [11, Theorem 1] under the assumption of strong convexity.

Lemma 5.4.

Suppose Assumptions 1 and 2 hold. Furthermore, if the stepsize is chosen as αt=1μt\alpha_{t}=\frac{1}{\mu t}, then the static regret of online projected gradient descent algorithm (32) satisfies

R¯TsG22μ(1+logT).\bar{R}_{T}^{s}\leq\frac{G^{2}}{2\mu}(1+\log T).

The following result on static regret bounds of Algorithms 2 and 3 for strongly convex functions is a direct corollary of Proposition 5.1 and Lemma 5.4.

Corollary 5.5.

Suppose Assumptions 1 and 2 hold. Furthermore, if the stepsize is chosen as αt=1μt\alpha_{t}=\frac{1}{\mu t}, then the static regrets of Algorithms 2 and 3 satisfy

RTs3G22μ(1+logT).R_{T}^{s}\leq\frac{3G^{2}}{2\mu}(1+\log T).
{pf}

Since αt=1μt\alpha_{t}=\frac{1}{\mu t}, we have

t=1Tαt=1+t=2Tαt1+s=1T1μs𝑑s=1+logT.\begin{split}\sum_{t=1}^{T}\alpha_{t}&=1+\sum_{t=2}^{T}\alpha_{t}\\ &\leq 1+\int_{s=1}^{T}\frac{1}{\mu s}ds\\ &=1+\log T.\end{split}

The conclusion follows as a result of Proposition 5.1 and Lemma 5.4.

5.3 Dynamic regret for convex functions

We can adapt the proof of Theorem 3 to the deterministic case to derive a dynamic regret bound for convex functions which is given in the following result.

Proposition 5.6.

Suppose Assumption 1 holds. Furthermore, if the stepsize is chosen as αt=CTT\alpha_{t}=\sqrt{\frac{C_{T}}{T}}, the dynamic regret achieved by the online gradient descent algorithm (32) satisfies

R¯Td(5R22CT+RCT)T+CTG22T.\bar{R}_{T}^{d}\leq(\frac{5R^{2}}{2\sqrt{C_{T}}}+R\sqrt{C_{T}})\sqrt{T}+\frac{\sqrt{C_{T}}G^{2}}{2}\sqrt{T}.
{pf}

From (32) and Lemma 1, we have

|xt+1x|2|xtαtft(xt)x|2=|xtx|22αtft(xt)T(xtx)+αt2|ft(xt)|2\begin{split}&|x_{t+1}-x|^{2}\leq|x_{t}-\alpha_{t}\nabla f_{t}(x_{t})-x|^{2}\\ &=|x_{t}-x|^{2}-2\alpha_{t}\nabla f_{t}(x_{t})^{T}(x_{t}-x)+\alpha_{t}^{2}|\nabla f_{t}(x_{t})|^{2}\end{split} (34)

for any xΘx\in\Theta. Substituting xtx_{t}^{*} to (34) results in the following inequality for any t2t\geq 2

ft(xt)ft(xt)12αt(|xtxt|2|xt+1xt|2)+αt2|ft(xt)|2.\begin{split}&f_{t}(x_{t})-f_{t}(x_{t}^{*})\leq\\ &\frac{1}{2\alpha_{t}}(|x_{t}-x_{t}^{*}|^{2}-|x_{t+1}-x_{t}^{*}|^{2})+\frac{\alpha_{t}}{2}|\nabla f_{t}(x_{t})|^{2}.\end{split}

Note that, the inequality (22) still holds in this case Thus, we have

R¯Td12αt(5R2+2RCT)+αt2G2T.\bar{R}_{T}^{d}\leq\frac{1}{2\alpha_{t}}(5R^{2}+2RC_{T})+\frac{\alpha_{t}}{2}G^{2}T.

If we set the stepsize to be CTT\sqrt{\frac{C_{T}}{T}}, we have

R¯Td(5R22CT+RCT)T+CTG22T.\bar{R}_{T}^{d}\leq(\frac{5R^{2}}{2\sqrt{C_{T}}}+R\sqrt{C_{T}})\sqrt{T}+\frac{\sqrt{C_{T}}G^{2}}{2}\sqrt{T}.

Using Propositions 5.1 and 5.6, we can state the following result on dynamic regret bounds of Algorithms 2 and 3.

Corollary 5.7.

Suppose Assumption 1 holds. Furthermore, if the stepsize is chosen as αt=CTT\alpha_{t}=\sqrt{\frac{C_{T}}{T}}, then the dynamic regrets of Algorithms 2 and 3 satisfy

RTdR2T2+(T12)G2+2G2CTT.R_{T}^{d}\leq\frac{R^{2}\sqrt{T}}{2}+(\sqrt{T}-\frac{1}{2})G^{2}+2G^{2}\sqrt{C_{T}T}.
{pf}

Since αt=CTT\alpha_{t}=\sqrt{\frac{C_{T}}{T}}, we have t=1Tαt=CTT\sum_{t=1}^{T}\alpha_{t}=\sqrt{C_{T}T}. The conclusion follows as a result of Propositions 5.1 and 5.6. As discussed in Remark 4.2, the stepsize choice αt=CTT\alpha_{t}=\sqrt{\frac{C_{T}}{T}} can be made independent of TT by using the doubling trick scheme.

5.4 Dynamic regret for strongly convex functions

Note that when the cost functions are strongly convex and constant stepsizes are used, the result in Proposition 5.1 only gives a dynamic regret bound that is linear in TT. Therefore, we aim to establish better dynamic regret bounds for Algorithms 2 and 3. In this subsection, we will make the following assumption on the problem (1).

Assumption 4.

The gradient of ftf_{t} is block-wise Lipschitz continuous uniformly in tt, i.e., for any 1iP1\leq i\leq P, there exists Li>0L_{i}>0 such that the following inequality holds for any t0t\in\mathbb{Z}_{\geq 0}, xnx\in\mathbb{R}^{n}, and uniu\in\mathbb{R}^{n_{i}}

|(i)ft(x+Hiu)(i)ft(x)|Li|u|,|\nabla_{(i)}f_{t}(x+H_{i}u)-\nabla_{(i)}f_{t}(x)|\leq L_{i}|u|, (35)

where Hin×niH_{i}\in\mathbb{R}^{n\times n_{i}} is such that [H1H2HP]=In[H_{1}\ H_{2}\cdots H_{P}]=I_{n}.

We denote the largest LiL_{i} for 1iP1\leq i\leq P by Lmax:=max{L1,,LP}L_{\mathrm{max}}:={\max}\{L_{1},\cdots,L_{P}\}. By further assuming ft(xt)=0\nabla f_{t}(x_{t}^{*})=0 for any t0t\in\mathbb{Z}_{\geq 0}, the dynamic regret bounds can be derived without Assumption 1. Therefore, they are applicable to unconstrained problems with unbounded gradients. Before analyzing the regrets, we define the following measure of variations of the problem (1) which is first introduced in [40],

CT,2:=t=1T|xtxt1|2.C_{T,2}:=\sum_{t=1}^{T}|x_{t}^{*}-x_{t-1}^{*}|^{2}. (36)
Remark 5.8.

Compared to CTC_{T}, CT,2C_{T,2} can be significantly smaller if the variation of minimizers is small. Following the discussion in Remark 4.2, if the variation of minimizers decreases with the order 1/tq{1}/{t^{q}} with 1/2<q11/2<q\leq 1, then CT,2C_{T,2} is finite while CTC_{T} grows to \infty with a rate sublinear in TT. Similarly, if 0<q<1/20<q<1/2, then CT,2=O(T12q),CT=O(T1q)C_{T,2}=O(T^{1-2q}),C_{T}=O(T^{1-q}).

Theorem 5.

Suppose ft(xt)=0\nabla f_{t}(x_{t}^{*})=0 for any t0t\in\mathbb{Z}_{\geq 0} and Assumptions 2-4 hold. Furthermore, we assume the number of blocks PP satisfies P<1B3LmaxP<\frac{1}{B_{3}L_{\mathrm{max}}} with B3:=1μ12L>0B_{3}:=\frac{1}{\mu}-\frac{1}{2L}>0. If the stepsize is chosen such that αt=α(11B3PLmaxLmax,1+1B3PLmaxLmax)\alpha_{t}=\alpha\in(\frac{1-\sqrt{1-B_{3}PL_{\mathrm{max}}}}{L_{\mathrm{max}}},\frac{1+\sqrt{1-B_{3}PL_{\mathrm{max}}}}{L_{\mathrm{max}}}). Then, the dynamic regret (5) achieved by Algorithm 3 satisfies

RTdLB4(CT,2+C2),R_{T}^{d}\leq\frac{L}{B_{4}}(C_{T,2}+C_{2}),

where B4=12L(1μ2αα2Lmax2P)>0B_{4}=\frac{1}{2}-\sqrt{L(\frac{1}{\mu}-\frac{2\alpha-\alpha^{2}L_{\mathrm{max}}}{2P})}>0 and C2=|x1x1|22|x1x0|2C_{2}=|x_{1}-x_{1}^{*}|^{2}-2|x_{1}^{*}-x_{0}^{*}|^{2}.

{pf}

Let {yt}\{y_{t}\} denote the sequence of real vectors such that yt+1=xtαtUt[ft(xt)]y_{t+1}=x_{t}-\alpha_{t}U_{t}[\nabla f_{t}(x_{t})], where the value of UtU_{t} follows the coordinate selection rule in Algorithm 3. The variable ytny_{t}\in\mathbb{R}^{n} stores the value of the decision variable xtx_{t} before projection, i.e., xt=ΠΘ(yt)x_{t}=\Pi_{\Theta}(y_{t}). By the block descent lemma [1, Lemma 3.2] and the fact that LmaxLiL_{\mathrm{max}}\geq L_{i} for all ii, we have ft(yt+1)ft(xt)+α2Lmax2|(i)ft(xt)|2α|(i)ft(xt)|2f_{t}(y_{t+1})\leq f_{t}(x_{t})+\frac{\alpha^{2}L_{\mathrm{max}}}{2}|\nabla_{(i)}f_{t}(x_{t})|^{2}-\alpha|\nabla_{(i)}f_{t}(x_{t})|^{2}. That is

ft(xt)ft(yt+1)(αα2Lmax2)|(i)ft(xt)|2,f_{t}(x_{t})-f_{t}(y_{t+1})\geq(\alpha-\frac{\alpha^{2}L_{\mathrm{max}}}{2})|\nabla_{(i)}f_{t}(x_{t})|^{2}, (37)

From (37) and itargmaxi|(i)ft(xt)|i_{t}\in\arg\max_{i}|\nabla_{(i)}f_{t}(x_{t})|:

ft(xt)ft(yt+1)(αα2Lmax2)|(i)ft(xt)|21P(αα2Lmax2)|ft(xt)|2,\begin{split}f_{t}(x_{t})-f_{t}(y_{t+1})&\geq(\alpha-\frac{\alpha^{2}L_{\mathrm{max}}}{2})|\nabla_{(i)}f_{t}(x_{t})|^{2}\\ &\geq\frac{1}{P}(\alpha-\frac{\alpha^{2}L_{\mathrm{max}}}{2})|\nabla f_{t}(x_{t})|^{2},\end{split} (38)

Since ft(xt)=0\nabla f_{t}(x_{t}^{*})=0 for any t0t\in\mathbb{Z}_{\geq 0}, minimizing both sides of (16) with respect to yy, we have ft(xt)ft(xt)12μ|ft(xt)|2f_{t}(x_{t})-f_{t}(x_{t}^{*})\leq\frac{1}{2\mu}|\nabla f_{t}(x_{t})|^{2} for any xtΘx_{t}\in\Theta (known as the Polyak-Łojasiewicz condition). Then, by (38), we have

ft(xt)ft(xt)(ft(yt+1)ft(xt))A¯|ft(xt)|22μA¯(ft(xt)ft(xt)),\begin{split}f_{t}(x_{t})-f_{t}(x_{t}^{*})-&(f_{t}(y_{t+1})-f_{t}(x_{t}^{*}))\geq\bar{A}|\nabla f_{t}(x_{t})|^{2}\\ &\geq 2\mu\bar{A}(f_{t}(x_{t})-f_{t}(x_{t}^{*})),\end{split}

where A¯=1P(αα2Lmax2)\bar{A}=\frac{1}{P}(\alpha-\frac{\alpha^{2}L_{\mathrm{max}}}{2}). Consequently,

ft(yt+1)ft(xt)(12μA¯)(ft(xt)ft(xt)).f_{t}(y_{t+1})-f_{t}(x_{t}^{*})\leq(1-2\mu\bar{A})(f_{t}(x_{t})-f_{t}(x_{t}^{*})). (39)

By the fact that ft(xt)=0\nabla f_{t}(x_{t}^{*})=0 and Assumption 3, we have ft(xt)ft(xt)L2|xtxt|2f_{t}(x_{t})-f_{t}(x_{t}^{*})\leq\frac{L}{2}|x_{t}-x_{t}^{*}|^{2} and hence

μ2|xtxt|2ft(xt)ft(xt)L2|xtxt|2.\frac{\mu}{2}|x_{t}-x_{t}^{*}|^{2}\leq f_{t}(x_{t})-f_{t}(x_{t}^{*})\leq\frac{L}{2}|x_{t}-x_{t}^{*}|^{2}. (40)

By (39) and (40), the following inequality holds as a result of Lemma 1

|xt+1xt|2|yt+1xt|2Lμ(12μA¯)|xtxt|2.|x_{t+1}-x_{t}^{*}|^{2}\leq|y_{t+1}-x_{t}^{*}|^{2}\leq\frac{L}{\mu}(1-2\mu\bar{A})|x_{t}-x_{t}^{*}|^{2}. (41)

Next we show that Lμ(12μA¯)<12\frac{L}{\mu}(1-2\mu\bar{A})<\frac{1}{2} under the stated assumptions. Since A¯=1P(αα2Lmax2)\bar{A}=\frac{1}{P}(\alpha-\frac{\alpha^{2}L_{\mathrm{max}}}{2}), it can be shown that Lμ(12μA¯)<12\frac{L}{\mu}(1-2\mu\bar{A})<\frac{1}{2} is equivalent to Lmaxα22α+B3P<0L_{\mathrm{max}}\alpha^{2}-2\alpha+B_{3}P<0 after some algebraic manipulations. When P<1B3LmaxP<\frac{1}{B_{3}L_{\mathrm{max}}} we have 44B3PLmax>04-4B_{3}PL_{\mathrm{max}}>0 and the set of solutions to the inequality Lmaxα22α+B3P<0L_{\mathrm{max}}\alpha^{2}-2\alpha+B_{3}P<0 with respect to α\alpha is non-empty. Moreover the solution is given by 11B3PLmaxLmax<α<1+1B3PLmaxLmax\frac{1-\sqrt{1-B_{3}PL_{\mathrm{max}}}}{L_{\mathrm{max}}}<\alpha<\frac{1+\sqrt{1-B_{3}PL_{\mathrm{max}}}}{L_{\mathrm{max}}}. Therefore, the listed conditions in the statement of the theorem ensures that Lμ(12μA¯)<12\frac{L}{\mu}(1-2\mu\bar{A})<\frac{1}{2}. Thus, (41) implies

|xt+1xt|212|xtxt|2B4|xtxt|2,|x_{t+1}-x_{t}^{*}|^{2}\leq\frac{1}{2}|x_{t}-x_{t}^{*}|^{2}-B_{4}|x_{t}-x_{t}^{*}|^{2}, (42)

with B4=12Lμ(12μA¯)>0B_{4}=\frac{1}{2}-\frac{L}{\mu}(1-2\mu\bar{A})>0.

Note that

|xt+1xt+1|22|xt+1xt|2+2|xtxt+1|2.|x_{t+1}-x_{t+1}^{*}|^{2}\leq 2|x_{t+1}-x_{t}^{*}|^{2}+2|x_{t}^{*}-x_{t+1}^{*}|^{2}. (43)

The deterministic dynamic regret (3) can be upper bounded as follows

RTd=t=1Tft(xt)t=1Tft(xt)(24)t=1Tft(xt)T(xtxt)+L2|xtxt|2=L2t=1T|xtxt|2.\begin{split}R_{T}^{d}&=\sum_{t=1}^{T}f_{t}(x_{t})-\sum_{t=1}^{T}f_{t}(x_{t}^{*})\\ &\overset{\eqref{LipG2}}{\leq}\sum_{t=1}^{T}\nabla f_{t}(x_{t}^{*})^{T}(x_{t}-x_{t}^{*})+\frac{L}{2}|x_{t}-x_{t}^{*}|^{2}\\ &=\frac{L}{2}\sum_{t=1}^{T}|x_{t}-x_{t}^{*}|^{2}.\end{split} (44)

From (42), we have the following relationship by summing up both sides of (43) from t=1t=1 to t=T1t=T-1,

t=1T|xtxt|2|x1x1|2(12B4)t=1T|xtxt|2+2CT,22|x1x0|2.\begin{split}&\sum_{t=1}^{T}|x_{t}-x_{t}^{*}|^{2}-|x_{1}-x_{1}^{*}|^{2}\\ &\leq(1-2B_{4})\sum_{t=1}^{T}|x_{t}-x_{t}^{*}|^{2}+2C_{T,2}-2|x_{1}^{*}-x_{0}^{*}|^{2}.\end{split} (45)

Since B4>0B_{4}>0, (45) implies leads to RTdL2t=1T|xtxt|2LB4(CT,2+C2)R_{T}^{d}\leq\frac{L}{2}\sum_{t=1}^{T}|x_{t}-x_{t}^{*}|^{2}\leq\frac{L}{B_{4}}(C_{T,2}+C_{2}).

Remark 5.9.

Note that in Theorem 5, we introduce an upper bound that depends on the number of components PP which increases to \infty as B3B_{3} decreases to 0. This is mainly a result of conservatism introduced in the derivation of (41). However, we manage to improve the regret bound in Theorem 4 from O(CT)O(C_{T}) to O(CT,2)O(C_{T,2}) following the discussion in Remark 5.8. The bound is also valid for unconstrained OCO problems that are not covered by Theorem 1-4.

The following two theorems give dynamic regret bounds without assuming an upper bound on the number of components PP. However, at each time tt, potentially multiple offline steps are needed to guarantee desirable regret bounds.

The modified algorithms are in Algorithm 4.

Algorithm 4 Online cyclic coordinate descent algorithm
1:Initialization: x0Θ,i0,κx_{0}\in\Theta,i_{0},\kappa and some integer k1k\geq 1.
2:Update: At time tt, κ1,iκ1it1\kappa\leftarrow 1,i_{\kappa-1}\leftarrow i_{t-1}, x^κ1xt\hat{x}_{\kappa-1}\leftarrow x_{t}.
3:Select Coordinate: iκ(iκ1modP)+1i_{\kappa}\leftarrow(i_{\kappa-1}\mod P)+1.
4:Update: For i=iκi=i_{\kappa}, such that κk\kappa\leq k:
x^(i),κx^(i),κ1αt[(i)ft(x^κ1)].\displaystyle\hat{x}_{(i),\kappa}\leftarrow\hat{x}_{(i),\kappa-1}-\alpha_{t}[\nabla_{(i)}f_{t}(\hat{x}_{\kappa-1})].
     For all iiκi\neq i_{\kappa}, such that κk\kappa\leq k:
x^(i),κx^(i),κ1.\hat{x}_{(i),\kappa}\leftarrow\hat{x}_{(i),\kappa-1}.
5:For κ<k\kappa<k, set κκ+1\kappa\leftarrow\kappa+1 and go to Step 3.
6:For κk\kappa\geq k, set tt+1,it=ik,xt+1ΠΘ(x^k)t\leftarrow t+1,i_{t}=i_{k}\color[rgb]{0,0,0},x_{t+1}\leftarrow\Pi_{\Theta}(\hat{x}_{k}) and go to Step 2.
Algorithm 5 Online coordinate descent algorithm with Gauss-Southwell Rule
1:Initialization: x0Θ,κx_{0}\in\Theta,\kappa and some integer k1k\geq 1.
2:Update: At time tt, κ1\kappa\leftarrow 1, x^κ1xt\hat{x}_{\kappa-1}\leftarrow x_{t}.
3:Select Coordinate: iκargmaxi|(i)ft(xκ1)|i_{\kappa}\leftarrow\arg\max_{i}|\nabla_{(i)}f_{t}(x_{\kappa-1})|.
4:Update: For i=iκi=i_{\kappa}, such that κk\kappa\leq k:
x^(i),κx^(i),κ1αt[(i)ft(x^κ1)].\displaystyle\hat{x}_{(i),\kappa}\leftarrow\hat{x}_{(i),\kappa-1}-\alpha_{t}[\nabla_{(i)}f_{t}(\hat{x}_{\kappa-1})].
     For all iiκi\neq i_{\kappa}, such that κk\kappa\leq k:
x^(i),κx^(i),κ1.\hat{x}_{(i),\kappa}\leftarrow\hat{x}_{(i),\kappa-1}.
5:For κ<k\kappa<k, set κκ+1\kappa\leftarrow\kappa+1 and go to Step 3.
6:For κk\kappa\geq k, set tt+1,xt+1ΠΘ(x^k)t\leftarrow t+1\color[rgb]{0,0,0},x_{t+1}\leftarrow\Pi_{\Theta}(\hat{x}_{k}) and go to Step 2.

It can be seen that in Algorithms 4 and 5, at each time tt, kk updates are performed where kk is an integer to be chosen. In Algorithms 2 and 3, however, only one step is performed at each time tt.

Theorem 6.

Suppose ft(xt)=0\nabla f_{t}(x_{t}^{*})=0 for any t0t\in\mathbb{Z}_{\geq 0} and Assumptions 2-4 hold and the stepsize is chosen such that αt=α<2Lmax\alpha_{t}=\alpha<\frac{2}{L_{\mathrm{max}}}. Let kk be an integer such that B5:=Lμ(12μA)k+1PP(1+αLmax)2P2<1/2B_{5}:=\frac{L}{\mu}(1-2\mu A)^{\frac{k+1-P}{P}}(1+\alpha L_{\mathrm{max}})^{2P-2}<1/2 holds, then the dynamic regret (5) achieved by Algorithm 4 satisfies

RTdL24B5(2CT,2+C2),R_{T}^{d}\leq\frac{L}{2-4B_{5}}(2C_{T,2}+C_{2}),

where A=(αα2Lmax2)/[2(1+α2L2P)]A=(\alpha-\frac{\alpha^{2}L_{\mathrm{max}}}{2})/[2(1+\alpha^{2}L^{2}P)] and C2=|x1x1|22|x1x0|2C_{2}=|x_{1}-x_{1}^{*}|^{2}-2|x_{1}^{*}-x_{0}^{*}|^{2}.

{pf}

By the block descent lemma [1, Lemma 3.2] and the fact that LmaxLiL_{\mathrm{max}}\geq L_{i} for all ii, we have ft(x^κ+1)ft(x^κ)+α2Lmax2|(i)ft(x^κ)|2α|(i)ft(x^κ)|2f_{t}(\hat{x}_{\kappa+1})\leq f_{t}(\hat{x}_{\kappa})+\frac{\alpha^{2}L_{\mathrm{max}}}{2}|\nabla_{(i)}f_{t}(\hat{x}_{\kappa})|^{2}-\alpha|\nabla_{(i)}f_{t}(\hat{x}_{\kappa})|^{2} at any tt, for any 0κk10\leq\kappa\leq k-1. That is

ft(x^κ)ft(x^κ+1)(αα2Lmax2)|(i)ft(x^κ)|2,f_{t}(\hat{x}_{\kappa})-f_{t}(\hat{x}_{\kappa+1})\geq(\alpha-\frac{\alpha^{2}L_{\mathrm{max}}}{2})|\nabla_{(i)}f_{t}(\hat{x}_{\kappa})|^{2},

Summing over all PP blocks in a full round where all components have been updated exactly once leads to

ft(x^κ)ft(x^κ+P)(αα2Lmax2)i=1P|(i)ft(x^κ+i)|2.f_{t}(\hat{x}_{\kappa})-f_{t}(\hat{x}_{\kappa+P})\geq(\alpha-\frac{\alpha^{2}L_{\mathrm{max}}}{2})\sum_{i=1}^{P}|\nabla_{(i)}f_{t}(\hat{x}_{\kappa+i})|^{2}.

By the update equation and Lipschitz continuity of the gradient, we have |ft(x^κ)ft(x^κ+i)|2α2L2j=1i|(j)ft(x^κ+j1)|2|\nabla f_{t}(\hat{x}_{\kappa})-\nabla f_{t}(\hat{x}_{\kappa+i})|^{2}\leq\alpha^{2}L^{2}\sum_{j=1}^{i}|\nabla_{(j)}f_{t}(\hat{x}_{\kappa+j-1})|^{2}. Therefore

|(i)ft(x^κ)|2(|(i)ft(x^κ)(i)ft(x^κ+i)|+|(i)ft(x^κ+i)|)22|(i)ft(x^κ+i)|2+2α2L2j=1i|(j)ft(x^κ+j1)|2.\begin{split}&|\nabla_{(i)}f_{t}(\hat{x}_{\kappa})|^{2}\\ &\leq(|\nabla_{(i)}f_{t}(\hat{x}_{\kappa})-\nabla_{(i)}f_{t}(\hat{x}_{\kappa+i})|+|\nabla_{(i)}f_{t}(\hat{x}_{\kappa+i})|)^{2}\\ &\leq 2|\nabla_{(i)}f_{t}(\hat{x}_{\kappa+i})|^{2}+2\alpha^{2}L^{2}\sum_{j=1}^{i}|\nabla_{(j)}f_{t}(\hat{x}_{\kappa+j-1})|^{2}.\end{split}

Summing over PP blocks leads to

i=1P|(i)ft(x^κ)|22i=1P(1+(Pi)α2L2)|(i)ft(x^κ+i)|22(1+α2PL2)i=1P|(i)ft(x^κ+i)|2.\begin{split}\sum_{i=1}^{P}|\nabla_{(i)}f_{t}(\hat{x}_{\kappa})|^{2}&\leq 2\sum_{i=1}^{P}(1+(P-i)\alpha^{2}L^{2})|\nabla_{(i)}f_{t}(\hat{x}_{\kappa+i})|^{2}\\ &\leq 2(1+\alpha^{2}PL^{2})\sum_{i=1}^{P}|\nabla_{(i)}f_{t}(\hat{x}_{\kappa+i})|^{2}.\\ \end{split}

Therefore,

ft(x^κ)ft(x^κ+P)(αα2Lmax2)/[2(1+α2L2P)]|ft(x^κ)|2:=A|ft(x^κ)|2.\begin{split}&f_{t}(\hat{x}_{\kappa})-f_{t}(\hat{x}_{\kappa+P})\\ &\geq(\alpha-\frac{\alpha^{2}L_{\mathrm{max}}}{2})/[2(1+\alpha^{2}L^{2}P)]|\nabla f_{t}(\hat{x}_{\kappa})|^{2}\\ &:=A|\nabla f_{t}(\hat{x}_{\kappa})|^{2}.\\ \end{split} (46)

By Assumption 2, minimizing both sides of (16) with respect to yy, we have ft(x^κ)ft(xt)12μ|ft(x^κ)|2f_{t}(\hat{x}_{\kappa})-f_{t}(x_{t}^{*})\leq\frac{1}{2\mu}|\nabla f_{t}(\hat{x}_{\kappa})|^{2}. Then, by (46), we have

ft(x^κ)ft(xt)(ft(x^κ+P)ft(xt))A|ft(x^κ)|22μA(ft(x^κ)ft(xt)),\begin{split}&f_{t}(\hat{x}_{\kappa})-f_{t}(x_{t}^{*})-(f_{t}(\hat{x}_{\kappa+P})-f_{t}(x_{t}^{*}))\geq A|\nabla f_{t}(\hat{x}_{\kappa})|^{2}\\ &\geq 2\mu A(f_{t}(\hat{x}_{\kappa})-f_{t}(x_{t}^{*})),\end{split}

which further implies,

ft(x^κ+P)ft(xt)(12μA)(ft(x^κ)ft(xt)).f_{t}(\hat{x}_{\kappa+P})-f_{t}(x_{t}^{*})\leq(1-2\mu A)(f_{t}(\hat{x}_{\kappa})-f_{t}(x_{t}^{*})). (47)

By Assumption 3, we have |x^κ+1xt|(1+αLmax)|x^κxt||\hat{x}_{\kappa+1}-x_{t}^{*}|\leq(1+\alpha L_{\mathrm{max}})|\hat{x}_{\kappa}-x_{t}^{*}|. For any k0k\in\mathbb{Z}_{\geq 0}, there exist k10k_{1}\in\mathbb{Z}_{\geq 0} and k2{1,,P1}k_{2}\in\{1,\ldots,P-1\} such that k=k1P+k2k=k_{1}P+k_{2}. Then, by Assumptions 2 and 3, we have

μ2|xxt|2ft(x)ft(xt)L2|xxt|2,\frac{\mu}{2}|x-x_{t}^{*}|^{2}\leq f_{t}(x)-f_{t}(x_{t}^{*})\leq\frac{L}{2}|x-x_{t}^{*}|^{2}, (48)

for any xnx\in\mathbb{R}^{n}. Consequently, (47) and (48) yield

|x^kxt|2=|x^k1P+k2xt|22μ(12μA)k1L2|x^k2xt|2=Lμ(12μA)k1|x^k2xt|2Lμ(12μA)kk2P(1+αLmax)2P2|xtxt|2Lμ(12μA)k+1PP(1+αLmax)2P2|xtxt|2,\begin{split}&|\hat{x}_{k}-x_{t}^{*}|^{2}=|\hat{x}_{k_{1}P+k_{2}}-x_{t}^{*}|^{2}\\ &\leq\frac{2}{\mu}(1-2\mu A)^{k_{1}}\frac{L}{2}|\hat{x}_{k_{2}}-x_{t}^{*}|^{2}\\ &={\frac{L}{\mu}}(1-2\mu A)^{k_{1}}|\hat{x}_{k_{2}}-x_{t}^{*}|^{2}\\ &\leq{\frac{L}{\mu}}(1-2\mu A)^{\frac{k-k_{2}}{P}}(1+\alpha L_{\mathrm{max}})^{2P-2}|x_{t}-x_{t}^{*}|^{2}\\ &\leq{\frac{L}{\mu}}(1-2\mu A)^{\frac{k+1-P}{P}}(1+\alpha L_{\mathrm{max}})^{2P-2}|x_{t}-x_{t}^{*}|^{2},\end{split} (49)

at any t0t\in\mathbb{Z}_{\geq 0}. Since α<2/Lmax\alpha<{2}/{L_{\mathrm{max}}}, we have A>0A>0 and 0<12μA<10<1-2\mu A<1. Hence there exists kk such that B5=Lμ(12μA)k+1PP(1+αLmax)2P2<1/2B_{5}=\frac{L}{\mu}(1-2\mu A)^{\frac{k+1-P}{P}}(1+\alpha L_{\mathrm{max}})^{2P-2}<1/2. Hence, by Lemma 1

|xt+1xt|2|x^kxt|2B5|xtxt|2.|x_{t+1}-x_{t}^{*}|^{2}\leq|\hat{x}_{k}-x_{t}^{*}|^{2}\leq B_{5}|x_{t}-x_{t}^{*}|^{2}. (50)

By (44), we have RTdL2t=1T|xtxt|2R_{T}^{d}\leq\frac{L}{2}\sum_{t=1}^{T}|x_{t}-x_{t}^{*}|^{2}. Note that

|xt+1xt+1|22|xt+1xt|2+2|xtxt+1|2.|x_{t+1}-x_{t+1}^{*}|^{2}\leq 2|x_{t+1}-x_{t}^{*}|^{2}+2|x_{t}^{*}-x_{t+1}^{*}|^{2}. (51)

Summing up both sides of (51) from t=1t=1 to t=T1t=T-1 using (50), we have

(12B5)2RTd/L|x1x1|2+2t=1T|xtxt+1|2=2CT,2+C2.\begin{split}(1-2B_{5})2R_{T}^{d}/L&\leq|x_{1}-x_{1}^{*}|^{2}+2\sum_{t=1}^{T}|x_{t}^{*}-x_{t+1}^{*}|^{2}\\ &=2C_{T,2}+C_{2}.\end{split}

Since 2B5<12B_{5}<1, RTdL24B5(2CT,2+C2)R_{T}^{d}\leq\frac{L}{2-4B_{5}}(2C_{T,2}+C_{2}) follows.

Theorem 7.

Suppose ft(xt)=0\nabla f_{t}(x_{t}^{*})=0 for any t0t\in\mathbb{Z}_{\geq 0} and Assumptions 2-4 hold and the stepsize is chosen such that αt=α<2Lmax\alpha_{t}=\alpha<\frac{2}{L_{\mathrm{max}}}. Let kk be an integer such that B6:=Lμ(12μA¯)k+1PP(1+αLmax)2P2<12B_{6}:={\frac{L}{\mu}}(1-2\mu\bar{A})^{\frac{k+1-P}{P}}(1+\alpha L_{\mathrm{max}})^{2P-2}<\frac{1}{2} holds, then the dynamic regret (5) achieved by Algorithm 5 satisfies

RTdL24B6(2CT,2+C2),R_{T}^{d}\leq\frac{L}{2-4B_{6}}(2C_{T,2}+C_{2}),

where A¯=1P(αα2Lmax2)\bar{A}=\frac{1}{P}(\alpha-\frac{\alpha^{2}L_{\mathrm{max}}}{2}) and C2=|x1x1|22|x1x0|2C_{2}=|x_{1}-x_{1}^{*}|^{2}-2|x_{1}^{*}-x_{0}^{*}|^{2}.

{pf}

Following similar steps as those in the proof of Theorem 6 and using (39) and (48) yield

|xt+1xt|Lμ(12μA¯)k+1PP(1+αLmax)2P2|xtxt|,\begin{split}&|x_{t+1}-x_{t}^{*}|\\ &\leq{\frac{L}{\mu}}(1-2\mu\bar{A})^{\frac{k+1-P}{P}}(1+\alpha L_{\mathrm{max}})^{2P-2}|x_{t}-x_{t}^{*}|,\end{split} (52)

where A¯=1P(αα2Lmax2)\bar{A}=\frac{1}{P}(\alpha-\frac{\alpha^{2}L_{\mathrm{max}}}{2}) as in the proof of Theorem 5. Since A¯>0\bar{A}>0, we know there exists kk such that B6=Lμ(12μA¯)k+1PP(1+αLmax)2P2<12B_{6}={\frac{L}{\mu}}(1-2\mu\bar{A})^{\frac{k+1-P}{P}}(1+\alpha L_{\mathrm{max}})^{2P-2}<\frac{1}{2}. Hence, we have

|xt+1xt|2B6|xtxt|2.|x_{t+1}-x_{t}^{*}|^{2}\leq B_{6}|x_{t}-x_{t}^{*}|^{2}. (53)

From (53), the following inequality holds by summation of both sides of (51) from t=1t=1 to t=T1t=T-1

(12B6)2RTd/L|x1x1|2+2t=1T|xtxt+1|2=2CT,2+C2.\begin{split}(1-2B_{6})2R_{T}^{d}/L&\leq|x_{1}-x_{1}^{*}|^{2}+2\sum_{t=1}^{T}|x_{t}^{*}-x_{t+1}^{*}|^{2}\\ &=2C_{T,2}+C_{2}.\end{split}

Since 2B6<12B_{6}<1, we have RTdL24B6(2CT,2+C2)R_{T}^{d}\leq\frac{L}{2-4B_{6}}(2C_{T,2}+C_{2}).

6 Numerical simulations

First, we study the following unconstrained quadratic problem

minx2012xTQtxbTx,\underset{x\in\mathbb{R}^{20}}{\text{min}}\ \frac{1}{2}x^{T}Q_{t}x-b^{T}x, (54)

where b20b\in\mathbb{R}^{20} is a randomly generated constant vector, QtQ_{t} is a time-varying matrix that is positive definite for all t1t\geq 1. Moreover, all elements of QtQ_{t} are in the closed interval [1,2][1,2]. As a result, (54) satisfies Assumptions 2-4. Next, we discuss Assumption 1 (i) made in Theorem 4. The constant GG from Assumption 1 (i) is only used to show RTdG𝔼[t=1T|xtxt|]R_{T}^{d}\leq G\mathbb{E}[\sum_{t=1}^{T}|x_{t}-x_{t}^{*}|] such that (31) holds. All other arguments made in the proof of Theorem 4 are still true even if G=G=\infty. Thus, every iteration of Algorithm 1 will move xtx_{t} closer to xt=Qt1bx_{t}^{*}=Q_{t}^{-1}b expectation-wise, if the stepsize is small enough. Since in our setup Qt1bQ_{t}^{-1}b is uniformly bounded, the expectation of xtx_{t} will remain bounded too. This means xtx_{t} and the gradient QtxtbQ_{t}x_{t}-b must be bounded, almost surely. We choose P=20P=20 and for each 1i201\leq i\leq 20, x(i)x_{(i)} is a scalar. The horizon length is T=5000T=5000, and the constant stepsize is chosen as αt=α=0.001\alpha_{t}=\alpha=0.001. Since the static regret is always upper bounded by the dynamic regret, we only show the plots for dynamic regrets and their time-averages RTd/TR_{T}^{d}/T.

Refer to caption
Figure 1: Plots of the dynamic regrets RTdR_{T}^{d} and their time-averages RTd/TR_{T}^{d}/T.

It can be seen from Fig. 1 that when the constant stepsize is chosen sufficiently small, the regrets in all cases have sublinear growths and therefore their time-averages go to 0 when TT is sufficiently large. This is consistent with our theoretical results from Theorem 4-7 on strongly convex functions.

Refer to caption
Figure 2: Plots of the dynamic regrets RTdR_{T}^{d} and their time-averages RTd/TR_{T}^{d}/T with slow variation.

In order to see how the variation of the problem impacts the performance of the algorithms. We add an extra term of 100I20100I_{20} such that the matrix QtQ_{t} is diagonally dominant and therefore being less sensitive to tt. We test the online algorithms in this case and the results are shown in Fig. 2. It can be seen that when problem (54) varies slowly with respect to time, the curves of the regrets in Fig. 2 have a lower growth rate compared to the regrets shown in Fig. 1.

As expected, the algorithm using full gradient has the best performance in terms of minimizing the dynamic regret. Yet, it is worth mentioning that among the three coordinate descent algorithms considered for this numerical example, Gauss-Southwell rule gives the best performance which is consistent with Remark 2. The extra information of the component-wise gradient norms enables a better selection of the coordinate to update. An in-depth theoretical analysis of this problem in an online setting is left for future work.

Next, we consider the following problem of minimizing entropy functions online

minxΘi=15x(i)pi,tlnx(i)pi,t.\underset{x\in\Theta}{\text{min}}\sum_{i=1}^{5}\frac{x_{(i)}}{p_{i,t}}\ln\frac{x_{(i)}}{p_{i,t}}. (55)

The variable xΘx\in\Theta is decomposed into 5 scalar components with the compact constraint set Θ={x5| 0.001x(i)1000,i=1,2,3,4,5}\Theta=\{x\in\mathbb{R}^{5}\ |\ 0.001\leq x_{(i)}\leq 1000,i=1,2,3,4,5\}. The values of pi,tp_{i,t} are such that each pi,1p_{i,1} is individually and randomly selected from [1,5][1,5]. For t2t\geq 2, pi,tp_{i,t} is such that pi,t=pi,t1+1t1p_{i,t}=p_{i,t-1}+\frac{1}{t-1} for all ii. It can be verified that the above selection ensures that |xt+1xt|=1t|x_{t+1}^{*}-x_{t}^{*}|=\frac{1}{t} and CT=O(logT)C_{T}=O(\log T). Note that the cost function in (55) is convex but not strongly convex and hence only Theorem 1 and Theorem 3 apply. We again show the plots of dynamics regrets of Algorithms 1-3 and full gradient based algorithms in Fig. 3 with constant stepsize α=0.05\alpha=0.05 and T=5000T=5000. Moreover, the plot for random coordinate descent is averaged over 1000 runs.

It can be seen from the Fig. 1 to Fig. 3 that, for the quadratic problem, the dynamic regrets are a lot flatter when tt is large. On the other hand, the dynamic regrets in Fig. 3 still exhibit a significant growth when t=T=5000t=T=5000 even if we select the time-varying parameters to ensure that CT=O(logT)C_{T}=O(\log T). These findings are consistent with the improved regret bounds shown in Theorems 4-7 for uniformly strongly convex functions with uniformly Lipschitz gradients.

Refer to caption
Figure 3: Plots of the dynamic regrets RTdR_{T}^{d} and their time-averages RTd/TR_{T}^{d}/T in the non-strongly convex case.

7 Summary

In this work, we have proposed an online coordinate descent algorithms to deal with optimization problems that may change over time. Three widely used update rules of coordinate descent are considered. Under different assumptions, we have provided different upper bounds on the regrets of these online algorithms. In particular, we have verified that the established regret bounds of these coordinate descent algorithms are of similar orders as those of online gradient descent methods under same settings. The regret bounds proved in this paper are summarized in Table 1. Lastly, a numerical example was given to illustrate our main result. The possibilities of using coordinates with overlapping components is an interesting future research direction, especially for the deterministic case. Another topic of interest is the use of inaccurate gradient information in online coordinate descent algorithms.

{ack}

The authors would like to thank Dr. Yuen-Man Pun for pointing out an error in the proof of Theorem 6 in the initial draft.

References

  • [1] Amir Beck and Luba Tetruashvili. On the convergence of block coordinate descent type methods. SIAM journal on Optimization, 23(4):2037–2060, 2013.
  • [2] A. S. Bedi, P. Sarma, and K. Rajawat. Tracking moving agents via inexact online gradient descent algorithm. IEEE Journal of Selected Topics in Signal Processing, 12(1):202–217, 2018.
  • [3] D. P. Bertsekas and J. N. Tsitsiklis. Parallel and Distributed Computation: Numerical Methods. Prentice-Hall, 1989.
  • [4] Dimitri P Bertsekas. Convex Optimization Algorithms. Athena Scientific, 2015.
  • [5] Xuanyu Cao and Tamer Başar. Decentralized online convex optimization with feedback delays. IEEE Transactions on Automatic Control, 67(6):2889–2904, 2022.
  • [6] Xuanyu Cao and KJ Ray Liu. Online convex optimization with time-varying constraints and bandit feedback. IEEE Transactions on Automatic Control, 64(7):2665–2680, 2019.
  • [7] Xuanyu Cao, Junshan Zhang, and H Vincent Poor. Online stochastic optimization with time-varying distributions. IEEE Transactions on Automatic Control, 66(4):1840–1847, 2021.
  • [8] Ting-Jui Chang and Shahin Shahrampour. On online optimization: Dynamic regret analysis of strongly convex and smooth problems. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 6966–6973, 2021.
  • [9] Tianyi Chen, Qing Ling, and Georgios B Giannakis. An online convex optimization approach to proactive network resource allocation. IEEE Transactions on Signal Processing, 65(24):6350–6364, 2017.
  • [10] Dan Garber and Elad Hazan. A linearly convergent variant of the conditional gradient algorithm under strong convexity, with applications to online and stochastic optimization. SIAM Journal on Optimization, 26(3):1493–1528, 2016.
  • [11] Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169–192, 2007.
  • [12] Saghar Hosseini, Airlie Chapman, and Mehran Mesbahi. Online distributed convex optimization on dynamic networks. IEEE Transactions on Automatic Control, 61(11):3545–3550, 2016.
  • [13] Rodolphe Jenatton, Jim Huang, and Cédric Archambeau. Adaptive algorithms for online convex optimization with long-term constraints. In International Conference on Machine Learning, pages 402–411, 2016.
  • [14] A. Lesage-Landry and J. A. Taylor. Setpoint tracking with partially observed loads. IEEE Transactions on Power Systems, 33(5):5615–5627, 2018.
  • [15] Antoine Lesage-Landry, Iman Shames, and Joshua A Taylor. Predictive online convex optimization. Automatica, 113:108771, 2020.
  • [16] Antoine Lesage-Landry, Joshua A Taylor, and Iman Shames. Second-order online nonconvex optimization. IEEE Transactions on Automatic Control, 2020.
  • [17] Xiuxian Li, Lihua Xie, and Na Li. A survey of decentralized online learning. arXiv preprint arXiv:2205.00473, 2022.
  • [18] Yingying Li, Guannan Qu, and Na Li. Online optimization with predictions and switching costs: Fast algorithms and the fundamental limit. IEEE Transactions on Automatic Control, 66(10):4761–4768, 2021.
  • [19] Zhenhong Li, Zhengtao Ding, Junyong Sun, and Zhongkui Li. Distributed adaptive convex optimization on directed graphs via continuous-time algorithms. IEEE Transactions on Automatic Control, 63(5):1434–1441, 2018.
  • [20] Yankai Lin, Iman Shames, and Dragan Nešić. Asynchronous distributed optimization via dual decomposition and block coordinate ascent. In 58th IEEE Conference on Decision and Control, pages 6380–6385, 2019.
  • [21] Yankai Lin, Iman Shames, and Dragan Nešić. Asynchronous distributed optimization via dual decomposition and block coordinate subgradient methods. IEEE Transactions On Control Of Network Systems, 8(3):1348–1359, 2021.
  • [22] Behnam Mafakheri, Iman Shames, and Jonathan Manton. First order online optimisation using forward gradients under polyak-Lojasiewicz condition. arXiv preprint arXiv:2211.15825, 2022.
  • [23] Aryan Mokhtari, Shahin Shahrampour, Ali Jadbabaie, and Alejandro Ribeiro. Online optimization in dynamic environments: Improved regret rates for strongly convex problems. In 55th IEEE Conference on Decision and Control, pages 7195–7201, 2016.
  • [24] Yu Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341–362, 2012.
  • [25] Julie Nutini, Mark Schmidt, Issam Laradji, Michael Friedlander, and Hoyt Koepke. Coordinate descent converges faster with the gauss-southwell rule than random selection. In International Conference on Machine Learning, pages 1632–1641. PMLR, 2015.
  • [26] Yipeng Pang and Guoqiang Hu. Randomized gradient-free distributed online optimization via a dynamic regret analysis. IEEE Transactions on Automatic Control, 2023.
  • [27] Santiago Paternain and Alejandro Ribeiro. Online learning of feasible strategies in unknown environments. IEEE Transactions on Automatic Control, 62(6):2807–2822, 2017.
  • [28] Shahin Shahrampour and Ali Jadbabaie. Distributed online optimization in dynamic environments using mirror descent. IEEE Transactions on Automatic Control, 63(3):714–725, 2018.
  • [29] S. Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107–194, 2012.
  • [30] Iman Shames, Daniel Selvaratnam, and Jonathan H Manton. Online optimization using zeroth order oracles. IEEE Control Systems Letters, 4(1):31–36, 2019.
  • [31] S. J. Wright. Coordinate descent algorithms. Mathematical Programming, 151(1):3–34, 2015.
  • [32] Yongyang Xiong, Xiang Li, Keyou You, and Ligang Wu. Distributed online optimization in time-varying unbalanced networks without explicit subgradients. IEEE Transactions on Signal Processing, 70:4047–4060, 2022.
  • [33] Feng Yan, Shreyas Sundaram, S. V. N Vishwanathan, and Yuan Qi. Distributed autonomous online learning: Regrets and intrinsic privacy-preserving properties. IEEE Transactions on Knowledge and Data Engineering, 25(11):2483–2493, 2013.
  • [34] Peng Yi and Yiguang Hong. Stochastic sub-gradient algorithm for distributed optimization with random sleep scheme. Control Theory and Technology, 13:333–347, 2015.
  • [35] Xinlei Yi, Xiuxian Li, Lihua Xie, and Karl H Johansson. Distributed online convex optimization with time-varying coupled inequality constraints. IEEE Transactions on Signal Processing, 68:731–746, 2020.
  • [36] Xinlei Yi, Xiuxian Li, Tao Yang, Lihua Xie, Tianyou Chai, and Karl H Johansson. Distributed bandit online convex optimization with time-varying coupled inequality constraints. IEEE Transactions on Automatic Control, 66:4620–4635, 2021.
  • [37] Xinlei Yi, Xiuxian Li, Tao Yang, Lihua Xie, Tianyou Chai, and Karl H Johansson. Regret and cumulative constraint violation analysis for distributed online constrained convex optimization. IEEE Transactions on Automatic Control, 68:2875–2890, 2023.
  • [38] Hao Yu, Michael J Neely, and Xiaohan Wei. Online convex optimization with stochastic constraints. In Advances in Neural Information Processing Systems, pages 1428–1438, 2017.
  • [39] Deming Yuan, Alexandre Proutiere, and Guodong Shi. Distributed online optimization with long-term constraints. IEEE Transactions on Automatic Control, 67(3):1089–1104, 2022.
  • [40] Lijun Zhang, Tianbao Yang, Jinfeng Yi, Rong Jin, and Zhi-Hua Zhou. Improved dynamic regret for non-degenerate functions. Advances in Neural Information Processing Systems, 30, 2017.
  • [41] Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th international conference on machine learning, pages 928–936, 2003.