Online Convex Optimization Using Coordinate Descent Algorithms

Yankai Lin [email protected] Iman Shames [email protected] Dragan Nesic [email protected] Department of Mechanical Engineering, Eindhoven University of Technology, the Netherlands CIICADA Lab, School of Engineering, The Australian National University, Acton, 0200, ACT, Australia Department of Electrical and Electronic Engineering, The University of Melbourne, Parkville, 3010, Victoria, Australia

Abstract

This paper considers the problem of online optimization where the objective function is time-varying. In particular, we extend coordinate descent type algorithms to the online case, where the objective function varies after a finite number of iterations of the algorithm. Instead of solving the problem exactly at each time step, we only apply a finite number of iterations at each time step. Commonly used notions of regret are used to measure the performance of the online algorithm. Moreover, coordinate descent algorithms with different updating rules are considered, including both deterministic and stochastic rules that are developed in the literature of classical offline optimization. A thorough regret analysis is given for each case. Finally, numerical simulations are provided to illustrate the theoretical results.

keywords:

Online convex optimization; Coordinate descent; Online learning; Regret minimization.

^†^†thanks: This paper was not presented at any conference. This work was supported by the Australian Research Council under the Discovery Projects DP170104099, DP210102454, and the Australian Government, via grant AUSMURIB000001 associated with ONR MURI grant N00014-19-1-2571. Corresponding author Y. Lin.

, ,

1 Introduction

Online learning [29], resource allocation [9], demand response in power systems [14], and localization of moving targets [2] are just a few examples where online convex optimization (OCO) has been applied. In the problem setup of OCO, the objective functions are time-varying and are not available to the decision maker a priori. At each time instance, after an update of the decision variable, new information of the latest cost function is made available to the decision maker. The objective of the decision maker is to minimize the objective function over time. One commonly used performance measure of an online optimization algorithm is the notion of regret which measures the suboptimality of the algorithm compared to the outcome generated by the best decision at each time step.

In the seminal work of [41], the method of online gradient descent is proposed for OCO problems, where at each time step the decision maker performs one gradient descent step using the latest available information. A static regret upper bound that is sublinear in $T$ is proved, where $T$ is the length of the horizon. Under stronger assumptions on the cost functions such as strong convexity, an improved logarithmic regret bound can be achieved [11, 23, 10]. If future information is available, it can be used to further improve the performance of the online optimization algorithm in terms of regret bounds. The work [15] introduces an additional predictive step following the algorithm developed in [41], if certain conditions on the estimated gradient and descent direction are met. Similar algorithms have also been extended to cases where zeroth order [6, 30, 36, 26, 32] and second order [16] oracles are used instead of (sub)gradients. The works [6, 36] on bandit feedback consider the situation where there are time-varying inequality constraints. In such cases the algorithms proposed in [41] will be hard to implement because of the high computational resource demand of the projection operation. This motivates recent research on online optimization algorithms with time-varying constraints including the primal-dual algorithm proposed in [13, 37], a modified saddle-point method given in [9]. Other algorithms are also proposed to handle stochastic constraints [38] and cover continuous-time applications [27]. In the case where only the values rather than the exact form of the cost function at are revealed to the decision maker, bandit feedback based online algorithms [6, 36] can be used to solve the problem, other methods such as forward gradient [22] have also been proposed recently to deal with the issue. The need for applications in large-scale systems has also led to extensive research on distributed OCO. Distributed online algorithms that achieve sublinear regret bound for convex optimization problems with static constraints can be found in [28, 12, 33]. For instance, [28] proposes a distributed version of the dynamic mirror descent algorithm which is a generalization of the classical gradient descent methods suitable for high-dimensional optimization problems. The work [19] proposes distributed online primal-dual algorithms for optimization problems with static coupled inequality constraints while the work [35] studies distributed online convex optimization with time-varying inequality constraints in the discrete-time setting. For a more detailed documentation of recent advances of online optimization, we refer the readers to the survey paper [17].

To the best of our knowledge, Coordinate descent [31], as an important class of optimization algorithms, is not sufficiently analyzed by researchers in the online optimization community. In coordinate descent algorithms, most components of the decision variable are fixed during one iteration while the cost function is minimized with respect to the remaining components of the decision variable. The resulting problem is lower-dimensional and often much easier to solve. Thus, coordinate descent algorithms have great potential in applications such as machine learning, where iteration with full gradient information is computationally expensive. In [24], it is shown that for huge scale problems, coordinate descent can be very efficient. Another situation where one may find coordinate descent useful is dual decomposition based methods for distributed optimization, see [21] and references therein. Specifically, the dual problem of multi-agent optimal consensus results in a sum of functions with very loose coupling between the dual variables. Calculation of a component of the gradient of the dual function only involves computations and communications of a pair of agents (or processors). Moreover, it can also be implemented in a parallel fashion as shown in [3]. Therefore, sufficient effort has been made recently by researchers to develop theoretical performance guarantees of various coordinate descent algorithms [31]. In this paper, we aim to extend this appealing algorithm to solve OCO problems by providing an in-depth regret analysis for different types of online coordinate descent algorithms.

The main contributions of the paper can be summarized as follows. First, we extend the coordinate descent algorithms considered in [31] to the online case and provide their regret analysis. To the best of our knowledge, this is the first attempt to look at possibilities of using coordinate descent methods to solve OCO problems. Second, we provide an in-depth regret analysis of various coordinate descent algorithms with different rules, such as cyclic updating rules and random updating rules. Specifically, we consider both random and deterministic online coordinate descent algorithms under assumptions commonly used in the literature. In particular, most existing literature on OCO are based on extensions of offline algorithms that monotonically reduce the distance from the decision variable to the set of solutions at each iteration. An example is the well-known online gradient descent [41, 11]. However, offline deterministic coordinate descent algorithm, although has provable convergence properties to the set of solutions, does not necessarily result in an updated variable that is closer to the set of solutions at each iteration. We overcome this issue by using predictive like updates at each time which are detailed in Section 5. Lastly, we show that the regret bounds achieved by our online coordinate descent algorithms are comparable to those achieved by the literature on centralized full-gradient based online algorithms.

We summarize the theoretical upper bounds of regrets we prove in Theorem 1 to Theorem 7 for Algorithms 1, 4, 5 in the following table.

Table 1: Regret bounds proved in the paper with

C_{T}=\sum_{t=1}^{T}|x_{t}^{*}-x_{t-1}^{*}|,C_{T,2}=\sum_{t=1}^{T}|x_{t}^{*}-x_{t-1}^{*}|^{2}

and

C

stands for convex cases

SC

stands for strongly convex cases.

	$R_{T}^{s}$ ,C	$R_{T}^{s}$ ,SC	$R_{T}^{d}$ ,C	$R_{T}^{d}$ ,SC
Alg. 1	$O(\sqrt{T})$	$O(\log T)$	$O(\sqrt{C_{T}T})$	$O(C_{T})$
Alg. 2,4	$O(\sqrt{T})$	$O(\log T)$	$O(\sqrt{C_{T}T})$	$O(C_{T,2})$
Alg. 3,5	$O(\sqrt{T})$	$O(\log T)$	$O(\sqrt{C_{T}T})$	$O(C_{T,2})$
Gradient Method	$O(\sqrt{T})$ [41]	$O(\log T)$ [11]	$O(\sqrt{C_{T}T})$ [7]	$O(C_{T})$ [28] $O(C_{T,2})$ [8]

The regret bounds summarized in Table 1 are consistent with regret bounds of full-gradient based online optimization algorithms proved in the existing literature [29, 11, 7, 23] under similar settings. Our dynamic regret bounds for strongly convex functions proved in Theorems 6 and 7 might need multiple updates at each time $t$ . This setup is also adopted in some existing works including [40, 8] to achieve less conservative regret bounds. It should also be noted that, although the online algorithms with different update rules share the regret bounds with the same order, the exact coefficient for the regret bounds may still be different.

The rest of paper is organized as follows. The problem formulation is presented in Section 2. The online coordinate descent algorithm considered in this paper is given in Section 3. Regret bounds for random online coordinate descent algorithms are given in Section 4 followed by regret bounds for deterministic online coordinate descent algorithms in Section 5. The numerical simulation is given in Section 6. Finally the results presented in this paper is summarized in Section 7.

Notation: Let $\mathbb{R}$ be the set of real numbers and $\mathbb{R}^{n}$ be the $n$ -dimensional Euclidean space, $\mathbb{R}_{\geq 0}$ (resp. $\mathbb{R}_{>0}$ ) be the set of non-negative (resp. positive) real numbers. For $x,y\in\mathbb{R}^{n}$ , $\langle x,y\rangle$ denotes the inner product in $\mathbb{R}^{n}$ . The set of non-negative (resp. positive) integers is denoted by $\mathbb{Z}_{\geq 0}$ (resp. $\mathbb{Z}_{>0}$ ). Furthermore, $|\cdot|$ denotes the Euclidean norm of a vector $x\in\mathbb{R}^{n}$ . The matrix $I_{n}$ is used to denote the $n$ -dimensional identity matrix and $n$ will be omitted when the dimension is clear. For a given vector $x$ , $x_{(i)}$ denotes the $i$ -th component of $x$ , $x_{i}$ denotes the value of $x$ at time $i$ , and $x_{(i),j}$ denotes the value of the $i$ -th component of $x$ at time $j$ . All random variables considered in this paper are defined on a probability space $(\Omega,\mathcal{F},\mathbb{P})$ , where $\Omega$ is the sample space, the $\sigma$ -field $\mathcal{F}$ is the set of events and $\mathbb{P}$ is the probability measure defined on $(\Omega,\mathcal{F})$ .

2 Problem formulation

We consider the following optimization problem

\underset{x\in\Theta}{\text{min}}\ f_{t}(x),

(1)

where $f_{t}:\mathbb{R}^{n}\rightarrow\mathbb{R}$ is the convex cost function at time $t$ , $x\in\mathbb{R}^{n}$ is the decision variable, and $\Theta\subseteq\mathbb{R}^{n}$ is the non-empty closed convex constraint set. For simplicity, we assume $f_{t}$ is continuously differentiable for any $t\in\mathbb{Z}_{\geq 0}$ . Moreover, let the decision variable $x_{t}$ at any given time $t$ be partitioned as $x_{t}=[x_{(1),t}^{T},x_{(2),t}^{T},\ldots,x_{(P),t}^{T}]$ , $x_{(p),t}\in\mathbb{R}^{n_{p}}$ , $\sum_{p=1}^{P}n_{p}=n$ . The $P$ components of the vector are assigned to $P$ individual processors. For any integer $p\in\{1,\ldots P\}$ , the processor $p$ is responsible for updating its own “local” decision variable $x_{(p),t}$ [3].

Denote the minimizer of $f_{t}(x)$ at time $t$ by $x_{t}^{*}\in\Theta$ . We use a notion of regret to measure processors’ ability to track $x_{t}^{*}$ . Regret refers to a quantity that measures the overall difference between the cost incurred by an algorithm and the cost at the best possible point from an offline view up to a horizon $T$ . Two notions of regret commonly considered in the literature are static regret and dynamic regret. The static regret is defined as:

\color[rgb]{0,0,0}\begin{split}R_{T}^{s}:&=\sum_{t=1}^{T}f_{t}(x_{t})-\underset{x}{\text{min}}\sum_{t=1}^{T}f_{t}(x)\\ &=\sum_{t=1}^{T}f_{t}(x_{t})-\sum_{t=1}^{T}f_{t}(x^{*}),\end{split}

(2)

where $x^{*}:=\arg\underset{x\in\Theta}{\min}\sum_{t=1}^{T}f_{t}(x)$ for a given $T$ , the subscript $T$ is omitted for convenience. On the other hand, the dynamic regret is defined as

R_{T}^{d}:=\sum_{t=1}^{T}f_{t}(x_{t})-\sum_{t=1}^{T}f_{t}(x_{t}^{*}).

(3)

Remark 1.

Static regret is a useful performance metric in applications such as static parameter estimation, where the variable of interest is static. However, if the variable of interest evolves over time (e.g. tracking moving targets), the notion of dynamic regret makes more sense than its static counterpart. It can be seen from (2) and (3) that regret captures the accumulation of errors due to the fact that optimization problems are not solved exactly at each step. If the regret is sublinear in $T$ , then the average accumulated error $R_{T}^{s}/T$ (or $R_{T}^{d}/T$ ) will converge to $0$ as $T\rightarrow\infty$ . This further implies that $x_{t}$ converges to $x^{*}$ (or $x_{t}^{*}$ ). Although being the more appropriate performance metric in most applications, dynamic regret does not have a provable bound sublinear in $T$ in general. To obtain a sublinear regret bound, additional regularity assumptions such as bounded variation of environment are typically required, see [28] for example.

If stochastic algorithms are considered, similar notions of regrets can be defined via expectations.

R_{T}^{s}:=\sum_{t=1}^{T}\mathbb{E}[f_{t}(x_{t})]-\underset{x}{\text{min}}\sum_{t=1}^{T}f_{t}(x).

(4)

R_{T}^{d}:=\sum_{t=1}^{T}\mathbb{E}[f_{t}(x_{t})]-\sum_{t=1}^{T}f_{t}(x_{t}^{*}).

(5)

With some abuse of notation, $R_{T}^{s}$ and $R_{T}^{d}$ are used for both random and deterministic cases. However, this would not lead to confusion since it should be clear whether an algorithm is stochastic or not.

3 Online coordinate descent algorithms

We construct our online block coordinate descent algorithms following the setup in [31]. At time $t$ , we select a component of $x_{t}$ to update. The updating component at time $t$ is denoted by $i_{t}$ . If the $i$ -th component is updating, then updating equation for component $i$ is given by

x_{(i),t+1}=x_{(i),t}-\alpha_{t}[\nabla_{(i)}f_{t}(x_{t})],

(6)

where $\alpha_{t}$ is the stepsize at time $t$ and $\nabla_{(i)}f_{t}(x_{t})$ is the $i$ -th component of the gradient of $f$ evaluated at $x_{t}$ at time $t$ . For any $p\neq i_{t}$ , we have

x_{(p),t+1}=x_{(p),t}.

(7)

Then $x_{t+1}$ is projected onto $\Theta$ . We define the matrix $U_{t}=\text{diag}(U_{(1),t},U_{(2),t},\ldots,U_{(P),t})$ . For any $t\in\mathbb{Z}_{\geq 0}$ and $1\leq p\leq P$ , $U_{(p),t}\in\{I_{n_{p}},0_{n_{p}}\}$ , where $I_{n_{p}}$ and $0_{n_{p}}$ denote identity and zero matrices of dimension $n_{p}$ , respectively. Then the updates of the coordinate descent algorithm at time $t$ can be written as

x_{t+1}=\Pi_{\Theta}(x_{t}-\alpha_{t}U_{t}[\nabla f_{t}(x_{t})]),

(8)

where $\Pi_{\Theta}(\cdot)$ denotes projection on $\Theta$ which is well defined by closedness and convexity of $\Theta$ . The following non-expansive property of the projection operator will be used extensively throughout the paper.

Lemma 1.

[4, Proposition 3.2.1] Let $\Theta$ be a non-empty closed convex set. We have $|\Pi_{\Theta}(x)-\Pi_{\Theta}(y)|\leq|x-y|$ , for any $x,y\in\mathbb{R}^{n}$ .

In this paper, we consider the following three commonly used schemes of selecting the updating component $i_{t}$ .

•

The random coordinate descent algorithm. In this case, $i_{t}$ is selected randomly with equal probability, independently of the selections made at previous iterations.
•

The cyclic coordinate descent algorithm. In this case, the updating component $i_{t}$ is selected in a pre-determined cyclic fashion: $i_{t+1}=(i_{t}\ \mathrm{mod}\ {P})+1$ .
•

The coordinate descent algorithm with Gauss-Southwell Rule. In this case, the updating component $i_{t}$ is selected as $i_{t}\in\arg\max_{i}|\nabla_{(i)}f_{t}(x_{t})|$ .

In the paper, we consider the case where only one coordinate is allowed to change per iteration. The results on random coordinate descent can be extended to other cases where probabilities of selections are unequal with potentially overlapping components [20], such as the random sleep scheme [34]. Intuitively speaking, if for any $t$ the expectation of the update direction takes the form $\Gamma\nabla f_{t}(x)$ for a given positive definite diagonal matrix $\Gamma$ , then the analysis in this work can be applied with mild modifications. Our analysis for deterministic cases can not be trivially extended to cover overlapping components. We aim to address this topic in future research.

Remark 2.

A substantial review of variants of coordinate descent algorithms can be found in [4, Section 6.5.1]. The cyclic selection of coordinates is normally assumed to ensure convergence of the algorithm. On the other hand, the use of an irregular order is then considered by researchers to accelerate convergence. Particularly, it is shown in [31] that randomization leads to faster convergence in terms of expectation. Obviously, this is not guaranteed for each instance of the algorithm. The Gauss-Southwell method leads to faster convergence at the cost of extra computations and evaluations of gradients during the selection of coordinates which can be an issue in large-scale problems [25].

4 Regret bounds for online coordinate descent algorithms with random coordinate selection rules

The online random coordinate descent algorithm considered in this section is given in Algorithm 1.

Algorithm 1 Online random coordinate descent algorithm

1:Initialization:

x_{0}\in\Theta

2:Coordinate Selection: At time

t

i_{t}=p

with probability

\frac{1}{P}

where

p\in\{1,2,\ldots,P\}

3:Update: For

i=i_{t}

\displaystyle x_{(i),t+1}\leftarrow x_{(i),t}-\alpha_{t}[\nabla_{(i)}f_{t}(x_{t})].

For all

i\neq i_{t}

x_{(i),t+1}\leftarrow x_{(i),t}.

4:Projection: For

x_{t+1}\leftarrow\Pi_{\Theta}(x_{t+1})

5:Set

t\leftarrow t+1

and go to Step 2.

4.1 Static regret for convex functions

Before we state the main results, we first list the assumptions.

Assumption 1.

For any given $t\in\mathbb{Z}_{>0}$ , $x\in\Theta$ and $x^{*}:=\arg\underset{x\in\Theta}{\min}\sum_{t=1}^{T}f_{t}(x)$ , the following inequalities hold uniformly in $t$ and $T$ for problem (1),

(i)

$|\nabla f_{t}(x)|\leq G$ ,
(ii)

$|x_{t}-x^{*}|\leq R$ ,

where $x_{t}$ is the decision variable at time $t$ .

Remark 3.

Item (ii) of Assumption 1 can be ensured if the constraint set $\Theta$ is bounded. Moreover, when $\Theta$ is bounded, item (i) of Assumption 1 holds in many cases including linear regression and logistic regression [5]. By [29, Lemma 2.6], Assumption 1 (i) implies that $f_{t}$ is also Lipschitz continuous uniformly in $t$ over $\Theta$ . These assumptions are quite standard in the literature on online optimization [6, 7, 5, 17, 37, 39, 15, 18, 28].

Before we state the first main result of the paper, we first present the so-called doubling trick scheme stepsize rule which is introduced in [29, Section 2.3.1].

Definition 1 (Doubling trick scheme).

A stepsize sequence $\{\alpha_{t}\}$ is said to be chosen by the doubling trick scheme, if for $q=0,1,\ldots,\lceil\log_{2}T\rceil+1$ , $\alpha_{t}=\frac{1}{\sqrt{2^{q}}}$ in each period of $2^{q}$ iterations $t=2^{q},2^{q}+1,\ldots,2^{q+1}-1$ .

Remark 4.

The doubling trick scheme of stepsize choices is particularly useful when stepsizes of type $\sqrt{1/T}$ are needed to achieve a desirable regret bound. Since in applications it is often unrealistic to know the horizon $T$ in advance, by using the doubling trick scheme, regret bounds of the same order can be achieved without explicit knowledge of $T$ . The same trick will be used extensively throughout the paper to derive regret bounds. It should also be noted that, the doubling trick scheme in general does not result a better regret bound in terms orders compared to other stepsize rules [41].

The following result states that, under Assumption 1, if the stepsize at each iteration is chosen by the doubling trick scheme, there is an upper bound for the static regret defined in (4). Moreover, the upper bound has the order of $O(\sqrt{T})$ for convex costs.

Theorem 1.

Suppose Assumption 1 holds. Furthermore, if the stepsize is chosen according to Definition 1. Then, the static regret (4) achieved by Algorithm 1 satisfies

R_{T}^{s}\leq(B_{1}+B_{2})\sqrt{T},

(9)

where $B_{1}=\frac{PR^{2}}{2}$ and $B_{2}=\frac{\sqrt{2}G^{2}}{2(\sqrt{2}-1)}$ .

{pf}

From (8) and Lemma 1, we have

\begin{split}&|x_{t+1}-x|^{2}\leq|x_{t}-\alpha_{t}U_{t}[\nabla f_{t}(x_{t})]-x|^{2}=|x_{t}-x|^{2}\\ &-2\alpha_{t}[U_{t}\nabla f_{t}(x_{t})]^{T}(x_{t}-x)+\alpha_{t}^{2}|U_{t}\nabla f_{t}(x_{t})|^{2}\end{split}

for any $x\in\Theta$ . Given $x_{t}$ , denote the $\sigma$ -field containing past data of Algorithm 1 up to time $t$ by $\mathcal{F}_{t}$ . Then, by convexity of $f_{t}$ , we have

\begin{split}&\mathbb{E}[|x_{t+1}-x|^{2}|\mathcal{F}_{t}]\leq\\ &|x_{t}-x|^{2}-\frac{2\alpha_{t}}{P}[\nabla f_{t}(x_{t})]^{T}(x_{t}-x)+\frac{\alpha_{t}^{2}}{P}|\nabla f_{t}(x_{t})|^{2}\\ &\leq|x_{t}-x|^{2}-\frac{2\alpha_{t}}{P}(f_{t}(x_{t})-f_{t}(x))+\frac{\alpha_{t}^{2}}{P}|\nabla f_{t}(x_{t})|^{2}.\end{split}

(10)

Substituting $x^{*}:=\arg\underset{x\in\Theta}{\min}\sum_{t=1}^{T}f_{t}(x)$ in to (10), taking total expectations using $\mathbb{E}[\mathbb{E}[x|\mathcal{F}_{t}]|\mathcal{F}_{t-1}]=\mathbb{E}[x|\mathcal{F}_{t-1}]$ for $\mathcal{F}_{t-1}\subset\mathcal{F}_{t}$ , and rearranging (10) leads to

\begin{split}&\mathbb{E}[f_{t}(x_{t})-f_{t}(x^{*})]\\ &\leq\frac{P}{2\alpha_{t}}(\mathbb{E}[|x_{t}-x^{*}|^{2}]-\mathbb{E}[|x_{t+1}-x^{*}|^{2}])+\frac{\alpha_{t}}{2}|\nabla f_{t}(x_{t})|^{2}.\end{split}

(11)

Note that

\begin{split}R_{T}^{s}\leq&\sum_{t=1}^{T}\frac{P}{2\alpha_{t}}(\mathbb{E}[|x_{t}-x^{*}|^{2}]-\mathbb{E}[|x_{t+1}-x^{*}|^{2}])\\ &+\sum_{t=1}^{T}\frac{\alpha_{t}}{2}|\nabla f_{t}(x_{t})|^{2}.\end{split}

(12)

Due to Assumption 1 the gradient of $f_{t}$ is uniformly bounded by $G$ and $|x_{t}-x^{*}|$ is upper bounded uniformly by $R$ . Consequently,

\begin{split}&\sum_{t=1}^{T}\frac{P}{2\alpha_{t}}(\mathbb{E}[|x_{t}-x^{*}|^{2}]-\mathbb{E}[|x_{t+1}-x^{*}|^{2}])=\\ &\frac{P}{2\alpha_{1}}\mathbb{E}[|x_{1}-x^{*}|^{2}]-\frac{P}{2\alpha_{T}}\mathbb{E}[|x_{T+1}-x^{*}|^{2}]\\ &+\sum_{t=2}^{T}(\frac{P}{2\alpha_{t}}-\frac{P}{2\alpha_{t-1}})\mathbb{E}[|x_{t}-x^{*}|^{2}]\\ &\leq\frac{PR^{2}}{2\alpha_{1}}+R^{2}\sum_{t=2}^{T}(\frac{P}{2\alpha_{t}}-\frac{P}{2\alpha_{t-1}})\\ &=\frac{PR^{2}}{2\alpha_{T}}\leq\frac{PR^{2}}{2}\sqrt{T}:=B_{1}\sqrt{T}.\end{split}

(13)

In the last inequality of (13), the property $\alpha_{T}\geq 1/\sqrt{T}$ is used, which is a direct consequence of the definition of the stepsize rule. On the other hand, the remaining term in (12) can be upper bounded as follows

\sum_{t=1}^{T}\frac{\alpha_{t}}{2}|\nabla f_{t}(x_{t})|^{2}\leq\sum_{t=1}^{T}\alpha_{t}\frac{G^{2}}{2}.

To derive the regret bound, we again use the properties of the doubling trick scheme. First, we set the horizon to be some known constant $T^{*}$ and the stepsize is chosen as the constant $\frac{1}{\sqrt{T^{*}}}$ . Then it is obvious that

\sum_{t=1}^{T^{*}}\frac{\alpha_{t}}{2}|\nabla f_{t}(x_{t})|^{2}\leq\sum_{t=1}^{T^{*}}\alpha_{t}\frac{G^{2}}{2}\leq\frac{G^{2}}{2}\sqrt{T^{*}}.

(14)

Since for $q=0,1,\ldots,\lceil\log_{2}T\rceil+1$ , $\alpha_{t}=\frac{1}{\sqrt{2^{q}}}$ in each period of $2^{q}$ iterations $t=2^{q},2^{q}+1,\ldots,2^{q+1}-1$ . That is $T^{*}=2^{q}$ . Then we can sum (14) up over any given $T$ as

\begin{split}\sum_{t=1}^{T}\frac{\alpha_{t}}{2}|\nabla f_{t}(x_{t})|^{2}&\leq\sum_{q=0}^{\lceil\log_{2}T\rceil}\sqrt{2^{q}}\frac{G^{2}}{2}\\ &=\frac{G^{2}}{2}\frac{1-{\sqrt{2}}^{\lceil\log_{2}T\rceil+1}}{1-\sqrt{2}}\\ &\leq\frac{G^{2}}{2}\frac{1-{\sqrt{2T}}}{1-\sqrt{2}}\\ &\leq\frac{\sqrt{2}G^{2}}{2(\sqrt{2}-1)}\sqrt{T}:=B_{2}\sqrt{T}.\end{split}

(15)

Thus, the proof is complete. ∎

Remark 4.1.

In Theorem 1, the differentiability of $f_{t}$ is in fact never used. Thus the results apply to the case where subgradients are used when $f_{t}$ is not differentiable for some $t$ as long as item 1 of Assumption 1 is satisfied by the subgradients used in the iterations. Moreover, the regret bound established in Theorem 1 is of the same order as the one established in [29] for online gradient descent under the same assumptions.

4.2 Static regret for strongly convex functions

In this part, we show that if the cost functions are strongly convex, then we can achieve an improved static regret bound for the online random coordinate descent algorithms.

Assumption 2.

For any $t\in\mathbb{Z}_{\geq 0}$ , the function $f_{t}$ is uniformly $\mu$ -strongly convex, i.e., there exists $\mu>0$ such that the following inequality holds for any $t\in\mathbb{Z}_{\geq 0}$ , $x,y\in\mathbb{R}^{n}$

f_{t}(y)\geq f_{t}(x)+\nabla f_{t}(x)^{T}(y-x)+\frac{\mu}{2}|y-x|^{2}.

(16)

Theorem 2.

Suppose Assumptions 1 (i) and 2 hold. Furthermore, if the stepsize is chosen as $\alpha_{t}=\frac{P}{\mu t}$ . Then, the static regret (4) achieved by Algorithm 1 satisfies

R_{T}^{s}\leq\frac{PG^{2}}{2\mu}(1+\log T).

(17)

{pf}

Similar to (10), given $x_{t}$ , we have the following relationship for any $x\in\Theta$ ,

\begin{split}&\mathbb{E}[|x_{t+1}-x|^{2}|\mathcal{F}_{t}]\leq\\ &|x_{t}-x|^{2}-\frac{2\alpha_{t}}{P}[\nabla f_{t}(x_{t})]^{T}(x_{t}-x)+\frac{\alpha_{t}^{2}}{P}|\nabla f_{t}(x_{t})|^{2}\\ &\leq|x_{t}-x|^{2}-\frac{2\alpha_{t}}{P}(f_{t}(x_{t})-f_{t}(x))-\frac{\alpha_{t}\mu}{P}|x_{t}-x|^{2}\\ &+\frac{\alpha_{t}^{2}}{P^{2}}|\nabla f_{t}(x_{t})|^{2}=(1-\frac{\alpha_{t}\mu}{P})|x_{t}-x|^{2}\\ &-\frac{2\alpha_{t}}{P}(f_{t}(x_{t})-f_{t}(x))+\frac{\alpha_{t}^{2}}{P}|\nabla f_{t}(x_{t})|^{2}.\end{split}

(18)

By substituting $x^{*}$ to (18) and rearranging the terms, we have the following inequality

\begin{split}\mathbb{E}[f_{t}(x_{t})-f_{t}(x^{*})]\leq&\frac{P}{2\alpha_{t}}(\mathbb{E}[|x_{t}-x^{*}|^{2}]-\mathbb{E}[|x_{t+1}-x^{*}|^{2}])\\ &-\frac{\mu}{2}\mathbb{E}[|x_{t}-x^{*}|^{2}]+\frac{\alpha_{t}}{2}|\nabla f_{t}(x_{t})|^{2}.\end{split}

Since the gradient of $f_{t}$ is uniformly bounded by $G$ , the following inequality holds

\begin{split}R_{T}^{s}\leq&\sum_{t=1}^{T}\left[\frac{P}{2\alpha_{t}}(\mathbb{E}[|x_{t}-x^{*}|^{2}]-\mathbb{E}[|x_{t+1}-x^{*}|^{2}])\right]\\ &-\sum_{t=1}^{T}\frac{\mu}{2}\mathbb{E}[|x_{t}-x^{*}|^{2}]+\sum_{t=1}^{T}\frac{\alpha_{t}}{2}|\nabla f_{t}(x_{t})|^{2}.\end{split}

(19)

By choosing $\alpha_{t}=\frac{P}{\mu t}$ , we have the following relationship

\begin{split}&R_{T}^{s}\leq 0+\sum_{t=2}^{T}(\frac{P}{2\alpha_{t}}-\frac{P}{2\alpha_{t-1}}-\frac{\mu}{2})\mathbb{E}[|x_{t}-x^{*}|^{2}]\\ &-\frac{\mu T}{2}\mathbb{E}[|x_{T+1}-x^{*}|^{2}]+\sum_{t=1}^{T}\frac{\alpha_{t}G^{2}}{2}\\ &\leq 0+\sum_{t=1}^{T}\frac{\alpha_{t}G^{2}}{2}\leq\frac{PG^{2}}{2\mu}(1+\log T),\end{split}

(20)

where the second inequality follows by expanding the sum. Thus, the proof is complete. ∎

By making the extra assumption of strong convexity, the sublinear static regret bound of order $O(\sqrt{T})$ established in Theorem 1 is improved to be of order $O(\log T)$ which is consistent with the regret bound for online gradient descent algorithms under the same assumptions established in [11].

4.3 Dynamic regret for convex functions

Now, we provide an upper bound of the dynamic regret of the online coordinate descent algorithm. First, we define the following measure of variations of the problem (1) which is commonly used in the literature [23, 28]

C_{T}:=\sum_{t=1}^{T}|x_{t}^{*}-x_{t-1}^{*}|,

(21)

where $x_{t}^{*}$ is a solution to the optimization problem at time $t$ and $x_{0}^{*}$ can be arbitrary constant. The term $C_{T}$ captures the accumulated variations of optimal points at two consecutive time instances over time. If the cost functions are general convex functions and an upper bound of $C_{T}$ is known, the following results can be stated.

Theorem 3.

Suppose Assumption 1 holds. Furthermore, if the stepsize is chosen as $\alpha_{t}=\sqrt{\frac{C_{T}}{T}}$ , the dynamic regret (5) achieved by Algorithm 1 satisfies

R_{T}^{d}\leq(\frac{5R^{2}}{2\sqrt{C_{T}}}+RP\sqrt{C_{T}})\sqrt{T}+\frac{\sqrt{C_{T}}G^{2}}{2}\sqrt{T}.

{pf}

Substituting $x_{t}^{*}$ in (10) for any $t\geq 2$ , yields

\begin{split}\mathbb{E}[f_{t}(x_{t})-f_{t}(x_{t}^{*})]\leq&\frac{P}{2\alpha_{t}}(\mathbb{E}[|x_{t}-x_{t}^{*}|^{2}]-\mathbb{E}[|x_{t+1}-x_{t}^{*}|^{2}])\\ &+\frac{\alpha_{t}}{2}|\nabla f_{t}(x_{t})|^{2}.\end{split}

Furthermore,

\begin{split}&\sum_{t=1}^{T}(|x_{t}-x_{t}^{*}|^{2}-|x_{t+1}-x_{t}^{*}|^{2})\leq\\ &|x_{1}-x_{1}^{*}|^{2}+\sum_{t=2}^{T}(|x_{t}-x_{t}^{*}|^{2}-|x_{t}-x_{t-1}^{*}|^{2})\\ &=|x_{1}-x_{1}^{*}|^{2}+\sum_{t=2}^{T}2x_{t}^{T}(x_{t-1}^{*}-x_{t}^{*})+\sum_{t=2}^{T}(|x_{t}^{*}|^{2}-|x_{t-1}^{*}|^{2})\\ &\leq|x_{1}-x_{1}^{*}|^{2}+|x_{T}^{*}|^{2}+2\sum_{t=2}^{T}2|x_{t}||x_{t-1}^{*}-x_{t}^{*}|\\ &\leq 5R^{2}+2RC_{T}.\\ \end{split}

(22)

Then,

R_{T}^{d}\leq\frac{P}{2\alpha_{t}}(5R^{2}+2RC_{T})+\frac{\alpha_{t}}{2}G^{2}T.

Choosing the stepsize to be $\sqrt{\frac{C_{T}}{T}}$ leads to

R_{T}^{d}\leq(\frac{5R^{2}}{2\sqrt{C_{T}}}+RP\sqrt{C_{T}})\sqrt{T}+\frac{\sqrt{C_{T}}G^{2}}{2}\sqrt{T}.

∎

Remark 4.2.

As discussed in Remark 4, the choice of stepsize $\sqrt{\frac{C_{T}}{T}}$ can be replaced by the doubling trick scheme (see Definition 1) to derive a regret bound of the same order. Moreover, the exact value of $C_{T}$ is also not necessary to derive the regret bound given in Theorem 3. If $C_{T}$ is not available, a stepsize of the same order of $\sqrt{\frac{C_{T}}{T}}$ can be used with constant factor errors. We refer readers to [7, Theorem 1] for a more detailed discussion on how to implement stepsizes of the form $\sqrt{\frac{C_{T}}{T}}$ and obtain an upper bound of $C_{T}$ in practice using limited information. The inclusion of $C_{T}$ in the dynamic regret bound is common in existing literature [7, 28, 37]. As argued in [7], if $C_{T}$ is sublinear in $T$ , then the overall dynamic regret bound in Theorem 3 will be sublinear in $T$ , which is desired and consistent with the dynamic regret bounds established in [7] for full-gradient based online algorithms. To see this, if the variation of minimizers decreases with the order ${1}/{t}$ , then $C_{T}=O(\log T)$ and $R_{T}^{d}\leq O(\sqrt{T\log T})$ . Similarly, if the variation of minimizers decreases with the order ${1}/{t^{q}}$ with $0<q<1$ , then $C_{T}=O(T^{1-q})$ and $R_{T}^{d}\leq O(T^{1-\frac{q}{2}})$ . In the worst case, by Assumption 1, we have $C_{T}\leq RT$ and $R_{T}^{d}\leq O(\sqrt{R}T)$ . Thus, ${R_{T}^{d}}/{T}\leq O(\sqrt{R})$ , which means the corresponding online algorithm incurs a steady-state tracking error. Moreover, the error decreases with the constant bound on the variation of minimizers.

4.4 Dynamic regret for strongly convex functions

Now, we consider the dynamic regret of the online random coordinate descent algorithm for strongly convex functions. As before, we consider $\mu$ -strongly convex functions. In addition, we will make the following assumption commonly seen in online optimization literature [18, 23].

Assumption 3.

For any $t\in\mathbb{Z}_{\geq 0}$ , the gradient of $f_{t}$ is uniformly Lipschitz continuous, i.e., there exists $L>0$ such that the following inequality holds for any $t\in\mathbb{Z}_{\geq 0}$ , $x,y\in\mathbb{R}^{n}$

|\nabla f_{t}(x)-\nabla f_{t}(y)|\leq L|x-y|.

(23)

It is shown in [4, Proposition 6.1.2] that under the same conditions stated in Assumption 3, (23) is equivalent to

f_{t}(y)\leq f_{t}(x)+\nabla f_{t}(x)^{T}(y-x)+\frac{L}{2}|y-x|^{2}.

(24)

The main result regarding the dynamic regret bound of Algorithm 1 for strongly convex functions with Lipschitz gradients is stated as follows.

Theorem 4.

Suppose $\nabla f_{t}(x_{t}^{*})=0$ for any $t\in\mathbb{Z}_{\geq 0}$ , Assumption 1 (i) and Assumptions 2-3 hold. If the stepsize is chosen as $\alpha_{t}=\alpha\leq\frac{2}{\mu+L}$ , the dynamic regret (5) achieved by Algorithm 1 satisfies

R_{T}^{d}\leq\frac{G}{e}(C_{T}+C_{1}),

where $e=1-\sqrt{1-\frac{2\alpha}{P}\frac{\mu L}{\mu+L}}$ , $C_{T}=\sum_{t=1}^{T}|x_{t}^{*}-x_{t-1}^{*}|$ , and $C_{1}=|x_{1}-x_{1}^{*}|-|x_{1}^{*}-x_{0}^{*}|$ .

{pf}

From (16) and (23), the following inequality holds [4, Proposition 6.1.9] for any $t\in\mathbb{Z}_{\geq 0},x,y\in\mathbb{R}^{n}$

\begin{split}&(\nabla f_{t}(x)-\nabla f_{t}(y))^{T}(x-y)\geq\\ &\frac{\mu L}{\mu+L}|x-y|^{2}+\frac{1}{\mu+L}|\nabla f_{t}(x)-\nabla f_{t}(y)|^{2}.\end{split}

(25)

Then, using Lemma 1, we have $|x_{t+1}-x_{t}^{*}|^{2}\leq|x_{t}-\alpha U_{t}\nabla f_{t}(x_{t})-x_{t}^{*}|^{2}$ . Next, we take the conditional expectation given past history up to time $t$ ,

\begin{split}&\mathbb{E}[|x_{t+1}-x_{t}^{*}|^{2}|\mathcal{F}_{t}]\leq-\frac{2\alpha}{P}(\nabla f_{t}(x_{t})-\nabla f_{t}(x_{t}^{*}))^{T}(x_{t}-x_{t}^{*})\\ &+|x_{t}-x_{t}^{*}|^{2}+\alpha^{2}\mathbb{E}[|U_{t}\nabla f_{t}(x_{t})|^{2}|\mathcal{F}_{t}].\end{split}

(26)

The last term in (26) can be upper bounded using the following inequality

\mathbb{E}[|U_{t}\nabla f_{t}(x_{t})|^{2}|\mathcal{F}_{t}]{\leq}\frac{1}{P}|\nabla f_{t}(x_{t})-\nabla f_{t}(x_{t}^{*})|^{2}.

(27)

The inequality (27) comes from the fact that

\mathbb{E}[|U_{t}\nabla f_{t}(x_{t})|^{2}|\mathcal{F}_{t}]=|\nabla f_{t}(x_{t})|^{2}/P.

Combining (25), (26) and (27), we have for any $t\in\mathbb{Z}_{\geq 0}$

\begin{split}&\mathbb{E}[|x_{t+1}-x_{t}^{*}|^{2}|\mathcal{F}_{t}]\leq(1-\frac{2\alpha}{P}\frac{\mu L}{\mu+L})|x_{t}-x_{t}^{*}|^{2}\\ &+(\frac{\alpha^{2}}{P}-\frac{2\alpha}{P}\frac{1}{\mu+L})|\nabla f_{t}(x_{t})-\nabla f_{t}(x_{t}^{*})|^{2}\\ &{\leq}(1-\frac{2\alpha}{P}\frac{\mu L}{\mu+L})|x_{t}-x_{t}^{*}|^{2},\end{split}

(28)

where the last inequality comes from the constant stepsize choice $\alpha\leq 2/(\mu+L)$ . By Jensen’s inequality, we have

\mathbb{E}|x_{t+1}-x^{*}_{t}|\mathcal{F}_{t}|\leq\sqrt{1-\frac{2\alpha}{P}\frac{\mu L}{\mu+L}}|x_{t}-x_{t}^{*}|.

(29)

Note that $\sqrt{1-\frac{2\alpha}{P}\frac{\mu L}{\mu+L}}<1$ . Thus, there exists $0<e<1$ such that $\sqrt{1-\frac{2\alpha}{P}\frac{\mu L}{\mu+L}}=1-e$ .

Assumption 1 (i) implies $|f_{t}(x)-f_{t}(y)|\leq G|x-y|$ for any $t\in\mathbb{Z}_{\geq 0}$ and any $x,y\in\Theta$ . As a result, the stochastic dynamic regret is bounded as $R_{T}^{d}\leq G\mathbb{E}[\sum_{t=1}^{T}|x_{t}-x_{t}^{*}|]$ . Moreover, we have

|x_{t+1}-x_{t+1}^{*}|\leq|x_{t+1}-x_{t}^{*}|+|x_{t}^{*}-x_{t+1}^{*}|.

(30)

Now taking total expectations, using (29) and summing up both sides of (30) from $t=1$ to $t=T-1$ yields

R_{T}^{d}/G\leq|x_{1}-x_{1}^{*}|+(1-e)R_{T}^{d}/G+\sum_{t=1}^{T}|x_{t}^{*}-x_{t+1}^{*}|.

(31)

Rearranging (31) leads to

R_{T}^{d}\leq\frac{G}{e}(C_{T}+C_{1}).

∎

Remark 4.3.

By assuming uniform strong convexity and uniform Lipschitz continuity of gradients of $f_{t}$ , Theorem 4 improves the order of the theoretical dynamic regret bound from $O(\sqrt{C_{T}T})$ in Theorem 3 to $O(C_{T})$ , which is consistent with results shown in [23] for full-gradient based online algorithms. This means if problem (1) does not change over time i.e. $f_{t}=f$ for any $t$ , $C_{T}=0$ and the regret grows at a rate $O(1)$ which is consistent with the convergence result of coordinate descent algorithms in the offline setting [31]. As item (ii) of Assumption 1 is not required for proving Theorem 4, the regret bound is applicable to unconstrained problems with bounded gradients.

5 Regret bounds for online coordinate descent algorithms with deterministic coordinate selection rules

We consider two deterministic online coordinate descent algorithms in this section and they are given in Algorithms 2 and 3, respectively.

Algorithm 2 Online cyclic coordinate descent algorithm

1:Initialization:

x_{0},i_{0}

2:Select Coordinate:

i_{t}\leftarrow(i_{t-1}\mod P)+1

3:Update: For

i=i_{t}

\displaystyle x_{(i),t+1}\leftarrow x_{(i),t}-\alpha_{t}[\nabla_{(i)}f_{t}(x_{t})].

For all

i\neq i_{t}

x_{(i),t+1}\leftarrow x_{(i),t}.

4:Projection:

x_{t+1}\leftarrow\Pi_{\Theta}(x_{t+1})

5:Set

t\leftarrow t+1

and go to Step 2.

Algorithm 3 Online coordinate descent algorithm with Gauss-Southwell Rule

1:Initialization:

x_{0}\in\Theta

2:Select Coordinate:

i_{t}\leftarrow\arg\max_{i}|\nabla_{(i)}f_{t}(x_{t})|

3:Update: For

i=i_{t}

\displaystyle x_{(i),t+1}\leftarrow x_{(i),t}-\alpha_{t}[\nabla_{(i)}f_{t}(x_{t})].

For all

i\neq i_{t}

x_{(i),t+1}\leftarrow x_{(i),t}.

4:Projection:

x_{t+1}\leftarrow\Pi_{\Theta}(x_{t+1})

5:Set

t\leftarrow t+1

and go to Step 2.

Note that both Algorithms 2 and 3 are deterministic and update a component of the decision variable. We can therefore relate the regrets achievable by Algorithms 2 and 3 to regret achievable by the online projected gradient descent algorithm which takes the form

\color[rgb]{0,0,0}x_{t+1}=\Pi_{\Theta}(x_{t}-\alpha_{t}\nabla f_{t}(x_{t})).\color[rgb]{0,0,0}

(32)

It can be easily seen that Algorithms 2 and 3 take the following form

x_{t+1}=\Pi_{\Theta}(x_{t}-\alpha_{t}(\nabla f_{t}(x_{t})+\nu_{t})),

(33)

where the vector $\nu_{t}:=\overline{\nabla}_{(i_{t})}f_{t}(x_{t})-\nabla f_{t}(x_{t})$ with $\overline{\nabla}_{(i_{t})}f_{t}(x_{t})\in\mathbb{R}^{n}$ denotes the vector such that its $i_{t}$ -th component is ${\nabla}_{(i_{t})}f_{t}(x_{t})$ while all other entries are $0$ . It captures the effect of the components of the gradient that are not updating. We use $\bar{R}_{T}^{s}$ and $\bar{R}_{T}^{d}$ to denote the static and dynamic regrets of the online projected gradient descent algorithm respectively. Then, we can have the following result.

Proposition 5.1.

Suppose Assumption 1 holds, then the static regret $R_{T}^{s}$ and dynamic regret $R_{T}^{d}$ of iterations (33) satisfy the following two relationships.

1.

$R_{T}^{s}\leq\bar{R}_{T}^{s}+G^{2}\sum_{t=1}^{T}\alpha_{t}$ .
2.

$R_{T}^{d}\leq\bar{R}_{T}^{d}+G^{2}\sum_{t=1}^{T}\alpha_{t}$ .

{pf}

By the definitions of regrets in (2) and (3), it is obvious that $R_{T}^{s}-\bar{R}_{T}^{s}=R_{T}^{d}-\bar{R}_{T}^{d}$ . Thus, if one of the two items is proved, so is the other item. Since Assumption 1 holds, from (2) and Lemma 1, we know that $R_{T}^{s}-\bar{R}_{T}^{s}\leq\sum_{t=1}^{T}G|\alpha_{t}\nu|\leq G^{2}\sum_{t=1}^{T}|\alpha_{t}|=G^{2}\sum_{t=1}^{T}\alpha_{t}$ . Thus the proof is complete. ∎ By using Proposition 5.1, we can establish regret bounds for Algorithms 2 and 3 using known regret bounds for online gradient descent algorithms. Moreover, if the regret bounds for online gradient descent are sublinear in $T$ and $\sum_{t=1}^{T}\alpha_{t}$ is also sublinear in $T$ , then the established regret bounds for Algorithms 2 and 3 will be sublinear.

5.1 Static regret for convex functions

The following static regret bound of online projected gradient descent algorithm is proved in [41, Theorem 1].

Lemma 5.2.

Suppose Assumption 1 holds. Furthermore, if the stepsize is chosen as $\alpha_{t}=\sqrt{\frac{1}{t}}$ , then the static regret of online projected gradient descent iterations (32) satisfy

\bar{R}_{T}^{s}\leq\frac{R^{2}\sqrt{T}}{2}+(\sqrt{T}-\frac{1}{2})G^{2}.

The following result on static regret bounds of Algorithms 2 and 3 is a direct corollary of Proposition 5.1 and Lemma 5.2.

Corollary 5.3.

Suppose Assumption 1 holds. Furthermore, if the stepsize is chosen as $\alpha_{t}=\sqrt{\frac{1}{t}}$ , then the static regrets of Algorithms 2 and 3 satisfy

R_{T}^{s}\leq\frac{R^{2}\sqrt{T}}{2}+(\sqrt{T}-\frac{1}{2})G^{2}+2G^{2}\sqrt{T}.

{pf}

Since $\alpha_{t}=\sqrt{\frac{1}{t}}$ , we have

\begin{split}\sum_{t=1}^{T}\alpha_{t}&=1+\sum_{t=2}^{T}\alpha_{t}\\ &\leq 1+\int_{s=1}^{T}\frac{1}{\sqrt{s}}ds\\ &=2\sqrt{T}.\end{split}

The conclusion follows as a result of Proposition 5.1 and Lemma 5.2. ∎

5.2 Static regret for strongly convex functions

The static regret bound in Lemma 5.2 is improved in [11, Theorem 1] under the assumption of strong convexity.

Lemma 5.4.

Suppose Assumptions 1 and 2 hold. Furthermore, if the stepsize is chosen as $\alpha_{t}=\frac{1}{\mu t}$ , then the static regret of online projected gradient descent algorithm (32) satisfies

\bar{R}_{T}^{s}\leq\frac{G^{2}}{2\mu}(1+\log T).

The following result on static regret bounds of Algorithms 2 and 3 for strongly convex functions is a direct corollary of Proposition 5.1 and Lemma 5.4.

Corollary 5.5.

Suppose Assumptions 1 and 2 hold. Furthermore, if the stepsize is chosen as $\alpha_{t}=\frac{1}{\mu t}$ , then the static regrets of Algorithms 2 and 3 satisfy

R_{T}^{s}\leq\frac{3G^{2}}{2\mu}(1+\log T).

{pf}

Since $\alpha_{t}=\frac{1}{\mu t}$ , we have

\begin{split}\sum_{t=1}^{T}\alpha_{t}&=1+\sum_{t=2}^{T}\alpha_{t}\\ &\leq 1+\int_{s=1}^{T}\frac{1}{\mu s}ds\\ &=1+\log T.\end{split}

The conclusion follows as a result of Proposition 5.1 and Lemma 5.4. ∎

5.3 Dynamic regret for convex functions

We can adapt the proof of Theorem 3 to the deterministic case to derive a dynamic regret bound for convex functions which is given in the following result.

Proposition 5.6.

Suppose Assumption 1 holds. Furthermore, if the stepsize is chosen as $\alpha_{t}=\sqrt{\frac{C_{T}}{T}}$ , the dynamic regret achieved by the online gradient descent algorithm (32) satisfies

\bar{R}_{T}^{d}\leq(\frac{5R^{2}}{2\sqrt{C_{T}}}+R\sqrt{C_{T}})\sqrt{T}+\frac{\sqrt{C_{T}}G^{2}}{2}\sqrt{T}.

{pf}

From (32) and Lemma 1, we have

\begin{split}&|x_{t+1}-x|^{2}\leq|x_{t}-\alpha_{t}\nabla f_{t}(x_{t})-x|^{2}\\ &=|x_{t}-x|^{2}-2\alpha_{t}\nabla f_{t}(x_{t})^{T}(x_{t}-x)+\alpha_{t}^{2}|\nabla f_{t}(x_{t})|^{2}\end{split}

(34)

for any $x\in\Theta$ . Substituting $x_{t}^{*}$ to (34) results in the following inequality for any $t\geq 2$

\begin{split}&f_{t}(x_{t})-f_{t}(x_{t}^{*})\leq\\ &\frac{1}{2\alpha_{t}}(|x_{t}-x_{t}^{*}|^{2}-|x_{t+1}-x_{t}^{*}|^{2})+\frac{\alpha_{t}}{2}|\nabla f_{t}(x_{t})|^{2}.\end{split}

Note that, the inequality (22) still holds in this case Thus, we have

\bar{R}_{T}^{d}\leq\frac{1}{2\alpha_{t}}(5R^{2}+2RC_{T})+\frac{\alpha_{t}}{2}G^{2}T.

If we set the stepsize to be $\sqrt{\frac{C_{T}}{T}}$ , we have

\bar{R}_{T}^{d}\leq(\frac{5R^{2}}{2\sqrt{C_{T}}}+R\sqrt{C_{T}})\sqrt{T}+\frac{\sqrt{C_{T}}G^{2}}{2}\sqrt{T}.

∎

Using Propositions 5.1 and 5.6, we can state the following result on dynamic regret bounds of Algorithms 2 and 3.

Corollary 5.7.

Suppose Assumption 1 holds. Furthermore, if the stepsize is chosen as $\alpha_{t}=\sqrt{\frac{C_{T}}{T}}$ , then the dynamic regrets of Algorithms 2 and 3 satisfy

R_{T}^{d}\leq\frac{R^{2}\sqrt{T}}{2}+(\sqrt{T}-\frac{1}{2})G^{2}+2G^{2}\sqrt{C_{T}T}.

{pf}

Since $\alpha_{t}=\sqrt{\frac{C_{T}}{T}}$ , we have $\sum_{t=1}^{T}\alpha_{t}=\sqrt{C_{T}T}$ . The conclusion follows as a result of Propositions 5.1 and 5.6. ∎ As discussed in Remark 4.2, the stepsize choice $\alpha_{t}=\sqrt{\frac{C_{T}}{T}}$ can be made independent of $T$ by using the doubling trick scheme.

5.4 Dynamic regret for strongly convex functions

Note that when the cost functions are strongly convex and constant stepsizes are used, the result in Proposition 5.1 only gives a dynamic regret bound that is linear in $T$ . Therefore, we aim to establish better dynamic regret bounds for Algorithms 2 and 3. In this subsection, we will make the following assumption on the problem (1).

Assumption 4.

The gradient of $f_{t}$ is block-wise Lipschitz continuous uniformly in $t$ , i.e., for any $1\leq i\leq P$ , there exists $L_{i}>0$ such that the following inequality holds for any $t\in\mathbb{Z}_{\geq 0}$ , $x\in\mathbb{R}^{n}$ , and $u\in\mathbb{R}^{n_{i}}$

|\nabla_{(i)}f_{t}(x+H_{i}u)-\nabla_{(i)}f_{t}(x)|\leq L_{i}|u|,

(35)

where $H_{i}\in\mathbb{R}^{n\times n_{i}}$ is such that $[H_{1}\ H_{2}\cdots H_{P}]=I_{n}$ .

We denote the largest $L_{i}$ for $1\leq i\leq P$ by $L_{\mathrm{max}}:={\max}\{L_{1},\cdots,L_{P}\}$ . By further assuming $\nabla f_{t}(x_{t}^{*})=0$ for any $t\in\mathbb{Z}_{\geq 0}$ , the dynamic regret bounds can be derived without Assumption 1. Therefore, they are applicable to unconstrained problems with unbounded gradients. Before analyzing the regrets, we define the following measure of variations of the problem (1) which is first introduced in [40],

C_{T,2}:=\sum_{t=1}^{T}|x_{t}^{*}-x_{t-1}^{*}|^{2}.

(36)

Remark 5.8.

Compared to $C_{T}$ , $C_{T,2}$ can be significantly smaller if the variation of minimizers is small. Following the discussion in Remark 4.2, if the variation of minimizers decreases with the order ${1}/{t^{q}}$ with $1/2<q\leq 1$ , then $C_{T,2}$ is finite while $C_{T}$ grows to $\infty$ with a rate sublinear in $T$ . Similarly, if $0<q<1/2$ , then $C_{T,2}=O(T^{1-2q}),C_{T}=O(T^{1-q})$ .

Theorem 5.

Suppose $\nabla f_{t}(x_{t}^{*})=0$ for any $t\in\mathbb{Z}_{\geq 0}$ and Assumptions 2-4 hold. Furthermore, we assume the number of blocks $P$ satisfies $P<\frac{1}{B_{3}L_{\mathrm{max}}}$ with $B_{3}:=\frac{1}{\mu}-\frac{1}{2L}>0$ . If the stepsize is chosen such that $\alpha_{t}=\alpha\in(\frac{1-\sqrt{1-B_{3}PL_{\mathrm{max}}}}{L_{\mathrm{max}}},\frac{1+\sqrt{1-B_{3}PL_{\mathrm{max}}}}{L_{\mathrm{max}}})$ . Then, the dynamic regret (5) achieved by Algorithm 3 satisfies

R_{T}^{d}\leq\frac{L}{B_{4}}(C_{T,2}+C_{2}),

where $B_{4}=\frac{1}{2}-\sqrt{L(\frac{1}{\mu}-\frac{2\alpha-\alpha^{2}L_{\mathrm{max}}}{2P})}>0$ and $C_{2}=|x_{1}-x_{1}^{*}|^{2}-2|x_{1}^{*}-x_{0}^{*}|^{2}$ .

{pf}

Let $\{y_{t}\}$ denote the sequence of real vectors such that $y_{t+1}=x_{t}-\alpha_{t}U_{t}[\nabla f_{t}(x_{t})]$ , where the value of $U_{t}$ follows the coordinate selection rule in Algorithm 3. The variable $y_{t}\in\mathbb{R}^{n}$ stores the value of the decision variable $x_{t}$ before projection, i.e., $x_{t}=\Pi_{\Theta}(y_{t})$ . By the block descent lemma [1, Lemma 3.2] and the fact that $L_{\mathrm{max}}\geq L_{i}$ for all $i$ , we have $f_{t}(y_{t+1})\leq f_{t}(x_{t})+\frac{\alpha^{2}L_{\mathrm{max}}}{2}|\nabla_{(i)}f_{t}(x_{t})|^{2}-\alpha|\nabla_{(i)}f_{t}(x_{t})|^{2}$ . That is

f_{t}(x_{t})-f_{t}(y_{t+1})\geq(\alpha-\frac{\alpha^{2}L_{\mathrm{max}}}{2})|\nabla_{(i)}f_{t}(x_{t})|^{2},

(37)

From (37) and $i_{t}\in\arg\max_{i}|\nabla_{(i)}f_{t}(x_{t})|$ :

\begin{split}f_{t}(x_{t})-f_{t}(y_{t+1})&\geq(\alpha-\frac{\alpha^{2}L_{\mathrm{max}}}{2})|\nabla_{(i)}f_{t}(x_{t})|^{2}\\ &\geq\frac{1}{P}(\alpha-\frac{\alpha^{2}L_{\mathrm{max}}}{2})|\nabla f_{t}(x_{t})|^{2},\end{split}

(38)

Since $\nabla f_{t}(x_{t}^{*})=0$ for any $t\in\mathbb{Z}_{\geq 0}$ , minimizing both sides of (16) with respect to $y$ , we have $f_{t}(x_{t})-f_{t}(x_{t}^{*})\leq\frac{1}{2\mu}|\nabla f_{t}(x_{t})|^{2}$ for any $x_{t}\in\Theta$ (known as the Polyak-Łojasiewicz condition). Then, by (38), we have

\begin{split}f_{t}(x_{t})-f_{t}(x_{t}^{*})-&(f_{t}(y_{t+1})-f_{t}(x_{t}^{*}))\geq\bar{A}|\nabla f_{t}(x_{t})|^{2}\\ &\geq 2\mu\bar{A}(f_{t}(x_{t})-f_{t}(x_{t}^{*})),\end{split}

where $\bar{A}=\frac{1}{P}(\alpha-\frac{\alpha^{2}L_{\mathrm{max}}}{2})$ . Consequently,

f_{t}(y_{t+1})-f_{t}(x_{t}^{*})\leq(1-2\mu\bar{A})(f_{t}(x_{t})-f_{t}(x_{t}^{*})).

(39)

By the fact that $\nabla f_{t}(x_{t}^{*})=0$ and Assumption 3, we have $f_{t}(x_{t})-f_{t}(x_{t}^{*})\leq\frac{L}{2}|x_{t}-x_{t}^{*}|^{2}$ and hence

\frac{\mu}{2}|x_{t}-x_{t}^{*}|^{2}\leq f_{t}(x_{t})-f_{t}(x_{t}^{*})\leq\frac{L}{2}|x_{t}-x_{t}^{*}|^{2}.

(40)

By (39) and (40), the following inequality holds as a result of Lemma 1

|x_{t+1}-x_{t}^{*}|^{2}\leq|y_{t+1}-x_{t}^{*}|^{2}\leq\frac{L}{\mu}(1-2\mu\bar{A})|x_{t}-x_{t}^{*}|^{2}.

(41)

Next we show that $\frac{L}{\mu}(1-2\mu\bar{A})<\frac{1}{2}$ under the stated assumptions. Since $\bar{A}=\frac{1}{P}(\alpha-\frac{\alpha^{2}L_{\mathrm{max}}}{2})$ , it can be shown that $\frac{L}{\mu}(1-2\mu\bar{A})<\frac{1}{2}$ is equivalent to $L_{\mathrm{max}}\alpha^{2}-2\alpha+B_{3}P<0$ after some algebraic manipulations. When $P<\frac{1}{B_{3}L_{\mathrm{max}}}$ we have $4-4B_{3}PL_{\mathrm{max}}>0$ and the set of solutions to the inequality $L_{\mathrm{max}}\alpha^{2}-2\alpha+B_{3}P<0$ with respect to $\alpha$ is non-empty. Moreover the solution is given by $\frac{1-\sqrt{1-B_{3}PL_{\mathrm{max}}}}{L_{\mathrm{max}}}<\alpha<\frac{1+\sqrt{1-B_{3}PL_{\mathrm{max}}}}{L_{\mathrm{max}}}$ . Therefore, the listed conditions in the statement of the theorem ensures that $\frac{L}{\mu}(1-2\mu\bar{A})<\frac{1}{2}$ . Thus, (41) implies

|x_{t+1}-x_{t}^{*}|^{2}\leq\frac{1}{2}|x_{t}-x_{t}^{*}|^{2}-B_{4}|x_{t}-x_{t}^{*}|^{2},

(42)

with $B_{4}=\frac{1}{2}-\frac{L}{\mu}(1-2\mu\bar{A})>0$ .

Note that

|x_{t+1}-x_{t+1}^{*}|^{2}\leq 2|x_{t+1}-x_{t}^{*}|^{2}+2|x_{t}^{*}-x_{t+1}^{*}|^{2}.

(43)

The deterministic dynamic regret (3) can be upper bounded as follows

\begin{split}R_{T}^{d}&=\sum_{t=1}^{T}f_{t}(x_{t})-\sum_{t=1}^{T}f_{t}(x_{t}^{*})\\ &\overset{\eqref{LipG2}}{\leq}\sum_{t=1}^{T}\nabla f_{t}(x_{t}^{*})^{T}(x_{t}-x_{t}^{*})+\frac{L}{2}|x_{t}-x_{t}^{*}|^{2}\\ &=\frac{L}{2}\sum_{t=1}^{T}|x_{t}-x_{t}^{*}|^{2}.\end{split}

(44)

From (42), we have the following relationship by summing up both sides of (43) from $t=1$ to $t=T-1$ ,

\begin{split}&\sum_{t=1}^{T}|x_{t}-x_{t}^{*}|^{2}-|x_{1}-x_{1}^{*}|^{2}\\ &\leq(1-2B_{4})\sum_{t=1}^{T}|x_{t}-x_{t}^{*}|^{2}+2C_{T,2}-2|x_{1}^{*}-x_{0}^{*}|^{2}.\end{split}

(45)

Since $B_{4}>0$ , (45) implies leads to $R_{T}^{d}\leq\frac{L}{2}\sum_{t=1}^{T}|x_{t}-x_{t}^{*}|^{2}\leq\frac{L}{B_{4}}(C_{T,2}+C_{2})$ . ∎

Remark 5.9.

Note that in Theorem 5, we introduce an upper bound that depends on the number of components $P$ which increases to $\infty$ as $B_{3}$ decreases to $0$ . This is mainly a result of conservatism introduced in the derivation of (41). However, we manage to improve the regret bound in Theorem 4 from $O(C_{T})$ to $O(C_{T,2})$ following the discussion in Remark 5.8. The bound is also valid for unconstrained OCO problems that are not covered by Theorem 1-4.

The following two theorems give dynamic regret bounds without assuming an upper bound on the number of components $P$ . However, at each time $t$ , potentially multiple offline steps are needed to guarantee desirable regret bounds.

The modified algorithms are in Algorithm 4.

Algorithm 4 Online cyclic coordinate descent algorithm

1:Initialization:

x_{0}\in\Theta,i_{0},\kappa

and some integer

k\geq 1

2:Update: At time

t

\kappa\leftarrow 1,i_{\kappa-1}\leftarrow i_{t-1}

\hat{x}_{\kappa-1}\leftarrow x_{t}

3:Select Coordinate:

i_{\kappa}\leftarrow(i_{\kappa-1}\mod P)+1

4:Update: For

i=i_{\kappa}

, such that

\kappa\leq k

\displaystyle\hat{x}_{(i),\kappa}\leftarrow\hat{x}_{(i),\kappa-1}-\alpha_{t}[\nabla_{(i)}f_{t}(\hat{x}_{\kappa-1})].

For all

i\neq i_{\kappa}

, such that

\kappa\leq k

\hat{x}_{(i),\kappa}\leftarrow\hat{x}_{(i),\kappa-1}.

5:For

\kappa<k

, set

\kappa\leftarrow\kappa+1

and go to Step 3.

6:For

\kappa\geq k

, set

t\leftarrow t+1,i_{t}=i_{k}\color[rgb]{0,0,0},x_{t+1}\leftarrow\Pi_{\Theta}(\hat{x}_{k})

and go to Step 2.

Algorithm 5 Online coordinate descent algorithm with Gauss-Southwell Rule

1:Initialization:

x_{0}\in\Theta,\kappa

and some integer

k\geq 1

2:Update: At time

t

\kappa\leftarrow 1

\hat{x}_{\kappa-1}\leftarrow x_{t}

3:Select Coordinate:

i_{\kappa}\leftarrow\arg\max_{i}|\nabla_{(i)}f_{t}(x_{\kappa-1})|

4:Update: For

i=i_{\kappa}

, such that

\kappa\leq k

\displaystyle\hat{x}_{(i),\kappa}\leftarrow\hat{x}_{(i),\kappa-1}-\alpha_{t}[\nabla_{(i)}f_{t}(\hat{x}_{\kappa-1})].

For all

i\neq i_{\kappa}

, such that

\kappa\leq k

\hat{x}_{(i),\kappa}\leftarrow\hat{x}_{(i),\kappa-1}.

5:For

\kappa<k

, set

\kappa\leftarrow\kappa+1

and go to Step 3.

6:For

\kappa\geq k

, set

t\leftarrow t+1\color[rgb]{0,0,0},x_{t+1}\leftarrow\Pi_{\Theta}(\hat{x}_{k})

and go to Step 2.

It can be seen that in Algorithms 4 and 5, at each time $t$ , $k$ updates are performed where $k$ is an integer to be chosen. In Algorithms 2 and 3, however, only one step is performed at each time $t$ .

Theorem 6.

Suppose $\nabla f_{t}(x_{t}^{*})=0$ for any $t\in\mathbb{Z}_{\geq 0}$ and Assumptions 2-4 hold and the stepsize is chosen such that $\alpha_{t}=\alpha<\frac{2}{L_{\mathrm{max}}}$ . Let $k$ be an integer such that $B_{5}:=\frac{L}{\mu}(1-2\mu A)^{\frac{k+1-P}{P}}(1+\alpha L_{\mathrm{max}})^{2P-2}<1/2$ holds, then the dynamic regret (5) achieved by Algorithm 4 satisfies

R_{T}^{d}\leq\frac{L}{2-4B_{5}}(2C_{T,2}+C_{2}),

where $A=(\alpha-\frac{\alpha^{2}L_{\mathrm{max}}}{2})/[2(1+\alpha^{2}L^{2}P)]$ and $C_{2}=|x_{1}-x_{1}^{*}|^{2}-2|x_{1}^{*}-x_{0}^{*}|^{2}$ .

{pf}

By the block descent lemma [1, Lemma 3.2] and the fact that $L_{\mathrm{max}}\geq L_{i}$ for all $i$ , we have $f_{t}(\hat{x}_{\kappa+1})\leq f_{t}(\hat{x}_{\kappa})+\frac{\alpha^{2}L_{\mathrm{max}}}{2}|\nabla_{(i)}f_{t}(\hat{x}_{\kappa})|^{2}-\alpha|\nabla_{(i)}f_{t}(\hat{x}_{\kappa})|^{2}$ at any $t$ , for any $0\leq\kappa\leq k-1$ . That is

f_{t}(\hat{x}_{\kappa})-f_{t}(\hat{x}_{\kappa+1})\geq(\alpha-\frac{\alpha^{2}L_{\mathrm{max}}}{2})|\nabla_{(i)}f_{t}(\hat{x}_{\kappa})|^{2},

Summing over all $P$ blocks in a full round where all components have been updated exactly once leads to

f_{t}(\hat{x}_{\kappa})-f_{t}(\hat{x}_{\kappa+P})\geq(\alpha-\frac{\alpha^{2}L_{\mathrm{max}}}{2})\sum_{i=1}^{P}|\nabla_{(i)}f_{t}(\hat{x}_{\kappa+i})|^{2}.

By the update equation and Lipschitz continuity of the gradient, we have $|\nabla f_{t}(\hat{x}_{\kappa})-\nabla f_{t}(\hat{x}_{\kappa+i})|^{2}\leq\alpha^{2}L^{2}\sum_{j=1}^{i}|\nabla_{(j)}f_{t}(\hat{x}_{\kappa+j-1})|^{2}$ . Therefore

\begin{split}&|\nabla_{(i)}f_{t}(\hat{x}_{\kappa})|^{2}\\ &\leq(|\nabla_{(i)}f_{t}(\hat{x}_{\kappa})-\nabla_{(i)}f_{t}(\hat{x}_{\kappa+i})|+|\nabla_{(i)}f_{t}(\hat{x}_{\kappa+i})|)^{2}\\ &\leq 2|\nabla_{(i)}f_{t}(\hat{x}_{\kappa+i})|^{2}+2\alpha^{2}L^{2}\sum_{j=1}^{i}|\nabla_{(j)}f_{t}(\hat{x}_{\kappa+j-1})|^{2}.\end{split}

Summing over $P$ blocks leads to

\begin{split}\sum_{i=1}^{P}|\nabla_{(i)}f_{t}(\hat{x}_{\kappa})|^{2}&\leq 2\sum_{i=1}^{P}(1+(P-i)\alpha^{2}L^{2})|\nabla_{(i)}f_{t}(\hat{x}_{\kappa+i})|^{2}\\ &\leq 2(1+\alpha^{2}PL^{2})\sum_{i=1}^{P}|\nabla_{(i)}f_{t}(\hat{x}_{\kappa+i})|^{2}.\\ \end{split}

Therefore,

\begin{split}&f_{t}(\hat{x}_{\kappa})-f_{t}(\hat{x}_{\kappa+P})\\ &\geq(\alpha-\frac{\alpha^{2}L_{\mathrm{max}}}{2})/[2(1+\alpha^{2}L^{2}P)]|\nabla f_{t}(\hat{x}_{\kappa})|^{2}\\ &:=A|\nabla f_{t}(\hat{x}_{\kappa})|^{2}.\\ \end{split}

(46)

By Assumption 2, minimizing both sides of (16) with respect to $y$ , we have $f_{t}(\hat{x}_{\kappa})-f_{t}(x_{t}^{*})\leq\frac{1}{2\mu}|\nabla f_{t}(\hat{x}_{\kappa})|^{2}$ . Then, by (46), we have

\begin{split}&f_{t}(\hat{x}_{\kappa})-f_{t}(x_{t}^{*})-(f_{t}(\hat{x}_{\kappa+P})-f_{t}(x_{t}^{*}))\geq A|\nabla f_{t}(\hat{x}_{\kappa})|^{2}\\ &\geq 2\mu A(f_{t}(\hat{x}_{\kappa})-f_{t}(x_{t}^{*})),\end{split}

which further implies,

f_{t}(\hat{x}_{\kappa+P})-f_{t}(x_{t}^{*})\leq(1-2\mu A)(f_{t}(\hat{x}_{\kappa})-f_{t}(x_{t}^{*})).

(47)

By Assumption 3, we have $|\hat{x}_{\kappa+1}-x_{t}^{*}|\leq(1+\alpha L_{\mathrm{max}})|\hat{x}_{\kappa}-x_{t}^{*}|$ . For any $k\in\mathbb{Z}_{\geq 0}$ , there exist $k_{1}\in\mathbb{Z}_{\geq 0}$ and $k_{2}\in\{1,\ldots,P-1\}$ such that $k=k_{1}P+k_{2}$ . Then, by Assumptions 2 and 3, we have

\frac{\mu}{2}|x-x_{t}^{*}|^{2}\leq f_{t}(x)-f_{t}(x_{t}^{*})\leq\frac{L}{2}|x-x_{t}^{*}|^{2},

(48)

for any $x\in\mathbb{R}^{n}$ . Consequently, (47) and (48) yield

\begin{split}&|\hat{x}_{k}-x_{t}^{*}|^{2}=|\hat{x}_{k_{1}P+k_{2}}-x_{t}^{*}|^{2}\\ &\leq\frac{2}{\mu}(1-2\mu A)^{k_{1}}\frac{L}{2}|\hat{x}_{k_{2}}-x_{t}^{*}|^{2}\\ &={\frac{L}{\mu}}(1-2\mu A)^{k_{1}}|\hat{x}_{k_{2}}-x_{t}^{*}|^{2}\\ &\leq{\frac{L}{\mu}}(1-2\mu A)^{\frac{k-k_{2}}{P}}(1+\alpha L_{\mathrm{max}})^{2P-2}|x_{t}-x_{t}^{*}|^{2}\\ &\leq{\frac{L}{\mu}}(1-2\mu A)^{\frac{k+1-P}{P}}(1+\alpha L_{\mathrm{max}})^{2P-2}|x_{t}-x_{t}^{*}|^{2},\end{split}

(49)

at any $t\in\mathbb{Z}_{\geq 0}$ . Since $\alpha<{2}/{L_{\mathrm{max}}}$ , we have $A>0$ and $0<1-2\mu A<1$ . Hence there exists $k$ such that $B_{5}=\frac{L}{\mu}(1-2\mu A)^{\frac{k+1-P}{P}}(1+\alpha L_{\mathrm{max}})^{2P-2}<1/2$ . Hence, by Lemma 1

|x_{t+1}-x_{t}^{*}|^{2}\leq|\hat{x}_{k}-x_{t}^{*}|^{2}\leq B_{5}|x_{t}-x_{t}^{*}|^{2}.

(50)

By (44), we have $R_{T}^{d}\leq\frac{L}{2}\sum_{t=1}^{T}|x_{t}-x_{t}^{*}|^{2}$ . Note that

|x_{t+1}-x_{t+1}^{*}|^{2}\leq 2|x_{t+1}-x_{t}^{*}|^{2}+2|x_{t}^{*}-x_{t+1}^{*}|^{2}.

(51)

Summing up both sides of (51) from $t=1$ to $t=T-1$ using (50), we have

\begin{split}(1-2B_{5})2R_{T}^{d}/L&\leq|x_{1}-x_{1}^{*}|^{2}+2\sum_{t=1}^{T}|x_{t}^{*}-x_{t+1}^{*}|^{2}\\ &=2C_{T,2}+C_{2}.\end{split}

Since $2B_{5}<1$ , $R_{T}^{d}\leq\frac{L}{2-4B_{5}}(2C_{T,2}+C_{2})$ follows. ∎

Theorem 7.

Suppose $\nabla f_{t}(x_{t}^{*})=0$ for any $t\in\mathbb{Z}_{\geq 0}$ and Assumptions 2-4 hold and the stepsize is chosen such that $\alpha_{t}=\alpha<\frac{2}{L_{\mathrm{max}}}$ . Let $k$ be an integer such that $B_{6}:={\frac{L}{\mu}}(1-2\mu\bar{A})^{\frac{k+1-P}{P}}(1+\alpha L_{\mathrm{max}})^{2P-2}<\frac{1}{2}$ holds, then the dynamic regret (5) achieved by Algorithm 5 satisfies

R_{T}^{d}\leq\frac{L}{2-4B_{6}}(2C_{T,2}+C_{2}),

where $\bar{A}=\frac{1}{P}(\alpha-\frac{\alpha^{2}L_{\mathrm{max}}}{2})$ and $C_{2}=|x_{1}-x_{1}^{*}|^{2}-2|x_{1}^{*}-x_{0}^{*}|^{2}$ .

{pf}

Following similar steps as those in the proof of Theorem 6 and using (39) and (48) yield

\begin{split}&|x_{t+1}-x_{t}^{*}|\\ &\leq{\frac{L}{\mu}}(1-2\mu\bar{A})^{\frac{k+1-P}{P}}(1+\alpha L_{\mathrm{max}})^{2P-2}|x_{t}-x_{t}^{*}|,\end{split}

(52)

where $\bar{A}=\frac{1}{P}(\alpha-\frac{\alpha^{2}L_{\mathrm{max}}}{2})$ as in the proof of Theorem 5. Since $\bar{A}>0$ , we know there exists $k$ such that $B_{6}={\frac{L}{\mu}}(1-2\mu\bar{A})^{\frac{k+1-P}{P}}(1+\alpha L_{\mathrm{max}})^{2P-2}<\frac{1}{2}$ . Hence, we have

|x_{t+1}-x_{t}^{*}|^{2}\leq B_{6}|x_{t}-x_{t}^{*}|^{2}.

(53)

From (53), the following inequality holds by summation of both sides of (51) from $t=1$ to $t=T-1$

\begin{split}(1-2B_{6})2R_{T}^{d}/L&\leq|x_{1}-x_{1}^{*}|^{2}+2\sum_{t=1}^{T}|x_{t}^{*}-x_{t+1}^{*}|^{2}\\ &=2C_{T,2}+C_{2}.\end{split}

Since $2B_{6}<1$ , we have $R_{T}^{d}\leq\frac{L}{2-4B_{6}}(2C_{T,2}+C_{2})$ . ∎

6 Numerical simulations

First, we study the following unconstrained quadratic problem

\underset{x\in\mathbb{R}^{20}}{\text{min}}\ \frac{1}{2}x^{T}Q_{t}x-b^{T}x,

(54)

where $b\in\mathbb{R}^{20}$ is a randomly generated constant vector, $Q_{t}$ is a time-varying matrix that is positive definite for all $t\geq 1$ . Moreover, all elements of $Q_{t}$ are in the closed interval $[1,2]$ . As a result, (54) satisfies Assumptions 2-4. Next, we discuss Assumption 1 (i) made in Theorem 4. The constant $G$ from Assumption 1 (i) is only used to show $R_{T}^{d}\leq G\mathbb{E}[\sum_{t=1}^{T}|x_{t}-x_{t}^{*}|]$ such that (31) holds. All other arguments made in the proof of Theorem 4 are still true even if $G=\infty$ . Thus, every iteration of Algorithm 1 will move $x_{t}$ closer to $x_{t}^{*}=Q_{t}^{-1}b$ expectation-wise, if the stepsize is small enough. Since in our setup $Q_{t}^{-1}b$ is uniformly bounded, the expectation of $x_{t}$ will remain bounded too. This means $x_{t}$ and the gradient $Q_{t}x_{t}-b$ must be bounded, almost surely. We choose $P=20$ and for each $1\leq i\leq 20$ , $x_{(i)}$ is a scalar. The horizon length is $T=5000$ , and the constant stepsize is chosen as $\alpha_{t}=\alpha=0.001$ . Since the static regret is always upper bounded by the dynamic regret, we only show the plots for dynamic regrets and their time-averages $R_{T}^{d}/T$ .

Refer to caption — Figure 1: Plots of the dynamic regrets $R_{T}^{d}$ and their time-averages $R_{T}^{d}/T$ .

It can be seen from Fig. 1 that when the constant stepsize is chosen sufficiently small, the regrets in all cases have sublinear growths and therefore their time-averages go to $0$ when $T$ is sufficiently large. This is consistent with our theoretical results from Theorem 4-7 on strongly convex functions.

In order to see how the variation of the problem impacts the performance of the algorithms. We add an extra term of $100I_{20}$ such that the matrix $Q_{t}$ is diagonally dominant and therefore being less sensitive to $t$ . We test the online algorithms in this case and the results are shown in Fig. 2. It can be seen that when problem (54) varies slowly with respect to time, the curves of the regrets in Fig. 2 have a lower growth rate compared to the regrets shown in Fig. 1.

As expected, the algorithm using full gradient has the best performance in terms of minimizing the dynamic regret. Yet, it is worth mentioning that among the three coordinate descent algorithms considered for this numerical example, Gauss-Southwell rule gives the best performance which is consistent with Remark 2. The extra information of the component-wise gradient norms enables a better selection of the coordinate to update. An in-depth theoretical analysis of this problem in an online setting is left for future work.

Next, we consider the following problem of minimizing entropy functions online

\underset{x\in\Theta}{\text{min}}\sum_{i=1}^{5}\frac{x_{(i)}}{p_{i,t}}\ln\frac{x_{(i)}}{p_{i,t}}.

(55)

The variable $x\in\Theta$ is decomposed into 5 scalar components with the compact constraint set $\Theta=\{x\in\mathbb{R}^{5}\ |\ 0.001\leq x_{(i)}\leq 1000,i=1,2,3,4,5\}$ . The values of $p_{i,t}$ are such that each $p_{i,1}$ is individually and randomly selected from $[1,5]$ . For $t\geq 2$ , $p_{i,t}$ is such that $p_{i,t}=p_{i,t-1}+\frac{1}{t-1}$ for all $i$ . It can be verified that the above selection ensures that $|x_{t+1}^{*}-x_{t}^{*}|=\frac{1}{t}$ and $C_{T}=O(\log T)$ . Note that the cost function in (55) is convex but not strongly convex and hence only Theorem 1 and Theorem 3 apply. We again show the plots of dynamics regrets of Algorithms 1-3 and full gradient based algorithms in Fig. 3 with constant stepsize $\alpha=0.05$ and $T=5000$ . Moreover, the plot for random coordinate descent is averaged over 1000 runs.

It can be seen from the Fig. 1 to Fig. 3 that, for the quadratic problem, the dynamic regrets are a lot flatter when $t$ is large. On the other hand, the dynamic regrets in Fig. 3 still exhibit a significant growth when $t=T=5000$ even if we select the time-varying parameters to ensure that $C_{T}=O(\log T)$ . These findings are consistent with the improved regret bounds shown in Theorems 4-7 for uniformly strongly convex functions with uniformly Lipschitz gradients.

7 Summary

In this work, we have proposed an online coordinate descent algorithms to deal with optimization problems that may change over time. Three widely used update rules of coordinate descent are considered. Under different assumptions, we have provided different upper bounds on the regrets of these online algorithms. In particular, we have verified that the established regret bounds of these coordinate descent algorithms are of similar orders as those of online gradient descent methods under same settings. The regret bounds proved in this paper are summarized in Table 1. Lastly, a numerical example was given to illustrate our main result. The possibilities of using coordinates with overlapping components is an interesting future research direction, especially for the deterministic case. Another topic of interest is the use of inaccurate gradient information in online coordinate descent algorithms.

{ack}

The authors would like to thank Dr. Yuen-Man Pun for pointing out an error in the proof of Theorem 6 in the initial draft.

References

[1] Amir Beck and Luba Tetruashvili. On the convergence of block coordinate descent type methods. SIAM journal on Optimization, 23(4):2037–2060, 2013.
[2] A. S. Bedi, P. Sarma, and K. Rajawat. Tracking moving agents via inexact online gradient descent algorithm. IEEE Journal of Selected Topics in Signal Processing, 12(1):202–217, 2018.
[3] D. P. Bertsekas and J. N. Tsitsiklis. Parallel and Distributed Computation: Numerical Methods. Prentice-Hall, 1989.
[4] Dimitri P Bertsekas. Convex Optimization Algorithms. Athena Scientific, 2015.
[5] Xuanyu Cao and Tamer Başar. Decentralized online convex optimization with feedback delays. IEEE Transactions on Automatic Control, 67(6):2889–2904, 2022.
[6] Xuanyu Cao and KJ Ray Liu. Online convex optimization with time-varying constraints and bandit feedback. IEEE Transactions on Automatic Control, 64(7):2665–2680, 2019.
[7] Xuanyu Cao, Junshan Zhang, and H Vincent Poor. Online stochastic optimization with time-varying distributions. IEEE Transactions on Automatic Control, 66(4):1840–1847, 2021.
[8] Ting-Jui Chang and Shahin Shahrampour. On online optimization: Dynamic regret analysis of strongly convex and smooth problems. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 6966–6973, 2021.
[9] Tianyi Chen, Qing Ling, and Georgios B Giannakis. An online convex optimization approach to proactive network resource allocation. IEEE Transactions on Signal Processing, 65(24):6350–6364, 2017.
[10] Dan Garber and Elad Hazan. A linearly convergent variant of the conditional gradient algorithm under strong convexity, with applications to online and stochastic optimization. SIAM Journal on Optimization, 26(3):1493–1528, 2016.
[11] Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169–192, 2007.
[12] Saghar Hosseini, Airlie Chapman, and Mehran Mesbahi. Online distributed convex optimization on dynamic networks. IEEE Transactions on Automatic Control, 61(11):3545–3550, 2016.
[13] Rodolphe Jenatton, Jim Huang, and Cédric Archambeau. Adaptive algorithms for online convex optimization with long-term constraints. In International Conference on Machine Learning, pages 402–411, 2016.
[14] A. Lesage-Landry and J. A. Taylor. Setpoint tracking with partially observed loads. IEEE Transactions on Power Systems, 33(5):5615–5627, 2018.
[15] Antoine Lesage-Landry, Iman Shames, and Joshua A Taylor. Predictive online convex optimization. Automatica, 113:108771, 2020.
[16] Antoine Lesage-Landry, Joshua A Taylor, and Iman Shames. Second-order online nonconvex optimization. IEEE Transactions on Automatic Control, 2020.
[17] Xiuxian Li, Lihua Xie, and Na Li. A survey of decentralized online learning. arXiv preprint arXiv:2205.00473, 2022.
[18] Yingying Li, Guannan Qu, and Na Li. Online optimization with predictions and switching costs: Fast algorithms and the fundamental limit. IEEE Transactions on Automatic Control, 66(10):4761–4768, 2021.
[19] Zhenhong Li, Zhengtao Ding, Junyong Sun, and Zhongkui Li. Distributed adaptive convex optimization on directed graphs via continuous-time algorithms. IEEE Transactions on Automatic Control, 63(5):1434–1441, 2018.
[20] Yankai Lin, Iman Shames, and Dragan Nešić. Asynchronous distributed optimization via dual decomposition and block coordinate ascent. In 58th IEEE Conference on Decision and Control, pages 6380–6385, 2019.
[21] Yankai Lin, Iman Shames, and Dragan Nešić. Asynchronous distributed optimization via dual decomposition and block coordinate subgradient methods. IEEE Transactions On Control Of Network Systems, 8(3):1348–1359, 2021.
[22] Behnam Mafakheri, Iman Shames, and Jonathan Manton. First order online optimisation using forward gradients under polyak-Lojasiewicz condition. arXiv preprint arXiv:2211.15825, 2022.
[23] Aryan Mokhtari, Shahin Shahrampour, Ali Jadbabaie, and Alejandro Ribeiro. Online optimization in dynamic environments: Improved regret rates for strongly convex problems. In 55th IEEE Conference on Decision and Control, pages 7195–7201, 2016.
[24] Yu Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341–362, 2012.
[25] Julie Nutini, Mark Schmidt, Issam Laradji, Michael Friedlander, and Hoyt Koepke. Coordinate descent converges faster with the gauss-southwell rule than random selection. In International Conference on Machine Learning, pages 1632–1641. PMLR, 2015.
[26] Yipeng Pang and Guoqiang Hu. Randomized gradient-free distributed online optimization via a dynamic regret analysis. IEEE Transactions on Automatic Control, 2023.
[27] Santiago Paternain and Alejandro Ribeiro. Online learning of feasible strategies in unknown environments. IEEE Transactions on Automatic Control, 62(6):2807–2822, 2017.
[28] Shahin Shahrampour and Ali Jadbabaie. Distributed online optimization in dynamic environments using mirror descent. IEEE Transactions on Automatic Control, 63(3):714–725, 2018.
[29] S. Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107–194, 2012.
[30] Iman Shames, Daniel Selvaratnam, and Jonathan H Manton. Online optimization using zeroth order oracles. IEEE Control Systems Letters, 4(1):31–36, 2019.
[31] S. J. Wright. Coordinate descent algorithms. Mathematical Programming, 151(1):3–34, 2015.
[32] Yongyang Xiong, Xiang Li, Keyou You, and Ligang Wu. Distributed online optimization in time-varying unbalanced networks without explicit subgradients. IEEE Transactions on Signal Processing, 70:4047–4060, 2022.
[33] Feng Yan, Shreyas Sundaram, S. V. N Vishwanathan, and Yuan Qi. Distributed autonomous online learning: Regrets and intrinsic privacy-preserving properties. IEEE Transactions on Knowledge and Data Engineering, 25(11):2483–2493, 2013.
[34] Peng Yi and Yiguang Hong. Stochastic sub-gradient algorithm for distributed optimization with random sleep scheme. Control Theory and Technology, 13:333–347, 2015.
[35] Xinlei Yi, Xiuxian Li, Lihua Xie, and Karl H Johansson. Distributed online convex optimization with time-varying coupled inequality constraints. IEEE Transactions on Signal Processing, 68:731–746, 2020.
[36] Xinlei Yi, Xiuxian Li, Tao Yang, Lihua Xie, Tianyou Chai, and Karl H Johansson. Distributed bandit online convex optimization with time-varying coupled inequality constraints. IEEE Transactions on Automatic Control, 66:4620–4635, 2021.
[37] Xinlei Yi, Xiuxian Li, Tao Yang, Lihua Xie, Tianyou Chai, and Karl H Johansson. Regret and cumulative constraint violation analysis for distributed online constrained convex optimization. IEEE Transactions on Automatic Control, 68:2875–2890, 2023.
[38] Hao Yu, Michael J Neely, and Xiaohan Wei. Online convex optimization with stochastic constraints. In Advances in Neural Information Processing Systems, pages 1428–1438, 2017.
[39] Deming Yuan, Alexandre Proutiere, and Guodong Shi. Distributed online optimization with long-term constraints. IEEE Transactions on Automatic Control, 67(3):1089–1104, 2022.
[40] Lijun Zhang, Tianbao Yang, Jinfeng Yi, Rong Jin, and Zhi-Hua Zhou. Improved dynamic regret for non-degenerate functions. Advances in Neural Information Processing Systems, 30, 2017.
[41] Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th international conference on machine learning, pages 928–936, 2003.