Dynamic Regret for Online Composite Optimization

Ruijie Hou, Xiuxian Li, Senior Member, IEEE, and Yang Shi, Fellow, IEEE This research was supported by the National Natural Science Foundation of China under Grant 62003243, and the Shanghai Municipal Science and Technology Major Project, No. 2021SHZDZX0100.R. Hou and X. Li are with Department of Control Science and Engineering, College of Electronics and Information Engineering, Tongji University, Shanghai 201800, China (e-mail: [email protected], [email protected]).X. Li is also with the Shanghai Research Institute for Intelligent Autonomous Systems, Shanghai 201210, China.Y. Shi is with Department of Mechanical Engineering, University of Victoria, Victoria, B.C. V8W 3P6, Canada (e-mail: [email protected]).

Abstract

This paper investigates online composite optimization in dynamic environments, where each objective or loss function contains a time-varying nondifferentiable regularizer. To resolve it, an online proximal gradient algorithm is studied for two distinct scenarios, including convex and strongly convex objectives without the smooth condition. In both scenarios, unlike most of works, an extended version of the conventional path variation is employed to bound the considered performance metric, i.e., dynamic regret. In the convex scenario, a bound $\mathcal{O}(\sqrt{T^{1-\beta}D_{\beta}(T)+T})$ is obtained which is comparable to the best-known result, where $D_{\beta}(T)$ is the extended path variation with $\beta\in[0,1)$ and $T$ being the total number of rounds. In strongly convex case, a bound $\mathcal{O}(\log T(1+T^{-\beta}D_{\beta}(T)))$ on the dynamic regret is established. In the end, numerical examples are presented to support the theoretical findings.

Index Terms:

Online optimization, composite optimization, convex optimization, dynamic regret, dynamic environments.

I Introduction

Composite optimization has been studied in considerable research in recent decades. Usually, the problems are composed of functions with different properties. This structure can be applied in statistics, machine learning, and engineering, such as empirical risk minimization, logistic regression, compressed sensing and so on[1]. There are many algorithms proposed for solving composite optimization, such as Second Order Primal-Dual Algorithm [2], ADA-FTRL and ADA-MD[3], COMID[4], ODCMD[5], SAGE[6], AC-SA[7], RDA[8], most of which use the well-known proximal operator. Proximal operator introduced by [9] is an efficient method to minimize functions with a convex but possibly nondifferentiable part.

This paper focuses on online composite optimization in dynamic environments. It is one of numerous important settings in online optimization. Among them, online convex optimization has received immense attentions among scientists with many applications in dynamic setting, such as, online trajectory optimization[10], network resource allocation[11], radio resource management[12], and so on. The typical setup of online optimization can be described below. At each time $t$ , a learner selects an appropriate action $x_{t}\in\mathcal{X}$ , where $\mathcal{X}\subset\mathbb{R}^{n}$ is a feasible constraint set and then an adversary selects a corresponding convex loss $F_{t}\in\mathcal{F}$ , where $\mathcal{F}$ is a given set of loss functions available to the adversary. The objective of online optimization is selecting proper actions $\{x_{t}\}_{t=1}^{T}$ such that $\sum_{t=1}^{T}F_{t}(x_{t})$ is minimized to track the best decision in hindsight.

Static regret is a classical performance metric: $\bm{Reg}_{T}^{s}:=\sum_{t=1}^{T}F_{t}(x_{t})-\sum_{t=1}^{T}F_{t}(x^{*})$ , where $x^{*}$ minimizes the sum of all loss functions over the constraint set[13]. An effective algorithm ensures the average static regret goes to zero over an infinite time horizon[14]. For instance, Online Gradient Method (OGM) is proved to have bounds $O(\sqrt{T})$ and $O(\log T)$ when $F_{t}$ is convex and strongly convex, respectively[15]. By using Online Proximal Gradient (OPG) algorithm for a composite function, where one term is convex or strongly convex, the same bounds are also obtained[16, 6]. Although both of them share the same static regret bound, OPG can solve the composite loss function with nondifferentiable part, such as objective functions regularized by $\cal{l_{1}}$ norm. Besides, there are other related works solving online composite problems based on static regret. Authors in [5] developed an online distributed composite mirror descent using the Bregman divergence, and attained an average static regret bound of order $\mathcal{O}(\frac{1}{\sqrt{T}})$ for each agent, which they called average regularized regret. Even for multipoint bandit feedback model, they also achieved the same result.

In a wide range of applications, $F_{t}(x)$ often has different minimizers at each $t$ . As such, the dynamic regret introduced in [17] is a well-known extension of static regret. Many previous works [18, 19] have pointed out that the dynamic regret is hard to achieve sublinearity with respect to $T$ , since the function sequence fluctuates arbitrarily. In order to bound the dynamic regret, one needs to specify some complexity measures. By introducing $\{u_{t}\}_{t=1}^{T}$ as a sequence of any feasible comparators, several common complexity measures are showed as follows:

I-1 The functions variation

V(T):=\sum_{t=2}^{T}sup_{u\in\mathcal{X}}|F_{t}(u)-F_{t-1}(u)|.

$V(T)$ is used to solve problems where cost functions may change along a non-stationary variant of a sequential stochastic optimization. It has been obtained minimax regret bounds $\mathcal{O}(T^{\frac{2}{3}}V(T)^{\frac{1}{3}})$ and $\mathcal{O}(T^{\frac{1}{2}}V(T)^{\frac{1}{2}})$ for convex and strongly convex functions, respectively, in a non-stationary setting under noisy gradient access[20].

I-2 The gradients variation

F(T):=\sum_{t=2}^{T}\lVert\nabla F_{t}(u_{t})-M_{t}\lVert_{2}^{2},

where $\{M_{1},M_{2},...,M_{T}\}$ is computed accroding to the past observations or side information. For instance, one of the choice could be $M_{t}=\nabla F_{t-1}(u_{t-1})$ [21]. This work about online Frank-Wolfe algorithm established $\mathcal{O}(\sqrt{T}(1+V(T)+\sqrt{F(T)}))$ dynamic regret for convex and smooth loss function in non-stationary settings.

I-3 The path variation

	$\displaystyle D(T):=\sum_{t=2}^{T}\Arrowvert u_{t}-u_{t-1}\Arrowvert,$
	$\displaystyle S(T):=\sum_{t=2}^{T}\Arrowvert u_{t}-u_{t-1}\Arrowvert^{2}.$

If the reference points move slowly, $S(T)$ may be smaller than $D(T)$ . On the contrary, $D(T)$ may be smaller than $S(T)$ . Specially, $D(T)$ and $S(T)$ are also utilized to represent those variations in the optimal points $x_{t}^{*}:=arg\min_{x\in\mathcal{X}}f_{t}(x)$ [22, 23]. Related works on path variation are significantly more than the first two. Earlier research on OGM showed that the dynamic regret has an upper bound $\mathcal{O}(\sqrt{T}(D(T)+1))$ for convex cost functions[17]. Then, authors improved that there exists a lower bound $\mathcal{O}(\sqrt{T(D(T)+1)})$ [24]. By adding strict conditions, a regret bound of order $\mathcal{O}(D(T)+1)$ was established for strongly and smooth convex loss functions[25]. Another study on Online Multiple Gradient Descent for smooth and strongly convex functions improved the regret bound to $\mathcal{O}(min(D(T),S(T)))$ [23].

I-4 An extended path variation

\displaystyle D_{\beta}(T):=\sum_{t=2}^{T}t^{\beta}\Arrowvert u_{t}-u_{t-1}\Arrowvert,

(1)

where $0\leqslant\beta<1$ [26]. Larger weights are allocated for the future parts than the previous parts. When $\beta=0$ , $D_{0}(T)$ is equal to $D(T)$ and when $0<\beta<1$ , $D_{0}(T)<D_{\beta}(T)<T^{\beta}D_{0}(T)$ . Therefore, $D_{\beta}(T)$ is an extension of $D(T)$ . In the literature, online optimization based on $D_{\beta}(T)$ is few. One related work using $D_{\beta}(T)$ is on proximal online gradient algorithm[26]. The authors established a dynamic regret of order $\mathcal{O}(\sqrt{T^{1-\beta}D_{\beta}(T)}+\sqrt{T})$ for convex loss functions.

TABLE I: Summary of results on dynamic regret (

F_{t}=f_{t}+r_{t}

)

Reference	Algorithm	Assumptions on Loss functions	Dynamic regret
[24]	OGD	$F_{t}$ : Convex without $r_{t}$	$\mathcal{O}(\sqrt{T(D_{0}(T)+1)})$
[25]	OGD	$F_{t}$ : Strongly convex and smooth without $r_{t}$	$\mathcal{O}(D_{0}(T)+1)$
[26]	POG	$f_{t}$ : Convex with fixed regularizer	$\mathcal{O}(\sqrt{T^{1-\beta}D_{\beta}(T)}+\sqrt{T})$
[27]	DP-OGD	$f_{t}$ : Strongly convex and smooth	$\mathcal{O}(\log T(1+D_{0}(T)))$
[28]	OIPG	$f_{t}$ : Strongly convex and smooth	$\mathcal{O}(1+D_{0}(T))$
This work	OPG	$f_{t}$ : Convex	$\mathcal{O}(\sqrt{T^{1-\beta}D_{\beta}(T)+T})$
This work	OPG	$f_{t}$ : Strongly convex	$\mathcal{O}(\log T(1+T^{-\beta}D_{\beta}(T)))$

In addition, there are plenty of works studying about online composite optimization. In [28], inexact online proximal gradient algorithm was proposed for loss functions composed of strongly convex, smooth term and convex term, where the loss relies on the estimated gradient information of the smooth convex function and the approximate proximal operator. Coincidentally, for allowing the loss functions to be subsampled, inexact gradient was added into online proximal gradient algorithm for a finite sum of composite functions[29]. When both algorithms are applied to the exact gradient case, the upper bounds are both in order of $\mathcal{O}(1+D(T))$ . Furthermore, considering inexact scenarios by approximate gradients, approximate proximal operator and communication noise[30], online inexact distributed proximal gradient method (DPGM) guaranteed Q-linearly convergence to the optimal solution with some error. Decentralized algorithms are also proposed for solving composite optimization, as the composite loss function of each agent may be held privately. An unconstrained problem is considered firstly. An online DPGM was proposed in [27] for each private cost function composed of strongly convex, smooth function and convex nonsmooth regularizer, obtaining a regret bound of order $\mathcal{O}(\log T(1+D(T)))$ . Later, some works use mirror descent algorithm for constrained problems. A distributed online primal-dual mirror descent algorithm is developed to handle composite objective functions, which consist of local convex functions and regularizers with time-varying coupled inequality constraints[31].

In this paper, we revisit OPG algorithm for composite functions with a time-varying regularizer and discuss dynamic regret in terms of $D_{\beta}(T)$ . Most related works about OPG consider a time-invariant $r(x)$ [26, 32, 33, 34, 35], while our problem extends the stationary regularizer to time-varying scenario.

The contributions of this paper include: (a) solving online composite optimization problems with two time-varying nonsmooth parts in terms of $D_{\beta}(T)$ ; (b) obtaining a best-known dynamic regret bound for composite functions with two convex but nondifferentiable functions; (c) establishing a dynamic regret bound of OPG for functions composed of a strongly convex term and a convex but nondifferentiable term.

The paper is organized as follows. Section II describes the online composite optimization problem. In Section III, OPG algorithm is introduced, necessary assumptions are presented and main results are provided for two cases. Numerical examples of the OPG algorithm are given in Section IV. The conclusions are presented in Section V.

Notations: Let $\lVert\cdot\rVert_{1}$ represent the $\cal{l}_{1}$ norm and $\lVert\cdot\rVert$ represent the $\cal{l}_{2}$ norm by default. $\partial f(x)$ denotes the subdifferential of a convex function $f$ at $x$ and $\tilde{\nabla}f(x)$ represents one subgradient in $\partial f(x)$ . Let $[N]:=\{1,2,...,N\}$ . Let $\langle\cdot,\cdot\rangle$ be the Euclidean inner product on $\mathbb{R}^{n}$ .

II Problem Formulation

The loss function at each time $t\geq 0$ is composed of two parts as follows:

\displaystyle F_{t}(x)

\displaystyle:=f_{t}(x)+r_{t}(x),~{}~{}s.t.~{}x\in\cal{X}

(2)

where $\mathcal{X}\subset\mathbb{R}^{n}$ is a feasible constraint set, and functions $f_{t},r_{t}:\mathcal{X}\to\mathbb{R}$ are not necessarily differentiable. At each instant $t$ , an action $x_{t}\in\mathcal{X}$ is selected by the learner and then loss functions $f_{t}(x_{t}),r_{t}(x_{t}):\mathcal{X}\to\mathbb{R}$ are chosen by an adversary and revealed to the learner. The proper action sequence $\{x_{t}\}_{t=1}^{T}$ is expected to be obtained such that the dynamic regret is sublinear with the horizon $T$ , i.e., $\lim_{T\to\infty}\frac{Reg_{T}^{d}}{T}=0$ , where

\displaystyle\bm{Reg}_{T}^{d}:=\sum_{t=1}^{T}F_{t}(x_{t})-\sum_{t=1}^{T}F_{t}(u_{t}),

(3)

and $\{u_{t}\}_{t=1}^{T}$ is a sequence of any feasible comparators. It is worth noting that an important case of dynamic regret is when $u_{t}=x_{t}^{*}:=argmin_{x\in\mathcal{X}}F_{t}(x)$ .

Examples of applications using the time-varying nonsmooth $f_{t}$ and the time-varying $r_{t}$ are briefly presented below.

II-1 Examples of time-varying nonsmooth $f_{t}(x)$

In machine learning, signal processing and statistics, many nonsmooth error functions have a wide range of applications for solving regression and classification problems[36]. In online streaming data-generating process, the probability density function may change with time[37]. There are some examples for online nonsmooth convex loss functions, by setting $y_{t}$ as label and $a_{t}$ as feature vector.

•

Hinge loss:

$f_{t}(x)=max(0,1-y_{t}a_{t}^{\top}x).$

•

Generalized hinge loss:

f_{t}(x)=\left\{\begin{aligned} &1-\alpha y_{t}a_{t}^{\top}x,~{}~{}&if~{}y_{t}a_{t}^{\top}x\leqslant 0\\ &1-y_{t}a_{t}^{\top}x,~{}~{}&if~{}0<y_{t}a_{t}^{\top}x<1\\ &0,~{}~{}&if~{}y_{t}a_{t}^{\top}x\geqslant 1,\end{aligned}\right.

where $\alpha>1$ .

•

Absolute loss:

$f_{t}(x)=max_{\alpha\in[-1,1]}\alpha(y_{t}-a_{t}^{\top}x).$
•

$\epsilon$ -insensitive loss:

$f_{t}(x)=max(|y_{t}-a_{t}^{\top}x|-\epsilon,0),$

where $\epsilon$ is a small postive constant.

II-2 Examples of time-varying nondifferentiable $r_{t}(x)$

•

The weighted $\cal{l}_{1}$ norm function:
Assume the optimal solution is sparse with many zero or near-zero coefficients. To solve it, let $r_{t}(x)$ be the weighted $\cal{l}_{1}$ norm function[38, 39, 40]:

$\displaystyle r_{t}(x):=\rho\sum_{i=1}^{N}\omega_{i}^{t}|x_{i}|,$ (4)

where $x_{i}$ is the $i$ -th component of $N$ -dimensional $x$ and $\rho>0$ is a parameter for regularizers. The weight $\omega_{i}^{t}$ $(i=1,2,...,N)$ is defined by

$\omega_{i}^{t}:=\left\{\begin{aligned} \epsilon,&~{}~{}if~{}|x_{i}^{t-1}|>\tau\\ 1,&~{}~{}otherwise,\end{aligned}\right.$

with a parameter $\tau>0$ and a small $\epsilon>0$ , where $x_{i}^{t-1}$ is $t-1$ -th iteration of $x_{i}$ by any proposed algorithm. The parameter $\tau$ may rely on noise statistics, e.g., the variance. By introducing the weights $\omega_{i}^{t}$ , larger $|x_{i}|$ has smaller coefficients. Hence, larger $|x_{i}|$ renders a less sensitivity to the impact of the proximity operator. Therefore, the weighted $\cal{l}_{1}$ norm exploits the sparseness more efficiently.

•

The indicator function of time-varying set constraints:
In a network flow problem, to handle the constraints $x\in\cal{X}_{t}$ with $\cal{X}_{t}\subset\cal{X}$ being a convex set, the authors in [28] use indicator functions based on a given network $(\cal{N},\mathcal{E})$ . The relevant variables are defined as follows. $z(i,s)$ denotes the rate produced at node $i$ for traffic $s$ and $x(ij,s)$ is the flow between nodes $i$ and $j$ for traffic $s$ . For brevity, let $z$ and $x$ be the stacked traffic and link rates, respectively. Time-varying nonsmooth $r_{t}(z,x)$ is composed of two functions $r(z,x)$ and $\delta_{\cal{X}_{t}}(z,x)$ , where $r(z,x)=\frac{\nu}{2}(\lVert z\lVert^{2}+\lVert x\lVert^{2})$ is convex with some $\nu>0$ and $\delta_{\cal{X}_{t}}(z,x)$ is an indicator function of constraints, defined as

\delta_{\cal{X}_{t}}(z,x):=\left\{\begin{aligned} +\infty,&~{}~{}x\notin\cal{X}_{t}\\ 0,&~{}~{}x\in\cal{X}_{t}.\end{aligned}\right.

Here are the relevant constraints in the netwrok. The capacity constraint for link is $0\leqslant\sum_{s}x(ij,s)+w_{t}(ij)\leqslant c_{t}(ij)$ , where $c_{t}(ij)$ is the time-varying link capacity and $w_{t}(ij)$ is a time-varying and non-controllable link traffic. The flow conservation constraint implies $z(s)=\bar{R}x(s)$ with the routing matrix $\bar{R}$ . $z^{max}$ represents the maximal traffic rate. Therefore, the time-varying nonsmooth $r_{t}(z,x)$ can be written as

	$\displaystyle r_{t}(z,x)=\sum_{i,s}\delta_{\{0\leqslant\sum_{s}x(ij,s)+w_{t}(ij)\leqslant c_{t}(ij)\}}$
	$\displaystyle+\delta_{\{0\leqslant z\leqslant z^{max}\}}+\sum_{s}\delta_{\{z(s)=\bar{R}x(s)\}}+\frac{\nu}{2}(\lVert z\lVert^{2}+\lVert x\lVert^{2}).$

III Main Results

This section will introduce the OPG alogrithm and present the main results.

Firstly, how the OPG Algorithm updates the iterate $x_{t}$ is introduced. At each iteration $t$ , based on a subgradient of $f_{t}(x)$ and a positive step size $\eta_{t}$ , the next iteration $x_{t+1}$ is computed as

\displaystyle x_{t+1}=prox_{\eta_{t}r_{t}}(x_{t}-\eta_{t}\tilde{\nabla}f_{t}(x_{t})),

(5)

where the proximal operator is defined by

prox_{\eta_{t}r_{t}}(x):=arg\min_{u\in\cal{X}}\{r_{t}(u)+\frac{1}{2\eta_{t}}\lVert u-x\rVert^{2}\}.

(6)

Then based on (5) and (6), the update is equivalent to

\displaystyle x_{t+1}=arg\min_{x\in\mathcal{X}}r_{t}(x)+\langle x,\tilde{\nabla}f_{t}(x_{t})\rangle+\frac{1}{2\eta_{t}}\rVert x-x_{t}\rVert^{2}.

(7)

At each iteration $t$ , the algorithm assumes the availability of $\tilde{\nabla}f_{t}(x_{t})$ .

Algorithm 1 Online Proximal Gradient (OPG)

0: Initial vector

x_{1}

, learning rate

\eta_{t}

with

1\leqslant t\leqslant T

x_{2},x_{3},...,x_{T}

1: for all

t=1,2,...,T

2: Compute the next action:

x_{t+1}=arg\min_{x\in\mathcal{X}}\{r_{t}(x)+\langle x,\tilde{\nabla}f_{t}(x_{t})\rangle+\frac{1}{2\eta_{t}}\rVert x-x_{t}\rVert^{2}\}.

3: return

x_{t+1}

4: end for

To proceed, the following standard assumptions are useful in the analysis.

Assumption 1.

Functions $f_{t}(x),r_{t}(x):\mathcal{X}\to\mathbb{R}$ are closed, convex for all $t\in[T]$ .

Assumption 1 ensures the existence of optimal point for $f_{t}(x),r_{t}(x)$ . Besides, according to convexity, for any $x,y\in\cal{X}$ , any subgradient $\tilde{\nabla}f_{t}(x)\in\partial f_{t}(x)$ and $\tilde{\nabla}r_{t}(x)\in\partial r_{t}(x)$ , one has:

	$\displaystyle f_{t}(y)$	$\displaystyle\geqslant f_{t}(x)+\langle\tilde{\nabla}f_{t}(x),y-x\rangle,~{}\forall x,y\in\mathcal{X},$
	$\displaystyle r_{t}(y)$	$\displaystyle\geqslant r_{t}(x)+\langle\tilde{\nabla}r_{t}(x),y-x\rangle,~{}\forall x,y\in\mathcal{X}.$

Assumption 2.

Functions $f_{t}:\mathcal{X}\to\mathbb{R}$ is $\mu$ -strongly convex for all $t\in[T]$ .

According to $\mu$ -strongly convexity, for any $x,y\in\mathcal{X}$ and any subgradient $\tilde{\nabla}f_{t}(x)\in\partial f_{t}(x)$ , one has:

f_{t}(y)\geqslant f_{t}(x)+\langle\tilde{\nabla}f_{t}(x),y-x\rangle+\frac{\mu}{2}\rVert x-y\rVert^{2}.

Assumption 3.

The set $\mathcal{X}$ is convex and compact, and $\rVert x-y\rVert\leq R$ for $R>0$ and all x,y $\in\mathcal{X}$ .

Assumption 4.

For any $x\in\cal{X}$ , $\rVert\tilde{\nabla}F_{t}(x)\rVert\leq M$ for some $M>0$ .

If $\rVert\tilde{\nabla}f_{t}(x)\rVert$ and $\rVert\tilde{\nabla}r_{t}(x)\rVert$ respectively have upper bounds $m_{f}$ and $m_{r}$ , one could set $M=m_{f}+m_{r}$ .

III-A Dynamic Regret Bound for Convex $f_{t}$

In this case, the main result is provided in the following theorem.

Theorem 1.

Let Assumptions 1,3,4 hold. By setting non-increasing step size $\eta_{t}=t^{-\gamma}\sigma$ in the OPG, where $\sigma$ is a constant value

\sigma=\frac{\sqrt{(1-\gamma)2RT^{2\gamma-\beta-1}D_{\beta}(T)+R^{2}T^{2\gamma-1}}}{M},

and $\gamma\in[\beta,1)$ , there holds

\displaystyle\bm{Reg}_{T}^{d}=\mathcal{O}(\sqrt{T^{1-\beta}D_{\beta}(T)+T}),

(8)

where $D_{\beta}(T)$ is defined in (1).

Proof.

See Appendix A. ∎

Remark 1: Note that the bound (8) is comparable to the best-known one $\sqrt{T^{1-\beta}D_{\beta}(T)}+\sqrt{T}$ established in [26], where time-invariant regularizers are considered. However, more general time-varying regularizers are investigated here.

Remark 2: Note that the bound $\mathcal{O}(\sqrt{T(D_{0}(T)+1)})$ in [24] is an optimal dynamic regret matching with the lower bound given in [24, Theorem 2]. By setting $\beta=0$ , this bound is a special case of (8).

III-B Dynamic Regret Bound for Strongly Convex $f_{t}$

The following Lemma 1 gives an upper bound for any point in $\cal{X}$ and positive step size. Besides, when setting $\mu=0$ , it also holds for the case without strongly convexity, which is discussed in Section III-A. It will be useful for deriving the main results.

Lemma 1.

Let Assumptions 1,2,3,4 hold. For any $u_{t}\in\cal{X}$ , positive step size $\eta_{t}>0$ and iteration $x_{t}$ in the OPG, there holds

	$\displaystyle F_{t}(x_{t})-F_{t}(u_{t})$	$\displaystyle\leqslant\frac{1}{2}(\frac{1}{\eta_{t}}-\mu)\rVert u_{t}-x_{t}\rVert^{2}-\frac{1}{2\eta_{t}}\rVert u_{t}-x_{t+1}\rVert^{2}$
		$\displaystyle+\frac{\eta_{t}}{2}(\rVert\tilde{\nabla}f_{t}(x_{t})+\tilde{\nabla}r_{t}(x_{t})\rVert^{2}).$		(9)

Proof.

See Appendix B. ∎

Now by summing (9) at each $t$ , selecting appropriate step size $\eta_{t}$ , the following theorem is obtained for characterizing the dynamic regret bound for strongly convex $f_{t}$ .

Theorem 2.

Let Assumptions 1,2,3,4 hold. For any $0\leqslant\beta<1$ in OPG, setting the non-increasing step size $\eta_{t}=\frac{\gamma}{t}$ and choosing $\gamma$ as

\displaystyle\gamma=\frac{2RT^{-\beta}D_{\beta}(T)+R^{2}}{\delta R^{2}+(\mu-\delta)\frac{1}{T}\lVert u_{1}-x_{1}\rVert^{2}},

where $\delta\in(0,\mu)$ is small enough such that $\gamma\delta<1$ , and $u_{1}$ and $x_{1}$ are the initial comparator vector and the initial vector in OPG, respectively, there holds

\displaystyle\bm{Reg}_{T}^{d}=\mathcal{O}(\log T(1+T^{-\beta}D_{\beta}(T))),

(10)

where $D_{\beta}(T)$ is defined in (1).

Proof.

See Appendix C. ∎

It is worth noting that $\gamma$ in Theorem 2 is a fixed value but $\gamma$ in Theorem 1 is any positive value in $[\beta,1)$ . They are differently chosen in different scenarios.

Remark 3: If choosing $\beta=0$ , dynamic regret is in order of $\mathcal{O}(\log(T)(1+D_{0}(T)))$ , which is the result for distributed proximal gradient algorithm over dynamic graphs [27]. However, Theorem 2 drops the assumption of $l$ -smooth on $f_{t}(x)$ which is required in [27]. The existing works for composite optimization establishing upper bounds of dynamic regret under strongly convex $f_{t}(x)$ usually assume smoothness of $f_{t}(x)$ [28, 30, 41]. Therefore, the result without the smooth assumption here greatly expands the scope of application.

Remark 4: Theorem 2 also holds true when considering static regret. If the sequence $\{u_{t}\}_{t=1}^{T}$ is chosen time-invariant, which means the sequence does not change over time, $D_{\beta}(T)=\sum_{t=2}^{T}t^{\beta}\Arrowvert u_{t}-u_{t-1}\Arrowvert$ is equal to 0. In this case, $\bm{Reg}_{T}^{d}=\mathcal{O}(\log(T))$ is obtained, which is the optimal bound on static regret in the strongly convex case[16]. Therefore, Theorem 2 may be optimal in this sense which is of interest as one possible future work.

Remark 5: A promising extension can be considered in the future work. The positive step size $\eta_{t}$ is related to $D_{\beta}(T)$ in Theorems 1 and 2. However, in most dynamic environment, $D_{\beta}(T)$ may be unknown in advance. In [26], authors pointed out that in some online learning cases, $D_{\beta}(T)$ can be estimated by employing observed data for the learning problems. By adjusting step size, the dynamic regret is expected to decrese in an appropriate rate and then $D_{\beta}(T)$ may be derived in reverse. Moreover, it can be observed that the selection of step size in Theorems 1 and 2 needs to know $T$ in advance. However, if $T$ is unknown, similar results can still be ensured by leveraging the so-called doubling trick, as done in[42].

Remark 6: The result in Theorem 2 can be further improved when $f_{t}$ is $\ell$ -smooth as well. The result in [28] shows that

\displaystyle\lVert x_{t+1}-x_{t}^{*}\lVert\leqslant\rho_{t}\lVert x_{t}-x_{t}^{*}\lVert,

(11)

where $\rho_{t}:=max\{|1-\eta_{t}\mu|,|1-\eta_{t}\ell|\}$ and $x_{t}^{*}$ is the optimal point of $F_{t}(x)=f_{t}(x)+r_{t}(x)$ . Based on the convexity of $F_{t}$ and the bound of $\rVert\tilde{\nabla}F_{t}\rVert$ , the upper bound for strongly convex and smooth case is in order of $\mathcal{O}(1+D_{0}(T))$ . The result matches the best-known one in [28].

IV Numerical Examples

Numerical experiments are provided here to show the performance of OPG by comparison with some state-of-the-art algorithms, which are valid for online composite convex optimizition.

Objective functions: The performance of OPG is tested by optimizing two types of loss functions, corresponding to the function in the theoretical part. Based on the nonsmooth convex Hinge loss and the time-varying nondifferentiable regularization function $r_{t}(x)$ defined by (4), two online regression functions are defined:

	$\displaystyle F^{1}_{t}(x)$	$\displaystyle=l_{t}(x)+r_{t}(x),$		(12)
	$\displaystyle F_{t}^{2}(x)$	$\displaystyle=l_{t}(x)+\frac{\lambda}{2}\lVert x\rVert^{2}+r_{t}(x).$		(13)

Here $l_{t}(x)$ is the Hinge loss, defined by

\displaystyle l_{t}(x)=max(0,1-y_{t}a_{t}^{\top}x),

where $y_{t}$ and $a_{t}$ are generated by the dataset Usenet2. Usenet2 is based on the newsgroups collection and collects which messages are considered interesting or junk in each time period, where the 99-dimensional vector is realized as feature $a_{t}$ and the 1-dimensional vector is realized as label $y_{t}$ . The related parameter of $r_{t}(x)$ is $\rho=0.4$ , $\tau=1$ and $\epsilon=0.1$ for $F^{1}_{t}(x)$ and $F^{2}_{t}(x)$ . The related parameter of $f_{t}(x)=l_{t}(x)+\frac{\lambda}{2}\lVert x\rVert^{2}$ is $\lambda=1$ for $F_{t}^{2}(x)$ .

The two functions are considered as different types of loss functions. Since the first term $f_{t}(x)$ of $F^{1}_{t}(x)$ is a nonsmooth convex loss $l_{t}(x)$ , $F^{1}_{t}(x)$ will show the perpformance of OPG for nonsmooth and convex $f_{t}(x)$ . Since the first term $f_{t}(x)$ of $F^{2}_{t}(x)$ is a nonsmooth strongly convex loss $l_{t}(x)+\frac{\lambda}{2}\lVert x\rVert^{2}$ , $F^{2}_{t}(x)$ will show the perpformance of OPG for nonsmooth and strongly convex $f_{t}(x)$ . Besides, both of the loss functions have a time-varying nonsmooth regularization function. Based on the results in theoretical analysis in Section III, OPG algorithm is able to solve both of them.

Other online methods: In order to show the effectiveness of the proposed OPG algorithm, let us consider the following efficient online algorithms.

IV-1 SAGE (Stochastic Accelerated GradiEnt)

It is an accelerated gradient method for stochastic composite optimization[6]. Although it is a stochastic composite optimization, they also propose the accelerated gradient method for online setting. Each iteration of the method takes the form

	$\displaystyle x_{t}$	$\displaystyle=(1-\alpha_{t})y_{t-1}+\alpha_{t}z_{t-1},$
	$\displaystyle y_{t}$	$\displaystyle=arg\min_{x}\{\langle\tilde{\nabla}f_{t-1}(x_{t}),x-x_{t}\rangle+\frac{L_{t}}{2}\lVert x-x_{t}\rVert^{2}+r_{t}(x)\},$
	$\displaystyle z_{t}$	$\displaystyle=z_{t-1}-\alpha_{t}(L_{t}+\mu\alpha_{t})^{-1}[L_{t}(x_{t}-y_{t})+\mu(z_{t-1}-x_{t})],$

where $\{\alpha_{t}\}$ and $\{L_{t}\}$ are two parameter sequences and $\mu$ is the strong-convexity constant of $F_{t}(x)$ . It has established upper bounds on static regret in order of ${\cal{O}}(\sqrt{T})$ for convex $f_{t}(x)$ and ${\cal{O}}(\log(T))$ for strongly convex $f_{t}(x)$ , by choosing suitable parameter sequences $\{\alpha_{t}\}$ and $\{L_{t}\}$ [6].

IV-2 AC-SA (Accelerated Stochastic Approximation)

It is a stochastic gradient descent algorithm for handling convex composite optimization problems[7]. For better comparison, we choose the true subgradient $\tilde{\nabla}f_{t}$ instead of the stochastic one. Each iteration of the method takes the form

	$\displaystyle x_{t}^{md}=$	$\displaystyle\frac{(1-\alpha_{t})(\mu+\gamma_{t})}{\gamma_{t}+(1-\alpha_{t}^{2})\mu}x_{t-1}^{ag}+\frac{\alpha_{t}[(1-\alpha_{t})+\gamma_{t}]}{\gamma_{t}+(1-\alpha_{t}^{2})\mu}x_{t-1},$
	$\displaystyle x_{t}=$	$\displaystyle arg\min_{x}\{\alpha_{t}[\langle\tilde{\nabla}f_{t}(x_{t}^{md}),x\rangle+r_{t}(x)+\mu V(x_{t}^{md},x)]$
		$\displaystyle+[(1-\alpha_{t})\mu+\gamma_{t}]V(x_{t-1},x)\},$
	$\displaystyle x_{t}^{ag}=$	$\displaystyle\alpha_{t}x_{t}+(1-\alpha_{t})x_{t-1}^{ag},$

where $\{\alpha_{t}\}$ and $\{\gamma_{t}\}$ are two parameter sequences, $\mu$ denotes the strong-convexity constant of $F_{t}(x)$ and $V(\cdot,\cdot)$ is the Bregman divergence, and here we choose $V(x,z)=\frac{1}{2}\lVert x-z\rVert^{2}$ . Optimal convergence rates for tackling different types of stochastic composite optimization problems were discussed in [7].

IV-3 RDA (Regularized Dual Averaging Method)

It is an extension of Nesterov’s dual averaging method for composite optimization problems, where one term is the convex loss function, and the other is a regularization term[8]. Each iteration of the method takes the form

\displaystyle x_{t+1}=arg\min_{x}\{\frac{1}{t}\sum_{\tau=1}^{t}\langle\tilde{\nabla}f_{\tau}{}(x_{\tau}),x\rangle+r_{t}(x)+\frac{\beta_{t}}{t}h(x)\},

where $\{\beta_{t}\}$ is a parameter sequence and $h(x)$ is an auxiliary strongly convex function, and here we choose $h(x)=\frac{1}{2}\lVert x\rVert^{2}$ . It ensures a bound ${\cal{O}}(\sqrt{T})$ on static regret with $\beta_{t}={\cal{O}}(\sqrt{T})$ for convex loss[8].

IV-A Performance on Nonsmooth Convex $f_{t}(x)$

According to $F^{1}_{t}(x)$ in (12), a time-varying nondifferentiable regularization function is added into a nonsmooth convex loss function. The step size $\eta_{t}$ is set to $0.001/t$ in OPG, while step sizes in other three algorithms are set optimally based on their analysis. The simulation is run 1500 times and the average of dynamic regret $\bm{Reg}_{T}^{d}/T$ is shown in Fig. 1. It can be seen that OPG performs the best in our simulation.

Refer to caption — Figure 1: The performance on $F^{1}_{t}(x)$

IV-B Performance on Nonsmooth Strongly Convex $f_{t}(x)$

The example (13) will show the performance of OPG on nonsmooth strongly convex $f_{t}(x)$ and time-varying nondifferentiable $r_{t}(x)$ . To achieve the best possible upper bound for each algorithm, different step sizes are set according to their analysis. The simulation is run 1500 times. Fig. 2 shows how the average of dynamic regret $\bm{Reg}_{T}^{d}/T$ changes with horizon $T$ . It shows that OPG performs the best in the simulation.

Therefore, numerical results show that, compared with other three algorithms, OPG algorithm performs the best for the two cases in theoretical analysis.

V Conclusion

This paper considers online optimization with composite loss functions, where two terms are time-varying and nonsmooth. The loss function is optimized by online proximal gradient (OPG) algorithm in terms of the extended path variation $D_{\beta}(T)$ . By analyzing upper bounds of its dynamic regret for two cases, their corresponding upper bounds are obtained. In the first case that $f_{t}(x)$ is convex, we established a regret bound comparable to the best-known one in the existing literature, simultaneously extending the problem to time-varying regularizers. In the second case that $f_{t}(x)$ is strongly convex, we obtained a better bound than another work while removing the smoothness assumption. At last, numerical experiments verified the performance of OPG algorithm in two cases. Possible future directions can be placed on the investigation of whether the obtained regret bounds are optimal or not, and the relaxation of stepsize selection without dependence on $D_{\beta}(T)$ , etc.

In this section, detailed discussions of the lemmas and theorems are given as follows.

Appendix A Proof of Theorem 1

First, the following result reveals the relationship between any sequence $\{u_{t}\}_{t=1}^{T}\subset\cal{X}$ and iteration sequence $\{x_{t}\}_{t=1}^{T}$ in OPG. If Assumption 3 is satisfied, for any non-increasing step size $0<\eta_{t+1}\leqslant\eta_{t}$ in OPG algorithm, then it follows from [26, Lemma 4] that

		$\displaystyle\sum_{t=1}^{T}\frac{1}{\eta_{t}}(\lVert u_{t}-x_{t}\rVert^{2}-\lVert u_{t}-x_{t+1}\rVert^{2})$
	$\displaystyle\leqslant$	$\displaystyle 2R\sum_{t=1}^{T-1}\frac{1}{\eta_{t}}\lVert u_{t+1}-u_{t}\rVert+\frac{R^{2}}{\eta_{T}},$		(14)

where $R$ is defined in Assumption 3.

When setting $\mu=0$ , Lemma 1 amounts to the case of convex $f_{t}(x)$ . Then, (9) reduces to

	$\displaystyle F_{t}(x_{t})-F_{t}(u_{t})$	$\displaystyle\leqslant\frac{1}{2\eta_{t}}\rVert u_{t}-x_{t}\rVert^{2}-\frac{1}{2\eta_{t}}\rVert u_{t}-x_{t+1}\rVert^{2}$
		$\displaystyle+\frac{\eta_{t}}{2}(\rVert\tilde{\nabla}f_{t}(x_{t})+\tilde{\nabla}r_{t}(x_{t})\rVert^{2}).$		(15)

According to (14), the sum of $\frac{1}{2\eta_{t}}\rVert u_{t}-x_{t}\rVert^{2}-\frac{1}{2\eta_{t}}\rVert u_{t}-x_{t+1}\rVert^{2}$ could be bounded by $R\sum_{t=1}^{T-1}\frac{1}{\eta_{t}}\lVert u_{t+1}-u_{t}\rVert+\frac{R^{2}}{2\eta_{T}}$ . Then Assumption 4 shows that the upper bound of $\rVert\tilde{\nabla}f_{t}(x_{t})+\tilde{\nabla}r_{t}(x_{t})\rVert^{2}$ is $M^{2}$ . Hence, (A) implies

		$\displaystyle\sum_{t=1}^{T}(F_{t}(x_{t})-F_{t}(u_{t}))$
	$\displaystyle\leqslant$	$\displaystyle R\sum_{t=1}^{T-1}\frac{1}{\eta_{t}}\lVert u_{t+1}-u_{t}\rVert+\frac{R^{2}}{2\eta_{T}}+\frac{M^{2}}{2}\sum_{t=1}^{T}\eta_{t}.$		(16)

If one multiplies and divides $t^{\beta}$ by $\lVert u_{t+1}-u_{t}\rVert$ and bounds $\frac{1}{\eta_{t}t^{\beta}}$ by choosing $\mathop{\max}_{t\in[T]}\big{\{}\frac{1}{\eta_{t}t^{\beta}}\big{\}}$ , then according to the definition of $D_{\beta}(T)$ , (16) yields

	$\displaystyle\sum_{t=1}^{T}(F_{t}(x_{t})-F_{t}(u_{t}))$
$\displaystyle\leqslant$	$\displaystyle R\mathop{\max}_{t\in[T]}\big{\{}\frac{1}{\eta_{t}t^{\beta}}\big{\}}\sum_{t=1}^{T-1}t^{\beta}\lVert u_{t+1}-u_{t}\rVert+\frac{R^{2}}{2\eta_{T}}$
	$\displaystyle+\frac{M^{2}}{2}\sum_{t=1}^{T}\eta_{t}$
$\displaystyle=$	$\displaystyle R\mathop{\max}_{t\in[T]}\big{\{}\frac{1}{\eta_{t}t^{\beta}}\big{\}}D_{\beta}(T)+\frac{R^{2}}{2\eta_{T}}+\frac{M^{2}}{2}\sum_{t=1}^{T}\eta_{t}.$	(17)

By setting $\eta_{t}=t^{-\gamma}\sigma$ with $\sigma$ being a constant and $\gamma\in[\beta,1)$ , $\mathop{\max}_{t\in[T]}\big{\{}\frac{1}{\eta_{t}t^{\beta}}\big{\}}=\frac{T^{\gamma-\beta}}{\sigma}$ , invoking (17) yeilds that

	$\displaystyle\sum_{t=1}^{T}(F_{t}(x_{t})-F_{t}(u_{t}))$
$\displaystyle\leqslant$	$\displaystyle\frac{RT^{\gamma-\beta}}{\sigma}D_{\beta}(T)+\frac{R^{2}T^{\gamma}}{2\sigma}+\frac{M^{2}\sigma}{2}\sum_{t=1}^{T}t^{-\gamma}$
$\displaystyle\leqslant$	$\displaystyle\frac{RT^{\gamma-\beta}}{\sigma}D_{\beta}(T)+\frac{R^{2}T^{\gamma}}{2\sigma}+\frac{M^{2}\sigma T^{1-\gamma}}{2(1-\gamma)}.$	(18)

According to Cauchy inequality, if one chooses the optimal $\sigma$ with

\sigma=\frac{\sqrt{(1-\gamma)2RT^{2\gamma-\beta-1}D_{\beta}(T)+R^{2}T^{2\gamma-1}}}{M},

it can be then obtained that

		$\displaystyle\sum_{t=1}^{T}(F_{t}(x_{t})-F_{t}(u_{t}))$
	$\displaystyle\leqslant$	$\displaystyle\sqrt{\frac{M^{2}(2RT^{1-\beta}D_{\beta}(T)+TR^{2})}{1-\gamma}}.$		(19)

Thus, there holds

\displaystyle\bm{Reg}_{T}^{d}=\mathcal{O}(\sqrt{T^{1-\beta}D_{\beta}(T)+T}).

Appendix B Proof of Lemma 1

Let us start with defining a Bregman divergence $B_{\psi}(x,u)$ associated with a differentiable strongly convex $\psi(x)$ . It will be useful to introduce some results of Bregman divergence[26]:

$a$ .

$B_{\psi}(x,u):=\psi(x)-\psi(u)-\langle\nabla\psi(u),x-u\rangle$ (its definition);
$b$ .

$\nabla_{x}B_{\psi}(x,u)=\nabla\psi(x)-\nabla\psi(u)$ ;
$c$ .

$B_{\psi}(x,u)=B_{\psi}(x,z)+B_{\psi}(z,u)+\langle x-z,\nabla\psi(z)-\nabla\psi(u)\rangle$ .

Let $\psi(x)=\frac{1}{2}\|x\|^{2}$ in the following. By Algorithm 1, setting any sequence $\{u_{t}\}_{t=1}^{T}\subset\cal{X}$ , one has

		$\displaystyle F_{t}(x_{t})-F_{t}(u_{t})$
	$\displaystyle=$	$\displaystyle f_{t}(x_{t})+r_{t}(x_{t})-f_{t}(u_{t})-r_{t}(u_{t}).$

Invoking the optimality criterion for (7) implies that for any $x\in\mathcal{X}$ ,

\displaystyle 0\leqslant\langle x-x_{t+1},\tilde{\nabla}f_{t}(x_{t})+\tilde{\nabla}r_{t}(x_{t+1})+\frac{1}{\eta_{t}}(x_{t+1}-x_{t})\rangle.

(20)

From this result, choosing $x=u_{t}$ , it follows that

		$\displaystyle\langle\tilde{\nabla}f_{t}(x_{t})+\tilde{\nabla}r_{t}(x_{t+1}),x_{t+1}-u_{t}\rangle$
	$\displaystyle\leqslant$	$\displaystyle\frac{1}{\eta_{t}}\langle x_{t+1}-x_{t},u_{t}-x_{t+1}\rangle.$		(21)

According to (21), $f_{t}$ ’s strong convexity and $r_{t}$ ’s convexity imply that

	$\displaystyle f_{t}(x_{t})+r_{t}(x_{t})-f_{t}(u_{t})-r_{t}(u_{t})$
$\displaystyle=$	$\displaystyle f_{t}(x_{t})+r_{t}(x_{t+1})-f_{t}(u_{t})-r_{t}(u_{t})$
	$\displaystyle+r_{t}(x_{t})-r_{t}(x_{t+1})$
$\displaystyle\leqslant$	$\displaystyle\langle\tilde{\nabla}f_{t}(x_{t}),x_{t}-u_{t}\rangle-\frac{\mu}{2}\lVert u_{t}-x_{t}\rVert^{2}$
	$\displaystyle+\langle\tilde{\nabla}r_{t}(x_{t+1}),x_{t+1}-u_{t}\rangle+\langle\tilde{\nabla}r_{t}(x_{t}),x_{t}-x_{t+1}\rangle$
$\displaystyle=$	$\displaystyle\langle\tilde{\nabla}f_{t}(x_{t}),x_{t+1}-u_{t}\rangle+\langle\tilde{\nabla}f_{t}(x_{t}),x_{t}-x_{t+1}\rangle$
	$\displaystyle-\frac{\mu}{2}\lVert u_{t}-x_{t}\rVert^{2}+\langle\tilde{\nabla}r_{t}(x_{t+1}),x_{t+1}-u_{t}\rangle$
	$\displaystyle+\langle\tilde{\nabla}r_{t}(x_{t}),x_{t}-x_{t+1}\rangle$
$\displaystyle=$	$\displaystyle\langle\tilde{\nabla}f_{t}(x_{t})+\tilde{\nabla}r_{t}(x_{t+1}),x_{t+1}-u_{t}\rangle-\frac{\mu}{2}\lVert u_{t}-x_{t}\rVert^{2}$
	$\displaystyle+\langle\tilde{\nabla}f_{t}(x_{t})+\tilde{\nabla}r_{t}(x_{t}),x_{t}-x_{t+1}\rangle.$	(22)

Combining (21) with (22) yields

	$\displaystyle f_{t}(x_{t})+r_{t}(x_{t})-f_{t}(u_{t})-r_{t}(u_{t})$
$\displaystyle\leqslant$	$\displaystyle\frac{1}{\eta_{t}}\langle x_{t+1}-x_{t},u_{t}-x_{t+1}\rangle-\frac{\mu}{2}\lVert u_{t}-x_{t}\rVert^{2}$
	$\displaystyle+\langle\tilde{\nabla}f_{t}(x_{t})+\tilde{\nabla}r_{t}(x_{t}),x_{t}-x_{t+1}\rangle$
$\displaystyle=$	$\displaystyle\frac{1}{\eta_{t}}\langle\nabla\psi(x_{t+1})-\nabla\psi(x_{t}),u_{t}-x_{t+1}\rangle-\frac{\mu}{2}\lVert u_{t}-x_{t}\rVert^{2}$
	$\displaystyle+\langle\tilde{\nabla}f_{t}(x_{t})+\tilde{\nabla}r_{t}(x_{t}),x_{t}-x_{t+1}\rangle$
$\displaystyle=$	$\displaystyle\frac{1}{\eta_{t}}(B_{\psi}(u_{t},x_{t})-B_{\psi}(u_{t},x_{t+1})-B_{\psi}(x_{t+1},x_{t}))$
	$\displaystyle-\frac{\mu}{2}\lVert u_{t}-x_{t}\rVert^{2}+\langle\tilde{\nabla}f_{t}(x_{t})+\tilde{\nabla}r_{t}(x_{t}),x_{t}-x_{t+1}\rangle.$	(23)

According to the definition of $\psi(x)$ , the term $\langle x_{t+1}-x_{t},u_{t}-x_{t+1}\rangle$ is equal to $\langle\nabla\psi(x_{t+1})-\nabla\psi(x_{t}),u_{t}-x_{t+1}\rangle$ . The last equation in (23) is established due to the property $c$ of Bregman divergence. Thus, by (23), invoking the definition of Bregman divergence leads to

	$\displaystyle f_{t}(x_{t})+r_{t}(x_{t})-f_{t}(u_{t})-r_{t}(u_{t})$
$\displaystyle\leqslant$	$\displaystyle\frac{1}{2\eta_{t}}(\lVert u_{t}-x_{t}\rVert^{2}-\lVert u_{t}-x_{t+1}\rVert^{2}-\lVert x_{t+1}-x_{t}\rVert^{2})$
	$\displaystyle-\frac{\mu}{2}\lVert u_{t}-x_{t}\rVert^{2}+\langle\tilde{\nabla}f_{t}(x_{t})+\tilde{\nabla}r_{t}(x_{t}),x_{t}-x_{t+1}\rangle.$	(24)

By Young’s inequality, one can obtain that

		$\displaystyle\langle\tilde{\nabla}f_{t}(x_{t})+\tilde{\nabla}r_{t}(x_{t}),x_{t}-x_{t+1}\rangle-\frac{1}{2\eta_{t}}\lVert x_{t}-x_{t+1}\rVert^{2}$
	$\displaystyle\leqslant$	$\displaystyle\frac{\eta_{t}}{2}\lVert\tilde{\nabla}f_{t}(x_{t})+\tilde{\nabla}r_{t}(x_{t})\rVert^{2}.$		(25)

Thus, invoking (25), together with (24), yields that

		$\displaystyle f_{t}(x_{t})+r_{t}(x_{t})-f_{t}(u_{t})-r_{t}(u_{t})$
	$\displaystyle\leqslant$	$\displaystyle\frac{1}{2\eta_{t}}(\lVert u_{t}-x_{t}\rVert^{2}-\lVert u_{t}-x_{t+1}\rVert^{2})$
		$\displaystyle-\frac{\mu}{2}\lVert u_{t}-x_{t}\rVert^{2}+\frac{\eta_{t}}{2}\lVert\tilde{\nabla}f_{t}(x_{t})+\tilde{\nabla}r_{t}(x_{t})\rVert^{2}.$

Appendix C Proof of Theorem 2

According to Assumption 4, it implies that $\rVert\tilde{\nabla}f_{t}(x)+\tilde{\nabla}r_{t}(x)\rVert^{2}\leq M^{2}$ . Applying this substitution into Lemma 1 yields

	$\displaystyle\sum_{t=1}^{T}(F_{t}(x_{t})-F_{t}(u_{t}))$
$\displaystyle=$	$\displaystyle\sum_{t=1}^{T}(f_{t}(x_{t})+r_{t}(x_{t})-f_{t}(u_{t})-r_{t}(u_{t}))$
$\displaystyle\leqslant$	$\displaystyle\frac{1}{2}\sum_{t=1}^{T}((\frac{1}{\eta_{t}}-\mu)\lVert u_{t}-x_{t}\rVert^{2}-\frac{1}{\eta_{t}}\lVert u_{t}-x_{t+1}\rVert^{2})$
	$\displaystyle+\frac{M^{2}}{2}\sum_{t=1}^{T}\eta_{t}.$	(26)

Now let us bound the first term of (26) as follows

	$\displaystyle\sum_{t=1}^{T}((\frac{1}{\eta_{t}}-\mu)\lVert u_{t}-x_{t}\rVert^{2}-\frac{1}{\eta_{t}}\lVert u_{t}-x_{t+1}\rVert^{2})$
$\displaystyle=$	$\displaystyle\sum_{t=0}^{T-1}(\frac{1}{\eta_{t+1}}-\mu)\lVert u_{t+1}-x_{t+1}\rVert^{2}-\sum_{t=1}^{T}\frac{1}{\eta_{t}}\lVert u_{t}-x_{t+1}\rVert^{2}$
$\displaystyle=$	$\displaystyle\sum_{t=0}^{T-1}(\frac{1}{\eta_{t+1}}-\frac{1}{\eta_{t}}-\mu)\lVert u_{t+1}-x_{t+1}\rVert^{2}-\sum_{t=1}^{T}\frac{1}{\eta_{t}}\lVert u_{t}-x_{t+1}\rVert^{2}$
	$\displaystyle+\sum_{t=0}^{T-1}\frac{1}{\eta_{t}}\lVert u_{t+1}-x_{t+1}\rVert^{2}$
$\displaystyle=$	$\displaystyle\sum_{t=1}^{T-1}(\frac{1}{\eta_{t+1}}-\frac{1}{\eta_{t}}-\mu)\lVert u_{t+1}-x_{t+1}\rVert^{2}-\sum_{t=1}^{T}\frac{1}{\eta_{t}}\lVert u_{t}-x_{t+1}\rVert^{2}$
	$\displaystyle+\sum_{t=1}^{T-1}\frac{1}{\eta_{t}}\lVert u_{t+1}-x_{t+1}\rVert^{2}+(\frac{1}{\eta_{1}}-\mu)\lVert u_{1}-x_{1}\rVert^{2}$
$\displaystyle=$	$\displaystyle\sum_{t=1}^{T-1}(\frac{1}{\eta_{t+1}}-\frac{1}{\eta_{t}}-\mu)\lVert u_{t+1}-x_{t+1}\rVert^{2}+(\frac{1}{\eta_{1}}-\mu)\lVert u_{1}-x_{1}\rVert^{2}$
	$\displaystyle+\sum_{t=1}^{T-1}\frac{1}{\eta_{t}}(\lVert u_{t+1}-x_{t+1}\rVert^{2}-\lVert u_{t}-x_{t+1}\rVert^{2}).$	(27)

According to

	$\displaystyle\lVert u_{t}-x_{t+1}\rVert^{2}=$	$\displaystyle\lVert u_{t+1}-x_{t+1}\rVert^{2}-2\langle u_{t+1}-x_{t+1},u_{t+1}-u_{t}\rangle$
		$\displaystyle+\lVert u_{t+1}-u_{t}\rVert^{2},$		(28)

and using (28), one has

	$\displaystyle\sum_{t=1}^{T}((\frac{1}{\eta_{t}}-\mu)\lVert u_{t}-x_{t}\rVert^{2}-\frac{1}{\eta_{t}}\lVert u_{t}-x_{t+1}\rVert^{2})$
$\displaystyle\leqslant$	$\displaystyle\sum_{t=1}^{T-1}(\frac{1}{\eta_{t+1}}-\frac{1}{\eta_{t}}-\mu)\lVert u_{t+1}-x_{t+1}\rVert^{2}+(\frac{1}{\eta_{1}}-\mu)\lVert u_{1}-x_{1}\rVert^{2}$
	$\displaystyle+\sum_{t=1}^{T-1}\frac{2}{\eta_{t}}\langle u_{t+1}-x_{t+1},u_{t+1}-u_{t}\rangle-\sum_{t=1}^{T}\frac{1}{\eta_{t}}\lVert u_{t+1}-u_{t}\rVert^{2}$
$\displaystyle\leqslant$	$\displaystyle\sum_{t=1}^{T-1}(\frac{1}{\eta_{t+1}}-\frac{1}{\eta_{t}}-\mu)\lVert u_{t+1}-x_{t+1}\rVert^{2}+(\frac{1}{\eta_{1}}-\mu)\lVert u_{1}-x_{1}\rVert^{2}$
	$\displaystyle+\sum_{t=1}^{T-1}\frac{2}{\eta_{t}}\lVert u_{t+1}-x_{t+1}\lVert\lVert u_{t+1}-u_{t}\lVert-\sum_{t=1}^{T}\frac{1}{\eta_{t}}\lVert u_{t+1}-u_{t}\rVert^{2}$
$\displaystyle\leqslant$	$\displaystyle\sum_{t=1}^{T-1}(\frac{1}{\eta_{t+1}}-\frac{1}{\eta_{t}}-\mu)\lVert u_{t+1}-x_{t+1}\rVert^{2}+(\frac{1}{\eta_{1}}-\mu)\lVert u_{1}-x_{1}\rVert^{2}$
	$\displaystyle+\sum_{t=1}^{T-1}\frac{2R}{\eta_{t}}\lVert u_{t+1}-u_{t}\lVert.$	(29)

Considering this upper bound, the dynamic regret is bounded above by

	$\displaystyle\sum_{t=1}^{T}(f_{t}(x_{t})+r_{t}(x_{t})-f_{t}(u_{t})-r_{t}(u_{t}))$
$\displaystyle\leqslant$	$\displaystyle\frac{M^{2}}{2}\sum_{t=1}^{T}\eta_{t}+\frac{1}{2}\sum_{t=1}^{T-1}(\frac{1}{\eta_{t+1}}-\frac{1}{\eta_{t}}-\mu)\lVert u_{t+1}-x_{t+1}\rVert^{2}$
	$\displaystyle+\frac{1}{2}(\frac{1}{\eta_{1}}-\mu)\lVert u_{1}-x_{1}\rVert^{2}+\sum_{t=1}^{T-1}\frac{R}{\eta_{t}}\lVert u_{t+1}-u_{t}\rVert$
$\displaystyle\leqslant$	$\displaystyle\frac{M^{2}}{2}\sum_{t=1}^{T}\eta_{t}+\frac{1}{2}\sum_{t=1}^{T-1}(\frac{1}{\eta_{t+1}}-\frac{1}{\eta_{t}}-\mu)\lVert u_{t+1}-x_{t+1}\rVert^{2}$
	$\displaystyle+\frac{1}{2}(\frac{1}{\eta_{1}}-\mu)\lVert u_{1}-x_{1}\rVert^{2}$
	$\displaystyle+R\mathop{\max}_{t\in[T]}\big{\{}\frac{1}{\eta_{t}t^{\beta}}\big{\}}\sum_{t=1}^{T-1}t^{\beta}\lVert u_{t+1}-u_{t}\rVert.$	(30)

By introducing a small positive number $\delta<\mu$ , (30) yeilds

		$\displaystyle\sum_{t=1}^{T}(f_{t}(x_{t})+r_{t}(x_{t})-f_{t}(u_{t})-r_{t}(u_{t}))$
	$\displaystyle\leqslant$	$\displaystyle\frac{M^{2}}{2}\sum_{t=1}^{T}\eta_{t}+\frac{1}{2}\sum_{t=1}^{T-1}(\frac{1}{\eta_{t+1}}-\frac{1}{\eta_{t}}-\delta)\lVert u_{t+1}-x_{t+1}\rVert^{2}$
		$\displaystyle+\frac{1}{2}(\frac{1}{\eta_{1}}-\delta)\lVert u_{1}-x_{1}\rVert^{2}-\frac{1}{2}(\mu-\delta)\lVert u_{1}-x_{1}\rVert^{2}$
		$\displaystyle+R\mathop{\max}_{t\in[T]}\big{\{}\frac{1}{\eta_{t}t^{\beta}}\big{\}}\sum_{t=1}^{T-1}t^{\beta}\lVert u_{t+1}-u_{t}\rVert.$

Setting $\eta_{t}=\frac{\gamma}{t}$ , where $\gamma<\frac{1}{\delta}$ , one can derive $\frac{1}{\eta_{t+1}}-\frac{1}{\eta_{t}}-\delta$ and $\frac{1}{\eta_{1}}-\delta$ are both greater than 0. Thus, there holds

	$\displaystyle\sum_{t=1}^{T}(f_{t}(x_{t})+r_{t}(x_{t})-f_{t}(u_{t})-r_{t}(u_{t}))$
$\displaystyle\leqslant$	$\displaystyle\frac{M^{2}}{2}\sum_{t=1}^{T}\eta_{t}+\frac{R^{2}}{2}\sum_{t=1}^{T-1}(\frac{1}{\eta_{t+1}}-\frac{1}{\eta_{t}}-\delta)+\frac{R^{2}}{2}(\frac{1}{\eta_{1}}-\delta)$
	$\displaystyle-\frac{1}{2}(\mu-\delta)\lVert u_{1}-x_{1}\rVert^{2}+R\mathop{\max}_{t\in[T]}\big{\{}\frac{1}{\eta_{t}t^{\beta}}\big{\}}D_{\beta}(T)$
$\displaystyle\leqslant$	$\displaystyle\frac{M^{2}}{2}\sum_{t=1}^{T}\eta_{t}+R\mathop{\max}_{t\in[T]}\big{\{}\frac{1}{\eta_{t}t^{\beta}}\big{\}}D_{\beta}(T)-\frac{\delta R^{2}T}{2}+\frac{R^{2}}{2\eta_{T}}$
	$\displaystyle-\frac{1}{2}(\mu-\delta)\lVert u_{1}-x_{1}\rVert^{2},$	(31)

where one uses Assumption 3 to bound $\lVert u_{t+1}-x_{t+1}\rVert^{2}$ , $\lVert u_{1}-x_{1}\rVert^{2}$ and uses the definition of $D_{\beta}(T)$ . Then the result (31) is obtained by telescoping subtraction of $\sum_{t=1}^{T-1}(\frac{1}{\eta_{t+1}}-\frac{1}{\eta_{t}})$ .

As $\eta_{t}=\frac{\gamma}{t}$ and $0\leqslant\beta<1$ , $\mathop{\max}_{t\in[T]}\big{\{}\frac{1}{\eta_{t}t^{\beta}}\big{\}}$ can be bounded as

\displaystyle\mathop{\max}_{t\in[T]}\big{\{}\frac{1}{\eta_{t}t^{\beta}}\big{\}}=\mathop{\max}_{t\in[T]}\big{\{}\frac{t^{1-\beta}}{\gamma}\big{\}}=\frac{T^{1-\beta}}{\gamma}.

(32)

According to the fact that the area of the integral sum from $1$ to $T$ is larger than the area of the discrete sum from $2$ to $T$ , one can bound $\sum_{t=1}^{T}\eta_{t}$ as follows

	$\displaystyle\sum_{t=1}^{T}\eta_{t}$	$\displaystyle=\gamma\sum_{t=1}^{T}\frac{1}{t}=\gamma(1+\sum_{t=2}^{T}\frac{1}{t})$
		$\displaystyle\leqslant\gamma(1+\int_{1}^{T}\frac{1}{t}dt)=\gamma(1+\log T).$		(33)

Now, adding (32) and (33) into (31), it is obtained

	$\displaystyle\sum_{t=1}^{T}(f_{t}(x_{t})+r_{t}(x_{t})-f_{t}(u_{t})-r_{t}(u_{t}))$
$\displaystyle\leqslant$	$\displaystyle\frac{M^{2}\gamma}{2}(1+\log T)+\frac{RT^{1-\beta}}{\gamma}D_{\beta}(T)-\frac{\delta R^{2}T}{2}+\frac{TR^{2}}{2\gamma}$
	$\displaystyle-\frac{1}{2}(\mu-\delta)\lVert u_{1}-x_{1}\rVert^{2}.$	(34)

By choosing

\displaystyle\gamma=\frac{2RT^{-\beta}D_{\beta}(T)+R^{2}}{\delta R^{2}+(\mu-\delta)\frac{1}{T}\lVert u_{1}-x_{1}\rVert^{2}},

it is easy to verify that the term $\frac{RT^{1-\beta}}{\gamma}D_{\beta}(T)-\frac{\delta R^{2}T}{2}+\frac{TR^{2}}{2\gamma}-\frac{1}{2}(\mu-\delta)\lVert u_{1}-x_{1}\rVert^{2}$ can be zero. According to the prior setting $\delta\gamma<1$ and the chosen $\gamma$ here, by choosing a small $\delta$ , $\frac{2RT^{-\beta}D_{\beta}(T)+R^{2}}{R^{2}+(\frac{\mu}{\delta}-1)\frac{1}{T}\lVert u_{1}-x_{1}\rVert^{2}}=\delta\gamma$ can be smaller than 1. Invoking the $\gamma$ , together with (34), yeilds that

		$\displaystyle\sum_{t=1}^{T}(f_{t}(x_{t})+r_{t}(x_{t})-f_{t}(u_{t})-r_{t}(u_{t}))$
	$\displaystyle\leqslant$	$\displaystyle\frac{M^{2}}{2\delta R}(1+\log T)(2T^{-\beta}D_{\beta}(T)+R).$		(35)

Therefore, one has

\displaystyle\bm{Reg}_{T}^{d}=\mathcal{O}(\log T(1+T^{-\beta}D_{\beta}(T))).

(36)

References

[1] A. Hamadouche, Y. Wu, A. M. Wallace, and J. F. C. Mota, “Approximate proximal-gradient methods,” in 2021 Sensor Signal Processing for Defence Conference (SSPD), 2021, pp. 1–6.
[2] N. K. Dhingra, S. Z. Khong, and M. R. Jovanović, “A second order primal-dual method for nonsmooth convex composite optimization,” IEEE Transactions on Automatic Control, vol. 67, no. 8, pp. 4061–4076, 2022.
[3] P. Joulani, A. György, and C. Szepesvári, “A modular analysis of adaptive (non-)convex optimization: Optimism, composite objectives, variance reduction, and variational bounds,” Theoretical Computer Science, vol. 808, pp. 108–138, 2020.
[4] J. Duchi, S. Shalev-Shwartz, Y. Singer, and A. Tewari, “Composite objective mirror descent.” 12 2010, pp. 14–26.
[5] D. Yuan, Y. Hong, D. W. C. Ho, and S. Xu, “Distributed mirror descent for online composite optimization,” IEEE Transactions on Automatic Control, vol. 66, no. 2, pp. 714–729, 2021.
[6] C. Hu, J. T.-Y. Kwok, and W. Pan, “Accelerated gradient methods for stochastic optimization and online learning,” in NuerIPS, 2009.
[7] S. Ghadimi and G. Lan, “Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization i: A generic algorithmic framework,” SIAM J. Optim., vol. 22, pp. 1469–1492, 2012.
[8] X. Chen, Q. Lin, and J. F. Pena, “Optimal regularized dual averaging methods for stochastic optimization,” in NuerIPS, 2012.
[9] A. Taylor, J. Hendrickx, and F. Glineur, “Exact worst-case convergence rates of the proximal gradient method for composite convex minimization,” Journal of Optimization Theory and Applications, vol. 178, 08 2018.
[10] M. K. Nutalapati, A. S. Bedi, K. Rajawat, and M. Coupechoux, “Online trajectory optimization using inexact gradient feedback for time-varying environments,” IEEE Transactions on Signal Processing, vol. 68, pp. 4824–4838, 2020.
[11] T. Chen, Q. Ling, and G. B. Giannakis, “An online convex optimization approach to proactive network resource allocation,” IEEE Transactions on Signal Processing, vol. 65, no. 24, pp. 6350–6364, 2017.
[12] T. Wang, W. Yu, and S. Wang, “Inter-slice radio resource management via online convex optimization,” in Proceedings of IEEE International Conference on Communications, 2021, pp. 1–6.
[13] Q. Tao, Q.-K. Gao, D.-J. Chu, and G.-W. Wu, “Stochastic learning via optimizing the variational inequalities,” IEEE Transactions on Neural Networks and Learning Systems, vol. 25, no. 10, pp. 1769–1778, 2014.
[14] W. Xue and W. Zhang, “Learning a coupled linearized method in online setting,” IEEE Transactions on Neural Networks and Learning Systems, vol. 28, no. 2, pp. 438–450, 2017.
[15] E. Hazan, “Introduction to online convex optimization,” Foundations and Trends in Optimization, vol. 2, pp. 157–325, 01 2016.
[16] Y. Li, X. Cao, and H. Chen, “Fully projection-free proximal stochastic gradient method with optimal convergence rates,” IEEE Access, vol. 8, pp. 165 904–165 912, 2020.
[17] M. Zinkevich, “Online convex programming and generalized infinitesimal gradient ascent,” in ICML, 2003, p. 928–935.
[18] X. Li, X. Yi, and L. Xie, “Distributed online optimization for multi-agent networks with coupled inequality constraints,” IEEE Transactions on Automatic Control, vol. 66, no. 8, pp. 3575–3591, 2021.
[19] ——, “Distributed online convex optimization with an aggregative variable,” IEEE Transactions on Control of Network Systems, vol. 9, no. 1, pp. 438–449, 2022.
[20] O. Besbes, Y. Gur, and A. Zeevi, “Non-stationary stochastic optimization,” Operations Research, vol. 63, pp. 1227–1244, 09 2015.
[21] D. S. Kalhan, A. Singh Bedi, A. Koppel, K. Rajawat, H. Hassani, A. K. Gupta, and A. Banerjee, “Dynamic online learning via frank-wolfe algorithm,” IEEE Transactions on Signal Processing, vol. 69, pp. 932–947, 2021.
[22] P. Zhao and L. Zhang, “Improved Analysis for Dynamic Regret of Strongly Convex and Smooth Functions,” arXiv e-prints, p. arXiv:2006.05876, Jun. 2020.
[23] L. Zhang, T. Yangt, J. Yi, R. Jin, and Z.-H. Zhou, “Improved dynamic regret for non-degenerate functions,” in NeurIPS, Red Hook, NY, USA, 2017, p. 732–741.
[24] L. Zhang, S. Lu, and Z.-H. Zhou, “Adaptive online learning in dynamic environments,” in NeurIPS, Red Hook, NY, USA, 2018, p. 1330–1340.
[25] A. Mokhtari, S. Shahrampour, A. Jadbabaie, and A. Ribeiro, “Online optimization in dynamic environments: Improved regret rates for strongly convex problems,” in 2016 IEEE 55th Conference on Decision and Control (CDC), 2016, pp. 7195–7201.
[26] Y. Zhao, S. Qiu, K. Li, L. Luo, J. Yin, and J. Liu, “Proximal online gradient is optimum for dynamic regret: A general lower bound,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–10, 2021.
[27] R. Dixit, A. S. Bedi, and K. Rajawat, “Online learning over dynamic graphs via distributed proximal gradient algorithm,” IEEE Transactions on Automatic Control, vol. 66, pp. 5065–5079, 2021.
[28] A. Ajalloeian, A. Simonetto, and E. Dall’Anese, “Inexact online proximal-gradient method for time-varying convex optimization,” in 2020 American Control Conference (ACC), 2020, pp. 2850–2857.
[29] R. Dixit, A. S. Bedi, R. Tripathi, and K. Rajawat, “Online learning with inexact proximal online gradient descent algorithms,” IEEE Transactions on Signal Processing, vol. 67, no. 5, pp. 1338–1352, 2019.
[30] N. Bastianello and E. Dall’Anese, “Distributed and inexact proximal gradient method for online convex optimization,” in 2021 European Control Conference (ECC), 2021, pp. 2432–2437.
[31] X. Yi, X. Li, L. Xie, and K. H. Johansson, “Distributed online convex optimization with time-varying coupled inequality constraints,” IEEE Transactions on Signal Processing, vol. 68, pp. 731–746, 2020.
[32] J. Duchi and Y. Singer, “Efficient online and batch learning using forward backward splitting,” J. Mach. Learn. Res., vol. 10, p. 2899–2934, dec 2009.
[33] L. M. Lopez-Ramos and B. Beferull-Lozano, “Online hyperparameter search interleaved with proximal parameter updates,” in 2020 28th European Signal Processing Conference (EUSIPCO), 2021, pp. 2085–2089.
[34] J. Liang and C. Poon, “Variable screening for sparse online regression,” Journal of Computational and Graphical Statistics, pp. 1–51, 07 2022.
[35] R. Shafipour and G. Mateos, “Online proximal gradient for learning graphs from streaming signals,” in 2020 28th European Signal Processing Conference (EUSIPCO), 2021, pp. 865–869.
[36] T. Yang, R. Jin, M. Mahdavi, and S. Zhu, “An efficient primal dual prox method for non-smooth optimization,” Machine Learning, vol. 98, pp. 369–406, 2014.
[37] G. Ditzler, M. Roveri, C. Alippi, and R. Polikar, “Learning in nonstationary environments: A survey,” IEEE Computational Intelligence Magazine, vol. 10, no. 4, pp. 12–25, 2015.
[38] Y. Murakami, M. Yamagishi, M. Yukawa, and I. Yamada, “A sparse adaptive filtering using time-varying soft-thresholding techniques,” in 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, 2010, pp. 3734–3737.
[39] M. Yamagishi, M. Yukawa, and I. Yamada, “Acceleration of adaptive proximal forward-backward splitting method and its application to sparse system identification,” in 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011, pp. 4296–4299.
[40] T. Yamamoto, M. Yamagishi, and I. Yamada, “Adaptive proximal forward-backward splitting for sparse system identification under impulsive noise,” 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO), pp. 2620–2624, 2012.
[41] S. A. Alghunaim, E. K. Ryu, K. Yuan, and A. H. Sayed, “Decentralized proximal gradient algorithms with linear convergence rates,” IEEE Transactions on Automatic Control, vol. 66, pp. 2787–2794, 2021.
[42] H. Yu and M. J. Neely, “A low complexity algorithm with ${O}(\sqrt{T})$ regret and ${O}(1)$ constraint violations for online convex optimization with long term constraints,” J. Mach. Learn. Res., vol. 21, pp. 1–24, 2020.

Dynamic Regret for Online Composite Optimization

Abstract

Index Terms:

I Introduction

I-1 The functions variation

I-2 The gradients variation

I-3 The path variation

I-4 An extended path variation

II Problem Formulation

II-1 Examples of time-varying nonsmooth ft​(x)f_{t}(x)

II-2 Examples of time-varying nondifferentiable rt​(x)r_{t}(x)

III Main Results

Assumption 1.

Assumption 2.

Assumption 3.

Assumption 4.

III-A Dynamic Regret Bound for Convex ftf_{t}

Theorem 1.

Proof.

III-B Dynamic Regret Bound for Strongly Convex ftf_{t}

Lemma 1.

Proof.

Theorem 2.

Proof.

IV Numerical Examples

IV-1 SAGE (Stochastic Accelerated GradiEnt)

IV-2 AC-SA (Accelerated Stochastic Approximation)

IV-3 RDA (Regularized Dual Averaging Method)

IV-A Performance on Nonsmooth Convex ft​(x)f_{t}(x)

IV-B Performance on Nonsmooth Strongly Convex ft​(x)f_{t}(x)

V Conclusion

Appendix A Proof of Theorem 1

Appendix B Proof of Lemma 1

Appendix C Proof of Theorem 2

References

II-1 Examples of time-varying nonsmooth $f_{t}(x)$

II-2 Examples of time-varying nondifferentiable $r_{t}(x)$

III-A Dynamic Regret Bound for Convex $f_{t}$

III-B Dynamic Regret Bound for Strongly Convex $f_{t}$

IV-A Performance on Nonsmooth Convex $f_{t}(x)$

IV-B Performance on Nonsmooth Strongly Convex $f_{t}(x)$