This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Dynamic Regret for Online Composite Optimization

Ruijie Hou, Xiuxian Li, Senior Member, IEEE, and Yang Shi, Fellow, IEEE This research was supported by the National Natural Science Foundation of China under Grant 62003243, and the Shanghai Municipal Science and Technology Major Project, No. 2021SHZDZX0100.R. Hou and X. Li are with Department of Control Science and Engineering, College of Electronics and Information Engineering, Tongji University, Shanghai 201800, China (e-mail: [email protected], [email protected]).X. Li is also with the Shanghai Research Institute for Intelligent Autonomous Systems, Shanghai 201210, China.Y. Shi is with Department of Mechanical Engineering, University of Victoria, Victoria, B.C. V8W 3P6, Canada (e-mail: [email protected]).
Abstract

This paper investigates online composite optimization in dynamic environments, where each objective or loss function contains a time-varying nondifferentiable regularizer. To resolve it, an online proximal gradient algorithm is studied for two distinct scenarios, including convex and strongly convex objectives without the smooth condition. In both scenarios, unlike most of works, an extended version of the conventional path variation is employed to bound the considered performance metric, i.e., dynamic regret. In the convex scenario, a bound 𝒪(T1βDβ(T)+T)\mathcal{O}(\sqrt{T^{1-\beta}D_{\beta}(T)+T}) is obtained which is comparable to the best-known result, where Dβ(T)D_{\beta}(T) is the extended path variation with β[0,1)\beta\in[0,1) and TT being the total number of rounds. In strongly convex case, a bound 𝒪(logT(1+TβDβ(T)))\mathcal{O}(\log T(1+T^{-\beta}D_{\beta}(T))) on the dynamic regret is established. In the end, numerical examples are presented to support the theoretical findings.

Index Terms:
Online optimization, composite optimization, convex optimization, dynamic regret, dynamic environments.

I Introduction

Composite optimization has been studied in considerable research in recent decades. Usually, the problems are composed of functions with different properties. This structure can be applied in statistics, machine learning, and engineering, such as empirical risk minimization, logistic regression, compressed sensing and so on[1]. There are many algorithms proposed for solving composite optimization, such as Second Order Primal-Dual Algorithm [2], ADA-FTRL and ADA-MD[3], COMID[4], ODCMD[5], SAGE[6], AC-SA[7], RDA[8], most of which use the well-known proximal operator. Proximal operator introduced by [9] is an efficient method to minimize functions with a convex but possibly nondifferentiable part.

This paper focuses on online composite optimization in dynamic environments. It is one of numerous important settings in online optimization. Among them, online convex optimization has received immense attentions among scientists with many applications in dynamic setting, such as, online trajectory optimization[10], network resource allocation[11], radio resource management[12], and so on. The typical setup of online optimization can be described below. At each time tt, a learner selects an appropriate action xt𝒳x_{t}\in\mathcal{X}, where 𝒳n\mathcal{X}\subset\mathbb{R}^{n} is a feasible constraint set and then an adversary selects a corresponding convex loss FtF_{t}\in\mathcal{F}, where \mathcal{F} is a given set of loss functions available to the adversary. The objective of online optimization is selecting proper actions {xt}t=1T\{x_{t}\}_{t=1}^{T} such that t=1TFt(xt)\sum_{t=1}^{T}F_{t}(x_{t}) is minimized to track the best decision in hindsight.

Static regret is a classical performance metric: 𝑹𝒆𝒈Ts:=t=1TFt(xt)t=1TFt(x)\bm{Reg}_{T}^{s}:=\sum_{t=1}^{T}F_{t}(x_{t})-\sum_{t=1}^{T}F_{t}(x^{*}), where xx^{*} minimizes the sum of all loss functions over the constraint set[13]. An effective algorithm ensures the average static regret goes to zero over an infinite time horizon[14]. For instance, Online Gradient Method (OGM) is proved to have bounds O(T)O(\sqrt{T}) and O(logT)O(\log T) when FtF_{t} is convex and strongly convex, respectively[15]. By using Online Proximal Gradient (OPG) algorithm for a composite function, where one term is convex or strongly convex, the same bounds are also obtained[16, 6]. Although both of them share the same static regret bound, OPG can solve the composite loss function with nondifferentiable part, such as objective functions regularized by 𝓁1\cal{l_{1}} norm. Besides, there are other related works solving online composite problems based on static regret. Authors in [5] developed an online distributed composite mirror descent using the Bregman divergence, and attained an average static regret bound of order 𝒪(1T)\mathcal{O}(\frac{1}{\sqrt{T}}) for each agent, which they called average regularized regret. Even for multipoint bandit feedback model, they also achieved the same result.

In a wide range of applications, Ft(x)F_{t}(x) often has different minimizers at each tt. As such, the dynamic regret introduced in [17] is a well-known extension of static regret. Many previous works [18, 19] have pointed out that the dynamic regret is hard to achieve sublinearity with respect to TT, since the function sequence fluctuates arbitrarily. In order to bound the dynamic regret, one needs to specify some complexity measures. By introducing {ut}t=1T\{u_{t}\}_{t=1}^{T} as a sequence of any feasible comparators, several common complexity measures are showed as follows:

I-1 The functions variation

V(T):=t=2Tsupu𝒳|Ft(u)Ft1(u)|.V(T):=\sum_{t=2}^{T}sup_{u\in\mathcal{X}}|F_{t}(u)-F_{t-1}(u)|.

V(T)V(T) is used to solve problems where cost functions may change along a non-stationary variant of a sequential stochastic optimization. It has been obtained minimax regret bounds 𝒪(T23V(T)13)\mathcal{O}(T^{\frac{2}{3}}V(T)^{\frac{1}{3}}) and 𝒪(T12V(T)12)\mathcal{O}(T^{\frac{1}{2}}V(T)^{\frac{1}{2}}) for convex and strongly convex functions, respectively, in a non-stationary setting under noisy gradient access[20].

I-2 The gradients variation

F(T):=t=2TFt(ut)Mt22,F(T):=\sum_{t=2}^{T}\lVert\nabla F_{t}(u_{t})-M_{t}\lVert_{2}^{2},

where {M1,M2,,MT}\{M_{1},M_{2},...,M_{T}\} is computed accroding to the past observations or side information. For instance, one of the choice could be Mt=Ft1(ut1)M_{t}=\nabla F_{t-1}(u_{t-1})[21]. This work about online Frank-Wolfe algorithm established 𝒪(T(1+V(T)+F(T)))\mathcal{O}(\sqrt{T}(1+V(T)+\sqrt{F(T)})) dynamic regret for convex and smooth loss function in non-stationary settings.

I-3 The path variation

D(T):=t=2Tutut1,\displaystyle D(T):=\sum_{t=2}^{T}\Arrowvert u_{t}-u_{t-1}\Arrowvert,
S(T):=t=2Tutut12.\displaystyle S(T):=\sum_{t=2}^{T}\Arrowvert u_{t}-u_{t-1}\Arrowvert^{2}.

If the reference points move slowly, S(T)S(T) may be smaller than D(T)D(T). On the contrary, D(T)D(T) may be smaller than S(T)S(T). Specially, D(T)D(T) and S(T)S(T) are also utilized to represent those variations in the optimal points xt:=argminx𝒳ft(x)x_{t}^{*}:=arg\min_{x\in\mathcal{X}}f_{t}(x)[22, 23]. Related works on path variation are significantly more than the first two. Earlier research on OGM showed that the dynamic regret has an upper bound 𝒪(T(D(T)+1))\mathcal{O}(\sqrt{T}(D(T)+1)) for convex cost functions[17]. Then, authors improved that there exists a lower bound 𝒪(T(D(T)+1))\mathcal{O}(\sqrt{T(D(T)+1)})[24]. By adding strict conditions, a regret bound of order 𝒪(D(T)+1)\mathcal{O}(D(T)+1) was established for strongly and smooth convex loss functions[25]. Another study on Online Multiple Gradient Descent for smooth and strongly convex functions improved the regret bound to 𝒪(min(D(T),S(T)))\mathcal{O}(min(D(T),S(T)))[23].

I-4 An extended path variation

Dβ(T):=t=2Ttβutut1,\displaystyle D_{\beta}(T):=\sum_{t=2}^{T}t^{\beta}\Arrowvert u_{t}-u_{t-1}\Arrowvert, (1)

where 0β<10\leqslant\beta<1[26]. Larger weights are allocated for the future parts than the previous parts. When β=0\beta=0, D0(T)D_{0}(T) is equal to D(T)D(T) and when 0<β<10<\beta<1, D0(T)<Dβ(T)<TβD0(T)D_{0}(T)<D_{\beta}(T)<T^{\beta}D_{0}(T). Therefore, Dβ(T)D_{\beta}(T) is an extension of D(T)D(T). In the literature, online optimization based on Dβ(T)D_{\beta}(T) is few. One related work using Dβ(T)D_{\beta}(T) is on proximal online gradient algorithm[26]. The authors established a dynamic regret of order 𝒪(T1βDβ(T)+T)\mathcal{O}(\sqrt{T^{1-\beta}D_{\beta}(T)}+\sqrt{T}) for convex loss functions.

TABLE I: Summary of results on dynamic regret (Ft=ft+rtF_{t}=f_{t}+r_{t})
Reference Algorithm Assumptions on Loss functions Dynamic regret
[24] OGD FtF_{t} : Convex without rtr_{t} 𝒪(T(D0(T)+1))\mathcal{O}(\sqrt{T(D_{0}(T)+1)})
[25] OGD FtF_{t} : Strongly convex and smooth without rtr_{t} 𝒪(D0(T)+1)\mathcal{O}(D_{0}(T)+1)
[26] POG ftf_{t} : Convex with fixed regularizer 𝒪(T1βDβ(T)+T)\mathcal{O}(\sqrt{T^{1-\beta}D_{\beta}(T)}+\sqrt{T})
[27] DP-OGD ftf_{t} : Strongly convex and smooth 𝒪(logT(1+D0(T)))\mathcal{O}(\log T(1+D_{0}(T)))
[28] OIPG ftf_{t} : Strongly convex and smooth 𝒪(1+D0(T))\mathcal{O}(1+D_{0}(T))
This work OPG ftf_{t} : Convex 𝒪(T1βDβ(T)+T)\mathcal{O}(\sqrt{T^{1-\beta}D_{\beta}(T)+T})
This work OPG ftf_{t} : Strongly convex 𝒪(logT(1+TβDβ(T)))\mathcal{O}(\log T(1+T^{-\beta}D_{\beta}(T)))

In addition, there are plenty of works studying about online composite optimization. In [28], inexact online proximal gradient algorithm was proposed for loss functions composed of strongly convex, smooth term and convex term, where the loss relies on the estimated gradient information of the smooth convex function and the approximate proximal operator. Coincidentally, for allowing the loss functions to be subsampled, inexact gradient was added into online proximal gradient algorithm for a finite sum of composite functions[29]. When both algorithms are applied to the exact gradient case, the upper bounds are both in order of 𝒪(1+D(T))\mathcal{O}(1+D(T)). Furthermore, considering inexact scenarios by approximate gradients, approximate proximal operator and communication noise[30], online inexact distributed proximal gradient method (DPGM) guaranteed Q-linearly convergence to the optimal solution with some error. Decentralized algorithms are also proposed for solving composite optimization, as the composite loss function of each agent may be held privately. An unconstrained problem is considered firstly. An online DPGM was proposed in [27] for each private cost function composed of strongly convex, smooth function and convex nonsmooth regularizer, obtaining a regret bound of order 𝒪(logT(1+D(T)))\mathcal{O}(\log T(1+D(T))). Later, some works use mirror descent algorithm for constrained problems. A distributed online primal-dual mirror descent algorithm is developed to handle composite objective functions, which consist of local convex functions and regularizers with time-varying coupled inequality constraints[31].

In this paper, we revisit OPG algorithm for composite functions with a time-varying regularizer and discuss dynamic regret in terms of Dβ(T)D_{\beta}(T). Most related works about OPG consider a time-invariant r(x)r(x)[26, 32, 33, 34, 35], while our problem extends the stationary regularizer to time-varying scenario.

The contributions of this paper include: (a) solving online composite optimization problems with two time-varying nonsmooth parts in terms of Dβ(T)D_{\beta}(T); (b) obtaining a best-known dynamic regret bound for composite functions with two convex but nondifferentiable functions; (c) establishing a dynamic regret bound of OPG for functions composed of a strongly convex term and a convex but nondifferentiable term.

The paper is organized as follows. Section II describes the online composite optimization problem. In Section III, OPG algorithm is introduced, necessary assumptions are presented and main results are provided for two cases. Numerical examples of the OPG algorithm are given in Section IV. The conclusions are presented in Section V.

Notations: Let 1\lVert\cdot\rVert_{1} represent the 𝓁1\cal{l}_{1} norm and \lVert\cdot\rVert represent the 𝓁2\cal{l}_{2} norm by default. f(x)\partial f(x) denotes the subdifferential of a convex function ff at xx and ~f(x)\tilde{\nabla}f(x) represents one subgradient in f(x)\partial f(x). Let [N]:={1,2,,N}[N]:=\{1,2,...,N\}. Let ,\langle\cdot,\cdot\rangle be the Euclidean inner product on n\mathbb{R}^{n}.

II Problem Formulation

The loss function at each time t0t\geq 0 is composed of two parts as follows:

Ft(x)\displaystyle F_{t}(x) :=ft(x)+rt(x),s.t.x𝒳\displaystyle:=f_{t}(x)+r_{t}(x),~{}~{}s.t.~{}x\in\cal{X} (2)

where 𝒳n\mathcal{X}\subset\mathbb{R}^{n} is a feasible constraint set, and functions ft,rt:𝒳f_{t},r_{t}:\mathcal{X}\to\mathbb{R} are not necessarily differentiable. At each instant tt, an action xt𝒳x_{t}\in\mathcal{X} is selected by the learner and then loss functions ft(xt),rt(xt):𝒳f_{t}(x_{t}),r_{t}(x_{t}):\mathcal{X}\to\mathbb{R} are chosen by an adversary and revealed to the learner. The proper action sequence {xt}t=1T\{x_{t}\}_{t=1}^{T} is expected to be obtained such that the dynamic regret is sublinear with the horizon TT, i.e., limTRegTdT=0\lim_{T\to\infty}\frac{Reg_{T}^{d}}{T}=0, where

𝑹𝒆𝒈Td:=t=1TFt(xt)t=1TFt(ut),\displaystyle\bm{Reg}_{T}^{d}:=\sum_{t=1}^{T}F_{t}(x_{t})-\sum_{t=1}^{T}F_{t}(u_{t}), (3)

and {ut}t=1T\{u_{t}\}_{t=1}^{T} is a sequence of any feasible comparators. It is worth noting that an important case of dynamic regret is when ut=xt:=argminx𝒳Ft(x)u_{t}=x_{t}^{*}:=argmin_{x\in\mathcal{X}}F_{t}(x).

Examples of applications using the time-varying nonsmooth ftf_{t} and the time-varying rtr_{t} are briefly presented below.

II-1 Examples of time-varying nonsmooth ft(x)f_{t}(x)

In machine learning, signal processing and statistics, many nonsmooth error functions have a wide range of applications for solving regression and classification problems[36]. In online streaming data-generating process, the probability density function may change with time[37]. There are some examples for online nonsmooth convex loss functions, by setting yty_{t} as label and ata_{t} as feature vector.

  • Hinge loss:

    ft(x)=max(0,1ytatx).f_{t}(x)=max(0,1-y_{t}a_{t}^{\top}x).
  • Generalized hinge loss:

    ft(x)={1αytatx,ifytatx01ytatx,if0<ytatx<10,ifytatx1,f_{t}(x)=\left\{\begin{aligned} &1-\alpha y_{t}a_{t}^{\top}x,~{}~{}&if~{}y_{t}a_{t}^{\top}x\leqslant 0\\ &1-y_{t}a_{t}^{\top}x,~{}~{}&if~{}0<y_{t}a_{t}^{\top}x<1\\ &0,~{}~{}&if~{}y_{t}a_{t}^{\top}x\geqslant 1,\end{aligned}\right.

    where α>1\alpha>1.

  • Absolute loss:

    ft(x)=maxα[1,1]α(ytatx).f_{t}(x)=max_{\alpha\in[-1,1]}\alpha(y_{t}-a_{t}^{\top}x).
  • ϵ\epsilon-insensitive loss:

    ft(x)=max(|ytatx|ϵ,0),f_{t}(x)=max(|y_{t}-a_{t}^{\top}x|-\epsilon,0),

    where ϵ\epsilon is a small postive constant.

II-2 Examples of time-varying nondifferentiable rt(x)r_{t}(x)

  • The weighted 𝓁1\cal{l}_{1} norm function:
    Assume the optimal solution is sparse with many zero or near-zero coefficients. To solve it, let rt(x)r_{t}(x) be the weighted 𝓁1\cal{l}_{1} norm function[38, 39, 40]:

    rt(x):=ρi=1Nωit|xi|,\displaystyle r_{t}(x):=\rho\sum_{i=1}^{N}\omega_{i}^{t}|x_{i}|, (4)

    where xix_{i} is the ii-th component of NN-dimensional xx and ρ>0\rho>0 is a parameter for regularizers. The weight ωit\omega_{i}^{t} (i=1,2,,N)(i=1,2,...,N) is defined by

    ωit:={ϵ,if|xit1|>τ1,otherwise,\omega_{i}^{t}:=\left\{\begin{aligned} \epsilon,&~{}~{}if~{}|x_{i}^{t-1}|>\tau\\ 1,&~{}~{}otherwise,\end{aligned}\right.

    with a parameter τ>0\tau>0 and a small ϵ>0\epsilon>0, where xit1x_{i}^{t-1} is t1t-1-th iteration of xix_{i} by any proposed algorithm. The parameter τ\tau may rely on noise statistics, e.g., the variance. By introducing the weights ωit\omega_{i}^{t}, larger |xi||x_{i}| has smaller coefficients. Hence, larger |xi||x_{i}| renders a less sensitivity to the impact of the proximity operator. Therefore, the weighted 𝓁1\cal{l}_{1} norm exploits the sparseness more efficiently.

  • The indicator function of time-varying set constraints:
    In a network flow problem, to handle the constraints x𝒳𝓉x\in\cal{X}_{t} with 𝒳𝓉𝒳\cal{X}_{t}\subset\cal{X} being a convex set, the authors in [28] use indicator functions based on a given network (𝒩,)(\cal{N},\mathcal{E}). The relevant variables are defined as follows. z(i,s)z(i,s) denotes the rate produced at node ii for traffic ss and x(ij,s)x(ij,s) is the flow between nodes ii and jj for traffic ss. For brevity, let zz and xx be the stacked traffic and link rates, respectively. Time-varying nonsmooth rt(z,x)r_{t}(z,x) is composed of two functions r(z,x)r(z,x) and δ𝒳𝓉(z,x)\delta_{\cal{X}_{t}}(z,x), where r(z,x)=ν2(z2+x2)r(z,x)=\frac{\nu}{2}(\lVert z\lVert^{2}+\lVert x\lVert^{2}) is convex with some ν>0\nu>0 and δ𝒳𝓉(z,x)\delta_{\cal{X}_{t}}(z,x) is an indicator function of constraints, defined as

    δ𝒳𝓉(z,x):={+,x𝒳𝓉0,x𝒳𝓉.\delta_{\cal{X}_{t}}(z,x):=\left\{\begin{aligned} +\infty,&~{}~{}x\notin\cal{X}_{t}\\ 0,&~{}~{}x\in\cal{X}_{t}.\end{aligned}\right.

    Here are the relevant constraints in the netwrok. The capacity constraint for link is 0sx(ij,s)+wt(ij)ct(ij)0\leqslant\sum_{s}x(ij,s)+w_{t}(ij)\leqslant c_{t}(ij), where ct(ij)c_{t}(ij) is the time-varying link capacity and wt(ij)w_{t}(ij) is a time-varying and non-controllable link traffic. The flow conservation constraint implies z(s)=R¯x(s)z(s)=\bar{R}x(s) with the routing matrix R¯\bar{R}. zmaxz^{max} represents the maximal traffic rate. Therefore, the time-varying nonsmooth rt(z,x)r_{t}(z,x) can be written as

    rt(z,x)=i,sδ{0sx(ij,s)+wt(ij)ct(ij)}\displaystyle r_{t}(z,x)=\sum_{i,s}\delta_{\{0\leqslant\sum_{s}x(ij,s)+w_{t}(ij)\leqslant c_{t}(ij)\}}
    +δ{0zzmax}+sδ{z(s)=R¯x(s)}+ν2(z2+x2).\displaystyle+\delta_{\{0\leqslant z\leqslant z^{max}\}}+\sum_{s}\delta_{\{z(s)=\bar{R}x(s)\}}+\frac{\nu}{2}(\lVert z\lVert^{2}+\lVert x\lVert^{2}).

III Main Results

This section will introduce the OPG alogrithm and present the main results.

Firstly, how the OPG Algorithm updates the iterate xtx_{t} is introduced. At each iteration tt, based on a subgradient of ft(x)f_{t}(x) and a positive step size ηt\eta_{t}, the next iteration xt+1x_{t+1} is computed as

xt+1=proxηtrt(xtηt~ft(xt)),\displaystyle x_{t+1}=prox_{\eta_{t}r_{t}}(x_{t}-\eta_{t}\tilde{\nabla}f_{t}(x_{t})), (5)

where the proximal operator is defined by

proxηtrt(x):=argminu𝒳{rt(u)+12ηtux2}.prox_{\eta_{t}r_{t}}(x):=arg\min_{u\in\cal{X}}\{r_{t}(u)+\frac{1}{2\eta_{t}}\lVert u-x\rVert^{2}\}. (6)

Then based on (5) and (6), the update is equivalent to

xt+1=argminx𝒳rt(x)+x,~ft(xt)+12ηtxxt2.\displaystyle x_{t+1}=arg\min_{x\in\mathcal{X}}r_{t}(x)+\langle x,\tilde{\nabla}f_{t}(x_{t})\rangle+\frac{1}{2\eta_{t}}\rVert x-x_{t}\rVert^{2}. (7)

At each iteration tt, the algorithm assumes the availability of ~ft(xt)\tilde{\nabla}f_{t}(x_{t}).

Algorithm 1 Online Proximal Gradient (OPG)
0:  Initial vector x1x_{1}, learning rate ηt\eta_{t} with 1tT1\leqslant t\leqslant T.
0:  x2,x3,,xTx_{2},x_{3},...,x_{T}
1:  for all t=1,2,,Tt=1,2,...,T do
2:     Compute the next action:
xt+1=argminx𝒳{rt(x)+x,~ft(xt)+12ηtxxt2}.x_{t+1}=arg\min_{x\in\mathcal{X}}\{r_{t}(x)+\langle x,\tilde{\nabla}f_{t}(x_{t})\rangle+\frac{1}{2\eta_{t}}\rVert x-x_{t}\rVert^{2}\}.
3:     return xt+1x_{t+1}
4:  end for

To proceed, the following standard assumptions are useful in the analysis.

Assumption 1.

Functions ft(x),rt(x):𝒳f_{t}(x),r_{t}(x):\mathcal{X}\to\mathbb{R} are closed, convex for all t[T]t\in[T].

Assumption 1 ensures the existence of optimal point for ft(x),rt(x)f_{t}(x),r_{t}(x). Besides, according to convexity, for any x,y𝒳x,y\in\cal{X}, any subgradient ~ft(x)ft(x)\tilde{\nabla}f_{t}(x)\in\partial f_{t}(x) and ~rt(x)rt(x)\tilde{\nabla}r_{t}(x)\in\partial r_{t}(x), one has:

ft(y)\displaystyle f_{t}(y) ft(x)+~ft(x),yx,x,y𝒳,\displaystyle\geqslant f_{t}(x)+\langle\tilde{\nabla}f_{t}(x),y-x\rangle,~{}\forall x,y\in\mathcal{X},
rt(y)\displaystyle r_{t}(y) rt(x)+~rt(x),yx,x,y𝒳.\displaystyle\geqslant r_{t}(x)+\langle\tilde{\nabla}r_{t}(x),y-x\rangle,~{}\forall x,y\in\mathcal{X}.
Assumption 2.

Functions ft:𝒳f_{t}:\mathcal{X}\to\mathbb{R} is μ\mu-strongly convex for all t[T]t\in[T].

According to μ\mu-strongly convexity, for any x,y𝒳x,y\in\mathcal{X} and any subgradient ~ft(x)ft(x)\tilde{\nabla}f_{t}(x)\in\partial f_{t}(x), one has:

ft(y)ft(x)+~ft(x),yx+μ2xy2.f_{t}(y)\geqslant f_{t}(x)+\langle\tilde{\nabla}f_{t}(x),y-x\rangle+\frac{\mu}{2}\rVert x-y\rVert^{2}.
Assumption 3.

The set 𝒳\mathcal{X} is convex and compact, and xyR\rVert x-y\rVert\leq R for R>0R>0 and all x,y 𝒳\in\mathcal{X}.

Assumption 4.

For any x𝒳x\in\cal{X}, ~Ft(x)M\rVert\tilde{\nabla}F_{t}(x)\rVert\leq M for some M>0M>0.

If ~ft(x)\rVert\tilde{\nabla}f_{t}(x)\rVert and ~rt(x)\rVert\tilde{\nabla}r_{t}(x)\rVert respectively have upper bounds mfm_{f} and mrm_{r}, one could set M=mf+mrM=m_{f}+m_{r}.

III-A Dynamic Regret Bound for Convex ftf_{t}

In this case, the main result is provided in the following theorem.

Theorem 1.

Let Assumptions 1,3,4 hold. By setting non-increasing step size ηt=tγσ\eta_{t}=t^{-\gamma}\sigma in the OPG, where σ\sigma is a constant value

σ=(1γ)2RT2γβ1Dβ(T)+R2T2γ1M,\sigma=\frac{\sqrt{(1-\gamma)2RT^{2\gamma-\beta-1}D_{\beta}(T)+R^{2}T^{2\gamma-1}}}{M},

and γ[β,1)\gamma\in[\beta,1), there holds

𝑹𝒆𝒈Td=𝒪(T1βDβ(T)+T),\displaystyle\bm{Reg}_{T}^{d}=\mathcal{O}(\sqrt{T^{1-\beta}D_{\beta}(T)+T}), (8)

where Dβ(T)D_{\beta}(T) is defined in (1).

Proof.

See Appendix A. ∎

Remark 1: Note that the bound (8) is comparable to the best-known one T1βDβ(T)+T\sqrt{T^{1-\beta}D_{\beta}(T)}+\sqrt{T} established in [26], where time-invariant regularizers are considered. However, more general time-varying regularizers are investigated here.

Remark 2: Note that the bound 𝒪(T(D0(T)+1))\mathcal{O}(\sqrt{T(D_{0}(T)+1)}) in [24] is an optimal dynamic regret matching with the lower bound given in [24, Theorem 2]. By setting β=0\beta=0, this bound is a special case of (8).

III-B Dynamic Regret Bound for Strongly Convex ftf_{t}

The following Lemma 1 gives an upper bound for any point in 𝒳\cal{X} and positive step size. Besides, when setting μ=0\mu=0, it also holds for the case without strongly convexity, which is discussed in Section III-A. It will be useful for deriving the main results.

Lemma 1.

Let Assumptions 1,2,3,4 hold. For any ut𝒳u_{t}\in\cal{X}, positive step size ηt>0\eta_{t}>0 and iteration xtx_{t} in the OPG, there holds

Ft(xt)Ft(ut)\displaystyle F_{t}(x_{t})-F_{t}(u_{t}) 12(1ηtμ)utxt212ηtutxt+12\displaystyle\leqslant\frac{1}{2}(\frac{1}{\eta_{t}}-\mu)\rVert u_{t}-x_{t}\rVert^{2}-\frac{1}{2\eta_{t}}\rVert u_{t}-x_{t+1}\rVert^{2}
+ηt2(~ft(xt)+~rt(xt)2).\displaystyle+\frac{\eta_{t}}{2}(\rVert\tilde{\nabla}f_{t}(x_{t})+\tilde{\nabla}r_{t}(x_{t})\rVert^{2}). (9)
Proof.

See Appendix B. ∎

Now by summing (9) at each tt, selecting appropriate step size ηt\eta_{t}, the following theorem is obtained for characterizing the dynamic regret bound for strongly convex ftf_{t}.

Theorem 2.

Let Assumptions 1,2,3,4 hold. For any 0β<10\leqslant\beta<1 in OPG, setting the non-increasing step size ηt=γt\eta_{t}=\frac{\gamma}{t} and choosing γ\gamma as

γ=2RTβDβ(T)+R2δR2+(μδ)1Tu1x12,\displaystyle\gamma=\frac{2RT^{-\beta}D_{\beta}(T)+R^{2}}{\delta R^{2}+(\mu-\delta)\frac{1}{T}\lVert u_{1}-x_{1}\rVert^{2}},

where δ(0,μ)\delta\in(0,\mu) is small enough such that γδ<1\gamma\delta<1, and u1u_{1} and x1x_{1} are the initial comparator vector and the initial vector in OPG, respectively, there holds

𝑹𝒆𝒈Td=𝒪(logT(1+TβDβ(T))),\displaystyle\bm{Reg}_{T}^{d}=\mathcal{O}(\log T(1+T^{-\beta}D_{\beta}(T))), (10)

where Dβ(T)D_{\beta}(T) is defined in (1).

Proof.

See Appendix C. ∎

It is worth noting that γ\gamma in Theorem 2 is a fixed value but γ\gamma in Theorem 1 is any positive value in [β,1)[\beta,1). They are differently chosen in different scenarios.

Remark 3: If choosing β=0\beta=0, dynamic regret is in order of 𝒪(log(T)(1+D0(T)))\mathcal{O}(\log(T)(1+D_{0}(T))), which is the result for distributed proximal gradient algorithm over dynamic graphs [27]. However, Theorem 2 drops the assumption of ll-smooth on ft(x)f_{t}(x) which is required in [27]. The existing works for composite optimization establishing upper bounds of dynamic regret under strongly convex ft(x)f_{t}(x) usually assume smoothness of ft(x)f_{t}(x) [28, 30, 41]. Therefore, the result without the smooth assumption here greatly expands the scope of application.

Remark 4: Theorem 2 also holds true when considering static regret. If the sequence {ut}t=1T\{u_{t}\}_{t=1}^{T} is chosen time-invariant, which means the sequence does not change over time, Dβ(T)=t=2Ttβutut1D_{\beta}(T)=\sum_{t=2}^{T}t^{\beta}\Arrowvert u_{t}-u_{t-1}\Arrowvert is equal to 0. In this case, 𝑹𝒆𝒈Td=𝒪(log(T))\bm{Reg}_{T}^{d}=\mathcal{O}(\log(T)) is obtained, which is the optimal bound on static regret in the strongly convex case[16]. Therefore, Theorem 2 may be optimal in this sense which is of interest as one possible future work.

Remark 5: A promising extension can be considered in the future work. The positive step size ηt\eta_{t} is related to Dβ(T)D_{\beta}(T) in Theorems 1 and 2. However, in most dynamic environment, Dβ(T)D_{\beta}(T) may be unknown in advance. In [26], authors pointed out that in some online learning cases, Dβ(T)D_{\beta}(T) can be estimated by employing observed data for the learning problems. By adjusting step size, the dynamic regret is expected to decrese in an appropriate rate and then Dβ(T)D_{\beta}(T) may be derived in reverse. Moreover, it can be observed that the selection of step size in Theorems 1 and 2 needs to know TT in advance. However, if TT is unknown, similar results can still be ensured by leveraging the so-called doubling trick, as done in[42].

Remark 6: The result in Theorem 2 can be further improved when ftf_{t} is \ell-smooth as well. The result in [28] shows that

xt+1xtρtxtxt,\displaystyle\lVert x_{t+1}-x_{t}^{*}\lVert\leqslant\rho_{t}\lVert x_{t}-x_{t}^{*}\lVert, (11)

where ρt:=max{|1ηtμ|,|1ηt|}\rho_{t}:=max\{|1-\eta_{t}\mu|,|1-\eta_{t}\ell|\} and xtx_{t}^{*} is the optimal point of Ft(x)=ft(x)+rt(x)F_{t}(x)=f_{t}(x)+r_{t}(x). Based on the convexity of FtF_{t} and the bound of ~Ft\rVert\tilde{\nabla}F_{t}\rVert, the upper bound for strongly convex and smooth case is in order of 𝒪(1+D0(T))\mathcal{O}(1+D_{0}(T)). The result matches the best-known one in [28].

IV Numerical Examples

Numerical experiments are provided here to show the performance of OPG by comparison with some state-of-the-art algorithms, which are valid for online composite convex optimizition.

Objective functions: The performance of OPG is tested by optimizing two types of loss functions, corresponding to the function in the theoretical part. Based on the nonsmooth convex Hinge loss and the time-varying nondifferentiable regularization function rt(x)r_{t}(x) defined by (4), two online regression functions are defined:

Ft1(x)\displaystyle F^{1}_{t}(x) =lt(x)+rt(x),\displaystyle=l_{t}(x)+r_{t}(x), (12)
Ft2(x)\displaystyle F_{t}^{2}(x) =lt(x)+λ2x2+rt(x).\displaystyle=l_{t}(x)+\frac{\lambda}{2}\lVert x\rVert^{2}+r_{t}(x). (13)

Here lt(x)l_{t}(x) is the Hinge loss, defined by

lt(x)=max(0,1ytatx),\displaystyle l_{t}(x)=max(0,1-y_{t}a_{t}^{\top}x),

where yty_{t} and ata_{t} are generated by the dataset Usenet2. Usenet2 is based on the newsgroups collection and collects which messages are considered interesting or junk in each time period, where the 99-dimensional vector is realized as feature ata_{t} and the 1-dimensional vector is realized as label yty_{t}. The related parameter of rt(x)r_{t}(x) is ρ=0.4\rho=0.4, τ=1\tau=1 and ϵ=0.1\epsilon=0.1 for Ft1(x)F^{1}_{t}(x) and Ft2(x)F^{2}_{t}(x). The related parameter of ft(x)=lt(x)+λ2x2f_{t}(x)=l_{t}(x)+\frac{\lambda}{2}\lVert x\rVert^{2} is λ=1\lambda=1 for Ft2(x)F_{t}^{2}(x).

The two functions are considered as different types of loss functions. Since the first term ft(x)f_{t}(x) of Ft1(x)F^{1}_{t}(x) is a nonsmooth convex loss lt(x)l_{t}(x), Ft1(x)F^{1}_{t}(x) will show the perpformance of OPG for nonsmooth and convex ft(x)f_{t}(x). Since the first term ft(x)f_{t}(x) of Ft2(x)F^{2}_{t}(x) is a nonsmooth strongly convex loss lt(x)+λ2x2l_{t}(x)+\frac{\lambda}{2}\lVert x\rVert^{2}, Ft2(x)F^{2}_{t}(x) will show the perpformance of OPG for nonsmooth and strongly convex ft(x)f_{t}(x). Besides, both of the loss functions have a time-varying nonsmooth regularization function. Based on the results in theoretical analysis in Section III, OPG algorithm is able to solve both of them.

Other online methods: In order to show the effectiveness of the proposed OPG algorithm, let us consider the following efficient online algorithms.

IV-1 SAGE (Stochastic Accelerated GradiEnt)

It is an accelerated gradient method for stochastic composite optimization[6]. Although it is a stochastic composite optimization, they also propose the accelerated gradient method for online setting. Each iteration of the method takes the form

xt\displaystyle x_{t} =(1αt)yt1+αtzt1,\displaystyle=(1-\alpha_{t})y_{t-1}+\alpha_{t}z_{t-1},
yt\displaystyle y_{t} =argminx{~ft1(xt),xxt+Lt2xxt2+rt(x)},\displaystyle=arg\min_{x}\{\langle\tilde{\nabla}f_{t-1}(x_{t}),x-x_{t}\rangle+\frac{L_{t}}{2}\lVert x-x_{t}\rVert^{2}+r_{t}(x)\},
zt\displaystyle z_{t} =zt1αt(Lt+μαt)1[Lt(xtyt)+μ(zt1xt)],\displaystyle=z_{t-1}-\alpha_{t}(L_{t}+\mu\alpha_{t})^{-1}[L_{t}(x_{t}-y_{t})+\mu(z_{t-1}-x_{t})],

where {αt}\{\alpha_{t}\} and {Lt}\{L_{t}\} are two parameter sequences and μ\mu is the strong-convexity constant of Ft(x)F_{t}(x). It has established upper bounds on static regret in order of 𝒪(T){\cal{O}}(\sqrt{T}) for convex ft(x)f_{t}(x) and 𝒪(log(T)){\cal{O}}(\log(T)) for strongly convex ft(x)f_{t}(x), by choosing suitable parameter sequences {αt}\{\alpha_{t}\} and {Lt}\{L_{t}\}[6].

IV-2 AC-SA (Accelerated Stochastic Approximation)

It is a stochastic gradient descent algorithm for handling convex composite optimization problems[7]. For better comparison, we choose the true subgradient ~ft\tilde{\nabla}f_{t} instead of the stochastic one. Each iteration of the method takes the form

xtmd=\displaystyle x_{t}^{md}= (1αt)(μ+γt)γt+(1αt2)μxt1ag+αt[(1αt)+γt]γt+(1αt2)μxt1,\displaystyle\frac{(1-\alpha_{t})(\mu+\gamma_{t})}{\gamma_{t}+(1-\alpha_{t}^{2})\mu}x_{t-1}^{ag}+\frac{\alpha_{t}[(1-\alpha_{t})+\gamma_{t}]}{\gamma_{t}+(1-\alpha_{t}^{2})\mu}x_{t-1},
xt=\displaystyle x_{t}= argminx{αt[~ft(xtmd),x+rt(x)+μV(xtmd,x)]\displaystyle arg\min_{x}\{\alpha_{t}[\langle\tilde{\nabla}f_{t}(x_{t}^{md}),x\rangle+r_{t}(x)+\mu V(x_{t}^{md},x)]
+[(1αt)μ+γt]V(xt1,x)},\displaystyle+[(1-\alpha_{t})\mu+\gamma_{t}]V(x_{t-1},x)\},
xtag=\displaystyle x_{t}^{ag}= αtxt+(1αt)xt1ag,\displaystyle\alpha_{t}x_{t}+(1-\alpha_{t})x_{t-1}^{ag},

where {αt}\{\alpha_{t}\} and {γt}\{\gamma_{t}\} are two parameter sequences, μ\mu denotes the strong-convexity constant of Ft(x)F_{t}(x) and V(,)V(\cdot,\cdot) is the Bregman divergence, and here we choose V(x,z)=12xz2V(x,z)=\frac{1}{2}\lVert x-z\rVert^{2}. Optimal convergence rates for tackling different types of stochastic composite optimization problems were discussed in [7].

IV-3 RDA (Regularized Dual Averaging Method)

It is an extension of Nesterov’s dual averaging method for composite optimization problems, where one term is the convex loss function, and the other is a regularization term[8]. Each iteration of the method takes the form

xt+1=argminx{1tτ=1t~fτ(xτ),x+rt(x)+βtth(x)},\displaystyle x_{t+1}=arg\min_{x}\{\frac{1}{t}\sum_{\tau=1}^{t}\langle\tilde{\nabla}f_{\tau}{}(x_{\tau}),x\rangle+r_{t}(x)+\frac{\beta_{t}}{t}h(x)\},

where {βt}\{\beta_{t}\} is a parameter sequence and h(x)h(x) is an auxiliary strongly convex function, and here we choose h(x)=12x2h(x)=\frac{1}{2}\lVert x\rVert^{2}. It ensures a bound 𝒪(T){\cal{O}}(\sqrt{T}) on static regret with βt=𝒪(T)\beta_{t}={\cal{O}}(\sqrt{T}) for convex loss[8].

IV-A Performance on Nonsmooth Convex ft(x)f_{t}(x)

According to Ft1(x)F^{1}_{t}(x) in (12), a time-varying nondifferentiable regularization function is added into a nonsmooth convex loss function. The step size ηt\eta_{t} is set to 0.001/t0.001/t in OPG, while step sizes in other three algorithms are set optimally based on their analysis. The simulation is run 1500 times and the average of dynamic regret 𝑹𝒆𝒈Td/T\bm{Reg}_{T}^{d}/T is shown in Fig. 1. It can be seen that OPG performs the best in our simulation.

Refer to caption
Figure 1: The performance on Ft1(x)F^{1}_{t}(x)

IV-B Performance on Nonsmooth Strongly Convex ft(x)f_{t}(x)

The example (13) will show the performance of OPG on nonsmooth strongly convex ft(x)f_{t}(x) and time-varying nondifferentiable rt(x)r_{t}(x). To achieve the best possible upper bound for each algorithm, different step sizes are set according to their analysis. The simulation is run 1500 times. Fig. 2 shows how the average of dynamic regret 𝑹𝒆𝒈Td/T\bm{Reg}_{T}^{d}/T changes with horizon TT. It shows that OPG performs the best in the simulation.

Refer to caption
Figure 2: The performance on Ft2(x)F^{2}_{t}(x)

Therefore, numerical results show that, compared with other three algorithms, OPG algorithm performs the best for the two cases in theoretical analysis.

V Conclusion

This paper considers online optimization with composite loss functions, where two terms are time-varying and nonsmooth. The loss function is optimized by online proximal gradient (OPG) algorithm in terms of the extended path variation Dβ(T)D_{\beta}(T). By analyzing upper bounds of its dynamic regret for two cases, their corresponding upper bounds are obtained. In the first case that ft(x)f_{t}(x) is convex, we established a regret bound comparable to the best-known one in the existing literature, simultaneously extending the problem to time-varying regularizers. In the second case that ft(x)f_{t}(x) is strongly convex, we obtained a better bound than another work while removing the smoothness assumption. At last, numerical experiments verified the performance of OPG algorithm in two cases. Possible future directions can be placed on the investigation of whether the obtained regret bounds are optimal or not, and the relaxation of stepsize selection without dependence on Dβ(T)D_{\beta}(T), etc.

In this section, detailed discussions of the lemmas and theorems are given as follows.

Appendix A Proof of Theorem 1

First, the following result reveals the relationship between any sequence {ut}t=1T𝒳\{u_{t}\}_{t=1}^{T}\subset\cal{X} and iteration sequence {xt}t=1T\{x_{t}\}_{t=1}^{T} in OPG. If Assumption 3 is satisfied, for any non-increasing step size 0<ηt+1ηt0<\eta_{t+1}\leqslant\eta_{t} in OPG algorithm, then it follows from [26, Lemma 4] that

t=1T1ηt(utxt2utxt+12)\displaystyle\sum_{t=1}^{T}\frac{1}{\eta_{t}}(\lVert u_{t}-x_{t}\rVert^{2}-\lVert u_{t}-x_{t+1}\rVert^{2})
\displaystyle\leqslant 2Rt=1T11ηtut+1ut+R2ηT,\displaystyle 2R\sum_{t=1}^{T-1}\frac{1}{\eta_{t}}\lVert u_{t+1}-u_{t}\rVert+\frac{R^{2}}{\eta_{T}}, (14)

where RR is defined in Assumption 3.

When setting μ=0\mu=0, Lemma 1 amounts to the case of convex ft(x)f_{t}(x). Then, (9) reduces to

Ft(xt)Ft(ut)\displaystyle F_{t}(x_{t})-F_{t}(u_{t}) 12ηtutxt212ηtutxt+12\displaystyle\leqslant\frac{1}{2\eta_{t}}\rVert u_{t}-x_{t}\rVert^{2}-\frac{1}{2\eta_{t}}\rVert u_{t}-x_{t+1}\rVert^{2}
+ηt2(~ft(xt)+~rt(xt)2).\displaystyle+\frac{\eta_{t}}{2}(\rVert\tilde{\nabla}f_{t}(x_{t})+\tilde{\nabla}r_{t}(x_{t})\rVert^{2}). (15)

According to (14), the sum of 12ηtutxt212ηtutxt+12\frac{1}{2\eta_{t}}\rVert u_{t}-x_{t}\rVert^{2}-\frac{1}{2\eta_{t}}\rVert u_{t}-x_{t+1}\rVert^{2} could be bounded by Rt=1T11ηtut+1ut+R22ηTR\sum_{t=1}^{T-1}\frac{1}{\eta_{t}}\lVert u_{t+1}-u_{t}\rVert+\frac{R^{2}}{2\eta_{T}}. Then Assumption 4 shows that the upper bound of ~ft(xt)+~rt(xt)2\rVert\tilde{\nabla}f_{t}(x_{t})+\tilde{\nabla}r_{t}(x_{t})\rVert^{2} is M2M^{2}. Hence, (A) implies

t=1T(Ft(xt)Ft(ut))\displaystyle\sum_{t=1}^{T}(F_{t}(x_{t})-F_{t}(u_{t}))
\displaystyle\leqslant Rt=1T11ηtut+1ut+R22ηT+M22t=1Tηt.\displaystyle R\sum_{t=1}^{T-1}\frac{1}{\eta_{t}}\lVert u_{t+1}-u_{t}\rVert+\frac{R^{2}}{2\eta_{T}}+\frac{M^{2}}{2}\sum_{t=1}^{T}\eta_{t}. (16)

If one multiplies and divides tβt^{\beta} by ut+1ut\lVert u_{t+1}-u_{t}\rVert and bounds 1ηttβ\frac{1}{\eta_{t}t^{\beta}} by choosing maxt[T]{1ηttβ}\mathop{\max}_{t\in[T]}\big{\{}\frac{1}{\eta_{t}t^{\beta}}\big{\}}, then according to the definition of Dβ(T)D_{\beta}(T), (16) yields

t=1T(Ft(xt)Ft(ut))\displaystyle\sum_{t=1}^{T}(F_{t}(x_{t})-F_{t}(u_{t}))
\displaystyle\leqslant Rmaxt[T]{1ηttβ}t=1T1tβut+1ut+R22ηT\displaystyle R\mathop{\max}_{t\in[T]}\big{\{}\frac{1}{\eta_{t}t^{\beta}}\big{\}}\sum_{t=1}^{T-1}t^{\beta}\lVert u_{t+1}-u_{t}\rVert+\frac{R^{2}}{2\eta_{T}}
+M22t=1Tηt\displaystyle+\frac{M^{2}}{2}\sum_{t=1}^{T}\eta_{t}
=\displaystyle= Rmaxt[T]{1ηttβ}Dβ(T)+R22ηT+M22t=1Tηt.\displaystyle R\mathop{\max}_{t\in[T]}\big{\{}\frac{1}{\eta_{t}t^{\beta}}\big{\}}D_{\beta}(T)+\frac{R^{2}}{2\eta_{T}}+\frac{M^{2}}{2}\sum_{t=1}^{T}\eta_{t}. (17)

By setting ηt=tγσ\eta_{t}=t^{-\gamma}\sigma with σ\sigma being a constant and γ[β,1)\gamma\in[\beta,1), maxt[T]{1ηttβ}=Tγβσ\mathop{\max}_{t\in[T]}\big{\{}\frac{1}{\eta_{t}t^{\beta}}\big{\}}=\frac{T^{\gamma-\beta}}{\sigma}, invoking (17) yeilds that

t=1T(Ft(xt)Ft(ut))\displaystyle\sum_{t=1}^{T}(F_{t}(x_{t})-F_{t}(u_{t}))
\displaystyle\leqslant RTγβσDβ(T)+R2Tγ2σ+M2σ2t=1Ttγ\displaystyle\frac{RT^{\gamma-\beta}}{\sigma}D_{\beta}(T)+\frac{R^{2}T^{\gamma}}{2\sigma}+\frac{M^{2}\sigma}{2}\sum_{t=1}^{T}t^{-\gamma}
\displaystyle\leqslant RTγβσDβ(T)+R2Tγ2σ+M2σT1γ2(1γ).\displaystyle\frac{RT^{\gamma-\beta}}{\sigma}D_{\beta}(T)+\frac{R^{2}T^{\gamma}}{2\sigma}+\frac{M^{2}\sigma T^{1-\gamma}}{2(1-\gamma)}. (18)

According to Cauchy inequality, if one chooses the optimal σ\sigma with

σ=(1γ)2RT2γβ1Dβ(T)+R2T2γ1M,\sigma=\frac{\sqrt{(1-\gamma)2RT^{2\gamma-\beta-1}D_{\beta}(T)+R^{2}T^{2\gamma-1}}}{M},

it can be then obtained that

t=1T(Ft(xt)Ft(ut))\displaystyle\sum_{t=1}^{T}(F_{t}(x_{t})-F_{t}(u_{t}))
\displaystyle\leqslant M2(2RT1βDβ(T)+TR2)1γ.\displaystyle\sqrt{\frac{M^{2}(2RT^{1-\beta}D_{\beta}(T)+TR^{2})}{1-\gamma}}. (19)

Thus, there holds

𝑹𝒆𝒈Td=𝒪(T1βDβ(T)+T).\displaystyle\bm{Reg}_{T}^{d}=\mathcal{O}(\sqrt{T^{1-\beta}D_{\beta}(T)+T}).

 

Appendix B Proof of Lemma 1

Let us start with defining a Bregman divergence Bψ(x,u)B_{\psi}(x,u) associated with a differentiable strongly convex ψ(x)\psi(x). It will be useful to introduce some results of Bregman divergence[26]:

  • aa.

    Bψ(x,u):=ψ(x)ψ(u)ψ(u),xuB_{\psi}(x,u):=\psi(x)-\psi(u)-\langle\nabla\psi(u),x-u\rangle (its definition);

  • bb.

    xBψ(x,u)=ψ(x)ψ(u)\nabla_{x}B_{\psi}(x,u)=\nabla\psi(x)-\nabla\psi(u);

  • cc.

    Bψ(x,u)=Bψ(x,z)+Bψ(z,u)+xz,ψ(z)ψ(u)B_{\psi}(x,u)=B_{\psi}(x,z)+B_{\psi}(z,u)+\langle x-z,\nabla\psi(z)-\nabla\psi(u)\rangle.

Let ψ(x)=12x2\psi(x)=\frac{1}{2}\|x\|^{2} in the following. By Algorithm 1, setting any sequence {ut}t=1T𝒳\{u_{t}\}_{t=1}^{T}\subset\cal{X}, one has

Ft(xt)Ft(ut)\displaystyle F_{t}(x_{t})-F_{t}(u_{t})
=\displaystyle= ft(xt)+rt(xt)ft(ut)rt(ut).\displaystyle f_{t}(x_{t})+r_{t}(x_{t})-f_{t}(u_{t})-r_{t}(u_{t}).

Invoking the optimality criterion for (7) implies that for any x𝒳x\in\mathcal{X},

0xxt+1,~ft(xt)+~rt(xt+1)+1ηt(xt+1xt).\displaystyle 0\leqslant\langle x-x_{t+1},\tilde{\nabla}f_{t}(x_{t})+\tilde{\nabla}r_{t}(x_{t+1})+\frac{1}{\eta_{t}}(x_{t+1}-x_{t})\rangle. (20)

From this result, choosing x=utx=u_{t}, it follows that

~ft(xt)+~rt(xt+1),xt+1ut\displaystyle\langle\tilde{\nabla}f_{t}(x_{t})+\tilde{\nabla}r_{t}(x_{t+1}),x_{t+1}-u_{t}\rangle
\displaystyle\leqslant 1ηtxt+1xt,utxt+1.\displaystyle\frac{1}{\eta_{t}}\langle x_{t+1}-x_{t},u_{t}-x_{t+1}\rangle. (21)

According to (21), ftf_{t}’s strong convexity and rtr_{t}’s convexity imply that

ft(xt)+rt(xt)ft(ut)rt(ut)\displaystyle f_{t}(x_{t})+r_{t}(x_{t})-f_{t}(u_{t})-r_{t}(u_{t})
=\displaystyle= ft(xt)+rt(xt+1)ft(ut)rt(ut)\displaystyle f_{t}(x_{t})+r_{t}(x_{t+1})-f_{t}(u_{t})-r_{t}(u_{t})
+rt(xt)rt(xt+1)\displaystyle+r_{t}(x_{t})-r_{t}(x_{t+1})
\displaystyle\leqslant ~ft(xt),xtutμ2utxt2\displaystyle\langle\tilde{\nabla}f_{t}(x_{t}),x_{t}-u_{t}\rangle-\frac{\mu}{2}\lVert u_{t}-x_{t}\rVert^{2}
+~rt(xt+1),xt+1ut+~rt(xt),xtxt+1\displaystyle+\langle\tilde{\nabla}r_{t}(x_{t+1}),x_{t+1}-u_{t}\rangle+\langle\tilde{\nabla}r_{t}(x_{t}),x_{t}-x_{t+1}\rangle
=\displaystyle= ~ft(xt),xt+1ut+~ft(xt),xtxt+1\displaystyle\langle\tilde{\nabla}f_{t}(x_{t}),x_{t+1}-u_{t}\rangle+\langle\tilde{\nabla}f_{t}(x_{t}),x_{t}-x_{t+1}\rangle
μ2utxt2+~rt(xt+1),xt+1ut\displaystyle-\frac{\mu}{2}\lVert u_{t}-x_{t}\rVert^{2}+\langle\tilde{\nabla}r_{t}(x_{t+1}),x_{t+1}-u_{t}\rangle
+~rt(xt),xtxt+1\displaystyle+\langle\tilde{\nabla}r_{t}(x_{t}),x_{t}-x_{t+1}\rangle
=\displaystyle= ~ft(xt)+~rt(xt+1),xt+1utμ2utxt2\displaystyle\langle\tilde{\nabla}f_{t}(x_{t})+\tilde{\nabla}r_{t}(x_{t+1}),x_{t+1}-u_{t}\rangle-\frac{\mu}{2}\lVert u_{t}-x_{t}\rVert^{2}
+~ft(xt)+~rt(xt),xtxt+1.\displaystyle+\langle\tilde{\nabla}f_{t}(x_{t})+\tilde{\nabla}r_{t}(x_{t}),x_{t}-x_{t+1}\rangle. (22)

Combining (21) with (22) yields

ft(xt)+rt(xt)ft(ut)rt(ut)\displaystyle f_{t}(x_{t})+r_{t}(x_{t})-f_{t}(u_{t})-r_{t}(u_{t})
\displaystyle\leqslant 1ηtxt+1xt,utxt+1μ2utxt2\displaystyle\frac{1}{\eta_{t}}\langle x_{t+1}-x_{t},u_{t}-x_{t+1}\rangle-\frac{\mu}{2}\lVert u_{t}-x_{t}\rVert^{2}
+~ft(xt)+~rt(xt),xtxt+1\displaystyle+\langle\tilde{\nabla}f_{t}(x_{t})+\tilde{\nabla}r_{t}(x_{t}),x_{t}-x_{t+1}\rangle
=\displaystyle= 1ηtψ(xt+1)ψ(xt),utxt+1μ2utxt2\displaystyle\frac{1}{\eta_{t}}\langle\nabla\psi(x_{t+1})-\nabla\psi(x_{t}),u_{t}-x_{t+1}\rangle-\frac{\mu}{2}\lVert u_{t}-x_{t}\rVert^{2}
+~ft(xt)+~rt(xt),xtxt+1\displaystyle+\langle\tilde{\nabla}f_{t}(x_{t})+\tilde{\nabla}r_{t}(x_{t}),x_{t}-x_{t+1}\rangle
=\displaystyle= 1ηt(Bψ(ut,xt)Bψ(ut,xt+1)Bψ(xt+1,xt))\displaystyle\frac{1}{\eta_{t}}(B_{\psi}(u_{t},x_{t})-B_{\psi}(u_{t},x_{t+1})-B_{\psi}(x_{t+1},x_{t}))
μ2utxt2+~ft(xt)+~rt(xt),xtxt+1.\displaystyle-\frac{\mu}{2}\lVert u_{t}-x_{t}\rVert^{2}+\langle\tilde{\nabla}f_{t}(x_{t})+\tilde{\nabla}r_{t}(x_{t}),x_{t}-x_{t+1}\rangle. (23)

According to the definition of ψ(x)\psi(x), the term xt+1xt,utxt+1\langle x_{t+1}-x_{t},u_{t}-x_{t+1}\rangle is equal to ψ(xt+1)ψ(xt),utxt+1\langle\nabla\psi(x_{t+1})-\nabla\psi(x_{t}),u_{t}-x_{t+1}\rangle. The last equation in (23) is established due to the property cc of Bregman divergence. Thus, by (23), invoking the definition of Bregman divergence leads to

ft(xt)+rt(xt)ft(ut)rt(ut)\displaystyle f_{t}(x_{t})+r_{t}(x_{t})-f_{t}(u_{t})-r_{t}(u_{t})
\displaystyle\leqslant 12ηt(utxt2utxt+12xt+1xt2)\displaystyle\frac{1}{2\eta_{t}}(\lVert u_{t}-x_{t}\rVert^{2}-\lVert u_{t}-x_{t+1}\rVert^{2}-\lVert x_{t+1}-x_{t}\rVert^{2})
μ2utxt2+~ft(xt)+~rt(xt),xtxt+1.\displaystyle-\frac{\mu}{2}\lVert u_{t}-x_{t}\rVert^{2}+\langle\tilde{\nabla}f_{t}(x_{t})+\tilde{\nabla}r_{t}(x_{t}),x_{t}-x_{t+1}\rangle. (24)

By Young’s inequality, one can obtain that

~ft(xt)+~rt(xt),xtxt+112ηtxtxt+12\displaystyle\langle\tilde{\nabla}f_{t}(x_{t})+\tilde{\nabla}r_{t}(x_{t}),x_{t}-x_{t+1}\rangle-\frac{1}{2\eta_{t}}\lVert x_{t}-x_{t+1}\rVert^{2}
\displaystyle\leqslant ηt2~ft(xt)+~rt(xt)2.\displaystyle\frac{\eta_{t}}{2}\lVert\tilde{\nabla}f_{t}(x_{t})+\tilde{\nabla}r_{t}(x_{t})\rVert^{2}. (25)

Thus, invoking (25), together with (24), yields that

ft(xt)+rt(xt)ft(ut)rt(ut)\displaystyle f_{t}(x_{t})+r_{t}(x_{t})-f_{t}(u_{t})-r_{t}(u_{t})
\displaystyle\leqslant 12ηt(utxt2utxt+12)\displaystyle\frac{1}{2\eta_{t}}(\lVert u_{t}-x_{t}\rVert^{2}-\lVert u_{t}-x_{t+1}\rVert^{2})
μ2utxt2+ηt2~ft(xt)+~rt(xt)2.\displaystyle-\frac{\mu}{2}\lVert u_{t}-x_{t}\rVert^{2}+\frac{\eta_{t}}{2}\lVert\tilde{\nabla}f_{t}(x_{t})+\tilde{\nabla}r_{t}(x_{t})\rVert^{2}.

 

Appendix C Proof of Theorem 2

According to Assumption 4, it implies that ~ft(x)+~rt(x)2M2\rVert\tilde{\nabla}f_{t}(x)+\tilde{\nabla}r_{t}(x)\rVert^{2}\leq M^{2}. Applying this substitution into Lemma 1 yields

t=1T(Ft(xt)Ft(ut))\displaystyle\sum_{t=1}^{T}(F_{t}(x_{t})-F_{t}(u_{t}))
=\displaystyle= t=1T(ft(xt)+rt(xt)ft(ut)rt(ut))\displaystyle\sum_{t=1}^{T}(f_{t}(x_{t})+r_{t}(x_{t})-f_{t}(u_{t})-r_{t}(u_{t}))
\displaystyle\leqslant 12t=1T((1ηtμ)utxt21ηtutxt+12)\displaystyle\frac{1}{2}\sum_{t=1}^{T}((\frac{1}{\eta_{t}}-\mu)\lVert u_{t}-x_{t}\rVert^{2}-\frac{1}{\eta_{t}}\lVert u_{t}-x_{t+1}\rVert^{2})
+M22t=1Tηt.\displaystyle+\frac{M^{2}}{2}\sum_{t=1}^{T}\eta_{t}. (26)

Now let us bound the first term of (26) as follows

t=1T((1ηtμ)utxt21ηtutxt+12)\displaystyle\sum_{t=1}^{T}((\frac{1}{\eta_{t}}-\mu)\lVert u_{t}-x_{t}\rVert^{2}-\frac{1}{\eta_{t}}\lVert u_{t}-x_{t+1}\rVert^{2})
=\displaystyle= t=0T1(1ηt+1μ)ut+1xt+12t=1T1ηtutxt+12\displaystyle\sum_{t=0}^{T-1}(\frac{1}{\eta_{t+1}}-\mu)\lVert u_{t+1}-x_{t+1}\rVert^{2}-\sum_{t=1}^{T}\frac{1}{\eta_{t}}\lVert u_{t}-x_{t+1}\rVert^{2}
=\displaystyle= t=0T1(1ηt+11ηtμ)ut+1xt+12t=1T1ηtutxt+12\displaystyle\sum_{t=0}^{T-1}(\frac{1}{\eta_{t+1}}-\frac{1}{\eta_{t}}-\mu)\lVert u_{t+1}-x_{t+1}\rVert^{2}-\sum_{t=1}^{T}\frac{1}{\eta_{t}}\lVert u_{t}-x_{t+1}\rVert^{2}
+t=0T11ηtut+1xt+12\displaystyle+\sum_{t=0}^{T-1}\frac{1}{\eta_{t}}\lVert u_{t+1}-x_{t+1}\rVert^{2}
=\displaystyle= t=1T1(1ηt+11ηtμ)ut+1xt+12t=1T1ηtutxt+12\displaystyle\sum_{t=1}^{T-1}(\frac{1}{\eta_{t+1}}-\frac{1}{\eta_{t}}-\mu)\lVert u_{t+1}-x_{t+1}\rVert^{2}-\sum_{t=1}^{T}\frac{1}{\eta_{t}}\lVert u_{t}-x_{t+1}\rVert^{2}
+t=1T11ηtut+1xt+12+(1η1μ)u1x12\displaystyle+\sum_{t=1}^{T-1}\frac{1}{\eta_{t}}\lVert u_{t+1}-x_{t+1}\rVert^{2}+(\frac{1}{\eta_{1}}-\mu)\lVert u_{1}-x_{1}\rVert^{2}
=\displaystyle= t=1T1(1ηt+11ηtμ)ut+1xt+12+(1η1μ)u1x12\displaystyle\sum_{t=1}^{T-1}(\frac{1}{\eta_{t+1}}-\frac{1}{\eta_{t}}-\mu)\lVert u_{t+1}-x_{t+1}\rVert^{2}+(\frac{1}{\eta_{1}}-\mu)\lVert u_{1}-x_{1}\rVert^{2}
+t=1T11ηt(ut+1xt+12utxt+12).\displaystyle+\sum_{t=1}^{T-1}\frac{1}{\eta_{t}}(\lVert u_{t+1}-x_{t+1}\rVert^{2}-\lVert u_{t}-x_{t+1}\rVert^{2}). (27)

According to

utxt+12=\displaystyle\lVert u_{t}-x_{t+1}\rVert^{2}= ut+1xt+122ut+1xt+1,ut+1ut\displaystyle\lVert u_{t+1}-x_{t+1}\rVert^{2}-2\langle u_{t+1}-x_{t+1},u_{t+1}-u_{t}\rangle
+ut+1ut2,\displaystyle+\lVert u_{t+1}-u_{t}\rVert^{2}, (28)

and using (28), one has

t=1T((1ηtμ)utxt21ηtutxt+12)\displaystyle\sum_{t=1}^{T}((\frac{1}{\eta_{t}}-\mu)\lVert u_{t}-x_{t}\rVert^{2}-\frac{1}{\eta_{t}}\lVert u_{t}-x_{t+1}\rVert^{2})
\displaystyle\leqslant t=1T1(1ηt+11ηtμ)ut+1xt+12+(1η1μ)u1x12\displaystyle\sum_{t=1}^{T-1}(\frac{1}{\eta_{t+1}}-\frac{1}{\eta_{t}}-\mu)\lVert u_{t+1}-x_{t+1}\rVert^{2}+(\frac{1}{\eta_{1}}-\mu)\lVert u_{1}-x_{1}\rVert^{2}
+t=1T12ηtut+1xt+1,ut+1utt=1T1ηtut+1ut2\displaystyle+\sum_{t=1}^{T-1}\frac{2}{\eta_{t}}\langle u_{t+1}-x_{t+1},u_{t+1}-u_{t}\rangle-\sum_{t=1}^{T}\frac{1}{\eta_{t}}\lVert u_{t+1}-u_{t}\rVert^{2}
\displaystyle\leqslant t=1T1(1ηt+11ηtμ)ut+1xt+12+(1η1μ)u1x12\displaystyle\sum_{t=1}^{T-1}(\frac{1}{\eta_{t+1}}-\frac{1}{\eta_{t}}-\mu)\lVert u_{t+1}-x_{t+1}\rVert^{2}+(\frac{1}{\eta_{1}}-\mu)\lVert u_{1}-x_{1}\rVert^{2}
+t=1T12ηtut+1xt+1ut+1utt=1T1ηtut+1ut2\displaystyle+\sum_{t=1}^{T-1}\frac{2}{\eta_{t}}\lVert u_{t+1}-x_{t+1}\lVert\lVert u_{t+1}-u_{t}\lVert-\sum_{t=1}^{T}\frac{1}{\eta_{t}}\lVert u_{t+1}-u_{t}\rVert^{2}
\displaystyle\leqslant t=1T1(1ηt+11ηtμ)ut+1xt+12+(1η1μ)u1x12\displaystyle\sum_{t=1}^{T-1}(\frac{1}{\eta_{t+1}}-\frac{1}{\eta_{t}}-\mu)\lVert u_{t+1}-x_{t+1}\rVert^{2}+(\frac{1}{\eta_{1}}-\mu)\lVert u_{1}-x_{1}\rVert^{2}
+t=1T12Rηtut+1ut.\displaystyle+\sum_{t=1}^{T-1}\frac{2R}{\eta_{t}}\lVert u_{t+1}-u_{t}\lVert. (29)

Considering this upper bound, the dynamic regret is bounded above by

t=1T(ft(xt)+rt(xt)ft(ut)rt(ut))\displaystyle\sum_{t=1}^{T}(f_{t}(x_{t})+r_{t}(x_{t})-f_{t}(u_{t})-r_{t}(u_{t}))
\displaystyle\leqslant M22t=1Tηt+12t=1T1(1ηt+11ηtμ)ut+1xt+12\displaystyle\frac{M^{2}}{2}\sum_{t=1}^{T}\eta_{t}+\frac{1}{2}\sum_{t=1}^{T-1}(\frac{1}{\eta_{t+1}}-\frac{1}{\eta_{t}}-\mu)\lVert u_{t+1}-x_{t+1}\rVert^{2}
+12(1η1μ)u1x12+t=1T1Rηtut+1ut\displaystyle+\frac{1}{2}(\frac{1}{\eta_{1}}-\mu)\lVert u_{1}-x_{1}\rVert^{2}+\sum_{t=1}^{T-1}\frac{R}{\eta_{t}}\lVert u_{t+1}-u_{t}\rVert
\displaystyle\leqslant M22t=1Tηt+12t=1T1(1ηt+11ηtμ)ut+1xt+12\displaystyle\frac{M^{2}}{2}\sum_{t=1}^{T}\eta_{t}+\frac{1}{2}\sum_{t=1}^{T-1}(\frac{1}{\eta_{t+1}}-\frac{1}{\eta_{t}}-\mu)\lVert u_{t+1}-x_{t+1}\rVert^{2}
+12(1η1μ)u1x12\displaystyle+\frac{1}{2}(\frac{1}{\eta_{1}}-\mu)\lVert u_{1}-x_{1}\rVert^{2}
+Rmaxt[T]{1ηttβ}t=1T1tβut+1ut.\displaystyle+R\mathop{\max}_{t\in[T]}\big{\{}\frac{1}{\eta_{t}t^{\beta}}\big{\}}\sum_{t=1}^{T-1}t^{\beta}\lVert u_{t+1}-u_{t}\rVert. (30)

By introducing a small positive number δ<μ\delta<\mu, (30) yeilds

t=1T(ft(xt)+rt(xt)ft(ut)rt(ut))\displaystyle\sum_{t=1}^{T}(f_{t}(x_{t})+r_{t}(x_{t})-f_{t}(u_{t})-r_{t}(u_{t}))
\displaystyle\leqslant M22t=1Tηt+12t=1T1(1ηt+11ηtδ)ut+1xt+12\displaystyle\frac{M^{2}}{2}\sum_{t=1}^{T}\eta_{t}+\frac{1}{2}\sum_{t=1}^{T-1}(\frac{1}{\eta_{t+1}}-\frac{1}{\eta_{t}}-\delta)\lVert u_{t+1}-x_{t+1}\rVert^{2}
+12(1η1δ)u1x1212(μδ)u1x12\displaystyle+\frac{1}{2}(\frac{1}{\eta_{1}}-\delta)\lVert u_{1}-x_{1}\rVert^{2}-\frac{1}{2}(\mu-\delta)\lVert u_{1}-x_{1}\rVert^{2}
+Rmaxt[T]{1ηttβ}t=1T1tβut+1ut.\displaystyle+R\mathop{\max}_{t\in[T]}\big{\{}\frac{1}{\eta_{t}t^{\beta}}\big{\}}\sum_{t=1}^{T-1}t^{\beta}\lVert u_{t+1}-u_{t}\rVert.

Setting ηt=γt\eta_{t}=\frac{\gamma}{t}, where γ<1δ\gamma<\frac{1}{\delta}, one can derive 1ηt+11ηtδ\frac{1}{\eta_{t+1}}-\frac{1}{\eta_{t}}-\delta and 1η1δ\frac{1}{\eta_{1}}-\delta are both greater than 0. Thus, there holds

t=1T(ft(xt)+rt(xt)ft(ut)rt(ut))\displaystyle\sum_{t=1}^{T}(f_{t}(x_{t})+r_{t}(x_{t})-f_{t}(u_{t})-r_{t}(u_{t}))
\displaystyle\leqslant M22t=1Tηt+R22t=1T1(1ηt+11ηtδ)+R22(1η1δ)\displaystyle\frac{M^{2}}{2}\sum_{t=1}^{T}\eta_{t}+\frac{R^{2}}{2}\sum_{t=1}^{T-1}(\frac{1}{\eta_{t+1}}-\frac{1}{\eta_{t}}-\delta)+\frac{R^{2}}{2}(\frac{1}{\eta_{1}}-\delta)
12(μδ)u1x12+Rmaxt[T]{1ηttβ}Dβ(T)\displaystyle-\frac{1}{2}(\mu-\delta)\lVert u_{1}-x_{1}\rVert^{2}+R\mathop{\max}_{t\in[T]}\big{\{}\frac{1}{\eta_{t}t^{\beta}}\big{\}}D_{\beta}(T)
\displaystyle\leqslant M22t=1Tηt+Rmaxt[T]{1ηttβ}Dβ(T)δR2T2+R22ηT\displaystyle\frac{M^{2}}{2}\sum_{t=1}^{T}\eta_{t}+R\mathop{\max}_{t\in[T]}\big{\{}\frac{1}{\eta_{t}t^{\beta}}\big{\}}D_{\beta}(T)-\frac{\delta R^{2}T}{2}+\frac{R^{2}}{2\eta_{T}}
12(μδ)u1x12,\displaystyle-\frac{1}{2}(\mu-\delta)\lVert u_{1}-x_{1}\rVert^{2}, (31)

where one uses Assumption 3 to bound ut+1xt+12\lVert u_{t+1}-x_{t+1}\rVert^{2}, u1x12\lVert u_{1}-x_{1}\rVert^{2} and uses the definition of Dβ(T)D_{\beta}(T). Then the result (31) is obtained by telescoping subtraction of t=1T1(1ηt+11ηt)\sum_{t=1}^{T-1}(\frac{1}{\eta_{t+1}}-\frac{1}{\eta_{t}}).

As ηt=γt\eta_{t}=\frac{\gamma}{t} and 0β<10\leqslant\beta<1, maxt[T]{1ηttβ}\mathop{\max}_{t\in[T]}\big{\{}\frac{1}{\eta_{t}t^{\beta}}\big{\}} can be bounded as

maxt[T]{1ηttβ}=maxt[T]{t1βγ}=T1βγ.\displaystyle\mathop{\max}_{t\in[T]}\big{\{}\frac{1}{\eta_{t}t^{\beta}}\big{\}}=\mathop{\max}_{t\in[T]}\big{\{}\frac{t^{1-\beta}}{\gamma}\big{\}}=\frac{T^{1-\beta}}{\gamma}. (32)

According to the fact that the area of the integral sum from 11 to TT is larger than the area of the discrete sum from 22 to TT, one can bound t=1Tηt\sum_{t=1}^{T}\eta_{t} as follows

t=1Tηt\displaystyle\sum_{t=1}^{T}\eta_{t} =γt=1T1t=γ(1+t=2T1t)\displaystyle=\gamma\sum_{t=1}^{T}\frac{1}{t}=\gamma(1+\sum_{t=2}^{T}\frac{1}{t})
γ(1+1T1t𝑑t)=γ(1+logT).\displaystyle\leqslant\gamma(1+\int_{1}^{T}\frac{1}{t}dt)=\gamma(1+\log T). (33)

Now, adding (32) and (33) into (31), it is obtained

t=1T(ft(xt)+rt(xt)ft(ut)rt(ut))\displaystyle\sum_{t=1}^{T}(f_{t}(x_{t})+r_{t}(x_{t})-f_{t}(u_{t})-r_{t}(u_{t}))
\displaystyle\leqslant M2γ2(1+logT)+RT1βγDβ(T)δR2T2+TR22γ\displaystyle\frac{M^{2}\gamma}{2}(1+\log T)+\frac{RT^{1-\beta}}{\gamma}D_{\beta}(T)-\frac{\delta R^{2}T}{2}+\frac{TR^{2}}{2\gamma}
12(μδ)u1x12.\displaystyle-\frac{1}{2}(\mu-\delta)\lVert u_{1}-x_{1}\rVert^{2}. (34)

By choosing

γ=2RTβDβ(T)+R2δR2+(μδ)1Tu1x12,\displaystyle\gamma=\frac{2RT^{-\beta}D_{\beta}(T)+R^{2}}{\delta R^{2}+(\mu-\delta)\frac{1}{T}\lVert u_{1}-x_{1}\rVert^{2}},

it is easy to verify that the term RT1βγDβ(T)δR2T2+TR22γ12(μδ)u1x12\frac{RT^{1-\beta}}{\gamma}D_{\beta}(T)-\frac{\delta R^{2}T}{2}+\frac{TR^{2}}{2\gamma}-\frac{1}{2}(\mu-\delta)\lVert u_{1}-x_{1}\rVert^{2} can be zero. According to the prior setting δγ<1\delta\gamma<1 and the chosen γ\gamma here, by choosing a small δ\delta, 2RTβDβ(T)+R2R2+(μδ1)1Tu1x12=δγ\frac{2RT^{-\beta}D_{\beta}(T)+R^{2}}{R^{2}+(\frac{\mu}{\delta}-1)\frac{1}{T}\lVert u_{1}-x_{1}\rVert^{2}}=\delta\gamma can be smaller than 1. Invoking the γ\gamma, together with (34), yeilds that

t=1T(ft(xt)+rt(xt)ft(ut)rt(ut))\displaystyle\sum_{t=1}^{T}(f_{t}(x_{t})+r_{t}(x_{t})-f_{t}(u_{t})-r_{t}(u_{t}))
\displaystyle\leqslant M22δR(1+logT)(2TβDβ(T)+R).\displaystyle\frac{M^{2}}{2\delta R}(1+\log T)(2T^{-\beta}D_{\beta}(T)+R). (35)

Therefore, one has

𝑹𝒆𝒈Td=𝒪(logT(1+TβDβ(T))).\displaystyle\bm{Reg}_{T}^{d}=\mathcal{O}(\log T(1+T^{-\beta}D_{\beta}(T))). (36)

 

References

  • [1] A. Hamadouche, Y. Wu, A. M. Wallace, and J. F. C. Mota, “Approximate proximal-gradient methods,” in 2021 Sensor Signal Processing for Defence Conference (SSPD), 2021, pp. 1–6.
  • [2] N. K. Dhingra, S. Z. Khong, and M. R. Jovanović, “A second order primal-dual method for nonsmooth convex composite optimization,” IEEE Transactions on Automatic Control, vol. 67, no. 8, pp. 4061–4076, 2022.
  • [3] P. Joulani, A. György, and C. Szepesvári, “A modular analysis of adaptive (non-)convex optimization: Optimism, composite objectives, variance reduction, and variational bounds,” Theoretical Computer Science, vol. 808, pp. 108–138, 2020.
  • [4] J. Duchi, S. Shalev-Shwartz, Y. Singer, and A. Tewari, “Composite objective mirror descent.” 12 2010, pp. 14–26.
  • [5] D. Yuan, Y. Hong, D. W. C. Ho, and S. Xu, “Distributed mirror descent for online composite optimization,” IEEE Transactions on Automatic Control, vol. 66, no. 2, pp. 714–729, 2021.
  • [6] C. Hu, J. T.-Y. Kwok, and W. Pan, “Accelerated gradient methods for stochastic optimization and online learning,” in NuerIPS, 2009.
  • [7] S. Ghadimi and G. Lan, “Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization i: A generic algorithmic framework,” SIAM J. Optim., vol. 22, pp. 1469–1492, 2012.
  • [8] X. Chen, Q. Lin, and J. F. Pena, “Optimal regularized dual averaging methods for stochastic optimization,” in NuerIPS, 2012.
  • [9] A. Taylor, J. Hendrickx, and F. Glineur, “Exact worst-case convergence rates of the proximal gradient method for composite convex minimization,” Journal of Optimization Theory and Applications, vol. 178, 08 2018.
  • [10] M. K. Nutalapati, A. S. Bedi, K. Rajawat, and M. Coupechoux, “Online trajectory optimization using inexact gradient feedback for time-varying environments,” IEEE Transactions on Signal Processing, vol. 68, pp. 4824–4838, 2020.
  • [11] T. Chen, Q. Ling, and G. B. Giannakis, “An online convex optimization approach to proactive network resource allocation,” IEEE Transactions on Signal Processing, vol. 65, no. 24, pp. 6350–6364, 2017.
  • [12] T. Wang, W. Yu, and S. Wang, “Inter-slice radio resource management via online convex optimization,” in Proceedings of IEEE International Conference on Communications, 2021, pp. 1–6.
  • [13] Q. Tao, Q.-K. Gao, D.-J. Chu, and G.-W. Wu, “Stochastic learning via optimizing the variational inequalities,” IEEE Transactions on Neural Networks and Learning Systems, vol. 25, no. 10, pp. 1769–1778, 2014.
  • [14] W. Xue and W. Zhang, “Learning a coupled linearized method in online setting,” IEEE Transactions on Neural Networks and Learning Systems, vol. 28, no. 2, pp. 438–450, 2017.
  • [15] E. Hazan, “Introduction to online convex optimization,” Foundations and Trends in Optimization, vol. 2, pp. 157–325, 01 2016.
  • [16] Y. Li, X. Cao, and H. Chen, “Fully projection-free proximal stochastic gradient method with optimal convergence rates,” IEEE Access, vol. 8, pp. 165 904–165 912, 2020.
  • [17] M. Zinkevich, “Online convex programming and generalized infinitesimal gradient ascent,” in ICML, 2003, p. 928–935.
  • [18] X. Li, X. Yi, and L. Xie, “Distributed online optimization for multi-agent networks with coupled inequality constraints,” IEEE Transactions on Automatic Control, vol. 66, no. 8, pp. 3575–3591, 2021.
  • [19] ——, “Distributed online convex optimization with an aggregative variable,” IEEE Transactions on Control of Network Systems, vol. 9, no. 1, pp. 438–449, 2022.
  • [20] O. Besbes, Y. Gur, and A. Zeevi, “Non-stationary stochastic optimization,” Operations Research, vol. 63, pp. 1227–1244, 09 2015.
  • [21] D. S. Kalhan, A. Singh Bedi, A. Koppel, K. Rajawat, H. Hassani, A. K. Gupta, and A. Banerjee, “Dynamic online learning via frank-wolfe algorithm,” IEEE Transactions on Signal Processing, vol. 69, pp. 932–947, 2021.
  • [22] P. Zhao and L. Zhang, “Improved Analysis for Dynamic Regret of Strongly Convex and Smooth Functions,” arXiv e-prints, p. arXiv:2006.05876, Jun. 2020.
  • [23] L. Zhang, T. Yangt, J. Yi, R. Jin, and Z.-H. Zhou, “Improved dynamic regret for non-degenerate functions,” in NeurIPS, Red Hook, NY, USA, 2017, p. 732–741.
  • [24] L. Zhang, S. Lu, and Z.-H. Zhou, “Adaptive online learning in dynamic environments,” in NeurIPS, Red Hook, NY, USA, 2018, p. 1330–1340.
  • [25] A. Mokhtari, S. Shahrampour, A. Jadbabaie, and A. Ribeiro, “Online optimization in dynamic environments: Improved regret rates for strongly convex problems,” in 2016 IEEE 55th Conference on Decision and Control (CDC), 2016, pp. 7195–7201.
  • [26] Y. Zhao, S. Qiu, K. Li, L. Luo, J. Yin, and J. Liu, “Proximal online gradient is optimum for dynamic regret: A general lower bound,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–10, 2021.
  • [27] R. Dixit, A. S. Bedi, and K. Rajawat, “Online learning over dynamic graphs via distributed proximal gradient algorithm,” IEEE Transactions on Automatic Control, vol. 66, pp. 5065–5079, 2021.
  • [28] A. Ajalloeian, A. Simonetto, and E. Dall’Anese, “Inexact online proximal-gradient method for time-varying convex optimization,” in 2020 American Control Conference (ACC), 2020, pp. 2850–2857.
  • [29] R. Dixit, A. S. Bedi, R. Tripathi, and K. Rajawat, “Online learning with inexact proximal online gradient descent algorithms,” IEEE Transactions on Signal Processing, vol. 67, no. 5, pp. 1338–1352, 2019.
  • [30] N. Bastianello and E. Dall’Anese, “Distributed and inexact proximal gradient method for online convex optimization,” in 2021 European Control Conference (ECC), 2021, pp. 2432–2437.
  • [31] X. Yi, X. Li, L. Xie, and K. H. Johansson, “Distributed online convex optimization with time-varying coupled inequality constraints,” IEEE Transactions on Signal Processing, vol. 68, pp. 731–746, 2020.
  • [32] J. Duchi and Y. Singer, “Efficient online and batch learning using forward backward splitting,” J. Mach. Learn. Res., vol. 10, p. 2899–2934, dec 2009.
  • [33] L. M. Lopez-Ramos and B. Beferull-Lozano, “Online hyperparameter search interleaved with proximal parameter updates,” in 2020 28th European Signal Processing Conference (EUSIPCO), 2021, pp. 2085–2089.
  • [34] J. Liang and C. Poon, “Variable screening for sparse online regression,” Journal of Computational and Graphical Statistics, pp. 1–51, 07 2022.
  • [35] R. Shafipour and G. Mateos, “Online proximal gradient for learning graphs from streaming signals,” in 2020 28th European Signal Processing Conference (EUSIPCO), 2021, pp. 865–869.
  • [36] T. Yang, R. Jin, M. Mahdavi, and S. Zhu, “An efficient primal dual prox method for non-smooth optimization,” Machine Learning, vol. 98, pp. 369–406, 2014.
  • [37] G. Ditzler, M. Roveri, C. Alippi, and R. Polikar, “Learning in nonstationary environments: A survey,” IEEE Computational Intelligence Magazine, vol. 10, no. 4, pp. 12–25, 2015.
  • [38] Y. Murakami, M. Yamagishi, M. Yukawa, and I. Yamada, “A sparse adaptive filtering using time-varying soft-thresholding techniques,” in 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, 2010, pp. 3734–3737.
  • [39] M. Yamagishi, M. Yukawa, and I. Yamada, “Acceleration of adaptive proximal forward-backward splitting method and its application to sparse system identification,” in 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011, pp. 4296–4299.
  • [40] T. Yamamoto, M. Yamagishi, and I. Yamada, “Adaptive proximal forward-backward splitting for sparse system identification under impulsive noise,” 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO), pp. 2620–2624, 2012.
  • [41] S. A. Alghunaim, E. K. Ryu, K. Yuan, and A. H. Sayed, “Decentralized proximal gradient algorithms with linear convergence rates,” IEEE Transactions on Automatic Control, vol. 66, pp. 2787–2794, 2021.
  • [42] H. Yu and M. J. Neely, “A low complexity algorithm with O(T){O}(\sqrt{T}) regret and O(1){O}(1) constraint violations for online convex optimization with long term constraints,” J. Mach. Learn. Res., vol. 21, pp. 1–24, 2020.