This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Resource-Aware Discretization of Accelerated Optimization Flows

Miguel Vaquero  Pol Mestres  Jorge Cortés This work was supported by NSF Award ECCS-1917177.MV and JC are with the Department of Mechanical and Aerospace Engineering, University of California, San Diego, {mivaquerovallina,cortes}@ucsd.edu. PM is with the Centro de Formacion Interdisciplinaria Superior, Universidad Politécnica de Cataluña, [email protected]
Abstract

This paper tackles the problem of discretizing accelerated optimization flows while retaining their convergence properties. Inspired by the success of resource-aware control in developing efficient closed-loop feedback implementations on digital systems, we view the last sampled state of the system as the resource to be aware of. The resulting variable-stepsize discrete-time algorithms retain by design the desired decrease of the Lyapunov certificate of their continuous-time counterparts. Our algorithm design employs various concepts and techniques from resource-aware control that, in the present context, have interesting parallelisms with the discrete-time implementation of optimization algorithms. These include derivative- and performance-based triggers to monitor the evolution of the Lyapunov function as a way of determining the algorithm stepsize, exploiting sampled information to enhance algorithm performance, and employing high-order holds using more accurate integrators of the original dynamics. Throughout the paper, we illustrate our approach on a newly introduced continuous-time dynamics termed heavy-ball dynamics with displaced gradient, but the ideas proposed here have broad applicability to other globally asymptotically stable flows endowed with a Lyapunov certificate.

I Introduction

A recent body of research seeks to understand the acceleration phenomena of first-order discrete optimization methods by means of models that evolve in continuous time. Roughly speaking, the idea is to study the behavior of ordinary differential equations (ODEs) which arise as continuous limits of discrete-time accelerated algorithms. The basic premise is that the availability of the powerful tools of the continuous realm, such as differential calculus, Lie derivatives, and Lyapunov stability theory, can be then brought to bear to analyze and explain the accelerated behavior of these flows, providing insight into their discrete counterparts. Fully closing the circle to provide a complete description of the acceleration phenomenon requires solving the outstanding open question of how to discretize the continuous flows while retaining their accelerated convergence properties. However, the discretization of accelerated flows has proven to be a challenging task, where retaining acceleration seems to depend largely on the particular ODE and the discretization method employed. This paper develops a resource-aware approach to the discretization of accelerated optimization flows.

Literature Review

The acceleration phenomenon goes back to the seminal paper [1] introducing the so-called heavy-ball method, which employed momentum terms to speed up the convergence of the classical gradient descent method. The heavy-ball method achieves optimal convergence rate in a neighborhood of the minimizer for arbitrary convex functions and global optimal convergence rate for quadratic objective functions. Later on, the work [2] proposed the Nesterov’s accelerated gradient method and, employing the technique of estimating sequences, showed that it converges globally with optimal convergence rate for convex and strongly-convex smooth functions. The algebraic nature of the technique of estimating sequences does not fully explain the mechanisms behind the acceleration phenomenon, and this has motivated many approaches in the literature to provide fundamental understanding and insights. These include coupling dynamics [3], dissipativity theory [4], integral quadratic constraints [5, 6], and geometric arguments [7].

Of specific relevance to this paper is a recent line of research initiated by [8] that seeks to understand the acceleration phenomenon in first-order optimization methods by means of models that evolve in continuous time. [8] introduced a second-order ODE as the continuous limit of Nesterov’s accelerated gradient method and characterized its accelerated convergence properties using Lyapunov stability analysis. The ODE approach to acceleration now includes the use of Hamiltonian dynamical systems [9, 10], inertial systems with Hessian-driven damping [11], and high-resolution ODEs [12, 13]. This body of research is also reminiscient of the classical dynamical systems approach to algorithms in optimization, see [14, 15]. The question of how to discretize the continuous flows while maintaining their accelerated convergence rates has also attracted significant attention, motivated by the ultimate goal of fully understanding the acceleration phenomenon and taking advantage of it to design better optimization algorithms. Interestingly, discretizations of these ODEs do not necessarily lead to acceleration [16]. In fact, explicit discretization schemes, like forward Euler, can even become numerically unstable after a few iterations [17]. Most of the discretization approaches found in the literature are based on the study of well-known integrators, including symplectic integrators [9, 18], Runge-Kutta integrators [19] or modifications of Nesterov’s three sequences [17, 18, 20]. Our previous work [21] instead developed a variable-stepsize discretization using zero-order holds and state-triggers based on the derivative of the Lyapunov function of the original continuous flow. Here, we provide a comprehensive approach based on powerful tools from resource-aware control, including performance-based triggering and state holds that more effectively use sampled information. Other recent approaches to the acceleration phenomena and the synthesis of optimization algorithms using control-theoretic notions and techniques include [22], which employs hybrid systems to design a continuous-time dynamics with a feedback regulator of the viscosity of the heavy-ball ODE to guarantee arbitrarily fast exponential convergence, and [23], which introduced an algorithm which alternates between two (one fast when far from the minimizer but unstable, and another slower but stable around the minimizer) continuous heavy-ball dynamics.

Statement of Contributions

This paper develops a resource-aware control framework to the discretization of accelerated optimization flows that fully exploits their dynamical properties. Our approach relies on the key observation that resource-aware control provides a principled way to go from continuous-time control design to real-time implementation with stability and performance guarantees by opportunistically prescribing when certain resource should be employed. In our treatment, the resource to be aware of is the last sampled state of the system, and hence what we seek to maximize is the stepsize of the resulting discrete-time algorithm. Our first contribution is the introduction of a second-order differential equation which we term heavy-ball dynamics with displaced gradient. This dynamics generalizes the continuous-time heavy-ball dynamics analyzed in the literature by evaluating the gradient of the objective function taking into account the second-order nature of the flow. We establish that the proposed dynamics retains the same convergence properties as the original one while providing additional flexibility in the form of a design parameter.

Our second contribution uses trigger design concepts from resource-aware control to synthesize criteria that determine the variable stepsize of the discrete-time implementation of the heavy-ball dynamics with displaced gradient. We refer to these criteria as event- or self-triggered, depending on whether the stepsize is implicitly or explicitly defined. We employ derivative- and performance-based triggering to ensure the algorithm retains the desired decrease of the Lyapunov function of the continuous flow. In doing so, we face the challenge that the evaluation of this function requires knowledge of the unknown optimizer of the objective function. To circumvect this hurdle, we derive bounds on the evolution of the Lyapunov function that can be evaluated without knowledge of the optimizer. We characterize the convergence properties of the resulting discrete-time algorithms, establishing the existence of a minimum inter-event time and performance guarantees with regards to the objective function.

Our last two contributions provide ways of exploiting the sampled information to enhance the algorithm performance. Our third contribution provides an adaptive implementation of the algorithms that adaptively adjusts the value of the gradient displacement parameter depending on the region of the space to which the state belongs. Our fourth and last contribution builds on the fact that the continuous-time heavy-ball dynamics can be decomposed as the sum of a second-order linear dynamics with a nonlinear forcing term corresponding to the gradient of the objective function. Building on this observation, we provide a more accurate hold for the resource-aware implementation by using the samples only on the nonlinear term, and integrating exactly the resulting linear system with constant forcing. We establish the existence of a minimum inter-event time and characterize the performance with regards to the objective function of the resulting high-order-hold algorithm. Finally, we illustrate the proposed optimization algorithms in simulation, comparing them against the heavy-ball and Nesterov’s accelerated gradient methods and showing superior performance to other discretization methods proposed in the literature.

II Preliminaries

This section presents basic notation and preliminaries.

II-A Notation

We denote by {\mathbb{R}} and >0\mathbb{R}_{>0} the sets of real and positive real numbers, resp. All vectors are column vectors. We denote their scalar product by ,\langle\cdot,\cdot\rangle. We use \left\lVert\cdot{}\right\rVert to denote the 22-norm in Euclidean space. Given μ>0\mu\in\mathbb{R}_{>0}, a continuously differentiable function ff is μ\mu-strongly convex if f(y)f(x)f(x),yx+μ2xy2f(y)-f(x)\geq\langle\nabla f(x),y-x\rangle+\frac{\mu}{2}\left\lVert x-y\right\rVert^{2} for xx, yny\in{\mathbb{R}}^{n}. Given L>0L\in\mathbb{R}_{>0} and a function f:XYf:X\rightarrow Y between two normed spaces (X,X)(X,\left\lVert\cdot{}\right\rVert_{X}) and (Y,YY,\left\lVert\cdot{}\right\rVert_{Y}), ff is LL-Lipschitz if f(x)f(x)YLxxX\left\lVert f(x)-f(x^{\prime})\right\rVert_{Y}\leq L\left\lVert x-x^{\prime}\right\rVert_{X} for xx, xXx^{\prime}\in X. The functions we consider here are continuously differentiable, μ\mu-strongly convex and have LL-Lipschitz continuous gradient. We refer to the set of functions with all these properties by 𝒮μ,L1(n)\mathcal{S}_{\mu,L}^{1}({\mathbb{R}}^{n}). A function f:nf:{\mathbb{R}}^{n}\rightarrow{\mathbb{R}} is positive definite relative to xx_{*} if f(x)=0f(x_{*})=0 and f(x)>0f(x)>0 for xn{x}x\in{\mathbb{R}}^{n}\setminus\{x_{*}\}.

II-B Resource-Aware Control

Our work builds on ideas from resource-aware control to develop discretizations of continuous-time accelerated flows. Here, we provide a brief exposition of its basic elements and refer to [24, 25] for further details.

Given a controlled dynamical system p˙=X(p,u)\dot{p}=X(p,u), with pnp\in{\mathbb{R}}^{n} and umu\in{\mathbb{R}}^{m}, assume we are given a stabilizing continuous state-feedback 𝔨:nm\mathfrak{k}:{\mathbb{R}}^{n}\rightarrow{\mathbb{R}}^{m} so that the closed-loop system p˙=X(p,𝔨(p))\dot{p}=X(p,\mathfrak{k}(p)) has pp_{*} as a globally asymptotically stable equilibrium point. Assume also that a Lyapunov function V:nV:{\mathbb{R}}^{n}\rightarrow{\mathbb{R}} is available as a certificate of the globally stabilizing nature of the controller. Here, we assume this takes the form

V˙=V(p)X(p,𝔨(p))μ4V(p),\dot{V}=\langle\nabla V(p)X(p,\mathfrak{k}(p))\rangle\leq-\frac{\sqrt{\mu}}{4}V(p), (1)

for all pnp\in{\mathbb{R}}^{n}. Although exponential decay of VV along the system trajectories is not necessary, we restrict our attention to this case as it arises naturally in our treatment.

Suppose we are given the task of implementing the controller signal over a digital platform, meaning that the actuator cannot be continuously updated as prescribed by the specification u=𝔨(p)u=\mathfrak{k}(p). In such case, one is forced to discretize the control action along the execution of the dynamics, while making sure that stability is still preserved. A simple-to-implement approach is to update the control action periodically, i.e., fix h>0h>0, sample the state as {p(kh)}k=0\{p(kh)\}_{k=0}^{\infty} and implement

p˙(t)=X(p(t),𝔨(p(kh))),t[kh,(k+1)h].\displaystyle\dot{p}(t)=X(p(t),\mathfrak{k}(p(kh))),\quad t\in[kh,(k+1)h].

This approach requires hh to be small enough to ensure that VV remains a Lyapunov function and, consequently, the system remains stable. By contrast, in resource-aware control, one employs the information generated by the system along its trajectory to update the control action in an opportunistic fashion. Specifically, we seek to determine in a state-dependent fashion a sequence of times {tk}k=0\{t_{k}\}_{k=0}^{\infty}, not necessarily uniformly spaced, such that pp_{*} remains a globally asymptotically stable equilibrium for the system

p˙(t)\displaystyle\dot{p}(t) =X(p(t),𝔨(p(tk))),t[tk,tk+1].\displaystyle=X(p(t),\mathfrak{k}(p(t_{k}))),\quad t\in[t_{k},t_{k+1}]. (2)

The main idea to accomplish this is to let the state sampling be guided by the principle of maintaining the same type of exponential decay (1) along the new dynamics. To do this, one defines triggers to ensure that this decay is never violated by prescribing a new state sampling. Formally, one sets t0=0t_{0}=0 and tk+1=tk+step(p(tk))t_{k+1}=t_{k}+\operatorname{step}(p(t_{k})), where the stepsize is defined by

step(p^)\displaystyle\operatorname{step}(\hat{p}) =min{t>0|b(p^,t)=0}.\displaystyle=\min\{t>0\;|\;b(\hat{p},t)=0\}. (3)

We refer to the criteria as event-triggering or self-triggering depending on whether the evaluation of the function bb requires monitoring of the state pp along the trajectory of (2) (ET) or just knowledge of its initial condition p^\hat{p} (ST). The more stringent requirements to implement event-triggering lead to larger stepsizes versus the more conservative ones characteristic of self-triggering. In order for the state sampling to be implementable in practice, the inter-event times {tk+1tk}k=0\{t_{k+1}-t_{k}\}_{k=0}^{\infty} must be uniformly lower bounded by a positive minimum inter-event time, abbreviated MIET. In particular, the existence of a MIET rules out the existence of Zeno behavior, i.e., the possibility of an infinite number of triggers in a finite amount of time.

Depending on how the evolution of the function VV is examined, we describe two types of triggering conditions111In both cases, for a given znz\in{\mathbb{R}}^{n}, we let p(t;p^)p(t;\hat{p}) denote the solution of p˙(t)=X(p(t),𝔨(p^))\dot{p}(t)=X(p(t),\mathfrak{k}(\hat{p})) with initial condition p(0)=p^p(0)=\hat{p}.: {LaTeXdescription}

In this case, bdb^{\operatorname{d}} is defined as an upper bound of the expression ddtV(p(t;p^))+μ4V(p(t;p^))\frac{d}{dt}V(p(t;\hat{p}))+\frac{\sqrt{\mu}}{4}V(p(t;\hat{p})). This definition ensures that (1) is maintained along (2);

In this case, bpb^{\operatorname{p}} is defined as an upper bound of the expression V(p(t;p^))eμ4tV(p^)V(p(t;\hat{p}))-e^{-\frac{\sqrt{\mu}}{4}t}V(\hat{p}). Note that this definition ensures that the integral version of (1) is maintained along (2). In general, the performance-based trigger gives rise to stepsizes that are at least as large as the ones determined by the derivative-based approach, cf. [26]. This is because the latter prescribes an update as soon as the exponential decay is about to be violated, and therefore, does not take into account the fact that the Lyapunov function might have been decreasing at a faster rate since the last update. Instead, the performance-based approach reasons over the accumulated decay of the Lyapunov function since the last update, potentially yielding longer inter-sampling times.

A final point worth mentioning is that, in the event-triggered control literature, the notion of resource to be aware of can be many different things, beyond the actuator described above, including the sensor, sensor-controller communication, communication with other agents, etc. This richness opens the way to explore more elaborate uses of the sampled information beyond the zero-order hold in (2), something that we also leverage later in our presentation.

III Problem Statement

Our motivation here is to show that principled approaches to discretization can retain the accelerated convergence properties of continuous-time dynamics, fill the gap between the continuous and discrete viewpoints on optimization algorithms, and lead to the construction of new ones. Throughout the paper, we focus on the continuous-time version of the celebrated heavy-ball method [1]. Let ff be a function in 𝒮μ,L1(n)\mathcal{S}_{\mu,L}^{1}({\mathbb{R}}^{n}) and let xx_{*} be its unique minimizer. The heavy-ball method is known to have an optimal convergence rate in a neighborhood of the minimizer. For its continuous-time counterpart, consider the following ss-dependent family of second-order differential equations, with s>0s\in\mathbb{R}_{>0}, proposed in [12],

[x˙v˙]\displaystyle\begin{bmatrix}\dot{x}\\ \dot{v}\end{bmatrix} =[v2μv(1+μs)f(x))],\displaystyle=\begin{bmatrix}v\\ -2\sqrt{\mu}v-(1+\sqrt{\mu s})\nabla f(x))\end{bmatrix}, (4a)
x(0)\displaystyle x(0) =x0,v(0)=2sf(x0)1+μs.\displaystyle=x_{0},\quad v(0)=-\frac{2\sqrt{s}\nabla f(x_{0})}{1+\sqrt{\mu s}}. (4b)

We refer to this dynamics as XhbX_{\operatorname{hb}}. The following result characterizes the convergence properties of (4) to p=[x,0]Tp_{*}=[x_{*},0]^{T}.

Theorem III.1 ([12]).

Let V:n×nV:{\mathbb{R}}^{n}\times{\mathbb{R}}^{n}\rightarrow{\mathbb{R}} be

V(x,v)\displaystyle V(x,v) =(1+μs)(f(x)f(x))+14v2\displaystyle=(1+\sqrt{\mu s})(f(x)-f(x_{*}))+\displaystyle\frac{1}{4}\left\lVert v\right\rVert^{2}
+14v+2μ(xx)2,\displaystyle\quad+\displaystyle\frac{1}{4}\left\lVert v+2\sqrt{\mu}(x-x_{*})\right\rVert^{2}, (5)

which is positive definite relative to [x,0]T[x_{*},0]^{T}. Then V˙μ4V\dot{V}\leq-\frac{\sqrt{\mu}}{4}V along the dynamics (4) and, as a consequence, p=[x,0]Tp_{*}=[x_{*},0]^{T} is globally asymptotically stable. Moreover, for s1/Ls\leq 1/L, the exponential decrease of VV implies

f(x(t))f(x)7x(0)x22seμ4t.f(x(t))-f(x_{*})\leq\frac{7\left\lVert x(0)-x_{*}\right\rVert^{2}}{2s}e^{-\frac{\sqrt{\mu}}{4}t}. (6)

This result, along with analogous results [12] for the Nesterov’s accelerated gradient descent, serves as an inspiration to build Lyapunov functions that help to explain the accelerated convergence rate of the discrete-time methods.

Inspired by the success of resource-aware control in developing efficient closed-loop feedback implementations on digital systems, here we present a discretization approach to accelerated optimization flows using resource-aware control. At the basis of the approach taken here is the observation that the convergence rate (6) of the continuous flow is a direct consequence of the Lyapunov nature of the function (III.1). In fact, the integration of V˙μ4V\dot{V}\leq-\frac{\sqrt{\mu}}{4}V along the system trajectories yields

V(x(t),v(t))eμ4tV(x(0),v(0)).V(x(t),v(t))\leq e^{-\frac{\sqrt{\mu}}{4}t}V(x(0),v(0)).

Since f(x(t))f(x)V(x(t),v(t))f(x(t))-f(x_{*})\leq V(x(t),v(t)), we deduce

f(x(t))f(x)eμ4tV(x(0),v(0))=𝒪(eμ4t).\displaystyle f(x(t))-f(x_{*})\leq e^{-\frac{\sqrt{\mu}}{4}t}V(x(0),v(0))=\mathcal{O}(e^{-\frac{\sqrt{\mu}}{4}t}).

The characterization of the convergence rate via the decay of the Lyapunov function is indeed common among accelerated optimization flows. This observation motivates the resource-aware approach to discretization pursued here, where the resource that we aim to use efficiently is the sampling of the state itself. By doing so, the ultimate goal is to give rise to large stepsizes that take maximum advantage of the decay of the Lyapunov function (and consequently of the accelerated nature) of the continuous-time dynamics in the resulting discrete-time implementation.

IV Resource-Aware Discretization of Accelerated Optimization Flows

In this section we propose a discretization of accelerated optimization flows using state-dependent triggering and analyze the properties of the resulting discrete-time algorithm. For convenience, we use the shorthand notation p=[x,v]Tp=[x,v]^{T}. In following with the exposition in Section II-B, we start by considering the zero-order hold implementation p˙=Xhb(p^)\dot{p}=X_{\operatorname{hb}}(\hat{p}), p(0)=p^p(0)=\hat{p} of the heavy-ball dynamics (4),

x˙\displaystyle\dot{x} =v^,\displaystyle=\hat{v}, (7a)
v˙\displaystyle\dot{v} =2μv^(1+μs)f(x^).\displaystyle=-2\sqrt{\mu}\hat{v}-(1+\sqrt{\mu s})\nabla f(\hat{x}). (7b)

Note that the solution trajectory takes the form p(t)=p^+tXhb(p^)p(t)=\hat{p}+tX_{\operatorname{hb}}(\hat{p}), which in discrete-time terminology corresponds to a forward-Euler discretization of (4). Component-wise, we have

x(t)\displaystyle x(t) =x^+tv^,\displaystyle=\hat{x}+t\hat{v},
v(t)\displaystyle v(t) =v^t(2μv^+(1+μs)f(x^)).\displaystyle=\hat{v}-t\big{(}2\sqrt{\mu}\hat{v}+(1+\sqrt{\mu s})\nabla f(\hat{x})\big{)}.

As we pointed out in Section II-B, the use of sampled information opens the way to more elaborated constructions than the zero-order hold in (7). As an example, given the second-order nature of the heavy-ball dynamics, it would seem reasonable to leverage the (position, velocity) nature of the pair (x^,v^)(\hat{x},\hat{v}) (meaning that, at position x^\hat{x}, the system is moving with velocity v^\hat{v}) by employing the modified zero-order hold

x˙\displaystyle\dot{x} =v^,\displaystyle=\hat{v}, (8a)
v˙\displaystyle\dot{v} =2μv^(1+μs)f(x^+av^),\displaystyle=-2\sqrt{\mu}\hat{v}-(1+\sqrt{\mu s})\nabla f(\hat{x}+a\hat{v}), (8b)

where a0a\geq 0. Note that the trajectory of (8) corresponds to the forward-Euler discretization of the continuous-time dynamics

[x˙v˙]\displaystyle\begin{bmatrix}\dot{x}\\ \dot{v}\end{bmatrix} =[v2μv(1+μs)f(x+av))],\displaystyle=\begin{bmatrix}v\\ -2\sqrt{\mu}v-(1+\sqrt{\mu s})\nabla f(x+av))\end{bmatrix}, (9)

We refer to this as the heavy-ball dynamics with displaced gradient and denote it by XhbaX^{a}_{\operatorname{hb}} (note that (8) and (9) with a=0a=0 recover (7) and (4), respectively). In order to pursue the resource-aware approach laid out in Section II-B with the modified zero-order hold in (8), we need to characterize the asymptotic convergence properties of the heavy-ball dynamics with displaced gradient, which we tackle next.

Remark IV.1.

(Connection between the use of sampled information and high-resolution-ODEs). A number of works [27, 28, 29] have explored formulations of Nesterov’s accelerated that employ displaced-gradient-like terms similar to the one used above. Here, we make this connection explicit. Given Nesterov’s algorithm

yk+1\displaystyle y_{k+1} =xksf(xk)\displaystyle=x_{k}-s\nabla f(x_{k})
xk+1\displaystyle x_{k+1} =yk+1+1μs1+μs(yk+1yk)\displaystyle=y_{k+1}+\displaystyle\frac{1-\sqrt{\mu s}}{1+\sqrt{\mu s}}(y_{k+1}-y_{k})

the work [12] obtains the following limiting high-resolution ODE

x¨+2μx˙+s2f(x)x˙+(1+μs)f(x)=0.\ddot{x}+2\sqrt{\mu}\dot{x}+\sqrt{s}\nabla^{2}f(x)\dot{x}+(1+\sqrt{\mu s})\nabla f(x)=0. (10)

Interestingly, considering instead the evolution of the yy-variable and applying similar arguments to the ones in [12], one instead obtains

y¨+2μy˙+(1+μs)f(y+s1+μsy˙)=0,\ddot{y}+2\sqrt{\mu}\dot{y}+(1+\sqrt{\mu s})\nabla f\big{(}y+\frac{\sqrt{s}}{1+\sqrt{\mu s}}\dot{y}\big{)}=0, (11)

which corresponds to the continuous heavy-ball dynamics in (4) evaluated with a displaced gradient, i.e., (9). Even further, if we Taylor expand the last term in (11) as

f(y+s1+μsy˙)=f(y)+2f(y)s1+μsy˙+𝒪(s)\nabla f(y+\displaystyle\frac{\sqrt{s}}{1+\sqrt{\mu s}}\dot{y})=\nabla f(y)+\nabla^{2}f(y)\displaystyle\frac{\sqrt{s}}{1+\sqrt{\mu s}}\dot{y}+\mathcal{O}(s)

and disregard the 𝒪(s)\mathcal{O}(s) term, we recover (10). This shows that (11) is just (10) with extra higher-order terms in ss, and provides evidence of the role of gradient displacement in enlarging the modeling capabilities of high-resolution ODEs. \bullet

IV-A Asymptotic Convergence of Heavy-Ball Dynamics with Displaced Gradient

In this section, we study the asymptotic convergence of heavy-ball dynamics with displaced gradient. Interestingly, for aa sufficiently small, this dynamics enjoys the same convergence properties as the dynamics (4), as the following result shows.

Theorem IV.2.

(Global asymptotic stability of heavy-ball dynamics with displaced gradient). Let β1,,β4>0\beta_{1},\dots,\beta_{4}>0 be

β1\displaystyle\beta_{1} =μsμ,β2=μsLμ,\displaystyle=\sqrt{\mu_{s}}\mu,\quad\beta_{2}=\displaystyle\frac{\sqrt{\mu_{s}}L}{\sqrt{\mu}},
β3\displaystyle\beta_{3} =13μ16,β4=4μ2s+3Lμμs8L2,\displaystyle=\frac{13\sqrt{\mu}}{16},\quad\beta_{4}=\frac{4\mu^{2}\sqrt{s}+3L\sqrt{\mu}\sqrt{\mu_{s}}}{8L^{2}},

where, for brevity, μs=1+μs\sqrt{\mu_{s}}=1+\sqrt{\mu s}, and define

a1=2β22(β1β4+β22β3β4+β12β42).\displaystyle a^{*}_{1}=\frac{2}{\beta_{2}^{2}}\Big{(}\beta_{1}\beta_{4}+\sqrt{\beta_{2}^{2}\beta_{3}\beta_{4}+\beta_{1}^{2}\beta_{4}^{2}}\Big{)}. (12)

Then, for 0aa10\leq a\leq a^{*}_{1}, V˙μ4V\dot{V}\leq-\frac{\sqrt{\mu}}{4}V along the dynamics (9) and, as a consequence, p=[x,0]Tp_{*}=[x_{*},0]^{T} is globally asymptotically stable. Moreover, for s1/Ls\leq 1/L, the exponential decrease of VV implies (6) holds along the trajectories of XhbaX^{a}_{\operatorname{hb}}.

Proof.

Note that

V(p),Xhba(p)+μ4V(p)=\displaystyle\langle\nabla V(p),X^{a}_{\operatorname{hb}}(p)\rangle+\frac{\sqrt{\mu}}{4}V(p)=
=(1+μs)f(x),vμv2μsf(x+av),v\displaystyle=(1\!+\!\sqrt{\mu s})\langle\nabla f(x),v\rangle\!-\!\sqrt{\mu}\left\lVert v\right\rVert^{2}\!-\!\sqrt{\mu_{s}}\langle\nabla f(x\!+\!av),v\rangle
μμsf(x+av),xx+μ4V(x,v)\displaystyle\quad-\sqrt{\mu}\sqrt{\mu_{s}}\langle\nabla f(x+av),x-x_{*}\rangle+\frac{\sqrt{\mu}}{4}V(x,v)
=μv2μμsf(x),xx+μ4V(x,v)Term I\displaystyle=\underbrace{-\sqrt{\mu}\left\lVert v\right\rVert^{2}-\sqrt{\mu}\sqrt{\mu_{s}}\langle\nabla f(x),x-x_{*}\rangle+\frac{\sqrt{\mu}}{4}V(x,v)}_{\textrm{Term~{}I}}
μsf(x+av)f(x),vTerm II\displaystyle\quad\underbrace{-\sqrt{\mu_{s}}\langle\nabla f(x+av)-\nabla f(x),v\rangle}_{\textrm{Term~{}II}}
μμsf(x+av)f(x),xxTerm III,\displaystyle\quad\underbrace{-\sqrt{\mu}\sqrt{\mu_{s}}\langle\nabla f(x+av)-\nabla f(x),x-x_{*}\rangle}_{\textrm{Term~{}III}},

where in the second equality, we have added and subtracted μμsf(x),xx\sqrt{\mu}\sqrt{\mu_{s}}\langle\nabla f(x),x-x_{*}\rangle. Observe that “Term I” corresponds to V(p),Xhb(p)+μ4V(p)\langle\nabla V(p),X_{\operatorname{hb}}(p)\rangle+\frac{\sqrt{\mu}}{4}V(p) and is therefore negative by Theorem III.1. From [21], this term can be bounded as

Term I 13μ16v2\displaystyle\leq\frac{-13\sqrt{\mu}}{16}\left\lVert v\right\rVert^{2}
+(4μ2s+3Lμμs8L2)f(x)2.\displaystyle\quad+\Big{(}\frac{4\mu^{2}\sqrt{s}+3L\sqrt{\mu}\sqrt{\mu_{s}}}{8L^{2}}\Big{)}\left\lVert\nabla f(x)\right\rVert^{2}.

Let us study the other two terms. By strong convexity, we have f(x+av)f(x),vaμv2-\langle\nabla f(x+av)-\nabla f(x),v\rangle\leq-a\mu\left\lVert v\right\rVert^{2}, and therefore

Term II aμsμv20.\displaystyle\leq-a\sqrt{\mu_{s}}\mu\left\lVert v\right\rVert^{2}\leq 0.

Regarding Term III, one can use the LL-Lipschitzness of f\nabla f and strong convexity to obtain

Term III aμμμsLvf(x).\displaystyle\leq\displaystyle\frac{a}{\mu}\sqrt{\mu}\sqrt{\mu_{s}}L\left\lVert v\right\rVert\left\lVert\nabla f(x)\right\rVert.

Now, using the notation in the statement, we can write

V(p),Xhba(p)+μ4V(p)\displaystyle\langle\nabla V(p),X^{a}_{\operatorname{hb}}(p)\rangle+\frac{\sqrt{\mu}}{4}V(p) (13)
a(β1v2+β2vf(x))β3v2β4f(x)2.\displaystyle\leq a\big{(}\!-\!\beta_{1}\left\lVert v\right\rVert^{2}+\beta_{2}\left\lVert v\right\rVert\left\lVert\nabla f(x)\right\rVert\big{)}\!-\!\beta_{3}\left\lVert v\right\rVert^{2}\!-\!\beta_{4}\left\lVert\nabla f(x)\right\rVert^{2}.

If β1v2+β2vf(x)0-\beta_{1}\left\lVert v\right\rVert^{2}+\beta_{2}\left\lVert v\right\rVert\left\lVert\nabla f(x)\right\rVert\leq 0, then the RHS of (13) is negative for any a0a\geq 0. If β1v2+β2vf(x)>0-\beta_{1}\left\lVert v\right\rVert^{2}+\beta_{2}\left\lVert v\right\rVert\left\lVert\nabla f(x)\right\rVert>0, the RHS of (13) is negative if and only if

aβ3v2+β4f(x)2β1v2+β2vf(x).a\leq\displaystyle\frac{\beta_{3}\left\lVert v\right\rVert^{2}+\beta_{4}\left\lVert\nabla f(x)\right\rVert^{2}}{-\beta_{1}\left\lVert v\right\rVert^{2}+\beta_{2}\left\lVert v\right\rVert\left\lVert\nabla f(x)\right\rVert}.

The RHS of this equation corresponds to g(f(x)/v)g({\left\lVert\nabla f(x)\right\rVert}/{\left\lVert\nabla v\right\rVert}), with the function gg defined in (A.3). From Lemma A.1, as long as β1v2+β2vf(x)>0-\beta_{1}\left\lVert v\right\rVert^{2}+\beta_{2}\left\lVert v\right\rVert\left\lVert\nabla f(x)\right\rVert>0, this function is lower bounded by

a1=β3+β4(zroot+)2β1+β2zroot+>0,\displaystyle a_{1}^{*}=\frac{\beta_{3}+\beta_{4}(z_{\operatorname{root}}^{+})^{2}}{-\beta_{1}+\beta_{2}z_{\operatorname{root}}^{+}}>0,

where zroot+z_{\operatorname{root}}^{+} is defined in (A.4). This exactly corresponds to (12), concluding the result. ∎

Remark IV.3.

(Adaptive displacement along the trajectories of heavy-ball dynamics with displaced gradient). From the proof of Theorem IV.2, one can observe that if (x,v)(x,v) is such that n¯f(x)<n¯\underline{n}\leq\left\lVert\nabla f(x)\right\rVert<\overline{n} and m¯v<m¯\underline{m}\leq\left\lVert v\right\rVert<\overline{m}, for n¯,n¯,m¯,m¯>0\underline{n},\overline{n},\underline{m},\overline{m}\in\mathbb{R}_{>0}, then one can upper bound the LHS of (13) by

a(β1m¯2+β2m¯n¯)β3m¯2β4n¯2.a(-\beta_{1}\underline{m}^{2}+\beta_{2}\overline{m}\,\overline{n})-\beta_{3}\underline{m}^{2}-\beta_{4}\underline{n}^{2}.

If β1m¯2+β2m¯n¯0-\beta_{1}\underline{m}^{2}+\beta_{2}\overline{m}\,\overline{n}\leq 0, any a0a\geq 0 makes this expression negative. If instead β1m¯2+β2m¯n¯0-\beta_{1}\underline{m}^{2}+\beta_{2}\overline{m}\,\overline{n}\geq 0, then aa must satisfy

a|β3m¯2+β4n¯2β1m¯2+β2m¯n¯|.\displaystyle a\leq\Big{|}\frac{\beta_{3}\underline{m}^{2}+\beta_{4}\underline{n}^{2}}{-\beta_{1}\underline{m}^{2}+\beta_{2}\overline{m}\,\overline{n}}\Big{|}. (14)

This argument shows that over the region R={(x,v)|n¯f(x)<n¯ and m¯v<m¯}R=\{(x,v)\;|\;\underline{n}\leq\left\lVert\nabla f(x)\right\rVert<\overline{n}\text{ and }\underline{m}\leq\left\lVert v\right\rVert<\overline{m}\}, any a0a\geq 0 satisfying (14) ensures that V˙μ4V\dot{V}\leq-\frac{\sqrt{\mu}}{4}V, and hence the desired exponential decrease of the Lyapunov function. This observation opens the way to modify the value of the parameter aa adaptively along the execution of the heavy-ball dynamics with displaced gradient, depending on the region of state space visited by its trajectories. \bullet

IV-B Triggered Design of Variable-Stepsize Algorithms

In this section we propose a discretization of the continuous heavy-ball dynamics based on resource-aware control. To do so, we employ the approaches to trigger design described in Section II-B on the dynamics XhbaX^{a}_{\operatorname{hb}}, whose forward-Euler discretization corresponds to the modified zero-order hold (8) of the heavy-ball dynamics.

Our starting point is the characterization of the asymptotic convergence properties of XhbaX^{a}_{\operatorname{hb}} developed in Section IV-A. The trigger design necessitates of bounding the evolution of the Lyapunov function VV in (III.1) for the continuous-time heavy-ball dynamics with displaced gradient along its zero-order hold implementation. However, this task presents the challenge that the definition of VV involves the minimizer xx_{*} of the optimization problem itself, which is unknown (in fact, finding it is the ultimate objective of the discrete-time algorithm we seek to design). In order to synthesize computable triggers, this raises the issue of bounding the evolution of VV as accurately as possible while avoiding any requirement on the knowledge of xx_{*}. The following result, whose proof is presented in Appendix A, addresses this point.

Proposition IV.4.

(Upper bound for derivative-based triggering with zero-order hold). Let a0a\geq 0 and define

bETd(p^,t;a)\displaystyle b^{\operatorname{d}}_{\operatorname{ET}}(\hat{p},t;a) =AET(p^,t;a)+BET(p^,t;a)+CET(p^;a),\displaystyle=A_{\operatorname{ET}}(\hat{p},t;a)+B_{\operatorname{ET}}(\hat{p},t;a)+C_{\operatorname{ET}}(\hat{p};a),
bSTd(p^,t;a)\displaystyle b^{\operatorname{d}}_{\operatorname{ST}}(\hat{p},t;a) =BSTq(p^;a)t2+(AST(p^;a)+BSTl(p^;a))t\displaystyle=B^{q}_{\operatorname{ST}}(\hat{p};a)t^{2}+(A_{\operatorname{ST}}(\hat{p};a)+B^{l}_{\operatorname{ST}}(\hat{p};a))t
+CST(p^;a),\displaystyle\quad+C_{\operatorname{ST}}(\hat{p};a),

where

AET(p^,t;a)=2μtv^2+μs(f(x^+tv^)f(x^),v^\displaystyle A_{\operatorname{ET}}(\hat{p},t;a)=2\mu t\left\lVert\hat{v}\right\rVert^{2}+\sqrt{\mu_{s}}\big{(}\langle\nabla f(\hat{x}+t\hat{v})-\nabla f(\hat{x}),\hat{v}\rangle
+2tμf(x^+av^),v^+tμsf(x^+av^)2),\displaystyle\quad+2t\sqrt{\mu}\langle\nabla f(\hat{x}+a\hat{v}),\hat{v}\rangle+t\sqrt{\mu_{s}}\left\lVert\nabla f(\hat{x}+a\hat{v})\right\rVert^{2}\big{)},
BET(p^,t;a)=μt2162μv^+μsf(x^+av^)2\displaystyle B_{\operatorname{ET}}(\hat{p},t;a)=\frac{\sqrt{\mu}t^{2}}{16}\left\lVert 2\sqrt{\mu}\hat{v}+\sqrt{\mu_{s}}\nabla f(\hat{x}+a\hat{v})\right\rVert^{2}
tμ4v^2+μμs4(f(x^+tv^)f(x^)+\displaystyle\quad-\frac{t\mu}{4}\left\lVert\hat{v}\right\rVert^{2}+\frac{\sqrt{\mu}\sqrt{\mu_{s}}}{4}\big{(}f(\hat{x}+t\hat{v})-f(\hat{x})+
tv^,f(x^+av^)+t2μs4f(x^+av^)2\displaystyle\quad-t\langle\hat{v},\nabla f(\hat{x}+a\hat{v})\rangle+\frac{t^{2}\sqrt{\mu_{s}}}{4}\left\lVert\nabla f(\hat{x}+a\hat{v})\right\rVert^{2}
tμLf(x^+av^)2+tμav^,f(x^+av^)),\displaystyle\quad-\frac{t\sqrt{\mu}}{L}\left\lVert\nabla f(\hat{x}+a\hat{v})\right\rVert^{2}+t\sqrt{\mu}\langle a\hat{v},\nabla f(\hat{x}+a\hat{v})\rangle\big{)},
CET(p^;a)=13μ16v^2μ2s2f(x^)2L2\displaystyle C_{\operatorname{ET}}(\hat{p};a)=-\frac{13\sqrt{\mu}}{16}\left\lVert\hat{v}\right\rVert^{2}-\frac{\mu^{2}\sqrt{s}}{2}\displaystyle\frac{\left\lVert\nabla f(\hat{x})\right\rVert^{2}}{L^{2}}
+μs(3μ8Lf(x^)2\displaystyle\quad+\sqrt{\mu_{s}}\big{(}\frac{-3\sqrt{\mu}}{8L}\left\lVert\nabla f(\hat{x})\right\rVert^{2}
+μ(f(x^)f(x^+av^))+μf(x^)av^\displaystyle\quad+\sqrt{\mu}(f(\hat{x})-f(\hat{x}+a\hat{v}))+\sqrt{\mu}\left\lVert\nabla f(\hat{x})\right\rVert\left\lVert a\hat{v}\right\rVert
μ3/22av^2f(x^+av^)f(x^),v^\displaystyle\quad-\frac{\mu^{3/2}}{2}\left\lVert a\hat{v}\right\rVert^{2}-\langle\nabla f(\hat{x}+a\hat{v})-\nabla f(\hat{x}),\hat{v}\rangle
+μf(x^+av^),av^),\displaystyle\quad+\sqrt{\mu}\langle\nabla f(\hat{x}+a\hat{v}),a\hat{v}\rangle\big{)},
AST(p^;a)=2μv^2+μs(Lv^2+2μf(x^+av^),v^\displaystyle A_{\operatorname{ST}}(\hat{p};a)=2\mu\left\lVert\hat{v}\right\rVert^{2}+\sqrt{\mu_{s}}\big{(}L\left\lVert\hat{v}\right\rVert^{2}+2\sqrt{\mu}\langle\nabla f(\hat{x}+a\hat{v}),\hat{v}\rangle
+μsf(x^+av^)2),\displaystyle\quad+\sqrt{\mu_{s}}\left\lVert\nabla f(\hat{x}+a\hat{v})\right\rVert^{2}\big{)},
BSTl(p^;a)=μ4(μv^2+μs(f(x^)f(x^+av^),v^\displaystyle B^{l}_{\operatorname{ST}}(\hat{p};a)=\frac{\sqrt{\mu}}{4}\big{(}-\sqrt{\mu}\left\lVert\hat{v}\right\rVert^{2}+\sqrt{\mu_{s}}(\langle\nabla f(\hat{x})-\nabla f(\hat{x}+a\hat{v}),\hat{v}\rangle
μLf(x^+av^)2+μav^,f(x^+av^))),\displaystyle\quad-\frac{\sqrt{\mu}}{L}\left\lVert\nabla f(\hat{x}+a\hat{v})\right\rVert^{2}+\sqrt{\mu}\langle a\hat{v},\nabla f(\hat{x}+a\hat{v})\rangle)\big{)},
BSTq(p^;a)=μ162μv^+μsf(x^+av^)2\displaystyle B^{q}_{\operatorname{ST}}(\hat{p};a)=\frac{\sqrt{\mu}}{16}\left\lVert 2\sqrt{\mu}\hat{v}+\sqrt{\mu_{s}}\nabla f(\hat{x}+a\hat{v})\right\rVert^{2}
+μμs4(L2v^2+μs4f(x^+av^)2),\displaystyle\quad+\frac{\sqrt{\mu}\sqrt{\mu_{s}}}{4}\big{(}\frac{L}{2}\left\lVert\hat{v}\right\rVert^{2}+\frac{\sqrt{\mu_{s}}}{4}\left\lVert\nabla f(\hat{x}+a\hat{v})\right\rVert^{2}\big{)},
CST(p^;a)=CET(p^;a).\displaystyle C_{\operatorname{ST}}(\hat{p};a)=C_{\operatorname{ET}}(\hat{p};a).

Let tp(t)=p^+tXhba(p^)t\mapsto p(t)=\hat{p}+tX^{a}_{\operatorname{hb}}(\hat{p}) be the trajectory of the zero-order hold dynamics p˙=Xhba(p^)\dot{p}=X^{a}_{\operatorname{hb}}(\hat{p}), p(0)=p^p(0)=\hat{p}. Then, for t0t\geq 0,

ddtV(p(t))+μ4V(p(t))\displaystyle\frac{d}{dt}V(p(t))+\frac{\sqrt{\mu}}{4}V({p}(t)) bETd(p^,t;a)bSTd(p^,t;a).\displaystyle\leq b^{\operatorname{d}}_{\operatorname{ET}}(\hat{p},t;a)\leq b^{\operatorname{d}}_{\operatorname{ST}}(\hat{p},t;a).

The importance of Proposition IV.4 stems from the fact that the triggering conditions defined by b#db^{\operatorname{d}}_{\#}, #{ET,ST}\#\in\{\operatorname{ET},\operatorname{ST}\}, can be evaluated without knowledge of the optimizer xx_{*}. We build on this result next to establish an upper bound for the performance-based triggering condition.

Proposition IV.5.

(Upper bound for performance-based triggering with zero-order hold). Let a0a\geq 0 and

b#p(p^,t;a)\displaystyle b^{\operatorname{p}}_{\#}(\hat{p},t;a) =0teμ4ζb#d(p^,ζ;a)𝑑ζ,\displaystyle=\int_{0}^{t}e^{\frac{\sqrt{\mu}}{4}\zeta}b^{\operatorname{d}}_{\#}(\hat{p},\zeta;a)d\zeta,

for #{ET,ST}\#\in\{\operatorname{ET},\operatorname{ST}\}. Let tp(t)=p^+tXhba(p^)t\mapsto p(t)=\hat{p}+tX^{a}_{\operatorname{hb}}(\hat{p}) be the trajectory of the zero-order hold dynamics p˙=Xhba(p^)\dot{p}=X^{a}_{\operatorname{hb}}(\hat{p}), p(0)=p^p(0)=\hat{p}. Then, for t0t\geq 0,

V(p(t))eμ4tV(p^)eμ4tbETp(p^,t;a)eμ4tbSTp(p^,t;a).\displaystyle V(p(t))\!-\!e^{-\frac{\sqrt{\mu}}{4}t}V(\hat{p})\!\leq\!e^{-\frac{\sqrt{\mu}}{4}t}b^{\operatorname{p}}_{\operatorname{ET}}(\hat{p},t;a)\!\leq\!e^{-\frac{\sqrt{\mu}}{4}t}b^{\operatorname{p}}_{\operatorname{ST}}(\hat{p},t;a).
Proof.

We rewrite V(p(t))eμ4tV(p^)=eμ4t(eμ4tV(p(t))V(p^))V(p(t))-e^{-\frac{\sqrt{\mu}}{4}t}V(\hat{p})=e^{-\frac{\sqrt{\mu}}{4}t}(e^{\frac{\sqrt{\mu}}{4}t}V(p(t))-V(\hat{p})), and note that

eμ4tV(p(t))V(p^)\displaystyle e^{\frac{\sqrt{\mu}}{4}t}V(p(t))-V(\hat{p})
=0tddζ(eμ4ζV(p(ζ))V(p^))𝑑ζ\displaystyle\quad=\int_{0}^{t}\frac{d}{d\zeta}\big{(}e^{\frac{\sqrt{\mu}}{4}\zeta}V(p(\zeta))-V(\hat{p})\big{)}d\zeta
=0teμ4ζ(ddζV(p(ζ))+μ4V(p(ζ))dζ.\displaystyle\quad=\int_{0}^{t}e^{\frac{\sqrt{\mu}}{4}\zeta}\Big{(}\frac{d}{d\zeta}V(p(\zeta))+\frac{\sqrt{\mu}}{4}V(p(\zeta)\Big{)}d\zeta.

Note that the integrand corresponds to the derivative-based criterion bounded in Proposition IV.4. Therefore,

eμ4tV(p(t))V(p^)\displaystyle e^{\frac{\sqrt{\mu}}{4}t}V(p(t))-V(\hat{p}) 0teμ4ζbETd(p^,ζ;a)𝑑ζ\displaystyle\leq\int_{0}^{t}e^{\frac{\sqrt{\mu}}{4}\zeta}b^{\operatorname{d}}_{\operatorname{ET}}(\hat{p},\zeta;a)d\zeta
=bETp(p^,t;a)bSTp(p^,t;a)\displaystyle=b^{\operatorname{p}}_{\operatorname{ET}}(\hat{p},t;a)\leq b^{\operatorname{p}}_{\operatorname{ST}}(\hat{p},t;a)

for t0t\geq 0, and the result follows. ∎

Propositions IV.4 and IV.5 provide us with the tools to determine the stepsize according to the derivative- and performance-based triggering criteria, respectively. For convenience, and following the notation in (3), we define the stepsizes

step#d(p^;a)\displaystyle\operatorname{step}^{\operatorname{d}}_{\#}(\hat{p};a) =min{t>0|b#d(p^,t;a)=0},\displaystyle=\min\{t>0\;|\;b^{\operatorname{d}}_{\#}(\hat{p},t;a)=0\}, (15a)
step#p(p^;a)\displaystyle\operatorname{step}^{\operatorname{p}}_{\#}(\hat{p};a) =min{t>0|b#p(p^,t;a)=0},\displaystyle=\min\{t>0\;|\;b^{\operatorname{p}}_{\#}(\hat{p},t;a)=0\}, (15b)

for #{ET,ST}\#\in\{\operatorname{ET},\operatorname{ST}\}. Observe that, as long as p^p=[x,0]T\hat{p}\neq p_{*}=[x_{*},0]^{T} and 0aa10\leq a\leq a^{*}_{1}, we have C#(p^;a)<0C_{\#}(\hat{p};a)<0 for #{ST,ET}\#\in\{\operatorname{ST},\operatorname{ET}\} and, as a consequence, b#d(p^,0;a)<0b^{\operatorname{d}}_{\#}(\hat{p},0;a)<0. The ET/ST terminology is justified by the following observation: in the ET case, the equation defining the stepsize is in general implicit in tt. Instead, in the ST case, the equation defining the stepsize is explicit in tt. Equipped with this notation, we define the variable-stepsize algorithm described in Algorithm 1, which consists of following the dynamics (8) until the exponential decay of the Lyapunov function is violated as estimated by the derivative-based (=d\diamond={\operatorname{d}}) or the performance-based (=p\diamond={\operatorname{p}}) triggering condition. When this happens, the algorithm re-samples the state before continue flowing along (8).

Design Choices: {d,p}\diamond\in\{{\operatorname{d}},{\operatorname{p}}\}, #{ET,ST}\#\in\{\operatorname{ET},\operatorname{ST}\} Initialization: Initial point (p0p_{0}), objective function (ff), tolerance (ϵ\epsilon), a0a\geq 0, k=0k=0
while f(xk)ϵ\left\lVert\nabla f(x_{k})\right\rVert\geq\epsilon do
       Compute stepsize Δk=step#(pk;a)\Delta_{k}=\operatorname{step}^{\diamond}_{\#}(p_{k};a)
       Compute next iterate pk+1=pk+ΔkXhba(pk)p_{k+1}=p_{k}+\Delta_{k}X^{a}_{\operatorname{hb}}(p_{k})
       Set k=k+1k=k+1
end while
Algorithm 1 Displaced-Gradient Algorithm

IV-C Convergence Analysis of Displaced-Gradient Algorithm

Here we characterize the convergence properties of the derivative- and performance-based implementations of the Displaced-Gradient Algorithm. In each case, we show that algorithm is implementable (i.e., it admits a MIET) and inherits the convergence rate from the continuous-time dynamics. The following result deals with the derivative-based implementation of Algorithm 1.

Theorem IV.6.

(Convergence of derivative-based implementation of Displaced-Gradient Algorithm). Let β^1,,β^5>0\hat{\beta}_{1},\dots,\hat{\beta}_{5}>0 be

β^1\displaystyle\hat{\beta}_{1} =μs(3μ2+L),\displaystyle=\sqrt{\mu_{s}}(\frac{3\sqrt{\mu}}{2}+L), β^2\displaystyle\hat{\beta}_{2} =μμs32,\displaystyle=\sqrt{\mu}\sqrt{\mu_{s}}\frac{3}{2},
β^3\displaystyle\hat{\beta}_{3} =13μ16,\displaystyle=\frac{13\sqrt{\mu}}{16}, β^4\displaystyle\hat{\beta}_{4} =4μ2s+3Lμμs8L2,\displaystyle=\frac{4\mu^{2}\sqrt{s}+3L\sqrt{\mu}\sqrt{\mu_{s}}}{8L^{2}},
β^5\displaystyle\hat{\beta}_{5} =μs(5μL2μ3/22),\displaystyle=\sqrt{\mu_{s}}\big{(}\frac{5\sqrt{\mu}L}{2}-\frac{\mu^{3/2}}{2}\big{)},

and define

a2=αmin{β^1+β^12+4β^5β^32β^5,β^4β^2},\displaystyle a^{*}_{2}=\alpha\min\Big{\{}\frac{-\hat{\beta}_{1}+\sqrt{\hat{\beta}_{1}^{2}+4\hat{\beta}_{5}\hat{\beta}_{3}}}{2\hat{\beta}_{5}},\frac{\hat{\beta}_{4}}{\hat{\beta}_{2}}\Big{\}}, (16)

with 0<α<10<\alpha<1. Then, for 0aa20\leq a\leq a^{*}_{2}, =d\diamond={\operatorname{d}}, and #{ET,ST}\#\in\{\operatorname{ET},\operatorname{ST}\}, the variable-stepsize strategy in Algorithm 1 has the following properties

  1. (i)

    the stepsize is uniformly lower bounded by the positive constant MIET(a)\operatorname{MIET}(a), where

    MIET(a)=ν+ν2+η,\operatorname{MIET}(a)=-\nu+\sqrt{\nu^{2}+\eta}, (17)

    η=min{η1,η2}\eta=\min\{\eta_{1},\eta_{2}\}, ν=max{ν1,ν2}\nu=\max\{\nu_{1},\nu_{2}\}, and

    η1\displaystyle\eta_{1} =8aμs(a(μ5L)2Lμ3)+132μsL(3a2μsL+1)+8μ,\displaystyle=\frac{8a\sqrt{\mu_{s}}\left(a(\mu-5L)-\frac{2L}{\sqrt{\mu}}-3\right)+13}{2\sqrt{\mu_{s}}L\left(3a^{2}\sqrt{\mu_{s}}L+1\right)+8\mu},
    η2\displaystyle\eta_{2} =3μsμL(4aL1)4μ2s3μsμL2,\displaystyle=-\frac{3\sqrt{\mu_{s}}\sqrt{\mu}L(4aL-1)-4\mu^{2}\sqrt{s}}{3\mu_{s}\sqrt{\mu}L^{2}},
    ν1\displaystyle\nu_{1} =μ(2a3μsL2+aμs+16)+8μsL(2a2μsL+1)2μ(μsL(3a2μsL+1)+4μ)\displaystyle=\frac{\mu\left(2a^{3}\sqrt{\mu_{s}}L^{2}+a\sqrt{\mu_{s}}+16\right)+8\sqrt{\mu_{s}}L\left(2a^{2}\sqrt{\mu_{s}}L+1\right)}{2\sqrt{\mu}\left(\sqrt{\mu_{s}}L\left(3a^{2}\sqrt{\mu_{s}}L+1\right)+4\mu\right)}
    +μs(aL(8aL+1)+4)μsL(3a2μsL+1)+4μ,\displaystyle\quad+\frac{\sqrt{\mu_{s}}(aL(8aL+1)+4)}{\sqrt{\mu_{s}}L\left(3a^{2}\sqrt{\mu_{s}}L+1\right)+4\mu},
    ν2\displaystyle\nu_{2} =aμ+8μs+8μ3μsμ;\displaystyle=\frac{a\mu+8\sqrt{\mu_{s}}+8\sqrt{\mu}}{3\sqrt{\mu_{s}}\sqrt{\mu}};
  2. (ii)

    ddtV(pk+tXhba(pk))μ4V(pk+tXhba(pk))\frac{d}{dt}V(p_{k}+tX^{a}_{\operatorname{hb}}(p_{k}))\leq-\frac{\sqrt{\mu}}{4}V(p_{k}+tX^{a}_{\operatorname{hb}}(p_{k})) for all t[0,Δk]t\in[0,\Delta_{k}] and k{0}k\in\{0\}\cup\mathbb{N}.

As a consequence, f(xk+1)f(x)=𝒪(eμ4i=0kΔi)f(x_{k+1})-f(x_{*})=\mathcal{O}(e^{-\frac{\sqrt{\mu}}{4}\sum_{i=0}^{k}\Delta_{i}}).

Proof.

Regarding fact (i), we prove the result for the ST\operatorname{ST}-case, as the ET\operatorname{ET}-case follows from stepETd(p^;a)stepSTd(p^;a)\operatorname{step}^{\operatorname{d}}_{\operatorname{ET}}(\hat{p};a)\geq\operatorname{step}^{\operatorname{d}}_{\operatorname{ST}}(\hat{p};a). We start by upper bounding CST(p^;a)C_{\operatorname{ST}}(\hat{p};a) by a negative quadratic function of v^\left\lVert\hat{v}\right\rVert and f(x^)\left\lVert\nabla f(\hat{x})\right\rVert as follows,

CST(p^;a)=13μ16v^2+μs3μ8Lf(x^)2\displaystyle C_{\operatorname{ST}}(\hat{p};a)=-\frac{13\sqrt{\mu}}{16}\left\lVert\hat{v}\right\rVert^{2}+\sqrt{\mu_{s}}\frac{-3\sqrt{\mu}}{8L}\left\lVert\nabla f(\hat{x})\right\rVert^{2}
μ2s2L2f(x^)2+μs(μ(f(x^)f(x^+av^))(a)\displaystyle\quad-\frac{\mu^{2}\sqrt{s}}{2L^{2}}\displaystyle\left\lVert\nabla f(\hat{x})\right\rVert^{2}+\sqrt{\mu_{s}}\big{(}\sqrt{\mu}\underbrace{(f(\hat{x})-f(\hat{x}+a\hat{v}))}_{\textrm{(a)}}
+μf(x^)av^(b)μ3/22av^2\displaystyle\quad+\sqrt{\mu}\underbrace{\left\lVert\nabla f(\hat{x})\right\rVert\left\lVert a\hat{v}\right\rVert}_{\textrm{(b)}}-\frac{\mu^{3/2}}{2}\left\lVert a\hat{v}\right\rVert^{2}
+f(x^)f(x^+av^),v^(c)+μf(x^+av^),av^(d)).\displaystyle\quad+\underbrace{\langle\nabla f(\hat{x})-\nabla f(\hat{x}+a\hat{v}),\hat{v}\rangle}_{\textrm{(c)}}+\sqrt{\mu}\underbrace{\langle\nabla f(\hat{x}+a\hat{v}),a\hat{v}\rangle}_{\textrm{(d)}}\big{)}.

Using the LL-Lipschitzness of the gradient and Young’s inequality, we can easily upper bound

(a) f(x^+av^),av^+L2a2v^2Using (A.1c)\displaystyle\leq\underbrace{\langle\nabla f(\hat{x}+a\hat{v}),-a\hat{v}\rangle+\frac{L}{2}a^{2}\left\lVert\hat{v}\right\rVert^{2}}_{\textrm{Using~{}\eqref{eq:aux-d}}}
=f(x^+av^)f(x^),av^+L2a2v^2\displaystyle=\langle\nabla f(\hat{x}+a\hat{v})-\nabla f(\hat{x}),-a\hat{v}\rangle+\frac{L}{2}a^{2}\left\lVert\hat{v}\right\rVert^{2}
+f(x^),av^\displaystyle\quad+\langle\nabla f(\hat{x}),-a\hat{v}\rangle
La2v^2+L2a2v^2+a(f(x^)22+v^22)\displaystyle\leq La^{2}\left\lVert\hat{v}\right\rVert^{2}+\frac{L}{2}a^{2}\left\lVert\hat{v}\right\rVert^{2}+a\big{(}\frac{\left\lVert\nabla f(\hat{x})\right\rVert^{2}}{2}+\frac{\left\lVert\hat{v}\right\rVert^{2}}{2}\big{)}
=3La2+a2v^2+a2f(x^)2,\displaystyle=\frac{3La^{2}+a}{2}\left\lVert\hat{v}\right\rVert^{2}+\frac{a}{2}\left\lVert\nabla f(\hat{x})\right\rVert^{2},
(b) a(f(x^)22+v^22),\displaystyle\leq a\big{(}\frac{\left\lVert\nabla f(\hat{x})\right\rVert^{2}}{2}+\frac{\left\lVert\hat{v}\right\rVert^{2}}{2}\big{)},
(c) Lav^2,\displaystyle\leq La\left\lVert\hat{v}\right\rVert^{2},
(d) =f(x^+av^)f(x^)+f(x^),av^\displaystyle=\langle\nabla f(\hat{x}+a\hat{v})-\nabla f(\hat{x})+\nabla f(\hat{x}),a\hat{v}\rangle
La2v^2+f(x^),av^\displaystyle\leq La^{2}\left\lVert\hat{v}\right\rVert^{2}+\langle\nabla f(\hat{x}),a\hat{v}\rangle
=2La2+a2v^2+a2f(z^)2.\displaystyle=\frac{2La^{2}+a}{2}\left\lVert\hat{v}\right\rVert^{2}+\frac{a}{2}\left\lVert\nabla f(\hat{z})\right\rVert^{2}.

Note that, with the definition of the constants β^1,,β^5>0\hat{\beta}_{1},\dots,\hat{\beta}_{5}>0 in the statement, we can write

CST(p^;a)\displaystyle C_{\operatorname{ST}}(\hat{p};a) aβ^1v^2+a2β^5v^2+aβ^2f(x^)2\displaystyle\leq a\hat{\beta}_{1}\left\lVert\hat{v}\right\rVert^{2}+a^{2}\hat{\beta}_{5}\left\lVert\hat{v}\right\rVert^{2}+a\hat{\beta}_{2}\left\lVert\nabla f(\hat{x})\right\rVert^{2}
β^3v^2β^4f(x^)2.\displaystyle\quad-\hat{\beta}_{3}\left\lVert\hat{v}\right\rVert^{2}-\hat{\beta}_{4}\left\lVert\nabla f(\hat{x})\right\rVert^{2}.

Therefore, for a[0,a2]a\in[0,a^{*}_{2}], we have

aβ^1+a2β^5β^3\displaystyle a\hat{\beta}_{1}+a^{2}\hat{\beta}_{5}-\hat{\beta}_{3} a2β^1+(a2)2β^5β^3=γ1<0\displaystyle\leq a^{*}_{2}\hat{\beta}_{1}+(a^{*}_{2})^{2}\hat{\beta}_{5}-\hat{\beta}_{3}=-\gamma_{1}<0
aβ^2β^4\displaystyle a\hat{\beta}_{2}-\hat{\beta}_{4} a2β^2β^4=γ2<0,\displaystyle\leq a^{*}_{2}\hat{\beta}_{2}-\hat{\beta}_{4}=-\gamma_{2}<0,

and hence CST(p^;a)γ1v^2γ2f(x^)2C_{\operatorname{ST}}(\hat{p};a)\leq-\gamma_{1}\left\lVert\hat{v}\right\rVert^{2}-\gamma_{2}\left\lVert\nabla f(\hat{x})\right\rVert^{2}. Similarly, introducing

γ3\displaystyle\gamma_{3} =2a2μsL2+2a2μsμL2+μsμ+μsL+2μ,\displaystyle=2a^{2}\mu_{s}L^{2}+2a^{2}\sqrt{\mu_{s}}\sqrt{\mu}L^{2}+\sqrt{\mu_{s}}\sqrt{\mu}+\sqrt{\mu_{s}}L+2\mu,
γ4\displaystyle\gamma_{4} =2μs+2μsμ,γ5=18aμs(2a2μL2+μ+2μL),\displaystyle=2\mu_{s}+2\sqrt{\mu_{s}}\sqrt{\mu},\;\gamma_{5}\!=\!\frac{1}{8}a\sqrt{\mu_{s}}\left(2a^{2}\mu L^{2}+\mu+2\sqrt{\mu}L\right),
γ6\displaystyle\gamma_{6} =aμμs4,γ7=38a2μsμL2+18μsμL+μ3/22,\displaystyle=\frac{a\mu\sqrt{\mu_{s}}}{4},\;\gamma_{7}=\frac{3}{8}a^{2}\mu_{s}\sqrt{\mu}L^{2}+\frac{1}{8}\sqrt{\mu_{s}}\sqrt{\mu}L+\frac{\mu^{3/2}}{2},
γ8\displaystyle\gamma_{8} =3μsμ8,\displaystyle=\frac{3\mu_{s}\sqrt{\mu}}{8},

one can show that

AST(p^;a)A^ST(p^;a)\displaystyle A_{\operatorname{ST}}(\hat{p};a)\leq\hat{A}_{\operatorname{ST}}(\hat{p};a) =γ3v^2+γ4f(x^)2,\displaystyle=\gamma_{3}\left\lVert\hat{v}\right\rVert^{2}+\gamma_{4}\left\lVert\nabla f(\hat{x})\right\rVert^{2},
BSTl(p^;a)B^STl(p^;a)\displaystyle B^{l}_{\operatorname{ST}}(\hat{p};a)\leq\hat{B}^{l}_{\operatorname{ST}}(\hat{p};a) =γ5v^2+γ6f(x^)2,\displaystyle=\gamma_{5}\left\lVert\hat{v}\right\rVert^{2}+\gamma_{6}\left\lVert\nabla f(\hat{x})\right\rVert^{2},
BSTq(p^;a)B^STq(p^;a)\displaystyle B^{q}_{\operatorname{ST}}(\hat{p};a)\leq\hat{B}^{q}_{\operatorname{ST}}(\hat{p};a) =γ7v^2+γ8f(x^)2.\displaystyle=\gamma_{7}\left\lVert\hat{v}\right\rVert^{2}+\gamma_{8}\left\lVert\nabla f(\hat{x})\right\rVert^{2}.

Thus, from (15a), we have

stepSTd(p^;a)(A^ST(p^;a)+B^STl(p^;a))2B^STq(p^;a)\displaystyle\operatorname{step}^{\operatorname{d}}_{\operatorname{ST}}(\hat{p};a)\geq\frac{-(\hat{A}_{\operatorname{ST}}(\hat{p};a)+\hat{B}^{l}_{\operatorname{ST}}(\hat{p};a))}{2\hat{B}^{q}_{\operatorname{ST}}(\hat{p};a)} (18)
+(A^ST(p^;a)+B^STl(p^;a)2B^STq(p^;a))2CST(p^;a)B^STq(p^;a).\displaystyle\quad+\sqrt{\left(\frac{\hat{A}_{\operatorname{ST}}(\hat{p};a)+\hat{B}^{l}_{\operatorname{ST}}(\hat{p};a)}{2\hat{B}^{q}_{\operatorname{ST}}(\hat{p};a)}\right)^{2}-\frac{C_{\operatorname{ST}}(\hat{p};a)}{\hat{B}^{q}_{\operatorname{ST}}(\hat{p};a)}}.

Using now [21, supplementary material, Lemma 1], we deduce

ηCST(p^;a)B^STq(p^;a),A^ST(p^;a)+B^STl(p^;a)2B^STq(p^;a)ν,\displaystyle\eta\leq\frac{-C_{\operatorname{ST}}(\hat{p};a)}{\hat{B}^{q}_{\operatorname{ST}}(\hat{p};a)},\quad\frac{\hat{A}_{\operatorname{ST}}(\hat{p};a)+\hat{B}^{l}_{\operatorname{ST}}(\hat{p};a)}{2\hat{B}^{q}_{\operatorname{ST}}(\hat{p};a)}\leq\nu,

where

η\displaystyle\eta =min{γ1γ7,γ2γ8},ν=max{γ3+γ52γ7,γ4+γ62γ8}.\displaystyle=\min\{\frac{\gamma_{1}}{\gamma_{7}},\frac{\gamma_{2}}{\gamma_{8}}\},\quad\nu=\max\{\frac{\gamma_{3}+\gamma_{5}}{2\gamma_{7}},\frac{\gamma_{4}+\gamma_{6}}{2\gamma_{8}}\}.

With these elements in place and referring to (18), we have

stepSTd(p^;a)\displaystyle\operatorname{step}^{\operatorname{d}}_{\operatorname{ST}}(\hat{p};a) (A^ST(p^;a)+B^STl(p^;a))2B^STq(p^;a)\displaystyle\geq\frac{-(\hat{A}_{\operatorname{ST}}(\hat{p};a)+\hat{B}^{l}_{\operatorname{ST}}(\hat{p};a))}{2\hat{B}^{q}_{\operatorname{ST}}(\hat{p};a)}
+(A^ST(p^;a)+B^STl(p^;a)2B^STq(p^;a))2+η.\displaystyle\quad+\sqrt{\left(\frac{\hat{A}_{\operatorname{ST}}(\hat{p};a)+\hat{B}^{l}_{\operatorname{ST}}(\hat{p};a)}{2\hat{B}^{q}_{\operatorname{ST}}(\hat{p};a)}\right)^{2}+\eta}.

We observe now that zg(z)=z+z2+ηz\mapsto g(z)=-z+\sqrt{z^{2}+\eta} is monotonically decreasing and lower bounded. So, if zz is upper bounded, then g(z)g(z) is lower bounded by a positive constant. Taking z=(A^ST(p^;a)+B^STl(p^;a))2B^STq(p^;a)νz=\frac{(\hat{A}_{\operatorname{ST}}(\hat{p};a)+\hat{B}^{l}_{\operatorname{ST}}(\hat{p};a))}{2\hat{B}^{q}_{\operatorname{ST}}(\hat{p};a)}\leq\nu gives the bound of the stepsize. Finally, the algorithm design together with Proposition IV.4 ensure fact (ii) throughout its evolution. ∎

It is worth noticing that the derivative-based implementation of the Displaced-Gradient Algorithm generalizes the algorithm proposed in our previous work [21] (in fact, the strategy proposed there corresponds to the choice a=0a=0). The next result characterizes the convergence properties of the performance-based implementation of Algorithm 1.

Theorem IV.7.

(Convergence of performance-based implementation of Displaced-Gradient Algorithm). For 0aa20\leq a\leq a^{*}_{2}, =p\diamond={\operatorname{p}}, and #{ET,ST}\#\in\{\operatorname{ET},\operatorname{ST}\}, the variable-stepsize strategy in Algorithm 1 has the following properties

  1. (i)

    the stepsize is uniformly lower bounded by the positive constant MIET(a)\operatorname{MIET}(a);

  2. (ii)

    V(pk+tXhba(pk))eμ4tV(pk)V(p_{k}+tX^{a}_{\operatorname{hb}}(p_{k}))\leq e^{-\frac{\sqrt{\mu}}{4}t}V(p_{k}) for all t[0,Δk]t\in[0,\Delta_{k}] and k{0}k\in\{0\}\cup\mathbb{N}.

As a consequence, f(xk+1)f(x)=𝒪(eμ4i=0kΔi)f(x_{k+1})-f(x_{*})=\mathcal{O}(e^{-\frac{\sqrt{\mu}}{4}\sum_{i=0}^{k}\Delta_{i}}).

Proof.

To show (i), notice that it is sufficient to prove that stepSTp\operatorname{step}^{\operatorname{p}}_{\operatorname{ST}} is uniformly lower bounded away from zero. This is because of the definition of stepsize in (15b) and the fact that bETp(p^,t;a)bSTp(p^,t;a)b^{\operatorname{p}}_{\operatorname{ET}}(\hat{p},t;a)\leq b^{\operatorname{p}}_{\operatorname{ST}}(\hat{p},t;a) for all p^\hat{p} and all tt. For an arbitrary fixed p^\hat{p}, note that tbSTd(p^,t;a)t\mapsto b^{\operatorname{d}}_{\operatorname{ST}}(\hat{p},t;a) is strictly negative in the interval [0,stepSTd(p;a))[0,\operatorname{step}^{\operatorname{d}}_{\operatorname{ST}}(p;a)) given the definition of stepsize in (15a). Consequently, the function tbSTp(p^,t;a)=0teμ4ζbSTd(p^;ζ,a)𝑑ζt\mapsto b^{\operatorname{p}}_{\operatorname{ST}}(\hat{p},t;a)=\int_{0}^{t}e^{\frac{\sqrt{\mu}}{4}\zeta}b^{\operatorname{d}}_{\operatorname{ST}}(\hat{p};\zeta,a)d\zeta is strictly negative over (0,stepSTd(p^;a))(0,\operatorname{step}^{\operatorname{d}}_{\operatorname{ST}}(\hat{p};a)). From the definition of stepSTp\operatorname{step}^{\operatorname{p}}_{\operatorname{ST}}, it then follows that stepSTp(p^;a)stepSTd(p^;a)\operatorname{step}^{\operatorname{p}}_{\operatorname{ST}}(\hat{p};a)\geq\operatorname{step}^{\operatorname{d}}_{\operatorname{ST}}(\hat{p};a). The result now follows by noting that stepSTd\operatorname{step}^{\operatorname{d}}_{\operatorname{ST}} is uniformly lower bounded away from zero by a positive constant, cf. Theorem IV.6(i).

To show (ii), we recall that Δk=step#p(pk;a)\Delta_{k}=\operatorname{step}^{\operatorname{p}}_{\#}(p_{k};a) for #{ET,ST}\#\in\{\operatorname{ET},\operatorname{ST}\} and use Proposition IV.5 for p^=pk\hat{p}=p_{k} to obtain, for all t[0,Δk]t\in[0,\Delta_{k}],

V(p(t))eμ4tV(pk)\displaystyle V(p(t))-e^{-\frac{\sqrt{\mu}}{4}t}V(p_{k}) eμ4tb#p(pk,t;a)\displaystyle\leq e^{-\frac{\sqrt{\mu}}{4}t}b^{\operatorname{p}}_{\#}(p_{k},t;a)
eμ4tb#p(pk,Δk;a)=0,\displaystyle\leq e^{-\frac{\sqrt{\mu}}{4}t}b^{\operatorname{p}}_{\#}(p_{k},\Delta_{k};a)=0,

as claimed. ∎

The proof of Theorem IV.7 brings up an interesting geometric interpretation of the relationship between the stepsizes determined according to the derivative- and performance-based approaches. In fact, since

ddtb#p(p^,t;a)=eμ4tb#d(p^,t;a),\displaystyle\frac{d}{dt}b^{\operatorname{p}}_{\#}(\hat{p},t;a)=e^{\frac{\sqrt{\mu}}{4}t}b^{\operatorname{d}}_{\#}(\hat{p},t;a),

we observe that step#d(p^;a)\operatorname{step}^{\operatorname{d}}_{\#}(\hat{p};a) is precisely the (positive) critical point of tb#p(p^,t;a)t\mapsto b^{\operatorname{p}}_{\#}(\hat{p},t;a). Therefore, stepSTp(p^;a)\operatorname{step}^{\operatorname{p}}_{\operatorname{ST}}(\hat{p};a) is the smallest nonzero root of tb#p(p^,t;a)t\mapsto b^{\operatorname{p}}_{\#}(\hat{p},t;a), whereas stepSTd(p^;a)\operatorname{step}^{\operatorname{d}}_{\operatorname{ST}}(\hat{p};a) is the time where tb#p(p^,t;a)t\mapsto b^{\operatorname{p}}_{\#}(\hat{p},t;a) achieves its smallest value, and consequently is furthest away from zero. This confirms the fact that the performance-based approach obtains larger stepsizes than the derivative-based approach.

V Exploiting Sampled Information to Enhance Algorithm Performance

Here we describe two different refinements of the implementations proposed in Section IV to further enhance their performance. Both of them are based on further exploiting the sampled information about the system. The first refinement, cf. Section V-A, looks at the possibility of adapting the value of the gradient displacement as the algorithm is executed. The second refinement, cf. Section V-B, develops a high-order hold that more accurately approximates the evolution of the continuous-time heavy-ball dynamics with displaced gradient.

V-A Adaptive Gradient Displacement

The derivative- and performance-based triggered implementations in Section IV-B both employ a constant value of the parameter aa. Here, motivated by the observation made in Remark IV.3, we develop triggered implementations that adaptively adjust the value of the gradient displacement depending on the region of the space to which the state belongs. Rather than relying on the condition (14), which would require partitioning the state space based on bounds on f(x)\nabla f(x) and vv, we seek to compute on the fly a value of the parameter aa that ensures the exponential decrease of the Lyapunov function at the current state. Formally, the strategy is stated in Algorithm 2.

Design Choices: {d,p}\diamond\in\{{\operatorname{d}},{\operatorname{p}}\}, #{ET,ST}\#\in\{\operatorname{ET},\operatorname{ST}\} Initialization: Initial point (p0p_{0}), objective function (ff), tolerance (ϵ\epsilon), increase rate (ri>1r_{i}>1), decrease rate (0<rd<10<r_{d}<1), stepsize lower bound (τ\tau), a0a\geq 0, k=0k=0
while f(xk)ϵ\left\lVert\nabla f(x_{k})\right\rVert\geq\epsilon do
       increase = True
      exit = False
      while  exit = False do
             while C#(pk;a)0C_{\#}(p_{k};a)\geq 0  do
                   a=arda=ar_{d}
                  increase = False
             end while
            if step#(pk;a)τ\operatorname{step}^{\diamond}_{\#}(p_{k};a)\geq\tau then
                   exit = True
             else
                   a=arda=ar_{d}
                  increase = False
            
       end while
      
      Compute stepsize Δk=step#(pk;a)\Delta_{k}=\operatorname{step}^{\diamond}_{\#}(p_{k};a)
       Compute next iterate pk+1=pk+ΔkXhba(pk)p_{k+1}=p_{k}+\Delta_{k}X_{\operatorname{hb}}^{a}(p_{k})
       Set k=k+1k=k+1
      if increase = True then
             a=aria=ar_{i}
      
end while
Algorithm 2 Adaptive Displaced-Gradient Algorithm
Proposition V.1.

(Convergence of Adaptive Displaced-Gradient Algorithm). For {d,p}\diamond\in\{{\operatorname{d}},{\operatorname{p}}\}, #{ET,ST}\#\in\{\operatorname{ET},\operatorname{ST}\}, and τmina[0,a2]MIET(a)\tau\leq\min_{a\in[0,a^{*}_{2}]}\operatorname{MIET}(a), the variable-stepsize strategy in Algorithm 2 has the following properties:

  1. (i)

    it is executable (i.e., at each iteration, the parameter aa is determined in a finite number of steps);

  2. (ii)

    the stepsize is uniformly lower bounded by τ\tau;

  3. (iii)

    it satisfies f(xk+1)f(x)=𝒪(eμ4i=0kΔi)f(x_{k+1})\!-\!f(x_{*})\!=\!\mathcal{O}(e^{-\frac{\sqrt{\mu}}{4}\sum_{i=0}^{k}\Delta_{i}}), for k{0}k\in\{0\}\cup\mathbb{N}.

Proof.

Notice first that the function aMIET(a)>0a\mapsto\operatorname{MIET}(a)>0 defined in (17) is continuous and therefore attains its minimum over a compact set. At each iteration, Algorithm 2 first ensures that C#(p^;a)<0C_{\#}(\hat{p};a)<0, decreasing aa if this is not the case. We know this process is guaranteed as soon as a<a2a<a_{2}^{*} (cf. proof of Theorem IV.6) and hence only takes a finite number of steps. Once C#(p^;a)<0C_{\#}(\hat{p};a)<0, the stepsize could be computed to guarantee the desired decrease of the Lyapunov function VV. The algorithm next checks if the stepsize is lower bounded by τ\tau. If that is not the case, then the algorithm reduces aa and re-checks if C#(p^;a)<0C_{\#}(\hat{p};a)<0. With this process and in a finite number of steps, the algorithm eventually either computes a stepsize lower bounded by τ\tau with a>a2a>a^{*}_{2} or aa decreases enough to make aa2a\leq a^{*}_{2}, for which we know that the stepsize is already lower bounded by τ\tau. These arguments establish facts (i) and (ii) at the same time. Finally, fact (iii) is a consequence of the prescribed decreased of the Lyapunov function along the algorithm execution. ∎

V-B Discretization via High-Order Hold

The modified zero-order hold based on employing displaced gradients developed in Section IV is an example of the possibilities enabled by more elaborate uses of sampled information. In this section, we propose another such use based on the observation that the continuous-time heavy-ball dynamics can be decomposed as the sum of a linear term and a nonlinear term. Specifically, we have

Xhba(p)\displaystyle X_{\operatorname{hb}}^{a}(p) =[v2μv]+[0μsf(x+av)].\displaystyle=\begin{bmatrix}v\\ -2\sqrt{\mu}v\end{bmatrix}+\begin{bmatrix}0\\ -\sqrt{\mu_{s}}\nabla f(x+av)\end{bmatrix}.

Note that the first term in this decomposition is linear, whereas the other one contains the potentially nonlinear gradient term that complicates finding a closed-form solution. Keeping this in mind when considering a discrete-time implementation, it would seem reasonable to perform a zero-order hold only on the nonlinear term while exactly integrating the resulting differential equation. Formally, a zero-order hold at p^=[x^,v^]\hat{p}=[\hat{x},\hat{v}] of the nonlinear term above yields a system of the form

[x˙v˙]\displaystyle\begin{bmatrix}\dot{x}\\ \dot{v}\end{bmatrix} =A[xv]+b,\displaystyle=A\begin{bmatrix}x\\ v\end{bmatrix}+b, (19)

with p(0)=p^p(0)=\hat{p}, and where

A=[0102μ],b=[0μsf(x^+av^)].\displaystyle A=\begin{bmatrix}0&1\\ 0&-2\sqrt{\mu}\end{bmatrix},\quad b=\begin{bmatrix}0\\ -\sqrt{\mu_{s}}\nabla f(\hat{x}+a\hat{v})\end{bmatrix}.

Equation (19) is an in-homogeneous linear dynamical system, which is integrable by the method of variation of constants [30]. Its solution is given by p(t)=eAt(0teAζb𝑑ζ+p(0))p(t)=e^{At}\big{(}\int_{0}^{t}e^{-A\zeta}bd\zeta+p(0)\big{)}, or equivalently,

x(t)\displaystyle x(t) =x^μsf(x^+av^)t2μ\displaystyle=\hat{x}-\frac{\sqrt{\mu_{s}}\nabla f(\hat{x}+a\hat{v})t}{2\sqrt{\mu}} (20a)
+(1e2μt)μsf(x^+av^)+2μv^4μ,\displaystyle\quad+(1-e^{-2\sqrt{\mu}t})\frac{\sqrt{\mu_{s}}\nabla f(\hat{x}+a\hat{v})+2\sqrt{\mu}\hat{v}}{4\mu},
v(t)\displaystyle v(t) =e2μtv^+(e2μt1)μsf(x^+av^)2μ.\displaystyle=e^{-2\sqrt{\mu}t}\hat{v}+(e^{-2\sqrt{\mu}t}-1)\frac{\sqrt{\mu_{s}}\nabla f(\hat{x}+a\hat{v})}{2\sqrt{\mu}}. (20b)

We refer to this trajectory as a high-order-hold integrator. In order to develop a discrete-time algorithm based on this type of integrator, the next result provides a bound of the evolution of the Lyapunov function VV along the high-order-hold integrator trajectories. The proof is presented in Appendix A.

Proposition V.2.

(Upper bound for derivative-based triggering with high-order hold). Let a0a\geq 0 and define

𝔟ETd(p^,t;a)\displaystyle\mathfrak{b}^{\operatorname{d}}_{\operatorname{ET}}(\hat{p},t;a) =𝔄ET(p^,t;a)+𝔅ET(p^,t;a)\displaystyle=\mathfrak{A}_{\operatorname{ET}}(\hat{p},t;a)+\mathfrak{B}_{\operatorname{ET}}(\hat{p},t;a)
+ET(p^;a)+𝔇ET(p^,t;a),\displaystyle\quad+\mathfrak{C}_{\operatorname{ET}}(\hat{p};a)+\mathfrak{D}_{\operatorname{ET}}(\hat{p},t;a),
𝔟STd(p^,t;a)\displaystyle\mathfrak{b}^{\operatorname{d}}_{\operatorname{ST}}(\hat{p},t;a) =(𝔄STq(p^;a)+𝔅STq(p^;a))t2+(𝔄STl(p^;a)\displaystyle=(\mathfrak{A}_{\operatorname{ST}}^{q}(\hat{p};a)+\mathfrak{B}^{q}_{\operatorname{ST}}(\hat{p};a))t^{2}+(\mathfrak{A}_{\operatorname{ST}}^{l}(\hat{p};a)
+𝔅STl(p^;a)+𝔇ST(p^;a))t+ST(p^;a),\displaystyle\quad+\mathfrak{B}^{l}_{\operatorname{ST}}(\hat{p};a)+\mathfrak{D}_{\operatorname{ST}}(\hat{p};a))t+\mathfrak{C}_{\operatorname{ST}}(\hat{p};a),

where

𝔄ET(p^,t;a)=μs(f(x(t))f(x^),v(t)\displaystyle\mathfrak{A}_{\operatorname{ET}}(\hat{p},t;a)=\sqrt{\mu_{s}}(\langle\nabla f(x(t))-\nabla f(\hat{x}),v(t)\rangle
v(t)v^,f(x^+av^)\displaystyle\quad-\langle v(t)-\hat{v},\nabla f(\hat{x}+a\hat{v})\rangle
μx(t)x^,f(x^+av^))\displaystyle\quad-\sqrt{\mu}\langle x(t)-\hat{x},\nabla f(\hat{x}+a\hat{v})\rangle)
μv(t)v^,v(t),\displaystyle\quad-\sqrt{\mu}\langle v(t)-\hat{v},v(t)\rangle,
𝔅ET(p^,t;a)=μ4(μs(f(x(t))f(x^))\displaystyle\mathfrak{B}_{\operatorname{ET}}(\hat{p},t;a)=\frac{\sqrt{\mu}}{4}\big{(}\sqrt{\mu_{s}}(f(x(t))-f(\hat{x}))
μμstf(x^+av^)2L\displaystyle\quad-\sqrt{\mu}\sqrt{\mu_{s}}t\frac{\left\lVert\nabla f(\hat{x}+a\hat{v})\right\rVert^{2}}{L}
+μμstf(x^+av^),av^+14(v(t)2v^2)\displaystyle\quad+\sqrt{\mu}\sqrt{\mu_{s}}t\langle\nabla f(\hat{x}+a\hat{v}),a\hat{v}\rangle+\frac{1}{4}(\left\lVert v(t)\right\rVert^{2}-\left\lVert\hat{v}\right\rVert^{2})
+14v(t)v^+2μ(x(t)x^)2\displaystyle\quad+\frac{1}{4}\left\lVert v(t)-\hat{v}+2\sqrt{\mu}(x(t)-\hat{x})\right\rVert^{2}
+12v(t)v^+2μ(x(t)x^),v^),\displaystyle\quad+\frac{1}{2}\langle v(t)-\hat{v}+2\sqrt{\mu}(x(t)-\hat{x}),\hat{v}\rangle\big{)},
ET(p^;a)=CET(p^;a),\displaystyle\mathfrak{C}_{\operatorname{ET}}(\hat{p};a)=C_{\operatorname{ET}}(\hat{p};a),
𝔇ET(p^,t;a)=μsf(x^),v(t)v^\displaystyle\mathfrak{D}_{\operatorname{ET}}(\hat{p},t;a)=\sqrt{\mu_{s}}\langle\nabla f(\hat{x}),v(t)-\hat{v}\rangle
μv^,v(t)v^,\displaystyle\quad-\sqrt{\mu}\langle\hat{v},v(t)-\hat{v}\rangle,

and

𝔄STl(p^;a)=2μv^+μsf(x^+av^)(μv^\displaystyle\mathfrak{A}_{\operatorname{ST}}^{l}(\hat{p};a)=\left\lVert 2\sqrt{\mu}\hat{v}+\sqrt{\mu_{s}}\nabla f(\hat{x}+a\hat{v})\right\rVert\Big{(}\sqrt{\mu}\left\lVert\hat{v}\right\rVert
+Lμs2μv^+3μs2f(x^+av^))\displaystyle\quad+\frac{L\sqrt{\mu_{s}}}{2\sqrt{\mu}}\left\lVert\hat{v}\right\rVert+\frac{3\sqrt{\mu_{s}}}{2}\left\lVert\nabla f(\hat{x}+a\hat{v})\right\rVert\Big{)}
+μs2f(x^+av^)(Lμv^+f(x^+av^)),\displaystyle\quad+\frac{\mu_{s}}{2}\left\lVert\nabla f(\hat{x}+a\hat{v})\right\rVert\Big{(}\frac{L}{\sqrt{\mu}}\left\lVert\hat{v}\right\rVert+\left\lVert\nabla f(\hat{x}+a\hat{v})\right\rVert\Big{)},
𝔄STq(p^;a)=2μv^+μsf(x^+av^)\displaystyle\mathfrak{A}_{\operatorname{ST}}^{q}(\hat{p};a)=\left\lVert 2\sqrt{\mu}\hat{v}+\sqrt{\mu_{s}}\nabla f(\hat{x}+a\hat{v})\right\rVert
((Lμs2μ+μ)2μv^+μsf(x^+av^)\displaystyle\quad\cdot{}\Big{(}\big{(}\frac{L\sqrt{\mu_{s}}}{2\sqrt{\mu}}+\sqrt{\mu}\big{)}\left\lVert 2\sqrt{\mu}\hat{v}+\sqrt{\mu_{s}}\nabla f(\hat{x}+a\hat{v})\right\rVert
+Lμs2μf(x^+av^)),\displaystyle\quad+\frac{L\mu_{s}}{2\sqrt{\mu}}\left\lVert\nabla f(\hat{x}+a\hat{v})\right\rVert\Big{)},
𝔅STl(p^;a)=μμs4(μs2μf(x^+av^)f(x^)\displaystyle\mathfrak{B}^{l}_{\operatorname{ST}}(\hat{p};a)=\frac{\sqrt{\mu}\sqrt{\mu_{s}}}{4}\Big{(}\frac{\sqrt{\mu_{s}}}{2\sqrt{\mu}}\left\lVert\nabla f(\hat{x}+a\hat{v})\right\rVert\left\lVert\nabla f(\hat{x})\right\rVert
+122μv^+μsf(x^+av^)(f(x^)μ+v^μs)\displaystyle\quad+\frac{1}{2}\left\lVert 2\sqrt{\mu}\hat{v}\!+\!\sqrt{\mu_{s}}\nabla f(\hat{x}+a\hat{v})\right\rVert\big{(}\frac{\left\lVert\nabla f(\hat{x})\right\rVert}{\sqrt{\mu}}\!+\!\frac{\left\lVert\hat{v}\right\rVert}{\sqrt{\mu_{s}}}\big{)}
μf(x^+av^)2L+(aμ12)f(x^+av^),v^),\displaystyle\quad-\sqrt{\mu}\frac{\left\lVert\nabla f(\hat{x}+a\hat{v})\right\rVert^{2}}{L}+(a\sqrt{\mu}-\frac{1}{2})\langle\nabla f(\hat{x}+a\hat{v}),\hat{v}\rangle\Big{)},
𝔅STq(p^;a)=10μ2+L2μs32μ3/2\displaystyle\mathfrak{B}^{q}_{\operatorname{ST}}(\hat{p};a)=\frac{10\mu^{2}+L^{2}\sqrt{\mu_{s}}}{32\mu^{3/2}}
2μv^+μsf(x^+av^)2\displaystyle\quad\cdot{}\left\lVert 2\sqrt{\mu}\hat{v}+\sqrt{\mu_{s}}\nabla f(\hat{x}+a\hat{v})\right\rVert^{2}
+μs(4μ2+L2μs)32μ3/2f(x^+av^)2\displaystyle\quad+\frac{\mu_{s}\left(4\mu^{2}+L^{2}\sqrt{\mu_{s}}\right)}{32\mu^{3/2}}\left\lVert\nabla f(\hat{x}+a\hat{v})\right\rVert^{2}
+μs(4μ2+L2μs)16μ3/22μv^+μsf(x^+av^)\displaystyle\quad+\frac{\sqrt{\mu_{s}}\left(4\mu^{2}+L^{2}\sqrt{\mu_{s}}\right)}{16\mu^{3/2}}\left\lVert 2\sqrt{\mu}\hat{v}+\sqrt{\mu_{s}}\nabla f(\hat{x}+a\hat{v})\right\rVert
f(x^+av^)),\displaystyle\quad\cdot{}\left\lVert\nabla f(\hat{x}+a\hat{v})\right\rVert),
ST(p^;a)=CST(p^;a),\displaystyle\mathfrak{C}_{\operatorname{ST}}(\hat{p};a)=C_{\operatorname{ST}}(\hat{p};a),
𝔇ST(p^;a)=2μv^+μsf(x^+av^)\displaystyle\mathfrak{D}_{\operatorname{ST}}(\hat{p};a)=\left\lVert 2\sqrt{\mu}\hat{v}+\sqrt{\mu_{s}}\nabla f(\hat{x}+a\hat{v})\right\rVert\cdot
(μsf(x^)+μv^).\displaystyle\quad\Big{(}\sqrt{\mu_{s}}\left\lVert\nabla f(\hat{x})\right\rVert+\sqrt{\mu}\left\lVert\hat{v}\right\rVert\Big{)}.

Let tp(t)t\mapsto p(t) be the high-order-hold integrator trajectory (20) from p(0)=p^p(0)=\hat{p}. Then, for t0t\geq 0,

ddtV(p(t))+μ4V(p(t))\displaystyle\frac{d}{dt}V(p(t))+\frac{\sqrt{\mu}}{4}V({p}(t)) 𝔟ETd(p^,t;a)𝔟STd(p^,t;a).\displaystyle\leq\mathfrak{b}^{\operatorname{d}}_{\operatorname{ET}}(\hat{p},t;a)\leq\mathfrak{b}^{\operatorname{d}}_{\operatorname{ST}}(\hat{p},t;a).

Analogously to what we did in Section IV-B, we build on this result to establish an upper bound for the performance-based triggering condition with the high-order-hold integrator.

Proposition V.3.

(Upper bound for performance-based triggering with high-order hold). Let 0a0\leq a and

𝔟#p(p^,t;a)\displaystyle\mathfrak{b}^{\operatorname{p}}_{\#}(\hat{p},t;a) =0teμ4ζ𝔟#d(p^,ζ;a)𝑑ζ,\displaystyle=\int_{0}^{t}e^{\frac{\sqrt{\mu}}{4}\zeta}\mathfrak{b}^{\operatorname{d}}_{\#}(\hat{p},\zeta;a)d\zeta, (21)

for #{ET,ST}\#\in\{\operatorname{ET},\operatorname{ST}\}. Let tp(t)t\mapsto p(t) be the high-order-hold integrator trajectory (20) from p(0)=p^p(0)=\hat{p}. Then, for t0t\geq 0,

V(p(t))eμ4tV(p^)eμ4t𝔟ETp(p^,t;a)eμ4t𝔟STp(p^,t;a).\displaystyle V(p(t))\!-\!e^{-\frac{\sqrt{\mu}}{4}t}V(\hat{p})\!\leq\!e^{-\frac{\sqrt{\mu}}{4}t}\mathfrak{b}^{\operatorname{p}}_{\operatorname{ET}}(\hat{p},t;a)\!\leq\!e^{-\frac{\sqrt{\mu}}{4}t}\mathfrak{b}^{\operatorname{p}}_{\operatorname{ST}}(\hat{p},t;a).

Using Proposition V.2, the proof of this result is analogous to that of Proposition IV.5, and we omit it for space reasons. Propositions V.2 and V.3 are all we need to fully specify the variable-stepsize algorithm based on high-order-hold integrators. Formally, we set

𝔰𝔱𝔢𝔭#(p^;a)=min{t>0|𝔟#(p^,t;a)=0},\displaystyle\mathfrak{step}^{\diamond}_{\#}(\hat{p};a)=\min\{t>0\;|\;\mathfrak{b}^{\diamond}_{\#}(\hat{p},t;a)=0\}, (22)

for {d,p}\diamond\in\{{\operatorname{d}},{\operatorname{p}}\} and #{ET,ST}\#\in\{\operatorname{ET},\operatorname{ST}\}. With this in place, we design Algorithm 3, which is a higher-order counterpart to Algorithm 2, and whose convergence properties are characterized in the following result.

Design Choices: {d,p}\diamond\in\{{\operatorname{d}},{\operatorname{p}}\}, #{ET,ST}\#\in\{\operatorname{ET},\operatorname{ST}\} Initialization: Initial point (p0p_{0}), objective function (ff), tolerance (ϵ\epsilon), increase rate (ri>1r_{i}>1), decrease rate (0<rd<10<r_{d}<1), stepsize lower bound (τ\tau), a0a\geq 0, k=0k=0
while f(xk)ϵ\left\lVert\nabla f(x_{k})\right\rVert\geq\epsilon do
       increase = True
      exit = False
      while  exit = False do
             while #(pk;a)0\mathfrak{C}_{\#}(p_{k};a)\geq 0  do
                   a=arda=ar_{d}
                  increase = False
             end while
            if 𝔰𝔱𝔢𝔭#(pk;a)τ\mathfrak{step}^{\diamond}_{\#}(p_{k};a)\geq\tau then
                   exit = True
             else
                   a=arda=ar_{d}
                  increase = False
            
       end while
      
      Compute stepsize Δk=𝔰𝔱𝔢𝔭#(pk;a)\Delta_{k}=\mathfrak{step}^{\diamond}_{\#}(p_{k};a)
       Compute next iterate pk+1p_{k+1} using (20)
       Set k=k+1k=k+1
      if increase = True then
             a=aria=ar_{i}
      
end while
Algorithm 3 Adaptive High-Order-Hold Algorithm
Proposition V.4.

(Convergence of Adaptive High-Order-Hold Algorithm). For {d,p}\diamond\in\{{\operatorname{d}},{\operatorname{p}}\}, and #{ET,ST}\#\in\{\operatorname{ET},\operatorname{ST}\}, there exists MIET\operatorname{MIET}^{\diamond} such that for τMIET\tau\leq\operatorname{MIET}^{\diamond}, the variable-stepsize strategy in Algorithm 3 has the following properties:

  1. (i)

    it is executable (i.e., at each iteration, the parameter aa is determined in a finite number of steps);

  2. (ii)

    the stepsize is uniformly lower bounded by τ\tau;

  3. (iii)

    it satisfies f(xk+1)f(x)=𝒪(eμ4i=0kΔi)f(x_{k+1})\!-\!f(x_{*})\!=\!\mathcal{O}(e^{-\frac{\sqrt{\mu}}{4}\sum_{i=0}^{k}\Delta_{i}}), for k{0}k\in\{0\}\cup\mathbb{N}.

We omit the proof of this result, which is analogous to that of Proposition V.1, with lengthier computations.

VI Simulations

Here we illustrate the performance of the methods resulting from the proposed resource-aware discretization approach to accelerated optimization flows. Specifically, we simulate in two examples the performance-based implementation of the Displaced Gradient algorithm (denoted DGp{}^{\operatorname{p}}) and the derivative- and performance-based implementations of the High-Order-Hold (HOHd{}^{\operatorname{d}} and HOHp{}^{\operatorname{p}} respectively) algorithms. We compare these algorithms against the Nesterov’s accelerated gradient and the heavy-ball methods, as they still exhibit similar or superior performance to the discretization approaches proposed in the literature, cf. Section I.

Optimization of Ill-Conditioned Quadratic Objective Function

Consider the optimization of the objective function f:2f:{\mathbb{R}}^{2}\rightarrow{\mathbb{R}} defined by f(x)=102x12+102x22f(x)=10^{-2}x_{1}^{2}+10^{2}x_{2}^{2}. Note that μ=2102\mu=2\cdot 10^{-2} and L=2102L=2\cdot 10^{2}. We use s=μ/(36L2)s=\mu/(36L^{2}) and initialize the velocity according to (4b). For DGp{}^{\operatorname{p}}, HOHd{}^{\operatorname{d}}, and HOHp{}^{\operatorname{p}}, we set a=0.1a=0.1 and implement the event-triggered approach (at each iteration, we employ a numerical zero-finding routine to explicitly determine the stepsizes stepETp\operatorname{step}^{\operatorname{p}}_{\operatorname{ET}}, 𝔰𝔱𝔢𝔭ETd\mathfrak{step}^{\operatorname{d}}_{\operatorname{ET}}, and 𝔰𝔱𝔢𝔭ETp\mathfrak{step}^{\operatorname{p}}_{\operatorname{ET}}, respectively).

Figure 1(a) illustrates how the stepsize of HOHp{}^{\operatorname{p}} changes during the first 10001000 iterations. After the tuning of the stepsize during the first iterations, it becomes quite steady (likely due to the simplicity of quadratic functions) until the trajectory approaches the minimizer. After 55 iterations, the algorithm stepsize becomes almost equal to the optimal stepsize.

Refer to caption
Refer to caption
Figure 1: Ill-conditioned quadratic objective function example. (a) Evolution of the stepsize along the execution of HOHp{}^{\operatorname{p}} during the first 10001000 iterations. (b) State evolution along DGp{}^{\operatorname{p}}, HOHd{}^{\operatorname{d}}, HOHp{}^{\operatorname{p}}, continuous heavy-ball dynamics, and Nesterov’s method starting from x=(50,50)x=(50,50) and v=(0.0023,4.7139)v=(-0.0023,-4.7139).

Figure 1(b) compares the performance of DGp{}^{\operatorname{p}}, HOHd{}^{\operatorname{d}}, and HOHp{}^{\operatorname{p}} against the continuous heavy-ball method and the discrete Nesterov method for strongly convex functions. The DGp{}^{\operatorname{p}} algorithm takes large stepsizes following the evolution of the continuous heavy-ball along the straight lines p(t)=pk+tXhba(pk)p(t)=p_{k}+tX_{\operatorname{hb}}^{a}(p_{k}). Meanwhile, the higher-order nature of the hold employed by HOHd{}^{\operatorname{d}} and HOHp{}^{\operatorname{p}} makes them able to leap over the oscillations, yielding a state evolution similar to Nesterov’s method. Figure 2 shows the evolution of the objective and Lyapunov functions. We observe that after some initial iterations, HOHp{}^{\operatorname{p}} outperforms Nesterov’s method. Eventually, also DGp{}^{\operatorname{p}} catches up to Nesterov’s method.

Refer to caption
Refer to caption
Figure 2: Ill-conditioned quadratic objective function example. (a) Evolution of the logarithm of the objective function under DGp{}^{\operatorname{p}}, HOHd{}^{\operatorname{d}}, HOHp{}^{\operatorname{p}}, the heavy-ball method, and Nesterov’s method starting from x=(50,50)x=(50,50) and v=(0.0023,4.7139)v=(-0.0023,-4.7139). (b) Corresponding evolution of the logarithm of the Lyapunov function along DGp{}^{\operatorname{p}}, HOHd{}^{\operatorname{d}}, and HOHp{}^{\operatorname{p}}.

Logarithmic Regression

Consider the optimization of the regularized logistic regression cost function f:4f:{\mathbb{R}}^{4}\rightarrow{\mathbb{R}} defined by f(x)=i=110log(1+eyizi,x)+12x2f(x)=\sum_{i=1}^{10}\log(1+e^{-y_{i}\langle z_{i},x\rangle})+\frac{1}{2}\left\lVert x\right\rVert^{2}, where the points {zi}i=1104\{z_{i}\}_{i=1}^{10}\subset\mathbb{R}^{4} are generated randomly using a uniform distribution in the interval [5,5][-5,5], and the points {yi}i=110{1,1}\{y_{i}\}_{i=1}^{10}\subset\{-1,1\} are generated similarly with quantized values. This objective function is 11-strongly convex and one can also compute the value L=177.49L=177.49. We use a=0.025a=0.025 and s=μ/(36L2)s=\mu/(36L^{2}), and initialize the velocity according to (4b). Figure 3(a) and (b) show the evolution of the stepsize along HOHp{}^{\operatorname{p}}, which changes as a function of the state looking to satisfy the desired decay of the Lyapunov function.

Refer to caption
Refer to caption
Figure 3: Logarithmic regression example. (a) Evolution of the stepsize along the execution of HOHp{}^{\operatorname{p}} starting from x=(50,50,50,50)x=(50,50,50,50) and v=(0.1026,0.09265,0.1078,0.0899)v=(-0.1026,-0.09265,-0.1078,-0.0899). Notice the complex pattern, with significant increases and oscillations along the trajectory. (b) Difference between the optimal stepsize (computed using the exact Lyapunov function, which assumes knowledge of the minimizer) and the stepsize of HOHp{}^{\operatorname{p}}. The largest difference is achieved at the beginning: after a few iterations, the difference descreases significantly, periodically becoming almost zero.

Figure 4 shows the evolution of the objective and Lyapunov functions. We observe how HOHd{}^{\operatorname{d}} and HOHp{}^{\operatorname{p}} outperform Nesterov’s method, although eventually the heavy-ball algorithm performs the best. The Lyapunov function decreases at a much faster rate along HOHd{}^{\operatorname{d}} and HOHp{}^{\operatorname{p}} than along DGp{}^{\operatorname{p}}.

Refer to caption
Refer to caption
Figure 4: Logarithmic regression example. (a) Evolution of the logarithm of the objective function under DGp{}^{\operatorname{p}}, HOHd{}^{\operatorname{d}}, HOHp{}^{\operatorname{p}}, the heavy-ball method, and Nesterov’s method starting from x=(50,50,50,50)x=(50,50,50,50) and v=(0.1026,0.09265,0.1078,0.0899)v=(-0.1026,-0.09265,-0.1078,-0.0899). (b) Corresponding evolution of the logarithm of the Lyapunov function along DGp{}^{\operatorname{p}}, HOHd{}^{\operatorname{d}}, and HOHp{}^{\operatorname{p}}.

VII Conclusions

We have introduced a resource-aware control framework to the discretization of accelerated optimization flows that specifically takes advantage of their dynamical properties. We have exploited fundamental concepts from opportunistic state-triggering related to the various ways of encoding the notion of valid Lyapunov certificates, the use of sampled-data information, and the construction of state estimators and holders to synthesize variable-stepsize optimization algorithms that retain by design the convergence properties of their continuous-time counterparts. We believe these results open the way to a number of exciting research directions. Among them, we highlight the design of adaptive learning schemes to refine the use of sampled data and optimize the algorithm performance with regards to the objective function, the characterization of accelerated convergence rates, employing tools and insights from hybrid systems for analysis and design, enriching the proposed designs by incorporating re-start schemes as triggering conditions to avoid overshooting and oscillations, and developing distributed implementations for network optimization problems.

References

  • [1] B. T. Polyak, “Some methods of speeding up the convergence of iterative methods,” USSR Computational Mathematics and Mathematical Physics, vol. 4, no. 5, pp. 1–17, 1964.
  • [2] Y. E. Nesterov, “A method of solving a convex programming problem with convergence rate O(1/k2){O}(1/k^{2}),” Soviet Mathematics Doklady, vol. 27, no. 2, pp. 372–376, 1983.
  • [3] Z. Allen-Zhu and L. Orecchia, “Linear Coupling: An Ultimate Unification of Gradient and Mirror Descent,” in 8th Innovations in Theoretical Computer Science Conference (ITCS 2017), Dagstuhl, Germany, 2017, pp. 1–22.
  • [4] B. Hu and L. Lessard, “Dissipativity theory for Nesterov’s accelerated method,” in International Conference on Machine Learning, International Convention Centre, Sydney, Australia, August 2017, pp. 1549–1557.
  • [5] L. Lessard, B. Recht, and A. Packard, “Analysis and design of optimization algorithms via integral quadratic constraints,” SIAM Journal on Optimization, vol. 26, no. 1, pp. 57–95, 2016.
  • [6] B. V. Scoy, R. A. Freeman, and K. M. Lynch, “The fastest known globally convergent first-order method for minimizing strongly convex functions,” IEEE Control Systems Letters, vol. 2, no. 1, pp. 49–54, 2018.
  • [7] S. Bubeck, Y. Lee, and M. Singh, “A geometric alternative to Nesterov’s accelerated gradient descent,” arXiv preprint arXiv:1506.08187, 2015.
  • [8] W. Su, S. Boyd, and E. J. Candès, “A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights,” Journal of Machine Learning Research, vol. 17, pp. 1–43, 2016.
  • [9] M. Betancourt, M. Jordan, and A. C. Wilson, “On symplectic optimization,” arXiv preprint arXiv: 1802.03653, 2018.
  • [10] C. J. Maddison, D. Paulin, Y. W. Teh, B. O’Donoghue, and A. Doucet, “Hamiltonian descent methods,” arXiv preprint arXiv:1809.05042, 2018.
  • [11] H. Attouch, Z. Chbani, J. Fadili, and H. Riahi, “First-order optimization algorithms via inertial systems with Hessian driven damping,” arXiv preprint arXiv:1907.10536, 2019.
  • [12] B. Shi, S. S. Du, M. I. Jordan, and W. J. Su, “Understanding the acceleration phenomenon via high-resolution differential equations,” arXiv preprint arXiv:1810.08907, 2018.
  • [13] B. Sun, J. George, and S. Kia, “High-resolution modeling of the fastest first-order optimization method for strongly convex functions,” arXiv preprint arXiv:2008.11199, 2020.
  • [14] R. W. Brockett, “Dynamical systems that sort lists, diagonalize matrices, and solve linear programming problems,” Linear Algebra and its Applications, vol. 146, pp. 79–91, 1991.
  • [15] U. Helmke and J. B. Moore, Optimization and Dynamical Systems.   Springer, 1994.
  • [16] B. Shi, S. S. Du, M. I. Jordan, and W. J. Su, “Acceleration via symplectic discretization of high-resolution differential equations,” arXiv preprint arXiv:1902.03694, 2019.
  • [17] A. Wibisono, A. C. Wilson, and M. I. Jordan, “A variational perspective on accelerated methods in optimization,” Proceedings of the National Academy of Sciences, vol. 113, no. 47, pp. E7351–E7358, 2016.
  • [18] A. Wilson, L. Mackey, and A. Wibisono, “Accelerating rescaled gradient descent: Fast optimization of smooth functions,” in Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché Buc, E. Fox, and R. Garnett, Eds., vol. 31.   Curran Associates, Inc., 2019, pp. 13 555–13 565.
  • [19] J. Zhang, A. Mokhtari, S. Sra, and A. Jadbabaie, “Direct Runge-Kutta discretization achieves acceleration,” in Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., vol. 31.   Curran Associates, Inc., 2018, pp. 3900–3909.
  • [20] A. C. Wilson, B. Recht, and M. I. Jordan, “A Lyapunov analysis of momentum methods in optimization,” arXiv preprint arXiv:1611.02635, 2018.
  • [21] M. Vaquero and J. Cortés, “Convergence-rate-matching discretization of accelerated optimization flows through opportunistic state-triggered control,” in Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché Buc, E. Fox, and R. Garnett, Eds.   Curran Associates, Inc., 2019, vol. 32, pp. 9767–9776.
  • [22] A. S. Kolarijani, P. M. Esfahani, and T. Keviczky, “Fast gradient-based methods with exponential rate: a hybrid control framework,” in International Conference on Machine Learning, July 2018, pp. 2728–2736.
  • [23] D. Hustig-Schultz and R. G. Sanfelice, “A robust hybrid Heavy-Ball algorithm for optimization with high performance,” in American Control Conference, July 2019, pp. 151–156.
  • [24] W. P. M. H. Heemels, K. H. Johansson, and P. Tabuada, “An introduction to event-triggered and self-triggered control,” in IEEE Conf. on Decision and Control, Maui, HI, 2012, pp. 3270–3285.
  • [25] C. Nowzari, E. Garcia, and J. Cortés, “Event-triggered control and communication of networked systems for multi-agent consensus,” Automatica, vol. 105, pp. 1–27, 2019.
  • [26] P. Ong and J. Cortés, “Event-triggered control design with performance barrier,” in IEEE Conf. on Decision and Control, Miami Beach, FL, Dec. 2018, pp. 951–956.
  • [27] M. Laborde and A. Oberman, “A Lyapunov analysis for accelerated gradient methods: From deterministic to stochastic case,” in AISTATS, vol. 108, Online, 2020, pp. 602–612.
  • [28] M. Muehlebach and M. Jordan, “A dynamical systems perspective on Nesterov acceleration,” in International Conference on Machine Learning, vol. 97, Long Beach, California, USA, 2019, pp. 4656–4662.
  • [29] I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importance of initialization and momentum in deep learning,” in International Conference on Machine Learning, vol. 28, Atlanta, Georgia, USA, 2013, pp. 1139–1147.
  • [30] L. Perko, Differential Equations and Dynamical Systems, 3rd ed., ser. Texts in Applied Mathematics.   New York: Springer, 2000, vol. 7.
  • [31] S. Lang, Real and Functional Analysis, 3rd ed.   New York: Springer, 1993.

Appendix A

Throughout the appendix, we make use of a number of basic facts that we gather here for convenience,

f(x)f(x)\displaystyle f(x_{*})-f(x) f(x)22L\displaystyle\leq-\frac{\left\lVert\nabla f(x)\right\rVert^{2}}{2L} (A.1a)
f(x)Lxx\displaystyle\frac{\left\lVert\nabla f(x)\right\rVert}{L}\leq\left\lVert x-x_{*}\right\rVert f(x)μ\displaystyle\leq\frac{\left\lVert\nabla f(x)\right\rVert}{\mu} (A.1b)
f(y)f(x)f(x),yx\displaystyle f(y)-f(x)-\langle\nabla f(x),y-x\rangle L2yx2\displaystyle\leq\frac{L}{2}\left\lVert y-x\right\rVert^{2} (A.1c)
1Lf(x)f(y)2\displaystyle\frac{1}{L}\left\lVert\nabla f(x)-\nabla f(y)\right\rVert^{2} f(x)f(y),xy\displaystyle\leq\langle\nabla f(x)-\nabla f(y),x-y\rangle (A.1d)
f(y)f(x)f(x),yx\displaystyle f(y)-f(x)-\langle\nabla f(x),y-x\rangle 12μf(y)f(x)2\displaystyle\leq\frac{1}{2\mu}\left\lVert\nabla f(y)-\nabla f(x)\right\rVert^{2} (A.1e)

We also resort at various points to the expression of the gradient of VV,

V(p)\displaystyle\nabla V(p) =[μsf(x)+μv+2μ(xx)v+μ(xx)].\displaystyle=\begin{bmatrix}\sqrt{\mu_{s}}\nabla f(x)+\sqrt{\mu}v+2\mu(x-x_{*})\\ v+\sqrt{\mu}(x-x_{*})\end{bmatrix}. (A.2)

The following result is used in the proof of Theorem IV.2.

Lemma A.1.

For β1,,β4>0\beta_{1},\dots,\beta_{4}>0, the function

g(z)=β3+β4z2β1+β2z\displaystyle g(z)=\frac{\beta_{3}+\beta_{4}z^{2}}{-\beta_{1}+\beta_{2}z} (A.3)

is positively lower bounded on (β1/β2,)(\beta_{1}/\beta_{2},\infty).

Proof.

The derivative of gg is

g(z)=β2β3+β4z(2β1+β2z)(β1β2z)2.g^{\prime}(z)=\frac{-\beta_{2}\beta_{3}+\beta_{4}z(-2\beta_{1}+\beta_{2}z)}{(\beta_{1}-\beta_{2}z)^{2}}.

The solutions to g(z)=0g^{\prime}(z)=0 are then given by

zroot±=β1β4±β22β3β4+β12β42β2β4.\displaystyle z_{\operatorname{root}}^{\pm}=\frac{\beta_{1}\beta_{4}\pm\sqrt{\beta_{2}^{2}\beta_{3}\beta_{4}+\beta_{1}^{2}\beta_{4}^{2}}}{\beta_{2}\beta_{4}}. (A.4)

Note that zroot<0<β1/β2<zroot+z^{-}_{\operatorname{root}}<0<\beta_{1}/\beta_{2}<z^{+}_{\operatorname{root}}, gg^{\prime} is negative on (zroot,zroot+)(z^{-}_{\operatorname{root}},z^{+}_{\operatorname{root}}), and positive on (zroot+,)(z^{+}_{\operatorname{root}},\infty). Therefore the minimum value over (β1/β2,)(\beta_{1}/\beta_{2},\infty) is achieved at zroot+z^{+}_{\operatorname{root}}, and corresponds to g(zroot+)>0g(z^{+}_{\operatorname{root}})>0. ∎

Proof of Proposition IV.4.

We break out ddtV(p(t))+μ4V(p(t))\frac{d}{dt}V(p(t))+\frac{\sqrt{\mu}}{4}V({p}(t)) as follows

ddtV(p^+tXhba(p^))+μ4V(p^+tXhba(p^))=\displaystyle\frac{d}{dt}V(\hat{p}+tX_{\operatorname{hb}}^{a}(\hat{p}))+\frac{\sqrt{\mu}}{4}V(\hat{p}+tX_{\operatorname{hb}}^{a}(\hat{p}))=
=V(p^),Xhba(p^)+μ4V(p^)Term I + II + III\displaystyle=\underbrace{\langle\nabla V(\hat{p}),X_{\operatorname{hb}}^{a}(\hat{p})\rangle+\frac{\sqrt{\mu}}{4}V(\hat{p})}_{\textrm{Term I + II + III}}
+V(p^+tXhba(p^))V(p^),Xhba(p^)Term IV + V\displaystyle\quad+\underbrace{\langle\nabla V(\hat{p}+tX_{\operatorname{hb}}^{a}(\hat{p}))-\nabla V(\hat{p}),X_{\operatorname{hb}}^{a}(\hat{p})\rangle}_{\textrm{Term IV + V}}
+μ4(V(p^+tXhba(p^))V(p^)Term VI),\displaystyle\quad+\frac{\sqrt{\mu}}{4}(\underbrace{V(\hat{p}+tX_{\operatorname{hb}}^{a}(\hat{p}))-V(\hat{p})}_{\textrm{Term VI}}),

and bound each term separately.

𝐓𝐞𝐫𝐦𝐈+𝐈𝐈+𝐈𝐈𝐈\mathbf{Term\ I+II+III}. From the definition (III.1) of VV and the fact that y1+y222y12+2y22\left\lVert y_{1}+y_{2}\right\rVert^{2}\leq 2\left\lVert y_{1}\right\rVert^{2}+2\left\lVert y_{2}\right\rVert^{2}, we have

V(p^)\displaystyle V(\hat{p}) =μs(f(x^)f(x))+14v^2\displaystyle=\sqrt{\mu_{s}}(f(\hat{x})-f(x_{*}))+\frac{1}{4}\left\lVert\hat{v}\right\rVert^{2}
+14v^+2μ(x^x)2\displaystyle\quad+\frac{1}{4}\left\lVert\hat{v}+2\sqrt{\mu}(\hat{x}-x_{*})\right\rVert^{2}
μs(f(x^)f(x))\displaystyle\leq\sqrt{\mu_{s}}(f(\hat{x})-f(x_{*}))
+14v^2+24v^2+242μ(x^x)2\displaystyle\quad+\frac{1}{4}\left\lVert\hat{v}\right\rVert^{2}+\frac{2}{4}\left\lVert\hat{v}\right\rVert^{2}+\frac{2}{4}\left\lVert 2\sqrt{\mu}(\hat{x}-x_{*})\right\rVert^{2}
=μs(f(x^)f(x))+34v^2+2μx^x2.\displaystyle=\sqrt{\mu_{s}}(f(\hat{x})-f(x_{*}))+\frac{3}{4}\left\lVert\hat{v}\right\rVert^{2}+2\mu\left\lVert\hat{x}-x_{*}\right\rVert^{2}.

Using this bound, we obtain

V(p^),Xhba(p^)+μ4V(p^)\displaystyle\langle\nabla V(\hat{p}),X^{a}_{\operatorname{hb}}(\hat{p})\rangle+\frac{\sqrt{\mu}}{4}V(\hat{p})
μv^2+μ4μs(f(x^)f(x))+3μ16v^2\displaystyle\leq-\sqrt{\mu}\left\lVert\hat{v}\right\rVert^{2}+\frac{\sqrt{\mu}}{4}\sqrt{\mu_{s}}(f(\hat{x})-f(x_{*}))+\frac{3\sqrt{\mu}}{16}\left\lVert\hat{v}\right\rVert^{2}
+μμ2x^x2+μsf(x^)f(x^+av^),v^\displaystyle\quad+\frac{\mu\sqrt{\mu}}{2}\left\lVert\hat{x}-x_{*}\right\rVert^{2}+\sqrt{\mu_{s}}\langle\nabla f(\hat{x})-\nabla f(\hat{x}+a\hat{v}),\hat{v}\rangle
μμsf(x^+av^),x^x.\displaystyle\quad-\sqrt{\mu}\sqrt{\mu_{s}}\langle\nabla f(\hat{x}+a\hat{v}),\hat{x}-x_{*}\rangle.

Writing 0 as 0=av^av^0=a\hat{v}-a\hat{v} and using strong convexity, we can upper bound f(x^+av^),xx^\langle\nabla f(\hat{x}+a\hat{v}),x_{*}-\hat{x}\rangle in the last summand by the expression

f(x)f(x^+av^)μ2x^+av^x2+f(x^+av^),av^.\displaystyle f(x_{*})-f(\hat{x}+a\hat{v})-\frac{\mu}{2}\left\lVert\hat{x}+a\hat{v}-x_{*}\right\rVert^{2}+\langle\nabla f(\hat{x}+a\hat{v}),a\hat{v}\rangle.

Substituting this bound above and re-grouping terms,

V(p^),Xhba(p^)+μ4V(p^)μv^2\displaystyle\langle\nabla V(\hat{p}),X^{a}_{\operatorname{hb}}(\hat{p})\rangle+\frac{\sqrt{\mu}}{4}V(\hat{p})\leq-\sqrt{\mu}\left\lVert\hat{v}\right\rVert^{2}
+μμs(14(f(x^)f(x))+f(x)f(x^+av^))(a)\displaystyle\quad+\underbrace{\sqrt{\mu}\sqrt{\mu_{s}}\Big{(}\frac{1}{4}(f(\hat{x})-f(x_{*}))+f(x_{*})-f(\hat{x}+a\hat{v})\Big{)}}_{\textrm{(a)}}
+3μ16v^2+μsf(x^)f(x^+av^),v^\displaystyle\quad+\frac{3\sqrt{\mu}}{16}\left\lVert\hat{v}\right\rVert^{2}+\sqrt{\mu_{s}}\langle\nabla f(\hat{x})-\nabla f(\hat{x}+a\hat{v}),\hat{v}\rangle
+μμ2x^x2+μμs(μ2x^+av^x2)(b)\displaystyle\quad\underbrace{+\frac{\mu\sqrt{\mu}}{2}\left\lVert\hat{x}-x_{*}\right\rVert^{2}+\sqrt{\mu}\sqrt{\mu_{s}}(-\frac{\mu}{2}\left\lVert\hat{x}+a\hat{v}-x_{*}\right\rVert^{2})}_{\textrm{(b)}}
+μμsf(x^+av^),av^.\displaystyle\quad+\sqrt{\mu}\sqrt{\mu_{s}}\langle\nabla f(\hat{x}+a\hat{v}),a\hat{v}\rangle.

Observe that

(a) =μμs(34(f(x^)f(x))+f(x^)f(x^+av^)),\displaystyle=\sqrt{\mu}\sqrt{\mu_{s}}\big{(}-\frac{3}{4}(f(\hat{x})-f(x_{*}))+f(\hat{x})-f(\hat{x}+a\hat{v})\big{)},
(b) μ2s2x^x2+μsμ3/2x^xav^\displaystyle\leq-\frac{\mu^{2}\sqrt{s}}{2}\left\lVert\hat{x}-x_{*}\right\rVert^{2}+\sqrt{\mu_{s}}\mu^{3/2}\left\lVert\hat{x}-x_{*}\right\rVert\left\lVert a\hat{v}\right\rVert
μsμ3/2/2av^2,\displaystyle\quad-\sqrt{\mu_{s}}\mu^{3/2}/2\left\lVert a\hat{v}\right\rVert^{2},

where, in the expression of (a), we have expressed 0 as 0=3/4(f(x^)f(x^))0=3/4(f(\hat{x})-f(\hat{x})) and, in the expression of (b), we have expanded the square and used the Cauchy-Schwartz inequality [31]. Finally, resorting to (A.1), we obtain

V(p^),Xhba(p^)+μ4V(p^)CET(p^;a)=CST(p^;a).\displaystyle\langle\nabla V(\hat{p}),X^{a}_{\operatorname{hb}}(\hat{p})\rangle+\frac{\sqrt{\mu}}{4}V(\hat{p})\leq C_{\operatorname{ET}}(\hat{p};a)=C_{\operatorname{ST}}(\hat{p};a).

\bullet 𝐓𝐞𝐫𝐦𝐈𝐕+𝐕\mathbf{Term\ IV+V}. Using (A.2) we have

V(p^+tXhba(p^))=\displaystyle\nabla V(\hat{p}+tX_{\operatorname{hb}}^{a}(\hat{p}))=
[μsf(x^+tv^)+μv^2μtv^tμμsf(x^+av^)+2μ(x^+tv^x)v^2tμv^tμsf(x^+av^)+μ(x^+tv^x)].\displaystyle\begin{bmatrix}\sqrt{\mu_{s}}\nabla f(\hat{x}+t\hat{v})+\sqrt{\mu}\hat{v}-2\mu t\hat{v}\\ -t\sqrt{\mu}\sqrt{\mu_{s}}\nabla f(\hat{x}+a\hat{v})+2\mu(\hat{x}+t\hat{v}-x_{*})\\ \vskip 6.0pt plus 2.0pt minus 2.0pt\cr\hat{v}-2t\sqrt{\mu}\hat{v}-t\sqrt{\mu_{s}}\nabla f(\hat{x}+a\hat{v})+\sqrt{\mu}(\hat{x}+t\hat{v}-x_{*})\end{bmatrix}.

Therefore, V(p^+tXhba(p^))V(p^)\nabla V(\hat{p}+tX_{\operatorname{hb}}^{a}(\hat{p}))-\nabla V(\hat{p}) reads

[μs(f(x^+tv^)f(x^))tμμsf(x^+av^)μtv^tμsf(x^+av^)]\displaystyle\begin{bmatrix}\sqrt{\mu_{s}}(\nabla f(\hat{x}\!+\!t\hat{v})\!-\!\nabla f(\hat{x}))\!-\!t\sqrt{\mu}\sqrt{\mu_{s}}\nabla f(\hat{x}\!+\!a\hat{v})\\ \vskip 6.0pt plus 2.0pt minus 2.0pt\cr-\sqrt{\mu}t\hat{v}-t\sqrt{\mu_{s}}\nabla f(\hat{x}+a\hat{v})\end{bmatrix}

and hence

V(p^+tXhba(p^))V(p^),Xhba(p^)\displaystyle\langle\nabla V(\hat{p}+tX_{\operatorname{hb}}^{a}(\hat{p}))-\nabla V(\hat{p}),X_{\operatorname{hb}}^{a}(\hat{p})\rangle
=μsf(x^+tv^)f(x^),v^\displaystyle=\sqrt{\mu_{s}}\langle\nabla f(\hat{x}+t\hat{v})-\nabla f(\hat{x}),\hat{v}\rangle
+2tμμsf(x^+av^),v^+2tμv^2\displaystyle\quad+2t\sqrt{\mu}\sqrt{\mu_{s}}\langle\nabla f(\hat{x}+a\hat{v}),\hat{v}\rangle+2t\mu\left\lVert\hat{v}\right\rVert^{2}
+tμsf(x^+av^)2.\displaystyle\quad+t\mu_{s}\left\lVert\nabla f(\hat{x}+a\hat{v})\right\rVert^{2}.

The RHS of the last expression is precisely AET(p^,t;a)A_{\operatorname{ET}}(\hat{p},t;a). Using the LL-Lipschitzness of f\nabla f, one can see that AET(p^,t;a)AST(p;a)tA_{\operatorname{ET}}(\hat{p},t;a)\leq A_{\operatorname{ST}}(p;a)t.

\bullet 𝐓𝐞𝐫𝐦𝐕𝐈\mathbf{Term\ VI}. From (III.1),

V(p^+tXhba(p^))V(p^)=μs(f(x^+tv^)f(x))\displaystyle V(\hat{p}+tX_{\operatorname{hb}}^{a}(\hat{p}))-V(\hat{p})=\sqrt{\mu_{s}}(f(\hat{x}+t\hat{v})-f(x_{*}))
+14v^2tμv^tμsf(x^+av^)2\displaystyle\quad+\frac{1}{4}\left\lVert\hat{v}-2t\sqrt{\mu}\hat{v}-t\sqrt{\mu_{s}}\nabla f(\hat{x}+a\hat{v})\right\rVert^{2}
+14v^2tμv^tμsf(x^+av^)\displaystyle\quad+\frac{1}{4}\left\lVert\hat{v}\right.-2t\sqrt{\mu}\hat{v}-t\sqrt{\mu_{s}}\nabla f(\hat{x}+a\hat{v})
+2μ(x^+tv^x)2μs(f(x^)f(x))\displaystyle\quad+\left.2\sqrt{\mu}(\hat{x}+t\hat{v}-x_{*})\right\rVert^{2}-\sqrt{\mu_{s}}(f(\hat{x})-f(x_{*}))
14v^214v^+2μ(x^x)2.\displaystyle\quad-\frac{1}{4}\left\lVert\hat{v}\right\rVert^{2}-\frac{1}{4}\left\lVert\hat{v}+2\sqrt{\mu}(\hat{x}-x_{*})\right\rVert^{2}.

Expanding the squares in the second and third summands, and simplifying, we obtain

V(p^+tXhba(p^))V(p^)=μs(f(x^+tv^)f(x^))\displaystyle V(\hat{p}+tX_{\operatorname{hb}}^{a}(\hat{p}))-V(\hat{p})=\sqrt{\mu_{s}}(f(\hat{x}+t\hat{v})-f(\hat{x}))
+142tμv^tμsf(x^+av^)2\displaystyle\quad+\frac{1}{4}\left\lVert-2t\sqrt{\mu}\hat{v}-t\sqrt{\mu_{s}}\nabla f(\hat{x}+a\hat{v})\right\rVert^{2}
+12v^,2tμv^tμsf(x^+av^)\displaystyle\quad+\frac{1}{2}\langle\hat{v},-2t\sqrt{\mu}\hat{v}-t\sqrt{\mu_{s}}\nabla f(\hat{x}+a\hat{v})\rangle
+14tμsf(x^+av^)2\displaystyle\quad+\frac{1}{4}\left\lVert-t\sqrt{\mu_{s}}\nabla f(\hat{x}+a\hat{v})\right\rVert^{2}
+12v^+2μ(x^x),t(μsf(x^+av^)\displaystyle\quad+\frac{1}{2}\langle\hat{v}+2\sqrt{\mu}(\hat{x}-x_{*}),-t(\sqrt{\mu_{s}}\nabla f(\hat{x}+a\hat{v})\rangle
=μs(f(x^+tv^)f(x^))\displaystyle=\sqrt{\mu_{s}}(f(\hat{x}+t\hat{v})-f(\hat{x}))
+142tμv^tμsf(x^+av^)2\displaystyle\quad+\frac{1}{4}\left\lVert-2t\sqrt{\mu}\hat{v}-t\sqrt{\mu_{s}}\nabla f(\hat{x}+a\hat{v})\right\rVert^{2}
tμv^2tμsv^,f(x^+av^)\displaystyle\quad-t\sqrt{\mu}\left\lVert\hat{v}\right\rVert^{2}-t\sqrt{\mu_{s}}\langle\hat{v},\nabla f(\hat{x}+a\hat{v})\rangle
+14tμsf(x^+av^)2\displaystyle\quad+\frac{1}{4}\left\lVert-t\sqrt{\mu_{s}}\nabla f(\hat{x}+a\hat{v})\right\rVert^{2}
+μ(x^x),tμsf(x^+av^).\displaystyle\quad+\langle\sqrt{\mu}(\hat{x}-x_{*}),-t\sqrt{\mu_{s}}\nabla f(\hat{x}+a\hat{v})\rangle.

Note that

xx^,f(x^+av^)\displaystyle\langle x_{*}-\hat{x},\nabla f(\hat{x}+a\hat{v})\rangle
=xx^av,f(x^+av^)+av^,f(x^+av^)\displaystyle=\langle x_{*}-\hat{x}-av,\nabla f(\hat{x}+a\hat{v})\rangle+\langle a\hat{v},\nabla f(\hat{x}+a\hat{v})\rangle
f(x^+av^)2L+av^,f(x^+av^),\displaystyle\leq-\frac{\left\lVert\nabla f(\hat{x}+a\hat{v})\right\rVert^{2}}{L}+\langle a\hat{v},\nabla f(\hat{x}+a\hat{v})\rangle,

where in the inequality we have used (A.1d) with x=x^+av^x=\hat{x}+a\hat{v} and y=xy=x_{*}. Using this in the equation above, one identifies the expression of BET(p,t;a)B_{\operatorname{ET}}(p,t;a). Finally, applying (A.1c), one can show that BET(p,t;a)BSTl(p;a)t+BSTq(p;a)t2B_{\operatorname{ET}}(p,t;a)\leq B_{\operatorname{ST}}^{l}(p;a)t+B^{q}_{\operatorname{ST}}(p;a)t^{2}, concluding the proof. ∎

Proof of Proposition V.2.

For convenience, let

Xhba,p^(p)\displaystyle X_{\operatorname{hb}}^{a,\hat{p}}(p) =[v2μvμsf(x^+av^)],\displaystyle=\begin{bmatrix}v\\ -2\sqrt{\mu}v-\sqrt{\mu_{s}}\nabla f(\hat{x}+a\hat{v})\end{bmatrix},

where p^=[x^,v^]\hat{p}=[\hat{x},\hat{v}]. We next provide a bound for the expression

ddtV(p(t)))+μ4V(p(t))=V(p^),Xhba,p^(p^)+μ4V(p^)Term I + II + III\displaystyle\frac{d}{dt}V(p(t)))+\frac{\sqrt{\mu}}{4}V(p(t))=\underbrace{\langle\nabla V(\hat{p}),X_{\operatorname{hb}}^{a,\hat{p}}(\hat{p})\rangle+\frac{\sqrt{\mu}}{4}V(\hat{p})}_{\textrm{Term I + II + III}}
+V(p(t))V(p^),Xhba,p^(p(t))Term IV\displaystyle\quad+\underbrace{\langle\nabla V(p(t))-\nabla V(\hat{p}),X_{\operatorname{hb}}^{a,\hat{p}}(p(t))\rangle}_{\textrm{Term IV}}
+V(p^),Xhba,p^(p(t))Xhba,p^(p^)Term V+μ4(V(p(t))V(p^))Term VI.\displaystyle\quad+\underbrace{\langle\nabla V(\hat{p}),X_{\operatorname{hb}}^{a,\hat{p}}(p(t))-X_{\operatorname{hb}}^{a,\hat{p}}(\hat{p})\rangle}_{\textrm{Term V}}+\frac{\sqrt{\mu}}{4}\underbrace{(V(p(t))-V(\hat{p}))}_{\textrm{Term VI}}.

Next, we bound each term separately.

\bullet 𝐓𝐞𝐫𝐦𝐈+𝐈𝐈+𝐈𝐈𝐈\mathbf{Term\ I+II+III}. Since Xhba,p^(p^)=Xhba(p^)X_{\operatorname{hb}}^{a,\hat{p}}(\hat{p})=X_{\operatorname{hb}}^{a}(\hat{p}), this term is exactly the same as Term I + II + III in the proof of Proposition IV.4, and hence the bound obtained there is valid.

\bullet 𝐓𝐞𝐫𝐦𝐈𝐕\mathbf{Term\ IV}. Using (A.2), we have

V(p(t))V(p^),Xhba,p^(p(t))\displaystyle\langle\nabla V(p(t))-\nabla V(\hat{p}),X_{\operatorname{hb}}^{a,\hat{p}}(p(t))\rangle
=μsf(x(t))f(x^),v(t)\displaystyle=\sqrt{\mu_{s}}\langle\nabla f(x(t))-\nabla f(\hat{x}),v(t)\rangle
+μv(t)v^,v(t)+2μx(t)x^,v(t)\displaystyle\quad+\sqrt{\mu}\langle v(t)-\hat{v},v(t)\rangle+2\mu\langle x(t)-\hat{x},v(t)\rangle
2μv(t)v^,v(t)μsv(t)v^,f(x^+av^)\displaystyle\quad-2\sqrt{\mu}\langle v(t)-\hat{v},v(t)\rangle-\sqrt{\mu_{s}}\langle v(t)-\hat{v},\nabla f(\hat{x}+a\hat{v})\rangle
2μx(t)x^,v(t)μsμx(t)x^,f(x^+av^)\displaystyle\quad-2\mu\langle x(t)-\hat{x},v(t)\rangle-\sqrt{\mu_{s}}\sqrt{\mu}\langle x(t)-\hat{x},\nabla f(\hat{x}+a\hat{v})\rangle
=μsf(x(t))f(x^),v(t)μv(t)v^,v(t)\displaystyle=\sqrt{\mu_{s}}\langle\nabla f(x(t))-\nabla f(\hat{x}),v(t)\rangle-\sqrt{\mu}\langle v(t)-\hat{v},v(t)\rangle
μsv(t)v^,f(x^+av^)\displaystyle\quad-\sqrt{\mu_{s}}\langle v(t)-\hat{v},\nabla f(\hat{x}+a\hat{v})\rangle
μsμx(t)x^,f(x^+av^),\displaystyle\quad-\sqrt{\mu_{s}}\sqrt{\mu}\langle x(t)-\hat{x},\nabla f(\hat{x}+a\hat{v})\rangle,

from where we obtain TermIV𝔄ET(p^,t;a)\mathrm{Term\ IV}\leq\mathfrak{A}_{\operatorname{ET}}(\hat{p},t;a). Now, using 0=v^v^0=\hat{v}-\hat{v}, the LL-Lipschitzness of f\nabla f, and the Cauchy-Schwartz inequality, we have

|𝔄ET(p^,t;a)|\displaystyle|\mathfrak{A}_{\operatorname{ET}}(\hat{p},t;a)| μsLx(t)x^(v(t)v^+v^)\displaystyle\leq\sqrt{\mu_{s}}L\left\lVert x(t)-\hat{x}\right\rVert(\left\lVert v(t)-\hat{v}\right\rVert+\left\lVert\hat{v}\right\rVert)
+μv(t)v^2+μv(t)v^v^\displaystyle\quad+\sqrt{\mu}\left\lVert v(t)-\hat{v}\right\rVert^{2}+\sqrt{\mu}\left\lVert v(t)-\hat{v}\right\rVert\left\lVert\hat{v}\right\rVert
+μsv(t)v^f(x^+av^)\displaystyle\quad+\sqrt{\mu_{s}}\left\lVert v(t)-\hat{v}\right\rVert\left\lVert\nabla f(\hat{x}+a\hat{v})\right\rVert
+μsμx(t)x^f(x^+av^).\displaystyle\quad+\sqrt{\mu_{s}}\sqrt{\mu}\left\lVert x(t)-\hat{x}\right\rVert\left\lVert\nabla f(\hat{x}+a\hat{v})\right\rVert.

Using (20), the triangle inequality, and 1e2μt2μt1-e^{-2\sqrt{\mu}t}\leq 2\sqrt{\mu}t, we can write

x(t)x^\displaystyle\left\lVert x(t)-\hat{x}\right\rVert t2μ2μv^+μsf(x^+av^)\displaystyle\leq\frac{t}{2\sqrt{\mu}}\left\lVert 2\sqrt{\mu}\hat{v}+\sqrt{\mu_{s}}\nabla f(\hat{x}+a\hat{v})\right\rVert
+μst2μf(x^+av^),\displaystyle\quad+\frac{\sqrt{\mu_{s}}t}{2\sqrt{\mu}}\left\lVert\nabla f(\hat{x}+a\hat{v})\right\rVert, (A.5a)
v(t)v^\displaystyle\left\lVert v(t)-\hat{v}\right\rVert t2μv^+μsf(x^+av^).\displaystyle\leq t\left\lVert 2\sqrt{\mu}\hat{v}+\sqrt{\mu_{s}}\nabla f(\hat{x}+a\hat{v})\right\rVert. (A.5b)

Substituting into the bound for |𝔄ET(p^,t;a)||\mathfrak{A}_{\operatorname{ET}}(\hat{p},t;a)| above, we obtain

|𝔄ET(p^,t;a)|𝔄STq(p^;a)t2+𝔄STl(p^;a)t\displaystyle|\mathfrak{A}_{\operatorname{ET}}(\hat{p},t;a)|\leq\mathfrak{A}^{q}_{ST}(\hat{p};a)t^{2}+\mathfrak{A}_{\operatorname{ST}}^{l}(\hat{p};a)t

as claimed.

\bullet 𝐓𝐞𝐫𝐦𝐕\mathbf{Term\ V}. Using (A.2), we have

V(p^),Xhba,p^(p(t))Xhba,p^(p^)\displaystyle\Big{\langle}\nabla V(\hat{p}),X_{\operatorname{hb}}^{a,\hat{p}}(p(t))-X_{\operatorname{hb}}^{a,\hat{p}}(\hat{p})\Big{\rangle}
=[μsf(x^)+μv^+2μ(x^x)v^+μ(x^x)],\displaystyle=\langle\begin{bmatrix}\sqrt{\mu_{s}}\nabla f(\hat{x})+\sqrt{\mu}\hat{v}+2\mu(\hat{x}-x_{*})\\ \hat{v}+\sqrt{\mu}(\hat{x}-x_{*})\end{bmatrix},
[v(t)v^2μ(v(t)v^)]\displaystyle\quad\begin{bmatrix}v(t)-\hat{v}\\ -2\sqrt{\mu}(v(t)-\hat{v})\end{bmatrix}\rangle
=μsf(x^),v(t)v^+μv^,v(t)v^\displaystyle=\sqrt{\mu_{s}}\langle\nabla f(\hat{x}),v(t)-\hat{v}\rangle+\sqrt{\mu}\langle\hat{v},v(t)-\hat{v}\rangle
+2μx^x,v(t)v^2μv^,v(t)v^\displaystyle\quad+2\mu\langle\hat{x}-x_{*},v(t)-\hat{v}\rangle-2\sqrt{\mu}\langle\hat{v},v(t)-\hat{v}\rangle
2μx^x,v(t)v^=𝔇ET(p^,t;a).\displaystyle\quad-2\mu\langle\hat{x}-x_{*},v(t)-\hat{v}\rangle=\mathfrak{D}_{\operatorname{ET}}(\hat{p},t;a).

Taking the absolute value and using the Cauchy-Schwartz inequality in conjunction with (A.5), we obtain the expression corresponding to 𝔇ST\mathfrak{D}_{\operatorname{ST}}.

\bullet 𝐓𝐞𝐫𝐦𝐕𝐈\mathbf{Term\ VI}. From (III.1),

V(p(t))V(p^)=μs(f(x(t))f(x))+14v(t)2\displaystyle V(p(t))-V(\hat{p})=\sqrt{\mu_{s}}(f(x(t))-f(x_{*}))+\frac{1}{4}\left\lVert v(t)\right\rVert^{2}
+14v(t)+2μ(x(t)x)2\displaystyle\quad+\frac{1}{4}\left\lVert v(t)+2\sqrt{\mu}(x(t)-x_{*})\right\rVert^{2}
μs(f(x^)f(x))14v^2\displaystyle\quad-\sqrt{\mu_{s}}(f(\hat{x})-f(x_{*}))-\frac{1}{4}\left\lVert\hat{v}\right\rVert^{2}
14v^+2μ(x^x)2.\displaystyle\quad-\frac{1}{4}\left\lVert\hat{v}+2\sqrt{\mu}(\hat{x}-x_{*})\right\rVert^{2}.

Expanding the third summand (using x(t)=x^+(x(t)x^)x(t)=\hat{x}+(x(t)-\hat{x}) and v(t)=v^+(v(t)v^)v(t)=\hat{v}+(v(t)-\hat{v})) as v^+2μ(x^x)2+2v^+2μ(x^x),v(t)v^+2μ(x(t)x^)+v(t)v^+2μ(x(t)x^)2\left\lVert\hat{v}+2\sqrt{\mu}(\hat{x}-x_{*})\right\rVert^{2}+2\langle\hat{v}+2\sqrt{\mu}(\hat{x}-x_{*}),v(t)-\hat{v}+2\sqrt{\mu}(x(t)-\hat{x})\rangle+\left\lVert v(t)-\hat{v}+2\sqrt{\mu}(x(t)-\hat{x})\right\rVert^{2}, we obtain after simplification

V(p(t))V(p^)=μs(f(x(t))f(x^))\displaystyle V(p(t))-V(\hat{p})=\sqrt{\mu_{s}}(f(x(t))-f(\hat{x})) (A.6)
+14(v(t)2v^2)+14v(t)v^+2μ(x(t)x^)2\displaystyle\quad+\frac{1}{4}(\left\lVert v(t)\right\rVert^{2}-\left\lVert\hat{v}\right\rVert^{2})+\frac{1}{4}\left\lVert v(t)-\hat{v}+2\sqrt{\mu}(x(t)-\hat{x})\right\rVert^{2}
+12v(t)v^+2μ(x(t)x^),v^+2μ(x^x).\displaystyle\quad+\frac{1}{2}\langle v(t)-\hat{v}+2\sqrt{\mu}(x(t)-\hat{x}),\hat{v}+2\sqrt{\mu}(\hat{x}-x_{\ast})\rangle.

Using (20), we have

v(t)v^+2μ(x(t)x^),2μ(x^x)\displaystyle\langle v(t)-\hat{v}+2\sqrt{\mu}(x(t)-\hat{x}),2\sqrt{\mu}(\hat{x}-x_{\ast})\rangle
=2μμstf(x^+av^),x^x\displaystyle=-2\sqrt{\mu}\sqrt{\mu_{s}}t\langle\nabla f(\hat{x}+a\hat{v}),\hat{x}-x_{\ast}\rangle
=2μμstf(x^+av^),x^+av^x\displaystyle=-2\sqrt{\mu}\sqrt{\mu_{s}}t\langle\nabla f(\hat{x}+a\hat{v}),\hat{x}+a\hat{v}-x_{\ast}\rangle
2μμstf(x^+av^),av^\displaystyle\quad-2\sqrt{\mu}\sqrt{\mu_{s}}t\langle\nabla f(\hat{x}+a\hat{v}),-a\hat{v}\rangle
2μμstf(x^+av^)2L\displaystyle\leq-2\sqrt{\mu}\sqrt{\mu_{s}}t\frac{\left\lVert\nabla f(\hat{x}+a\hat{v})\right\rVert^{2}}{L}
+2μμstf(x^+av^),av^,\displaystyle\quad+2\sqrt{\mu}\sqrt{\mu_{s}}t\langle\nabla f(\hat{x}+a\hat{v}),a\hat{v}\rangle,

where we have used (A.1d) to derive the inequality. Substituting this bound into (A.6), we obtain μ4(V(p(t))V(p^))𝔅ET(p^,t;a)\frac{\sqrt{\mu}}{4}(V(p(t))-V(\hat{p}))\leq\mathfrak{B}_{\operatorname{ET}}(\hat{p},t;a). To obtain the ST\operatorname{ST}-expressions, we bound each remaining term separately as follows. Note that

f(x(t))f(x^)(A.1e)f(x^),x(t)x^+L22μx(t)x^2\displaystyle f(x(t))-f(\hat{x})\underbrace{\leq}_{\eqref{eq:aux-f}}\langle\nabla f(\hat{x}),x(t)-\hat{x}\rangle+\frac{L^{2}}{2\mu}\left\lVert x(t)-\hat{x}\right\rVert^{2}
x(t)x^f(x^)+L22μx(t)x^2\displaystyle\leq\left\lVert x(t)-\hat{x}\right\rVert\left\lVert\nabla f(\hat{x})\right\rVert+\frac{L^{2}}{2\mu}\left\lVert x(t)-\hat{x}\right\rVert^{2}
t2μ2μv^+μsf(x^+av^)f(x^)\displaystyle\leq\frac{t}{2\sqrt{\mu}}\left\lVert 2\sqrt{\mu}\hat{v}+\sqrt{\mu_{s}}\nabla f(\hat{x}+a\hat{v})\right\rVert\left\lVert\nabla f(\hat{x})\right\rVert
+μst2μf(x^+av^)f(x^)\displaystyle\quad+\frac{\sqrt{\mu_{s}}t}{2\sqrt{\mu}}\left\lVert\nabla f(\hat{x}+a\hat{v})\right\rVert\left\lVert\nabla f(\hat{x})\right\rVert
+L22μ(t24μ2μv^+μsf(x^+av^)2\displaystyle\quad+\frac{L^{2}}{2\mu}(\frac{t^{2}}{4\mu}\left\lVert 2\sqrt{\mu}\hat{v}+\sqrt{\mu_{s}}\nabla f(\hat{x}+a\hat{v})\right\rVert^{2}
+μst24μf(x^+av^)2\displaystyle\quad+\frac{\mu_{s}t^{2}}{4\mu}\left\lVert\nabla f(\hat{x}+a\hat{v})\right\rVert^{2}
+μst22μ2μv^+μsf(x^+av^)f(x^+av^)),\displaystyle\quad+\frac{\sqrt{\mu_{s}}t^{2}}{2\mu}\left\lVert 2\sqrt{\mu}\hat{v}+\sqrt{\mu_{s}}\nabla f(\hat{x}+a\hat{v})\right\rVert\left\lVert\nabla f(\hat{x}+a\hat{v})\right\rVert),

where we have used (A.5a) to obtain the last inequality. Next,

v(t)2v^2=v(t)v^2+2v(t)v^,v^\displaystyle\left\lVert v(t)\right\rVert^{2}-\left\lVert\hat{v}\right\rVert^{2}=\left\lVert v(t)-\hat{v}\right\rVert^{2}+2\langle v(t)-\hat{v},\hat{v}\rangle
t22μv^+μsf(x^+av^)2\displaystyle\leq t^{2}\left\lVert 2\sqrt{\mu}\hat{v}+\sqrt{\mu_{s}}\nabla f(\hat{x}+a\hat{v})\right\rVert^{2}
+2t2μv^+μsf(x^+av^)v^,\displaystyle\quad+2t\left\lVert 2\sqrt{\mu}\hat{v}+\sqrt{\mu_{s}}\nabla f(\hat{x}+a\hat{v})\right\rVert\left\lVert\hat{v}\right\rVert,

where we have used (A.5b) to obtain the last inequality. Using y1+y222y12+2y22\left\lVert y_{1}+y_{2}\right\rVert^{2}\leq 2\left\lVert y_{1}\right\rVert^{2}+2\left\lVert y_{2}\right\rVert^{2}, we bound

v(t)v^+2μ(x(t)x^)22v(t)v^2+8μx(t)x^2\displaystyle\left\lVert v(t)\!-\!\hat{v}\!+\!2\sqrt{\mu}(x(t)\!-\!\hat{x})\right\rVert^{2}\leq 2\left\lVert v(t)\!-\!\hat{v}\right\rVert^{2}\!+\!8\mu\left\lVert x(t)\!-\!\hat{x}\right\rVert^{2}
2t22μv^+μsf(x^+av^)2+4μt\displaystyle\leq 2t^{2}\left\lVert 2\sqrt{\mu}\hat{v}+\sqrt{\mu_{s}}\nabla f(\hat{x}+a\hat{v})\right\rVert^{2}+4\sqrt{\mu}t\cdot
(2μv^+μsf(x^+av^)+μsf(x^+av^))2,\displaystyle\quad\cdot\big{(}\left\lVert 2\sqrt{\mu}\hat{v}+\sqrt{\mu_{s}}\nabla f(\hat{x}+a\hat{v})\right\rVert+\sqrt{\mu_{s}}\left\lVert\nabla f(\hat{x}+a\hat{v})\right\rVert\big{)}^{2},

where we have used (A.5). Finally,

v(t)v^+2μ(x(t)x^),v^μstf(x^+av^),v^.\displaystyle\langle v(t)-\hat{v}+2\sqrt{\mu}(x(t)-\hat{x}),\hat{v}\rangle\leq-\sqrt{\mu_{s}}t\langle\nabla f(\hat{x}+a\hat{v}),\hat{v}\rangle.

Employing these bounds in the expression of 𝔅ET\mathfrak{B}_{\operatorname{ET}}, we obtain |𝔅ET(p^,t;a)|𝔅STq(p^;a)t2+𝔅STl(p^;a)t|\mathfrak{B}_{\operatorname{ET}}(\hat{p},t;a)|\leq\mathfrak{B}^{q}_{ST}(\hat{p};a)t^{2}+\mathfrak{B}_{\operatorname{ST}}^{l}(\hat{p};a)t, as claimed. ∎

[Uncaptioned image] Miguel Vaquero was born in Galicia, Spain. He received his Licenciatura and Master’s degree in mathematics from the Universidad de Santiago de Compostela, Spain and the Ph.D. degree in mathematics from Instituto de Ciencias Matemáticas (ICMAT), Spain in 20152015. He was then a postdoctoral scholar working on the ERC project “Invariant Submanifolds in Dynamical Systems and PDE” also at ICMAT. Since October 2017, he has been a postdoctoral scholar at the Department of Mechanical and Aerospace Engineering of UC San Diego. He will start in 20212021 as an Assistant Professor at the School of Human Sciences and Technology of IE University, Madrid, Spain. His interests include optimization, dynamical systems, control theory, machine learning, and geometric mechanics.
[Uncaptioned image] Pol Mestres was born in Catalonia, Spain in 1997. He is currently a student of the Bachelor’s Degree in Mathematics and the Bachelor’s Degree in Engineering Physics at Universitat Politècnica de Catalunya, Barcelona, Spain. He was a visiting scholar at University of California, San Diego, CA, USA, from September 2019 to March 2020.
[Uncaptioned image] Jorge Cortés (M’02, SM’06, F’14) received the Licenciatura degree in mathematics from Universidad de Zaragoza, Zaragoza, Spain, in 1997, and the Ph.D. degree in engineering mathematics from Universidad Carlos III de Madrid, Madrid, Spain, in 2001. He held postdoctoral positions with the University of Twente, Twente, The Netherlands, and the University of Illinois at Urbana-Champaign, Urbana, IL, USA. He was an Assistant Professor with the Department of Applied Mathematics and Statistics, University of California, Santa Cruz, CA, USA, from 2004 to 2007. He is currently a Professor in the Department of Mechanical and Aerospace Engineering, University of California, San Diego, CA, USA. He is the author of Geometric, Control and Numerical Aspects of Nonholonomic Systems (Springer-Verlag, 2002) and co-author (together with F. Bullo and S. Martínez) of Distributed Control of Robotic Networks (Princeton University Press, 2009). He is a Fellow of IEEE and SIAM. At the IEEE Control Systems Society, he has been a Distinguished Lecturer (2010-2014), and is currently its Director of Operations and an elected member (2018-2020) of its Board of Governors. His current research interests include distributed control and optimization, network science, resource-aware control, nonsmooth analysis, reasoning and decision making under uncertainty, network neuroscience, and multi-agent coordination in robotic, power, and transportation networks.