This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Dynamic behavior for a gradient algorithm with energy and momentum

Hailiang Liu Department of Mathematics, Iowa State University, Ames, IA, USA, ([email protected]).    Xuping Tian Department of Mathematics, Iowa State University, Ames, IA, USA, ([email protected]).
Abstract

This paper investigates a novel gradient algorithm, AGEM, using both energy and momentum, for addressing general non-convex optimization problems. The solution properties of the AGEM algorithm, including aspects such as uniformly boundedness and convergence to critical points, are examined. The dynamic behavior is studied through a comprehensive analysis of a high-resolution ODE system. This ODE system, being nonlinear, is derived by taking the limit of the discrete scheme while preserving the momentum effect through a rescaling of the momentum parameter. The paper emphasizes the global well-posedness of the ODE system and the time-asymptotic convergence of solution trajectories. Furthermore, we establish a linear convergence rate for objective functions that adhere to the Polyak-Łojasiewicz condition.

keywords:
gradient algorithm; energy adaptive; dynamical systems; PL condition
{AMS}

34D05; 65K10

1 Introduction

In this paper, we will be considering the nonconvex optimization problem,

minθnf(θ),\min_{\theta\in\mathbb{R}^{n}}f(\theta), (1.1)

where the differentiable objective function f:nf:\mathbb{R}^{n}\to\mathbb{R} is assumed to be bounded from below, i.e., f+c>0f+c>0 for some cc\in\mathbb{R}. Applications like training machine learning models often involve large-scale problems with non-convex objective functions represented as a finite sum over terms associated with individual data. In such settings, first-order gradient methods such as gradient descent (GD) and its variants are commonly employed, due to their computational efficiency and satisfactory performance [15].

Despite their wide usage, GDs face a potential limitation on step size, attributed to the conditional stability of GD. This limitation can significantly impede convergence, especially in large-scale machine learning applications. This same challenge also persists for the stochastic approximation of GD, known as Stochastic Gradient Descent (SGD). Consequently, there has been much effort directed towards enhancing the convergence of first-order gradient methods. Among these efforts, adaptive step size [44, 26] and momentum [40] emerge as two widely-used techniques, aiming to address the aforementioned limitations and improve the efficiency of optimization algorithms in practice.

To overcome the issue of step size limitations, the authors in [31, 32] introduced an adaptive gradient descent with energy (AEGD). The algorithm initiates from θ0\theta_{0} and r0=F(θ0)r_{0}=F(\theta_{0}) with F(θ)=f(θ)+cF(\theta)=\sqrt{f(\theta)+c}, inductively define

vk\displaystyle v_{k} =F(θk),\displaystyle=\nabla F(\theta_{k}), (1.2a)
rk+1\displaystyle r_{k+1} =rk1+2η|vk|2,\displaystyle=\frac{r_{k}}{1+2\eta|v_{k}|^{2}}, (1.2b)
θk+1\displaystyle\theta_{k+1} =θk2ηrk+1vk,\displaystyle=\theta_{k}-2\eta r_{k+1}v_{k}, (1.2c)

where cc\in\mathbb{R} is chosen so that infθΘ(f(θ)+c)>0\inf\limits_{\theta\in\Theta}\left(f(\theta)+c\right)>0. In sharp contrast to GD, energy-adaptive gradient methods exhibit unconditional energy stability, regardless of the base step size η>0\eta>0. Additionally, findings in [31] indicate that the parameter cc has minimal impact on the performance of (1.2).

AEGD can be extended to include momentum for accelerated convergence  [32]. In this study, we explore a variant of AEGD with momentum, referred to as AGEM. It is defined inductively as follows:

vk\displaystyle v_{k} =βvk1+(1β)F(θk),v1=𝟎,\displaystyle=\beta v_{k-1}+(1-\beta)\nabla F(\theta_{k}),\quad v_{-1}=\bf{0}, (1.3a)
rk+1\displaystyle r_{k+1} =rk1+2η|vk|2,\displaystyle=\frac{r_{k}}{1+2\eta|v_{k}|^{2}}, (1.3b)
θk+1\displaystyle\theta_{k+1} =θk2ηrk+1vk,\displaystyle=\theta_{k}-2\eta r_{k+1}v_{k}, (1.3c)

where the momentum parameter β[0,1)\beta\in[0,1) is a crucial element, and setting β=0\beta=0 reduces it to the original AEGD in (1.2). Unlike GD, all these energy-adaptive gradient methods exhibit unconditional energy stability, irrespective of the base step size η\eta. The excellent performance of AEGD-type schemes has been demonstrated across various optimization tasks [31, 32]. On the theoretical front, the convergence rates obtained in [31] depend on krkkr_{k}, rather than solely on kk. This dependence on the norm of energy could pose challenges when the energy variable decays too rapidly, potentially impacting the convergence behavior. Addressing this issue represents a main challenge in achieving a better understanding of the energy-adaptive gradient methods.

Conversely, the link between ODEs and numerical optimization can be established by considering infinitesimally small step sizes. This approach ensures that the trajectory or solution path converges to a curve modeled by an ODE. Leveraging the well-established theory and dynamic properties of ODEs can yield profound insights into optimization processes, as exemplified in [3, 8, 5, 43].

The motivation for this study stems from our two recent papers [32, 30], wherein we successfully demonstrated the convergence of AEGD with momentum through comprehensive experiments. These experiments showcased its superior performance across stochastic and non-stochastic scenarios. Our aim in the current work is to deepen our understanding of AGEM, as expressed in equation (1.3), specifically focusing on elements such as the momentum effect, dynamics of the energy variable, and convergence rates. By delving into its continuous formulation, we seek to gain additional insights that will guide the implementation of the algorithm and further enhance its effectiveness. Specifically, we derive an ODE system for (1.3), representing the precise limit of (1.3) when employing infinitesimally small step sizes while maintaining ϵ=βη/(1β)\epsilon=\beta\eta/(1-\beta) as a constant. This ODE system, featuring an unknown vector U:=(v,r,θ)U:=(v,r,\theta), is expressed as

ϵv˙\displaystyle\epsilon\dot{v} =v+F(θ),\displaystyle=-v+\nabla F(\theta), (1.4a)
r˙\displaystyle\dot{r} =2r|v|2,\displaystyle=-2r|v|^{2}, (1.4b)
θ˙\displaystyle\dot{\theta} =2rv,\displaystyle=-2rv, (1.4c)

for t>0t>0, with initial conditions U(0)=(0,F(θ0),θ0)U(0)=(0,F(\theta_{0}),\theta_{0}); here, θ0\theta_{0} is the starting point in the AGEM scheme, and U˙\dot{U} denotes the time derivative. Notably, this study represents the pioneering effort to model energy-adaptive schemes using ODE systems. This linkage enables the examination of the convergence properties of (1.3) from the standpoint of continuous dynamics, particularly for general smooth objective functions.

For a gradient method, the geometric property of the objective function ff often play a crucial role in determining the convergence and convergence rates. In the case of non-convex ff, we consider an established condition originally introduced by Polyak [39], who showed its sufficiency for GD to achieve a linear convergence rate. A recent convergence study under this condition is presented in [25], referring to this condition as the Polyak-Łojasiewicz (PL) inequality. This designation originates from its connection the gradient inequality introduced by Łojasiewicz in 1963 [48]. Observations regarding optimization-related dynamical systems have suggested that, in general, convergence along subsequences is typically achievable. However, achieving convergence of the entire trajectory require more geometric properties. For instance, the Łojasiewicz inequality has been used to establish trajectory convergence in [4] for a second-order gradient-like dissipative dynamical system with Hessian-driven damping. Section 7 of this work shows how such an inequality can ensure trajectory convergence for θ(t)\theta(t) in (1.4). We should point out that several significant non-convex problems in machine learning, including certain classes of neural networks, have recently been shown to satisfy the PL condition [10, 17, 42, 29]. The prevalent belief is that the PL/KL condition provides a pertinent and attractive framework for numerous machine learning problems, especially in the over-parametrized regime.

Contributions. Our findings provide a rigorous and precise understanding of the convergence behavior for both the discrete scheme (1.3) and its continuous counterpart (1.4). Our main contributions can be summarized as follows.

  1. (1)

    For (1.3), we establish that the iterates are uniformly bounded and converge to the set of critical points of the objective function when the step size is appropriately small. Additionally, we demonstrate that limkrk=r>0\lim_{k\to\infty}r_{k}=r^{*}>0.

  2. (2)

    We derive (1.4) as a high resolution ODE system corresponding to the discrete scheme (1.3). For this ODE system, we initially show global well-posedness by constructing a suitable Lyapunov function. Subsequently, we establish the time-asymptotic convergence of the solution toward critical points using the Lasalle invariance principle. Furthermore, we provide a positive lower bound of the energy variable.

  3. (3)

    For objective functions satisfying the PL condition (refer to Section 2.2), we establish a linear convergence rate:

    f(θ(t))f(θ)(f(θ0)f(θ))eαtf(\theta(t))-f(\theta^{*})\leq(f(\theta_{0})-f(\theta^{*}))e^{-\alpha t}

    for some α>0\alpha>0, where θ\theta^{*} represents the global minimizer of ff (not necessarily convex).

  4. (4)

    We propose several variations and extensions. For a broader class of objective functions satisfying the Łojasiewicz inequality, θ(t)\theta(t) is shown to have finite length, hence converging to a single minimum of ff, accompanied by associated convergence rates.

Obtaining convergence rates for non-convex optimization problems poses a significant challenge. The approach employed in Section 6 to deliver the linear convergence rate for the system (1.4) draws inspiration from [5], where a linear convergence rate was established for the Heavy Ball dynamical system, namely (1.5), with a(t)a(t) being a constant. In the case of the more intricate nonlinear system (1.4), we manage to formulate a class of control functions that serve a similar role to those in [5]. The derivation of the convergence rate results for (1.4) differs significantly and is more intricate. Notably, we employ a subtle optimization argument to identify an optimal control function that facilitates the desired decay rate.

1.1 Related works.

IEQ strategy. The fundamental concept underpinning AEGD is the invariant energy quadratization (IEQ) strategy, originally introduced to devise linear and unconditionally energy stable schemes for gradient flows expressed as partial differential equations [46, 47]. The AEGD algorithm [31] stands out as the pioneer in utilizing the energy update to stabilize GD in the context of optimization.

Optimization with momentum. The two prominent momentum-based optimization methods are Polyaks’s Heavy Ball (HB) [40] and Nesterov’s Accelerated Gradient (NAG) [37]. The HB method combines the current gradient with a history of the previous steps to accelerate the algorithm’s convergence:

vk\displaystyle v_{k} =μvk1+f(θk),\displaystyle=\mu v_{k-1}+\nabla f(\theta_{k}),
θk+1\displaystyle\theta_{k+1} =θkηvk\displaystyle=\theta_{k}-\eta v_{k}

where μ[0,1]\mu\in[0,1] is the momentum coefficient, and η\eta is the step size. Research in [40] showed that HB can significantly accelerate convergence to a local minimum. For strongly convex functions, it requires κ\sqrt{\kappa} times fewer iterations than GD to achieve the same level of accuracy, where κ\kappa is the condition number of the curvature at the minimum and μ\mu is set to κ1κ+1\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+_{1}}. Similar to the HB method, NAG also ensures a faster convergence rate than GD in specific scenarios. Particularly, for general smooth (non-strongly) convex functions, NAG achieves a global convergence rate of O(1/k2)O(1/k^{2}) (versus O(1/k)O(1/k) of GD) [37]. Recent studies have indicated that integrating momentum with other techniques can further enhance the performance of certain optimization methods. For instance, momentum incorporation into adaptive step sizes [26] and variance-reduction-based methods [2] for stochastic optimization.

The approach discussed in this work integrates the momentum technique into the energy-adaptive gradient methods. As demonstrated in [31, 32], the resulting algorithms not only exhibit unconditional energy stability but also showcase faster convergence, inheriting advantages from both techniques.

ODE perspectives. In recent years, the optimization community has shown a growing interest in examining the continuous-time ODE limit of optimization methods as the step size tends to zero. This perspective has spurred numerous recent studies that critically examine established optimization methods in their continuous-time limit. A main aspect of the research in this domain focuses on developing Lyapunov functions or estimating sequences that emerge independently from the dynamical system.

ODEs have proven to be particularly useful in the theoretical analysis of accelerated gradient algorithms. For instance, researchers have investigated the convergence behavior of the HB method using second-order ODEs:

θ¨(t)+a(t)θ˙+f(θ)=0,θ(0)=θ0,θ˙=0\ddot{\theta}(t)+a(t)\dot{\theta}+\nabla f(\theta)=0,\quad\theta(0)=\theta_{0},\quad\dot{\theta}=0 (1.5)

with a(t)a(t) being a smooth, positive and non-increasing function. The Lyapunov function E=12|θ˙|2+f(θ)E=\frac{1}{2}|\dot{\theta}|^{2}+f(\theta) plays a pivotal role in the analysis of (1.5) within both convex and non-convex settings [3, 8]. The NAG method has been linked to (1.5) with a(t)=3ta(t)=\frac{3}{t} in [43], wherein the optimal convergence rate of O(1/t2)O(1/t^{2}) of the discrete scheme in the convex setting is recovered [38]. In [41], the distinction between the acceleration phenomena of the HB method and the NAG method is investigated through an analysis of high-resolution ODEs. Further analysis of (1.5) is conducted in Attouch et al. [7], Wibisono et al. [45] frame it within a variational context. This motivated further research on structure-preserving integration schemes for discretizing continuous-time optimization algorithms, as explored by Betancourt et al. [11]. Franca et al. [20] and Muehlebach and Jordan [36] highlight important geometric properties of the underlying dynamics, analyzing corresponding structure-preserving discretization schemes. Maddison et al. [35] propose that convergence can be enhanced through a suitable selection of kinetic energy.

In [18], a non-autonomous system of differential equations was derived and analyzed as the continuous-time limit of adaptive optimization methods, including Adam and RMSprop. Beyond examining convergence behavior, this dynamical approach provides insight into qualitative features of discrete algorithms. For example, by analyzing the sign GD of the form θ˙=f(θ)|f(θ)|\dot{\theta}=-\frac{\nabla f(\theta)}{|\nabla f(\theta)|} and its variants, [34] identified three behavior patterns of Adam and RMSprop: fast initial progress, small oscillations, and large spikes. Our work aligns with this approach, although, to our knowledge, (1.4) differs from existing ODEs for optimization methods and presents subtle analytical challenges.

The ODE perspective has inspired researchers to propose innovative optimization methods. Indeed, starting from a continuous dynamical system with favorable converging properties, one can discretize it to derive novel algorithms. For example, a new second-order inertial optimization method, known as INNA algorithm, is obtained from a continuous inclusion system when the objective function is non-smooth [16].

The rest of the paper is organized as follows. In Section 2, we begin with the problem setup, revisiting existing energy-adaptive gradient methods, and introducing a novel approach. We delve into the PL condition and its associated properties. In Section 3, we analyze solution properties for the newly proposed method. In Section 4, we derive a continuous dynamical system and present a convergence result to show the dynamical system is a faithful model for the discrete scheme. In Section 5, we analyze the global existence and asymptotic behavior of the solution to the dynamical system. In Section 6, we obtain a linear convergence rate for objective functions satisfying the PL condition. In Section 7, we propose several variations and extensions. Some concluding remarks are presented in Section 8.

2 Energy-adaptive gradient algorithms

Throughout this study, we make the following assumptions: {assumption} The objective function ff in (1.1) satisfies

  1. (1)

    fC2(n)f\in C^{2}(\mathbb{R}^{n}) and is coercive, denoted by lim|θ|f(θ)=\lim_{|\theta|\to\infty}f(\theta)=\infty.

  2. (2)

    ff is bounded from below, so that F=f+cF=\sqrt{f+c} is well-defined for some cc\in\mathbb{R}.

We represent f=minf(θ)f^{*}=\min f(\theta), F=f+cF^{*}=\sqrt{f^{*}+c}, and assume that cc is chosen such that F=f+cF>0F=\sqrt{f+c}\geq F^{*}>0. We use f\nabla f and 2f\nabla^{2}f to denote the gradient and Hessian of ff; and if\partial_{i}f to denote the ii-th element of f\nabla f. For xnx\in\mathbb{R}^{n}, |x||x| denotes its Euclidean norm (|x|=x12++xn2|x|=\sqrt{x^{2}_{1}+...+x^{2}_{n}}). The outer product of x,ynx,y\in\mathbb{R}^{n} is denoted as xyx\otimes y, and [m][m] denotes the set {1,2,,m}\{1,2,...,m\}.

Under Assumption 2, recognizing that f+c=F2f+c=F^{2}, we deduce f=2FF\nabla f=2F\nabla F and

D2f=2FD2F+2FF.D^{2}f=2FD^{2}F+2\nabla F\otimes\nabla F.

The fact that FF>0F\geq F^{*}>0 ensures that FC2(n)F\in C^{2}(\mathbb{R}^{n}) and FF is coercive.

2.1 Energy-adaptive gradient algorithms.

(1.2) represents the global and non-stochastic counterpart of AEGD, as introduced in [31]. Its notable feature lies in being unconditionally energy stable, the energy variable rkr_{k}, serving as an approximation to F(θk)=f(θk)+cF(\theta_{k})=\sqrt{f(\theta_{k})+c}, strictly decreases unless θk+1=θk\theta_{k+1}=\theta_{k}. To enhance the convergence speed of AEGD, momentum was introduced into the energy-adaptive methods. The initial momentum version, named AEGDM, was introduced in [30]:

vk\displaystyle v_{k} =βvk1+F(θk),\displaystyle=\beta v_{k-1}+\nabla F(\theta_{k}), (2.6a)
rk+1\displaystyle r_{k+1} =rk1+2η|F(θk)|2,r0=F(θ0),\displaystyle=\frac{r_{k}}{1+2\eta|\nabla F(\theta_{k})|^{2}},\quad r_{0}=F(\theta_{0}), (2.6b)
θk+1\displaystyle\theta_{k+1} =θk2ηrk+1vk.\displaystyle=\theta_{k}-2\eta r_{k+1}v_{k}. (2.6c)

Here, β(0,1)\beta\in(0,1) represents the momentum parameter. The theoretical and numerical superiority of AEGDM over AEGD has been established in [30]. Subsequently, in [32], a stochastic version of AEGD with momentum, named SGEM, was introduced, emphasizing its performance in training large-scale machine learning tasks. The formulation is as follows:

mk\displaystyle m_{k} =βmk1+(1β)f(θk),\displaystyle=\beta m_{k-1}+(1-\beta)\nabla f(\theta_{k}), (2.7a)
vk\displaystyle v_{k} =mk(1βk)2F(θk),\displaystyle=\frac{m_{k}}{(1-\beta^{k})2F(\theta_{k})}, (2.7b)
rk+1\displaystyle r_{k+1} =rk1+2η|vk|2,r0=F(θ0),\displaystyle=\frac{r_{k}}{1+2\eta|v_{k}|^{2}},\quad r_{0}=F(\theta_{0}), (2.7c)
θk+1\displaystyle\theta_{k+1} =θk2ηrk+1vk,\displaystyle=\theta_{k}-2\eta r_{k+1}v_{k}, (2.7d)

with β(0,1)\beta\in(0,1) being the momentum parameter. SGEM has been shown to achieve a faster convergence compared to AEGD, while exhibiting superior generalizing capabilities over stochastic gradient descent (SGD) with momentum in training various benchmark deep neural networks [32].

Refer to caption
(a) AGEM
Refer to caption
(b) SGEM
Refer to caption
(c) AEGDM
Refer to caption
(d) GDM
Figure 1: The number of iterations required for the three energy-adaptive methods AEGDM, SGEM, AGEM, and GDM, to converge to the global minima of the two-dimensional Rosenbrock function, across different values of β\beta and η\eta. In the case of GDM, the white areas on the graph indicate divergences.

This work introduces a novel variant of (1.2), named AGEM:

vk\displaystyle v_{k} =βvk1+(1β)F(θk),v1=𝟎,\displaystyle=\beta v_{k-1}+(1-\beta)\nabla F(\theta_{k}),\quad v_{-1}=\bf{0}, (2.8a)
rk+1\displaystyle r_{k+1} =rk1+2η|vk|2,r0=F(θ0),\displaystyle=\frac{r_{k}}{1+2\eta|v_{k}|^{2}},\quad r_{0}=F(\theta_{0}), (2.8b)
θk+1\displaystyle\theta_{k+1} =θk2ηrk+1vk.\displaystyle=\theta_{k}-2\eta r_{k+1}v_{k}. (2.8c)

All above energy-adaptive methods – AEGD, AEGDM, SGEM, and AGEM – share a common property of unconditional energy stability, as stated in the following.

Theorem 2.1 (Unconditional energy stability).

AEGD (1.2), (2.6), (2.7), and (2.8) all exhibit unconditionally energy stability, as expressed by the relation:

rk+12=rk2(rk+1rk)21η|θk+1θk|2.r^{2}_{k+1}=r^{2}_{k}-(r_{k+1}-r_{k})^{2}-\frac{1}{\eta}|\theta_{k+1}-\theta_{k}|^{2}. (2.9)

This guarantees the strict decrease of rkr_{k} with convergence to r0r^{*}\geq 0 as kk\to\infty, accompanied by the limit limk|θk+1θk|2=0\lim_{k\to\infty}|\theta_{k+1}-\theta_{k}|^{2}=0. Furthermore, the following inequalities hold:

ηk=0(rk+1rk)2+k=0|θk+1θk|2ηr02.\eta\sum_{k=0}^{\infty}(r_{k+1}-r_{k})^{2}+\sum_{k=0}^{\infty}|\theta_{k+1}-\theta_{k}|^{2}\leq\eta r^{2}_{0}. (2.10)
Proof 2.2.

Starting with the common AEGD equations:

rk+1\displaystyle r_{k+1} =rk1+2η|vk|2,\displaystyle=\frac{r_{k}}{1+2\eta|v_{k}|^{2}},
θk+1\displaystyle\theta_{k+1} =θk2ηrk+1vk,\displaystyle=\theta_{k}-2\eta r_{k+1}v_{k},

we derive the relation:

rk+1+2ηrk+1|vk|2=rk,|θk+1θk|2=4η2rk+12|vk|2.r_{k+1}+2\eta r_{k+1}|v_{k}|^{2}=r_{k},\quad|\theta_{k+1}-\theta_{k}|^{2}=4\eta^{2}r^{2}_{k+1}|v_{k}|^{2}.

Hence,

rk+12rk2\displaystyle r^{2}_{k+1}-r^{2}_{k} =2rk+1(rk+1rk)(rk+1rk)2\displaystyle=2r_{k+1}(r_{k+1}-r_{k})-(r_{k+1}-r_{k})^{2}
=4ηrk+12|vk|2(rk+1rk)2\displaystyle=-4\eta r^{2}_{k+1}|v_{k}|^{2}-(r_{k+1}-r_{k})^{2}
=1η|θk+1θk|2(rk+1rk)2.\displaystyle=-\frac{1}{\eta}|\theta_{k+1}-\theta_{k}|^{2}-(r_{k+1}-r_{k})^{2}.

This establishes (2.9). As {rk}\{r_{k}\} is decreasing and bounded from below, it converges to rr^{*}. Summing (2.9) over kk leads to

1ηk=0|θk+1θk|2+k=0(rk+1rk)2\displaystyle\frac{1}{\eta}\sum_{k=0}^{\infty}|\theta_{k+1}-\theta_{k}|^{2}+\sum_{k=0}^{\infty}(r_{k+1}-r_{k})^{2} =limk(r02rk2)=r02(r)2r02.\displaystyle=\lim_{k\to\infty}(r^{2}_{0}-r^{2}_{k})=r^{2}_{0}-(r^{*})^{2}\leq r^{2}_{0}.

This implies (2.10), leading to limk|θk+1θk|2=0\lim_{k\to\infty}|\theta_{k+1}-\theta_{k}|^{2}=0.

To gain a deeper understanding of how AGEM, scheme inspired by its continuous ODE system, distinguishes itself from the other two energy-adaptive methods, AEGDM [30] and SGEM [32], we conducted a robustness test on all three methods and GDM (GD with momentum). The test used the Rosenbrock function:

f(x)=i=1n1(1xi)2+100(xi+1xi2)2,f(x)=\sum_{i=1}^{n-1}(1-x_{i})^{2}+100*(x_{i+1}-x_{i}^{2})^{2},

where nn is an integer. This nonconvex function poses a challenge due to its global minima being located at the bottom of a long, narrow valley, making the optimization problem inherently difficult. Setting n=2n=2, we investigated the impact of the momentum parameter β\beta and step size η\eta on the convergence performance of each method, with results presented in Figure 1. Our observations indicate that, in comparison to GDM, the energy-adaptive methods exhibit greater robustness in the sense that their convergence performance is less sensitive to variations in β\beta and η\eta: enabling convergence within a certain number of iterations to global minima across a wide range of parameter values. In contrast, GDM diverges when η\eta exceeds a certain threshold (e.g. 5e45e-4). Moreover, AGEM exhibits better robustness than the other two methods, achieving the fastest convergence under the settings β=0.9\beta=0.9 and η=7e4\eta=7e-4.

Addressing a more challenging scenario in higher dimensions, we considered the Rosenbrock function with n=100n=100. The landscape in this case is quite intricate, with two minimizers, a global minimizer at x=(1,,1)x^{*}=(1,...,1)^{\top} with f(x)=0f(x^{*})=0 and a local minimizer near x=(1,1,,1)x=(-1,1,...,1)^{\top} with f3.99f\sim 3.99. A comparison between GDM and AGEM is presented in Figure 2. Both methods achieve their fastest convergence with η=1e4\eta=1e-4 and η=3e5\eta=3e-5, respectively. Notably, as η\eta increases for both methods, GDM becomes trapped at a local minima, while AGEM continues to converge to the global minima, though at a slower pace.

Refer to caption
(a) n=2n=2
Refer to caption
(b) n=100n=100
Figure 2: A comparative analysis showcasing the performance of various methods on the Rosenbrock function.

The primary goal of this study is to provide an in-depth analysis of AGEM, considering both its discrete and continuous formulations. These findings are expected to be applicable to (2.7). Our specific focus lies in understanding the convergence characteristics of AGEM. This examination will be conducted by exploring continuous dynamics for objective functions that are generally smooth. Additionally, leveraging a structural condition on ff – known as the Polyak-Łojasiewicz (PL) condition, we establish a convergence rate in Section 6.

2.2 PL condition for non-convex objectives.

In this section, we provide a brief overview of the PL condition and its relevance to the loss function within the context of deep neural networks.

Definition 2.3.

For a differentiable function f:nf:\mathbb{R}^{n}\to\mathbb{R} with argminf(θ){\rm argmin}f(\theta)\neq\emptyset, we say ff satisfies the PL condition if there exists some constant μ>0\mu>0, such that the following inequality holds for any θargminf(θ)\theta^{*}\in{\rm argmin}f(\theta):

12|f(θ)|2μ(f(θ)f(θ)),θn.\frac{1}{2}|\nabla f(\theta)|^{2}\geq\mu(f(\theta)-f(\theta^{*})),\quad\forall\theta\in\mathbb{R}^{n}. (2.11)

This condition implies that a local minimizer θ\theta^{*} is also a global minimizer. It’s important to note that strongly convex functions satisfy the PL condition, although a function satisfying the PL condition need not be convex. For instance, the function

f(θ)=θ2+3sin2θf(\theta)=\theta^{2}+3\sin^{2}\theta

is not convex but satisfies the PL condition with μ=132,f=0\mu=\frac{1}{32},f^{*}=0.

As we mentioned in the introduction, the PL condition has attracted increasing attention in the context of training deep neural networks. In this informal exploration, we investigate how a commonly used loss function in deep learning tasks can satisfy the PL condition. Consider the following loss function

f(θ):=12i=1m(g(xi;θ)yi)2,f(\theta):=\frac{1}{2}\sum_{i=1}^{m}(g(x_{i};\theta)-y_{i})^{2},

where f(θ)=0f(\theta^{*})=0 for any minimizer θ\theta^{*}. Here, {xi,yi}i=1m\{x_{i},y_{i}\}_{i=1}^{m} represents training data points, and gg is a deep neural network parameterized by θ\theta. The gradient of this loss function is given by:

f(θ)=i=1m(g(xi;θ)yi)g(xi;θ)θ.\nabla f(\theta)=\sum_{i=1}^{m}(g(x_{i};\theta)-y_{i})\frac{\partial g(x_{i};\theta)}{\partial\theta}.

Denoting u=(u1,,um)u=(u_{1},...,u_{m}) where ui=g(xi;θ)yiu_{i}=g(x_{i};\theta)-y_{i}, we have

12|f(θ)|2=12i=1mj=1muiHijuj=12uHu,\frac{1}{2}|\nabla f(\theta)|^{2}=\frac{1}{2}\sum_{i=1}^{m}\sum_{j=1}^{m}u_{i}H_{ij}u_{j}=\frac{1}{2}uHu^{\top}, (2.12)

where HH is a m×mm\times m matrix with (i,j)(i,j) entry defined by

Hij=g(xi;θ)θ,g(xj;θ)θ.H_{ij}=\langle\frac{\partial g(x_{i};\theta)}{\partial\theta},\frac{\partial g(x_{j};\theta)}{\partial\theta}\rangle. (2.13)

From (2.12) it is evident that if the smallest eigenvalue of HH is bounded from below by μ\mu, then:

12|f(θ)|212μ|u|2=μf(θ).\frac{1}{2}|\nabla f(\theta)|^{2}\geq\frac{1}{2}\mu|u|^{2}=\mu f(\theta).

This condition aligns with the PL condition. Notably, using over-parameterization and random initialization, [19] has proved that for a two-layer neural network with rectified linear unit (ReLU) activation, the smallest eigenvalue of HH is indeed strictly positive. This eigenvalue proves to be crucial in deriving the linear convergence rate of gradient descent, as illustrated in [19]. A similar convergence result for implicit networks is also derived in [22].

3 Solution properties of the AGEM algorithm

In this section, we establish that the iterates generated by (2.8) are uniformly bounded and convergent if η\eta is suitably small.

Theorem 3.1.

Under Assumption 2, for (θk,rk,vk)(\theta_{k},r_{k},v_{k}) generated by (2.8), if η<η\eta<\eta^{*} for some η>0\eta^{*}>0, we have the following results:

  1. (1)

    Uniformly boundedness: θk\theta_{k} and vkv_{k} are bounded for all k1k\geq 1. Moreover,

    θk{θn|F(θ)2F(θ0)},|vk|maxjk|F(θj)|.\theta_{k}\in\{\theta\in\mathbb{R}^{n}\;|\;F(\theta)\leq 2F(\theta_{0})\},\quad|v_{k}|\leq\max_{j\leq k}|\nabla F(\theta_{j})|.
  2. (2)

    Lower bound on energy: The lower bound of rkr_{k} is strictly positive:

    rkrk+1>r>0,k0.r_{k}\geq r_{k+1}>r^{*}>0,\quad\forall k\geq 0.
  3. (3)

    Convergence: As kk\to\infty, we have

    rkr>0,vk0,f(θk)0.r_{k}\to r^{*}>0,\quad v_{k}\to 0,\quad\nabla f(\theta_{k})\to 0.
Proof 3.2.

Define

Qk:=F(θk)+ϵrk|vk1|2,Q_{k}:=F(\theta_{k})+\epsilon r_{k}|v_{k-1}|^{2}, (3.14)

where ϵ=ηβ1β\epsilon=\frac{\eta\beta}{1-\beta}, then we claim:

Qk+1Qkϵrk+1|vkvk1|22ηrk+1|vk|2+LF(θk+12)2|θk+1θk|2,Q_{k+1}-Q_{k}\leq-\epsilon r_{k+1}|v_{k}-v_{k-1}|^{2}-2\eta r_{k+1}|v_{k}|^{2}+\frac{L_{F}(\theta_{k+\frac{1}{2}})}{2}|\theta_{k+1}-\theta_{k}|^{2}, (3.15)

where LF(θk+12)L_{F}(\theta_{k+\frac{1}{2}}) is the largest eigenvalue of maxs[0,1]D2F((1s)θk+sθk+1)\max_{s\in[0,1]}D^{2}F((1-s)\theta_{k}+s\theta_{k+1}). In fact,

Qk+1Qk=F(θk+1)F(θk)+ϵrk+1|vk|2ϵrk|vk1|2,Q_{k+1}-Q_{k}=F(\theta_{k+1})-F(\theta_{k})+\epsilon r_{k+1}|v_{k}|^{2}-\epsilon r_{k}|v_{k-1}|^{2}, (3.16)

where

F(θk+1)F(θk)F(θk),θk+1θk+LF(θk+12)2|θk+1θk|2,F(\theta_{k+1})-F(\theta_{k})\leq\langle\nabla F(\theta_{k}),\theta_{k+1}-\theta_{k}\rangle+\frac{L_{F}(\theta_{k+\frac{1}{2}})}{2}|\theta_{k+1}-\theta_{k}|^{2}, (3.17)

in which,

F(θk),θk+1θk\displaystyle\quad\langle\nabla F(\theta_{k}),\theta_{k+1}-\theta_{k}\rangle
=2ηrk+1F(θk),vk\displaystyle=-2\eta r_{k+1}\langle\nabla F(\theta_{k}),v_{k}\rangle
=2ηrk+1F(θk)vk,vk2ηrk+1|vk|2\displaystyle=-2\eta r_{k+1}\langle\nabla F(\theta_{k})-v_{k},v_{k}\rangle-2\eta r_{k+1}|v_{k}|^{2}
=2ηrk+1β1β(vkvk1),vk2ηrk+1|vk|2\displaystyle=-2\eta r_{k+1}\langle\frac{\beta}{1-\beta}(v_{k}-v_{k-1}),v_{k}\rangle-2\eta r_{k+1}|v_{k}|^{2}
=ϵrk+1(|vk|2|vk1|2+|vkvk1|2)2ηrk+1|vk|2\displaystyle=-\epsilon r_{k+1}(|v_{k}|^{2}-|v_{k-1}|^{2}+|v_{k}-v_{k-1}|^{2})-2\eta r_{k+1}|v_{k}|^{2}
ϵrk+1|vk|2+ϵrk|vk1|2ϵrk+1|vkvk1|22ηrk+1|vk|2,\displaystyle\leq-\epsilon r_{k+1}|v_{k}|^{2}+\epsilon r_{k}|v_{k-1}|^{2}-\epsilon r_{k+1}|v_{k}-v_{k-1}|^{2}-2\eta r_{k+1}|v_{k}|^{2}, (3.18)

where rk+1<rkr_{k+1}<r_{k} is used in the last inequality. Connecting (3.16), (3.17), (3.2) gives (3.15).

(1) To demonstrate uniformly boundedness, we sum up (3.15) from 0 to kk, omitting negative terms on the right-hand side to obtain:

Qk+1Q0+maxjkLF(θj+12)2j=0k|θj+1θj|2.\displaystyle Q_{k+1}\leq Q_{0}+\frac{\max_{j\leq k}L_{F}(\theta_{j+\frac{1}{2}})}{2}\sum_{j=0}^{k}|\theta_{j+1}-\theta_{j}|^{2}. (3.19)

Using Q0=F(θ0)Q_{0}=F(\theta_{0}) and the bound in (2.10), we derive:

F(θk+1)F(θ0)+12ηr02maxjkLF(θj+12).F(\theta_{k+1})\leq F(\theta_{0})+\frac{1}{2}\eta r^{2}_{0}\max_{j\leq k}L_{F}(\theta_{j+\frac{1}{2}}). (3.20)

Next, we establish that if η\eta is suitably small, {θk}\{\theta_{k}\} is uniformly bounded, ensuring the boundedness of {vk}\{v_{k}\}, as indicated by:

|vk|(1β)j=0kβkj|F(θj)|maxjk|F(θj)|.|v_{k}|\leq(1-\beta)\sum_{j=0}^{k}\beta^{k-j}|\nabla F(\theta_{j})|\leq\max_{j\leq k}|\nabla F(\theta_{j})|. (3.21)

To proceed, we introduce the notation:

Σk:={θn|F(θ)kF(θ0)},k=2,3\Sigma_{k}:=\{\theta\in\mathbb{R}^{n}\;|\;F(\theta)\leq kF(\theta_{0})\},\quad k=2,3

and Σ~k\tilde{\Sigma}_{k} for the convex hull111The convex hull of a given set SS is the (unique) minimal convex set containing SS. of Σk\Sigma_{k}. By assumption, Σ2Σ3\Sigma_{2}\subset\Sigma_{3} are bounded domains. Our objective is to identify conditions on η\eta such that {θk}Σ2\{\theta_{k}\}\in\Sigma_{2} for all k1k\geq 1. This can be accomplished through induction. Since θ0Σ2\theta_{0}\in\Sigma_{2}, tt suffices to show that under suitable conditions on η\eta, we have:

{θj}jkΣ2θk+1Σ2\{\theta_{j}\}_{j\leq k}\subset\Sigma_{2}\Rightarrow\theta_{k+1}\in\Sigma_{2}

The main task is to establish a sufficient condition on η\eta, for which we argue in two steps:
(i) By the continuity of FF, there exists δF(r0)\delta_{F}(r_{0}) such that if

|θk+1θk|<δF(r0),|\theta_{k+1}-\theta_{k}|<\delta_{F}(r_{0}), (3.22)

we have

|F(θk+1)F(θk)|<r0.|F(\theta_{k+1})-F(\theta_{k})|<r_{0}.

This implies F(θk+1)F(θk)+r03r0F(\theta_{k+1})\leq F(\theta_{k})+r_{0}\leq 3r_{0}, indicating θk+1Σ3.\theta_{k+1}\in\Sigma_{3}. On the other hand,

|θk+1θk|=|2ηrk+1vk|2ηr0maxjk|F(θj)|2ηr0maxθΣ2|F(θ)|.|\theta_{k+1}-\theta_{k}|=|-2\eta r_{k+1}v_{k}|\leq 2\eta r_{0}\max_{j\leq k}|\nabla F(\theta_{j})|\leq 2\eta r_{0}\max_{\theta\in\Sigma_{2}}|\nabla F(\theta)|.

This ensures (3.22) if we set

η<η1:=δF(r0)2r0maxθΣ2|F(θ)|.\eta<\eta_{1}:=\frac{\delta_{F}(r_{0})}{2r_{0}\max_{\theta\in\Sigma_{2}}|\nabla F(\theta)|}.

(ii) Considering that θk+1Σ3\theta_{k+1}\in\Sigma_{3} and {θj}jkΣ2Σ3\{\theta_{j}\}_{j\leq k}\in\Sigma_{2}\subset\Sigma_{3}, we have {θj+12}jkΣ~3\{\theta_{j+\frac{1}{2}}\}_{j\leq k}\subset\tilde{\Sigma}_{3}. By (3.20), we get:

F(θk+1)F(θ0)+12ηr02maxθΣ~3LF(θ).F(\theta_{k+1})\leq F(\theta_{0})+\frac{1}{2}\eta r^{2}_{0}\max_{\theta\in\tilde{\Sigma}_{3}}L_{F}(\theta). (3.23)

This implies

F(θk+1)2F(θ0),F(\theta_{k+1})\leq 2F(\theta_{0}),

provided

ηη2:=2r0maxθΣ~3LF(θ).\eta\leq\eta_{2}:=\frac{2}{r_{0}\max_{\theta\in\tilde{\Sigma}_{3}}L_{F}(\theta)}.

Hence, if η<min{η1,η2}\eta<\min\{\eta_{1},\eta_{2}\}, then θkΣ2\theta_{k}\in\Sigma_{2} for all kk.

(2) Theorem 2.1 asserts that rkr0r_{k}\to r^{*}\geq 0 as kk\to\infty. Here, we identify a sufficient condition on η\eta to ensure r>0r^{*}>0. Following a similar argument as in (1), while retaining the negative term

2ηrk+1|vk|2=rk+1rk,-2\eta r_{k+1}|v_{k}|^{2}=r_{k+1}-r_{k},

on the rand-hand side of (3.15), we derive:

F(θk)F(θ0)+rkr0+12ηr02maxθΣ~3LF(θ),F(\theta_{k})\leq F(\theta_{0})+r_{k}-r_{0}+\frac{1}{2}\eta r_{0}^{2}\max_{\theta\in\tilde{\Sigma}_{3}}L_{F}(\theta),

under the condition that η<η1\eta<\eta_{1}. Since r0=F(θ0)r_{0}=F(\theta_{0}), we obtain:

rkF(θk)12ηr02maxθΣ~3LF(θ).r_{k}\geq F(\theta_{k})-\frac{1}{2}\eta r_{0}^{2}\max_{\theta\in\tilde{\Sigma}_{3}}L_{F}(\theta).

Letting kk\to\infty, we have:

rlim infkF(θk)12ηr02maxθΣ~3LF(θ)F12ηr02maxθΣ~3LF(θ)>0,r^{*}\geq\liminf_{k\to\infty}F(\theta_{k})-\frac{1}{2}\eta r_{0}^{2}\max_{\theta\in\tilde{\Sigma}_{3}}L_{F}(\theta)\geq F^{*}-\frac{1}{2}\eta r_{0}^{2}\max_{\theta\in\tilde{\Sigma}_{3}}L_{F}(\theta)>0,

provided

ηη3:=2Fr02maxθΣ~3LF(θ).\eta\leq\eta_{3}:=\frac{2F^{*}}{r^{2}_{0}\max_{\theta\in\tilde{\Sigma}_{3}}L_{F}(\theta)}.

Therefore, if η<min{η1,η3}\eta<\min\{\eta_{1},\eta_{3}\}, then r>0r^{*}>0. Note that η3η2\eta_{3}\leq\eta_{2}, hence both (1) and (2) are satisfied if we choose η<η:=min{η1,η3}\eta<\eta^{*}:=\min\{\eta_{1},\eta_{3}\}.

(3) In Theorem 2.1, we have shown that |θk+1θk|0|\theta_{k+1}-\theta_{k}|\to 0. Using (2.8), we deduce that

rk+1|vk|0.r_{k+1}|v_{k}|\to 0.

According to (2), rkr>0r_{k}\to r^{*}>0 when η<η\eta<\eta^{*}. Consequently, |vk|0|v_{k}|\to 0 must hold. As indicated by (2.8a), this further implies F(θk)0\nabla F(\theta_{k})\to 0, consequently, f(θk)=2F(θk)F(θk)0\nabla f(\theta_{k})=2F(\theta_{k})\nabla F(\theta_{k})\to 0 also holds.

Remark 3.3.

When ff is assumed to be LL-smooth, LFL_{F} becomes uniformly bounded. In this scenario, the solution’s boundedness is attained without imposing any restriction on η\eta, as evident from (3.20)(\ref{Qr}). However, it is crucial to note that the LL-smoothness assumption excludes a significant class of objective functions, including those of the form |θ|4|\theta|^{4}.

4 Discrete schemes vs continuous-time-limits

The analysis of scheme (2.8) involve studying the limiting ODEs obtained by letting the step size η\eta approach 0. Notably, distinct ODEs may arise based on how the hyper-parameters are scaled. In this context, we will derive an ODE system that preserves certain momentum effects.

4.1 High resolution ODEs.

For any k0k\geq 0, let tk=kηt_{k}=k\eta, and assume vk=v(tk),rk=r(tk),θk=θ(tk)v_{k}=v(t_{k}),r_{k}=r(t_{k}),\theta_{k}=\theta(t_{k}) for some sufficiently smooth curve (v(t),r(t),θ(t))(v(t),r(t),\theta(t)). Performing the Taylor expansion on vk1v_{k-1} in powers of η\eta, we get:

vk1=v(tk1)=v(tk)ηv˙(tk)+O(η2).v_{k-1}=v(t_{k-1})=v(t_{k})-\eta\dot{v}(t_{k})+O(\eta^{2}).

Substituting vk1v_{k-1} into (2.8a), we have

v(tk)=β(v(tk)ηv˙(tk)+O(η2))+(1β)F(θ(tk)),v(t_{k})=\beta(v(t_{k})-\eta\dot{v}(t_{k})+O(\eta^{2}))+(1-\beta)\nabla F(\theta(t_{k})),

which gives

βη1βv˙=v+F(θ)+O(η2),att=tk.\frac{\beta\eta}{1-\beta}\dot{v}=-v+\nabla F(\theta)+O(\eta^{2}),\quad\text{at}\;t=t_{k}. (4.24)

Plugging rk+1=r(tk)+ηr˙(tk)+O(η2)r_{k+1}=r(t_{k})+\eta\dot{r}(t_{k})+O(\eta^{2}) into (2.8b), we have

r˙=2r|v|21+2η|v|2+O(η2),att=tk.\dot{r}=-\frac{2r|v|^{2}}{1+2\eta|v|^{2}}+O(\eta^{2}),\quad\text{at}\;t=t_{k}. (4.25)

Similarly, from (2.8c) we have

θ˙=2rv1+2η|v|2+O(η2),att=tk.\dot{\theta}=-\frac{2rv}{1+2\eta|v|^{2}}+O(\eta^{2}),\quad\text{at}\;t=t_{k}. (4.26)

We first discard O(η2)O(\eta^{2}) and keep O(η)O(\eta) terms in (4.24) (4.25), (4.26), which leads to the following ODE system:

ϵv˙\displaystyle\epsilon\dot{v} =v+F(θ),\displaystyle=-v+\nabla F(\theta), (4.27a)
r˙\displaystyle\dot{r} =2r|v|21+2η|v|2,\displaystyle=-\frac{2r|v|^{2}}{1+2\eta|v|^{2}}, (4.27b)
θ˙\displaystyle\dot{\theta} =2rv1+2η|v|2,\displaystyle=-\frac{2rv}{1+2\eta|v|^{2}}, (4.27c)

where ϵ=βη1β\epsilon=\frac{\beta\eta}{1-\beta}. This can be viewed as a high-resolution ODE [41] because of the presence of O(η)O(\eta) terms. Let η0\eta\to 0, then (4.27) reduces to the following ODE system:

r˙\displaystyle\dot{r} =2r|F(θ)|2,\displaystyle=-2r|\nabla F(\theta)|^{2}, (4.28a)
θ˙\displaystyle\dot{\theta} =2rF(θ).\displaystyle=-2r\nabla F(\theta). (4.28b)

From two equations in (4.28) together with r(0)=F(θ0)r(0)=F(\theta_{0}) we can show

r(t)=F(θ(t)),t>0,r(t)=F(\theta(t)),\quad\forall t>0,

with which, (4.28b) reduces to

θ˙=2F(θ)F(θ)=f(θ).\dot{\theta}=-2F(\theta)\nabla F(\theta)=-\nabla f(\theta). (4.29)

This asserts that the ODE system (4.28) is equivalent to the gradient flow (4.29).

To explore the momentum effect, we maintain ϵ=βη1β\epsilon=\frac{\beta\eta}{1-\beta} unchanged while allowing η0\eta\to 0. This yields the following system of ODEs:

ϵv˙\displaystyle\epsilon\dot{v} =v+F(θ),\displaystyle=-v+\nabla F(\theta), (4.30a)
r˙\displaystyle\dot{r} =2r|v|2,\displaystyle=-2r|v|^{2}, (4.30b)
θ˙\displaystyle\dot{\theta} =2rv.\displaystyle=-2rv. (4.30c)

Compared with (4.30), the high-resolution ODEs (4.27) serve as more accurate continuous-time counterparts for the corresponding discrete algorithm (2.8). While the analysis of (4.27) would be more intricate, in the remainder of this work, we concentrate on (4.30) and provide a detailed analysis of it.

Theorem 4.1.

Denote Uk=(θk,vk,rk)U_{k}=(\theta_{k},v_{k},r_{k}) as the solution provided in Theorem 3.1, and U(t)=(θ(t),v(t),r(t))U(t)=(\theta(t),v(t),r(t)) as the solution to (4.30). Then, for any T>0T>0, we have:

limη0kη=TUk=U(T).\lim_{\begin{subarray}{c}\eta\to 0\\ k\eta=T\end{subarray}}U_{k}=U(T).

This result is a consequence of a standard ODE analysis. In the present context, it is assured by the boundedness of UkU_{k}, as demonstrated in Theorem 3.1, and the forthcoming demonstration of the boundedness of U(t)U(t) that will be shown in Theorem 5.1.

4.2 Time discretization.

Certainly, the discrete AEGM dynamics can be interpreted as a discretization of the continuous ODE systems (4.30). Specifically, we employ a first order finite difference approximation with implicit considerations in rr and θ\theta, but explicit treatment in vv, resulting in:

ϵvk+1vkτ\displaystyle\epsilon\frac{v_{k+1}-v_{k}}{\tau} =vk+F(θk+1),\displaystyle=-v_{k}+\nabla F(\theta_{k+1}),
rk+1rkτ\displaystyle\frac{r_{k+1}-r_{k}}{\tau} =2rk+1|vk|2,\displaystyle=-2r_{k+1}|v_{k}|^{2},
θk+1θkτ\displaystyle\frac{\theta_{k+1}-\theta_{k}}{\tau} =2rk+1vk.\displaystyle=-2r_{k+1}v_{k}.

This formulation reproduces the discrete AEGM dynamic (4.30) when setting β=τ/ϵ\beta=\tau/\epsilon and η=τ\eta=\tau. Moreover, this perspective enables the derivation of new algorithms by applying diverse time discretizations to (4.30). For instance, a second order discretizaiton of (4.30) leads to

ϵ3vk+14vk+vk12τ\displaystyle\epsilon\frac{3v_{k+1}-4v_{k}+v_{k-1}}{2\tau} =vk+1+F(2θkθk1),\displaystyle=-v_{k+1}+\nabla F(2\theta_{k}-\theta_{k-1}),
3rk+14rk+rk12τ\displaystyle\frac{3r_{k+1}-4r_{k}+r_{k-1}}{2\tau} =2rk+1|2vkvk1|2,\displaystyle=-2r_{k+1}|2v_{k}-v_{k-1}|^{2},
3θk+14θk+θk12τ\displaystyle\frac{3\theta_{k+1}-4\theta_{k}+\theta_{k-1}}{2\tau} =2rk+1vk+1.\displaystyle=-2r_{k+1}v_{k+1}.

By setting β=τ/ϵ\beta=\tau/\epsilon and η=τ\eta=\tau the following recursive relationships are derived:

vk+1\displaystyle v_{k+1} =13+2β(4vkvk1+2βF(2θkθk1)),\displaystyle=\frac{1}{3+2\beta}\left(4v_{k}-v_{k-1}+2\beta\nabla F(2\theta_{k}-\theta_{k-1})\right),
rk+1\displaystyle r_{k+1} =4rkrk13+4η|2vkvk1|2,\displaystyle=\frac{4r_{k}-r_{k-1}}{3+4\eta|2v_{k}-v_{k-1}|^{2}},
θk+1\displaystyle\theta_{k+1} =13(4θkθk14ηrk+1vk+1).\displaystyle=\frac{1}{3}\left(4\theta_{k}-\theta_{k-1}-4\eta r_{k+1}v_{k+1}\right).

These recursions begin with the initial setup of v1=0,r1=r0v_{-1}=0,r_{-1}=r_{0} and θ1=θ0\theta_{-1}=\theta_{0}. A more comprehensive analysis of this scheme exceeds the scope of the present article.

5 Dynamic solution behavior

This section is dedicated to the examination of (4.30), including aspects such as global existence, asymptotic convergence to equilibria through the renowned LaSalle invariance principle [28], and the determination of convergence rates for non-convex objective functions that adhere to the PL property.

5.1 Global Existence and Uniqueness.

Theorem 5.1.

Under Assumption 2, the ODE (4.30) with the initial condition (θ0,r0>0,v0)(\theta_{0},r_{0}>0,v_{0}) admits a unique global solution (θ(t),v(t),r(t))C1([0,];2n+1)(\theta(t),v(t),r(t))\in C^{1}([0,\infty];\mathbb{R}^{2n+1}). Specifically, for any t>0t>0, the following conditions hold:

θ(t){θn|F(θ)F(θ0)+ϵr0|v0|2},\displaystyle\theta(t)\in\{\theta\in\mathbb{R}^{n}\;|\;F(\theta)\leq F(\theta_{0})+\epsilon r_{0}|v_{0}|^{2}\},
|v(t)|max{|v(0)|,max[0,t]|F(θ())|},\displaystyle|v(t)|\leq\max\{|v(0)|,\max_{[0,t]}|\nabla F(\theta(\cdot))|\},
0<r(t)r0.\displaystyle 0<r(t)\leq r_{0}. (5.31)
Proof 5.2.

For fC1(n)f\in C^{1}(\mathbb{R}^{n}), by Picard and Lindelöf, there exists a unique local solution to (4.30) with the initial condition (θ0,r0,v0)(\theta_{0},r_{0},v_{0}). The extension theorem implies that the global existence fails only if the solution ‘escapes’ to infinity. To establish global existence, it is sufficient to demonstrate that (θ(t),r(t),v(t))(\theta(t),r(t),v(t)) remains bounded for all t>0t>0. To achieve this, we introduce a suitable control function. A candidate based on the discrete counterpart is:

Q:=Q(θ,r,v)=F(θ)+ϵr|v|2.Q:=Q(\theta,r,v)=F(\theta)+\epsilon r|v|^{2}. (5.32)

Using the chain rule, if (θ,r,v)(\theta,r,v) satisfies (4.30), then:

Q˙\displaystyle\dot{Q} =F(θ),θ˙+ϵr˙|v|2+2ϵrv,v˙\displaystyle=\langle\nabla F(\theta),\dot{\theta}\rangle+\epsilon\dot{r}|v|^{2}+2\epsilon r\langle v,\dot{v}\rangle
=2rF(θ),v2ϵr|v|4+2rv,v+F(θ)\displaystyle=-2r\langle\nabla F(\theta),v\rangle-2\epsilon r|v|^{4}+2r\langle v,-v+\nabla F(\theta)\rangle
=2r|v|22ϵr|v|40.\displaystyle=-2r|v|^{2}-2\epsilon r|v|^{4}\leq 0. (5.33)

Hence, QQ is guaranteed to decrease. In particular, we have:

F(θ)+ϵr|v|2Q(θ0,r0,v0),F(\theta)+\epsilon r|v|^{2}\leq Q(\theta_{0},r_{0},v_{0}), (5.34)

which implies the boundedness of F(θ)F(\theta). This, combined with the coerciveness of ff and, consequently FF, guarantees that θ(t)\theta(t) remains bounded for all t>0t>0. However, the above inequality does not provide a uniform bound for v(t)v(t).

To establish a useful bound for v(t)v(t), we use (4.30a):

tv+1ϵv=1ϵF(θ(t)),\partial_{t}v+\frac{1}{\epsilon}v=\frac{1}{\epsilon}\nabla F(\theta(t)),

resulting in:

ddt(v(t)et/ϵ)=1ϵet/ϵF(θ(t)).\frac{d}{dt}(v(t)e^{t/\epsilon})=\frac{1}{\epsilon}e^{t/\epsilon}\nabla F(\theta(t)).

Consequently, for every tt, we have:

v(t)=v(0)et/ϵ+1ϵ0te(τt)/ϵF(θ(τ))𝑑τ.v(t)=v(0)e^{-t/\epsilon}+\frac{1}{\epsilon}\int_{0}^{t}e^{(\tau-t)/\epsilon}\nabla F(\theta(\tau))d\tau.

Therefore,

|v(t)|\displaystyle|v(t)| |v(0)|et/ϵ+(1et/ϵ)maxτ[0,t]|F(θ(τ))|\displaystyle\leq|v(0)|e^{-t/\epsilon}+(1-e^{-t/\epsilon})\max_{\tau\in[0,t]}|\nabla F(\theta(\tau))|
max{|v(0)|,maxτ[0,t]|F(θ(τ))|}.\displaystyle\leq\max\{|v(0)|,\max_{\tau\in[0,t]}|\nabla F(\theta(\tau))|\}.

Note that the boundedness of F(θ(t))\nabla F(\theta(t)) is ensured by

|F(θ(t))|=|f(θ(t))2F(θ(t))||f(θ(t))2F|.|\nabla F(\theta(t))|=\bigg{|}\frac{\nabla f(\theta(t))}{2F(\theta(t))}\bigg{|}\leq\bigg{|}\frac{\nabla f(\theta(t))}{2F^{*}}\bigg{|}.

From (4.30), it is evident that rr decreases monotonically, thus 0r(t)r00\leq r(t)\leq r_{0}. This concludes the proof.

Remark 5.3.

The uniform boundedness of |v||v| justifies the removal of 2η|v|22\eta|v|^{2} in (4.27b) and (4.27c), while taking the limit η0\eta\to 0.

5.2 Asymptotic behavior of solutions.

Theorem 5.4.

Under Assumption 2, let (θ(t),r(t),v(t))(\theta(t),r(t),v(t)) be the global solution to the ODE (4.30) as stated in Theorem 5.1, with the initial condition (θ0,r0,v0)(\theta_{0},r_{0},v_{0}). If the conditions

r0>F(θ0)F,ϵ|v0|2<1F(θ0)Fr0r_{0}>F(\theta_{0})-F^{*},\quad\epsilon|v_{0}|^{2}<1-\frac{F(\theta_{0})-F^{*}}{r_{0}} (5.35)

hold, then as tt\to\infty,

F(θ(t))0,r(t)r>0,v(t)v=0.\nabla F(\theta(t))\rightarrow 0,\quad r(t)\rightarrow r^{*}>0,\quad v(t)\rightarrow v^{*}=0.

Furthermore, F(θ(t))FF(\theta(t))\rightarrow F^{*} as the minimum value of FF, and

r<r(t)r0,FF(θ(t))F(θ0)+ϵr0|v0|2.r^{*}<r(t)\leq r_{0},\quad F^{*}\leq F(\theta(t))\leq F(\theta_{0})+\epsilon r_{0}|v_{0}|^{2}. (5.36)

The upper bound on FF follows directly from (5.34). We establish the convergence result through three steps.

Step 1: We demonstrate that r:=limtr(t)r^{*}:=\lim_{t\to\infty}r(t) exists and r>0r^{*}>0. First, since r˙0\dot{r}\leq 0 and r0r\geq 0, it follows that 0|r˙|𝑑tr0<\int_{0}^{\infty}|\dot{r}|dt\leq r_{0}<\infty. This ensures the existence of limtr(t)=r\lim_{t\to\infty}r(t)=r^{*}. Note that (5.2) can be expressed as

r˙=Q˙+2ϵr|v|4.\dot{r}=\dot{Q}+2\epsilon r|v|^{4}.

Integrating both sides from 0 to tt and using (5.35), we obtain:

r(t)\displaystyle r(t) =r0+F(θ(t))+ϵr(t)|v(t)|2(F(θ0)+ϵr0|v0|2)+2ϵ0tr|v|4𝑑s\displaystyle=r_{0}+F(\theta(t))+\epsilon r(t)|v(t)|^{2}-(F(\theta_{0})+\epsilon r_{0}|v_{0}|^{2})+2\epsilon\int_{0}^{t}r|v|^{4}ds
r0(F(θ0)F)ϵr0|v0|2>0.\displaystyle\geq r_{0}-(F(\theta_{0})-F^{*})-\epsilon r_{0}|v_{0}|^{2}>0. (5.37)

From this lower bound, we deduce:

r=limtr(t)r0+FF(θ0)ϵr0|v0|2>0.r^{*}=\lim_{t\to\infty}r(t)\geq r_{0}+F^{*}-F(\theta_{0})-\epsilon r_{0}|v_{0}|^{2}>0.

Step 2: We proceed to analyze the asymptotic behavior of θ\theta and vv by using LaSalle’s invariance principle [28]. Recall the function QQ defined in Theorem 5.1,

Q(θ,r,v)=F(θ)+ϵr|v|2.Q(\theta,r,v)=F(\theta)+\epsilon r|v|^{2}.

Given that r(t)r>0r(t)\geq r^{*}>0, this implies that the set

Ω={(θ,r,v)|Q(θ,r,v)Q(θ0,r0,v0)}\Omega=\{(\theta,r,v)\;|\;Q(\theta,r,v)\leq Q(\theta_{0},r_{0},v_{0})\}

is a compact positively invariant set with respect to (4.30). For any (θ,r,v)Ω(\theta,r,v)\in\Omega, we observe that

Q˙(θ,r,v)=2r|v|22ϵr|v|40.\dot{Q}(\theta,r,v)=-2r|v|^{2}-2\epsilon r|v|^{4}\leq 0.

Define the set:

Σ={(θ,r,v)Ω|Q˙(θ,r,v)=0}.\Sigma=\{(\theta,r,v)\in\Omega\;|\;\dot{Q}(\theta,r,v)=0\}.

Since r(t)r>0r(t)\geq r^{*}>0 for all t>0t>0, it follows that

Σ={(θ,r,v)Ω|r|v|=0}={(θ,r,v)Ω|v=0,rr}.\Sigma=\{(\theta,r,v)\in\Omega\;|\;r|v|=0\}=\{(\theta,r,v)\in\Omega\;|\;v=0,r\geq r^{*}\}.

Next, we identify the largest invariant set in Σ\Sigma. Suppose that at some t1t_{1}, v(t1)=0v(t_{1})=0, then we have θ˙=0,r˙=0\dot{\theta}=0,\dot{r}=0, but ϵv˙(t1)=F(θ(t1))\epsilon\dot{v}(t_{1})=\nabla F(\theta(t_{1})). As a result, the trajectory will not stay in Σ\Sigma unless F(θ(t1))=0\nabla F(\theta(t_{1}))=0. In other words, the largest invariant set in Σ\Sigma must be:

Σ0={(θ,r,v)Ω|v=0,F(θ)=0,rr}.\Sigma_{0}=\{(\theta,r,v)\in\Omega\;|\;v=0,\nabla F(\theta)=0,r\geq r^{*}\}.

By LaSalle’s invariance principle, every trajectory of (4.30) starting in Ω\Omega tends to Σ0\Sigma_{0} as tt\to\infty. From Step 1, rr^{*} is the only limit of r(t)r(t). Hence, all trajectories will admit a limit or a cluster point in

{(θ,r,0)|F(θ)=0}.\{(\theta,r^{*},0)\;|\;\nabla F(\theta)=0\}.

This implies that: limtF(θ(t))=0\lim_{t\to\infty}\nabla F(\theta(t))=0. Note that QQ is monotone and bounded, hence admitting a limit b>0b>0. This further implies that:

limtF(θ(t))=limt[Q(θ(t),r(t),v(t))ϵr(t)|v(t)|2]=binfF.\lim_{t\to\infty}F(\theta(t))=\lim_{t\to\infty}[Q(\theta(t),r(t),v(t))-\epsilon r(t)|v(t)|^{2}]=b\geq\inf F.

Step 3. Finally, we demonstrate that bb is a local minimum value of FF, rather than a local maximum value. We prove this by contradiction. Suppose bb is a local maximum value of FF. Recall that F(θ)bF(\theta)\to b, rrr\to r^{*} and |v|0|v|\to 0 as tt\to\infty. Thus, for any δ>0\delta>0, there exists t1=t1(δ)t_{1}=t_{1}(\delta) such that for any tt1t\geq t_{1}:

|F(θ(t))b|δ,|r(t)r|δ,|v(t)|δ.|F(\theta(t))-b|\leq\delta,\quad|r(t)-r^{*}|\leq\delta,\quad|v(t)|\leq\delta.

Since FF is continuous, there exists t2t1t_{2}\geq t_{1} such that:

F(θ(t2))bδ2,F(\theta(t_{2}))\leq b-\frac{\delta}{2},

where we have used the assumption that bb is a local maximum value of FF. From Step 2, we know that QQ is non-increasing in tt. Hence, for any tt2t\geq t_{2}, we have:

Q(t)Q(t2)=F(θ(t2))+ϵr(t2)|v(t2)|2bδ2+ϵ(r+δ)δ2<b,Q(t)\leq Q(t_{2})=F(\theta(t_{2}))+\epsilon r(t_{2})|v(t_{2})|^{2}\leq b-\frac{\delta}{2}+\epsilon(r^{*}+\delta)\delta^{2}<b,

provided δ\delta is small enough so that (r+δ)δ<12ϵ(r^{*}+\delta)\delta<\frac{1}{2\epsilon}. Now letting tt\to\infty, we obtain b=limtQ(t)<bb=\lim_{t\to\infty}Q(t)<b. This is a contradiction. Hence, bb has to be a local minimum value of FF, i.e., b=Fb=F^{*}.

6 Convergence rates

The speed at which the solution converges to the minimum value of ff depends on the geometry of the objective function near ff^{*}. In fact, when the PL condition is satisfied, it can be shown that f(θ(t))f(\theta(t)) converge exponentially fast to ff^{*}.

Theorem 6.1.

Consider the global solution (θ(t),r(t),v(t))(\theta(t),r(t),v(t)) to (4.30) as presented in Theorem 5.4, initialized with (θ0,r0,v0)=(θ0,F(θ0),0)(\theta_{0},r_{0},v_{0})=(\theta_{0},F(\theta_{0}),0). Assume that ff also satisfies the PL condition (2.11) with μ>0\mu>0. For any δ(0,1)\delta\in(0,1), the following bounds are valid:

f(θ(t))f(f(θ0)f)(e2μrF(θ0)(2δ)t+2μϵδ(2δ)eδϵt),f(\theta(t))-f^{*}\leq(f(\theta_{0})-f^{*})\bigg{(}e^{-\frac{2\mu r^{*}}{F(\theta_{0})(2-\delta)}t}+\frac{2\mu\epsilon}{\delta(2-\delta)}e^{-\frac{\delta}{\epsilon}t}\bigg{)}, (6.38)

given that ϵmin{ϵ1,ϵ2}\epsilon\leq\min\{\epsilon_{1},\epsilon_{2}\}, where

ϵ1=δ(1δ)F2LF(θ0),ϵ2=12μF(θ0)(2δ)(1δ)(F(θ0)+FrF(θ0)).\epsilon_{1}=\delta(1-\delta)\frac{F^{*}}{2LF(\theta_{0})},\quad\epsilon_{2}=\frac{1}{2\mu F(\theta_{0})}(2-\delta)(1-\delta)\bigg{(}F(\theta_{0})+\frac{F^{*}r^{*}}{F(\theta_{0})}\bigg{)}. (6.39)

Here, LL represents the largest eigenvalue of D2f(θ)D^{2}f(\theta), where θ\theta belongs to the solution domain in Theorem 5.1.

Remark 6.2.

(Comparison with gradient flow (4.29 without momentum) For small values of ϵ\epsilon, where momentum exerts a diminished influence on the dynamical system, the convergence rate is predominantly governed by the term e2μrF(θ0)(2δ)te^{-\frac{2\mu r^{*}}{F(\theta_{0})(2-\delta)}t}. Specifically:

f(θ(t))f(f(θ0)f)(e2μrF(θ0)(2δ)t),f(\theta(t))-f^{*}\leq(f(\theta_{0})-f^{*})\bigg{(}e^{-\frac{2\mu r^{*}}{F(\theta_{0})(2-\delta)}t}\bigg{)},

if ϵ{ϵ1,ϵ2,ϵ3}\epsilon\leq\{\epsilon_{1},\epsilon_{2},\epsilon_{3}\} with ϵ3=F(θ0)δ(2δ)2μr\epsilon_{3}=\frac{F(\theta_{0})\delta(2-\delta)}{2\mu r^{*}}. Also 2μrF(θ0)(2δ)\frac{2\mu r^{*}}{F(\theta_{0})(2-\delta)} gets larger as δ1\delta\to 1. For δ=1\delta=1, ϵ\epsilon is compelled to be 0, reducing (4.30) to (4.28) or (4.29). In this scenario, 2μrF(θ0)(2δ)\frac{2\mu r^{*}}{F(\theta_{0})(2-\delta)} becomes 2μ2\mu, restoring the convergence rate of gradient flow (4.29) under the PL condition [39].

Remark 6.3.

(Comparison with gradient flow with momentum and the work [5]). To compare with the GD with momentum of form

mk\displaystyle m_{k} =βmk1+(1β)f(θk),m1=0,\displaystyle=\beta m_{k-1}+(1-\beta)\nabla f(\theta_{k}),\quad m_{-1}=0,
θk+1\displaystyle\theta_{k+1} =θkηmk,\displaystyle=\theta_{k}-\eta m_{k},

we follow the same strategy as in Section 4.1 to obtain its continuous formulation:

ϵθ¨+θ˙+f(θ)=0.\epsilon\ddot{\theta}+\dot{\theta}+\nabla f(\theta)=0.

This upon rescaling s=t/ϵs=t/\sqrt{\epsilon} leads to

θ′′+1ϵθ+f(θ)=0.\theta^{\prime\prime}+\frac{1}{\sqrt{\epsilon}}\theta^{\prime}+\nabla f(\theta)=0.

This is the second-order ODE (1.5) with a constant a(t)a(t), for which we refer to the findings presented in [5, Theorem 3.1]. When dealing with non-convex functions satisfying the PL condition (2.11), particularly in scenarios where κ>9/8\kappa>9/8 and 1ϵ=(2κκ1)μ\frac{1}{\sqrt{\epsilon}}=(2\sqrt{\kappa}-\sqrt{\kappa-1})\sqrt{\mu}, [5] establishes the estimate:

f(θ(t))fO(1)e2(κκ1)μs=O(1)e2μb1t,f(\theta(t))-f^{*}\leq O(1)e^{-2(\sqrt{\kappa}-\sqrt{\kappa-1})\sqrt{\mu}s}=O(1)e^{-2\mu b_{1}t},

where

b1=3κ13κ(κ1).b_{1}=3\kappa-1-3\sqrt{\kappa(\kappa-1)}.

We examine a scenario where the dominant factor in the convergence rate (6.38) is eδϵte^{-\frac{\delta}{\epsilon}t}, which when taking the same ϵ\epsilon as above gives the convergence rate of e2μb2te^{-2\mu b_{2}t} with

b2=δ2(5κ14κ(κ1)).b_{2}=\frac{\delta}{2}(5\kappa-1-4\sqrt{\kappa(\kappa-1)}).

While our convergence rate is comparable to the specific rate e2μb1te^{-2\mu b_{1}t}, it is considered suboptimal owing to the additional constraint ϵmin{ϵ1,ϵ2}\epsilon\leq\min\{\epsilon_{1},\epsilon_{2}\} imposed by the proof technique.

The proof outlined below comprises three steps:
Step 1: We introduce a candidate control function, denoted as

E=a(f(θ)f)ϵf(θ),v+λϵr|v|2,E=a(f(\theta)-f^{*})-\epsilon\langle\nabla f(\theta),v\rangle+\lambda\epsilon r|v|^{2},

where the parameters aa and λ>0\lambda>0 are to be determined.
Step 2: We determine admissible pairs (a,λ)(a,\lambda) such that E(t)E(t) exhibits exponential decay.
Step 3: Using W˙(t)=f(θ),θ˙=2rf(θ),v\dot{W}(t)=\langle\nabla f(\theta),\dot{\theta}\rangle=-2r\langle\nabla f(\theta),v\rangle, we link EE to W:=f(θ)fW:=f(\theta)-f^{*} by

E=aW+ϵW˙2r+λϵr|v|2,E=aW+\frac{\epsilon\dot{W}}{2r}+\lambda\epsilon r|v|^{2}, (6.40)

and derive the convergence rate of WW based on the convergence rate of EE.

Proof 6.4.

Step 1: Decay of the control function. Define

E=a(f(θ)f)ϵf(θ),v+λϵr|v|2,E=a(f(\theta)-f^{*})-\epsilon\langle\nabla f(\theta),v\rangle+\lambda\epsilon r|v|^{2}, (6.41)

where parameters a,λ>0a,\lambda>0 will be determined later so that along the trajectory of (4.30) E=E(t)E=E(t) has an exponential decay rate.

For each term in EE we proceed to evaluate their time derivative along the trajectory of (4.30). For the first term, we get

ddt(f(θ)f)=f(θ),θ˙=2rf(θ),v.\frac{d}{dt}(f(\theta)-f^{*})=\langle\nabla f(\theta),\dot{\theta}\rangle=-2r\langle\nabla f(\theta),v\rangle.

For the second term, we have

ddtϵf(θ),v\displaystyle\frac{d}{dt}\epsilon\langle\nabla f(\theta),v\rangle =ϵ2f(θ)θ˙,v+ϵf(θ),v˙\displaystyle=\epsilon\langle\nabla^{2}f(\theta)\dot{\theta},v\rangle+\epsilon\langle\nabla f(\theta),\dot{v}\rangle
=2ϵr2f(θ)v,vf(θ),v+f(θ),F(θ).\displaystyle=-2\epsilon r\langle\nabla^{2}f(\theta)v,v\rangle-\langle\nabla f(\theta),v\rangle+\langle\nabla f(\theta),\nabla F(\theta)\rangle. (6.42)

Note that

2f(θ)v,vL|v|2,and\langle\nabla^{2}f(\theta)v,v\rangle\leq L|v|^{2},\quad\text{and}
f(θ),F(θ)=12F(θ)|f(θ)|2μF(θ)(f(θ)f).\langle\nabla f(\theta),\nabla F(\theta)\rangle=\frac{1}{2F(\theta)}|\nabla f(\theta)|^{2}\geq\frac{\mu}{F(\theta)}(f(\theta)-f^{*}).

Here we used F(θ)=f(θ)/(2F(θ))\nabla F(\theta)=\nabla f(\theta)/(2F(\theta)) and the PL property for ff.

Hence (6.42) reduces to

ddtϵf(θ),v2ϵLr|v|2f(θ),v+μF(θ)(f(θ)f).\frac{d}{dt}\epsilon\langle\nabla f(\theta),v\rangle\geq-2\epsilon Lr|v|^{2}-\langle\nabla f(\theta),v\rangle+\frac{\mu}{F(\theta)}(f(\theta)-f^{*}).

For the third term, we have

ddtϵr|v|2\displaystyle\frac{d}{dt}\epsilon r|v|^{2} =ϵr˙|v|2+2ϵrv,v˙\displaystyle=\epsilon\dot{r}|v|^{2}+2\epsilon r\langle v,\dot{v}\rangle
=2ϵr|v|4+2rv,v+F(θ)\displaystyle=-2\epsilon r|v|^{4}+2r\langle v,-v+\nabla F(\theta)\rangle
=2ϵr|v|42r|v|2+2rF(θ),v\displaystyle=-2\epsilon r|v|^{4}-2r|v|^{2}+2r\langle\nabla F(\theta),v\rangle
2r|v|2+rF(θ)f(θ),v.\displaystyle\leq-2r|v|^{2}+\frac{r}{F(\theta)}\langle\nabla f(\theta),v\rangle.

Combining the above three estimates together we obtain

E˙(t)μF(θ)(f(θ)f)+(2ϵL2λ)r|v|2+(2ar+1+λrF(θ))f(θ),v.\dot{E}(t)\leq-\frac{\mu}{F(\theta)}(f(\theta)-f^{*})+(2\epsilon L-2\lambda)r|v|^{2}+\bigg{(}-2ar+1+\frac{\lambda r}{F(\theta)}\bigg{)}\langle\nabla f(\theta),v\rangle. (6.43)

Note that (6.41) can be rewritten as

f(θ),v=aϵ(f(θ)f)+λr|v|21ϵE(t).\langle\nabla f(\theta),v\rangle=\frac{a}{\epsilon}(f(\theta)-f^{*})+\lambda r|v|^{2}-\frac{1}{\epsilon}E(t). (6.44)

This upon substitution into (6.43) gives

E˙(t)b(t)ϵE(t)+(ab(t)ϵμF(θ))(f(θ)f)+(λb(t)+2ϵL2λ)r|v|2,\dot{E}(t)\leq-\frac{b(t)}{\epsilon}E(t)+\left(\frac{ab(t)}{\epsilon}-\frac{\mu}{F(\theta)}\right)(f(\theta)-f^{*})+(\lambda b(t)+2\epsilon L-2\lambda)r|v|^{2}, (6.45)

where b(t):=2ar(t)+1+λr(t)F(θ(t))b(t):=-2ar(t)+1+\frac{\lambda r(t)}{F(\theta(t))}. This leads to

E˙(t)b(t)ϵE(t),\dot{E}(t)\leq-\frac{b(t)}{\epsilon}E(t), (6.46)

as long as we can identify a,λa,\lambda so that

b(t)=2ar(t)+1+λr(t)F(θ(t))>0,\displaystyle b(t)=-2ar(t)+1+\frac{\lambda r(t)}{F(\theta(t))}>0, (6.47a)
ab(t)ϵμF(θ(t))0,\displaystyle\frac{ab(t)}{\epsilon}-\frac{\mu}{F(\theta(t))}\leq 0, (6.47b)
λb(t)+2ϵL2λ0,\displaystyle\lambda b(t)+2\epsilon L-2\lambda\leq 0, (6.47c)

for all t>0t>0. Step 2: Admissible choice for λ\lambda and aa. Using the solution bounds rr(t)r0r^{*}\leq r(t)\leq r_{0} and FF(θ(t))F(θ0)F^{*}\leq F(\theta(t))\leq F(\theta_{0}), we have

bb(t)b,t>0,b_{*}\leq b(t)\leq b^{*},\quad\forall t>0,

where

b=b(a,λ)=2ar0+1+λrF(θ0),\displaystyle b_{*}=b_{*}(a,\lambda)=-2ar_{0}+1+\frac{\lambda r_{*}}{F(\theta_{0})},
b=b(a,λ)=2ar+1+λr0F.\displaystyle b^{*}=b^{*}(a,\lambda)=-2ar^{*}+1+\frac{\lambda r_{0}}{F^{*}}.

In order to ensure (6.47c) we must have b<2b^{*}<2. Putting these together, condition (6.47) can be ensured by the following stronger constraints:

0<b,b<2,\displaystyle 0<b_{*},\quad b^{*}<2, (6.48a)
abϵμF(θ0)0,\displaystyle ab^{*}-\frac{\epsilon\mu}{F(\theta_{0})}\leq 0, (6.48b)
λb+2ϵL2λ0.\displaystyle\lambda b^{*}+2\epsilon L-2\lambda\leq 0. (6.48c)

A direct calculation shows that for small ϵ\epsilon, such admissible pair exists, with a=O(ϵ)a=O(\epsilon), and λFr0\lambda\leq\frac{F^{*}}{r_{0}} which is induced from b<2b^{*}<2. To be more precise, we fix δ(0,1)\delta\in(0,1) and require that

bδ.b_{*}\geq\delta.

The constraint b<2b^{*}<2 would impose a lower bound for aϵa\sim\epsilon (which we should avoid to stay consistency with the discrete algorithm) unless λ\lambda is chosen to satisfy b+2ar2δb^{*}+2ar^{*}\leq 2-\delta. This is equivalent to requiring

λ(1δ)Fr0.\lambda\leq(1-\delta)\frac{F^{*}}{r_{0}}.

With the above two constraints, it is safe to replace bb^{*} by 2δ2-\delta in (6.48b) and (6.48c), respectively, so that

a(2δ)ϵμF(θ0),2ϵLδλ.a(2-\delta)\leq\frac{\epsilon\mu}{F(\theta_{0})},\quad 2\epsilon L\leq\delta\lambda. (6.49)

The convergence rate estimate in the next step requires aa to be large as possible, we thus simply take

a=ϵμF(θ0)(2δ).a=\frac{\epsilon\mu}{F(\theta_{0})(2-\delta)}. (6.50)

The second relation in (6.49) now reduces to

ϵδλ2L.\epsilon\leq\frac{\delta\lambda}{2L}.

Note that the constraint bδb_{*}\geq\delta is met if 2ar01δ+λrF(θ0)2ar_{0}\leq 1-\delta+\frac{\lambda r^{*}}{F(\theta_{0})}, imposing another upper bound on ϵ\epsilon. To be concrete, we take

λ=(1δ)Fr0,\lambda=(1-\delta)\frac{F^{*}}{r_{0}}, (6.51)

then (6.48) is met if

ϵmin{ϵ1,ϵ2},\epsilon\leq\min\{\epsilon_{1},\epsilon_{2}\},

with

ϵ1=δ(1δ)F2Lr0,ϵ2=12μr0(2δ)(1δ)(F(θ0)+Frr0).\epsilon_{1}=\delta(1-\delta)\frac{F^{*}}{2Lr_{0}},\quad\epsilon_{2}=\frac{1}{2\mu r_{0}}(2-\delta)(1-\delta)\bigg{(}F(\theta_{0})+\frac{F^{*}r^{*}}{r_{0}}\bigg{)}.

This is (6.39) with r0=F(θ0)r_{0}=F(\theta_{0}). Hence, for suitably small ϵ\epsilon, we have obtained a set of admissible pairs (a,λ)(a,\lambda) with

a=ϵμF(θ0)(2δ),λ=(1δ)Fr0,a=\frac{\epsilon\mu}{F(\theta_{0})(2-\delta)},\quad\lambda=(1-\delta)\frac{F^{*}}{r_{0}}, (6.52)

as δ\delta varies in (0,1)(0,1).

Step 3: Convergence rate of WW. With above choices of (a,λ)(a,\lambda), we have

E˙(t)b(t)ϵE(t)bϵE(t).\dot{E}(t)\leq-\frac{b(t)}{\epsilon}E(t)\leq-\frac{b_{*}}{\epsilon}E(t). (6.53)

This gives

E(t)E(0)e(b/ϵ)t.E(t)\leq E(0)e^{-(b_{*}/\epsilon)t}. (6.54)

This when combined with (6.40), i.e.,

E=aW+ϵW˙2r+λϵr|v|2,E=aW+\frac{\epsilon\dot{W}}{2r}+\lambda\epsilon r|v|^{2},

and E(0)=aW(0)E(0)=aW(0) (since v0=0v_{0}=0) allows us to rewrite (6.54) as

2ar(t)ϵW(t)+W˙(t)2ar(t)W(0)ϵe(b/ϵ)t.\frac{2ar(t)}{\epsilon}W(t)+\dot{W}(t)\leq\frac{2ar(t)W(0)}{\epsilon}e^{-(b_{*}/\epsilon)t}.

That is

ddt(W(t)e0t(2ar(s)/ϵ)𝑑s)2ar(t)W(0)ϵe(b/ϵ)t+0t(2ar(s)/ϵ)𝑑s.\frac{d}{dt}(W(t)e^{\int_{0}^{t}(2ar(s)/\epsilon)ds})\leq\frac{2ar(t)W(0)}{\epsilon}e^{-(b_{*}/\epsilon)t+\int_{0}^{t}(2ar(s)/\epsilon)ds}.

Integration of this gives

W(t)\displaystyle W(t) W(0)e0t(2ar(s)/ϵ)𝑑s+2aW(0)ϵe0t(2ar(s)/ϵ)𝑑s0tr(s)e(b/ϵ)s+0s(2ar(τ)/ϵ)𝑑τ𝑑s\displaystyle\leq W(0)e^{-\int_{0}^{t}(2ar(s)/\epsilon)ds}+\frac{2aW(0)}{\epsilon}e^{-\int_{0}^{t}(2ar(s)/\epsilon)ds}\int_{0}^{t}r(s)e^{-(b_{*}/\epsilon)s+\int_{0}^{s}(2ar(\tau)/\epsilon)d\tau}ds
W(0)e(2ar/ϵ)t+2ar0W(0)ϵ0te(b/ϵ)s𝑑s\displaystyle\leq W(0)e^{-(2ar^{*}/\epsilon)t}+\frac{2ar_{0}W(0)}{\epsilon}\int_{0}^{t}e^{-(b_{*}/\epsilon)s}ds
W(0)e(2ar/ϵ)t+2ar0W(0)|b|e(b/ϵ)t.\displaystyle\leq W(0)e^{-(2ar^{*}/\epsilon)t}+\frac{2ar_{0}W(0)}{|b_{*}|}e^{-(b_{*}/\epsilon)t}. (6.55)

Recall in Step 2, aa is chosen as ϵμF(θ0)(2δ)\frac{\epsilon\mu}{F(\theta_{0})(2-\delta)} for a fixed δ(0,1)\delta\in(0,1) and ϵ{ϵ1,ϵ2}\epsilon\leq\{\epsilon_{1},\epsilon_{2}\} so that bδb_{*}\geq\delta, also using r0=F(θ0)r_{0}=F(\theta_{0}), we have

W(t)W(0)(e2μrF(θ0)(2δ)t+2μϵδ(2δ)eδϵt).W(t)\leq W(0)\bigg{(}e^{-\frac{2\mu r^{*}}{F(\theta_{0})(2-\delta)}t}+\frac{2\mu\epsilon}{\delta(2-\delta)}e^{-\frac{\delta}{\epsilon}t}\bigg{)}. (6.56)

This is (6.38) with W(t)=f(θ(t))fW(t)=f(\theta(t))-f^{*}.

Remark 6.5.

In Step, equation (6.55) reveals that the convergence rate is dominated by eKte^{-Kt}, where K(a,λ;ϵ):=min{2ar/ϵ,b/ϵ}K(a,\lambda;\epsilon):=\min\{2ar^{*}/\epsilon,b_{*}/\epsilon\}. Consequently, in the current methodology, determining the optimal values of a,λa,\lambda involves solving the following constrained optimization problem:

maxa,λK(a,λ;ϵ)\displaystyle\max_{a,\lambda}K(a,\lambda;\epsilon) (6.57)
s.t.\displaystyle s.t. 0<b<b<2,abϵμB0,λb+2ϵL2λ0.\displaystyle 0<b_{*}<b^{*}<2,\quad ab^{*}-\frac{\epsilon\mu}{B}\leq 0,\quad\lambda b^{*}+2\epsilon L-2\lambda\leq 0.

The constraint 0<b<b<20<b_{*}<b^{*}<2 in (6.57) forms an open set, making it natural to confine the range to δb<b2δ\delta\leq b_{*}<b^{*}\leq 2-\delta for a small δ>0\delta>0. Notably, when ϵ\epsilon is sufficiently small, the expression

K(a,λ;ϵ)=2ar/ϵK(a,\lambda;\epsilon)=2ar^{*}/\epsilon

holds true, owing to the fact that aϵμF(θ0)(2δ)a\leq\frac{\epsilon\mu}{F(\theta_{0})(2-\delta)} in Step 2. Consequently, the estimation obtained in Step 2 is nearly optimal within the confines of the present approach.

7 Convergence of trajectories

In this section, we provide insights into the previously established results as stated in Theorems 5.4 and 6.1 and put forth several variations and extensions.

In Theorem 5.4, we demonstrated that F(θ(t))0\nabla F(\theta(t))\to 0 and F(θ(t))FF(\theta(t))\to F^{*}. It is reasonable to anticipate that, under appropriate conditions on FF (or ff), θ(t)\theta(t) will converge toward a single minimum point of FF. Conversely, the PL condition (2.11) used in Theorem 6.1 pertains to the geometric attributes of ff rather than its regularity. We revisit the details of the proof of Theorem 6.1, wherein for sufficiently small ϵ\epsilon, EE is bounded below by (ff)+|v|2(f-f^{*})+|v|^{2}. Consequently, the exponential decay of EE extends to |v||v|, leading to the convergence of θ(t)\theta(t) since θ˙=2rv\dot{\theta}=2rv.

It is interesting to explore the minimal general condition on ff that would ensure the convergence for θ(t)\theta(t) towards a single limit-point. In general, addressing this question poses significant challenges. However, we can identify a larger class of functions beyond those covered by the PL condition. It is important to note that, following Łojasiewicz’s groundbreaking work on real-analytic functions [48, 33], the key factor ensuring convergence of θ(t)\theta(t) with θ˙=f(θ)\dot{\theta}=-\nabla f(\theta) lies in the geometric properties of ff. This is exemplified by a counterexample presented by Palis and De Melo [24, p.14], highlighting that CC^{\infty} smoothness is insufficient to guarantee single limit-point convergence. These findings illustrate the importance of gradient vector fields of functions satisfying the Łojasiewicz inequality. The Łojasiewicz inequality asserts that for a real-analytic function and a critical point aa there exists α(0,1)\alpha\in(0,1) such that the function |ff(a)|1α|f|1|f-f(a)|^{1-\alpha}|\nabla f|^{-1} remains bounded away from 0 around aa. Such gradient inequality has been extended by Kurdyka [27] to C1C^{1} functions whose graphs belong to an o-minimal structure, and Bolte et al. [14, 12] extended it to a broad class of nonsmooth subanalytic functions. This gradient inequality, or its generalization with a desingularizing function, is known as the Kurdyka-Łojasiewicz inequality. In the optimization literature, the KL inequality has proven to be a powerful tool for characterizing the convergence properties of iterative algorithms; refer to, for instance, [1, 6, 13, 21, 23].

For the system (4.30), we show that the Łojasiewicz inequality is sufficient to ensure the convergence of θ(t)\theta(t). Additionally, we provide convergence rates for θ(t)\theta(t) under different values of α\alpha.

Definition 7.1 (Łojasiewicz inequality).

For a differentiable function f:nf:\mathbb{R}^{n}\to\mathbb{R} with argminf{\rm argmin}f\neq\emptyset, we say ff satisfies the Łojasiewicz inequality if there exists σ>0\sigma>0, c>0c>0, and α(0,1)\alpha\in(0,1), such that the following inequality holds for any aargminfa\in{\rm argmin}f:

c|f(θ)f(a)|1α|f(θ)|,θB(a,σ),c|f(\theta)-f(a)|^{1-\alpha}\leq|\nabla f(\theta)|,\quad\forall\theta\in B(a,\sigma), (7.58)

where B(a,σ)B(a,\sigma) is the neighborhood of aa with radius σ\sigma.

Note that a diverse range of functions has been shown to satisfy (7.58), ranging from real-analytic functions [48] to non-smooth lower semi-continuous functions [14]. The PL condition (2.11) stated as a global condition corresponds to (7.58) with α=1/2\alpha=1/2 and c=2μc=\sqrt{2\mu}.

Lemma 7.2.

Let f:nf:\mathbb{R}^{n}\to\mathbb{R} satisfy the Łojasiewicz inequality (7.58), then

|F(θ)F(a)|1α|F(θ)|,θB(a,σ).\displaystyle|F(\theta)-F(a)|^{1-\alpha}\leq|\nabla F(\theta)|,\quad\forall\theta\in B(a,\sigma). (7.59)
Proof 7.3.

Using the relation F2=f+cF^{2}=f+c, inequality (7.58) reduces to

c|F(θ)+F(a)|1α|F(θ)F(a)|1α2|F(θ)||F(θ)|.c|F(\theta)+F(a)|^{1-\alpha}|F(\theta)-F(a)|^{1-\alpha}\leq 2|F(\theta)||\nabla F(\theta)|.

This leads to (7.59) if cc is chosen as 2maxθB(a,σ)|F(θ)||F+F(a)|α12\max_{\theta\in B(a,\sigma)}|F(\theta)||F+F(a)|^{\alpha-1}.

Theorem 7.4.

Consider θ(t)\theta(t) as a bounded solution of (4.30) as stated in Theorem 5.4, and further assume that FF satisfies (7.59). Then θ˙L1([0,+])\dot{\theta}\in L^{1}([0,+\infty]) and θ(t)\theta(t) converges towards a local minimum point of FF as tt\to\infty.

Proof 7.5.

Let ω(θ)\omega(\theta) be the ω\omega-limit set of θ\theta. According to Theorem 5.3, F(θ(t))FF(\theta(t))\to F^{*}, ensuring F=FF=F^{*} on ω(θ)\omega(\theta), and

limtdist(θ(t),ω(θ))=0.\lim_{t\to\infty}\text{dist}(\theta(t),\omega(\theta))=0.

The inequality (7.59) asserts the existence of T>0T>0 and α(0,1)\alpha\in(0,1) such that for any tTt\geq T,

|F(θ(t))F|1α|F(θ(t))|.|F(\theta(t))-F^{*}|^{1-\alpha}\leq|\nabla F(\theta(t))|.

The proof of the convergence of θ(t)\theta(t) relies on a novel functional

R(t)=Q(t)F+λϵ2|v˙(t)|2,R(t)=Q(t)-F^{*}+\frac{\lambda\epsilon}{2}|\dot{v}(t)|^{2},

with the parameter λ\lambda yet to be determined. A direct calculation gives

R˙\displaystyle\dot{R} =Q˙+λϵv˙v¨\displaystyle=\dot{Q}+\lambda\epsilon\dot{v}\cdot\ddot{v}
=2r|v|2(1+ϵ|v|2)+λv˙ddt(F(θ(t))v)\displaystyle=-2r|v|^{2}(1+\epsilon|v|^{2})+\lambda\dot{v}\cdot\frac{d}{dt}(\nabla F(\theta(t))-v)
=2r|v|2(1+ϵ|v|2)λ|v˙|2+λv˙D2Fθ˙\displaystyle=-2r|v|^{2}(1+\epsilon|v|^{2})-\lambda|\dot{v}|^{2}+\lambda\dot{v}\cdot D^{2}F\dot{\theta}
=2r|v|2(1+ϵ|v|2)λ|v˙|22λrv˙D2Fv\displaystyle=-2r|v|^{2}(1+\epsilon|v|^{2})-\lambda|\dot{v}|^{2}-2\lambda r\dot{v}\cdot D^{2}Fv
2r|v|2(1+ϵ|v|2)λ|v˙|2+λLFδ|v˙|2+λLFr2δ|v|2\displaystyle\leq-2r|v|^{2}(1+\epsilon|v|^{2})-\lambda|\dot{v}|^{2}+\lambda L_{F}\delta|\dot{v}|^{2}+\frac{\lambda L_{F}r^{2}}{\delta}|v|^{2}
r|v|2(2λLFr0δ)λ|v˙|2(1LFδ).\displaystyle\leq-r|v|^{2}(2-\frac{\lambda L_{F}r_{0}}{\delta})-\lambda|\dot{v}|^{2}(1-L_{F}\delta).

Here 2abδa2+b2δ-2ab\leq\delta a^{2}+\frac{b^{2}}{\delta} for any δ>0\delta>0 and D2F(θ)LF\|D^{2}F(\theta)\|\leq L_{F} were used in the first inequality. Take δ=1/(2LF)\delta=1/(2L_{F}) and λ=1/(2LF2r0)\lambda=1/(2L_{F}^{2}r_{0}), then

R˙\displaystyle\dot{R} r|v|212λ|v˙|2\displaystyle\leq-r^{*}|v|^{2}-\frac{1}{2}\lambda|\dot{v}|^{2}
=r|v|214LF2r0|v˙|2\displaystyle=-r^{*}|v|^{2}-\frac{1}{4L_{F}^{2}r_{0}}|\dot{v}|^{2}
12min{r,14LF2r0}(|v|+|v˙|)2\displaystyle\leq-\frac{1}{2}\min\{r^{*},\frac{1}{4L_{F}^{2}r_{0}}\}(|v|+|\dot{v}|)^{2}
=:C1(|v|+|v˙|)2.\displaystyle=:-C_{1}(|v|+|\dot{v}|)^{2}.

In the last inequality, we used the inequality (a+b)22(a2+b2)(a+b)^{2}\leq 2(a^{2}+b^{2}). Conversely,

R(t)\displaystyle R(t) =F(θ(t))F+ϵr|v|2+ϵ4LF2r0|v˙|2\displaystyle=F(\theta(t))-F^{*}+\epsilon r|v|^{2}+\frac{\epsilon}{4L_{F}^{2}r_{0}}|\dot{v}|^{2}
C2(|F(θ(t))F|+|v|2+|v˙|2),\displaystyle\leq C_{2}(|F(\theta(t))-F^{*}|+|v|^{2}+|\dot{v}|^{2}),

where C2=max{1,ϵr0,ϵ4LF2r0}.C_{2}=\max\{1,\epsilon r_{0},\frac{\epsilon}{4L_{F}^{2}r_{0}}\}. We proceed by distinguishing two cases:
(i) α(0,1/2]\alpha\in(0,1/2]. Using the inequality (a+b)1αa1α+b1α(a+b)^{1-\alpha}\leq a^{1-\alpha}+b^{1-\alpha}, we obtain

R(t)1αC21α(|F(θ(t))F|1α+|v|2(1α)+|v˙|2(1α)).R(t)^{1-\alpha}\leq C_{2}^{1-\alpha}(|F(\theta(t))-F^{*}|^{1-\alpha}+|v|^{2(1-\alpha)}+|\dot{v}|^{2(1-\alpha)}).

Using (7.59), for tTt\geq T,

R(t)1αC21α(|F(θ(t))|+|v|2(1α)+|v˙|2(1α)).R(t)^{1-\alpha}\leq C_{2}^{1-\alpha}(|\nabla F(\theta(t))|+|v|^{2(1-\alpha)}+|\dot{v}|^{2(1-\alpha)}).

Since |v(t)|,|v˙(t)|0|v(t)|,|\dot{v}(t)|\to 0 as tt\to\infty and 2(1α)12(1-\alpha)\geq 1,

R(t)1α\displaystyle R(t)^{1-\alpha} C21α(|F(θ(t))|+|v(t)|+|v˙(t)|)\displaystyle\leq C_{2}^{1-\alpha}(|\nabla F(\theta(t))|+|v(t)|+|\dot{v}(t)|)
C21α(|v+ϵv˙|+|v(t)|+|v˙(t)|)\displaystyle\leq C_{2}^{1-\alpha}(|v+\epsilon\dot{v}|+|v(t)|+|\dot{v}(t)|)
C21α(2|v(t)|+(ϵ+1)|v˙(t)|)\displaystyle\leq C_{2}^{1-\alpha}(2|v(t)|+(\epsilon+1)|\dot{v}(t)|)
C3(|v(t)|+|v˙(t)|),\displaystyle\leq C_{3}(|v(t)|+|\dot{v}(t)|), (7.60)

where C3=C21αmax{2,ϵ+1}C_{3}=C_{2}^{1-\alpha}\max\{2,\epsilon+1\}.

We are prepared to prove that θ˙\dot{\theta} belongs to L1([0,])L^{1}([0,\infty]) by considering tTt\geq T:

ddtR(t)α\displaystyle-\frac{d}{dt}R(t)^{\alpha} =αR(t)α1R˙=αR˙R1α\displaystyle=-\alpha R(t)^{\alpha-1}\cdot\dot{R}=\alpha\frac{-\dot{R}}{R^{1-\alpha}}
αC1(|v|+|v˙|)2C3(|v|+|v˙|)\displaystyle\geq\alpha\frac{C_{1}(|v|+|\dot{v}|)^{2}}{C_{3}(|v|+|\dot{v}|)}
=C(|v|+|v˙|),C:=αC3C1.\displaystyle=C(|v|+|\dot{v}|),\quad C:=\frac{\alpha C_{3}}{C_{1}}.

Integrating both sides from TT to \infty yields:

T(|v|+|v˙|)𝑑tC(R(T)αlimtR(t)α)=CR(T)α.\int_{T}^{\infty}(|v|+|\dot{v}|)dt\leq C(R(T)^{\alpha}-\lim_{t\to\infty}R(t)^{\alpha})=CR(T)^{\alpha}. (7.61)

Here, we take into account that limtR(t)α=0\lim_{t\to\infty}R(t)^{\alpha}=0. Note that |θ˙|=2r|v|2r0|v||\dot{\theta}|=2r|v|\leq 2r_{0}|v|, hence

T|θ˙|𝑑t2r0T|v|𝑑t2r0CR(T)α2r0CR(0)α.\int_{T}^{\infty}|\dot{\theta}|dt\leq 2r_{0}\int_{T}^{\infty}|v|dt\leq 2r_{0}CR(T)^{\alpha}\leq 2r_{0}CR(0)^{\alpha}.

Thus θ˙\dot{\theta} belongs to L1([0,])L^{1}([0,\infty]), implying that, θ:=limtθ(t)\theta^{*}:=\lim_{t\to\infty}\theta(t) exists, and F(θ)=FF(\theta^{*})=F^{*}.
(ii) For α(1/2,1)\alpha\in(1/2,1), we set α=α/2(0,1/2)\alpha^{\prime}=\alpha/2\in(0,1/2). With this adjustment, (7.59) remains valid for α\alpha^{\prime} since

|FF|1α|FF|1α|F-F^{*}|^{1-\alpha^{\prime}}\leq|F-F^{*}|^{1-\alpha}

for |fF||f-F^{*}| suitably small. Also, (7.5) holds with α\alpha replaced by α\alpha^{\prime}. Thus, we conclude the convergence of θ(t)\theta(t) in this case. Moreover,

T(|v|+|v˙|)𝑑tCR(T)α/2\int_{T}^{\infty}(|v|+|\dot{v}|)dt\leq CR(T)^{\alpha/2}

for some constant CC.

Before getting into the estimation of the rate of convergence, let us define

u(t):=t(|v(t)|+|v˙(t)|)𝑑t.u(t):=\int_{t}^{\infty}(|v(t)|+|\dot{v}(t)|)dt. (7.62)

This expression serves to control the tail length function for both θ(t)\theta(t) and v(t)v(t). In fact,

|θ(t)θ|\displaystyle|\theta(t)-\theta^{*}| t|θ˙(t)|𝑑t=t2r(t)|v(t)|𝑑t2r0t|v(t)|𝑑t2r0u(t).\displaystyle\leq\int^{\infty}_{t}|\dot{\theta}(t)|dt=\int^{\infty}_{t}2r(t)|v(t)|dt\leq 2r_{0}\int^{\infty}_{t}|v(t)|dt\leq 2r_{0}u(t). (7.63)

From (7.61) in the proof of Theorem 6.1, we see that

u(T)CR(T)β,β={α,if 0<α12,α/2,if 12<α<1.u(T)\leq CR(T)^{\beta},\quad\beta=\begin{cases}\alpha,&\text{if $0<\alpha\leq\frac{1}{2}$},\\ \alpha/2,&\text{if $\frac{1}{2}<\alpha<1$}.\end{cases}

This inequality remains true for every tTt\geq T, thus

u(t)CR(t)βtT.\displaystyle u(t)\leq CR(t)^{\beta}\quad\forall t\geq T. (7.64)

We now present the following result.

Theorem 7.6.

Under the same conditions as in Theorem 7.4, there exists TT such that for any tTt\geq T, the following results hold:

  1. (1)

    If α=12\alpha=\frac{1}{2}, then

    |θ(t)θ|C(T)eKt;|\theta(t)-\theta^{*}|\leq C(T)e^{-Kt};
  2. (2)

    If 0<α<120<\alpha<\frac{1}{2}, then

    |θ(t)θ|C(T)(t+1)α12α;|\theta(t)-\theta^{*}|\leq C(T)(t+1)^{-\frac{\alpha}{1-2\alpha}};
  3. (3)

    If 12<α<1\frac{1}{2}<\alpha<1, then

    |θ(t)θ|C(T)(t+1)α22α,|\theta(t)-\theta^{*}|\leq C(T)(t+1)^{-\frac{\alpha}{2-2\alpha}},

where C(T)C(T) depends on TT and K>0K>0 is independent of TT.

Proof 7.7.

Let B(θ,σ)B(\theta^{*},\sigma) be a neighborhood of θ\theta^{*} in which the Łjosiewicz inequality holds. Since θ(t)\theta(t) convergese to θ\theta^{*} there exits T>0T>0 such that θ(t)B(θ,σ)\theta(t)\in B(\theta^{*},\sigma) for every tT.t\geq T. In particular, (7.64) holds. It suffices to consider β=α(0,1/2]\beta=\alpha\in(0,1/2]:

u(t)\displaystyle u(t) CR(t)α=C(R(t)1α)α1α\displaystyle\leq CR(t)^{\alpha}=C(R(t)^{1-\alpha})^{\frac{\alpha}{1-\alpha}}
C(C3(|v(t)|+|v˙(t)|))α1α=CC3α1α(u˙(t))α1α,\displaystyle\leq C(C_{3}(|v(t)|+|\dot{v}(t)|))^{\frac{\alpha}{1-\alpha}}=CC_{3}^{\frac{\alpha}{1-\alpha}}(-\dot{u}(t))^{\frac{\alpha}{1-\alpha}},

where C3C_{3} was defined in (7.5). The above can be rearranged as

u˙(t)Ku(t)1αα,\dot{u}(t)\leq-Ku(t)^{\frac{1-\alpha}{\alpha}}, (7.65)

where K=C1ααC3K=\frac{C^{\frac{1-\alpha}{\alpha}}}{C_{3}}. Next we define another problem of form

z˙(t)=Kz(t)1αα,z(T)=u(T),\dot{z}(t)=-Kz(t)^{\frac{1-\alpha}{\alpha}},\quad z(T)=u(T), (7.66)

so that u(t)z(t)u(t)\leq z(t) by comparison; hence by (7.63) we obtain |θ(t)θ|2r0z(t)|\theta(t)-\theta^{*}|\leq 2r_{0}z(t).

The solution of z(t)z(t) can be determined for two cases:
(1) α=12\alpha=\frac{1}{2}. In this case, (7.66) of form z˙=Kz\dot{z}=-Kz admits a unique solution

z(t)=u(T)eKTeKt.z(t)=u(T)e^{KT}e^{-Kt}.

(2) 0<α<120<\alpha<\frac{1}{2}. The exact solution in such case can be obtained as

z(t)=(u(T)21/α+K(1α2)(tT))α12αz(t)=\left(u(T)^{2-1/\alpha}+K(\frac{1}{\alpha}-2)(t-T)\right)^{-\frac{\alpha}{1-2\alpha}}

which is bounded by C(T)(t+1)α12αC(T)(t+1)^{-\frac{\alpha}{1-2\alpha}}.
(3) 12<α<1\frac{1}{2}<\alpha<1. This case reduces to case (2) by simply replacing α\alpha by α/2\alpha/2 in the obtained convergence rate for (2). The proof is complete.

Remark 7.8.

The convergence rate in (3) is suboptimal as it relies on (7.64) with β=α/2\beta=\alpha/2 for 1/2<α<11/2<\alpha<1. In this scenario, the full potential of the Łojasiewicz inequality is not used in the proof of Theorem 7.4. Had (7.64) held with β=α\beta=\alpha in such cases, it would have led to z(t)=0z(t)=0 for some finite tt, indicating a faster, finite time convergence. This assertion is demonstrated in [14, Theorem 4.7] for the case for θ˙=f(θ)\dot{\theta}=-\nabla f(\theta).

Remark 7.9.

The exponential convergence rate in (1) aligns with the result obtained in Theorem 6.1, though K=12C1K=\frac{1}{2C_{1}} may not be optimally sharp.

Remark 7.10.

A counterexample by Baillon [9] for the steepest descent equation θ˙+f(θ)=0\dot{\theta}+\nabla f(\theta)=0 suggests that, likely, convexity alone is not sufficient for the trajectories of (1.4) to converge strongly. In the case of a general convex fucntion ff, one may instead aim to demonstrate the weak convergence of trajectories, following the approach outlined in [4, Theorem 5.1] for a dissipative dynamical system with Hessian-driven damping. It is important to note that the conditions outlined in (7.58) are insufficient to imply the convexity of the function. This is exemplified by the case of f(θ)=|θ2sin(θ1)|1/αf(\theta)=|\theta_{2}-sin(\theta_{1})|^{1/\alpha} where α(0,1)\alpha\in(0,1). Although this function satisfies (7.58) with c=1/αc=1/\alpha, it does not guarantee convexity. Furthermore, the set of minimizers for ff characterized by the graph of θ2=sinθ1\theta_{2}=\sin\theta_{1}, is also non-convex.

8 Discussion

This paper explores a novel energy-adaptive gradient algorithm with momentum, termed AGEM. Empirically, we observe that the inclusion of momentum significantly accelerate convergence, particularly in a non-convex setting. Theoretical investigations reveal that AGEM retains the unconditional energy stability property akin to AEGD. Additionally, we establish convergence to critical points for general non-convex objective functions when step sizes are suitably small.

To capture the dynamical behavior of AGEM, we derive a high resolution ODE system that preserves the momentum effect. This system, featuring infinitely many steady states, poses challenges in analysis. Nonetheless, leveraging various analytical tools, we successfully establish a series of results: (i) establishing global well-posedness through a construction of a suitable Lyapunov function; (ii) characterizing the time-asymptotic behavior of the solution towards critical points, using the LaSalle invariance principle; (iii) deriving a linear convergence rate for non-convex objective functions that satisfy the PL condition, and (iv) demonstrating the finite length of θ(t)\theta(t) and its convergence to the minimum of ff based on the Łojasiewicz inequality.

We anticipate that the analytical framework developed in this article can be extended to study other variants of AEGD. As an illustration, the associated system for scheme (2.7) in the format:

ϵm˙\displaystyle\epsilon\dot{m} =m+f(θ),\displaystyle=-m+\nabla f(\theta),
v\displaystyle v =m2F(θ)(1et/ϵ),\displaystyle=\frac{m}{2F(\theta)(1-e^{-t/\epsilon})},
r˙\displaystyle\dot{r} =2r|v|2,\displaystyle=-2r|v|^{2},
θ˙\displaystyle\dot{\theta} =2rv\displaystyle=-2rv

is anticipated to exhibit analogous solution properties. In the context of large-scale optimization problems, the practical preference often leans towards a stochastic version of AGEM. Consequently, it becomes intriguing to investigate the convergence behavior of AGEM in the stochastic setting.

Acknowledgements. This research was partially supported by the National Science Foundation under Grant DMS1812666.

References

  • [1] Pierre-Antoine Absil, Robert Mahony, and Benjamin Andrews, Convergence of the iterates of descent methods for analytic cost functions, SIAM Journal on Optimization 16 (2005), no. 2, 531–547.
  • [2] Zeyuan Allen-Zhu, Katyusha: The first direct acceleration of stochastic gradient methods, Journal of Machine Learning Research 18 (2018), no. 221, 1–51.
  • [3] Felipe Alvarez, On the minimizing property of a second order dissipative system in Hilbert spaces, SIAM J. Control and Optimization 38 (1998), 1102–1119.
  • [4] Felipe Alvarez, Hedy Attouch, Jérôme Bolte, and Patrick Redont, A second-order gradient-like dissipative dynamical system with Hessian-driven damping.: Application to optimization and mechanics, Journal de mathématiques pures et appliquées 81 (2002), no. 8, 747–779.
  • [5] Vassilis Apidopoulos, Nicolò Ginatta, and Silvia Villa, Convergence rates for the heavy-ball continuous dynamics for non-convex optimization, under Polyak–Łojasiewicz condition, Journal of Global Optimization 84 (2022), no. 3, 563–589.
  • [6] Hedy Attouch and Jérôme Bolte, On the convergence of the proximal algorithm for nonsmooth functions involving analytic features, Mathematical Programming 116 (2009), no. 1, 5–16.
  • [7] Hedy Attouch, Zaki Chbani, Juan Peypouquet, and Patrick Redont, Fast convergence of inertial dynamics and algorithms with asymptotic vanishing viscosity, Mathematical Programming 168 (2018), 123–175.
  • [8] Hedy Attouch, Xavier Goudou, and Patrick Redont, The heavy ball with friction method, I. the continuous dynamical system: global exploration of the local minima of a real-valued function by asymptotic analysis of a dissipative dynamical system, Communications in Contemporary Mathematics 2 (2000), no. 01, 1–34.
  • [9] JB Baillon, Un exemple concernant le comportement asymptotique de la solution du problème dudt+ϑ(μ)0dudt+\partial\vartheta(\mu)\ni 0, Journal of Functional Analysis 28 (1978), no. 3, 369–376.
  • [10] Raef Bassily, Mikhail Belkin, and Siyuan Ma, On exponential convergence of SGD in non-convex over-parametrized learning, arXiv preprint arXiv:1811.02564 (2018).
  • [11] Michael Betancourt, Michael I Jordan, and Ashia C Wilson, On symplectic optimization, arXiv preprint arXiv:1802.03653 (2018).
  • [12] Jérôme Bolte, Aris Daniilidis, Olivier Ley, and Laurent Mazet, Characterizations of Łojasiewicz inequalities: subgradient flows, talweg, convexity, Transactions of the American Mathematical Society 362 (2010), no. 6, 3319–3363.
  • [13] Jérôme Bolte, Shoham Sabach, and Marc Teboulle, Proximal alternating linearized minimization for nonconvex and nonsmooth problems, Mathematical Programming 146 (2014), no. 1, 459–494.
  • [14] Jérôme Bolte, Aris Daniilidis, and Adrian Lewis, The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems, SIAM Journal on Optimization 17 (2007), no. 4, 1205–1223.
  • [15] Léon Bottou, Frank E. Curtis, and Jorge Nocedal, Optimization methods for large-scale machine learning, SIAM Review 60(2) (2018), 223–311.
  • [16] Camille Castera, Jérôme Bolte, Cédric Févotte, and Edouard Pauwels, An inertial Newton algorithm for deep learning, Journal of Machine Learning Research 22 (2021), 1–31.
  • [17] Zachary Charles and Dimitris Papailiopoulos, Stability and generalization of learning algorithms that converge to global optima, International conference on machine learning, 2018, pp. 745–754.
  • [18] Andre Belotto da Silva and Maxime Gazeau, A general system of differential equations to model first-order adaptive algorithms, Journal of Machine Learning Research 21 (2020), no. 129, 1–42.
  • [19] Simon S. Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh, Gradient descent provably optimizes over-parameterized neural networks, International Conference on Learning Representations, 2019.
  • [20] Guilherme França, Jeremias Sulam, Daniel Robinson, and René Vidal, Conformal symplectic and relativistic optimization, Advances in Neural Information Processing Systems 33 (2020), 16916–16926.
  • [21] Pierre Frankel, Guillaume Garrigos, and Juan Peypouquet, Splitting methods with variable metric for Kurdyka–Łojasiewicz functions and general convergence rates, Journal of Optimization Theory and Applications 165 (2015), no. 3, 874–900.
  • [22] Tianxiang Gao, Hailiang Liu, Jia Liu, Hridesh Rajan, and Hongyang Gao, A global convergence theory for deep ReLU implicit networks via over-parameterization, International Conference on Learning Representations, 2021.
  • [23] Guillaume Garrigos, Lorenzo Rosasco, and Silvia Villa, Convergence of the forward-backward algorithm: beyond the worst-case with the help of geometry, Mathematical Programming (2022), 1–60.
  • [24] Jacob Palis Junior and Welington De Melo, Geometric theory of dynamical systems: an introduction, Springer-Verlag, 1982.
  • [25] Hamed Karimi, Julie Nutini, and Mark Schmidt, Linear convergence of gradient and proximal-gradient methods under the Polyak-Łojasiewicz condition, Machine Learning and Knowledge Discovery in Database, Springer, 2016, pp. 795–811.
  • [26] Diederik Kingma and Jimmy Ba, Adam: A method for stochastic optimization, International Conference on Learning Representations (ICLR), 2014.
  • [27] Krzysztof Kurdyka, On gradients of functions definable in o-minimal structures, Annales de l’institut Fourier, vol. 48, 1998, pp. 769–783.
  • [28] J. LaSalle, Some extensions of Lyapunov’s second method, IRE Transactions on Circuit Theory 7 (1960), no. 4, 520–527.
  • [29] Chaoyue Liu, Libin Zhu, and Mikhail Belkin, Loss landscapes and optimization in over-parameterized non-linear systems and neural networks, Applied and Computational Harmonic Analysis (2022).
  • [30] Hailiang Liu and Xuping Tian, An adaptive gradient method with energy and momentum, Annals of Applied Mathematics 38 (2022), no. 2, 183–222.
  • [31] Hailiang Liu and Xuping Tian, Aegd: adaptive gradient descent with energy, Numerical Algebra, Control and Optimization (2023).
  • [32]  , SGEM: stochastic gradient with energy and momentum, Numerical Algorithms (2023), 1–28.
  • [33] S. Łojasiewicz, Sur les trajectoires de gradient d’une fonction analytique, Seminari di Geometria 1982-1983 (lecture notes), Dipartemento di Matematica, Universita di Bologna (1984), 115–117.
  • [34] Chao Ma, Lei Wu, and E Weinan, A qualitative study of the dynamic behavior for adaptive gradient algorithms, Mathematical and Scientific Machine Learning, 2022, pp. 671–692.
  • [35] Chris J Maddison, Daniel Paulin, Yee Whye Teh, Brendan O’Donoghue, and Arnaud Doucet, Hamiltonian descent methods, arXiv preprint arXiv:1809.05042 (2018).
  • [36] Michael Muehlebach and Michael Jordan, A dynamical systems perspective on nesterov acceleration, International Conference on Machine Learning, 2019, pp. 4656–4662.
  • [37] Yurii Nesterov, A method of solving a convex programming problem with convergence rate O(1/k2)(1/k^{2}), Doklady Akademii Nauk, vol. 269, Russian Academy of Sciences, 1983, pp. 543–547.
  • [38]  , Introductory lectures on convex optimization: A basic course, vol. 87, Springer Science & Business Media, 2013.
  • [39] Boris Polyak, Gradient methods for the minimisation of functionals, USSR Computational Mathematics and Mathematical Physics 3 (1963), 864–878.
  • [40] Boris Polyak, Some methods of speeding up the convergence of iteration methods, Ussr computational mathematics and mathematical physics 4 (1964), no. 5, 1–17.
  • [41] Bin Shi, Simon S Du, Michael I Jordan, and Weijie J Su, Understanding the acceleration phenomenon via high-resolution differential equations, Mathematical Programming (2021), 1–70.
  • [42] Mahdi Soltanolkotabi, Adel Javanmard, and Jason D Lee, Theoretical insights into the optimization landscape of over-parameterized shallow neural networks, IEEE Transactions on Information Theory 65 (2018), no. 2, 742–769.
  • [43] Weijie Su, Stephen Boyd, and Emmanuel Candes, A differential equation for modeling Nesterov’s accelerated gradient method: Theory and insights, Advances in Neural Information Processing Systems 3 (2015).
  • [44] Tijmen Tieleman, Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude, COURSERA: Neural networks for machine learning 4 (2012), no. 2, 26.
  • [45] Andre Wibisono, Ashia C Wilson, and Michael I Jordan, A variational perspective on accelerated methods in optimization, proceedings of the National Academy of Sciences 113 (2016), no. 47, E7351–E7358.
  • [46] Xiaofeng Yang, Linear, first and second-order, unconditionally energy stable numerical schemes for the phase field model of homopolymer blends, Journal of Computational Physics 327 (2016), 294–316.
  • [47] Jia Zhao, Qi Wang, and Xiaofeng Yang, Numerical approximations for a phase field dendritic crystal growth model based on the invariant energy quadratization approach, International Journal for Numerical Methods in Engineering 110 (2017), no. 3, 279–300.
  • [48] S. Łojasiewicz, A topological property of real analytic subsets (in French), Coll. du CNRS, Les équations aux derivéés partielles (1963), 87–89.