Dynamic behavior for a gradient algorithm with energy and momentum

Hailiang Liu Department of Mathematics, Iowa State University, Ames, IA, USA, ([email protected]). Xuping Tian Department of Mathematics, Iowa State University, Ames, IA, USA, ([email protected]).

Abstract

This paper investigates a novel gradient algorithm, AGEM, using both energy and momentum, for addressing general non-convex optimization problems. The solution properties of the AGEM algorithm, including aspects such as uniformly boundedness and convergence to critical points, are examined. The dynamic behavior is studied through a comprehensive analysis of a high-resolution ODE system. This ODE system, being nonlinear, is derived by taking the limit of the discrete scheme while preserving the momentum effect through a rescaling of the momentum parameter. The paper emphasizes the global well-posedness of the ODE system and the time-asymptotic convergence of solution trajectories. Furthermore, we establish a linear convergence rate for objective functions that adhere to the Polyak-Łojasiewicz condition.

keywords:

gradient algorithm; energy adaptive; dynamical systems; PL condition

{AMS}

34D05; 65K10

1 Introduction

In this paper, we will be considering the nonconvex optimization problem,

\min_{\theta\in\mathbb{R}^{n}}f(\theta),

(1.1)

where the differentiable objective function $f:\mathbb{R}^{n}\to\mathbb{R}$ is assumed to be bounded from below, i.e., $f+c>0$ for some $c\in\mathbb{R}$ . Applications like training machine learning models often involve large-scale problems with non-convex objective functions represented as a finite sum over terms associated with individual data. In such settings, first-order gradient methods such as gradient descent (GD) and its variants are commonly employed, due to their computational efficiency and satisfactory performance [15].

Despite their wide usage, GDs face a potential limitation on step size, attributed to the conditional stability of GD. This limitation can significantly impede convergence, especially in large-scale machine learning applications. This same challenge also persists for the stochastic approximation of GD, known as Stochastic Gradient Descent (SGD). Consequently, there has been much effort directed towards enhancing the convergence of first-order gradient methods. Among these efforts, adaptive step size [44, 26] and momentum [40] emerge as two widely-used techniques, aiming to address the aforementioned limitations and improve the efficiency of optimization algorithms in practice.

To overcome the issue of step size limitations, the authors in [31, 32] introduced an adaptive gradient descent with energy (AEGD). The algorithm initiates from $\theta_{0}$ and $r_{0}=F(\theta_{0})$ with $F(\theta)=\sqrt{f(\theta)+c}$ , inductively define


$\displaystyle v_{k}$	$\displaystyle=\nabla F(\theta_{k}),$	(1.2a)
$\displaystyle r_{k+1}$	$\displaystyle=\frac{r_{k}}{1+2\eta\|v_{k}\|^{2}},$	(1.2b)
$\displaystyle\theta_{k+1}$	$\displaystyle=\theta_{k}-2\eta r_{k+1}v_{k},$	(1.2c)

where $c\in\mathbb{R}$ is chosen so that $\inf\limits_{\theta\in\Theta}\left(f(\theta)+c\right)>0$ . In sharp contrast to GD, energy-adaptive gradient methods exhibit unconditional energy stability, regardless of the base step size $\eta>0$ . Additionally, findings in [31] indicate that the parameter $c$ has minimal impact on the performance of (1.2).

AEGD can be extended to include momentum for accelerated convergence [32]. In this study, we explore a variant of AEGD with momentum, referred to as AGEM. It is defined inductively as follows:


$\displaystyle v_{k}$	$\displaystyle=\beta v_{k-1}+(1-\beta)\nabla F(\theta_{k}),\quad v_{-1}=\bf{0},$	(1.3a)
$\displaystyle r_{k+1}$	$\displaystyle=\frac{r_{k}}{1+2\eta\|v_{k}\|^{2}},$	(1.3b)
$\displaystyle\theta_{k+1}$	$\displaystyle=\theta_{k}-2\eta r_{k+1}v_{k},$	(1.3c)

where the momentum parameter $\beta\in[0,1)$ is a crucial element, and setting $\beta=0$ reduces it to the original AEGD in (1.2). Unlike GD, all these energy-adaptive gradient methods exhibit unconditional energy stability, irrespective of the base step size $\eta$ . The excellent performance of AEGD-type schemes has been demonstrated across various optimization tasks [31, 32]. On the theoretical front, the convergence rates obtained in [31] depend on $kr_{k}$ , rather than solely on $k$ . This dependence on the norm of energy could pose challenges when the energy variable decays too rapidly, potentially impacting the convergence behavior. Addressing this issue represents a main challenge in achieving a better understanding of the energy-adaptive gradient methods.

Conversely, the link between ODEs and numerical optimization can be established by considering infinitesimally small step sizes. This approach ensures that the trajectory or solution path converges to a curve modeled by an ODE. Leveraging the well-established theory and dynamic properties of ODEs can yield profound insights into optimization processes, as exemplified in [3, 8, 5, 43].

The motivation for this study stems from our two recent papers [32, 30], wherein we successfully demonstrated the convergence of AEGD with momentum through comprehensive experiments. These experiments showcased its superior performance across stochastic and non-stochastic scenarios. Our aim in the current work is to deepen our understanding of AGEM, as expressed in equation (1.3), specifically focusing on elements such as the momentum effect, dynamics of the energy variable, and convergence rates. By delving into its continuous formulation, we seek to gain additional insights that will guide the implementation of the algorithm and further enhance its effectiveness. Specifically, we derive an ODE system for (1.3), representing the precise limit of (1.3) when employing infinitesimally small step sizes while maintaining $\epsilon=\beta\eta/(1-\beta)$ as a constant. This ODE system, featuring an unknown vector $U:=(v,r,\theta)$ , is expressed as


$\displaystyle\epsilon\dot{v}$	$\displaystyle=-v+\nabla F(\theta),$	(1.4a)
$\displaystyle\dot{r}$	$\displaystyle=-2r\|v\|^{2},$	(1.4b)
$\displaystyle\dot{\theta}$	$\displaystyle=-2rv,$	(1.4c)

for $t>0$ , with initial conditions $U(0)=(0,F(\theta_{0}),\theta_{0})$ ; here, $\theta_{0}$ is the starting point in the AGEM scheme, and $\dot{U}$ denotes the time derivative. Notably, this study represents the pioneering effort to model energy-adaptive schemes using ODE systems. This linkage enables the examination of the convergence properties of (1.3) from the standpoint of continuous dynamics, particularly for general smooth objective functions.

For a gradient method, the geometric property of the objective function $f$ often play a crucial role in determining the convergence and convergence rates. In the case of non-convex $f$ , we consider an established condition originally introduced by Polyak [39], who showed its sufficiency for GD to achieve a linear convergence rate. A recent convergence study under this condition is presented in [25], referring to this condition as the Polyak-Łojasiewicz (PL) inequality. This designation originates from its connection the gradient inequality introduced by Łojasiewicz in 1963 [48]. Observations regarding optimization-related dynamical systems have suggested that, in general, convergence along subsequences is typically achievable. However, achieving convergence of the entire trajectory require more geometric properties. For instance, the Łojasiewicz inequality has been used to establish trajectory convergence in [4] for a second-order gradient-like dissipative dynamical system with Hessian-driven damping. Section 7 of this work shows how such an inequality can ensure trajectory convergence for $\theta(t)$ in (1.4). We should point out that several significant non-convex problems in machine learning, including certain classes of neural networks, have recently been shown to satisfy the PL condition [10, 17, 42, 29]. The prevalent belief is that the PL/KL condition provides a pertinent and attractive framework for numerous machine learning problems, especially in the over-parametrized regime.

Contributions. Our findings provide a rigorous and precise understanding of the convergence behavior for both the discrete scheme (1.3) and its continuous counterpart (1.4). Our main contributions can be summarized as follows.

(1)

For (1.3), we establish that the iterates are uniformly bounded and converge to the set of critical points of the objective function when the step size is appropriately small. Additionally, we demonstrate that $\lim_{k\to\infty}r_{k}=r^{*}>0$ .
(2)

We derive (1.4) as a high resolution ODE system corresponding to the discrete scheme (1.3). For this ODE system, we initially show global well-posedness by constructing a suitable Lyapunov function. Subsequently, we establish the time-asymptotic convergence of the solution toward critical points using the Lasalle invariance principle. Furthermore, we provide a positive lower bound of the energy variable.
(3)

For objective functions satisfying the PL condition (refer to Section 2.2), we establish a linear convergence rate:

$f(\theta(t))-f(\theta^{*})\leq(f(\theta_{0})-f(\theta^{*}))e^{-\alpha t}$

for some $\alpha>0$ , where $\theta^{*}$ represents the global minimizer of $f$ (not necessarily convex).
(4)

We propose several variations and extensions. For a broader class of objective functions satisfying the Łojasiewicz inequality, $\theta(t)$ is shown to have finite length, hence converging to a single minimum of $f$ , accompanied by associated convergence rates.

Obtaining convergence rates for non-convex optimization problems poses a significant challenge. The approach employed in Section 6 to deliver the linear convergence rate for the system (1.4) draws inspiration from [5], where a linear convergence rate was established for the Heavy Ball dynamical system, namely (1.5), with $a(t)$ being a constant. In the case of the more intricate nonlinear system (1.4), we manage to formulate a class of control functions that serve a similar role to those in [5]. The derivation of the convergence rate results for (1.4) differs significantly and is more intricate. Notably, we employ a subtle optimization argument to identify an optimal control function that facilitates the desired decay rate.

1.1 Related works.

IEQ strategy. The fundamental concept underpinning AEGD is the invariant energy quadratization (IEQ) strategy, originally introduced to devise linear and unconditionally energy stable schemes for gradient flows expressed as partial differential equations [46, 47]. The AEGD algorithm [31] stands out as the pioneer in utilizing the energy update to stabilize GD in the context of optimization.

Optimization with momentum. The two prominent momentum-based optimization methods are Polyaks’s Heavy Ball (HB) [40] and Nesterov’s Accelerated Gradient (NAG) [37]. The HB method combines the current gradient with a history of the previous steps to accelerate the algorithm’s convergence:

	$\displaystyle v_{k}$	$\displaystyle=\mu v_{k-1}+\nabla f(\theta_{k}),$
	$\displaystyle\theta_{k+1}$	$\displaystyle=\theta_{k}-\eta v_{k}$

where $\mu\in[0,1]$ is the momentum coefficient, and $\eta$ is the step size. Research in [40] showed that HB can significantly accelerate convergence to a local minimum. For strongly convex functions, it requires $\sqrt{\kappa}$ times fewer iterations than GD to achieve the same level of accuracy, where $\kappa$ is the condition number of the curvature at the minimum and $\mu$ is set to $\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+_{1}}$ . Similar to the HB method, NAG also ensures a faster convergence rate than GD in specific scenarios. Particularly, for general smooth (non-strongly) convex functions, NAG achieves a global convergence rate of $O(1/k^{2})$ (versus $O(1/k)$ of GD) [37]. Recent studies have indicated that integrating momentum with other techniques can further enhance the performance of certain optimization methods. For instance, momentum incorporation into adaptive step sizes [26] and variance-reduction-based methods [2] for stochastic optimization.

The approach discussed in this work integrates the momentum technique into the energy-adaptive gradient methods. As demonstrated in [31, 32], the resulting algorithms not only exhibit unconditional energy stability but also showcase faster convergence, inheriting advantages from both techniques.

ODE perspectives. In recent years, the optimization community has shown a growing interest in examining the continuous-time ODE limit of optimization methods as the step size tends to zero. This perspective has spurred numerous recent studies that critically examine established optimization methods in their continuous-time limit. A main aspect of the research in this domain focuses on developing Lyapunov functions or estimating sequences that emerge independently from the dynamical system.

ODEs have proven to be particularly useful in the theoretical analysis of accelerated gradient algorithms. For instance, researchers have investigated the convergence behavior of the HB method using second-order ODEs:

\ddot{\theta}(t)+a(t)\dot{\theta}+\nabla f(\theta)=0,\quad\theta(0)=\theta_{0},\quad\dot{\theta}=0

(1.5)

with $a(t)$ being a smooth, positive and non-increasing function. The Lyapunov function $E=\frac{1}{2}|\dot{\theta}|^{2}+f(\theta)$ plays a pivotal role in the analysis of (1.5) within both convex and non-convex settings [3, 8]. The NAG method has been linked to (1.5) with $a(t)=\frac{3}{t}$ in [43], wherein the optimal convergence rate of $O(1/t^{2})$ of the discrete scheme in the convex setting is recovered [38]. In [41], the distinction between the acceleration phenomena of the HB method and the NAG method is investigated through an analysis of high-resolution ODEs. Further analysis of (1.5) is conducted in Attouch et al. [7], Wibisono et al. [45] frame it within a variational context. This motivated further research on structure-preserving integration schemes for discretizing continuous-time optimization algorithms, as explored by Betancourt et al. [11]. Franca et al. [20] and Muehlebach and Jordan [36] highlight important geometric properties of the underlying dynamics, analyzing corresponding structure-preserving discretization schemes. Maddison et al. [35] propose that convergence can be enhanced through a suitable selection of kinetic energy.

In [18], a non-autonomous system of differential equations was derived and analyzed as the continuous-time limit of adaptive optimization methods, including Adam and RMSprop. Beyond examining convergence behavior, this dynamical approach provides insight into qualitative features of discrete algorithms. For example, by analyzing the sign GD of the form $\dot{\theta}=-\frac{\nabla f(\theta)}{|\nabla f(\theta)|}$ and its variants, [34] identified three behavior patterns of Adam and RMSprop: fast initial progress, small oscillations, and large spikes. Our work aligns with this approach, although, to our knowledge, (1.4) differs from existing ODEs for optimization methods and presents subtle analytical challenges.

The ODE perspective has inspired researchers to propose innovative optimization methods. Indeed, starting from a continuous dynamical system with favorable converging properties, one can discretize it to derive novel algorithms. For example, a new second-order inertial optimization method, known as INNA algorithm, is obtained from a continuous inclusion system when the objective function is non-smooth [16].

The rest of the paper is organized as follows. In Section 2, we begin with the problem setup, revisiting existing energy-adaptive gradient methods, and introducing a novel approach. We delve into the PL condition and its associated properties. In Section 3, we analyze solution properties for the newly proposed method. In Section 4, we derive a continuous dynamical system and present a convergence result to show the dynamical system is a faithful model for the discrete scheme. In Section 5, we analyze the global existence and asymptotic behavior of the solution to the dynamical system. In Section 6, we obtain a linear convergence rate for objective functions satisfying the PL condition. In Section 7, we propose several variations and extensions. Some concluding remarks are presented in Section 8.

2 Energy-adaptive gradient algorithms

Throughout this study, we make the following assumptions: {assumption} The objective function $f$ in (1.1) satisfies

(1)

$f\in C^{2}(\mathbb{R}^{n})$ and is coercive, denoted by $\lim_{|\theta|\to\infty}f(\theta)=\infty$ .
(2)

$f$ is bounded from below, so that $F=\sqrt{f+c}$ is well-defined for some $c\in\mathbb{R}$ .

We represent $f^{*}=\min f(\theta)$ , $F^{*}=\sqrt{f^{*}+c}$ , and assume that $c$ is chosen such that $F=\sqrt{f+c}\geq F^{*}>0$ . We use $\nabla f$ and $\nabla^{2}f$ to denote the gradient and Hessian of $f$ ; and $\partial_{i}f$ to denote the $i$ -th element of $\nabla f$ . For $x\in\mathbb{R}^{n}$ , $|x|$ denotes its Euclidean norm ( $|x|=\sqrt{x^{2}_{1}+...+x^{2}_{n}}$ ). The outer product of $x,y\in\mathbb{R}^{n}$ is denoted as $x\otimes y$ , and $[m]$ denotes the set $\{1,2,...,m\}$ .

Under Assumption 2, recognizing that $f+c=F^{2}$ , we deduce $\nabla f=2F\nabla F$ and

D^{2}f=2FD^{2}F+2\nabla F\otimes\nabla F.

The fact that $F\geq F^{*}>0$ ensures that $F\in C^{2}(\mathbb{R}^{n})$ and $F$ is coercive.

2.1 Energy-adaptive gradient algorithms.

(1.2) represents the global and non-stochastic counterpart of AEGD, as introduced in [31]. Its notable feature lies in being unconditionally energy stable, the energy variable $r_{k}$ , serving as an approximation to $F(\theta_{k})=\sqrt{f(\theta_{k})+c}$ , strictly decreases unless $\theta_{k+1}=\theta_{k}$ . To enhance the convergence speed of AEGD, momentum was introduced into the energy-adaptive methods. The initial momentum version, named AEGDM, was introduced in [30]:


$\displaystyle v_{k}$	$\displaystyle=\beta v_{k-1}+\nabla F(\theta_{k}),$	(2.6a)
$\displaystyle r_{k+1}$	$\displaystyle=\frac{r_{k}}{1+2\eta\|\nabla F(\theta_{k})\|^{2}},\quad r_{0}=F(\theta_{0}),$	(2.6b)
$\displaystyle\theta_{k+1}$	$\displaystyle=\theta_{k}-2\eta r_{k+1}v_{k}.$	(2.6c)

Here, $\beta\in(0,1)$ represents the momentum parameter. The theoretical and numerical superiority of AEGDM over AEGD has been established in [30]. Subsequently, in [32], a stochastic version of AEGD with momentum, named SGEM, was introduced, emphasizing its performance in training large-scale machine learning tasks. The formulation is as follows:


$\displaystyle m_{k}$	$\displaystyle=\beta m_{k-1}+(1-\beta)\nabla f(\theta_{k}),$	(2.7a)
$\displaystyle v_{k}$	$\displaystyle=\frac{m_{k}}{(1-\beta^{k})2F(\theta_{k})},$	(2.7b)
$\displaystyle r_{k+1}$	$\displaystyle=\frac{r_{k}}{1+2\eta\|v_{k}\|^{2}},\quad r_{0}=F(\theta_{0}),$	(2.7c)
$\displaystyle\theta_{k+1}$	$\displaystyle=\theta_{k}-2\eta r_{k+1}v_{k},$	(2.7d)

with $\beta\in(0,1)$ being the momentum parameter. SGEM has been shown to achieve a faster convergence compared to AEGD, while exhibiting superior generalizing capabilities over stochastic gradient descent (SGD) with momentum in training various benchmark deep neural networks [32].

This work introduces a novel variant of (1.2), named AGEM:


$\displaystyle v_{k}$	$\displaystyle=\beta v_{k-1}+(1-\beta)\nabla F(\theta_{k}),\quad v_{-1}=\bf{0},$	(2.8a)
$\displaystyle r_{k+1}$	$\displaystyle=\frac{r_{k}}{1+2\eta\|v_{k}\|^{2}},\quad r_{0}=F(\theta_{0}),$	(2.8b)
$\displaystyle\theta_{k+1}$	$\displaystyle=\theta_{k}-2\eta r_{k+1}v_{k}.$	(2.8c)

All above energy-adaptive methods – AEGD, AEGDM, SGEM, and AGEM – share a common property of unconditional energy stability, as stated in the following.

Theorem 2.1 (Unconditional energy stability).

AEGD (1.2), (2.6), (2.7), and (2.8) all exhibit unconditionally energy stability, as expressed by the relation:

r^{2}_{k+1}=r^{2}_{k}-(r_{k+1}-r_{k})^{2}-\frac{1}{\eta}|\theta_{k+1}-\theta_{k}|^{2}.

(2.9)

This guarantees the strict decrease of $r_{k}$ with convergence to $r^{*}\geq 0$ as $k\to\infty$ , accompanied by the limit $\lim_{k\to\infty}|\theta_{k+1}-\theta_{k}|^{2}=0$ . Furthermore, the following inequalities hold:

\eta\sum_{k=0}^{\infty}(r_{k+1}-r_{k})^{2}+\sum_{k=0}^{\infty}|\theta_{k+1}-\theta_{k}|^{2}\leq\eta r^{2}_{0}.

(2.10)

Proof 2.2.

Starting with the common AEGD equations:

	$\displaystyle r_{k+1}$	$\displaystyle=\frac{r_{k}}{1+2\eta\|v_{k}\|^{2}},$
	$\displaystyle\theta_{k+1}$	$\displaystyle=\theta_{k}-2\eta r_{k+1}v_{k},$

we derive the relation:

r_{k+1}+2\eta r_{k+1}|v_{k}|^{2}=r_{k},\quad|\theta_{k+1}-\theta_{k}|^{2}=4\eta^{2}r^{2}_{k+1}|v_{k}|^{2}.

Hence,

	$\displaystyle r^{2}_{k+1}-r^{2}_{k}$	$\displaystyle=2r_{k+1}(r_{k+1}-r_{k})-(r_{k+1}-r_{k})^{2}$
		$\displaystyle=-4\eta r^{2}_{k+1}\|v_{k}\|^{2}-(r_{k+1}-r_{k})^{2}$
		$\displaystyle=-\frac{1}{\eta}\|\theta_{k+1}-\theta_{k}\|^{2}-(r_{k+1}-r_{k})^{2}.$

This establishes (2.9). As $\{r_{k}\}$ is decreasing and bounded from below, it converges to $r^{*}$ . Summing (2.9) over $k$ leads to

\displaystyle\frac{1}{\eta}\sum_{k=0}^{\infty}|\theta_{k+1}-\theta_{k}|^{2}+\sum_{k=0}^{\infty}(r_{k+1}-r_{k})^{2}

\displaystyle=\lim_{k\to\infty}(r^{2}_{0}-r^{2}_{k})=r^{2}_{0}-(r^{*})^{2}\leq r^{2}_{0}.

This implies (2.10), leading to $\lim_{k\to\infty}|\theta_{k+1}-\theta_{k}|^{2}=0$ .

To gain a deeper understanding of how AGEM, scheme inspired by its continuous ODE system, distinguishes itself from the other two energy-adaptive methods, AEGDM [30] and SGEM [32], we conducted a robustness test on all three methods and GDM (GD with momentum). The test used the Rosenbrock function:

f(x)=\sum_{i=1}^{n-1}(1-x_{i})^{2}+100*(x_{i+1}-x_{i}^{2})^{2},

where $n$ is an integer. This nonconvex function poses a challenge due to its global minima being located at the bottom of a long, narrow valley, making the optimization problem inherently difficult. Setting $n=2$ , we investigated the impact of the momentum parameter $\beta$ and step size $\eta$ on the convergence performance of each method, with results presented in Figure 1. Our observations indicate that, in comparison to GDM, the energy-adaptive methods exhibit greater robustness in the sense that their convergence performance is less sensitive to variations in $\beta$ and $\eta$ : enabling convergence within a certain number of iterations to global minima across a wide range of parameter values. In contrast, GDM diverges when $\eta$ exceeds a certain threshold (e.g. $5e-4$ ). Moreover, AGEM exhibits better robustness than the other two methods, achieving the fastest convergence under the settings $\beta=0.9$ and $\eta=7e-4$ .

Addressing a more challenging scenario in higher dimensions, we considered the Rosenbrock function with $n=100$ . The landscape in this case is quite intricate, with two minimizers, a global minimizer at $x^{*}=(1,...,1)^{\top}$ with $f(x^{*})=0$ and a local minimizer near $x=(-1,1,...,1)^{\top}$ with $f\sim 3.99$ . A comparison between GDM and AGEM is presented in Figure 2. Both methods achieve their fastest convergence with $\eta=1e-4$ and $\eta=3e-5$ , respectively. Notably, as $\eta$ increases for both methods, GDM becomes trapped at a local minima, while AGEM continues to converge to the global minima, though at a slower pace.

The primary goal of this study is to provide an in-depth analysis of AGEM, considering both its discrete and continuous formulations. These findings are expected to be applicable to (2.7). Our specific focus lies in understanding the convergence characteristics of AGEM. This examination will be conducted by exploring continuous dynamics for objective functions that are generally smooth. Additionally, leveraging a structural condition on $f$ – known as the Polyak-Łojasiewicz (PL) condition, we establish a convergence rate in Section 6.

2.2 PL condition for non-convex objectives.

In this section, we provide a brief overview of the PL condition and its relevance to the loss function within the context of deep neural networks.

Definition 2.3.

For a differentiable function $f:\mathbb{R}^{n}\to\mathbb{R}$ with ${\rm argmin}f(\theta)\neq\emptyset$ , we say $f$ satisfies the PL condition if there exists some constant $\mu>0$ , such that the following inequality holds for any $\theta^{*}\in{\rm argmin}f(\theta)$ :

\frac{1}{2}|\nabla f(\theta)|^{2}\geq\mu(f(\theta)-f(\theta^{*})),\quad\forall\theta\in\mathbb{R}^{n}.

(2.11)

This condition implies that a local minimizer $\theta^{*}$ is also a global minimizer. It’s important to note that strongly convex functions satisfy the PL condition, although a function satisfying the PL condition need not be convex. For instance, the function

f(\theta)=\theta^{2}+3\sin^{2}\theta

is not convex but satisfies the PL condition with $\mu=\frac{1}{32},f^{*}=0$ .

As we mentioned in the introduction, the PL condition has attracted increasing attention in the context of training deep neural networks. In this informal exploration, we investigate how a commonly used loss function in deep learning tasks can satisfy the PL condition. Consider the following loss function

f(\theta):=\frac{1}{2}\sum_{i=1}^{m}(g(x_{i};\theta)-y_{i})^{2},

where $f(\theta^{*})=0$ for any minimizer $\theta^{*}$ . Here, $\{x_{i},y_{i}\}_{i=1}^{m}$ represents training data points, and $g$ is a deep neural network parameterized by $\theta$ . The gradient of this loss function is given by:

\nabla f(\theta)=\sum_{i=1}^{m}(g(x_{i};\theta)-y_{i})\frac{\partial g(x_{i};\theta)}{\partial\theta}.

Denoting $u=(u_{1},...,u_{m})$ where $u_{i}=g(x_{i};\theta)-y_{i}$ , we have

\frac{1}{2}|\nabla f(\theta)|^{2}=\frac{1}{2}\sum_{i=1}^{m}\sum_{j=1}^{m}u_{i}H_{ij}u_{j}=\frac{1}{2}uHu^{\top},

(2.12)

where $H$ is a $m\times m$ matrix with $(i,j)$ entry defined by

H_{ij}=\langle\frac{\partial g(x_{i};\theta)}{\partial\theta},\frac{\partial g(x_{j};\theta)}{\partial\theta}\rangle.

(2.13)

From (2.12) it is evident that if the smallest eigenvalue of $H$ is bounded from below by $\mu$ , then:

\frac{1}{2}|\nabla f(\theta)|^{2}\geq\frac{1}{2}\mu|u|^{2}=\mu f(\theta).

This condition aligns with the PL condition. Notably, using over-parameterization and random initialization, [19] has proved that for a two-layer neural network with rectified linear unit (ReLU) activation, the smallest eigenvalue of $H$ is indeed strictly positive. This eigenvalue proves to be crucial in deriving the linear convergence rate of gradient descent, as illustrated in [19]. A similar convergence result for implicit networks is also derived in [22].

3 Solution properties of the AGEM algorithm

In this section, we establish that the iterates generated by (2.8) are uniformly bounded and convergent if $\eta$ is suitably small.

Theorem 3.1.

Under Assumption 2, for $(\theta_{k},r_{k},v_{k})$ generated by (2.8), if $\eta<\eta^{*}$ for some $\eta^{*}>0$ , we have the following results:

(1)

Uniformly boundedness: $\theta_{k}$ and $v_{k}$ are bounded for all $k\geq 1$ . Moreover,

\theta_{k}\in\{\theta\in\mathbb{R}^{n}\;|\;F(\theta)\leq 2F(\theta_{0})\},\quad|v_{k}|\leq\max_{j\leq k}|\nabla F(\theta_{j})|.

(2)

Lower bound on energy: The lower bound of $r_{k}$ is strictly positive:

$r_{k}\geq r_{k+1}>r^{*}>0,\quad\forall k\geq 0.$
(3)

Convergence: As $k\to\infty$ , we have

$r_{k}\to r^{*}>0,\quad v_{k}\to 0,\quad\nabla f(\theta_{k})\to 0.$

Proof 3.2.

Define

Q_{k}:=F(\theta_{k})+\epsilon r_{k}|v_{k-1}|^{2},

(3.14)

where $\epsilon=\frac{\eta\beta}{1-\beta}$ , then we claim:

Q_{k+1}-Q_{k}\leq-\epsilon r_{k+1}|v_{k}-v_{k-1}|^{2}-2\eta r_{k+1}|v_{k}|^{2}+\frac{L_{F}(\theta_{k+\frac{1}{2}})}{2}|\theta_{k+1}-\theta_{k}|^{2},

(3.15)

where $L_{F}(\theta_{k+\frac{1}{2}})$ is the largest eigenvalue of $\max_{s\in[0,1]}D^{2}F((1-s)\theta_{k}+s\theta_{k+1})$ . In fact,

Q_{k+1}-Q_{k}=F(\theta_{k+1})-F(\theta_{k})+\epsilon r_{k+1}|v_{k}|^{2}-\epsilon r_{k}|v_{k-1}|^{2},

(3.16)

where

F(\theta_{k+1})-F(\theta_{k})\leq\langle\nabla F(\theta_{k}),\theta_{k+1}-\theta_{k}\rangle+\frac{L_{F}(\theta_{k+\frac{1}{2}})}{2}|\theta_{k+1}-\theta_{k}|^{2},

(3.17)

in which,

		$\displaystyle\quad\langle\nabla F(\theta_{k}),\theta_{k+1}-\theta_{k}\rangle$
		$\displaystyle=-2\eta r_{k+1}\langle\nabla F(\theta_{k}),v_{k}\rangle$
		$\displaystyle=-2\eta r_{k+1}\langle\nabla F(\theta_{k})-v_{k},v_{k}\rangle-2\eta r_{k+1}\|v_{k}\|^{2}$
		$\displaystyle=-2\eta r_{k+1}\langle\frac{\beta}{1-\beta}(v_{k}-v_{k-1}),v_{k}\rangle-2\eta r_{k+1}\|v_{k}\|^{2}$
		$\displaystyle=-\epsilon r_{k+1}(\|v_{k}\|^{2}-\|v_{k-1}\|^{2}+\|v_{k}-v_{k-1}\|^{2})-2\eta r_{k+1}\|v_{k}\|^{2}$
		$\displaystyle\leq-\epsilon r_{k+1}\|v_{k}\|^{2}+\epsilon r_{k}\|v_{k-1}\|^{2}-\epsilon r_{k+1}\|v_{k}-v_{k-1}\|^{2}-2\eta r_{k+1}\|v_{k}\|^{2},$		(3.18)

where $r_{k+1}<r_{k}$ is used in the last inequality. Connecting (3.16), (3.17), (3.2) gives (3.15).

(1) To demonstrate uniformly boundedness, we sum up (3.15) from $0$ to $k$ , omitting negative terms on the right-hand side to obtain:

\displaystyle Q_{k+1}\leq Q_{0}+\frac{\max_{j\leq k}L_{F}(\theta_{j+\frac{1}{2}})}{2}\sum_{j=0}^{k}|\theta_{j+1}-\theta_{j}|^{2}.

(3.19)

Using $Q_{0}=F(\theta_{0})$ and the bound in (2.10), we derive:

F(\theta_{k+1})\leq F(\theta_{0})+\frac{1}{2}\eta r^{2}_{0}\max_{j\leq k}L_{F}(\theta_{j+\frac{1}{2}}).

(3.20)

Next, we establish that if $\eta$ is suitably small, $\{\theta_{k}\}$ is uniformly bounded, ensuring the boundedness of $\{v_{k}\}$ , as indicated by:

|v_{k}|\leq(1-\beta)\sum_{j=0}^{k}\beta^{k-j}|\nabla F(\theta_{j})|\leq\max_{j\leq k}|\nabla F(\theta_{j})|.

(3.21)

To proceed, we introduce the notation:

\Sigma_{k}:=\{\theta\in\mathbb{R}^{n}\;|\;F(\theta)\leq kF(\theta_{0})\},\quad k=2,3

and $\tilde{\Sigma}_{k}$ for the convex hull¹¹1The convex hull of a given set $S$ is the (unique) minimal convex set containing $S$ . of $\Sigma_{k}$ . By assumption, $\Sigma_{2}\subset\Sigma_{3}$ are bounded domains. Our objective is to identify conditions on $\eta$ such that $\{\theta_{k}\}\in\Sigma_{2}$ for all $k\geq 1$ . This can be accomplished through induction. Since $\theta_{0}\in\Sigma_{2}$ , tt suffices to show that under suitable conditions on $\eta$ , we have:

\{\theta_{j}\}_{j\leq k}\subset\Sigma_{2}\Rightarrow\theta_{k+1}\in\Sigma_{2}

The main task is to establish a sufficient condition on $\eta$ , for which we argue in two steps:
(i) By the continuity of $F$ , there exists $\delta_{F}(r_{0})$ such that if

|\theta_{k+1}-\theta_{k}|<\delta_{F}(r_{0}),

(3.22)

we have

|F(\theta_{k+1})-F(\theta_{k})|<r_{0}.

This implies $F(\theta_{k+1})\leq F(\theta_{k})+r_{0}\leq 3r_{0}$ , indicating $\theta_{k+1}\in\Sigma_{3}.$ On the other hand,

|\theta_{k+1}-\theta_{k}|=|-2\eta r_{k+1}v_{k}|\leq 2\eta r_{0}\max_{j\leq k}|\nabla F(\theta_{j})|\leq 2\eta r_{0}\max_{\theta\in\Sigma_{2}}|\nabla F(\theta)|.

This ensures (3.22) if we set

\eta<\eta_{1}:=\frac{\delta_{F}(r_{0})}{2r_{0}\max_{\theta\in\Sigma_{2}}|\nabla F(\theta)|}.

(ii) Considering that $\theta_{k+1}\in\Sigma_{3}$ and $\{\theta_{j}\}_{j\leq k}\in\Sigma_{2}\subset\Sigma_{3}$ , we have $\{\theta_{j+\frac{1}{2}}\}_{j\leq k}\subset\tilde{\Sigma}_{3}$ . By (3.20), we get:

F(\theta_{k+1})\leq F(\theta_{0})+\frac{1}{2}\eta r^{2}_{0}\max_{\theta\in\tilde{\Sigma}_{3}}L_{F}(\theta).

(3.23)

This implies

F(\theta_{k+1})\leq 2F(\theta_{0}),

provided

\eta\leq\eta_{2}:=\frac{2}{r_{0}\max_{\theta\in\tilde{\Sigma}_{3}}L_{F}(\theta)}.

Hence, if $\eta<\min\{\eta_{1},\eta_{2}\}$ , then $\theta_{k}\in\Sigma_{2}$ for all $k$ .

(2) Theorem 2.1 asserts that $r_{k}\to r^{*}\geq 0$ as $k\to\infty$ . Here, we identify a sufficient condition on $\eta$ to ensure $r^{*}>0$ . Following a similar argument as in (1), while retaining the negative term

-2\eta r_{k+1}|v_{k}|^{2}=r_{k+1}-r_{k},

on the rand-hand side of (3.15), we derive:

F(\theta_{k})\leq F(\theta_{0})+r_{k}-r_{0}+\frac{1}{2}\eta r_{0}^{2}\max_{\theta\in\tilde{\Sigma}_{3}}L_{F}(\theta),

under the condition that $\eta<\eta_{1}$ . Since $r_{0}=F(\theta_{0})$ , we obtain:

r_{k}\geq F(\theta_{k})-\frac{1}{2}\eta r_{0}^{2}\max_{\theta\in\tilde{\Sigma}_{3}}L_{F}(\theta).

Letting $k\to\infty$ , we have:

r^{*}\geq\liminf_{k\to\infty}F(\theta_{k})-\frac{1}{2}\eta r_{0}^{2}\max_{\theta\in\tilde{\Sigma}_{3}}L_{F}(\theta)\geq F^{*}-\frac{1}{2}\eta r_{0}^{2}\max_{\theta\in\tilde{\Sigma}_{3}}L_{F}(\theta)>0,

provided

\eta\leq\eta_{3}:=\frac{2F^{*}}{r^{2}_{0}\max_{\theta\in\tilde{\Sigma}_{3}}L_{F}(\theta)}.

Therefore, if $\eta<\min\{\eta_{1},\eta_{3}\}$ , then $r^{*}>0$ . Note that $\eta_{3}\leq\eta_{2}$ , hence both (1) and (2) are satisfied if we choose $\eta<\eta^{*}:=\min\{\eta_{1},\eta_{3}\}$ .

(3) In Theorem 2.1, we have shown that $|\theta_{k+1}-\theta_{k}|\to 0$ . Using (2.8), we deduce that

r_{k+1}|v_{k}|\to 0.

According to (2), $r_{k}\to r^{*}>0$ when $\eta<\eta^{*}$ . Consequently, $|v_{k}|\to 0$ must hold. As indicated by (2.8a), this further implies $\nabla F(\theta_{k})\to 0$ , consequently, $\nabla f(\theta_{k})=2F(\theta_{k})\nabla F(\theta_{k})\to 0$ also holds.

Remark 3.3.

When $f$ is assumed to be $L$ -smooth, $L_{F}$ becomes uniformly bounded. In this scenario, the solution’s boundedness is attained without imposing any restriction on $\eta$ , as evident from $(\ref{Qr})$ . However, it is crucial to note that the $L$ -smoothness assumption excludes a significant class of objective functions, including those of the form $|\theta|^{4}$ .

4 Discrete schemes vs continuous-time-limits

The analysis of scheme (2.8) involve studying the limiting ODEs obtained by letting the step size $\eta$ approach $0$ . Notably, distinct ODEs may arise based on how the hyper-parameters are scaled. In this context, we will derive an ODE system that preserves certain momentum effects.

4.1 High resolution ODEs.

For any $k\geq 0$ , let $t_{k}=k\eta$ , and assume $v_{k}=v(t_{k}),r_{k}=r(t_{k}),\theta_{k}=\theta(t_{k})$ for some sufficiently smooth curve $(v(t),r(t),\theta(t))$ . Performing the Taylor expansion on $v_{k-1}$ in powers of $\eta$ , we get:

v_{k-1}=v(t_{k-1})=v(t_{k})-\eta\dot{v}(t_{k})+O(\eta^{2}).

Substituting $v_{k-1}$ into (2.8a), we have

v(t_{k})=\beta(v(t_{k})-\eta\dot{v}(t_{k})+O(\eta^{2}))+(1-\beta)\nabla F(\theta(t_{k})),

which gives

\frac{\beta\eta}{1-\beta}\dot{v}=-v+\nabla F(\theta)+O(\eta^{2}),\quad\text{at}\;t=t_{k}.

(4.24)

Plugging $r_{k+1}=r(t_{k})+\eta\dot{r}(t_{k})+O(\eta^{2})$ into (2.8b), we have

\dot{r}=-\frac{2r|v|^{2}}{1+2\eta|v|^{2}}+O(\eta^{2}),\quad\text{at}\;t=t_{k}.

(4.25)

Similarly, from (2.8c) we have

\dot{\theta}=-\frac{2rv}{1+2\eta|v|^{2}}+O(\eta^{2}),\quad\text{at}\;t=t_{k}.

(4.26)

We first discard $O(\eta^{2})$ and keep $O(\eta)$ terms in (4.24) (4.25), (4.26), which leads to the following ODE system:


$\displaystyle\epsilon\dot{v}$	$\displaystyle=-v+\nabla F(\theta),$	(4.27a)
$\displaystyle\dot{r}$	$\displaystyle=-\frac{2r\|v\|^{2}}{1+2\eta\|v\|^{2}},$	(4.27b)
$\displaystyle\dot{\theta}$	$\displaystyle=-\frac{2rv}{1+2\eta\|v\|^{2}},$	(4.27c)

where $\epsilon=\frac{\beta\eta}{1-\beta}$ . This can be viewed as a high-resolution ODE [41] because of the presence of $O(\eta)$ terms. Let $\eta\to 0$ , then (4.27) reduces to the following ODE system:


$\displaystyle\dot{r}$	$\displaystyle=-2r\|\nabla F(\theta)\|^{2},$	(4.28a)
$\displaystyle\dot{\theta}$	$\displaystyle=-2r\nabla F(\theta).$	(4.28b)

From two equations in (4.28) together with $r(0)=F(\theta_{0})$ we can show

r(t)=F(\theta(t)),\quad\forall t>0,

with which, (4.28b) reduces to

\dot{\theta}=-2F(\theta)\nabla F(\theta)=-\nabla f(\theta).

(4.29)

This asserts that the ODE system (4.28) is equivalent to the gradient flow (4.29).

To explore the momentum effect, we maintain $\epsilon=\frac{\beta\eta}{1-\beta}$ unchanged while allowing $\eta\to 0$ . This yields the following system of ODEs:


$\displaystyle\epsilon\dot{v}$	$\displaystyle=-v+\nabla F(\theta),$	(4.30a)
$\displaystyle\dot{r}$	$\displaystyle=-2r\|v\|^{2},$	(4.30b)
$\displaystyle\dot{\theta}$	$\displaystyle=-2rv.$	(4.30c)

Compared with (4.30), the high-resolution ODEs (4.27) serve as more accurate continuous-time counterparts for the corresponding discrete algorithm (2.8). While the analysis of (4.27) would be more intricate, in the remainder of this work, we concentrate on (4.30) and provide a detailed analysis of it.

Theorem 4.1.

Denote $U_{k}=(\theta_{k},v_{k},r_{k})$ as the solution provided in Theorem 3.1, and $U(t)=(\theta(t),v(t),r(t))$ as the solution to (4.30). Then, for any $T>0$ , we have:

\lim_{\begin{subarray}{c}\eta\to 0\\ k\eta=T\end{subarray}}U_{k}=U(T).

This result is a consequence of a standard ODE analysis. In the present context, it is assured by the boundedness of $U_{k}$ , as demonstrated in Theorem 3.1, and the forthcoming demonstration of the boundedness of $U(t)$ that will be shown in Theorem 5.1.

4.2 Time discretization.

Certainly, the discrete AEGM dynamics can be interpreted as a discretization of the continuous ODE systems (4.30). Specifically, we employ a first order finite difference approximation with implicit considerations in $r$ and $\theta$ , but explicit treatment in $v$ , resulting in:

	$\displaystyle\epsilon\frac{v_{k+1}-v_{k}}{\tau}$	$\displaystyle=-v_{k}+\nabla F(\theta_{k+1}),$
	$\displaystyle\frac{r_{k+1}-r_{k}}{\tau}$	$\displaystyle=-2r_{k+1}\|v_{k}\|^{2},$
	$\displaystyle\frac{\theta_{k+1}-\theta_{k}}{\tau}$	$\displaystyle=-2r_{k+1}v_{k}.$

This formulation reproduces the discrete AEGM dynamic (4.30) when setting $\beta=\tau/\epsilon$ and $\eta=\tau$ . Moreover, this perspective enables the derivation of new algorithms by applying diverse time discretizations to (4.30). For instance, a second order discretizaiton of (4.30) leads to

	$\displaystyle\epsilon\frac{3v_{k+1}-4v_{k}+v_{k-1}}{2\tau}$	$\displaystyle=-v_{k+1}+\nabla F(2\theta_{k}-\theta_{k-1}),$
	$\displaystyle\frac{3r_{k+1}-4r_{k}+r_{k-1}}{2\tau}$	$\displaystyle=-2r_{k+1}\|2v_{k}-v_{k-1}\|^{2},$
	$\displaystyle\frac{3\theta_{k+1}-4\theta_{k}+\theta_{k-1}}{2\tau}$	$\displaystyle=-2r_{k+1}v_{k+1}.$

By setting $\beta=\tau/\epsilon$ and $\eta=\tau$ the following recursive relationships are derived:

	$\displaystyle v_{k+1}$	$\displaystyle=\frac{1}{3+2\beta}\left(4v_{k}-v_{k-1}+2\beta\nabla F(2\theta_{k}-\theta_{k-1})\right),$
	$\displaystyle r_{k+1}$	$\displaystyle=\frac{4r_{k}-r_{k-1}}{3+4\eta\|2v_{k}-v_{k-1}\|^{2}},$
	$\displaystyle\theta_{k+1}$	$\displaystyle=\frac{1}{3}\left(4\theta_{k}-\theta_{k-1}-4\eta r_{k+1}v_{k+1}\right).$

These recursions begin with the initial setup of $v_{-1}=0,r_{-1}=r_{0}$ and $\theta_{-1}=\theta_{0}$ . A more comprehensive analysis of this scheme exceeds the scope of the present article.

5 Dynamic solution behavior

This section is dedicated to the examination of (4.30), including aspects such as global existence, asymptotic convergence to equilibria through the renowned LaSalle invariance principle [28], and the determination of convergence rates for non-convex objective functions that adhere to the PL property.

5.1 Global Existence and Uniqueness.

Theorem 5.1.

Under Assumption 2, the ODE (4.30) with the initial condition $(\theta_{0},r_{0}>0,v_{0})$ admits a unique global solution $(\theta(t),v(t),r(t))\in C^{1}([0,\infty];\mathbb{R}^{2n+1})$ . Specifically, for any $t>0$ , the following conditions hold:

		$\displaystyle\theta(t)\in\{\theta\in\mathbb{R}^{n}\;\|\;F(\theta)\leq F(\theta_{0})+\epsilon r_{0}\|v_{0}\|^{2}\},$
		$\displaystyle\|v(t)\|\leq\max\{\|v(0)\|,\max_{[0,t]}\|\nabla F(\theta(\cdot))\|\},$
		$\displaystyle 0<r(t)\leq r_{0}.$		(5.31)

Proof 5.2.

For $f\in C^{1}(\mathbb{R}^{n})$ , by Picard and Lindelöf, there exists a unique local solution to (4.30) with the initial condition $(\theta_{0},r_{0},v_{0})$ . The extension theorem implies that the global existence fails only if the solution ‘escapes’ to infinity. To establish global existence, it is sufficient to demonstrate that $(\theta(t),r(t),v(t))$ remains bounded for all $t>0$ . To achieve this, we introduce a suitable control function. A candidate based on the discrete counterpart is:

Q:=Q(\theta,r,v)=F(\theta)+\epsilon r|v|^{2}.

(5.32)

Using the chain rule, if $(\theta,r,v)$ satisfies (4.30), then:

$\displaystyle\dot{Q}$	$\displaystyle=\langle\nabla F(\theta),\dot{\theta}\rangle+\epsilon\dot{r}\|v\|^{2}+2\epsilon r\langle v,\dot{v}\rangle$
	$\displaystyle=-2r\langle\nabla F(\theta),v\rangle-2\epsilon r\|v\|^{4}+2r\langle v,-v+\nabla F(\theta)\rangle$
	$\displaystyle=-2r\|v\|^{2}-2\epsilon r\|v\|^{4}\leq 0.$	(5.33)

Hence, $Q$ is guaranteed to decrease. In particular, we have:

F(\theta)+\epsilon r|v|^{2}\leq Q(\theta_{0},r_{0},v_{0}),

(5.34)

which implies the boundedness of $F(\theta)$ . This, combined with the coerciveness of $f$ and, consequently $F$ , guarantees that $\theta(t)$ remains bounded for all $t>0$ . However, the above inequality does not provide a uniform bound for $v(t)$ .

To establish a useful bound for $v(t)$ , we use (4.30a):

\partial_{t}v+\frac{1}{\epsilon}v=\frac{1}{\epsilon}\nabla F(\theta(t)),

resulting in:

\frac{d}{dt}(v(t)e^{t/\epsilon})=\frac{1}{\epsilon}e^{t/\epsilon}\nabla F(\theta(t)).

Consequently, for every $t$ , we have:

v(t)=v(0)e^{-t/\epsilon}+\frac{1}{\epsilon}\int_{0}^{t}e^{(\tau-t)/\epsilon}\nabla F(\theta(\tau))d\tau.

Therefore,

	$\displaystyle\|v(t)\|$	$\displaystyle\leq\|v(0)\|e^{-t/\epsilon}+(1-e^{-t/\epsilon})\max_{\tau\in[0,t]}\|\nabla F(\theta(\tau))\|$
		$\displaystyle\leq\max\{\|v(0)\|,\max_{\tau\in[0,t]}\|\nabla F(\theta(\tau))\|\}.$

Note that the boundedness of $\nabla F(\theta(t))$ is ensured by

|\nabla F(\theta(t))|=\bigg{|}\frac{\nabla f(\theta(t))}{2F(\theta(t))}\bigg{|}\leq\bigg{|}\frac{\nabla f(\theta(t))}{2F^{*}}\bigg{|}.

From (4.30), it is evident that $r$ decreases monotonically, thus $0\leq r(t)\leq r_{0}$ . This concludes the proof.

Remark 5.3.

The uniform boundedness of $|v|$ justifies the removal of $2\eta|v|^{2}$ in (4.27b) and (4.27c), while taking the limit $\eta\to 0$ .

5.2 Asymptotic behavior of solutions.

Theorem 5.4.

Under Assumption 2, let $(\theta(t),r(t),v(t))$ be the global solution to the ODE (4.30) as stated in Theorem 5.1, with the initial condition $(\theta_{0},r_{0},v_{0})$ . If the conditions

r_{0}>F(\theta_{0})-F^{*},\quad\epsilon|v_{0}|^{2}<1-\frac{F(\theta_{0})-F^{*}}{r_{0}}

(5.35)

hold, then as $t\to\infty$ ,

\nabla F(\theta(t))\rightarrow 0,\quad r(t)\rightarrow r^{*}>0,\quad v(t)\rightarrow v^{*}=0.

Furthermore, $F(\theta(t))\rightarrow F^{*}$ as the minimum value of $F$ , and

r^{*}<r(t)\leq r_{0},\quad F^{*}\leq F(\theta(t))\leq F(\theta_{0})+\epsilon r_{0}|v_{0}|^{2}.

(5.36)

The upper bound on $F$ follows directly from (5.34). We establish the convergence result through three steps.

Step 1: We demonstrate that $r^{*}:=\lim_{t\to\infty}r(t)$ exists and $r^{*}>0$ . First, since $\dot{r}\leq 0$ and $r\geq 0$ , it follows that $\int_{0}^{\infty}|\dot{r}|dt\leq r_{0}<\infty$ . This ensures the existence of $\lim_{t\to\infty}r(t)=r^{*}$ . Note that (5.2) can be expressed as

\dot{r}=\dot{Q}+2\epsilon r|v|^{4}.

Integrating both sides from $0$ to $t$ and using (5.35), we obtain:

	$\displaystyle r(t)$	$\displaystyle=r_{0}+F(\theta(t))+\epsilon r(t)\|v(t)\|^{2}-(F(\theta_{0})+\epsilon r_{0}\|v_{0}\|^{2})+2\epsilon\int_{0}^{t}r\|v\|^{4}ds$
		$\displaystyle\geq r_{0}-(F(\theta_{0})-F^{*})-\epsilon r_{0}\|v_{0}\|^{2}>0.$		(5.37)

From this lower bound, we deduce:

r^{*}=\lim_{t\to\infty}r(t)\geq r_{0}+F^{*}-F(\theta_{0})-\epsilon r_{0}|v_{0}|^{2}>0.

Step 2: We proceed to analyze the asymptotic behavior of $\theta$ and $v$ by using LaSalle’s invariance principle [28]. Recall the function $Q$ defined in Theorem 5.1,

Q(\theta,r,v)=F(\theta)+\epsilon r|v|^{2}.

Given that $r(t)\geq r^{*}>0$ , this implies that the set

\Omega=\{(\theta,r,v)\;|\;Q(\theta,r,v)\leq Q(\theta_{0},r_{0},v_{0})\}

is a compact positively invariant set with respect to (4.30). For any $(\theta,r,v)\in\Omega$ , we observe that

\dot{Q}(\theta,r,v)=-2r|v|^{2}-2\epsilon r|v|^{4}\leq 0.

Define the set:

\Sigma=\{(\theta,r,v)\in\Omega\;|\;\dot{Q}(\theta,r,v)=0\}.

Since $r(t)\geq r^{*}>0$ for all $t>0$ , it follows that

\Sigma=\{(\theta,r,v)\in\Omega\;|\;r|v|=0\}=\{(\theta,r,v)\in\Omega\;|\;v=0,r\geq r^{*}\}.

Next, we identify the largest invariant set in $\Sigma$ . Suppose that at some $t_{1}$ , $v(t_{1})=0$ , then we have $\dot{\theta}=0,\dot{r}=0$ , but $\epsilon\dot{v}(t_{1})=\nabla F(\theta(t_{1}))$ . As a result, the trajectory will not stay in $\Sigma$ unless $\nabla F(\theta(t_{1}))=0$ . In other words, the largest invariant set in $\Sigma$ must be:

\Sigma_{0}=\{(\theta,r,v)\in\Omega\;|\;v=0,\nabla F(\theta)=0,r\geq r^{*}\}.

By LaSalle’s invariance principle, every trajectory of (4.30) starting in $\Omega$ tends to $\Sigma_{0}$ as $t\to\infty$ . From Step 1, $r^{*}$ is the only limit of $r(t)$ . Hence, all trajectories will admit a limit or a cluster point in

\{(\theta,r^{*},0)\;|\;\nabla F(\theta)=0\}.

This implies that: $\lim_{t\to\infty}\nabla F(\theta(t))=0$ . Note that $Q$ is monotone and bounded, hence admitting a limit $b>0$ . This further implies that:

\lim_{t\to\infty}F(\theta(t))=\lim_{t\to\infty}[Q(\theta(t),r(t),v(t))-\epsilon r(t)|v(t)|^{2}]=b\geq\inf F.

Step 3. Finally, we demonstrate that $b$ is a local minimum value of $F$ , rather than a local maximum value. We prove this by contradiction. Suppose $b$ is a local maximum value of $F$ . Recall that $F(\theta)\to b$ , $r\to r^{*}$ and $|v|\to 0$ as $t\to\infty$ . Thus, for any $\delta>0$ , there exists $t_{1}=t_{1}(\delta)$ such that for any $t\geq t_{1}$ :

|F(\theta(t))-b|\leq\delta,\quad|r(t)-r^{*}|\leq\delta,\quad|v(t)|\leq\delta.

Since $F$ is continuous, there exists $t_{2}\geq t_{1}$ such that:

F(\theta(t_{2}))\leq b-\frac{\delta}{2},

where we have used the assumption that $b$ is a local maximum value of $F$ . From Step 2, we know that $Q$ is non-increasing in $t$ . Hence, for any $t\geq t_{2}$ , we have:

Q(t)\leq Q(t_{2})=F(\theta(t_{2}))+\epsilon r(t_{2})|v(t_{2})|^{2}\leq b-\frac{\delta}{2}+\epsilon(r^{*}+\delta)\delta^{2}<b,

provided $\delta$ is small enough so that $(r^{*}+\delta)\delta<\frac{1}{2\epsilon}$ . Now letting $t\to\infty$ , we obtain $b=\lim_{t\to\infty}Q(t)<b$ . This is a contradiction. Hence, $b$ has to be a local minimum value of $F$ , i.e., $b=F^{*}$ .

6 Convergence rates

The speed at which the solution converges to the minimum value of $f$ depends on the geometry of the objective function near $f^{*}$ . In fact, when the PL condition is satisfied, it can be shown that $f(\theta(t))$ converge exponentially fast to $f^{*}$ .

Theorem 6.1.

Consider the global solution $(\theta(t),r(t),v(t))$ to (4.30) as presented in Theorem 5.4, initialized with $(\theta_{0},r_{0},v_{0})=(\theta_{0},F(\theta_{0}),0)$ . Assume that $f$ also satisfies the PL condition (2.11) with $\mu>0$ . For any $\delta\in(0,1)$ , the following bounds are valid:

f(\theta(t))-f^{*}\leq(f(\theta_{0})-f^{*})\bigg{(}e^{-\frac{2\mu r^{*}}{F(\theta_{0})(2-\delta)}t}+\frac{2\mu\epsilon}{\delta(2-\delta)}e^{-\frac{\delta}{\epsilon}t}\bigg{)},

(6.38)

given that $\epsilon\leq\min\{\epsilon_{1},\epsilon_{2}\}$ , where

\epsilon_{1}=\delta(1-\delta)\frac{F^{*}}{2LF(\theta_{0})},\quad\epsilon_{2}=\frac{1}{2\mu F(\theta_{0})}(2-\delta)(1-\delta)\bigg{(}F(\theta_{0})+\frac{F^{*}r^{*}}{F(\theta_{0})}\bigg{)}.

(6.39)

Here, $L$ represents the largest eigenvalue of $D^{2}f(\theta)$ , where $\theta$ belongs to the solution domain in Theorem 5.1.

Remark 6.2.

(Comparison with gradient flow (4.29 without momentum) For small values of $\epsilon$ , where momentum exerts a diminished influence on the dynamical system, the convergence rate is predominantly governed by the term $e^{-\frac{2\mu r^{*}}{F(\theta_{0})(2-\delta)}t}$ . Specifically:

f(\theta(t))-f^{*}\leq(f(\theta_{0})-f^{*})\bigg{(}e^{-\frac{2\mu r^{*}}{F(\theta_{0})(2-\delta)}t}\bigg{)},

if $\epsilon\leq\{\epsilon_{1},\epsilon_{2},\epsilon_{3}\}$ with $\epsilon_{3}=\frac{F(\theta_{0})\delta(2-\delta)}{2\mu r^{*}}$ . Also $\frac{2\mu r^{*}}{F(\theta_{0})(2-\delta)}$ gets larger as $\delta\to 1$ . For $\delta=1$ , $\epsilon$ is compelled to be $0$ , reducing (4.30) to (4.28) or (4.29). In this scenario, $\frac{2\mu r^{*}}{F(\theta_{0})(2-\delta)}$ becomes $2\mu$ , restoring the convergence rate of gradient flow (4.29) under the PL condition [39].

Remark 6.3.

(Comparison with gradient flow with momentum and the work [5]). To compare with the GD with momentum of form

	$\displaystyle m_{k}$	$\displaystyle=\beta m_{k-1}+(1-\beta)\nabla f(\theta_{k}),\quad m_{-1}=0,$
	$\displaystyle\theta_{k+1}$	$\displaystyle=\theta_{k}-\eta m_{k},$

we follow the same strategy as in Section 4.1 to obtain its continuous formulation:

\epsilon\ddot{\theta}+\dot{\theta}+\nabla f(\theta)=0.

This upon rescaling $s=t/\sqrt{\epsilon}$ leads to

\theta^{\prime\prime}+\frac{1}{\sqrt{\epsilon}}\theta^{\prime}+\nabla f(\theta)=0.

This is the second-order ODE (1.5) with a constant $a(t)$ , for which we refer to the findings presented in [5, Theorem 3.1]. When dealing with non-convex functions satisfying the PL condition (2.11), particularly in scenarios where $\kappa>9/8$ and $\frac{1}{\sqrt{\epsilon}}=(2\sqrt{\kappa}-\sqrt{\kappa-1})\sqrt{\mu}$ , [5] establishes the estimate:

f(\theta(t))-f^{*}\leq O(1)e^{-2(\sqrt{\kappa}-\sqrt{\kappa-1})\sqrt{\mu}s}=O(1)e^{-2\mu b_{1}t},

where

b_{1}=3\kappa-1-3\sqrt{\kappa(\kappa-1)}.

We examine a scenario where the dominant factor in the convergence rate (6.38) is $e^{-\frac{\delta}{\epsilon}t}$ , which when taking the same $\epsilon$ as above gives the convergence rate of $e^{-2\mu b_{2}t}$ with

b_{2}=\frac{\delta}{2}(5\kappa-1-4\sqrt{\kappa(\kappa-1)}).

While our convergence rate is comparable to the specific rate $e^{-2\mu b_{1}t}$ , it is considered suboptimal owing to the additional constraint $\epsilon\leq\min\{\epsilon_{1},\epsilon_{2}\}$ imposed by the proof technique.

The proof outlined below comprises three steps:
Step 1: We introduce a candidate control function, denoted as

E=a(f(\theta)-f^{*})-\epsilon\langle\nabla f(\theta),v\rangle+\lambda\epsilon r|v|^{2},

where the parameters $a$ and $\lambda>0$ are to be determined.
Step 2: We determine admissible pairs $(a,\lambda)$ such that $E(t)$ exhibits exponential decay.
Step 3: Using $\dot{W}(t)=\langle\nabla f(\theta),\dot{\theta}\rangle=-2r\langle\nabla f(\theta),v\rangle$ , we link $E$ to $W:=f(\theta)-f^{*}$ by

E=aW+\frac{\epsilon\dot{W}}{2r}+\lambda\epsilon r|v|^{2},

(6.40)

and derive the convergence rate of $W$ based on the convergence rate of $E$ .

Proof 6.4.

Step 1: Decay of the control function. Define

E=a(f(\theta)-f^{*})-\epsilon\langle\nabla f(\theta),v\rangle+\lambda\epsilon r|v|^{2},

(6.41)

where parameters $a,\lambda>0$ will be determined later so that along the trajectory of (4.30) $E=E(t)$ has an exponential decay rate.

For each term in $E$ we proceed to evaluate their time derivative along the trajectory of (4.30). For the first term, we get

\frac{d}{dt}(f(\theta)-f^{*})=\langle\nabla f(\theta),\dot{\theta}\rangle=-2r\langle\nabla f(\theta),v\rangle.

For the second term, we have

	$\displaystyle\frac{d}{dt}\epsilon\langle\nabla f(\theta),v\rangle$	$\displaystyle=\epsilon\langle\nabla^{2}f(\theta)\dot{\theta},v\rangle+\epsilon\langle\nabla f(\theta),\dot{v}\rangle$
		$\displaystyle=-2\epsilon r\langle\nabla^{2}f(\theta)v,v\rangle-\langle\nabla f(\theta),v\rangle+\langle\nabla f(\theta),\nabla F(\theta)\rangle.$		(6.42)

Note that

\langle\nabla^{2}f(\theta)v,v\rangle\leq L|v|^{2},\quad\text{and}

\langle\nabla f(\theta),\nabla F(\theta)\rangle=\frac{1}{2F(\theta)}|\nabla f(\theta)|^{2}\geq\frac{\mu}{F(\theta)}(f(\theta)-f^{*}).

Here we used $\nabla F(\theta)=\nabla f(\theta)/(2F(\theta))$ and the PL property for $f$ .

Hence (6.42) reduces to

\frac{d}{dt}\epsilon\langle\nabla f(\theta),v\rangle\geq-2\epsilon Lr|v|^{2}-\langle\nabla f(\theta),v\rangle+\frac{\mu}{F(\theta)}(f(\theta)-f^{*}).

For the third term, we have

	$\displaystyle\frac{d}{dt}\epsilon r\|v\|^{2}$	$\displaystyle=\epsilon\dot{r}\|v\|^{2}+2\epsilon r\langle v,\dot{v}\rangle$
		$\displaystyle=-2\epsilon r\|v\|^{4}+2r\langle v,-v+\nabla F(\theta)\rangle$
		$\displaystyle=-2\epsilon r\|v\|^{4}-2r\|v\|^{2}+2r\langle\nabla F(\theta),v\rangle$
		$\displaystyle\leq-2r\|v\|^{2}+\frac{r}{F(\theta)}\langle\nabla f(\theta),v\rangle.$

Combining the above three estimates together we obtain

\dot{E}(t)\leq-\frac{\mu}{F(\theta)}(f(\theta)-f^{*})+(2\epsilon L-2\lambda)r|v|^{2}+\bigg{(}-2ar+1+\frac{\lambda r}{F(\theta)}\bigg{)}\langle\nabla f(\theta),v\rangle.

(6.43)

Note that (6.41) can be rewritten as

\langle\nabla f(\theta),v\rangle=\frac{a}{\epsilon}(f(\theta)-f^{*})+\lambda r|v|^{2}-\frac{1}{\epsilon}E(t).

(6.44)

This upon substitution into (6.43) gives

\dot{E}(t)\leq-\frac{b(t)}{\epsilon}E(t)+\left(\frac{ab(t)}{\epsilon}-\frac{\mu}{F(\theta)}\right)(f(\theta)-f^{*})+(\lambda b(t)+2\epsilon L-2\lambda)r|v|^{2},

(6.45)

where $b(t):=-2ar(t)+1+\frac{\lambda r(t)}{F(\theta(t))}$ . This leads to

\dot{E}(t)\leq-\frac{b(t)}{\epsilon}E(t),

(6.46)

as long as we can identify $a,\lambda$ so that


	$\displaystyle b(t)=-2ar(t)+1+\frac{\lambda r(t)}{F(\theta(t))}>0,$		(6.47a)
	$\displaystyle\frac{ab(t)}{\epsilon}-\frac{\mu}{F(\theta(t))}\leq 0,$		(6.47b)
	$\displaystyle\lambda b(t)+2\epsilon L-2\lambda\leq 0,$		(6.47c)

for all $t>0$ . Step 2: Admissible choice for $\lambda$ and $a$ . Using the solution bounds $r^{*}\leq r(t)\leq r_{0}$ and $F^{*}\leq F(\theta(t))\leq F(\theta_{0})$ , we have

b_{*}\leq b(t)\leq b^{*},\quad\forall t>0,

where

	$\displaystyle b_{}=b_{}(a,\lambda)=-2ar_{0}+1+\frac{\lambda r_{*}}{F(\theta_{0})},$
	$\displaystyle b^{}=b^{}(a,\lambda)=-2ar^{}+1+\frac{\lambda r_{0}}{F^{}}.$

In order to ensure (6.47c) we must have $b^{*}<2$ . Putting these together, condition (6.47) can be ensured by the following stronger constraints:


	$\displaystyle 0<b_{},\quad b^{}<2,$		(6.48a)
	$\displaystyle ab^{*}-\frac{\epsilon\mu}{F(\theta_{0})}\leq 0,$		(6.48b)
	$\displaystyle\lambda b^{*}+2\epsilon L-2\lambda\leq 0.$		(6.48c)

A direct calculation shows that for small $\epsilon$ , such admissible pair exists, with $a=O(\epsilon)$ , and $\lambda\leq\frac{F^{*}}{r_{0}}$ which is induced from $b^{*}<2$ . To be more precise, we fix $\delta\in(0,1)$ and require that

b_{*}\geq\delta.

The constraint $b^{*}<2$ would impose a lower bound for $a\sim\epsilon$ (which we should avoid to stay consistency with the discrete algorithm) unless $\lambda$ is chosen to satisfy $b^{*}+2ar^{*}\leq 2-\delta$ . This is equivalent to requiring

\lambda\leq(1-\delta)\frac{F^{*}}{r_{0}}.

With the above two constraints, it is safe to replace $b^{*}$ by $2-\delta$ in (6.48b) and (6.48c), respectively, so that

a(2-\delta)\leq\frac{\epsilon\mu}{F(\theta_{0})},\quad 2\epsilon L\leq\delta\lambda.

(6.49)

The convergence rate estimate in the next step requires $a$ to be large as possible, we thus simply take

a=\frac{\epsilon\mu}{F(\theta_{0})(2-\delta)}.

(6.50)

The second relation in (6.49) now reduces to

\epsilon\leq\frac{\delta\lambda}{2L}.

Note that the constraint $b_{*}\geq\delta$ is met if $2ar_{0}\leq 1-\delta+\frac{\lambda r^{*}}{F(\theta_{0})}$ , imposing another upper bound on $\epsilon$ . To be concrete, we take

\lambda=(1-\delta)\frac{F^{*}}{r_{0}},

(6.51)

then (6.48) is met if

\epsilon\leq\min\{\epsilon_{1},\epsilon_{2}\},

with

\epsilon_{1}=\delta(1-\delta)\frac{F^{*}}{2Lr_{0}},\quad\epsilon_{2}=\frac{1}{2\mu r_{0}}(2-\delta)(1-\delta)\bigg{(}F(\theta_{0})+\frac{F^{*}r^{*}}{r_{0}}\bigg{)}.

This is (6.39) with $r_{0}=F(\theta_{0})$ . Hence, for suitably small $\epsilon$ , we have obtained a set of admissible pairs $(a,\lambda)$ with

a=\frac{\epsilon\mu}{F(\theta_{0})(2-\delta)},\quad\lambda=(1-\delta)\frac{F^{*}}{r_{0}},

(6.52)

as $\delta$ varies in $(0,1)$ .

Step 3: Convergence rate of $W$ . With above choices of $(a,\lambda)$ , we have

\dot{E}(t)\leq-\frac{b(t)}{\epsilon}E(t)\leq-\frac{b_{*}}{\epsilon}E(t).

(6.53)

This gives

E(t)\leq E(0)e^{-(b_{*}/\epsilon)t}.

(6.54)

This when combined with (6.40), i.e.,

E=aW+\frac{\epsilon\dot{W}}{2r}+\lambda\epsilon r|v|^{2},

and $E(0)=aW(0)$ (since $v_{0}=0$ ) allows us to rewrite (6.54) as

\frac{2ar(t)}{\epsilon}W(t)+\dot{W}(t)\leq\frac{2ar(t)W(0)}{\epsilon}e^{-(b_{*}/\epsilon)t}.

That is

\frac{d}{dt}(W(t)e^{\int_{0}^{t}(2ar(s)/\epsilon)ds})\leq\frac{2ar(t)W(0)}{\epsilon}e^{-(b_{*}/\epsilon)t+\int_{0}^{t}(2ar(s)/\epsilon)ds}.

Integration of this gives

$\displaystyle W(t)$	$\displaystyle\leq W(0)e^{-\int_{0}^{t}(2ar(s)/\epsilon)ds}+\frac{2aW(0)}{\epsilon}e^{-\int_{0}^{t}(2ar(s)/\epsilon)ds}\int_{0}^{t}r(s)e^{-(b_{*}/\epsilon)s+\int_{0}^{s}(2ar(\tau)/\epsilon)d\tau}ds$
	$\displaystyle\leq W(0)e^{-(2ar^{}/\epsilon)t}+\frac{2ar_{0}W(0)}{\epsilon}\int_{0}^{t}e^{-(b_{}/\epsilon)s}ds$
	$\displaystyle\leq W(0)e^{-(2ar^{}/\epsilon)t}+\frac{2ar_{0}W(0)}{\|b_{}\|}e^{-(b_{*}/\epsilon)t}.$	(6.55)

Recall in Step 2, $a$ is chosen as $\frac{\epsilon\mu}{F(\theta_{0})(2-\delta)}$ for a fixed $\delta\in(0,1)$ and $\epsilon\leq\{\epsilon_{1},\epsilon_{2}\}$ so that $b_{*}\geq\delta$ , also using $r_{0}=F(\theta_{0})$ , we have

W(t)\leq W(0)\bigg{(}e^{-\frac{2\mu r^{*}}{F(\theta_{0})(2-\delta)}t}+\frac{2\mu\epsilon}{\delta(2-\delta)}e^{-\frac{\delta}{\epsilon}t}\bigg{)}.

(6.56)

This is (6.38) with $W(t)=f(\theta(t))-f^{*}$ .

Remark 6.5.

In Step, equation (6.55) reveals that the convergence rate is dominated by $e^{-Kt}$ , where $K(a,\lambda;\epsilon):=\min\{2ar^{*}/\epsilon,b_{*}/\epsilon\}$ . Consequently, in the current methodology, determining the optimal values of $a,\lambda$ involves solving the following constrained optimization problem:

		$\displaystyle\max_{a,\lambda}K(a,\lambda;\epsilon)$		(6.57)
	$\displaystyle s.t.$	$\displaystyle 0<b_{}<b^{}<2,\quad ab^{}-\frac{\epsilon\mu}{B}\leq 0,\quad\lambda b^{}+2\epsilon L-2\lambda\leq 0.$		(6.57)

The constraint $0<b_{*}<b^{*}<2$ in (6.57) forms an open set, making it natural to confine the range to $\delta\leq b_{*}<b^{*}\leq 2-\delta$ for a small $\delta>0$ . Notably, when $\epsilon$ is sufficiently small, the expression

K(a,\lambda;\epsilon)=2ar^{*}/\epsilon

holds true, owing to the fact that $a\leq\frac{\epsilon\mu}{F(\theta_{0})(2-\delta)}$ in Step 2. Consequently, the estimation obtained in Step 2 is nearly optimal within the confines of the present approach.

7 Convergence of trajectories

In this section, we provide insights into the previously established results as stated in Theorems 5.4 and 6.1 and put forth several variations and extensions.

In Theorem 5.4, we demonstrated that $\nabla F(\theta(t))\to 0$ and $F(\theta(t))\to F^{*}$ . It is reasonable to anticipate that, under appropriate conditions on $F$ (or $f$ ), $\theta(t)$ will converge toward a single minimum point of $F$ . Conversely, the PL condition (2.11) used in Theorem 6.1 pertains to the geometric attributes of $f$ rather than its regularity. We revisit the details of the proof of Theorem 6.1, wherein for sufficiently small $\epsilon$ , $E$ is bounded below by $(f-f^{*})+|v|^{2}$ . Consequently, the exponential decay of $E$ extends to $|v|$ , leading to the convergence of $\theta(t)$ since $\dot{\theta}=2rv$ .

It is interesting to explore the minimal general condition on $f$ that would ensure the convergence for $\theta(t)$ towards a single limit-point. In general, addressing this question poses significant challenges. However, we can identify a larger class of functions beyond those covered by the PL condition. It is important to note that, following Łojasiewicz’s groundbreaking work on real-analytic functions [48, 33], the key factor ensuring convergence of $\theta(t)$ with $\dot{\theta}=-\nabla f(\theta)$ lies in the geometric properties of $f$ . This is exemplified by a counterexample presented by Palis and De Melo [24, p.14], highlighting that $C^{\infty}$ smoothness is insufficient to guarantee single limit-point convergence. These findings illustrate the importance of gradient vector fields of functions satisfying the Łojasiewicz inequality. The Łojasiewicz inequality asserts that for a real-analytic function and a critical point $a$ there exists $\alpha\in(0,1)$ such that the function $|f-f(a)|^{1-\alpha}|\nabla f|^{-1}$ remains bounded away from $0$ around $a$ . Such gradient inequality has been extended by Kurdyka [27] to $C^{1}$ functions whose graphs belong to an o-minimal structure, and Bolte et al. [14, 12] extended it to a broad class of nonsmooth subanalytic functions. This gradient inequality, or its generalization with a desingularizing function, is known as the Kurdyka-Łojasiewicz inequality. In the optimization literature, the KL inequality has proven to be a powerful tool for characterizing the convergence properties of iterative algorithms; refer to, for instance, [1, 6, 13, 21, 23].

For the system (4.30), we show that the Łojasiewicz inequality is sufficient to ensure the convergence of $\theta(t)$ . Additionally, we provide convergence rates for $\theta(t)$ under different values of $\alpha$ .

Definition 7.1 (Łojasiewicz inequality).

For a differentiable function $f:\mathbb{R}^{n}\to\mathbb{R}$ with ${\rm argmin}f\neq\emptyset$ , we say $f$ satisfies the Łojasiewicz inequality if there exists $\sigma>0$ , $c>0$ , and $\alpha\in(0,1)$ , such that the following inequality holds for any $a\in{\rm argmin}f$ :

c|f(\theta)-f(a)|^{1-\alpha}\leq|\nabla f(\theta)|,\quad\forall\theta\in B(a,\sigma),

(7.58)

where $B(a,\sigma)$ is the neighborhood of $a$ with radius $\sigma$ .

Note that a diverse range of functions has been shown to satisfy (7.58), ranging from real-analytic functions [48] to non-smooth lower semi-continuous functions [14]. The PL condition (2.11) stated as a global condition corresponds to (7.58) with $\alpha=1/2$ and $c=\sqrt{2\mu}$ .

Lemma 7.2.

Let $f:\mathbb{R}^{n}\to\mathbb{R}$ satisfy the Łojasiewicz inequality (7.58), then

\displaystyle|F(\theta)-F(a)|^{1-\alpha}\leq|\nabla F(\theta)|,\quad\forall\theta\in B(a,\sigma).

(7.59)

Proof 7.3.

Using the relation $F^{2}=f+c$ , inequality (7.58) reduces to

c|F(\theta)+F(a)|^{1-\alpha}|F(\theta)-F(a)|^{1-\alpha}\leq 2|F(\theta)||\nabla F(\theta)|.

This leads to (7.59) if $c$ is chosen as $2\max_{\theta\in B(a,\sigma)}|F(\theta)||F+F(a)|^{\alpha-1}$ .

Theorem 7.4.

Consider $\theta(t)$ as a bounded solution of (4.30) as stated in Theorem 5.4, and further assume that $F$ satisfies (7.59). Then $\dot{\theta}\in L^{1}([0,+\infty])$ and $\theta(t)$ converges towards a local minimum point of $F$ as $t\to\infty$ .

Proof 7.5.

Let $\omega(\theta)$ be the $\omega$ -limit set of $\theta$ . According to Theorem 5.3, $F(\theta(t))\to F^{*}$ , ensuring $F=F^{*}$ on $\omega(\theta)$ , and

\lim_{t\to\infty}\text{dist}(\theta(t),\omega(\theta))=0.

The inequality (7.59) asserts the existence of $T>0$ and $\alpha\in(0,1)$ such that for any $t\geq T$ ,

|F(\theta(t))-F^{*}|^{1-\alpha}\leq|\nabla F(\theta(t))|.

The proof of the convergence of $\theta(t)$ relies on a novel functional

R(t)=Q(t)-F^{*}+\frac{\lambda\epsilon}{2}|\dot{v}(t)|^{2},

with the parameter $\lambda$ yet to be determined. A direct calculation gives

	$\displaystyle\dot{R}$	$\displaystyle=\dot{Q}+\lambda\epsilon\dot{v}\cdot\ddot{v}$
		$\displaystyle=-2r\|v\|^{2}(1+\epsilon\|v\|^{2})+\lambda\dot{v}\cdot\frac{d}{dt}(\nabla F(\theta(t))-v)$
		$\displaystyle=-2r\|v\|^{2}(1+\epsilon\|v\|^{2})-\lambda\|\dot{v}\|^{2}+\lambda\dot{v}\cdot D^{2}F\dot{\theta}$
		$\displaystyle=-2r\|v\|^{2}(1+\epsilon\|v\|^{2})-\lambda\|\dot{v}\|^{2}-2\lambda r\dot{v}\cdot D^{2}Fv$
		$\displaystyle\leq-2r\|v\|^{2}(1+\epsilon\|v\|^{2})-\lambda\|\dot{v}\|^{2}+\lambda L_{F}\delta\|\dot{v}\|^{2}+\frac{\lambda L_{F}r^{2}}{\delta}\|v\|^{2}$
		$\displaystyle\leq-r\|v\|^{2}(2-\frac{\lambda L_{F}r_{0}}{\delta})-\lambda\|\dot{v}\|^{2}(1-L_{F}\delta).$

Here $-2ab\leq\delta a^{2}+\frac{b^{2}}{\delta}$ for any $\delta>0$ and $\|D^{2}F(\theta)\|\leq L_{F}$ were used in the first inequality. Take $\delta=1/(2L_{F})$ and $\lambda=1/(2L_{F}^{2}r_{0})$ , then

	$\displaystyle\dot{R}$	$\displaystyle\leq-r^{*}\|v\|^{2}-\frac{1}{2}\lambda\|\dot{v}\|^{2}$
		$\displaystyle=-r^{*}\|v\|^{2}-\frac{1}{4L_{F}^{2}r_{0}}\|\dot{v}\|^{2}$
		$\displaystyle\leq-\frac{1}{2}\min\{r^{*},\frac{1}{4L_{F}^{2}r_{0}}\}(\|v\|+\|\dot{v}\|)^{2}$
		$\displaystyle=:-C_{1}(\|v\|+\|\dot{v}\|)^{2}.$

In the last inequality, we used the inequality $(a+b)^{2}\leq 2(a^{2}+b^{2})$ . Conversely,

	$\displaystyle R(t)$	$\displaystyle=F(\theta(t))-F^{*}+\epsilon r\|v\|^{2}+\frac{\epsilon}{4L_{F}^{2}r_{0}}\|\dot{v}\|^{2}$
		$\displaystyle\leq C_{2}(\|F(\theta(t))-F^{*}\|+\|v\|^{2}+\|\dot{v}\|^{2}),$

where $C_{2}=\max\{1,\epsilon r_{0},\frac{\epsilon}{4L_{F}^{2}r_{0}}\}.$ We proceed by distinguishing two cases:
(i) $\alpha\in(0,1/2]$ . Using the inequality $(a+b)^{1-\alpha}\leq a^{1-\alpha}+b^{1-\alpha}$ , we obtain

R(t)^{1-\alpha}\leq C_{2}^{1-\alpha}(|F(\theta(t))-F^{*}|^{1-\alpha}+|v|^{2(1-\alpha)}+|\dot{v}|^{2(1-\alpha)}).

Using (7.59), for $t\geq T$ ,

R(t)^{1-\alpha}\leq C_{2}^{1-\alpha}(|\nabla F(\theta(t))|+|v|^{2(1-\alpha)}+|\dot{v}|^{2(1-\alpha)}).

Since $|v(t)|,|\dot{v}(t)|\to 0$ as $t\to\infty$ and $2(1-\alpha)\geq 1$ ,

$\displaystyle R(t)^{1-\alpha}$	$\displaystyle\leq C_{2}^{1-\alpha}(\|\nabla F(\theta(t))\|+\|v(t)\|+\|\dot{v}(t)\|)$
	$\displaystyle\leq C_{2}^{1-\alpha}(\|v+\epsilon\dot{v}\|+\|v(t)\|+\|\dot{v}(t)\|)$
	$\displaystyle\leq C_{2}^{1-\alpha}(2\|v(t)\|+(\epsilon+1)\|\dot{v}(t)\|)$
	$\displaystyle\leq C_{3}(\|v(t)\|+\|\dot{v}(t)\|),$	(7.60)

where $C_{3}=C_{2}^{1-\alpha}\max\{2,\epsilon+1\}$ .

We are prepared to prove that $\dot{\theta}$ belongs to $L^{1}([0,\infty])$ by considering $t\geq T$ :

	$\displaystyle-\frac{d}{dt}R(t)^{\alpha}$	$\displaystyle=-\alpha R(t)^{\alpha-1}\cdot\dot{R}=\alpha\frac{-\dot{R}}{R^{1-\alpha}}$
		$\displaystyle\geq\alpha\frac{C_{1}(\|v\|+\|\dot{v}\|)^{2}}{C_{3}(\|v\|+\|\dot{v}\|)}$
		$\displaystyle=C(\|v\|+\|\dot{v}\|),\quad C:=\frac{\alpha C_{3}}{C_{1}}.$

Integrating both sides from $T$ to $\infty$ yields:

\int_{T}^{\infty}(|v|+|\dot{v}|)dt\leq C(R(T)^{\alpha}-\lim_{t\to\infty}R(t)^{\alpha})=CR(T)^{\alpha}.

(7.61)

Here, we take into account that $\lim_{t\to\infty}R(t)^{\alpha}=0$ . Note that $|\dot{\theta}|=2r|v|\leq 2r_{0}|v|$ , hence

\int_{T}^{\infty}|\dot{\theta}|dt\leq 2r_{0}\int_{T}^{\infty}|v|dt\leq 2r_{0}CR(T)^{\alpha}\leq 2r_{0}CR(0)^{\alpha}.

Thus $\dot{\theta}$ belongs to $L^{1}([0,\infty])$ , implying that, $\theta^{*}:=\lim_{t\to\infty}\theta(t)$ exists, and $F(\theta^{*})=F^{*}$ .
(ii) For $\alpha\in(1/2,1)$ , we set $\alpha^{\prime}=\alpha/2\in(0,1/2)$ . With this adjustment, (7.59) remains valid for $\alpha^{\prime}$ since

|F-F^{*}|^{1-\alpha^{\prime}}\leq|F-F^{*}|^{1-\alpha}

for $|f-F^{*}|$ suitably small. Also, (7.5) holds with $\alpha$ replaced by $\alpha^{\prime}$ . Thus, we conclude the convergence of $\theta(t)$ in this case. Moreover,

\int_{T}^{\infty}(|v|+|\dot{v}|)dt\leq CR(T)^{\alpha/2}

for some constant $C$ .

Before getting into the estimation of the rate of convergence, let us define

u(t):=\int_{t}^{\infty}(|v(t)|+|\dot{v}(t)|)dt.

(7.62)

This expression serves to control the tail length function for both $\theta(t)$ and $v(t)$ . In fact,

\displaystyle|\theta(t)-\theta^{*}|

\displaystyle\leq\int^{\infty}_{t}|\dot{\theta}(t)|dt=\int^{\infty}_{t}2r(t)|v(t)|dt\leq 2r_{0}\int^{\infty}_{t}|v(t)|dt\leq 2r_{0}u(t).

(7.63)

From (7.61) in the proof of Theorem 6.1, we see that

u(T)\leq CR(T)^{\beta},\quad\beta=\begin{cases}\alpha,&\text{if $0<\alpha\leq\frac{1}{2}$},\\ \alpha/2,&\text{if $\frac{1}{2}<\alpha<1$}.\end{cases}

This inequality remains true for every $t\geq T$ , thus

\displaystyle u(t)\leq CR(t)^{\beta}\quad\forall t\geq T.

(7.64)

We now present the following result.

Theorem 7.6.

Under the same conditions as in Theorem 7.4, there exists $T$ such that for any $t\geq T$ , the following results hold:

(1)

If $\alpha=\frac{1}{2}$ , then

$|\theta(t)-\theta^{*}|\leq C(T)e^{-Kt};$
(2)

If $0<\alpha<\frac{1}{2}$ , then

$|\theta(t)-\theta^{*}|\leq C(T)(t+1)^{-\frac{\alpha}{1-2\alpha}};$
(3)

If $\frac{1}{2}<\alpha<1$ , then

$|\theta(t)-\theta^{*}|\leq C(T)(t+1)^{-\frac{\alpha}{2-2\alpha}},$

where $C(T)$ depends on $T$ and $K>0$ is independent of $T$ .

Proof 7.7.

Let $B(\theta^{*},\sigma)$ be a neighborhood of $\theta^{*}$ in which the Łjosiewicz inequality holds. Since $\theta(t)$ convergese to $\theta^{*}$ there exits $T>0$ such that $\theta(t)\in B(\theta^{*},\sigma)$ for every $t\geq T.$ In particular, (7.64) holds. It suffices to consider $\beta=\alpha\in(0,1/2]$ :

	$\displaystyle u(t)$	$\displaystyle\leq CR(t)^{\alpha}=C(R(t)^{1-\alpha})^{\frac{\alpha}{1-\alpha}}$
		$\displaystyle\leq C(C_{3}(\|v(t)\|+\|\dot{v}(t)\|))^{\frac{\alpha}{1-\alpha}}=CC_{3}^{\frac{\alpha}{1-\alpha}}(-\dot{u}(t))^{\frac{\alpha}{1-\alpha}},$

where $C_{3}$ was defined in (7.5). The above can be rearranged as

\dot{u}(t)\leq-Ku(t)^{\frac{1-\alpha}{\alpha}},

(7.65)

where $K=\frac{C^{\frac{1-\alpha}{\alpha}}}{C_{3}}$ . Next we define another problem of form

\dot{z}(t)=-Kz(t)^{\frac{1-\alpha}{\alpha}},\quad z(T)=u(T),

(7.66)

so that $u(t)\leq z(t)$ by comparison; hence by (7.63) we obtain $|\theta(t)-\theta^{*}|\leq 2r_{0}z(t)$ .

The solution of $z(t)$ can be determined for two cases:
(1) $\alpha=\frac{1}{2}$ . In this case, (7.66) of form $\dot{z}=-Kz$ admits a unique solution

z(t)=u(T)e^{KT}e^{-Kt}.

(2) $0<\alpha<\frac{1}{2}$ . The exact solution in such case can be obtained as

z(t)=\left(u(T)^{2-1/\alpha}+K(\frac{1}{\alpha}-2)(t-T)\right)^{-\frac{\alpha}{1-2\alpha}}

which is bounded by $C(T)(t+1)^{-\frac{\alpha}{1-2\alpha}}$ .
(3) $\frac{1}{2}<\alpha<1$ . This case reduces to case (2) by simply replacing $\alpha$ by $\alpha/2$ in the obtained convergence rate for (2). The proof is complete.

Remark 7.8.

The convergence rate in (3) is suboptimal as it relies on (7.64) with $\beta=\alpha/2$ for $1/2<\alpha<1$ . In this scenario, the full potential of the Łojasiewicz inequality is not used in the proof of Theorem 7.4. Had (7.64) held with $\beta=\alpha$ in such cases, it would have led to $z(t)=0$ for some finite $t$ , indicating a faster, finite time convergence. This assertion is demonstrated in [14, Theorem 4.7] for the case for $\dot{\theta}=-\nabla f(\theta)$ .

Remark 7.9.

The exponential convergence rate in (1) aligns with the result obtained in Theorem 6.1, though $K=\frac{1}{2C_{1}}$ may not be optimally sharp.

Remark 7.10.

A counterexample by Baillon [9] for the steepest descent equation $\dot{\theta}+\nabla f(\theta)=0$ suggests that, likely, convexity alone is not sufficient for the trajectories of (1.4) to converge strongly. In the case of a general convex fucntion $f$ , one may instead aim to demonstrate the weak convergence of trajectories, following the approach outlined in [4, Theorem 5.1] for a dissipative dynamical system with Hessian-driven damping. It is important to note that the conditions outlined in (7.58) are insufficient to imply the convexity of the function. This is exemplified by the case of $f(\theta)=|\theta_{2}-sin(\theta_{1})|^{1/\alpha}$ where $\alpha\in(0,1)$ . Although this function satisfies (7.58) with $c=1/\alpha$ , it does not guarantee convexity. Furthermore, the set of minimizers for $f$ characterized by the graph of $\theta_{2}=\sin\theta_{1}$ , is also non-convex.

8 Discussion

This paper explores a novel energy-adaptive gradient algorithm with momentum, termed AGEM. Empirically, we observe that the inclusion of momentum significantly accelerate convergence, particularly in a non-convex setting. Theoretical investigations reveal that AGEM retains the unconditional energy stability property akin to AEGD. Additionally, we establish convergence to critical points for general non-convex objective functions when step sizes are suitably small.

To capture the dynamical behavior of AGEM, we derive a high resolution ODE system that preserves the momentum effect. This system, featuring infinitely many steady states, poses challenges in analysis. Nonetheless, leveraging various analytical tools, we successfully establish a series of results: (i) establishing global well-posedness through a construction of a suitable Lyapunov function; (ii) characterizing the time-asymptotic behavior of the solution towards critical points, using the LaSalle invariance principle; (iii) deriving a linear convergence rate for non-convex objective functions that satisfy the PL condition, and (iv) demonstrating the finite length of $\theta(t)$ and its convergence to the minimum of $f$ based on the Łojasiewicz inequality.

We anticipate that the analytical framework developed in this article can be extended to study other variants of AEGD. As an illustration, the associated system for scheme (2.7) in the format:

	$\displaystyle\epsilon\dot{m}$	$\displaystyle=-m+\nabla f(\theta),$
	$\displaystyle v$	$\displaystyle=\frac{m}{2F(\theta)(1-e^{-t/\epsilon})},$
	$\displaystyle\dot{r}$	$\displaystyle=-2r\|v\|^{2},$
	$\displaystyle\dot{\theta}$	$\displaystyle=-2rv$

is anticipated to exhibit analogous solution properties. In the context of large-scale optimization problems, the practical preference often leans towards a stochastic version of AGEM. Consequently, it becomes intriguing to investigate the convergence behavior of AGEM in the stochastic setting.

Acknowledgements. This research was partially supported by the National Science Foundation under Grant DMS1812666.

References

[1] Pierre-Antoine Absil, Robert Mahony, and Benjamin Andrews, Convergence of the iterates of descent methods for analytic cost functions, SIAM Journal on Optimization 16 (2005), no. 2, 531–547.
[2] Zeyuan Allen-Zhu, Katyusha: The first direct acceleration of stochastic gradient methods, Journal of Machine Learning Research 18 (2018), no. 221, 1–51.
[3] Felipe Alvarez, On the minimizing property of a second order dissipative system in Hilbert spaces, SIAM J. Control and Optimization 38 (1998), 1102–1119.
[4] Felipe Alvarez, Hedy Attouch, Jérôme Bolte, and Patrick Redont, A second-order gradient-like dissipative dynamical system with Hessian-driven damping.: Application to optimization and mechanics, Journal de mathématiques pures et appliquées 81 (2002), no. 8, 747–779.
[5] Vassilis Apidopoulos, Nicolò Ginatta, and Silvia Villa, Convergence rates for the heavy-ball continuous dynamics for non-convex optimization, under Polyak–Łojasiewicz condition, Journal of Global Optimization 84 (2022), no. 3, 563–589.
[6] Hedy Attouch and Jérôme Bolte, On the convergence of the proximal algorithm for nonsmooth functions involving analytic features, Mathematical Programming 116 (2009), no. 1, 5–16.
[7] Hedy Attouch, Zaki Chbani, Juan Peypouquet, and Patrick Redont, Fast convergence of inertial dynamics and algorithms with asymptotic vanishing viscosity, Mathematical Programming 168 (2018), 123–175.
[8] Hedy Attouch, Xavier Goudou, and Patrick Redont, The heavy ball with friction method, I. the continuous dynamical system: global exploration of the local minima of a real-valued function by asymptotic analysis of a dissipative dynamical system, Communications in Contemporary Mathematics 2 (2000), no. 01, 1–34.
[9] JB Baillon, Un exemple concernant le comportement asymptotique de la solution du problème $dudt+\partial\vartheta(\mu)\ni 0$ , Journal of Functional Analysis 28 (1978), no. 3, 369–376.
[10] Raef Bassily, Mikhail Belkin, and Siyuan Ma, On exponential convergence of SGD in non-convex over-parametrized learning, arXiv preprint arXiv:1811.02564 (2018).
[11] Michael Betancourt, Michael I Jordan, and Ashia C Wilson, On symplectic optimization, arXiv preprint arXiv:1802.03653 (2018).
[12] Jérôme Bolte, Aris Daniilidis, Olivier Ley, and Laurent Mazet, Characterizations of Łojasiewicz inequalities: subgradient flows, talweg, convexity, Transactions of the American Mathematical Society 362 (2010), no. 6, 3319–3363.
[13] Jérôme Bolte, Shoham Sabach, and Marc Teboulle, Proximal alternating linearized minimization for nonconvex and nonsmooth problems, Mathematical Programming 146 (2014), no. 1, 459–494.
[14] Jérôme Bolte, Aris Daniilidis, and Adrian Lewis, The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems, SIAM Journal on Optimization 17 (2007), no. 4, 1205–1223.
[15] Léon Bottou, Frank E. Curtis, and Jorge Nocedal, Optimization methods for large-scale machine learning, SIAM Review 60(2) (2018), 223–311.
[16] Camille Castera, Jérôme Bolte, Cédric Févotte, and Edouard Pauwels, An inertial Newton algorithm for deep learning, Journal of Machine Learning Research 22 (2021), 1–31.
[17] Zachary Charles and Dimitris Papailiopoulos, Stability and generalization of learning algorithms that converge to global optima, International conference on machine learning, 2018, pp. 745–754.
[18] Andre Belotto da Silva and Maxime Gazeau, A general system of differential equations to model first-order adaptive algorithms, Journal of Machine Learning Research 21 (2020), no. 129, 1–42.
[19] Simon S. Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh, Gradient descent provably optimizes over-parameterized neural networks, International Conference on Learning Representations, 2019.
[20] Guilherme França, Jeremias Sulam, Daniel Robinson, and René Vidal, Conformal symplectic and relativistic optimization, Advances in Neural Information Processing Systems 33 (2020), 16916–16926.
[21] Pierre Frankel, Guillaume Garrigos, and Juan Peypouquet, Splitting methods with variable metric for Kurdyka–Łojasiewicz functions and general convergence rates, Journal of Optimization Theory and Applications 165 (2015), no. 3, 874–900.
[22] Tianxiang Gao, Hailiang Liu, Jia Liu, Hridesh Rajan, and Hongyang Gao, A global convergence theory for deep ReLU implicit networks via over-parameterization, International Conference on Learning Representations, 2021.
[23] Guillaume Garrigos, Lorenzo Rosasco, and Silvia Villa, Convergence of the forward-backward algorithm: beyond the worst-case with the help of geometry, Mathematical Programming (2022), 1–60.
[24] Jacob Palis Junior and Welington De Melo, Geometric theory of dynamical systems: an introduction, Springer-Verlag, 1982.
[25] Hamed Karimi, Julie Nutini, and Mark Schmidt, Linear convergence of gradient and proximal-gradient methods under the Polyak-Łojasiewicz condition, Machine Learning and Knowledge Discovery in Database, Springer, 2016, pp. 795–811.
[26] Diederik Kingma and Jimmy Ba, Adam: A method for stochastic optimization, International Conference on Learning Representations (ICLR), 2014.
[27] Krzysztof Kurdyka, On gradients of functions definable in o-minimal structures, Annales de l’institut Fourier, vol. 48, 1998, pp. 769–783.
[28] J. LaSalle, Some extensions of Lyapunov’s second method, IRE Transactions on Circuit Theory 7 (1960), no. 4, 520–527.
[29] Chaoyue Liu, Libin Zhu, and Mikhail Belkin, Loss landscapes and optimization in over-parameterized non-linear systems and neural networks, Applied and Computational Harmonic Analysis (2022).
[30] Hailiang Liu and Xuping Tian, An adaptive gradient method with energy and momentum, Annals of Applied Mathematics 38 (2022), no. 2, 183–222.
[31] Hailiang Liu and Xuping Tian, Aegd: adaptive gradient descent with energy, Numerical Algebra, Control and Optimization (2023).
[32] , SGEM: stochastic gradient with energy and momentum, Numerical Algorithms (2023), 1–28.
[33] S. Łojasiewicz, Sur les trajectoires de gradient d’une fonction analytique, Seminari di Geometria 1982-1983 (lecture notes), Dipartemento di Matematica, Universita di Bologna (1984), 115–117.
[34] Chao Ma, Lei Wu, and E Weinan, A qualitative study of the dynamic behavior for adaptive gradient algorithms, Mathematical and Scientific Machine Learning, 2022, pp. 671–692.
[35] Chris J Maddison, Daniel Paulin, Yee Whye Teh, Brendan O’Donoghue, and Arnaud Doucet, Hamiltonian descent methods, arXiv preprint arXiv:1809.05042 (2018).
[36] Michael Muehlebach and Michael Jordan, A dynamical systems perspective on nesterov acceleration, International Conference on Machine Learning, 2019, pp. 4656–4662.
[37] Yurii Nesterov, A method of solving a convex programming problem with convergence rate O $(1/k^{2})$ , Doklady Akademii Nauk, vol. 269, Russian Academy of Sciences, 1983, pp. 543–547.
[38] , Introductory lectures on convex optimization: A basic course, vol. 87, Springer Science & Business Media, 2013.
[39] Boris Polyak, Gradient methods for the minimisation of functionals, USSR Computational Mathematics and Mathematical Physics 3 (1963), 864–878.
[40] Boris Polyak, Some methods of speeding up the convergence of iteration methods, Ussr computational mathematics and mathematical physics 4 (1964), no. 5, 1–17.
[41] Bin Shi, Simon S Du, Michael I Jordan, and Weijie J Su, Understanding the acceleration phenomenon via high-resolution differential equations, Mathematical Programming (2021), 1–70.
[42] Mahdi Soltanolkotabi, Adel Javanmard, and Jason D Lee, Theoretical insights into the optimization landscape of over-parameterized shallow neural networks, IEEE Transactions on Information Theory 65 (2018), no. 2, 742–769.
[43] Weijie Su, Stephen Boyd, and Emmanuel Candes, A differential equation for modeling Nesterov’s accelerated gradient method: Theory and insights, Advances in Neural Information Processing Systems 3 (2015).
[44] Tijmen Tieleman, Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude, COURSERA: Neural networks for machine learning 4 (2012), no. 2, 26.
[45] Andre Wibisono, Ashia C Wilson, and Michael I Jordan, A variational perspective on accelerated methods in optimization, proceedings of the National Academy of Sciences 113 (2016), no. 47, E7351–E7358.
[46] Xiaofeng Yang, Linear, first and second-order, unconditionally energy stable numerical schemes for the phase field model of homopolymer blends, Journal of Computational Physics 327 (2016), 294–316.
[47] Jia Zhao, Qi Wang, and Xiaofeng Yang, Numerical approximations for a phase field dendritic crystal growth model based on the invariant energy quadratization approach, International Journal for Numerical Methods in Engineering 110 (2017), no. 3, 279–300.
[48] S. Łojasiewicz, A topological property of real analytic subsets (in French), Coll. du CNRS, Les équations aux derivéés partielles (1963), 87–89.

		$\displaystyle\quad\langle\nabla F(\theta_{k}),\theta_{k+1}-\theta_{k}\rangle$
		$\displaystyle=-2\eta r_{k+1}\langle\nabla F(\theta_{k}),v_{k}\rangle$
		$\displaystyle=-2\eta r_{k+1}\langle\nabla F(\theta_{k})-v_{k},v_{k}\rangle-2\eta r_{k+1}\|v_{k}\|^{2}$
		$\displaystyle=-2\eta r_{k+1}\langle\frac{\beta}{1-\beta}(v_{k}-v_{k-1}),v_{k}\rangle-2\eta r_{k+1}\|v_{k}\|^{2}$
		$\displaystyle=-\epsilon r_{k+1}(\|v_{k}\|^{2}-\|v_{k-1}\|^{2}+\|v_{k}-v_{k-1}\|^{2})-2\eta r_{k+1}\|v_{k}\|^{2}$
		$\displaystyle\leq-\epsilon r_{k+1}\|v_{k}\|^{2}+\epsilon r_{k}\|v_{k-1}\|^{2}-\epsilon r_{k+1}\|v_{k}-v_{k-1}\|^{2}-2\eta r_{k+1}\|v_{k}\|^{2},$		(3.18)

$\displaystyle\dot{Q}$	$\displaystyle=\langle\nabla F(\theta),\dot{\theta}\rangle+\epsilon\dot{r}\|v\|^{2}+2\epsilon r\langle v,\dot{v}\rangle$
	$\displaystyle=-2r\langle\nabla F(\theta),v\rangle-2\epsilon r\|v\|^{4}+2r\langle v,-v+\nabla F(\theta)\rangle$
	$\displaystyle=-2r\|v\|^{2}-2\epsilon r\|v\|^{4}\leq 0.$	(5.33)

	$\displaystyle\frac{d}{dt}\epsilon r\|v\|^{2}$	$\displaystyle=\epsilon\dot{r}\|v\|^{2}+2\epsilon r\langle v,\dot{v}\rangle$
		$\displaystyle=-2\epsilon r\|v\|^{4}+2r\langle v,-v+\nabla F(\theta)\rangle$
		$\displaystyle=-2\epsilon r\|v\|^{4}-2r\|v\|^{2}+2r\langle\nabla F(\theta),v\rangle$
		$\displaystyle\leq-2r\|v\|^{2}+\frac{r}{F(\theta)}\langle\nabla f(\theta),v\rangle.$

	$\displaystyle\dot{R}$	$\displaystyle=\dot{Q}+\lambda\epsilon\dot{v}\cdot\ddot{v}$
		$\displaystyle=-2r\|v\|^{2}(1+\epsilon\|v\|^{2})+\lambda\dot{v}\cdot\frac{d}{dt}(\nabla F(\theta(t))-v)$
		$\displaystyle=-2r\|v\|^{2}(1+\epsilon\|v\|^{2})-\lambda\|\dot{v}\|^{2}+\lambda\dot{v}\cdot D^{2}F\dot{\theta}$
		$\displaystyle=-2r\|v\|^{2}(1+\epsilon\|v\|^{2})-\lambda\|\dot{v}\|^{2}-2\lambda r\dot{v}\cdot D^{2}Fv$
		$\displaystyle\leq-2r\|v\|^{2}(1+\epsilon\|v\|^{2})-\lambda\|\dot{v}\|^{2}+\lambda L_{F}\delta\|\dot{v}\|^{2}+\frac{\lambda L_{F}r^{2}}{\delta}\|v\|^{2}$
		$\displaystyle\leq-r\|v\|^{2}(2-\frac{\lambda L_{F}r_{0}}{\delta})-\lambda\|\dot{v}\|^{2}(1-L_{F}\delta).$

	$\displaystyle\dot{R}$	$\displaystyle\leq-r^{*}\|v\|^{2}-\frac{1}{2}\lambda\|\dot{v}\|^{2}$
		$\displaystyle=-r^{*}\|v\|^{2}-\frac{1}{4L_{F}^{2}r_{0}}\|\dot{v}\|^{2}$
		$\displaystyle\leq-\frac{1}{2}\min\{r^{*},\frac{1}{4L_{F}^{2}r_{0}}\}(\|v\|+\|\dot{v}\|)^{2}$
		$\displaystyle=:-C_{1}(\|v\|+\|\dot{v}\|)^{2}.$