Dynamic behavior for a gradient algorithm with energy and momentum
Abstract
This paper investigates a novel gradient algorithm, AGEM, using both energy and momentum, for addressing general non-convex optimization problems. The solution properties of the AGEM algorithm, including aspects such as uniformly boundedness and convergence to critical points, are examined. The dynamic behavior is studied through a comprehensive analysis of a high-resolution ODE system. This ODE system, being nonlinear, is derived by taking the limit of the discrete scheme while preserving the momentum effect through a rescaling of the momentum parameter. The paper emphasizes the global well-posedness of the ODE system and the time-asymptotic convergence of solution trajectories. Furthermore, we establish a linear convergence rate for objective functions that adhere to the Polyak-Łojasiewicz condition.
keywords:
gradient algorithm; energy adaptive; dynamical systems; PL condition34D05; 65K10
1 Introduction
In this paper, we will be considering the nonconvex optimization problem,
(1.1) |
where the differentiable objective function is assumed to be bounded from below, i.e., for some . Applications like training machine learning models often involve large-scale problems with non-convex objective functions represented as a finite sum over terms associated with individual data. In such settings, first-order gradient methods such as gradient descent (GD) and its variants are commonly employed, due to their computational efficiency and satisfactory performance [15].
Despite their wide usage, GDs face a potential limitation on step size, attributed to the conditional stability of GD. This limitation can significantly impede convergence, especially in large-scale machine learning applications. This same challenge also persists for the stochastic approximation of GD, known as Stochastic Gradient Descent (SGD). Consequently, there has been much effort directed towards enhancing the convergence of first-order gradient methods. Among these efforts, adaptive step size [44, 26] and momentum [40] emerge as two widely-used techniques, aiming to address the aforementioned limitations and improve the efficiency of optimization algorithms in practice.
To overcome the issue of step size limitations, the authors in [31, 32] introduced an adaptive gradient descent with energy (AEGD). The algorithm initiates from and with , inductively define
(1.2a) | ||||
(1.2b) | ||||
(1.2c) |
where is chosen so that . In sharp contrast to GD, energy-adaptive gradient methods exhibit unconditional energy stability, regardless of the base step size . Additionally, findings in [31] indicate that the parameter has minimal impact on the performance of (1.2).
AEGD can be extended to include momentum for accelerated convergence [32]. In this study, we explore a variant of AEGD with momentum, referred to as AGEM. It is defined inductively as follows:
(1.3a) | ||||
(1.3b) | ||||
(1.3c) |
where the momentum parameter is a crucial element, and setting reduces it to the original AEGD in (1.2). Unlike GD, all these energy-adaptive gradient methods exhibit unconditional energy stability, irrespective of the base step size . The excellent performance of AEGD-type schemes has been demonstrated across various optimization tasks [31, 32]. On the theoretical front, the convergence rates obtained in [31] depend on , rather than solely on . This dependence on the norm of energy could pose challenges when the energy variable decays too rapidly, potentially impacting the convergence behavior. Addressing this issue represents a main challenge in achieving a better understanding of the energy-adaptive gradient methods.
Conversely, the link between ODEs and numerical optimization can be established by considering infinitesimally small step sizes. This approach ensures that the trajectory or solution path converges to a curve modeled by an ODE. Leveraging the well-established theory and dynamic properties of ODEs can yield profound insights into optimization processes, as exemplified in [3, 8, 5, 43].
The motivation for this study stems from our two recent papers [32, 30], wherein we successfully demonstrated the convergence of AEGD with momentum through comprehensive experiments. These experiments showcased its superior performance across stochastic and non-stochastic scenarios. Our aim in the current work is to deepen our understanding of AGEM, as expressed in equation (1.3), specifically focusing on elements such as the momentum effect, dynamics of the energy variable, and convergence rates. By delving into its continuous formulation, we seek to gain additional insights that will guide the implementation of the algorithm and further enhance its effectiveness. Specifically, we derive an ODE system for (1.3), representing the precise limit of (1.3) when employing infinitesimally small step sizes while maintaining as a constant. This ODE system, featuring an unknown vector , is expressed as
(1.4a) | ||||
(1.4b) | ||||
(1.4c) |
for , with initial conditions ; here, is the starting point in the AGEM scheme, and denotes the time derivative. Notably, this study represents the pioneering effort to model energy-adaptive schemes using ODE systems. This linkage enables the examination of the convergence properties of (1.3) from the standpoint of continuous dynamics, particularly for general smooth objective functions.
For a gradient method, the geometric property of the objective function often play a crucial role in determining the convergence and convergence rates. In the case of non-convex , we consider an established condition originally introduced by Polyak [39], who showed its sufficiency for GD to achieve a linear convergence rate. A recent convergence study under this condition is presented in [25], referring to this condition as the Polyak-Łojasiewicz (PL) inequality. This designation originates from its connection the gradient inequality introduced by Łojasiewicz in 1963 [48]. Observations regarding optimization-related dynamical systems have suggested that, in general, convergence along subsequences is typically achievable. However, achieving convergence of the entire trajectory require more geometric properties. For instance, the Łojasiewicz inequality has been used to establish trajectory convergence in [4] for a second-order gradient-like dissipative dynamical system with Hessian-driven damping. Section 7 of this work shows how such an inequality can ensure trajectory convergence for in (1.4). We should point out that several significant non-convex problems in machine learning, including certain classes of neural networks, have recently been shown to satisfy the PL condition [10, 17, 42, 29]. The prevalent belief is that the PL/KL condition provides a pertinent and attractive framework for numerous machine learning problems, especially in the over-parametrized regime.
Contributions. Our findings provide a rigorous and precise understanding of the convergence behavior for both the discrete scheme (1.3) and its continuous counterpart (1.4). Our main contributions can be summarized as follows.
-
(1)
For (1.3), we establish that the iterates are uniformly bounded and converge to the set of critical points of the objective function when the step size is appropriately small. Additionally, we demonstrate that .
-
(2)
We derive (1.4) as a high resolution ODE system corresponding to the discrete scheme (1.3). For this ODE system, we initially show global well-posedness by constructing a suitable Lyapunov function. Subsequently, we establish the time-asymptotic convergence of the solution toward critical points using the Lasalle invariance principle. Furthermore, we provide a positive lower bound of the energy variable.
-
(3)
For objective functions satisfying the PL condition (refer to Section 2.2), we establish a linear convergence rate:
for some , where represents the global minimizer of (not necessarily convex).
-
(4)
We propose several variations and extensions. For a broader class of objective functions satisfying the Łojasiewicz inequality, is shown to have finite length, hence converging to a single minimum of , accompanied by associated convergence rates.
Obtaining convergence rates for non-convex optimization problems poses a significant challenge. The approach employed in Section 6 to deliver the linear convergence rate for the system (1.4) draws inspiration from [5], where a linear convergence rate was established for the Heavy Ball dynamical system, namely (1.5), with being a constant. In the case of the more intricate nonlinear system (1.4), we manage to formulate a class of control functions that serve a similar role to those in [5]. The derivation of the convergence rate results for (1.4) differs significantly and is more intricate. Notably, we employ a subtle optimization argument to identify an optimal control function that facilitates the desired decay rate.
1.1 Related works.
IEQ strategy. The fundamental concept underpinning AEGD is the invariant energy quadratization (IEQ) strategy, originally introduced to devise linear and unconditionally energy stable schemes for gradient flows expressed as partial differential equations [46, 47]. The AEGD algorithm [31] stands out as the pioneer in utilizing the energy update to stabilize GD in the context of optimization.
Optimization with momentum. The two prominent momentum-based optimization methods are Polyaks’s Heavy Ball (HB) [40] and Nesterov’s Accelerated Gradient (NAG) [37]. The HB method combines the current gradient with a history of the previous steps to accelerate the algorithm’s convergence:
where is the momentum coefficient, and is the step size. Research in [40] showed that HB can significantly accelerate convergence to a local minimum. For strongly convex functions, it requires times fewer iterations than GD to achieve the same level of accuracy, where is the condition number of the curvature at the minimum and is set to . Similar to the HB method, NAG also ensures a faster convergence rate than GD in specific scenarios. Particularly, for general smooth (non-strongly) convex functions, NAG achieves a global convergence rate of (versus of GD) [37]. Recent studies have indicated that integrating momentum with other techniques can further enhance the performance of certain optimization methods. For instance, momentum incorporation into adaptive step sizes [26] and variance-reduction-based methods [2] for stochastic optimization.
The approach discussed in this work integrates the momentum technique into the energy-adaptive gradient methods. As demonstrated in [31, 32], the resulting algorithms not only exhibit unconditional energy stability but also showcase faster convergence, inheriting advantages from both techniques.
ODE perspectives. In recent years, the optimization community has shown a growing interest in examining the continuous-time ODE limit of optimization methods as the step size tends to zero. This perspective has spurred numerous recent studies that critically examine established optimization methods in their continuous-time limit. A main aspect of the research in this domain focuses on developing Lyapunov functions or estimating sequences that emerge independently from the dynamical system.
ODEs have proven to be particularly useful in the theoretical analysis of accelerated gradient algorithms. For instance, researchers have investigated the convergence behavior of the HB method using second-order ODEs:
(1.5) |
with being a smooth, positive and non-increasing function. The Lyapunov function plays a pivotal role in the analysis of (1.5) within both convex and non-convex settings [3, 8]. The NAG method has been linked to (1.5) with in [43], wherein the optimal convergence rate of of the discrete scheme in the convex setting is recovered [38]. In [41], the distinction between the acceleration phenomena of the HB method and the NAG method is investigated through an analysis of high-resolution ODEs. Further analysis of (1.5) is conducted in Attouch et al. [7], Wibisono et al. [45] frame it within a variational context. This motivated further research on structure-preserving integration schemes for discretizing continuous-time optimization algorithms, as explored by Betancourt et al. [11]. Franca et al. [20] and Muehlebach and Jordan [36] highlight important geometric properties of the underlying dynamics, analyzing corresponding structure-preserving discretization schemes. Maddison et al. [35] propose that convergence can be enhanced through a suitable selection of kinetic energy.
In [18], a non-autonomous system of differential equations was derived and analyzed as the continuous-time limit of adaptive optimization methods, including Adam and RMSprop. Beyond examining convergence behavior, this dynamical approach provides insight into qualitative features of discrete algorithms. For example, by analyzing the sign GD of the form and its variants, [34] identified three behavior patterns of Adam and RMSprop: fast initial progress, small oscillations, and large spikes. Our work aligns with this approach, although, to our knowledge, (1.4) differs from existing ODEs for optimization methods and presents subtle analytical challenges.
The ODE perspective has inspired researchers to propose innovative optimization methods. Indeed, starting from a continuous dynamical system with favorable converging properties, one can discretize it to derive novel algorithms. For example, a new second-order inertial optimization method, known as INNA algorithm, is obtained from a continuous inclusion system when the objective function is non-smooth [16].
The rest of the paper is organized as follows. In Section 2, we begin with the problem setup, revisiting existing energy-adaptive gradient methods, and introducing a novel approach. We delve into the PL condition and its associated properties. In Section 3, we analyze solution properties for the newly proposed method. In Section 4, we derive a continuous dynamical system and present a convergence result to show the dynamical system is a faithful model for the discrete scheme. In Section 5, we analyze the global existence and asymptotic behavior of the solution to the dynamical system. In Section 6, we obtain a linear convergence rate for objective functions satisfying the PL condition. In Section 7, we propose several variations and extensions. Some concluding remarks are presented in Section 8.
2 Energy-adaptive gradient algorithms
Throughout this study, we make the following assumptions: {assumption} The objective function in (1.1) satisfies
-
(1)
and is coercive, denoted by .
-
(2)
is bounded from below, so that is well-defined for some .
We represent , , and assume that is chosen such that . We use and to denote the gradient and Hessian of ; and to denote the -th element of . For , denotes its Euclidean norm (). The outer product of is denoted as , and denotes the set .
2.1 Energy-adaptive gradient algorithms.
(1.2) represents the global and non-stochastic counterpart of AEGD, as introduced in [31]. Its notable feature lies in being unconditionally energy stable, the energy variable , serving as an approximation to , strictly decreases unless . To enhance the convergence speed of AEGD, momentum was introduced into the energy-adaptive methods. The initial momentum version, named AEGDM, was introduced in [30]:
(2.6a) | ||||
(2.6b) | ||||
(2.6c) |
Here, represents the momentum parameter. The theoretical and numerical superiority of AEGDM over AEGD has been established in [30]. Subsequently, in [32], a stochastic version of AEGD with momentum, named SGEM, was introduced, emphasizing its performance in training large-scale machine learning tasks. The formulation is as follows:
(2.7a) | ||||
(2.7b) | ||||
(2.7c) | ||||
(2.7d) |
with being the momentum parameter. SGEM has been shown to achieve a faster convergence compared to AEGD, while exhibiting superior generalizing capabilities over stochastic gradient descent (SGD) with momentum in training various benchmark deep neural networks [32].




This work introduces a novel variant of (1.2), named AGEM:
(2.8a) | ||||
(2.8b) | ||||
(2.8c) |
All above energy-adaptive methods – AEGD, AEGDM, SGEM, and AGEM – share a common property of unconditional energy stability, as stated in the following.
Theorem 2.1 (Unconditional energy stability).
Proof 2.2.
To gain a deeper understanding of how AGEM, scheme inspired by its continuous ODE system, distinguishes itself from the other two energy-adaptive methods, AEGDM [30] and SGEM [32], we conducted a robustness test on all three methods and GDM (GD with momentum). The test used the Rosenbrock function:
where is an integer. This nonconvex function poses a challenge due to its global minima being located at the bottom of a long, narrow valley, making the optimization problem inherently difficult. Setting , we investigated the impact of the momentum parameter and step size on the convergence performance of each method, with results presented in Figure 1. Our observations indicate that, in comparison to GDM, the energy-adaptive methods exhibit greater robustness in the sense that their convergence performance is less sensitive to variations in and : enabling convergence within a certain number of iterations to global minima across a wide range of parameter values. In contrast, GDM diverges when exceeds a certain threshold (e.g. ). Moreover, AGEM exhibits better robustness than the other two methods, achieving the fastest convergence under the settings and .
Addressing a more challenging scenario in higher dimensions, we considered the Rosenbrock function with . The landscape in this case is quite intricate, with two minimizers, a global minimizer at with and a local minimizer near with . A comparison between GDM and AGEM is presented in Figure 2. Both methods achieve their fastest convergence with and , respectively. Notably, as increases for both methods, GDM becomes trapped at a local minima, while AGEM continues to converge to the global minima, though at a slower pace.


The primary goal of this study is to provide an in-depth analysis of AGEM, considering both its discrete and continuous formulations. These findings are expected to be applicable to (2.7). Our specific focus lies in understanding the convergence characteristics of AGEM. This examination will be conducted by exploring continuous dynamics for objective functions that are generally smooth. Additionally, leveraging a structural condition on – known as the Polyak-Łojasiewicz (PL) condition, we establish a convergence rate in Section 6.
2.2 PL condition for non-convex objectives.
In this section, we provide a brief overview of the PL condition and its relevance to the loss function within the context of deep neural networks.
Definition 2.3.
For a differentiable function with , we say satisfies the PL condition if there exists some constant , such that the following inequality holds for any :
(2.11) |
This condition implies that a local minimizer is also a global minimizer. It’s important to note that strongly convex functions satisfy the PL condition, although a function satisfying the PL condition need not be convex. For instance, the function
is not convex but satisfies the PL condition with .
As we mentioned in the introduction, the PL condition has attracted increasing attention in the context of training deep neural networks. In this informal exploration, we investigate how a commonly used loss function in deep learning tasks can satisfy the PL condition. Consider the following loss function
where for any minimizer . Here, represents training data points, and is a deep neural network parameterized by . The gradient of this loss function is given by:
Denoting where , we have
(2.12) |
where is a matrix with entry defined by
(2.13) |
From (2.12) it is evident that if the smallest eigenvalue of is bounded from below by , then:
This condition aligns with the PL condition. Notably, using over-parameterization and random initialization, [19] has proved that for a two-layer neural network with rectified linear unit (ReLU) activation, the smallest eigenvalue of is indeed strictly positive. This eigenvalue proves to be crucial in deriving the linear convergence rate of gradient descent, as illustrated in [19]. A similar convergence result for implicit networks is also derived in [22].
3 Solution properties of the AGEM algorithm
In this section, we establish that the iterates generated by (2.8) are uniformly bounded and convergent if is suitably small.
Theorem 3.1.
Proof 3.2.
Define
(3.14) |
where , then we claim:
(3.15) |
where is the largest eigenvalue of . In fact,
(3.16) |
where
(3.17) |
in which,
(3.18) |
where is used in the last inequality. Connecting (3.16), (3.17), (3.2) gives (3.15).
(1) To demonstrate uniformly boundedness, we sum up (3.15) from to , omitting negative terms on the right-hand side to obtain:
(3.19) |
Using and the bound in (2.10), we derive:
(3.20) |
Next, we establish that if is suitably small, is uniformly bounded, ensuring the boundedness of , as indicated by:
(3.21) |
To proceed, we introduce the notation:
and for the convex hull111The convex hull of a given set is the (unique) minimal convex set containing . of . By assumption, are bounded domains. Our objective is to identify conditions on such that for all . This can be accomplished through induction. Since , tt suffices to show that under suitable conditions on , we have:
The main task is to establish a sufficient condition on , for which we argue in two steps:
(i) By the continuity of , there exists such that if
(3.22) |
we have
This implies , indicating On the other hand,
This ensures (3.22) if we set
(ii) Considering that and , we have . By (3.20), we get:
(3.23) |
This implies
provided
Hence, if , then for all .
(2) Theorem 2.1 asserts that as . Here, we identify a sufficient condition on to ensure . Following a similar argument as in (1), while retaining the negative term
on the rand-hand side of (3.15), we derive:
under the condition that . Since , we obtain:
Letting , we have:
provided
Therefore, if , then . Note that , hence both (1) and (2) are satisfied if we choose .
Remark 3.3.
When is assumed to be -smooth, becomes uniformly bounded. In this scenario, the solution’s boundedness is attained without imposing any restriction on , as evident from . However, it is crucial to note that the -smoothness assumption excludes a significant class of objective functions, including those of the form .
4 Discrete schemes vs continuous-time-limits
The analysis of scheme (2.8) involve studying the limiting ODEs obtained by letting the step size approach . Notably, distinct ODEs may arise based on how the hyper-parameters are scaled. In this context, we will derive an ODE system that preserves certain momentum effects.
4.1 High resolution ODEs.
For any , let , and assume for some sufficiently smooth curve . Performing the Taylor expansion on in powers of , we get:
Substituting into (2.8a), we have
which gives
(4.24) |
Plugging into (2.8b), we have
(4.25) |
Similarly, from (2.8c) we have
(4.26) |
We first discard and keep terms in (4.24) (4.25), (4.26), which leads to the following ODE system:
(4.27a) | ||||
(4.27b) | ||||
(4.27c) |
where . This can be viewed as a high-resolution ODE [41] because of the presence of terms. Let , then (4.27) reduces to the following ODE system:
(4.28a) | ||||
(4.28b) |
From two equations in (4.28) together with we can show
with which, (4.28b) reduces to
(4.29) |
This asserts that the ODE system (4.28) is equivalent to the gradient flow (4.29).
To explore the momentum effect, we maintain unchanged while allowing . This yields the following system of ODEs:
(4.30a) | ||||
(4.30b) | ||||
(4.30c) |
Compared with (4.30), the high-resolution ODEs (4.27) serve as more accurate continuous-time counterparts for the corresponding discrete algorithm (2.8). While the analysis of (4.27) would be more intricate, in the remainder of this work, we concentrate on (4.30) and provide a detailed analysis of it.
Theorem 4.1.
4.2 Time discretization.
Certainly, the discrete AEGM dynamics can be interpreted as a discretization of the continuous ODE systems (4.30). Specifically, we employ a first order finite difference approximation with implicit considerations in and , but explicit treatment in , resulting in:
This formulation reproduces the discrete AEGM dynamic (4.30) when setting and . Moreover, this perspective enables the derivation of new algorithms by applying diverse time discretizations to (4.30). For instance, a second order discretizaiton of (4.30) leads to
By setting and the following recursive relationships are derived:
These recursions begin with the initial setup of and . A more comprehensive analysis of this scheme exceeds the scope of the present article.
5 Dynamic solution behavior
This section is dedicated to the examination of (4.30), including aspects such as global existence, asymptotic convergence to equilibria through the renowned LaSalle invariance principle [28], and the determination of convergence rates for non-convex objective functions that adhere to the PL property.
5.1 Global Existence and Uniqueness.
Theorem 5.1.
Proof 5.2.
For , by Picard and Lindelöf, there exists a unique local solution to (4.30) with the initial condition . The extension theorem implies that the global existence fails only if the solution ‘escapes’ to infinity. To establish global existence, it is sufficient to demonstrate that remains bounded for all . To achieve this, we introduce a suitable control function. A candidate based on the discrete counterpart is:
(5.32) |
Using the chain rule, if satisfies (4.30), then:
(5.33) |
Hence, is guaranteed to decrease. In particular, we have:
(5.34) |
which implies the boundedness of . This, combined with the coerciveness of and, consequently , guarantees that remains bounded for all . However, the above inequality does not provide a uniform bound for .
5.2 Asymptotic behavior of solutions.
Theorem 5.4.
The upper bound on follows directly from (5.34). We establish the convergence result through three steps.
Step 1: We demonstrate that exists and . First, since and , it follows that . This ensures the existence of . Note that (5.2) can be expressed as
Integrating both sides from to and using (5.35), we obtain:
(5.37) |
From this lower bound, we deduce:
Step 2: We proceed to analyze the asymptotic behavior of and by using LaSalle’s invariance principle [28]. Recall the function defined in Theorem 5.1,
Given that , this implies that the set
is a compact positively invariant set with respect to (4.30). For any , we observe that
Define the set:
Since for all , it follows that
Next, we identify the largest invariant set in . Suppose that at some , , then we have , but . As a result, the trajectory will not stay in unless . In other words, the largest invariant set in must be:
By LaSalle’s invariance principle, every trajectory of (4.30) starting in tends to as . From Step 1, is the only limit of . Hence, all trajectories will admit a limit or a cluster point in
This implies that: . Note that is monotone and bounded, hence admitting a limit . This further implies that:
Step 3. Finally, we demonstrate that is a local minimum value of , rather than a local maximum value. We prove this by contradiction. Suppose is a local maximum value of . Recall that , and as . Thus, for any , there exists such that for any :
Since is continuous, there exists such that:
where we have used the assumption that is a local maximum value of . From Step 2, we know that is non-increasing in . Hence, for any , we have:
provided is small enough so that . Now letting , we obtain . This is a contradiction. Hence, has to be a local minimum value of , i.e., .
6 Convergence rates
The speed at which the solution converges to the minimum value of depends on the geometry of the objective function near . In fact, when the PL condition is satisfied, it can be shown that converge exponentially fast to .
Theorem 6.1.
Consider the global solution to (4.30) as presented in Theorem 5.4, initialized with . Assume that also satisfies the PL condition (2.11) with . For any , the following bounds are valid:
(6.38) |
given that , where
(6.39) |
Here, represents the largest eigenvalue of , where belongs to the solution domain in Theorem 5.1.
Remark 6.2.
(Comparison with gradient flow (4.29 without momentum) For small values of , where momentum exerts a diminished influence on the dynamical system, the convergence rate is predominantly governed by the term . Specifically:
if with . Also gets larger as . For , is compelled to be , reducing (4.30) to (4.28) or (4.29). In this scenario, becomes , restoring the convergence rate of gradient flow (4.29) under the PL condition [39].
Remark 6.3.
(Comparison with gradient flow with momentum and the work [5]). To compare with the GD with momentum of form
we follow the same strategy as in Section 4.1 to obtain its continuous formulation:
This upon rescaling leads to
This is the second-order ODE (1.5) with a constant , for which we refer to the findings presented in [5, Theorem 3.1]. When dealing with non-convex functions satisfying the PL condition (2.11), particularly in scenarios where and , [5] establishes the estimate:
where
We examine a scenario where the dominant factor in the convergence rate (6.38) is , which when taking the same as above gives the convergence rate of with
While our convergence rate is comparable to the specific rate , it is considered suboptimal owing to the additional constraint imposed by the proof technique.
The proof outlined below comprises three steps:
Step 1: We introduce a candidate control function, denoted as
where the parameters and are to be determined.
Step 2: We determine admissible pairs such that exhibits exponential decay.
Step 3: Using , we link to by
(6.40) |
and derive the convergence rate of based on the convergence rate of .
Proof 6.4.
Step 1: Decay of the control function. Define
(6.41) |
where parameters will be determined later so that along the trajectory of (4.30) has an exponential decay rate.
For each term in we proceed to evaluate their time derivative along the trajectory of (4.30). For the first term, we get
For the second term, we have
(6.42) |
Note that
Here we used and the PL property for .
Hence (6.42) reduces to
For the third term, we have
Combining the above three estimates together we obtain
(6.43) |
Note that (6.41) can be rewritten as
(6.44) |
This upon substitution into (6.43) gives
(6.45) |
where . This leads to
(6.46) |
as long as we can identify so that
(6.47a) | |||
(6.47b) | |||
(6.47c) |
for all . Step 2: Admissible choice for and . Using the solution bounds and , we have
where
In order to ensure (6.47c) we must have . Putting these together, condition (6.47) can be ensured by the following stronger constraints:
(6.48a) | |||
(6.48b) | |||
(6.48c) |
A direct calculation shows that for small , such admissible pair exists, with , and which is induced from . To be more precise, we fix and require that
The constraint would impose a lower bound for (which we should avoid to stay consistency with the discrete algorithm) unless is chosen to satisfy . This is equivalent to requiring
With the above two constraints, it is safe to replace by in (6.48b) and (6.48c), respectively, so that
(6.49) |
The convergence rate estimate in the next step requires to be large as possible, we thus simply take
(6.50) |
The second relation in (6.49) now reduces to
Note that the constraint is met if , imposing another upper bound on . To be concrete, we take
(6.51) |
then (6.48) is met if
with
This is (6.39) with . Hence, for suitably small , we have obtained a set of admissible pairs with
(6.52) |
as varies in .
Step 3: Convergence rate of . With above choices of , we have
(6.53) |
This gives
(6.54) |
This when combined with (6.40), i.e.,
and (since ) allows us to rewrite (6.54) as
That is
Integration of this gives
(6.55) |
Recall in Step 2, is chosen as for a fixed and so that , also using , we have
(6.56) |
This is (6.38) with .
Remark 6.5.
In Step, equation (6.55) reveals that the convergence rate is dominated by , where . Consequently, in the current methodology, determining the optimal values of involves solving the following constrained optimization problem:
(6.57) | ||||
The constraint in (6.57) forms an open set, making it natural to confine the range to for a small . Notably, when is sufficiently small, the expression
holds true, owing to the fact that in Step 2. Consequently, the estimation obtained in Step 2 is nearly optimal within the confines of the present approach.
7 Convergence of trajectories
In this section, we provide insights into the previously established results as stated in Theorems 5.4 and 6.1 and put forth several variations and extensions.
In Theorem 5.4, we demonstrated that and . It is reasonable to anticipate that, under appropriate conditions on (or ), will converge toward a single minimum point of . Conversely, the PL condition (2.11) used in Theorem 6.1 pertains to the geometric attributes of rather than its regularity. We revisit the details of the proof of Theorem 6.1, wherein for sufficiently small , is bounded below by . Consequently, the exponential decay of extends to , leading to the convergence of since .
It is interesting to explore the minimal general condition on that would ensure the convergence for towards a single limit-point. In general, addressing this question poses significant challenges. However, we can identify a larger class of functions beyond those covered by the PL condition. It is important to note that, following Łojasiewicz’s groundbreaking work on real-analytic functions [48, 33], the key factor ensuring convergence of with lies in the geometric properties of . This is exemplified by a counterexample presented by Palis and De Melo [24, p.14], highlighting that smoothness is insufficient to guarantee single limit-point convergence. These findings illustrate the importance of gradient vector fields of functions satisfying the Łojasiewicz inequality. The Łojasiewicz inequality asserts that for a real-analytic function and a critical point there exists such that the function remains bounded away from around . Such gradient inequality has been extended by Kurdyka [27] to functions whose graphs belong to an o-minimal structure, and Bolte et al. [14, 12] extended it to a broad class of nonsmooth subanalytic functions. This gradient inequality, or its generalization with a desingularizing function, is known as the Kurdyka-Łojasiewicz inequality. In the optimization literature, the KL inequality has proven to be a powerful tool for characterizing the convergence properties of iterative algorithms; refer to, for instance, [1, 6, 13, 21, 23].
For the system (4.30), we show that the Łojasiewicz inequality is sufficient to ensure the convergence of . Additionally, we provide convergence rates for under different values of .
Definition 7.1 (Łojasiewicz inequality).
For a differentiable function with , we say satisfies the Łojasiewicz inequality if there exists , , and , such that the following inequality holds for any :
(7.58) |
where is the neighborhood of with radius .
Note that a diverse range of functions has been shown to satisfy (7.58), ranging from real-analytic functions [48] to non-smooth lower semi-continuous functions [14]. The PL condition (2.11) stated as a global condition corresponds to (7.58) with and .
Lemma 7.2.
Let satisfy the Łojasiewicz inequality (7.58), then
(7.59) |
Theorem 7.4.
Proof 7.5.
Let be the -limit set of . According to Theorem 5.3, , ensuring on , and
The inequality (7.59) asserts the existence of and such that for any ,
The proof of the convergence of relies on a novel functional
with the parameter yet to be determined. A direct calculation gives
Here for any and were used in the first inequality. Take and , then
In the last inequality, we used the inequality . Conversely,
where
We proceed by distinguishing two cases:
(i) .
Using the inequality , we obtain
Using (7.59), for ,
Since as and ,
(7.60) |
where .
We are prepared to prove that belongs to by considering :
Integrating both sides from to yields:
(7.61) |
Here, we take into account that . Note that , hence
Thus belongs to , implying that, exists, and .
(ii) For , we set . With this adjustment, (7.59) remains valid for since
for suitably small. Also, (7.5) holds with replaced by . Thus, we conclude the convergence of in this case. Moreover,
for some constant .
Before getting into the estimation of the rate of convergence, let us define
(7.62) |
This expression serves to control the tail length function for both and . In fact,
(7.63) |
From (7.61) in the proof of Theorem 6.1, we see that
This inequality remains true for every , thus
(7.64) |
We now present the following result.
Theorem 7.6.
Under the same conditions as in Theorem 7.4, there exists such that for any , the following results hold:
-
(1)
If , then
-
(2)
If , then
-
(3)
If , then
where depends on and is independent of .
Proof 7.7.
Let be a neighborhood of in which the Łjosiewicz inequality holds. Since convergese to there exits such that for every In particular, (7.64) holds. It suffices to consider :
where was defined in (7.5). The above can be rearranged as
(7.65) |
where . Next we define another problem of form
(7.66) |
so that by comparison; hence by (7.63) we obtain .
The solution of can be determined for two cases:
(1) . In this case, (7.66) of form
admits a unique solution
(2) . The exact solution in such case can be obtained as
which is bounded by .
(3) . This case reduces to case (2) by simply replacing by in the obtained convergence rate for (2). The proof is complete.
Remark 7.8.
The convergence rate in (3) is suboptimal as it relies on (7.64) with for . In this scenario, the full potential of the Łojasiewicz inequality is not used in the proof of Theorem 7.4. Had (7.64) held with in such cases, it would have led to for some finite , indicating a faster, finite time convergence. This assertion is demonstrated in [14, Theorem 4.7] for the case for .
Remark 7.9.
The exponential convergence rate in (1) aligns with the result obtained in Theorem 6.1, though may not be optimally sharp.
Remark 7.10.
A counterexample by Baillon [9] for the steepest descent equation suggests that, likely, convexity alone is not sufficient for the trajectories of (1.4) to converge strongly. In the case of a general convex fucntion , one may instead aim to demonstrate the weak convergence of trajectories, following the approach outlined in [4, Theorem 5.1] for a dissipative dynamical system with Hessian-driven damping. It is important to note that the conditions outlined in (7.58) are insufficient to imply the convexity of the function. This is exemplified by the case of where . Although this function satisfies (7.58) with , it does not guarantee convexity. Furthermore, the set of minimizers for characterized by the graph of , is also non-convex.
8 Discussion
This paper explores a novel energy-adaptive gradient algorithm with momentum, termed AGEM. Empirically, we observe that the inclusion of momentum significantly accelerate convergence, particularly in a non-convex setting. Theoretical investigations reveal that AGEM retains the unconditional energy stability property akin to AEGD. Additionally, we establish convergence to critical points for general non-convex objective functions when step sizes are suitably small.
To capture the dynamical behavior of AGEM, we derive a high resolution ODE system that preserves the momentum effect. This system, featuring infinitely many steady states, poses challenges in analysis. Nonetheless, leveraging various analytical tools, we successfully establish a series of results: (i) establishing global well-posedness through a construction of a suitable Lyapunov function; (ii) characterizing the time-asymptotic behavior of the solution towards critical points, using the LaSalle invariance principle; (iii) deriving a linear convergence rate for non-convex objective functions that satisfy the PL condition, and (iv) demonstrating the finite length of and its convergence to the minimum of based on the Łojasiewicz inequality.
We anticipate that the analytical framework developed in this article can be extended to study other variants of AEGD. As an illustration, the associated system for scheme (2.7) in the format:
is anticipated to exhibit analogous solution properties. In the context of large-scale optimization problems, the practical preference often leans towards a stochastic version of AGEM. Consequently, it becomes intriguing to investigate the convergence behavior of AGEM in the stochastic setting.
Acknowledgements. This research was partially supported by the National Science Foundation under Grant DMS1812666.
References
- [1] Pierre-Antoine Absil, Robert Mahony, and Benjamin Andrews, Convergence of the iterates of descent methods for analytic cost functions, SIAM Journal on Optimization 16 (2005), no. 2, 531–547.
- [2] Zeyuan Allen-Zhu, Katyusha: The first direct acceleration of stochastic gradient methods, Journal of Machine Learning Research 18 (2018), no. 221, 1–51.
- [3] Felipe Alvarez, On the minimizing property of a second order dissipative system in Hilbert spaces, SIAM J. Control and Optimization 38 (1998), 1102–1119.
- [4] Felipe Alvarez, Hedy Attouch, Jérôme Bolte, and Patrick Redont, A second-order gradient-like dissipative dynamical system with Hessian-driven damping.: Application to optimization and mechanics, Journal de mathématiques pures et appliquées 81 (2002), no. 8, 747–779.
- [5] Vassilis Apidopoulos, Nicolò Ginatta, and Silvia Villa, Convergence rates for the heavy-ball continuous dynamics for non-convex optimization, under Polyak–Łojasiewicz condition, Journal of Global Optimization 84 (2022), no. 3, 563–589.
- [6] Hedy Attouch and Jérôme Bolte, On the convergence of the proximal algorithm for nonsmooth functions involving analytic features, Mathematical Programming 116 (2009), no. 1, 5–16.
- [7] Hedy Attouch, Zaki Chbani, Juan Peypouquet, and Patrick Redont, Fast convergence of inertial dynamics and algorithms with asymptotic vanishing viscosity, Mathematical Programming 168 (2018), 123–175.
- [8] Hedy Attouch, Xavier Goudou, and Patrick Redont, The heavy ball with friction method, I. the continuous dynamical system: global exploration of the local minima of a real-valued function by asymptotic analysis of a dissipative dynamical system, Communications in Contemporary Mathematics 2 (2000), no. 01, 1–34.
- [9] JB Baillon, Un exemple concernant le comportement asymptotique de la solution du problème , Journal of Functional Analysis 28 (1978), no. 3, 369–376.
- [10] Raef Bassily, Mikhail Belkin, and Siyuan Ma, On exponential convergence of SGD in non-convex over-parametrized learning, arXiv preprint arXiv:1811.02564 (2018).
- [11] Michael Betancourt, Michael I Jordan, and Ashia C Wilson, On symplectic optimization, arXiv preprint arXiv:1802.03653 (2018).
- [12] Jérôme Bolte, Aris Daniilidis, Olivier Ley, and Laurent Mazet, Characterizations of Łojasiewicz inequalities: subgradient flows, talweg, convexity, Transactions of the American Mathematical Society 362 (2010), no. 6, 3319–3363.
- [13] Jérôme Bolte, Shoham Sabach, and Marc Teboulle, Proximal alternating linearized minimization for nonconvex and nonsmooth problems, Mathematical Programming 146 (2014), no. 1, 459–494.
- [14] Jérôme Bolte, Aris Daniilidis, and Adrian Lewis, The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems, SIAM Journal on Optimization 17 (2007), no. 4, 1205–1223.
- [15] Léon Bottou, Frank E. Curtis, and Jorge Nocedal, Optimization methods for large-scale machine learning, SIAM Review 60(2) (2018), 223–311.
- [16] Camille Castera, Jérôme Bolte, Cédric Févotte, and Edouard Pauwels, An inertial Newton algorithm for deep learning, Journal of Machine Learning Research 22 (2021), 1–31.
- [17] Zachary Charles and Dimitris Papailiopoulos, Stability and generalization of learning algorithms that converge to global optima, International conference on machine learning, 2018, pp. 745–754.
- [18] Andre Belotto da Silva and Maxime Gazeau, A general system of differential equations to model first-order adaptive algorithms, Journal of Machine Learning Research 21 (2020), no. 129, 1–42.
- [19] Simon S. Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh, Gradient descent provably optimizes over-parameterized neural networks, International Conference on Learning Representations, 2019.
- [20] Guilherme França, Jeremias Sulam, Daniel Robinson, and René Vidal, Conformal symplectic and relativistic optimization, Advances in Neural Information Processing Systems 33 (2020), 16916–16926.
- [21] Pierre Frankel, Guillaume Garrigos, and Juan Peypouquet, Splitting methods with variable metric for Kurdyka–Łojasiewicz functions and general convergence rates, Journal of Optimization Theory and Applications 165 (2015), no. 3, 874–900.
- [22] Tianxiang Gao, Hailiang Liu, Jia Liu, Hridesh Rajan, and Hongyang Gao, A global convergence theory for deep ReLU implicit networks via over-parameterization, International Conference on Learning Representations, 2021.
- [23] Guillaume Garrigos, Lorenzo Rosasco, and Silvia Villa, Convergence of the forward-backward algorithm: beyond the worst-case with the help of geometry, Mathematical Programming (2022), 1–60.
- [24] Jacob Palis Junior and Welington De Melo, Geometric theory of dynamical systems: an introduction, Springer-Verlag, 1982.
- [25] Hamed Karimi, Julie Nutini, and Mark Schmidt, Linear convergence of gradient and proximal-gradient methods under the Polyak-Łojasiewicz condition, Machine Learning and Knowledge Discovery in Database, Springer, 2016, pp. 795–811.
- [26] Diederik Kingma and Jimmy Ba, Adam: A method for stochastic optimization, International Conference on Learning Representations (ICLR), 2014.
- [27] Krzysztof Kurdyka, On gradients of functions definable in o-minimal structures, Annales de l’institut Fourier, vol. 48, 1998, pp. 769–783.
- [28] J. LaSalle, Some extensions of Lyapunov’s second method, IRE Transactions on Circuit Theory 7 (1960), no. 4, 520–527.
- [29] Chaoyue Liu, Libin Zhu, and Mikhail Belkin, Loss landscapes and optimization in over-parameterized non-linear systems and neural networks, Applied and Computational Harmonic Analysis (2022).
- [30] Hailiang Liu and Xuping Tian, An adaptive gradient method with energy and momentum, Annals of Applied Mathematics 38 (2022), no. 2, 183–222.
- [31] Hailiang Liu and Xuping Tian, Aegd: adaptive gradient descent with energy, Numerical Algebra, Control and Optimization (2023).
- [32] , SGEM: stochastic gradient with energy and momentum, Numerical Algorithms (2023), 1–28.
- [33] S. Łojasiewicz, Sur les trajectoires de gradient d’une fonction analytique, Seminari di Geometria 1982-1983 (lecture notes), Dipartemento di Matematica, Universita di Bologna (1984), 115–117.
- [34] Chao Ma, Lei Wu, and E Weinan, A qualitative study of the dynamic behavior for adaptive gradient algorithms, Mathematical and Scientific Machine Learning, 2022, pp. 671–692.
- [35] Chris J Maddison, Daniel Paulin, Yee Whye Teh, Brendan O’Donoghue, and Arnaud Doucet, Hamiltonian descent methods, arXiv preprint arXiv:1809.05042 (2018).
- [36] Michael Muehlebach and Michael Jordan, A dynamical systems perspective on nesterov acceleration, International Conference on Machine Learning, 2019, pp. 4656–4662.
- [37] Yurii Nesterov, A method of solving a convex programming problem with convergence rate O, Doklady Akademii Nauk, vol. 269, Russian Academy of Sciences, 1983, pp. 543–547.
- [38] , Introductory lectures on convex optimization: A basic course, vol. 87, Springer Science & Business Media, 2013.
- [39] Boris Polyak, Gradient methods for the minimisation of functionals, USSR Computational Mathematics and Mathematical Physics 3 (1963), 864–878.
- [40] Boris Polyak, Some methods of speeding up the convergence of iteration methods, Ussr computational mathematics and mathematical physics 4 (1964), no. 5, 1–17.
- [41] Bin Shi, Simon S Du, Michael I Jordan, and Weijie J Su, Understanding the acceleration phenomenon via high-resolution differential equations, Mathematical Programming (2021), 1–70.
- [42] Mahdi Soltanolkotabi, Adel Javanmard, and Jason D Lee, Theoretical insights into the optimization landscape of over-parameterized shallow neural networks, IEEE Transactions on Information Theory 65 (2018), no. 2, 742–769.
- [43] Weijie Su, Stephen Boyd, and Emmanuel Candes, A differential equation for modeling Nesterov’s accelerated gradient method: Theory and insights, Advances in Neural Information Processing Systems 3 (2015).
- [44] Tijmen Tieleman, Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude, COURSERA: Neural networks for machine learning 4 (2012), no. 2, 26.
- [45] Andre Wibisono, Ashia C Wilson, and Michael I Jordan, A variational perspective on accelerated methods in optimization, proceedings of the National Academy of Sciences 113 (2016), no. 47, E7351–E7358.
- [46] Xiaofeng Yang, Linear, first and second-order, unconditionally energy stable numerical schemes for the phase field model of homopolymer blends, Journal of Computational Physics 327 (2016), 294–316.
- [47] Jia Zhao, Qi Wang, and Xiaofeng Yang, Numerical approximations for a phase field dendritic crystal growth model based on the invariant energy quadratization approach, International Journal for Numerical Methods in Engineering 110 (2017), no. 3, 279–300.
- [48] S. Łojasiewicz, A topological property of real analytic subsets (in French), Coll. du CNRS, Les équations aux derivéés partielles (1963), 87–89.