Convergence Analysis of Stochastic Gradient Descent with MCMC Estimators
Abstract
Understanding stochastic gradient descent (SGD) and its variants is essential for machine learning. However, most of the preceding analyses are conducted under amenable conditions such as unbiased gradient estimator and bounded objective functions, which does not encompass many sophisticated applications, such as variational Monte Carlo, entropy-regularized reinforcement learning and variational inference. In this paper, we consider the SGD algorithm that employ the Markov Chain Monte Carlo (MCMC) estimator to compute the gradient, called MCMC-SGD. Since MCMC reduces the sampling complexity significantly, it is an asymptotically convergent biased estimator in practice. Moreover, by incorporating a general class of unbounded functions, it is much more difficult to analyze the MCMC sampling error. Therefore, we assume that the function is sub-exponential and use the Bernstein inequality for non-stationary Markov chains to derive error bounds of the MCMC estimator. Consequently, MCMC-SGD is proven to have a first order convergence rate with iterations and a sample size . It partially explains how MCMC influences the behavior of SGD. Furthermore, we verify the correlated negative curvature condition under reasonable assumptions. It is shown that MCMC-SGD escapes from saddle points and reaches approximate second order stationary points or -variance points at least steps with high probability. Our analysis unveils the convergence pattern of MCMC-SGD across a broad class of stochastic optimization problems, and interprets the convergence phenomena observed in practical applications.
1 Introduction
We consider a general class of stochastic optimization problems. Given a measurable space , let be a family of probability distributions on and be a function for . A general form of stochastic optimization problems is stated as
(1.1) |
where denotes the space of parameters, which is typically a subset of a Euclidean space . In the field of stochastic optimization, the most common form is typically expressed by a given distribution and a parameterized function . Also, there are numerous tasks, such as reinforcement learning (RL), that aim to find a suitable distribution under a given function . Compared to the two aforementioned problems, (1.1) represents a broader framework, wherein both the distribution and the function are dependent on parameters .
A main issue in solving the stochastic optimization problem (1.1) is that the difficulty of directly computing integrals over high-dimensional spaces. The Monte Carlo method offers a strategy for estimating expectations using samples, thereby facilitating the computation of stochastic gradients. However, for certain complex variational distributions, it is prohibitive to obtain their exact probability density functions which impedes directly sampling. In order to circumvent the intractable normalization constants in probabilities, Markov chain Monte Carlo (MCMC) is employed to sample from the unnormalized probability. In contrast to the ordinary unbiased Monte Carlo methods, the MCMC method requires some time for mixing and produces the desired samples biasedly and non-independently. There are several common MCMC algorithms, such as Metropolis-Hastings algorithm [29, 16], Hamiltonian Monte Carlo [11], Stochastic Gradient Langevin Dynamics [45], etc. The efficiency of the optimization algorithm relies on the error of the MCMC sampling, which has not been adequately investigated in the previous literatures. Previous studies [30, 19, 13, 37] on the concentration inequality of MCMC sampling typically assume a more idealized setting where the random variables are uniformly bounded.
With the stochastic gradient estimated by MCMC algorithms, the objective function is minimized using stochastic gradient descent (SGD) methods. The vanilla SGD method is to sample independently from a uniform distribution, and has been extensively studied. Moulines & Bach first show linear convergence of SGD non-asymptotically for strongly convex functions [33]. Needell et al. improve these results by removing the quadratic dependency on the condition number in the iteration complexity results [34]. Among these convergence results, the gradient noise assumptions for i.i.d samples are of vital importance to establish an convergence rate for general non-convex cost functions, where is the sample size per iteration and is the total number of iterations. However, stochastic sampling is not always independent or unbiased. The convergence of SGD with Markovian gradients has been studied in [12, 44], and SGD with biased gradient estimators is considered in [2]. Moreover, there are also some research about SGD with MCMC noise, which we call MCMC-SGD. Atchade et al. consider stochastic proximal gradient algorithms with MCMC sampling on convex objective [3]. The uniform-in- ergodic assumption is proposed to analyze the Markov noise for convex bounded functions in their stochastic proximal gradient algorithm. Karimi et al. study SGD with Markovian noise [22]. They analyze the convergence by controlling the noise through the Poisson equation. But strong assumptions are given and the convergence result has an extra term that depends on the bias of the estimation scheme.
The contributions of this paper are multifaceted and useful in the advancement of stochastic optimization, particularly in the context of machine learning and scientific computing. Three main contributions are delineated as follows in a structured manner:
-
1.
Error analysis of MCMC estimators: A novel aspect of this paper is the application of concentration inequalities to estimate the upper bounds of bias and variance associated with MCMC sampling. Compared with the conventional boundedness assumption in [30, 19, 13, 37], we adopt a broader assumption of unbounded functions, recognizing the more complex and realistic scenarios often encountered in MCMC sampling. Based on concentration inequalities, our approach investigates the MCMC error for a class of sub-exponential functions from the perspective of spectral gap. This analysis is universal and non-asymptotic, entailing a meticulous examination of the non-stationary Markov chain on unbounded functions.
-
2.
First-order convergence of MCMC-SGD: We demonstrate that the biased stochastic gradient, a result of the MCMC sampling, achieves first-order stationary convergence. In comparison with [3], we conduct a detailed analysis on the MCMC error and extend its assumption of convex bounded functions. Meanwhile, our analysis does not need assumptions with the Poisson equation in [22]. The convergence rate is quantified as , which is established after iterations, given a sufficiently large sample size . This result is instrumental in validating the effectiveness of the MCMC-SGD algorithm and provides a theoretical viewpoint that can be used to guide the design and tuning of the algorithm.
-
3.
Escaping from saddle points and second-order convergence: Our investigation extends into the realm of second-order convergence properties. The SGD escaping from saddle points with unbiased gradient estimators has been studied in [14, 8, 20]. Under the influence of biased MCMC noise, it is imperative to examine the second-order convergence properties of MCMC-SGD for the problem (1.1). By substantiating the correlated negative curvature (CNC) condition under errors, we generalize the analysis of unbiased SGD escaping from saddle points in [8] to MCMC-SGD, where the gradient estimator is biased and requires more carefully analysis. A second-order convergence guarantee characterized by a rate of . This guarantee not only provides theoretical assurance of the convergence speed but also serves specifically as a testament to the algorithm’s ability to efficiently approximate eigenvalues for variational eigenvalue problems in quantum mechanics.
The rest of this paper is organized as follows. In Section 2, we describe the applications of the general optimization problem and introduce the SGD algorithm with MCMC sampling. In Section 3, we first give our assumptions for the function and the variational distribution. Then, the sampling error is analyzed asymptotically by the concentration inequality for Markov chains. We prove that the MCMC-SGD algorithm converges to stationary points and provide the convergence rates. In Section 4, the correlated negative curvature condition of the MCMC-SGD algorithm is verified in empirical sampling under reasonable assumptions and the second-order convergence characteristics are further investigated. We establish the convergence guarantee to avoid saddle points with high probability, based on the stochastic perturbations of MCMC algorithms.
2 SGD with MCMC estimators
For the problem (1.1), we notice that parameter dependency exists in both the function and distribution. It is regarded as a more generalized form than purely optimizing distributions or functions in stochastic optimization problems.
To begin with, we formally present the assumptions regarding and in problem (1.1).
Assumption 1.
For the stochastic optimization problem (1.1), the following two properties holds correspondingly.
-
(1)
The function satisfies
(2.1) -
(2)
There exists an energy function such that the distribution is parameterized as
(2.2) where is a known parameterization.
The function need not be assumed bounded on the space , as numerous examples have demonstrated that such an assumption is often difficult to satisfy. Assumption 1(1) constitutes an extension to the case where is independent of the parameter , that is, for all . We naturally generalize the case where is parameter-independent. The function with restriction (2.1) can not only cover a wider range of stochastic optimization problems but also retains the form of policy gradients, facilitating our optimization efforts.
The distribution is typically viewed as an artificially designed parametric distribution, encompassing these distributions from the exponential family (e.g., Gaussian, Bernoulli, Poisson distribution), structured probabilistic models such as graphical models (e.g., Bayesian networks, Markov random fields), and a series of deep learning architectures that serve as function approximators. In most scenarios, we may not have direct access to the value of , but instead is parameterized by the energy function . Assumption 1(2) expounds that possesses a unnormalized form with the energy function , which frequently occurs in distributions expressed by neural networks. The relation (2.2) is a weak assumption and often appears in the design of complex distributions. Since the energy function is typically chosen to be tractable and differentiable with respect to the parameters , the computation of is often infeasible for complex models or high-dimensional spaces, as it involves an integral over all possible states of . However, MCMC techniques, such as Metropolis-Hastings algorithm and Stochastic Gradient Langevin Dynamics and so on, can sample from unnormalized distributions, thereby circumventing the need for complex integral computations.
To substantiate the validity of this framework, let us consider the following three problems as exemplary cases, where it can be verified that is unbounded and satisfies (2.1).
2.1 Applications
2.1.1 Variational Monte Carlo for many-body quantum problems
Consider a many-body quantum system with particles. We denote the -particle configuration of this system by , with being the one-particle configuration space, which can be continuous or discrete. The wavefunction describes the quantum state of the many-body system, and is often required to satisfy some symmetric/anti-symmetric conditions. The Hamiltonian is a self-adjoint operator acting on the wavefunction, which determines the dynamics and the ground state of the quantum system.
A central issue in quantum physics is to compute the ground state energy and wavefunction of the system, which corresponds to the lowest eigenvalue and its corresponding eigenfunction of
(2.1) |
Since it is challenge to solve the very high dimensional eigenvalue problem (2.1) directly, an alternative way is to minimize the following Rayleigh quotient
(2.2) |
where the wavefunction is parameterized by a suitable ansatz within a finite parameter space. When is a discrete configuration space, the above integrals in (2.2) is regarded as summations over all configurations. We can regard (2.2) as a stochastic optimization problem by the following reformulation:
(2.3) |
As the Hamiltonian is a self-adjoint linear operator, the following holds for :
For examples, the -electron configuration for a many-electron system in dimension is with , where represents the spatial coordinate and is the spin coordinate. The Hamiltonian of the electron system is given by
(2.4) |
where is the ionic potential and represents the interaction between electrons
(2.5) |
with and being the position and atomic number of the -th nuclear.
In quantum many-body systems, the classical VMC employs MCMC algorithms to compute approximate gradients, which garnered renewed interest due to the rise of modern neural network ansatzes. These techniques typically rely on the ability of artificial neural networks to represent complex high-dimensional wavefunction, which have already been explored in many fields of physics and chemistry. The RBM is first proposed by Carleo and Troyer [6] as a variational ansatz for many-body quantum problems. Furthermore, a large number of deep neural networks, such as FFNN [41, 5], deep RBMs [9, 15, 36, 23], convolutional neural networks [27, 7], variational autoencoders [40], have been applied to capture the physical features and improve the accuracy of the ground state. Motivated by the traditional Slater-Jastrow-backflow ansatzes, PauliNet [17] and FermiNet [38] use the permutation equivariant and invariant construction of networks and many determinants to approximate a general antisymmetric wavefunction. However, the convergence properties of VMC methods have remained underexplored in a long time. A recent study [1] by Abrahamsen et al. provides a proof of convergence for VMC methods with an unbiased gradient estimator.
2.1.2 Entropy-regularized reinforcement learning
In RL, exploration in actions is critical to finding good policies during optimization. Without a large number of diverse state-action pairs, the algorithm might settle for a weak policy. To prevent the concentration in policy, the entropy regularization is used to encourage random exploration [46, 32]. Let be the trajectory and be the cumulative reward over the trajectory . Compared to the classical reinforcement learning, the introduction of entropy regularization for the parameterized policy yields a distinct objective function, that is,
(2.6) |
Since , it holds that
(2.7) |
2.1.3 Variational Inference
Approximating complex distributions constitutes a significant facet of the Bayesian statistics. Compared to directly handling distributions within high-dimensional spaces, variational inference (VI) seeks to circumvent the computational complexities by approximating the target distribution with a parameterized simpler distribution [21, 39, 25, 4]. This approach is particularly advantageous when the exact posterior distribution is either too complex to be analytically solvable or when it is computationally prohibitive to sample from, as is often the case in high-dimensional Bayesian inference problems. VI operates on the principle of optimization, where the objective is to minimize the discrepancy between the approximating variational distribution and the true posterior . This discrepancy is frequently quantified using the Kullback-Leibler (KL) divergence, a non-symmetric measure of the difference between two probability distributions:
(2.8) |
There are also a lot of attention to improving the explicit variational distribution in VI by applying MCMC sampling. Salimans et al. (2015) propose a hybrid approach that utilizes MCMC within the variational framework to refine the variational distribution [42]. By employing MCMC transitions that leave the target posterior distribution invariant, they introduce a rich class of variational distributions that can potentially capture complex dependencies and multimodal structures. A continuous relaxation is introduced by Maddison et al. (2017) for discrete random variables [28], which facilitates the application of gradient-based optimization techniques. Their paper is particularly relevant when dealing with discrete latent variables in VI, where MCMC sampling can be leveraged to approximate the gradients of the objective function more effectively. Li et al. (2017) propose utilizing MCMC sampling not just as a means of refining a given variational distribution, but also as a mechanism for learning an efficient proposal distribution for variational inference [26]. By amortizing the cost of MCMC sampling across different iterations of the inference process, they aim to reduce the computational overhead while maintaining the benefits of MCMC sampling efficacy.
2.2 Gradient approximation using MCMC
To solve the stochastic optimization problem (1.1), a common approach is to compute the stochastic gradient for iterations. As the Assumption 1 holds, the gradient of objective function is derived by
(2.9) | ||||
The second equality is due to and , implied by Assumption 1. We notice that the gradient can be expressed in the form of an expectation, thus allowing the approximation of the gradient through the Monte Carlo method. As for the unnormalized distribution , it is natural to employ MCMC methods to estimate the expectations presented within gradients.
Let us consider a Markov chain with a stationary distribution . We have obtained a set of samples following burn-in steps. These samples are generated to approximate expectations with respect to , which is typically intractable. The objective function and its gradient is estimated by the MCMC samples :
(2.10) | ||||
(2.11) |
While MCMC methods offer a tractable and efficient strategy for approximating expectations, it must be noted that estimators derived from finite MCMC samples are generally biased. Furthermore, the samples are correlated, which affects the variance of the estimator. To mitigate bias, one must ensure a sufficiently large sample size. However, an excessively large sample size may increase computational burden and reduce the stochasticity for escaping local regions. In light of these considerations, the practitioner must carefully balance the sample size to minimize bias while preserving randomness and computational tractability. A rigorous approach may involve adaptive MCMC methods that adjust the burn-in period and sample size based on convergence diagnostics of Markov chains. It suggests that we should pay attention to the influence of sampling methods in optimization.
The SGD algorithm is utilized to update the parameters with MCMC samples:
(2.12) |
where are chosen stepsizes and is the obtained samples with the stationary distribution . MCMC-SGD provides a powerful framework for optimization in settings where the objective function is defined in terms of expectations over complex distributions. A detailed analysis on the interplay between SGD and MCMC sampling will aid us in deepening our understanding of convergence and refining the algorithms.
3 Convergence analysis of MCMC-SGD
In this section, we explore how SGD performs when it is coupled with MCMC samples. We begin by stating a number of assumptions that are crucial to guaranteeing the necessary characteristics of both the objective function and the sampling process. Then, the MCMC estimator is analyzed by the concentration inequality specific to Markov chains. This approach is fundamentally distinct from that employed in traditional SGD settings. Upon establishing this analytical framework, we present our convergence theorem for the MCMC-SGD. This theorem reveals that, within a practical and flexible situation, the expected value of the gradient norm is assured to converge to zero. This convergence is noteworthy since it persists despite the bias inherent in MCMC sampling methods, highlighting the effectiveness of the algorithm under a variety of conditions. It enhances our theoretical understanding of the integration of MCMC-SGD and offers guidance for improving algorithmic methodologies for various stochastic optimization problems.
3.1 Assumptions
To lay the groundwork for our subsequent assumptions, we first expound upon the definition of sub-exponential random variables. This class of random variables is characterized by their tail behavior, which, in a certain sense, is less extreme than that of heavy-tailed distributions. A precise mathematical definition is as follows:
Definition 1.
The sub-exponential norm of a random variable , which assumes values in a space and follows a distribution , is defined by the expression
In the event that is bounded above by a finite value, is called sub-exponential with parameter . Moreover, the sub-exponential norm of a measurable function with respect to a distribution is defined by
We introduce some regularity conditions on . Noticing that may be unbounded for a general class problems, we give the sub-exponential assumption of under the distribution .
Assumption 2.
Let given in (1.1) satisfy the following conditions. There exist constants such that
-
(1)
,
-
(2)
.
The mentioned two assumptions are both reasonable and encompass a broad class of scenarios. A common example is with for all , which often appears in the optimization of entropy regularization and KL divergence. In this situation, is equivalent to
Since holds for the distribution , there must exist a constant relying such that and then Assumption 2 (1) is easily satisfied for a compact parameter space . Meanwhile, , which can be bounded through Assumption 3 (1) on the variational distribution . By establishing the reasonableness of the Assumption 3 (1) on the variational distribution, one can deduce the validity of the Assumption 2 (2) in this setting. For another instances, when solving the many-electron Schrödinger equations, people often use the Jastrow factor [18] to enable the network to efficiently capture the cusps and decay of the wavefunction. By exploiting physical knowledge in the construction of the wavefunction ansatz can make the function smooth with respect to the parameters , such that Assumption 2 is satisfied in practical VMC calculations.
There are also some regularity conditions on the variational distribution , in order to guarantee Lipschitz continuity of the gradient . Typically, it is more practical to apply these smoothing measures to the energy function rather than directly to the distribution . This kind of smoothing helps us control how the gradient changes, which is important for making sure our optimization methods work efficiently and reliably.
Assumption 3.
Let be differentiable with respect to the parameters for any . There exist constants such that
-
(1)
,
-
(2)
.
The reasonableness of these two assumptions is solely based on the choice of parameterization. For instance, is taken as a canonical form of exponential family, i.e., , where represents the natural parameter vector and is the sufficient statistic. As and , the Assumption 3 holds when is uniformly bounded for all . If is parameterized by a complex network, it is equally imperative to ensure that its gradients with respect to any input and parameters are bounded.
3.2 Analysis of the MCMC error
We introduce following notations which are used frequently throughout this paper. For any function and any distribution on , we write its expectation and -th central moment . The second central moment, called the variance, is denoted by .
Before the analysis of MCMC methods, we provide several general definitions about Markov chains. Within our consideration, the state space is Polish and equipped with its -algebra . Let be a time-homogeneous Markov chain defined on . The distribution of the Markov chain is uniquely determined by its initial distribution and its transition kernel . For any Borel set , let
A distribution is called stationary with respect to a transition kernel if
When the initial distribution , we call the Markov chain stationary.
Our analysis starts from the perspective of operator theory on Hilbert spaces. Let be the stationary distribution of a Markov chain and be the Hilbert space equipped with the norm . Each transition kernel can be viewed as a Markov operator on the Hilbert space . The Markov operator is defined by
It is easy to show that has the largest eigenvalue . Intuitively but not strictly, the gap between and other eigenvalues matters to the Markov chain from non-stationarity towards stationarity. Hence, we introduce the definition of the absolute spectral gap.
Definition 2 (absolute spectral gap).
A Markov operator admits an absolute spectral gap if
where is the projection operator with denoting the identity operator and is the operator norm induced by on .
We consider the sample set generated from the desired distribution by the MCMC algorithm . The Markov operator, denoted by , determines the convergence rate of the Markov chain. We assume that there exists a uniform lower bound of the spectral gap.
Assumption 4.
Let be the Markov operator induced by the MCMC algorithm with the absolute spectral gap . For any , there is a positive lower bound of absolute spectral gaps, that is,
This assumption excludes the situation that , which ensures the uniform convergence of the Markov chain.. The spectral gap measures the lower bound on the mixing rate of the Markov chain, thereby guaranteeing the ergodicity of the MCMC algorithm. Compared to the uniformly geometric ergodicity assumptions commonly found in other articles [22, 3], the spectral gap assumption is significantly tighter, intuitively reflecting the factor that influence the convergence rate of a Markov chain. The spectral gap of the MCMC algorithm has been studied for some specific examples. On finite-state spaces, becomes a transition matrix while relates to its eigenvalue. A survey of spectrums for Markov chains on discrete state-spaces is in [43]. This develops amounts of analytic techniques and [10] has further applications to the Metropolis-Hastings algorithm. For the continuous spaces, there are few examples of sharp rates of convergence for the Metropolis-Hastings algorithm. In [24, 31], it is claimed that the spectral gap for the Gaussian proposal .
We commence by stating Theorem 1, which articulates the core result that analyzes the error of MCMC gradient estimators. The proof of this theorem will rely on a series of auxiliary propositions, including Lemmas 1, 2 and 3. Lemma 1 introduces a Bernstein inequality for general Markov chains. Using the Bernstein inequality, we obtain the tail bound of the MCMC estimators for sub-exponential functions in Lemma 2. Then, a detailed and intuitive error bound for the MCMC estimators is of consideration in Lemma 3. Finally, through Lemma 3, we succinctly capture the sampling error inherent in MCMC algorithms. In the following theorem, we delineate the error bounds of this stochastic gradient, which is one of main results in this paper.
Theorem 1.
Let Assumptions 1, 2, 3 and 4 hold. For a fixed parameter , we generate the MCMC samples with the stationary distribution by the MCMC algorithm. Suppose we start from the initial distribution and let . The stochastic gradient , defined by (2.11) has the following error bounds that
(3.1) | ||||
where with and , these factors are defined by , and , .
Proof.
The error of the stochastic gradient can be rewritten as
For any given such that , is approximated by . We denote the stationary variables , , with and the empirical variables , , . Then it holds
(3.2) | ||||
With Assumptions 3 and 2, we have and . Then the variance is bounded by . It also holds
which implies . Applying Lemma 3 to , we have
(3.3) |
To refine the proof of the Theorem 1, we first introduce a pivotal tool in this paper, concentration inequalities, to analyze the error of the MCMC estimators in details. The essence of concentration inequalities lies in their capacity to quantify how a random variable can concentrate around its expected value. A Bernstein inequality for general Markov chains is proposed by Jiang and Fan [19] and beneficial to our analysis. The following lemma is a direct corollary of Theorem 2 in [19].
Lemma 1.
Let be a Markov chain with stationary distribution and absolute spectral gap . Suppose the initial distribution is absolute continuous with respect to the stationary distribution and its derivative . Consider a bounded function with and variance . Then, when , that is, is stationary, it holds that
(3.5) |
The Bernstein inequality (3.5) shows how the average of MCMC samples concentrates at the expectation. However, the Markov chain is not always non-stationary. We define represents the chi-squared divergence between and , by which the Bernstein inequality can be extended into a non-stationary one. We abbreviate and . Then, we derive the tail bound of the MCMC estimator for sub-exponential functions in the following lemma.
Lemma 2.
Let be a non-stationary Markov chain with stationary distribution and absolute spectral gap . Suppose the initial distribution is absolute continuous with respect to the stationary distribution and its derivative . We consider a function satisfying . If , the following tail bound holds,
(3.6) |
In other words, for , with probability at least , it holds that
(3.7) |
Proof.
Without loss of generality, we assume that in the following proof. To deal with a possibly unbounded , we firstly fix a and consider the truncation function and . When is a sub-exponential random variable, its tail decays at least as fast as an exponential function. As , the Markov’s inequality implies
(3.8) |
Notice that for any event , it holds from the Cauchy-Schwarz inequality
(3.9) | ||||
Hence, we only consider the case , i.e., the case is stationary. We first fix a large . For any , we have
(3.10) |
It follows from the inclusion of events that
Notice that . Therefore, when , it holds that
where the third inequality is due to (3.5), and the last inequality uses . Finally, for any fixed , we take and then obtain
In the above lemma, we establish the tail bound of the MCMC estimator for unbounded functions. To the best of our knowledge, this result has not appeared in the literatures of Bernstein inequality for Markov chains. The novelty of our method is manifested in the sub-exponential assumption tailored for unbounded functions and the utilization of spectral gap analysis within concentration inequalities. Consequently, a comprehensive and intuitive error bound for the MCMC estimator is provided as follows.
Lemma 3.
Suppose that the condition of Lemma 2 holds. Then, error bounds for the MCMC estimator are given as follows.
(1) The bias satisfies that
(3.11) |
where and . In particular, when , .
(2) The variance satisfies that
(3.12) |
where , , , and .
Proof.
We follow the notations in the proof of Lemma 2.
Bias bound.
Clearly, we only need to bound , where is the distribution of . For the distribution whose derivative , it holds from the Cauchy-Schwarz inequality that
(3.13) | ||||
Besides, noticing that , we have
Then (3.13) implies that
(3.14) | ||||
and as long as , we can choose in (3.14) to derive . Combining (3.13) and (3.14) yields that
(3.15) |
Now, by the definition of , it holds that . Let us consider the smallest such that . Then clearly , and for , . Hence,
where the first inequality is the triangle inequality and the second uses (3.15). Therefore, it holds that
Variance bound.
In the following proof, we provide an upper bound on all higher-order moments of simultaneously. Denote . Then, we only need to bound for even under the condition
Notice that
where and . Thus, we further denote . Then
where the last inequality is due to the fact for any integer . Similarly,
Combining these two cases, we obtain
This completes the proof. ∎
3.3 First order convergence
In this subsection, we will first analyze the decrease for a single iteration based on the MCMC error bounds. Then, our main result delineates the convergence of the expected norm of the gradient throughout the iterations of SGD, when cooperating with the MCMC estimator. This convergence is established through a rigorous theoretical framework, underscoring the efficacy of the SGD algorithm in the presence of sampling randomness.
Under our assumptions in subsection 3.1, we can affirm the -smoothness of the objective function to facilitate a conventional optimization analysis. This is a common condition for functions to undergo expected descent. We prove the following lemma by giving the bound of the Hessian.
Lemma 4.
Proof.
It is already known that the stochastic gradient is biased, unlike the unbiased estimator in the classical SGD. However, the bias decays with increasing burn-in time or the sample size , which can be regarded sufficiently small. We discuss the expected descent of the function value within each iteration in the following lemma.
Lemma 5.
If the stepsize , for any given the objective function decreases in expectation as
(3.19) |
where and are defined in Theorem 1.
Proof.
For convenience, we simplify the notations with , and for any . By the Lemma 4 and the property (3.18), we perform a gradient descent analysis of the iteration (2.12),
(3.20) | ||||
Taking the conditional expectation of (3.20) on gives
(3.21) | ||||
Through the fact , it holds
(3.22) |
Therefore, we can plug in (3.22) to (3.21) and rearrange to get
(3.23) |
Thus, (3.19) holds when we substitute the error bounds in Theorem 1 into (3.23). ∎
As goes towards positive infinity, the objective function decreases as an exact gradient descent. However, and are not too large because of the high sampling cost in practice. The error bounds studied in Section 3.2 are valuable for our analysis of the stochastic optimization. Since is non-convex, we consider the convergence rate in terms of the expected norm of the gradient . Finally, we establish our first-order convergence results as follows.
Theorem 2.
Proof.
Theorem 2 shows the convergence rate of the MCMC-SGD with certain choices of stepsizes. The convergence rate is related to the bias and variance produced by the MCMC estimators. A trade-off involves balancing the convergence rate of the algorithm with the sample size. As the bias term depends on , we can select an appropriate sample size to ensure that the bias term has few impact on convergence. Also, in practical applications, we often choose a long burn-in period at the start of the algorithm to reduce the bias in sampling.
4 Escaping from saddle points
In this section, we discuss how MCMC-SGD escapes from saddle points and converges to second-order stationarity. Daneshmand et al. first propose the CNC assumption, by which the simple SGD without explicit improvement is proven to have second-order convergence [8]. Inspired by the ideas of this study, we proceed to analyze the escape from saddle points under biased gradient estimators. To begin with, a second moment lower bound for non-stationary Markov chains is developed to guarantee the sufficient noise of the biased gradient. Then, we verify that the CNC condition under errors is satisfied when our assumptions hold. Eventually, we demonstrate our convergence analysis of Algorithm 1 to reach approximate second-order stationary points with high probability.
To deal with the second-order structure, the following assumption is needed.
Assumption 5.
(1) The Hessian matrix defined in (3.17) is -Lipschitz continuous, i.e.,
(2) Let be the unit eigenvector with respect to the minimum eigenvalue of the Hessian matrix . There exists a constant such that
(3) For any , the ratio of the sub-exponential norm of to its variance has an upper bound , that is,
The first assumption is common to perform a second-order convergence analysis. The second assumption refers to the Correlated negative curvature (CNC) condition introduced by Daneshmand et al. [8], which assumes that the second moment of the projection of stochastic gradients along the negative curvature direction is uniformly bounded away from zero. By this way, the simple SGD without explicit improvement is proven to have second-order convergence. However, the CNC condition is difficult to satisfy when the stochastic gradient is almost surely zero. We modify their CNC condition into the more precise one by introducing the variance of the function . The third one guarantees that the will have a respectively light-tailed distribution. It holds for most of distributions and especially near the ground state of quantum many-body problems.
Besides, over Assumptions 2 and 3, we can assume an upper bound of the stochastic gradient to simply our analysis:
(4.1) |
This treatment is legal because is assumed to sub-exponential. Otherwise, the proof can be also established with similar techniques in section 3.2. The upper bound will simplify our proof in Lemma 9.
4.1 The correlated negative curvature condition under errors
We first prove a lemma about the second moment for non-stationary Markov chains. This lemma establishes that the second moment of the MCMC estimator is bounded from below by a positive quantity relying on the sample size , the spectral gap and .
Lemma 6.
Let be the Markov chain with the stationary distribution and the initial distribution . Suppose it admits a absolute spectral gap and . The function has a finite fourth central moment . If and , it holds that
(4.2) |
Proof.
We denote and . Then it holds
(4.3) |
Notice the difference
(4.4) |
where the inequality changes the measure as in (3.13). It follows from the Cauchy-Schwarz inequality that
where the second equality is because is the stationary distribution of the Markov chain. Therefore, we derive that
(4.5) |
It holds from Theorem 3.1 in [37] that
(4.6) |
where on page 24 of [37]. Using the spectral method therein, we obtain that
(4.7) | ||||
Next, we bound the term in (4.3). By Lemma 3, it holds that
(4.8) |
Finally, combining (4.3), (4.5), (4.6), (4.7) and (4.8) yields
where the third inequality holds when and . ∎
The CNC condition under errors (4.9) in the following is of vital importance to analyze the saddle escaping property in SGD. That is one of the reason why stochastic optimization algorithms can escape from saddle points. If the CNC condition does not hold, the noise may not be strong enough to escape along the descent direction. The following lemma shows that the empirical stochastic gradient defined in (2.11) satisfies the CNC condition under errors.
Lemma 7.
Proof.
The MCMC estimate of the gradient is computed by (2.11). For convenience, fix a unit vector and we denote empirical variables by and represents the average of . Let stationary variables and with a dependent variable , then it holds that
To obtain the lower bound, we have
Since when , it holds that . It implies the kurtosis is less than . Then, when and , it follows from Lemma 6 that
(4.10) |
Besides, the upper bound of can be obtained by the Cauchy-Schwarz inequality that
(4.11) |
By the result in the proof of Lemma 3, we have
(4.12) |
Similarly, notice that , we can apply [13, Theorem 12] to obtain
Using the similar technique in the proof of Lemma 2, we derive
(4.13) |
Plugging (4.12) and (4.13) into (4.11) yields
Therefore, as long as . Finally, combining our discussion of and , we derive that
This completes the proof. ∎
4.2 Second order stationarity
As discussed in Section 3, MCMC-SGD converges to first order stationary points. While the convergence of SGD is well-understood for convex functions, the existence of saddle points and local minimum poses challenges for non-convex optimization. Since the ordinary GD often stucks near the saddle points, the additional noise within SGD allows it to escape from saddle points. We have verified the CNC condition for biased gradients, by which we are able to analyze the behavior of MCMC-SGD near saddle points. Compared to previous studies [8, 14, 20], the presence of bias in the stochastic gradient needs a more nuanced analysis.
We first give the definition of approximate second-order stationary points.
Definition 3 (Approximate second-order stationary point).
Given a function , an approximate second-order stationary point of is defined as
where and denote the gradient and Hessian of respectively.
If , the point is a second-order stationary point. The second order analysis contributes to understand how MCMC-SGD leaves from these saddle points and finally reach approximate second-order stationary points.
An observation lies that when the variance of the function approaches zero, the efficacy of MCMC-SGD is compromised, as the stochastic gradient tends to zero. Notably, in the context of variational eigenvalue problems, the specific situation of interest arises when an eigenvalue is obtained as . This scenario is precisely what is required for our analysis. Hence, we define -variance points as another criteria for the algorithm 1.
Definition 4 (-variance point).
For the optimization problem (1.1), we call an -variance point if the function satisfies .
This definition may differentiate the regions where modeling and parameterization cause the algorithm to get stuck. For notational simplification, we take an abbreviation in this subsection.
To escape from saddle points, a new stepsize schedule is adopted for Algorithm 1. Given a period , note that and are the constant stepsizes with , with values given in Table 1. Within one iteration period of steps, we adopt a large stepsize at the beginning of the period and a small one at the other iterations, that is,
(4.14) |
It will be shown that the schedule can be suitably designed to achieve sufficient descent in one period.
Suppose the total number of iterations is a multiple of and there are periods. We denote for in each period. For some given , we consider four regimes of the iterates as follow,
(4.15) | ||||
(4.16) | ||||
(4.17) |
stands for the regime with a large gradient, where the stochastic gradient works effectively. When the iterate lies in , despite being close to a first-order stationary point, the CNC condition mentioned in section 4.1 guarantees a decrease after iterations under our schedule. is a regime of approximate second order stationary points or variance points. We need to show Algorithm 1 will reach with high probability, that is, converge to approximate second order stationary points.
The analysis below relies on a particular choice of parameters, whose values satisfy the following lemmas and the main theorem. For ease of verification, the choice of parameters is collected in Table 1.
Parameter | Value | Order | Conditions |
---|---|---|---|
(4.20), (4.34) | |||
(4.34) | |||
(4.20),(4.34) | |||
Lemma 7 | |||
- | (3.1), Lemma 7 | ||
(4.20), (4.34) | |||
(4.38) | |||
(4.42) |
We shall briefly introduce each parameter in Table 1 to facilitate the understanding of the following proof. is the larger stepsize that assists SGD in achieving sufficient descent when the gradient norm is large. is the smaller stepsize that helps SGD escaping saddle points when the algorithm approaches first-order stationary points. stands for the sufficient decrease in the expected function value near the large gradient region and saddle points. and are the sample size and the length of burn-in period of MCMC algorithms. is the bias of MCMC algorithms, depending on and . represents the period length for switching stepsize and , in which we adopt the large stepsize at the beginning and the small one at the other iterations. Finally, is the total number of iterations of MCMC-SGD. These parameters are chosen such that all the conditions are satisfied. We presuppose the validity of these conditions and then deduce the values of the parameters from them. Although the choice of each parameter may not appear intuitive, it is not necessary to substitute these parameters in their entirety during computation. One only needs to verify the fulfillment of the conditions listed in Table 1.
With a large gradient, it is easy to show a sufficient decrease of the objective function value. We have analyzed the decrease for each iteration in Lemma 5, and it follows a similar argument for the MCMC-SGD.
Lemma 8.
Proof.
We first decompose the difference of the expected function value into each iteration,
(4.18) |
where due to the definition . Using the choice in Table 1, it follows from (4.18) that
(4.19) | ||||
where the first inequality is by Lemma 5 for and the second is by the direct substitution. Then, by the choice of in Table 1, these conditions hold that
(4.20) |
As lies in , which means , we plug (4.20) into (4.19) and obtain that
This completes the proof. ∎
Near the saddle points, the classical GD gets stuck if the gradient is orthogonal to the negative curvature direction. However, the stochastic gradient with the CNC condition has inherent noise along the negative curvature direction. Under our stepsize schedule (4.14), the objective function value can have sufficient decrease after a period.
Lemma 9.
Proof.
The proof is by contradiction. Without loss of generality, we suppose that in this proof and denote
for . Every expectation in this proof is taken over all existed in every formula. We assume the expected function value decreases by no more than , i.e.,
(4.21) |
We proceed to show that the assumption (4.21) is invalid.
We start with estimating the expected distance between and for . By Lemma 5, it holds
(4.22) | ||||
where the first inequality is derived similarly to (4.19), and the second inequality is due to . Together with the assumption (4.21), it follows that
(4.23) |
A direct implication of (4.23) is that
(4.24) |
We proceed to bound as follows. Firstly,
(4.25) | ||||
where the second equality is because we can take conditional expectation on first, the following inequality holds from the result and in Theorem 1, and the last inequality is due to AM-GM inequality. Therefore, we can bound
(4.26) | ||||
where the second inequality is due to (4.24), Theorem 1 and (4.25), and the final follows from . Thus, we can take
and then
(4.27) |
When we take , (4.26) shows that the expected distance between and is bounded by a quadratic function of .
We further prove the expected distance between and grows at least exponentially for , leading to a contradiction. Since stays close to , the quadratic Taylor approximation of the function at is introduced as
We denote and for . Using the Taylor approximation is firstly proposed in [14]. As is twice-differentiable with a -Lipschitz Hessian, [35, Lemma 1.2.4] gives that
(4.28) |
Thus, . To derive the lower bound, is decomposed as
Let be the minimum eigenvalue of the Hessian , and let be the unit eigenvector with respect to (which is deterministic conditional on ). Then , and hence
Recursively expanding this equality out, we finally obtain
Therefore,
(4.29) | ||||
By the Cauchy-Schwarz inequality, Lemma 7 implies that
(4.30) |
where . Next, because is deterministic, the term can be bounded as
(4.31) | ||||
where the first inequality is due to , and the second inequality uses Theorem 1.
We next upper bound the term as follows.
(4.32) | ||||
where the second inequality is due to the Cauchy-Schwarz inequality, and in the last inequality we use Theorem 1.
Finally, we bound the term :
(4.33) | ||||
where the second inequality is due to (4.1), the third inequality uses (4.27), and the last inequality is because .
Substituting the four inequalities (4.30), (4.31), (4.32) and (4.33) into (4.29), we obtain the lower bound as
According to our settings in Table 1, these conditions are satisfied:
(4.34) |
It follows that
(4.35) |
Therefore, we can conclude that
(4.36) |
In other word, (4.36) shows that the expected distance between and grows at least exponentially for . Substituting , it leads to a contradiction if the lower bound of is greater than the upper bound, i.e.,
(4.37) |
We choose
(4.38) |
in Table 1 with taking a sufficiently large and then (4.37) implies a contradiction. This completes the proof. ∎
Finally, we establish the main theorem in this section. It is shown that MCMC-SGD returns approximate second-order points or -variance points with high probability. This result may explain the observation of convergence to eigenvalues in solving variational eigenvalue problems, where -variance points means that an eigenvalue is obtained even if the objective function does not approach the global minimum. When an -variance point is desired in this kind of problems, we should design better neural network architectures to reduce those meaningless second order stability points.
Theorem 3.
Let Assumptions 3 ,2 ,4 and 5 hold. For any , with the stepsizes (4.14) and parameters in Table 1, Algorithm 1 returns an approximate second-order stationary point or an -variance point with probability at least after the following steps
Proof.
Suppose is the event
and its complement is . Let denote the probability of the occurrence of the event .
When occurs, by Lemmas 8 and 9, we have
(4.39) |
On the other hand, when occurs, it follows from (4.19) that
(4.40) |
where the first inequality is by discarding positive terms in (4.19) and the second inequality is due to the choice in Table 1. It means that the function value may increase by no more than . When the expectation is taken overall, (4.39) and (4.40) imply that
(4.41) |
5 Conclusions
In this paper, we explore the convergence properties of the SGD algorithm coupled with MCMC sampling. The upper bound of the bias and variance corresponding to the MCMC is estimated by concentration inequalities without resorting to the conventional bounded assumption. Then, we show that MCMC-SGD achieves first order stationary convergence based on the error analysis of MCMC. It has convergence rate with a sufficiently large sample size after iterations. Moreover, we conduct a study on the second-order convergence properties of MCMC-SGD under reasonable assumptions. By verifying the CNC condition under errors, we discuss how MCMC-SGD escapes from saddle points and establish the second order convergence guarantee of with high probability. Our result explains the observation of convergence to eigenvalues in solving variational eigenvalue problems, thereby demonstrating the favorable second-order convergence properties of the MCMC-SGD algorithm.
There are some potential directions to be concerned for future research. (1) The relationship between non minimum eigenvalues in variational eigenvalue problems and optimization algorithms remains an area ripe for exploration. (2) Our convergence analysis suggests that improving the efficiency of sampling methods is of vital importance for the better performance of stochastic algorithms. (3) The convergence of the natural gradient method and the KFAC method with MCMC samples is also interesting.
References
- [1] Nilin Abrahamsen, Zhiyan Ding, Gil Goldshlager, and Lin Lin. Convergence of stochastic gradient descent on parameterized sphere with applications to variational Monte carlo simulation. arXiv preprint arXiv:2303.11602, 2023.
- [2] Ahmad Ajalloeian and Sebastian U Stich. On the convergence of SGD with biased gradients. arXiv preprint arXiv:2008.00051, 2020.
- [3] Yves F Atchadé, Gersende Fort, and Eric Moulines. On perturbed proximal gradient algorithms. The Journal of Machine Learning Research, 18(1):310–342, 2017.
- [4] David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians. Journal of the American statistical Association, 112(518):859–877, 2017.
- [5] Z. Cai. Approximating quantum many-body wave-functions using artificial neural networks. Phys. Rev. B, 97:035116, 2018.
- [6] G. Carleo and M. Troyer. Solving the quantum many-body problem with artificial neural networks. Science, 355:602–606, 2017.
- [7] Kenny Choo, Giuseppe Carleo, Nicolas Regnault, and Titus Neupert. Symmetries and many-body excitations with neural-network quantum states. Physical review letters, 121(16):167204, 2018.
- [8] Hadi Daneshmand, Jonas Kohler, Aurelien Lucchi, and Thomas Hofmann. Escaping saddles with stochastic gradients. In International Conference on Machine Learning, pages 1155–1164. PMLR, 2018.
- [9] D. Deng, X. Li, and S. Das Sarma. Quantum entanglement in neural network states. Phys. Rev. X, 7:021021, 2017.
- [10] Persi Diaconis and Laurent Saloff-Coste. What do we know about the Metropolis algorithm? Journal of Computer and System Sciences, 57(1):20–36, 1998.
- [11] Simon Duane, Anthony D Kennedy, Brian J Pendleton, and Duncan Roweth. Hybrid monte carlo. Physics letters B, 195(2):216–222, 1987.
- [12] John C Duchi, Alekh Agarwal, Mikael Johansson, and Michael I Jordan. Ergodic mirror descent. SIAM Journal on Optimization, 22(4):1549–1578, 2012.
- [13] Jianqing Fan, Bai Jiang, and Qiang Sun. Hoeffding’s inequality for general Markov chains and its applications to statistical learning. Journal of Machine Learning Research, 22(139):1–35, 2021.
- [14] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points—online stochastic gradient for tensor decomposition. In Conference on learning theory, pages 797–842. PMLR, 2015.
- [15] I. Glasser, N. Pancotti, M. August, I. Rodriguez, and J. Cirac. Neural networks quantum states, string-bond states and chiral topological states. Phys. Rev. X, 8:011006, 2018.
- [16] WK HASTINGS. Monte carlo sampling methods using Markov chains and their applications. Biometrika, 57(1):97–109, 1970.
- [17] Jan Hermann, Zeno Schätzle, and Frank Noé. Deep-neural-network solution of the electronic Schrödinger equation. Nature Chemistry, 12(10):891–897, 2020.
- [18] Robert Jastrow. Many-body problem with strong forces. Physical Review, 98(5):1479, 1955.
- [19] Bai Jiang, Qiang Sun, and Jianqing Fan. Bernstein’s inequality for general Markov chains. arXiv preprint arXiv:1805.10721, 2018.
- [20] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escape saddle points efficiently. In International conference on machine learning, pages 1724–1732. PMLR, 2017.
- [21] Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul. An introduction to variational methods for graphical models. Machine learning, 37:183–233, 1999.
- [22] Belhal Karimi, Blazej Miasojedow, Éric Moulines, and Hoi-To Wai. Non-asymptotic analysis of biased stochastic approximation scheme. Proceedings of Machine Learning Research vol XX, 1:33, 2019.
- [23] Raphael Kaubruegger, Lorenzo Pastori, and Jan Carl Budich. Chiral topological phases from artificial neural networks. Physical Review B, 97(19):195136, 2018.
- [24] Jörg Kienitz. Convergence of Markov chains via analytic and isoperimetric inequalities. PhD thesis, Bielefeld University, 2000.
- [25] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- [26] Yingzhen Li, Richard E Turner, and Qiang Liu. Approximate inference with amortised mcmc. arXiv preprint arXiv:1702.08343, 2017.
- [27] Xiao Liang, Wen-Yuan Liu, Pei-Ze Lin, Guang-Can Guo, Yong-Sheng Zhang, and Lixin He. Solving frustrated quantum many-particle models with convolutional neural networks. Physical Review B, 98(10):104426, 2018.
- [28] Chris J Maddison, John Lawson, George Tucker, Nicolas Heess, Mohammad Norouzi, Andriy Mnih, Arnaud Doucet, and Yee Teh. Filtering variational objectives. Advances in Neural Information Processing Systems, 30, 2017.
- [29] Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller, and Edward Teller. Equation of state calculations by fast computing machines. The journal of chemical physics, 21(6):1087–1092, 1953.
- [30] Błażej Miasojedow. Hoeffding’s inequalities for geometrically ergodic markov chains on general state space. Statistics & Probability Letters, 87:115–120, 2014.
- [31] Laurent Miclo and Cyril Roberto. Trous spectraux pour certains algorithmes de Métropolis sur r. Séminaire de Probabilités XXXIV, pages 336–352, 2000.
- [32] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937. PMLR, 2016.
- [33] Eric Moulines and Francis Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. Advances in neural information processing systems, 24, 2011.
- [34] Deanna Needell, Rachel Ward, and Nati Srebro. Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm. Advances in neural information processing systems, 27, 2014.
- [35] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, Louvain-la-Neuve, 2003.
- [36] Y. Nomura, A. Darmawan, Y. Yamajji, and M. Imada. Restricted Boltzmann machine learning for solving strongly correlated quantum systems. Phys. Rev. B, 96:205152, 2017.
- [37] Daniel Paulin. Concentration inequalities for Markov chains by Marton couplings and spectral methods. Electronic Journal of Probability, 20:1–32, 2015.
- [38] David Pfau, James S Spencer, Alexander GDG Matthews, and W Matthew C Foulkes. Ab initio solution of the many-electron Schrödinger equation with deep neural networks. Physical Review Research, 2(3):033429, 2020.
- [39] Rajesh Ranganath, Sean Gerrish, and David Blei. Black box variational inference. In Artificial intelligence and statistics, pages 814–822. PMLR, 2014.
- [40] A. Rocchetto, E. Grant, S. Strelchuk, G. Carleo, and S. Severini. Learning hard quantum distributions with variational autoencoders. Quantum Inf., 4:28, 2018.
- [41] H. Saito and M. Kato. Machine learning technique to find quantum many-body ground states of Bosons on a lattice. J. Phys. Soc. Jpn., 87:014001, 2017.
- [42] Tim Salimans, Diederik Kingma, and Max Welling. Markov chain monte carlo and variational inference: Bridging the gap. In International conference on machine learning, pages 1218–1226. PMLR, 2015.
- [43] Laurent Saloff-Coste. Lectures on finite Markov chains. Lectures on probability theory and statistics, pages 301–413, 1997.
- [44] Tao Sun, Yuejiao Sun, and Wotao Yin. On Markov chain gradient descent. Advances in neural information processing systems, 31, 2018.
- [45] Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 681–688. Citeseer, 2011.
- [46] Ronald J Williams and Jing Peng. Function optimization using connectionist reinforcement learning algorithms. Connection Science, 3(3):241–268, 1991.
- [47] Stephen Wright, Jorge Nocedal, et al. Numerical optimization. Springer Science, 35(67-68):7, 1999.