On the global convergence of randomized coordinate gradient descent for non-convex optimization
Abstract.
In this work, we analyze the global convergence property of coordinate gradient descent with random choice of coordinates and stepsizes for non-convex optimization problems. Under generic assumptions, we prove that the algorithm iterate will almost surely escape strict saddle points of the objective function. As a result, the algorithm is guaranteed to converge to local minima if all saddle points are strict. Our proof is based on viewing coordinate descent algorithm as a nonlinear random dynamical system and a quantitative finite block analysis of its linearization around saddle points.
1. Introduction
In this paper, we analyze the global convergence of coordinate gradient descent algorithm for smooth but non-convex optimization problem
(1.1) |
More specifically, we consider coordinate gradient descent with random coordinate selection and random stepsizes, as Algorithm 1.
The main result of this paper, Theorem 1, is that for any initial guess that is not a strict saddle point of , under some mild conditions, with probability , Algorithm 1 will escape any strict saddle points; and thus under some additional structural assumption of , the algorithm will globally converge to a local minimum.
In order to establish the global convergence, we view the algorithm as a random dynamical system and carry out the analysis based on the theory of random dynamical systems. This might be of separate interest, in particular, to the best of our knowledge, the theory of random dynamical system has not been utilized in analyzing randomized algorithms; while it offers a natural framework to establish long time behavior of such algorithms. Let us now briefly explain the random dynamical system view of the algorithm and our analysis; more details can be found in Section 3.
Let be the probability space for all randomness used in the algorithm, such that each is a sequence of coordinates and stepsizes. The iterate of Algorithm 1 can be described as a random dynamical system where is a nonlinear map for any given and .
Consider an isolated stationary point of the dynamical system, which corresponds to a critical point of . Near , the dynamical system can be approximated by its linearization: , where . The limiting behavior of the linear dynamical system can be well understood by the celebrated multiplicative ergodic theorem: Under some assumptions, the limit exists almost surely. The eigenvalues of the matrix , , characterize the long time behavior of the system. In particular, if the largest Lyapunov exponent is strictly positive, then if has some non-trivial component in the unstable subspace, would exponentially diverge from . More details of preliminaries of linear random dynamical system can be found in Section 2.
Intuitively, one expects that the nonlinear dynamical system can be approximated by its linearization around a critical point , and would hence escape the strict saddle point, following the linearized system. However, the approximation by linear dynamical system cannot hold for infinite time horizon, due to error accumulation. Therefore, we cannot naively conclude using the multiplicative ergodic theorem and the linear approximation. Instead, a major part of analysis is devoted to establish a quantitative finite block analysis of the behavior of the dynamical system over finite time interval. In particular, we will prove that when the iterate is in a neighborhood of , the distance will be exponentially amplified for a duration with high probability. This would then be used to prove that with probability the nonlinear system will escape strict saddle points.
1.1. Related work
Coordinate gradient descent is a popular approach in optimization, see e.g., review articles [Wright-15, shi2016primer]. Advantages of coordinate gradient method include that compared with the full gradient descent, it allows larger stepsize [Nesterov-12] and enjoys faster convergence [Saha-13], and it is also friendly for parallelization [Liu-15, Richtarik-16].
The convergence of coordinate gradient descent has been analyzed in several settings on the property of the objective function and on the strategy of coordinate selection. The understanding of convergence for convex problems is quite complete: For methods with cyclic choice of coordinates, the convergence has been established in [Beck-13, Saha-13, Sun-15] and the worst-case complexity is investigated when the objective function is convex and quadratic in [Sun-19]. For methods with random choice of coordinates, it is shown in [Nesterov-12] that converges to sublinearly in the convex case and linearly in the strongly convex case. Convergence of objective function in high probability has also been established in [Nesterov-12]. We also refer to [Richtarik-14, Liu-14, Liu-15, Wright-15] for further convergence results for random coordinate selection for convex problems. More recently, convergence of methods with random permutation of coordinates (i.e., a random permutation of the coordinates is used for every step of the algorithm) have been analyzed, mostly for the case of quadratic objective functions [Lee-19, Oswald-17, Gurbuzbalaban-20, Wright-20]. It has been an ongoing research direction to compare various coordinate selection strategies in various settings. In addition, in the non-convex and non-smooth setting, the convergence of coordinate/alternating descent methods can be analyzed for tame/semi-algebraic functions with Kurdyka-Łojasiewicz property (see e.g. [attouch2013convergence, attouch2010proximal, bolte2014proximal, boct2020inertial]).
For non-convex objective functions, the global convergence analysis is less developed, as the situation becomes more complicated. Escaping strict saddle points has been a focused research topic in non-convex optimization, motivated by applications in machine learning. It has been established that various first-order algorithms with gradient noise or added randomness to iterates would escape strict saddle points, see e.g., [Ge-15, Levy-16, Jin-17, Jin-18, Jin-19, Guo-20] for works in this direction.
Among previous works for escaping saddle points, perhaps the closest in spirit to our current result are [Lee-16, ONeill-19, Lee-2019, leadeigen-19], where algorithms without gradient or iterate randomness are studied. It is proved in [Lee-16] that for almost every initial guess, the trajectory of the gradient descent algorithm (without any randomness) with constant stepsize would not converge to a strict saddle point. The result has been extended in [Lee-2019] to a broader class of deterministic first-order algorithms, including coordinate gradient descent with cyclic choice of coordinate. The global convergence result for cyclic coordinate gradient descent is also proved in [leadeigen-19] under slightly more relaxed conditions. Similar convergence result is also obtained for heavy-ball method in [ONeill-19]. Let us emphasize that in the case of coordinate algorithms, it is not merely a technical question whether the algorithm can escape the strict saddle points without randomly perturbing gradients or iterates. In fact, one simply cannot employ such random perturbations, e.g., adding a random Gaussian vector to the iterate, since doing so would destroy the coordinate nature of the algorithm.
The analysis in works [Lee-16, Lee-2019, ONeill-19, leadeigen-19] is based on viewing the algorithm as a deterministic dynamical system, and applying center-stable manifold theorem for deterministic dynamical system [Shub-87], which characterizes the local behavior near a stationary point of nonlinear dynamical systems. Such a framework obviously does not work for randomized algorithms. To some extent, our analysis can be understood as a natural generalization to the framework of random dynamical systems, which allows us to analyze the long time behavior of randomized algorithms, in particular coordinate gradient descent with random coordinate selection.
Let us mention that various stable, unstable, and center manifolds theorems have been established in the literature of random dynamical systems, see e.g., [Arnold_RDS, Ruelle-1979, Ruelle-82, Boxler-89-center, Liu-06]. These sample-dependent random manifolds also characterize the local behavior of random dynamical systems. However, as far as we can tell, one cannot simply apply such “off-the-shelf results” for the analysis of Algorithm 1. Instead, for study of the algorithm, we have to carry out a quantitative finite block analysis for the random dynamical system near the stationary points. Our proof technique is inspired by stability analysis of Lyapunov exponent of random dynamical systems, as in [Ledrappier-91, Froyland-15].
1.2. Organization
2. Preliminaries of random dynamical systems
In this section, we recall basic notions and results of random dynamical systems, for more details, we refer the readers to standard reference, such as [Arnold_RDS]. After introducing the preliminaries in this section, we will define the random dynamical system associated with Algorithm 1 in Section 3.1. Let be a probability space and let be a semigroup with being its Borel -algebra. serves as the notion of time. In the setting of Algorithm 1, we have , corresponding to the one-sided discrete time setting. Other possible examples of include , , and , with the assumption that .
Let us first define random dynamical system. As we have mentioned in the introduction, the dynamics starting from can be determined once a sample is fixed. From the viewpoint of random dynamical system, specifying the dynamics of is equivalent to specifying the dynamics of : Suppose at time , the dynamics corresponds to , then to prescribe the future dynamics starting from time , we can specify the corresponding for some map . More precisely, we have the following definition of dynamics on .
Definition 2.1 (Metric dynamical system).
A metric dynamical system on a probability space is a family of maps satisfying that
-
(i)
The mapping is measurable;
-
(ii)
It holds that and ;
-
(iii)
is -preserving for any , where we say a map is -preserving if
The random dynamical system can then be defined as follows.
Definition 2.2 (Random dynamical system).
Let be a measurable space and let be a metric dynamical system on . Then a random dynamical system on over is a measurable map
satisfying the following cocycle property: for any , , and , it holds that
and that
(2.1) |
The cocycle property (2.1) is a key property of random dynamical system: After time , if we restart the system at , the future dynamic corresponds to the sample . Note that is a map on , with some ambiguity of notation, we also use to denote this map on and write . Then the cocycle property (2.1) can be written as
In this work, we will focus on the one-sided discrete time and , where is -preserving and is the -fold composition of . Suppose that and is measurable. Consider a linear random dynamical system defined as (we use for the linear system, while reserving for nonlinear dynamics considered later)
where the right-hand side is the product of a sequences of random matrices. In this setting, the behavior of the linear system is well understood by the celebrated multiplicative ergodic theorem, also known as the Oseledets theorem, which we recall in Theorem 2.3. Such type of results was first established by V.I. Oseledets [oseledets] and was further developed in many works such as [raghunathan, Ruelle-1979, Walters-93].
Theorem 2.3 (Multiplicative ergodic theorem, [Arnold_RDS, Theorem 3.4.1]).
Suppose that
where we have used the short-hand . Then there exists an -invariant with , such that the followings hold for any :
-
(i)
It holds that the limit
(2.2) exists and is a positive definite matrix. Here denotes the transposition of the matrix (as is a linear map on ).
-
(ii)
Suppose has distinct eigenvalues, which are ordered as . Denote the corresponding eigenspace, with dimension , for . Then the functions , , and , , are all measurable and -invariant on .
-
(iii)
Set , and . Then it holds that
(2.3) for . The maps and from to the Grassmannian manifold are measurable.
-
(iv)
It holds that
-
(v)
When is ergodic, i.e., every with satisfies or , the functions , , and , , are constant on .
In Theorem 2.3, are known as Lyapunov exponents and is the Oseledets filtration. We can see from the above theorem that both the Lyapunov exponents and the Oseledets filtration are -forward invariant.
The Lyapunov exponents describe the asymptotic growth rate of as . More specifically, (2.3) implies that when , for any , there exists some , such that
holds for any . The subspaces spanned by eigenvectors of corresponding to eigenvalues smaller than, equal to, and greater than are the stable subspace, center subspace, and unstable subspace, respectively. The stable and unstable subspaces correspond to exponential convergence and exponential divergence, respectively. When starting from the center subspace, we would get some sub-exponential behavior.
The multiplicative ergodic theorem also generalizes to continuous time and two-sided time. We refer interested readers to [Arnold_RDS, Theorem 3.4.1, Theorem 3.4.11] for details.
The stable, unstable, and center subspaces can be generalized to stable, unstable, and center manifolds when considering nonlinear systems, see e.g., [Arnold_RDS, Ruelle-1979, Ruelle-82, Boxler-89-center, Weigu-Sternberg, Liu-06, Lian-10, Li-13, Guo-16]. These manifolds play similar roles in characterizing the local behavior of nonlinear random dynamical systems, as the subspaces for linear random dynamical systems. In particular, Hartman–Grobman theorem establishes the topological conjugacy between a nonlinear system and its linearization [Wanner-95]. There are also other conjugacy results for random dynamical systems, see e.g., [Weigu-Sternberg, Weigu-05, Weigu-08, Weigu-16].
3. Main results
3.1. Setup of the random dynamical system
Let us first specify the random dynamical system corresponding to the Algorithm 1.
-
•
Probability space. For each , denote the usual probability space for the distribution , where and are the uniform distributions on the set and interval respectively. Let be the product probability space of all . Denote as the projection from onto . Thus, a sample can be represented as a sequence , where . Let be the filtration defined by
-
•
Metric dynamical system. The metric dynamical system on is constructed by the (left) shifting operator defined as
which is clearly measurable and -preserving. The metric dynamical system is then given by for .
-
•
Random dynamical system. For any and , we define to be a (nonlinear) map on as
where is the first pair/element in the sequence , and we define the map via
while is the identity operator. It is clear that satisfies the cocycle property (2.1) and hence defines a random dynamical system on over . The iterate of Algorithm 1 follows the random dynamical system as
It can be seen that is -predictable, i.e., is -measurable for any , since is determined by samples .
In our analysis, we will use linearization of the dynamical system at a critical point of . Without loss of generality, we assume ; otherwise we consider the system with state being . The resulting linear system, which depends on , is given by (here and in the sequel, we use the superscript to indicate dependence on the matrix)
(3.1) |
where
(3.2) |
Note that is bounded in . We know that is integrable. When , the matrix is invertable, and the inverse is given explicitly by applying the Sherman-Morrison formula:
(3.3) |
In particular, we have
(3.4) |
Thus, if we take the maximal stepsize such that , is bounded in , and as a result is also integrable. Therefore, the assumptions of Theorem 2.3 hold. The shifting operator is ergodic on by Kolmogorov’s – law. Then Theorem 2.3 applies for with , , and all being a.e. constant. For any that is the set in Theorem 2.3 satisfying , we denote
(3.5) |
Then the following invariant property holds
Note that works as a center-stable subspace. That is, for any and any , it holds that , for sufficiently large , and for , grows exponentially as with rate greater than for any .
3.2. Assumptions
In this section, we specify the assumptions of the objective function in this paper. The first is a standard smoothness assumption of .
Assumption 1.
and the Hessian is uniformly bounded, i.e., there exists such that for all .
An optimization algorithm is expected to converge, under some reasonable assumptions, to a critical point of where the gradient vanishes. Our aim is to further characterize the possible limits of the algorithm iterates. For this purpose, we distinguish , the set of all strict saddle points (including local maxima with non-degenerate Hessian) of :
where we use the subscript to emphasize that it is strict. Due to the presence of negative eigenvalue of Hessian, if we were considering the gradient dynamics near the critical point, the saddle point would be an unstable equilibrium. Our first result is that this instability also occurs in the linear random dynamical system where . In other words, the dimension of defined in (3.5) is greater than . While this would mainly serve as a preliminary step for our analysis of the nonlinear dynamics, the conclusion by itself might be of interest, and is stated as follows. The proof will be deferred to Section 4.1.
Proposition 3.1.
Let have a negative eigenvalue and , then the largest Lyapunov exponent of is positive.
Our goal is to generalize such results to the nonlinear dynamics near strict saddle points of , for which, we would require two additional assumptions, as the following.
Assumption 2.
For every , is non-degenerate, i.e., is a non-degenerate critical point of in the sense that any eigenvalue of is nonzero.
Assumption 2 is also a standard assumption, which in particular guarantees that each strict saddle point is isolated, due to the non-degenerate Hessian. For each strict saddle point, Proposition 3.1 guarantees that the corresponding unstable subspace is non-trivial (has dimension at least ). We would in fact require a stronger technical assumption on its structure.
Assumption 3.
For every , it holds that , for every and almost every , where is the orthogonal projection onto with .
We expect that Assumption 3 holds generically. We also show in Appendix A, Assumption 3 can be verified when has no zero off-diagonal elements (and is not trivial). However, there exist cases that Assumption 3 might not hold. One example is where only has positive eigenvalues and only has negative eigenvalues, which implies that and .
We also remark that Assumption 1 is essential in our framework, since the linearized system is defined using the Hessian matrix. Analysis of randomized coordinate method for non-smooth optimization problems requires new techniques and deserves future research.
3.3. Main results
Given an initial guess , the behavior of the algorithm, in particular, the limit of , depends on the particular sample . For any , we denote the set of all such that the algorithm starting at would converge to :
We further define the set as the union of all over ,
Thus, if , the limit , if exists, will not be one of the strict saddle points. Our main result in this paper proves that the set is of measure zero, i.e., for any initial guess that is not a strict saddle point, with probability , Algorithm 1 will not converge to a strict saddle point.
The intuition behind the proof of Theorem 1 is to compare the nonlinear dynamics around a strict saddle point with its linearization, as the linear dynamics has non-trivial unstable subspace, thanks to Proposition 3.1. Ideally, one would hope that the nonlinear dynamics would closely follow the linear dynamics, and thus leave the neighborhood of eventually; the obstacle for such argument is however that the approximation of the linearization is only valid for a finite time interval. Therefore, to establish the instability behavior of the nonlinear dynamics, we would need a much more refined and quantitative argument using the instability of the linear system. In particular, we would need to show that over a finite interval, with high probability, the linear system, and hence the nonlinear system, would drive away from the strict saddle point with quantitative bounds; see Theorem 4.4 in Section 4.2. Theorem 1 then follows from an argument with a similar spirit as law of large numbers; see Section 4.3.
Remark 3.2.
The technical Assumption 3 and the randomness in stepsizes are made so that the iterate would obtain some non-trivial component in the unstable subspace, which would be further amplified within a sufficiently long but finite time interval. When and are fixed and relatively large, a random would keep away from with high probability; see Section 4.2 for more details. It is an interesting open question whether it is possible to establish similar results without such assumptions. Our conjecture is that still holds for unless is located in a set with Lebesgue measure zero, similar to the results established in [Lee-2019].
As an application of our main result Theorem 1, we can obtain the global convergence to stationary points with no negative Hessian eigenvalues for Algorithm 1. More specifically, denote by
the set of all critical points of , we have the following corollary, which will also be proved in Section 4.3.
Corollary 3.3.
Under the same assumptions as in Theorem 1 and assume further that every is an isolated critical point. For any with bounded level set , with probability , is convergent with limit in .
Remark 3.4.
In Corollary 3.3, if we further assume that all saddle points of are strict, then the algorithm iterate converges to a local minimum with probability . Let us also mention that for many non-convex problems, saddle points are suboptimal while there do not exist “bad” local minima, e.g. phase retrieval [sun2018geometric], deep learning [kawaguchi2016deep, lu2017depth], and low-rank matrix problems [ge2017no]. For these problems, convergence to local minima suffice to guarantee good performance.
Remark 3.5.
In our setting without adding noise to gradient or iterate, we cannot hope for good convergence rates for arbitrary initial iterate. In fact as shown in [Du-17], the convergence of deterministic gradient descent algorithm to a local minimum might take exponentially long time; we expect similar behavior for the randomized coordinate gradient descent Algorithm 1. Let us also remark that while we need the random stepsize as discussed in Remark 3.2, the interval could be made arbitrarily small; the result holds as long as .
4. Proofs
We collect all proofs in this section.
4.1. Analysis of the linearized system
We will first study the linear dynamical system, for which we assume the objective function is given by
(4.1) |
where is a symmetric matrix in with at least one negative eigenvalue. In this case, the coordinate descent algorithm is given by
which corresponds to the linear dynamical system with single step map , defined in (3.1) and (3.2) respectively.
Our main goal in this subsection is to prove Proposition 3.1 for this linear dynamical system which states that at least one Lyapunov exponent of is positive. It suffices to show that there exists some , such that grows exponentially to infinity, which will follow from an energy argument, similar to the proof of [Lee-2019, Proposition 5]. Although we consider a randomized coordinate gradient descent algorithm instead of a cyclic one, one step, i.e., Lemma 4.3, in the proof of Proposition 4.1 follows closely the proof in [Lee-2019, Appendix A]. We start from with and consider a finite time interval with length . Proposition 4.1 establishes a quantitative decay estimate for compared with , which leads to our desired result Proposition 3.1.
Proposition 4.1.
Let be fixed. For the objective function (4.1) with , suppose that , there exists depending on , , , and , such that
holds as long as and (in the sense of sets).
Remark 4.2.
The condition above is known as “generalized Gauss-Seidel rule” in the literature of coordinate methods [Tseng-09, Wright-12].
Proof.
Without loss of generality, we assume that . Due to the choice of , we have the following simple non-increasing property for any :
(4.2) |
Write with and . Let
(4.3) |
Then holds for any . Using (4.2), to give an upper bound for , we would like a nontrivial lower bound for for some , which is guaranteed by Lemma 4.3, whose proof will be postponed.
Lemma 4.3.
Suppose that . For any
(4.4) |
where and are the smallest and largest positive singular values of respectively, if , then there exists such that , where the sequence is given as in (4.3).
Assuming the lemma, there exists that
with a fixed satisfying (4.4). We can further constrain that . Thus, we have
which combined with , yields that
Set and we get that . ∎
We finish the proof by establishing Lemma 4.3 below.
Proof of Lemma 4.3.
Suppose on the contrary that for any . It holds that
We claim that
(4.5) |
for any . By induction, assume that holds for some , then
where the last inequality uses . It follows .
We are now ready to prove Proposition 3.1 which states the existence of positive Lyapunov exponent of the linear dynamical system.
Proof of Proposition 3.1.
It suffices to show that for almost every there exist some , , and , such that satisfies for any . Let be an eigenvector corresponding to a negative eigenvalue of . Then it holds that . Consider a fixed . For any , set if and otherwise. We can see that the random variables, , are independent and identically distributed with . By Proposition 4.1, we obtain that
where is the constant from Proposition 4.1. Therefore,
which implies that
(4.6) |
Note that . The strong law of large number suggests that for almost every , there exists some , such that for all
(4.7) |
Combining (4.6) and (4.7), we arrive at
Noticing that is greater than , grows exponentially in and we complete the proof. ∎
4.2. Finite block analysis
In this subsection, we study the behavior of the nonlinear dynamical system near a strict saddle point of , which without loss of generality can be assumed to be . As mentioned above, in a small neighborhood of , while it is not possible to control the difference between nonlinear and linear systems for infinite time, the nonlinear system can be approximated by the linear system during a finite time horizon.
The main conclusion of this subsection is the following theorem which states that after a finite time interval with length , the distance of the iterate from will be amplified exponentially with high probability.
Theorem 4.4.
The lower bound (4.8) quantifies the amplification of : While we always have (see (4.20) below), the result states that with probability at least , the amplification factor is at least the right-hand side of (4.8), which is exponentially large in . Hence on average would be much larger than . This would lead to escaping of the iterate from the neighborhood of .
To prove Theorem 4.4, we would require a more quantitative characterization of the behavior of its linearization at . In particular, we need a high probability estimate of the distance of the iterate from after some time interval. For this purpose, conditioned on with the iterate , we will first show in Lemma 4.8 that, after some finite time, the orthogonal projection of the iterate onto the unstable subspace, where for some constant , is significant. The component in the unstable subspace would then be further amplified subsequently by , where . Here the time duration would be chosen sufficiently large such that the distance from to is exponentially amplified. Theorem 4.4 follows by setting . In the second step above, we would need to control the closeness between the linear and nonlinear systems within a time horizon with length .
Such a finite block analysis approach has been used to establish the stability of Lyapunov exponent of random dynamical systems [Ledrappier-91, Froyland-15], which inspired our proof technique for Theorem 4.4 and Theorem 1.
We first set the small constant in Theorem 4.4 which controls the failure probability of the amplification bound. Let be the Lyapunov exponents of the linearized system at . We set
(4.9) |
Note that . Let be sufficiently small such that
(4.10) |
The reason for such choice will become clear later (see (4.23)). For the rest of the section, we will consider a fixed .
We now state and prove several lemmas for Theorem 4.4. First, in the following Lemma 4.5, we construct a stopping time that is bounded almost surely and that the component of the gradient is comparable with in amplitude with high probability.
Lemma 4.5.
Let be a fixed constant. There exists some constant , such that for any , there exists a measurable such that and
(4.11) |
Proof.
For any , use to denote the smallest non-negative integer , such that
It is clear that , since for each step the coordinate is randomly chosen. Hence, there exists some , such that
We finish the proof by setting which has the desired property. ∎
We now carry out the amplification part of the finite block analysis for the linearized dynamics at . To simplify expressions in the following, for , we introduce the short-hand notation
and the finite time transition matrix (i.e., composition of linear maps)
Recall that is the probability space for for . We also denote as the projection operator onto the subspace spanned by the right singular vectors of corresponding to largest singular values, where and is the dimension of the -th eigenspace as in Theorem 2.3 (ii) for the linearized system at .
As we mentioned in the proof sketch, we want to amplify , for which we need to establish a non-trivial lower-bound for the unstable component . This is achieved by several lemmas. We will establish three lower bound in the sequel:
Let us first control in the following lemma, which utilizes Assumption 3. For simplicity of notation, in Lemma 4.6 and Lemma 4.7, we state the results for and instead, which is slightly more general.
Lemma 4.6.
Proof.
We will then be able to handle using Lemma 4.6 and the closeness between with as the former converges to the latter as by Theorem 2.3. More precisely, denote the singular values of by . Then for sufficiently large, we have
(4.12) |
for every , where are the Lyapunov exponents from Theorem 2.3, is given by (4.9), and the map satisfies that if and only if , so corresponds the index for the singular values with that of the Lyapunov exponents. Moreover, the convergence also implies that
for sufficiently large , which then leads to
(4.13) |
for every , where is the projection operator onto the subspace spanned by the right singular vectors of corresponding to largest singular values. Let
(4.14) |
The following lemma states that has high probability for sufficiently large , where with slight abuse of notation, we write .
Lemma 4.7.
Under the same assumptions of Lemma 4.6, there exists some , such that for every , it holds .
Proof.
For , it follows from Theorem 2.3, in particular (2.2), and standard matrix perturbation analysis (see e.g., [bhatia, Theorem VI.2.1, Theorem VII.3.1]) that
(4.15) |
for any , and that
(4.16) |
By Egorov’s theorem, there exists with , such that the convergences in (4.15) and (4.16) are both uniform on . Here is as in Lemma 4.6. Therefore, for some sufficiently large, we have
and
(4.17) |
Combining Lemma 4.6 and (4.17), we obtain that
For any , by the definition of , it holds that
which implies the desired estimate
∎
Note that is independent of , , and . The next lemma shows that with high probability, the choice of will lead to a nontrivial orthogonal projection of onto the unstable subspace of .
Lemma 4.8.
For any , , , and , there exists with where is the Lebesgue measure, such that for any , it holds that
(4.18) |
Proof.
We assume that ; otherwise the result is trivial.
For simplicity of notation, we write
where the last line defines and . Using the short-hand notation
we observe then (4.18) holds if and only if is not located in a ball with radius centered at .
It follows from the definition of and (4.13) that , which then leads to
Thus, the set of such that consists of an interval in with as the diameter of is , which implies . The lemma is proved then by setting . ∎
With Lemma 4.5–4.8, we now prove Theorem 4.4 which relies on approximation of the nonlinear dynamics by linearization and the amplification from the finite block analysis for the linearized system.
Proof of Theorem 4.4.
Without loss of generality, we will assume in the proof to simplify notation. Since is non-degenerate, we can take a neighborhood of and some fixed such that
(4.19) |
Assumption 1 implies that
Using the above inequality and , it holds for every and that
(4.20) |
and similarly that
We thus define
so that
(4.21) |
We now choose the time duration large enough in the finite block analysis to guarantee significant amplification. More specifically, we choose so large that ( defined in Lemma 4.7) and that the following two inequalities hold:
(4.22) |
and
(4.23) |
where is the upper bound defined in Lemma 4.5, is a fixed constant as in Lemma 4.5, is from Lemma 4.6, and is set in (4.19). Thanks to (4.10) for our choice of and that from (4.9), (4.22) and (4.23) are satisfied for sufficiently large .
Next, we show that, for any sufficiently large as above, there exists a convex neighborhood of , such that for any , any , and any , it holds that
(4.24) |
We first define a convex neighborhood of such that
for any , any , and any . Applying the inequality times for , we have,
and hence, inequality (4.24).
Setting , we then have for any , which implies that as . According to Lemma 4.5–4.8, for any given , with probability at least , we have , and the followings hold:
(4.25) | ||||
(4.26) |
where the probability is the marginal probability on .
Recall and in (4.9), and . It follows from (4.12) of the construction of the set that
This is to say that the largest singular values of are all greater than or equal to . Therefore, it holds that
(4.27) |
where the first inequality follows from the fact that and are orthogonal. Combining (4.24) and (4.27), we obtain that
Therefore, it holds that
We finally arrive at (4.8) by setting and combining the above with (4.23). ∎
4.3. Proof of main results
In this section, we first prove the following theorem, which relies on the local amplification with high probability established in Theorem 4.4. The main result Theorem 1 will follow as an immediate corollary since is countable and strict saddle points are isolated.
Theorem 4.9.
Proof.
Without loss of generality, we assume that . Conditioned on with , where can be assumed to be bounded, Theorem 4.4 states that with probability at least ,
where to simplify notation we denote
the amplification factor appearing on the right-hand side of (4.8). Notice also that, due to (4.21), we always have
It suffices to show that for any , with probability there exists some such that .
Let us consider the iterates every steps: Denote and for . Denote stopping time
it suffices to show that . We define a sequence of random variables as follows:
By the discussion in the beginning of the proof, we have
and moreover, setting , we have for ,
Denote . Since , there exists , such that
Therefore, for any , it holds that
As we will show in the next lemma that the right-hand side of above goes to as , and thus , which implies that . ∎
Lemma 4.10.
For any , it holds that
Proof.
Let be a sequence of i.i.d. random variables with being a Bernoulli random variable with expectation . Denote the average . The weak law of large numbers yields that
as . ∎
The main theorem then follows directly from Theorem 4.9.
Proof of Theorem 1.
We now prove the global convergence, i.e., Corollary 3.3, for which we will show that Algorithm 1 converges to a critical point of with some appropriate assumptions. We first show that the limit of each convergent subsequence of is a critical point of .
Proposition 4.11.
If Assumption 1 holds and , for any with bounded level set , with probability , every accumulation point of is in .
Proof.
Algorithm 1 is always monotone since the following holds for any by Taylor’s expansion:
(4.28) |
where , which implies that the whole sequence is contained in the bounded level set .
Let us consider any and set
which is either empty or compact. We claim that with probability , the accumulation points of will not be located in . This is clear when is empty, so it suffices to consider compact . Set as a fixed constant. For any , there exists an open neighborhood of and a coordinate , such that
(4.29) |
and that
(4.30) |
Noticing that , by the compactness, there exist finitely many points, say , such that
For any , combining (4.28), (4.29), and (4.30), we know that for any , conditioned on with , if (which has probability ), then and thus for all .
Therefore, the probability that there are infinitely many with is zero, which implies that does not have accumulation points in with probability . We conclude that with probability , does not contain any accumulation points of as is finite. Since this holds for any , we have for -a.e. , has no accumulation points in any , which then leads to the desired result. ∎
Proposition 4.11 implies that any accumulation point of the algorithm iterate is a critical point. If we further assume that each critical point of is isolated, we would conclude that the whole sequence converges and the limit is in .
Proposition 4.12.
Under the assumptions of Proposition 4.11. If every is an isolated critical point of , then with probability , converges to some critical point of as .
Proof.
It follows from Proposition 4.11 that converges to as for . In fact, if there were a subsequence and with . Then by the boundedness of , would have some accumulation point which is not a stationary point of , which leads to a contradiction.
Moreover, is a finite set, since otherwise, would have a limiting point which would be a non-isolated critical point of , violating the assumption.
Consider a fixed with . Select a open neighborhood for every , such that there exists some with
If has more than one accumulation point, there would be infinitely many iterates located in which is compact. Therefore, would have an accumulation point in , which contradicts Proposition 4.11. ∎
Corollary 3.3 is now an immediate consequence.
References
Appendix A Validity of Assumption 3
In this appendix, we provide some justification of Assumption 3, which is expected to hold generically. In particular, the following proposition validates this assumption when the off-diagonal entries of are all non-zero.
Proposition A.1.
Suppose that the largest Lyapunov exponent of is positive. Then Assumption 3 holds as long as and every off-diagonal entry of is non-zero.
Proof.
For any element in , we take the smallest such that and write
where . We have that is finite for a.e. . Note that we can view as a stopping time, in particular, given , has distribution and is independent with .
Let be a set of basis vectors for . Then a set of basis vectors for is given by
Denote the matrices concatenated by the column vectors as and . If , then is column-rank deficient since the existence of a positive Lyapunov exponent implies that . Here and for the rest of the appendix, we denote by the matrix obtained via removing -th row of .
Therefore, as Assumption 3 is equivalent to that holds for any and almost every , it suffices to show that has full column-rank with probability . The key point is that given , are independent with and . Thus, it suffices to show that with fixed , , , and , the set of all that yield the rank-deficiency of is of measure zero; and without loss of generality, we can assume . Noticing that cover all the coordinates and that every off-diagonal entry of is non-zero, the desired result follows directly from the following Lemma A.2 applied repeatedly. ∎
Lemma A.2.
Suppose that and are full-column-rank matrices satisfying (we suppress in the notation the dependence of on and for simplicity). Then the followings holds:
-
(i)
If has full column-rank, then for any , also has full column-rank for a.e. .
-
(ii)
Suppose that is column-rank deficient, and let be row indices such that
If , then we have with probability , either has full column-rank or is column-rank deficient with
If and , then has full column-rank.
Proof of Lemma A.2.
By (3.3), it holds that for and that
For point (i), we notice that if , then has full column-rank. If , then it follows from that also has full column-rank for a.e. .
For point (ii). We have
If , then holds for . Therefore, we obtain that are linearly independent, which implies that either has full column-rank or
If , then has full column-rank since . ∎