This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On the global convergence of randomized coordinate gradient descent for non-convex optimization

Ziang Chen (ZC) Department of Mathematics, Duke University, Box 90320, Durham, NC 27708, USA. [email protected] Yingzhou Li (YL) School of Mathematical Sciences, Fudan University, Shanghai 200433, China [email protected]  and  Jianfeng Lu (JL) Departments of Mathematics, Physics, and Chemistry, Duke University, Box 90320, Durham, NC 27708, USA. [email protected]
Abstract.

In this work, we analyze the global convergence property of coordinate gradient descent with random choice of coordinates and stepsizes for non-convex optimization problems. Under generic assumptions, we prove that the algorithm iterate will almost surely escape strict saddle points of the objective function. As a result, the algorithm is guaranteed to converge to local minima if all saddle points are strict. Our proof is based on viewing coordinate descent algorithm as a nonlinear random dynamical system and a quantitative finite block analysis of its linearization around saddle points.

This work is supported in part by the National Science Foundation via grants DMS-2012286 and CHE-2037263, and by US Department of Energy via grant DE-SC0019449. Y. Li is partially supported by National Natural Science Foundation of China under Grant No. 12271109. We thank Jonathan Mattingly, Zhe Wang, and Stephen J. Wright for helpful discussions.

1. Introduction

In this paper, we analyze the global convergence of coordinate gradient descent algorithm for smooth but non-convex optimization problem

(1.1) minxdf(x).\min_{x\in\mathbb{R}^{d}}f(x).

More specifically, we consider coordinate gradient descent with random coordinate selection and random stepsizes, as Algorithm 1.

Algorithm 1 Randomized coordinate gradient descent
Initialization: x0dx_{0}\in\mathbb{R}^{d}, t=0t=0.
while not convergent do
     Draw a coordinate iti_{t} uniformly random from {1,2,,d}\{1,2,\dots,d\}.
     Draw a stepsize αt\alpha_{t} uniformly random in [αmin,αmax][\alpha_{\min},\alpha_{\max}].
     xt+1xtαteititf(xt)x_{t+1}\leftarrow x_{t}-\alpha_{t}e_{i_{t}}\partial_{i_{t}}f(x_{t}).
     tt+1t\leftarrow t+1.
end while

The main result of this paper, Theorem 1, is that for any initial guess x0x_{0} that is not a strict saddle point of ff, under some mild conditions, with probability 11, Algorithm 1 will escape any strict saddle points; and thus under some additional structural assumption of ff, the algorithm will globally converge to a local minimum.

In order to establish the global convergence, we view the algorithm as a random dynamical system and carry out the analysis based on the theory of random dynamical systems. This might be of separate interest, in particular, to the best of our knowledge, the theory of random dynamical system has not been utilized in analyzing randomized algorithms; while it offers a natural framework to establish long time behavior of such algorithms. Let us now briefly explain the random dynamical system view of the algorithm and our analysis; more details can be found in Section 3.

Let (Ω,,)(\Omega,\mathcal{F},\mathbb{P}) be the probability space for all randomness used in the algorithm, such that each ωΩ\omega\in\Omega is a sequence of coordinates and stepsizes. The iterate of Algorithm 1 can be described as a random dynamical system xt=φ(t,ω)x0x_{t}=\varphi(t,\omega)x_{0} where φ(t,ω):dd\varphi(t,\omega):\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} is a nonlinear map for any given tt\in\mathbb{N} and ωΩ\omega\in\Omega.

Consider an isolated stationary point xx^{*} of the dynamical system, which corresponds to a critical point of ff. Near xx^{*}, the dynamical system can be approximated by its linearization: xt=Φ(t,ω)x0x_{t}=\Phi(t,\omega)x_{0}, where Φ(t,ω)d×d\Phi(t,\omega)\in\mathbb{R}^{d\times d}. The limiting behavior of the linear dynamical system can be well understood by the celebrated multiplicative ergodic theorem: Under some assumptions, the limit Λ(ω)=limt(Φ(t,ω)Φ(t,ω))1/2t\Lambda(\omega)=\lim_{t\rightarrow\infty}\left(\Phi(t,\omega)^{\top}\Phi(t,\omega)\right)^{1/2t} exists almost surely. The eigenvalues of the matrix Λ(ω)\Lambda(\omega), eλ1(ω)>eλ2(ω)>>eλp(ω)(ω)e^{\lambda_{1}(\omega)}>e^{\lambda_{2}(\omega)}>\cdots>e^{\lambda_{p(\omega)}(\omega)}, characterize the long time behavior of the system. In particular, if the largest Lyapunov exponent λ1(ω)\lambda_{1}(\omega) is strictly positive, then if x0x_{0} has some non-trivial component in the unstable subspace, xt=Φ(t,ω)x0x_{t}=\Phi(t,\omega)x_{0} would exponentially diverge from xx^{*}. More details of preliminaries of linear random dynamical system can be found in Section 2.

Intuitively, one expects that the nonlinear dynamical system can be approximated by its linearization around a critical point xx^{*}, and would hence escape the strict saddle point, following the linearized system. However, the approximation by linear dynamical system cannot hold for infinite time horizon, due to error accumulation. Therefore, we cannot naively conclude using the multiplicative ergodic theorem and the linear approximation. Instead, a major part of analysis is devoted to establish a quantitative finite block analysis of the behavior of the dynamical system over finite time interval. In particular, we will prove that when the iterate is in a neighborhood of xx^{*}, the distance xtx\left\|x_{t}-x^{*}\right\| will be exponentially amplified for a duration TT with high probability. This would then be used to prove that with probability 11 the nonlinear system will escape strict saddle points.

1.1. Related work

Coordinate gradient descent is a popular approach in optimization, see e.g., review articles [Wright-15, shi2016primer]. Advantages of coordinate gradient method include that compared with the full gradient descent, it allows larger stepsize [Nesterov-12] and enjoys faster convergence [Saha-13], and it is also friendly for parallelization [Liu-15, Richtarik-16].

The convergence of coordinate gradient descent has been analyzed in several settings on the property of the objective function and on the strategy of coordinate selection. The understanding of convergence for convex problems is quite complete: For methods with cyclic choice of coordinates, the convergence has been established in [Beck-13, Saha-13, Sun-15] and the worst-case complexity is investigated when the objective function is convex and quadratic in [Sun-19]. For methods with random choice of coordinates, it is shown in [Nesterov-12] that 𝔼f(xt)\mathbb{E}f(x_{t}) converges to f=minxdf(x)f^{*}=\min_{x\in\mathbb{R}^{d}}f(x) sublinearly in the convex case and linearly in the strongly convex case. Convergence of objective function in high probability has also been established in [Nesterov-12]. We also refer to [Richtarik-14, Liu-14, Liu-15, Wright-15] for further convergence results for random coordinate selection for convex problems. More recently, convergence of methods with random permutation of coordinates (i.e., a random permutation of the dd coordinates is used for every dd step of the algorithm) have been analyzed, mostly for the case of quadratic objective functions [Lee-19, Oswald-17, Gurbuzbalaban-20, Wright-20]. It has been an ongoing research direction to compare various coordinate selection strategies in various settings. In addition, in the non-convex and non-smooth setting, the convergence of coordinate/alternating descent methods can be analyzed for tame/semi-algebraic functions with Kurdyka-Łojasiewicz property (see e.g. [attouch2013convergence, attouch2010proximal, bolte2014proximal, boct2020inertial]).

For non-convex objective functions, the global convergence analysis is less developed, as the situation becomes more complicated. Escaping strict saddle points has been a focused research topic in non-convex optimization, motivated by applications in machine learning. It has been established that various first-order algorithms with gradient noise or added randomness to iterates would escape strict saddle points, see e.g., [Ge-15, Levy-16, Jin-17, Jin-18, Jin-19, Guo-20] for works in this direction.

Among previous works for escaping saddle points, perhaps the closest in spirit to our current result are [Lee-16, ONeill-19, Lee-2019, leadeigen-19], where algorithms without gradient or iterate randomness are studied. It is proved in [Lee-16] that for almost every initial guess, the trajectory of the gradient descent algorithm (without any randomness) with constant stepsize would not converge to a strict saddle point. The result has been extended in [Lee-2019] to a broader class of deterministic first-order algorithms, including coordinate gradient descent with cyclic choice of coordinate. The global convergence result for cyclic coordinate gradient descent is also proved in [leadeigen-19] under slightly more relaxed conditions. Similar convergence result is also obtained for heavy-ball method in [ONeill-19]. Let us emphasize that in the case of coordinate algorithms, it is not merely a technical question whether the algorithm can escape the strict saddle points without randomly perturbing gradients or iterates. In fact, one simply cannot employ such random perturbations, e.g., adding a random Gaussian vector to the iterate, since doing so would destroy the coordinate nature of the algorithm.

The analysis in works [Lee-16, Lee-2019, ONeill-19, leadeigen-19] is based on viewing the algorithm as a deterministic dynamical system, and applying center-stable manifold theorem for deterministic dynamical system [Shub-87], which characterizes the local behavior near a stationary point of nonlinear dynamical systems. Such a framework obviously does not work for randomized algorithms. To some extent, our analysis can be understood as a natural generalization to the framework of random dynamical systems, which allows us to analyze the long time behavior of randomized algorithms, in particular coordinate gradient descent with random coordinate selection.

Let us mention that various stable, unstable, and center manifolds theorems have been established in the literature of random dynamical systems, see e.g., [Arnold_RDS, Ruelle-1979, Ruelle-82, Boxler-89-center, Liu-06]. These sample-dependent random manifolds also characterize the local behavior of random dynamical systems. However, as far as we can tell, one cannot simply apply such “off-the-shelf results” for the analysis of Algorithm 1. Instead, for study of the algorithm, we have to carry out a quantitative finite block analysis for the random dynamical system near the stationary points. Our proof technique is inspired by stability analysis of Lyapunov exponent of random dynamical systems, as in [Ledrappier-91, Froyland-15].

1.2. Organization

The rest of this paper will be organized as follows. In Section 2, we review the preliminaries of random dynamical system, for convenience of readers. Our main result is stated in Section 3. The proofs can be found in Section 4.

2. Preliminaries of random dynamical systems

In this section, we recall basic notions and results of random dynamical systems, for more details, we refer the readers to standard reference, such as [Arnold_RDS]. After introducing the preliminaries in this section, we will define the random dynamical system associated with Algorithm 1 in Section 3.1. Let (Ω,,)(\Omega,\mathcal{F},\mathbb{P}) be a probability space and let 𝕋\mathbb{T} be a semigroup with (𝕋)\mathcal{B}(\mathbb{T}) being its Borel σ\sigma-algebra. 𝕋\mathbb{T} serves as the notion of time. In the setting of Algorithm 1, we have 𝕋=\mathbb{T}=\mathbb{N}, corresponding to the one-sided discrete time setting. Other possible examples of 𝕋\mathbb{T} include 𝕋=\mathbb{T}=\mathbb{Z}, 𝕋=0\mathbb{T}=\mathbb{R}_{\geq 0}, and 𝕋=\mathbb{T}=\mathbb{R}, with the assumption that 0𝕋0\in\mathbb{T}.

Let us first define random dynamical system. As we have mentioned in the introduction, the dynamics starting from x0x_{0} can be determined once a sample ωΩ\omega\in\Omega is fixed. From the viewpoint of random dynamical system, specifying the dynamics of xx is equivalent to specifying the dynamics of ω\omega: Suppose at time 0, the dynamics corresponds to ω\omega, then to prescribe the future dynamics starting from time tt, we can specify the corresponding θ(t)ωΩ\theta(t)\omega\in\Omega for some map θ(t):ΩΩ\theta(t):\Omega\to\Omega. More precisely, we have the following definition of dynamics on Ω\Omega.

Definition 2.1 (Metric dynamical system).

A metric dynamical system on a probability space (Ω,,)(\Omega,\mathcal{F},\mathbb{P}) is a family of maps {θ(t):ΩΩ}t𝕋\{\theta(t):\Omega\rightarrow\Omega\}_{t\in\mathbb{T}} satisfying that

  • (i)

    The mapping 𝕋×ΩΩ,(t,ω)θ(t)ω\mathbb{T}\times\Omega\rightarrow\Omega,\ (t,\omega)\mapsto\theta(t)\omega is measurable;

  • (ii)

    It holds that θ(0)=IdΩ\theta(0)=\mathrm{Id}_{\Omega} and θ(t+s)=θ(t)θ(s),s,t𝕋\theta(t+s)=\theta(t)\circ\theta(s),\ \forall\ s,t\in\mathbb{T};

  • (iii)

    θ(t)\theta(t) is \mathbb{P}-preserving for any t𝕋t\in\mathbb{T}, where we say a map θ:ΩΩ\theta:\Omega\rightarrow\Omega is \mathbb{P}-preserving if

    (θ1B)=(B),B.\mathbb{P}(\theta^{-1}B)=\mathbb{P}(B),\quad\forall\ B\in\mathcal{F}.

The random dynamical system can then be defined as follows.

Definition 2.2 (Random dynamical system).

Let (X,X)(X,\mathcal{F}_{X}) be a measurable space and let {θ(t):ΩΩ}t𝕋\{\theta(t):\Omega\rightarrow\Omega\}_{t\in\mathbb{T}} be a metric dynamical system on (Ω,,)(\Omega,\mathcal{F},\mathbb{P}). Then a random dynamical system on (X,X)(X,\mathcal{F}_{X}) over {θ(t)}t𝕋\{\theta(t)\}_{t\in\mathbb{T}} is a measurable map

φ:𝕋×Ω×XX,(t,ω,x)φ(t,ω,x),\begin{split}\varphi:\mathbb{T}\times\Omega\times X&\rightarrow\quad X,\\ (t,\omega,x)\ \ &\mapsto\varphi(t,\omega,x),\end{split}

satisfying the following cocycle property: for any ωΩ\omega\in\Omega, xXx\in X, and s,t𝕋s,t\in\mathbb{T}, it holds that

φ(0,ω,x)=x,\varphi(0,\omega,x)=x,

and that

(2.1) φ(t+s,ω,x)=φ(t,θ(s)ω,φ(s,ω,x)).\varphi(t+s,\omega,x)=\varphi(t,\theta(s)\omega,\varphi(s,\omega,x)).

The cocycle property (2.1) is a key property of random dynamical system: After time ss, if we restart the system at xsx_{s}, the future dynamic corresponds to the sample θ(s)ω\theta(s)\omega. Note that φ(t,ω,)\varphi(t,\omega,\cdot) is a map on XX, with some ambiguity of notation, we also use φ(t,ω)\varphi(t,\omega) to denote this map on XX and write φ(t,ω)x=φ(t,ω,x)\varphi(t,\omega)x=\varphi(t,\omega,x). Then the cocycle property (2.1) can be written as

φ(t+s,ω)=φ(t,θ(s)ω)φ(s,ω).\varphi(t+s,\omega)=\varphi(t,\theta(s)\omega)\circ\varphi(s,\omega).

In this work, we will focus on the one-sided discrete time 𝕋=\mathbb{T}=\mathbb{N} and θ(t)=θt\theta(t)=\theta^{t}, where θ\theta is \mathbb{P}-preserving and θt\theta^{t} is the tt-fold composition of θ\theta. Suppose that X=dX=\mathbb{R}^{d} and A:ΩGL(d,)A:\Omega\rightarrow\text{GL}(d,\mathbb{R}) is measurable. Consider a linear random dynamical system defined as (we use Φ\Phi for the linear system, while reserving φ\varphi for nonlinear dynamics considered later)

Φ(t,ω)=A(θt1ω)A(θω)A(ω),\Phi(t,\omega)=A(\theta^{t-1}\omega)\cdots A(\theta\omega)A(\omega),

where the right-hand side is the product of a sequences of random matrices. In this setting, the behavior of the linear system xt=Φ(t,ω)x0x_{t}=\Phi(t,\omega)x_{0} is well understood by the celebrated multiplicative ergodic theorem, also known as the Oseledets theorem, which we recall in Theorem 2.3. Such type of results was first established by V.I. Oseledets [oseledets] and was further developed in many works such as [raghunathan, Ruelle-1979, Walters-93].

Theorem 2.3 (Multiplicative ergodic theorem, [Arnold_RDS, Theorem 3.4.1]).

Suppose that

(logA())+,(logA()1)+L1(Ω,,),\left(\log\left\|A(\cdot)\right\|\right)_{+},\ \left(\log\left\|A(\cdot)^{-1}\right\|\right)_{+}\in L^{1}(\Omega,\mathcal{F},\mathbb{P}),

where we have used the short-hand a+:=max{a,0}a_{+}:=\max\{a,0\}. Then there exists an θ\theta-invariant Ω~\widetilde{\Omega}\in\mathcal{F} with (Ω~)=1\mathbb{P}(\widetilde{\Omega})=1, such that the followings hold for any ωΩ~\omega\in\widetilde{\Omega}:

  • (i)

    It holds that the limit

    (2.2) Λ(ω)=limt(Φ(t,ω)Φ(t,ω))1/2t,\Lambda(\omega)=\lim_{t\rightarrow\infty}\left(\Phi(t,\omega)^{\top}\Phi(t,\omega)\right)^{1/2t},

    exists and is a positive definite matrix. Here Φ(t,ω)\Phi(t,\omega)^{\top} denotes the transposition of the matrix (as Φ(t,ω)\Phi(t,\omega) is a linear map on XX).

  • (ii)

    Suppose Λ(ω)\Lambda(\omega) has p(ω)p(\omega) distinct eigenvalues, which are ordered as eλ1(ω)>eλ2(ω)>>eλp(ω)(ω)e^{\lambda_{1}(\omega)}>e^{\lambda_{2}(\omega)}>\cdots>e^{\lambda_{p(\omega)}(\omega)}. Denote Vi(ω)V_{i}(\omega) the corresponding eigenspace, with dimension di(ω)d_{i}(\omega), for i=1,2,,p(ω)i=1,2,\dots,p(\omega). Then the functions p()p(\cdot), λi()\lambda_{i}(\cdot), and di()d_{i}(\cdot), i=1,2,,p()i=1,2,\dots,p(\cdot), are all measurable and θ\theta-invariant on Ω~\widetilde{\Omega}.

  • (iii)

    Set Wi(ω)=jiVj(ω),i=1,2,,p(ω)W_{i}(\omega)=\bigoplus_{j\geq i}V_{j}(\omega),\ i=1,2,\dots,p(\omega), and Wp(ω)+1(ω)={0}W_{p(\omega)+1}(\omega)=\{0\}. Then it holds that

    (2.3) limt1tlogΦ(t,ω)x=λi(ω),xWi(ω)\Wi+1(ω),\lim_{t\rightarrow\infty}\frac{1}{t}\log\left\|\Phi(t,\omega)x\right\|=\lambda_{i}(\omega),\quad\forall\ x\in W_{i}(\omega)\backslash W_{i+1}(\omega),

    for i=1,2,,p(ω)i=1,2,\dots,p(\omega). The maps V()V(\cdot) and W()W(\cdot) from Ω~\widetilde{\Omega} to the Grassmannian manifold are measurable.

  • (iv)

    It holds that

    Wi(θω)=A(ω)Wi(ω).W_{i}(\theta\omega)=A(\omega)W_{i}(\omega).
  • (v)

    When (Ω,,,θ)(\Omega,\mathcal{F},\mathbb{P},\theta) is ergodic, i.e., every BB\in\mathcal{F} with θ1B=B\theta^{-1}B=B satisfies (B)=0\mathbb{P}(B)=0 or (B)=1\mathbb{P}(B)=1, the functions p()p(\cdot), λi()\lambda_{i}(\cdot), and di()d_{i}(\cdot), i=1,2,,p()i=1,2,\dots,p(\cdot), are constant on Ω~\widetilde{\Omega}.

In Theorem 2.3, λ1(ω)>λ2(ω)>>λp(ω)(ω)\lambda_{1}(\omega)>\lambda_{2}(\omega)>\cdots>\lambda_{p(\omega)}(\omega) are known as Lyapunov exponents and {0}Wp(ω)(ω)W1(ω)d\{0\}\subseteq W_{p(\omega)}(\omega)\subseteq\cdots\subseteq W_{1}(\omega)\subseteq\mathbb{R}^{d} is the Oseledets filtration. We can see from the above theorem that both the Lyapunov exponents and the Oseledets filtration are AA-forward invariant.

The Lyapunov exponents describe the asymptotic growth rate of Φ(t,ω)x\left\|\Phi(t,\omega)x\right\| as tt\rightarrow\infty. More specifically, (2.3) implies that when xWi(ω)\Wi+1(ω)x\in W_{i}(\omega)\backslash W_{i+1}(\omega), for any ϵ>0\epsilon>0, there exists some T>0T>0, such that

et(λi(ω)ϵ)Φ(t,ω)xet(λi(ω)+ϵ),e^{t(\lambda_{i}(\omega)-\epsilon)}\leq\left\|\Phi(t,\omega)x\right\|\leq e^{t(\lambda_{i}(\omega)+\epsilon)},

holds for any t>Tt>T. The subspaces spanned by eigenvectors of Λ(ω)\Lambda(\omega) corresponding to eigenvalues smaller than, equal to, and greater than 0 are the stable subspace, center subspace, and unstable subspace, respectively. The stable and unstable subspaces correspond to exponential convergence and exponential divergence, respectively. When starting from the center subspace, we would get some sub-exponential behavior.

The multiplicative ergodic theorem also generalizes to continuous time and two-sided time. We refer interested readers to [Arnold_RDS, Theorem 3.4.1, Theorem 3.4.11] for details.

The stable, unstable, and center subspaces can be generalized to stable, unstable, and center manifolds when considering nonlinear systems, see e.g., [Arnold_RDS, Ruelle-1979, Ruelle-82, Boxler-89-center, Weigu-Sternberg, Liu-06, Lian-10, Li-13, Guo-16]. These manifolds play similar roles in characterizing the local behavior of nonlinear random dynamical systems, as the subspaces for linear random dynamical systems. In particular, Hartman–Grobman theorem establishes the topological conjugacy between a nonlinear system and its linearization [Wanner-95]. There are also other conjugacy results for random dynamical systems, see e.g., [Weigu-Sternberg, Weigu-05, Weigu-08, Weigu-16].

3. Main results

3.1. Setup of the random dynamical system

Let us first specify the random dynamical system corresponding to the Algorithm 1.

  • Probability space. For each tt\in\mathbb{N}, denote (Ωt,Σt,t)(\Omega_{t},\Sigma_{t},\mathbb{P}_{t}) the usual probability space for the distribution 𝒰{1,2,,n}×𝒰[αmin,αmax]\mathcal{U}_{\{1,2,\dots,n\}}\times\mathcal{U}_{[\alpha_{\min},\alpha_{\max}]}, where 𝒰{1,2,,n}\mathcal{U}_{\{1,2,\dots,n\}} and 𝒰[αmin,αmax]\mathcal{U}_{[\alpha_{\min},\alpha_{\max}]} are the uniform distributions on the set {1,2,,n}\{1,2,\dots,n\} and interval [αmin,αmax][\alpha_{\min},\alpha_{\max}] respectively. Let (Ω,,)(\Omega,\mathcal{F},\mathbb{P}) be the product probability space of all (Ωt,Σt,t),t(\Omega_{t},\Sigma_{t},\mathbb{P}_{t}),\ t\in\mathbb{N}. Denote πt\pi_{t} as the projection from (Ω,,)(\Omega,\mathcal{F},\mathbb{P}) onto (Ωt,Σt,t),t(\Omega_{t},\Sigma_{t},\mathbb{P}_{t}),\ t\in\mathbb{N}. Thus, a sample ωΩ\omega\in\Omega can be represented as a sequence ((i0,α0),(i1,α1),)((i_{0},\alpha_{0}),(i_{1},\alpha_{1}),\dots), where (it,αt)=πt(ω),t(i_{t},\alpha_{t})=\pi_{t}(\omega),\ t\in\mathbb{N}. Let {t}t\{\mathcal{F}_{t}\}_{t\in\mathbb{N}} be the filtration defined by

    t=σ{(B0××Bt)×(j>tΩj):BiΣi,i=0,1,,t}.\mathcal{F}_{t}=\sigma\left\{(B_{0}\times\cdots\times B_{t})\times\left(\prod_{j>t}\Omega_{j}\right):B_{i}\in\Sigma_{i},\ i=0,1,\dots,t\right\}.
  • Metric dynamical system. The metric dynamical system on Ω\Omega is constructed by the (left) shifting operator τ:ΩΩ\tau:\Omega\rightarrow\Omega defined as

    τ(ω)=τ(π0(ω),π1(ω),):=(π1(ω),π2(ω),),\tau(\omega)=\tau(\pi_{0}(\omega),\pi_{1}(\omega),\cdots):=(\pi_{1}(\omega),\pi_{2}(\omega),\cdots),

    which is clearly measurable and \mathbb{P}-preserving. The metric dynamical system is then given by θ(t)=τt\theta(t)=\tau^{t} for tt\in\mathbb{N}.

  • Random dynamical system. For any ωΩ\omega\in\Omega and tt\in\mathbb{N}, we define ϕ(ω)\phi(\omega) to be a (nonlinear) map on d\mathbb{R}^{d} as

    ϕ(ω):ddxxαeieif(x),\begin{split}\phi(\omega):\mathbb{R}^{d}&\rightarrow\quad\quad\quad\mathbb{R}^{d}\\ x\ &\mapsto x-\alpha e_{i}e_{i}^{\top}\nabla f(x),\end{split}

    where (i,α)=π0(ω)(i,\alpha)=\pi_{0}(\omega) is the first pair/element in the sequence ω\omega, and we define the map φ(t,ω)\varphi(t,\omega) via

    φ(t,ω)=ϕ(τt1ω)ϕ(τω)ϕ(ω),for t1,\varphi(t,\omega)=\phi(\tau^{t-1}\omega)\circ\cdots\circ\phi(\tau\omega)\circ\phi(\omega),\qquad\text{for }t\geq 1,

    while φ(0,ω)\varphi(0,\omega) is the identity operator. It is clear that φ(t,ω)\varphi(t,\omega) satisfies the cocycle property (2.1) and hence defines a random dynamical system on X=dX=\mathbb{R}^{d} over {τt}t\{\tau^{t}\}_{t\in\mathbb{N}}. The iterate of Algorithm 1 follows the random dynamical system as

    xt=ϕ(τt1ω)xt1==ϕ(τt1ω)ϕ(τω)ϕ(ω)x0=φ(t,ω)x0.x_{t}=\phi(\tau^{t-1}\omega)x_{t-1}=\cdots=\phi(\tau^{t-1}\omega)\circ\cdots\circ\phi(\tau\omega)\circ\phi(\omega)x_{0}=\varphi(t,\omega)x_{0}.

    It can be seen that {xt}t\{x_{t}\}_{t\in\mathbb{N}} is {t}\{\mathcal{F}_{t}\}-predictable, i.e., xtx_{t} is t1\mathcal{F}_{t-1}-measurable for any t+t\in\mathbb{N}_{+}, since xtx_{t} is determined by samples (i0,α0),(i1,α1),,(it1,αt1)(i_{0},\alpha_{0}),(i_{1},\alpha_{1}),\dots,(i_{t-1},\alpha_{t-1}).

In our analysis, we will use linearization of the dynamical system φ(t,ω)\varphi(t,\omega) at a critical point xx^{*} of ff. Without loss of generality, we assume x=0x^{*}=0; otherwise we consider the system with state being xxx-x^{*}. The resulting linear system, which depends on H=2f(x)=(Hij)1i,jdH=\nabla^{2}f(x^{*})=(H_{ij})_{1\leq i,j\leq d}, is given by (here and in the sequel, we use the superscript HH to indicate dependence on the matrix)

(3.1) ΦH(t,ω)=AH(τt1ω)AH(τω)AH(ω),\Phi^{H}(t,\omega)=A^{H}(\tau^{t-1}\omega)\cdots A^{H}(\tau\omega)A^{H}(\omega),

where

(3.2) AH(ω)=IαeieiH,(i,α)=π0(ω).A^{H}(\omega)=I-\alpha e_{i}e_{i}^{\top}H,\quad(i,\alpha)=\pi_{0}(\omega).

Note that AH()A^{H}(\cdot) is bounded in Ω\Omega. We know that (logAH())+\left(\log\left\|A^{H}(\cdot)\right\|\right)_{+} is integrable. When α<1/|Hii|\alpha<1/|H_{ii}|, the matrix AH(ω)=IαeieiHA^{H}(\omega)=I-\alpha e_{i}e_{i}^{\top}H is invertable, and the inverse is given explicitly by applying the Sherman-Morrison formula:

(3.3) AH(ω)1=(IαeieiH)1=I+αeieiH1αHii.\begin{split}A^{H}(\omega)^{-1}&=\left(I-\alpha e_{i}e_{i}^{\top}H\right)^{-1}=I+\frac{\alpha e_{i}e_{i}^{\top}H}{1-\alpha H_{ii}}.\end{split}

In particular, we have

(3.4) AH(ω)11+αH1α|Hii|.\left\|A^{H}(\omega)^{-1}\right\|\leq 1+\frac{\alpha\left\|H\right\|}{1-\alpha|H_{ii}|}.

Thus, if we take the maximal stepsize αmax\alpha_{\max} such that αmax<1/max1id|Hii|\alpha_{\max}<1/\max_{1\leq i\leq d}|H_{ii}|, AH()1\left\|A^{H}(\cdot)^{-1}\right\| is bounded in Ω\Omega, and as a result (logAH()1)+\left(\log\left\|A^{H}(\cdot)^{-1}\right\|\right)_{+} is also integrable. Therefore, the assumptions of Theorem 2.3 hold. The shifting operator τ\tau is ergodic on (Ω,,)(\Omega,\mathcal{F},\mathbb{P}) by Kolmogorov’s 011 law. Then Theorem 2.3 applies for θ=τ\theta=\tau with pH()p^{H}(\cdot), λiH()\lambda_{i}^{H}(\cdot), and diH()d_{i}^{H}(\cdot) all being a.e. constant. For any ωΩ~\omega\in\widetilde{\Omega} that is the set in Theorem 2.3 satisfying (Ω~)=1\mathbb{P}(\widetilde{\Omega})=1, we denote

(3.5) W+H(ω)=λi>0ViH(ω),andWH(ω)=λi0ViH(ω).W_{+}^{H}(\omega)=\bigoplus_{\lambda_{i}>0}V_{i}^{H}(\omega),\ \text{and}\ W_{-}^{H}(\omega)=\bigoplus_{\lambda_{i}\leq 0}V_{i}^{H}(\omega).

Then the following invariant property holds

WH(τω)=AH(ω)WH(ω).W_{-}^{H}(\tau\omega)=A^{H}(\omega)W_{-}^{H}(\omega).

Note that WH(ω)W_{-}^{H}(\omega) works as a center-stable subspace. That is, for any xWH(ω)x\in W_{-}^{H}(\omega) and any ϵ>0\epsilon>0, it holds that Φ(t,ω)xetϵ\left\|\Phi(t,\omega)x\right\|\leq e^{t\epsilon}, for sufficiently large tt, and for xWH(ω)x\notin W_{-}^{H}(\omega), Φ(t,ω)x\left\|\Phi(t,\omega)x\right\| grows exponentially as tt\rightarrow\infty with rate greater than minλi>0λiϵ\min_{\lambda_{i}>0}\lambda_{i}-\epsilon for any ϵ>0\epsilon>0.

3.2. Assumptions

In this section, we specify the assumptions of the objective function ff in this paper. The first is a standard smoothness assumption of ff.

Assumption 1.

fC2(d)f\in C^{2}(\mathbb{R}^{d}) and the Hessian 2f\nabla^{2}f is uniformly bounded, i.e., there exists M>0M>0 such that 2f(x)M\left\|\nabla^{2}f(x)\right\|\leq M for all xdx\in\mathbb{R}^{d}.

An optimization algorithm is expected to converge, under some reasonable assumptions, to a critical point of ff where the gradient vanishes. Our aim is to further characterize the possible limits of the algorithm iterates. For this purpose, we distinguish Crits(f)\mathrm{Crit}_{s}(f), the set of all strict saddle points (including local maxima with non-degenerate Hessian) of ff:

Crits(f):={xd:f(x)=0,λmin(2f(x))<0},\mathrm{Crit}_{s}(f):=\{x\in\mathbb{R}^{d}:\nabla f(x)=0,\ \lambda_{\min}(\nabla^{2}f(x))<0\},

where we use the subscript ss to emphasize that it is strict. Due to the presence of negative eigenvalue of Hessian, if we were considering the gradient dynamics near the critical point, the saddle point would be an unstable equilibrium. Our first result is that this instability also occurs in the linear random dynamical system ΦH(t,ω)\Phi^{H}(t,\omega) where H=2f(x)H=\nabla^{2}f(x^{*}). In other words, the dimension of W+H(ω)W^{H}_{+}(\omega) defined in (3.5) is greater than 0. While this would mainly serve as a preliminary step for our analysis of the nonlinear dynamics, the conclusion by itself might be of interest, and is stated as follows. The proof will be deferred to Section 4.1.

Proposition 3.1.

Let HH have a negative eigenvalue and 0<αmin<αmax<1/max1id|Hii|0<\alpha_{\min}<\alpha_{\max}<1/\max_{1\leq i\leq d}|H_{ii}|, then the largest Lyapunov exponent of ΦH(t,ω)\Phi^{H}(t,\omega) is positive.

Our goal is to generalize such results to the nonlinear dynamics near strict saddle points of ff, for which, we would require two additional assumptions, as the following.

Assumption 2.

For every xCrits(f)x^{*}\in\mathrm{Crit}_{s}(f), 2f(x)\nabla^{2}f(x^{*}) is non-degenerate, i.e., xx^{*} is a non-degenerate critical point of ff in the sense that any eigenvalue of 2f(x)\nabla^{2}f(x^{*}) is nonzero.

Assumption 2 is also a standard assumption, which in particular guarantees that each strict saddle point is isolated, due to the non-degenerate Hessian. For each strict saddle point, Proposition 3.1 guarantees that the corresponding unstable subspace W+H(ω)W^{H}_{+}(\omega) is non-trivial (has dimension at least 11). We would in fact require a stronger technical assumption on its structure.

Assumption 3.

For every xCrits(f)x^{*}\in\mathrm{Crit}_{s}(f), it holds that 𝒫+H(ω)ei0\mathcal{P}_{+}^{H}(\omega)e_{i}\neq 0, for every i{1,2,,d}i\in\{1,2,\dots,d\} and almost every ωΩ\omega\in\Omega, where 𝒫+H(ω)\mathcal{P}_{+}^{H}(\omega) is the orthogonal projection onto W+H(ω)W_{+}^{H}(\omega) with H=2f(x)H=\nabla^{2}f(x^{*}).

We expect that Assumption 3 holds generically. We also show in Appendix A, Assumption 3 can be verified when HH has no zero off-diagonal elements (and W+H(ω)W^{H}_{+}(\omega) is not trivial). However, there exist cases that Assumption 3 might not hold. One example is H=2f(x)=diag(H1,H2)H=\nabla^{2}f(x^{*})=\text{diag}(H_{1},H_{2}) where H1d1×d1H_{1}\in\mathbb{R}^{d_{1}\times d_{1}} only has positive eigenvalues and H2d2×d2H_{2}\in\mathbb{R}^{d_{2}\times d_{2}} only has negative eigenvalues, which implies that WH(ω)=span{ei:1id1}W_{-}^{H}(\omega)=\text{span}\{e_{i}:1\leq i\leq d_{1}\} and W+H(ω)=span{ei:d1+1id}W_{+}^{H}(\omega)=\text{span}\{e_{i}:d_{1}+1\leq i\leq d\}.

We also remark that Assumption 1 is essential in our framework, since the linearized system is defined using the Hessian matrix. Analysis of randomized coordinate method for non-smooth optimization problems requires new techniques and deserves future research.

3.3. Main results

Given an initial guess x0x_{0}, the behavior of the algorithm, in particular, the limit of xtx_{t}, depends on the particular sample ωΩ\omega\in\Omega. For any xCrits(f)x^{*}\in\mathrm{Crit}_{s}(f), we denote the set of all ω\omega such that the algorithm starting at x0x_{0} would converge to xx^{*}:

Ωs(x,x0):={ωΩ:limtxt=limtφ(t,ω)x0=x}.\Omega_{s}(x^{*},x_{0}):=\left\{\omega\in\Omega:\lim_{t\rightarrow\infty}x_{t}=\lim_{t\rightarrow\infty}\varphi(t,\omega)x_{0}=x^{*}\right\}.

We further define the set Ωs(Crits(f),x0)\Omega_{s}(\mathrm{Crit}_{s}(f),x_{0}) as the union of all Ωs(x,x0)\Omega_{s}(x^{*},x_{0}) over xCrits(f)x^{*}\in\mathrm{Crit}_{s}(f),

Ωs(Crits(f),x0):=xCrits(f)Ωs(x,x0).\Omega_{s}(\mathrm{Crit}_{s}(f),x_{0}):=\bigcup_{x^{*}\in\mathrm{Crit}_{s}(f)}\Omega_{s}(x^{*},x_{0}).

Thus, if ωΩs(Crits(f),x0)\omega\notin\Omega_{s}(\mathrm{Crit}_{s}(f),x_{0}), the limit limtxt\lim_{t\to\infty}x_{t}, if exists, will not be one of the strict saddle points. Our main result in this paper proves that the set is of measure zero, i.e., for any initial guess x0x_{0} that is not a strict saddle point, with probability 11, Algorithm 1 will not converge to a strict saddle point.

Theorem 1.

Suppose that Assumptions 1, 2, and 3 hold and that 0<αmin<αmax<1/M0<\alpha_{\min}<\alpha_{\max}<1/M, then for any x0d\Crits(f)x_{0}\in\mathbb{R}^{d}\backslash\mathrm{Crit}_{s}(f), it holds that

(Ωs(Crits(f),x0))=0.\mathbb{P}(\Omega_{s}(\mathrm{Crit}_{s}(f),x_{0}))=0.

The intuition behind the proof of Theorem 1 is to compare the nonlinear dynamics around a strict saddle point xCrits(f)x^{*}\in\mathrm{Crit}_{s}(f) with its linearization, as the linear dynamics has non-trivial unstable subspace, thanks to Proposition 3.1. Ideally, one would hope that the nonlinear dynamics would closely follow the linear dynamics, and thus leave the neighborhood of xx^{*} eventually; the obstacle for such argument is however that the approximation of the linearization is only valid for a finite time interval. Therefore, to establish the instability behavior of the nonlinear dynamics, we would need a much more refined and quantitative argument using the instability of the linear system. In particular, we would need to show that over a finite interval, with high probability, the linear system, and hence the nonlinear system, would drive xtx_{t} away from the strict saddle point with quantitative bounds; see Theorem 4.4 in Section 4.2. Theorem 1 then follows from an argument with a similar spirit as law of large numbers; see Section 4.3.

Remark 3.2.

The technical Assumption 3 and the randomness in stepsizes are made so that the iterate xt=xt1αt1eit1eit1f(xt1)x_{t}=x_{t-1}-\alpha_{t-1}e_{i_{t-1}}e_{i_{t-1}}^{\top}\nabla f(x_{t-1}) would obtain some non-trivial component in the unstable subspace, which would be further amplified within a sufficiently long but finite time interval. When 𝒫+H(τtω)eit1\left\|\mathcal{P}_{+}^{H}(\tau^{t}\omega)e_{i_{t-1}}\right\| and |eit1f(xt1)|\lvert e_{i_{t-1}}^{\top}\nabla f(x_{t-1})\rvert are fixed and relatively large, a random αt1\alpha_{t-1} would keep 𝒫+H(τtω)xt\left\|\mathcal{P}_{+}^{H}(\tau^{t}\omega)x_{t}\right\| away from 0 with high probability; see Section 4.2 for more details. It is an interesting open question whether it is possible to establish similar results without such assumptions. Our conjecture is that (Ωs(x0))=0\mathbb{P}(\Omega_{s}(x_{0}))=0 still holds for x0d\Crits(f)x_{0}\in\mathbb{R}^{d}\backslash\mathrm{Crit}_{s}(f) unless x0x_{0} is located in a set with Lebesgue measure zero, similar to the results established in [Lee-2019].

As an application of our main result Theorem 1, we can obtain the global convergence to stationary points with no negative Hessian eigenvalues for Algorithm 1. More specifically, denote by

Crit(f):={xd:f(x)=0},\mathrm{Crit}(f):=\{x\in\mathbb{R}^{d}:\nabla f(x)=0\},

the set of all critical points of ff, we have the following corollary, which will also be proved in Section 4.3.

Corollary 3.3.

Under the same assumptions as in Theorem 1 and assume further that every xCrit(f)x^{*}\in\mathrm{Crit}(f) is an isolated critical point. For any x0d\Crits(f)x_{0}\in\mathbb{R}^{d}\backslash\mathrm{Crit}_{s}(f) with bounded level set L(x0)={xd:f(x)f(x0)}L(x_{0})=\{x\in\mathbb{R}^{d}:f(x)\leq f(x_{0})\}, with probability 11, {xt}t\{x_{t}\}_{t\in\mathbb{N}} is convergent with limit in Crit(f)\Crits(f)\mathrm{Crit}(f)\backslash\mathrm{Crit}_{s}(f).

Remark 3.4.

In Corollary 3.3, if we further assume that all saddle points of ff are strict, then the algorithm iterate converges to a local minimum with probability 11. Let us also mention that for many non-convex problems, saddle points are suboptimal while there do not exist “bad” local minima, e.g. phase retrieval [sun2018geometric], deep learning [kawaguchi2016deep, lu2017depth], and low-rank matrix problems [ge2017no]. For these problems, convergence to local minima suffice to guarantee good performance.

Remark 3.5.

In our setting without adding noise to gradient or iterate, we cannot hope for good convergence rates for arbitrary initial iterate. In fact as shown in [Du-17], the convergence of deterministic gradient descent algorithm to a local minimum might take exponentially long time; we expect similar behavior for the randomized coordinate gradient descent Algorithm 1. Let us also remark that while we need the random stepsize as discussed in Remark 3.2, the interval [αmin,αmax][\alpha_{\min},\alpha_{\max}] could be made arbitrarily small; the result holds as long as 0<αmin<αmax<1/M0<\alpha_{\min}<\alpha_{\max}<1/M.

4. Proofs

We collect all proofs in this section.

4.1. Analysis of the linearized system

We will first study the linear dynamical system, for which we assume the objective function is given by

(4.1) fH(x)=12xHx,f^{H}(x)=\frac{1}{2}x^{\top}Hx,

where HH is a symmetric matrix in d×d\mathbb{R}^{d\times d} with at least one negative eigenvalue. In this case, the coordinate descent algorithm is given by

xt+1=(IαteiteitH)xt,x_{t+1}=\left(I-\alpha_{t}e_{i_{t}}e_{i_{t}}^{\top}H\right)x_{t},

which corresponds to the linear dynamical system ΦH(t,ω)\Phi^{H}(t,\omega) with single step map AH(ω)A^{H}(\omega), defined in (3.1) and (3.2) respectively.

Our main goal in this subsection is to prove Proposition 3.1 for this linear dynamical system which states that at least one Lyapunov exponent of ΦH(t,ω)\Phi^{H}(t,\omega) is positive. It suffices to show that there exists some x0x_{0}, such that xt\left\|x_{t}\right\| grows exponentially to infinity, which will follow from an energy argument, similar to the proof of [Lee-2019, Proposition 5]. Although we consider a randomized coordinate gradient descent algorithm instead of a cyclic one, one step, i.e., Lemma 4.3, in the proof of Proposition 4.1 follows closely the proof in [Lee-2019, Appendix A]. We start from x0x_{0} with fH(x0)<0f^{H}(x_{0})<0 and consider a finite time interval with length mdm\geq d. Proposition 4.1 establishes a quantitative decay estimate for fH(xt+m)f^{H}(x_{t+m}) compared with fH(xt)f^{H}(x_{t}), which leads to our desired result Proposition 3.1.

Proposition 4.1.

Let mdm\geq d be fixed. For the objective function (4.1) with λmin(H)<0\lambda_{\min}(H)<0, suppose that 0<αmin<αmax<1/max1id|Hii|0<\alpha_{\min}<\alpha_{\max}<1/\max_{1\leq i\leq d}|H_{ii}|, there exists c(0,1)c\in(0,1) depending on mm, HH, αmin\alpha_{\min}, and αmax\alpha_{\max}, such that

fH(xt+m)fH(xt)cfH(xt),f^{H}(x_{t+m})-f^{H}(x_{t})\leq c\,f^{H}(x_{t}),

holds as long as fH(xt)<0f^{H}(x_{t})<0 and {1,2,,d}={it,it+1,,it+m1}\{1,2,\dots,d\}=\{i_{t},i_{t+1},\dots,i_{t+m-1}\} (in the sense of sets).

Remark 4.2.

The condition {1,2,,d}={it,it+1,,it+m1}\{1,2,\dots,d\}=\{i_{t},i_{t+1},\dots,i_{t+m-1}\} above is known as “generalized Gauss-Seidel rule” in the literature of coordinate methods [Tseng-09, Wright-12].

Proof.

Without loss of generality, we assume that t=0t=0. Due to the choice of αmax\alpha_{\max}, we have the following simple non-increasing property for any tt^{\prime}\in\mathbb{N}:

(4.2) fH(xt+1)=12xt+1Hxt+1=12xt(IαteiteitH)H(IαteiteitH)xt=fH(xt)αt(eitHxt)2+12αt2eitHeit(eitHxt)2fH(xt)αt2(eitHxt)2.\begin{split}f^{H}(x_{t^{\prime}+1})&=\frac{1}{2}x_{t^{\prime}+1}^{\top}Hx_{t^{\prime}+1}\\ &=\frac{1}{2}x_{t^{\prime}}^{\top}\left(I-\alpha_{t^{\prime}}e_{i_{t^{\prime}}}e_{i_{t^{\prime}}}^{\top}H\right)^{\top}H\left(I-\alpha_{t^{\prime}}e_{i_{t^{\prime}}}e_{i_{t^{\prime}}}^{\top}H\right)x_{t^{\prime}}\\ &=f^{H}(x_{t^{\prime}})-\alpha_{t^{\prime}}\left(e_{i_{t^{\prime}}}^{\top}Hx_{t^{\prime}}\right)^{2}+\frac{1}{2}\alpha_{t}^{2}e_{i_{t^{\prime}}}^{\top}He_{i_{t^{\prime}}}\left(e_{i_{t^{\prime}}}^{\top}Hx_{t^{\prime}}\right)^{2}\\ &\leq f^{H}(x_{t^{\prime}})-\frac{\alpha_{t^{\prime}}}{2}\left(e_{i_{t^{\prime}}}^{\top}Hx_{t^{\prime}}\right)^{2}.\end{split}

Write x0=y+y0x_{0}=y^{*}+y_{0} with yker(H)y^{*}\in\ker(H) and y0ran(H)y_{0}\in\operatorname{ran}(H). Let

(4.3) yt+1=ytαteiteitHyt,t=0,1,,m1.y_{t^{\prime}+1}=y_{t^{\prime}}-\alpha_{t^{\prime}}e_{i_{t^{\prime}}}e_{i_{t^{\prime}}}^{\top}Hy_{t^{\prime}},\qquad t^{\prime}=0,1,\dots,m-1.

Then xt=y+ytx_{t^{\prime}}=y^{*}+y_{t^{\prime}} holds for any t=0,1,,mt^{\prime}=0,1,\dots,m. Using (4.2), to give an upper bound for fH(xt+m)fH(xt)f^{H}(x_{t+m})-f^{H}(x_{t}), we would like a nontrivial lower bound for αt(eitHxt)2=αt(eitHyt)2\alpha_{t^{\prime}}\bigl{(}e_{i_{t^{\prime}}}^{\top}Hx_{t^{\prime}}\bigr{)}^{2}=\alpha_{t^{\prime}}\bigl{(}e_{i_{t^{\prime}}}^{\top}Hy_{t^{\prime}}\bigr{)}^{2} for some t{t,t+1,,t+m1}t^{\prime}\in\{t,t+1,\dots,t+m-1\}, which is guaranteed by Lemma 4.3, whose proof will be postponed.

Lemma 4.3.

Suppose that {1,2,,d}={i0,i1,,im1}\{1,2,\dots,d\}=\{i_{0},i_{1},\dots,i_{m-1}\}. For any

(4.4) 0<δmin{12m,αminσmin(H)2m(mαmaxσmax(H)+1)},0<\delta\leq\min\left\{\frac{1}{2m},\frac{\alpha_{\min}\sigma_{\min}(H)}{2\sqrt{m}(m\alpha_{\max}\sigma_{\max}(H)+1)}\right\},

where σmin(H)\sigma_{\min}(H) and σmax(H)\sigma_{\max}(H) are the smallest and largest positive singular values of HH respectively, if 0y0ran(H)0\neq y_{0}\in\operatorname{ran}(H), then there exists T{0,1,,m1}T\in\{0,1,\dots,m-1\} such that αT|eiTHyT|δyT\alpha_{T}\left|e_{i_{T}}^{\top}Hy_{T}\right|\geq\delta\left\|y_{T}\right\|, where the sequence yty_{t} is given as in (4.3).

Assuming the lemma, there exists T{0,1,,m1}T\in\{0,1,\dots,m-1\} that

αT|eiTHyT|δyT,\alpha_{T}\left|e_{i_{T}}^{\top}Hy_{T}\right|\geq\delta\left\|y_{T}\right\|,

with a fixed δ>0\delta>0 satisfying (4.4). We can further constrain that δ2αminσmax(H)<1\frac{\delta^{2}}{\alpha_{\min}\sigma_{\max}(H)}<1. Thus, we have

fH(xm)fH(xT+1)fH(xT)αT2(eiTHxT)2fH(xT)δ22αTyT2,\begin{split}f^{H}(x_{m})&\leq f^{H}(x_{T+1})\leq f^{H}(x_{T})-\frac{\alpha_{T}}{2}\left(e_{i_{T}}^{\top}Hx_{T}\right)^{2}\leq f^{H}(x_{T})-\frac{\delta^{2}}{2\alpha_{T}}\left\|y_{T}\right\|^{2},\end{split}

which combined with σmax(H)yT2yTHyT=xTHxT=2fH(xT)\sigma_{\max}(H)\left\|y_{T}\right\|^{2}\geq-y_{T}^{\top}Hy_{T}=-x_{T}^{\top}Hx_{T}=-2f^{H}(x_{T}), yields that

fH(xm)(1+δ2αTσmax(H))fH(xT)(1+δ2αTσmax(H))fH(x0).f^{H}(x_{m})\leq\left(1+\frac{\delta^{2}}{\alpha_{T}\sigma_{\max}(H)}\right)f^{H}(x_{T})\leq\left(1+\frac{\delta^{2}}{\alpha_{T}\sigma_{\max}(H)}\right)f^{H}(x_{0}).

Set c=δ2αmaxσmax(H)c=\frac{\delta^{2}}{\alpha_{\max}\sigma_{\max}(H)} and we get that fH(xm)fH(x0)cfH(x0)f^{H}(x_{m})-f^{H}(x_{0})\leq cf^{H}(x_{0}). ∎

We finish the proof by establishing Lemma 4.3 below.

Proof of Lemma 4.3.

Suppose on the contrary that αt|eitHyt|<δyt\alpha_{t}\left|e_{i_{t}}^{\top}Hy_{t}\right|<\delta\left\|y_{t}\right\| for any t{0,1,,m1}t\in\{0,1,\dots,m-1\}. It holds that

y1y0=α0|ei0Hy0|<δy0<2δy0.\left\|y_{1}-y_{0}\right\|=\alpha_{0}\left|e_{i_{0}}^{\top}Hy_{0}\right|<\delta\left\|y_{0}\right\|<2\delta\left\|y_{0}\right\|.

We claim that

(4.5) yty0<2tδy0,\left\|y_{t}-y_{0}\right\|<2t\delta\left\|y_{0}\right\|,

for any t=1,2,,m1t=1,2,\dots,m-1. By induction, assume that yty0<2tδy0\left\|y_{t}-y_{0}\right\|<2t\delta\left\|y_{0}\right\| holds for some t{1,2,,m2}t\in\{1,2,\dots,m-2\}, then

yt+1yt=αt|eitHyt|<δyt<δ(2tδ+1)y0<2δy0,\left\|y_{t+1}-y_{t}\right\|=\alpha_{t}\left|e_{i_{t}}^{\top}Hy_{t}\right|<\delta\left\|y_{t}\right\|<\delta(2t\delta+1)\left\|y_{0}\right\|<2\delta\left\|y_{0}\right\|,

where the last inequality uses 2tδ<2mδ12t\delta<2m\delta\leq 1. It follows yt+1y0yty0+yt+1yt<2(t+1)δy0\left\|y_{t+1}-y_{0}\right\|\leq\left\|y_{t}-y_{0}\right\|+\left\|y_{t+1}-y_{t}\right\|<2(t+1)\delta\left\|y_{0}\right\|.

Using (4.5) and max1idHeiσmax(H)\max_{1\leq i\leq d}\left\|He_{i}\right\|\leq\sigma_{\max}(H), we have

αt|eitHy0|αt|eitH(yty0)|+αt|eitHyt|<αmaxσmax(H)2tδy0+2δy0<2δ(mαmaxσmax(H)+1)y0,\begin{split}\alpha_{t}\bigl{\lvert}e_{i_{t}}^{\top}Hy_{0}\bigr{\rvert}&\leq\alpha_{t}\bigl{\lvert}e_{i_{t}}^{\top}H(y_{t}-y_{0})\bigr{\rvert}+\alpha_{t}\bigl{\lvert}e_{i_{t}}^{\top}Hy_{t}\bigr{\rvert}\\ &<\alpha_{\max}\sigma_{\max}(H)\cdot 2t\delta\left\|y_{0}\right\|+2\delta\left\|y_{0}\right\|\\ &<2\delta(m\alpha_{\max}\sigma_{\max}(H)+1)\left\|y_{0}\right\|,\end{split}

for t=0,1,,m1t=0,1,\dots,m-1. Since span{eik:k=0,1,,m1}=d\text{span}\{e_{i_{k}}:k=0,1,\dots,m-1\}=\mathbb{R}^{d}, noticing that y0ranHy_{0}\in\operatorname{ran}{H}, we have

αminσmin(H)y0αminHy0(t=0m1(αt|eitHy0|)2)1/2<2δm(mαmaxσmax(H)+1)y0,\begin{split}\alpha_{\min}\sigma_{\min}(H)\left\|y_{0}\right\|&\leq\alpha_{\min}\left\|Hy_{0}\right\|\leq\left(\sum_{t=0}^{m-1}\left(\alpha_{t}\bigl{\lvert}e_{i_{t}}^{\top}Hy_{0}\bigr{\rvert}\right)^{2}\right)^{1/2}\\ &<2\delta\sqrt{m}(m\alpha_{\max}\sigma_{\max}(H)+1)\left\|y_{0}\right\|,\end{split}

which contradicts with the choice of δ\delta in (4.4). ∎

We are now ready to prove Proposition 3.1 which states the existence of positive Lyapunov exponent of the linear dynamical system.

Proof of Proposition 3.1.

It suffices to show that for almost every ωΩ\omega\in\Omega there exist some x0dx_{0}\in\mathbb{R}^{d}, ϵ>0\epsilon>0, and T>0T>0, such that xt=Φ(t,ω)x0x_{t}=\Phi(t,\omega)x_{0} satisfies xteϵt\left\|x_{t}\right\|\geq e^{\epsilon t} for any t>Tt>T. Let x0x_{0} be an eigenvector corresponding to a negative eigenvalue of HH. Then it holds that fH(x0)<0f^{H}(x_{0})<0. Consider a fixed mdm\geq d. For any kk\in\mathbb{N}, set Ik=1I_{k}=1 if {1,2,,d}={ikm,ikm+1,,ikm+m1}\{1,2,\dots,d\}=\{i_{km},i_{km+1},\dots,i_{km+m-1}\} and Ik=0I_{k}=0 otherwise. We can see that the random variables, I0,I1,I2,I_{0},I_{1},I_{2},\dots, are independent and identically distributed with 𝔼I0=(I0=1)(0,1)\mathbb{E}I_{0}=\mathbb{P}(I_{0}=1)\in(0,1). By Proposition 4.1, we obtain that

fH(x(k+1)m){(1+c)fH(xkm),if Ik=1,fH(xkm),if Ik=0,f^{H}(x_{(k+1)m})\leq\begin{cases}(1+c)f^{H}(x_{km}),&\text{if }I_{k}=1,\\ f^{H}(x_{km}),&\text{if }I_{k}=0,\end{cases}

where cc is the constant from Proposition 4.1. Therefore,

λmin(H)2xkm2fH(xkm)(1+c)j=0k1IjfH(x0),\frac{\lambda_{\min}(H)}{2}\left\|x_{km}\right\|^{2}\leq f^{H}(x_{km})\leq(1+c)^{\sum_{j=0}^{k-1}I_{j}}f^{H}(x_{0}),

which implies that

(4.6) xkm(2fH(x0)λmin(H))1/2(1+c)12j=0k1Ij.\left\|x_{km}\right\|\geq\left(\frac{2f^{H}(x_{0})}{\lambda_{\min}(H)}\right)^{1/2}\cdot(1+c)^{\frac{1}{2}\sum_{j=0}^{k-1}I_{j}}.

Note that 𝔼|I0|=𝔼I0<\mathbb{E}|I_{0}|=\mathbb{E}I_{0}<\infty. The strong law of large number suggests that for almost every ωΩ\omega\in\Omega, there exists some KK, such that for all kKk\geq K

(4.7) j=0k1Ij𝔼I02k.\sum_{j=0}^{k-1}I_{j}\geq\frac{\mathbb{E}I_{0}}{2}k.

Combining (4.6) and (4.7), we arrive at

xkm(2fH(x0)λmin(H))1/2(1+c)𝔼I04mkm.\left\|x_{km}\right\|\geq\left(\frac{2f^{H}(x_{0})}{\lambda_{\min}(H)}\right)^{1/2}\cdot(1+c)^{\frac{\mathbb{E}I_{0}}{4m}\cdot km}.

Noticing that (1+c)𝔼I04m(1+c)^{\frac{\mathbb{E}I_{0}}{4m}} is greater than 11, xkm\left\|x_{km}\right\| grows exponentially in kmkm and we complete the proof. ∎

4.2. Finite block analysis

In this subsection, we study the behavior of the nonlinear dynamical system near a strict saddle point of ff, which without loss of generality can be assumed to be x=0x^{*}=0. As mentioned above, in a small neighborhood of xx^{*}, while it is not possible to control the difference between nonlinear and linear systems for infinite time, the nonlinear system can be approximated by the linear system during a finite time horizon.

The main conclusion of this subsection is the following theorem which states that after a finite time interval with length TT, the distance of the iterate from x=0x^{*}=0 will be amplified exponentially with high probability.

Theorem 4.4.

Suppose that Assumptions 1, 2, and 3 hold and that 0<αmin<αmax<1/M0<\alpha_{\min}<\alpha_{\max}<1/M. There exists ϵ(0,1/6)\epsilon_{*}\in(0,1/6) such that for any ϵ(0,ϵ)\epsilon\in(0,\epsilon_{*}), we have T=T(ϵ)+T_{*}=T_{*}(\epsilon)\in\mathbb{N}_{+} that for any T+T\in\mathbb{N}_{+} with TTT\geq T_{*} and any tt\in\mathbb{N}, conditioned on t1\mathcal{F}_{t-1}, with probability at least 14ϵ1-4\epsilon, it holds for all xtVx_{t}\in V that

(4.8) xt+Texp(6ϵ16ϵ|log(1Mαmax)|T)xt,\left\|x_{t+T}\right\|\geq\exp\biggl{(}\frac{6\epsilon}{1-6\epsilon}\bigl{\lvert}\log(1-M\alpha_{\max})\bigr{\rvert}\,T\biggr{)}\left\|x_{t}\right\|,

where VV is a neighborhood of x=0x^{*}=0, depending on ϵ\epsilon, TT, and ff near xx^{*}.

The lower bound (4.8) quantifies the amplification of xt+T\left\|x_{t+T}\right\|: While we always have xt+T(1Mαmax)Txt\left\|x_{t+T}\right\|\geq(1-M\alpha_{\max})^{T}\left\|x_{t}\right\| (see (4.20) below), the result states that with probability at least 14ϵ1-4\epsilon, the amplification factor is at least the right-hand side of (4.8), which is exponentially large in TT. Hence on average xt+T\left\|x_{t+T}\right\| would be much larger than xt\left\|x_{t}\right\|. This would lead to escaping of the iterate from the neighborhood of x=0x^{*}=0.

To prove Theorem 4.4, we would require a more quantitative characterization of the behavior of its linearization at xx^{*}. In particular, we need a high probability estimate of the distance of the iterate from xx^{*} after some time interval. For this purpose, conditioned on t1\mathcal{F}_{t-1} with the iterate xtx_{t}, we will first show in Lemma 4.8 that, after some finite time, the orthogonal projection of the iterate xϱtx_{\varrho_{t}} onto the unstable subspace, where t<ϱtt+Lt<\varrho_{t}\leq t+L for some constant LL, is significant. The component in the unstable subspace would then be further amplified subsequently by ΦH(S,τϱtω)\Phi^{H}(S,\tau^{\varrho_{t}}\omega), where H=2f(x)H=\nabla^{2}f(x^{*}). Here the time duration SS would be chosen sufficiently large such that the distance from xϱt+Sx_{\varrho_{t}+S} to xx^{*} is exponentially amplified. Theorem 4.4 follows by setting T=L+ST=L+S. In the second step above, we would need to control the closeness between the linear and nonlinear systems within a time horizon with length SS.

Such a finite block analysis approach has been used to establish the stability of Lyapunov exponent of random dynamical systems [Ledrappier-91, Froyland-15], which inspired our proof technique for Theorem 4.4 and Theorem 1.

We first set the small constant ϵ\epsilon in Theorem 4.4 which controls the failure probability of the amplification bound. Let λ1>λ2>>λp\lambda_{1}>\lambda_{2}>\cdots>\lambda_{p} be the Lyapunov exponents of the linearized system at x=0x^{*}=0. We set

(4.9) λ+=minλi>0λiand0<γ<12min{min1i<p|λiλi+1|,λ+}.\lambda_{+}=\min_{\lambda_{i}>0}\lambda_{i}\quad\text{and}\quad 0<\gamma<\frac{1}{2}\min\left\{\min_{1\leq i<p}|\lambda_{i}-\lambda_{i+1}|,\lambda_{+}\right\}.

Note that γ<λ+\gamma<\lambda_{+}. Let ϵ(0,1/6)\epsilon_{*}\in(0,1/6) be sufficiently small such that

(4.10) (16ϵ)(λ+γ)+6ϵlog(1Mαmax)>0,ϵ(0,ϵ).(1-6\epsilon)(\lambda_{+}-\gamma)+6\epsilon\cdot\log(1-M\alpha_{\max})>0,\quad\forall\ \epsilon\in(0,\epsilon_{*}).

The reason for such choice will become clear later (see (4.23)). For the rest of the section, we will consider a fixed ϵ(0,ϵ)\epsilon\in(0,\epsilon_{*}).

We now state and prove several lemmas for Theorem 4.4. First, in the following Lemma 4.5, we construct a stopping time ϱt1\varrho_{t}-1 that is bounded almost surely and that the component of the gradient |eiϱt1f(xϱt1)|\bigl{\lvert}e_{i_{\varrho_{t}-1}}^{\top}\nabla f(x_{\varrho_{t}-1})\bigr{\rvert} is comparable with f(xϱt1)\left\|\nabla f(x_{\varrho_{t}-1})\right\| in amplitude with high probability.

Lemma 4.5.

Let 0<μ1d0<\mu\leq\frac{1}{\sqrt{d}} be a fixed constant. There exists some constant L>0L>0, such that for any tt\in\mathbb{N}, there exists a measurable ϱt:Ω+\varrho_{t}:\Omega\rightarrow\mathbb{N}_{+} such that t<ϱtt+Lt<\varrho_{t}\leq t+L and

(4.11) (|eiϱt1f(xϱt1)|μf(xϱt1)|t1)1ϵ.\mathbb{P}\Bigl{(}\bigl{\lvert}e_{i_{\varrho_{t}-1}}^{\top}\nabla f(x_{\varrho_{t}-1})\bigr{\rvert}\geq\mu\left\|\nabla f(x_{\varrho_{t}-1})\right\|\;\Big{|}\;\mathcal{F}_{t-1}\Bigr{)}\geq 1-\epsilon.
Proof.

For any tt\in\mathbb{N}, use 0\ell_{0} to denote the smallest non-negative integer \ell, such that

0=argmin{+:|eit+1f(xt+1)|μf(xt+1)}.\ell_{0}=\arg\min_{\ell}\Bigl{\{}\ell\in\mathbb{N}_{+}\,:\,\bigl{\lvert}e_{i_{t+\ell-1}}^{\top}\nabla f(x_{t+\ell-1})\bigr{\rvert}\geq\mu\left\|\nabla f(x_{t+\ell-1})\right\|\Bigr{\}}.

It is clear that (0>t1)(11/d)\mathbb{P}(\ell_{0}>\ell\mid\mathcal{F}_{t-1})\leq(1-1/d)^{\ell}, since for each step the coordinate is randomly chosen. Hence, there exists some L>0L>0, such that

(0Lt1)1ϵ.\mathbb{P}(\ell_{0}\leq L\mid\mathcal{F}_{t-1})\geq 1-\epsilon.

We finish the proof by setting ϱt=t+min{0,L}\varrho_{t}=t+\min\{\ell_{0},L\} which has the desired property. ∎

We now carry out the amplification part of the finite block analysis for the linearized dynamics at x=0x^{*}=0. To simplify expressions in the following, for t1<t2t_{1}<t_{2}, we introduce the short-hand notation

(i,α)t1:t21=((it1,αt1),,(it21,αt21))Ωt1××Ωt21,(i,\alpha)_{t_{1}:t_{2}-1}=\big{(}(i_{t_{1}},\alpha_{t_{1}}),\dots,(i_{t_{2}-1},\alpha_{t_{2}-1})\big{)}\in\Omega_{t_{1}}\times\cdots\times\Omega_{t_{2}-1},

and the finite time transition matrix (i.e., composition of linear maps)

ΦH((i,α)t1:t21)=(Iαt21eit21eit21H)(Iαt1eit1eit1H).\Phi^{H}\bigl{(}(i,\alpha)_{t_{1}:t_{2}-1}\bigr{)}=(I-\alpha_{t_{2}-1}e_{i_{t_{2}-1}}e_{i_{t_{2}-1}}^{\top}H)\cdots(I-\alpha_{t_{1}}e_{i_{t_{1}}}e_{i_{t_{1}}}^{\top}H).

Recall that (Ωt,Σt,t)(\Omega_{t},\Sigma_{t},\mathbb{P}_{t}) is the probability space for 𝒰{1,2,,d}×𝒰[αmin,αmax]\mathcal{U}_{\{1,2,\dots,d\}}\times\mathcal{U}_{[\alpha_{\min},\alpha_{\max}]} for tt\in\mathbb{N}. We also denote 𝒫+H((i,α)t1:t21)\mathcal{P}_{+}^{H}\bigl{(}(i,\alpha)_{t_{1}:t_{2}-1}\bigr{)} as the projection operator onto the subspace spanned by the right singular vectors of ΦH((i,α)t1:t21)\Phi^{H}\bigl{(}(i,\alpha)_{t_{1}:t_{2}-1}\bigr{)} corresponding to d+d_{+} largest singular values, where d+=λi>0did_{+}=\sum_{\lambda_{i}>0}d_{i} and did_{i} is the dimension of the ii-th eigenspace as in Theorem 2.3 (ii) for the linearized system at xx^{*}.

As we mentioned in the proof sketch, we want ΦH(S,τϱtω)=ΦH((i,α)ϱt:ϱt+S1)\Phi^{H}(S,\tau^{\varrho_{t}}\omega)=\Phi^{H}\left((i,\alpha)_{\varrho_{t}:\varrho_{t}+S-1}\right) to amplify xϱtx_{\varrho_{t}}, for which we need to establish a non-trivial lower-bound for the unstable component 𝒫+H((i,α)ϱt:ϱt+S1)xϱt\left\|\mathcal{P}_{+}^{H}\left((i,\alpha)_{\varrho_{t}:\varrho_{t}+S-1}\right)x_{\varrho_{t}}\right\|. This is achieved by several lemmas. We will establish three lower bound in the sequel:

  • 𝒫+H(τϱtω)eiϱt1\left\|\mathcal{P}_{+}^{H}\left(\tau^{\varrho_{t}}\omega\right)e_{i_{\varrho_{t}-1}}\right\| using Lemma 4.6;

  • 𝒫+H((i,α)ϱt:ϱt+S1)eiϱt1\left\|\mathcal{P}_{+}^{H}\left((i,\alpha)_{\varrho_{t}:\varrho_{t}+S-1}\right)e_{i_{\varrho_{t}-1}}\right\| in Lemma 4.7; and finally the desired

  • 𝒫+H((i,α)ϱt:ϱt+S1)xϱt\left\|\mathcal{P}_{+}^{H}\left((i,\alpha)_{\varrho_{t}:\varrho_{t}+S-1}\right)x_{\varrho_{t}}\right\| in Lemma 4.8.

Let us first control 𝒫+H(τϱtω)eiϱt1\left\|\mathcal{P}_{+}^{H}\left(\tau^{\varrho_{t}}\omega\right)e_{i_{\varrho_{t}-1}}\right\| in the following lemma, which utilizes Assumption 3. For simplicity of notation, in Lemma 4.6 and Lemma 4.7, we state the results for 𝒫+H(ω)ei\left\|\mathcal{P}_{+}^{H}(\omega)e_{i}\right\| and 𝒫+H((i,α)0:S1)ej\left\|\mathcal{P}_{+}^{H}((i,\alpha)_{0:S-1})e_{j}\right\| instead, which is slightly more general.

Lemma 4.6.

Under Assumption 3, there exist δ>0\delta>0 and measurable Ω1ϵΩ~\Omega_{1}^{\epsilon}\subset\widetilde{\Omega}, where Ω~\widetilde{\Omega} is from Theorem 2.3, such that (Ω1ϵ)1ϵ\mathbb{P}(\Omega_{1}^{\epsilon})\geq 1-\epsilon and

𝒫+H(ω)eiδ,ωΩ1ϵ,i{1,2,,d}.\left\|\mathcal{P}^{H}_{+}(\omega)e_{i}\right\|\geq\delta,\quad\forall\ \omega\in\Omega_{1}^{\epsilon},\ i\in\{1,2,\dots,d\}.
Proof.

Assumption 3 implies that

({ωΩ~:𝒫+H(ω)ei>0,i{1,2,,d}})=1\mathbb{P}\bigl{(}\{\omega\in\widetilde{\Omega}:\left\|\mathcal{P}^{H}_{+}(\omega)e_{i}\right\|>0,\quad\forall\ i\in\{1,2,\dots,d\}\}\bigr{)}=1

Notice that

{ωΩ~:𝒫+H(ω)ei>0,i{1,2,,d}}==n+{ωΩ~:𝒫+H(ω)ei1n,i{1,2,,d}}.\bigl{\{}\omega\in\widetilde{\Omega}:\left\|\mathcal{P}^{H}_{+}(\omega)e_{i}\right\|>0,\quad\forall\ i\in\{1,2,\dots,d\}\bigr{\}}=\\ =\bigcup_{n\in\mathbb{N}_{+}}\Bigl{\{}\omega\in\widetilde{\Omega}:\left\|\mathcal{P}^{H}_{+}(\omega)e_{i}\right\|\geq\frac{1}{n},\quad\forall\ i\in\{1,2,\dots,d\}\Bigr{\}}.

The Lemma follows from continuity of measure. ∎

We will then be able to handle 𝒫+H((i,α)0:S1)ej\left\|\mathcal{P}_{+}^{H}((i,\alpha)_{0:S-1})e_{j}\right\| using Lemma 4.6 and the closeness between (ΦH(S,ω)ΦH(S,ω))1/2S\left(\Phi^{H}(S,\omega)^{\top}\Phi^{H}(S,\omega)\right)^{1/2S} with Λ(ω)\Lambda(\omega) as the former converges to the latter as SS\to\infty by Theorem 2.3. More precisely, denote the singular values of Xd×dX\in\mathbb{R}^{d\times d} by s1(X)s2(X)sd(X)s_{1}(X)\geq s_{2}(X)\geq\cdots\geq s_{d}(X). Then for S+S\in\mathbb{N}_{+} sufficiently large, we have

(4.12) |1Slogsj(ΦH(S,ω))λμ(j)|=|1Slogsj(ΦH((i,α)0:S1))λμ(j)|γ,\left|\frac{1}{S}\log s_{j}\left(\Phi^{H}(S,\omega)\right)-\lambda_{\mu(j)}\right|=\left|\frac{1}{S}\log s_{j}\left(\Phi^{H}\bigl{(}(i,\alpha)_{0:S-1}\bigr{)}\right)-\lambda_{\mu(j)}\right|\leq\gamma,

for every j{1,2,,d}j\in\{1,2,\dots,d\}, where λ1>λ2>>λp\lambda_{1}>\lambda_{2}>\dots>\lambda_{p} are the Lyapunov exponents from Theorem 2.3, γ\gamma is given by (4.9), and the map μ:{1,2,,d}{1,2,,p}\mu:\{1,2,\dots,d\}\rightarrow\{1,2,\dots,p\} satisfies that μ(j)=i\mu(j)=i if and only if d1++di1<jd1++did_{1}+\cdots+d_{i-1}<j\leq d_{1}+\cdots+d_{i}, so μ\mu corresponds the index for the singular values with that of the Lyapunov exponents. Moreover, the convergence also implies that

𝒫+H(S,ω)𝒫+H(ω)δ2,\left\|\mathcal{P}_{+}^{H}(S,\omega)-\mathcal{P}_{+}^{H}(\omega)\right\|\leq\frac{\delta}{2},

for sufficiently large SS, which then leads to

(4.13) 𝒫+H(S,ω)ej=𝒫+H((i,α)0:S1)ejδ2,\left\|\mathcal{P}^{H}_{+}\left(S,\omega\right)e_{j}\right\|=\left\|\mathcal{P}^{H}_{+}\left((i,\alpha)_{0:S-1}\right)e_{j}\right\|\geq\frac{\delta}{2},

for every j{1,2,,d}j\in\{1,2,\dots,d\}, where 𝒫+H(S,ω)\mathcal{P}^{H}_{+}(S,\omega) is the projection operator onto the subspace spanned by the right singular vectors of ΦH(S,ω)\Phi^{H}(S,\omega) corresponding to d+d_{+} largest singular values. Let

(4.14) ΩS={(i,α)0:S1Ω0××ΩS1:(4.12) and (4.13) hold}.\Omega^{S}=\bigl{\{}(i,\alpha)_{0:S-1}\in\Omega_{0}\times\cdots\times\Omega_{S-1}:\eqref{close: singular value}\text{ and }\eqref{close: singular space}\text{ hold}\bigr{\}}.

The following lemma states that ΩS\Omega^{S} has high probability for sufficiently large SS, where with slight abuse of notation, we write (ΩS)=(ΩS×(×tSΩt))\mathbb{P}(\Omega^{S})=\mathbb{P}\left(\Omega^{S}\times\left(\bigtimes_{t\geq S}\Omega_{t}\right)\right).

Lemma 4.7.

Under the same assumptions of Lemma 4.6, there exists some S>0S_{*}>0, such that for every S+,SSS\in\mathbb{N}_{+},S\geq S_{*}, it holds (ΩS)12ϵ\mathbb{P}(\Omega^{S})\geq 1-2\epsilon.

Proof.

For a.e.ωΩa.e.\ \omega\in\Omega, it follows from Theorem 2.3, in particular (2.2), and standard matrix perturbation analysis (see e.g., [bhatia, Theorem VI.2.1, Theorem VII.3.1]) that

(4.15) 1Ssj(ΦH(S,ω))λμ(j),S.\frac{1}{S}s_{j}(\Phi^{H}(S,\omega))\rightarrow\lambda_{\mu(j)},\quad S\rightarrow\infty.

for any j{1,2,,d}j\in\{1,2,\dots,d\}, and that

(4.16) 𝒫+H(S,ω)𝒫+H(ω),S.\mathcal{P}^{H}_{+}(S,\omega)\rightarrow\mathcal{P}^{H}_{+}(\omega),\quad S\rightarrow\infty.

By Egorov’s theorem, there exists Ω2ϵΩ1ϵ\Omega_{2}^{\epsilon}\subset\Omega_{1}^{\epsilon} with (Ω2ϵ)12ϵ\mathbb{P}(\Omega_{2}^{\epsilon})\geq 1-2\epsilon, such that the convergences in (4.15) and (4.16) are both uniform on Ω2ϵ\Omega_{2}^{\epsilon}. Here Ω1ϵ\Omega_{1}^{\epsilon} is as in Lemma 4.6. Therefore, for some SS_{*} sufficiently large, we have

|1Slogsj(ΦH(S,ω))λμ(j)|γ,j{1,2,,d},SS,ωΩ2ϵ,\left|\frac{1}{S}\log s_{j}\left(\Phi^{H}(S,\omega)\right)-\lambda_{\mu(j)}\right|\leq\gamma,\quad\forall\ j\in\{1,2,\dots,d\},\ S\geq S_{*},\ \omega\in\Omega_{2}^{\epsilon},

and

(4.17) 𝒫+H(S,ω)𝒫+H(ω)δ2,SS,ωΩ2ϵ.\left\|\mathcal{P}^{H}_{+}(S,\omega)-\mathcal{P}^{H}_{+}(\omega)\right\|\leq\frac{\delta}{2},\quad\forall\ S\geq S_{*},\ \omega\in\Omega_{2}^{\epsilon}.

Combining Lemma 4.6 and (4.17), we obtain that

𝒫+H(S,ω)eiδ2,i{1,2,,d},SS,ωΩ2ϵ.\left\|\mathcal{P}^{H}_{+}(S,\omega)e_{i}\right\|\geq\frac{\delta}{2},\quad\forall\ i\in\{1,2,\dots,d\},\ \forall\ S\geq S_{*},\ \omega\in\Omega_{2}^{\epsilon}.

For any SSS\geq S_{*}, by the definition of ΩS\Omega^{S}, it holds that

Ω2ϵΩS×(×tSΩt),\Omega_{2}^{\epsilon}\subset\Omega^{S}\times\Bigl{(}\bigtimes_{t\geq S}\Omega_{t}\Bigr{)},

which implies the desired estimate

(ΩS)=(ΩS×(×tSΩt))(Ω2ϵ)12ϵ.\mathbb{P}(\Omega^{S})=\mathbb{P}\biggl{(}\Omega^{S}\times\Bigl{(}\bigtimes_{t\geq S}\Omega_{t}\Bigr{)}\biggr{)}\geq\mathbb{P}(\Omega_{2}^{\epsilon})\geq 1-2\epsilon.

Note that αϱt1𝒰[αmin,αmax]\alpha_{\varrho_{t}-1}\sim\mathcal{U}_{[\alpha_{\min},\alpha_{\max}]} is independent of ϱt2\mathcal{F}_{\varrho_{t}-2}, iϱt1i_{\varrho_{t}-1}, and (i,α)ϱt:ϱt+S1(i,\alpha)_{\varrho_{t}:\varrho_{t}+S-1}. The next lemma shows that with high probability, the choice of αϱt1\alpha_{\varrho_{t}-1} will lead to a nontrivial orthogonal projection of xϱtx_{\varrho_{t}} onto the unstable subspace of ΦH(S,τϱtω)=ΦH((i,α)ϱt:ϱt+S1)\Phi^{H}(S,\tau^{\varrho_{t}}\omega)=\Phi^{H}\bigl{(}(i,\alpha)_{\varrho_{t}:\varrho_{t}+S-1}\bigr{)}.

Lemma 4.8.

For any S+S\in\mathbb{N}_{+}, xϱt1x_{\varrho_{t}-1}, iϱt1i_{\varrho_{t}-1}, and (i,α)ϱt:ϱt+S1ΩS(i,\alpha)_{\varrho_{t}:\varrho_{t}+S-1}\in\Omega^{S}, there exists I[αmin,αmax]I\subset[\alpha_{\min},\alpha_{\max}] with m(I)(1ϵ)(αmaxαmin)m(I)\geq(1-\epsilon)(\alpha_{\max}-\alpha_{\min}) where m()m(\cdot) is the Lebesgue measure, such that for any αϱt1I\alpha_{\varrho_{t}-1}\in I, it holds that

(4.18) 𝒫+H((i,α)ϱt:ϱt+S1)xϱtϵδ(αmaxαmin)4|eiϱt1f(xϱt1)|.\left\|\mathcal{P}^{H}_{+}\left((i,\alpha)_{\varrho_{t}:\varrho_{t}+S-1}\right)x_{\varrho_{t}}\right\|\geq\frac{\epsilon\delta(\alpha_{\max}-\alpha_{\min})}{4}\bigl{\lvert}e_{i_{\varrho_{t}-1}}^{\top}\nabla f(x_{\varrho_{t}-1})\bigr{\rvert}.
Proof.

We assume that |eiϱt1f(xϱt1)|0\bigl{\lvert}e_{i_{\varrho_{t}-1}}^{\top}\nabla f(x_{\varrho_{t}-1})\bigr{\rvert}\neq 0; otherwise the result is trivial.

For simplicity of notation, we write

𝒫+H((i,α)ϱt:ϱt+S1)xϱt\displaystyle\mathcal{P}^{H}_{+}\left((i,\alpha)_{\varrho_{t}:\varrho_{t}+S-1}\right)x_{\varrho_{t}} =𝒫+H((i,α)ϱt:ϱt+S1)xϱt1\displaystyle=\mathcal{P}^{H}_{+}\left((i,\alpha)_{\varrho_{t}:\varrho_{t}+S-1}\right)x_{\varrho_{t}-1}
αϱt1eiϱt1f(xϱt1)𝒫+H((i,α)ϱt:ϱt+S1)eiϱt1\displaystyle\qquad-\alpha_{\varrho_{t}-1}e_{i_{\varrho_{t}-1}}^{\top}\nabla f(x_{\varrho_{t}-1})\mathcal{P}^{H}_{+}\left((i,\alpha)_{\varrho_{t}:\varrho_{t}+S-1}\right)e_{i_{\varrho_{t}-1}}
=:y2αϱt1y1,\displaystyle=:y_{2}-\alpha_{\varrho_{t}-1}y_{1},

where the last line defines y1y_{1} and y2y_{2}. Using the short-hand notation

r=ϵδ(αmaxαmin)4|eiϱt1f(xϱt1)|,r=\frac{\epsilon\delta(\alpha_{\max}-\alpha_{\min})}{4}\bigl{\lvert}e_{i_{\varrho_{t}-1}}^{\top}\nabla f(x_{\varrho_{t}-1})\bigr{\rvert},

we observe then (4.18) holds if and only if αϱt1y1\alpha_{\varrho_{t}-1}y_{1} is not located in a ball with radius rr centered at y2y_{2}.

It follows from the definition of ΩS\Omega^{S} and (4.13) that 𝒫+H((i,α)ϱt:ϱt+S1)eiϱt1δ2\left\|\mathcal{P}^{H}_{+}\left((i,\alpha)_{\varrho_{t}:\varrho_{t}+S-1}\right)e_{i_{\varrho_{t}-1}}\right\|\geq\frac{\delta}{2}, which then leads to

y1δ2|eiϱt1f(xϱt1)|=:2rϵ(αmaxαmin).\left\|y_{1}\right\|\geq\frac{\delta}{2}\bigl{\lvert}e_{i_{\varrho_{t}-1}}^{\top}\nabla f(x_{\varrho_{t}-1})\bigr{\rvert}=:\frac{2r}{\epsilon(\alpha_{\max}-\alpha_{\min})}.

Thus, the set of αϱt1\alpha_{\varrho_{t}-1} such that αϱt1y1r(y2)\alpha_{\varrho_{t}-1}y_{1}\in\mathcal{B}_{r}(y_{2}) consists of an interval JJ in \mathbb{R} with sup(J)y1inf(J)y12r\|\sup(J)\cdot y_{1}-\inf(J)\cdot y_{1}\|\leq 2r as the diameter of r(y2)\mathcal{B}_{r}(y_{2}) is 2r2r, which implies m(J)2r/y1ϵ(αmaxαmin)m(J)\leq 2r/\left\|y_{1}\right\|\leq\epsilon(\alpha_{\max}-\alpha_{\min}). The lemma is proved then by setting I=[αmaxαmin]\JI=[\alpha_{\max}-\alpha_{\min}]\backslash J. ∎

With Lemma 4.54.8, we now prove Theorem 4.4 which relies on approximation of the nonlinear dynamics by linearization and the amplification from the finite block analysis for the linearized system.

Proof of Theorem 4.4.

Without loss of generality, we will assume t=0t=0 in the proof to simplify notation. Since H=2f(x)H=\nabla^{2}f(x^{*}) is non-degenerate, we can take a neighborhood UU of x=0x^{*}=0 and some fixed σ>0\sigma>0 such that

(4.19) f(x)σx,xU.\left\|\nabla f(x)\right\|\geq\sigma\left\|x\right\|,\quad\forall\ x\in U.

Assumption 1 implies that

f(x)=f(x)f(x)Mxx=Mx,xd.\left\|\nabla f(x)\right\|=\left\|\nabla f(x)-\nabla f(x^{*})\right\|\leq M\left\|x-x^{*}\right\|=M\left\|x\right\|,\quad\forall\ x\in\mathbb{R}^{d}.

Using the above inequality and αmax<1/M\alpha_{\max}<1/M, it holds for every ωΩ\omega\in\Omega and tt^{\prime}\in\mathbb{N} that

(4.20) xt+1=xtαteiteitf(xt)xtαteiteitf(xt)(1Mαmax)xt,\begin{split}\left\|x_{t^{\prime}+1}\right\|&=\left\|x_{t^{\prime}}-\alpha_{t^{\prime}}e_{i_{t^{\prime}}}e_{i_{t^{\prime}}}^{\top}\nabla f(x_{t^{\prime}})\right\|\\ &\geq\left\|x_{t^{\prime}}\right\|-\alpha_{t^{\prime}}\left\|e_{i_{t^{\prime}}}e_{i_{t^{\prime}}}^{\top}\right\|\cdot\left\|\nabla f(x_{t^{\prime}})\right\|\\ &\geq(1-M\alpha_{\max})\left\|x_{t^{\prime}}\right\|,\end{split}

and similarly that

xt+1(1+Mαmax)xt.\left\|x_{t^{\prime}+1}\right\|\leq(1+M\alpha_{\max})\left\|x_{t^{\prime}}\right\|.

We thus define

r:=1Mαmaxandr+:=1+Mαmax,r_{-}:=1-M\alpha_{\max}\quad\text{and}\quad r_{+}:=1+M\alpha_{\max},

so that

(4.21) rxtxt+1r+xt.r_{-}\left\|x_{t^{\prime}}\right\|\leq\left\|x_{t^{\prime}+1}\right\|\leq r_{+}\left\|x_{t^{\prime}}\right\|.

We now choose the time duration SS large enough in the finite block analysis to guarantee significant amplification. More specifically, we choose SS so large that SSS\geq S_{*} (SS_{*} defined in Lemma 4.7) and that the following two inequalities hold:

(4.22) exp(S(λ+γ))ϵδμσ(r)L1(αmaxαmin)8(r+)L,\exp(S(\lambda_{+}-\gamma))\cdot\frac{\epsilon\delta\mu\sigma(r_{-})^{L-1}(\alpha_{\max}-\alpha_{\min})}{8}\geq(r_{+})^{L},

and

(4.23) (16ϵ)(S(λ+γ)+logϵδμσ(r)2(L1)(αmaxαmin)8)+6ϵ(L+S)logr>0,(1-6\epsilon)\left(S(\lambda_{+}-\gamma)+\log\frac{\epsilon\delta\mu\sigma(r_{-})^{2(L-1)}(\alpha_{\max}-\alpha_{\min})}{8}\right)+6\epsilon(L+S)\log r_{-}>0,

where LL is the upper bound defined in Lemma 4.5, μ1/d\mu\leq 1/\sqrt{d} is a fixed constant as in Lemma 4.5, δ\delta is from Lemma 4.6, and σ\sigma is set in (4.19). Thanks to (4.10) for our choice of ϵ\epsilon and that γ<λ+\gamma<\lambda_{+} from (4.9), (4.22) and (4.23) are satisfied for sufficiently large SS.

Next, we show that, for any SS sufficiently large as above, there exists a convex neighborhood U1UU_{1}\subset U of x=0x^{*}=0, such that for any tt^{\prime}\in\mathbb{N}, any xtU1x_{t^{\prime}}\in U_{1}, and any (i,α)t:t+S1(i,\alpha)_{t^{\prime}:t^{\prime}+S-1}, it holds that

(4.24) xt+SΦH((i,α)t:t+S1)xtxt.\left\|x_{t^{\prime}+S}\right\|\geq\left\|\Phi^{H}\bigl{(}(i,\alpha)_{t^{\prime}:t^{\prime}+S-1}\bigr{)}x_{t^{\prime}}\right\|-\left\|x_{t^{\prime}}\right\|.

We first define a convex neighborhood U0UU_{0}\subset U of x=0x^{*}=0 such that

(xαeieif(x))(IαeieiH)x=αeiei(f(x)Hx)=αeiei01(2f(ηx)Hx)dη1S(r+)S1x,\left\|\left(x-\alpha e_{i}e_{i}^{\top}\nabla f(x)\right)-\left(I-\alpha e_{i}e_{i}^{\top}H\right)x\right\|=\left\|\alpha e_{i}e_{i}^{\top}\left(\nabla f(x)-Hx\right)\right\|\\ =\left\|\alpha e_{i}e_{i}^{\top}\int_{0}^{1}(\nabla^{2}f(\eta x)-Hx)\mathrm{d}\eta\right\|\leq\frac{1}{S(r_{+})^{S-1}}\left\|x\right\|,

for any xU0x\in U_{0}, any i{1,2,,d}i\in\{1,2,\dots,d\}, and any α[αmin,αmax]\alpha\in[\alpha_{\min},\alpha_{\max}]. Applying the inequality SS times for xtU1=(r+)(S1)U0x_{t^{\prime}}\in U_{1}=(r_{+})^{-(S-1)}U_{0}, we have,

xt+SΦH((i,α)t:t+S1)xtxt+S(Iαt+S1eit+S1eit+S1H)xt+S1+Iαt+S1eit+S1eit+S1Hxt+S1ΦH((i,α)t:t+S2)xt1S(r+)S1xt+S1+r+xt+S1ΦH((i,α)t:t+S2)xt1S(r+)S1(xt+S1+r+xt+S2++(r+)S1xt)xt,\begin{split}&\left\|x_{t^{\prime}+S}-\Phi^{H}\bigl{(}(i,\alpha)_{t^{\prime}:t^{\prime}+S-1}\bigr{)}x_{t^{\prime}}\right\|\\ \leq&\left\|x_{t^{\prime}+S}-\left(I-\alpha_{t^{\prime}+S-1}e_{i_{t^{\prime}+S-1}}e_{i_{t^{\prime}+S-1}}^{\top}H\right)x_{t^{\prime}+S-1}\right\|\\ &+\left\|I-\alpha_{t^{\prime}+S-1}e_{i_{t^{\prime}+S-1}}e_{i_{t^{\prime}+S-1}}^{\top}H\right\|\cdot\left\|x_{t^{\prime}+S-1}-\Phi^{H}\bigl{(}(i,\alpha)_{t^{\prime}:t^{\prime}+S-2}\bigr{)}x_{t^{\prime}}\right\|\\ \leq&\frac{1}{S(r_{+})^{S-1}}\left\|x_{t^{\prime}+S-1}\right\|+r_{+}\left\|x_{t^{\prime}+S-1}-\Phi^{H}\bigl{(}(i,\alpha)_{t^{\prime}:t^{\prime}+S-2}\bigr{)}x_{t^{\prime}}\right\|\\ \leq&\frac{1}{S(r_{+})^{S-1}}\left(\left\|x_{t^{\prime}+S-1}\right\|+r_{+}\left\|x_{t^{\prime}+S-2}\right\|+\cdots+(r_{+})^{S-1}\left\|x_{t^{\prime}}\right\|\right)\\ \leq&\left\|x_{t^{\prime}}\right\|,\end{split}

and hence, inequality (4.24).

Setting V=(r+)(L1)U1V=(r_{+})^{-(L-1)}U_{1}, we then have xϱ01Ux_{\varrho_{0}-1}\in U for any x0Vx_{0}\in V, which implies that f(xϱ01)σxϱ01\left\|\nabla f(x_{\varrho_{0}-1})\right\|\geq\sigma\left\|x_{\varrho_{0}-1}\right\| as ϱ0L\varrho_{0}\leq L. According to Lemma 4.54.8, for any given x0Vx_{0}\in V, with probability at least 14ϵ1-4\epsilon, we have (i,α)ϱ0:ϱ0+S1ΩS(i,\alpha)_{\varrho_{0}:\varrho_{0}+S-1}\in\Omega^{S}, and the followings hold:

(4.25) |eiϱ01f(xϱ01)|μf(xϱ01),\displaystyle\bigl{\lvert}e_{i_{\varrho_{0}-1}}^{\top}\nabla f(x_{\varrho_{0}-1})\bigr{\rvert}\geq\mu\left\|\nabla f(x_{\varrho_{0}-1})\right\|,
(4.26) 𝒫+H((i,α)ϱ0:ϱ0+S1)xϱ0ϵδ(αmaxαmin)4|eiϱ01f(xϱ01)|,\displaystyle\left\|\mathcal{P}^{H}_{+}\left((i,\alpha)_{\varrho_{0}:\varrho_{0}+S-1}\right)x_{\varrho_{0}}\right\|\geq\frac{\epsilon\delta(\alpha_{\max}-\alpha_{\min})}{4}\bigl{\lvert}e_{i_{\varrho_{0}-1}}^{\top}\nabla f(x_{\varrho_{0}-1})\bigr{\rvert},

where the probability is the marginal probability on (i,α)0:L+S1Ω0××ΩL+S1(i,\alpha)_{0:L+S-1}\in\Omega_{0}\times\cdots\times\Omega_{L+S-1}.

Recall λ+\lambda_{+} and γ\gamma in (4.9), and d+=λi>0did_{+}=\sum_{\lambda_{i}>0}d_{i}. It follows from (4.12) of the construction of the set ΩS\Omega^{S} that

1Slogsj(ΦH((i,α)ϱ0:ϱ0+S1))λμ(j)γλ+γ,jd+.\frac{1}{S}\log s_{j}\left(\Phi^{H}\left((i,\alpha)_{\varrho_{0}:\varrho_{0}+S-1}\right)\right)\geq\lambda_{\mu(j)}-\gamma\geq\lambda_{+}-\gamma,\quad\forall\ j\leq d_{+}.

This is to say that the d+d_{+} largest singular values of ΦH((i,α)ϱ0:ϱ0+S1)\Phi^{H}\left((i,\alpha)_{\varrho_{0}:\varrho_{0}+S-1}\right) are all greater than or equal to exp(S(λ+γ))\exp(S(\lambda_{+}-\gamma)). Therefore, it holds that

(4.27) ΦH((i,α)ϱ0:ϱ0+S1)xϱ0ΦH((i,α)ϱ0:ϱ0+S1)𝒫+H((i,α)ϱ0:ϱ0+S1)xϱ0(4.26)exp(S(λ+γ))ϵδ(αmaxαmin)4|eiϱ01f(xϱ01)|(4.25)exp(S(λ+γ))ϵδμ(αmaxαmin)4f(xϱ01),\begin{split}\left\|\Phi^{H}\left((i,\alpha)_{\varrho_{0}:\varrho_{0}+S-1}\right)x_{\varrho_{0}}\right\|&\geq\left\|\Phi^{H}\left((i,\alpha)_{\varrho_{0}:\varrho_{0}+S-1}\right)\mathcal{P}^{H}_{+}\left((i,\alpha)_{\varrho_{0}:\varrho_{0}+S-1}\right)x_{\varrho_{0}}\right\|\\ &\leavevmode\kern-10.1389pt\mathrel{\mathop{\geq}\limits^{\eqref{ineq2}}}\exp(S(\lambda_{+}-\gamma))\cdot\frac{\epsilon\delta(\alpha_{\max}-\alpha_{\min})}{4}\bigl{\lvert}e_{i_{\varrho_{0}-1}}^{\top}\nabla f(x_{\varrho_{0}-1})\bigr{\rvert}\\ &\leavevmode\kern-10.1389pt\mathrel{\mathop{\geq}\limits^{\eqref{ineq1}}}\exp(S(\lambda_{+}-\gamma))\cdot\frac{\epsilon\delta\mu(\alpha_{\max}-\alpha_{\min})}{4}\left\|\nabla f(x_{\varrho_{0}-1})\right\|,\end{split}

where the first inequality follows from the fact that ΦH((i,α)ϱ0:ϱ0+S1)𝒫+H((i,α)ϱ0:ϱ0+S1)xϱ0\Phi^{H}\left((i,\alpha)_{\varrho_{0}:\varrho_{0}+S-1}\right)\mathcal{P}^{H}_{+}\left((i,\alpha)_{\varrho_{0}:\varrho_{0}+S-1}\right)x_{\varrho_{0}} and ΦH((i,α)ϱ0:ϱ0+S1)(I𝒫+H((i,α)ϱ0:ϱ0+S1))xϱ0\Phi^{H}\left((i,\alpha)_{\varrho_{0}:\varrho_{0}+S-1}\right)\left(I-\mathcal{P}^{H}_{+}\left((i,\alpha)_{\varrho_{0}:\varrho_{0}+S-1}\right)\right)x_{\varrho_{0}} are orthogonal. Combining (4.24) and (4.27), we obtain that

xϱ0+Sexp(S(λ+γ))ϵδμ(αmaxαmin)4f(xϱ01)xϱ0(4.21)(exp(S(λ+γ))ϵδμσ(r)L1(αmaxαmin)4(r+)L)x0(4.22)exp(S(λ+γ))ϵδμσ(r)L1(αmaxαmin)8x0.\begin{split}\left\|x_{\varrho_{0}+S}\right\|&\geq\exp(S(\lambda_{+}-\gamma))\cdot\frac{\epsilon\delta\mu(\alpha_{\max}-\alpha_{\min})}{4}\left\|\nabla f(x_{\varrho_{0}-1})\right\|-\left\|x_{\varrho_{0}}\right\|\\ &\leavevmode\kern-37.38895pt\mathrel{\mathop{\geq}\limits^{\eqref{eq:compareiterate}}}\left(\exp(S(\lambda_{+}-\gamma))\cdot\frac{\epsilon\delta\mu\sigma(r_{-})^{L-1}(\alpha_{\max}-\alpha_{\min})}{4}-(r_{+})^{L}\right)\cdot\left\|x_{0}\right\|\\ &\leavevmode\kern-16.12503pt\mathrel{\mathop{\geq}\limits^{\eqref{large S1}}}\exp(S(\lambda_{+}-\gamma))\cdot\frac{\epsilon\delta\mu\sigma(r_{-})^{L-1}(\alpha_{\max}-\alpha_{\min})}{8}\cdot\left\|x_{0}\right\|.\end{split}

Therefore, it holds that

xL+S\displaystyle\left\|x_{L+S}\right\| (4.21)(r)L1xϱ0+S\displaystyle\;\leavevmode\kern-37.38895pt\mathrel{\mathop{\geq}\limits^{\eqref{eq:compareiterate}}}(r_{-})^{L-1}\left\|x_{\varrho_{0}+S}\right\|
exp(S(λ+γ))ϵδμσ(r)2(L1)(αmaxαmin)8x0.\displaystyle\;\geq\exp(S(\lambda_{+}-\gamma))\cdot\frac{\epsilon\delta\mu\sigma(r_{-})^{2(L-1)}(\alpha_{\max}-\alpha_{\min})}{8}\cdot\left\|x_{0}\right\|.

We finally arrive at (4.8) by setting T=L+ST=L+S and combining the above with (4.23). ∎

4.3. Proof of main results

In this section, we first prove the following theorem, which relies on the local amplification with high probability established in Theorem 4.4. The main result Theorem 1 will follow as an immediate corollary since Crits(f)\mathrm{Crit}_{s}(f) is countable and strict saddle points are isolated.

Theorem 4.9.

Suppose that Assumption 1, 2, and 3 hold and that 0<αmin<αmax<1/M0<\alpha_{\min}<\alpha_{\max}<1/M. Then for every xCrits(f)x^{*}\in\mathrm{Crit}_{s}(f) and every x0d\{x}x_{0}\in\mathbb{R}^{d}\backslash\{x^{*}\}, it holds that

(Ωs(x,x0))=0.\mathbb{P}(\Omega_{s}(x^{*},x_{0}))=0.
Proof.

Without loss of generality, we assume that x=0x^{*}=0. Conditioned on t1\mathcal{F}_{t-1} with xtVx_{t}\in V, where VV can be assumed to be bounded, Theorem 4.4 states that with probability at least 14ϵ1-4\epsilon,

xt+TAxt,\left\|x_{t+T}\right\|\geq A\left\|x_{t}\right\|,

where to simplify notation we denote

A:=exp(6ϵ16ϵ|log(1Mαmax)|T)A:=\exp\biggl{(}\frac{6\epsilon}{1-6\epsilon}\bigl{\lvert}\log(1-M\alpha_{\max})\bigr{\rvert}\,T\biggr{)}

the amplification factor appearing on the right-hand side of (4.8). Notice also that, due to (4.21), we always have

xt+T(r)Txt.\left\|x_{t+T}\right\|\geq(r_{-})^{T}\left\|x_{t}\right\|.

It suffices to show that for any x0V\{x}x_{0}\in V\backslash\{x^{*}\}, with probability 11 there exists some t+t\in\mathbb{N}_{+} such that xtVx_{t}\notin V.

Let us consider the iterates every TT steps: Denote yt=xTty_{t}=x_{Tt} and 𝒢t=Tt1\mathcal{G}_{t}=\mathcal{F}_{Tt-1} for tt\in\mathbb{N}. Denote stopping time

ρ=inf{t:ytV},\rho=\inf\{t\in\mathbb{N}:y_{t}\notin V\},

it suffices to show that (ρ<)=1\mathbb{P}(\rho<\infty)=1. We define a sequence of random variables ItI_{t} as follows:

It(ω)={1,if yt+1Ayt,0,otherwise.I_{t}(\omega)=\begin{cases}1,&\text{if }\left\|y_{t+1}\right\|\geq A\left\|y_{t}\right\|,\\ 0,&\text{otherwise}.\end{cases}

By the discussion in the beginning of the proof, we have

(It=1|𝒢t,t<ρ)14ϵ,\mathbb{P}(I_{t}=1\ |\ \mathcal{G}_{t},t<\rho)\geq 1-4\epsilon,

and moreover, setting St(ω)=0s<tIt(ω)S_{t}(\omega)=\sum_{0\leq s<t}I_{t}(\omega), we have for t<ρ(ω)t<\rho(\omega),

yty0ASt(ω)(r)T(tSt(ω)).\frac{\left\|y_{t}\right\|}{\left\|y_{0}\right\|}\geq A^{S_{t}(\omega)}\cdot(r_{-})^{T(t-S_{t}(\omega))}.

Denote R:=supxVx<R:=\sup_{x\in V}\left\|x\right\|<\infty. Since (15ϵ)logA+5ϵTlogr>0(1-5\epsilon)\log A+5\epsilon T\log r_{-}>0, there exists tt_{*}\in\mathbb{N}, such that

(A15ϵ(r)5ϵT)t>Ry0,tt.\left(A^{1-5\epsilon}\cdot(r_{-})^{5\epsilon T}\right)^{t}>\frac{R}{\left\|y_{0}\right\|},\quad\forall\ t\geq t_{*}.

Therefore, for any ttt\geq t_{*}, it holds that

(ρ>t)=(ρ>t,St(15ϵ)t)i(15ϵ)t(ti)(14ϵ)i(4ϵ)ti.\mathbb{P}(\rho>t)=\mathbb{P}\bigl{(}\rho>t,S_{t}\leq(1-5\epsilon)t\bigr{)}\leq\sum_{i\leq(1-5\epsilon)t}\binom{t}{i}(1-4\epsilon)^{i}(4\epsilon)^{t-i}.

As we will show in the next lemma that the right-hand side of above goes to 0 as tt\to\infty, and thus limt(ρ>t)=0\lim_{t\rightarrow\infty}\mathbb{P}(\rho>t)=0, which implies that (ρ<)=1\mathbb{P}(\rho<\infty)=1. ∎

Lemma 4.10.

For any ϵ(0,1/4)\epsilon\in(0,1/4), it holds that

limti(15ϵ)t(ti)(14ϵ)i(4ϵ)ti=0.\lim_{t\rightarrow\infty}\sum_{i\leq(1-5\epsilon)t}\binom{t}{i}(1-4\epsilon)^{i}(4\epsilon)^{t-i}=0.
Proof.

Let X0,X1,X2,X_{0},X_{1},X_{2},\dots be a sequence of i.i.d. random variables with XiX_{i} being a Bernoulli random variable with expectation 14ϵ1-4\epsilon. Denote the average X¯t=1t0s<tXs\bar{X}_{t}=\frac{1}{t}\sum_{0\leq s<t}X_{s}. The weak law of large numbers yields that

i(15ϵ)t(ti)(14ϵ)i(4ϵ)ti=(X¯t15ϵ)(|X¯t𝔼X0|ϵ)0,\sum_{i\leq(1-5\epsilon)t}\binom{t}{i}(1-4\epsilon)^{i}(4\epsilon)^{t-i}=\mathbb{P}\left(\bar{X}_{t}\leq 1-5\epsilon\right)\leq\mathbb{P}\left(|\bar{X}_{t}-\mathbb{E}X_{0}|\geq\epsilon\right)\rightarrow 0,

as tt\rightarrow\infty. ∎

The main theorem then follows directly from Theorem 4.9.

Proof of Theorem 1.

Assumption 2 guarantees that, in a small neighborhood of xx^{*}, the gradient f(x)=2f(x)(xx)+o(xx)\nabla f(x)=\nabla^{2}f(x^{*})(x-x^{*})+o(\left\|x-x^{*}\right\|) is non-vanishing as long as xxx\neq x^{*}, which implies that xx^{*} is an isolated stationary point. Therefore, Crits(f)\mathrm{Crit}_{s}(f) is countable. Then Theorem 1 follows directly from Theorem 4.9 and the countability of Crits(f)\mathrm{Crit}_{s}(f). ∎

We now prove the global convergence, i.e., Corollary 3.3, for which we will show that Algorithm 1 converges to a critical point of ff with some appropriate assumptions. We first show that the limit of each convergent subsequence of {xt}t\{x_{t}\}_{t\in\mathbb{N}} is a critical point of ff.

Proposition 4.11.

If Assumption 1 holds and 0<αmin<αmax<1/M0<\alpha_{\min}<\alpha_{\max}<1/M, for any x0dx_{0}\in\mathbb{R}^{d} with bounded level set L(x0)={xd:f(x)f(x0)}L(x_{0})=\{x\in\mathbb{R}^{d}:f(x)\leq f(x_{0})\}, with probability 11, every accumulation point of {xt}t\{x_{t}\}_{t\in\mathbb{N}} is in Crit(f)\mathrm{Crit}(f).

Proof.

Algorithm 1 is always monotone since the following holds for any tt\in\mathbb{N} by Taylor’s expansion:

(4.28) f(xt+1)=f(xtαteiteitf(xt))=f(xt)αt(eitf(xt))2+12αt2(eitf(xt))2eitf(xtθtαteiteitf(xt))eitf(xt)12αt(eif(xt))2f(xt),\begin{split}f(x_{t+1})&=f\left(x_{t}-\alpha_{t}e_{i_{t}}e_{i_{t}}^{\top}\nabla f(x_{t})\right)\\ &=f(x_{t})-\alpha_{t}\left(e_{i_{t}}^{\top}\nabla f(x_{t})\right)^{2}\\ &\qquad\qquad+\frac{1}{2}\alpha_{t}^{2}\left(e_{i_{t}}^{\top}\nabla f(x_{t})\right)^{2}\cdot e_{i_{t}}^{\top}\nabla f\left(x_{t}-\theta_{t}\alpha_{t}e_{i_{t}}e_{i_{t}}^{\top}\nabla f(x_{t})\right)e_{i_{t}}\\ &\leq f(x_{t})-\frac{1}{2}\alpha_{t}(e_{i}^{\top}\nabla f(x_{t}))^{2}\\ &\leq f(x_{t}),\end{split}

where θt(0,1)\theta_{t}\in(0,1), which implies that the whole sequence {xt}t\{x_{t}\}_{t\in\mathbb{N}} is contained in the bounded level set L(x0)L(x_{0}).

Let us consider any η>0\eta>0 and set

L(x0,η)={xL(x0):f(x)η},L(x_{0},\eta)=\{x\in L(x_{0}):\left\|\nabla f(x)\right\|\geq\eta\},

which is either empty or compact. We claim that with probability 11, the accumulation points of {xt}t\{x_{t}\}_{t\in\mathbb{N}} will not be located in L(x0,η)L(x_{0},\eta). This is clear when L(x0,η)L(x_{0},\eta) is empty, so it suffices to consider compact L(x0,η)L(x_{0},\eta). Set μ(0,1/d]\mu\in(0,1/\sqrt{d}] as a fixed constant. For any xL(x0,η)x\in L(x_{0},\eta), there exists an open neighborhood UxU_{x} of xx and a coordinate ix{1,2,,d}i_{x}\in\{1,2,\dots,d\}, such that

(4.29) |eixf(y)|μf(x)μη,yUx,\bigl{\lvert}e_{i_{x}}^{\top}\nabla f(y)\bigr{\rvert}\geq\mu\left\|\nabla f(x)\right\|\geq\mu\eta,\quad\forall\ y\in U_{x},

and that

(4.30) supyUxf(y)infyUxf(y)<αminμ2η22.\sup_{y\in U_{x}}f(y)-\inf_{y\in U_{x}}f(y)<\frac{\alpha_{\min}\mu^{2}\eta^{2}}{2}.

Noticing that L(x0,η)xL(x0,η)UxL(x_{0},\eta)\subset\bigcup_{x\in L(x_{0},\eta)}U_{x}, by the compactness, there exist finitely many points, say x1,x2,,xKx^{1},x^{2},\dots,x^{K}, such that

L(x0,η)1kKUxk.L(x_{0},\eta)\subset\bigcup_{1\leq k\leq K}U_{x^{k}}.

For any k{1,2,,K}k\in\{1,2,\dots,K\}, combining (4.28), (4.29), and (4.30), we know that for any tt, conditioned on t1\mathcal{F}_{t-1} with xtUxkx_{t}\in U_{x^{k}}, if it=ixi_{t}=i_{x} (which has probability 1/d1/d), then f(xt+1)<infyUxkf(y)f(x_{t+1})<\inf_{y\in U_{x^{k}}}f(y) and thus xtUxkx_{t^{\prime}}\not\in U_{x^{k}} for all t>tt^{\prime}>t.

Therefore, the probability that there are infinitely many tt\in\mathbb{N} with xtUxkx_{t}\in U_{x^{k}} is zero, which implies that {xt}t\{x_{t}\}_{t\in\mathbb{N}} does not have accumulation points in UxkU_{x^{k}} with probability 11. We conclude that with probability 11, L(x0,η)L(x_{0},\eta) does not contain any accumulation points of {xt}t\{x_{t}\}_{t\in\mathbb{N}} as KK is finite. Since this holds for any η>0\eta>0, we have for \mathbb{P}-a.e. ωΩ\omega\in\Omega, {xt}t\{x_{t}\}_{t\in\mathbb{N}} has no accumulation points in any n1L(x0,1/n)\bigcup_{n\geq 1}L(x_{0},1/n), which then leads to the desired result. ∎

Proposition 4.11 implies that any accumulation point of the algorithm iterate is a critical point. If we further assume that each critical point of ff is isolated, we would conclude that the whole sequence {xt}t\{x_{t}\}_{t\in\mathbb{N}} converges and the limit is in Crit(f)\mathrm{Crit}(f).

Proposition 4.12.

Under the assumptions of Proposition 4.11. If every xCrit(f)x^{*}\in\mathrm{Crit}(f) is an isolated critical point of ff, then with probability 11, xtx_{t} converges to some critical point of ff as tt\rightarrow\infty.

Proof.

It follows from Proposition 4.11 that f(xt)\left\|\nabla f(x_{t})\right\| converges to 0 as tt\rightarrow\infty for a.e.a.e. ωΩ\omega\in\Omega. In fact, if there were a subsequence {xtk}k\{x_{t_{k}}\}_{k\in\mathbb{N}} and ϵ>0\epsilon>0 with f(xtk)ϵ,k\left\|\nabla f(x_{t_{k}})\right\|\geq\epsilon,\ \forall\ k\in\mathbb{N}. Then by the boundedness of L(x0)L(x_{0}), {xtk}k\{x_{t_{k}}\}_{k\in\mathbb{N}} would have some accumulation point which is not a stationary point of ff, which leads to a contradiction.

Moreover, Crit(f)L(x0)\mathrm{Crit}(f)\cap L(x_{0}) is a finite set, since otherwise, Crit(f)L(x0)\mathrm{Crit}(f)\cap L(x_{0}) would have a limiting point which would be a non-isolated critical point of ff, violating the assumption.

Consider a fixed ωΩ\omega\in\Omega with limtf(xt)=0\lim_{t\rightarrow\infty}\left\|\nabla f(x_{t})\right\|=0. Select a open neighborhood UxU_{x^{*}} for every xCrit(f)L(x0)x^{*}\in\mathrm{Crit}(f)\cap L(x_{0}), such that there exists some δ>0\delta>0 with

dist(Ux,Uy)=infxUx,yUyxy>δ,x,yCrit(f).\text{dist}(U_{x^{*}},U_{y^{*}})=\inf_{x\in U_{x^{*}},y\in U_{y^{*}}}\left\|x-y\right\|>\delta,\quad\forall\ x^{*},y^{*}\in\mathrm{Crit}(f).

If {xt}t\{x_{t}\}_{t\in\mathbb{N}} has more than one accumulation point, there would be infinitely many iterates located in L(x0)\xCrit(f)L(x0)UxL(x_{0})\backslash\bigcup_{x^{*}\in\mathrm{Crit}(f)\cap L(x_{0})}U_{x^{*}} which is compact. Therefore, {xt}t\{x_{t}\}_{t\in\mathbb{N}} would have an accumulation point in L(x0)\xCrit(f)L(x0)UxL(x_{0})\backslash\bigcup_{x^{*}\in\mathrm{Crit}(f)\cap L(x_{0})}U_{x^{*}}, which contradicts Proposition 4.11. ∎

Corollary 3.3 is now an immediate consequence.

Proof of Corollary 3.3.

The result follows directly from Theorem 1 and Proposition 4.12. ∎

References

Appendix A Validity of Assumption 3

In this appendix, we provide some justification of Assumption 3, which is expected to hold generically. In particular, the following proposition validates this assumption when the off-diagonal entries of HH are all non-zero.

Proposition A.1.

Suppose that the largest Lyapunov exponent of ΦH(t,ω)\Phi^{H}(t,\omega) is positive. Then Assumption 3 holds as long as 1<αmin<αmax<1/max1id|Hii|1<\alpha_{\min}<\alpha_{\max}<1/\max_{1\leq i\leq d}|H_{ii}| and every off-diagonal entry of HH is non-zero.

Proof.

For any element ω\omega in Ω\Omega, we take the smallest \ell such that {1,2,,d}={i0,i1,,i1}\{1,2,\dots,d\}=\{i_{0},i_{1},\dots,i_{\ell-1}\} and write

ω=((i0,α0),,(i1,α1),ω),\omega=((i_{0},\alpha_{0}),\dots,(i_{\ell-1},\alpha_{\ell-1}),\omega^{\prime}),

where ω=τωΩ\omega^{\prime}=\tau^{\ell}\omega\in\Omega. We have that \ell is finite for a.e. ωΩ\omega\in\Omega. Note that we can view 1\ell-1 as a stopping time, in particular, given \ell, ω\omega^{\prime} has distribution \mathbb{P} and is independent with 1\mathcal{F}_{\ell-1}.

Let {v1,v2,,vm}\{v_{1}^{\prime},v_{2}^{\prime},\dots,v_{m}^{\prime}\} be a set of basis vectors for WH(ω)=WH(τω)W^{H}_{-}(\omega^{\prime})=W^{H}_{-}(\tau^{\ell}\omega). Then a set of basis vectors for WH(ω)W^{H}_{-}(\omega) is given by

vj=(Iα0ei0ei0H)1(Iα1ei1ei1H)1vj,j=1,2,,m.v_{j}=\bigl{(}I-\alpha_{0}e_{i_{0}}e_{i_{0}}^{\top}H\bigr{)}^{-1}\cdots\bigl{(}I-\alpha_{\ell-1}e_{i_{\ell-1}}e_{i_{\ell-1}}^{\top}H\bigr{)}^{-1}v_{j}^{\prime},\quad j=1,2,\dots,m.

Denote the matrices concatenated by the column vectors as V=(v1|v2||vm)V^{\prime}=\bigl{(}v_{1}^{\prime}|v_{2}^{\prime}|\cdots|v_{m}^{\prime}\bigr{)} and V=(v1|v2||vm)V=\bigl{(}v_{1}|v_{2}|\cdots|v_{m}\bigr{)}. If eiWH(ω)=span{v1,v2,,vm}e_{i}\in W^{H}_{-}(\omega)=\text{span}\{v_{1},v_{2},\dots,v_{m}\}, then Vı^,:V_{\hat{\imath},:} is column-rank deficient since the existence of a positive Lyapunov exponent implies that md1m\leq d-1. Here and for the rest of the appendix, we denote by Vı^,:V_{\hat{\imath},:} the (d1)×m(d-1)\times m matrix obtained via removing ii-th row of Vd×mV\in\mathbb{R}^{d\times m}.

Therefore, as Assumption 3 is equivalent to that eiWH(ω)e_{i}\notin W^{H}_{-}(\omega) holds for any i{1,2,,d}i\in\{1,2,\dots,d\} and almost every ωΩ\omega\in\Omega, it suffices to show that Vı^,:V_{\hat{\imath},:} has full column-rank with probability 11. The key point is that given \ell, α0,α1,,α1\alpha_{0},\alpha_{1},\dots,\alpha_{\ell-1} are independent with i0,i1,,i1i_{0},i_{1},\dots,i_{\ell-1} and ω=τω\omega^{\prime}=\tau^{\ell}\omega. Thus, it suffices to show that with fixed \ell, i0,i1,,i1i_{0},i_{1},\dots,i_{\ell-1}, ω=τω\omega^{\prime}=\tau^{\ell}\omega, and v1,v2,,vmv_{1}^{\prime},v_{2}^{\prime},\dots,v_{m}^{\prime}, the set of all α0,α1,,α1\alpha_{0},\alpha_{1},\dots,\alpha_{\ell-1} that yield the rank-deficiency of Vı^,:V_{\hat{\imath},:} is of measure zero; and without loss of generality, we can assume i=1i=1. Noticing that i0,i1,,i1i_{0},i_{1},\cdots,i_{\ell-1} cover all the coordinates and that every off-diagonal entry of HH is non-zero, the desired result follows directly from the following Lemma A.2 applied repeatedly. ∎

Lemma A.2.

Suppose that X=(X1|X2||Xd)X=\bigl{(}X_{1}|X_{2}|\cdots|X_{d}\bigr{)}^{\top} and Y=(Y1|Y2||Yd)Y=\bigl{(}Y_{1}|Y_{2}|\cdots|Y_{d}\bigr{)}^{\top} are full-column-rank d×md\times m matrices satisfying Y=(IαekekH)1XY=(I-\alpha e_{k}e_{k}^{\top}H)^{-1}X (we suppress in the notation the dependence of YY on kk and α\alpha for simplicity). Then the followings holds:

  • (i)

    If X1^,:X_{\hat{1},:} has full column-rank, then for any k={1,2,,d}k=\{1,2,\dots,d\}, Y1^,:Y_{\hat{1},:} also has full column-rank for a.e. α\alpha.

  • (ii)

    Suppose that X1^,:X_{\hat{1},:} is column-rank deficient, and let 2j1<j2<<jm1d2\leq j_{1}<j_{2}<\dots<j_{m-1}\leq d be row indices such that

    Xjspan{Xj1,Xj2,,Xjm1},j{2,3,,d}.X_{j}\in\emph{span}\{X_{j_{1}},X_{j_{2}},\dots,X_{j_{m-1}}\},\quad\forall\ j\in\{2,3,\dots,d\}.

    If k{1,j1,j2,,jm1}k\in\{1,j_{1},j_{2},\dots,j_{m-1}\}, then we have with probability 11, either Y1^,:Y_{\hat{1},:} has full column-rank or Y1^,:Y_{\hat{1},:} is column-rank deficient with

    Yjspan{Yj1,Yj2,,Yjm1},j{2,3,,d}.Y_{j}\in\emph{span}\{Y_{j_{1}},Y_{j_{2}},\dots,Y_{j_{m-1}}\},\quad\forall\ j\in\{2,3,\dots,d\}.

    If k{1,j1,j2,,jm1}k\notin\{1,j_{1},j_{2},\dots,j_{m-1}\} and Hk10H_{k1}\neq 0, then Y1^,:Y_{\hat{1},:} has full column-rank.

Proof of Lemma A.2.

By (3.3), it holds that Yj=XjY_{j}=X_{j} for jkj\neq k and that

Yk=11αHkk(Xk+αjkHkjXj).Y_{k}=\frac{1}{1-\alpha H_{kk}}\left(X_{k}+\alpha\sum_{j\neq k}H_{kj}X_{j}\right).

For point (i), we notice that if k=1k=1, then Y1^,:=X1^,:Y_{\hat{1},:}=X_{\hat{1},:} has full column-rank. If k>1k>1, then it follows from X1span{X2,,Xd}X_{1}\in\text{span}\{X_{2},\dots,X_{d}\} that Y1^,:Y_{\hat{1},:} also has full column-rank for a.e. α\alpha.

For point (ii). We have

span{X1,Xj1,Xj2,,Xjm1}=m.\text{span}\{X_{1},X_{j_{1}},X_{j_{2}},\dots,X_{j_{m-1}}\}=\mathbb{R}^{m}.

If k{1,j1,j2,,jm1}k\in\{1,j_{1},j_{2},\dots,j_{m-1}\}, then span{Y1,Yj1,Yj2,,Yjm1}=m\text{span}\{Y_{1},Y_{j_{1}},Y_{j_{2}},\dots,Y_{j_{m-1}}\}=\mathbb{R}^{m} holds for a.e.a.e. α\alpha. Therefore, we obtain that Yj1,Yj2,,Yjm1Y_{j_{1}},Y_{j_{2}},\dots,Y_{j_{m-1}} are linearly independent, which implies that either Y1^,:Y_{\hat{1},:} has full column-rank or

Yjspan{Yj1,Yj2,,Yjm1},j{2,3,,n}.Y_{j}\in\text{span}\{Y_{j_{1}},Y_{j_{2}},\dots,Y_{j_{m-1}}\},\quad\forall\ j\in\{2,3,\dots,n\}.

If k{j1,j2,,jm1}k\notin\{j_{1},j_{2},\dots,j_{m-1}\}, then Y1^,:Y_{\hat{1},:} has full column-rank since Hk10H_{k1}\neq 0. ∎