On the global convergence of randomized coordinate gradient descent for non-convex optimization

Ziang Chen (ZC) Department of Mathematics, Duke University, Box 90320, Durham, NC 27708, USA. [email protected] , Yingzhou Li (YL) School of Mathematical Sciences, Fudan University, Shanghai 200433, China [email protected] and Jianfeng Lu (JL) Departments of Mathematics, Physics, and Chemistry, Duke University, Box 90320, Durham, NC 27708, USA. [email protected]

Abstract.

In this work, we analyze the global convergence property of coordinate gradient descent with random choice of coordinates and stepsizes for non-convex optimization problems. Under generic assumptions, we prove that the algorithm iterate will almost surely escape strict saddle points of the objective function. As a result, the algorithm is guaranteed to converge to local minima if all saddle points are strict. Our proof is based on viewing coordinate descent algorithm as a nonlinear random dynamical system and a quantitative finite block analysis of its linearization around saddle points.

This work is supported in part by the National Science Foundation via grants DMS-2012286 and CHE-2037263, and by US Department of Energy via grant DE-SC0019449. Y. Li is partially supported by National Natural Science Foundation of China under Grant No. 12271109. We thank Jonathan Mattingly, Zhe Wang, and Stephen J. Wright for helpful discussions.

1. Introduction

In this paper, we analyze the global convergence of coordinate gradient descent algorithm for smooth but non-convex optimization problem

(1.1)

\min_{x\in\mathbb{R}^{d}}f(x).

More specifically, we consider coordinate gradient descent with random coordinate selection and random stepsizes, as Algorithm 1.

Algorithm 1 Randomized coordinate gradient descent

Initialization:

x_{0}\in\mathbb{R}^{d}

t=0

while not convergent do

Draw a coordinate

i_{t}

uniformly random from

\{1,2,\dots,d\}

Draw a stepsize

\alpha_{t}

uniformly random in

[\alpha_{\min},\alpha_{\max}]

x_{t+1}\leftarrow x_{t}-\alpha_{t}e_{i_{t}}\partial_{i_{t}}f(x_{t})

t\leftarrow t+1

end while

The main result of this paper, Theorem 1, is that for any initial guess $x_{0}$ that is not a strict saddle point of $f$ , under some mild conditions, with probability $1$ , Algorithm 1 will escape any strict saddle points; and thus under some additional structural assumption of $f$ , the algorithm will globally converge to a local minimum.

In order to establish the global convergence, we view the algorithm as a random dynamical system and carry out the analysis based on the theory of random dynamical systems. This might be of separate interest, in particular, to the best of our knowledge, the theory of random dynamical system has not been utilized in analyzing randomized algorithms; while it offers a natural framework to establish long time behavior of such algorithms. Let us now briefly explain the random dynamical system view of the algorithm and our analysis; more details can be found in Section 3.

Let $(\Omega,\mathcal{F},\mathbb{P})$ be the probability space for all randomness used in the algorithm, such that each $\omega\in\Omega$ is a sequence of coordinates and stepsizes. The iterate of Algorithm 1 can be described as a random dynamical system $x_{t}=\varphi(t,\omega)x_{0}$ where $\varphi(t,\omega):\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}$ is a nonlinear map for any given $t\in\mathbb{N}$ and $\omega\in\Omega$ .

Consider an isolated stationary point $x^{*}$ of the dynamical system, which corresponds to a critical point of $f$ . Near $x^{*}$ , the dynamical system can be approximated by its linearization: $x_{t}=\Phi(t,\omega)x_{0}$ , where $\Phi(t,\omega)\in\mathbb{R}^{d\times d}$ . The limiting behavior of the linear dynamical system can be well understood by the celebrated multiplicative ergodic theorem: Under some assumptions, the limit $\Lambda(\omega)=\lim_{t\rightarrow\infty}\left(\Phi(t,\omega)^{\top}\Phi(t,\omega)\right)^{1/2t}$ exists almost surely. The eigenvalues of the matrix $\Lambda(\omega)$ , $e^{\lambda_{1}(\omega)}>e^{\lambda_{2}(\omega)}>\cdots>e^{\lambda_{p(\omega)}(\omega)}$ , characterize the long time behavior of the system. In particular, if the largest Lyapunov exponent $\lambda_{1}(\omega)$ is strictly positive, then if $x_{0}$ has some non-trivial component in the unstable subspace, $x_{t}=\Phi(t,\omega)x_{0}$ would exponentially diverge from $x^{*}$ . More details of preliminaries of linear random dynamical system can be found in Section 2.

Intuitively, one expects that the nonlinear dynamical system can be approximated by its linearization around a critical point $x^{*}$ , and would hence escape the strict saddle point, following the linearized system. However, the approximation by linear dynamical system cannot hold for infinite time horizon, due to error accumulation. Therefore, we cannot naively conclude using the multiplicative ergodic theorem and the linear approximation. Instead, a major part of analysis is devoted to establish a quantitative finite block analysis of the behavior of the dynamical system over finite time interval. In particular, we will prove that when the iterate is in a neighborhood of $x^{*}$ , the distance $\left\|x_{t}-x^{*}\right\|$ will be exponentially amplified for a duration $T$ with high probability. This would then be used to prove that with probability $1$ the nonlinear system will escape strict saddle points.

1.1. Related work

Coordinate gradient descent is a popular approach in optimization, see e.g., review articles [Wright-15, shi2016primer]. Advantages of coordinate gradient method include that compared with the full gradient descent, it allows larger stepsize [Nesterov-12] and enjoys faster convergence [Saha-13], and it is also friendly for parallelization [Liu-15, Richtarik-16].

The convergence of coordinate gradient descent has been analyzed in several settings on the property of the objective function and on the strategy of coordinate selection. The understanding of convergence for convex problems is quite complete: For methods with cyclic choice of coordinates, the convergence has been established in [Beck-13, Saha-13, Sun-15] and the worst-case complexity is investigated when the objective function is convex and quadratic in [Sun-19]. For methods with random choice of coordinates, it is shown in [Nesterov-12] that $\mathbb{E}f(x_{t})$ converges to $f^{*}=\min_{x\in\mathbb{R}^{d}}f(x)$ sublinearly in the convex case and linearly in the strongly convex case. Convergence of objective function in high probability has also been established in [Nesterov-12]. We also refer to [Richtarik-14, Liu-14, Liu-15, Wright-15] for further convergence results for random coordinate selection for convex problems. More recently, convergence of methods with random permutation of coordinates (i.e., a random permutation of the $d$ coordinates is used for every $d$ step of the algorithm) have been analyzed, mostly for the case of quadratic objective functions [Lee-19, Oswald-17, Gurbuzbalaban-20, Wright-20]. It has been an ongoing research direction to compare various coordinate selection strategies in various settings. In addition, in the non-convex and non-smooth setting, the convergence of coordinate/alternating descent methods can be analyzed for tame/semi-algebraic functions with Kurdyka-Łojasiewicz property (see e.g. [attouch2013convergence, attouch2010proximal, bolte2014proximal, boct2020inertial]).

For non-convex objective functions, the global convergence analysis is less developed, as the situation becomes more complicated. Escaping strict saddle points has been a focused research topic in non-convex optimization, motivated by applications in machine learning. It has been established that various first-order algorithms with gradient noise or added randomness to iterates would escape strict saddle points, see e.g., [Ge-15, Levy-16, Jin-17, Jin-18, Jin-19, Guo-20] for works in this direction.

Among previous works for escaping saddle points, perhaps the closest in spirit to our current result are [Lee-16, ONeill-19, Lee-2019, leadeigen-19], where algorithms without gradient or iterate randomness are studied. It is proved in [Lee-16] that for almost every initial guess, the trajectory of the gradient descent algorithm (without any randomness) with constant stepsize would not converge to a strict saddle point. The result has been extended in [Lee-2019] to a broader class of deterministic first-order algorithms, including coordinate gradient descent with cyclic choice of coordinate. The global convergence result for cyclic coordinate gradient descent is also proved in [leadeigen-19] under slightly more relaxed conditions. Similar convergence result is also obtained for heavy-ball method in [ONeill-19]. Let us emphasize that in the case of coordinate algorithms, it is not merely a technical question whether the algorithm can escape the strict saddle points without randomly perturbing gradients or iterates. In fact, one simply cannot employ such random perturbations, e.g., adding a random Gaussian vector to the iterate, since doing so would destroy the coordinate nature of the algorithm.

The analysis in works [Lee-16, Lee-2019, ONeill-19, leadeigen-19] is based on viewing the algorithm as a deterministic dynamical system, and applying center-stable manifold theorem for deterministic dynamical system [Shub-87], which characterizes the local behavior near a stationary point of nonlinear dynamical systems. Such a framework obviously does not work for randomized algorithms. To some extent, our analysis can be understood as a natural generalization to the framework of random dynamical systems, which allows us to analyze the long time behavior of randomized algorithms, in particular coordinate gradient descent with random coordinate selection.

Let us mention that various stable, unstable, and center manifolds theorems have been established in the literature of random dynamical systems, see e.g., [Arnold_RDS, Ruelle-1979, Ruelle-82, Boxler-89-center, Liu-06]. These sample-dependent random manifolds also characterize the local behavior of random dynamical systems. However, as far as we can tell, one cannot simply apply such “off-the-shelf results” for the analysis of Algorithm 1. Instead, for study of the algorithm, we have to carry out a quantitative finite block analysis for the random dynamical system near the stationary points. Our proof technique is inspired by stability analysis of Lyapunov exponent of random dynamical systems, as in [Ledrappier-91, Froyland-15].

1.2. Organization

The rest of this paper will be organized as follows. In Section 2, we review the preliminaries of random dynamical system, for convenience of readers. Our main result is stated in Section 3. The proofs can be found in Section 4.

2. Preliminaries of random dynamical systems

In this section, we recall basic notions and results of random dynamical systems, for more details, we refer the readers to standard reference, such as [Arnold_RDS]. After introducing the preliminaries in this section, we will define the random dynamical system associated with Algorithm 1 in Section 3.1. Let $(\Omega,\mathcal{F},\mathbb{P})$ be a probability space and let $\mathbb{T}$ be a semigroup with $\mathcal{B}(\mathbb{T})$ being its Borel $\sigma$ -algebra. $\mathbb{T}$ serves as the notion of time. In the setting of Algorithm 1, we have $\mathbb{T}=\mathbb{N}$ , corresponding to the one-sided discrete time setting. Other possible examples of $\mathbb{T}$ include $\mathbb{T}=\mathbb{Z}$ , $\mathbb{T}=\mathbb{R}_{\geq 0}$ , and $\mathbb{T}=\mathbb{R}$ , with the assumption that $0\in\mathbb{T}$ .

Let us first define random dynamical system. As we have mentioned in the introduction, the dynamics starting from $x_{0}$ can be determined once a sample $\omega\in\Omega$ is fixed. From the viewpoint of random dynamical system, specifying the dynamics of $x$ is equivalent to specifying the dynamics of $\omega$ : Suppose at time $0$ , the dynamics corresponds to $\omega$ , then to prescribe the future dynamics starting from time $t$ , we can specify the corresponding $\theta(t)\omega\in\Omega$ for some map $\theta(t):\Omega\to\Omega$ . More precisely, we have the following definition of dynamics on $\Omega$ .

Definition 2.1 (Metric dynamical system).

A metric dynamical system on a probability space $(\Omega,\mathcal{F},\mathbb{P})$ is a family of maps $\{\theta(t):\Omega\rightarrow\Omega\}_{t\in\mathbb{T}}$ satisfying that

(i)

The mapping $\mathbb{T}\times\Omega\rightarrow\Omega,\ (t,\omega)\mapsto\theta(t)\omega$ is measurable;
(ii)

It holds that $\theta(0)=\mathrm{Id}_{\Omega}$ and $\theta(t+s)=\theta(t)\circ\theta(s),\ \forall\ s,t\in\mathbb{T}$ ;
(iii)

$\theta(t)$ is $\mathbb{P}$ -preserving for any $t\in\mathbb{T}$ , where we say a map $\theta:\Omega\rightarrow\Omega$ is $\mathbb{P}$ -preserving if

$\mathbb{P}(\theta^{-1}B)=\mathbb{P}(B),\quad\forall\ B\in\mathcal{F}.$

The random dynamical system can then be defined as follows.

Definition 2.2 (Random dynamical system).

Let $(X,\mathcal{F}_{X})$ be a measurable space and let $\{\theta(t):\Omega\rightarrow\Omega\}_{t\in\mathbb{T}}$ be a metric dynamical system on $(\Omega,\mathcal{F},\mathbb{P})$ . Then a random dynamical system on $(X,\mathcal{F}_{X})$ over $\{\theta(t)\}_{t\in\mathbb{T}}$ is a measurable map

\begin{split}\varphi:\mathbb{T}\times\Omega\times X&\rightarrow\quad X,\\ (t,\omega,x)\ \ &\mapsto\varphi(t,\omega,x),\end{split}

satisfying the following cocycle property: for any $\omega\in\Omega$ , $x\in X$ , and $s,t\in\mathbb{T}$ , it holds that

\varphi(0,\omega,x)=x,

and that

(2.1)

\varphi(t+s,\omega,x)=\varphi(t,\theta(s)\omega,\varphi(s,\omega,x)).

The cocycle property (2.1) is a key property of random dynamical system: After time $s$ , if we restart the system at $x_{s}$ , the future dynamic corresponds to the sample $\theta(s)\omega$ . Note that $\varphi(t,\omega,\cdot)$ is a map on $X$ , with some ambiguity of notation, we also use $\varphi(t,\omega)$ to denote this map on $X$ and write $\varphi(t,\omega)x=\varphi(t,\omega,x)$ . Then the cocycle property (2.1) can be written as

\varphi(t+s,\omega)=\varphi(t,\theta(s)\omega)\circ\varphi(s,\omega).

In this work, we will focus on the one-sided discrete time $\mathbb{T}=\mathbb{N}$ and $\theta(t)=\theta^{t}$ , where $\theta$ is $\mathbb{P}$ -preserving and $\theta^{t}$ is the $t$ -fold composition of $\theta$ . Suppose that $X=\mathbb{R}^{d}$ and $A:\Omega\rightarrow\text{GL}(d,\mathbb{R})$ is measurable. Consider a linear random dynamical system defined as (we use $\Phi$ for the linear system, while reserving $\varphi$ for nonlinear dynamics considered later)

\Phi(t,\omega)=A(\theta^{t-1}\omega)\cdots A(\theta\omega)A(\omega),

where the right-hand side is the product of a sequences of random matrices. In this setting, the behavior of the linear system $x_{t}=\Phi(t,\omega)x_{0}$ is well understood by the celebrated multiplicative ergodic theorem, also known as the Oseledets theorem, which we recall in Theorem 2.3. Such type of results was first established by V.I. Oseledets [oseledets] and was further developed in many works such as [raghunathan, Ruelle-1979, Walters-93].

Theorem 2.3 (Multiplicative ergodic theorem, [Arnold_RDS, Theorem 3.4.1]).

Suppose that

\left(\log\left\|A(\cdot)\right\|\right)_{+},\ \left(\log\left\|A(\cdot)^{-1}\right\|\right)_{+}\in L^{1}(\Omega,\mathcal{F},\mathbb{P}),

where we have used the short-hand $a_{+}:=\max\{a,0\}$ . Then there exists an $\theta$ -invariant $\widetilde{\Omega}\in\mathcal{F}$ with $\mathbb{P}(\widetilde{\Omega})=1$ , such that the followings hold for any $\omega\in\widetilde{\Omega}$ :

(i)

It holds that the limit

(2.2) $\Lambda(\omega)=\lim_{t\rightarrow\infty}\left(\Phi(t,\omega)^{\top}\Phi(t,\omega)\right)^{1/2t},$

exists and is a positive definite matrix. Here $\Phi(t,\omega)^{\top}$ denotes the transposition of the matrix (as $\Phi(t,\omega)$ is a linear map on $X$ ).
(ii)

Suppose $\Lambda(\omega)$ has $p(\omega)$ distinct eigenvalues, which are ordered as $e^{\lambda_{1}(\omega)}>e^{\lambda_{2}(\omega)}>\cdots>e^{\lambda_{p(\omega)}(\omega)}$ . Denote $V_{i}(\omega)$ the corresponding eigenspace, with dimension $d_{i}(\omega)$ , for $i=1,2,\dots,p(\omega)$ . Then the functions $p(\cdot)$ , $\lambda_{i}(\cdot)$ , and $d_{i}(\cdot)$ , $i=1,2,\dots,p(\cdot)$ , are all measurable and $\theta$ -invariant on $\widetilde{\Omega}$ .

(iii)

Set $W_{i}(\omega)=\bigoplus_{j\geq i}V_{j}(\omega),\ i=1,2,\dots,p(\omega)$ , and $W_{p(\omega)+1}(\omega)=\{0\}$ . Then it holds that

(2.3)

\lim_{t\rightarrow\infty}\frac{1}{t}\log\left\|\Phi(t,\omega)x\right\|=\lambda_{i}(\omega),\quad\forall\ x\in W_{i}(\omega)\backslash W_{i+1}(\omega),

for $i=1,2,\dots,p(\omega)$ . The maps $V(\cdot)$ and $W(\cdot)$ from $\widetilde{\Omega}$ to the Grassmannian manifold are measurable.

(iv)

It holds that

$W_{i}(\theta\omega)=A(\omega)W_{i}(\omega).$
(v)

When $(\Omega,\mathcal{F},\mathbb{P},\theta)$ is ergodic, i.e., every $B\in\mathcal{F}$ with $\theta^{-1}B=B$ satisfies $\mathbb{P}(B)=0$ or $\mathbb{P}(B)=1$ , the functions $p(\cdot)$ , $\lambda_{i}(\cdot)$ , and $d_{i}(\cdot)$ , $i=1,2,\dots,p(\cdot)$ , are constant on $\widetilde{\Omega}$ .

In Theorem 2.3, $\lambda_{1}(\omega)>\lambda_{2}(\omega)>\cdots>\lambda_{p(\omega)}(\omega)$ are known as Lyapunov exponents and $\{0\}\subseteq W_{p(\omega)}(\omega)\subseteq\cdots\subseteq W_{1}(\omega)\subseteq\mathbb{R}^{d}$ is the Oseledets filtration. We can see from the above theorem that both the Lyapunov exponents and the Oseledets filtration are $A$ -forward invariant.

The Lyapunov exponents describe the asymptotic growth rate of $\left\|\Phi(t,\omega)x\right\|$ as $t\rightarrow\infty$ . More specifically, (2.3) implies that when $x\in W_{i}(\omega)\backslash W_{i+1}(\omega)$ , for any $\epsilon>0$ , there exists some $T>0$ , such that

e^{t(\lambda_{i}(\omega)-\epsilon)}\leq\left\|\Phi(t,\omega)x\right\|\leq e^{t(\lambda_{i}(\omega)+\epsilon)},

holds for any $t>T$ . The subspaces spanned by eigenvectors of $\Lambda(\omega)$ corresponding to eigenvalues smaller than, equal to, and greater than $0$ are the stable subspace, center subspace, and unstable subspace, respectively. The stable and unstable subspaces correspond to exponential convergence and exponential divergence, respectively. When starting from the center subspace, we would get some sub-exponential behavior.

The multiplicative ergodic theorem also generalizes to continuous time and two-sided time. We refer interested readers to [Arnold_RDS, Theorem 3.4.1, Theorem 3.4.11] for details.

The stable, unstable, and center subspaces can be generalized to stable, unstable, and center manifolds when considering nonlinear systems, see e.g., [Arnold_RDS, Ruelle-1979, Ruelle-82, Boxler-89-center, Weigu-Sternberg, Liu-06, Lian-10, Li-13, Guo-16]. These manifolds play similar roles in characterizing the local behavior of nonlinear random dynamical systems, as the subspaces for linear random dynamical systems. In particular, Hartman–Grobman theorem establishes the topological conjugacy between a nonlinear system and its linearization [Wanner-95]. There are also other conjugacy results for random dynamical systems, see e.g., [Weigu-Sternberg, Weigu-05, Weigu-08, Weigu-16].

3. Main results

3.1. Setup of the random dynamical system

Let us first specify the random dynamical system corresponding to the Algorithm 1.

•

Probability space. For each $t\in\mathbb{N}$ , denote $(\Omega_{t},\Sigma_{t},\mathbb{P}_{t})$ the usual probability space for the distribution $\mathcal{U}_{\{1,2,\dots,n\}}\times\mathcal{U}_{[\alpha_{\min},\alpha_{\max}]}$ , where $\mathcal{U}_{\{1,2,\dots,n\}}$ and $\mathcal{U}_{[\alpha_{\min},\alpha_{\max}]}$ are the uniform distributions on the set $\{1,2,\dots,n\}$ and interval $[\alpha_{\min},\alpha_{\max}]$ respectively. Let $(\Omega,\mathcal{F},\mathbb{P})$ be the product probability space of all $(\Omega_{t},\Sigma_{t},\mathbb{P}_{t}),\ t\in\mathbb{N}$ . Denote $\pi_{t}$ as the projection from $(\Omega,\mathcal{F},\mathbb{P})$ onto $(\Omega_{t},\Sigma_{t},\mathbb{P}_{t}),\ t\in\mathbb{N}$ . Thus, a sample $\omega\in\Omega$ can be represented as a sequence $((i_{0},\alpha_{0}),(i_{1},\alpha_{1}),\dots)$ , where $(i_{t},\alpha_{t})=\pi_{t}(\omega),\ t\in\mathbb{N}$ . Let $\{\mathcal{F}_{t}\}_{t\in\mathbb{N}}$ be the filtration defined by

\mathcal{F}_{t}=\sigma\left\{(B_{0}\times\cdots\times B_{t})\times\left(\prod_{j>t}\Omega_{j}\right):B_{i}\in\Sigma_{i},\ i=0,1,\dots,t\right\}.

•

Metric dynamical system. The metric dynamical system on $\Omega$ is constructed by the (left) shifting operator $\tau:\Omega\rightarrow\Omega$ defined as

$\tau(\omega)=\tau(\pi_{0}(\omega),\pi_{1}(\omega),\cdots):=(\pi_{1}(\omega),\pi_{2}(\omega),\cdots),$

which is clearly measurable and $\mathbb{P}$ -preserving. The metric dynamical system is then given by $\theta(t)=\tau^{t}$ for $t\in\mathbb{N}$ .

•

Random dynamical system. For any $\omega\in\Omega$ and $t\in\mathbb{N}$ , we define $\phi(\omega)$ to be a (nonlinear) map on $\mathbb{R}^{d}$ as

\begin{split}\phi(\omega):\mathbb{R}^{d}&\rightarrow\quad\quad\quad\mathbb{R}^{d}\\ x\ &\mapsto x-\alpha e_{i}e_{i}^{\top}\nabla f(x),\end{split}

where $(i,\alpha)=\pi_{0}(\omega)$ is the first pair/element in the sequence $\omega$ , and we define the map $\varphi(t,\omega)$ via

\varphi(t,\omega)=\phi(\tau^{t-1}\omega)\circ\cdots\circ\phi(\tau\omega)\circ\phi(\omega),\qquad\text{for }t\geq 1,

while $\varphi(0,\omega)$ is the identity operator. It is clear that $\varphi(t,\omega)$ satisfies the cocycle property (2.1) and hence defines a random dynamical system on $X=\mathbb{R}^{d}$ over $\{\tau^{t}\}_{t\in\mathbb{N}}$ . The iterate of Algorithm 1 follows the random dynamical system as

x_{t}=\phi(\tau^{t-1}\omega)x_{t-1}=\cdots=\phi(\tau^{t-1}\omega)\circ\cdots\circ\phi(\tau\omega)\circ\phi(\omega)x_{0}=\varphi(t,\omega)x_{0}.

It can be seen that $\{x_{t}\}_{t\in\mathbb{N}}$ is $\{\mathcal{F}_{t}\}$ -predictable, i.e., $x_{t}$ is $\mathcal{F}_{t-1}$ -measurable for any $t\in\mathbb{N}_{+}$ , since $x_{t}$ is determined by samples $(i_{0},\alpha_{0}),(i_{1},\alpha_{1}),\dots,(i_{t-1},\alpha_{t-1})$ .

In our analysis, we will use linearization of the dynamical system $\varphi(t,\omega)$ at a critical point $x^{*}$ of $f$ . Without loss of generality, we assume $x^{*}=0$ ; otherwise we consider the system with state being $x-x^{*}$ . The resulting linear system, which depends on $H=\nabla^{2}f(x^{*})=(H_{ij})_{1\leq i,j\leq d}$ , is given by (here and in the sequel, we use the superscript $H$ to indicate dependence on the matrix)

(3.1)

\Phi^{H}(t,\omega)=A^{H}(\tau^{t-1}\omega)\cdots A^{H}(\tau\omega)A^{H}(\omega),

where

(3.2)

A^{H}(\omega)=I-\alpha e_{i}e_{i}^{\top}H,\quad(i,\alpha)=\pi_{0}(\omega).

Note that $A^{H}(\cdot)$ is bounded in $\Omega$ . We know that $\left(\log\left\|A^{H}(\cdot)\right\|\right)_{+}$ is integrable. When $\alpha<1/|H_{ii}|$ , the matrix $A^{H}(\omega)=I-\alpha e_{i}e_{i}^{\top}H$ is invertable, and the inverse is given explicitly by applying the Sherman-Morrison formula:

(3.3)

\begin{split}A^{H}(\omega)^{-1}&=\left(I-\alpha e_{i}e_{i}^{\top}H\right)^{-1}=I+\frac{\alpha e_{i}e_{i}^{\top}H}{1-\alpha H_{ii}}.\end{split}

In particular, we have

(3.4)

\left\|A^{H}(\omega)^{-1}\right\|\leq 1+\frac{\alpha\left\|H\right\|}{1-\alpha|H_{ii}|}.

Thus, if we take the maximal stepsize $\alpha_{\max}$ such that $\alpha_{\max}<1/\max_{1\leq i\leq d}|H_{ii}|$ , $\left\|A^{H}(\cdot)^{-1}\right\|$ is bounded in $\Omega$ , and as a result $\left(\log\left\|A^{H}(\cdot)^{-1}\right\|\right)_{+}$ is also integrable. Therefore, the assumptions of Theorem 2.3 hold. The shifting operator $\tau$ is ergodic on $(\Omega,\mathcal{F},\mathbb{P})$ by Kolmogorov’s $0$ – $1$ law. Then Theorem 2.3 applies for $\theta=\tau$ with $p^{H}(\cdot)$ , $\lambda_{i}^{H}(\cdot)$ , and $d_{i}^{H}(\cdot)$ all being a.e. constant. For any $\omega\in\widetilde{\Omega}$ that is the set in Theorem 2.3 satisfying $\mathbb{P}(\widetilde{\Omega})=1$ , we denote

(3.5)

W_{+}^{H}(\omega)=\bigoplus_{\lambda_{i}>0}V_{i}^{H}(\omega),\ \text{and}\ W_{-}^{H}(\omega)=\bigoplus_{\lambda_{i}\leq 0}V_{i}^{H}(\omega).

Then the following invariant property holds

W_{-}^{H}(\tau\omega)=A^{H}(\omega)W_{-}^{H}(\omega).

Note that $W_{-}^{H}(\omega)$ works as a center-stable subspace. That is, for any $x\in W_{-}^{H}(\omega)$ and any $\epsilon>0$ , it holds that $\left\|\Phi(t,\omega)x\right\|\leq e^{t\epsilon}$ , for sufficiently large $t$ , and for $x\notin W_{-}^{H}(\omega)$ , $\left\|\Phi(t,\omega)x\right\|$ grows exponentially as $t\rightarrow\infty$ with rate greater than $\min_{\lambda_{i}>0}\lambda_{i}-\epsilon$ for any $\epsilon>0$ .

3.2. Assumptions

In this section, we specify the assumptions of the objective function $f$ in this paper. The first is a standard smoothness assumption of $f$ .

Assumption 1.

$f\in C^{2}(\mathbb{R}^{d})$ and the Hessian $\nabla^{2}f$ is uniformly bounded, i.e., there exists $M>0$ such that $\left\|\nabla^{2}f(x)\right\|\leq M$ for all $x\in\mathbb{R}^{d}$ .

An optimization algorithm is expected to converge, under some reasonable assumptions, to a critical point of $f$ where the gradient vanishes. Our aim is to further characterize the possible limits of the algorithm iterates. For this purpose, we distinguish $\mathrm{Crit}_{s}(f)$ , the set of all strict saddle points (including local maxima with non-degenerate Hessian) of $f$ :

\mathrm{Crit}_{s}(f):=\{x\in\mathbb{R}^{d}:\nabla f(x)=0,\ \lambda_{\min}(\nabla^{2}f(x))<0\},

where we use the subscript $s$ to emphasize that it is strict. Due to the presence of negative eigenvalue of Hessian, if we were considering the gradient dynamics near the critical point, the saddle point would be an unstable equilibrium. Our first result is that this instability also occurs in the linear random dynamical system $\Phi^{H}(t,\omega)$ where $H=\nabla^{2}f(x^{*})$ . In other words, the dimension of $W^{H}_{+}(\omega)$ defined in (3.5) is greater than $0$ . While this would mainly serve as a preliminary step for our analysis of the nonlinear dynamics, the conclusion by itself might be of interest, and is stated as follows. The proof will be deferred to Section 4.1.

Proposition 3.1.

Let $H$ have a negative eigenvalue and $0<\alpha_{\min}<\alpha_{\max}<1/\max_{1\leq i\leq d}|H_{ii}|$ , then the largest Lyapunov exponent of $\Phi^{H}(t,\omega)$ is positive.

Our goal is to generalize such results to the nonlinear dynamics near strict saddle points of $f$ , for which, we would require two additional assumptions, as the following.

Assumption 2.

For every $x^{*}\in\mathrm{Crit}_{s}(f)$ , $\nabla^{2}f(x^{*})$ is non-degenerate, i.e., $x^{*}$ is a non-degenerate critical point of $f$ in the sense that any eigenvalue of $\nabla^{2}f(x^{*})$ is nonzero.

Assumption 2 is also a standard assumption, which in particular guarantees that each strict saddle point is isolated, due to the non-degenerate Hessian. For each strict saddle point, Proposition 3.1 guarantees that the corresponding unstable subspace $W^{H}_{+}(\omega)$ is non-trivial (has dimension at least $1$ ). We would in fact require a stronger technical assumption on its structure.

Assumption 3.

For every $x^{*}\in\mathrm{Crit}_{s}(f)$ , it holds that $\mathcal{P}_{+}^{H}(\omega)e_{i}\neq 0$ , for every $i\in\{1,2,\dots,d\}$ and almost every $\omega\in\Omega$ , where $\mathcal{P}_{+}^{H}(\omega)$ is the orthogonal projection onto $W_{+}^{H}(\omega)$ with $H=\nabla^{2}f(x^{*})$ .

We expect that Assumption 3 holds generically. We also show in Appendix A, Assumption 3 can be verified when $H$ has no zero off-diagonal elements (and $W^{H}_{+}(\omega)$ is not trivial). However, there exist cases that Assumption 3 might not hold. One example is $H=\nabla^{2}f(x^{*})=\text{diag}(H_{1},H_{2})$ where $H_{1}\in\mathbb{R}^{d_{1}\times d_{1}}$ only has positive eigenvalues and $H_{2}\in\mathbb{R}^{d_{2}\times d_{2}}$ only has negative eigenvalues, which implies that $W_{-}^{H}(\omega)=\text{span}\{e_{i}:1\leq i\leq d_{1}\}$ and $W_{+}^{H}(\omega)=\text{span}\{e_{i}:d_{1}+1\leq i\leq d\}$ .

We also remark that Assumption 1 is essential in our framework, since the linearized system is defined using the Hessian matrix. Analysis of randomized coordinate method for non-smooth optimization problems requires new techniques and deserves future research.

3.3. Main results

Given an initial guess $x_{0}$ , the behavior of the algorithm, in particular, the limit of $x_{t}$ , depends on the particular sample $\omega\in\Omega$ . For any $x^{*}\in\mathrm{Crit}_{s}(f)$ , we denote the set of all $\omega$ such that the algorithm starting at $x_{0}$ would converge to $x^{*}$ :

\Omega_{s}(x^{*},x_{0}):=\left\{\omega\in\Omega:\lim_{t\rightarrow\infty}x_{t}=\lim_{t\rightarrow\infty}\varphi(t,\omega)x_{0}=x^{*}\right\}.

We further define the set $\Omega_{s}(\mathrm{Crit}_{s}(f),x_{0})$ as the union of all $\Omega_{s}(x^{*},x_{0})$ over $x^{*}\in\mathrm{Crit}_{s}(f)$ ,

\Omega_{s}(\mathrm{Crit}_{s}(f),x_{0}):=\bigcup_{x^{*}\in\mathrm{Crit}_{s}(f)}\Omega_{s}(x^{*},x_{0}).

Thus, if $\omega\notin\Omega_{s}(\mathrm{Crit}_{s}(f),x_{0})$ , the limit $\lim_{t\to\infty}x_{t}$ , if exists, will not be one of the strict saddle points. Our main result in this paper proves that the set is of measure zero, i.e., for any initial guess $x_{0}$ that is not a strict saddle point, with probability $1$ , Algorithm 1 will not converge to a strict saddle point.

Theorem 1.

Suppose that Assumptions 1, 2, and 3 hold and that $0<\alpha_{\min}<\alpha_{\max}<1/M$ , then for any $x_{0}\in\mathbb{R}^{d}\backslash\mathrm{Crit}_{s}(f)$ , it holds that

\mathbb{P}(\Omega_{s}(\mathrm{Crit}_{s}(f),x_{0}))=0.

The intuition behind the proof of Theorem 1 is to compare the nonlinear dynamics around a strict saddle point $x^{*}\in\mathrm{Crit}_{s}(f)$ with its linearization, as the linear dynamics has non-trivial unstable subspace, thanks to Proposition 3.1. Ideally, one would hope that the nonlinear dynamics would closely follow the linear dynamics, and thus leave the neighborhood of $x^{*}$ eventually; the obstacle for such argument is however that the approximation of the linearization is only valid for a finite time interval. Therefore, to establish the instability behavior of the nonlinear dynamics, we would need a much more refined and quantitative argument using the instability of the linear system. In particular, we would need to show that over a finite interval, with high probability, the linear system, and hence the nonlinear system, would drive $x_{t}$ away from the strict saddle point with quantitative bounds; see Theorem 4.4 in Section 4.2. Theorem 1 then follows from an argument with a similar spirit as law of large numbers; see Section 4.3.

Remark 3.2.

The technical Assumption 3 and the randomness in stepsizes are made so that the iterate $x_{t}=x_{t-1}-\alpha_{t-1}e_{i_{t-1}}e_{i_{t-1}}^{\top}\nabla f(x_{t-1})$ would obtain some non-trivial component in the unstable subspace, which would be further amplified within a sufficiently long but finite time interval. When $\left\|\mathcal{P}_{+}^{H}(\tau^{t}\omega)e_{i_{t-1}}\right\|$ and $\lvert e_{i_{t-1}}^{\top}\nabla f(x_{t-1})\rvert$ are fixed and relatively large, a random $\alpha_{t-1}$ would keep $\left\|\mathcal{P}_{+}^{H}(\tau^{t}\omega)x_{t}\right\|$ away from $0$ with high probability; see Section 4.2 for more details. It is an interesting open question whether it is possible to establish similar results without such assumptions. Our conjecture is that $\mathbb{P}(\Omega_{s}(x_{0}))=0$ still holds for $x_{0}\in\mathbb{R}^{d}\backslash\mathrm{Crit}_{s}(f)$ unless $x_{0}$ is located in a set with Lebesgue measure zero, similar to the results established in [Lee-2019].

As an application of our main result Theorem 1, we can obtain the global convergence to stationary points with no negative Hessian eigenvalues for Algorithm 1. More specifically, denote by

\mathrm{Crit}(f):=\{x\in\mathbb{R}^{d}:\nabla f(x)=0\},

the set of all critical points of $f$ , we have the following corollary, which will also be proved in Section 4.3.

Corollary 3.3.

Under the same assumptions as in Theorem 1 and assume further that every $x^{*}\in\mathrm{Crit}(f)$ is an isolated critical point. For any $x_{0}\in\mathbb{R}^{d}\backslash\mathrm{Crit}_{s}(f)$ with bounded level set $L(x_{0})=\{x\in\mathbb{R}^{d}:f(x)\leq f(x_{0})\}$ , with probability $1$ , $\{x_{t}\}_{t\in\mathbb{N}}$ is convergent with limit in $\mathrm{Crit}(f)\backslash\mathrm{Crit}_{s}(f)$ .

Remark 3.4.

In Corollary 3.3, if we further assume that all saddle points of $f$ are strict, then the algorithm iterate converges to a local minimum with probability $1$ . Let us also mention that for many non-convex problems, saddle points are suboptimal while there do not exist “bad” local minima, e.g. phase retrieval [sun2018geometric], deep learning [kawaguchi2016deep, lu2017depth], and low-rank matrix problems [ge2017no]. For these problems, convergence to local minima suffice to guarantee good performance.

Remark 3.5.

In our setting without adding noise to gradient or iterate, we cannot hope for good convergence rates for arbitrary initial iterate. In fact as shown in [Du-17], the convergence of deterministic gradient descent algorithm to a local minimum might take exponentially long time; we expect similar behavior for the randomized coordinate gradient descent Algorithm 1. Let us also remark that while we need the random stepsize as discussed in Remark 3.2, the interval $[\alpha_{\min},\alpha_{\max}]$ could be made arbitrarily small; the result holds as long as $0<\alpha_{\min}<\alpha_{\max}<1/M$ .

4. Proofs

We collect all proofs in this section.

4.1. Analysis of the linearized system

We will first study the linear dynamical system, for which we assume the objective function is given by

(4.1)

f^{H}(x)=\frac{1}{2}x^{\top}Hx,

where $H$ is a symmetric matrix in $\mathbb{R}^{d\times d}$ with at least one negative eigenvalue. In this case, the coordinate descent algorithm is given by

x_{t+1}=\left(I-\alpha_{t}e_{i_{t}}e_{i_{t}}^{\top}H\right)x_{t},

which corresponds to the linear dynamical system $\Phi^{H}(t,\omega)$ with single step map $A^{H}(\omega)$ , defined in (3.1) and (3.2) respectively.

Our main goal in this subsection is to prove Proposition 3.1 for this linear dynamical system which states that at least one Lyapunov exponent of $\Phi^{H}(t,\omega)$ is positive. It suffices to show that there exists some $x_{0}$ , such that $\left\|x_{t}\right\|$ grows exponentially to infinity, which will follow from an energy argument, similar to the proof of [Lee-2019, Proposition 5]. Although we consider a randomized coordinate gradient descent algorithm instead of a cyclic one, one step, i.e., Lemma 4.3, in the proof of Proposition 4.1 follows closely the proof in [Lee-2019, Appendix A]. We start from $x_{0}$ with $f^{H}(x_{0})<0$ and consider a finite time interval with length $m\geq d$ . Proposition 4.1 establishes a quantitative decay estimate for $f^{H}(x_{t+m})$ compared with $f^{H}(x_{t})$ , which leads to our desired result Proposition 3.1.

Proposition 4.1.

Let $m\geq d$ be fixed. For the objective function (4.1) with $\lambda_{\min}(H)<0$ , suppose that $0<\alpha_{\min}<\alpha_{\max}<1/\max_{1\leq i\leq d}|H_{ii}|$ , there exists $c\in(0,1)$ depending on $m$ , $H$ , $\alpha_{\min}$ , and $\alpha_{\max}$ , such that

f^{H}(x_{t+m})-f^{H}(x_{t})\leq c\,f^{H}(x_{t}),

holds as long as $f^{H}(x_{t})<0$ and $\{1,2,\dots,d\}=\{i_{t},i_{t+1},\dots,i_{t+m-1}\}$ (in the sense of sets).

Remark 4.2.

The condition $\{1,2,\dots,d\}=\{i_{t},i_{t+1},\dots,i_{t+m-1}\}$ above is known as “generalized Gauss-Seidel rule” in the literature of coordinate methods [Tseng-09, Wright-12].

Proof.

Without loss of generality, we assume that $t=0$ . Due to the choice of $\alpha_{\max}$ , we have the following simple non-increasing property for any $t^{\prime}\in\mathbb{N}$ :

(4.2)

\begin{split}f^{H}(x_{t^{\prime}+1})&=\frac{1}{2}x_{t^{\prime}+1}^{\top}Hx_{t^{\prime}+1}\\ &=\frac{1}{2}x_{t^{\prime}}^{\top}\left(I-\alpha_{t^{\prime}}e_{i_{t^{\prime}}}e_{i_{t^{\prime}}}^{\top}H\right)^{\top}H\left(I-\alpha_{t^{\prime}}e_{i_{t^{\prime}}}e_{i_{t^{\prime}}}^{\top}H\right)x_{t^{\prime}}\\ &=f^{H}(x_{t^{\prime}})-\alpha_{t^{\prime}}\left(e_{i_{t^{\prime}}}^{\top}Hx_{t^{\prime}}\right)^{2}+\frac{1}{2}\alpha_{t}^{2}e_{i_{t^{\prime}}}^{\top}He_{i_{t^{\prime}}}\left(e_{i_{t^{\prime}}}^{\top}Hx_{t^{\prime}}\right)^{2}\\ &\leq f^{H}(x_{t^{\prime}})-\frac{\alpha_{t^{\prime}}}{2}\left(e_{i_{t^{\prime}}}^{\top}Hx_{t^{\prime}}\right)^{2}.\end{split}

Write $x_{0}=y^{*}+y_{0}$ with $y^{*}\in\ker(H)$ and $y_{0}\in\operatorname{ran}(H)$ . Let

(4.3)

y_{t^{\prime}+1}=y_{t^{\prime}}-\alpha_{t^{\prime}}e_{i_{t^{\prime}}}e_{i_{t^{\prime}}}^{\top}Hy_{t^{\prime}},\qquad t^{\prime}=0,1,\dots,m-1.

Then $x_{t^{\prime}}=y^{*}+y_{t^{\prime}}$ holds for any $t^{\prime}=0,1,\dots,m$ . Using (4.2), to give an upper bound for $f^{H}(x_{t+m})-f^{H}(x_{t})$ , we would like a nontrivial lower bound for $\alpha_{t^{\prime}}\bigl{(}e_{i_{t^{\prime}}}^{\top}Hx_{t^{\prime}}\bigr{)}^{2}=\alpha_{t^{\prime}}\bigl{(}e_{i_{t^{\prime}}}^{\top}Hy_{t^{\prime}}\bigr{)}^{2}$ for some $t^{\prime}\in\{t,t+1,\dots,t+m-1\}$ , which is guaranteed by Lemma 4.3, whose proof will be postponed.

Lemma 4.3.

Suppose that $\{1,2,\dots,d\}=\{i_{0},i_{1},\dots,i_{m-1}\}$ . For any

(4.4)

0<\delta\leq\min\left\{\frac{1}{2m},\frac{\alpha_{\min}\sigma_{\min}(H)}{2\sqrt{m}(m\alpha_{\max}\sigma_{\max}(H)+1)}\right\},

where $\sigma_{\min}(H)$ and $\sigma_{\max}(H)$ are the smallest and largest positive singular values of $H$ respectively, if $0\neq y_{0}\in\operatorname{ran}(H)$ , then there exists $T\in\{0,1,\dots,m-1\}$ such that $\alpha_{T}\left|e_{i_{T}}^{\top}Hy_{T}\right|\geq\delta\left\|y_{T}\right\|$ , where the sequence $y_{t}$ is given as in (4.3).

Assuming the lemma, there exists $T\in\{0,1,\dots,m-1\}$ that

\alpha_{T}\left|e_{i_{T}}^{\top}Hy_{T}\right|\geq\delta\left\|y_{T}\right\|,

with a fixed $\delta>0$ satisfying (4.4). We can further constrain that $\frac{\delta^{2}}{\alpha_{\min}\sigma_{\max}(H)}<1$ . Thus, we have

\begin{split}f^{H}(x_{m})&\leq f^{H}(x_{T+1})\leq f^{H}(x_{T})-\frac{\alpha_{T}}{2}\left(e_{i_{T}}^{\top}Hx_{T}\right)^{2}\leq f^{H}(x_{T})-\frac{\delta^{2}}{2\alpha_{T}}\left\|y_{T}\right\|^{2},\end{split}

which combined with $\sigma_{\max}(H)\left\|y_{T}\right\|^{2}\geq-y_{T}^{\top}Hy_{T}=-x_{T}^{\top}Hx_{T}=-2f^{H}(x_{T})$ , yields that

f^{H}(x_{m})\leq\left(1+\frac{\delta^{2}}{\alpha_{T}\sigma_{\max}(H)}\right)f^{H}(x_{T})\leq\left(1+\frac{\delta^{2}}{\alpha_{T}\sigma_{\max}(H)}\right)f^{H}(x_{0}).

Set $c=\frac{\delta^{2}}{\alpha_{\max}\sigma_{\max}(H)}$ and we get that $f^{H}(x_{m})-f^{H}(x_{0})\leq cf^{H}(x_{0})$ . ∎

We finish the proof by establishing Lemma 4.3 below.

Proof of Lemma 4.3.

Suppose on the contrary that $\alpha_{t}\left|e_{i_{t}}^{\top}Hy_{t}\right|<\delta\left\|y_{t}\right\|$ for any $t\in\{0,1,\dots,m-1\}$ . It holds that

\left\|y_{1}-y_{0}\right\|=\alpha_{0}\left|e_{i_{0}}^{\top}Hy_{0}\right|<\delta\left\|y_{0}\right\|<2\delta\left\|y_{0}\right\|.

We claim that

(4.5)

\left\|y_{t}-y_{0}\right\|<2t\delta\left\|y_{0}\right\|,

for any $t=1,2,\dots,m-1$ . By induction, assume that $\left\|y_{t}-y_{0}\right\|<2t\delta\left\|y_{0}\right\|$ holds for some $t\in\{1,2,\dots,m-2\}$ , then

\left\|y_{t+1}-y_{t}\right\|=\alpha_{t}\left|e_{i_{t}}^{\top}Hy_{t}\right|<\delta\left\|y_{t}\right\|<\delta(2t\delta+1)\left\|y_{0}\right\|<2\delta\left\|y_{0}\right\|,

where the last inequality uses $2t\delta<2m\delta\leq 1$ . It follows $\left\|y_{t+1}-y_{0}\right\|\leq\left\|y_{t}-y_{0}\right\|+\left\|y_{t+1}-y_{t}\right\|<2(t+1)\delta\left\|y_{0}\right\|$ .

Using (4.5) and $\max_{1\leq i\leq d}\left\|He_{i}\right\|\leq\sigma_{\max}(H)$ , we have

\begin{split}\alpha_{t}\bigl{\lvert}e_{i_{t}}^{\top}Hy_{0}\bigr{\rvert}&\leq\alpha_{t}\bigl{\lvert}e_{i_{t}}^{\top}H(y_{t}-y_{0})\bigr{\rvert}+\alpha_{t}\bigl{\lvert}e_{i_{t}}^{\top}Hy_{t}\bigr{\rvert}\\ &<\alpha_{\max}\sigma_{\max}(H)\cdot 2t\delta\left\|y_{0}\right\|+2\delta\left\|y_{0}\right\|\\ &<2\delta(m\alpha_{\max}\sigma_{\max}(H)+1)\left\|y_{0}\right\|,\end{split}

for $t=0,1,\dots,m-1$ . Since $\text{span}\{e_{i_{k}}:k=0,1,\dots,m-1\}=\mathbb{R}^{d}$ , noticing that $y_{0}\in\operatorname{ran}{H}$ , we have

\begin{split}\alpha_{\min}\sigma_{\min}(H)\left\|y_{0}\right\|&\leq\alpha_{\min}\left\|Hy_{0}\right\|\leq\left(\sum_{t=0}^{m-1}\left(\alpha_{t}\bigl{\lvert}e_{i_{t}}^{\top}Hy_{0}\bigr{\rvert}\right)^{2}\right)^{1/2}\\ &<2\delta\sqrt{m}(m\alpha_{\max}\sigma_{\max}(H)+1)\left\|y_{0}\right\|,\end{split}

which contradicts with the choice of $\delta$ in (4.4). ∎

We are now ready to prove Proposition 3.1 which states the existence of positive Lyapunov exponent of the linear dynamical system.

Proof of Proposition 3.1.

It suffices to show that for almost every $\omega\in\Omega$ there exist some $x_{0}\in\mathbb{R}^{d}$ , $\epsilon>0$ , and $T>0$ , such that $x_{t}=\Phi(t,\omega)x_{0}$ satisfies $\left\|x_{t}\right\|\geq e^{\epsilon t}$ for any $t>T$ . Let $x_{0}$ be an eigenvector corresponding to a negative eigenvalue of $H$ . Then it holds that $f^{H}(x_{0})<0$ . Consider a fixed $m\geq d$ . For any $k\in\mathbb{N}$ , set $I_{k}=1$ if $\{1,2,\dots,d\}=\{i_{km},i_{km+1},\dots,i_{km+m-1}\}$ and $I_{k}=0$ otherwise. We can see that the random variables, $I_{0},I_{1},I_{2},\dots$ , are independent and identically distributed with $\mathbb{E}I_{0}=\mathbb{P}(I_{0}=1)\in(0,1)$ . By Proposition 4.1, we obtain that

f^{H}(x_{(k+1)m})\leq\begin{cases}(1+c)f^{H}(x_{km}),&\text{if }I_{k}=1,\\ f^{H}(x_{km}),&\text{if }I_{k}=0,\end{cases}

where $c$ is the constant from Proposition 4.1. Therefore,

\frac{\lambda_{\min}(H)}{2}\left\|x_{km}\right\|^{2}\leq f^{H}(x_{km})\leq(1+c)^{\sum_{j=0}^{k-1}I_{j}}f^{H}(x_{0}),

which implies that

(4.6)

\left\|x_{km}\right\|\geq\left(\frac{2f^{H}(x_{0})}{\lambda_{\min}(H)}\right)^{1/2}\cdot(1+c)^{\frac{1}{2}\sum_{j=0}^{k-1}I_{j}}.

Note that $\mathbb{E}|I_{0}|=\mathbb{E}I_{0}<\infty$ . The strong law of large number suggests that for almost every $\omega\in\Omega$ , there exists some $K$ , such that for all $k\geq K$

(4.7)

\sum_{j=0}^{k-1}I_{j}\geq\frac{\mathbb{E}I_{0}}{2}k.

Combining (4.6) and (4.7), we arrive at

\left\|x_{km}\right\|\geq\left(\frac{2f^{H}(x_{0})}{\lambda_{\min}(H)}\right)^{1/2}\cdot(1+c)^{\frac{\mathbb{E}I_{0}}{4m}\cdot km}.

Noticing that $(1+c)^{\frac{\mathbb{E}I_{0}}{4m}}$ is greater than $1$ , $\left\|x_{km}\right\|$ grows exponentially in $km$ and we complete the proof. ∎

4.2. Finite block analysis

In this subsection, we study the behavior of the nonlinear dynamical system near a strict saddle point of $f$ , which without loss of generality can be assumed to be $x^{*}=0$ . As mentioned above, in a small neighborhood of $x^{*}$ , while it is not possible to control the difference between nonlinear and linear systems for infinite time, the nonlinear system can be approximated by the linear system during a finite time horizon.

The main conclusion of this subsection is the following theorem which states that after a finite time interval with length $T$ , the distance of the iterate from $x^{*}=0$ will be amplified exponentially with high probability.

Theorem 4.4.

Suppose that Assumptions 1, 2, and 3 hold and that $0<\alpha_{\min}<\alpha_{\max}<1/M$ . There exists $\epsilon_{*}\in(0,1/6)$ such that for any $\epsilon\in(0,\epsilon_{*})$ , we have $T_{*}=T_{*}(\epsilon)\in\mathbb{N}_{+}$ that for any $T\in\mathbb{N}_{+}$ with $T\geq T_{*}$ and any $t\in\mathbb{N}$ , conditioned on $\mathcal{F}_{t-1}$ , with probability at least $1-4\epsilon$ , it holds for all $x_{t}\in V$ that

(4.8)

\left\|x_{t+T}\right\|\geq\exp\biggl{(}\frac{6\epsilon}{1-6\epsilon}\bigl{\lvert}\log(1-M\alpha_{\max})\bigr{\rvert}\,T\biggr{)}\left\|x_{t}\right\|,

where $V$ is a neighborhood of $x^{*}=0$ , depending on $\epsilon$ , $T$ , and $f$ near $x^{*}$ .

The lower bound (4.8) quantifies the amplification of $\left\|x_{t+T}\right\|$ : While we always have $\left\|x_{t+T}\right\|\geq(1-M\alpha_{\max})^{T}\left\|x_{t}\right\|$ (see (4.20) below), the result states that with probability at least $1-4\epsilon$ , the amplification factor is at least the right-hand side of (4.8), which is exponentially large in $T$ . Hence on average $\left\|x_{t+T}\right\|$ would be much larger than $\left\|x_{t}\right\|$ . This would lead to escaping of the iterate from the neighborhood of $x^{*}=0$ .

To prove Theorem 4.4, we would require a more quantitative characterization of the behavior of its linearization at $x^{*}$ . In particular, we need a high probability estimate of the distance of the iterate from $x^{*}$ after some time interval. For this purpose, conditioned on $\mathcal{F}_{t-1}$ with the iterate $x_{t}$ , we will first show in Lemma 4.8 that, after some finite time, the orthogonal projection of the iterate $x_{\varrho_{t}}$ onto the unstable subspace, where $t<\varrho_{t}\leq t+L$ for some constant $L$ , is significant. The component in the unstable subspace would then be further amplified subsequently by $\Phi^{H}(S,\tau^{\varrho_{t}}\omega)$ , where $H=\nabla^{2}f(x^{*})$ . Here the time duration $S$ would be chosen sufficiently large such that the distance from $x_{\varrho_{t}+S}$ to $x^{*}$ is exponentially amplified. Theorem 4.4 follows by setting $T=L+S$ . In the second step above, we would need to control the closeness between the linear and nonlinear systems within a time horizon with length $S$ .

Such a finite block analysis approach has been used to establish the stability of Lyapunov exponent of random dynamical systems [Ledrappier-91, Froyland-15], which inspired our proof technique for Theorem 4.4 and Theorem 1.

We first set the small constant $\epsilon$ in Theorem 4.4 which controls the failure probability of the amplification bound. Let $\lambda_{1}>\lambda_{2}>\cdots>\lambda_{p}$ be the Lyapunov exponents of the linearized system at $x^{*}=0$ . We set

(4.9)

\lambda_{+}=\min_{\lambda_{i}>0}\lambda_{i}\quad\text{and}\quad 0<\gamma<\frac{1}{2}\min\left\{\min_{1\leq i<p}|\lambda_{i}-\lambda_{i+1}|,\lambda_{+}\right\}.

Note that $\gamma<\lambda_{+}$ . Let $\epsilon_{*}\in(0,1/6)$ be sufficiently small such that

(4.10)

(1-6\epsilon)(\lambda_{+}-\gamma)+6\epsilon\cdot\log(1-M\alpha_{\max})>0,\quad\forall\ \epsilon\in(0,\epsilon_{*}).

The reason for such choice will become clear later (see (4.23)). For the rest of the section, we will consider a fixed $\epsilon\in(0,\epsilon_{*})$ .

We now state and prove several lemmas for Theorem 4.4. First, in the following Lemma 4.5, we construct a stopping time $\varrho_{t}-1$ that is bounded almost surely and that the component of the gradient $\bigl{\lvert}e_{i_{\varrho_{t}-1}}^{\top}\nabla f(x_{\varrho_{t}-1})\bigr{\rvert}$ is comparable with $\left\|\nabla f(x_{\varrho_{t}-1})\right\|$ in amplitude with high probability.

Lemma 4.5.

Let $0<\mu\leq\frac{1}{\sqrt{d}}$ be a fixed constant. There exists some constant $L>0$ , such that for any $t\in\mathbb{N}$ , there exists a measurable $\varrho_{t}:\Omega\rightarrow\mathbb{N}_{+}$ such that $t<\varrho_{t}\leq t+L$ and

(4.11)

\mathbb{P}\Bigl{(}\bigl{\lvert}e_{i_{\varrho_{t}-1}}^{\top}\nabla f(x_{\varrho_{t}-1})\bigr{\rvert}\geq\mu\left\|\nabla f(x_{\varrho_{t}-1})\right\|\;\Big{|}\;\mathcal{F}_{t-1}\Bigr{)}\geq 1-\epsilon.

Proof.

For any $t\in\mathbb{N}$ , use $\ell_{0}$ to denote the smallest non-negative integer $\ell$ , such that

\ell_{0}=\arg\min_{\ell}\Bigl{\{}\ell\in\mathbb{N}_{+}\,:\,\bigl{\lvert}e_{i_{t+\ell-1}}^{\top}\nabla f(x_{t+\ell-1})\bigr{\rvert}\geq\mu\left\|\nabla f(x_{t+\ell-1})\right\|\Bigr{\}}.

It is clear that $\mathbb{P}(\ell_{0}>\ell\mid\mathcal{F}_{t-1})\leq(1-1/d)^{\ell}$ , since for each step the coordinate is randomly chosen. Hence, there exists some $L>0$ , such that

\mathbb{P}(\ell_{0}\leq L\mid\mathcal{F}_{t-1})\geq 1-\epsilon.

We finish the proof by setting $\varrho_{t}=t+\min\{\ell_{0},L\}$ which has the desired property. ∎

We now carry out the amplification part of the finite block analysis for the linearized dynamics at $x^{*}=0$ . To simplify expressions in the following, for $t_{1}<t_{2}$ , we introduce the short-hand notation

(i,\alpha)_{t_{1}:t_{2}-1}=\big{(}(i_{t_{1}},\alpha_{t_{1}}),\dots,(i_{t_{2}-1},\alpha_{t_{2}-1})\big{)}\in\Omega_{t_{1}}\times\cdots\times\Omega_{t_{2}-1},

and the finite time transition matrix (i.e., composition of linear maps)

\Phi^{H}\bigl{(}(i,\alpha)_{t_{1}:t_{2}-1}\bigr{)}=(I-\alpha_{t_{2}-1}e_{i_{t_{2}-1}}e_{i_{t_{2}-1}}^{\top}H)\cdots(I-\alpha_{t_{1}}e_{i_{t_{1}}}e_{i_{t_{1}}}^{\top}H).

Recall that $(\Omega_{t},\Sigma_{t},\mathbb{P}_{t})$ is the probability space for $\mathcal{U}_{\{1,2,\dots,d\}}\times\mathcal{U}_{[\alpha_{\min},\alpha_{\max}]}$ for $t\in\mathbb{N}$ . We also denote $\mathcal{P}_{+}^{H}\bigl{(}(i,\alpha)_{t_{1}:t_{2}-1}\bigr{)}$ as the projection operator onto the subspace spanned by the right singular vectors of $\Phi^{H}\bigl{(}(i,\alpha)_{t_{1}:t_{2}-1}\bigr{)}$ corresponding to $d_{+}$ largest singular values, where $d_{+}=\sum_{\lambda_{i}>0}d_{i}$ and $d_{i}$ is the dimension of the $i$ -th eigenspace as in Theorem 2.3 (ii) for the linearized system at $x^{*}$ .

As we mentioned in the proof sketch, we want $\Phi^{H}(S,\tau^{\varrho_{t}}\omega)=\Phi^{H}\left((i,\alpha)_{\varrho_{t}:\varrho_{t}+S-1}\right)$ to amplify $x_{\varrho_{t}}$ , for which we need to establish a non-trivial lower-bound for the unstable component $\left\|\mathcal{P}_{+}^{H}\left((i,\alpha)_{\varrho_{t}:\varrho_{t}+S-1}\right)x_{\varrho_{t}}\right\|$ . This is achieved by several lemmas. We will establish three lower bound in the sequel:

•

$\left\|\mathcal{P}_{+}^{H}\left(\tau^{\varrho_{t}}\omega\right)e_{i_{\varrho_{t}-1}}\right\|$ using Lemma 4.6;
•

$\left\|\mathcal{P}_{+}^{H}\left((i,\alpha)_{\varrho_{t}:\varrho_{t}+S-1}\right)e_{i_{\varrho_{t}-1}}\right\|$ in Lemma 4.7; and finally the desired
•

$\left\|\mathcal{P}_{+}^{H}\left((i,\alpha)_{\varrho_{t}:\varrho_{t}+S-1}\right)x_{\varrho_{t}}\right\|$ in Lemma 4.8.

Let us first control $\left\|\mathcal{P}_{+}^{H}\left(\tau^{\varrho_{t}}\omega\right)e_{i_{\varrho_{t}-1}}\right\|$ in the following lemma, which utilizes Assumption 3. For simplicity of notation, in Lemma 4.6 and Lemma 4.7, we state the results for $\left\|\mathcal{P}_{+}^{H}(\omega)e_{i}\right\|$ and $\left\|\mathcal{P}_{+}^{H}((i,\alpha)_{0:S-1})e_{j}\right\|$ instead, which is slightly more general.

Lemma 4.6.

Under Assumption 3, there exist $\delta>0$ and measurable $\Omega_{1}^{\epsilon}\subset\widetilde{\Omega}$ , where $\widetilde{\Omega}$ is from Theorem 2.3, such that $\mathbb{P}(\Omega_{1}^{\epsilon})\geq 1-\epsilon$ and

\left\|\mathcal{P}^{H}_{+}(\omega)e_{i}\right\|\geq\delta,\quad\forall\ \omega\in\Omega_{1}^{\epsilon},\ i\in\{1,2,\dots,d\}.

Proof.

Assumption 3 implies that

\mathbb{P}\bigl{(}\{\omega\in\widetilde{\Omega}:\left\|\mathcal{P}^{H}_{+}(\omega)e_{i}\right\|>0,\quad\forall\ i\in\{1,2,\dots,d\}\}\bigr{)}=1

Notice that

\bigl{\{}\omega\in\widetilde{\Omega}:\left\|\mathcal{P}^{H}_{+}(\omega)e_{i}\right\|>0,\quad\forall\ i\in\{1,2,\dots,d\}\bigr{\}}=\\ =\bigcup_{n\in\mathbb{N}_{+}}\Bigl{\{}\omega\in\widetilde{\Omega}:\left\|\mathcal{P}^{H}_{+}(\omega)e_{i}\right\|\geq\frac{1}{n},\quad\forall\ i\in\{1,2,\dots,d\}\Bigr{\}}.

The Lemma follows from continuity of measure. ∎

We will then be able to handle $\left\|\mathcal{P}_{+}^{H}((i,\alpha)_{0:S-1})e_{j}\right\|$ using Lemma 4.6 and the closeness between $\left(\Phi^{H}(S,\omega)^{\top}\Phi^{H}(S,\omega)\right)^{1/2S}$ with $\Lambda(\omega)$ as the former converges to the latter as $S\to\infty$ by Theorem 2.3. More precisely, denote the singular values of $X\in\mathbb{R}^{d\times d}$ by $s_{1}(X)\geq s_{2}(X)\geq\cdots\geq s_{d}(X)$ . Then for $S\in\mathbb{N}_{+}$ sufficiently large, we have

(4.12)

\left|\frac{1}{S}\log s_{j}\left(\Phi^{H}(S,\omega)\right)-\lambda_{\mu(j)}\right|=\left|\frac{1}{S}\log s_{j}\left(\Phi^{H}\bigl{(}(i,\alpha)_{0:S-1}\bigr{)}\right)-\lambda_{\mu(j)}\right|\leq\gamma,

for every $j\in\{1,2,\dots,d\}$ , where $\lambda_{1}>\lambda_{2}>\dots>\lambda_{p}$ are the Lyapunov exponents from Theorem 2.3, $\gamma$ is given by (4.9), and the map $\mu:\{1,2,\dots,d\}\rightarrow\{1,2,\dots,p\}$ satisfies that $\mu(j)=i$ if and only if $d_{1}+\cdots+d_{i-1}<j\leq d_{1}+\cdots+d_{i}$ , so $\mu$ corresponds the index for the singular values with that of the Lyapunov exponents. Moreover, the convergence also implies that

\left\|\mathcal{P}_{+}^{H}(S,\omega)-\mathcal{P}_{+}^{H}(\omega)\right\|\leq\frac{\delta}{2},

for sufficiently large $S$ , which then leads to

(4.13)

\left\|\mathcal{P}^{H}_{+}\left(S,\omega\right)e_{j}\right\|=\left\|\mathcal{P}^{H}_{+}\left((i,\alpha)_{0:S-1}\right)e_{j}\right\|\geq\frac{\delta}{2},

for every $j\in\{1,2,\dots,d\}$ , where $\mathcal{P}^{H}_{+}(S,\omega)$ is the projection operator onto the subspace spanned by the right singular vectors of $\Phi^{H}(S,\omega)$ corresponding to $d_{+}$ largest singular values. Let

(4.14)

\Omega^{S}=\bigl{\{}(i,\alpha)_{0:S-1}\in\Omega_{0}\times\cdots\times\Omega_{S-1}:\eqref{close: singular value}\text{ and }\eqref{close: singular space}\text{ hold}\bigr{\}}.

The following lemma states that $\Omega^{S}$ has high probability for sufficiently large $S$ , where with slight abuse of notation, we write $\mathbb{P}(\Omega^{S})=\mathbb{P}\left(\Omega^{S}\times\left(\bigtimes_{t\geq S}\Omega_{t}\right)\right)$ .

Lemma 4.7.

Under the same assumptions of Lemma 4.6, there exists some $S_{*}>0$ , such that for every $S\in\mathbb{N}_{+},S\geq S_{*}$ , it holds $\mathbb{P}(\Omega^{S})\geq 1-2\epsilon$ .

Proof.

For $a.e.\ \omega\in\Omega$ , it follows from Theorem 2.3, in particular (2.2), and standard matrix perturbation analysis (see e.g., [bhatia, Theorem VI.2.1, Theorem VII.3.1]) that

(4.15)

\frac{1}{S}s_{j}(\Phi^{H}(S,\omega))\rightarrow\lambda_{\mu(j)},\quad S\rightarrow\infty.

for any $j\in\{1,2,\dots,d\}$ , and that

(4.16)

\mathcal{P}^{H}_{+}(S,\omega)\rightarrow\mathcal{P}^{H}_{+}(\omega),\quad S\rightarrow\infty.

By Egorov’s theorem, there exists $\Omega_{2}^{\epsilon}\subset\Omega_{1}^{\epsilon}$ with $\mathbb{P}(\Omega_{2}^{\epsilon})\geq 1-2\epsilon$ , such that the convergences in (4.15) and (4.16) are both uniform on $\Omega_{2}^{\epsilon}$ . Here $\Omega_{1}^{\epsilon}$ is as in Lemma 4.6. Therefore, for some $S_{*}$ sufficiently large, we have

\left|\frac{1}{S}\log s_{j}\left(\Phi^{H}(S,\omega)\right)-\lambda_{\mu(j)}\right|\leq\gamma,\quad\forall\ j\in\{1,2,\dots,d\},\ S\geq S_{*},\ \omega\in\Omega_{2}^{\epsilon},

and

(4.17)

\left\|\mathcal{P}^{H}_{+}(S,\omega)-\mathcal{P}^{H}_{+}(\omega)\right\|\leq\frac{\delta}{2},\quad\forall\ S\geq S_{*},\ \omega\in\Omega_{2}^{\epsilon}.

Combining Lemma 4.6 and (4.17), we obtain that

\left\|\mathcal{P}^{H}_{+}(S,\omega)e_{i}\right\|\geq\frac{\delta}{2},\quad\forall\ i\in\{1,2,\dots,d\},\ \forall\ S\geq S_{*},\ \omega\in\Omega_{2}^{\epsilon}.

For any $S\geq S_{*}$ , by the definition of $\Omega^{S}$ , it holds that

\Omega_{2}^{\epsilon}\subset\Omega^{S}\times\Bigl{(}\bigtimes_{t\geq S}\Omega_{t}\Bigr{)},

which implies the desired estimate

\mathbb{P}(\Omega^{S})=\mathbb{P}\biggl{(}\Omega^{S}\times\Bigl{(}\bigtimes_{t\geq S}\Omega_{t}\Bigr{)}\biggr{)}\geq\mathbb{P}(\Omega_{2}^{\epsilon})\geq 1-2\epsilon.

∎

Note that $\alpha_{\varrho_{t}-1}\sim\mathcal{U}_{[\alpha_{\min},\alpha_{\max}]}$ is independent of $\mathcal{F}_{\varrho_{t}-2}$ , $i_{\varrho_{t}-1}$ , and $(i,\alpha)_{\varrho_{t}:\varrho_{t}+S-1}$ . The next lemma shows that with high probability, the choice of $\alpha_{\varrho_{t}-1}$ will lead to a nontrivial orthogonal projection of $x_{\varrho_{t}}$ onto the unstable subspace of $\Phi^{H}(S,\tau^{\varrho_{t}}\omega)=\Phi^{H}\bigl{(}(i,\alpha)_{\varrho_{t}:\varrho_{t}+S-1}\bigr{)}$ .

Lemma 4.8.

For any $S\in\mathbb{N}_{+}$ , $x_{\varrho_{t}-1}$ , $i_{\varrho_{t}-1}$ , and $(i,\alpha)_{\varrho_{t}:\varrho_{t}+S-1}\in\Omega^{S}$ , there exists $I\subset[\alpha_{\min},\alpha_{\max}]$ with $m(I)\geq(1-\epsilon)(\alpha_{\max}-\alpha_{\min})$ where $m(\cdot)$ is the Lebesgue measure, such that for any $\alpha_{\varrho_{t}-1}\in I$ , it holds that

(4.18)

\left\|\mathcal{P}^{H}_{+}\left((i,\alpha)_{\varrho_{t}:\varrho_{t}+S-1}\right)x_{\varrho_{t}}\right\|\geq\frac{\epsilon\delta(\alpha_{\max}-\alpha_{\min})}{4}\bigl{\lvert}e_{i_{\varrho_{t}-1}}^{\top}\nabla f(x_{\varrho_{t}-1})\bigr{\rvert}.

Proof.

We assume that $\bigl{\lvert}e_{i_{\varrho_{t}-1}}^{\top}\nabla f(x_{\varrho_{t}-1})\bigr{\rvert}\neq 0$ ; otherwise the result is trivial.

For simplicity of notation, we write

	$\displaystyle\mathcal{P}^{H}_{+}\left((i,\alpha)_{\varrho_{t}:\varrho_{t}+S-1}\right)x_{\varrho_{t}}$	$\displaystyle=\mathcal{P}^{H}_{+}\left((i,\alpha)_{\varrho_{t}:\varrho_{t}+S-1}\right)x_{\varrho_{t}-1}$
		$\displaystyle\qquad-\alpha_{\varrho_{t}-1}e_{i_{\varrho_{t}-1}}^{\top}\nabla f(x_{\varrho_{t}-1})\mathcal{P}^{H}_{+}\left((i,\alpha)_{\varrho_{t}:\varrho_{t}+S-1}\right)e_{i_{\varrho_{t}-1}}$
		$\displaystyle=:y_{2}-\alpha_{\varrho_{t}-1}y_{1},$

where the last line defines $y_{1}$ and $y_{2}$ . Using the short-hand notation

r=\frac{\epsilon\delta(\alpha_{\max}-\alpha_{\min})}{4}\bigl{\lvert}e_{i_{\varrho_{t}-1}}^{\top}\nabla f(x_{\varrho_{t}-1})\bigr{\rvert},

we observe then (4.18) holds if and only if $\alpha_{\varrho_{t}-1}y_{1}$ is not located in a ball with radius $r$ centered at $y_{2}$ .

It follows from the definition of $\Omega^{S}$ and (4.13) that $\left\|\mathcal{P}^{H}_{+}\left((i,\alpha)_{\varrho_{t}:\varrho_{t}+S-1}\right)e_{i_{\varrho_{t}-1}}\right\|\geq\frac{\delta}{2}$ , which then leads to

\left\|y_{1}\right\|\geq\frac{\delta}{2}\bigl{\lvert}e_{i_{\varrho_{t}-1}}^{\top}\nabla f(x_{\varrho_{t}-1})\bigr{\rvert}=:\frac{2r}{\epsilon(\alpha_{\max}-\alpha_{\min})}.

Thus, the set of $\alpha_{\varrho_{t}-1}$ such that $\alpha_{\varrho_{t}-1}y_{1}\in\mathcal{B}_{r}(y_{2})$ consists of an interval $J$ in $\mathbb{R}$ with $\|\sup(J)\cdot y_{1}-\inf(J)\cdot y_{1}\|\leq 2r$ as the diameter of $\mathcal{B}_{r}(y_{2})$ is $2r$ , which implies $m(J)\leq 2r/\left\|y_{1}\right\|\leq\epsilon(\alpha_{\max}-\alpha_{\min})$ . The lemma is proved then by setting $I=[\alpha_{\max}-\alpha_{\min}]\backslash J$ . ∎

With Lemma 4.5–4.8, we now prove Theorem 4.4 which relies on approximation of the nonlinear dynamics by linearization and the amplification from the finite block analysis for the linearized system.

Proof of Theorem 4.4.

Without loss of generality, we will assume $t=0$ in the proof to simplify notation. Since $H=\nabla^{2}f(x^{*})$ is non-degenerate, we can take a neighborhood $U$ of $x^{*}=0$ and some fixed $\sigma>0$ such that

(4.19)

\left\|\nabla f(x)\right\|\geq\sigma\left\|x\right\|,\quad\forall\ x\in U.

Assumption 1 implies that

\left\|\nabla f(x)\right\|=\left\|\nabla f(x)-\nabla f(x^{*})\right\|\leq M\left\|x-x^{*}\right\|=M\left\|x\right\|,\quad\forall\ x\in\mathbb{R}^{d}.

Using the above inequality and $\alpha_{\max}<1/M$ , it holds for every $\omega\in\Omega$ and $t^{\prime}\in\mathbb{N}$ that

(4.20)

\begin{split}\left\|x_{t^{\prime}+1}\right\|&=\left\|x_{t^{\prime}}-\alpha_{t^{\prime}}e_{i_{t^{\prime}}}e_{i_{t^{\prime}}}^{\top}\nabla f(x_{t^{\prime}})\right\|\\ &\geq\left\|x_{t^{\prime}}\right\|-\alpha_{t^{\prime}}\left\|e_{i_{t^{\prime}}}e_{i_{t^{\prime}}}^{\top}\right\|\cdot\left\|\nabla f(x_{t^{\prime}})\right\|\\ &\geq(1-M\alpha_{\max})\left\|x_{t^{\prime}}\right\|,\end{split}

and similarly that

\left\|x_{t^{\prime}+1}\right\|\leq(1+M\alpha_{\max})\left\|x_{t^{\prime}}\right\|.

We thus define

r_{-}:=1-M\alpha_{\max}\quad\text{and}\quad r_{+}:=1+M\alpha_{\max},

so that

(4.21)

r_{-}\left\|x_{t^{\prime}}\right\|\leq\left\|x_{t^{\prime}+1}\right\|\leq r_{+}\left\|x_{t^{\prime}}\right\|.

We now choose the time duration $S$ large enough in the finite block analysis to guarantee significant amplification. More specifically, we choose $S$ so large that $S\geq S_{*}$ ( $S_{*}$ defined in Lemma 4.7) and that the following two inequalities hold:

(4.22)

\exp(S(\lambda_{+}-\gamma))\cdot\frac{\epsilon\delta\mu\sigma(r_{-})^{L-1}(\alpha_{\max}-\alpha_{\min})}{8}\geq(r_{+})^{L},

and

(4.23)

(1-6\epsilon)\left(S(\lambda_{+}-\gamma)+\log\frac{\epsilon\delta\mu\sigma(r_{-})^{2(L-1)}(\alpha_{\max}-\alpha_{\min})}{8}\right)+6\epsilon(L+S)\log r_{-}>0,

where $L$ is the upper bound defined in Lemma 4.5, $\mu\leq 1/\sqrt{d}$ is a fixed constant as in Lemma 4.5, $\delta$ is from Lemma 4.6, and $\sigma$ is set in (4.19). Thanks to (4.10) for our choice of $\epsilon$ and that $\gamma<\lambda_{+}$ from (4.9), (4.22) and (4.23) are satisfied for sufficiently large $S$ .

Next, we show that, for any $S$ sufficiently large as above, there exists a convex neighborhood $U_{1}\subset U$ of $x^{*}=0$ , such that for any $t^{\prime}\in\mathbb{N}$ , any $x_{t^{\prime}}\in U_{1}$ , and any $(i,\alpha)_{t^{\prime}:t^{\prime}+S-1}$ , it holds that

(4.24)

\left\|x_{t^{\prime}+S}\right\|\geq\left\|\Phi^{H}\bigl{(}(i,\alpha)_{t^{\prime}:t^{\prime}+S-1}\bigr{)}x_{t^{\prime}}\right\|-\left\|x_{t^{\prime}}\right\|.

We first define a convex neighborhood $U_{0}\subset U$ of $x^{*}=0$ such that

\left\|\left(x-\alpha e_{i}e_{i}^{\top}\nabla f(x)\right)-\left(I-\alpha e_{i}e_{i}^{\top}H\right)x\right\|=\left\|\alpha e_{i}e_{i}^{\top}\left(\nabla f(x)-Hx\right)\right\|\\ =\left\|\alpha e_{i}e_{i}^{\top}\int_{0}^{1}(\nabla^{2}f(\eta x)-Hx)\mathrm{d}\eta\right\|\leq\frac{1}{S(r_{+})^{S-1}}\left\|x\right\|,

for any $x\in U_{0}$ , any $i\in\{1,2,\dots,d\}$ , and any $\alpha\in[\alpha_{\min},\alpha_{\max}]$ . Applying the inequality $S$ times for $x_{t^{\prime}}\in U_{1}=(r_{+})^{-(S-1)}U_{0}$ , we have,

\begin{split}&\left\|x_{t^{\prime}+S}-\Phi^{H}\bigl{(}(i,\alpha)_{t^{\prime}:t^{\prime}+S-1}\bigr{)}x_{t^{\prime}}\right\|\\ \leq&\left\|x_{t^{\prime}+S}-\left(I-\alpha_{t^{\prime}+S-1}e_{i_{t^{\prime}+S-1}}e_{i_{t^{\prime}+S-1}}^{\top}H\right)x_{t^{\prime}+S-1}\right\|\\ &+\left\|I-\alpha_{t^{\prime}+S-1}e_{i_{t^{\prime}+S-1}}e_{i_{t^{\prime}+S-1}}^{\top}H\right\|\cdot\left\|x_{t^{\prime}+S-1}-\Phi^{H}\bigl{(}(i,\alpha)_{t^{\prime}:t^{\prime}+S-2}\bigr{)}x_{t^{\prime}}\right\|\\ \leq&\frac{1}{S(r_{+})^{S-1}}\left\|x_{t^{\prime}+S-1}\right\|+r_{+}\left\|x_{t^{\prime}+S-1}-\Phi^{H}\bigl{(}(i,\alpha)_{t^{\prime}:t^{\prime}+S-2}\bigr{)}x_{t^{\prime}}\right\|\\ \leq&\frac{1}{S(r_{+})^{S-1}}\left(\left\|x_{t^{\prime}+S-1}\right\|+r_{+}\left\|x_{t^{\prime}+S-2}\right\|+\cdots+(r_{+})^{S-1}\left\|x_{t^{\prime}}\right\|\right)\\ \leq&\left\|x_{t^{\prime}}\right\|,\end{split}

and hence, inequality (4.24).

Setting $V=(r_{+})^{-(L-1)}U_{1}$ , we then have $x_{\varrho_{0}-1}\in U$ for any $x_{0}\in V$ , which implies that $\left\|\nabla f(x_{\varrho_{0}-1})\right\|\geq\sigma\left\|x_{\varrho_{0}-1}\right\|$ as $\varrho_{0}\leq L$ . According to Lemma 4.5–4.8, for any given $x_{0}\in V$ , with probability at least $1-4\epsilon$ , we have $(i,\alpha)_{\varrho_{0}:\varrho_{0}+S-1}\in\Omega^{S}$ , and the followings hold:

(4.25)			$\displaystyle\bigl{\lvert}e_{i_{\varrho_{0}-1}}^{\top}\nabla f(x_{\varrho_{0}-1})\bigr{\rvert}\geq\mu\left\\|\nabla f(x_{\varrho_{0}-1})\right\\|,$
(4.26)			$\displaystyle\left\\|\mathcal{P}^{H}_{+}\left((i,\alpha)_{\varrho_{0}:\varrho_{0}+S-1}\right)x_{\varrho_{0}}\right\\|\geq\frac{\epsilon\delta(\alpha_{\max}-\alpha_{\min})}{4}\bigl{\lvert}e_{i_{\varrho_{0}-1}}^{\top}\nabla f(x_{\varrho_{0}-1})\bigr{\rvert},$

where the probability is the marginal probability on $(i,\alpha)_{0:L+S-1}\in\Omega_{0}\times\cdots\times\Omega_{L+S-1}$ .

Recall $\lambda_{+}$ and $\gamma$ in (4.9), and $d_{+}=\sum_{\lambda_{i}>0}d_{i}$ . It follows from (4.12) of the construction of the set $\Omega^{S}$ that

\frac{1}{S}\log s_{j}\left(\Phi^{H}\left((i,\alpha)_{\varrho_{0}:\varrho_{0}+S-1}\right)\right)\geq\lambda_{\mu(j)}-\gamma\geq\lambda_{+}-\gamma,\quad\forall\ j\leq d_{+}.

This is to say that the $d_{+}$ largest singular values of $\Phi^{H}\left((i,\alpha)_{\varrho_{0}:\varrho_{0}+S-1}\right)$ are all greater than or equal to $\exp(S(\lambda_{+}-\gamma))$ . Therefore, it holds that

(4.27)

\begin{split}\left\|\Phi^{H}\left((i,\alpha)_{\varrho_{0}:\varrho_{0}+S-1}\right)x_{\varrho_{0}}\right\|&\geq\left\|\Phi^{H}\left((i,\alpha)_{\varrho_{0}:\varrho_{0}+S-1}\right)\mathcal{P}^{H}_{+}\left((i,\alpha)_{\varrho_{0}:\varrho_{0}+S-1}\right)x_{\varrho_{0}}\right\|\\ &\leavevmode\kern-10.1389pt\mathrel{\mathop{\geq}\limits^{\eqref{ineq2}}}\exp(S(\lambda_{+}-\gamma))\cdot\frac{\epsilon\delta(\alpha_{\max}-\alpha_{\min})}{4}\bigl{\lvert}e_{i_{\varrho_{0}-1}}^{\top}\nabla f(x_{\varrho_{0}-1})\bigr{\rvert}\\ &\leavevmode\kern-10.1389pt\mathrel{\mathop{\geq}\limits^{\eqref{ineq1}}}\exp(S(\lambda_{+}-\gamma))\cdot\frac{\epsilon\delta\mu(\alpha_{\max}-\alpha_{\min})}{4}\left\|\nabla f(x_{\varrho_{0}-1})\right\|,\end{split}

where the first inequality follows from the fact that $\Phi^{H}\left((i,\alpha)_{\varrho_{0}:\varrho_{0}+S-1}\right)\mathcal{P}^{H}_{+}\left((i,\alpha)_{\varrho_{0}:\varrho_{0}+S-1}\right)x_{\varrho_{0}}$ and $\Phi^{H}\left((i,\alpha)_{\varrho_{0}:\varrho_{0}+S-1}\right)\left(I-\mathcal{P}^{H}_{+}\left((i,\alpha)_{\varrho_{0}:\varrho_{0}+S-1}\right)\right)x_{\varrho_{0}}$ are orthogonal. Combining (4.24) and (4.27), we obtain that

\begin{split}\left\|x_{\varrho_{0}+S}\right\|&\geq\exp(S(\lambda_{+}-\gamma))\cdot\frac{\epsilon\delta\mu(\alpha_{\max}-\alpha_{\min})}{4}\left\|\nabla f(x_{\varrho_{0}-1})\right\|-\left\|x_{\varrho_{0}}\right\|\\ &\leavevmode\kern-37.38895pt\mathrel{\mathop{\geq}\limits^{\eqref{eq:compareiterate}}}\left(\exp(S(\lambda_{+}-\gamma))\cdot\frac{\epsilon\delta\mu\sigma(r_{-})^{L-1}(\alpha_{\max}-\alpha_{\min})}{4}-(r_{+})^{L}\right)\cdot\left\|x_{0}\right\|\\ &\leavevmode\kern-16.12503pt\mathrel{\mathop{\geq}\limits^{\eqref{large S1}}}\exp(S(\lambda_{+}-\gamma))\cdot\frac{\epsilon\delta\mu\sigma(r_{-})^{L-1}(\alpha_{\max}-\alpha_{\min})}{8}\cdot\left\|x_{0}\right\|.\end{split}

Therefore, it holds that

	$\displaystyle\left\\|x_{L+S}\right\\|$	$\displaystyle\;\leavevmode\kern-37.38895pt\mathrel{\mathop{\geq}\limits^{\eqref{eq:compareiterate}}}(r_{-})^{L-1}\left\\|x_{\varrho_{0}+S}\right\\|$
		$\displaystyle\;\geq\exp(S(\lambda_{+}-\gamma))\cdot\frac{\epsilon\delta\mu\sigma(r_{-})^{2(L-1)}(\alpha_{\max}-\alpha_{\min})}{8}\cdot\left\\|x_{0}\right\\|.$

We finally arrive at (4.8) by setting $T=L+S$ and combining the above with (4.23). ∎

4.3. Proof of main results

In this section, we first prove the following theorem, which relies on the local amplification with high probability established in Theorem 4.4. The main result Theorem 1 will follow as an immediate corollary since $\mathrm{Crit}_{s}(f)$ is countable and strict saddle points are isolated.

Theorem 4.9.

Suppose that Assumption 1, 2, and 3 hold and that $0<\alpha_{\min}<\alpha_{\max}<1/M$ . Then for every $x^{*}\in\mathrm{Crit}_{s}(f)$ and every $x_{0}\in\mathbb{R}^{d}\backslash\{x^{*}\}$ , it holds that

\mathbb{P}(\Omega_{s}(x^{*},x_{0}))=0.

Proof.

Without loss of generality, we assume that $x^{*}=0$ . Conditioned on $\mathcal{F}_{t-1}$ with $x_{t}\in V$ , where $V$ can be assumed to be bounded, Theorem 4.4 states that with probability at least $1-4\epsilon$ ,

\left\|x_{t+T}\right\|\geq A\left\|x_{t}\right\|,

where to simplify notation we denote

A:=\exp\biggl{(}\frac{6\epsilon}{1-6\epsilon}\bigl{\lvert}\log(1-M\alpha_{\max})\bigr{\rvert}\,T\biggr{)}

the amplification factor appearing on the right-hand side of (4.8). Notice also that, due to (4.21), we always have

\left\|x_{t+T}\right\|\geq(r_{-})^{T}\left\|x_{t}\right\|.

It suffices to show that for any $x_{0}\in V\backslash\{x^{*}\}$ , with probability $1$ there exists some $t\in\mathbb{N}_{+}$ such that $x_{t}\notin V$ .

Let us consider the iterates every $T$ steps: Denote $y_{t}=x_{Tt}$ and $\mathcal{G}_{t}=\mathcal{F}_{Tt-1}$ for $t\in\mathbb{N}$ . Denote stopping time

\rho=\inf\{t\in\mathbb{N}:y_{t}\notin V\},

it suffices to show that $\mathbb{P}(\rho<\infty)=1$ . We define a sequence of random variables $I_{t}$ as follows:

I_{t}(\omega)=\begin{cases}1,&\text{if }\left\|y_{t+1}\right\|\geq A\left\|y_{t}\right\|,\\ 0,&\text{otherwise}.\end{cases}

By the discussion in the beginning of the proof, we have

\mathbb{P}(I_{t}=1\ |\ \mathcal{G}_{t},t<\rho)\geq 1-4\epsilon,

and moreover, setting $S_{t}(\omega)=\sum_{0\leq s<t}I_{t}(\omega)$ , we have for $t<\rho(\omega)$ ,

\frac{\left\|y_{t}\right\|}{\left\|y_{0}\right\|}\geq A^{S_{t}(\omega)}\cdot(r_{-})^{T(t-S_{t}(\omega))}.

Denote $R:=\sup_{x\in V}\left\|x\right\|<\infty$ . Since $(1-5\epsilon)\log A+5\epsilon T\log r_{-}>0$ , there exists $t_{*}\in\mathbb{N}$ , such that

\left(A^{1-5\epsilon}\cdot(r_{-})^{5\epsilon T}\right)^{t}>\frac{R}{\left\|y_{0}\right\|},\quad\forall\ t\geq t_{*}.

Therefore, for any $t\geq t_{*}$ , it holds that

\mathbb{P}(\rho>t)=\mathbb{P}\bigl{(}\rho>t,S_{t}\leq(1-5\epsilon)t\bigr{)}\leq\sum_{i\leq(1-5\epsilon)t}\binom{t}{i}(1-4\epsilon)^{i}(4\epsilon)^{t-i}.

As we will show in the next lemma that the right-hand side of above goes to $0$ as $t\to\infty$ , and thus $\lim_{t\rightarrow\infty}\mathbb{P}(\rho>t)=0$ , which implies that $\mathbb{P}(\rho<\infty)=1$ . ∎

Lemma 4.10.

For any $\epsilon\in(0,1/4)$ , it holds that

\lim_{t\rightarrow\infty}\sum_{i\leq(1-5\epsilon)t}\binom{t}{i}(1-4\epsilon)^{i}(4\epsilon)^{t-i}=0.

Proof.

Let $X_{0},X_{1},X_{2},\dots$ be a sequence of i.i.d. random variables with $X_{i}$ being a Bernoulli random variable with expectation $1-4\epsilon$ . Denote the average $\bar{X}_{t}=\frac{1}{t}\sum_{0\leq s<t}X_{s}$ . The weak law of large numbers yields that

\sum_{i\leq(1-5\epsilon)t}\binom{t}{i}(1-4\epsilon)^{i}(4\epsilon)^{t-i}=\mathbb{P}\left(\bar{X}_{t}\leq 1-5\epsilon\right)\leq\mathbb{P}\left(|\bar{X}_{t}-\mathbb{E}X_{0}|\geq\epsilon\right)\rightarrow 0,

as $t\rightarrow\infty$ . ∎

The main theorem then follows directly from Theorem 4.9.

Proof of Theorem 1.

Assumption 2 guarantees that, in a small neighborhood of $x^{*}$ , the gradient $\nabla f(x)=\nabla^{2}f(x^{*})(x-x^{*})+o(\left\|x-x^{*}\right\|)$ is non-vanishing as long as $x\neq x^{*}$ , which implies that $x^{*}$ is an isolated stationary point. Therefore, $\mathrm{Crit}_{s}(f)$ is countable. Then Theorem 1 follows directly from Theorem 4.9 and the countability of $\mathrm{Crit}_{s}(f)$ . ∎

We now prove the global convergence, i.e., Corollary 3.3, for which we will show that Algorithm 1 converges to a critical point of $f$ with some appropriate assumptions. We first show that the limit of each convergent subsequence of $\{x_{t}\}_{t\in\mathbb{N}}$ is a critical point of $f$ .

Proposition 4.11.

If Assumption 1 holds and $0<\alpha_{\min}<\alpha_{\max}<1/M$ , for any $x_{0}\in\mathbb{R}^{d}$ with bounded level set $L(x_{0})=\{x\in\mathbb{R}^{d}:f(x)\leq f(x_{0})\}$ , with probability $1$ , every accumulation point of $\{x_{t}\}_{t\in\mathbb{N}}$ is in $\mathrm{Crit}(f)$ .

Proof.

Algorithm 1 is always monotone since the following holds for any $t\in\mathbb{N}$ by Taylor’s expansion:

(4.28)

\begin{split}f(x_{t+1})&=f\left(x_{t}-\alpha_{t}e_{i_{t}}e_{i_{t}}^{\top}\nabla f(x_{t})\right)\\ &=f(x_{t})-\alpha_{t}\left(e_{i_{t}}^{\top}\nabla f(x_{t})\right)^{2}\\ &\qquad\qquad+\frac{1}{2}\alpha_{t}^{2}\left(e_{i_{t}}^{\top}\nabla f(x_{t})\right)^{2}\cdot e_{i_{t}}^{\top}\nabla f\left(x_{t}-\theta_{t}\alpha_{t}e_{i_{t}}e_{i_{t}}^{\top}\nabla f(x_{t})\right)e_{i_{t}}\\ &\leq f(x_{t})-\frac{1}{2}\alpha_{t}(e_{i}^{\top}\nabla f(x_{t}))^{2}\\ &\leq f(x_{t}),\end{split}

where $\theta_{t}\in(0,1)$ , which implies that the whole sequence $\{x_{t}\}_{t\in\mathbb{N}}$ is contained in the bounded level set $L(x_{0})$ .

Let us consider any $\eta>0$ and set

L(x_{0},\eta)=\{x\in L(x_{0}):\left\|\nabla f(x)\right\|\geq\eta\},

which is either empty or compact. We claim that with probability $1$ , the accumulation points of $\{x_{t}\}_{t\in\mathbb{N}}$ will not be located in $L(x_{0},\eta)$ . This is clear when $L(x_{0},\eta)$ is empty, so it suffices to consider compact $L(x_{0},\eta)$ . Set $\mu\in(0,1/\sqrt{d}]$ as a fixed constant. For any $x\in L(x_{0},\eta)$ , there exists an open neighborhood $U_{x}$ of $x$ and a coordinate $i_{x}\in\{1,2,\dots,d\}$ , such that

(4.29)

\bigl{\lvert}e_{i_{x}}^{\top}\nabla f(y)\bigr{\rvert}\geq\mu\left\|\nabla f(x)\right\|\geq\mu\eta,\quad\forall\ y\in U_{x},

and that

(4.30)

\sup_{y\in U_{x}}f(y)-\inf_{y\in U_{x}}f(y)<\frac{\alpha_{\min}\mu^{2}\eta^{2}}{2}.

Noticing that $L(x_{0},\eta)\subset\bigcup_{x\in L(x_{0},\eta)}U_{x}$ , by the compactness, there exist finitely many points, say $x^{1},x^{2},\dots,x^{K}$ , such that

L(x_{0},\eta)\subset\bigcup_{1\leq k\leq K}U_{x^{k}}.

For any $k\in\{1,2,\dots,K\}$ , combining (4.28), (4.29), and (4.30), we know that for any $t$ , conditioned on $\mathcal{F}_{t-1}$ with $x_{t}\in U_{x^{k}}$ , if $i_{t}=i_{x}$ (which has probability $1/d$ ), then $f(x_{t+1})<\inf_{y\in U_{x^{k}}}f(y)$ and thus $x_{t^{\prime}}\not\in U_{x^{k}}$ for all $t^{\prime}>t$ .

Therefore, the probability that there are infinitely many $t\in\mathbb{N}$ with $x_{t}\in U_{x^{k}}$ is zero, which implies that $\{x_{t}\}_{t\in\mathbb{N}}$ does not have accumulation points in $U_{x^{k}}$ with probability $1$ . We conclude that with probability $1$ , $L(x_{0},\eta)$ does not contain any accumulation points of $\{x_{t}\}_{t\in\mathbb{N}}$ as $K$ is finite. Since this holds for any $\eta>0$ , we have for $\mathbb{P}$ -a.e. $\omega\in\Omega$ , $\{x_{t}\}_{t\in\mathbb{N}}$ has no accumulation points in any $\bigcup_{n\geq 1}L(x_{0},1/n)$ , which then leads to the desired result. ∎

Proposition 4.11 implies that any accumulation point of the algorithm iterate is a critical point. If we further assume that each critical point of $f$ is isolated, we would conclude that the whole sequence $\{x_{t}\}_{t\in\mathbb{N}}$ converges and the limit is in $\mathrm{Crit}(f)$ .

Proposition 4.12.

Under the assumptions of Proposition 4.11. If every $x^{*}\in\mathrm{Crit}(f)$ is an isolated critical point of $f$ , then with probability $1$ , $x_{t}$ converges to some critical point of $f$ as $t\rightarrow\infty$ .

Proof.

It follows from Proposition 4.11 that $\left\|\nabla f(x_{t})\right\|$ converges to $0$ as $t\rightarrow\infty$ for $a.e.$ $\omega\in\Omega$ . In fact, if there were a subsequence $\{x_{t_{k}}\}_{k\in\mathbb{N}}$ and $\epsilon>0$ with $\left\|\nabla f(x_{t_{k}})\right\|\geq\epsilon,\ \forall\ k\in\mathbb{N}$ . Then by the boundedness of $L(x_{0})$ , $\{x_{t_{k}}\}_{k\in\mathbb{N}}$ would have some accumulation point which is not a stationary point of $f$ , which leads to a contradiction.

Moreover, $\mathrm{Crit}(f)\cap L(x_{0})$ is a finite set, since otherwise, $\mathrm{Crit}(f)\cap L(x_{0})$ would have a limiting point which would be a non-isolated critical point of $f$ , violating the assumption.

Consider a fixed $\omega\in\Omega$ with $\lim_{t\rightarrow\infty}\left\|\nabla f(x_{t})\right\|=0$ . Select a open neighborhood $U_{x^{*}}$ for every $x^{*}\in\mathrm{Crit}(f)\cap L(x_{0})$ , such that there exists some $\delta>0$ with

\text{dist}(U_{x^{*}},U_{y^{*}})=\inf_{x\in U_{x^{*}},y\in U_{y^{*}}}\left\|x-y\right\|>\delta,\quad\forall\ x^{*},y^{*}\in\mathrm{Crit}(f).

If $\{x_{t}\}_{t\in\mathbb{N}}$ has more than one accumulation point, there would be infinitely many iterates located in $L(x_{0})\backslash\bigcup_{x^{*}\in\mathrm{Crit}(f)\cap L(x_{0})}U_{x^{*}}$ which is compact. Therefore, $\{x_{t}\}_{t\in\mathbb{N}}$ would have an accumulation point in $L(x_{0})\backslash\bigcup_{x^{*}\in\mathrm{Crit}(f)\cap L(x_{0})}U_{x^{*}}$ , which contradicts Proposition 4.11. ∎

Corollary 3.3 is now an immediate consequence.

Proof of Corollary 3.3.

The result follows directly from Theorem 1 and Proposition 4.12. ∎

References

Appendix A Validity of Assumption 3

In this appendix, we provide some justification of Assumption 3, which is expected to hold generically. In particular, the following proposition validates this assumption when the off-diagonal entries of $H$ are all non-zero.

Proposition A.1.

Suppose that the largest Lyapunov exponent of $\Phi^{H}(t,\omega)$ is positive. Then Assumption 3 holds as long as $1<\alpha_{\min}<\alpha_{\max}<1/\max_{1\leq i\leq d}|H_{ii}|$ and every off-diagonal entry of $H$ is non-zero.

Proof.

For any element $\omega$ in $\Omega$ , we take the smallest $\ell$ such that $\{1,2,\dots,d\}=\{i_{0},i_{1},\dots,i_{\ell-1}\}$ and write

\omega=((i_{0},\alpha_{0}),\dots,(i_{\ell-1},\alpha_{\ell-1}),\omega^{\prime}),

where $\omega^{\prime}=\tau^{\ell}\omega\in\Omega$ . We have that $\ell$ is finite for a.e. $\omega\in\Omega$ . Note that we can view $\ell-1$ as a stopping time, in particular, given $\ell$ , $\omega^{\prime}$ has distribution $\mathbb{P}$ and is independent with $\mathcal{F}_{\ell-1}$ .

Let $\{v_{1}^{\prime},v_{2}^{\prime},\dots,v_{m}^{\prime}\}$ be a set of basis vectors for $W^{H}_{-}(\omega^{\prime})=W^{H}_{-}(\tau^{\ell}\omega)$ . Then a set of basis vectors for $W^{H}_{-}(\omega)$ is given by

v_{j}=\bigl{(}I-\alpha_{0}e_{i_{0}}e_{i_{0}}^{\top}H\bigr{)}^{-1}\cdots\bigl{(}I-\alpha_{\ell-1}e_{i_{\ell-1}}e_{i_{\ell-1}}^{\top}H\bigr{)}^{-1}v_{j}^{\prime},\quad j=1,2,\dots,m.

Denote the matrices concatenated by the column vectors as $V^{\prime}=\bigl{(}v_{1}^{\prime}|v_{2}^{\prime}|\cdots|v_{m}^{\prime}\bigr{)}$ and $V=\bigl{(}v_{1}|v_{2}|\cdots|v_{m}\bigr{)}$ . If $e_{i}\in W^{H}_{-}(\omega)=\text{span}\{v_{1},v_{2},\dots,v_{m}\}$ , then $V_{\hat{\imath},:}$ is column-rank deficient since the existence of a positive Lyapunov exponent implies that $m\leq d-1$ . Here and for the rest of the appendix, we denote by $V_{\hat{\imath},:}$ the $(d-1)\times m$ matrix obtained via removing $i$ -th row of $V\in\mathbb{R}^{d\times m}$ .

Therefore, as Assumption 3 is equivalent to that $e_{i}\notin W^{H}_{-}(\omega)$ holds for any $i\in\{1,2,\dots,d\}$ and almost every $\omega\in\Omega$ , it suffices to show that $V_{\hat{\imath},:}$ has full column-rank with probability $1$ . The key point is that given $\ell$ , $\alpha_{0},\alpha_{1},\dots,\alpha_{\ell-1}$ are independent with $i_{0},i_{1},\dots,i_{\ell-1}$ and $\omega^{\prime}=\tau^{\ell}\omega$ . Thus, it suffices to show that with fixed $\ell$ , $i_{0},i_{1},\dots,i_{\ell-1}$ , $\omega^{\prime}=\tau^{\ell}\omega$ , and $v_{1}^{\prime},v_{2}^{\prime},\dots,v_{m}^{\prime}$ , the set of all $\alpha_{0},\alpha_{1},\dots,\alpha_{\ell-1}$ that yield the rank-deficiency of $V_{\hat{\imath},:}$ is of measure zero; and without loss of generality, we can assume $i=1$ . Noticing that $i_{0},i_{1},\cdots,i_{\ell-1}$ cover all the coordinates and that every off-diagonal entry of $H$ is non-zero, the desired result follows directly from the following Lemma A.2 applied repeatedly. ∎

Lemma A.2.

Suppose that $X=\bigl{(}X_{1}|X_{2}|\cdots|X_{d}\bigr{)}^{\top}$ and $Y=\bigl{(}Y_{1}|Y_{2}|\cdots|Y_{d}\bigr{)}^{\top}$ are full-column-rank $d\times m$ matrices satisfying $Y=(I-\alpha e_{k}e_{k}^{\top}H)^{-1}X$ (we suppress in the notation the dependence of $Y$ on $k$ and $\alpha$ for simplicity). Then the followings holds:

(i)

If $X_{\hat{1},:}$ has full column-rank, then for any $k=\{1,2,\dots,d\}$ , $Y_{\hat{1},:}$ also has full column-rank for a.e. $\alpha$ .
(ii)

Suppose that $X_{\hat{1},:}$ is column-rank deficient, and let $2\leq j_{1}<j_{2}<\dots<j_{m-1}\leq d$ be row indices such that

$X_{j}\in\emph{span}\{X_{j_{1}},X_{j_{2}},\dots,X_{j_{m-1}}\},\quad\forall\ j\in\{2,3,\dots,d\}.$

If $k\in\{1,j_{1},j_{2},\dots,j_{m-1}\}$ , then we have with probability $1$ , either $Y_{\hat{1},:}$ has full column-rank or $Y_{\hat{1},:}$ is column-rank deficient with

$Y_{j}\in\emph{span}\{Y_{j_{1}},Y_{j_{2}},\dots,Y_{j_{m-1}}\},\quad\forall\ j\in\{2,3,\dots,d\}.$

If $k\notin\{1,j_{1},j_{2},\dots,j_{m-1}\}$ and $H_{k1}\neq 0$ , then $Y_{\hat{1},:}$ has full column-rank.

Proof of Lemma A.2.

By (3.3), it holds that $Y_{j}=X_{j}$ for $j\neq k$ and that

Y_{k}=\frac{1}{1-\alpha H_{kk}}\left(X_{k}+\alpha\sum_{j\neq k}H_{kj}X_{j}\right).

For point (i), we notice that if $k=1$ , then $Y_{\hat{1},:}=X_{\hat{1},:}$ has full column-rank. If $k>1$ , then it follows from $X_{1}\in\text{span}\{X_{2},\dots,X_{d}\}$ that $Y_{\hat{1},:}$ also has full column-rank for a.e. $\alpha$ .

For point (ii). We have

\text{span}\{X_{1},X_{j_{1}},X_{j_{2}},\dots,X_{j_{m-1}}\}=\mathbb{R}^{m}.

If $k\in\{1,j_{1},j_{2},\dots,j_{m-1}\}$ , then $\text{span}\{Y_{1},Y_{j_{1}},Y_{j_{2}},\dots,Y_{j_{m-1}}\}=\mathbb{R}^{m}$ holds for $a.e.$ $\alpha$ . Therefore, we obtain that $Y_{j_{1}},Y_{j_{2}},\dots,Y_{j_{m-1}}$ are linearly independent, which implies that either $Y_{\hat{1},:}$ has full column-rank or

Y_{j}\in\text{span}\{Y_{j_{1}},Y_{j_{2}},\dots,Y_{j_{m-1}}\},\quad\forall\ j\in\{2,3,\dots,n\}.

If $k\notin\{j_{1},j_{2},\dots,j_{m-1}\}$ , then $Y_{\hat{1},:}$ has full column-rank since $H_{k1}\neq 0$ . ∎