∎

¹¹institutetext: Yian Deng, Tingting Mu ²²institutetext: Department of Computer Science
University of Manchester
²²email: {yian.deng, tingting.mu}@manchester.ac.uk

Faster Riemannian Newton-type Optimization
by Subsampling and Cubic Regularization

Yian Deng Tingting Mu

(Manuscript was submitted to Machine Learning in 10/2021 and accepted in 01/2023.)

Abstract

This work is on constrained large-scale non-convex optimization where the constraint set implies a manifold structure. Solving such problems is important in a multitude of fundamental machine learning tasks. Recent advances on Riemannian optimization have enabled the convenient recovery of solutions by adapting unconstrained optimization algorithms over manifolds. However, it remains challenging to scale up and meanwhile maintain stable convergence rates and handle saddle points. We propose a new second-order Riemannian optimization algorithm, aiming at improving convergence rate and reducing computational cost. It enhances the Riemannian trust-region algorithm that explores curvature information to escape saddle points through a mixture of subsampling and cubic regularization techniques. We conduct rigorous analysis to study the convergence behavior of the proposed algorithm. We also perform extensive experiments to evaluate it based on two general machine learning tasks using multiple datasets. The proposed algorithm exhibits improved computational speed and convergence behavior compared to a large set of state-of-the-art Riemannian optimization algorithms.

Keywords:

Optimization Cubic Regularization Riemannian manifolds Subsampling

1 Introduction

In modern machine learning, many learning tasks are formulated as non-convex optimization problems. This is because, as compared to linear or convex formulations, they can often capture more accurately the underlying structures within the data, and model more precisely the learning performance (or losses). There is an important class of non-convex problems of which the constraint sets possess manifold structures, e.g., to optimize over a set of orthogonal matrices. A manifold in mathematics refers to a topological space that locally behaves as an Euclidean space near each point. Over a $D$ -dimensional manifold $\mathcal{M}$ in a $d$ -dimensional ambient space ( $D<d$ ), each local patch around each data point (a subset of $\mathcal{M}$ ) is homeomorphic to a local open subset of the Euclidean space $\mathbb{R}^{D}$ . This special structure enables a straightforward adoption of any unconstrained optimization algorithm for solving a constrained problem over a manifold, simply by applying a systematic way to modify the gradient and Hessian calculation. The modified calculations are called the Riemannian gradients and the Riemannian Hessians, which will be rigorously defined later. Such an accessible method for developing optimized solutions has benefited many applications and encouraged the implementation of various optimization libraries.

A representative learning task that gives rise to non-convex problems over manifolds is low-rank matrix completion, widely applied in signal denoising, recommendation systems and image recovery liu2019convolution . It is formulated as optimization problems constrained on fixed-rank matrices that belong to a Grassmann manifold. Another example task is principal component analysis (PCA) popularly used in statistical data analysis and dimensionality reduction shahid2015robust . It seeks an optimal orthogonal projection matrix from a Stiefel manifold. A more general problem setup than PCA is the subspace learning mishra2019riemannian , where a low-dimensional space is an instance of a Grassmann manifold. When training a neural network, in order to reduce overfitting, the orthogonal constraint that provides a Stiefel manifold structure is sometimes imposed over the network weights anandkumar2016efficient . Additionally, in hand gesture recognition nguyen2019neural , optimizing over a symmetric definite positive (SDP) manifold has been shown effective.

Recent developments in optimization on Riemannian manifolds absil2009optimization have offered a convenient and unified solution framework for solving the aforementioned class of non-convex problems. The Riemannian optimization techniques translate the constrained problems into unconstrained ones on the manifold whilst preserving the geometric structure of the solution. For example, one Riemannian way to implement a PCA is to preserve the SDP geometric structure of the solutions without explicit constraints horev2016geometry . A simplified description of how Riemannian optimization works is that it first applies a straightforward way to modify the calculation of the first-order and second-order gradient information, then it adopts an unconstrained optimization algorithm that uses the modified gradient information. There are systematic ways to compute these modifications by analyzing the geometric structure of the manifold. Various libraries have implemented these methods and are available to practitioners, e.g., Manopt boumal2014manopt and Pymanopt townsend2016pymanopt .

Among such techniques, Riemannian gradient descent (RGD) is the simplest. To handle large-scale computation with a finite-sum structure, Riemannian stochastic gradient descent (RSGD) bonnabel2013stochastic has been proposed to estimate the gradient from a single sample (or a sample batch) in each iteration of the optimization. Here, an iteration refers to the process by which an incumbent solution is updated with gradient and (or) higher-order derivative information; for example, Eq. (3) in the upcoming text defines an RSGD iteration. Convergence rates of RGD and RSGD are compared in zhang2016first together with a global complexity analysis. The work concludes that RGD can converge linearly while RSGD converges sub-linearly, but RSGD becomes computationally cheaper when there is a significant increase in the size of samples to process, also it can potentially prevent overfitting. By using RSGD to optimize over the Stiefel manifold, politz2016interpretable attempts to improve interpretability of domain adaptation and has demonstrated its benefits for text classification.

A major drawback of RSGD is the variance issue, where the variance of the update direction can slow down the convergence and result in poor solutions. Typical techniques for variance reduction include the Riemannian stochastic variance reduced gradient (RSVRG) zhang2016riemannian and the Riemannian stochastic recursive gradient (RSRG) kasai2018riemannian . RSVRG reduces the gradient variance by using a momentum term, which takes into account the gradient information obtained from both RGD and RSGD. RSRG follows a different strategy and considers only the information in the last and current iterations. This has the benefit of avoiding large cumulative errors, which can be caused by transporting the gradient vector along a distant path when aligning two vectors at the same tangent plane. It has been shown by kasai2018riemannian that RSRG performs better than RSVRG particularly for large-scale problems.

The RSGD variants can suffer from oscillation across the slopes of a ravine kumar2018geometry . This also happens when performing stochastic gradient descent in Euclidean spaces. To address this, various adaptive algorithms have been proposed. The core idea is to control the learning process with adaptive learning rates in addition to the gradient momentum. Riemannian techniques of this kind include R-RMSProp kumar2018geometry , R-AMSGrad cho2017riemannian , R-AdamNC becigneul2018riemannian , RPG huang2021riemannian and RAGDsDR alimisis2021momentum .

Although improvements have been made for first-order optimization, they might still be insufficient for handling saddle points in non-convex problems mokhtari2018escaping . They can only guarantee convergence to stationary points and do not have control over getting trapped at saddle points due to the lack of higher-order information. As an alternative, second-order algorithms are normally good at escaping saddle points by exploiting curvature information kohler2017sub ; tripuraneni2018stochastic . Representative examples of this are the trust-region (TR) methods. Their capacity for handling saddle points and improved convergence over many first-order methods has been demonstrated in weiwei2013newton for various non-convex problems. The TR technique has been extended to a Riemannian setting for the first time by absil2007trust , referred to as the Riemannian TR (RTR) technique.

It is well known that what prevents the wide use of the second-order Riemannian techniques in large-scale problems is the high cost of computing the exact Hessian matrices. Inexact techniques are therefore proposed to iteratively search for solutions without explicit Hessian computations. They can also handle non-positive-definite Hessian matrices and improve operational robustness. Two representative inexact examples are the conjugate gradient and the Lanczos methods zhu2017riemannian ; xu2016matrix . However, their reduced complexity is still proportional to the sample size, and they can still be computationally costly when working with large-scale problems. To address this issue, the subsampling technique has been proposed, and its core idea is to approximate the gradient and Hessian using a batch of samples. It has been proved by shen2019stochastic that the TR method with subsampled gradient and Hessian can achieve a convergence rate of order $\mathcal{O}\left(\frac{1}{k^{2/3}}\right)$ with $k$ denoting the iteration number. A sample-efficient stochastic TR approach is proposed by shen2019stochastic which finds an $(\epsilon,\sqrt{\epsilon})$ -approximate local minimum within a number of $\mathcal{O}(\sqrt{n}/\epsilon^{1.5})$ stochastic Hessian oracle queries where $n$ denotes the sample number. The subsampling technique has been applied to improve the second-order Riemannian optimization for the first time by kasai2018inexact . Their proposed inexact RTR algorithm employs subsampling over the Riemannian manifold and achieves faster convergence than the standard RTR method. Nonetheless, subsampling can be sensitive to the configured batch size. Overly small batch sizes can lead to poor convergence.

In the latest development of second-order unconstrained optimization, it has been shown that the adaptive cubic regularization technique cartis2011adaptive can improve the standard and subsampled TR algorithms and the Newton-type methods, resulting in, for instance, improved convergence and effectiveness at escaping strict saddle points kohler2017sub ; xu2020newton . To improve performance, the variance reduction techniques have been combined into cubic regularization and extended to cases with inexact solutions. Example works of this include zhou2020stochastic ; zhou2019stochastic which were the first to rigorously demonstrate the advantage of variance reduction for second-order optimization algorithms. Recently, the potential of cubic regularization for solving non-convex problems over constraint sets with Riemannian manifold structures has been shown by zhang2018cubic ; agarwal2021adaptive .

We aim at improving the RTR optimization by taking advantage of both the adaptive cubic regularization and subsampling techniques. Our problem of interest is to find a local minimum of a non-convex finite-sum minimization problem constrained on a set endowed with a Riemannian manifold structure. Letting $f_{i}:\mathcal{M}\rightarrow\mathbb{R}$ be a real-valued function defined on a Riemannian manifold $\mathcal{M}$ , we consider a twice differentiable finite-sum objective function:

\begin{split}\min_{\mathbf{x}\in\mathcal{M}}\ f(\mathbf{x})\>=\frac{1}{n}\sum_{i=1}^{n}f_{i}(\mathbf{x}).\end{split}

(1)

In the machine learning context, $n$ denotes the sample number, and $f_{i}(\mathbf{x})$ is a smooth real-valued and twice differentiable cost (or loss) function computed for the $i$ -th sample. The $n$ samples are assumed to be uniformly sampled, and thus $\mathbb{E}\left[f_{i}(\mathbf{x})\right]=\lim_{n\to\infty}\frac{1}{n}\sum_{i=1}^{n}f_{i}(\mathbf{x})$ .

We propose a cubic Riemannian Newton-like (RN) method to solve more effectively the problem in Eq. (1). Specifically, we enable two key improvements in the Riemannian space, including (1) to approximate the Riemannian gradient and Hessian using the subsampling technique and (2) to improve the subproblem formulation by replacing the trust-region constraint with a cubic regularization term. The resulting algorithm is named Inexact Sub-RN-CR¹¹1The abbreviation Sub-RN-CR comes from Sub-sampled Riemannian Newton-like Cubic Regularization. We follow the tradition of referring to a TR method enhanced by cubic regularization as a Newton-like method cartis2011adaptive . The implementation of the Inexact Sub-RN-CR is provided in https://github.com/xqdi/isrncr.. After introducing cubic regularization, it becomes more challenging to solve the subproblem, for which we demonstrate two effective solvers based on the Lanczos and conjugate gradient methods. We provide convergence analysis for the proposed Inexact Sub-RN-CR algorithm and present the main results in Theorems 3 and 4. Additionally, we provide analysis for the subproblem solvers, regarding their solution quality, e.g., whether and how they meet a set of desired conditions as presented in Assumptions 1-3, and their convergence property. The key results are presented in Lemma 1, Theorem 2, Lemmas 2 and 3. Overall, our results are satisfactory. The proposed algorithm finds an $\left(\epsilon_{g},\epsilon_{H}\right)$ -optimal solution (defined in Section 3.1) in fewer iterations than the state-of-the-art RTR kasai2018inexact . Specifically, the required number of iterations is reduced from $\mathcal{O}\left(\max\left(\epsilon_{g}^{-2}\epsilon_{H}^{-1},\epsilon_{H}^{-3}\right)\right)$ to $\mathcal{O}\left(\max\left(\epsilon_{g}^{-2},\epsilon_{H}^{-3}\right)\right)$ . When being tested by extensive experiments on PCA and matrix completion tasks with different datasets and applications in image analysis, our algorithm shows much better performance than most state-of-the-art and popular Riemannian optimization algorithms, in terms of both the solution quality and computing time.

2 Notations and Preliminaries

We start by familiarizing the readers with the notations and concepts that will be used in the paper, and recommend absil2009optimization for a more detailed explanation of the relevant concepts. The manifold $\mathcal{M}$ is equipped with a smooth inner product $\langle\cdot,\cdot\rangle_{\mathbf{x}}$ associated with the tangent space $T_{\mathbf{x}}\mathcal{M}$ at any $\mathbf{x}\in\mathcal{M}$ , and this inner product is referred to as the Riemannian metric. The norm of a tangent vector $\bm{\eta}\in T_{\mathbf{x}}\mathcal{M}$ is denoted by $\left\|\bm{\eta}\right\|_{\mathbf{x}}$ , which is computed by $\left\|\bm{\eta}\right\|_{\mathbf{x}}=\sqrt{\langle\bm{\eta},\bm{\eta}\rangle_{\mathbf{x}}}$ . When we use the notation $\left\|\bm{\eta}\right\|_{\mathbf{x}}$ , $\bm{\eta}$ by default belongs to the tangent space $T_{\mathbf{x}}\mathcal{M}$ . We use $\mathbf{0}_{\mathbf{x}}\in T_{\mathbf{x}}\mathcal{M}$ to denote the zero vector of the tangent space at $\mathbf{x}$ . The retraction mapping denoted by $R_{\mathbf{x}}\left(\bm{\eta}\right):T_{\mathbf{x}}\mathcal{M}\rightarrow\mathcal{M}$ is used to move $\mathbf{x}\in\mathcal{M}$ in the direction $\bm{\eta}\in T_{\mathbf{x}}\mathcal{M}$ while remaining on $\mathcal{M}$ , and it is an equivalent version of $\mathbf{x}+\bm{\eta}$ in an Euclidean space. The pullback of $f$ at $\mathbf{x}$ is defined by $\hat{f}(\bm{\eta})=f(R_{\mathbf{x}}(\bm{\eta}))$ , and $\hat{f}(\mathbf{0}_{\mathbf{x}})=f(\mathbf{x})$ . The vector transport operator $\mathcal{T}_{\mathbf{x}}^{\mathbf{y}}(\mathbf{v}):T_{\mathbf{x}}\mathcal{M}\rightarrow T_{\mathbf{x}}\mathcal{M}$ moves a tangent vector $\mathbf{v}\in T_{\mathbf{x}}\mathcal{M}$ from a point $\mathbf{x}\in\mathcal{M}$ to another $\mathbf{y}\in\mathcal{M}$ . We also use a shorthand notation $\mathcal{T}_{\bm{\eta}}({\mathbf{v}})$ to describe $\mathcal{T}_{\mathbf{x}}^{\mathbf{y}}(\mathbf{v})$ for a moving direction $\bm{\eta}\in T_{\mathbf{x}}\mathcal{M}$ from $\mathbf{x}$ to $\mathbf{y}$ satisfying $R_{\mathbf{x}}\left(\bm{\eta}\right)=\mathbf{y}$ . The parallel transport operator $\mathcal{P}_{\bm{\eta},\gamma}(\mathbf{v})$ is a special instance of the vector transport. It moves $\mathbf{v}\in T_{\mathbf{x}}\mathcal{M}$ in the direction of $\bm{\eta}\in T_{\mathbf{x}}\mathcal{M}$ along a geodesic $\gamma:[0,1]\rightarrow\mathcal{M}$ , where $\gamma(0)=\mathbf{x}$ , $\gamma(1)=\mathbf{y}$ and $\gamma^{\prime}(0)=\bm{\eta}$ , and during the movement, it has to satisfy the parallel condition on the geodesic curve. We simplify the notation $\mathcal{P}_{\bm{\eta},\gamma}(\mathbf{v})$ to $\mathcal{P}_{\bm{\eta}}(\mathbf{v})$ . Fig. 1 illustrates a manifold and the operations over it. Additionally, we use $\|\cdot\|$ to denote the $l_{2}$ -norm operation in a Euclidean space.

Refer to caption — Figure 1: Illustration of the retraction and parallel transport operations.

The Riemannian gradient of a real-valued differentiable function $f$ at $\mathbf{x}\in\mathcal{M}$ , denoted by ${\rm grad}f(\mathbf{x})$ , is defined as the unique element of $\mathcal{T}_{\mathbf{x}}\mathcal{M}$ satisfying $\langle{\rm grad}f(\mathbf{x}),\bm{\xi}\rangle_{\mathbf{x}}=\mathcal{D}f(\mathbf{x})[\bm{\xi}],\ \forall\bm{\xi}\in T_{\mathbf{x}}\mathcal{M}$ . Here, $\mathcal{D}f(\mathbf{x})[\bm{\xi}]$ generalizes the notion of the directional derivative to a manifold, defined as the derivative of $f(\gamma(t))$ at $t=0$ where $\gamma(t)$ is a curve on $\mathcal{M}$ such that $\gamma(0)=\mathbf{x}$ and $\dot{\gamma}(0)=\bm{\xi}$ . When operating in an Euclidean space, we use the same notation $\mathcal{D}f(\mathbf{x})[\bm{\xi}]$ to denote the classical directional derivative. The Riemannian Hessian of a real-valued differentiable function $f$ at $\mathbf{x}\in\mathcal{M}$ , denoted by ${\rm Hess}f(\mathbf{x})[\bm{\xi}]:T_{\mathbf{x}}\mathcal{M}\rightarrow T_{\mathbf{x}}\mathcal{M}$ , is a linear mapping defined based on the Riemannian connection, as ${\rm Hess}f(\mathbf{x})[\bm{\xi}]=\tilde{\nabla}_{\bm{\xi}}{\rm grad}f(\mathbf{x})$ . The Riemannian connection $\tilde{\nabla}_{\bm{\xi}}\bm{\eta}:T_{\mathbf{x}}\mathcal{M}\times T_{\mathbf{x}}\mathcal{M}\rightarrow T_{\mathbf{x}}\mathcal{M}$ , generalizes the notion of the directional derivative of a vector field. For a function $f$ defined over an embedded manifold, its Riemannian gradient can be computed by projecting the Euclidean gradient $\nabla f(\mathbf{x})$ onto the tangent space, as ${\rm grad}f(\mathbf{x})=P_{T_{\mathbf{x}}\mathcal{M}}\left[\nabla f(\mathbf{x})\right]$ where $P_{T_{\mathbf{x}}\mathcal{M}}[\cdot]$ is the orthogonal projection onto $T_{\mathbf{x}}\mathcal{M}$ . Similarly, its Riemannian Hessian can be computed by projecting the classical directional derivative of ${\rm grad}f(\mathbf{x})$ , defined by $\nabla^{2}f(\mathbf{x})[\bm{\xi}]=\mathcal{D}{\rm grad}f(\mathbf{x})[\bm{\xi}]$ , onto the tangent space, resulting in ${\rm Hess}f(\mathbf{x})[\bm{\xi}]=P_{T_{\mathbf{x}}\mathcal{M}}\left[\nabla^{2}f(\mathbf{x})[\bm{\xi}]\right]$ . When the function $f$ is defined over a quotient manifold, the Riemannian gradient and Hessian can be computed by projecting $\nabla f(\mathbf{x})$ and $\nabla^{2}f(\mathbf{x})[\bm{\xi}]$ onto the horizontal space of the manifold.

Taking the PCA problem as an example (see Eqs (60) and (61) for its formulation), the general objective in Eq. (1) can be instantiated by $\min_{\mathbf{U}\in{\rm Gr}\left(r,d\right)}$ $\frac{1}{n}\sum_{i=1}^{n}\left\|\mathbf{z}_{i}-\mathbf{UU}^{T}\mathbf{z}_{i}\right\|^{2}$ , where $\mathcal{M}={\rm Gr}\left(r,d\right)$ is a Grassmann manifold. The function of interest is $f_{i}(\mathbf{U})=\left\|\mathbf{z}_{i}-\mathbf{UU}^{T}\mathbf{z}_{i}\right\|^{2}$ . The Grassmann manifold ${\rm Gr}\left(r,d\right)$ contains the set of $r$ -dimensional linear subspaces of the $d$ -dimensional vector space. Each subspace corresponds to a point on the manifold that is an equivalence class of $d\times r$ orthogonal matrices, expressed as $\mathbf{x}=\left[\mathbf{U}\right]:=\left\{\mathbf{UO}:\mathbf{U}^{T}\mathbf{U}=\mathbf{I},\mathbf{O}\in{\rm O}\left(r\right)\right\}$ where ${\rm O}\left(r\right)$ denotes the orthogonal group in $\mathbb{R}^{r\times r}$ . A tangent vector $\bm{\eta}\in T_{\mathbf{x}}{\rm Gr}(r,d)$ of the Grassmann manifold has the form $\bm{\eta}=\mathbf{U}^{\bot}\mathbf{B}$ edelman1998geometry , where $\mathbf{B}\in\mathbb{R}^{(d-r)\times r}$ , and $\mathbf{U}^{\bot}\in\mathbb{R}^{d\times(d-r)}$ is the orthogonal complement of $\mathbf{U}$ such that $\left[\mathbf{U},\mathbf{U}^{\bot}\right]$ is orthogonal. A commonly used Riemannian metric for the Grassmann manifold is the canonical inner product $\langle\bm{\eta},\bm{\bm{\xi}}\rangle_{\mathbf{x}}={\rm tr}(\bm{\eta}^{T}\bm{\bm{\xi}})$ given $\bm{\eta},\bm{\bm{\xi}}\in T_{\mathbf{x}}{\rm Gr}(r,d)$ , resulting in $\left\|\bm{\eta}\right\|_{\mathbf{x}}=\|\bm{\eta}\|$ (Section 2.3.2 of edelman1998geometry ). As we can see, the Riemannian metric and the norm here are equivalent to the Euclidean inner product and norm. The same result can be derived from another commonly used metric of the Grassmann manifold, i.e., $\langle\bm{\eta},\bm{\xi}\rangle_{\mathbf{x}}={\rm tr}\left(\bm{\eta}^{T}\left(\mathbf{I}-\frac{1}{2}\mathbf{U}\mathbf{U}^{T}\right)\bm{\xi}\right)$ for $\bm{\eta},\bm{\xi}\in T_{\mathbf{x}}{\rm Gr}(r,d)$ (Section 2.5 of edelman1998geometry ). Expressing two given tangent vectors as $\bm{\eta}=\mathbf{U}^{\bot}\mathbf{B}_{\eta}$ and $\bm{\xi}=\mathbf{U}^{\bot}\mathbf{B}_{\xi}$ with $\mathbf{B}_{\eta},\mathbf{B}_{\xi}\in\mathbb{R}^{(d-r)\times r}$ , we have

	$\displaystyle\langle\bm{\eta},\bm{\xi}\rangle_{\mathbf{x}}=$	$\displaystyle\;{\rm tr}\left(\left(\mathbf{U}^{\bot}\mathbf{B}_{\eta}\right)^{T}\left(\mathbf{I}-\frac{1}{2}\mathbf{U}\mathbf{U}^{T}\right)\mathbf{U}^{\bot}\mathbf{B}_{\xi}\right)={\rm tr}\left(\left(\mathbf{U}^{\bot}\mathbf{B}_{\eta}\right)^{T}\mathbf{U}^{\bot}\mathbf{B}_{\xi}\right)$
	$\displaystyle=$	$\displaystyle\;{\rm tr}\left(\bm{\eta}^{T}\bm{\xi}\right).$		(2)

Here we provide a few examples of the key operations explained earlier on the Grassmann manifold, taken from boumal2014manopt . Given a data point $[\mathbf{U}]$ , a moving direction $\bm{\eta}$ and the step size $t$ , one way to construct the retraction mapping is through performing singular value decomposition (SVD) on $\mathbf{U}+t\bm{\eta}$ , i.e., $\mathbf{U}+t\bm{\eta}=\bar{\mathbf{U}}\mathbf{S}\bar{\mathbf{V}}^{T}$ , and the new data point after moving is $\left[\bar{\mathbf{U}}\bar{\mathbf{V}}^{T}\right]$ . A transport operation can be implemented by projecting a given tangent vector using the orthogonal projector $\mathbf{I}-\mathbf{U}\mathbf{U}^{T}$ . Both Riemannian gradient and Hessian can be computed by projecting the Euclidean gradient and Hessian of $f(\mathbf{U})$ using the same projector $\mathbf{I}-\mathbf{U}\mathbf{U}^{T}$ .

2.1 First-order Algorithms

To optimize the problem in Eq. (1), the first-order Riemannian optimization algorithm RSGD updates the solution at each $k$ -th iteration by using an $f_{i}$ instance, as

\mathbf{x}_{k+1}=R_{\mathbf{x}_{k}}\left(-\beta_{k}\text{grad}f_{i}\left(\mathbf{x}_{k}\right)\right),

(3)

where $\beta_{k}$ is the step size. Assume that the algorithm runs for multiple epochs referred to as the outer iterations. Each epoch contains multiple inner iterations, each of which corresponds to a randomly selected $f_{i}$ for calculating the update. Letting $\mathbf{x}_{k}^{t}$ be the solution at the $t$ -th inner iteration of the $k$ -th outer iteration and $\tilde{\mathbf{x}}_{k}$ be the solution at the last inner iteration of the $k$ -th outer iteration, RSVRG employs a variance reduced extension zhang2016riemannian of the update defined in Eq. (3), given as

\mathbf{x}_{k}^{t+1}=R_{\mathbf{x}_{k}^{t}}\left(-\beta_{k}\mathbf{v}_{k}^{t}\right),

(4)

where

\mathbf{v}_{k}^{t}=\text{grad}f_{i}\left(\mathbf{x}_{k}^{t}\right)-\mathcal{T}_{\tilde{\mathbf{x}}_{k-1}}^{\mathbf{x}_{k}^{t}}\left(\text{grad}f_{i}\left(\tilde{\mathbf{x}}_{k-1}\right)-\text{grad}f\left(\tilde{\mathbf{x}}_{k-1}\right)\right).

(5)

Here, the full gradient information ${\rm grad}f\left(\tilde{\mathbf{x}}_{k-1}\right)$ is used to reduce the variance in the stochastic gradient $\mathbf{v}_{k}^{t}$ . As a later development, RSRG kasai2018riemannian suggests a recursive formulation to improve the variance-reduced gradient $\mathbf{v}_{k}^{t}$ . Starting from $\mathbf{v}_{k}^{0}={\rm grad}f\left(\tilde{\mathbf{x}}_{k-1}\right)$ , it updates by

\mathbf{v}_{k}^{t}=\text{grad}f_{i}\left(\mathbf{x}_{k}^{t}\right)-\mathcal{T}_{\mathbf{x}_{k}^{t-1}}^{\mathbf{x}_{k}^{t}}\left(\text{grad}f_{i}\left(\mathbf{x}_{k}^{t-1}\right)\right)+\mathcal{T}_{\mathbf{x}_{k}^{t-1}}^{\mathbf{x}_{k}^{t}}\left(\mathbf{v}_{k}^{t-1}\right).

(6)

This formulation is designed to avoid the accumulated error caused by a distant vector transport.

2.2 Inexact RTR

For second-order Riemannian optimization, the Inexact RTR kasai2018inexact improves the standard RTR absil2007trust through subsampling. It optimizes an approximation of the objective function formulated using the second-order Taylor expansion within a trust region $\Delta_{k}$ around $\mathbf{x}_{k}$ at iteration $k$ . A moving direction $\bm{\eta}_{k}$ within the trust region is found by solving the subproblem at iteration $k$ :

	$\displaystyle\bm{\eta}_{k}=\mathop{\arg\min}_{\bm{\eta}\in T_{\mathbf{x}_{k}}\mathcal{M}}\;\;\;$	$\displaystyle f(\mathbf{x}_{k})+\langle\mathbf{G}_{k},\bm{\eta}\rangle_{\mathbf{x}_{k}}+\frac{1}{2}\langle\bm{\eta},\mathbf{H}_{k}[\bm{\eta}]\rangle_{\mathbf{x}_{k}},$		(7)
	subject to	$\displaystyle\quad\left\\|\bm{\eta}\right\\|_{\mathbf{x}_{k}}\leq\Delta_{k},$

where $\mathbf{G}_{k}$ and $\mathbf{H}_{k}[\bm{\eta}]$ are the approximated Riemannian gradient and Hessian calculated by using the subsampling technique. The approximation is based on the current solution $\mathbf{x}_{k}$ and the moving direction $\bm{\eta}$ , calculated as

	$\displaystyle\mathbf{G}_{k}=$	$\displaystyle\;\frac{1}{\left\|\mathcal{S}_{g}\right\|}\sum_{i\in\mathcal{S}_{g}}\text{grad}f_{i}(\mathbf{x}_{k}),$		(8)
	$\displaystyle\mathbf{H}_{k}[\bm{\eta}]=$	$\displaystyle\;\frac{1}{\left\|\mathcal{S}_{H}\right\|}\sum_{i\in\mathcal{S}_{H}}\text{Hess}f_{i}(\mathbf{x}_{k})[\bm{\eta}],$		(9)

where $\mathcal{S}_{g}$ , $\mathcal{S}_{H}\subset\left\{1,...,n\right\}$ are the sets of the subsampled indices used for estimating the Riemannian gradient and Hessian. The updated solution $\mathbf{x}_{k+1}=R_{\mathbf{x}_{k}}(\bm{\eta}_{k})$ will be accepted and $\Delta_{k}$ will be increased, if the decrease of the true objective function $f$ is sufficiently large as compared to that of the approximate objective used in Eq. (7). Otherwise, $\Delta_{k}$ will be decreased because of the poor agreement between the approximate and true objectives.

3 Proposed Method

3.1 Inexact Sub-RN-CR Algorithm

We propose to improve the subsampling-based construction of the RTR subproblem in Eq. (7) by cubic regularization. This gives rise to the minimization

\bm{\eta}_{k}=\mathop{\arg\min}_{\bm{\eta}\in T_{\mathbf{x}_{k}}\mathcal{M}}\hat{m}_{k}(\bm{\eta}),

(10)

where

\centering\begin{split}\hat{m}_{k}(\bm{\eta})=\left\{\begin{aligned} &h_{\mathbf{x}_{k}}(\bm{\eta})+\langle\mathbf{G}_{k},\bm{\eta}\rangle_{\mathbf{x}_{k}},&&\text{ if }\|\mathbf{G}_{k}\|_{\mathbf{x}_{k}}\geq\epsilon_{g}.\\ &h_{\mathbf{x}_{k}}(\bm{\eta}),&&\text{ otherwise}.\end{aligned}\right.\end{split}\@add@centering

(11)

Here, $0<\epsilon_{g}<1$ is a user-specified parameter that plays a role in convergence analysis, which we will explain later. The core objective component $h_{\mathbf{x}_{k}}(\bm{\eta})$ is formulated by extending the adaptive cubic regularization technique cartis2011adaptive , given as

h_{\mathbf{x}_{k}}(\bm{\eta})=f(\mathbf{x}_{k})+\frac{1}{2}\langle\bm{\eta},\mathbf{H}_{k}[\bm{\eta}]\rangle_{\mathbf{x}_{k}}+\frac{\sigma_{k}}{3}\|\bm{\eta}\|_{\mathbf{x}_{k}}^{3},

(12)

with

\sigma_{k}=\left\{\begin{array}[]{lr}\max(\frac{\sigma_{k-1}}{\gamma},\epsilon_{\sigma}),\quad\text{if}\ \rho_{k-1}\geq\tau,\\ \gamma\sigma_{k-1},\quad\ \ \text{otherwise,}\end{array}\right.

(13)

and

\rho_{k}=\frac{\hat{f}_{k}(\mathbf{0}_{\mathbf{x}_{k}})-\hat{f}_{k}(\bm{\eta}_{k})}{\hat{m}_{k}(\mathbf{0}_{\mathbf{x}_{k}})-\hat{m}_{k}(\bm{\eta}_{k})},

(14)

where the subscript $k$ is used to highlight the pullback of $f$ at $\mathbf{x}_{k}$ as $\hat{f}_{k}(\cdot)$ . Overall, there are four hyper-parameters to be set by the user, including the gradient norm threshold $0<\epsilon_{\sigma}<1$ , the dynamic control parameter $\gamma>1$ to adjust the cubic regularization weight, the model validity threshold $0<\tau<1$ , and the initial trust parameter $\sigma_{0}$ . We will discuss the setup of the algorithm in more detail.

Algorithm 1 Main Inexact Sub-RN-CR Solver

\epsilon_{\sigma}\in(0,1)

\gamma>1

0<\tau<1

\sigma_{0}>0

0<\epsilon_{g},\epsilon_{H}<1

1: for

k=1,2,...

2: Sample the index sets

\mathcal{S}_{g}

and

\mathcal{S}_{H}

3: Compute the subsampled gradient

\mathbf{G}_{k}

and

\lambda_{min}(\mathbf{H}_{k})

based on Eqs. (8)-(9) and (16)

4: if

\|\mathbf{G}_{k}\|_{\mathbf{x}_{k}}\leq\epsilon_{g}

and

\lambda_{min}(\mathbf{H}_{k})\geq-\epsilon_{H}

then

5: Return

\mathbf{x}_{k}

6: else if

\|\mathbf{G}_{k}\|_{\mathbf{x}_{k}}\leq\epsilon_{g}

then

\mathbf{G}_{k}=\mathbf{0}_{\mathbf{x}_{k}}

8: end if

9: Inexactly solve

\bm{\eta}_{k}^{*}=\mathop{\arg\min}_{\bm{\eta}\in T_{\mathbf{x}_{k}}\mathcal{M}}\hat{m}_{k}(\bm{\eta})

by Algorithm 2 or Algorithm 3

10: Calculate

\rho_{k}=\frac{\hat{f}_{k}(\mathbf{0}_{\mathbf{x}_{k}})-\hat{f}_{k}(\bm{\eta}_{k}^{*})}{\hat{m}_{k}(\mathbf{0}_{\mathbf{x}_{k}})-\hat{m}_{k}(\bm{\eta}_{k}^{*})}

11: Set

\mathbf{x}_{k+1}=\left\{\begin{array}[]{lr}R_{\mathbf{x}_{k}}(\bm{\eta}_{k}^{*})\quad\text{if}\ \rho_{k}\geq\tau\\ \mathbf{x}_{k}\qquad\quad\ \ \text{otherwise}\end{array}\right.

12: Set

\sigma_{k+1}=\left\{\begin{array}[]{lr}\max(\sigma_{k}/\gamma,\epsilon_{\sigma})\quad\text{if}\ \rho_{k}\geq\tau\\ \gamma\sigma_{k}\quad\ \ \text{otherwise}\end{array}\right.

13: end for

13:

\mathbf{x}_{k}

We expect the norm of the approximate gradient to approach $\epsilon_{g}$ with $0<\epsilon_{g}<1$ . Following a similar treatment in kasai2018inexact , when the gradient norm is smaller than $\epsilon_{g}$ , the gradient-based term is ignored. This is important to the convergence analysis shown in the next section.

The trust region radius $\Delta_{k}$ is no longer explicitly defined, but replaced by the cubic regularization term $\frac{\sigma_{k}}{3}\|\bm{\eta}\|_{\mathbf{x}_{k}}^{3}$ , where $\sigma_{k}$ is related to a Lagrange multiplier on a cubic trust-region constraint. Naturally, the smaller $\sigma_{k}$ is, the larger a moving step is allowed. Benefits of cubic regularization have been shown in griewank1981modification ; kohler2017sub . It can not only accelerate the local convergence especially when the Hessian is singular, but also help escape better strict saddle points than the TR methods, providing stronger convergence properties.

The cubic term $\frac{\sigma_{k}}{3}\|\bm{\eta}\|_{\mathbf{x}_{k}}^{3}$ is equipped with a dynamic penalization control through the adaptive trust quantity $\sigma_{k}\geq 0$ . The value of $\sigma_{k}$ is determined by examining how successful each iteration $k$ is. An iteration $k$ is considered successful if $\rho_{k}\geq\tau$ , and unsuccessful otherwise, where the value of $\rho_{k}$ quantifies the agreement between the changes of the approximate objective $\hat{m}_{k}(\bm{\eta})$ and the true objective $f(\mathbf{x})$ . The larger $\rho_{k}$ is, the more effective the approximate model is. Given $\gamma>1$ , in an unsuccessful iteration, $\sigma_{k}$ is increased to $\gamma\sigma_{k}$ hoping to obtain a more accurate approximation in the next iteration. On the opposite, $\sigma_{k}$ is decreased to $\frac{\sigma_{k}}{\gamma}$ , relaxing the approximation in a successful iteration, but it is still restricted within the lower bound $\epsilon_{\sigma}$ . This bound $\epsilon_{\sigma}$ helps avoid solution candidates with overly large norms $\|\bm{\eta}_{k}\|_{\mathbf{x}_{k}}$ that can cause an unstable optimization. Below we formally define what an (un)successful iteration is, which will be used in our later analysis.

Definition 1 (Successful and Unsuccessful Iterations)

An iteration $k$ in Algorithm 1 is considered successful if the agreement $\rho_{k}\geq\tau$ , and unsuccessful if $\rho_{k}<\tau$ . In addition, based on Step (12) of Algorithm 1, a successful iteration has $\sigma_{k+1}\leq\sigma_{k}$ , while an unsuccessful one has $\sigma_{k+1}>\sigma_{k}$ .

3.2 Optimality Examination

The stopping condition of the algorithm follows the definition of $\left(\epsilon_{g},\epsilon_{H}\right)$ -optimality nocedal2006numerical , stated as below.

Definition 2 ( $\left(\epsilon_{g},\epsilon_{H}\right)$ -optimality)

Given $0<\epsilon_{g},\epsilon_{H}<1$ , a solution $\mathbf{x}$ satisfies $\left(\epsilon_{g},\epsilon_{H}\right)$ -optimality if

\left\|{\rm grad}f(\mathbf{x})\right\|_{\mathbf{x}}\leq\epsilon_{g}\quad{\rm and}\quad{\rm Hess}f(\mathbf{x})[\bm{\eta}]\succeq-\epsilon_{H}\mathbf{I},

(15)

for all $\bm{\eta}\in T_{\mathbf{x}}\mathcal{M}$ , where $\mathbf{I}$ is an identity matrix.

This is a relaxation and a manifold extension of the standard second-order optimality conditions $\left\|{\rm grad}f(\mathbf{x})\right\|_{\mathbf{x}}=0$ and $\text{Hess}f(\mathbf{x})\succeq 0$ in Euclidean spaces mokhtari2018escaping . The algorithm stops (1) when the gradient norm is sufficiently small and (2) when the Hessian is sufficiently close to being positive semidefinite.

To examine the Hessian, we follow a similar approach as in han2021riemannian by assessing the solution of the following minimization problem:

\displaystyle\lambda_{min}(\mathbf{H}_{k}):=

\displaystyle\min_{\bm{\eta}\in T_{\mathbf{x}}\mathcal{M},\ \left\|\bm{\eta}\right\|_{\mathbf{x}}=1}\langle\bm{\eta},\mathbf{H}_{k}[\bm{\eta}]\rangle_{\mathbf{x}_{k}},

(16)

which resembles the smallest eigenvalue of the Riemannian Hessian. As a result, the algorithm stops when $\left\|{\rm grad}f(\mathbf{x})\right\|_{\mathbf{x}}\leq\epsilon_{g}$ (referred to as the gradient condition) and when $\lambda_{min}(\mathbf{H}_{k})\geq-\epsilon_{H}$ (referred to as the Hessian condition), where $\epsilon_{g},\epsilon_{H}\in(0,1)$ are the user-set stopping parameters. Note, we use the same $\epsilon_{g}$ for thresholding as in Eq. (LABEL:eq_m). Pseudo code of the complete Inexact Sub-RN-CR is provided in Algorithm 1.

3.3 Subproblem Solvers

The step (9) of Algorithm 1 requires to solve the subproblem in Eq. (10). We rewrite its objective function $\hat{m}_{k}(\bm{\eta})$ as below for the convenience of explanation:

\bm{\eta}_{k}^{*}=\arg\min_{\bm{\eta}\in T_{\mathbf{x}_{k}}\mathcal{M}}f(\mathbf{x}_{k})+\delta\langle\mathbf{G}_{k},\bm{\eta}\rangle_{\mathbf{x}_{k}}+\frac{1}{2}\langle\bm{\eta},\mathbf{H}_{k}[\bm{\eta}]\rangle_{\mathbf{x}_{k}}+\frac{\sigma_{k}}{3}\|\bm{\eta}\|_{\mathbf{x}_{k}}^{3},

(17)

where $\delta=1$ if $\|\mathbf{G}_{k}\|_{\mathbf{x}_{k}}\geq\epsilon_{g}$ , otherwise $\delta=0$ . We demonstrate two solvers commonly used in practice.

3.3.1 The Lanczos Method

The Lanczos method 1999Solving has been widely used to solve the cubic regularization problem in Euclidean spaces xu2020newton ; kohler2017sub ; cartis2011adaptive ; jia2021solving and been recently adapted to Riemannian spaces agarwal2021adaptive . Let $D$ denote the manifold dimension. Its core operation is to construct a Krylov subspace $\mathcal{K}_{D}$ , of which the basis $\{\mathbf{q}_{i}\}_{i=1}^{D}$ spans the tangent space $\bm{\eta}\in T_{\mathbf{x}_{k}}\mathcal{M}$ . After expressing the solution as an element in $\mathcal{K}_{D}$ , i.e., $\bm{\eta}:=\sum_{i=1}^{D}y_{i}\mathbf{q}_{i}$ , the minimization problem in Eq. (17) can be converted to one in Euclidean spaces $\mathbb{R}^{D}$ , as

\mathbf{y}^{*}=\arg\min_{\mathbf{y}\in\mathbb{R}^{D}}\ y_{1}\delta\left\|\mathbf{G}_{k}\right\|_{\mathbf{x}_{k}}+\frac{1}{2}\mathbf{y}^{T}\mathbf{T}_{D}\mathbf{y}+\frac{\sigma_{k}}{3}\left\|\mathbf{y}\right\|_{2}^{3},

(18)

where $\mathbf{T}_{D}\in\mathbb{R}^{D\times D}$ is a symmetric tridiagonal matrix determined by the basic construction process, e.g., Algorithm 1 of jia2021solving . We provide a detailed derivation of Eq. (18) in Appendix A. The global solution of this converted problem, i.e., $\mathbf{y}^{*}=\left[y_{1}^{*},y_{2}^{*},\ldots,y_{D}^{*}\right]$ , can be found by many existing techniques, see press2007chapter . We employ the Newton root finding method adopted by Section 6 of cartis2011adaptive , which was originally proposed by agarwal2021adaptive . It reduces the problem to a univariate root finding problem. After this, the global solution of the subproblem is computed by $\bm{\eta}_{k}^{*}=\sum_{i=1}^{D}y_{i}^{*}\mathbf{q}_{i}$ .

Algorithm 2 Subproblem Solver by Lanczos agarwal2021adaptive

\mathbf{G}_{k}

and

\mathbf{H}_{k}[\bm{\eta}]

\kappa_{\theta}\in(0,1/6]

\sigma_{k}

\mathbf{q}_{1}=\frac{\mathbf{G}_{k}}{\left\|\mathbf{G}_{k}\right\|_{\mathbf{x}_{k}}}

\mathbf{T}=\mathbf{0}\in\mathbb{R}^{D\times D}

\alpha=\langle\mathbf{q}_{1},\mathbf{H}_{k}[\mathbf{q}_{1}]\rangle_{\mathbf{x}_{k}}

T_{1,1}=\alpha

\mathbf{r}=\mathbf{H}_{k}[\mathbf{q}_{1}]-\alpha\mathbf{q}_{1}

3: for

l=1,2,\ldots,D

4: Obtain

\mathbf{y}^{*}

by optimizing Eq. (18) with

D=l

using Newton root finding

\beta=\left\|\mathbf{r}\right\|_{\mathbf{x}_{k}}

\mathbf{q}_{l+1}=-\frac{\mathbf{r}}{\beta}

\alpha=\langle\mathbf{q}_{l+1},\mathbf{H}_{k}[\mathbf{q}_{l+1}]-\beta\mathbf{q}_{l}\rangle_{\mathbf{x}_{k}}

\mathbf{r}=\mathbf{H}_{k}[\mathbf{q}_{l+1}]-\beta\mathbf{q}_{l}-\alpha\mathbf{q}_{l+1}

T_{l,l+1}=T_{l+1,l}=\beta

T_{l+1,l+1}=\alpha

10: if Eq. (47) is satisfied then

11: Return

\sum_{i=1}^{l}y_{i}^{*}\mathbf{q}_{i}

12: end if

13: end for

In practice, when the manifold dimension $D$ is large, it is more practical to find a good solution rather than the global solution. By working with a lower-dimensional Krylov subspace $\mathcal{K}_{l}$ with $l<D$ , one can derive Eq. (18) in $\mathbb{R}^{l}$ , and its solution $\mathbf{y}^{*l}$ results in a subproblem solution $\bm{\eta}_{k}^{*l}=\sum_{i=1}^{l}y_{i}^{*l}\mathbf{q}_{i}$ . Both the global solution $\bm{\eta}_{k}^{*}$ and the approximate solution $\bm{\eta}_{k}^{*l}$ are always guaranteed to be at least as good as the solution obtained by performing a line search along the gradient direction, i.e.,

\hat{m}_{k}\left(\bm{\eta}_{k}^{*}\right)\leq\hat{m}_{k}\left(\bm{\eta}_{k}^{*l}\right)\leq\min_{\alpha\in\mathbb{R}}\hat{m}_{k}(\alpha\mathbf{G}_{k}),

(19)

because $\alpha\mathbf{G}_{k}$ is a common basis vector shared by all the constructed Krylov subspaces $\{\mathcal{K}_{l}\}_{i=1}^{D}$ . We provide pseudo code for the Lanczos subproblem solver in Algorithm 2.

To benefit practitioners and improve understanding of the Lanczos solver, we analyse the gap between a practical solution $\bm{\eta}_{k}^{*l}$ and the global minimizer $\bm{\eta}_{k}^{*}$ . Firstly, we define $\lambda_{max}(\mathbf{H}_{k})$ in a similar manner to $\lambda_{min}(\mathbf{H}_{k})$ as in Eq. (16). It resembles the largest eigenvalue of the Riemannian Hessian, as

\displaystyle\lambda_{max}(\mathbf{H}_{k}):=

\displaystyle\max_{\bm{\eta}\in T_{\mathbf{x}}\mathcal{M},\ \left\|\bm{\eta}\right\|_{\mathbf{x}}=1}\langle\bm{\eta},\mathbf{H}_{k}[\bm{\eta}]\rangle_{\mathbf{x}_{k}}.

(20)

We denote a degree- $l$ polynomial evaluated at $\mathbf{H}_{k}[\bm{\eta}]$ by $p_{l}\left(\mathbf{H}_{k}\right)[\bm{\eta}]$ , such that

p_{l}\left(\mathbf{H}_{k}\right)[\bm{\eta}]:=c_{l}\mathbf{H}_{k}^{l}[\bm{\eta}]+c_{l-1}\mathbf{H}_{k}^{l-1}[\bm{\eta}]+\cdots+c_{1}\mathbf{H}_{k}[\bm{\eta}]+c_{0}\bm{\eta},

(21)

for some coefficients $c_{0},\;c_{1},\;\ldots,\;c_{l}\in\mathbb{R}$ . The quantity $\mathbf{H}_{k}^{l}[\bm{\eta}]$ is recursively defined by $\mathbf{H}_{k}\left[\mathbf{H}_{k}^{l-1}\left[\bm{\eta}\right]\right]$ for $l=2,3,\ldots$ We define below an induced norm, as

\left\|p_{l+1}\left(\tilde{\mathbf{H}}_{k}\right)-{\rm Id}\right\|_{\mathbf{x}_{k}}=\sup_{\begin{subarray}{c}\bm{\eta}\in T_{\mathbf{x}_{k}}\mathcal{M}\\ \|\bm{\eta}\|_{\mathbf{x}_{k}}\neq 0\end{subarray}}\frac{\left\|\left(p_{l+1}\left(\tilde{\mathbf{H}}_{k}\right)-{\rm Id}\right)[\bm{\eta}]\right\|_{\mathbf{x}_{k}}}{\left\|\bm{\eta}\right\|_{\mathbf{x}_{k}}},

(22)

where the identity mapping operator works as ${\rm Id}[\bm{\eta}]=\bm{\eta}$ . Now we are ready to present our result in the following lemma.

Lemma 1 (Lanczos Solution Gap)

Let $\bm{\eta}_{k}^{*}$ be the global minimizer of the subproblem $\hat{m}_{k}$ in Eq. (10). Denote the subproblem without cubic regularization in Eq. (7) by $\bar{m}_{k}$ and let $\bar{\bm{\eta}}_{k}^{*}$ be its global minimizer. For each $l>0$ , the solution $\bm{\eta}_{k}^{*l}$ returned by Algorithm 2 satisfies

\hat{m}_{k}\left(\bm{\eta}_{k}^{*l}\right)-\hat{m}_{k}\left(\bm{\eta}_{k}^{*}\right)\leq\frac{4\lambda_{max}\left(\tilde{\mathbf{H}}_{k}\right)}{\lambda_{min}\left(\tilde{\mathbf{H}}_{k}\right)}\left(\bar{m}_{k}\left(\mathbf{0}_{\mathbf{x}_{k}}\right)-\bar{m}_{k}\left(\bar{\bm{\eta}}_{k}^{*}\right)\right)\phi_{l}\left(\tilde{\mathbf{H}}_{k}\right)^{2},

(23)

where $\tilde{\mathbf{H}}_{k}[\bm{\eta}]:=(\mathbf{H}_{k}+\sigma_{k}\left\|\bm{\eta}_{k}^{*}\right\|_{\mathbf{x}_{k}}{\rm Id})[\bm{\eta}]$ for a moving direction $\bm{\eta}$ , and $\phi_{l}\left(\tilde{\mathbf{H}}_{k}\right)$ is an upper bound of the induced norm $\left\|p_{l+1}\left(\tilde{\mathbf{H}}_{k}\right)-{\rm Id}\right\|_{\mathbf{x}_{k}}$ .

We provide its proof in Appendix B. In Euclidean spaces, carmon2018analysis has shown that $\phi_{l}(\mathbf{H}_{k})=2e^{\frac{-2(l+1)}{\sqrt{\lambda_{max}(\mathbf{H}_{k})\lambda_{min}^{-1}(\mathbf{H}_{k})}}}$ . With the help of Lemma 1, this could serve as a reference to gain an understanding of the solution quality for the Lanczos method in Riemannian spaces.

3.3.2 The Conjugate Gradient Method

We experiment with an alternative subproblem solver by adapting the non-linear conjugate gradient technique in Riemannian spaces. It starts from the initialization of $\bm{\eta}_{k}^{0}=\mathbf{0}_{\mathbf{x}_{k}}$ and the first conjugate direction $\mathbf{p}_{1}=-\mathbf{G}_{k}$ (negative gradient direction). At each inner iteration $i$ (as opposed to the outer iteration $k$ in the main algorithm), it solves the minimization problem with one input variable:

\alpha_{i}^{*}=\arg\min_{\alpha\geq 0}\hat{m}_{k}\left(\bm{\eta}_{k}^{i-1}+\alpha\mathbf{p}_{i}\right).

(24)

The global solution of this one-input minimization problem can be computed by zeroing the derivative of $\hat{m}_{k}$ with respect to $\alpha$ , resulting in a polynomial equation of $\alpha$ , which can then be solved by eigen-decomposition edelman1995polynomial . Its root that possesses the minimal value of $\hat{m}_{k}$ is retrieved. The algorithm updates the next conjugate direction $\mathbf{p}_{i+1}$ using the returned $\alpha_{i}^{*}$ and $\mathbf{p}_{i}$ . Pseudo code for the conjugate gradient subproblem solver is provided in Algorithm 3.

Algorithm 3 subproblem Solver by Non-linear Conjugate Gradient

0: subproblem

\hat{m}_{k}(\bm{\eta})

\mathbf{G}_{k}

m

\kappa,\theta>0

\kappa_{\theta}\in(0,1/6]

\bm{\eta}_{k}^{0}=\mathbf{0}_{\mathbf{x}_{k}}

\mathbf{r}_{0}=\mathbf{G}_{k}

\mathbf{p}_{1}=-\mathbf{r}_{0}

\mathbf{x}_{k}^{0}=\mathbf{x}_{k}

2: for

i=1,2,\ldots,m

3: Solve Eq. (24) by zeroing its derivative and solving the resulting polynomial equation

4: if

\alpha_{i}^{*}\leq 10^{-10}

then

5: Return

\bm{\eta}_{k}^{*}=\bm{\eta}_{k}^{i-1}

6: end if

\bm{\eta}_{k}^{i}=\bm{\eta}_{k}^{i-1}+\alpha_{i}^{*}\mathbf{p}_{i}

8: if Eq. (47) is satisfied then

9: Return

\bm{\eta}_{k}^{*}=\bm{\eta}_{k}^{i}

10: end if

11:

\mathbf{r}_{i}=\mathbf{r}_{i-1}+\alpha_{i}^{*}\mathbf{H}_{k}^{i-1}[\mathbf{p}_{i}]

12:

\mathbf{x}_{k}^{i}=R_{\mathbf{x}_{k}^{i-1}}(\alpha_{i}^{*}\mathbf{p}_{i})

13: if Eq. (31) is met then

14: Return

\bm{\eta}_{k}^{*}=\bm{\eta}_{k}^{i}

15: end if

16: Compute

\beta_{i}

by Eq. (28)

17:

\mathbf{p}_{i+1}=-\mathbf{r}_{i}+\beta_{i}\mathcal{P}_{\alpha_{i}^{*}\mathbf{p}_{i}}\mathbf{p}_{i}

18: end for

18:

\bm{\eta}_{k}^{*}=\bm{\eta}_{k}^{m}

Convergence of a conjugate gradient method largely depends on how its conjugate direction is updated. This is controlled by the setting of $\beta_{i}$ for calculating $\mathbf{p}_{i+1}=-\mathbf{r}_{i}+\beta_{i}\mathcal{P}_{\alpha_{i}^{*}\mathbf{p}_{i}}\mathbf{p}_{i}$ in Step (17) of Algorithm 3. Working in Riemannian spaces under the subsampling setting, it has been proven by sakai2021sufficient that, when the Fletcher-Reeves formula fletcher1964function is used, i.e.,

\beta_{i}=\frac{\left\|\mathbf{G}_{k}^{i}\right\|_{\mathbf{x}_{k}^{i}}^{2}}{\left\|\mathbf{G}_{k}^{i-1}\right\|_{\mathbf{x}_{k}^{i-1}}^{2}},

(25)

where

\mathbf{G}_{k}^{i}=\frac{1}{|\mathcal{S}_{g}|}\sum_{j\in\mathcal{S}_{g}}{\rm grad}f_{j}\left(\mathbf{x}_{k}^{i}\right),

(26)

a conjugate gradient method can converge to a stationary point with $\lim_{i\to\infty}$ $\left\|{\rm grad}f\left(\mathbf{x}_{k}^{i}\right)\right\|_{\mathbf{x}_{k}^{i}}=0$ . Working in Euclidean spaces, wei2006convergence has shown that the Polak–Ribiere–Polyak formula, i.e.,

\beta_{i}=\frac{\left\langle\nabla f\left(\mathbf{x}_{k}^{i}\right),\nabla f\left(\mathbf{x}_{k}^{i}\right)-\nabla f\left(\mathbf{x}_{k}^{i-1}\right)\right\rangle}{\left\|\nabla f\left(\mathbf{x}_{k}^{i-1}\right)\right\|^{2}},

(27)

performs better than the Fletcher-Reeves formula. Building upon these, we propose to compute $\beta_{i}$ by a modified Polak–Ribiere–Polyak formula in Riemannian spaces in Step (16) of Algorithm 3, given as

\beta_{i}=\frac{\left\langle\mathbf{r}_{i},\mathbf{r}_{i}-\frac{\left\|\mathbf{r}_{i}\right\|_{\mathbf{x}_{k}^{i}}}{\left\|\mathbf{r}_{i-1}\right\|_{\mathbf{x}_{k}^{i-1}}}\mathcal{P}_{\alpha_{i}^{*}\mathbf{p}_{i}}\mathbf{r}_{i-1}\right\rangle_{\mathbf{x}_{k}^{i}}}{2\left\langle\mathbf{r}_{i-1},\mathbf{r}_{i-1}\right\rangle_{\mathbf{x}_{k}^{i-1}}}.

(28)

We prove that the resulting algorithm converges to a stationary point, and present the convergence result in Theorem 1, with its proof deferred to Appendix C.

Theorem 1 (Convergence of the Conjugate Gradient Solver)

Assume that the step size $\alpha_{i}^{*}$ in Algorithm 3 satisfies the strong Wolfe conditions hosseini2018line , i.e., given a smooth function $f:\mathcal{M}\to\mathbb{R}$ , it has

	$\displaystyle f\left(R_{\mathbf{x}_{k}^{i-1}}(\alpha_{i}^{*}\mathbf{p}_{i})\right)\leq$	$\displaystyle\;f\left(\mathbf{x}_{k}^{i-1}\right)+c_{1}\alpha_{i}^{*}\left\langle\mathbf{G}_{k}^{i-1},\mathbf{p}_{i}\right\rangle_{\mathbf{x}_{k}^{i-1}},$		(29)
	$\displaystyle\left\|\left\langle\mathbf{G}_{k}^{i},\mathcal{P}_{\alpha_{i}^{*}\mathbf{p}_{i}}(\mathbf{p}_{i})\right\rangle_{\mathbf{x}_{k}^{i}}\right\|\leq$	$\displaystyle\;-c_{2}\left\langle\mathbf{G}_{k}^{i-1},\mathbf{p}_{i}\right\rangle_{\mathbf{x}_{k}^{i-1}},$		(30)

with $0<c_{1}<c_{2}<1$ . When Step (16) of Algorithm 3 computes $\beta_{i}$ by Eq. (28), Algorithm 3 converges to a stationary point, i.e., $\lim_{i\to\infty}\left\|\mathbf{G}_{k}^{i}\right\|_{\mathbf{x}_{k}^{i}}=0$ .

In practice, Algorithm 3 terminates when there is no obvious change in the solution, which is examined in Step (4) by checking whether the step size is sufficiently small, i.e., whether $\alpha_{i}\leq 10^{-10}$ (Section 9 in agarwal2021adaptive ). To improve the convergence rate, the algorithm also terminates when $\mathbf{r}_{i}$ in Step (13) is sufficiently small, i.e., following a classical criterion absil2007trust , to check whether

\left\|\mathbf{r}_{i}\right\|_{\mathbf{x}_{k}^{i}}\leq\left\|\mathbf{r}_{0}\right\|_{\mathbf{x}_{k}}\min\left(\left\|\mathbf{r}_{0}\right\|_{\mathbf{x}_{k}}^{\theta},\kappa\right),

(31)

for some $\theta,\kappa>0$ .

3.3.3 Properties of the Subproblem Solutions

In Algorithm 2, the basis $\{\mathbf{q}_{i}\}_{i=1}^{D}$ is constructed successively starting from $q_{1}=\mathbf{G}_{k}$ , while the converted problem in Eq. (18) is solved for each $\mathcal{K}_{l}$ starting from $l=1$ . This process allows up to $D$ inner iterations. The solution $\bm{\eta}_{k}^{*}$ obtained in the last inner iteration where $l=D$ is the global minimizer over $\mathbb{R}^{D}$ . Differently, Algorithm 3 converges to a stationary point as proved in Theorem 1. In practice, a maximum inner iteration number $m$ is set in advance. Algorithm 3 stops when it reaches the maximum iteration number or converges to a status where the change in either the solution or the conjugate direction is very small.

The convergence property of the main algorithm presented in Algorithm 1 relies on the quality of the subproblem solution. Before discussing it, we first familiarize the reader with the classical TR concepts of Cauchy steps and eigensteps but defined for the Inexact RTR problem introduced in Section 2.2. According to Section 3.3 of boumal2019global , when $\hat{m}_{k}$ is the RTR subproblem, the closed-form Cauchy step $\hat{\bm{\eta}}_{k}^{C}$ is an improving direction defined by

\hat{\bm{\eta}}_{k}^{C}:=\min\Bigg{(}\frac{\left\|\mathbf{G}_{k}\right\|_{\mathbf{x}_{k}}^{2}}{\langle\mathbf{G}_{k},\mathbf{H}_{k}[\mathbf{G}_{k}]\rangle}_{\mathbf{x}_{k}},\frac{\Delta_{k}}{\left\|\mathbf{G}_{k}\right\|_{\mathbf{x}_{k}}}\Bigg{)}\mathbf{G}_{k}.

(32)

It points towards the gradient direction with an optimal step size computed by the $\min(\cdot,\cdot)$ operation, and follows the form of the general Cauchy step defined by

\bm{\eta}_{k}^{C}:=\arg\min_{\alpha\in\mathbb{R}}(\hat{m}_{k}(\alpha\mathbf{G}_{k}))\mathbf{G}_{k}.

(33)

According to Section 2.2 of kasai2018inexact , for some $\nu\in(0,1]$ , the eigenstep $\bm{\eta}_{k}^{E}$ satisfies

\left\langle\bm{\eta}_{k}^{E},\mathbf{H}_{k}\left[\bm{\eta}_{k}^{E}\right]\right\rangle_{\mathbf{x}_{k}}\leq\nu\lambda_{min}(\mathbf{H}_{k})\left\|\bm{\eta}_{k}^{E}\right\|^{2}_{\mathbf{x}_{k}}<0.

(34)

It is an approximation of the negative curvature direction by an eigenvector associated with the smallest negative eigenvalue.

The following three assumptions on the subproblem solution are needed by the convergence analysis later. We define the induced norm for the Hessian as below:

\|\mathbf{H}_{k}\|_{\mathbf{x}_{k}}=\sup_{\begin{subarray}{c}\bm{\eta}\in T_{\mathbf{x}_{k}}\mathcal{M}\\ \|\bm{\eta}\|_{\mathbf{x}_{k}}\neq 0\end{subarray}}\frac{\left\|\mathbf{H}_{k}[\bm{\eta}]\right\|_{\mathbf{x}_{k}}}{\left\|\bm{\eta}\right\|_{\mathbf{x}_{k}}}.

(35)

Assumption 1 (Sufficient Descent Step)

Given the Cauchy step $\bm{\eta}_{k}^{C}$ and the eigenstep $\bm{\eta}_{k}^{E}$ for $\nu\in(0,1]$ , assume the subproblem solution $\bm{\eta}_{k}^{*}$ satisfies the Cauchy condition

\hat{m}_{k}\left(\bm{\eta}_{k}^{*}\right)\leq\hat{m}_{k}\left(\bm{\eta}_{k}^{C}\right)\leq\hat{m}_{k}\left(\mathbf{0}_{\mathbf{x}_{k}}\right)-\max(a_{k},b_{k}),

(36)

and the eigenstep condition

\hat{m}_{k}\left(\bm{\eta}_{k}^{*}\right)\leq\hat{m}_{k}\left(\bm{\eta}_{k}^{E}\right)\leq\hat{m}_{k}\left(\mathbf{0}_{\mathbf{x}_{k}}\right)-c_{k},

(37)

where

$\displaystyle a_{k}=\;$	$\displaystyle\frac{\\|\mathbf{G}_{k}\\|_{\mathbf{x}_{k}}}{2\sqrt{3}}\min\Bigg{(}\frac{\\|\mathbf{G}_{k}\\|_{\mathbf{x}_{k}}}{\\|\mathbf{H}_{k}\\|_{\mathbf{x}_{k}}},\sqrt{\frac{\\|\mathbf{G}_{k}\\|_{\mathbf{x}_{k}}}{\sigma_{k}}}\Bigg{)},$	(38)
$\displaystyle b_{k}=\;$	$\displaystyle\frac{\left\\|\bm{\eta}_{k}^{C}\right\\|_{\mathbf{x}_{k}}^{2}}{12}\left(\sqrt{\\|\mathbf{H}_{k}\\|_{\mathbf{x}_{k}}^{2}+4\sigma_{k}\\|\mathbf{G}_{k}\\|_{\mathbf{x}_{k}}}-\\|\mathbf{H}_{k}\\|_{\mathbf{x}_{k}}\right),$	(39)
$\displaystyle c_{k}=\;$	$\displaystyle\frac{\nu\left\|\lambda_{min}(\mathbf{H}_{k})\right\|}{6}\max\Bigg{(}\left\\|\bm{\eta}_{k}^{E}\right\\|_{\mathbf{x}_{k}}^{2},\frac{\nu^{2}\|\lambda_{min}(\mathbf{H}_{k})\|^{2}}{\sigma_{k}^{2}}\Bigg{)}.$	(40)

The two last inequalities in Eqs. (36) and (37) concerning the Cauchy step and eigenstep are derived in Lemma 6 and Lemma 7 of xu2020newton . Assumption 1 generalizes Condition 3 in xu2020newton to the Riemannian case. It assumes that the subproblem solution $\bm{\eta}_{k}^{*}$ is better than the Cauchy step and eigenstep, decreasing more the value of the subproblem objective function. The following two assumptions enable a stronger convergence result for Algorithm 1, which will be used in the proof of Theorem 4.

Assumption 2 (Sub-model Gradient Norm cartis2011adaptive ; kohler2017sub )

Assume the subproblem solution $\bm{\eta}_{k}^{*}$ satisfies

\left\|\nabla_{\bm{\eta}}\hat{m}_{k}\left(\bm{\eta}_{k}^{*}\right)\right\|_{\mathbf{x}_{k}}\leq\kappa_{\theta}\min\left(1,\left\|\bm{\eta}_{k}^{*}\right\|_{\mathbf{x}_{k}}\right)\|\mathbf{G}_{k}\|_{\mathbf{x}_{k}},

(41)

where $\kappa_{\theta}\in(0,1/6]$ .

Assumption 3 (Approximate Global Minimizer cartis2011adaptive ; yao2021inexact )

Assume the subproblem solution $\bm{\eta}_{k}^{*}$ satisfies

	$\displaystyle\langle\mathbf{G}_{k},\bm{\eta}_{k}^{}\rangle_{\mathbf{x}_{k}}+\langle\bm{\eta}_{k}^{},\mathbf{H}_{k}[\bm{\eta}_{k}^{}]\rangle_{\mathbf{x}_{k}}+\sigma_{k}\\|\bm{\eta}_{k}^{}\\|_{\mathbf{x}_{k}}^{3}=0,$		(42)
	$\displaystyle\langle\bm{\eta}_{k}^{},\mathbf{H}_{k}[\bm{\eta}_{k}^{}]\rangle_{\mathbf{x}_{k}}+\sigma_{k}\\|\bm{\eta}_{k}^{*}\\|_{\mathbf{x}_{k}}^{3}\geq 0.$		(43)

Driven by these assumptions, we characterize the subproblem solutions and present the results in the following lemmas. Their proofs are deferred to Appendix D.

Lemma 2 (Lanczos Solution)

The subproblem solution obtained by Algorithm 2 when being executed $D$ (the dimension of $\mathcal{M}$ ) iterations satisfies Assumptions 1, 2, 3. When being executed $l<D$ times, the solution satisfies the Cauchy condition in Assumption 1, also Assumptions 2 and 3.

Lemma 3 (Conjugate Gradient Solution)

The subproblem solution obtained by Algorithm 3 satisfies the Cauchy condition in Assumption 1. Assuming $\hat{m}_{k}(\bm{\eta})\approx f(R_{\mathbf{x}_{k}}(\bm{\eta}))$ , it also satisfies

\left\|\nabla_{\bm{\eta}}\hat{m}_{k}\left(\bm{\eta}_{k}^{*}\right)\right\|_{\mathbf{x}_{k}}\approx 0,

(44)

and approximately the first condition of Assumption 3, as

\langle\mathbf{G}_{k},\bm{\eta}_{k}^{*}\rangle_{\mathbf{x}_{k}}+\langle\bm{\eta}_{k}^{*},\mathbf{H}_{k}[\bm{\eta}_{k}^{*}]\rangle_{\mathbf{x}_{k}}+\sigma_{k}\left\|\bm{\eta}_{k}^{*}\right\|_{\mathbf{x}_{k}}^{3}\approx 0.

(45)

In practice, Algorithm 2 based on lower-dimensional Krylov subspaces with $l<D$ returns a less optimal solution, while Algorithm 3 returns at most a local minimum. They are not guaranteed to satisfy the eigenstep condition in Assumption 1. But the early-returned solutions from Algorithm 2 still satisfy Assumptions 2 and 3. However, solutions from Algorithm 3 do not satisfy these two Assumptions exactly, but they could get close in an approximate manner. For instance, according to Lemma 3, $\left\|\nabla_{\bm{\eta}}\hat{m}_{k}\left(\bm{\eta}_{k}^{*}\right)\right\|_{\mathbf{x}_{k}}\approx 0$ , and we know that $0\leq\kappa_{\theta}\min\left(1,\left\|\bm{\eta}_{k}^{*}\right\|_{\mathbf{x}_{k}}\right)\|\mathbf{G}_{k}\|_{\mathbf{x}_{k}}$ ; thus, there is a fair chance for Eq. (41) in Assumption 2 to be met by the solution from Algorithm 3. Also, given that $\bm{\eta}_{k}^{*}$ is a descent direction, it has $\langle\mathbf{G}_{k},\bm{\eta}_{k}^{*}\rangle_{\mathbf{x}_{k}}\leq 0$ . Combining this with Eq. (45) in Lemma 3, there is a fair chance for Eq. (42) in Assumption 2 to be met. We present experimental results in Section 5.5, showing empirically to what extent the different solutions satisfy or are close to the eigenstep condition in Assumption 1, Assumptions 2 and 3.

3.4 Practical Early Stopping

In practice, it is often more efficient to stop the optimization earlier before meeting the optimality conditions and obtain a reasonably good solution much faster. We employ a practical and simple early stopping mechanism to accommodate this need. Algorithm 1 is allowed to terminate earlier when: (1) the norm of the approximate gradient continually fails to decrease for $K$ times, and (2) when the percentage of the function decrement is lower than a given threshold, i.e.,

\frac{f(\mathbf{x}_{k-1})-f(\mathbf{x}_{k})}{|f(\mathbf{x}_{k-1})|}\leq\tau_{f},

(46)

for a consecutive number of $K$ times, with $K$ and $\tau_{f}>0$ being user-defined.

For the subproblem, both Algorithms 2 and 3 are allowed to terminate when the current solution $\bm{\eta}_{k}$ , i.e., $\bm{\eta}_{k}=\bm{\eta}_{k}^{*l}$ for Algorithm 2 and $\bm{\eta}_{k}=\bm{\eta}_{k}^{i}$ for Algorithm 3, satisfies

\left\|\nabla_{\bm{\eta}}\hat{m}_{k}\left(\bm{\eta}_{k}\right)\right\|_{\mathbf{x}_{k}}\leq\kappa_{\theta}\min\left(1,\left\|\bm{\eta}_{k}\right\|_{\mathbf{x}_{k}}\right)\|\mathbf{G}_{k}\|_{\mathbf{x}_{k}}.

(47)

This implements an examination of Assumption 2. Regarding Assumption 1, both Algorithms 2 and 3 optimize along the direction of the Cauchy step in their first iteration and thus satisfy the Cauchy condition. Therefore there is no need to examine it. As for the eigenstep condition, it is costly to compute and compare with the eigenstep in each inner iteration, so we do not use it as a stopping criterion in practice. Regarding Assumption 3, according to Lemma 2, it is always satisfied by the solution from Algorithm 2. Therefore, there is no need to examine it in Algorithm 2. As for Algorithm 3, the examination by Eq. (47) also plays a role in checking Assumption 3. For the first condition in Assumption 3, Eq. (42) is equivalent to $\langle\nabla_{\bm{\eta}}\hat{m}_{k}\left(\bm{\eta}_{k}^{*}\right),\bm{\eta}_{k}^{*}\rangle_{\mathbf{x}_{k}}=0$ (this results from Eq. (106) in Appendix D.2). In practice, when Eq. (47) is satisfied with a small value of $\left\|\nabla_{\bm{\eta}}\hat{m}_{k}\left(\bm{\eta}_{k}^{*}\right)\right\|_{\mathbf{x}_{k}}$ , it has $\left\langle\nabla_{\bm{\eta}}\hat{m}_{k}\left(\bm{\eta}_{k}^{*}\right),\bm{\eta}_{k}^{*}\right\rangle_{\mathbf{x}_{k}}\approx 0$ , indicating that the first condition of Assumption 3 is met approximately. Also, since $\langle\mathbf{G}_{k},\bm{\eta}_{k}^{*}\rangle_{\mathbf{x}_{k}}\leq 0$ due to the descent direction $\bm{\eta}_{k}^{*}$ , the second condition of Assumption 3 has a fairly high chance to be met.

4 Convergence Analysis

4.1 Preliminaries

We start from listing those assumptions and conditions from existing literature that are adopted to support our analysis. Given a function $f$ , the Hessian of its pullback $\nabla^{2}\hat{f}\left(\mathbf{x}\right)[\mathbf{\bm{\eta}}]$ and its Riemannian Hessian ${\rm Hess}f\left(\mathbf{x}\right)[\bm{\eta}]$ are identical when a second-order retraction is used boumal2019global , and this serves as an assumption to ease the analysis.

Assumption 4 (Second-order Retraction boumal2020introduction )

The retraction mapping is assumed to be a second-order retraction. That is, for all $\mathbf{x}\in\mathcal{M}$ and all $\bm{\eta}\in T_{\mathbf{x}}\mathcal{M}$ , the curve $\gamma(t):=R_{\mathbf{x}}(t\bm{\eta})$ has zero acceleration at $t=0$ , i.e., $\gamma^{\prime\prime}(0)=\frac{\mathcal{D}^{2}}{dt^{2}}R_{\mathbf{x}}(t\bm{\eta})\big{|}_{t=0}=0$ .

The following two assumptions originate from the assumptions required by the convergence analysis of the standard RTR algorithm boumal2019global ; ferreira2002kantorovich , and are adopted here to support the inexact analysis.

Assumption 5 (Restricted Lipschitz Hessian)

There exists $L_{H}>0$ such that for all $\mathbf{x}_{k}$ generated by Algorithm 1 and all $\bm{\eta}_{k}\in T_{\mathbf{x}_{k}}\mathcal{M}$ , $\hat{f}_{k}$ satisfies

\left|\hat{f}_{k}(\bm{\eta}_{k})-f(\mathbf{x}_{k})-\langle{\rm grad}f(\mathbf{x}_{k}),\bm{\eta}_{k}\rangle_{\mathbf{x}_{k}}-\frac{1}{2}\left\langle\bm{\eta}_{k},\nabla^{2}\hat{f}_{k}(\mathbf{0}_{\mathbf{x}_{k}})[\bm{\eta}_{k}]\right\rangle_{\mathbf{x}_{k}}\right|\leq\frac{L_{H}}{6}\|\bm{\eta}_{k}\|_{\mathbf{x}_{k}}^{3},

(48)

and

\left\|\mathcal{P}_{\bm{\eta}_{k}}^{-1}\left({\rm grad}\hat{f}_{k}(\bm{\eta}_{k})\right)-{\rm grad}f(\mathbf{x}_{k})-\nabla^{2}\hat{f}_{k}(\mathbf{0}_{\mathbf{x}_{k}})[\bm{\eta}_{k}]\right\|_{\mathbf{x}_{k}}\leq\frac{L_{H}}{2}\|\bm{\eta}_{k}\|_{\mathbf{x}_{k}}^{2},

(49)

where $\mathcal{P}^{-1}$ denotes the inverse process of the parallel transport operator.

Assumption 6 (Norm Bound on Hessian)

For all $\mathbf{x}_{k}$ , there exists $K_{H}\geq 0$ so that the inexact Hessian $\mathbf{H}_{k}$ satisfies

\begin{split}\|\mathbf{H}_{k}\|_{\mathbf{x}_{k}}\leq K_{H}.\end{split}

(50)

The following key conditions on the inexact gradient and Hessian approximations are developed in Euclidean spaces by roosta2019sub (Section 2.2) and xu2020newton (Section 1.3), respectively. We make use of these in Riemannian spaces.

Condition 1 (Approximation Error Bounds)

For all $\mathbf{x}_{k}$ and $\bm{\eta}_{k}\in T_{\mathbf{x}_{k}}\mathcal{M}$ , suppose that there exist $\delta_{g},\delta_{H}>0$ , such that the approximate gradient and Hessian satisfy

	$\displaystyle\\|\mathbf{G}_{k}-{\rm grad}f(\mathbf{x}_{k})\\|_{\mathbf{x}_{k}}$	$\displaystyle\leq\delta_{g},$		(51)
	$\displaystyle\\|\mathbf{H}_{k}[\bm{\eta}_{k}]-\nabla^{2}\hat{f}_{k}(\mathbf{0}_{\mathbf{x}_{k}})[\bm{\eta}_{k}]\\|_{\mathbf{x}_{k}}$	$\displaystyle\leq\delta_{H}\\|\bm{\eta}_{k}\\|_{\mathbf{x}_{k}}.$		(52)

As will be shown in Theorem 2 later, these allow the used sampling size in the gradient and Hessian approximations to be fixed throughout the training process. As a result, it can serve as a guarantee of the algorithmic efficiency when dealing with large-scale problems.

4.2 Supporting Theorem and Assumption

In this section, we prove a preparation theorem and present new conditions required by our results. Below, we re-develop Theorem 4.1 in kasai2018inexact using the matrix Bernstein inequality gross2011recovering . It provides lower bounds on the required subsampling size for approximating the gradient and Hessian in order for Condition 1 to hold. The proof is provided in Appendix E.

Theorem 2 (Gradient and Hessian Sampling Size)

Define the suprema of the Riemannian gradient and Hessian

	$\displaystyle K_{g_{\max}}:=$	$\displaystyle\max_{i\in[n]}\sup_{\mathbf{x}\in\mathcal{M}}\left\\|{\rm grad}f_{i}(\mathbf{x})\right\\|_{\mathbf{x}},$		(53)
	$\displaystyle K_{H_{\max}}:=$	$\displaystyle\max_{i\in[n]}\sup_{\mathbf{x}\in\mathcal{M}}\sup_{\begin{subarray}{c}\bm{\eta}\in T_{\mathbf{x}}\mathcal{M}\\ \\|\bm{\eta}\\|_{\mathbf{x}}\neq 0\end{subarray}}\frac{\left\\|{\rm Hess}f_{i}(\mathbf{x})[\mathbf{\bm{\eta}}]\right\\|_{\mathbf{x}}}{\left\\|\bm{\eta}\right\\|_{\mathbf{x}}}.$		(54)

Given $0<\delta<1$ , Condition 1 is satisfied with probability at least $\left(1-\delta\right)$ if

	$\displaystyle\|\mathcal{S}_{g}\|$	$\displaystyle\geq\frac{8\left(K_{g_{max}}^{2}+K_{g_{max}}\right)\ln\left(\frac{d+r}{\delta}\right)}{\delta_{g}^{2}},$		(55)
	$\displaystyle\|\mathcal{S}_{H}\|$	$\displaystyle\geq\frac{8\left(K_{H_{max}}^{2}+\frac{K_{H_{max}}}{\left\\|\bm{\eta}\right\\|_{\mathbf{x}}}\right)\ln\left(\frac{d+r}{\delta}\right)}{\delta_{H}^{2}}.$		(56)

where $|\mathcal{S}_{g}|$ and $|\mathcal{S}_{H}|$ denote the sampling sizes, while $d$ and $r$ are the dimensions of $\mathbf{x}$ .

The two quantities $\delta_{g}$ and $\delta_{H}$ in Condition 1 are the upper bounds of the gradient and Hessian approximation errors, respectively. The following assumption bounds $\delta_{g}$ and $\delta_{H}$ .

Assumption 7 (Restrictions on $\delta_{g}$ and $\delta_{H}$ )

Given $\nu\in(0,1]$ , $K_{H}\geq 0$ , $L_{H}>0$ , $0<\tau,\epsilon_{g}<1$ , we assume that $\delta_{g}$ and $\delta_{H}$ satisfy

	$\displaystyle\delta_{g}\leq\;$	$\displaystyle\frac{(1-\tau)\left(\sqrt{K_{H}^{2}+4L_{H}\epsilon_{g}}-K_{H}\right)^{2}}{48L_{H}},$		(57)
	$\displaystyle\delta_{H}\leq\;$	$\displaystyle\min\Bigg{(}\frac{1-\tau}{12}\left(\sqrt{K_{H}^{2}+4L_{H}\epsilon_{g}}-K_{H}\right),\frac{1-\tau}{3}\nu\epsilon_{H}\Bigg{)}.$		(58)

As seen in Eqs. (55) and (56), sampling sizes $|\mathcal{S}_{g}|$ and $|\mathcal{S}_{H}|$ are directly proportional to the probability $(1-\delta)$ but inversely proportional to the error tolerances $\delta_{g}$ and $\delta_{H}$ , respectively. Hence, a higher $(1-\delta)$ and smaller $\delta_{g}$ and $\delta_{H}$ (affected by $K_{H}$ and $L_{H}$ ) require larger $|\mathcal{S}_{g}|$ and $|\mathcal{S}_{H}|$ for estimating the inexact Riemannian gradient and Hessian.

4.3 Main Results

Now we are ready to present our main convergence results in two main theorems for Algorithm 1. Different from sun2019escaping which explores the escape rate from a saddle point to a local minimum, we study the convergence rate from a random point.

Theorem 3 (Convergence Complexity of Algorithm 1)

Consider $0<\epsilon_{g},\epsilon_{H}<1$ and $\delta_{g},\delta_{H}>0$ . Suppose that Assumptions 4, 5, 50 and 7 hold and the solution of the subproblem in Eq. (10) satisfies Assumption 1. Then, if the inexact gradient $\mathbf{G}_{k}$ and Hessian $\mathbf{H}_{k}$ satisfy Condition 1, Algorithm 1 returns an $\left(\epsilon_{g},\epsilon_{H}\right)$ -optimal solution in $\mathcal{O}\left(\max\left(\epsilon_{g}^{-2},\epsilon_{H}^{-3}\right)\right)$ iterations.

The proof along with the supporting lemmas is provided in Appendices F.1 and F.2. When the Hessian at the solution is close to positive semi-definite which indicates a small $\epsilon_{H}$ , the Inexact Sub-RN-CR finds an $\left(\epsilon_{g},\epsilon_{H}\right)$ -optimal solution in fewer iterations than the Inexact RTR, i.e., $\mathcal{O}\left(\max\left(\epsilon_{g}^{-2},\epsilon_{H}^{-3}\right)\right)$ iterations for the Inexact Sub-RN-CR as compared to $\mathcal{O}\left(\max\left(\epsilon_{g}^{-2}\epsilon_{H}^{-1},\epsilon_{H}^{-3}\right)\right)$ for the Inexact RTR kasai2018inexact . Such a result is satisfactory. Combining Theorems 2 and 3, it leads to the following corollary.

Corollary 1

Consider $0<\epsilon_{g},\epsilon_{H}<1$ and $\delta_{g},\delta_{H}>0$ . Suppose that Assumptions 4, 5, 50 and 7 hold and the solution of the subproblem in Eq. (10) satisfies Assumption 1. For any $0<\delta<1$ , suppose Eqs. (55) and (56) are satisfied at each iteration. Then, Algorithm 1 returns an $\left(\epsilon_{g},\epsilon_{H}\right)$ -optimal solution in $\mathcal{O}\left(\max\left(\epsilon_{g}^{-2},\epsilon_{H}^{-3}\right)\right)$ iterations with a probability at least $p=(1-\delta)^{\mathcal{O}\left(\max\left(\epsilon_{g}^{-2},\epsilon_{H}^{-3}\right)\right)}$ .

The proof is provided in Appendix F.3. We use an example to illustrate the effect of $\delta$ on the approximate gradient sample size $|\mathcal{S}_{g}|$ . Suppose $\epsilon^{-2}_{g}>\epsilon^{-3}_{H}$ , then $p=(1-\delta)^{\mathcal{O}\left(\epsilon_{g}^{-2}\right)}$ . In addition, when $\delta=\mathcal{O}\left(\epsilon^{2}_{g}/10\right)$ , $p\approx 0.9$ . Replacing $\delta$ with $\mathcal{O}\left(\epsilon^{2}_{g}/10\right)$ in Eqs. (55) and (56), it can be obtained that the lower bound of $|\mathcal{S}_{g}|$ is proportional to $\ln\left(10\epsilon^{-2}_{g}\right)$ .

Combining Assumption 3 and the stopping condition in Eq. (47) for the inexact solver, a stronger convergence result can be obtained for Algorithm 1, which is presented in the following theorem and corollary.

Theorem 4 (Optimal Convergence Complexity of Algorithm 1)

Consider $0<\epsilon_{g},\epsilon_{H}<1$ and $\delta_{g},\delta_{H}>0$ . Suppose that Assumptions 4, 5, 50 and 7 hold and the solution of the subproblem satisfies Assumptions 1, 2 and 3. Then, if the inexact gradient $\mathbf{G}_{k}$ and Hessian $\mathbf{H}_{k}$ satisfy Condition 1 and $\delta_{g}\leq\delta_{H}\leq\kappa_{\theta}\epsilon_{g}$ , Algorithm 1 returns an $\left(\epsilon_{g},\epsilon_{H}\right)$ -optimal solution in $\mathcal{O}\left(\max\left(\epsilon_{g}^{-\frac{3}{2}},\epsilon_{H}^{-3}\right)\right)$ iterations.

Corollary 2

Consider $0<\epsilon_{g},\epsilon_{H}<1$ and $\delta_{g},\delta_{H}>0$ . Suppose that Assumptions 4, 5, 50 and 7 hold, and the solution of the subproblem satisfies Assumptions 1, 2 and 3. For any $0<\delta<1$ , suppose Eqs. (55) and (56) are satisfied at each iteration. Then, Algorithm 1 returns an $\left(\epsilon_{g},\epsilon_{H}\right)$ -optimal solution in $\mathcal{O}\left(\max\left(\epsilon_{g}^{-\frac{3}{2}},\epsilon_{H}^{-3}\right)\right)$ iterations with a probability at least $p=(1-\delta)^{\mathcal{O}\left(\max\left(\epsilon_{g}^{-\frac{3}{2}},\epsilon_{H}^{-3}\right)\right)}$ .

The proof of Theorem 4 along with its supporting lemmas is provided in Appendices G.1 and G.2. The proof of Corollary 2 follows Corollary 1 and is provided in Appendix G.3.

4.4 Computational Complexity Analysis

We analyse the number of main operations required by the proposed algorithm. Taking the PCA task as an example, it optimizes over the Grassmann manifold ${\rm Gr}\left(r,d\right)$ . Denote by $m$ the number of inexact iterations and $D$ the manifold dimension, i.e., $D=d\times(d-r)$ for the Grassmann manifold. Starting from the gradient and Hessian computation, the full case requires $\mathcal{O}(ndr)$ operations for both in the PCA task. By using the subsampling technique, these can be reduced to $\mathcal{O}(|\mathcal{S}_{g}|dr)$ and $\mathcal{O}(|\mathcal{S}_{H}|dr)$ by gradient and Hessian approximation. Following an existing setup for cost computation, i.e., Inexact RTR method kasai2018inexact , the full function cost evaluation takes $n$ operations, while the approximate cost evaluation after subsampling becomes $\mathcal{O}(|\mathcal{S}_{n}|dr)$ , where $\mathcal{S}_{n}$ is the subsampled set of data points used to compute the function cost. These show that, for large-scale practices with $n\gg\max\left(|\mathcal{S}_{g}|,|\mathcal{S}_{H}|,|\mathcal{S}_{n}|\right)$ , the computational cost reduction gained from the subsampling technique is significant.

For the subproblem solver by Algorithm 2 or 3, the dominant computation within each iteration is the Hessian computation, which as mentioned above requires $\mathcal{O}(|\mathcal{S}_{H}|dr)$ operations after using the subsampling technique. Taking this into account to analyze Algorithm 1, its overall computational complexity becomes $\mathcal{O}\left(\max\left(\epsilon_{g}^{-2},\epsilon_{H}^{-3}\right)\right)\times\left[\mathcal{O}(n+|\mathcal{S}_{g}|dr)+\mathcal{O}\left(m|\mathcal{S}_{H}|d^{2}(d-r)r\right)\right]$ based on Theorem 3, where $\mathcal{O}(n+|\mathcal{S}_{g}|dr)$ corresponds to the operations for computing the full function cost and the approximate gradient in an outer iteration. This overall complexity can be simplified to $\mathcal{O}\left(\max\left(\epsilon_{g}^{-2},\epsilon_{H}^{-3}\right)\right)\times\mathcal{O}(n+|\mathcal{S}_{g}|dr+m|\mathcal{S}_{H}|d^{2}(d-r)r)$ , where $\mathcal{O}(m|\mathcal{S}_{H}|d^{2}(d-r)r)$ is the cost of the subproblem solver by either Algorithm 2 or Algorithm 3. Algorithm 2 is guaranteed to return the optimal subproblem solution within at most $m=D=d\times(d-r)$ inner iterations, of which the complexity is at most $\mathcal{O}(|\mathcal{S}_{H}|d^{2}(d-r)^{2}r^{2})$ . Such a polynomial complexity is at least as good as the ST-tCG solver used in the Inexact RTR algorithm. For Algorithm 3, although $m$ is not guaranteed to be bounded and polynomial, we have empirically observed that $m$ is generally smaller than $D$ in practice, presenting a similar complexity to Algorithm 2.

5 Experiments and Result Analysis

We compare the proposed Inexact Sub-RN-CR algorithm with state-of-the-art and popular Riemannian optimization algorithms. These include the Riemannian stochastic gradient descent (RSGD) bonnabel2013stochastic , Riemannian steepest descent (RSD) absil2009optimization , Riemannian conjugate gradient (RCG) absil2009optimization , Riemannian limited memory BFGS algorithm (RLBFGS) yuan2016riemannian , Riemannian stochastic variance-reduced gradient (RSVRG) zhang2016riemannian , Riemannian stochastic recursive gradient (RSRG) kasai2018riemannian , RTR absil2007trust , Inexact RTR kasai2018inexact and RTRMC boumal2011rtrmc . Existing implementations of these algorithms are available in either Manopt boumal2014manopt or Pymanopt townsend2016pymanopt library. They are often used for algorithm comparison in existing literature, e.g., by Inexact RTR kasai2018inexact . Particularly, RSGD, RSD, RCG, RLBFGS, RTR and RTRMC algorithms have been encapsulated into Manopt, and RSD, RCG and RTR also into Pymanopt. RSVRG, RSRG and Inexact RTR are implemented by kasai2018inexact based on Manopt. We use existing implementations to reproduce their methods. Our Inexact Sub-RN-CR implementation builds on Manopt.

For the competing methods, we follow the same parameter settings from the existing implementations, including the batch size (i.e. sampling size), step size (i.e. learning rate) and the inner iteration number to ensure the same results as the reported ones. For our method, we first keep the common algorithm parameters the same as the competing methods, including $\gamma$ , $\tau$ , $\epsilon_{g}$ and $\epsilon_{H}$ . Then, we use a grid search to find appropriate values of $\theta$ and $\kappa_{\theta}$ for both Algorithms 2 and 3. Specifically, the searching grid for $\theta$ is $(0.02,0.05,0.1,0.2,0.5,1)$ , and the searching grid for $\kappa_{\theta}$ is $(0.005,0.01,0.02,0.04,0.08,0.16)$ . For the parameter $\kappa$ in Algorithm 3, we keep it the same as the other conjugate gradient solvers. The early stopping approach as described in Section 3.4 is applied to all the compared algorithms.

Regarding the batch setting, which is also the sample size setting for approximating the gradient and Hessian, we adopt the same value as used in existing subsampling implementations to keep consistency. Also, the same settings are used for both the PCA and matrix completion tasks. Specifically, the batch size $\left|\mathcal{S}_{g}\right|=n/100$ is used for RSGD, RSVRG and RSRG where $\mathcal{S}_{H}$ is not considered as these are first-order methods. For both the Inexact RTR and the proposed Inexact Sub-RN-CR, $\left|\mathcal{S}_{H}\right|=n/100$ and $\left|\mathcal{S}_{g}\right|=n$ is used. This is to follow the existing setting in kasai2018inexact for benchmark purposes, which exploits the approximate Hessian but the full gradient. In addition to these, we experiment with another batch setting of $\left\{\left|\mathcal{S}_{H}\right|=n/100,\left|\mathcal{S}_{g}\right|=n/10\right\}$ for both the Inexact RTR and Inexact Sub-RN-CR. This is flagged by $(G)$ in the algorithm name, meaning that the algorithm uses the approximate gradient in addition to the approximate Hessian. Its purpose is to evaluate the effect of $\mathcal{S}_{g}$ in the optimization.

Evaluation is conducted based on two machine learning tasks of PCA and low-rank matrix completion using both synthetic and real-world datasets with $n\gg d\gg 1$ . Both tasks can be formulated as non-convex optimization problems on the Grassmann manifold Gr $\left(r,d\right)$ . The algorithm performance is evaluated by oracle calls and the run time. The former counts the number of function, gradient, and Hessian-vector product computations. For instance, Algorithm 1 requires $n+|\mathcal{S}_{g}|+m|\mathcal{S}_{H}|$ oracle calls each iteration, where $m$ is the number of iterations of the subproblem solver. Regarding the user-defined parameters in Algorithm 1, we use $\epsilon_{\sigma}=10^{-18}$ . Empirical observations suggest that the magnitude of the data entries affects the optimization in its early stage, and hence these factors are taken into account in the setting of $\sigma_{0}$ . Let $\mathbf{S}=[s_{ij}]$ denote the input data matrix containing $L$ rows and $H$ columns. We compute $\sigma_{0}$ by considering the data dimension, also the averaged data magnitude normalized by its standard deviation, given as

\sigma_{0}=\left(\sum_{i\in[L],j\in[H]}\frac{|s_{ij}|}{LH}\right)^{2}\left(\frac{dim(M)*d}{\sqrt{\frac{1}{LH}\sum_{i\in[L],j\in[H]}(s_{ij}-\mu_{S})^{2}}}\right)^{\frac{1}{2}},

(59)

where $\mu_{S}=\frac{1}{LH}\sum_{i\in[L],j\in[H]}s_{i,j}$ and $dim(M)$ is the manifold dimension.

Regarding the early stopping setting in Eq. (46), $K=5$ is used for both tasks, and we use $\tau_{f}=10^{-12}$ for MNIST and $\tau_{f}=10^{-10}$ for the remaining datasets in the PCA task. In the matrix completion task, we set $\tau_{f}=10^{-10}$ for the synthetic datasets and $\tau_{f}=10^{-3}$ for the real-world datasets. For the early stopping settings in Eq. (47) and in Step (13) of Algorithm 3, we adopt $\kappa_{\theta}=0.08$ and $\theta=0.1$ . Between the two subproblem solvers, we observe that Algorithm 2 by the Lanczos method and Algorithm 3 by the conjugate gradient perform similarly. Therefore, we report the main results using Algorithm 2, and provide supplementary results for Algorithm 3 in a separate Section 5.4.

5.1 PCA Experiments

PCA can be interpreted as a minimization of the sum of squared residual errors between the projected and the original data points, formulated as

\min_{\mathbf{U}\in{\rm Gr}\left(r,d\right)}\frac{1}{n}\sum_{i=1}^{n}\left\|\mathbf{z}_{i}-\mathbf{UU^{T}}\mathbf{z}_{i}\right\|_{2}^{2},

(60)

where $\mathbf{z}_{i}\in\mathbb{R}^{d}$ . The objective function can be re-expressed as one on the Grassmann manifold via

\min_{\mathbf{U}\in{\rm Gr}\left(r,d\right)}-\frac{1}{n}\sum_{i=1}^{n}\mathbf{z}_{i}^{T}\mathbf{UU^{T}}\mathbf{z}_{i}.

(61)

One synthetic dataset P1 and two real-world datasets including MNIST lecun1998gradient and Covertype blackard1999comparative are used in the evaluation. The P1 dataset is firstly generated by randomly sampling each element of a matrix $\mathbf{A}\in\mathbb{R}^{n\times d}$ from a normal distribution $\mathcal{N}(0,1)$ . This is then followed by a multiplication with a diagonal matrix $\mathbf{S}\in\mathbb{R}^{d\times d}$ with each diagonal element randomly sampled from an exponential distribution $\textmd{Exp}(2)$ , which increases the difference between the feature variances. After that, a mean-subtraction preprocessing is applied to $\mathbf{A}\mathbf{S}$ to obtain the final P1 dataset. The $\left(n,d,r\right)$ values are: $\left(5\times 10^{5},10^{3},5\right)$ for P1, $\left(6\times 10^{4},784,10\right)$ for MNIST, and $\left(581012,54,10\right)$ for Covertype. Algorithm accuracy is assessed by optimality gap, defined as the absolute difference $|f-f^{*}|$ , where $f=-\frac{1}{n}\sum_{i=1}^{n}\mathbf{z}_{i}^{T}\hat{\mathbf{U}}\hat{\mathbf{U}}^{T}\mathbf{z}_{i}$ with $\hat{\mathbf{U}}$ as the optimal solution returned by Algorithm 1. The optimal function value $f^{*}=-\frac{1}{n}\sum_{i=1}^{n}\mathbf{z}_{i}^{T}\tilde{\mathbf{U}}\tilde{\mathbf{U}}^{T}\mathbf{z}_{i}$ is computed by using the eigen-decomposition solution $\tilde{\mathbf{U}}$ , which is a classical way to obtain PCA result without going through the optimization program.

Fig. 2 compares the optimality gap changes over iterations for all the competing algorithms. Additionally, Fig. 3 summarizes their accuracy and convergence performance in optimality gap and run time. Fig. 4a reports the performance without using early stopping for the P1 dataset. It can be seen that the Inexact Sub-RN-CR reaches the minima with the smallest iteration number for both the synthetic and real-world datasets. In particular, the larger the scale of a problem is, the more obvious the advantage of our Inexact Sub-RN-CR is, evidenced by the performance difference.

However, both the Inexact RTR and Inexact Sub-RN-CR achieve their best PCA performance when using a full gradient calculation accompanied by a subsampled Hessian. The subsampled gradient does not seem to result in a satisfactory solution as shown in Fig. 2 with $\left|\mathcal{S}_{g}\right|=n/10$ . Additionally, we report more results for the Inexact RTR and the proposed Inexact Sub-RN-CR in Fig. 4b on the P1 dataset with different gradient batch sizes, including $\left|\mathcal{S}_{g}\right|\in\left\{n/1.5,n/2,n/5,n/10\right\}$ . They all perform less well than $\left|\mathcal{S}_{g}\right|=n$ . More accurate gradient information is required to produce a high-precision solution in these tested cases. A hypothesis on the cause of this phenomenon might be that the variance of the approximate gradients across samples is larger than that of the approximate Hessians. Hence, a sufficiently large sample size is needed for a stable approximation of the gradient information. Errors in approximate gradients may cause the algorithm to converge to a sub-optimal point with a higher cost, thus performing less well. Another hypothesis might be that the quadratic term $\mathbf{U}\mathbf{U}^{T}$ in Eq. (61) would square the approximation error from the approximate gradient, which could significantly increase the PCA reconstruction error.

By fixing the gradient batch size $\left|\mathcal{S}_{g}\right|=n$ for both the Inexact RTR and Inexact Sub-RN-CR, we compare in Figs. 4c and 4d their sensitivity to the used batch size for Hessian approximation. We experiment with $\left|\mathcal{S}_{H}\right|\in\{n/10,n/10^{2}$ , $n/10^{3},n/10^{4}\}$ . It can be seen that the Inexact Sub-RN-CR outperforms the Inexact RTR in almost all cases except for $\left|\mathcal{S}_{H}\right|=n/10^{4}$ for the MNIST dataset. The Inexact Sub-RN-CR possesses a rate of increase in oracle calls significantly smaller for large sample sizes. This implies that the Inexact Sub-RN-CR is more robust than the Inexact RTR to batch-size change for inexact Hessian approximation.

5.2 Low-rank Matrix Completion Experiments

Low-rank matrix completion aims at completing a partial matrix $\mathbf{Z}$ with only a small number of entries observed, under the assumption that the matrix has a low rank. One way to formulate the problem is shown as below

\min_{\mathbf{U}\in{\rm Gr}\left(r,d\right),\mathbf{A}\in\mathbb{R}^{r\times n}}\frac{1}{|\Omega|}\left\|\mathcal{P}_{\Omega}\left(\mathbf{UA}\right)-\mathcal{P}_{\Omega}\left(\mathbf{Z}\right)\right\|_{F}^{2},

(62)

where $\Omega$ denotes the index set of the observed matrix entries. The operator $\mathcal{P}_{\Omega}:\mathbb{R}^{d\times n}\rightarrow\mathbb{R}^{d\times n}$ is defined as $\mathcal{P}_{\Omega}\left(\mathbf{Z}\right)_{ij}=\mathbf{Z}_{ij}$ if $\left(i,j\right)\in\Omega$ , while $\mathcal{P}_{\Omega}\left(\mathbf{Z}\right)_{ij}=0$ otherwise. We generate it by uniformly sampling a set of $|\Omega|=4r(n+d-r)$ elements from the $dn$ entries. Let $\mathbf{a}_{i}$ be the $i$ -th column of $\mathbf{A}$ , $\mathbf{z}_{i}$ be the $i$ -th column of $\mathbf{Z}$ , and ${\Omega}_{i}$ be the subset of ${\Omega}$ that contains sampled indices for the $i$ -th column of $\mathbf{Z}$ . Then, $\mathbf{a}_{i}$ has a closed-form solution $\mathbf{a}_{i}=(\mathbf{U}_{{\Omega}_{i}})^{\dagger}\mathbf{z}_{\Omega_{i}}$ kasai2018inexact , where $\mathbf{U}_{{\Omega}_{i}}$ contains the selected rows of $\mathbf{U}$ , and $\mathbf{z}_{\Omega_{i}}$ the selected elements of $\mathbf{z}_{i}$ according to the indices in $\Omega_{i}$ , and $\dagger$ denotes the pseudo inverse operation. To evaluate a solution $\mathbf{U}$ , we generate another index set $\tilde{\Omega}$ , which is used as the test set and satisfies $\tilde{\Omega}\cap\Omega=\emptyset$ , following the same way as generating $\Omega$ . We compute the mean squared error (MSE) by

\textmd{MSE}=\frac{1}{\left|\tilde{\Omega}\right|}\left\|\mathcal{P}_{\tilde{\Omega}}(\mathbf{U}\mathbf{A})-\mathcal{P}_{\tilde{\Omega}}(\mathbf{Z})\right\|_{F}^{2}.

(63)

In evaluation, three synthetic datasets and a real-world dataset Jester goldberg2001eigentaste are used where the training and test sets are already predefined by goldberg2001eigentaste . The synthetic datasets are generated by following a generative model similar to ngo2012scaled based on SVD. Specifically, to develop a synthetic dataset, we generate two matrices $\mathbf{A}\in\mathbb{R}^{d\times r}$ and $\mathbf{B}\in\mathbb{R}^{n\times r}$ with their elements independently sampled from the normal distribution $\mathcal{N}(0,1)$ . Then, we generate two orthogonal matrices $\mathbf{Q}_{A}$ and $\mathbf{Q}_{B}$ by applying the QR decomposition trefethen1997numerical respectively to $\mathbf{A}$ and $\mathbf{B}$ . After that, we construct a diagonal matrix $\mathbf{S}\in\mathbb{R}^{r\times r}$ of which the diagonal elements are computed by $s_{i,i}=10^{3+\frac{(i-r)\log_{10}(c)}{r-1}}$ for $i=1,...,r$ , and the final data matrix is computed by $\mathbf{Z}=\mathbf{Q}_{A}\mathbf{S}\mathbf{Q}_{B}^{T}$ . The reason to construct $\mathbf{S}$ in this specific way is to have an easy control over the condition number of the data matrix, denoted by $\kappa(\mathbf{Z})$ , which is the ratio between the maximum and minimum singular values of $\mathbf{Z}$ . Because $\kappa(\mathbf{Z})=\frac{\sigma_{max}(\mathbf{Z})}{\sigma_{min}(\mathbf{Z})}=\frac{10^{3}}{10^{3-\log_{10}(c)}}=c$ , we can adjust the condition number by tuning the parameter $c$ . Following this generative model, each synthetic dataset is generated by randomly sampling two orthogonal matrices and constructing one diagonal matrix subject to the constraint of condition numbers, i.e., $c=\kappa(\mathbf{Z})=5$ for M1 and $c=\kappa(\mathbf{Z})=20$ for M2 and M3. The $\left(n,d,r\right)$ values of the used datasets are: $\left(10^{5},10^{2},5\right)$ for M1, $\left(10^{5},10^{2},5\right)$ for M2, $\left(3\times 10^{4},10^{2},5\right)$ for M3, and $\left(24983,10^{2},5\right)$ for Jester.

Fig. 5 compares the MSE changes over iterations, while Fig. 6 and Fig. 7 summarize both the MSE performance and the run time in the same plot for different algorithms and datasets. In Fig. 6, the Inexact Sub-RN-CR outperforms the others in most cases, and it can even be nearly twice as fast as the state-of-the-art methods for cases with a larger condition number (see dataset M2 and M3). This shows that the proposed algorithm is efficient at handling ill-conditioned problems. In Fig. 7, the Inexact Sub-RN-CR achieves a sufficiently small MSE with the shortest run time, faster than the Inexact RTR and RTRMC. Unlike in the PCA task, the subsampled gradient approximation actually helps to improve the convergence. A hypothesis for explaining this phenomenon could be that, as compared to the quadratic term $\mathbf{U}\mathbf{U}^{T}$ in the PCA objective function, the linear term $\mathbf{U}$ in the matrix completion objective function accumulates fewer errors from the inexact gradient, making the optimization more stable.

Additionally, Fig. 8a compares the Inexact RTR and the Inexact Sub-RN-CR with varying batch sizes for gradient estimation and with fixed $|\mathcal{S}_{H}|=n$ . The M1-M3 results show that our algorithm exhibits stronger robustness to $|\mathcal{S}_{g}|$ , as it converges to the minima with only 50 $\%$ additional oracle calls when reducing $|\mathcal{S}_{g}|$ from $n/10$ to $n/10^{2}$ , whereas Inexact RTR requires twice as many calls. For the Jester dataset, in all settings of gradient sample sizes, our method achieves lower MSE than the Inexact RTR, especially when $|\mathcal{S}_{g}|=n/10^{4}$ . Fig. 8b compares sensitivities in Hessian sample sizes $|\mathcal{S}_{H}|$ with fixed $|\mathcal{S}_{g}|=n$ . Inexact Sub-RN-CR performs better for the synthetic dataset M3 with $\left|\mathcal{S}_{H}\right|\in\{n/10,n/10^{2}\}$ , showing roughly twice faster convergence. For the Jester dataset, Inexact Sub-RN-CR performs better with $\left|\mathcal{S}_{H}\right|\in\{n/10,n/10^{2},n/10^{3}\}$ except for the case of $\left|\mathcal{S}_{H}\right|=n/10^{4}$ , which is possibly because the construction of the Krylov subspace requires a more accurately approximated Hessian.

To summarize, we have observed from both the PCA and matrix completion tasks that the effectiveness of the subsampled gradient in the proposed approach can be sensitive to the choice of the practical problems, while the subsampled Hessian steadily contributes to a faster convergence rate.

5.3 Imaging Applications

In this section, we demonstrate some practical applications of PCA and matrix completion, which are solved by using the proposed optimization algorithm Inexact Sub-RN-CR, for analyzing medical images and scene images.

5.3.1 Functional Connectivity in fMRI by PCA

Functional magnetic-resonance imaging (fMRI) can be used to measure brain activities and PCA is often used to find functional connectivities between brain regions based on the fMRI scans zhong2009detecting . This method is based on the assumption that the activation is independent of other signal variations such as brain motion and physiological signals zhong2009detecting . Usually, the fMRI images are represented as a 4D data block subject to observational noise, including 3 spatial dimensions and 1 time dimension. Following a common preprocessing routine sidhu2012kernel ; kohoutova2020toward , we denote an fMRI data block by $\mathbf{D}\in\mathbb{R}^{u\times v\times w\times T}$ and a mask by $\mathbf{M}\in\{0,1\}^{u\times v\times w}$ that contains $d$ nonzero elements marked by $1$ . By applying the mask to the data block, we obtain a feature matrix $\mathbf{f}\in\mathbb{R}^{d\times T}$ , where each column stores the features of the brain at a given time stamp. One can increase the sample size by collecting $k$ fMRI data blocks corresponding to $k$ human subjects, after which the feature matrix is expanded to a larger matrix $\mathbf{F}\in\mathbb{R}^{d\times kT}$ .

In this experiment, an fMRI dataset referred to as $ds000030$ provided by the OpenfMRI database poldrack2017openfmri is used, where $u=v=64$ , $w=34$ , and $T=184$ . We select $k=13$ human subjects and use the provided brain mask with $d=228483$ . The dimension of the final data matrix is $(n,d)=(2392,228483)$ , where $n=kT$ . We set the rank as $r=5$ which is sufficient to capture over $93\%$ of the variance in the data. After the PCA processing, each computed principal component can be rendered back to the brain reconstruction by using the open-source library Nilearn abraham2014machine . Fig. 9 displays the optimization performance, where the Inexact Sub-RN-CR converges faster in terms of both run time and oracle calls. For our method and the Inexact RTR, adopting the subsampled gradient leads to a suboptimal solution in less time than using the full gradient. We speculate that imprecise gradients cause an oscillation of the optimization near local optima. Fig. 10 compares the results obtained by our optimization algorithm with those computed by eigen-decomposition. The highlighted regions denote the main activated regions with positive connectivity (yellow) and negative connectivity (blue). The components learned by the two approaches are similar, with some cases learned by our approach tending to be more connected (see Figs. 10a and 10c).

5.3.2 Image Recovery by Matrix Completion

In this application, we demonstrate image recovery with matrix completion using a $(W,H,C)=2519\times 1679\times 3$ scene image selected from the BIG dataset cheng2020cascadepsp . As seen in Fig. 12a, this image contains rich texture information. The values of $(n,d,r)$ for conducting the matrix completion task are $(1679,2519,20)$ where we use a relatively large rank to allow more accurate recovery. The sampling ratio for observing the pixels is set as ${\rm SR}=\frac{\left|\Omega\right|}{W\times H\times C}=0.6$ . Fig. 11 compares the performance of different algorithms, where the Inexact Sub-RN-CR takes the shortest time to obtain a satisfactory solution. The subsampled gradient promotes the optimization speed of the Inexact Sub-RN-CR without sacrificing much the MSE error. Fig. 12 illustrates the recovered image using three representative algorithms, providing similar visual results.

5.4 Results of Conjugate Gradient Subproblem Solver

We experiment with Algorithm 3 for solving the subproblem. In Step (3) of Algorithm 3, the eigen-decomposition method edelman1995polynomial used to solve the minimization problem has a complexity $\mathcal{O}(C^{3})$ where $C=4$ is the fixed degree of the polynomial. Figs. 13-16 display the results for both the PCA and matrix completion tasks. Overall, Algorithm 3 can obtain the optimal results with the fastest convergence speed, as compared to the opponent approaches. We have observed that, in general, Algorithm 3 provides similar results to Algorithm 2, but they differ in run time. For instance, Algorithm 2 runs $18\%$ faster for the PCA task with the MNIST dataset and $20\%$ faster for the matrix completion task with the M1 dataset, as compared to Algorithm 3. But Algorithm 3 runs $17\%$ faster than Algorithm 2 for the matrix completion task with the M2 dataset. A hypothesis could be that Algorithm 2 performs well on well-conditioned data (e.g. MNIST and M1) because of its strength of finding the global solution, while for ill-conditioned data (e.g. M2), it may not show significant advantages over Algorithm 3. Moreover, from the computational aspect, the Step (3) in Algorithm 3 is of $\mathcal{O}(C^{3})$ complexity, which tends to be faster than solving Eq. (18) as required by Algorithm 2. Overall this makes Algorithm 3 probably a better choice than Algorithm 2 for processing ill-conditioned data.

5.5 Examination of Convergence Analysis Assumptions

As explained in Section 3.3.3 and Section 3.4, the eigenstep condition in Assumption 1, Assumption 2 and Assumption 3, although are required by convergence analysis, are not always satisfied by a subproblem solver in practice. In this experiment, we attempt to estimate the probability $P$ that an assumption is satisfied in the process of optimization, by counting the number of outer iterations of Algorithm 1 where an assumption holds. We repeat this entire process five times ( $T=5$ ) to attain a stable result. Let $N_{i}$ be the number of outer iterations where the assumption is satisfied, and $M_{i}$ the total number of outer iterations, in the $i$ -th repetition ( $i=1,\;2,\;\ldots,5$ ). We compute the probability by $P=\frac{\sum_{i\in[T]}N_{i}}{\sum_{i\in[T]}M_{i}}$ . Experiments are conducted for the PCA task using the P1 dataset.

In order to examine Assumption 2, which is the stopping criterion in Eq. (47), we temporarily deactivate the other stopping criteria. We observe that Algorithm 2 can always produce a solution that satisfies Assumption 2. However, Algorithm 3 has only $P\approx 50\%$ chance to produce a solution satisfying Assumption 2. The reason is probably that when computing $\mathbf{r}_{i}$ in Step (11) of Algorithm 3, the first-order approximation of $\nabla f(R_{\mathbf{x}_{k}}(\bm{\eta}_{k}^{*}))$ is used rather than the exact $\nabla f(R_{\mathbf{x}_{k}}(\bm{\eta}_{k}^{*}))$ for the sake of computational efficiency. This can result in an approximation error.

Regarding the eigenstep condition in Assumption 1, it can always be met by Algorithm 2 with $P\approx 100\%$ . This indicates that even a few inner iterations are sufficient for it to find a solution pointing in the direction of negative curvature. However, Algorithm 3 has a $P\approx 70\%$ chance to meet the eigenstep condition. This might be caused by insufficient inner iterations according to Theorem 1. Moreover, the solution obtained by Algorithm 3 is only guaranteed to be stationary according to Theorem 1, rather than pointing in the direction of the negative curvature. This could be a second cause for Algorithm 3 not to meet the eigenstep condition in Eq. (34).

While about Assumption 3, according to Lemma 2, Algorithm 2 always satisfies it. This is verified by our results with $P=100\%$ . Algorithm 3 has a $P\approx 80\%$ chance to meet Assumption 3 empirically. This empirical result matches the theoretical result indicated by Lemma 3 where solutions from Algorithm 3 tend to approximately satisfy Assumption 3.

6 Conclusion

We have proposed the Inexact Sub-RN-CR algorithm to offer an effective and fast optimization for an important class of non-convex problems whose constraint sets possess manifold structures. The algorithm improves the current state of the art in second-order Riemannian optimization by using cubic regularization and subsampled Hessian and gradient approximations. We have also provided rigorous theoretical results on its convergence, and empirically evaluated and compared it with state-of-the-art Riemannian optimization techniques for two general machine learning tasks and multiple datasets. Both theoretical and experimental results demonstrate that the Inexact Sub-RN-CR offers improved convergence and computational costs. Although the proposed method is promising in solving large-sample problems, there remains an open and interesting question of whether the proposed algorithm can be effective in training a constrained deep neural network. This is more demanding in its required computational complexity and convergence characteristics than many other machine learning problems, and it is more challenging to perform the Hessian approximation. Our future work will pursue this direction.

References

(1) Abraham, A., Pedregosa, F., Eickenberg, M., Gervais, P., Mueller, A., Kossaifi, J., Gramfort, A., Thirion, B., Varoquaux, G.: Machine learning for neuroimaging with scikit-learn. Frontiers in neuroinformatics 8, 14 (2014)
(2) Absil, P.A., Baker, C.G., Gallivan, K.A.: Trust-region methods on riemannian manifolds. Foundations of Computational Mathematics 7(3), 303–330 (2007)
(3) Absil, P.A., Mahony, R., Sepulchre, R.: Optimization algorithms on matrix manifolds. Princeton University Press (2009)
(4) Agarwal, N., Boumal, N., Bullins, B., Cartis, C.: Adaptive regularization with cubics on manifolds. Mathematical Programming 188(1), 85–134 (2021)
(5) Alimisis, F., Orvieto, A., Bécigneul, G., Lucchi, A.: Momentum improves optimization on riemannian manifolds. In: International Conference on Artificial Intelligence and Statistics, pp. 1351–1359. PMLR (2021)
(6) Anandkumar, A., Ge, R.: Efficient approaches for escaping higher order saddle points in non-convex optimization. In: Conference on learning theory, pp. 81–102 (2016)
(7) Becigneul, G., Ganea, O.E.: Riemannian adaptive optimization methods. In: International Conference on Learning Representations (2019)
(8) Blackard, J.A., Dean, D.J.: Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables. Computers and electronics in agriculture 24(3), 131–151 (1999)
(9) Bonnabel, S.: Stochastic gradient descent on riemannian manifolds. IEEE Transactions on Automatic Control 58(9), 2217–2229 (2013)
(10) Boumal, N.: An introduction to optimization on smooth manifolds. Available online, May 3 (2020)
(11) Boumal, N., Absil, P.a.: Rtrmc: A riemannian trust-region method for low-rank matrix completion. In: Advances in neural information processing systems, pp. 406–414 (2011)
(12) Boumal, N., Absil, P.A., Cartis, C.: Global rates of convergence for nonconvex optimization on manifolds. IMA Journal of Numerical Analysis 39(1), 1–33 (2019)
(13) Boumal, N., Mishra, B., Absil, P.A., Sepulchre, R.: Manopt, a matlab toolbox for optimization on manifolds. The Journal of Machine Learning Research 15(1), 1455–1459 (2014)
(14) Carmon, Y., Duchi, J.C.: Analysis of krylov subspace solutions of regularized non-convex quadratic problems. Advances in Neural Information Processing Systems 31 (2018)
(15) Cartis, C., Gould, N.I., Toint, P.L.: Adaptive cubic regularisation methods for unconstrained optimization. part i: motivation, convergence and numerical results. Mathematical Programming 127(2), 245–295 (2011)
(16) Cheng, H.K., Chung, J., Tai, Y.W., Tang, C.K.: Cascadepsp: toward class-agnostic and very high-resolution segmentation via global and local refinement. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8890–8899 (2020)
(17) Cho, M., Lee, J.: Riemannian approach to batch normalization. In: Advances in Neural Information Processing Systems, pp. 5225–5235 (2017)
(18) Conn, A.R., Gould, N.I., Toint, P.L.: Trust region methods. SIAM (2000)
(19) Edelman, A., Arias, T.A., Smith, S.T.: The geometry of algorithms with orthogonality constraints. SIAM journal on Matrix Analysis and Applications 20(2), 303–353 (1998)
(20) Edelman, A., Murakami, H.: Polynomial roots from companion matrix eigenvalues. Mathematics of Computation 64(210), 763–776 (1995)
(21) Ferreira, O.P., Svaiter, B.F.: Kantorovich’s theorem on newton’s method in riemannian manifolds. Journal of Complexity 18(1), 304–329 (2002)
(22) Fletcher, R., Reeves, C.M.: Function minimization by conjugate gradients. The computer journal 7(2), 149–154 (1964)
(23) Goldberg, K., Roeder, T., Gupta, D., Perkins, C.: Eigentaste: A constant time collaborative filtering algorithm. information retrieval 4(2), 133–151 (2001)
(24) Gould, N., Lucidi, S., Roma, M., Toint, P.L.: Solving the trust-region subproblem using the lanczos method. Siam Journal on Optimization 9(2), 504–525 (1999)
(25) Griewank, A.: The modification of newton’s method for unconstrained optimization by bounding cubic terms. Tech. rep., Technical report NA/12 (1981)
(26) Gross, D.: Recovering low-rank matrices from few coefficients in any basis. IEEE Transactions on Information Theory 57(3), 1548–1566 (2011)
(27) Han, A., Mishra, B., Jawanpuria, P.K., Gao, J.: On riemannian optimization over positive definite matrices with the bures-wasserstein geometry. Advances in Neural Information Processing Systems 34 (2021)
(28) Horev, I., Yger, F., Sugiyama, M.: Geometry-aware principal component analysis for symmetric positive definite matrices. pp. 493–522. Springer (2017)
(29) Hosseini, S., Huang, W., Yousefpour, R.: Line search algorithms for locally lipschitz functions on riemannian manifolds. SIAM Journal on Optimization 28(1), 596–619 (2018)
(30) Huang, W., Wei, K.: Riemannian proximal gradient methods. Mathematical Programming pp. 1–43 (2021)
(31) Jia, X., Liang, X., Shen, C., Zhang, L.H.: Solving the cubic regularization model by a nested restarting lanczos method (2021)
(32) Kasai, H., Mishra, B.: Inexact trust-region algorithms on riemannian manifolds. In: Advances in Neural Information Processing Systems, pp. 4249–4260 (2018)
(33) Kasai, H., Sato, H., Mishra, B.: Riemannian stochastic recursive gradient algorithm. In: International Conference on Machine Learning, pp. 2516–2524 (2018)
(34) Kohler, J.M., Lucchi, A.: Sub-sampled cubic regularization for non-convex optimization. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1895–1904. JMLR. org (2017)
(35) Kohoutová, L., Heo, J., Cha, S., Lee, S., Moon, T., Wager, T.D., Woo, C.W.: Toward a unified framework for interpreting machine-learning models in neuroimaging. Nature protocols 15(4), 1399–1435 (2020)
(36) Kumar Roy, S., Mhammedi, Z., Harandi, M.: Geometry aware constrained optimization techniques for deep learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4460–4469 (2018)
(37) LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998)
(38) Liu, X., He, J., Duddy, S., O’Sullivan, L.: Convolution-consistent collective matrix completion. In: International Conference on Information and Knowledge Management, pp. 2209–2212 (2019)
(39) Mishra, B., Kasai, H., Jawanpuria, P., Saroop, A.: A riemannian gossip approach to subspace learning on grassmann manifold. Machine Learning 108(10), 1783–1803 (2019)
(40) Mokhtari, A., Ozdaglar, A., Jadbabaie, A.: Escaping saddle points in constrained optimization. In: Advances in Neural Information Processing Systems, pp. 3629–3639 (2018)
(41) Ngo, T., Saad, Y.: Scaled gradients on grassmann manifolds for matrix completion. Advances in neural information processing systems 25 (2012)
(42) Nguyen, X.S., Brun, L., Lézoray, O., Bougleux, S.: A neural network based on spd manifold learning for skeleton-based hand gesture recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12036–12045 (2019)
(43) Nocedal, J., Wright, S.: Numerical optimization. Springer Science & Business Media (2006)
(44) Poldrack, R.A., Gorgolewski, K.J.: Openfmri: Open sharing of task fmri data. Neuroimage 144, 259–261 (2017)
(45) Pölitz, C., Duivesteijn, W., Morik, K.: Interpretable domain adaptation via optimization over the stiefel manifold. Machine Learning 104(2), 315–336 (2016)
(46) Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.: Chapter 9: Root finding and nonlinear sets of equations. Book: Numerical Recipes: The Art of Scientific Computing, New York, Cambridge University Press, 10 (2007)
(47) Qi, C.: Numerical optimization methods on riemannian manifolds (2011)
(48) Roosta-Khorasani, F., Mahoney, M.W.: Sub-sampled newton methods. Mathematical Programming 174(1), 293–326 (2019)
(49) Sakai, H., Iiduka, H.: Sufficient descent riemannian conjugate gradient methods. Journal of Optimization Theory and Applications 190(1), 130–150 (2021)
(50) Sato, H., Iwai, T.: A new, globally convergent riemannian conjugate gradient method. Optimization 64(4), 1011–1031 (2015)
(51) Shahid, N., Kalofolias, V., Bresson, X., Bronstein, M., Vandergheynst, P.: Robust principal component analysis on graphs. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2812–2820 (2015)
(52) Shen, Z., Zhou, P., Fang, C., Ribeiro, A.: A stochastic trust region method for non-convex minimization. arXiv preprint arXiv:1903.01540 (2019)
(53) Sidhu, G.S., Asgarian, N., Greiner, R., Brown, M.R.: Kernel principal component analysis for dimensionality reduction in fmri-based diagnosis of adhd. Frontiers in systems neuroscience 6, 74 (2012)
(54) Sun, Y., Flammarion, N., Fazel, M.: Escaping from saddle points on riemannian manifolds. arXiv preprint arXiv:1906.07355 (2019)
(55) Townsend, J., Koep, N., Weichwald, S.: Pymanopt: A python toolbox for optimization on manifolds using automatic differentiation. The Journal of Machine Learning Research 17(1), 4755–4759 (2016)
(56) Trefethen, L.N., Bau III, D.: Numerical linear algebra, vol. 50. Siam (1997)
(57) Tripuraneni, N., Stern, M., Jin, C., Regier, J., Jordan, M.I.: Stochastic cubic regularization for fast nonconvex optimization. Advances in neural information processing systems 31 (2018)
(58) Tropp, J.A.: An introduction to matrix concentration inequalities. arXiv preprint arXiv:1501.01571 (2015)
(59) Wei, Z., Yao, S., Liu, L.: The convergence properties of some new conjugate gradient methods. Applied Mathematics and computation 183(2), 1341–1350 (2006)
(60) Weiwei, Y., Yueting, Y., Chenhui, Z., Mingyuan, C.: A newton-like trust region method for large-scale unconstrained nonconvex minimization. In: Abstract and Applied Analysis, vol. 2013. Hindawi (2013)
(61) Xu, P., Roosta, F., Mahoney, M.W.: Newton-type methods for non-convex optimization under inexact hessian information. Mathematical Programming 184(1), 35–70 (2020)
(62) Xu, Z., Zhao, P., Cao, J., Li, X.: Matrix eigen-decomposition via doubly stochastic riemannian optimization. In: International Conference on Machine Learning, pp. 1660–1669 (2016)
(63) Yao, Z., Xu, P., Roosta, F., Mahoney, M.W.: Inexact nonconvex newton-type methods. Informs Journal on Optimization 3(2), 154–182 (2021)
(64) Yuan, X., Huang, W., Absil, P.A., Gallivan, K.A.: A riemannian limited-memory bfgs algorithm for computing the matrix geometric mean. Procedia Computer Science 80, 2147–2157 (2016)
(65) Zhang, H., Reddi, S.J., Sra, S.: Riemannian svrg: Fast stochastic optimization on riemannian manifolds. In: Advances in Neural Information Processing Systems, pp. 4592–4600 (2016)
(66) Zhang, H., Sra, S.: First-order methods for geodesically convex optimization. In: Conference on Learning Theory, pp. 1617–1638 (2016)
(67) Zhang, J., Zhang, S.: A cubic regularized newton’s method over riemannian manifolds. arXiv preprint arXiv:1805.05565 (2018)
(68) Zhong, Y., Wang, H., Lu, G., Zhang, Z., Jiao, Q., Liu, Y.: Detecting functional connectivity in fmri using pca and regression analysis. Brain topography 22(2), 134–144 (2009)
(69) Zhou, D., Gu, Q.: Stochastic recursive variance-reduced cubic regularization methods. In: International Conference on Artificial Intelligence and Statistics, pp. 3980–3990. PMLR (2020)
(70) Zhou, D., Xu, P., Gu, Q.: Stochastic variance-reduced cubic regularization methods. J. Mach. Learn. Res. 20(134), 1–47 (2019)
(71) Zhu, X.: A riemannian conjugate gradient method for optimization on the stiefel manifold. Computational optimization and Applications 67(1), 73–110 (2017)

Appendix A Appendix: Derivation of Lanczos Method

Instead of solving the subproblem in the tangent space $\bm{\eta}\in T_{\mathbf{x}_{k}}\mathcal{M}$ of the manifold dimension $D$ , the Lanczos method solves it within a Krylov subspace $\mathcal{K}_{l}$ , where $l$ can range from 1 to $D$ . This subspace is defined by the span of the following elements:

\mathcal{K}_{l}(\mathbf{H}_{k},\mathbf{G}_{k}):=\left\{\mathbf{G}_{k},\mathbf{H}_{k}[\mathbf{G}_{k}],\mathbf{H}_{k}^{2}[\mathbf{G}_{k}],...,\mathbf{H}_{k}^{l}[\mathbf{G}_{k}]\right\},

(64)

where, for $l\geq 2$ , $\mathbf{H}_{k}^{l}[\mathbf{G}_{k}]$ is recursively defined by $\mathbf{H}_{k}\left[\mathbf{H}^{l-1}_{k}\left[\mathbf{G}_{k}\right]\right]$ . Its orthonormal basis $\mathbf{Q}_{l}=\{\mathbf{q}_{1},...,\mathbf{q}_{l}\}$ , where $\mathbf{Q}_{l}^{T}\mathbf{Q}_{l}=\mathbf{I}$ , is successively constructed to satisfy

	$\displaystyle\mathbf{q}_{1}=$	$\displaystyle\frac{\mathbf{G}_{k}}{\left\\|\mathbf{G}_{k}\right\\|_{\mathbf{x}_{k}}},$			(65)
	$\displaystyle\langle\mathbf{q}_{i},\mathbf{H}_{k}[\mathbf{q}_{j}]\rangle_{\mathbf{x}_{k}}=$	$\displaystyle\left(\mathbf{T}_{l}\right)_{i,j},$			(66)

for $i,j\in[n]$ , where $\left(\mathbf{T}_{l}\right)_{i,j}$ denotes the $ij$ -th element of the matrix $\mathbf{T}_{l}$ . Each element $\bm{\eta}\in\mathcal{K}_{l}$ in the Krylov subspace can be expressed as $\bm{\eta}=\sum_{i=1}^{l}y_{i}\mathbf{q}_{i}$ . Store these $\{y_{i}\}_{i=1}^{l}$ in the vector $\mathbf{y}\in\mathbb{R}^{l}$ , the subproblem objective function $\mathop{\min}_{\bm{\eta}\in\mathcal{K}_{l}}\hat{m}(\bm{\eta})$ is minimized in the Krylov subspace instead. By substituting $\bm{\eta}:=\sum_{i=1}^{l}y_{i}\mathbf{q}_{i}$ into $\hat{m}(\bm{\eta})$ , the objective function becomes

$\displaystyle\hat{m}(\bm{\eta})$	$\displaystyle=f(\mathbf{x}_{k})+\delta\left\langle\mathbf{G}_{k},\sum_{i=1}^{l}y_{i}\mathbf{q}_{i}\right\rangle_{\mathbf{x}_{k}}+\frac{1}{2}\left\langle\sum_{i=1}^{l}y_{i}\mathbf{q}_{i},\mathbf{H}_{k}\left[\sum_{i=1}^{l}y_{i}\mathbf{q}_{i}\right]\right\rangle_{\mathbf{x}_{k}}+\frac{\sigma_{k}}{3}\left\\|\sum_{i=1}^{l}y_{i}\mathbf{q}_{i}\right\\|_{\mathbf{x}_{k}}^{3}$
	$\displaystyle=f(\mathbf{x}_{k})+\delta\left\langle\mathbf{G}_{k},y_{1}\mathbf{q}_{1}\right\rangle_{\mathbf{x}_{k}}+\frac{1}{2}\sum_{i,j=1}^{l}y_{i}y_{j}\left\langle\mathbf{q}_{i},\mathbf{H}_{k}[\mathbf{q}_{j}]\right\rangle_{\mathbf{x}_{k}}+\frac{\sigma_{k}}{3}\left\\|\mathbf{y}\right\\|_{2}^{3}$
	$\displaystyle=f(\mathbf{x}_{k})+\ y_{1}\delta\left\\|\mathbf{G}_{k}\right\\|_{\mathbf{x}_{k}}+\frac{1}{2}\mathbf{y}^{T}\mathbf{T}_{D}\mathbf{y}+\frac{\sigma_{k}}{3}\left\\|\mathbf{y}\right\\|_{2}^{3}.$	(67)

The properties $\mathbf{q}_{i}\perp\mathbf{q}_{j}$ for $i\neq j$ and $\mathbf{G}_{k}\perp\mathbf{q}_{i}$ for $i\neq 1$ are used in the derivation. Therefore, to solve $\min_{\bm{\eta}\in T_{\mathbf{x}_{k}}\mathcal{M}}\ \hat{m}(\bm{\eta})$ is equivalent to

\min_{\mathbf{y}\in\mathbb{R}^{l}}\ y_{1}\delta\left\|\mathbf{G}_{k}\right\|_{\mathbf{x}_{k}}+\frac{1}{2}\mathbf{y}^{T}\mathbf{T}_{D}\mathbf{y}+\frac{\sigma_{k}}{3}\left\|\mathbf{y}\right\|_{2}^{3}.

(68)

Appendix B Appendix: Proof of Lemma 1

Proof

Let $\lambda_{*}=\sigma_{k}\left\|\bm{\eta}_{k}^{*}\right\|$ . The Krylov subspaces are invariant to shifts by scalar matrices, therefore $\mathcal{K}_{l}(\mathbf{H}_{k},\mathbf{G}_{k})=\mathcal{K}_{l}(\mathbf{H}_{k}+\lambda_{*}{\rm Id},\mathbf{G}_{k})=\mathcal{K}_{l}(\tilde{\mathbf{H}}_{k},\mathbf{G}_{k})$ carmon2018analysis , where the definition of $\mathcal{K}_{l}(\mathbf{H}_{k},\mathbf{G}_{k})$ follows Eq. (64). Let $\bm{\xi}_{l}\in\mathcal{K}_{l}$ be the solution found in the Krylov subspace $\mathcal{K}_{l}(\mathbf{H}_{k},\mathbf{G}_{k})$ , which is thus an element of $\mathcal{K}_{l}(\tilde{\mathbf{H}}_{k},\mathbf{G}_{k})$ expressed by

\bm{\xi}_{l}=-p_{l}\left(\tilde{\mathbf{H}}_{k}\right)[\mathbf{G}_{k}]=-c_{0}\mathbf{G}_{k}-c_{1}\tilde{\mathbf{H}}_{k}[\mathbf{G}_{k}]\cdots-c_{l}\tilde{\mathbf{H}}_{k}^{l}[\mathbf{G}_{k}],

(69)

for some values of $c_{0},\;c_{1},\;\ldots,\;c_{l}\in\mathbb{R}$ . According to Section 6.2 of absil2009optimization , a global minimizer $\bar{\bm{\eta}}_{k}^{*}$ of the RTR subproblem without cubic regularization in Eq. (7) is expected to satisfy the Riemannian quasi-Newton equation:

{\rm grad}\bar{m}_{k}({\mathbf{0}_{k}})+\left({\rm Hess}\bar{m}_{k}(\mathbf{0}_{k})+\lambda_{*}{\rm Id}\right)[\bar{\bm{\eta}}_{k}^{*}]=\mathbf{0}_{\mathbf{x}_{k}},

(70)

where $\lambda_{*}\geq\max(-\lambda_{min}(\mathbf{H}_{k}),0)$ and $\lambda_{*}\left(\Delta_{k}-\left\|\bar{\bm{\eta}}_{k}^{*}\right\|_{\mathbf{x}_{k}}\right)=0$ according to Corollary 7.2.2 of conn2000trust . Using the approximate gradient and Hessian, the inexact minimizer is expected to satisfy

\mathbf{G}_{k}+(\mathbf{H}_{k}+\lambda_{*}{\rm Id})[\bar{\bm{\eta}}_{k}^{*}]=\mathbf{G}_{k}+\tilde{\mathbf{H}}_{k}[\bar{\bm{\eta}}_{k}^{*}]=\mathbf{0}_{\mathbf{x}_{k}}.

(71)

Introduce $\bm{\zeta}_{l}=(1-\alpha)\bm{\xi}_{l}$ where $\alpha=\frac{\left\|\bm{\xi}_{l}\right\|_{\mathbf{x}_{k}}-\left\|\bar{\bm{\eta}}_{k}^{*}\right\|_{\mathbf{x}_{k}}}{\max\left(\left\|\bm{\xi}_{l}\right\|_{\mathbf{x}_{k}},\left\|\bar{\bm{\eta}}_{k}^{*}\right\|_{\mathbf{x}_{k}}\right)}$ . When $\left\|\bm{\xi}_{l}\right\|_{\mathbf{x}_{k}}<\left\|\bar{\bm{\eta}}_{k}^{*}\right\|_{\mathbf{x}_{k}}$ , we start from the fact that $\left(\left\|\bar{\bm{\eta}}_{k}^{*}\right\|_{\mathbf{x}_{k}}-\left\|\bm{\xi}_{l}\right\|_{\mathbf{x}_{k}}\right)^{2}\geq 0$ , which results in the following:

$\displaystyle\left(2\left\\|\bar{\bm{\eta}}_{k}^{}\right\\|_{\mathbf{x}_{k}}-\left\\|\bm{\xi}_{l}\right\\|_{\mathbf{x}_{k}}\right)\left\\|\bm{\xi}_{l}\right\\|_{\mathbf{x}_{k}}\leq\left\\|\bar{\bm{\eta}}_{k}^{}\right\\|_{\mathbf{x}_{k}}^{2}$	$\displaystyle\Longleftrightarrow\left(1+\frac{\left\\|\bar{\bm{\eta}}_{k}^{}\right\\|_{\mathbf{x}_{k}}-\left\\|\bm{\xi}_{l}\right\\|_{\mathbf{x}_{k}}}{\left\\|\bar{\bm{\eta}}_{k}^{}\right\\|_{\mathbf{x}_{k}}}\right)\left\\|\bm{\xi}_{l}\right\\|_{\mathbf{x}_{k}}\leq\left\\|\bar{\bm{\eta}}_{k}^{*}\right\\|_{\mathbf{x}_{k}}$
	$\displaystyle\Longleftrightarrow(1-\alpha)\left\\|\bm{\xi}_{l}\right\\|_{\mathbf{x}_{k}}\leq\left\\|\bar{\bm{\eta}}_{k}^{*}\right\\|_{\mathbf{x}_{k}}$
	$\displaystyle\Longleftrightarrow\left\\|\bm{\zeta}_{l}\right\\|_{\mathbf{x}_{k}}\leq\left\\|\bar{\bm{\eta}}_{k}^{*}\right\\|_{\mathbf{x}_{k}}.$	(72)

When $\left\|\bm{\xi}_{l}\right\|_{\mathbf{x}_{k}}\geq\left\|\bar{\bm{\eta}}_{k}^{*}\right\|_{\mathbf{x}_{k}}$ , it has

\left\|{\bm{\zeta}}_{l}\right\|_{\mathbf{x}_{k}}=\left\|(1-\alpha)\bm{\xi}_{l}\right\|_{\mathbf{x}_{k}}=\left\|\left(1-\frac{\left\|\bm{\xi}_{l}\right\|_{\mathbf{x}_{k}}-\left\|\bar{\bm{\eta}}_{k}^{*}\right\|_{\mathbf{x}_{k}}}{\left\|\bm{\xi}_{l}\right\|_{\mathbf{x}_{k}}}\right)\bm{\xi}_{l}\right\|_{\mathbf{x}_{k}}=\frac{\left\|\bar{\bm{\eta}}_{k}^{*}\right\|_{\mathbf{x}_{k}}\left\|{\bm{\xi}}_{l}\right\|_{\mathbf{x}_{k}}}{\left\|{\bm{\xi}}_{l}\right\|_{\mathbf{x}_{k}}}=\left\|\bar{\bm{\eta}}_{k}^{*}\right\|_{\mathbf{x}_{k}}.

(73)

This concludes that for any $\bm{\xi}_{l}$ , it has $\left\|\bm{\zeta}_{l}\right\|_{\mathbf{x}_{k}}\leq\left\|\bar{\bm{\eta}}_{k}^{*}\right\|_{\mathbf{x}_{k}}$ .

We introduce the notation $\tilde{m}_{k}$ to denote the subproblem in Eq. (7) using the inexact Hessian $\tilde{\mathbf{H}}_{k}$ . Let $\psi_{k}^{*}=\left(p_{l+1}\left(\tilde{\mathbf{H}}_{k}\right)-{\rm Id}\right)[\bar{\bm{\eta}}_{k}^{*}]$ and $\iota\left(\tilde{\mathbf{H}}_{k}\right)=\frac{\lambda_{max}(\tilde{\mathbf{H}}_{k})}{\lambda_{min}(\tilde{\mathbf{H}}_{k})}$ . Since $\phi_{l}\left(\tilde{\mathbf{H}}_{k}\right)$ is the upper bound of $\left\|p_{l+1}\left(\tilde{\mathbf{H}}_{k}\right)-{\rm Id}\right\|_{\mathbf{x}_{k}}$ , it has $\left\|\psi_{k}^{*}\right\|_{\mathbf{x}_{k}}\leq\phi_{l}\left(\tilde{\mathbf{H}}_{k}\right)\left\|\bar{\bm{\eta}}_{k}^{*}\right\|_{\mathbf{x}_{k}}$ . Then, we have

	$\displaystyle\bar{m}_{k}(\bm{\zeta}_{l})-\bar{m}_{k}(\bar{\bm{\eta}}_{k}^{*})$
$\displaystyle=$	$\displaystyle\;\tilde{m}_{k}(\bm{\zeta}_{l})-\tilde{m}_{k}(\bar{\bm{\eta}}_{k}^{})+\frac{\lambda_{}}{2}\left(\left\\|\bar{\bm{\eta}}_{k}^{*}\right\\|^{2}_{\mathbf{x}_{k}}-\left\\|\bm{\zeta}_{l}\right\\|^{2}_{\mathbf{x}_{k}}\right)$
$\displaystyle=$	$\displaystyle\;\frac{1}{2}\left\langle\bm{\zeta}_{l}-\bar{\bm{\eta}}_{k}^{},\tilde{\mathbf{H}}_{k}[\bm{\zeta}_{l}-\bar{\bm{\eta}}_{k}^{}]\right\rangle_{\mathbf{x}_{k}}+\frac{\lambda_{}}{2}\left(\left\\|\bar{\bm{\eta}}_{k}^{}\right\\|^{2}_{\mathbf{x}_{k}}-\left\\|\bm{\zeta}_{l}\right\\|^{2}_{\mathbf{x}_{k}}\right)$
$\displaystyle\leq$	$\displaystyle\;\frac{1}{2}\left\langle\bm{\zeta}_{l}-\bar{\bm{\eta}}_{k}^{},\tilde{\mathbf{H}}_{k}[\bm{\zeta}_{l}-\bar{\bm{\eta}}_{k}^{}]\right\rangle_{\mathbf{x}_{k}}+\lambda_{}\left\\|\bar{\bm{\eta}}_{k}^{}\right\\|_{\mathbf{x}_{k}}\left(\left\\|\bar{\bm{\eta}}_{k}^{*}\right\\|_{\mathbf{x}_{k}}-\left\\|\bm{\zeta}_{l}\right\\|_{\mathbf{x}_{k}}\right)$
$\displaystyle\leq$	$\displaystyle\;\frac{(1-\alpha)^{2}}{2}\left\langle\left(p_{l+1}\left(\tilde{\mathbf{H}}_{k}\right)-{\rm Id}\right)[\bar{\bm{\eta}}_{k}^{}],\tilde{\mathbf{H}}_{k}\left[\left(p_{l+1}\left(\tilde{\mathbf{H}}_{k}\right)-{\rm Id}\right)\left[\bar{\bm{\eta}}_{k}^{}\right]\right]\right\rangle_{\mathbf{x}_{k}}+\lambda_{}\left\\|\bar{\bm{\eta}}_{k}^{}\right\\|_{\mathbf{x}_{k}}^{2}\alpha^{2}$	(74)
$\displaystyle=$	$\displaystyle\;\frac{(1-\alpha)^{2}}{2}\left\\|\psi_{k}^{}\right\\|_{\mathbf{x}_{k}}^{2}\left\langle\frac{\psi_{k}^{}}{\left\\|\psi_{k}^{}\right\\|_{\mathbf{x}_{k}}},\tilde{\mathbf{H}}_{k}\left[\frac{\psi_{k}^{}}{\left\\|\psi_{k}^{}\right\\|_{\mathbf{x}_{k}}}\right]\right\rangle_{\mathbf{x}_{k}}+\lambda_{}\left\\|\bar{\bm{\eta}}_{k}^{*}\right\\|_{\mathbf{x}_{k}}^{2}\alpha^{2}$
$\displaystyle\leq$	$\displaystyle\;2\phi_{l}\left(\tilde{\mathbf{H}}_{k}\right)^{2}\left\\|\bar{\bm{\eta}}_{k}^{}\right\\|_{\mathbf{x}_{k}}^{2}\iota\left(\tilde{\mathbf{H}}_{k}\right)\left\langle\frac{\bar{\bm{\eta}}_{k}^{}}{\left\\|\bar{\bm{\eta}}_{k}^{}\right\\|_{\mathbf{x}_{k}}},\tilde{\mathbf{H}}_{k}\left[\frac{\bar{\bm{\eta}}_{k}^{}}{\left\\|\bar{\bm{\eta}}_{k}^{}\right\\|_{\mathbf{x}_{k}}}\right]\right\rangle_{\mathbf{x}_{k}}+\phi_{l}(\tilde{\mathbf{H}}_{k})^{2}\lambda_{}\left\\|\bar{\bm{\eta}}_{k}^{*}\right\\|_{\mathbf{x}_{k}}^{2}$	(75)
$\displaystyle\leq$	$\displaystyle\;4\iota\left(\tilde{\mathbf{H}}_{k}\right)\left(\frac{1}{2}\left\langle\bar{\bm{\eta}}_{k}^{},\tilde{\mathbf{H}}_{k}[\bar{\bm{\eta}}_{k}^{}]\right\rangle_{\mathbf{x}_{k}}+\frac{1}{2}\lambda_{}\left\\|\bar{\bm{\eta}}_{k}^{}\right\\|_{\mathbf{x}_{k}}^{2}\right)\phi_{l}\left(\tilde{\mathbf{H}}_{k}\right)^{2}$	(76)
$\displaystyle=$	$\displaystyle\;4\iota\left(\tilde{\mathbf{H}}_{k}\right)\left(\bar{m}_{k}(\mathbf{0}_{\mathbf{x}_{k}})-\bar{m}_{k}\left(\bar{\bm{\eta}}_{k}^{*}\right)\right)\phi_{l}\left(\tilde{\mathbf{H}}_{k}\right)^{2}.$	(77)

To derive Eq. (74), $\mathbf{G}_{k}=-\tilde{\mathbf{H}}_{k}[\bar{\bm{\eta}}_{k}^{*}]$ from Eq. (71) is used. To derive Eq. (75), we use the definition in Eq. (35), where for a non-zero $\bm{\xi}\in T_{\mathbf{x}_{k}}\mathcal{M}$ , it has

\frac{\left\|\mathbf{H}_{k}[\bm{\xi}]\right\|_{\mathbf{x}_{k}}}{\left\|\bm{\xi}\right\|_{\mathbf{x}_{k}}}\leq\left\|\mathbf{H}_{k}\right\|_{\mathbf{x}_{k}}=\sup_{{\mathbf{\bm{\eta}}\in T_{\mathbf{x}_{k}}\mathcal{M},\|\mathbf{\bm{\eta}}\|_{\mathbf{x}_{k}}\neq 0}}\frac{\left\|\mathbf{H}_{k}[\mathbf{\bm{\eta}}]\right\|_{\mathbf{x}_{k}}}{\left\|\bm{\eta}\right\|_{\mathbf{x}_{k}}}.

(78)

Eq. (75) also uses (1) the fact of $\alpha\geq-1$ which comes from the fact of $\alpha$ being in the form of $\frac{a-b}{\max(a,b)}$ , and thus $1-\alpha\leq 2$ , (2) the definition of the smallest and largest eigenvalues in Eqs. (16) and (20), which gives $\frac{\langle\bm{\eta},\tilde{\mathbf{H}[\bm{\eta}]}\rangle_{\mathbf{x}_{k}}}{\langle\bm{\xi},\tilde{\mathbf{H}[\bm{\xi}]}\rangle_{\mathbf{x}_{k}}}\leq\frac{\lambda_{max}(\tilde{\mathbf{H}}_{k})}{\lambda_{min}(\tilde{\mathbf{H}}_{k})}$ for and any unit tangent vectors $\bm{\eta}$ and $\bm{\xi}$ , and (3) the fact that

	$\displaystyle\|\alpha\|$	$\displaystyle=\frac{\left\|\left\\|\bm{\xi}_{l}\right\\|_{\mathbf{x}_{k}}-\left\\|\bar{\bm{\eta}}_{k}^{}\right\\|_{\mathbf{x}_{k}}\right\|}{\max\left(\left\\|\bm{\xi}_{l}\right\\|_{\mathbf{x}_{k}},\left\\|\bar{\bm{\eta}}_{k}^{}\right\\|_{\mathbf{x}_{k}}\right)}\leq\frac{\left\\|\bm{\xi}_{l}-\bar{\bm{\eta}}_{k}^{}\right\\|_{\mathbf{x}_{k}}}{\max\left(\left\\|\bm{\xi}_{l}\right\\|_{\mathbf{x}_{k}},\left\\|\bar{\bm{\eta}}_{k}^{}\right\\|_{\mathbf{x}_{k}}\right)}$
		$\displaystyle=\frac{\left\\|\left(p_{l+1}\left(\tilde{\mathbf{H}}_{k}\right)-{\rm Id}\right)[\bar{\bm{\eta}}_{k}^{}]\right\\|_{\mathbf{x}_{k}}}{\max\left(\left\\|\bm{\xi}_{l}\right\\|_{\mathbf{x}_{k}},\left\\|\bar{\bm{\eta}}_{k}^{}\right\\|_{\mathbf{x}_{k}}\right)}\leq\frac{\phi_{l}(\mathbf{H}_{k})\left\\|\bar{\bm{\eta}}_{k}^{}\right\\|_{\mathbf{x}_{k}}}{\max\left(\left\\|\bm{\xi}_{l}\right\\|_{\mathbf{x}_{k}},\left\\|\bar{\bm{\eta}}_{k}^{}\right\\|_{\mathbf{x}_{k}}\right)}\leq\phi_{l}(\mathbf{H}_{k}).$		(79)

To derive Eq. (77), we use

\bar{m}_{k}(\mathbf{0}_{\mathbf{x}_{k}})-\bar{m}_{k}(\bar{\bm{\eta}}_{k}^{*})=\frac{1}{2}\left\langle\bar{\bm{\eta}}_{k}^{*},\tilde{\mathbf{H}}_{k}[\bar{\bm{\eta}}_{k}^{*}]\right\rangle_{\mathbf{x}_{k}}+\frac{\lambda_{*}}{2}\left\|\bar{\bm{\eta}}_{k}^{*}\right\|_{\mathbf{x}_{k}}^{2}.

(80)

Next, as $\bm{\eta}_{k}^{*l}$ is the optimal solution in the subspace $\mathcal{K}_{l}(\mathbf{H}_{k},\mathbf{G}_{k})$ , we have $\bar{m}_{k}\left(\bm{\eta}_{k}^{*l}\right)\leq\bar{m}_{k}\left(\bm{\zeta}^{l}\right)$ , and hence

\bar{m}_{k}\left(\bm{\eta}_{k}^{*l}\right)-\bar{m}_{k}\left(\bar{\bm{\eta}}_{k}^{*}\right)\leq 4\iota\left(\tilde{\mathbf{H}}_{k}\right)(\bar{m}_{k}(\mathbf{0}_{\mathbf{x}_{k}})-\bar{m}_{k}(\bar{\bm{\eta}}_{k}^{*}))\phi_{l}\left(\tilde{\mathbf{H}}_{k}\right)^{2}.

(81)

We then show that the Lanczos method exhibits at least the same convergence property as above for the subsampled Riemannian cubic-regularization subproblem. Let ${\bm{\eta}}_{k}^{*}$ be the global minimizer for the subproblem $\hat{m}_{k}$ in Eq. (10). ${\bm{\eta}}_{k}^{*}$ is equivalent to $\bar{\bm{\eta}}_{k}^{*}$ in the RTR subproblem with $\Delta_{k}=\left\|{\bm{\eta}}_{k}^{*}\right\|_{\mathbf{x}_{k}}$ and $\lambda_{*}=\sigma_{k}\left\|{\bm{\eta}}_{k}^{*}\right\|_{\mathbf{x}_{k}}$ carmon2018analysis . Then, letting $\bm{\eta}_{k}^{*l}$ be the minimizer of $\hat{m}_{k}$ over $\mathcal{K}_{l}(\mathbf{H}_{k},\mathbf{G}_{k})$ satisfying $\left\|{\bm{\eta}}_{k}^{*l}\right\|_{\mathbf{x}_{k}}\leq\left\|{\bm{\eta}}_{k}^{*}\right\|_{\mathbf{x}_{k}}=\Delta_{k}$ , we have

$\displaystyle\hat{m}_{k}({\bm{\eta}}_{k}^{l})-\hat{m}_{k}({\bm{\eta}}_{k}^{})$	$\displaystyle\leq\hat{m}_{k}({\bm{\eta}}_{k}^{l})-\hat{m}_{k}({\bm{\eta}}_{k}^{})$
	$\displaystyle=\bar{m}_{k}({\bm{\eta}}_{k}^{l})-\bar{m}_{k}({\bm{\eta}}_{k}^{})+\frac{\sigma_{k}}{3}(\left\\|{\bm{\eta}}_{k}^{l}\right\\|_{\mathbf{x}_{k}}^{3}-\left\\|{\bm{\eta}}_{k}^{}\right\\|_{\mathbf{x}_{k}}^{3})$
	$\displaystyle\leq\bar{m}_{k}({\bm{\eta}}_{k}^{l})-\bar{m}_{k}({\bm{\eta}}_{k}^{})=\bar{m}_{k}({\bm{\eta}}_{k}^{l})-\bar{m}_{k}(\bar{\bm{\eta}}_{k}^{})$	(82)

Combining this with Eq. (81), it has

\hat{m}_{k}\left(\bm{\eta}_{k}^{*l}\right)-\hat{m}_{k}\left({\bm{\eta}}_{k}^{*}\right)\leq 4\iota\left(\tilde{\mathbf{H}}_{k}\right)(\bar{m}_{k}(\mathbf{0}_{\mathbf{x}_{k}})-\bar{m}_{k}(\bar{\bm{\eta}}_{k}^{*}))\phi_{l}\left(\tilde{\mathbf{H}}_{k}\right)^{2}.

(83)

This completes the proof.

Appendix C Appendix: Proof of Theorem 1

Proof

We first prove the relationship between $\mathbf{G}_{k}^{i}$ and $\mathbf{r}_{i}$ . According to Algorithm 3, $\mathbf{r}_{0}=\mathbf{G}_{k}^{0}=\mathbf{G}_{k}$ . Then for $i>0$ , we have

$\displaystyle\mathbf{G}_{k}^{i}$	$\displaystyle=\frac{1}{\|\mathcal{S}_{g}\|}\sum_{j\in\mathcal{S}_{g}}\nabla_{\alpha_{i}^{}\mathbf{p}_{i}}f_{j}\left(R_{\mathbf{x}_{k}^{i-1}}\left(\alpha_{i}^{}\mathbf{p}_{i}\right)\right)$
	$\displaystyle\approx\frac{1}{\|\mathcal{S}_{g}\|}\sum_{j\in\mathcal{S}_{g}}\nabla_{\alpha_{i}^{}\mathbf{p}_{i}}\left(f_{j}\left(\mathbf{x}_{k}^{i-1}\right)+\left\langle\mathbf{G}_{k}^{i-1},\alpha_{i}^{}\mathbf{p}_{i}\right\rangle_{\mathbf{x}_{k}}+\frac{1}{2}\left\langle\alpha_{i}^{}\mathbf{p}_{i},\mathbf{H}_{k}^{i-1}[\alpha_{i}^{}\mathbf{p}_{i}]\right\rangle_{\mathbf{x}_{k}}\right)$	(84)
	$\displaystyle=\mathbf{G}_{k}^{i-1}+\alpha_{i}^{}\mathbf{H}_{k}^{i-1}[\mathbf{p}_{i}]=\mathbf{r}_{i-1}+\alpha_{i}^{}\mathbf{H}_{k}^{i-1}[\mathbf{p}_{i}]=\mathbf{r}_{i},$	(85)

where $\approx$ comes from the first-order Taylor extension and Eq. (85) follows the Step (11) in Algorithm 3.

The exact line search in the Step. (3) of Algorithm 3 approximates

\alpha_{i}^{*}=\arg\min_{\alpha\geq 0}f\left(R_{\mathbf{x}_{k}^{i-1}}(\alpha\mathbf{p}_{i})\right).

(86)

Zeroing the derivative of Eq. (86) with respect to $\alpha$ gives

\mathbf{0}_{\mathbf{x}_{k}^{i}}=\nabla_{\alpha_{i}^{*}}f\left(R_{\mathbf{x}_{k}^{i-1}}\left(\alpha_{i}^{*}\mathbf{p}_{i}\right)\right)=\left\langle\nabla f\left(\mathbf{x}_{k}^{i}\right),\mathcal{P}_{\alpha_{i}^{*}\mathbf{p}_{i}}\frac{d(\alpha_{i}^{*}\mathbf{p}_{i})}{d\alpha_{i}^{*}}\right\rangle_{\mathbf{x}_{k}^{i}}\approx\left\langle\mathbf{G}_{k}^{i},\mathcal{P}_{\alpha_{i}^{*}\mathbf{p}_{i}}\mathbf{p}_{i}\right\rangle_{\mathbf{x}_{k}^{i}},

(87)

where $\approx$ results from the use of subsampled gradient $\mathbf{G}_{k}^{i}$ to approximate the full gradient $\nabla f(\mathbf{x}_{k}^{i})$ . We then show that each $\mathbf{p}_{i}$ is a sufficient descent direction, i.e., $\left\langle\mathbf{G}_{k}^{i},\mathbf{p}_{i+1}\right\rangle_{\mathbf{x}_{k}^{i}}\leq-C\left\|\mathbf{G}_{k}^{i}\right\|_{\mathbf{x}_{k}^{i}}^{2}$ for some constant $C>0$ sakai2021sufficient . When $i=0$ , $\mathbf{p}_{1}=\mathbf{G}_{k}^{0}$ , and thus $\left\langle\mathbf{G}_{k}^{0},\mathbf{p}_{1}\right\rangle_{\mathbf{x}_{k}^{0}}=-\left\|\mathbf{G}_{k}^{0}\right\|_{\mathbf{x}_{k}^{0}}^{2}$ . When $i>0$ , from Step (17) in Algorithm 3 and Eq. (85), we have $\mathbf{p}_{i+1}\approx-\mathbf{G}_{k}^{i}+\beta_{i}\mathbf{p}_{i}$ . Applying the inner product to both sides by $\mathbf{G}_{k}^{i}$ , we have

	$\displaystyle\langle\mathbf{G}_{k}^{i},\mathbf{p}_{i+1}\rangle_{\mathbf{x}_{k}^{i}}$	$\displaystyle\approx-\left\\|\mathbf{G}_{k}^{i}\right\\|_{\mathbf{x}_{k}^{i}}^{2}+\beta_{i}\langle\mathbf{G}_{k}^{i},\mathcal{P}_{\alpha_{i}^{*}\mathbf{p}_{i}}\mathbf{p}_{i}\rangle_{\mathbf{x}_{k}^{i}}$		(88)
		$\displaystyle\approx-\left\\|\mathbf{G}_{k}^{i}\right\\|_{\mathbf{x}_{k}^{i}}^{2}\leq-C\left\\|\mathbf{G}_{k}^{i}\right\\|_{\mathbf{x}_{k}^{i}}^{2},$		(89)

with a selected $C>0$ . Here, Eq. (89) builds on Eq. (87). Given the sufficient descent direction $\mathbf{p}_{i}$ and the strong Wolfe conditions satisfied by $\alpha_{i}^{*}$ , Theorem 2.4.1 in qi2011numerical shows that the Zoutendijk Condition holds sato2015new , i.e.,

\sum_{i=0}^{\infty}\frac{\left\langle\mathbf{G}_{k}^{i},\mathbf{p}_{i+1}\right\rangle_{\mathbf{x}_{k}^{i}}^{2}}{\left\|\mathbf{p}_{i+1}\right\|_{\mathbf{x}_{k}^{i}}^{2}}<\infty.

(90)

Next, we show that $\beta_{i}$ is upper bounded. Using Eq. (85), we have

	$\displaystyle\beta_{i}$	$\displaystyle\approx\frac{\left\langle\mathbf{G}_{k}^{i},\mathbf{G}_{k}^{i}-\frac{\left\\|\mathbf{G}_{k}^{i}\right\\|_{\mathbf{x}_{k}^{i}}}{\left\\|\mathbf{G}_{k}^{i-1}\right\\|_{\mathbf{x}_{k}^{i-1}}}\mathcal{P}_{\alpha_{i}^{*}\mathbf{T}_{i}}\mathbf{G}_{k}^{i-1}\right\rangle_{\mathbf{x}_{k}^{i}}}{2\left\langle\mathbf{G}_{k}^{i-1},\mathbf{G}_{k}^{i-1}\right\rangle_{\mathbf{x}_{k}^{i-1}}}$
		$\displaystyle=\frac{\left\\|\mathbf{G}_{k}^{i}\right\\|_{\mathbf{x}_{k}^{i}}^{2}-\frac{\left\\|\mathbf{G}_{k}^{i}\right\\|_{\mathbf{x}_{k}^{i}}}{\left\\|\mathbf{G}_{k}^{i-1}\right\\|_{\mathbf{x}_{k}^{i-1}}}\cos\theta\left\\|\mathbf{G}_{k}^{i}\right\\|_{\mathbf{x}_{k}^{i}}\left\\|\mathbf{G}_{k}^{i-1}\right\\|_{\mathbf{x}_{k}^{i-1}}}{2\left\\|\mathbf{G}_{k}^{i-1}\right\\|_{\mathbf{x}_{k}^{i-1}}^{2}}\leq\frac{\left\\|\mathbf{G}_{k}^{i}\right\\|_{\mathbf{x}_{k}^{i}}^{2}}{\left\\|\mathbf{G}_{k}^{i-1}\right\\|_{\mathbf{x}_{k}^{i-1}}^{2}},$		(91)

where $c_{3}>1$ is some constant.

Now, we prove $\lim_{i\to\infty}\left\|\mathbf{G}_{k}^{i}\right\|_{\mathbf{x}_{k}^{i}}=0$ by contradiction. Assume that $\lim_{i\to\infty}\left\|\mathbf{G}_{k}^{i}\right\|_{\mathbf{x}_{k}^{i}}>0$ , that is, for all $i$ , there exists $\gamma>0$ such that $\left\|\mathbf{G}_{k}^{i}\right\|_{\mathbf{x}_{k}^{i}}>\gamma>0$ . Squaring Step (17) of Algorithm 3 and applying Eqs. (30), (85), (89) and (91), we have

$\displaystyle\left\\|\mathbf{p}_{i+1}\right\\|_{\mathbf{x}_{k}^{i}}^{2}$	$\displaystyle\leq\left\\|\mathbf{G}_{k}^{i}\right\\|_{\mathbf{x}_{k}^{i}}^{2}+2\beta_{i}\left\|\left\langle\mathbf{G}_{k}^{i},\mathcal{P}_{\alpha_{i}^{}\mathbf{p}_{i}}\mathbf{p}_{i}\right\rangle_{\mathbf{x}_{k}^{i}}\right\|+\beta_{i}^{2}\left\\|\mathcal{P}_{\alpha_{i}^{}\mathbf{p}_{i}}\mathbf{p}_{i}\right\\|_{\mathbf{x}_{k}^{i}}^{2}$
	$\displaystyle\leq\left\\|\mathbf{G}_{k}^{i}\right\\|_{\mathbf{x}_{k}^{i}}^{2}-2c_{2}\frac{\left\\|\mathbf{G}_{k}^{i}\right\\|_{\mathbf{x}_{k}^{i}}^{2}}{\left\\|\mathbf{G}_{k}^{i-1}\right\\|_{\mathbf{x}_{k}^{i-1}}^{2}}\left\langle\mathbf{G}_{k}^{i-1},\mathbf{p}_{i}\right\rangle_{\mathbf{x}_{k}^{i-1}}+\beta_{i}^{2}\left\\|\mathbf{p}_{i}\right\\|_{\mathbf{x}_{k}^{i-1}}^{2}$
	$\displaystyle\leq\left\\|\mathbf{G}_{k}^{i}\right\\|_{\mathbf{x}_{k}^{i}}^{2}+2c_{2}C\frac{\left\\|\mathbf{G}_{k}^{i}\right\\|_{\mathbf{x}_{k}^{i}}^{2}}{\left\\|\mathbf{G}_{k}^{i-1}\right\\|_{\mathbf{x}_{k}^{i-1}}^{2}}\left\\|\mathbf{G}_{k}^{i-1}\right\\|_{\mathbf{x}_{k}^{i-1}}^{2}+\beta_{i}^{2}\left\\|\mathbf{p}_{i}\right\\|_{\mathbf{x}_{k}^{i-1}}^{2}$
	$\displaystyle=\hat{C}\left\\|\mathbf{G}_{k}^{i}\right\\|_{\mathbf{x}_{k}^{i}}^{2}+\beta_{i}^{2}\left\\|\mathbf{p}_{i}\right\\|_{\mathbf{x}_{k}^{i-1}}^{2},$	(92)

where $\hat{C}=1+2c_{2}C>1$ . Applying this repetitively, we have

$\displaystyle\left\\|\mathbf{p}_{i+1}\right\\|_{\mathbf{x}_{k}^{i}}^{2}$	$\displaystyle\leq\hat{C}\left\\|\mathbf{G}_{k}^{i}\right\\|_{\mathbf{x}_{k}^{i}}^{2}+\beta_{i}^{2}\left(\hat{C}\left\\|\mathbf{G}_{k}^{i-1}\right\\|_{\mathbf{x}_{k}^{i-1}}^{2}+\beta_{i-1}^{2}\left\\|\mathbf{p}_{i-1}\right\\|_{\mathbf{x}_{k}^{i-2}}^{2}\right)$
	$\displaystyle\leq\hat{C}\left(\left\\|\mathbf{G}_{k}^{i}\right\\|_{\mathbf{x}_{k}^{i}}^{2}+\beta_{i}^{2}\left\\|\mathbf{G}_{k}^{i-1}\right\\|_{\mathbf{x}_{k}^{i-1}}^{2}+\cdots+\prod_{j=2}^{i}\beta_{j}^{2}\left\\|\mathbf{G}_{k}^{1}\right\\|_{\mathbf{x}_{k}^{1}}^{2}\right)+\left\\|\mathbf{p}_{1}\right\\|_{\mathbf{x}_{k}^{0}}^{2}\prod_{j=1}^{i}\beta_{j}^{2}$
	$\displaystyle\leq\hat{C}\left\\|\mathbf{G}_{k}^{i}\right\\|_{\mathbf{x}_{k}^{i}}^{4}\left(\frac{1}{\left\\|\mathbf{G}_{k}^{i}\right\\|_{\mathbf{x}_{k}^{i}}^{2}}+\frac{1}{\left\\|\mathbf{G}_{k}^{i-1}\right\\|_{\mathbf{x}_{k}^{i-1}}^{2}}+\cdots+\frac{1}{\left\\|\mathbf{G}_{k}^{1}\right\\|_{\mathbf{x}_{k}^{1}}^{2}}+\frac{1}{\hat{C}\left\\|\mathbf{G}_{k}^{0}\right\\|_{\mathbf{x}_{k}^{0}}^{2}}\right)$	(93)
	$\displaystyle\leq\hat{C}\left\\|\mathbf{G}_{k}^{i}\right\\|_{\mathbf{x}_{k}^{i}}^{4}\sum_{j=0}^{i}\frac{1}{\left\\|\mathbf{G}_{k}^{j}\right\\|_{\mathbf{x}_{k}^{j}}^{2}}\leq\frac{\hat{C}(i+1)}{\gamma^{2}}\left\\|\mathbf{G}_{k}^{i}\right\\|_{\mathbf{x}_{k}^{i}}^{4},$

where Eq. (93) uses Eq. (91) and $\mathbf{p}_{1}=\mathbf{G}_{k}^{0}$ , and the last inequality uses $\left\|\mathbf{G}_{k}^{i}\right\|_{\mathbf{x}_{k}^{i}}>\gamma$ . Subsequently, this gives

\frac{\left\|\mathbf{G}_{k}^{i}\right\|_{\mathbf{x}_{k}^{i}}^{4}}{\left\|\mathbf{p}_{i+1}\right\|_{\mathbf{x}_{k}^{i}}^{2}}\geq\frac{\gamma^{2}}{\hat{C}(i+1)}.

(94)

Combing Eqs. (89) and (94), we have

\sum_{i=0}^{\infty}\frac{\left\langle\mathbf{G}_{k}^{i},\mathbf{p}_{i+1}\right\rangle_{\mathbf{x}_{k}^{i}}^{2}}{\left\|\mathbf{p}_{i+1}\right\|_{\mathbf{x}_{k}^{i}}^{2}}=\sum_{i=0}^{\infty}\frac{\left\|\mathbf{G}_{k}^{i}\right\|_{\mathbf{x}_{k}^{i}}^{4}}{\left\|\mathbf{p}_{i+1}\right\|_{\mathbf{x}_{k}^{i}}^{2}}\times\frac{\langle\mathbf{G}_{k}^{i},\mathbf{p}_{i+1}\rangle_{\mathbf{x}_{k}^{i}}^{2}}{\left\|\mathbf{G}_{k}^{i}\right\|_{\mathbf{x}_{k}^{i}}^{4}}\geq\sum_{i=0}^{\infty}\frac{C^{2}\gamma^{2}}{\hat{C}(i+1)}=\infty.

(95)

This contradicts Eq. (90) and completes the proof.

Appendix D Appendix: Subproblem Solvers

D.1 Proof of Lemma 2

D.1.1 The Case of $l=D$

Proof

Assumption 1: Regarding the Cauchy condition in Eq. (36), it is satisfied simply because of Eq. (19) and Eq. (33). Regarding the eigenstep condition, the proof is also fairly simple. The solution $\bm{\eta}_{k}^{*}$ from Algorithm 2 with $l=D$ is the global minimizer over $\mathbb{R}^{D}$ . As the subspace spanned by Cauchy steps and eigensteps belongs to $\mathbb{R}^{D}$ , i.e., ${\rm Span}\left\{\bm{\eta}_{k}^{C},\bm{\eta}_{k}^{E}\right\}\in\mathbb{R}^{D}$ , we have

\hat{m}_{k}(\bm{\eta}_{k}^{*})\leq\min_{\bm{\eta}\in{\rm Span}\left\{\bm{\eta}_{k}^{C},\bm{\eta}_{k}^{E}\right\}}\hat{m}_{k}(\bm{\eta}).

(96)

Hence, the solution from Algorithm 2 satisfies the eigenstep condition in Eq. (37)) from Assumption 1.

Assumption 2: As stated in Section 3.3 of cartis2011adaptive , any minimizer, including the global minimizer $\bm{\eta}_{k}^{*}$ from Algorithm 2, is a stationary point of $\hat{m}_{k}$ , and naturally has the property $\nabla_{\bm{\eta}}\hat{m}_{k}(\bm{\eta}_{k}^{*})=\mathbf{0}_{\mathbf{x}_{k}}$ . Hence, it has $\left\|\nabla_{\bm{\eta}}\hat{m}_{k}(\bm{\eta}_{k}^{*})\right\|_{\mathbf{x}_{k}}=0\leq\kappa_{\theta}\min\left(1,\left\|\bm{\eta}_{k}^{*}\right\|_{\mathbf{x}_{k}}\right)\|\mathbf{G}_{k}\|_{\mathbf{x}_{k}}$ . Assumption 2 is then satisfied.

Assumption 3: Given the above-mentioned property of the global minimizer $\bm{\eta}_{k}^{*}$ of $\hat{m}_{k}$ from Algorithm 2, i.e., $\nabla_{\bm{\eta}}\hat{m}_{k}(\bm{\eta}_{k}^{*})=\mathbf{0}_{\mathbf{x}_{k}}$ , and using the definition of $\nabla_{\bm{\eta}}\hat{m}_{k}(\bm{\eta}_{k}^{*})$ , it has

\nabla_{\bm{\eta}}\hat{m}_{k}(\bm{\eta}_{k}^{*})=\mathbf{G}_{k}+\mathbf{H}_{k}[\bm{\eta}_{k}^{*}]+\sigma_{k}\left\|\bm{\eta}_{k}^{*}\right\|_{\mathbf{x}_{k}}\bm{\eta}_{k}^{*}=\mathbf{0}_{\mathbf{x}_{k}}.

(97)

Applying the inner product operation with $\bm{\eta}_{k}^{*}$ to both sides of Eq. (97), it has

\langle\mathbf{G}_{k},\bm{\eta}_{k}^{*}\rangle_{\mathbf{x}_{k}}+\langle\bm{\eta}_{k}^{*},\mathbf{H}_{k}[\bm{\eta}_{k}^{*}]\rangle_{\mathbf{x}_{k}}+\sigma_{k}\left\|\bm{\eta}_{k}^{*}\right\|_{\mathbf{x}_{k}}^{3}=0,

(98)

and this corresponds to Eq. (42). As $\bm{\eta}_{k}^{*}$ is a descent direction, we have $\langle\mathbf{G}_{k},\bm{\eta}_{k}^{*}\rangle_{\mathbf{x}_{k}}\leq 0$ . Combining this with Eq. (98), it has

\langle\bm{\eta}_{k}^{*},\mathbf{H}_{k}[\bm{\eta}_{k}^{*}]\rangle_{\mathbf{x}_{k}}+\sigma_{k}\left\|\bm{\eta}_{k}^{*}\right\|_{\mathbf{x}_{k}}^{3}\geq 0.

(99)

and this is Eq. (43). This completes the proof.

D.1.2 The Case of $l<D$

Proof

Cauchy condition in Assumption 1: Implied by Eq. (19), any intermediate solution $\bm{\eta}_{k}^{*l}$ satisfies the Cauchy condition.

Assumption 2: As stated in Section 3.3 of cartis2011adaptive , any minimizer $\bm{\eta}^{*}$ of $\hat{m}_{k}$ admits the property $\nabla_{\bm{\eta}}\hat{m}_{k}(\bm{\eta}^{*})=\mathbf{0}_{\mathbf{x}_{k}}$ . Since each $\bm{\eta}_{k}^{*l}$ is the minimizer of $\hat{m}_{k}$ over $\mathcal{K}_{l}$ , it has $\nabla_{\bm{\eta}}\hat{m}_{k}(\bm{\eta}_{k}^{*l})=\mathbf{0}_{\mathbf{x}_{k}}$ . Then, it has $\left\|\nabla_{\bm{\eta}}\hat{m}_{k}(\bm{\eta}_{k}^{*l})\right\|_{\mathbf{x}_{k}}=0\leq\kappa_{\theta}\min\left(1,\left\|\bm{\eta}_{k}^{*l}\right\|_{\mathbf{x}_{k}}\right)\|\mathbf{G}_{k}\|_{\mathbf{x}_{k}}$ . Assumption 2 is then satisfied.

Assumption 3: According to Lemma 3.2 of cartis2011adaptive , a global minimizer of $\hat{m}_{k}$ over a subspace of $\mathbb{R}^{D}$ satisfies Assumption 3. As the solution $\bm{\eta}_{k}^{*l}$ in each iteration of Algorithm 2 is a global solution over the subspace $\mathcal{K}_{l}$ , each $\bm{\eta}_{k}^{*l}$ satisfies Assumption 3.

D.2 Proof of Lemma 3

Proof

For Cauchy Condition of Assumption 1: In the first iteration of Algorithm 3, the step size is optimized along the steepest gradient direction, as in the classical steepest-descent method of Cauchy, i.e.,

\bm{\eta}_{k}^{1}=\left(\arg\min_{\alpha\in\mathbb{R}}\hat{m}_{k}(\alpha\mathbf{G}_{k})\right)\mathbf{G}_{k}.

(100)

At each iteration $i$ of Algorithm 3, the line search process in Eq. (24) aims at finding a step size that can achieve a cost decrease, otherwise the step size will be zero, meaning that no strict decrease can be achieved and the algorithm will stop at Step (4). Because of this, we have

\hat{m}_{k}\left(\bm{\eta}_{k}^{i-1}+\alpha_{i}\mathbf{p}_{i}\right)\leq\hat{m}_{k}\left(\bm{\eta}_{k}^{i-1}\right).

(101)

Given $\bm{\eta}_{k}^{i}=\bm{\eta}_{k}^{i-1}+\alpha_{i}^{*}\mathbf{p}_{i}$ in Algorithm 3, we have

\hat{m}_{k}\left(\bm{\eta}_{k}^{i}\right)\leq\hat{m}_{k}\left(\bm{\eta}_{k}^{i-1}\right).

(102)

Then, considering all $i=1,2,...$ , we have

\hat{m}_{k}\left(\bm{\eta}_{k}^{*}\right)\leq\ldots\leq\hat{m}_{k}\left(\bm{\eta}_{k}^{1}\right)\leq\hat{m}_{k}\left(\bm{\eta}_{k}^{0}\right).

(103)

This shows Algorithm 3 always returns a solution better than or equal to the Cauchy step.

For $\left\|\nabla_{\bm{\eta}}\hat{m}_{k}(\bm{\eta})\right\|_{\mathbf{x}_{k}}\approx 0$ : The approximation concept ( $\approx$ ) of interest here builds on the fact that $\hat{m}_{k}(\bm{\eta})$ is used as the approximation of the real objective function $f(R_{\mathbf{x}_{k}}(\bm{\eta}))$ . By assuming $\hat{m}_{k}(\bm{\eta})\approx f(R_{\mathbf{x}_{k}}(\bm{\eta}))$ , it leads to

\nabla_{\bm{\eta}}\hat{m}_{k}(\bm{\eta})\approx\nabla_{\bm{\eta}}f(R_{\mathbf{x}_{k}}(\bm{\eta})).

(104)

Let $\mathbf{G}_{k+1}$ be the subsampled gradient evaluated at $R_{\mathbf{x}_{k}}(\bm{\eta}_{k}^{*})$ . Based on Theorem 1, it has $\mathbf{G}_{k+1}=\lim_{i\to\infty}\mathbf{G}_{k}^{i}$ where $\mathbf{G}_{k}^{i}$ is the resulting subsampled gradient after $i$ inner iterations. Since $\mathbf{G}_{k+1}$ is the approximate gradient of the full gradient $\nabla f(R_{\mathbf{x}_{k}}(\bm{\eta}_{k}^{*}))$ , it has $\nabla f(R_{\mathbf{x}_{k}}(\bm{\eta}_{k}^{*}))=\mathbb{E}[\mathbf{G}_{k+1}]=\mathbb{E}[\lim_{i\to\infty}\mathbf{G}_{k}^{i}]$ . Hence, it has

		$\displaystyle\left\\|\nabla_{\bm{\eta}}f\left(R_{\mathbf{x}_{k}}(\bm{\eta}_{k}^{*})\right)\right\\|_{\mathbf{x}_{k}}$		(105)
	$\displaystyle=$	$\displaystyle\left\\|\nabla f\left(R_{\mathbf{x}_{k}}(\bm{\eta}_{k}^{})\right)\frac{d\left(R_{\mathbf{x}_{k}}(\bm{\eta}_{k}^{})\right)}{d\bm{\eta}}\Big{\|}_{\bm{\eta}=\bm{\eta}_{k}^{*}}\right\\|_{\mathbf{x}_{k}}=\left\\|\mathbb{E}\left[\lim_{i\to\infty}\mathbf{G}_{k}^{i}\right]\right\\|_{\mathbf{x}_{k}}\leq\mathbb{E}\left[\lim_{i\to\infty}\left\\|\mathbf{G}_{k}^{i}\right\\|_{\mathbf{x}_{k}^{i}}\right]=0,$

which indicates the equality holds as $\left\|\nabla_{\bm{\eta}}f\left(R_{\mathbf{x}_{k}}(\bm{\eta}_{k}^{*})\right)\right\|_{\mathbf{x}_{k}}=0$ . Combining this with Eq. (104), it completes the proof.

For Condition 1 of Assumption 3: Using the definition of $\nabla_{\bm{\eta}}\hat{m}_{k}(\bm{\eta}_{k}^{*})$ as in Eq. (97), it has

	$\displaystyle\left\langle\nabla_{\bm{\eta}}\hat{m}_{k}(\bm{\eta}_{k}^{}),\bm{\eta}_{k}^{}\right\rangle$	$\displaystyle=\langle\mathbf{G}_{k}+\mathbf{H}_{k}[\bm{\eta}_{k}^{}]+\sigma_{k}\left\\|\bm{\eta}_{k}^{}\right\\|_{\mathbf{x}_{k}}\bm{\eta}_{k}^{},\bm{\eta}_{k}^{}\rangle_{\mathbf{x}_{k}}$		(106)
		$\displaystyle=\langle\mathbf{G}_{k},\bm{\eta}_{k}^{}\rangle_{\mathbf{x}_{k}}+\langle\bm{\eta}_{k}^{},\mathbf{H}_{k}[\bm{\eta}_{k}^{}]\rangle_{\mathbf{x}_{k}}+\sigma_{k}\left\\|\bm{\eta}_{k}^{}\right\\|_{\mathbf{x}_{k}}^{3}.$

Also, since $\left\|\nabla_{\bm{\eta}}\hat{m}_{k}(\bm{\eta}_{k}^{*})\right\|_{\mathbf{x}_{k}}\approx 0$ , we have

\left|\left\langle\nabla_{\bm{\eta}}\hat{m}_{k}(\bm{\eta}_{k}^{*}),\bm{\eta}_{k}^{*}\right\rangle\right|\leq\left\|\nabla_{\bm{\eta}}\hat{m}_{k}(\bm{\eta}_{k}^{*})\right\|_{\mathbf{x}_{k}}\left\|\bm{\eta}_{k}^{*}\right\|_{\mathbf{x}_{k}}\approx 0.

(107)

Combining Eqs (106) and (107) , this results in Eq. (45), which completes the proof.

Appendix E Appendix: Proof of Theorem 2

E.1 Matrix Bernstein Inequality

We build the proof on the matrix Bernstein inequality. We restate this inequality in the following lemma.

Lemma 4 (Matrix Bernstein Inequality (gross2011recovering ; tropp2015introduction ))

Let $\mathbf{A}_{1},...,\mathbf{A}_{n}$ be independent, centered random matrices with the common dimension $d_{1}\times d_{2}$ . Assume that each one is uniformly bounded:

\mathbb{E}[\mathbf{A}_{i}]=0,\|\mathbf{A}_{i}\|\leq\mu,\ i=1,...,n.

(108)

Given the matrix sum $\mathbf{Z}=\sum_{i=1}^{n}\mathbf{A}_{i}$ , we define its variance $\nu(\mathbf{Z})$ by

\begin{split}\nu(\mathbf{Z})&:=\max\left\{\left\|\sum_{i=1}^{n}\mathbb{E}\left[\mathbf{A}_{i}\mathbf{A}_{i}^{T}\right]\right\|,\left\|\sum_{i=1}^{n}\mathbb{E}\left[\mathbf{A}_{i}^{T}\mathbf{A}_{i}\right]\right\|\right\}.\end{split}

(109)

Then

{\rm Pr}(\left\|\mathbf{Z}\right\|\geq\epsilon)\leq(d_{1}+d_{2})\exp\left(\frac{-\epsilon^{2}/2}{\nu(\mathbf{Z})+\mu\epsilon/3}\right)\ {\rm for\ all}\ \epsilon>0.

(110)

This lemma supports us to prove Theorem 2 as below.

E.2 Main Proof

Proof

Following the subsampling process, a total of $|\mathcal{S}_{g}|$ matrices are uniformly sampled from the set of $n$ Riemannan gradients $\left\{{\rm grad}f_{i}(\mathbf{x})\subseteq\mathbb{R}^{d\times r}\right\}_{i=1}^{n}$ . We denote each sampled element as $\mathbf{G}^{(i)}_{\mathbf{x}}$ , and it has

{\rm Pr}\left(\mathbf{G}^{(i)}_{\mathbf{x}}\right)=\frac{1}{n},\ i=1,...,|\mathcal{S}_{g}|.

(111)

Define the random matrix

\mathbf{X}_{i}:=\mathbf{G}^{(i)}_{\mathbf{x}}-{\rm grad}f(\mathbf{x}),\ i=1,2,..,|\mathcal{S}_{g}|.

(112)

Our focus is the type of problem as in Eq. (1), therefore it has

\displaystyle\mathbb{E}[\mathbf{X}_{i}]=\mathbb{E}\left[\mathbf{G}^{(i)}_{\mathbf{x}}\right]-\mathbb{E}\left[{\rm grad}f(\mathbf{x})\right]=\mathbb{E}\left[{\rm grad}f_{i}(\mathbf{x})\right]-\mathbb{E}\left[\frac{1}{n}\sum_{i=1}^{n}{\rm grad}f_{i}(\mathbf{x})\right]=\mathbf{0}.

(113)

Define a random variable

\mathbf{X}:=\frac{1}{|\mathcal{S}_{g}|}\sum_{i=1}^{|\mathcal{S}_{g}|}\mathbf{X}_{i}=\frac{1}{|\mathcal{S}_{g}|}\sum_{i=1}^{|\mathcal{S}_{g}|}\left(\mathbf{G}^{(i)}_{\mathbf{x}}-{\rm grad}f(\mathbf{x})\right).

(114)

Its variance satisfies

	$\displaystyle\nu(\mathbf{X})$	$\displaystyle=\max\left\{\frac{1}{\|\mathcal{S}_{g}\|^{2}}\left\\|\sum_{i=1}^{\|\mathcal{S}_{g}\|}\mathbb{E}\left[\mathbf{X}_{i}\mathbf{X}_{i}^{T}\right]\right\\|_{\mathbf{x}},\frac{1}{\|\mathcal{S}_{g}\|^{2}}\left\\|\sum_{i=1}^{\|\mathcal{S}_{g}\|}\mathbb{E}\left[\mathbf{X}_{i}^{T}\mathbf{X}_{i}\right]\right\\|_{\mathbf{x}}\right\}$
		$\displaystyle\leq\frac{1}{\|\mathcal{S}_{g}\|^{2}}\max\left\{\sum_{i=1}^{\|\mathcal{S}_{g}\|}\mathbb{E}\left[\left\\|\mathbf{X}_{i}\mathbf{X}_{i}^{T}\right\\|_{\mathbf{x}}\right],\sum_{i=1}^{\|\mathcal{S}_{g}\|}\mathbb{E}\left[\left\\|\mathbf{X}_{i}^{T}\mathbf{X}_{i}\right\\|_{\mathbf{x}}\right]\right\}$		(115)

Take $\mathbf{G}^{(i)}_{\mathbf{x}}={\rm grad}f_{1}(\mathbf{x})$ as an example and applying the definition of $K_{g_{max}}$ in Eq. (53), we have

$\displaystyle\mathbb{E}[\\|\mathbf{X}_{i}\\|_{\mathbf{x}}]$	$\displaystyle=\mathbb{E}\left[\left\\|{\rm grad}f_{1}(\mathbf{x})-{\rm grad}f(\mathbf{x})\right\\|_{\mathbf{x}}\right]=\mathbb{E}\left[\left\\|{\rm grad}f_{1}(\mathbf{x})-\frac{1}{n}\sum_{i=1}^{n}{\rm grad}f_{i}(\mathbf{x})\right\\|_{\mathbf{x}}\right]$
	$\displaystyle=\mathbb{E}\left[\left\\|\frac{n-1}{n}{\rm grad}f_{1}(\mathbf{x})-\frac{1}{n}\sum_{i=2}^{n}{\rm grad}f_{i}(\mathbf{x})\right\\|_{\mathbf{x}}\right]$
	$\displaystyle\leq\mathbb{E}\left[\left(\frac{2(n-1)^{2}}{n^{2}}\left\\|{\rm grad}f_{1}(\mathbf{x})\right\\|_{\mathbf{x}}^{2}+\frac{2}{n^{2}}\left\\|\sum_{i=2}^{n}{\rm grad}f_{i}(\mathbf{x})\right\\|_{\mathbf{x}}^{2}\right)^{\frac{1}{2}}\right]$
	$\displaystyle\leq\mathbb{E}\left[\left(\frac{2(n-1)^{2}}{n^{2}}K_{g_{max}}^{2}+\frac{2(n-2)^{2}}{n^{2}}K_{g_{max}}^{2}\right)^{\frac{1}{2}}\right]\leq 2K_{g_{max}}.$	(116)

where the first inequality uses $(a+b)^{2}\leq 2a^{2}+2b^{2}$ . Combining Eq. (115) and Eq. (116), it has

\displaystyle\nu(\mathbf{X})\leq\frac{1}{|\mathcal{S}_{g}|^{2}}\sum_{i=1}^{|\mathcal{S}_{g}|}\mathbb{E}\left[\left\|\mathbf{X}_{i}\right\|_{\mathbf{x}}\right]\mathbb{E}\left[\left\|\mathbf{X}_{i}^{T}\right\|_{\mathbf{x}}\right]\leq\frac{1}{|\mathcal{S}_{g}|^{2}}\sum_{i=1}^{|\mathcal{S}_{g}|}4K_{g_{max}}^{2}=\frac{4}{|\mathcal{S}_{g}|}K_{g_{max}}^{2}.

(117)

Now we are ready to apply Lemma 110. Given $\mathbb{E}\left[\frac{1}{|\mathcal{S}_{g}|}\mathbf{X}_{i}\right]=\frac{1}{|\mathcal{S}_{g}|}\mathbb{E}[\mathbf{X}_{i}]=\mathbf{0}$ and according to the superma definition $\left\|\frac{1}{|\mathcal{S}_{g}|}\mathbf{X}_{i}\right\|_{\mathbf{x}}=\frac{1}{|\mathcal{S}_{g}|}\|\mathbf{X}_{i}\|_{\mathbf{x}}\leq\frac{K_{g_{max}}}{|\mathcal{S}_{g}|}$ , the following is obtained from the matrix Bernstein inequality:

$\displaystyle{\rm Pr}(\\|\mathbf{X}\\|_{\mathbf{x}}\geq\epsilon)$	$\displaystyle={\rm Pr}\left(\left\\|\frac{1}{\|\mathcal{S}_{g}\|}\sum_{i=1}^{\|\mathcal{S}_{g}\|}\mathbf{G}^{(i)}_{\mathbf{x}}-{\rm grad}f(\mathbf{x})\right\\|_{\mathbf{x}}\geq\epsilon\right)$	(118)
	$\displaystyle\leq(d+r)\exp\left(\frac{-\epsilon^{2}/2}{\frac{4}{\|\mathcal{S}_{g}\|}K_{g_{max}}^{2}+\frac{K_{g_{max}}}{\|\mathcal{S}_{g}\|}\epsilon/3}\right)$	(119)
	$\displaystyle\leq(d+r)\exp\left(\frac{-\|\mathcal{S}_{g}\|\epsilon^{2}}{8(K_{g_{max}}^{2}+K_{g_{max}})}\right)=\delta,$	(120)

of which the last equality implies

\epsilon=2\sqrt{\frac{2\left(K_{g_{max}}^{2}+K_{g_{max}}\right)\ln\left(\frac{d+r}{\delta}\right)}{|\mathcal{S}|_{g}}}.

(121)

In simple words, with a probability at least $1-\delta$ , we have $\|\mathbf{X}\|_{\mathbf{x}}<\epsilon$ . Letting

\epsilon=2\sqrt{\frac{2\left(K_{g_{max}}^{2}+K_{g_{max}}\right)\ln\left(\frac{d+r}{\delta}\right)}{|\mathcal{S}|_{g}}}\leq\delta_{g},

(122)

this results in the sample size bound as in Eq. (55)

\ |\mathcal{S}_{g}|\geq\frac{8\left(K_{g_{max}}^{2}+K_{g_{max}}\right)\ln\left(\frac{d+r}{\delta}\right)}{\delta_{g}^{2}}.

(123)

Therefore, we have $\|\mathbf{X}\|_{\mathbf{x}}\leq\delta_{g}$ . Expanding $\mathbf{X}$ , we have the following satisfied with a probability at least $1-\delta$ :

\|\mathbf{X}\|_{\mathbf{x}}=\left\|\frac{1}{|\mathcal{S}_{g}|}\sum_{i=1}^{|\mathcal{S}_{g}|}\mathbf{G}^{(i)}_{\mathbf{x}}-{\rm grad}f(\mathbf{x})\right\|_{\mathbf{x}}=\left\|\mathbf{G}_{k}-{\rm grad}f(\mathbf{x}_{k})\right\|_{\mathbf{x}}\leq\delta_{g},

(124)

which is Eq. (51) of Condition 1.

The proof of the other sample size bound follows the same strategy. A total of $|\mathcal{S}_{H}|$ matrices are uniformly sampled from the set of $n$ Riemannan Hessians $\left\{\nabla^{2}\hat{f}_{i}(\mathbf{0}_{x})[\bm{\eta}]\subseteq\mathbb{R}^{d\times r}\right\}_{i=1}^{n}$ . We denote each sampled element as $\mathbf{H}^{(i)}_{\mathbf{x}}[\bm{\eta}]$ , and it has

{\rm Pr}\left(\mathbf{H}^{(i)}_{\mathbf{x}}[\bm{\eta}]\right)=\frac{1}{n},\ i=1,...,|\mathcal{S}_{H}|.

(125)

Define the random matrix

\mathbf{Y}_{i}:=\mathbf{H}^{(i)}_{\mathbf{x}}[\bm{\eta}]-\nabla^{2}\hat{f}(\mathbf{0}_{x})[\bm{\eta}],\ i=1,2,..,|\mathcal{S}_{H}|.

(126)

For the problem defined in Eq. (1) with a second-order retraction, it has

	$\displaystyle\mathbb{E}[\mathbf{Y}_{i}]=\mathbb{E}\left[\mathbf{H}^{(i)}_{\mathbf{x}}[\bm{\eta}]\right]-\mathbb{E}\left[\nabla^{2}\hat{f}(\mathbf{0}_{x})[\bm{\eta}]\right]$		(127)
	$\displaystyle=\mathbb{E}\left[\nabla^{2}\hat{f}_{i}(\mathbf{0}_{x})[\bm{\eta}]\right]-\mathbb{E}\left[\frac{1}{n}\sum_{i=1}^{n}\nabla^{2}\hat{f}_{i}(\mathbf{0}_{x})[\bm{\eta}]\right]=\mathbf{0}.$		(128)

Define a random variable

\mathbf{Y}:=\frac{1}{|\mathcal{S}_{H}|}\sum_{i=1}^{|\mathcal{S}_{H}|}\mathbf{Y}_{i}=\frac{1}{|\mathcal{S}_{H}|}\sum_{i=1}^{|\mathcal{S}_{H}|}\left(\mathbf{H}^{(i)}_{\mathbf{x}}[\bm{\eta}]-\nabla^{2}\hat{f}(\mathbf{0}_{x})[\bm{\eta}]\right).

(129)

Its variance satisfies

	$\displaystyle\nu(\mathbf{Y})$	$\displaystyle=\max\left\{\frac{1}{\|\mathcal{S}_{H}\|^{2}}\left\\|\sum_{i=1}^{\|\mathcal{S}_{H}\|}\mathbb{E}\left[\mathbf{Y}_{i}\mathbf{Y}_{i}^{T}\right]\right\\|_{\mathbf{x}},\frac{1}{\|\mathcal{S}_{H}\|^{2}}\left\\|\sum_{i=1}^{\|\mathcal{S}_{H}\|}\mathbb{E}\left[\mathbf{Y}_{i}^{T}\mathbf{Y}_{i}\right]\right\\|_{\mathbf{x}}\right\}$
		$\displaystyle\leq\frac{1}{\|\mathcal{S}_{H}\|^{2}}\max\left\{\sum_{i=1}^{\|\mathcal{S}_{H}\|}\mathbb{E}\left[\left\\|\mathbf{Y}_{i}\mathbf{Y}_{i}^{T}\right\\|_{\mathbf{x}}\right],\sum_{i=1}^{\|\mathcal{S}_{H}\|}\mathbb{E}\left[\left\\|\mathbf{Y}_{i}^{T}\mathbf{Y}_{i}\right\\|_{\mathbf{x}}\right]\right\}.$		(130)

Take $\mathbf{H}^{(i)}_{\mathbf{x}}[\bm{\eta}]=\nabla^{2}\hat{f}_{1}(\mathbf{0}_{x})[\bm{\eta}]$ as an example and applying the definition of $K_{H_{max}}$ in Eq. (54), we have

$\displaystyle\mathbb{E}[\\|\mathbf{Y}_{i}\\|_{\mathbf{x}}]$	$\displaystyle=\mathbb{E}\left[\left\\|\nabla^{2}\hat{f}_{1}(\mathbf{0}_{x})[\bm{\eta}]-\frac{1}{n}\sum_{i=1}^{n}\nabla^{2}\hat{f}_{i}(\mathbf{0}_{x})[\bm{\eta}]\right\\|_{\mathbf{x}}\right]$
	$\displaystyle=\mathbb{E}\left[\left\\|\frac{n-1}{n}\nabla^{2}\hat{f}_{1}(\mathbf{0}_{x})[\bm{\eta}]-\frac{1}{n}\sum_{i=2}^{n}\nabla^{2}\hat{f}_{i}(\mathbf{0}_{x})[\bm{\eta}]\right\\|_{\mathbf{x}}\right]$
	$\displaystyle\leq\mathbb{E}\left[\left(\frac{2(n-1)^{2}}{n^{2}}\left\\|\nabla^{2}\hat{f}_{1}(\mathbf{0}_{x})[\bm{\eta}]\right\\|_{\mathbf{x}}^{2}+\frac{2}{n^{2}}\left\\|\sum_{i=2}^{n}\nabla^{2}\hat{f}_{i}(\mathbf{0}_{x})[\bm{\eta}]\right\\|_{\mathbf{x}}^{2}\right)^{\frac{1}{2}}\right]$
	$\displaystyle\leq\mathbb{E}\left[\left(\frac{2(n-1)^{2}}{n^{2}}K_{H_{max}}^{2}\left\\|\bm{\eta}\right\\|_{\mathbf{x}}^{2}+\frac{2(n-2)^{2}}{n^{2}}K_{H_{max}}^{2}\left\\|\bm{\eta}\right\\|_{\mathbf{x}}^{2}\right)^{\frac{1}{2}}\right]$
	$\displaystyle\leq 2K_{H_{max}}\left\\|\bm{\eta}\right\\|_{\mathbf{x}}.$	(131)

where the first inequality uses $(a+b)^{2}\leq 2a^{2}+2b^{2}$ , and $\bm{\eta}$ is the current moving direction being optimized in Eq. (10). Combining Eq. (130) and Eq. ((131), we have

	$\displaystyle\nu(\mathbf{Y})$	$\displaystyle\leq\frac{1}{\|\mathcal{S}_{H}\|^{2}}\sum_{i=1}^{\|\mathcal{S}_{H}\|}\mathbb{E}\left[\left\\|\mathbf{Y}_{i}\right\\|_{\mathbf{x}}\right]\mathbb{E}\left[\left\\|\mathbf{Y}_{i}^{T}\right\\|_{\mathbf{x}}\right]$		(132)
		$\displaystyle\leq\frac{1}{\|\mathcal{S}_{H}\|^{2}}\sum_{i=1}^{\|\mathcal{S}_{H}\|}4K_{H_{max}}^{2}\left\\|\bm{\eta}\right\\|_{\mathbf{x}}^{2}=\frac{4}{\|\mathcal{S}_{H}\|}K_{H_{max}}^{2}\left\\|\bm{\eta}\right\\|_{\mathbf{x}}^{2}.$		(133)

We then apply Lemma 110. Given $\mathbb{E}\left[\frac{1}{|\mathcal{S}_{H}|}\mathbf{Y}_{i}\right]=\frac{1}{|\mathcal{S}_{H}|}\mathbb{E}[\mathbf{Y}_{i}]=\mathbf{0}$ and according to the superma definition $\left\|\frac{1}{|\mathcal{S}_{H}|}\mathbf{Y}_{i}\right\|_{\mathbf{x}}=\frac{1}{|\mathcal{S}_{H}|}\|\mathbf{Y}_{i}\|_{\mathbf{x}}\leq\frac{K_{H_{max}}\left\|\bm{\eta}\right\|_{\mathbf{x}}}{|\mathcal{S}_{H}|}$ , the following is obtained from the matrix Bernstein inequality:

$\displaystyle{\rm Pr}(\\|\mathbf{Y}\\|_{\mathbf{x}}\geq\epsilon)$	$\displaystyle={\rm Pr}\left(\left\\|\frac{1}{\|\mathcal{S}_{H}\|}\sum_{i=1}^{\|\mathcal{S}_{H}\|}\mathbf{H}^{(i)}_{\mathbf{x}}[\bm{\eta}]-\nabla^{2}\hat{f}(\mathbf{0}_{x})[\bm{\eta}]\right\\|_{\mathbf{x}}\geq\epsilon\right)$	(134)
	$\displaystyle\leq(d+r)\exp\left(\frac{-\epsilon^{2}/2}{\frac{4}{\|\mathcal{S}_{H}\|}K_{H_{max}}^{2}\left\\|\bm{\eta}\right\\|_{\mathbf{x}}^{2}+\frac{K_{H_{max}}\left\\|\bm{\eta}\right\\|_{\mathbf{x}}}{\|\mathcal{S}_{H}\|}\epsilon/3}\right)$	(135)
	$\displaystyle\leq(d+r)\exp\left(\frac{-\|\mathcal{S}_{H}\|\epsilon^{2}}{8(K_{H_{max}}^{2}\left\\|\bm{\eta}\right\\|_{\mathbf{x}}^{2}+K_{H_{max}}\left\\|\bm{\eta}\right\\|_{\mathbf{x}})}\right)=\delta,$	(136)

of which the last equality indicates

\epsilon=2\sqrt{\frac{2\left(K_{H_{max}}^{2}\left\|\bm{\eta}\right\|_{\mathbf{x}}^{2}+K_{H_{max}}\left\|\bm{\eta}\right\|_{\mathbf{x}}\right)\ln\left(\frac{d+r}{\delta}\right)}{|\mathcal{S}_{H}|}}.

(137)

In simple words, with probability at least $1-\delta$ , we have $\|\mathbf{Y}\|_{\mathbf{x}}<\epsilon$ . By letting

\epsilon=2\sqrt{\frac{2\left(K_{H_{max}}^{2}\left\|\bm{\eta}\right\|_{\mathbf{x}}^{2}+K_{H_{max}}\left\|\bm{\eta}\right\|_{\mathbf{x}}\right)\ln\left(\frac{d+r}{\delta}\right)}{|\mathcal{S}_{H}|}}\leq\delta_{H}\left\|\bm{\eta}\right\|_{\mathbf{x}},

(138)

which results in the sample size bound in Eq. (56)

\displaystyle|\mathcal{S}_{H}|\geq\frac{8\left(K_{H_{max}}^{2}+\frac{K_{H_{max}}}{\left\|\bm{\eta}\right\|_{\mathbf{x}}}\right)\ln\left(\frac{d+r}{\delta}\right)}{\delta_{H}^{2}}.

(139)

And we have $\|\mathbf{Y}\|_{\mathbf{x}}\leq\delta_{H}\left\|\bm{\eta}\right\|_{\mathbf{x}}$ . Expanding $\mathbf{Y}$ , we have the following satisfied with a probability at least $1-\delta$ :

\|\mathbf{Y}\|_{\mathbf{x}}=\left\|\frac{1}{|\mathcal{S}_{H}|}\sum_{i=1}^{|\mathcal{S}_{H}|}\mathbf{H}^{(i)}_{\mathbf{x}}[\bm{\eta}]-\nabla^{2}\hat{f}(\mathbf{0}_{x})[\bm{\eta}]\right\|_{\mathbf{x}}=\left\|\mathbf{H}_{k}[\bm{\eta}]-\nabla^{2}\hat{f}(\mathbf{0}_{x})[\bm{\eta}]\right\|_{\mathbf{x}}\leq\delta_{H}\left\|\bm{\eta}\right\|_{\mathbf{x}},

(140)

which is Eq. (52) of Condition 1.

Appendix F Appendix: Theorem 3 and Corollary 1

F.1 Supporting Lemmas for Theorem 3

Lemma 5

Suppose Condition 1 and Assumptions 1, 2 hold, then for the case of $\|\mathbf{G}_{k}\|\geq\epsilon_{g}$ , we have

\begin{split}\hat{f}_{k}(\bm{\eta}_{k})-\hat{m}_{k}(\bm{\eta}_{k})\leq\left(\frac{L_{H}}{6}-\frac{\sigma_{k}}{3}\right)\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{3}+\delta_{g}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}+\frac{1}{2}\delta_{H}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{2}.\end{split}

(141)

Proof

We start from the left-hand side of Eq. (141), and this leads to

	$\displaystyle\hat{f}_{k}(\bm{\eta}_{k})-\hat{m}_{k}(\bm{\eta}_{k})$
$\displaystyle=\;$	$\displaystyle\hat{f}_{k}(\bm{\eta}_{k})-f(\bm{x}_{k})-\left\langle\mathbf{G}_{k},\bm{\eta}_{k}\right\rangle_{\bm{x}_{k}}-\frac{1}{2}\left\langle\bm{\eta}_{k},\mathbf{H}_{k}[\bm{\eta}_{k}]\right\rangle_{\bm{x}_{k}}-\frac{\sigma_{k}}{3}\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}^{3}$
$\displaystyle=\;$	$\displaystyle\hat{f}_{k}(\bm{\eta}_{k})-f(\bm{x}_{k})-\langle\text{grad}f(\bm{x}_{k}),\bm{\eta}_{k}\rangle_{\bm{x}_{k}}-\frac{1}{2}\left\langle\bm{\eta}_{k},\nabla^{2}\hat{f}_{k}(\bm{0}_{\bm{x}_{k}})[\bm{\eta}_{k}]\right\rangle_{\bm{x}_{k}}$
	$\displaystyle+\langle\text{grad}f(\bm{x}_{k}),\bm{\eta}_{k}\rangle_{\bm{x}_{k}}-\langle\mathbf{G}_{k},\bm{\eta}_{k}\rangle_{\bm{x}_{k}}+\frac{1}{2}\left\langle\bm{\eta}_{k},\nabla^{2}\hat{f}_{k}(\bm{0}_{\bm{x}_{k}})[\bm{\eta}_{k}]\right\rangle_{\bm{x}_{k}}$
	$\displaystyle-\frac{1}{2}\left\langle\bm{\eta}_{k},\mathbf{H}_{k}[\bm{\eta}_{k}]\right\rangle_{\bm{x}_{k}}-\frac{\sigma_{k}}{3}\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}^{3}$
$\displaystyle\leq\;$	$\displaystyle\left\|\hat{f}_{k}(\bm{\eta}_{k})-f(\bm{x}_{k})-\langle\text{grad}f(\bm{x}_{k}),\bm{\eta}_{k}\rangle_{\bm{x}_{k}}-\frac{1}{2}\left\langle\bm{\eta}_{k},\nabla^{2}\hat{f}_{k}(\bm{0}_{\bm{x}_{k}})[\bm{\eta}_{k}]\right\rangle_{\bm{x}_{k}}\right\|$
	$\displaystyle+\left\|\langle\text{grad}f(\bm{x}_{k}),\bm{\eta}_{k}\rangle_{\bm{x}_{k}}-\langle\mathbf{G}_{k},\bm{\eta}_{k}\rangle_{\bm{x}_{k}}\right\|$
	$\displaystyle+\left\|\frac{1}{2}\left\langle\bm{\eta}_{k},\nabla^{2}\hat{f}_{k}(\bm{0}_{\bm{x}_{k}})[\bm{\eta}_{k}]\right\rangle_{\bm{x}_{k}}-\frac{1}{2}\langle\bm{\eta}_{k},\mathbf{H}_{k}[\bm{\eta}_{k}]\rangle_{\bm{x}_{k}}\right\|-\frac{\sigma_{k}}{3}\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}^{3}$
$\displaystyle\leq\;$	$\displaystyle\frac{1}{6}L_{H}\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}^{3}+\delta_{g}\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}+\frac{1}{2}\delta_{H}\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}^{2}-\frac{\sigma_{k}}{3}\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}^{3}$
$\displaystyle=\;$	$\displaystyle\left(\frac{L_{H}}{6}-\frac{\sigma_{k}}{3}\right)\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}^{3}+\delta_{g}\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}+\frac{1}{2}\delta_{H}\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}^{2},$	(142)

where the first inequality uses the Cauchy-Schwarz inequality and the second one uses Eqs. (48), (51) and (52). It can be seen that the term $\left\langle\mathbf{G}_{k},\bm{\eta}_{k}\right\rangle_{\bm{x}_{k}}$ can not be neglected because of the condition $\|\mathbf{G}_{k}\|\geq\epsilon_{g}$ based on Eq. (LABEL:eq_m). This then results in the term $\delta_{g}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}$ in Eq. (142).

Lemma 6

Suppose Condition 1 and Assumptions 1, 2 hold, then for the case of $\|\mathbf{G}_{k}\|<\epsilon_{g}$ and $\lambda_{min}(\mathbf{H}_{k})<-\epsilon_{H}$ , we have

\begin{split}\hat{f}_{k}(\bm{\eta}_{k})-\hat{m}_{k}(\bm{\eta}_{k})\leq\left(\frac{L_{H}}{6}-\frac{\sigma_{k}}{3}\right)\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{3}+\frac{1}{2}\delta_{H}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{2}.\end{split}

(143)

Proof

For each $\bm{\eta}_{k}$ , at least one of the two inequalities is true:

	$\displaystyle\langle\bm{\eta}_{k},\text{grad}f(\bm{x}_{k})\rangle_{\bm{x}_{k}}\leq$	$\displaystyle 0,$		(144)
	$\displaystyle\langle-\bm{\eta}_{k},\text{grad}f(\bm{x}_{k})\rangle_{\bm{x}_{k}}\leq$	$\displaystyle 0.$		(145)

Without loss of the generality, we assume $\langle\bm{\eta}_{k},\text{grad}f(\bm{x}_{k})\rangle_{\bm{x}_{k}}\leq 0$ which is also an assumption adopted by yao2021inexact , and it has

	$\displaystyle\hat{f}_{k}(\bm{\eta}_{k})-\hat{m}_{k}(\bm{\eta}_{k})$
$\displaystyle=\;$	$\displaystyle\hat{f}_{k}(\bm{\eta}_{k})-f(\bm{x}_{k})-\frac{1}{2}\langle\bm{\eta}_{k},\mathbf{H}_{k}[\bm{\eta}_{k}]\rangle_{\bm{x}_{k}}-\frac{\sigma_{k}}{3}\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}^{3}$
$\displaystyle=\;$	$\displaystyle\hat{f}_{k}(\bm{\eta}_{k})-f(\bm{x}_{k})-\langle\text{grad}f(\bm{x}_{k}),\bm{\eta}_{k}\rangle_{\bm{x}_{k}}-\frac{1}{2}\left\langle\bm{\eta}_{k},\nabla^{2}\hat{f}_{k}(\bm{0}_{\bm{x}_{k}})[\bm{\eta}_{k}]\right\rangle_{\bm{x}_{k}}$
	$\displaystyle+\langle\text{grad}f(\bm{x}_{k}),\bm{\eta}_{k}\rangle_{\bm{x}_{k}}+\frac{1}{2}\left\langle\bm{\eta}_{k},\nabla^{2}\hat{f}_{k}(\bm{0}_{\bm{x}_{k}})[\bm{\eta}_{k}]\right\rangle_{\bm{x}_{k}}$
	$\displaystyle-\frac{1}{2}\langle\bm{\eta}_{k},\mathbf{H}_{k}[\bm{\eta}_{k}]\rangle_{\bm{x}_{k}}-\frac{\sigma_{k}}{3}\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}^{3}$
$\displaystyle\leq\;$	$\displaystyle\left\|\hat{f}_{k}(\bm{\eta}_{k})-f(\bm{x}_{k})-\langle\text{grad}f(\bm{x}_{k}),\bm{\eta}_{k}\rangle_{\bm{x}_{k}}-\frac{1}{2}\left\langle\bm{\eta}_{k},\nabla^{2}\hat{f}_{k}(\bm{0}_{\bm{x}_{k}})[\bm{\eta}_{k}]\right\rangle_{\bm{x}_{k}}\right\|$
	$\displaystyle+\left\|\frac{1}{2}\left\langle\bm{\eta}_{k},\nabla^{2}\hat{f}_{k}(\bm{0}_{\bm{x}_{k}})[\bm{\eta}_{k}]\right\rangle_{\bm{x}_{k}}-\frac{1}{2}\langle\bm{\eta}_{k},\mathbf{H}_{k}[\bm{\eta}_{k}]\rangle_{\bm{x}_{k}}\right\|-\frac{\sigma_{k}}{3}\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}^{3}$
$\displaystyle\leq\;$	$\displaystyle\frac{1}{6}L_{H}\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}^{3}+\frac{1}{2}\delta_{H}\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}^{2}-\frac{\sigma_{k}}{3}\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}^{3}$
$\displaystyle=\;$	$\displaystyle\left(\frac{L_{H}}{6}-\frac{\sigma_{k}}{3}\right)\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}^{3}+\frac{1}{2}\delta_{H}\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}^{2}.$	(146)

The condition $\|\mathbf{G}_{k}\|<\epsilon_{g}$ and $\lambda_{min}(\mathbf{H}_{k})<-\epsilon_{H}$ in this case means that the term $\left\langle\mathbf{G}_{k},\bm{\eta}_{k}\right\rangle_{\bm{x}_{k}}$ can be neglected based on Eq. (LABEL:eq_m) but the optimization process is not yet finished. The effect is that $\delta_{g}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}$ no more exists in Eq. (146) unlike Eq. (142).

Lemma 7

Suppose Condition 1 and Assumptions 1, 2, 3, 4, 5 hold, then when $\|\mathbf{G}_{k}\|\geq\epsilon_{g}$ and $\sigma_{k}\geq L_{H}$ , we have

\begin{split}\left(\frac{L_{H}}{6}-\frac{\sigma_{k}}{3}\right)\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{3}&+\delta_{g}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}+\frac{\delta_{H}}{2}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{2}\leq\delta_{g}\left\|\bm{\eta}_{k}^{C}\right\|_{\bm{x}_{k}}+\frac{\delta_{H}}{2}\left\|\bm{\eta}_{k}^{C}\right\|_{\bm{x}_{k}}^{2}.\end{split}

(147)

Proof

According to Lemma 6 of xu2020newton , it has

\begin{split}\left\|\bm{\eta}_{k}^{C}\right\|_{\bm{x}_{k}}\geq\frac{1}{2\sigma_{k}}\left(\sqrt{K_{H}^{2}+4\sigma_{k}\|\mathbf{G}_{k}\|_{\bm{x}_{k}}}-K_{H}\right).\end{split}

(148)

Then we consider two cases for $\left\|\bm{\eta}_{k}^{C}\right\|_{\bm{x}_{k}}$ .

(i) If $\|\bm{\eta}_{k}\|_{\bm{x}_{k}}\leq\left\|\bm{\eta}_{k}^{C}\right\|_{\bm{x}_{k}}$ , since $\sigma_{k}\geq L_{H}$ , it follows that

\begin{split}\left(\frac{L_{H}}{6}-\frac{\sigma_{k}}{3}\right)\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{3}+\delta_{g}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}+\frac{\delta_{H}}{2}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{2}\leq\delta_{g}\left\|\bm{\eta}_{k}^{C}\right\|_{\bm{x}_{k}}+\frac{\delta_{H}}{2}\left\|\bm{\eta}_{k}^{C}\right\|_{\bm{x}_{k}}^{2}.\end{split}

(149)

(ii) If $\|\bm{\eta}_{k}\|_{\bm{x}_{k}}\geq\left\|\bm{\eta}_{k}^{C}\right\|_{\bm{x}_{k}}$ , since $\sigma_{k}\geq L_{H}$ , we have

	$\displaystyle\;\left(\frac{L_{H}}{6}-\frac{\sigma_{k}}{3}\right)\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}^{3}+\delta_{g}\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}+\frac{\delta_{H}}{2}\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}^{2}$
$\displaystyle\leq$	$\displaystyle\;-\frac{\sigma_{k}}{6}\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}^{3}+\delta_{g}\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}+\frac{\delta_{H}}{2}\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}^{2}$
$\displaystyle\leq$	$\displaystyle\;\left(-\frac{\sigma_{k}}{12}\left\\|\bm{\eta}_{k}^{C}\right\\|_{\bm{x}_{k}}+\frac{\delta_{H}}{2}\right)\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}^{2}+\left(-\frac{\sigma_{k}}{12}\left\\|\bm{\eta}_{k}^{C}\right\\|_{\bm{x}_{k}}^{2}+\delta_{g}\right)\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}$
$\displaystyle\leq$	$\displaystyle\;\left(-\frac{\sqrt{K_{H}^{2}+4\sigma_{k}\\|\mathbf{G}_{k}\\|_{\bm{x}_{k}}}-K_{H}}{24}+\frac{\delta_{H}}{2}\right)\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}^{2}+$
	$\displaystyle\;\left(-\frac{\frac{1}{12}\left(\sqrt{K_{H}^{2}+4\sigma_{k}\\|\mathbf{G}_{k}\\|_{\bm{x}_{k}}}-K_{H}\right)^{2}\\|\mathbf{G}_{k}\\|_{\bm{x}_{k}}}{4\sigma_{k}\\|\mathbf{G}_{k}\\|_{\bm{x}_{k}}}+\delta_{g}\right)\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}$
$\displaystyle\leq$	$\displaystyle\;\left(-\frac{\sqrt{K_{H}^{2}+4L_{H}\epsilon_{g}}-K_{H}}{24}+\frac{\delta_{H}}{2}\right)\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}^{2}+$
	$\displaystyle\;\left(-\frac{\frac{1}{12}\left(\sqrt{K_{H}^{2}+4L_{H}\epsilon_{g}}-K_{H}\right)^{2}\epsilon_{g}}{4L_{H}\epsilon_{g}}+\delta_{g}\right)\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}$
$\displaystyle\leq$	$\displaystyle\;0+0\leq\delta_{g}\left\\|\bm{\eta}_{k}^{C}\right\\|_{\bm{x}_{k}}+\frac{\delta_{H}}{2}\left\\|\bm{\eta}_{k}^{C}\right\\|_{\bm{x}_{k}}^{2}.$	(150)

The third inequality follows from Eq. (148). The fourth inequality holds since the function $h(x)=\frac{\left(\sqrt{\alpha^{2}+x}-\alpha\right)^{2}}{x}$ is an increasing function of $x$ for $\alpha\geq 0$ . The penultimate inequality also holds given Eqs. (57), (58) and $\frac{1-\tau}{12}\leq\frac{1}{12}$ since $\tau\in(0,1]$ .

Lemma 8

Suppose Condition 1 and Assumptions 1, 2, 4, 5 hold, then when $\|\mathbf{G}_{k}\|\leq\epsilon_{g}$ and $\lambda_{min}(\mathbf{H}_{k})\leq-\epsilon_{H}$ , if $\sigma_{k}\geq L_{H}$ , we have

\begin{split}\left(\frac{L_{H}}{6}-\frac{\sigma_{k}}{3}\right)\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{3}&+\frac{\delta_{H}}{2}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{2}\leq\frac{\delta_{H}}{2}\left\|\bm{\eta}_{k}^{E}\right\|_{\bm{x}_{k}}^{2}.\end{split}

(151)

Proof

From Lemma 7 in xu2020newton , we have

\begin{split}\left\|\bm{\eta}_{k}^{E}\right\|_{\bm{x}_{k}}\geq\frac{\nu}{\sigma_{k}}|\lambda_{min}(\mathbf{H}_{k})|.\end{split}

(152)

Then we consider two cases for $\left\|\bm{\eta}_{k}^{E}\right\|_{\bm{x}_{k}}$ .

(i) If $\|\bm{\eta}_{k}\|_{\bm{x}_{k}}\leq\left\|\bm{\eta}_{k}^{E}\right\|_{\bm{x}_{k}}$ , since $\sigma_{k}\geq L_{H}$ , it follows that

\begin{split}\left(\frac{L_{H}}{6}-\frac{\sigma_{k}}{3}\right)\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{3}+\frac{\delta_{H}}{2}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{2}\leq\frac{\delta_{H}}{2}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{2}\leq\frac{\delta_{H}}{2}\left\|\bm{\eta}_{k}^{E}\right\|_{\bm{x}_{k}}^{2}.\end{split}

(153)

(ii) If $\|\bm{\eta}_{k}\|_{\bm{x}_{k}}\geq\left\|\bm{\eta}_{k}^{E}\right\|_{\bm{x}_{k}}$ , since $\sigma_{k}\geq L_{H}$ , we have

	$\displaystyle\;\left(\frac{L_{H}}{6}-\frac{\sigma_{k}}{3}\right)\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}^{3}+\frac{\delta_{H}}{2}\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}^{2}$	(154)
$\displaystyle\leq$	$\displaystyle\;-\frac{\sigma_{k}}{6}\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}^{3}+\frac{\delta_{H}}{2}\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}^{2}\leq-\frac{\sigma_{k}}{6}\left\\|\bm{\eta}_{k}^{E}\right\\|_{\bm{x}_{k}}\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}^{2}+\frac{\delta_{H}}{2}\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}^{2}$
$\displaystyle\leq$	$\displaystyle\;\left(-\frac{\nu}{6}\|\lambda_{min}(\mathbf{H}_{k})\|+\frac{(1-\tau)\nu\epsilon_{H}}{6}\right)\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}^{2}\leq\left(-\frac{\nu\epsilon_{H}}{6}+\frac{\nu\epsilon_{H}}{6}\right)\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}^{2}\leq 0\leq\frac{\delta_{H}}{2}\left\\|\bm{\eta}_{k}^{E}\right\\|_{\bm{x}_{k}}^{2}.$

The third inequality follows from Eqs. (58) and (152), and the fourth one holds since $|\lambda_{min}(\mathbf{H}_{k})|\geq\epsilon_{H}$ and $\tau<1$ .

Lemma 9

Suppose Assumptions 1, 2, 3, 4, 5 hold, then when $\|\mathbf{G}_{k}\|_{\bm{x}_{k}}\geq\epsilon_{g}$ , if $\sigma_{k}\geq L_{H}$ , the iteration $k$ is successful, i.e., $\rho_{k}\geq\tau$ and $\sigma_{k+1}\leq\sigma_{k}$ .

Proof

We have that

$\displaystyle 1-\rho_{k}$	$\displaystyle=\frac{\hat{f}_{k}(\bm{\eta}_{k})-\hat{m}_{k}(\bm{\eta}_{k})}{\hat{m}_{k}(\bm{0}_{\bm{x}_{k}})-\hat{m}_{k}(\bm{\eta}_{k})}$
	$\displaystyle\leq\frac{\left(\frac{L_{H}}{6}-\frac{\sigma_{k}}{3}\right)\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}^{3}+\delta_{g}\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}+\frac{1}{2}\delta_{H}\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}^{2}}{\hat{m}_{k}(\bm{0}_{\bm{x}_{k}})-\hat{m}_{k}\left(\bm{\eta}_{k}^{C}\right)}$
	$\displaystyle\leq\frac{\delta_{g}\left\\|\bm{\eta}_{k}^{C}\right\\|_{\bm{x}_{k}}+\frac{1}{2}\delta_{H}\left\\|\bm{\eta}_{k}^{C}\right\\|_{\bm{x}_{k}}^{2}}{\frac{1}{12}\left\\|\bm{\eta}_{k}^{C}\right\\|_{\bm{x}_{k}}^{2}\left(\sqrt{K_{H}^{2}+4\sigma_{k}\\|\mathbf{G}_{k}\\|_{\bm{x}_{k}}}-K_{H}\right)}$
	$\displaystyle\leq\frac{4\sigma_{k}\\|\mathbf{G}_{k}\\|_{\bm{x}_{k}}\delta_{g}}{\frac{1}{6}\\|\mathbf{G}_{k}\\|_{\bm{x}_{k}}\left(\sqrt{K_{H}^{2}+4\sigma_{k}\\|\mathbf{G}_{k}\\|_{\bm{x}_{k}}}-K_{H}\right)^{2}}+\frac{6\delta_{H}}{\sqrt{K_{H}^{2}+4L_{H}\epsilon_{g}}-K_{H}}$
	$\displaystyle\leq\frac{4L_{H}\epsilon_{g}\delta_{g}}{\frac{1}{6}\epsilon_{g}\left(\sqrt{K_{H}^{2}+4L_{H}\epsilon_{g}}-K_{H}\right)^{2}}+\frac{6\delta_{H}}{\left(\sqrt{K_{H}^{2}+4L_{H}\epsilon_{g}}-K_{H}\right)}$
	$\displaystyle\leq\frac{1-\tau}{2}+\frac{1-\tau}{2}=1-\tau,$	(155)

where the first inequality follows from Eqs. (36), Eq. (141), the second inequality from Eqs. (147), (36), (39), and the third follows from Eq. (148). The last two inequalities hold given Eqs. (57), (58) and the fact that the function $h(x)=\frac{x}{\left(\sqrt{\alpha^{2}+x}-\alpha\right)^{2}}$ is decreasing with $\alpha>0$ . Consequently, we have $\rho_{k}\geq\tau$ and that the iteration is successful. Based on Step (12) of Algorithm 1, we have $\sigma_{k+1}=\max(\sigma_{k}/\gamma,\epsilon_{\sigma})\leq\sigma_{k}$ .

Lemma 10

Suppose Condition 1 and Assumption 1, 2, 4, 5 hold, then when $\|\mathbf{G}_{k}\|_{\bm{x}_{k}}<\epsilon_{g}$ and $\lambda_{min}(\mathbf{H}_{k})\leq-\epsilon_{H}$ , if $\sigma_{k}\geq L_{H}$ , the iteration $k$ is successful, i.e., $\rho_{k}\geq\tau$ and $\sigma_{k+1}\leq\sigma_{k}$ .

Proof

We have that

	$\displaystyle 1-\rho_{k}$	$\displaystyle=\frac{\hat{f}_{k}(\bm{\eta}_{k})-\hat{m}_{k}(\bm{\eta}_{k})}{\hat{m}_{k}(\bm{0}_{\bm{x}_{k}})-\hat{m}_{k}(\bm{\eta}_{k})}\leq\frac{\left(\frac{L_{H}}{6}-\frac{\sigma_{k}}{3}\right)\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}^{3}+\frac{1}{2}\delta_{H}\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}^{2}}{\hat{m}_{k}(\bm{0}_{\bm{x}_{k}})-\hat{m}_{k}\left(\bm{\eta}_{k}^{E}\right)}$
		$\displaystyle\leq\frac{\frac{1}{2}\delta_{H}\left\\|\bm{\eta}_{k}^{E}\right\\|_{\bm{x}_{k}}^{2}}{\nu\|\lambda_{min}(\mathbf{H}_{k})\left\\|\bm{\eta}_{k}^{E}\right\\|_{\bm{x}_{k}}^{2}/6}\leq\frac{3\delta_{H}}{\nu\epsilon_{H}}\leq 1-\tau,$		(156)

where the first and second inequalities follow from Eqs. (143), (37), (151), (40). The last inequality uses Eq. (58). Consequently, we have $\rho_{k}\geq\tau$ which indicates the iteration is successful. Based on Step (12) of Algorithm 1, we have $\sigma_{k+1}=\max(\sigma_{k}/\gamma,\epsilon_{\sigma})\leq\sigma_{k}$ .

Lemma 11

Suppose Condition 1 and Assumptions 1, 2, 3, 4, 5 hold, then for all $k$ , we have

\sigma_{k}\leq\max\left(\sigma_{0},2\gamma L_{H}\right).

(157)

Proof

We prove this lemma by contradiction by considering the two following cases.

(i) If $\sigma_{0}\leq 2\gamma L_{H}$ , we assume there exists an iteration $k\geq 0$ that is the first unsuccessful iteration such that $\sigma_{k+1}=\gamma\sigma_{k}>2\gamma L_{H}$ . This implies $\sigma_{k}>L_{H}$ and $\sigma_{k+1}>\sigma_{k}$ . Since the algorithm fails to terminate at iteration $k$ , we have $\|\bm{G_{k}}\|_{\bm{x}_{k}}\geq\epsilon_{g}$ or $\lambda_{min}(\mathbf{H}_{k})<-\epsilon_{H}$ . Then, according to Lemmas 9 and 10, iteration $k$ is successful and thus $\sigma_{k+1}=\max(\sigma_{k}/\gamma,\epsilon_{\sigma})\leq\sigma_{k}$ . This contradicts the earlier statement of $\sigma_{k+1}>\sigma_{k}$ . We thus have $\sigma_{k}\leq 2\gamma L_{H}$ for all $k$ .

(ii) If $\sigma_{0}>2\gamma L_{H}$ , similarly, we assume there exists an iteration $k\geq 0$ that is the first unsuccessful iteration such that $\sigma_{k+1}=\gamma\sigma_{k}>\sigma_{0}$ . This implies $\sigma_{k}>L_{H}$ and $\sigma_{k+1}>\sigma_{k}$ . Since the algorithm fails to terminate at iteration $k$ , we have $\|\bm{G_{k}}\|_{\bm{x}_{k}}\geq\epsilon_{g}$ or $\lambda_{min}(\mathbf{H}_{k})<-\epsilon_{H}$ . According to Lemmas 9 and 10, iteration $k$ is successful and thus $\sigma_{k+1}=\max(\sigma_{k}/\gamma,\epsilon_{\sigma})\leq\sigma_{k}$ , which is a contradiction. Thus, we have $\sigma_{k}\leq\sigma_{0}$ for all $k$ .

F.2 Main Proof of Theorem 3

Proof

Letting $\sigma_{b}=\max\left(\sigma_{0},2\gamma L_{H}\right)$ , when $\|\mathbf{G}_{k}\|\geq\epsilon_{g}$ , from Eq. (36) and Lemma 157 we have

\hat{m}_{k}(\bm{0}_{\bm{x}_{k}})-\hat{m}_{k}(\bm{\eta}_{k})\geq\frac{\|\mathbf{G}_{k}\|_{\bm{x}_{k}}}{2\sqrt{3}}\min\left(\frac{\|\mathbf{G}_{k}\|_{\bm{x}_{k}}}{K_{H}},\sqrt{\frac{\|\mathbf{G}_{k}\|_{\bm{x}_{k}}}{\sigma_{k}}}\right)\geq\frac{\epsilon_{g}^{2}}{2\sqrt{3}}\min\left(\frac{1}{K_{H}},{\frac{1}{\sqrt{\sigma_{b}}}}\right).

(158)

When $\|\mathbf{G}_{k}\|<\epsilon_{g}$ and $\lambda_{min}(\mathbf{H}_{k})\leq-\epsilon_{H}$ , from Eq. (37) and Lemma 157, we have

\begin{split}\hat{m}_{k}(\bm{0}_{\bm{x}_{k}})-\hat{m}_{k}(\bm{\eta}_{k})\geq\frac{\nu^{3}}{6\sigma_{k}^{2}}|\lambda_{min}(\mathbf{H}_{k})|^{3}\geq\frac{\nu^{3}\epsilon_{H}^{3}}{6\sigma_{b}^{2}}.\end{split}

(159)

Let $\mathcal{S}_{succ}$ be the set of successful iterations before Algorithm 1 terminates and $\hat{f}_{min}$ be the function minimum. Since $\hat{f}_{k}(\bm{\eta}_{k})$ is monotonically decreasing, we have

$\displaystyle\hat{f}_{0}(\bm{0}_{\bm{x}_{0}})-\hat{f}_{min}$	$\displaystyle\geq\sum_{k=0}^{\infty}\left(\hat{f}_{k}(\bm{0}_{\bm{x}_{k}})-\hat{f}_{k}(\bm{\eta}_{k})\right)\geq\sum_{k\in\mathcal{S}_{succ}}\left(\hat{f}_{k}(\bm{0}_{\bm{x}_{k}})-\hat{f}_{k}(\bm{\eta}_{k})\right)$	(160)
	$\displaystyle\geq\tau\left(\sum_{k\in\mathcal{S}_{succ}}\hat{m}_{k}(\bm{0}_{\bm{x}_{k}})-\hat{m}_{k}(\bm{\eta}_{k})\right)$
	$\displaystyle\geq\tau\|\mathcal{S}_{succ}\|\min\left(\frac{\nu^{3}\epsilon_{H}^{3}}{6\sigma_{b}^{2}},\frac{\epsilon_{g}^{2}}{2\sqrt{3}}\min\left(\frac{1}{K_{H}},{\frac{1}{\sqrt{\sigma_{b}}}}\right)\right)\geq\|\mathcal{S}_{succ}\|\tau\kappa\min\left(\epsilon_{g}^{2},\epsilon_{H}^{3}\right),$

where $\kappa=\min\left(\frac{\nu^{3}}{6\sigma_{b}^{2}},\frac{1}{2\sqrt{3}}\min\left(\frac{1}{K_{H}},\frac{1}{\sqrt{\sigma_{b}}}\right)\right)$ . Let $\mathcal{S}_{fail}$ be the set of unsuccessful iterations and $T$ be the total iterations of the algorithm. Then we have $\sigma_{T}=\sigma_{0}\gamma^{|\mathcal{S}_{fail}|-|\mathcal{S}_{succ}|}$ . Combining it with $\sigma_{T}\leq\sigma_{b}$ from Lemma 157, we have

	$\displaystyle\sigma_{T}=\sigma_{0}\gamma^{\|\mathcal{S}_{fail}\|-\|\mathcal{S}_{succ}\|}\leq\sigma_{b}$
$\displaystyle\Longrightarrow$	$\displaystyle\log\left(\gamma^{\|\mathcal{S}_{fail}\|-\|\mathcal{S}_{succ}\|}\right)\leq\log\left(\frac{\sigma_{b}}{\sigma_{0}}\right)$
$\displaystyle\Longrightarrow$	$\displaystyle\left({\|\mathcal{S}_{fail}\|-\|\mathcal{S}_{succ}\|}\right)\leq\frac{\log(\sigma_{b}/\sigma_{0})}{\gamma}$
$\displaystyle\Longrightarrow$	$\displaystyle\|\mathcal{S}_{fail}\|\leq\frac{\log(\sigma_{b}/\sigma_{0})}{\log{\gamma}}+\|\mathcal{S}_{succ}\|.$	(161)

Finally, combining Eqs. (160), (161), we have

	$\displaystyle T=\|\mathcal{S}_{fail}\|+\|\mathcal{S}_{succ}\|$	$\displaystyle\leq\frac{\log(\sigma_{b}/\sigma_{0})}{\log{\gamma}}+2\|\mathcal{S}_{succ}\|$		(162)
		$\displaystyle\leq\frac{\log(\sigma_{b}/\sigma_{0})}{\log{\gamma}}+\frac{2\left(\hat{f}_{0}(\bm{0}_{\bm{x}_{0}})-\hat{f}_{min}\right)}{\tau\kappa}\max\left(\epsilon_{g}^{-2},\epsilon_{H}^{-3}\right).$

This completes the proof.

F.3 Main Proof of Corollary 1

Proof

Under the given assumptions, when Theorem 3 holds, Algorithm 1 returns an $\left(\epsilon_{g},\epsilon_{H}\right)$ -optimal solution in $T=\mathcal{O}\left(\max\left(\epsilon_{g}^{-2},\epsilon_{H}^{-3}\right)\right)$ iterations. Also, according to Theorem 2, at an iteration, Condition 1 is satisfied with a probability $(1-\delta)$ , where the probability $(1-\delta)$ at the current iteration can be independently achieved by selecting proper subsampling sizes for the approximate gradient and Hessian. Let $E$ be the event that Algorithm 1 returns an $\left(\epsilon_{g},\epsilon_{H}\right)$ -optimal solution and $E_{i}$ be the event that Condition 1 is satisfied at iteration $i$ . According on Theorem 3, when event $E$ happens, it requires Condition 1 to be satisfied for all the iterations, thus we have

{\rm Pr}(E)=\prod_{i=1}^{T}{\rm Pr}(E_{i})=(1-\delta)^{T}=(1-\delta)^{\mathcal{O}\left(\max\left(\epsilon_{g}^{-2},\epsilon_{H}^{-3}\right)\right)}.

(163)

This completes the proof.

Appendix G Appendix: Theorem 4 and Corollary 2

G.1 Supporting Lemmas for Theorem 4

The proof of Theorem 4 builds on an existing lemma, which we restate as follows.

Lemma 12

(Sufficient Function Decrease (Lemma 3.3 in cartis2011adaptive )) Suppose the solution $\bm{\eta}_{k}$ satisfies Assumption 3, then we have

\begin{split}\hat{m}_{k}(\bm{0}_{\bm{x}_{k}})-\hat{m}_{k}(\bm{\eta}_{k})\geq\frac{\sigma_{k}}{6}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{3}.\end{split}

(164)

The proof also needs another lemma that we develop as below.

Lemma 13

(Sufficiently Long Steps) Suppose Condition 1 and Assumptions 1, 2, 3, 4, 5 hold. If $\delta_{g}\leq\delta_{H}\leq\kappa_{\theta}\epsilon_{g}$ , Algorithm 3 returns a solution $\bm{\eta}_{k}$ such that

\begin{split}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}\geq\kappa_{s}\sqrt{\epsilon_{g}},\end{split}

(165)

when $\|\mathbf{G}_{k+1}\|_{\bm{x}_{k+1}}\geq\epsilon_{g}$ for $k>0$ and when the inner stopping criterion of Eq. (47) is used. Here $\kappa_{s}=\min\left(1/\sqrt{(L_{H}+2\sigma_{b}+\frac{\epsilon_{g}}{3}+\frac{L_{l}}{3})},1/\sqrt{\frac{5L_{H}}{3}+\frac{10\sigma_{b}}{3}+\frac{11L_{l}}{9}}\right)$ with $L_{l}>0$ and $\sigma_{b}=\max\left(\sigma_{0},2\gamma L_{H}\right)$ .

Proof

By differentiating the approximate model, we have

	$\displaystyle\\|\nabla\hat{m}_{k}(\bm{\eta}_{k})\\|_{\bm{x}_{k}}=\bigg{\|}\bigg{\|}\mathbf{G}_{k}+\mathbf{H}_{k}[\bm{\eta}_{k}]+\sigma_{k}\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}\bm{\eta}_{k}\bigg{\|}\bigg{\|}_{\bm{x}_{k}}$
$\displaystyle=$	$\displaystyle\left\\|\mathcal{P}_{\bm{\eta}_{k}}^{-1}\text{grad}\hat{f}_{k}(\bm{\eta}_{k})+\mathbf{G}_{k}-\text{grad}f(\bm{x}_{k})+\mathbf{H}_{k}[\bm{\eta}_{k}]-\nabla^{2}\hat{f}_{k}(\bm{0}_{\bm{x}_{k}})[\bm{\eta}_{k}]\right.$
	$\displaystyle\left.+\text{grad}f(\bm{x}_{k})+\nabla^{2}\hat{f}_{k}(\bm{0}_{\bm{x}_{k}})[\bm{\eta}_{k}]-\mathcal{P}_{\bm{\eta}_{k}}^{-1}\text{grad}\hat{f}_{k}(\bm{\eta}_{k})+\sigma_{k}\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}\bm{\eta}_{k}\right\\|_{\bm{x}_{k}}$
$\displaystyle\geq$	$\displaystyle\left\\|\mathcal{P}_{\bm{\eta}_{k}}^{-1}\text{grad}\hat{f}_{k}(\bm{\eta}_{k})\right\\|_{\bm{x}_{k}}-\left\\|\mathbf{G}_{k}-\text{grad}f(\bm{x}_{k})\right\\|_{\bm{x}_{k}}-\left\\|\mathbf{H}_{k}[\bm{\eta}_{k}]-\nabla^{2}\hat{f}_{k}(\bm{0}_{\bm{x}_{k}})[\bm{\eta}_{k}]\right\\|_{\bm{x}_{k}}$
	$\displaystyle-\left\\|\mathcal{P}_{\bm{\eta}_{k}}^{-1}\text{grad}\hat{f}_{k}(\bm{\eta}_{k})-\text{grad}f(\bm{x}_{k})-\nabla^{2}\hat{f}_{k}(\bm{0}_{\bm{x}_{k}})[\bm{\eta}_{k}]\right\\|_{\bm{x}_{k}}-\sigma_{k}\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}^{2}$
$\displaystyle\geq$	$\displaystyle\left\lVert\mathcal{P}_{\bm{\eta}_{k}}^{-1}\text{grad}\hat{f}_{k}(\bm{\eta}_{k})\right\rVert_{\bm{x}_{k}}-\delta_{g}-\delta_{H}\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}-\frac{L_{H}}{2}\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}^{2}-\sigma_{b}\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}^{2},$	(166)

where the first inequality follows from the triangle inequality and the second inequality from Eqs. (49), (51) and (52). Additionally, from Lemma 3.8 in kasai2018riemannian , we have

\begin{split}\left\lVert\text{grad}f_{k}(\bm{x}_{k})\right\rVert_{\bm{x}_{k}}&-\left\lVert\mathcal{P}_{\bm{\eta}_{k}}^{-1}\text{grad}\hat{f}_{k}(\bm{\eta}_{k})\right\rVert_{\bm{x}_{k}}\\ &\leq\left\lVert\mathcal{P}_{\bm{\eta}_{k}}^{-1}\text{grad}\hat{f}_{k}(\bm{\eta}_{k})-\text{grad}f(\bm{x}_{k})\right\rVert_{\bm{x}_{k}}\leq L_{l}\left\lVert\bm{\eta}_{k}\right\rVert_{\bm{x}_{k}},\end{split}

(167)

where $L_{l}>0$ is a constant. Then, we have

	$\displaystyle\\|\mathbf{G}_{k}\\|_{\bm{x}_{k}}\leq$	$\displaystyle\;\\|\mathbf{G}_{k}-\text{grad}f(\bm{x}_{k})\\|_{\bm{x}_{k}}+\\|\text{grad}f(\bm{x}_{k})\\|_{\bm{x}_{k}}$
	$\displaystyle\leq$	$\displaystyle\;\delta_{g}+L_{l}\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}+\left\lVert\mathcal{P}_{\bm{\eta}_{k}}^{-1}\text{grad}\hat{f}_{k}(\bm{\eta}_{k})\right\rVert_{\bm{x}_{k}},$		(168)

with the first inequality following from the triangle inequality and the second from Eqs. (51), (52) and (167). Then, by combining Eqs. (47), (166) and (168), with $\theta_{k}:=\kappa_{\theta}\min(1,\left\|\bm{\eta}_{k}^{i}\right\|_{\mathbf{x}_{k}})$ we obtain

	$\displaystyle\left\lVert\mathcal{P}_{\bm{\eta}_{k}}^{-1}\text{grad}\hat{f}_{k}(\bm{\eta}_{k})\right\rVert_{\bm{x}_{k}}-\delta_{g}-\delta_{H}\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}-\left(\frac{L_{H}}{2}+\sigma_{b}\right)\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}^{2}$	(169)
$\displaystyle\leq$	$\displaystyle\;\\|\nabla\hat{m}_{k}(\bm{\eta}_{k})\\|_{\bm{x}_{k}}\leq\theta_{k}\\|\mathbf{G}_{k}\\|_{\bm{x}_{k}}$
$\displaystyle\leq$	$\displaystyle\;\theta_{k}\left(\delta_{g}+L_{l}\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}+\left\lVert\mathcal{P}_{\bm{\eta}_{k}}^{-1}\text{grad}\hat{f}_{k}(\bm{\eta}_{k})\right\rVert_{\bm{x}_{k}}\right).$

This results in

		$\displaystyle\left(\frac{L_{H}}{2}+\sigma_{b}\right)\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}^{2}+\left(\delta_{H}+\theta_{k}L_{l}\right)\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}+\left(1+\theta_{k}\right)\delta_{g}$		(170)
	$\displaystyle\geq$	$\displaystyle\;\left(1-\theta_{k}\right)\left\lVert\text{grad}\hat{f}_{k}(\bm{\eta}_{k})\right\rVert_{\bm{x}_{k+1}}\geq\left(1-\theta_{k}\right)\left(\\|\mathbf{G}_{k+1}\\|_{\bm{x}_{k+1}}-\delta_{g}\right).$

Subsequently, it has

\left(\frac{L_{H}}{2}+\sigma_{b}\right)\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{2}+\left(\delta_{H}+\theta_{k}L_{l}\right)\|\bm{\eta}_{k}\|_{\bm{x}_{k}}+2\delta_{g}\geq\left(1-\theta_{k}\right)\left(\|\mathbf{G}_{k+1}\|_{\bm{x}_{k+1}}\right).

(171)

In the above derivation, we use the property of the parallel transport that preserves the length of the transported vector. The last inequality in Eq. (170) is based on Eqs. (51) and the triangle inequality.

Now, we consider the following two cases. (i) If $\|\bm{\eta}_{k}\|_{\bm{x}_{k}}\geq 1$ , from Eq. (47) we have $\theta_{k}=\kappa_{\theta}$ , and therefore

\left(1-\kappa_{\theta}\right)\left(\|\mathbf{G}_{k+1}\|_{\bm{x}_{k+1}}\right)-2\delta_{g}\leq\left(\frac{L_{H}}{2}+\sigma_{b}+\delta_{H}+\kappa_{\theta}L_{l}\right)\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{2}.

(172)

This then gives

\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{2}\geq\frac{\left(1-\kappa_{\theta}\right)\left(\|\mathbf{G}_{k+1}\|_{\bm{x}_{k+1}}\right)-2\delta_{g}}{\frac{L_{H}}{2}+\sigma_{b}+\delta_{H}+\kappa_{\theta}L_{l}}\geq\frac{\frac{1}{2}\epsilon_{g}}{\frac{L_{H}}{2}+\sigma_{b}+\frac{\epsilon_{g}}{6}+\frac{L_{l}}{6}},

(173)

where the last inequality holds because $\delta_{g}\leq\delta_{H}\leq\kappa_{\theta}\epsilon_{g}$ and $\kappa_{\theta}<\frac{1}{6}$ . (ii) If $\|\bm{\eta}_{k}\|_{\bm{x}_{k}}<1$ , then $\theta_{k}=\kappa_{\theta}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}<\kappa_{\theta}$ . Given Eq. (168) and $\delta_{g}\leq\delta_{H}\leq\kappa_{\theta}\epsilon_{g}$ , it has

\delta_{H}\leq\kappa_{\theta}\epsilon_{g}\leq\kappa_{\theta}\|\mathbf{G}_{k}\|_{\bm{x}_{k}}\leq\kappa_{\theta}\left(\delta_{H}+L_{l}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}+\|\mathbf{G}_{k+1}\|_{\bm{x}_{k+1}}\right).

(174)

Then, we have

	$\displaystyle\left(1-\theta_{k}\right)\left(\\|\mathbf{G}_{k+1}\\|_{\bm{x}_{k+1}}\right)-2\delta_{g}$	(175)
$\displaystyle\leq$	$\displaystyle\left(\frac{L_{H}}{2}+\sigma_{b}\right)\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}^{2}+\left(\delta_{H}+\theta_{k}L_{l}\right)\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}$
$\displaystyle\leq$	$\displaystyle\left(\frac{L_{H}}{2}+\sigma_{b}\right)\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}^{2}+\left(\frac{\kappa_{\theta}}{1-\kappa_{\theta}}+\kappa_{\theta}\right)L_{l}\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}^{2}+\frac{\kappa_{\theta}\\|\mathbf{G}_{k+1}\\|_{\bm{x}_{k+1}}}{1-\kappa_{\theta}},$

which results in

\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{2}\geq\frac{\left(1-\kappa_{\theta}-\frac{\kappa_{\theta}}{1-\kappa_{\theta}}\right)\|\mathbf{G}_{k+1}\|_{\bm{x}_{k+1}}-2\delta_{g}}{\frac{L_{H}}{2}+\sigma_{b}+\left(\frac{\kappa_{\theta}}{1-\kappa_{\theta}}+\kappa_{\theta}\right)L_{l}}\geq\frac{\frac{3}{10}\epsilon_{g}}{\frac{L_{H}}{2}+\sigma_{b}+\frac{11}{30}L_{l}}.

(176)

This completes the proof.

G.2 Main Proof of Theorem 4

Let $\sigma_{b}=\max\left(\sigma_{0},2\gamma L_{H}\right)$ , $\sigma_{min}=\min\left(\sigma_{k}\right)$ for $k\geq 0$ and $\mathcal{S}_{succ}^{1}$ be the set of successful iterations such that $\|\mathbf{G}_{k+1}\|_{\bm{x}_{k+1}}\geq\epsilon_{g}$ for $k\in\mathcal{S}_{succ}^{1}$ . As $\hat{f}_{k}(\bm{\eta}_{k})$ is monotonically decreasing, we have

$\displaystyle\hat{f}_{0}(\bm{0}_{\bm{x}_{0}})-\hat{f}_{min}$	$\displaystyle\geq\sum_{k=0}^{\infty}\left(\hat{f}_{k}(\bm{0}_{\bm{x}_{k}})-\hat{f}_{k}(\bm{\eta}_{k})\right)\geq\sum_{k\in\mathcal{S}_{succ}^{1}}\left(\hat{f}_{k}(\bm{0}_{\bm{x}_{k}})-\hat{f}_{k}(\bm{\eta}_{k})\right)$	(177)
	$\displaystyle\geq\tau\sum_{k\in\mathcal{S}_{succ}^{1}}\left(\hat{m}_{k}(\bm{0}_{\bm{x}_{k}})-\hat{m}_{k}(\bm{\eta}_{k})\right)\geq\tau\|\mathcal{S}_{succ}^{1}\|\min\left(\frac{\nu^{3}\epsilon_{H}^{3}}{6\sigma_{k}^{2}},\frac{\sigma_{min}}{6}\\|\bm{\eta}_{k}\\|_{\bm{x}_{k}}^{3}\right)$
	$\displaystyle\geq\tau\|\mathcal{S}_{succ}^{1}\|\min\left(\frac{\nu^{3}\epsilon_{H}^{3}}{6\sigma_{b}^{2}},\frac{\epsilon_{\sigma}\kappa_{s}^{3}\epsilon_{g}^{\frac{3}{2}}}{6}\right)\geq\tau\kappa_{1}\|\mathcal{S}_{succ}^{1}\|\min\left(\epsilon_{g}^{\frac{3}{2}},\epsilon_{H}^{3}\right),$

where $\kappa_{1}=\frac{1}{6}\min\left(\frac{\nu^{3}}{\sigma_{b}^{2}},\epsilon_{\sigma}\kappa_{s}^{3}\right)$ . The fourth inequality follows from Eqs. (37) and (164), while the fifth from Eq. (165).

Let $\mathcal{S}_{succ}^{2}$ be the set of successful iterations such that $\|\mathbf{G}_{k+1}\|_{\bm{x}_{k+1}}<\epsilon_{g}$ and $\lambda_{min}(\mathbf{H}_{k+1})<-\epsilon_{H}$ for $k\in\mathcal{S}_{succ}^{2}$ . Then there is an iteration $t\in\mathcal{S}_{succ}^{2}$ in which $\|\mathbf{G}_{t}\|_{\bm{x}_{t}}\geq\epsilon_{g}$ and $\|\mathbf{G}_{t+1}\|_{\bm{x}_{t+1}}<\epsilon_{g}$ . Thus, we have

\hat{f}_{0}(\bm{0}_{\bm{x}_{0}})-\hat{f}_{min}\geq\sum_{k=t}^{\infty}\left(\hat{f}_{k}(\bm{0}_{\bm{x}_{k}})-\hat{f}_{k}(\bm{\eta}_{k})\right)\geq\hat{f}_{0}(\bm{0}_{\bm{x}_{0}})-\hat{f}_{t}(\bm{\eta}_{t})+\sum_{k\in\mathcal{S}_{succ}^{2}}\left(\hat{f}_{k}(\bm{0}_{\bm{x}_{k}})-\hat{f}_{k}(\bm{\eta}_{k})\right),

(178)

and this results in

	$\displaystyle\hat{f}_{t}(\bm{\eta}_{t})-\hat{f}_{min}$	$\displaystyle\geq\sum_{k\in\mathcal{S}_{succ}^{2}}\left(\hat{f}_{k}(\bm{0}_{\bm{x}_{k}})-\hat{f}_{k}(\bm{\eta}_{k})\right)\geq\tau\sum_{k\in\mathcal{S}_{succ}^{2}}\left(\hat{m}_{k}(\bm{0}_{\bm{x}_{k}})-\hat{m}_{k}(\bm{\eta}_{k})\right)$
		$\displaystyle\geq\tau\|\mathcal{S}_{succ}^{2}\|\frac{\nu^{3}\epsilon_{H}^{3}}{6\sigma_{k}^{2}}\geq\tau\|\mathcal{S}_{succ}^{2}\|\frac{\nu^{3}\epsilon_{H}^{3}}{6\sigma_{b}^{2}}\geq\tau\kappa_{2}\|\mathcal{S}_{succ}^{2}\|\epsilon_{H}^{3},$		(179)

where $\kappa_{2}=\frac{\nu^{3}}{6\sigma_{b}^{2}}$ . The second and third inequalities follow from Eq. (157) and Eq. (37), respectively. Then, the bound for the total number of successful iterations is

$\displaystyle\|\mathcal{S}_{succ}\|=$	$\displaystyle\;\|\mathcal{S}_{succ}^{1}\|+\|\mathcal{S}_{succ}^{2}\|+1$
$\displaystyle\leq$	$\displaystyle\;\frac{\hat{f}_{0}(\bm{0}_{\bm{x}_{0}})-\hat{f}_{min}}{\tau\kappa_{1}}\max\left(\epsilon_{g}^{-\frac{3}{2}},\epsilon_{H}^{-3}\right)+\frac{\hat{f}_{t}(\bm{\eta}_{t})-\hat{f}_{min}}{\tau\kappa_{2}}\epsilon_{H}^{-3}+1$
$\displaystyle\leq$	$\displaystyle\;\left(\frac{\hat{f}_{0}(\bm{0}_{\bm{x}_{0}})-\hat{f}_{min}}{\tau\kappa_{1}}+\frac{\hat{f}_{t}(\bm{\eta}_{t})-\hat{f}_{min}}{\tau\kappa_{2}}\right)\max\left(\epsilon_{g}^{-\frac{3}{2}},\epsilon_{H}^{-3}\right)+1,$	(180)

where the extra iteration corresponds to the final successful iteration of Algorithm 1 with $\lambda_{min}(\mathbf{H}_{k+1})\geq-\epsilon_{H}$ . Then, similar to Eq. (162), we have the improved iteration bound for Algorithm 1 given as

T=|\mathcal{S}_{fail}|+|\mathcal{S}_{succ}|\leq\frac{\log(\sigma_{b}/\sigma_{0})}{\log{\gamma}}+2|\mathcal{S}_{succ}|\leq\frac{\log(\sigma_{b}/\sigma_{0})}{\log{\gamma}}+2C\max\left(\epsilon_{g}^{-\frac{3}{2}},\epsilon_{H}^{-3}\right)+2,

(181)

where $C=\frac{\hat{f}_{0}(\bm{0}_{\bm{x}_{0}})-\hat{f}_{min}}{\tau\kappa_{1}}+\frac{\hat{f}_{t}(\bm{\eta}_{t})-\hat{f}_{min}}{\tau\kappa_{2}}$ . This completes the proof. ∎

G.3 Main Proof of Corollary 2

Although this follows exactly the same way as to prove Corollary 1, we repeat it here for the convenience of readers.

Proof

Under the given assumptions, when Theorem 4 holds, Algorithm 1 returns an $\left(\epsilon_{g},\epsilon_{H}\right)$ -optimal solution in $T=\mathcal{O}\left(\max\left(\epsilon_{g}^{-\frac{3}{2}},\epsilon_{H}^{-3}\right)\right)$ iterations. Also, according to Theorem 2, at an iteration, Condition 1 is satisfied with a probability $(1-\delta)$ , where the probability $(1-\delta)$ at the current iteration can be independently achieved by selecting proper subsampling sizes for the approximate gradient and Hessian. Let $E$ be the event that Algorithm 1 returns an $\left(\epsilon_{g},\epsilon_{H}\right)$ -optimal solution and $E_{i}$ be the event that Condition 1 is satisfied at iteration $i$ . According on Theorem 4, when event $E$ happens, it requires Condition 1 to be satisfied for all the iterations, thus we have

{\rm Pr}(E)=\prod_{i=1}^{T}{\rm Pr}(E_{i})=(1-\delta)^{T}=(1-\delta)^{\mathcal{O}\left(\max\left(\epsilon_{g}^{-\frac{3}{2}},\epsilon_{H}^{-3}\right)\right)}.

(182)

This completes the proof.

$\displaystyle\hat{m}(\bm{\eta})$	$\displaystyle=f(\mathbf{x}_{k})+\delta\left\langle\mathbf{G}_{k},\sum_{i=1}^{l}y_{i}\mathbf{q}_{i}\right\rangle_{\mathbf{x}_{k}}+\frac{1}{2}\left\langle\sum_{i=1}^{l}y_{i}\mathbf{q}_{i},\mathbf{H}_{k}\left[\sum_{i=1}^{l}y_{i}\mathbf{q}_{i}\right]\right\rangle_{\mathbf{x}_{k}}+\frac{\sigma_{k}}{3}\left\\|\sum_{i=1}^{l}y_{i}\mathbf{q}_{i}\right\\|_{\mathbf{x}_{k}}^{3}$
	$\displaystyle=f(\mathbf{x}_{k})+\delta\left\langle\mathbf{G}_{k},y_{1}\mathbf{q}_{1}\right\rangle_{\mathbf{x}_{k}}+\frac{1}{2}\sum_{i,j=1}^{l}y_{i}y_{j}\left\langle\mathbf{q}_{i},\mathbf{H}_{k}[\mathbf{q}_{j}]\right\rangle_{\mathbf{x}_{k}}+\frac{\sigma_{k}}{3}\left\\|\mathbf{y}\right\\|_{2}^{3}$
	$\displaystyle=f(\mathbf{x}_{k})+\ y_{1}\delta\left\\|\mathbf{G}_{k}\right\\|_{\mathbf{x}_{k}}+\frac{1}{2}\mathbf{y}^{T}\mathbf{T}_{D}\mathbf{y}+\frac{\sigma_{k}}{3}\left\\|\mathbf{y}\right\\|_{2}^{3}.$	(67)

$\displaystyle\left(2\left\\|\bar{\bm{\eta}}_{k}^{}\right\\|_{\mathbf{x}_{k}}-\left\\|\bm{\xi}_{l}\right\\|_{\mathbf{x}_{k}}\right)\left\\|\bm{\xi}_{l}\right\\|_{\mathbf{x}_{k}}\leq\left\\|\bar{\bm{\eta}}_{k}^{}\right\\|_{\mathbf{x}_{k}}^{2}$	$\displaystyle\Longleftrightarrow\left(1+\frac{\left\\|\bar{\bm{\eta}}_{k}^{}\right\\|_{\mathbf{x}_{k}}-\left\\|\bm{\xi}_{l}\right\\|_{\mathbf{x}_{k}}}{\left\\|\bar{\bm{\eta}}_{k}^{}\right\\|_{\mathbf{x}_{k}}}\right)\left\\|\bm{\xi}_{l}\right\\|_{\mathbf{x}_{k}}\leq\left\\|\bar{\bm{\eta}}_{k}^{*}\right\\|_{\mathbf{x}_{k}}$
	$\displaystyle\Longleftrightarrow(1-\alpha)\left\\|\bm{\xi}_{l}\right\\|_{\mathbf{x}_{k}}\leq\left\\|\bar{\bm{\eta}}_{k}^{*}\right\\|_{\mathbf{x}_{k}}$
	$\displaystyle\Longleftrightarrow\left\\|\bm{\zeta}_{l}\right\\|_{\mathbf{x}_{k}}\leq\left\\|\bar{\bm{\eta}}_{k}^{*}\right\\|_{\mathbf{x}_{k}}.$	(72)

	$\displaystyle\|\alpha\|$	$\displaystyle=\frac{\left\|\left\\|\bm{\xi}_{l}\right\\|_{\mathbf{x}_{k}}-\left\\|\bar{\bm{\eta}}_{k}^{}\right\\|_{\mathbf{x}_{k}}\right\|}{\max\left(\left\\|\bm{\xi}_{l}\right\\|_{\mathbf{x}_{k}},\left\\|\bar{\bm{\eta}}_{k}^{}\right\\|_{\mathbf{x}_{k}}\right)}\leq\frac{\left\\|\bm{\xi}_{l}-\bar{\bm{\eta}}_{k}^{}\right\\|_{\mathbf{x}_{k}}}{\max\left(\left\\|\bm{\xi}_{l}\right\\|_{\mathbf{x}_{k}},\left\\|\bar{\bm{\eta}}_{k}^{}\right\\|_{\mathbf{x}_{k}}\right)}$
		$\displaystyle=\frac{\left\\|\left(p_{l+1}\left(\tilde{\mathbf{H}}_{k}\right)-{\rm Id}\right)[\bar{\bm{\eta}}_{k}^{}]\right\\|_{\mathbf{x}_{k}}}{\max\left(\left\\|\bm{\xi}_{l}\right\\|_{\mathbf{x}_{k}},\left\\|\bar{\bm{\eta}}_{k}^{}\right\\|_{\mathbf{x}_{k}}\right)}\leq\frac{\phi_{l}(\mathbf{H}_{k})\left\\|\bar{\bm{\eta}}_{k}^{}\right\\|_{\mathbf{x}_{k}}}{\max\left(\left\\|\bm{\xi}_{l}\right\\|_{\mathbf{x}_{k}},\left\\|\bar{\bm{\eta}}_{k}^{}\right\\|_{\mathbf{x}_{k}}\right)}\leq\phi_{l}(\mathbf{H}_{k}).$		(79)

$\displaystyle\hat{m}_{k}({\bm{\eta}}_{k}^{l})-\hat{m}_{k}({\bm{\eta}}_{k}^{})$	$\displaystyle\leq\hat{m}_{k}({\bm{\eta}}_{k}^{l})-\hat{m}_{k}({\bm{\eta}}_{k}^{})$
	$\displaystyle=\bar{m}_{k}({\bm{\eta}}_{k}^{l})-\bar{m}_{k}({\bm{\eta}}_{k}^{})+\frac{\sigma_{k}}{3}(\left\\|{\bm{\eta}}_{k}^{l}\right\\|_{\mathbf{x}_{k}}^{3}-\left\\|{\bm{\eta}}_{k}^{}\right\\|_{\mathbf{x}_{k}}^{3})$
	$\displaystyle\leq\bar{m}_{k}({\bm{\eta}}_{k}^{l})-\bar{m}_{k}({\bm{\eta}}_{k}^{})=\bar{m}_{k}({\bm{\eta}}_{k}^{l})-\bar{m}_{k}(\bar{\bm{\eta}}_{k}^{})$	(82)

$\displaystyle\left\\|\mathbf{p}_{i+1}\right\\|_{\mathbf{x}_{k}^{i}}^{2}$	$\displaystyle\leq\left\\|\mathbf{G}_{k}^{i}\right\\|_{\mathbf{x}_{k}^{i}}^{2}+2\beta_{i}\left\|\left\langle\mathbf{G}_{k}^{i},\mathcal{P}_{\alpha_{i}^{}\mathbf{p}_{i}}\mathbf{p}_{i}\right\rangle_{\mathbf{x}_{k}^{i}}\right\|+\beta_{i}^{2}\left\\|\mathcal{P}_{\alpha_{i}^{}\mathbf{p}_{i}}\mathbf{p}_{i}\right\\|_{\mathbf{x}_{k}^{i}}^{2}$
	$\displaystyle\leq\left\\|\mathbf{G}_{k}^{i}\right\\|_{\mathbf{x}_{k}^{i}}^{2}-2c_{2}\frac{\left\\|\mathbf{G}_{k}^{i}\right\\|_{\mathbf{x}_{k}^{i}}^{2}}{\left\\|\mathbf{G}_{k}^{i-1}\right\\|_{\mathbf{x}_{k}^{i-1}}^{2}}\left\langle\mathbf{G}_{k}^{i-1},\mathbf{p}_{i}\right\rangle_{\mathbf{x}_{k}^{i-1}}+\beta_{i}^{2}\left\\|\mathbf{p}_{i}\right\\|_{\mathbf{x}_{k}^{i-1}}^{2}$
	$\displaystyle\leq\left\\|\mathbf{G}_{k}^{i}\right\\|_{\mathbf{x}_{k}^{i}}^{2}+2c_{2}C\frac{\left\\|\mathbf{G}_{k}^{i}\right\\|_{\mathbf{x}_{k}^{i}}^{2}}{\left\\|\mathbf{G}_{k}^{i-1}\right\\|_{\mathbf{x}_{k}^{i-1}}^{2}}\left\\|\mathbf{G}_{k}^{i-1}\right\\|_{\mathbf{x}_{k}^{i-1}}^{2}+\beta_{i}^{2}\left\\|\mathbf{p}_{i}\right\\|_{\mathbf{x}_{k}^{i-1}}^{2}$
	$\displaystyle=\hat{C}\left\\|\mathbf{G}_{k}^{i}\right\\|_{\mathbf{x}_{k}^{i}}^{2}+\beta_{i}^{2}\left\\|\mathbf{p}_{i}\right\\|_{\mathbf{x}_{k}^{i-1}}^{2},$	(92)

Faster Riemannian Newton-type Optimization by Subsampling and Cubic Regularization

Abstract

Keywords:

1 Introduction

2 Notations and Preliminaries

2.1 First-order Algorithms

2.2 Inexact RTR

3 Proposed Method

3.1 Inexact Sub-RN-CR Algorithm

Definition 1 (Successful and Unsuccessful Iterations)

3.2 Optimality Examination

Definition 2 ((ϵg,ϵH)\left(\epsilon_{g},\epsilon_{H}\right)-optimality)

3.3 Subproblem Solvers

3.3.1 The Lanczos Method

Lemma 1 (Lanczos Solution Gap)

3.3.2 The Conjugate Gradient Method

Theorem 1 (Convergence of the Conjugate Gradient Solver)

3.3.3 Properties of the Subproblem Solutions

Assumption 1 (Sufficient Descent Step)

Assumption 2 (Sub-model Gradient Norm cartis2011adaptive ; kohler2017sub )

Assumption 3 (Approximate Global Minimizer cartis2011adaptive ; yao2021inexact )

Lemma 2 (Lanczos Solution)

Lemma 3 (Conjugate Gradient Solution)

3.4 Practical Early Stopping

4 Convergence Analysis

4.1 Preliminaries

Assumption 4 (Second-order Retraction boumal2020introduction )

Assumption 5 (Restricted Lipschitz Hessian)

Assumption 6 (Norm Bound on Hessian)

Condition 1 (Approximation Error Bounds)

4.2 Supporting Theorem and Assumption

Theorem 2 (Gradient and Hessian Sampling Size)

Assumption 7 (Restrictions on δg\delta_{g} and δH\delta_{H})

4.3 Main Results

Theorem 3 (Convergence Complexity of Algorithm 1)

Corollary 1

Theorem 4 (Optimal Convergence Complexity of Algorithm 1)

Corollary 2

4.4 Computational Complexity Analysis

5 Experiments and Result Analysis

5.1 PCA Experiments

5.2 Low-rank Matrix Completion Experiments

5.3 Imaging Applications

5.3.1 Functional Connectivity in fMRI by PCA

5.3.2 Image Recovery by Matrix Completion

5.4 Results of Conjugate Gradient Subproblem Solver

5.5 Examination of Convergence Analysis Assumptions

6 Conclusion

References

Appendix A Appendix: Derivation of Lanczos Method

Appendix B Appendix: Proof of Lemma 1

Proof

Appendix C Appendix: Proof of Theorem 1

Proof

Appendix D Appendix: Subproblem Solvers

D.1 Proof of Lemma 2

D.1.1 The Case of l=Dl=D

Proof

D.1.2 The Case of l<Dl<D

Proof

D.2 Proof of Lemma 3

Proof

Appendix E Appendix: Proof of Theorem 2

E.1 Matrix Bernstein Inequality

Lemma 4 (Matrix Bernstein Inequality (gross2011recovering ; tropp2015introduction ))

E.2 Main Proof

Proof

Appendix F Appendix: Theorem 3 and Corollary 1

F.1 Supporting Lemmas for Theorem 3

Lemma 5

Proof

Lemma 6

Proof

Lemma 7

Proof

Lemma 8

Proof

Lemma 9

Proof

Lemma 10

Faster Riemannian Newton-type Optimization
by Subsampling and Cubic Regularization

Definition 2 ( $\left(\epsilon_{g},\epsilon_{H}\right)$ -optimality)

Assumption 7 (Restrictions on $\delta_{g}$ and $\delta_{H}$ )

D.1.1 The Case of $l=D$

D.1.2 The Case of $l<D$