This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: Yian Deng, Tingting Mu 22institutetext: Department of Computer Science
University of Manchester
22email: {yian.deng, tingting.mu}@manchester.ac.uk

Faster Riemannian Newton-type Optimization
by Subsampling and Cubic Regularization

Yian Deng    Tingting Mu
(Manuscript was submitted to Machine Learning in 10/2021 and accepted in 01/2023.)
Abstract

This work is on constrained large-scale non-convex optimization where the constraint set implies a manifold structure. Solving such problems is important in a multitude of fundamental machine learning tasks. Recent advances on Riemannian optimization have enabled the convenient recovery of solutions by adapting unconstrained optimization algorithms over manifolds. However, it remains challenging to scale up and meanwhile maintain stable convergence rates and handle saddle points. We propose a new second-order Riemannian optimization algorithm, aiming at improving convergence rate and reducing computational cost. It enhances the Riemannian trust-region algorithm that explores curvature information to escape saddle points through a mixture of subsampling and cubic regularization techniques. We conduct rigorous analysis to study the convergence behavior of the proposed algorithm. We also perform extensive experiments to evaluate it based on two general machine learning tasks using multiple datasets. The proposed algorithm exhibits improved computational speed and convergence behavior compared to a large set of state-of-the-art Riemannian optimization algorithms.

Keywords:
Optimization Cubic Regularization Riemannian manifolds Subsampling

1 Introduction

In modern machine learning, many learning tasks are formulated as non-convex optimization problems. This is because, as compared to linear or convex formulations, they can often capture more accurately the underlying structures within the data, and model more precisely the learning performance (or losses). There is an important class of non-convex problems of which the constraint sets possess manifold structures, e.g., to optimize over a set of orthogonal matrices. A manifold in mathematics refers to a topological space that locally behaves as an Euclidean space near each point. Over a DD-dimensional manifold \mathcal{M} in a dd-dimensional ambient space (D<dD<d), each local patch around each data point (a subset of \mathcal{M}) is homeomorphic to a local open subset of the Euclidean space D\mathbb{R}^{D}. This special structure enables a straightforward adoption of any unconstrained optimization algorithm for solving a constrained problem over a manifold, simply by applying a systematic way to modify the gradient and Hessian calculation. The modified calculations are called the Riemannian gradients and the Riemannian Hessians, which will be rigorously defined later. Such an accessible method for developing optimized solutions has benefited many applications and encouraged the implementation of various optimization libraries.

A representative learning task that gives rise to non-convex problems over manifolds is low-rank matrix completion, widely applied in signal denoising, recommendation systems and image recovery liu2019convolution . It is formulated as optimization problems constrained on fixed-rank matrices that belong to a Grassmann manifold. Another example task is principal component analysis (PCA) popularly used in statistical data analysis and dimensionality reduction shahid2015robust . It seeks an optimal orthogonal projection matrix from a Stiefel manifold. A more general problem setup than PCA is the subspace learning mishra2019riemannian , where a low-dimensional space is an instance of a Grassmann manifold. When training a neural network, in order to reduce overfitting, the orthogonal constraint that provides a Stiefel manifold structure is sometimes imposed over the network weights anandkumar2016efficient . Additionally, in hand gesture recognition nguyen2019neural , optimizing over a symmetric definite positive (SDP) manifold has been shown effective.

Recent developments in optimization on Riemannian manifolds absil2009optimization have offered a convenient and unified solution framework for solving the aforementioned class of non-convex problems. The Riemannian optimization techniques translate the constrained problems into unconstrained ones on the manifold whilst preserving the geometric structure of the solution. For example, one Riemannian way to implement a PCA is to preserve the SDP geometric structure of the solutions without explicit constraints horev2016geometry . A simplified description of how Riemannian optimization works is that it first applies a straightforward way to modify the calculation of the first-order and second-order gradient information, then it adopts an unconstrained optimization algorithm that uses the modified gradient information. There are systematic ways to compute these modifications by analyzing the geometric structure of the manifold. Various libraries have implemented these methods and are available to practitioners, e.g., Manopt boumal2014manopt and Pymanopt townsend2016pymanopt .

Among such techniques, Riemannian gradient descent (RGD) is the simplest. To handle large-scale computation with a finite-sum structure, Riemannian stochastic gradient descent (RSGD) bonnabel2013stochastic has been proposed to estimate the gradient from a single sample (or a sample batch) in each iteration of the optimization. Here, an iteration refers to the process by which an incumbent solution is updated with gradient and (or) higher-order derivative information; for example, Eq. (3) in the upcoming text defines an RSGD iteration. Convergence rates of RGD and RSGD are compared in zhang2016first together with a global complexity analysis. The work concludes that RGD can converge linearly while RSGD converges sub-linearly, but RSGD becomes computationally cheaper when there is a significant increase in the size of samples to process, also it can potentially prevent overfitting. By using RSGD to optimize over the Stiefel manifold, politz2016interpretable attempts to improve interpretability of domain adaptation and has demonstrated its benefits for text classification.

A major drawback of RSGD is the variance issue, where the variance of the update direction can slow down the convergence and result in poor solutions. Typical techniques for variance reduction include the Riemannian stochastic variance reduced gradient (RSVRG) zhang2016riemannian and the Riemannian stochastic recursive gradient (RSRG) kasai2018riemannian . RSVRG reduces the gradient variance by using a momentum term, which takes into account the gradient information obtained from both RGD and RSGD. RSRG follows a different strategy and considers only the information in the last and current iterations. This has the benefit of avoiding large cumulative errors, which can be caused by transporting the gradient vector along a distant path when aligning two vectors at the same tangent plane. It has been shown by kasai2018riemannian that RSRG performs better than RSVRG particularly for large-scale problems.

The RSGD variants can suffer from oscillation across the slopes of a ravine kumar2018geometry . This also happens when performing stochastic gradient descent in Euclidean spaces. To address this, various adaptive algorithms have been proposed. The core idea is to control the learning process with adaptive learning rates in addition to the gradient momentum. Riemannian techniques of this kind include R-RMSProp kumar2018geometry , R-AMSGrad cho2017riemannian , R-AdamNC becigneul2018riemannian , RPG huang2021riemannian and RAGDsDR alimisis2021momentum .

Although improvements have been made for first-order optimization, they might still be insufficient for handling saddle points in non-convex problems mokhtari2018escaping . They can only guarantee convergence to stationary points and do not have control over getting trapped at saddle points due to the lack of higher-order information. As an alternative, second-order algorithms are normally good at escaping saddle points by exploiting curvature information kohler2017sub ; tripuraneni2018stochastic . Representative examples of this are the trust-region (TR) methods. Their capacity for handling saddle points and improved convergence over many first-order methods has been demonstrated in weiwei2013newton for various non-convex problems. The TR technique has been extended to a Riemannian setting for the first time by absil2007trust , referred to as the Riemannian TR (RTR) technique.

It is well known that what prevents the wide use of the second-order Riemannian techniques in large-scale problems is the high cost of computing the exact Hessian matrices. Inexact techniques are therefore proposed to iteratively search for solutions without explicit Hessian computations. They can also handle non-positive-definite Hessian matrices and improve operational robustness. Two representative inexact examples are the conjugate gradient and the Lanczos methods zhu2017riemannian ; xu2016matrix . However, their reduced complexity is still proportional to the sample size, and they can still be computationally costly when working with large-scale problems. To address this issue, the subsampling technique has been proposed, and its core idea is to approximate the gradient and Hessian using a batch of samples. It has been proved by shen2019stochastic that the TR method with subsampled gradient and Hessian can achieve a convergence rate of order 𝒪(1k2/3)\mathcal{O}\left(\frac{1}{k^{2/3}}\right) with kk denoting the iteration number. A sample-efficient stochastic TR approach is proposed by shen2019stochastic which finds an (ϵ,ϵ)(\epsilon,\sqrt{\epsilon})-approximate local minimum within a number of 𝒪(n/ϵ1.5)\mathcal{O}(\sqrt{n}/\epsilon^{1.5}) stochastic Hessian oracle queries where nn denotes the sample number. The subsampling technique has been applied to improve the second-order Riemannian optimization for the first time by kasai2018inexact . Their proposed inexact RTR algorithm employs subsampling over the Riemannian manifold and achieves faster convergence than the standard RTR method. Nonetheless, subsampling can be sensitive to the configured batch size. Overly small batch sizes can lead to poor convergence.

In the latest development of second-order unconstrained optimization, it has been shown that the adaptive cubic regularization technique cartis2011adaptive can improve the standard and subsampled TR algorithms and the Newton-type methods, resulting in, for instance, improved convergence and effectiveness at escaping strict saddle points kohler2017sub ; xu2020newton . To improve performance, the variance reduction techniques have been combined into cubic regularization and extended to cases with inexact solutions. Example works of this include zhou2020stochastic ; zhou2019stochastic which were the first to rigorously demonstrate the advantage of variance reduction for second-order optimization algorithms. Recently, the potential of cubic regularization for solving non-convex problems over constraint sets with Riemannian manifold structures has been shown by zhang2018cubic ; agarwal2021adaptive .

We aim at improving the RTR optimization by taking advantage of both the adaptive cubic regularization and subsampling techniques. Our problem of interest is to find a local minimum of a non-convex finite-sum minimization problem constrained on a set endowed with a Riemannian manifold structure. Letting fi:f_{i}:\mathcal{M}\rightarrow\mathbb{R} be a real-valued function defined on a Riemannian manifold \mathcal{M}, we consider a twice differentiable finite-sum objective function:

min𝐱f(𝐱)=1ni=1nfi(𝐱).\begin{split}\min_{\mathbf{x}\in\mathcal{M}}\ f(\mathbf{x})\>=\frac{1}{n}\sum_{i=1}^{n}f_{i}(\mathbf{x}).\end{split} (1)

In the machine learning context, nn denotes the sample number, and fi(𝐱)f_{i}(\mathbf{x}) is a smooth real-valued and twice differentiable cost (or loss) function computed for the ii-th sample. The nn samples are assumed to be uniformly sampled, and thus 𝔼[fi(𝐱)]=limn1ni=1nfi(𝐱)\mathbb{E}\left[f_{i}(\mathbf{x})\right]=\lim_{n\to\infty}\frac{1}{n}\sum_{i=1}^{n}f_{i}(\mathbf{x}).

We propose a cubic Riemannian Newton-like (RN) method to solve more effectively the problem in Eq. (1). Specifically, we enable two key improvements in the Riemannian space, including (1) to approximate the Riemannian gradient and Hessian using the subsampling technique and (2) to improve the subproblem formulation by replacing the trust-region constraint with a cubic regularization term. The resulting algorithm is named Inexact Sub-RN-CR111The abbreviation Sub-RN-CR comes from Sub-sampled Riemannian Newton-like Cubic Regularization. We follow the tradition of referring to a TR method enhanced by cubic regularization as a Newton-like method cartis2011adaptive . The implementation of the Inexact Sub-RN-CR is provided in https://github.com/xqdi/isrncr.. After introducing cubic regularization, it becomes more challenging to solve the subproblem, for which we demonstrate two effective solvers based on the Lanczos and conjugate gradient methods. We provide convergence analysis for the proposed Inexact Sub-RN-CR algorithm and present the main results in Theorems 3 and 4. Additionally, we provide analysis for the subproblem solvers, regarding their solution quality, e.g., whether and how they meet a set of desired conditions as presented in Assumptions 1-3, and their convergence property. The key results are presented in Lemma 1, Theorem 2, Lemmas 2 and 3. Overall, our results are satisfactory. The proposed algorithm finds an (ϵg,ϵH)\left(\epsilon_{g},\epsilon_{H}\right)-optimal solution (defined in Section 3.1) in fewer iterations than the state-of-the-art RTR kasai2018inexact . Specifically, the required number of iterations is reduced from 𝒪(max(ϵg2ϵH1,ϵH3))\mathcal{O}\left(\max\left(\epsilon_{g}^{-2}\epsilon_{H}^{-1},\epsilon_{H}^{-3}\right)\right) to 𝒪(max(ϵg2,ϵH3))\mathcal{O}\left(\max\left(\epsilon_{g}^{-2},\epsilon_{H}^{-3}\right)\right). When being tested by extensive experiments on PCA and matrix completion tasks with different datasets and applications in image analysis, our algorithm shows much better performance than most state-of-the-art and popular Riemannian optimization algorithms, in terms of both the solution quality and computing time.

2 Notations and Preliminaries

We start by familiarizing the readers with the notations and concepts that will be used in the paper, and recommend absil2009optimization for a more detailed explanation of the relevant concepts. The manifold \mathcal{M} is equipped with a smooth inner product ,𝐱\langle\cdot,\cdot\rangle_{\mathbf{x}} associated with the tangent space T𝐱T_{\mathbf{x}}\mathcal{M} at any 𝐱\mathbf{x}\in\mathcal{M}, and this inner product is referred to as the Riemannian metric. The norm of a tangent vector 𝜼T𝐱\bm{\eta}\in T_{\mathbf{x}}\mathcal{M} is denoted by 𝜼𝐱\left\|\bm{\eta}\right\|_{\mathbf{x}}, which is computed by 𝜼𝐱=𝜼,𝜼𝐱\left\|\bm{\eta}\right\|_{\mathbf{x}}=\sqrt{\langle\bm{\eta},\bm{\eta}\rangle_{\mathbf{x}}}. When we use the notation 𝜼𝐱\left\|\bm{\eta}\right\|_{\mathbf{x}}, 𝜼\bm{\eta} by default belongs to the tangent space T𝐱T_{\mathbf{x}}\mathcal{M}. We use 𝟎𝐱T𝐱\mathbf{0}_{\mathbf{x}}\in T_{\mathbf{x}}\mathcal{M} to denote the zero vector of the tangent space at 𝐱\mathbf{x}. The retraction mapping denoted by R𝐱(𝜼):T𝐱R_{\mathbf{x}}\left(\bm{\eta}\right):T_{\mathbf{x}}\mathcal{M}\rightarrow\mathcal{M} is used to move 𝐱\mathbf{x}\in\mathcal{M} in the direction 𝜼T𝐱\bm{\eta}\in T_{\mathbf{x}}\mathcal{M} while remaining on \mathcal{M}, and it is an equivalent version of 𝐱+𝜼\mathbf{x}+\bm{\eta} in an Euclidean space. The pullback of ff at 𝐱\mathbf{x} is defined by f^(𝜼)=f(R𝐱(𝜼))\hat{f}(\bm{\eta})=f(R_{\mathbf{x}}(\bm{\eta})), and f^(𝟎𝐱)=f(𝐱)\hat{f}(\mathbf{0}_{\mathbf{x}})=f(\mathbf{x}). The vector transport operator 𝒯𝐱𝐲(𝐯):T𝐱T𝐱\mathcal{T}_{\mathbf{x}}^{\mathbf{y}}(\mathbf{v}):T_{\mathbf{x}}\mathcal{M}\rightarrow T_{\mathbf{x}}\mathcal{M} moves a tangent vector 𝐯T𝐱\mathbf{v}\in T_{\mathbf{x}}\mathcal{M} from a point 𝐱\mathbf{x}\in\mathcal{M} to another 𝐲\mathbf{y}\in\mathcal{M}. We also use a shorthand notation 𝒯𝜼(𝐯)\mathcal{T}_{\bm{\eta}}({\mathbf{v}}) to describe 𝒯𝐱𝐲(𝐯)\mathcal{T}_{\mathbf{x}}^{\mathbf{y}}(\mathbf{v}) for a moving direction 𝜼T𝐱\bm{\eta}\in T_{\mathbf{x}}\mathcal{M} from 𝐱\mathbf{x} to 𝐲\mathbf{y} satisfying R𝐱(𝜼)=𝐲R_{\mathbf{x}}\left(\bm{\eta}\right)=\mathbf{y}. The parallel transport operator 𝒫𝜼,γ(𝐯)\mathcal{P}_{\bm{\eta},\gamma}(\mathbf{v}) is a special instance of the vector transport. It moves 𝐯T𝐱\mathbf{v}\in T_{\mathbf{x}}\mathcal{M} in the direction of 𝜼T𝐱\bm{\eta}\in T_{\mathbf{x}}\mathcal{M} along a geodesic γ:[0,1]\gamma:[0,1]\rightarrow\mathcal{M}, where γ(0)=𝐱\gamma(0)=\mathbf{x}, γ(1)=𝐲\gamma(1)=\mathbf{y} and γ(0)=𝜼\gamma^{\prime}(0)=\bm{\eta}, and during the movement, it has to satisfy the parallel condition on the geodesic curve. We simplify the notation 𝒫𝜼,γ(𝐯)\mathcal{P}_{\bm{\eta},\gamma}(\mathbf{v}) to 𝒫𝜼(𝐯)\mathcal{P}_{\bm{\eta}}(\mathbf{v}). Fig. 1 illustrates a manifold and the operations over it. Additionally, we use \|\cdot\| to denote the l2l_{2}-norm operation in a Euclidean space.

Refer to caption
Figure 1: Illustration of the retraction and parallel transport operations.

The Riemannian gradient of a real-valued differentiable function ff at 𝐱\mathbf{x}\in\mathcal{M}, denoted by gradf(𝐱){\rm grad}f(\mathbf{x}), is defined as the unique element of 𝒯𝐱\mathcal{T}_{\mathbf{x}}\mathcal{M} satisfying gradf(𝐱),𝝃𝐱=𝒟f(𝐱)[𝝃],𝝃T𝐱\langle{\rm grad}f(\mathbf{x}),\bm{\xi}\rangle_{\mathbf{x}}=\mathcal{D}f(\mathbf{x})[\bm{\xi}],\ \forall\bm{\xi}\in T_{\mathbf{x}}\mathcal{M}. Here, 𝒟f(𝐱)[𝝃]\mathcal{D}f(\mathbf{x})[\bm{\xi}] generalizes the notion of the directional derivative to a manifold, defined as the derivative of f(γ(t))f(\gamma(t)) at t=0t=0 where γ(t)\gamma(t) is a curve on \mathcal{M} such that γ(0)=𝐱\gamma(0)=\mathbf{x} and γ˙(0)=𝝃\dot{\gamma}(0)=\bm{\xi}. When operating in an Euclidean space, we use the same notation 𝒟f(𝐱)[𝝃]\mathcal{D}f(\mathbf{x})[\bm{\xi}] to denote the classical directional derivative. The Riemannian Hessian of a real-valued differentiable function ff at 𝐱\mathbf{x}\in\mathcal{M}, denoted by Hessf(𝐱)[𝝃]:T𝐱T𝐱{\rm Hess}f(\mathbf{x})[\bm{\xi}]:T_{\mathbf{x}}\mathcal{M}\rightarrow T_{\mathbf{x}}\mathcal{M}, is a linear mapping defined based on the Riemannian connection, as Hessf(𝐱)[𝝃]=~𝝃gradf(𝐱){\rm Hess}f(\mathbf{x})[\bm{\xi}]=\tilde{\nabla}_{\bm{\xi}}{\rm grad}f(\mathbf{x}). The Riemannian connection ~𝝃𝜼:T𝐱×T𝐱T𝐱\tilde{\nabla}_{\bm{\xi}}\bm{\eta}:T_{\mathbf{x}}\mathcal{M}\times T_{\mathbf{x}}\mathcal{M}\rightarrow T_{\mathbf{x}}\mathcal{M}, generalizes the notion of the directional derivative of a vector field. For a function ff defined over an embedded manifold, its Riemannian gradient can be computed by projecting the Euclidean gradient f(𝐱)\nabla f(\mathbf{x}) onto the tangent space, as gradf(𝐱)=PT𝐱[f(𝐱)]{\rm grad}f(\mathbf{x})=P_{T_{\mathbf{x}}\mathcal{M}}\left[\nabla f(\mathbf{x})\right] where PT𝐱[]P_{T_{\mathbf{x}}\mathcal{M}}[\cdot] is the orthogonal projection onto T𝐱T_{\mathbf{x}}\mathcal{M}. Similarly, its Riemannian Hessian can be computed by projecting the classical directional derivative of gradf(𝐱){\rm grad}f(\mathbf{x}), defined by 2f(𝐱)[𝝃]=𝒟gradf(𝐱)[𝝃]\nabla^{2}f(\mathbf{x})[\bm{\xi}]=\mathcal{D}{\rm grad}f(\mathbf{x})[\bm{\xi}], onto the tangent space, resulting in Hessf(𝐱)[𝝃]=PT𝐱[2f(𝐱)[𝝃]]{\rm Hess}f(\mathbf{x})[\bm{\xi}]=P_{T_{\mathbf{x}}\mathcal{M}}\left[\nabla^{2}f(\mathbf{x})[\bm{\xi}]\right]. When the function ff is defined over a quotient manifold, the Riemannian gradient and Hessian can be computed by projecting f(𝐱)\nabla f(\mathbf{x}) and 2f(𝐱)[𝝃]\nabla^{2}f(\mathbf{x})[\bm{\xi}] onto the horizontal space of the manifold.

Taking the PCA problem as an example (see Eqs (60) and (61) for its formulation), the general objective in Eq. (1) can be instantiated by min𝐔Gr(r,d)\min_{\mathbf{U}\in{\rm Gr}\left(r,d\right)} 1ni=1n𝐳i𝐔𝐔T𝐳i2\frac{1}{n}\sum_{i=1}^{n}\left\|\mathbf{z}_{i}-\mathbf{UU}^{T}\mathbf{z}_{i}\right\|^{2}, where =Gr(r,d)\mathcal{M}={\rm Gr}\left(r,d\right) is a Grassmann manifold. The function of interest is fi(𝐔)=𝐳i𝐔𝐔T𝐳i2f_{i}(\mathbf{U})=\left\|\mathbf{z}_{i}-\mathbf{UU}^{T}\mathbf{z}_{i}\right\|^{2}. The Grassmann manifold Gr(r,d){\rm Gr}\left(r,d\right) contains the set of rr-dimensional linear subspaces of the dd-dimensional vector space. Each subspace corresponds to a point on the manifold that is an equivalence class of d×rd\times r orthogonal matrices, expressed as 𝐱=[𝐔]:={𝐔𝐎:𝐔T𝐔=𝐈,𝐎O(r)}\mathbf{x}=\left[\mathbf{U}\right]:=\left\{\mathbf{UO}:\mathbf{U}^{T}\mathbf{U}=\mathbf{I},\mathbf{O}\in{\rm O}\left(r\right)\right\} where O(r){\rm O}\left(r\right) denotes the orthogonal group in r×r\mathbb{R}^{r\times r}. A tangent vector 𝜼T𝐱Gr(r,d)\bm{\eta}\in T_{\mathbf{x}}{\rm Gr}(r,d) of the Grassmann manifold has the form 𝜼=𝐔𝐁\bm{\eta}=\mathbf{U}^{\bot}\mathbf{B} edelman1998geometry , where 𝐁(dr)×r\mathbf{B}\in\mathbb{R}^{(d-r)\times r}, and 𝐔d×(dr)\mathbf{U}^{\bot}\in\mathbb{R}^{d\times(d-r)} is the orthogonal complement of 𝐔\mathbf{U} such that [𝐔,𝐔]\left[\mathbf{U},\mathbf{U}^{\bot}\right] is orthogonal. A commonly used Riemannian metric for the Grassmann manifold is the canonical inner product 𝜼,𝝃𝐱=tr(𝜼T𝝃)\langle\bm{\eta},\bm{\bm{\xi}}\rangle_{\mathbf{x}}={\rm tr}(\bm{\eta}^{T}\bm{\bm{\xi}}) given 𝜼,𝝃T𝐱Gr(r,d)\bm{\eta},\bm{\bm{\xi}}\in T_{\mathbf{x}}{\rm Gr}(r,d), resulting in 𝜼𝐱=𝜼\left\|\bm{\eta}\right\|_{\mathbf{x}}=\|\bm{\eta}\| (Section 2.3.2 of edelman1998geometry ). As we can see, the Riemannian metric and the norm here are equivalent to the Euclidean inner product and norm. The same result can be derived from another commonly used metric of the Grassmann manifold, i.e., 𝜼,𝝃𝐱=tr(𝜼T(𝐈12𝐔𝐔T)𝝃)\langle\bm{\eta},\bm{\xi}\rangle_{\mathbf{x}}={\rm tr}\left(\bm{\eta}^{T}\left(\mathbf{I}-\frac{1}{2}\mathbf{U}\mathbf{U}^{T}\right)\bm{\xi}\right) for 𝜼,𝝃T𝐱Gr(r,d)\bm{\eta},\bm{\xi}\in T_{\mathbf{x}}{\rm Gr}(r,d) (Section 2.5 of edelman1998geometry ). Expressing two given tangent vectors as 𝜼=𝐔𝐁η\bm{\eta}=\mathbf{U}^{\bot}\mathbf{B}_{\eta} and 𝝃=𝐔𝐁ξ\bm{\xi}=\mathbf{U}^{\bot}\mathbf{B}_{\xi} with 𝐁η,𝐁ξ(dr)×r\mathbf{B}_{\eta},\mathbf{B}_{\xi}\in\mathbb{R}^{(d-r)\times r}, we have

𝜼,𝝃𝐱=\displaystyle\langle\bm{\eta},\bm{\xi}\rangle_{\mathbf{x}}= tr((𝐔𝐁η)T(𝐈12𝐔𝐔T)𝐔𝐁ξ)=tr((𝐔𝐁η)T𝐔𝐁ξ)\displaystyle\;{\rm tr}\left(\left(\mathbf{U}^{\bot}\mathbf{B}_{\eta}\right)^{T}\left(\mathbf{I}-\frac{1}{2}\mathbf{U}\mathbf{U}^{T}\right)\mathbf{U}^{\bot}\mathbf{B}_{\xi}\right)={\rm tr}\left(\left(\mathbf{U}^{\bot}\mathbf{B}_{\eta}\right)^{T}\mathbf{U}^{\bot}\mathbf{B}_{\xi}\right)
=\displaystyle= tr(𝜼T𝝃).\displaystyle\;{\rm tr}\left(\bm{\eta}^{T}\bm{\xi}\right). (2)

Here we provide a few examples of the key operations explained earlier on the Grassmann manifold, taken from boumal2014manopt . Given a data point [𝐔][\mathbf{U}], a moving direction 𝜼\bm{\eta} and the step size tt, one way to construct the retraction mapping is through performing singular value decomposition (SVD) on 𝐔+t𝜼\mathbf{U}+t\bm{\eta}, i.e., 𝐔+t𝜼=𝐔¯𝐒𝐕¯T\mathbf{U}+t\bm{\eta}=\bar{\mathbf{U}}\mathbf{S}\bar{\mathbf{V}}^{T}, and the new data point after moving is [𝐔¯𝐕¯T]\left[\bar{\mathbf{U}}\bar{\mathbf{V}}^{T}\right]. A transport operation can be implemented by projecting a given tangent vector using the orthogonal projector 𝐈𝐔𝐔T\mathbf{I}-\mathbf{U}\mathbf{U}^{T}. Both Riemannian gradient and Hessian can be computed by projecting the Euclidean gradient and Hessian of f(𝐔)f(\mathbf{U}) using the same projector 𝐈𝐔𝐔T\mathbf{I}-\mathbf{U}\mathbf{U}^{T}.

2.1 First-order Algorithms

To optimize the problem in Eq. (1), the first-order Riemannian optimization algorithm RSGD updates the solution at each kk-th iteration by using an fif_{i} instance, as

𝐱k+1=R𝐱k(βkgradfi(𝐱k)),\mathbf{x}_{k+1}=R_{\mathbf{x}_{k}}\left(-\beta_{k}\text{grad}f_{i}\left(\mathbf{x}_{k}\right)\right), (3)

where βk\beta_{k} is the step size. Assume that the algorithm runs for multiple epochs referred to as the outer iterations. Each epoch contains multiple inner iterations, each of which corresponds to a randomly selected fif_{i} for calculating the update. Letting 𝐱kt\mathbf{x}_{k}^{t} be the solution at the tt-th inner iteration of the kk-th outer iteration and 𝐱~k\tilde{\mathbf{x}}_{k} be the solution at the last inner iteration of the kk-th outer iteration, RSVRG employs a variance reduced extension zhang2016riemannian of the update defined in Eq. (3), given as

𝐱kt+1=R𝐱kt(βk𝐯kt),\mathbf{x}_{k}^{t+1}=R_{\mathbf{x}_{k}^{t}}\left(-\beta_{k}\mathbf{v}_{k}^{t}\right), (4)

where

𝐯kt=gradfi(𝐱kt)𝒯𝐱~k1𝐱kt(gradfi(𝐱~k1)gradf(𝐱~k1)).\mathbf{v}_{k}^{t}=\text{grad}f_{i}\left(\mathbf{x}_{k}^{t}\right)-\mathcal{T}_{\tilde{\mathbf{x}}_{k-1}}^{\mathbf{x}_{k}^{t}}\left(\text{grad}f_{i}\left(\tilde{\mathbf{x}}_{k-1}\right)-\text{grad}f\left(\tilde{\mathbf{x}}_{k-1}\right)\right). (5)

Here, the full gradient information gradf(𝐱~k1){\rm grad}f\left(\tilde{\mathbf{x}}_{k-1}\right) is used to reduce the variance in the stochastic gradient 𝐯kt\mathbf{v}_{k}^{t}. As a later development, RSRG kasai2018riemannian suggests a recursive formulation to improve the variance-reduced gradient 𝐯kt\mathbf{v}_{k}^{t}. Starting from 𝐯k0=gradf(𝐱~k1)\mathbf{v}_{k}^{0}={\rm grad}f\left(\tilde{\mathbf{x}}_{k-1}\right), it updates by

𝐯kt=gradfi(𝐱kt)𝒯𝐱kt1𝐱kt(gradfi(𝐱kt1))+𝒯𝐱kt1𝐱kt(𝐯kt1).\mathbf{v}_{k}^{t}=\text{grad}f_{i}\left(\mathbf{x}_{k}^{t}\right)-\mathcal{T}_{\mathbf{x}_{k}^{t-1}}^{\mathbf{x}_{k}^{t}}\left(\text{grad}f_{i}\left(\mathbf{x}_{k}^{t-1}\right)\right)+\mathcal{T}_{\mathbf{x}_{k}^{t-1}}^{\mathbf{x}_{k}^{t}}\left(\mathbf{v}_{k}^{t-1}\right). (6)

This formulation is designed to avoid the accumulated error caused by a distant vector transport.

2.2 Inexact RTR

For second-order Riemannian optimization, the Inexact RTR kasai2018inexact improves the standard RTR absil2007trust through subsampling. It optimizes an approximation of the objective function formulated using the second-order Taylor expansion within a trust region Δk\Delta_{k} around 𝐱k\mathbf{x}_{k} at iteration kk. A moving direction 𝜼k\bm{\eta}_{k} within the trust region is found by solving the subproblem at iteration kk:

𝜼k=argmin𝜼T𝐱k\displaystyle\bm{\eta}_{k}=\mathop{\arg\min}_{\bm{\eta}\in T_{\mathbf{x}_{k}}\mathcal{M}}\;\;\; f(𝐱k)+𝐆k,𝜼𝐱k+12𝜼,𝐇k[𝜼]𝐱k,\displaystyle f(\mathbf{x}_{k})+\langle\mathbf{G}_{k},\bm{\eta}\rangle_{\mathbf{x}_{k}}+\frac{1}{2}\langle\bm{\eta},\mathbf{H}_{k}[\bm{\eta}]\rangle_{\mathbf{x}_{k}}, (7)
subject to 𝜼𝐱kΔk,\displaystyle\quad\left\|\bm{\eta}\right\|_{\mathbf{x}_{k}}\leq\Delta_{k},

where 𝐆k\mathbf{G}_{k} and 𝐇k[𝜼]\mathbf{H}_{k}[\bm{\eta}] are the approximated Riemannian gradient and Hessian calculated by using the subsampling technique. The approximation is based on the current solution 𝐱k\mathbf{x}_{k} and the moving direction 𝜼\bm{\eta}, calculated as

𝐆k=\displaystyle\mathbf{G}_{k}= 1|𝒮g|i𝒮ggradfi(𝐱k),\displaystyle\;\frac{1}{\left|\mathcal{S}_{g}\right|}\sum_{i\in\mathcal{S}_{g}}\text{grad}f_{i}(\mathbf{x}_{k}), (8)
𝐇k[𝜼]=\displaystyle\mathbf{H}_{k}[\bm{\eta}]= 1|𝒮H|i𝒮HHessfi(𝐱k)[𝜼],\displaystyle\;\frac{1}{\left|\mathcal{S}_{H}\right|}\sum_{i\in\mathcal{S}_{H}}\text{Hess}f_{i}(\mathbf{x}_{k})[\bm{\eta}], (9)

where 𝒮g\mathcal{S}_{g}, 𝒮H{1,,n}\mathcal{S}_{H}\subset\left\{1,...,n\right\} are the sets of the subsampled indices used for estimating the Riemannian gradient and Hessian. The updated solution 𝐱k+1=R𝐱k(𝜼k)\mathbf{x}_{k+1}=R_{\mathbf{x}_{k}}(\bm{\eta}_{k}) will be accepted and Δk\Delta_{k} will be increased, if the decrease of the true objective function ff is sufficiently large as compared to that of the approximate objective used in Eq. (7). Otherwise, Δk\Delta_{k} will be decreased because of the poor agreement between the approximate and true objectives.

3 Proposed Method

3.1 Inexact Sub-RN-CR Algorithm

We propose to improve the subsampling-based construction of the RTR subproblem in Eq. (7) by cubic regularization. This gives rise to the minimization

𝜼k=argmin𝜼T𝐱km^k(𝜼),\bm{\eta}_{k}=\mathop{\arg\min}_{\bm{\eta}\in T_{\mathbf{x}_{k}}\mathcal{M}}\hat{m}_{k}(\bm{\eta}), (10)

where

m^k(𝜼)={h𝐱k(𝜼)+𝐆k,𝜼𝐱k, if 𝐆k𝐱kϵg.h𝐱k(𝜼), otherwise.\centering\begin{split}\hat{m}_{k}(\bm{\eta})=\left\{\begin{aligned} &h_{\mathbf{x}_{k}}(\bm{\eta})+\langle\mathbf{G}_{k},\bm{\eta}\rangle_{\mathbf{x}_{k}},&&\text{ if }\|\mathbf{G}_{k}\|_{\mathbf{x}_{k}}\geq\epsilon_{g}.\\ &h_{\mathbf{x}_{k}}(\bm{\eta}),&&\text{ otherwise}.\end{aligned}\right.\end{split}\@add@centering (11)

Here, 0<ϵg<10<\epsilon_{g}<1 is a user-specified parameter that plays a role in convergence analysis, which we will explain later. The core objective component h𝐱k(𝜼)h_{\mathbf{x}_{k}}(\bm{\eta}) is formulated by extending the adaptive cubic regularization technique cartis2011adaptive , given as

h𝐱k(𝜼)=f(𝐱k)+12𝜼,𝐇k[𝜼]𝐱k+σk3𝜼𝐱k3,h_{\mathbf{x}_{k}}(\bm{\eta})=f(\mathbf{x}_{k})+\frac{1}{2}\langle\bm{\eta},\mathbf{H}_{k}[\bm{\eta}]\rangle_{\mathbf{x}_{k}}+\frac{\sigma_{k}}{3}\|\bm{\eta}\|_{\mathbf{x}_{k}}^{3}, (12)

with

σk={max(σk1γ,ϵσ),ifρk1τ,γσk1,otherwise,\sigma_{k}=\left\{\begin{array}[]{lr}\max(\frac{\sigma_{k-1}}{\gamma},\epsilon_{\sigma}),\quad\text{if}\ \rho_{k-1}\geq\tau,\\ \gamma\sigma_{k-1},\quad\ \ \text{otherwise,}\end{array}\right. (13)

and

ρk=f^k(𝟎𝐱k)f^k(𝜼k)m^k(𝟎𝐱k)m^k(𝜼k),\rho_{k}=\frac{\hat{f}_{k}(\mathbf{0}_{\mathbf{x}_{k}})-\hat{f}_{k}(\bm{\eta}_{k})}{\hat{m}_{k}(\mathbf{0}_{\mathbf{x}_{k}})-\hat{m}_{k}(\bm{\eta}_{k})}, (14)

where the subscript kk is used to highlight the pullback of ff at 𝐱k\mathbf{x}_{k} as f^k()\hat{f}_{k}(\cdot). Overall, there are four hyper-parameters to be set by the user, including the gradient norm threshold 0<ϵσ<10<\epsilon_{\sigma}<1, the dynamic control parameter γ>1\gamma>1 to adjust the cubic regularization weight, the model validity threshold 0<τ<10<\tau<1, and the initial trust parameter σ0\sigma_{0}. We will discuss the setup of the algorithm in more detail.

Algorithm 1 Main Inexact Sub-RN-CR Solver
0:  ϵσ(0,1)\epsilon_{\sigma}\in(0,1), γ>1\gamma>1, 0<τ<10<\tau<1, σ0>0\sigma_{0}>0, 0<ϵg,ϵH<10<\epsilon_{g},\epsilon_{H}<1.
1:  for k=1,2,k=1,2,... do
2:     Sample the index sets 𝒮g\mathcal{S}_{g} and 𝒮H\mathcal{S}_{H}
3:     Compute the subsampled gradient 𝐆k\mathbf{G}_{k} and λmin(𝐇k)\lambda_{min}(\mathbf{H}_{k}) based on Eqs. (8)-(9) and (16)
4:     if 𝐆k𝐱kϵg\|\mathbf{G}_{k}\|_{\mathbf{x}_{k}}\leq\epsilon_{g} and λmin(𝐇k)ϵH\lambda_{min}(\mathbf{H}_{k})\geq-\epsilon_{H} then
5:        Return 𝐱k\mathbf{x}_{k}
6:     else if 𝐆k𝐱kϵg\|\mathbf{G}_{k}\|_{\mathbf{x}_{k}}\leq\epsilon_{g} then
7:        𝐆k=𝟎𝐱k\mathbf{G}_{k}=\mathbf{0}_{\mathbf{x}_{k}}
8:     end if
9:     Inexactly solve 𝜼k=argmin𝜼T𝐱km^k(𝜼)\bm{\eta}_{k}^{*}=\mathop{\arg\min}_{\bm{\eta}\in T_{\mathbf{x}_{k}}\mathcal{M}}\hat{m}_{k}(\bm{\eta}) by Algorithm 2 or Algorithm 3
10:     Calculate ρk=f^k(𝟎𝐱k)f^k(𝜼k)m^k(𝟎𝐱k)m^k(𝜼k)\rho_{k}=\frac{\hat{f}_{k}(\mathbf{0}_{\mathbf{x}_{k}})-\hat{f}_{k}(\bm{\eta}_{k}^{*})}{\hat{m}_{k}(\mathbf{0}_{\mathbf{x}_{k}})-\hat{m}_{k}(\bm{\eta}_{k}^{*})}
11:     Set 𝐱k+1={R𝐱k(𝜼k)ifρkτ𝐱kotherwise\mathbf{x}_{k+1}=\left\{\begin{array}[]{lr}R_{\mathbf{x}_{k}}(\bm{\eta}_{k}^{*})\quad\text{if}\ \rho_{k}\geq\tau\\ \mathbf{x}_{k}\qquad\quad\ \ \text{otherwise}\end{array}\right.
12:      Set σk+1={max(σk/γ,ϵσ)ifρkτγσkotherwise\sigma_{k+1}=\left\{\begin{array}[]{lr}\max(\sigma_{k}/\gamma,\epsilon_{\sigma})\quad\text{if}\ \rho_{k}\geq\tau\\ \gamma\sigma_{k}\quad\ \ \text{otherwise}\end{array}\right.
13:  end for
13:  𝐱k\mathbf{x}_{k}

We expect the norm of the approximate gradient to approach ϵg\epsilon_{g} with 0<ϵg<10<\epsilon_{g}<1. Following a similar treatment in kasai2018inexact , when the gradient norm is smaller than ϵg\epsilon_{g}, the gradient-based term is ignored. This is important to the convergence analysis shown in the next section.

The trust region radius Δk\Delta_{k} is no longer explicitly defined, but replaced by the cubic regularization term σk3𝜼𝐱k3\frac{\sigma_{k}}{3}\|\bm{\eta}\|_{\mathbf{x}_{k}}^{3}, where σk\sigma_{k} is related to a Lagrange multiplier on a cubic trust-region constraint. Naturally, the smaller σk\sigma_{k} is, the larger a moving step is allowed. Benefits of cubic regularization have been shown in griewank1981modification ; kohler2017sub . It can not only accelerate the local convergence especially when the Hessian is singular, but also help escape better strict saddle points than the TR methods, providing stronger convergence properties.

The cubic term σk3𝜼𝐱k3\frac{\sigma_{k}}{3}\|\bm{\eta}\|_{\mathbf{x}_{k}}^{3} is equipped with a dynamic penalization control through the adaptive trust quantity σk0\sigma_{k}\geq 0. The value of σk\sigma_{k} is determined by examining how successful each iteration kk is. An iteration kk is considered successful if ρkτ\rho_{k}\geq\tau, and unsuccessful otherwise, where the value of ρk\rho_{k} quantifies the agreement between the changes of the approximate objective m^k(𝜼)\hat{m}_{k}(\bm{\eta}) and the true objective f(𝐱)f(\mathbf{x}). The larger ρk\rho_{k} is, the more effective the approximate model is. Given γ>1\gamma>1, in an unsuccessful iteration, σk\sigma_{k} is increased to γσk\gamma\sigma_{k} hoping to obtain a more accurate approximation in the next iteration. On the opposite, σk\sigma_{k} is decreased to σkγ\frac{\sigma_{k}}{\gamma}, relaxing the approximation in a successful iteration, but it is still restricted within the lower bound ϵσ\epsilon_{\sigma}. This bound ϵσ\epsilon_{\sigma} helps avoid solution candidates with overly large norms 𝜼k𝐱k\|\bm{\eta}_{k}\|_{\mathbf{x}_{k}} that can cause an unstable optimization. Below we formally define what an (un)successful iteration is, which will be used in our later analysis.

Definition 1 (Successful and Unsuccessful Iterations)

An iteration kk in Algorithm 1 is considered successful if the agreement ρkτ\rho_{k}\geq\tau, and unsuccessful if ρk<τ\rho_{k}<\tau. In addition, based on Step (12) of Algorithm 1, a successful iteration has σk+1σk\sigma_{k+1}\leq\sigma_{k}, while an unsuccessful one has σk+1>σk\sigma_{k+1}>\sigma_{k}.

3.2 Optimality Examination

The stopping condition of the algorithm follows the definition of (ϵg,ϵH)\left(\epsilon_{g},\epsilon_{H}\right)-optimality nocedal2006numerical , stated as below.

Definition 2 ((ϵg,ϵH)\left(\epsilon_{g},\epsilon_{H}\right)-optimality)

Given 0<ϵg,ϵH<10<\epsilon_{g},\epsilon_{H}<1, a solution 𝐱\mathbf{x} satisfies (ϵg,ϵH)\left(\epsilon_{g},\epsilon_{H}\right)-optimality if

gradf(𝐱)𝐱ϵgandHessf(𝐱)[𝜼]ϵH𝐈,\left\|{\rm grad}f(\mathbf{x})\right\|_{\mathbf{x}}\leq\epsilon_{g}\quad{\rm and}\quad{\rm Hess}f(\mathbf{x})[\bm{\eta}]\succeq-\epsilon_{H}\mathbf{I}, (15)

for all 𝛈T𝐱\bm{\eta}\in T_{\mathbf{x}}\mathcal{M}, where 𝐈\mathbf{I} is an identity matrix.

This is a relaxation and a manifold extension of the standard second-order optimality conditions gradf(𝐱)𝐱=0\left\|{\rm grad}f(\mathbf{x})\right\|_{\mathbf{x}}=0 and Hessf(𝐱)0\text{Hess}f(\mathbf{x})\succeq 0 in Euclidean spaces mokhtari2018escaping . The algorithm stops (1) when the gradient norm is sufficiently small and (2) when the Hessian is sufficiently close to being positive semidefinite.

To examine the Hessian, we follow a similar approach as in han2021riemannian by assessing the solution of the following minimization problem:

λmin(𝐇k):=\displaystyle\lambda_{min}(\mathbf{H}_{k}):= min𝜼T𝐱,𝜼𝐱=1𝜼,𝐇k[𝜼]𝐱k,\displaystyle\min_{\bm{\eta}\in T_{\mathbf{x}}\mathcal{M},\ \left\|\bm{\eta}\right\|_{\mathbf{x}}=1}\langle\bm{\eta},\mathbf{H}_{k}[\bm{\eta}]\rangle_{\mathbf{x}_{k}}, (16)

which resembles the smallest eigenvalue of the Riemannian Hessian. As a result, the algorithm stops when gradf(𝐱)𝐱ϵg\left\|{\rm grad}f(\mathbf{x})\right\|_{\mathbf{x}}\leq\epsilon_{g} (referred to as the gradient condition) and when λmin(𝐇k)ϵH\lambda_{min}(\mathbf{H}_{k})\geq-\epsilon_{H} (referred to as the Hessian condition), where ϵg,ϵH(0,1)\epsilon_{g},\epsilon_{H}\in(0,1) are the user-set stopping parameters. Note, we use the same ϵg\epsilon_{g} for thresholding as in Eq. (LABEL:eq_m). Pseudo code of the complete Inexact Sub-RN-CR is provided in Algorithm 1.

3.3 Subproblem Solvers

The step (9) of Algorithm 1 requires to solve the subproblem in Eq. (10). We rewrite its objective function m^k(𝜼)\hat{m}_{k}(\bm{\eta}) as below for the convenience of explanation:

𝜼k=argmin𝜼T𝐱kf(𝐱k)+δ𝐆k,𝜼𝐱k+12𝜼,𝐇k[𝜼]𝐱k+σk3𝜼𝐱k3,\bm{\eta}_{k}^{*}=\arg\min_{\bm{\eta}\in T_{\mathbf{x}_{k}}\mathcal{M}}f(\mathbf{x}_{k})+\delta\langle\mathbf{G}_{k},\bm{\eta}\rangle_{\mathbf{x}_{k}}+\frac{1}{2}\langle\bm{\eta},\mathbf{H}_{k}[\bm{\eta}]\rangle_{\mathbf{x}_{k}}+\frac{\sigma_{k}}{3}\|\bm{\eta}\|_{\mathbf{x}_{k}}^{3}, (17)

where δ=1\delta=1 if 𝐆k𝐱kϵg\|\mathbf{G}_{k}\|_{\mathbf{x}_{k}}\geq\epsilon_{g}, otherwise δ=0\delta=0. We demonstrate two solvers commonly used in practice.

3.3.1 The Lanczos Method

The Lanczos method 1999Solving has been widely used to solve the cubic regularization problem in Euclidean spaces xu2020newton ; kohler2017sub ; cartis2011adaptive ; jia2021solving and been recently adapted to Riemannian spaces agarwal2021adaptive . Let DD denote the manifold dimension. Its core operation is to construct a Krylov subspace 𝒦D\mathcal{K}_{D}, of which the basis {𝐪i}i=1D\{\mathbf{q}_{i}\}_{i=1}^{D} spans the tangent space 𝜼T𝐱k\bm{\eta}\in T_{\mathbf{x}_{k}}\mathcal{M}. After expressing the solution as an element in 𝒦D\mathcal{K}_{D}, i.e., 𝜼:=i=1Dyi𝐪i\bm{\eta}:=\sum_{i=1}^{D}y_{i}\mathbf{q}_{i}, the minimization problem in Eq. (17) can be converted to one in Euclidean spaces D\mathbb{R}^{D}, as

𝐲=argmin𝐲Dy1δ𝐆k𝐱k+12𝐲T𝐓D𝐲+σk3𝐲23,\mathbf{y}^{*}=\arg\min_{\mathbf{y}\in\mathbb{R}^{D}}\ y_{1}\delta\left\|\mathbf{G}_{k}\right\|_{\mathbf{x}_{k}}+\frac{1}{2}\mathbf{y}^{T}\mathbf{T}_{D}\mathbf{y}+\frac{\sigma_{k}}{3}\left\|\mathbf{y}\right\|_{2}^{3}, (18)

where 𝐓DD×D\mathbf{T}_{D}\in\mathbb{R}^{D\times D} is a symmetric tridiagonal matrix determined by the basic construction process, e.g., Algorithm 1 of jia2021solving . We provide a detailed derivation of Eq. (18) in Appendix A. The global solution of this converted problem, i.e., 𝐲=[y1,y2,,yD]\mathbf{y}^{*}=\left[y_{1}^{*},y_{2}^{*},\ldots,y_{D}^{*}\right], can be found by many existing techniques, see press2007chapter . We employ the Newton root finding method adopted by Section 6 of cartis2011adaptive , which was originally proposed by agarwal2021adaptive . It reduces the problem to a univariate root finding problem. After this, the global solution of the subproblem is computed by 𝜼k=i=1Dyi𝐪i\bm{\eta}_{k}^{*}=\sum_{i=1}^{D}y_{i}^{*}\mathbf{q}_{i}.

Algorithm 2 Subproblem Solver by Lanczos agarwal2021adaptive
0:  𝐆k\mathbf{G}_{k} and 𝐇k[𝜼]\mathbf{H}_{k}[\bm{\eta}], κθ(0,1/6]\kappa_{\theta}\in(0,1/6], σk\sigma_{k}.
1:  𝐪1=𝐆k𝐆k𝐱k\mathbf{q}_{1}=\frac{\mathbf{G}_{k}}{\left\|\mathbf{G}_{k}\right\|_{\mathbf{x}_{k}}}, 𝐓=𝟎D×D\mathbf{T}=\mathbf{0}\in\mathbb{R}^{D\times D}, α=𝐪1,𝐇k[𝐪1]𝐱k\alpha=\langle\mathbf{q}_{1},\mathbf{H}_{k}[\mathbf{q}_{1}]\rangle_{\mathbf{x}_{k}}, T1,1=αT_{1,1}=\alpha
2:  𝐫=𝐇k[𝐪1]α𝐪1\mathbf{r}=\mathbf{H}_{k}[\mathbf{q}_{1}]-\alpha\mathbf{q}_{1}
3:  for l=1,2,,Dl=1,2,\ldots,D do
4:     Obtain 𝐲\mathbf{y}^{*} by optimizing Eq. (18) with D=lD=l using Newton root finding
5:     β=𝐫𝐱k\beta=\left\|\mathbf{r}\right\|_{\mathbf{x}_{k}}
6:     𝐪l+1=𝐫β\mathbf{q}_{l+1}=-\frac{\mathbf{r}}{\beta}
7:     α=𝐪l+1,𝐇k[𝐪l+1]β𝐪l𝐱k\alpha=\langle\mathbf{q}_{l+1},\mathbf{H}_{k}[\mathbf{q}_{l+1}]-\beta\mathbf{q}_{l}\rangle_{\mathbf{x}_{k}}
8:     𝐫=𝐇k[𝐪l+1]β𝐪lα𝐪l+1\mathbf{r}=\mathbf{H}_{k}[\mathbf{q}_{l+1}]-\beta\mathbf{q}_{l}-\alpha\mathbf{q}_{l+1}
9:     Tl,l+1=Tl+1,l=βT_{l,l+1}=T_{l+1,l}=\beta, Tl+1,l+1=αT_{l+1,l+1}=\alpha
10:     if  Eq. (47) is satisfied then
11:        Return i=1lyi𝐪i\sum_{i=1}^{l}y_{i}^{*}\mathbf{q}_{i}
12:     end if
13:  end for

In practice, when the manifold dimension DD is large, it is more practical to find a good solution rather than the global solution. By working with a lower-dimensional Krylov subspace 𝒦l\mathcal{K}_{l} with l<Dl<D, one can derive Eq. (18) in l\mathbb{R}^{l}, and its solution 𝐲l\mathbf{y}^{*l} results in a subproblem solution 𝜼kl=i=1lyil𝐪i\bm{\eta}_{k}^{*l}=\sum_{i=1}^{l}y_{i}^{*l}\mathbf{q}_{i}. Both the global solution 𝜼k\bm{\eta}_{k}^{*} and the approximate solution 𝜼kl\bm{\eta}_{k}^{*l} are always guaranteed to be at least as good as the solution obtained by performing a line search along the gradient direction, i.e.,

m^k(𝜼k)m^k(𝜼kl)minαm^k(α𝐆k),\hat{m}_{k}\left(\bm{\eta}_{k}^{*}\right)\leq\hat{m}_{k}\left(\bm{\eta}_{k}^{*l}\right)\leq\min_{\alpha\in\mathbb{R}}\hat{m}_{k}(\alpha\mathbf{G}_{k}), (19)

because α𝐆k\alpha\mathbf{G}_{k} is a common basis vector shared by all the constructed Krylov subspaces {𝒦l}i=1D\{\mathcal{K}_{l}\}_{i=1}^{D}. We provide pseudo code for the Lanczos subproblem solver in Algorithm 2.

To benefit practitioners and improve understanding of the Lanczos solver, we analyse the gap between a practical solution 𝜼kl\bm{\eta}_{k}^{*l} and the global minimizer 𝜼k\bm{\eta}_{k}^{*}. Firstly, we define λmax(𝐇k)\lambda_{max}(\mathbf{H}_{k}) in a similar manner to λmin(𝐇k)\lambda_{min}(\mathbf{H}_{k}) as in Eq. (16). It resembles the largest eigenvalue of the Riemannian Hessian, as

λmax(𝐇k):=\displaystyle\lambda_{max}(\mathbf{H}_{k}):= max𝜼T𝐱,𝜼𝐱=1𝜼,𝐇k[𝜼]𝐱k.\displaystyle\max_{\bm{\eta}\in T_{\mathbf{x}}\mathcal{M},\ \left\|\bm{\eta}\right\|_{\mathbf{x}}=1}\langle\bm{\eta},\mathbf{H}_{k}[\bm{\eta}]\rangle_{\mathbf{x}_{k}}. (20)

We denote a degree-ll polynomial evaluated at 𝐇k[𝜼]\mathbf{H}_{k}[\bm{\eta}] by pl(𝐇k)[𝜼]p_{l}\left(\mathbf{H}_{k}\right)[\bm{\eta}], such that

pl(𝐇k)[𝜼]:=cl𝐇kl[𝜼]+cl1𝐇kl1[𝜼]++c1𝐇k[𝜼]+c0𝜼,p_{l}\left(\mathbf{H}_{k}\right)[\bm{\eta}]:=c_{l}\mathbf{H}_{k}^{l}[\bm{\eta}]+c_{l-1}\mathbf{H}_{k}^{l-1}[\bm{\eta}]+\cdots+c_{1}\mathbf{H}_{k}[\bm{\eta}]+c_{0}\bm{\eta}, (21)

for some coefficients c0,c1,,clc_{0},\;c_{1},\;\ldots,\;c_{l}\in\mathbb{R}. The quantity 𝐇kl[𝜼]\mathbf{H}_{k}^{l}[\bm{\eta}] is recursively defined by 𝐇k[𝐇kl1[𝜼]]\mathbf{H}_{k}\left[\mathbf{H}_{k}^{l-1}\left[\bm{\eta}\right]\right] for l=2,3,l=2,3,\ldots We define below an induced norm, as

pl+1(𝐇~k)Id𝐱k=sup𝜼T𝐱k𝜼𝐱k0(pl+1(𝐇~k)Id)[𝜼]𝐱k𝜼𝐱k,\left\|p_{l+1}\left(\tilde{\mathbf{H}}_{k}\right)-{\rm Id}\right\|_{\mathbf{x}_{k}}=\sup_{\begin{subarray}{c}\bm{\eta}\in T_{\mathbf{x}_{k}}\mathcal{M}\\ \|\bm{\eta}\|_{\mathbf{x}_{k}}\neq 0\end{subarray}}\frac{\left\|\left(p_{l+1}\left(\tilde{\mathbf{H}}_{k}\right)-{\rm Id}\right)[\bm{\eta}]\right\|_{\mathbf{x}_{k}}}{\left\|\bm{\eta}\right\|_{\mathbf{x}_{k}}}, (22)

where the identity mapping operator works as Id[𝜼]=𝜼{\rm Id}[\bm{\eta}]=\bm{\eta}. Now we are ready to present our result in the following lemma.

Lemma 1 (Lanczos Solution Gap)

Let 𝛈k\bm{\eta}_{k}^{*} be the global minimizer of the subproblem m^k\hat{m}_{k} in Eq. (10). Denote the subproblem without cubic regularization in Eq. (7) by m¯k\bar{m}_{k} and let 𝛈¯k\bar{\bm{\eta}}_{k}^{*} be its global minimizer. For each l>0l>0, the solution 𝛈kl\bm{\eta}_{k}^{*l} returned by Algorithm 2 satisfies

m^k(𝜼kl)m^k(𝜼k)4λmax(𝐇~k)λmin(𝐇~k)(m¯k(𝟎𝐱k)m¯k(𝜼¯k))ϕl(𝐇~k)2,\hat{m}_{k}\left(\bm{\eta}_{k}^{*l}\right)-\hat{m}_{k}\left(\bm{\eta}_{k}^{*}\right)\leq\frac{4\lambda_{max}\left(\tilde{\mathbf{H}}_{k}\right)}{\lambda_{min}\left(\tilde{\mathbf{H}}_{k}\right)}\left(\bar{m}_{k}\left(\mathbf{0}_{\mathbf{x}_{k}}\right)-\bar{m}_{k}\left(\bar{\bm{\eta}}_{k}^{*}\right)\right)\phi_{l}\left(\tilde{\mathbf{H}}_{k}\right)^{2}, (23)

where 𝐇~k[𝛈]:=(𝐇k+σk𝛈k𝐱kId)[𝛈]\tilde{\mathbf{H}}_{k}[\bm{\eta}]:=(\mathbf{H}_{k}+\sigma_{k}\left\|\bm{\eta}_{k}^{*}\right\|_{\mathbf{x}_{k}}{\rm Id})[\bm{\eta}] for a moving direction 𝛈\bm{\eta}, and ϕl(𝐇~k)\phi_{l}\left(\tilde{\mathbf{H}}_{k}\right) is an upper bound of the induced norm pl+1(𝐇~k)Id𝐱k\left\|p_{l+1}\left(\tilde{\mathbf{H}}_{k}\right)-{\rm Id}\right\|_{\mathbf{x}_{k}}.

We provide its proof in Appendix B. In Euclidean spaces, carmon2018analysis has shown that ϕl(𝐇k)=2e2(l+1)λmax(𝐇k)λmin1(𝐇k)\phi_{l}(\mathbf{H}_{k})=2e^{\frac{-2(l+1)}{\sqrt{\lambda_{max}(\mathbf{H}_{k})\lambda_{min}^{-1}(\mathbf{H}_{k})}}}. With the help of Lemma 1, this could serve as a reference to gain an understanding of the solution quality for the Lanczos method in Riemannian spaces.

3.3.2 The Conjugate Gradient Method

We experiment with an alternative subproblem solver by adapting the non-linear conjugate gradient technique in Riemannian spaces. It starts from the initialization of 𝜼k0=𝟎𝐱k\bm{\eta}_{k}^{0}=\mathbf{0}_{\mathbf{x}_{k}} and the first conjugate direction 𝐩1=𝐆k\mathbf{p}_{1}=-\mathbf{G}_{k} (negative gradient direction). At each inner iteration ii (as opposed to the outer iteration kk in the main algorithm), it solves the minimization problem with one input variable:

αi=argminα0m^k(𝜼ki1+α𝐩i).\alpha_{i}^{*}=\arg\min_{\alpha\geq 0}\hat{m}_{k}\left(\bm{\eta}_{k}^{i-1}+\alpha\mathbf{p}_{i}\right). (24)

The global solution of this one-input minimization problem can be computed by zeroing the derivative of m^k\hat{m}_{k} with respect to α\alpha, resulting in a polynomial equation of α\alpha, which can then be solved by eigen-decomposition edelman1995polynomial . Its root that possesses the minimal value of m^k\hat{m}_{k} is retrieved. The algorithm updates the next conjugate direction 𝐩i+1\mathbf{p}_{i+1} using the returned αi\alpha_{i}^{*} and 𝐩i\mathbf{p}_{i}. Pseudo code for the conjugate gradient subproblem solver is provided in Algorithm 3.

Algorithm 3 subproblem Solver by Non-linear Conjugate Gradient
0:  subproblem m^k(𝜼)\hat{m}_{k}(\bm{\eta}), 𝐆k\mathbf{G}_{k}, mm, κ,θ>0\kappa,\theta>0, κθ(0,1/6]\kappa_{\theta}\in(0,1/6].
1:  𝜼k0=𝟎𝐱k\bm{\eta}_{k}^{0}=\mathbf{0}_{\mathbf{x}_{k}}, 𝐫0=𝐆k\mathbf{r}_{0}=\mathbf{G}_{k}, 𝐩1=𝐫0\mathbf{p}_{1}=-\mathbf{r}_{0}, 𝐱k0=𝐱k\mathbf{x}_{k}^{0}=\mathbf{x}_{k}
2:  for i=1,2,,mi=1,2,\ldots,m do
3:      Solve Eq. (24) by zeroing its derivative and solving the resulting polynomial equation
4:     if  αi1010\alpha_{i}^{*}\leq 10^{-10} then
5:        Return 𝜼k=𝜼ki1\bm{\eta}_{k}^{*}=\bm{\eta}_{k}^{i-1}
6:     end if
7:     𝜼ki=𝜼ki1+αi𝐩i\bm{\eta}_{k}^{i}=\bm{\eta}_{k}^{i-1}+\alpha_{i}^{*}\mathbf{p}_{i}
8:     if  Eq. (47) is satisfied then
9:        Return 𝜼k=𝜼ki\bm{\eta}_{k}^{*}=\bm{\eta}_{k}^{i}
10:     end if
11:      𝐫i=𝐫i1+αi𝐇ki1[𝐩i]\mathbf{r}_{i}=\mathbf{r}_{i-1}+\alpha_{i}^{*}\mathbf{H}_{k}^{i-1}[\mathbf{p}_{i}]
12:     𝐱ki=R𝐱ki1(αi𝐩i)\mathbf{x}_{k}^{i}=R_{\mathbf{x}_{k}^{i-1}}(\alpha_{i}^{*}\mathbf{p}_{i})
13:     if  Eq. (31) is met then
14:        Return 𝜼k=𝜼ki\bm{\eta}_{k}^{*}=\bm{\eta}_{k}^{i}
15:     end if
16:      Compute βi\beta_{i} by Eq. (28)
17:      𝐩i+1=𝐫i+βi𝒫αi𝐩i𝐩i\mathbf{p}_{i+1}=-\mathbf{r}_{i}+\beta_{i}\mathcal{P}_{\alpha_{i}^{*}\mathbf{p}_{i}}\mathbf{p}_{i}
18:  end for
18:  𝜼k=𝜼km\bm{\eta}_{k}^{*}=\bm{\eta}_{k}^{m}

Convergence of a conjugate gradient method largely depends on how its conjugate direction is updated. This is controlled by the setting of βi\beta_{i} for calculating 𝐩i+1=𝐫i+βi𝒫αi𝐩i𝐩i\mathbf{p}_{i+1}=-\mathbf{r}_{i}+\beta_{i}\mathcal{P}_{\alpha_{i}^{*}\mathbf{p}_{i}}\mathbf{p}_{i} in Step (17) of Algorithm 3. Working in Riemannian spaces under the subsampling setting, it has been proven by sakai2021sufficient that, when the Fletcher-Reeves formula fletcher1964function is used, i.e.,

βi=𝐆ki𝐱ki2𝐆ki1𝐱ki12,\beta_{i}=\frac{\left\|\mathbf{G}_{k}^{i}\right\|_{\mathbf{x}_{k}^{i}}^{2}}{\left\|\mathbf{G}_{k}^{i-1}\right\|_{\mathbf{x}_{k}^{i-1}}^{2}}, (25)

where

𝐆ki=1|𝒮g|j𝒮ggradfj(𝐱ki),\mathbf{G}_{k}^{i}=\frac{1}{|\mathcal{S}_{g}|}\sum_{j\in\mathcal{S}_{g}}{\rm grad}f_{j}\left(\mathbf{x}_{k}^{i}\right), (26)

a conjugate gradient method can converge to a stationary point with limi\lim_{i\to\infty} gradf(𝐱ki)𝐱ki=0\left\|{\rm grad}f\left(\mathbf{x}_{k}^{i}\right)\right\|_{\mathbf{x}_{k}^{i}}=0. Working in Euclidean spaces, wei2006convergence has shown that the Polak–Ribiere–Polyak formula, i.e.,

βi=f(𝐱ki),f(𝐱ki)f(𝐱ki1)f(𝐱ki1)2,\beta_{i}=\frac{\left\langle\nabla f\left(\mathbf{x}_{k}^{i}\right),\nabla f\left(\mathbf{x}_{k}^{i}\right)-\nabla f\left(\mathbf{x}_{k}^{i-1}\right)\right\rangle}{\left\|\nabla f\left(\mathbf{x}_{k}^{i-1}\right)\right\|^{2}}, (27)

performs better than the Fletcher-Reeves formula. Building upon these, we propose to compute βi\beta_{i} by a modified Polak–Ribiere–Polyak formula in Riemannian spaces in Step (16) of Algorithm 3, given as

βi=𝐫i,𝐫i𝐫i𝐱ki𝐫i1𝐱ki1𝒫αi𝐩i𝐫i1𝐱ki2𝐫i1,𝐫i1𝐱ki1.\beta_{i}=\frac{\left\langle\mathbf{r}_{i},\mathbf{r}_{i}-\frac{\left\|\mathbf{r}_{i}\right\|_{\mathbf{x}_{k}^{i}}}{\left\|\mathbf{r}_{i-1}\right\|_{\mathbf{x}_{k}^{i-1}}}\mathcal{P}_{\alpha_{i}^{*}\mathbf{p}_{i}}\mathbf{r}_{i-1}\right\rangle_{\mathbf{x}_{k}^{i}}}{2\left\langle\mathbf{r}_{i-1},\mathbf{r}_{i-1}\right\rangle_{\mathbf{x}_{k}^{i-1}}}. (28)

We prove that the resulting algorithm converges to a stationary point, and present the convergence result in Theorem 1, with its proof deferred to Appendix C.

Theorem 1 (Convergence of the Conjugate Gradient Solver)

Assume that the step size αi\alpha_{i}^{*} in Algorithm 3 satisfies the strong Wolfe conditions hosseini2018line , i.e., given a smooth function f:f:\mathcal{M}\to\mathbb{R}, it has

f(R𝐱ki1(αi𝐩i))\displaystyle f\left(R_{\mathbf{x}_{k}^{i-1}}(\alpha_{i}^{*}\mathbf{p}_{i})\right)\leq f(𝐱ki1)+c1αi𝐆ki1,𝐩i𝐱ki1,\displaystyle\;f\left(\mathbf{x}_{k}^{i-1}\right)+c_{1}\alpha_{i}^{*}\left\langle\mathbf{G}_{k}^{i-1},\mathbf{p}_{i}\right\rangle_{\mathbf{x}_{k}^{i-1}}, (29)
|𝐆ki,𝒫αi𝐩i(𝐩i)𝐱ki|\displaystyle\left|\left\langle\mathbf{G}_{k}^{i},\mathcal{P}_{\alpha_{i}^{*}\mathbf{p}_{i}}(\mathbf{p}_{i})\right\rangle_{\mathbf{x}_{k}^{i}}\right|\leq c2𝐆ki1,𝐩i𝐱ki1,\displaystyle\;-c_{2}\left\langle\mathbf{G}_{k}^{i-1},\mathbf{p}_{i}\right\rangle_{\mathbf{x}_{k}^{i-1}}, (30)

with 0<c1<c2<10<c_{1}<c_{2}<1. When Step (16) of Algorithm 3 computes βi\beta_{i} by Eq. (28), Algorithm 3 converges to a stationary point, i.e., limi𝐆ki𝐱ki=0\lim_{i\to\infty}\left\|\mathbf{G}_{k}^{i}\right\|_{\mathbf{x}_{k}^{i}}=0.

In practice, Algorithm 3 terminates when there is no obvious change in the solution, which is examined in Step (4) by checking whether the step size is sufficiently small, i.e., whether αi1010\alpha_{i}\leq 10^{-10} (Section 9 in agarwal2021adaptive ). To improve the convergence rate, the algorithm also terminates when 𝐫i\mathbf{r}_{i} in Step (13) is sufficiently small, i.e., following a classical criterion absil2007trust , to check whether

𝐫i𝐱ki𝐫0𝐱kmin(𝐫0𝐱kθ,κ),\left\|\mathbf{r}_{i}\right\|_{\mathbf{x}_{k}^{i}}\leq\left\|\mathbf{r}_{0}\right\|_{\mathbf{x}_{k}}\min\left(\left\|\mathbf{r}_{0}\right\|_{\mathbf{x}_{k}}^{\theta},\kappa\right), (31)

for some θ,κ>0\theta,\kappa>0.

3.3.3 Properties of the Subproblem Solutions

In Algorithm 2, the basis {𝐪i}i=1D\{\mathbf{q}_{i}\}_{i=1}^{D} is constructed successively starting from q1=𝐆kq_{1}=\mathbf{G}_{k}, while the converted problem in Eq. (18) is solved for each 𝒦l\mathcal{K}_{l} starting from l=1l=1. This process allows up to DD inner iterations. The solution 𝜼k\bm{\eta}_{k}^{*} obtained in the last inner iteration where l=Dl=D is the global minimizer over D\mathbb{R}^{D}. Differently, Algorithm 3 converges to a stationary point as proved in Theorem 1. In practice, a maximum inner iteration number mm is set in advance. Algorithm 3 stops when it reaches the maximum iteration number or converges to a status where the change in either the solution or the conjugate direction is very small.

The convergence property of the main algorithm presented in Algorithm 1 relies on the quality of the subproblem solution. Before discussing it, we first familiarize the reader with the classical TR concepts of Cauchy steps and eigensteps but defined for the Inexact RTR problem introduced in Section 2.2. According to Section 3.3 of boumal2019global , when m^k\hat{m}_{k} is the RTR subproblem, the closed-form Cauchy step 𝜼^kC\hat{\bm{\eta}}_{k}^{C} is an improving direction defined by

𝜼^kC:=min(𝐆k𝐱k2𝐆k,𝐇k[𝐆k]𝐱k,Δk𝐆k𝐱k)𝐆k.\hat{\bm{\eta}}_{k}^{C}:=\min\Bigg{(}\frac{\left\|\mathbf{G}_{k}\right\|_{\mathbf{x}_{k}}^{2}}{\langle\mathbf{G}_{k},\mathbf{H}_{k}[\mathbf{G}_{k}]\rangle}_{\mathbf{x}_{k}},\frac{\Delta_{k}}{\left\|\mathbf{G}_{k}\right\|_{\mathbf{x}_{k}}}\Bigg{)}\mathbf{G}_{k}. (32)

It points towards the gradient direction with an optimal step size computed by the min(,)\min(\cdot,\cdot) operation, and follows the form of the general Cauchy step defined by

𝜼kC:=argminα(m^k(α𝐆k))𝐆k.\bm{\eta}_{k}^{C}:=\arg\min_{\alpha\in\mathbb{R}}(\hat{m}_{k}(\alpha\mathbf{G}_{k}))\mathbf{G}_{k}. (33)

According to Section 2.2 of kasai2018inexact , for some ν(0,1]\nu\in(0,1], the eigenstep 𝜼kE\bm{\eta}_{k}^{E} satisfies

𝜼kE,𝐇k[𝜼kE]𝐱kνλmin(𝐇k)𝜼kE𝐱k2<0.\left\langle\bm{\eta}_{k}^{E},\mathbf{H}_{k}\left[\bm{\eta}_{k}^{E}\right]\right\rangle_{\mathbf{x}_{k}}\leq\nu\lambda_{min}(\mathbf{H}_{k})\left\|\bm{\eta}_{k}^{E}\right\|^{2}_{\mathbf{x}_{k}}<0. (34)

It is an approximation of the negative curvature direction by an eigenvector associated with the smallest negative eigenvalue.

The following three assumptions on the subproblem solution are needed by the convergence analysis later. We define the induced norm for the Hessian as below:

𝐇k𝐱k=sup𝜼T𝐱k𝜼𝐱k0𝐇k[𝜼]𝐱k𝜼𝐱k.\|\mathbf{H}_{k}\|_{\mathbf{x}_{k}}=\sup_{\begin{subarray}{c}\bm{\eta}\in T_{\mathbf{x}_{k}}\mathcal{M}\\ \|\bm{\eta}\|_{\mathbf{x}_{k}}\neq 0\end{subarray}}\frac{\left\|\mathbf{H}_{k}[\bm{\eta}]\right\|_{\mathbf{x}_{k}}}{\left\|\bm{\eta}\right\|_{\mathbf{x}_{k}}}. (35)
Assumption 1 (Sufficient Descent Step)

Given the Cauchy step 𝛈kC\bm{\eta}_{k}^{C} and the eigenstep 𝛈kE\bm{\eta}_{k}^{E} for ν(0,1]\nu\in(0,1], assume the subproblem solution 𝛈k\bm{\eta}_{k}^{*} satisfies the Cauchy condition

m^k(𝜼k)m^k(𝜼kC)m^k(𝟎𝐱k)max(ak,bk),\hat{m}_{k}\left(\bm{\eta}_{k}^{*}\right)\leq\hat{m}_{k}\left(\bm{\eta}_{k}^{C}\right)\leq\hat{m}_{k}\left(\mathbf{0}_{\mathbf{x}_{k}}\right)-\max(a_{k},b_{k}), (36)

and the eigenstep condition

m^k(𝜼k)m^k(𝜼kE)m^k(𝟎𝐱k)ck,\hat{m}_{k}\left(\bm{\eta}_{k}^{*}\right)\leq\hat{m}_{k}\left(\bm{\eta}_{k}^{E}\right)\leq\hat{m}_{k}\left(\mathbf{0}_{\mathbf{x}_{k}}\right)-c_{k}, (37)

where

ak=\displaystyle a_{k}=\; 𝐆k𝐱k23min(𝐆k𝐱k𝐇k𝐱k,𝐆k𝐱kσk),\displaystyle\frac{\|\mathbf{G}_{k}\|_{\mathbf{x}_{k}}}{2\sqrt{3}}\min\Bigg{(}\frac{\|\mathbf{G}_{k}\|_{\mathbf{x}_{k}}}{\|\mathbf{H}_{k}\|_{\mathbf{x}_{k}}},\sqrt{\frac{\|\mathbf{G}_{k}\|_{\mathbf{x}_{k}}}{\sigma_{k}}}\Bigg{)}, (38)
bk=\displaystyle b_{k}=\; 𝜼kC𝐱k212(𝐇k𝐱k2+4σk𝐆k𝐱k𝐇k𝐱k),\displaystyle\frac{\left\|\bm{\eta}_{k}^{C}\right\|_{\mathbf{x}_{k}}^{2}}{12}\left(\sqrt{\|\mathbf{H}_{k}\|_{\mathbf{x}_{k}}^{2}+4\sigma_{k}\|\mathbf{G}_{k}\|_{\mathbf{x}_{k}}}-\|\mathbf{H}_{k}\|_{\mathbf{x}_{k}}\right), (39)
ck=\displaystyle c_{k}=\; ν|λmin(𝐇k)|6max(𝜼kE𝐱k2,ν2|λmin(𝐇k)|2σk2).\displaystyle\frac{\nu\left|\lambda_{min}(\mathbf{H}_{k})\right|}{6}\max\Bigg{(}\left\|\bm{\eta}_{k}^{E}\right\|_{\mathbf{x}_{k}}^{2},\frac{\nu^{2}|\lambda_{min}(\mathbf{H}_{k})|^{2}}{\sigma_{k}^{2}}\Bigg{)}. (40)

The two last inequalities in Eqs. (36) and (37) concerning the Cauchy step and eigenstep are derived in Lemma 6 and Lemma 7 of xu2020newton . Assumption 1 generalizes Condition 3 in xu2020newton to the Riemannian case. It assumes that the subproblem solution 𝜼k\bm{\eta}_{k}^{*} is better than the Cauchy step and eigenstep, decreasing more the value of the subproblem objective function. The following two assumptions enable a stronger convergence result for Algorithm 1, which will be used in the proof of Theorem 4.

Assumption 2 (Sub-model Gradient Norm cartis2011adaptive ; kohler2017sub )

Assume the subproblem solution 𝛈k\bm{\eta}_{k}^{*} satisfies

𝜼m^k(𝜼k)𝐱kκθmin(1,𝜼k𝐱k)𝐆k𝐱k,\left\|\nabla_{\bm{\eta}}\hat{m}_{k}\left(\bm{\eta}_{k}^{*}\right)\right\|_{\mathbf{x}_{k}}\leq\kappa_{\theta}\min\left(1,\left\|\bm{\eta}_{k}^{*}\right\|_{\mathbf{x}_{k}}\right)\|\mathbf{G}_{k}\|_{\mathbf{x}_{k}}, (41)

where κθ(0,1/6]\kappa_{\theta}\in(0,1/6].

Assumption 3 (Approximate Global Minimizer cartis2011adaptive ; yao2021inexact )

Assume the subproblem solution 𝛈k\bm{\eta}_{k}^{*} satisfies

𝐆k,𝜼k𝐱k+𝜼k,𝐇k[𝜼k]𝐱k+σk𝜼k𝐱k3=0,\displaystyle\langle\mathbf{G}_{k},\bm{\eta}_{k}^{*}\rangle_{\mathbf{x}_{k}}+\langle\bm{\eta}_{k}^{*},\mathbf{H}_{k}[\bm{\eta}_{k}^{*}]\rangle_{\mathbf{x}_{k}}+\sigma_{k}\|\bm{\eta}_{k}^{*}\|_{\mathbf{x}_{k}}^{3}=0, (42)
𝜼k,𝐇k[𝜼k]𝐱k+σk𝜼k𝐱k30.\displaystyle\langle\bm{\eta}_{k}^{*},\mathbf{H}_{k}[\bm{\eta}_{k}^{*}]\rangle_{\mathbf{x}_{k}}+\sigma_{k}\|\bm{\eta}_{k}^{*}\|_{\mathbf{x}_{k}}^{3}\geq 0. (43)

Driven by these assumptions, we characterize the subproblem solutions and present the results in the following lemmas. Their proofs are deferred to Appendix D.

Lemma 2 (Lanczos Solution)

The subproblem solution obtained by Algorithm 2 when being executed DD (the dimension of \mathcal{M}) iterations satisfies Assumptions 1, 2, 3. When being executed l<Dl<D times, the solution satisfies the Cauchy condition in Assumption 1, also Assumptions 2 and 3.

Lemma 3 (Conjugate Gradient Solution)

The subproblem solution obtained by Algorithm 3 satisfies the Cauchy condition in Assumption 1. Assuming m^k(𝛈)f(R𝐱k(𝛈))\hat{m}_{k}(\bm{\eta})\approx f(R_{\mathbf{x}_{k}}(\bm{\eta})), it also satisfies

𝜼m^k(𝜼k)𝐱k0,\left\|\nabla_{\bm{\eta}}\hat{m}_{k}\left(\bm{\eta}_{k}^{*}\right)\right\|_{\mathbf{x}_{k}}\approx 0, (44)

and approximately the first condition of Assumption 3, as

𝐆k,𝜼k𝐱k+𝜼k,𝐇k[𝜼k]𝐱k+σk𝜼k𝐱k30.\langle\mathbf{G}_{k},\bm{\eta}_{k}^{*}\rangle_{\mathbf{x}_{k}}+\langle\bm{\eta}_{k}^{*},\mathbf{H}_{k}[\bm{\eta}_{k}^{*}]\rangle_{\mathbf{x}_{k}}+\sigma_{k}\left\|\bm{\eta}_{k}^{*}\right\|_{\mathbf{x}_{k}}^{3}\approx 0. (45)

In practice, Algorithm 2 based on lower-dimensional Krylov subspaces with l<Dl<D returns a less optimal solution, while Algorithm 3 returns at most a local minimum. They are not guaranteed to satisfy the eigenstep condition in Assumption 1. But the early-returned solutions from Algorithm 2 still satisfy Assumptions 2 and 3. However, solutions from Algorithm 3 do not satisfy these two Assumptions exactly, but they could get close in an approximate manner. For instance, according to Lemma 3, 𝜼m^k(𝜼k)𝐱k0\left\|\nabla_{\bm{\eta}}\hat{m}_{k}\left(\bm{\eta}_{k}^{*}\right)\right\|_{\mathbf{x}_{k}}\approx 0, and we know that 0κθmin(1,𝜼k𝐱k)𝐆k𝐱k0\leq\kappa_{\theta}\min\left(1,\left\|\bm{\eta}_{k}^{*}\right\|_{\mathbf{x}_{k}}\right)\|\mathbf{G}_{k}\|_{\mathbf{x}_{k}}; thus, there is a fair chance for Eq. (41) in Assumption 2 to be met by the solution from Algorithm 3. Also, given that 𝜼k\bm{\eta}_{k}^{*} is a descent direction, it has 𝐆k,𝜼k𝐱k0\langle\mathbf{G}_{k},\bm{\eta}_{k}^{*}\rangle_{\mathbf{x}_{k}}\leq 0. Combining this with Eq. (45) in Lemma 3, there is a fair chance for Eq. (42) in Assumption 2 to be met. We present experimental results in Section 5.5, showing empirically to what extent the different solutions satisfy or are close to the eigenstep condition in Assumption 1, Assumptions 2 and 3.

3.4 Practical Early Stopping

In practice, it is often more efficient to stop the optimization earlier before meeting the optimality conditions and obtain a reasonably good solution much faster. We employ a practical and simple early stopping mechanism to accommodate this need. Algorithm 1 is allowed to terminate earlier when: (1) the norm of the approximate gradient continually fails to decrease for KK times, and (2) when the percentage of the function decrement is lower than a given threshold, i.e.,

f(𝐱k1)f(𝐱k)|f(𝐱k1)|τf,\frac{f(\mathbf{x}_{k-1})-f(\mathbf{x}_{k})}{|f(\mathbf{x}_{k-1})|}\leq\tau_{f}, (46)

for a consecutive number of KK times, with KK and τf>0\tau_{f}>0 being user-defined.

For the subproblem, both Algorithms 2 and 3 are allowed to terminate when the current solution 𝜼k\bm{\eta}_{k}, i.e., 𝜼k=𝜼kl\bm{\eta}_{k}=\bm{\eta}_{k}^{*l} for Algorithm 2 and 𝜼k=𝜼ki\bm{\eta}_{k}=\bm{\eta}_{k}^{i} for Algorithm 3, satisfies

𝜼m^k(𝜼k)𝐱kκθmin(1,𝜼k𝐱k)𝐆k𝐱k.\left\|\nabla_{\bm{\eta}}\hat{m}_{k}\left(\bm{\eta}_{k}\right)\right\|_{\mathbf{x}_{k}}\leq\kappa_{\theta}\min\left(1,\left\|\bm{\eta}_{k}\right\|_{\mathbf{x}_{k}}\right)\|\mathbf{G}_{k}\|_{\mathbf{x}_{k}}. (47)

This implements an examination of Assumption 2. Regarding Assumption 1, both Algorithms 2 and 3 optimize along the direction of the Cauchy step in their first iteration and thus satisfy the Cauchy condition. Therefore there is no need to examine it. As for the eigenstep condition, it is costly to compute and compare with the eigenstep in each inner iteration, so we do not use it as a stopping criterion in practice. Regarding Assumption 3, according to Lemma 2, it is always satisfied by the solution from Algorithm 2. Therefore, there is no need to examine it in Algorithm 2. As for Algorithm 3, the examination by Eq. (47) also plays a role in checking Assumption 3. For the first condition in Assumption 3, Eq. (42) is equivalent to 𝜼m^k(𝜼k),𝜼k𝐱k=0\langle\nabla_{\bm{\eta}}\hat{m}_{k}\left(\bm{\eta}_{k}^{*}\right),\bm{\eta}_{k}^{*}\rangle_{\mathbf{x}_{k}}=0 (this results from Eq. (106) in Appendix D.2). In practice, when Eq. (47) is satisfied with a small value of 𝜼m^k(𝜼k)𝐱k\left\|\nabla_{\bm{\eta}}\hat{m}_{k}\left(\bm{\eta}_{k}^{*}\right)\right\|_{\mathbf{x}_{k}}, it has 𝜼m^k(𝜼k),𝜼k𝐱k0\left\langle\nabla_{\bm{\eta}}\hat{m}_{k}\left(\bm{\eta}_{k}^{*}\right),\bm{\eta}_{k}^{*}\right\rangle_{\mathbf{x}_{k}}\approx 0, indicating that the first condition of Assumption 3 is met approximately. Also, since 𝐆k,𝜼k𝐱k0\langle\mathbf{G}_{k},\bm{\eta}_{k}^{*}\rangle_{\mathbf{x}_{k}}\leq 0 due to the descent direction 𝜼k\bm{\eta}_{k}^{*}, the second condition of Assumption 3 has a fairly high chance to be met.

4 Convergence Analysis

4.1 Preliminaries

We start from listing those assumptions and conditions from existing literature that are adopted to support our analysis. Given a function ff, the Hessian of its pullback 2f^(𝐱)[𝜼]\nabla^{2}\hat{f}\left(\mathbf{x}\right)[\mathbf{\bm{\eta}}] and its Riemannian Hessian Hessf(𝐱)[𝜼]{\rm Hess}f\left(\mathbf{x}\right)[\bm{\eta}] are identical when a second-order retraction is used boumal2019global , and this serves as an assumption to ease the analysis.

Assumption 4 (Second-order Retraction boumal2020introduction )

The retraction mapping is assumed to be a second-order retraction. That is, for all 𝐱\mathbf{x}\in\mathcal{M} and all 𝛈T𝐱\bm{\eta}\in T_{\mathbf{x}}\mathcal{M}, the curve γ(t):=R𝐱(t𝛈)\gamma(t):=R_{\mathbf{x}}(t\bm{\eta}) has zero acceleration at t=0t=0, i.e., γ′′(0)=𝒟2dt2R𝐱(t𝛈)|t=0=0\gamma^{\prime\prime}(0)=\frac{\mathcal{D}^{2}}{dt^{2}}R_{\mathbf{x}}(t\bm{\eta})\big{|}_{t=0}=0.

The following two assumptions originate from the assumptions required by the convergence analysis of the standard RTR algorithm boumal2019global ; ferreira2002kantorovich , and are adopted here to support the inexact analysis.

Assumption 5 (Restricted Lipschitz Hessian)

There exists LH>0L_{H}>0 such that for all 𝐱k\mathbf{x}_{k} generated by Algorithm 1 and all 𝛈kT𝐱k\bm{\eta}_{k}\in T_{\mathbf{x}_{k}}\mathcal{M}, f^k\hat{f}_{k} satisfies

|f^k(𝜼k)f(𝐱k)gradf(𝐱k),𝜼k𝐱k12𝜼k,2f^k(𝟎𝐱k)[𝜼k]𝐱k|LH6𝜼k𝐱k3,\left|\hat{f}_{k}(\bm{\eta}_{k})-f(\mathbf{x}_{k})-\langle{\rm grad}f(\mathbf{x}_{k}),\bm{\eta}_{k}\rangle_{\mathbf{x}_{k}}-\frac{1}{2}\left\langle\bm{\eta}_{k},\nabla^{2}\hat{f}_{k}(\mathbf{0}_{\mathbf{x}_{k}})[\bm{\eta}_{k}]\right\rangle_{\mathbf{x}_{k}}\right|\leq\frac{L_{H}}{6}\|\bm{\eta}_{k}\|_{\mathbf{x}_{k}}^{3}, (48)

and

𝒫𝜼k1(gradf^k(𝜼k))gradf(𝐱k)2f^k(𝟎𝐱k)[𝜼k]𝐱kLH2𝜼k𝐱k2,\left\|\mathcal{P}_{\bm{\eta}_{k}}^{-1}\left({\rm grad}\hat{f}_{k}(\bm{\eta}_{k})\right)-{\rm grad}f(\mathbf{x}_{k})-\nabla^{2}\hat{f}_{k}(\mathbf{0}_{\mathbf{x}_{k}})[\bm{\eta}_{k}]\right\|_{\mathbf{x}_{k}}\leq\frac{L_{H}}{2}\|\bm{\eta}_{k}\|_{\mathbf{x}_{k}}^{2}, (49)

where 𝒫1\mathcal{P}^{-1} denotes the inverse process of the parallel transport operator.

Assumption 6 (Norm Bound on Hessian)

For all 𝐱k\mathbf{x}_{k}, there exists KH0K_{H}\geq 0 so that the inexact Hessian 𝐇k\mathbf{H}_{k} satisfies

𝐇k𝐱kKH.\begin{split}\|\mathbf{H}_{k}\|_{\mathbf{x}_{k}}\leq K_{H}.\end{split} (50)

The following key conditions on the inexact gradient and Hessian approximations are developed in Euclidean spaces by roosta2019sub (Section 2.2) and xu2020newton (Section 1.3), respectively. We make use of these in Riemannian spaces.

Condition 1 (Approximation Error Bounds)

For all 𝐱k\mathbf{x}_{k} and 𝛈kT𝐱k\bm{\eta}_{k}\in T_{\mathbf{x}_{k}}\mathcal{M}, suppose that there exist δg,δH>0\delta_{g},\delta_{H}>0, such that the approximate gradient and Hessian satisfy

𝐆kgradf(𝐱k)𝐱k\displaystyle\|\mathbf{G}_{k}-{\rm grad}f(\mathbf{x}_{k})\|_{\mathbf{x}_{k}} δg,\displaystyle\leq\delta_{g}, (51)
𝐇k[𝜼k]2f^k(𝟎𝐱k)[𝜼k]𝐱k\displaystyle\|\mathbf{H}_{k}[\bm{\eta}_{k}]-\nabla^{2}\hat{f}_{k}(\mathbf{0}_{\mathbf{x}_{k}})[\bm{\eta}_{k}]\|_{\mathbf{x}_{k}} δH𝜼k𝐱k.\displaystyle\leq\delta_{H}\|\bm{\eta}_{k}\|_{\mathbf{x}_{k}}. (52)

As will be shown in Theorem 2 later, these allow the used sampling size in the gradient and Hessian approximations to be fixed throughout the training process. As a result, it can serve as a guarantee of the algorithmic efficiency when dealing with large-scale problems.

4.2 Supporting Theorem and Assumption

In this section, we prove a preparation theorem and present new conditions required by our results. Below, we re-develop Theorem 4.1 in kasai2018inexact using the matrix Bernstein inequality gross2011recovering . It provides lower bounds on the required subsampling size for approximating the gradient and Hessian in order for Condition 1 to hold. The proof is provided in Appendix E.

Theorem 2 (Gradient and Hessian Sampling Size)

Define the suprema of the Riemannian gradient and Hessian

Kgmax:=\displaystyle K_{g_{\max}}:= maxi[n]sup𝐱gradfi(𝐱)𝐱,\displaystyle\max_{i\in[n]}\sup_{\mathbf{x}\in\mathcal{M}}\left\|{\rm grad}f_{i}(\mathbf{x})\right\|_{\mathbf{x}}, (53)
KHmax:=\displaystyle K_{H_{\max}}:= maxi[n]sup𝐱sup𝜼T𝐱𝜼𝐱0Hessfi(𝐱)[𝜼]𝐱𝜼𝐱.\displaystyle\max_{i\in[n]}\sup_{\mathbf{x}\in\mathcal{M}}\sup_{\begin{subarray}{c}\bm{\eta}\in T_{\mathbf{x}}\mathcal{M}\\ \|\bm{\eta}\|_{\mathbf{x}}\neq 0\end{subarray}}\frac{\left\|{\rm Hess}f_{i}(\mathbf{x})[\mathbf{\bm{\eta}}]\right\|_{\mathbf{x}}}{\left\|\bm{\eta}\right\|_{\mathbf{x}}}. (54)

Given 0<δ<10<\delta<1, Condition 1 is satisfied with probability at least (1δ)\left(1-\delta\right) if

|𝒮g|\displaystyle|\mathcal{S}_{g}| 8(Kgmax2+Kgmax)ln(d+rδ)δg2,\displaystyle\geq\frac{8\left(K_{g_{max}}^{2}+K_{g_{max}}\right)\ln\left(\frac{d+r}{\delta}\right)}{\delta_{g}^{2}}, (55)
|𝒮H|\displaystyle|\mathcal{S}_{H}| 8(KHmax2+KHmax𝜼𝐱)ln(d+rδ)δH2.\displaystyle\geq\frac{8\left(K_{H_{max}}^{2}+\frac{K_{H_{max}}}{\left\|\bm{\eta}\right\|_{\mathbf{x}}}\right)\ln\left(\frac{d+r}{\delta}\right)}{\delta_{H}^{2}}. (56)

where |𝒮g||\mathcal{S}_{g}| and |𝒮H||\mathcal{S}_{H}| denote the sampling sizes, while dd and rr are the dimensions of 𝐱\mathbf{x}.

The two quantities δg\delta_{g} and δH\delta_{H} in Condition 1 are the upper bounds of the gradient and Hessian approximation errors, respectively. The following assumption bounds δg\delta_{g} and δH\delta_{H}.

Assumption 7 (Restrictions on δg\delta_{g} and δH\delta_{H})

Given ν(0,1]\nu\in(0,1], KH0K_{H}\geq 0, LH>0L_{H}>0, 0<τ,ϵg<10<\tau,\epsilon_{g}<1, we assume that δg\delta_{g} and δH\delta_{H} satisfy

δg\displaystyle\delta_{g}\leq\; (1τ)(KH2+4LHϵgKH)248LH,\displaystyle\frac{(1-\tau)\left(\sqrt{K_{H}^{2}+4L_{H}\epsilon_{g}}-K_{H}\right)^{2}}{48L_{H}}, (57)
δH\displaystyle\delta_{H}\leq\; min(1τ12(KH2+4LHϵgKH),1τ3νϵH).\displaystyle\min\Bigg{(}\frac{1-\tau}{12}\left(\sqrt{K_{H}^{2}+4L_{H}\epsilon_{g}}-K_{H}\right),\frac{1-\tau}{3}\nu\epsilon_{H}\Bigg{)}. (58)

As seen in Eqs. (55) and (56), sampling sizes |𝒮g||\mathcal{S}_{g}| and |𝒮H||\mathcal{S}_{H}| are directly proportional to the probability (1δ)(1-\delta) but inversely proportional to the error tolerances δg\delta_{g} and δH\delta_{H}, respectively. Hence, a higher (1δ)(1-\delta) and smaller δg\delta_{g} and δH\delta_{H} (affected by KHK_{H} and LHL_{H}) require larger |𝒮g||\mathcal{S}_{g}| and |𝒮H||\mathcal{S}_{H}| for estimating the inexact Riemannian gradient and Hessian.

4.3 Main Results

Now we are ready to present our main convergence results in two main theorems for Algorithm 1. Different from sun2019escaping which explores the escape rate from a saddle point to a local minimum, we study the convergence rate from a random point.

Theorem 3 (Convergence Complexity of Algorithm 1)

Consider 0<ϵg,ϵH<10<\epsilon_{g},\epsilon_{H}<1 and δg,δH>0\delta_{g},\delta_{H}>0. Suppose that Assumptions 4, 5, 50 and 7 hold and the solution of the subproblem in Eq. (10) satisfies Assumption 1. Then, if the inexact gradient 𝐆k\mathbf{G}_{k} and Hessian 𝐇k\mathbf{H}_{k} satisfy Condition 1, Algorithm 1 returns an (ϵg,ϵH)\left(\epsilon_{g},\epsilon_{H}\right)-optimal solution in 𝒪(max(ϵg2,ϵH3))\mathcal{O}\left(\max\left(\epsilon_{g}^{-2},\epsilon_{H}^{-3}\right)\right) iterations.

The proof along with the supporting lemmas is provided in Appendices F.1 and F.2. When the Hessian at the solution is close to positive semi-definite which indicates a small ϵH\epsilon_{H}, the Inexact Sub-RN-CR finds an (ϵg,ϵH)\left(\epsilon_{g},\epsilon_{H}\right)-optimal solution in fewer iterations than the Inexact RTR, i.e., 𝒪(max(ϵg2,ϵH3))\mathcal{O}\left(\max\left(\epsilon_{g}^{-2},\epsilon_{H}^{-3}\right)\right) iterations for the Inexact Sub-RN-CR as compared to 𝒪(max(ϵg2ϵH1,ϵH3))\mathcal{O}\left(\max\left(\epsilon_{g}^{-2}\epsilon_{H}^{-1},\epsilon_{H}^{-3}\right)\right) for the Inexact RTR kasai2018inexact . Such a result is satisfactory. Combining Theorems 2 and 3, it leads to the following corollary.

Corollary 1

Consider 0<ϵg,ϵH<10<\epsilon_{g},\epsilon_{H}<1 and δg,δH>0\delta_{g},\delta_{H}>0. Suppose that Assumptions 4, 5, 50 and 7 hold and the solution of the subproblem in Eq. (10) satisfies Assumption 1. For any 0<δ<10<\delta<1, suppose Eqs. (55) and (56) are satisfied at each iteration. Then, Algorithm 1 returns an (ϵg,ϵH)\left(\epsilon_{g},\epsilon_{H}\right)-optimal solution in 𝒪(max(ϵg2,ϵH3))\mathcal{O}\left(\max\left(\epsilon_{g}^{-2},\epsilon_{H}^{-3}\right)\right) iterations with a probability at least p=(1δ)𝒪(max(ϵg2,ϵH3))p=(1-\delta)^{\mathcal{O}\left(\max\left(\epsilon_{g}^{-2},\epsilon_{H}^{-3}\right)\right)}.

The proof is provided in Appendix F.3. We use an example to illustrate the effect of δ\delta on the approximate gradient sample size |𝒮g||\mathcal{S}_{g}|. Suppose ϵg2>ϵH3\epsilon^{-2}_{g}>\epsilon^{-3}_{H}, then p=(1δ)𝒪(ϵg2)p=(1-\delta)^{\mathcal{O}\left(\epsilon_{g}^{-2}\right)}. In addition, when δ=𝒪(ϵg2/10)\delta=\mathcal{O}\left(\epsilon^{2}_{g}/10\right), p0.9p\approx 0.9. Replacing δ\delta with 𝒪(ϵg2/10)\mathcal{O}\left(\epsilon^{2}_{g}/10\right) in Eqs. (55) and (56), it can be obtained that the lower bound of |𝒮g||\mathcal{S}_{g}| is proportional to ln(10ϵg2)\ln\left(10\epsilon^{-2}_{g}\right).

Combining Assumption 3 and the stopping condition in Eq. (47) for the inexact solver, a stronger convergence result can be obtained for Algorithm 1, which is presented in the following theorem and corollary.

Theorem 4 (Optimal Convergence Complexity of Algorithm 1)

Consider 0<ϵg,ϵH<10<\epsilon_{g},\epsilon_{H}<1 and δg,δH>0\delta_{g},\delta_{H}>0. Suppose that Assumptions 4, 5, 50 and 7 hold and the solution of the subproblem satisfies Assumptions 1, 2 and 3. Then, if the inexact gradient 𝐆k\mathbf{G}_{k} and Hessian 𝐇k\mathbf{H}_{k} satisfy Condition 1 and δgδHκθϵg\delta_{g}\leq\delta_{H}\leq\kappa_{\theta}\epsilon_{g}, Algorithm 1 returns an (ϵg,ϵH)\left(\epsilon_{g},\epsilon_{H}\right)-optimal solution in 𝒪(max(ϵg32,ϵH3))\mathcal{O}\left(\max\left(\epsilon_{g}^{-\frac{3}{2}},\epsilon_{H}^{-3}\right)\right) iterations.

Corollary 2

Consider 0<ϵg,ϵH<10<\epsilon_{g},\epsilon_{H}<1 and δg,δH>0\delta_{g},\delta_{H}>0. Suppose that Assumptions 4, 5, 50 and 7 hold, and the solution of the subproblem satisfies Assumptions 1, 2 and 3. For any 0<δ<10<\delta<1, suppose Eqs. (55) and (56) are satisfied at each iteration. Then, Algorithm 1 returns an (ϵg,ϵH)\left(\epsilon_{g},\epsilon_{H}\right)-optimal solution in 𝒪(max(ϵg32,ϵH3))\mathcal{O}\left(\max\left(\epsilon_{g}^{-\frac{3}{2}},\epsilon_{H}^{-3}\right)\right) iterations with a probability at least p=(1δ)𝒪(max(ϵg32,ϵH3))p=(1-\delta)^{\mathcal{O}\left(\max\left(\epsilon_{g}^{-\frac{3}{2}},\epsilon_{H}^{-3}\right)\right)}.

The proof of Theorem 4 along with its supporting lemmas is provided in Appendices G.1 and G.2. The proof of Corollary 2 follows Corollary 1 and is provided in Appendix G.3.

4.4 Computational Complexity Analysis

We analyse the number of main operations required by the proposed algorithm. Taking the PCA task as an example, it optimizes over the Grassmann manifold Gr(r,d){\rm Gr}\left(r,d\right). Denote by mm the number of inexact iterations and DD the manifold dimension, i.e., D=d×(dr)D=d\times(d-r) for the Grassmann manifold. Starting from the gradient and Hessian computation, the full case requires 𝒪(ndr)\mathcal{O}(ndr) operations for both in the PCA task. By using the subsampling technique, these can be reduced to 𝒪(|𝒮g|dr)\mathcal{O}(|\mathcal{S}_{g}|dr) and 𝒪(|𝒮H|dr)\mathcal{O}(|\mathcal{S}_{H}|dr) by gradient and Hessian approximation. Following an existing setup for cost computation, i.e., Inexact RTR method kasai2018inexact , the full function cost evaluation takes nn operations, while the approximate cost evaluation after subsampling becomes 𝒪(|𝒮n|dr)\mathcal{O}(|\mathcal{S}_{n}|dr), where 𝒮n\mathcal{S}_{n} is the subsampled set of data points used to compute the function cost. These show that, for large-scale practices with nmax(|𝒮g|,|𝒮H|,|𝒮n|)n\gg\max\left(|\mathcal{S}_{g}|,|\mathcal{S}_{H}|,|\mathcal{S}_{n}|\right), the computational cost reduction gained from the subsampling technique is significant.

For the subproblem solver by Algorithm 2 or 3, the dominant computation within each iteration is the Hessian computation, which as mentioned above requires 𝒪(|𝒮H|dr)\mathcal{O}(|\mathcal{S}_{H}|dr) operations after using the subsampling technique. Taking this into account to analyze Algorithm 1, its overall computational complexity becomes 𝒪(max(ϵg2,ϵH3))×[𝒪(n+|𝒮g|dr)+𝒪(m|𝒮H|d2(dr)r)]\mathcal{O}\left(\max\left(\epsilon_{g}^{-2},\epsilon_{H}^{-3}\right)\right)\times\left[\mathcal{O}(n+|\mathcal{S}_{g}|dr)+\mathcal{O}\left(m|\mathcal{S}_{H}|d^{2}(d-r)r\right)\right] based on Theorem 3, where 𝒪(n+|𝒮g|dr)\mathcal{O}(n+|\mathcal{S}_{g}|dr) corresponds to the operations for computing the full function cost and the approximate gradient in an outer iteration. This overall complexity can be simplified to 𝒪(max(ϵg2,ϵH3))×𝒪(n+|𝒮g|dr+m|𝒮H|d2(dr)r)\mathcal{O}\left(\max\left(\epsilon_{g}^{-2},\epsilon_{H}^{-3}\right)\right)\times\mathcal{O}(n+|\mathcal{S}_{g}|dr+m|\mathcal{S}_{H}|d^{2}(d-r)r), where 𝒪(m|𝒮H|d2(dr)r)\mathcal{O}(m|\mathcal{S}_{H}|d^{2}(d-r)r) is the cost of the subproblem solver by either Algorithm 2 or Algorithm 3. Algorithm 2 is guaranteed to return the optimal subproblem solution within at most m=D=d×(dr)m=D=d\times(d-r) inner iterations, of which the complexity is at most 𝒪(|𝒮H|d2(dr)2r2)\mathcal{O}(|\mathcal{S}_{H}|d^{2}(d-r)^{2}r^{2}). Such a polynomial complexity is at least as good as the ST-tCG solver used in the Inexact RTR algorithm. For Algorithm 3, although mm is not guaranteed to be bounded and polynomial, we have empirically observed that mm is generally smaller than DD in practice, presenting a similar complexity to Algorithm 2.

5 Experiments and Result Analysis

Refer to captionRefer to caption
(a) Synthetic dataset P1
Refer to caption
(b) MNIST dataset
Refer to caption
(c) Covertype dataset
Figure 2: Performance comparison by optimality gap for the PCA task.
Refer to caption
Figure 3: PCA performance summary for the participating methods.

We compare the proposed Inexact Sub-RN-CR algorithm with state-of-the-art and popular Riemannian optimization algorithms. These include the Riemannian stochastic gradient descent (RSGD) bonnabel2013stochastic , Riemannian steepest descent (RSD) absil2009optimization , Riemannian conjugate gradient (RCG) absil2009optimization , Riemannian limited memory BFGS algorithm (RLBFGS) yuan2016riemannian , Riemannian stochastic variance-reduced gradient (RSVRG) zhang2016riemannian , Riemannian stochastic recursive gradient (RSRG) kasai2018riemannian , RTR absil2007trust , Inexact RTR kasai2018inexact and RTRMC boumal2011rtrmc . Existing implementations of these algorithms are available in either Manopt boumal2014manopt or Pymanopt townsend2016pymanopt library. They are often used for algorithm comparison in existing literature, e.g., by Inexact RTR kasai2018inexact . Particularly, RSGD, RSD, RCG, RLBFGS, RTR and RTRMC algorithms have been encapsulated into Manopt, and RSD, RCG and RTR also into Pymanopt. RSVRG, RSRG and Inexact RTR are implemented by kasai2018inexact based on Manopt. We use existing implementations to reproduce their methods. Our Inexact Sub-RN-CR implementation builds on Manopt.

For the competing methods, we follow the same parameter settings from the existing implementations, including the batch size (i.e. sampling size), step size (i.e. learning rate) and the inner iteration number to ensure the same results as the reported ones. For our method, we first keep the common algorithm parameters the same as the competing methods, including γ\gamma, τ\tau, ϵg\epsilon_{g} and ϵH\epsilon_{H}. Then, we use a grid search to find appropriate values of θ\theta and κθ\kappa_{\theta} for both Algorithms 2 and 3. Specifically, the searching grid for θ\theta is (0.02,0.05,0.1,0.2,0.5,1)(0.02,0.05,0.1,0.2,0.5,1), and the searching grid for κθ\kappa_{\theta} is (0.005,0.01,0.02,0.04,0.08,0.16)(0.005,0.01,0.02,0.04,0.08,0.16). For the parameter κ\kappa in Algorithm 3, we keep it the same as the other conjugate gradient solvers. The early stopping approach as described in Section 3.4 is applied to all the compared algorithms.

Regarding the batch setting, which is also the sample size setting for approximating the gradient and Hessian, we adopt the same value as used in existing subsampling implementations to keep consistency. Also, the same settings are used for both the PCA and matrix completion tasks. Specifically, the batch size |𝒮g|=n/100\left|\mathcal{S}_{g}\right|=n/100 is used for RSGD, RSVRG and RSRG where 𝒮H\mathcal{S}_{H} is not considered as these are first-order methods. For both the Inexact RTR and the proposed Inexact Sub-RN-CR, |𝒮H|=n/100\left|\mathcal{S}_{H}\right|=n/100 and |𝒮g|=n\left|\mathcal{S}_{g}\right|=n is used. This is to follow the existing setting in kasai2018inexact for benchmark purposes, which exploits the approximate Hessian but the full gradient. In addition to these, we experiment with another batch setting of {|𝒮H|=n/100,|𝒮g|=n/10}\left\{\left|\mathcal{S}_{H}\right|=n/100,\left|\mathcal{S}_{g}\right|=n/10\right\} for both the Inexact RTR and Inexact Sub-RN-CR. This is flagged by (G)(G) in the algorithm name, meaning that the algorithm uses the approximate gradient in addition to the approximate Hessian. Its purpose is to evaluate the effect of 𝒮g\mathcal{S}_{g} in the optimization.

Evaluation is conducted based on two machine learning tasks of PCA and low-rank matrix completion using both synthetic and real-world datasets with nd1n\gg d\gg 1. Both tasks can be formulated as non-convex optimization problems on the Grassmann manifold Gr(r,d)\left(r,d\right). The algorithm performance is evaluated by oracle calls and the run time. The former counts the number of function, gradient, and Hessian-vector product computations. For instance, Algorithm 1 requires n+|𝒮g|+m|𝒮H|n+|\mathcal{S}_{g}|+m|\mathcal{S}_{H}| oracle calls each iteration, where mm is the number of iterations of the subproblem solver. Regarding the user-defined parameters in Algorithm 1, we use ϵσ=1018\epsilon_{\sigma}=10^{-18}. Empirical observations suggest that the magnitude of the data entries affects the optimization in its early stage, and hence these factors are taken into account in the setting of σ0\sigma_{0}. Let 𝐒=[sij]\mathbf{S}=[s_{ij}] denote the input data matrix containing LL rows and HH columns. We compute σ0\sigma_{0} by considering the data dimension, also the averaged data magnitude normalized by its standard deviation, given as

σ0=(i[L],j[H]|sij|LH)2(dim(M)d1LHi[L],j[H](sijμS)2)12,\sigma_{0}=\left(\sum_{i\in[L],j\in[H]}\frac{|s_{ij}|}{LH}\right)^{2}\left(\frac{dim(M)*d}{\sqrt{\frac{1}{LH}\sum_{i\in[L],j\in[H]}(s_{ij}-\mu_{S})^{2}}}\right)^{\frac{1}{2}}, (59)

where μS=1LHi[L],j[H]si,j\mu_{S}=\frac{1}{LH}\sum_{i\in[L],j\in[H]}s_{i,j} and dim(M)dim(M) is the manifold dimension.

Regarding the early stopping setting in Eq. (46), K=5K=5 is used for both tasks, and we use τf=1012\tau_{f}=10^{-12} for MNIST and τf=1010\tau_{f}=10^{-10} for the remaining datasets in the PCA task. In the matrix completion task, we set τf=1010\tau_{f}=10^{-10} for the synthetic datasets and τf=103\tau_{f}=10^{-3} for the real-world datasets. For the early stopping settings in Eq. (47) and in Step (13) of Algorithm 3, we adopt κθ=0.08\kappa_{\theta}=0.08 and θ=0.1\theta=0.1. Between the two subproblem solvers, we observe that Algorithm 2 by the Lanczos method and Algorithm 3 by the conjugate gradient perform similarly. Therefore, we report the main results using Algorithm 2, and provide supplementary results for Algorithm 3 in a separate Section 5.4.

Refer to caption
(a) P1, no early stopping
Refer to caption
(b) P1, varying |𝒮g||\mathcal{S}_{g}| settings
Refer to caption
(c) P1, varying |𝒮H||\mathcal{S}_{H}| settings
Refer to caption
(d) MNIST, varying |𝒮H||\mathcal{S}_{H}| settings
Figure 4: Additional comparisons for the PCA task.
Refer to captionRefer to caption
(a) Synthetic Dataset M1
Refer to caption
(b) Synthetic Dataset M2
Refer to caption
(c) Synthetic Dataset M3
Refer to captionRefer to caption
(d) Jester Dataset
Figure 5: Performance comparison by MSE for the matrix completion task.

5.1 PCA Experiments

PCA can be interpreted as a minimization of the sum of squared residual errors between the projected and the original data points, formulated as

min𝐔Gr(r,d)1ni=1n𝐳i𝐔𝐔𝐓𝐳i22,\min_{\mathbf{U}\in{\rm Gr}\left(r,d\right)}\frac{1}{n}\sum_{i=1}^{n}\left\|\mathbf{z}_{i}-\mathbf{UU^{T}}\mathbf{z}_{i}\right\|_{2}^{2}, (60)

where 𝐳id\mathbf{z}_{i}\in\mathbb{R}^{d}. The objective function can be re-expressed as one on the Grassmann manifold via

min𝐔Gr(r,d)1ni=1n𝐳iT𝐔𝐔𝐓𝐳i.\min_{\mathbf{U}\in{\rm Gr}\left(r,d\right)}-\frac{1}{n}\sum_{i=1}^{n}\mathbf{z}_{i}^{T}\mathbf{UU^{T}}\mathbf{z}_{i}. (61)

One synthetic dataset P1 and two real-world datasets including MNIST lecun1998gradient and Covertype blackard1999comparative are used in the evaluation. The P1 dataset is firstly generated by randomly sampling each element of a matrix 𝐀n×d\mathbf{A}\in\mathbb{R}^{n\times d} from a normal distribution 𝒩(0,1)\mathcal{N}(0,1). This is then followed by a multiplication with a diagonal matrix 𝐒d×d\mathbf{S}\in\mathbb{R}^{d\times d} with each diagonal element randomly sampled from an exponential distribution Exp(2)\textmd{Exp}(2), which increases the difference between the feature variances. After that, a mean-subtraction preprocessing is applied to 𝐀𝐒\mathbf{A}\mathbf{S} to obtain the final P1 dataset. The (n,d,r)\left(n,d,r\right) values are: (5×105,103,5)\left(5\times 10^{5},10^{3},5\right) for P1, (6×104,784,10)\left(6\times 10^{4},784,10\right) for MNIST, and (581012,54,10)\left(581012,54,10\right) for Covertype. Algorithm accuracy is assessed by optimality gap, defined as the absolute difference |ff||f-f^{*}|, where f=1ni=1n𝐳iT𝐔^𝐔^T𝐳if=-\frac{1}{n}\sum_{i=1}^{n}\mathbf{z}_{i}^{T}\hat{\mathbf{U}}\hat{\mathbf{U}}^{T}\mathbf{z}_{i} with 𝐔^\hat{\mathbf{U}} as the optimal solution returned by Algorithm 1. The optimal function value f=1ni=1n𝐳iT𝐔~𝐔~T𝐳if^{*}=-\frac{1}{n}\sum_{i=1}^{n}\mathbf{z}_{i}^{T}\tilde{\mathbf{U}}\tilde{\mathbf{U}}^{T}\mathbf{z}_{i} is computed by using the eigen-decomposition solution 𝐔~\tilde{\mathbf{U}}, which is a classical way to obtain PCA result without going through the optimization program.

Refer to caption
Figure 6: Matrix completion performance summary on synthetic datasets.
Refer to caption
Figure 7: Matrix completion performance summary on the Jester dataset.

Fig. 2 compares the optimality gap changes over iterations for all the competing algorithms. Additionally, Fig. 3 summarizes their accuracy and convergence performance in optimality gap and run time. Fig. 4a reports the performance without using early stopping for the P1 dataset. It can be seen that the Inexact Sub-RN-CR reaches the minima with the smallest iteration number for both the synthetic and real-world datasets. In particular, the larger the scale of a problem is, the more obvious the advantage of our Inexact Sub-RN-CR is, evidenced by the performance difference.

However, both the Inexact RTR and Inexact Sub-RN-CR achieve their best PCA performance when using a full gradient calculation accompanied by a subsampled Hessian. The subsampled gradient does not seem to result in a satisfactory solution as shown in Fig. 2 with |𝒮g|=n/10\left|\mathcal{S}_{g}\right|=n/10. Additionally, we report more results for the Inexact RTR and the proposed Inexact Sub-RN-CR in Fig. 4b on the P1 dataset with different gradient batch sizes, including |𝒮g|{n/1.5,n/2,n/5,n/10}\left|\mathcal{S}_{g}\right|\in\left\{n/1.5,n/2,n/5,n/10\right\}. They all perform less well than |𝒮g|=n\left|\mathcal{S}_{g}\right|=n. More accurate gradient information is required to produce a high-precision solution in these tested cases. A hypothesis on the cause of this phenomenon might be that the variance of the approximate gradients across samples is larger than that of the approximate Hessians. Hence, a sufficiently large sample size is needed for a stable approximation of the gradient information. Errors in approximate gradients may cause the algorithm to converge to a sub-optimal point with a higher cost, thus performing less well. Another hypothesis might be that the quadratic term 𝐔𝐔T\mathbf{U}\mathbf{U}^{T} in Eq. (61) would square the approximation error from the approximate gradient, which could significantly increase the PCA reconstruction error.

By fixing the gradient batch size |𝒮g|=n\left|\mathcal{S}_{g}\right|=n for both the Inexact RTR and Inexact Sub-RN-CR, we compare in Figs. 4c and 4d their sensitivity to the used batch size for Hessian approximation. We experiment with |𝒮H|{n/10,n/102\left|\mathcal{S}_{H}\right|\in\{n/10,n/10^{2}, n/103,n/104}n/10^{3},n/10^{4}\}. It can be seen that the Inexact Sub-RN-CR outperforms the Inexact RTR in almost all cases except for |𝒮H|=n/104\left|\mathcal{S}_{H}\right|=n/10^{4} for the MNIST dataset. The Inexact Sub-RN-CR possesses a rate of increase in oracle calls significantly smaller for large sample sizes. This implies that the Inexact Sub-RN-CR is more robust than the Inexact RTR to batch-size change for inexact Hessian approximation.

Refer to captionRefer to caption
(a) Synthetic M3 (left) and Jester (right) with varying |𝒮g||\mathcal{S}_{g}| settings.
Refer to captionRefer to caption
(b) Synthetic M3 (left) and Jester (right) with varying |𝒮H||\mathcal{S}_{H}| settings.
Figure 8: Inexact RTR vs. Inexact Sub-RN-CR for matrix completion with varying subsampling sizes for gradient and Hessian approximation.

5.2 Low-rank Matrix Completion Experiments

Low-rank matrix completion aims at completing a partial matrix 𝐙\mathbf{Z} with only a small number of entries observed, under the assumption that the matrix has a low rank. One way to formulate the problem is shown as below

min𝐔Gr(r,d),𝐀r×n1|Ω|𝒫Ω(𝐔𝐀)𝒫Ω(𝐙)F2,\min_{\mathbf{U}\in{\rm Gr}\left(r,d\right),\mathbf{A}\in\mathbb{R}^{r\times n}}\frac{1}{|\Omega|}\left\|\mathcal{P}_{\Omega}\left(\mathbf{UA}\right)-\mathcal{P}_{\Omega}\left(\mathbf{Z}\right)\right\|_{F}^{2}, (62)

where Ω\Omega denotes the index set of the observed matrix entries. The operator 𝒫Ω:d×nd×n\mathcal{P}_{\Omega}:\mathbb{R}^{d\times n}\rightarrow\mathbb{R}^{d\times n} is defined as 𝒫Ω(𝐙)ij=𝐙ij\mathcal{P}_{\Omega}\left(\mathbf{Z}\right)_{ij}=\mathbf{Z}_{ij} if (i,j)Ω\left(i,j\right)\in\Omega, while 𝒫Ω(𝐙)ij=0\mathcal{P}_{\Omega}\left(\mathbf{Z}\right)_{ij}=0 otherwise. We generate it by uniformly sampling a set of |Ω|=4r(n+dr)|\Omega|=4r(n+d-r) elements from the dndn entries. Let 𝐚i\mathbf{a}_{i} be the ii-th column of 𝐀\mathbf{A}, 𝐳i\mathbf{z}_{i} be the ii-th column of 𝐙\mathbf{Z}, and Ωi{\Omega}_{i} be the subset of Ω{\Omega} that contains sampled indices for the ii-th column of 𝐙\mathbf{Z}. Then, 𝐚i\mathbf{a}_{i} has a closed-form solution 𝐚i=(𝐔Ωi)𝐳Ωi\mathbf{a}_{i}=(\mathbf{U}_{{\Omega}_{i}})^{\dagger}\mathbf{z}_{\Omega_{i}} kasai2018inexact , where 𝐔Ωi\mathbf{U}_{{\Omega}_{i}} contains the selected rows of 𝐔\mathbf{U}, and 𝐳Ωi\mathbf{z}_{\Omega_{i}} the selected elements of 𝐳i\mathbf{z}_{i} according to the indices in Ωi\Omega_{i}, and \dagger denotes the pseudo inverse operation. To evaluate a solution 𝐔\mathbf{U}, we generate another index set Ω~\tilde{\Omega}, which is used as the test set and satisfies Ω~Ω=\tilde{\Omega}\cap\Omega=\emptyset, following the same way as generating Ω\Omega. We compute the mean squared error (MSE) by

MSE=1|Ω~|𝒫Ω~(𝐔𝐀)𝒫Ω~(𝐙)F2.\textmd{MSE}=\frac{1}{\left|\tilde{\Omega}\right|}\left\|\mathcal{P}_{\tilde{\Omega}}(\mathbf{U}\mathbf{A})-\mathcal{P}_{\tilde{\Omega}}(\mathbf{Z})\right\|_{F}^{2}. (63)

In evaluation, three synthetic datasets and a real-world dataset Jester goldberg2001eigentaste are used where the training and test sets are already predefined by goldberg2001eigentaste . The synthetic datasets are generated by following a generative model similar to ngo2012scaled based on SVD. Specifically, to develop a synthetic dataset, we generate two matrices 𝐀d×r\mathbf{A}\in\mathbb{R}^{d\times r} and 𝐁n×r\mathbf{B}\in\mathbb{R}^{n\times r} with their elements independently sampled from the normal distribution 𝒩(0,1)\mathcal{N}(0,1). Then, we generate two orthogonal matrices 𝐐A\mathbf{Q}_{A} and 𝐐B\mathbf{Q}_{B} by applying the QR decomposition trefethen1997numerical respectively to 𝐀\mathbf{A} and 𝐁\mathbf{B}. After that, we construct a diagonal matrix 𝐒r×r\mathbf{S}\in\mathbb{R}^{r\times r} of which the diagonal elements are computed by si,i=103+(ir)log10(c)r1s_{i,i}=10^{3+\frac{(i-r)\log_{10}(c)}{r-1}} for i=1,,ri=1,...,r, and the final data matrix is computed by 𝐙=𝐐A𝐒𝐐BT\mathbf{Z}=\mathbf{Q}_{A}\mathbf{S}\mathbf{Q}_{B}^{T}. The reason to construct 𝐒\mathbf{S} in this specific way is to have an easy control over the condition number of the data matrix, denoted by κ(𝐙)\kappa(\mathbf{Z}), which is the ratio between the maximum and minimum singular values of 𝐙\mathbf{Z}. Because κ(𝐙)=σmax(𝐙)σmin(𝐙)=103103log10(c)=c\kappa(\mathbf{Z})=\frac{\sigma_{max}(\mathbf{Z})}{\sigma_{min}(\mathbf{Z})}=\frac{10^{3}}{10^{3-\log_{10}(c)}}=c, we can adjust the condition number by tuning the parameter cc. Following this generative model, each synthetic dataset is generated by randomly sampling two orthogonal matrices and constructing one diagonal matrix subject to the constraint of condition numbers, i.e., c=κ(𝐙)=5c=\kappa(\mathbf{Z})=5 for M1 and c=κ(𝐙)=20c=\kappa(\mathbf{Z})=20 for M2 and M3. The (n,d,r)\left(n,d,r\right) values of the used datasets are: (105,102,5)\left(10^{5},10^{2},5\right) for M1, (105,102,5)\left(10^{5},10^{2},5\right) for M2, (3×104,102,5)\left(3\times 10^{4},10^{2},5\right) for M3, and (24983,102,5)\left(24983,10^{2},5\right) for Jester.

Refer to captionRefer to caption
Refer to caption
Figure 9: PCA optimality gap comparison for fMRI analysis.

Fig. 5 compares the MSE changes over iterations, while Fig. 6 and Fig. 7 summarize both the MSE performance and the run time in the same plot for different algorithms and datasets. In Fig. 6, the Inexact Sub-RN-CR outperforms the others in most cases, and it can even be nearly twice as fast as the state-of-the-art methods for cases with a larger condition number (see dataset M2 and M3). This shows that the proposed algorithm is efficient at handling ill-conditioned problems. In Fig. 7, the Inexact Sub-RN-CR achieves a sufficiently small MSE with the shortest run time, faster than the Inexact RTR and RTRMC. Unlike in the PCA task, the subsampled gradient approximation actually helps to improve the convergence. A hypothesis for explaining this phenomenon could be that, as compared to the quadratic term 𝐔𝐔T\mathbf{U}\mathbf{U}^{T} in the PCA objective function, the linear term 𝐔\mathbf{U} in the matrix completion objective function accumulates fewer errors from the inexact gradient, making the optimization more stable.

Additionally, Fig. 8a compares the Inexact RTR and the Inexact Sub-RN-CR with varying batch sizes for gradient estimation and with fixed |𝒮H|=n|\mathcal{S}_{H}|=n. The M1-M3 results show that our algorithm exhibits stronger robustness to |𝒮g||\mathcal{S}_{g}|, as it converges to the minima with only 50%\% additional oracle calls when reducing |𝒮g||\mathcal{S}_{g}| from n/10n/10 to n/102n/10^{2}, whereas Inexact RTR requires twice as many calls. For the Jester dataset, in all settings of gradient sample sizes, our method achieves lower MSE than the Inexact RTR, especially when |𝒮g|=n/104|\mathcal{S}_{g}|=n/10^{4}. Fig. 8b compares sensitivities in Hessian sample sizes |𝒮H||\mathcal{S}_{H}| with fixed |𝒮g|=n|\mathcal{S}_{g}|=n. Inexact Sub-RN-CR performs better for the synthetic dataset M3 with |𝒮H|{n/10,n/102}\left|\mathcal{S}_{H}\right|\in\{n/10,n/10^{2}\}, showing roughly twice faster convergence. For the Jester dataset, Inexact Sub-RN-CR performs better with |𝒮H|{n/10,n/102,n/103}\left|\mathcal{S}_{H}\right|\in\{n/10,n/10^{2},n/10^{3}\} except for the case of |𝒮H|=n/104\left|\mathcal{S}_{H}\right|=n/10^{4}, which is possibly because the construction of the Krylov subspace requires a more accurately approximated Hessian.

To summarize, we have observed from both the PCA and matrix completion tasks that the effectiveness of the subsampled gradient in the proposed approach can be sensitive to the choice of the practical problems, while the subsampled Hessian steadily contributes to a faster convergence rate.

Refer to caption
(a) component 1
Refer to caption
(b) component 2
Refer to caption
(c) component 3
Figure 10: Comparison of the learned fMRI principal components.
Refer to captionRefer to caption
Refer to caption
Figure 11: MSE comparison for scene image recovery by matrix completion.

5.3 Imaging Applications

In this section, we demonstrate some practical applications of PCA and matrix completion, which are solved by using the proposed optimization algorithm Inexact Sub-RN-CR, for analyzing medical images and scene images.

5.3.1 Functional Connectivity in fMRI by PCA

Functional magnetic-resonance imaging (fMRI) can be used to measure brain activities and PCA is often used to find functional connectivities between brain regions based on the fMRI scans zhong2009detecting . This method is based on the assumption that the activation is independent of other signal variations such as brain motion and physiological signals zhong2009detecting . Usually, the fMRI images are represented as a 4D data block subject to observational noise, including 3 spatial dimensions and 1 time dimension. Following a common preprocessing routine sidhu2012kernel ; kohoutova2020toward , we denote an fMRI data block by 𝐃u×v×w×T\mathbf{D}\in\mathbb{R}^{u\times v\times w\times T} and a mask by 𝐌{0,1}u×v×w\mathbf{M}\in\{0,1\}^{u\times v\times w} that contains dd nonzero elements marked by 11. By applying the mask to the data block, we obtain a feature matrix 𝐟d×T\mathbf{f}\in\mathbb{R}^{d\times T}, where each column stores the features of the brain at a given time stamp. One can increase the sample size by collecting kk fMRI data blocks corresponding to kk human subjects, after which the feature matrix is expanded to a larger matrix 𝐅d×kT\mathbf{F}\in\mathbb{R}^{d\times kT}.

In this experiment, an fMRI dataset referred to as ds000030ds000030 provided by the OpenfMRI database poldrack2017openfmri is used, where u=v=64u=v=64, w=34w=34, and T=184T=184. We select k=13k=13 human subjects and use the provided brain mask with d=228483d=228483. The dimension of the final data matrix is (n,d)=(2392,228483)(n,d)=(2392,228483), where n=kTn=kT. We set the rank as r=5r=5 which is sufficient to capture over 93%93\% of the variance in the data. After the PCA processing, each computed principal component can be rendered back to the brain reconstruction by using the open-source library Nilearn abraham2014machine . Fig. 9 displays the optimization performance, where the Inexact Sub-RN-CR converges faster in terms of both run time and oracle calls. For our method and the Inexact RTR, adopting the subsampled gradient leads to a suboptimal solution in less time than using the full gradient. We speculate that imprecise gradients cause an oscillation of the optimization near local optima. Fig. 10 compares the results obtained by our optimization algorithm with those computed by eigen-decomposition. The highlighted regions denote the main activated regions with positive connectivity (yellow) and negative connectivity (blue). The components learned by the two approaches are similar, with some cases learned by our approach tending to be more connected (see Figs. 10a and 10c).

5.3.2 Image Recovery by Matrix Completion

In this application, we demonstrate image recovery with matrix completion using a (W,H,C)=2519×1679×3(W,H,C)=2519\times 1679\times 3 scene image selected from the BIG dataset cheng2020cascadepsp . As seen in Fig. 12a, this image contains rich texture information. The values of (n,d,r)(n,d,r) for conducting the matrix completion task are (1679,2519,20)(1679,2519,20) where we use a relatively large rank to allow more accurate recovery. The sampling ratio for observing the pixels is set as SR=|Ω|W×H×C=0.6{\rm SR}=\frac{\left|\Omega\right|}{W\times H\times C}=0.6. Fig. 11 compares the performance of different algorithms, where the Inexact Sub-RN-CR takes the shortest time to obtain a satisfactory solution. The subsampled gradient promotes the optimization speed of the Inexact Sub-RN-CR without sacrificing much the MSE error. Fig. 12 illustrates the recovered image using three representative algorithms, providing similar visual results.

Refer to caption
(a) Original image
Refer to caption
(b) Observation
Refer to caption
(c) Inexact Sub-RN-CR
Refer to caption
(d) Inexact RTR
Refer to caption
(e) RTRMC
Figure 12: Comparison of the scene images recovered by different algorithms.

5.4 Results of Conjugate Gradient Subproblem Solver

Refer to captionRefer to caption
(a) Synthetic dataset P1
Refer to caption
(b) MNIST dataset
Refer to caption
(c) Covertype dataset
Figure 13: Performance comparison by optimality gap for the PCA task (using the CG solver in Inexact Sub-RN-CR).
Refer to captionRefer to caption
(a) Synthetic Dataset M1
Refer to caption
(b) Synthetic Dataset M2
Refer to caption
(c) Synthetic Dataset M3
Refer to captionRefer to caption
(d) Jester Dataset
Figure 14: Performance comparison by MSE for the matrix completion task (using the CG solver in Inexact Sub-RN-CR).
Refer to captionRefer to caption
Figure 15: PCA optimality gap comparison for fMRI analysis (using the CG solver in Inexact Sub-RN-CR).
Refer to captionRefer to caption
Figure 16: MSE comparison for scene image recovery by matrix completion (using the CG solver in Inexact Sub-RN-CR).

We experiment with Algorithm 3 for solving the subproblem. In Step (3) of Algorithm 3, the eigen-decomposition method edelman1995polynomial used to solve the minimization problem has a complexity 𝒪(C3)\mathcal{O}(C^{3}) where C=4C=4 is the fixed degree of the polynomial. Figs. 13-16 display the results for both the PCA and matrix completion tasks. Overall, Algorithm 3 can obtain the optimal results with the fastest convergence speed, as compared to the opponent approaches. We have observed that, in general, Algorithm 3 provides similar results to Algorithm 2, but they differ in run time. For instance, Algorithm 2 runs 18%18\% faster for the PCA task with the MNIST dataset and 20%20\% faster for the matrix completion task with the M1 dataset, as compared to Algorithm 3. But Algorithm 3 runs 17%17\% faster than Algorithm 2 for the matrix completion task with the M2 dataset. A hypothesis could be that Algorithm 2 performs well on well-conditioned data (e.g. MNIST and M1) because of its strength of finding the global solution, while for ill-conditioned data (e.g. M2), it may not show significant advantages over Algorithm 3. Moreover, from the computational aspect, the Step (3) in Algorithm 3 is of 𝒪(C3)\mathcal{O}(C^{3}) complexity, which tends to be faster than solving Eq. (18) as required by Algorithm 2. Overall this makes Algorithm 3 probably a better choice than Algorithm 2 for processing ill-conditioned data.

5.5 Examination of Convergence Analysis Assumptions

As explained in Section 3.3.3 and Section 3.4, the eigenstep condition in Assumption 1, Assumption 2 and Assumption 3, although are required by convergence analysis, are not always satisfied by a subproblem solver in practice. In this experiment, we attempt to estimate the probability PP that an assumption is satisfied in the process of optimization, by counting the number of outer iterations of Algorithm 1 where an assumption holds. We repeat this entire process five times (T=5T=5) to attain a stable result. Let NiN_{i} be the number of outer iterations where the assumption is satisfied, and MiM_{i} the total number of outer iterations, in the ii-th repetition (i=1, 2,,5i=1,\;2,\;\ldots,5). We compute the probability by P=i[T]Nii[T]MiP=\frac{\sum_{i\in[T]}N_{i}}{\sum_{i\in[T]}M_{i}}. Experiments are conducted for the PCA task using the P1 dataset.

In order to examine Assumption 2, which is the stopping criterion in Eq. (47), we temporarily deactivate the other stopping criteria. We observe that Algorithm 2 can always produce a solution that satisfies Assumption 2. However, Algorithm 3 has only P50%P\approx 50\% chance to produce a solution satisfying Assumption 2. The reason is probably that when computing 𝐫i\mathbf{r}_{i} in Step (11) of Algorithm 3, the first-order approximation of f(R𝐱k(𝜼k))\nabla f(R_{\mathbf{x}_{k}}(\bm{\eta}_{k}^{*})) is used rather than the exact f(R𝐱k(𝜼k))\nabla f(R_{\mathbf{x}_{k}}(\bm{\eta}_{k}^{*})) for the sake of computational efficiency. This can result in an approximation error.

Regarding the eigenstep condition in Assumption 1, it can always be met by Algorithm 2 with P100%P\approx 100\%. This indicates that even a few inner iterations are sufficient for it to find a solution pointing in the direction of negative curvature. However, Algorithm 3 has a P70%P\approx 70\% chance to meet the eigenstep condition. This might be caused by insufficient inner iterations according to Theorem 1. Moreover, the solution obtained by Algorithm 3 is only guaranteed to be stationary according to Theorem 1, rather than pointing in the direction of the negative curvature. This could be a second cause for Algorithm 3 not to meet the eigenstep condition in Eq. (34).

While about Assumption 3, according to Lemma 2, Algorithm 2 always satisfies it. This is verified by our results with P=100%P=100\%. Algorithm 3 has a P80%P\approx 80\% chance to meet Assumption 3 empirically. This empirical result matches the theoretical result indicated by Lemma 3 where solutions from Algorithm 3 tend to approximately satisfy Assumption 3.

6 Conclusion

We have proposed the Inexact Sub-RN-CR algorithm to offer an effective and fast optimization for an important class of non-convex problems whose constraint sets possess manifold structures. The algorithm improves the current state of the art in second-order Riemannian optimization by using cubic regularization and subsampled Hessian and gradient approximations. We have also provided rigorous theoretical results on its convergence, and empirically evaluated and compared it with state-of-the-art Riemannian optimization techniques for two general machine learning tasks and multiple datasets. Both theoretical and experimental results demonstrate that the Inexact Sub-RN-CR offers improved convergence and computational costs. Although the proposed method is promising in solving large-sample problems, there remains an open and interesting question of whether the proposed algorithm can be effective in training a constrained deep neural network. This is more demanding in its required computational complexity and convergence characteristics than many other machine learning problems, and it is more challenging to perform the Hessian approximation. Our future work will pursue this direction.

References

  • (1) Abraham, A., Pedregosa, F., Eickenberg, M., Gervais, P., Mueller, A., Kossaifi, J., Gramfort, A., Thirion, B., Varoquaux, G.: Machine learning for neuroimaging with scikit-learn. Frontiers in neuroinformatics 8, 14 (2014)
  • (2) Absil, P.A., Baker, C.G., Gallivan, K.A.: Trust-region methods on riemannian manifolds. Foundations of Computational Mathematics 7(3), 303–330 (2007)
  • (3) Absil, P.A., Mahony, R., Sepulchre, R.: Optimization algorithms on matrix manifolds. Princeton University Press (2009)
  • (4) Agarwal, N., Boumal, N., Bullins, B., Cartis, C.: Adaptive regularization with cubics on manifolds. Mathematical Programming 188(1), 85–134 (2021)
  • (5) Alimisis, F., Orvieto, A., Bécigneul, G., Lucchi, A.: Momentum improves optimization on riemannian manifolds. In: International Conference on Artificial Intelligence and Statistics, pp. 1351–1359. PMLR (2021)
  • (6) Anandkumar, A., Ge, R.: Efficient approaches for escaping higher order saddle points in non-convex optimization. In: Conference on learning theory, pp. 81–102 (2016)
  • (7) Becigneul, G., Ganea, O.E.: Riemannian adaptive optimization methods. In: International Conference on Learning Representations (2019)
  • (8) Blackard, J.A., Dean, D.J.: Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables. Computers and electronics in agriculture 24(3), 131–151 (1999)
  • (9) Bonnabel, S.: Stochastic gradient descent on riemannian manifolds. IEEE Transactions on Automatic Control 58(9), 2217–2229 (2013)
  • (10) Boumal, N.: An introduction to optimization on smooth manifolds. Available online, May 3 (2020)
  • (11) Boumal, N., Absil, P.a.: Rtrmc: A riemannian trust-region method for low-rank matrix completion. In: Advances in neural information processing systems, pp. 406–414 (2011)
  • (12) Boumal, N., Absil, P.A., Cartis, C.: Global rates of convergence for nonconvex optimization on manifolds. IMA Journal of Numerical Analysis 39(1), 1–33 (2019)
  • (13) Boumal, N., Mishra, B., Absil, P.A., Sepulchre, R.: Manopt, a matlab toolbox for optimization on manifolds. The Journal of Machine Learning Research 15(1), 1455–1459 (2014)
  • (14) Carmon, Y., Duchi, J.C.: Analysis of krylov subspace solutions of regularized non-convex quadratic problems. Advances in Neural Information Processing Systems 31 (2018)
  • (15) Cartis, C., Gould, N.I., Toint, P.L.: Adaptive cubic regularisation methods for unconstrained optimization. part i: motivation, convergence and numerical results. Mathematical Programming 127(2), 245–295 (2011)
  • (16) Cheng, H.K., Chung, J., Tai, Y.W., Tang, C.K.: Cascadepsp: toward class-agnostic and very high-resolution segmentation via global and local refinement. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8890–8899 (2020)
  • (17) Cho, M., Lee, J.: Riemannian approach to batch normalization. In: Advances in Neural Information Processing Systems, pp. 5225–5235 (2017)
  • (18) Conn, A.R., Gould, N.I., Toint, P.L.: Trust region methods. SIAM (2000)
  • (19) Edelman, A., Arias, T.A., Smith, S.T.: The geometry of algorithms with orthogonality constraints. SIAM journal on Matrix Analysis and Applications 20(2), 303–353 (1998)
  • (20) Edelman, A., Murakami, H.: Polynomial roots from companion matrix eigenvalues. Mathematics of Computation 64(210), 763–776 (1995)
  • (21) Ferreira, O.P., Svaiter, B.F.: Kantorovich’s theorem on newton’s method in riemannian manifolds. Journal of Complexity 18(1), 304–329 (2002)
  • (22) Fletcher, R., Reeves, C.M.: Function minimization by conjugate gradients. The computer journal 7(2), 149–154 (1964)
  • (23) Goldberg, K., Roeder, T., Gupta, D., Perkins, C.: Eigentaste: A constant time collaborative filtering algorithm. information retrieval 4(2), 133–151 (2001)
  • (24) Gould, N., Lucidi, S., Roma, M., Toint, P.L.: Solving the trust-region subproblem using the lanczos method. Siam Journal on Optimization 9(2), 504–525 (1999)
  • (25) Griewank, A.: The modification of newton’s method for unconstrained optimization by bounding cubic terms. Tech. rep., Technical report NA/12 (1981)
  • (26) Gross, D.: Recovering low-rank matrices from few coefficients in any basis. IEEE Transactions on Information Theory 57(3), 1548–1566 (2011)
  • (27) Han, A., Mishra, B., Jawanpuria, P.K., Gao, J.: On riemannian optimization over positive definite matrices with the bures-wasserstein geometry. Advances in Neural Information Processing Systems 34 (2021)
  • (28) Horev, I., Yger, F., Sugiyama, M.: Geometry-aware principal component analysis for symmetric positive definite matrices. pp. 493–522. Springer (2017)
  • (29) Hosseini, S., Huang, W., Yousefpour, R.: Line search algorithms for locally lipschitz functions on riemannian manifolds. SIAM Journal on Optimization 28(1), 596–619 (2018)
  • (30) Huang, W., Wei, K.: Riemannian proximal gradient methods. Mathematical Programming pp. 1–43 (2021)
  • (31) Jia, X., Liang, X., Shen, C., Zhang, L.H.: Solving the cubic regularization model by a nested restarting lanczos method (2021)
  • (32) Kasai, H., Mishra, B.: Inexact trust-region algorithms on riemannian manifolds. In: Advances in Neural Information Processing Systems, pp. 4249–4260 (2018)
  • (33) Kasai, H., Sato, H., Mishra, B.: Riemannian stochastic recursive gradient algorithm. In: International Conference on Machine Learning, pp. 2516–2524 (2018)
  • (34) Kohler, J.M., Lucchi, A.: Sub-sampled cubic regularization for non-convex optimization. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1895–1904. JMLR. org (2017)
  • (35) Kohoutová, L., Heo, J., Cha, S., Lee, S., Moon, T., Wager, T.D., Woo, C.W.: Toward a unified framework for interpreting machine-learning models in neuroimaging. Nature protocols 15(4), 1399–1435 (2020)
  • (36) Kumar Roy, S., Mhammedi, Z., Harandi, M.: Geometry aware constrained optimization techniques for deep learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4460–4469 (2018)
  • (37) LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998)
  • (38) Liu, X., He, J., Duddy, S., O’Sullivan, L.: Convolution-consistent collective matrix completion. In: International Conference on Information and Knowledge Management, pp. 2209–2212 (2019)
  • (39) Mishra, B., Kasai, H., Jawanpuria, P., Saroop, A.: A riemannian gossip approach to subspace learning on grassmann manifold. Machine Learning 108(10), 1783–1803 (2019)
  • (40) Mokhtari, A., Ozdaglar, A., Jadbabaie, A.: Escaping saddle points in constrained optimization. In: Advances in Neural Information Processing Systems, pp. 3629–3639 (2018)
  • (41) Ngo, T., Saad, Y.: Scaled gradients on grassmann manifolds for matrix completion. Advances in neural information processing systems 25 (2012)
  • (42) Nguyen, X.S., Brun, L., Lézoray, O., Bougleux, S.: A neural network based on spd manifold learning for skeleton-based hand gesture recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12036–12045 (2019)
  • (43) Nocedal, J., Wright, S.: Numerical optimization. Springer Science & Business Media (2006)
  • (44) Poldrack, R.A., Gorgolewski, K.J.: Openfmri: Open sharing of task fmri data. Neuroimage 144, 259–261 (2017)
  • (45) Pölitz, C., Duivesteijn, W., Morik, K.: Interpretable domain adaptation via optimization over the stiefel manifold. Machine Learning 104(2), 315–336 (2016)
  • (46) Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.: Chapter 9: Root finding and nonlinear sets of equations. Book: Numerical Recipes: The Art of Scientific Computing, New York, Cambridge University Press, 10 (2007)
  • (47) Qi, C.: Numerical optimization methods on riemannian manifolds (2011)
  • (48) Roosta-Khorasani, F., Mahoney, M.W.: Sub-sampled newton methods. Mathematical Programming 174(1), 293–326 (2019)
  • (49) Sakai, H., Iiduka, H.: Sufficient descent riemannian conjugate gradient methods. Journal of Optimization Theory and Applications 190(1), 130–150 (2021)
  • (50) Sato, H., Iwai, T.: A new, globally convergent riemannian conjugate gradient method. Optimization 64(4), 1011–1031 (2015)
  • (51) Shahid, N., Kalofolias, V., Bresson, X., Bronstein, M., Vandergheynst, P.: Robust principal component analysis on graphs. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2812–2820 (2015)
  • (52) Shen, Z., Zhou, P., Fang, C., Ribeiro, A.: A stochastic trust region method for non-convex minimization. arXiv preprint arXiv:1903.01540 (2019)
  • (53) Sidhu, G.S., Asgarian, N., Greiner, R., Brown, M.R.: Kernel principal component analysis for dimensionality reduction in fmri-based diagnosis of adhd. Frontiers in systems neuroscience 6, 74 (2012)
  • (54) Sun, Y., Flammarion, N., Fazel, M.: Escaping from saddle points on riemannian manifolds. arXiv preprint arXiv:1906.07355 (2019)
  • (55) Townsend, J., Koep, N., Weichwald, S.: Pymanopt: A python toolbox for optimization on manifolds using automatic differentiation. The Journal of Machine Learning Research 17(1), 4755–4759 (2016)
  • (56) Trefethen, L.N., Bau III, D.: Numerical linear algebra, vol. 50. Siam (1997)
  • (57) Tripuraneni, N., Stern, M., Jin, C., Regier, J., Jordan, M.I.: Stochastic cubic regularization for fast nonconvex optimization. Advances in neural information processing systems 31 (2018)
  • (58) Tropp, J.A.: An introduction to matrix concentration inequalities. arXiv preprint arXiv:1501.01571 (2015)
  • (59) Wei, Z., Yao, S., Liu, L.: The convergence properties of some new conjugate gradient methods. Applied Mathematics and computation 183(2), 1341–1350 (2006)
  • (60) Weiwei, Y., Yueting, Y., Chenhui, Z., Mingyuan, C.: A newton-like trust region method for large-scale unconstrained nonconvex minimization. In: Abstract and Applied Analysis, vol. 2013. Hindawi (2013)
  • (61) Xu, P., Roosta, F., Mahoney, M.W.: Newton-type methods for non-convex optimization under inexact hessian information. Mathematical Programming 184(1), 35–70 (2020)
  • (62) Xu, Z., Zhao, P., Cao, J., Li, X.: Matrix eigen-decomposition via doubly stochastic riemannian optimization. In: International Conference on Machine Learning, pp. 1660–1669 (2016)
  • (63) Yao, Z., Xu, P., Roosta, F., Mahoney, M.W.: Inexact nonconvex newton-type methods. Informs Journal on Optimization 3(2), 154–182 (2021)
  • (64) Yuan, X., Huang, W., Absil, P.A., Gallivan, K.A.: A riemannian limited-memory bfgs algorithm for computing the matrix geometric mean. Procedia Computer Science 80, 2147–2157 (2016)
  • (65) Zhang, H., Reddi, S.J., Sra, S.: Riemannian svrg: Fast stochastic optimization on riemannian manifolds. In: Advances in Neural Information Processing Systems, pp. 4592–4600 (2016)
  • (66) Zhang, H., Sra, S.: First-order methods for geodesically convex optimization. In: Conference on Learning Theory, pp. 1617–1638 (2016)
  • (67) Zhang, J., Zhang, S.: A cubic regularized newton’s method over riemannian manifolds. arXiv preprint arXiv:1805.05565 (2018)
  • (68) Zhong, Y., Wang, H., Lu, G., Zhang, Z., Jiao, Q., Liu, Y.: Detecting functional connectivity in fmri using pca and regression analysis. Brain topography 22(2), 134–144 (2009)
  • (69) Zhou, D., Gu, Q.: Stochastic recursive variance-reduced cubic regularization methods. In: International Conference on Artificial Intelligence and Statistics, pp. 3980–3990. PMLR (2020)
  • (70) Zhou, D., Xu, P., Gu, Q.: Stochastic variance-reduced cubic regularization methods. J. Mach. Learn. Res. 20(134), 1–47 (2019)
  • (71) Zhu, X.: A riemannian conjugate gradient method for optimization on the stiefel manifold. Computational optimization and Applications 67(1), 73–110 (2017)

Appendix A Appendix: Derivation of Lanczos Method

Instead of solving the subproblem in the tangent space 𝜼T𝐱k\bm{\eta}\in T_{\mathbf{x}_{k}}\mathcal{M} of the manifold dimension DD, the Lanczos method solves it within a Krylov subspace 𝒦l\mathcal{K}_{l}, where ll can range from 1 to DD. This subspace is defined by the span of the following elements:

𝒦l(𝐇k,𝐆k):={𝐆k,𝐇k[𝐆k],𝐇k2[𝐆k],,𝐇kl[𝐆k]},\mathcal{K}_{l}(\mathbf{H}_{k},\mathbf{G}_{k}):=\left\{\mathbf{G}_{k},\mathbf{H}_{k}[\mathbf{G}_{k}],\mathbf{H}_{k}^{2}[\mathbf{G}_{k}],...,\mathbf{H}_{k}^{l}[\mathbf{G}_{k}]\right\}, (64)

where, for l2l\geq 2, 𝐇kl[𝐆k]\mathbf{H}_{k}^{l}[\mathbf{G}_{k}] is recursively defined by 𝐇k[𝐇kl1[𝐆k]]\mathbf{H}_{k}\left[\mathbf{H}^{l-1}_{k}\left[\mathbf{G}_{k}\right]\right]. Its orthonormal basis 𝐐l={𝐪1,,𝐪l}\mathbf{Q}_{l}=\{\mathbf{q}_{1},...,\mathbf{q}_{l}\}, where 𝐐lT𝐐l=𝐈\mathbf{Q}_{l}^{T}\mathbf{Q}_{l}=\mathbf{I}, is successively constructed to satisfy

𝐪1=\displaystyle\mathbf{q}_{1}= 𝐆k𝐆k𝐱k,\displaystyle\frac{\mathbf{G}_{k}}{\left\|\mathbf{G}_{k}\right\|_{\mathbf{x}_{k}}}, (65)
𝐪i,𝐇k[𝐪j]𝐱k=\displaystyle\langle\mathbf{q}_{i},\mathbf{H}_{k}[\mathbf{q}_{j}]\rangle_{\mathbf{x}_{k}}= (𝐓l)i,j,\displaystyle\left(\mathbf{T}_{l}\right)_{i,j}, (66)

for i,j[n]i,j\in[n], where (𝐓l)i,j\left(\mathbf{T}_{l}\right)_{i,j} denotes the ijij-th element of the matrix 𝐓l\mathbf{T}_{l}. Each element 𝜼𝒦l\bm{\eta}\in\mathcal{K}_{l} in the Krylov subspace can be expressed as 𝜼=i=1lyi𝐪i\bm{\eta}=\sum_{i=1}^{l}y_{i}\mathbf{q}_{i}. Store these {yi}i=1l\{y_{i}\}_{i=1}^{l} in the vector 𝐲l\mathbf{y}\in\mathbb{R}^{l}, the subproblem objective function min𝜼𝒦lm^(𝜼)\mathop{\min}_{\bm{\eta}\in\mathcal{K}_{l}}\hat{m}(\bm{\eta}) is minimized in the Krylov subspace instead. By substituting 𝜼:=i=1lyi𝐪i\bm{\eta}:=\sum_{i=1}^{l}y_{i}\mathbf{q}_{i} into m^(𝜼)\hat{m}(\bm{\eta}), the objective function becomes

m^(𝜼)\displaystyle\hat{m}(\bm{\eta}) =f(𝐱k)+δ𝐆k,i=1lyi𝐪i𝐱k+12i=1lyi𝐪i,𝐇k[i=1lyi𝐪i]𝐱k+σk3i=1lyi𝐪i𝐱k3\displaystyle=f(\mathbf{x}_{k})+\delta\left\langle\mathbf{G}_{k},\sum_{i=1}^{l}y_{i}\mathbf{q}_{i}\right\rangle_{\mathbf{x}_{k}}+\frac{1}{2}\left\langle\sum_{i=1}^{l}y_{i}\mathbf{q}_{i},\mathbf{H}_{k}\left[\sum_{i=1}^{l}y_{i}\mathbf{q}_{i}\right]\right\rangle_{\mathbf{x}_{k}}+\frac{\sigma_{k}}{3}\left\|\sum_{i=1}^{l}y_{i}\mathbf{q}_{i}\right\|_{\mathbf{x}_{k}}^{3}
=f(𝐱k)+δ𝐆k,y1𝐪1𝐱k+12i,j=1lyiyj𝐪i,𝐇k[𝐪j]𝐱k+σk3𝐲23\displaystyle=f(\mathbf{x}_{k})+\delta\left\langle\mathbf{G}_{k},y_{1}\mathbf{q}_{1}\right\rangle_{\mathbf{x}_{k}}+\frac{1}{2}\sum_{i,j=1}^{l}y_{i}y_{j}\left\langle\mathbf{q}_{i},\mathbf{H}_{k}[\mathbf{q}_{j}]\right\rangle_{\mathbf{x}_{k}}+\frac{\sigma_{k}}{3}\left\|\mathbf{y}\right\|_{2}^{3}
=f(𝐱k)+y1δ𝐆k𝐱k+12𝐲T𝐓D𝐲+σk3𝐲23.\displaystyle=f(\mathbf{x}_{k})+\ y_{1}\delta\left\|\mathbf{G}_{k}\right\|_{\mathbf{x}_{k}}+\frac{1}{2}\mathbf{y}^{T}\mathbf{T}_{D}\mathbf{y}+\frac{\sigma_{k}}{3}\left\|\mathbf{y}\right\|_{2}^{3}. (67)

The properties 𝐪i𝐪j\mathbf{q}_{i}\perp\mathbf{q}_{j} for iji\neq j and 𝐆k𝐪i\mathbf{G}_{k}\perp\mathbf{q}_{i} for i1i\neq 1 are used in the derivation. Therefore, to solve min𝜼T𝐱km^(𝜼)\min_{\bm{\eta}\in T_{\mathbf{x}_{k}}\mathcal{M}}\ \hat{m}(\bm{\eta}) is equivalent to

min𝐲ly1δ𝐆k𝐱k+12𝐲T𝐓D𝐲+σk3𝐲23.\min_{\mathbf{y}\in\mathbb{R}^{l}}\ y_{1}\delta\left\|\mathbf{G}_{k}\right\|_{\mathbf{x}_{k}}+\frac{1}{2}\mathbf{y}^{T}\mathbf{T}_{D}\mathbf{y}+\frac{\sigma_{k}}{3}\left\|\mathbf{y}\right\|_{2}^{3}. (68)

Appendix B Appendix: Proof of Lemma 1

Proof

Let λ=σk𝜼k\lambda_{*}=\sigma_{k}\left\|\bm{\eta}_{k}^{*}\right\|. The Krylov subspaces are invariant to shifts by scalar matrices, therefore 𝒦l(𝐇k,𝐆k)=𝒦l(𝐇k+λId,𝐆k)=𝒦l(𝐇~k,𝐆k)\mathcal{K}_{l}(\mathbf{H}_{k},\mathbf{G}_{k})=\mathcal{K}_{l}(\mathbf{H}_{k}+\lambda_{*}{\rm Id},\mathbf{G}_{k})=\mathcal{K}_{l}(\tilde{\mathbf{H}}_{k},\mathbf{G}_{k}) carmon2018analysis , where the definition of 𝒦l(𝐇k,𝐆k)\mathcal{K}_{l}(\mathbf{H}_{k},\mathbf{G}_{k}) follows Eq. (64). Let 𝝃l𝒦l\bm{\xi}_{l}\in\mathcal{K}_{l} be the solution found in the Krylov subspace 𝒦l(𝐇k,𝐆k)\mathcal{K}_{l}(\mathbf{H}_{k},\mathbf{G}_{k}), which is thus an element of 𝒦l(𝐇~k,𝐆k)\mathcal{K}_{l}(\tilde{\mathbf{H}}_{k},\mathbf{G}_{k}) expressed by

𝝃l=pl(𝐇~k)[𝐆k]=c0𝐆kc1𝐇~k[𝐆k]cl𝐇~kl[𝐆k],\bm{\xi}_{l}=-p_{l}\left(\tilde{\mathbf{H}}_{k}\right)[\mathbf{G}_{k}]=-c_{0}\mathbf{G}_{k}-c_{1}\tilde{\mathbf{H}}_{k}[\mathbf{G}_{k}]\cdots-c_{l}\tilde{\mathbf{H}}_{k}^{l}[\mathbf{G}_{k}], (69)

for some values of c0,c1,,clc_{0},\;c_{1},\;\ldots,\;c_{l}\in\mathbb{R}. According to Section 6.2 of absil2009optimization , a global minimizer 𝜼¯k\bar{\bm{\eta}}_{k}^{*} of the RTR subproblem without cubic regularization in Eq. (7) is expected to satisfy the Riemannian quasi-Newton equation:

gradm¯k(𝟎k)+(Hessm¯k(𝟎k)+λId)[𝜼¯k]=𝟎𝐱k,{\rm grad}\bar{m}_{k}({\mathbf{0}_{k}})+\left({\rm Hess}\bar{m}_{k}(\mathbf{0}_{k})+\lambda_{*}{\rm Id}\right)[\bar{\bm{\eta}}_{k}^{*}]=\mathbf{0}_{\mathbf{x}_{k}}, (70)

where λmax(λmin(𝐇k),0)\lambda_{*}\geq\max(-\lambda_{min}(\mathbf{H}_{k}),0) and λ(Δk𝜼¯k𝐱k)=0\lambda_{*}\left(\Delta_{k}-\left\|\bar{\bm{\eta}}_{k}^{*}\right\|_{\mathbf{x}_{k}}\right)=0 according to Corollary 7.2.2 of conn2000trust . Using the approximate gradient and Hessian, the inexact minimizer is expected to satisfy

𝐆k+(𝐇k+λId)[𝜼¯k]=𝐆k+𝐇~k[𝜼¯k]=𝟎𝐱k.\mathbf{G}_{k}+(\mathbf{H}_{k}+\lambda_{*}{\rm Id})[\bar{\bm{\eta}}_{k}^{*}]=\mathbf{G}_{k}+\tilde{\mathbf{H}}_{k}[\bar{\bm{\eta}}_{k}^{*}]=\mathbf{0}_{\mathbf{x}_{k}}. (71)

Introduce 𝜻l=(1α)𝝃l\bm{\zeta}_{l}=(1-\alpha)\bm{\xi}_{l} where α=𝝃l𝐱k𝜼¯k𝐱kmax(𝝃l𝐱k,𝜼¯k𝐱k)\alpha=\frac{\left\|\bm{\xi}_{l}\right\|_{\mathbf{x}_{k}}-\left\|\bar{\bm{\eta}}_{k}^{*}\right\|_{\mathbf{x}_{k}}}{\max\left(\left\|\bm{\xi}_{l}\right\|_{\mathbf{x}_{k}},\left\|\bar{\bm{\eta}}_{k}^{*}\right\|_{\mathbf{x}_{k}}\right)}. When 𝝃l𝐱k<𝜼¯k𝐱k\left\|\bm{\xi}_{l}\right\|_{\mathbf{x}_{k}}<\left\|\bar{\bm{\eta}}_{k}^{*}\right\|_{\mathbf{x}_{k}}, we start from the fact that (𝜼¯k𝐱k𝝃l𝐱k)20\left(\left\|\bar{\bm{\eta}}_{k}^{*}\right\|_{\mathbf{x}_{k}}-\left\|\bm{\xi}_{l}\right\|_{\mathbf{x}_{k}}\right)^{2}\geq 0, which results in the following:

(2𝜼¯k𝐱k𝝃l𝐱k)𝝃l𝐱k𝜼¯k𝐱k2\displaystyle\left(2\left\|\bar{\bm{\eta}}_{k}^{*}\right\|_{\mathbf{x}_{k}}-\left\|\bm{\xi}_{l}\right\|_{\mathbf{x}_{k}}\right)\left\|\bm{\xi}_{l}\right\|_{\mathbf{x}_{k}}\leq\left\|\bar{\bm{\eta}}_{k}^{*}\right\|_{\mathbf{x}_{k}}^{2} (1+𝜼¯k𝐱k𝝃l𝐱k𝜼¯k𝐱k)𝝃l𝐱k𝜼¯k𝐱k\displaystyle\Longleftrightarrow\left(1+\frac{\left\|\bar{\bm{\eta}}_{k}^{*}\right\|_{\mathbf{x}_{k}}-\left\|\bm{\xi}_{l}\right\|_{\mathbf{x}_{k}}}{\left\|\bar{\bm{\eta}}_{k}^{*}\right\|_{\mathbf{x}_{k}}}\right)\left\|\bm{\xi}_{l}\right\|_{\mathbf{x}_{k}}\leq\left\|\bar{\bm{\eta}}_{k}^{*}\right\|_{\mathbf{x}_{k}}
(1α)𝝃l𝐱k𝜼¯k𝐱k\displaystyle\Longleftrightarrow(1-\alpha)\left\|\bm{\xi}_{l}\right\|_{\mathbf{x}_{k}}\leq\left\|\bar{\bm{\eta}}_{k}^{*}\right\|_{\mathbf{x}_{k}}
𝜻l𝐱k𝜼¯k𝐱k.\displaystyle\Longleftrightarrow\left\|\bm{\zeta}_{l}\right\|_{\mathbf{x}_{k}}\leq\left\|\bar{\bm{\eta}}_{k}^{*}\right\|_{\mathbf{x}_{k}}. (72)

When 𝝃l𝐱k𝜼¯k𝐱k\left\|\bm{\xi}_{l}\right\|_{\mathbf{x}_{k}}\geq\left\|\bar{\bm{\eta}}_{k}^{*}\right\|_{\mathbf{x}_{k}}, it has

𝜻l𝐱k=(1α)𝝃l𝐱k=(1𝝃l𝐱k𝜼¯k𝐱k𝝃l𝐱k)𝝃l𝐱k=𝜼¯k𝐱k𝝃l𝐱k𝝃l𝐱k=𝜼¯k𝐱k.\left\|{\bm{\zeta}}_{l}\right\|_{\mathbf{x}_{k}}=\left\|(1-\alpha)\bm{\xi}_{l}\right\|_{\mathbf{x}_{k}}=\left\|\left(1-\frac{\left\|\bm{\xi}_{l}\right\|_{\mathbf{x}_{k}}-\left\|\bar{\bm{\eta}}_{k}^{*}\right\|_{\mathbf{x}_{k}}}{\left\|\bm{\xi}_{l}\right\|_{\mathbf{x}_{k}}}\right)\bm{\xi}_{l}\right\|_{\mathbf{x}_{k}}=\frac{\left\|\bar{\bm{\eta}}_{k}^{*}\right\|_{\mathbf{x}_{k}}\left\|{\bm{\xi}}_{l}\right\|_{\mathbf{x}_{k}}}{\left\|{\bm{\xi}}_{l}\right\|_{\mathbf{x}_{k}}}=\left\|\bar{\bm{\eta}}_{k}^{*}\right\|_{\mathbf{x}_{k}}. (73)

This concludes that for any 𝝃l\bm{\xi}_{l}, it has 𝜻l𝐱k𝜼¯k𝐱k\left\|\bm{\zeta}_{l}\right\|_{\mathbf{x}_{k}}\leq\left\|\bar{\bm{\eta}}_{k}^{*}\right\|_{\mathbf{x}_{k}}.

We introduce the notation m~k\tilde{m}_{k} to denote the subproblem in Eq. (7) using the inexact Hessian 𝐇~k\tilde{\mathbf{H}}_{k}. Let ψk=(pl+1(𝐇~k)Id)[𝜼¯k]\psi_{k}^{*}=\left(p_{l+1}\left(\tilde{\mathbf{H}}_{k}\right)-{\rm Id}\right)[\bar{\bm{\eta}}_{k}^{*}] and ι(𝐇~k)=λmax(𝐇~k)λmin(𝐇~k)\iota\left(\tilde{\mathbf{H}}_{k}\right)=\frac{\lambda_{max}(\tilde{\mathbf{H}}_{k})}{\lambda_{min}(\tilde{\mathbf{H}}_{k})}. Since ϕl(𝐇~k)\phi_{l}\left(\tilde{\mathbf{H}}_{k}\right) is the upper bound of pl+1(𝐇~k)Id𝐱k\left\|p_{l+1}\left(\tilde{\mathbf{H}}_{k}\right)-{\rm Id}\right\|_{\mathbf{x}_{k}}, it has ψk𝐱kϕl(𝐇~k)𝜼¯k𝐱k\left\|\psi_{k}^{*}\right\|_{\mathbf{x}_{k}}\leq\phi_{l}\left(\tilde{\mathbf{H}}_{k}\right)\left\|\bar{\bm{\eta}}_{k}^{*}\right\|_{\mathbf{x}_{k}}. Then, we have

m¯k(𝜻l)m¯k(𝜼¯k)\displaystyle\bar{m}_{k}(\bm{\zeta}_{l})-\bar{m}_{k}(\bar{\bm{\eta}}_{k}^{*})
=\displaystyle= m~k(𝜻l)m~k(𝜼¯k)+λ2(𝜼¯k𝐱k2𝜻l𝐱k2)\displaystyle\;\tilde{m}_{k}(\bm{\zeta}_{l})-\tilde{m}_{k}(\bar{\bm{\eta}}_{k}^{*})+\frac{\lambda_{*}}{2}\left(\left\|\bar{\bm{\eta}}_{k}^{*}\right\|^{2}_{\mathbf{x}_{k}}-\left\|\bm{\zeta}_{l}\right\|^{2}_{\mathbf{x}_{k}}\right)
=\displaystyle= 12𝜻l𝜼¯k,𝐇~k[𝜻l𝜼¯k]𝐱k+λ2(𝜼¯k𝐱k2𝜻l𝐱k2)\displaystyle\;\frac{1}{2}\left\langle\bm{\zeta}_{l}-\bar{\bm{\eta}}_{k}^{*},\tilde{\mathbf{H}}_{k}[\bm{\zeta}_{l}-\bar{\bm{\eta}}_{k}^{*}]\right\rangle_{\mathbf{x}_{k}}+\frac{\lambda_{*}}{2}\left(\left\|\bar{\bm{\eta}}_{k}^{*}\right\|^{2}_{\mathbf{x}_{k}}-\left\|\bm{\zeta}_{l}\right\|^{2}_{\mathbf{x}_{k}}\right)
\displaystyle\leq 12𝜻l𝜼¯k,𝐇~k[𝜻l𝜼¯k]𝐱k+λ𝜼¯k𝐱k(𝜼¯k𝐱k𝜻l𝐱k)\displaystyle\;\frac{1}{2}\left\langle\bm{\zeta}_{l}-\bar{\bm{\eta}}_{k}^{*},\tilde{\mathbf{H}}_{k}[\bm{\zeta}_{l}-\bar{\bm{\eta}}_{k}^{*}]\right\rangle_{\mathbf{x}_{k}}+\lambda_{*}\left\|\bar{\bm{\eta}}_{k}^{*}\right\|_{\mathbf{x}_{k}}\left(\left\|\bar{\bm{\eta}}_{k}^{*}\right\|_{\mathbf{x}_{k}}-\left\|\bm{\zeta}_{l}\right\|_{\mathbf{x}_{k}}\right)
\displaystyle\leq (1α)22(pl+1(𝐇~k)Id)[𝜼¯k],𝐇~k[(pl+1(𝐇~k)Id)[𝜼¯k]]𝐱k+λ𝜼¯k𝐱k2α2\displaystyle\;\frac{(1-\alpha)^{2}}{2}\left\langle\left(p_{l+1}\left(\tilde{\mathbf{H}}_{k}\right)-{\rm Id}\right)[\bar{\bm{\eta}}_{k}^{*}],\tilde{\mathbf{H}}_{k}\left[\left(p_{l+1}\left(\tilde{\mathbf{H}}_{k}\right)-{\rm Id}\right)\left[\bar{\bm{\eta}}_{k}^{*}\right]\right]\right\rangle_{\mathbf{x}_{k}}+\lambda_{*}\left\|\bar{\bm{\eta}}_{k}^{*}\right\|_{\mathbf{x}_{k}}^{2}\alpha^{2} (74)
=\displaystyle= (1α)22ψk𝐱k2ψkψk𝐱k,𝐇~k[ψkψk𝐱k]𝐱k+λ𝜼¯k𝐱k2α2\displaystyle\;\frac{(1-\alpha)^{2}}{2}\left\|\psi_{k}^{*}\right\|_{\mathbf{x}_{k}}^{2}\left\langle\frac{\psi_{k}^{*}}{\left\|\psi_{k}^{*}\right\|_{\mathbf{x}_{k}}},\tilde{\mathbf{H}}_{k}\left[\frac{\psi_{k}^{*}}{\left\|\psi_{k}^{*}\right\|_{\mathbf{x}_{k}}}\right]\right\rangle_{\mathbf{x}_{k}}+\lambda_{*}\left\|\bar{\bm{\eta}}_{k}^{*}\right\|_{\mathbf{x}_{k}}^{2}\alpha^{2}
\displaystyle\leq  2ϕl(𝐇~k)2𝜼¯k𝐱k2ι(𝐇~k)𝜼¯k𝜼¯k𝐱k,𝐇~k[𝜼¯k𝜼¯k𝐱k]𝐱k+ϕl(𝐇~k)2λ𝜼¯k𝐱k2\displaystyle\;2\phi_{l}\left(\tilde{\mathbf{H}}_{k}\right)^{2}\left\|\bar{\bm{\eta}}_{k}^{*}\right\|_{\mathbf{x}_{k}}^{2}\iota\left(\tilde{\mathbf{H}}_{k}\right)\left\langle\frac{\bar{\bm{\eta}}_{k}^{*}}{\left\|\bar{\bm{\eta}}_{k}^{*}\right\|_{\mathbf{x}_{k}}},\tilde{\mathbf{H}}_{k}\left[\frac{\bar{\bm{\eta}}_{k}^{*}}{\left\|\bar{\bm{\eta}}_{k}^{*}\right\|_{\mathbf{x}_{k}}}\right]\right\rangle_{\mathbf{x}_{k}}+\phi_{l}(\tilde{\mathbf{H}}_{k})^{2}\lambda_{*}\left\|\bar{\bm{\eta}}_{k}^{*}\right\|_{\mathbf{x}_{k}}^{2} (75)
\displaystyle\leq  4ι(𝐇~k)(12𝜼¯k,𝐇~k[𝜼¯k]𝐱k+12λ𝜼¯k𝐱k2)ϕl(𝐇~k)2\displaystyle\;4\iota\left(\tilde{\mathbf{H}}_{k}\right)\left(\frac{1}{2}\left\langle\bar{\bm{\eta}}_{k}^{*},\tilde{\mathbf{H}}_{k}[\bar{\bm{\eta}}_{k}^{*}]\right\rangle_{\mathbf{x}_{k}}+\frac{1}{2}\lambda_{*}\left\|\bar{\bm{\eta}}_{k}^{*}\right\|_{\mathbf{x}_{k}}^{2}\right)\phi_{l}\left(\tilde{\mathbf{H}}_{k}\right)^{2} (76)
=\displaystyle=  4ι(𝐇~k)(m¯k(𝟎𝐱k)m¯k(𝜼¯k))ϕl(𝐇~k)2.\displaystyle\;4\iota\left(\tilde{\mathbf{H}}_{k}\right)\left(\bar{m}_{k}(\mathbf{0}_{\mathbf{x}_{k}})-\bar{m}_{k}\left(\bar{\bm{\eta}}_{k}^{*}\right)\right)\phi_{l}\left(\tilde{\mathbf{H}}_{k}\right)^{2}. (77)

To derive Eq. (74), 𝐆k=𝐇~k[𝜼¯k]\mathbf{G}_{k}=-\tilde{\mathbf{H}}_{k}[\bar{\bm{\eta}}_{k}^{*}] from Eq. (71) is used. To derive Eq. (75), we use the definition in Eq. (35), where for a non-zero 𝝃T𝐱k\bm{\xi}\in T_{\mathbf{x}_{k}}\mathcal{M}, it has

𝐇k[𝝃]𝐱k𝝃𝐱k𝐇k𝐱k=sup𝜼T𝐱k,𝜼𝐱k0𝐇k[𝜼]𝐱k𝜼𝐱k.\frac{\left\|\mathbf{H}_{k}[\bm{\xi}]\right\|_{\mathbf{x}_{k}}}{\left\|\bm{\xi}\right\|_{\mathbf{x}_{k}}}\leq\left\|\mathbf{H}_{k}\right\|_{\mathbf{x}_{k}}=\sup_{{\mathbf{\bm{\eta}}\in T_{\mathbf{x}_{k}}\mathcal{M},\|\mathbf{\bm{\eta}}\|_{\mathbf{x}_{k}}\neq 0}}\frac{\left\|\mathbf{H}_{k}[\mathbf{\bm{\eta}}]\right\|_{\mathbf{x}_{k}}}{\left\|\bm{\eta}\right\|_{\mathbf{x}_{k}}}. (78)

Eq. (75) also uses (1) the fact of α1\alpha\geq-1 which comes from the fact of α\alpha being in the form of abmax(a,b)\frac{a-b}{\max(a,b)}, and thus 1α21-\alpha\leq 2, (2) the definition of the smallest and largest eigenvalues in Eqs. (16) and (20), which gives 𝜼,𝐇[𝜼]~𝐱k𝝃,𝐇[𝝃]~𝐱kλmax(𝐇~k)λmin(𝐇~k)\frac{\langle\bm{\eta},\tilde{\mathbf{H}[\bm{\eta}]}\rangle_{\mathbf{x}_{k}}}{\langle\bm{\xi},\tilde{\mathbf{H}[\bm{\xi}]}\rangle_{\mathbf{x}_{k}}}\leq\frac{\lambda_{max}(\tilde{\mathbf{H}}_{k})}{\lambda_{min}(\tilde{\mathbf{H}}_{k})} for and any unit tangent vectors 𝜼\bm{\eta} and 𝝃\bm{\xi}, and (3) the fact that

|α|\displaystyle|\alpha| =|𝝃l𝐱k𝜼¯k𝐱k|max(𝝃l𝐱k,𝜼¯k𝐱k)𝝃l𝜼¯k𝐱kmax(𝝃l𝐱k,𝜼¯k𝐱k)\displaystyle=\frac{\left|\left\|\bm{\xi}_{l}\right\|_{\mathbf{x}_{k}}-\left\|\bar{\bm{\eta}}_{k}^{*}\right\|_{\mathbf{x}_{k}}\right|}{\max\left(\left\|\bm{\xi}_{l}\right\|_{\mathbf{x}_{k}},\left\|\bar{\bm{\eta}}_{k}^{*}\right\|_{\mathbf{x}_{k}}\right)}\leq\frac{\left\|\bm{\xi}_{l}-\bar{\bm{\eta}}_{k}^{*}\right\|_{\mathbf{x}_{k}}}{\max\left(\left\|\bm{\xi}_{l}\right\|_{\mathbf{x}_{k}},\left\|\bar{\bm{\eta}}_{k}^{*}\right\|_{\mathbf{x}_{k}}\right)}
=(pl+1(𝐇~k)Id)[𝜼¯k]𝐱kmax(𝝃l𝐱k,𝜼¯k𝐱k)ϕl(𝐇k)𝜼¯k𝐱kmax(𝝃l𝐱k,𝜼¯k𝐱k)ϕl(𝐇k).\displaystyle=\frac{\left\|\left(p_{l+1}\left(\tilde{\mathbf{H}}_{k}\right)-{\rm Id}\right)[\bar{\bm{\eta}}_{k}^{*}]\right\|_{\mathbf{x}_{k}}}{\max\left(\left\|\bm{\xi}_{l}\right\|_{\mathbf{x}_{k}},\left\|\bar{\bm{\eta}}_{k}^{*}\right\|_{\mathbf{x}_{k}}\right)}\leq\frac{\phi_{l}(\mathbf{H}_{k})\left\|\bar{\bm{\eta}}_{k}^{*}\right\|_{\mathbf{x}_{k}}}{\max\left(\left\|\bm{\xi}_{l}\right\|_{\mathbf{x}_{k}},\left\|\bar{\bm{\eta}}_{k}^{*}\right\|_{\mathbf{x}_{k}}\right)}\leq\phi_{l}(\mathbf{H}_{k}). (79)

To derive Eq. (77), we use

m¯k(𝟎𝐱k)m¯k(𝜼¯k)=12𝜼¯k,𝐇~k[𝜼¯k]𝐱k+λ2𝜼¯k𝐱k2.\bar{m}_{k}(\mathbf{0}_{\mathbf{x}_{k}})-\bar{m}_{k}(\bar{\bm{\eta}}_{k}^{*})=\frac{1}{2}\left\langle\bar{\bm{\eta}}_{k}^{*},\tilde{\mathbf{H}}_{k}[\bar{\bm{\eta}}_{k}^{*}]\right\rangle_{\mathbf{x}_{k}}+\frac{\lambda_{*}}{2}\left\|\bar{\bm{\eta}}_{k}^{*}\right\|_{\mathbf{x}_{k}}^{2}. (80)

Next, as 𝜼kl\bm{\eta}_{k}^{*l} is the optimal solution in the subspace 𝒦l(𝐇k,𝐆k)\mathcal{K}_{l}(\mathbf{H}_{k},\mathbf{G}_{k}), we have m¯k(𝜼kl)m¯k(𝜻l)\bar{m}_{k}\left(\bm{\eta}_{k}^{*l}\right)\leq\bar{m}_{k}\left(\bm{\zeta}^{l}\right), and hence

m¯k(𝜼kl)m¯k(𝜼¯k)4ι(𝐇~k)(m¯k(𝟎𝐱k)m¯k(𝜼¯k))ϕl(𝐇~k)2.\bar{m}_{k}\left(\bm{\eta}_{k}^{*l}\right)-\bar{m}_{k}\left(\bar{\bm{\eta}}_{k}^{*}\right)\leq 4\iota\left(\tilde{\mathbf{H}}_{k}\right)(\bar{m}_{k}(\mathbf{0}_{\mathbf{x}_{k}})-\bar{m}_{k}(\bar{\bm{\eta}}_{k}^{*}))\phi_{l}\left(\tilde{\mathbf{H}}_{k}\right)^{2}. (81)

We then show that the Lanczos method exhibits at least the same convergence property as above for the subsampled Riemannian cubic-regularization subproblem. Let 𝜼k{\bm{\eta}}_{k}^{*} be the global minimizer for the subproblem m^k\hat{m}_{k} in Eq. (10). 𝜼k{\bm{\eta}}_{k}^{*} is equivalent to 𝜼¯k\bar{\bm{\eta}}_{k}^{*} in the RTR subproblem with Δk=𝜼k𝐱k\Delta_{k}=\left\|{\bm{\eta}}_{k}^{*}\right\|_{\mathbf{x}_{k}} and λ=σk𝜼k𝐱k\lambda_{*}=\sigma_{k}\left\|{\bm{\eta}}_{k}^{*}\right\|_{\mathbf{x}_{k}} carmon2018analysis . Then, letting 𝜼kl\bm{\eta}_{k}^{*l} be the minimizer of m^k\hat{m}_{k} over 𝒦l(𝐇k,𝐆k)\mathcal{K}_{l}(\mathbf{H}_{k},\mathbf{G}_{k}) satisfying 𝜼kl𝐱k𝜼k𝐱k=Δk\left\|{\bm{\eta}}_{k}^{*l}\right\|_{\mathbf{x}_{k}}\leq\left\|{\bm{\eta}}_{k}^{*}\right\|_{\mathbf{x}_{k}}=\Delta_{k}, we have

m^k(𝜼kl)m^k(𝜼k)\displaystyle\hat{m}_{k}({\bm{\eta}}_{k}^{*l})-\hat{m}_{k}({\bm{\eta}}_{k}^{*}) m^k(𝜼kl)m^k(𝜼k)\displaystyle\leq\hat{m}_{k}({\bm{\eta}}_{k}^{*l})-\hat{m}_{k}({\bm{\eta}}_{k}^{*})
=m¯k(𝜼kl)m¯k(𝜼k)+σk3(𝜼kl𝐱k3𝜼k𝐱k3)\displaystyle=\bar{m}_{k}({\bm{\eta}}_{k}^{*l})-\bar{m}_{k}({\bm{\eta}}_{k}^{*})+\frac{\sigma_{k}}{3}(\left\|{\bm{\eta}}_{k}^{*l}\right\|_{\mathbf{x}_{k}}^{3}-\left\|{\bm{\eta}}_{k}^{*}\right\|_{\mathbf{x}_{k}}^{3})
m¯k(𝜼kl)m¯k(𝜼k)=m¯k(𝜼kl)m¯k(𝜼¯k)\displaystyle\leq\bar{m}_{k}({\bm{\eta}}_{k}^{*l})-\bar{m}_{k}({\bm{\eta}}_{k}^{*})=\bar{m}_{k}({\bm{\eta}}_{k}^{*l})-\bar{m}_{k}(\bar{\bm{\eta}}_{k}^{*}) (82)

Combining this with Eq. (81), it has

m^k(𝜼kl)m^k(𝜼k)4ι(𝐇~k)(m¯k(𝟎𝐱k)m¯k(𝜼¯k))ϕl(𝐇~k)2.\hat{m}_{k}\left(\bm{\eta}_{k}^{*l}\right)-\hat{m}_{k}\left({\bm{\eta}}_{k}^{*}\right)\leq 4\iota\left(\tilde{\mathbf{H}}_{k}\right)(\bar{m}_{k}(\mathbf{0}_{\mathbf{x}_{k}})-\bar{m}_{k}(\bar{\bm{\eta}}_{k}^{*}))\phi_{l}\left(\tilde{\mathbf{H}}_{k}\right)^{2}. (83)

This completes the proof.

Appendix C Appendix: Proof of Theorem 1

Proof

We first prove the relationship between 𝐆ki\mathbf{G}_{k}^{i} and 𝐫i\mathbf{r}_{i}. According to Algorithm 3, 𝐫0=𝐆k0=𝐆k\mathbf{r}_{0}=\mathbf{G}_{k}^{0}=\mathbf{G}_{k}. Then for i>0i>0, we have

𝐆ki\displaystyle\mathbf{G}_{k}^{i} =1|𝒮g|j𝒮gαi𝐩ifj(R𝐱ki1(αi𝐩i))\displaystyle=\frac{1}{|\mathcal{S}_{g}|}\sum_{j\in\mathcal{S}_{g}}\nabla_{\alpha_{i}^{*}\mathbf{p}_{i}}f_{j}\left(R_{\mathbf{x}_{k}^{i-1}}\left(\alpha_{i}^{*}\mathbf{p}_{i}\right)\right)
1|𝒮g|j𝒮gαi𝐩i(fj(𝐱ki1)+𝐆ki1,αi𝐩i𝐱k+12αi𝐩i,𝐇ki1[αi𝐩i]𝐱k)\displaystyle\approx\frac{1}{|\mathcal{S}_{g}|}\sum_{j\in\mathcal{S}_{g}}\nabla_{\alpha_{i}^{*}\mathbf{p}_{i}}\left(f_{j}\left(\mathbf{x}_{k}^{i-1}\right)+\left\langle\mathbf{G}_{k}^{i-1},\alpha_{i}^{*}\mathbf{p}_{i}\right\rangle_{\mathbf{x}_{k}}+\frac{1}{2}\left\langle\alpha_{i}^{*}\mathbf{p}_{i},\mathbf{H}_{k}^{i-1}[\alpha_{i}^{*}\mathbf{p}_{i}]\right\rangle_{\mathbf{x}_{k}}\right) (84)
=𝐆ki1+αi𝐇ki1[𝐩i]=𝐫i1+αi𝐇ki1[𝐩i]=𝐫i,\displaystyle=\mathbf{G}_{k}^{i-1}+\alpha_{i}^{*}\mathbf{H}_{k}^{i-1}[\mathbf{p}_{i}]=\mathbf{r}_{i-1}+\alpha_{i}^{*}\mathbf{H}_{k}^{i-1}[\mathbf{p}_{i}]=\mathbf{r}_{i}, (85)

where \approx comes from the first-order Taylor extension and Eq. (85) follows the Step (11) in Algorithm 3.

The exact line search in the Step. (3) of Algorithm 3 approximates

αi=argminα0f(R𝐱ki1(α𝐩i)).\alpha_{i}^{*}=\arg\min_{\alpha\geq 0}f\left(R_{\mathbf{x}_{k}^{i-1}}(\alpha\mathbf{p}_{i})\right). (86)

Zeroing the derivative of Eq. (86) with respect to α\alpha gives

𝟎𝐱ki=αif(R𝐱ki1(αi𝐩i))=f(𝐱ki),𝒫αi𝐩id(αi𝐩i)dαi𝐱ki𝐆ki,𝒫αi𝐩i𝐩i𝐱ki,\mathbf{0}_{\mathbf{x}_{k}^{i}}=\nabla_{\alpha_{i}^{*}}f\left(R_{\mathbf{x}_{k}^{i-1}}\left(\alpha_{i}^{*}\mathbf{p}_{i}\right)\right)=\left\langle\nabla f\left(\mathbf{x}_{k}^{i}\right),\mathcal{P}_{\alpha_{i}^{*}\mathbf{p}_{i}}\frac{d(\alpha_{i}^{*}\mathbf{p}_{i})}{d\alpha_{i}^{*}}\right\rangle_{\mathbf{x}_{k}^{i}}\approx\left\langle\mathbf{G}_{k}^{i},\mathcal{P}_{\alpha_{i}^{*}\mathbf{p}_{i}}\mathbf{p}_{i}\right\rangle_{\mathbf{x}_{k}^{i}}, (87)

where \approx results from the use of subsampled gradient 𝐆ki\mathbf{G}_{k}^{i} to approximate the full gradient f(𝐱ki)\nabla f(\mathbf{x}_{k}^{i}). We then show that each 𝐩i\mathbf{p}_{i} is a sufficient descent direction, i.e., 𝐆ki,𝐩i+1𝐱kiC𝐆ki𝐱ki2\left\langle\mathbf{G}_{k}^{i},\mathbf{p}_{i+1}\right\rangle_{\mathbf{x}_{k}^{i}}\leq-C\left\|\mathbf{G}_{k}^{i}\right\|_{\mathbf{x}_{k}^{i}}^{2} for some constant C>0C>0 sakai2021sufficient . When i=0i=0, 𝐩1=𝐆k0\mathbf{p}_{1}=\mathbf{G}_{k}^{0}, and thus 𝐆k0,𝐩1𝐱k0=𝐆k0𝐱k02\left\langle\mathbf{G}_{k}^{0},\mathbf{p}_{1}\right\rangle_{\mathbf{x}_{k}^{0}}=-\left\|\mathbf{G}_{k}^{0}\right\|_{\mathbf{x}_{k}^{0}}^{2}. When i>0i>0, from Step (17) in Algorithm 3 and Eq. (85), we have 𝐩i+1𝐆ki+βi𝐩i\mathbf{p}_{i+1}\approx-\mathbf{G}_{k}^{i}+\beta_{i}\mathbf{p}_{i}. Applying the inner product to both sides by 𝐆ki\mathbf{G}_{k}^{i}, we have

𝐆ki,𝐩i+1𝐱ki\displaystyle\langle\mathbf{G}_{k}^{i},\mathbf{p}_{i+1}\rangle_{\mathbf{x}_{k}^{i}} 𝐆ki𝐱ki2+βi𝐆ki,𝒫αi𝐩i𝐩i𝐱ki\displaystyle\approx-\left\|\mathbf{G}_{k}^{i}\right\|_{\mathbf{x}_{k}^{i}}^{2}+\beta_{i}\langle\mathbf{G}_{k}^{i},\mathcal{P}_{\alpha_{i}^{*}\mathbf{p}_{i}}\mathbf{p}_{i}\rangle_{\mathbf{x}_{k}^{i}} (88)
𝐆ki𝐱ki2C𝐆ki𝐱ki2,\displaystyle\approx-\left\|\mathbf{G}_{k}^{i}\right\|_{\mathbf{x}_{k}^{i}}^{2}\leq-C\left\|\mathbf{G}_{k}^{i}\right\|_{\mathbf{x}_{k}^{i}}^{2}, (89)

with a selected C>0C>0. Here, Eq. (89) builds on Eq. (87). Given the sufficient descent direction 𝐩i\mathbf{p}_{i} and the strong Wolfe conditions satisfied by αi\alpha_{i}^{*}, Theorem 2.4.1 in qi2011numerical shows that the Zoutendijk Condition holds sato2015new , i.e.,

i=0𝐆ki,𝐩i+1𝐱ki2𝐩i+1𝐱ki2<.\sum_{i=0}^{\infty}\frac{\left\langle\mathbf{G}_{k}^{i},\mathbf{p}_{i+1}\right\rangle_{\mathbf{x}_{k}^{i}}^{2}}{\left\|\mathbf{p}_{i+1}\right\|_{\mathbf{x}_{k}^{i}}^{2}}<\infty. (90)

Next, we show that βi\beta_{i} is upper bounded. Using Eq. (85), we have

βi\displaystyle\beta_{i} 𝐆ki,𝐆ki𝐆ki𝐱ki𝐆ki1𝐱ki1𝒫αi𝐓i𝐆ki1𝐱ki2𝐆ki1,𝐆ki1𝐱ki1\displaystyle\approx\frac{\left\langle\mathbf{G}_{k}^{i},\mathbf{G}_{k}^{i}-\frac{\left\|\mathbf{G}_{k}^{i}\right\|_{\mathbf{x}_{k}^{i}}}{\left\|\mathbf{G}_{k}^{i-1}\right\|_{\mathbf{x}_{k}^{i-1}}}\mathcal{P}_{\alpha_{i}^{*}\mathbf{T}_{i}}\mathbf{G}_{k}^{i-1}\right\rangle_{\mathbf{x}_{k}^{i}}}{2\left\langle\mathbf{G}_{k}^{i-1},\mathbf{G}_{k}^{i-1}\right\rangle_{\mathbf{x}_{k}^{i-1}}}
=𝐆ki𝐱ki2𝐆ki𝐱ki𝐆ki1𝐱ki1cosθ𝐆ki𝐱ki𝐆ki1𝐱ki12𝐆ki1𝐱ki12𝐆ki𝐱ki2𝐆ki1𝐱ki12,\displaystyle=\frac{\left\|\mathbf{G}_{k}^{i}\right\|_{\mathbf{x}_{k}^{i}}^{2}-\frac{\left\|\mathbf{G}_{k}^{i}\right\|_{\mathbf{x}_{k}^{i}}}{\left\|\mathbf{G}_{k}^{i-1}\right\|_{\mathbf{x}_{k}^{i-1}}}\cos\theta\left\|\mathbf{G}_{k}^{i}\right\|_{\mathbf{x}_{k}^{i}}\left\|\mathbf{G}_{k}^{i-1}\right\|_{\mathbf{x}_{k}^{i-1}}}{2\left\|\mathbf{G}_{k}^{i-1}\right\|_{\mathbf{x}_{k}^{i-1}}^{2}}\leq\frac{\left\|\mathbf{G}_{k}^{i}\right\|_{\mathbf{x}_{k}^{i}}^{2}}{\left\|\mathbf{G}_{k}^{i-1}\right\|_{\mathbf{x}_{k}^{i-1}}^{2}}, (91)

where c3>1c_{3}>1 is some constant.

Now, we prove limi𝐆ki𝐱ki=0\lim_{i\to\infty}\left\|\mathbf{G}_{k}^{i}\right\|_{\mathbf{x}_{k}^{i}}=0 by contradiction. Assume that limi𝐆ki𝐱ki>0\lim_{i\to\infty}\left\|\mathbf{G}_{k}^{i}\right\|_{\mathbf{x}_{k}^{i}}>0, that is, for all ii, there exists γ>0\gamma>0 such that 𝐆ki𝐱ki>γ>0\left\|\mathbf{G}_{k}^{i}\right\|_{\mathbf{x}_{k}^{i}}>\gamma>0. Squaring Step (17) of Algorithm 3 and applying Eqs. (30), (85), (89) and (91), we have

𝐩i+1𝐱ki2\displaystyle\left\|\mathbf{p}_{i+1}\right\|_{\mathbf{x}_{k}^{i}}^{2} 𝐆ki𝐱ki2+2βi|𝐆ki,𝒫αi𝐩i𝐩i𝐱ki|+βi2𝒫αi𝐩i𝐩i𝐱ki2\displaystyle\leq\left\|\mathbf{G}_{k}^{i}\right\|_{\mathbf{x}_{k}^{i}}^{2}+2\beta_{i}\left|\left\langle\mathbf{G}_{k}^{i},\mathcal{P}_{\alpha_{i}^{*}\mathbf{p}_{i}}\mathbf{p}_{i}\right\rangle_{\mathbf{x}_{k}^{i}}\right|+\beta_{i}^{2}\left\|\mathcal{P}_{\alpha_{i}^{*}\mathbf{p}_{i}}\mathbf{p}_{i}\right\|_{\mathbf{x}_{k}^{i}}^{2}
𝐆ki𝐱ki22c2𝐆ki𝐱ki2𝐆ki1𝐱ki12𝐆ki1,𝐩i𝐱ki1+βi2𝐩i𝐱ki12\displaystyle\leq\left\|\mathbf{G}_{k}^{i}\right\|_{\mathbf{x}_{k}^{i}}^{2}-2c_{2}\frac{\left\|\mathbf{G}_{k}^{i}\right\|_{\mathbf{x}_{k}^{i}}^{2}}{\left\|\mathbf{G}_{k}^{i-1}\right\|_{\mathbf{x}_{k}^{i-1}}^{2}}\left\langle\mathbf{G}_{k}^{i-1},\mathbf{p}_{i}\right\rangle_{\mathbf{x}_{k}^{i-1}}+\beta_{i}^{2}\left\|\mathbf{p}_{i}\right\|_{\mathbf{x}_{k}^{i-1}}^{2}
𝐆ki𝐱ki2+2c2C𝐆ki𝐱ki2𝐆ki1𝐱ki12𝐆ki1𝐱ki12+βi2𝐩i𝐱ki12\displaystyle\leq\left\|\mathbf{G}_{k}^{i}\right\|_{\mathbf{x}_{k}^{i}}^{2}+2c_{2}C\frac{\left\|\mathbf{G}_{k}^{i}\right\|_{\mathbf{x}_{k}^{i}}^{2}}{\left\|\mathbf{G}_{k}^{i-1}\right\|_{\mathbf{x}_{k}^{i-1}}^{2}}\left\|\mathbf{G}_{k}^{i-1}\right\|_{\mathbf{x}_{k}^{i-1}}^{2}+\beta_{i}^{2}\left\|\mathbf{p}_{i}\right\|_{\mathbf{x}_{k}^{i-1}}^{2}
=C^𝐆ki𝐱ki2+βi2𝐩i𝐱ki12,\displaystyle=\hat{C}\left\|\mathbf{G}_{k}^{i}\right\|_{\mathbf{x}_{k}^{i}}^{2}+\beta_{i}^{2}\left\|\mathbf{p}_{i}\right\|_{\mathbf{x}_{k}^{i-1}}^{2}, (92)

where C^=1+2c2C>1\hat{C}=1+2c_{2}C>1. Applying this repetitively, we have

𝐩i+1𝐱ki2\displaystyle\left\|\mathbf{p}_{i+1}\right\|_{\mathbf{x}_{k}^{i}}^{2} C^𝐆ki𝐱ki2+βi2(C^𝐆ki1𝐱ki12+βi12𝐩i1𝐱ki22)\displaystyle\leq\hat{C}\left\|\mathbf{G}_{k}^{i}\right\|_{\mathbf{x}_{k}^{i}}^{2}+\beta_{i}^{2}\left(\hat{C}\left\|\mathbf{G}_{k}^{i-1}\right\|_{\mathbf{x}_{k}^{i-1}}^{2}+\beta_{i-1}^{2}\left\|\mathbf{p}_{i-1}\right\|_{\mathbf{x}_{k}^{i-2}}^{2}\right)
C^(𝐆ki𝐱ki2+βi2𝐆ki1𝐱ki12++j=2iβj2𝐆k1𝐱k12)+𝐩1𝐱k02j=1iβj2\displaystyle\leq\hat{C}\left(\left\|\mathbf{G}_{k}^{i}\right\|_{\mathbf{x}_{k}^{i}}^{2}+\beta_{i}^{2}\left\|\mathbf{G}_{k}^{i-1}\right\|_{\mathbf{x}_{k}^{i-1}}^{2}+\cdots+\prod_{j=2}^{i}\beta_{j}^{2}\left\|\mathbf{G}_{k}^{1}\right\|_{\mathbf{x}_{k}^{1}}^{2}\right)+\left\|\mathbf{p}_{1}\right\|_{\mathbf{x}_{k}^{0}}^{2}\prod_{j=1}^{i}\beta_{j}^{2}
C^𝐆ki𝐱ki4(1𝐆ki𝐱ki2+1𝐆ki1𝐱ki12++1𝐆k1𝐱k12+1C^𝐆k0𝐱k02)\displaystyle\leq\hat{C}\left\|\mathbf{G}_{k}^{i}\right\|_{\mathbf{x}_{k}^{i}}^{4}\left(\frac{1}{\left\|\mathbf{G}_{k}^{i}\right\|_{\mathbf{x}_{k}^{i}}^{2}}+\frac{1}{\left\|\mathbf{G}_{k}^{i-1}\right\|_{\mathbf{x}_{k}^{i-1}}^{2}}+\cdots+\frac{1}{\left\|\mathbf{G}_{k}^{1}\right\|_{\mathbf{x}_{k}^{1}}^{2}}+\frac{1}{\hat{C}\left\|\mathbf{G}_{k}^{0}\right\|_{\mathbf{x}_{k}^{0}}^{2}}\right) (93)
C^𝐆ki𝐱ki4j=0i1𝐆kj𝐱kj2C^(i+1)γ2𝐆ki𝐱ki4,\displaystyle\leq\hat{C}\left\|\mathbf{G}_{k}^{i}\right\|_{\mathbf{x}_{k}^{i}}^{4}\sum_{j=0}^{i}\frac{1}{\left\|\mathbf{G}_{k}^{j}\right\|_{\mathbf{x}_{k}^{j}}^{2}}\leq\frac{\hat{C}(i+1)}{\gamma^{2}}\left\|\mathbf{G}_{k}^{i}\right\|_{\mathbf{x}_{k}^{i}}^{4},

where Eq. (93) uses Eq. (91) and 𝐩1=𝐆k0\mathbf{p}_{1}=\mathbf{G}_{k}^{0}, and the last inequality uses 𝐆ki𝐱ki>γ\left\|\mathbf{G}_{k}^{i}\right\|_{\mathbf{x}_{k}^{i}}>\gamma. Subsequently, this gives

𝐆ki𝐱ki4𝐩i+1𝐱ki2γ2C^(i+1).\frac{\left\|\mathbf{G}_{k}^{i}\right\|_{\mathbf{x}_{k}^{i}}^{4}}{\left\|\mathbf{p}_{i+1}\right\|_{\mathbf{x}_{k}^{i}}^{2}}\geq\frac{\gamma^{2}}{\hat{C}(i+1)}. (94)

Combing Eqs. (89) and (94), we have

i=0𝐆ki,𝐩i+1𝐱ki2𝐩i+1𝐱ki2=i=0𝐆ki𝐱ki4𝐩i+1𝐱ki2×𝐆ki,𝐩i+1𝐱ki2𝐆ki𝐱ki4i=0C2γ2C^(i+1)=.\sum_{i=0}^{\infty}\frac{\left\langle\mathbf{G}_{k}^{i},\mathbf{p}_{i+1}\right\rangle_{\mathbf{x}_{k}^{i}}^{2}}{\left\|\mathbf{p}_{i+1}\right\|_{\mathbf{x}_{k}^{i}}^{2}}=\sum_{i=0}^{\infty}\frac{\left\|\mathbf{G}_{k}^{i}\right\|_{\mathbf{x}_{k}^{i}}^{4}}{\left\|\mathbf{p}_{i+1}\right\|_{\mathbf{x}_{k}^{i}}^{2}}\times\frac{\langle\mathbf{G}_{k}^{i},\mathbf{p}_{i+1}\rangle_{\mathbf{x}_{k}^{i}}^{2}}{\left\|\mathbf{G}_{k}^{i}\right\|_{\mathbf{x}_{k}^{i}}^{4}}\geq\sum_{i=0}^{\infty}\frac{C^{2}\gamma^{2}}{\hat{C}(i+1)}=\infty. (95)

This contradicts Eq. (90) and completes the proof.

Appendix D Appendix: Subproblem Solvers

D.1 Proof of Lemma 2

D.1.1 The Case of l=Dl=D

Proof

Assumption 1: Regarding the Cauchy condition in Eq. (36), it is satisfied simply because of Eq. (19) and Eq. (33). Regarding the eigenstep condition, the proof is also fairly simple. The solution 𝜼k\bm{\eta}_{k}^{*} from Algorithm 2 with l=Dl=D is the global minimizer over D\mathbb{R}^{D}. As the subspace spanned by Cauchy steps and eigensteps belongs to D\mathbb{R}^{D}, i.e., Span{𝜼kC,𝜼kE}D{\rm Span}\left\{\bm{\eta}_{k}^{C},\bm{\eta}_{k}^{E}\right\}\in\mathbb{R}^{D}, we have

m^k(𝜼k)min𝜼Span{𝜼kC,𝜼kE}m^k(𝜼).\hat{m}_{k}(\bm{\eta}_{k}^{*})\leq\min_{\bm{\eta}\in{\rm Span}\left\{\bm{\eta}_{k}^{C},\bm{\eta}_{k}^{E}\right\}}\hat{m}_{k}(\bm{\eta}). (96)

Hence, the solution from Algorithm 2 satisfies the eigenstep condition in Eq. (37)) from Assumption 1.

Assumption 2: As stated in Section 3.3 of cartis2011adaptive , any minimizer, including the global minimizer 𝜼k\bm{\eta}_{k}^{*} from Algorithm 2, is a stationary point of m^k\hat{m}_{k}, and naturally has the property 𝜼m^k(𝜼k)=𝟎𝐱k\nabla_{\bm{\eta}}\hat{m}_{k}(\bm{\eta}_{k}^{*})=\mathbf{0}_{\mathbf{x}_{k}}. Hence, it has 𝜼m^k(𝜼k)𝐱k=0κθmin(1,𝜼k𝐱k)𝐆k𝐱k\left\|\nabla_{\bm{\eta}}\hat{m}_{k}(\bm{\eta}_{k}^{*})\right\|_{\mathbf{x}_{k}}=0\leq\kappa_{\theta}\min\left(1,\left\|\bm{\eta}_{k}^{*}\right\|_{\mathbf{x}_{k}}\right)\|\mathbf{G}_{k}\|_{\mathbf{x}_{k}}. Assumption 2 is then satisfied.

Assumption 3: Given the above-mentioned property of the global minimizer 𝜼k\bm{\eta}_{k}^{*} of m^k\hat{m}_{k} from Algorithm 2, i.e., 𝜼m^k(𝜼k)=𝟎𝐱k\nabla_{\bm{\eta}}\hat{m}_{k}(\bm{\eta}_{k}^{*})=\mathbf{0}_{\mathbf{x}_{k}}, and using the definition of 𝜼m^k(𝜼k)\nabla_{\bm{\eta}}\hat{m}_{k}(\bm{\eta}_{k}^{*}), it has

𝜼m^k(𝜼k)=𝐆k+𝐇k[𝜼k]+σk𝜼k𝐱k𝜼k=𝟎𝐱k.\nabla_{\bm{\eta}}\hat{m}_{k}(\bm{\eta}_{k}^{*})=\mathbf{G}_{k}+\mathbf{H}_{k}[\bm{\eta}_{k}^{*}]+\sigma_{k}\left\|\bm{\eta}_{k}^{*}\right\|_{\mathbf{x}_{k}}\bm{\eta}_{k}^{*}=\mathbf{0}_{\mathbf{x}_{k}}. (97)

Applying the inner product operation with 𝜼k\bm{\eta}_{k}^{*} to both sides of Eq. (97), it has

𝐆k,𝜼k𝐱k+𝜼k,𝐇k[𝜼k]𝐱k+σk𝜼k𝐱k3=0,\langle\mathbf{G}_{k},\bm{\eta}_{k}^{*}\rangle_{\mathbf{x}_{k}}+\langle\bm{\eta}_{k}^{*},\mathbf{H}_{k}[\bm{\eta}_{k}^{*}]\rangle_{\mathbf{x}_{k}}+\sigma_{k}\left\|\bm{\eta}_{k}^{*}\right\|_{\mathbf{x}_{k}}^{3}=0, (98)

and this corresponds to Eq. (42). As 𝜼k\bm{\eta}_{k}^{*} is a descent direction, we have 𝐆k,𝜼k𝐱k0\langle\mathbf{G}_{k},\bm{\eta}_{k}^{*}\rangle_{\mathbf{x}_{k}}\leq 0. Combining this with Eq. (98), it has

𝜼k,𝐇k[𝜼k]𝐱k+σk𝜼k𝐱k30.\langle\bm{\eta}_{k}^{*},\mathbf{H}_{k}[\bm{\eta}_{k}^{*}]\rangle_{\mathbf{x}_{k}}+\sigma_{k}\left\|\bm{\eta}_{k}^{*}\right\|_{\mathbf{x}_{k}}^{3}\geq 0. (99)

and this is Eq. (43). This completes the proof.

D.1.2 The Case of l<Dl<D

Proof

Cauchy condition in Assumption 1: Implied by Eq. (19), any intermediate solution 𝜼kl\bm{\eta}_{k}^{*l} satisfies the Cauchy condition.

Assumption 2: As stated in Section 3.3 of cartis2011adaptive , any minimizer 𝜼\bm{\eta}^{*} of m^k\hat{m}_{k} admits the property 𝜼m^k(𝜼)=𝟎𝐱k\nabla_{\bm{\eta}}\hat{m}_{k}(\bm{\eta}^{*})=\mathbf{0}_{\mathbf{x}_{k}}. Since each 𝜼kl\bm{\eta}_{k}^{*l} is the minimizer of m^k\hat{m}_{k} over 𝒦l\mathcal{K}_{l}, it has 𝜼m^k(𝜼kl)=𝟎𝐱k\nabla_{\bm{\eta}}\hat{m}_{k}(\bm{\eta}_{k}^{*l})=\mathbf{0}_{\mathbf{x}_{k}}. Then, it has 𝜼m^k(𝜼kl)𝐱k=0κθmin(1,𝜼kl𝐱k)𝐆k𝐱k\left\|\nabla_{\bm{\eta}}\hat{m}_{k}(\bm{\eta}_{k}^{*l})\right\|_{\mathbf{x}_{k}}=0\leq\kappa_{\theta}\min\left(1,\left\|\bm{\eta}_{k}^{*l}\right\|_{\mathbf{x}_{k}}\right)\|\mathbf{G}_{k}\|_{\mathbf{x}_{k}}. Assumption 2 is then satisfied.

Assumption 3: According to Lemma 3.2 of cartis2011adaptive , a global minimizer of m^k\hat{m}_{k} over a subspace of D\mathbb{R}^{D} satisfies Assumption 3. As the solution 𝜼kl\bm{\eta}_{k}^{*l} in each iteration of Algorithm 2 is a global solution over the subspace 𝒦l\mathcal{K}_{l}, each 𝜼kl\bm{\eta}_{k}^{*l} satisfies Assumption 3.

D.2 Proof of Lemma 3

Proof

For Cauchy Condition of Assumption 1: In the first iteration of Algorithm 3, the step size is optimized along the steepest gradient direction, as in the classical steepest-descent method of Cauchy, i.e.,

𝜼k1=(argminαm^k(α𝐆k))𝐆k.\bm{\eta}_{k}^{1}=\left(\arg\min_{\alpha\in\mathbb{R}}\hat{m}_{k}(\alpha\mathbf{G}_{k})\right)\mathbf{G}_{k}. (100)

At each iteration ii of Algorithm 3, the line search process in Eq. (24) aims at finding a step size that can achieve a cost decrease, otherwise the step size will be zero, meaning that no strict decrease can be achieved and the algorithm will stop at Step (4). Because of this, we have

m^k(𝜼ki1+αi𝐩i)m^k(𝜼ki1).\hat{m}_{k}\left(\bm{\eta}_{k}^{i-1}+\alpha_{i}\mathbf{p}_{i}\right)\leq\hat{m}_{k}\left(\bm{\eta}_{k}^{i-1}\right). (101)

Given 𝜼ki=𝜼ki1+αi𝐩i\bm{\eta}_{k}^{i}=\bm{\eta}_{k}^{i-1}+\alpha_{i}^{*}\mathbf{p}_{i} in Algorithm 3, we have

m^k(𝜼ki)m^k(𝜼ki1).\hat{m}_{k}\left(\bm{\eta}_{k}^{i}\right)\leq\hat{m}_{k}\left(\bm{\eta}_{k}^{i-1}\right). (102)

Then, considering all i=1,2,i=1,2,..., we have

m^k(𝜼k)m^k(𝜼k1)m^k(𝜼k0).\hat{m}_{k}\left(\bm{\eta}_{k}^{*}\right)\leq\ldots\leq\hat{m}_{k}\left(\bm{\eta}_{k}^{1}\right)\leq\hat{m}_{k}\left(\bm{\eta}_{k}^{0}\right). (103)

This shows Algorithm 3 always returns a solution better than or equal to the Cauchy step.

For ηm^k(η)𝐱k0\left\|\nabla_{\bm{\eta}}\hat{m}_{k}(\bm{\eta})\right\|_{\mathbf{x}_{k}}\approx 0: The approximation concept (\approx) of interest here builds on the fact that m^k(𝜼)\hat{m}_{k}(\bm{\eta}) is used as the approximation of the real objective function f(R𝐱k(𝜼))f(R_{\mathbf{x}_{k}}(\bm{\eta})). By assuming m^k(𝜼)f(R𝐱k(𝜼))\hat{m}_{k}(\bm{\eta})\approx f(R_{\mathbf{x}_{k}}(\bm{\eta})), it leads to

𝜼m^k(𝜼)𝜼f(R𝐱k(𝜼)).\nabla_{\bm{\eta}}\hat{m}_{k}(\bm{\eta})\approx\nabla_{\bm{\eta}}f(R_{\mathbf{x}_{k}}(\bm{\eta})). (104)

Let 𝐆k+1\mathbf{G}_{k+1} be the subsampled gradient evaluated at R𝐱k(𝜼k)R_{\mathbf{x}_{k}}(\bm{\eta}_{k}^{*}). Based on Theorem 1, it has 𝐆k+1=limi𝐆ki\mathbf{G}_{k+1}=\lim_{i\to\infty}\mathbf{G}_{k}^{i} where 𝐆ki\mathbf{G}_{k}^{i} is the resulting subsampled gradient after ii inner iterations. Since 𝐆k+1\mathbf{G}_{k+1} is the approximate gradient of the full gradient f(R𝐱k(𝜼k))\nabla f(R_{\mathbf{x}_{k}}(\bm{\eta}_{k}^{*})), it has f(R𝐱k(𝜼k))=𝔼[𝐆k+1]=𝔼[limi𝐆ki]\nabla f(R_{\mathbf{x}_{k}}(\bm{\eta}_{k}^{*}))=\mathbb{E}[\mathbf{G}_{k+1}]=\mathbb{E}[\lim_{i\to\infty}\mathbf{G}_{k}^{i}]. Hence, it has

𝜼f(R𝐱k(𝜼k))𝐱k\displaystyle\left\|\nabla_{\bm{\eta}}f\left(R_{\mathbf{x}_{k}}(\bm{\eta}_{k}^{*})\right)\right\|_{\mathbf{x}_{k}} (105)
=\displaystyle= f(R𝐱k(𝜼k))d(R𝐱k(𝜼k))d𝜼|𝜼=𝜼k𝐱k=𝔼[limi𝐆ki]𝐱k𝔼[limi𝐆ki𝐱ki]=0,\displaystyle\left\|\nabla f\left(R_{\mathbf{x}_{k}}(\bm{\eta}_{k}^{*})\right)\frac{d\left(R_{\mathbf{x}_{k}}(\bm{\eta}_{k}^{*})\right)}{d\bm{\eta}}\Big{|}_{\bm{\eta}=\bm{\eta}_{k}^{*}}\right\|_{\mathbf{x}_{k}}=\left\|\mathbb{E}\left[\lim_{i\to\infty}\mathbf{G}_{k}^{i}\right]\right\|_{\mathbf{x}_{k}}\leq\mathbb{E}\left[\lim_{i\to\infty}\left\|\mathbf{G}_{k}^{i}\right\|_{\mathbf{x}_{k}^{i}}\right]=0,

which indicates the equality holds as 𝜼f(R𝐱k(𝜼k))𝐱k=0\left\|\nabla_{\bm{\eta}}f\left(R_{\mathbf{x}_{k}}(\bm{\eta}_{k}^{*})\right)\right\|_{\mathbf{x}_{k}}=0. Combining this with Eq. (104), it completes the proof.

For Condition 1 of Assumption 3: Using the definition of 𝜼m^k(𝜼k)\nabla_{\bm{\eta}}\hat{m}_{k}(\bm{\eta}_{k}^{*}) as in Eq. (97), it has

𝜼m^k(𝜼k),𝜼k\displaystyle\left\langle\nabla_{\bm{\eta}}\hat{m}_{k}(\bm{\eta}_{k}^{*}),\bm{\eta}_{k}^{*}\right\rangle =𝐆k+𝐇k[𝜼k]+σk𝜼k𝐱k𝜼k,𝜼k𝐱k\displaystyle=\langle\mathbf{G}_{k}+\mathbf{H}_{k}[\bm{\eta}_{k}^{*}]+\sigma_{k}\left\|\bm{\eta}_{k}^{*}\right\|_{\mathbf{x}_{k}}\bm{\eta}_{k}^{*},\bm{\eta}_{k}^{*}\rangle_{\mathbf{x}_{k}} (106)
=𝐆k,𝜼k𝐱k+𝜼k,𝐇k[𝜼k]𝐱k+σk𝜼k𝐱k3.\displaystyle=\langle\mathbf{G}_{k},\bm{\eta}_{k}^{*}\rangle_{\mathbf{x}_{k}}+\langle\bm{\eta}_{k}^{*},\mathbf{H}_{k}[\bm{\eta}_{k}^{*}]\rangle_{\mathbf{x}_{k}}+\sigma_{k}\left\|\bm{\eta}_{k}^{*}\right\|_{\mathbf{x}_{k}}^{3}.

Also, since 𝜼m^k(𝜼k)𝐱k0\left\|\nabla_{\bm{\eta}}\hat{m}_{k}(\bm{\eta}_{k}^{*})\right\|_{\mathbf{x}_{k}}\approx 0, we have

|𝜼m^k(𝜼k),𝜼k|𝜼m^k(𝜼k)𝐱k𝜼k𝐱k0.\left|\left\langle\nabla_{\bm{\eta}}\hat{m}_{k}(\bm{\eta}_{k}^{*}),\bm{\eta}_{k}^{*}\right\rangle\right|\leq\left\|\nabla_{\bm{\eta}}\hat{m}_{k}(\bm{\eta}_{k}^{*})\right\|_{\mathbf{x}_{k}}\left\|\bm{\eta}_{k}^{*}\right\|_{\mathbf{x}_{k}}\approx 0. (107)

Combining Eqs (106) and (107) , this results in Eq. (45), which completes the proof.

Appendix E Appendix: Proof of Theorem 2

E.1 Matrix Bernstein Inequality

We build the proof on the matrix Bernstein inequality. We restate this inequality in the following lemma.

Lemma 4 (Matrix Bernstein Inequality (gross2011recovering ; tropp2015introduction ))

Let 𝐀1,,𝐀n\mathbf{A}_{1},...,\mathbf{A}_{n} be independent, centered random matrices with the common dimension d1×d2d_{1}\times d_{2}. Assume that each one is uniformly bounded:

𝔼[𝐀i]=0,𝐀iμ,i=1,,n.\mathbb{E}[\mathbf{A}_{i}]=0,\|\mathbf{A}_{i}\|\leq\mu,\ i=1,...,n. (108)

Given the matrix sum 𝐙=i=1n𝐀i\mathbf{Z}=\sum_{i=1}^{n}\mathbf{A}_{i}, we define its variance ν(𝐙)\nu(\mathbf{Z}) by

ν(𝐙):=max{i=1n𝔼[𝐀i𝐀iT],i=1n𝔼[𝐀iT𝐀i]}.\begin{split}\nu(\mathbf{Z})&:=\max\left\{\left\|\sum_{i=1}^{n}\mathbb{E}\left[\mathbf{A}_{i}\mathbf{A}_{i}^{T}\right]\right\|,\left\|\sum_{i=1}^{n}\mathbb{E}\left[\mathbf{A}_{i}^{T}\mathbf{A}_{i}\right]\right\|\right\}.\end{split} (109)

Then

Pr(𝐙ϵ)(d1+d2)exp(ϵ2/2ν(𝐙)+μϵ/3)forallϵ>0.{\rm Pr}(\left\|\mathbf{Z}\right\|\geq\epsilon)\leq(d_{1}+d_{2})\exp\left(\frac{-\epsilon^{2}/2}{\nu(\mathbf{Z})+\mu\epsilon/3}\right)\ {\rm for\ all}\ \epsilon>0. (110)

This lemma supports us to prove Theorem 2 as below.

E.2 Main Proof

Proof

Following the subsampling process, a total of |𝒮g||\mathcal{S}_{g}| matrices are uniformly sampled from the set of nn Riemannan gradients {gradfi(𝐱)d×r}i=1n\left\{{\rm grad}f_{i}(\mathbf{x})\subseteq\mathbb{R}^{d\times r}\right\}_{i=1}^{n}. We denote each sampled element as 𝐆𝐱(i)\mathbf{G}^{(i)}_{\mathbf{x}}, and it has

Pr(𝐆𝐱(i))=1n,i=1,,|𝒮g|.{\rm Pr}\left(\mathbf{G}^{(i)}_{\mathbf{x}}\right)=\frac{1}{n},\ i=1,...,|\mathcal{S}_{g}|. (111)

Define the random matrix

𝐗i:=𝐆𝐱(i)gradf(𝐱),i=1,2,..,|𝒮g|.\mathbf{X}_{i}:=\mathbf{G}^{(i)}_{\mathbf{x}}-{\rm grad}f(\mathbf{x}),\ i=1,2,..,|\mathcal{S}_{g}|. (112)

Our focus is the type of problem as in Eq. (1), therefore it has

𝔼[𝐗i]=𝔼[𝐆𝐱(i)]𝔼[gradf(𝐱)]=𝔼[gradfi(𝐱)]𝔼[1ni=1ngradfi(𝐱)]=𝟎.\displaystyle\mathbb{E}[\mathbf{X}_{i}]=\mathbb{E}\left[\mathbf{G}^{(i)}_{\mathbf{x}}\right]-\mathbb{E}\left[{\rm grad}f(\mathbf{x})\right]=\mathbb{E}\left[{\rm grad}f_{i}(\mathbf{x})\right]-\mathbb{E}\left[\frac{1}{n}\sum_{i=1}^{n}{\rm grad}f_{i}(\mathbf{x})\right]=\mathbf{0}. (113)

Define a random variable

𝐗:=1|𝒮g|i=1|𝒮g|𝐗i=1|𝒮g|i=1|𝒮g|(𝐆𝐱(i)gradf(𝐱)).\mathbf{X}:=\frac{1}{|\mathcal{S}_{g}|}\sum_{i=1}^{|\mathcal{S}_{g}|}\mathbf{X}_{i}=\frac{1}{|\mathcal{S}_{g}|}\sum_{i=1}^{|\mathcal{S}_{g}|}\left(\mathbf{G}^{(i)}_{\mathbf{x}}-{\rm grad}f(\mathbf{x})\right). (114)

Its variance satisfies

ν(𝐗)\displaystyle\nu(\mathbf{X}) =max{1|𝒮g|2i=1|𝒮g|𝔼[𝐗i𝐗iT]𝐱,1|𝒮g|2i=1|𝒮g|𝔼[𝐗iT𝐗i]𝐱}\displaystyle=\max\left\{\frac{1}{|\mathcal{S}_{g}|^{2}}\left\|\sum_{i=1}^{|\mathcal{S}_{g}|}\mathbb{E}\left[\mathbf{X}_{i}\mathbf{X}_{i}^{T}\right]\right\|_{\mathbf{x}},\frac{1}{|\mathcal{S}_{g}|^{2}}\left\|\sum_{i=1}^{|\mathcal{S}_{g}|}\mathbb{E}\left[\mathbf{X}_{i}^{T}\mathbf{X}_{i}\right]\right\|_{\mathbf{x}}\right\}
1|𝒮g|2max{i=1|𝒮g|𝔼[𝐗i𝐗iT𝐱],i=1|𝒮g|𝔼[𝐗iT𝐗i𝐱]}\displaystyle\leq\frac{1}{|\mathcal{S}_{g}|^{2}}\max\left\{\sum_{i=1}^{|\mathcal{S}_{g}|}\mathbb{E}\left[\left\|\mathbf{X}_{i}\mathbf{X}_{i}^{T}\right\|_{\mathbf{x}}\right],\sum_{i=1}^{|\mathcal{S}_{g}|}\mathbb{E}\left[\left\|\mathbf{X}_{i}^{T}\mathbf{X}_{i}\right\|_{\mathbf{x}}\right]\right\} (115)

Take 𝐆𝐱(i)=gradf1(𝐱)\mathbf{G}^{(i)}_{\mathbf{x}}={\rm grad}f_{1}(\mathbf{x}) as an example and applying the definition of KgmaxK_{g_{max}} in Eq. (53), we have

𝔼[𝐗i𝐱]\displaystyle\mathbb{E}[\|\mathbf{X}_{i}\|_{\mathbf{x}}] =𝔼[gradf1(𝐱)gradf(𝐱)𝐱]=𝔼[gradf1(𝐱)1ni=1ngradfi(𝐱)𝐱]\displaystyle=\mathbb{E}\left[\left\|{\rm grad}f_{1}(\mathbf{x})-{\rm grad}f(\mathbf{x})\right\|_{\mathbf{x}}\right]=\mathbb{E}\left[\left\|{\rm grad}f_{1}(\mathbf{x})-\frac{1}{n}\sum_{i=1}^{n}{\rm grad}f_{i}(\mathbf{x})\right\|_{\mathbf{x}}\right]
=𝔼[n1ngradf1(𝐱)1ni=2ngradfi(𝐱)𝐱]\displaystyle=\mathbb{E}\left[\left\|\frac{n-1}{n}{\rm grad}f_{1}(\mathbf{x})-\frac{1}{n}\sum_{i=2}^{n}{\rm grad}f_{i}(\mathbf{x})\right\|_{\mathbf{x}}\right]
𝔼[(2(n1)2n2gradf1(𝐱)𝐱2+2n2i=2ngradfi(𝐱)𝐱2)12]\displaystyle\leq\mathbb{E}\left[\left(\frac{2(n-1)^{2}}{n^{2}}\left\|{\rm grad}f_{1}(\mathbf{x})\right\|_{\mathbf{x}}^{2}+\frac{2}{n^{2}}\left\|\sum_{i=2}^{n}{\rm grad}f_{i}(\mathbf{x})\right\|_{\mathbf{x}}^{2}\right)^{\frac{1}{2}}\right]
𝔼[(2(n1)2n2Kgmax2+2(n2)2n2Kgmax2)12]2Kgmax.\displaystyle\leq\mathbb{E}\left[\left(\frac{2(n-1)^{2}}{n^{2}}K_{g_{max}}^{2}+\frac{2(n-2)^{2}}{n^{2}}K_{g_{max}}^{2}\right)^{\frac{1}{2}}\right]\leq 2K_{g_{max}}. (116)

where the first inequality uses (a+b)22a2+2b2(a+b)^{2}\leq 2a^{2}+2b^{2}. Combining Eq. (115) and Eq. (116), it has

ν(𝐗)1|𝒮g|2i=1|𝒮g|𝔼[𝐗i𝐱]𝔼[𝐗iT𝐱]1|𝒮g|2i=1|𝒮g|4Kgmax2=4|𝒮g|Kgmax2.\displaystyle\nu(\mathbf{X})\leq\frac{1}{|\mathcal{S}_{g}|^{2}}\sum_{i=1}^{|\mathcal{S}_{g}|}\mathbb{E}\left[\left\|\mathbf{X}_{i}\right\|_{\mathbf{x}}\right]\mathbb{E}\left[\left\|\mathbf{X}_{i}^{T}\right\|_{\mathbf{x}}\right]\leq\frac{1}{|\mathcal{S}_{g}|^{2}}\sum_{i=1}^{|\mathcal{S}_{g}|}4K_{g_{max}}^{2}=\frac{4}{|\mathcal{S}_{g}|}K_{g_{max}}^{2}. (117)

Now we are ready to apply Lemma 110. Given 𝔼[1|𝒮g|𝐗i]=1|𝒮g|𝔼[𝐗i]=𝟎\mathbb{E}\left[\frac{1}{|\mathcal{S}_{g}|}\mathbf{X}_{i}\right]=\frac{1}{|\mathcal{S}_{g}|}\mathbb{E}[\mathbf{X}_{i}]=\mathbf{0} and according to the superma definition 1|𝒮g|𝐗i𝐱=1|𝒮g|𝐗i𝐱Kgmax|𝒮g|\left\|\frac{1}{|\mathcal{S}_{g}|}\mathbf{X}_{i}\right\|_{\mathbf{x}}=\frac{1}{|\mathcal{S}_{g}|}\|\mathbf{X}_{i}\|_{\mathbf{x}}\leq\frac{K_{g_{max}}}{|\mathcal{S}_{g}|}, the following is obtained from the matrix Bernstein inequality:

Pr(𝐗𝐱ϵ)\displaystyle{\rm Pr}(\|\mathbf{X}\|_{\mathbf{x}}\geq\epsilon) =Pr(1|𝒮g|i=1|𝒮g|𝐆𝐱(i)gradf(𝐱)𝐱ϵ)\displaystyle={\rm Pr}\left(\left\|\frac{1}{|\mathcal{S}_{g}|}\sum_{i=1}^{|\mathcal{S}_{g}|}\mathbf{G}^{(i)}_{\mathbf{x}}-{\rm grad}f(\mathbf{x})\right\|_{\mathbf{x}}\geq\epsilon\right) (118)
(d+r)exp(ϵ2/24|𝒮g|Kgmax2+Kgmax|𝒮g|ϵ/3)\displaystyle\leq(d+r)\exp\left(\frac{-\epsilon^{2}/2}{\frac{4}{|\mathcal{S}_{g}|}K_{g_{max}}^{2}+\frac{K_{g_{max}}}{|\mathcal{S}_{g}|}\epsilon/3}\right) (119)
(d+r)exp(|𝒮g|ϵ28(Kgmax2+Kgmax))=δ,\displaystyle\leq(d+r)\exp\left(\frac{-|\mathcal{S}_{g}|\epsilon^{2}}{8(K_{g_{max}}^{2}+K_{g_{max}})}\right)=\delta, (120)

of which the last equality implies

ϵ=22(Kgmax2+Kgmax)ln(d+rδ)|𝒮|g.\epsilon=2\sqrt{\frac{2\left(K_{g_{max}}^{2}+K_{g_{max}}\right)\ln\left(\frac{d+r}{\delta}\right)}{|\mathcal{S}|_{g}}}. (121)

In simple words, with a probability at least 1δ1-\delta, we have 𝐗𝐱<ϵ\|\mathbf{X}\|_{\mathbf{x}}<\epsilon. Letting

ϵ=22(Kgmax2+Kgmax)ln(d+rδ)|𝒮|gδg,\epsilon=2\sqrt{\frac{2\left(K_{g_{max}}^{2}+K_{g_{max}}\right)\ln\left(\frac{d+r}{\delta}\right)}{|\mathcal{S}|_{g}}}\leq\delta_{g}, (122)

this results in the sample size bound as in Eq. (55)

|𝒮g|8(Kgmax2+Kgmax)ln(d+rδ)δg2.\ |\mathcal{S}_{g}|\geq\frac{8\left(K_{g_{max}}^{2}+K_{g_{max}}\right)\ln\left(\frac{d+r}{\delta}\right)}{\delta_{g}^{2}}. (123)

Therefore, we have 𝐗𝐱δg\|\mathbf{X}\|_{\mathbf{x}}\leq\delta_{g}. Expanding 𝐗\mathbf{X}, we have the following satisfied with a probability at least 1δ1-\delta:

𝐗𝐱=1|𝒮g|i=1|𝒮g|𝐆𝐱(i)gradf(𝐱)𝐱=𝐆kgradf(𝐱k)𝐱δg,\|\mathbf{X}\|_{\mathbf{x}}=\left\|\frac{1}{|\mathcal{S}_{g}|}\sum_{i=1}^{|\mathcal{S}_{g}|}\mathbf{G}^{(i)}_{\mathbf{x}}-{\rm grad}f(\mathbf{x})\right\|_{\mathbf{x}}=\left\|\mathbf{G}_{k}-{\rm grad}f(\mathbf{x}_{k})\right\|_{\mathbf{x}}\leq\delta_{g}, (124)

which is Eq. (51) of Condition 1.

The proof of the other sample size bound follows the same strategy. A total of |𝒮H||\mathcal{S}_{H}| matrices are uniformly sampled from the set of nn Riemannan Hessians {2f^i(𝟎x)[𝜼]d×r}i=1n\left\{\nabla^{2}\hat{f}_{i}(\mathbf{0}_{x})[\bm{\eta}]\subseteq\mathbb{R}^{d\times r}\right\}_{i=1}^{n}. We denote each sampled element as 𝐇𝐱(i)[𝜼]\mathbf{H}^{(i)}_{\mathbf{x}}[\bm{\eta}], and it has

Pr(𝐇𝐱(i)[𝜼])=1n,i=1,,|𝒮H|.{\rm Pr}\left(\mathbf{H}^{(i)}_{\mathbf{x}}[\bm{\eta}]\right)=\frac{1}{n},\ i=1,...,|\mathcal{S}_{H}|. (125)

Define the random matrix

𝐘i:=𝐇𝐱(i)[𝜼]2f^(𝟎x)[𝜼],i=1,2,..,|𝒮H|.\mathbf{Y}_{i}:=\mathbf{H}^{(i)}_{\mathbf{x}}[\bm{\eta}]-\nabla^{2}\hat{f}(\mathbf{0}_{x})[\bm{\eta}],\ i=1,2,..,|\mathcal{S}_{H}|. (126)

For the problem defined in Eq. (1) with a second-order retraction, it has

𝔼[𝐘i]=𝔼[𝐇𝐱(i)[𝜼]]𝔼[2f^(𝟎x)[𝜼]]\displaystyle\mathbb{E}[\mathbf{Y}_{i}]=\mathbb{E}\left[\mathbf{H}^{(i)}_{\mathbf{x}}[\bm{\eta}]\right]-\mathbb{E}\left[\nabla^{2}\hat{f}(\mathbf{0}_{x})[\bm{\eta}]\right] (127)
=𝔼[2f^i(𝟎x)[𝜼]]𝔼[1ni=1n2f^i(𝟎x)[𝜼]]=𝟎.\displaystyle=\mathbb{E}\left[\nabla^{2}\hat{f}_{i}(\mathbf{0}_{x})[\bm{\eta}]\right]-\mathbb{E}\left[\frac{1}{n}\sum_{i=1}^{n}\nabla^{2}\hat{f}_{i}(\mathbf{0}_{x})[\bm{\eta}]\right]=\mathbf{0}. (128)

Define a random variable

𝐘:=1|𝒮H|i=1|𝒮H|𝐘i=1|𝒮H|i=1|𝒮H|(𝐇𝐱(i)[𝜼]2f^(𝟎x)[𝜼]).\mathbf{Y}:=\frac{1}{|\mathcal{S}_{H}|}\sum_{i=1}^{|\mathcal{S}_{H}|}\mathbf{Y}_{i}=\frac{1}{|\mathcal{S}_{H}|}\sum_{i=1}^{|\mathcal{S}_{H}|}\left(\mathbf{H}^{(i)}_{\mathbf{x}}[\bm{\eta}]-\nabla^{2}\hat{f}(\mathbf{0}_{x})[\bm{\eta}]\right). (129)

Its variance satisfies

ν(𝐘)\displaystyle\nu(\mathbf{Y}) =max{1|𝒮H|2i=1|𝒮H|𝔼[𝐘i𝐘iT]𝐱,1|𝒮H|2i=1|𝒮H|𝔼[𝐘iT𝐘i]𝐱}\displaystyle=\max\left\{\frac{1}{|\mathcal{S}_{H}|^{2}}\left\|\sum_{i=1}^{|\mathcal{S}_{H}|}\mathbb{E}\left[\mathbf{Y}_{i}\mathbf{Y}_{i}^{T}\right]\right\|_{\mathbf{x}},\frac{1}{|\mathcal{S}_{H}|^{2}}\left\|\sum_{i=1}^{|\mathcal{S}_{H}|}\mathbb{E}\left[\mathbf{Y}_{i}^{T}\mathbf{Y}_{i}\right]\right\|_{\mathbf{x}}\right\}
1|𝒮H|2max{i=1|𝒮H|𝔼[𝐘i𝐘iT𝐱],i=1|𝒮H|𝔼[𝐘iT𝐘i𝐱]}.\displaystyle\leq\frac{1}{|\mathcal{S}_{H}|^{2}}\max\left\{\sum_{i=1}^{|\mathcal{S}_{H}|}\mathbb{E}\left[\left\|\mathbf{Y}_{i}\mathbf{Y}_{i}^{T}\right\|_{\mathbf{x}}\right],\sum_{i=1}^{|\mathcal{S}_{H}|}\mathbb{E}\left[\left\|\mathbf{Y}_{i}^{T}\mathbf{Y}_{i}\right\|_{\mathbf{x}}\right]\right\}. (130)

Take 𝐇𝐱(i)[𝜼]=2f^1(𝟎x)[𝜼]\mathbf{H}^{(i)}_{\mathbf{x}}[\bm{\eta}]=\nabla^{2}\hat{f}_{1}(\mathbf{0}_{x})[\bm{\eta}] as an example and applying the definition of KHmaxK_{H_{max}} in Eq. (54), we have

𝔼[𝐘i𝐱]\displaystyle\mathbb{E}[\|\mathbf{Y}_{i}\|_{\mathbf{x}}] =𝔼[2f^1(𝟎x)[𝜼]1ni=1n2f^i(𝟎x)[𝜼]𝐱]\displaystyle=\mathbb{E}\left[\left\|\nabla^{2}\hat{f}_{1}(\mathbf{0}_{x})[\bm{\eta}]-\frac{1}{n}\sum_{i=1}^{n}\nabla^{2}\hat{f}_{i}(\mathbf{0}_{x})[\bm{\eta}]\right\|_{\mathbf{x}}\right]
=𝔼[n1n2f^1(𝟎x)[𝜼]1ni=2n2f^i(𝟎x)[𝜼]𝐱]\displaystyle=\mathbb{E}\left[\left\|\frac{n-1}{n}\nabla^{2}\hat{f}_{1}(\mathbf{0}_{x})[\bm{\eta}]-\frac{1}{n}\sum_{i=2}^{n}\nabla^{2}\hat{f}_{i}(\mathbf{0}_{x})[\bm{\eta}]\right\|_{\mathbf{x}}\right]
𝔼[(2(n1)2n22f^1(𝟎x)[𝜼]𝐱2+2n2i=2n2f^i(𝟎x)[𝜼]𝐱2)12]\displaystyle\leq\mathbb{E}\left[\left(\frac{2(n-1)^{2}}{n^{2}}\left\|\nabla^{2}\hat{f}_{1}(\mathbf{0}_{x})[\bm{\eta}]\right\|_{\mathbf{x}}^{2}+\frac{2}{n^{2}}\left\|\sum_{i=2}^{n}\nabla^{2}\hat{f}_{i}(\mathbf{0}_{x})[\bm{\eta}]\right\|_{\mathbf{x}}^{2}\right)^{\frac{1}{2}}\right]
𝔼[(2(n1)2n2KHmax2𝜼𝐱2+2(n2)2n2KHmax2𝜼𝐱2)12]\displaystyle\leq\mathbb{E}\left[\left(\frac{2(n-1)^{2}}{n^{2}}K_{H_{max}}^{2}\left\|\bm{\eta}\right\|_{\mathbf{x}}^{2}+\frac{2(n-2)^{2}}{n^{2}}K_{H_{max}}^{2}\left\|\bm{\eta}\right\|_{\mathbf{x}}^{2}\right)^{\frac{1}{2}}\right]
2KHmax𝜼𝐱.\displaystyle\leq 2K_{H_{max}}\left\|\bm{\eta}\right\|_{\mathbf{x}}. (131)

where the first inequality uses (a+b)22a2+2b2(a+b)^{2}\leq 2a^{2}+2b^{2}, and 𝜼\bm{\eta} is the current moving direction being optimized in Eq. (10). Combining Eq. (130) and Eq. ((131), we have

ν(𝐘)\displaystyle\nu(\mathbf{Y}) 1|𝒮H|2i=1|𝒮H|𝔼[𝐘i𝐱]𝔼[𝐘iT𝐱]\displaystyle\leq\frac{1}{|\mathcal{S}_{H}|^{2}}\sum_{i=1}^{|\mathcal{S}_{H}|}\mathbb{E}\left[\left\|\mathbf{Y}_{i}\right\|_{\mathbf{x}}\right]\mathbb{E}\left[\left\|\mathbf{Y}_{i}^{T}\right\|_{\mathbf{x}}\right] (132)
1|𝒮H|2i=1|𝒮H|4KHmax2𝜼𝐱2=4|𝒮H|KHmax2𝜼𝐱2.\displaystyle\leq\frac{1}{|\mathcal{S}_{H}|^{2}}\sum_{i=1}^{|\mathcal{S}_{H}|}4K_{H_{max}}^{2}\left\|\bm{\eta}\right\|_{\mathbf{x}}^{2}=\frac{4}{|\mathcal{S}_{H}|}K_{H_{max}}^{2}\left\|\bm{\eta}\right\|_{\mathbf{x}}^{2}. (133)

We then apply Lemma 110. Given 𝔼[1|𝒮H|𝐘i]=1|𝒮H|𝔼[𝐘i]=𝟎\mathbb{E}\left[\frac{1}{|\mathcal{S}_{H}|}\mathbf{Y}_{i}\right]=\frac{1}{|\mathcal{S}_{H}|}\mathbb{E}[\mathbf{Y}_{i}]=\mathbf{0} and according to the superma definition 1|𝒮H|𝐘i𝐱=1|𝒮H|𝐘i𝐱KHmax𝜼𝐱|𝒮H|\left\|\frac{1}{|\mathcal{S}_{H}|}\mathbf{Y}_{i}\right\|_{\mathbf{x}}=\frac{1}{|\mathcal{S}_{H}|}\|\mathbf{Y}_{i}\|_{\mathbf{x}}\leq\frac{K_{H_{max}}\left\|\bm{\eta}\right\|_{\mathbf{x}}}{|\mathcal{S}_{H}|}, the following is obtained from the matrix Bernstein inequality:

Pr(𝐘𝐱ϵ)\displaystyle{\rm Pr}(\|\mathbf{Y}\|_{\mathbf{x}}\geq\epsilon) =Pr(1|𝒮H|i=1|𝒮H|𝐇𝐱(i)[𝜼]2f^(𝟎x)[𝜼]𝐱ϵ)\displaystyle={\rm Pr}\left(\left\|\frac{1}{|\mathcal{S}_{H}|}\sum_{i=1}^{|\mathcal{S}_{H}|}\mathbf{H}^{(i)}_{\mathbf{x}}[\bm{\eta}]-\nabla^{2}\hat{f}(\mathbf{0}_{x})[\bm{\eta}]\right\|_{\mathbf{x}}\geq\epsilon\right) (134)
(d+r)exp(ϵ2/24|𝒮H|KHmax2𝜼𝐱2+KHmax𝜼𝐱|𝒮H|ϵ/3)\displaystyle\leq(d+r)\exp\left(\frac{-\epsilon^{2}/2}{\frac{4}{|\mathcal{S}_{H}|}K_{H_{max}}^{2}\left\|\bm{\eta}\right\|_{\mathbf{x}}^{2}+\frac{K_{H_{max}}\left\|\bm{\eta}\right\|_{\mathbf{x}}}{|\mathcal{S}_{H}|}\epsilon/3}\right) (135)
(d+r)exp(|𝒮H|ϵ28(KHmax2𝜼𝐱2+KHmax𝜼𝐱))=δ,\displaystyle\leq(d+r)\exp\left(\frac{-|\mathcal{S}_{H}|\epsilon^{2}}{8(K_{H_{max}}^{2}\left\|\bm{\eta}\right\|_{\mathbf{x}}^{2}+K_{H_{max}}\left\|\bm{\eta}\right\|_{\mathbf{x}})}\right)=\delta, (136)

of which the last equality indicates

ϵ=22(KHmax2𝜼𝐱2+KHmax𝜼𝐱)ln(d+rδ)|𝒮H|.\epsilon=2\sqrt{\frac{2\left(K_{H_{max}}^{2}\left\|\bm{\eta}\right\|_{\mathbf{x}}^{2}+K_{H_{max}}\left\|\bm{\eta}\right\|_{\mathbf{x}}\right)\ln\left(\frac{d+r}{\delta}\right)}{|\mathcal{S}_{H}|}}. (137)

In simple words, with probability at least 1δ1-\delta, we have 𝐘𝐱<ϵ\|\mathbf{Y}\|_{\mathbf{x}}<\epsilon. By letting

ϵ=22(KHmax2𝜼𝐱2+KHmax𝜼𝐱)ln(d+rδ)|𝒮H|δH𝜼𝐱,\epsilon=2\sqrt{\frac{2\left(K_{H_{max}}^{2}\left\|\bm{\eta}\right\|_{\mathbf{x}}^{2}+K_{H_{max}}\left\|\bm{\eta}\right\|_{\mathbf{x}}\right)\ln\left(\frac{d+r}{\delta}\right)}{|\mathcal{S}_{H}|}}\leq\delta_{H}\left\|\bm{\eta}\right\|_{\mathbf{x}}, (138)

which results in the sample size bound in Eq. (56)

|𝒮H|8(KHmax2+KHmax𝜼𝐱)ln(d+rδ)δH2.\displaystyle|\mathcal{S}_{H}|\geq\frac{8\left(K_{H_{max}}^{2}+\frac{K_{H_{max}}}{\left\|\bm{\eta}\right\|_{\mathbf{x}}}\right)\ln\left(\frac{d+r}{\delta}\right)}{\delta_{H}^{2}}. (139)

And we have 𝐘𝐱δH𝜼𝐱\|\mathbf{Y}\|_{\mathbf{x}}\leq\delta_{H}\left\|\bm{\eta}\right\|_{\mathbf{x}}. Expanding 𝐘\mathbf{Y}, we have the following satisfied with a probability at least 1δ1-\delta:

𝐘𝐱=1|𝒮H|i=1|𝒮H|𝐇𝐱(i)[𝜼]2f^(𝟎x)[𝜼]𝐱=𝐇k[𝜼]2f^(𝟎x)[𝜼]𝐱δH𝜼𝐱,\|\mathbf{Y}\|_{\mathbf{x}}=\left\|\frac{1}{|\mathcal{S}_{H}|}\sum_{i=1}^{|\mathcal{S}_{H}|}\mathbf{H}^{(i)}_{\mathbf{x}}[\bm{\eta}]-\nabla^{2}\hat{f}(\mathbf{0}_{x})[\bm{\eta}]\right\|_{\mathbf{x}}=\left\|\mathbf{H}_{k}[\bm{\eta}]-\nabla^{2}\hat{f}(\mathbf{0}_{x})[\bm{\eta}]\right\|_{\mathbf{x}}\leq\delta_{H}\left\|\bm{\eta}\right\|_{\mathbf{x}}, (140)

which is Eq. (52) of Condition 1.

Appendix F Appendix: Theorem 3 and Corollary 1

F.1 Supporting Lemmas for Theorem 3

Lemma 5

Suppose Condition 1 and Assumptions 1, 2 hold, then for the case of 𝐆kϵg\|\mathbf{G}_{k}\|\geq\epsilon_{g}, we have

f^k(𝜼k)m^k(𝜼k)(LH6σk3)𝜼k𝒙k3+δg𝜼k𝒙k+12δH𝜼k𝒙k2.\begin{split}\hat{f}_{k}(\bm{\eta}_{k})-\hat{m}_{k}(\bm{\eta}_{k})\leq\left(\frac{L_{H}}{6}-\frac{\sigma_{k}}{3}\right)\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{3}+\delta_{g}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}+\frac{1}{2}\delta_{H}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{2}.\end{split} (141)
Proof

We start from the left-hand side of Eq. (141), and this leads to

f^k(𝜼k)m^k(𝜼k)\displaystyle\hat{f}_{k}(\bm{\eta}_{k})-\hat{m}_{k}(\bm{\eta}_{k})
=\displaystyle=\; f^k(𝜼k)f(𝒙k)𝐆k,𝜼k𝒙k12𝜼k,𝐇k[𝜼k]𝒙kσk3𝜼k𝒙k3\displaystyle\hat{f}_{k}(\bm{\eta}_{k})-f(\bm{x}_{k})-\left\langle\mathbf{G}_{k},\bm{\eta}_{k}\right\rangle_{\bm{x}_{k}}-\frac{1}{2}\left\langle\bm{\eta}_{k},\mathbf{H}_{k}[\bm{\eta}_{k}]\right\rangle_{\bm{x}_{k}}-\frac{\sigma_{k}}{3}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{3}
=\displaystyle=\; f^k(𝜼k)f(𝒙k)gradf(𝒙k),𝜼k𝒙k12𝜼k,2f^k(𝟎𝒙k)[𝜼k]𝒙k\displaystyle\hat{f}_{k}(\bm{\eta}_{k})-f(\bm{x}_{k})-\langle\text{grad}f(\bm{x}_{k}),\bm{\eta}_{k}\rangle_{\bm{x}_{k}}-\frac{1}{2}\left\langle\bm{\eta}_{k},\nabla^{2}\hat{f}_{k}(\bm{0}_{\bm{x}_{k}})[\bm{\eta}_{k}]\right\rangle_{\bm{x}_{k}}
+gradf(𝒙k),𝜼k𝒙k𝐆k,𝜼k𝒙k+12𝜼k,2f^k(𝟎𝒙k)[𝜼k]𝒙k\displaystyle+\langle\text{grad}f(\bm{x}_{k}),\bm{\eta}_{k}\rangle_{\bm{x}_{k}}-\langle\mathbf{G}_{k},\bm{\eta}_{k}\rangle_{\bm{x}_{k}}+\frac{1}{2}\left\langle\bm{\eta}_{k},\nabla^{2}\hat{f}_{k}(\bm{0}_{\bm{x}_{k}})[\bm{\eta}_{k}]\right\rangle_{\bm{x}_{k}}
12𝜼k,𝐇k[𝜼k]𝒙kσk3𝜼k𝒙k3\displaystyle-\frac{1}{2}\left\langle\bm{\eta}_{k},\mathbf{H}_{k}[\bm{\eta}_{k}]\right\rangle_{\bm{x}_{k}}-\frac{\sigma_{k}}{3}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{3}
\displaystyle\leq\; |f^k(𝜼k)f(𝒙k)gradf(𝒙k),𝜼k𝒙k12𝜼k,2f^k(𝟎𝒙k)[𝜼k]𝒙k|\displaystyle\left|\hat{f}_{k}(\bm{\eta}_{k})-f(\bm{x}_{k})-\langle\text{grad}f(\bm{x}_{k}),\bm{\eta}_{k}\rangle_{\bm{x}_{k}}-\frac{1}{2}\left\langle\bm{\eta}_{k},\nabla^{2}\hat{f}_{k}(\bm{0}_{\bm{x}_{k}})[\bm{\eta}_{k}]\right\rangle_{\bm{x}_{k}}\right|
+|gradf(𝒙k),𝜼k𝒙k𝐆k,𝜼k𝒙k|\displaystyle+\left|\langle\text{grad}f(\bm{x}_{k}),\bm{\eta}_{k}\rangle_{\bm{x}_{k}}-\langle\mathbf{G}_{k},\bm{\eta}_{k}\rangle_{\bm{x}_{k}}\right|
+|12𝜼k,2f^k(𝟎𝒙k)[𝜼k]𝒙k12𝜼k,𝐇k[𝜼k]𝒙k|σk3𝜼k𝒙k3\displaystyle+\left|\frac{1}{2}\left\langle\bm{\eta}_{k},\nabla^{2}\hat{f}_{k}(\bm{0}_{\bm{x}_{k}})[\bm{\eta}_{k}]\right\rangle_{\bm{x}_{k}}-\frac{1}{2}\langle\bm{\eta}_{k},\mathbf{H}_{k}[\bm{\eta}_{k}]\rangle_{\bm{x}_{k}}\right|-\frac{\sigma_{k}}{3}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{3}
\displaystyle\leq\; 16LH𝜼k𝒙k3+δg𝜼k𝒙k+12δH𝜼k𝒙k2σk3𝜼k𝒙k3\displaystyle\frac{1}{6}L_{H}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{3}+\delta_{g}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}+\frac{1}{2}\delta_{H}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{2}-\frac{\sigma_{k}}{3}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{3}
=\displaystyle=\; (LH6σk3)𝜼k𝒙k3+δg𝜼k𝒙k+12δH𝜼k𝒙k2,\displaystyle\left(\frac{L_{H}}{6}-\frac{\sigma_{k}}{3}\right)\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{3}+\delta_{g}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}+\frac{1}{2}\delta_{H}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{2}, (142)

where the first inequality uses the Cauchy-Schwarz inequality and the second one uses Eqs. (48), (51) and (52). It can be seen that the term 𝐆k,𝜼k𝒙k\left\langle\mathbf{G}_{k},\bm{\eta}_{k}\right\rangle_{\bm{x}_{k}} can not be neglected because of the condition 𝐆kϵg\|\mathbf{G}_{k}\|\geq\epsilon_{g} based on Eq. (LABEL:eq_m). This then results in the term δg𝜼k𝒙k\delta_{g}\|\bm{\eta}_{k}\|_{\bm{x}_{k}} in Eq. (142).

Lemma 6

Suppose Condition 1 and Assumptions 1, 2 hold, then for the case of 𝐆k<ϵg\|\mathbf{G}_{k}\|<\epsilon_{g} and λmin(𝐇k)<ϵH\lambda_{min}(\mathbf{H}_{k})<-\epsilon_{H} , we have

f^k(𝜼k)m^k(𝜼k)(LH6σk3)𝜼k𝒙k3+12δH𝜼k𝒙k2.\begin{split}\hat{f}_{k}(\bm{\eta}_{k})-\hat{m}_{k}(\bm{\eta}_{k})\leq\left(\frac{L_{H}}{6}-\frac{\sigma_{k}}{3}\right)\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{3}+\frac{1}{2}\delta_{H}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{2}.\end{split} (143)
Proof

For each 𝜼k\bm{\eta}_{k}, at least one of the two inequalities is true:

𝜼k,gradf(𝒙k)𝒙k\displaystyle\langle\bm{\eta}_{k},\text{grad}f(\bm{x}_{k})\rangle_{\bm{x}_{k}}\leq 0,\displaystyle 0, (144)
𝜼k,gradf(𝒙k)𝒙k\displaystyle\langle-\bm{\eta}_{k},\text{grad}f(\bm{x}_{k})\rangle_{\bm{x}_{k}}\leq 0.\displaystyle 0. (145)

Without loss of the generality, we assume 𝜼k,gradf(𝒙k)𝒙k0\langle\bm{\eta}_{k},\text{grad}f(\bm{x}_{k})\rangle_{\bm{x}_{k}}\leq 0 which is also an assumption adopted by yao2021inexact , and it has

f^k(𝜼k)m^k(𝜼k)\displaystyle\hat{f}_{k}(\bm{\eta}_{k})-\hat{m}_{k}(\bm{\eta}_{k})
=\displaystyle=\; f^k(𝜼k)f(𝒙k)12𝜼k,𝐇k[𝜼k]𝒙kσk3𝜼k𝒙k3\displaystyle\hat{f}_{k}(\bm{\eta}_{k})-f(\bm{x}_{k})-\frac{1}{2}\langle\bm{\eta}_{k},\mathbf{H}_{k}[\bm{\eta}_{k}]\rangle_{\bm{x}_{k}}-\frac{\sigma_{k}}{3}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{3}
=\displaystyle=\; f^k(𝜼k)f(𝒙k)gradf(𝒙k),𝜼k𝒙k12𝜼k,2f^k(𝟎𝒙k)[𝜼k]𝒙k\displaystyle\hat{f}_{k}(\bm{\eta}_{k})-f(\bm{x}_{k})-\langle\text{grad}f(\bm{x}_{k}),\bm{\eta}_{k}\rangle_{\bm{x}_{k}}-\frac{1}{2}\left\langle\bm{\eta}_{k},\nabla^{2}\hat{f}_{k}(\bm{0}_{\bm{x}_{k}})[\bm{\eta}_{k}]\right\rangle_{\bm{x}_{k}}
+gradf(𝒙k),𝜼k𝒙k+12𝜼k,2f^k(𝟎𝒙k)[𝜼k]𝒙k\displaystyle+\langle\text{grad}f(\bm{x}_{k}),\bm{\eta}_{k}\rangle_{\bm{x}_{k}}+\frac{1}{2}\left\langle\bm{\eta}_{k},\nabla^{2}\hat{f}_{k}(\bm{0}_{\bm{x}_{k}})[\bm{\eta}_{k}]\right\rangle_{\bm{x}_{k}}
12𝜼k,𝐇k[𝜼k]𝒙kσk3𝜼k𝒙k3\displaystyle-\frac{1}{2}\langle\bm{\eta}_{k},\mathbf{H}_{k}[\bm{\eta}_{k}]\rangle_{\bm{x}_{k}}-\frac{\sigma_{k}}{3}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{3}
\displaystyle\leq\; |f^k(𝜼k)f(𝒙k)gradf(𝒙k),𝜼k𝒙k12𝜼k,2f^k(𝟎𝒙k)[𝜼k]𝒙k|\displaystyle\left|\hat{f}_{k}(\bm{\eta}_{k})-f(\bm{x}_{k})-\langle\text{grad}f(\bm{x}_{k}),\bm{\eta}_{k}\rangle_{\bm{x}_{k}}-\frac{1}{2}\left\langle\bm{\eta}_{k},\nabla^{2}\hat{f}_{k}(\bm{0}_{\bm{x}_{k}})[\bm{\eta}_{k}]\right\rangle_{\bm{x}_{k}}\right|
+|12𝜼k,2f^k(𝟎𝒙k)[𝜼k]𝒙k12𝜼k,𝐇k[𝜼k]𝒙k|σk3𝜼k𝒙k3\displaystyle+\left|\frac{1}{2}\left\langle\bm{\eta}_{k},\nabla^{2}\hat{f}_{k}(\bm{0}_{\bm{x}_{k}})[\bm{\eta}_{k}]\right\rangle_{\bm{x}_{k}}-\frac{1}{2}\langle\bm{\eta}_{k},\mathbf{H}_{k}[\bm{\eta}_{k}]\rangle_{\bm{x}_{k}}\right|-\frac{\sigma_{k}}{3}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{3}
\displaystyle\leq\; 16LH𝜼k𝒙k3+12δH𝜼k𝒙k2σk3𝜼k𝒙k3\displaystyle\frac{1}{6}L_{H}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{3}+\frac{1}{2}\delta_{H}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{2}-\frac{\sigma_{k}}{3}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{3}
=\displaystyle=\; (LH6σk3)𝜼k𝒙k3+12δH𝜼k𝒙k2.\displaystyle\left(\frac{L_{H}}{6}-\frac{\sigma_{k}}{3}\right)\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{3}+\frac{1}{2}\delta_{H}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{2}. (146)

The condition 𝐆k<ϵg\|\mathbf{G}_{k}\|<\epsilon_{g} and λmin(𝐇k)<ϵH\lambda_{min}(\mathbf{H}_{k})<-\epsilon_{H} in this case means that the term 𝐆k,𝜼k𝒙k\left\langle\mathbf{G}_{k},\bm{\eta}_{k}\right\rangle_{\bm{x}_{k}} can be neglected based on Eq. (LABEL:eq_m) but the optimization process is not yet finished. The effect is that δg𝜼k𝒙k\delta_{g}\|\bm{\eta}_{k}\|_{\bm{x}_{k}} no more exists in Eq. (146) unlike Eq. (142).

Lemma 7

Suppose Condition 1 and Assumptions 1, 2, 3, 4, 5 hold, then when 𝐆kϵg\|\mathbf{G}_{k}\|\geq\epsilon_{g} and σkLH\sigma_{k}\geq L_{H}, we have

(LH6σk3)𝜼k𝒙k3+δg𝜼k𝒙k+δH2𝜼k𝒙k2δg𝜼kC𝒙k+δH2𝜼kC𝒙k2.\begin{split}\left(\frac{L_{H}}{6}-\frac{\sigma_{k}}{3}\right)\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{3}&+\delta_{g}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}+\frac{\delta_{H}}{2}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{2}\leq\delta_{g}\left\|\bm{\eta}_{k}^{C}\right\|_{\bm{x}_{k}}+\frac{\delta_{H}}{2}\left\|\bm{\eta}_{k}^{C}\right\|_{\bm{x}_{k}}^{2}.\end{split} (147)
Proof

According to Lemma 6 of xu2020newton , it has

𝜼kC𝒙k12σk(KH2+4σk𝐆k𝒙kKH).\begin{split}\left\|\bm{\eta}_{k}^{C}\right\|_{\bm{x}_{k}}\geq\frac{1}{2\sigma_{k}}\left(\sqrt{K_{H}^{2}+4\sigma_{k}\|\mathbf{G}_{k}\|_{\bm{x}_{k}}}-K_{H}\right).\end{split} (148)

Then we consider two cases for 𝜼kC𝒙k\left\|\bm{\eta}_{k}^{C}\right\|_{\bm{x}_{k}}.

(i) If 𝜼k𝒙k𝜼kC𝒙k\|\bm{\eta}_{k}\|_{\bm{x}_{k}}\leq\left\|\bm{\eta}_{k}^{C}\right\|_{\bm{x}_{k}}, since σkLH\sigma_{k}\geq L_{H}, it follows that

(LH6σk3)𝜼k𝒙k3+δg𝜼k𝒙k+δH2𝜼k𝒙k2δg𝜼kC𝒙k+δH2𝜼kC𝒙k2.\begin{split}\left(\frac{L_{H}}{6}-\frac{\sigma_{k}}{3}\right)\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{3}+\delta_{g}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}+\frac{\delta_{H}}{2}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{2}\leq\delta_{g}\left\|\bm{\eta}_{k}^{C}\right\|_{\bm{x}_{k}}+\frac{\delta_{H}}{2}\left\|\bm{\eta}_{k}^{C}\right\|_{\bm{x}_{k}}^{2}.\end{split} (149)

(ii) If 𝜼k𝒙k𝜼kC𝒙k\|\bm{\eta}_{k}\|_{\bm{x}_{k}}\geq\left\|\bm{\eta}_{k}^{C}\right\|_{\bm{x}_{k}}, since σkLH\sigma_{k}\geq L_{H}, we have

(LH6σk3)𝜼k𝒙k3+δg𝜼k𝒙k+δH2𝜼k𝒙k2\displaystyle\;\left(\frac{L_{H}}{6}-\frac{\sigma_{k}}{3}\right)\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{3}+\delta_{g}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}+\frac{\delta_{H}}{2}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{2}
\displaystyle\leq σk6𝜼k𝒙k3+δg𝜼k𝒙k+δH2𝜼k𝒙k2\displaystyle\;-\frac{\sigma_{k}}{6}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{3}+\delta_{g}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}+\frac{\delta_{H}}{2}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{2}
\displaystyle\leq (σk12𝜼kC𝒙k+δH2)𝜼k𝒙k2+(σk12𝜼kC𝒙k2+δg)𝜼k𝒙k\displaystyle\;\left(-\frac{\sigma_{k}}{12}\left\|\bm{\eta}_{k}^{C}\right\|_{\bm{x}_{k}}+\frac{\delta_{H}}{2}\right)\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{2}+\left(-\frac{\sigma_{k}}{12}\left\|\bm{\eta}_{k}^{C}\right\|_{\bm{x}_{k}}^{2}+\delta_{g}\right)\|\bm{\eta}_{k}\|_{\bm{x}_{k}}
\displaystyle\leq (KH2+4σk𝐆k𝒙kKH24+δH2)𝜼k𝒙k2+\displaystyle\;\left(-\frac{\sqrt{K_{H}^{2}+4\sigma_{k}\|\mathbf{G}_{k}\|_{\bm{x}_{k}}}-K_{H}}{24}+\frac{\delta_{H}}{2}\right)\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{2}+
(112(KH2+4σk𝐆k𝒙kKH)2𝐆k𝒙k4σk𝐆k𝒙k+δg)𝜼k𝒙k\displaystyle\;\left(-\frac{\frac{1}{12}\left(\sqrt{K_{H}^{2}+4\sigma_{k}\|\mathbf{G}_{k}\|_{\bm{x}_{k}}}-K_{H}\right)^{2}\|\mathbf{G}_{k}\|_{\bm{x}_{k}}}{4\sigma_{k}\|\mathbf{G}_{k}\|_{\bm{x}_{k}}}+\delta_{g}\right)\|\bm{\eta}_{k}\|_{\bm{x}_{k}}
\displaystyle\leq (KH2+4LHϵgKH24+δH2)𝜼k𝒙k2+\displaystyle\;\left(-\frac{\sqrt{K_{H}^{2}+4L_{H}\epsilon_{g}}-K_{H}}{24}+\frac{\delta_{H}}{2}\right)\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{2}+
(112(KH2+4LHϵgKH)2ϵg4LHϵg+δg)𝜼k𝒙k\displaystyle\;\left(-\frac{\frac{1}{12}\left(\sqrt{K_{H}^{2}+4L_{H}\epsilon_{g}}-K_{H}\right)^{2}\epsilon_{g}}{4L_{H}\epsilon_{g}}+\delta_{g}\right)\|\bm{\eta}_{k}\|_{\bm{x}_{k}}
\displaystyle\leq  0+0δg𝜼kC𝒙k+δH2𝜼kC𝒙k2.\displaystyle\;0+0\leq\delta_{g}\left\|\bm{\eta}_{k}^{C}\right\|_{\bm{x}_{k}}+\frac{\delta_{H}}{2}\left\|\bm{\eta}_{k}^{C}\right\|_{\bm{x}_{k}}^{2}. (150)

The third inequality follows from Eq. (148). The fourth inequality holds since the function h(x)=(α2+xα)2xh(x)=\frac{\left(\sqrt{\alpha^{2}+x}-\alpha\right)^{2}}{x} is an increasing function of xx for α0\alpha\geq 0. The penultimate inequality also holds given Eqs. (57), (58) and 1τ12112\frac{1-\tau}{12}\leq\frac{1}{12} since τ(0,1]\tau\in(0,1].

Lemma 8

Suppose Condition 1 and Assumptions 1, 2, 4, 5 hold, then when 𝐆kϵg\|\mathbf{G}_{k}\|\leq\epsilon_{g} and λmin(𝐇k)ϵH\lambda_{min}(\mathbf{H}_{k})\leq-\epsilon_{H}, if σkLH\sigma_{k}\geq L_{H}, we have

(LH6σk3)𝜼k𝒙k3+δH2𝜼k𝒙k2δH2𝜼kE𝒙k2.\begin{split}\left(\frac{L_{H}}{6}-\frac{\sigma_{k}}{3}\right)\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{3}&+\frac{\delta_{H}}{2}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{2}\leq\frac{\delta_{H}}{2}\left\|\bm{\eta}_{k}^{E}\right\|_{\bm{x}_{k}}^{2}.\end{split} (151)
Proof

From Lemma 7 in xu2020newton , we have

𝜼kE𝒙kνσk|λmin(𝐇k)|.\begin{split}\left\|\bm{\eta}_{k}^{E}\right\|_{\bm{x}_{k}}\geq\frac{\nu}{\sigma_{k}}|\lambda_{min}(\mathbf{H}_{k})|.\end{split} (152)

Then we consider two cases for 𝜼kE𝒙k\left\|\bm{\eta}_{k}^{E}\right\|_{\bm{x}_{k}}.

(i) If 𝜼k𝒙k𝜼kE𝒙k\|\bm{\eta}_{k}\|_{\bm{x}_{k}}\leq\left\|\bm{\eta}_{k}^{E}\right\|_{\bm{x}_{k}}, since σkLH\sigma_{k}\geq L_{H}, it follows that

(LH6σk3)𝜼k𝒙k3+δH2𝜼k𝒙k2δH2𝜼k𝒙k2δH2𝜼kE𝒙k2.\begin{split}\left(\frac{L_{H}}{6}-\frac{\sigma_{k}}{3}\right)\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{3}+\frac{\delta_{H}}{2}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{2}\leq\frac{\delta_{H}}{2}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{2}\leq\frac{\delta_{H}}{2}\left\|\bm{\eta}_{k}^{E}\right\|_{\bm{x}_{k}}^{2}.\end{split} (153)

(ii) If 𝜼k𝒙k𝜼kE𝒙k\|\bm{\eta}_{k}\|_{\bm{x}_{k}}\geq\left\|\bm{\eta}_{k}^{E}\right\|_{\bm{x}_{k}}, since σkLH\sigma_{k}\geq L_{H}, we have

(LH6σk3)𝜼k𝒙k3+δH2𝜼k𝒙k2\displaystyle\;\left(\frac{L_{H}}{6}-\frac{\sigma_{k}}{3}\right)\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{3}+\frac{\delta_{H}}{2}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{2} (154)
\displaystyle\leq σk6𝜼k𝒙k3+δH2𝜼k𝒙k2σk6𝜼kE𝒙k𝜼k𝒙k2+δH2𝜼k𝒙k2\displaystyle\;-\frac{\sigma_{k}}{6}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{3}+\frac{\delta_{H}}{2}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{2}\leq-\frac{\sigma_{k}}{6}\left\|\bm{\eta}_{k}^{E}\right\|_{\bm{x}_{k}}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{2}+\frac{\delta_{H}}{2}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{2}
\displaystyle\leq (ν6|λmin(𝐇k)|+(1τ)νϵH6)𝜼k𝒙k2(νϵH6+νϵH6)𝜼k𝒙k20δH2𝜼kE𝒙k2.\displaystyle\;\left(-\frac{\nu}{6}|\lambda_{min}(\mathbf{H}_{k})|+\frac{(1-\tau)\nu\epsilon_{H}}{6}\right)\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{2}\leq\left(-\frac{\nu\epsilon_{H}}{6}+\frac{\nu\epsilon_{H}}{6}\right)\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{2}\leq 0\leq\frac{\delta_{H}}{2}\left\|\bm{\eta}_{k}^{E}\right\|_{\bm{x}_{k}}^{2}.

The third inequality follows from Eqs. (58) and (152), and the fourth one holds since |λmin(𝐇k)|ϵH|\lambda_{min}(\mathbf{H}_{k})|\geq\epsilon_{H} and τ<1\tau<1.

Lemma 9

Suppose Assumptions 1, 2, 3, 4, 5 hold, then when 𝐆k𝐱kϵg\|\mathbf{G}_{k}\|_{\bm{x}_{k}}\geq\epsilon_{g}, if σkLH\sigma_{k}\geq L_{H}, the iteration kk is successful, i.e., ρkτ\rho_{k}\geq\tau and σk+1σk\sigma_{k+1}\leq\sigma_{k}.

Proof

We have that

1ρk\displaystyle 1-\rho_{k} =f^k(𝜼k)m^k(𝜼k)m^k(𝟎𝒙k)m^k(𝜼k)\displaystyle=\frac{\hat{f}_{k}(\bm{\eta}_{k})-\hat{m}_{k}(\bm{\eta}_{k})}{\hat{m}_{k}(\bm{0}_{\bm{x}_{k}})-\hat{m}_{k}(\bm{\eta}_{k})}
(LH6σk3)𝜼k𝒙k3+δg𝜼k𝒙k+12δH𝜼k𝒙k2m^k(𝟎𝒙k)m^k(𝜼kC)\displaystyle\leq\frac{\left(\frac{L_{H}}{6}-\frac{\sigma_{k}}{3}\right)\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{3}+\delta_{g}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}+\frac{1}{2}\delta_{H}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{2}}{\hat{m}_{k}(\bm{0}_{\bm{x}_{k}})-\hat{m}_{k}\left(\bm{\eta}_{k}^{C}\right)}
δg𝜼kC𝒙k+12δH𝜼kC𝒙k2112𝜼kC𝒙k2(KH2+4σk𝐆k𝒙kKH)\displaystyle\leq\frac{\delta_{g}\left\|\bm{\eta}_{k}^{C}\right\|_{\bm{x}_{k}}+\frac{1}{2}\delta_{H}\left\|\bm{\eta}_{k}^{C}\right\|_{\bm{x}_{k}}^{2}}{\frac{1}{12}\left\|\bm{\eta}_{k}^{C}\right\|_{\bm{x}_{k}}^{2}\left(\sqrt{K_{H}^{2}+4\sigma_{k}\|\mathbf{G}_{k}\|_{\bm{x}_{k}}}-K_{H}\right)}
4σk𝐆k𝒙kδg16𝐆k𝒙k(KH2+4σk𝐆k𝒙kKH)2+6δHKH2+4LHϵgKH\displaystyle\leq\frac{4\sigma_{k}\|\mathbf{G}_{k}\|_{\bm{x}_{k}}\delta_{g}}{\frac{1}{6}\|\mathbf{G}_{k}\|_{\bm{x}_{k}}\left(\sqrt{K_{H}^{2}+4\sigma_{k}\|\mathbf{G}_{k}\|_{\bm{x}_{k}}}-K_{H}\right)^{2}}+\frac{6\delta_{H}}{\sqrt{K_{H}^{2}+4L_{H}\epsilon_{g}}-K_{H}}
4LHϵgδg16ϵg(KH2+4LHϵgKH)2+6δH(KH2+4LHϵgKH)\displaystyle\leq\frac{4L_{H}\epsilon_{g}\delta_{g}}{\frac{1}{6}\epsilon_{g}\left(\sqrt{K_{H}^{2}+4L_{H}\epsilon_{g}}-K_{H}\right)^{2}}+\frac{6\delta_{H}}{\left(\sqrt{K_{H}^{2}+4L_{H}\epsilon_{g}}-K_{H}\right)}
1τ2+1τ2=1τ,\displaystyle\leq\frac{1-\tau}{2}+\frac{1-\tau}{2}=1-\tau, (155)

where the first inequality follows from Eqs. (36), Eq. (141), the second inequality from Eqs. (147), (36), (39), and the third follows from Eq. (148). The last two inequalities hold given Eqs. (57), (58) and the fact that the function h(x)=x(α2+xα)2h(x)=\frac{x}{\left(\sqrt{\alpha^{2}+x}-\alpha\right)^{2}} is decreasing with α>0\alpha>0. Consequently, we have ρkτ\rho_{k}\geq\tau and that the iteration is successful. Based on Step (12) of Algorithm 1, we have σk+1=max(σk/γ,ϵσ)σk\sigma_{k+1}=\max(\sigma_{k}/\gamma,\epsilon_{\sigma})\leq\sigma_{k}.

Lemma 10

Suppose Condition 1 and Assumption 1, 2, 4, 5 hold, then when 𝐆k𝐱k<ϵg\|\mathbf{G}_{k}\|_{\bm{x}_{k}}<\epsilon_{g} and λmin(𝐇k)ϵH\lambda_{min}(\mathbf{H}_{k})\leq-\epsilon_{H}, if σkLH\sigma_{k}\geq L_{H}, the iteration kk is successful, i.e., ρkτ\rho_{k}\geq\tau and σk+1σk\sigma_{k+1}\leq\sigma_{k}.

Proof

We have that

1ρk\displaystyle 1-\rho_{k} =f^k(𝜼k)m^k(𝜼k)m^k(𝟎𝒙k)m^k(𝜼k)(LH6σk3)𝜼k𝒙k3+12δH𝜼k𝒙k2m^k(𝟎𝒙k)m^k(𝜼kE)\displaystyle=\frac{\hat{f}_{k}(\bm{\eta}_{k})-\hat{m}_{k}(\bm{\eta}_{k})}{\hat{m}_{k}(\bm{0}_{\bm{x}_{k}})-\hat{m}_{k}(\bm{\eta}_{k})}\leq\frac{\left(\frac{L_{H}}{6}-\frac{\sigma_{k}}{3}\right)\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{3}+\frac{1}{2}\delta_{H}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{2}}{\hat{m}_{k}(\bm{0}_{\bm{x}_{k}})-\hat{m}_{k}\left(\bm{\eta}_{k}^{E}\right)}
12δH𝜼kE𝒙k2ν|λmin(𝐇k)𝜼kE𝒙k2/63δHνϵH1τ,\displaystyle\leq\frac{\frac{1}{2}\delta_{H}\left\|\bm{\eta}_{k}^{E}\right\|_{\bm{x}_{k}}^{2}}{\nu|\lambda_{min}(\mathbf{H}_{k})\left\|\bm{\eta}_{k}^{E}\right\|_{\bm{x}_{k}}^{2}/6}\leq\frac{3\delta_{H}}{\nu\epsilon_{H}}\leq 1-\tau, (156)

where the first and second inequalities follow from Eqs. (143), (37), (151), (40). The last inequality uses Eq. (58). Consequently, we have ρkτ\rho_{k}\geq\tau which indicates the iteration is successful. Based on Step (12) of Algorithm 1, we have σk+1=max(σk/γ,ϵσ)σk\sigma_{k+1}=\max(\sigma_{k}/\gamma,\epsilon_{\sigma})\leq\sigma_{k}.

Lemma 11

Suppose Condition 1 and Assumptions 1, 2, 3, 4, 5 hold, then for all kk, we have

σkmax(σ0,2γLH).\sigma_{k}\leq\max\left(\sigma_{0},2\gamma L_{H}\right). (157)
Proof

We prove this lemma by contradiction by considering the two following cases.

(i) If σ02γLH\sigma_{0}\leq 2\gamma L_{H}, we assume there exists an iteration k0k\geq 0 that is the first unsuccessful iteration such that σk+1=γσk>2γLH\sigma_{k+1}=\gamma\sigma_{k}>2\gamma L_{H}. This implies σk>LH\sigma_{k}>L_{H} and σk+1>σk\sigma_{k+1}>\sigma_{k}. Since the algorithm fails to terminate at iteration kk, we have 𝑮𝒌𝒙kϵg\|\bm{G_{k}}\|_{\bm{x}_{k}}\geq\epsilon_{g} or λmin(𝐇k)<ϵH\lambda_{min}(\mathbf{H}_{k})<-\epsilon_{H}. Then, according to Lemmas 9 and 10, iteration kk is successful and thus σk+1=max(σk/γ,ϵσ)σk\sigma_{k+1}=\max(\sigma_{k}/\gamma,\epsilon_{\sigma})\leq\sigma_{k}. This contradicts the earlier statement of σk+1>σk\sigma_{k+1}>\sigma_{k}. We thus have σk2γLH\sigma_{k}\leq 2\gamma L_{H} for all kk.

(ii) If σ0>2γLH\sigma_{0}>2\gamma L_{H}, similarly, we assume there exists an iteration k0k\geq 0 that is the first unsuccessful iteration such that σk+1=γσk>σ0\sigma_{k+1}=\gamma\sigma_{k}>\sigma_{0}. This implies σk>LH\sigma_{k}>L_{H} and σk+1>σk\sigma_{k+1}>\sigma_{k}. Since the algorithm fails to terminate at iteration kk, we have 𝑮𝒌𝒙kϵg\|\bm{G_{k}}\|_{\bm{x}_{k}}\geq\epsilon_{g} or λmin(𝐇k)<ϵH\lambda_{min}(\mathbf{H}_{k})<-\epsilon_{H}. According to Lemmas 9 and 10, iteration kk is successful and thus σk+1=max(σk/γ,ϵσ)σk\sigma_{k+1}=\max(\sigma_{k}/\gamma,\epsilon_{\sigma})\leq\sigma_{k}, which is a contradiction. Thus, we have σkσ0\sigma_{k}\leq\sigma_{0} for all kk.

F.2 Main Proof of Theorem 3

Proof

Letting σb=max(σ0,2γLH)\sigma_{b}=\max\left(\sigma_{0},2\gamma L_{H}\right), when 𝐆kϵg\|\mathbf{G}_{k}\|\geq\epsilon_{g}, from Eq. (36) and Lemma 157 we have

m^k(𝟎𝒙k)m^k(𝜼k)𝐆k𝒙k23min(𝐆k𝒙kKH,𝐆k𝒙kσk)ϵg223min(1KH,1σb).\hat{m}_{k}(\bm{0}_{\bm{x}_{k}})-\hat{m}_{k}(\bm{\eta}_{k})\geq\frac{\|\mathbf{G}_{k}\|_{\bm{x}_{k}}}{2\sqrt{3}}\min\left(\frac{\|\mathbf{G}_{k}\|_{\bm{x}_{k}}}{K_{H}},\sqrt{\frac{\|\mathbf{G}_{k}\|_{\bm{x}_{k}}}{\sigma_{k}}}\right)\geq\frac{\epsilon_{g}^{2}}{2\sqrt{3}}\min\left(\frac{1}{K_{H}},{\frac{1}{\sqrt{\sigma_{b}}}}\right). (158)

When 𝐆k<ϵg\|\mathbf{G}_{k}\|<\epsilon_{g} and λmin(𝐇k)ϵH\lambda_{min}(\mathbf{H}_{k})\leq-\epsilon_{H}, from Eq. (37) and Lemma 157, we have

m^k(𝟎𝒙k)m^k(𝜼k)ν36σk2|λmin(𝐇k)|3ν3ϵH36σb2.\begin{split}\hat{m}_{k}(\bm{0}_{\bm{x}_{k}})-\hat{m}_{k}(\bm{\eta}_{k})\geq\frac{\nu^{3}}{6\sigma_{k}^{2}}|\lambda_{min}(\mathbf{H}_{k})|^{3}\geq\frac{\nu^{3}\epsilon_{H}^{3}}{6\sigma_{b}^{2}}.\end{split} (159)

Let 𝒮succ\mathcal{S}_{succ} be the set of successful iterations before Algorithm 1 terminates and f^min\hat{f}_{min} be the function minimum. Since f^k(𝜼k)\hat{f}_{k}(\bm{\eta}_{k}) is monotonically decreasing, we have

f^0(𝟎𝒙0)f^min\displaystyle\hat{f}_{0}(\bm{0}_{\bm{x}_{0}})-\hat{f}_{min} k=0(f^k(𝟎𝒙k)f^k(𝜼k))k𝒮succ(f^k(𝟎𝒙k)f^k(𝜼k))\displaystyle\geq\sum_{k=0}^{\infty}\left(\hat{f}_{k}(\bm{0}_{\bm{x}_{k}})-\hat{f}_{k}(\bm{\eta}_{k})\right)\geq\sum_{k\in\mathcal{S}_{succ}}\left(\hat{f}_{k}(\bm{0}_{\bm{x}_{k}})-\hat{f}_{k}(\bm{\eta}_{k})\right) (160)
τ(k𝒮succm^k(𝟎𝒙k)m^k(𝜼k))\displaystyle\geq\tau\left(\sum_{k\in\mathcal{S}_{succ}}\hat{m}_{k}(\bm{0}_{\bm{x}_{k}})-\hat{m}_{k}(\bm{\eta}_{k})\right)
τ|𝒮succ|min(ν3ϵH36σb2,ϵg223min(1KH,1σb))|𝒮succ|τκmin(ϵg2,ϵH3),\displaystyle\geq\tau|\mathcal{S}_{succ}|\min\left(\frac{\nu^{3}\epsilon_{H}^{3}}{6\sigma_{b}^{2}},\frac{\epsilon_{g}^{2}}{2\sqrt{3}}\min\left(\frac{1}{K_{H}},{\frac{1}{\sqrt{\sigma_{b}}}}\right)\right)\geq|\mathcal{S}_{succ}|\tau\kappa\min\left(\epsilon_{g}^{2},\epsilon_{H}^{3}\right),

where κ=min(ν36σb2,123min(1KH,1σb))\kappa=\min\left(\frac{\nu^{3}}{6\sigma_{b}^{2}},\frac{1}{2\sqrt{3}}\min\left(\frac{1}{K_{H}},\frac{1}{\sqrt{\sigma_{b}}}\right)\right). Let 𝒮fail\mathcal{S}_{fail} be the set of unsuccessful iterations and TT be the total iterations of the algorithm. Then we have σT=σ0γ|𝒮fail||𝒮succ|\sigma_{T}=\sigma_{0}\gamma^{|\mathcal{S}_{fail}|-|\mathcal{S}_{succ}|}. Combining it with σTσb\sigma_{T}\leq\sigma_{b} from Lemma 157, we have

σT=σ0γ|𝒮fail||𝒮succ|σb\displaystyle\sigma_{T}=\sigma_{0}\gamma^{|\mathcal{S}_{fail}|-|\mathcal{S}_{succ}|}\leq\sigma_{b}
\displaystyle\Longrightarrow log(γ|𝒮fail||𝒮succ|)log(σbσ0)\displaystyle\log\left(\gamma^{|\mathcal{S}_{fail}|-|\mathcal{S}_{succ}|}\right)\leq\log\left(\frac{\sigma_{b}}{\sigma_{0}}\right)
\displaystyle\Longrightarrow (|𝒮fail||𝒮succ|)log(σb/σ0)γ\displaystyle\left({|\mathcal{S}_{fail}|-|\mathcal{S}_{succ}|}\right)\leq\frac{\log(\sigma_{b}/\sigma_{0})}{\gamma}
\displaystyle\Longrightarrow |𝒮fail|log(σb/σ0)logγ+|𝒮succ|.\displaystyle|\mathcal{S}_{fail}|\leq\frac{\log(\sigma_{b}/\sigma_{0})}{\log{\gamma}}+|\mathcal{S}_{succ}|. (161)

Finally, combining Eqs. (160), (161), we have

T=|𝒮fail|+|𝒮succ|\displaystyle T=|\mathcal{S}_{fail}|+|\mathcal{S}_{succ}| log(σb/σ0)logγ+2|𝒮succ|\displaystyle\leq\frac{\log(\sigma_{b}/\sigma_{0})}{\log{\gamma}}+2|\mathcal{S}_{succ}| (162)
log(σb/σ0)logγ+2(f^0(𝟎𝒙0)f^min)τκmax(ϵg2,ϵH3).\displaystyle\leq\frac{\log(\sigma_{b}/\sigma_{0})}{\log{\gamma}}+\frac{2\left(\hat{f}_{0}(\bm{0}_{\bm{x}_{0}})-\hat{f}_{min}\right)}{\tau\kappa}\max\left(\epsilon_{g}^{-2},\epsilon_{H}^{-3}\right).

This completes the proof.

F.3 Main Proof of Corollary 1

Proof

Under the given assumptions, when Theorem 3 holds, Algorithm 1 returns an (ϵg,ϵH)\left(\epsilon_{g},\epsilon_{H}\right)-optimal solution in T=𝒪(max(ϵg2,ϵH3))T=\mathcal{O}\left(\max\left(\epsilon_{g}^{-2},\epsilon_{H}^{-3}\right)\right) iterations. Also, according to Theorem 2, at an iteration, Condition 1 is satisfied with a probability (1δ)(1-\delta), where the probability (1δ)(1-\delta) at the current iteration can be independently achieved by selecting proper subsampling sizes for the approximate gradient and Hessian. Let EE be the event that Algorithm 1 returns an (ϵg,ϵH)\left(\epsilon_{g},\epsilon_{H}\right)-optimal solution and EiE_{i} be the event that Condition 1 is satisfied at iteration ii. According on Theorem 3, when event EE happens, it requires Condition 1 to be satisfied for all the iterations, thus we have

Pr(E)=i=1TPr(Ei)=(1δ)T=(1δ)𝒪(max(ϵg2,ϵH3)).{\rm Pr}(E)=\prod_{i=1}^{T}{\rm Pr}(E_{i})=(1-\delta)^{T}=(1-\delta)^{\mathcal{O}\left(\max\left(\epsilon_{g}^{-2},\epsilon_{H}^{-3}\right)\right)}. (163)

This completes the proof.

Appendix G Appendix: Theorem 4 and Corollary 2

G.1 Supporting Lemmas for Theorem 4

The proof of Theorem 4 builds on an existing lemma, which we restate as follows.

Lemma 12

(Sufficient Function Decrease (Lemma 3.3 in cartis2011adaptive )) Suppose the solution 𝛈k\bm{\eta}_{k} satisfies Assumption 3, then we have

m^k(𝟎𝒙k)m^k(𝜼k)σk6𝜼k𝒙k3.\begin{split}\hat{m}_{k}(\bm{0}_{\bm{x}_{k}})-\hat{m}_{k}(\bm{\eta}_{k})\geq\frac{\sigma_{k}}{6}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{3}.\end{split} (164)

The proof also needs another lemma that we develop as below.

Lemma 13

(Sufficiently Long Steps) Suppose Condition 1 and Assumptions 1, 2, 3, 4, 5 hold. If δgδHκθϵg\delta_{g}\leq\delta_{H}\leq\kappa_{\theta}\epsilon_{g}, Algorithm 3 returns a solution 𝛈k\bm{\eta}_{k} such that

𝜼k𝒙kκsϵg,\begin{split}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}\geq\kappa_{s}\sqrt{\epsilon_{g}},\end{split} (165)

when 𝐆k+1𝐱k+1ϵg\|\mathbf{G}_{k+1}\|_{\bm{x}_{k+1}}\geq\epsilon_{g} for k>0k>0 and when the inner stopping criterion of Eq. (47) is used. Here κs=min(1/(LH+2σb+ϵg3+Ll3),1/5LH3+10σb3+11Ll9)\kappa_{s}=\min\left(1/\sqrt{(L_{H}+2\sigma_{b}+\frac{\epsilon_{g}}{3}+\frac{L_{l}}{3})},1/\sqrt{\frac{5L_{H}}{3}+\frac{10\sigma_{b}}{3}+\frac{11L_{l}}{9}}\right) with Ll>0L_{l}>0 and σb=max(σ0,2γLH)\sigma_{b}=\max\left(\sigma_{0},2\gamma L_{H}\right).

Proof

By differentiating the approximate model, we have

m^k(𝜼k)𝒙k=𝐆k+𝐇k[𝜼k]+σk𝜼k𝒙k𝜼k𝒙k\displaystyle\|\nabla\hat{m}_{k}(\bm{\eta}_{k})\|_{\bm{x}_{k}}=\bigg{|}\bigg{|}\mathbf{G}_{k}+\mathbf{H}_{k}[\bm{\eta}_{k}]+\sigma_{k}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}\bm{\eta}_{k}\bigg{|}\bigg{|}_{\bm{x}_{k}}
=\displaystyle= 𝒫𝜼k1gradf^k(𝜼k)+𝐆kgradf(𝒙k)+𝐇k[𝜼k]2f^k(𝟎𝒙k)[𝜼k]\displaystyle\left\|\mathcal{P}_{\bm{\eta}_{k}}^{-1}\text{grad}\hat{f}_{k}(\bm{\eta}_{k})+\mathbf{G}_{k}-\text{grad}f(\bm{x}_{k})+\mathbf{H}_{k}[\bm{\eta}_{k}]-\nabla^{2}\hat{f}_{k}(\bm{0}_{\bm{x}_{k}})[\bm{\eta}_{k}]\right.
+gradf(𝒙k)+2f^k(𝟎𝒙k)[𝜼k]𝒫𝜼k1gradf^k(𝜼k)+σk𝜼k𝒙k𝜼k𝒙k\displaystyle\left.+\text{grad}f(\bm{x}_{k})+\nabla^{2}\hat{f}_{k}(\bm{0}_{\bm{x}_{k}})[\bm{\eta}_{k}]-\mathcal{P}_{\bm{\eta}_{k}}^{-1}\text{grad}\hat{f}_{k}(\bm{\eta}_{k})+\sigma_{k}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}\bm{\eta}_{k}\right\|_{\bm{x}_{k}}
\displaystyle\geq 𝒫𝜼k1gradf^k(𝜼k)𝒙k𝐆kgradf(𝒙k)𝒙k𝐇k[𝜼k]2f^k(𝟎𝒙k)[𝜼k]𝒙k\displaystyle\left\|\mathcal{P}_{\bm{\eta}_{k}}^{-1}\text{grad}\hat{f}_{k}(\bm{\eta}_{k})\right\|_{\bm{x}_{k}}-\left\|\mathbf{G}_{k}-\text{grad}f(\bm{x}_{k})\right\|_{\bm{x}_{k}}-\left\|\mathbf{H}_{k}[\bm{\eta}_{k}]-\nabla^{2}\hat{f}_{k}(\bm{0}_{\bm{x}_{k}})[\bm{\eta}_{k}]\right\|_{\bm{x}_{k}}
𝒫𝜼k1gradf^k(𝜼k)gradf(𝒙k)2f^k(𝟎𝒙k)[𝜼k]𝒙kσk𝜼k𝒙k2\displaystyle-\left\|\mathcal{P}_{\bm{\eta}_{k}}^{-1}\text{grad}\hat{f}_{k}(\bm{\eta}_{k})-\text{grad}f(\bm{x}_{k})-\nabla^{2}\hat{f}_{k}(\bm{0}_{\bm{x}_{k}})[\bm{\eta}_{k}]\right\|_{\bm{x}_{k}}-\sigma_{k}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{2}
\displaystyle\geq 𝒫𝜼k1gradf^k(𝜼k)𝒙kδgδH𝜼k𝒙kLH2𝜼k𝒙k2σb𝜼k𝒙k2,\displaystyle\left\lVert\mathcal{P}_{\bm{\eta}_{k}}^{-1}\text{grad}\hat{f}_{k}(\bm{\eta}_{k})\right\rVert_{\bm{x}_{k}}-\delta_{g}-\delta_{H}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}-\frac{L_{H}}{2}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{2}-\sigma_{b}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{2}, (166)

where the first inequality follows from the triangle inequality and the second inequality from Eqs. (49), (51) and (52). Additionally, from Lemma 3.8 in kasai2018riemannian , we have

gradfk(𝒙k)𝒙k𝒫𝜼k1gradf^k(𝜼k)𝒙k𝒫𝜼k1gradf^k(𝜼k)gradf(𝒙k)𝒙kLl𝜼k𝒙k,\begin{split}\left\lVert\text{grad}f_{k}(\bm{x}_{k})\right\rVert_{\bm{x}_{k}}&-\left\lVert\mathcal{P}_{\bm{\eta}_{k}}^{-1}\text{grad}\hat{f}_{k}(\bm{\eta}_{k})\right\rVert_{\bm{x}_{k}}\\ &\leq\left\lVert\mathcal{P}_{\bm{\eta}_{k}}^{-1}\text{grad}\hat{f}_{k}(\bm{\eta}_{k})-\text{grad}f(\bm{x}_{k})\right\rVert_{\bm{x}_{k}}\leq L_{l}\left\lVert\bm{\eta}_{k}\right\rVert_{\bm{x}_{k}},\end{split} (167)

where Ll>0L_{l}>0 is a constant. Then, we have

𝐆k𝒙k\displaystyle\|\mathbf{G}_{k}\|_{\bm{x}_{k}}\leq 𝐆kgradf(𝒙k)𝒙k+gradf(𝒙k)𝒙k\displaystyle\;\|\mathbf{G}_{k}-\text{grad}f(\bm{x}_{k})\|_{\bm{x}_{k}}+\|\text{grad}f(\bm{x}_{k})\|_{\bm{x}_{k}}
\displaystyle\leq δg+Ll𝜼k𝒙k+𝒫𝜼k1gradf^k(𝜼k)𝒙k,\displaystyle\;\delta_{g}+L_{l}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}+\left\lVert\mathcal{P}_{\bm{\eta}_{k}}^{-1}\text{grad}\hat{f}_{k}(\bm{\eta}_{k})\right\rVert_{\bm{x}_{k}}, (168)

with the first inequality following from the triangle inequality and the second from Eqs. (51), (52) and (167). Then, by combining Eqs. (47), (166) and (168), with θk:=κθmin(1,𝜼ki𝐱k)\theta_{k}:=\kappa_{\theta}\min(1,\left\|\bm{\eta}_{k}^{i}\right\|_{\mathbf{x}_{k}}) we obtain

𝒫𝜼k1gradf^k(𝜼k)𝒙kδgδH𝜼k𝒙k(LH2+σb)𝜼k𝒙k2\displaystyle\left\lVert\mathcal{P}_{\bm{\eta}_{k}}^{-1}\text{grad}\hat{f}_{k}(\bm{\eta}_{k})\right\rVert_{\bm{x}_{k}}-\delta_{g}-\delta_{H}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}-\left(\frac{L_{H}}{2}+\sigma_{b}\right)\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{2} (169)
\displaystyle\leq m^k(𝜼k)𝒙kθk𝐆k𝒙k\displaystyle\;\|\nabla\hat{m}_{k}(\bm{\eta}_{k})\|_{\bm{x}_{k}}\leq\theta_{k}\|\mathbf{G}_{k}\|_{\bm{x}_{k}}
\displaystyle\leq θk(δg+Ll𝜼k𝒙k+𝒫𝜼k1gradf^k(𝜼k)𝒙k).\displaystyle\;\theta_{k}\left(\delta_{g}+L_{l}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}+\left\lVert\mathcal{P}_{\bm{\eta}_{k}}^{-1}\text{grad}\hat{f}_{k}(\bm{\eta}_{k})\right\rVert_{\bm{x}_{k}}\right).

This results in

(LH2+σb)𝜼k𝒙k2+(δH+θkLl)𝜼k𝒙k+(1+θk)δg\displaystyle\left(\frac{L_{H}}{2}+\sigma_{b}\right)\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{2}+\left(\delta_{H}+\theta_{k}L_{l}\right)\|\bm{\eta}_{k}\|_{\bm{x}_{k}}+\left(1+\theta_{k}\right)\delta_{g} (170)
\displaystyle\geq (1θk)gradf^k(𝜼k)𝒙k+1(1θk)(𝐆k+1𝒙k+1δg).\displaystyle\;\left(1-\theta_{k}\right)\left\lVert\text{grad}\hat{f}_{k}(\bm{\eta}_{k})\right\rVert_{\bm{x}_{k+1}}\geq\left(1-\theta_{k}\right)\left(\|\mathbf{G}_{k+1}\|_{\bm{x}_{k+1}}-\delta_{g}\right).

Subsequently, it has

(LH2+σb)𝜼k𝒙k2+(δH+θkLl)𝜼k𝒙k+2δg(1θk)(𝐆k+1𝒙k+1).\left(\frac{L_{H}}{2}+\sigma_{b}\right)\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{2}+\left(\delta_{H}+\theta_{k}L_{l}\right)\|\bm{\eta}_{k}\|_{\bm{x}_{k}}+2\delta_{g}\geq\left(1-\theta_{k}\right)\left(\|\mathbf{G}_{k+1}\|_{\bm{x}_{k+1}}\right). (171)

In the above derivation, we use the property of the parallel transport that preserves the length of the transported vector. The last inequality in Eq. (170) is based on Eqs. (51) and the triangle inequality.

Now, we consider the following two cases. (i) If 𝜼k𝒙k1\|\bm{\eta}_{k}\|_{\bm{x}_{k}}\geq 1, from Eq. (47) we have θk=κθ\theta_{k}=\kappa_{\theta}, and therefore

(1κθ)(𝐆k+1𝒙k+1)2δg(LH2+σb+δH+κθLl)𝜼k𝒙k2.\left(1-\kappa_{\theta}\right)\left(\|\mathbf{G}_{k+1}\|_{\bm{x}_{k+1}}\right)-2\delta_{g}\leq\left(\frac{L_{H}}{2}+\sigma_{b}+\delta_{H}+\kappa_{\theta}L_{l}\right)\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{2}. (172)

This then gives

𝜼k𝒙k2(1κθ)(𝐆k+1𝒙k+1)2δgLH2+σb+δH+κθLl12ϵgLH2+σb+ϵg6+Ll6,\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{2}\geq\frac{\left(1-\kappa_{\theta}\right)\left(\|\mathbf{G}_{k+1}\|_{\bm{x}_{k+1}}\right)-2\delta_{g}}{\frac{L_{H}}{2}+\sigma_{b}+\delta_{H}+\kappa_{\theta}L_{l}}\geq\frac{\frac{1}{2}\epsilon_{g}}{\frac{L_{H}}{2}+\sigma_{b}+\frac{\epsilon_{g}}{6}+\frac{L_{l}}{6}}, (173)

where the last inequality holds because δgδHκθϵg\delta_{g}\leq\delta_{H}\leq\kappa_{\theta}\epsilon_{g} and κθ<16\kappa_{\theta}<\frac{1}{6}. (ii) If 𝜼k𝒙k<1\|\bm{\eta}_{k}\|_{\bm{x}_{k}}<1, then θk=κθ𝜼k𝒙k<κθ\theta_{k}=\kappa_{\theta}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}<\kappa_{\theta}. Given Eq. (168) and δgδHκθϵg\delta_{g}\leq\delta_{H}\leq\kappa_{\theta}\epsilon_{g}, it has

δHκθϵgκθ𝐆k𝒙kκθ(δH+Ll𝜼k𝒙k+𝐆k+1𝒙k+1).\delta_{H}\leq\kappa_{\theta}\epsilon_{g}\leq\kappa_{\theta}\|\mathbf{G}_{k}\|_{\bm{x}_{k}}\leq\kappa_{\theta}\left(\delta_{H}+L_{l}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}+\|\mathbf{G}_{k+1}\|_{\bm{x}_{k+1}}\right). (174)

Then, we have

(1θk)(𝐆k+1𝒙k+1)2δg\displaystyle\left(1-\theta_{k}\right)\left(\|\mathbf{G}_{k+1}\|_{\bm{x}_{k+1}}\right)-2\delta_{g} (175)
\displaystyle\leq (LH2+σb)𝜼k𝒙k2+(δH+θkLl)𝜼k𝒙k\displaystyle\left(\frac{L_{H}}{2}+\sigma_{b}\right)\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{2}+\left(\delta_{H}+\theta_{k}L_{l}\right)\|\bm{\eta}_{k}\|_{\bm{x}_{k}}
\displaystyle\leq (LH2+σb)𝜼k𝒙k2+(κθ1κθ+κθ)Ll𝜼k𝒙k2+κθ𝐆k+1𝒙k+11κθ,\displaystyle\left(\frac{L_{H}}{2}+\sigma_{b}\right)\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{2}+\left(\frac{\kappa_{\theta}}{1-\kappa_{\theta}}+\kappa_{\theta}\right)L_{l}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{2}+\frac{\kappa_{\theta}\|\mathbf{G}_{k+1}\|_{\bm{x}_{k+1}}}{1-\kappa_{\theta}},

which results in

𝜼k𝒙k2(1κθκθ1κθ)𝐆k+1𝒙k+12δgLH2+σb+(κθ1κθ+κθ)Ll310ϵgLH2+σb+1130Ll.\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{2}\geq\frac{\left(1-\kappa_{\theta}-\frac{\kappa_{\theta}}{1-\kappa_{\theta}}\right)\|\mathbf{G}_{k+1}\|_{\bm{x}_{k+1}}-2\delta_{g}}{\frac{L_{H}}{2}+\sigma_{b}+\left(\frac{\kappa_{\theta}}{1-\kappa_{\theta}}+\kappa_{\theta}\right)L_{l}}\geq\frac{\frac{3}{10}\epsilon_{g}}{\frac{L_{H}}{2}+\sigma_{b}+\frac{11}{30}L_{l}}. (176)

This completes the proof.

G.2 Main Proof of Theorem 4

Let σb=max(σ0,2γLH)\sigma_{b}=\max\left(\sigma_{0},2\gamma L_{H}\right), σmin=min(σk)\sigma_{min}=\min\left(\sigma_{k}\right) for k0k\geq 0 and 𝒮succ1\mathcal{S}_{succ}^{1} be the set of successful iterations such that 𝐆k+1𝒙k+1ϵg\|\mathbf{G}_{k+1}\|_{\bm{x}_{k+1}}\geq\epsilon_{g} for k𝒮succ1k\in\mathcal{S}_{succ}^{1}. As f^k(𝜼k)\hat{f}_{k}(\bm{\eta}_{k}) is monotonically decreasing, we have

f^0(𝟎𝒙0)f^min\displaystyle\hat{f}_{0}(\bm{0}_{\bm{x}_{0}})-\hat{f}_{min} k=0(f^k(𝟎𝒙k)f^k(𝜼k))k𝒮succ1(f^k(𝟎𝒙k)f^k(𝜼k))\displaystyle\geq\sum_{k=0}^{\infty}\left(\hat{f}_{k}(\bm{0}_{\bm{x}_{k}})-\hat{f}_{k}(\bm{\eta}_{k})\right)\geq\sum_{k\in\mathcal{S}_{succ}^{1}}\left(\hat{f}_{k}(\bm{0}_{\bm{x}_{k}})-\hat{f}_{k}(\bm{\eta}_{k})\right) (177)
τk𝒮succ1(m^k(𝟎𝒙k)m^k(𝜼k))τ|𝒮succ1|min(ν3ϵH36σk2,σmin6𝜼k𝒙k3)\displaystyle\geq\tau\sum_{k\in\mathcal{S}_{succ}^{1}}\left(\hat{m}_{k}(\bm{0}_{\bm{x}_{k}})-\hat{m}_{k}(\bm{\eta}_{k})\right)\geq\tau|\mathcal{S}_{succ}^{1}|\min\left(\frac{\nu^{3}\epsilon_{H}^{3}}{6\sigma_{k}^{2}},\frac{\sigma_{min}}{6}\|\bm{\eta}_{k}\|_{\bm{x}_{k}}^{3}\right)
τ|𝒮succ1|min(ν3ϵH36σb2,ϵσκs3ϵg326)τκ1|𝒮succ1|min(ϵg32,ϵH3),\displaystyle\geq\tau|\mathcal{S}_{succ}^{1}|\min\left(\frac{\nu^{3}\epsilon_{H}^{3}}{6\sigma_{b}^{2}},\frac{\epsilon_{\sigma}\kappa_{s}^{3}\epsilon_{g}^{\frac{3}{2}}}{6}\right)\geq\tau\kappa_{1}|\mathcal{S}_{succ}^{1}|\min\left(\epsilon_{g}^{\frac{3}{2}},\epsilon_{H}^{3}\right),

where κ1=16min(ν3σb2,ϵσκs3)\kappa_{1}=\frac{1}{6}\min\left(\frac{\nu^{3}}{\sigma_{b}^{2}},\epsilon_{\sigma}\kappa_{s}^{3}\right). The fourth inequality follows from Eqs. (37) and (164), while the fifth from Eq. (165).

Let 𝒮succ2\mathcal{S}_{succ}^{2} be the set of successful iterations such that 𝐆k+1𝒙k+1<ϵg\|\mathbf{G}_{k+1}\|_{\bm{x}_{k+1}}<\epsilon_{g} and λmin(𝐇k+1)<ϵH\lambda_{min}(\mathbf{H}_{k+1})<-\epsilon_{H} for k𝒮succ2k\in\mathcal{S}_{succ}^{2}. Then there is an iteration t𝒮succ2t\in\mathcal{S}_{succ}^{2} in which 𝐆t𝒙tϵg\|\mathbf{G}_{t}\|_{\bm{x}_{t}}\geq\epsilon_{g} and 𝐆t+1𝒙t+1<ϵg\|\mathbf{G}_{t+1}\|_{\bm{x}_{t+1}}<\epsilon_{g}. Thus, we have

f^0(𝟎𝒙0)f^mink=t(f^k(𝟎𝒙k)f^k(𝜼k))f^0(𝟎𝒙0)f^t(𝜼t)+k𝒮succ2(f^k(𝟎𝒙k)f^k(𝜼k)),\hat{f}_{0}(\bm{0}_{\bm{x}_{0}})-\hat{f}_{min}\geq\sum_{k=t}^{\infty}\left(\hat{f}_{k}(\bm{0}_{\bm{x}_{k}})-\hat{f}_{k}(\bm{\eta}_{k})\right)\geq\hat{f}_{0}(\bm{0}_{\bm{x}_{0}})-\hat{f}_{t}(\bm{\eta}_{t})+\sum_{k\in\mathcal{S}_{succ}^{2}}\left(\hat{f}_{k}(\bm{0}_{\bm{x}_{k}})-\hat{f}_{k}(\bm{\eta}_{k})\right), (178)

and this results in

f^t(𝜼t)f^min\displaystyle\hat{f}_{t}(\bm{\eta}_{t})-\hat{f}_{min} k𝒮succ2(f^k(𝟎𝒙k)f^k(𝜼k))τk𝒮succ2(m^k(𝟎𝒙k)m^k(𝜼k))\displaystyle\geq\sum_{k\in\mathcal{S}_{succ}^{2}}\left(\hat{f}_{k}(\bm{0}_{\bm{x}_{k}})-\hat{f}_{k}(\bm{\eta}_{k})\right)\geq\tau\sum_{k\in\mathcal{S}_{succ}^{2}}\left(\hat{m}_{k}(\bm{0}_{\bm{x}_{k}})-\hat{m}_{k}(\bm{\eta}_{k})\right)
τ|𝒮succ2|ν3ϵH36σk2τ|𝒮succ2|ν3ϵH36σb2τκ2|𝒮succ2|ϵH3,\displaystyle\geq\tau|\mathcal{S}_{succ}^{2}|\frac{\nu^{3}\epsilon_{H}^{3}}{6\sigma_{k}^{2}}\geq\tau|\mathcal{S}_{succ}^{2}|\frac{\nu^{3}\epsilon_{H}^{3}}{6\sigma_{b}^{2}}\geq\tau\kappa_{2}|\mathcal{S}_{succ}^{2}|\epsilon_{H}^{3}, (179)

where κ2=ν36σb2\kappa_{2}=\frac{\nu^{3}}{6\sigma_{b}^{2}}. The second and third inequalities follow from Eq. (157) and Eq. (37), respectively. Then, the bound for the total number of successful iterations is

|𝒮succ|=\displaystyle|\mathcal{S}_{succ}|= |𝒮succ1|+|𝒮succ2|+1\displaystyle\;|\mathcal{S}_{succ}^{1}|+|\mathcal{S}_{succ}^{2}|+1
\displaystyle\leq f^0(𝟎𝒙0)f^minτκ1max(ϵg32,ϵH3)+f^t(𝜼t)f^minτκ2ϵH3+1\displaystyle\;\frac{\hat{f}_{0}(\bm{0}_{\bm{x}_{0}})-\hat{f}_{min}}{\tau\kappa_{1}}\max\left(\epsilon_{g}^{-\frac{3}{2}},\epsilon_{H}^{-3}\right)+\frac{\hat{f}_{t}(\bm{\eta}_{t})-\hat{f}_{min}}{\tau\kappa_{2}}\epsilon_{H}^{-3}+1
\displaystyle\leq (f^0(𝟎𝒙0)f^minτκ1+f^t(𝜼t)f^minτκ2)max(ϵg32,ϵH3)+1,\displaystyle\;\left(\frac{\hat{f}_{0}(\bm{0}_{\bm{x}_{0}})-\hat{f}_{min}}{\tau\kappa_{1}}+\frac{\hat{f}_{t}(\bm{\eta}_{t})-\hat{f}_{min}}{\tau\kappa_{2}}\right)\max\left(\epsilon_{g}^{-\frac{3}{2}},\epsilon_{H}^{-3}\right)+1, (180)

where the extra iteration corresponds to the final successful iteration of Algorithm 1 with λmin(𝐇k+1)ϵH\lambda_{min}(\mathbf{H}_{k+1})\geq-\epsilon_{H}. Then, similar to Eq. (162), we have the improved iteration bound for Algorithm 1 given as

T=|𝒮fail|+|𝒮succ|log(σb/σ0)logγ+2|𝒮succ|log(σb/σ0)logγ+2Cmax(ϵg32,ϵH3)+2,T=|\mathcal{S}_{fail}|+|\mathcal{S}_{succ}|\leq\frac{\log(\sigma_{b}/\sigma_{0})}{\log{\gamma}}+2|\mathcal{S}_{succ}|\leq\frac{\log(\sigma_{b}/\sigma_{0})}{\log{\gamma}}+2C\max\left(\epsilon_{g}^{-\frac{3}{2}},\epsilon_{H}^{-3}\right)+2, (181)

where C=f^0(𝟎𝒙0)f^minτκ1+f^t(𝜼t)f^minτκ2C=\frac{\hat{f}_{0}(\bm{0}_{\bm{x}_{0}})-\hat{f}_{min}}{\tau\kappa_{1}}+\frac{\hat{f}_{t}(\bm{\eta}_{t})-\hat{f}_{min}}{\tau\kappa_{2}}. This completes the proof. ∎

G.3 Main Proof of Corollary 2

Although this follows exactly the same way as to prove Corollary 1, we repeat it here for the convenience of readers.

Proof

Under the given assumptions, when Theorem 4 holds, Algorithm 1 returns an (ϵg,ϵH)\left(\epsilon_{g},\epsilon_{H}\right)-optimal solution in T=𝒪(max(ϵg32,ϵH3))T=\mathcal{O}\left(\max\left(\epsilon_{g}^{-\frac{3}{2}},\epsilon_{H}^{-3}\right)\right) iterations. Also, according to Theorem 2, at an iteration, Condition 1 is satisfied with a probability (1δ)(1-\delta), where the probability (1δ)(1-\delta) at the current iteration can be independently achieved by selecting proper subsampling sizes for the approximate gradient and Hessian. Let EE be the event that Algorithm 1 returns an (ϵg,ϵH)\left(\epsilon_{g},\epsilon_{H}\right)-optimal solution and EiE_{i} be the event that Condition 1 is satisfied at iteration ii. According on Theorem 4, when event EE happens, it requires Condition 1 to be satisfied for all the iterations, thus we have

Pr(E)=i=1TPr(Ei)=(1δ)T=(1δ)𝒪(max(ϵg32,ϵH3)).{\rm Pr}(E)=\prod_{i=1}^{T}{\rm Pr}(E_{i})=(1-\delta)^{T}=(1-\delta)^{\mathcal{O}\left(\max\left(\epsilon_{g}^{-\frac{3}{2}},\epsilon_{H}^{-3}\right)\right)}. (182)

This completes the proof.