∎
University of Manchester
22email: {yian.deng, tingting.mu}@manchester.ac.uk
Faster Riemannian Newton-type Optimization
by Subsampling and Cubic Regularization
Abstract
This work is on constrained large-scale non-convex optimization where the constraint set implies a manifold structure. Solving such problems is important in a multitude of fundamental machine learning tasks. Recent advances on Riemannian optimization have enabled the convenient recovery of solutions by adapting unconstrained optimization algorithms over manifolds. However, it remains challenging to scale up and meanwhile maintain stable convergence rates and handle saddle points. We propose a new second-order Riemannian optimization algorithm, aiming at improving convergence rate and reducing computational cost. It enhances the Riemannian trust-region algorithm that explores curvature information to escape saddle points through a mixture of subsampling and cubic regularization techniques. We conduct rigorous analysis to study the convergence behavior of the proposed algorithm. We also perform extensive experiments to evaluate it based on two general machine learning tasks using multiple datasets. The proposed algorithm exhibits improved computational speed and convergence behavior compared to a large set of state-of-the-art Riemannian optimization algorithms.
Keywords:
Optimization Cubic Regularization Riemannian manifolds Subsampling1 Introduction
In modern machine learning, many learning tasks are formulated as non-convex optimization problems. This is because, as compared to linear or convex formulations, they can often capture more accurately the underlying structures within the data, and model more precisely the learning performance (or losses). There is an important class of non-convex problems of which the constraint sets possess manifold structures, e.g., to optimize over a set of orthogonal matrices. A manifold in mathematics refers to a topological space that locally behaves as an Euclidean space near each point. Over a -dimensional manifold in a -dimensional ambient space (), each local patch around each data point (a subset of ) is homeomorphic to a local open subset of the Euclidean space . This special structure enables a straightforward adoption of any unconstrained optimization algorithm for solving a constrained problem over a manifold, simply by applying a systematic way to modify the gradient and Hessian calculation. The modified calculations are called the Riemannian gradients and the Riemannian Hessians, which will be rigorously defined later. Such an accessible method for developing optimized solutions has benefited many applications and encouraged the implementation of various optimization libraries.
A representative learning task that gives rise to non-convex problems over manifolds is low-rank matrix completion, widely applied in signal denoising, recommendation systems and image recovery liu2019convolution . It is formulated as optimization problems constrained on fixed-rank matrices that belong to a Grassmann manifold. Another example task is principal component analysis (PCA) popularly used in statistical data analysis and dimensionality reduction shahid2015robust . It seeks an optimal orthogonal projection matrix from a Stiefel manifold. A more general problem setup than PCA is the subspace learning mishra2019riemannian , where a low-dimensional space is an instance of a Grassmann manifold. When training a neural network, in order to reduce overfitting, the orthogonal constraint that provides a Stiefel manifold structure is sometimes imposed over the network weights anandkumar2016efficient . Additionally, in hand gesture recognition nguyen2019neural , optimizing over a symmetric definite positive (SDP) manifold has been shown effective.
Recent developments in optimization on Riemannian manifolds absil2009optimization have offered a convenient and unified solution framework for solving the aforementioned class of non-convex problems. The Riemannian optimization techniques translate the constrained problems into unconstrained ones on the manifold whilst preserving the geometric structure of the solution. For example, one Riemannian way to implement a PCA is to preserve the SDP geometric structure of the solutions without explicit constraints horev2016geometry . A simplified description of how Riemannian optimization works is that it first applies a straightforward way to modify the calculation of the first-order and second-order gradient information, then it adopts an unconstrained optimization algorithm that uses the modified gradient information. There are systematic ways to compute these modifications by analyzing the geometric structure of the manifold. Various libraries have implemented these methods and are available to practitioners, e.g., Manopt boumal2014manopt and Pymanopt townsend2016pymanopt .
Among such techniques, Riemannian gradient descent (RGD) is the simplest. To handle large-scale computation with a finite-sum structure, Riemannian stochastic gradient descent (RSGD) bonnabel2013stochastic has been proposed to estimate the gradient from a single sample (or a sample batch) in each iteration of the optimization. Here, an iteration refers to the process by which an incumbent solution is updated with gradient and (or) higher-order derivative information; for example, Eq. (3) in the upcoming text defines an RSGD iteration. Convergence rates of RGD and RSGD are compared in zhang2016first together with a global complexity analysis. The work concludes that RGD can converge linearly while RSGD converges sub-linearly, but RSGD becomes computationally cheaper when there is a significant increase in the size of samples to process, also it can potentially prevent overfitting. By using RSGD to optimize over the Stiefel manifold, politz2016interpretable attempts to improve interpretability of domain adaptation and has demonstrated its benefits for text classification.
A major drawback of RSGD is the variance issue, where the variance of the update direction can slow down the convergence and result in poor solutions. Typical techniques for variance reduction include the Riemannian stochastic variance reduced gradient (RSVRG) zhang2016riemannian and the Riemannian stochastic recursive gradient (RSRG) kasai2018riemannian . RSVRG reduces the gradient variance by using a momentum term, which takes into account the gradient information obtained from both RGD and RSGD. RSRG follows a different strategy and considers only the information in the last and current iterations. This has the benefit of avoiding large cumulative errors, which can be caused by transporting the gradient vector along a distant path when aligning two vectors at the same tangent plane. It has been shown by kasai2018riemannian that RSRG performs better than RSVRG particularly for large-scale problems.
The RSGD variants can suffer from oscillation across the slopes of a ravine kumar2018geometry . This also happens when performing stochastic gradient descent in Euclidean spaces. To address this, various adaptive algorithms have been proposed. The core idea is to control the learning process with adaptive learning rates in addition to the gradient momentum. Riemannian techniques of this kind include R-RMSProp kumar2018geometry , R-AMSGrad cho2017riemannian , R-AdamNC becigneul2018riemannian , RPG huang2021riemannian and RAGDsDR alimisis2021momentum .
Although improvements have been made for first-order optimization, they might still be insufficient for handling saddle points in non-convex problems mokhtari2018escaping . They can only guarantee convergence to stationary points and do not have control over getting trapped at saddle points due to the lack of higher-order information. As an alternative, second-order algorithms are normally good at escaping saddle points by exploiting curvature information kohler2017sub ; tripuraneni2018stochastic . Representative examples of this are the trust-region (TR) methods. Their capacity for handling saddle points and improved convergence over many first-order methods has been demonstrated in weiwei2013newton for various non-convex problems. The TR technique has been extended to a Riemannian setting for the first time by absil2007trust , referred to as the Riemannian TR (RTR) technique.
It is well known that what prevents the wide use of the second-order Riemannian techniques in large-scale problems is the high cost of computing the exact Hessian matrices. Inexact techniques are therefore proposed to iteratively search for solutions without explicit Hessian computations. They can also handle non-positive-definite Hessian matrices and improve operational robustness. Two representative inexact examples are the conjugate gradient and the Lanczos methods zhu2017riemannian ; xu2016matrix . However, their reduced complexity is still proportional to the sample size, and they can still be computationally costly when working with large-scale problems. To address this issue, the subsampling technique has been proposed, and its core idea is to approximate the gradient and Hessian using a batch of samples. It has been proved by shen2019stochastic that the TR method with subsampled gradient and Hessian can achieve a convergence rate of order with denoting the iteration number. A sample-efficient stochastic TR approach is proposed by shen2019stochastic which finds an -approximate local minimum within a number of stochastic Hessian oracle queries where denotes the sample number. The subsampling technique has been applied to improve the second-order Riemannian optimization for the first time by kasai2018inexact . Their proposed inexact RTR algorithm employs subsampling over the Riemannian manifold and achieves faster convergence than the standard RTR method. Nonetheless, subsampling can be sensitive to the configured batch size. Overly small batch sizes can lead to poor convergence.
In the latest development of second-order unconstrained optimization, it has been shown that the adaptive cubic regularization technique cartis2011adaptive can improve the standard and subsampled TR algorithms and the Newton-type methods, resulting in, for instance, improved convergence and effectiveness at escaping strict saddle points kohler2017sub ; xu2020newton . To improve performance, the variance reduction techniques have been combined into cubic regularization and extended to cases with inexact solutions. Example works of this include zhou2020stochastic ; zhou2019stochastic which were the first to rigorously demonstrate the advantage of variance reduction for second-order optimization algorithms. Recently, the potential of cubic regularization for solving non-convex problems over constraint sets with Riemannian manifold structures has been shown by zhang2018cubic ; agarwal2021adaptive .
We aim at improving the RTR optimization by taking advantage of both the adaptive cubic regularization and subsampling techniques. Our problem of interest is to find a local minimum of a non-convex finite-sum minimization problem constrained on a set endowed with a Riemannian manifold structure. Letting be a real-valued function defined on a Riemannian manifold , we consider a twice differentiable finite-sum objective function:
(1) |
In the machine learning context, denotes the sample number, and is a smooth real-valued and twice differentiable cost (or loss) function computed for the -th sample. The samples are assumed to be uniformly sampled, and thus .
We propose a cubic Riemannian Newton-like (RN) method to solve more effectively the problem in Eq. (1). Specifically, we enable two key improvements in the Riemannian space, including (1) to approximate the Riemannian gradient and Hessian using the subsampling technique and (2) to improve the subproblem formulation by replacing the trust-region constraint with a cubic regularization term. The resulting algorithm is named Inexact Sub-RN-CR111The abbreviation Sub-RN-CR comes from Sub-sampled Riemannian Newton-like Cubic Regularization. We follow the tradition of referring to a TR method enhanced by cubic regularization as a Newton-like method cartis2011adaptive . The implementation of the Inexact Sub-RN-CR is provided in https://github.com/xqdi/isrncr.. After introducing cubic regularization, it becomes more challenging to solve the subproblem, for which we demonstrate two effective solvers based on the Lanczos and conjugate gradient methods. We provide convergence analysis for the proposed Inexact Sub-RN-CR algorithm and present the main results in Theorems 3 and 4. Additionally, we provide analysis for the subproblem solvers, regarding their solution quality, e.g., whether and how they meet a set of desired conditions as presented in Assumptions 1-3, and their convergence property. The key results are presented in Lemma 1, Theorem 2, Lemmas 2 and 3. Overall, our results are satisfactory. The proposed algorithm finds an -optimal solution (defined in Section 3.1) in fewer iterations than the state-of-the-art RTR kasai2018inexact . Specifically, the required number of iterations is reduced from to . When being tested by extensive experiments on PCA and matrix completion tasks with different datasets and applications in image analysis, our algorithm shows much better performance than most state-of-the-art and popular Riemannian optimization algorithms, in terms of both the solution quality and computing time.
2 Notations and Preliminaries
We start by familiarizing the readers with the notations and concepts that will be used in the paper, and recommend absil2009optimization for a more detailed explanation of the relevant concepts. The manifold is equipped with a smooth inner product associated with the tangent space at any , and this inner product is referred to as the Riemannian metric. The norm of a tangent vector is denoted by , which is computed by . When we use the notation , by default belongs to the tangent space . We use to denote the zero vector of the tangent space at . The retraction mapping denoted by is used to move in the direction while remaining on , and it is an equivalent version of in an Euclidean space. The pullback of at is defined by , and . The vector transport operator moves a tangent vector from a point to another . We also use a shorthand notation to describe for a moving direction from to satisfying . The parallel transport operator is a special instance of the vector transport. It moves in the direction of along a geodesic , where , and , and during the movement, it has to satisfy the parallel condition on the geodesic curve. We simplify the notation to . Fig. 1 illustrates a manifold and the operations over it. Additionally, we use to denote the -norm operation in a Euclidean space.
The Riemannian gradient of a real-valued differentiable function at , denoted by , is defined as the unique element of satisfying . Here, generalizes the notion of the directional derivative to a manifold, defined as the derivative of at where is a curve on such that and . When operating in an Euclidean space, we use the same notation to denote the classical directional derivative. The Riemannian Hessian of a real-valued differentiable function at , denoted by , is a linear mapping defined based on the Riemannian connection, as . The Riemannian connection , generalizes the notion of the directional derivative of a vector field. For a function defined over an embedded manifold, its Riemannian gradient can be computed by projecting the Euclidean gradient onto the tangent space, as where is the orthogonal projection onto . Similarly, its Riemannian Hessian can be computed by projecting the classical directional derivative of , defined by , onto the tangent space, resulting in . When the function is defined over a quotient manifold, the Riemannian gradient and Hessian can be computed by projecting and onto the horizontal space of the manifold.
Taking the PCA problem as an example (see Eqs (60) and (61) for its formulation), the general objective in Eq. (1) can be instantiated by , where is a Grassmann manifold. The function of interest is . The Grassmann manifold contains the set of -dimensional linear subspaces of the -dimensional vector space. Each subspace corresponds to a point on the manifold that is an equivalence class of orthogonal matrices, expressed as where denotes the orthogonal group in . A tangent vector of the Grassmann manifold has the form edelman1998geometry , where , and is the orthogonal complement of such that is orthogonal. A commonly used Riemannian metric for the Grassmann manifold is the canonical inner product given , resulting in (Section 2.3.2 of edelman1998geometry ). As we can see, the Riemannian metric and the norm here are equivalent to the Euclidean inner product and norm. The same result can be derived from another commonly used metric of the Grassmann manifold, i.e., for (Section 2.5 of edelman1998geometry ). Expressing two given tangent vectors as and with , we have
(2) |
Here we provide a few examples of the key operations explained earlier on the Grassmann manifold, taken from boumal2014manopt . Given a data point , a moving direction and the step size , one way to construct the retraction mapping is through performing singular value decomposition (SVD) on , i.e., , and the new data point after moving is . A transport operation can be implemented by projecting a given tangent vector using the orthogonal projector . Both Riemannian gradient and Hessian can be computed by projecting the Euclidean gradient and Hessian of using the same projector .
2.1 First-order Algorithms
To optimize the problem in Eq. (1), the first-order Riemannian optimization algorithm RSGD updates the solution at each -th iteration by using an instance, as
(3) |
where is the step size. Assume that the algorithm runs for multiple epochs referred to as the outer iterations. Each epoch contains multiple inner iterations, each of which corresponds to a randomly selected for calculating the update. Letting be the solution at the -th inner iteration of the -th outer iteration and be the solution at the last inner iteration of the -th outer iteration, RSVRG employs a variance reduced extension zhang2016riemannian of the update defined in Eq. (3), given as
(4) |
where
(5) |
Here, the full gradient information is used to reduce the variance in the stochastic gradient . As a later development, RSRG kasai2018riemannian suggests a recursive formulation to improve the variance-reduced gradient . Starting from , it updates by
(6) |
This formulation is designed to avoid the accumulated error caused by a distant vector transport.
2.2 Inexact RTR
For second-order Riemannian optimization, the Inexact RTR kasai2018inexact improves the standard RTR absil2007trust through subsampling. It optimizes an approximation of the objective function formulated using the second-order Taylor expansion within a trust region around at iteration . A moving direction within the trust region is found by solving the subproblem at iteration :
(7) | ||||
subject to |
where and are the approximated Riemannian gradient and Hessian calculated by using the subsampling technique. The approximation is based on the current solution and the moving direction , calculated as
(8) | ||||
(9) |
where , are the sets of the subsampled indices used for estimating the Riemannian gradient and Hessian. The updated solution will be accepted and will be increased, if the decrease of the true objective function is sufficiently large as compared to that of the approximate objective used in Eq. (7). Otherwise, will be decreased because of the poor agreement between the approximate and true objectives.
3 Proposed Method
3.1 Inexact Sub-RN-CR Algorithm
We propose to improve the subsampling-based construction of the RTR subproblem in Eq. (7) by cubic regularization. This gives rise to the minimization
(10) |
where
(11) |
Here, is a user-specified parameter that plays a role in convergence analysis, which we will explain later. The core objective component is formulated by extending the adaptive cubic regularization technique cartis2011adaptive , given as
(12) |
with
(13) |
and
(14) |
where the subscript is used to highlight the pullback of at as . Overall, there are four hyper-parameters to be set by the user, including the gradient norm threshold , the dynamic control parameter to adjust the cubic regularization weight, the model validity threshold , and the initial trust parameter . We will discuss the setup of the algorithm in more detail.
We expect the norm of the approximate gradient to approach with . Following a similar treatment in kasai2018inexact , when the gradient norm is smaller than , the gradient-based term is ignored. This is important to the convergence analysis shown in the next section.
The trust region radius is no longer explicitly defined, but replaced by the cubic regularization term , where is related to a Lagrange multiplier on a cubic trust-region constraint. Naturally, the smaller is, the larger a moving step is allowed. Benefits of cubic regularization have been shown in griewank1981modification ; kohler2017sub . It can not only accelerate the local convergence especially when the Hessian is singular, but also help escape better strict saddle points than the TR methods, providing stronger convergence properties.
The cubic term is equipped with a dynamic penalization control through the adaptive trust quantity . The value of is determined by examining how successful each iteration is. An iteration is considered successful if , and unsuccessful otherwise, where the value of quantifies the agreement between the changes of the approximate objective and the true objective . The larger is, the more effective the approximate model is. Given , in an unsuccessful iteration, is increased to hoping to obtain a more accurate approximation in the next iteration. On the opposite, is decreased to , relaxing the approximation in a successful iteration, but it is still restricted within the lower bound . This bound helps avoid solution candidates with overly large norms that can cause an unstable optimization. Below we formally define what an (un)successful iteration is, which will be used in our later analysis.
3.2 Optimality Examination
The stopping condition of the algorithm follows the definition of -optimality nocedal2006numerical , stated as below.
Definition 2 (-optimality)
Given , a solution satisfies -optimality if
(15) |
for all , where is an identity matrix.
This is a relaxation and a manifold extension of the standard second-order optimality conditions and in Euclidean spaces mokhtari2018escaping . The algorithm stops (1) when the gradient norm is sufficiently small and (2) when the Hessian is sufficiently close to being positive semidefinite.
To examine the Hessian, we follow a similar approach as in han2021riemannian by assessing the solution of the following minimization problem:
(16) |
which resembles the smallest eigenvalue of the Riemannian Hessian. As a result, the algorithm stops when (referred to as the gradient condition) and when (referred to as the Hessian condition), where are the user-set stopping parameters. Note, we use the same for thresholding as in Eq. (LABEL:eq_m). Pseudo code of the complete Inexact Sub-RN-CR is provided in Algorithm 1.
3.3 Subproblem Solvers
The step (9) of Algorithm 1 requires to solve the subproblem in Eq. (10). We rewrite its objective function as below for the convenience of explanation:
(17) |
where if , otherwise . We demonstrate two solvers commonly used in practice.
3.3.1 The Lanczos Method
The Lanczos method 1999Solving has been widely used to solve the cubic regularization problem in Euclidean spaces xu2020newton ; kohler2017sub ; cartis2011adaptive ; jia2021solving and been recently adapted to Riemannian spaces agarwal2021adaptive . Let denote the manifold dimension. Its core operation is to construct a Krylov subspace , of which the basis spans the tangent space . After expressing the solution as an element in , i.e., , the minimization problem in Eq. (17) can be converted to one in Euclidean spaces , as
(18) |
where is a symmetric tridiagonal matrix determined by the basic construction process, e.g., Algorithm 1 of jia2021solving . We provide a detailed derivation of Eq. (18) in Appendix A. The global solution of this converted problem, i.e., , can be found by many existing techniques, see press2007chapter . We employ the Newton root finding method adopted by Section 6 of cartis2011adaptive , which was originally proposed by agarwal2021adaptive . It reduces the problem to a univariate root finding problem. After this, the global solution of the subproblem is computed by .
In practice, when the manifold dimension is large, it is more practical to find a good solution rather than the global solution. By working with a lower-dimensional Krylov subspace with , one can derive Eq. (18) in , and its solution results in a subproblem solution . Both the global solution and the approximate solution are always guaranteed to be at least as good as the solution obtained by performing a line search along the gradient direction, i.e.,
(19) |
because is a common basis vector shared by all the constructed Krylov subspaces . We provide pseudo code for the Lanczos subproblem solver in Algorithm 2.
To benefit practitioners and improve understanding of the Lanczos solver, we analyse the gap between a practical solution and the global minimizer . Firstly, we define in a similar manner to as in Eq. (16). It resembles the largest eigenvalue of the Riemannian Hessian, as
(20) |
We denote a degree- polynomial evaluated at by , such that
(21) |
for some coefficients . The quantity is recursively defined by for We define below an induced norm, as
(22) |
where the identity mapping operator works as . Now we are ready to present our result in the following lemma.
Lemma 1 (Lanczos Solution Gap)
We provide its proof in Appendix B. In Euclidean spaces, carmon2018analysis has shown that . With the help of Lemma 1, this could serve as a reference to gain an understanding of the solution quality for the Lanczos method in Riemannian spaces.
3.3.2 The Conjugate Gradient Method
We experiment with an alternative subproblem solver by adapting the non-linear conjugate gradient technique in Riemannian spaces. It starts from the initialization of and the first conjugate direction (negative gradient direction). At each inner iteration (as opposed to the outer iteration in the main algorithm), it solves the minimization problem with one input variable:
(24) |
The global solution of this one-input minimization problem can be computed by zeroing the derivative of with respect to , resulting in a polynomial equation of , which can then be solved by eigen-decomposition edelman1995polynomial . Its root that possesses the minimal value of is retrieved. The algorithm updates the next conjugate direction using the returned and . Pseudo code for the conjugate gradient subproblem solver is provided in Algorithm 3.
Convergence of a conjugate gradient method largely depends on how its conjugate direction is updated. This is controlled by the setting of for calculating in Step (17) of Algorithm 3. Working in Riemannian spaces under the subsampling setting, it has been proven by sakai2021sufficient that, when the Fletcher-Reeves formula fletcher1964function is used, i.e.,
(25) |
where
(26) |
a conjugate gradient method can converge to a stationary point with . Working in Euclidean spaces, wei2006convergence has shown that the Polak–Ribiere–Polyak formula, i.e.,
(27) |
performs better than the Fletcher-Reeves formula. Building upon these, we propose to compute by a modified Polak–Ribiere–Polyak formula in Riemannian spaces in Step (16) of Algorithm 3, given as
(28) |
We prove that the resulting algorithm converges to a stationary point, and present the convergence result in Theorem 1, with its proof deferred to Appendix C.
Theorem 1 (Convergence of the Conjugate Gradient Solver)
In practice, Algorithm 3 terminates when there is no obvious change in the solution, which is examined in Step (4) by checking whether the step size is sufficiently small, i.e., whether (Section 9 in agarwal2021adaptive ). To improve the convergence rate, the algorithm also terminates when in Step (13) is sufficiently small, i.e., following a classical criterion absil2007trust , to check whether
(31) |
for some .
3.3.3 Properties of the Subproblem Solutions
In Algorithm 2, the basis is constructed successively starting from , while the converted problem in Eq. (18) is solved for each starting from . This process allows up to inner iterations. The solution obtained in the last inner iteration where is the global minimizer over . Differently, Algorithm 3 converges to a stationary point as proved in Theorem 1. In practice, a maximum inner iteration number is set in advance. Algorithm 3 stops when it reaches the maximum iteration number or converges to a status where the change in either the solution or the conjugate direction is very small.
The convergence property of the main algorithm presented in Algorithm 1 relies on the quality of the subproblem solution. Before discussing it, we first familiarize the reader with the classical TR concepts of Cauchy steps and eigensteps but defined for the Inexact RTR problem introduced in Section 2.2. According to Section 3.3 of boumal2019global , when is the RTR subproblem, the closed-form Cauchy step is an improving direction defined by
(32) |
It points towards the gradient direction with an optimal step size computed by the operation, and follows the form of the general Cauchy step defined by
(33) |
According to Section 2.2 of kasai2018inexact , for some , the eigenstep satisfies
(34) |
It is an approximation of the negative curvature direction by an eigenvector associated with the smallest negative eigenvalue.
The following three assumptions on the subproblem solution are needed by the convergence analysis later. We define the induced norm for the Hessian as below:
(35) |
Assumption 1 (Sufficient Descent Step)
Given the Cauchy step and the eigenstep for , assume the subproblem solution satisfies the Cauchy condition
(36) |
and the eigenstep condition
(37) |
where
(38) | ||||
(39) | ||||
(40) |
The two last inequalities in Eqs. (36) and (37) concerning the Cauchy step and eigenstep are derived in Lemma 6 and Lemma 7 of xu2020newton . Assumption 1 generalizes Condition 3 in xu2020newton to the Riemannian case. It assumes that the subproblem solution is better than the Cauchy step and eigenstep, decreasing more the value of the subproblem objective function. The following two assumptions enable a stronger convergence result for Algorithm 1, which will be used in the proof of Theorem 4.
Assumption 2 (Sub-model Gradient Norm cartis2011adaptive ; kohler2017sub )
Assume the subproblem solution satisfies
(41) |
where .
Assumption 3 (Approximate Global Minimizer cartis2011adaptive ; yao2021inexact )
Assume the subproblem solution satisfies
(42) | |||
(43) |
Driven by these assumptions, we characterize the subproblem solutions and present the results in the following lemmas. Their proofs are deferred to Appendix D.
Lemma 2 (Lanczos Solution)
Lemma 3 (Conjugate Gradient Solution)
In practice, Algorithm 2 based on lower-dimensional Krylov subspaces with returns a less optimal solution, while Algorithm 3 returns at most a local minimum. They are not guaranteed to satisfy the eigenstep condition in Assumption 1. But the early-returned solutions from Algorithm 2 still satisfy Assumptions 2 and 3. However, solutions from Algorithm 3 do not satisfy these two Assumptions exactly, but they could get close in an approximate manner. For instance, according to Lemma 3, , and we know that ; thus, there is a fair chance for Eq. (41) in Assumption 2 to be met by the solution from Algorithm 3. Also, given that is a descent direction, it has . Combining this with Eq. (45) in Lemma 3, there is a fair chance for Eq. (42) in Assumption 2 to be met. We present experimental results in Section 5.5, showing empirically to what extent the different solutions satisfy or are close to the eigenstep condition in Assumption 1, Assumptions 2 and 3.
3.4 Practical Early Stopping
In practice, it is often more efficient to stop the optimization earlier before meeting the optimality conditions and obtain a reasonably good solution much faster. We employ a practical and simple early stopping mechanism to accommodate this need. Algorithm 1 is allowed to terminate earlier when: (1) the norm of the approximate gradient continually fails to decrease for times, and (2) when the percentage of the function decrement is lower than a given threshold, i.e.,
(46) |
for a consecutive number of times, with and being user-defined.
For the subproblem, both Algorithms 2 and 3 are allowed to terminate when the current solution , i.e., for Algorithm 2 and for Algorithm 3, satisfies
(47) |
This implements an examination of Assumption 2. Regarding Assumption 1, both Algorithms 2 and 3 optimize along the direction of the Cauchy step in their first iteration and thus satisfy the Cauchy condition. Therefore there is no need to examine it. As for the eigenstep condition, it is costly to compute and compare with the eigenstep in each inner iteration, so we do not use it as a stopping criterion in practice. Regarding Assumption 3, according to Lemma 2, it is always satisfied by the solution from Algorithm 2. Therefore, there is no need to examine it in Algorithm 2. As for Algorithm 3, the examination by Eq. (47) also plays a role in checking Assumption 3. For the first condition in Assumption 3, Eq. (42) is equivalent to (this results from Eq. (106) in Appendix D.2). In practice, when Eq. (47) is satisfied with a small value of , it has , indicating that the first condition of Assumption 3 is met approximately. Also, since due to the descent direction , the second condition of Assumption 3 has a fairly high chance to be met.
4 Convergence Analysis
4.1 Preliminaries
We start from listing those assumptions and conditions from existing literature that are adopted to support our analysis. Given a function , the Hessian of its pullback and its Riemannian Hessian are identical when a second-order retraction is used boumal2019global , and this serves as an assumption to ease the analysis.
Assumption 4 (Second-order Retraction boumal2020introduction )
The retraction mapping is assumed to be a second-order retraction. That is, for all and all , the curve has zero acceleration at , i.e., .
The following two assumptions originate from the assumptions required by the convergence analysis of the standard RTR algorithm boumal2019global ; ferreira2002kantorovich , and are adopted here to support the inexact analysis.
Assumption 5 (Restricted Lipschitz Hessian)
There exists such that for all generated by Algorithm 1 and all , satisfies
(48) |
and
(49) |
where denotes the inverse process of the parallel transport operator.
Assumption 6 (Norm Bound on Hessian)
For all , there exists so that the inexact Hessian satisfies
(50) |
The following key conditions on the inexact gradient and Hessian approximations are developed in Euclidean spaces by roosta2019sub (Section 2.2) and xu2020newton (Section 1.3), respectively. We make use of these in Riemannian spaces.
Condition 1 (Approximation Error Bounds)
For all and , suppose that there exist , such that the approximate gradient and Hessian satisfy
(51) | ||||
(52) |
As will be shown in Theorem 2 later, these allow the used sampling size in the gradient and Hessian approximations to be fixed throughout the training process. As a result, it can serve as a guarantee of the algorithmic efficiency when dealing with large-scale problems.
4.2 Supporting Theorem and Assumption
In this section, we prove a preparation theorem and present new conditions required by our results. Below, we re-develop Theorem 4.1 in kasai2018inexact using the matrix Bernstein inequality gross2011recovering . It provides lower bounds on the required subsampling size for approximating the gradient and Hessian in order for Condition 1 to hold. The proof is provided in Appendix E.
Theorem 2 (Gradient and Hessian Sampling Size)
Define the suprema of the Riemannian gradient and Hessian
(53) | ||||
(54) |
Given , Condition 1 is satisfied with probability at least if
(55) | ||||
(56) |
where and denote the sampling sizes, while and are the dimensions of .
The two quantities and in Condition 1 are the upper bounds of the gradient and Hessian approximation errors, respectively. The following assumption bounds and .
Assumption 7 (Restrictions on and )
Given , , , , we assume that and satisfy
(57) | ||||
(58) |
4.3 Main Results
Now we are ready to present our main convergence results in two main theorems for Algorithm 1. Different from sun2019escaping which explores the escape rate from a saddle point to a local minimum, we study the convergence rate from a random point.
Theorem 3 (Convergence Complexity of Algorithm 1)
The proof along with the supporting lemmas is provided in Appendices F.1 and F.2. When the Hessian at the solution is close to positive semi-definite which indicates a small , the Inexact Sub-RN-CR finds an -optimal solution in fewer iterations than the Inexact RTR, i.e., iterations for the Inexact Sub-RN-CR as compared to for the Inexact RTR kasai2018inexact . Such a result is satisfactory. Combining Theorems 2 and 3, it leads to the following corollary.
Corollary 1
The proof is provided in Appendix F.3. We use an example to illustrate the effect of on the approximate gradient sample size . Suppose , then . In addition, when , . Replacing with in Eqs. (55) and (56), it can be obtained that the lower bound of is proportional to .
Combining Assumption 3 and the stopping condition in Eq. (47) for the inexact solver, a stronger convergence result can be obtained for Algorithm 1, which is presented in the following theorem and corollary.
Theorem 4 (Optimal Convergence Complexity of Algorithm 1)
Corollary 2
4.4 Computational Complexity Analysis
We analyse the number of main operations required by the proposed algorithm. Taking the PCA task as an example, it optimizes over the Grassmann manifold . Denote by the number of inexact iterations and the manifold dimension, i.e., for the Grassmann manifold. Starting from the gradient and Hessian computation, the full case requires operations for both in the PCA task. By using the subsampling technique, these can be reduced to and by gradient and Hessian approximation. Following an existing setup for cost computation, i.e., Inexact RTR method kasai2018inexact , the full function cost evaluation takes operations, while the approximate cost evaluation after subsampling becomes , where is the subsampled set of data points used to compute the function cost. These show that, for large-scale practices with , the computational cost reduction gained from the subsampling technique is significant.
For the subproblem solver by Algorithm 2 or 3, the dominant computation within each iteration is the Hessian computation, which as mentioned above requires operations after using the subsampling technique. Taking this into account to analyze Algorithm 1, its overall computational complexity becomes based on Theorem 3, where corresponds to the operations for computing the full function cost and the approximate gradient in an outer iteration. This overall complexity can be simplified to , where is the cost of the subproblem solver by either Algorithm 2 or Algorithm 3. Algorithm 2 is guaranteed to return the optimal subproblem solution within at most inner iterations, of which the complexity is at most . Such a polynomial complexity is at least as good as the ST-tCG solver used in the Inexact RTR algorithm. For Algorithm 3, although is not guaranteed to be bounded and polynomial, we have empirically observed that is generally smaller than in practice, presenting a similar complexity to Algorithm 2.
5 Experiments and Result Analysis
We compare the proposed Inexact Sub-RN-CR algorithm with state-of-the-art and popular Riemannian optimization algorithms. These include the Riemannian stochastic gradient descent (RSGD) bonnabel2013stochastic , Riemannian steepest descent (RSD) absil2009optimization , Riemannian conjugate gradient (RCG) absil2009optimization , Riemannian limited memory BFGS algorithm (RLBFGS) yuan2016riemannian , Riemannian stochastic variance-reduced gradient (RSVRG) zhang2016riemannian , Riemannian stochastic recursive gradient (RSRG) kasai2018riemannian , RTR absil2007trust , Inexact RTR kasai2018inexact and RTRMC boumal2011rtrmc . Existing implementations of these algorithms are available in either Manopt boumal2014manopt or Pymanopt townsend2016pymanopt library. They are often used for algorithm comparison in existing literature, e.g., by Inexact RTR kasai2018inexact . Particularly, RSGD, RSD, RCG, RLBFGS, RTR and RTRMC algorithms have been encapsulated into Manopt, and RSD, RCG and RTR also into Pymanopt. RSVRG, RSRG and Inexact RTR are implemented by kasai2018inexact based on Manopt. We use existing implementations to reproduce their methods. Our Inexact Sub-RN-CR implementation builds on Manopt.
For the competing methods, we follow the same parameter settings from the existing implementations, including the batch size (i.e. sampling size), step size (i.e. learning rate) and the inner iteration number to ensure the same results as the reported ones. For our method, we first keep the common algorithm parameters the same as the competing methods, including , , and . Then, we use a grid search to find appropriate values of and for both Algorithms 2 and 3. Specifically, the searching grid for is , and the searching grid for is . For the parameter in Algorithm 3, we keep it the same as the other conjugate gradient solvers. The early stopping approach as described in Section 3.4 is applied to all the compared algorithms.
Regarding the batch setting, which is also the sample size setting for approximating the gradient and Hessian, we adopt the same value as used in existing subsampling implementations to keep consistency. Also, the same settings are used for both the PCA and matrix completion tasks. Specifically, the batch size is used for RSGD, RSVRG and RSRG where is not considered as these are first-order methods. For both the Inexact RTR and the proposed Inexact Sub-RN-CR, and is used. This is to follow the existing setting in kasai2018inexact for benchmark purposes, which exploits the approximate Hessian but the full gradient. In addition to these, we experiment with another batch setting of for both the Inexact RTR and Inexact Sub-RN-CR. This is flagged by in the algorithm name, meaning that the algorithm uses the approximate gradient in addition to the approximate Hessian. Its purpose is to evaluate the effect of in the optimization.
Evaluation is conducted based on two machine learning tasks of PCA and low-rank matrix completion using both synthetic and real-world datasets with . Both tasks can be formulated as non-convex optimization problems on the Grassmann manifold Gr. The algorithm performance is evaluated by oracle calls and the run time. The former counts the number of function, gradient, and Hessian-vector product computations. For instance, Algorithm 1 requires oracle calls each iteration, where is the number of iterations of the subproblem solver. Regarding the user-defined parameters in Algorithm 1, we use . Empirical observations suggest that the magnitude of the data entries affects the optimization in its early stage, and hence these factors are taken into account in the setting of . Let denote the input data matrix containing rows and columns. We compute by considering the data dimension, also the averaged data magnitude normalized by its standard deviation, given as
(59) |
where and is the manifold dimension.
Regarding the early stopping setting in Eq. (46), is used for both tasks, and we use for MNIST and for the remaining datasets in the PCA task. In the matrix completion task, we set for the synthetic datasets and for the real-world datasets. For the early stopping settings in Eq. (47) and in Step (13) of Algorithm 3, we adopt and . Between the two subproblem solvers, we observe that Algorithm 2 by the Lanczos method and Algorithm 3 by the conjugate gradient perform similarly. Therefore, we report the main results using Algorithm 2, and provide supplementary results for Algorithm 3 in a separate Section 5.4.
5.1 PCA Experiments
PCA can be interpreted as a minimization of the sum of squared residual errors between the projected and the original data points, formulated as
(60) |
where . The objective function can be re-expressed as one on the Grassmann manifold via
(61) |
One synthetic dataset P1 and two real-world datasets including MNIST lecun1998gradient and Covertype blackard1999comparative are used in the evaluation. The P1 dataset is firstly generated by randomly sampling each element of a matrix from a normal distribution . This is then followed by a multiplication with a diagonal matrix with each diagonal element randomly sampled from an exponential distribution , which increases the difference between the feature variances. After that, a mean-subtraction preprocessing is applied to to obtain the final P1 dataset. The values are: for P1, for MNIST, and for Covertype. Algorithm accuracy is assessed by optimality gap, defined as the absolute difference , where with as the optimal solution returned by Algorithm 1. The optimal function value is computed by using the eigen-decomposition solution , which is a classical way to obtain PCA result without going through the optimization program.
Fig. 2 compares the optimality gap changes over iterations for all the competing algorithms. Additionally, Fig. 3 summarizes their accuracy and convergence performance in optimality gap and run time. Fig. 4a reports the performance without using early stopping for the P1 dataset. It can be seen that the Inexact Sub-RN-CR reaches the minima with the smallest iteration number for both the synthetic and real-world datasets. In particular, the larger the scale of a problem is, the more obvious the advantage of our Inexact Sub-RN-CR is, evidenced by the performance difference.
However, both the Inexact RTR and Inexact Sub-RN-CR achieve their best PCA performance when using a full gradient calculation accompanied by a subsampled Hessian. The subsampled gradient does not seem to result in a satisfactory solution as shown in Fig. 2 with . Additionally, we report more results for the Inexact RTR and the proposed Inexact Sub-RN-CR in Fig. 4b on the P1 dataset with different gradient batch sizes, including . They all perform less well than . More accurate gradient information is required to produce a high-precision solution in these tested cases. A hypothesis on the cause of this phenomenon might be that the variance of the approximate gradients across samples is larger than that of the approximate Hessians. Hence, a sufficiently large sample size is needed for a stable approximation of the gradient information. Errors in approximate gradients may cause the algorithm to converge to a sub-optimal point with a higher cost, thus performing less well. Another hypothesis might be that the quadratic term in Eq. (61) would square the approximation error from the approximate gradient, which could significantly increase the PCA reconstruction error.
By fixing the gradient batch size for both the Inexact RTR and Inexact Sub-RN-CR, we compare in Figs. 4c and 4d their sensitivity to the used batch size for Hessian approximation. We experiment with , . It can be seen that the Inexact Sub-RN-CR outperforms the Inexact RTR in almost all cases except for for the MNIST dataset. The Inexact Sub-RN-CR possesses a rate of increase in oracle calls significantly smaller for large sample sizes. This implies that the Inexact Sub-RN-CR is more robust than the Inexact RTR to batch-size change for inexact Hessian approximation.
5.2 Low-rank Matrix Completion Experiments
Low-rank matrix completion aims at completing a partial matrix with only a small number of entries observed, under the assumption that the matrix has a low rank. One way to formulate the problem is shown as below
(62) |
where denotes the index set of the observed matrix entries. The operator is defined as if , while otherwise. We generate it by uniformly sampling a set of elements from the entries. Let be the -th column of , be the -th column of , and be the subset of that contains sampled indices for the -th column of . Then, has a closed-form solution kasai2018inexact , where contains the selected rows of , and the selected elements of according to the indices in , and denotes the pseudo inverse operation. To evaluate a solution , we generate another index set , which is used as the test set and satisfies , following the same way as generating . We compute the mean squared error (MSE) by
(63) |
In evaluation, three synthetic datasets and a real-world dataset Jester goldberg2001eigentaste are used where the training and test sets are already predefined by goldberg2001eigentaste . The synthetic datasets are generated by following a generative model similar to ngo2012scaled based on SVD. Specifically, to develop a synthetic dataset, we generate two matrices and with their elements independently sampled from the normal distribution . Then, we generate two orthogonal matrices and by applying the QR decomposition trefethen1997numerical respectively to and . After that, we construct a diagonal matrix of which the diagonal elements are computed by for , and the final data matrix is computed by . The reason to construct in this specific way is to have an easy control over the condition number of the data matrix, denoted by , which is the ratio between the maximum and minimum singular values of . Because , we can adjust the condition number by tuning the parameter . Following this generative model, each synthetic dataset is generated by randomly sampling two orthogonal matrices and constructing one diagonal matrix subject to the constraint of condition numbers, i.e., for M1 and for M2 and M3. The values of the used datasets are: for M1, for M2, for M3, and for Jester.
Fig. 5 compares the MSE changes over iterations, while Fig. 6 and Fig. 7 summarize both the MSE performance and the run time in the same plot for different algorithms and datasets. In Fig. 6, the Inexact Sub-RN-CR outperforms the others in most cases, and it can even be nearly twice as fast as the state-of-the-art methods for cases with a larger condition number (see dataset M2 and M3). This shows that the proposed algorithm is efficient at handling ill-conditioned problems. In Fig. 7, the Inexact Sub-RN-CR achieves a sufficiently small MSE with the shortest run time, faster than the Inexact RTR and RTRMC. Unlike in the PCA task, the subsampled gradient approximation actually helps to improve the convergence. A hypothesis for explaining this phenomenon could be that, as compared to the quadratic term in the PCA objective function, the linear term in the matrix completion objective function accumulates fewer errors from the inexact gradient, making the optimization more stable.
Additionally, Fig. 8a compares the Inexact RTR and the Inexact Sub-RN-CR with varying batch sizes for gradient estimation and with fixed . The M1-M3 results show that our algorithm exhibits stronger robustness to , as it converges to the minima with only 50 additional oracle calls when reducing from to , whereas Inexact RTR requires twice as many calls. For the Jester dataset, in all settings of gradient sample sizes, our method achieves lower MSE than the Inexact RTR, especially when . Fig. 8b compares sensitivities in Hessian sample sizes with fixed . Inexact Sub-RN-CR performs better for the synthetic dataset M3 with , showing roughly twice faster convergence. For the Jester dataset, Inexact Sub-RN-CR performs better with except for the case of , which is possibly because the construction of the Krylov subspace requires a more accurately approximated Hessian.
To summarize, we have observed from both the PCA and matrix completion tasks that the effectiveness of the subsampled gradient in the proposed approach can be sensitive to the choice of the practical problems, while the subsampled Hessian steadily contributes to a faster convergence rate.
5.3 Imaging Applications
In this section, we demonstrate some practical applications of PCA and matrix completion, which are solved by using the proposed optimization algorithm Inexact Sub-RN-CR, for analyzing medical images and scene images.
5.3.1 Functional Connectivity in fMRI by PCA
Functional magnetic-resonance imaging (fMRI) can be used to measure brain activities and PCA is often used to find functional connectivities between brain regions based on the fMRI scans zhong2009detecting . This method is based on the assumption that the activation is independent of other signal variations such as brain motion and physiological signals zhong2009detecting . Usually, the fMRI images are represented as a 4D data block subject to observational noise, including 3 spatial dimensions and 1 time dimension. Following a common preprocessing routine sidhu2012kernel ; kohoutova2020toward , we denote an fMRI data block by and a mask by that contains nonzero elements marked by . By applying the mask to the data block, we obtain a feature matrix , where each column stores the features of the brain at a given time stamp. One can increase the sample size by collecting fMRI data blocks corresponding to human subjects, after which the feature matrix is expanded to a larger matrix .
In this experiment, an fMRI dataset referred to as provided by the OpenfMRI database poldrack2017openfmri is used, where , , and . We select human subjects and use the provided brain mask with . The dimension of the final data matrix is , where . We set the rank as which is sufficient to capture over of the variance in the data. After the PCA processing, each computed principal component can be rendered back to the brain reconstruction by using the open-source library Nilearn abraham2014machine . Fig. 9 displays the optimization performance, where the Inexact Sub-RN-CR converges faster in terms of both run time and oracle calls. For our method and the Inexact RTR, adopting the subsampled gradient leads to a suboptimal solution in less time than using the full gradient. We speculate that imprecise gradients cause an oscillation of the optimization near local optima. Fig. 10 compares the results obtained by our optimization algorithm with those computed by eigen-decomposition. The highlighted regions denote the main activated regions with positive connectivity (yellow) and negative connectivity (blue). The components learned by the two approaches are similar, with some cases learned by our approach tending to be more connected (see Figs. 10a and 10c).
5.3.2 Image Recovery by Matrix Completion
In this application, we demonstrate image recovery with matrix completion using a scene image selected from the BIG dataset cheng2020cascadepsp . As seen in Fig. 12a, this image contains rich texture information. The values of for conducting the matrix completion task are where we use a relatively large rank to allow more accurate recovery. The sampling ratio for observing the pixels is set as . Fig. 11 compares the performance of different algorithms, where the Inexact Sub-RN-CR takes the shortest time to obtain a satisfactory solution. The subsampled gradient promotes the optimization speed of the Inexact Sub-RN-CR without sacrificing much the MSE error. Fig. 12 illustrates the recovered image using three representative algorithms, providing similar visual results.





5.4 Results of Conjugate Gradient Subproblem Solver
We experiment with Algorithm 3 for solving the subproblem. In Step (3) of Algorithm 3, the eigen-decomposition method edelman1995polynomial used to solve the minimization problem has a complexity where is the fixed degree of the polynomial. Figs. 13-16 display the results for both the PCA and matrix completion tasks. Overall, Algorithm 3 can obtain the optimal results with the fastest convergence speed, as compared to the opponent approaches. We have observed that, in general, Algorithm 3 provides similar results to Algorithm 2, but they differ in run time. For instance, Algorithm 2 runs faster for the PCA task with the MNIST dataset and faster for the matrix completion task with the M1 dataset, as compared to Algorithm 3. But Algorithm 3 runs faster than Algorithm 2 for the matrix completion task with the M2 dataset. A hypothesis could be that Algorithm 2 performs well on well-conditioned data (e.g. MNIST and M1) because of its strength of finding the global solution, while for ill-conditioned data (e.g. M2), it may not show significant advantages over Algorithm 3. Moreover, from the computational aspect, the Step (3) in Algorithm 3 is of complexity, which tends to be faster than solving Eq. (18) as required by Algorithm 2. Overall this makes Algorithm 3 probably a better choice than Algorithm 2 for processing ill-conditioned data.
5.5 Examination of Convergence Analysis Assumptions
As explained in Section 3.3.3 and Section 3.4, the eigenstep condition in Assumption 1, Assumption 2 and Assumption 3, although are required by convergence analysis, are not always satisfied by a subproblem solver in practice. In this experiment, we attempt to estimate the probability that an assumption is satisfied in the process of optimization, by counting the number of outer iterations of Algorithm 1 where an assumption holds. We repeat this entire process five times () to attain a stable result. Let be the number of outer iterations where the assumption is satisfied, and the total number of outer iterations, in the -th repetition (). We compute the probability by . Experiments are conducted for the PCA task using the P1 dataset.
In order to examine Assumption 2, which is the stopping criterion in Eq. (47), we temporarily deactivate the other stopping criteria. We observe that Algorithm 2 can always produce a solution that satisfies Assumption 2. However, Algorithm 3 has only chance to produce a solution satisfying Assumption 2. The reason is probably that when computing in Step (11) of Algorithm 3, the first-order approximation of is used rather than the exact for the sake of computational efficiency. This can result in an approximation error.
Regarding the eigenstep condition in Assumption 1, it can always be met by Algorithm 2 with . This indicates that even a few inner iterations are sufficient for it to find a solution pointing in the direction of negative curvature. However, Algorithm 3 has a chance to meet the eigenstep condition. This might be caused by insufficient inner iterations according to Theorem 1. Moreover, the solution obtained by Algorithm 3 is only guaranteed to be stationary according to Theorem 1, rather than pointing in the direction of the negative curvature. This could be a second cause for Algorithm 3 not to meet the eigenstep condition in Eq. (34).
While about Assumption 3, according to Lemma 2, Algorithm 2 always satisfies it. This is verified by our results with . Algorithm 3 has a chance to meet Assumption 3 empirically. This empirical result matches the theoretical result indicated by Lemma 3 where solutions from Algorithm 3 tend to approximately satisfy Assumption 3.
6 Conclusion
We have proposed the Inexact Sub-RN-CR algorithm to offer an effective and fast optimization for an important class of non-convex problems whose constraint sets possess manifold structures. The algorithm improves the current state of the art in second-order Riemannian optimization by using cubic regularization and subsampled Hessian and gradient approximations. We have also provided rigorous theoretical results on its convergence, and empirically evaluated and compared it with state-of-the-art Riemannian optimization techniques for two general machine learning tasks and multiple datasets. Both theoretical and experimental results demonstrate that the Inexact Sub-RN-CR offers improved convergence and computational costs. Although the proposed method is promising in solving large-sample problems, there remains an open and interesting question of whether the proposed algorithm can be effective in training a constrained deep neural network. This is more demanding in its required computational complexity and convergence characteristics than many other machine learning problems, and it is more challenging to perform the Hessian approximation. Our future work will pursue this direction.
References
- (1) Abraham, A., Pedregosa, F., Eickenberg, M., Gervais, P., Mueller, A., Kossaifi, J., Gramfort, A., Thirion, B., Varoquaux, G.: Machine learning for neuroimaging with scikit-learn. Frontiers in neuroinformatics 8, 14 (2014)
- (2) Absil, P.A., Baker, C.G., Gallivan, K.A.: Trust-region methods on riemannian manifolds. Foundations of Computational Mathematics 7(3), 303–330 (2007)
- (3) Absil, P.A., Mahony, R., Sepulchre, R.: Optimization algorithms on matrix manifolds. Princeton University Press (2009)
- (4) Agarwal, N., Boumal, N., Bullins, B., Cartis, C.: Adaptive regularization with cubics on manifolds. Mathematical Programming 188(1), 85–134 (2021)
- (5) Alimisis, F., Orvieto, A., Bécigneul, G., Lucchi, A.: Momentum improves optimization on riemannian manifolds. In: International Conference on Artificial Intelligence and Statistics, pp. 1351–1359. PMLR (2021)
- (6) Anandkumar, A., Ge, R.: Efficient approaches for escaping higher order saddle points in non-convex optimization. In: Conference on learning theory, pp. 81–102 (2016)
- (7) Becigneul, G., Ganea, O.E.: Riemannian adaptive optimization methods. In: International Conference on Learning Representations (2019)
- (8) Blackard, J.A., Dean, D.J.: Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables. Computers and electronics in agriculture 24(3), 131–151 (1999)
- (9) Bonnabel, S.: Stochastic gradient descent on riemannian manifolds. IEEE Transactions on Automatic Control 58(9), 2217–2229 (2013)
- (10) Boumal, N.: An introduction to optimization on smooth manifolds. Available online, May 3 (2020)
- (11) Boumal, N., Absil, P.a.: Rtrmc: A riemannian trust-region method for low-rank matrix completion. In: Advances in neural information processing systems, pp. 406–414 (2011)
- (12) Boumal, N., Absil, P.A., Cartis, C.: Global rates of convergence for nonconvex optimization on manifolds. IMA Journal of Numerical Analysis 39(1), 1–33 (2019)
- (13) Boumal, N., Mishra, B., Absil, P.A., Sepulchre, R.: Manopt, a matlab toolbox for optimization on manifolds. The Journal of Machine Learning Research 15(1), 1455–1459 (2014)
- (14) Carmon, Y., Duchi, J.C.: Analysis of krylov subspace solutions of regularized non-convex quadratic problems. Advances in Neural Information Processing Systems 31 (2018)
- (15) Cartis, C., Gould, N.I., Toint, P.L.: Adaptive cubic regularisation methods for unconstrained optimization. part i: motivation, convergence and numerical results. Mathematical Programming 127(2), 245–295 (2011)
- (16) Cheng, H.K., Chung, J., Tai, Y.W., Tang, C.K.: Cascadepsp: toward class-agnostic and very high-resolution segmentation via global and local refinement. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8890–8899 (2020)
- (17) Cho, M., Lee, J.: Riemannian approach to batch normalization. In: Advances in Neural Information Processing Systems, pp. 5225–5235 (2017)
- (18) Conn, A.R., Gould, N.I., Toint, P.L.: Trust region methods. SIAM (2000)
- (19) Edelman, A., Arias, T.A., Smith, S.T.: The geometry of algorithms with orthogonality constraints. SIAM journal on Matrix Analysis and Applications 20(2), 303–353 (1998)
- (20) Edelman, A., Murakami, H.: Polynomial roots from companion matrix eigenvalues. Mathematics of Computation 64(210), 763–776 (1995)
- (21) Ferreira, O.P., Svaiter, B.F.: Kantorovich’s theorem on newton’s method in riemannian manifolds. Journal of Complexity 18(1), 304–329 (2002)
- (22) Fletcher, R., Reeves, C.M.: Function minimization by conjugate gradients. The computer journal 7(2), 149–154 (1964)
- (23) Goldberg, K., Roeder, T., Gupta, D., Perkins, C.: Eigentaste: A constant time collaborative filtering algorithm. information retrieval 4(2), 133–151 (2001)
- (24) Gould, N., Lucidi, S., Roma, M., Toint, P.L.: Solving the trust-region subproblem using the lanczos method. Siam Journal on Optimization 9(2), 504–525 (1999)
- (25) Griewank, A.: The modification of newton’s method for unconstrained optimization by bounding cubic terms. Tech. rep., Technical report NA/12 (1981)
- (26) Gross, D.: Recovering low-rank matrices from few coefficients in any basis. IEEE Transactions on Information Theory 57(3), 1548–1566 (2011)
- (27) Han, A., Mishra, B., Jawanpuria, P.K., Gao, J.: On riemannian optimization over positive definite matrices with the bures-wasserstein geometry. Advances in Neural Information Processing Systems 34 (2021)
- (28) Horev, I., Yger, F., Sugiyama, M.: Geometry-aware principal component analysis for symmetric positive definite matrices. pp. 493–522. Springer (2017)
- (29) Hosseini, S., Huang, W., Yousefpour, R.: Line search algorithms for locally lipschitz functions on riemannian manifolds. SIAM Journal on Optimization 28(1), 596–619 (2018)
- (30) Huang, W., Wei, K.: Riemannian proximal gradient methods. Mathematical Programming pp. 1–43 (2021)
- (31) Jia, X., Liang, X., Shen, C., Zhang, L.H.: Solving the cubic regularization model by a nested restarting lanczos method (2021)
- (32) Kasai, H., Mishra, B.: Inexact trust-region algorithms on riemannian manifolds. In: Advances in Neural Information Processing Systems, pp. 4249–4260 (2018)
- (33) Kasai, H., Sato, H., Mishra, B.: Riemannian stochastic recursive gradient algorithm. In: International Conference on Machine Learning, pp. 2516–2524 (2018)
- (34) Kohler, J.M., Lucchi, A.: Sub-sampled cubic regularization for non-convex optimization. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1895–1904. JMLR. org (2017)
- (35) Kohoutová, L., Heo, J., Cha, S., Lee, S., Moon, T., Wager, T.D., Woo, C.W.: Toward a unified framework for interpreting machine-learning models in neuroimaging. Nature protocols 15(4), 1399–1435 (2020)
- (36) Kumar Roy, S., Mhammedi, Z., Harandi, M.: Geometry aware constrained optimization techniques for deep learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4460–4469 (2018)
- (37) LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998)
- (38) Liu, X., He, J., Duddy, S., O’Sullivan, L.: Convolution-consistent collective matrix completion. In: International Conference on Information and Knowledge Management, pp. 2209–2212 (2019)
- (39) Mishra, B., Kasai, H., Jawanpuria, P., Saroop, A.: A riemannian gossip approach to subspace learning on grassmann manifold. Machine Learning 108(10), 1783–1803 (2019)
- (40) Mokhtari, A., Ozdaglar, A., Jadbabaie, A.: Escaping saddle points in constrained optimization. In: Advances in Neural Information Processing Systems, pp. 3629–3639 (2018)
- (41) Ngo, T., Saad, Y.: Scaled gradients on grassmann manifolds for matrix completion. Advances in neural information processing systems 25 (2012)
- (42) Nguyen, X.S., Brun, L., Lézoray, O., Bougleux, S.: A neural network based on spd manifold learning for skeleton-based hand gesture recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12036–12045 (2019)
- (43) Nocedal, J., Wright, S.: Numerical optimization. Springer Science & Business Media (2006)
- (44) Poldrack, R.A., Gorgolewski, K.J.: Openfmri: Open sharing of task fmri data. Neuroimage 144, 259–261 (2017)
- (45) Pölitz, C., Duivesteijn, W., Morik, K.: Interpretable domain adaptation via optimization over the stiefel manifold. Machine Learning 104(2), 315–336 (2016)
- (46) Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.: Chapter 9: Root finding and nonlinear sets of equations. Book: Numerical Recipes: The Art of Scientific Computing, New York, Cambridge University Press, 10 (2007)
- (47) Qi, C.: Numerical optimization methods on riemannian manifolds (2011)
- (48) Roosta-Khorasani, F., Mahoney, M.W.: Sub-sampled newton methods. Mathematical Programming 174(1), 293–326 (2019)
- (49) Sakai, H., Iiduka, H.: Sufficient descent riemannian conjugate gradient methods. Journal of Optimization Theory and Applications 190(1), 130–150 (2021)
- (50) Sato, H., Iwai, T.: A new, globally convergent riemannian conjugate gradient method. Optimization 64(4), 1011–1031 (2015)
- (51) Shahid, N., Kalofolias, V., Bresson, X., Bronstein, M., Vandergheynst, P.: Robust principal component analysis on graphs. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2812–2820 (2015)
- (52) Shen, Z., Zhou, P., Fang, C., Ribeiro, A.: A stochastic trust region method for non-convex minimization. arXiv preprint arXiv:1903.01540 (2019)
- (53) Sidhu, G.S., Asgarian, N., Greiner, R., Brown, M.R.: Kernel principal component analysis for dimensionality reduction in fmri-based diagnosis of adhd. Frontiers in systems neuroscience 6, 74 (2012)
- (54) Sun, Y., Flammarion, N., Fazel, M.: Escaping from saddle points on riemannian manifolds. arXiv preprint arXiv:1906.07355 (2019)
- (55) Townsend, J., Koep, N., Weichwald, S.: Pymanopt: A python toolbox for optimization on manifolds using automatic differentiation. The Journal of Machine Learning Research 17(1), 4755–4759 (2016)
- (56) Trefethen, L.N., Bau III, D.: Numerical linear algebra, vol. 50. Siam (1997)
- (57) Tripuraneni, N., Stern, M., Jin, C., Regier, J., Jordan, M.I.: Stochastic cubic regularization for fast nonconvex optimization. Advances in neural information processing systems 31 (2018)
- (58) Tropp, J.A.: An introduction to matrix concentration inequalities. arXiv preprint arXiv:1501.01571 (2015)
- (59) Wei, Z., Yao, S., Liu, L.: The convergence properties of some new conjugate gradient methods. Applied Mathematics and computation 183(2), 1341–1350 (2006)
- (60) Weiwei, Y., Yueting, Y., Chenhui, Z., Mingyuan, C.: A newton-like trust region method for large-scale unconstrained nonconvex minimization. In: Abstract and Applied Analysis, vol. 2013. Hindawi (2013)
- (61) Xu, P., Roosta, F., Mahoney, M.W.: Newton-type methods for non-convex optimization under inexact hessian information. Mathematical Programming 184(1), 35–70 (2020)
- (62) Xu, Z., Zhao, P., Cao, J., Li, X.: Matrix eigen-decomposition via doubly stochastic riemannian optimization. In: International Conference on Machine Learning, pp. 1660–1669 (2016)
- (63) Yao, Z., Xu, P., Roosta, F., Mahoney, M.W.: Inexact nonconvex newton-type methods. Informs Journal on Optimization 3(2), 154–182 (2021)
- (64) Yuan, X., Huang, W., Absil, P.A., Gallivan, K.A.: A riemannian limited-memory bfgs algorithm for computing the matrix geometric mean. Procedia Computer Science 80, 2147–2157 (2016)
- (65) Zhang, H., Reddi, S.J., Sra, S.: Riemannian svrg: Fast stochastic optimization on riemannian manifolds. In: Advances in Neural Information Processing Systems, pp. 4592–4600 (2016)
- (66) Zhang, H., Sra, S.: First-order methods for geodesically convex optimization. In: Conference on Learning Theory, pp. 1617–1638 (2016)
- (67) Zhang, J., Zhang, S.: A cubic regularized newton’s method over riemannian manifolds. arXiv preprint arXiv:1805.05565 (2018)
- (68) Zhong, Y., Wang, H., Lu, G., Zhang, Z., Jiao, Q., Liu, Y.: Detecting functional connectivity in fmri using pca and regression analysis. Brain topography 22(2), 134–144 (2009)
- (69) Zhou, D., Gu, Q.: Stochastic recursive variance-reduced cubic regularization methods. In: International Conference on Artificial Intelligence and Statistics, pp. 3980–3990. PMLR (2020)
- (70) Zhou, D., Xu, P., Gu, Q.: Stochastic variance-reduced cubic regularization methods. J. Mach. Learn. Res. 20(134), 1–47 (2019)
- (71) Zhu, X.: A riemannian conjugate gradient method for optimization on the stiefel manifold. Computational optimization and Applications 67(1), 73–110 (2017)
Appendix A Appendix: Derivation of Lanczos Method
Instead of solving the subproblem in the tangent space of the manifold dimension , the Lanczos method solves it within a Krylov subspace , where can range from 1 to . This subspace is defined by the span of the following elements:
(64) |
where, for , is recursively defined by . Its orthonormal basis , where , is successively constructed to satisfy
(65) | |||||
(66) |
for , where denotes the -th element of the matrix . Each element in the Krylov subspace can be expressed as . Store these in the vector , the subproblem objective function is minimized in the Krylov subspace instead. By substituting into , the objective function becomes
(67) |
The properties for and for are used in the derivation. Therefore, to solve is equivalent to
(68) |
Appendix B Appendix: Proof of Lemma 1
Proof
Let . The Krylov subspaces are invariant to shifts by scalar matrices, therefore carmon2018analysis , where the definition of follows Eq. (64). Let be the solution found in the Krylov subspace , which is thus an element of expressed by
(69) |
for some values of . According to Section 6.2 of absil2009optimization , a global minimizer of the RTR subproblem without cubic regularization in Eq. (7) is expected to satisfy the Riemannian quasi-Newton equation:
(70) |
where and according to Corollary 7.2.2 of conn2000trust . Using the approximate gradient and Hessian, the inexact minimizer is expected to satisfy
(71) |
Introduce where . When , we start from the fact that , which results in the following:
(72) |
When , it has
(73) |
This concludes that for any , it has .
We introduce the notation to denote the subproblem in Eq. (7) using the inexact Hessian . Let and . Since is the upper bound of , it has . Then, we have
(74) | ||||
(75) | ||||
(76) | ||||
(77) |
To derive Eq. (74), from Eq. (71) is used. To derive Eq. (75), we use the definition in Eq. (35), where for a non-zero , it has
(78) |
Eq. (75) also uses (1) the fact of which comes from the fact of being in the form of , and thus , (2) the definition of the smallest and largest eigenvalues in Eqs. (16) and (20), which gives for and any unit tangent vectors and , and (3) the fact that
(79) |
To derive Eq. (77), we use
(80) |
Next, as is the optimal solution in the subspace , we have , and hence
(81) |
We then show that the Lanczos method exhibits at least the same convergence property as above for the subsampled Riemannian cubic-regularization subproblem. Let be the global minimizer for the subproblem in Eq. (10). is equivalent to in the RTR subproblem with and carmon2018analysis . Then, letting be the minimizer of over satisfying , we have
(82) |
Combining this with Eq. (81), it has
(83) |
This completes the proof.
Appendix C Appendix: Proof of Theorem 1
Proof
We first prove the relationship between and . According to Algorithm 3, . Then for , we have
(84) | ||||
(85) |
where comes from the first-order Taylor extension and Eq. (85) follows the Step (11) in Algorithm 3.
The exact line search in the Step. (3) of Algorithm 3 approximates
(86) |
Zeroing the derivative of Eq. (86) with respect to gives
(87) |
where results from the use of subsampled gradient to approximate the full gradient . We then show that each is a sufficient descent direction, i.e., for some constant sakai2021sufficient . When , , and thus . When , from Step (17) in Algorithm 3 and Eq. (85), we have . Applying the inner product to both sides by , we have
(88) | ||||
(89) |
with a selected . Here, Eq. (89) builds on Eq. (87). Given the sufficient descent direction and the strong Wolfe conditions satisfied by , Theorem 2.4.1 in qi2011numerical shows that the Zoutendijk Condition holds sato2015new , i.e.,
(90) |
Next, we show that is upper bounded. Using Eq. (85), we have
(91) |
where is some constant.
Now, we prove by contradiction. Assume that , that is, for all , there exists such that . Squaring Step (17) of Algorithm 3 and applying Eqs. (30), (85), (89) and (91), we have
(92) |
where . Applying this repetitively, we have
(93) | ||||
where Eq. (93) uses Eq. (91) and , and the last inequality uses . Subsequently, this gives
(94) |
Combing Eqs. (89) and (94), we have
(95) |
This contradicts Eq. (90) and completes the proof.
Appendix D Appendix: Subproblem Solvers
D.1 Proof of Lemma 2
D.1.1 The Case of
Proof
Assumption 1: Regarding the Cauchy condition in Eq. (36), it is satisfied simply because of Eq. (19) and Eq. (33). Regarding the eigenstep condition, the proof is also fairly simple. The solution from Algorithm 2 with is the global minimizer over . As the subspace spanned by Cauchy steps and eigensteps belongs to , i.e., , we have
(96) |
Hence, the solution from Algorithm 2 satisfies the eigenstep condition in Eq. (37)) from Assumption 1.
Assumption 2: As stated in Section 3.3 of cartis2011adaptive , any minimizer, including the global minimizer from Algorithm 2, is a stationary point of , and naturally has the property . Hence, it has . Assumption 2 is then satisfied.
Assumption 3: Given the above-mentioned property of the global minimizer of from Algorithm 2, i.e., , and using the definition of , it has
(97) |
Applying the inner product operation with to both sides of Eq. (97), it has
(98) |
and this corresponds to Eq. (42). As is a descent direction, we have . Combining this with Eq. (98), it has
(99) |
and this is Eq. (43). This completes the proof.
D.1.2 The Case of
Proof
Cauchy condition in Assumption 1: Implied by Eq. (19), any intermediate solution satisfies the Cauchy condition.
Assumption 2: As stated in Section 3.3 of cartis2011adaptive , any minimizer of admits the property . Since each is the minimizer of over , it has . Then, it has . Assumption 2 is then satisfied.
Assumption 3: According to Lemma 3.2 of cartis2011adaptive , a global minimizer of over a subspace of satisfies Assumption 3. As the solution in each iteration of Algorithm 2 is a global solution over the subspace , each satisfies Assumption 3.
D.2 Proof of Lemma 3
Proof
For Cauchy Condition of Assumption 1: In the first iteration of Algorithm 3, the step size is optimized along the steepest gradient direction, as in the classical steepest-descent method of Cauchy, i.e.,
(100) |
At each iteration of Algorithm 3, the line search process in Eq. (24) aims at finding a step size that can achieve a cost decrease, otherwise the step size will be zero, meaning that no strict decrease can be achieved and the algorithm will stop at Step (4). Because of this, we have
(101) |
Given in Algorithm 3, we have
(102) |
Then, considering all , we have
(103) |
This shows Algorithm 3 always returns a solution better than or equal to the Cauchy step.
For : The approximation concept () of interest here builds on the fact that is used as the approximation of the real objective function . By assuming , it leads to
(104) |
Let be the subsampled gradient evaluated at . Based on Theorem 1, it has where is the resulting subsampled gradient after inner iterations. Since is the approximate gradient of the full gradient , it has . Hence, it has
(105) | ||||
which indicates the equality holds as . Combining this with Eq. (104), it completes the proof.
Appendix E Appendix: Proof of Theorem 2
E.1 Matrix Bernstein Inequality
We build the proof on the matrix Bernstein inequality. We restate this inequality in the following lemma.
Lemma 4 (Matrix Bernstein Inequality (gross2011recovering ; tropp2015introduction ))
Let be independent, centered random matrices with the common dimension . Assume that each one is uniformly bounded:
(108) |
Given the matrix sum , we define its variance by
(109) |
Then
(110) |
This lemma supports us to prove Theorem 2 as below.
E.2 Main Proof
Proof
Following the subsampling process, a total of matrices are uniformly sampled from the set of Riemannan gradients . We denote each sampled element as , and it has
(111) |
Define the random matrix
(112) |
Our focus is the type of problem as in Eq. (1), therefore it has
(113) |
Define a random variable
(114) |
Its variance satisfies
(115) |
Take as an example and applying the definition of in Eq. (53), we have
(116) |
where the first inequality uses . Combining Eq. (115) and Eq. (116), it has
(117) |
Now we are ready to apply Lemma 110. Given and according to the superma definition , the following is obtained from the matrix Bernstein inequality:
(118) | ||||
(119) | ||||
(120) |
of which the last equality implies
(121) |
In simple words, with a probability at least , we have . Letting
(122) |
this results in the sample size bound as in Eq. (55)
(123) |
Therefore, we have . Expanding , we have the following satisfied with a probability at least :
(124) |
which is Eq. (51) of Condition 1.
The proof of the other sample size bound follows the same strategy. A total of matrices are uniformly sampled from the set of Riemannan Hessians . We denote each sampled element as , and it has
(125) |
Define the random matrix
(126) |
For the problem defined in Eq. (1) with a second-order retraction, it has
(127) | |||
(128) |
Define a random variable
(129) |
Its variance satisfies
(130) |
Take as an example and applying the definition of in Eq. (54), we have
(131) |
where the first inequality uses , and is the current moving direction being optimized in Eq. (10). Combining Eq. (130) and Eq. ((131), we have
(132) | ||||
(133) |
We then apply Lemma 110. Given and according to the superma definition , the following is obtained from the matrix Bernstein inequality:
(134) | ||||
(135) | ||||
(136) |
of which the last equality indicates
(137) |
In simple words, with probability at least , we have . By letting
(138) |
which results in the sample size bound in Eq. (56)
(139) |
And we have . Expanding , we have the following satisfied with a probability at least :
(140) |
which is Eq. (52) of Condition 1.
Appendix F Appendix: Theorem 3 and Corollary 1
F.1 Supporting Lemmas for Theorem 3
Lemma 5
Suppose Condition 1 and Assumptions 1, 2 hold, then for the case of , we have
(141) |
Proof
We start from the left-hand side of Eq. (141), and this leads to
(142) |
where the first inequality uses the Cauchy-Schwarz inequality and the second one uses Eqs. (48), (51) and (52). It can be seen that the term can not be neglected because of the condition based on Eq. (LABEL:eq_m). This then results in the term in Eq. (142).
Lemma 6
Suppose Condition 1 and Assumptions 1, 2 hold, then for the case of and , we have
(143) |
Proof
For each , at least one of the two inequalities is true:
(144) | ||||
(145) |
Without loss of the generality, we assume which is also an assumption adopted by yao2021inexact , and it has
(146) |
The condition and in this case means that the term can be neglected based on Eq. (LABEL:eq_m) but the optimization process is not yet finished. The effect is that no more exists in Eq. (146) unlike Eq. (142).
Lemma 7
Suppose Condition 1 and Assumptions 1, 2, 3, 4, 5 hold, then when and , we have
(147) |
Proof
(i) If , since , it follows that
(149) |
Lemma 8
Suppose Condition 1 and Assumptions 1, 2, 4, 5 hold, then when and , if , we have
(151) |
Proof
(i) If , since , it follows that
(153) |
Lemma 9
Suppose Assumptions 1, 2, 3, 4, 5 hold, then when , if , the iteration is successful, i.e., and .
Proof
We have that
(155) |
where the first inequality follows from Eqs. (36), Eq. (141), the second inequality from Eqs. (147), (36), (39), and the third follows from Eq. (148). The last two inequalities hold given Eqs. (57), (58) and the fact that the function is decreasing with . Consequently, we have and that the iteration is successful. Based on Step (12) of Algorithm 1, we have .
Lemma 10
Suppose Condition 1 and Assumption 1, 2, 4, 5 hold, then when and , if , the iteration is successful, i.e., and .
Proof
Lemma 11
Suppose Condition 1 and Assumptions 1, 2, 3, 4, 5 hold, then for all , we have
(157) |
Proof
We prove this lemma by contradiction by considering the two following cases.
(i) If , we assume there exists an iteration that is the first unsuccessful iteration such that . This implies and . Since the algorithm fails to terminate at iteration , we have or . Then, according to Lemmas 9 and 10, iteration is successful and thus . This contradicts the earlier statement of . We thus have for all .
(ii) If , similarly, we assume there exists an iteration that is the first unsuccessful iteration such that . This implies and . Since the algorithm fails to terminate at iteration , we have or . According to Lemmas 9 and 10, iteration is successful and thus , which is a contradiction. Thus, we have for all .
F.2 Main Proof of Theorem 3
Proof
Letting , when , from Eq. (36) and Lemma 157 we have
(158) |
When and , from Eq. (37) and Lemma 157, we have
(159) |
Let be the set of successful iterations before Algorithm 1 terminates and be the function minimum. Since is monotonically decreasing, we have
(160) | ||||
where . Let be the set of unsuccessful iterations and be the total iterations of the algorithm. Then we have . Combining it with from Lemma 157, we have
(161) |
Finally, combining Eqs. (160), (161), we have
(162) | ||||
This completes the proof.
F.3 Main Proof of Corollary 1
Proof
Under the given assumptions, when Theorem 3 holds, Algorithm 1 returns an -optimal solution in iterations. Also, according to Theorem 2, at an iteration, Condition 1 is satisfied with a probability , where the probability at the current iteration can be independently achieved by selecting proper subsampling sizes for the approximate gradient and Hessian. Let be the event that Algorithm 1 returns an -optimal solution and be the event that Condition 1 is satisfied at iteration . According on Theorem 3, when event happens, it requires Condition 1 to be satisfied for all the iterations, thus we have
(163) |
This completes the proof.
Appendix G Appendix: Theorem 4 and Corollary 2
G.1 Supporting Lemmas for Theorem 4
The proof of Theorem 4 builds on an existing lemma, which we restate as follows.
Lemma 12
(Sufficient Function Decrease (Lemma 3.3 in cartis2011adaptive )) Suppose the solution satisfies Assumption 3, then we have
(164) |
The proof also needs another lemma that we develop as below.
Lemma 13
Proof
By differentiating the approximate model, we have
(166) |
where the first inequality follows from the triangle inequality and the second inequality from Eqs. (49), (51) and (52). Additionally, from Lemma 3.8 in kasai2018riemannian , we have
(167) |
where is a constant. Then, we have
(168) |
with the first inequality following from the triangle inequality and the second from Eqs. (51), (52) and (167). Then, by combining Eqs. (47), (166) and (168), with we obtain
(169) | ||||
This results in
(170) | ||||
Subsequently, it has
(171) |
In the above derivation, we use the property of the parallel transport that preserves the length of the transported vector. The last inequality in Eq. (170) is based on Eqs. (51) and the triangle inequality.
G.2 Main Proof of Theorem 4
Let , for and be the set of successful iterations such that for . As is monotonically decreasing, we have
(177) | ||||
where . The fourth inequality follows from Eqs. (37) and (164), while the fifth from Eq. (165).
Let be the set of successful iterations such that and for . Then there is an iteration in which and . Thus, we have
(178) |
and this results in
(179) |
where . The second and third inequalities follow from Eq. (157) and Eq. (37), respectively. Then, the bound for the total number of successful iterations is
(180) |
where the extra iteration corresponds to the final successful iteration of Algorithm 1 with . Then, similar to Eq. (162), we have the improved iteration bound for Algorithm 1 given as
(181) |
where . This completes the proof. ∎
G.3 Main Proof of Corollary 2
Although this follows exactly the same way as to prove Corollary 1, we repeat it here for the convenience of readers.
Proof
Under the given assumptions, when Theorem 4 holds, Algorithm 1 returns an -optimal solution in iterations. Also, according to Theorem 2, at an iteration, Condition 1 is satisfied with a probability , where the probability at the current iteration can be independently achieved by selecting proper subsampling sizes for the approximate gradient and Hessian. Let be the event that Algorithm 1 returns an -optimal solution and be the event that Condition 1 is satisfied at iteration . According on Theorem 4, when event happens, it requires Condition 1 to be satisfied for all the iterations, thus we have
(182) |
This completes the proof.