Inexact Adaptive Cubic Regularization Algorithms on Riemannian Manifolds and Application
Abstract
The adaptive cubic regularization algorithm employing the inexact gradient and Hessian is proposed on general Riemannian manifolds, together with the iteration complexity to get an approximate second-order optimality under certain assumptions on accuracies about the inexact gradient and Hessian. The algorithm extends the inexact adaptive cubic regularization algorithm under true gradient in [Math. Program., 184(1-2): 35-70, 2020] to more general cases even in Euclidean settings. As an application, the algorithm is applied to solve the joint diagonalization problem on the Stiefel manifold. Numerical experiments illustrate that the algorithm performs better than the inexact trust-region algorithm in [Advances of the neural information processing systems, 31, 2018].
keywords:
Riemannian manifolds; retraction; inexact adaptive cubic regularization algorithm; iteration complexity; joint diagonalization1 Introduction
We consider the large-scale separable unconstrained optimization problem on general Riemannian manifolds:
(1) |
where is a Riemannian manifold, , and each is continuously differentiable (). Such problems frequently appear in machine learning and scientific computing, where each is a loss (or misfit) function corresponding to th observation (or measurement); see, e.g., [1, 2, 3]. In such “large-scale” settings, since the evaluations of the gradient or the Hessian of can be computationally expensive, some inexact techniques are used for approximating the first and the second derivatives, which particularly includes the random sampling technique. As a result, many popular first-order and second-order stochastic algorithms are proposed to solve problem (1). The stochastic gradient descent (SGD) algorithm is one of the first-order stochastic/inexact algorithms, which uses only the gradient information. The idea of the SGD comes from [4] and one can find some variants in [5, 6, 27, 10, 11] and the references therein (the Riemannian versions of the SGD can be found in [10, 11]). The second-order stochastic/inexact algorithms always use the inexact Hessian information, which includes the popular sub-sampling Newton method, sub-sampling/inexact trust region algorithm, and the sub-sampling/inexact cubic regularization algorithm; see e.g, [28, 29, 30, 31, 32, 17] (the Riemannian version of the inexact trust region algorithm is in [17]). Particularly, we note the inexact trust-region algorithm and the inexact adaptive cubic regularization algorithms introduced in [32] for solving problem (1). Both algorithms employ the true gradient and the inexact Hessian (by sub-sampling) of and solve the corresponding sub-problems approximately every iteration. These two algorithms are respectively shown to own iteration complexity and of achieving the -optimality (see definition of the -optimality in Definition 2.2).
It is known that the Riemannian geometry framework has some advantages in many applications, such as translating some nonconvex (constrained) optimization problems in Euclidean space into convex (unconstrained) ones over a Riemannian manifold; see, e.g., [12, 34, 35, 36]. As a result, some classical numerical methods for solving optimization problems on the Euclidean space, such as Newton’s method, BFGS algorithm, trust region method, gradient algorithm, subgradient algorithm, etc., have been extended to the Riemannian manifold setting; see, e.g., [16, 12, 13, 14, 15, 34, 33, 35, 36]. As we have noted above, for solving large-scale problem (1), the stochastic gradient descent algorithm and its variants have been extended from Euclidean spaces to Riemannian manifolds in the literature; see, e.g., [10, 11]. Recently, the Riemannian inexact trust region algorithm is studied in [17], which uses the inexact gradient and the inexact Hessian every iteration, and in addition the inexact solution of the trust region sub-problem. Noting that the algorithm employs the inexact gradient instead of the true gradient, it extends the corresponding one in [32] to more general cases even in Euclidean settings. Similarly, for achieving the -optimality, the iteration complexity of the algorithm is also proven to be . At the same time, some numerical results are provided which illustrate the algorithm performs significantly better than state-of-the-art deterministic and stochastic algorithms in some applications.
Inspired by the prior works (in particular, the works in [32] and [17]), we propose the Riemannian inexact adaptive cubic regularization algorithm for solving problem (1) on general Riemannian manifolds, which employs the inexact gradient and the inexact Hessian every iteration. The algorithm is proven to own the similar iteration complexity for obtaining the -optimality under certain conditions on accuracies about the inexact gradient and the inexact Hessian. Particularly, iteration complexities of the Riemannian (deterministic) adaptive cubic regularization algorithm and the Riemannian inexact adaptive cubic regularization algorithm under the true gradient are established. As an application, the proposed algorithms are applied to solve the joint diagonalization problem on the Stiefel manifold. Numerical results indicate that inexact algorithms are more efficient than the deterministic algorithm at the same accuracy. Meanwhile, the inexact Riemannian adaptive cubic regularization algorithm outperforms the inexact Riemannian trust-region algorithm (in [17, Algorithm 1]).
The paper is organized as follows. As usual, some basic notions and notation on Riemannian manifolds, together with some related properties about the sub-sampling, are introduced in the next section. The Riemannian inexact adaptive cubic regularization algorithm and its iteration complexity are presented in section 3, and the application to joint diagonalization on the Stiefel manifold is shown in the last section.
2 Notation and Preliminaries
Notation and terminologies used in the present paper are standard; the readers are referred to some textbooks for more details; see, e.g., [12, 18, 19, 22].
Let be a connected and complete -dimensional Riemannian manifold. Let , and let stand for the tangent space at to . is called the tangent bundle on . We denote by the scalar product on with the associated norm , where the subscript is sometimes omitted. For , let be a piecewise smooth curve joining to . Then, the arc-length of is defined by , while the Riemannian distance from to is defined by , where the infimum is taken over all piecewise smooth curves joining to .
We use to denote the Levi-Civita connection on . A vector field is said to be parallel along if . In particular, for a smooth curve , if is parallel along itself, then is called a geodesic; thus, a smooth curve is a geodesic if and only if . A geodesic joining to is minimal if its arc-length equals its Riemannian distance between and . By the Hopf-Rinow theorem [18], is a complete metric space, and there is at least one minimal geodesic joining to for any points and .
Let be a twice continuously differentiable real-valued function defined on . The Remannian gradient and Hessian of at are denoted by and , which are respectively defined as
On manifolds , we use the retraction defined as below to move in the direction of a tangent vector, which can be found in [12, Definition 4.1.1 and Proposition 5.5.5].
Definition 2.1 (retraction and seconder-order retraction).
A retraction on a manifold is a smooth mapping with the following properties, where is the restriction of on :
(1) , where denotes the zero element of .
(2) , where is the identity mapping on with the canonical identification .
Moreover, is said to be a seconder-order retraction if it further satisfies
where denotes acceleration of the curve .
Remark 1.
(i) admits a seconder-order retraction defined by the exponential mapping, and one can find more general retractions on matrix manifolds in [12].
(ii) In general, the Euclidean Hessian differs from the Riemannian Hessian , while they are identical under second-order retractions (see[21, Lemma 3.9]).
Let . We consider the large-scale minimization (1):
where , and for each , is a twice continuously differentiable real-valued function. Note that may be non-convex, some authors refer to find a proximate -optimality in practice, which is defined by [32, Definition 1] (see also [17, Definition 2.1]). As usual, stands the minimum eigenvalue of matrix .
Definition 2.2 (-optimality).
Let . A point is said to be an -optimality of (1) if
In practice, we adopt the sub-sampled inexact gradient and Hessian. To proceed, let be the sample collections with or without replacement from , and their cardinalities are denoted as and , respectively. The sub-sampled inexact gradient and Hessian are defined as
(2) |
respectively. The following lemma provides sufficient sample sizes of and to guarantee that the inexact and approximate and in a probabilistic way, which is taken form [17, Theorem 4.1]. Here we set
Lemma 2.3.
Let and let be a seconder-order retraction. Assume that the sampling is done uniformly at random to generate and , and let and be defined by (2). If the sample sizes and satisfy
then the following estimates hold for any and any :
3 Inexact Riemannian adaptive cubic regularization algorithm
In this section, we propose the following inexact Riemannian adaptive cubic regularization algorithm for solving problem (1), which employs the inexact Hessian and gradient at each iteration.
Algorithm 3.1.
(Inexact Riemannian adaptive cubic regularization)
Step 0. Choose , and .
Step 1. Choose and , and set .
Step 2. Construct inexact gradient and approximate Hessian of at , respectively.
Step 3. If and , then return .
Step 4. Solve the following sub-problem approximately:
(3) |
Step 5. Set .
Step 6. If , then set and ; otherwise set and .
Step 7. Set and go to Step 2.
Remark 2.
Algorithm 3.1 is an extension version of the adaptive cubic regularization algorithm in Euclidean spaces. As we known, in Euclidean settings, the algorithm was proposed and extensively studied by Cartis et al. in [23, 24]. Recently, for solving the “large-scale” separable problem (1), Wu et al. proposed the adaptive cubic regularization algorithm with inexact Hessian (and true gradient) in [32, Algorithm 2] and the iteration complexity of the algorithm is proven to be .
Let be a sequence generated by Algorithm 3.1. We make the following blanket assumptions throughout the paper:
(A1) There exists constant such that for all satisfies
(A2) There exists constant such that
(A3) There exist constants such that
Remark 3.
(i) As shown in [21, Appendix B], Assumption (A1) is satisfied in the case when is compact and is a second-order retraction.
(ii) In practice, according to Lemma 2.3, we shall construct and by sub-sampling to ensure Assumption (A3) in a probabilistic way.
As stated in Step 4, we solve the sub-problem (3) approximately every iteration. The most popular conditions for the approximate solution in the literature are the Cauchy and Eigenpoint conditions; see, e.g., [23, 24, 32]. To ensure the convergence of the algorithm, we make the following similar assumptions on the approximate solutions :
(A4) , and ,
where and are the approximate optimal solutions of (3) along the negative gradient and the negative curvature directions, respectively, and satisfies
Now, let . Noting that the sub-problem (3) of Algorithm 3.1 is actually posed on Euclidean spaces, the follow properties of and are valid (see [32, Lemmas 6,7]).
Lemma 3.1.
The following estimates for and hold:
|
(4) |
(5) |
(6) |
(7) |
Lemma 3.2 below estimates the sufficient decrease in the objective function.
Lemma 3.2.
Assume (A1) and (A3). Then we have
(8) |
Proof.
Lemma 3.3.
Assume (A1)-(A4). Assume further that and . If one of the following conditions is satisfied, then parameter decreases, that is, :
-
(a)
, .
-
(b)
, .
Proof.
In light of (8) and , we get that
(9) | ||||
We claim that the following implication holds:
(10) |
In fact, note that
for all . | (11) |
In view of (by assumption), we see that , and then have
If , we get from (11) that . Therefore, implication (10) is valid by (9).
Below, we show that the following inequality holds under either condition (a) or (b):
(12) |
Granting this and Step 6 of Algorithm 3.1, we conclude that , completing the proof
Assume that condition (a) is satisfied. Then, we have
(13) |
where the first inequality is by (6) and the first item of condition (a), and the last inequality thanks to the second item of condition (a). We claim that
(14) |
Indeed, in the case when , (14) is evident from (9). Otherwise, . This, together with (13) and implication (10), implies that , and so claim (14) is trivial. Using (13), (14) and , we estimate that
(15) |
Due to assumption (A4) and (4), there holds that
This together with (15), implies that
where the second inequality thanks to the second item of condition (a), which shows (12).
Assume that condition (b) is satisfied. Then, we get from (7) and condition (b) that
(16) |
Thus, we claim that
(17) |
Indeed, if , claim (17) is clear by (9). Otherwise . This, together with (16) and implication (10), implies that . Therefore, claim (17) holds trivially. In view of (16), (17) and , we get that
(18) |
Moreover, in view of by condition (b), we have from assumption (A4) and (5) that
Combining this and (18), there holds that
where the second inequality thanks to the second item of condition (b), and then (12) is ture. The proof is complete. ∎
In the following lemma, we estimate the upper bound for the cubic regularization parameters before the algorithm terminates. Here all iterations satisfying (12) are said to be successful and to be unsuccessful otherwise.
Lemma 3.4.
Assume (A1)-(A4). Assume further that Algorithm 3.1 does not terminate at -th iteration, and the parameters satisfy and
(19) |
then
(20) |
Proof.
By contradiction, we assume that the -th iteration () is the first unsuccessful iteration such that
(21) |
which implies that . Since Algorithm 3.1 does terminate at the -th iteration, we have that either or . Recalling that , and (19), Lemma 3.3 is applicable to showing , which contradicts with (21). The proof is complete. ∎
Without loss generality, we assume that . Moreover, from now on, let and stand the set of all the successful and unseccessful iterations of Algorithm 3.1, respectively. The number of successful (unsuccessful) iterations is denoted by . The lemma below provides estimation of the total number of successful iterations before the algorithm terminates.
Lemma 3.5.
Assume that the assumptions made in Lemma 3.4 are satisfied. Then the following estimate holds:
(22) |
where is the minimum of over and
(23) |
Proof.
Noting that is monotonically decreasing, we have that
(24) |
where the second inequality is by Steps 4-5. Assume that and Algorithm 3.1 does not terminate at the -th iteration. Then we have either or , and
(25) |
by lemma 3.4(20) (noting ). In the case when , we get from (4) that
(26) |
where the third inequality is by , and the fourth inequality is because of (25). Similarly, for the case when , we have from (5) that
This, together with (26), yields
(27) |
where the last inequality thanks to (23). Combing (24) and (27), we get that
(28) |
Thus, (22) holds, completing the proof. ∎
We can now prove the main theorem, which indicates the iteration complexity of Algorithm 3.1.
Theorem 3.6 (complexity of Algorithm 3.1).
Proof.
As corollaries of Theorem 3.6, we consider the (deterministic) Riemannian adaptive cubic regularization algorithm and the inexact Riemannian adaptive cubic regularization algorithm under the true gradient. The following corollary shows the iteration complexity of the (deterministic) Riemannian adaptive cubic regularization algorithm, which is immediate from Theorem 3.6 (noting that assumption (A3) is satisfied naturally).
Corollary 3.7.
Following the same line of proof of Theorem 3.6, we have the following corollary about the iteration complexity of the inexact Riemannian adaptive cubic regularization algorithm under the true gradient.
Corollary 3.8.
Assume (A1)-(A2), (A3)-b and (A4). Assume further that Algorithm 3.1 employs
and the parameter satisfies
Then, the algorithm terminates after at most iterations.
4 Application and numerical experiments
In this section, we shall apply Algorithm 3.1 to solve the joint diagonalization (JD for short) problems on Stiefel manifolds, and provide some numerical experiments to compare their performances. All numerical comparisons are implemented in MATLAB on a 3.20 GHz Intel Core machine with 8 GB RAM.
We first introduce some notation and results about the Stiefel manifold and the JD. The following notation and results about the Stiefel manifold can be found in [12]. Let , and let (we always assume that ). The symmetric matrix of is denoted by , which is defined by . The Frobenius norm for is defined as
We use to denote the matrix formed by the diagonal elements of , and then have . The Stiefel manifold St is the set of all orthogonal matrices, i.e.,
where is the identity matrix. Let . The tangent space of St at is defined by
The Riemannian inner product on is given by
In our practice, we adopt the following retraction defined by
where means the orthogonal factor based on the QR decomposition of .
The following introduction about the blind source separation (BSS for short) and the JD can be found in the works in [9, 17, 25, 26] and references there in. Let be a given -dimensional stochastic process and the time index such that
where is a mixing matrix and is the -dimensional source vector, both of and are unknown. The BSS problem tries to recover the mixing matrix and the source vector. If we pose additional constraints on (see, e.g., [9]), the BSS problem can be transformed into the JD problem on the Stiefel manifold, which is stated as follows:
(30) |
where are matrices estimated from the dada under some source conditions (see [9, 26] for more details). The following expressions of the (Riemannian) gradient and Hessian of are taken form [17]. Let . The Riemannian gradient of at is given by
where is the Euclidean gradient of and the orthogonal projection is
The Riemannian Hessian of is defined by
where
For simplicity, Algorithm 3.1 with true gradient and true Hessain, with true gradient and inexact Hessain, and with inexact gradient and inexact Hessain are denoted by RACR, SRACR and SSRACR, respectively; and the inexact Riemannian trust-region algorithm in [17, Algorithm 1] is denoted by SSRTR. We shall apply RACR, SRACR, SSRACR, and SSRTR to solve the JD problem (30), and provide some numerical comparisons with different . In all numerical experiments, we generate randomly, and all related inexact gradients and inexact Hessains are constructed by sampling with sizes and . We set the parameters as and in RACR, SRACR and SSRACR, and set and in SSRTR (see [17, Algorithm 1] for details about the papmeters). The stopping criterion of all executed algorithms is set as .
The numerical results are listed in Table 4, where and denote the number of iterations and the CPUtime (in seconds), respectively. The observations from Table 4 reveal that SSRACR costs much less CPUtime than SRACR, RACR and SSRTR at the same accuracy.
Number of iterations and time for RACR, SRACR, SSRACR, and SSRTR. RACR SRACR SSRACR SSRTR (2015,5,5) 317 7.524 788 9.349 220 2.259 398 4.042 (2015,10,10) 1073 48.144 1247 29.432 718 14.048 1936 39.174 (2015,20,20) 2209 108.753 2114 87.839 2570 83.804 5140 172.033 (2015,30,30) 2693 395.710 2398 177.815 2603 202.081 12739 836.978 (2015,43,43) 2913 1280.422 3169 736.996 3563 691.293 12777 2622.408 (7200,43,43) 2873 5526.860 3019 2485.531 2909 2005.734 12837 8501.568 (60000,43,43) 298 4448.468 298 2395.300 303 2118.032 861 4579.865
Funding
Research of the second author was supported in part by the National Natural Science Foundation of China (grant 12161017) and the Guizhou Provincial Science and Technology Projects (grant ZK[2022]110).
References
- [1] Shalev-Shwartz S, Ben-David S. Understanding Machine Learning: From Theory to Algorithms. Cambridge university press, 2014.
- [2] Roosta-Khorasani F, Doel K, Ascher U. Data completion and stochastic algorithms for PDE inversion problems with many measurements. Electron. T. Numer. Ana., 42:177–196, 2014.
- [3] Roosta-Khorasani F, Doel K, Ascher U. Stochastic algorithms for inverse problems involving pdes and many measurements. SIAM J. Sci. Comput., 36(5):S3–S22, 2014.
- [4] Robbins H, Monro S. A stochastic approximation method. AMS, 22(3):400–407, 1951.
- [5] Roux N, Schmidt M, Bach F. A stochastic gradient method with an exponential convergence rate for finite training sets. Advances in neural information processing systems, 25, 2012.
- [6] Johnson R, Zhang T. Accelerating stochastic gradient descent using predictive variance reduction. In NIPS, 1(3):315–323, 2013.
- [7] Liu DC, Jorge Nocedal J. On the limited memory bfgs method for large scale optimization. Math. Program., 45(1-3):503–528, 1989.
- [8] Mishra B, Kasai H, Jawanpuria P, Saroop A. A Riemannian gossip approach to subspace learning on grassmann manifold. Mach. Learn., 108:1783–1803, 2019.
- [9] Theis FJ, Cason TP, Absil PA. Soft dimension reduction for ICA by joint diagonalization on the Stiefel manifold. In ICA, 2019.
- [10] Bonnabel S. Stochastic gradient descent on Riemannian manifolds. IEEE T. Automat. Contr., 58(9):2217–2229, 2013.
- [11] Sato H, Kasai H, Mishra B. Riemannian stochastic variance reduced gradient. ArXiv preprint arXiv:1702.05594, 2017.
- [12] Absil PA, Mahony R, Sepulchre R. Optimization Algorithms on Matrix Manifolds. Princeton University Press, 2008.
- [13] Huang W, Gallivan KA, Absil PA. A broyden class of quasi-Newton methods for Riemannian optimization. SIAM J. Optimiz., 25(3):1660–1685, 2015.
- [14] Gabay D. Minimizing a differentiable function over a differential manifold. J. Optimiz. Theory App., 37:177–219, 1982.
- [15] Ring W, Wirth B. Optimization methods on Riemannian manifolds and their application to shape space. SIAM J. Optimiz., 22(2):596–627, 2012.
- [16] Absil PA, Baker CG, Gallivan KA. Trust-Region Methods on Riemannian Manifolds. Found. Comput. Math., 2007;7(3): 303–330.
- [17] Kasai H, Mishra B. Inexact trust-region algorithms on Riemannian manifolds. Advances in neural information processing systems, 31, 2018.
- [18] DoCarmo MP. Riemannian Geometry. Boston MA: Birkhäuser Boston; 1992.
- [19] Udriste C. Convex Functions and Optimization Methods on Riemannian Manifolds. In: Mathematics and Its Applications, Kluwer Academic, Dordrecht, 1994.
- [20] Jorge N, Stephen JW. Numerical Optimization. Springer Series in Operations Research, Springer-Verlag, New York, 1999.
- [21] Boumal N, Absil PA, Cartis C. Global rates of convergence for nonconvex optimization on manifolds. IMA J. Numer. Anal., 39(1):1–33, 2019.
- [22] Conn AR, Gould NIM, Toint PL. Trust Region Methods. Society for Industrial and Applied Mathematics, 2000.
- [23] Cartis C, Gould NIM, Toint PL. Adaptive cubic regularisation methods for unconstrained optimization. Part I: Motivation, convergence and numerical results. Math. Program., 127(2):245–295, 2011.
- [24] Cartis C, Gould NIM, Toint PL. Adaptive cubic regularisation methods for unconstrained optimization. Part II: Worst-case function and derivative-evaluation complexity. Math. Program., 30(2):295–319, 2011.
- [25] Hyvärinen A, Oja E. Independent component analysis: Algorithms and applications. Neural networks, 13(4-5): 411–430, 2000.
- [26] Theis FJ, Inouye Y. On the use of joint diagonalization in blind signal processing. In: Proc. ISCAS 2006, Kos, Greece 2006.
- [27] Bottou L, Curtis FE, Nocedal J. Optimization mehtods for large-scale machine learning. SIAM Rev., 60(2): 223–311, 2018.
- [28] Byrd RH, Chin GM, Neveitt W, Nocedal J. On the use of stochastic Hessian information in optimization methods for machine learning. SIAM J. Optim., 21(3): 977–995, 2011.
- [29] Erdogdu MA, Montanari A. Convergence rates of sub-sampled Newton methods. In NIPS, 2015.
- [30] Kohler JM, Lucchi A. Sub-sampled cubic regularization for non-convex optimization. In ICML, 2017.
- [31] Yao Z, Xu P, Roosta-Khorasani F, Mahoney MW. Inexact non-convex Newton-type methods. ArXiv preprint arXiv: 1802.06925, 2018.
- [32] Xu P, Roosta F, Mahoney MW. Newton-type methods for non-convex optimization under inexact Hessian information. Math. Program., 184(1-2):35–70, 2020.
- [33] Wang XM. Subgradient algorithms on Riemannian manifolds of lower bounded curvatures. Optimization, 67: 179–194, 2018.
- [34] Wang XM, Li C, Wang JH, and Yao JC. Linear convergence of subgradient algorithm for convex feasibility on Reimannian manifolds. SIAM J. Optim., 25: 2334–2358, 2015.
- [35] Ferreira OP, Louzeiro MS, Prudente LF. Gradient method for optimization on Riemannian manifolds with lower bounded curvature. SIAM J. Optim., 29: 2517–2541, 2019.
- [36] Bento CG, Bitar SDB, Cruz Neto JX, Oliveira PR, Souza JCO. Computing Riemannian center of mass on Hadamard manifolds. J. Optim. Theory Appl., 183: 977–992, 2019.