This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Inexact Adaptive Cubic Regularization Algorithms on Riemannian Manifolds and Application

\nameZ. Y. Lia and X. M. Wangb aEmail: [email protected]bCorresponding author, Email: [email protected] a,bCollege of Mathematics and Statistics, Guizhou University, Guiyang 550025, P. R. China
Abstract

The adaptive cubic regularization algorithm employing the inexact gradient and Hessian is proposed on general Riemannian manifolds, together with the iteration complexity to get an approximate second-order optimality under certain assumptions on accuracies about the inexact gradient and Hessian. The algorithm extends the inexact adaptive cubic regularization algorithm under true gradient in [Math. Program., 184(1-2): 35-70, 2020] to more general cases even in Euclidean settings. As an application, the algorithm is applied to solve the joint diagonalization problem on the Stiefel manifold. Numerical experiments illustrate that the algorithm performs better than the inexact trust-region algorithm in [Advances of the neural information processing systems, 31, 2018].

keywords:
Riemannian manifolds; retraction; inexact adaptive cubic regularization algorithm; iteration complexity; joint diagonalization

1 Introduction

We consider the large-scale separable unconstrained optimization problem on general Riemannian manifolds:

minxf(x)=1ni=1nfi(x),\min_{x\in\mathcal{M}}f(x)=\frac{1}{n}\sum_{i=1}^{n}f_{i}(x), (1)

where \mathcal{M} is a Riemannian manifold, n1n\gg 1, and each fi:f_{i}:\mathcal{M}\rightarrow\mathbb{R} is continuously differentiable (i:=1,2,ni:=1,2,\dots n). Such problems frequently appear in machine learning and scientific computing, where each fif_{i} is a loss (or misfit) function corresponding to ii-th observation (or measurement); see, e.g., [1, 2, 3]. In such “large-scale” settings, since the evaluations of the gradient or the Hessian of ff can be computationally expensive, some inexact techniques are used for approximating the first and the second derivatives, which particularly includes the random sampling technique. As a result, many popular first-order and second-order stochastic algorithms are proposed to solve problem (1). The stochastic gradient descent (SGD) algorithm is one of the first-order stochastic/inexact algorithms, which uses only the gradient information. The idea of the SGD comes from [4] and one can find some variants in [5, 6, 27, 10, 11] and the references therein (the Riemannian versions of the SGD can be found in [10, 11]). The second-order stochastic/inexact algorithms always use the inexact Hessian information, which includes the popular sub-sampling Newton method, sub-sampling/inexact trust region algorithm, and the sub-sampling/inexact cubic regularization algorithm; see e.g, [28, 29, 30, 31, 32, 17] (the Riemannian version of the inexact trust region algorithm is in [17]). Particularly, we note the inexact trust-region algorithm and the inexact adaptive cubic regularization algorithms introduced in [32] for solving problem (1). Both algorithms employ the true gradient and the inexact Hessian (by sub-sampling) of ff and solve the corresponding sub-problems approximately every iteration. These two algorithms are respectively shown to own iteration complexity 𝒪(max{εg2εH1,εH3})\mathcal{O}(\max\{\varepsilon_{g}^{-2}\varepsilon_{H}^{-1},\varepsilon_{H}^{-3}\}) and 𝒪(max{εg2,εH3})\mathcal{O}(\max\{\varepsilon_{g}^{-2},\varepsilon_{H}^{-3}\}) of achieving the (εg,εH)(\varepsilon_{g},\varepsilon_{H})-optimality (see definition of the (εg,εH)(\varepsilon_{g},\varepsilon_{H})-optimality in Definition 2.2).

It is known that the Riemannian geometry framework has some advantages in many applications, such as translating some nonconvex (constrained) optimization problems in Euclidean space into convex (unconstrained) ones over a Riemannian manifold; see, e.g., [12, 34, 35, 36]. As a result, some classical numerical methods for solving optimization problems on the Euclidean space, such as Newton’s method, BFGS algorithm, trust region method, gradient algorithm, subgradient algorithm, etc., have been extended to the Riemannian manifold setting; see, e.g., [16, 12, 13, 14, 15, 34, 33, 35, 36]. As we have noted above, for solving large-scale problem (1), the stochastic gradient descent algorithm and its variants have been extended from Euclidean spaces to Riemannian manifolds in the literature; see, e.g., [10, 11]. Recently, the Riemannian inexact trust region algorithm is studied in [17], which uses the inexact gradient and the inexact Hessian every iteration, and in addition the inexact solution of the trust region sub-problem. Noting that the algorithm employs the inexact gradient instead of the true gradient, it extends the corresponding one in [32] to more general cases even in Euclidean settings. Similarly, for achieving the (εg,εH)(\varepsilon_{g},\varepsilon_{H})-optimality, the iteration complexity of the algorithm is also proven to be 𝒪(max{εg2εH1,εH3})\mathcal{O}(\max\{\varepsilon_{g}^{-2}\varepsilon_{H}^{-1},\varepsilon_{H}^{-3}\}). At the same time, some numerical results are provided which illustrate the algorithm performs significantly better than state-of-the-art deterministic and stochastic algorithms in some applications.

Inspired by the prior works (in particular, the works in [32] and [17]), we propose the Riemannian inexact adaptive cubic regularization algorithm for solving problem (1) on general Riemannian manifolds, which employs the inexact gradient and the inexact Hessian every iteration. The algorithm is proven to own the similar iteration complexity 𝒪(max{εg2,εH3})\mathcal{O}(\max\{\varepsilon_{g}^{-2},\varepsilon_{H}^{-3}\}) for obtaining the (εg,εH)(\varepsilon_{g},\varepsilon_{H})-optimality under certain conditions on accuracies about the inexact gradient and the inexact Hessian. Particularly, iteration complexities of the Riemannian (deterministic) adaptive cubic regularization algorithm and the Riemannian inexact adaptive cubic regularization algorithm under the true gradient are established. As an application, the proposed algorithms are applied to solve the joint diagonalization problem on the Stiefel manifold. Numerical results indicate that inexact algorithms are more efficient than the deterministic algorithm at the same accuracy. Meanwhile, the inexact Riemannian adaptive cubic regularization algorithm outperforms the inexact Riemannian trust-region algorithm (in [17, Algorithm 1]).

The paper is organized as follows. As usual, some basic notions and notation on Riemannian manifolds, together with some related properties about the sub-sampling, are introduced in the next section. The Riemannian inexact adaptive cubic regularization algorithm and its iteration complexity are presented in section 3, and the application to joint diagonalization on the Stiefel manifold is shown in the last section.

2 Notation and Preliminaries

Notation and terminologies used in the present paper are standard; the readers are referred to some textbooks for more details; see, e.g., [12, 18, 19, 22].

Let \mathcal{M} be a connected and complete nn-dimensional Riemannian manifold. Let xx\in\mathcal{M}, and let Tx{T}_{x}\mathcal{M} stand for the tangent space at xx to \mathcal{M}. T:=xTxT\mathcal{M}:=\cup_{x\in\mathcal{M}}T_{x}\mathcal{M} is called the tangent bundle on \mathcal{M}. We denote by ,x\langle,\rangle_{x} the scalar product on TxT_{x}\mathcal{M} with the associated norm x\|\cdot\|_{x}, where the subscript xx is sometimes omitted. For yy\in\mathcal{M}, let γ:[0,1]\gamma:[0,1]\rightarrow\mathcal{M} be a piecewise smooth curve joining xx to yy. Then, the arc-length of γ\gamma is defined by l(γ):=01γ(t)𝑑tl(\gamma):=\int_{0}^{1}\|{\gamma}^{\prime}(t)\|dt, while the Riemannian distance from xx to yy is defined by d(x,y):=infγl(γ){\rm d}(x,y):=\inf_{\gamma}l(\gamma), where the infimum is taken over all piecewise smooth curves γ:[0,1]\gamma:[0,1]\rightarrow\mathcal{M} joining xx to yy.

We use \nabla to denote the Levi-Civita connection on \mathcal{M}. A vector field VV is said to be parallel along γ\gamma if γV=0\nabla_{{\gamma}^{\prime}}V=0. In particular, for a smooth curve γ\gamma, if γ{\gamma}^{\prime} is parallel along itself, then γ\gamma is called a geodesic; thus, a smooth curve γ\gamma is a geodesic if and only if γγ=0\nabla_{{\gamma}^{\prime}}{{\gamma}^{\prime}}=0. A geodesic γ:[0,1]M\gamma:[0,1]\rightarrow M joining xx to yy is minimal if its arc-length equals its Riemannian distance between xx and yy. By the Hopf-Rinow theorem [18], (,d)(\mathcal{M},{\rm d}) is a complete metric space, and there is at least one minimal geodesic joining xx to yy for any points xx and yy.

Let f:f:\mathcal{M}\rightarrow{{\mathbb{R}}} be a twice continuously differentiable real-valued function defined on \mathcal{M}. The Remannian gradient and Hessian of ff at xx\in\mathcal{M} are denoted by gradf(x){\rm grad}f(x) and Hessf(x){\rm Hess}f(x), which are respectively defined as

gradf(x),ξ=ξ(f)andHessf(x)(ξ,η)=ξgradf(x),ηξTx.\langle{\rm grad}f(x),\xi\rangle=\xi(f)\quad\mbox{and}\quad{\rm Hess}f(x)(\xi,\eta)=\langle\nabla_{\xi}{\rm grad}f(x),\eta\rangle\quad\forall\xi\in T_{x}\mathcal{M}.

On manifolds \mathcal{M}, we use the retraction defined as below to move in the direction of a tangent vector, which can be found in [12, Definition 4.1.1 and Proposition 5.5.5].

Definition 2.1 (retraction and seconder-order retraction).

A retraction on a manifold \mathcal{M} is a smooth mapping R:TR:T\mathcal{M}\rightarrow\mathcal{M} with the following properties, where RxR_{x} is the restriction of RR on TxT_{x}\mathcal{M}:

(1) Rx(0x)=xR_{x}(0_{x})=x, where 0x0_{x} denotes the zero element of TxT_{x}\mathcal{M}.

(2) DRx(0x)=IdTxDR_{x}(0_{x})=\mathrm{Id}_{T_{x}\mathcal{M}}, where IdTx\mathrm{Id}_{T_{x}\mathcal{M}} is the identity mapping on TxT_{x}\mathcal{M} with the canonical identification T0xTxTxT_{0_{x}}T_{x}\mathcal{M}\simeq T_{x}\mathcal{M}.

Moreover, RR is said to be a seconder-order retraction if it further satisfies

D2dt2Rx(tη)t=0=0x and ηTx,\frac{D^{2}}{dt^{2}}R_{x}(t\eta)\mid_{t=0}=0\quad\forall x\in\mathcal{M}\mbox{ and }\forall\eta\in T_{x}\mathcal{M},

where D2dt2γ\frac{D^{2}}{dt^{2}}\gamma denotes acceleration of the curve γ\gamma.

Remark 1.

(i) \mathcal{M} admits a seconder-order retraction defined by the exponential mapping, and one can find more general retractions on matrix manifolds in [12].

(ii) In general, the Euclidean Hessian 2fRx(0x)\nabla^{2}f\circ R_{x}(0_{x}) differs from the Riemannian Hessian Hessf(x)\mathrm{Hess}f(x), while they are identical under second-order retractions (see[21, Lemma 3.9]).

Let J:={1,2,,n}J:=\{1,2,\cdots,n\}. We consider the large-scale minimization (1):

minxf(x)=1ni=1nfi(x),\min_{x\in\mathcal{M}}f(x)=\frac{1}{n}\sum_{i=1}^{n}f_{i}(x),

where n1n\gg 1, and for each iJi\in J, fi:f_{i}:\mathcal{M}\rightarrow\mathbb{R} is a twice continuously differentiable real-valued function. Note that ff may be non-convex, some authors refer to find a proximate (εg,εH)(\varepsilon_{g},\varepsilon_{H})-optimality in practice, which is defined by [32, Definition 1] (see also [17, Definition 2.1]). As usual, λmin(A)\lambda_{\min}(A) stands the minimum eigenvalue of matrix AA.

Definition 2.2 ((εg,εH)(\varepsilon_{g},\varepsilon_{H})-optimality).

Let εg,εH(0,1)\varepsilon_{g},\varepsilon_{H}\in(0,1). A point xx^{*}\in\mathcal{M} is said to be an (εg,εH)(\varepsilon_{g},\varepsilon_{H})-optimality of (1) if

gradf(x)εgandλmin(Hessf(x))εH.\|{\rm{grad}}f(x^{*})\|\leq\varepsilon_{g}\quad\mbox{and}\quad\lambda_{\min}({\rm{Hess}}f(x^{*}))\geq-\varepsilon_{H}.

In practice, we adopt the sub-sampled inexact gradient and Hessian. To proceed, let Sg,SHJS_{g},S_{H}\subset J be the sample collections with or without replacement from JJ, and their cardinalities are denoted as |Sg||S_{g}| and |SH||S_{H}|, respectively. The sub-sampled inexact gradient and Hessian are defined as

G:=1|Sg|iSggradfiandH:=1|SH|iSHHessfi,G:=\frac{1}{|S_{g}|}\sum_{i\in S_{g}}\mathrm{grad}f_{i}\quad\mbox{and}\quad H:=\frac{1}{|S_{H}|}\sum_{i\in S_{H}}\mathrm{Hess}f_{i}, (2)

respectively. The following lemma provides sufficient sample sizes of SgS_{g} and SHS_{H} to guarantee that the inexact GG and HH approximate gradf{\rm grad}f and Hessf{\rm Hess}f in a probabilistic way, which is taken form [17, Theorem 4.1]. Here we set

Kgmax:=maxiJsupxgradfi(x)andKHmax:=maxiJsupxHessfi(x).K^{\max}_{g}:=\max_{i\in J}\sup_{x\in\mathcal{M}}\|\mathrm{grad}f_{i}(x)\|\quad\mbox{and}\quad K^{\max}_{H}:=\max_{i\in J}\sup_{x\in\mathcal{M}}\|\mathrm{Hess}f_{i}(x)\|.
Lemma 2.3.

Let δ,δg,δH(0,1)\delta,\delta_{g},\delta_{H}\in(0,1) and let RR be a seconder-order retraction. Assume that the sampling is done uniformly at random to generate SgS_{g} and SHS_{H}, and let GG and HH be defined by (2). If the sample sizes |Sg||S_{g}| and |SH||S_{H}| satisfy

|Sg|32(Kgmax)2(log1δ)+14δg2and|SH|32(KHmax)2(log1δ)+14δH2,|S_{g}|\geq\frac{32(K_{g}^{\max})^{2}(\log\frac{1}{\delta})+\frac{1}{4}}{\delta_{g}^{2}}\quad\mbox{and}\quad|S_{H}|\geq\frac{32(K_{H}^{\max})^{2}(\log\frac{1}{\delta})+\frac{1}{4}}{\delta_{H}^{2}},

then the following estimates hold for any xx\in\mathcal{M} and any ηTx\eta\in T_{x}\mathcal{M}:

Pr(G(x)gradf(x)δg)1δ,\mathrm{Pr}(\|G(x)-\mathrm{grad}f(x)\|\leq\delta_{g})\geq 1-\delta,
Pr((H(x)2fRx(0x))[η]δHη)1δ.\mathrm{Pr}(\|(H(x)-\nabla^{2}f\circ R_{x}(0_{x}))[\eta]\|\leq\delta_{H}\|\eta\|)\geq 1-\delta.

3 Inexact Riemannian adaptive cubic regularization algorithm

In this section, we propose the following inexact Riemannian adaptive cubic regularization algorithm for solving problem (1), which employs the inexact Hessian and gradient at each iteration.

Algorithm 3.1.

(Inexact Riemannian adaptive cubic regularization)

Step 0. Choose εg,εH,ρTH(0,1)\varepsilon_{g},\varepsilon_{H},\rho_{TH}\in(0,1), and γ(1,+)\gamma\in(1,+\infty).

Step 1. Choose x0x_{0}\in\mathcal{M} and σ0(0,+)\sigma_{0}\in(0,+\infty), and set k:=0k:=0.

Step 2. Construct inexact gradient GkG_{k} and approximate Hessian HkH_{k} of ff at xkx_{k}, respectively.

Step 3. If Gkεg\|G_{k}\|\leq\varepsilon_{g} and λmin(Hk)εH\lambda_{\min}(H_{k})\geq-\varepsilon_{H}, then return xkx_{k}.

Step 4. Solve the following sub-problem approximately:

ηkargminmk(η)ηTxk:=Gk,η+12Hk[η],η+13σkη3.\eta_{k}\approx{\underset{\eta\in T_{x_{k}}\mathcal{M}}{{\arg\min}\,m_{k}(\eta)}:=\langle G_{k},\eta\rangle+\frac{1}{2}\langle H_{k}[\eta],\eta\rangle+\frac{1}{3}\sigma_{k}\|\eta\|^{3}.} (3)

Step 5. Set ρk:=f(xk)fRxk(ηk)mk(ηk)\rho_{k}:=\frac{f(x_{k})-f\circ R_{x_{k}}(\eta_{k})}{-m_{k}(\eta_{k})}.

Step 6. If ρkρTH\rho_{k}\geq\rho_{TH}, then set xk+1=Rxk(ηk)x_{k+1}=R_{x_{k}}(\eta_{k}) and σk+1=σkγ\sigma_{k+1}=\frac{\sigma_{k}}{\gamma}; otherwise set xk+1=xkx_{k+1}=x_{k} and σk+1=γσk\sigma_{k+1}=\gamma\sigma_{k}.

Step 7. Set k:=k+1k:=k+1 and go to Step 2.

Remark 2.

Algorithm 3.1 is an extension version of the adaptive cubic regularization algorithm in Euclidean spaces. As we known, in Euclidean settings, the algorithm was proposed and extensively studied by Cartis et al. in [23, 24]. Recently, for solving the “large-scale” separable problem (1), Wu et al. proposed the adaptive cubic regularization algorithm with inexact Hessian (and true gradient) in [32, Algorithm 2] and the iteration complexity of the algorithm is proven to be 𝒪(max{εg2,εH3})\mathcal{O}(\max\{\varepsilon_{g}^{-2},\varepsilon_{H}^{-3}\}).

Let {xk}\{x_{k}\} be a sequence generated by Algorithm 3.1. We make the following blanket assumptions throughout the paper:

(A1) There exists constant LH>0L_{H}>0 such that for all kk\in\mathbb{N},fR,\emph{}f\circ R satisfies

|fRxk(ηk)f(xk)gradf(xk),ηk122fRxk(0xk)[ηk],ηk|12LHηk3.|f\circ R_{x_{k}}(\eta_{k})-f(x_{k})-\langle\mathrm{grad}f(x_{k}),\eta_{k}\rangle-\frac{1}{2}\langle\nabla^{2}f\circ R_{x_{k}}(0_{x_{k}})[\eta_{k}],\eta_{k}\rangle|\leq\frac{1}{2}L_{H}\|\eta_{k}\|^{3}.

(A2) There exists constant KH>0K_{H}>0 such that

Hk:=supηTxk,η1η,Hk[η]KHk.\|H_{k}\|:=\sup_{\eta\in T_{x_{k}}\mathcal{M},\|\eta\|\leq 1}\langle\eta,H_{k}[\eta]\rangle\leq K_{H}\quad\forall k\in\mathbb{N}.

(A3) There exist constants δg,δH(0,1)\delta_{g},\delta_{H}\in(0,1) such that

(A3)-aGkgradf(xk)δgk,(A3)-b(Hk2fRxk(0xk))[ηk]δHηkk.\begin{array}[]{lll}&\mbox{\bf(A3)-a}\quad\|G_{k}-\mathrm{grad}f(x_{k})\|\leq\delta_{g}\quad\forall k\in\mathbb{N},\\ &\mbox{\bf(A3)-b}\quad\|(H_{k}-\nabla^{2}f\circ R_{x_{k}}(0_{x_{k}}))[\eta_{k}]\|\leq\delta_{H}\|\eta_{k}\|\quad\forall k\in\mathbb{N}.\end{array}
Remark 3.

(i) As shown in [21, Appendix B], Assumption (A1) is satisfied in the case when \mathcal{M} is compact and RR is a second-order retraction.

(ii) In practice, according to Lemma 2.3, we shall construct {Gk}\{G_{k}\} and {Hk}\{H_{k}\} by sub-sampling to ensure Assumption (A3) in a probabilistic way.

As stated in Step 4, we solve the sub-problem (3) approximately every iteration. The most popular conditions for the approximate solution in the literature are the Cauchy and Eigenpoint conditions; see, e.g., [23, 24, 32]. To ensure the convergence of the algorithm, we make the following similar assumptions on the approximate solutions {ηk}\{\eta_{k}\}:

(A4) mk(ηk)mk(ηkC)-m_{k}(\eta_{k})\geq-m_{k}(\eta_{k}^{C}), and mk(ηk)mk(ηkE)ifλmin(Hk)<0-m_{k}(\eta_{k})\geq-m_{k}(\eta_{k}^{E})\quad if\enspace\lambda_{\min}(H_{k})<0k\forall k\in\mathbb{N},

where ηkC\eta_{k}^{C} and ηkE\eta_{k}^{E} are the approximate optimal solutions of (3) along the negative gradient and the negative curvature directions, respectively, and ηkE\eta_{k}^{E} satisfies

ηkE,Hk[ηkE]νλmin(Hk)ηkE2<0,ν(0,1);\langle\eta_{k}^{E},H_{k}[\eta_{k}^{E}]\rangle\leq\nu\lambda_{\min}(H_{k})\|\eta_{k}^{E}\|^{2}<0,\quad\nu\in(0,1);

see more details in ([23, 24]).

Now, let kk\in\mathbb{N}. Noting that the sub-problem (3) of Algorithm 3.1 is actually posed on Euclidean spaces, the follow properties of ηkC\eta_{k}^{C} and ηkE\eta_{k}^{E} are valid (see [32, Lemmas 6,7]).

Lemma 3.1.

The following estimates for ηkC\eta_{k}^{C} and ηkE\eta_{k}^{E} hold:

mk(ηkC)max{112ηkC2(KH2+4σkGkKH),Gk23min{GkKH,Gkσk}},-m_{k}(\eta_{k}^{C})\geq\max\{\frac{1}{12}||\eta_{k}^{C}||^{2}(\sqrt{K_{H}^{2}+4\sigma_{k}||G_{k}||}-K_{H}),\frac{||G_{k}||}{2\sqrt{3}}\min\{\frac{||G_{k}||}{K_{H}},\sqrt{\frac{||G_{k}||}{\sigma_{k}}}\}\},

(4)
mk(ηkE)ν|λmin(Hk)|6max{ηkE2,ν2|λmin(Hk)|2σk2},-m_{k}(\eta_{k}^{E})\geq\frac{\nu|\lambda_{\min}(H_{k})|}{6}\max\{\|\eta_{k}^{E}\|^{2},\frac{\nu^{2}|\lambda_{\min}(H_{k})|^{2}}{\sigma_{k}^{2}}\}, (5)
ηkCxk12σk(KH2+4σkGkKH),\|\eta_{k}^{C}\|_{x_{k}}\geq\frac{1}{2\sigma_{k}}(\sqrt{K_{H}^{2}+4\sigma_{k}\|G_{k}\|}-K_{H}), (6)
ηkEν|λmin(Hk)|σk.\|\eta_{k}^{E}\|\geq\frac{\nu|\lambda_{\min}(H_{k})|}{\sigma_{k}}. (7)

Lemma 3.2 below estimates the sufficient decrease in the objective function.

Lemma 3.2.

Assume (A1) and (A3). Then we have

fRxk(ηk)f(xk)mk(ηk)δgηk+12δHηk2+(LH2σk3)ηk3.f\circ R_{x_{k}}(\eta_{k})-f(x_{k})-m_{k}(\eta_{k})\leq\delta_{g}\|\eta_{k}\|+\frac{1}{2}\delta_{H}\|\eta_{k}\|^{2}+(\frac{L_{H}}{2}-\frac{\sigma_{k}}{3})\|\eta_{k}\|^{3}. (8)
Proof.

In view of (3), we have that

|fRxk(ηk)f(xk)mk(ηk)+σk3ηk3|=|fRxk(ηk)f(xk)Gk,ηk12Hk[ηk],ηk||fRxk(ηk)f(xk)gradf(xk),ηk122fRxk(0xk)[ηk],ηk|+|Gkgradff(xk),ηk|+12|(2fRxk(0xk)Hk)[ηk],ηk|δgηk+δH2ηk2+LH2ηk3,\begin{split}&|f\circ R_{x_{k}}(\eta_{k})-f(x_{k})-m_{k}(\eta_{k})+\frac{\sigma_{k}}{3}\|\eta_{k}\|^{3}|\\ =&|f\circ R_{x_{k}}(\eta_{k})-f(x_{k})-\langle G_{k},\eta_{k}\rangle-\frac{1}{2}\langle H_{k}[\eta_{k}],\eta_{k}\rangle|\\ \leq&|f\circ R_{x_{k}}(\eta_{k})-f(x_{k})-\langle{\rm gradf}(x_{k}),\eta_{k}\rangle-\frac{1}{2}\langle\nabla^{2}f\circ R_{x_{k}}(0_{x_{k}})[\eta_{k}],\eta_{k}\rangle|\\ &+|\langle G_{k}-{\rm gradf}f(x_{k}),\eta_{k}\rangle|+\frac{1}{2}|\langle(\nabla^{2}f\circ R_{x_{k}}(0_{x_{k}})-H_{k})[\eta_{k}],\eta_{k}\rangle|\\ \leq&\delta_{g}\|\eta_{k}\|+\frac{\delta_{H}}{2}\|\eta_{k}\|^{2}+\frac{L_{H}}{2}\|\eta_{k}\|^{3},\end{split}

where the second inequality is from Assumptions (A1) and (A3). Then, (8) follows immediately. ∎

Lemma 3.3.

Assume (A1)-(A4). Assume further that σk2LH\sigma_{k}\geq 2L_{H} and δg9δH24σk\delta_{g}\leq\frac{9\delta_{H}^{2}}{4\sigma_{k}}. If one of the following conditions is satisfied, then parameter σk\sigma_{k} decreases, that is, σk+1=σkγ(γ>1)\sigma_{k+1}=\frac{\sigma_{k}}{\gamma}\;(\gamma>1):

  1. (a)

    Gk>εg\|G_{k}\|>\varepsilon_{g}, δHmin{118,1ρTH9}(KH2+4σkεgKH)\delta_{H}\leq\min\{\frac{1}{18},\frac{1-\rho_{TH}}{9}\}(\sqrt{K_{H}^{2}+4\sigma_{k}\varepsilon_{g}}-K_{H}).

  2. (b)

    λmin(Hk)<εH\lambda_{\min}(H_{k})<-\varepsilon_{H}, δHmin{19,2(1ρTH)9}νεH\delta_{H}\leq\min\{\frac{1}{9},\frac{2(1-\rho_{TH})}{9}\}\nu\varepsilon_{H}.

Proof.

In light of (8) and σk2LH\sigma_{k}\geq 2L_{H}, we get that

fRxk(ηk)f(xk)mk(ηk)\displaystyle f\circ R_{x_{k}}(\eta_{k})-f(x_{k})-m_{k}(\eta_{k}) (LH2σk3)ηk3+12δHηk2+δgηk\displaystyle\leq(\frac{L_{H}}{2}-\frac{\sigma_{k}}{3})\|\eta_{k}\|^{3}+\frac{1}{2}\delta_{H}\|\eta_{k}\|^{2}+\delta_{g}\|\eta_{k}\| (9)
σk12ηk3+12δHηk2+δgηk.\displaystyle\leq-\frac{\sigma_{k}}{12}\|\eta_{k}\|^{3}+\frac{1}{2}\delta_{H}\|\eta_{k}\|^{2}+\delta_{g}\|\eta_{k}\|.

We claim that the following implication holds:

[ηk9δHσk][fRxk(ηk)f(xk)mk(ηk)0].\mbox{[$\|\eta_{k}\|\geq\frac{9\delta_{H}}{\sigma_{k}}]\Longrightarrow[f\circ R_{x_{k}}(\eta_{k})-f(x_{k})-m_{k}(\eta_{k})\leq 0]$}. (10)

In fact, note that

σkt2+6δHt+12δg0-\sigma_{k}t^{2}+6\delta_{H}t+12\delta_{g}\leq 0  for all t6δH+36δH2+48σkδg2σk=3δHσk(1+1+4σkδg3δH2)t\geq\frac{6\delta_{H}+\sqrt{36\delta_{H}^{2}+48\sigma_{k}\delta_{g}}}{2\sigma_{k}}=\frac{3\delta_{H}}{\sigma_{k}}(1+\sqrt{1+\frac{4\sigma_{k}\delta_{g}}{3\delta_{H}^{2}}}). (11)

In view of δg9δH24σk\delta_{g}\leq\frac{9\delta_{H}^{2}}{4\sigma_{k}} (by assumption), we see that 4σkδg3δH23\frac{4\sigma_{k}\delta_{g}}{3\delta_{H}^{2}}\leq 3, and then have

3δHσk(1+1+4σkδg3δH2)9δHσk.\frac{3\delta_{H}}{\sigma_{k}}(1+\sqrt{1+\frac{4\sigma_{k}\delta_{g}}{3\delta_{H}^{2}}})\leq\frac{9\delta_{H}}{\sigma_{k}}.

If ηk9δHσk\|\eta_{k}\|\geq\frac{9\delta_{H}}{\sigma_{k}}, we get from (11) that σkηk2+6δHηk+12δg0-\sigma_{k}\|\eta_{k}\|^{2}+6\delta_{H}\|\eta_{k}\|+12\delta_{g}\leq 0. Therefore, implication (10) is valid by (9).

Below, we show that the following inequality holds under either condition (a) or (b):

ρkρTH.\rho_{k}\geq\rho_{TH}. (12)

Granting this and Step 6 of Algorithm 3.1, we conclude that σk+1=σkγ\sigma_{k+1}=\frac{\sigma_{k}}{\gamma}, completing the proof

Assume that condition (a) is satisfied. Then, we have

ηkC12σk(KH2+4σkεgKH)9δHσk,\|\eta_{k}^{C}\|\geq\frac{1}{2\sigma_{k}}(\sqrt{K_{H}^{2}+4\sigma_{k}\varepsilon_{g}}-K_{H})\geq\frac{9\delta_{H}}{\sigma_{k}}, (13)

where the first inequality is by (6) and the first item of condition (a), and the last inequality thanks to the second item of condition (a). We claim that

fRxk(ηk)f(xk)mk(ηk)12δHηkC2+δgηkC.f\circ R_{x_{k}}(\eta_{k})-f(x_{k})-m_{k}(\eta_{k})\leq\frac{1}{2}\delta_{H}\|\eta_{k}^{C}\|^{2}+\delta_{g}\|\eta_{k}^{C}\|. (14)

Indeed, in the case when ηkηkC\|\eta_{k}\|\leq\|\eta_{k}^{C}\|, (14) is evident from (9). Otherwise, ηk>ηkC\|\eta_{k}\|>\|\eta_{k}^{C}\|. This, together with (13) and implication (10), implies that fRxk(ηk)f(xk)mk(ηk)0f\circ R_{x_{k}}(\eta_{k})-f(x_{k})-m_{k}(\eta_{k})\leq 0, and so claim (14) is trivial. Using (13), (14) and δg9δH24σk\delta_{g}\leq\frac{9\delta_{H}^{2}}{4\sigma_{k}}, we estimate that

fRxk(ηk)f(xk)mk(ηk)12δHηkC2+δH49δHσkηkC34δHηkC2.f\circ R_{x_{k}}(\eta_{k})-f(x_{k})-m_{k}(\eta_{k})\leq\frac{1}{2}\delta_{H}\|\eta_{k}^{C}\|^{2}+\frac{\delta_{H}}{4}\frac{9\delta_{H}}{\sigma_{k}}\|\eta_{k}^{C}\|\leq\frac{3}{4}\delta_{H}\|\eta_{k}^{C}\|^{2}. (15)

Due to assumption (A4) and (4), there holds that

mk(ηk)mk(ηkC)112ηkC2(KH2+4σkεgKH).-m_{k}(\eta_{k})\geq-m_{k}(\eta_{k}^{C})\geq\frac{1}{12}\|\eta_{k}^{C}\|^{2}(\sqrt{K_{H}^{2}+4\sigma_{k}\varepsilon_{g}}-K_{H}).

This together with (15), implies that

1ρk=fRxk(ηk)f(xk)mk(ηk)mk(ηk)34δHηkC2112ηkC2(KH2+4σkεgKH)1ρTH,1-\rho_{k}=\frac{f\circ R_{x_{k}}(\eta_{k})-f(x_{k})-m_{k}(\eta_{k})}{-m_{k}(\eta_{k})}\leq\frac{\frac{3}{4}\delta_{H}\|\eta_{k}^{C}\|^{2}}{\frac{1}{12}\|\eta_{k}^{C}\|^{2}(\sqrt{K_{H}^{2}+4\sigma_{k}\varepsilon_{g}}-K_{H})}\leq 1-\rho_{TH},

where the second inequality thanks to the second item of condition (a), which shows (12).

Assume that condition (b) is satisfied. Then, we get from (7) and condition (b) that

ηkEν|λmin(Hk)|σkνεHσk9δHσk,\|\eta_{k}^{E}\|\geq\frac{\nu|\lambda_{\min}(H_{k})|}{\sigma_{k}}\geq\frac{\nu\varepsilon_{H}}{\sigma_{k}}\geq\frac{9\delta_{H}}{\sigma_{k}}, (16)

Thus, we claim that

fRxk(ηk)f(xk)mk(ηk)12δHηkE2+δgηkE.f\circ R_{x_{k}}(\eta_{k})-f(x_{k})-m_{k}(\eta_{k})\leq\frac{1}{2}\delta_{H}\|\eta_{k}^{E}\|^{2}+\delta_{g}\|\eta_{k}^{E}\|. (17)

Indeed, if ηkηkE\|\eta_{k}\|\leq\|\eta_{k}^{E}\|, claim (17) is clear by (9). Otherwise ηk>ηkE\|\eta_{k}\|>\|\eta_{k}^{E}\|. This, together with (16) and implication (10), implies that fRxk(ηk)f(xk)mk(ηk)0f\circ R_{x_{k}}(\eta_{k})-f(x_{k})-m_{k}(\eta_{k})\leq 0. Therefore, claim (17) holds trivially. In view of (16), (17) and δg9δH24σk\delta_{g}\leq\frac{9\delta_{H}^{2}}{4\sigma_{k}}, we get that

fRxk(ηk)f(xk)mk(ηk)12δHηkE2+δH49δHσkηkE34δHηkE2.f\circ R_{x_{k}}(\eta_{k})-f(x_{k})-m_{k}(\eta_{k})\leq\frac{1}{2}\delta_{H}\|\eta_{k}^{E}\|^{2}+\frac{\delta_{H}}{4}\frac{9\delta_{H}}{\sigma_{k}}\|\eta_{k}^{E}\|\leq\frac{3}{4}\delta_{H}\|\eta_{k}^{E}\|^{2}. (18)

Moreover, in view of λmin(Hk)<0\lambda_{\min}(H_{k})<0 by condition (b), we have from assumption (A4) and (5) that

mk(ηk)mk(ηkE)ν|λmin(Hk)|6ηkE2(KH2+4σkεgKH).-m_{k}(\eta_{k})\geq-m_{k}(\eta_{k}^{E})\geq\frac{\nu|\lambda_{\min}(H_{k})|}{6}\|\eta_{k}^{E}\|^{2}(\sqrt{K_{H}^{2}+4\sigma_{k}\varepsilon_{g}}-K_{H}).

Combining this and (18), there holds that

1ρk=fRxk(ηk)f(xk)mk(ηk)mk(ηk)34δHηkE2ν|λmin(Hk)|6ηkE21ρTH,1-\rho_{k}=\frac{f\circ R_{x_{k}}(\eta_{k})-f(x_{k})-m_{k}(\eta_{k})}{-m_{k}(\eta_{k})}\leq\frac{\frac{3}{4}\delta_{H}\|\eta_{k}^{E}\|^{2}}{\frac{\nu|\lambda_{\min}(H_{k})|}{6}\|\eta_{k}^{E}\|^{2}}\leq 1-\rho_{TH},

where the second inequality thanks to the second item of condition (b), and then (12) is ture. The proof is complete. ∎

In the following lemma, we estimate the upper bound for the cubic regularization parameters {σk}\{\sigma_{k}\} before the algorithm terminates. Here all iterations satisfying (12) are said to be successful and to be unsuccessful otherwise.

Lemma 3.4.

Assume (A1)-(A4). Assume further that Algorithm 3.1 does not terminate at NN-th iteration, and the parameters satisfy δg9δH24σk\delta_{g}\leq\frac{9\delta_{H}^{2}}{4\sigma_{k}} and

δHmin{min{118,1ρTH9}(KH2+4σkεgKH),min{19,2(1ρTH)9}νεH},\delta_{H}\leq\min\{\min\{\frac{1}{18},\frac{1-\rho_{TH}}{9}\}(\sqrt{K_{H}^{2}+4\sigma_{k}\varepsilon_{g}}-K_{H}),\min\{\frac{1}{9},\frac{2(1-\rho_{TH})}{9}\}\nu\varepsilon_{H}\}, (19)

then

σkmax{σ0,2γLH}k=1,2,,N.\sigma_{k}\leq\max\{\sigma_{0},2\gamma L_{H}\}\quad\forall k=1,2,\cdots,N. (20)
Proof.

By contradiction, we assume that the kk-th iteration (kNk\leq N) is the first unsuccessful iteration such that

σk+1=γσk2γLH,\sigma_{k+1}=\gamma\sigma_{k}\geq 2\gamma L_{H}, (21)

which implies that σk2LH\sigma_{k}\geq 2L_{H}. Since Algorithm 3.1 does terminate at the kk-th iteration, we have that either Gk>εg\|G_{k}\|>\varepsilon_{g} or λmin(Hk)<εH\lambda_{\min}(H_{k})<-\varepsilon_{H}. Recalling that σk2LH\sigma_{k}\geq 2L_{H}, δg9δH24σk\delta_{g}\leq\frac{9\delta_{H}^{2}}{4\sigma_{k}} and (19), Lemma 3.3 is applicable to showing σk+1=σkγ\sigma_{k+1}=\frac{\sigma_{k}}{\gamma}, which contradicts with (21). The proof is complete. ∎

Without loss generality, we assume that σ02γLH\sigma_{0}\leq 2\gamma L_{H}. Moreover, from now on, let 𝒩succ\mathcal{N}_{succ} and 𝒩fail\mathcal{N}_{fail} stand the set of all the successful and unseccessful iterations of Algorithm 3.1, respectively. The number of successful (unsuccessful) iterations is denoted by |𝒩succ||\mathcal{N}_{succ}| (|𝒩fail|)(|\mathcal{N}_{fail}|). The lemma below provides estimation of the total number of successful iterations before the algorithm terminates.

Lemma 3.5.

Assume that the assumptions made in Lemma 3.4 are satisfied. Then the following estimate holds:

|𝒩succ|f(x0)fminρTHκσmax{εg2,εH3}+1,|\mathcal{N}_{succ}|\leq\frac{f(x_{0})-f_{\min}}{\rho_{TH}\kappa_{\sigma}}\max\{\varepsilon_{g}^{-2},\varepsilon_{H}^{-3}\}+1, (22)

where fminf_{\min} is the minimum of ff over \mathcal{M} and

κσ:={ν324γ2LH2,123min{1KH,12γLH}}.\kappa_{\sigma}:=\{\frac{\nu^{3}}{24\gamma^{2}L_{H}^{2}},\frac{1}{2\sqrt{3}}\min\{\frac{1}{K_{H}},\frac{1}{\sqrt{2\gamma L_{H}}}\}\}. (23)
Proof.

Noting that {f(xk)}\{f(x_{k})\} is monotonically decreasing, we have that

f(x0)fmink𝒩succ(f(xk)f(xk+1))k𝒩succρTH(mk(ηk)),\begin{array}[]{lll}f(x_{0})-f_{\min}&\geq\sum_{k\in\mathcal{N}_{succ}}\left(f(x_{k})-f(x_{k+1})\right)\\ &\geq\sum_{k\in\mathcal{N}_{succ}}\rho_{TH}(-m_{k}(\eta_{k})),\end{array} (24)

where the second inequality is by Steps 4-5. Assume that kNsucck\in N_{succ} and Algorithm 3.1 does not terminate at the kk-th iteration. Then we have either Gk>εg\|G_{k}\|>\varepsilon_{g} or λmin(Hk)<εH\lambda_{\min}(H_{k})<-\varepsilon_{H}, and

σk2γLH.\sigma_{k}\leq 2\gamma L_{H}. (25)

by lemma 3.4(20) (noting σ02γLH\sigma_{0}\leq 2\gamma L_{H}). In the case when Gk>εg\|G_{k}\|>\varepsilon_{g}, we get from (4) that

mk(ηk)Gk23min{GkKH,Gkσk}>εg23min{εgKH,εgσk}>εg223min{1KH,1σk}>εg223min{1KH,12γLH},\begin{array}[]{lll}-m_{k}(\eta_{k})&\geq\frac{\|G_{k}\|}{2\sqrt{3}}\min\{\frac{\|G_{k}\|}{K_{H}},\sqrt{\frac{\|G_{k}\|}{\sigma_{k}}}\}>\frac{\varepsilon_{g}}{2\sqrt{3}}\min\{\frac{\varepsilon_{g}}{K_{H}},\sqrt{\frac{\varepsilon_{g}}{\sigma_{k}}}\}\\ &>\frac{\varepsilon_{g}^{2}}{2\sqrt{3}}\min\{\frac{1}{K_{H}},\sqrt{\frac{1}{\sigma_{k}}}\}>\frac{\varepsilon_{g}^{2}}{2\sqrt{3}}\min\{\frac{1}{K_{H}},\sqrt{\frac{1}{2\gamma L_{H}}}\},\end{array} (26)

where the third inequality is by εg(0,1)\varepsilon_{g}\in(0,1), and the fourth inequality is because of (25). Similarly, for the case when λmin(Hk)<εH\lambda_{\min}(H_{k})<-\varepsilon_{H}, we have from (5) that

mk(ηk)ν3|λmin(Hk)|36σk2ν3εH324γ2LH2.-m_{k}(\eta_{k})\geq\frac{\nu^{3}|\lambda_{\min}(H_{k})|^{3}}{6\sigma_{k}^{2}}\geq\frac{\nu^{3}\varepsilon_{H}^{3}}{24\gamma^{2}L_{H}^{2}}.

This, together with (26), yields

mk(ηk)min{εg223min{1KH,12γLH},ν3εH324γ2LH2}κσmin{εg2,εH3},-m_{k}(\eta_{k})\geq\min\{\frac{\varepsilon_{g}^{2}}{2\sqrt{3}}\min\{\frac{1}{K_{H}},\frac{1}{\sqrt{2\gamma L_{H}}}\},\frac{\nu^{3}\varepsilon_{H}^{3}}{24\gamma^{2}L_{H}^{2}}\}\geq\kappa_{\sigma}\min\{\varepsilon_{g}^{2},\varepsilon_{H}^{3}\}, (27)

where the last inequality thanks to (23). Combing (24) and (27), we get that

f(x0)fmin(|𝒩succ|1)ρTHκσmin{εg2,εH3}.\displaystyle f(x_{0})-f_{\min}\geq(|\mathcal{N}_{succ}|-1)\rho_{TH}\kappa_{\sigma}\min\{\varepsilon_{g}^{2},\varepsilon_{H}^{3}\}. (28)

Thus, (22) holds, completing the proof. ∎

We can now prove the main theorem, which indicates the iteration complexity of Algorithm 3.1.

Theorem 3.6 (complexity of Algorithm 3.1).

Assume that the assumptions made in Lemma 3.4 are satisfied. Then, Algorithm 3.1 terminates after at most 𝒪(max{εg2,εH3})\mathcal{O}(\max\{\varepsilon_{g}^{-2},\varepsilon_{H}^{-3}\}) iterations.

Proof.

Noting that the total number of iteration of Algorithm 3.1 N=|𝒩succ|+|𝒩fail|N=|\mathcal{N}_{succ}|+|\mathcal{N}_{fail}|, we have by definition of Algorithm 3.1 that

σN=σ0γ|𝒩fail||𝒩succ|.\sigma_{N}=\sigma_{0}\gamma^{|\mathcal{N}_{fail}|-|\mathcal{N}_{succ}|}.

Noting σk2γLH\sigma_{k}\leq 2\gamma L_{H} for each k=1,2,Nk=1,2,\cdots N by Lemma 3.4(20), it follows that

|𝒩fail||𝒩succ|+logγ2γLHσ0.|\mathcal{N}_{fail}|\leq|\mathcal{N}_{succ}|+\log_{\gamma}\frac{2\gamma L_{H}}{\sigma_{0}}. (29)

This, together with Lemma 3.5(22), implies that

N=|𝒩succ|+|𝒩fail|\displaystyle N=|\mathcal{N}_{succ}|+|\mathcal{N}_{fail}| 2|𝒩succ|+logγ2γLHσ0\displaystyle\leq 2|\mathcal{N}_{succ}|+\log_{\gamma}{\frac{2\gamma L_{H}}{\sigma_{0}}}
2(f(x0)fmin)ρTHκσmax{εg2,εH3}+1+logγ2γLHσ0,\displaystyle\leq\frac{2(f(x_{0})-f_{\min})}{\rho_{TH}\kappa_{\sigma}}\max\{\varepsilon_{g}^{-2},\varepsilon_{H}^{-3}\}+1+\log_{\gamma}{\frac{2\gamma L_{H}}{\sigma_{0}}},

completing the proof. ∎

As corollaries of Theorem 3.6, we consider the (deterministic) Riemannian adaptive cubic regularization algorithm and the inexact Riemannian adaptive cubic regularization algorithm under the true gradient. The following corollary shows the iteration complexity of the (deterministic) Riemannian adaptive cubic regularization algorithm, which is immediate from Theorem 3.6 (noting that assumption (A3) is satisfied naturally).

Corollary 3.7.

Assume (A1)-(A2) and (A4). If Algorithm 3.1 employs

Gk:=gradf(x)andHk:=Hessf(x)k,G_{k}:=\mathrm{grad}f(x)\quad and\quad H_{k}:=\mathrm{Hess}f(x)\quad\forall k\in\mathbb{N},

then it terminates after at most 𝒪(max{εg2,εH3})\mathcal{O}(\max\{\varepsilon_{g}^{-2},\varepsilon_{H}^{-3}\}) iterations.

Following the same line of proof of Theorem 3.6, we have the following corollary about the iteration complexity of the inexact Riemannian adaptive cubic regularization algorithm under the true gradient.

Corollary 3.8.

Assume (A1)-(A2), (A3)-b and (A4). Assume further that Algorithm 3.1 employs

Gk:=gradf(x)kG_{k}:=\mathrm{grad}f(x)\quad\forall k\in\mathbb{N}

and the parameter δH\delta_{H} satisfies

δHmin{min{112,1ρTH6}(KH2+4σkεgKH),min{16,1ρTH3}νεH}.\delta_{H}\leq\min\{\min\{\frac{1}{12},\frac{1-\rho_{TH}}{6}\}(\sqrt{K_{H}^{2}+4\sigma_{k}\varepsilon_{g}}-K_{H}),\min\{\frac{1}{6},\frac{1-\rho_{TH}}{3}\}\nu\varepsilon_{H}\}.

Then, the algorithm terminates after at most 𝒪(max{εg2,εH3})\mathcal{O}(\max\{\varepsilon_{g}^{-2},\varepsilon_{H}^{-3}\}) iterations.

4 Application and numerical experiments

In this section, we shall apply Algorithm 3.1 to solve the joint diagonalization (JD for short) problems on Stiefel manifolds, and provide some numerical experiments to compare their performances. All numerical comparisons are implemented in MATLAB on a 3.20 GHz Intel Core machine with 8 GB RAM.

We first introduce some notation and results about the Stiefel manifold and the JD. The following notation and results about the Stiefel manifold can be found in [12]. Let r,dr,d\in\mathbb{N}, and let A=(aij)d×rA=(a_{ij})\in\mathbb{R}^{d\times r} (we always assume that rdr\leq d). The symmetric matrix of AA is denoted by sym(A)\mathrm{sym}(A), which is defined by sym(A):=AT+A2\mathrm{sym}(A):=\frac{A^{T}+A}{2}. The Frobenius norm for AA is defined as

AF2:=i=1dj=1raij2.\|\mathrm{A}\|_{F}^{2}:=\sum_{i=1}^{d}\sum_{j=1}^{r}a_{ij}^{2}.

We use diag(A){\rm diag}(A) to denote the matrix formed by the diagonal elements of AA, and then have diag(A)F2=j=1rajj2\|\mathrm{diag(A)}\|_{F}^{2}=\sum_{j=1}^{r}a_{jj}^{2} . The Stiefel manifold St(r,d)(r,d) is the set of all r×dr\times d orthogonal matrices, i.e.,

St(r,d):={Xd×r:XTX=I},\mathrm{St}(r,d):=\{X\in\mathbb{R}^{d\times r}:X^{T}X=I\},

where Ir×rI\in\mathbb{R}^{r\times r} is the identity matrix. Let USt(r,d)U\in\mathrm{St}(r,d). The tangent space of St(r,d)(r,d) at UU is defined by

TUSt(r,d):={ξ:ξTU+UTξ=0}.T_{U}\mathrm{St}(r,d):={\{\xi:\xi^{T}U+U^{T}\xi=0\}}.

The Riemannian inner product on TUSt(r,d)T_{U}\mathrm{St}(r,d) is given by

η,ξU:=trace(ηTξ)η,ξTUSt(r,d).\langle\eta,\xi\rangle_{U}:={\rm trace}(\eta^{T}\xi)\quad\forall\eta,\xi\in T_{U}\mathrm{St}(r,d).

In our practice, we adopt the following retraction defined by

RU(ξ):=qf(U+ξ)USt(r,d),R_{U}(\xi):={\rm qf}(U+\xi)\quad\forall U\in\mathrm{St}(r,d),

where qf(U+ξ){\rm qf}(U+\xi) means the orthogonal factor based on the QR decomposition of U+ξU+\xi.

The following introduction about the blind source separation (BSS for short) and the JD can be found in the works in [9, 17, 25, 26] and references there in. Let x(t)x(t) be a given dd-dimensional stochastic process and tt the time index such that

x(t)=As(t)x(t)=As(t)

where A=(aij)d×rA=(a_{ij})\in\mathbb{R}^{d\times r} is a mixing matrix and s(t)s(t) is the rr-dimensional source vector, both of AA and s(t)s(t) are unknown. The BSS problem tries to recover the mixing matrix AA and the source vectors(t)s(t). If we pose additional constraints on s(t)s(t) (see, e.g., [9]), the BSS problem can be transformed into the JD problem on the Stiefel manifold, which is stated as follows:

minUSt(r,d)fica(U):=1ni=1ndiag(UTCiU)F2,\min_{{U}\in\mathrm{St}(r,d)}f_{\mathrm{{ica}}}{(U)}:=-\frac{1}{n}\sum_{i=1}^{n}\|\mathrm{diag}({U^{T}C}_{i}{U})\|_{F}^{2}, (30)

where {Cid×d:i=1,2,,n}\{C_{i}\in\mathbb{R}^{d\times d}:i=1,2,\cdots,n\} are matrices estimated from the dada under some source conditions (see [9, 26] for more details). The following expressions of the (Riemannian) gradient and Hessian of ficaf{\rm{{}_{ica}}} are taken form [17]. Let USt(r,d){U}\in\mathrm{St}(r,d). The Riemannian gradient of f(U)icaf\mathrm{{}_{ica}}(U) at UU is given by

gradf(U)ica=PUegradf(U)ica=PU(1ni=1n4CiUdiag(UTCiU)),\mathrm{grad}f\mathrm{{}_{ica}}(U)=\mathrm{P}_{U}{\rm egrad}f\mathrm{{}_{ica}}(U)=\mathrm{P}_{U}(-\frac{1}{n}\sum_{i=1}^{n}4{C}_{i}U\mathrm{diag}({U^{T}C}_{i}{U})),

where egradf(U)ica\mathrm{egrad}f\mathrm{{}_{ica}}(U) is the Euclidean gradient of f(U)icaf\mathrm{{}_{ica}}(U) and the orthogonal projection PU\mathrm{P}_{U} is

PU(W):=WUsym(UTW)Wd×r.\mathrm{P}_{U}(W):={W-U{\rm sym}(U^{T}W)\quad\forall W}\in\mathbb{R}^{d\times r}.

The Riemannian Hessian of f(U)icaf\mathrm{{}_{ica}}(U) is defined by

Hessf(U)ica[ξ]=\displaystyle\mathrm{Hess}f\mathrm{{}_{ica}}(U)[\xi]= PU(Degradfica(U)[ξ]ξsym(UTegradf(U)ica)\displaystyle\mathrm{P}_{U}({\rm Degrad}f_{\mathrm{{ica}}}{(U)[\xi]}-\xi\mathrm{sym}({U^{T}{\rm egrad}}f\mathrm{{}_{ica}}(U))
Usym(ξTegradfica(U))Usym(UTDegradfica(U)[ξ])ξTStU(r,d),\displaystyle-U\mathrm{sym}({\xi^{T}{\rm egrad}}f_{\mathrm{{ica}}}{(U)})-U\mathrm{sym}({U^{T}{\rm Degrad}}f_{\mathrm{{ica}}}{(U)[\xi]})\quad\forall\xi\in T{{}_{U}}\mathrm{St}(r,d),

where

Degradfica(U)[ξ]=1ni=1n4Ci(ξdiag(UTCiU)+Udiag(ξTCiU)+Udiag(UTCiξ)).\mathrm{Degrad}f_{\mathrm{{ica}}}{(U)[\xi]}=-\frac{1}{n}\sum_{i=1}^{n}4{C}_{i}({\xi{\rm diag}(U^{T}C}_{i}{U})+{U{\rm diag}}({\xi^{T}C}_{i}{U})+{U{\rm diag}}({U^{T}C}_{i}{\xi})).

For simplicity, Algorithm 3.1 with true gradient and true Hessain, with true gradient and inexact Hessain, and with inexact gradient and inexact Hessain are denoted by RACR, SRACR and SSRACR, respectively; and the inexact Riemannian trust-region algorithm in [17, Algorithm 1] is denoted by SSRTR. We shall apply RACR, SRACR, SSRACR, and SSRTR to solve the JD problem (30), and provide some numerical comparisons with different (n,d,r)(n,d,r). In all numerical experiments, we generate {Ci:i=1,2,,n}\{{C}_{i}:i=1,2,\cdots,n\} randomly, and all related inexact gradients and inexact Hessains are constructed by sampling with sizes |Sg|=n4|S_{g}|=\frac{n}{4} and |SH|=n40|S_{H}|=\frac{n}{40}. We set the parameters as σ0=0.001,ρTH=0.9\sigma_{0}=0.001,\rho_{TH}=0.9 and γ=2\gamma=2 in RACR, SRACR and SSRACR, and set 0=1,ρTH=0.9\triangle_{0}=1,\rho_{TH}=0.9 and γ=2\gamma=2 in SSRTR (see [17, Algorithm 1] for details about the papmeters). The stopping criterion of all executed algorithms is set as Gk20.001\|G_{k}\|^{2}\leq 0.001.

The numerical results are listed in Table 4, where ii and TT denote the number of iterations and the CPUtime (in seconds), respectively. The observations from Table 4 reveal that SSRACR costs much less CPUtime than SRACR, RACR and SSRTR at the same accuracy.

\tbl

Number of iterations and time for RACR, SRACR, SSRACR, and SSRTR. RACR SRACR SSRACR SSRTR (n,d,r)(n,d,r) ii TT ii TT ii TT ii TT (2015,5,5) 317 7.524 788 9.349 220 2.259 398 4.042 (2015,10,10) 1073 48.144 1247 29.432 718 14.048 1936 39.174 (2015,20,20) 2209 108.753 2114 87.839 2570 83.804 5140 172.033 (2015,30,30) 2693 395.710 2398 177.815 2603 202.081 12739 836.978 (2015,43,43) 2913 1280.422 3169 736.996 3563 691.293 12777 2622.408 (7200,43,43) 2873 5526.860 3019 2485.531 2909 2005.734 12837 8501.568 (60000,43,43) 298 4448.468 298 2395.300 303 2118.032 861 4579.865

Funding

Research of the second author was supported in part by the National Natural Science Foundation of China (grant 12161017) and the Guizhou Provincial Science and Technology Projects (grant ZK[2022]110).

References

  • [1] Shalev-Shwartz S, Ben-David S. Understanding Machine Learning: From Theory to Algorithms. Cambridge university press, 2014.
  • [2] Roosta-Khorasani F, Doel K, Ascher U. Data completion and stochastic algorithms for PDE inversion problems with many measurements. Electron. T. Numer. Ana., 42:177–196, 2014.
  • [3] Roosta-Khorasani F, Doel K, Ascher U. Stochastic algorithms for inverse problems involving pdes and many measurements. SIAM J. Sci. Comput., 36(5):S3–S22, 2014.
  • [4] Robbins H, Monro S. A stochastic approximation method. AMS, 22(3):400–407, 1951.
  • [5] Roux N, Schmidt M, Bach F. A stochastic gradient method with an exponential convergence rate for finite training sets. Advances in neural information processing systems, 25, 2012.
  • [6] Johnson R, Zhang T. Accelerating stochastic gradient descent using predictive variance reduction. In NIPS, 1(3):315–323, 2013.
  • [7] Liu DC, Jorge Nocedal J. On the limited memory bfgs method for large scale optimization. Math. Program., 45(1-3):503–528, 1989.
  • [8] Mishra B, Kasai H, Jawanpuria P, Saroop A. A Riemannian gossip approach to subspace learning on grassmann manifold. Mach. Learn., 108:1783–1803, 2019.
  • [9] Theis FJ, Cason TP, Absil PA. Soft dimension reduction for ICA by joint diagonalization on the Stiefel manifold. In ICA, 2019.
  • [10] Bonnabel S. Stochastic gradient descent on Riemannian manifolds. IEEE T. Automat. Contr., 58(9):2217–2229, 2013.
  • [11] Sato H, Kasai H, Mishra B. Riemannian stochastic variance reduced gradient. ArXiv preprint arXiv:1702.05594, 2017.
  • [12] Absil PA, Mahony R, Sepulchre R. Optimization Algorithms on Matrix Manifolds. Princeton University Press, 2008.
  • [13] Huang W, Gallivan KA, Absil PA. A broyden class of quasi-Newton methods for Riemannian optimization. SIAM J. Optimiz., 25(3):1660–1685, 2015.
  • [14] Gabay D. Minimizing a differentiable function over a differential manifold. J. Optimiz. Theory App., 37:177–219, 1982.
  • [15] Ring W, Wirth B. Optimization methods on Riemannian manifolds and their application to shape space. SIAM J. Optimiz., 22(2):596–627, 2012.
  • [16] Absil PA, Baker CG, Gallivan KA. Trust-Region Methods on Riemannian Manifolds. Found. Comput. Math., 2007;7(3): 303–330.
  • [17] Kasai H, Mishra B. Inexact trust-region algorithms on Riemannian manifolds. Advances in neural information processing systems, 31, 2018.
  • [18] DoCarmo MP. Riemannian Geometry. Boston MA: Birkhäuser Boston; 1992.
  • [19] Udriste C. Convex Functions and Optimization Methods on Riemannian Manifolds. In: Mathematics and Its Applications, Kluwer Academic, Dordrecht, 1994.
  • [20] Jorge N, Stephen JW. Numerical Optimization. Springer Series in Operations Research, Springer-Verlag, New York, 1999.
  • [21] Boumal N, Absil PA, Cartis C. Global rates of convergence for nonconvex optimization on manifolds. IMA J. Numer. Anal., 39(1):1–33, 2019.
  • [22] Conn AR, Gould NIM, Toint PL. Trust Region Methods. Society for Industrial and Applied Mathematics, 2000.
  • [23] Cartis C, Gould NIM, Toint PL. Adaptive cubic regularisation methods for unconstrained optimization. Part I: Motivation, convergence and numerical results. Math. Program., 127(2):245–295, 2011.
  • [24] Cartis C, Gould NIM, Toint PL. Adaptive cubic regularisation methods for unconstrained optimization. Part II: Worst-case function and derivative-evaluation complexity. Math. Program., 30(2):295–319, 2011.
  • [25] Hyvärinen A, Oja E. Independent component analysis: Algorithms and applications. Neural networks, 13(4-5): 411–430, 2000.
  • [26] Theis FJ, Inouye Y. On the use of joint diagonalization in blind signal processing. In: Proc. ISCAS 2006, Kos, Greece 2006.
  • [27] Bottou L, Curtis FE, Nocedal J. Optimization mehtods for large-scale machine learning. SIAM Rev., 60(2): 223–311, 2018.
  • [28] Byrd RH, Chin GM, Neveitt W, Nocedal J. On the use of stochastic Hessian information in optimization methods for machine learning. SIAM J. Optim., 21(3): 977–995, 2011.
  • [29] Erdogdu MA, Montanari A. Convergence rates of sub-sampled Newton methods. In NIPS, 2015.
  • [30] Kohler JM, Lucchi A. Sub-sampled cubic regularization for non-convex optimization. In ICML, 2017.
  • [31] Yao Z, Xu P, Roosta-Khorasani F, Mahoney MW. Inexact non-convex Newton-type methods. ArXiv preprint arXiv: 1802.06925, 2018.
  • [32] Xu P, Roosta F, Mahoney MW. Newton-type methods for non-convex optimization under inexact Hessian information. Math. Program., 184(1-2):35–70, 2020.
  • [33] Wang XM. Subgradient algorithms on Riemannian manifolds of lower bounded curvatures. Optimization, 67: 179–194, 2018.
  • [34] Wang XM, Li C, Wang JH, and Yao JC. Linear convergence of subgradient algorithm for convex feasibility on Reimannian manifolds. SIAM J. Optim., 25: 2334–2358, 2015.
  • [35] Ferreira OP, Louzeiro MS, Prudente LF. Gradient method for optimization on Riemannian manifolds with lower bounded curvature. SIAM J. Optim., 29: 2517–2541, 2019.
  • [36] Bento CG, Bitar SDB, Cruz Neto JX, Oliveira PR, Souza JCO. Computing Riemannian center of mass on Hadamard manifolds. J. Optim. Theory Appl., 183: 977–992, 2019.