This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Modified memoryless spectral-scaling Broyden family on Riemannian manifolds

Hiroyuki Sakai    Hideaki Iiduka
Abstract

This paper presents modified memoryless quasi-Newton methods based on the spectral-scaling Broyden family on Riemannian manifolds. The method involves adding one parameter to the search direction of the memoryless self-scaling Broyden family on the manifold. Moreover, it uses a general map instead of vector transport. This idea has already been proposed within a general framework of Riemannian conjugate gradient methods where one can use vector transport, scaled vector transport, or an inverse retraction. We show that the search direction satisfies the sufficient descent condition under some assumptions on the parameters. In addition, we show global convergence of the proposed method under the Wolfe conditions. We numerically compare it with existing methods, including Riemannian conjugate gradient methods and the memoryless spectral-scaling Broyden family. The numerical results indicate that the proposed method with the BFGS formula is suitable for solving an off-diagonal cost function minimization problem on an oblique manifold.

1 Introduction

Riemannian optimization has recently attracted a great deal of attention and has been used in many applications, including low-rank tensor completion [30, 10], machine learning [17], and shape analysis [8].

Iterative methods for solving unconstrained optimization problems on the Euclidean space have been studied for a long time [18]. Quasi-Newton methods and nonlinear conjugate gradient methods are the especially important ones and have been implemented in various software packages.

Here, quasi-Newton methods need to store dense matrices, so it is difficult to apply them to large-scale problems. Shanno [27] proposed a memoryless quasi-Newton method as a way to deal with this problem. This method [9, 12, 13, 14, 15] has proven effective at solving large-scale unconstrained optimization problems. The concept is simple: an approximate matrix is updated by using the identity matrix instead of the previous approximate matrix. Similar to the case of nonlinear conjugate gradient methods, the search direction can be computed without having to use matrices and simply by taking the inner product without matrices.

Kou and Dai [9] proposed a modified memoryless spectral-scaling BFGS method. Their method involves adding one parameter to the search direction of the memoryless self-scaling BFGS method. In [13], Nakayama used this technique to devise a memoryless spectral-scaling Broyden family. In addition, he showed that the search direction is a sufficient descent direction and has the global convergence property. Nakayama, Narushima, and Yabe [15] proposed memoryless quasi-Newton methods based on the spectral-scaling Broyden family [3]. Their methods generate a sufficient descent direction and have the global convergence property.

Many useful iterative methods for solving unconstrained optimization problems on manifolds have been studied (see [2, 24]). They have been obtained by extending iterative methods in Euclidean space by using the concepts of retraction and vector transport. For example, Riemannian quasi-Newton methods [7, 6] and Riemannian conjugate gradient methods [26, 34, 20, 24] have been developed. Sato and Iwai [26] introduced scaled vector transport [26, Definition 2.2] in order to remove the assumption of isometric vector transport from the convergence analysis. Zhu and Sato [34] proposed Riemannian conjugate gradient methods that use an inverse retraction instead of vector transport. In [24], Sato proposed a general framework of Riemannian conjugate gradient methods. This framework uses a general map instead of vector transport and utilizes the existing Riemannian conjugate gradient methods such as ones that use vector transport, scaled vector transport [26], or inverse retraction [34].

In [19], Ring and Wirth proposed the BFGS method, which has a global convergence property under some convexity assumptions. Narushima et al. [16] proposed memoryless quasi-Newton methods based on the spectral-scaling Broyden family on Riemannian manifolds. They extended the memoryless spectral-scaling Broyden family in Euclidean space to Riemannian manifolds with an additional modification to ensure a sufficient descent condition. Moreover, they presented a global convergence analysis under the Wolfe conditions. In particular, they did not assume convexity of the objective function or isometric vector transport. The results of the previous studies are summarized in Tables 1 and 2.

In this paper, we propose a modified memoryless quasi-Newton method based on the spectral-scaling Broyden family on Riemannian manifolds, exploiting the idea used in the paper [13].

In the case of Euclidean space, Nakayama [13] reported that the modified memoryless quasi-Newton method based on the spectral-scaling Broyden family shows good experimental performance with parameter tuning. Therefore, it is worth extending it to Riemannian manifolds. Our method is based on the memoryless quasi-Newton methods on Riemannian manifolds proposed by Narushima et al. [16] as well as on the modification by Kou and Dai [9]. It uses a general map to transport vectors similarly to the general framework of Riemannian conjugate gradient methods [25]. This generalisation allows us to use maps such as an inverse retraction [34] instead of vector transport. We show that our method generates a search direction satisfying the sufficient descent condition under some assumptions on the parameters (see Proposition 1). Moreover, we present global convergence analyses under the Wolfe conditions (see Theorem 2). Furthermore, we describe the results of numerical experiments comparing our method with the existing ones, including Riemannian conjugate gradient methods [20] and the memoryless spectral-scaling Broyden family on Riemannian manifolds [16]. The key advantages of the proposed methods are the added parameter ξk1\xi_{k-1} and support for maps other than vector transports. As shown in the numerical experiments, the proposed method may outperform the existing methods depending on how the parameter ξk1\xi_{k-1} is chosen. It has an advantage over [16] in that it can use a map such as an inverse retraction, which is not applicable in [16].

This paper is organized as follows. Section 2 reviews the fundamentals of Riemannian geometry and Riemannian optimization. Section 3 proposes the modified memoryless quasi-Newton method based on the spectral-scaling Broyden family. Section 4 gives a global convergence analysis. Section 5 compares the proposed method with the existing methods through numerical experiments. Section 6 concludes the paper.

Table 1: Results of previous studies on Quasi-Newton methods in Euclidean space and Riemannian manifolds.
BFGS Broyden family spectral-scaling
Broyden family
Euclidean —— —— Chen–Cheng
(2013) [3]
Riemannian Ring–Wirth ——
(2012) [19] Huang et al.
Huang et al. (2015) [7]
(2018) [6]
Table 2: Results of previous studies and ours on memoryless quasi-Newton methods in Euclidean space and Riemannian manifolds.
spectral-scaling modified spectral-scaling
Broyden family Broyden family
Euclidean Nakayama et al. Nakayama
(2019) [15] (2018) [13]
Riemannian Narushima et al. this work
(2023) [16]

2 Mathematical preliminaries

Let MM be a Riemannian manifold with Riemannian metric gg. TxMT_{x}M denotes the tangent space of MM at a point xMx\in M. The tangent bundle of MM is denoted by TMTM. A Riemannian metric at xMx\in M is denoted by ,x:TxM×TxM\langle\cdot,\cdot\rangle_{x}:T_{x}M\times T_{x}M\to\mathbb{R}. The induced norm of a tangent vector ηTxM\eta\in T_{x}M is defined by ηx:=η,ηx\lVert\eta\rVert_{x}:=\sqrt{\langle\eta,\eta\rangle_{x}}. For a given tangent vector ηTxM\eta\in T_{x}M, η\eta^{\flat} represents the flat of η\eta, i.e., η:TxM:ξη,ξx\eta^{\flat}:T_{x}M\to\mathbb{R}:\xi\mapsto\langle\eta,\xi\rangle_{x}. Let F:MNF:M\to N be a smooth map between smooth manifolds; then, the derivative of FF at xMx\in M is denoted by DF(x):TxMTF(x)N\mathrm{D}F(x):T_{x}M\to T_{F(x)}N. For a smooth function f:Mf:M\to\mathbb{R}, gradf(x)\mathrm{grad}f(x) denotes the Riemannian gradient at xMx\in M, i.e., a unique element of TxMT_{x}M satisfying

gradf(x),ηx=Df(x)[η],\displaystyle\langle\mathrm{grad}f(x),\eta\rangle_{x}=\mathrm{D}f(x)[\eta],

for all ηTxM\eta\in T_{x}M. Hessf(x)\mathrm{Hess}f(x) denotes the Riemannian Hessian at xMx\in M, which is defined as

Hessf(x):TxMTxM:ηηgradf(x),\displaystyle\mathrm{Hess}f(x):T_{x}M\to T_{x}M:\eta\mapsto\nabla_{\eta}\mathrm{grad}f(x),

where \nabla denotes the Levi-Civita connection of MM (see [2]).

Definition 1.

Any smooth map R:TMMR:TM\to M is called a retraction on MM if it has the following properties.

  • Rx(0x)=xR_{x}(0_{x})=x, where 0x0_{x} denotes the zero element of TxMT_{x}M;

  • DRx(0x)=idTxM\mathrm{D}R_{x}(0_{x})=\mathrm{id}_{T_{x}M} with the canonical identification T0x(TxM)TxMT_{0_{x}}(T_{x}M)\simeq T_{x}M,

where RxR_{x} denotes the restriction of RR to TxMT_{x}M.

Definition 2.

Any smooth map 𝒯:TMTMTM\mathcal{T}:TM\oplus TM\to TM is called a vector transport on MM if it has the following properties.

  • There exists a retraction RR such that 𝒯η(ξ)TRx(η)M\mathcal{T}_{\eta}(\xi)\in T_{R_{x}(\eta)}M for all xMx\in M and η,ξTxM\eta,\xi\in T_{x}M;

  • 𝒯0x(ξ)=ξ\mathcal{T}_{0_{x}}(\xi)=\xi for all xMx\in M and ξTxM\xi\in T_{x}M;

  • 𝒯η(aξ+bζ)=a𝒯η(ξ)+b𝒯η(ζ)\mathcal{T}_{\eta}(a\xi+b\zeta)=a\mathcal{T}_{\eta}(\xi)+b\mathcal{T}_{\eta}(\zeta) for all xMx\in M, a,ba,b\in\mathbb{R} and η,ξ,ζTxM\eta,\xi,\zeta\in T_{x}M,

where 𝒯η(ξ):=𝒯(η,ξ)\mathcal{T}_{\eta}(\xi):=\mathcal{T}(\eta,\xi).

Let us consider an iterative method in Riemannian optimization. For an initial point x0Mx_{0}\in M, step size αk>0\alpha_{k}>0, and search direction ηkTxkM\eta_{k}\in T_{x_{k}}M, the kk-th approximation to the solution is described as

xk+1=Rxk(αkηk),\displaystyle x_{k+1}=R_{x_{k}}(\alpha_{k}\eta_{k}), (1)

where RR is a retraction. We define gk:=gradf(xk)g_{k}:=\mathrm{grad}f(x_{k}). Various algorithms have been developed to determine the search direction ηk\eta_{k}. We say that ηk\eta_{k} is a sufficient descent direction if the sufficient descent condition,

gk,ηkxkκgkxk2\displaystyle\langle g_{k},\eta_{k}\rangle_{x_{k}}\leq-\kappa\lVert g_{k}\rVert_{x_{k}}^{2} (2)

holds for some constant κ>0\kappa>0.

In [7, 6, 16], the search direction ηkTxkM\eta_{k}\in T_{x_{k}}M of Riemannian quasi-Newton methods is computed as

ηk=k[gk],\displaystyle\eta_{k}=-\mathcal{H}_{k}[g_{k}], (3)

where k:TxkMTxkM\mathcal{H}_{k}:T_{x_{k}}M\to T_{x_{k}}M is a symmetric approximation to Hessf(xk)1\mathrm{Hess}f(x_{k})^{-1}.

In [25], Sato proposed a general framework of Riemannian conjugate gradient methods by using a map 𝒯(k1):Txk1MTxkM\mathscr{T}^{(k-1)}:T_{x_{k-1}}M\to T_{x_{k}}M which satisfies Assumption 1, to transport ηk1Txk1M\eta_{k-1}\in T_{x_{k-1}}M to TxkMT_{x_{k}}M; i.e., the search direction ηk\eta_{k} is computed as

ηk=gk+βkσk1𝒯(k1)(ηk1),\displaystyle\eta_{k}=-g_{k}+\beta_{k}\sigma_{k-1}\mathscr{T}^{(k-1)}(\eta_{k-1}),

where βk\beta_{k}\in\mathbb{R}, and σk1\sigma_{k-1} is a scaling parameter (see [25, Section 4.1]) satisfying

0<σk1min{1,ηk1xk1𝒯(k1)(ηk1)xk}.\displaystyle 0<\sigma_{k-1}\leq\min\left\{1,\frac{\lVert\eta_{k-1}\rVert_{x_{k-1}}}{\lVert\mathscr{T}^{(k-1)}(\eta_{k-1})\rVert_{x_{k}}}\right\}.
Assumption 1.

There exist C0C\geq 0 and KK\subset\mathbb{N}, such that for all kKk\in K,

𝒯(k)(ηk)DRx(αkηk)[ηk]xk+1Cαkηkxk2,\displaystyle\lVert\mathscr{T}^{(k)}(\eta_{k})-\mathrm{D}R_{x}(\alpha_{k}\eta_{k})[\eta_{k}]\rVert_{x_{k+1}}\leq C\alpha_{k}\lVert\eta_{k}\rVert_{x_{k}}^{2}, (4)

and for all kKk\in\mathbb{N}-K,

𝒯(k)(ηk)DRx(αkηk)[ηk]xk+1C(αk+αk2)ηkxk2.\displaystyle\lVert\mathscr{T}^{(k)}(\eta_{k})-\mathrm{D}R_{x}(\alpha_{k}\eta_{k})[\eta_{k}]\rVert_{x_{k+1}}\leq C(\alpha_{k}+\alpha_{k}^{2})\lVert\eta_{k}\rVert_{x_{k}}^{2}. (5)

Note that inequality (5) is weaker than (4). For kk satisfying the stronger condition (4), the assumption of Theorem 1 can be weakened. Further details can be found in [25, Remark 4.3]. Assumption 1 requires that 𝒯(k)\mathscr{T}^{(k)} is an approximation of the differentiated retraction. Therefore, the differentiated retraction clearly satisfies the conditions of Assumption 1. In [25, Example 4.5] and [25, Example 4.6], Sato gives examples of maps 𝒯(k)\mathscr{T}^{(k)} satisfying Assumption 1 in the case of the unit sphere and Grassmann manifolds, respectively. In [34, Proposition 1], Zhu and Sato proved that the inverse of the retraction satisfies Assumption 1.

Sato [25, Section 4.3] generalized the parameter βk\beta_{k} (i.e., Fletcher–Reeves (FR) [26], Dai–Yuan (DY) [23], Polak–Ribière–Polyak (PRP) and Hestenes–Stiefel (HS) methods) as follows:

βkFR\displaystyle\beta_{k}^{\mathrm{FR}} =gkxk2gk1xk12,\displaystyle=\frac{\lVert g_{k}\|_{x_{k}}^{2}}{\lVert g_{k-1}\rVert_{x_{k-1}}^{2}},
βkDY\displaystyle\beta_{k}^{\mathrm{DY}} =gkxk2gk,σk1𝒯(k1)(ηk1)xkgk1,ηk1xk1,\displaystyle=\frac{\lVert g_{k}\|_{x_{k}}^{2}}{\langle g_{k},\sigma_{k-1}\mathscr{T}^{(k-1)}(\eta_{k-1})\rangle_{x_{k}}-\langle g_{k-1},\eta_{k-1}\rangle_{x_{k-1}}}, (6)
βkPRP\displaystyle\beta_{k}^{\mathrm{PRP}} =gkxk2gk,lk1𝒮(k1)(gk1)xk1gk1xk12,\displaystyle=\frac{\lVert g_{k}\rVert_{x_{k}}^{2}-\langle g_{k},l_{k-1}\mathscr{S}^{(k-1)}(g_{k-1})\rangle_{x_{k-1}}}{\lVert g_{k-1}\rVert_{x_{k-1}}^{2}},
βkHS\displaystyle\beta_{k}^{\mathrm{HS}} =gkxk2gk,lk1𝒮(k1)(gk1)xk1gk,σk1𝒯(k1)(ηk1)xkgk1,ηk1xk1,\displaystyle=\frac{\lVert g_{k}\rVert_{x_{k}}^{2}-\langle g_{k},l_{k-1}\mathscr{S}^{(k-1)}(g_{k-1})\rangle_{x_{k-1}}}{\langle g_{k},\sigma_{k-1}\mathscr{T}^{(k-1)}(\eta_{k-1})\rangle_{x_{k}}-\langle g_{k-1},\eta_{k-1}\rangle_{x_{k-1}}},

where lk1>0l_{k-1}>0 and 𝒮(k1):Txk1MTxkM\mathscr{S}^{(k-1)}:T_{x_{k-1}}M\to T_{x_{k}}M is an appropriate mapping. Therefore, we can use the Hager-Zhang (HZ) methods [22, Section 3] generalized by the above techniques, as follows:

βkHZ=βkHSμyk1xk2gk,𝒯(k1)(ηk1)xk(gk,σk1𝒯(k1)(ηk1)xkgk1,ηk1xk1)2,\displaystyle\beta_{k}^{\mathrm{HZ}}=\beta_{k}^{\mathrm{HS}}-\mu\frac{\lVert y_{k-1}\rVert_{x_{k}}^{2}\langle g_{k},\mathscr{T}^{(k-1)}(\eta_{k-1})\rangle_{x_{k}}}{\left(\langle g_{k},\sigma_{k-1}\mathscr{T}^{(k-1)}(\eta_{k-1})\rangle_{x_{k}}-\langle g_{k-1},\eta_{k-1}\rangle_{x_{k-1}}\right)^{2}}, (7)

where μ>1/4\mu>1/4.

We suppose that the search direction ηkTxkM\eta_{k}\in T_{x_{k}}M is a descent direction. In [25, Section 4.4], Sato introduced the Riemannian version of the Wolfe conditions with a 𝒯(k):TxkMTxk+1M\mathscr{T}^{(k)}:T_{x_{k}}M\to T_{x_{k+1}}M, called 𝒯(k)\mathscr{T}^{(k)}-Wolfe conditions. 𝒯(k)\mathscr{T}^{(k)}-Wolfe conditions are written as

f(Rxk(αkηk))f(xk)+c1αkgk,ηkxk,\displaystyle f(R_{x_{k}}(\alpha_{k}\eta_{k}))\leq f(x_{k})+c_{1}{\alpha_{k}}\langle g_{k},\eta_{k}\rangle_{x_{k}}, (8)
gradf(Rxk(αkηk)),𝒯(k)(ηk)Rxk(αkηk)c2gk,ηkxk,\displaystyle\langle\mathrm{grad}f(R_{x_{k}}(\alpha_{k}\eta_{k})),\mathscr{T}^{(k)}(\eta_{k})\rangle_{R_{x_{k}}(\alpha_{k}\eta_{k})}\geq c_{2}\langle g_{k},\eta_{k}\rangle_{x_{k}}, (9)

where 0<c1<c2<10<c_{1}<c_{2}<1. Note that the existence of a step size αk>0\alpha_{k}>0 satisfying the 𝒯(k)\mathscr{T}^{(k)}-Wolfe conditions is discussed in [25, Section 4.4]. Moreover, algorithms [21, Algorithm 3] and [23, Section 5.1] exist for finding step sizes which satisfy the 𝒯(k)\mathscr{T}^{(k)}-Wolfe conditions.

3 Memoryless spectral-scaling Broyden family

Let us start by reviewing the memoryless spectral-scaling Broyden family in Euclidean space. In the Euclidean setting, an iterative optimization algorithm updates the current iterate xkx_{k} to the next iterate xk+1x_{k+1} with the updating formula,

xk+1=xk+αkdk,\displaystyle x_{k+1}=x_{k}+\alpha_{k}d_{k},

where αk>0\alpha_{k}>0 is a positive step size. One often chooses a step size αk>0\alpha_{k}>0 to satisfy the Wolfe conditions (see [31, 32]),

f(xk+αkdk)f(xk)+c1αkf(xk)dk,\displaystyle f(x_{k}+\alpha_{k}d_{k})\leq f(x_{k})+c_{1}\alpha_{k}\nabla f(x_{k})^{\top}d_{k},
f(xk+αkdk)dkc2f(xk)dk,\displaystyle\nabla f(x_{k}+\alpha_{k}d_{k})^{\top}d_{k}\geq c_{2}\nabla f(x_{k})\top d_{k},

where 0<c1<c2<10<c_{1}<c_{2}<1. The search direction dkd_{k} of the quasi-Newton methods is defined by

dk=Hkf(xk),\displaystyle d_{k}=-H_{k}\nabla f(x_{k}), (10)

where gk=f(xk)g_{k}=\nabla f(x_{k}) and HkH_{k} is a symmetric approximation to 2f(xk)1\nabla^{2}f(x_{k})^{-1}. In this paper, we will focus on the Broyden family, written as

Hk=Hk1Hk1yk1yk1Hk1yk1Hk1yk1+sk1sk1sk1yk1+ϕk1yk1Hk1yk1wk1wk1,\displaystyle~{}\begin{split}H_{k}&=H_{k-1}-\frac{H_{k-1}y_{k-1}y_{k-1}^{\top}H_{k-1}}{y_{k-1}^{\top}H_{k-1}y_{k-1}}+\frac{s_{k-1}s_{k-1}^{\top}}{s_{k-1}^{\top}y_{k-1}}\\ &\quad+\phi_{k-1}y_{k-1}^{\top}H_{k-1}y_{k-1}w_{k-1}w_{k-1}^{\top},\end{split} (11)

where

wk1=sk1sk1yk1Hk1yk1yk1Hk1yk1,\displaystyle w_{k-1}=\frac{s_{k-1}}{s_{k-1}^{\top}y_{k-1}}-\frac{H_{k-1}y_{k-1}}{y_{k-1}^{\top}H_{k-1}y_{k-1}},

sk1=xkxk1s_{k-1}=x_{k}-x_{k-1} and yk1=f(xk)f(xk1)y_{k-1}=\nabla f(x_{k})-\nabla f(x_{k-1}). ϕk1\phi_{k-1} is a parameter, which becomes the DFP formula when ϕk1=0\phi_{k-1}=0 or the BFGS formula when ϕk1=1\phi_{k-1}=1 (see [18, 28]). Here, if ϕk1[0,1]\phi_{k-1}\in[0,1], then (11) is a convex combination of the DFP formula and the BFGS formula; we call this interval the convex class. Zhang and Tewarson [33] found a better choice in the case ϕk1>1\phi_{k-1}>1; we call this interval the preconvex class. In [3], Chen and Cheng proposed the Broyden family based on the spectral-scaling secant condition [4] as follows:

Hk=Hk1Hk1yk1yk1Hk1yk1Hk1yk1+1τk1sk1sk1sk1yk1+ϕk1yk1Hk1yk1wk1wk1,\displaystyle\begin{split}H_{k}&=H_{k-1}-\frac{H_{k-1}y_{k-1}y_{k-1}^{\top}H_{k-1}}{y_{k-1}^{\top}H_{k-1}y_{k-1}}+\frac{1}{\tau_{k-1}}\frac{s_{k-1}s_{k-1}^{\top}}{s_{k-1}^{\top}y_{k-1}}\\ &\quad+\phi_{k-1}y_{k-1}^{\top}H_{k-1}y_{k-1}w_{k-1}w_{k-1}^{\top},\end{split} (12)

where τk1>0\tau_{k-1}>0 is a spectral-scaling parameter.

Shanno [27] proposed memoryless quasi-Newton methods in which Hk1H_{k-1} is replaced with the identity matrix in (11). Memoryless quasi-Newton methods avoid having to make memory storage for matrices and can solve large-scale unconstrained optimization problems. In addition, Nakayama, Narushima and Yabe [15] proposed memoryless quasi-Newton methods based on the spectral-scaling Broyden family by replacing Hk1H_{k-1} with the identity matrix in (12), i.e.,

Hk=Iyk1yk1yk1yk1+1τk1sk1sk1sk1yk1+ϕk1yk1yk1wk1wk1,\displaystyle H_{k}=I-\frac{y_{k-1}y_{k-1}^{\top}}{y_{k-1}^{\top}y_{k-1}}+\frac{1}{\tau_{k-1}}\frac{s_{k-1}s_{k-1}^{\top}}{s_{k-1}^{\top}y_{k-1}}+\phi_{k-1}y_{k-1}^{\top}y_{k-1}w_{k-1}w_{k-1}^{\top}, (13)

where

wk1=sk1sk1yk1yk1yk1yk1.\displaystyle w_{k-1}=\frac{s_{k-1}}{s_{k-1}^{\top}y_{k-1}}-\frac{y_{k-1}}{y_{k-1}^{\top}y_{k-1}}.

From (10) and (12), the search direction dkd_{k} of memoryless quasi-Newton methods based on the spectral-scaling Broyden family can be computed as

dk\displaystyle d_{k} =gk+(ϕk1yk1gkdk1yk1(1τk1+ϕk1yk1yk1sk1yk1)sk1gkdk1yk1)dk1\displaystyle=-g_{k}+\left(\phi_{k-1}\frac{y_{k-1}^{\top}g_{k}}{d_{k-1}^{\top}y_{k-1}}-\left(\frac{1}{\tau_{k-1}}+\phi_{k-1}\frac{y_{k-1}^{\top}y_{k-1}}{s_{k-1}^{\top}y_{k-1}}\right)\frac{s_{k-1}^{\top}g_{k}}{d_{k-1}^{\top}y_{k-1}}\right)d_{k-1}
+(ϕk1dk1gkdk1yk1+(1ϕk1)yk1gkyk1yk1)yk1.\displaystyle+\left(\phi_{k-1}\frac{d_{k-1}^{\top}g_{k}}{d_{k-1}^{\top}y_{k-1}}+\left(1-\phi_{k-1}\right)\frac{y_{k-1}^{\top}g_{k}}{y_{k-1}^{\top}y_{k-1}}\right)y_{k-1}.

In [15], they also proved global convergence for step sizes satisfying the Wolfe conditions (see [15, Theorem 3.1] and [15, Theorem 3.6]). In [9], Kou and Dai proposed a modified memoryless self-scaling BFGS method and showed that it generates a search direction satisfying the sufficient descent condition. Moreover, Nakayama [13] used the modification by Kou and Dai and proposed a search direction dkd_{k} defined by

dk\displaystyle d_{k} =gk+(ϕk1yk1gkdk1yk1(1τk1+ϕk1yk1yk1sk1yk1)sk1gkdk1yk1)dk1\displaystyle=-g_{k}+\left(\phi_{k-1}\frac{y_{k-1}^{\top}g_{k}}{d_{k-1}^{\top}y_{k-1}}-\left(\frac{1}{\tau_{k-1}}+\phi_{k-1}\frac{y_{k-1}^{\top}y_{k-1}}{s_{k-1}^{\top}y_{k-1}}\right)\frac{s_{k-1}^{\top}g_{k}}{d_{k-1}^{\top}y_{k-1}}\right)d_{k-1}
+ξk1(ϕk1dk1gkdk1yk1+(1ϕk1)yk1gkyk1yk1)yk1,\displaystyle+\xi_{k-1}\left(\phi_{k-1}\frac{d_{k-1}^{\top}g_{k}}{d_{k-1}^{\top}y_{k-1}}+\left(1-\phi_{k-1}\right)\frac{y_{k-1}^{\top}g_{k}}{y_{k-1}^{\top}y_{k-1}}\right)y_{k-1},

where ξk1[0,1]\xi_{k-1}\in[0,1] is a parameter.

3.1 Memoryless spectral-scaling Broyden family on Riemannian manifolds

We define sk1=𝒯αk1ηk1(αk1ηk1)s_{k-1}=\mathcal{T}_{\alpha_{k-1}\eta_{k-1}}(\alpha_{k-1}\eta_{k-1}) and yk1=gk𝒯αk1ηk1(gk1)y_{k-1}=g_{k}-\mathcal{T}_{\alpha_{k-1}\eta_{k-1}}(g_{k-1}). The Riemannian quasi-Newton method with the spectral-scaling Broyden family [16, (23)] is written as

k=~k1~k1yk1(~k1yk1)(~k1yk1)yk1+1τk1sk1sk1sk1yk1+ϕk1(~k1yk1)yk1wk1wk1,\displaystyle\begin{split}\mathcal{H}_{k}&=\tilde{\mathcal{H}}_{k-1}-\frac{\tilde{\mathcal{H}}_{k-1}y_{k-1}(\tilde{\mathcal{H}}_{k-1}y_{k-1})^{\flat}}{(\tilde{\mathcal{H}}_{k-1}y_{k-1})^{\flat}y_{k-1}}+\frac{1}{\tau_{k-1}}\frac{s_{k-1}s_{k-1}^{\flat}}{s_{k-1}^{\flat}y_{k-1}}\\ &\quad+\phi_{k-1}(\tilde{\mathcal{H}}_{k-1}y_{k-1})^{\flat}y_{k-1}w_{k-1}w_{k-1}^{\flat},\end{split} (14)

where

wk1=sk1sk1yk1~k1yk1(~k1yk1)yk1,\displaystyle w_{k-1}=\frac{s_{k-1}}{s_{k-1}^{\flat}y_{k-1}}-\frac{\tilde{\mathcal{H}}_{k-1}y_{k-1}}{(\tilde{\mathcal{H}}_{k-1}y_{k-1})^{\flat}y_{k-1}},

and

~k1=𝒯αk1ηk1k1(𝒯αk1ηk1)1.\displaystyle\tilde{\mathcal{H}}_{k-1}=\mathcal{T}_{\alpha_{k-1}\eta_{k-1}}\circ\mathcal{H}_{k-1}\circ(\mathcal{T}_{\alpha_{k-1}\eta_{k-1}})^{-1}.

Here, ϕk10\phi_{k-1}\geq 0 is a parameter, and τk1>0\tau_{k-1}>0 is a spectral-scaling parameter.

The idea of behind the memoryless spectral-scaling Broyden family is very simple: replace ~k1\tilde{\mathcal{H}}_{k-1} with idTxk1M\mathrm{id}_{T_{x_{k-1}}M}. In [16], a memoryless spectral-scaling Broyden family on a Riemannian manifold is proposed by replacing ~k1\tilde{\mathcal{H}}_{k-1} with idTxkM\mathrm{id}_{T_{x_{k}}M}. To guarantee global convergence, they replaced yk1y_{k-1} by zk1TxkMz_{k-1}\in T_{x_{k}}M satisfying the following conditions [16, (27)]: for positive constants ν¯,ν¯>0\underline{\nu},\overline{\nu}>0,

ν¯sk1xk2\displaystyle\underline{\nu}\lVert s_{k-1}\rVert_{x_{k}}^{2} sk1zk1,\displaystyle\leq s_{k-1}^{\flat}z_{k-1}, (15)
zk1xk\displaystyle\lVert z_{k-1}\rVert_{x_{k}} ν¯sk1xk.\displaystyle\leq\overline{\nu}\lVert s_{k-1}\rVert_{x_{k}}. (16)

Here, we can choose zk1z_{k-1} by using Li-Fukushima regularization [11], which is a Levenberg–Marquardt type of regularization, and set

zk1=yk1+νk1sk1,\displaystyle z_{k-1}=y_{k-1}+\nu_{k-1}s_{k-1}, (17)

where

νk1={0,if sk1yk1ν^sk1xk2,max{0,sk1yk1sk1xk2}+ν^,otherwise,\displaystyle\nu_{k-1}=\begin{cases}0,&\text{if }s_{k-1}^{\flat}y_{k-1}\geq\hat{\nu}\lVert s_{k-1}\rVert_{x_{k}}^{2},\\ \max\left\{0,-\dfrac{s_{k-1}^{\flat}y_{k-1}}{\lVert s_{k-1}\rVert_{x_{k}}^{2}}\right\}+\hat{\nu},&\text{otherwise},\end{cases} (18)

and ν^>0\hat{\nu}>0. We can also use Powell’s damping technique [18], which sets

zk1=νk1yk1+(1νk1)sk1,\displaystyle z_{k-1}=\nu_{k-1}y_{k-1}+(1-\nu_{k-1})s_{k-1}, (19)

where ν^(0,1)\hat{\nu}\in(0,1) and

νk1={1,if sk1yk1ν^sk1xk2,(1ν^)sk1xk2sk1xk2sk1yk1,otherwise.\displaystyle\nu_{k-1}=\begin{cases}1,&\text{if }s_{k-1}^{\flat}y_{k-1}\geq\hat{\nu}\lVert s_{k-1}\rVert_{x_{k}}^{2},\\ \dfrac{(1-\hat{\nu})\lVert s_{k-1}\rVert_{x_{k}}^{2}}{\lVert s_{k-1}\rVert_{x_{k}}^{2}-s_{k-1}^{\flat}y_{k-1}},&\text{otherwise}.\end{cases} (20)

The proof that these choices satisfy conditions (15) and (16) is given in [16, Proposition 4.1]. Thus, a memoryless spectral-scaling Broyden family on a Riemannian manifold [16, (28)] can be described as

k\displaystyle\mathcal{H}_{k} =γk1idTxkMγk1zk1zk1zk1zk1+1τk1sk1sk1sk1zk1\displaystyle=\gamma_{k-1}\mathrm{id}_{T_{x_{k}M}}-\gamma_{k-1}\frac{z_{k-1}z_{k-1}^{\flat}}{z_{k-1}^{\flat}z_{k-1}}+\frac{1}{\tau_{k-1}}\frac{s_{k-1}s_{k-1}^{\flat}}{s_{k-1}^{\flat}z_{k-1}}
+ϕk1γk1zk1zk1wk1wk1,\displaystyle\quad+\phi_{k-1}\gamma_{k-1}z_{k-1}^{\flat}z_{k-1}w_{k-1}w_{k-1}^{\flat},

where

wk1=sk1sk1zk1zk1zk1zk1.\displaystyle w_{k-1}=\frac{s_{k-1}}{s_{k-1}^{\flat}z_{k-1}}-\frac{z_{k-1}}{z_{k-1}^{\flat}z_{k-1}}.

Here, γk1>0\gamma_{k-1}>0 is a sizing parameter. From (3), we can compute the search direction of the memoryless spectral-scaling Broyden family on a Riemannian manifold as follows:

ηk=\displaystyle\eta_{k}= γk1gk\displaystyle-\gamma_{k-1}g_{k}
+γk1(ϕk1zk1gksk1zk1(1γk1τk1+ϕk1zk1zk1sk1zk1)sk1gksk1zk1)sk1\displaystyle+\gamma_{k-1}\left(\phi_{k-1}\frac{z_{k-1}^{\flat}g_{k}}{s_{k-1}^{\flat}z_{k-1}}-\left(\frac{1}{\gamma_{k-1}\tau_{k-1}}+\phi_{k-1}\frac{z_{k-1}^{\flat}z_{k-1}}{s_{k-1}^{\flat}z_{k-1}}\right)\frac{s_{k-1}^{\flat}g_{k}}{s_{k-1}^{\flat}z_{k-1}}\right)s_{k-1}
+γk1(ϕk1sk1gksk1zk1+(1ϕk1)zk1gkzk1zk1)zk1.\displaystyle+\gamma_{k-1}\left(\phi_{k-1}\frac{s_{k-1}^{\flat}g_{k}}{s_{k-1}^{\flat}z_{k-1}}+\left(1-\phi_{k-1}\right)\frac{z_{k-1}^{\flat}g_{k}}{z_{k-1}^{\flat}z_{k-1}}\right)z_{k-1}.

3.2 Proposed algorithm

Let 𝒯(k1):Txk1MTxkM\mathscr{T}^{(k-1)}:T_{x_{k-1}}M\to T_{x_{k}}M be a map which satisfies Assumption 1. Furthermore, we define yk1=gk𝒯(k1)(gk1)y_{k-1}=g_{k}-\mathscr{T}^{(k-1)}(g_{k-1}) and sk1=𝒯(k1)(αk1ηk1)s_{k-1}=\mathscr{T}^{(k-1)}(\alpha_{k-1}\eta_{k-1}). We propose the following search direction of the modified memoryless spectral-scaling Broyden family on a Riemannian manifold:

ηk=γk1gk+γk1(ϕk1zk1gksk1zk1(1γk1τk1+ϕk1zk1zk1sk1zk1)sk1gksk1zk1)sk1+γk1ξk1(ϕk1sk1gksk1zk1+(1ϕk1)zk1gkzk1zk1)zk1,\displaystyle\begin{split}\eta_{k}=&-\gamma_{k-1}g_{k}\\ &+\gamma_{k-1}\left(\phi_{k-1}\frac{z_{k-1}^{\flat}g_{k}}{s_{k-1}^{\flat}z_{k-1}}-\left(\frac{1}{\gamma_{k-1}\tau_{k-1}}+\phi_{k-1}\frac{z_{k-1}^{\flat}z_{k-1}}{s_{k-1}^{\flat}z_{k-1}}\right)\frac{s_{k-1}^{\flat}g_{k}}{s_{k-1}^{\flat}z_{k-1}}\right)s_{k-1}\\ &+\gamma_{k-1}\xi_{k-1}\left(\phi_{k-1}\frac{s_{k-1}^{\flat}g_{k}}{s_{k-1}^{\flat}z_{k-1}}+\left(1-\phi_{k-1}\right)\frac{z_{k-1}^{\flat}g_{k}}{z_{k-1}^{\flat}z_{k-1}}\right)z_{k-1},\end{split} (21)

where ξk1[0,1]\xi_{k-1}\in[0,1] is a parameter, and zk1TxkMz_{k-1}\in T_{x_{k}}M is a tangent vector satisfying (15) and (16). Note that equation (21) has not only added ξk1\xi_{k-1}, but also changed the definition of the two tangent vectors yk1y_{k-1} and sk1s_{k-1} for determining zk1z_{k-1}. The proposed algorithm is listed in Algorithm 1. Note that Algorithm 1 is a generalization of memoryless quasi-Newton methods based on the spectral-scaling Broyden family proposed in [16]. In fact, if ξk1=1\xi_{k-1}=1 and 𝒯(k1)=𝒯αk1ηk1()\mathscr{T}^{(k-1)}=\mathcal{T}_{\alpha_{k-1}\eta_{k-1}}(\cdot), then Algorithm 1 coincides with it.

Algorithm 1 Modified memoryless quasi‐Newton methods based on spectral-scaling Broyden family on Riemannian manifolds.
1:Initial point x0Mx_{0}\in M, (γk)k=0(0,)(\gamma_{k})_{k=0}^{\infty}\subset(0,\infty), (ϕk)k=0[0,)(\phi_{k})_{k=0}^{\infty}\subset[0,\infty), (ξk)k=0[0,1](\xi_{k})_{k=0}^{\infty}\subset[0,1], (τk)k=0(0,)(\tau_{k})_{k=0}^{\infty}\subset(0,\infty).
2:Sequence (xk)k=0M(x_{k})_{k=0}^{\infty}\subset M.
3:k0k\leftarrow 0.
4:Set η0=g0=gradf(x0)\eta_{0}=-g_{0}=-\mathrm{grad}f(x_{0}).
5:loop
6:     Compute a step size αk>0\alpha_{k}>0 satisfying the Wolfe conditions (8) and (9).
7:     Set xk+1=Rxk(αkηk)x_{k+1}=R_{x_{k}}(\alpha_{k}\eta_{k}).
8:     Compute gk+1:=gradf(xk+1)g_{k+1}:=\mathrm{grad}f(x_{k+1}).
9:     Compute a search direction ηk+1Txk+1M\eta_{k+1}\in T_{x_{k+1}}M by (21).
10:     kk+1k\leftarrow k+1
11:end loop

4 Convergence analysis

Assumption 2.

Let f:Mf:M\to\mathbb{R} be a smooth, bounded below function with the following property: there exists L>0L>0 such that

|D(fRx)(tη)[η]D(fRx)(0x)[η]|Lt,\displaystyle\lvert\mathrm{D}(f\circ R_{x})(t\eta)[\eta]-\mathrm{D}(f\circ R_{x})(0_{x})[\eta]\rvert\leq Lt,

for all ηTxM\eta\leq T_{x}M, ηx=1\lVert\eta\rVert_{x}=1, xMx\in M and t0t\geq 0.

Assumption 3.

We suppose that there exists Γ>0\Gamma>0 such that

gkxkΓ\displaystyle\lVert g_{k}\rVert_{x_{k}}\leq\Gamma

for all kk.

Zoutendijk’s theorem about the 𝒯(k)\mathscr{T}^{(k)}-Wolfe conditions [25, Theorem 5.3], is described as follows:

Theorem 1.

Suppose that Assumptions 1 and 2 hold. Let (xk)k=0,1,(x_{k})_{k=0,1,\cdots} be a sequence generated by an iterative method of the form (1). We assume that the step size αk\alpha_{k} satisfies the 𝒯(k)\mathscr{T}^{(k)}-Wolfe conditions (8) and (9). If the search direction ηk\eta_{k} is a descent direction and there exists μ>0\mu>0, such that ηk\eta_{k} satisfies gkxkμηkxk\lVert g_{k}\rVert_{x_{k}}\leq\mu\lVert\eta_{k}\rVert_{x_{k}} for all kKk\in\mathbb{N}-K, then the following holds:

k=0gk,ηkxk2ηkxk2<,\displaystyle\sum_{k=0}^{\infty}\frac{\langle g_{k},\eta_{k}\rangle_{x_{k}}^{2}}{\lVert\eta_{k}\rVert_{x_{k}}^{2}}<\infty,

where KK is the subset of \mathbb{N} in Assumption 1.

We present a proof that the search direction (21) satisfies the sufficient descent condition (2), which involves generalizing the Euclidean case in [13, Proposition 3.1] and [15, Proposition 2.1].

Proposition 1.

Assume that 0<γ¯γk10<\underline{\gamma}\leq\gamma_{k-1} and 0ϕk1ϕ¯20\leq\phi_{k-1}\leq\overline{\phi}^{2} hold, where 1<ϕ¯<21<\overline{\phi}<2. The search direction (21) with

{0ξk1ξ¯,if 0ϕk11,0ξk1<ϕ¯ϕk11,otherwise,\displaystyle\begin{cases}0\leq\xi_{k-1}\leq\overline{\xi},&\text{if }0\leq\phi_{k-1}\leq 1,\\ 0\leq\xi_{k-1}<\dfrac{\overline{\phi}}{\sqrt{\phi_{k-1}}}-1,&\text{otherwise},\end{cases} (22)

where 0ξ¯<10\leq\overline{\xi}<1, satisfies the sufficient descent condition (2) with

κ:=min{3γ¯(1ξ¯)4,γ¯(1ϕ¯24)}>0.\displaystyle\kappa:=\min\left\{\frac{3\underline{\gamma}(1-\overline{\xi})}{4},\underline{\gamma}\left(1-\frac{\overline{\phi}^{2}}{4}\right)\right\}>0.

.

Proof.

The proof involves extending the discussion in [13, Proposition 3.1] to the case of Riemannian manifolds. For convenience, let us set

Φk1=1γk1τk1+ϕk1zk1zk1sk1zk1.\displaystyle\Phi_{k-1}=\frac{1}{\gamma_{k-1}\tau_{k-1}}+\phi_{k-1}\frac{z_{k-1}^{\flat}z_{k-1}}{s_{k-1}^{\flat}z_{k-1}}.

From the definition of the search direction (2), we have

gk,ηkxk\displaystyle\langle g_{k},\eta_{k}\rangle_{x_{k}} =γk1gkxk2+γk1(1+ξk1)ϕk1(zk1gk)(sk1gk)sk1zk1\displaystyle=-\gamma_{k-1}\lVert g_{k}\rVert_{x_{k}}^{2}+\gamma_{k-1}(1+\xi_{k-1})\phi_{k-1}\frac{(z_{k-1}^{\flat}g_{k})(s_{k-1}^{\flat}g_{k})}{s_{k-1}^{\flat}z_{k-1}}
γk1Φk1(sk1gk)2sk1zk1+γk1ξk1(1ϕk1)(zk1gk)2zk1zk1.\displaystyle-\gamma_{k-1}\Phi_{k-1}\frac{(s_{k-1}^{\flat}g_{k})^{2}}{s_{k-1}^{\flat}z_{k-1}}+\gamma_{k-1}\xi_{k-1}(1-\phi_{k-1})\frac{(z_{k-1}^{\flat}g_{k})^{2}}{z_{k-1}^{\flat}z_{k-1}}.

From the relation 2u,vu2+v22\langle u,v\rangle\leq\lVert u\rVert^{2}+\lVert v\rVert^{2} for any vectors uu and vv in an inner product space, we obtain

gk,ηkxk\displaystyle\langle g_{k},\eta_{k}\rangle_{x_{k}} γk1gkxk2+γk1ϕk12(2sk1gksk1zk1zk1xk2+1+ξk12gkxk2)\displaystyle\leq-\gamma_{k-1}\lVert g_{k}\rVert_{x_{k}}^{2}+\frac{\gamma_{k-1}\phi_{k-1}}{2}\left(\left\lVert\frac{\sqrt{2}s_{k-1}^{\flat}g_{k}}{s_{k-1}^{\flat}z_{k-1}}z_{k-1}\right\rVert_{x_{k}}^{2}+\left\lVert\frac{1+\xi_{k-1}}{\sqrt{2}}g_{k}\right\rVert_{x_{k}}^{2}\right)
γk1Φk1(sk1gk)2sk1zk1+γk1ξk1(1ϕk1)(zk1gk)2zk1zk1\displaystyle-\gamma_{k-1}\Phi_{k-1}\frac{(s_{k-1}^{\flat}g_{k})^{2}}{s_{k-1}^{\flat}z_{k-1}}+\gamma_{k-1}\xi_{k-1}(1-\phi_{k-1})\frac{(z_{k-1}^{\flat}g_{k})^{2}}{z_{k-1}^{\flat}z_{k-1}}
=γk1(1ϕk1(1+ξk1)24)gkxk21τk1(sk1gk)2sk1zk1\displaystyle=-\gamma_{k-1}\left(1-\frac{\phi_{k-1}(1+\xi_{k-1})^{2}}{4}\right)\lVert g_{k}\rVert_{x_{k}}^{2}-\frac{1}{\tau_{k-1}}\frac{(s_{k-1}^{\flat}g_{k})^{2}}{s_{k-1}^{\flat}z_{k-1}}
+γk1ξk1(1ϕk1)(zk1gk)2zk1zk1.\displaystyle+\gamma_{k-1}\xi_{k-1}(1-\phi_{k-1})\frac{(z_{k-1}^{\flat}g_{k})^{2}}{z_{k-1}^{\flat}z_{k-1}}.

From (15) (i.e., 0<ν¯sk1xk2sk1zk10<\underline{\nu}\lVert s_{k-1}\rVert_{x_{k}}^{2}\leq s_{k-1}^{\flat}z_{k-1}), we have

gk,ηkxkγk1(1ϕk1(1+ξk1)24)gkxk2\displaystyle\langle g_{k},\eta_{k}\rangle_{x_{k}}\leq-\gamma_{k-1}\left(1-\frac{\phi_{k-1}(1+\xi_{k-1})^{2}}{4}\right)\lVert g_{k}\rVert_{x_{k}}^{2}
+γk1ξk1(1ϕk1)(zk1gk)2zk1zk1.\displaystyle+\gamma_{k-1}\xi_{k-1}(1-\phi_{k-1})\frac{(z_{k-1}^{\flat}g_{k})^{2}}{z_{k-1}^{\flat}z_{k-1}}.

Here, we consider the case 0ϕk110\leq\phi_{k-1}\leq 1. From ξk1(1ϕk1)0\xi_{k-1}(1-\phi_{k-1})\geq 0 and the Cauchy-Schwarz inequality, we obtain

gk,ηkxkγk1((1ξk1)(1ϕk14(1ξk1)))gkxk2.\displaystyle\langle g_{k},\eta_{k}\rangle_{x_{k}}\leq-\gamma_{k-1}\left((1-\xi_{k-1})\left(1-\frac{\phi_{k-1}}{4}(1-\xi_{k-1})\right)\right)\lVert g_{k}\rVert_{x_{k}}^{2}.

From 0ξk1ξ¯<10\leq\xi_{k-1}\leq\overline{\xi}<1 and 0γ¯γk10\leq\underline{\gamma}\leq\gamma_{k-1}, we have

gk,ηkxk\displaystyle\langle g_{k},\eta_{k}\rangle_{x_{k}} γ¯((1ξ¯)(114(10)))gkxk2=3γ¯(1ξ¯)4gkxk2.\displaystyle\leq-\underline{\gamma}\left((1-\overline{\xi})\left(1-\frac{1}{4}(1-0)\right)\right)\lVert g_{k}\rVert_{x_{k}}^{2}=-\frac{3\underline{\gamma}(1-\overline{\xi})}{4}\lVert g_{k}\rVert_{x_{k}}^{2}.

Next, let us consider the case 1<ϕk1<ϕ¯1<\phi_{k-1}<\overline{\phi}. From ξk1(1ϕk1)0\xi_{k-1}(1-\phi_{k-1})\leq 0 and 0γ¯γk10\leq\underline{\gamma}\leq\gamma_{k-1}, we obtain

gk,ηkxk\displaystyle\langle g_{k},\eta_{k}\rangle_{x_{k}} γk1(1ϕk1(1+ξk1)24)gkxk2=γ¯(1ϕ¯24)gkxk2.\displaystyle\leq-\gamma_{k-1}\left(1-\frac{\phi_{k-1}(1+\xi_{k-1})^{2}}{4}\right)\lVert g_{k}\rVert_{x_{k}}^{2}=-\underline{\gamma}\left(1-\frac{\overline{\phi}^{2}}{4}\right)\lVert g_{k}\rVert_{x_{k}}^{2}.

Therefore, the search direction (21) satisfies the sufficient descent condition (2), i.e., gk,ηkxkκgkxk2\langle g_{k},\eta_{k}\rangle_{x_{k}}\leq-\kappa\lVert g_{k}\rVert_{x_{k}}^{2}, where

κ:=min{3γ¯(1ξ¯)4,γ¯(1ϕ¯24)}>0.\displaystyle\kappa:=\min\left\{\frac{3\underline{\gamma}(1-\overline{\xi})}{4},\underline{\gamma}\left(1-\frac{\overline{\phi}^{2}}{4}\right)\right\}>0.

∎∎

Now let us show the global convergence of Algorithm 1.

Theorem 2.

Suppose that Assumptions 1, 2 and 3 are satisfied. Assume further that 0<γ¯γk1γ¯0<\underline{\gamma}\leq\gamma_{k-1}\leq\overline{\gamma}, τ¯τk1\underline{\tau}\leq\tau_{k-1} and 0ϕk1ϕ¯20\leq\phi_{k-1}\leq\overline{\phi}^{2} hold, where τ¯>0\underline{\tau}>0 and 1<ϕ¯<21<\overline{\phi}<2. Moreover, suppose that ξk[0,1]\xi_{k}\in[0,1] satisfies (22). Let (xk)k=0,1,(x_{k})_{k=0,1,\cdots} be a sequence generated by Algorithm 1, and let the step size αk\alpha_{k} satisfy the 𝒯(k)\mathscr{T}^{(k)}-Wolfe conditions (8) and (9). Then, Algorithm 1 converges in the sense that

lim infkgkxk=0\displaystyle\liminf_{k\to\infty}\lVert g_{k}\rVert_{x_{k}}=0

holds.

Proof.

For convenience, let us set

Φk1=1γk1τk1+ϕk1zk1zk1sk1zk1.\displaystyle\Phi_{k-1}=\frac{1}{\gamma_{k-1}\tau_{k-1}}+\phi_{k-1}\frac{z_{k-1}^{\flat}z_{k-1}}{s_{k-1}^{\flat}z_{k-1}}.

From (21) and the triangle inequality, we have

ηkxk\displaystyle\lVert\eta_{k}\rVert_{x_{k}} =γk1gk+γk1(ϕk1zk1gksk1zk1Φk1sk1gksk1zk1)sk1\displaystyle=\left\lVert-\gamma_{k-1}g_{k}+\gamma_{k-1}\left(\phi_{k-1}\frac{z_{k-1}^{\flat}g_{k}}{s_{k-1}^{\flat}z_{k-1}}-\Phi_{k-1}\frac{s_{k-1}^{\flat}g_{k}}{s_{k-1}^{\flat}z_{k-1}}\right)s_{k-1}\right.
+γk1ξk1(ϕk1sk1gksk1zk1+(1ϕk1)zk1gkzk1zk1)zk1xk\displaystyle\left.+\gamma_{k-1}\xi_{k-1}\left(\phi_{k-1}\frac{s_{k-1}^{\flat}g_{k}}{s_{k-1}^{\flat}z_{k-1}}+\left(1-\phi_{k-1}\right)\frac{z_{k-1}^{\flat}g_{k}}{z_{k-1}^{\flat}z_{k-1}}\right)z_{k-1}\right\rVert_{x_{k}}
γk1gkxk+γk1|ϕk1zk1gksk1zk1|sk1xk\displaystyle\leq\gamma_{k-1}\lVert g_{k}\rVert_{x_{k}}+\gamma_{k-1}\left\lvert\phi_{k-1}\frac{z_{k-1}^{\flat}g_{k}}{s_{k-1}^{\flat}z_{k-1}}\right\rvert\left\lVert s_{k-1}\right\rVert_{x_{k}}
+γk1|Φk1sk1gksk1zk1|sk1xk+γk1ξk1|ϕk1sk1gksk1zk1|zk1xk\displaystyle+\gamma_{k-1}\left\lvert\Phi_{k-1}\frac{s_{k-1}^{\flat}g_{k}}{s_{k-1}^{\flat}z_{k-1}}\right\lvert\left\lVert s_{k-1}\right\rVert_{x_{k}}+\gamma_{k-1}\xi_{k-1}\left\lvert\phi_{k-1}\frac{s_{k-1}^{\flat}g_{k}}{s_{k-1}^{\flat}z_{k-1}}\right\rvert\lVert z_{k-1}\rVert_{x_{k}}
+γk1ξk1|(1ϕk1)zk1gkzk1zk1|zk1xk.\displaystyle+\gamma_{k-1}\xi_{k-1}\left\lvert\left(1-\phi_{k-1}\right)\frac{z_{k-1}^{\flat}g_{k}}{z_{k-1}^{\flat}z_{k-1}}\right\rvert\left\lVert z_{k-1}\right\rVert_{x_{k}}.

Here, from the Cauchy-Schwarz inequality, we obtain

|zk1gk|=|zk1,gkxk|zk1xkgkxk,\displaystyle\lvert z_{k-1}^{\flat}g_{k}\rvert=\lvert\langle z_{k-1},g_{k}\rangle_{x_{k}}\rvert\leq\lVert z_{k-1}\rVert_{x_{k}}\lVert g_{k}\rVert_{x_{k}},
|sk1gk|=|sk1,gkxk|sk1xkgkxk,\displaystyle\lvert s_{k-1}^{\flat}g_{k}\rvert=\lvert\langle s_{k-1},g_{k}\rangle_{x_{k}}\rvert\leq\lVert s_{k-1}\rVert_{x_{k}}\lVert g_{k}\rVert_{x_{k}},

which together with (15) (i.e., ν¯sk1xk2sk1zk1\underline{\nu}\lVert s_{k-1}\rVert_{x_{k}}^{2}\leq s_{k-1}^{\flat}z_{k-1}) and 0ϕk1<40\leq\phi_{k-1}<4, gives

ηkxk\displaystyle\lVert\eta_{k}\rVert_{x_{k}} γk1gkxk+4γk1zk1xkgkxkν¯sk1xk2sk1xk\displaystyle\leq\gamma_{k-1}\lVert g_{k}\rVert_{x_{k}}+4\gamma_{k-1}\frac{\lVert z_{k-1}\rVert_{x_{k}}\lVert g_{k}\rVert_{x_{k}}}{\underline{\nu}\lVert s_{k-1}\rVert_{x_{k}}^{2}}\lVert s_{k-1}\rVert_{x_{k}}
+γk1(1γk1τk1+4zk1xk2ν¯sk1xk2)sk1xkgkxkν¯sk1xk2sk1xk\displaystyle+\gamma_{k-1}\left(\frac{1}{\gamma_{k-1}\tau_{k-1}}+\frac{4\lVert z_{k-1}\rVert_{x_{k}}^{2}}{\underline{\nu}\lVert s_{k-1}\rVert_{x_{k}}^{2}}\right)\frac{\lVert s_{k-1}\rVert_{x_{k}}\lVert g_{k}\rVert_{x_{k}}}{\underline{\nu}\lVert s_{k-1}\rVert_{x_{k}}^{2}}\lVert s_{k-1}\rVert_{x_{k}}
+γk1ξk14sk1xkgkxkν¯sk1xk2zk1xk+γk1ξk13zk1xkgkxkzk1xk2zk1xk.\displaystyle+\gamma_{k-1}\xi_{k-1}\frac{4\lVert s_{k-1}\rVert_{x_{k}}\lVert g_{k}\rVert_{x_{k}}}{\underline{\nu}\lVert s_{k-1}\rVert_{x_{k}}^{2}}\lVert z_{k-1}\rVert_{x_{k}}+\gamma_{k-1}\xi_{k-1}\frac{3\lVert z_{k-1}\rVert_{x_{k}}\lVert g_{k}\rVert_{x_{k}}}{\lVert z_{k-1}\rVert_{x_{k}}^{2}}\lVert z_{k-1}\rVert_{x_{k}}.

From (16) (i.e., zk1xkν¯sk1xk\lVert z_{k-1}\rVert_{x_{k}}\leq\overline{\nu}\lVert s_{k-1}\rVert_{x_{k}}), 0ξk110\leq\xi_{k-1}\leq 1, and γk1γ¯\gamma_{k-1}\leq\overline{\gamma}, we have

ηkxk\displaystyle\lVert\eta_{k}\rVert_{x_{k}} γ¯gkxk+4γ¯ν¯ν¯gkxk+1τ¯ν¯gkxk+4γ¯ν¯2ν¯2gkxk\displaystyle\leq\overline{\gamma}\lVert g_{k}\rVert_{x_{k}}+\frac{4\overline{\gamma}\overline{\nu}}{\underline{\nu}}\lVert g_{k}\rVert_{x_{k}}+\frac{1}{\underline{\tau}\underline{\nu}}\lVert g_{k}\rVert_{x_{k}}+\frac{4\overline{\gamma}\overline{\nu}^{2}}{\underline{\nu}^{2}}\lVert g_{k}\rVert_{x_{k}}
+4γ¯ν¯ν¯gkxk+3γ¯gkxk,\displaystyle+\frac{4\overline{\gamma}\overline{\nu}}{\underline{\nu}}\lVert g_{k}\rVert_{x_{k}}+3\overline{\gamma}\lVert g_{k}\rVert_{x_{k}},

which, together with gkxkΓ\lVert g_{k}\rVert_{x_{k}}\leq\Gamma, give

ηkxk(4γ¯+8γ¯ν¯ν¯+1τ¯ν¯+4γ¯ν¯2ν¯2)ΘΓ=ΘΓ.\displaystyle\lVert\eta_{k}\rVert_{x_{k}}\leq\underbrace{\left(4\overline{\gamma}+\frac{8\overline{\gamma}\overline{\nu}}{\underline{\nu}}+\frac{1}{\underline{\tau}\underline{\nu}}+\frac{4\overline{\gamma}\overline{\nu}^{2}}{\underline{\nu}^{2}}\right)}_{\Theta}\Gamma=\Theta\Gamma.

To prove convergence by contradiction, suppose that there exists a positive constant ε>0\varepsilon>0 such that

gkxkε,\displaystyle\lVert g_{k}\rVert_{x_{k}}\geq\varepsilon,

for all kk. From Proposition 1,

gk,ηkxkκgkxk2κε2,\displaystyle\langle g_{k},\eta_{k}\rangle_{x_{k}}\leq-\kappa\lVert g_{k}\rVert_{x_{k}}^{2}\leq-\kappa\varepsilon^{2},

where

κ:=min{3γ¯(1ξ¯)4,γ¯(1ϕ¯24)}>0.\displaystyle\kappa:=\min\left\{\frac{3\underline{\gamma}(1-\overline{\xi})}{4},\underline{\gamma}\left(1-\frac{\overline{\phi}^{2}}{4}\right)\right\}>0.

It follows from the above inequalities that

=k=0κ2ε4Θ2Γ2k=0gk,ηkxk2ηkxk2.\displaystyle\infty=\sum_{k=0}^{\infty}\frac{\kappa^{2}\varepsilon^{4}}{\Theta^{2}\Gamma^{2}}\leq\sum_{k=0}^{\infty}\frac{\langle g_{k},\eta_{k}\rangle_{x_{k}}^{2}}{\lVert\eta_{k}\rVert_{x_{k}}^{2}}.

This contradicts the Zoutendijk theorem (Theorem 1) and thus completes the proof.∎∎

5 Numerical experiments

We compared the proposed method with existing methods, including the Riemannian conjugate gradient methods and memoryless spectral-scaling Broyden family. In the experiments, we implemented the proposed method as an optimizer of pymanopt (see [29]) and solved two Riemannian optimization problems (Problems 1 and 2).

Problem 1 is the Rayleigh-quotient minimization problem on the unit sphere [2, Chapter 4.6].

Problem 1.

Let An×nA\in\mathbb{R}^{n\times n} be a symmetric matrix,

minimize f(x):=xAx,\displaystyle f(x):=x^{\top}Ax,
subject to x𝕊n1:={xn:x=1},\displaystyle x\in\mathbb{S}^{n-1}:=\{x\in\mathbb{R}^{n}:\lVert x\rVert=1\},

where \lVert\cdot\rVert denotes the Euclidean norm.

In the experiments, we set n=100n=100 and generated a matrix Bn×nB\in\mathbb{R}^{n\times n} with randomly chosen elements by using numpy.random.randn. Then, we set a symmetric matrix A=(B+B)/2A=(B+B^{\top})/2.

Absil and Gallivan [1, Section 3] introduced an off-diagonal cost function. Problem 2 is an off-diagonal cost function minimization problem on an oblique manifold.

Problem 2.

Let Cin×nC_{i}\in\mathbb{R}^{n\times n} (i=1,,N)(i=1,\cdots,N) be symmetric matrices.

minimize f(X):=i=1NXCiXddiag(XCiX)F2,\displaystyle f(X):=\sum_{i=1}^{N}\lVert X^{\top}C_{i}X-\mathrm{ddiag}(X^{\top}C_{i}X)\rVert_{F}^{2},
subject to X𝒪(n,p):={Xn×p:ddiag(XX)=I},\displaystyle X\in\mathcal{OB}(n,p):=\{X\in\mathbb{R}^{n\times p}:\mathrm{ddiag}(X^{\top}X)=I\},

where F\lVert\cdot\rVert_{F} denotes the Frobenius norm and ddiag(M)\mathrm{ddiag}(M) denotes a diagonal matrix MM with all its off-diagonal elements set to zero.

In the experiments, we set N=5N=5, n=10n=10, and p=5p=5 and generated five matrices Bin×nB_{i}\in\mathbb{R}^{n\times n} (i=1,,5)(i=1,\cdots,5) with randomly chosen elements by using numpy.random.randn. Then, we set symmetric matrices Ci=(Bi+Bi)/2C_{i}=(B_{i}+B_{i}^{\top})/2 (i=1,,5)(i=1,\cdots,5).

The experiments used a MacBook Air (M1, 2020) with version 12.2 of the macOS Monterey operating system. The algorithms were written in Python 3.11.3 with the NumPy 1.25.0 package and the Matplotlib 3.7.1 package. Python implementations of the methods used in the numerical experiments are available at https://github.com/iiduka-researches/202307-memoryless.

We considered that a sequence had converged to an optimal solution when the stopping condition,

gradf(xk)xk<106,\displaystyle\lVert\mathrm{grad}f(x_{k})\rVert_{x_{k}}<10^{-6},

was satisfied. We set 𝒯(k1)=𝒯αk1ηk1R()\mathscr{T}^{(k-1)}=\mathcal{T}_{\alpha_{k-1}\eta_{k-1}}^{R}(\cdot),

γk1=max{1,sk1zk1zk1zk1},τk1=min{1,zk1zk1sk1zk1}.\displaystyle\gamma_{k-1}=\max\left\{1,\frac{s_{k-1}^{\flat}z_{k-1}}{z_{k-1}^{\flat}z_{k-1}}\right\},\quad\tau_{k-1}=\min\left\{1,\frac{z_{k-1}^{\flat}z_{k-1}}{s_{k-1}^{\flat}z_{k-1}}\right\}.

We compared the proposed methods with the existing Riemannian optimization algorithms, including Riemannian conjugate gradient methods. Moreover, we compared twelve versions of the proposed method with different parameters, i.e., ϕk1\phi_{k-1}, zk1z_{k-1} and ξk1\xi_{k-1}. We compared the BFGS formula ϕk1=1\phi_{k-1}=1 and the preconvex class ϕk1[0,)\phi_{k-1}\in[0,\infty). For the preconvex class (see [16, (43)]), we used

ϕk1=0.1θk110.1θk1(1μk1)1,\displaystyle\phi_{k-1}=\frac{0.1\theta^{\ast}_{k-1}-1}{0.1\theta^{\ast}_{k-1}(1-\mu_{k-1})-1},

where

θk1=max{11μk1,105},μk1=(sk1sk1)(zk1zk1)(sk1zk1)2.\displaystyle\theta_{k-1}^{\ast}=\max\left\{\frac{1}{1-\mu_{k-1}},10^{-5}\right\},\quad\mu_{k-1}=\frac{(s_{k-1}^{\flat}s_{k-1})(z_{k-1}^{\flat}z_{k-1})}{(s_{k-1}^{\flat}z_{k-1})^{2}}.

Moreover, we compared Li-Fukushima regularization (17) and (18) with ν^=106\hat{\nu}=10^{-6} and Powell’s damping technique (19) and (20) with ν^=0.1\hat{\nu}=0.1. In addition, we used a constant parameter ξk1=ξ[0,1]\xi_{k-1}=\xi\in[0,1] and compared our methods with ξ=1\xi=1 (i.e., the existing methods when 𝒯(k)=𝒯αk1ηk1()\mathscr{T}^{(k)}=\mathcal{T}_{\alpha_{k-1}\eta_{k-1}}(\cdot)), ξ=0.8\xi=0.8, and ξ=0.1\xi=0.1. For comparison, we also tested two Riemannian conjugate gradient methods, i.e., DY (6) and HZ (7).

As the measure for these comparisons, we calculated the performance profile Ps:[0,1]P_{s}:\mathbb{R}\to[0,1] [5] defined as follows: let 𝒫\mathcal{P} and 𝒮\mathcal{S} be the sets of problems and solvers, respectively. For each p𝒫p\in\mathcal{P} and s𝒮s\in\mathcal{S},

tp,s:=(iterations or time required to solve problem p by solver s).\displaystyle t_{p,s}:=(\text{iterations or time required to solve problem }p\text{ by solver }s).

We defined the performance ratio rp,sr_{p,s} as

rp,s:=tp,smins𝒮tp,s.\displaystyle r_{p,s}:=\frac{t_{p,s}}{\min_{s^{\prime}\in\mathcal{S}}t_{p,s^{\prime}}}.

Next, we defined the performance profile PsP_{s} for all τ\tau\in\mathbb{R} as

Ps(τ):=|{p𝒫:rp,sτ}||𝒫|,\displaystyle P_{s}(\tau):=\frac{\lvert\{p\in\mathcal{P}:r_{p,s}\leq\tau\}\rvert}{\lvert\mathcal{P}\rvert},

where |A|\lvert A\rvert denotes the number of elements in a set AA. In the experiments, we set |𝒫|=100\lvert\mathcal{P}\rvert=100 for Problems 1 and 2, respectively.

Figures 14 plot the results of our experiments. In particular, Figure 1 shows the numerical results for Problem 1 with Li-Fukushima regularization (17) and (18). It shows that Algorithm 1 with ξ=0.1\xi=0.1 has much higher performance than that of Algorithm 1 with ξ=1\xi=1 (i.e., the existing method) regardless of whether the BFGS formula or the preconvex class is used. In addition, we can see that Algorithm 1 with ξ=0.8\xi=0.8 and ξ=1\xi=1 have about the same performance.

Refer to caption
(a) iteration
Refer to caption
(b) elapsed time
Figure 1: Performance profiles of each algorithm versus the number of iterations (a) and the elapsed time (b) for Problem 1. zkz_{k} is defined by Li-Fukushima regularization (17) and (18).

Figure 2 shows the numerical results for solving Problem 1 with Powell’s damping technique (19) and (20). It shows that Algorithm 1 with ξ=0.1\xi=0.1 is superior to Algorithm 1 with ξ=1\xi=1 (i.e., the existing method), regardless of whether the BFGS formula or the preconvex class is used. Moreover, it can be seen that Algorithm 1 with ξ=0.8\xi=0.8 and ξ=1\xi=1 has about the same performance.

Refer to caption
(a) iteration
Refer to caption
(b) elapsed time
Figure 2: Performance profiles of each algorithm versus the number of iterations (a) and the elapsed time (b) for Problem 1. zkz_{k} is defined by Powell’s damping technique (19) and (20).

Figure 3 shows numerical results for Problem 1 with Li-Fukushima regularization (17) and (18). It shows that if we use the BFGS formula (i.e., ϕk=1\phi_{k}=1), then Algorithm 1 with ξ=0.8\xi=0.8 and the HZ method outperform the others. However, Algorithm 1 with the preconvex class is not compatible with is an off-diagonal cost function minimization problem on an oblique manifold.

Refer to caption
(a) iteration
Refer to caption
(b) elapsed time
Figure 3: Performance profiles of each algorithm versus the number of iterations (a) and the elapsed time (b) for Problem 2. zkz_{k} is defined by Li-Fukushima regularization (17) and (18).

Figure 4 shows the numerical results for solving Problem 1 with Powell’s damping technique (19) and (20). It shows that if we use the BFGS formula (i.e., ϕk=1\phi_{k}=1), then Algorithm 1 with ξ=0.8\xi=0.8 or ξ=1\xi=1 is superior to the others. However, Algorithm 1 with the preconvex class is not compatible with is an off-diagonal cost function minimization problem on an oblique manifold. Therefore, we can see that Algorithm 1 with the BFGS formula (i.e., ϕk=1\phi_{k}=1) is suitable for solving an off-diagonal cost function minimization problem on an oblique manifold.

Refer to caption
(a) iteration
Refer to caption
(b) elapsed time
Figure 4: Performance profiles of each algorithm versus the number of iterations (a) and the elapsed time (b) for Problem 2. zkz_{k} is defined by Powell’s damping technique (19) and (20)

6 Conclusion

This paper presented a modified memoryless quasi-Newton method with the spectral-scaling Broyden family on Riemannian manifolds, i.e., Algorithm 1. Algorithm 1 is a generalization of the memoryless self-scaling Broyden family on Riemannian manifolds. Specifically, it involves adding one parameter to the search direction. We use a general map instead of vector transport, similarly to the general framework of Riemannian conjugate gradient methods. Therefore, we can utilize methods that use vector transport, scaled vector transport, or an inverse retraction. Moreover, we proved that the search direction satisfies the sufficient descent condition, and the method globally converges under the Wolfe conditions. Moreover, the numerical experiments indicated that the proposed method with the BFGS formula is suitable for solving an off-diagonal cost function minimization problem on an oblique manifold.

Acknowledgements

This work was supported by a JSPS KAKENHI Grant Number JP23KJ2003.

Data availability statements

Data sharing is not applicable to this article as no datasets were generated or analysed during the current study.

References

  • [1] P.-A. Absil and K. A. Gallivan. Joint diagonalization on the oblique manifold for independent component analysis. In 2006 IEEE international conference on acoustics speech and signal processing proceedings, volume 5, pages V–V. IEEE, 2006.
  • [2] P.-A. Absil, R. Mahony, and R. Sepulchre. Optimization Algorithms on Matrix Manifolds. Princeton University Press, 2008.
  • [3] Z. Chen and W. Cheng. Spectral-scaling quasi-Newton methods with updates from the one parameter of the broyden family. Journal of computational and applied mathematics, 248:88–98, 2013.
  • [4] W. Cheng and D. Li. Spectral scaling BFGS method. Journal of optimization Theory and Applications, 146(2):305–319, 2010.
  • [5] E. D. Dolan and J. J. Moré. Benchmarking optimization software with performance profiles. Mathematical programming, 91(2):201–213, 2002.
  • [6] W. Huang, P.-A. Absil, and K. A. Gallivan. A riemannian BFGS method without differentiated retraction for nonconvex optimization problems. SIAM Journal on Optimization, 28(1):470–495, 2018.
  • [7] W. Huang, K. A. Gallivan, and P.-A. Absil. A Broyden class of quasi-Newton methods for Riemannian optimization. SIAM Journal on Optimization, 25(3):1660–1685, 2015.
  • [8] W. Huang, K. A. Gallivan, A. Srivastava, and P.-A. Absil. Riemannian optimization for registration of curves in elastic shape analysis. Journal of Mathematical Imaging and Vision, 54:320–343, 2016.
  • [9] C.-X. Kou and Y.-H. Dai. A modified self-scaling memoryless Broyden–Fletcher–Goldfarb–Shanno method for unconstrained optimization. Journal of Optimization Theory and Applications, 165:209–224, 2015.
  • [10] D. Kressner, M. Steinlechner, and B. Vandereycken. Low-rank tensor completion by Riemannian optimization. BIT Numerical Mathematics, 54:447–468, 2014.
  • [11] D.-H. Li and M. Fukushima. A modified bfgs method and its global convergence in nonconvex minimization. Journal of Computational and Applied Mathematics, 129(1-2):15–35, 2001.
  • [12] A. U. Moyi and W. J. Leong. A sufficient descent three-term conjugate gradient method via symmetric rank-one update for large-scale optimization. Optimization, 65(1):121–143, 2016.
  • [13] S. Nakayama. A hybrid method of three-term conjugate gradient method and memoryless quasi-Newton method for unconstrained optimization. SUT journal of Mathematics, 54(1):79–98, 2018.
  • [14] S. Nakayama, Y. Narushima, and H. Yabe. A memoryless symmetric rank-one method with sufficient descent property for unconstrained optimization. Journal of the Operations Research Society of Japan, 61(1):53–70, 2018.
  • [15] S. Nakayama, Y. Narushima, and H. Yabe. Memoryless quasi-Newton methods based on spectral-scaling broyden family for unconstrained optimization. Journal of Industrial and Management Optimization, 15(4):1773–1793, 2019.
  • [16] Y. Narushima, S. Nakayama, M. Takemura, and H. Yabe. Memoryless quasi-newton methods based on the spectral-scaling broyden family for Riemannian optimization. Journal of Optimization Theory and Applications, pages 1–26, 2023.
  • [17] M. Nickel and D. Kiela. Poincaré embeddings for learning hierarchical representations. Advances in Neural Information Processing Systems, 30, 2017.
  • [18] J. Nocedal and S. J. Wright. Numerical Optimization. Springer, 1999.
  • [19] W. Ring and B. Wirth. Optimization methods on Riemannian manifolds and their application to shape space. SIAM Journal on Optimization, 22(2):596–627, 2012.
  • [20] H. Sakai and H. Iiduka. Hybrid Riemannian conjugate gradient methods with global convergence properties. Computational Optimization and Applications, 77:811–830, 2020.
  • [21] H. Sakai and H. Iiduka. Sufficient descent riemannian conjugate gradient methods. Journal of Optimization Theory and Applications, 190(1):130–150, 2021.
  • [22] H. Sakai, H. Sato, and H. Iiduka. Global convergence of hager–zhang type riemannian conjugate gradient method. Applied Mathematics and Computation, 441:127685, 2023.
  • [23] H. Sato. A Dai-Yuan-type Riemannian conjugate gradient method with the weak Wolfe conditions. Computational Optimization and Applications, 64(1):101–118, 2016.
  • [24] H. Sato. Riemannian Optimization and Its Applications. Springer, 2021.
  • [25] H. Sato. Riemannian conjugate gradient methods: General framework and specific algorithms with convergence analyses. SIAM Journal on Optimization, 32(4):2690–2717, 2022.
  • [26] H. Sato and T. Iwai. A new, globally convergent riemannian conjugate gradient method. Optimization, 64(4):1011–1031, 2015.
  • [27] D. F. Shanno. Conjugate gradient methods with inexact searches. Mathematics of operations research, 3(3):244–256, 1978.
  • [28] W. Sun and Y.-X. Yuan. Optimization Theory and Methods: Nonlinear Programming, volume 1. Springer, 2006.
  • [29] J. Townsend, N. Koep, and S. Weichwald. Pymanopt: A python toolbox for optimization on manifolds using automatic differentiation. J Mach Learn Res, 17(1):4755–4759, 2016.
  • [30] B. Vandereycken. Low-rank matrix completion by Riemannian optimization. SIAM Journal on Optimization, 23(2):1214–1236, 2013.
  • [31] P. Wolfe. Convergence conditions for ascent methods. SIAM Review, 11(2):226–235, 1969.
  • [32] P. Wolfe. Convergence conditions for ascent methods. II: Some corrections. SIAM Review, 13(2):185–188, 1971.
  • [33] Y. Zhang and R. Tewarson. Quasi-Newton algorithms with updates from the preconvex part of Broyden’s family. IMA Journal of Numerical Analysis, 8(4):487–509, 1988.
  • [34] X. Zhu and H. Sato. Riemannian conjugate gradient methods with inverse retraction. Computational Optimization and Applications, 77:779–810, 2020.