This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Riemannian conditional gradient methods for composite optimization problems

Kangming Chen [email protected] Ellen H. Fukuda [email protected]
Abstract

In this paper, we propose Riemannian conditional gradient methods for minimizing composite functions, i.e., those that can be expressed as the sum of a smooth function and a geodesically convex one. We analyze the convergence of the proposed algorithms, utilizing three types of step-size strategies: adaptive, diminishing, and those based on the Armijo condition. We establish the convergence rate of 𝒪(1/k)\mathcal{O}(1/k) for the adaptive and diminishing step sizes, where kk denotes the number of iterations. Additionally, we derive an iteration complexity of 𝒪(1/ϵ2)\mathcal{O}(1/\epsilon^{2}) for the Armijo step-size strategy to achieve ϵ\epsilon-optimality, where ϵ\epsilon is the optimality tolerance. Finally, the effectiveness of our algorithms is validated through some numerical experiments performed on the sphere and Stiefel manifolds.

keywords:
Conditional gradient methods, Riemannian manifolds, composite optimization problems , Frank-Wolfe methods
journal: Not decided
\affiliation

[label1]organization=Graduate School of Informatics, Kyoto University, country=Japan

1 Introduction

The conditional gradient (CG) method, also known as the Frank-Wolfe algorithm, is a first-order iterative method, widely used for solving constrained nonlinear optimization problems. Developed by Frank and Wolfe in 1956 [1], the CG method addresses the problem of minimizing a convex function over a convex constraint set, particularly when the constraint set is difficult, in the sense that the projection onto it is computationally demanding.

The convergence rate of 𝒪(1/k)\mathcal{O}(1/k), where kk denotes the number of iterations, for the conditional gradient method has been presented in several studies [2, 3]. Recent advancements in the field have focused on accelerating the method [4, 5]. The versatility of this method is also clear, since many applications can be found in the literature (see, for instance, [6, 7]). Furthermore, the CG method has recently been extended to multiobjective optimization [8, 9, 10].

Meanwhile, composite optimization problems, where the objective function can be decomposed into the sum of two functions, have become increasingly important in modern optimization theory and applications. Specifically, the objectives of these problems take the form F(x):=f(x)+g(x)F(x):=f(x)+g(x), where ff is a smooth function and gg is a possibly non-smooth and convex function. This structure arises naturally in many areas, including machine learning and signal processing, where ff often represents a data fidelity term and gg serves as a regularization term to promote desired properties such as sparsity or low rank [11, 12, 13].

To deal with such problems efficiently, various methods have been developed. Classical approaches include proximal gradient methods [14, 15, 16], which address the non-smooth term gg using proximal operators, and accelerated variants such as FISTA [17]. Other methods, such as the alternating direction method of multipliers (ADMM) [18], decompose the problem further to enable parallel and distributed computation. For problems with specific structures or manifold constraints, methods like the Riemannian proximal gradient have been proposed in [19, 20], extending proximal techniques to non-Euclidean settings. Recently, the CG method in Riemannian manifolds has been proposed in [21], and the CG method for composite functions, called the generalized conditional gradient (GCG) method, has also been discussed in many works [9, 22, 23, 24].

In this paper, we propose GCG methods on Riemannian manifolds, with general retractions and vector transport. We also focus on three types of step-size strategies: adaptive, diminishing, and those based on the Armijo condition. We provide a detailed discussion on the implementation of subproblem solutions for each strategy. For each step-size approach, we analyze the convergence of the method. Specifically, for the adaptive and diminishing step-size strategies, we establish a convergence rate of 𝒪(1/k)\mathcal{O}(1/k), where kk represents the number of iterations. For the Armijo step size strategy, we derive an iteration complexity of 𝒪(1/ϵ2)\mathcal{O}(1/\epsilon^{2}), where ϵ\epsilon represents the desired accuracy in the optimality condition.

This paper is organized as follows. In Section 2, we review fundamental concepts and results related to Riemannian optimization. In Section 3, we introduce the generalized conditional gradient method on Riemannian manifolds with three different types of step size. Section 4 presents the convergence analysis, while Section 5 focuses on the discussion of the subproblem. Section 6 explores the accelerated version of the proposed method. Numerical experiments are presented in Section 7 to demonstrate the effectiveness of our approach. Finally, in Section 8, we conclude the paper and suggest potential directions for future research.

2 Preliminaries

This section summarizes essential definitions, results, and concepts fundamental to Riemannian optimization [25]. The transpose of a matrix AA is denoted by ATA^{T}, and the identity matrix with dimension \ell is written as II_{\ell}. A Riemannian manifold \mathcal{M} is a smooth manifold equipped with a Riemannian metric, which is a smoothly varying inner product ηx,σxx\langle\eta_{x},\sigma_{x}\rangle_{x}\in\mathbb{R}, defined on the tangent space at each point xx\in\mathcal{M}, where ηx\eta_{x} and σx\sigma_{x} are tangent vectors in the tangent space TxT_{x}\mathcal{M}. The tangent space at a point xx on the manifold \mathcal{M} is denoted by TxT_{x}\mathcal{M}, while the tangent bundle of \mathcal{M} is denoted as T:={(x,d)dTx,x}T\mathcal{M}:=\{(x,d)\mid d\in T_{x}\mathcal{M},x\in\mathcal{M}\}. The norm of a tangent vector ηTx\eta\in T_{x}\mathcal{M} is defined by ηx:=η,ηx\|\eta\|_{x}:=\sqrt{\langle\eta,\eta\rangle_{x}}. When the norm and the inner product are written as \|\cdot\| and ,\langle\cdot,\cdot\rangle, without the subscript, then they are the Euclidean ones. For a map F:𝒩F\colon\mathcal{M}\rightarrow\mathcal{N} between two manifolds \mathcal{M} and 𝒩\mathcal{N}, the derivative of FF at a point xx\in\mathcal{M}, denoted by DF(x)\mathrm{D}F(x), is a linear map DF(x):TxTF(x)𝒩\mathrm{D}F(x)\colon T_{x}\mathcal{M}\rightarrow T_{F(x)}\mathcal{N} that maps tangent vectors from the tangent space of \mathcal{M} at xx to the tangent space of 𝒩\mathcal{N} at F(x)F(x). The Riemannian gradient gradf(x)\operatorname{grad}f(x) of a smooth function f:f\colon\mathcal{M}\rightarrow\mathbb{R} at a point xx\in\mathcal{M} is defined as the unique tangent vector at xx that satisfies the equation gradf(x),ηx=Df(x)[η]\langle\operatorname{grad}f(x),\eta\rangle_{x}=\mathrm{D}f(x)[\eta] for every tangent vector ηTx\eta\in T_{x}\mathcal{M}. We also define the Whitney sum of the tangent bundle as TT:={(ξ,d)ξ,dTx,x}T\mathcal{M}\oplus T\mathcal{M}:=\left\{(\xi,d)\mid\xi,d\in T_{x}\mathcal{M},x\in\mathcal{M}\right\}.

In Riemannian optimization, a retraction is employed to map points from the tangent space of a manifold back onto the manifold.

Definition 2.1.

[25] A smooth map R:TR\colon T\mathcal{M}\rightarrow\mathcal{M} is called a retraction on a smooth manifold \mathcal{M} if the restriction of RR to the tangent space TxT_{x}\mathcal{M} at any point xx\in\mathcal{M}, denoted by RxR_{x}, satisfies the following conditions:

  1. 1.

    Rx(0x)=xR_{x}\left(0_{x}\right)=x,

  2. 2.

    DRx(0x)=idTx\mathrm{D}R_{x}\left(0_{x}\right)=\operatorname{id}_{T_{x}\mathcal{M}} for all xx\in\mathcal{M},

where 0x0_{x} and idTx\operatorname{id}_{T_{x}\mathcal{M}} are the zero vector of TxT_{x}\mathcal{M} and the identity map in TxT_{x}\mathcal{M}, respectively.

Depending on the problem and the specific properties of the manifold, different types of retractions can be utilized. Many Riemannian optimization algorithms employ a retraction that generalizes the exponential map on \mathcal{M}. For instance, assume that we have an iterative method generating iterates {xk}\{x^{k}\}. In Euclidean spaces, the update takes the form xk+1=xk+tkdkx^{k+1}=x^{k}+t_{k}d^{k}, while in the Riemannian case it is generalized as

xk+1=Rxk(tkdk),for k=0,1,2,,x^{k+1}=R_{x^{k}}(t_{k}d^{k}),\quad\text{for }k=0,1,2,\ldots,

where dkd^{k} is a descent direction and tkt_{k} is a step size.

Another essential concept is the vector transport, which is critical in Riemannian optimization to maintain the intrinsic geometry of the manifold during computations.

Definition 2.2.

[25] A map 𝒯:TTT\mathcal{T}\colon T\mathcal{M}\oplus T\mathcal{M}\rightarrow T\mathcal{M} is called a vector transport on \mathcal{M} if it satisfies the following conditions:

  1. 1.

    There exists a retraction RR on \mathcal{M} such that 𝒯d(ξ)TRx(d)\mathcal{T}_{d}(\xi)\in T_{R_{x}(d)}\mathcal{M} for all xx\in\mathcal{M} and ξ,dTx\xi,d\in T_{x}\mathcal{M}.

  2. 2.

    For any xx\in\mathcal{M} and ξTx,𝒯0x(ξ)=ξ\xi\in T_{x}\mathcal{M},\mathcal{T}_{0_{x}}(\xi)=\xi holds, where 0x0_{x} is the zero vector in TxT_{x}\mathcal{M}, i.e., 𝒯0x\mathcal{T}_{0_{x}} is the identity map in TxT_{x}\mathcal{M}.

  3. 3.

    For any a,b,xa,b\in\mathbb{R},x\in\mathcal{M}, and ξ,d,ζTx,𝒯d(aξ+bζ)=a𝒯d(ξ)+b𝒯d(ζ)\xi,d,\zeta\in T_{x}\mathcal{M},\mathcal{T}_{d}(a\xi+b\zeta)=a\mathcal{T}_{d}(\xi)+b\mathcal{T}_{d}(\zeta) holds, i.e., 𝒯d\mathcal{T}_{d} is a linear map from TxT_{x}\mathcal{M} to TRx(d)T_{R_{x}(d)}\mathcal{M}.

The adjoint operator of a vector transport 𝒯\mathcal{T}, denoted by 𝒯\mathcal{T}^{\sharp}, is a vector transport satisfying:

ξy,𝒯ηxζxy=𝒯ηxξy,ζxxfor all ηx,ζxTx and ξyTy,\langle\xi_{y},\mathcal{T}_{\eta_{x}}\zeta_{x}\rangle_{y}=\langle\mathcal{T}_{\eta_{x}}^{\sharp}\xi_{y},\zeta_{x}\rangle_{x}\quad\text{for all }\eta_{x},\zeta_{x}\in T_{x}\mathcal{M}\text{ and }\xi_{y}\in T_{y}\mathcal{M},

where y=Rx(ηx)y=R_{x}(\eta_{x}). The inverse operator of a vector transport, denoted by 𝒯ηx1\mathcal{T}_{\eta_{x}}^{-1}, is a vector transport satisfying:

𝒯ηx1𝒯ηx=idfor all ηxTx,\mathcal{T}_{\eta_{x}}^{-1}\mathcal{T}_{\eta_{x}}=\text{id}\quad\text{for all }\eta_{x}\in T_{x}\mathcal{M},

where id is the identity operator.

Note that a map 𝒯\mathcal{T} defined by 𝒯d(ξ):=Pγx,d10(ξ)\mathcal{T}_{d}(\xi):=\mathrm{P}_{\gamma_{x,d}}^{1\leftarrow 0}(\xi) is also a vector transport, where Pγx,d10\mathrm{P}_{\gamma_{x,d}}^{1\leftarrow 0} is the parallel transport along the geodesic γx,d(t):=Expx(td)\gamma_{x,d}(t):=\operatorname{Exp}_{x}(td) connecting γx,d(0)=x\gamma_{x,d}(0)=x and γx,d(1)=Expx(d)\gamma_{x,d}(1)=\operatorname{Exp}_{x}(d) with the exponential map Exp as a retraction. Parallel transport is a specific case of vector transport that preserves the length and angle of vectors as they are transported along geodesics, thereby preserving the intrinsic Riemannian metric of the manifold exactly. In contrast, vector transport is a more general operation that allows for the movement of vectors between tangent spaces in a way that approximately preserves geometric properties, and it may not necessarily follow geodesics or maintain exact parallelism. If there is a unique geodesic between any two points in 𝒳\mathcal{X}\subset\mathcal{M}, then the exponential map has an inverse Expx1:𝒳Tx\operatorname{Exp}_{x}^{-1}\colon\mathcal{X}\rightarrow T_{x}\mathcal{M}. This geodesic represents the unique shortest path and the geodesic distance between xx and yy in 𝒳\mathcal{X} is given by Expx1(y)x=Expy1(x)y\left\|\operatorname{Exp}_{x}^{-1}(y)\right\|_{x}=\left\|\operatorname{Exp}_{y}^{-1}(x)\right\|_{y}.

Now, we define geodesic convexity sets and functions, LL-smoothness, and related concepts on Riemannian manifolds as follows:

Definition 2.3.

A set 𝒳\mathcal{X} is called geodesically convex if for any x,y𝒳x,y\in\mathcal{X}, there is a geodesic γ\gamma with γ(0)=x,γ(1)=y\gamma(0)=x,\gamma(1)=y and γ(t)𝒳\gamma(t)\in\mathcal{X} for all t[0,1]t\in[0,1].

Definition 2.4.

A function h:h\colon\mathcal{M}\rightarrow\mathbb{R} is called geodesically convex, if for any p,qp,q\in\mathcal{M}, we have h(γ(t))(1t)h(p)+th(q)h(\gamma(t))\leq(1-t)h(p)+th(q) for any t[0,1]t\in[0,1], where γ\gamma is the geodesic connecting pp and qq.

Proposition 2.1.

[21] Let h:h\colon\mathcal{M}\rightarrow\mathbb{R} be a smooth and geodesically convex function. Then, for any x,yx,y\in\mathcal{M}, we have

h(y)h(x)gradh(x),Rx1(y)x.h(y)-h(x)\geq\left\langle\operatorname{grad}h(x),R_{x}^{-1}(y)\right\rangle_{x}.

Considering the definition of geodesic convexity and defining the retraction through γ\gamma, we can easily obtain the following result.

Proposition 2.2.

Let h:h\colon\mathcal{M}\rightarrow\mathbb{R} be a smooth and geodesically convex function. Then, for any x,yx,y\in\mathcal{M} and λ[0,1]\lambda\in[0,1], we have

h(Rx(λRx1(y)))(1λ)h(x)+λh(y).h(R_{x}(\lambda R_{x}^{-1}(y)))\leq(1-\lambda)h(x)+\lambda h(y).
Definition 2.5.

A function h:h\colon\mathcal{M}\rightarrow\mathbb{R} is called geodesically LL-smooth or with LL-Lipschitz continuous gradient if there exists LL such that

gradh(y)𝒯gradh(x)Ldist(x,y)x,y,\left\|\mathrm{grad}h(y)-\mathcal{T}\mathrm{grad}h(x)\right\|\leq L\mathrm{dist}(x,y)\quad\forall x,y\in\mathcal{M}, (1)

where dist(x,y)\mathrm{dist}(x,y) is the geodesic distance between xx and yy and 𝒯\mathcal{T} is the vector transport from TxT_{x}\mathcal{M} to TyT_{y}\mathcal{M}. It is equivalent to

h(y)h(x)+gradh(x),Rx1(y)x+L2dist2(x,y)x,y.h(y)\leq h(x)+\left\langle\mathrm{grad}h(x),R^{-1}_{x}(y)\right\rangle_{x}+\frac{L}{2}\mathrm{dist}^{2}(x,y)\quad\forall x,y\in\mathcal{M}.

A similar concept is defined below.

Definition 2.6.

[20] A function h:h\colon\mathcal{M}\rightarrow\mathbb{R} is called LL-retraction-smooth with respect to a retraction RR in 𝒩\mathcal{N}\subseteq\mathcal{M} if for any x𝒩x\in\mathcal{N} and any 𝒮xTx\mathcal{S}_{x}\subseteq\mathrm{T}_{x}\mathcal{M} such that Rx(𝒮x)𝒩R_{x}\left(\mathcal{S}_{x}\right)\subseteq\mathcal{N}, we have

h(Rx(η))h(x)+gradh(x),ηx+L2ηx2η𝒮x.h\left(R_{x}(\eta)\right)\leq h(x)+\left\langle\operatorname{grad}h(x),\eta\right\rangle_{x}+\frac{L}{2}\left\|\eta\right\|_{x}^{2}\quad\forall\eta\in\mathcal{S}_{x}. (2)

3 Riemannian generalized conditional gradient method

In this paper, we consider the following Riemannian optimization problem:

min\displaystyle\min F(x):=f(x)+g(x)\displaystyle\quad F(x):=f(x)+g(x) (3)
s.t. x𝒳,\displaystyle\quad x\in\mathcal{X},

where 𝒳\mathcal{X}\subseteq\mathcal{M} is a compact geodesically convex set and F:F\colon\mathcal{M}\rightarrow\mathbb{R} is a composite function with f:f\colon\mathcal{M}\rightarrow\mathbb{R} being continuously differentiable and g:g\colon\mathcal{M}\rightarrow\mathbb{R} being a closed, geodesically convex and lower semicontinuous (possibly nonsmooth) function. Here, we can also consider FF as an extended-valued function. All the analyses would still apply, but we omit the infinity for simplicity. Also, we make the following assumption.

Assumption 1.

The set 𝒳\mathcal{X}\subseteq\mathcal{M} is a geodesically convex and complete set where the retraction R{R} and its inverse R1{R}^{-1} are well-defined and smooth over 𝒳\mathcal{X}. Additionally, the inverse retraction Rx1(y)R^{-1}_{x}(y) is continuous with respect to both xx and yy.

As discussed in [26, Section 10.2], for general retractions, over some domain of ×\mathcal{M}\times\mathcal{M} which contains all pairs (x,x)(x,x), the map (x,y)Rx1(y)(x,y)\mapsto R_{x}^{-1}(y) can be defined smoothly jointly in xx and yy. In this sense, we can say that the above assumption is reasonable.

Now, let us recall the CG method in the Euclidean space. Consider minimizing a convex function f~:n\tilde{f}\colon\mathbb{R}^{n}\rightarrow\mathbb{R} over a convex compact set 𝒳n\mathcal{X}\subseteq\mathbb{R}^{n}. At each iteration kk, the method solves a linear approximation of the original problem by finding a direction sks^{k} that minimizes the linearized objective function over the constraint set 𝒳\mathcal{X}, i.e., sk=argmins𝒳f~(xk),s.s^{k}=\arg\min_{s\in\mathcal{X}}\langle\nabla\tilde{f}(x^{k}),s\rangle. Then, it updates the current point xkx^{k} using a step size λk\lambda_{k} along the direction skxks^{k}-x^{k}, i.e., xk+1=xk+λk(skxk).x^{k+1}=x^{k}+\lambda_{k}(s^{k}-x^{k}). For composite functions of the form f~(x)=h~(x)+g~(x)\tilde{f}(x)=\tilde{h}(x)+\tilde{g}(x), where h~\tilde{h} is smooth and g~\tilde{g} is convex, the CG method can be adapted to handle the non-smooth term g~\tilde{g}. The update step becomes:

xk+1=argmins𝒳h~(xk),s+g~(s).x^{k+1}=\arg\min_{s\in\mathcal{X}}\langle\nabla\tilde{h}(x^{k}),s\rangle+\tilde{g}(s).

In this case, sh~(xk),s+g~(s)s\mapsto\langle\nabla\tilde{h}(x^{k}),s\rangle+\tilde{g}(s) is the new objective function for the subproblem, which remains convex due to the convexity of g~\tilde{g}, compacity of 𝒳\mathcal{X} and the linearity of h~(xk),\langle\nabla\tilde{h}(x^{k}),\cdot\rangle. The convexity of 𝒳\mathcal{X} ensures that this subproblem also has a solution. The step size λk\lambda_{k} can be chosen using various strategies. The CG method is particularly useful when the constraint set 𝒳\mathcal{X} is not an easy set, making projection-based methods computationally expensive.

The exponential map, the parallel transport, and specific step size are used in the Riemannian CG method proposed in [21]. Now we extend this approach to a more general framework for composite optimization problems, without specifying the retraction and the vector transport.

Algorithm 1 Riemannian generalized conditional gradient method (RGCG)

Step 0. Initialization:

Choose x0𝒳x^{0}\in\mathcal{X} and initialize k=0k=0.

Step 1. Compute the search direction:

Compute an optimal solution p(xk)p(x^{k}) and the optimal value θ(xk)\theta(x^{k}) as

p(xk)=argminu𝒳gradf(xk),Rxk1(u)xk+g(u)g(xk),p(x^{k})=\arg\min_{u\in\mathcal{X}}\left\langle\mathrm{grad}f(x^{k}),R^{-1}_{x^{k}}(u)\right\rangle_{x^{k}}+g(u)-g(x^{k}), (4)
θ(xk)=gradf(xk),Rxk1(p(xk))xk+g(p(xk))g(xk).\theta(x^{k})=\left\langle\mathrm{grad}f(x^{k}),R^{-1}_{x^{k}}(p(x^{k}))\right\rangle_{x^{k}}+g(p(x^{k}))-g(x^{k}). (5)

Define the search direction by d(xk)=Rxk1(p(xk))d(x^{k})=R^{-1}_{x^{k}}(p(x^{k})).
Step 2. Compute the step size:

Compute the step size λk\lambda_{k}.

Step 3. Update the iterates:

Update the current iterate xk+1=Rxk(λkd(xk)).x^{k+1}=R_{x^{k}}(\lambda_{k}d(x^{k})).

Step 4. Convergence check:

If a convergence criteria is met, stop; otherwise, set k=k+1k=k+1 and return to Step 1.

Our method is presented in Algorithm 1. In Step 1, we need to solve the subproblem (4), which, similarly to the Euclidean case, is related to the linear approximation of the objective functions. In Step 2, we compute the step size based on some strategies that we will mention later. In Step 3, we update the iterate by using some retraction. Note that if u=xku=x^{k} in (4), then gradf(xk),Rxk1(u)xk+g(u)g(xk)=0\left\langle\mathrm{grad}f(x^{k}),R^{-1}_{x^{k}}(u)\right\rangle_{x^{k}}+g(u)-g(x^{k})=0, which implies that θ(xk)0\theta(x^{k})\leq 0. And since d(xk)=Rxk1(p(xk))d(x^{k})=R^{-1}_{x^{k}}(p(x^{k})), the subproblem is equivalent to

d(xk)=argmindTxk𝒳gradf(xk),dxk+g(Rxk(d))g(xk).d(x^{k})=\arg\min_{d\in T_{x^{k}}\mathcal{X}}\left\langle\mathrm{grad}f(x^{k}),d\right\rangle_{x^{k}}+g(R_{x^{k}}(d))-g(x^{k}). (6)

In the subsequent sections, we will examine the convergence properties of the sequence generated by the Algorithm 1, specifically focusing on three distinct and rigorously defined strategies for step size determination, which is also discussed in [8], as elaborated below.  

Armijo step size: Let ζ(0,1)\zeta\in(0,1) and 0<ω1<ω2<10<\omega_{1}<\omega_{2}<1. The step size λk\lambda_{k} is chosen according to the following line search algorithm:

Step 0. Set λk0=1\lambda_{k_{0}}=1 and initialize 0\ell\leftarrow 0.

Step 1. If

F(Rxk(λkd(xk)))F(xk)+ζλkθ(xk),F(R_{x^{k}}(\lambda_{k}d(x^{k})))\leq F(x^{k})+\zeta\lambda_{k_{\ell}}\theta(x^{k}),

then set λk:=λk\lambda_{k}:=\lambda_{k_{\ell}} and return to the main algorithm.

Step 2. Find λk+1[ω1λk,ω2λk]\lambda_{k_{\ell+1}}\in\left[\omega_{1}\lambda_{k_{\ell}},\omega_{2}\lambda_{k_{\ell}}\right], set +1\ell\leftarrow\ell+1, and go to Step 1.

Adaptive step size: For all kk, set

λk:=min{1,θ(xk)Ldist2(p(xk),xk)}=argminλ(0,1]{λθ(xk)+L2λ2dist2(p(xk),xk)}.\lambda_{k}:=\min\left\{1,\frac{-\theta(x^{k})}{L\mathrm{dist}^{2}(p\left(x^{k}\right),x^{k})}\right\}=\operatorname{argmin}_{\lambda\in(0,1]}\left\{\lambda\theta(x^{k})+\frac{L}{2}\lambda^{2}\mathrm{dist}^{2}(p(x^{k}),x^{k})\right\}.

Diminishing step size: For all kk, set

λk:=2k+2.\lambda_{k}:=\frac{2}{k+2}.

4 Convergence analysis

In this section, we will provide the convergence analysis for three types of step sizes. The following assumption will be used in the convergence results.

Assumption 2.

The function ff is geodesically LL-smooth, namely, there exist a constant LL such that (1) holds.

Since 𝒳\mathcal{X} is a compact set, we can assume without loss of generality that 𝒳=dom(g)\mathcal{X}=\mathrm{dom}(g) is a compact set, its diameter is finite and can be defined by

diam(𝒳):=supx,y𝒳dist(x,y).\operatorname{diam}(\mathcal{X}):=\sup_{x,y\in\mathcal{X}}\mathrm{dist}(x,y). (7)
Lemma 4.1.

Let x,y,z𝒳x,y,z\in\mathcal{X} with z=p(x)z=p(x) and y=γx,z(λ)=Rx(λRx1(z))y=\gamma_{x,z}(\lambda)=R_{x}(\lambda R^{-1}_{x}(z)), where λ[0,1]\lambda\in[0,1] and γx,z:[0,1]\gamma_{x,z}\colon[0,1]\to\mathcal{M} represents the geodesic joining x=γx,z(0)x=\gamma_{x,z}(0) with z=γx,z(1)z=\gamma_{x,z}(1). Then, we have

F(y)F(x)+λθ(x)+Lλ22diam(𝒳)2.F(y)\leq F(x)+\lambda\theta(x)+\frac{L\lambda^{2}}{2}\operatorname{diam}(\mathcal{X})^{2}. (8)
Proof.

From the definition of FF, we have

F(y)\displaystyle F(y) =f(y)+g(y)\displaystyle=f(y)+g(y)
f(x)+gradf(x),Rx1(y)x+L2dist2(x,y)+g(y)\displaystyle\leq f(x)+\left\langle\mathrm{grad}f(x),R^{-1}_{x}(y)\right\rangle_{x}+\frac{L}{2}\mathrm{dist}^{2}(x,y)+g(y)
f(x)+gradf(x),Rx1(y)x+L2dist2(x,y)+(1λ)g(x)+λg(z)\displaystyle\leq f(x)+\left\langle\mathrm{grad}f(x),R^{-1}_{x}(y)\right\rangle_{x}+\frac{L}{2}\mathrm{dist}^{2}(x,y)+(1-\lambda)g(x)+\lambda g(z)
=F(x)+λgradf(x),Rx1(z)xλg(x)+λg(z)+L2dist2(x,y)\displaystyle=F(x)+\lambda\left\langle\mathrm{grad}f(x),R^{-1}_{x}(z)\right\rangle_{x}-\lambda g(x)+\lambda g(z)+\frac{L}{2}\mathrm{dist}^{2}(x,y)
=F(x)+λ(gradf(x),Rx1(z)xg(x)+g(z))+L2dist2(x,y)\displaystyle=F(x)+\lambda\left(\left\langle\mathrm{grad}f(x),R^{-1}_{x}(z)\right\rangle_{x}-g(x)+g(z)\right)+\frac{L}{2}\mathrm{dist}^{2}(x,y)
=F(x)+λθ(x)+Lλ22dist2(x,z)\displaystyle=F(x)+\lambda\theta(x)+\frac{L\lambda^{2}}{2}\mathrm{dist}^{2}(x,z)
F(x)+λθ(x)+Lλ22diam(𝒳)2,\displaystyle\leq F(x)+\lambda\theta(x)+\frac{L\lambda^{2}}{2}\operatorname{diam}(\mathcal{X})^{2},

where the first inequality comes from the LL-smoothness of ff and the second inequality is due to the geodesic convexity of gg. Then, from the definition of θ\theta in (5), and using the fact that dist2(x,y)=λ2dist2(x,z)\mathrm{dist}^{2}(x,y)=\lambda^{2}\mathrm{dist}^{2}(x,z), we obtain the desired result. ∎

Additionally, we can derive the following proposition.

Proposition 4.2.

Let x𝒳x^{*}\in\mathcal{X} be an optimal point of (3), and assume that ff is geodesically convex. Then, Algorithm 1 generates {xk}\{x^{k}\} satisfying

F(xk+1)F(x)+πk(1λ0)(F(x0)F(x))+s=0kπkπsLλs22diam(𝒳)2,F(x^{k+1})\leq F(x^{*})+\pi_{k}\left(1-\lambda_{0}\right)\left(F(x^{0})-F(x^{*})\right)+\sum_{s=0}^{k}\frac{\pi_{k}}{\pi_{s}}\frac{L\lambda_{s}^{2}}{2}\operatorname{diam}(\mathcal{X})^{2}, (9)

where πk:=s=1k(1λs)\pi_{k}:=\prod_{s=1}^{k}\left(1-\lambda_{s}\right) with π0=1\pi_{0}=1.

Proof.

From Lemma 4.1, we obtain

F(xk+1)F(xk)+λkθ(xk)+Lλk22diam(𝒳)2.F(x^{k+1})\leq F(x^{k})+\lambda_{k}\theta(x^{k})+\frac{L\lambda_{k}^{2}}{2}\operatorname{diam}(\mathcal{X})^{2}.

Define Δk:=F(xk)F(x)\Delta_{k}:=F(x^{k})-F(x^{*}). Then, we get

Δk+1Δk+λkθ(xk)+Lλk22diam(𝒳)2.\Delta_{k+1}\leq\Delta_{k}+\lambda_{k}\theta(x^{k})+\frac{L\lambda_{k}^{2}}{2}\operatorname{diam}(\mathcal{X})^{2}.

From the convexity of ff, we have

F(x)F(xk)gradf(xk),Rxk1(x)xk+g(x)g(xk)minu𝒳gradf(xk),Rxk1(u)xk+g(u)g(xk)=θ(xk).F(x^{*})-F(x^{k})\geq\left\langle\operatorname{grad}f(x^{k}),R_{x^{k}}^{-1}(x^{*})\right\rangle_{x^{k}}+g(x^{*})-g(x^{k})\geq\min_{u\in\mathcal{X}}\left\langle\mathrm{grad}f(x^{k}),R^{-1}_{x^{k}}(u)\right\rangle_{x^{k}}+g(u)-g(x^{k})=\theta(x^{k}).

Then we have Δkθ(xk)\Delta_{k}\leq-\theta(x^{k}), which gives

Δk+1(1λk)Δk+Lλk22diam(𝒳)2.\Delta_{k+1}\leq(1-\lambda_{k})\Delta_{k}+\frac{L\lambda_{k}^{2}}{2}\operatorname{diam}(\mathcal{X})^{2}.

Now, let us expand Δk\Delta_{k} and gradually express Δk\Delta_{k} in terms of Δ0\Delta_{0}. Ultimately, for the recursive relation at step k+1k+1, we obtain:

Δk+1s=0k(1λs)Δ0+s=0kLλs22j=s+1k(1λj)diam(𝒳)2,\Delta_{k+1}\leq\prod_{s=0}^{k}(1-\lambda_{s})\Delta_{0}+\sum_{s=0}^{k}\frac{L\lambda_{s}^{2}}{2}\prod_{j=s+1}^{k}(1-\lambda_{j})\operatorname{diam}(\mathcal{X})^{2},

that is,

Δk+1πk(1λ0)Δ0+s=0kπkπsLλs22diam(𝒳)2.\Delta_{k+1}\leq\pi_{k}\left(1-\lambda_{0}\right)\Delta_{0}+\sum_{s=0}^{k}\frac{\pi_{k}}{\pi_{s}}\frac{L\lambda_{s}^{2}}{2}\operatorname{diam}(\mathcal{X})^{2}.

Rearranging the terms completes the proof. ∎

Now we present a result associated to general sequences of real numbers, which will be useful for our study of convergence and iteration complexity bounds for the CG method.

Lemma 4.3.

[14, Lemma 13.13] Let {ak}\left\{a_{k}\right\} and {bk}\left\{b_{k}\right\} be nonnegative sequences of real numbers satisfying

ak+1akbkβk+A2βk2,k=0,1,2,,a_{k+1}\leq a_{k}-b_{k}\beta_{k}+\frac{A}{2}\beta_{k}^{2},\quad k=0,1,2,\ldots,

where βk:=2/(k+2)\beta_{k}:=2/(k+2) and AA is a positive number. Suppose that akbka_{k}\leq b_{k} for all kk. Then

  • (i)

    ak2Aka_{k}\leq\frac{2A}{k} for all k=1,2,k=1,2,\ldots

  • (ii)

    min{k2+2,,k}b8Ak2\min_{\ell\in\left\{\left\lfloor\frac{k}{2}\right\rfloor+2,\ldots,k\right\}}b_{\ell}\leq\frac{8A}{k-2} for all k=3,4,,k=3,4,\ldots,\quad where k/2=max{n:nk/2}\lfloor k/2\rfloor=\max\{n\in\mathbb{N}:n\leq k/2\}.

4.1 Analysis with the adaptive and the diminishing step sizes

Theorem 4.4.

Let xx^{*} be an optimal point of (3), and assume that ff is geodesically convex. Then, the sequence of iterates {xk}\{x^{k}\} generated by Algorithm 1 with the adaptive or diminishing step size satisfies F(xk)F(x)=O(1/k)F(x^{k})-F(x^{*})=O(1/k). Specifically, we obtain

min{k2+2,,k}θ(xk)8Ldiam(𝒳)2k2.\min_{\ell\in\left\{\left\lfloor\frac{k}{2}\right\rfloor+2,\ldots,k\right\}}-\theta(x^{k})\leq\frac{8L\operatorname{diam}(\mathcal{X})^{2}}{k-2}.
Proof.

From Lemma 4.1 and the diminishing step size λk=2k+2(0,1)\lambda_{k}=\frac{2}{k+2}\in(0,1), we have

F(xk+1)F(x)F(xk)F(x)+λkθ(xk)+Lλk22diam(𝒳)2.F(x^{k+1})-F(x^{*})\leq F(x^{k})-F(x^{*})+\lambda_{k}\theta(x^{k})+\frac{L\lambda_{k}^{2}}{2}\operatorname{diam}(\mathcal{X})^{2}. (10)

Considering the adaptive step size ηk=argminλ(0,1]{λθ(xk)+L2λ2d2(p(xk),xk)}\eta_{k}=\operatorname{argmin}_{\lambda\in(0,1]}\left\{\lambda\theta(x^{k})+\frac{L}{2}\lambda^{2}d^{2}(p(x^{k}),x^{k})\right\}, we have ηkθ(xk)+Lηk22diam(𝒳)2λkθ(xk)+Lλk22diam(𝒳)2\eta_{k}\theta(x^{k})+\frac{L\eta_{k}^{2}}{2}\operatorname{diam}(\mathcal{X})^{2}\leq\lambda_{k}\theta(x^{k})+\frac{L\lambda_{k}^{2}}{2}\operatorname{diam}(\mathcal{X})^{2}, so the inequality (10) still holds.

Define Δk:=F(xk)F(x)\Delta_{k}:=F(x^{k})-F(x^{*}). Then (10) is equivalent to

Δk+1Δk+λkθ(xk)+Lλk22diam(𝒳)2.\Delta_{k+1}\leq\Delta_{k}+\lambda_{k}\theta(x^{k})+\frac{L\lambda_{k}^{2}}{2}\operatorname{diam}(\mathcal{X})^{2}.

From the convexity of ff, we have

F(x)F(xk)gradf(xk),Rxk1(x)xk+g(x)g(xk)minu𝒳gradf(xk),Rxk1(u)xk+g(u)g(xk)=θ(xk).F(x^{*})-F(x^{k})\geq\left\langle\operatorname{grad}f(x^{k}),R_{x^{k}}^{-1}(x^{*})\right\rangle_{x^{k}}+g(x^{*})-g(x^{k})\geq\min_{u\in\mathcal{X}}\left\langle\mathrm{grad}f(x^{k}),R^{-1}_{x^{k}}(u)\right\rangle_{x^{k}}+g(u)-g(x^{k})=\theta(x^{k}).

Then we have 0Δkθ(xk)0\leq\Delta_{k}\leq-\theta(x^{k}). So setting ak=Δka_{k}=\Delta_{k} and bk=θ(xk)0b_{k}=-\theta(x^{k})\geq 0, βk=λk\beta_{k}=\lambda_{k}, and A=Ldiam(𝒳)2A=L\operatorname{diam}(\mathcal{X})^{2} in Lemma 4.3 we obtain

F(xk)F(x)2Lkdiam(𝒳)2,F(x^{k})-F(x^{*})\leq\frac{2L}{k}\operatorname{diam}(\mathcal{X})^{2},

and

min{k2+2,,k}θ(xk)8Ldiam(𝒳)2k2,\min_{\ell\in\left\{\left\lfloor\frac{k}{2}\right\rfloor+2,\ldots,k\right\}}-\theta(x^{k})\leq\frac{8L\operatorname{diam}(\mathcal{X})^{2}}{k-2},

for all k=3,4,k=3,4,\ldots, where k/2=max{n:nk/2}\lfloor k/2\rfloor=\max\{n\in\mathbb{N}:n\leq k/2\}. ∎

The above theorem also indicates that the iteration complexity of the method is 𝒪(1/k)\mathcal{O}\left(1/k\right). Note that a similar result can be obtained by following the proof of [22, Corollary 5].

4.2 Analysis with the Armijo step size

In the following lemma, we prove that there exist intervals of step sizes satisfying the Armijo condition.

Lemma 4.5.

Let ζ(0,1)\zeta\in(0,1), xk𝒳x^{k}\in\mathcal{X} and θ(xk)0\theta(x^{k})\neq 0. Then, there exists 0<λ¯10<\bar{\lambda}\leq 1 such that

F(Rxk(λRxk1(p(xk))))F(xk)ζλ|θ(xk)|λ(0,λ¯].F(R_{x^{k}}(\lambda R_{x^{k}}^{-1}(p(x^{k}))))\leq F(x^{k})-\zeta\lambda\left|\theta(x^{k})\right|\quad\forall\lambda\in(0,\bar{\lambda}]. (11)
Proof.

Since ff is differentiable and gg is geodesically convex, for all λ(0,1)\lambda\in(0,1) , we have

F(Rxk(λRxk1(p(xk))))\displaystyle F(R_{x^{k}}(\lambda R_{x^{k}}^{-1}(p(x^{k}))))
=\displaystyle= g(Rxk(λRxk1(p(xk))))+f(Rxk(λ(Rxk1(p(xk)))))\displaystyle g(R_{x^{k}}(\lambda R_{x^{k}}^{-1}(p(x^{k}))))+f(R_{x^{k}}(\lambda(R_{x^{k}}^{-1}(p(x^{k})))))
\displaystyle\leq (1λ)g(xk)+λg(p(xk))+f(xk)+λgradf(xk),Rxk1(p(xk))xk+o(λ)\displaystyle(1-\lambda)g(x^{k})+\lambda g(p(x^{k}))+f(x^{k})+\lambda\left\langle\mathrm{grad}f(x^{k}),R_{x^{k}}^{-1}(p(x^{k}))\right\rangle_{x^{k}}+{o}(\lambda)
=\displaystyle= F(xk)+λ(gradf(xk),Rxk1(p(xk))xk+g(p(xk))g(xk))+o(λ)\displaystyle F(x^{k})+\lambda\left(\left\langle\mathrm{grad}f(x^{k}),R_{x^{k}}^{-1}(p(x^{k}))\right\rangle_{x^{k}}+g(p(x^{k}))-g(x^{k})\right)+{o}(\lambda)
=\displaystyle= F(xk)+λζθ(xk)+λ((1ζ)θ(xk)+o(λ)λ).\displaystyle F(x^{k})+\lambda\zeta\theta(x^{k})+\lambda\left((1-\zeta)\theta(x^{k})+\frac{{o}(\lambda)}{\lambda}\right).

The inequality arises from the Taylor expansion of ff and the geodesic convexity of gg. Since limλ0o(λ)λ=0\lim_{\lambda\rightarrow 0}\frac{{o}(\lambda)}{\lambda}=0, ζ(0,1)\zeta\in(0,1) and θ(xk)<0\theta(x^{k})<0 from the definition as in (5), there exists λ¯(0,1]\bar{\lambda}\in(0,1] such that (1ζ)θ(xk)+o(λ)λ0(1-\zeta)\theta(x^{k})+\frac{{o}(\lambda)}{\lambda}\leq 0, then (11) holds for all λ(0,λ¯]\lambda\in(0,\bar{\lambda}]. ∎

In the subsequent results, we assume that the sequence generated by Algorithm 1 is infinite, which means that θ(xk)<0\theta(x^{k})<0 for all kk.

Assumption 3.

Set ={xF(x)F(x0)}\mathcal{L}=\{x\mid F(x)\leq F(x^{0})\}. All monotonically nonincreasing sequences in F()F(\mathcal{L}) are bounded from below.

Theorem 4.6.

Assume that FF satisfies Assumption 3 and the sequence {xk}\{x^{k}\} is generated by Algorithm 1 with the Armijo step size. The following statements hold:

  • (1)

    limkθ(xk)=0\lim_{k\rightarrow\infty}\theta(x^{k})=0;

  • (2)

    Let xx^{*} be an accumulation point of the sequence {xk}\{x^{k}\}. Then xx^{*} is a stationary point of (3) and limkF(xk)=F(x)\lim_{k\rightarrow\infty}F(x^{k})=F(x^{*}).

Proof.

(1) By the Armijo step size, the fact that θ(xk)<0\theta(x^{k})<0 for all kk\in\mathbb{N}, and by Lemma 4.5, we have

F(xk)F(xk+1)ζλk|θ(xk)|>0,F(x^{k})-F(x^{k+1})\geq\zeta\lambda_{k}\left|\theta(x^{k})\right|>0, (12)

which implies that the sequence {F(xk)}\{F(x^{k})\} is monotonically decreasing. On the other hand, since {xk}𝒳\{x^{k}\}\subset\mathcal{X} and 𝒳\mathcal{X} is compact, there exists x¯𝒳\bar{x}\in\mathcal{X} that is a accumulation point of a subsequence of {xk}\{x^{k}\}. Without loss of generality, assume that xkx¯x^{k}\rightarrow\bar{x}.

From (12), with k=0,1,,Nk=0,1,\ldots,N for some NN, we have F(x0)FminF(x0)F(xN)ζk=0Nλk|θ(xk)|F(x^{0})-F_{min}\geq F(x^{0})-F(x^{N})\geq\zeta\sum_{k=0}^{N}\lambda_{k}\left|\theta(x^{k})\right|, where FminF_{min} exists from Assumption 3. Taking NN\rightarrow\infty, we have k=0λk|θ(xk)|<+\sum_{k=0}^{\infty}\lambda_{k}\left|\theta(x^{k})\right|<+\infty. Therefore, we have limkλkθ(xk)=0\lim_{k\rightarrow\infty}\lambda_{k}\theta(x^{k})=0. If

limkθ(xk)=0,\lim_{k\rightarrow\infty}\theta(x^{k})=0, (13)

then the assertion holds. So we consider the case

limkλk=0.\lim_{k\rightarrow\infty}\lambda_{k}=0. (14)

Assume that lim𝕂1kλk=0\lim_{\mathbb{K}_{1}\ni k\to\infty}\lambda_{k}=0 for some 𝕂1\mathbb{K}_{1}\subseteq\mathbb{N}. Suppose by contradiction that θ(x¯)<0\theta(\bar{x})<0. Then, there exist δ>0\delta>0 and 𝕂2𝕂1\mathbb{K}_{2}\subseteq\mathbb{K}_{1} such that

θ(xk)<δk𝕂2.\theta(x^{k})<-\delta\quad\forall k\in\mathbb{K}_{2}.

Without loss of generality, we assume that there exists p¯𝒳\bar{p}\in\mathcal{X} such that

lim𝕂2kp(xk)=p¯.\lim_{\mathbb{K}_{2}\ni k\to\infty}p(x^{k})=\bar{p}.

Since λk<1\lambda_{k}<1 for all k𝕂2k\in\mathbb{K}_{2}, by Armijo step size, there exists λ¯k(0,λk/ω1)\bar{\lambda}_{k}\in(0,\lambda_{k}/\omega_{1}) such that

F(Rxk(λ¯kRxk1(p(xk))))>F(xk)+ζλ¯kθ(xk)k𝕂2,F(R_{x^{k}}(\bar{\lambda}_{k}R_{x^{k}}^{-1}(p(x^{k}))))>F(x^{k})+\zeta\bar{\lambda}_{k}\theta(x^{k})\quad\forall k\in\mathbb{K}_{2},

Then, we have

F(Rxk(λ¯kRxk1(p(xk))))F(xk)λ¯k>ζθ(xk)k𝕂2.\frac{F(R_{x^{k}}(\bar{\lambda}_{k}R_{x^{k}}^{-1}(p(x^{k}))))-F(x^{k})}{\bar{\lambda}_{k}}>\zeta\theta(x^{k})\quad\forall k\in\mathbb{K}_{2}. (15)

On the other hand, since 0<λ¯k10<\bar{\lambda}_{k}\leq 1, and gg is geodesically convex, then

g(Rxk(λ¯kRxk1(p(xk))))(1λ¯k)g(xk)+λ¯kg(p(xk)),g(R_{x^{k}}(\bar{\lambda}_{k}R_{x^{k}}^{-1}(p(x^{k}))))\leq(1-\bar{\lambda}_{k})g(x^{k})+\bar{\lambda}_{k}g(p(x^{k})),

namely,

g(Rxk(λ¯kRxk1(p(xk)))g(xk)λ¯kg(p(xk))g(xk)k𝕂2.\frac{g(R_{x^{k}}(\bar{\lambda}_{k}R_{x^{k}}^{-1}(p(x^{k})))-g(x^{k})}{\bar{\lambda}_{k}}\leq g(p(x^{k}))-g(x^{k})\quad\forall k\in\mathbb{K}_{2}. (16)

Since ff is differentiable and lim𝕂2kλ¯k=0\lim_{\mathbb{K}_{2}\ni k\to\infty}\bar{\lambda}_{k}=0, for all k𝕂2k\in\mathbb{K}_{2}, we have

f(Rxk(λ¯kRxk1(p(xk))))=f(xk)+λ¯kgradf(xk),Rxk1(p(xk))xk+o(λ¯k),f(R_{x^{k}}(\bar{\lambda}_{k}R_{x^{k}}^{-1}(p(x^{k}))))=f(x^{k})+\bar{\lambda}_{k}\left\langle\operatorname{grad}f(x^{k}),R_{x^{k}}^{-1}(p(x^{k}))\right\rangle_{x^{k}}+o(\bar{\lambda}_{k}), (17)

which, together with (16) and (17) and the definition of θ\theta, gives

θ(xk)\displaystyle\theta(x^{k}) =gradf(xk),Rxk1(p(xk))xk+g(p(xk))g(xk)\displaystyle=\left\langle\operatorname{grad}f(x^{k}),R_{x^{k}}^{-1}(p(x^{k}))\right\rangle_{x^{k}}+g(p(x^{k}))-g(x^{k}) (18)
f(Rxk(λ¯kRxk1(p(xk))))f(xk)o(λ¯k)λ¯k+g(Rxk(λ¯kRxk1(p(xk)))g(xk)λ¯k\displaystyle\geq\frac{f(R_{x^{k}}(\bar{\lambda}_{k}R_{x^{k}}^{-1}(p(x^{k}))))-f(x^{k})-o(\bar{\lambda}_{k})}{\bar{\lambda}_{k}}+\frac{g(R_{x^{k}}(\bar{\lambda}_{k}R_{x^{k}}^{-1}(p(x^{k})))-g(x^{k})}{\bar{\lambda}_{k}}
F(Rxk(λ¯kRxk1(p(xk))))F(xk)λ¯ko(λ¯k)λ¯kk𝕂2.\displaystyle\geq\frac{F(R_{x^{k}}(\bar{\lambda}_{k}R_{x^{k}}^{-1}(p(x^{k}))))-F(x^{k})}{\bar{\lambda}_{k}}-\frac{o(\bar{\lambda}_{k})}{\bar{\lambda}_{k}}\quad\forall k\in\mathbb{K}_{2}.

Combining (18) with (15), we obtain for all k𝕂2k\in\mathbb{K}_{2},

F(Rxk(λ¯kRxk1(p(xk))))F(xk)λ¯kζ>F(Rxk(λ¯kRxk1(p(xk))))F(xk)λ¯ko(λ¯k)λ¯k,\frac{F(R_{x^{k}}(\bar{\lambda}_{k}R_{x^{k}}^{-1}(p(x^{k}))))-F(x^{k})}{\bar{\lambda}_{k}\zeta}>\frac{F(R_{x^{k}}(\bar{\lambda}_{k}R_{x^{k}}^{-1}(p(x^{k}))))-F(x^{k})}{\bar{\lambda}_{k}}-\frac{o(\bar{\lambda}_{k})}{\bar{\lambda}_{k}},

that is,

(1ζ1)F(Rxk(λ¯kRxk1(p(xk))))F(xk)λ¯k>o(λ¯k)λ¯k.\left(\frac{1}{\zeta}-1\right)\frac{F(R_{x^{k}}(\bar{\lambda}_{k}R_{x^{k}}^{-1}(p(x^{k}))))-F(x^{k})}{\bar{\lambda}_{k}}>-\frac{o(\bar{\lambda}_{k})}{\bar{\lambda}_{k}}.

Then, we have

F(Rxkλ¯k(Rxk1(p(xk))))F(xk)λ¯k>(ζ1ζ)o(λ¯k)λ¯kk𝕂2.\frac{F(R_{x^{k}}\bar{\lambda}_{k}(R_{x^{k}}^{-1}(p(x^{k}))))-F(x^{k})}{\bar{\lambda}_{k}}>\left(\frac{-\zeta}{1-\zeta}\right)\frac{o(\bar{\lambda}_{k})}{\bar{\lambda}_{k}}\quad\forall k\in\mathbb{K}_{2}.

On the other hand, from the fact that θ(xk)<δ\theta(x^{k})<-\delta and (18), we obtain

δ+o(λ¯k)λ¯k>F(Rxkλ¯kRxk1(p(xk)))F(xk)λ¯kk𝕂2.-\delta+\frac{o(\bar{\lambda}_{k})}{\bar{\lambda}_{k}}>\frac{F(R_{x^{k}}\bar{\lambda}_{k}R_{x^{k}}^{-1}(p(x^{k})))-F(x^{k})}{\bar{\lambda}_{k}}\quad\forall k\in\mathbb{K}_{2}.

Combining the two results above, it holds that

δ+o(λ¯k)λ¯k>(ζ1ζ)o(λ¯k)λ¯kk𝕂2.-\delta+\frac{o(\bar{\lambda}_{k})}{\bar{\lambda}_{k}}>\left(\frac{-\zeta}{1-\zeta}\right)\frac{o(\bar{\lambda}_{k})}{\bar{\lambda}_{k}}\quad\forall k\in\mathbb{K}_{2}.

Since o(λ¯k)λ¯k0\frac{o(\bar{\lambda}_{k})}{\bar{\lambda}_{k}}\to 0 as kk\to\infty, we obtain δ0\delta\leq 0, which contradicts the fact that δ>0\delta>0. Thus, θ(x¯)=0\theta\left(\bar{x}\right)=0.

(2) Let xx^{*} be an accumulation point of the sequence {xk}\{x^{k}\}. Assume that xx^{*} is not a stationary point of FF, namely θ(x)<0\theta\left(x^{*}\right)<0. By the definition of θ\theta in (5), we have

θ(xk)=gradf(xk),Rxk1(p(xk))xk+g(p(xk))g(xk)gradf(xk),Rxk1(u)xk+g(u)g(xk)u𝒳,\theta\left(x^{k}\right)=\left\langle\mathrm{grad}f(x^{k}),R^{-1}_{x^{k}}(p(x^{k}))\right\rangle_{x^{k}}+g(p(x^{k}))-g(x^{k})\leq\left\langle\mathrm{grad}f(x^{k}),R^{-1}_{x^{k}}(u)\right\rangle_{x^{k}}+g(u)-g(x^{k})\quad\forall u\in\mathcal{X},

and by taking the limit we obtain

lim supkθ(xk)gradf(x),Rx1(u)x+g(u)g(x)u𝒳,\limsup_{k\rightarrow\infty}\theta\left(x^{k}\right)\leq\left\langle\mathrm{grad}f(x^{*}),R^{-1}_{x^{*}}(u)\right\rangle_{x^{*}}+g(u)-g(x^{*})\quad\forall u\in\mathcal{X},

since ff is continuously differentiable, gg is lower semicontinuous and Assumption 1 holds. Now taking the minimum over x𝒳x\in\mathcal{X} and because xx^{*} is not stationary point of FF,

lim supkθ(xk)minu𝒳gradf(x),Rx1(u)x+g(u)g(x)=θ(x)<0,\limsup_{k\rightarrow\infty}\theta\left(x^{k}\right)\leq\min_{u\in\mathcal{X}}\left\langle\mathrm{grad}f(x^{*}),R^{-1}_{x^{*}}(u)\right\rangle_{x^{*}}+g(u)-g(x^{*})=\theta\left(x^{*}\right)<0,

which contradicts to limkθ(xk)=0\lim_{k\rightarrow\infty}\theta(x^{k})=0, so xx^{*} is a stationary point of FF.

Also, due to the monotonicity of the sequence {F(xk)}\{F(x^{k})\} given in (12), we obtain limkF(xk)=F(x)\lim_{k\rightarrow\infty}F(x^{k})=F\left(x^{*}\right), which completes the proof. ∎

The following theorem provides the iteration complexity of Algorithm 1 with Armijo step size, indicating that after a finite number of steps, the iteration complexity of the method could be 𝒪(1/ϵ2)\mathcal{O}\left(1/\epsilon^{2}\right).

Theorem 4.7.

Assume that FF satisfies Assumptions 2 and  3. Then, there exists a finite integer JJ\in\mathbb{N} such that Algorithm 1 with Armijo step size achieves an ϵ\epsilon-optimal solution within J+𝒪(1/ϵ2)J+\mathcal{O}\left(1/\epsilon^{2}\right) iterations.

Proof.

First, we consider that 0<λk<10<\lambda_{k}<1. From the Armijo step size, we can say that there exists 0<λ¯kmin{1,λkω1}0<\bar{\lambda}_{k}\leq\min\left\{1,\frac{\lambda_{k}}{\omega_{1}}\right\} such that

F(Rxkλ¯k(Rxk1(p(xk))))>F(xk)+ζλ¯kθ(xk).F(R_{x^{k}}\bar{\lambda}_{k}(R_{x^{k}}^{-1}(p(x^{k}))))>F(x^{k})+\zeta\bar{\lambda}_{k}\theta(x^{k}).

Recall that from Lemma 4.1, we have

F(Rxk(λ¯k(Rxk1(p(xk)))))<F(xk)+λ¯kθ(xk)+Lλ¯k22diam(𝒳)2.F(R_{x^{k}}(\bar{\lambda}_{k}(R_{x^{k}}^{-1}(p(x^{k})))))<F(x^{k})+\bar{\lambda}_{k}\theta(x^{k})+\frac{L\bar{\lambda}_{k}^{2}}{2}\operatorname{diam}(\mathcal{X})^{2}.

Combining the above two inequalities, we conclude that:

ζλ¯kθ(xk)<λ¯kθ(xk)+Lλ¯k22diam(𝒳)2,\zeta\bar{\lambda}_{k}\theta(x^{k})<\bar{\lambda}_{k}\theta(x^{k})+\frac{L\bar{\lambda}_{k}^{2}}{2}\operatorname{diam}(\mathcal{X})^{2},

namely,

0<2(1ζ)Ldiam(𝒳)2θ(xk)<λ¯kλkω1.0<-\frac{2(1-\zeta)}{L\operatorname{diam}(\mathcal{X})^{2}}\theta(x^{k})<\bar{\lambda}_{k}\leq\frac{\lambda_{k}}{\omega_{1}}.

Defining γ=2ω1(1ζ)Ldiam(𝒳)2\gamma=\frac{2\omega_{1}(1-\zeta)}{L\operatorname{diam}(\mathcal{X})^{2}}, we get

0<γθ(xk)<λk.0<-\gamma\theta(x^{k})<\lambda_{k}. (19)

From Theorem 4.6, we have limkθ(xk)=0\lim_{k\rightarrow\infty}\theta(x^{k})=0. Then, without loss of generality, for sufficiently large JJ, when k>Jk>J, |θ(xk)|1/γ\left|\theta(x^{k})\right|\leq 1/\gamma. Then (19) still holds for λk=1\lambda_{k}=1. Taking it into (12), we obtain

0<ζγ|θ(xk)|2F(xk)F(xk+1).0<\zeta\gamma\left|\theta(x^{k})\right|^{2}\leq F(x^{k})-F(x^{k+1}). (20)

Namely (20) holds for all k>Jk>J. By summing both sides of the second inequality in (20) for k=J+1,,Nk=J+1,\ldots,N, we obtain

k=J+1N|θ(xk)|21ζγ(F(xJ+1)F(xN+1))1ζγ(F(x0)F(x)),\sum_{k=J+1}^{N}\left|\theta(x^{k})\right|^{2}\leq\frac{1}{\zeta\gamma}(F(x^{J+1})-F(x^{N+1}))\leq\frac{1}{\zeta\gamma}(F(x^{0})-F(x^{*})),

where xx^{*} is an accumulation point of {xk}\{x^{k}\}, as in Theorem 4.6.

This implies mink{|θ(xk)|k=J+1,,N}(F(x0)F(x))/(ζγ(NJ))\min_{k}\left\{\left|\theta(x^{k})\right|\mid k=J+1,\ldots,N\right\}\leq\sqrt{(F(x^{0})-F(x^{*}))/(\zeta\gamma(N-J))}. Namely, Algorithm 1 returns xkx^{k} satisfying |θ(xk)|ϵ\left|\theta(x^{k})\right|\leq\epsilon in at most J+(F(x0)F(x))/(ζγϵ2)J+(F(x^{0})-F\left(x^{*}\right))/(\zeta\gamma\epsilon^{2}) iterations. ∎

Furthermore, we can obtain stronger results by making additional assumptions about the objective function gg.

Assumption 4.

Assume the function gg is Lipschitz continuous with constant Lg>0L_{g}>0 in 𝒳\mathcal{X}, namely, |g(x)g(y)|Lgdist(x,y)|g(x)-g(y)|\leq L_{g}\mathrm{dist}(x,y). And for {xk}\{x^{k}\} generated by Algorithm 1, there exists Ω>0\Omega>0 such that Ω\Omega\geq max{max{Rxi1(xj)xi:xi,xj{xk}},diam(𝒳)}\max\big{\{}\max\big{\{}\|R_{x^{i}}^{-1}(x^{j})\|_{x^{i}}:{x^{i}},{x^{j}}\in\{x^{k}\}\big{\}},\operatorname{diam}(\mathcal{X})\big{\}}.

As kk tends to infinity, we can assume without loss of generality that xkx^{k} lies within a compact neighborhood of xx^{*}. Due to the smoothness of the inverse retraction, this ensures that Rxi1(xj)xi\|R_{x^{i}}^{-1}(x^{j})\|_{x^{i}} remains bounded. Hence, Assumption 4 is reasonable. Moreover, we define

ρ:=sup{gradf(x)xx𝒳}\displaystyle\rho:=\sup\left\{\left\|\operatorname{grad}f(x)\right\|_{x}\mid x\in\mathcal{X}\right\} (21)
γ:=min{1(Lg+ρ)Ω,2ω1(1ζ)LΩ2}.\displaystyle\gamma:=\min\left\{\frac{1}{(L_{g}+\rho)\Omega},\frac{2\omega_{1}(1-\zeta)}{L\Omega^{2}}\right\}.

Then we present the following lemma, which is crucial for the subsequent convergence rate analysis.

Lemma 4.8.

Let {xk}\{x^{k}\} be a sequence generated by Algorithm 1 with Armijo step size. Assume that Assumptions 2 and 4 hold. Then λkγ|θ(xk)|>0.\lambda_{k}\geq\gamma\left|\theta\left(x^{k}\right)\right|>0.

Proof.

Since λk(0,1]\lambda_{k}\in(0,1], let us consider two cases: λk=1\lambda_{k}=1 and 0<λk<10<\lambda_{k}<1. First, we assume that λk=1\lambda_{k}=1. From (5), we have

0<θ(xk)=g(xk)g(p(xk))gradf(xk),Rxk1p(xk)xk,0<-\theta(x^{k})=g(x^{k})-g(p(x^{k}))-\left\langle\operatorname{grad}f(x^{k}),R_{x^{k}}^{-1}p(x^{k})\right\rangle_{x^{k}},

thus, it follows from Assumption 4 and the Cauchy-Schwarz inequality that

0<θ(xk)Lgdist(xk,p(xk))+gradf(xk)xkRxk1p(xk)xk.0<-\theta(x^{k})\leq L_{g}\mathrm{dist}(x^{k},p(x^{k}))+\left\|\operatorname{grad}f(x^{k})\right\|_{x^{k}}\left\|R_{x^{k}}^{-1}p(x^{k})\right\|_{x^{k}}.

Using (21), we have 0<θ(xk)(Lg+ρ)Ω0<-\theta(x^{k})\leq(L_{g}+\rho)\Omega. Furthermore,

0<γ|θ(xk)|=γθ(xk)θ(xk)(Lg+ρ)Ω1,0<\gamma|\theta(x^{k})|=-\gamma\theta(x^{k})\leq\frac{-\theta(x^{k})}{(L_{g}+\rho)\Omega}\leq 1,

which demonstrates that the desired inequality holds.

Now, we assume that 0<λk<10<\lambda_{k}<1. From the Armijo step size, we know that there exists 0<λ¯kmin{1,λkω1}0<\bar{\lambda}_{k}\leq\min\left\{1,\frac{\lambda_{k}}{\omega_{1}}\right\} such that

F(Rxk(λ¯kRxk1(p(xk))))>F(xk)+ζλ¯kθ(xk).F(R_{x^{k}}(\bar{\lambda}_{k}R_{x^{k}}^{-1}(p(x^{k}))))>F(x^{k})+\zeta\bar{\lambda}_{k}\theta(x^{k}).

By using Lemma 4.1, we have

F(Rxk(λ¯kRxk1(p(xk))))F(xk)+λ¯kθ(xk)+L2diam(𝒳)2λ¯k2.F(R_{x^{k}}(\bar{\lambda}_{k}R_{x^{k}}^{-1}(p(x^{k}))))\leq F(x^{k})+\bar{\lambda}_{k}\theta(x^{k})+\frac{L}{2}\operatorname{diam}(\mathcal{X})^{2}\bar{\lambda}_{k}^{2}.

From the above two inequalities, we have that

θ(xk)(1ζ)<L2diam(𝒳)2λ¯kL2diam(𝒳)2λkω1.-\theta(x^{k})(1-\zeta)<\frac{L}{2}\operatorname{diam}(\mathcal{X})^{2}\bar{\lambda}_{k}\leq\frac{L}{2}\operatorname{diam}(\mathcal{X})^{2}\frac{\lambda_{k}}{\omega_{1}}.

Therefore, from (21), we get

0<γ|θ(xk)|=γθ(xk)2ω1(1ζ)LΩ2θ(xk)<λk,0<\gamma|\theta(x^{k})|=-\gamma\theta(x^{k})\leq-\frac{2\omega_{1}(1-\zeta)}{L\Omega^{2}}\theta(x^{k})<\lambda_{k},

which concludes the proof. ∎

Theorem 4.9.

Let {xk}\{x^{k}\} be a sequence generated by Algorithm 1 with Armijo step size. Assume that Assumptions 2 and 4 hold. Then limkF(xk)=F(x)\lim_{k\rightarrow\infty}F(x^{k})=F(x^{*}), for some xx^{*}\in 𝒳\mathcal{X}. Moreover, the following statements hold:

  • (i)

    limkθ(xk)=0\lim_{k\rightarrow\infty}\theta(x^{k})=0;

  • (ii)

    min{|θ(xk)|k=0,1,,N1}F(x0)F(x)/(ζγN)\min\left\{\left|\theta(x^{k})\right|\mid k=0,1,\ldots,N-1\right\}\leq\sqrt{F(x^{0})-F(x^{*})/(\zeta\gamma N)}.

Proof.

By the Armijo step size and the fact that θ(xk)<0\theta(x^{k})<0 for all kk\in\mathbb{N}, and Lemma 4.5, we have

F(xk)F(xk+1)ζλk|θ(xk)|>0,F(x^{k})-F(x^{k+1})\geq\zeta\lambda_{k}\left|\theta(x^{k})\right|>0, (22)

which implies that the sequence {F(xk)}k\{F(x^{k})\}_{k\in\mathbb{N}} is monotone decreasing. Combined with Lemma 4.8, we have

0<ζγ|θ(xk)|2F(xk)F(xk+1).0<\zeta\gamma\left|\theta(x^{k})\right|^{2}\leq F(x^{k})-F(x^{k+1}). (23)

On the other hand, since {xk}k𝒳\{x^{k}\}_{k\in\mathbb{N}}\subset\mathcal{X} and 𝒳\mathcal{X} is compact, there exists x𝒳x^{*}\in\mathcal{X} a limit point of {xk}k\{x^{k}\}_{k\in\mathbb{N}}. Let {xkj}j\{x^{k_{j}}\}_{j\in\mathbb{N}} be a subsequence of {xk}k\{x^{k}\}_{k\in\mathbb{N}} such that limjxkj=x\lim_{j\rightarrow\infty}x^{k_{j}}=x^{*}. Since {xkj}j𝒳\{x^{k_{j}}\}_{j\in\mathbb{N}}\subset\mathcal{X} and gg is Lipschitz continuous, we have

F(xkj)F(x)=g(xkj)g(x)+f(xkj)f(x)Lgxkjx+f(xkj)f(x),\left\|F(x^{k_{j}})-F(x^{*})\right\|=\left\|g(x^{k_{j}})-g(x^{*})+f(x^{k_{j}})-f(x^{*})\right\|\leq L_{g}\left\|x^{k_{j}}-x^{*}\right\|+\left\|f(x^{k_{j}})-f(x^{*})\right\|,

for all jj\in\mathbb{N}. Considering that ff is continuous and limjxkj=x\lim_{j\rightarrow\infty}x^{k_{j}}=x^{*}, we conclude from the last inequality that limjF(xkj)=F(x)\lim_{j\rightarrow\infty}F(x^{k_{j}})=F(x^{*}). Thus, due to the monotonicity of the sequence {F(xk)}k\{F(x^{k})\}_{k\in\mathbb{N}}, we obtain that limkF(xk)=F(x)\lim_{k\rightarrow\infty}F(x^{k})=F(x^{*}). Therefore, from (23), limk|θ(xk)|=0\lim_{k\rightarrow\infty}\left|\theta(x^{k})\right|=0, which implies item (i).

By summing both sides of the second inequality in (23) for k=0,1,,N1k=0,1,\ldots,N-1, we obtain

k=0N1|θ(xk)|21ζγ(F(x0)F(xN))1ζγ(F(x0)F(x)),\sum_{k=0}^{N-1}\left|\theta(x^{k})\right|^{2}\leq\frac{1}{\zeta\gamma}(F(x^{0})-F(x^{N}))\leq\frac{1}{\zeta\gamma}(F(x^{0})-F(x^{*})),

which implies the item (ii).

Therefore, xkx^{k} satisfies |θ(xk)|ϵ\left|\theta(x^{k})\right|\leq\epsilon in at most (F(x0)F(x))/(ζγϵ2)(F(x^{0})-F(x^{*}))/(\zeta\gamma\epsilon^{2}) iterations. ∎

5 Solving the subproblem

In this section, we will propose an algorithm for solving the subproblem of the RGCG method. Recall that in Algorithm 1, given the current iterate xx, we need to compute an optimal solution p(x)p(x) and the optimal value θ(x)\theta(x) as

p(x)=argminu𝒳gradf(x),R1(u)x+g(u)g(x),p(x)=\arg\min_{u\in\mathcal{X}}\left\langle\mathrm{grad}f(x),R^{-1}(u)\right\rangle_{x}+g(u)-g(x), (24)
θ(x)=gradf(x),Rx1(p(x))x+g(p(x))g(x).\theta(x)=\left\langle\mathrm{grad}f(x),R^{-1}_{x}(p(x))\right\rangle_{x}+g(p(x))-g(x). (25)

Now define p~(x)\tilde{p}(x) as the descent direction on the tangent space, i.e.,

p~(x)=argmindTxgradf(x),dx+g(Rx(d)).\tilde{p}(x)=\arg\min_{d\in T_{x}\mathcal{M}}\left\langle\mathrm{grad}f(x),d\right\rangle_{x}+g(R_{x}(d)). (26)

We also define

x(d)=gradf(x),dx+g(Rx(d)).\ell_{x}(d)=\left\langle\mathrm{grad}f(x),d\right\rangle_{x}+g(R_{x}(d)). (27)

We observe that for both (24) and (26), the retraction is not necessarily a convex function, which implies that the resulting optimization problem is not necessarily convex. Non-convex problems are inherently more challenging to solve. To address this, we aim to transform the subproblem into a series of convex problems, which is similar to the general method for solving Riemannian proximal mappings proposed in [20].To ensure that we can locally approximate and simulate the subproblems effectively, we consider the following assumption:

Assumption 5.

The manifold \mathcal{M} is an embedded submanifold of n\mathbb{R}^{n} or is a quotient manifold whose total space is an embedded submanifold of n\mathbb{R}^{n}.

Let us now present a way to find the approximation of p~(x)=argmindTxx(d)\tilde{p}(x)=\arg\min_{d\in T_{x}\mathcal{M}}\ell_{x}(d). Assume that d0Txd^{0}\in T_{x}\mathcal{M}, and let dkd^{k} denote the current estimate of p~(x)\tilde{p}(x). Observe that

x(dk+ξ~k)\displaystyle\ell_{x}(d^{k}+\tilde{\xi}_{k}) =gradf(x),dk+ξ~kx+g(Rx(dk+ξ~k))\displaystyle=\left\langle\operatorname{grad}f\left(x\right),d^{k}+\tilde{\xi}_{k}\right\rangle_{x}+g(R_{x}(d^{k}+\tilde{\xi}_{k}))
=gradf(x),dkx+gradf(x),ξ~kx+g(Rx(dk+ξ~k)),\displaystyle=\left\langle\operatorname{grad}f\left(x\right),d^{k}\right\rangle_{x}+\left\langle\operatorname{grad}f\left(x\right),\tilde{\xi}_{k}\right\rangle_{x}+g(R_{x}(d^{k}+\tilde{\xi}_{k})),

for any ξ~kTx\tilde{\xi}_{k}\in\mathrm{T}_{x}\mathcal{M}. Let yk=Rx(dk)y_{k}=R_{x}\left(d^{k}\right) and ξk=𝒯Rdkξ~k\xi_{k}=\mathcal{T}_{R_{d^{k}}}\tilde{\xi}_{k}. Since the retraction RR is smooth by definition, we obtain Rx(dk+ξ~k)=yk+ξk+O(ξkx2)R_{x}(d^{k}+\tilde{\xi}_{k})=y_{k}+\xi_{k}+O\left(\left\|\xi_{k}\right\|_{x}^{2}\right), where y=x+O(z)y=x+O(z) means limsupz0yx/z<\lim\sup_{z\rightarrow 0}\|y-x\|/\|z\|<\infty. Then, we obtain:

x(dk+ξ~k)=\displaystyle\ell_{x}(d^{k}+\tilde{\xi}_{k})= gradf(x),dkx+gradf(x),𝒯Rdk1ξkx+g(yk+ξk+O(ξkx2))\displaystyle\left\langle\operatorname{grad}f(x),d^{k}\right\rangle_{x}+\left\langle\operatorname{grad}f(x),\mathcal{T}_{R_{d^{k}}}^{-1}\xi_{k}\right\rangle_{x}+g\left(y_{k}+\xi_{k}+O\left(\left\|\xi_{k}\right\|_{x}^{2}\right)\right)
=\displaystyle= gradf(x),dkx+gradf(x),𝒯Rdk1ξkx+g(yk+ξk)+O(ξkx2),\displaystyle\left\langle\operatorname{grad}f(x),d^{k}\right\rangle_{x}+\left\langle\operatorname{grad}f(x),\mathcal{T}_{R_{d^{k}}}^{-1}\xi_{k}\right\rangle_{x}+g\left(y_{k}+\xi_{k}\right)+O\left(\left\|\xi_{k}\right\|_{x}^{2}\right),

where the second equality comes from the Lipschitz continuity of gg (see Assumption 4). Note that the middle parts of the above term

gradf(x),𝒯Rdk1ξkx+g(yk+ξk)\left\langle\operatorname{grad}f(x),\mathcal{T}_{R_{d^{k}}}^{-1}\xi_{k}\right\rangle_{x}+g\left(y_{k}+\xi_{k}\right) (28)

can be interpreted as a simple local model of x(dk+𝒯Rdk1ξk)\ell_{x}\left(d^{k}+\mathcal{T}_{R_{d^{k}}}^{-1}\xi_{k}\right). Thus, to obtain a new estimate from dkd^{k}, we first compute a search direction ξk\xi_{k}^{*} by minimizing (28) on Tyk\mathrm{T}_{y_{k}}\mathcal{M}, and then update dkd^{k} along the direction 𝒯Rdk1ξk\mathcal{T}_{R_{d^{k}}}^{-1}\xi_{k}^{*}, as described in Algorithm 2.

It is easy to find that Algorithm 2 is the application of the Riemannian GCG method for problem (26) with a proper retraction. Moreover, note that (29) can be solved by the subgradient method or proximal gradient method.

Algorithm 2 Solving the subproblem

Require: An initial iterate d0Txd^{0}\in T_{x}\mathcal{M}; a small positive constant σ\sigma;
1: for k=0,1,k=0,1,\ldots, do
2:  yk=Rx(dk)y_{k}=R_{x}\left(d^{k}\right)
3:  Compute ξk\xi_{k}^{*} by solving

ξk=argminξTyk𝒯Rdk(gradf(x)),ξyk+g(yk+ξ);\xi_{k}^{*}=\arg\min_{\xi\in\mathrm{T}_{y_{k}}\mathcal{M}}\left\langle\mathcal{T}_{R_{d^{k}}}^{-\sharp}\left(\operatorname{grad}f(x)\right),\xi\right\rangle_{y_{k}}+g\left(y_{k}+\xi\right); (29)

4:  α=1\alpha=1
5:  while  x(dk+α𝒯Rdk1ξk)x(dk)σαξkx2\ell_{x}\left(d^{k}+\alpha\mathcal{T}_{R_{d^{k}}}^{-1}\xi_{k}^{*}\right)\geq\ell_{x}\left(d^{k}\right)-\sigma\alpha\left\|\xi_{k}^{*}\right\|_{x}^{2}  do
6:   α=12α\alpha=\frac{1}{2}\alpha;
7:  end while
8:  dk+1=dk+α𝒯Rdk1ξkd_{k+1}=d^{k}+\alpha\mathcal{T}_{R_{d^{k}}}^{-1}\xi_{k}^{*};
4: end for

6 Accelerated conditional gradient methods

Now we consider the accelerated version of the RCG method. We noticed that based on the accelerated gradient methods, Li et al. [4] establish links between subproblems of conditional gradient methods and the notion of momentum. They proposed a momentum variant of CG method and proved that it converges with a fast rate of O(1k2)O(\frac{1}{k^{2}}). However, the extension of this work to the Riemannian manifold case is not trivial, even though there are works on Nesterov accelerated methods on Riemannian manifolds [27]. Based on these existing works, here we propose an accelerated CG method on Riemannian manifolds for composite function.

Algorithm 3 Riemannian accelerated GCG method

Step 0. Initialization Choose x0=p0𝒳x^{0}=p_{0}\in\mathcal{X}, τk=2k+3\tau^{k}=\frac{2}{k+3} and initialize k=0k=0, β0=𝟎,d0=𝟎\beta_{0}=\mathbf{0},d^{0}=\mathbf{0}.

Step 1.

yk=Rxk(τkRxk1(dk))y_{k}=R_{x^{k}}(\tau^{k}R^{-1}_{x^{k}}(d^{k}))
βk+1=𝒯yk(1τk)βk+λkgradf(yk)\beta_{k+1}=\mathcal{T}_{y_{k}}(1-\tau^{k})\beta_{k}+\lambda_{k}\mathrm{grad}f(y_{k})

Step 2. Compute an optimal solution pkp_{k} and the optimal value θ(xk)\theta(x^{k}) as

pk+1=argminu𝒳βk+1,Ryk1(u)yk+g(u)g(yk),p_{k+1}=\arg\min_{u\in\mathcal{X}}\left\langle\beta_{k+1},R_{y_{k}}^{-1}(u)\right\rangle_{y^{k}}+g(u)-g(y_{k}),
dk+1=Ryk1(pk+1).d_{k+1}=R_{y_{k}}^{-1}(p_{k+1}).

Step 3. Stopping criteria If θ(xk)=0\theta(x^{k})=0, then stop.

Step 4. Compute the iterate Set

xk+1=Rxk(λkRxk1(pk+1)).x^{k+1}=R_{x^{k}}(\lambda_{k}R^{-1}_{x^{k}}(p_{k+1})).

Step 5. Beginning a new iteration Set k=k+1k=k+1 and go to Step 1.

Analyzing the convergence and convergence rate for the above accelerated method is challenging, primarily due to the inherent nonconvexity of the problem and the complexities introduced by transporting tangent vectors. A comprehensive analysis of the accelerated method is beyond the scope of this paper. However, the performance of the accelerated method compared to the non-accelerated method is shown in the numerical experiments in Section 7.

7 Numerical experiments

Sparse principal component analysis (Sparse PCA) plays a crucial role in high-dimensional data analysis by enhancing the interpretability and efficiency of dimensionality reduction. Unlike traditional PCA, which often produces complex results in high-dimensional settings, Sparse PCA introduces sparsity to refine data representation and improve clarity. This is particularly relevant in fields such as bioinformatics, image processing, and signal analysis, where interpretability and reconstruction accuracy are essential. In this section, we consider Sparce PCA problems to check the validity of our proposed methods. We explore two models for sparse PCA as in [20], where we set 𝒳=\mathcal{X}=\mathcal{M} in each problem.

7.1 Sparse PCA on the sphere manifold

Now consider problems in sphere manifolds 𝕊n1={xn:x=1}.\mathbb{S}^{n-1}=\{x\in\mathbb{R}^{n}:\|x\|=1\}. The specific optimization problem is formulated as follows:

minx𝕊n1xTATAx+λx1,\min_{x\in\mathbb{S}^{n-1}}-x^{T}A^{T}Ax+\lambda\|x\|_{1}, (30)

where 1\left\|\,\cdot\,\right\|_{1} denotes the L1L_{1}-norm, and λ>0\lambda>0 is a parameter. Here, we use the exponential mapping as the retraction, i.e.,

Expx(ηx)=xcos(ηx)+ηxsin(ηx)/ηx,x𝕊n1,ηxTx𝕊n1.\operatorname{Exp}_{x}\left(\eta_{x}\right)=x\cos\left(\left\|\eta_{x}\right\|\right)+\eta_{x}\sin\left(\left\|\eta_{x}\right\|\right)/\left\|\eta_{x}\right\|,\quad x\in\mathbb{S}^{n-1},\quad\eta_{x}\in\mathrm{T}_{x}\mathbb{S}^{n-1}.

The inverse exponential mapping can also be computed by applying the inverse exponential mapping on the unit sphere 𝕊n1\mathbb{S}^{n-1} [20], i.e.,

logx(y)=Expx1(y)=cos1(xTy)1(xTy)2(InxxT)y,x,y𝕊n1.\log_{x}(y)=\operatorname{Exp}_{x}^{-1}(y)=\frac{\cos^{-1}\left(x^{T}y\right)}{\sqrt{1-\left(x^{T}y\right)^{2}}}\left(I_{n}-xx^{T}\right)y,\quad x,y\in\mathbb{S}^{n-1}.

We now need to introduce an important result as below.

Lemma 7.1.

[20, Lemma D.1] For any xnx\in\mathbb{R}^{n} and λ>0\lambda>0, the minimizer of the optimization problem

miny𝕊n112λyx2+y1\min_{y\in\mathbb{S}^{n-1}}\frac{1}{2\lambda}\|y-x\|^{2}+\|y\|_{1}

is given by

y={z/z, if z0,sign(ximax)eimax, otherwise, y_{*}=\begin{cases}z/\|z\|,&\text{ if }\|z\|\neq 0,\\ \operatorname{sign}\left(x_{i_{\max}}\right)e_{i_{\max}},&\text{ otherwise, }\end{cases}

where imaxi_{\max} is the index of the largest magnitude entry of xx (break ties arbitrarily), eie_{i} denotes the ii-th column in the canonical basis of n\mathbb{R}^{n}, and zz is defined by

zi={0, if |xi|λ,xiλ, if xi>λ,xi+λ, if xi<λ.z_{i}=\begin{cases}0,&\text{ if }\left|x_{i}\right|\leq\lambda,\\ x_{i}-\lambda,&\text{ if }x_{i}>\lambda,\\ x_{i}+\lambda,&\text{ if }x_{i}<-\lambda.\end{cases}

The subproblem (4) applied to (30), can be written as

miny𝕊n1u(y),where u(y)=1λgradf(x),logx(y)+y1.\min_{y\in\mathbb{S}^{n-1}}u(y),\quad\text{where }u(y)=\frac{1}{\lambda}\langle\mathrm{grad}f(x),\log_{x}(y)\rangle+\|y\|_{1}. (31)

Let us not check how to compute its solution. Let h(y)=1λgradf(x),logx(y)h(y)=\frac{1}{\lambda}\langle\operatorname{grad}f(x),\log_{x}(y)\rangle, and let yky^{k} denote the current estimate of the minimizer of u(y)u(y) over the unit sphere. The next estimate, yk+1y^{k+1}, is obtained by solving the following optimization problem:

miny𝕊n1h(yk)+h(yk)T(yyk)+y1,\min_{y\in\mathbb{S}^{n-1}}h(y^{k})+\nabla h(y^{k})^{T}(y-y^{k})+\|y\|_{1},

which is equal to

miny𝕊n1h(yk)Ty+y1.\min_{y\in\mathbb{S}^{n-1}}\nabla h(y^{k})^{T}y+\|y\|_{1}.

In other words, h(y)h(y) is approximated by its first-order Taylor expansion around yky^{k} at each iteration. To solve the above problem, note that since y=1\|y\|=1 for all y𝕊n1y\in\mathbb{S}^{n-1}, the problem is further equivalent to:

miny𝕊n112y+h(yk)2+y1.\min_{y\in\mathbb{S}^{n-1}}\frac{1}{2}\left\|y+\nabla h(y^{k})\right\|^{2}+\|y\|_{1}. (32)

The above problem can be actually solved using Lemma 7.1.

7.2 Sparse PCA on the Stiefel manifold

Now consider sparse PCA on the Stiefel manifold. The specific optimization problem is formulated as follows:

minXst(p,n)trace(XTATAX)+λX1.\min_{X\in\operatorname{st}(p,n)}-\operatorname{trace}\left(X^{T}A^{T}AX\right)+\lambda\|X\|_{1}. (33)

where st(p,n)={Xn×pXTX=Ip}\operatorname{st}(p,n)=\left\{X\in\mathbb{R}^{n\times p}\mid X^{T}X=I_{p}\right\} denotes the Stiefel manifold, and λ>0\lambda>0. Moreover, the term trace(XTATAX)-\operatorname{trace}\left(X^{T}A^{T}AX\right) aims to maximize the variance of the data, X1\|X\|_{1} characterizes the sparsity of the principal components, and AA is the original data matrix. The tangent space is defined as

TXst(p,n)={ηn×pXTη+ηTX=0}.\mathrm{T}_{X}\operatorname{st}(p,n)=\left\{\eta\in\mathbb{R}^{n\times p}\mid X^{T}\eta+\eta^{T}X=0\right\}.

Here we use the Euclidean metric ηX,ξXX=trace(ηXTξX)\left\langle\eta_{X},\xi_{X}\right\rangle_{X}=\operatorname{trace}(\eta_{X}^{T}\xi_{X}) as the Riemannian metric. We also use the polar decomposition as the retraction, i.e.,

RX(ηX)=(X+ηX)(Ip+ηXTηX)1/2.R_{X}\left(\eta_{X}\right)=\left(X+\eta_{X}\right)\left(I_{p}+\eta_{X}^{T}\eta_{X}\right)^{-1/2}.

The vector transport by differentiated retraction is given in [28, Lemma 10.2.1] by

𝒯ηXξX=YΩ+(InYYT)ξX(YT(X+ηX))1,\mathcal{T}_{\eta_{X}}\xi_{X}=Y\Omega+\left(I_{n}-YY^{T}\right)\xi_{X}\left(Y^{T}\left(X+\eta_{X}\right)\right)^{-1},

where Y=RX(ηX)Y=R_{X}\left(\eta_{X}\right) and Ω\Omega is the solution of the Sylvester equation

(YT(X+ηX))Ω+Ω(YT(X+ηX))=YTξXξXTY.\left(Y^{T}\left(X+\eta_{X}\right)\right)\Omega+\Omega\left(Y^{T}(X+\eta_{X})\right)=Y^{T}\xi_{X}-\xi_{X}^{T}Y.

For ease of notation, we define the operator 𝒜k\mathcal{A}_{k} by 𝒜k(V):=VTXk+XkTV\mathcal{A}_{k}(V):=V^{T}X_{k}+X_{k}^{T}V and rewrite the subproblem (29) as

Vk:=argmin𝑉\displaystyle V_{k}:=\underset{V}{\operatorname{argmin}} 𝒯Rdkgradf(Xk),V+g(Xk+V)\displaystyle\left\langle\mathcal{T}_{R_{d^{k}}}^{-\sharp}\mathrm{grad}f\left(X_{k}\right),V\right\rangle+g\left(X_{k}+V\right)
s.t. 𝒜k(V)=0.\displaystyle\mathcal{A}_{k}(V)=0.

This can be solved by the subgradient method or the proximal gradient method. For the subgradient method, note that the subgradient of g(X)g(X) can be calculate through function m(X)m(X), which is defined as:

  • 1.

    mij(X)=λm_{ij}(X)=\lambda if (X)ij>0(X)_{ij}>0,

  • 2.

    mij(X)=λm_{ij}(X)=-\lambda if (X)ij<0(X)_{ij}<0,

  • 3.

    mij(X)[λ,λ]m_{ij}(X)\in[-\lambda,\lambda] if (X)ij=0(X)_{ij}=0,

where mij(X)m_{ij}(X) is the element of m(X)m(X). Then the subgradient of the above problem can be written as 𝒯Rdkgradf(Xk)+λsign(Xk+Vk).\mathcal{T}_{R_{d^{k}}}^{-\sharp}\mathrm{grad}f(X_{k})+\lambda\cdot\operatorname{sign}(X_{k}+V_{k}).

7.3 Numerical results

We implemented the algorithms in Python 3.9.12, and the experiments were carried out on a MacBook Pro with an Apple M1 Pro chip and 16GB of RAM.

7.3.1 Sphere experiment

For the example (30), we conducted experiments on a nn-dimensional problem, using a randomly generated n×nn\times n standardized AA and a regularization parameter λ=0.1\lambda=0.1. The experiment evaluates Armijo step size with parameters ζ=0.1\zeta=0.1, ω1=0.05\omega_{1}=0.05, and ω2=0.95\omega_{2}=0.95. Each experiment runs for the same original problem 10 times with different initial points and stops if the norm of θxk\theta_{x^{k}} is less than 10410^{-4} or if the difference between the latest function value and the function value from five iterations ago is less than 10410^{-4}. In this example, our experiments were conducted on high-dimensional spheres, with dimensions set to 10, 100, and 1000, across ten runs for each method. In the numerical experiments, we observed that solving the subproblems typically converges after just one iteration, suggesting that the local model provides an effective approximation of the original subproblem.

As shown in Table 1 and Figure 1, the results demonstrated that the Armijo step-size strategy consistently outperformed the others in terms of computational time and required iterations to achieve function value convergence. Overall, our findings suggest that the Armijo step-size strategy offers significant advantages in efficiency and speed, particularly in high-dimensional settings.

We observe the performance of different step-size strategies (Armijo, adaptive, and diminishing) under varying values of λ\lambda. Across all values of λ\lambda in Table 2, the Armijo step size consistently outperforms the other methods in terms of both time and iterations. Specifically, at λ=0.1\lambda=0.1, the Armijo method achieves the fastest convergence, taking an average of 0.3505 seconds and 53.7 iterations. As λ\lambda increases, the performance gap between Armijo and the other methods becomes more pronounced. For λ=0.5\lambda=0.5, Armijo converges in 0.3153 seconds and 44.8 iterations, while the diminishing strategy requires 2.0290 seconds and 530.2 iterations, demonstrating its inefficiency for larger λ\lambda values. Similarly, the adaptive step size shows intermediate performance, but its time and iteration counts remain higher than Armijo. Overall, the results suggest that the Armijo step size is the most robust and efficient strategy for different values of λ\lambda, while the diminishing method struggles with larger problem sizes and regularization terms.

7.3.2 Comparison with the accelerated method

We consider the same problem and compare the performance of the accelerated algorithm with the non-accelerated version. To better observe the convergence rate and differences between various step sizes, we run 50 iterations. As shown in Figure 2, the results indicate that the accelerated algorithm with Armijo and diminishing step sizes generally outperforms the non-accelerated algorithm, particularly when compared with adaptive and diminishing step sizes. The results are similar to the method in [4], and it can indeed accelerate the decline at the beginning.

Table 3 summarizes the results of 10 random runs for the Sphere Compare Experiment with λ=0.1\lambda=0.1. Across all dimensions (n=10,100,500,1000n=10,100,500,1000), the RGCG method combined with Armijo step size consistently achieves the best performance, with the lowest computation time and iteration count. For instance, with n=1000n=1000, RGCG + Armijo takes 8.47 seconds and 346.8 iterations, outperforming the adaptive and diminishing strategies, which require more time and iterations.

The accelerated method shows similar trends, with Armijo step size yielding better results than adaptive and diminishing strategies. However, for higher dimensions, RGCG with Armijo step size remains faster overall. The diminishing step size, in both methods, leads to significantly longer runtimes and higher iteration counts, especially as nn increases. For n=1000,2000n=1000,2000, indicating that the accelerated method benefits more from the diminishing step size in higher dimensions. In summary, the Armijo step size proves to be the most efficient strategy, particularly when paired with the RGCG method, making it the preferred choice for both low and high-dimensional problems.

Refer to caption
(a) Function value, n=10n=10
Refer to caption
(b) Function value, n=100n=100
Refer to caption
(c) Function value, n=1000n=1000
Refer to caption
(d) Theta value, n=10n=10
Refer to caption
(e) Theta value, n=100n=100
Refer to caption
(f) Theta value, n=1000n=1000
Refer to caption
(g) CPU time, n=10n=10
Refer to caption
(h) CPU time, n=100n=100
Refer to caption
(i) CPU time, n=1000n=1000
Figure 1: Example (30) with fixed AA, different nn and λ=0.1\lambda=0.1
Table 1: Average results of 10 random runs for sparse PCA on the Sphere manifold with λ=0.1\lambda=0.1 and fixed AA
n Step size Time (seconds) Iterations
10 Armijo 0.0152 14.50
Adaptive 0.0208 31.30
Diminishing 0.0491 88.80
100 Armijo 0.0771 32.20
Adaptive 0.1761 78.70
Diminishing 0.3113 244.60
1000 Armijo 8.3842 344.60
Adaptive 17.4690 921.00
Diminishing 13.1272 929.70
Table 2: Average results of 10 random runs for sparse PCA on the Sphere manifold with varying λ\lambda and fixed AA, n=500n=500
λ\lambda Step size Time (seconds) Iterations
0.1 Armijo 0.3505 53.70
Adaptive 0.7006 136.10
Diminishing 2.2043 522.90
0.3 Armijo 0.3987 51.60
Adaptive 0.6094 111.70
Diminishing 1.6226 381.80
0.5 Armijo 0.3153 44.80
Adaptive 0.5716 111.30
Diminishing 2.0290 530.20
1.0 Armijo 0.3016 46.50
Adaptive 0.5137 97.30
Diminishing 1.8412 465.50
Refer to caption
(a) Function value, n=10n=10
Refer to caption
(b) Function value, n=100n=100
Refer to caption
(c) Function value, n=1000n=1000
Refer to caption
(d) Theta value, n=10n=10
Refer to caption
(e) Theta value, n=100n=100
Refer to caption
(f) Theta value, n=1000n=1000
Figure 2: Example (30) with fixed AA, λ=0.1\lambda=0.1 and different nn for RGCG and accelerated method
Table 3: Comparison of Accelerated and RGCG Methods: Average Results of 10 Random Runs for Sparse PCA on Sphere Manifolds with λ=0.1\lambda=0.1
n Method + Step size Time (seconds) Iterations
10 RGCG + Armijo 0.0173 15.30
RGCG + Adaptive 0.0288 36.10
RGCG + Diminishing 0.0796 109.50
10 Accelerated + Armijo 0.0496 36.80
Accelerated + Adaptive 0.0507 61.30
Accelerated + Diminishing 0.1137 107.30
100 RGCG + Armijo 0.1069 32.40
RGCG + Adaptive 0.1670 83.00
RGCG + Diminishing 0.6486 295.50
100 Accelerated + Armijo 0.2629 85.30
Accelerated + Adaptive 0.3232 169.30
Accelerated + Diminishing 0.9453 368.50
500 RGCG + Armijo 0.3750 56.70
RGCG + Adaptive 0.8563 160.50
RGCG + Diminishing 2.4350 605.30
500 Accelerated + Armijo 1.3190 166.00
Accelerated + Adaptive 2.4168 385.60
Accelerated + Diminishing 2.8703 417.10
1000 RGCG + Armijo 8.4704 346.80
RGCG + Adaptive 17.8079 830.50
RGCG + Diminishing 14.0213 908.70
1000 Accelerated + Armijo 8.6065 255.30
Accelerated + Adaptive 13.2967 687.70
Accelerated + Diminishing 12.1202 445.60
2000 RGCG + Armijo 14.6191 160.20
RGCG + Adaptive 21.9078 411.40
RGCG + Diminishing 50.3453 1106.70
2000 Accelerated + Armijo 19.0932 188.00
Accelerated + Adaptive 47.8875 752.60
Accelerated + Diminishing 25.3371 396.20

7.3.3 Stiefel experiment

For the example (33), we conduct experiments on a (n,p)(n,p)-dimensional problem, using a randomly generated n×nn\times n standardized symmetric AA and a regularization parameter λ=0.1\lambda=0.1. As in the previous experiment, here we evaluate Armijo step size with parameters ζ=0.1\zeta=0.1, ω1=0.05\omega_{1}=0.05, and ω2=0.95\omega_{2}=0.95. Each experiment runs for the same original problem 10 times with different initial points, and stops if the norm of θ\theta or the difference of the function value is less than 10410^{-4}. The maximum number of iterations allowed for solving subproblems is set to 2. The performance of the RGCG method is analyzed by plotting the convergence behavior of the objective values and the value of θ\theta over iterations and the table of average performance.

The experimental results for sparse PCA on the Stiefel manifold (Table 4 and Figures 3) show clear differences in performance between the three step-size strategies across various problem sizes. For smaller problem dimensions (e.g., n=100n=100 and n=200n=200), the Armijo step size consistently achieves the fewest iterations, indicating its efficiency in reaching convergence faster. However, the diminishing step size, while stable, requires significantly more iterations and time, especially as the problem dimension increases.

For higher dimensions (n=500n=500 and n=1000n=1000), the adaptive step size shows a competitive balance between time and iterations, especially compared to the diminishing one, which becomes increasingly less efficient. However, Armijo continues to outperform both in terms of iterations, making it the preferred choice for high-dimensional problems where computational efficiency is critical. Overall, the Armijo step size demonstrates superior performance in both time and iteration count, particularly for larger problem sizes. The diminishing step size, while robust, is less efficient for larger problems, indicating that it may not be suitable for high-dimensional optimization tasks.

We see a similar trend with the Armijo step size consistently performing well across varying values of λ\lambda. At λ=0.1\lambda=0.1, Armijo converges in 0.6839 seconds and 103.3 iterations, slightly slower than the adaptive method in terms of time (0.5516 seconds) but significantly faster than the diminishing strategy (1.3375 seconds and 409.9 iterations). For different λ\lambda, Armijo maintains its competitive performance. For λ=1.0\lambda=1.0, it achieves an average of 0.9943 seconds and 147.8 iterations, compared to 1.7331 seconds and 514.4 iterations for the diminishing step size. While the adaptive strategy generally outperforms the diminishing one, it remains less efficient than Armijo in terms of iterations.

Table 4: Average results of 10 random runs for sparse PCA on the Stiefel manifold with λ=0.1\lambda=0.1 and fixed AA
(p, n) Step size Time (seconds) Iterations
(10, 100) Armijo 0.2344 104.7
Adaptive 0.2287 233.0
Diminishing 0.1775 192.4
(10, 200) Armijo 0.2729 59.2
Adaptive 0.4160 242.8
Diminishing 0.5557 347.0
(10, 500) Armijo 1.8151 115.8
Adaptive 1.5056 229.5
Diminishing 5.5653 827.5
(10, 1000) Armijo 11.9895 219.0
Adaptive 12.6799 459.3
Diminishing 44.2100 1630.5
Table 5: Average results of 10 random runs for sparse PCA on the Stiefel manifold with varying λ\lambda and fixed AA, n=200,p=10n=200,p=10
λ\lambda Step size Time (seconds) Iterations
0.1 Armijo 0.6839 103.3
Adaptive 0.5516 171.9
Diminishing 1.3375 409.9
0.3 Armijo 1.0916 166.6
Adaptive 0.6786 206.7
Diminishing 1.4409 442.2
0.5 Armijo 0.9410 136.0
Adaptive 0.5557 167.9
Diminishing 1.3503 402.8
1.0 Armijo 0.9943 147.8
Adaptive 0.5600 170.1
Diminishing 1.7331 514.4
Refer to caption
(a) Function value, n=100,p=10n=100,p=10
Refer to caption
(b) Function value, n=500,p=10n=500,p=10
Refer to caption
(c) Function value, n=1000,p=10n=1000,p=10
Refer to caption
(d) Theta value, n=100,p=10n=100,p=10
Refer to caption
(e) Theta value, n=500,p=10n=500,p=10
Refer to caption
(f) Theta value, n=1000,p=10n=1000,p=10
Refer to caption
(g) CPU time, n=100,p=10n=100,p=10
Refer to caption
(h) CPU time, n=500,p=10n=500,p=10
Refer to caption
(i) CPU time, n=1000,p=10n=1000,p=10
Figure 3: Example (33) with fixed AA, λ=0.1\lambda=0.1, different nn and pp

The overall results demonstrate that both the adaptive and Armijo step size methods outperform the diminishing step size in terms of convergence speed and computational time. Specifically, the adaptive strategy consistently exhibits the fastest convergence and the fewest iterations in most cases, while the Armijo step size method shows similar superiority in the majority of experiments. The performance of the algorithms is significantly influenced by different experimental parameters. Notably, as the number of rows and columns increases, the computational time and the number of iterations for all methods increase, but the adaptive and Armijo step sizes still perform better than the diminishing method. It was observed that the diminishing step size method experiences a significant slowdown in convergence speed as it approaches the stationary point. This indicates that while the method can initially reduce the objective function value quickly, its decreasing step size results in very small updates near the optimal solution, leading to a slower convergence rate. Although all step size strategies effectively converge for smaller problems, the choice of the appropriate step size becomes crucial for larger problems. These experiments suggest that in Stiefel manifold optimization, the adaptive and Armijo step size methods are generally more effective than the diminishing step size method, and their use is recommended in practical applications.

Through all the experiments, we observed that as the iteration point approaches the optimal solution, the descent slows down and requires more iterations to reach optimality. To address this issue, we might consider using a threshold or combining our approach with other methods to improve convergence.

8 Conclusion and future works

In this work, we proposed novel conditional gradient methods specifically designed for composite function optimization on Riemannian manifolds. Our investigation focused on Armijo, adaptive, and diminishing step-size strategies. We proved that the adaptive and diminishing step-size strategies achieve a convergence rate of 𝒪(1/k)\mathcal{O}(1/k), while the Armijo step-size strategy results in an iteration complexity of J+𝒪(1/ϵ2)J+\mathcal{O}(1/\epsilon^{2}). In addition, under the assumption that the function gg is Lipschitz continuous, we derived an iteration complexity of 𝒪(1/ϵ2)\mathcal{O}(1/\epsilon^{2}) for the Armijo step size. Furthermore, we proposed a specialized algorithm to solve the subproblems in the presence of non-convexity, thereby broadening the applicability of our methods. Our discussions and analyses are not confined to specific retraction and transport operations, enhancing the applicability of our methods to a broader range of manifold optimization problems.

Also, from the results of the Stiefel manifold experiments, we observe that the convergence of the adaptive step size method tends to outperform the diminishing step size method. While the theoretical analysis previously focused on proving convergence rates through the diminishing step size, these results suggest that alternative step size strategies, such as adaptive methods, may lead to stronger performance. This opens the possibility of further exploration of different step size techniques to achieve better convergence rates in future studies.

Future work on manifold optimization could significantly benefit from the development and enhancement of accelerated conditional gradient methods. The momentum-guided Frank-Wolfe algorithm, as introduced by Li et al. [4], leverages Nesterov’s accelerated gradient method to enhance convergence rates on constrained domains. This approach could be adapted to various manifold settings to exploit their geometric properties more effectively. Additionally, Zhang et al. [5] proposed accelerating the Frank-Wolfe algorithm through the use of weighted average gradients, presenting a promising direction for improving optimization efficiency on manifolds. Future research could focus on combining these acceleration techniques with Riemannian geometry principles to develop robust, scalable algorithms for complex manifold-constrained optimization problems.

Acknowledgments

This work was supported by JST SPRING (JPMJSP2110), and Japan Society for the Promotion of Science, Grant-in-Aid for Scientific Research (C) (JP19K11840).

References

  • [1] M. Frank, P. Wolfe, An algorithm for quadratic programming, Naval Research Logistics 3 (1-2) (1956) 95–110.
  • [2] F. Bach, Duality between subgradient and conditional gradient methods, SIAM Journal on Optimization 25 (1) (2015) 115–129.
  • [3] M. Jaggi, Revisiting Frank-Wolfe: Projection-free sparse convex optimization, in: S. Dasgupta, D. McAllester (Eds.), Proceedings of the 30th International Conference on Machine Learning, Vol. 28 of Proceedings of Machine Learning Research, PMLR, Atlanta, Georgia, USA, 2013, pp. 427–435.
  • [4] B. Li, M. Coutino, G. B. Giannakis, G. Leus, A momentum-guided Frank-Wolfe algorithm, IEEE Transactions on Signal Processing 69 (2021) 3597–3611.
  • [5] Y. Zhang, B. Li, G. B. Giannakis, Accelerating Frank-Wolfe with weighted average gradients, in: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2021, pp. 5529–5533.
  • [6] S. Lacoste-Julien, M. Jaggi, M. Schmidt, P. Pletscher, Block-coordinate Frank-Wolfe optimization for structural SVMs, in: International Conference on Machine Learning, PMLR, 2013, pp. 53–61.
  • [7] N. Dalmasso, R. Zhao, M. Ghassemi, V. Potluru, T. Balch, M. Veloso, Efficient event series data modeling via first-order constrained optimization, in: Proceedings of the Fourth ACM International Conference on AI in Finance, 2023, pp. 463–471.
  • [8] P. B. Assunção, O. P. Ferreira, L. F. Prudente, Conditional gradient method for multiobjective optimization, Computational Optimization and Applications 78 (2021) 741–768.
  • [9] P. B. Assunção, O. P. Ferreira, L. F. Prudente, A generalized conditional gradient method for multiobjective composite optimization problems, Optimization (2023) 1–31.
  • [10] A. G. Gebrie, E. H. Fukuda, Adaptive generalized conditional gradient method for multiobjective optimization, ArXiv (2404.04174) (2024).
  • [11] S. J. Wright, R. D. Nowak, M. A. Figueiredo, Sparse reconstruction by separable approximation, IEEE Transactions on Signal Processing 57 (7) (2009) 2479–2493.
  • [12] R. Tibshirani, Regression shrinkage and selection via the Lasso, Journal of the Royal Statistical Society Series B: Statistical Methodology 58 (1) (1996) 267–288.
  • [13] P. L. Combettes, V. R. Wajs, Signal recovery by proximal forward-backward splitting, Multiscale Modeling & Simulation 4 (4) (2005) 1168–1200.
  • [14] A. Beck, First-Order Methods in Optimization, SIAM, 2017.
  • [15] M. Fukushima, H. Mine, A generalized proximal point algorithm for certain non-convex minimization problems, International Journal of Systems Science 12 (8) (1981) 989–1000.
  • [16] N. Parikh, S. Boyd, Proximal algorithms, Foundations and Trends in Optimization 1 (3) (2014) 127–239.
  • [17] A. Beck, M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM Journal on Imaging Sciences 2 (1) (2009) 183–202.
  • [18] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, Distributed optimization and statistical learning via the alternating direction method of multipliers, Foundations and Trends® in Machine Learning 3 (1) (2011) 1–122.
  • [19] S. Chen, S. Ma, A. Man-Cho So, T. Zhang, Proximal gradient method for nonsmooth optimization over the Stiefel manifold, SIAM Journal on Optimization 30 (1) (2020) 210–239.
  • [20] W. Huang, K. Wei, Riemannian proximal gradient methods, Mathematical Programming 194 (1) (2022) 371–413.
  • [21] M. Weber, S. Sra, Riemannian optimization via Frank-Wolfe methods, Mathematical Programming 199 (1) (2023) 525–556.
  • [22] Y. Yu, X. Zhang, D. Schuurmans, Generalized conditional gradient for sparse estimation, Journal of Machine Learning Research 18 (144) (2017) 1–46.
  • [23] R. Zhao, R. M. Freund, Analysis of the Frank-Wolfe method for convex composite optimization involving a logarithmically-homogeneous barrier, Mathematical Programming 199 (1) (2023) 123–163.
  • [24] Y.-M. Cheung, J. Lou, Efficient generalized conditional gradient with gradient sliding for composite optimization, in: Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015.
  • [25] H. Sato, Riemannian conjugate gradient methods: General framework and specific algorithms with convergence analyses, SIAM Journal on Optimization 32 (4) (2022) 2690–2717.
  • [26] N. Boumal, An Introduction to Optimization on Smooth Manifolds, Cambridge University Press, 2023.
  • [27] F. Alimisis, A. Orvieto, G. Becigneul, A. Lucchi, Momentum improves optimization on Riemannian manifolds, in: International Conference on Artificial Intelligence and Statistics, PMLR, 2021, pp. 1351–1359.
  • [28] W. Huang, Optimization algorithms on Riemannian manifolds with applications, Ph.D. thesis, The Florida State University (2013).