Riemannian conditional gradient methods for composite optimization problems

Kangming Chen [email protected] Ellen H. Fukuda [email protected]

Abstract

In this paper, we propose Riemannian conditional gradient methods for minimizing composite functions, i.e., those that can be expressed as the sum of a smooth function and a geodesically convex one. We analyze the convergence of the proposed algorithms, utilizing three types of step-size strategies: adaptive, diminishing, and those based on the Armijo condition. We establish the convergence rate of $\mathcal{O}(1/k)$ for the adaptive and diminishing step sizes, where $k$ denotes the number of iterations. Additionally, we derive an iteration complexity of $\mathcal{O}(1/\epsilon^{2})$ for the Armijo step-size strategy to achieve $\epsilon$ -optimality, where $\epsilon$ is the optimality tolerance. Finally, the effectiveness of our algorithms is validated through some numerical experiments performed on the sphere and Stiefel manifolds.

keywords:

Conditional gradient methods, Riemannian manifolds, composite optimization problems , Frank-Wolfe methods

^†^†journal: Not decided

\affiliation

[label1]organization=Graduate School of Informatics, Kyoto University, country=Japan

1 Introduction

The conditional gradient (CG) method, also known as the Frank-Wolfe algorithm, is a first-order iterative method, widely used for solving constrained nonlinear optimization problems. Developed by Frank and Wolfe in 1956 [1], the CG method addresses the problem of minimizing a convex function over a convex constraint set, particularly when the constraint set is difficult, in the sense that the projection onto it is computationally demanding.

The convergence rate of $\mathcal{O}(1/k)$ , where $k$ denotes the number of iterations, for the conditional gradient method has been presented in several studies [2, 3]. Recent advancements in the field have focused on accelerating the method [4, 5]. The versatility of this method is also clear, since many applications can be found in the literature (see, for instance, [6, 7]). Furthermore, the CG method has recently been extended to multiobjective optimization [8, 9, 10].

Meanwhile, composite optimization problems, where the objective function can be decomposed into the sum of two functions, have become increasingly important in modern optimization theory and applications. Specifically, the objectives of these problems take the form $F(x):=f(x)+g(x)$ , where $f$ is a smooth function and $g$ is a possibly non-smooth and convex function. This structure arises naturally in many areas, including machine learning and signal processing, where $f$ often represents a data fidelity term and $g$ serves as a regularization term to promote desired properties such as sparsity or low rank [11, 12, 13].

To deal with such problems efficiently, various methods have been developed. Classical approaches include proximal gradient methods [14, 15, 16], which address the non-smooth term $g$ using proximal operators, and accelerated variants such as FISTA [17]. Other methods, such as the alternating direction method of multipliers (ADMM) [18], decompose the problem further to enable parallel and distributed computation. For problems with specific structures or manifold constraints, methods like the Riemannian proximal gradient have been proposed in [19, 20], extending proximal techniques to non-Euclidean settings. Recently, the CG method in Riemannian manifolds has been proposed in [21], and the CG method for composite functions, called the generalized conditional gradient (GCG) method, has also been discussed in many works [9, 22, 23, 24].

In this paper, we propose GCG methods on Riemannian manifolds, with general retractions and vector transport. We also focus on three types of step-size strategies: adaptive, diminishing, and those based on the Armijo condition. We provide a detailed discussion on the implementation of subproblem solutions for each strategy. For each step-size approach, we analyze the convergence of the method. Specifically, for the adaptive and diminishing step-size strategies, we establish a convergence rate of $\mathcal{O}(1/k)$ , where $k$ represents the number of iterations. For the Armijo step size strategy, we derive an iteration complexity of $\mathcal{O}(1/\epsilon^{2})$ , where $\epsilon$ represents the desired accuracy in the optimality condition.

This paper is organized as follows. In Section 2, we review fundamental concepts and results related to Riemannian optimization. In Section 3, we introduce the generalized conditional gradient method on Riemannian manifolds with three different types of step size. Section 4 presents the convergence analysis, while Section 5 focuses on the discussion of the subproblem. Section 6 explores the accelerated version of the proposed method. Numerical experiments are presented in Section 7 to demonstrate the effectiveness of our approach. Finally, in Section 8, we conclude the paper and suggest potential directions for future research.

2 Preliminaries

This section summarizes essential definitions, results, and concepts fundamental to Riemannian optimization [25]. The transpose of a matrix $A$ is denoted by $A^{T}$ , and the identity matrix with dimension $\ell$ is written as $I_{\ell}$ . A Riemannian manifold $\mathcal{M}$ is a smooth manifold equipped with a Riemannian metric, which is a smoothly varying inner product $\langle\eta_{x},\sigma_{x}\rangle_{x}\in\mathbb{R}$ , defined on the tangent space at each point $x\in\mathcal{M}$ , where $\eta_{x}$ and $\sigma_{x}$ are tangent vectors in the tangent space $T_{x}\mathcal{M}$ . The tangent space at a point $x$ on the manifold $\mathcal{M}$ is denoted by $T_{x}\mathcal{M}$ , while the tangent bundle of $\mathcal{M}$ is denoted as $T\mathcal{M}:=\{(x,d)\mid d\in T_{x}\mathcal{M},x\in\mathcal{M}\}$ . The norm of a tangent vector $\eta\in T_{x}\mathcal{M}$ is defined by $\|\eta\|_{x}:=\sqrt{\langle\eta,\eta\rangle_{x}}$ . When the norm and the inner product are written as $\|\cdot\|$ and $\langle\cdot,\cdot\rangle$ , without the subscript, then they are the Euclidean ones. For a map $F\colon\mathcal{M}\rightarrow\mathcal{N}$ between two manifolds $\mathcal{M}$ and $\mathcal{N}$ , the derivative of $F$ at a point $x\in\mathcal{M}$ , denoted by $\mathrm{D}F(x)$ , is a linear map $\mathrm{D}F(x)\colon T_{x}\mathcal{M}\rightarrow T_{F(x)}\mathcal{N}$ that maps tangent vectors from the tangent space of $\mathcal{M}$ at $x$ to the tangent space of $\mathcal{N}$ at $F(x)$ . The Riemannian gradient $\operatorname{grad}f(x)$ of a smooth function $f\colon\mathcal{M}\rightarrow\mathbb{R}$ at a point $x\in\mathcal{M}$ is defined as the unique tangent vector at $x$ that satisfies the equation $\langle\operatorname{grad}f(x),\eta\rangle_{x}=\mathrm{D}f(x)[\eta]$ for every tangent vector $\eta\in T_{x}\mathcal{M}$ . We also define the Whitney sum of the tangent bundle as $T\mathcal{M}\oplus T\mathcal{M}:=\left\{(\xi,d)\mid\xi,d\in T_{x}\mathcal{M},x\in\mathcal{M}\right\}$ .

In Riemannian optimization, a retraction is employed to map points from the tangent space of a manifold back onto the manifold.

Definition 2.1.

[25] A smooth map $R\colon T\mathcal{M}\rightarrow\mathcal{M}$ is called a retraction on a smooth manifold $\mathcal{M}$ if the restriction of $R$ to the tangent space $T_{x}\mathcal{M}$ at any point $x\in\mathcal{M}$ , denoted by $R_{x}$ , satisfies the following conditions:

1.

$R_{x}\left(0_{x}\right)=x$ ,
2.

$\mathrm{D}R_{x}\left(0_{x}\right)=\operatorname{id}_{T_{x}\mathcal{M}}$ for all $x\in\mathcal{M}$ ,

where $0_{x}$ and $\operatorname{id}_{T_{x}\mathcal{M}}$ are the zero vector of $T_{x}\mathcal{M}$ and the identity map in $T_{x}\mathcal{M}$ , respectively.

Depending on the problem and the specific properties of the manifold, different types of retractions can be utilized. Many Riemannian optimization algorithms employ a retraction that generalizes the exponential map on $\mathcal{M}$ . For instance, assume that we have an iterative method generating iterates $\{x^{k}\}$ . In Euclidean spaces, the update takes the form $x^{k+1}=x^{k}+t_{k}d^{k}$ , while in the Riemannian case it is generalized as

x^{k+1}=R_{x^{k}}(t_{k}d^{k}),\quad\text{for }k=0,1,2,\ldots,

where $d^{k}$ is a descent direction and $t_{k}$ is a step size.

Another essential concept is the vector transport, which is critical in Riemannian optimization to maintain the intrinsic geometry of the manifold during computations.

Definition 2.2.

[25] A map $\mathcal{T}\colon T\mathcal{M}\oplus T\mathcal{M}\rightarrow T\mathcal{M}$ is called a vector transport on $\mathcal{M}$ if it satisfies the following conditions:

1.

There exists a retraction $R$ on $\mathcal{M}$ such that $\mathcal{T}_{d}(\xi)\in T_{R_{x}(d)}\mathcal{M}$ for all $x\in\mathcal{M}$ and $\xi,d\in T_{x}\mathcal{M}$ .
2.

For any $x\in\mathcal{M}$ and $\xi\in T_{x}\mathcal{M},\mathcal{T}_{0_{x}}(\xi)=\xi$ holds, where $0_{x}$ is the zero vector in $T_{x}\mathcal{M}$ , i.e., $\mathcal{T}_{0_{x}}$ is the identity map in $T_{x}\mathcal{M}$ .
3.

For any $a,b\in\mathbb{R},x\in\mathcal{M}$ , and $\xi,d,\zeta\in T_{x}\mathcal{M},\mathcal{T}_{d}(a\xi+b\zeta)=a\mathcal{T}_{d}(\xi)+b\mathcal{T}_{d}(\zeta)$ holds, i.e., $\mathcal{T}_{d}$ is a linear map from $T_{x}\mathcal{M}$ to $T_{R_{x}(d)}\mathcal{M}$ .

The adjoint operator of a vector transport $\mathcal{T}$ , denoted by $\mathcal{T}^{\sharp}$ , is a vector transport satisfying:

\langle\xi_{y},\mathcal{T}_{\eta_{x}}\zeta_{x}\rangle_{y}=\langle\mathcal{T}_{\eta_{x}}^{\sharp}\xi_{y},\zeta_{x}\rangle_{x}\quad\text{for all }\eta_{x},\zeta_{x}\in T_{x}\mathcal{M}\text{ and }\xi_{y}\in T_{y}\mathcal{M},

where $y=R_{x}(\eta_{x})$ . The inverse operator of a vector transport, denoted by $\mathcal{T}_{\eta_{x}}^{-1}$ , is a vector transport satisfying:

\mathcal{T}_{\eta_{x}}^{-1}\mathcal{T}_{\eta_{x}}=\text{id}\quad\text{for all }\eta_{x}\in T_{x}\mathcal{M},

where id is the identity operator.

Note that a map $\mathcal{T}$ defined by $\mathcal{T}_{d}(\xi):=\mathrm{P}_{\gamma_{x,d}}^{1\leftarrow 0}(\xi)$ is also a vector transport, where $\mathrm{P}_{\gamma_{x,d}}^{1\leftarrow 0}$ is the parallel transport along the geodesic $\gamma_{x,d}(t):=\operatorname{Exp}_{x}(td)$ connecting $\gamma_{x,d}(0)=x$ and $\gamma_{x,d}(1)=\operatorname{Exp}_{x}(d)$ with the exponential map Exp as a retraction. Parallel transport is a specific case of vector transport that preserves the length and angle of vectors as they are transported along geodesics, thereby preserving the intrinsic Riemannian metric of the manifold exactly. In contrast, vector transport is a more general operation that allows for the movement of vectors between tangent spaces in a way that approximately preserves geometric properties, and it may not necessarily follow geodesics or maintain exact parallelism. If there is a unique geodesic between any two points in $\mathcal{X}\subset\mathcal{M}$ , then the exponential map has an inverse $\operatorname{Exp}_{x}^{-1}\colon\mathcal{X}\rightarrow T_{x}\mathcal{M}$ . This geodesic represents the unique shortest path and the geodesic distance between $x$ and $y$ in $\mathcal{X}$ is given by $\left\|\operatorname{Exp}_{x}^{-1}(y)\right\|_{x}=\left\|\operatorname{Exp}_{y}^{-1}(x)\right\|_{y}$ .

Now, we define geodesic convexity sets and functions, $L$ -smoothness, and related concepts on Riemannian manifolds as follows:

Definition 2.3.

A set $\mathcal{X}$ is called geodesically convex if for any $x,y\in\mathcal{X}$ , there is a geodesic $\gamma$ with $\gamma(0)=x,\gamma(1)=y$ and $\gamma(t)\in\mathcal{X}$ for all $t\in[0,1]$ .

Definition 2.4.

A function $h\colon\mathcal{M}\rightarrow\mathbb{R}$ is called geodesically convex, if for any $p,q\in\mathcal{M}$ , we have $h(\gamma(t))\leq(1-t)h(p)+th(q)$ for any $t\in[0,1]$ , where $\gamma$ is the geodesic connecting $p$ and $q$ .

Proposition 2.1.

[21] Let $h\colon\mathcal{M}\rightarrow\mathbb{R}$ be a smooth and geodesically convex function. Then, for any $x,y\in\mathcal{M}$ , we have

h(y)-h(x)\geq\left\langle\operatorname{grad}h(x),R_{x}^{-1}(y)\right\rangle_{x}.

Considering the definition of geodesic convexity and defining the retraction through $\gamma$ , we can easily obtain the following result.

Proposition 2.2.

Let $h\colon\mathcal{M}\rightarrow\mathbb{R}$ be a smooth and geodesically convex function. Then, for any $x,y\in\mathcal{M}$ and $\lambda\in[0,1]$ , we have

h(R_{x}(\lambda R_{x}^{-1}(y)))\leq(1-\lambda)h(x)+\lambda h(y).

Definition 2.5.

A function $h\colon\mathcal{M}\rightarrow\mathbb{R}$ is called geodesically $L$ -smooth or with $L$ -Lipschitz continuous gradient if there exists $L$ such that

\left\|\mathrm{grad}h(y)-\mathcal{T}\mathrm{grad}h(x)\right\|\leq L\mathrm{dist}(x,y)\quad\forall x,y\in\mathcal{M},

(1)

where $\mathrm{dist}(x,y)$ is the geodesic distance between $x$ and $y$ and $\mathcal{T}$ is the vector transport from $T_{x}\mathcal{M}$ to $T_{y}\mathcal{M}$ . It is equivalent to

h(y)\leq h(x)+\left\langle\mathrm{grad}h(x),R^{-1}_{x}(y)\right\rangle_{x}+\frac{L}{2}\mathrm{dist}^{2}(x,y)\quad\forall x,y\in\mathcal{M}.

A similar concept is defined below.

Definition 2.6.

[20] A function $h\colon\mathcal{M}\rightarrow\mathbb{R}$ is called $L$ -retraction-smooth with respect to a retraction $R$ in $\mathcal{N}\subseteq\mathcal{M}$ if for any $x\in\mathcal{N}$ and any $\mathcal{S}_{x}\subseteq\mathrm{T}_{x}\mathcal{M}$ such that $R_{x}\left(\mathcal{S}_{x}\right)\subseteq\mathcal{N}$ , we have

h\left(R_{x}(\eta)\right)\leq h(x)+\left\langle\operatorname{grad}h(x),\eta\right\rangle_{x}+\frac{L}{2}\left\|\eta\right\|_{x}^{2}\quad\forall\eta\in\mathcal{S}_{x}.

(2)

3 Riemannian generalized conditional gradient method

In this paper, we consider the following Riemannian optimization problem:

	$\displaystyle\min$	$\displaystyle\quad F(x):=f(x)+g(x)$		(3)
	s.t.	$\displaystyle\quad x\in\mathcal{X},$		(3)

where $\mathcal{X}\subseteq\mathcal{M}$ is a compact geodesically convex set and $F\colon\mathcal{M}\rightarrow\mathbb{R}$ is a composite function with $f\colon\mathcal{M}\rightarrow\mathbb{R}$ being continuously differentiable and $g\colon\mathcal{M}\rightarrow\mathbb{R}$ being a closed, geodesically convex and lower semicontinuous (possibly nonsmooth) function. Here, we can also consider $F$ as an extended-valued function. All the analyses would still apply, but we omit the infinity for simplicity. Also, we make the following assumption.

Assumption 1.

The set $\mathcal{X}\subseteq\mathcal{M}$ is a geodesically convex and complete set where the retraction ${R}$ and its inverse ${R}^{-1}$ are well-defined and smooth over $\mathcal{X}$ . Additionally, the inverse retraction $R^{-1}_{x}(y)$ is continuous with respect to both $x$ and $y$ .

As discussed in [26, Section 10.2], for general retractions, over some domain of $\mathcal{M}\times\mathcal{M}$ which contains all pairs $(x,x)$ , the map $(x,y)\mapsto R_{x}^{-1}(y)$ can be defined smoothly jointly in $x$ and $y$ . In this sense, we can say that the above assumption is reasonable.

Now, let us recall the CG method in the Euclidean space. Consider minimizing a convex function $\tilde{f}\colon\mathbb{R}^{n}\rightarrow\mathbb{R}$ over a convex compact set $\mathcal{X}\subseteq\mathbb{R}^{n}$ . At each iteration $k$ , the method solves a linear approximation of the original problem by finding a direction $s^{k}$ that minimizes the linearized objective function over the constraint set $\mathcal{X}$ , i.e., $s^{k}=\arg\min_{s\in\mathcal{X}}\langle\nabla\tilde{f}(x^{k}),s\rangle.$ Then, it updates the current point $x^{k}$ using a step size $\lambda_{k}$ along the direction $s^{k}-x^{k}$ , i.e., $x^{k+1}=x^{k}+\lambda_{k}(s^{k}-x^{k}).$ For composite functions of the form $\tilde{f}(x)=\tilde{h}(x)+\tilde{g}(x)$ , where $\tilde{h}$ is smooth and $\tilde{g}$ is convex, the CG method can be adapted to handle the non-smooth term $\tilde{g}$ . The update step becomes:

x^{k+1}=\arg\min_{s\in\mathcal{X}}\langle\nabla\tilde{h}(x^{k}),s\rangle+\tilde{g}(s).

In this case, $s\mapsto\langle\nabla\tilde{h}(x^{k}),s\rangle+\tilde{g}(s)$ is the new objective function for the subproblem, which remains convex due to the convexity of $\tilde{g}$ , compacity of $\mathcal{X}$ and the linearity of $\langle\nabla\tilde{h}(x^{k}),\cdot\rangle$ . The convexity of $\mathcal{X}$ ensures that this subproblem also has a solution. The step size $\lambda_{k}$ can be chosen using various strategies. The CG method is particularly useful when the constraint set $\mathcal{X}$ is not an easy set, making projection-based methods computationally expensive.

The exponential map, the parallel transport, and specific step size are used in the Riemannian CG method proposed in [21]. Now we extend this approach to a more general framework for composite optimization problems, without specifying the retraction and the vector transport.

Algorithm 1 Riemannian generalized conditional gradient method (RGCG)

Step 0. Initialization:

Choose $x^{0}\in\mathcal{X}$ and initialize $k=0$ .

Step 1. Compute the search direction:

Compute an optimal solution $p(x^{k})$ and the optimal value $\theta(x^{k})$ as

p(x^{k})=\arg\min_{u\in\mathcal{X}}\left\langle\mathrm{grad}f(x^{k}),R^{-1}_{x^{k}}(u)\right\rangle_{x^{k}}+g(u)-g(x^{k}),

(4)

\theta(x^{k})=\left\langle\mathrm{grad}f(x^{k}),R^{-1}_{x^{k}}(p(x^{k}))\right\rangle_{x^{k}}+g(p(x^{k}))-g(x^{k}).

(5)

Define the search direction by $d(x^{k})=R^{-1}_{x^{k}}(p(x^{k}))$ .
Step 2. Compute the step size:

Compute the step size $\lambda_{k}$ .

Step 3. Update the iterates:

Update the current iterate $x^{k+1}=R_{x^{k}}(\lambda_{k}d(x^{k})).$

Step 4. Convergence check:

If a convergence criteria is met, stop; otherwise, set $k=k+1$ and return to Step 1.

Our method is presented in Algorithm 1. In Step 1, we need to solve the subproblem (4), which, similarly to the Euclidean case, is related to the linear approximation of the objective functions. In Step 2, we compute the step size based on some strategies that we will mention later. In Step 3, we update the iterate by using some retraction. Note that if $u=x^{k}$ in (4), then $\left\langle\mathrm{grad}f(x^{k}),R^{-1}_{x^{k}}(u)\right\rangle_{x^{k}}+g(u)-g(x^{k})=0$ , which implies that $\theta(x^{k})\leq 0$ . And since $d(x^{k})=R^{-1}_{x^{k}}(p(x^{k}))$ , the subproblem is equivalent to

d(x^{k})=\arg\min_{d\in T_{x^{k}}\mathcal{X}}\left\langle\mathrm{grad}f(x^{k}),d\right\rangle_{x^{k}}+g(R_{x^{k}}(d))-g(x^{k}).

(6)

In the subsequent sections, we will examine the convergence properties of the sequence generated by the Algorithm 1, specifically focusing on three distinct and rigorously defined strategies for step size determination, which is also discussed in [8], as elaborated below.

Armijo step size: Let $\zeta\in(0,1)$ and $0<\omega_{1}<\omega_{2}<1$ . The step size $\lambda_{k}$ is chosen according to the following line search algorithm:

Step 0. Set $\lambda_{k_{0}}=1$ and initialize $\ell\leftarrow 0$ .

Step 1. If

F(R_{x^{k}}(\lambda_{k}d(x^{k})))\leq F(x^{k})+\zeta\lambda_{k_{\ell}}\theta(x^{k}),

then set $\lambda_{k}:=\lambda_{k_{\ell}}$ and return to the main algorithm.

Step 2. Find $\lambda_{k_{\ell+1}}\in\left[\omega_{1}\lambda_{k_{\ell}},\omega_{2}\lambda_{k_{\ell}}\right]$ , set $\ell\leftarrow\ell+1$ , and go to Step 1.

Adaptive step size: For all $k$ , set

\lambda_{k}:=\min\left\{1,\frac{-\theta(x^{k})}{L\mathrm{dist}^{2}(p\left(x^{k}\right),x^{k})}\right\}=\operatorname{argmin}_{\lambda\in(0,1]}\left\{\lambda\theta(x^{k})+\frac{L}{2}\lambda^{2}\mathrm{dist}^{2}(p(x^{k}),x^{k})\right\}.

Diminishing step size: For all $k$ , set

\lambda_{k}:=\frac{2}{k+2}.

4 Convergence analysis

In this section, we will provide the convergence analysis for three types of step sizes. The following assumption will be used in the convergence results.

Assumption 2.

The function $f$ is geodesically $L$ -smooth, namely, there exist a constant $L$ such that (1) holds.

Since $\mathcal{X}$ is a compact set, we can assume without loss of generality that $\mathcal{X}=\mathrm{dom}(g)$ is a compact set, its diameter is finite and can be defined by

\operatorname{diam}(\mathcal{X}):=\sup_{x,y\in\mathcal{X}}\mathrm{dist}(x,y).

(7)

Lemma 4.1.

Let $x,y,z\in\mathcal{X}$ with $z=p(x)$ and $y=\gamma_{x,z}(\lambda)=R_{x}(\lambda R^{-1}_{x}(z))$ , where $\lambda\in[0,1]$ and $\gamma_{x,z}\colon[0,1]\to\mathcal{M}$ represents the geodesic joining $x=\gamma_{x,z}(0)$ with $z=\gamma_{x,z}(1)$ . Then, we have

F(y)\leq F(x)+\lambda\theta(x)+\frac{L\lambda^{2}}{2}\operatorname{diam}(\mathcal{X})^{2}.

(8)

Proof.

From the definition of $F$ , we have

	$\displaystyle F(y)$	$\displaystyle=f(y)+g(y)$
		$\displaystyle\leq f(x)+\left\langle\mathrm{grad}f(x),R^{-1}_{x}(y)\right\rangle_{x}+\frac{L}{2}\mathrm{dist}^{2}(x,y)+g(y)$
		$\displaystyle\leq f(x)+\left\langle\mathrm{grad}f(x),R^{-1}_{x}(y)\right\rangle_{x}+\frac{L}{2}\mathrm{dist}^{2}(x,y)+(1-\lambda)g(x)+\lambda g(z)$
		$\displaystyle=F(x)+\lambda\left\langle\mathrm{grad}f(x),R^{-1}_{x}(z)\right\rangle_{x}-\lambda g(x)+\lambda g(z)+\frac{L}{2}\mathrm{dist}^{2}(x,y)$
		$\displaystyle=F(x)+\lambda\left(\left\langle\mathrm{grad}f(x),R^{-1}_{x}(z)\right\rangle_{x}-g(x)+g(z)\right)+\frac{L}{2}\mathrm{dist}^{2}(x,y)$
		$\displaystyle=F(x)+\lambda\theta(x)+\frac{L\lambda^{2}}{2}\mathrm{dist}^{2}(x,z)$
		$\displaystyle\leq F(x)+\lambda\theta(x)+\frac{L\lambda^{2}}{2}\operatorname{diam}(\mathcal{X})^{2},$

where the first inequality comes from the $L$ -smoothness of $f$ and the second inequality is due to the geodesic convexity of $g$ . Then, from the definition of $\theta$ in (5), and using the fact that $\mathrm{dist}^{2}(x,y)=\lambda^{2}\mathrm{dist}^{2}(x,z)$ , we obtain the desired result. ∎

Additionally, we can derive the following proposition.

Proposition 4.2.

Let $x^{*}\in\mathcal{X}$ be an optimal point of (3), and assume that $f$ is geodesically convex. Then, Algorithm 1 generates $\{x^{k}\}$ satisfying

F(x^{k+1})\leq F(x^{*})+\pi_{k}\left(1-\lambda_{0}\right)\left(F(x^{0})-F(x^{*})\right)+\sum_{s=0}^{k}\frac{\pi_{k}}{\pi_{s}}\frac{L\lambda_{s}^{2}}{2}\operatorname{diam}(\mathcal{X})^{2},

(9)

where $\pi_{k}:=\prod_{s=1}^{k}\left(1-\lambda_{s}\right)$ with $\pi_{0}=1$ .

Proof.

From Lemma 4.1, we obtain

F(x^{k+1})\leq F(x^{k})+\lambda_{k}\theta(x^{k})+\frac{L\lambda_{k}^{2}}{2}\operatorname{diam}(\mathcal{X})^{2}.

Define $\Delta_{k}:=F(x^{k})-F(x^{*})$ . Then, we get

\Delta_{k+1}\leq\Delta_{k}+\lambda_{k}\theta(x^{k})+\frac{L\lambda_{k}^{2}}{2}\operatorname{diam}(\mathcal{X})^{2}.

From the convexity of $f$ , we have

F(x^{*})-F(x^{k})\geq\left\langle\operatorname{grad}f(x^{k}),R_{x^{k}}^{-1}(x^{*})\right\rangle_{x^{k}}+g(x^{*})-g(x^{k})\geq\min_{u\in\mathcal{X}}\left\langle\mathrm{grad}f(x^{k}),R^{-1}_{x^{k}}(u)\right\rangle_{x^{k}}+g(u)-g(x^{k})=\theta(x^{k}).

Then we have $\Delta_{k}\leq-\theta(x^{k})$ , which gives

\Delta_{k+1}\leq(1-\lambda_{k})\Delta_{k}+\frac{L\lambda_{k}^{2}}{2}\operatorname{diam}(\mathcal{X})^{2}.

Now, let us expand $\Delta_{k}$ and gradually express $\Delta_{k}$ in terms of $\Delta_{0}$ . Ultimately, for the recursive relation at step $k+1$ , we obtain:

\Delta_{k+1}\leq\prod_{s=0}^{k}(1-\lambda_{s})\Delta_{0}+\sum_{s=0}^{k}\frac{L\lambda_{s}^{2}}{2}\prod_{j=s+1}^{k}(1-\lambda_{j})\operatorname{diam}(\mathcal{X})^{2},

that is,

\Delta_{k+1}\leq\pi_{k}\left(1-\lambda_{0}\right)\Delta_{0}+\sum_{s=0}^{k}\frac{\pi_{k}}{\pi_{s}}\frac{L\lambda_{s}^{2}}{2}\operatorname{diam}(\mathcal{X})^{2}.

Rearranging the terms completes the proof. ∎

Now we present a result associated to general sequences of real numbers, which will be useful for our study of convergence and iteration complexity bounds for the CG method.

Lemma 4.3.

[14, Lemma 13.13] Let $\left\{a_{k}\right\}$ and $\left\{b_{k}\right\}$ be nonnegative sequences of real numbers satisfying

a_{k+1}\leq a_{k}-b_{k}\beta_{k}+\frac{A}{2}\beta_{k}^{2},\quad k=0,1,2,\ldots,

where $\beta_{k}:=2/(k+2)$ and $A$ is a positive number. Suppose that $a_{k}\leq b_{k}$ for all $k$ . Then

(i)

$a_{k}\leq\frac{2A}{k}$ for all $k=1,2,\ldots$
(ii)

$\min_{\ell\in\left\{\left\lfloor\frac{k}{2}\right\rfloor+2,\ldots,k\right\}}b_{\ell}\leq\frac{8A}{k-2}$ for all $k=3,4,\ldots,\quad$ where $\lfloor k/2\rfloor=\max\{n\in\mathbb{N}:n\leq k/2\}$ .

4.1 Analysis with the adaptive and the diminishing step sizes

Theorem 4.4.

Let $x^{*}$ be an optimal point of (3), and assume that $f$ is geodesically convex. Then, the sequence of iterates $\{x^{k}\}$ generated by Algorithm 1 with the adaptive or diminishing step size satisfies $F(x^{k})-F(x^{*})=O(1/k)$ . Specifically, we obtain

\min_{\ell\in\left\{\left\lfloor\frac{k}{2}\right\rfloor+2,\ldots,k\right\}}-\theta(x^{k})\leq\frac{8L\operatorname{diam}(\mathcal{X})^{2}}{k-2}.

Proof.

From Lemma 4.1 and the diminishing step size $\lambda_{k}=\frac{2}{k+2}\in(0,1)$ , we have

F(x^{k+1})-F(x^{*})\leq F(x^{k})-F(x^{*})+\lambda_{k}\theta(x^{k})+\frac{L\lambda_{k}^{2}}{2}\operatorname{diam}(\mathcal{X})^{2}.

(10)

Considering the adaptive step size $\eta_{k}=\operatorname{argmin}_{\lambda\in(0,1]}\left\{\lambda\theta(x^{k})+\frac{L}{2}\lambda^{2}d^{2}(p(x^{k}),x^{k})\right\}$ , we have $\eta_{k}\theta(x^{k})+\frac{L\eta_{k}^{2}}{2}\operatorname{diam}(\mathcal{X})^{2}\leq\lambda_{k}\theta(x^{k})+\frac{L\lambda_{k}^{2}}{2}\operatorname{diam}(\mathcal{X})^{2}$ , so the inequality (10) still holds.

Define $\Delta_{k}:=F(x^{k})-F(x^{*})$ . Then (10) is equivalent to

\Delta_{k+1}\leq\Delta_{k}+\lambda_{k}\theta(x^{k})+\frac{L\lambda_{k}^{2}}{2}\operatorname{diam}(\mathcal{X})^{2}.

From the convexity of $f$ , we have

F(x^{*})-F(x^{k})\geq\left\langle\operatorname{grad}f(x^{k}),R_{x^{k}}^{-1}(x^{*})\right\rangle_{x^{k}}+g(x^{*})-g(x^{k})\geq\min_{u\in\mathcal{X}}\left\langle\mathrm{grad}f(x^{k}),R^{-1}_{x^{k}}(u)\right\rangle_{x^{k}}+g(u)-g(x^{k})=\theta(x^{k}).

Then we have $0\leq\Delta_{k}\leq-\theta(x^{k})$ . So setting $a_{k}=\Delta_{k}$ and $b_{k}=-\theta(x^{k})\geq 0$ , $\beta_{k}=\lambda_{k}$ , and $A=L\operatorname{diam}(\mathcal{X})^{2}$ in Lemma 4.3 we obtain

F(x^{k})-F(x^{*})\leq\frac{2L}{k}\operatorname{diam}(\mathcal{X})^{2},

and

\min_{\ell\in\left\{\left\lfloor\frac{k}{2}\right\rfloor+2,\ldots,k\right\}}-\theta(x^{k})\leq\frac{8L\operatorname{diam}(\mathcal{X})^{2}}{k-2},

for all $k=3,4,\ldots$ , where $\lfloor k/2\rfloor=\max\{n\in\mathbb{N}:n\leq k/2\}$ . ∎

The above theorem also indicates that the iteration complexity of the method is $\mathcal{O}\left(1/k\right)$ . Note that a similar result can be obtained by following the proof of [22, Corollary 5].

4.2 Analysis with the Armijo step size

In the following lemma, we prove that there exist intervals of step sizes satisfying the Armijo condition.

Lemma 4.5.

Let $\zeta\in(0,1)$ , $x^{k}\in\mathcal{X}$ and $\theta(x^{k})\neq 0$ . Then, there exists $0<\bar{\lambda}\leq 1$ such that

F(R_{x^{k}}(\lambda R_{x^{k}}^{-1}(p(x^{k}))))\leq F(x^{k})-\zeta\lambda\left|\theta(x^{k})\right|\quad\forall\lambda\in(0,\bar{\lambda}].

(11)

Proof.

Since $f$ is differentiable and $g$ is geodesically convex, for all $\lambda\in(0,1)$ , we have

		$\displaystyle F(R_{x^{k}}(\lambda R_{x^{k}}^{-1}(p(x^{k}))))$
	$\displaystyle=$	$\displaystyle g(R_{x^{k}}(\lambda R_{x^{k}}^{-1}(p(x^{k}))))+f(R_{x^{k}}(\lambda(R_{x^{k}}^{-1}(p(x^{k})))))$
	$\displaystyle\leq$	$\displaystyle(1-\lambda)g(x^{k})+\lambda g(p(x^{k}))+f(x^{k})+\lambda\left\langle\mathrm{grad}f(x^{k}),R_{x^{k}}^{-1}(p(x^{k}))\right\rangle_{x^{k}}+{o}(\lambda)$
	$\displaystyle=$	$\displaystyle F(x^{k})+\lambda\left(\left\langle\mathrm{grad}f(x^{k}),R_{x^{k}}^{-1}(p(x^{k}))\right\rangle_{x^{k}}+g(p(x^{k}))-g(x^{k})\right)+{o}(\lambda)$
	$\displaystyle=$	$\displaystyle F(x^{k})+\lambda\zeta\theta(x^{k})+\lambda\left((1-\zeta)\theta(x^{k})+\frac{{o}(\lambda)}{\lambda}\right).$

The inequality arises from the Taylor expansion of $f$ and the geodesic convexity of $g$ . Since $\lim_{\lambda\rightarrow 0}\frac{{o}(\lambda)}{\lambda}=0$ , $\zeta\in(0,1)$ and $\theta(x^{k})<0$ from the definition as in (5), there exists $\bar{\lambda}\in(0,1]$ such that $(1-\zeta)\theta(x^{k})+\frac{{o}(\lambda)}{\lambda}\leq 0$ , then (11) holds for all $\lambda\in(0,\bar{\lambda}]$ . ∎

In the subsequent results, we assume that the sequence generated by Algorithm 1 is infinite, which means that $\theta(x^{k})<0$ for all $k$ .

Assumption 3.

Set $\mathcal{L}=\{x\mid F(x)\leq F(x^{0})\}$ . All monotonically nonincreasing sequences in $F(\mathcal{L})$ are bounded from below.

Theorem 4.6.

Assume that $F$ satisfies Assumption 3 and the sequence $\{x^{k}\}$ is generated by Algorithm 1 with the Armijo step size. The following statements hold:

(1)

$\lim_{k\rightarrow\infty}\theta(x^{k})=0$ ;
(2)

Let $x^{*}$ be an accumulation point of the sequence $\{x^{k}\}$ . Then $x^{*}$ is a stationary point of (3) and $\lim_{k\rightarrow\infty}F(x^{k})=F(x^{*})$ .

Proof.

(1) By the Armijo step size, the fact that $\theta(x^{k})<0$ for all $k\in\mathbb{N}$ , and by Lemma 4.5, we have

F(x^{k})-F(x^{k+1})\geq\zeta\lambda_{k}\left|\theta(x^{k})\right|>0,

(12)

which implies that the sequence $\{F(x^{k})\}$ is monotonically decreasing. On the other hand, since $\{x^{k}\}\subset\mathcal{X}$ and $\mathcal{X}$ is compact, there exists $\bar{x}\in\mathcal{X}$ that is a accumulation point of a subsequence of $\{x^{k}\}$ . Without loss of generality, assume that $x^{k}\rightarrow\bar{x}$ .

From (12), with $k=0,1,\ldots,N$ for some $N$ , we have $F(x^{0})-F_{min}\geq F(x^{0})-F(x^{N})\geq\zeta\sum_{k=0}^{N}\lambda_{k}\left|\theta(x^{k})\right|$ , where $F_{min}$ exists from Assumption 3. Taking $N\rightarrow\infty$ , we have $\sum_{k=0}^{\infty}\lambda_{k}\left|\theta(x^{k})\right|<+\infty$ . Therefore, we have $\lim_{k\rightarrow\infty}\lambda_{k}\theta(x^{k})=0$ . If

\lim_{k\rightarrow\infty}\theta(x^{k})=0,

(13)

then the assertion holds. So we consider the case

\lim_{k\rightarrow\infty}\lambda_{k}=0.

(14)

Assume that $\lim_{\mathbb{K}_{1}\ni k\to\infty}\lambda_{k}=0$ for some $\mathbb{K}_{1}\subseteq\mathbb{N}$ . Suppose by contradiction that $\theta(\bar{x})<0$ . Then, there exist $\delta>0$ and $\mathbb{K}_{2}\subseteq\mathbb{K}_{1}$ such that

\theta(x^{k})<-\delta\quad\forall k\in\mathbb{K}_{2}.

Without loss of generality, we assume that there exists $\bar{p}\in\mathcal{X}$ such that

\lim_{\mathbb{K}_{2}\ni k\to\infty}p(x^{k})=\bar{p}.

Since $\lambda_{k}<1$ for all $k\in\mathbb{K}_{2}$ , by Armijo step size, there exists $\bar{\lambda}_{k}\in(0,\lambda_{k}/\omega_{1})$ such that

F(R_{x^{k}}(\bar{\lambda}_{k}R_{x^{k}}^{-1}(p(x^{k}))))>F(x^{k})+\zeta\bar{\lambda}_{k}\theta(x^{k})\quad\forall k\in\mathbb{K}_{2},

Then, we have

\frac{F(R_{x^{k}}(\bar{\lambda}_{k}R_{x^{k}}^{-1}(p(x^{k}))))-F(x^{k})}{\bar{\lambda}_{k}}>\zeta\theta(x^{k})\quad\forall k\in\mathbb{K}_{2}.

(15)

On the other hand, since $0<\bar{\lambda}_{k}\leq 1$ , and $g$ is geodesically convex, then

g(R_{x^{k}}(\bar{\lambda}_{k}R_{x^{k}}^{-1}(p(x^{k}))))\leq(1-\bar{\lambda}_{k})g(x^{k})+\bar{\lambda}_{k}g(p(x^{k})),

namely,

\frac{g(R_{x^{k}}(\bar{\lambda}_{k}R_{x^{k}}^{-1}(p(x^{k})))-g(x^{k})}{\bar{\lambda}_{k}}\leq g(p(x^{k}))-g(x^{k})\quad\forall k\in\mathbb{K}_{2}.

(16)

Since $f$ is differentiable and $\lim_{\mathbb{K}_{2}\ni k\to\infty}\bar{\lambda}_{k}=0$ , for all $k\in\mathbb{K}_{2}$ , we have

f(R_{x^{k}}(\bar{\lambda}_{k}R_{x^{k}}^{-1}(p(x^{k}))))=f(x^{k})+\bar{\lambda}_{k}\left\langle\operatorname{grad}f(x^{k}),R_{x^{k}}^{-1}(p(x^{k}))\right\rangle_{x^{k}}+o(\bar{\lambda}_{k}),

(17)

which, together with (16) and (17) and the definition of $\theta$ , gives

$\displaystyle\theta(x^{k})$	$\displaystyle=\left\langle\operatorname{grad}f(x^{k}),R_{x^{k}}^{-1}(p(x^{k}))\right\rangle_{x^{k}}+g(p(x^{k}))-g(x^{k})$	(18)
	$\displaystyle\geq\frac{f(R_{x^{k}}(\bar{\lambda}_{k}R_{x^{k}}^{-1}(p(x^{k}))))-f(x^{k})-o(\bar{\lambda}_{k})}{\bar{\lambda}_{k}}+\frac{g(R_{x^{k}}(\bar{\lambda}_{k}R_{x^{k}}^{-1}(p(x^{k})))-g(x^{k})}{\bar{\lambda}_{k}}$
	$\displaystyle\geq\frac{F(R_{x^{k}}(\bar{\lambda}_{k}R_{x^{k}}^{-1}(p(x^{k}))))-F(x^{k})}{\bar{\lambda}_{k}}-\frac{o(\bar{\lambda}_{k})}{\bar{\lambda}_{k}}\quad\forall k\in\mathbb{K}_{2}.$

Combining (18) with (15), we obtain for all $k\in\mathbb{K}_{2}$ ,

\frac{F(R_{x^{k}}(\bar{\lambda}_{k}R_{x^{k}}^{-1}(p(x^{k}))))-F(x^{k})}{\bar{\lambda}_{k}\zeta}>\frac{F(R_{x^{k}}(\bar{\lambda}_{k}R_{x^{k}}^{-1}(p(x^{k}))))-F(x^{k})}{\bar{\lambda}_{k}}-\frac{o(\bar{\lambda}_{k})}{\bar{\lambda}_{k}},

that is,

\left(\frac{1}{\zeta}-1\right)\frac{F(R_{x^{k}}(\bar{\lambda}_{k}R_{x^{k}}^{-1}(p(x^{k}))))-F(x^{k})}{\bar{\lambda}_{k}}>-\frac{o(\bar{\lambda}_{k})}{\bar{\lambda}_{k}}.

Then, we have

\frac{F(R_{x^{k}}\bar{\lambda}_{k}(R_{x^{k}}^{-1}(p(x^{k}))))-F(x^{k})}{\bar{\lambda}_{k}}>\left(\frac{-\zeta}{1-\zeta}\right)\frac{o(\bar{\lambda}_{k})}{\bar{\lambda}_{k}}\quad\forall k\in\mathbb{K}_{2}.

On the other hand, from the fact that $\theta(x^{k})<-\delta$ and (18), we obtain

-\delta+\frac{o(\bar{\lambda}_{k})}{\bar{\lambda}_{k}}>\frac{F(R_{x^{k}}\bar{\lambda}_{k}R_{x^{k}}^{-1}(p(x^{k})))-F(x^{k})}{\bar{\lambda}_{k}}\quad\forall k\in\mathbb{K}_{2}.

Combining the two results above, it holds that

-\delta+\frac{o(\bar{\lambda}_{k})}{\bar{\lambda}_{k}}>\left(\frac{-\zeta}{1-\zeta}\right)\frac{o(\bar{\lambda}_{k})}{\bar{\lambda}_{k}}\quad\forall k\in\mathbb{K}_{2}.

Since $\frac{o(\bar{\lambda}_{k})}{\bar{\lambda}_{k}}\to 0$ as $k\to\infty$ , we obtain $\delta\leq 0$ , which contradicts the fact that $\delta>0$ . Thus, $\theta\left(\bar{x}\right)=0$ .

(2) Let $x^{*}$ be an accumulation point of the sequence $\{x^{k}\}$ . Assume that $x^{*}$ is not a stationary point of $F$ , namely $\theta\left(x^{*}\right)<0$ . By the definition of $\theta$ in (5), we have

\theta\left(x^{k}\right)=\left\langle\mathrm{grad}f(x^{k}),R^{-1}_{x^{k}}(p(x^{k}))\right\rangle_{x^{k}}+g(p(x^{k}))-g(x^{k})\leq\left\langle\mathrm{grad}f(x^{k}),R^{-1}_{x^{k}}(u)\right\rangle_{x^{k}}+g(u)-g(x^{k})\quad\forall u\in\mathcal{X},

and by taking the limit we obtain

\limsup_{k\rightarrow\infty}\theta\left(x^{k}\right)\leq\left\langle\mathrm{grad}f(x^{*}),R^{-1}_{x^{*}}(u)\right\rangle_{x^{*}}+g(u)-g(x^{*})\quad\forall u\in\mathcal{X},

since $f$ is continuously differentiable, $g$ is lower semicontinuous and Assumption 1 holds. Now taking the minimum over $x\in\mathcal{X}$ and because $x^{*}$ is not stationary point of $F$ ,

\limsup_{k\rightarrow\infty}\theta\left(x^{k}\right)\leq\min_{u\in\mathcal{X}}\left\langle\mathrm{grad}f(x^{*}),R^{-1}_{x^{*}}(u)\right\rangle_{x^{*}}+g(u)-g(x^{*})=\theta\left(x^{*}\right)<0,

which contradicts to $\lim_{k\rightarrow\infty}\theta(x^{k})=0$ , so $x^{*}$ is a stationary point of $F$ .

Also, due to the monotonicity of the sequence $\{F(x^{k})\}$ given in (12), we obtain $\lim_{k\rightarrow\infty}F(x^{k})=F\left(x^{*}\right)$ , which completes the proof. ∎

The following theorem provides the iteration complexity of Algorithm 1 with Armijo step size, indicating that after a finite number of steps, the iteration complexity of the method could be $\mathcal{O}\left(1/\epsilon^{2}\right)$ .

Theorem 4.7.

Assume that $F$ satisfies Assumptions 2 and 3. Then, there exists a finite integer $J\in\mathbb{N}$ such that Algorithm 1 with Armijo step size achieves an $\epsilon$ -optimal solution within $J+\mathcal{O}\left(1/\epsilon^{2}\right)$ iterations.

Proof.

First, we consider that $0<\lambda_{k}<1$ . From the Armijo step size, we can say that there exists $0<\bar{\lambda}_{k}\leq\min\left\{1,\frac{\lambda_{k}}{\omega_{1}}\right\}$ such that

F(R_{x^{k}}\bar{\lambda}_{k}(R_{x^{k}}^{-1}(p(x^{k}))))>F(x^{k})+\zeta\bar{\lambda}_{k}\theta(x^{k}).

Recall that from Lemma 4.1, we have

F(R_{x^{k}}(\bar{\lambda}_{k}(R_{x^{k}}^{-1}(p(x^{k})))))<F(x^{k})+\bar{\lambda}_{k}\theta(x^{k})+\frac{L\bar{\lambda}_{k}^{2}}{2}\operatorname{diam}(\mathcal{X})^{2}.

Combining the above two inequalities, we conclude that:

\zeta\bar{\lambda}_{k}\theta(x^{k})<\bar{\lambda}_{k}\theta(x^{k})+\frac{L\bar{\lambda}_{k}^{2}}{2}\operatorname{diam}(\mathcal{X})^{2},

namely,

0<-\frac{2(1-\zeta)}{L\operatorname{diam}(\mathcal{X})^{2}}\theta(x^{k})<\bar{\lambda}_{k}\leq\frac{\lambda_{k}}{\omega_{1}}.

Defining $\gamma=\frac{2\omega_{1}(1-\zeta)}{L\operatorname{diam}(\mathcal{X})^{2}}$ , we get

0<-\gamma\theta(x^{k})<\lambda_{k}.

(19)

From Theorem 4.6, we have $\lim_{k\rightarrow\infty}\theta(x^{k})=0$ . Then, without loss of generality, for sufficiently large $J$ , when $k>J$ , $\left|\theta(x^{k})\right|\leq 1/\gamma$ . Then (19) still holds for $\lambda_{k}=1$ . Taking it into (12), we obtain

0<\zeta\gamma\left|\theta(x^{k})\right|^{2}\leq F(x^{k})-F(x^{k+1}).

(20)

Namely (20) holds for all $k>J$ . By summing both sides of the second inequality in (20) for $k=J+1,\ldots,N$ , we obtain

\sum_{k=J+1}^{N}\left|\theta(x^{k})\right|^{2}\leq\frac{1}{\zeta\gamma}(F(x^{J+1})-F(x^{N+1}))\leq\frac{1}{\zeta\gamma}(F(x^{0})-F(x^{*})),

where $x^{*}$ is an accumulation point of $\{x^{k}\}$ , as in Theorem 4.6.

This implies $\min_{k}\left\{\left|\theta(x^{k})\right|\mid k=J+1,\ldots,N\right\}\leq\sqrt{(F(x^{0})-F(x^{*}))/(\zeta\gamma(N-J))}$ . Namely, Algorithm 1 returns $x^{k}$ satisfying $\left|\theta(x^{k})\right|\leq\epsilon$ in at most $J+(F(x^{0})-F\left(x^{*}\right))/(\zeta\gamma\epsilon^{2})$ iterations. ∎

Furthermore, we can obtain stronger results by making additional assumptions about the objective function $g$ .

Assumption 4.

Assume the function $g$ is Lipschitz continuous with constant $L_{g}>0$ in $\mathcal{X}$ , namely, $|g(x)-g(y)|\leq L_{g}\mathrm{dist}(x,y)$ . And for $\{x^{k}\}$ generated by Algorithm 1, there exists $\Omega>0$ such that $\Omega\geq$ $\max\big{\{}\max\big{\{}\|R_{x^{i}}^{-1}(x^{j})\|_{x^{i}}:{x^{i}},{x^{j}}\in\{x^{k}\}\big{\}},\operatorname{diam}(\mathcal{X})\big{\}}$ .

As $k$ tends to infinity, we can assume without loss of generality that $x^{k}$ lies within a compact neighborhood of $x^{*}$ . Due to the smoothness of the inverse retraction, this ensures that $\|R_{x^{i}}^{-1}(x^{j})\|_{x^{i}}$ remains bounded. Hence, Assumption 4 is reasonable. Moreover, we define

		$\displaystyle\rho:=\sup\left\{\left\\|\operatorname{grad}f(x)\right\\|_{x}\mid x\in\mathcal{X}\right\}$		(21)
		$\displaystyle\gamma:=\min\left\{\frac{1}{(L_{g}+\rho)\Omega},\frac{2\omega_{1}(1-\zeta)}{L\Omega^{2}}\right\}.$		(21)

Then we present the following lemma, which is crucial for the subsequent convergence rate analysis.

Lemma 4.8.

Let $\{x^{k}\}$ be a sequence generated by Algorithm 1 with Armijo step size. Assume that Assumptions 2 and 4 hold. Then $\lambda_{k}\geq\gamma\left|\theta\left(x^{k}\right)\right|>0.$

Proof.

Since $\lambda_{k}\in(0,1]$ , let us consider two cases: $\lambda_{k}=1$ and $0<\lambda_{k}<1$ . First, we assume that $\lambda_{k}=1$ . From (5), we have

0<-\theta(x^{k})=g(x^{k})-g(p(x^{k}))-\left\langle\operatorname{grad}f(x^{k}),R_{x^{k}}^{-1}p(x^{k})\right\rangle_{x^{k}},

thus, it follows from Assumption 4 and the Cauchy-Schwarz inequality that

0<-\theta(x^{k})\leq L_{g}\mathrm{dist}(x^{k},p(x^{k}))+\left\|\operatorname{grad}f(x^{k})\right\|_{x^{k}}\left\|R_{x^{k}}^{-1}p(x^{k})\right\|_{x^{k}}.

Using (21), we have $0<-\theta(x^{k})\leq(L_{g}+\rho)\Omega$ . Furthermore,

0<\gamma|\theta(x^{k})|=-\gamma\theta(x^{k})\leq\frac{-\theta(x^{k})}{(L_{g}+\rho)\Omega}\leq 1,

which demonstrates that the desired inequality holds.

Now, we assume that $0<\lambda_{k}<1$ . From the Armijo step size, we know that there exists $0<\bar{\lambda}_{k}\leq\min\left\{1,\frac{\lambda_{k}}{\omega_{1}}\right\}$ such that

F(R_{x^{k}}(\bar{\lambda}_{k}R_{x^{k}}^{-1}(p(x^{k}))))>F(x^{k})+\zeta\bar{\lambda}_{k}\theta(x^{k}).

By using Lemma 4.1, we have

F(R_{x^{k}}(\bar{\lambda}_{k}R_{x^{k}}^{-1}(p(x^{k}))))\leq F(x^{k})+\bar{\lambda}_{k}\theta(x^{k})+\frac{L}{2}\operatorname{diam}(\mathcal{X})^{2}\bar{\lambda}_{k}^{2}.

From the above two inequalities, we have that

-\theta(x^{k})(1-\zeta)<\frac{L}{2}\operatorname{diam}(\mathcal{X})^{2}\bar{\lambda}_{k}\leq\frac{L}{2}\operatorname{diam}(\mathcal{X})^{2}\frac{\lambda_{k}}{\omega_{1}}.

Therefore, from (21), we get

0<\gamma|\theta(x^{k})|=-\gamma\theta(x^{k})\leq-\frac{2\omega_{1}(1-\zeta)}{L\Omega^{2}}\theta(x^{k})<\lambda_{k},

which concludes the proof. ∎

Theorem 4.9.

Let $\{x^{k}\}$ be a sequence generated by Algorithm 1 with Armijo step size. Assume that Assumptions 2 and 4 hold. Then $\lim_{k\rightarrow\infty}F(x^{k})=F(x^{*})$ , for some $x^{*}\in$ $\mathcal{X}$ . Moreover, the following statements hold:

(i)

$\lim_{k\rightarrow\infty}\theta(x^{k})=0$ ;
(ii)

$\min\left\{\left|\theta(x^{k})\right|\mid k=0,1,\ldots,N-1\right\}\leq\sqrt{F(x^{0})-F(x^{*})/(\zeta\gamma N)}$ .

Proof.

By the Armijo step size and the fact that $\theta(x^{k})<0$ for all $k\in\mathbb{N}$ , and Lemma 4.5, we have

F(x^{k})-F(x^{k+1})\geq\zeta\lambda_{k}\left|\theta(x^{k})\right|>0,

(22)

which implies that the sequence $\{F(x^{k})\}_{k\in\mathbb{N}}$ is monotone decreasing. Combined with Lemma 4.8, we have

0<\zeta\gamma\left|\theta(x^{k})\right|^{2}\leq F(x^{k})-F(x^{k+1}).

(23)

On the other hand, since $\{x^{k}\}_{k\in\mathbb{N}}\subset\mathcal{X}$ and $\mathcal{X}$ is compact, there exists $x^{*}\in\mathcal{X}$ a limit point of $\{x^{k}\}_{k\in\mathbb{N}}$ . Let $\{x^{k_{j}}\}_{j\in\mathbb{N}}$ be a subsequence of $\{x^{k}\}_{k\in\mathbb{N}}$ such that $\lim_{j\rightarrow\infty}x^{k_{j}}=x^{*}$ . Since $\{x^{k_{j}}\}_{j\in\mathbb{N}}\subset\mathcal{X}$ and $g$ is Lipschitz continuous, we have

\left\|F(x^{k_{j}})-F(x^{*})\right\|=\left\|g(x^{k_{j}})-g(x^{*})+f(x^{k_{j}})-f(x^{*})\right\|\leq L_{g}\left\|x^{k_{j}}-x^{*}\right\|+\left\|f(x^{k_{j}})-f(x^{*})\right\|,

for all $j\in\mathbb{N}$ . Considering that $f$ is continuous and $\lim_{j\rightarrow\infty}x^{k_{j}}=x^{*}$ , we conclude from the last inequality that $\lim_{j\rightarrow\infty}F(x^{k_{j}})=F(x^{*})$ . Thus, due to the monotonicity of the sequence $\{F(x^{k})\}_{k\in\mathbb{N}}$ , we obtain that $\lim_{k\rightarrow\infty}F(x^{k})=F(x^{*})$ . Therefore, from (23), $\lim_{k\rightarrow\infty}\left|\theta(x^{k})\right|=0$ , which implies item (i).

By summing both sides of the second inequality in (23) for $k=0,1,\ldots,N-1$ , we obtain

\sum_{k=0}^{N-1}\left|\theta(x^{k})\right|^{2}\leq\frac{1}{\zeta\gamma}(F(x^{0})-F(x^{N}))\leq\frac{1}{\zeta\gamma}(F(x^{0})-F(x^{*})),

which implies the item (ii).

Therefore, $x^{k}$ satisfies $\left|\theta(x^{k})\right|\leq\epsilon$ in at most $(F(x^{0})-F(x^{*}))/(\zeta\gamma\epsilon^{2})$ iterations. ∎

5 Solving the subproblem

In this section, we will propose an algorithm for solving the subproblem of the RGCG method. Recall that in Algorithm 1, given the current iterate $x$ , we need to compute an optimal solution $p(x)$ and the optimal value $\theta(x)$ as

p(x)=\arg\min_{u\in\mathcal{X}}\left\langle\mathrm{grad}f(x),R^{-1}(u)\right\rangle_{x}+g(u)-g(x),

(24)

\theta(x)=\left\langle\mathrm{grad}f(x),R^{-1}_{x}(p(x))\right\rangle_{x}+g(p(x))-g(x).

(25)

Now define $\tilde{p}(x)$ as the descent direction on the tangent space, i.e.,

\tilde{p}(x)=\arg\min_{d\in T_{x}\mathcal{M}}\left\langle\mathrm{grad}f(x),d\right\rangle_{x}+g(R_{x}(d)).

(26)

We also define

\ell_{x}(d)=\left\langle\mathrm{grad}f(x),d\right\rangle_{x}+g(R_{x}(d)).

(27)

We observe that for both (24) and (26), the retraction is not necessarily a convex function, which implies that the resulting optimization problem is not necessarily convex. Non-convex problems are inherently more challenging to solve. To address this, we aim to transform the subproblem into a series of convex problems, which is similar to the general method for solving Riemannian proximal mappings proposed in [20].To ensure that we can locally approximate and simulate the subproblems effectively, we consider the following assumption:

Assumption 5.

The manifold $\mathcal{M}$ is an embedded submanifold of $\mathbb{R}^{n}$ or is a quotient manifold whose total space is an embedded submanifold of $\mathbb{R}^{n}$ .

Let us now present a way to find the approximation of $\tilde{p}(x)=\arg\min_{d\in T_{x}\mathcal{M}}\ell_{x}(d)$ . Assume that $d^{0}\in T_{x}\mathcal{M}$ , and let $d^{k}$ denote the current estimate of $\tilde{p}(x)$ . Observe that

	$\displaystyle\ell_{x}(d^{k}+\tilde{\xi}_{k})$	$\displaystyle=\left\langle\operatorname{grad}f\left(x\right),d^{k}+\tilde{\xi}_{k}\right\rangle_{x}+g(R_{x}(d^{k}+\tilde{\xi}_{k}))$
		$\displaystyle=\left\langle\operatorname{grad}f\left(x\right),d^{k}\right\rangle_{x}+\left\langle\operatorname{grad}f\left(x\right),\tilde{\xi}_{k}\right\rangle_{x}+g(R_{x}(d^{k}+\tilde{\xi}_{k})),$

for any $\tilde{\xi}_{k}\in\mathrm{T}_{x}\mathcal{M}$ . Let $y_{k}=R_{x}\left(d^{k}\right)$ and $\xi_{k}=\mathcal{T}_{R_{d^{k}}}\tilde{\xi}_{k}$ . Since the retraction $R$ is smooth by definition, we obtain $R_{x}(d^{k}+\tilde{\xi}_{k})=y_{k}+\xi_{k}+O\left(\left\|\xi_{k}\right\|_{x}^{2}\right)$ , where $y=x+O(z)$ means $\lim\sup_{z\rightarrow 0}\|y-x\|/\|z\|<\infty$ . Then, we obtain:

	$\displaystyle\ell_{x}(d^{k}+\tilde{\xi}_{k})=$	$\displaystyle\left\langle\operatorname{grad}f(x),d^{k}\right\rangle_{x}+\left\langle\operatorname{grad}f(x),\mathcal{T}_{R_{d^{k}}}^{-1}\xi_{k}\right\rangle_{x}+g\left(y_{k}+\xi_{k}+O\left(\left\\|\xi_{k}\right\\|_{x}^{2}\right)\right)$
	$\displaystyle=$	$\displaystyle\left\langle\operatorname{grad}f(x),d^{k}\right\rangle_{x}+\left\langle\operatorname{grad}f(x),\mathcal{T}_{R_{d^{k}}}^{-1}\xi_{k}\right\rangle_{x}+g\left(y_{k}+\xi_{k}\right)+O\left(\left\\|\xi_{k}\right\\|_{x}^{2}\right),$

where the second equality comes from the Lipschitz continuity of $g$ (see Assumption 4). Note that the middle parts of the above term

\left\langle\operatorname{grad}f(x),\mathcal{T}_{R_{d^{k}}}^{-1}\xi_{k}\right\rangle_{x}+g\left(y_{k}+\xi_{k}\right)

(28)

can be interpreted as a simple local model of $\ell_{x}\left(d^{k}+\mathcal{T}_{R_{d^{k}}}^{-1}\xi_{k}\right)$ . Thus, to obtain a new estimate from $d^{k}$ , we first compute a search direction $\xi_{k}^{*}$ by minimizing (28) on $\mathrm{T}_{y_{k}}\mathcal{M}$ , and then update $d^{k}$ along the direction $\mathcal{T}_{R_{d^{k}}}^{-1}\xi_{k}^{*}$ , as described in Algorithm 2.

It is easy to find that Algorithm 2 is the application of the Riemannian GCG method for problem (26) with a proper retraction. Moreover, note that (29) can be solved by the subgradient method or proximal gradient method.

Algorithm 2 Solving the subproblem

Require: An initial iterate $d^{0}\in T_{x}\mathcal{M}$ ; a small positive constant $\sigma$ ;
1: for $k=0,1,\ldots$ , do
2: $y_{k}=R_{x}\left(d^{k}\right)$
3: Compute $\xi_{k}^{*}$ by solving

\xi_{k}^{*}=\arg\min_{\xi\in\mathrm{T}_{y_{k}}\mathcal{M}}\left\langle\mathcal{T}_{R_{d^{k}}}^{-\sharp}\left(\operatorname{grad}f(x)\right),\xi\right\rangle_{y_{k}}+g\left(y_{k}+\xi\right);

(29)

4: $\alpha=1$
5: while $\ell_{x}\left(d^{k}+\alpha\mathcal{T}_{R_{d^{k}}}^{-1}\xi_{k}^{*}\right)\geq\ell_{x}\left(d^{k}\right)-\sigma\alpha\left\|\xi_{k}^{*}\right\|_{x}^{2}$ do
6: $\alpha=\frac{1}{2}\alpha$ ;
7: end while
8: $d_{k+1}=d^{k}+\alpha\mathcal{T}_{R_{d^{k}}}^{-1}\xi_{k}^{*}$ ;
4: end for

6 Accelerated conditional gradient methods

Now we consider the accelerated version of the RCG method. We noticed that based on the accelerated gradient methods, Li et al. [4] establish links between subproblems of conditional gradient methods and the notion of momentum. They proposed a momentum variant of CG method and proved that it converges with a fast rate of $O(\frac{1}{k^{2}})$ . However, the extension of this work to the Riemannian manifold case is not trivial, even though there are works on Nesterov accelerated methods on Riemannian manifolds [27]. Based on these existing works, here we propose an accelerated CG method on Riemannian manifolds for composite function.

Algorithm 3 Riemannian accelerated GCG method

Step 0. Initialization Choose $x^{0}=p_{0}\in\mathcal{X}$ , $\tau^{k}=\frac{2}{k+3}$ and initialize $k=0$ , $\beta_{0}=\mathbf{0},d^{0}=\mathbf{0}$ .

Step 1.

y_{k}=R_{x^{k}}(\tau^{k}R^{-1}_{x^{k}}(d^{k}))

\beta_{k+1}=\mathcal{T}_{y_{k}}(1-\tau^{k})\beta_{k}+\lambda_{k}\mathrm{grad}f(y_{k})

Step 2. Compute an optimal solution $p_{k}$ and the optimal value $\theta(x^{k})$ as

p_{k+1}=\arg\min_{u\in\mathcal{X}}\left\langle\beta_{k+1},R_{y_{k}}^{-1}(u)\right\rangle_{y^{k}}+g(u)-g(y_{k}),

d_{k+1}=R_{y_{k}}^{-1}(p_{k+1}).

Step 3. Stopping criteria If $\theta(x^{k})=0$ , then stop.

Step 4. Compute the iterate Set

x^{k+1}=R_{x^{k}}(\lambda_{k}R^{-1}_{x^{k}}(p_{k+1})).

Step 5. Beginning a new iteration Set $k=k+1$ and go to Step 1.

Analyzing the convergence and convergence rate for the above accelerated method is challenging, primarily due to the inherent nonconvexity of the problem and the complexities introduced by transporting tangent vectors. A comprehensive analysis of the accelerated method is beyond the scope of this paper. However, the performance of the accelerated method compared to the non-accelerated method is shown in the numerical experiments in Section 7.

7 Numerical experiments

Sparse principal component analysis (Sparse PCA) plays a crucial role in high-dimensional data analysis by enhancing the interpretability and efficiency of dimensionality reduction. Unlike traditional PCA, which often produces complex results in high-dimensional settings, Sparse PCA introduces sparsity to refine data representation and improve clarity. This is particularly relevant in fields such as bioinformatics, image processing, and signal analysis, where interpretability and reconstruction accuracy are essential. In this section, we consider Sparce PCA problems to check the validity of our proposed methods. We explore two models for sparse PCA as in [20], where we set $\mathcal{X}=\mathcal{M}$ in each problem.

7.1 Sparse PCA on the sphere manifold

Now consider problems in sphere manifolds $\mathbb{S}^{n-1}=\{x\in\mathbb{R}^{n}:\|x\|=1\}.$ The specific optimization problem is formulated as follows:

\min_{x\in\mathbb{S}^{n-1}}-x^{T}A^{T}Ax+\lambda\|x\|_{1},

(30)

where $\left\|\,\cdot\,\right\|_{1}$ denotes the $L_{1}$ -norm, and $\lambda>0$ is a parameter. Here, we use the exponential mapping as the retraction, i.e.,

\operatorname{Exp}_{x}\left(\eta_{x}\right)=x\cos\left(\left\|\eta_{x}\right\|\right)+\eta_{x}\sin\left(\left\|\eta_{x}\right\|\right)/\left\|\eta_{x}\right\|,\quad x\in\mathbb{S}^{n-1},\quad\eta_{x}\in\mathrm{T}_{x}\mathbb{S}^{n-1}.

The inverse exponential mapping can also be computed by applying the inverse exponential mapping on the unit sphere $\mathbb{S}^{n-1}$ [20], i.e.,

\log_{x}(y)=\operatorname{Exp}_{x}^{-1}(y)=\frac{\cos^{-1}\left(x^{T}y\right)}{\sqrt{1-\left(x^{T}y\right)^{2}}}\left(I_{n}-xx^{T}\right)y,\quad x,y\in\mathbb{S}^{n-1}.

We now need to introduce an important result as below.

Lemma 7.1.

[20, Lemma D.1] For any $x\in\mathbb{R}^{n}$ and $\lambda>0$ , the minimizer of the optimization problem

\min_{y\in\mathbb{S}^{n-1}}\frac{1}{2\lambda}\|y-x\|^{2}+\|y\|_{1}

is given by

y_{*}=\begin{cases}z/\|z\|,&\text{ if }\|z\|\neq 0,\\ \operatorname{sign}\left(x_{i_{\max}}\right)e_{i_{\max}},&\text{ otherwise, }\end{cases}

where $i_{\max}$ is the index of the largest magnitude entry of $x$ (break ties arbitrarily), $e_{i}$ denotes the $i$ -th column in the canonical basis of $\mathbb{R}^{n}$ , and $z$ is defined by

z_{i}=\begin{cases}0,&\text{ if }\left|x_{i}\right|\leq\lambda,\\ x_{i}-\lambda,&\text{ if }x_{i}>\lambda,\\ x_{i}+\lambda,&\text{ if }x_{i}<-\lambda.\end{cases}

The subproblem (4) applied to (30), can be written as

\min_{y\in\mathbb{S}^{n-1}}u(y),\quad\text{where }u(y)=\frac{1}{\lambda}\langle\mathrm{grad}f(x),\log_{x}(y)\rangle+\|y\|_{1}.

(31)

Let us not check how to compute its solution. Let $h(y)=\frac{1}{\lambda}\langle\operatorname{grad}f(x),\log_{x}(y)\rangle$ , and let $y^{k}$ denote the current estimate of the minimizer of $u(y)$ over the unit sphere. The next estimate, $y^{k+1}$ , is obtained by solving the following optimization problem:

\min_{y\in\mathbb{S}^{n-1}}h(y^{k})+\nabla h(y^{k})^{T}(y-y^{k})+\|y\|_{1},

which is equal to

\min_{y\in\mathbb{S}^{n-1}}\nabla h(y^{k})^{T}y+\|y\|_{1}.

In other words, $h(y)$ is approximated by its first-order Taylor expansion around $y^{k}$ at each iteration. To solve the above problem, note that since $\|y\|=1$ for all $y\in\mathbb{S}^{n-1}$ , the problem is further equivalent to:

\min_{y\in\mathbb{S}^{n-1}}\frac{1}{2}\left\|y+\nabla h(y^{k})\right\|^{2}+\|y\|_{1}.

(32)

The above problem can be actually solved using Lemma 7.1.

7.2 Sparse PCA on the Stiefel manifold

Now consider sparse PCA on the Stiefel manifold. The specific optimization problem is formulated as follows:

\min_{X\in\operatorname{st}(p,n)}-\operatorname{trace}\left(X^{T}A^{T}AX\right)+\lambda\|X\|_{1}.

(33)

where $\operatorname{st}(p,n)=\left\{X\in\mathbb{R}^{n\times p}\mid X^{T}X=I_{p}\right\}$ denotes the Stiefel manifold, and $\lambda>0$ . Moreover, the term $-\operatorname{trace}\left(X^{T}A^{T}AX\right)$ aims to maximize the variance of the data, $\|X\|_{1}$ characterizes the sparsity of the principal components, and $A$ is the original data matrix. The tangent space is defined as

\mathrm{T}_{X}\operatorname{st}(p,n)=\left\{\eta\in\mathbb{R}^{n\times p}\mid X^{T}\eta+\eta^{T}X=0\right\}.

Here we use the Euclidean metric $\left\langle\eta_{X},\xi_{X}\right\rangle_{X}=\operatorname{trace}(\eta_{X}^{T}\xi_{X})$ as the Riemannian metric. We also use the polar decomposition as the retraction, i.e.,

R_{X}\left(\eta_{X}\right)=\left(X+\eta_{X}\right)\left(I_{p}+\eta_{X}^{T}\eta_{X}\right)^{-1/2}.

The vector transport by differentiated retraction is given in [28, Lemma 10.2.1] by

\mathcal{T}_{\eta_{X}}\xi_{X}=Y\Omega+\left(I_{n}-YY^{T}\right)\xi_{X}\left(Y^{T}\left(X+\eta_{X}\right)\right)^{-1},

where $Y=R_{X}\left(\eta_{X}\right)$ and $\Omega$ is the solution of the Sylvester equation

\left(Y^{T}\left(X+\eta_{X}\right)\right)\Omega+\Omega\left(Y^{T}(X+\eta_{X})\right)=Y^{T}\xi_{X}-\xi_{X}^{T}Y.

For ease of notation, we define the operator $\mathcal{A}_{k}$ by $\mathcal{A}_{k}(V):=V^{T}X_{k}+X_{k}^{T}V$ and rewrite the subproblem (29) as

	$\displaystyle V_{k}:=\underset{V}{\operatorname{argmin}}$	$\displaystyle\left\langle\mathcal{T}_{R_{d^{k}}}^{-\sharp}\mathrm{grad}f\left(X_{k}\right),V\right\rangle+g\left(X_{k}+V\right)$
	s.t.	$\displaystyle\mathcal{A}_{k}(V)=0.$

This can be solved by the subgradient method or the proximal gradient method. For the subgradient method, note that the subgradient of $g(X)$ can be calculate through function $m(X)$ , which is defined as:

1.

$m_{ij}(X)=\lambda$ if $(X)_{ij}>0$ ,
2.

$m_{ij}(X)=-\lambda$ if $(X)_{ij}<0$ ,
3.

$m_{ij}(X)\in[-\lambda,\lambda]$ if $(X)_{ij}=0$ ,

where $m_{ij}(X)$ is the element of $m(X)$ . Then the subgradient of the above problem can be written as $\mathcal{T}_{R_{d^{k}}}^{-\sharp}\mathrm{grad}f(X_{k})+\lambda\cdot\operatorname{sign}(X_{k}+V_{k}).$

7.3 Numerical results

We implemented the algorithms in Python 3.9.12, and the experiments were carried out on a MacBook Pro with an Apple M1 Pro chip and 16GB of RAM.

7.3.1 Sphere experiment

For the example (30), we conducted experiments on a $n$ -dimensional problem, using a randomly generated $n\times n$ standardized $A$ and a regularization parameter $\lambda=0.1$ . The experiment evaluates Armijo step size with parameters $\zeta=0.1$ , $\omega_{1}=0.05$ , and $\omega_{2}=0.95$ . Each experiment runs for the same original problem 10 times with different initial points and stops if the norm of $\theta_{x^{k}}$ is less than $10^{-4}$ or if the difference between the latest function value and the function value from five iterations ago is less than $10^{-4}$ . In this example, our experiments were conducted on high-dimensional spheres, with dimensions set to 10, 100, and 1000, across ten runs for each method. In the numerical experiments, we observed that solving the subproblems typically converges after just one iteration, suggesting that the local model provides an effective approximation of the original subproblem.

As shown in Table 1 and Figure 1, the results demonstrated that the Armijo step-size strategy consistently outperformed the others in terms of computational time and required iterations to achieve function value convergence. Overall, our findings suggest that the Armijo step-size strategy offers significant advantages in efficiency and speed, particularly in high-dimensional settings.

We observe the performance of different step-size strategies (Armijo, adaptive, and diminishing) under varying values of $\lambda$ . Across all values of $\lambda$ in Table 2, the Armijo step size consistently outperforms the other methods in terms of both time and iterations. Specifically, at $\lambda=0.1$ , the Armijo method achieves the fastest convergence, taking an average of 0.3505 seconds and 53.7 iterations. As $\lambda$ increases, the performance gap between Armijo and the other methods becomes more pronounced. For $\lambda=0.5$ , Armijo converges in 0.3153 seconds and 44.8 iterations, while the diminishing strategy requires 2.0290 seconds and 530.2 iterations, demonstrating its inefficiency for larger $\lambda$ values. Similarly, the adaptive step size shows intermediate performance, but its time and iteration counts remain higher than Armijo. Overall, the results suggest that the Armijo step size is the most robust and efficient strategy for different values of $\lambda$ , while the diminishing method struggles with larger problem sizes and regularization terms.

7.3.2 Comparison with the accelerated method

We consider the same problem and compare the performance of the accelerated algorithm with the non-accelerated version. To better observe the convergence rate and differences between various step sizes, we run 50 iterations. As shown in Figure 2, the results indicate that the accelerated algorithm with Armijo and diminishing step sizes generally outperforms the non-accelerated algorithm, particularly when compared with adaptive and diminishing step sizes. The results are similar to the method in [4], and it can indeed accelerate the decline at the beginning.

Table 3 summarizes the results of 10 random runs for the Sphere Compare Experiment with $\lambda=0.1$ . Across all dimensions ( $n=10,100,500,1000$ ), the RGCG method combined with Armijo step size consistently achieves the best performance, with the lowest computation time and iteration count. For instance, with $n=1000$ , RGCG + Armijo takes 8.47 seconds and 346.8 iterations, outperforming the adaptive and diminishing strategies, which require more time and iterations.

The accelerated method shows similar trends, with Armijo step size yielding better results than adaptive and diminishing strategies. However, for higher dimensions, RGCG with Armijo step size remains faster overall. The diminishing step size, in both methods, leads to significantly longer runtimes and higher iteration counts, especially as $n$ increases. For $n=1000,2000$ , indicating that the accelerated method benefits more from the diminishing step size in higher dimensions. In summary, the Armijo step size proves to be the most efficient strategy, particularly when paired with the RGCG method, making it the preferred choice for both low and high-dimensional problems.

Refer to caption — (a) Function value, $n=10$

Table 1: Average results of 10 random runs for sparse PCA on the Sphere manifold with

\lambda=0.1

and fixed

A

n	Step size	Time (seconds)	Iterations
10	Armijo	0.0152	14.50
	Adaptive	0.0208	31.30
	Diminishing	0.0491	88.80
100	Armijo	0.0771	32.20
	Adaptive	0.1761	78.70
	Diminishing	0.3113	244.60
1000	Armijo	8.3842	344.60
	Adaptive	17.4690	921.00
	Diminishing	13.1272	929.70

Table 2: Average results of 10 random runs for sparse PCA on the Sphere manifold with varying

\lambda

and fixed

A

n=500

$\lambda$	Step size	Time (seconds)	Iterations
0.1	Armijo	0.3505	53.70
	Adaptive	0.7006	136.10
	Diminishing	2.2043	522.90
0.3	Armijo	0.3987	51.60
	Adaptive	0.6094	111.70
	Diminishing	1.6226	381.80
0.5	Armijo	0.3153	44.80
	Adaptive	0.5716	111.30
	Diminishing	2.0290	530.20
1.0	Armijo	0.3016	46.50
	Adaptive	0.5137	97.30
	Diminishing	1.8412	465.50

Table 3: Comparison of Accelerated and RGCG Methods: Average Results of 10 Random Runs for Sparse PCA on Sphere Manifolds with

\lambda=0.1

n	Method + Step size	Time (seconds)	Iterations
10	RGCG + Armijo	0.0173	15.30
	RGCG + Adaptive	0.0288	36.10
	RGCG + Diminishing	0.0796	109.50
10	Accelerated + Armijo	0.0496	36.80
	Accelerated + Adaptive	0.0507	61.30
	Accelerated + Diminishing	0.1137	107.30
100	RGCG + Armijo	0.1069	32.40
	RGCG + Adaptive	0.1670	83.00
	RGCG + Diminishing	0.6486	295.50
100	Accelerated + Armijo	0.2629	85.30
	Accelerated + Adaptive	0.3232	169.30
	Accelerated + Diminishing	0.9453	368.50
500	RGCG + Armijo	0.3750	56.70
	RGCG + Adaptive	0.8563	160.50
	RGCG + Diminishing	2.4350	605.30
500	Accelerated + Armijo	1.3190	166.00
	Accelerated + Adaptive	2.4168	385.60
	Accelerated + Diminishing	2.8703	417.10
1000	RGCG + Armijo	8.4704	346.80
	RGCG + Adaptive	17.8079	830.50
	RGCG + Diminishing	14.0213	908.70
1000	Accelerated + Armijo	8.6065	255.30
	Accelerated + Adaptive	13.2967	687.70
	Accelerated + Diminishing	12.1202	445.60
2000	RGCG + Armijo	14.6191	160.20
	RGCG + Adaptive	21.9078	411.40
	RGCG + Diminishing	50.3453	1106.70
2000	Accelerated + Armijo	19.0932	188.00
	Accelerated + Adaptive	47.8875	752.60
	Accelerated + Diminishing	25.3371	396.20

7.3.3 Stiefel experiment

For the example (33), we conduct experiments on a $(n,p)$ -dimensional problem, using a randomly generated $n\times n$ standardized symmetric $A$ and a regularization parameter $\lambda=0.1$ . As in the previous experiment, here we evaluate Armijo step size with parameters $\zeta=0.1$ , $\omega_{1}=0.05$ , and $\omega_{2}=0.95$ . Each experiment runs for the same original problem 10 times with different initial points, and stops if the norm of $\theta$ or the difference of the function value is less than $10^{-4}$ . The maximum number of iterations allowed for solving subproblems is set to 2. The performance of the RGCG method is analyzed by plotting the convergence behavior of the objective values and the value of $\theta$ over iterations and the table of average performance.

The experimental results for sparse PCA on the Stiefel manifold (Table 4 and Figures 3) show clear differences in performance between the three step-size strategies across various problem sizes. For smaller problem dimensions (e.g., $n=100$ and $n=200$ ), the Armijo step size consistently achieves the fewest iterations, indicating its efficiency in reaching convergence faster. However, the diminishing step size, while stable, requires significantly more iterations and time, especially as the problem dimension increases.

For higher dimensions ( $n=500$ and $n=1000$ ), the adaptive step size shows a competitive balance between time and iterations, especially compared to the diminishing one, which becomes increasingly less efficient. However, Armijo continues to outperform both in terms of iterations, making it the preferred choice for high-dimensional problems where computational efficiency is critical. Overall, the Armijo step size demonstrates superior performance in both time and iteration count, particularly for larger problem sizes. The diminishing step size, while robust, is less efficient for larger problems, indicating that it may not be suitable for high-dimensional optimization tasks.

We see a similar trend with the Armijo step size consistently performing well across varying values of $\lambda$ . At $\lambda=0.1$ , Armijo converges in 0.6839 seconds and 103.3 iterations, slightly slower than the adaptive method in terms of time (0.5516 seconds) but significantly faster than the diminishing strategy (1.3375 seconds and 409.9 iterations). For different $\lambda$ , Armijo maintains its competitive performance. For $\lambda=1.0$ , it achieves an average of 0.9943 seconds and 147.8 iterations, compared to 1.7331 seconds and 514.4 iterations for the diminishing step size. While the adaptive strategy generally outperforms the diminishing one, it remains less efficient than Armijo in terms of iterations.

Table 4: Average results of 10 random runs for sparse PCA on the Stiefel manifold with

\lambda=0.1

and fixed

A

(p, n)	Step size	Time (seconds)	Iterations
(10, 100)	Armijo	0.2344	104.7
	Adaptive	0.2287	233.0
	Diminishing	0.1775	192.4
(10, 200)	Armijo	0.2729	59.2
	Adaptive	0.4160	242.8
	Diminishing	0.5557	347.0
(10, 500)	Armijo	1.8151	115.8
	Adaptive	1.5056	229.5
	Diminishing	5.5653	827.5
(10, 1000)	Armijo	11.9895	219.0
	Adaptive	12.6799	459.3
	Diminishing	44.2100	1630.5

Table 5: Average results of 10 random runs for sparse PCA on the Stiefel manifold with varying

\lambda

and fixed

A

n=200,p=10

$\lambda$	Step size	Time (seconds)	Iterations
0.1	Armijo	0.6839	103.3
	Adaptive	0.5516	171.9
	Diminishing	1.3375	409.9
0.3	Armijo	1.0916	166.6
	Adaptive	0.6786	206.7
	Diminishing	1.4409	442.2
0.5	Armijo	0.9410	136.0
	Adaptive	0.5557	167.9
	Diminishing	1.3503	402.8
1.0	Armijo	0.9943	147.8
	Adaptive	0.5600	170.1
	Diminishing	1.7331	514.4

The overall results demonstrate that both the adaptive and Armijo step size methods outperform the diminishing step size in terms of convergence speed and computational time. Specifically, the adaptive strategy consistently exhibits the fastest convergence and the fewest iterations in most cases, while the Armijo step size method shows similar superiority in the majority of experiments. The performance of the algorithms is significantly influenced by different experimental parameters. Notably, as the number of rows and columns increases, the computational time and the number of iterations for all methods increase, but the adaptive and Armijo step sizes still perform better than the diminishing method. It was observed that the diminishing step size method experiences a significant slowdown in convergence speed as it approaches the stationary point. This indicates that while the method can initially reduce the objective function value quickly, its decreasing step size results in very small updates near the optimal solution, leading to a slower convergence rate. Although all step size strategies effectively converge for smaller problems, the choice of the appropriate step size becomes crucial for larger problems. These experiments suggest that in Stiefel manifold optimization, the adaptive and Armijo step size methods are generally more effective than the diminishing step size method, and their use is recommended in practical applications.

Through all the experiments, we observed that as the iteration point approaches the optimal solution, the descent slows down and requires more iterations to reach optimality. To address this issue, we might consider using a threshold or combining our approach with other methods to improve convergence.

8 Conclusion and future works

In this work, we proposed novel conditional gradient methods specifically designed for composite function optimization on Riemannian manifolds. Our investigation focused on Armijo, adaptive, and diminishing step-size strategies. We proved that the adaptive and diminishing step-size strategies achieve a convergence rate of $\mathcal{O}(1/k)$ , while the Armijo step-size strategy results in an iteration complexity of $J+\mathcal{O}(1/\epsilon^{2})$ . In addition, under the assumption that the function $g$ is Lipschitz continuous, we derived an iteration complexity of $\mathcal{O}(1/\epsilon^{2})$ for the Armijo step size. Furthermore, we proposed a specialized algorithm to solve the subproblems in the presence of non-convexity, thereby broadening the applicability of our methods. Our discussions and analyses are not confined to specific retraction and transport operations, enhancing the applicability of our methods to a broader range of manifold optimization problems.

Also, from the results of the Stiefel manifold experiments, we observe that the convergence of the adaptive step size method tends to outperform the diminishing step size method. While the theoretical analysis previously focused on proving convergence rates through the diminishing step size, these results suggest that alternative step size strategies, such as adaptive methods, may lead to stronger performance. This opens the possibility of further exploration of different step size techniques to achieve better convergence rates in future studies.

Future work on manifold optimization could significantly benefit from the development and enhancement of accelerated conditional gradient methods. The momentum-guided Frank-Wolfe algorithm, as introduced by Li et al. [4], leverages Nesterov’s accelerated gradient method to enhance convergence rates on constrained domains. This approach could be adapted to various manifold settings to exploit their geometric properties more effectively. Additionally, Zhang et al. [5] proposed accelerating the Frank-Wolfe algorithm through the use of weighted average gradients, presenting a promising direction for improving optimization efficiency on manifolds. Future research could focus on combining these acceleration techniques with Riemannian geometry principles to develop robust, scalable algorithms for complex manifold-constrained optimization problems.

Acknowledgments

This work was supported by JST SPRING (JPMJSP2110), and Japan Society for the Promotion of Science, Grant-in-Aid for Scientific Research (C) (JP19K11840).

References

[1] M. Frank, P. Wolfe, An algorithm for quadratic programming, Naval Research Logistics 3 (1-2) (1956) 95–110.
[2] F. Bach, Duality between subgradient and conditional gradient methods, SIAM Journal on Optimization 25 (1) (2015) 115–129.
[3] M. Jaggi, Revisiting Frank-Wolfe: Projection-free sparse convex optimization, in: S. Dasgupta, D. McAllester (Eds.), Proceedings of the 30th International Conference on Machine Learning, Vol. 28 of Proceedings of Machine Learning Research, PMLR, Atlanta, Georgia, USA, 2013, pp. 427–435.
[4] B. Li, M. Coutino, G. B. Giannakis, G. Leus, A momentum-guided Frank-Wolfe algorithm, IEEE Transactions on Signal Processing 69 (2021) 3597–3611.
[5] Y. Zhang, B. Li, G. B. Giannakis, Accelerating Frank-Wolfe with weighted average gradients, in: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2021, pp. 5529–5533.
[6] S. Lacoste-Julien, M. Jaggi, M. Schmidt, P. Pletscher, Block-coordinate Frank-Wolfe optimization for structural SVMs, in: International Conference on Machine Learning, PMLR, 2013, pp. 53–61.
[7] N. Dalmasso, R. Zhao, M. Ghassemi, V. Potluru, T. Balch, M. Veloso, Efficient event series data modeling via first-order constrained optimization, in: Proceedings of the Fourth ACM International Conference on AI in Finance, 2023, pp. 463–471.
[8] P. B. Assunção, O. P. Ferreira, L. F. Prudente, Conditional gradient method for multiobjective optimization, Computational Optimization and Applications 78 (2021) 741–768.
[9] P. B. Assunção, O. P. Ferreira, L. F. Prudente, A generalized conditional gradient method for multiobjective composite optimization problems, Optimization (2023) 1–31.
[10] A. G. Gebrie, E. H. Fukuda, Adaptive generalized conditional gradient method for multiobjective optimization, ArXiv (2404.04174) (2024).
[11] S. J. Wright, R. D. Nowak, M. A. Figueiredo, Sparse reconstruction by separable approximation, IEEE Transactions on Signal Processing 57 (7) (2009) 2479–2493.
[12] R. Tibshirani, Regression shrinkage and selection via the Lasso, Journal of the Royal Statistical Society Series B: Statistical Methodology 58 (1) (1996) 267–288.
[13] P. L. Combettes, V. R. Wajs, Signal recovery by proximal forward-backward splitting, Multiscale Modeling & Simulation 4 (4) (2005) 1168–1200.
[14] A. Beck, First-Order Methods in Optimization, SIAM, 2017.
[15] M. Fukushima, H. Mine, A generalized proximal point algorithm for certain non-convex minimization problems, International Journal of Systems Science 12 (8) (1981) 989–1000.
[16] N. Parikh, S. Boyd, Proximal algorithms, Foundations and Trends in Optimization 1 (3) (2014) 127–239.
[17] A. Beck, M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM Journal on Imaging Sciences 2 (1) (2009) 183–202.
[18] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, Distributed optimization and statistical learning via the alternating direction method of multipliers, Foundations and Trends® in Machine Learning 3 (1) (2011) 1–122.
[19] S. Chen, S. Ma, A. Man-Cho So, T. Zhang, Proximal gradient method for nonsmooth optimization over the Stiefel manifold, SIAM Journal on Optimization 30 (1) (2020) 210–239.
[20] W. Huang, K. Wei, Riemannian proximal gradient methods, Mathematical Programming 194 (1) (2022) 371–413.
[21] M. Weber, S. Sra, Riemannian optimization via Frank-Wolfe methods, Mathematical Programming 199 (1) (2023) 525–556.
[22] Y. Yu, X. Zhang, D. Schuurmans, Generalized conditional gradient for sparse estimation, Journal of Machine Learning Research 18 (144) (2017) 1–46.
[23] R. Zhao, R. M. Freund, Analysis of the Frank-Wolfe method for convex composite optimization involving a logarithmically-homogeneous barrier, Mathematical Programming 199 (1) (2023) 123–163.
[24] Y.-M. Cheung, J. Lou, Efficient generalized conditional gradient with gradient sliding for composite optimization, in: Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015.
[25] H. Sato, Riemannian conjugate gradient methods: General framework and specific algorithms with convergence analyses, SIAM Journal on Optimization 32 (4) (2022) 2690–2717.
[26] N. Boumal, An Introduction to Optimization on Smooth Manifolds, Cambridge University Press, 2023.
[27] F. Alimisis, A. Orvieto, G. Becigneul, A. Lucchi, Momentum improves optimization on Riemannian manifolds, in: International Conference on Artificial Intelligence and Statistics, PMLR, 2021, pp. 1351–1359.
[28] W. Huang, Optimization algorithms on Riemannian manifolds with applications, Ph.D. thesis, The Florida State University (2013).