This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Completely Positive Factorization by a Riemannian Smoothing Method

ZHIJIAN LAI Graduate School of Systems and Information Engineering, University of Tsukuba, Tsukuba, Ibaraki 305-8573, Japan. E-mail:[email protected]    AKIKO YOSHISE Corresponding author. Faculty of Engineering, Information and Systems, University of Tsukuba, Tsukuba, Ibaraki 305-8573, Japan. E-mail:[email protected]
Abstract

Copositive optimization is a special case of convex conic programming, and it consists of optimizing a linear function over the cone of all completely positive matrices under linear constraints. Copositive optimization provides powerful relaxations of NP-hard quadratic problems or combinatorial problems, but there are still many open problems regarding copositive or completely positive matrices. In this paper, we focus on one such problem; finding a completely positive (CP) factorization for a given completely positive matrix. We treat it as a nonsmooth Riemannian optimization problem, i.e., a minimization problem of a nonsmooth function over a Riemannian manifold. To solve this problem, we present a general smoothing framework for solving nonsmooth Riemannian optimization problems and show convergence to a stationary point of the original problem. An advantage is that we can implement it quickly with minimal effort by directly using the existing standard smooth Riemannian solvers, such as Manopt. Numerical experiments show the efficiency of our method especially for large-scale CP factorizations.

Keywords: completely positive factorization, smoothing method, nonsmooth Riemannian optimization problem

AMS: 15A23, 15B48, 90C48, 90C59

1 Introduction

The space of n×nn\times n real symmetric matrices 𝒮n\mathcal{S}_{n} is endowed with the trace inner product A,B:=trace(AB)\langle A,B\rangle:=\operatorname{trace}(AB). A matrix A𝒮nA\in\mathcal{S}_{n} is called completely positive if for some rr\in\mathbb{N} there exists an entrywise nonnegative matrix Bn×rB\in\mathbb{R}^{n\times r} such that A=BBA=BB^{\top}, and we call BB a CP factorization of AA. We define 𝒞𝒫n\mathcal{CP}_{n} as the set of n×nn\times n completely positive matrices, equivalently characterized as

𝒞𝒫n:={BB𝒮nB is a nonnegative matrix }=conv{xxx+n},\mathcal{CP}_{n}:=\{BB^{\top}\in\mathcal{S}_{n}\mid B\text{ is a nonnegative matrix }\}=\operatorname{conv}\{xx^{\top}\mid x\in\mathbb{R}_{+}^{n}\},

where conv(S)\operatorname{conv}(S) denotes the convex hull of a given set SS. We denote the set of n×nn\times n copositive matrices by 𝒞𝒪𝒫n:={A𝒮nxAx0 for all x+n}.\mathcal{COP}_{n}:=\{A\in\mathcal{S}_{n}\mid x^{\top}Ax\geq 0\text{ for all }x\in\mathbb{R}_{+}^{n}\}. It is known that 𝒞𝒪𝒫n\mathcal{COP}_{n} and 𝒞𝒫n\mathcal{CP}_{n} are duals of each other under the trace inner product; moreover, both 𝒞𝒫n\mathcal{CP}_{n} and 𝒞𝒪𝒫n\mathcal{COP}_{n} are proper convex cones [3, Section 2.2]. For any positive integer nn, we have the following inclusion relationship among other important cones in conic optimization:

𝒞𝒫n𝒮n+𝒩n𝒮n+𝒮n++𝒩n𝒞𝒪𝒫n,\mathcal{CP}_{n}\subseteq\mathcal{S}_{n}^{+}\cap\mathcal{N}_{n}\subseteq\mathcal{S}_{n}^{+}\subseteq\mathcal{S}_{n}^{+}+\mathcal{N}_{n}\subseteq\mathcal{COP}_{n},

where 𝒮n+\mathcal{S}_{n}^{+} is the cone of n×nn\times n symmetric positive semidefinite matrices and 𝒩n\mathcal{N}_{n} is the cone of n×nn\times n symmetric nonnegative matrices. See the monograph [3] for a comprehensive description of 𝒞𝒫n\mathcal{CP}_{n} and 𝒞𝒪𝒫n\mathcal{COP}_{n}.

Conic optimization is a subfield of convex optimization that studies minimization of linear functions over proper cones. Here, if the proper cone is 𝒞𝒫n\mathcal{CP}_{n} or its dual cone 𝒞𝒪𝒫n\mathcal{COP}_{n}, we call the conic optimization problem a copositive programming problem. Copositive programming is closely related to many nonconvex, NP-hard quadratic and combinatorial optimizations [29]. For example, consider the so-called standard quadratic optimization problem,

min{xMxex=1,x+n},\min\{x^{\top}Mx\mid\textbf{e}^{\top}x=1,x\in\mathbb{R}_{+}^{n}\}, (1)

where M𝒮nM\in\mathcal{S}_{n} is possibly not positive semidefinite and e is the all-ones vector. Bomze et al. [9] showed that the following completely positive reformulation,

min{M,XE,X=1,X𝒞𝒫n},\min\{\langle M,X\rangle\mid\langle E,X\rangle=1,X\in\mathcal{C}\mathcal{P}_{n}\},

where EE is the all-ones matrix, is equivalent to (1). Burer [15] reported a more general result, where any quadratic problem with binary and continuous variables can be rewritten as a linear program over 𝒞𝒫n\mathcal{CP}_{n}. As an application to combinatorial problems, consider the problem of computing the independence number α(G)\alpha(G) of a graph GG with nn nodes. De Klerk and Pasechnik [24] showed that

α(G)=max{E,XA+I,X=1,X𝒞𝒫n},\alpha(G)=\max\{\langle E,X\rangle\mid\langle A+I,X\rangle=1,X\in\mathcal{C}\mathcal{P}_{n}\},

where AA is the adjacency matrix of GG. For surveys on applications of copositive programming, see [6, 10, 16, 28, 29].

The difficulty of the above problems lies entirely in the completely positive conic constraint. Note that because neither 𝒞𝒪𝒫n\mathcal{COP}_{n} nor 𝒞𝒫n\mathcal{CP}_{n} is self-dual, the primal-dual interior point method for conic optimization does not work as is. Besides this, there are many open problems related to completely positive cones. One is checking membership in 𝒞𝒫n\mathcal{CP}_{n}, which was shown to be NP-hard by [27]. Computing or estimating the cp-rank, as defined later in (3), is also an open problem. We refer the reader to [4, 28] for a detailed discussion of those unresolved issues.

In this paper, we focus on finding a CP factorization for a given A𝒞𝒫nA\in\mathcal{CP}_{n}, i.e., the CP factorization problem:

 Find Bn×r s.t. A=BB and B0,\text{ Find }B\in\mathbb{R}^{n\times r}\text{ s.t. }A=BB^{\top}\text{ and }B\geq 0, (CPfact)

which seems to be closely related to the membership problem A𝒞𝒫nA\in\mathcal{CP}_{n}. Sometimes, a matrix is shown to be completely positive through duality, or rather, A,X0\langle A,X\rangle\geq 0 for all X𝒞𝒪𝒫nX\in\mathcal{COP}_{n}, but in this case, a CP factorization will not necessarily be obtained.

1.1 Related work on CP factorization

Various methods of solving CP factorization problems have been studied. Jarre and Schmallowsky [33] stated a criterion for complete positivity, based on the augmented primal dual method to solve a particular second-order cone problem. Dickinson and Dür [26] dealt with complete positivity of matrices that possess a specific sparsity pattern and proposed a method for finding CP factorizations of these special matrices that can be performed in linear time. Nie [36] formulated the CP factorization problem as an 𝒜\mathcal{A}-truncated KK-moment problem, for which the author developed an algorithm that solves a series of semidefinite optimization problems. Sponsel and Dür [45] considered the problem of projecting a matrix onto 𝒞𝒫n\mathcal{CP}_{n} and 𝒞𝒪𝒫n\mathcal{COP}_{n} by using polyhedral approximations of these cones. With the help of these projections, they devised a method to compute a CP factorization for any matrix in the interior of 𝒞𝒫n\mathcal{CP}_{n}. Bomze [7] showed how to construct a CP factorization of an n×nn\times n matrix based on a given CP factorization of an (n1)×(n1)(n-1)\times(n-1) principal submatrix. Dutour Sikirić et al. [43] developed a simplex-like method for a rational CP factorization that works if the input matrix allows a rational CP factorization.

In 2020, Groetzner and Dür [31] applied the alternating projection method to the CP factorization problem by posing it as an equivalent feasibility problem (see (FeasCP)). Shortly afterwards, Chen et al. [19] reformulated the split feasibility problem as a difference-of-convex optimization problem and solved (FeasCP) as a specific application. In fact, we will solve this equivalent feasibility problem (FeasCP) by other means in this paper. In 2021, Boţ and Nguyen [12] proposed a projected gradient method with relaxation and inertia parameters for the CP factorization problem, aimed at solving

minX{AXX2X+n×r(0,trace(A))},\min_{X}\{\|A-XX^{\top}\|^{2}\mid X\in\mathbb{R}_{+}^{n\times r}\cap\mathcal{B}({0},\sqrt{\operatorname{trace}(A)})\}, (2)

where (0,ε):={Xn×rXε}\mathcal{B}(0,\varepsilon):=\{X\in\mathbb{R}^{n\times r}\mid\|X\|\leq\varepsilon\} is the closed ball centered at 0. The authors argued that its optimal value is zero if and only if A𝒞𝒫nA\in\mathcal{CP}_{n}.

1.2 Our contributions and organization of the paper

Inspired by the idea of Groetzner and Dür [31], wherein (CPfact) is shown to be equivalent to a feasibility problem called (FeasCP), we treat the problem (FeasCP) as a nonsmooth Riemannian optimization problem and solve it through a general Riemannian smoothing method. Our contributions are summarized as follows:

1. Although it is not explicitly stated in [31], (FeasCP) is actually a Riemannian optimization formulation. We propose a new Riemannian optimization technique and apply it to the problem.

2. In particular, we present a general framework of Riemannian smoothing for the nonsmooth Riemannian optimization problem and show convergence to a stationary point of the original problem.

3. We apply the general framework of Riemannian smoothing to CP factorization. Numerical experiments clarify that our method is competitive with other efficient CP factorization methods, especially for large-scale matrices.

In Section 2, we review the process to reconstruct (CPfact) into another feasibility problem; in particular, we take a different approach to this problem from those in other studies. In Section 3, we describe the general framework of smoothing methods for Riemannian optimization. To apply it to the CP factorization problem, we employ a smoothing function named LogSumExp. Section 4 is a collection of numerical experiments for CP factorization. As a meaningful supplement, in Section 5, we conduct further experiments (FSV problem and robust low-rank matrix completion) to explore the numerical performance of various sub-algorithms and smoothing functions on different applications.

2 Preliminaries

2.1 cp-rank and cp-plus-rank

First, let us recall some basic properties of completely positive matrices. Generally, many CP factorizations of a given AA may exist, and they may vary in their numbers of columns. This gives rise to the following definitions: the cp-rank of A𝒮nA\in\mathcal{S}_{n}, denoted by cp(A)\operatorname{cp}(A), is defined as

cp(A):=min{rA=BB,Bn×r,B0},\operatorname{cp}(A):=\min\{r\in\mathbb{N}\mid A=BB^{\top},B\in\mathbb{R}^{n\times r},B\geq 0\}, (3)

where cp(A)=\operatorname{cp}(A)=\infty if A𝒞𝒫n.A\notin\mathcal{CP}_{n}. Similarly, we can define the cp-plus-rank as

cp+(A):=min{rA=BB,Bn×r,B>0}.\operatorname{cp}^{+}(A):=\min\{r\in\mathbb{N}\mid A=BB^{\top},B\in\mathbb{R}^{n\times r},B>0\}.

Immediately, for all A𝒮nA\in\mathcal{S}_{n}, we have

rank(A)cp(A)cp+(A).\operatorname{rank}(A)\leq\mathrm{cp}(A)\leq\mathrm{cp}^{+}(A). (4)

Every CP factorization BB of AA is of the same rank as AA since rank(XX)=rank(X)\operatorname{rank}(XX^{\top})=\operatorname{rank}(X) holds for any matrix XX. The first inequality of (4) comes from the fact that for any CP factorization BB,

rank(A)=rank(B) the number of columns of B.\operatorname{rank}(A)=\operatorname{rank}(B)\leq\text{ the number of columns of $B$.}

The second is trivial by definition.

Note that computing or estimating the cp-rank of any given A𝒞𝒫nA\in\mathcal{CP}_{n} is still an open problem. The following result gives a tight upper bound of the cp-rank for A𝒞𝒫nA\in\mathcal{CP}_{n} in terms of the order nn.

Theorem 2.1 (Bomze, Dickinson, and Still [8, Theorem 4.1]).

For all A𝒞𝒫nA\in\mathcal{C}\mathcal{P}_{n}, we have

cp(A)cpn:={n for n{2,3,4}12n(n+1)4 for n5.\operatorname{cp}(A)\leq\mathrm{cp}_{n}:=\left\{\begin{array}[]{ll}n&\text{ for }n\in\{2,3,4\}\\ \frac{1}{2}n(n+1)-4&\text{ for }n\geq 5.\end{array}\right. (5)

The following result is useful for distinguishing completely positive matrices in either the interior or on the boundary of 𝒞𝒫n\mathcal{CP}_{n}.

Theorem 2.2 (Dickinson [25, Theorem 3.8]).

We have

int(𝒞𝒫n)\displaystyle\operatorname{int}(\mathcal{C}\mathcal{P}_{n}) ={A𝒮nrank(A)=n,cp+(A)<}\displaystyle=\{A\in\mathcal{S}_{n}\mid\operatorname{rank}(A)=n,\operatorname{cp}^{+}(A)<\infty\}
={A𝒮nrank(A)=n,A=BB,Bn×r,B0,\displaystyle=\{A\in\mathcal{S}_{n}\mid\operatorname{rank}(A)=n,A=BB^{\top},B\in\mathbb{R}^{n\times r},B\geq 0,
bj>0 for at least one column bj of B}.\displaystyle\qquad b_{j}>0\text{ for at least one column $b_{j}$ of $B$}\}.

2.2 CP factorization as a feasibility problem

Groetzner and Dür [31] reformulated the CP factorization problem as an equivalent feasibility problem containing an orthogonality constraint.

Given A𝒞𝒫nA\in\mathcal{CP}_{n}, we can easily get another CP factorization B^\widehat{B} with rr^{\prime} columns for every integer rrr^{\prime}\geq r, if we also have a CP factorization BB with rr columns. The simplest way to construct such an n×rn\times r^{\prime} matrix B^\widehat{B} is to append k:=rrk:=r^{\prime}-r zero columns to B,B, i.e., B^:=[B,0n×k]0.\widehat{B}:=\left[B,0_{n\times k}\right]\geq 0. Another way is called column replication, i.e.,

B^:=[b1,,bn1,1mbn,,1mbnm:=rn+1 columns ],\widehat{B}:=[b_{1},\ldots,b_{n-1},\underbrace{\frac{1}{\sqrt{m}}b_{n},\ldots,\frac{1}{\sqrt{m}}b_{n}}_{m:=r^{\prime}-n+1\text{ columns }}], (6)

where bib_{i} denotes the ii-th column of BB. It is easy to see that B^B^=BB=A.\widehat{B}\widehat{B}^{\top}=BB^{\top}=A. The next lemma is easily derived from the previous discussion, and it implies that there always exists an n×cpnn\times\mathrm{cp}_{n} CP factorization for any A𝒞𝒫nA\in\mathcal{CP}_{n}. Recall that definition of cpn\mathrm{cp}_{n} is given in (5).

Lemma 2.3.

Suppose that A𝒮nA\in\mathcal{S}_{n}, rr\in\mathbb{N}. Then rcp(A)r\geq\operatorname{cp}(A) if and only if AA has a CP factorization BB with rr columns.

Let 𝒪(r)\mathcal{O}(r) denote the orthogonal group of order rr, i.e., the set of r×rr\times r orthogonal matrices. The following lemma is essential to our study. (Note that many authors have proved the existence of such an orthogonal matrix XX (see, e.g., [17, Lemma 2.1] and [31, Lemma 2.6]).

Lemma 2.4.

Let B,Cn×rB,C\in\mathbb{R}^{n\times r}. BB=CCBB^{\top}=CC^{\top} if and only if there exists X𝒪(r) with BX=CX\in\mathcal{O}(r)\text{ with }BX=C.

The next proposition puts the previous two lemmas together.

Proposition 2.5.

Let A𝒞𝒫n,rcp(A),A=B¯B¯A\in\mathcal{CP}_{n},r\geq\operatorname{cp}(A),A=\bar{B}\bar{B}^{\top}, where B¯n×r\bar{B}\in\mathbb{R}^{n\times r} may possibly be not nonnegative. Then there exists an orthogonal matrix X𝒪(r)X\in\mathcal{O}(r) such that B¯X0\bar{B}X\geq 0 and A=(B¯X)(B¯X).A=(\bar{B}X)(\bar{B}X)^{\top}.

This proposition tells us that one can find an orthogonal matrix XX which can turn a “bad” factorization B¯\bar{B} into a “good” factorization B¯X\bar{B}X. Let rcp(A)r\geq\operatorname{cp}(A) and B¯n×r\bar{B}\in\mathbb{R}^{n\times r} be an arbitrary (possibly not nonnegative) initial factorization A=B¯B¯A=\bar{B}\bar{B}^{\top}. The task of finding a CP factorization of AA can then be formulated as the following feasibility problem,

 Find X s.t. B¯X0 and X𝒪(r).\text{ Find }X\text{ s.t. }\bar{B}X\geq 0\text{ and }X\in\mathcal{O}(r). (FeasCP)

We should notice that the condition rcp(A)r\geq\operatorname{cp}(A) is necessary; otherwise, (FeasCP) has no solution even if A𝒞𝒫n.A\in\mathcal{CP}_{n}. Regardless of the exact cp(A)\operatorname{cp}(A) which is often unknown, one can use r=cpnr=\mathrm{cp}_{n} in (5). Note that finding an initial matrix B¯\bar{B} is not difficult. Since a completely positive matrix is necessarily positive semidefinite, one can use Cholesky decomposition or spectral decomposition and then extend it to rr columns by using (6). The following corollary shows that the feasibility of (FeasCP) is precisely a criterion for complete positivity.

Corollary 2.6.

Set rcp(A)r\geq\operatorname{cp}(A), B¯n×r\bar{B}\in\mathbb{R}^{n\times r} an arbitrary initial factorization of AA. Then A𝒞𝒫nA\in\mathcal{CP}_{n} if and only if (FeasCP)(\ref{FeasCP}) is feasible. In this case, for any feasible solution XX, B¯X\bar{B}X is a CP factorization of AA.

In this study, solving (FeasCP) is the key to finding a CP factorization, but it is still a hard problem because 𝒪(r)\mathcal{O}(r) is nonconvex.

2.3 Approaches to solving (FeasCP)

Groetzner and Dür [31] applied the so-called alternating projections method to (FeasCP). They defined the polyhedral cone, 𝒫:={Xr×r:B¯X0},\mathcal{P}:=\{X\in\mathbb{R}^{r\times r}:\bar{B}X\geq 0\}, and rewrote (FeasCP) as

 Find X s.t. X𝒫𝒪(r).\text{ Find }X\text{ s.t. }X\in\mathcal{P}\cap\mathcal{O}(r).

The alternating projections method is as follows: choose a starting point X0𝒪(r)X_{0}\in\mathcal{O}(r); then compute P0=proj𝒫(X0)P_{0}=\operatorname{proj}_{\mathcal{P}}(X_{0}) and X1=proj𝒪(r)(P0)X_{1}=\operatorname{proj}_{\mathcal{O}(r)}(P_{0}), and iterate this process. Computing the projection onto 𝒫\mathcal{P} amounts to solving a second-order cone problem (SOCP), while computing the projection onto 𝒪(r)\mathcal{O}(r) amounts to a singular value decomposition. Note that we need to solve an SOCP alternately at every iteration, which is still expensive in practice. A modified version without convergence involves calculating an approximation of proj𝒫(Xk)\operatorname{proj}_{\mathcal{P}}(X_{k}) by using the Moore-Penrose inverse of B¯\bar{B}; for details, see [31, Algorithm 2].

Our way is to use the optimization form. Here, we denote by max()\max(\cdot) (resp. min()\min(\cdot)) the max-function (resp. min-function)) that selects the largest (resp. smallest) entry of a vector or matrix. Notice that min()=max(()).-\min(\cdot)=\max(-(\cdot)). We associate (FeasCP) with the following optimization problem:

maxX𝒪(r){min(B¯X)}.\max_{X\in\mathcal{O}(r)}\{\min\>(\bar{B}X)\}.

For consistency of notation, we turn the maximization problem into a minimization problem:

minX𝒪(r){max(B¯X)}.\min_{X\in\mathcal{O}(r)}\{\max\>(-\bar{B}X)\}. (OptCP)

The feasible set 𝒪(r)\mathcal{O}(r) is known to be compact [32, Observation 2.1.7]. In accordance with the extreme value theorem [40, Theorem 4.16], (OptCP) attains the global minimum, say tt. Summarizing these observations together with Corollary 2.6 yields the following proposition.

Proposition 2.7.

Set rcp(A)r\geq\operatorname{cp}(A), and let B¯n×r\bar{B}\in\mathbb{R}^{n\times r} be an arbitrary initial factorization of AA. Then the following statements are equivalent:

  1. 1.

    A𝒞𝒫nA\in\mathcal{CP}_{n}.

  2. 2.

    (FeasCP) is feasible.

  3. 3.

    In (OptCP), there exists a feasible solution XX such that max(B¯X)0\max(-\bar{B}X)\leq 0; alternatively, min(B¯X)0\min(\bar{B}X)\geq 0.

  4. 4.

    In (OptCP), the global minimum t0t\leq 0.

3 Riemannian smoothing method

The problem of minimizing a real-valued function over a Riemannian manifold \mathcal{M}, which is called Riemannian optimization, has been actively studied during the last few decades. In particular, the Stiefel manifold,

St(n,p)={Xn×pXX=I},\operatorname{St}(n,p)=\{X\in\mathbb{R}^{n\times p}\mid X^{\top}X=I\},

(when n=pn=p, it reduces to the orthogonal group) is an important case and is our main interest here. We treat the CP factorization problem, i.e., (OptCP) as a problem of minimizing a nonsmooth function over a Riemannian manifold, for which variants of subgradient methods [11], proximal gradient methods [23], and the alternating direction method of multipliers (ADMM) [34] have been studied.

Smoothing methods [21], which use a parameterized smoothing function to approximate the objective function, are effective on a class of nonsmooth optimizations in Euclidean space. Recently, Zhang, Chen and Ma [47] extended a smoothing steepest descent method to the case of Riemannian submanifolds in n\mathbb{R}^{n}. This is not the first time that smoothing methods have been studied on manifolds. Liu and Boumal [35] extended the augmented Lagrangian method and exact penalty method to the Riemannian case. The latter leads to a nonsmooth Riemannian optimization problem to which they applied smoothing techniques. Cambier and Absil [18] dealt with the problem of robust low-rank matrix completion by solving a Riemannian optimization problem, wherein they applied a smoothing conjugate gradient method.

In this section, we propose a general Riemannian smoothing method and apply it to the CP factorization problem.

3.1 Notation and terminology of Riemannian optimization

Let us briefly review some concepts in Riemannian optimization, following the notation of [13]. Throughout this paper, \mathcal{M} will refer to a complete Riemannian submanifold of Euclidean space n\mathbb{R}^{n}. Thus, \mathcal{M} is endowed with a Riemannian metric induced from the Euclidean inner product, i.e., ξ,ηx:=ξη\langle\xi,\eta\rangle_{x}:=\xi^{\top}\eta for any ξ,ηTx\xi,\eta\in{\rm T}_{x}\mathcal{M}, where Txn{\rm T}_{x}\mathcal{M}\subseteq\mathbb{R}^{n} is the tangent space to \mathcal{M} at xx. The Riemannian metric induces the usual Euclidean norm ξx:=ξ=ξ,ξx\|\xi\|_{x}:=\|\xi\|=\sqrt{\langle\xi,\xi\rangle_{x}} for ξTx\xi\in{\rm T}_{x}\mathcal{M}. The tangent bundle T:=xTx{\rm T}\mathcal{M}:=\bigsqcup_{x\in\mathcal{M}}{\rm T}_{x}\mathcal{M} is a disjoint union of the tangent spaces of \mathcal{M}. Let f:f:\mathcal{M}\rightarrow\mathbb{R} be a smooth function on \mathcal{M}. The Riemannian gradient of ff is a vector field gradf\operatorname{grad}f on \mathcal{M} that is uniquely defined by the identities: for all (x,v)T(x,v)\in\rm{T}\mathcal{M},

Df(x)[v]=v,gradf(x)x\mathrm{D}f(x)[v]=\langle v,\operatorname{grad}f(x)\rangle_{x}

where Df(x):TxTf(x)\mathrm{D}f(x):{\rm T}_{x}\mathcal{M}\to{\rm T}_{f(x)}\mathbb{R}\cong\mathbb{R} is the differential of ff at xx\in\mathcal{M}. Since \mathcal{M} is an embedded submanifold of n\mathbb{R}^{n}, we have a simpler statement for ff that is also well defined on the whole n\mathbb{R}^{n}:

gradf(x)=Projx(f(x)),\operatorname{grad}f(x)=\operatorname{Proj}_{x}(\nabla f(x)),

where f(x)\nabla f(x) is the usual gradient in n\mathbb{R}^{n} and Projx\operatorname{Proj}_{x} denotes the orthogonal projector from n\mathbb{R}^{n} to Tx{\rm T}_{x}\mathcal{M}. For a subset DnD\subseteq\mathbb{R}^{n}, the function hC1(D)h\in C^{1}(D) is smooth, i.e., continuously differentiable on DD. Given a point xnx\in\mathbb{R}^{n} and δ>0\delta>0, (x,δ)\mathcal{B}(x,\delta) denotes a closed ball of radius δ\delta centered at xx. ++\mathbb{R}_{++} denotes the set of positive real numbers. We use subscript notation xix_{i} to select the iith entry of a vector and superscript notation xkx^{k} to designate an element in a sequence {xk}\{x^{k}\}.

3.2 Ingredients

Now let us consider the nonsmooth Riemannian optimization problem (NROP):

minxf(x),\min_{x\in\mathcal{M}}f(x), (NROP)

where n\mathcal{M}\subseteq\mathbb{R}^{n} and f:nf:\mathbb{R}^{n}\rightarrow\mathbb{R} is a proper lower semi-continuous function (maybe nonsmooth or even non-Lipschitzian) on n\mathbb{R}^{n}. For convenience, the term smooth Riemannian optimization problem (SROP) refers to (NROP) when f()f(\cdot) is continuously differentiable on n\mathbb{R}^{n}. To avoid confusion in this case, we use gg instead of ff,

minxg(x).\min_{x\in\mathcal{M}}g(x). (SROP)

Throughout this subsection, we will refer to many of the concepts in [47].

First, let us review the usual concepts and properties related to generalized subdifferentials in n\mathbb{R}^{n}. For a proper lower semi-continuous function f:nf:\mathbb{R}^{n}\rightarrow\mathbb{R}, the Fréchet subdifferential and the limiting subdifferential of ff at xnx\in\mathbb{R}^{n} are defined as

^f(x):={h(x)δ\displaystyle\hat{\partial}f(x):=\{\nabla h(x)\mid\exists\delta >0 such that hC1((x,δ)) and\displaystyle>0\text{ such that }h\in C^{1}(\mathcal{B}(x,\delta))\text{ and }
fh attains a local minimum at x on n},\displaystyle f-h\text{ attains a local minimum at }x\text{ on }\mathbb{R}^{n}\},
f(x):={limvv^f(x),(x,f(x))(x,f(x))}.\partial f(x):=\{\lim_{\ell\rightarrow\infty}v^{\ell}\mid v^{\ell}\in\hat{\partial}f\left(x^{\ell}\right),\left(x^{\ell},f\left(x^{\ell}\right)\right)\rightarrow(x,f(x))\}.

The definition of ^f(x)\hat{\partial}f(x) above is not the standard one: the standard definition follows [39, 8.3 Definition]. But these definitions are equivalent by [39, 8.5 Proposition]. For locally Lipschitz functions, the Clarke subdifferential at xnx\in\mathbb{R}^{n}, f(x)\partial^{\circ}f(x), is the convex hull of the limiting subdifferential. Their relationship is as follows:

^f(x)f(x)f(x).\hat{\partial}f(x)\subseteq\partial f(x)\subseteq\partial^{\circ}f(x).

Notice that if ff is convex, f(x)\partial f(x) and f(x)\partial^{\circ}f(x) coincide with the classical subdifferential in convex analysis [39, 8.12 Proposition].

Example 1 (Bagirov, Karmitsa, and Mäkelä [2, Theorem 3.23]).

From a result on the pointwise max-function in convex analysis, we have

max(x)=conv{eii(x)},\partial\max(x)=\operatorname{conv}\{e_{i}\mid i\in\mathcal{I}(x)\},

where eie_{i}’s are the standard bases of n\mathbb{R}^{n} and (x)={ixi=max(x)}.\mathcal{I}(x)=\{i\mid x_{i}=\max(x)\}.

Next, we extend our discussion to include generalized subdifferentials of a nonsmooth function on submanifolds \mathcal{M}. The Riemannian Fréchet subdifferential and the Riemannian limiting subdifferential of ff at xx\in\mathcal{M} (see, e.g., [47, Definition 3.1]) are defined as

^f(x):={gradh(x)δ\displaystyle\hat{\partial}_{\mathcal{R}}f(x):=\{\operatorname{grad}h(x)\mid\exists\delta >0 such that hC1((x,δ)) and\displaystyle>0\text{ such that }h\in C^{1}(\mathcal{B}(x,\delta))\text{ and }
fh attains a local minimum at x on },\displaystyle f-h\text{ attains a local minimum at }x\text{ on }\mathcal{M}\},
f(x):={limvv^f(x),(x,f(x))(x,f(x))}.\partial_{\mathcal{R}}f(x):=\{\lim_{\ell\rightarrow\infty}v^{\ell}\mid v^{\ell}\in\hat{\partial}_{\mathcal{R}}f\left(x^{\ell}\right),\left(x^{\ell},f\left(x^{\ell}\right)\right)\rightarrow(x,f(x))\}.

If =n\mathcal{M}=\mathbb{R}^{n}, the above definitions coincide with the usual Fréchet and limiting subdifferentials in n\mathbb{R}^{n}. Moreover, it follows directly that, for all xx\in\mathcal{M}, one has ^f(x)f(x)\hat{\partial}_{\mathcal{R}}f(x)\subseteq\partial_{\mathcal{R}}f(x). According to [47, Proposition 3.2], if x{x} is a local minimizer of ff on \mathcal{M}, then 0^f(x)0\in\hat{\partial}_{\mathcal{R}}f({x}). Thus, we call a point xx\in\mathcal{M} a Riemannian limiting stationary point of (NROP) if

0f(x).0\in\partial_{\mathcal{R}}f(x). (7)

In this paper, we will treat it as a necessary condition for a local solution of (NROP) to exist.

The smoothing function is the most important tool of the smoothing method.

Definition 3.1 (Zhang and Chen [46, Definition 3.1]).

A function f~(,):n×++\tilde{f}(\cdot,\cdot):\mathbb{R}^{n}\times\mathbb{R}_{++}\rightarrow\mathbb{R} is called a smoothing function of f:nf:\mathbb{R}^{n}\rightarrow\mathbb{R}, if f~(,μ)\tilde{f}(\cdot,\mu) is continuously differentiable in n\mathbb{R}^{n} for any μ>0\mu>0,

limzx,μ0f~(z,μ)=f(x)\lim_{z\rightarrow x,\mu\downarrow 0}\tilde{f}(z,\mu)=f(x)

and there exist a constant κ>0\kappa>0 and a function ω:++++\omega:\mathbb{R}_{++}\rightarrow\mathbb{R}_{++} such that

|f~(x,μ)f(x)|κω(μ) with limμ0ω(μ)=0.\lvert\tilde{f}(x,\mu)-f(x)\rvert\leq\kappa\omega(\mu)\quad\text{ with }\quad\lim_{\mu\downarrow 0}\omega(\mu)=0. (8)
Example 2 (Chen, Wets, and Zhang [22, Lemma 4.4]).

The LogSumExp function, lse(x,μ):n×++\operatorname{lse}(x,\mu):\mathbb{R}^{n}\times\mathbb{R}_{++}\rightarrow\mathbb{R}, given by

lse(x,μ):=μlog(i=1nexp(xi/μ)),\operatorname{lse}(x,\mu):=\mu\log(\textstyle\sum_{i=1}^{n}\exp(x_{i}/\mu)),

is the smoothing function of max(x)\max(x) because we can see that

(i) lse(,μ)\operatorname{lse}(\cdot,\mu) is smooth on n\mathbb{R}^{n} for any μ>0\mu>0. Its gradient xlse(x,μ)\nabla_{x}\operatorname{lse}(x,\mu) is given by σ(,μ):nΔn1\sigma(\cdot,\mu):\mathbb{R}^{n}\rightarrow\Delta^{n-1},

xlse(x,μ)=σ(x,μ):=1=1nexp(x/μ)[exp(x1/μ),,exp(xn/μ)],\nabla_{x}\operatorname{lse}(x,\mu)=\sigma(x,\mu):=\frac{1}{\sum_{\ell=1}^{n}\exp(x_{\ell}/\mu)}[\begin{array}[]{c}\exp(x_{1}/\mu),\cdots,\exp(x_{n}/\mu)\end{array}]^{\top}, (9)

where Δn1:={xni=1nxi=1,xi0}\Delta^{n-1}:=\{x\in\mathbb{R}^{n}\mid\sum_{i=1}^{n}x_{i}=1,x_{i}\geq 0\} is the unit simplex.

(ii) For all xnx\in\mathbb{R}^{n} and μ>0\mu>0, we have max(x)<lse(x,μ)max(x)+μlog(n).\max(x)<\operatorname{lse}(x,\mu)\leq\max(x)+\mu\log(n). Then, the constant κ=log(n)\kappa=\log(n) and ω(μ)=μ\omega(\mu)=\mu. The above inequalities imply that limzx,μ0lse(z,μ)=max(x).\lim_{z\rightarrow x,\mu\downarrow 0}\operatorname{lse}(z,\mu)=\max(x).

Gradient sub-consistency or consistency is crucial to showing that any limit point of the Riemannian smoothing method is also a limiting stationary point of (NROP).

Definition 3.2 (Zhang, Chen and Ma [47, Definition 3.4 & 3.9]).

A smoothing function f~\tilde{f} of ff is said to satisfy gradient sub-consistency on n\mathbb{R}^{n} if, for any xnx\in\mathbb{R}^{n},

Gf~(x)f(x),G_{\tilde{f}}(x)\subseteq\partial f(x), (10)

where the subdifferential of ff associated with f~\tilde{f} at xnx\in\mathbb{R}^{n} is given by

Gf~(x):={unxf~(zk,μk)u for some zkx,μk0}.G_{\tilde{f}}(x):=\{u\in\mathbb{R}^{n}\mid\nabla_{x}\tilde{f}\left(z_{k},\mu_{k}\right)\rightarrow u\text{ for some }z_{k}\rightarrow x,\mu_{k}\downarrow 0\}.

Similarly, f~\tilde{f} is said to satisfy Riemannian gradient sub-consistency on \mathcal{M} if, for any xx\in\mathcal{M},

Gf~,(x)f(x),G_{\tilde{f},\mathcal{R}}(x)\subseteq\partial_{\mathcal{R}}f(x), (11)

where the Riemannian subdifferential of ff associated with f~\tilde{f} at xx\in\mathcal{M} is given by

Gf~,(x)={vngradf~(zk,μk)v for some zk,zkx,μk0}.G_{\tilde{f},\mathcal{R}}(x)=\{v\in\mathbb{R}^{n}\mid\operatorname{grad}\tilde{f}\left(z_{k},\mu_{k}\right)\rightarrow v\text{ for some }z_{k}\in\mathcal{M},z_{k}\rightarrow x,\mu_{k}\downarrow 0\}.

If one substitutes the inclusion with equality in (10), then f~\tilde{f} satisfies gradient consistency on n\mathbb{R}^{n}, and similarly in (11) for \mathcal{M}. Thanks to the following useful proposition from [47], we can induce gradient sub-consistency on \mathcal{M} from that on n\mathbb{R}^{n} if ff is locally Lipschitz.

Proposition 3.3 (Zhang, Chen and Ma [47, Proposition 3.10]).

Let ff be a locally Lipschitz function and f~\tilde{f} a smoothing function of f.f. For f~\tilde{f}, if gradient sub-consistency holds on n\mathbb{R}^{n}, then Riemannian gradient sub-consistency holds on \mathcal{M} as well.

The next example illustrates Riemannian gradient sub-consistency on \mathcal{M} for lse(x,μ)\operatorname{lse}(x,\mu) in Example 2, since any convex function is locally Lipschitz continuous.

Example 3 (Chen, Wets, and Zhang [22, Lemma 4.4]).

The smoothing function lse(x,μ)\operatorname{lse}(x,\mu) of max(x)\max(x) satisfies gradient consistency on n\mathbb{R}^{n}. That is, for any xnx\in\mathbb{R}^{n},

max(x)=Glse(x)={limxkx,μk0σ(xk,μk)}.\partial\max(x)=G_{\operatorname{lse}}(x)=\{\lim_{x^{k}\rightarrow x,\mu_{k}\downarrow 0}\sigma(x^{k},\mu_{k})\}.

Note that the original assertion of [22, Lemma 4.4] is gradient consistency in the Clarke sense, i.e., max(x)=Glse(x)\partial^{\circ}\max(x)=G_{\operatorname{lse}}(x).

3.3 Riemannian smoothing method

Motivated by the previous papers [18, 35, 47] on smoothing methods and Riemannian manifolds, we propose a general Riemannian smoothing method. Algorithm 1 is the basic framework of this general method.

  Initialization: Given θ(0,1),μ0>0\theta\in(0,1),\mu_{0}>0, and x1x_{-1}\in\mathcal{M}, select a smoothing function f~\tilde{f} and a Riemannian algorithm (called sub-algorithm here) for (SROP).
  for k=0,1,2,k=0,1,2,\ldots
 Solve
xk=argminxf~(x,μk)x_{k}=\arg\min_{x\in\mathcal{M}}\tilde{f}(x,\mu_{k}) (12)
 approximately by using the chosen sub-algorithm, starting at xk1x^{k-1}
if final convergence test is satisfied
  stop with approximate solution xkx_{k}
end if
 Set μk+1=θμk\mu_{k+1}=\theta\mu_{k}
  end for
Algorithm 1 Basic Riemannian smoothing method for (NROP)

Now let us describe the convergence properties of the basic method. First, let us assume that the function f~(x,μk)\tilde{f}(x,\mu_{k}) has a minimizer on \mathcal{M} for each value of μk\mu_{k}.

Theorem 3.4.

Suppose that each xkx^{k} is an exact global minimizer of (12) in Algorithm 1. Then every limit point xx^{*} of the sequence {xk}\{x^{k}\} is a global minimizer of the problem (NROP).

Proof.

Let x¯\bar{x} be a global solution of (NROP), that is,

f(x¯)f(x) for all x.f(\bar{x})\leq f(x)\quad\text{ for all }x\in\mathcal{M}.

From the Definition 3.1 of the smoothing function, there exist a constant κ>0\kappa>0 and a function ω:++++\omega:\mathbb{R}_{++}\rightarrow\mathbb{R}_{++} such that, for all xx\in\mathcal{M},

κω(μ)f~(x,μ)f(x)κω(μ)-\kappa\omega(\mu)\leq\tilde{f}(x,\mu)-f(x)\leq\kappa\omega(\mu) (13)

with limμ0ω(μ)=0.\lim_{\mu\downarrow 0}\omega(\mu)=0. Substituting xkx^{k} and combining with the global solution x¯\bar{x}, we have that

f~(xk,μk)f(xk)κω(μk)f(x¯)κω(μk).\tilde{f}(x^{k},\mu_{k})\geq f(x^{k})-\kappa\omega(\mu_{k})\geq f(\bar{x})-\kappa\omega(\mu_{k}).

By rearranging this expression, we obtain

κω(μk)f~(xk,μk)f(x¯).-\kappa\omega(\mu_{k})\leq\tilde{f}(x^{k},\mu_{k})-f(\bar{x}). (14)

Since xkx^{k} minimizes f~(x,μk)\tilde{f}(x,\mu_{k}) on \mathcal{M} for each μk\mu_{k}, we have that f~(xk,μk)f~(x¯,μk)\tilde{f}(x^{k},\mu_{k})\leq\tilde{f}(\bar{x},\mu_{k}), which leads to

f~(xk,μk)f(x¯)f~(x¯,μk)f(x¯)κω(μk).\tilde{f}(x^{k},\mu_{k})-f(\bar{x})\leq\tilde{f}(\bar{x},\mu_{k})-f(\bar{x})\leq\kappa\omega(\mu_{k}). (15)

The second inequality above follows from (13). Combining (14) and (15), we obtain

|f~(xk,μk)f(x¯)|κω(μk).\lvert\tilde{f}(x^{k},\mu_{k})-f(\bar{x})\rvert\leq\kappa\omega(\mu_{k}). (16)

Now, suppose that xx^{*} is a limit point of {xk}\{x^{k}\}, so that there is an infinite subsequence 𝒦\mathcal{K} such that limk𝒦xk=x.\lim_{k\in\mathcal{K}}x^{k}=x^{*}. Note that xx^{*}\in\mathcal{M} because \mathcal{M} is complete. By taking the limit as k,k𝒦k\rightarrow\infty,k\in\mathcal{K}, on both sides of (16), again by the definition of the smoothing function, we obtain

|f(x)f(x¯)|=limk𝒦|f~(xk,μk)f(x¯)|limk𝒦κω(μk)=0.\lvert f(x^{*})-f(\bar{x})\rvert=\lim_{k\in\mathcal{K}}\lvert\tilde{f}(x^{k},\mu_{k})-f(\bar{x})\rvert\leq\lim_{k\in\mathcal{K}}\kappa\omega(\mu_{k})=0.

Thus, it follows that f(x)=f(x¯).f(x^{*})=f(\bar{x}). Since xx^{*}\in\mathcal{M} is a point whose objective value is equal to that of the global solution x¯\bar{x}, we conclude that xx^{*}, too, is a global solution. ∎

This strong result requires us to find a global minimizer of each subproblem, which, however, cannot always be done. The next result concerns the convergence properties of the sequence f~(xk,μk)\tilde{f}(x^{k},\mu_{k}) under the condition that f~\tilde{f} has the following additional property:

0<μ2<μ1f~(x,μ2)<f~(x,μ1) for all xn.0<\mu_{2}<\mu_{1}\Longrightarrow\tilde{f}(x,\mu_{2})<\tilde{f}(x,\mu_{1})\text{ for all }x\in\mathbb{R}^{n}. (17)
Example 4.

The above property holds for lse(x,μ)\operatorname{lse}(x,\mu) in Example 2; i.e., we have lse(x,μ2)<lse(x,μ1)\operatorname{lse}(x,\mu_{2})<\operatorname{lse}(x,\mu_{1}) on n\mathbb{R}^{n}, provided that 0<μ2<μ10<\mu_{2}<\mu_{1}. Note that under the equality,

l=1nexp(xl/μ)=exp{lse(x,μ)/μ},\sum_{l=1}^{n}\exp(x_{l}/{\mu})=\exp\{\operatorname{lse}(x,\mu)/\mu\},

the ii-th component of σ(x,μ)\sigma(x,\mu) can be rewritten as

σi(x,μ)=exp{(xilse(x,μ))/μ}.\sigma_{i}(x,\mu)=\exp\{{(x_{i}-\operatorname{lse}(x,\mu))}/\mu\}.

For any fixed xnx\in\mathbb{R}^{n}, consider the derivative of a real function μlse(x,):++\mu\rightarrow\operatorname{lse}(x,\cdot):\mathbb{R}_{++}\to\mathbb{R}. Then we have

μlse(x,μ)=lse/μi=1nxiexp(xi/μ)μexp(lse/μ)=\displaystyle\nabla_{\mu}\operatorname{lse}(x,\mu)=\operatorname{lse}/\mu-\frac{\sum_{i=1}^{n}x_{i}\exp(x_{i}/{\mu})}{\mu\exp{(\operatorname{lse}/\mu)}}= (lsei=1nxiexp{(xilse)/μ})/μ\displaystyle(\operatorname{lse}-\textstyle\sum_{i=1}^{n}x_{i}\exp\{(x_{i}-\operatorname{lse})/\mu\})/\mu
=\displaystyle= (lsei=1nxiσi)/μ0,\displaystyle(\operatorname{lse}-\textstyle\sum_{i=1}^{n}x_{i}\sigma_{i})/\mu\leq 0,

where “lse\operatorname{lse}, σ\sigma” are shorthand for lse(x,μ)\operatorname{lse}(x,\mu) and σ(x,μ)\sigma(x,\mu). For the last inequality above, we observe that σΔn1\sigma\in\Delta^{n-1}; hence, the term i=1nxiσi\sum_{i=1}^{n}x_{i}\sigma_{i} is a convex combination of all entries of xx, which implies that i=1nxiσimax(x)<lse.\sum_{i=1}^{n}x_{i}\sigma_{i}\leq\max(x)<\operatorname{lse}. This completes the proofs of our claims.

In [18], the authors considered a special case of Algorithm 1, wherein the smoothing function f~(x,μ)=μ2+x2\tilde{f}(x,\mu)=\sqrt{\mu^{2}+x^{2}} of |x|\lvert x\rvert also satisfies (17) and a Riemannian conjugate gradient method is used for (12).

Theorem 3.5.

Suppose that f:=infxf(x)f^{*}:=\inf_{x\in\mathcal{M}}f(x) exists and the smoothing function f~\tilde{f} has property (17). Let fk:=f~(xk,μk)f^{k}:=\tilde{f}(x^{k},\mu_{k}). Then the sequence {fk}\{f^{k}\} generated by Algorithm 1 is strictly decreasing and bounded below by ff^{*}; hence,

limk|fkfk1|=0.\lim_{k\to\infty}\lvert f^{k}-f^{k-1}\rvert=0.
Proof.

For each k1k\geq 1, xkx_{k} is obtained by approximately solving

minxf~(x,μk),\min_{x\in\mathcal{M}}\tilde{f}(x,\mu_{k}),

starting at xk1x^{k-1}. Then at least, we have

f~(xk1,μk)f~(xk,μk)=fk.\tilde{f}(x^{k-1},\mu_{k})\geq\tilde{f}(x^{k},\mu_{k})=f^{k}.

Since μk=θμk1<μk1\mu_{k}=\theta\mu_{k-1}<\mu_{k-1}, property (17) ensures

fk1=f~(xk1,μk1)>f~(xk1,μk).f^{k-1}=\tilde{f}(x^{k-1},\mu_{k-1})>\tilde{f}(x^{k-1},\mu_{k}).

The claim that sequence {fk}\{f^{k}\} is strictly decreasing follows from these two inequalities.
Suppose that, for all μ>0\mu>0 and for all xnx\in\mathbb{R}^{n},

f~(x,μ)f(x).\tilde{f}(x,\mu)\geq{f}(x). (18)

Then for each kk,

fk=f~(xk,μk)f(xk)infxf(x)=f,f^{k}=\tilde{f}(x^{k},\mu_{k})\geq{f}(x^{k})\geq\inf_{x\in\mathcal{M}}f(x)=f^{*},

which proves our claims.
Now, we show (18) is true if the smoothing function has property (17). Fix any xnx\in\mathbb{R}^{n}; (17) implies that f~(x,)\tilde{f}(x,\cdot) is strictly decreasing as μ0.\mu\to 0. Actually, f~(x,)\tilde{f}(x,\cdot) is monotonically increasing on μ>0.\mu>0. On the other hand, from the definition of the smoothing function, we have that

limμ0f~(x,μ)=f(x).\lim_{\mu\downarrow 0}\tilde{f}(x,\mu)=f(x).

Hence, we have infμ>0f~(x,μ)=f(x),\inf_{\mu>0}\tilde{f}(x,\mu)=f(x), as claimed. ∎

Note that the above weak result does not ensure that {fk}f\{f^{k}\}\rightarrow f^{*}. Next, for better convergence (compared with Theorem 3.5) and an effortless implementation (compared with Theorem 3.4), we propose an enhanced Riemannian smoothing method: Algorithm 2. This is closer to the version in [47], where the authors use the Riemannian steepest descent method for solving the smoothed problem (19).

  Initialization: Given θ(0,1),μ0>0\theta\in(0,1),\mu_{0}>0 and a nonnegative sequence {δk}\{\delta_{k}\} with δk0\delta_{k}\rightarrow 0, and x1x_{-1}\in\mathcal{M}, select a smoothing function f~\tilde{f} and a Riemannian algorithm (called sub-algorithm here) for (SROP).
  for k=0,1,2,k=0,1,2,\ldots
 Solve
xk=argminxf~(x,μk)x_{k}=\arg\min_{x\in\mathcal{M}}\tilde{f}(x,\mu_{k}) (19)
 approximately by using the chosen sub-algorithm, starting at xk1x^{k-1}, such that
gradf~(xk,μk)<δk\|\operatorname{grad}\tilde{f}(x^{k},\mu_{k})\|<\delta_{k} (20)
if final convergence test is satisfied
  stop with approximate solution xkx_{k}
end if
 Set μk+1=θμk\mu_{k+1}=\theta\mu_{k}
  end for
Algorithm 2 Enhanced Riemannian smoothing method for (NROP)

The following result is adapted from [47, Proposition 4.2 & Theorem 4.3]. Readers are encouraged to refer to [47] for a discussion on the stationary point associated with f~\tilde{f} on \mathcal{M}.

Theorem 3.6.

In Algorithm 2, suppose that the chosen sub-algorithm has the following general convergence property for (SROP):

lim infgradg(x)=0.\liminf_{\ell\rightarrow\infty}\|\operatorname{grad}g(x^{\ell})\|=0. (21)

Moreover, suppose that, for all μk\mu_{k}, the function f~(,μk)\tilde{f}(\cdot,\mu_{k}) satisfies the convergence assumptions of the sub-algorithm needed for gg above and f~\tilde{f} satisfies the Riemannian gradient sub-consistency on \mathcal{M}. Then

  1. 1.

    For each kk, there exists an xkx^{k} satisfying (20); hence, Algorithm 2 is well defined.

  2. 2.

    Every limit point xx^{*} of the sequence {xk}\{x^{k}\} generated by Algorithm 2 is a Riemannian limiting stationary point of (NROP) (see (7)).

Proof.

Fix any μk\mu_{k}. By (21), we have lim infgradf~(x,μk)=0\liminf_{\ell\rightarrow\infty}\|\operatorname{grad}\tilde{f}(x^{\ell},\mu_{k})\|=0. Hence, there is a convergent subsequence of gradf~(x,μk)\|\operatorname{grad}\tilde{f}(x^{\ell},\mu_{k})\| whose limit is 0. This means that, for any ϵ>0\epsilon>0, there exists an integer ϵ\ell_{\epsilon} such that gradf~(xϵ,μk)<ϵ.\|\operatorname{grad}\tilde{f}(x^{\ell_{\epsilon}},\mu_{k})\|<\epsilon. If ϵ=δk\epsilon=\delta_{k}, we get xk=xϵx^{k}=x^{\ell_{\epsilon}}. Thus, statement (1) holds.

Next, suppose that xx^{*} is a limit point of {xk}\{x^{k}\} generated by Algorithm 2, so that there is an infinite subsequence 𝒦\mathcal{K} such that limk𝒦xk=x.\lim_{k\in\mathcal{K}}x^{k}=x^{*}. From (1), we have

limk𝒦gradf~(xk,μk)limk𝒦δk=0,\lim_{k\in\mathcal{K}}\|\operatorname{grad}\tilde{f}(x^{k},\mu_{k})\|\leq\lim_{k\in\mathcal{K}}\delta_{k}=0,

and we find that gradf~(xk,μk)0\operatorname{grad}\tilde{f}(x^{k},\mu_{k})\rightarrow 0 for k𝒦,xk,xkx,μk0k\in\mathcal{K},x^{k}\in\mathcal{M},x^{k}\rightarrow x^{*},\mu_{k}\downarrow 0. Hence,

0Gf~,(x)f(x).0\in G_{\tilde{f},\mathcal{R}}(x^{*})\subseteq\partial_{\mathcal{R}}f(x^{*}).

Now let us consider the selection strategy of the nonnegative sequence {δk}\{\delta_{k}\} with δk0\delta_{k}\rightarrow 0. In [47], when μk+1=θμk\mu_{k+1}=\theta\mu_{k} shrinks, the authors set

δk+1:=ρδk\delta_{k+1}:=\rho\delta_{k} (22)

with an initial value of δ0\delta_{0} and constant ρ(0,1)\rho\in(0,1). In the spirit of the usual smoothing methods described in [21], one can set

δk:=γμk\delta_{k}:=\gamma\mu_{k} (23)

with a constant γ>0\gamma>0. The latter is an adaptive rule, because μk\mu_{k} determines subproblem (19) and its stopping criterion at the same time. The merits and drawbacks of the two rules require more discussion, but the latter seems to be more reasonable.

We conclude this section by discussing the connections with [18] and [47]. Our work is based on an efficient unification of them. [18] focused on a specific algorithm and did not discuss the underlying generalities, whereas we studied a general framework for Riemannian smoothing. Recall that the “smoothing function” is the core tool of the smoothing method. In addition to what are required by its definition (see Definition 3.1), it needs to have the following “additional properties” (AP) in order for the algorithms to converge:

  1. (AP1)

    Approximate from above, i.e., (17). (Needed in Algorithm 1)

  2. (AP2)

    (Riemannian) gradient sub-consistency, i.e., Definition 3.2. (Needed in Algorithm 2)

We find that not all smoothing functions satisfy (AP1) and for some functions it is hard to prove whether (AP2) holds. For example, all the functions in Table 1 are smoothing functions of |x||x|, but only the first three meet (AP1); the last two do not. In [21], the authors showed that the first one in Table 1, f~1(x,μ)\tilde{f}_{1}(x,\mu), has property (AP2). The others remain to be verified, but doing so will not be a trivial exercise. To a certain extent, Algorithm 1 as well as Theorem 3.5 guarantee a fundamental convergence result even if one has difficulty in showing whether one’s smoothing function satisfies (AP2). Therefore, it makes sense to consider Algorithms 1 and 2 together for the sake of the completeness of the general framework.

Table 1: List of smoothing functions of the absolute value function |x||x| with κ\kappa and ω(μ)\omega(\mu) in (8)
κ\kappa ω(μ)\omega(\mu)
f~1(x,μ)={|x| if |x|>μ2x2μ+μ4 if |x|μ2\tilde{f}_{1}(x,\mu)=\left\{\begin{array}[]{lll}|x|&\text{ if }&|x|>\frac{\mu}{2}\\ \frac{x^{2}}{\mu}+\frac{\mu}{4}&\text{ if }&|x|\leq\frac{\mu}{2}\end{array}\right. 14\frac{1}{4} μ\mu
f~2(x,μ)=μ2+x2\tilde{f}_{2}(x,\mu)=\sqrt{\mu^{2}+x^{2}} 11 μ\mu
f~3(x,μ)=2μlog(1+exμ)x\tilde{f}_{3}(x,\mu)=2\mu\log(1+e^{\frac{x}{\mu}})-x 2log(2)2\log(2) μ\mu
f~4(x,μ)=xtanh(xμ)\tilde{f}_{4}(x,\mu)=x\tanh(\frac{x}{\mu}), where tanh(z)\tanh(z) is the hyperbolic tangent function. 11 μ\mu
f~5(x,μ)=xerf(xμ)\tilde{f}_{5}(x,\mu)=x\operatorname{erf}(\frac{x}{\mu}), where erf(z):=2π0zet2𝑑t\operatorname{erf}(z):=\frac{2}{\sqrt{\pi}}\int_{0}^{z}e^{-t^{2}}dt is the Gauss error function. 2eπ\frac{2}{e\sqrt{\pi}} μ\mu

Algorithm 2 expands on the results of [47]. It allows us to use any standard method of (SROP), not just steepest descent, to solve the smoothed problem (19). Various standard Riemannian algorithms for (SROP), such as the Riemannian conjugate gradient method [41] (which often performs better than Riemannian steepest descent), the Riemannian Newton method [1, Chapter 6], and the Riemannian trust region method [1, Chapter 7], have extended the concepts and techniques used in Euclidean space to Riemannian manifolds. As shown by Theorem 3.6, no matter what kind of sub-algorithm is implemented for (19), it does not affect the final convergence as long as the chosen sub-algorithm has property (21). On the other hand, we advocate that the sub-algorithm should be viewed as a “Black Box” and the user should not have to care about the code details of the sub-algorithm at all. We can directly use an existing solver, e.g., Manopt [14], which includes the standard Riemannian algorithms mentioned above. Hence, we can choose the most suitable sub-algorithm for the application and quickly implement it with minimal effort.

4 Numerical experiments on CP factorization

The numerical experiments in Section 4 and 5 were performed on a computer equipped with an Intel Core i7-10700 at 2.90GHz with 16GB of RAM using Matlab R2022a. Our Algorithm 2 is implemented in the Manopt framework [14] (version 7.0). The number of iterations to solve the smoothed problem (19) with the sub-algorithm is recorded in the total number of iterations. We refer readers to the supplementary material of this paper for the available codes.111Alternatively, https://github.com/GALVINLAI/General-Riemannian-Smoothing-Method.

In this section, we describe numerical experiments that we conducted on CP factorization in which we solved (OptCP) using Algorithm 2, where different Riemannian algorithms were employed as sub-algorithms and lse(B¯X,μ)\operatorname{lse}(-\bar{B}X,\mu) was used as the smoothing function. To be specific, we used three built-in Riemannian solvers of Manopt 7.0 — steepest descent (SD), conjugate gradient (CG), and trust regions (RTR), denoted by SM_SD, SM_CG and SM_RTR, respectively. We compared our algorithms with the following non-Riemannian numerical algorithms for CP factorization that were mentioned in subsection 1.1. We followed the settings used by the authors in their papers.

  • SpFeasDC_ls [19]: A difference-of-convex functions approach for solving the split feasibility problem, it can be applied to (FeasCP). The implementation details regarding the parameters we used are the same as in the numerical experiments reported in [19, Section 6.1].

  • RIPG_mod [12]: This is a projected gradient method with relaxation and inertia parameters for solving (2). As shown in [12, Section 4.2], RIPG_mod is the best among the many strategies of choosing parameters.

  • APM_mod [31]: A modified alternating projection method for CP factorization; it is described in Section 2.3.

We have shown that lse(x,μ)\operatorname{lse}(x,\mu) is a smoothing function of max(x)\max(x) with gradient consistency. lse(,μ)\operatorname{lse}(\cdot,\mu) of the matrix argument can be simply derived from entrywise operations. Then from the properties of compositions of smoothing functions [5, Proposition 1 (3)], we have that lse(B¯X,μ)\operatorname{lse}(-\bar{B}X,\mu) is a smoothing function of max(B¯X)\max(-\bar{B}X) with gradient consistency. In practice, it is important to avoid numerical overflow and underflow when evaluating lse(x,μ)\operatorname{lse}(x,\mu). Overflow occurs when any xix_{i} is large and underflow occurs when all xix_{i} are small. To avoid these problems, we can shift each component xix_{i} by max(x)\max(x) and use the following formula:

lse(x,μ)=μlog(i=1nexp((ximax(x))/μ))+max(x),\operatorname{lse}(x,\mu)=\mu\log(\textstyle\sum_{i=1}^{n}\exp((x_{i}-\max(x))/{\mu}))+\max(x),

whose validity is easy to show.

The details of the experiments are as follows. If A𝒞𝒫nA\in\mathcal{CP}_{n} was of full rank, for accuracy reasons, we obtained an initial B¯\bar{B} by using Cholesky decomposition. Otherwise, B¯\bar{B} was obtained by using spectral decomposition. Then we extended B¯\bar{B} to rr columns by column replication (6). We set r=cp(A)r=\operatorname{cp}(A) if cp(A)\operatorname{cp}(A) was known or rr was sufficiently large. We used RandOrthMat.m [42] to generate a random starting point X0X^{0} on the basis of the Gram-Schmidt process.

For our three algorithms, we set μ0=100,θ=0.8\mu_{0}=100,\theta=0.8 and used an adaptive rule (23) of δk:=γμk\delta_{k}:=\gamma\mu_{k} with γ=0.5\gamma=0.5. Except for RIPG_mod, all the algorithms terminated successfully at iteration kk, where min(B¯Xk)1015\min(\bar{B}X^{k})\geq-10^{-15} was attained before the maximum number of iterations (5,000) was reached. In addition, SpFeasDC_ls failed when L¯k>1010\bar{L}_{k}>10^{10}. Regarding RIPG_mod, it terminated successfully when AXkXk2/A2<1015\|A-X_{k}X_{k}^{\top}\|^{2}/\|A\|^{2}<10^{-15} was attained before at most 10,000 iterations for n<100n<100, and before at most 50,000 iterations in all other cases. In the tables of this section, we report the rounded success rate (Rate) over the total number of trials, although the definitions of “Rate” in the different experiments (described in Sections 4.1-4.4) vary slightly from one experiment to the other. We will describe them later.

4.1 Randomly generated instances

We examined the case of randomly generated matrices to see how the methods were affected by the order nn or rr. The instances were generated in the same way as in [31, Section 7.7]. We computed CC by setting Cij:=|Bij|C_{ij}:=\lvert B_{ij}\rvert for all i,j,i,j, where BB is a random n×2nn\times 2n matrix based on the Matlab command randn, and we took A=CCA=CC^{\top} to be factorized. In Table 2, we set r=1.5nr=1.5n and r=3nr=3n for the values n{20,30,40,100,200,400,600,800}n\in\{20,30,40,100,200,400,600,800\}. For each pair of nn and rr, we generated 50 instances if n100n\leq 100 and 10 instances otherwise. For each instance, we initialized all the algorithms at the same random starting point X0X^{0} and initial decomposition B¯\bar{B}, except for RIPG_mod. Note that each instance AA was assigned only one starting point.

Table 2 lists the average time in seconds (Times) and the average number of iterations (Iters) among the successful instances. For our three Riemannian algorithms, Iters contains the number of iterations of the sub-algorithm. Table 2 also lists the rounded success rate (Rate) over the total number (50 or 10) of instances for each pair of nn and rr. Boldface highlights the two best results in each row.

As shown in Table 2, except for APM_mod, each method had a success rate of 1 for all pairs of nn and rr. Our three algorithms outperformed the other methods on the large-scale matrices with n100n\geq 100. In particular, SM_CG with the conjugate-gradient method gave the best results.

4.2 A specifically structured instance

Let en\textbf{e}_{n} denote the all-ones vector in n\mathbb{R}^{n} and consider the matrix [31, Example 7.1],

An=(0en1en1In1)(0en1en1In1)𝒞𝒫n.A_{n}=\left(\begin{array}[]{cc}{0}&{\textbf{e}_{n-1}^{\top}}\\ {\textbf{e}_{n-1}}&{I_{n-1}}\end{array}\right)^{\top}\left(\begin{array}[]{cc}{0}&{\textbf{e}_{n-1}^{\top}}\\ {\textbf{e}_{n-1}}&{I_{n-1}}\end{array}\right)\in\mathcal{CP}_{n}.

Theorem 2.2 shows that Anint(𝒞𝒫n)A_{n}\in\operatorname{int}(\mathcal{CP}_{n}) for every n2.n\geq 2. By construction, it is obvious that cp(An)=n\operatorname{cp}(A_{n})=n. We tried to factorize AnA_{n} for the values n{10,20,50,75,100,150}n\in\{10,20,50,75,100,150\} in Table 3. For each AnA_{n}, using r=cp(An)=nr=\operatorname{cp}(A_{n})=n and the same initial decomposition B¯\bar{B}, we tested all the algorithms on the same 50 randomly generated starting points, except for RIPG_mod. Note that each instance was assigned 50 starting points.

Table 3 lists the average time in seconds (Times) and the average number of iterations (Iters) among the successful starting points. It also lists the rounded success rate (Rate) over the total number (50) of starting points for each nn. Boldface highlights the two best results for each nn.

We can see from Table 3 that the success rates of our three algorithms were always 1, whereas the success rates of the other methods decreased as nn increased. Likewise, SM_CG with the conjugate-gradient method gave the best results.

Table 2: CP factorization of random completely positive matrices.
Methods SM_SD SM_CG SM_RTR SpFeasDC_ls RIPG_mod APM_mod
n(r=1.5n)n(r=1.5n) Rate Times Iters Rate Times Iters Rate Times Iters Rate Times Iters Rate Times Iters Rate Times Iters
20 1 0.0409 49 1 0.0394 41 1 0.0514 35 1 0.0027 24 1 0.0081 1229 0.32 0.3502 2318
30 1 0.0549 58 1 0.0477 44 1 0.0690 36 1 0.0075 24 1 0.0231 1481 0.04 1.0075 2467
40 1 0.0735 65 1 0.0606 46 1 0.0859 37 1 0.0216 46 1 0.0574 1990 0 - -
100 1 0.2312 104 1 0.1520 56 1 0.4061 45 1 0.2831 109 1 0.8169 4912 0 - -
200 1 1.0723 167 1 0.5485 69 1 1.9855 53 1 2.2504 212 1 5.2908 9616 0 - -
400 1 14.6 314 1 4.1453 86 1 22.1 69 1 36.9 636 1 90.6 17987 0 - -
600 1 50.6 474 1 14.7 105 1 46.4 80 1 140.1 882 1 344.7 26146 0 - -
800 1 133.3 643 1 30.0 109 1 93.5 89 1 413.3 1225 1 891.1 34022 0 - -
Methods SM_SD SM_CG SM_RTR SpFeasDC_ls RIPG_mod APM_mod
n(r=3n)n(r=3n) Rate Times Iters Rate Times Iters Rate Times Iters Rate Times Iters Rate Times Iters Rate Times Iters
20 1 0.0597 52 1 0.0551 42 1 0.0842 37 1 0.0057 15 1 0.0105 1062 0.30 0.7267 2198
30 1 0.0793 59 1 0.0673 45 1 0.1161 39 1 0.0128 17 1 0.0336 1127 0 - -
40 1 0.1035 67 1 0.0882 48 1 0.1961 41 1 0.0256 19 1 0.0822 1460 0 - -
100 1 0.5632 103 1 0.3395 57 1 1.4128 50 1 0.8115 86 1 1.1909 4753 0 - -
200 1 4.5548 163 1 2.4116 68 1 14.3 65 1 8.1517 184 1 9.2248 9402 0 - -
400 1 46.5 296 1 19.2 89 1 77.1 80 1 124.3 453 1 156.6 17563 0 - -
600 1 209.0 446 1 65.7 99 1 294.5 76 1 981.8 795 1 616.7 25336 0 - -
800 1 609.7 628 1 160.4 114 1 650.7 83 1 4027.4 1070 1 1289.4 26820 0 - -
Table 3: CP factorization of a family of specifically structured instances.
Methods SM_SD SM_CG SM_RTR SpFeasDC_ls RIPG_mod APM_mod
n(r=n)n(r=n) Rate Times Iters Rate Times Iters Rate Times Iters Rate Times Iters Rate Times Iters Rate Times Iters
10 1 0.0399 71 1 0.0313 49 1 0.0424 45 1 0.0043 149 1 0.0074 2085 0.80 0.0174 616
20 1 0.0486 85 1 0.0408 63 1 0.0637 55 0.98 0.0139 201 0.74 0.0212 3478 0.90 0.0591 864
50 1 0.2599 295 1 0.1073 101 1 0.2104 76 0.98 0.3389 770 0 - - 0.76 0.6948 1416
75 1 0.3843 329 1 0.1923 135 1 0.4293 93 0.98 1.0706 1186 0 - - 0.64 1.4809 1510
100 1 0.7459 458 1 0.3289 168 1 0.9074 108 0.80 1.6653 1083 0 - - 0.60 2.8150 1690
150 1 1.8076 647 1 0.7837 241 1 2.6030 145 0.70 3.7652 1170 0 - - 0.35 9.9930 2959

4.3 An easy instance on the boundary of 𝒞𝒫n\mathcal{CP}_{n}

Consider the following matrix from [44, Example 2.7]:

A=(414380565043628978518089162120935678120104625051936265).A=\left(\begin{array}[]{ccccc}41&43&80&56&50\\ 43&62&89&78&51\\ 80&89&162&120&93\\ 56&78&120&104&62\\ 50&51&93&62&65\end{array}\right).

The sufficient condition from [44, Theorem 2.5] ensures that this matrix is completely positive and cp(A)=rank(A)=3.\operatorname{cp}(A)=\operatorname{rank}(A)=3. Theorem 2.2 tells us that Abd(𝒞𝒫5)A\in\operatorname{bd}(\mathcal{CP}_{5}), since rank(A)5\operatorname{rank}(A)\neq 5.

We found that all the algorithms could easily factorize this matrix. However, our three algorithms returned a CP factorization BB whose smallest entry was as large as possible. In fact, they also maximized the smallest entry in the n×rn\times r symmetric factorization of AA, since (OptCP) is equivalent to

maxA=XX,Xn×r{min(X)}.\max_{A=XX^{\top},X\in\mathbb{R}^{n\times r}}\{\min\>(X)\}.

When we did not terminate as soon as min(B¯Xk)1015\min(\bar{B}X^{k})\geq-10^{-15}, for example, after 1000 iterations, our algorithms gave the following CP factorization whose the smallest entry is around 2.857310152.8573\gg-10^{-15}:

A=BB , where B(3.57714.47662.85732.85743.06826.66508.38227.00016.53745.75152.85747.92192.85746.77413.3085).A=BB^{\top}\text{ , where }B\approx\left(\begin{array}[]{lll}3.5771&4.4766&\textbf{2.8573}\\ 2.8574&3.0682&6.6650\\ 8.3822&7.0001&6.5374\\ 5.7515&2.8574&7.9219\\ 2.8574&6.7741&3.3085\end{array}\right).

4.4 A hard instance on the boundary of 𝒞𝒫n\mathcal{CP}_{n}

Next, we examined how well these methods worked on a hard matrix on the boundary of 𝒞𝒫n\mathcal{CP}_{n}. Consider the following matrix on the boundary taken from [30]:

A=(8511558511158511158551158)bd(𝒞𝒫5).A=\left(\begin{array}[]{lllll}8&5&1&1&5\\ 5&8&5&1&1\\ 1&5&8&5&1\\ 1&1&5&8&5\\ 5&1&1&5&8\end{array}\right)\in\operatorname{bd}(\mathcal{CP}_{5}).

Since Abd(𝒞𝒫5)A\in\mathrm{bd}(\mathcal{CP}_{5}) and AA is of full rank, it follows from Theorem 2.2 that cp+(A)=\operatorname{cp}^{+}(A)=\infty; i.e., there is no strictly positive CP factorization for A.A. Hence, the global minimum of (OptCP), t=0t=0, is clear. None of the algorithms could decompose this matrix under our tolerance, 101510^{-15}, in the stopping criteria. As was done in [31, Example 7.3], we investigated slight perturbations of this matrix. Given

MM=:Cint(𝒞𝒫5) with M=(110000101000100100100010100001),MM^{\top}=:C\in\operatorname{int}(\mathcal{CP}_{5})\text{ with }M=\left(\begin{array}[]{cccccc}1&1&0&0&0&0\\ 1&0&1&0&0&0\\ 1&0&0&1&0&0\\ 1&0&0&0&1&0\\ 1&0&0&0&0&1\end{array}\right),

we factorized Aλ:=λA+(1A_{\lambda}:=\lambda A+(1- λ)C\lambda)C for different values of λ[0,1)\lambda\in[0,1) using r=12>cp5=11.r=12>\mathrm{cp}_{5}=11. Note that Aλint(C𝒫5)A_{\lambda}\in\operatorname{int}(C\mathcal{P}_{5}) provided 0λ<10\leq\lambda<1 and AλA_{\lambda} approached the boundary as λ1\lambda\to 1. We chose the largest λ=0.9999\lambda=0.9999. For each Aλ,A_{\lambda}, we tested all of the algorithms on 50 randomly generated starting points and computed the success rate over the total number of starting points.

Table 4 shows how the success rate of each algorithm changes as AλA_{\lambda} approaches the boundary. The table sorts the results from left to right according to overall performance. Except for SM_RTR, whose success rate was always 1, the success rates of all the other algorithms significantly decreased as λ\lambda increased to 0.9999. Surprisingly, the method of SM_CG, which performed well in the previous examples, seemed unable to handle instances close to the boundary.

Table 4: Success rate of CP factorization of AλA_{\lambda} for values of λ\lambda from 0.6 to 0.9999.
λ\lambda SM_RTR SM_SD RIPG_mod SM_CG SpFeasDC_ls APM_mod
0.6 1 1 1 1 1 0.42
0.65 1 1 1 1 1 0.44
0.7 1 1 1 1 1 0.48
0.75 1 1 1 1 1 0.52
0.8 1 1 1 1 0.96 0.46
0.82 1 1 1 1 0.98 0.4
0.84 1 1 1 1 0.86 0.24
0.86 1 1 1 1 0.82 0.1
0.88 1 1 1 1 0.58 0.18
0.9 1 1 1 1 0.48 0.18
0.91 1 1 1 1 0.4 0.14
0.92 1 1 1 1 0.2 0.18
0.93 1 1 0.98 1 0.22 0.22
0.94 1 1 0.98 1 0.1 0.2
0.95 1 1 1 1 0.12 0.32
0.96 1 1 0.96 0.98 0.06 0.34
0.97 1 1 0.86 0.82 0.06 0.14
0.98 1 1 0.76 0.28 0.02 0
0.99 1 0.68 0.42 0 0 0
0.999 1 0 0.14 0 0 0
0.9999 1 0 0 0 0 0

5 Further numerical experiments: comparison with [18, 47]

As described at the end of Section 3, the algorithms in [47] and [18] are both special cases of our algorithm. In this section, we compare them to show whether it performs better when we use other sub-algorithms or other smoothing functions. We applied Algorithm 2 to two problems: finding a sparse vector (FSV) in a subspace and robust low-rank matrix completion, which are problems implemented in [47] and [18], respectively. Since they both involve approximations to the 1\ell_{1} norm, we applied the smoothing functions listed in Table 1.

We used the six solvers built into Manopt 7.0, namely, steepest descent; Barzilai-Borwein (i.e., gradient-descent with BB step size); Conjugate gradient; trust regions; BFGS (a limited-memory version); ARC (i.e., adaptive regularization by cubics).

5.1 FSV problem

The FSV problem is to find the sparsest vector in an nn-dimensional linear subspace WmW\subseteq\mathbb{R}^{m}; it has applications in robust subspace recovery, dictionary learning, and many other problems in machine learning and signal processing [37, 38]. Let Qm×nQ\in\mathbb{R}^{m\times n} denote a matrix whose columns form an orthonormal basis of WW: this problem can be formulated as

minxSn1Qx0,\min_{x\in S^{n-1}}\|Qx\|_{0},

where Sn1:={xnx=1}S^{n-1}:=\left\{x\in\mathbb{R}^{n}\mid\|x\|=1\right\} is the sphere manifold, and z0\|z\|_{0} counts the number of nonzero components of zz. Because this discontinuous objective function is unwieldy, in the literature, one instead focuses on solving the 1\ell_{1} norm relaxation given below:

minxSn1Qx1,\min_{x\in S^{n-1}}\|Qx\|_{1}, (24)

where z1:=i|zi|\|z\|_{1}:=\sum_{i}\left|z_{i}\right| is the 1\ell_{1} norm of the vector zz.

Our synthetic problems of the 1\ell_{1} minimization model (24) were generated in the same way as in [47]: i.e., we chose m{4n,6n,8n,10n}m\in\{4n,6n,8n,10n\} for n=5n=5 and m{6n,8n,10n,12n}m\in\{6n,8n,10n,12n\} for n=10n=10. We defined a sparse vector en:=(1,,1,0,,0)me_{n}:=(1,\ldots,1,0,\ldots,0)^{\top}\in\mathbb{R}^{m}, whose first nn components are 1 and the remaining components are 0. Let the subspace WW be the span of ene_{n} and some n1n-1 random vectors in m\mathbb{R}^{m}. By mgson.m [20], we generated an orthonormal basis of WW to form a matrix Qm×nQ\in\mathbb{R}^{m\times n}. With this construction, the minimum value of Qx0\|Qx\|_{0} should be equal to nn. We chose the initial points by using the M.rand() tool of Manopt 7.0 that returns a random point on the manifold M and set x0 = abs(M.rand()). The nonnegative initial point seemed to be better in the experiment. Regarding the the settings of our Algorithm 2, we chose the same smoothing function f~1(x,μ)\tilde{f}_{1}(x,\mu) in Table 1 and the same gradient tolerance strategy (22) as in [47]: μ0=1,θ=0.5,δ0=0.1,ρ=0.5.\mu_{0}=1,\theta=0.5,\delta_{0}=0.1,\rho=0.5. We compared the numerical performances when using different sub-algorithms. Note that with the choice of the steepest-descent method, our Algorithm 2 is exactly the same as the one in [47].

For each (n,m)(n,m), we generated 50 pairs of random instances and random initial points. We claim that an algorithm successfully terminates if Qxk0=n\|Qx_{k}\|_{0}=n, where xkx_{k} is the kk-th iteration. Here, when we count the number of nonzeros of QxkQx_{k}, we truncated the entries as

(Qxk)i=0 if |(Qxk)i|<τ,(Qx_{k})_{i}=0\quad\text{ if }\left|(Qx_{k})_{i}\right|<\tau, (25)

where τ>0\tau>0 is a tolerance related to the precision of the solution, taking values from 10510^{-5} to 101210^{-12}. Tables 5 and 6 report the number of successful cases out of 50 cases. Boldface highlights the best result for each row.

Table 5: Number of successes from 50 pairs of random instances and random initial points for the 1\ell_{1} minimization model (24) and n=5n=5.
(n,m)(n,m) τ\tau
Steepest-
descent
Barzilai-
Borwein
Conjugate-
gradient
Trust-
regions
BFGS ARC
(5,20)(5,20) 10510^{-5} 21 19 0 22 23 23
10610^{-6} 21 19 0 22 23 23
10710^{-7} 21 19 0 22 23 23
10810^{-8} 16 19 0 22 23 23
(5,30)(5,30) 10510^{-5} 36 42 0 34 36 35
10610^{-6} 36 42 0 34 36 35
10710^{-7} 36 42 0 34 36 35
10810^{-8} 34 42 0 34 36 35
(5,40)(5,40) 10510^{-5} 44 47 1 44 47 45
10610^{-6} 44 47 0 44 47 45
10710^{-7} 44 47 0 44 47 45
10810^{-8} 43 47 0 44 47 45
(5,50)(5,50) 10510^{-5} 47 47 2 45 45 45
10610^{-6} 47 47 2 45 45 45
10710^{-7} 47 47 0 45 45 45
10810^{-8} 46 47 0 45 45 45
(n,m)(n,m) τ\tau
Steepest-
descent
Barzilai-
Borwein
Conjugate-
gradient
Trust-
regions
BFGS ARC
(5,20)(5,20) 10910^{-9} 0 19 0 22 23 23
101010^{-10} 0 19 0 22 23 23
101110^{-11} 0 19 0 22 23 19
101210^{-12} 0 18 0 22 22 17
(5,30)(5,30) 10910^{-9} 8 42 0 34 36 35
101010^{-10} 1 42 0 34 36 35
101110^{-11} 0 42 0 34 36 33
101210^{-12} 0 42 0 34 34 29
(5,40)(5,40) 10910^{-9} 3 47 0 44 47 45
101010^{-10} 2 47 0 44 47 45
101110^{-11} 1 47 0 44 47 44
101210^{-12} 0 46 0 44 44 36
(5,50)(5,50) 10910^{-9} 5 47 0 45 45 45
101010^{-10} 2 47 0 45 45 45
101110^{-11} 0 47 0 45 45 45
101210^{-12} 0 47 0 45 45 37

As shown in Table 5 and 6, surprisingly, the conjugate-gradient method, which performed best on the CP factorization problem in Section 4, performed worst on the FSV problem. In fact, it was almost useless. Moreover, although the steepest-descent method employed in [47] was not bad at obtaining low-precision solutions with τ{105,106,107,108}\tau\in\{10^{-5},10^{-6},10^{-7},10^{-8}\}, it had difficulty obtaining high-precision solutions with τ{109,1010,1011,1012}\tau\in\{10^{-9},10^{-10},10^{-11},10^{-12}\}. The remaining four sub-algorithms easily obtained high-precision solutions, with the Barzilai-Borwein method performing the best in most occasions. Combined with the results in Section 4, this shows that in practice, the choice of sub-algorithm in the Riemannian smoothing method (Algorithm 2) is highly problem-dependent. For the other smoothing functions in Table 1, we obtained similar results as in Table 5 and 6.

Table 6: Number of successes from 50 pairs of random instances and random initial points for the 1\ell_{1} minimization model (24) and n=10n=10.
(n,m)(n,m) τ\tau
Steepest-
descent
Barzilai-
Borwein
Conjugate-
gradient
Trust-
regions
BFGS ARC
(10,60)(10,60) 10510^{-5} 24 28 0 28 28 25
10610^{-6} 24 28 0 28 28 25
10710^{-7} 24 28 0 28 28 25
10810^{-8} 23 28 0 28 28 25
(10,80)(10,80) 10510^{-5} 39 37 1 40 39 40
10610^{-6} 39 37 0 40 39 40
10710^{-7} 39 37 0 40 39 40
10810^{-8} 39 37 0 40 39 40
(10,100)(10,100) 10510^{-5} 45 48 3 45 43 41
10610^{-6} 45 48 0 45 43 41
10710^{-7} 45 48 0 45 43 41
10810^{-8} 45 48 0 45 43 41
(10,120)(10,120) 10510^{-5} 44 46 1 44 44 43
10610^{-6} 44 46 0 44 44 43
10710^{-7} 44 46 0 44 44 43
10810^{-8} 44 46 0 44 44 43
(n,m)(n,m) τ\tau
Steepest-
descent
Barzilai-
Borwein
Conjugate-
gradient
Trust-
regions
BFGS ARC
(10,60)(10,60) 10910^{-9} 3 28 0 28 28 25
101010^{-10} 0 28 0 28 28 25
101110^{-11} 0 28 0 28 28 22
101210^{-12} 0 28 0 28 27 12
(10,80)(10,80) 10910^{-9} 5 37 0 40 39 40
101010^{-10} 0 37 0 40 39 40
101110^{-11} 0 37 0 40 39 39
101210^{-12} 0 37 0 40 37 30
(10,100)(10,100) 10910^{-9} 13 48 0 45 43 41
101010^{-10} 2 48 0 45 43 41
101110^{-11} 0 48 0 45 43 40
101210^{-12} 0 48 0 45 43 37
(10,120)(10,120) 10910^{-9} 14 46 0 44 44 43
101010^{-10} 0 46 0 44 44 43
101110^{-11} 0 46 0 44 44 43
101210^{-12} 0 46 0 44 43 40
Refer to caption
(a) f~1\tilde{f}_{1}
Refer to caption
(b) f~2\tilde{f}_{2}
Refer to caption
(c) f~3\tilde{f}_{3}
Refer to caption
(d) f~4\tilde{f}_{4}
Refer to caption
(e) f~5\tilde{f}_{5}
Refer to caption
(f) f~1\tilde{f}_{1}
Refer to caption
(g) f~2\tilde{f}_{2}
Refer to caption
(h) f~3\tilde{f}_{3}
Refer to caption
(i) f~4\tilde{f}_{4}
Refer to caption
(j) f~5\tilde{f}_{5}
Figure 1: Perfect low-rank matrix completion of a rank-10 5000×50005000\times 5000 matrix without any outliers using different smoothing functions in Table 1. (a)–(e) comprise the running iteration comparison; (f)–(j) comprise the time comparison.

5.2 Robust low-rank matrix completion

Low-rank matrix completion consists of recovering a rank-rr matrix MM of size m×nm\times n from only a fraction of its entries with rmin(m,n)r\ll\min(m,n). The situation in robust low-rank matrix completion is one where only a few observed entries, called outliers, are perturbed, i.e.,

M=M0+S,M=M_{0}+S,

where M0M_{0} is the unperturbed original data matrix of rank rr and SS is a sparse matrix. This is a case of adding non-Gaussian noise for which the traditional 2\ell_{2} minimization model,

minXr𝒫Ω(XM)2\min_{X\in\mathcal{M}_{r}}\left\|\mathcal{P}_{\Omega}(X-M)\right\|_{2}

is not well suited to recovery of M0M_{0}. Here, r:={Xm×nrank(X)=r}\mathcal{M}_{r}:=\left\{X\in\mathbb{R}^{m\times n}\mid\operatorname{rank}(X)=r\right\} is a fixed rank manifold, Ω\Omega denotes the set of indices of observed entries, and 𝒫Ω:m×nm×n\mathcal{P}_{\Omega}:\mathbb{R}^{m\times n}\to\mathbb{R}^{m\times n} is the projection onto Ω\Omega, defined as

ZijPΩ{Zij if (i,j)Ω0 if (i,j)Ω.Z_{ij}\stackrel{{\scriptstyle P_{\Omega}}}{{\longmapsto}}\begin{cases}Z_{ij}&\text{ if }(i,j)\in\Omega\\ 0&\text{ if }(i,j)\notin\Omega.\end{cases}

In [18], the authors try to solve

minXr𝒫Ω(XM)1,\min_{X\in\mathcal{M}_{r}}\left\|\mathcal{P}_{\Omega}(X-M)\right\|_{1},

because the sparsity-inducing property of the 1\ell_{1} norm leads one to expect exact recovery when the noise consists of just a few outliers.

In all of the experiments, the problems were generated in the same way as in [18]. In particular, after picking the values of m,n,rm,n,r, we generated the ground truth Um×rU\in\mathbb{R}^{m\times r}, Vn×rV\in\mathbb{R}^{n\times r} with independent and identically distributed (i.i.d.) Gaussian entries of zero mean and unit variance and M:=UVM:=UV^{\top}. We then sampled k:=k:= ρr(m+nr)\rho r(m+n-r) observed entries uniformly at random, where ρ\rho is the oversampling factor. In our experiments, we set ρ=5\rho=5 throughout. We chose the initial points X0X_{0} by using the rank-rr truncated SVD decomposition of 𝒫Ω(M)\mathcal{P}_{\Omega}(M).

Regarding the setting of our Algorithm 2, we tested all combinations of the five smoothing functions in Table 1 and six sub-algorithms mentioned before (30 cases in total). We set μ0=100\mu_{0}=100 and chose an aggressive value of θ=0.05\theta=0.05 for reducing μ\mu, as in [18]. The stopping criterion of the loop of the sub-algorithm was set to a maximum of 40 iterations or the gradient tolerance (23), whichever was reached first. We monitored the iterations XkX_{k} through the root mean square error (RMSE), which is defined as the error on all the entries between XkX_{k} and the original low-rank matrix M0M_{0}, i.e.,

RMSE(Xk,M0):=i=1mj=1n(Xk,ijM0,ij)2mn.\operatorname{RMSE}\left(X_{k},M_{0}\right):=\sqrt{\frac{\sum_{i=1}^{m}\sum_{j=1}^{n}\left(X_{k,ij}-M_{0,ij}\right)^{2}}{mn}}.
Refer to caption
(a) f~1\tilde{f}_{1}
Refer to caption
(b) f~2\tilde{f}_{2}
Refer to caption
(c) f~3\tilde{f}_{3}
Refer to caption
(d) f~4\tilde{f}_{4}
Refer to caption
(e) f~5\tilde{f}_{5}
Refer to caption
(f) f~1\tilde{f}_{1}
Refer to caption
(g) f~2\tilde{f}_{2}
Refer to caption
(h) f~3\tilde{f}_{3}
Refer to caption
(i) f~4\tilde{f}_{4}
Refer to caption
(j) f~5\tilde{f}_{5}
Refer to caption
(k) f~1\tilde{f}_{1}
Refer to caption
(l) f~2\tilde{f}_{2}
Refer to caption
(m) f~3\tilde{f}_{3}
Refer to caption
(n) f~4\tilde{f}_{4}
Refer to caption
(o) f~5\tilde{f}_{5}
Refer to caption
(p) f~1\tilde{f}_{1}
Refer to caption
(q) f~2\tilde{f}_{2}
Refer to caption
(r) f~3\tilde{f}_{3}
Refer to caption
(s) f~4\tilde{f}_{4}
Refer to caption
(t) f~5\tilde{f}_{5}
Figure 2: Low-rank matrix completion with outliers for two rank-10 500×500500\times 500 matrices by using different smoothing functions in Table 1. (a)–(j) corresponds to one matrix with outliers created by using μN=σN=0.1\mu_{N}=\sigma_{N}=0.1, while (k)–(t) corresponds to the other with outliers created by using μN=σN=1\mu_{N}=\sigma_{N}=1. (a)–(e) and (k)–(o) comprise the running iteration comparison; (f)–(j) and (p)–(t) comprise the time comparison.
Refer to caption
(a) f~1\tilde{f}_{1}
Refer to caption
(b) f~2\tilde{f}_{2}
Refer to caption
(c) f~3\tilde{f}_{3}
Refer to caption
(d) f~4\tilde{f}_{4}
Refer to caption
(e) f~5\tilde{f}_{5}
Refer to caption
(f) f~1\tilde{f}_{1}
Refer to caption
(g) f~2\tilde{f}_{2}
Refer to caption
(h) f~3\tilde{f}_{3}
Refer to caption
(i) f~4\tilde{f}_{4}
Refer to caption
(j) f~5\tilde{f}_{5}
Refer to caption
(k) f~1\tilde{f}_{1}
Refer to caption
(l) f~2\tilde{f}_{2}
Refer to caption
(m) f~3\tilde{f}_{3}
Refer to caption
(n) f~4\tilde{f}_{4}
Refer to caption
(o) f~5\tilde{f}_{5}
Refer to caption
(p) f~1\tilde{f}_{1}
Refer to caption
(q) f~2\tilde{f}_{2}
Refer to caption
(r) f~3\tilde{f}_{3}
Refer to caption
(s) f~4\tilde{f}_{4}
Refer to caption
(t) f~5\tilde{f}_{5}
Figure 3: Low-rank matrix completion with outliers for two rank-10 5000×50005000\times 5000 matrices by using different smoothing functions in Table 1. (a)–(j) corresponds to one matrix with outliers created by using μN=σN=0.1\mu_{N}=\sigma_{N}=0.1, while (k)–(t) corresponds to the other with outliers created by using μN=σN=1\mu_{N}=\sigma_{N}=1. (a)–(e) and (k)–(o) comprise the running iteration comparison; (f)–(j) and (p)–(t) comprise the time comparison.

5.2.1 Perfect low-rank matrix completion

As in [18], we first tested all the methods on a simple perfect matrix MM (without any outliers) of size 5000×50005000\times 5000 and rank 10. The results are shown in Figure 1. We can see that the choice of smoothing function does not have much effect on numerical performance. In terms of the number of iterations ((a)–(e)), our Algorithm 2 inherits the convergence of its sub-algorithm at least Q-superlinearly when trust regions or ARC are used. But the single iteration cost of trust regions and ARC is high; they are not efficient in terms of time. Specifically, the conjugate-gradient method employed in [18] stagnates at lower precision. Overall, Barzilai-Borwein performed best in terms of time and accuracy.

5.2.2 Low-rank matrix completion with outliers

Given a 500×500500\times 500 matrix for which we observed the entries uniformly at random with an oversampling ρ\rho of 5, we perturbed 5%5\% of the observed entries by adding noise to them in order to create outliers. The added item was a random variable defined as 𝒪=𝒮±1𝒩(μN,σN2)\mathcal{O}=\mathcal{S}_{\pm 1}\cdot\mathcal{N}(\mu_{N},\sigma_{N}^{2}) where 𝒮±1\mathcal{S}_{\pm 1} is a random variable with equal probability of being equal to +1+1 or 1-1, while 𝒩(μN,σN2)\mathcal{N}(\mu_{N},\sigma_{N}^{2}) is a Gaussian random variable of mean μN\mu_{N} and variance σN2\sigma_{N}^{2}.

Figure 2 reports the results of two 500×500500\times 500 instances with outliers generated using μN=σN=0.1\mu_{N}=\sigma_{N}=0.1 and μN=σN=1\mu_{N}=\sigma_{N}=1. Again, we can see that the choice of smoothing function does not have much effect. In most cases, BFGS and trust regions were better than the other methods in terms of number of iterations, and BFGS was the fastest. Furthermore, the conjugate-gradient method employed in [18] still stagnated at solutions with lower precision, approximately 10610^{-6}, while steepest descent, BFGS, and trust regions always obtained solutions with at least 10810^{-8} precision.

Next, we ran the same experiment on larger 5000×50005000\times 5000 matrices, with 5%5\% outliers. Figure 3 illustrates the results of these experiments, with μN=σN=0.1\mu_{N}=\sigma_{N}=0.1 and μN=σN=1\mu_{N}=\sigma_{N}=1. In most cases, trust regions still outperformed the other methods in terms of number of iterations, while BFGS performed poorly. Barzilai-Borwein and the conjugate-gradient method were almost as good in terms of time.

6 Concluding remarks

We examined the problem of finding a CP factorization of a given completely positive matrix and treated it as a nonsmooth Riemannian optimization problem. To this end, we studied a general framework of Riemannian smoothing for Riemannian optimization. The numerical experiments clarified that our method can compete with other efficient CP factorization methods, in particular on large-scale matrices.

Let us we summarize the relation of our approach to the existing CP factorization methods. Groetzner and Dür [31] and Chen et al. [19] proposed different methods to solve (FeasCP). Boţ and Nguyen [12] tried to solve another model (2). However, the methods they used do not belong to the Riemannian optimization techniques, but are rather Euclidean ones, since they treated the set 𝒪(r):={Xr×r:XX=I}\mathcal{O}(r):=\{X\in\mathbb{R}^{r\times r}:X^{\top}X=I\} as a usual constraint in Euclidean space. By comparison, we recognize the existence of manifolds, namely, the Stiefel manifold =𝒪(r)\mathcal{M}=\mathcal{O}(r), and use optimization techniques specific to them. This change in perspective suggests the possibility of using the rich variety of Riemannian optimization techniques. As the experiments in Section 4 show, our Riemannian approach is faster and more reliable than the Euclidean methods.

In the future, we plan to extend Algorithm 2 to the case of general manifolds and, particularly, to quotient manifolds. This application is believed to be possible, although effort should be put into establishing analogous convergence results. In fact, convergence has been verified in a built-in example in Manopt 7.0 [14]: robust_pca.m computes a robust version of PCA on data and optimizes a nonsmooth function over a Grassmann manifold. The nonsmooth term consists of the l2l_{2} norm, which is not squared, for robustness. In robust_pca.m, Riemannian smoothing with a pseudo-Huber loss function is used in place of the l2l_{2} norm.

As in the other numerical methods, there is no guarantee that Algorithm 2 will find a CP factorization for every A𝒞𝒫nA\in\mathcal{C}\mathcal{P}^{n}. It follows from Proposition 2.7 that A𝒞𝒫nA\in\mathcal{C}\mathcal{P}_{n} if and only if the global minimum of (OptCP), say tt, is such that t0t\leqslant 0. Since our methods only converge to a stationary point, Algorithm 2 provides us with a local minimizer at best. We are looking forward to finding a global minimizer of (OptCP) in our future work.

Acknowledgments

This work was supported by the Research Institute for Mathematical Sciences, an International Joint Usage/Research Center, at Kyoto University, JSPS KAKENHI Grant, number (B)19H02373, and JST SPRING Grant, number JPMJSP2124. The authors would like to sincerely thank the anonymous reviewers and the coordinating editor for their thoughtful and valuable comments which have significantly improved the paper.

Data availability

The data that support the findings of this study are available from the corresponding author upon request.

Conflict of interest

All authors declare that they have no conflicts of interest.

References

  • [1] Absil, P.-A., Mahony, R., Sepulchre, R.: Optimization Algorithms on Matrix Manifolds. Princeton University Press, Princeton (2009)
  • [2] Bagirov, A., Karmitsa, N., Mäkelä, MM.: Introduction to Nonsmooth Optimization: Theory, Practice and Software. Springer International Publishing, Berlin (2014)
  • [3] Berman, A., Shaked-Monderer, N.: Completely Positive Matrices. World Scientific, Singapore (2003)
  • [4] Berman, A., Dür, M., Shaked-Monderer, N.: Open problems in the theory of completely positive and copositive matrices. Electron. J. Linear Algebra. 29, 46–58 (2015)
  • [5] Bian, W., Chen, X.: Neural network for nonsmooth, nonconvex constrained minimization via smooth approximation. IEEE Trans. Neural Netw. Learn. Syst. 25, 545–556 (2013)
  • [6] Bomze, I.M.: Copositive optimization–recent developments and applications. Eur. J. Oper. Res. 216, 509–520 (2012)
  • [7] Bomze, I.M.: Building a completely positive factorization. Cent. Eur. J. Oper. Res. 26, 287–305 (2018)
  • [8] Bomze, I.M., Dickinson, P.J., Still, G.: The structure of completely positive matrices according to their cp-rank and cp-plus-rank. Linear Algebra Appl. 482, 191–206 (2015)
  • [9] Bomze, I.M., Dür, M., De Klerk, E., Roos, C., Quist, A.J., Terlaky, T.: On copositive programming and standard quadratic optimization problems. J. Glob. Optim. 18, 301–320 (2000)
  • [10] Bomze, I.M., Schachinger, W., Uchida, G.: Think co(mpletely) positive! Matrix properties, examples and a clustered bibliography on copositive optimization. J. Glob. Optim. 52, 423–445 (2012)
  • [11] Borckmans, P.B., Selvan, S.E., Boumal, N., Absil, P.-A.: A Riemannian subgradient algorithm for economic dispatch with valve-point effect, J. Comput. Appl. Math. 255, 848–866 (2014)
  • [12] Boţ, R. I., Nguyen, D.-K.: Factorization of completely positive matrices using iterative projected gradient steps. Numer. Linear Algebra Appl. 28, e2391 (2021)
  • [13] Boumal, N.: An Introduction to Optimization on Smooth Manifolds. To appear with Cambridge University Press, Mar 2022.
  • [14] Boumal, N., Mishra, B., Absil, P.-A., Sepulchre, R.: Manopt, a Matlab toolbox for optimization on manifolds. J. Mach. Learn. Res. 15, 1455–1459 (2014)
  • [15] Burer, S.: On the copositive representation of binary and continuous nonconvex quadratic programs. Math. Program. 120, 479–495 (2009)
  • [16] Burer, S.: A gentle, geometric introduction to copositive optimization. Math. Program. 151, 89–116 (2015)
  • [17] Burer, S., Monteiro, R.D.C.: Local minima and convergence in low-rank semidefinite programming. Math. Program. 103, 427–444 (2005)
  • [18] Cambier, L., Absil, P.-A.: Robust low-rank matrix completion by Riemannian optimization. SIAM J. Sci. Comput. 38, S440–S460 (2016)
  • [19] Chen, C., Pong, T.K., Tan, L., Zeng, L.: A difference-of-convex approach for split feasibility with applications to matrix factorizations and outlier detection. J. Glob. Optim. 78, 107–136 (2020)
  • [20] Chen, M.: Gram-Schmidt orthogonalization. MATLAB Central File Exchange. https://www.mathworks.com/matlabcentral/fileexchange/55881-gram-schmidt-orthogonalization (2022). Accessed 24 June 2022
  • [21] Chen, X.: Smoothing methods for nonsmooth, nonconvex minimization. Math. Program. 134, 71–99 (2012)
  • [22] Chen, X., Wets, R.J.B., Zhang, Y.: Stochastic variational inequalities: residual minimization smoothing sample average approximations. SIAM J. Optim. 22, 649–673 (2012)
  • [23] De Carvalho Bento, G., da Cruz Neto, J.X., Oliveira, P.R.: A new approach to the proximal point method: convergence on general Riemannian manifolds. J. Optim. Theory Appl. 168, 743–755 (2016)
  • [24] De Klerk, E., Pasechnik, D.V.: Approximation of the stability number of a graph via copositive programming. SIAM J. Optim. 12, 875–892 (2002),
  • [25] Dickinson, P.J.: An improved characterization of the interior of the completely positive cone. Electron. J. Linear Algebra. 20, 723–729 (2010)
  • [26] Dickinson, P.J., Dür, M.: Linear-time complete positivity detection and decomposition of sparse matrices. SIAM J. Matrix Anal. Appl. 33, 701–720 (2012)
  • [27] Dickinson, P.J., Gijben, L.: On the computational complexity of membership problems for the completely positive cone and its dual. Comput. Optim. Appl. 57, 403–415 (2014)
  • [28] Dür, M.: Copositive programming — a survey. In: Diehl, M., Glineur, F., Jarlebring, E., Michiels W. (eds.) Recent advances in optimization and its applications in engineering, pp. 3–20. Springer, Berlin, Heidelberg (2010).
  • [29] Dür, M., Rendl, F.: Conic optimization: A survey with special focus on copositive optimization and binary quadratic problems. EURO J. Comput. Optim. 9, 100021 (2021)
  • [30] Dür, M., Still, G.: Interior points of the completely positive cone. Electron. J. Linear Algebra. 17, 48-53 (2008).
  • [31] Groetzner, P., Dür, M.: A factorization method for completely positive matrices. Linear Algebra Appl. 591, 1–24 (2020)
  • [32] Horn, R.A., Johnson, C.R.: Matrix Analysis. Cambridge University Press, Cambridge (2012)
  • [33] Jarre, F., Schmallowsky, K.: On the computation of CC^{*} certificates. J. Glob. Optim. 45, 281–296 (2009)
  • [34] Kovnatsky, A., Glashoff, K., Bronstein, M.M.: MADMM: a generic algorithm for non-smooth optimization on manifolds. In: European Conference on Computer Vision, pp. 680–696. Springer, Cham (2016)
  • [35] Liu, C., Boumal, N.: Simple algorithms for optimization on Riemannian manifolds with constraints. Appl. Math. Optim. 82, 949–981 (2020)
  • [36] Nie, J.: The 𝒜\mathcal{A}-truncated KK-moment problem. Found. Comput. Math. 14, 1243–1276 (2014)
  • [37] Qu, Q., Sun, J., Wright, J.: Finding a sparse vector in a subspace: Linear sparsity using alternating directions. Adv. Neural. Inf. Process. Syst. 27 (2014)
  • [38] Qu, Q., Zhu, Z., Li, X., Tsakiris, M.C., Wright, J., Vidal, R.: Finding the sparsest vectors in a subspace: Theory, algorithms, and applications. arXiv preprint arXiv:2001.06970. (2020)
  • [39] Rockafellar, R.T., Wets, R.J.B.: Variational Analysis. Springer Science & Business Media, Berlin (2009)
  • [40] Rudin, W.: Principles of Mathematical Analysis. McGraw-Hill, New York (1964)
  • [41] Sato, H., Iwai, T.: A new, globally convergent Riemannian conjugate gradient method. Optim. 64, 1011–1031 (2015)
  • [42] Shilon O.: Randorthmat. MATLAB Central File Exchange. https://www.mathworks.com/matlabcentral/fileexchange/11783-randorthmat (2022). Accessed 7 April 2022
  • [43] Dutour Sikirić, M., Schürmann, A., Vallentin, F.: A simplex algorithm for rational cp-factorization. Math. Program. 187, 25–45 (2020)
  • [44] So, W., Xu, C.: A simple sufficient condition for complete positivity. Oper. Matrices. 9, 233–239 (2015)
  • [45] Sponsel, J., Dür, M.: Factorization and cutting planes for completely positive matrices by copositive projection. Math. Program. 143, 211–229 (2014)
  • [46] Zhang, C., Chen, X.: A smoothing active set method for linearly constrained non-Lipschitz nonconvex optimization, SIAM J. Optim. 30, 1–30 (2020)
  • [47] Zhang, C., Chen, X., Ma, S.: A Riemannian smoothing steepest descent method for non-Lipschitz optimization on submanifolds. arXiv preprint arXiv:2104.04199. (2021)