This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\jmlrvolume\jmlryear

2021 \jmlrworkshopACML 2021

Vector Transport Free Riemannian LBFGS for Optimization on Symmetric Positive Definite Matrix Manifolds

\NameReza Godaz \Email[email protected]
\addrDepartment of Computer Engineering
The first two authors contributed equally to this work.
   Ferdowsi University of Mashhad    Mashhad    Iran    \NameBenyamin Ghojogh11footnotemark: 1 \Email[email protected]
\addrDepartment of Electrical and Computer Engineering
   University of Waterloo    ON    Canada    \NameReshad Hosseini \Email[email protected]
\addrDepartment of Electrical and Computer Engineering
   University of Tehran    Tehran    Iran    \NameReza Monsefi \Email[email protected]
\addrDepartment of Computer Engineering
   Ferdowsi University of Mashhad    Mashhad    Iran    \NameFakhri Karray \Email[email protected]
\NameMark Crowley \Email[email protected]
\addrDepartment of Electrical and Computer Engineering
   University of Waterloo    ON    Canada
Abstract

This work concentrates on optimization on Riemannian manifolds. The Limited-memory Broyden-Fletcher-Goldfarb-Shanno (LBFGS) algorithm is a commonly used quasi-Newton method for numerical optimization in Euclidean spaces. Riemannian LBFGS (RLBFGS) is an extension of this method to Riemannian manifolds. RLBFGS involves computationally expensive vector transports as well as unfolding recursions using adjoint vector transports. In this article, we propose two mappings in the tangent space using the inverse second root and Cholesky decomposition. These mappings make both vector transport and adjoint vector transport identity and therefore isometric. Identity vector transport makes RLBFGS less computationally expensive and its isometry is also very useful in convergence analysis of RLBFGS. Moreover, under the proposed mappings, the Riemannian metric reduces to Euclidean inner product, which is much less computationally expensive. We focus on the Symmetric Positive Definite (SPD) manifolds which are beneficial in various fields such as data science and statistics. This work opens a research opportunity for extension of the proposed mappings to other well-known manifolds.

keywords:
LBFGS, Riemannian optimization, positive definite manifolds, isometric vector transport, quasi-Newton’s method
editors: Vineeth N Balasubramanian and Ivor Tsang

1 Introduction

Various numerical optimization methods have appeared, for the Euclidean spaces, which can be categorized into first-order and second-order methods (Nocedal and Wright, 2006). Examples for the former category are steepest descent and gradient descent and for the latter group are Newton’s method. Computation of the Hessian matrix is usually expensive in Newton’s method encouraging many practical problems to either use quasi-Newton’s methods for approximating the Hessian matrix or use non-linear conjugate gradient (Hestenes and Stiefel, 1952). The most well-known algorithm for quasi-Newton optimization is Broyden-Fletcher-Goldfarb-Shanno (BFGS) (Fletcher, 2013). Limited-memory BFGS (LBFGS) is a simplified version of BFGS which utilizes less memory (Nocedal, 1980; Liu and Nocedal, 1989). It has recursive unfoldings which approximate the descent directions in optimization.

Unlike Euclidean spaces in which the optimization direction lie in a linear coordinate system, Riemannian spaces have curvature in coordinates. Recently, extension of optimization methods from Euclidean spaces to Riemannian manifolds has been extensively noticed in the literature (Absil et al., 2009; Boumal, 2020; Hu et al., 2020). For example, Euclidean BFGS has been extended to Riemannian manifolds, named Riemannian BFGS (RBFGS), (Qi et al., 2010), its convergence has been proven (Ring and Wirth, 2012; Huang et al., 2015), and its properties have been analyzed in the literature (Seibert et al., 2013). As vector transport is computationally expensive in RBFGS, cautious RBFGS was proposed (Huang et al., 2016) which ignores the curvature condition in the Wolfe conditions (Wolfe, 1969) and only checks the Armijo condition (Armijo, 1966). Since the curvature condition guarantees that the approximation of Hessian remains positive definite, it compensates by checking a cautious condition (Li and Fukushima, 2001) before updating the approximation of Hessian. This cautious RBFGS has been used in the Manopt optimization toolbox (Boumal et al., 2014). Another approach is an extension of the Euclidean LBFGS to Riemannian manifolds, named Riemannian LBFGS (RLBFGS), using both Wolfe conditions (Wolfe, 1969) in linesearch can be found in (Sra and Hosseini, 2015, 2016; Hosseini and Sra, 2020). Some other direct extensions of Euclidean BFGS to Riemannian spaces exist (e.g., see (Ji, 2007, Chapter 7)).

In this paper, we address the computationally expensive parts of RLBFGS algorithm, which are computation of vector transports, their adjoints, and Riemannian metrics. To achieve this, we propose two mappings, in the tangent space, which make RLBFGS free of vector transport. We name the obtained algorithm Vector Transport Free (VTF)-RLBFGS. One mapping uses inverse second root and the other uses Cholesky decomposition which is very efficient (Golub and Van Loan, 2013). The proposed mappings make both vector transport and adjoint vector transport, which are used in RLBFGS, identity. This reduction of transports to identity makes optimization much less expensive computationally. Moreover, as the vector transports become identity, they are isometric which is a suitable property mostly used in the convergence proofs of RBFGS and RLBFGS algorithms (Ring and Wirth, 2012; Huang et al., 2015). Furthermore, under the proposed mappings, the Riemannian metric reduces to Euclidean inner product which is much less computationally expensive. In this paper, we concentrate on the Symmetric Positive Definite (SPD) manifolds (Sra and Hosseini, 2015, 2016; Bhatia, 2009) which are very useful in data science and machine learning, such as in mixture models (Hosseini and Sra, 2020; Hosseini and Mash’al, 2015). This paper opens a new research path for extension of the proposed mappings to other well-known manifolds such as Grassmann and Stiefel (Edelman et al., 1998).

The remainder of this paper is organized as follows. Section 2 reviews the notations and technical background on Euclidean LBFGS, Wolfe conditions, Riemannian LBFGS, and SPD manifold. The proposed mappings using inverse second root and Cholesky decomposition are introduced in Sections 3 and 4, respectively. Simulation results are reported in Section 6. Finally, Section 7 concludes the paper and proposes the possible future directions.

2 Background and Notations

2.1 Euclidean BFGS and LBFGS

Consider minimization of the cost function f(𝚺)f(\boldsymbol{\Sigma}) where the point 𝚺\boldsymbol{\Sigma} belongs to some domain. In Newton’s method, the descent direction 𝒑k\boldsymbol{p}_{k} at the iteration kk is calculated as 𝑩k𝒑k=f(𝚺k)𝒑k=𝑩k1f(𝚺k)\boldsymbol{B}_{k}\boldsymbol{p}_{k}=-\nabla f(\boldsymbol{\Sigma}_{k})\implies\boldsymbol{p}_{k}=-\boldsymbol{B}_{k}^{-1}\nabla f(\boldsymbol{\Sigma}_{k}) (Nocedal and Wright, 2006), where 𝑩k\boldsymbol{B}_{k} is the Hessian or approximation of Hessian and f(𝚺k)\nabla f(\boldsymbol{\Sigma}_{k}) is the gradient of function at iteration kk. The Euclidean BFGS method is a quasi-Newton’s method which approximates the Hessian matrix as 𝑩k+1:=𝑩k+(𝒚k𝒚k)/(𝒚k𝒔k)(𝑩k𝒔k𝒔k𝑩k)/(𝒔k𝑩k𝒔k)\boldsymbol{B}_{k+1}:=\boldsymbol{B}_{k}+(\boldsymbol{y}_{k}\boldsymbol{y}_{k}^{\top})/(\boldsymbol{y}_{k}^{\top}\boldsymbol{s}_{k})-(\boldsymbol{B}_{k}\boldsymbol{s}_{k}\boldsymbol{s}_{k}^{\top}\boldsymbol{B}_{k}^{\top})/(\boldsymbol{s}_{k}^{\top}\boldsymbol{B}_{k}\boldsymbol{s}_{k}) (Fletcher, 2013; Nocedal and Wright, 2006), where 𝒔k:=𝚺k+1𝚺k\boldsymbol{s}_{k}:=\boldsymbol{\Sigma}_{k+1}-\boldsymbol{\Sigma}_{k} and 𝒚k:=f(𝚺k+1)f(𝚺k)\boldsymbol{y}_{k}:=\nabla f(\boldsymbol{\Sigma}_{k+1})-\nabla f(\boldsymbol{\Sigma}_{k}). The descent direction is 𝒑k\boldsymbol{p}_{k} whose expression was provided above.

The Euclidean LBFGS calculates the descent direction recursively where it uses the approximation of the inverse Hessian as 𝑯k+1:=𝑽k𝑯k𝑽k+ρk𝒔k𝒔k\boldsymbol{H}_{k+1}:=\boldsymbol{V}_{k}^{\top}\boldsymbol{H}_{k}\boldsymbol{V}_{k}+\rho_{k}\boldsymbol{s}_{k}\boldsymbol{s}_{k}^{\top} (Nocedal, 1980; Liu and Nocedal, 1989), where 𝑯k\boldsymbol{H}_{k} denotes the approximation of the inverse of the Hessian at iteration kk, ρk:=1/(𝒚k𝒔k)\rho_{k}:=1/(\boldsymbol{y}_{k}^{\top}\boldsymbol{s}_{k}), and 𝑽k:=𝑰ρk𝒚k𝒔k\boldsymbol{V}_{k}:=\boldsymbol{I}-\rho_{k}\boldsymbol{y}_{k}\boldsymbol{s}_{k}^{\top} in which 𝑰\boldsymbol{I} denotes the identity matrix. The LBFGS algorithm updates the approximation of the inverse of the Hessian matrix recursively and for that it always stores a memory window of pairs {𝒚k,𝒔k}\{\boldsymbol{y}_{k},\boldsymbol{s}_{k}\} (Liu and Nocedal, 1989).

2.2 Linesearch and Wolfe Conditions

After finding the descent direction 𝒑k\boldsymbol{p}_{k} at each iteration kk of optimization, one needs to know what step size αk\alpha_{k} should be taken in that direction. Linesearch should be performed to find the largest step, for faster progress, satisfying Wolfe conditions which are f(𝚺k+αk)f(𝚺k)+c1αk𝒑kf(𝚺k)f(\boldsymbol{\Sigma}_{k}+\alpha_{k})\leq f(\boldsymbol{\Sigma}_{k})+c_{1}\alpha_{k}\boldsymbol{p}_{k}^{\top}\nabla f(\boldsymbol{\Sigma}_{k}) and 𝒑kf(𝚺k+αk𝒑k)c2𝒑kf(𝚺k)-\boldsymbol{p}_{k}^{\top}\nabla f(\boldsymbol{\Sigma}_{k}+\alpha_{k}\boldsymbol{p}_{k})\leq-c_{2}\boldsymbol{p}_{k}^{\top}\nabla f(\boldsymbol{\Sigma}_{k}) (Wolfe, 1969), where the parameters 0<c1<c2<10<c_{1}<c_{2}<1 are recommended to be c1=101c_{1}=10^{-1} and c2=0.9c_{2}=0.9 (Nocedal and Wright, 2006). The former condition is the Armijo condition to check if cost decreases sufficiently (Armijo, 1966) while the latter is the curvature condition making sure that the slope is reduced sufficiently in a way that the approximation of Hessian remains positive definite. Note that there also exists a strong curvature condition, i.e., |𝒑kf(𝚺k+αk𝒑k)|c2|𝒑kf(𝚺k)||\boldsymbol{p}_{k}^{\top}\nabla f(\boldsymbol{\Sigma}_{k}+\alpha_{k}\boldsymbol{p}_{k})|\leq c_{2}|\boldsymbol{p}_{k}^{\top}\nabla f(\boldsymbol{\Sigma}_{k})|.

2.3 Riemannian Notations

Consider a Riemannian manifold denoted by \mathcal{M}. At every point 𝚺\boldsymbol{\Sigma}\in\mathcal{M}, there is a tangent space to the manifold, denoted by T𝚺T_{\boldsymbol{\Sigma}}\mathcal{M}. A tangent space includes tangent vectors. We denote a tangent vector by 𝝃\boldsymbol{\xi}. For 𝝃,𝜼T𝚺\boldsymbol{\xi},\boldsymbol{\eta}\in T_{\boldsymbol{\Sigma}}\mathcal{M}, a metric on this manifold is the inner product defined on manifold and is denoted by g𝚺(𝝃,𝜼)g_{\boldsymbol{\Sigma}}(\boldsymbol{\xi},\boldsymbol{\eta}). Note that the gradient of a cost function, whose domain is a manifold, is a tangent vector in the tangent space and is denoted by f(𝚺)\nabla f(\boldsymbol{\Sigma}) for point 𝚺\boldsymbol{\Sigma}\in\mathcal{M}. Vector transport is an operator which maps a tangent vector 𝝃T𝚺1\boldsymbol{\xi}\in T_{\boldsymbol{\Sigma}_{1}}\mathcal{M} from its tangent space at point 𝚺1\boldsymbol{\Sigma}_{1} to another tangent space at another point 𝚺2\boldsymbol{\Sigma}_{2}. We denote this vector transport by 𝒯𝚺1,𝚺2(𝝃):T𝚺1T𝚺2\mathcal{T}_{\boldsymbol{\Sigma}_{1},\boldsymbol{\Sigma}_{2}}(\boldsymbol{\xi}):T_{\boldsymbol{\Sigma}_{1}}\mathcal{M}\mapsto T_{\boldsymbol{\Sigma}_{2}}\mathcal{M}. Now, consider a point 𝚺\boldsymbol{\Sigma} on a manifold \mathcal{M} and a descent direction 𝝃\boldsymbol{\xi} in the tangent space T𝚺T_{\boldsymbol{\Sigma}}\mathcal{M}. The retraction Ret𝚺(𝝃):T𝚺\text{Ret}_{\boldsymbol{\Sigma}}(\boldsymbol{\xi}):T_{\boldsymbol{\Sigma}}\mathcal{M}\mapsto\mathcal{M} retracts or maps the direction 𝝃\boldsymbol{\xi} in the tangent space onto the manifold \mathcal{M}. The operator exponential map, denoted by Exp𝚺(𝝃)\text{Exp}_{\boldsymbol{\Sigma}}(\boldsymbol{\xi}), is also capable of this mapping by moving along the geodesic. One can use the second-order Taylor expansion of exponential map, which is positive-preserving (Jeuris et al., 2012) for the case of SPD manifold, for approximating the exponential map. In this paper, tr(.)\textbf{tr}(.) denotes the trace of matrix and .F\|.\|_{F} denotes the Frobenius norm.

2.4 Riemannian BFGS and LBFGS

The Riemannian extension of Euclidean BFGS (RBFGS) (Qi et al., 2010; Ring and Wirth, 2012; Huang et al., 2015) performs updates of Hessian approximation by the 𝑩k+1\boldsymbol{B}_{k+1} (see Section 2.1) using Riemannian operators. RBFGS methods check both Wolfe conditions (see Section 2.2) which are computationally expensive. Cautious RBFGS (Huang et al., 2016) ignores the curvature condition and only checks the Armijo condition for linesearch. However, for ensuring that the Hessian approximation remains positive definite, it checks a cautious condition (Li and Fukushima, 2001) before updating the Hessian approximation.

Riemannian LBFGS (Sra and Hosseini, 2015, 2016; Hosseini and Sra, 2020) performs the recursions of LBFGS in the Riemannian space for finding the descent direction 𝝃kT𝚺k\boldsymbol{\xi}_{k}\in T_{\boldsymbol{\Sigma}_{k}}{\mathcal{M}} and checks both Wolfe conditions in linesearch. In every iteration kk of optimization, recursion starts with the direction 𝒑=f(𝚺k)\boldsymbol{p}=-\nabla f(\boldsymbol{\Sigma}_{k}). Let the recursive function GetDirection(𝒑,k)\text{GetDirection}(\boldsymbol{p},k) returns the descent direction. Inside every step of this recursion, we have (Hosseini and Sra, 2020, Algorithm 3):

𝒑~:=𝒑ρk𝒈𝚺k(𝒔k,𝒑)𝒚k,\displaystyle\boldsymbol{\tilde{p}}:=\boldsymbol{p}-\rho_{k}\,\boldsymbol{g}_{\boldsymbol{\Sigma}_{k}}(\boldsymbol{s}_{k},\boldsymbol{p})\boldsymbol{y}_{k}, (1)
𝒑^:=𝒯𝚺k1,𝚺k(GetDirection(𝒯𝚺k1,𝚺k(𝒑~),k1)),\displaystyle\boldsymbol{\widehat{p}}:=\mathcal{T}_{\boldsymbol{\Sigma}_{k-1},\boldsymbol{\Sigma}_{k}}\big{(}\text{GetDirection}(\mathcal{T}^{*}_{\boldsymbol{\Sigma}_{k-1},\boldsymbol{\Sigma}_{k}}(\boldsymbol{\tilde{p}}),k-1)\big{)}, (2)
return 𝝃k:=𝒑^ρk𝒈𝚺k(𝒚k,𝒑^k)𝒔k+ρk𝒈𝚺k(𝒔k,𝒔k)𝒑,\displaystyle\text{return }\boldsymbol{\xi}_{k}:=\boldsymbol{\widehat{p}}-\rho_{k}\,\boldsymbol{g}_{\boldsymbol{\Sigma}_{k}}(\boldsymbol{y}_{k},\boldsymbol{\widehat{p}}_{k})\boldsymbol{s}_{k}+\rho_{k}\,\boldsymbol{g}_{\boldsymbol{\Sigma}_{k}}(\boldsymbol{s}_{k},\boldsymbol{s}_{k})\boldsymbol{p}, (3)

where, ρk:=1/𝒈𝚺k(𝒚k,𝒔k)\rho_{k}:=1/\boldsymbol{g}_{\boldsymbol{\Sigma}_{k}}(\boldsymbol{y}_{k},\boldsymbol{s}_{k}), and inspired by the introduced 𝒔k\boldsymbol{s}_{k} and 𝒚k\boldsymbol{y}_{k} for Euclidean spaces, we have:

𝚺k+1:=Exp𝚺k(αk𝝃k)orRet𝚺k(αk𝝃k),\displaystyle\boldsymbol{\Sigma}_{k+1}:=\text{Exp}_{\boldsymbol{\Sigma}_{k}}(\alpha_{k}\boldsymbol{\xi}_{k})\,\,\text{or}\,\,\text{Ret}_{\boldsymbol{\Sigma}_{k}}(\alpha_{k}\boldsymbol{\xi}_{k}), (4)
𝒔k+1:=𝒯𝚺k,𝚺k+1(αk𝝃k),\displaystyle\boldsymbol{s}_{k+1}:=\mathcal{T}_{\boldsymbol{\Sigma}_{k},\boldsymbol{\Sigma}_{k+1}}(\alpha_{k}\boldsymbol{\xi}_{k}), (5)
𝒚k+1:=f(𝚺k+1)𝒯𝚺k,𝚺k+1(f(𝚺k)).\displaystyle\boldsymbol{y}_{k+1}:=\nabla f(\boldsymbol{\Sigma}_{k+1})-\mathcal{T}_{\boldsymbol{\Sigma}_{k},\boldsymbol{\Sigma}_{k+1}}\big{(}\nabla f(\boldsymbol{\Sigma}_{k})\big{)}. (6)

Note that the new point in every iteration is found by retraction, or an exponential map, of the searched point along the descent direction onto manifold. According to Eq. (2) in recursion and Eq. (5), RLBFGS involves both adjoint vector transport and vector transport which are computationally expensive. Moreover, Eqs. (1) and (3) show that Riemannian metric is utilized many times inside recursions. Our proposed mappings simplify all vector transport, adjoint vector transport, and metric which are used in the RLBFGS algorithm.

2.5 Symmetric Positive Definite (SPD) Manifold

Consider the SPD manifold (Sra and Hosseini, 2015, 2016) whose every point is a SPD matrix, i.e., 𝚺\boldsymbol{\Sigma}\in\mathcal{M} and 𝕊++n𝚺𝟎\mathbb{S}_{++}^{n}\ni\boldsymbol{\Sigma}\succ\boldsymbol{0}. It can be shown that the tangent space to the SPD manifold is the space of symmetric matrices, i.e., T(𝚺)𝕊++nT_{\mathcal{M}}(\boldsymbol{\Sigma})\subset\mathbb{S}_{++}^{n} (Bhatia, 2009). In this paper, we focus on SPD manifolds which are widely used in data science. The operators for metric, gradient, exponential map, and vector transport on SPD manifolds are listed in Table 1 (Sra and Hosseini, 2016; Hosseini and Sra, 2020). In this table, Ef(𝚺)\nabla_{E}f(\boldsymbol{\Sigma}) denotes the Euclidean gradient and 𝑳1\boldsymbol{L}_{1} and 𝑳2\boldsymbol{L}_{2} are the lower-triangular matrices in Cholesky decomposition of points 𝚺1\boldsymbol{\Sigma}_{1} and 𝚺2\boldsymbol{\Sigma}_{2}, respectively.

Table 1: Operators on SPD manifold under the proposed mappings.
Operator No mapping
Metric, g𝚺(𝝃,𝜼)g_{\boldsymbol{\Sigma}}(\boldsymbol{\xi},\boldsymbol{\eta}) tr(𝚺1𝝃𝚺1𝜼)\textbf{tr}(\boldsymbol{\Sigma}^{-1}\boldsymbol{\xi}\boldsymbol{\Sigma}^{-1}\boldsymbol{\eta})
Gradient, f(𝚺)\nabla f(\boldsymbol{\Sigma}) 12𝚺(Ef(𝚺)+(Ef(𝚺)))𝚺\frac{1}{2}\boldsymbol{\Sigma}\big{(}\nabla_{E}f(\boldsymbol{\Sigma})+(\nabla_{E}f(\boldsymbol{\Sigma}))^{\top}\big{)}\boldsymbol{\Sigma}
Exponential map, Exp𝚺(𝝃)\text{Exp}_{\boldsymbol{\Sigma}}(\boldsymbol{\xi}) 𝚺exp(𝚺1𝝃)=𝚺12exp(𝚺12𝝃𝚺12)𝚺12\boldsymbol{\Sigma}\,\text{exp}(\boldsymbol{\Sigma}^{-1}\boldsymbol{\xi})=\boldsymbol{\Sigma}^{\frac{1}{2}}\,\text{exp}(\boldsymbol{\Sigma}^{-\frac{1}{2}}\boldsymbol{\xi}\boldsymbol{\Sigma}^{-\frac{1}{2}})\,\boldsymbol{\Sigma}^{\frac{1}{2}}
Vector transport, 𝒯𝚺1,𝚺2(𝝃)\mathcal{T}_{\boldsymbol{\Sigma}_{1},\boldsymbol{\Sigma}_{2}}(\boldsymbol{\xi}) 𝚺212𝚺112𝝃𝚺112𝚺212\boldsymbol{\Sigma}_{2}^{\frac{1}{2}}\boldsymbol{\Sigma}_{1}^{-\frac{1}{2}}\boldsymbol{\xi}\boldsymbol{\Sigma}_{1}^{-\frac{1}{2}}\boldsymbol{\Sigma}_{2}^{\frac{1}{2}} or 𝑳2𝑳11𝝃𝑳1𝑳2\boldsymbol{L}_{2}\boldsymbol{L}_{1}^{-1}\boldsymbol{\xi}\boldsymbol{L}_{1}^{-\top}\boldsymbol{L}_{2}^{\top}
Approx. Euclidean retraction, Ret𝚺(𝝃)\text{Ret}_{\boldsymbol{\Sigma}}(\boldsymbol{\xi}) 𝚺+𝝃+12𝝃𝚺1𝝃\boldsymbol{\Sigma}+\boldsymbol{\xi}+\frac{1}{2}\boldsymbol{\xi}\boldsymbol{\Sigma}^{-1}\boldsymbol{\xi}
Operator Mapping by inverse second root
Mapping 𝝃:=𝚺12𝝃𝚺12\boldsymbol{\xi}^{\prime}:=\boldsymbol{\Sigma}^{-\frac{1}{2}}\,\boldsymbol{\xi}\,\boldsymbol{\Sigma}^{-\frac{1}{2}}
Metric, g𝚺(𝝃,𝜼)g^{\prime}_{\boldsymbol{\Sigma}}(\boldsymbol{\xi}^{\prime},\boldsymbol{\eta}^{\prime}) tr(𝝃𝜼)\textbf{tr}(\boldsymbol{\xi}^{\prime}\boldsymbol{\eta}^{\prime})
Gradient, f(𝚺)\nabla^{\prime}f(\boldsymbol{\Sigma}) 12𝚺12(Ef(𝚺)+(Ef(𝚺)))𝚺12\frac{1}{2}\boldsymbol{\Sigma}^{\frac{1}{2}}\big{(}\nabla_{E}f(\boldsymbol{\Sigma})+(\nabla_{E}f(\boldsymbol{\Sigma}))^{\top}\big{)}\boldsymbol{\Sigma}^{\frac{1}{2}}
Exponential map, Exp𝚺(𝝃)\text{Exp}_{\boldsymbol{\Sigma}}(\boldsymbol{\xi}^{\prime}) 𝚺12exp(𝝃)𝚺12\boldsymbol{\Sigma}^{\frac{1}{2}}\,\text{exp}(\boldsymbol{\xi}^{\prime})\,\boldsymbol{\Sigma}^{\frac{1}{2}}
Vector transport, 𝒯𝚺1,𝚺2(𝝃)\mathcal{T}^{\prime}_{\boldsymbol{\Sigma}_{1},\boldsymbol{\Sigma}_{2}}(\boldsymbol{\xi}^{\prime}) 𝝃\boldsymbol{\xi}^{\prime}
Approx. Euclidean retraction, Ret𝚺(𝝃)\text{Ret}_{\boldsymbol{\Sigma}}(\boldsymbol{\xi}^{\prime}) 𝚺+𝚺12𝝃𝚺12+12𝚺12𝝃 2𝚺12\boldsymbol{\Sigma}+\boldsymbol{\Sigma}^{\frac{1}{2}}\boldsymbol{\xi}^{\prime}\boldsymbol{\Sigma}^{\frac{1}{2}}+\frac{1}{2}\boldsymbol{\Sigma}^{\frac{1}{2}}\boldsymbol{\xi}^{\prime\,2}\boldsymbol{\Sigma}^{\frac{1}{2}}
Operator Mapping by Cholesky decomposition
Mapping 𝝃:=𝑳1𝝃𝑳\boldsymbol{\xi}^{\prime}:=\boldsymbol{L}^{-1}\,\boldsymbol{\xi}\,\boldsymbol{L}^{-\top}
Metric, g𝚺(𝝃,𝜼)g^{\prime}_{\boldsymbol{\Sigma}}(\boldsymbol{\xi}^{\prime},\boldsymbol{\eta}^{\prime}) tr(𝝃𝜼)\textbf{tr}(\boldsymbol{\xi}^{\prime}\boldsymbol{\eta}^{\prime})
Gradient, f(𝚺)\nabla^{\prime}f(\boldsymbol{\Sigma}) 12𝑳(Ef(𝚺)+(Ef(𝚺)))𝑳\frac{1}{2}\boldsymbol{L}^{\top}\big{(}\nabla_{E}f(\boldsymbol{\Sigma})+(\nabla_{E}f(\boldsymbol{\Sigma}))^{\top}\big{)}\boldsymbol{L}
Exponential map, Exp𝚺(𝝃)\text{Exp}_{\boldsymbol{\Sigma}}(\boldsymbol{\xi}^{\prime}) 𝚺exp(𝑳𝝃𝑳)\boldsymbol{\Sigma}\,\text{exp}(\boldsymbol{L}^{-\top}\boldsymbol{\xi}^{\prime}\boldsymbol{L}^{\top})
Vector transport, 𝒯𝚺1,𝚺2(𝝃)\mathcal{T}^{\prime}_{\boldsymbol{\Sigma}_{1},\boldsymbol{\Sigma}_{2}}(\boldsymbol{\xi}^{\prime}) 𝝃\boldsymbol{\xi}^{\prime}
Approx. Euclidean retraction, Ret𝚺(𝝃)\text{Ret}_{\boldsymbol{\Sigma}}(\boldsymbol{\xi}^{\prime}) 𝚺+𝑳𝝃𝑳+12𝑳𝝃 2𝑳\boldsymbol{\Sigma}+\boldsymbol{L}\,\boldsymbol{\xi}^{\prime}\boldsymbol{L}^{\top}+\frac{1}{2}\boldsymbol{L}\,\boldsymbol{\xi}^{\prime\,2}\boldsymbol{L}^{\top}

3 Vector Transport Free Riemannian LBFGS Using Mapping by Inverse Second Root

We propose two mappings on tangent vectors in the tangent space where the first mapping is by inverse second root. Our first proposed mapping in the tangent space of every point 𝚺\boldsymbol{\Sigma}\in\mathcal{M} is:

𝝃:=𝚺12𝝃𝚺12𝝃=𝚺12𝝃𝚺12,\displaystyle\boldsymbol{\xi}^{\prime}:=\boldsymbol{\Sigma}^{-\frac{1}{2}}\,\boldsymbol{\xi}\,\boldsymbol{\Sigma}^{-\frac{1}{2}}\implies\boldsymbol{\xi}=\boldsymbol{\Sigma}^{\frac{1}{2}}\,\boldsymbol{\xi}^{\prime}\,\boldsymbol{\Sigma}^{\frac{1}{2}}, (7)

where the mapped tangent vector still remains in the tangent space, i.e. 𝝃,𝝃T𝚺𝕊++n\boldsymbol{\xi},\boldsymbol{\xi}^{\prime}\in T_{\boldsymbol{\Sigma}}{\mathcal{M}}\subset\mathbb{S}_{++}^{n}. It is important the proposed mapping is bijective and keeps the tangent vector in the tangent space while it simplifies vector transport, adjoint vector transport, and metric.

Under the proposed mapping (7), the Riemannian operators on a SPD manifold are modified and mostly simplified. These operators are listed in Table 1. In the following, we provide proofs for these modifications.

Proposition 3.1.

After mapping (7), we have

  • Metric: the metric on SPD manifold is reduced to the Euclidean inner product, i.e., g𝚺(𝝃,𝜼)=tr(𝝃𝜼)g_{\boldsymbol{\Sigma}}(\boldsymbol{\xi}^{\prime},\boldsymbol{\eta}^{\prime})=\textbf{tr}(\boldsymbol{\xi}^{\prime}\boldsymbol{\eta}^{\prime}).

  • Gradient: the gradient on a SPD manifold is changed to f(𝚺)=12𝚺12(Ef(𝚺)+(Ef(𝚺)))𝚺12\nabla^{\prime}f(\boldsymbol{\Sigma})=\frac{1}{2}\boldsymbol{\Sigma}^{\frac{1}{2}}\big{(}\nabla_{E}f(\boldsymbol{\Sigma})+(\nabla_{E}f(\boldsymbol{\Sigma}))^{\top}\big{)}\boldsymbol{\Sigma}^{\frac{1}{2}}, where Ef(𝚺)\nabla_{E}f(\boldsymbol{\Sigma}) denotes the Euclidean gradient.

  • Vector transport: the vector transport is changed to identity, i.e., 𝒯𝚺1,𝚺2(𝝃)=𝝃,\mathcal{T}^{\prime}_{\boldsymbol{\Sigma}_{1},\boldsymbol{\Sigma}_{2}}(\boldsymbol{\xi}^{\prime})=\boldsymbol{\xi}^{\prime}, hence, optimization becomes vector transport free.

  • Exponential map: the exponential map on a SPD manifold becomes Exp𝚺(𝝃)=𝚺12exp(𝝃)𝚺12\text{Exp}_{\boldsymbol{\Sigma}}(\boldsymbol{\xi}^{\prime})=\boldsymbol{\Sigma}^{\frac{1}{2}}\,\text{exp}(\boldsymbol{\xi}^{\prime})\,\boldsymbol{\Sigma}^{\frac{1}{2}}.

  • Adjoint vector transport: the adjoint of vector transport on a SPD manifold remains the same. In other words, if before mapping we have the definition of adjoint vector transport as g𝚺1(𝝃,𝒯𝚺1,𝚺2𝜼)=g𝚺2(𝒯𝚺1,𝚺2𝝃,𝜼),𝝃T𝚺1,𝜼T𝚺2g_{\boldsymbol{\Sigma}_{1}}(\boldsymbol{\xi},\mathcal{T}^{*}_{\boldsymbol{\Sigma}_{1},\boldsymbol{\Sigma}_{2}}\boldsymbol{\eta})=g_{\boldsymbol{\Sigma}_{2}}(\mathcal{T}_{\boldsymbol{\Sigma}_{1},\boldsymbol{\Sigma}_{2}}\boldsymbol{\xi},\boldsymbol{\eta}),\forall\boldsymbol{\xi}\in T_{\boldsymbol{\Sigma}_{1}}{\mathcal{M}},\forall\boldsymbol{\eta}\in T_{\boldsymbol{\Sigma}_{2}}{\mathcal{M}} (Ring and Wirth, 2012), we will have g𝚺1(𝝃,𝒯𝚺1,𝚺2𝜼)=g𝚺2(𝒯𝚺1,𝚺2𝝃,𝜼),𝝃T𝚺1,𝜼T𝚺2g_{\boldsymbol{\Sigma}_{1}}(\boldsymbol{\xi}^{\prime},\mathcal{T}^{{}^{\prime*}}_{\boldsymbol{\Sigma}_{1},\boldsymbol{\Sigma}_{2}}\boldsymbol{\eta}^{\prime})=g_{\boldsymbol{\Sigma}_{2}}(\mathcal{T}^{\prime}_{\boldsymbol{\Sigma}_{1},\boldsymbol{\Sigma}_{2}}\boldsymbol{\xi}^{\prime},\boldsymbol{\eta}^{\prime}),\forall\boldsymbol{\xi}^{\prime}\in T_{\boldsymbol{\Sigma}_{1}}{\mathcal{M}},\forall\boldsymbol{\eta}^{\prime}\in T_{\boldsymbol{\Sigma}_{2}}{\mathcal{M}}.

  • Retraction: the approximation of Euclidean retraction, using second-order Taylor expansion, on a SPD manifold becomes Ret𝚺(𝝃)=𝚺+𝚺12𝝃𝚺12+12𝚺12𝝃 2𝚺12\text{Ret}_{\boldsymbol{\Sigma}}(\boldsymbol{\xi}^{\prime})=\boldsymbol{\Sigma}+\boldsymbol{\Sigma}^{\frac{1}{2}}\boldsymbol{\xi}^{\prime}\boldsymbol{\Sigma}^{\frac{1}{2}}+\frac{1}{2}\boldsymbol{\Sigma}^{\frac{1}{2}}\boldsymbol{\xi}^{\prime\,2}\boldsymbol{\Sigma}^{\frac{1}{2}}.

Proof 3.2.

\bullet Metric: g𝚺(𝝃,𝜼)=(a)tr(𝚺1𝝃𝚺1𝜼)=tr(𝚺12𝚺12𝝃𝚺12𝚺12𝜼)=(b)tr(𝚺12𝝃𝚺12=𝝃𝚺12𝜼𝚺12=𝜼)=(7)tr(𝝃𝜼)g_{\boldsymbol{\Sigma}}(\boldsymbol{\xi}^{\prime},\boldsymbol{\eta}^{\prime})\overset{(a)}{=}\textbf{tr}(\boldsymbol{\Sigma}^{-1}\boldsymbol{\xi}\boldsymbol{\Sigma}^{-1}\boldsymbol{\eta})=\textbf{tr}(\boldsymbol{\Sigma}^{-\frac{1}{2}}\boldsymbol{\Sigma}^{-\frac{1}{2}}\boldsymbol{\xi}\boldsymbol{\Sigma}^{-\frac{1}{2}}\boldsymbol{\Sigma}^{-\frac{1}{2}}\boldsymbol{\eta})\overset{(b)}{=}\textbf{tr}(\underbrace{\boldsymbol{\Sigma}^{-\frac{1}{2}}\boldsymbol{\xi}\boldsymbol{\Sigma}^{-\frac{1}{2}}}_{=\,\boldsymbol{\xi}^{\prime}}\\ \underbrace{\boldsymbol{\Sigma}^{-\frac{1}{2}}\boldsymbol{\eta}\boldsymbol{\Sigma}^{-\frac{1}{2}}}_{=\,\boldsymbol{\eta}^{\prime}})\overset{(\ref{eq_mapping})}{=}\textbf{tr}(\boldsymbol{\xi}^{\prime}\boldsymbol{\eta}^{\prime}) where (a)(a) is because of definition of metric on SPD manifolds (see Table 1) and (b)(b) is thanks to the cyclic property of trace.

\bullet Gradient: f(𝚺)=(7)𝚺12f(𝚺)𝚺12=(a)12𝚺12𝚺(Ef(𝚺)+(Ef(𝚺)))𝚺𝚺12=12𝚺12(Ef(𝚺)+(Ef(𝚺)))𝚺12\nabla^{\prime}f(\boldsymbol{\Sigma})\overset{(\ref{eq_mapping})}{=}\boldsymbol{\Sigma}^{-\frac{1}{2}}\nabla f(\boldsymbol{\Sigma})\boldsymbol{\Sigma}^{-\frac{1}{2}}\overset{(a)}{=}\frac{1}{2}\boldsymbol{\Sigma}^{-\frac{1}{2}}\boldsymbol{\Sigma}\big{(}\nabla_{E}f(\boldsymbol{\Sigma})+(\nabla_{E}f(\boldsymbol{\Sigma}))^{\top}\big{)}\boldsymbol{\Sigma}\boldsymbol{\Sigma}^{-\frac{1}{2}}=\frac{1}{2}\boldsymbol{\Sigma}^{\frac{1}{2}}\big{(}\nabla_{E}f(\boldsymbol{\Sigma})+(\nabla_{E}f(\boldsymbol{\Sigma}))^{\top}\big{)}\boldsymbol{\Sigma}^{\frac{1}{2}}, where (a)(a) is due to definition of gradient on a SPD manifold (see Table 1).

\bullet Vector transport: 𝒯𝚺1,𝚺2(𝝃)=(7)𝚺212𝒯𝚺1,𝚺2(𝝃)𝚺212=(a)𝚺212(𝚺212=𝑰𝚺112𝝃𝚺112𝚺212)𝚺212=𝑰=𝚺112𝝃𝚺112=(7)𝝃\mathcal{T}^{\prime}_{\boldsymbol{\Sigma}_{1},\boldsymbol{\Sigma}_{2}}(\boldsymbol{\xi}^{\prime})\overset{(\ref{eq_mapping})}{=}\boldsymbol{\Sigma}_{2}^{-\frac{1}{2}}\mathcal{T}_{\boldsymbol{\Sigma}_{1},\boldsymbol{\Sigma}_{2}}(\boldsymbol{\xi})\boldsymbol{\Sigma}_{2}^{-\frac{1}{2}}\overset{(a)}{=}\underbrace{\boldsymbol{\Sigma}_{2}^{-\frac{1}{2}}\big{(}\boldsymbol{\Sigma}_{2}^{\frac{1}{2}}}_{=\,\boldsymbol{I}}\boldsymbol{\Sigma}_{1}^{-\frac{1}{2}}\boldsymbol{\xi}\boldsymbol{\Sigma}_{1}^{-\frac{1}{2}}\underbrace{\boldsymbol{\Sigma}_{2}^{\frac{1}{2}}\big{)}\boldsymbol{\Sigma}_{2}^{-\frac{1}{2}}}_{=\,\boldsymbol{I}}\\ =\boldsymbol{\Sigma}_{1}^{-\frac{1}{2}}\boldsymbol{\xi}\boldsymbol{\Sigma}_{1}^{-\frac{1}{2}}\overset{(\ref{eq_mapping})}{=}\boldsymbol{\xi}^{\prime}, where (a)(a) is due to definition of vector transport on a SPD manifold (see Table 1).

\bullet Exponential map: Exp𝚺(𝝃)=(a)𝚺exp(𝚺1𝝃)=(b)𝚺12exp(𝚺12𝝃𝚺12)𝚺12=(7)𝚺12exp(𝝃)𝚺12\text{Exp}_{\boldsymbol{\Sigma}}(\boldsymbol{\xi}^{\prime})\overset{(a)}{=}\boldsymbol{\Sigma}\,\text{exp}(\boldsymbol{\Sigma}^{-1}\boldsymbol{\xi})\overset{(b)}{=}\boldsymbol{\Sigma}^{\frac{1}{2}}\,\text{exp}(\boldsymbol{\Sigma}^{-\frac{1}{2}}\boldsymbol{\xi}\boldsymbol{\Sigma}^{-\frac{1}{2}})\,\boldsymbol{\Sigma}^{\frac{1}{2}}\overset{(\ref{eq_mapping})}{=}\boldsymbol{\Sigma}^{\frac{1}{2}}\,\text{exp}(\boldsymbol{\xi}^{\prime})\\ \boldsymbol{\Sigma}^{\frac{1}{2}} where (a)(a) is shown in (Sra and Hosseini, 2015, Eq. 3.3) and (b)(b) is shown in (Sra and Hosseini, 2015, Eq. 3.2). Also see Table 1.

\bullet Retraction: Ret𝚺(𝝃)=(a)𝚺+𝝃+12𝝃𝚺1𝝃=(7)𝚺+𝚺12𝝃𝚺12+12𝚺12𝝃𝚺12𝚺1𝚺12=𝑰𝝃𝚺12=𝚺+𝚺12𝝃𝚺12+12𝚺12𝝃 2𝚺12\text{Ret}_{\boldsymbol{\Sigma}}(\boldsymbol{\xi}^{\prime})\overset{(a)}{=}\boldsymbol{\Sigma}+\boldsymbol{\xi}+\frac{1}{2}\boldsymbol{\xi}\boldsymbol{\Sigma}^{-1}\boldsymbol{\xi}\overset{(\ref{eq_mapping})}{=}\boldsymbol{\Sigma}+\boldsymbol{\Sigma}^{\frac{1}{2}}\,\boldsymbol{\xi}^{\prime}\,\boldsymbol{\Sigma}^{\frac{1}{2}}+\frac{1}{2}\boldsymbol{\Sigma}^{\frac{1}{2}}\boldsymbol{\xi}^{\prime}\underbrace{\boldsymbol{\Sigma}^{\frac{1}{2}}\boldsymbol{\Sigma}^{-1}\boldsymbol{\Sigma}^{\frac{1}{2}}}_{=\,\boldsymbol{I}}\boldsymbol{\xi}^{\prime}\boldsymbol{\Sigma}^{\frac{1}{2}}=\boldsymbol{\Sigma}+\boldsymbol{\Sigma}^{\frac{1}{2}}\boldsymbol{\xi}^{\prime}\boldsymbol{\Sigma}^{\frac{1}{2}}+\frac{1}{2}\boldsymbol{\Sigma}^{\frac{1}{2}}\boldsymbol{\xi}^{\prime\,2}\boldsymbol{\Sigma}^{\frac{1}{2}}, where (a)(a) is because of approximation of Euclidean retraction, using the second-order Taylor expansion, on SPD manifolds (see Table 1).

4 Vector Transport Free Riemannian LBFGS Using Mapping by Cholesky Decomposition

Our second proposed mapping is by Cholesky decomposition which is very efficient computationally. Consider the Cholesky decomposition of point 𝚺\boldsymbol{\Sigma}\in\mathcal{M} (Golub and Van Loan, 2013):

𝟎𝚺=𝑳𝑳𝚺1=𝑳𝑳1,\displaystyle\boldsymbol{0}\prec\boldsymbol{\Sigma}=\boldsymbol{L}\boldsymbol{L}^{\top}\implies\boldsymbol{\Sigma}^{-1}=\boldsymbol{L}^{-\top}\boldsymbol{L}^{-1}, (8)

where 𝑳n×n\boldsymbol{L}\in\mathbb{R}^{n\times n} is the lower-triangular matrix in Cholesky decomposition. It is noteworthy that many of the MATLAB matrix multiplication operators, which the Manopt toolbox (Boumal et al., 2014) also uses, apply Cholesky decomposition internally due to its efficiency.

In the tangent space of every point 𝚺\boldsymbol{\Sigma}\in\mathcal{M}, the proposed mapping is:

𝝃:=𝑳1𝝃𝑳𝝃=𝑳𝝃𝑳,\displaystyle\boldsymbol{\xi}^{\prime}:=\boldsymbol{L}^{-1}\,\boldsymbol{\xi}\,\boldsymbol{L}^{-\top}\implies\boldsymbol{\xi}=\boldsymbol{L}\,\boldsymbol{\xi}^{\prime}\,\boldsymbol{L}^{\top}, (9)

where 𝝃,𝝃T𝚺𝕊++n\boldsymbol{\xi},\boldsymbol{\xi}^{\prime}\in T_{\boldsymbol{\Sigma}}{\mathcal{M}}\subset\mathbb{S}_{++}^{n}. Note that, similar to the previous mapping, the tangent matrix is still symmetric under this mapping; hence, it remains in the tangent space of the SPD manifold (Bhatia, 2009). Similar to the previous mapping, under the second proposed mapping (7), the Riemannian operators on SPD manifold are simplified. These operators can be found in Table 1. In the following, we provide proofs for these operators.

Proposition 4.1.

After mapping (9), we have:

  • Metric: the metric on a SPD manifold is reduced to the Euclidean inner product, i.e., g𝚺(𝝃,𝜼)=tr(𝝃𝜼)g_{\boldsymbol{\Sigma}}(\boldsymbol{\xi}^{\prime},\boldsymbol{\eta}^{\prime})=\textbf{tr}(\boldsymbol{\xi}^{\prime}\boldsymbol{\eta}^{\prime}).

  • Gradient: the gradient on SPD manifold is changed to f(𝚺)=12𝑳(Ef(𝚺)+(Ef(𝚺)))𝑳\nabla^{\prime}f(\boldsymbol{\Sigma})=\frac{1}{2}\boldsymbol{L}^{\top}\big{(}\nabla_{E}f(\boldsymbol{\Sigma})+(\nabla_{E}f(\boldsymbol{\Sigma}))^{\top}\big{)}\boldsymbol{L}, where Ef(𝚺)\nabla_{E}f(\boldsymbol{\Sigma}) denotes the Euclidean gradient.

  • Vector transport: the vector transport is changed to identity, i.e., 𝒯𝚺1,𝚺2(𝝃)=𝝃\mathcal{T}^{\prime}_{\boldsymbol{\Sigma}_{1},\boldsymbol{\Sigma}_{2}}(\boldsymbol{\xi}^{\prime})=\boldsymbol{\xi}^{\prime}, hence, optimization becomes vector transport free.

  • Exponential map: the exponential map becomes Exp𝚺(𝝃)=𝚺exp(𝑳𝝃𝑳)\text{Exp}_{\boldsymbol{\Sigma}}(\boldsymbol{\xi}^{\prime})=\boldsymbol{\Sigma}\,\text{exp}(\boldsymbol{L}^{-\top}\boldsymbol{\xi}^{\prime}\boldsymbol{L}^{\top}).

  • Adjoint vector transport: the adjoint vector transport becomes g𝚺1(𝝃,𝒯𝚺1,𝚺2𝜼)=g𝚺2(𝒯𝚺1,𝚺2𝝃,𝜼),𝝃T𝚺1,𝜼T𝚺2g_{\boldsymbol{\Sigma}_{1}}(\boldsymbol{\xi}^{\prime},\mathcal{T}^{{}^{\prime*}}_{\boldsymbol{\Sigma}_{1},\boldsymbol{\Sigma}_{2}}\boldsymbol{\eta}^{\prime})=g_{\boldsymbol{\Sigma}_{2}}(\mathcal{T}^{\prime}_{\boldsymbol{\Sigma}_{1},\boldsymbol{\Sigma}_{2}}\boldsymbol{\xi}^{\prime},\boldsymbol{\eta}^{\prime}),\forall\boldsymbol{\xi}^{\prime}\in T_{\boldsymbol{\Sigma}_{1}}{\mathcal{M}},\forall\boldsymbol{\eta}^{\prime}\in T_{\boldsymbol{\Sigma}_{2}}{\mathcal{M}} as we had in Proposition 3.1.

  • Retraction: the approximation of Euclidean retraction by second-order Taylor expansion on a SPD manifold becomes Ret𝚺(𝝃)=𝚺+𝑳𝝃𝑳+12𝑳𝝃 2𝑳\text{Ret}_{\boldsymbol{\Sigma}}(\boldsymbol{\xi}^{\prime})=\boldsymbol{\Sigma}+\boldsymbol{L}\,\boldsymbol{\xi}^{\prime}\boldsymbol{L}^{\top}+\frac{1}{2}\boldsymbol{L}\,\boldsymbol{\xi}^{\prime\,2}\boldsymbol{L}^{\top}.

Proof 4.2.

\bullet Metric: g𝚺(𝝃,𝜼)=(a)tr(𝚺1𝝃𝚺1𝜼)=tr(𝑳𝑳1𝝃𝑳𝑳1𝜼)=(b)tr(𝑳1𝝃𝑳=𝝃𝑳1𝜼𝑳=𝜼)=(9)tr(𝝃𝜼)g_{\boldsymbol{\Sigma}}(\boldsymbol{\xi}^{\prime},\boldsymbol{\eta}^{\prime})\overset{(a)}{=}\textbf{tr}(\boldsymbol{\Sigma}^{-1}\boldsymbol{\xi}\boldsymbol{\Sigma}^{-1}\boldsymbol{\eta})=\textbf{tr}(\boldsymbol{L}^{-\top}\boldsymbol{L}^{-1}\boldsymbol{\xi}\boldsymbol{L}^{-\top}\boldsymbol{L}^{-1}\boldsymbol{\eta})\overset{(b)}{=}\textbf{tr}(\underbrace{\boldsymbol{L}^{-1}\boldsymbol{\xi}\boldsymbol{L}^{-\top}}_{=\,\boldsymbol{\xi}^{\prime}}\\ \underbrace{\boldsymbol{L}^{-1}\boldsymbol{\eta}\boldsymbol{L}^{-\top}}_{=\,\boldsymbol{\eta}^{\prime}})\overset{(\ref{eq_mapping_Cholesky})}{=}\textbf{tr}(\boldsymbol{\xi}^{\prime}\boldsymbol{\eta}^{\prime}), where (a)(a) is because of definition of metric on SPD manifolds (see Table 1) and (b)(b) is thanks to the cyclic property of trace.

\bullet Gradient: f(𝚺)=(9)𝑳1f(𝚺)𝑳=(a)12𝑳1𝚺(Ef(𝚺)+(Ef(𝚺)))𝚺𝑳=(8)12𝑳1𝑳𝑳(Ef(𝚺)+(Ef(𝚺)))𝑳𝑳𝑳=12𝑳(Ef(𝚺)+(Ef(𝚺)))𝑳\nabla^{\prime}f(\boldsymbol{\Sigma})\overset{(\ref{eq_mapping_Cholesky})}{=}\boldsymbol{L}^{-1}\nabla f(\boldsymbol{\Sigma})\boldsymbol{L}^{-\top}\overset{(a)}{=}\frac{1}{2}\boldsymbol{L}^{-1}\boldsymbol{\Sigma}\big{(}\nabla_{E}f(\boldsymbol{\Sigma})+(\nabla_{E}f(\boldsymbol{\Sigma}))^{\top}\big{)}\boldsymbol{\Sigma}\boldsymbol{L}^{-\top}\\ \overset{(\ref{eq_Cholesky_decomposition})}{=}\frac{1}{2}\boldsymbol{L}^{-1}\boldsymbol{L}\boldsymbol{L}^{\top}\big{(}\nabla_{E}f(\boldsymbol{\Sigma})+(\nabla_{E}f(\boldsymbol{\Sigma}))^{\top}\big{)}\boldsymbol{L}\boldsymbol{L}^{\top}\boldsymbol{L}^{-\top}=\frac{1}{2}\boldsymbol{L}^{\top}\big{(}\nabla_{E}f(\boldsymbol{\Sigma})+(\nabla_{E}f(\boldsymbol{\Sigma}))^{\top}\big{)}\boldsymbol{L}, where (a)(a) is due to definition of gradient on a SPD manifold (see Table 1).

\bullet Vector transport: 𝒯𝚺1,𝚺2(𝝃)=(9)𝑳21𝒯𝚺1,𝚺2(𝝃)𝑳2=(a)𝑳21(𝑳2𝑳11𝝃𝑳1𝑳2)𝑳2=𝑳11𝝃𝑳1=(9)𝝃\mathcal{T}^{\prime}_{\boldsymbol{\Sigma}_{1},\boldsymbol{\Sigma}_{2}}(\boldsymbol{\xi}^{\prime})\overset{(\ref{eq_mapping_Cholesky})}{=}\boldsymbol{L}_{2}^{-1}\mathcal{T}_{\boldsymbol{\Sigma}_{1},\boldsymbol{\Sigma}_{2}}(\boldsymbol{\xi})\boldsymbol{L}_{2}^{-\top}\overset{(a)}{=}{\boldsymbol{L}_{2}^{-1}\big{(}\boldsymbol{L}_{2}}\boldsymbol{L}_{1}^{-1}\boldsymbol{\xi}\boldsymbol{L}_{1}^{-\top}{\boldsymbol{L}_{2}^{\top}\big{)}\boldsymbol{L}_{2}^{-\top}}\\ =\boldsymbol{L}_{1}^{-1}\boldsymbol{\xi}\boldsymbol{L}_{1}^{-\top}\overset{(\ref{eq_mapping_Cholesky})}{=}\boldsymbol{\xi}^{\prime}, where (a)(a) is due to definition of vector transport on SPD manifold (see Table 1).

\bullet Exponential map: The exponential map is changed to Exp𝚺(𝝃)=(a)𝚺exp(𝚺1𝝃)=(b)𝚺exp(𝑳𝑳1𝑳𝝃𝑳)=𝚺exp(𝑳𝝃𝑳)\text{Exp}_{\boldsymbol{\Sigma}}(\boldsymbol{\xi}^{\prime})\overset{(a)}{=}\boldsymbol{\Sigma}\,\text{exp}(\boldsymbol{\Sigma}^{-1}\boldsymbol{\xi})\overset{(b)}{=}\boldsymbol{\Sigma}\,\text{exp}(\boldsymbol{L}^{-\top}\boldsymbol{L}^{-1}\boldsymbol{L}\,\boldsymbol{\xi}^{\prime}\boldsymbol{L}^{\top})=\boldsymbol{\Sigma}\,\text{exp}(\boldsymbol{L}^{-\top}\boldsymbol{\xi}^{\prime}\boldsymbol{L}^{\top}), where (a)(a) is shown in (Sra and Hosseini, 2015, Eq. 3.3) and (b)(b) is because of Eqs. (8) and (9).

\bullet Retraction: Ret𝚺(𝝃)=(a)𝚺+𝝃+12𝝃𝚺1𝝃=(b)𝚺+𝑳𝝃𝑳+12𝑳𝝃𝑳𝑳𝑳1𝑳𝝃𝑳=𝚺+𝑳𝝃𝑳+12𝑳𝝃 2𝑳\text{Ret}_{\boldsymbol{\Sigma}}(\boldsymbol{\xi}^{\prime})\overset{(a)}{=}\boldsymbol{\Sigma}+\boldsymbol{\xi}+\frac{1}{2}\boldsymbol{\xi}\boldsymbol{\Sigma}^{-1}\boldsymbol{\xi}\overset{(b)}{=}\boldsymbol{\Sigma}+\boldsymbol{L}\,\boldsymbol{\xi}^{\prime}\boldsymbol{L}^{\top}+\frac{1}{2}\boldsymbol{L}\,\boldsymbol{\xi}^{\prime}\boldsymbol{L}^{\top}\boldsymbol{L}^{-\top}\boldsymbol{L}^{-1}\boldsymbol{L}\boldsymbol{\xi}^{\prime}\boldsymbol{L}^{\top}=\boldsymbol{\Sigma}+\boldsymbol{L}\,\boldsymbol{\xi}^{\prime}\boldsymbol{L}^{\top}+\frac{1}{2}\boldsymbol{L}\,\boldsymbol{\xi}^{\prime\,2}\boldsymbol{L}^{\top}, where (a)(a) is because of approximation of Euclidean retraction, using second-order Taylor expansion, on SPD manifolds (see Table 1), and (b)(b) is because of Eqs. (8) and (9).

Noticing Eq. (9), the approximation of retraction under mapping by Cholesky decomposition (see Proposition 4.1), can be restated as Ret𝚺(𝝃)=0.5𝚺+0.5𝑳(𝑰+𝝃)2𝑳\text{Ret}_{\boldsymbol{\Sigma}}(\boldsymbol{\xi}^{\prime})=0.5\,\boldsymbol{\Sigma}+0.5\,\boldsymbol{L}(\boldsymbol{I}+\boldsymbol{\xi}^{\prime})^{2}\boldsymbol{L}^{\top}. Defining 𝚿:=𝑳(𝑰+𝝃)\boldsymbol{\Psi}:=\boldsymbol{L}(\boldsymbol{I}+\boldsymbol{\xi}^{\prime}) restates this retraction as Ret𝚺(𝝃)=0.5𝚺+0.5𝚿𝚿\text{Ret}_{\boldsymbol{\Sigma}}(\boldsymbol{\xi}^{\prime})=0.5\,\boldsymbol{\Sigma}+0.5\,\boldsymbol{\Psi}\boldsymbol{\Psi}^{\top} because 𝝃\boldsymbol{\xi}^{\prime} is symmetric. The term 𝚿𝚿\boldsymbol{\Psi}\boldsymbol{\Psi}^{\top} is very efficient and fast to compute because it is symmetric.

5 Analytical Discussion and Complexity Analysis

Corollary 5.1.

Propositions 3.1 and 4.1 show that under mapping (7) or (9), both vector transport and adjoint vector transport are identity. As these transforms become identity, they also become isometric because inner products of vectors do not change under these transforms. As they are identity, these transforms also preserve the length of vectors under the proposed mappings.

Propositions 3.1 and 4.1 and Corollary 5.1 show the two proposed mappings simplify vector transport and adjoint vector transport to isometric identity and reduce the Riemannian metric to Euclidean inner product. These reductions and simplifications reduce computations significantly during optimization on the manifold. The VTF-RLBFGS algorithm using either of the proposed mappings is shown in Algorithm 1. As the algorithm shows, at every new point 𝚺k\boldsymbol{\Sigma}_{k}, the entire parameters are computed in the paradigm of mapping because the manifold operators, calculated as in Table 1, are in that paradigm. This simplifies operators such as metric and removes vector transports from RLBFGS. In case the Riemannian gradient is given directly by the user to RLBFGS, the Riemannian gradient, which is in the tangent space, should be mapped explicitly by Eqs. (7) and (9) at every iteration. However, if the Riemannian gradient is calculated from the Euclidean gradient, it should not be mapped explicitly, since it is already in the paradigm of mapping implicitly because of the used operators of Table 1 in that paradigm.

Input: Initial point 𝚺0\boldsymbol{\Sigma}_{0}
𝑯0:=1g𝚺0(f(𝚺0),f(𝚺0))𝑰\boldsymbol{H}_{0}:=\frac{1}{\sqrt{{g}^{\prime}_{\boldsymbol{\Sigma}_{0}}(\nabla^{\prime}f(\boldsymbol{\Sigma}_{0}),\nabla^{\prime}f(\boldsymbol{\Sigma}_{0}))}}\boldsymbol{I}
for k=0,1,k=0,1,\dots do
       Compute f(𝚺k)\nabla^{\prime}f(\boldsymbol{\Sigma}_{k}) from Euclidean gradient by one of the mappings in Table 1
       𝝃k:=GetDirection(f(𝚺k),k)\boldsymbol{\xi}^{\prime}_{k}:=\text{GetDirection}(-\nabla^{\prime}f(\boldsymbol{\Sigma}_{k}),k)
       αk:=\alpha_{k}:= Line search with Wolfe conditions
       𝚺k+1:=Exp𝚺k(αk𝝃k)orRet𝚺k(αk𝝃k)\boldsymbol{\Sigma}_{k+1}:=\text{Exp}_{\boldsymbol{\Sigma}_{k}}(\alpha_{k}\boldsymbol{\xi}^{\prime}_{k})\,\,\text{or}\,\,\text{Ret}_{\boldsymbol{\Sigma}_{k}}(\alpha_{k}\boldsymbol{\xi}^{\prime}_{k})
       𝒔k+1:=αk𝝃k\boldsymbol{s}^{\prime}_{k+1}:=\alpha_{k}\boldsymbol{\xi}^{\prime}_{k}
       𝒚k+1:=f(𝚺k+1)f(𝚺k)\boldsymbol{y}^{\prime}_{k+1}:=\nabla^{\prime}f(\boldsymbol{\Sigma}_{k+1})-\nabla^{\prime}f(\boldsymbol{\Sigma}_{k})
       𝑯k+1:=g𝚺k+1(𝒔k+1,𝒚k+1)g𝚺k+1(𝒚k+1,𝒚k+1)\boldsymbol{H}_{k+1}:=\frac{g^{\prime}_{\boldsymbol{\Sigma}_{k+1}}(\boldsymbol{s}^{\prime}_{k+1},\boldsymbol{y}^{\prime}_{k+1})}{g^{\prime}_{\boldsymbol{\Sigma}_{k+1}}(\boldsymbol{y}^{\prime}_{k+1},\boldsymbol{y}^{\prime}_{k+1})}
       Store 𝒚k+1\boldsymbol{y}^{\prime}_{k+1}, 𝒔k+1\boldsymbol{s}^{\prime}_{k+1}, g𝚺k+1(𝒔k+1,𝒚k+1)g^{\prime}_{\boldsymbol{\Sigma}_{k+1}}(\boldsymbol{s}^{\prime}_{k+1},\boldsymbol{y}^{\prime}_{k+1}), g𝚺k+1(𝒔k+1,𝒔k+1)g^{\prime}_{\boldsymbol{\Sigma}_{k+1}}(\boldsymbol{s}^{\prime}_{k+1},\boldsymbol{s}^{\prime}_{k+1}), and 𝑯k+1\boldsymbol{H}_{k+1}
      
end for
return 𝚺k+1\boldsymbol{\Sigma}_{k+1}
Function GetDirection(𝒑,k)\text{GetDirection}(\boldsymbol{p}^{\prime},k)
if k>0k>0 then
       ρk:=1𝒈𝚺k(𝒚k,𝒔k)\rho_{k}:=\frac{1}{\boldsymbol{g}^{\prime}_{\boldsymbol{\Sigma}_{k}}(\boldsymbol{y}^{\prime}_{k},\boldsymbol{s}^{\prime}_{k})}
       𝒑~:=𝒑ρk𝒈𝚺k(𝒔k,𝒑)𝒚k\boldsymbol{\tilde{p}}^{\prime}:=\boldsymbol{p}^{\prime}-\rho_{k}\,\boldsymbol{g}^{\prime}_{\boldsymbol{\Sigma}_{k}}(\boldsymbol{s}^{\prime}_{k},\boldsymbol{p}^{\prime})\boldsymbol{y}^{\prime}_{k}
       𝒑^:=GetDirection(𝒑~,k1)\boldsymbol{\widehat{p}}^{\prime}:=\text{GetDirection}(\boldsymbol{\tilde{p}}^{\prime},k-1)
       return 𝒑^ρk𝒈𝚺k(𝒚k,𝒑^k)𝒔k+ρk𝒈𝚺k(𝒔k,𝒔k)𝒑\boldsymbol{\widehat{p}}^{\prime}-\rho_{k}\,\boldsymbol{g}^{\prime}_{\boldsymbol{\Sigma}_{k}}(\boldsymbol{y}^{\prime}_{k},\boldsymbol{\widehat{p}}^{\prime}_{k})\boldsymbol{s}^{\prime}_{k}+\rho_{k}\,\boldsymbol{g}^{\prime}_{\boldsymbol{\Sigma}_{k}}(\boldsymbol{s}^{\prime}_{k},\boldsymbol{s}^{\prime}_{k})\boldsymbol{p}^{\prime}
      
else
       return 𝑯0𝒑\boldsymbol{H}_{0}\,\boldsymbol{p}^{\prime}
      
end if
Algorithm 1 The VTF-RLBFGS algorithm
Lemma 5.2.

Vector transport is valid under both proposed mappings (7) and (9) because they preserve the properties of vector transport.

Proof 5.3.

A valid vector transport should satisfy three properties (Hosseini and Sra, 2020) (also see (Absil et al., 2009, Definition 8.1.1) and (Boumal, 2020, Definition 10.62)):

(1) The following property 𝐯T𝚺1:𝒯𝚺1,𝚺2(𝛏)TRet𝚺(𝐯),𝛏T𝚺1\exists\boldsymbol{v}\in T_{\boldsymbol{\Sigma}_{1}}\mathcal{M}:\mathcal{T}_{\boldsymbol{\Sigma}_{1},\boldsymbol{\Sigma}_{2}}(\boldsymbol{\xi})\in T_{\text{Ret}_{\boldsymbol{\Sigma}}(\boldsymbol{v})}\mathcal{M},\forall\boldsymbol{\xi}\in T_{\boldsymbol{\Sigma}_{1}}\mathcal{M} holds because the tangent space TRet𝚺(𝐯)T_{\text{Ret}_{\boldsymbol{\Sigma}}(\boldsymbol{v})}\mathcal{M} is isomorphic to the space of symmetric matrices and the identity vector transports return the tangent vector itself which is in the space of symmetric matrices.

(2) The vector transport of a tangent vector from one point to itself should be the same tangent vector. This holds because vector transport is identity under the proposed mappings: 𝒯𝚺,𝚺(𝛏)=𝛏,𝛏T𝚺1\mathcal{T}_{\boldsymbol{\Sigma},\boldsymbol{\Sigma}}(\boldsymbol{\xi})=\boldsymbol{\xi},\forall\boldsymbol{\xi}\in T_{\boldsymbol{\Sigma}_{1}}\mathcal{M}.

(3) The vector transport 𝒯𝚺1,𝚺2(𝛏)\mathcal{T}_{\boldsymbol{\Sigma}_{1},\boldsymbol{\Sigma}_{2}}(\boldsymbol{\xi}) should be linear which is because it is equal to 𝛏\boldsymbol{\xi} under the proposed mappings.

As after applying the mappings, the vector transport holds the three above properties, it is a valid transport. Q.E.D.

Proposition 5.4.

The time complexity of the recursion part in RLBFGS is improved from Θ(mn3)\Theta(mn^{3}) to Θ(mn2)\Theta(mn^{2}) after mapping (7) or (9), where mm is the memory limit, i.e., the maximum number of recursions in RLBFGS (proof is available in Supplementary Material). This time improvement shows off better in problems whose computation of gradients in Eq. (6), or line 𝐲k+1\boldsymbol{y}^{\prime}_{k+1} in Algorithm 1, is not dominant in complexity.

6 Simulations

In this section, we evaluate the effectiveness of the proposed mappings, i.e. (7) and (9), in the tangent space. Here, we show that these mappings often improve the performance and speed of RLBFGS. The code of this article is available in https://github.com/bghojogh/LBFGS-Vector-Transport-Free. In our reports, we denote the proposed Vector Transport Free (VTF) RLBFGS with VTF-RLBFGS where ISR and Cholesky (or Chol.) stand for VTF-RLBFGS using mapping by inverse second root and Cholesky decomposition, respectively. For RLBFGS, with and without the proposed mappings, we use both Wolfe linesearch conditions. The programming language, used for experiments, was MATLAB and the hardware was Intel Core-i7 CPU with the base frequency 2.20 GHz and 12 GB RAM. For every experiment, we performed optimization for ten runs and the reported results are the average of performances over the runs. We evaluated our mappings with various application problems, explained below.

6.1 Gaussian Mixture Model

\bullet Formulation: An optimization problem, which we selected for evaluation, is the Riemannian optimization for Gaussian Mixture Model (GMM) without the use of expectation maximization. We employ RLBFGS with and without the proposed mappings for fitting the GMM problem whose algorithm can be found in (Hosseini and Sra, 2020; Hosseini and Mash’al, 2015). This is a suitable problem for evaluation of the proposed mappings because the covariance matrices are SPD (Bhatia, 2009). For this, we minimize the negative log-likelihood of GMM where the covariance matrix is constrained to belong to the SPD matrix manifold (Hosseini and Sra, 2020). For nn-dimensional GMM, the optimization problem is:

minimize{αj,𝝁j,𝚺j}j=1K\displaystyle\underset{\{\alpha_{j},\,\boldsymbol{\mu}_{j},\,\boldsymbol{\Sigma}_{j}\}_{j=1}^{K}}{\text{minimize}} i=1Nlog(j=1Kαj𝒩(𝒙i;𝝁j,𝚺j)),\displaystyle-\sum_{i=1}^{N}\log\Big{(}\sum_{j=1}^{K}\alpha_{j}\,\mathcal{N}(\boldsymbol{x}_{i};\boldsymbol{\mu}_{j},\boldsymbol{\Sigma}_{j})\Big{)}, (10)
subject to 𝚺j=𝕊++n,j{1,,K},\displaystyle\boldsymbol{\Sigma}_{j}\in\mathcal{M}=\mathbb{S}_{++}^{n},\quad\forall j\in\{1,\dots,K\},

where NN denotes the sample size, KK denotes the number of components in mixture model, and αj\alpha_{j}, 𝝁j\boldsymbol{\mu}_{j}, and 𝚺j\boldsymbol{\Sigma}_{j} are the mixing probability, mean, and covariance of the jj-th component, respectively. We use the same reformulation trick of (Hosseini and Sra, 2020) to reformulate the cost function of (10). The mixture parameters were initialized using K-means++ (Arthur and Vassilvitskii, 2007) following (Hosseini and Sra, 2020). Three different levels of separation of Gaussian models, namely low, mid, and high, were used. The reader can refer to (Hosseini and Sra, 2020) for mathematical details of these separation levels.

Table 2: Comparison of average results over ten runs where exponential map is used in algorithms and K{2,5}K\in\{2,5\}, n{2,10}n\in\{2,10\}, N=10n2{40,1000}N=10n^{2}\in\{40,1000\}. The #\#iters, conv, iter, diff, and std are short for number of iterations, convergence, iteration, difference, and standard deviation, respectively.
KK nn Separation Algorithm #iters conv. time time diff. std iter. time last cost
2 2 Low VTF (ISR) 53.100±\pm18.248 68.380±\pm52.834 13.405 1.140±\pm0.504 0.364±\pm0.444
VTF (Chol.) 52.100±\pm17.866 61.096±\pm48.433 17.443 1.030±\pm0.463 0.364±\pm0.444
RLBFGS 52.500±\pm16.595 65.890±\pm49.208 1.125±\pm0.458 0.364±\pm0.444
Mid VTF (ISR) 56.400±\pm21.813 76.124±\pm69.189 10.712 1.150±\pm0.574 0.657±\pm0.344
VTF (Chol.) 54.400±\pm19.156 68.456±\pm64.290 48.016 1.099±\pm0.504 0.638±\pm0.333
RLBFGS 57.700±\pm19.844 81.957±\pm64.463 1.250±\pm0.550 0.657±\pm0.344
High VTF (ISR) 25.500±\pm3.064 9.518±\pm2.922 0.333 0.366±\pm0.067 0.341±\pm0.371
VTF (Chol.) 26.000±\pm4.738 10.254±\pm5.468 3.132 0.377±\pm0.108 0.341±\pm0.371
RLBFGS 25.500±\pm3.064 10.054±\pm3.141 0.386±\pm0.073 0.341±\pm0.371
10 Low VTF (ISR) 77.600±\pm54.175 235.615±\pm466.119 12.040 1.936±\pm1.751 4.210±\pm0.889
VTF (Chol.) 84.100±\pm73.260 313.398±\pm714.908 257.528 2.041±\pm2.150 4.209±\pm0.889
RLBFGS 77.600±\pm52.754 242.743±\pm458.641 2.069±\pm1.733 4.209±\pm0.889
Mid VTF (ISR) 44.600±\pm7.260 43.654±\pm16.872 4.139 0.948±\pm0.208 4.262±\pm1.098
VTF (Chol.) 45.900±\pm8.647 45.661±\pm20.088 5.615 0.955±\pm0.236 4.262±\pm1.098
RLBFGS 45.200±\pm8.080 48.104±\pm19.981 1.025±\pm0.243 4.262±\pm1.098
High VTF (ISR) 44.400±\pm9.252 43.684±\pm22.460 11.746 0.936±\pm0.254 3.874±\pm1.395
VTF (Chol.) 47.100±\pm8.333 48.278±\pm21.112 8.992 0.987±\pm0.240 3.874±\pm1.395
RLBFGS 43.300±\pm7.150 43.904±\pm17.626 0.981±\pm0.225 3.874±\pm1.395
5 2 Low VTF (ISR) 152.400±\pm62.819 1344.573±\pm999.949 649.595 7.560±\pm3.406 0.260±\pm0.458
VTF (Chol.) 174.800±\pm114.139 2168.886±\pm3090.820 2347.949 8.680±\pm6.348 0.265±\pm0.466
RLBFGS 166.300±\pm76.884 1767.676±\pm1406.454 8.788±\pm4.427 0.267±\pm0.454
Mid VTF (ISR) 120.700±\pm69.620 942.713±\pm1063.075 1120.404 5.807±\pm3.877 0.781±\pm0.207
VTF (Chol.) 121.600±\pm65.030 897.110±\pm988.826 1315.410 5.720±\pm3.445 0.781±\pm0.207
RLBFGS 136.300±\pm91.493 1404.654±\pm1986.798 7.057±\pm5.378 0.764±\pm0.225
High VTF (ISR) 46.400±\pm17.322 94.741±\pm105.045 22.071 1.739±\pm0.902 1.805±\pm0.385
VTF (Chol.) 46.200±\pm19.803 98.065±\pm122.950 6.436 1.729±\pm1.020 1.805±\pm0.385
RLBFGS 46.200±\pm18.937 104.934±\pm126.734 1.879±\pm1.064 1.805±\pm0.385
10 Low VTF (ISR) 295.900±\pm86.577 5524.495±\pm3173.855 2664.627 17.299±\pm5.221 6.175±\pm0.745
VTF (Chol.) 292.300±\pm83.828 5294.565±\pm3165.735 5106.693 16.737±\pm5.350 6.197±\pm0.718
RLBFGS 318.800±\pm100.216 7083.082±\pm4801.718 20.279±\pm6.859 6.173±\pm0.744
Mid VTF (ISR) 134.200±\pm54.956 1139.332±\pm919.917 306.469 7.302±\pm3.227 6.753±\pm0.708
VTF (Chol.) 133.300±\pm52.415 1100.519±\pm842.840 296.996 7.183±\pm3.045 6.753±\pm0.708
RLBFGS 135.300±\pm54.965 1268.707±\pm967.678 8.070±\pm3.580 6.753±\pm0.708
High VTF (ISR) 68.600±\pm12.367 241.487±\pm87.593 18.122 3.398±\pm0.764 6.599±\pm0.836
VTF (Chol.) 74.000±\pm14.071 279.271±\pm105.828 40.158 3.632±\pm0.838 6.599±\pm0.836
RLBFGS 68.300±\pm11.982 258.475±\pm93.959 3.661±\pm0.790 6.599±\pm0.836

\bullet Results: We compared RLBFGS performances with and without our proposed mappings. The average number of iterations, convergence time, time per iteration, and the cost value of last iteration are reported in Table 2 where exponential map is used for retraction. Results for Taylor approximation of exponential map used as retraction is available in Table 1 of supplementary material. In these experiments, we report performances for K{2,5}K\in\{2,5\}, n{2,10}n\in\{2,10\}, N=10n2{40,1000}N=10n^{2}\in\{40,1000\}. For the sake of fairness, all the three compared algorithms start with the same initial points in every run. The reader can see more extensive experiments for more sample size and dimensionality, i.e. K{2,5}K\in\{2,5\}, n{2,10,100}n\in\{2,10,100\}, N=100n2{400,10000,1000000}N=100n^{2}\in\{400,10000,1000000\}, in the Supplementary Material. In these tables, we have also provided the standard deviation (std) of time differences between VTF-RLBFGS and RLBFGS, over the ten runs. This value shows how much the average convergence time improvements over RLBFGS are reliable.

\bullet Discussion by Varying Separability: As Table 2 and the Table 1 of Supplementary Material show, the proposed ISR mapping converges faster than no mapping most often in all separability levels. Both its time per iteration and number of iterations are often less than no mapping. These tables also show that the proposed Cholesky mapping converges faster than no mapping in most of the cases, although not all cases. Overall, we see that the two proposed mappings make RLBFGS faster and more efficient most often. This pacing improvement can be noticed more for low and mid separability levels because they are harder cases to converge. In terms of quality of local minimum, the proposed mappings often find the same local minimum as no mapping, but in a faster way. The equality of the found local optima in the three methods is because of using the same RLBFGS algorithm as their base in addition to having the same initial points. In some cases, the proposed mappings have even found better local minima. Moreover, as expected, the time difference std reduces in higher separability which is a simpler task.

\bullet Discussion by Varying Dimensionality: We can discuss the results of Table 2 and the Table 1 of Supplementary Material by varying dimensionality, n{2,10}n\in\{2,10\}. Tables 2 and 3 of Supplementary Material also report performance for dimensions n{2,10,100}n\in\{2,10,100\} and larger sample size. Obviously, by increasing dimensionality, the time of convergence goes up and the faster pacing of the proposed mappings is noticed more, compared to no mapping.

\bullet Discussion by Varying the Number of Components: The results of Table 2 and the Table 1 of Supplementary Material can also be interpreted based on varying the number of mixture components, K{2,5}K\in\{2,5\}. The more number of components makes the optimization problem harder. Hence, the difference of speeds of the proposed mappings and no mapping can be noticed more for K=5K=5; although, for both KK values, the proposed mappings are often faster than no mapping.

\bullet Discussion by Varying Retraction Type: We can also compare the performance of algorithms in terms of type of retraction. Table 2 and the Table 1 of Supplementary Material report performances where exponential map and Taylor approximation of exponential map are used for retraction, respectively. Comparing these tables shows that using Taylor approximation usually converges with less number of iterations compared to using exponential map. It makes sense because exponential map requires passing on geodesics. This difference of pacing is more obvious for larger number of components, i.e., K=5K=5. Our two proposed mappings outperform no mapping for both types retraction. This shows that our mappings are effective regardless of the details of operators.

\bullet Discussion by Varying the Sample Size: Table 2 and the Table 1 of Supplementary Material report for n{2,10}n\in\{2,10\}, N=10n2{40,1000}N=10n^{2}\in\{40,1000\}. More experiments for larger sample size and dimensionality, i.e. n{2,10,100}n\in\{2,10,100\}, N=100n2{400,10000,1000000}N=100n^{2}\in\{400,10000,1000000\}, can be found in Tables 2 and 3 in the Supplementary Material. Comparing those tables with Table 2 and the Table 1 of Supplementary Material shows that larger sample size and/or dimensionality takes more time to converge as expected. Still, our proposed mappings often converge faster than no mapping. The difference of pacing is mostly less in larger sample size compared to smaller sample size. This is because very large sample size consumes time on computation of cost function and the difference is not given much chance to show off in that case.

Table 3: Comparison of average results over ten runs with various algorithms where exponential map is used in algorithms and K{2,5}K\in\{2,5\}, n=2n=2, N=1000n2=4000N=1000n^{2}=4000.
Low separation Mid separation High separation
Algorithm #iters (K=2K=2) #iters (K=5K=5) #iters (K=2K=2) #iters (K=5K=5) #iters (K=2K=2) #iters (K=5K=5)
VTF (ISR) 51.500±\pm8.885 123.400±\pm26.069 54.100±\pm20.464 106.500±\pm39.328 24.500±\pm4.353 35.500±\pm6.996
VTF (Chol.) 54.200±\pm11.263 123.000±\pm35.065 58.800±\pm24.943 107.500±\pm44.490 25.100±\pm4.149 35.900±\pm7.622
RLBFGS 52.900±\pm8.825 147.400±\pm35.926 57.900±\pm29.622 110.200±\pm38.064 24.400±\pm4.006 35.600±\pm7.516
CG 83.100±\pm22.684 142.200±\pm44.236 69.500±\pm40.175 155.100±\pm48.732 28.500±\pm9.192 51.600±\pm16.392
EM 94.400±\pm40.114 277.900±\pm142.549 118.100±\pm89.794 224.000±\pm101.576 3.200±\pm0.422 3.600±\pm0.966
Refer to caption
Figure 1: The cost difference progress for several runs: (a) K=2K=2, n=2n=2, N=4000N=4000, low separation, and (b) K=5K=5, n=2n=2, N=4000N=4000, mid separation.

\bullet Comparison with Other Algorithms: We compared RLBFGS, with and without the proposed mappings, with some other algorithms, i.e., nonlinear Conjugate Gradient (CG) and Expectation Maximization (EM) for fitting GMM. The log-scale cost difference progress of several insightful runs are illustrated in Fig. 1. The average number of iterations in the algorithms are compared in Table 3. A complete set of plots for cost difference progress can be seen in Figs. 1, 2, 3, and 4 in the Supplementary Material. Figs. 1-a and 1-b show that in some cases, ISR mapping is faster than the Cholesky mapping and in some other cases, we have the other way around. As Fig. 1 and Table 3 show, our proposed mappings are outperforming CG and EM in the number of iterations.

6.2 Geometric Metric Learning

\bullet Formulation: As another application, we experiment with geometric metric learning. We implemented an iterative version of (Zadeh et al., 2016) whose regularized problem is:

min.𝑾\displaystyle\underset{\boldsymbol{W}}{\text{min.}} f:=(𝒙i,𝒙j)𝒮(𝒙i𝒙j)𝑾(𝒙i𝒙j)+(𝒙i,𝒙j)𝒟(𝒙i𝒙j)𝑾1(𝒙i𝒙j)+12𝑾F2,\displaystyle f:=\!\!\sum_{(\boldsymbol{x}_{i},\boldsymbol{x}_{j})\in\mathcal{S}}(\boldsymbol{x}_{i}-\boldsymbol{x}_{j})^{\top}\boldsymbol{W}(\boldsymbol{x}_{i}-\boldsymbol{x}_{j})+\!\!\sum_{(\boldsymbol{x}_{i},\boldsymbol{x}_{j})\in\mathcal{D}}(\boldsymbol{x}_{i}-\boldsymbol{x}_{j})^{\top}\boldsymbol{W}^{-1}(\boldsymbol{x}_{i}-\boldsymbol{x}_{j})+\frac{1}{2}\|\boldsymbol{W}\|_{F}^{2},
s.t. 𝑾=𝕊++n,\displaystyle\boldsymbol{W}\in\mathcal{M}=\mathbb{S}_{++}^{n},

where 𝒮\mathcal{S} and 𝒟\mathcal{D} denote the sets of similar and dissimilar points, respectively. We used class labels to randomly sample the points for these sets. This metric learning behaves like triplet loss in which the intra- and inter-class variances are decreased and increased, respectively, for better discrimination of classes (Ghojogh et al., 2020). The Euclidean gradient of this problem is n×nEf(𝑾)=(𝒙i,𝒙j)𝒮(𝒙i𝒙j)(𝒙i𝒙j)(𝒙i,𝒙j)𝒟(𝑾1(𝒙i𝒙j))(𝑾1(𝒙i𝒙j))+𝑾\mathbb{R}^{n\times n}\ni\nabla_{E}f(\boldsymbol{W})=\sum_{(\boldsymbol{x}_{i},\boldsymbol{x}_{j})\in\mathcal{S}}(\boldsymbol{x}_{i}-\boldsymbol{x}_{j})(\boldsymbol{x}_{i}-\boldsymbol{x}_{j})^{\top}-\sum_{(\boldsymbol{x}_{i},\boldsymbol{x}_{j})\in\mathcal{D}}(\boldsymbol{W}^{-1}(\boldsymbol{x}_{i}-\boldsymbol{x}_{j}))(\boldsymbol{W}^{-1}(\boldsymbol{x}_{i}-\boldsymbol{x}_{j}))^{\top}+\boldsymbol{W}.

\bullet Results: We evaluated our mappings with geometric metric learning optimization on three public datasets, i.e., Fisher Iris, USPS digits, and MNIST digits. The average results over ten runs are reported in Table 4. In Iris data, both mappings have converged faster than RLBFGS, having a better converged cost function. In USPS and MNIST data, the Cholesky and ISR mappings have outperformed RLBFGS without mapping, respectively.

Table 4: Comparison of geometric metric learning by RLBFGS, with and without the proposed mappings, where exponential map is used in algorithms.
Data Algorithm #iters conv. time iter. time last cost
Iris VTF (ISR) 23.500±\pm4.528 2.461±\pm1.174 0.110±\pm0.070 1512.602±\pm347.004
VTF (Chol.) 25.000±\pm5.676 2.474±\pm1.009 0.096±\pm0.028 1594.975±\pm124.087
RLBFGS 25.000±\pm4.714 2.914±\pm1.239 0.113±\pm0.037 1620.207±\pm125.989
USPS VTF (ISR) 14.700±\pm3.234 1.885±\pm1.024 0.133±\pm0.094 14223.215±\pm0.001
VTF (Chol.) 13.100±\pm1.524 1.246±\pm0.217 0.095±\pm0.012 14223.216±\pm0.001
RLBFGS 13.100±\pm2.234 1.307±\pm0.454 0.098±\pm0.025 14223.215±\pm0.001
MNIST VTF (ISR) 11.700±\pm3.335 1.288±\pm0.619 0.108±\pm0.041 6254.283±\pm0.001
VTF (Chol.) 13.100±\pm3.479 1.435±\pm0.793 0.104±\pm0.029 6254.284±\pm0.001
RLBFGS 12.100±\pm3.381 1.266±\pm0.754 0.099±\pm0.024 6254.284±\pm0.001

7 Conclusion and Future Direction

In this paper, we proposed two mappings in the tangent space of SPD manifolds by inverse second root and Cholesky decomposition. The proposed mappings simplify the vector transports and adjoint vector transports to identity. These transports are widely used in RLBFGS quasi-Newton optimization, to identity. They also reduce the Riemannian metric to the Euclidean inner product which is more efficient computationally. Simulation results verified the effectiveness of the proposed mappings for two optimization tasks on SPD matrix manifolds. In this work, we focused on mappings for SPD manifolds which are widely used in machine learning and data science. A possible future direction is to extend the proposed mappings for other well-known Riemannian manifolds, such as Grassmann and Stiefel (Edelman et al., 1998), as well as other Riemannian optimization methods. This paper opens a new research path for such mappings in the tangent space and we conjecture that such mappings can make numerical Riemannian optimization more efficient.

References

  • Absil et al. (2009) P-A Absil, Robert Mahony, and Rodolphe Sepulchre. Optimization algorithms on matrix manifolds. Princeton University Press, 2009.
  • Armijo (1966) Larry Armijo. Minimization of functions having Lipschitz continuous first partial derivatives. Pacific Journal of mathematics, 16(1):1–3, 1966.
  • Arthur and Vassilvitskii (2007) David Arthur and Sergei Vassilvitskii. k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 1027–1035, 2007.
  • Bhatia (2009) Rajendra Bhatia. Positive definite matrices. Princeton University Press, 2009.
  • Boumal (2020) Nicolas Boumal. An introduction to optimization on smooth manifolds. Available online, 2020.
  • Boumal et al. (2014) Nicolas Boumal, Bamdev Mishra, P-A Absil, and Rodolphe Sepulchre. Manopt, a Matlab toolbox for optimization on manifolds. The Journal of Machine Learning Research, 15(1):1455–1459, 2014.
  • Edelman et al. (1998) Alan Edelman, Tomás A Arias, and Steven T Smith. The geometry of algorithms with orthogonality constraints. SIAM journal on Matrix Analysis and Applications, 20(2):303–353, 1998.
  • Fletcher (2013) Roger Fletcher. Practical methods of optimization. John Wiley & Sons, 2013.
  • Ghojogh et al. (2020) Benyamin Ghojogh, Milad Sikaroudi, Sobhan Shafiei, Hamid R Tizhoosh, Fakhri Karray, and Mark Crowley. Fisher discriminant triplet and contrastive losses for training Siamese networks. In International joint conference on neural networks, pages 1–7. IEEE, 2020.
  • Golub and Van Loan (2013) Gene H Golub and Charles F Van Loan. Matrix computations, volume 3. JHU press, 2013.
  • Hestenes and Stiefel (1952) Magnus Rudolph Hestenes and Eduard Stiefel. Methods of conjugate gradients for solving linear systems, volume 49. NBS Washington, DC, 1952.
  • Hosseini and Mash’al (2015) Reshad Hosseini and Mohamadreza Mash’al. Mixest: An estimation toolbox for mixture models. arXiv preprint arXiv:1507.06065, 2015.
  • Hosseini and Sra (2020) Reshad Hosseini and Suvrit Sra. An alternative to EM for Gaussian mixture models: batch and stochastic Riemannian optimization. Mathematical Programming, 181(1):187–223, 2020.
  • Hu et al. (2020) Jiang Hu, Xin Liu, Zai-Wen Wen, and Ya-Xiang Yuan. A brief introduction to manifold optimization. Journal of the Operations Research Society of China, 8(2):199–248, 2020.
  • Huang et al. (2015) Wen Huang, Kyle A Gallivan, and P-A Absil. A Broyden class of quasi-Newton methods for Riemannian optimization. SIAM Journal on Optimization, 25(3):1660–1685, 2015.
  • Huang et al. (2016) Wen Huang, P-A Absil, and Kyle A Gallivan. A Riemannian BFGS method for nonconvex optimization problems. In Numerical Mathematics and Advanced Applications ENUMATH 2015, pages 627–634. Springer, 2016.
  • Jeuris et al. (2012) Ben Jeuris, Raf Vandebril, and Bart Vandereycken. A survey and comparison of contemporary algorithms for computing the matrix geometric mean. Electronic Transactions on Numerical Analysis, 39(ARTICLE):379–402, 2012.
  • Ji (2007) Huibo Ji. Optimization approaches on smooth manifolds. PhD thesis, Australian National University, 2007.
  • Li and Fukushima (2001) Dong-Hui Li and Masao Fukushima. On the global convergence of the BFGS method for nonconvex unconstrained optimization problems. SIAM Journal on Optimization, 11(4):1054–1064, 2001.
  • Liu and Nocedal (1989) Dong C Liu and Jorge Nocedal. On the limited memory BFGS method for large scale optimization. Mathematical programming, 45(1):503–528, 1989.
  • Nocedal (1980) Jorge Nocedal. Updating quasi-Newton matrices with limited storage. Mathematics of computation, 35(151):773–782, 1980.
  • Nocedal and Wright (2006) Jorge Nocedal and Stephen Wright. Numerical optimization. Springer Science & Business Media, 2006.
  • Qi et al. (2010) Chunhong Qi, Kyle A Gallivan, and P-A Absil. Riemannian BFGS algorithm with applications. In Recent advances in optimization and its applications in engineering, pages 183–192. Springer, 2010.
  • Ring and Wirth (2012) Wolfgang Ring and Benedikt Wirth. Optimization methods on Riemannian manifolds and their application to shape space. SIAM Journal on Optimization, 22(2):596–627, 2012.
  • Seibert et al. (2013) Matthias Seibert, Martin Kleinsteuber, and Knut Hüper. Properties of the BFGS method on Riemannian manifolds. Mathematical System Theory C Festschrift in Honor of Uwe Helmke on the Occasion of his Sixtieth Birthday, pages 395–412, 2013.
  • Sra and Hosseini (2015) Suvrit Sra and Reshad Hosseini. Conic geometric optimization on the manifold of positive definite matrices. SIAM Journal on Optimization, 25(1):713–739, 2015.
  • Sra and Hosseini (2016) Suvrit Sra and Reshad Hosseini. Geometric optimization in machine learning. In Algorithmic Advances in Riemannian Geometry and Applications, pages 73–91. Springer, 2016.
  • Wolfe (1969) Philip Wolfe. Convergence conditions for ascent methods. SIAM Review, 11(2):226–235, 1969.
  • Zadeh et al. (2016) Pourya Zadeh, Reshad Hosseini, and Suvrit Sra. Geometric mean metric learning. In International conference on machine learning, pages 2464–2471. PMLR, 2016.

Vector Transport Free Riemannian LBFGS for Optimization on Symmetric Positive Definite Matrix Manifolds (Supplementary Material)

1 Proof for Proposition 5

– Upper bound analysis: The most time consuming part of the RLBFGS algorithm is the recursive function GetDirection(.) defined in Section 2.4. In every call of recursion, Eqs. (1)-(3) are performed. The metric used in Eqs. (1) and (3) takes 𝒪(n3)\mathcal{O}(n^{3}) and 𝒪(n2)\mathcal{O}(n^{2}) before and after applying the mappings, respectively (cf. Table 1 in main paper). This is because matrix inversion takes 𝒪(n3)\mathcal{O}(n^{3}) time and the matrices are vectorized within trace operator in the algorithm implementation. The vector transport used in Eq. (2) takes 𝒪(n3)\mathcal{O}(n^{3}) and 𝒪(1)\mathcal{O}(1) before and after applying the mappings, respectively (cf. Table 1 in main paper). Recursion in RLBGS is usually done for a constant mm number of times (see Ref. Ring and Wirth, 2020). Overall, the complexity of recursion is 𝒪(mn3)\mathcal{O}(mn^{3}) and 𝒪(mn2)\mathcal{O}(mn^{2}), for RLBFGS with and without mappings, respectively.

– Lower bound analysis: Computing matrix inversion in the metric before applying our mappings (cf. Table 1 in main paper) takes Ω(n2logn)\Omega(n^{2}\log n) (see article [1] referenced below) while metric takes Ω(n2)\Omega(n^{2}) after the mappings. The vector transport takes Ω(n3)\Omega(n^{3}) and Ω(1)\Omega(1) before and after the mappings, respectively. Overall, the complexity of recursion is Ω(mn3)\Omega(mn^{3}) and Ω(mn2)\Omega(mn^{2}), for RLBFGS with and without mappings, respectively.

[1] A. Tveit, “On the complexity of matrix inversion,” Norwegian University of Science and Technology, Trondheim, Technical Report, 2003.

– Tight bound analysis: As the upper bound and lower bound complexities of the every algorithm are equal, we can conclude that the complexity bound is tight. Therefore, RLBFGS has time complexity Θ(mn3)\Theta(mn^{3}) and Θ(mn2)\Theta(mn^{2}), with and without our mappings, respectively. This shows that our proposed mappings improve the time of optimization. This time improvement shows off better if computation of gradients in Eq. (6) is not dominant in complexity.

2 Simulations for Larger Sample Size and Dimensionality

Tables 2 and 3, in this Supplementary Material, report the average simulation results for large sample size, i.e., N=100n2N=100n^{2}. In these tables, exponential map and Taylor approximation for retraction are used, respectively. We also have reported results for large dimensionality d=100d=100 in these two tables.

Table 1: Comparison of average results over ten runs where retraction with Taylor series expansion is used in algorithms and K{2,5}K\in\{2,5\}, n{2,10}n\in\{2,10\}, N=10n2{40,1000}N=10n^{2}\in\{40,1000\}. The #\#iters, conv, iter, diff, and std are short for number of iterations, convergence, iteration, difference, and standard deviation, respectively.
KK nn Separation Algorithm #iters conv. time time diff. std iter. time last cost
2 2 Low VTF (ISR) 59.800±\pm22.553 93.135±\pm88.437 19.280 1.320±\pm0.711 0.610±\pm0.366
VTF (Chol.) 62.600±\pm29.079 106.545±\pm117.633 48.611 1.355±\pm0.830 0.619±\pm0.382
RLBFGS 59.800±\pm26.803 100.424±\pm120.310 1.363±\pm0.787 0.610±\pm0.366
Mid VTF (ISR) 45.800±\pm16.982 48.155±\pm38.045 10.833 0.900±\pm0.454 0.482±\pm0.497
VTF (Chol.) 47.000±\pm18.074 51.529±\pm41.465 11.146 0.930±\pm0.483 0.482±\pm0.497
RLBFGS 45.000±\pm16.330 48.150±\pm37.285 0.921±\pm0.458 0.482±\pm0.497
High VTF (ISR) 25.600±\pm4.142 9.816±\pm3.837 1.380 0.370±\pm0.093 0.270±\pm0.456
VTF (Chol.) 26.300±\pm4.448 10.540±\pm4.190 1.932 0.385±\pm0.101 0.270±\pm0.456
RLBFGS 25.500±\pm4.353 10.526±\pm4.819 0.396±\pm0.113 0.270±\pm0.456
10 Low VTF (ISR) 75.800±\pm25.607 166.736±\pm137.600 63.108 1.955±\pm0.816 4.129±\pm0.771
VTF (Chol.) 74.500±\pm22.629 152.074±\pm107.531 98.424 1.855±\pm0.682 4.129±\pm0.771
RLBFGS 74.800±\pm25.442 172.641±\pm143.198 2.054±\pm0.834 4.129±\pm0.771
Mid VTF (ISR) 50.900±\pm13.956 64.501±\pm38.055 11.794 1.170±\pm0.395 3.472±\pm1.154
VTF (Chol.) 52.100±\pm13.892 66.775±\pm41.257 7.815 1.184±\pm0.410 3.472±\pm1.154
RLBFGS 51.000±\pm14.063 70.132±\pm46.421 1.264±\pm0.450 3.472±\pm1.154
High VTF (ISR) 43.600±\pm5.522 42.763±\pm11.897 12.828 0.963±\pm0.168 4.353±\pm1.081
VTF (Chol.) 47.400±\pm6.620 50.438±\pm15.187 8.731 1.041±\pm0.182 4.353±\pm1.081
RLBFGS 44.300±\pm6.464 47.830±\pm14.747 1.055±\pm0.192 4.353±\pm1.081
5 2 Low VTF (ISR) 262.500±\pm126.532 4430.577±\pm4455.877 13196.851 13.849±\pm6.991 0.082±\pm0.281
VTF (Chol.) 253.500±\pm124.970 4086.076±\pm3967.586 10698.273 13.084±\pm6.850 0.099±\pm0.279
RLBFGS 270.700±\pm144.145 5305.821±\pm5495.383 15.560±\pm8.452 0.090±\pm0.282
Mid VTF (ISR) 110.900±\pm50.886 731.318±\pm769.639 184.154 5.445±\pm2.806 1.010±\pm0.349
VTF (Chol.) 112.200±\pm60.736 792.756±\pm1053.832 416.142 5.436±\pm3.358 1.010±\pm0.349
RLBFGS 111.900±\pm55.183 814.994±\pm975.737 5.827±\pm3.289 1.010±\pm0.349
High VTF (ISR) 52.000±\pm13.565 116.485±\pm69.535 12.185 2.071±\pm0.728 1.626±\pm0.431
VTF (Chol.) 51.400±\pm12.616 114.032±\pm67.916 19.838 2.057±\pm0.735 1.626±\pm0.431
RLBFGS 53.400±\pm14.081 139.989±\pm89.213 2.408±\pm0.936 1.626±\pm0.431
10 Low VTF (ISR) 219.700±\pm57.908 2946.040±\pm1513.372 1239.375 12.579±\pm3.507 6.643±\pm0.662
VTF (Chol.) 226.100±\pm86.010 3272.111±\pm2808.634 2103.162 12.707±\pm5.156 6.625±\pm0.650
RLBFGS 231.300±\pm64.484 3607.215±\pm1947.580 14.516±\pm4.305 6.642±\pm0.663
Mid VTF (ISR) 87.100±\pm21.502 413.616±\pm223.510 54.437 4.458±\pm1.310 6.585±\pm0.569
VTF (Chol.) 87.200±\pm21.186 405.484±\pm214.985 47.791 4.374±\pm1.262 6.585±\pm0.569
RLBFGS 87.400±\pm20.403 454.434±\pm227.556 4.918±\pm1.339 6.585±\pm0.569
High VTF (ISR) 60.500±\pm10.533 181.113±\pm76.070 16.691 2.892±\pm0.649 6.827±\pm0.836
VTF (Chol.) 62.700±\pm11.295 197.565±\pm93.653 24.531 3.019±\pm0.827 6.827±\pm0.836
RLBFGS 58.900±\pm11.040 187.518±\pm86.254 3.058±\pm0.746 6.827±\pm0.836
Table 2: Comparison of average results over ten runs where exponential map is used in algorithms and n{2,10,100}n\in\{2,10,100\}, N=100n2{400,10000,1000000}N=100n^{2}\in\{400,10000,1000000\}, K=2K=2.
nn Separation Algorithm #iters conv. time time diff. std iter. time last cost
2 Low VTF (ISR) 72.000±\pm22.949 127.902±\pm98.372 31.248 1.616±\pm0.584 0.281±\pm0.394
VTF (Chol.) 69.800±\pm23.136 116.250±\pm86.811 23.937 1.490±\pm0.590 0.281±\pm0.394
RLBFGS 68.900±\pm24.875 123.064±\pm105.594 1.561±\pm0.694 0.281±\pm0.394
Mid VTF (ISR) 59.400±\pm16.460 78.124±\pm53.880 10.139 1.210±\pm0.421 0.520±\pm0.392
VTF (Chol.) 54.900±\pm12.862 63.241±\pm34.382 31.027 1.082±\pm0.332 0.516±\pm0.393
RLBFGS 58.400±\pm17.037 80.042±\pm62.183 1.240±\pm0.497 0.520±\pm0.392
High VTF (ISR) 23.300±\pm2.627 7.434±\pm2.288 0.738 0.313±\pm0.057 0.102±\pm0.288
VTF (Chol.) 22.900±\pm2.685 7.273±\pm2.084 0.873 0.312±\pm0.052 0.102±\pm0.288
RLBFGS 23.300±\pm2.497 7.917±\pm2.135 0.335±\pm0.053 0.102±\pm0.288
10 Low VTF (ISR) 63.400±\pm16.728 112.682±\pm74.903 17.031 1.664±\pm0.489 4.364±\pm1.299
VTF (Chol.) 65.000±\pm17.994 117.215±\pm82.258 13.637 1.670±\pm0.538 4.364±\pm1.299
RLBFGS 64.900±\pm17.866 126.987±\pm87.668 1.819±\pm0.561 4.364±\pm1.299
Mid VTF (ISR) 48.200±\pm8.766 58.096±\pm23.089 10.899 1.164±\pm0.256 4.024±\pm1.627
VTF (Chol.) 49.800±\pm11.811 64.021±\pm34.770 6.571 1.208±\pm0.364 4.024±\pm1.627
RLBFGS 48.400±\pm10.906 64.099±\pm32.746 1.251±\pm0.361 4.024±\pm1.627
High VTF (ISR) 45.100±\pm7.385 49.628±\pm20.550 7.336 1.065±\pm0.243 3.290±\pm1.129
VTF (Chol.) 47.800±\pm8.121 55.217±\pm22.622 6.572 1.117±\pm0.251 3.290±\pm1.129
RLBFGS 45.200±\pm7.131 52.883±\pm20.627 1.137±\pm0.236 3.290±\pm1.129
100 Low VTF (ISR) 131.900±\pm32.402 54826.327±\pm30454.295 5842.849 389.761±\pm121.060 63.703±\pm2.843
VTF (Chol.) 141.200±\pm36.908 59082.278±\pm32385.087 3775.919 392.658±\pm110.916 63.703±\pm2.843
RLBFGS 136.300±\pm36.059 56627.788±\pm32602.224 387.367±\pm119.404 63.703±\pm2.843
Mid VTF (ISR) 87.800±\pm20.225 23635.623±\pm11743.237 5738.406 259.026±\pm52.061 61.263±\pm4.734
VTF (Chol.) 95.800±\pm25.170 28033.655±\pm16196.321 1410.113 278.456±\pm64.670 61.263±\pm4.734
RLBFGS 94.900±\pm25.653 28527.524±\pm17341.755 284.548±\pm69.821 61.263±\pm4.734
High VTF (ISR) 97.000±\pm24.240 29048.836±\pm12857.812 3441.882 286.602±\pm63.250 61.241±\pm3.991
VTF (Chol.) 105.100±\pm27.477 33967.543±\pm15707.841 2014.574 308.668±\pm70.065 61.241±\pm3.991
RLBFGS 103.200±\pm26.389 33101.986±\pm15790.066 305.002±\pm74.361 61.241±\pm3.991
Table 3: Comparison of average results over ten runs where retraction with Taylor series expansion is used in algorithms and n{2,10,100}n\in\{2,10,100\}, N=100n2{400,10000,1000000}N=100n^{2}\in\{400,10000,1000000\}, K=2K=2.
nn Separation Algorithm #iters conv. time time diff. std iter. time last cost
2 Low VTF (ISR) 79.800±\pm49.973 206.935±\pm343.153 105.106 1.839±\pm1.340 0.403±\pm0.578
VTF (Chol.) 72.200±\pm20.004 125.636±\pm83.113 334.239 1.608±\pm0.536 0.402±\pm0.576
RLBFGS 83.900±\pm51.054 237.936±\pm364.382 2.064±\pm1.412 0.404±\pm0.578
Mid VTF (ISR) 48.100±\pm12.688 49.277±\pm30.486 12.185 0.946±\pm0.334 0.196±\pm0.436
VTF (Chol.) 50.100±\pm14.271 56.329±\pm40.420 14.674 1.017±\pm0.422 0.196±\pm0.436
RLBFGS 51.000±\pm13.622 61.036±\pm39.550 1.099±\pm0.412 0.196±\pm0.436
High VTF (ISR) 25.300±\pm3.945 9.887±\pm3.903 3.565 0.377±\pm0.101 0.111±\pm0.261
VTF (Chol.) 26.500±\pm4.927 11.003±\pm5.644 4.938 0.395±\pm0.124 0.111±\pm0.261
RLBFGS 25.100±\pm4.012 10.203±\pm4.224 0.392±\pm0.106 0.111±\pm0.261
10 Low VTF (ISR) 65.500±\pm18.435 123.552±\pm77.898 14.558 1.747±\pm0.560 4.164±\pm1.142
VTF (Chol.) 66.500±\pm19.558 122.281±\pm79.423 19.994 1.688±\pm0.569 4.164±\pm1.142
RLBFGS 65.200±\pm18.743 130.098±\pm84.241 1.835±\pm0.622 4.164±\pm1.142
Mid VTF (ISR) 40.600±\pm5.211 38.915±\pm12.591 8.862 0.938±\pm0.177 4.589±\pm1.278
VTF (Chol.) 41.000±\pm5.538 38.732±\pm12.699 9.319 0.924±\pm0.171 4.589±\pm1.278
RLBFGS 40.400±\pm6.186 41.513±\pm15.303 0.998±\pm0.213 4.589±\pm1.278
High VTF (ISR) 44.400±\pm6.310 46.957±\pm15.325 11.008 1.032±\pm0.198 3.786±\pm1.113
VTF (Chol.) 46.000±\pm6.616 50.393±\pm15.950 12.688 1.070±\pm0.195 3.786±\pm1.113
RLBFGS 44.200±\pm7.099 50.707±\pm18.070 1.116±\pm0.222 3.786±\pm1.113
100 Low VTF (ISR) 118.500±\pm11.404 39986.299±\pm6719.992 2227.745 335.341±\pm26.657 62.241±\pm5.708
VTF (Chol.) 124.400±\pm11.157 42882.658±\pm7219.289 3669.708 342.538±\pm28.830 62.241±\pm5.708
RLBFGS 121.300±\pm11.235 41171.059±\pm7987.985 336.547±\pm36.783 62.241±\pm5.708
Mid VTF (ISR) 103.800±\pm24.679 32533.037±\pm16124.089 3314.944 297.268±\pm76.267 60.256±\pm3.662
VTF (Chol.) 111.900±\pm28.781 37328.925±\pm18924.066 2916.610 314.512±\pm82.724 60.256±\pm3.662
RLBFGS 111.300±\pm27.941 36926.995±\pm17705.262 314.564±\pm76.642 60.256±\pm3.662
High VTF (ISR) 96.300±\pm25.880 28818.656±\pm14923.845 1492.788 283.824±\pm67.267 60.601±\pm4.436
VTF (Chol.) 103.900±\pm26.126 33669.987±\pm16494.757 3368.579 308.353±\pm73.281 60.601±\pm4.436
RLBFGS 102.800±\pm25.258 31630.478±\pm15755.934 292.771±\pm68.396 60.601±\pm4.436

3 Cost Difference Progress for N=10n2N=10n^{2} Sample Size Simulations

The log-scale cost difference for simulations of N=10n2N=10n^{2} are depicted in Figs. 1 and 2 of this Supplementary Material, for K=2K=2 and K=5K=5, respectively.

Refer to caption
Figure 1: Comparison of the proposed VTF-RLBFGS algorithm (using ISR and Cholesky) with RLBFGS in their cost differences. In these experiments, we had K=2K=2.
Refer to caption
Figure 2: Comparison of the proposed VTF-RLBFGS algorithm (using ISR and Cholesky) with RLBFGS in their cost differences. In these experiments, we had K=5K=5.

4 Cost Difference Progress for N=100n2N=100n^{2} Sample Size Simulations

The log-scale cost difference for simulations of N=100n2N=100n^{2} are depicted in Figs. 3 and 4 of this Supplementary Material, where exponential map and Taylor approximation for retraction are used, respectively.

Refer to caption
Figure 3: Comparison of the proposed VTF-RLBFGS algorithm (using ISR and Cholesky) with RLBFGS in their cost differences. In these experiments, exponential map is used in optimization procedure and we had K=2K=2.
Refer to caption
Figure 4: Comparison of the proposed VTF-RLBFGS algorithm (using ISR and Cholesky) with RLBFGS in their cost differences. In these experiments, Taylor series expansion is used for approximation of retraction operator and we had K=2K=2.