This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Riemannian Primal-dual Algorithm Based on Proximal Operator and its Application in Metric Learning

1st Shijun Wang Dept. of Artificial Intelligence
Ant Financial Services Group
Seattle, USA
[email protected]
   2nd Baocheng Zhu Dept. of Artificial Intelligence
Ant Financial Services Group
Shanghai, China
[email protected]
   3rd Lintao Ma Dept. of Artificial Intelligence
Ant Financial Services Group
Shanghai, China
[email protected]
   4th Yuan Qi Dept. of Artificial Intelligence
Ant Financial Services Group
Hangzhou, China
[email protected]
Abstract

In this paper, we consider optimizing a smooth, convex, lower semicontinuous function in Riemannian space with constraints. To solve the problem, we first convert it to a dual problem and then propose a general primal-dual algorithm to optimize the primal and dual variables iteratively. In each optimization iteration, we employ a proximal operator to search optimal solution in the primal space. We prove convergence of the proposed algorithm and show its non-asymptotic convergence rate. By utilizing the proposed primal-dual optimization technique, we propose a novel metric learning algorithm which learns an optimal feature transformation matrix in the Riemannian space of positive definite matrices. Preliminary experimental results on an optimal fund selection problem in fund of funds (FOF) management for quantitative investment showed its efficacy.

I Introduction

Many machine learning problems can be solved by optimization algorithms which minimize or maximize a predefined objective function under certain constraints if there are any. In the past decades, searching optimal variables in Euclidean space is a mainstream direction for optimization techniques. In recent years, there is a shift from Euclidean space to Riemannian space due to manifold structures existed in many machine learning problems[1, 2, 3, 4, 5, 6]. To solve optimization problems in the Riemannian space, a straightforward method is to generalize optimization algorithms developed in the Euclidean space to the Riemannian space with consideration of manifold constraints on the variables to be optimized. Gradient descent methods, Newton’s methods and conjugate gradient methods can be natural extended from the Euclidean space to the Riemannian space, see [7, 8, 9, 10] and references therein. Studies on Riemannian accelerated gradient methods, quasi-Newton algorithms like BFGS and adaptive optimization methods can be found in [11, 12, 13]. P. -A. Absil et al. proposed a trust-region approach for optimizing a smooth function on a Riemannian manifold in which the trust-region subproblems are solved using a truncated conjugate gradient algorithm [2]. Furthermore, Qi and Agarwal generalized the adaptive regularization with cubics algorithm to Riemannian manifold, and obtain an upper bound on the iteration complexity which is optimal compared to the complexity of steepest descent and trust-region methods [14, 15]. In recent years, variance reduction techniques drew tremendous attention for optimizing finite-sum problems[16, 17, 18]. Extending the idea of variance reduction for optimizing finite sums of geodesically smooth functions on Riemannian manifolds can be found in [19, 20, 21].

Although intensive studies on Riemannian manifold optimization, all the above works have only considered problems on unconstrained manifold, the only constraint is that the solution has to lie on the manifold. This severely limits the scope of possible applications with those methods. In many problems, additional equality or inequality constraints need to be imposed. There are few works addressed this problem, Hauswirth et al. extended the projected gradient descent algorithm to Riemannian manifold with inequality constraints and show its well-behaved convergent behaviour, but the manifold is restricted to submanifold of Euclidean space [22]. Zhang et al. developed an ADMM-like primal-dual approach to solve nonconvex and nonsmooth multi-block optimization over Riemannian manifold with coupled linear equality constraints[23].

In this paper, we consider the following general nonlinear primal problem which is constrained on a complete Riemannian manifold MM:

minxMf(x),\underset{x\in M}{min}f\left(x\right), (1)

subject to constraints: h(x)0h\left(x\right)\preceq 0, where fC2(M,R)f\in C^{2}\left(M,R\right) and h=(h1,h2,,hm)C2(M,R)h=\left(h_{1},h_{2},...,h_{m}\right)\in C^{2}\left(M,R\right) are closed proper, convex, lower semicontinuous (l.s.c.) and real-valued functions. A function is of class C2C^{2} if its first and second derivatives both exist and are continuous.

I-A Previous Works

Khuzani and Li studied stochastic primal-dual method on the Riemannian manifolds with bounded sectional curvature[24]. They proved non-asymptotic convergence of the primal-dual method and established a connection between convergence rate and sectional curvature lower bound. In their algorithm, standard gradient descent method followed by exponential map to search optimal variable in each iteration. In recent years, proximal algorithms emerge from many machine learning applications due to their capability on handling nonsmooth, constrained, large-scale or distributed problems[25, 26]. Ferreira and Oliveira considered minimization problem on a Riemannian manifold with nonpositive sectional curvature. They solved the problem by extending the proximal method in the Euclidean space to the Riemannian space [27]. Recent advances on Riemannian proximal point method can be found in [28, 29].

I-B Our Contributions

Previous works concentrate on either constrained Euclidean space optimization or unconstrained Riemannian space optimization, we propose a novel algorithm to solve constrained optimization problem (1) over the Riemannian manifold. We first convert it to a dual problem and then use a general primal-dual algorithm to optimize the primal and dual variables iteratively. In each optimization iteration, we employ the proximal point algorithm and gradient ascend method alternatively to search optimal solution in the primal and dual space. We prove convergence of the proposed algorithm and show its non-asymptotic convergence rate. To show the efficacy of the proposed primal-dual algorithm for optimizing nonlinear l.s.c. functions in C2C^{2} space with constraints, we propose a novel metric learning algorithm and solved it using the proposed Riemannian primal-dual method.

II Notation and Preliminaries

Let MM be a connected and finite dimensional manifold with dimensionality of mm. We denote by TpMT_{p}M the tangent space of MM at pp. Let MM be endowed with a Riemannian metric .,.\left\langle.,.\right\rangle, with corresponding norm denoted by .\parallel.\parallel, so that MM is now a Riemannian manifold[30]. We use l(γ)=abγ(t)𝑑tl\left(\gamma\right)=\int_{a}^{b}\parallel\gamma^{\prime}\left(t\right)\parallel dt to denote the length of a piecewise smooth curve γ:[a,b]M\gamma:\left[a,b\right]\longrightarrow M joining xx^{\prime} to xx, i.e., such that γ(a)=x\gamma\left(a\right)=x^{\prime} and γ(b)=x\gamma\left(b\right)=x. Minimizing this length functional over the set of all piecewise smooth curves passing xx^{\prime} and xx we get a Riemannian distance d(x,x)d\left(x^{\prime},x\right) which induces the original topology on MM. Take xM,x\in M, the exponential map expx:TxMMexp_{x}:T_{x}M\longrightarrow M is defined by expxv=γv(1,x)exp_{x}v=\gamma_{v}\left(1,x\right) which maps a tangent vector vv at xx to MM along the curve γ\gamma. For any xMx^{\prime}\in M we define the exponential inverse map expx1:MTxMexp_{x^{\prime}}^{-1}:M\longrightarrow T_{x^{\prime}}M which is CC^{\infty} and maps a point xx^{\prime} on MM to a tangent vector at xx with d(x,x)=expx1xd\left(x^{\prime},x\right)=\parallel exp_{x^{\prime}}^{-1}x\parallel. We assume (M,d)\left(M,d\right) is a complete metric space, bounded and all closed subsets of MM are compact. For a given convex function f:MRf:M\rightarrow R at xMx^{\prime}\in M, a vector sTxMs\in T_{x^{\prime}}M is called subgradient of ff at xMx^{\prime}\in M if f(x)f(x)+<s,expx1x>f\left(x\right)\geq f\left(x^{\prime}\right)+<s,exp_{x^{\prime}}^{-1}x>, for all xMx\in M. The set of all subgradients of ff at xMx^{\prime}\in M is called subdifferential of ff at xMx^{\prime}\in M which is denoted by f(x)\partial f\left(x^{\prime}\right). If MM is a Hadamard manifold which is complete, simply connected and has everywhere non-positive sectional curvature, the subdifferential of ff at any point on MM is nonempty[27].

III The Algorithm

By employing duality, we convert original optimization problem (1) to an augmented Lagrangian function (generic saddle-point problem):

L(x,λ)=f(x)+<λ,h(x)>α2λ2,L\left(x,\lambda\right)=f\left(x\right)+<\lambda,h\left(x\right)>-\frac{\alpha}{2}\parallel\lambda\parallel^{2}, (2)

where λ+m\lambda\in\mathbb{\mathbb{R}}_{+}^{m} is Lagrangian dual vector for inequality constraints, and α>0\alpha>0 is a regularization parameter which weights norm of the dual variables. The norm of the Lagrangian dual variables λ\parallel\lambda\parallel is upper bounded and so is gradient of the deterministic Lagrangian function L(xt,λt)L\left(x_{t},\lambda_{t}\right).

To solve the generic primal-dual problem shown in eq. (2), we propose a primal-dual algorithm based on proximal operator (see Algorithm 1).

Algorithm 1 Riemannian Primal-dual Algorithm Based on Proximal Operator
  1.initialize: set initial point x0Mx_{0}\in M, λ0=0\lambda_{0}=0 , and step size sequence {ηt}t=0T\left\{\eta_{t}\right\}_{t=0}^{T} which is decreasing and ηt+\eta_{t}\in\mathbb{R}_{+} .
  2. for t=0,1,2,,Tt=0,1,2,...,T do:
xt+1\displaystyle x_{t+1} =proxL(xt)=argminxM{L(x,λt)+12ηtd2(xt,x)},\displaystyle=prox_{L}\left(x_{t}\right)=arg\underset{x\in M}{min}\left\{L\left(x,\lambda_{t}\right)+\frac{1}{2\eta_{t}}d^{2}\left(x_{t},x\right)\right\}, (3)
λt+1\displaystyle\lambda_{t+1} =[λt+ηtgradλtL(xt+1,λt)]+,\displaystyle=\left[\lambda_{t}+\eta_{t}grad_{\lambda_{t}}L\left(x_{t+1},\lambda_{t}\right)\right]_{+}, (4)
where gradλL(xt+1,λ)=h(xt+1)αλgrad_{\lambda}L\left(x_{t+1},\lambda\right)=h\left(x_{t+1}\right)-\alpha\lambda, and [x]+\left[x\right]_{+} means for a vector xx, change every negative entry of xx to zero.

Assumption 1. Assume MM is a compact Riemannian manifold with finite diameter R=supx,yMd(x,y)R=sup_{x,y\in M}d\left(x,y\right) and non-positive sectional curvature. For any xMx\in M, the following gradients are bounded by gradxf(x)C\parallel grad_{x}f\left(x\right)\parallel\leq C, gradxhk(x)C\parallel grad_{x}h_{k}\left(x\right)\parallel\leq C, and hk(x)G\mid h_{k}\left(x\right)\mid\leq G, k=1,2,,mk=1,2,...,m.

With Assumption 1, we have the following theorem.

Theorem 1. Assume problem (3) has a saddle-point (x,λ)\left(x_{*},\lambda_{*}\right) and Assumption 1 hold. Let {xt}t=0T1\left\{x_{t}\right\}_{t=0}^{T-1} be a finite sequence generated by Algorithm 1 iteratively, and step size ηtα1\eta_{t}\alpha\leq 1, t[T]={0,1,2,,T1}t\in[T]=\left\{0,1,2,...,T-1\right\}. Then,

mint[T]f(xt+1)f(x)1t=0T1ηt(12d2(x,x0)+2mG2t=0T1(ηt2))\underset{t\in\left[T\right]}{min}f\left(x_{t+1}\right)-f\left(x_{*}\right)\\ \leq\frac{1}{\sum_{t=0}^{T-1}\eta_{t}}\left(\frac{1}{2}d^{2}\left(x_{*},x_{0}\right)+2mG^{2}\sum_{t=0}^{T-1}\left(\eta_{t}^{2}\right)\right) (5)

for all xtM,t[T]x_{t}\in M,t\in\left[T\right].

Proof of Theorem 1 is shown in appendix A.

From Theorem 1, we could derive the following corollary:

Corollary 1. (Non-asymptotic Convergence) By choosing step size ηt=1t+1\eta_{t}=\frac{1}{\sqrt{t+1}}, and αηt1\alpha\eta_{t}\leq 1 for all t[T]t\in[T], the sequence {xt}\left\{x_{t}\right\}, t[T]t\in[T] generated by Algorithm 1 converges at rate:

mint[T]f(xt+1)f(x)=𝒪(log(T)T1).\underset{t\in\left[T\right]}{min}f\left(x_{t+1}\right)-f\left(x_{*}\right)=\mathcal{O}\left(\frac{log\left(T\right)}{\sqrt{T}-1}\right). (6)

Proof of Corollary 1 is shown in appendix B.

IV Application in Metric Learning

Metric learning is a technique to learn a distance metric in data feature space, and finds application in various machine learning tasks relying on distances or similarities measure, like classification, clustering, dimensionality reduction and domain adaptation, to name a few [31, 32, 33, 34, 35, 36, 37, 38]. Most methods learn the metric (positive definite matrix 𝐖\mathbf{W}) in a weakly-supervised way from pairwise or triplet constraints of data points. In general, metric learning can be formulated as an optimization problem that shares the same form as standard regularized empirical risk minimization:

min𝐖(𝐖,𝐗)+λΩ(𝐖),\underset{\mathbf{W}}{min}\mathcal{L(\mathbf{W},\mathbf{X})}+\lambda\Omega(\mathbf{W}), (7)

where 𝐗\mathbf{X} denotes training samples, \mathcal{L} is the loss function associated with sample constraints and Ω\Omega is the regularizer, λ\lambda is the trade-off parameter. Many methods are specified as a constrained optimization problem by writing down \mathcal{L} explicitly as inequality constraints h(𝐖,𝐗)0h(\mathbf{W},\mathbf{X})\preceq 0, although we can always transform it into an unconstrained problem using hinge loss or other tricks [31, 37].

Some techniques are developed to solve metric learning optimization problem eq. (7). Projected gradient descent and its stochastic version use traditional (stochastic) gradient descent followed by an orthogonal projection onto the positive semi-definite cone [33, 39, 40, 41]. Bregman projections update based on one single constraint at each iteration, and perform a general non-orthogonal projection so that the chosen constraint is satisfied. After projecting, an appropriate correction is employed [35].

However, these methods do not fully use the intrinsic manifold structure of the problem, i.e. the learned metric must lie in a Riemannian space of positive definite matrices. So it is naturally an optimization problem on Riemannian manifold rather than Euclidean space. In this section, we apply the proposed method to metric learning problem and illustrate how to optimize a convex target function in a Riemannian manifold.

IV-A Metric Learning Problem Formulation

We consider the following convex metric learning problem with 𝐖\mathbf{W} in the Riemannian space 𝐒+n\mathbf{S}_{+}^{n} of n×nn\times n positive definite matrices:

min𝐖𝐒+nΩ(𝐖),\underset{\mathbf{W}\in\mathbf{S}_{+}^{n}}{min}\Omega\left(\mathbf{W}\right), (8)

s.t.

(𝐱i𝐱j)𝐖(𝐱i𝐱j)u,(i,j)C+,\left(\mathbf{x}_{i}-\mathbf{x}_{j}\right)\mathbf{W}\left(\mathbf{x}_{i}-\mathbf{x}_{j}\right)^{\top}\leq u,\forall\left(i,j\right)\in C^{+},

(𝐱i𝐱j)𝐖(𝐱i𝐱j)l,(i,j)C,\left(\mathbf{x}_{i}-\mathbf{x}_{j}\right)\mathbf{W}\left(\mathbf{x}_{i}-\mathbf{x}_{j}\right)^{\top}\geq l,\forall\left(i,j\right)\in C^{-}, where Ω(𝐖)=12d2(𝐖,𝐖0)\Omega\left(\mathbf{W}\right)=\frac{1}{2}d^{2}(\mathbf{W},\mathbf{W}_{0}), d2(𝐖,𝐖0)=tr(𝐖𝐖01)logdet(𝐖𝐖01)nd^{2}(\mathbf{W},\mathbf{W}_{0})=tr\left(\mathbf{W}\mathbf{W}_{0}^{-1}\right)-logdet\left(\mathbf{W}\mathbf{W}_{0}^{-1}\right)-n is the LogDet divergence which is a scale-invariant distance measure on Riemannian metrics manifold [35]. 𝐖0\mathbf{W}_{0} is a target transformation matrix initialized to identity (corresponds to the Euclidean distance) or inverse of data covariance matrix (corresponds to the Mahalanobis distance), C+/CC^{+}/C^{-} is set of all sample pairs with the same/different labels. uu and ll are the upper/lower distance bound of similar/dissimilar pairs of points and are set to 5-th/95-th percentiles of the observed distribution of distances in the following experiments. It is known that space of all n×nn\times n positive definite Hermitian matrices is a Cartan-Hadamard manifold which is a simply connected complete Riemannian manifold with non-positive sectional curvature.

IV-B Optimization by the Proposed Riemannian Primal-dual Algorithm

By introducing relaxation variables ξ\mathbf{\xi}, we have

min𝐖𝐒+nΩ(𝐖)+C12ξ22,\underset{\mathbf{W}\in\mathbf{S}_{+}^{n}}{min}\Omega\left(\mathbf{W}\right)+\frac{C_{1}}{2}\parallel\xi\parallel_{2}^{2}\text{,} (9)

s.t.

(𝐱i𝐱j)𝐖(𝐱i𝐱j)u(1+ξij)\left(\mathbf{x}_{i}-\mathbf{x}_{j}\right)\mathbf{W}\left(\mathbf{x}_{i}-\mathbf{x}_{j}\right)^{\top}\leq u(1+\xi_{ij}) , (i,j)C+\forall\left(i,j\right)\in C^{+} ,

(𝐱i𝐱j)𝐖(𝐱i𝐱j)l(1ξij)\left(\mathbf{x}_{i}-\mathbf{x}_{j}\right)\mathbf{W}\left(\mathbf{x}_{i}-\mathbf{x}_{j}\right)^{\top}\geq l(1-\xi_{ij}) , (i,j)C\forall\left(i,j\right)\in C^{-}.

Let’s define h+(𝐖)=diag(𝐗+𝐖𝐗+)u(𝐞+ξ+)0h_{+}\left(\mathbf{W}\right)=diag\left(\mathbf{X}_{+}\mathbf{W\mathbf{X}_{+}^{\top}}\right)-u(\mathbf{e}+\xi_{+})\leq 0 (𝐞\mathbf{e} is a vector whose entries are all one, diag(X)diag\left(X\right) extracts diagonal elements of an input matrix XX and write them to a vector), and h(𝐖)=diag(𝐗𝐖𝐗)+l(𝐞ξ)0h_{-}\left(\mathbf{W}\right)=-diag\left(\mathbf{X}_{-}\mathbf{W}\mathbf{X}_{-}^{\top}\right)+l(\mathbf{e}-\xi_{-})\leq 0 , where 𝐗+/𝐗\mathbf{X}_{+}/\mathbf{X}_{-} are matrices composed by sample pairs with the same / different labels (shape: number of samples by feature dimensions), and ξ+/ξ\xi_{+}/\xi_{-} are corresponding relaxation vectors which are greater than or equal to zero. So we have

min𝐖𝐒+nΩ(𝐖)+C12ξ+22+C12ξ22,\underset{\mathbf{W}\in\mathbf{S}_{+}^{n}}{min}\Omega\left(\mathbf{W}\right)+\frac{C_{1}}{2}\parallel\xi_{+}\parallel_{2}^{2}+\frac{C_{1}}{2}\parallel\xi_{-}\parallel_{2}^{2}, (10)

s.t.

h+(𝐖)=diag(𝐗+𝐖𝐗+)u(𝐞+ξ+)0h_{+}\left(\mathbf{W}\right)=diag\left(\mathbf{X}_{+}\mathbf{W\mathbf{X}_{+}^{\top}}\right)-u(\mathbf{e}+\xi_{+})\leq 0,

h(𝐖)=diag(𝐗𝐖𝐗)+l(𝐞ξ)0h_{-}\left(\mathbf{W}\right)=-diag\left(\mathbf{X}_{-}\mathbf{W}\mathbf{X}_{-}^{\top}\right)+l(\mathbf{e}-\xi_{-})\leq 0,

ξ+0-\xi_{+}\leq 0,

ξ0-\xi_{-}\leq 0.

Further more, we define h(𝐖)=[h+(𝐖),h(𝐖)]h\left(\mathbf{W}\right)=\left[h_{+}\left(\mathbf{W}\right),h_{-}\left(\mathbf{W}\right)\right]^{\top} , and ξ=[ξ+,ξ]\xi=\left[\xi_{+},\xi_{-}\right] , then

min𝐖𝐒+nΩ(𝐖)+C12ξ22,\underset{\mathbf{W}\in\mathbf{S}_{+}^{n}}{min}\Omega\left(\mathbf{W}\right)+\frac{C_{1}}{2}\parallel\xi\parallel_{2}^{2}, (11)

s.t.

h(𝐖)0h\left(\mathbf{W}\right)\leq 0,

ξ0-\xi\leq 0.

By employing duality, we have the following augmented Lagrangian function:

L(𝐖,ξ,λ,γ)=Ω(𝐖)+C12ξ2+<λ,h(𝐖)>+<γ,ξ>C22λ2C22γ2L\left(\mathbf{W},\xi,\lambda,\gamma\right)=\Omega\left(\mathbf{W}\right)+\frac{C_{1}}{2}\parallel\xi\parallel^{2}+<\lambda,h\left(\mathbf{W}\right)>+<\gamma,-\xi>-\frac{C_{2}}{2}\parallel\lambda\parallel^{2}-\frac{C_{2}}{2}\parallel\gamma\parallel^{2}.

Now let’s solve the above Lagrangian function using Algorithm 1. At each step t+1t+1, we have the following updates:

𝐖t+1=argmin𝐖𝐒+n{L(𝐖,ξ,λt,γt)+12ηtd2(𝐖t,𝐖)},\mathbf{W}_{t+1}=arg\underset{\mathbf{W}\in\mathbf{S}_{+}^{n}}{min}\left\{L\left(\mathbf{W},\xi,\lambda_{t},\gamma_{t}\right)+\frac{1}{2\eta_{t}}d^{2}\left(\mathbf{W}_{t},\mathbf{W}\right)\right\},

ξt+1=argminξ0{L(𝐖,ξ,λt,γt)+12ηtd2(ξt,ξ)}\mathbf{\xi}_{t+1}=arg\underset{\xi\geq 0}{min}\left\{L\left(\mathbf{W},\xi,\lambda_{t},\gamma_{t}\right)+\frac{1}{2\eta_{t}}d^{2}\left(\mathbf{\xi}_{t},\mathbf{\xi}\right)\right\},

λt+1=[λt+ηtgradλtL(𝐖t+1,ξt+1,λt,γt)]+\lambda_{t+1}=\left[\lambda_{t}+\eta_{t}grad_{\lambda_{t}}L\left(\mathbf{W}_{t+1},\xi_{t+1},\lambda_{t},\gamma_{t}\right)\right]_{+},

γt+1=[γt+ηtgradγtL(𝐖t+1,ξt+1,λt,γt)]+\gamma_{t+1}=\left[\gamma_{t}+\eta_{t}grad_{\gamma_{t}}L\left(\mathbf{W}_{t+1},\xi_{t+1},\lambda_{t},\gamma_{t}\right)\right]_{+}.

So,

𝐖t+1=\mathbf{W}_{t+1}= argmin𝐖𝐒+n{12d2(𝐖,𝐖0)+<λt,h(𝐖)>arg\underset{\mathbf{W}\in\mathbf{S}_{+}^{n}}{min}\{\frac{1}{2}d^{2}(\mathbf{W},\mathbf{W}_{0})+<\lambda_{t},h\left(\mathbf{W}\right)> +12ηtd2(𝐖t,𝐖)}+\frac{1}{2\eta_{t}}d^{2}\left(\mathbf{W}_{t},\mathbf{W}\right)\},

ξt+1=\mathbf{\xi}_{t+1}= argminξ0{C12ξ2+<λt,h(𝐖,ξ)>arg\underset{\xi\geq 0}{min}\{\frac{C_{1}}{2}\parallel\xi\parallel^{2}+<\lambda_{t},h(\mathbf{W},\xi)> +<γt,ξ>+12ηtd2(ξt,ξ)}+<\gamma_{t},-\xi>+\frac{1}{2\eta_{t}}d^{2}\left(\mathbf{\xi}_{t},\mathbf{\xi}\right)\},

λt+1=[λt+ηtgradλt[<λt,h(𝐖t+1)>C22λt2]]+\lambda_{t+1}=\left[\lambda_{t}+\eta_{t}grad_{\lambda_{t}}\left[<\lambda_{t},h\left(\mathbf{W}_{t+1}\right)>-\frac{C_{2}}{2}\parallel\lambda_{t}\parallel^{2}\right]\right]_{+},

γt+1=[γt+ηtgradγt[<γt,ξt+1>C22γt2]]+\gamma_{t+1}=\left[\gamma_{t}+\eta_{t}grad_{\gamma_{t}}\left[<\gamma_{t},-\xi_{t+1}>-\frac{C_{2}}{2}\parallel\gamma_{t}\parallel^{2}\right]\right]_{+}.

In the following paragraphs, we will show how to update primal and dual variables in each iteration.

(1) We employ Riemannian gradient decent method to search optimal 𝐖t+1\mathbf{W}_{t+1}. Define

J𝐖=L(𝐖,ξ,λt,γt)=12d2(𝐖,𝐖0)+<λt,h(𝐖)>J_{\mathbf{W}}=L\left(\mathbf{W},\xi,\lambda_{t},\gamma_{t}\right)=\frac{1}{2}d^{2}(\mathbf{W},\mathbf{W}_{0})+<\lambda_{t},h\left(\mathbf{W}\right)>.

We have 𝐖t+1=R𝐖t(ηtGrad𝐖tJ𝐖t)\mathbf{W}_{t+1}=R_{\mathbf{W}_{t}}\left(-\eta_{t}Grad_{\mathbf{W}_{t}}J_{\mathbf{W}_{t}}\right), where Grad𝐖J𝐖=12(𝐖01𝐖1)+<λt,h(𝐖)𝐖>Grad_{\mathbf{W}}J_{\mathbf{W}}=\frac{1}{2}\left(\mathbf{W}_{0}^{-1}-\mathbf{W}^{-1}\right)+<\lambda_{t},\frac{\partial h\left(\mathbf{W}\right)}{\partial\mathbf{W}}> is the Riemannian gradient and operator RWR_{W} means retraction. See Appendix C for the full procedure.

(2) Define Jξ=C12ξ2+<λt,h(𝐖,ξ)>+<γt,ξ>+12ηtd2(ξt,ξ)J_{\xi}=\frac{C_{1}}{2}\parallel\xi\parallel^{2}+<\lambda_{t},h(\mathbf{W},\xi)>+<\gamma_{t},-\xi>+\frac{1}{2\eta_{t}}d^{2}\left(\mathbf{\xi}_{t},\mathbf{\xi}\right) , and d2(ξt,ξ)=ξξt2d^{2}\left(\mathbf{\xi}_{t},\mathbf{\xi}\right)=\parallel\xi-\mathbf{\xi}_{t}\parallel^{2} .

gradξJξ=C1ξ<λt,(ul)>γt+ηt(ξξt)grad_{\xi}J_{\xi}=C_{1}\xi-<\lambda_{t},\binom{u}{l}>-\gamma_{t}+\eta_{t}\left(\xi-\mathbf{\xi}_{t}\right)

=(C1+ηt)ξηtξtγt<λt,(ul)>=0,=\left(C_{1}+\eta_{t}\right)\xi-\eta_{t}\mathbf{\xi}_{t}-\gamma_{t}-<\lambda_{t},\binom{u}{l}>=0,

ξt+1=[1C1+ηt(ηtξt+γt+<λt,(ul)>)]+\mathbf{\xi}_{t+1}=\left[\frac{1}{C_{1}+\eta_{t}}\left(\eta_{t}\mathbf{\xi}_{t}+\gamma_{t}+<\lambda_{t},\binom{u}{l}>\right)\right]_{+}.

(3) λt+1=[(1C2ηt)λt+ηth(𝐖t+1)]+.\lambda_{t+1}=\left[\left(1-C_{2}\eta_{t}\right)\lambda_{t}+\eta_{t}h\left(\mathbf{W}_{t+1}\right)\right]_{+}.

(4) γt+1=[(1C2ηt)γtηtξt+1]+\gamma_{t+1}=\left[\left(1-C_{2}\eta_{t}\right)\gamma_{t}-\eta_{t}\xi_{t+1}\right]_{+} .

IV-C Experimental Results

Both investors and machine learning researchers showed great interests on applying machine learning to finance area in recent years. The Holy Grail of quantitative investment is selection of high-quality financial assets with good timing to achieve higher returns with less risk[42]. To measure quality and trend of financial assets, technical, fundamental, and macroeconomic factors or features are developed, Usually multi-factor regression models[43] are deployed to find the most effective features to achieve higher asset return. However, when the number of features is large, or heterogeneity/multicollinearity exists in these features, traditional factor-oriented asset selection models tend to fail and may not give encouraging results.

Each asset can be represented by a data point in high-dimensional feature space, then a good distance metric in the space is crucial for more accurate similarity measure between assets. In our treating, assets selection can be regarded as a classification problem, assets are divided into two groups, one with positive return and the other with negative return, the aim is to find an optimal distance metric in feature space to separate these two groups, which is exactly what eq. (8) formulated. Using metric learning approach to asset selection, above mentioned factor model problems can be largely alleviated. In following sections, we apply the proposed Riemannian primal-dual metric learning (RPDML) algorithm to fund of funds (FOF) management problem. FOF is a multi-manager investment strategy whose portfolios are composed by mutual or hedge funds which invest directly in stocks, bonds or other financial assets.

IV-D Data

In this research, we consider Chinese mutual funds that were publicly traded for at least 12 consecutive months in the period 2012-01-01 to 2018-12-31. We select a total of 697 funds with capital size larger than 100 million RMB. Fund features consist of totally 70 technical factors with different rolling windows (10, 14, 21, 28, 42, 63 and 90 trading days respectively), including ROC, EMA, MDD, STDDEV, Sharpe ratio, Sortino ratio, Calmar ratio, RSI, MACD, and Stability [https://www.investopedia.com/]. All features are normalized to have zero mean and unit variance.

IV-E Backtest Protocol

For mutual fund management, typical length of rebalancing interval is one quarter of a year. So we split original sequential fund data into segments of quarters. We use a rolling window prediction schema, in the training set, we learn a distance metric from 70 technical factors from previous quarter with quarterly return of current quarter as target. In the test set, features from current quarter are used to predict quarterly return of next quarter. To validate learned metric, a simple k-nearest neighbors (k-NN) algorithm is employed, the predict return of next quarter for each fund is based on the learned metric from training set. The hyper-parameter k of k-NN is set as 10. To avoid overfitting, rolling data before 2017-01-01 are used as validation and data from 2017-01-01 to 2018-12-31 as test. For convenience, following results are based on both validation and test set.

For each fund, with k nearest neighbor funds predicted, we use average of returns of the k neighbor funds in current quarter as prediction of the fund’s return in the next quarter. At each quarterly rebalance day (the last trading day of each quarter), a top 10 buy trading strategy is used based on the prediction of each fund. In this strategy, we rank all funds based on their predicted quarterly return and select the top 10 funds for portfolio construction and rebalancing with equal weight.

We compare RPDML with the following four distance metrics and a baseline fund index:

- Euclidean distance metric.

- Mahalanobis distance metric, which is computed as the inverse of covariance matrix of training samples.

- LMNN, a distance metric learning algorithm based on the large margin nearest neighbor classification [34].

- ITML, a distance metric learning algorithm based on information geometry[35].

- GMML, a distance metric learning algorithm, the learned metric can be viewed as “matrix geometric mean” on the Riemannian manifold of positive definite matrices [37].

- Baseline fund index, CSI Partial Equity Funds Index [http://www.csindex.com.cn/en /indices/index-detail/930950]

IV-F Result

TABLE I: Spearman’s correlation/Information coefficient for different algorithms.
Algorithm Euclidean Mahalanobis LMNN ITML GMML RPDML
IC 0.018±0.1380.018\pm 0.138 0.050±0.0710.050\pm 0.071 0.017±0.2010.017\pm 0.201 0.030±0.0710.030\pm 0.071 0.058±0.0860.058\pm 0.086 0.069±0.187\mathbf{0.069\pm 0.187}

In the FoF setting, we care more about the order of predicted return than the absolute value. So we calculate the Spearman’s rank-order correlation of the predicted return to the true return for different algorithms, a higher correlation means better predictive power. In financial community, this correlation is often called Information coefficient (IC). The calculation is done rollingly, and we show the mean and standard deviation of IC in Table 1. We can see that RPDML achieves highest mean correlation. The backtest performance using different prediction models is shown in Figure 1. We also show CSI Partial Equity Funds Index (dashed curves) as baseline fund index which reflects overall performance of all partial equity funds in China’s financial market. From the top panel, we observed that all the experimental algorithms outperform baseline index, and among them, RPDML achieved the best performance, with total accumulated return of 148% in the whole backtest period, while the worst portfolio with Euclidean distance metric only achieved 25%, even less than CSI Partial Equity Funds Index. We also plot one year rolling maximum drawdown (MDD) in the bottom panel, MDD of the RPDML algorithm is quite low considering its superior accumulated return.

Refer to caption

Figure 1: Portfolio performance comparison of each metric learning algorithm and a baseline fund index. (Top) Accumulate rate of return, (Bottom) Maximum drawdown.

In Figure 2, we show annual returns of FOF of each algorithm. We can see that the proposed algorithm PRDML achieved highest returns in most years. Besides, we also notice that LMNN performed quite well in some years.

Refer to caption

Figure 2: Annual return of FOF for each algorithm.

V Conclusion

In this paper, we propose a Riemannian primal-dual algorithm based on proximal operator for optimizing a smooth, convex, lower semicontinuous function on Riemannian manifolds with constraints. We prove convergence of the proposed algorithm and show its non-asymptotic rate. By utilizing the proposed primal-dual optimization technique, we propose a novel metric learning algorithm which learns an optimal feature transformation matrix in the Riemannian space of positive definite matrices. Preliminary experimental results on an optimal fund selection problem in FOF management for quantitative investment showed its efficacy.

VI Appendix

VI-A Proof of Theorem 1

Due to convexity of L(x,λ)L\left(x,\lambda\right), for any xMx\in M we have

L(x,λt)L(xt+1,λt)+<s,expxt+11x>,L\left(x,\lambda_{t}\right)\geq L\left(x_{t+1},\lambda_{t}\right)+<s,exp_{x_{t+1}}^{-1}x>, (12)

where sL(xt+1,λt)s\in\partial L\left(x_{t+1},\lambda_{t}\right) and expxt+11xTxt+1Mexp_{x_{t+1}}^{-1}x\in T_{x_{t+1}}M.

Because MM is a Hadamard manifold which has non-positive curvature, then

d2(x,xt)d2(x,xt+1)+d2(xt+1,xt)2<expxt+11xt,expxt+11x>.d^{2}\left(x,x_{t}\right)\geq d^{2}\left(x,x_{t+1}\right)+d^{2}\left(x_{t+1},x_{t}\right)-2<exp_{x_{t+1}}^{-1}x_{t},exp_{x_{t+1}}^{-1}x>. (13)

[ref.[44], Proposition 1]

Multiplying eq. (13) by 12ηt\frac{1}{2\eta_{t}} and summing the result with eq. (12), we get the following inequality:

L(x,λt)+12ηtd2(x,xt)L(xt+1,λt)+12ηtd2(x,xt+1)+12ηtd2(xt+1,xt)+<sηtexpxt+11xt,expxt+11x>.\begin{split}&L\left(x,\lambda_{t}\right)+\frac{1}{2\eta_{t}}d^{2}\left(x,x_{t}\right)\\ &\geq L\left(x_{t+1},\lambda_{t}\right)+\frac{1}{2\eta_{t}}d^{2}\left(x,x_{t+1}\right)+\frac{1}{2\eta_{t}}d^{2}\left(x_{t+1},x_{t}\right)\\ &+<s-\eta_{t}exp_{x_{t+1}}^{-1}x_{t},exp_{x_{t+1}}^{-1}x>.\end{split} (14)

From eq. (3), we have 0L(xt+1,λt)+ηtexpxt+11xt0\in\partial L\left(x_{t+1},\lambda_{t}\right)+\eta_{t}exp_{x_{t+1}}^{-1}x_{t}. So

L(x,λt)+12ηtd2(x,xt)L(xt+1,λt)+12ηtd2(x,xt+1)+12ηtd2(xt+1,xt)L\left(x,\lambda_{t}\right)+\frac{1}{2\eta_{t}}d^{2}\left(x,x_{t}\right)\geq L\left(x_{t+1},\lambda_{t}\right)+\frac{1}{2\eta_{t}}d^{2}\left(x,x_{t+1}\right)+\frac{1}{2\eta_{t}}d^{2}\left(x_{t+1},x_{t}\right).

Since 12ηtd2(xt+1,xt)\frac{1}{2\eta_{t}}d^{2}\left(x_{t+1},x_{t}\right) is zero or positive, by taking out it,

L(xt+1,λt)L(x,λt)12ηtd2(x,xt)12ηtd2(x,xt+1)L\left(x_{t+1},\lambda_{t}\right)-L\left(x,\lambda_{t}\right)\leq\frac{1}{2\eta_{t}}d^{2}\left(x,x_{t}\right)-\frac{1}{2\eta_{t}}d^{2}\left(x,x_{t+1}\right).

Let (x,λ)\left(x_{*},\lambda_{*}\right) be the saddle (min-max) point which satisfies

L(x,λ)L(x,λ)L(x,λ)L\left(x_{*},\lambda\right)\leq L\left(x_{*},\lambda_{*}\right)\leq L\left(x,\lambda_{*}\right). By choosing x=xx=x_{*}, we have

L(xt+1,λt)L(x,λt)12ηtd2(x,xt)12ηtd2(x,xt+1).L\left(x_{t+1},\lambda_{t}\right)-L\left(x_{*},\lambda_{t}\right)\leq\frac{1}{2\eta_{t}}d^{2}\left(x_{*},x_{t}\right)-\frac{1}{2\eta_{t}}d^{2}\left(x_{*},x_{t+1}\right). (15)

Multiplying eq. (15) with ηt\eta_{t} and summing over t=0,1,2,,T1t=0,1,2,...,T-1,

t=0T1ηt(L(xt+1,λt)L(x,λt))12t=0T1(d2(x,xt)d2(x,xt+1)).\sum_{t=0}^{T-1}\eta_{t}\left(L\left(x_{t+1},\lambda_{t}\right)-L\left(x_{*},\lambda_{t}\right)\right)\\ \leq\frac{1}{2}\sum_{t=0}^{T-1}\left(d^{2}\left(x_{*},x_{t}\right)-d^{2}\left(x_{*},x_{t+1}\right)\right). (16)

For any dual variable λ+m\lambda\in\mathbb{R}_{+}^{m}, we have

λt+1λ2=[λt+ηtgradλtL(xt+1,λt)]+λ2λtλ2+2ηt<gradλtL(xt+1,λt),λtλ>+ηt2gradλtL(xt+1,λt)2.\parallel\lambda_{t+1}-\lambda\parallel^{2}=\parallel\left[\lambda_{t}+\eta_{t}grad_{\lambda_{t}}L\left(x_{t+1},\lambda_{t}\right)\right]_{+}-\lambda\parallel^{2}\leq\parallel\lambda_{t}-\lambda\parallel^{2}+2\eta_{t}<\text{grad}_{\lambda_{t}}L\left(x_{t+1},\lambda_{t}\right),\lambda_{t}-\lambda>+\eta_{t}^{2}\parallel grad_{\lambda_{t}}L\left(x_{t+1},\lambda_{t}\right)\parallel^{2}.

By summing over t=0,1,2,,T1t=0,1,2,...,T-1, and using the telescoping sum series, we have

λTλ2λ0λ2t=0T1(2ηt<gradλtL(xt+1,λt),λtλ>)+t=0T1(ηt2gradλtL(xt+1,λt)2)\parallel\lambda_{T}-\lambda\parallel^{2}-\parallel\lambda_{0}-\lambda\parallel^{2}\leq\sum_{t=0}^{T-1}\left(2\eta_{t}<\text{grad}_{\lambda_{t}}L\left(x_{t+1},\lambda_{t}\right),\lambda_{t}-\lambda>\right)+\sum_{t=0}^{T-1}\left(\eta_{t}^{2}\parallel grad_{\lambda_{t}}L\left(x_{t+1},\lambda_{t}\right)\parallel^{2}\right).

Using the fact that λ0=0\lambda_{0}=0 and λTλ20\parallel\lambda_{T}-\lambda\parallel^{2}\geq 0, we have

t=0T1(2ηt<gradλtL(xt+1,λt),λλt>)λ2+t=0T1(ηt2gradλtL(xt+1,λt)2).\begin{split}&\sum_{t=0}^{T-1}\left(2\eta_{t}<\text{grad}_{\lambda_{t}}L\left(x_{t+1},\lambda_{t}\right),\lambda-\lambda_{t}>\right)\\ &\leq\parallel\lambda\parallel^{2}+\sum_{t=0}^{T-1}\left(\eta_{t}^{2}\parallel grad_{\lambda_{t}}L\left(x_{t+1},\lambda_{t}\right)\parallel^{2}\right).\end{split} (17)

Because L(xt+1,λ)L\left(x_{t+1},\lambda\right) is concave with respect to λ\lambda,

<gradλtL(xt+1,λt),λλt>L(xt+1,λ)L(xt+1,λt)<grad_{\lambda_{t}}L\left(x_{t+1},\lambda_{t}\right),\lambda-\lambda_{t}>\geq L\left(x_{t+1},\lambda\right)-L\left(x_{t+1},\lambda_{t}\right).

By replacing <gradλtL(xt+1,λt),λλt><grad_{\lambda_{t}}L\left(x_{t+1},\lambda_{t}\right),\lambda-\lambda_{t}> in eq. (17), we have

t=0T1ηt(L(xt+1,λ)L(xt+1,λt))12λ2+12t=0T1(ηt2gradλtL(xt+1,λt)2).\begin{split}&\sum_{t=0}^{T-1}\eta_{t}\left(L\left(x_{t+1},\lambda\right)-L\left(x_{t+1},\lambda_{t}\right)\right)\\ &\leq\frac{1}{2}\parallel\lambda\parallel^{2}+\frac{1}{2}\sum_{t=0}^{T-1}\left(\eta_{t}^{2}\parallel grad_{\lambda_{t}}L\left(x_{t+1},\lambda_{t}\right)\parallel^{2}\right).\end{split} (18)

Combine eq. (16) and eq. (18),

t=0T1ηt(L(xt+1,λ)L(x,λt))12t=0T1(d2(x,xt)d2(x,xt+1))+12λ2+12t=0T1(ηt2gradλtL(xt+1,λt)2).\begin{split}&\sum_{t=0}^{T-1}\eta_{t}\left(L\left(x_{t+1},\lambda\right)-L\left(x_{*},\lambda_{t}\right)\right)\\ &\leq\frac{1}{2}\sum_{t=0}^{T-1}\left(d^{2}\left(x_{*},x_{t}\right)-d^{2}\left(x_{*},x_{t+1}\right)\right)+\frac{1}{2}\parallel\lambda\parallel^{2}\\ &+\frac{1}{2}\sum_{t=0}^{T-1}\left(\eta_{t}^{2}\parallel grad_{\lambda_{t}}L\left(x_{t+1},\lambda_{t}\right)\parallel^{2}\right).\end{split} (19)

To bound the gradient item gradλtL(xt+1,λt)2\parallel grad_{\lambda_{t}}L\left(x_{t+1},\lambda_{t}\right)\parallel^{2} in eq. (19), we employ Lemma 13 from Ref.[24].

t=0T1ηt(L(xt+1,λ)L(x,λt))12t=0T1(d2(x,xt)d2(x,xt+1))+12λ2+2mG2t=0T1(ηt2)12d2(x,x0)+12λ2+2mG2t=0T1(ηt2).\begin{split}&\sum_{t=0}^{T-1}\eta_{t}\left(L\left(x_{t+1},\lambda\right)-L\left(x_{*},\lambda_{t}\right)\right)\\ &\leq\frac{1}{2}\sum_{t=0}^{T-1}\left(d^{2}\left(x_{*},x_{t}\right)-d^{2}\left(x_{*},x_{t+1}\right)\right)\\ &+\frac{1}{2}\parallel\lambda\parallel^{2}+2mG^{2}\sum_{t=0}^{T-1}\left(\eta_{t}^{2}\right)\\ &\leq\frac{1}{2}d^{2}\left(x_{*},x_{0}\right)+\frac{1}{2}\parallel\lambda\parallel^{2}+2mG^{2}\sum_{t=0}^{T-1}\left(\eta_{t}^{2}\right).\end{split} (20)

Now let’s expand the left side of eq. (20),

t=0T1ηt(L(xt+1,λ)L(x,λt))=t=0T1ηt(f(xt+1)+<λ,h(xt+1)>α2λ2(f(x)+<λt,h(x)>α2λt2)).\begin{split}&\sum_{t=0}^{T-1}\eta_{t}(L(x_{t+1},\lambda)-L(x_{*},\lambda_{t}))=\\ &\sum_{t=0}^{T-1}\eta_{t}(f(x_{t+1})+<\lambda,h(x_{t+1})>-\frac{\alpha}{2}\parallel\lambda\parallel^{2}\\ &-(f(x_{*})+<\lambda_{t},h(x_{*})>-\frac{\alpha}{2}\parallel\lambda_{t}\parallel^{2})).\end{split}

Since h(x)0h\left(x_{*}\right)\preceq 0, α>0\alpha>0, λt0\lambda_{t}\succeq 0 and λt20\parallel\lambda_{t}\parallel^{2}\geq 0, by removing positive terms <λt,h(x)>-<\lambda_{t},h\left(x_{*}\right)> and α2λt2\frac{\alpha}{2}\parallel\lambda_{t}\parallel^{2}, we have the following inequality

t=0T1ηt(f(xt+1)f(x)+<λ,h(xt+1)>α2λ2)12d2(x,x0)+12λ2+2mG2t=0T1(ηt2).\begin{split}&\sum_{t=0}^{T-1}\eta_{t}\left(f\left(x_{t+1}\right)-f\left(x_{*}\right)+<\lambda,h\left(x_{t+1}\right)>-\frac{\alpha}{2}\parallel\lambda\parallel^{2}\right)\\ &\leq\frac{1}{2}d^{2}\left(x_{*},x_{0}\right)+\frac{1}{2}\parallel\lambda\parallel^{2}+2mG^{2}\sum_{t=0}^{T-1}\left(\eta_{t}^{2}\right).\end{split} (21)

By moving 12λ2\frac{1}{2}\parallel\lambda\parallel^{2} from r.h.s. of eq. (21) to l.h.s. of eq. (21), we have

t=0T1ηt(f(xt+1)f(x))+(<λ,t=0T1ηth(xt+1)>α(t=0T1ηt)+12λ2)12d2(x,x0)+2mG2t=0T1(ηt2).\begin{split}&\sum_{t=0}^{T-1}\eta_{t}\left(f\left(x_{t+1}\right)-f\left(x_{*}\right)\right)\\ &+\left(<\lambda,\sum_{t=0}^{T-1}\eta_{t}h\left(x_{t+1}\right)>-\frac{\alpha\left(\sum_{t=0}^{T-1}\eta_{t}\right)+1}{2}\parallel\lambda\parallel^{2}\right)\\ &\leq\frac{1}{2}d^{2}\left(x_{*},x_{0}\right)+2mG^{2}\sum_{t=0}^{T-1}\left(\eta_{t}^{2}\right).\end{split} (22)

By maximizing (<λ,t=0T1ηth(xt+1)>α(t=0T1ηt)+12λ2)\left(<\lambda,\sum_{t=0}^{T-1}\eta_{t}h\left(x_{t+1}\right)>-\frac{\alpha\left(\sum_{t=0}^{T-1}\eta_{t}\right)+1}{2}\parallel\lambda\parallel^{2}\right) , we have λmax=((αt=0T1ηt)+1)1[t=0T1ηth(xt+1)]+\lambda_{max}=\left(\left(\alpha\sum_{t=0}^{T-1}\eta_{t}\right)+1\right)^{-1}\left[\sum_{t=0}^{T-1}\eta_{t}h\left(x_{t+1}\right)\right]_{+}, and

max(<λ,t=0T1ηth(xt+1)>12(α(t=0T1ηt)+1)λ2)=12((αt=0T1ηt)+1)1[t=0T1ηth(xt+1)]+2.\begin{split}&max\left(<\lambda,\sum_{t=0}^{T-1}\eta_{t}h\left(x_{t+1}\right)>-\frac{1}{2}\left(\alpha\left(\sum_{t=0}^{T-1}\eta_{t}\right)+1\right)\parallel\lambda\parallel^{2}\right)\\ &=\frac{1}{2}\left(\left(\alpha\sum_{t=0}^{T-1}\eta_{t}\right)+1\right)^{-1}\parallel\left[\sum_{t=0}^{T-1}\eta_{t}h\left(x_{t+1}\right)\right]_{+}\parallel^{2}.\end{split}

Since λ+m\lambda\in\mathbb{\mathbb{R}}_{+}^{m} could be any value in R+mR_{+}^{m}, so

t=0T1ηt(f(xt+1)f(x))+(<λ,t=0T1ηth(xt+1)>α(t=0T1ηt)+12λ2)t=0T1ηt(f(xt+1)f(x))+12((αt=0T1ηt)+1)[t=0T1ηth(xt+1)]+212d2(x,x0)+2mG2t=0T1(ηt2).\begin{split}&\sum_{t=0}^{T-1}\eta_{t}\left(f\left(x_{t+1}\right)-f\left(x_{*}\right)\right)\\ &+\left(<\lambda,\sum_{t=0}^{T-1}\eta_{t}h\left(x_{t+1}\right)>-\frac{\alpha\left(\sum_{t=0}^{T-1}\eta_{t}\right)+1}{2}\parallel\lambda\parallel^{2}\right)\\ &\leq\sum_{t=0}^{T-1}\eta_{t}\left(f\left(x_{t+1}\right)-f\left(x_{*}\right)\right)\\ &+\frac{1}{2}\left(\left(\alpha\sum_{t=0}^{T-1}\eta_{t}\right)+1\right)\parallel\left[\sum_{t=0}^{T-1}\eta_{t}h\left(x_{t+1}\right)\right]_{+}\parallel^{2}\\ &\leq\frac{1}{2}d^{2}\left(x_{*},x_{0}\right)+2mG^{2}\sum_{t=0}^{T-1}\left(\eta_{t}^{2}\right).\end{split}

By removing the maximum term which is positive on the l.h.s. of above equation , we have

t=0T1ηt(f(xt+1)f(x))12d2(x,x0)+2mG2t=0T1(ηt2).\sum_{t=0}^{T-1}\eta_{t}\left(f\left(x_{t+1}\right)-f\left(x_{*}\right)\right)\leq\frac{1}{2}d^{2}\left(x_{*},x_{0}\right)+2mG^{2}\sum_{t=0}^{T-1}\left(\eta_{t}^{2}\right).

Since

t=0T1ηt(f(xt+1)f(x))t=0T1ηt(mint[T]f(xt+1)f(x))=(mint[T]f(xt+1)f(x))t=0T1ηt,\begin{split}\sum_{t=0}^{T-1}\eta_{t}\left(f\left(x_{t+1}\right)-f\left(x_{*}\right)\right)&\geq\sum_{t=0}^{T-1}\eta_{t}\left(\underset{t\in\left[T\right]}{min}f\left(x_{t+1}\right)-f\left(x_{*}\right)\right)\\ &=\left(\underset{t\in\left[T\right]}{min}f\left(x_{t+1}\right)-f\left(x_{*}\right)\right)\sum_{t=0}^{T-1}\eta_{t},\end{split}

we have

(mint[T]f(xt+1)f(x))1t=0T1ηt(12d2(x,x0)+2mG2t=0T1(ηt2)).\begin{split}&\left(\underset{t\in\left[T\right]}{min}f\left(x_{t+1}\right)-f\left(x_{*}\right)\right)\\ &\leq\frac{1}{\sum_{t=0}^{T-1}\eta_{t}}\left(\frac{1}{2}d^{2}\left(x_{*},x_{0}\right)+2mG^{2}\sum_{t=0}^{T-1}\left(\eta_{t}^{2}\right)\right).\end{split} (23)

End of proof.

VI-B Proof of Corollary 1

The proof simply follows the proof of Corollary 8 in ref.[24].

First we have the following bounds:

t=0T1(ηt)=\displaystyle\sum_{t=0}^{T-1}\left(\eta_{t}\right)= t=0T11t+10T11t+1𝑑t\displaystyle\sum_{t=0}^{T-1}\frac{1}{\sqrt{t+1}}\geq\int_{0}^{T-1}\frac{1}{\sqrt{t+1}}dt
=\displaystyle= 2t+1t=0T1=2(T1),\displaystyle 2\sqrt{t+1}\mid_{t=0}^{T-1}=2\left(\sqrt{T}-1\right),
t=0T1(ηt2)=t=0T11t+11+0T11t+1𝑑t=1+log(T).\displaystyle\sum_{t=0}^{T-1}\left(\eta_{t}^{2}\right)=\sum_{t=0}^{T-1}\frac{1}{t+1}\leq 1+\int_{0}^{T-1}\frac{1}{t+1}dt=1+log\left(T\right).

By using the fact that d2(x,x0)R2d^{2}\left(x_{*},x_{0}\right)\leq R^{2}, we have

(mint[T]f(xt+1)f(x))12(T1)(12R2+2mG2(1+log(T)))=𝒪(log(T)T1).\left(\underset{t\in\left[T\right]}{min}f\left(x_{t+1}\right)-f\left(x_{*}\right)\right)\leq\\ \frac{1}{2\left(\sqrt{T}-1\right)}\left(\frac{1}{2}R^{2}+2mG^{2}\left(1+log\left(T\right)\right)\right)=\mathcal{O}\left(\frac{log\left(T\right)}{\sqrt{T}-1}\right).

End of proof.

VI-C Riemannian Gradient with Retraction

We can write Euclidean gradient as grad𝐖J𝐖grad_{\mathbf{W}}J_{\mathbf{W}}

=𝐖(12d2(𝐖t,𝐖)+<λt,h(𝐖)>)=\frac{\partial}{\partial\mathbf{W}}\left(\frac{1}{2}d^{2}\left(\mathbf{W}_{t},\mathbf{W}\right)+<\lambda_{t},h\left(\mathbf{W}\right)>\right)

=12(𝐖01𝐖1)+<λt,h(𝐖)𝐖>=\frac{1}{2}\left(\mathbf{W}_{0}^{-1}-\mathbf{W}^{-1}\right)+<\lambda_{t},\frac{\partial h\left(\mathbf{W}\right)}{\partial\mathbf{W}}>

h(𝐖)𝐖=𝐖([h+(𝐖),h(𝐖)])\frac{\partial h\left(\mathbf{W}\right)}{\partial\mathbf{W}}=\frac{\partial}{\partial\mathbf{W}}\left(\left[h_{+}\left(\mathbf{W}\right),h_{-}\left(\mathbf{W}\right)\right]^{\top}\right) =[diag(Diag(𝐗+)Diag(𝐗+)),diag(Diag(𝐗)Diag(𝐗))]=\left[diag\left(Diag\left(\mathbf{X}_{+}^{\top}\right)Diag\left(\mathbf{X}_{+}\right)\right),-diag\left(Diag\left(\mathbf{X}_{-}^{\top}\right)Diag\left(\mathbf{X}_{-}\right)\right)\right]^{\top}, where Diag(𝐗+)/Diag(𝐗+)Diag\left(\mathbf{X}_{+}^{\top}\right)/Diag\left(\mathbf{X}_{+}\right) means constructing a block diagonal matrix whose block diagonal elements are columns / rows of 𝐗+/𝐗+\mathbf{X}_{+}^{\top}/\mathbf{X}_{+}. In another word, each element of h(𝐖)𝐖\frac{\partial h\left(\mathbf{W}\right)}{\partial\mathbf{W}} is a covariance matrix 𝐱i×𝐱i\mathbf{x}_{i}^{\top}\times\mathbf{x}_{i} of a sample pair vector 𝐱i\mathbf{x}_{i}, i=1,2,,Ni=1,2,...,N (assume the expectation of 𝐱i\mathbf{x}_{i}, i=1,2,,Ni=1,2,...,N, is zero).

<λt,h(𝐖)𝐖><\lambda_{t},\frac{\partial h\left(\mathbf{W}\right)}{\partial\mathbf{W}}> is tensor dot product between λt\lambda_{t} and a vector of size NN whose element is covariance matrix of the ii’th sample pair.

1) Riemannian Gradient

First we show that in our case, Grad𝐖J𝐖=grad𝐖J𝐖Grad_{\mathbf{W}}J_{\mathbf{W}}=grad_{\mathbf{W}}J_{\mathbf{W}}.

Let’s define MW:={W=UΣV}M_{W}:=\left\{W=U\Sigma V^{\top}\right\}, s.t. Ust(m,d)U\in st\left(m,d\right), Vst(n,d)V\in st\left(n,d\right), and Σ=diag(σ1,σ2,,σd)\Sigma=diag\left(\sigma_{1},\sigma_{2},...,\sigma_{d}\right), with σi>0\sigma_{i}>0, i\forall i. st(m,d)st\left(m,d\right) is Stiefel manifold of m×dm\times d real, orthonormal matrices. MWM_{W} is a Riemannian manifold with tangent space[45, 3]:

TWM:={UMV+UpV+UVp}T_{W}M:=\left\{UMV^{\top}+U_{p}V^{\top}+UV_{p}^{\top}\right\}, where MM is an arbitrary d×dd\times d matrix, UpU=0U_{p}^{\top}U=0, and VpV=0V_{p}^{\top}V=0.

For a given objective function JJ which depends on input matrix W of size m×nm\times n, we use grad𝐖Jm×ngrad_{\mathbf{W}}J\in\mathbb{R}^{m\times n} represent Euclidean gradient of JJ w.r.t WW; and denote Grad𝐖JGrad_{\mathbf{W}}J as its Riemannian gradient by projecting the Euclidean gradient onto the tangent space of MWM_{W}:

Grad𝐖J=PUHgrad𝐖JPVH+PUυgrad𝐖JPVH+PUHgrad𝐖JPVυGrad_{\mathbf{W}}J=P_{U}^{H}grad_{\mathbf{W}}JP_{V}^{H}+P_{U}^{\upsilon}grad_{\mathbf{W}}JP_{V}^{H}+P_{U}^{H}grad_{\mathbf{W}}JP_{V}^{\upsilon}, where PUH:=UUP_{U}^{H}:=UU^{\top}, PUυ:=IUUP_{U}^{\upsilon}:=I-UU^{\top}, PVH:=VVP_{V}^{H}:=VV^{\top}, and PVυ:=IVVP_{V}^{\upsilon}:=I-VV^{\top}.

For our problem, since WW is a real symmetric positive definite matrix, we have MW:={W=QΛQ}M_{W}:=\left\{W=Q\Lambda Q^{\top}\right\}, and TWM:={QMQ}T_{W}M:=\left\{QMQ^{\top}\right\}, where QQ is an orthogonal matrix, Λ\Lambda is a diagonal matrix whose entries are the eigenvalues of WW and greater than or equal to zero.

Since PUH:=UU=QQ=IP_{U}^{H}:=UU^{\top}=QQ^{\top}=I, PUυ:=IUU=0P_{U}^{\upsilon}:=I-UU^{\top}=0, PVH:=VV=IP_{V}^{H}:=VV^{\top}=I, PVυ:=IVV=0P_{V}^{\upsilon}:=I-VV^{\top}=0 , the projection of the Euclidean gradient grad𝐖J𝐖grad_{\mathbf{W}}J_{\mathbf{W}} onto the tangent space of MWM_{W} is

Grad𝐖J𝐖=PUHgrad𝐖J𝐖PVH+PUυgrad𝐖J𝐖PVH+PUHgrad𝐖J𝐖PVυ=grad𝐖J𝐖Grad_{\mathbf{W}}J_{\mathbf{W}}=P_{U}^{H}grad_{\mathbf{W}}J_{\mathbf{W}}P_{V}^{H}+P_{U}^{\upsilon}grad_{\mathbf{W}}J_{\mathbf{W}}P_{V}^{H}+P_{U}^{H}grad_{\mathbf{W}}J_{\mathbf{W}}P_{V}^{\upsilon}=grad_{\mathbf{W}}J_{\mathbf{W}}.

2) Retraction

With 𝐖t\mathbf{W}_{t} and Grad𝐖J𝐖Grad_{\mathbf{W}}J_{\mathbf{W}} shown above, we would like to calculate 𝐖t+1\mathbf{W}_{t+1} using retraction. From Ref.[4], for any tangent vector ηTWM\eta\in T_{W}M, its retraction RW(η):=arminXMW+ηXFR_{W}\left(\eta\right):=\underset{X\in M}{armin}\parallel W+\eta-X\parallel_{F}. For our case R𝐖t(ηtGrad𝐖tJ𝐖t)=i=1nσiqiqiR_{\mathbf{W}_{t}}\left(-\eta_{t}Grad_{\mathbf{W}_{t}}J_{\mathbf{W}_{t}}\right)=\sum_{i=1}^{n}\sigma_{i}q_{i}q_{i}^{\top} , where σi\sigma_{i} and qiq_{i} are the ii-th eigenvalues and eigenvector of matrix 𝐖tηtGrad𝐖tJ𝐖t\mathbf{W}_{t}-\eta_{t}Grad_{\mathbf{W}_{t}}J_{\mathbf{W}_{t}}.

The calculation and retraction shown above are repeated until J𝐖J_{\mathbf{W}} converges.

References

  • [1] Y. Nishimori, S. Akaho, and M. D. Plumbley, “Riemannian optimization method on the flag manifold for independent subspace analysis,” in International Conference on Independent Component Analysis and Signal Separation.   Springer, 2006, pp. 295–302.
  • [2] P.-A. Absil, C. G. Baker, and K. A. Gallivan, “Trust-region methods on riemannian manifolds,” Foundations of Computational Mathematics, vol. 7, no. 3, pp. 303–330, 2007.
  • [3] B. Vandereycken, “Low-rank matrix completion by Riemannian optimization,” SIAM Journal on Optimization, vol. 23, No. 2, pp. 1214–1236, 2013.
  • [4] L. Cheng, “Riemannian similarity learning,” in International Conference on Machine Learning, 2013, pp. 540–548.
  • [5] A. Cherian and S. Sra, “Riemannian dictionary learning and sparse coding for positive definite matrices,” IEEE transactions on neural networks and learning systems, vol. 28, no. 12, pp. 2859–2871, 2017.
  • [6] P.-A. Absil, R. Mahony, and R. Sepulchre, “Optimization on manifolds: Methods and applications,” in Recent Advances in Optimization and its Applications in Engineering.   Springer, 2010, pp. 125–144.
  • [7] D. Gabay, “Minimizing a differentiable function over a differential manifold,” Journal of Optimization Theory and Applications, vol. 37, no. 2, pp. 177–219, 1982.
  • [8] S. T. Smith, “Optimization techniques on riemannian manifolds,” Fields institute communications, vol. 3, no. 3, pp. 113–135, 1994.
  • [9] G. de Carvalho Bento, O. P. Ferreira, and P. R. Oliveira, “Unconstrained steepest descent method for multicriteria optimization on riemannian manifolds,” J. Optimization Theory and Applications, vol. 154, no. 1, pp. 88–107, 2012.
  • [10] H. Sato and T. Iwai, “A new, globally convergent riemannian conjugate gradient method,” Optimization, vol. 64, no. 4, pp. 1011–1031, 2015.
  • [11] Y. Liu, F. Shang, J. Cheng, H. Cheng, and L. Jiao, “Accelerated first-order methods for geodesically convex optimization on riemannian manifolds,” in Advances in Neural Information Processing Systems, 2017, pp. 4868–4877.
  • [12] C. Qi, K. A. Gallivan, and P.-A. Absil, “Riemannian bfgs algorithm with applications,” in Recent advances in optimization and its applications in engineering.   Springer, 2010, pp. 183–192.
  • [13] G. Bécigneul and O. Ganea, “Riemannian adaptive optimization methods,” CoRR, vol. abs/1810.00760, 2018. [Online]. Available: http://arxiv.org/abs/1810.00760
  • [14] C. Qi, “Numerical optimization methods on riemannian manifolds,” 2011.
  • [15] N. Agarwal, N. Boumal, B. Bullins, and C. Cartis, “Adaptive regularization with cubics on manifolds with a first-order analysis,” arXiv preprint arXiv:1806.00065, 2018.
  • [16] M. Schmidt, N. Le Roux, and F. Bach, “Minimizing finite sums with the stochastic average gradient,” Mathematical Programming, vol. 162, no. 1-2, pp. 83–112, 2017.
  • [17] R. Johnson and T. Zhang, “Accelerating stochastic gradient descent using predictive variance reduction,” in Advances in neural information processing systems, 2013, pp. 315–323.
  • [18] A. Defazio, F. Bach, and S. Lacoste-Julien, “Saga: A fast incremental gradient method with support for non-strongly convex composite objectives,” in Advances in neural information processing systems, 2014, pp. 1646–1654.
  • [19] H. Zhang, S. J. Reddi, and S. Sra, “Riemannian svrg: Fast stochastic optimization on riemannian manifolds,” in Advances in Neural Information Processing Systems, 2016, pp. 4592–4600.
  • [20] H. Sato, H. Kasai, and B. Mishra, “Riemannian stochastic variance reduced gradient,” arXiv preprint arXiv:1702.05594, 2017.
  • [21] H. Kasai, H. Sato, and B. Mishra, “Riemannian stochastic quasi-newton algorithm with variance reduction and its convergence analysis,” arXiv preprint arXiv:1703.04890, 2017.
  • [22] A. Hauswirth, S. Bolognani, G. Hug, and F. Dörfler, “Projected gradient descent on riemannian manifolds with applications to online power system optimization,” in 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton).   IEEE, 2016, pp. 225–232.
  • [23] J. Zhang, S. Ma, and S. Zhang, “Primal-dual optimization algorithms over riemannian manifolds: an iteration complexity analysis,” arXiv preprint arXiv:1710.02236, 2017.
  • [24] M. B. Khuzani and N. Li, “Stochastic primal-dual method on riemannian manifolds of bounded sectional curvature,” in Machine Learning and Applications (ICMLA), 2017 16th IEEE International Conference on.   IEEE, 2017, pp. 133–140.
  • [25] R. T. Rockafellar, “Augmented lagrangians and applications of the proximal point algorithm in convex programming,” Mathematics of operations research, vol. 1, no. 2, pp. 97–116, 1976.
  • [26] N. Parikh, S. Boyd et al., “Proximal algorithms,” Foundations and Trends® in Optimization, vol. 1, no. 3, pp. 127–239, 2014.
  • [27] O. Ferreira and P. Oliveira, “Proximal point algorithm on riemannian manifolds,” Optimization, vol. 51, no. 2, pp. 257–270, 2002.
  • [28] M. Bačák, “The proximal point algorithm in metric spaces,” Israel Journal of Mathematics, vol. 194, no. 2, pp. 689–701, 2013.
  • [29] S. Chen, S. Ma, A. M.-C. So, and T. Zhang, “Proximal gradient method for manifold optimization,” arXiv preprint arXiv:1811.00980, 2018.
  • [30] L. P. Eisenhart, Riemannian geometry.   Princeton university press, 2016.
  • [31] B. Kulis et al., “Metric learning: A survey,” Foundations and Trends® in Machine Learning, vol. 5, no. 4, pp. 287–364, 2013.
  • [32] A. Bellet, A. Habrard, and M. Sebban, “A survey on metric learning for feature vectors and structured data,” arXiv preprint arXiv:1306.6709, 2013.
  • [33] E. P. Xing, M. I. Jordan, S. J. Russell, and A. Y. Ng, “Distance metric learning with application to clustering with side-information,” in Advances in neural information processing systems, 2003, pp. 521–528.
  • [34] K. Q. Weinberger, J. Blitzer, and L. K. Saul, “Distance metric learning for large margin nearest neighbor classification,” in Advances in neural information processing systems, 2006, pp. 1473–1480.
  • [35] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon, “Information-theoretic metric learning,” in Proceedings of the 24th international conference on Machine learning.   ACM, 2007, pp. 209–216.
  • [36] S. Wang and R. Jin, “An information geometry approach for distance metric learning,” in Artificial Intelligence and Statistics, 2009, pp. 591–598.
  • [37] P. Zadeh, R. Hosseini, and S. Sra, “Geometric mean metric learning,” in International Conference on Machine Learning, 2016, pp. 2464–2471.
  • [38] S. Mahadevan, B. Mishra, and S. Ghosh, “A unified framework for domain adaptation using metric learning on manifolds,” arXiv preprint arXiv:1804.10834, 2018.
  • [39] E. S. Levitin and B. T. Polyak, “Constrained minimization methods,” Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki, vol. 6, no. 5, pp. 787–823, 1966.
  • [40] S. Shalev-Shwartz, Y. Singer, and A. Y. Ng, “Online and batch learning of pseudo-metrics,” in Proceedings of the twenty-first international conference on Machine learning.   ACM, 2004, p. 94.
  • [41] P. Jain, B. Kulis, I. S. Dhillon, and K. Grauman, “Online metric learning and fast similarity search,” in Advances in neural information processing systems, 2009, pp. 761–768.
  • [42] R. A. DeFusco, D. W. McLeavey, J. E. Pinto, M. J. Anson, and D. E. Runkle, Quantitative investment analysis.   John Wiley & Sons, 2015.
  • [43] E. F. Fama and K. R. French, “The capital asset pricing model: Theory and evidence,” Journal of economic perspectives, vol. 18, no. 3, pp. 25–46, 2004.
  • [44] G. C. Bento, O. P. Ferreira, and J. G. Melo, “Iteration-complexity of gradient, subgradient and proximal point methods on riemannian manifolds,” Journal of Optimization Theory and Applications, vol. 173, no. 2, pp. 548–562, 2017.
  • [45] P. A. Absil, R. Mahony, and R. Sepulchre, Optimization algorithms on matrix manifolds.   Princeton University Press, 2009.