Deep Optimal Transport for Domain Adaptation on SPD Manifolds
Abstract
The machine learning community has shown increasing interest in addressing the domain adaptation problem on symmetric positive definite (SPD) manifolds. This interest is primarily driven by the complexities of neuroimaging data generated from brain signals, which often exhibit shifts in data distribution across recording sessions. These neuroimaging data, represented by signal covariance matrices, possess the mathematical properties of symmetry and positive definiteness. However, applying conventional domain adaptation methods is challenging because these mathematical properties can be disrupted when operating on covariance matrices. In this study, we introduce a novel geometric deep learning-based approach utilizing optimal transport on SPD manifolds to manage discrepancies in both marginal and conditional distributions between the source and target domains. We evaluate the effectiveness of this approach in three cross-session brain-computer interface scenarios and provide visualized results for further insights. The GitHub repository of this study can be accessed at https://github.com/GeometricBCI/Deep-Optimal-Transport-for-Domain-Adaptation-on-SPD-Manifolds.
Index Terms:
Domain Adaptation, Optimal Transport, Geometric Deep Learning, Neural Signal ProcessingI Introduction
In cross-domain problems, the distribution of source data often differs from that of target data, leading to what is widely recognized as a domain adaptation problem in the machine learning community. This issue has gained considerable attention over the past decade [1, 2, 3]. For example, signals in an online testing set frequently differ from those in the calibration set, despite being generated by the same process, highlighting a common domain adaptation challenge.
In domain adaptation, a domain consists of a feature space and a marginal probability distribution , denoted as . Meanwhile, a task comprises a label space and a conditional distribution , represented by . Domain adaptation aims to build a classifier in the target domain by leveraging the information obtained from the source domain , where and . In particular, yields either or , whereas results in either or or .
The domain adaptation problem can be broadly categorized into unsupervised and semi-supervised scenarios. In the unsupervised scenario, no labels are available in the target domain, while in the semi-supervised scenario, a limited number of labeled instances are available in the target domain to guide the adaptation process. For instance, the variability of electroencephalography (EEG) signals often results in unreliable and non-robust performance when transitioning from calibration to feedback phases, as these sessions exhibit distinct distributions. This variability in EEG-based classification exemplifies a typical domain adaptation problem [4]. In response to this issue, a covariate shift adaptation approach has been proposed in machine learning society [5]. This approach assumes that the difference between domains is characterized by a change in the feature space, while the conditional distributions remain unchanged.
Framework | Cost function | Underlying Space | Transformation | Domain Adaptation Scenario |
---|---|---|---|---|
OT-DA[6, 7] | Euclidean | Affine | . | |
JDOT-DA[8, 9] | Euclidean | Affine | , . | |
OT-DA on SPD manifolds[10] | (, ) | Bi-Map | . | |
OT-DA on SPD manifolds (Proposed) | (, ) | Neural Networks | , , and | |
. |
Among various methods that have been proposed to address the domain adaptation problem across numerous engineering scenarios, Optimal Transport (OT) has emerged as a novel and effective solution for aligning source and target domains [6, 7]. The OT approach introduces regularized optimal transport, allowing a classifier to learn from labeled source data and apply this knowledge to target data. The goal of the Optimal Transport-Domain Adaptation (OT-DA) framework is to minimize the overall transportation costs between two distributions, quantified by the Wasserstein metric. It seeks to find a push-forward transformation that satisfies . The strengths of the OT-DA framework lie in its simplified evaluation of empirical data distribution and its enhanced ability to leverage the underlying geometry of discrete samples [6].
While broad in scope, the OT-DA framework can benefit from enhancements derived from insights into specific applications. For instance, in EEG-based motor imagery classification, EEG signals are processed as (average) spatial covariance matrices after segmentation in time and averaging over trials [11, 12]. This leads to the cross-session scenario of motor imagery classification taking place in the space of covariance matrices, which forms a symmetric and positive definite (SPD) manifold if endowed with a Riemannian metric. In this context, Yair et al. utilized squared distance as the cost function in the OT-DA framework, deriving a closed-form solution through Brenier’s polar factorization [10].
Although Yair’s work pioneered the OT-DA framework on SPD manifolds, it presents certain theoretical and practical shortcomings. Theoretically, their chosen cost function, based on Euclidean distance, undermines the consistency of mathematical formulation in the view of differential geometry. Practically, their methodology fails to account for disparities within conditional distributions. Additionally, the unique challenges inherent to the problem, such as the scarcity of labeled data during the feedback phase and the divergence of spatial covariance matrices from distinct frequency bands, pose significant modeling difficulties. These challenges necessitate the development of a more adaptable OT-DA framework to effectively address problems on SPD manifolds.
To address these challenges, we consider a more applicable scenario where , , , and . This situation is typically handled through joint distribution adaptation, which concurrently aligns both marginal and conditional distributions across domains, even when dealing with implicit probability distributions, without relying on labeled information from the target domain [13]. To facilitate modeling of , we formulate the space of SPD matrices using the Log-Euclidean metric on SPD manifolds. The Log-Euclidean metric is a commonly used Riemannian metric on SPD manifolds, allowing operations to be performed in the logarithmic domain where the structure is Euclidean [14]. Moreover, in this context, parallel transport on SPD manifolds with the Log-Euclidean metric simplifies to an identity map, ensuring that the feature space of the source and target domains, which is the tangent space at the identity of SPD manifolds, are the same.
In this study, we propose a novel geometric deep learning architecture called Deep Optimal Transport (DOT) to address the joint distribution adaptation problem on SPD manifolds, utilizing a squared Log-Euclidean distance as the cost function in optimal transport, i.e., . DOT is predicated on the assumption that information transfer between domains can be effectively estimated through optimal transport. The goal of DOT is to find a transformation on SPD manifolds such that and . To evaluate the effectiveness of DOT, we focus on cross-session scenarios in motor imagery classification. This problem is notably challenging and often inadequately addressed, primarily due to the extensive patterns of synchronized neuronal activity that continuously shift over time. This results in significant variability in brain signals captured by EEG devices across different sessions. In our evaluation, we calibrate the model in one session and then test it in another, encompassing both unsupervised and semi-supervised domain adaptation scenarios.
The remainder of this paper is organized as follows: Section II delves into the Log-Euclidean geometry, the discrete OT problems, joint distribution optimal transport, and SPD matrix-valued motor imagery classifier. Section III presents the OT-DA framework on SPD manifolds and proposes the architecture of DOT. In Section IV, we will employ DOT on the motor imagery classification. Section V exhibits results on synthetic and real-world datasets. A comparison of various frameworks is summarized in Table I.
II PRELIMINARY
Notations and Conventions
A matrix is symmetric and positive-definite if it is symmetric and all its eigenvalues are positive, i.e., , and , for all . The Frobenius inner product and Frobenius norm on matrices and are defined as and respectively.
II-A Log-Euclidean Metric
The Log-Euclidean metric, a widely used Riemannian metric on SPD manifolds, introduces a vector space structure to the space of SPD matrices, endowing it with numerous exceptional properties [14, 15]. These include invariance under inversion, logarithmic multiplication, orthogonal transformations, and scaling. Formally, a commutative Lie group structure, known as the tensor Lie group, is endowed on using the logarithmic multiplication defined by , where , and and are matrix exponential and logarithm respectively. The bi-invariant metric exists on the tensor Lie group, which yields the following scalar product , where tangent vectors at matrix , and represent the differential at . We denote an SPD manifold equipped with the Log-Euclidean metric as (, ). The sectional curvature of (, ) is flat, and thus parallel transport on it simplifies to an identity. Riemannian distance between and on this SPD manifold. The Log-Euclidean Frchet mean of a batch of SPD matrices is defined as follows:
(1) |
In this study, another Riemannian metric on SPD manifolds, affine-invariant Riemannian metric [16], is introduced for comparison with . SPD manifold equipped with this metric is denoted as (, ). Formally, for any , it is defined as , where tangent vectors . (, ) also exhibit nice properties, such as affine invariance, scale invariance, and inversion invariance. Since these properties are not the primary focus of this study, please refer to details in [17].
II-B Discrete Optimal Transport
Optimal Transport (OT) seeks the minimal cost transport plan between source and target distributions [18]. In this study, we consider the case where these distributions are available only through a finite number of samples. Given two measurable spaces and , an empirical distribution of the measure and are written as and , where discrete variable takes values , countable index set is with weights . takes values , countable index set is with weights . Dirac measure denotes a unite point mass at the point . The discrete Kantorovich formalization of the OT problem is given as , where transportation plan , and is a dimensional all-one vector.
II-C Joint Distribution Optimal Transport
Assuming that the transfer between source and target distributions can be estimated using optimal transport, Joint Distribution Optimal Transport (JDOT) [8] extends the OT-DA framework by jointly optimizing both the optimal coupling and the prediction function for the two distributions, i.e., , where is the label set for samples , is the prediction function, and the joint cost function is a linear combination of distance function and cross-entropy loss, i.e., . Moreover, DeepJDOT employs neural networks to map raw data into feature space and thus extends JDOT by setting the new joint objective function , where represents the neural networks [9].
II-D SPD Matrix-Valued Motor Imagery Classifier
Extensive research has explored various classes of Riemannian metrics on SPD manifolds for various applications. These explorations have led to the introduction of numerous statistical methods to address computational issues on SPD manifolds, drawing inspiration from manifold-valued neuroimaging data [19, 20]. During the feature engineering era, Barachant et al. utilized the Riemannian distance between EEG spatial covariance matrices for motor imagery classification, resulting in a significant impact on various brain-computer interface applications [21, 22, 23].
In the deep learning era, Ju et al. employed geometric deep learning on SPD manifolds to extract discriminative information from the time-space-frequency domain for motor imagery classification, ultimately enhancing classification performance [24, 25, 26, 27, 28]. The neural network layers utilized in these works are derived from SPDNet [29], which is specifically designed to process SPD matrices while preserving their SPD properties throughout the layers during non-linear learning. These layers include:
-
•
Bi-Map Layer: This layer transforms the covariance matrix by applying the Bi-Map transformation , where the transformation matrix must have full row rank.
-
•
ReEig Layer: This layer introduces non-linearity to SPD manifolds using , where , is a rectification threshold, and is an identity matrix.
-
•
LOG Layer: This layer maps elements on SPD manifolds to their tangent space using , where . The logarithm function applied to a diagonal matrix takes the logarithm of each diagonal element individually.
III Methodology
In this section, we will generalize the OT-DA framework on (, ) and propose a novel geometric deep learning approach to address the problems formulated on this framework.
III-A OT-DA Framework on (, )
we first calculate the gradient of any squared distance function on general Riemannian manifolds.
Lemma 1.
Let be a connected, compact, and smooth Riemannian manifold without a boundary. Consider a compact subset and a fixed point . Let denote the squared distance function. Then, for not on the boundary of , the gradient of is given by .
Proof.
Without loss of generality, consider a point . There exists an open neighborhood of and such that the exponential map maps the ball diffeomorphically onto [30, Corollary 5.5.2.]. Let be a geodesic on , parameterized with constant speed, with and with . By reversing the direction of the geodesic , we have . We proceed by linearizing the exponential maps around and and derive the derivative of a function as follows:
where the differential is the differential of the exponential map at with respect to . The fourth equality follows from on , and the fifth equality follows from Gauss’ lemma [31, Section 3.3.5.]. Hence, by the definition of the gradient on Riemannian manifolds, we have , which implies that . ∎
We say a function is c-concave if it is not identically and there exists such that
where is a nonnegative cost function that measures the cost of transporting mass from to . Brenier’s polar factorization on general Riemannian manifolds is given as follows:
Lemma 2 (Brenier’s Polar Factorization [32]).
Given any compactly supported measures . If is absolutely continuous with respect to Riemannian volume, there exists a unique optimal transport map as follows,
where is a c-concave function with respect to the cost function , and is the exponential map at on .
On the SPD manifold (, ), let’s consider source measure and target measure that are accessible only through finite samples and . Each sample contributes to its respective empirical distribution with weights and . We assume there exists a linear transformation on (, ) between two discrete feature space distributions of the source and target domains that can be estimated with the discrete Kantorovich problem, i.e., there exists
(2) |
where . The pseudocode for the computation of the transport plan is presented in Algorithm 1.
In the following paragraph, we will employ discrete transport plans to construct c-concave functions and utilize Lemma 1 and 2 to obtain continuous transport plans. According to Algorithm 1, the new coordinates for target samples on SPD manifolds are given by the following expression:
We define for , and set the cost function . This leads to the formulation of the corresponding c-concave function as follows:
For a fixed , we suppose achieves the minimum. By invoking Lemma 1 and 2, we promptly derive the unique optimal transport map as follows,
which implies the following theorem:
Theorem 3.
On (, ), consider the source measure and target measure , which are only accessible through finite samples and . Each sample contributes to the empirical distribution with weights and , respectively. Assuming the existence of a linear transformation between the feature space distributions of the source and target domains, it can be estimated using the discrete Monge-Kantorovich Problem 2. Therefore, the values of the continuous optimal transport lie within the set .
Theorem 3 encompasses an extension of [6, Theorem 3.1] pertaining to (, ) within the context of the OT-DA framework. Significantly, our framework is designed to incorporate any supervised neural networks-based transformation, distinguishing it from the conventional use of affine transformations, and employs the squared Riemannian distance. In Appendix VI-A, we illuminate the connection between our proposed approach and another similar OT-DA framework on SPD manifolds proposed in [10] that utilizes the squared Euclidean distance in its cost function.

III-B Deep Optimal Transport
We propose a novel geometric deep learning approach, Deep Optimal Transport (DOT) on (, ), to address the joint distribution adaptation problems involving both marginal distribution shifts and conditional distribution shifts. We assume samples from the source and target distributions can be restored via OT. The proposed architecture is depicted in Figure 1.
Marginal Distribution Adaptation (MDA)
The objective of MDA is to adapt the marginal distributions and on (, ). We measure the statistical discrepancy between them using the Riemannian distance between the Log-Euclidean Frchet means of each batch of source and target samples and , respectively. The MDA loss is formulated as .
Conditional Distribution Adaptation (CDA)
The objective of CDA is to address the adaptation of conditional distributions and on (, ). It necessitates pseudo labels predicted by a baseline algorithm for the target domain [13]. We consider class-conditional distribution instead of posterior probability and introduce the use of pseudo labels for the unsupervised setting. To quantify the statistical difference between and , we continue to employ the Riemannian distance between the Log-Euclidean Frchet means of each batch of source and target samples and , respectively, for class . The CDA loss is formulated as .
Overall Objective Function
The overall objective function of DOT is composed of the cross-entropy loss and the joint distribution adaptation, i.e., , where , and .
Remark.
1). and should be in the squared form, as required by Brenier’s formulation and the associated theoretical results [32].
2). On the space of , there are multiple available Riemannian metrics besides LEM. For example, the affine invariant Riemannian metric (AIRM) is another well-known Riemannian metric that has been extensively investigated in the literature [16]. However, AIRM is not a suitable choice for our framework. The calculation of the Frchet mean on (, ) requires iterative methods. This is not feasible as a loss function for deep learning-based approaches.
3). In practice, we substitute Equation 1 into and , resulting in the simplified expressions,
These simplified expressions have two distinct interpretations in domain adaptation as follows: First, they can be seen as an empirical maximum mean discrepancy approach defined as, , where , , is the feature map, is the reproducing kernel Hilbert space, and is the output dimension; Second, since each matrix in or is a covariance matrix of multichannel signals, the expressions can also be viewed as a correlation alignment (CORAL) approach [33, 34]. This approach aligns the second-order statistics of the source and target data distributions to minimize the drift between their statistical distributions.
4). For multi-source adaptation, the loss function can be modified by incorporating multi-joint distribution adaptions between each pair of source and target domains on (, ), as , where the subscript is used to represent the source domain. This modification is based on the theory of multi-marginal optimal transport on Riemannian manifolds [35] and multi-source joint distribution optimal transport [36].
5). DOT does not directly solve an OT problem. Instead, it uses the transportation costs as a guiding principle to direct the learning process of the neural network. Consequently, the trained neural network maps the source and target domains onto a submanifold of the SPD manifold, thereby implementing a subspace method in transfer learning.
6). DOT calculates this cost for the centroid points and addresses the issue of conditional marginal adaptation without necessitating alternating parameter updates. In contrast, DeepJDOT computes the transportation costs for all points and requires alternating parameter updates, as the transport plan is a deterministic transfer.
IV EXPERIMENTS
IV-A Deep Optimal Transport-Based Motor Imagery Classifier
We model the space of EEG spatial covariance matrices in (, ) and employ DOT to address the issue that EEG statistical characterization is shifted during the feedback phase. Using the event-related desynchronization/synchronization methods to classify motor imagery classes, EEG frequency components in specific frequency bands are found which provided the best discrimination between movement imagination [37]. Hence, it is reasonable to regard features in source and target feature space holds the same, i.e., , after transforming to a latent space using the Bi-Map layer, and thus model the space of EEG spatial covariance matrices using SPD manifolds with the Log-Euclidean metric. Moreover, the Bi-Map layer provides a transitive group action on SPD manifolds, allowing any transition between two points on manifolds. In Fig. 2, we illustrate how the DOT-based MI-EEG classifier functions in a domain adaptation problem.
Specifically, EEGs are represented by , where and denote the number of channels and timestamps, respectively. The corresponding EEG spatial covariance matrix, , represents an EEG segment within the frequency band and time interval . Let and represent the frequency and time domains, respectively. A given segmentation is written as , on the time-frequency domain of EEG signals. To ensure the theoretical convergence of this classifier in a real-world context, the following three assumptions are necessary:
Assumption 1.
There exist linear transformations between the feature space distributions on the source and target domains for each segment of , which can be estimated using the Kantorovich Problem 2.
Assumption 2.
The weights for the Log-Euclidean Frchet mean of samples on each time-frequency segment in both the source and target empirical distributions are uniform and equal .
Assumption 3.
The Riemannian distance between the Log-Euclidean Frchet means of source and target samples, both under the same time-frequency segment, is minimized, that is:
where is a batch of samples on the target domain that is generated by the other time-frequency segments , where or .

Assumption 1 is a necessary condition of modeling the OT-DA problem. Assumption 2 is straightforward because when we are uncertain about which EEG frequency band has a higher weight for a specific task, we tend to assume that each frequency band has equal priority for the classification task. Assumption 3 ensures that features extracted from different bands do not overlap.
Corollary 3.1.
Proof.
In this case, the transport plan equals the identity matrix. Consequently, Theorem 3 leads directly to this corollary. ∎
IV-B Evaluation Dataset
Three commonly used EEG-based motor imagery datasets are used for evaluation, accessed from two publicly available data packages MOABB and BNCI Horizon 2020111 The MOABB package can be accessed at https://github.com/NeuroTechX/moabb, and the BNCI Horizon 2020 project is available at http://bnci-horizon-2020.eu/database/data-sets.. The selection criteria require each subject in the dataset to have two-session recordings collected on different days. All digital signals were filtered using Chebyshev Type II filters at intervals of 4 Hz. The filters were designed to have a maximum loss of 3 dB in the passband and a minimum attenuation of 30 dB in the stopband. The following are the descriptions of these three datasets.
1). Korea University (KU) Dataset: The KU dataset comprises EEG signals obtained from 54 subjects during a binary-class imagery task. The EEG signals were recorded using 62 electrodes at a sampling rate of 1,000 Hz. For evaluation, we selected 20 electrodes located in the motor cortex region. The dataset was divided into two sessions, S1 and S2, each consisting of a training and a testing phase. Each phase included 100 trials, with an equal number of right and left-hand imagery tasks. The EEG signals were epoched from second 1 to second 3.5 relative to the stimulus onset, resulting in a total epoch duration of 2.5 seconds.
2). BNCI2014001 Dataset: The BNCI2014001 dataset (also known as BCIC-IV-2a) involved 9 participants performing a motor imagery task with four classes: left hand, right hand, feet, and tongue. The task was recorded using 22 Ag/AgCl electrodes and three EOG channels at a sampling rate of 250 Hz. The recorded data were filtered using a bandpass filter between 0.5 and 100 Hz, with a 50 Hz notch filter applied. The study was conducted over two sessions on separate days, with each participant completing a total of 288 trials (six runs of 12 cue-based trials for each class). The EEG signals were epoched from the cue onset at second 2 to the end of the motor imagery period at second 6, resulting in a total epoch duration of 4 seconds.
3). BNCI2015001 Dataset: BNCI2015001 involves 12 participants performing a right-hand versus both-feet imagery task. Most of the study consisted of two sessions conducted on consecutive days, with each participant completing 100 trials for each class, resulting in a total of 200 trials per participant. Four subjects participated in a third session, but their data should have been included in the evaluation. The EEG data were acquired at a sampling rate of 512 Hz and filtered using a bandpass filter ranging from 0.5 to 100 Hz, with a notch filter at 50 Hz. The recordings began at second 3 with the prompt and continued until the end of the cross period at second 8, resulting in a total duration of 5 seconds.
IV-C Experimental Settings
The experimental settings for domain adaptation on three evaluated datasets are described as follows, illustrated in Fig. 3: 1). Semi-supervised Domain Adaptation: For the KU dataset, S1 was used for training, and the first half of S2 was used for validation, while the second half was reserved for testing. Early stopping was adopted during the validation phase. 2). Unsupervised Domain Adaptation: For the BNCI2014001 and BNCI2015001 datasets, S1 was used for training, and S2 was used for testing. No validation set was used, and a maximum epoch value was set for stopping in practice.
We specified that the transfer procedure must always occur from S1 (source domain) to S2 (target domain), i.e., for each dataset. It is reasonable because there is no meaning if negative transfer is used in the learning process. The former and latter sessions could be interpreted as the calibration and feedback phases.

IV-D Evaluated Baselines
The geometric methods in this study can be broadly divided into four categories. These include categories of parallel transport, deep parallel transport, optimal transport, and deep optimal transport. Each category includes various transfer methods, performing on three types of base machine learning classifiers: the support vector machine, minimum distance to Riemannian mean, and SPDNet. We begin by introducing these base classifiers and discussing transfer methods in each category.
-
1.
Support Vector Machine (SVM) [38]: This is a well-known machine learning classifier used for classification and regression tasks, which aims to find an optimal hyperplane that maximally separates data points of different classes, allowing it to handle linear and non-linear classification problems effectively.
-
2.
Minimum Distance to Riemannian Mean (MDM) [21]: This employs the geodesic distance on SPD manifolds for classification. More specifically, a centroid is estimated for each predefined class, and subsequently, the classification of a new data point is determined based on its proximity to the nearest centroid.
-
3.
SPDNet [29]: This has been introduced in the preliminary part. In the experiments, we employ a one-depth Bi-Map layer with both the input and output dimensions equal to the number of electrodes on the primary motor cortex: 20 (KU), 22 (BNCI2014001), and 13 (BNCI2015001).
As computations on the SPD manifold are related to Riemannian metrics, we will emphasize the specific Riemannian metrics associated with each method. The different transfer methods across categories according to various Riemannian metrics can be seen in the first three columns of Table II.
IV-D1 Category of Parallel Transport
This category refers to the utilization of parallel transport to perform a transfer of training and test data distributions. One of the representative methods in this category is Riemannian Procrustes Analysis (RPA) [39]. RPA utilizes transformations on (, ), including translation, scaling, and rotation. After the geometric transformation, RPA uses the MDM approach as the base classifier. The RPA approach includes three sequential processes: recentering (RCT), stretching, and rotation (ROT). First, the RCT process is used to re-centralize the centroids of the data to the identity matrix. In the implementation, ROT is the parallel transportation on (, ) from the Frchet mean of either source domain or target domain to the identity matrix. Second, the stretching process rescales the distributions on both datasets so that their dispersions around the mean are the same. Finally, the ROT process rotates the matrices from the target domain around the origin and matches the orientation of its point cloud with that of the source domain. In particular, the ROT process is unsupervised and can be used in unsupervised domain adaptation, whereas the RCT process is supervised and can only be used in semi-supervised domain adaptation. Specifically, RCT was proposed as a standalone method to be employed in the cross-session motor imagery classification [40].
IV-D2 Category of Deep Parallel Transport
This category refers to the domain adaptation approaches using deep learning and parallel transport on SPD manifolds. One representative work is known as Riemannian Batch Normalization (RieBN) [41]. RieBN is an SPD matrix-valued network architecture that employs parallel transport for batch centering and biasing on SPD neural networks.
IV-D3 Category of Optimal Transport
This category refers to treating the distribution of the covariance matrix obtained from cross-session as an OT problem and solving the transport plan using different variants of Wasserstein distances [42]. The classifier is trained within a session and tested in a shifted session. This is an unsupervised method. Three different types of Wasserstein distances are provided, including the earth mover’s distance (EMD), Sliced-Wasserstein distance (SPDSW), and Log Sliced-Wasserstein distance (logSW).
IV-D4 Category of Deep Optimal Transport
This category refers to the approach proposed in this study, which uses SPD neural networks on (, ) to transfer both source domain and target domain to a common space induced by optimal transport.
V Main Results


Metric | Category | Method | Classifier | KU (54 Subjects) | BNCI2014001 (9 Subjects) | BNCI2015001 (12 Subjects) | |||
Semi-supervised | Unsupervised | Unsupervised | |||||||
Accuracy | Increment | Accuracy | Increment | Accuracy | Increment | ||||
OT | EMD | SVM | 65.24 (14.81) | 0.07 | 50.89 (16.54) | 14.27 | 82.17 (11.80) | 6.25 | |
PT | RCT | MDM | 52.69 (6.39) | 0.34 | 55.13 (12.48) | 0.89 | 77.67 (13.23) | 5.46 | |
ROT | MDM | 51.39 (5.89) | 0.96 | / | |||||
RCT | SPDNet | 69.91 (13.67) | 2.15 | 72.11 (12.99) | 3.51 | 81.25 (10.64) | 4.75 | ||
ROT | SPDNet | 63.44 (14.94) | 4.32 | / | |||||
DPT | RieBN | SPDNet | 67.57 (15.27) | 0.19 | 69.33 (13.49) | 0.73 | 78.83 (14.56) | 2.33 | |
OT | EMD | SVM | 67.70 (15.14) | 0.06 | 68.44 (15.79) | 3.28 | 82.17 (11.80) | 6.25 | |
EMD | SPDNet | 64.22 (14.30) | 3.54 | 62.15 (14.69) | 6.45 | 78.33 (13.80) | 1.83 | ||
SPDSW | SVM | 66.67 (14.54) | 1.36 | 67.82 (16.71) | 2.66 | 82.42 (11.74) | 6.50 | ||
logSW | SVM | 66.26 (15.00) | 0.95 | 65.82 (16.56) | 0.66 | 80.75 (12.94) | 4.83 | ||
DOT | DeepJDOT | SPDNet | 67.38 (14.67) | 0.38 | 61.00 (11.40) | 7.60 | 77.63 (15.12) | 1.13 | |
MDA | SPDNet | 69.19 (14.37) | 1.43 | 67.78 (12.46) | 0.82 | 78.33 (14.57) | 1.83 | ||
CDA | SPDNet | 68.80 (14.29) | 1.04 | 65.12 (11.34) | 3.48 | 78.25 (14.59) | 1.75 | ||
MDA+CDA | SPDNet | 68.61 (14.79) | 1.05 | 64.85 (11.54) | 3.75 | 78.29 (11.99) | 1.79 | ||
SVM | 65.31 (14.42) | 65.16 (17.00) | 75.92 (16.29) | ||||||
Base Classifier | MDM | 52.35 (6.73) | 54.24 (13.31) | 72.21 (14.67) | |||||
SPDNet | 67.76 (14.55) | 68.60 (13.03) | 76.50 (14.41) |
V-A Synthetic Dataset
We synthesize a source domain dataset consisting of 50 covariance matrices of size , drawing a Gaussian distribution on the SPD manifold proposed in a generative approach [43] with a median of 1 and a standard deviation of 0.4. We utilize the following transformation matrix in the generation,
and data in the target domain can be expressed as .
Subfigure (a) exhibits red points sourced from the source domain, with the blue points originating from the target domain. In Subfigure (b), the blue points represent the relocation of the red points via the EMD method from OT. The associations made by this transformation are represented by thin blue lines. Remarkably, the position of these transformed blue points is extremely proximate to the points of the target domain in Subfigure (a). This can be attributed to the fact that the OT-DA method essentially computes a weighted average of the source and target domain data points, based on weights learned to maximize the transportation costs.
The transformed blue points in Subfigures (c) and (d) were derived from mapping the red points via neural network weights learned through the DOT and DeepJDOT methods, respectively. Corresponding point pairs are linked with thin blue lines. These figures reveal the shared fundamental characteristic of the DOT and DeepJDOT methods that both project the source and target domains onto a subspace learned via a network, which, in these two figures, is represented by a straight line. In differential geometry, this is a submanifold, which refers to a lower-dimensional manifold embedded within a higher-dimensional one. In this case, the two-dimensional matrices are represented by a line, essentially projecting the higher dimensional data onto a lower dimensional space. It reduces the complexity of the data while preserving information.
V-B Comparative Experiments on Real-World EEG Datasets
As shown in Table II, we classify all approaches into three categories according to Riemannian metrics. The first category does not require a Riemannian metric for computation, the second employs , while the third uses .
The experimental results indicate that metrics strongly influence the results. Specifically, in the table, EMD+SVM achieved 50.89 () and 68.33 () on the BNCI2014001 dataset. Moreover, several observations are introduced as follows: 1). Participant Number: The improvement is related to the number of participants. On datasets with fewer participants, the enhancement brought by transfer methods is more significant, with several scenarios showing changes of more than 5%. 2). Base Classifier Results: The extent of improvement is linked to the base classifier’s performance. Lower initial results from the base classifier provide more opportunities for transfer methods to achieve significant gains. 3). Methodological Factors: Improvements are associated with the classifier neural network architecture, parameter selection, and data preprocessing. Therefore, the conclusions of this study may not be universally applicable.
Furthermore, we summarize the observations of all methods affected by various Riemannian metrics as follows:
-
•
(, ) includes categories of OT, PT, and DPT. Combining the results from all three datasets, the RCT method performs well, especially when paired with the MDM or SPDNet classifier. The EMD method shows the largest variation across datasets. The second step (ROT) in the RPA method, which is limited to semi-supervised scenarios, performs worse than the first step (RCT) of RPA. The RieBN method provides slight improvements but does not outperform RCT across the board.
-
•
(, ) includes two categories of OT and DOT. The EMD method, when paired with the SVM classifier, performs significantly better compared to its pairing with the SPDNet classifier. SPDSW and logSW methods have obvious improvements compared to their base classifier (SVM), among which SPDSW performs better than logSW by 1% to 2% on all three datasets. The MDA and CDA methods perform similarly, except for the BNCI2014001 dataset, the improvements on the other two datasets are between 1% to 2%, while the MDA method is slightly better than the CDA method or a mixture of the MDA and CDA methods. Notably, the use of pseudo-labels in the CDA method affects its results. The DeepJDOT method’s performance across three datasets falls short of our proposed approach. The primary reason for this could be attributed to our decision to replace the task of computing transportation costs for each data point with merely computing transportation costs for the centers of each frequency band’s dataset. Minimizing the transportation costs does not necessarily imply improved classification accuracy, thus the proposed approach reduces computational cost while enhancing the model’s potential.
Besides, neural network-based approaches generally achieve better classification outcomes compared to other categories of approaches, such as EMD, SPDSW, and logSW. To compute the CDA loss in DOT, we used the pseudo-label method, with pseudo labels derived from the Graph-CSPNet [26], which currently performs the best for subject-specific tasks. Specifically, the results of SPDSW and logSW on the BNCI2014001 dataset were obtained using the data processing procedure and source codes provided by their GitHub repository, outperforming ours. This is because the performance of the SVM methods heavily depends on a selection of handcrafted features.
VI DISCUSSIONS
The experimental results indicate that all transfer learning methods exhibit effectiveness. However, it’s important to note that while domain shift is a known factor in these tasks, the classification performance is not solely influenced by it. Consequently, the improvements from transfer learning methods are generally limited. Currently, we lack an effective method to quantify the intensity of domain shift and a system to correlate this intensity with classifier performance, making it difficult to fully assess these methods’ effectiveness. In the rest of this section, we will discuss two issues of the proposed approach.
VI-A Theorem 3 with cost function
We claim that Theorem 3 with cost function is equilvant to [6, Theorem 3.1]. It was first observed and proved by Yair et al.[10] who use the matrix vectorization operator to establish an isometry between the affine transformation and the Bi-Map transformation. We follow their arguments and revise a bit of their proof.
Theorem 4.
[6, Theorem 3.1] Let and be two discrete distributions on with Diracs. Given a strictly positive definite matrix , bias weight , and source samples , suppose target samples , for , and the weights in both source and target distributions’ empirical distributions are . Then, transport is the solution to the OT Problem provided with a cost function .
To expose the relationship between the Bi-Map transformation in Theorem 3 and affine transformation in Theorem 4, we introduce the matrix vectorization operator to stack the column vectors of matrix below on another as follows,
Then, we have the following lemma.
Lemma 5.
, where is Kronecker product.
Proof.
Suppose is an matrix, and is the transpose of matrix with each column vector (i=1, …, m). The k-th column vector of can be written as . Hence, we achieve the lemma as follows,
∎
Lemma 6.
Suppose is positive definite, then is also positive definite.
Proof.
This is given by a fact that suppose are the eigenvalues of , then the eigenvalues for is . Hence, is positive definite. ∎
VI-B Parallel Transport
Parallel transport on Riemannian manifolds transfers a tangent vector at one point along a curve on the manifold to another point, such that the transported vector remains parallel to the original vector [30]. Under two Riemannian metrics on SPD manifolds and , expressions of parallel transport exhibit distinctly and determine two distinct modeling approaches for the problems [44, 10]. Specifically, for any and on (, ), parallel transport from to for can be expressed as , where . For any and on (, ), parallel transport is given by .
Keep in mind that DOT seeks a neural networks-based transformation such that and , where and are transformation on the source and target feature spaces. Neural networks-based transformation extracts discriminative information from feature spaces. After the extraction, it is reasonable to view the feature space of the source domain as the same as it of the target domain, as they will be transformed under neural networks with the same weights, so are their transformed feature spaces, i.e., and , and thus we utilize the Log-Euclidean metric to formulate the space of covariance matrice.
VI-C Assumption 3
In this section, we will justify that Assumption 3 is reasonable from the numerical results. Table III records average Log-Euclidean distances in the same frequency band, and diagonal numbers are consistently the smallest among all the numbers in each row for the two datasets. This suggests that the mapping shifts the center of each frequency band in the source domain to the corresponding frequency band center in the target domain. It aligns with our intuition that the difference between the same frequency bands should be smaller since they capture similar signal characteristics in the same frequency range, and the difference between different frequency bands tends to be larger.
The procedure of calculating the average Log-Euclidean distance is specifically given as follows: assuming source and target samples are represented in a format of [trials, frequency bands, channels, channels], with frequency bands fixed (in the 2nd dimension), we first calculate the Fréchet mean by averaging all covariance matrices along the trial dimension. This results in the shape of [frequency bands, channels, channels]. Next, we directly compute the average Log-Euclidean distance for each pair of frequency bands.
KU: S1 S2 | 48 Hz | 812 Hz | 1216 Hz | 1620 Hz | 2024 Hz | 2428 Hz | 2832 Hz | 3236 Hz | 3640 Hz |
---|---|---|---|---|---|---|---|---|---|
48 Hz | 4.69 | 4.98 | 5.82 | 6.74 | 7.19 | 7.64 | 8.17 | 8.78 | 9.28 |
812 Hz | 5.09 | 4.21 | 4.82 | 5.81 | 6.25 | 6.70 | 7.22 | 7.82 | 8.31 |
1216 Hz | 5.84 | 4.76 | 4.01 | 4.62 | 4.98 | 5.34 | 5.78 | 6.30 | 6.75 |
1620 Hz | 6.66 | 5.61 | 4.37 | 3.96 | 4.08 | 4.33 | 4.64 | 5.02 | 5.40 |
2024 Hz | 7.08 | 6.01 | 4.66 | 3.97 | 3.86 | 3.99 | 4.27 | 4.63 | 4.97 |
2428 Hz | 7.52 | 6.45 | 4.98 | 4.11 | 3.86 | 3.75 | 3.91 | 4.21 | 4.52 |
2832 Hz | 8.04 | 6.96 | 5.40 | 4.37 | 4.07 | 3.82 | 3.73 | 3.87 | 4.11 |
3236 Hz | 8.65 | 7.56 | 5.93 | 4.77 | 4.42 | 4.13 | 3.88 | 3.74 | 3.82 |
3640 Hz | 9.16 | 8.07 | 6.40 | 5.16 | 4.77 | 4.43 | 4.11 | 3.82 | 3.74 |
BNCI2014001: T E | 48 Hz | 812 Hz | 1216 Hz | 1620 Hz | 2024 Hz | 2428 Hz | 2832 Hz | 3236 Hz | 3640 Hz |
48 Hz | 2.81 | 4.25 | 5.14 | 5.88 | 6.70 | 8.30 | 9.64 | 10.97 | 12.46 |
812 Hz | 4.31 | 2.37 | 3.98 | 5.45 | 6.28 | 8.07 | 9.44 | 10.77 | 12.26 |
1216 Hz | 5.16 | 3.91 | 2.29 | 3.53 | 4.23 | 5.83 | 7.15 | 8.42 | 9.87 |
1620 Hz | 5.84 | 5.49 | 3.44 | 2.17 | 2.76 | 4.08 | 5.29 | 6.56 | 7.98 |
2024 Hz | 6.59 | 6.24 | 4.18 | 2.75 | 2.18 | 3.31 | 4.54 | 5.79 | 7.24 |
2428 Hz | 8.11 | 8.02 | 5.72 | 4.01 | 3.14 | 2.20 | 2.96 | 4.03 | 5.38 |
2832 Hz | 9.42 | 9.38 | 7.06 | 5.21 | 4.39 | 2.83 | 2.21 | 2.86 | 4.05 |
3236 Hz | 10.67 | 10.63 | 8.25 | 6.43 | 5.57 | 3.87 | 2.72 | 2.22 | 3.05 |
3640 Hz | 12.16 | 12.12 | 9.69 | 7.85 | 7.02 | 5.18 | 3.85 | 2.70 | 2.22 |
BNCI2015001: | 48 Hz | 812 Hz | 1216 Hz | 1620 Hz | 2024 Hz | 2428 Hz | 2832 Hz | 3236 Hz | 3640 Hz |
48 Hz | 1.07 | 12.39 | 12.35 | 10.36 | 9.71 | 8.46 | 5.73 | 2.25 | 6.24 |
812 Hz | 12.04 | 1.06 | 1.49 | 2.68 | 3.22 | 4.40 | 7.13 | 12.43 | 18.19 |
1216 Hz | 12.10 | 1.60 | 1.09 | 2.52 | 3.13 | 4.32 | 7.08 | 12.41 | 18.18 |
1620 Hz | 10.19 | 2.99 | 2.67 | 1.19 | 1.59 | 2.45 | 5.06 | 10.40 | 16.22 |
2024 Hz | 9.64 | 3.40 | 3.15 | 1.53 | 1.22 | 1.90 | 4.51 | 9.84 | 15.66 |
2428 Hz | 8.36 | 4.51 | 4.30 | 2.34 | 1.80 | 1.19 | 3.21 | 8.52 | 14.34 |
2832 Hz | 5.72 | 7.17 | 7.01 | 4.94 | 4.31 | 3.06 | 1.28 | 5.73 | 11.54 |
3236 Hz | 2.52 | 12.43 | 12.31 | 10.24 | 9.58 | 8.29 | 5.44 | 1.30 | 6.17 |
3640 Hz | 6.41 | 18.22 | 18.12 | 16.09 | 15.42 | 14.14 | 11.27 | 5.88 | 1.25 |
VII CONCLUSIONS
This study advances the OT-DA framework to SPD manifolds equipped with the Log-Euclidean metric and introduces a geometric deep learning approach to address real-world domain adaptation challenges with SPD matrix-valued data. Specifically, it handles cross-session variabilities in EEG-based motor imagery classification. Experiments on three publicly available EEG datasets validate the proposed approach’s effectiveness. The results highlight its practical potential and advantages in managing complex tasks involving domain adaptation in EEG data analysis.
Acknowledgment
This study is supported under the RIE2020 Industry Alignment Fund–Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contributions from the industry partner(s). This study is also supported by the RIE2020 AME Programmatic Fund, Singapore (No. A20G8b0102).
References
- [1] S. Ben-David, J. Blitzer, K. Crammer, F. Pereira et al., “Analysis of representations for domain adaptation,” Advances in neural information processing systems, vol. 19, p. 137, 2007.
- [2] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions on knowledge and data engineering, vol. 22, no. 10, pp. 1345–1359, 2009.
- [3] M. Wang and W. Deng, “Deep visual domain adaptation: A survey,” Neurocomputing, vol. 312, pp. 135–153, 2018.
- [4] Y.-P. Lin and T.-P. Jung, “Improving eeg-based emotion classification using conditional transfer learning,” Frontiers in human neuroscience, vol. 11, p. 334, 2017.
- [5] M. Sugiyama, M. Krauledat, and K.-R. Müller, “Covariate shift adaptation by importance weighted cross validation.” Journal of Machine Learning Research, vol. 8, no. 5, 2007.
- [6] N. Courty, R. Flamary, D. Tuia, and A. Rakotomamonjy, “Optimal transport for domain adaptation,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 9, pp. 1853–1865, 2016.
- [7] N. Courty, R. Flamary, A. Habrard, and A. Rakotomamonjy, “Joint distribution optimal transportation for domain adaptation,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, ser. NIPS’17. Red Hook, NY, USA: Curran Associates Inc., 2017, p. 3733–3742.
- [8] ——, “Joint distribution optimal transportation for domain adaptation,” Advances in neural information processing systems, vol. 30, 2017.
- [9] B. B. Damodaran, B. Kellenberger, R. Flamary, D. Tuia, and N. Courty, “Deepjdot: Deep joint distribution optimal transport for unsupervised domain adaptation,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 447–463.
- [10] O. Yair, F. Dietrich, R. Talmon, and I. G. Kevrekidis, “Domain adaptation with optimal transport on the manifold of spd matrices,” arXiv preprint arXiv:1906.00616, 2019.
- [11] Z. J. Koles, M. S. Lazar, and S. Z. Zhou, “Spatial patterns underlying population differences in the background eeg,” Brain topography, vol. 2, no. 4, pp. 275–284, 1990.
- [12] J. Müller-Gerking, G. Pfurtscheller, and H. Flyvbjerg, “Designing optimal spatial filters for single-trial eeg classification in a movement task,” Clinical neurophysiology, vol. 110, no. 5, pp. 787–798, 1999.
- [13] M. Long, J. Wang, G. Ding, J. Sun, and P. S. Yu, “Transfer feature learning with joint distribution adaptation,” in Proceedings of the IEEE international conference on computer vision, 2013, pp. 2200–2207.
- [14] V. Arsigny, P. Fillard, X. Pennec, and N. Ayache, “Fast and simple computations on tensors with log-euclidean metrics.” Ph.D. dissertation, INRIA, 2005.
- [15] ——, “Log-euclidean metrics for fast and simple calculus on diffusion tensors,” Magnetic Resonance in Medicine: An Official Journal of the International Society for Magnetic Resonance in Medicine, vol. 56, no. 2, pp. 411–421, 2006.
- [16] X. Pennec, P. Fillard, and N. Ayache, “A riemannian framework for tensor computing,” International Journal of computer vision, vol. 66, pp. 41–66, 2006.
- [17] H. Q. Minh and V. Murino, Covariances in computer vision and machine learning. Springer Nature, 2022.
- [18] S. Kolouri, S. R. Park, M. Thorpe, D. Slepcev, and G. K. Rohde, “Optimal mass transport: Signal processing and machine-learning applications,” IEEE signal processing magazine, vol. 34, no. 4, pp. 43–59, 2017.
- [19] X. Pennec, “Statistical computing on manifolds for computational anatomy,” Ph.D. dissertation, Université Nice Sophia Antipolis, 2006.
- [20] ——, “Manifold-valued image processing with spd matrices,” in Riemannian geometric statistics in medical image analysis. Elsevier, 2020, pp. 75–134.
- [21] A. Barachant, S. Bonnet, M. Congedo, and C. Jutten, “Multiclass brain–computer interface classification by riemannian geometry,” IEEE Transactions on Biomedical Engineering, vol. 59, no. 4, pp. 920–928, 2011.
- [22] ——, “Classification of covariance matrices using a riemannian-based kernel for bci applications,” Neurocomputing, vol. 112, pp. 172–178, 2013.
- [23] M. Congedo, A. Barachant, and R. Bhatia, “Riemannian geometry for eeg-based brain-computer interfaces; a primer and a review,” Brain-Computer Interfaces, vol. 4, no. 3, pp. 155–174, 2017.
- [24] C. Ju, D. Gao, R. Mane, B. Tan, Y. Liu, and C. Guan, “Federated transfer learning for eeg signal classification,” in 2020 42nd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC). IEEE, 2020, pp. 3040–3045.
- [25] C. Ju and C. Guan, “Tensor-cspnet: A novel geometric deep learning framework for motor imagery classification,” IEEE Transactions on Neural Networks and Learning Systems, 2022.
- [26] ——, “Graph neural networks on spd manifolds for motor imagery classification: A perspective from the time-frequency analysis,” IEEE Transactions on Neural Networks and Learning Systems, 2023.
- [27] C. Ju, R. J. Kobler, and C. Guan, “Score-based data generation for eeg spatial covariance matrices: Towards boosting bci performance,” in 2023 45th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), 2023.
- [28] C. Ju, R. J. Kobler, L. Tang, C. Guan, and M. Kawanabe, “Deep geodesic canonical correlation analysis for covariance-based neuroimaging data,” in The Twelfth International Conference on Learning Representations (ICLR), 2024.
- [29] Z. Huang and L. Van Gool, “A riemannian network for spd matrix learning,” in Thirty-First AAAI Conference on Artificial Intelligence, 2017.
- [30] P. Petersen, S. Axler, and K. Ribet, Riemannian geometry. Springer, 2006, vol. 171.
- [31] M. P. Do Carmo and J. Flaherty Francis, Riemannian geometry. Springer, 1992, vol. 6.
- [32] R. J. McCann, “Polar factorization of maps on riemannian manifolds,” Geometric & Functional Analysis GAFA, vol. 11, no. 3, pp. 589–608, 2001.
- [33] B. Sun and K. Saenko, “Deep coral: Correlation alignment for deep domain adaptation,” in European conference on computer vision. Springer, 2016, pp. 443–450.
- [34] B. Sun, J. Feng, and K. Saenko, “Correlation alignment for unsupervised domain adaptation,” in Domain Adaptation in Computer Vision Applications. Springer, 2017, pp. 153–171.
- [35] Y.-H. Kim and B. Pass, “Multi-marginal optimal transport on riemannian manifolds,” American Journal of Mathematics, vol. 137, no. 4, pp. 1045–1060, 2015.
- [36] R. Turrisi, R. Flamary, A. Rakotomamonjy, and M. Pontil, “Multi-source domain adaptation via weighted joint distributions optimal transport,” in Uncertainty in Artificial Intelligence. PMLR, 2022, pp. 1970–1980.
- [37] G. Pfurtscheller, C. Neuper, D. Flotzinger, and M. Pregenzer, “Eeg-based discrimination between imagination of right and left hand movement,” Electroencephalography and clinical Neurophysiology, vol. 103, no. 6, pp. 642–651, 1997.
- [38] C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning, vol. 20, pp. 273–297, 1995.
- [39] P. L. C. Rodrigues, C. Jutten, and M. Congedo, “Riemannian procrustes analysis: transfer learning for brain–computer interfaces,” IEEE Transactions on Biomedical Engineering, vol. 66, no. 8, pp. 2390–2401, 2018.
- [40] O. Yair, M. Ben-Chen, and R. Talmon, “Parallel transport on the cone manifold of spd matrices for domain adaptation,” IEEE Transactions on Signal Processing, vol. 67, no. 7, pp. 1797–1811, 2019.
- [41] D. Brooks, O. Schwander, F. Barbaresco, J.-Y. Schneider, and M. Cord, “Riemannian batch normalization for spd neural networks,” Advances in Neural Information Processing Systems, vol. 32, 2019.
- [42] C. Bonet, B. Malézieux, A. Rakotomamonjy, L. Drumetz, T. Moreau, M. Kowalski, and N. Courty, “Sliced-wasserstein on symmetric positive definite matrices for m/eeg signals,” arXiv preprint arXiv:2303.05798, 2023.
- [43] S. Said, L. Bombrun, Y. Berthoumieu, and J. H. Manton, “Riemannian gaussian distributions on the space of symmetric positive definite matrices,” IEEE Transactions on Information Theory, vol. 63, no. 4, pp. 2153–2170, 2017.
- [44] S. Sra and R. Hosseini, “Conic geometric optimization on the manifold of positive definite matrices,” SIAM Journal on Optimization, vol. 25, no. 1, pp. 713–739, 2015.