Deep Optimal Transport for Domain Adaptation on SPD Manifolds

Ce Ju, and Cuntai Guan Ce Ju and Cuntai Guan are with the College of Computing and Data Science, Nanyang Technological University, 50 Nanyang Avenue, Singapore. E-mail: {juce0001,ctguan}@ntu.edu.sg.

Abstract

The machine learning community has shown increasing interest in addressing the domain adaptation problem on symmetric positive definite (SPD) manifolds. This interest is primarily driven by the complexities of neuroimaging data generated from brain signals, which often exhibit shifts in data distribution across recording sessions. These neuroimaging data, represented by signal covariance matrices, possess the mathematical properties of symmetry and positive definiteness. However, applying conventional domain adaptation methods is challenging because these mathematical properties can be disrupted when operating on covariance matrices. In this study, we introduce a novel geometric deep learning-based approach utilizing optimal transport on SPD manifolds to manage discrepancies in both marginal and conditional distributions between the source and target domains. We evaluate the effectiveness of this approach in three cross-session brain-computer interface scenarios and provide visualized results for further insights. The GitHub repository of this study can be accessed at https://github.com/GeometricBCI/Deep-Optimal-Transport-for-Domain-Adaptation-on-SPD-Manifolds.

Index Terms:

Domain Adaptation, Optimal Transport, Geometric Deep Learning, Neural Signal Processing

I Introduction

In cross-domain problems, the distribution of source data often differs from that of target data, leading to what is widely recognized as a domain adaptation problem in the machine learning community. This issue has gained considerable attention over the past decade [1, 2, 3]. For example, signals in an online testing set frequently differ from those in the calibration set, despite being generated by the same process, highlighting a common domain adaptation challenge.

In domain adaptation, a domain $\mathcal{D}$ consists of a feature space $X$ and a marginal probability distribution $\mathbb{P}(X)$ , denoted as $\mathcal{D}=\{X,\mathbb{P}(X)\}$ . Meanwhile, a task $\mathcal{T}$ comprises a label space $Y$ and a conditional distribution $\mathbb{Q}(Y|X)$ , represented by $\mathcal{T}=\{Y,\mathbb{Q}(Y|X)\}$ . Domain adaptation aims to build a classifier in the target domain $\mathcal{D}_{T}$ by leveraging the information obtained from the source domain $\mathcal{D}_{S}$ , where $\mathcal{D}_{S}\neq\mathcal{D}_{T}$ and $\mathcal{T}_{S}\neq\mathcal{T}_{T}$ . In particular, $\mathcal{D}_{S}\neq\mathcal{D}_{T}$ yields either $X_{S}\neq X_{T}$ or $\mathbb{P}_{S}(X_{S})\neq\mathbb{P}_{T}(X_{T})$ , whereas $\mathcal{T}_{S}\neq\mathcal{T}_{T}$ results in either $Y_{S}\neq Y_{T}$ or $\mathbb{Q}_{S}(Y_{S}|X_{S})$ or $\mathbb{Q}_{T}(Y_{T}|X_{T})$ .

The domain adaptation problem can be broadly categorized into unsupervised and semi-supervised scenarios. In the unsupervised scenario, no labels are available in the target domain, while in the semi-supervised scenario, a limited number of labeled instances are available in the target domain to guide the adaptation process. For instance, the variability of electroencephalography (EEG) signals often results in unreliable and non-robust performance when transitioning from calibration to feedback phases, as these sessions exhibit distinct distributions. This variability in EEG-based classification exemplifies a typical domain adaptation problem [4]. In response to this issue, a covariate shift adaptation approach has been proposed in machine learning society [5]. This approach assumes that the difference between domains is characterized by a change in the feature space, while the conditional distributions remain unchanged.

TABLE I: A comparison of various OT-DA Frameworks.

X_{S}

and

X_{T}

are feature spaces of the source and target domains respectively.

\mathbb{P}

and

\mathbb{Q}

represent the marginal probability distribution and conditional probability distribution respectively.

Framework	Cost function	Underlying Space	Transformation	Domain Adaptation Scenario
OT-DA[6, 7]	$c(x,\bar{x})=\|\|x-\bar{x}\|\|_{\ell_{2}}^{2}.$	Euclidean	Affine	$X_{S}\neq X_{T}$ .
JDOT-DA[8, 9]	$c(x,\bar{x})=\|\|x-\bar{x}\|\|_{\ell_{2}}^{2}$	Euclidean	Affine	$X_{S}\neq X_{T}$ , $\mathbb{Q}_{S}(Y_{S}\|X_{S})\neq\mathbb{Q}_{T}(Y_{T}\|X_{T})$ .
OT-DA on SPD manifolds[10]	$c(x,\bar{x})=\|\|x-\bar{x}\|\|_{\ell_{2}}^{2}$	( $\mathcal{S}_{++}$ , $g^{AIRM}$ )	Bi-Map	$X_{S}\neq X_{T}$ .
OT-DA on SPD manifolds (Proposed)	$c(x,\bar{x})=d_{g^{LEM}}^{2}(x,\bar{x})$	( $\mathcal{S}_{++}$ , $g^{LEM}$ )	Neural Networks	$X_{S}=X_{T}$ , $\mathbb{P}_{S}(X_{S})\neq\mathbb{P}_{T}(X_{T})$ , and
				$\mathbb{Q}_{S}(Y_{S}\|X_{S})\neq\mathbb{Q}_{T}(Y_{T}\|X_{T})$ .

Among various methods that have been proposed to address the domain adaptation problem across numerous engineering scenarios, Optimal Transport (OT) has emerged as a novel and effective solution for aligning source and target domains [6, 7]. The OT approach introduces regularized optimal transport, allowing a classifier to learn from labeled source data and apply this knowledge to target data. The goal of the Optimal Transport-Domain Adaptation (OT-DA) framework is to minimize the overall transportation costs between two distributions, quantified by the Wasserstein metric. It seeks to find a push-forward transformation $\varphi:X_{S}\longmapsto X_{T}$ that satisfies $\mathbb{P}_{S}(X_{S})=\mathbb{P}_{T}(\varphi(X_{S}))$ . The strengths of the OT-DA framework lie in its simplified evaluation of empirical data distribution and its enhanced ability to leverage the underlying geometry of discrete samples [6].

While broad in scope, the OT-DA framework can benefit from enhancements derived from insights into specific applications. For instance, in EEG-based motor imagery classification, EEG signals are processed as (average) spatial covariance matrices after segmentation in time and averaging over trials [11, 12]. This leads to the cross-session scenario of motor imagery classification taking place in the space of covariance matrices, which forms a symmetric and positive definite (SPD) manifold if endowed with a Riemannian metric. In this context, Yair et al. utilized squared $\ell_{2}$ distance as the cost function $c(x,\bar{x})=||x-\bar{x}||_{\ell_{2}}^{2}$ in the OT-DA framework, deriving a closed-form solution through Brenier’s polar factorization [10].

Although Yair’s work pioneered the OT-DA framework on SPD manifolds, it presents certain theoretical and practical shortcomings. Theoretically, their chosen cost function, based on Euclidean distance, undermines the consistency of mathematical formulation in the view of differential geometry. Practically, their methodology fails to account for disparities within conditional distributions. Additionally, the unique challenges inherent to the problem, such as the scarcity of labeled data during the feedback phase and the divergence of spatial covariance matrices from distinct frequency bands, pose significant modeling difficulties. These challenges necessitate the development of a more adaptable OT-DA framework to effectively address problems on SPD manifolds.

To address these challenges, we consider a more applicable scenario where $X_{S}=X_{T}$ , $Y_{S}=Y_{T}$ , $\mathbb{P}_{S}(X_{S})\neq\mathbb{P}_{T}(X_{T})$ , and $\mathbb{Q}_{S}(Y_{S}|X_{S})\neq\mathbb{Q}_{T}(Y_{T}|X_{T})$ . This situation is typically handled through joint distribution adaptation, which concurrently aligns both marginal and conditional distributions across domains, even when dealing with implicit probability distributions, without relying on labeled information from the target domain [13]. To facilitate modeling of $X_{S}=X_{T}$ , we formulate the space of SPD matrices using the Log-Euclidean metric $g^{LEM}$ on SPD manifolds. The Log-Euclidean metric is a commonly used Riemannian metric on SPD manifolds, allowing operations to be performed in the logarithmic domain where the structure is Euclidean [14]. Moreover, in this context, parallel transport on SPD manifolds with the Log-Euclidean metric simplifies to an identity map, ensuring that the feature space of the source and target domains, which is the tangent space at the identity of SPD manifolds, are the same.

In this study, we propose a novel geometric deep learning architecture called Deep Optimal Transport (DOT) to address the joint distribution adaptation problem on SPD manifolds, utilizing a squared Log-Euclidean distance as the cost function in optimal transport, i.e., $c(x,\bar{x})=d_{g^{LEM}}^{2}(x,\bar{x})$ . DOT is predicated on the assumption that information transfer between domains can be effectively estimated through optimal transport. The goal of DOT is to find a transformation on SPD manifolds $\varphi:X_{S},X_{T}\longmapsto Z\subset\mathcal{S}_{++}$ such that $\mathbb{P}_{S}(\varphi(X_{S}))=\mathbb{P}_{T}(\varphi(X_{T}))$ and $\mathbb{Q}_{S}(Y_{S}|\varphi(X_{S}))=\mathbb{Q}_{T}(Y_{T}|\varphi(X_{T}))$ . To evaluate the effectiveness of DOT, we focus on cross-session scenarios in motor imagery classification. This problem is notably challenging and often inadequately addressed, primarily due to the extensive patterns of synchronized neuronal activity that continuously shift over time. This results in significant variability in brain signals captured by EEG devices across different sessions. In our evaluation, we calibrate the model in one session and then test it in another, encompassing both unsupervised and semi-supervised domain adaptation scenarios.

The remainder of this paper is organized as follows: Section II delves into the Log-Euclidean geometry, the discrete OT problems, joint distribution optimal transport, and SPD matrix-valued motor imagery classifier. Section III presents the OT-DA framework on SPD manifolds and proposes the architecture of DOT. In Section IV, we will employ DOT on the motor imagery classification. Section V exhibits results on synthetic and real-world datasets. A comparison of various frameworks is summarized in Table I.

II PRELIMINARY

Notations and Conventions

A matrix $S\in\mathbb{R}^{n\times n}$ is symmetric and positive-definite if it is symmetric and all its eigenvalues are positive, i.e., $S=S^{T}$ , and $x^{T}\cdot S\cdot x>0$ , for all $x\in\mathbb{R}^{n}\setminus\{{\mathbf{0}}\}$ . The Frobenius inner product and Frobenius norm on $m\times n$ matrices $A$ and $B$ are defined as $\langle\,A,B\,\rangle_{\mathcal{F}}:=\operatorname{Tr}{(A^{\top}\cdot B)}$ and $||\,A\,||_{\mathcal{F}}^{2}:=\langle\,A,A\,\rangle_{\mathcal{F}}$ respectively.

II-A Log-Euclidean Metric

The Log-Euclidean metric, a widely used Riemannian metric on SPD manifolds, introduces a vector space structure to the space of SPD matrices, endowing it with numerous exceptional properties [14, 15]. These include invariance under inversion, logarithmic multiplication, orthogonal transformations, and scaling. Formally, a commutative Lie group structure, known as the tensor Lie group, is endowed on $\mathcal{S}_{++}$ using the logarithmic multiplication $\odot$ defined by $P_{1}\odot P_{2}:=\exp\big{(}\log P_{1}+\log P_{2}\big{)}$ , where $P_{1},P_{2}\in\mathcal{S}_{++}$ , and $\exp$ and $\log$ are matrix exponential and logarithm respectively. The bi-invariant metric exists on the tensor Lie group, which yields the following scalar product $g^{LEM}_{P}(v,w):=\langle\,D_{P}\log v,D_{P}\log w\,\rangle_{\mathcal{F}}$ , where tangent vectors $v,w$ at matrix $P$ , and $D_{P}\log$ represent the differential $D\log$ at $P$ . We denote an SPD manifold equipped with the Log-Euclidean metric as ( $\mathcal{S}_{++}$ , $g^{LEM}$ ). The sectional curvature of ( $\mathcal{S}_{++}$ , $g^{LEM}$ ) is flat, and thus parallel transport on it simplifies to an identity. Riemannian distance $d_{g^{LEM}}(P_{1},P_{2})=||\log{P_{1}}-\log{P_{2}}||_{\mathcal{F}}$ between $P_{1}$ and $P_{2}$ on this SPD manifold. The Log-Euclidean Fr $\acute{e}$ chet mean $\overline{w}(\mathcal{B})$ of a batch $\mathcal{B}$ of SPD matrices $\{S_{i}\}_{i=1}^{|\mathcal{B}|}$ is defined as follows:

\displaystyle\overline{w}(\mathcal{B}):=\exp{\bigg{(}\frac{1}{|\mathcal{B}|}\sum_{i=1}^{|\mathcal{B}|}\log{(S_{i})}\bigg{)}}.

(1)

In this study, another Riemannian metric on SPD manifolds, affine-invariant Riemannian metric [16], is introduced for comparison with $g^{LEM}$ . SPD manifold equipped with this metric is denoted as ( $\mathcal{S}_{++}$ , $g^{AIRM}$ ). Formally, for any $P$ , it is defined as $g^{AIRM}_{P}(v,w):=\langle\,P^{-\frac{1}{2}}\,v\,P^{-\frac{1}{2}},P^{-\frac{1}{2}}\,w\,P^{-\frac{1}{2}}\,\rangle_{\mathcal{F}}$ , where tangent vectors $v,w\in\mathcal{T}_{P}\mathcal{S}_{++}$ . ( $\mathcal{S}_{++}$ , $g^{AIRM}$ ) also exhibit nice properties, such as affine invariance, scale invariance, and inversion invariance. Since these properties are not the primary focus of this study, please refer to details in [17].

II-B Discrete Optimal Transport

Optimal Transport (OT) seeks the minimal cost transport plan between source and target distributions [18]. In this study, we consider the case where these distributions are available only through a finite number of samples. Given two measurable spaces $X$ and $\bar{X}$ , an empirical distribution of the measure $\mu\in X$ and $\nu\in\bar{X}$ are written as $\mu:=\sum_{i\in I}p_{i}\delta_{x_{i}}$ and $\nu:=\sum_{j\in J}q_{i}\delta_{\bar{x}_{i}}$ , where discrete variable $X$ takes values $\{x_{i}\}_{i\in I}\in X$ , countable index set $I$ is with weights $\sum_{i\in I}p_{i}=1$ . $\bar{X}$ takes values $\{\bar{x}_{j}\}_{j\in J}\in\bar{X}$ , countable index set $J$ is with weights $\sum_{j\in J}q_{j}=1$ . Dirac measure $\delta_{s}$ denotes a unite point mass at the point $s\in X\text{ or }\bar{X}$ . The discrete Kantorovich formalization of the OT problem is given as $\min_{\gamma\in\prod(\mu,\nu)}\big{\langle}\gamma,c(x,\bar{x})\big{\rangle}_{\mathcal{F}}$ , where transportation plan $\gamma\in\prod(\mu,\nu):=\{\gamma\in\mathbb{R}_{\geq 0}^{|X|\times|\bar{X}|}\big{|}\gamma\mathbbm{1}_{|X|}=\mu,\text{ and }\gamma^{\top}\mathbbm{1}_{|\bar{X}|}=\nu\}$ , and $\mathbbm{1}_{d}$ is a $d-$ dimensional all-one vector.

II-C Joint Distribution Optimal Transport

Assuming that the transfer between source and target distributions can be estimated using optimal transport, Joint Distribution Optimal Transport (JDOT) [8] extends the OT-DA framework by jointly optimizing both the optimal coupling and the prediction function $f$ for the two distributions, i.e., $\min_{\gamma\in\prod(\mu,\nu)}\big{\langle}\gamma,c^{\prime}(x,y;\,\bar{x},f(\bar{x}))\big{\rangle}_{\mathcal{F}}$ , where $y$ is the label set for samples $x$ , $f$ is the prediction function, and the joint cost function is a linear combination of distance function and cross-entropy loss, i.e., $c^{\prime}(x,y;\bar{x},f(\bar{x})):=\alpha_{1}d_{\ell^{2}}(x,\bar{x})+\alpha_{2}\mathcal{L}(y,f(\bar{x}))$ . Moreover, DeepJDOT employs neural networks to map raw data into feature space and thus extends JDOT by setting the new joint objective function $\mathcal{L}:=\mathcal{L}(y,f(g(x)))+\big{\langle}\gamma,c^{\prime}\big{(}g(x),y;\,g(\bar{x}),f(g(\bar{x})\big{)}\big{\rangle}_{\mathcal{F}}$ , where $g$ represents the neural networks [9].

II-D SPD Matrix-Valued Motor Imagery Classifier

Extensive research has explored various classes of Riemannian metrics on SPD manifolds for various applications. These explorations have led to the introduction of numerous statistical methods to address computational issues on SPD manifolds, drawing inspiration from manifold-valued neuroimaging data [19, 20]. During the feature engineering era, Barachant et al. utilized the Riemannian distance between EEG spatial covariance matrices for motor imagery classification, resulting in a significant impact on various brain-computer interface applications [21, 22, 23].

In the deep learning era, Ju et al. employed geometric deep learning on SPD manifolds to extract discriminative information from the time-space-frequency domain for motor imagery classification, ultimately enhancing classification performance [24, 25, 26, 27, 28]. The neural network layers utilized in these works are derived from SPDNet [29], which is specifically designed to process SPD matrices while preserving their SPD properties throughout the layers during non-linear learning. These layers include:

•

Bi-Map Layer: This layer transforms the covariance matrix $S$ by applying the Bi-Map transformation $WSW^{T}$ , where the transformation matrix $W$ must have full row rank.
•

ReEig Layer: This layer introduces non-linearity to SPD manifolds using $U\max{(\epsilon I,\Sigma)}U^{T}$ , where $S=U\Sigma U^{T}$ , $\epsilon$ is a rectification threshold, and $I$ is an identity matrix.
•

LOG Layer: This layer maps elements on SPD manifolds to their tangent space using $U\log{(\Sigma)}U^{T}$ , where $S=U\Sigma U^{T}$ . The logarithm function applied to a diagonal matrix takes the logarithm of each diagonal element individually.

III Methodology

In this section, we will generalize the OT-DA framework on ( $\mathcal{S}_{++}$ , $g^{LEM}$ ) and propose a novel geometric deep learning approach to address the problems formulated on this framework.

III-A OT-DA Framework on ( $\mathcal{S}_{++}$ , $g^{LEM}$ )

we first calculate the gradient of any squared distance function on general Riemannian manifolds.

Lemma 1.

Let $(\mathcal{M},g)$ be a connected, compact, and $C^{3}-$ smooth Riemannian manifold without a boundary. Consider a compact subset $\mathcal{U}\subset\mathcal{M}$ and a fixed point $q\in\mathcal{U}$ . Let $f(p)=\frac{1}{2}d_{g}^{2}(p,q)$ denote the squared distance function. Then, for $p$ not on the boundary of $\mathcal{M}$ , the gradient of $f(p)$ is given by $\nabla f(p)=-\log_{p}(q)$ .

Proof.

Without loss of generality, consider a point $p\in\mathcal{M}^{\circ}$ . There exists an open neighborhood $\mathcal{N}$ of $p$ and $\epsilon>0$ such that the exponential map $\exp_{p}$ maps the ball $\mathcal{B}(0,\epsilon)\subset T_{p}\mathcal{M}$ diffeomorphically onto $\mathcal{N}$ [30, Corollary 5.5.2.]. Let $c(s)$ be a geodesic on $\mathcal{N}$ , parameterized with constant speed, with $c(0):=q\in\mathcal{N}$ and $c(1):=p$ with $\dot{c}(0)=\log_{q}(p)$ . By reversing the direction of the geodesic $c(1-s)$ , we have $\dot{c}(1)=-\log_{p}(q)$ . We proceed by linearizing the exponential maps around $0\in T_{p}\mathcal{M}$ and $\dot{c}(0)\in T_{q}\mathcal{M}$ and derive the derivative of a function $f$ as follows:

		$\displaystyle f\big{(}\exp_{p}(v)\big{)}$
	$\displaystyle=\,$	$\displaystyle\frac{1}{2}\cdot d_{g}^{2}\big{(}\exp_{p}(v),q\big{)}$
	$\displaystyle=\,$	$\displaystyle\frac{1}{2}\cdot\big{\|}\log_{q}\big{(}\exp_{p}(v)\big{)}\big{\|}_{q}^{2}$
	$\displaystyle=\,$	$\displaystyle\frac{1}{2}\cdot\big{\|}\dot{c}(0)+(D\log_{q})_{\dot{c}(0)}\big{(}(D\exp_{p})_{0}v\big{)}+o(\|v\|_{p})\big{\|}_{q}^{2}$
	$\displaystyle=\,$	$\displaystyle\frac{1}{2}\cdot\big{\|}\dot{c}(0)\big{\|}_{q}^{2}+g_{q}(\dot{c}(0),(D\log_{q})_{\dot{c}(0)}v)+o(\|v\|_{p})$
	$\displaystyle=\,$	$\displaystyle\frac{1}{2}\cdot d_{g}^{2}(p,q)+g_{p}(\dot{c}(1),v)+o(\|v\|_{p}),$

where the differential $(D\exp_{p})_{q}(\cdot):T_{q}(T_{p}\mathcal{M})\longmapsto(T_{p}\mathcal{M})$ is the differential of the exponential map at $q$ with respect to $p$ . The fourth equality follows from $(D\exp_{p})_{0}=I$ on $T_{p}\mathcal{M}$ , and the fifth equality follows from Gauss’ lemma [31, Section 3.3.5.]. Hence, by the definition of the gradient on Riemannian manifolds, we have $df(p)(v)=g_{p}\big{(}\nabla f(x),v\big{)}$ , which implies that $\nabla f(p)=\dot{c}(1)=-\log_{p}(q)$ . ∎

We say a function $\psi:\mathcal{M}\longmapsto\mathbb{R}\cup\{-\infty\}$ is c-concave if it is not identically $-\infty$ and there exists $\varphi:\mathcal{M}\longmapsto\mathbb{R}\cup\{\pm\infty\}$ such that

\psi(p)=\inf_{q\in\mathcal{M}}\big{\{}c(p,q)+\varphi(q)\big{\}},

where $c(p,q)$ is a nonnegative cost function that measures the cost of transporting mass from $q$ to $p\in\mathcal{M}$ . Brenier’s polar factorization on general Riemannian manifolds is given as follows:

Lemma 2 (Brenier’s Polar Factorization [32]).

Given any compactly supported measures $\mu,\nu\in\mathbb{P}(\mathcal{M})$ . If $\mu$ is absolutely continuous with respect to Riemannian volume, there exists a unique optimal transport map $T\in S(\mu,\nu)$ as follows,

T(x)=\exp_{p}\big{(}-\nabla\psi(p)\big{)},

where $\psi:\mathcal{M}\longmapsto\mathbb{R}\cup\{\pm\infty\}$ is a c-concave function with respect to the cost function $c(p,q)=\frac{1}{2}\cdot d_{g}^{2}(p,q)$ , and $\exp_{p}(\cdot)$ is the exponential map at $p$ on $(\mathcal{M},g)$ .

On the SPD manifold ( $\mathcal{S}_{++}$ , $g^{LEM}$ ), let’s consider source measure $\mu$ and target measure $\nu\in\mathbb{P}(\mathcal{S}_{++})$ that are accessible only through finite samples $\{S_{i}\}_{i=1}^{N}\sim\mu$ and $\{\bar{S}_{j}\}_{j=1}^{M}\sim\nu$ . Each sample contributes to its respective empirical distribution with weights $\frac{1}{N}$ and $\frac{1}{M}$ . We assume there exists a linear transformation on ( $\mathcal{S}_{++}$ , $g^{LEM}$ ) between two discrete feature space distributions of the source and target domains that can be estimated with the discrete Kantorovich problem, i.e., there exists

~{}\gamma^{\dagger}:=\arg\min_{\gamma\in\prod(\mu,\nu)}\big{\langle}\gamma,d_{g^{LEM}}^{2}(S,\bar{S})\big{\rangle}_{\mathcal{F}},

(2)

where $\prod(\mu,\nu):=\{\gamma\in\mathbb{R}_{\geq 0}^{N\times M}\big{|}\gamma\mathbbm{1}_{N}=\frac{1}{N},\text{ and }\gamma^{\top}\mathbbm{1}_{M}=\frac{1}{M}\}$ . The pseudocode for the computation of the transport plan is presented in Algorithm 1.

Input : Source samples

\{S_{i}\}_{i=1}^{N}\sim\mu

, target samples

\{\bar{S}_{j}\}_{j=1}^{M}\sim\nu

Output: Transport plan

\gamma

, and new coordinates for target samples

\{\bar{S}^{\dagger}_{j}\}_{j=1}^{M}

/* 1. Compute ground metric matrix for OT */

D(i,j)\leftarrow d_{g^{LEM}}(S_{i},\bar{S}_{j})

/* 2. Compute transport plan

\gamma

using the earth movers distance-based OT algorithm. */

\gamma^{\dagger}\leftarrow OT^{EMD}(N,M,D(i,j))

/* 3. Compute new coordinates using transport plan on (

\mathcal{S}_{++}

g^{LEM}

). */

\{\bar{S}^{\dagger}_{j}\}_{j=1}^{M}\leftarrow\exp{\big{(}\gamma^{\dagger}[j,:]\cdot\{\log(S_{i})\}_{i=1}^{N}\big{)}}

Algorithm 1 Discrete Monge-Kantorovich Algorithm on (

\mathcal{S}_{++}

g^{LEM}

In the following paragraph, we will employ discrete transport plans to construct c-concave functions and utilize Lemma 1 and 2 to obtain continuous transport plans. According to Algorithm 1, the new coordinates for target samples on SPD manifolds are given by the following expression:

\{\bar{S}^{\dagger}_{j}\}_{j=1}^{M}:=\exp{\big{(}\gamma^{\dagger}[j,:]\cdot\{\log(S_{i})\}_{i=1}^{N}\big{)}}.

We define $\varphi(\bar{S}):=+\infty$ for $\bar{S}\notin\{\bar{S}^{\dagger}_{j}\}_{j=1}^{M}$ , and set the cost function $c(S,\bar{S}):=\frac{1}{2}\cdot d_{g^{LEM}}^{2}(S,\bar{S})$ . This leads to the formulation of the corresponding c-concave function as follows:

\psi(S)=\min_{j\in\{1,...,M\}}\frac{1}{2}\cdot d_{g^{LEM}}^{2}(S,\bar{S}^{\dagger}_{j})+\varphi(\bar{S}^{\dagger}_{j}).

For a fixed $S$ , we suppose $j^{\ast}\in\{1,...,M\}$ achieves the minimum. By invoking Lemma 1 and 2, we promptly derive the unique optimal transport map as follows,

T(S)=\exp_{S}\big{(}-\nabla_{S}\psi(S)\big{)}=\bar{S}^{\dagger}_{j^{\ast}}\in\{\bar{S}^{\dagger}_{j}\}_{j=1}^{M},

which implies the following theorem:

Theorem 3.

On ( $\mathcal{S}_{++}$ , $g^{LEM}$ ), consider the source measure $\mu$ and target measure $\nu\in\mathbb{P}(\mathcal{S}_{++})$ , which are only accessible through finite samples $\{S_{i}\}_{i=1}^{N}\sim\mu$ and $\{\bar{S}_{j}\}_{j=1}^{M}\sim\nu$ . Each sample contributes to the empirical distribution with weights $\frac{1}{N}$ and $\frac{1}{M}$ , respectively. Assuming the existence of a linear transformation between the feature space distributions of the source and target domains, it can be estimated using the discrete Monge-Kantorovich Problem 2. Therefore, the values of the continuous optimal transport $T(S)$ lie within the set $\big{\{}\exp{\big{(}\gamma^{\dagger}[j,:]\cdot{\log(S_{i})}_{i=1}^{N}\big{)}}\big{\}}_{j=1}^{M}$ .

Theorem 3 encompasses an extension of [6, Theorem 3.1] pertaining to ( $\mathcal{S}_{++}$ , $g^{LEM}$ ) within the context of the OT-DA framework. Significantly, our framework is designed to incorporate any supervised neural networks-based transformation, distinguishing it from the conventional use of affine transformations, and employs the squared Riemannian distance. In Appendix VI-A, we illuminate the connection between our proposed approach and another similar OT-DA framework on SPD manifolds proposed in [10] that utilizes the squared $\ell_{2}$ Euclidean distance in its cost function.

Refer to caption — Figure 1: Illustration of Deep Optimal Transport: Multi-channel signals are first transformed into covariance matrices, situating these matrices on the space of ( $\mathcal{S}_{++}$ , $g^{LEM}$ ). The SPD matrix-valued data is processed through the Bi-Map, ReEig, and LOG layers, moving it to the tangent space at the identity. The loss function takes into account the cross-entropy loss, marginal distribution adaptation, and conditional distribution adaptation.

III-B Deep Optimal Transport

We propose a novel geometric deep learning approach, Deep Optimal Transport (DOT) on ( $\mathcal{S}_{++}$ , $g^{LEM}$ ), to address the joint distribution adaptation problems involving both marginal distribution shifts and conditional distribution shifts. We assume samples from the source and target distributions can be restored via OT. The proposed architecture is depicted in Figure 1.

Marginal Distribution Adaptation (MDA)

The objective of MDA is to adapt the marginal distributions $\mathbb{P}_{S}(X_{S})$ and $\mathbb{P}_{T}(X_{T})$ on ( $\mathcal{S}_{++}$ , $g^{LEM}$ ). We measure the statistical discrepancy between them using the Riemannian distance $d_{g^{LEM}}(\overline{w}(\mathcal{B}_{S}),\overline{w}(\mathcal{B}_{T}))$ between the Log-Euclidean Fr $\acute{e}$ chet means of each batch of source and target samples $\mathcal{B}_{S}$ and $\mathcal{B}_{T}$ , respectively. The MDA loss is formulated as $\mathcal{L}_{MDA}:=||\log\big{(}\overline{w}(\mathcal{B}_{S})\big{)}-\log\big{(}\overline{w}(\mathcal{B}_{T})\big{)}||_{\mathcal{F}}$ .

Conditional Distribution Adaptation (CDA)

The objective of CDA is to address the adaptation of conditional distributions $\mathbb{Q}_{S}(Y_{S}|X_{S})$ and $\mathbb{Q}_{T}(Y_{T}|X_{T})$ on ( $\mathcal{S}_{++}$ , $g^{LEM}$ ). It necessitates pseudo labels predicted by a baseline algorithm for the target domain [13]. We consider class-conditional distribution $\mathbb{Q}(X|Y)$ instead of posterior probability $\mathbb{Q}(Y|X)$ and introduce the use of pseudo labels $\hat{y}$ for the unsupervised setting. To quantify the statistical difference between $\mathbb{Q}_{S}(Y_{S}|X_{S})$ and $\mathbb{Q}_{T}(Y_{T}|X_{T})$ , we continue to employ the Riemannian distance $d_{g^{LEM}}(\overline{w}(\mathcal{B}_{S^{\ell}}),\overline{w}(\mathcal{B}_{T^{\ell}}))$ between the Log-Euclidean Fr $\acute{e}$ chet means of each batch of source and target samples $S^{\ell}:=\{x_{S}|\hat{y}_{S}={\ell}\}$ and $T^{\ell}:=\{x_{T}|\hat{y}_{T}=\ell\}$ , respectively, for class $\ell\in\{1,....,L\}$ . The CDA loss is formulated as $\mathcal{L}_{CDA}:=||\log\big{(}\overline{w}(\mathcal{B}_{S^{\ell}})\big{)}-\log\big{(}\overline{w}(\mathcal{B}_{T^{\ell}})\big{)}||_{\mathcal{F}}$ .

Overall Objective Function

The overall objective function of DOT is composed of the cross-entropy loss $\mathcal{L}_{CE}$ and the joint distribution adaptation, i.e., $\mathcal{L}_{DOT}:=\alpha_{1}\mathcal{L}_{CE}+\alpha_{2}\mathcal{L}_{MDA}^{2}+\alpha_{3}\mathcal{L}_{CDA}^{2}$ , where $\alpha_{1},\alpha_{2}$ , and $\alpha_{3}\geq 0$ .

Remark.

1). $\mathcal{L}_{MDA}$ and $\mathcal{L}_{CDA}$ should be in the squared form, as required by Brenier’s formulation and the associated theoretical results [32].

2). On the space of $\mathcal{S}_{++}$ , there are multiple available Riemannian metrics besides LEM. For example, the affine invariant Riemannian metric (AIRM) $g^{AIRM}$ is another well-known Riemannian metric that has been extensively investigated in the literature [16]. However, AIRM is not a suitable choice for our framework. The calculation of the Fr $\acute{e}$ chet mean on ( $\mathcal{S}_{++}$ , $g^{AIRM}$ ) requires iterative methods. This is not feasible as a loss function for deep learning-based approaches.

3). In practice, we substitute Equation 1 into $\mathcal{L}_{MDA}$ and $\mathcal{L}_{CDA}$ , resulting in the simplified expressions,

	$\displaystyle\mathcal{L}_{MDA}$	$\displaystyle=\|\|\sum_{S_{i}\in\mathcal{B}_{S}}\frac{\log{(S_{i})}}{\|\mathcal{B}_{S}\|}-\sum_{S_{j}\in\mathcal{B}_{T}}\frac{\log{(S_{j})}}{\|\mathcal{B}_{T}\|}\|\|_{\mathcal{F}};$
	$\displaystyle\mathcal{L}_{CDA}$	$\displaystyle=\|\|\sum_{{S^{\ell}_{i}}\in\mathcal{B}_{S^{l}}}\frac{\log{({S^{\ell}_{i}})}}{\|\mathcal{B}_{S^{\ell}}\|}-\sum_{{S^{\ell}_{j}}\in\mathcal{B}_{T^{\ell}}}\frac{\log{({S^{\ell}_{j}})}}{\|\mathcal{B}_{T^{\ell}}\|}\|\|_{\mathcal{F}}.$

These simplified expressions have two distinct interpretations in domain adaptation as follows: First, they can be seen as an empirical maximum mean discrepancy approach defined as, $\mathcal{L}_{MMD}=||\sum_{i=1}^{n}\varphi{(S_{i})}/n-\sum_{j=1}^{m}\varphi{(S_{i})}/m||_{\mathcal{H}}$ , where $n=|\mathcal{B}_{S}|$ , $m=|\mathcal{B}_{T}|$ , $\varphi{(x)}:=\log{(x)}$ is the feature map, $\mathcal{H}:=\mathbb{R}^{d}$ is the reproducing kernel Hilbert space, and $d$ is the output dimension; Second, since each matrix $S$ in $\mathcal{B}_{S}$ or $\mathcal{B}_{T}$ is a covariance matrix of multichannel signals, the expressions can also be viewed as a correlation alignment (CORAL) approach [33, 34]. This approach aligns the second-order statistics of the source and target data distributions to minimize the drift between their statistical distributions.

4). For multi-source adaptation, the loss function can be modified by incorporating multi-joint distribution adaptions between each pair of source and target domains on ( $\mathcal{S}_{++}$ , $g^{LEM}$ ), as $\mathcal{L}:=\alpha_{1}^{i}\mathcal{L}_{CE}+\sum_{i}\Big{(}\alpha_{2}^{i}\mathcal{L}_{MDA,i}+\alpha_{3}^{i}\mathcal{L}_{CDA,i}\Big{)}$ , where the subscript $i$ is used to represent the $i^{th}$ source domain. This modification is based on the theory of multi-marginal optimal transport on Riemannian manifolds [35] and multi-source joint distribution optimal transport [36].

5). DOT does not directly solve an OT problem. Instead, it uses the transportation costs as a guiding principle to direct the learning process of the neural network. Consequently, the trained neural network maps the source and target domains onto a submanifold of the SPD manifold, thereby implementing a subspace method in transfer learning.

6). DOT calculates this cost for the centroid points and addresses the issue of conditional marginal adaptation without necessitating alternating parameter updates. In contrast, DeepJDOT computes the transportation costs for all points and requires alternating parameter updates, as the transport plan is a deterministic transfer.

IV EXPERIMENTS

IV-A Deep Optimal Transport-Based Motor Imagery Classifier

We model the space of EEG spatial covariance matrices in ( $\mathcal{S}_{++}$ , $g^{LEM}$ ) and employ DOT to address the issue that EEG statistical characterization is shifted during the feedback phase. Using the event-related desynchronization/synchronization methods to classify motor imagery classes, EEG frequency components in specific frequency bands are found which provided the best discrimination between movement imagination [37]. Hence, it is reasonable to regard features in source and target feature space holds the same, i.e., $X_{S}=X_{T}$ , after transforming to a latent space using the Bi-Map layer, and thus model the space of EEG spatial covariance matrices using SPD manifolds with the Log-Euclidean metric. Moreover, the Bi-Map layer provides a transitive group action on SPD manifolds, allowing any transition between two points on manifolds. In Fig. 2, we illustrate how the DOT-based MI-EEG classifier functions in a domain adaptation problem.

Specifically, EEGs are represented by $X\in\mathbb{R}^{n_{C}\times n_{T}}$ , where $n_{C}$ and $n_{T}$ denote the number of channels and timestamps, respectively. The corresponding EEG spatial covariance matrix, $S:=S(\Delta f\times\Delta t)\in\mathcal{S}_{++}^{n_{C}}$ , represents an EEG segment within the frequency band $\Delta f$ and time interval $\Delta t$ . Let $\mathcal{F}$ and $\mathcal{T}$ represent the frequency and time domains, respectively. A given segmentation $Seg(\mathcal{F},\mathcal{T})$ is written as $\{\Delta f\times\Delta t|\Delta f\subset Range(\mathcal{F})\text{, and }\Delta t\subset Range(\mathcal{T})\}$ , on the time-frequency domain of EEG signals. To ensure the theoretical convergence of this classifier in a real-world context, the following three assumptions are necessary:

Assumption 1.

There exist linear transformations between the feature space distributions on the source and target domains for each segment of $Seg(\mathcal{F},\mathcal{T})$ , which can be estimated using the Kantorovich Problem 2.

Assumption 2.

The weights for the Log-Euclidean Fr $\acute{e}$ chet mean of samples on each time-frequency segment $\Delta f\times\Delta t\in Seg(\mathcal{F},\mathcal{T})$ in both the source and target empirical distributions are uniform and equal $1/|Seg(\mathcal{F},\mathcal{T})|$ .

Assumption 3.

The Riemannian distance between the Log-Euclidean Fr $\acute{e}$ chet means of source and target samples, both under the same time-frequency segment, is minimized, that is:

\displaystyle d_{g^{LEM}}\big{(}\overline{w}(\mathcal{B}_{S}),\overline{w}(\mathcal{B}_{T})\big{)}

\displaystyle\leq d_{g^{LEM}}\big{(}\overline{w}(\mathcal{B}_{S}),\overline{w}(\mathcal{B}_{T}^{\prime})\big{)},

where $\mathcal{B}_{T}^{\prime}$ is a batch of samples on the target domain that is generated by the other time-frequency segments $\Delta f^{\prime}\times\Delta t^{\prime}\in Seg(\mathcal{F},\mathcal{T})$ , where $\Delta f^{\prime}\neq\Delta f$ or $\Delta t^{\prime}\neq\Delta t$ .

Assumption 1 is a necessary condition of modeling the OT-DA problem. Assumption 2 is straightforward because when we are uncertain about which EEG frequency band has a higher weight for a specific task, we tend to assume that each frequency band has equal priority for the classification task. Assumption 3 ensures that features extracted from different bands do not overlap.

Corollary 3.1.

Suppose EEG signals in the source and target domain have the same segmentation, Seg( $\mathcal{F},\mathcal{T}$ ). Assuming we model EEG spatial covariance matrices in ( $\mathcal{S}_{++}^{n_{C}}$ , $g^{LEM}$ ) and that Assumption 1, 2, and 3 exist, then the values of the continuous optimal transport $T(S)$ reside within the set $\big{\{}\log(S_{i})\big{\}}_{i=1}^{N}$ .

Proof.

In this case, the transport plan $\gamma$ equals the identity matrix. Consequently, Theorem 3 leads directly to this corollary. ∎

IV-B Evaluation Dataset

Three commonly used EEG-based motor imagery datasets are used for evaluation, accessed from two publicly available data packages MOABB and BNCI Horizon 2020¹¹1 The MOABB package can be accessed at https://github.com/NeuroTechX/moabb, and the BNCI Horizon 2020 project is available at http://bnci-horizon-2020.eu/database/data-sets.. The selection criteria require each subject in the dataset to have two-session recordings collected on different days. All digital signals were filtered using Chebyshev Type II filters at intervals of 4 Hz. The filters were designed to have a maximum loss of 3 dB in the passband and a minimum attenuation of 30 dB in the stopband. The following are the descriptions of these three datasets.

1). Korea University (KU) Dataset: The KU dataset comprises EEG signals obtained from 54 subjects during a binary-class imagery task. The EEG signals were recorded using 62 electrodes at a sampling rate of 1,000 Hz. For evaluation, we selected 20 electrodes located in the motor cortex region. The dataset was divided into two sessions, S1 and S2, each consisting of a training and a testing phase. Each phase included 100 trials, with an equal number of right and left-hand imagery tasks. The EEG signals were epoched from second 1 to second 3.5 relative to the stimulus onset, resulting in a total epoch duration of 2.5 seconds.

2). BNCI2014001 Dataset: The BNCI2014001 dataset (also known as BCIC-IV-2a) involved 9 participants performing a motor imagery task with four classes: left hand, right hand, feet, and tongue. The task was recorded using 22 Ag/AgCl electrodes and three EOG channels at a sampling rate of 250 Hz. The recorded data were filtered using a bandpass filter between 0.5 and 100 Hz, with a 50 Hz notch filter applied. The study was conducted over two sessions on separate days, with each participant completing a total of 288 trials (six runs of 12 cue-based trials for each class). The EEG signals were epoched from the cue onset at second 2 to the end of the motor imagery period at second 6, resulting in a total epoch duration of 4 seconds.

3). BNCI2015001 Dataset: BNCI2015001 involves 12 participants performing a right-hand versus both-feet imagery task. Most of the study consisted of two sessions conducted on consecutive days, with each participant completing 100 trials for each class, resulting in a total of 200 trials per participant. Four subjects participated in a third session, but their data should have been included in the evaluation. The EEG data were acquired at a sampling rate of 512 Hz and filtered using a bandpass filter ranging from 0.5 to 100 Hz, with a notch filter at 50 Hz. The recordings began at second 3 with the prompt and continued until the end of the cross period at second 8, resulting in a total duration of 5 seconds.

IV-C Experimental Settings

The experimental settings for domain adaptation on three evaluated datasets are described as follows, illustrated in Fig. 3: 1). Semi-supervised Domain Adaptation: For the KU dataset, S1 was used for training, and the first half of S2 was used for validation, while the second half was reserved for testing. Early stopping was adopted during the validation phase. 2). Unsupervised Domain Adaptation: For the BNCI2014001 and BNCI2015001 datasets, S1 was used for training, and S2 was used for testing. No validation set was used, and a maximum epoch value was set for stopping in practice.

We specified that the transfer procedure must always occur from S1 (source domain) to S2 (target domain), i.e., $S1\rightarrow S2$ for each dataset. It is reasonable because there is no meaning if negative transfer is used in the learning process. The former and latter sessions could be interpreted as the calibration and feedback phases.

IV-D Evaluated Baselines

The geometric methods in this study can be broadly divided into four categories. These include categories of parallel transport, deep parallel transport, optimal transport, and deep optimal transport. Each category includes various transfer methods, performing on three types of base machine learning classifiers: the support vector machine, minimum distance to Riemannian mean, and SPDNet. We begin by introducing these base classifiers and discussing transfer methods in each category.

1.

Support Vector Machine (SVM) [38]: This is a well-known machine learning classifier used for classification and regression tasks, which aims to find an optimal hyperplane that maximally separates data points of different classes, allowing it to handle linear and non-linear classification problems effectively.
2.

Minimum Distance to Riemannian Mean (MDM) [21]: This employs the geodesic distance on SPD manifolds for classification. More specifically, a centroid is estimated for each predefined class, and subsequently, the classification of a new data point is determined based on its proximity to the nearest centroid.
3.

SPDNet [29]: This has been introduced in the preliminary part. In the experiments, we employ a one-depth Bi-Map layer with both the input and output dimensions equal to the number of electrodes on the primary motor cortex: 20 (KU), 22 (BNCI2014001), and 13 (BNCI2015001).

As computations on the SPD manifold are related to Riemannian metrics, we will emphasize the specific Riemannian metrics associated with each method. The different transfer methods across categories according to various Riemannian metrics can be seen in the first three columns of Table II.

IV-D1 Category of Parallel Transport

This category refers to the utilization of parallel transport to perform a transfer of training and test data distributions. One of the representative methods in this category is Riemannian Procrustes Analysis (RPA) [39]. RPA utilizes transformations on ( $\mathcal{S}_{++}$ , $g^{AIRM}$ ), including translation, scaling, and rotation. After the geometric transformation, RPA uses the MDM approach as the base classifier. The RPA approach includes three sequential processes: recentering (RCT), stretching, and rotation (ROT). First, the RCT process is used to re-centralize the centroids of the data to the identity matrix. In the implementation, ROT is the parallel transportation on ( $\mathcal{S}_{++}$ , $g^{AIRM}$ ) from the Fr $\acute{e}$ chet mean of either source domain or target domain to the identity matrix. Second, the stretching process rescales the distributions on both datasets so that their dispersions around the mean are the same. Finally, the ROT process rotates the matrices from the target domain around the origin and matches the orientation of its point cloud with that of the source domain. In particular, the ROT process is unsupervised and can be used in unsupervised domain adaptation, whereas the RCT process is supervised and can only be used in semi-supervised domain adaptation. Specifically, RCT was proposed as a standalone method to be employed in the cross-session motor imagery classification [40].

IV-D2 Category of Deep Parallel Transport

This category refers to the domain adaptation approaches using deep learning and parallel transport on SPD manifolds. One representative work is known as Riemannian Batch Normalization (RieBN) [41]. RieBN is an SPD matrix-valued network architecture that employs parallel transport for batch centering and biasing on SPD neural networks.

IV-D3 Category of Optimal Transport

This category refers to treating the distribution of the covariance matrix obtained from cross-session as an OT problem and solving the transport plan using different variants of Wasserstein distances [42]. The classifier is trained within a session and tested in a shifted session. This is an unsupervised method. Three different types of Wasserstein distances are provided, including the earth mover’s distance (EMD), Sliced-Wasserstein distance (SPDSW), and Log Sliced-Wasserstein distance (logSW).

IV-D4 Category of Deep Optimal Transport

This category refers to the approach proposed in this study, which uses SPD neural networks on ( $\mathcal{S}_{++}$ , $g^{LEM}$ ) to transfer both source domain and target domain to a common space induced by optimal transport.

V Main Results

TABLE II: Experimental Results: All results in the table are in percentages (%), and the comparison methods include four categories, nine methods, and the corresponding three base classifiers. The original results of the base classifiers without considering transfer methods are at the bottom of the table. In the table, we use the symbol "/" to indicate that this method does not exist in the scenario constructed by this dataset.

Metric	Category	Method	Classifier	KU (54 Subjects)		BNCI2014001 (9 Subjects)		BNCI2015001 (12 Subjects)
				$S1\longrightarrow S2$		$T\longrightarrow E$		$S1\longrightarrow S2$
				Semi-supervised		Unsupervised		Unsupervised
				Accuracy	Increment	Accuracy	Increment	Accuracy	Increment
$g^{AIRM}$	OT	EMD	SVM	65.24 (14.81)	$-$ 0.07	50.89 (16.54)	$-$ 14.27	82.17 (11.80)	$+$ 6.25
	PT	RCT	MDM	52.69 (6.39)	$+$ 0.34	55.13 (12.48)	$+$ 0.89	77.67 (13.23)	$+$ 5.46
	PT	ROT	MDM	51.39 (5.89)	$-$ 0.96	/
		RCT	SPDNet	69.91 (13.67)	$+$ 2.15	72.11 (12.99)	$+$ 3.51	81.25 (10.64)	$+$ 4.75
		ROT	SPDNet	63.44 (14.94)	$-$ 4.32	/
	DPT	RieBN	SPDNet	67.57 (15.27)	$-$ 0.19	69.33 (13.49)	$+$ 0.73	78.83 (14.56)	$+$ 2.33
$g^{LEM}$	OT	EMD	SVM	67.70 (15.14)	$-$ 0.06	68.44 (15.79)	$+$ 3.28	82.17 (11.80)	$+$ 6.25
		EMD	SPDNet	64.22 (14.30)	$-$ 3.54	62.15 (14.69)	$-$ 6.45	78.33 (13.80)	$+$ 1.83
		SPDSW	SVM	66.67 (14.54)	$+$ 1.36	67.82 (16.71)	$+$ 2.66	82.42 (11.74)	$+$ 6.50
		logSW	SVM	66.26 (15.00)	$+$ 0.95	65.82 (16.56)	$+$ 0.66	80.75 (12.94)	$+$ 4.83
	DOT	DeepJDOT	SPDNet	67.38 (14.67)	$-$ 0.38	61.00 (11.40)	$-$ 7.60	77.63 (15.12)	$+$ 1.13
		MDA	SPDNet	69.19 (14.37)	$+$ 1.43	67.78 (12.46)	$-$ 0.82	78.33 (14.57)	$+$ 1.83
		CDA	SPDNet	68.80 (14.29)	$+$ 1.04	65.12 (11.34)	$-$ 3.48	78.25 (14.59)	$+$ 1.75
		MDA+CDA	SPDNet	68.61 (14.79)	$+$ 1.05	64.85 (11.54)	$-$ 3.75	78.29 (11.99)	$+$ 1.79
			SVM	65.31 (14.42)		65.16 (17.00)		75.92 (16.29)
Base Classifier			MDM	52.35 (6.73)		54.24 (13.31)		72.21 (14.67)
			SPDNet	67.76 (14.55)		68.60 (13.03)		76.50 (14.41)

V-A Synthetic Dataset

We synthesize a source domain dataset consisting of 50 covariance matrices of size $2\times 2$ , drawing a Gaussian distribution on the SPD manifold proposed in a generative approach [43] with a median of 1 and a standard deviation of 0.4. We utilize the following transformation matrix in the generation,

W:=\begin{pmatrix}1&0.5\\ 0.5&1\\ \end{pmatrix},

and data in the target domain can be expressed as $WSW^{\top}$ .

Subfigure (a) exhibits red points sourced from the source domain, with the blue points originating from the target domain. In Subfigure (b), the blue points represent the relocation of the red points via the EMD method from OT. The associations made by this transformation are represented by thin blue lines. Remarkably, the position of these transformed blue points is extremely proximate to the points of the target domain in Subfigure (a). This can be attributed to the fact that the OT-DA method essentially computes a weighted average of the source and target domain data points, based on weights learned to maximize the transportation costs.

The transformed blue points in Subfigures (c) and (d) were derived from mapping the red points via neural network weights learned through the DOT and DeepJDOT methods, respectively. Corresponding point pairs are linked with thin blue lines. These figures reveal the shared fundamental characteristic of the DOT and DeepJDOT methods that both project the source and target domains onto a subspace learned via a network, which, in these two figures, is represented by a straight line. In differential geometry, this is a submanifold, which refers to a lower-dimensional manifold embedded within a higher-dimensional one. In this case, the two-dimensional matrices are represented by a line, essentially projecting the higher dimensional data onto a lower dimensional space. It reduces the complexity of the data while preserving information.

V-B Comparative Experiments on Real-World EEG Datasets

As shown in Table II, we classify all approaches into three categories according to Riemannian metrics. The first category does not require a Riemannian metric for computation, the second employs $g^{AIRM}$ , while the third uses $g^{LEM}$ .

The experimental results indicate that metrics strongly influence the results. Specifically, in the table, EMD+SVM achieved 50.89 ( $g^{AIRM}$ ) and 68.33 ( $g^{LEM}$ ) on the BNCI2014001 dataset. Moreover, several observations are introduced as follows: 1). Participant Number: The improvement is related to the number of participants. On datasets with fewer participants, the enhancement brought by transfer methods is more significant, with several scenarios showing changes of more than 5%. 2). Base Classifier Results: The extent of improvement is linked to the base classifier’s performance. Lower initial results from the base classifier provide more opportunities for transfer methods to achieve significant gains. 3). Methodological Factors: Improvements are associated with the classifier neural network architecture, parameter selection, and data preprocessing. Therefore, the conclusions of this study may not be universally applicable.

Furthermore, we summarize the observations of all methods affected by various Riemannian metrics as follows:

•

( $\mathcal{S}_{++}$ , $g^{AIRM}$ ) includes categories of OT, PT, and DPT. Combining the results from all three datasets, the RCT method performs well, especially when paired with the MDM or SPDNet classifier. The EMD method shows the largest variation across datasets. The second step (ROT) in the RPA method, which is limited to semi-supervised scenarios, performs worse than the first step (RCT) of RPA. The RieBN method provides slight improvements but does not outperform RCT across the board.
•

( $\mathcal{S}_{++}$ , $g^{LEM}$ ) includes two categories of OT and DOT. The EMD method, when paired with the SVM classifier, performs significantly better compared to its pairing with the SPDNet classifier. SPDSW and logSW methods have obvious improvements compared to their base classifier (SVM), among which SPDSW performs better than logSW by 1% to 2% on all three datasets. The MDA and CDA methods perform similarly, except for the BNCI2014001 dataset, the improvements on the other two datasets are between 1% to 2%, while the MDA method is slightly better than the CDA method or a mixture of the MDA and CDA methods. Notably, the use of pseudo-labels in the CDA method affects its results. The DeepJDOT method’s performance across three datasets falls short of our proposed approach. The primary reason for this could be attributed to our decision to replace the task of computing transportation costs for each data point with merely computing transportation costs for the centers of each frequency band’s dataset. Minimizing the transportation costs does not necessarily imply improved classification accuracy, thus the proposed approach reduces computational cost while enhancing the model’s potential.

Besides, neural network-based approaches generally achieve better classification outcomes compared to other categories of approaches, such as EMD, SPDSW, and logSW. To compute the CDA loss in DOT, we used the pseudo-label method, with pseudo labels derived from the Graph-CSPNet [26], which currently performs the best for subject-specific tasks. Specifically, the results of SPDSW and logSW on the BNCI2014001 dataset were obtained using the data processing procedure and source codes provided by their GitHub repository, outperforming ours. This is because the performance of the SVM methods heavily depends on a selection of handcrafted features.

VI DISCUSSIONS

The experimental results indicate that all transfer learning methods exhibit effectiveness. However, it’s important to note that while domain shift is a known factor in these tasks, the classification performance is not solely influenced by it. Consequently, the improvements from transfer learning methods are generally limited. Currently, we lack an effective method to quantify the intensity of domain shift and a system to correlate this intensity with classifier performance, making it difficult to fully assess these methods’ effectiveness. In the rest of this section, we will discuss two issues of the proposed approach.

VI-A Theorem 3 with cost function $c(x,\bar{x})=||x-\bar{x}||_{\ell^{2}}^{2}$

We claim that Theorem 3 with cost function $c(x,\bar{x})=||x-\bar{x}||_{\ell^{2}}^{2}$ is equilvant to [6, Theorem 3.1]. It was first observed and proved by Yair et al.[10] who use the matrix vectorization operator to establish an isometry between the affine transformation and the Bi-Map transformation. We follow their arguments and revise a bit of their proof.

Theorem 4.

[6, Theorem 3.1] Let $\mu$ and $\nu$ be two discrete distributions on $\mathbb{R}^{n}$ with $N$ Diracs. Given a strictly positive definite matrix $A$ , bias weight $b\in\mathbb{R}^{n}$ , and source samples $\{x_{i}\}_{i=1}^{N}\sim\mu$ , suppose target samples $\bar{x}_{i}:=Ax_{i}+b$ , for $i=1,...,N$ , and the weights in both source and target distributions’ empirical distributions are $1/N$ . Then, transport $T(x_{i})=Ax_{i}+b$ is the solution to the OT Problem provided with a cost function $c(x,\bar{x}):=||x-\bar{x}||_{\ell^{2}}^{2}$ .

To expose the relationship between the Bi-Map transformation in Theorem 3 and affine transformation in Theorem 4, we introduce the matrix vectorization operator $vec(\cdot)$ to stack the column vectors of matrix $A=(a_{1}\big{|}a_{2}\big{|}\cdots\big{|}a_{n})$ below on another as follows,

vec(A):=\begin{pmatrix}a_{1}\\ a_{2}\\ \vdots\\ a_{n}\\ \end{pmatrix}.

Then, we have the following lemma.

Lemma 5.

$vec(WSW^{\top})=(W\otimes W)vec(S)$ , where $\otimes$ is Kronecker product.

Proof.

Suppose $S$ is an $n\times n$ matrix, and $W^{\top}=(w_{1}^{\top}|w_{2}^{\top}|\cdots|w_{m}^{\top})$ is the transpose of $m\times n$ matrix $W$ with each column vector $w_{i}^{\top}\in\mathbb{R}^{n}$ (i=1, …, m). The k-th column vector of $WSW^{\top}$ can be written as $(WSW^{\top})_{(:,k)}=WSw_{k}^{\top}=(w_{k}^{\top}\otimes W)vec(S)$ . Hence, we achieve the lemma as follows,

	$\displaystyle vec(WSW^{\top})$	$\displaystyle=\begin{pmatrix}(WSW^{\top})_{(:,1)}\\ \vdots\\ (WSW^{\top})_{(:,m)}\\ \end{pmatrix}$
		$\displaystyle=\begin{pmatrix}w_{1}^{\top}\otimes W\\ \vdots\\ w_{m}^{\top}\otimes W\\ \end{pmatrix}vec(S)$
		$\displaystyle=(W\otimes W)vec(S).$

∎

Lemma 6.

Suppose $W\in\mathbb{R}^{n\times n}$ is positive definite, then $W\otimes W$ is also positive definite.

Proof.

This is given by a fact that suppose $\{\lambda_{1},...,\lambda_{n}\}$ are the eigenvalues of $W$ , then the eigenvalues for $W\otimes W$ is $\{\lambda_{1}^{2},...,\lambda_{n}^{2}\}$ . Hence, $W\otimes W$ is positive definite. ∎

According to Lemma 5 and 6, the Bi-Map transformation $W\cdot S\cdot W^{\top}$ turns to be affine transformation $A=W\otimes W$ , and thus Theorem 3. Hence, Theorem 3 endowed with the cost function $c\big{(}vec(x),vec(y)\big{)}=||vec(x)-vec(y)||_{\ell^{2}}^{2}$ is equilvant to Theorem 4.

VI-B Parallel Transport

Parallel transport on Riemannian manifolds transfers a tangent vector at one point along a curve on the manifold to another point, such that the transported vector remains parallel to the original vector [30]. Under two Riemannian metrics on SPD manifolds $g^{AIRM}$ and $g^{LEM}$ , expressions of parallel transport exhibit distinctly and determine two distinct modeling approaches for the problems [44, 10]. Specifically, for any $S_{1}$ and $S_{2}$ on ( $\mathcal{S}_{++}$ , $g^{AIRM}$ ), parallel transport from $S_{1}$ to $S_{2}$ for $s\in T_{S_{1}}\mathcal{S}_{++}$ can be expressed as $\Gamma_{S_{1}\rightarrow S_{2}}(s)=EsE^{\top}$ , where $E=(S_{1}\cdot S_{2}^{-1})^{\frac{1}{2}}$ . For any $S_{1}$ and $S_{2}$ on ( $\mathcal{S}_{++}$ , $g^{LEM}$ ), parallel transport is given by $\Gamma_{S_{1}\rightarrow S_{2}}(s)=s$ .

Keep in mind that DOT seeks a neural networks-based transformation $\varphi:X_{S},X_{T}\longmapsto Z\subset\mathcal{S}_{++}$ such that $\mathbb{P}_{S}(\varphi(X_{S}))=\mathbb{P}_{T}(\varphi(X_{T}))$ and $\mathbb{Q}_{S}(Y_{S}|\varphi(X_{S}))=\mathbb{Q}_{T}(Y_{T}|\varphi(X_{T}))$ , where $\varphi(X_{S})$ and $\varphi(X_{T})$ are transformation $\varphi$ on the source and target feature spaces. Neural networks-based transformation $\varphi$ extracts discriminative information from feature spaces. After the extraction, it is reasonable to view the feature space of the source domain as the same as it of the target domain, as they will be transformed under neural networks with the same weights, so are their transformed feature spaces, i.e., $X_{T}=X_{S}$ and $\varphi(X_{T})=\varphi(X_{S})\subset Z$ , and thus we utilize the Log-Euclidean metric to formulate the space of covariance matrice.

VI-C Assumption 3

In this section, we will justify that Assumption 3 is reasonable from the numerical results. Table III records average Log-Euclidean distances in the same frequency band, and diagonal numbers are consistently the smallest among all the numbers in each row for the two datasets. This suggests that the mapping shifts the center of each frequency band in the source domain to the corresponding frequency band center in the target domain. It aligns with our intuition that the difference between the same frequency bands should be smaller since they capture similar signal characteristics in the same frequency range, and the difference between different frequency bands tends to be larger.

The procedure of calculating the average Log-Euclidean distance is specifically given as follows: assuming source and target samples are represented in a format of [trials, frequency bands, channels, channels], with frequency bands fixed (in the 2^nd dimension), we first calculate the Fréchet mean by averaging all covariance matrices along the trial dimension. This results in the shape of [frequency bands, channels, channels]. Next, we directly compute the average Log-Euclidean distance for each pair of frequency bands.

TABLE III: Average Log-Euclidean distances between the Fréchet means of EEG spatial covariance matrices generated from different frequency bands in the source and target domains for the KU (54 subjects), BNCI2014001 (9 subjects), and BNCI2015001 (12 subjects) datasets. The shortest average Log-Euclidean distance in each row is highlighted in boldface.

KU: S1 $\backslash$ S2	4 $\sim$ 8 Hz	8 $\sim$ 12 Hz	12 $\sim$ 16 Hz	16 $\sim$ 20 Hz	20 $\sim$ 24 Hz	24 $\sim$ 28 Hz	28 $\sim$ 32 Hz	32 $\sim$ 36 Hz	36 $\sim$ 40 Hz
4 $\sim$ 8 Hz	4.69	4.98	5.82	6.74	7.19	7.64	8.17	8.78	9.28
8 $\sim$ 12 Hz	5.09	4.21	4.82	5.81	6.25	6.70	7.22	7.82	8.31
12 $\sim$ 16 Hz	5.84	4.76	4.01	4.62	4.98	5.34	5.78	6.30	6.75
16 $\sim$ 20 Hz	6.66	5.61	4.37	3.96	4.08	4.33	4.64	5.02	5.40
20 $\sim$ 24 Hz	7.08	6.01	4.66	3.97	3.86	3.99	4.27	4.63	4.97
24 $\sim$ 28 Hz	7.52	6.45	4.98	4.11	3.86	3.75	3.91	4.21	4.52
28 $\sim$ 32 Hz	8.04	6.96	5.40	4.37	4.07	3.82	3.73	3.87	4.11
32 $\sim$ 36 Hz	8.65	7.56	5.93	4.77	4.42	4.13	3.88	3.74	3.82
36 $\sim$ 40 Hz	9.16	8.07	6.40	5.16	4.77	4.43	4.11	3.82	3.74
BNCI2014001: T $\backslash$ E	4 $\sim$ 8 Hz	8 $\sim$ 12 Hz	12 $\sim$ 16 Hz	16 $\sim$ 20 Hz	20 $\sim$ 24 Hz	24 $\sim$ 28 Hz	28 $\sim$ 32 Hz	32 $\sim$ 36 Hz	36 $\sim$ 40 Hz
4 $\sim$ 8 Hz	2.81	4.25	5.14	5.88	6.70	8.30	9.64	10.97	12.46
8 $\sim$ 12 Hz	4.31	2.37	3.98	5.45	6.28	8.07	9.44	10.77	12.26
12 $\sim$ 16 Hz	5.16	3.91	2.29	3.53	4.23	5.83	7.15	8.42	9.87
16 $\sim$ 20 Hz	5.84	5.49	3.44	2.17	2.76	4.08	5.29	6.56	7.98
20 $\sim$ 24 Hz	6.59	6.24	4.18	2.75	2.18	3.31	4.54	5.79	7.24
24 $\sim$ 28 Hz	8.11	8.02	5.72	4.01	3.14	2.20	2.96	4.03	5.38
28 $\sim$ 32 Hz	9.42	9.38	7.06	5.21	4.39	2.83	2.21	2.86	4.05
32 $\sim$ 36 Hz	10.67	10.63	8.25	6.43	5.57	3.87	2.72	2.22	3.05
36 $\sim$ 40 Hz	12.16	12.12	9.69	7.85	7.02	5.18	3.85	2.70	2.22
BNCI2015001: $S_{1}$ $\backslash$ $S_{2}$	4 $\sim$ 8 Hz	8 $\sim$ 12 Hz	12 $\sim$ 16 Hz	16 $\sim$ 20 Hz	20 $\sim$ 24 Hz	24 $\sim$ 28 Hz	28 $\sim$ 32 Hz	32 $\sim$ 36 Hz	36 $\sim$ 40 Hz
4 $\sim$ 8 Hz	1.07	12.39	12.35	10.36	9.71	8.46	5.73	2.25	6.24
8 $\sim$ 12 Hz	12.04	1.06	1.49	2.68	3.22	4.40	7.13	12.43	18.19
12 $\sim$ 16 Hz	12.10	1.60	1.09	2.52	3.13	4.32	7.08	12.41	18.18
16 $\sim$ 20 Hz	10.19	2.99	2.67	1.19	1.59	2.45	5.06	10.40	16.22
20 $\sim$ 24 Hz	9.64	3.40	3.15	1.53	1.22	1.90	4.51	9.84	15.66
24 $\sim$ 28 Hz	8.36	4.51	4.30	2.34	1.80	1.19	3.21	8.52	14.34
28 $\sim$ 32 Hz	5.72	7.17	7.01	4.94	4.31	3.06	1.28	5.73	11.54
32 $\sim$ 36 Hz	2.52	12.43	12.31	10.24	9.58	8.29	5.44	1.30	6.17
36 $\sim$ 40 Hz	6.41	18.22	18.12	16.09	15.42	14.14	11.27	5.88	1.25

VII CONCLUSIONS

This study advances the OT-DA framework to SPD manifolds equipped with the Log-Euclidean metric and introduces a geometric deep learning approach to address real-world domain adaptation challenges with SPD matrix-valued data. Specifically, it handles cross-session variabilities in EEG-based motor imagery classification. Experiments on three publicly available EEG datasets validate the proposed approach’s effectiveness. The results highlight its practical potential and advantages in managing complex tasks involving domain adaptation in EEG data analysis.

Acknowledgment

This study is supported under the RIE2020 Industry Alignment Fund–Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contributions from the industry partner(s). This study is also supported by the RIE2020 AME Programmatic Fund, Singapore (No. A20G8b0102).

References

[1] S. Ben-David, J. Blitzer, K. Crammer, F. Pereira et al., “Analysis of representations for domain adaptation,” Advances in neural information processing systems, vol. 19, p. 137, 2007.
[2] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions on knowledge and data engineering, vol. 22, no. 10, pp. 1345–1359, 2009.
[3] M. Wang and W. Deng, “Deep visual domain adaptation: A survey,” Neurocomputing, vol. 312, pp. 135–153, 2018.
[4] Y.-P. Lin and T.-P. Jung, “Improving eeg-based emotion classification using conditional transfer learning,” Frontiers in human neuroscience, vol. 11, p. 334, 2017.
[5] M. Sugiyama, M. Krauledat, and K.-R. Müller, “Covariate shift adaptation by importance weighted cross validation.” Journal of Machine Learning Research, vol. 8, no. 5, 2007.
[6] N. Courty, R. Flamary, D. Tuia, and A. Rakotomamonjy, “Optimal transport for domain adaptation,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 9, pp. 1853–1865, 2016.
[7] N. Courty, R. Flamary, A. Habrard, and A. Rakotomamonjy, “Joint distribution optimal transportation for domain adaptation,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, ser. NIPS’17. Red Hook, NY, USA: Curran Associates Inc., 2017, p. 3733–3742.
[8] ——, “Joint distribution optimal transportation for domain adaptation,” Advances in neural information processing systems, vol. 30, 2017.
[9] B. B. Damodaran, B. Kellenberger, R. Flamary, D. Tuia, and N. Courty, “Deepjdot: Deep joint distribution optimal transport for unsupervised domain adaptation,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 447–463.
[10] O. Yair, F. Dietrich, R. Talmon, and I. G. Kevrekidis, “Domain adaptation with optimal transport on the manifold of spd matrices,” arXiv preprint arXiv:1906.00616, 2019.
[11] Z. J. Koles, M. S. Lazar, and S. Z. Zhou, “Spatial patterns underlying population differences in the background eeg,” Brain topography, vol. 2, no. 4, pp. 275–284, 1990.
[12] J. Müller-Gerking, G. Pfurtscheller, and H. Flyvbjerg, “Designing optimal spatial filters for single-trial eeg classification in a movement task,” Clinical neurophysiology, vol. 110, no. 5, pp. 787–798, 1999.
[13] M. Long, J. Wang, G. Ding, J. Sun, and P. S. Yu, “Transfer feature learning with joint distribution adaptation,” in Proceedings of the IEEE international conference on computer vision, 2013, pp. 2200–2207.
[14] V. Arsigny, P. Fillard, X. Pennec, and N. Ayache, “Fast and simple computations on tensors with log-euclidean metrics.” Ph.D. dissertation, INRIA, 2005.
[15] ——, “Log-euclidean metrics for fast and simple calculus on diffusion tensors,” Magnetic Resonance in Medicine: An Official Journal of the International Society for Magnetic Resonance in Medicine, vol. 56, no. 2, pp. 411–421, 2006.
[16] X. Pennec, P. Fillard, and N. Ayache, “A riemannian framework for tensor computing,” International Journal of computer vision, vol. 66, pp. 41–66, 2006.
[17] H. Q. Minh and V. Murino, Covariances in computer vision and machine learning. Springer Nature, 2022.
[18] S. Kolouri, S. R. Park, M. Thorpe, D. Slepcev, and G. K. Rohde, “Optimal mass transport: Signal processing and machine-learning applications,” IEEE signal processing magazine, vol. 34, no. 4, pp. 43–59, 2017.
[19] X. Pennec, “Statistical computing on manifolds for computational anatomy,” Ph.D. dissertation, Université Nice Sophia Antipolis, 2006.
[20] ——, “Manifold-valued image processing with spd matrices,” in Riemannian geometric statistics in medical image analysis. Elsevier, 2020, pp. 75–134.
[21] A. Barachant, S. Bonnet, M. Congedo, and C. Jutten, “Multiclass brain–computer interface classification by riemannian geometry,” IEEE Transactions on Biomedical Engineering, vol. 59, no. 4, pp. 920–928, 2011.
[22] ——, “Classification of covariance matrices using a riemannian-based kernel for bci applications,” Neurocomputing, vol. 112, pp. 172–178, 2013.
[23] M. Congedo, A. Barachant, and R. Bhatia, “Riemannian geometry for eeg-based brain-computer interfaces; a primer and a review,” Brain-Computer Interfaces, vol. 4, no. 3, pp. 155–174, 2017.
[24] C. Ju, D. Gao, R. Mane, B. Tan, Y. Liu, and C. Guan, “Federated transfer learning for eeg signal classification,” in 2020 42nd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC). IEEE, 2020, pp. 3040–3045.
[25] C. Ju and C. Guan, “Tensor-cspnet: A novel geometric deep learning framework for motor imagery classification,” IEEE Transactions on Neural Networks and Learning Systems, 2022.
[26] ——, “Graph neural networks on spd manifolds for motor imagery classification: A perspective from the time-frequency analysis,” IEEE Transactions on Neural Networks and Learning Systems, 2023.
[27] C. Ju, R. J. Kobler, and C. Guan, “Score-based data generation for eeg spatial covariance matrices: Towards boosting bci performance,” in 2023 45th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), 2023.
[28] C. Ju, R. J. Kobler, L. Tang, C. Guan, and M. Kawanabe, “Deep geodesic canonical correlation analysis for covariance-based neuroimaging data,” in The Twelfth International Conference on Learning Representations (ICLR), 2024.
[29] Z. Huang and L. Van Gool, “A riemannian network for spd matrix learning,” in Thirty-First AAAI Conference on Artificial Intelligence, 2017.
[30] P. Petersen, S. Axler, and K. Ribet, Riemannian geometry. Springer, 2006, vol. 171.
[31] M. P. Do Carmo and J. Flaherty Francis, Riemannian geometry. Springer, 1992, vol. 6.
[32] R. J. McCann, “Polar factorization of maps on riemannian manifolds,” Geometric & Functional Analysis GAFA, vol. 11, no. 3, pp. 589–608, 2001.
[33] B. Sun and K. Saenko, “Deep coral: Correlation alignment for deep domain adaptation,” in European conference on computer vision. Springer, 2016, pp. 443–450.
[34] B. Sun, J. Feng, and K. Saenko, “Correlation alignment for unsupervised domain adaptation,” in Domain Adaptation in Computer Vision Applications. Springer, 2017, pp. 153–171.
[35] Y.-H. Kim and B. Pass, “Multi-marginal optimal transport on riemannian manifolds,” American Journal of Mathematics, vol. 137, no. 4, pp. 1045–1060, 2015.
[36] R. Turrisi, R. Flamary, A. Rakotomamonjy, and M. Pontil, “Multi-source domain adaptation via weighted joint distributions optimal transport,” in Uncertainty in Artificial Intelligence. PMLR, 2022, pp. 1970–1980.
[37] G. Pfurtscheller, C. Neuper, D. Flotzinger, and M. Pregenzer, “Eeg-based discrimination between imagination of right and left hand movement,” Electroencephalography and clinical Neurophysiology, vol. 103, no. 6, pp. 642–651, 1997.
[38] C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning, vol. 20, pp. 273–297, 1995.
[39] P. L. C. Rodrigues, C. Jutten, and M. Congedo, “Riemannian procrustes analysis: transfer learning for brain–computer interfaces,” IEEE Transactions on Biomedical Engineering, vol. 66, no. 8, pp. 2390–2401, 2018.
[40] O. Yair, M. Ben-Chen, and R. Talmon, “Parallel transport on the cone manifold of spd matrices for domain adaptation,” IEEE Transactions on Signal Processing, vol. 67, no. 7, pp. 1797–1811, 2019.
[41] D. Brooks, O. Schwander, F. Barbaresco, J.-Y. Schneider, and M. Cord, “Riemannian batch normalization for spd neural networks,” Advances in Neural Information Processing Systems, vol. 32, 2019.
[42] C. Bonet, B. Malézieux, A. Rakotomamonjy, L. Drumetz, T. Moreau, M. Kowalski, and N. Courty, “Sliced-wasserstein on symmetric positive definite matrices for m/eeg signals,” arXiv preprint arXiv:2303.05798, 2023.
[43] S. Said, L. Bombrun, Y. Berthoumieu, and J. H. Manton, “Riemannian gaussian distributions on the space of symmetric positive definite matrices,” IEEE Transactions on Information Theory, vol. 63, no. 4, pp. 2153–2170, 2017.
[44] S. Sra and R. Hosseini, “Conic geometric optimization on the manifold of positive definite matrices,” SIAM Journal on Optimization, vol. 25, no. 1, pp. 713–739, 2015.

Deep Optimal Transport for Domain Adaptation on SPD Manifolds

Abstract

Index Terms:

I Introduction

II PRELIMINARY

Notations and Conventions

II-A Log-Euclidean Metric

II-B Discrete Optimal Transport

II-C Joint Distribution Optimal Transport

II-D SPD Matrix-Valued Motor Imagery Classifier

III Methodology

III-A OT-DA Framework on (𝒮++\mathcal{S}_{++}, gL​E​Mg^{LEM})

Lemma 1.

Proof.

Lemma 2 (Brenier’s Polar Factorization [32]).

Theorem 3.

III-B Deep Optimal Transport

Marginal Distribution Adaptation (MDA)

Conditional Distribution Adaptation (CDA)

Overall Objective Function

Remark.

IV EXPERIMENTS

IV-A Deep Optimal Transport-Based Motor Imagery Classifier

Assumption 1.

Assumption 2.

Assumption 3.

Corollary 3.1.

Proof.

IV-B Evaluation Dataset

IV-C Experimental Settings

IV-D Evaluated Baselines

IV-D1 Category of Parallel Transport

IV-D2 Category of Deep Parallel Transport

IV-D3 Category of Optimal Transport

IV-D4 Category of Deep Optimal Transport

V Main Results

V-A Synthetic Dataset

V-B Comparative Experiments on Real-World EEG Datasets

VI DISCUSSIONS

VI-A Theorem 3 with cost function c​(x,x¯)=‖x−x¯‖ℓ22c(x,\bar{x})=||x-\bar{x}||_{\ell^{2}}^{2}

Theorem 4.

Lemma 5.

Proof.

Lemma 6.

Proof.

VI-B Parallel Transport

VI-C Assumption 3

VII CONCLUSIONS

Acknowledgment

References

III-A OT-DA Framework on ( $\mathcal{S}_{++}$ , $g^{LEM}$ )

VI-A Theorem 3 with cost function $c(x,\bar{x})=||x-\bar{x}||_{\ell^{2}}^{2}$