Higher order multi-dimension reduction methods via Einstein-product

A. Zahir The UM6P Vanguard Center, Mohammed VI Polytechnic University, Green City, Morocco. K. Jbilou¹¹footnotemark: 1 Université du Littoral Cote d’Opale, LMPA, 50 rue F. Buisson, 62228 Calais-Cedex, France. A. Ratnani¹¹footnotemark: 1

Abstract

This paper explores the extension of dimension reduction (DR) techniques to the multi-dimension case by using the Einstein product. Our focus lies on graph-based methods, encompassing both linear and nonlinear approaches, within both supervised and unsupervised learning paradigms. Additionally, we investigate variants such as repulsion graphs and kernel methods for linear approaches. Furthermore, we present two generalizations for each method, based on single or multiple weights. We demonstrate the straightforward nature of these generalizations and provide theoretical insights. Numerical experiments are conducted, and results are compared with original methods, highlighting the efficiency of our proposed methods, particularly in handling high-dimensional data such as color images.

keywords:

Tensor, Dimension reduction, Einstein product, Graph-based methods, Multi-dimensional Data, Trace optimization problem.

1 Introduction

In today’s data-driven world, where amounts of information are collected and analyzed, the ability to simplify and interpret data has never been more critical. The task is particularly evident in the field of data science and machine learning [29], where the curse of dimensionality is a major obstacle. Dimension reduction techniques aim to address this issue by projecting high-dimensional data onto a lower-dimensional space while preserving the underlying structure of the data. These methods have been proven to be quite efficient in revealing hidden structures and patterns.

The landscape of DR is quite rich, with a wide range of methods, from linear to non-linear [18], supervised to unsupervised, and versions of these methods that incorporate repulsion-based principles, or kernels. We find as an example, Principal Component Analysis (PCA)[9], Locality Preserving Projections (LPP)[12], Orthogonal Neighborhood Preserving Projections (ONPP)[14, 15], Neighborhood Preserving Projections (NPP)[14], Laplacian Eigenmap (LE)[4], Locally Linear Embedding (LLE)[23]… Each of these techniques has been subject to extensive research and applications, offering insights into data structures that are often hidden in high-dimensional spaces, these methods can also be seen as an optimization problem of trace, with some constraints[24].

Current approaches often require transforming the multi-dimensional data, such as images [13, 3, 26, 19, 30], into a matrix, into flattened (vectorized) forms before analysis. This process, while it’s fast, however, can be problematic, as it may lead to loss of inherent structure and relational information within the data.

This paper proposes a novel approach to generalize dimensional reduction techniques, employing the Einstein product, a tool in tensor algebra, which is the natural extension of the usual matrix product. By reformulating the operations of both linear and non-linear methods in the context of tensor operations, the generalization maintains the multi-dimensional integrity of complex datasets. This approach circumvents the need for vectorization, preserving the rich intrinsic structure of the data.
Our contribution lies in not only proposing a generalized framework for dimensional reduction, but also in demonstrating its effectiveness through empirical studies. We show that the proposed methods, outperform or at least are the same as their matrix-based counterparts, while preserving the integrity of the data.

This paper is organized as follows. Firstly, we will talk in Section 2 about the methods in the matrix case, then, in Section 3, we will introduce the mathematical background of tensors, and the Einstein product. Next, in Section 4, we will introduce the different methods, and the generalization of these methods using the Einstein product. Following that, Section 5 is dedicated to presenting variants of these techniques. Subsequently, we will present the numerical experiments and the results in Section 6. Lastly, we offer some concluding remarks and suggestion of future work in 7.

2 Dimension reduction methods in matrix case

Given a set of $n$ data points $\mathbf{x}_{1},\ldots,\mathbf{x}_{n}\in\mathbb{R}^{m}$ and a set of $n$ corresponding points $\mathbf{y}_{1},\ldots,\mathbf{y}_{n}\in\mathbb{R}^{d}$ , denote the data matrix $X=\left[\mathbf{x}_{1},\cdots,\mathbf{x}_{n}\right]\in\mathbb{R}^{m\times n}$ and the low-dimensional matrix $Y=\left[\mathbf{y}_{1},\cdots,\mathbf{y}_{n}\right]\in\mathbb{R}^{d\times n}$ . The objective is to find a mapping $\Phi:\mathbb{R}^{m}\longrightarrow\mathbb{R}^{d},\phi\left(\mathbf{x}_{i}\right)=\mathbf{y}_{i},\quad i=1,\cdots,n$ . The mapping is either non-linear $Y=\Phi(X)$ , or linear $Y=V^{T}X$ , in the latter case, it reduces to find the projection matrix $V\in\mathbb{R}^{m\times d}$ .

We denote the similarity matrix of a graph by $W\in\mathbb{R}^{n\times n}$ , the degree matrix by $D$ , and the Laplacian matrix by $L=D-W$ . For the sake of simplifications, we will define some new matrices

		$\displaystyle L_{n}=D^{-1/2}LD^{-1/2}\text{ , }\widehat{W}=D^{-1/2}WD^{-1/2}\text{ , }M=(I_{n}-W^{T})(I_{n}-W)\text{ , }$
		$\displaystyle\widehat{X}=XD^{1/2}\text{ , }\widehat{Y}=YD^{1/2}\text{ , }H=I_{n}-\dfrac{1}{n}\mathbf{1}\mathbf{1}^{T},$

where $H$ is the centering matrix, and $\mathbf{1}=\left(1,\ldots,1\right)^{T}\in\mathbb{R}^{n}.$
The usual loss functions used are defined as follows

$\displaystyle\phi_{1}(Y)$	$\displaystyle:=$	$\displaystyle\dfrac{1}{2}\sum_{i,j=1}^{n}W_{ij}\left\\|\mathbf{y}_{i}-\mathbf{y}_{j}\right\\|_{2}^{2}=\operatorname{Tr}\left[YLY^{H}\right],$	(1)
$\displaystyle\phi_{2}(Y)$	$\displaystyle:=$	$\displaystyle\sum_{i}\left\\|\mathbf{y}_{i}-\sum_{j}W_{ij}\mathbf{y}_{j}\right\\|_{2}^{2}=\operatorname{Tr}\left[YMY^{H}\right],$	(2)
$\displaystyle\Phi_{3}(Y)$	$\displaystyle:=$	$\displaystyle\sum_{i}\left\\|\mathbf{y}_{i}-\dfrac{1}{n}\sum_{j}\mathbf{y}_{j}\right\\|_{2}^{2}=\operatorname{Tr}\left[Y(I-\dfrac{1}{n}\mathbf{1}\mathbf{1}^{T})Y^{H}\right].$	(3)

Equations (1), and (3) preserve the locality, i.e., the point and its representation stay close, while Equation (2) preserves the local geometry, i.e., the representation point can be written as a linear combination of its neighbours.

For simplicity, we will refer to the $d$ eigenvectors of a matrix corresponding to the largest and smallest eigenvalues, respectively, as the largest and smallest $d$ eigenvectors of a matrix. The same terminology applies to the left or right singular vectors. Table 1 summarizes the various dimension reduction methods, their corresponding optimization problems and the solutions.

Method	Loss function	Constraint	Solution
Linear methods
Principal component analysis[9].	Maximize Equation 3.	$VV^{T}=I$	Largest $d$ left singular vectors of $XH$ .
Locality Preserving Projections. [12]	Minimize Equation 1	$YDY^{T}=I$	Solution of $\widehat{X}(I_{n}-\widehat{W})\widehat{X}^{T}u_{i}=\lambda_{i}\widehat{X}\widehat{X}^{T}u_{i}$ .
Orthogonal Locality Preserving Projections.[14, 15]	Minimize Equation 1	$VV^{T}=I$	Smallest $d$ eigenvectors of $XLX^{T}$ .
Orthogonal Neighborhood Preserving Projections.[14, 15]	Minimize Equation 2	$VV^{T}=I$	Smallest $d$ eigenvectors of $XMX^{T}$ .
Neighborhood Preserving Projections.[14]	Minimize Equation 2	$YY^{T}=I$	Sol of $XMX^{T}u_{i}=\lambda_{i}XX^{T}u_{i}$
Non-Linear methods
Locally Linear Embedding.[23]	Minimize Equation 2	$YY^{T}=I$	Smallest $d$ Eigenvectors of $M$ .
Laplacian Eigenmap.[4]	Minimize Equation 1	$YDY^{T}=I$	Solution of $Lu_{i}=\lambda_{i}Du_{i}$ .

Table 1: Objective functions and constraints employed in various dimension reduction methods along with corresponding solutions.

Notice that the smallest eigenvalue is disregarded in the solutions, thus, the second to the $d+1$ eigenvectors are taken. The graph based methods are quite similar, each one tries to give an accurate representation of the data while preserving a desired property. The solution of the optimization problem is given by the eigenvectors, or the singular vectors.

Next, we will introduce notations related to the tensor theory (Einstein product) as well as some properties that guarantee the proposed generalization.

3 The Einstein product and its properties

Let $\mathbf{I}=\{I_{1},\ldots,I_{N}\}$ and $\mathbf{J}=\{J_{1},\ldots,J_{M}\}$ be two multi-indices, and $\mathbf{i}=\{i_{1},\ldots,i_{N}\}$ and $\mathbf{j}=\{j_{1},\ldots,j_{M}\}$ be two indices. The index mapping function $\operatorname{ivec}(\mathbf{i},\mathbf{I})=i_{1}+\sum_{k=2}^{N}\left(i_{k}-1\right)\prod_{l=1}^{k-1}I_{l}$ that maps the multi-index $\mathbf{i}$ to the corresponding index in the vectorized form of a tensor of size $I_{1}\times\ldots\times I_{N}$ . The unfolding, also known also as flattening or matricizaion, is a function $\Psi:\mathbb{R}^{I_{1}\times I_{2}\times\cdots\times I_{N}\times J_{1}\times J_{2}\times\cdots\times J_{M}}\longrightarrow\mathbb{R}^{|\mathbf{I}|\times|\mathbf{J}|},\;\mathcal{A}\mapsto A$ with $A_{ij}=\mathcal{A}_{i_{1}i_{2}\ldots i_{N}j_{1}j_{2}\ldots j_{M}}$ , that maps a tensor into a matrix, with the subscripts $i=\operatorname{ivec}(\mathbf{i},\mathbf{I}),$ and $j=\operatorname{ivec}(\mathbf{j},\mathbf{J})$ . The mapping $\Psi$ is a linear isomorphism, and its inverse is denoted by $\Psi^{-1}$ . It would generalize some concepts of the matrix theory more easily.

The frontal slice of the N-order tensor $\mathcal{A}\in\mathbb{R}^{I_{1}\times\ldots\times I_{N}}$ , denoted by $\mathcal{A}^{(i)}$ is the tensor $\mathcal{A}_{:,\ldots,:i}$ (the last mode is fixed to $i$ ). A tensor $\mathcal{A}\in\mathbb{R}^{I_{1}\times\ldots\times I_{N}\times J_{1}\times\ldots\times J_{M}}$ is called even if $N=M$ and square if $I_{i}=J_{i}$ for all $i=1,\ldots,N$ [21].

Definition 1 (m-mode product).

[17]Let $\mathcal{X}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}}$ , and $U\in\mathbb{R}^{J\times I_{m}}$ , the m-mode (matrix) product of $\mathcal{X}$ and $U$ is a tensor of size $I_{1}\times\ldots I_{m-1}\times J\times I_{m+1}\ldots\times I_{M}$ , with element-wise

(\mathcal{X}\times_{m}U)_{i_{1}\ldots i_{m-1}ji_{m+1}\ldots i_{M}}=\sum_{i_{m}=1}^{I_{m}}U_{ji_{m}}\mathcal{X}_{i_{1}\ldots i_{M}}.

(4)

Definition 2 (Einstein product).

[6]Let $\mathcal{X}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times K_{1}\times\ldots\times K_{N}}$ and
$\mathcal{Y}\in\mathbb{R}^{K_{1}\times\ldots\times K_{N}\times J_{1}\times\ldots\times J_{M}}$ , the Einstein product of the tensors $\mathcal{X}$ and $\mathcal{Y}$ is the tensor of size $\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times J_{1}\times\ldots\times J_{M}}$ whose elements are defined by

\left(\mathcal{X}*_{N}\mathcal{Y}\right)_{i_{1}\ldots i_{M}j_{1}\ldots j_{M}}=\sum_{k_{1}\ldots k_{N}}\mathcal{X}_{i_{1}\ldots i_{M}k_{1}\ldots k_{N}}\mathcal{Y}_{k_{1}\ldots k_{N}j_{1}\ldots j_{M}}.

(5)

Next, we have some definitions related to the Einstein product.

Definition 3.

•

Let $\mathcal{A}\in\mathbb{R}^{I_{1}\times\ldots\times I_{N}\times J_{1}\times\ldots\times J_{M}}$ , then the transpose tensor [21] of $\mathcal{A}$ denoted by $\mathcal{A}^{T}$ is the tensor of size $J_{1}\times\ldots\times J_{M}\times I_{1}\times\ldots\times I_{N}$ whose entries defined by $(\mathcal{A}^{T})_{j_{1}\dots j_{M}i_{1}\dots i_{N}}=\mathcal{A}_{i_{1}\dots i_{N}j_{1}\dots j_{M}}$ .
•

$\mathcal{A}$ is a diagonal tensor if all of its entries are zero except for those on its diagonal, denoted as $(\mathcal{A})_{i_{1}\dots i_{N}i_{1}\dots i_{N}}$ , for all $1\leq i_{r}\leq\min(I_{r},J_{r}),\;1\leq r\leq N$ .
•

The identity tensor denoted by $\mathcal{I}_{N}\in\mathbb{R}^{I_{1}\times\ldots\times I_{N}\times I_{1}\times\ldots\times I_{N}}$ is a diagonal tensor with only ones on its diagonal.
•

A square tensor $\mathcal{A}\in\mathbb{R}^{I_{1}\times\ldots\times I_{N}\times I_{1}\times\ldots\times I_{N}}$ is called symmetric if $\mathcal{A}^{T}=\mathcal{A}$ .

Remark 1.

In case of no confusion, The identity tensor will be denoted simply $\mathcal{I}$ .

Definition 4.

The inner product of tensors $\mathcal{X},\mathcal{Y}\in\mathbb{R}^{I_{1}\times\ldots\times I_{N}}$ is defined by

\langle\mathcal{X},\mathcal{Y}\rangle=\sum_{i_{1},\ldots,i_{N}}\mathcal{X}_{i_{1}i_{2}\ldots i_{N}}\mathcal{X}_{i_{1}i_{2}\ldots i_{N}}.

(6)

The inner product induces The Frobenius norm as follows

\|\mathcal{X}\|_{F}=\sqrt{\langle\mathcal{X},\mathcal{X}\rangle}.

(7)

Definition 5.

A square 2N-order tensor $\mathcal{A}$ in invertible (non-singular) if there is a tensor denoted by $\mathcal{A}^{-1}$ of same size such that $\mathcal{A}*_{N}\mathcal{A}^{-1}=\mathcal{A}^{-1}*_{N}\mathcal{A}=\mathcal{I}_{N}$ . It is unitary if $\mathcal{A}^{T}*_{N}\mathcal{A}=\mathcal{A}*_{N}\mathcal{A}^{T}=\mathcal{I}_{N}$ . It is positive semi-definite if $\langle\mathcal{X},\mathcal{A}*_{N}\mathcal{X}\rangle\geq 0$ for all non-zero $\mathcal{X}\in\mathbb{R}^{I_{1}\times\ldots\times I_{N}}$ . It is positive definite if the inequality is strict.

An important relationship that is easy to prove is the stability of the Frobenius norm under the Einstein product with a unitary tensor.

Proposition 6.

Let $\mathcal{X}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times J_{1}\times\ldots\times J_{N}}$ and $\mathcal{U}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times I_{1}\times\ldots\times I_{M}}$ be a unitary tensor, then

\|\mathcal{U}*_{M}\mathcal{X}\|_{F}=\|\mathcal{X}\|_{F}.

(8)

The proof is straightforward using the inner product definition.

Proposition 7.

[27] For even order tensors $\mathcal{X},\mathcal{Y}\in\mathbb{R}^{I_{1}\times\ldots\times I_{N}\times J_{1}\times\ldots\times J_{M}}$ , we have

	$\displaystyle\langle\mathcal{X},\mathcal{Y}\rangle$	$\displaystyle=\operatorname{Tr}\left(\mathcal{X}^{T}*_{N}\mathcal{Y}\right)$		(9)
		$\displaystyle=\operatorname{Tr}(\mathcal{Y}*_{M}\mathcal{X}^{T}).$		(9)

Proposition 8.

[20] Given tensors $\mathcal{X}\in\mathbb{R}^{I_{1}\times\ldots\times I_{N}\times K_{1}\times\ldots\times K_{N}}$ , $\mathcal{Y}\in\mathbb{R}^{K_{1}\times\ldots\times K_{N}\times J_{1}\times\ldots\times J_{M}}$ , we have

\left(\mathcal{X}*_{N}\mathcal{Y}\right)^{T}=\mathcal{Y}^{T}*_{N}\mathcal{X}^{T}.

(10)

The isomorphism $\phi$ has some properties that would be useful in the following.

Proposition 9.

[27] Given the tensors $\mathcal{X}$ and $\mathcal{Y}$ of appropriate size then, we have $\phi$ is a multiplicative morphism with respect the Einstein product, i.e., $\Psi(\mathcal{X}*_{N}\mathcal{Y})=\Psi(\mathcal{X})\Psi(\mathcal{Y})$ .

It allows us to prove the Einstein Tensor Spectral Theorem.

Theorem 10 (Einstein Tensor Spectral Theorem).

A symmetric tensor is diagonalizable via the Einstein product.

Proof.

The proof is using the isomorphism and its properties 9.

Let $\mathcal{X}$ be a symmetric tensor of size $\mathbb{R}^{I_{1}\times\ldots\times I_{N}\times I_{1}\times\ldots\times I_{N}}$ , then $\Psi(\mathcal{X})$ is symmetric, and by the spectral theorem, there exists an orthogonal matrix $U$ such that $U^{T}\Psi(\mathcal{X})U=\Lambda$ , where $\Lambda$ is a diagonal matrix.
Then, $\Psi(\mathcal{X})=U\Lambda U^{T}$ , and $\mathcal{X}=\phi^{-1}(U\Lambda U^{T})=\Psi^{-1}(U)*_{N}\Psi^{-1}(\Lambda)*_{N}\Psi^{-1}(U^{T})=\Psi^{-1}(U)*_{N}\Psi^{-1}(\Lambda)*_{N}\Psi^{-1}(U)^{T}$ , with $\Psi^{-1}(U)$ is a unitary, and $\Psi^{-1}(\Lambda)$ is diagonal tensor.
∎

The cyclic property of the trace with Einstein product is also verified, which would be needed in the sequel.

Proposition 11 (Cyclic property of the trace).

Given tensors $\mathcal{X}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times K_{1}\times\ldots\times K_{N}}$ , $\mathcal{Y}\in\mathbb{R}^{K_{1}\times\ldots\times K_{N}\times I_{1}\times\ldots\times I_{M}},\mathcal{Z}\in\mathbb{R}^{K_{1}\times\ldots\times K_{N}\times K_{1}\times\ldots\times K_{N}}$ , we have

\operatorname{Tr}\left(\mathcal{X}*_{N}\mathcal{Z}*_{N}\mathcal{Y}\right)=\operatorname{Tr}\left(\mathcal{Y}*_{M}\mathcal{X}*_{N}\mathcal{Z}\right).

(11)

Theorem 12.

[20] Let $\mathcal{X}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times K_{1}\times\ldots\times K_{N}}$ , the Einstein singular value decomposition (E-SVD) of $\mathcal{X}$ is defined by

\mathcal{X}=\mathcal{U}*_{M}\mathcal{S}*_{N}\mathcal{V}^{T},

(12)

where $\mathcal{U}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times I_{1}\times\ldots\times I_{M}},\mathcal{S}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times K_{1}\times\ldots\times K_{N}},\mathcal{V}\in\mathbb{R}^{K_{1}\times\ldots\times K_{N}\times K_{1}\times\ldots\times K_{N}}$ with the following properties
$\mathcal{U}$ and $\mathcal{V}$ are unitary, where $\mathcal{U}_{:\ldots:i_{1}\ldots i_{M}},\mathcal{V}_{:\ldots:j_{1}\ldots j_{N}}$ are the left and right singular tensors of $\mathcal{X}$ , respectively. If $N<M$ , then

\mathcal{S}_{i_{1}\ldots i_{M}k_{1}\ldots k_{N}}=\begin{cases}d_{k_{1}\ldots k_{N}}&\text{if }(i_{1},\ldots,i_{N})=(k_{1},\ldots,k_{N})\text{ and }(i_{N+1},\ldots,i_{M})=(1,\ldots,1)\\ 0&\text{otherwise.}\end{cases}

.
If $N=M$ , then $\mathcal{S}_{i_{1}\ldots i_{M}k_{1}\ldots k_{N}}=\begin{cases}d_{k_{1}\ldots k_{N}}&\text{if }(i_{1},\ldots,i_{M})=(k_{1},\ldots,k_{M})\\ 0&\text{otherwise.}\end{cases}$
The numbers $d_{k_{1}\ldots k_{N}}$ are the singular values of $\mathcal{X}$ with the decreasing order

d_{{1,\ldots,1}}\geq d_{2,1,\ldots,1}\geq\ldots\geq d_{\widehat{K}_{1},1,\ldots,1}\\ \geq d_{1,2,\ldots,1}\geq\ldots\geq d_{1,\widehat{K}_{2},\ldots,1}\geq\ldots\geq d_{\widehat{K}_{1},\ldots,\widehat{K}_{P}}\geq 0,

with $P=\min(N,L),\;\widehat{K}_{r}=\min(I_{r},K_{r}),\;r=1,\ldots,P$ .

We define the eigenvalues and eigen-tensors of a tensor with the following.

Definition 13.

[28] Let a square 2-N order tensors $\mathcal{A},\mathcal{B}\in\mathbb{R}^{I_{1}\times\ldots\times I_{N}\times I_{1}\times\ldots\times I_{N}}$ , then

•

Tensor Eigenvalue problem: If there is a non null $\mathcal{X}\in\mathbb{R}^{I_{1}\times\ldots\times I_{N}}$ , and $\lambda\in\mathbb{R}$ such that $\mathcal{A}*_{N}\mathcal{X}=\lambda\mathcal{X}$ , then $\mathcal{X}$ is called an eigen-tensor of $\mathcal{A}$ , and $\lambda$ is the corresponding eigenvalue.
•

Tensor generalized Eigenvalue problem: If there is a non null $\mathcal{X}\in\mathbb{R}^{I_{1}\times\ldots\times I_{N}}$ , and $\lambda\in\mathbb{R}$ such that $\mathcal{A}*_{N}\mathcal{X}=\lambda\mathcal{B}*_{N}\mathcal{X}$ , then $\mathcal{X}$ is called an eigen-tensor of the pair $\{\mathcal{A},\mathcal{B}\}$ , and $\lambda$ is the corresponding eigenvalue.

Remark 2.

If $N=1$ , the two definitions above coincide with the eigenvalue and generalized eigenvalue problems, respectively.

We can also show a relationship between the singular values and the eigenvalues of a tensor.

Proposition 14.

Let the E-SVD of $\mathcal{X}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times K_{1}\times\ldots\times K_{N}}$ , defined as $\mathcal{X}=\mathcal{U}*_{M}\mathcal{S}*_{N}\mathcal{V}^{T}$ , then

•

The eigenvalues of $\mathcal{X}*_{N}\mathcal{X}^{T}$ and $\mathcal{X}^{T}\times_{M}\mathcal{X}$ are the squared singular values of $\mathcal{X}$ .
•

The eigen-tensors of $\mathcal{X}*_{N}\mathcal{X}^{T}$ are the left singular tensors of $\mathcal{X}$ .
•

The eigen-tensors of $\mathcal{X}^{T}*_{M}\mathcal{X}$ are the right singular tensors of $\mathcal{X}$ .

The proof is straightforward.

To simplify matters, we’ll denote the $d$ eigen-tensors of a tensor, associated with the smallest eigenvalues, as the smallest $d$ eigen-tensors. Similarly, we will apply the same principle to the largest $d$ eigen-tensors. This terminology also extends to the left or right singular tensors.

Remark 3.

To generalize the notion of left and right inverse for a non-square tensors. It is called left or right $\Psi$ -invertible if $\Psi(\mathcal{A})$ is left or right invertible, respectively. In case of confusion, we will denote $\Psi$ by $\Psi_{j}$ to represent the transformation of tensor $\mathcal{A}\in\mathbb{R}^{I_{1}\times\ldots\times I_{N}}$ to a matrix $\mathbb{R}^{(\prod_{k=1}^{j}I_{k})\times((\prod_{k=j+1}^{N}I_{k})}$ .

Proposition 15.

1.

A square 2-N order symmetric $\mathcal{X}$ is positive semi-definite tensor, definite tensor, respectively, if and only if there is a tensor, an invertible tensor, respectively, $\mathcal{B}$ of same size such that $\mathcal{X}=\mathcal{B}*_{N}\mathcal{B}^{T}$ .
2.

Let a tensor $\mathcal{X}\in\mathbb{R}^{I_{1}\times\ldots\times I_{N}}$ of order N, with its transpose in $\mathbb{R}^{I_{N-j+1}\times\ldots\times I_{N}\times I_{1}\times\ldots I_{N-j}}$ then $\mathcal{X}*_{j}\mathcal{X}^{T}$ is semi-definite positive for any $1\leq j\leq N$ .
Let a tensor $\mathcal{X}\in\mathbb{R}^{I_{1}\times\ldots\times I_{N}}$ of order N, let $1\leq j\leq N$ such that the tensor is $\Psi_{j}-$ invertible, with its transpose in $\mathbb{R}^{I_{N-j+1}\times\ldots\times I_{N}\times I_{1}\times\ldots I_{N-j}}$ , then $\mathcal{X}*_{j}\mathcal{X}^{T}$ is definite positive.
3.

The eigenvalues of a square symmetric tensor are real.
4.

Let a symmetric matrix $M\in\mathbb{R}^{K\times K}$ and $\mathcal{X}\in\mathbb{R}^{I_{1}\times\ldots\times I_{N}\times K}$ , with its transpose in $\mathbb{R}^{K\times I_{1}\times\ldots\times I_{N}}$ then $\mathcal{X}\times_{N+1}M*_{1}\mathcal{X}^{T}$ is a symmetric.
5.

If $M$ is positive semi-definite, then $\mathcal{X}\times_{N+1}M*_{1}\mathcal{X}^{T}$ is positive semi-definite.
6.

If $M$ is positive definite, and $\mathcal{X}$ is $\Psi-$ invertible, then $\mathcal{X}\times_{N+1}M*_{1}\mathcal{X}^{T}$ is positive definite.

Proof.

The proof of the first one is straightforward using 9.
Let $\mathcal{Y}\in\mathbb{R}^{I_{1}\times\ldots I_{N-j}}$ , then $\mathcal{Y}^{T}*_{N-j}\mathcal{X}*_{j}\mathcal{X}^{T}*_{N-j}\mathcal{Y}=\|\mathcal{X}^{T}*_{N-j}\mathcal{Y}\|_{F}^{2}\geq 0$ .
The proof of the third is similar to the second one.
Let $\mathcal{A}\in\mathbb{R}^{I_{1}\times\ldots\times I_{N}\times I_{1}\times\ldots\times I_{N}}$ be a symmetric tensor and non-zero tensor $\mathcal{X}\in\mathbb{R}^{I_{1}\times\ldots\times I_{N}}$ with $\mathcal{A}*_{N}\mathcal{X}=\lambda\mathcal{X}$ , then $\lambda^{T}\mathcal{X}^{T}=(\lambda\mathcal{X})^{T}=(\mathcal{A}*_{N}\mathcal{X})^{T}=\mathcal{X}^{T}*_{N}\mathcal{A}^{T}=\mathcal{X}^{T}*_{N}\mathcal{A}=\lambda^{T}\mathcal{X}^{T}$ , then $\lambda=\lambda^{T}$ , which completes the proof.
We have $\left(\mathcal{X}\times_{N+1}M*_{1}\mathcal{X}^{T}\right)^{T}=(M\times_{1}\mathcal{X}^{T})^{T}*_{1}\mathcal{X}^{T}=(\mathcal{X}\times_{N+1}M^{T})*_{1}\mathcal{X}^{T}$ , then conclude by the symmetry of $M$ .
Let $M$ be a positive semi-definite matrix, then there exist a matrix $B$ such that $M=BB^{T}$ , then

\mathcal{X}\times_{N+1}M*_{1}\mathcal{X}^{T}=\mathcal{X}\times_{N+1}BB^{T}*_{1}\mathcal{X}^{T}=\widehat{\mathcal{X}}*_{1}\widehat{\mathcal{X}}^{T},

with $\widehat{\mathcal{X}}=\mathcal{X}\times_{N+1}B$ , then the result follows.
The last has a similar proof; Using the fact that $M$ is positive definite, then $B$ is invertible, and $\mathcal{X}$ is invertible, then $\widehat{\mathcal{X}}$ is $\Psi-$ invertible, and the result follows. ∎

We also have a property that relates the tensor generalized eigenvalue problem with the tensor eigenvalue problem.

Proposition 16.

Let the generalized eigenvalue problem $\mathcal{A}*_{N}\mathcal{X}=\lambda\mathcal{M}*_{N}\mathcal{X}$ , with $\mathcal{A},\mathcal{M}$ are a square 2-N order tensor, with $\mathcal{M}$ being invertible, then $\widehat{\mathcal{X}}=\mathcal{M}*_{N}\mathcal{X}$ is a solution of the tensor eigen-problem $\widehat{\mathcal{A}}*_{N}\widehat{\mathcal{X}}=\lambda\widehat{\mathcal{X}}$ with $\widehat{\mathcal{A}}=\mathcal{A}*_{N}\mathcal{M}^{-1}$ .

Theorem 17.

Let a symmetric $\mathcal{X}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times I_{1}\times\ldots\times I_{M}}$ , and $\mathcal{B}$ a positive definite tensor of same size, then

\min_{\begin{subarray}{c}\mathcal{P}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times d}\\ \mathcal{P}^{T}*_{M}\mathcal{B}*_{M}\mathcal{P}=\mathcal{I}\end{subarray}}\operatorname{Tr}(\mathcal{P}^{T}*_{M}\mathcal{X}*_{M}\mathcal{P}),

is equivalent to solve the generalized eigenvalue problem $\mathcal{X}*_{M}\mathcal{P}=\lambda\mathcal{B}*_{M}\mathcal{P}$ .

Proof.

Since $\Psi$ is an isomorphism, the problem is equivalent to minimize $\operatorname{Tr}(PXP^{T})$ with $P^{T}BP=I$ , $\Psi(\mathcal{P})=P,\Psi(\mathcal{X})=X,\Psi(\mathcal{B})=B$ . We have $X$ symmetric and $B$ is definite positive. The solution of the equivalent problem is the $d$ smallest eigenvalues of $X$ , using the fact $\Psi^{-1}$ is an isomorphism, we obtain the result.
A second proof without using the isomorphism property is the following.
Let the Lagrangian of the problem be

\mathcal{L}(\mathcal{P},\Lambda):=\operatorname{Tr}(\mathcal{P}^{T}*_{M}\mathcal{X}*_{M}\mathcal{P})-\operatorname{Tr}(\Lambda^{T}*_{M}(\mathcal{P}^{T}*_{M}\mathcal{B}*_{M}\mathcal{P}-\mathcal{I})),

with $\Lambda\in\mathbb{R}^{d\times d}$ the Lagrange multiplier. Using KKT conditions, we have

	$\displaystyle\dfrac{\partial\mathcal{L}}{\partial\mathcal{P}}$	$\displaystyle=2\mathcal{X}_{M}\mathcal{P}+2\Lambda_{M}\mathcal{P}=0$
	$\displaystyle\implies$	$\displaystyle\mathcal{P}^{T}_{M}\mathcal{B}_{M}\mathcal{P}-\mathcal{I}=0.$

To compute the partial derivative with respect to $\mathcal{P}$ , we introduce the functions $f_{1}(\mathcal{P})$ and $f_{2}(\mathcal{P})$ , defined as follows

f_{1}(\mathcal{P})=\operatorname{Tr}(\mathcal{P}^{T}*_{M}\mathcal{X}*_{M}\mathcal{P}),

f_{2}(\mathcal{P})=\operatorname{Tr}\left(\Lambda^{T}*_{M}(\mathcal{P}^{T}*_{M}\mathcal{B}*_{M}\mathcal{P}-\mathcal{I})\right).

Subsequently, we aim to determine the partial derivative.

	$\displaystyle f_{1}(\mathcal{P}+\varepsilon\mathcal{H})$	$\displaystyle=\operatorname{Tr}\left(\left(\mathcal{P}+\varepsilon\mathcal{H})^{T}_{M}\mathcal{X}_{M}(\mathcal{P}+\varepsilon\mathcal{H}\right)\right)$
		$\displaystyle=\operatorname{Tr}\left(\mathcal{P}^{T}_{M}\mathcal{X}_{M}\mathcal{P}+\varepsilon\mathcal{H}^{T}_{M}\mathcal{X}_{M}\mathcal{P}+\varepsilon\mathcal{P}^{T}_{M}\mathcal{X}_{M}\mathcal{H}\right)$
		$\displaystyle\qquad+\operatorname{Tr}(\varepsilon^{2}\mathcal{H}^{T}_{M}\mathcal{X}_{M}\mathcal{H})$
		$\displaystyle=\operatorname{Tr}(\mathcal{P}^{T}_{M}\mathcal{X}_{M}\mathcal{P})+\varepsilon\operatorname{Tr}(\mathcal{H}^{T}_{M}\mathcal{X}_{M}\mathcal{P})$
		$\displaystyle\qquad+\varepsilon\operatorname{Tr}(\mathcal{P}^{T}_{M}\mathcal{X}_{M}\mathcal{H})+\varepsilon^{2}\operatorname{Tr}(\mathcal{H}^{T}_{M}\mathcal{X}_{M}\mathcal{H})$
		$\displaystyle=f_{1}(\mathcal{P})+\varepsilon\left[\operatorname{Tr}(\mathcal{H}^{T}_{M}\mathcal{X}_{M}\mathcal{P})+\operatorname{Tr}(\mathcal{P}^{T}_{M}\mathcal{X}_{M}\mathcal{H})\right]+o(\varepsilon)$
		$\displaystyle=f_{1}(\mathcal{P})+\varepsilon\operatorname{Tr}\left(\mathcal{H}^{T}_{M}(\mathcal{X}+\mathcal{X}^{T})_{M}\mathcal{P}\right)+o(\varepsilon).$

Then, as $\mathcal{X}$ is symmetric, the partial derivative in the direction $\mathcal{H}$ is

	$\displaystyle\dfrac{\partial f_{1}}{\partial\mathcal{P}}(\mathcal{H})$	$\displaystyle=\lim_{\varepsilon\rightarrow 0}\dfrac{f_{1}(\mathcal{P}+\varepsilon\mathcal{H})-f_{1}(\mathcal{P})}{\varepsilon}$
		$\displaystyle=2\operatorname{Tr}(\mathcal{H}^{T}_{M}\mathcal{X}_{M}\mathcal{P}).$

It gives us the partial derivative of $f_{1}$ with respect to $\mathcal{P}$ as

\dfrac{\partial f_{1}}{\partial\mathcal{P}}=2\mathcal{X}*_{M}\mathcal{P}.

For the second function, we have

	$\displaystyle f_{2}(\mathcal{P}+\varepsilon\mathcal{H})$	$\displaystyle=\operatorname{Tr}\left(\Lambda^{T}_{M}\left((\mathcal{P}+\varepsilon\mathcal{H})^{T}_{M}\mathcal{B}*_{M}(\mathcal{P}+\varepsilon\mathcal{H})-\mathcal{I}\right)\right)$
		$\displaystyle=\operatorname{Tr}\left(\Lambda^{T}_{M}(\mathcal{P}^{T}_{M}\mathcal{B}_{M}\mathcal{P}-\mathcal{I}))++\varepsilon^{2}\operatorname{Tr}(\Lambda^{T}_{M}(\mathcal{H}^{T}_{M}\mathcal{B}_{M}\mathcal{H})\right)$
	$\displaystyle+$	$\displaystyle\varepsilon\operatorname{Tr}\left(\Lambda^{T}_{M}(\mathcal{H}^{T}_{M}\mathcal{B}_{M}\mathcal{P}+\mathcal{P}^{T}_{M}\mathcal{B}*_{M}\mathcal{H})\right)$
		$\displaystyle=f_{2}(\mathcal{P})+\varepsilon\operatorname{Tr}\left(\Lambda^{T}_{M}(\mathcal{H}^{T}_{M}\mathcal{B}_{M}\mathcal{P}+\mathcal{P}^{T}_{M}\mathcal{B}*_{M}\mathcal{H})\right)+O(\varepsilon)$
		$\displaystyle=f_{2}(\mathcal{P})+\varepsilon\operatorname{Tr}\left(\mathcal{H}^{T}_{M}\left[\mathcal{B}_{M}\mathcal{P}_{M}\Lambda^{T}+\mathcal{B}^{T}_{M}\mathcal{P}*_{M}\Lambda\right]\right)+O(\varepsilon).$

We used the cyclic property 11 and the transpose with trace property 9 in the last equality. Then, as $\mathcal{B}$ is symmetric, the partial derivative in the direction $\mathcal{H}$ is

	$\displaystyle\dfrac{\partial f_{2}}{\partial\mathcal{P}}(\mathcal{H})$	$\displaystyle=\lim_{\varepsilon\rightarrow 0}\dfrac{f_{2}(\mathcal{P}+\varepsilon\mathcal{H})-f_{2}(\mathcal{P})}{\varepsilon}$
		$\displaystyle=\operatorname{Tr}\left(\mathcal{H}^{T}_{M}\mathcal{B}_{M}\mathcal{P}*_{M}(\Lambda+\Lambda^{T})\right).$

This yields the partial derivative of $\mathcal{L}$ with respect to $\mathcal{P}$ as follows

\dfrac{\partial f_{2}}{\partial\mathcal{P}}=\mathcal{B}*_{M}\mathcal{P}*_{M}(\Lambda+\Lambda^{T}).

Subsequently,

	$\displaystyle\dfrac{\partial\mathcal{L}}{\partial\mathcal{P}}=0$	$\displaystyle\iff 2\mathcal{X}_{M}\mathcal{P}-\Lambda_{M}\mathcal{P}-\Lambda^{T}*_{M}\mathcal{P}=0$
		$\displaystyle\iff\mathcal{X}_{M}\mathcal{P}=\mathcal{B}_{M}\mathcal{P}*_{M}(\Lambda+\Lambda^{T})$
		$\displaystyle\iff\mathcal{X}_{M}\mathcal{P}=\mathcal{B}_{M}\mathcal{P}*_{M}\widehat{\Lambda}$
		$\displaystyle\iff\mathcal{X}_{M}\mathcal{P}=\mathcal{B}_{M}\mathcal{P}_{M}\mathcal{Q}^{T}_{M}\mathcal{D}*_{M}\mathcal{Q}$
		$\displaystyle\iff\mathcal{X}_{M}\mathcal{P}_{M}\mathcal{Q}^{T}=\mathcal{B}_{M}\mathcal{P}_{M}\mathcal{Q}^{T}*_{M}\mathcal{D}$
		$\displaystyle\iff\mathcal{X}_{M}\widehat{\mathcal{P}}=\mathcal{B}_{M}\widehat{\mathcal{P}}*_{M}\mathcal{D}.$

The third line utilizes the property that $\widehat{\Lambda}=\Lambda+\Lambda^{T}$ , which is symmetric, thus diagonalizable (10), i.e., $\widehat{\Lambda}=\mathcal{Q}^{T}*_{M}\mathcal{D}*_{M}\mathcal{Q}$ . The last two lines are justified by the fact that $\widehat{\mathcal{P}}=\mathcal{P}*_{M}\mathcal{Q}$ also satisfies $\widehat{\mathcal{P}}^{T}*_{M}\mathcal{B}*_{M}\widehat{\mathcal{P}}=\mathcal{I}$ , concluding the proof. ∎

A corollary of this theorem can be deduced.

Corollary 18.

Let $\mathcal{X}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times K_{1}\times\ldots\times K_{N}}$ , the solution of

\arg\min_{\begin{subarray}{c}\mathcal{P}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times d}\\ \mathcal{P}^{T}*_{M}\mathcal{P}=\mathcal{I}\end{subarray}}\|\mathcal{P}^{T}*_{M}\mathcal{X}\|_{F}^{2}.

is the $d$ smallest left singular tensors of $\mathcal{X}$ .

Proof.

We have $\|\mathcal{P}^{T}*_{M}\mathcal{X}\|_{F}^{2}=\operatorname{Tr}\left(\mathcal{P}^{T}*_{M}\mathcal{X}*_{N}\mathcal{X}^{T}*_{M}\mathcal{P}\right).$ Theorem 17 tells that the solution is equivalent to solve $\mathcal{X}*_{N}\mathcal{X}^{T}*_{M}\mathcal{P}=\lambda\mathcal{P}$ , i.e, the $d$ smallest tensors of $\mathcal{X}*_{N}\mathcal{X}^{T}$ , which corresponds exactly to the $d$ smallest left singular tensors of $\mathcal{X}$ . ∎

4 Multidimensional reduction

In this section, we present a generalized approach to DR methods using the Einstein product.

Given a tensor $\mathcal{X}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times n}$ , our objective is to derive a low-dimensional representation $\mathcal{Y}\in\mathbb{R}^{d\times n}$ of $\mathcal{X}$ . This involves defining a mapping function $\Psi:\mathbb{R}^{I_{1}\times\ldots\times I_{M}}\longrightarrow\mathbb{R}^{d}$ .

First, we discuss the determination of the weight matrix, which can be computed in various ways, one common method is using the Gaussian kernel

W_{i,j}=\exp\left(-\frac{\left\|\mathcal{X}^{(i)}-\mathcal{X}^{(j)}\right\|_{F}^{2}}{\sigma^{2}}\right).

Additionally, introducing a threshold can yield a sparse matrix (Gaussian-threshold). We also explore another method later in this section, which utilizes the reconstruction error.

Next, we introduce our proposed methods.

Linear methods: The linear methods can be written as $\mathcal{Y}=\mathcal{P}^{T}*_{M}\mathcal{X}$ . It is sufficient to find the projection matrix $\mathcal{P}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times d}$ .
Higher order PCA based on Einstein [11] is the natural extension of PCA to higher order tensors. It extends the PCA applied to images, using the notion of eigenfaces to colored images that are modeled by a fourth-order tensor, using the Einstein product. It vectorizes pixels (height and width) for each color (RGB) to get a third-order tensor, then it computes E-SVD of this tensor centered, to get the eigenfaces.
The vectorization is not natural, since we omit the spatial information. The proposed work hides the vectorization in the first step by using the tensor directly, and seeks to find a solution of the following problem

\displaystyle\arg\max_{\begin{subarray}{c}\mathcal{P}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times d}\\ \mathcal{P}^{T}*_{M}\mathcal{P}=\mathcal{I}\\ \mathcal{Y}=\mathcal{P}^{T}*_{M}\mathcal{X}\end{subarray}}\Phi_{PCA}(\mathcal{Y})

\displaystyle:=\sum_{i}\left\|\mathcal{Y}^{(i)}-\dfrac{1}{n}\sum_{j}\mathcal{Y}^{(j)}\right\|_{F}^{2}.

(13)

The objective function can be written as

	$\displaystyle\Phi_{PCA}(\mathcal{Y})$	$\displaystyle=\sum_{i}\left\\|\mathcal{P}^{T}*_{M}(\mathcal{X}^{(i)}-\dfrac{1}{n}\sum_{j}\mathcal{X}^{(j)})\right\\|_{F}^{2}$
		$\displaystyle=\sum_{i}\left\\|\mathcal{P}^{T}*_{M}(\mathcal{X}^{(i)}-\mathcal{Q}^{(i)})\right\\|_{F}^{2}$
		$\displaystyle=\left\\|\mathcal{P}^{T}*_{M}(\mathcal{X}-\mathcal{Q})\right\\|_{F}^{2},$

with $\mathcal{Q}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times n}$ , where $\mathcal{Q}^{(i)}=\frac{1}{n}\sum_{j}\mathcal{X}^{(j)}$ represents the mean, the solution of (13) is the largest $d$ left singular tensors of the centered tensor $\mathcal{X}-\mathcal{Q}=\mathcal{X}\times_{M+1}H$ .

Since the feature dimension is typically larger than the number of data points $n$ , computing the E-SVD of $\mathcal{X}\times_{M+1}H$ can be computationally expensive. It’s preferable to have a runtime that depends on $n$ instead. To achieve this, we transform the equation $\mathcal{X}\times_{M+1}H*_{1}\mathcal{X}^{T}*_{M}\mathcal{P}=\lambda\mathcal{P}$ to $(\mathcal{X}^{T}*_{M}\mathcal{X}\times_{M}H)*_{1}\mathbf{z}=\lambda\mathbf{z}$ , with $\mathbf{z}=\mathcal{X}^{T}*_{M}\mathcal{P}$ . This allows us to find the eigenvectors of a square matrix of size $n$ . The projected data $\mathcal{Y}$ would be these vectors reshaped to the appropriate size.
The algorithm bellow, shows the steps of PCA via Einstein product.

Algorithm 1 PCA-Einstein

Input: $\mathcal{X}$ (Data) $d$ (dimension output).
Output: $\mathcal{P}$ (Projection space).

1:Compute

\mathcal{Q}

\triangleright

The mean tensor

2:Compute the largest

d

eigen-tensors of

\mathcal{Z}=\mathcal{X}-\mathcal{Q}

3:Combine these tensors to get

\mathcal{P}

4.1 Generalization of ONPP

Given a weight matrix $W\in\mathbb{R}^{n\times n}$ , the objective function is to resolve

\displaystyle\arg\min_{\begin{subarray}{c}\mathcal{P}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times d}\\ \mathcal{P}^{T}*_{M}\mathcal{P}=\mathcal{I}\\ \mathcal{Y}=\mathcal{P}^{T}*_{M}\mathcal{X}\end{subarray}}\Phi_{ONPP}(\mathcal{Y})

\displaystyle:=\sum_{i}\left\|\mathcal{Y}^{(i)}-\sum_{j}w_{i,j}\mathcal{Y}^{(j)}\right\|_{F}^{2}.

(14)

The objective function can be written as

	$\displaystyle\Phi_{ONPP}(\mathcal{Y})$	$\displaystyle=\sum_{i}\left\\|\sum_{j}\delta_{i,j}\mathcal{Y}^{(j)}-\sum_{j}w_{i,j}\mathcal{Y}^{(j)}\right\\|_{F}^{2}$
		$\displaystyle=\sum_{i}\left\\|\sum_{j}(\delta_{i,j}-w_{i,j})\mathcal{Y}^{(j)}\right\\|_{F}^{2}$
		$\displaystyle=\sum_{i}\left\\|\sum_{j}(I_{n}-W)_{i,j}\mathcal{Y}^{(j)}\right\\|_{F}^{2}$
		$\displaystyle=\sum_{i}\left\\|\left(\mathcal{Y}\times_{M+1}(I_{n}-W)\right)^{(i)}\right\\|_{F}^{2}$
		$\displaystyle=\left\\|\mathcal{Y}\times_{M+1}(I_{n}-W)\right\\|_{F}^{2}$
		$\displaystyle=\left\\|(\mathcal{P}^{T}*_{M}\mathcal{X})\times_{M+1}(I_{n}-W)\right\\|_{F}^{2}$
		$\displaystyle=\left\\|\mathcal{P}^{T}*_{M}\left(\mathcal{X}\times_{M+1}\left(I_{n}-W\right)\right)\right\\|_{F}^{2}.$

Using corollary 18, the solution of 14 is the smallest $d$ left singular tensors of $\mathcal{X}\times_{M+1}(I_{n}-W)$ .
The algorithm bellow, shows the steps of ONPP via Einstein product.

Algorithm 2 ONPP-Einstein

Input: $\mathcal{X}$ (Data) $d$ (subspace dimension).
Output: $\mathcal{P}$ (Projection space).

1:Compute

W

\triangleright

Using the appropriate method

2:Compute the smallest

d

left singular tensors of

\mathcal{Z}=\mathcal{X}\times_{M+1}(I-W)

3:Combine these tensors to get

\mathcal{P}

4.1.1 Multi-weight ONPP

In this section, we propose a generalization of ONPP, where multiple weights matrices are employed on the $I_{M}$ mode. We denote by $\mathcal{W}\in\mathbb{R}^{n\times n\times I_{M}}$ the weight tensor. Let $\mathcal{Y}^{(i)}_{r}$ denotes the tensor $\mathcal{Y}$ , by fixing its last two indices to $(r,i)$ , i.e., $\mathcal{Y}_{:,\ldots,r,i}$ , similarly $\mathcal{X}^{(i)}_{r}$ denotes $\mathcal{X}_{:,\ldots,r,i}$ . We assume that the $r$ -th frontal slice of the weight tensor is constructed only from the $r$ -th frontal slice of the data tensor.
The objective function is

\displaystyle\arg\min_{\begin{subarray}{c}\mathcal{P}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times d}\\ \mathcal{P}^{T}*_{M}\mathcal{P}=\mathcal{I}\\ \mathcal{Y}=\mathcal{P}^{T}*_{M}\mathcal{X}\end{subarray}}\Phi_{ONPP_{MW}}(\mathcal{Y})

\displaystyle:=\sum_{i,r}\left\|\mathcal{Y}^{(i)}_{r}-\sum_{j}\mathcal{W}^{(r)}_{i,j}\mathcal{Y}^{(j)}_{r}\right\|_{F}^{2}.

(15)

For each $i$ , utilizing the independence of the frontal slices $\mathcal{Y}^{(i)}_{r}$ , we can divide the objective function into $I_{M}$ independent objective functions. The solution is obtained by concatenating the solutions of each objective function. The $r$ -th objective function can be written as

\displaystyle\arg\min_{\begin{subarray}{c}\mathcal{P}_{:,\ldots,r,:}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times d}\\ \mathcal{P}_{:,\ldots,r,:}^{T}*_{M}\mathcal{P}_{:,\ldots,r,:}=\mathcal{I}\\ \mathcal{Y}{:,\ldots,r,:}=\mathcal{P}{:,\ldots,r,:}^{T}*_{M}\mathcal{X}{:,\ldots,r,:}\end{subarray}}

\displaystyle\sum_{i}\left\|\mathcal{Y}^{(i)}_{r}-\sum_{j}\mathcal{W}^{(r)}_{i,j}\mathcal{Y}^{(j)}_{r}\right\|_{F}^{2}.

(16)

The solution of this objective function is the smallest $d$ left singular tensors of
$\mathcal{X}_{:,\ldots,r,:}\times_{M+1}(I_{n}-\mathcal{W}^{(r)})$ . The solution of the original problem is obtained by concatenating the solutions of each objective function.

4.2 Generalization of OLPP

Given a weight matrix $W\in\mathbb{R}^{n\times n}$ , the optimization problem to solve is

\displaystyle\arg\min_{\begin{subarray}{c}\mathcal{P}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times d}\\ \mathcal{P}^{T}*_{M}\mathcal{P}=\mathcal{I}\end{subarray}}\Phi_{OLPP}(\mathcal{Y})

\displaystyle:=\dfrac{1}{2}\sum_{i,j}w_{i,j}\left\|\mathcal{Y}^{(i)}-\mathcal{Y}^{(j)}\right\|_{F}^{2}.

(17)

The objective function can be written as

	$\displaystyle\Phi_{OLPP}(\mathcal{Y})$	$\displaystyle=\dfrac{1}{2}\sum_{i,j}w_{i,j}\left\\|\mathcal{Y}^{(i)}\right\\|_{F}^{2}+w_{i,j}\left\\|\mathcal{Y}^{(j)}\right\\|_{F}^{2}-\langle\mathcal{Y}^{(i)},\mathcal{Y}^{(j)}\rangle$
		$\displaystyle=\dfrac{1}{2}\sum_{i}d_{i}\left\\|\mathcal{Y}^{(i)}\right\\|_{F}^{2}+\dfrac{1}{2}\sum_{j}d_{j}\left\\|\mathcal{Y}^{(j)}\right\\|_{F}^{2}-\sum_{i,j}w_{i,j}\langle\mathcal{Y}^{(i)},\mathcal{Y}^{(j)}\rangle$
		$\displaystyle=\sum_{i,j}d_{i}\left\\|\mathcal{Y}^{(i)}\right\\|_{F}^{2}-\sum_{i,j}w_{i,j}\langle\mathcal{Y}^{(i)},\mathcal{Y}^{(j)}\rangle$
		$\displaystyle=\sum_{i,j}d_{i,j}\langle\mathcal{Y}^{(i)},\mathcal{Y}^{(j)}\rangle-\sum_{i,j}w_{i,j}\langle\mathcal{Y}^{(i)},\mathcal{Y}^{(j)}\rangle$
		$\displaystyle=\sum_{i,j}L_{i,j}\langle\mathcal{Y}^{(i)},\mathcal{Y}^{(j)}\rangle$
		$\displaystyle=\langle\mathcal{Y}*_{M+1}L,\mathcal{Y}\rangle$
		$\displaystyle=\operatorname{Tr}\left(\mathcal{P}^{T}_{M}(\mathcal{X}\times_{M+1}L_{1}\mathcal{X}^{T})*_{M}\mathcal{P}\right),$

where $L$ is the Laplacian matrix corresponding to $W$ .
The solution using Theorem 18, is the smallest $d$ eigen-tensors of the symmetric tensor (Prop. 15) $\mathcal{X}\times_{M+1}L*_{1}\mathcal{X}^{T}$ .
The algorithm bellow, shows the steps of OLPP via Einstein product.

Algorithm 3 OLPP-Einstein

Input: $\mathcal{X}$ (Data) $d$ (subspace dimension).
Output: $\mathcal{P}$ (Projection space).

1:Compute

L

\triangleright

Using the appropriate method

2:Compute the smallest

d

eigen-tensors of

\mathcal{X}\times_{M+1}L*_{1}\mathcal{X}^{T}

3:Combine these tensors to get

\mathcal{P}

4.2.1 Multi-weight OLPP

In this section, we propose a generalization of OLPP, where multiple weights are utilized on the $I_{M}$ mode. The objective function is to solve

\displaystyle\arg\min_{\begin{subarray}{c}\mathcal{P}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times d}\\ \mathcal{P}^{T}*_{M}\mathcal{P}=\mathcal{I}\\ \mathcal{Y}=\mathcal{P}^{T}*_{M}\mathcal{X}\end{subarray}}\Phi_{OLPP_{MW}}(\mathcal{Y})

\displaystyle:=\dfrac{1}{2}\sum_{i,j,r}\mathcal{W}^{(r)}_{i,j}\left\|\mathcal{Y}_{r}^{(i)}-\mathcal{Y}_{r}^{(j)}\right\|_{F}^{2}.

(18)

Here, we have $I_{M}$ independent objective functions, and the solution is obtained by concatenating the solutions of each objective function.

4.3 Generalization of LPP

LPP is akin to the Laplacian Eigenmap, serving as its linear counterpart. The objective function of LPP solves

\displaystyle\arg\min_{\begin{subarray}{c}\mathcal{P}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times d}\\ \mathcal{P}^{T}*_{M}\mathcal{X}\times_{M+1}D*_{1}\mathcal{X}^{T}*_{M}\mathcal{P}=\mathcal{I}\end{subarray}}\Phi_{LPP}(\mathcal{Y})

\displaystyle:=\dfrac{1}{2}\sum_{i,j}w_{i,j}\left\|\mathcal{Y}^{(i)}-\mathcal{Y}^{(j)}\right\|_{F}^{2}.

(19)

The solution involves finding the smallest $d$ eigen-tensor of the generalized eigen-problem

\mathcal{X}\times_{M+1}L*_{1}\mathcal{X}^{T}*_{M}\mathcal{V}=\lambda\mathcal{X}*_{M+1}D*_{1}\mathcal{X}^{T}*_{M}\mathcal{V}.

(20)

Using 15, we deduce that the tensors $\mathcal{X}\times_{M+1}L*_{1}\mathcal{X}^{T}*_{M}$ , $\mathcal{X}\times_{M+1}D*_{1}\mathcal{X}^{T}*_{M}$ , are symmetric semi-definite positive. Here, we assume definiteness (although generally not true, especially if the number of sample points is less than the product of the feature dimensions). The projection tensor is obtained by adding these eigen-tensors and reshaping it to the desired dimension.
The algorithm bellow, shows the steps of LPP via Einstein product.

Algorithm 4 LPP-Einstein

Input: $\mathcal{X}$ (Data) $d$ (subspace dimension).
Output: $\mathcal{P}$ (Projection space).

1:Compute

L

\triangleright

Using the appropriate method

2:Compute the smallest

d

eigen-tensors of

\mathcal{X}\times_{M+1}L*_{1}\mathcal{X}^{T}*_{M}\mathcal{V}=\lambda\mathcal{X}^{T}*_{M+1}D*_{1}\mathcal{X}^{T}*_{M}\mathcal{V}

3:Combine these tensors to get

\mathcal{P}

Similarly, the multi-weight LPP can be proposed.

4.4 Generalization of NPP

The generalization of NPP resembles that of ONPP, under the constraint of LLE, it aims to find

\displaystyle\arg\min_{\begin{subarray}{c}\mathcal{P}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times d}\\ \mathcal{P}^{T}*_{M}\mathcal{X}*_{1}\mathcal{X}^{T}*_{M}\mathcal{P}=\mathcal{I}\end{subarray}}\Phi_{NPP}(\mathcal{Y})

\displaystyle:=\operatorname{Tr}\left(\mathcal{P}^{T}*_{M}\mathcal{X}\times_{M+1}(I_{n}-W)(I_{n}-W)^{T}\times_{1}\mathcal{X}^{T}*_{M}\mathcal{P}\right).

(21)

The solution entails finding the smallest $d$ eigenvectors of the generalized eigen-problem

\mathcal{X}\times_{M+1}(I_{n}-W)(I_{n}-W)^{T}\times_{1}\mathcal{X}^{T}*_{M}\mathcal{V}=\lambda\mathcal{X}*_{1}\mathcal{X}^{T}*_{M}\mathcal{V}.

The projection tensor is obtained by concatenating these eigenvectors into a matrix and then reshaping it to the desired dimension.
The algorithm bellow, shows the steps of NPP via Einstein product.

Algorithm 5 NPP-Einstein

Input: $\mathcal{X}$ (Data) $d$ (subspace dimension).
Output: $\mathcal{P}$ (Projection space).

1:Compute

W

\triangleright

Using the appropriate method

2:Compute the smallest

d

eigenvectors of

\mathcal{X}\times_{M+1}(I_{n}-W)(I_{n}-W)^{T}\times_{1}\mathcal{X}^{T}*_{M}\mathcal{V}=\lambda\mathcal{X}*_{1}\mathcal{X}^{T}*_{M}\mathcal{V}

3:Combine these tensors to get

\mathcal{P}

Similarly, the multi-weight NPP can be proposed, and the solution is similar.

Nonlinear methods: Nonlinear DR methods are potent tools for uncovering the nonlinear structure within data. However, they present their own challenges, such as the absence of an inverse mapping, which is essential for data reconstruction. Another difficulty encountered is the out-of-sample extension, which involves extending the method to new data. While a variant of these methods utilizing multiple weights could be proposed, it would resemble the approach of linear methods, and thus we will not delve into them here.

4.5 Generalization of Laplacian Eigenmap

Given a weight matrix $W\in\mathbb{R}^{n\times n}$ , the objective function is to solve

\displaystyle\arg\min_{\mathcal{Y}\times_{M+1}D*_{1}\mathcal{Y}^{T}=\mathcal{I}}\Phi_{LE}(\mathcal{Y})

\displaystyle:=\dfrac{1}{2}\sum_{i,j}w_{i,j}\left\|\mathcal{Y}^{(i)}-\mathcal{Y}^{(j)}\right\|_{F}^{2}.

(22)

The objective function can be written as

\Phi_{LE}(\mathcal{Y})=\operatorname{Tr}\left(\mathcal{Y}\times_{M+1}L*_{1}\mathcal{Y}^{T}\right).

For $\widehat{\mathcal{Y}}=\mathcal{Y}\times_{M+1}D^{1/2}$ , the constraint becomes $\widehat{\mathcal{Y}}*_{1}\widehat{\mathcal{Y}}^{T}=\mathcal{I}$ , using $\mathcal{Y}=\widehat{\mathcal{Y}}\times_{M+1}D^{-1/2}$ , the objective function becomes

\Phi_{LE}(\widehat{\mathcal{Y}})=\operatorname{Tr}\left(\widehat{\mathcal{Y}}\times_{M+1}L_{n}*_{1}\widehat{\mathcal{Y}}^{T}\right).

(23)

Using the isomorphism $\Psi$ and its properties, the problem is equivalent to Equation 1 under the constraint $\Psi(\widehat{\mathcal{Y}})*_{1}\Psi(\widehat{\mathcal{Y}})^{T}=I$ . Since the solution is the smallest $d$ eigenvectors of $L_{n}$ , the solution of the original problem would be $\mathcal{Y}=\Psi^{-1}(\Psi(\widehat{\mathcal{Y}}))\times_{M+1}D^{-1/2}$ .
The algorithm bellow, shows the steps of LE via Einstein product.

Algorithm 6 LE-Einstein

Input: $\mathcal{X}$ (Data) $d$ (subspace dimension).
Output: $\mathcal{Y}$ (Projection data).

1:Compute

L_{n}

\triangleright

Using the appropriate method

2:Compute the smallest

d

vectors of

L_{n}

3:Combine these vectors and reshape them to get

\Psi(\widehat{\mathcal{Y}})

4:Compute

\mathcal{Y}=\Psi^{-1}(\Psi(\widehat{\mathcal{Y}}))\times_{M+1}D^{-1/2}

4.5.1 Projection on out of Sample Data

The out of sample extension is the problem of extending the method to new data. Many approaches were proposed to solve this problem, such as the Nyström method, kernel mapping, eigenfunctions [5] [5, 25], etc. We will propose a method that is based on the eigenfunctions.
In matrix case, the out of sample extension is done simply computing the components explicitly as $y_{t_{j}}=\frac{1}{\lambda_{j}}\mathbf{k}_{t}^{T}\mathbf{v}_{j},\;j=1,\ldots,d$ , where $\mathbf{k}_{t}$ is the kernel matrix of the new data, and $(\lambda_{j},\mathbf{v}_{j})$ is eigentuple of the kernel matrix $L_{n}$ , $\mathbf{k}_{t}=(K(\mathbf{x}_{t},\mathbf{x}_{1}),\ldots,K(\mathbf{x}_{t},\mathbf{x}_{n}))^{T}$ is the kernel vector of the test data $\mathbf{x}_{t}$ . It can be written as $\mathbf{y}_{t}=\operatorname{diag}(\lambda_{1},\ldots,\lambda_{d})^{-1}[\mathbf{v}_{1},\ldots,\mathbf{v}_{d}]^{T}\mathbf{k}_{t}$ Thus, the generalization is straightforward.

4.6 Generalization of LLE

LLE is a nonlinear method that aims to preserve the local structure of the data. After finding the $k$ -nearest neighbors, it determines the weights that minimize the reconstruction error. In other words, it seeks to solve the following objective function

Re(W):=\sum_{i}\left\|\mathcal{X}^{(i)}-\sum_{j}w_{i,j}\mathcal{X}^{(j)}\right\|_{F}^{2},

(24)

subject to two constraints: a sparse one (the weights are zero if a point is not in the neighborhood of another point), ensuring the locality of the reconstruction weight, and an invariance constraint (the row sum of the weight matrix is 1). Finally, the projected data is obtained by solving the following objective function

\displaystyle\arg\min_{\begin{subarray}{c}\mathcal{Y}*_{1}\mathcal{Y}^{T}=\mathcal{I}\\ \sum_{i}\mathcal{Y}^{(i)}=\mathcal{O}\end{subarray}}\Phi_{LLE}

\displaystyle:=\sum_{i}\left\|\mathcal{Y}^{(i)}-\sum_{j}w_{i,j}\mathcal{Y}^{(j)}\right\|_{F}^{2}=\left\|\mathcal{Y}\times_{M+1}(I_{n}-W)\right\|_{F}^{2}.

(25)

4.6.1 Computing the weights

Equation 24 can be decomposed to finding the weights $w_{i,:}$ of a point $\mathcal{X}^{(i)}$ independently. For simplicity, we will refer to the neighbors of $\mathcal{X}^{(i)}$ as $\mathcal{N}^{(i,j)}$ , and the weights of its neighbors as $\mathbf{w}_{i}\in\mathbb{R}^{k}$ , with $k$ the number of neighbors, which plays the rule of the sparseness constraint. Denote the local Grammian matrix $G^{(i)}\in\mathbb{R}^{k\times k}$ as $G^{(i)}_{j,k}=\langle\mathcal{X}^{(i)}-\mathcal{N}^{(i,j)},\mathcal{X}^{(i)}-\mathcal{N}^{(i,k)}\rangle$ .
The invariance constraint can be written as $\mathbf{1}^{T}\mathbf{w}_{i}=1$ , thus, we can write the problem as

	$\displaystyle\sum_{i}\left\\|\mathcal{X}^{(i)}-\sum_{j}w_{i,j}\mathcal{X}^{(j)}\right\\|_{F}^{2}$	$\displaystyle=\sum_{i}\left\\|\mathcal{X}^{(i)}-\sum_{j}w_{j}\mathcal{N}^{(i,j)}\right\\|_{F}^{2}$
		$\displaystyle=\sum_{i}\left\\|\sum_{j}w_{i,j}(\mathcal{X}^{(i)}-\mathcal{N}^{(i,j)})\right\\|_{F}^{2}$
		$\displaystyle=\sum_{i,j,k}w_{i,j}w_{i,k}G^{(i)}_{j,k}$
		$\displaystyle=\sum_{i}\mathbf{w}_{i}^{T}G^{(i)}\mathbf{w}_{i}.$

This constrained problem can be solved using the Lagrangian

\mathcal{L}(\{\mathbf{w}_{i}\}_{i},\lambda)=\sum_{i}\mathbf{w}_{i}^{T}G^{(i)}\mathbf{w}_{i}-\sum_{i}\lambda_{i}(\mathbf{1}^{T}\mathbf{w}_{i}-1).

We compute the partial derivative with respect to $\mathbf{w}_{i}$ and setting it to zero

	$\displaystyle\dfrac{\partial\mathcal{L}}{\partial\mathbf{w}_{i}}$	$\displaystyle=2G^{(i)}\mathbf{w}_{i}-\lambda_{i}\mathbf{1}=0$		(26)
	$\displaystyle\implies\mathbf{w}_{i}$	$\displaystyle=\frac{\lambda_{i}}{2}G^{(i)^{-1}}\mathbf{1}.$		(26)

We utilize the fact that the Grammian matrix is symmetric, assuming that $G^{(i)}$ is full rank, which is typically the case. However, if $G^{(i)}$ is not full rank, a small value can be added to its diagonal to ensure full rank.
The partial derivative with respect to $\lambda_{i}$ , set to zero, yields the invariance constraint of point $i$ , i.e., $\mathbf{1}^{T}\mathbf{w}_{i}=1$ . Multiplying Equation (26) by $\mathbf{1}^{T}$ , we can isolate $\lambda$ and arrive at the following equation

\mathbf{w}_{i}=\dfrac{G^{(i)^{-1}}\mathbf{1}}{\mathbf{1}^{T}G^{(i)^{-1}}\mathbf{1}}.

4.6.2 Computing the projected data

The final step resembles the previous cases, ensuring that the mean constraint is satisfied. As $I-W$ can be represented as a Laplacian, we know that the number of components corresponds to the multiplicity of the eigenvalue 0. Hence, there is at least one eigenvalue 0 with multiplicity 1, and the identity tensor serves as the corresponding eigenvector, thereby satisfying the second constraint. For further elaboration, interested readers can refer to [10].
The solution is equivalent to solving the matricized version, where the solution comprises the smallest $d$ singular eigenvectors of the matrix $(I_{n}-W)(I_{n}-W^{T})$ . Consequently, the solution to the original problem is the inverse transform, denoted as $\Psi^{-1}$ , of the solution to the matricized version.
The algorithm bellow, shows the steps of LLE via Einstein product.

Algorithm 7 LLE-Einstein

Input: $\mathcal{X}$ (Data) $d$ (subspace dimension).
Output: $\mathcal{Y}$ (Projection data).

1:Find the neighbors of each point.

2:Compute the reconstruction weight

W

3:Compute the smallest

d

eigenvectors of

(I_{n}-W)(I_{n}-W^{T})

4:Compute the

\Psi^{-1}

of these vectors with the appropriate reshaping to get

\mathcal{Y}

4.6.3 Projection on out of Sample Data

To extend these methods to new data (test data) not seen in the training set, various approaches can be employed. These include kernel mapping, eigenfunctions (as discussed in Bengio et al. [5]), and linear reconstruction. Here, we opt to generalize the latter approach. We can follow similar steps to those used in matrix-based methods. Specifically, we can perform the following steps without re-running the algorithm on the entire dataset

1

Find the neighbors in training data of each new data test.
2

Compute the reconstruction weight that best reconstruct each test point from its k neighbours in the training data.
3

Compute the low dimensional representation of the new data using the reconstruction weight.

More formally, after finding the neighbours $\mathcal{N}^{(i,j)}$ of a test data $\mathcal{X}_{t}^{(i)}$ , we solve the following problem

\arg\min_{w^{(t)}}\sum_{i}\left\|\mathcal{X}_{t}^{(i)}-\sum_{j}w^{(t)}_{i,j}\mathcal{N}^{(i,j)}\right\|_{F}^{2},

with $\mathbf{w}^{(t)}_{i,:}$ is the reconstruction weight of the test data, under the same constraint, i.e., the row sum of the weight matrix is 1, with value zero if it’s not in the neighbour of a point.
The solution is $w^{(t)}_{i}=\dfrac{G_{t}^{(i)^{-1}}\mathbf{1}}{\mathbf{1}^{T}G_{t}^{(i)^{-1}}\mathbf{1}}$ , with $G^{(i)}_{t_{j,k}}=\langle\mathcal{X}_{t}^{(i)}-\mathcal{N}^{(i,j)},\mathcal{X}_{t}^{(i)}-\mathcal{N}^{(i,k)}\rangle$ .
Finally, the embedding of the test data $\mathcal{Y}_{t}^{(i)}$ is obtained as $\sum_{j}w^{(t)}_{i,j}\mathcal{Y}^{(j)}$ , with $\mathcal{Y}^{(j)}=\phi_{LLE}(\mathcal{N}^{(i,j)})$ are the embedding representation of the neighbours of the test data.

5 Other variants via Einstein product

5.1 Kernel methods

Kernels serve as powerful tools enabling linear methods to operate effectively in high-dimensional spaces, allowing for the representation of nonlinear data without explicitly computing data coordinates in this feature space. The kernel trick, a pivotal breakthrough in machine learning, facilitates this process. Formally, instead of directly manipulating the data $\mathcal{X}=\{\mathcal{X}^{(1)},\ldots,\mathcal{X}^{(n)}\}$ , we operate within a high-dimensional implicit feature space using a function $\Phi$ .
We denote the tensor $\Phi(\mathcal{X})=\{\Phi(\mathcal{X}^{(1)}),\ldots,\Phi(\mathcal{X}^{(n)})\}=\{\Phi(\mathcal{X})^{(1)},\ldots,\Phi(\mathcal{X})^{(n)}\}$ . The mapping need not be explicitly known; only the Gram matrix $K$ is required. This matrix represents the inner products of the data in the feature space, defined as $K_{i,j}=\langle\Phi(\mathcal{X}^{(i)}),\Phi(\mathcal{X}^{(j)})\rangle$ . Consequently, any method expressible in terms of data inner products can be reformulated in terms of the Gram matrix. Consequently, the kernel trick can be applied sequentially. Moreover, extending kernel methods to multi-linear operations using the Einstein product is straightforward. Commonly used kernels include the Gaussian, polynomial, Laplacian, and Sigmoid kernels, among others.
Denote $\mathcal{Y}=\mathcal{P}^{T}*_{M}\Phi(\mathcal{X})$ , with $\mathcal{P}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times d}$ .

5.1.1 Kernel PCA via Einstein product

The kernel multi-linear PCA solves the following problem

\displaystyle\arg\max_{\begin{subarray}{c}\mathcal{P}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times d}\\ \mathcal{P}^{T}*_{M}\mathcal{P}=\mathcal{I}\end{subarray}}

\displaystyle\left\|\mathcal{P}^{T}*_{1}(\Phi(\mathcal{X})-\mathcal{Q})\right\|_{F}^{2},

(27)

with $\mathcal{Q}$ is the mean of kernel points, the solution is the largest $d$ left singular tensors of $\Phi(\mathcal{X})-\mathcal{Q}=\widehat{\Phi}(\mathcal{X})$ . It needs to calculate the eigen-tensors of $\widehat{\Phi}(\mathcal{X})*_{1}\widehat{\Phi}(\mathcal{X})^{T}$ , which is not accessible, however, the Grammian $K=\widehat{\Phi}(\mathcal{X})^{T}*_{M}\widehat{\Phi}(\mathcal{X})$ is available, we can transform the problem to

\widehat{K}\widehat{\mathbf{z}}_{i}=\lambda_{i}\widehat{\mathbf{z}}_{i}\;\text{ with }\widehat{\mathbf{z}}_{i}=(\widehat{\Phi}(\mathcal{X}))^{T}*_{M}\mathcal{Z}^{(i)}\in\mathbb{R}^{n},

with $\widehat{K}$ representing the Grammian of the centered data, that can easily be obtained from only $K$ as $\widehat{K}=K-HK-KH+HKH$ , since

	$\displaystyle\widehat{K}_{i,j}$	$\displaystyle=\widehat{\Phi}\left(\mathcal{X}\right)^{(i)^{T}}*_{1}\widehat{\Phi}(\mathcal{X})^{(j)}$
		$\displaystyle=\left(\Phi(\mathcal{X}^{(i)})^{T}-\dfrac{1}{n}\sum_{k}\Phi(\mathcal{X}^{(k)})^{T}\right)*_{1}\left(\Phi(\mathcal{X}^{(j)})-\dfrac{1}{n}\sum_{l}\Phi(\mathcal{X}^{(l)})^{T}\right)$
		$\displaystyle=\Phi(\mathcal{X}^{(i)})^{T}_{1}\Phi(\mathcal{X}^{(j)})-\dfrac{1}{n}\sum_{k}\Phi(\mathcal{X}^{(i)})^{T}_{1}\Phi(\mathcal{X}^{(k)})-\dfrac{1}{n}\sum_{l}\Phi(\mathcal{X}^{(l)})^{T}*_{1}\Phi(\mathcal{X}^{(j)})$
		$\displaystyle=K_{i,j}-\dfrac{1}{n}\sum_{k}K_{i,k}1_{k,j}-\dfrac{1}{n}\sum_{l}1_{i,l}K_{l,j}+\dfrac{1}{n^{2}}\sum_{k,l}K_{k,l}$
		$\displaystyle=(K-\dfrac{\mathbf{1}}{n}K-K\dfrac{\mathbf{1}}{n}+\dfrac{\mathbf{1}}{n}K\dfrac{\mathbf{1}}{n})_{i,j}.$

Then, the solution is the same as the matrix case, i.e., the largest $d$ eigenvectors of $\widehat{K}$ , reshaped to the appropriate size.
The algorithm bellow, shows the steps of the kernel PCA via Einstein product.

Algorithm 8 Kernel PCA-Einstein

Input: $\mathcal{X}$ (Data) $(d_{i})_{i\leq M}$ (dimension output) $K$ (Grammian).
Output: $\mathcal{Y}$ (Projected data).

1:Compute

\widehat{K}

\triangleright

The mean of the Grammian

2:Compute the largest

d

eigenvectors of

\widehat{K}

3:Combine these vectors, and reshape them to get

\mathcal{Y}

5.1.2 Kernel LPP via Einstein product

The kernel multi-linear LPP tackles the following problem

\displaystyle\arg\min_{\begin{subarray}{c}\mathcal{P}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times d}\\ \mathcal{P}^{T}*_{M}\Phi(\mathcal{X})\times_{M+1}D*_{1}\Phi(\mathcal{X})^{T}*_{M}\mathcal{P}=\mathcal{I}\end{subarray}}

\displaystyle\operatorname{Tr}\left(\mathcal{P}^{T}*_{M}\left(\Phi(\mathcal{X})\times_{M+1}L*_{1}\Phi(\mathcal{X})^{T}\right)*_{M}\mathcal{P}\right).

(28)

The solution involves finding the smallest $d$ eigen-tensors of the generalized eigen-problem

\Phi(\mathcal{X})\times_{M+1}L*_{1}\Phi(\mathcal{X})^{T}*_{M}\mathcal{V}=\lambda\Phi(\mathcal{X})*_{M+1}D*_{1}\Phi(\mathcal{X})^{T}*_{M}\mathcal{V}.

$\Phi(\mathcal{X})$ is not available, the problem needs to be reformulated, Utilizing the fact that $K$ is invertible, we reformulate the problem as to find the vectors $\mathbf{z}$ , solution of the generalized eigen-problem

L\mathbf{z}=\lambda D\mathbf{z},\text{ with }\mathbf{z}=\Phi(\mathcal{X})^{T}*_{M}\mathcal{V}^{(i)}\in\mathbb{R}^{n},

This formulation reduces to the same minimization problem as in the matrix case.

5.1.3 Kernel ONPP via Einstein product

The kernel multi-linear ONPP addresses the following optimization problem

\displaystyle\arg\min_{\begin{subarray}{c}\mathcal{P}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times d}\\ \mathcal{P}^{T}*_{M}\mathcal{P}=\mathcal{I}\end{subarray}}

\displaystyle\operatorname{Tr}\left(\mathcal{P}^{T}*_{M}(\Phi(\mathcal{X})\times_{M+1}(I-W)(I-W^{T})*_{1}\Phi(\mathcal{X})^{T})*_{M}\mathcal{P}\right).

(29)

The solution involves finding the smallest $d$ eigen-tensors of problem

\Phi(\mathcal{X})\times_{M+1}(I-W)(I-W^{T})*_{1}\Phi(\mathcal{X})^{T}*_{M}\mathcal{V}=\lambda\mathcal{V}.

By employing similar techniques as before, we can derive the equivalent problem that seeks to find the vectors $\mathbf{z}$ , solution of of the eigen-problem

K(I-W)(I-W^{T})\mathbf{z}=\lambda\mathbf{z},\;\text{ with }\Phi(\mathcal{X})^{T}*_{M}\mathcal{V}=\mathbf{z}.

The problem is the same minimization problem, the solution $\mathcal{Y}$ can be obtained from reshaping the transpose of the concatenated vectors $\mathbf{z}$ .
The algorithm bellow, shows the steps of the kernel ONPP via Einstein product.

Algorithm 9 Kernel ONPP-Einstein

Input: $K$ (Grammian) $d$ (subspace dimension).
Output: $\mathcal{Y}$ (Projected data).

1:Compute

W

\triangleright

Using the appropriate method

2:Compute the smallest

d

eigenvectors of

K(I-W)(I-W^{T})

3:Combine these vectors, and reshape them to get

\mathcal{Y}

5.1.4 Kernel OLPP via Einstein product

The kernel multi-linear OLPP tackles the optimization problem defined as follows:

\displaystyle\arg\min_{\begin{subarray}{c}\mathcal{P}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times d}\\ \mathcal{P}^{T}*_{M}\mathcal{P}=\mathcal{I}\end{subarray}}

\displaystyle\operatorname{Tr}\left(\mathcal{P}^{T}*_{M}(\Phi(\mathcal{X})\times_{M+1}L*_{1}\Phi(\mathcal{X})^{T})*_{M}\mathcal{P}\right).

(30)

The solution of the problem involves the eigen-tensors of $\Phi(\mathcal{X})\times_{M+1}L*_{1}\Phi(\mathcal{X})^{T}*_{M}\mathcal{Z}=\lambda\mathcal{Z}$ , By transforming the problem, we arrive to find the vectors $\mathbf{z}$ , solution of of the eigen-problem

K\mathbf{z}=\lambda\mathbf{z},\;\text{ with }\Phi(\mathcal{X})^{T}*_{M}\mathcal{Z}=\mathbf{z},

which mirrors the matrix case. Here, the solution $\mathcal{Y}$ can be obtained from reshaping the transpose of the concatenated vectors $\mathbf{z}$ .
The algorithm bellow, shows the steps of the kernel OLPP via Einstein product.

Algorithm 10 Kernel OLPP-Einstein

Input: $K$ (Grammian) $d$ (subspace dimension).
Output: $\mathcal{Y}$ (Projected data).

1:Compute the smallest

d

eigenvectors of

K

2:Combine these vectors, and reshape them to get

\mathcal{Y}

5.2 Supervised learning

In general, supervised learning differs from unsupervised learning primarily in how the weight matrix incorporates class label information. Supervised learning tends to outperform unsupervised learning, particularly with small datasets, due to the utilization of additional class label information.
In supervised learning, each data point is associated with a known class label. The weight matrix can be adapted to include this class label information. For instance, it may take the form of a block diagonal matrix, where $W_{s}(i)\in\mathbb{R}^{n_{i}\times n_{i}}$ represents sub-weight matrices, and $n_{i}$ denotes the number of data points in class $i$ . Let $c(i)$ denote the class of data point $x_{i}$ .

Supervised PCA: PCA is the sole linear method presented devoid of a graph matrix. Consequently, Supervised PCA implementation is not straightforward, necessitating a detailed explanation. Following the approach proposed in [2], we address this challenge by formulating the problem and leveraging the empirical Hilbert-Schmidt independence criterion (HSIC):

\displaystyle\arg\max_{P^{T}P=I}

\displaystyle\operatorname{Tr}(P^{T}XHK_{L}HX^{T}P).

where $K_{L}$ is the kernel of the outcome measurements $Y$ . Thus, the generalization would be to solve

\arg\max_{\begin{subarray}{c}\mathcal{P}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times d}\\ \mathcal{P}^{T}*_{M}\mathcal{P}=\mathcal{I}\end{subarray}}\operatorname{Tr}\left(\mathcal{P}^{T}*_{M}\mathcal{X}\times_{M+1}HK_{L}H*_{1}\mathcal{X}^{T}*_{M}\mathcal{P}\right),

(31)

The solution of 31 is the largest $d$ eigen-tensors of $\mathcal{X}\times_{M+1}HK_{L}H\times_{1}\mathcal{X}^{T}$ .
Notice that when, $K_{L}=I_{n}$ , we get the same problem as in the unsupervised case.
The algorithm bellow, shows the steps of the Supervised PCA via Einstein product.

Algorithm 11 Supervised PCA-Einstein

Input: $\mathcal{X}$ (Data) $d$ (dimension output). $K_{L}$ (Kernel of labels)
Output: $\mathcal{P}$ (Projection space).

1:Compute the largest

d

eigen-tensors of

\mathcal{X}\times_{M+1}HK_{L}H\times_{1}\mathcal{X}^{T}

2:Combine these tensors to get

\mathcal{P}

Supervised Laplacian Eigenmap: The supervised Laplacian Eigenmap is similar to the Laplacian Eigenmap, with the difference that the weight matrix is computed using the class label, many approaches were proposed [22, 8, 25]. We choose a simple approach that changes the weight matrix to $W_{s}$ , and the rest of the algorithm is the same.

Supervised LLE: There are multiple variants of LLE that uses the class label to improve the performance of the method, e.g., Supervised LLE (SLLE), probabilistic SLLE [33], supervised guided LLE using HSIC [1], enhanced SLLE [31]…, the general strategy is to incorporate the class label either in computing the distance matrix, the weight matrix, or in the objective function [10]. We choose the simplest which is the first strategy; By changing the distance matrix by adding term that increases the inter-class and decreases the intra-class variance. The rest of the steps are the same as the unsupervised LLE.

5.2.1 Repulsion approaches

In the semi-supervised or the supervised learning, how we use the class label can affect the performance, commonly, the similarity matrix, tells us only if two points are of the same class or not, without incorporating any additional information on data locality, e.g., the closeness of points of different classes.., thus the repulsion technique is used to take into account the class label information, by repulsing the points of different classes, and attracting the points of the same class. It extends the traditional graph-based methods by incorporating repulsion or discrimination elements into the graph Laplacian, learning to more distinct separation of different classes in the reduced-dimensional space by integrating the class label information directly into the graph structure. The concept of repulsion has been used in DR with different formulations [32, 7] before using the k-nn graph to derive it. [16] a generic proposed a method that applies attraction to all points of the same class with the use of repulsion between nearby points of different classes, which was found to be significantly better than the previous approaches. Thus, we will use the same approach and generalize it to the Einstein-product.
The repulsion graph $\mathcal{G}^{(r)}=\{\mathcal{V}^{(r)},\mathcal{E}^{(r)}\}$ is derived from k-nn Graph $\mathcal{G}=\{\mathcal{V},\mathcal{E}\}$ based on the class label information, the weight of the edges can be computed in the simplest form as

W_{i,j}^{(r)}=\left\{\begin{array}[]{ll}1&\text{ if }(\mathbf{x}_{i},\mathbf{x}_{j})\in\mathcal{E},\;i\neq j,\;c(i)\neq c(j)\\ 0&\text{ otherwise.}\end{array}\right.

Hence, in the case of fully connected graph, the repulsion weight would be of the form

W^{(r)}=\mathbf{1}_{n}-\operatorname{diag}(\mathbf{1}_{n_{i}}).

Other weights value can be proposed.
The new repulsion algorithms are similar to the previous ones, with the new weight matrix $W_{s}=W+\beta W^{(r)}$ with $\beta$ is a parameter.

6 Experiments

To show the effectiveness of the proposed methods, we will use datasets that are commonly used in the literature. The experiments will be conducted on the GTDB dataset for the facial recognition, and the MNIST dataset for the digit recognition. We note that these datasets give the raw images instead of features. The results will be compared to the state of art methods, by using the projected data in a classifier. The baseline is also used for comparison, which is utilizing the raw data as the input of the classifier, and the recognition rate will be used as the evaluation metric for all methods. Images were chosen because the proposed methods are designed to work on multi-linear data, and the image is a typical example of such data. The proposed methods that use the multi-weight will be denoted by adding ” $-MW$ ” to the name of the method. It is intuitive to use multi-weight for images since the third mode represent the RGB while the first two modes represent the location of the pixel.
The evaluation metric Recognition rate (IR) is used to evaluate the performance of the proposed method. It is defined as the number of correct classification over the total number of testing data. A correct classification is done by computing the minimum distance between the projected data training and the projected testing data. The IR is computed on the testing data.

For simplicity, we used the supervised version of methods, with Gaussian weights, and the recommended parameter in [16] (half the median of data) for the Gaussian parameter.

IR=100\times\dfrac{\text{Number of recognized images in a data}}{\text{Number of images in the data}}.

All computations are carried out on a laptop computer with 2.1 GHz Intel Core 7 processors 8-th Gen and 8 GB of memory using MATLAB 2021a.

6.1 Digit recognition

The Dataset that will be used in the experiments is the MNIST dataset ¹¹1https://lucidar.me/en/matlab/load-mnist-database-of-handwritten-digits-in-matlab/. It contains 60,000 training images and 10,000 testing images of labeled handwritten digits. The images are of size $28\times 28$ , and are normalized gray images. The evaluation metric is the same as the facial recognition. We will work with smaller subset of the data to speed up the computation. e.g., 1000 training images and 200 testing images taking randomly from the data. Observe that the multi-weight methods are not used in this data since it is gray data, thus, we don’t have multiple weights.
Table 2 shows the performance of different approaches compared to the state-of-art based on different subspace dimensions.

-	OLPP	OLPP-E	ONPP	ONPP-E	PCA	PCA-E	Baseline
5	50,50	50,50	56,00	56,00	63,00	63,00	8,50
10	75,50	75,50	81,50	81,50	82,50	82,50	8,50
15	81,00	81,00	80,50	80,50	84,50	84,50	8,50
20	85,00	85,00	83,50	83,50	88,00	88,00	8,50
25	86,00	86,00	87,50	87,50	88,00	88,00	8,50
30	85,50	85,50	87,50	87,50	89,00	89,00	8,50
35	88,00	88,00	89,50	89,50	87,00	87,00	8,50
40	88,00	88,00	89,00	89,00	87,50	87,50	8,50

Table 2: Performance of methods per different subspace dimension.

The results are similar in the MNIST dataset between the method with its Multi dimensional counterpart, we claim that it is due to the fact that the vectorization of 2 dimension to 1 does not affect much the accuracy, which leads to similar results using the proposed parameters.

Note that the objective is to compare a method with its proposed multi dimension counterpart via Einstein product to see if the generalization works.

6.2 Facial recognition

The dataset that will be used in the experiments is the Georgia Tech database GTDB crop ²²2https://www.anefian.com/research/face_reco.htm. It contains 750 color JPEG images of 50 person, with each one represented by exactly 15 images that show different facial expression, scale and lighting conditions. Figure 1 shows an example of 12 arbitrary images from the possible 15 of an arbitrary person in the data set.
Our data in this case is a tensor of size $height\times width\times 3\times 750$ when dealing with RGB, and $height\times width\times 750$ when dealing with gray images. The height and width of the images are fixed to $60\times 60$ . The data is normalized.

Refer to caption — Figure 1: Example of images of one person in the GTDB dataset.

The experiment is done using 12 images for training and 3 for testing per face. Figure 2 shows these results for different subspace dimension reduction

The results show that as the subspace dimension increases, the performance of most methods also increases, suggesting that these methods benefit from a higher dimensional-feature space up to a point that differ from a method to another. The generalized methods using the Einstein product gives overall better result on all subspace dimension compared to its counterparts, except for the ONPP in the small $d$ case. The Multiple-weight methods show varying performance. They outperform the single-weight in some cases. Future work could be considered to enhance how the to aggregate the results of each weights in order to give a more robust results.
The objective is to compare between a method and its multi dimensional counter parts via Einstein product, e.g., the OLPP method with the OLPP-E, and OLPP-E-MW.
The superiority of Einstein based methods can be justified by the fact that, it preserve the multi-linear structure of the data, and the non-linear structure of the data, which is not the case of the vectorization of the data, which is the case of the other matricized methods.

7 Conclusion

The paper advances the field of dimension reduction by introducing refined graph-based methods and leveraging the Einstein product for tensor data. It extends both the Linear and Nonlinear methods (supervised and unsupervised) to higher order tensors as well as its variants. The methods are conducted on the GTDB and MNIST dataset, and the results are compared to the state-of-art-methods showing the competitive results. A future work could be conducted on generalization on trace ratio methods as Linear Discriminant Analysis. An acceleration of the computation can also be proposed using the Tensor Golub Kahan decomposition to get an approximation of these eigen-tensors in constructing the projected space.

References

[1] A. Álvarez-Meza, J. Valencia-Aguirre, G. Daza-Santacoloma, and G. Castellanos-Domínguez, Global and local choice of the number of nearest neighbors in locally linear embedding, Pattern Recognition Letters, 32 (2011), pp. 2171–2177.
[2] E. Barshan, A. Ghodsi, Z. Azimifar, and M. Z. Jahromi, Supervised principal component analysis: Visualization, classification and regression on subspaces and submanifolds, Pattern Recognition, 44 (2011), pp. 1357–1371.
[3] P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman, Eigenfaces vs. fisherfaces: Recognition using class specific linear projection, IEEE Transactions on pattern analysis and machine intelligence, 19 (1997), pp. 711–720.
[4] M. Belkin and P. Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation, Neural computation, 15 (2003), pp. 1373–1396.
[5] Y. Bengio, J.-f. Paiement, P. Vincent, O. Delalleau, N. Roux, and M. Ouimet, Out-of-sample extensions for lle, isomap, mds, eigenmaps, and spectral clustering, Advances in neural information processing systems, 16 (2003).
[6] M. Brazell, N. Li, C. Navasca, and C. Tamon, Solving multilinear systems via tensor inversion, SIAM Journal on Matrix Analysis and Applications, 34 (2013), pp. 542–570.
[7] H.-T. Chen, H.-W. Chang, and T.-L. Liu, Local discriminant embedding and its variants, in 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), vol. 2, IEEE, 2005, pp. 846–853.
[8] J. A. Costa and A. Hero, Classification constrained dimensionality reduction, in Proceedings.(ICASSP’05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005., vol. 5, IEEE, 2005, pp. v–1077.
[9] K. P. F.R.S., Liii. on lines and planes of closest fit to systems of points in space, The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2 (1901), pp. 559–572.
[10] B. Ghojogh, A. Ghodsi, F. Karray, and M. Crowley, Locally linear embedding and its variants: Tutorial and survey, arXiv preprint arXiv:2011.10925, (2020).
[11] A. E. Hachimi, K. Jbilou, M. Hached, and A. Ratnani, Tensor golub kahan based on einstein product, arXiv preprint arXiv:2311.03109, (2023).
[12] X. He and P. Niyogi, Locality preserving projections, Advances in neural information processing systems, 16 (2003).
[13] X. He, S. Yan, Y. Hu, P. Niyogi, and H.-J. Zhang, Face recognition using laplacianfaces, IEEE transactions on pattern analysis and machine intelligence, 27 (2005), pp. 328–340.
[14] E. Kokiopoulou and Y. Saad, Orthogonal neighborhood preserving projections, in Fifth IEEE International Conference on Data Mining (ICDM’05), IEEE, 2005, pp. 8–pp.
[15] E. Kokiopoulou and Y. Saad, Orthogonal neighborhood preserving projections: A projection-based dimensionality reduction technique, IEEE Transactions on Pattern Analysis and Machine Intelligence, 29 (2007), pp. 2143–2156.
[16] E. Kokiopoulou and Y. Saad, Enhanced graph-based dimensionality reduction with repulsion laplaceans, Pattern Recognition, 42 (2009), pp. 2392–2402.
[17] T. G. Kolda and B. W. Bader, Tensor decompositions and applications, SIAM review, 51 (2009), pp. 455–500.
[18] J. A. Lee, M. Verleysen, et al., Nonlinear dimensionality reduction, vol. 1, Springer, 2007.
[19] Q. Liu, R. Huang, H. Lu, and S. Ma, Face recognition using kernel-based fisher discriminant analysis, in Proceedings of Fifth IEEE International Conference on Automatic Face Gesture Recognition, IEEE, 2002, pp. 197–201.
[20] C. B. Lizhu Sun, Baodong Zheng and Y. Wei, Moore–penrose inverse of tensors via einstein product, Linear and Multilinear Algebra, 64 (2016), pp. 686–698.
[21] L. Qi and Z. Luo, Tensor analysis: spectral theory and special tensors, SIAM, 2017.
[22] B. Raducanu and F. Dornaika, A supervised non-linear dimensionality reduction approach for manifold learning, Pattern Recognition, 45 (2012), pp. 2432–2444.
[23] S. T. Roweis and L. K. Saul, Nonlinear dimensionality reduction by locally linear embedding, science, 290 (2000), pp. 2323–2326.
[24] L. K. Saul, K. Q. Weinberger, F. Sha, J. Ham, and D. D. Lee, Spectral methods for dimensionality reduction, in Semi-Supervised Learning, The MIT Press, 09 2006.
[25] M. Tai, M. Kudo, A. Tanaka, H. Imai, and K. Kimura, Kernelized supervised laplacian eigenmap for visualization and classification of multi-label data, Pattern Recognition, 123 (2022), p. 108399.
[26] M. A. Turk and A. P. Pentland, Face recognition using eigenfaces, in Proceedings. 1991 IEEE computer society conference on computer vision and pattern recognition, IEEE Computer Society, 1991, pp. 586–587.
[27] Q.-W. Wang and X. Xu, Iterative algorithms for solving some tensor equations, Linear and Multilinear Algebra, 67 (2019), pp. 1325–1349.
[28] Y. Wang and Y. Wei, Generalized eigenvalue for even order tensors via einstein product and its applications in multilinear control systems, Computational and Applied Mathematics, 41 (2022), p. 419.
[29] A. R. Webb, K. D. Copsey, and G. Cawley, Statistical pattern recognition, vol. 2, Wiley Online Library, 2011.
[30] M.-H. Yang, Face recognition using kernel methods, Advances in neural information processing systems, 14 (2001).
[31] S.-q. Zhang, Enhanced supervised locally linear embedding, Pattern Recognition Letters, 30 (2009), pp. 1208–1218.
[32] W. Zhang, X. Xue, H. Lu, and Y.-F. Guo, Discriminant neighborhood embedding for classification, Pattern Recognition, 39 (2006), pp. 2240–2243.
[33] L. Zhao and Z. Zhang, Supervised locally linear embedding with probability-based distance for classification, Computers & Mathematics with Applications, 57 (2009), pp. 919–926.

	$\displaystyle f_{1}(\mathcal{P}+\varepsilon\mathcal{H})$	$\displaystyle=\operatorname{Tr}\left(\left(\mathcal{P}+\varepsilon\mathcal{H})^{T}_{M}\mathcal{X}_{M}(\mathcal{P}+\varepsilon\mathcal{H}\right)\right)$
		$\displaystyle=\operatorname{Tr}\left(\mathcal{P}^{T}_{M}\mathcal{X}_{M}\mathcal{P}+\varepsilon\mathcal{H}^{T}_{M}\mathcal{X}_{M}\mathcal{P}+\varepsilon\mathcal{P}^{T}_{M}\mathcal{X}_{M}\mathcal{H}\right)$
		$\displaystyle\qquad+\operatorname{Tr}(\varepsilon^{2}\mathcal{H}^{T}_{M}\mathcal{X}_{M}\mathcal{H})$
		$\displaystyle=\operatorname{Tr}(\mathcal{P}^{T}_{M}\mathcal{X}_{M}\mathcal{P})+\varepsilon\operatorname{Tr}(\mathcal{H}^{T}_{M}\mathcal{X}_{M}\mathcal{P})$
		$\displaystyle\qquad+\varepsilon\operatorname{Tr}(\mathcal{P}^{T}_{M}\mathcal{X}_{M}\mathcal{H})+\varepsilon^{2}\operatorname{Tr}(\mathcal{H}^{T}_{M}\mathcal{X}_{M}\mathcal{H})$
		$\displaystyle=f_{1}(\mathcal{P})+\varepsilon\left[\operatorname{Tr}(\mathcal{H}^{T}_{M}\mathcal{X}_{M}\mathcal{P})+\operatorname{Tr}(\mathcal{P}^{T}_{M}\mathcal{X}_{M}\mathcal{H})\right]+o(\varepsilon)$
		$\displaystyle=f_{1}(\mathcal{P})+\varepsilon\operatorname{Tr}\left(\mathcal{H}^{T}_{M}(\mathcal{X}+\mathcal{X}^{T})_{M}\mathcal{P}\right)+o(\varepsilon).$

	$\displaystyle f_{2}(\mathcal{P}+\varepsilon\mathcal{H})$	$\displaystyle=\operatorname{Tr}\left(\Lambda^{T}_{M}\left((\mathcal{P}+\varepsilon\mathcal{H})^{T}_{M}\mathcal{B}*_{M}(\mathcal{P}+\varepsilon\mathcal{H})-\mathcal{I}\right)\right)$
		$\displaystyle=\operatorname{Tr}\left(\Lambda^{T}_{M}(\mathcal{P}^{T}_{M}\mathcal{B}_{M}\mathcal{P}-\mathcal{I}))++\varepsilon^{2}\operatorname{Tr}(\Lambda^{T}_{M}(\mathcal{H}^{T}_{M}\mathcal{B}_{M}\mathcal{H})\right)$
	$\displaystyle+$	$\displaystyle\varepsilon\operatorname{Tr}\left(\Lambda^{T}_{M}(\mathcal{H}^{T}_{M}\mathcal{B}_{M}\mathcal{P}+\mathcal{P}^{T}_{M}\mathcal{B}*_{M}\mathcal{H})\right)$
		$\displaystyle=f_{2}(\mathcal{P})+\varepsilon\operatorname{Tr}\left(\Lambda^{T}_{M}(\mathcal{H}^{T}_{M}\mathcal{B}_{M}\mathcal{P}+\mathcal{P}^{T}_{M}\mathcal{B}*_{M}\mathcal{H})\right)+O(\varepsilon)$
		$\displaystyle=f_{2}(\mathcal{P})+\varepsilon\operatorname{Tr}\left(\mathcal{H}^{T}_{M}\left[\mathcal{B}_{M}\mathcal{P}_{M}\Lambda^{T}+\mathcal{B}^{T}_{M}\mathcal{P}*_{M}\Lambda\right]\right)+O(\varepsilon).$

	$\displaystyle\dfrac{\partial\mathcal{L}}{\partial\mathcal{P}}=0$	$\displaystyle\iff 2\mathcal{X}_{M}\mathcal{P}-\Lambda_{M}\mathcal{P}-\Lambda^{T}*_{M}\mathcal{P}=0$
		$\displaystyle\iff\mathcal{X}_{M}\mathcal{P}=\mathcal{B}_{M}\mathcal{P}*_{M}(\Lambda+\Lambda^{T})$
		$\displaystyle\iff\mathcal{X}_{M}\mathcal{P}=\mathcal{B}_{M}\mathcal{P}*_{M}\widehat{\Lambda}$
		$\displaystyle\iff\mathcal{X}_{M}\mathcal{P}=\mathcal{B}_{M}\mathcal{P}_{M}\mathcal{Q}^{T}_{M}\mathcal{D}*_{M}\mathcal{Q}$
		$\displaystyle\iff\mathcal{X}_{M}\mathcal{P}_{M}\mathcal{Q}^{T}=\mathcal{B}_{M}\mathcal{P}_{M}\mathcal{Q}^{T}*_{M}\mathcal{D}$
		$\displaystyle\iff\mathcal{X}_{M}\widehat{\mathcal{P}}=\mathcal{B}_{M}\widehat{\mathcal{P}}*_{M}\mathcal{D}.$

	$\displaystyle\Phi_{PCA}(\mathcal{Y})$	$\displaystyle=\sum_{i}\left\\|\mathcal{P}^{T}*_{M}(\mathcal{X}^{(i)}-\dfrac{1}{n}\sum_{j}\mathcal{X}^{(j)})\right\\|_{F}^{2}$
		$\displaystyle=\sum_{i}\left\\|\mathcal{P}^{T}*_{M}(\mathcal{X}^{(i)}-\mathcal{Q}^{(i)})\right\\|_{F}^{2}$
		$\displaystyle=\left\\|\mathcal{P}^{T}*_{M}(\mathcal{X}-\mathcal{Q})\right\\|_{F}^{2},$

	$\displaystyle\Phi_{ONPP}(\mathcal{Y})$	$\displaystyle=\sum_{i}\left\\|\sum_{j}\delta_{i,j}\mathcal{Y}^{(j)}-\sum_{j}w_{i,j}\mathcal{Y}^{(j)}\right\\|_{F}^{2}$
		$\displaystyle=\sum_{i}\left\\|\sum_{j}(\delta_{i,j}-w_{i,j})\mathcal{Y}^{(j)}\right\\|_{F}^{2}$
		$\displaystyle=\sum_{i}\left\\|\sum_{j}(I_{n}-W)_{i,j}\mathcal{Y}^{(j)}\right\\|_{F}^{2}$
		$\displaystyle=\sum_{i}\left\\|\left(\mathcal{Y}\times_{M+1}(I_{n}-W)\right)^{(i)}\right\\|_{F}^{2}$
		$\displaystyle=\left\\|\mathcal{Y}\times_{M+1}(I_{n}-W)\right\\|_{F}^{2}$
		$\displaystyle=\left\\|(\mathcal{P}^{T}*_{M}\mathcal{X})\times_{M+1}(I_{n}-W)\right\\|_{F}^{2}$
		$\displaystyle=\left\\|\mathcal{P}^{T}*_{M}\left(\mathcal{X}\times_{M+1}\left(I_{n}-W\right)\right)\right\\|_{F}^{2}.$