This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Higher order multi-dimension reduction methods via Einstein-product

A. Zahir The UM6P Vanguard Center, Mohammed VI Polytechnic University, Green City, Morocco.    K. Jbilou11footnotemark: 1 Université du Littoral Cote d’Opale, LMPA, 50 rue F. Buisson, 62228 Calais-Cedex, France.    A. Ratnani11footnotemark: 1
Abstract

This paper explores the extension of dimension reduction (DR) techniques to the multi-dimension case by using the Einstein product. Our focus lies on graph-based methods, encompassing both linear and nonlinear approaches, within both supervised and unsupervised learning paradigms. Additionally, we investigate variants such as repulsion graphs and kernel methods for linear approaches. Furthermore, we present two generalizations for each method, based on single or multiple weights. We demonstrate the straightforward nature of these generalizations and provide theoretical insights. Numerical experiments are conducted, and results are compared with original methods, highlighting the efficiency of our proposed methods, particularly in handling high-dimensional data such as color images.

keywords:
Tensor, Dimension reduction, Einstein product, Graph-based methods, Multi-dimensional Data, Trace optimization problem.

1 Introduction

In today’s data-driven world, where amounts of information are collected and analyzed, the ability to simplify and interpret data has never been more critical. The task is particularly evident in the field of data science and machine learning [29], where the curse of dimensionality is a major obstacle. Dimension reduction techniques aim to address this issue by projecting high-dimensional data onto a lower-dimensional space while preserving the underlying structure of the data. These methods have been proven to be quite efficient in revealing hidden structures and patterns.

The landscape of DR is quite rich, with a wide range of methods, from linear to non-linear [18], supervised to unsupervised, and versions of these methods that incorporate repulsion-based principles, or kernels. We find as an example, Principal Component Analysis (PCA)[9], Locality Preserving Projections (LPP)[12], Orthogonal Neighborhood Preserving Projections (ONPP)[14, 15], Neighborhood Preserving Projections (NPP)[14], Laplacian Eigenmap (LE)[4], Locally Linear Embedding (LLE)[23]… Each of these techniques has been subject to extensive research and applications, offering insights into data structures that are often hidden in high-dimensional spaces, these methods can also be seen as an optimization problem of trace, with some constraints[24].

Current approaches often require transforming the multi-dimensional data, such as images [13, 3, 26, 19, 30], into a matrix, into flattened (vectorized) forms before analysis. This process, while it’s fast, however, can be problematic, as it may lead to loss of inherent structure and relational information within the data.

This paper proposes a novel approach to generalize dimensional reduction techniques, employing the Einstein product, a tool in tensor algebra, which is the natural extension of the usual matrix product. By reformulating the operations of both linear and non-linear methods in the context of tensor operations, the generalization maintains the multi-dimensional integrity of complex datasets. This approach circumvents the need for vectorization, preserving the rich intrinsic structure of the data.
Our contribution lies in not only proposing a generalized framework for dimensional reduction, but also in demonstrating its effectiveness through empirical studies. We show that the proposed methods, outperform or at least are the same as their matrix-based counterparts, while preserving the integrity of the data.

This paper is organized as follows. Firstly, we will talk in Section 2 about the methods in the matrix case, then, in Section 3, we will introduce the mathematical background of tensors, and the Einstein product. Next, in Section 4, we will introduce the different methods, and the generalization of these methods using the Einstein product. Following that, Section 5 is dedicated to presenting variants of these techniques. Subsequently, we will present the numerical experiments and the results in Section 6. Lastly, we offer some concluding remarks and suggestion of future work in 7.

2 Dimension reduction methods in matrix case

Given a set of nn data points 𝐱1,,𝐱nm\mathbf{x}_{1},\ldots,\mathbf{x}_{n}\in\mathbb{R}^{m} and a set of nn corresponding points 𝐲1,,𝐲nd\mathbf{y}_{1},\ldots,\mathbf{y}_{n}\in\mathbb{R}^{d}, denote the data matrix X=[𝐱1,,𝐱n]m×nX=\left[\mathbf{x}_{1},\cdots,\mathbf{x}_{n}\right]\in\mathbb{R}^{m\times n} and the low-dimensional matrix Y=[𝐲1,,𝐲n]d×nY=\left[\mathbf{y}_{1},\cdots,\mathbf{y}_{n}\right]\in\mathbb{R}^{d\times n}. The objective is to find a mapping Φ:md,ϕ(𝐱i)=𝐲i,i=1,,n\Phi:\mathbb{R}^{m}\longrightarrow\mathbb{R}^{d},\phi\left(\mathbf{x}_{i}\right)=\mathbf{y}_{i},\quad i=1,\cdots,n. The mapping is either non-linear Y=Φ(X)Y=\Phi(X), or linear Y=VTXY=V^{T}X, in the latter case, it reduces to find the projection matrix Vm×dV\in\mathbb{R}^{m\times d}.

We denote the similarity matrix of a graph by Wn×nW\in\mathbb{R}^{n\times n}, the degree matrix by DD, and the Laplacian matrix by L=DWL=D-W. For the sake of simplifications, we will define some new matrices

Ln=D1/2LD1/2 , W^=D1/2WD1/2 , M=(InWT)(InW) ,\displaystyle L_{n}=D^{-1/2}LD^{-1/2}\text{ , }\widehat{W}=D^{-1/2}WD^{-1/2}\text{ , }M=(I_{n}-W^{T})(I_{n}-W)\text{ , }
X^=XD1/2 , Y^=YD1/2 , H=In1n𝟏𝟏T,\displaystyle\widehat{X}=XD^{1/2}\text{ , }\widehat{Y}=YD^{1/2}\text{ , }H=I_{n}-\dfrac{1}{n}\mathbf{1}\mathbf{1}^{T},

where HH is the centering matrix, and 𝟏=(1,,1)Tn.\mathbf{1}=\left(1,\ldots,1\right)^{T}\in\mathbb{R}^{n}.
The usual loss functions used are defined as follows

ϕ1(Y)\displaystyle\phi_{1}(Y) :=\displaystyle:= 12i,j=1nWij𝐲i𝐲j22=Tr[YLYH],\displaystyle\dfrac{1}{2}\sum_{i,j=1}^{n}W_{ij}\left\|\mathbf{y}_{i}-\mathbf{y}_{j}\right\|_{2}^{2}=\operatorname{Tr}\left[YLY^{H}\right], (1)
ϕ2(Y)\displaystyle\phi_{2}(Y) :=\displaystyle:= i𝐲ijWij𝐲j22=Tr[YMYH],\displaystyle\sum_{i}\left\|\mathbf{y}_{i}-\sum_{j}W_{ij}\mathbf{y}_{j}\right\|_{2}^{2}=\operatorname{Tr}\left[YMY^{H}\right], (2)
Φ3(Y)\displaystyle\Phi_{3}(Y) :=\displaystyle:= i𝐲i1nj𝐲j22=Tr[Y(I1n𝟏𝟏T)YH].\displaystyle\sum_{i}\left\|\mathbf{y}_{i}-\dfrac{1}{n}\sum_{j}\mathbf{y}_{j}\right\|_{2}^{2}=\operatorname{Tr}\left[Y(I-\dfrac{1}{n}\mathbf{1}\mathbf{1}^{T})Y^{H}\right]. (3)

Equations (1), and (3) preserve the locality, i.e., the point and its representation stay close, while Equation (2) preserves the local geometry, i.e., the representation point can be written as a linear combination of its neighbours.

For simplicity, we will refer to the dd eigenvectors of a matrix corresponding to the largest and smallest eigenvalues, respectively, as the largest and smallest dd eigenvectors of a matrix. The same terminology applies to the left or right singular vectors. Table 1 summarizes the various dimension reduction methods, their corresponding optimization problems and the solutions.

Method Loss
function
Constraint Solution
Linear methods
Principal component analysis[9]. Maximize Equation 3. VVT=IVV^{T}=I Largest dd left singular vectors of XHXH.
Locality Preserving Projections. [12] Minimize Equation 1 YDYT=IYDY^{T}=I Solution of X^(InW^)X^Tui=λiX^X^Tui\widehat{X}(I_{n}-\widehat{W})\widehat{X}^{T}u_{i}=\lambda_{i}\widehat{X}\widehat{X}^{T}u_{i}.
Orthogonal Locality Preserving Projections.[14, 15] Minimize Equation 1 VVT=IVV^{T}=I Smallest dd eigenvectors of XLXTXLX^{T}.
Orthogonal Neighborhood Preserving Projections.[14, 15] Minimize Equation 2 VVT=IVV^{T}=I Smallest dd eigenvectors of XMXTXMX^{T}.
Neighborhood Preserving Projections.[14] Minimize Equation 2 YYT=IYY^{T}=I Sol of XMXTui=λiXXTuiXMX^{T}u_{i}=\lambda_{i}XX^{T}u_{i}
Non-Linear methods
Locally Linear Embedding.[23] Minimize Equation 2 YYT=IYY^{T}=I Smallest dd Eigenvectors of MM.
Laplacian Eigenmap.[4] Minimize Equation 1 YDYT=IYDY^{T}=I Solution of Lui=λiDuiLu_{i}=\lambda_{i}Du_{i}.
Table 1: Objective functions and constraints employed in various dimension reduction methods along with corresponding solutions.

Notice that the smallest eigenvalue is disregarded in the solutions, thus, the second to the d+1d+1 eigenvectors are taken. The graph based methods are quite similar, each one tries to give an accurate representation of the data while preserving a desired property. The solution of the optimization problem is given by the eigenvectors, or the singular vectors.

Next, we will introduce notations related to the tensor theory (Einstein product) as well as some properties that guarantee the proposed generalization.

3 The Einstein product and its properties

Let 𝐈={I1,,IN}\mathbf{I}=\{I_{1},\ldots,I_{N}\} and 𝐉={J1,,JM}\mathbf{J}=\{J_{1},\ldots,J_{M}\} be two multi-indices, and 𝐢={i1,,iN}\mathbf{i}=\{i_{1},\ldots,i_{N}\} and 𝐣={j1,,jM}\mathbf{j}=\{j_{1},\ldots,j_{M}\} be two indices. The index mapping function ivec(𝐢,𝐈)=i1+k=2N(ik1)l=1k1Il\operatorname{ivec}(\mathbf{i},\mathbf{I})=i_{1}+\sum_{k=2}^{N}\left(i_{k}-1\right)\prod_{l=1}^{k-1}I_{l} that maps the multi-index 𝐢\mathbf{i} to the corresponding index in the vectorized form of a tensor of size I1××INI_{1}\times\ldots\times I_{N}. The unfolding, also known also as flattening or matricizaion, is a function Ψ:I1×I2××IN×J1×J2××JM|𝐈|×|𝐉|,𝒜A\Psi:\mathbb{R}^{I_{1}\times I_{2}\times\cdots\times I_{N}\times J_{1}\times J_{2}\times\cdots\times J_{M}}\longrightarrow\mathbb{R}^{|\mathbf{I}|\times|\mathbf{J}|},\;\mathcal{A}\mapsto A with Aij=𝒜i1i2iNj1j2jMA_{ij}=\mathcal{A}_{i_{1}i_{2}\ldots i_{N}j_{1}j_{2}\ldots j_{M}}, that maps a tensor into a matrix, with the subscripts i=ivec(𝐢,𝐈),i=\operatorname{ivec}(\mathbf{i},\mathbf{I}), and j=ivec(𝐣,𝐉)j=\operatorname{ivec}(\mathbf{j},\mathbf{J}). The mapping Ψ\Psi is a linear isomorphism, and its inverse is denoted by Ψ1\Psi^{-1}. It would generalize some concepts of the matrix theory more easily.

The frontal slice of the N-order tensor 𝒜I1××IN\mathcal{A}\in\mathbb{R}^{I_{1}\times\ldots\times I_{N}}, denoted by 𝒜(i)\mathcal{A}^{(i)} is the tensor 𝒜:,,:i\mathcal{A}_{:,\ldots,:i} (the last mode is fixed to ii). A tensor 𝒜I1××IN×J1××JM\mathcal{A}\in\mathbb{R}^{I_{1}\times\ldots\times I_{N}\times J_{1}\times\ldots\times J_{M}} is called even if N=MN=M and square if Ii=JiI_{i}=J_{i} for all i=1,,Ni=1,\ldots,N [21].

Definition 1 (m-mode product).

[17]Let 𝒳I1××IM\mathcal{X}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}}, and UJ×ImU\in\mathbb{R}^{J\times I_{m}}, the m-mode (matrix) product of 𝒳\mathcal{X} and UU is a tensor of size I1×Im1×J×Im+1×IMI_{1}\times\ldots I_{m-1}\times J\times I_{m+1}\ldots\times I_{M}, with element-wise

(𝒳×mU)i1im1jim+1iM=im=1ImUjim𝒳i1iM.(\mathcal{X}\times_{m}U)_{i_{1}\ldots i_{m-1}ji_{m+1}\ldots i_{M}}=\sum_{i_{m}=1}^{I_{m}}U_{ji_{m}}\mathcal{X}_{i_{1}\ldots i_{M}}. (4)
Definition 2 (Einstein product).

[6]Let 𝒳I1××IM×K1××KN\mathcal{X}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times K_{1}\times\ldots\times K_{N}} and
𝒴K1××KN×J1××JM\mathcal{Y}\in\mathbb{R}^{K_{1}\times\ldots\times K_{N}\times J_{1}\times\ldots\times J_{M}}, the Einstein product of the tensors 𝒳\mathcal{X} and 𝒴\mathcal{Y} is the tensor of size I1××IM×J1××JM\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times J_{1}\times\ldots\times J_{M}} whose elements are defined by

(𝒳N𝒴)i1iMj1jM=k1kN𝒳i1iMk1kN𝒴k1kNj1jM.\left(\mathcal{X}*_{N}\mathcal{Y}\right)_{i_{1}\ldots i_{M}j_{1}\ldots j_{M}}=\sum_{k_{1}\ldots k_{N}}\mathcal{X}_{i_{1}\ldots i_{M}k_{1}\ldots k_{N}}\mathcal{Y}_{k_{1}\ldots k_{N}j_{1}\ldots j_{M}}. (5)

Next, we have some definitions related to the Einstein product.

Definition 3.
  • Let 𝒜I1××IN×J1××JM\mathcal{A}\in\mathbb{R}^{I_{1}\times\ldots\times I_{N}\times J_{1}\times\ldots\times J_{M}}, then the transpose tensor [21] of 𝒜\mathcal{A} denoted by 𝒜T\mathcal{A}^{T} is the tensor of size J1××JM×I1××INJ_{1}\times\ldots\times J_{M}\times I_{1}\times\ldots\times I_{N} whose entries defined by (𝒜T)j1jMi1iN=𝒜i1iNj1jM(\mathcal{A}^{T})_{j_{1}\dots j_{M}i_{1}\dots i_{N}}=\mathcal{A}_{i_{1}\dots i_{N}j_{1}\dots j_{M}}.

  • 𝒜\mathcal{A} is a diagonal tensor if all of its entries are zero except for those on its diagonal, denoted as (𝒜)i1iNi1iN(\mathcal{A})_{i_{1}\dots i_{N}i_{1}\dots i_{N}}, for all 1irmin(Ir,Jr), 1rN1\leq i_{r}\leq\min(I_{r},J_{r}),\;1\leq r\leq N.

  • The identity tensor denoted by NI1××IN×I1××IN\mathcal{I}_{N}\in\mathbb{R}^{I_{1}\times\ldots\times I_{N}\times I_{1}\times\ldots\times I_{N}} is a diagonal tensor with only ones on its diagonal.

  • A square tensor 𝒜I1××IN×I1××IN\mathcal{A}\in\mathbb{R}^{I_{1}\times\ldots\times I_{N}\times I_{1}\times\ldots\times I_{N}} is called symmetric if 𝒜T=𝒜\mathcal{A}^{T}=\mathcal{A}.

Remark 1.

In case of no confusion, The identity tensor will be denoted simply \mathcal{I}.

Definition 4.

The inner product of tensors 𝒳,𝒴I1××IN\mathcal{X},\mathcal{Y}\in\mathbb{R}^{I_{1}\times\ldots\times I_{N}} is defined by

𝒳,𝒴=i1,,iN𝒳i1i2iN𝒳i1i2iN.\langle\mathcal{X},\mathcal{Y}\rangle=\sum_{i_{1},\ldots,i_{N}}\mathcal{X}_{i_{1}i_{2}\ldots i_{N}}\mathcal{X}_{i_{1}i_{2}\ldots i_{N}}. (6)

The inner product induces The Frobenius norm as follows

𝒳F=𝒳,𝒳.\|\mathcal{X}\|_{F}=\sqrt{\langle\mathcal{X},\mathcal{X}\rangle}. (7)
Definition 5.

A square 2N-order tensor 𝒜\mathcal{A} in invertible (non-singular) if there is a tensor denoted by 𝒜1\mathcal{A}^{-1} of same size such that 𝒜N𝒜1=𝒜1N𝒜=N\mathcal{A}*_{N}\mathcal{A}^{-1}=\mathcal{A}^{-1}*_{N}\mathcal{A}=\mathcal{I}_{N}. It is unitary if 𝒜TN𝒜=𝒜N𝒜T=N\mathcal{A}^{T}*_{N}\mathcal{A}=\mathcal{A}*_{N}\mathcal{A}^{T}=\mathcal{I}_{N}. It is positive semi-definite if 𝒳,𝒜N𝒳0\langle\mathcal{X},\mathcal{A}*_{N}\mathcal{X}\rangle\geq 0 for all non-zero 𝒳I1××IN\mathcal{X}\in\mathbb{R}^{I_{1}\times\ldots\times I_{N}}. It is positive definite if the inequality is strict.

An important relationship that is easy to prove is the stability of the Frobenius norm under the Einstein product with a unitary tensor.

Proposition 6.

Let 𝒳I1××IM×J1××JN\mathcal{X}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times J_{1}\times\ldots\times J_{N}} and 𝒰I1××IM×I1××IM\mathcal{U}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times I_{1}\times\ldots\times I_{M}} be a unitary tensor, then

𝒰M𝒳F=𝒳F.\|\mathcal{U}*_{M}\mathcal{X}\|_{F}=\|\mathcal{X}\|_{F}. (8)

The proof is straightforward using the inner product definition.

Proposition 7.

[27] For even order tensors 𝒳,𝒴I1××IN×J1××JM\mathcal{X},\mathcal{Y}\in\mathbb{R}^{I_{1}\times\ldots\times I_{N}\times J_{1}\times\ldots\times J_{M}}, we have

𝒳,𝒴\displaystyle\langle\mathcal{X},\mathcal{Y}\rangle =Tr(𝒳TN𝒴)\displaystyle=\operatorname{Tr}\left(\mathcal{X}^{T}*_{N}\mathcal{Y}\right) (9)
=Tr(𝒴M𝒳T).\displaystyle=\operatorname{Tr}(\mathcal{Y}*_{M}\mathcal{X}^{T}).
Proposition 8.

[20] Given tensors 𝒳I1××IN×K1××KN\mathcal{X}\in\mathbb{R}^{I_{1}\times\ldots\times I_{N}\times K_{1}\times\ldots\times K_{N}}, 𝒴K1××KN×J1××JM\mathcal{Y}\in\mathbb{R}^{K_{1}\times\ldots\times K_{N}\times J_{1}\times\ldots\times J_{M}}, we have

(𝒳N𝒴)T=𝒴TN𝒳T.\left(\mathcal{X}*_{N}\mathcal{Y}\right)^{T}=\mathcal{Y}^{T}*_{N}\mathcal{X}^{T}. (10)

The isomorphism ϕ\phi has some properties that would be useful in the following.

Proposition 9.

[27] Given the tensors 𝒳\mathcal{X} and 𝒴\mathcal{Y} of appropriate size then, we have ϕ\phi is a multiplicative morphism with respect the Einstein product, i.e., Ψ(𝒳N𝒴)=Ψ(𝒳)Ψ(𝒴)\Psi(\mathcal{X}*_{N}\mathcal{Y})=\Psi(\mathcal{X})\Psi(\mathcal{Y}).

It allows us to prove the Einstein Tensor Spectral Theorem.

Theorem 10 (Einstein Tensor Spectral Theorem).

A symmetric tensor is diagonalizable via the Einstein product.

Proof.

The proof is using the isomorphism and its properties 9.

Let 𝒳\mathcal{X} be a symmetric tensor of size I1××IN×I1××IN\mathbb{R}^{I_{1}\times\ldots\times I_{N}\times I_{1}\times\ldots\times I_{N}}, then Ψ(𝒳)\Psi(\mathcal{X}) is symmetric, and by the spectral theorem, there exists an orthogonal matrix UU such that UTΨ(𝒳)U=ΛU^{T}\Psi(\mathcal{X})U=\Lambda, where Λ\Lambda is a diagonal matrix.
Then, Ψ(𝒳)=UΛUT\Psi(\mathcal{X})=U\Lambda U^{T}, and 𝒳=ϕ1(UΛUT)=Ψ1(U)NΨ1(Λ)NΨ1(UT)=Ψ1(U)NΨ1(Λ)NΨ1(U)T\mathcal{X}=\phi^{-1}(U\Lambda U^{T})=\Psi^{-1}(U)*_{N}\Psi^{-1}(\Lambda)*_{N}\Psi^{-1}(U^{T})=\Psi^{-1}(U)*_{N}\Psi^{-1}(\Lambda)*_{N}\Psi^{-1}(U)^{T}, with Ψ1(U)\Psi^{-1}(U) is a unitary, and Ψ1(Λ)\Psi^{-1}(\Lambda) is diagonal tensor.

The cyclic property of the trace with Einstein product is also verified, which would be needed in the sequel.

Proposition 11 (Cyclic property of the trace).

Given tensors 𝒳I1××IM×K1××KN\mathcal{X}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times K_{1}\times\ldots\times K_{N}}, 𝒴K1××KN×I1××IM,𝒵K1××KN×K1××KN\mathcal{Y}\in\mathbb{R}^{K_{1}\times\ldots\times K_{N}\times I_{1}\times\ldots\times I_{M}},\mathcal{Z}\in\mathbb{R}^{K_{1}\times\ldots\times K_{N}\times K_{1}\times\ldots\times K_{N}}, we have

Tr(𝒳N𝒵N𝒴)=Tr(𝒴M𝒳N𝒵).\operatorname{Tr}\left(\mathcal{X}*_{N}\mathcal{Z}*_{N}\mathcal{Y}\right)=\operatorname{Tr}\left(\mathcal{Y}*_{M}\mathcal{X}*_{N}\mathcal{Z}\right). (11)
Theorem 12.

[20] Let 𝒳I1××IM×K1××KN\mathcal{X}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times K_{1}\times\ldots\times K_{N}}, the Einstein singular value decomposition (E-SVD) of 𝒳\mathcal{X} is defined by

𝒳=𝒰M𝒮N𝒱T,\mathcal{X}=\mathcal{U}*_{M}\mathcal{S}*_{N}\mathcal{V}^{T}, (12)

where 𝒰I1××IM×I1××IM,𝒮I1××IM×K1××KN,𝒱K1××KN×K1××KN\mathcal{U}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times I_{1}\times\ldots\times I_{M}},\mathcal{S}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times K_{1}\times\ldots\times K_{N}},\mathcal{V}\in\mathbb{R}^{K_{1}\times\ldots\times K_{N}\times K_{1}\times\ldots\times K_{N}} with the following properties
𝒰\mathcal{U} and 𝒱\mathcal{V} are unitary, where 𝒰::i1iM,𝒱::j1jN\mathcal{U}_{:\ldots:i_{1}\ldots i_{M}},\mathcal{V}_{:\ldots:j_{1}\ldots j_{N}} are the left and right singular tensors of 𝒳\mathcal{X}, respectively. If N<MN<M, then

𝒮i1iMk1kN={dk1kNif (i1,,iN)=(k1,,kN) and (iN+1,,iM)=(1,,1)0otherwise.\mathcal{S}_{i_{1}\ldots i_{M}k_{1}\ldots k_{N}}=\begin{cases}d_{k_{1}\ldots k_{N}}&\text{if }(i_{1},\ldots,i_{N})=(k_{1},\ldots,k_{N})\text{ and }(i_{N+1},\ldots,i_{M})=(1,\ldots,1)\\ 0&\text{otherwise.}\end{cases}

.
If N=MN=M, then 𝒮i1iMk1kN={dk1kNif (i1,,iM)=(k1,,kM)0otherwise.\mathcal{S}_{i_{1}\ldots i_{M}k_{1}\ldots k_{N}}=\begin{cases}d_{k_{1}\ldots k_{N}}&\text{if }(i_{1},\ldots,i_{M})=(k_{1},\ldots,k_{M})\\ 0&\text{otherwise.}\end{cases}
The numbers dk1kNd_{k_{1}\ldots k_{N}} are the singular values of 𝒳\mathcal{X} with the decreasing order

d1,,1d2,1,,1dK^1,1,,1d1,2,,1d1,K^2,,1dK^1,,K^P0,d_{{1,\ldots,1}}\geq d_{2,1,\ldots,1}\geq\ldots\geq d_{\widehat{K}_{1},1,\ldots,1}\\ \geq d_{1,2,\ldots,1}\geq\ldots\geq d_{1,\widehat{K}_{2},\ldots,1}\geq\ldots\geq d_{\widehat{K}_{1},\ldots,\widehat{K}_{P}}\geq 0,

with P=min(N,L),K^r=min(Ir,Kr),r=1,,PP=\min(N,L),\;\widehat{K}_{r}=\min(I_{r},K_{r}),\;r=1,\ldots,P.

We define the eigenvalues and eigen-tensors of a tensor with the following.

Definition 13.

[28] Let a square 2-N order tensors 𝒜,I1××IN×I1××IN\mathcal{A},\mathcal{B}\in\mathbb{R}^{I_{1}\times\ldots\times I_{N}\times I_{1}\times\ldots\times I_{N}}, then

  • Tensor Eigenvalue problem: If there is a non null 𝒳I1××IN\mathcal{X}\in\mathbb{R}^{I_{1}\times\ldots\times I_{N}}, and λ\lambda\in\mathbb{R} such that 𝒜N𝒳=λ𝒳\mathcal{A}*_{N}\mathcal{X}=\lambda\mathcal{X}, then 𝒳\mathcal{X} is called an eigen-tensor of 𝒜\mathcal{A}, and λ\lambda is the corresponding eigenvalue.

  • Tensor generalized Eigenvalue problem: If there is a non null 𝒳I1××IN\mathcal{X}\in\mathbb{R}^{I_{1}\times\ldots\times I_{N}}, and λ\lambda\in\mathbb{R} such that 𝒜N𝒳=λN𝒳\mathcal{A}*_{N}\mathcal{X}=\lambda\mathcal{B}*_{N}\mathcal{X}, then 𝒳\mathcal{X} is called an eigen-tensor of the pair {𝒜,}\{\mathcal{A},\mathcal{B}\}, and λ\lambda is the corresponding eigenvalue.

Remark 2.

If N=1N=1, the two definitions above coincide with the eigenvalue and generalized eigenvalue problems, respectively.

We can also show a relationship between the singular values and the eigenvalues of a tensor.

Proposition 14.

Let the E-SVD of 𝒳I1××IM×K1××KN\mathcal{X}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times K_{1}\times\ldots\times K_{N}}, defined as 𝒳=𝒰M𝒮N𝒱T\mathcal{X}=\mathcal{U}*_{M}\mathcal{S}*_{N}\mathcal{V}^{T}, then

  • The eigenvalues of 𝒳N𝒳T\mathcal{X}*_{N}\mathcal{X}^{T} and 𝒳T×M𝒳\mathcal{X}^{T}\times_{M}\mathcal{X} are the squared singular values of 𝒳\mathcal{X}.

  • The eigen-tensors of 𝒳N𝒳T\mathcal{X}*_{N}\mathcal{X}^{T} are the left singular tensors of 𝒳\mathcal{X}.

  • The eigen-tensors of 𝒳TM𝒳\mathcal{X}^{T}*_{M}\mathcal{X} are the right singular tensors of 𝒳\mathcal{X}.

The proof is straightforward.

To simplify matters, we’ll denote the dd eigen-tensors of a tensor, associated with the smallest eigenvalues, as the smallest dd eigen-tensors. Similarly, we will apply the same principle to the largest dd eigen-tensors. This terminology also extends to the left or right singular tensors.

Remark 3.

To generalize the notion of left and right inverse for a non-square tensors. It is called left or right Ψ\Psi-invertible if Ψ(𝒜)\Psi(\mathcal{A}) is left or right invertible, respectively. In case of confusion, we will denote Ψ\Psi by Ψj\Psi_{j} to represent the transformation of tensor 𝒜I1××IN\mathcal{A}\in\mathbb{R}^{I_{1}\times\ldots\times I_{N}} to a matrix (k=1jIk)×((k=j+1NIk)\mathbb{R}^{(\prod_{k=1}^{j}I_{k})\times((\prod_{k=j+1}^{N}I_{k})}.

Proposition 15.
  1. 1.

    A square 2-N order symmetric 𝒳\mathcal{X} is positive semi-definite tensor, definite tensor, respectively, if and only if there is a tensor, an invertible tensor, respectively, \mathcal{B} of same size such that 𝒳=NT\mathcal{X}=\mathcal{B}*_{N}\mathcal{B}^{T}.

  2. 2.

    Let a tensor 𝒳I1××IN\mathcal{X}\in\mathbb{R}^{I_{1}\times\ldots\times I_{N}} of order N, with its transpose in INj+1××IN×I1×INj\mathbb{R}^{I_{N-j+1}\times\ldots\times I_{N}\times I_{1}\times\ldots I_{N-j}} then 𝒳j𝒳T\mathcal{X}*_{j}\mathcal{X}^{T} is semi-definite positive for any 1jN1\leq j\leq N.
    Let a tensor 𝒳I1××IN\mathcal{X}\in\mathbb{R}^{I_{1}\times\ldots\times I_{N}} of order N, let 1jN1\leq j\leq N such that the tensor is Ψj\Psi_{j}- invertible, with its transpose in INj+1××IN×I1×INj\mathbb{R}^{I_{N-j+1}\times\ldots\times I_{N}\times I_{1}\times\ldots I_{N-j}}, then 𝒳j𝒳T\mathcal{X}*_{j}\mathcal{X}^{T} is definite positive.

  3. 3.

    The eigenvalues of a square symmetric tensor are real.

  4. 4.

    Let a symmetric matrix MK×KM\in\mathbb{R}^{K\times K} and 𝒳I1××IN×K\mathcal{X}\in\mathbb{R}^{I_{1}\times\ldots\times I_{N}\times K}, with its transpose in K×I1××IN\mathbb{R}^{K\times I_{1}\times\ldots\times I_{N}} then 𝒳×N+1M1𝒳T\mathcal{X}\times_{N+1}M*_{1}\mathcal{X}^{T} is a symmetric.

  5. 5.

    If MM is positive semi-definite, then 𝒳×N+1M1𝒳T\mathcal{X}\times_{N+1}M*_{1}\mathcal{X}^{T} is positive semi-definite.

  6. 6.

    If MM is positive definite, and 𝒳\mathcal{X} is Ψ\Psi-invertible, then 𝒳×N+1M1𝒳T\mathcal{X}\times_{N+1}M*_{1}\mathcal{X}^{T} is positive definite.

Proof.

The proof of the first one is straightforward using 9.
Let 𝒴I1×INj\mathcal{Y}\in\mathbb{R}^{I_{1}\times\ldots I_{N-j}}, then 𝒴TNj𝒳j𝒳TNj𝒴=𝒳TNj𝒴F20\mathcal{Y}^{T}*_{N-j}\mathcal{X}*_{j}\mathcal{X}^{T}*_{N-j}\mathcal{Y}=\|\mathcal{X}^{T}*_{N-j}\mathcal{Y}\|_{F}^{2}\geq 0.
The proof of the third is similar to the second one.
Let 𝒜I1××IN×I1××IN\mathcal{A}\in\mathbb{R}^{I_{1}\times\ldots\times I_{N}\times I_{1}\times\ldots\times I_{N}} be a symmetric tensor and non-zero tensor 𝒳I1××IN\mathcal{X}\in\mathbb{R}^{I_{1}\times\ldots\times I_{N}} with 𝒜N𝒳=λ𝒳\mathcal{A}*_{N}\mathcal{X}=\lambda\mathcal{X}, then λT𝒳T=(λ𝒳)T=(𝒜N𝒳)T=𝒳TN𝒜T=𝒳TN𝒜=λT𝒳T\lambda^{T}\mathcal{X}^{T}=(\lambda\mathcal{X})^{T}=(\mathcal{A}*_{N}\mathcal{X})^{T}=\mathcal{X}^{T}*_{N}\mathcal{A}^{T}=\mathcal{X}^{T}*_{N}\mathcal{A}=\lambda^{T}\mathcal{X}^{T}, then λ=λT\lambda=\lambda^{T}, which completes the proof.
We have (𝒳×N+1M1𝒳T)T=(M×1𝒳T)T1𝒳T=(𝒳×N+1MT)1𝒳T\left(\mathcal{X}\times_{N+1}M*_{1}\mathcal{X}^{T}\right)^{T}=(M\times_{1}\mathcal{X}^{T})^{T}*_{1}\mathcal{X}^{T}=(\mathcal{X}\times_{N+1}M^{T})*_{1}\mathcal{X}^{T}, then conclude by the symmetry of MM.
Let MM be a positive semi-definite matrix, then there exist a matrix BB such that M=BBTM=BB^{T}, then

𝒳×N+1M1𝒳T=𝒳×N+1BBT1𝒳T=𝒳^1𝒳^T,\mathcal{X}\times_{N+1}M*_{1}\mathcal{X}^{T}=\mathcal{X}\times_{N+1}BB^{T}*_{1}\mathcal{X}^{T}=\widehat{\mathcal{X}}*_{1}\widehat{\mathcal{X}}^{T},

with 𝒳^=𝒳×N+1B\widehat{\mathcal{X}}=\mathcal{X}\times_{N+1}B, then the result follows.
The last has a similar proof; Using the fact that MM is positive definite, then BB is invertible, and 𝒳\mathcal{X} is invertible, then 𝒳^\widehat{\mathcal{X}} is Ψ\Psi-invertible, and the result follows. ∎

We also have a property that relates the tensor generalized eigenvalue problem with the tensor eigenvalue problem.

Proposition 16.

Let the generalized eigenvalue problem 𝒜N𝒳=λN𝒳\mathcal{A}*_{N}\mathcal{X}=\lambda\mathcal{M}*_{N}\mathcal{X}, with 𝒜,\mathcal{A},\mathcal{M} are a square 2-N order tensor, with \mathcal{M} being invertible, then 𝒳^=N𝒳\widehat{\mathcal{X}}=\mathcal{M}*_{N}\mathcal{X} is a solution of the tensor eigen-problem 𝒜^N𝒳^=λ𝒳^\widehat{\mathcal{A}}*_{N}\widehat{\mathcal{X}}=\lambda\widehat{\mathcal{X}} with 𝒜^=𝒜N1\widehat{\mathcal{A}}=\mathcal{A}*_{N}\mathcal{M}^{-1}.

Theorem 17.

Let a symmetric 𝒳I1××IM×I1××IM\mathcal{X}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times I_{1}\times\ldots\times I_{M}}, and \mathcal{B} a positive definite tensor of same size, then

min𝒫I1××IM×d𝒫TMM𝒫=Tr(𝒫TM𝒳M𝒫),\min_{\begin{subarray}{c}\mathcal{P}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times d}\\ \mathcal{P}^{T}*_{M}\mathcal{B}*_{M}\mathcal{P}=\mathcal{I}\end{subarray}}\operatorname{Tr}(\mathcal{P}^{T}*_{M}\mathcal{X}*_{M}\mathcal{P}),

is equivalent to solve the generalized eigenvalue problem 𝒳M𝒫=λM𝒫\mathcal{X}*_{M}\mathcal{P}=\lambda\mathcal{B}*_{M}\mathcal{P}.

Proof.

Since Ψ\Psi is an isomorphism, the problem is equivalent to minimize Tr(PXPT)\operatorname{Tr}(PXP^{T}) with PTBP=IP^{T}BP=I, Ψ(𝒫)=P,Ψ(𝒳)=X,Ψ()=B\Psi(\mathcal{P})=P,\Psi(\mathcal{X})=X,\Psi(\mathcal{B})=B. We have XX symmetric and BB is definite positive. The solution of the equivalent problem is the dd smallest eigenvalues of XX, using the fact Ψ1\Psi^{-1} is an isomorphism, we obtain the result.
A second proof without using the isomorphism property is the following.
Let the Lagrangian of the problem be

(𝒫,Λ):=Tr(𝒫TM𝒳M𝒫)Tr(ΛTM(𝒫TMM𝒫)),\mathcal{L}(\mathcal{P},\Lambda):=\operatorname{Tr}(\mathcal{P}^{T}*_{M}\mathcal{X}*_{M}\mathcal{P})-\operatorname{Tr}(\Lambda^{T}*_{M}(\mathcal{P}^{T}*_{M}\mathcal{B}*_{M}\mathcal{P}-\mathcal{I})),

with Λd×d\Lambda\in\mathbb{R}^{d\times d} the Lagrange multiplier. Using KKT conditions, we have

𝒫\displaystyle\dfrac{\partial\mathcal{L}}{\partial\mathcal{P}} =2𝒳M𝒫+2ΛM𝒫=0\displaystyle=2\mathcal{X}*_{M}\mathcal{P}+2\Lambda*_{M}\mathcal{P}=0
\displaystyle\implies 𝒫TMM𝒫=0.\displaystyle\mathcal{P}^{T}*_{M}\mathcal{B}*_{M}\mathcal{P}-\mathcal{I}=0.

To compute the partial derivative with respect to 𝒫\mathcal{P}, we introduce the functions f1(𝒫)f_{1}(\mathcal{P}) and f2(𝒫)f_{2}(\mathcal{P}), defined as follows

f1(𝒫)=Tr(𝒫TM𝒳M𝒫),f_{1}(\mathcal{P})=\operatorname{Tr}(\mathcal{P}^{T}*_{M}\mathcal{X}*_{M}\mathcal{P}),
f2(𝒫)=Tr(ΛTM(𝒫TMM𝒫)).f_{2}(\mathcal{P})=\operatorname{Tr}\left(\Lambda^{T}*_{M}(\mathcal{P}^{T}*_{M}\mathcal{B}*_{M}\mathcal{P}-\mathcal{I})\right).

Subsequently, we aim to determine the partial derivative.

f1(𝒫+ε)\displaystyle f_{1}(\mathcal{P}+\varepsilon\mathcal{H}) =Tr((𝒫+ε)TM𝒳M(𝒫+ε))\displaystyle=\operatorname{Tr}\left(\left(\mathcal{P}+\varepsilon\mathcal{H})^{T}*_{M}\mathcal{X}*_{M}(\mathcal{P}+\varepsilon\mathcal{H}\right)\right)
=Tr(𝒫TM𝒳M𝒫+εTM𝒳M𝒫+ε𝒫TM𝒳M)\displaystyle=\operatorname{Tr}\left(\mathcal{P}^{T}*_{M}\mathcal{X}*_{M}\mathcal{P}+\varepsilon\mathcal{H}^{T}*_{M}\mathcal{X}*_{M}\mathcal{P}+\varepsilon\mathcal{P}^{T}*_{M}\mathcal{X}*_{M}\mathcal{H}\right)
+Tr(ε2TM𝒳M)\displaystyle\qquad+\operatorname{Tr}(\varepsilon^{2}\mathcal{H}^{T}*_{M}\mathcal{X}*_{M}\mathcal{H})
=Tr(𝒫TM𝒳M𝒫)+εTr(TM𝒳M𝒫)\displaystyle=\operatorname{Tr}(\mathcal{P}^{T}*_{M}\mathcal{X}*_{M}\mathcal{P})+\varepsilon\operatorname{Tr}(\mathcal{H}^{T}*_{M}\mathcal{X}*_{M}\mathcal{P})
+εTr(𝒫TM𝒳M)+ε2Tr(TM𝒳M)\displaystyle\qquad+\varepsilon\operatorname{Tr}(\mathcal{P}^{T}*_{M}\mathcal{X}*_{M}\mathcal{H})+\varepsilon^{2}\operatorname{Tr}(\mathcal{H}^{T}*_{M}\mathcal{X}*_{M}\mathcal{H})
=f1(𝒫)+ε[Tr(TM𝒳M𝒫)+Tr(𝒫TM𝒳M)]+o(ε)\displaystyle=f_{1}(\mathcal{P})+\varepsilon\left[\operatorname{Tr}(\mathcal{H}^{T}*_{M}\mathcal{X}*_{M}\mathcal{P})+\operatorname{Tr}(\mathcal{P}^{T}*_{M}\mathcal{X}*_{M}\mathcal{H})\right]+o(\varepsilon)
=f1(𝒫)+εTr(TM(𝒳+𝒳T)M𝒫)+o(ε).\displaystyle=f_{1}(\mathcal{P})+\varepsilon\operatorname{Tr}\left(\mathcal{H}^{T}*_{M}(\mathcal{X}+\mathcal{X}^{T})*_{M}\mathcal{P}\right)+o(\varepsilon).

Then, as 𝒳\mathcal{X} is symmetric, the partial derivative in the direction \mathcal{H} is

f1𝒫()\displaystyle\dfrac{\partial f_{1}}{\partial\mathcal{P}}(\mathcal{H}) =limε0f1(𝒫+ε)f1(𝒫)ε\displaystyle=\lim_{\varepsilon\rightarrow 0}\dfrac{f_{1}(\mathcal{P}+\varepsilon\mathcal{H})-f_{1}(\mathcal{P})}{\varepsilon}
=2Tr(TM𝒳M𝒫).\displaystyle=2\operatorname{Tr}(\mathcal{H}^{T}*_{M}\mathcal{X}*_{M}\mathcal{P}).

It gives us the partial derivative of f1f_{1} with respect to 𝒫\mathcal{P} as

f1𝒫=2𝒳M𝒫.\dfrac{\partial f_{1}}{\partial\mathcal{P}}=2\mathcal{X}*_{M}\mathcal{P}.

For the second function, we have

f2(𝒫+ε)\displaystyle f_{2}(\mathcal{P}+\varepsilon\mathcal{H}) =Tr(ΛTM((𝒫+ε)TMM(𝒫+ε)))\displaystyle=\operatorname{Tr}\left(\Lambda^{T}*_{M}\left((\mathcal{P}+\varepsilon\mathcal{H})^{T}*_{M}\mathcal{B}*_{M}(\mathcal{P}+\varepsilon\mathcal{H})-\mathcal{I}\right)\right)
=Tr(ΛTM(𝒫TMM𝒫))++ε2Tr(ΛTM(TMM))\displaystyle=\operatorname{Tr}\left(\Lambda^{T}*_{M}(\mathcal{P}^{T}*_{M}\mathcal{B}*_{M}\mathcal{P}-\mathcal{I}))++\varepsilon^{2}\operatorname{Tr}(\Lambda^{T}*_{M}(\mathcal{H}^{T}*_{M}\mathcal{B}*_{M}\mathcal{H})\right)
+\displaystyle+ εTr(ΛTM(TMM𝒫+𝒫TMM))\displaystyle\varepsilon\operatorname{Tr}\left(\Lambda^{T}*_{M}(\mathcal{H}^{T}*_{M}\mathcal{B}*_{M}\mathcal{P}+\mathcal{P}^{T}*_{M}\mathcal{B}*_{M}\mathcal{H})\right)
=f2(𝒫)+εTr(ΛTM(TMM𝒫+𝒫TMM))+O(ε)\displaystyle=f_{2}(\mathcal{P})+\varepsilon\operatorname{Tr}\left(\Lambda^{T}*_{M}(\mathcal{H}^{T}*_{M}\mathcal{B}*_{M}\mathcal{P}+\mathcal{P}^{T}*_{M}\mathcal{B}*_{M}\mathcal{H})\right)+O(\varepsilon)
=f2(𝒫)+εTr(TM[M𝒫MΛT+TM𝒫MΛ])+O(ε).\displaystyle=f_{2}(\mathcal{P})+\varepsilon\operatorname{Tr}\left(\mathcal{H}^{T}*_{M}\left[\mathcal{B}*_{M}\mathcal{P}*_{M}\Lambda^{T}+\mathcal{B}^{T}*_{M}\mathcal{P}*_{M}\Lambda\right]\right)+O(\varepsilon).

We used the cyclic property 11 and the transpose with trace property 9 in the last equality. Then, as \mathcal{B} is symmetric, the partial derivative in the direction \mathcal{H} is

f2𝒫()\displaystyle\dfrac{\partial f_{2}}{\partial\mathcal{P}}(\mathcal{H}) =limε0f2(𝒫+ε)f2(𝒫)ε\displaystyle=\lim_{\varepsilon\rightarrow 0}\dfrac{f_{2}(\mathcal{P}+\varepsilon\mathcal{H})-f_{2}(\mathcal{P})}{\varepsilon}
=Tr(TMM𝒫M(Λ+ΛT)).\displaystyle=\operatorname{Tr}\left(\mathcal{H}^{T}*_{M}\mathcal{B}*_{M}\mathcal{P}*_{M}(\Lambda+\Lambda^{T})\right).

This yields the partial derivative of \mathcal{L} with respect to 𝒫\mathcal{P} as follows

f2𝒫=M𝒫M(Λ+ΛT).\dfrac{\partial f_{2}}{\partial\mathcal{P}}=\mathcal{B}*_{M}\mathcal{P}*_{M}(\Lambda+\Lambda^{T}).

Subsequently,

𝒫=0\displaystyle\dfrac{\partial\mathcal{L}}{\partial\mathcal{P}}=0 2𝒳M𝒫ΛM𝒫ΛTM𝒫=0\displaystyle\iff 2\mathcal{X}*_{M}\mathcal{P}-\Lambda*_{M}\mathcal{P}-\Lambda^{T}*_{M}\mathcal{P}=0
𝒳M𝒫=M𝒫M(Λ+ΛT)\displaystyle\iff\mathcal{X}*_{M}\mathcal{P}=\mathcal{B}*_{M}\mathcal{P}*_{M}(\Lambda+\Lambda^{T})
𝒳M𝒫=M𝒫MΛ^\displaystyle\iff\mathcal{X}*_{M}\mathcal{P}=\mathcal{B}*_{M}\mathcal{P}*_{M}\widehat{\Lambda}
𝒳M𝒫=M𝒫M𝒬TM𝒟M𝒬\displaystyle\iff\mathcal{X}*_{M}\mathcal{P}=\mathcal{B}*_{M}\mathcal{P}*_{M}\mathcal{Q}^{T}*_{M}\mathcal{D}*_{M}\mathcal{Q}
𝒳M𝒫M𝒬T=M𝒫M𝒬TM𝒟\displaystyle\iff\mathcal{X}*_{M}\mathcal{P}*_{M}\mathcal{Q}^{T}=\mathcal{B}*_{M}\mathcal{P}*_{M}\mathcal{Q}^{T}*_{M}\mathcal{D}
𝒳M𝒫^=M𝒫^M𝒟.\displaystyle\iff\mathcal{X}*_{M}\widehat{\mathcal{P}}=\mathcal{B}*_{M}\widehat{\mathcal{P}}*_{M}\mathcal{D}.

The third line utilizes the property that Λ^=Λ+ΛT\widehat{\Lambda}=\Lambda+\Lambda^{T}, which is symmetric, thus diagonalizable (10), i.e., Λ^=𝒬TM𝒟M𝒬\widehat{\Lambda}=\mathcal{Q}^{T}*_{M}\mathcal{D}*_{M}\mathcal{Q}. The last two lines are justified by the fact that 𝒫^=𝒫M𝒬\widehat{\mathcal{P}}=\mathcal{P}*_{M}\mathcal{Q} also satisfies 𝒫^TMM𝒫^=\widehat{\mathcal{P}}^{T}*_{M}\mathcal{B}*_{M}\widehat{\mathcal{P}}=\mathcal{I}, concluding the proof. ∎

A corollary of this theorem can be deduced.

Corollary 18.

Let 𝒳I1××IM×K1××KN\mathcal{X}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times K_{1}\times\ldots\times K_{N}}, the solution of

argmin𝒫I1××IM×d𝒫TM𝒫=𝒫TM𝒳F2.\arg\min_{\begin{subarray}{c}\mathcal{P}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times d}\\ \mathcal{P}^{T}*_{M}\mathcal{P}=\mathcal{I}\end{subarray}}\|\mathcal{P}^{T}*_{M}\mathcal{X}\|_{F}^{2}.

is the dd smallest left singular tensors of 𝒳\mathcal{X}.

Proof.

We have 𝒫TM𝒳F2=Tr(𝒫TM𝒳N𝒳TM𝒫).\|\mathcal{P}^{T}*_{M}\mathcal{X}\|_{F}^{2}=\operatorname{Tr}\left(\mathcal{P}^{T}*_{M}\mathcal{X}*_{N}\mathcal{X}^{T}*_{M}\mathcal{P}\right). Theorem 17 tells that the solution is equivalent to solve 𝒳N𝒳TM𝒫=λ𝒫\mathcal{X}*_{N}\mathcal{X}^{T}*_{M}\mathcal{P}=\lambda\mathcal{P}, i.e, the dd smallest tensors of 𝒳N𝒳T\mathcal{X}*_{N}\mathcal{X}^{T}, which corresponds exactly to the dd smallest left singular tensors of 𝒳\mathcal{X}. ∎

4 Multidimensional reduction

In this section, we present a generalized approach to DR methods using the Einstein product.

Given a tensor 𝒳I1××IM×n\mathcal{X}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times n}, our objective is to derive a low-dimensional representation 𝒴d×n\mathcal{Y}\in\mathbb{R}^{d\times n} of 𝒳\mathcal{X}. This involves defining a mapping function Ψ:I1××IMd\Psi:\mathbb{R}^{I_{1}\times\ldots\times I_{M}}\longrightarrow\mathbb{R}^{d}.

First, we discuss the determination of the weight matrix, which can be computed in various ways, one common method is using the Gaussian kernel

Wi,j=exp(𝒳(i)𝒳(j)F2σ2).W_{i,j}=\exp\left(-\frac{\left\|\mathcal{X}^{(i)}-\mathcal{X}^{(j)}\right\|_{F}^{2}}{\sigma^{2}}\right).

Additionally, introducing a threshold can yield a sparse matrix (Gaussian-threshold). We also explore another method later in this section, which utilizes the reconstruction error.

Next, we introduce our proposed methods.

Linear methods: The linear methods can be written as 𝒴=𝒫TM𝒳\mathcal{Y}=\mathcal{P}^{T}*_{M}\mathcal{X}. It is sufficient to find the projection matrix 𝒫I1××IM×d\mathcal{P}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times d}.
Higher order PCA based on Einstein [11] is the natural extension of PCA to higher order tensors. It extends the PCA applied to images, using the notion of eigenfaces to colored images that are modeled by a fourth-order tensor, using the Einstein product. It vectorizes pixels (height and width) for each color (RGB) to get a third-order tensor, then it computes E-SVD of this tensor centered, to get the eigenfaces.
The vectorization is not natural, since we omit the spatial information. The proposed work hides the vectorization in the first step by using the tensor directly, and seeks to find a solution of the following problem

argmax𝒫I1××IM×d𝒫TM𝒫=𝒴=𝒫TM𝒳ΦPCA(𝒴)\displaystyle\arg\max_{\begin{subarray}{c}\mathcal{P}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times d}\\ \mathcal{P}^{T}*_{M}\mathcal{P}=\mathcal{I}\\ \mathcal{Y}=\mathcal{P}^{T}*_{M}\mathcal{X}\end{subarray}}\Phi_{PCA}(\mathcal{Y}) :=i𝒴(i)1nj𝒴(j)F2.\displaystyle:=\sum_{i}\left\|\mathcal{Y}^{(i)}-\dfrac{1}{n}\sum_{j}\mathcal{Y}^{(j)}\right\|_{F}^{2}. (13)

The objective function can be written as

ΦPCA(𝒴)\displaystyle\Phi_{PCA}(\mathcal{Y}) =i𝒫TM(𝒳(i)1nj𝒳(j))F2\displaystyle=\sum_{i}\left\|\mathcal{P}^{T}*_{M}(\mathcal{X}^{(i)}-\dfrac{1}{n}\sum_{j}\mathcal{X}^{(j)})\right\|_{F}^{2}
=i𝒫TM(𝒳(i)𝒬(i))F2\displaystyle=\sum_{i}\left\|\mathcal{P}^{T}*_{M}(\mathcal{X}^{(i)}-\mathcal{Q}^{(i)})\right\|_{F}^{2}
=𝒫TM(𝒳𝒬)F2,\displaystyle=\left\|\mathcal{P}^{T}*_{M}(\mathcal{X}-\mathcal{Q})\right\|_{F}^{2},

with 𝒬I1××IM×n\mathcal{Q}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times n}, where 𝒬(i)=1nj𝒳(j)\mathcal{Q}^{(i)}=\frac{1}{n}\sum_{j}\mathcal{X}^{(j)} represents the mean, the solution of (13) is the largest dd left singular tensors of the centered tensor 𝒳𝒬=𝒳×M+1H\mathcal{X}-\mathcal{Q}=\mathcal{X}\times_{M+1}H.

Since the feature dimension is typically larger than the number of data points nn, computing the E-SVD of 𝒳×M+1H\mathcal{X}\times_{M+1}H can be computationally expensive. It’s preferable to have a runtime that depends on nn instead. To achieve this, we transform the equation 𝒳×M+1H1𝒳TM𝒫=λ𝒫\mathcal{X}\times_{M+1}H*_{1}\mathcal{X}^{T}*_{M}\mathcal{P}=\lambda\mathcal{P} to (𝒳TM𝒳×MH)1𝐳=λ𝐳(\mathcal{X}^{T}*_{M}\mathcal{X}\times_{M}H)*_{1}\mathbf{z}=\lambda\mathbf{z}, with 𝐳=𝒳TM𝒫\mathbf{z}=\mathcal{X}^{T}*_{M}\mathcal{P}. This allows us to find the eigenvectors of a square matrix of size nn. The projected data 𝒴\mathcal{Y} would be these vectors reshaped to the appropriate size.
The algorithm bellow, shows the steps of PCA via Einstein product.

Algorithm 1 PCA-Einstein

Input: 𝒳\mathcal{X} (Data) dd(dimension output).
      Output: 𝒫\mathcal{P} (Projection space).

1:Compute 𝒬\mathcal{Q}. \triangleright The mean tensor
2:Compute the largest dd eigen-tensors of 𝒵=𝒳𝒬\mathcal{Z}=\mathcal{X}-\mathcal{Q}.
3:Combine these tensors to get 𝒫\mathcal{P}.

4.1 Generalization of ONPP

Given a weight matrix Wn×nW\in\mathbb{R}^{n\times n}, the objective function is to resolve

argmin𝒫I1××IM×d𝒫TM𝒫=𝒴=𝒫TM𝒳ΦONPP(𝒴)\displaystyle\arg\min_{\begin{subarray}{c}\mathcal{P}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times d}\\ \mathcal{P}^{T}*_{M}\mathcal{P}=\mathcal{I}\\ \mathcal{Y}=\mathcal{P}^{T}*_{M}\mathcal{X}\end{subarray}}\Phi_{ONPP}(\mathcal{Y}) :=i𝒴(i)jwi,j𝒴(j)F2.\displaystyle:=\sum_{i}\left\|\mathcal{Y}^{(i)}-\sum_{j}w_{i,j}\mathcal{Y}^{(j)}\right\|_{F}^{2}. (14)

The objective function can be written as

ΦONPP(𝒴)\displaystyle\Phi_{ONPP}(\mathcal{Y}) =ijδi,j𝒴(j)jwi,j𝒴(j)F2\displaystyle=\sum_{i}\left\|\sum_{j}\delta_{i,j}\mathcal{Y}^{(j)}-\sum_{j}w_{i,j}\mathcal{Y}^{(j)}\right\|_{F}^{2}
=ij(δi,jwi,j)𝒴(j)F2\displaystyle=\sum_{i}\left\|\sum_{j}(\delta_{i,j}-w_{i,j})\mathcal{Y}^{(j)}\right\|_{F}^{2}
=ij(InW)i,j𝒴(j)F2\displaystyle=\sum_{i}\left\|\sum_{j}(I_{n}-W)_{i,j}\mathcal{Y}^{(j)}\right\|_{F}^{2}
=i(𝒴×M+1(InW))(i)F2\displaystyle=\sum_{i}\left\|\left(\mathcal{Y}\times_{M+1}(I_{n}-W)\right)^{(i)}\right\|_{F}^{2}
=𝒴×M+1(InW)F2\displaystyle=\left\|\mathcal{Y}\times_{M+1}(I_{n}-W)\right\|_{F}^{2}
=(𝒫TM𝒳)×M+1(InW)F2\displaystyle=\left\|(\mathcal{P}^{T}*_{M}\mathcal{X})\times_{M+1}(I_{n}-W)\right\|_{F}^{2}
=𝒫TM(𝒳×M+1(InW))F2.\displaystyle=\left\|\mathcal{P}^{T}*_{M}\left(\mathcal{X}\times_{M+1}\left(I_{n}-W\right)\right)\right\|_{F}^{2}.

Using corollary 18, the solution of 14 is the smallest dd left singular tensors of 𝒳×M+1(InW)\mathcal{X}\times_{M+1}(I_{n}-W).
The algorithm bellow, shows the steps of ONPP via Einstein product.

Algorithm 2 ONPP-Einstein

Input: 𝒳\mathcal{X} (Data) dd (subspace dimension).
      Output: 𝒫\mathcal{P} (Projection space).

1:Compute WW. \triangleright Using the appropriate method
2:Compute the smallest dd left singular tensors of 𝒵=𝒳×M+1(IW)\mathcal{Z}=\mathcal{X}\times_{M+1}(I-W).
3:Combine these tensors to get 𝒫\mathcal{P}.

4.1.1 Multi-weight ONPP

In this section, we propose a generalization of ONPP, where multiple weights matrices are employed on the IMI_{M} mode. We denote by 𝒲n×n×IM\mathcal{W}\in\mathbb{R}^{n\times n\times I_{M}} the weight tensor. Let 𝒴r(i)\mathcal{Y}^{(i)}_{r} denotes the tensor 𝒴\mathcal{Y}, by fixing its last two indices to (r,i)(r,i), i.e., 𝒴:,,r,i\mathcal{Y}_{:,\ldots,r,i}, similarly 𝒳r(i)\mathcal{X}^{(i)}_{r} denotes 𝒳:,,r,i\mathcal{X}_{:,\ldots,r,i}. We assume that the rr-th frontal slice of the weight tensor is constructed only from the rr-th frontal slice of the data tensor.
The objective function is

argmin𝒫I1××IM×d𝒫TM𝒫=𝒴=𝒫TM𝒳ΦONPPMW(𝒴)\displaystyle\arg\min_{\begin{subarray}{c}\mathcal{P}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times d}\\ \mathcal{P}^{T}*_{M}\mathcal{P}=\mathcal{I}\\ \mathcal{Y}=\mathcal{P}^{T}*_{M}\mathcal{X}\end{subarray}}\Phi_{ONPP_{MW}}(\mathcal{Y}) :=i,r𝒴r(i)j𝒲i,j(r)𝒴r(j)F2.\displaystyle:=\sum_{i,r}\left\|\mathcal{Y}^{(i)}_{r}-\sum_{j}\mathcal{W}^{(r)}_{i,j}\mathcal{Y}^{(j)}_{r}\right\|_{F}^{2}. (15)

For each ii, utilizing the independence of the frontal slices 𝒴r(i)\mathcal{Y}^{(i)}_{r}, we can divide the objective function into IMI_{M} independent objective functions. The solution is obtained by concatenating the solutions of each objective function. The rr-th objective function can be written as

argmin𝒫:,,r,:I1××IM×d𝒫:,,r,:TM𝒫:,,r,:=𝒴:,,r,:=𝒫:,,r,:TM𝒳:,,r,:\displaystyle\arg\min_{\begin{subarray}{c}\mathcal{P}_{:,\ldots,r,:}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times d}\\ \mathcal{P}_{:,\ldots,r,:}^{T}*_{M}\mathcal{P}_{:,\ldots,r,:}=\mathcal{I}\\ \mathcal{Y}{:,\ldots,r,:}=\mathcal{P}{:,\ldots,r,:}^{T}*_{M}\mathcal{X}{:,\ldots,r,:}\end{subarray}} i𝒴r(i)j𝒲i,j(r)𝒴r(j)F2.\displaystyle\sum_{i}\left\|\mathcal{Y}^{(i)}_{r}-\sum_{j}\mathcal{W}^{(r)}_{i,j}\mathcal{Y}^{(j)}_{r}\right\|_{F}^{2}. (16)

The solution of this objective function is the smallest dd left singular tensors of
𝒳:,,r,:×M+1(In𝒲(r))\mathcal{X}_{:,\ldots,r,:}\times_{M+1}(I_{n}-\mathcal{W}^{(r)}). The solution of the original problem is obtained by concatenating the solutions of each objective function.

4.2 Generalization of OLPP

Given a weight matrix Wn×nW\in\mathbb{R}^{n\times n}, the optimization problem to solve is

argmin𝒫I1××IM×d𝒫TM𝒫=ΦOLPP(𝒴)\displaystyle\arg\min_{\begin{subarray}{c}\mathcal{P}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times d}\\ \mathcal{P}^{T}*_{M}\mathcal{P}=\mathcal{I}\end{subarray}}\Phi_{OLPP}(\mathcal{Y}) :=12i,jwi,j𝒴(i)𝒴(j)F2.\displaystyle:=\dfrac{1}{2}\sum_{i,j}w_{i,j}\left\|\mathcal{Y}^{(i)}-\mathcal{Y}^{(j)}\right\|_{F}^{2}. (17)

The objective function can be written as

ΦOLPP(𝒴)\displaystyle\Phi_{OLPP}(\mathcal{Y}) =12i,jwi,j𝒴(i)F2+wi,j𝒴(j)F2𝒴(i),𝒴(j)\displaystyle=\dfrac{1}{2}\sum_{i,j}w_{i,j}\left\|\mathcal{Y}^{(i)}\right\|_{F}^{2}+w_{i,j}\left\|\mathcal{Y}^{(j)}\right\|_{F}^{2}-\langle\mathcal{Y}^{(i)},\mathcal{Y}^{(j)}\rangle
=12idi𝒴(i)F2+12jdj𝒴(j)F2i,jwi,j𝒴(i),𝒴(j)\displaystyle=\dfrac{1}{2}\sum_{i}d_{i}\left\|\mathcal{Y}^{(i)}\right\|_{F}^{2}+\dfrac{1}{2}\sum_{j}d_{j}\left\|\mathcal{Y}^{(j)}\right\|_{F}^{2}-\sum_{i,j}w_{i,j}\langle\mathcal{Y}^{(i)},\mathcal{Y}^{(j)}\rangle
=i,jdi𝒴(i)F2i,jwi,j𝒴(i),𝒴(j)\displaystyle=\sum_{i,j}d_{i}\left\|\mathcal{Y}^{(i)}\right\|_{F}^{2}-\sum_{i,j}w_{i,j}\langle\mathcal{Y}^{(i)},\mathcal{Y}^{(j)}\rangle
=i,jdi,j𝒴(i),𝒴(j)i,jwi,j𝒴(i),𝒴(j)\displaystyle=\sum_{i,j}d_{i,j}\langle\mathcal{Y}^{(i)},\mathcal{Y}^{(j)}\rangle-\sum_{i,j}w_{i,j}\langle\mathcal{Y}^{(i)},\mathcal{Y}^{(j)}\rangle
=i,jLi,j𝒴(i),𝒴(j)\displaystyle=\sum_{i,j}L_{i,j}\langle\mathcal{Y}^{(i)},\mathcal{Y}^{(j)}\rangle
=𝒴M+1L,𝒴\displaystyle=\langle\mathcal{Y}*_{M+1}L,\mathcal{Y}\rangle
=Tr(𝒫TM(𝒳×M+1L1𝒳T)M𝒫),\displaystyle=\operatorname{Tr}\left(\mathcal{P}^{T}*_{M}(\mathcal{X}\times_{M+1}L*_{1}\mathcal{X}^{T})*_{M}\mathcal{P}\right),

where LL is the Laplacian matrix corresponding to WW.
The solution using Theorem  18, is the smallest dd eigen-tensors of the symmetric tensor (Prop. 15) 𝒳×M+1L1𝒳T\mathcal{X}\times_{M+1}L*_{1}\mathcal{X}^{T}.
The algorithm bellow, shows the steps of OLPP via Einstein product.

Algorithm 3 OLPP-Einstein

Input: 𝒳\mathcal{X} (Data) dd (subspace dimension).
      Output: 𝒫\mathcal{P} (Projection space).

1:Compute LL. \triangleright Using the appropriate method
2:Compute the smallest dd eigen-tensors of 𝒳×M+1L1𝒳T\mathcal{X}\times_{M+1}L*_{1}\mathcal{X}^{T}.
3:Combine these tensors to get 𝒫\mathcal{P}.

4.2.1 Multi-weight OLPP

In this section, we propose a generalization of OLPP, where multiple weights are utilized on the IMI_{M} mode. The objective function is to solve

argmin𝒫I1××IM×d𝒫TM𝒫=𝒴=𝒫TM𝒳ΦOLPPMW(𝒴)\displaystyle\arg\min_{\begin{subarray}{c}\mathcal{P}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times d}\\ \mathcal{P}^{T}*_{M}\mathcal{P}=\mathcal{I}\\ \mathcal{Y}=\mathcal{P}^{T}*_{M}\mathcal{X}\end{subarray}}\Phi_{OLPP_{MW}}(\mathcal{Y}) :=12i,j,r𝒲i,j(r)𝒴r(i)𝒴r(j)F2.\displaystyle:=\dfrac{1}{2}\sum_{i,j,r}\mathcal{W}^{(r)}_{i,j}\left\|\mathcal{Y}_{r}^{(i)}-\mathcal{Y}_{r}^{(j)}\right\|_{F}^{2}. (18)

Here, we have IMI_{M} independent objective functions, and the solution is obtained by concatenating the solutions of each objective function.

4.3 Generalization of LPP

LPP is akin to the Laplacian Eigenmap, serving as its linear counterpart. The objective function of LPP solves

argmin𝒫I1××IM×d𝒫TM𝒳×M+1D1𝒳TM𝒫=ΦLPP(𝒴)\displaystyle\arg\min_{\begin{subarray}{c}\mathcal{P}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times d}\\ \mathcal{P}^{T}*_{M}\mathcal{X}\times_{M+1}D*_{1}\mathcal{X}^{T}*_{M}\mathcal{P}=\mathcal{I}\end{subarray}}\Phi_{LPP}(\mathcal{Y}) :=12i,jwi,j𝒴(i)𝒴(j)F2.\displaystyle:=\dfrac{1}{2}\sum_{i,j}w_{i,j}\left\|\mathcal{Y}^{(i)}-\mathcal{Y}^{(j)}\right\|_{F}^{2}. (19)

The solution involves finding the smallest dd eigen-tensor of the generalized eigen-problem

𝒳×M+1L1𝒳TM𝒱=λ𝒳M+1D1𝒳TM𝒱.\mathcal{X}\times_{M+1}L*_{1}\mathcal{X}^{T}*_{M}\mathcal{V}=\lambda\mathcal{X}*_{M+1}D*_{1}\mathcal{X}^{T}*_{M}\mathcal{V}. (20)

Using 15, we deduce that the tensors 𝒳×M+1L1𝒳TM\mathcal{X}\times_{M+1}L*_{1}\mathcal{X}^{T}*_{M}, 𝒳×M+1D1𝒳TM\mathcal{X}\times_{M+1}D*_{1}\mathcal{X}^{T}*_{M}, are symmetric semi-definite positive. Here, we assume definiteness (although generally not true, especially if the number of sample points is less than the product of the feature dimensions). The projection tensor is obtained by adding these eigen-tensors and reshaping it to the desired dimension.
The algorithm bellow, shows the steps of LPP via Einstein product.

Algorithm 4 LPP-Einstein

Input: 𝒳\mathcal{X} (Data) dd (subspace dimension).
      Output: 𝒫\mathcal{P} (Projection space).

1:Compute LL. \triangleright Using the appropriate method
2:Compute the smallest dd eigen-tensors of 𝒳×M+1L1𝒳TM𝒱=λ𝒳TM+1D1𝒳TM𝒱\mathcal{X}\times_{M+1}L*_{1}\mathcal{X}^{T}*_{M}\mathcal{V}=\lambda\mathcal{X}^{T}*_{M+1}D*_{1}\mathcal{X}^{T}*_{M}\mathcal{V}.
3:Combine these tensors to get 𝒫\mathcal{P}.

Similarly, the multi-weight LPP can be proposed.

4.4 Generalization of NPP

The generalization of NPP resembles that of ONPP, under the constraint of LLE, it aims to find

argmin𝒫I1××IM×d𝒫TM𝒳1𝒳TM𝒫=ΦNPP(𝒴)\displaystyle\arg\min_{\begin{subarray}{c}\mathcal{P}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times d}\\ \mathcal{P}^{T}*_{M}\mathcal{X}*_{1}\mathcal{X}^{T}*_{M}\mathcal{P}=\mathcal{I}\end{subarray}}\Phi_{NPP}(\mathcal{Y}) :=Tr(𝒫TM𝒳×M+1(InW)(InW)T×1𝒳TM𝒫).\displaystyle:=\operatorname{Tr}\left(\mathcal{P}^{T}*_{M}\mathcal{X}\times_{M+1}(I_{n}-W)(I_{n}-W)^{T}\times_{1}\mathcal{X}^{T}*_{M}\mathcal{P}\right). (21)

The solution entails finding the smallest dd eigenvectors of the generalized eigen-problem

𝒳×M+1(InW)(InW)T×1𝒳TM𝒱=λ𝒳1𝒳TM𝒱.\mathcal{X}\times_{M+1}(I_{n}-W)(I_{n}-W)^{T}\times_{1}\mathcal{X}^{T}*_{M}\mathcal{V}=\lambda\mathcal{X}*_{1}\mathcal{X}^{T}*_{M}\mathcal{V}.

The projection tensor is obtained by concatenating these eigenvectors into a matrix and then reshaping it to the desired dimension.
The algorithm bellow, shows the steps of NPP via Einstein product.

Algorithm 5 NPP-Einstein

Input: 𝒳\mathcal{X} (Data) dd (subspace dimension).
      Output: 𝒫\mathcal{P} (Projection space).

1:Compute WW. \triangleright Using the appropriate method
2:Compute the smallest dd eigenvectors of 𝒳×M+1(InW)(InW)T×1𝒳TM𝒱=λ𝒳1𝒳TM𝒱\mathcal{X}\times_{M+1}(I_{n}-W)(I_{n}-W)^{T}\times_{1}\mathcal{X}^{T}*_{M}\mathcal{V}=\lambda\mathcal{X}*_{1}\mathcal{X}^{T}*_{M}\mathcal{V}.
3:Combine these tensors to get 𝒫\mathcal{P}.

Similarly, the multi-weight NPP can be proposed, and the solution is similar.

Nonlinear methods: Nonlinear DR methods are potent tools for uncovering the nonlinear structure within data. However, they present their own challenges, such as the absence of an inverse mapping, which is essential for data reconstruction. Another difficulty encountered is the out-of-sample extension, which involves extending the method to new data. While a variant of these methods utilizing multiple weights could be proposed, it would resemble the approach of linear methods, and thus we will not delve into them here.

4.5 Generalization of Laplacian Eigenmap

Given a weight matrix Wn×nW\in\mathbb{R}^{n\times n}, the objective function is to solve

argmin𝒴×M+1D1𝒴T=ΦLE(𝒴)\displaystyle\arg\min_{\mathcal{Y}\times_{M+1}D*_{1}\mathcal{Y}^{T}=\mathcal{I}}\Phi_{LE}(\mathcal{Y}) :=12i,jwi,j𝒴(i)𝒴(j)F2.\displaystyle:=\dfrac{1}{2}\sum_{i,j}w_{i,j}\left\|\mathcal{Y}^{(i)}-\mathcal{Y}^{(j)}\right\|_{F}^{2}. (22)

The objective function can be written as

ΦLE(𝒴)=Tr(𝒴×M+1L1𝒴T).\Phi_{LE}(\mathcal{Y})=\operatorname{Tr}\left(\mathcal{Y}\times_{M+1}L*_{1}\mathcal{Y}^{T}\right).

For 𝒴^=𝒴×M+1D1/2\widehat{\mathcal{Y}}=\mathcal{Y}\times_{M+1}D^{1/2}, the constraint becomes 𝒴^1𝒴^T=\widehat{\mathcal{Y}}*_{1}\widehat{\mathcal{Y}}^{T}=\mathcal{I}, using 𝒴=𝒴^×M+1D1/2\mathcal{Y}=\widehat{\mathcal{Y}}\times_{M+1}D^{-1/2}, the objective function becomes

ΦLE(𝒴^)=Tr(𝒴^×M+1Ln1𝒴^T).\Phi_{LE}(\widehat{\mathcal{Y}})=\operatorname{Tr}\left(\widehat{\mathcal{Y}}\times_{M+1}L_{n}*_{1}\widehat{\mathcal{Y}}^{T}\right). (23)

Using the isomorphism Ψ\Psi and its properties, the problem is equivalent to Equation 1 under the constraint Ψ(𝒴^)1Ψ(𝒴^)T=I\Psi(\widehat{\mathcal{Y}})*_{1}\Psi(\widehat{\mathcal{Y}})^{T}=I. Since the solution is the smallest dd eigenvectors of LnL_{n}, the solution of the original problem would be 𝒴=Ψ1(Ψ(𝒴^))×M+1D1/2\mathcal{Y}=\Psi^{-1}(\Psi(\widehat{\mathcal{Y}}))\times_{M+1}D^{-1/2}.
The algorithm bellow, shows the steps of LE via Einstein product.

Algorithm 6 LE-Einstein

Input: 𝒳\mathcal{X} (Data) dd (subspace dimension).
      Output: 𝒴\mathcal{Y} (Projection data).

1:Compute LnL_{n}. \triangleright Using the appropriate method
2:Compute the smallest dd vectors of LnL_{n}.
3:Combine these vectors and reshape them to get Ψ(𝒴^)\Psi(\widehat{\mathcal{Y}}).
4:Compute 𝒴=Ψ1(Ψ(𝒴^))×M+1D1/2\mathcal{Y}=\Psi^{-1}(\Psi(\widehat{\mathcal{Y}}))\times_{M+1}D^{-1/2}.

4.5.1 Projection on out of Sample Data

The out of sample extension is the problem of extending the method to new data. Many approaches were proposed to solve this problem, such as the Nyström method, kernel mapping, eigenfunctions [5] [5, 25], etc. We will propose a method that is based on the eigenfunctions.
In matrix case, the out of sample extension is done simply computing the components explicitly as ytj=1λj𝐤tT𝐯j,j=1,,dy_{t_{j}}=\frac{1}{\lambda_{j}}\mathbf{k}_{t}^{T}\mathbf{v}_{j},\;j=1,\ldots,d, where 𝐤t\mathbf{k}_{t} is the kernel matrix of the new data, and (λj,𝐯j)(\lambda_{j},\mathbf{v}_{j}) is eigentuple of the kernel matrix LnL_{n}, 𝐤t=(K(𝐱t,𝐱1),,K(𝐱t,𝐱n))T\mathbf{k}_{t}=(K(\mathbf{x}_{t},\mathbf{x}_{1}),\ldots,K(\mathbf{x}_{t},\mathbf{x}_{n}))^{T} is the kernel vector of the test data 𝐱t\mathbf{x}_{t}. It can be written as 𝐲t=diag(λ1,,λd)1[𝐯1,,𝐯d]T𝐤t\mathbf{y}_{t}=\operatorname{diag}(\lambda_{1},\ldots,\lambda_{d})^{-1}[\mathbf{v}_{1},\ldots,\mathbf{v}_{d}]^{T}\mathbf{k}_{t} Thus, the generalization is straightforward.

4.6 Generalization of LLE

LLE is a nonlinear method that aims to preserve the local structure of the data. After finding the kk-nearest neighbors, it determines the weights that minimize the reconstruction error. In other words, it seeks to solve the following objective function

Re(W):=i𝒳(i)jwi,j𝒳(j)F2,Re(W):=\sum_{i}\left\|\mathcal{X}^{(i)}-\sum_{j}w_{i,j}\mathcal{X}^{(j)}\right\|_{F}^{2}, (24)

subject to two constraints: a sparse one (the weights are zero if a point is not in the neighborhood of another point), ensuring the locality of the reconstruction weight, and an invariance constraint (the row sum of the weight matrix is 1). Finally, the projected data is obtained by solving the following objective function

argmin𝒴1𝒴T=i𝒴(i)=𝒪ΦLLE\displaystyle\arg\min_{\begin{subarray}{c}\mathcal{Y}*_{1}\mathcal{Y}^{T}=\mathcal{I}\\ \sum_{i}\mathcal{Y}^{(i)}=\mathcal{O}\end{subarray}}\Phi_{LLE} :=i𝒴(i)jwi,j𝒴(j)F2=𝒴×M+1(InW)F2.\displaystyle:=\sum_{i}\left\|\mathcal{Y}^{(i)}-\sum_{j}w_{i,j}\mathcal{Y}^{(j)}\right\|_{F}^{2}=\left\|\mathcal{Y}\times_{M+1}(I_{n}-W)\right\|_{F}^{2}. (25)

4.6.1 Computing the weights

Equation 24 can be decomposed to finding the weights wi,:w_{i,:} of a point 𝒳(i)\mathcal{X}^{(i)} independently. For simplicity, we will refer to the neighbors of 𝒳(i)\mathcal{X}^{(i)} as 𝒩(i,j)\mathcal{N}^{(i,j)}, and the weights of its neighbors as 𝐰ik\mathbf{w}_{i}\in\mathbb{R}^{k}, with kk the number of neighbors, which plays the rule of the sparseness constraint. Denote the local Grammian matrix G(i)k×kG^{(i)}\in\mathbb{R}^{k\times k} as Gj,k(i)=𝒳(i)𝒩(i,j),𝒳(i)𝒩(i,k)G^{(i)}_{j,k}=\langle\mathcal{X}^{(i)}-\mathcal{N}^{(i,j)},\mathcal{X}^{(i)}-\mathcal{N}^{(i,k)}\rangle.
The invariance constraint can be written as 𝟏T𝐰i=1\mathbf{1}^{T}\mathbf{w}_{i}=1, thus, we can write the problem as

i𝒳(i)jwi,j𝒳(j)F2\displaystyle\sum_{i}\left\|\mathcal{X}^{(i)}-\sum_{j}w_{i,j}\mathcal{X}^{(j)}\right\|_{F}^{2} =i𝒳(i)jwj𝒩(i,j)F2\displaystyle=\sum_{i}\left\|\mathcal{X}^{(i)}-\sum_{j}w_{j}\mathcal{N}^{(i,j)}\right\|_{F}^{2}
=ijwi,j(𝒳(i)𝒩(i,j))F2\displaystyle=\sum_{i}\left\|\sum_{j}w_{i,j}(\mathcal{X}^{(i)}-\mathcal{N}^{(i,j)})\right\|_{F}^{2}
=i,j,kwi,jwi,kGj,k(i)\displaystyle=\sum_{i,j,k}w_{i,j}w_{i,k}G^{(i)}_{j,k}
=i𝐰iTG(i)𝐰i.\displaystyle=\sum_{i}\mathbf{w}_{i}^{T}G^{(i)}\mathbf{w}_{i}.

This constrained problem can be solved using the Lagrangian

({𝐰i}i,λ)=i𝐰iTG(i)𝐰iiλi(𝟏T𝐰i1).\mathcal{L}(\{\mathbf{w}_{i}\}_{i},\lambda)=\sum_{i}\mathbf{w}_{i}^{T}G^{(i)}\mathbf{w}_{i}-\sum_{i}\lambda_{i}(\mathbf{1}^{T}\mathbf{w}_{i}-1).

We compute the partial derivative with respect to 𝐰i\mathbf{w}_{i} and setting it to zero

𝐰i\displaystyle\dfrac{\partial\mathcal{L}}{\partial\mathbf{w}_{i}} =2G(i)𝐰iλi𝟏=0\displaystyle=2G^{(i)}\mathbf{w}_{i}-\lambda_{i}\mathbf{1}=0 (26)
𝐰i\displaystyle\implies\mathbf{w}_{i} =λi2G(i)1𝟏.\displaystyle=\frac{\lambda_{i}}{2}G^{(i)^{-1}}\mathbf{1}.

We utilize the fact that the Grammian matrix is symmetric, assuming that G(i)G^{(i)} is full rank, which is typically the case. However, if G(i)G^{(i)} is not full rank, a small value can be added to its diagonal to ensure full rank.
The partial derivative with respect to λi\lambda_{i}, set to zero, yields the invariance constraint of point ii, i.e., 𝟏T𝐰i=1\mathbf{1}^{T}\mathbf{w}_{i}=1. Multiplying Equation (26) by 𝟏T\mathbf{1}^{T}, we can isolate λ\lambda and arrive at the following equation

𝐰i=G(i)1𝟏𝟏TG(i)1𝟏.\mathbf{w}_{i}=\dfrac{G^{(i)^{-1}}\mathbf{1}}{\mathbf{1}^{T}G^{(i)^{-1}}\mathbf{1}}.

4.6.2 Computing the projected data

The final step resembles the previous cases, ensuring that the mean constraint is satisfied. As IWI-W can be represented as a Laplacian, we know that the number of components corresponds to the multiplicity of the eigenvalue 0. Hence, there is at least one eigenvalue 0 with multiplicity 1, and the identity tensor serves as the corresponding eigenvector, thereby satisfying the second constraint. For further elaboration, interested readers can refer to [10].
The solution is equivalent to solving the matricized version, where the solution comprises the smallest dd singular eigenvectors of the matrix (InW)(InWT)(I_{n}-W)(I_{n}-W^{T}). Consequently, the solution to the original problem is the inverse transform, denoted as Ψ1\Psi^{-1}, of the solution to the matricized version.
The algorithm bellow, shows the steps of LLE via Einstein product.

Algorithm 7 LLE-Einstein

Input: 𝒳\mathcal{X} (Data) dd (subspace dimension).
      Output: 𝒴\mathcal{Y} (Projection data).

1:Find the neighbors of each point.
2:Compute the reconstruction weight WW.
3:Compute the smallest dd eigenvectors of (InW)(InWT)(I_{n}-W)(I_{n}-W^{T}).
4:Compute the Ψ1\Psi^{-1} of these vectors with the appropriate reshaping to get 𝒴\mathcal{Y}.

4.6.3 Projection on out of Sample Data

To extend these methods to new data (test data) not seen in the training set, various approaches can be employed. These include kernel mapping, eigenfunctions (as discussed in Bengio et al. [5]), and linear reconstruction. Here, we opt to generalize the latter approach. We can follow similar steps to those used in matrix-based methods. Specifically, we can perform the following steps without re-running the algorithm on the entire dataset

  1. 1

    Find the neighbors in training data of each new data test.

  2. 2

    Compute the reconstruction weight that best reconstruct each test point from its k neighbours in the training data.

  3. 3

    Compute the low dimensional representation of the new data using the reconstruction weight.

More formally, after finding the neighbours 𝒩(i,j)\mathcal{N}^{(i,j)} of a test data 𝒳t(i)\mathcal{X}_{t}^{(i)}, we solve the following problem

argminw(t)i𝒳t(i)jwi,j(t)𝒩(i,j)F2,\arg\min_{w^{(t)}}\sum_{i}\left\|\mathcal{X}_{t}^{(i)}-\sum_{j}w^{(t)}_{i,j}\mathcal{N}^{(i,j)}\right\|_{F}^{2},

with 𝐰i,:(t)\mathbf{w}^{(t)}_{i,:} is the reconstruction weight of the test data, under the same constraint, i.e., the row sum of the weight matrix is 1, with value zero if it’s not in the neighbour of a point.
The solution is wi(t)=Gt(i)1𝟏𝟏TGt(i)1𝟏w^{(t)}_{i}=\dfrac{G_{t}^{(i)^{-1}}\mathbf{1}}{\mathbf{1}^{T}G_{t}^{(i)^{-1}}\mathbf{1}}, with Gtj,k(i)=𝒳t(i)𝒩(i,j),𝒳t(i)𝒩(i,k)G^{(i)}_{t_{j,k}}=\langle\mathcal{X}_{t}^{(i)}-\mathcal{N}^{(i,j)},\mathcal{X}_{t}^{(i)}-\mathcal{N}^{(i,k)}\rangle.
Finally, the embedding of the test data 𝒴t(i)\mathcal{Y}_{t}^{(i)} is obtained as jwi,j(t)𝒴(j)\sum_{j}w^{(t)}_{i,j}\mathcal{Y}^{(j)}, with 𝒴(j)=ϕLLE(𝒩(i,j))\mathcal{Y}^{(j)}=\phi_{LLE}(\mathcal{N}^{(i,j)}) are the embedding representation of the neighbours of the test data.

5 Other variants via Einstein product

5.1 Kernel methods

Kernels serve as powerful tools enabling linear methods to operate effectively in high-dimensional spaces, allowing for the representation of nonlinear data without explicitly computing data coordinates in this feature space. The kernel trick, a pivotal breakthrough in machine learning, facilitates this process. Formally, instead of directly manipulating the data 𝒳={𝒳(1),,𝒳(n)}\mathcal{X}=\{\mathcal{X}^{(1)},\ldots,\mathcal{X}^{(n)}\}, we operate within a high-dimensional implicit feature space using a function Φ\Phi.
We denote the tensor Φ(𝒳)={Φ(𝒳(1)),,Φ(𝒳(n))}={Φ(𝒳)(1),,Φ(𝒳)(n)}\Phi(\mathcal{X})=\{\Phi(\mathcal{X}^{(1)}),\ldots,\Phi(\mathcal{X}^{(n)})\}=\{\Phi(\mathcal{X})^{(1)},\ldots,\Phi(\mathcal{X})^{(n)}\}. The mapping need not be explicitly known; only the Gram matrix KK is required. This matrix represents the inner products of the data in the feature space, defined as Ki,j=Φ(𝒳(i)),Φ(𝒳(j))K_{i,j}=\langle\Phi(\mathcal{X}^{(i)}),\Phi(\mathcal{X}^{(j)})\rangle. Consequently, any method expressible in terms of data inner products can be reformulated in terms of the Gram matrix. Consequently, the kernel trick can be applied sequentially. Moreover, extending kernel methods to multi-linear operations using the Einstein product is straightforward. Commonly used kernels include the Gaussian, polynomial, Laplacian, and Sigmoid kernels, among others.
Denote 𝒴=𝒫TMΦ(𝒳)\mathcal{Y}=\mathcal{P}^{T}*_{M}\Phi(\mathcal{X}), with 𝒫I1××IM×d\mathcal{P}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times d}.

5.1.1 Kernel PCA via Einstein product

The kernel multi-linear PCA solves the following problem

argmax𝒫I1××IM×d𝒫TM𝒫=\displaystyle\arg\max_{\begin{subarray}{c}\mathcal{P}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times d}\\ \mathcal{P}^{T}*_{M}\mathcal{P}=\mathcal{I}\end{subarray}} 𝒫T1(Φ(𝒳)𝒬)F2,\displaystyle\left\|\mathcal{P}^{T}*_{1}(\Phi(\mathcal{X})-\mathcal{Q})\right\|_{F}^{2}, (27)

with 𝒬\mathcal{Q} is the mean of kernel points, the solution is the largest dd left singular tensors of Φ(𝒳)𝒬=Φ^(𝒳)\Phi(\mathcal{X})-\mathcal{Q}=\widehat{\Phi}(\mathcal{X}). It needs to calculate the eigen-tensors of Φ^(𝒳)1Φ^(𝒳)T\widehat{\Phi}(\mathcal{X})*_{1}\widehat{\Phi}(\mathcal{X})^{T}, which is not accessible, however, the Grammian K=Φ^(𝒳)TMΦ^(𝒳)K=\widehat{\Phi}(\mathcal{X})^{T}*_{M}\widehat{\Phi}(\mathcal{X}) is available, we can transform the problem to

K^𝐳^i=λi𝐳^i with 𝐳^i=(Φ^(𝒳))TM𝒵(i)n,\widehat{K}\widehat{\mathbf{z}}_{i}=\lambda_{i}\widehat{\mathbf{z}}_{i}\;\text{ with }\widehat{\mathbf{z}}_{i}=(\widehat{\Phi}(\mathcal{X}))^{T}*_{M}\mathcal{Z}^{(i)}\in\mathbb{R}^{n},

with K^\widehat{K} representing the Grammian of the centered data, that can easily be obtained from only KK as K^=KHKKH+HKH\widehat{K}=K-HK-KH+HKH, since

K^i,j\displaystyle\widehat{K}_{i,j} =Φ^(𝒳)(i)T1Φ^(𝒳)(j)\displaystyle=\widehat{\Phi}\left(\mathcal{X}\right)^{(i)^{T}}*_{1}\widehat{\Phi}(\mathcal{X})^{(j)}
=(Φ(𝒳(i))T1nkΦ(𝒳(k))T)1(Φ(𝒳(j))1nlΦ(𝒳(l))T)\displaystyle=\left(\Phi(\mathcal{X}^{(i)})^{T}-\dfrac{1}{n}\sum_{k}\Phi(\mathcal{X}^{(k)})^{T}\right)*_{1}\left(\Phi(\mathcal{X}^{(j)})-\dfrac{1}{n}\sum_{l}\Phi(\mathcal{X}^{(l)})^{T}\right)
=Φ(𝒳(i))T1Φ(𝒳(j))1nkΦ(𝒳(i))T1Φ(𝒳(k))1nlΦ(𝒳(l))T1Φ(𝒳(j))\displaystyle=\Phi(\mathcal{X}^{(i)})^{T}*_{1}\Phi(\mathcal{X}^{(j)})-\dfrac{1}{n}\sum_{k}\Phi(\mathcal{X}^{(i)})^{T}*_{1}\Phi(\mathcal{X}^{(k)})-\dfrac{1}{n}\sum_{l}\Phi(\mathcal{X}^{(l)})^{T}*_{1}\Phi(\mathcal{X}^{(j)})
=Ki,j1nkKi,k1k,j1nl1i,lKl,j+1n2k,lKk,l\displaystyle=K_{i,j}-\dfrac{1}{n}\sum_{k}K_{i,k}1_{k,j}-\dfrac{1}{n}\sum_{l}1_{i,l}K_{l,j}+\dfrac{1}{n^{2}}\sum_{k,l}K_{k,l}
=(K𝟏nKK𝟏n+𝟏nK𝟏n)i,j.\displaystyle=(K-\dfrac{\mathbf{1}}{n}K-K\dfrac{\mathbf{1}}{n}+\dfrac{\mathbf{1}}{n}K\dfrac{\mathbf{1}}{n})_{i,j}.

Then, the solution is the same as the matrix case, i.e., the largest dd eigenvectors of K^\widehat{K}, reshaped to the appropriate size.
The algorithm bellow, shows the steps of the kernel PCA via Einstein product.

Algorithm 8 Kernel PCA-Einstein

Input: 𝒳\mathcal{X} (Data) (di)iM(d_{i})_{i\leq M}(dimension output) KK (Grammian).
      Output: 𝒴\mathcal{Y} (Projected data).

1:Compute K^\widehat{K}. \triangleright The mean of the Grammian
2:Compute the largest dd eigenvectors of K^\widehat{K}.
3:Combine these vectors, and reshape them to get 𝒴\mathcal{Y}.

5.1.2 Kernel LPP via Einstein product

The kernel multi-linear LPP tackles the following problem

argmin𝒫I1××IM×d𝒫TMΦ(𝒳)×M+1D1Φ(𝒳)TM𝒫=\displaystyle\arg\min_{\begin{subarray}{c}\mathcal{P}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times d}\\ \mathcal{P}^{T}*_{M}\Phi(\mathcal{X})\times_{M+1}D*_{1}\Phi(\mathcal{X})^{T}*_{M}\mathcal{P}=\mathcal{I}\end{subarray}} Tr(𝒫TM(Φ(𝒳)×M+1L1Φ(𝒳)T)M𝒫).\displaystyle\operatorname{Tr}\left(\mathcal{P}^{T}*_{M}\left(\Phi(\mathcal{X})\times_{M+1}L*_{1}\Phi(\mathcal{X})^{T}\right)*_{M}\mathcal{P}\right). (28)

The solution involves finding the smallest dd eigen-tensors of the generalized eigen-problem

Φ(𝒳)×M+1L1Φ(𝒳)TM𝒱=λΦ(𝒳)M+1D1Φ(𝒳)TM𝒱.\Phi(\mathcal{X})\times_{M+1}L*_{1}\Phi(\mathcal{X})^{T}*_{M}\mathcal{V}=\lambda\Phi(\mathcal{X})*_{M+1}D*_{1}\Phi(\mathcal{X})^{T}*_{M}\mathcal{V}.

Φ(𝒳)\Phi(\mathcal{X}) is not available, the problem needs to be reformulated, Utilizing the fact that KK is invertible, we reformulate the problem as to find the vectors 𝐳\mathbf{z}, solution of the generalized eigen-problem

L𝐳=λD𝐳, with 𝐳=Φ(𝒳)TM𝒱(i)n,L\mathbf{z}=\lambda D\mathbf{z},\text{ with }\mathbf{z}=\Phi(\mathcal{X})^{T}*_{M}\mathcal{V}^{(i)}\in\mathbb{R}^{n},

This formulation reduces to the same minimization problem as in the matrix case.

5.1.3 Kernel ONPP via Einstein product

The kernel multi-linear ONPP addresses the following optimization problem

argmin𝒫I1××IM×d𝒫TM𝒫=\displaystyle\arg\min_{\begin{subarray}{c}\mathcal{P}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times d}\\ \mathcal{P}^{T}*_{M}\mathcal{P}=\mathcal{I}\end{subarray}} Tr(𝒫TM(Φ(𝒳)×M+1(IW)(IWT)1Φ(𝒳)T)M𝒫).\displaystyle\operatorname{Tr}\left(\mathcal{P}^{T}*_{M}(\Phi(\mathcal{X})\times_{M+1}(I-W)(I-W^{T})*_{1}\Phi(\mathcal{X})^{T})*_{M}\mathcal{P}\right). (29)

The solution involves finding the smallest dd eigen-tensors of problem

Φ(𝒳)×M+1(IW)(IWT)1Φ(𝒳)TM𝒱=λ𝒱.\Phi(\mathcal{X})\times_{M+1}(I-W)(I-W^{T})*_{1}\Phi(\mathcal{X})^{T}*_{M}\mathcal{V}=\lambda\mathcal{V}.

By employing similar techniques as before, we can derive the equivalent problem that seeks to find the vectors 𝐳\mathbf{z}, solution of of the eigen-problem

K(IW)(IWT)𝐳=λ𝐳, with Φ(𝒳)TM𝒱=𝐳.K(I-W)(I-W^{T})\mathbf{z}=\lambda\mathbf{z},\;\text{ with }\Phi(\mathcal{X})^{T}*_{M}\mathcal{V}=\mathbf{z}.

The problem is the same minimization problem, the solution 𝒴\mathcal{Y} can be obtained from reshaping the transpose of the concatenated vectors 𝐳\mathbf{z}.
The algorithm bellow, shows the steps of the kernel ONPP via Einstein product.

Algorithm 9 Kernel ONPP-Einstein

Input: KK (Grammian) dd (subspace dimension).
      Output: 𝒴\mathcal{Y} (Projected data).

1:Compute WW. \triangleright Using the appropriate method
2:Compute the smallest dd eigenvectors of K(IW)(IWT)K(I-W)(I-W^{T}).
3:Combine these vectors, and reshape them to get 𝒴\mathcal{Y}.

5.1.4 Kernel OLPP via Einstein product

The kernel multi-linear OLPP tackles the optimization problem defined as follows:

argmin𝒫I1××IM×d𝒫TM𝒫=\displaystyle\arg\min_{\begin{subarray}{c}\mathcal{P}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times d}\\ \mathcal{P}^{T}*_{M}\mathcal{P}=\mathcal{I}\end{subarray}} Tr(𝒫TM(Φ(𝒳)×M+1L1Φ(𝒳)T)M𝒫).\displaystyle\operatorname{Tr}\left(\mathcal{P}^{T}*_{M}(\Phi(\mathcal{X})\times_{M+1}L*_{1}\Phi(\mathcal{X})^{T})*_{M}\mathcal{P}\right). (30)

The solution of the problem involves the eigen-tensors of Φ(𝒳)×M+1L1Φ(𝒳)TM𝒵=λ𝒵\Phi(\mathcal{X})\times_{M+1}L*_{1}\Phi(\mathcal{X})^{T}*_{M}\mathcal{Z}=\lambda\mathcal{Z}, By transforming the problem, we arrive to find the vectors 𝐳\mathbf{z}, solution of of the eigen-problem

K𝐳=λ𝐳, with Φ(𝒳)TM𝒵=𝐳,K\mathbf{z}=\lambda\mathbf{z},\;\text{ with }\Phi(\mathcal{X})^{T}*_{M}\mathcal{Z}=\mathbf{z},

which mirrors the matrix case. Here, the solution 𝒴\mathcal{Y} can be obtained from reshaping the transpose of the concatenated vectors 𝐳\mathbf{z}.
The algorithm bellow, shows the steps of the kernel OLPP via Einstein product.

Algorithm 10 Kernel OLPP-Einstein

Input: KK (Grammian) dd (subspace dimension).
      Output: 𝒴\mathcal{Y} (Projected data).

1:Compute the smallest dd eigenvectors of KK.
2:Combine these vectors, and reshape them to get 𝒴\mathcal{Y}.

5.2 Supervised learning

In general, supervised learning differs from unsupervised learning primarily in how the weight matrix incorporates class label information. Supervised learning tends to outperform unsupervised learning, particularly with small datasets, due to the utilization of additional class label information.
In supervised learning, each data point is associated with a known class label. The weight matrix can be adapted to include this class label information. For instance, it may take the form of a block diagonal matrix, where Ws(i)ni×niW_{s}(i)\in\mathbb{R}^{n_{i}\times n_{i}} represents sub-weight matrices, and nin_{i} denotes the number of data points in class ii. Let c(i)c(i) denote the class of data point xix_{i}.

Supervised PCA: PCA is the sole linear method presented devoid of a graph matrix. Consequently, Supervised PCA implementation is not straightforward, necessitating a detailed explanation. Following the approach proposed in [2], we address this challenge by formulating the problem and leveraging the empirical Hilbert-Schmidt independence criterion (HSIC):

argmaxPTP=I\displaystyle\arg\max_{P^{T}P=I} Tr(PTXHKLHXTP).\displaystyle\operatorname{Tr}(P^{T}XHK_{L}HX^{T}P).

where KLK_{L} is the kernel of the outcome measurements YY. Thus, the generalization would be to solve

argmax𝒫I1××IM×d𝒫TM𝒫=Tr(𝒫TM𝒳×M+1HKLH1𝒳TM𝒫),\arg\max_{\begin{subarray}{c}\mathcal{P}\in\mathbb{R}^{I_{1}\times\ldots\times I_{M}\times d}\\ \mathcal{P}^{T}*_{M}\mathcal{P}=\mathcal{I}\end{subarray}}\operatorname{Tr}\left(\mathcal{P}^{T}*_{M}\mathcal{X}\times_{M+1}HK_{L}H*_{1}\mathcal{X}^{T}*_{M}\mathcal{P}\right), (31)

The solution of 31 is the largest dd eigen-tensors of 𝒳×M+1HKLH×1𝒳T\mathcal{X}\times_{M+1}HK_{L}H\times_{1}\mathcal{X}^{T}.
Notice that when, KL=InK_{L}=I_{n}, we get the same problem as in the unsupervised case.
The algorithm bellow, shows the steps of the Supervised PCA via Einstein product.

Algorithm 11 Supervised PCA-Einstein

Input: 𝒳\mathcal{X} (Data) dd(dimension output). KLK_{L} (Kernel of labels)
      Output: 𝒫\mathcal{P} (Projection space).

1:Compute the largest dd eigen-tensors of 𝒳×M+1HKLH×1𝒳T\mathcal{X}\times_{M+1}HK_{L}H\times_{1}\mathcal{X}^{T}.
2:Combine these tensors to get 𝒫\mathcal{P}.

Supervised Laplacian Eigenmap: The supervised Laplacian Eigenmap is similar to the Laplacian Eigenmap, with the difference that the weight matrix is computed using the class label, many approaches were proposed [22, 8, 25]. We choose a simple approach that changes the weight matrix to WsW_{s}, and the rest of the algorithm is the same.

Supervised LLE: There are multiple variants of LLE that uses the class label to improve the performance of the method, e.g., Supervised LLE (SLLE), probabilistic SLLE [33], supervised guided LLE using HSIC [1], enhanced SLLE [31]…, the general strategy is to incorporate the class label either in computing the distance matrix, the weight matrix, or in the objective function [10]. We choose the simplest which is the first strategy; By changing the distance matrix by adding term that increases the inter-class and decreases the intra-class variance. The rest of the steps are the same as the unsupervised LLE.

5.2.1 Repulsion approaches

In the semi-supervised or the supervised learning, how we use the class label can affect the performance, commonly, the similarity matrix, tells us only if two points are of the same class or not, without incorporating any additional information on data locality, e.g., the closeness of points of different classes.., thus the repulsion technique is used to take into account the class label information, by repulsing the points of different classes, and attracting the points of the same class. It extends the traditional graph-based methods by incorporating repulsion or discrimination elements into the graph Laplacian, learning to more distinct separation of different classes in the reduced-dimensional space by integrating the class label information directly into the graph structure. The concept of repulsion has been used in DR with different formulations [32, 7] before using the k-nn graph to derive it. [16] a generic proposed a method that applies attraction to all points of the same class with the use of repulsion between nearby points of different classes, which was found to be significantly better than the previous approaches. Thus, we will use the same approach and generalize it to the Einstein-product.
The repulsion graph 𝒢(r)={𝒱(r),(r)}\mathcal{G}^{(r)}=\{\mathcal{V}^{(r)},\mathcal{E}^{(r)}\} is derived from k-nn Graph 𝒢={𝒱,}\mathcal{G}=\{\mathcal{V},\mathcal{E}\} based on the class label information, the weight of the edges can be computed in the simplest form as

Wi,j(r)={1 if (𝐱i,𝐱j),ij,c(i)c(j)0 otherwise.W_{i,j}^{(r)}=\left\{\begin{array}[]{ll}1&\text{ if }(\mathbf{x}_{i},\mathbf{x}_{j})\in\mathcal{E},\;i\neq j,\;c(i)\neq c(j)\\ 0&\text{ otherwise.}\end{array}\right.

Hence, in the case of fully connected graph, the repulsion weight would be of the form

W(r)=𝟏ndiag(𝟏ni).W^{(r)}=\mathbf{1}_{n}-\operatorname{diag}(\mathbf{1}_{n_{i}}).

Other weights value can be proposed.
The new repulsion algorithms are similar to the previous ones, with the new weight matrix Ws=W+βW(r)W_{s}=W+\beta W^{(r)} with β\beta is a parameter.

6 Experiments

To show the effectiveness of the proposed methods, we will use datasets that are commonly used in the literature. The experiments will be conducted on the GTDB dataset for the facial recognition, and the MNIST dataset for the digit recognition. We note that these datasets give the raw images instead of features. The results will be compared to the state of art methods, by using the projected data in a classifier. The baseline is also used for comparison, which is utilizing the raw data as the input of the classifier, and the recognition rate will be used as the evaluation metric for all methods. Images were chosen because the proposed methods are designed to work on multi-linear data, and the image is a typical example of such data. The proposed methods that use the multi-weight will be denoted by adding ”MW-MW” to the name of the method. It is intuitive to use multi-weight for images since the third mode represent the RGB while the first two modes represent the location of the pixel.
The evaluation metric Recognition rate (IR) is used to evaluate the performance of the proposed method. It is defined as the number of correct classification over the total number of testing data. A correct classification is done by computing the minimum distance between the projected data training and the projected testing data. The IR is computed on the testing data.

For simplicity, we used the supervised version of methods, with Gaussian weights, and the recommended parameter in [16] (half the median of data) for the Gaussian parameter.

IR=100×Number of recognized images in a dataNumber of images in the data.IR=100\times\dfrac{\text{Number of recognized images in a data}}{\text{Number of images in the data}}.

All computations are carried out on a laptop computer with 2.1 GHz Intel Core 7 processors 8-th Gen and 8 GB of memory using MATLAB 2021a.

6.1 Digit recognition

The Dataset that will be used in the experiments is the MNIST dataset 111https://lucidar.me/en/matlab/load-mnist-database-of-handwritten-digits-in-matlab/. It contains 60,000 training images and 10,000 testing images of labeled handwritten digits. The images are of size 28×2828\times 28, and are normalized gray images. The evaluation metric is the same as the facial recognition. We will work with smaller subset of the data to speed up the computation. e.g., 1000 training images and 200 testing images taking randomly from the data. Observe that the multi-weight methods are not used in this data since it is gray data, thus, we don’t have multiple weights.
Table 2 shows the performance of different approaches compared to the state-of-art based on different subspace dimensions.

- OLPP OLPP-E ONPP ONPP-E PCA PCA-E Baseline
5 50,50 50,50 56,00 56,00 63,00 63,00 8,50
10 75,50 75,50 81,50 81,50 82,50 82,50 8,50
15 81,00 81,00 80,50 80,50 84,50 84,50 8,50
20 85,00 85,00 83,50 83,50 88,00 88,00 8,50
25 86,00 86,00 87,50 87,50 88,00 88,00 8,50
30 85,50 85,50 87,50 87,50 89,00 89,00 8,50
35 88,00 88,00 89,50 89,50 87,00 87,00 8,50
40 88,00 88,00 89,00 89,00 87,50 87,50 8,50
Table 2: Performance of methods per different subspace dimension.

The results are similar in the MNIST dataset between the method with its Multi dimensional counterpart, we claim that it is due to the fact that the vectorization of 2 dimension to 1 does not affect much the accuracy, which leads to similar results using the proposed parameters.

Note that the objective is to compare a method with its proposed multi dimension counterpart via Einstein product to see if the generalization works.

6.2 Facial recognition

The dataset that will be used in the experiments is the Georgia Tech database GTDB crop 222https://www.anefian.com/research/face_reco.htm. It contains 750 color JPEG images of 50 person, with each one represented by exactly 15 images that show different facial expression, scale and lighting conditions. Figure 1 shows an example of 12 arbitrary images from the possible 15 of an arbitrary person in the data set.
Our data in this case is a tensor of size height×width×3×750height\times width\times 3\times 750 when dealing with RGB, and height×width×750height\times width\times 750 when dealing with gray images. The height and width of the images are fixed to 60×6060\times 60. The data is normalized.

Refer to caption
Figure 1: Example of images of one person in the GTDB dataset.

The experiment is done using 12 images for training and 3 for testing per face. Figure 2 shows these results for different subspace dimension reduction

Refer to caption
Figure 2: Performance of methods on different subspace dimension.

The results show that as the subspace dimension increases, the performance of most methods also increases, suggesting that these methods benefit from a higher dimensional-feature space up to a point that differ from a method to another. The generalized methods using the Einstein product gives overall better result on all subspace dimension compared to its counterparts, except for the ONPP in the small dd case. The Multiple-weight methods show varying performance. They outperform the single-weight in some cases. Future work could be considered to enhance how the to aggregate the results of each weights in order to give a more robust results.
The objective is to compare between a method and its multi dimensional counter parts via Einstein product, e.g., the OLPP method with the OLPP-E, and OLPP-E-MW.
The superiority of Einstein based methods can be justified by the fact that, it preserve the multi-linear structure of the data, and the non-linear structure of the data, which is not the case of the vectorization of the data, which is the case of the other matricized methods.

7 Conclusion

The paper advances the field of dimension reduction by introducing refined graph-based methods and leveraging the Einstein product for tensor data. It extends both the Linear and Nonlinear methods (supervised and unsupervised) to higher order tensors as well as its variants. The methods are conducted on the GTDB and MNIST dataset, and the results are compared to the state-of-art-methods showing the competitive results. A future work could be conducted on generalization on trace ratio methods as Linear Discriminant Analysis. An acceleration of the computation can also be proposed using the Tensor Golub Kahan decomposition to get an approximation of these eigen-tensors in constructing the projected space.

References

  • [1] A. Álvarez-Meza, J. Valencia-Aguirre, G. Daza-Santacoloma, and G. Castellanos-Domínguez, Global and local choice of the number of nearest neighbors in locally linear embedding, Pattern Recognition Letters, 32 (2011), pp. 2171–2177.
  • [2] E. Barshan, A. Ghodsi, Z. Azimifar, and M. Z. Jahromi, Supervised principal component analysis: Visualization, classification and regression on subspaces and submanifolds, Pattern Recognition, 44 (2011), pp. 1357–1371.
  • [3] P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman, Eigenfaces vs. fisherfaces: Recognition using class specific linear projection, IEEE Transactions on pattern analysis and machine intelligence, 19 (1997), pp. 711–720.
  • [4] M. Belkin and P. Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation, Neural computation, 15 (2003), pp. 1373–1396.
  • [5] Y. Bengio, J.-f. Paiement, P. Vincent, O. Delalleau, N. Roux, and M. Ouimet, Out-of-sample extensions for lle, isomap, mds, eigenmaps, and spectral clustering, Advances in neural information processing systems, 16 (2003).
  • [6] M. Brazell, N. Li, C. Navasca, and C. Tamon, Solving multilinear systems via tensor inversion, SIAM Journal on Matrix Analysis and Applications, 34 (2013), pp. 542–570.
  • [7] H.-T. Chen, H.-W. Chang, and T.-L. Liu, Local discriminant embedding and its variants, in 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), vol. 2, IEEE, 2005, pp. 846–853.
  • [8] J. A. Costa and A. Hero, Classification constrained dimensionality reduction, in Proceedings.(ICASSP’05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005., vol. 5, IEEE, 2005, pp. v–1077.
  • [9] K. P. F.R.S., Liii. on lines and planes of closest fit to systems of points in space, The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2 (1901), pp. 559–572.
  • [10] B. Ghojogh, A. Ghodsi, F. Karray, and M. Crowley, Locally linear embedding and its variants: Tutorial and survey, arXiv preprint arXiv:2011.10925, (2020).
  • [11] A. E. Hachimi, K. Jbilou, M. Hached, and A. Ratnani, Tensor golub kahan based on einstein product, arXiv preprint arXiv:2311.03109, (2023).
  • [12] X. He and P. Niyogi, Locality preserving projections, Advances in neural information processing systems, 16 (2003).
  • [13] X. He, S. Yan, Y. Hu, P. Niyogi, and H.-J. Zhang, Face recognition using laplacianfaces, IEEE transactions on pattern analysis and machine intelligence, 27 (2005), pp. 328–340.
  • [14] E. Kokiopoulou and Y. Saad, Orthogonal neighborhood preserving projections, in Fifth IEEE International Conference on Data Mining (ICDM’05), IEEE, 2005, pp. 8–pp.
  • [15] E. Kokiopoulou and Y. Saad, Orthogonal neighborhood preserving projections: A projection-based dimensionality reduction technique, IEEE Transactions on Pattern Analysis and Machine Intelligence, 29 (2007), pp. 2143–2156.
  • [16] E. Kokiopoulou and Y. Saad, Enhanced graph-based dimensionality reduction with repulsion laplaceans, Pattern Recognition, 42 (2009), pp. 2392–2402.
  • [17] T. G. Kolda and B. W. Bader, Tensor decompositions and applications, SIAM review, 51 (2009), pp. 455–500.
  • [18] J. A. Lee, M. Verleysen, et al., Nonlinear dimensionality reduction, vol. 1, Springer, 2007.
  • [19] Q. Liu, R. Huang, H. Lu, and S. Ma, Face recognition using kernel-based fisher discriminant analysis, in Proceedings of Fifth IEEE International Conference on Automatic Face Gesture Recognition, IEEE, 2002, pp. 197–201.
  • [20] C. B. Lizhu Sun, Baodong Zheng and Y. Wei, Moore–penrose inverse of tensors via einstein product, Linear and Multilinear Algebra, 64 (2016), pp. 686–698.
  • [21] L. Qi and Z. Luo, Tensor analysis: spectral theory and special tensors, SIAM, 2017.
  • [22] B. Raducanu and F. Dornaika, A supervised non-linear dimensionality reduction approach for manifold learning, Pattern Recognition, 45 (2012), pp. 2432–2444.
  • [23] S. T. Roweis and L. K. Saul, Nonlinear dimensionality reduction by locally linear embedding, science, 290 (2000), pp. 2323–2326.
  • [24] L. K. Saul, K. Q. Weinberger, F. Sha, J. Ham, and D. D. Lee, Spectral methods for dimensionality reduction, in Semi-Supervised Learning, The MIT Press, 09 2006.
  • [25] M. Tai, M. Kudo, A. Tanaka, H. Imai, and K. Kimura, Kernelized supervised laplacian eigenmap for visualization and classification of multi-label data, Pattern Recognition, 123 (2022), p. 108399.
  • [26] M. A. Turk and A. P. Pentland, Face recognition using eigenfaces, in Proceedings. 1991 IEEE computer society conference on computer vision and pattern recognition, IEEE Computer Society, 1991, pp. 586–587.
  • [27] Q.-W. Wang and X. Xu, Iterative algorithms for solving some tensor equations, Linear and Multilinear Algebra, 67 (2019), pp. 1325–1349.
  • [28] Y. Wang and Y. Wei, Generalized eigenvalue for even order tensors via einstein product and its applications in multilinear control systems, Computational and Applied Mathematics, 41 (2022), p. 419.
  • [29] A. R. Webb, K. D. Copsey, and G. Cawley, Statistical pattern recognition, vol. 2, Wiley Online Library, 2011.
  • [30] M.-H. Yang, Face recognition using kernel methods, Advances in neural information processing systems, 14 (2001).
  • [31] S.-q. Zhang, Enhanced supervised locally linear embedding, Pattern Recognition Letters, 30 (2009), pp. 1208–1218.
  • [32] W. Zhang, X. Xue, H. Lu, and Y.-F. Guo, Discriminant neighborhood embedding for classification, Pattern Recognition, 39 (2006), pp. 2240–2243.
  • [33] L. Zhao and Z. Zhang, Supervised locally linear embedding with probability-based distance for classification, Computers & Mathematics with Applications, 57 (2009), pp. 919–926.