Uniform tensor clustering by jointly exploring sample affinities of various orders

Hongmin Cai^1∗, Fei Qi^1,2, Junyu Li¹, Yu Hu¹, Yue Zhang³,Yiu-ming Cheung⁵, and Bin Hu^4∗ ¹School of Computer Science and Engineering, South China University of Technology, Guangzhou, 510006, China. ²School of Data Science and Information Engineering, Guizhou Minzu University, China. ³ School of Computer Science, Guangdong Polytechnic Normal University, Guangzhou 510665, China. ⁴ School of Medical Technology, Beijing Institute of Technology, Beijing, China. ⁴ Department of Computer Science, Hong Kong Baptist University, Hong Kong SAR, China ^∗Corresponding Author, E-mail: [email protected], [email protected].

Abstract

Conventional clustering methods based on pairwise affinity usually suffer from the concentration effect while processing huge dimensional features yet low sample sizes data, resulting in inaccuracy to encode the sample proximity and suboptimal performance in clustering. To address this issue, we propose a unified tensor clustering method (UTC) that characterizes sample proximity using multiple samples’ affinity, thereby supplementing rich spatial sample distributions to boost clustering. Specifically, we find that the triadic tensor affinity can be constructed via the Khari-Rao product of two affinity matrices. Furthermore, our early work shows that the fourth-order tensor affinity is defined by the Kronecker product. Therefore, we utilize arithmetical products, Khatri-Rao and Kronecker products, to mathematically integrate different orders of affinity into a unified tensor clustering framework. Thus, the UTC jointly learns a joint low-dimensional embedding to combine various orders. Finally, a numerical scheme is designed to solve the problem. Experiments on synthetic datasets and real-world datasets demonstrate that 1) the usage of high-order tensor affinity could provide a supplementary characterization of sample proximity to the popular affinity matrix; 2) the proposed method of UTC is affirmed to enhance clustering by exploiting different order affinities when processing high-dimensional data.

Index Terms:

High-order affinity, Clustering, Fusing affinity, Tensor, Spectral graph.

1 Introduction

Clustering aims to partition unlabeled data into distinct groups [1]. Traditional methods, such as K-means [2] and spectral clustering [3, 4], have been well studied, and many variants have been designed to cater to various needs. However, with advances in observatory equipment, an object of interest could be associated with high or even huge dimensional features. For example, in bioinformatics, most single cells isolated from tissue are damaged during sequencing. Thus, only a limited number of cells can be successfully sequenced, resulting in a sample matrix with a few samples and millions of gene reads [5, 6]. Similar scenarios are rising across various fields, such as computer vision [7, 8] and natural language processing. Accurate clustering high-dimension $m$ yet low-sample-size $n$ (HDLSS) data remains challenging when $n\gg m$ [9, 10, 11].

The critical challenge that hinders the clustering of HDLSS data is called the concentration effect [12, 13], which refers to the situation that the pairwise Euclidean distance among the samples collapses to a constant, thus resulting in the sample-wise affinity tending to be indiscriminating in high feature dimensions [14, 15, 16]. Such an effect exists as an insurmountable roadblock that hinders most clustering methods, which rely on the pairwise affinity of samples, from achieving precise clustering. Early works attempted to learn adaptive sample affinity to develop metrics driven by data. For example, CAN [17] proposed a model that dynamically learns the affinity matrix by assigning the adaptive neighbors for each data point based on local distances. However, the measurement of local distance remains a characterization of sample-wise affinity and thus still suffers from the concentration effects and performs unsatisfactorily.

To overcome such drawbacks, many researchers have proposed to incorporate the spatial distribution of multiple samples to enhance clustering. Such high-order characterization on the local structure of multiple points can effectively overcome the shortcomings of pairwise similarity and enhance the clustering effect [18] [19]. For example, [20] generalizes the spectral methods operating on a pairwise similarity graph to a hypergraph that can represent high-order affinities and develops an algorithm to obtain the hypergraph embedding for clustering. [21] proposes to maintain a weight graph to approximate the hypergraph generated by high-order association and then use a spectral method to accomplish graph partitioning. [22] considers the hypergraph clustering problem as a multiplayer noncooperative game from a game-theoretic approach. Like the CAN method, [23] first proposes a way to learn the dynamic hypergraph for clustering. However, the above method must eventually convert the hypergraph into an approximate weighted graph. Therefore, the clustering task is still performed on sample affinities, so higher-order affinity is not properly exploited to overcome the drawbacks of pairwise relations.

Recently, the use of tensors to encode high-order affinities has been studied in clustering. [24] introduce the characteristic tensor of a hypergraph and its associated $Z$ -eigenvalue, providing natural relations for the structure of data. [25] denotes the order- $k$ affinities by a symmetric tensor and relaxes the hypergraph clustering to solve the multilinear singular value decomposition problem. However, these methods can use only a specific order of affinity for clustering, which cannot completely express the data structure. Our early work of IPS2 [26] concentrates on using the high-order structural information of data that can be obtained from relationships between two pairs to enhance clustering performance. The authors establish an equivalence between the even-order affinity and the pairwise second-order affinity. However, there are still two problems that need to be solved: (1) the connection between odd-order tensor affinity and the second-order matrix affinity remains unknown; (2) how to jointly utilize affinities with different orders to accomplish clustering has yet to be studied.

To address these two questions, this paper first establishes a connection between the second-order and third-order tensor affinities via the Khatri-Rao product. Then, we propose a unified tensor clustering (UTC) model to learn a low-dimensional embedding by jointly using affinities of various orders. The popular second-order matrix affinity and the triadic and tetradic tensor affinities are fused to help characterize the proximity among samples.

The main advantages of the method proposed in this paper are summarized as follow.

1)

A connection between pairwise matrix affinity and the triadic tensor affinity among three samples is established. An undecomposable third-order tensor is designed to provide the geometric proximity among three samples, thus serving as a supplement to the pairwise similarity matrix.
2)

The third-order tensor affinity is defined via the Khatri-Rao product, while the Kronecker product is used to define the fourth-order tensor affinity. Therefore, the popular matrix affinity defined by the arithmetical product is surprisingly related to the Khatri-Rao and Kronecker products.
3)

A uniform tensor clustering method (UTC) is developed. The UTC learns a low-dimensional embedding jointly based on affinities of various orders. The UTC is formulated elegantly and works on affinities of any order.
4)

Extensive experiments on HDLSS data demonstrate that the UTC achieved superior performance compared with that of baseline methods. In addition, the experiments show that using high-order affinities to characterize the sample spatial distribution can improve clustering performance, especially in cases of small sample sizes.

Refer to caption — Figure 1: Demonstration of the UTC method, which aims to learn a uniform embedding via joint optimization of not only the pairwise matrix affinity but also the triadic and tetradic tensor affinities among more than two samples. The approach is combined with the high-order tensor affinity to provide complementary spatial characterization for which the pairwise affinity fails; UTC learns a low-dimensional embedding and is robust to high-dimensional concentration effects and remains stable.

2 NOTATIONS AND PRELIMINARIES

2.1 Notations

Throughout the paper, lowercase letters represent scalars, while bold lowercase letters represent vectors, such as $\bm{v}\in\mathbb{R}^{m}$ . Bold uppercase letters denote matrices, such as $\bm{M}\in\mathbb{R}^{m\times n}$ , while bold calligraphy letters represent tensors, such as, $\mathcal{T}\in\mathbb{R}^{m\times n\times l}$ . Let $\mathcal{T}(:,:,i)$ , $\mathcal{T}(:,i,:)$ , $\mathcal{T}(i,:,:)$ represent the $i$ -th frontal, lateral and horizontal slices of an order-3 tensor $\mathcal{T}\in\mathbb{R}^{n_{1}\times n_{2}\times n_{3}}$ , respectively. $\mathcal{T}(:,:,i)$ can be abbreviated as $\mathcal{T}^{(i)}$ . A tensor can be reformatted into a matrix via series operations, called unfolding. For example, the three-order and four-order tensor unfold operations are as follows:

Definition 2.1.

Unfolding 3-order Tensor: Let $\mathcal{T}_{3}\in\mathbb{R}^{m\times m\times m}$ be an order-3 m-dimensional tensor. It can unfold to an $m^{2}\times m$ matrix $\hat{\bm{T}}_{3}$ as follows:

\displaystyle\hat{\bm{T}_{3}}=\operatorname{unfold}(\mathcal{T}_{3})=\left[\begin{array}[]{c}\mathcal{T}_{3}^{(1)}\\ \mathcal{T}_{3}^{(2)}\\ \vdots\\ \mathcal{T}_{3}^{\left(m\right)}\end{array}\right]

(1)

Definition 2.2.

Unfolding 4-order Tensor Let $\mathcal{T}_{4}\in\mathbb{R}^{m\times m\times m\times m}$ represent a fourth-order m-dimensional tensor. This tensor can unfold to an $m^{2}\times m^{2}$ matrix $\hat{T}_{4}$ , with its (r,s)-th entry given by:

\displaystyle\hat{\bm{T}}_{4_{rs}}=\mathcal{T}_{4_{ijkl}}

(2)

with $r=m(j-1)+i,s=m(l-1)+k,i,j,k,l\in[1,m]$ .

Three matrix multiplications are considered, namely, Hadamard product, Kronecker product and Khatri-Rao product [27].

Definition 2.3.

Hadamard Product The Hadamard product of two matrices $\bm{A}\in\mathbb{R}^{m\times n}$ and $\bm{B}\in\mathbb{R}^{m\times n}$ is defined as:

\displaystyle\bm{A}\odot\bm{B}=\left[\begin{matrix}a_{11}b_{11}&\cdots&a_{1n}b_{1n}\\ \vdots&\ddots&\vdots\\ a_{m1}b_{m1}&\cdots&a_{mn}b_{mn}\\ \end{matrix}\right]\in\mathbb{R}^{m\times n}

(3)

Definition 2.4.

Kronecker Product The Kronecker product of two matrices $\bm{A}\in\mathbb{R}^{m_{1}\times n_{1}}$ and $\bm{B}\in\mathbb{R}^{m_{2}\times n_{2}}$ is defined as:

\displaystyle\bm{A}\otimes\bm{B}=\left[\begin{matrix}a_{11}\bm{B}&\cdots&a_{1n}\bm{B}\\ \vdots&\ddots&\vdots\\ a_{m1}\bm{B}&\cdots&a_{mn}\bm{B}\\ \end{matrix}\right]\in\mathbb{R}^{m_{1}m_{2}\times n_{1}n_{2}}

(4)

Definition 2.5.

Khatri-rao Product The Khatri-Rao product of two matrices $\bm{A}\in\mathbb{R}^{m_{1}\times n}$ and $\bm{B}\in\mathbb{R}^{m_{2}\times n}$ is defined as the matrix:

\displaystyle\bm{A}*\bm{B}=\left[\begin{matrix}a_{:1}\otimes b_{:1}&\cdots&a_{:n}\otimes b_{:n}\end{matrix}\right]\in\mathbb{R}^{m_{1}m_{2}\times n}

(5)

A popular product between a tensor and a matrix is called a $k$ -mode product.

Definition 2.6.

k-mode Product The $k$ -mode product between an order- $m$ tensor $\bm{\mathcal{T}}\in\mathbb{R}^{n_{1}\times n_{2}\times\dots\times n_{m}}$ and a matrix $V\in\mathbb{R}^{p\times n_{k}}$ , denoted by $\bm{\mathcal{T}}\otimes_{k}U\in\mathbb{R}^{n_{1}\times\dots\times n_{k-1}\times p\times n_{k+1}\times\dots\times n_{m}}$ , with

\displaystyle(\bm{\mathcal{T}}\otimes_{k}\bm{V})_{i_{1}\dots i_{k-1}ji_{k+1}\dots i_{m}}=\sum^{n_{k}}_{i_{k}=1}\bm{\mathcal{T}}_{i_{1}\dots i_{k-1}i_{k}i_{k+1}\dots i_{m}}\bm{V}_{ji_{k}}

(6)

For convenience, we summarize the frequently used notations and definitions in Table I.

TABLE I: Summary of notations

$\bm{v}\in\mathbb{R}^{m}$	A vector
$\bm{M}\in\mathbb{R}^{m\times n}$	A matrix
$\mathcal{T}\in\mathbb{R}^{n_{1}\times n_{2}\times\dots\times n_{k}}$	An order-k tensor
$\hat{\bm{T}}_{k}$	The unfolding order-k tensor
$Tr(\cdot)$	The trace operation
$\odot$	The Hadamard product
$*$	The Khatri-Rao product
$\otimes$	The Kronecker product
$\otimes_{k}$	The k-mode product

2.2 Introduction of Spectral Clustering with Pairwise Similarity

Spectral clustering [3] starts by computing a sample-wise affinity matrix $\bm{S}_{2}\in\mathbb{R}^{m\times m}$ . Every entry $\bm{S}_{2}(i,j)$ measures the proximity between two samples $x_{i}$ and $x_{j}$ . Then, a normalized affinity matrix $\hat{\bm{L}}_{2}$ is estimated with $\hat{\bm{L}}_{2}=\bm{D}_{2}^{-\frac{1}{2}}\bm{S}_{2}\bm{D}_{2}^{-\frac{1}{2}}$ , where $\bm{D}_{2}$ denotes the degree matrix of $\bm{S}_{2}$ with $\bm{D}_{2_{ii}}=\sum_{i}\bm{S}_{2{ij}}$ . The Laplacian matrix $\bm{L}$ can be generated through $\bm{L}=\bm{I}-\hat{\bm{L}}_{2}$ . Spectral graph clustering seeks a low-dimensional embedding of samples by maximizing the cross-entropy while preserving the local proximity of paired samples:

\displaystyle\begin{aligned} &\min_{\bm{V}}\,\,tr(\bm{V}^{T}\bm{L}\bm{V})\\ \text{ Subject to:~{}}\quad&\bm{V}^{T}\bm{V}=\bm{I}\\ \end{aligned}

(7)

It is a standard eigenvalue problem and thus can be efficiently solved by calculating the eigenvectors $\bm{V}$ of the Laplacian matrix $\bm{L}$ , using methods such as singular value decomposition. Typically, the normalized cut of the graph is minimized in this model, but as stated in [28], this model can also be expressed as a maximization problem with the normalized pairwise affinity $\hat{\bm{L}}_{2}$ . By using the tensor $k$ -mode product, the spectral clustering in Eq. (7) can be rewritten as:

	$\displaystyle\max_{\bm{V}}\,\,tr(\bm{V}^{T}\hat{\bm{L}}_{2}\bm{V})$	(8)
$\displaystyle=$	$\displaystyle\max_{\bm{V}}\,\,\sum_{j=1}^{k}\hat{\bm{L}}_{2}\otimes_{2}\bm{v}_{j}^{T}\otimes_{1}\bm{v}_{j}^{T}$
	$\displaystyle\text{ Subject to:~{}}\bm{V}^{T}\bm{V}=\bm{I}$

where $\bm{v}_{j}$ is the $j^{th}$ column of the partition matrix $\bm{V}$ . This equation establishes the relationship between the membership $\bm{v}_{j}$ and the pairwise affinity. Then, a clustering task is performed on the obtained embedding.

2.3 Introduction of IPS2 Clustering with Pair-to-Pair Similarity

The pairwise affinity is easily broken by noise contamination or concentration effects in HDLSS data. To address this issue, our recent work IPS2 [26] attempted to use high-order affinity among more than two samples. IPS2 used the fourth-order tensor affinity to measure the proximity of two sample pairs. The decomposable fourth-order tensor affinity can be defined as the product of two pairwise similarities:

\displaystyle\mathcal{T}_{4_{ijkl}}=\bm{S}_{2_{ik}}\bm{S}_{2_{jl}},\quad i,j,k,l\in m.

(9)

Let $\hat{\bm{T}}_{4}$ denote the matrix unfolded from a fourth-order decomposable tensor affinity $\mathcal{T}_{4}$ ; it has been shown in [26] that

\displaystyle\hat{\bm{T}}_{4}=\bm{S}_{2}\otimes\bm{S}_{2}

(10)

This result establishes a direct connection between a fourth-order decomposable tensor affinity and the usual second-order matrix affinity $\bm{S}_{2}$ . Now, consider a normalized affinity matrix $\hat{\bm{L}}_{4}$ , computed by $\hat{\bm{L}}_{4}=\hat{\bm{D}}_{4}^{-\frac{1}{2}}\hat{\bm{T}}_{4}\hat{\bm{D}}_{4}^{-\frac{1}{2}}$ . Here, the diagonal matrix $\hat{\bm{D}}_{4}$ is the degree matrix of $\hat{\bm{T}}_{4}$ with the diagonal elements being $\hat{\bm{D}}_{4_{ii}}=\sum_{j}\hat{\bm{T}}_{4_{ij}}$ . It has been shown that the fourth-order and second-order normalized affinity matrix $\hat{\bm{L}}_{4},\hat{\bm{L}}_{2}$ shares a similar relation as in Eq. (10):

Theorem 2.1.

The decomposable 4-order tensor [26] Let $\hat{\bm{L}}_{2}$ and $\mathcal{L}_{4}$ denote the normalized similarity matrix and fourth-order normalized affinity tensor with the unfolding form $\hat{\bm{L}}_{4}$ , respectively. Then, we have the following equality:

\displaystyle\hat{\bm{L}}_{4}=\hat{\bm{L}}_{2}\otimes\hat{\bm{L}}_{2}

(11)

Moreover, the eigenvectors $\bm{v}$ and $\hat{\bm{v}}$ for the matrix $\hat{\bm{L}}_{2}$ and unfolded matrix $\hat{\bm{L}}_{4}$ satisfy:

\displaystyle\hat{\bm{v}}=\bm{v}\otimes\bm{v}

(12)

One can reshape the eigenvector $\hat{\bm{v}}$ into a matrix $\bm{V}\in\mathbb{R}^{m\times m}$ , and the matrix $\bm{V}$ is an eigenmatrix of the fourth-order tensor affinity $\hat{\bm{T}}_{4}$ in Eq. (10). The established equivalence between the decomposable tensor affinity $\mathcal{T}_{4}$ and any similarity matrix $\bm{S}$ opens a new path to understand high-order affinity. However, the fourth-order decomposable affinity inherits the drawback of the low-order similarity [29] in that it is vulnerable to noise corruption and concentration effects when applied to high-dimensional samples. Alternatively, one can use undecomposable tensor affinity to provide complementary sample affinities that the pairwise matrix affinity could not provide. The IPS2 uses a Fisher-ratio-like tensor affinity:

\mathcal{T}_{ijkl}=exp(-\sigma\frac{d_{ij}+d_{kl}}{d_{ik}+d_{jl}+\varepsilon})

(13)

for $i,j,k,l\in n$ , where $d_{ij}$ denotes the distance between samples $x_{i}$ and $x_{j}$ and the parameter $\sigma$ is an scaling constant.

One can calculate the normalized tensor affinity $\mathcal{L}_{4}$ from the fourth-order tensor affinity $\mathcal{\bm{T}}_{4}$ . Then, we can learn a low-dimensional embedding $\bm{V}$ from the fourth-order tensor affinity by solving the following maximization problem:

$\displaystyle\max_{\bm{V}}$	$\displaystyle\sum_{j=1}^{k}\mathcal{L}_{4}\otimes_{4}\bm{v}_{j}\otimes_{3}\bm{v}_{j}\otimes_{2}\bm{v}_{j}\otimes_{1}\bm{v}_{j}$	(14)
	$\displaystyle=tr((\bm{V}\bm{V})^{T}\hat{\bm{L}}_{4}(\bm{V}\bm{V}))$
	$\displaystyle s.t.\bm{V}^{T}\bm{V}=\bm{I}$

3 Method

The performance of clustering based on pairwise affinity suffers in high-dimensional and small-sample-size datasets. Incorporating the spatial distribution among multiple samples instead of using only pairwise affinity improves the clustering performance. IPS2 [26] built a bridge between the fourth-order affinity and low-dimensional embedding. It has been shown that high-order information can effectively complete the description of spatial data structure from low-order information. However, it is restricted to using even-order affinity, such as the fourth-order affinity, and does not apply to odd-order affinity. Moreover, it fails to fuse affinities from different orders within a uniform formulation, resulting in cumbersome utility for empirical applications. To address these two issues, we did the following work: (1) we introduced triadic and tetradic tensor affinity for three and four samples; (2) we propose a uniform framework to learn a low-dimensional embedding by fusing affinity of various orders. An illustrative overview of UTC is shown in Fig. 1.

3.1 Characterizing Sample’s Affinity via Third-Order Tensor Affinity

3.1.1 Decomposable Third-Order Tensor Affinity

Popular clustering methods are based on the pairwise similarity of two samples. We are interested in the tensor affinity among three samples, denoted by a third-order tensor $\mathcal{T}_{3}=[\mathcal{T}_{ijk}]\in\mathbb{R}^{m\times m\times m}$ . A natural way to define such affinity is to use the composite affinity based on paired affinities, such that each entry $\mathcal{T}_{ijk}$ is given by:

\displaystyle\mathcal{T}_{3_{ijk}}=\bm{S}_{2_{ij}}\bm{S}_{2_{kj}}

(15)

Under this definition, one can generalize the pairwise affinity $\bm{S}_{2}$ to a third-order tensor affinity $\mathcal{T}_{3}$ . Specifically, the results satisfy the following properties.

Theorem 3.1.

Given a decomposable third-order tensor affinity $\mathcal{T}_{3}$ defined in Eq. (15), $\hat{\bm{T}}_{3}=\bm{S}_{2}*\bm{S}_{2}$ , where $\hat{\bm{T}}_{3}$ is the matrix unfolded by the tensor $\mathcal{T}_{3}$ .

Proof.

By the definition in Eq.( 1), the unfolded matrix $\hat{\bm{T}_{3}}\in\mathbb{R}^{n^{2}\times n}$ can be written:

$\displaystyle\bm{\hat{T}}_{3}$	$\displaystyle=\left[\begin{matrix}\mathcal{T}_{3_{111}}&\dots&\mathcal{T}_{3_{1n1}}\\ \vdots&\ddots&\vdots\\ \mathcal{T}_{3_{n1n}}&\dots&\mathcal{T}_{3_{nnn}}\\ \end{matrix}\right]$	(16)
	$\displaystyle=\left[\begin{matrix}\bm{S}_{2_{11}}\bm{S}_{2_{11}}&\dots&\bm{S}_{2_{1n}}\bm{S}_{2_{1n}}\\ \vdots&\ddots&\vdots\\ \bm{S}_{2_{11}}\bm{S}_{2_{n1}}&\dots&\bm{S}_{2_{1n}}\bm{S}_{2_{nn}}\\ \vdots&\ddots&\vdots\\ \bm{S}_{2_{n1}}\bm{S}_{2_{11}}&\dots&\bm{S}_{2_{nn}}\bm{S}_{2_{1n}}\\ \vdots&\ddots&\vdots\\ \bm{S}_{2_{n1}}\bm{S}_{2_{n1}}&\dots&\bm{S}_{2_{nn}}\bm{S}_{2_{nn}}\\ \end{matrix}\right]$
	$\displaystyle=\left[\begin{matrix}\bm{S}_{2_{11}}\bm{S}_{2_{:1}}&\dots&\bm{S}_{2_{1n}}\bm{S}_{2_{:n}}\\ \vdots&\ddots&\vdots\\ \bm{S}_{2_{n1}}\bm{S}_{2_{:1}}&\dots&\bm{S}_{2_{nn}}\bm{S}_{2_{:n}}\\ \end{matrix}\right]$
	$\displaystyle=\left[\begin{matrix}\bm{S}_{2_{:1}}\otimes\bm{S}_{2_{:1}}&\dots&\bm{S}_{2_{:n}}\otimes\bm{S}_{2_{:n}}\\ \end{matrix}\right]$
	$\displaystyle=\bm{S}_{2}*\bm{S}_{2}$

∎

Surprisingly, this theorem shows that a decomposable third-order tensor affinity can be obtained via the Khatri-Rao product of two affinity matrices. We now proceed to define a normalized affinity tensor $\mathcal{L}_{3}$ . Let a matrix $\hat{\bm{L}}_{3}$ be calculated by $\hat{\bm{L}}_{3}=\hat{\bm{D}}_{3_{1}}^{-\frac{1}{2}}\hat{\bm{T}}_{3}\hat{\bm{D}}_{3_{2}}^{-\frac{1}{2}}$ . Here, the two diagonal matrices $\hat{\bm{D}}_{3_{1}},\hat{\bm{D}}_{3_{2}}$ are given by $\hat{\bm{D}}_{3_{1}}=\sqrt{\sum_{i}{\hat{\bm{T}}_{3_{:i}}}\sum_{i}{\hat{\bm{T}}_{3_{:i}}}}=\bm{D}_{2}\otimes\bm{D}_{2}$ and $\hat{\bm{D}}_{3_{2}}=\sum_{i}{\hat{\bm{T}}_{3_{:i}}}=\bm{D}_{2}\odot\bm{D}_{2}$ . The normalized third-order affinity tensor $\mathcal{L}_{3}$ is obtained by folding the matrix $\hat{\bm{L}}_{3}$ .

This defined affinity tensor shares a similar property to that of the fourth-order affinity tensor.

Theorem 3.2.

Let a matrix $\bm{S}_{2}$ be a pairwise affinity matrix with $\hat{\bm{L}}_{2}$ being its normalized affinity matrix. The decomposable third-order tensor affinity $\mathcal{T}_{3}$ and its normalized tensor $\mathcal{L}_{3}$ are obtained as above. We have

\displaystyle\hat{\bm{L}}_{3}=\hat{\bm{L}}_{2}*\hat{\bm{L}}_{2}

(17)

where $\hat{\bm{L}}_{3}$ is the unfolded matrix from the tensor $\mathcal{L}_{3}$ .

Proof.

By the definitions of the normalized affinity matrix $\hat{\bm{L}}_{2}$ and $\hat{\bm{L}}_{3}$ , we have:

$\displaystyle\hat{\bm{L}}_{3}=$	$\displaystyle\hat{\bm{D}}_{3_{1}}^{-\frac{1}{2}}\hat{\bm{T}_{3}}\hat{\bm{D}}_{3_{2}}^{-\frac{1}{2}}$	(18)
$\displaystyle=$	$\displaystyle(\bm{D}_{2}\otimes\bm{D}_{2})^{-\frac{1}{2}}(\bm{S}_{2}*\bm{S}_{2})(\bm{D}_{2}\odot\bm{D}_{2})^{-\frac{1}{2}}$
$\displaystyle=$	$\displaystyle(\bm{D}_{2}^{-\frac{1}{2}}\bm{S}_{2}\bm{D}_{2}^{-\frac{1}{2}})*(\bm{D}_{2}^{-\frac{1}{2}}\bm{S}_{2}\bm{D}_{2}^{-\frac{1}{2}})$
$\displaystyle=$	$\displaystyle\hat{\bm{L}}_{2}*\hat{\bm{L}}_{2}$

∎

Theo. 3.1 and Theo. 3.2 build a connection between third-order affinity $\mathcal{T}_{3}$ and pairwise similarity $\bm{S}_{2}$ via the Khatri-Rao product. We construct a synthetic dataset consisting of eighty samples, evenly drawn from two different clusters. Then, we visualize the pairwise affinity $\bm{S}_{2}$ and the decomposable third-order and fourth-order tensor affinity in Fig. 2. One can observe that they preserve the sample proximity satisfactorily. They can be well segmented into four distinct blocks, and within each sub-block, the sample proximity is small.

3.1.2 Indecomposable Third-Order Tensor Affinity

After constructing the connection of decomposable affinity with pairwise similarity, we analyze the local spatial relations to obtain supplement information for pairwise relations. As [30] stated, when the order is two, the spectral methods can also be seen as an eigenvector $\bm{v}$ containing the important underlying manifold information of the data. We generate this model to high-order affinity. Different third-order information can be extracted by examining the three samples from different views. As an example, the affinity between three samples is defined as follows: for any three samples $x_{i}$ , $x_{j}$ , and $x_{k}$ , we can always use one sample $x_{j}$ as an anchor point for the other samples $x_{i}$ and $x_{k}$ and consider anchor point $x_{j}$ ’s perspective to measure the relationship between the two other points. If $x_{i}$ and $x_{j}$ are sufficiently similar, they are similar no matter the location of the anchor point. But for outliers, the similarity to other data should be small under most anchors’ perspectives. Therefore, the undecomposable affinity $\mathcal{T}_{3}\in\mathbb{R}^{m^{2}\times m}$ among three points is defined by:

\mathcal{T}_{3_{ijk}}=\frac{<(x_{i}-x_{j}),(x_{k}-x_{j})>}{d_{ij}d_{jk}}

(19)

for $i,j,k\in m$ , where $d_{ij}$ denotes the distance between samples $x_{i}$ and $x_{j}$ . The entry $\mathcal{T}_{3_{ijk}}$ denotes the similarity between $x_{i}$ and $x_{k}$ under $x_{j}$ ’s view. If the distance between $x_{i}$ and $x_{k}$ is small, wherever the location of $x_{j}$ is, it should be a small value. Then, we can unfold $\mathcal{T}_{3}$ into a matrix $\hat{\bm{T}_{3}}\in\mathbb{R}^{m^{2}\times m}$ along the mode-3 direction. Similar to the decomposable tensor affinity, we can infer the three-order normalized affinity tensor $\mathcal{L}_{3}$ , which when unfolded along the mode-3 direction satisfies $\hat{\bm{L}}_{3}=\hat{\bm{D}}_{3_{1}}^{-\frac{1}{2}}\hat{\bm{T}}_{3}\hat{\bm{D}}_{3_{2}}^{-\frac{1}{2}}$ , where $\hat{\bm{D}}_{3_{1}}^{-\frac{1}{2}}=\sqrt{\sum_{i}{\hat{\bm{T}}_{3_{::i}}}\sum_{i}{\hat{\bm{T}}_{3_{::i}}}}=\bm{D}_{2}\otimes\bm{D}_{2}$ and $\hat{\bm{D}}_{3_{2}}=\sum_{i}{\hat{\bm{T}}_{3_{::i}}}=\bm{D}_{2}*\bm{D}_{2}$ . Similar to the form of the pairwise and tetradic relations, we seek a low-dimensional embedding $\bm{v}$ from the third-order tensor affinity by maximizing the cross-entropy:

$\displaystyle\max_{\bm{V}}$	$\displaystyle\sum_{j=1}^{k}\mathcal{L}_{3}\otimes_{3}\bm{v}_{j}\otimes_{2}\bm{v}_{j}\otimes_{1}\bm{v}_{j}$	(20)
$\displaystyle=\max_{\bm{V}}{}$	$\displaystyle tr((\bm{V}*\bm{V})^{T}\hat{\bm{L}}_{3}\bm{V})$
	$\displaystyle s.t.\bm{V}^{T}\bm{V}=\bm{I}$

3.2 Uniform Tensor Clustering

We now formulate a uniform framework to learn a low-dimensional embedding from affinities of various orders. Suppose we have $n$ samples $\bm{X}=[\bm{x}_{1},\bm{x}_{2},...,\bm{x}_{m}]\in\mathbb{R}^{n\times m}$ . The aim of clustering is to assign the samples into $k$ disjoint clusters, $C_{1},C_{2},\dots,C_{k}$ . We introduce an entropy, called total normalized similarity, to quantify the sample assignment:

Definition 3.1.

The total N-similarity of a cluster: Let $\bm{C}\in\bm{X}$ be a group of samples and $\mathcal{S}\in\mathbb{R}^{n_{1}\times n_{2}\dots\times n_{k}}$ be the order- $k$ similarity tensor. The total normalized similarity of $\bm{C}$ is defined as

\displaystyle Sim(\bm{C})=\sum_{x_{i_{1}},x_{i_{2}},\dots,x_{i_{k}}\in\bm{C}}\mathcal{L}_{x_{i_{1}},x_{i_{2}},\dots,x_{i_{k}}}

(21)

where $\mathcal{L}$ is the normalized affinity tensor calculated by $\mathcal{S}$ . In addition, for a partition $\bm{C}_{1},\bm{C}_{2},\dots,\bm{C}_{k}$ of $\bm{X}$ , we define the normalized associativity of the clustering as:

\displaystyle N-Assoc(\bm{C}_{1},\bm{C}_{2},\dots,\bm{C}_{k})=\sum^{k}_{i=1}\frac{Sim(\bm{C}_{i})}{|\bm{C}_{j}|^{m}},

(22)

where $|\bm{C}_{j}|$ denotes the sample number of cluster $\bm{C}_{j}$ .

Let an indicator matrix $\bm{H}=[\bm{h}_{1},\bm{h}_{2},\dots,\bm{h}_{k}]\in\mathbb{R}^{n\times k}$ be the sample assignment such that $\bm{\bm{H}}_{ij}=|\bm{C}_{j}|^{-1}$ if $x_{i}\in\bm{C}_{j}$ and zero otherwise.

One can maximize the normalized associativity in Eq. (22) to seek an optimal sample assignment ${\bm{C}_{i},\dots,\bm{C}_{k}}$ . Via algebraic transformation, this problem is equivalent to solving:

\displaystyle\max_{\bm{C}_{1},\cdots,\bm{C}_{k}}\sum_{j=1}^{k}(\mathcal{L}\otimes_{m}\bm{h}_{j}\otimes_{m-1}\bm{h}_{j}\dots\otimes_{1}\bm{h}_{j})

(23)

where $\bm{h}_{j}$ denotes the $j^{th}$ column of $\bm{H}$ .

One can verify that solving the maximum normalized associativity is NP-hard. Alternatively, one can relax the binary assignment matrix to the orthonormal matrix $\bm{V}\in\mathbb{R}^{n\times k}:\bm{V}^{T}\bm{V}=\bm{I}$ to lower the expectation of binary assignment. The simplified problem then becomes solving

\displaystyle\max_{\bm{V}^{T}\bm{V}=\bm{I}}\sum_{j=1}^{k}\mathcal{L}\otimes_{1}\bm{v}_{j}\otimes_{2}\bm{v}_{j}\dots\otimes_{m}\bm{v}_{j}

(24)

where $\bm{v}_{j}$ represents the $j^{th}$ column of $\bm{V}$ .

We now fuse the aforementioned undecomposable third-order and fourth-order tensor affinities with the pairwise affinity matrix to seek a uniform low-dimensional embedding. The high-order tensor affinity can be used to effectively address the challenge of the vulnerability to high-dimensional datasets. The proposed model, named uniform tensor clustering (UTC), effectively fuses the similarity measurements among samples to yield a uniform low-dimensional embedding to approximate the partition matrix. It achieves the goal by maximizing the cluster’s associativity of various orders. The proposed model is formulated as:

$\displaystyle\min_{\bm{V}}$	$\displaystyle\sum_{j=1}^{k}-{\hat{L}}_{2}\otimes_{2}\bm{v}_{j}\otimes_{1}\bm{v}_{j}-\mathcal{L}_{3}\otimes_{3}\bm{v}_{j}\otimes_{2}\bm{v}_{j}\otimes_{1}\bm{v}_{j}$	(25)
	$\displaystyle-\mathcal{L}_{4}\otimes_{4}\bm{v}_{j}\otimes_{3}\bm{v}_{j}\otimes_{2}\bm{v}_{j}\otimes_{1}\bm{v}_{j}$
	$\displaystyle s.t\,\,\bm{V}^{T}\bm{V}=\bm{I}$

where $\bm{v}_{j}$ is the $j^{th}$ column of the matrix $\bm{V}$ . This unified model aims to fuse various affinity matrices to yield a uniform low-dimensional embedding that is robust to noise contamination and high-dimensional concentration effects. The proposed model enhances the clustering performance by exploiting spatial information from multiple (more than two) samples. Upon solving this problem, one obtains the embedding $\bm{V}$ . The clustering task is accomplished by fusing low- to high-order information and using K-means clustering based on the embedding matrix $\bm{V}$ to obtain the final group. Therefore, once we have the indecomposable tensor affinity $\mathcal{S}$ from different orders, computing the fused low-dimensional embedding yields the different orders’ spatial information to enhance the clustering performance.

3.3 Numerical Scheme to Solve UTC

The optimization problem in Eq. (25) is convex, and the objective is differentiable. However, the gradient of the objective function involves solving a polynomial function of order three. It has no explicit solution and thus is computationally intractable. To circumvent this issue, we introduce a slack variable $\bm{v}_{2}$ to approximate the term $\bm{v}\otimes\bm{v}$ , thereby avoiding having to solve a high-order polynomial function. The model can be rewritten as

$\displaystyle\min_{\bm{V}_{1},\bm{V}_{2}}$	$\displaystyle-tr(\bm{V}_{1}^{T}\hat{\bm{L}}_{2}\bm{V}_{1}+\bm{V}_{2}^{T}\hat{\bm{L}}_{3}\bm{V}_{1}+\bm{V}_{2}^{T}\hat{\bm{L}}_{4}\bm{V}_{2})$	(26)
$\displaystyle s.t.$	$\displaystyle\bm{V}_{1}^{T}\bm{V}_{1}=\bm{I}$
	$\displaystyle\bm{V}_{1}*\bm{V}_{1}=\bm{V}_{2}$

This problem can be solved by alternating direction minimization. Its augmented Lagrangian formulation is as follows:

$\displaystyle\min_{\bm{V}_{1},\bm{V}_{2}}$	$\displaystyle-tr(\bm{V}_{1}^{T}\hat{\bm{L}}_{2}\bm{V}_{1})-tr(\bm{V}_{2}^{T}\hat{\bm{L}}_{3}\bm{V}_{1})-tr(\bm{V}_{2}^{T}\hat{\bm{L}}_{4}\bm{V}_{2})$	(27)
$\displaystyle+$	$\displaystyle<\bm{Y}_{1},\bm{V}_{1}*\bm{V}_{1}-\bm{V}_{2}>+<\bm{Y}_{2},\bm{V}_{1}^{T}\bm{V}_{1}-\bm{I}>$
$\displaystyle+$	$\displaystyle\frac{\mu}{2}[\\|\bm{V}_{1}*\bm{V}_{1}-\bm{V}_{2}\\|_{F}^{2}+\\|\bm{V}_{1}^{T}\bm{V}_{1}-\bm{I}\\|_{F}^{2}]$

where $\bm{Y}_{1}$ and $\bm{Y}_{2}$ are Lagrange multipliers. The constant $\mu>0$ is a penalty parameter. We then decompose the problem into two subproblems with respect to the variables $\bm{V}_{1}$ and $\bm{V}_{2}$ alternatively. Each variable is solved while fixing other variables. The process is updated iteratively until convergence.

Step 1): Solving the subproblem with respect to the variable $\bm{V}_{1}$

By fixing the variable $\bm{V}_{2}$ , the problem can be simplified as:

$\displaystyle\min_{\bm{V}_{1}}$	$\displaystyle-tr(\bm{V}_{1}^{T}\bm{L}\bm{V}_{1})-tr(\bm{V}_{2}^{T}\hat{\bm{L}}_{3}\bm{V}_{1})$	(28)
$\displaystyle+$	$\displaystyle<\bm{Y}_{1},\bm{V}_{1}*\bm{V}_{1}-\bm{V}_{2}>+<\bm{Y}_{2},\bm{V}^{T}\bm{V}-\bm{I}>$
$\displaystyle+$	$\displaystyle\frac{\mu}{2}[\\|\bm{V}_{1}*\bm{V}_{1}-\bm{V}_{2}\\|_{F}^{2}+(\bm{V}^{T}\bm{V}-\bm{I})]$

The gradient of the quadratic function is:

$\displaystyle\frac{\partial L}{\partial\bm{V}_{1}}=$	$\displaystyle-2\bm{L}\bm{V}_{1}-\hat{\bm{L}}_{3}^{T}\bm{V}_{2}+\mu\sum_{j=1}^{k}(\bm{V}_{1_{:j}}\otimes\bm{I}$	(29)
	$\displaystyle+\bm{I}\otimes\bm{V}_{1_{:j}})^{T}(\bm{V}_{1_{:j}}*\bm{V}_{1_{:j}}-\bm{V}_{2_{:j}}+\bm{Y}_{1_{:j}}/\mu)$
	$\displaystyle+2\mu\bm{V}_{1}(\bm{V}_{1}^{T}\bm{V}_{1}-\bm{I}+\bm{Y}_{2}/\mu)$

Thus, the problem in Eq. (28) can be solved via gradient descent or by setting the gradient function Eq. (29) to be zero to obtain the solution directly.

Step 2): Solving the subproblem with respect to the variable $\bm{V}_{2}$

By fixing the variable $\bm{V}_{1}$ , the augmented Lagrange function can be simplified as:

	$\displaystyle\min_{\bm{V}_{2}}$	$\displaystyle tr(-\bm{V}_{2}^{T}\hat{\bm{L}}_{3}\bm{V}_{1})-tr(\bm{V}_{2}^{T}\hat{\bm{L}}_{4}\bm{V}_{2})$		(30)
		$\displaystyle+\bm{Y}_{1}(\bm{V}_{1}\bm{V}_{1}-\bm{V}_{2})+\frac{\mu}{2}[\\|\bm{V}_{1}\bm{V}_{1}-\bm{V}_{2}\\|_{F}^{2}$		(30)

The gradient of the objective function is:

\displaystyle\frac{\partial L}{\partial\bm{V}_{2}}=-\hat{\bm{L}}_{3}\bm{V}_{1}-2\hat{\bm{L}}_{4}\bm{V}_{2}+\mu(\bm{V}_{2}-\bm{V}_{1}*\bm{V}_{1}-\frac{\bm{Y}_{1}}{\mu})

(31)

By setting the gradient to zero, one can obtain the implicit solution as:

\displaystyle\bm{V}_{2}^{*}=(\mu\bm{I}-2\hat{\bm{L}}_{4})^{-1}(\mu\bm{V}_{1}*\bm{V}_{1}+\hat{\bm{L}}_{3}\bm{V}_{1}+\bm{Y})

(32)

Step 3): Updating the multipliers $\bm{Y}_{1}$ and $\bm{Y}_{2}$ :

\displaystyle\bm{Y}_{1}^{(k+1)}=\bm{Y}_{1}^{(k)}+\mu(\bm{V}_{1}^{(k)}*\bm{V}_{1}^{(k)}-\bm{V}_{2}^{(k)})

(33)

\displaystyle\bm{Y}_{2}^{k+1}=\bm{Y}_{2}^{(k)}+\mu({\bm{V}_{1}^{(k)}}^{T}\bm{V}^{(k)}-\bm{I})

(34)

The three steps are iteratively updated until convergence or until a stopping criterion is met: $max(\|\bm{v}_{1}^{k+1}-\bm{v}_{1}^{k}\|_{\infty},\|\bm{v}_{2}^{k+1}-\bm{v}_{2}^{k}\|_{\infty},\|\bm{v}_{1}^{k+1}\otimes\bm{v}_{1}^{k+1}-\bm{v}_{2}^{k+1}\|_{\infty})<\epsilon$ . Since the main problem is convex, the numerical scheme is guaranteed to be convergent. The alternate strategy to solve the problem is summarized in Algorithm 1.

Algorithm 1 High-Order Affinity Clustering

Input: Data $\mathcal{X}\in\mathbb{R}^{m*d}$ , Cluster number $c$
Output: The fused low-dimensional embedding $\bm{V}$ and the cluster results $\bm{Y}_{pred}$

1:Compute the pairwise affinity

\hat{\bm{L}}_{2}

. Set

\bm{V}_{1}^{0}=\bm{V}_{2}^{0}=\bm{Y}_{1}^{0}=\bm{Y}_{2}^{0}=\bm{0}

\mu^{0}=10^{-3}

\mu^{max}=10^{2}

\epsilon=10^{-2}

\varepsilon=10^{-3}

\rho=1.1

\alpha=10^{-3}

k=t=0

, and

\sigma=10^{-6}

2:Construct triadic affinity tensor

\mathcal{L}_{3}

by Eq. (19).

3:Construct tetradic affinity tensor

\mathcal{L}_{4}

by Eq. (13).

4:Unfold order-3 tensor

\mathcal{L}_{3}

to matrix

\hat{\bm{L}_{3}}

by Eq. (1).

5:Unfold order-4 tensor

\mathcal{L}_{4}

to matrix

\hat{\bm{L}_{4}}

by Eq. (2).

6:while Not Converged do

\bm{V}_{1_{temp}}^{t}=\bm{V}_{1}^{k}

8: while Not Converged do

9: Fix

\bm{V}_{2}^{k}

, computed

\frac{\partial L}{\partial\bm{V}_{1_{temp}}^{t}}

by Eq. (29).

10: Update

\bm{V}_{1_{temp}}^{t+1}

\bm{V}_{1_{temp}}^{t+1}=\bm{V}_{1_{temp}}^{t}+\alpha\frac{\partial L}{\partial\bm{V}_{1_{temp}}^{t}}

11: Check the convergence conditions:

\|\bm{V}_{1_{temp}}^{t+1}-\bm{V}_{1_{temp}}^{t}\|_{\infty}\leq\epsilon

12:

t=t+1

13:

\bm{V}_{1}^{k+1}=\bm{V}_{1_{temp}}^{t}

14: Fix

\bm{V}_{1}^{k+1}

, update

\bm{V}_{2}^{k+1}

by Eq. (32).

15: Update

\bm{Y}_{1}^{k+1}

by Eq. (33).

16: Update

\bm{Y}_{2}^{k+1}

by Eq. (34).

17:

\mu^{k+1}=min(\rho\mu^{k},\mu^{max})

18: Check the convergence conditions:

\|\bm{V}_{1}^{k+1}-\bm{V}_{1}^{k}\|_{\infty}\leq\varepsilon

\|\bm{V}_{2}^{k+1}-\bm{V}_{2}^{k}\|_{\infty}\leq\varepsilon

\|\bm{Y}_{1}^{k+1}-\bm{Y}_{1}^{k}\|_{\infty}\leq\varepsilon

\|\bm{Y}_{2}^{k+1}-\bm{Y}_{2}^{k}\|_{\infty}\leq\varepsilon

19:

k=k+1

20:

t=0

21:return

\bm{V}_{1}^{k}

22:Perform spectral clustering on

\bm{V}\bm{V}^{T}

to obtain

\bm{Y}_{pred}

4 EXPERIMENTS

In this section, we illustrate the superiority of the proposed UTC method on both synthetic and real datasets. We first construct an example (Syndata-1) to demonstrate the scenario in which clustering through the standard pairwise affinity fails. In comparison, the triadic affinity can be employed to adequately characterize the spatial distribution of multiple samples; thus, the proposed UTC can perfectly cluster the sample using the triadic affinity. We then construct a second synthetic dataset (Syndata-2) with the HDLSS property to demonstrate the scenario in which clustering through standard pairwise affinity is unsatisfactory when the feature dimension increases. In comparison, the proposed UTC can perform clustering tasks stably by integrating affinities of different orders. Finally, we conduct experiments on six real datasets to verify the effectiveness of the proposed UTC by comparison with five popular clustering methods.

4.1 Experimental Settings

We employ five popular clustering methods to benchmark the proposed UTC. A brief introduction to these methods is given below:

•

Spectral clustering (SC) [3]: The classic spectral clustering gives a baseline on behalf of pairwise similarity.
•

Clustering with Adaptive Neighbors (CAN) [17]: Dynamically learning the affinity matrix by assigning the adaptive neighbors for each data point based on local distance. The final similarity matrix’s connected components are made precisely equal to the cluster number by imposing a new rank constraint.
•

Clique averaging+ncut (CAVE) [21]: The hypergraph, on behalf of the high-order affinity, is approximated using clique averaging, and the resulting graph is partitioned using the normalized cuts algorithm.
•

Pair-to-pair clustering (PPC) [26]: Tetradic undecomposable affinities are used to obtain the low-dimension embedding as the final similarity matrix of the spectral method.
•

Integrating tensor similarity and pairwise similarity (IPS2) [26]: Applying the spectral clustering method on the fused similarity that combines the pair-to-pair affinities and pairwise similarity.
•

Uniform Tensor Clustering (UTC): The proposed method.

In terms of the type of the affinity, the five methods can be divided into three categories, i.e., pairwise affinity, high-order affinity, and fused-order affinity.

The performance of each method is evaluated with respect to five popular metrics: accuracy (ACC), adjusted Rand index (ARI), F-score, normalized mutual information (NMI), and purity (PURITY). The larger the value is, the better the performance is.

4.2 Experiment on Synthdata-1 to Validate the Discrimination Potential of UTC via Triadic Affinity

Syndata-1 consists of 40 samples from two different subgroups. Each group is perfectly distributed along a straight line, while the two groups are distributed perpendicularly to each other. We use circles and crosses to denote the samples from the two subgroups, and their distribution is shown in Fig. 3(a).

We applied the spectral clustering, which works on pairwise affinity, on Syndata-1. Since the samples from the two subgroups are perpendicular, the resulting affinity matrix is non-differentiable. Therefore, the SC fails to yield the correct assignment. It mistakenly classifies the samples into two parts, with most in the upper and the rest in the bottom. In comparison, UTC can identify the sample assignments perfectly with $100\%$ accuracy. The results indicate that the proposed undecomposable high-order affinity can explore complementary information to improve the clustering performance.

TABLE II: Clustering results on Syndata-2 when the dimensionality is 10000

Datasets	Methods	ACC	ARI	F-SCORE	NMI	PURITY
Syndata2	SC	0.55	0.4427	0.6633	0.6544	0.68
	CAN	0.6449	0.2306	0.5342	0.3323	0.6449
	CAVE	0.3551	-0.013	0.3734	0.0018	0.3551
	PPS	0.5636	0.3166	0.4472	0.4453	0.5636
	IPS2	0.7218	0.4847	0.5848	0.5862	0.7818
	UTC(Fusing pairwise and triadic affinity)	0.86	0.6719	0.7865	0.759	0.86
	UTC(Fusing pairwise and tetradic affinity)	0.77	0.5204	0.6799	0.5776	0.77
	UTC	$\bm{1}$	$\bm{1}$	$\bm{1}$	$\bm{1}$	$\bm{1}$

4.3 Experiment on Synthdata-2 to Validate the Stability of UTC over Dimension Expanding via High-order Affinities

Syndata-2 consists of three subgroups, each derived from independent and identically distributed normal distributions with an equal standard deviation of 0.5 and mean of 2. The data from different groups are orthogonal, and each group contains 30-40 samples.

The pairwise affinity tends to be ambiguous with expanding feature dimensionality, limiting the clustering performance. We applied the proposed UTC on Syndata-2 to validate its stability in handling high-dimensional data by comparison with popular methods, including SC, CAN and CAVE, PPS, and IPS2. The results are shown in Fig. 4. When the feature dimension varies from 10 to 10000, UTC can maintain robustness to perform accurate clustering. Furthermore, the baseline methods gradually lose the ability to identify the data structure.

Moreover, when the number of data dimensions reaches 10000, Tab. II shows that our proposed UTC methods, including fusing pairwise with triadic affinity, tetradic affinity, and both, substantially outperform all other state of the art methods. The ACC clustering results of fusing triadic and fusing tetradic are higher than those of the second-highest baseline method by 14% and 28%, demonstrating the effectiveness of fusing high-order affinity with pairwise relation under HDLSS data. Additionally, the ACC achieves 100% when utilizing both higher-order affinities, proving that fusing different-order affinities is more helpful in completely explaining the data structure than is merging only a single higher-order affinity.

We also employ the t-SNE to visualize the differences in the embedded features after different methods. Most of the samples mix together when visualizing their embedding by t-SNE on the raw data, as shown in Fig. 5(a). On the other hand, the pairwise affinity, shown in Fig. 5(b), has no clear-cut structure, except the diagonal elements having large values. Therefore, it is challenging to discriminate the three subgroups when using the pairwise affinity matrix directly. Now, we fuse the three-order tensor affinity with pairwise affinity and apply UTC to obtain a local embedding. t-SNE is applied to the resulting embedding to visualize the sample assignment distribution. The visualized result is shown in Fig. 5(c). The samples in one subgroup (Subgroup A) are clearly separated from the others, while the remaining two subgroups (Subgroups B and C) are not well-distinguished. The upper left of the affinity matrix shows an obvious blocky structure, but the remaining part lacks a clear structure, as shown in Fig. 5(d), which confirms the results obtained by t-SNE. Alternatively, if we fuse the fourth-order tensor affinity with the pairwise matrix affinity and apply UTC to obtain another local embedding, the visualization using t-SNE is shown in Fig. 5(e). The samples in Subgroup C clearly are separated from the other two subgroups, while the other two subgroups are merged. The affinity heatmap also validates the observation that Subgroup A in the lower-right corner possesses distinct value compared to the others, as shown in Fig. 5(f). Finally, we fuse the pairwise affinity with the third-order and fourth-order. UTC is employed to learn a uniform embedding that is then visualized via t-SNE, as shown in Fig. 5(g). The three subgroups are perfectly separated in this case. The diagonal elements of the corresponding affinity matrix also show three obvious block-like structures and an affinity difference between the three subgroups. Therefore, an accurate sample assignment can easily be obtained from the embedding by the proposed UTC. In summary, the above experimental results validate that the high-order tensor affinities make up for the insufficiency of the pairwise affinity matrix when corrupted by high-dimensional features.

TABLE III: Statistics on six tested real-world datasets.

Dataset	Type	Samples	Features	Clusters
Yale	Facial Image	55	4096	5
Coil20	Object Image	100	16384	5
Lymphoma	Bioinformatics	62	4702	3
DriveFace	Facial Image	78	307200	3
Lung	Bioinformatics	100	12600	3
GLI-85	Bioinformatics	85	22283	2

4.4 Experiments on Real Datasets

We then validated the power of UTC by applying it to six public benchmark datasets. The six datasets are chosen to be representative of various sources, including facial image, object image, and bioinformatics data. The statistics for the six datasets are summarized in Tab. III. Additionally, a detailed description is provided.

TABLE IV: Clustering performance on six real datasets

Datasets	Methods	ACC	ARI	F-SCORE	NMI	PURITY
YALE	SC	0.4727	0.2528	0.423	0.4097	0.4909
	CAN	0.5091	0.3316	0.4839	0.5046	0.5273
	CAVE	0.5818	0.2871	0.4337	0.4307	0.6182
	PPS	0.5636	0.3166	0.4472	0.4453	0.5636
	IPS2	0.7218	0.4847	0.5848	0.5862	0.7818
	UTC	$\bm{0.8}$	$\bm{0.5019}$	$\bm{0.6036}$	$\bm{0.67}$	$\bm{0.8}$
Coil	SC	0.83	0.9068	0.9248	0.9352	0.96
	CAN	0.86	0.7660	0.7807	0.8886	0.81
	CAVE	0.84	0.6852	0.7462	0.7575	0.84
	PPS	0.93	0.8383	0.8694	0.869	0.93
	IPS2	$\bm{0.97}$	$\bm{0.9280}$	$\bm{0.9419}$	$\bm{0.9460}$	$\bm{0.97}$
	UTC	$\bm{0.97}$	$\bm{0.9280}$	$\bm{0.9419}$	$\bm{0.9460}$	$\bm{0.97}$
DriveFace	SC	0.641	0.3304	0.5921	0.4505	0.641
	CAN	0.7436	0.3671	0.5941	0.4486	0.7436
	CAVE	0.8974	0.7274	0.8182	0.7449	0.8974
	PPS	0.7436	0.4655	0.6433	0.5144	0.718
	IPS2	0.7436	0.4711	0.6549	0.5731	0.7436
	UTC	$\bm{0.9615}$	$\bm{0.8869}$	$\bm{0.9237}$	$\bm{0.8675}$	$\bm{0.9615}$
Lymphoma	SC	0.8065	0.5059	0.7167	0.6634	0.8548
	CAN	0.9839	0.9471	0.9733	0.9255	0.9839
	CAVE	0.8065	0.5060	0.7167	0.6634	0.8548
	PPS	0.9194	0.7570	0.8696	0.7823	0.9194
	IPS2	0.9839	0.9471	0.9733	0.9255	0.9839
	UTC	$\bm{1}$	$\bm{1}$	$\bm{1}$	$\bm{1}$	$\bm{1}$
Lung	SC	0.8	0.5855	0.675	0.6848	0.81
	CAN	0.77	0.5973	0.6971	0.6995	0.85
	CAVE	0.8	0.5855	0.675	0.6848	0.81
	PPS	0.73	0.541	0.6472	0.6311	0.8
	IPS2	0.82	0.541	0.7427	$\bm{0.7289}$	0.85
	UTC	$\bm{0.85}$	$\bm{0.6941}$	$\bm{0.7634}$	0.7219	$\bm{0.85}$
GLI-85	SC	0.6471	0.0731	0.5721	0.1385	0.6941
	CAN	0.6588	-0.0353	0.697	0.042	0.6941
	CAVE	0.7412	0.1225	0.7376	0.171	0.7412
	PPS	0.5294	-0.0447	0.5777	0.1103	0.6941
	IPS2	0.7294	0.2012	0.6267	0.2754	0.7294
	UTC	$\bm{0.8118}$	$\bm{0.3772}$	$\bm{0.7177}$	$\bm{0.2798}$	$\bm{0.8118}$

•

Yale dataset has 165 grayscale images covering 15 different individuals. Each individual has 11 different facial and configuration images, such as with glasses, without glasses, center-light, left-light, happy, sad, and so on. We extracted 4096-dimensional raw pixel values of every image in 5 classes for our experiment.
•

Coil20 dataset contains 1440 images of 20 categories of objects. Each category has 72 images from different views. We sample 20 samples from 5 classes, and all images have a size of $128\times 128$ .
•

Lymphoma dataset, one of the most common subtypes of non-Hodgkin’s lymphoma has two molecularly different forms of diffuse large B-cell lymphoma (DLBCL), which have gene expression patterns that indicate different stages of B cell differentiation. The dataset contains a total of 62 samples, 4702 based on expression fragments.
•

DriveFace dataset contains images sequences of subjects while driving in real scenarios. It is composed of 606 samples of $6400$ features each, acquired over different days from 3 drivers with various facial features, such as glasses and beards. For each type, we pick 26 samples for the experiment.
•

Lung dataset contains a total of 203 samples that can be divided into 5 classes, with 149, 21, 20, 6, and 17 samples, respectively. Each sample has 12,600 genes. We selected 40 samples of the first category and all samples of the remaining four categories for experiments.
•

GLI-85 dataset consists of the transcriptional data of 85 diffuse infiltrating gliomas from 74 patients. Those gliomas can be divided into two kinds of tumor subclasses. Each instance has 22283 features.

We applied the proposed UTC as well as five comparative methods on the six datasets, and the results are summarized in Tab. IV. The bold value represents the best result. The clustering method that uses higher-order affinity uniformly outperforms the standard method using pairwise affinity. Therefore, the higher-order affinity constructed from the relationships of multiple samples can better describe the spatial structure of the data, which is advantageous when dealing with HDLSS data. Furthermore, our proposed UTC method significantly outperforms the competitive methods with respect to all five evaluation metrics for all types of data, including facial image, object image, and bioinformatics data.

On the Yale dataset, the ACC clustering result of our method is over 33% higher than that of SC and over 8% higher than the second-highest IPS2. On the GLI-85 dataset, our results are more than 16% and 6% higher than those of the baseline and second-highest IPS2 methods. There are also 1%, 7%, 10%, 2%, and 3% improvements compared with the second-best performance on the Coil, DriveFace, Colon, Lymphoma, and Lung datasets, respectively. IPS2 and UTC achieve the top two results on almost all datasets with respect to nearly all the metrics.

We also visualized the sample spatial distribution after t-SNE[31] and the heatmap of the affinity matrix on the Yale, Coil, and DriveFace datasets in Fig. 6 and the Lymphoma, Lung, and GLI-85 datasets in Fig. 7, respectively. The spatial distributions after t-SNE on the raw data and the resulting embedding are shown on the left side of the two columns, while the heatmaps of the raw data and the affinity matrix fusing with various orders are on the other side. In most figures, the low-dimensional embedding after UTC displays better separation and less overlap, thus making it easy for clustering. In comparison, the heatmap of similarity produced by SC and UTC further demonstrates the effectiveness of our method. Deep yellow represents a large proximity value. The affinity heatmap on the raw samples possesses highly blurred boundaries, and there is no apparent block structure. However, the affinity matrix calculated from the low-dimensional embedding after UTC has clear-cut boundaries. The large value in the matrix generated by UTC is concentrated on the diagonal blocks, implying that the fused affinity can more accurately reflect the real relationship among the samples, thus yielding superior clustering performance in Tab. IV. Taking the Yale dataset as an example, the ACC of the clustering results obtained by UTC is 33% higher than that of SC. The visualization results confirm our method’s superiority intuitively, as shown in Fig. 6(a). Most of the samples are mixed together in the visualization based on pairwise affinity, resulting in an inability to distinguish between different subgroups accurately, as shown on the far left of Fig. 6(a). The heatmap of pairwise affinity confirms that only one cluster is significantly different from the other samples. Correspondingly, in the t-SNE visualization of the resulting embedding, most samples are clustered with their same cluster’s partners. Five obvious blocks are displayed on the diagonal in the heatmap. The right side of Fig. 6(a) confirms that UTC can better reflect the real structure of the data compared with single-order methods, which is critical for clustering.

In summary, the reason for the superior performance of UTC is that it utilizes the complementarity of high- and low-order affinities to comprehensively capture the spatial structure of data, thereby leading to a decisive effect on the clustering performance.

4.5 Computational Complexity Analysis

The UTC consists of two major computational components: constructing the high-order tensor affinity and solving the minimization problem in Eq. (27). In the first component, we calculate both the triadic and tetradic affinities by means of their $k$ -nearest neighbors (KNN). Given $m$ samples, one can use the KNN strategy to reduce the time complexity from $\mathcal{O}(m^{4})$ to $\mathcal{O}(mk^{4})$ . Thus, the two affinity tensors are sparse, with the number of nonzero elements being $s$ . For the latter, we solve th model by means of an iterative strategy. The subproblem of $\bm{V}_{1}$ based on the gradient descent strategy has a computational complexity of $\mathcal{O}(ncsm^{2})$ , where the $c$ denotes the number of clusters and the maximum iteration number is $n$ . Solving the other sub-problem $\bm{V}_{2}$ takes $\mathcal{O}(csm^{2})$ using the Germs algorithm. Thus, each iteration has a total computational cost of $\mathcal{O}(ncsm^{2})+\mathcal{O}(csm^{2})=\mathcal{O}((n+1)ncsm^{2})$ . In total, the complexity to learn the uniform embedding in Algorithm 1 is $\mathcal{O}(mk^{4}+K((n+1)ncsm^{2}))$ , with $K$ being the number of iterations.

5 Conclusion

The clustering of small-size samples with large dimensions is a bottleneck problem. We propose a unified learning model to effectively fuse multiple samples’ affinities of different orders to obtain the sample’s uniform low-dimensional embedding. Tensor-vector products uniformly formulate the sample affinities of various orders, which opens a new door for studying multiple samples. Our method involves three matrix products: arithmetical, Khatri-Rao and Kronecker product. They are jointly optimized to yield a uniform low-dimensional embedding of the samples. Experiments on synthetic data demonstrate the power of fusing different-order affinities and experiments on several real datasets with small sample size yet large feature dimensionality show the effectiveness and superiority of the method over other popular approaches.

References

[1] T. Hofmann and J. M. Buhmann, “Pairwise data clustering by deterministic annealing,” Ieee transactions on pattern analysis and machine intelligence, vol. 19, no. 1, pp. 1–14, 1997.
[2] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu, “An efficient k-means clustering algorithm: Analysis and implementation,” IEEE transactions on pattern analysis and machine intelligence, vol. 24, no. 7, pp. 881–892, 2002.
[3] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Transactions on pattern analysis and machine intelligence, vol. 22, no. 8, pp. 888–905, 2000.
[4] A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering: Analysis and an algorithm,” in Advances in neural information processing systems, 2002, pp. 849–856.
[5] H. Li, S. Tian, Y. Li, Q. Fang, R. Tan, Y. Pan, C. Huang, Y. Xu, and X. Gao, “Modern deep learning in bioinformatics,” Journal of molecular cell biology, vol. 12, no. 11, pp. 823–827, 2020.
[6] T. Tian, J. Wan, Q. Song, and Z. Wei, “Clustering single-cell rna-seq data with a model-based deep learning approach,” Nature Machine Intelligence, vol. 1, no. 4, pp. 191–198, 2019.
[7] X. Peng, J. Feng, J. T. Zhou, Y. Lei, and S. Yan, “Deep subspace clustering,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–13, 2020.
[8] Y. Lu, Y. Cheung, and Y. Y. Tang, “Self-adaptive multiprototype-based competitive learning approach: A k-means-type algorithm for imbalanced data clustering,” IEEE Transactions on Cybernetics, pp. 1–15, 2019.
[9] W. Yang, C. Hui, D. Sun, X. Sun, and Q. Liao, “Clustering through probability distribution analysis along eigenpaths,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2018.
[10] Y. Chen, L. Zhou, S. Pei, Z. Yu, Y. Chen, X. Liu, J. Du, and N. Xiong, “Knn-block dbscan: Fast clustering for large-scale data,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2019.
[11] X. Xu, J. Li, M. Zhou, J. Xu, and J. Cao, “Accelerated two-stage particle swarm optimization for clustering not-well-separated data,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 50, no. 11, pp. 4212–4223, 2018.
[12] C. C. Aggarwal, A. Hinneburg, and D. A. Keim, “On the surprising behavior of distance metrics in high dimensional space,” in International conference on database theory. Springer, 2001, pp. 420–434.
[13] D. François, V. Wertz, and M. Verleysen, “The concentration of fractional distances,” IEEE Transactions on Knowledge and Data Engineering, vol. 19, no. 7, pp. 873–886, 2007.
[14] S. Sarkar and A. K. Ghosh, “On perfect clustering of high dimension, low sample size data,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
[15] P. Borysov, J. Hannig, and J. S. Marron, “Asymptotics of hierarchical clustering for growing dimension,” Journal of Multivariate Analysis, vol. 124, pp. 465–479, 2014.
[16] P. Hall, J. S. Marron, and A. Neeman, “Geometric representation of high dimension, low sample size data,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 67, no. 3, pp. 427–444, 2005.
[17] F. Nie, X. Wang, and H. Huang, “Clustering and projected clustering with adaptive neighbors,” in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 2014, pp. 977–986.
[18] H. Wang, G. Xiao, Y. Yan, and D. Suter, “Searching for representative modes on hypergraphs for robust geometric model fitting,” IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 3, pp. 697–711, 2018.
[19] H. C. Nguyen and H. Mamitsuka, “Learning on hypergraphs with sparsity,” IEEE transactions on pattern analysis and machine intelligence, 2020.
[20] D. Zhou, J. Huang, and B. Schölkopf, “Learning with hypergraphs: Clustering, classification, and embedding,” Advances in neural information processing systems, vol. 19, pp. 1601–1608, 2006.
[21] S. Agarwal, J. Lim, L. Zelnik-Manor, P. Perona, D. Kriegman, and S. Belongie, “Beyond pairwise clustering,” in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 2. IEEE, 2005, pp. 838–845.
[22] S. R. Bulo and M. Pelillo, “A game-theoretic approach to hypergraph clustering.” in NIPS, vol. 22. Citeseer, 2009, pp. 1571–1579.
[23] Z. Zhang, H. Lin, Y. Gao, and K. BNRist, “Dynamic hypergraph structure learning.” in IJCAI, 2018, pp. 3162–3169.
[24] G. Li, L. Qi, and G. Yu, “The z-eigenvalues of a symmetric tensor and its application to spectral hypergraph theory,” Numerical Linear Algebra with Applications, vol. 20, no. 6, pp. 1001–1029, 2013.
[25] D. Ghoshdastidar and A. Dukkipati, “Spectral clustering using multilinear svd: Analysis, approximations and applications,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 29, no. 1, 2015.
[26] H. Peng, Y. Hu, J. Chen, W. Haiyan, Y. Li, and H. Cai, “Integrating tensor similarity to enhance clustering performance,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
[27] S. Rabanser, O. Shchur, and S. Günnemann, “Introduction to tensor decompositions and their applications in machine learning,” arXiv preprint arXiv:1711.10781, 2017.
[28] U. Von Luxburg, “A tutorial on spectral clustering,” Statistics and computing, vol. 17, no. 4, pp. 395–416, 2007.
[29] S. Sarkar and A. K. Ghosh, “On perfect clustering of high dimension, low sample size data,” IEEE transactions on pattern analysis and machine intelligence, vol. 42, no. 9, pp. 2257–2272, 2019.
[30] A. Kumar, P. Rai, and H. Daume, “Co-regularized multi-view spectral clustering,” Advances in neural information processing systems, vol. 24, pp. 1413–1421, 2011.
[31] L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal of machine learning research, vol. 9, no. 11, 2008.