This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Uniform tensor clustering by jointly exploring sample affinities of various orders

Hongmin Cai1∗, Fei Qi1,2, Junyu Li1, Yu Hu1, Yue Zhang3,Yiu-ming Cheung5, and Bin Hu4∗ 1School of Computer Science and Engineering, South China University of Technology, Guangzhou, 510006, China. 2School of Data Science and Information Engineering, Guizhou Minzu University, China. 3 School of Computer Science, Guangdong Polytechnic Normal University, Guangzhou 510665, China. 4 School of Medical Technology, Beijing Institute of Technology, Beijing, China. 4 Department of Computer Science, Hong Kong Baptist University, Hong Kong SAR, China Corresponding Author, E-mail: [email protected], [email protected].
Abstract

Conventional clustering methods based on pairwise affinity usually suffer from the concentration effect while processing huge dimensional features yet low sample sizes data, resulting in inaccuracy to encode the sample proximity and suboptimal performance in clustering. To address this issue, we propose a unified tensor clustering method (UTC) that characterizes sample proximity using multiple samples’ affinity, thereby supplementing rich spatial sample distributions to boost clustering. Specifically, we find that the triadic tensor affinity can be constructed via the Khari-Rao product of two affinity matrices. Furthermore, our early work shows that the fourth-order tensor affinity is defined by the Kronecker product. Therefore, we utilize arithmetical products, Khatri-Rao and Kronecker products, to mathematically integrate different orders of affinity into a unified tensor clustering framework. Thus, the UTC jointly learns a joint low-dimensional embedding to combine various orders. Finally, a numerical scheme is designed to solve the problem. Experiments on synthetic datasets and real-world datasets demonstrate that 1) the usage of high-order tensor affinity could provide a supplementary characterization of sample proximity to the popular affinity matrix; 2) the proposed method of UTC is affirmed to enhance clustering by exploiting different order affinities when processing high-dimensional data.

Index Terms:
High-order affinity, Clustering, Fusing affinity, Tensor, Spectral graph.

1 Introduction

Clustering aims to partition unlabeled data into distinct groups [1]. Traditional methods, such as K-means [2] and spectral clustering [3, 4], have been well studied, and many variants have been designed to cater to various needs. However, with advances in observatory equipment, an object of interest could be associated with high or even huge dimensional features. For example, in bioinformatics, most single cells isolated from tissue are damaged during sequencing. Thus, only a limited number of cells can be successfully sequenced, resulting in a sample matrix with a few samples and millions of gene reads [5, 6]. Similar scenarios are rising across various fields, such as computer vision [7, 8] and natural language processing. Accurate clustering high-dimension mm yet low-sample-size nn (HDLSS) data remains challenging when nmn\gg m  [9, 10, 11].

The critical challenge that hinders the clustering of HDLSS data is called the concentration effect [12, 13], which refers to the situation that the pairwise Euclidean distance among the samples collapses to a constant, thus resulting in the sample-wise affinity tending to be indiscriminating in high feature dimensions [14, 15, 16]. Such an effect exists as an insurmountable roadblock that hinders most clustering methods, which rely on the pairwise affinity of samples, from achieving precise clustering. Early works attempted to learn adaptive sample affinity to develop metrics driven by data. For example, CAN [17] proposed a model that dynamically learns the affinity matrix by assigning the adaptive neighbors for each data point based on local distances. However, the measurement of local distance remains a characterization of sample-wise affinity and thus still suffers from the concentration effects and performs unsatisfactorily.

To overcome such drawbacks, many researchers have proposed to incorporate the spatial distribution of multiple samples to enhance clustering. Such high-order characterization on the local structure of multiple points can effectively overcome the shortcomings of pairwise similarity and enhance the clustering effect [18] [19]. For example, [20] generalizes the spectral methods operating on a pairwise similarity graph to a hypergraph that can represent high-order affinities and develops an algorithm to obtain the hypergraph embedding for clustering. [21] proposes to maintain a weight graph to approximate the hypergraph generated by high-order association and then use a spectral method to accomplish graph partitioning. [22] considers the hypergraph clustering problem as a multiplayer noncooperative game from a game-theoretic approach. Like the CAN method, [23] first proposes a way to learn the dynamic hypergraph for clustering. However, the above method must eventually convert the hypergraph into an approximate weighted graph. Therefore, the clustering task is still performed on sample affinities, so higher-order affinity is not properly exploited to overcome the drawbacks of pairwise relations.

Recently, the use of tensors to encode high-order affinities has been studied in clustering. [24] introduce the characteristic tensor of a hypergraph and its associated ZZ-eigenvalue, providing natural relations for the structure of data. [25] denotes the order-kk affinities by a symmetric tensor and relaxes the hypergraph clustering to solve the multilinear singular value decomposition problem. However, these methods can use only a specific order of affinity for clustering, which cannot completely express the data structure. Our early work of IPS2 [26] concentrates on using the high-order structural information of data that can be obtained from relationships between two pairs to enhance clustering performance. The authors establish an equivalence between the even-order affinity and the pairwise second-order affinity. However, there are still two problems that need to be solved: (1) the connection between odd-order tensor affinity and the second-order matrix affinity remains unknown; (2) how to jointly utilize affinities with different orders to accomplish clustering has yet to be studied.

To address these two questions, this paper first establishes a connection between the second-order and third-order tensor affinities via the Khatri-Rao product. Then, we propose a unified tensor clustering (UTC) model to learn a low-dimensional embedding by jointly using affinities of various orders. The popular second-order matrix affinity and the triadic and tetradic tensor affinities are fused to help characterize the proximity among samples.

The main advantages of the method proposed in this paper are summarized as follow.

  • 1)

    A connection between pairwise matrix affinity and the triadic tensor affinity among three samples is established. An undecomposable third-order tensor is designed to provide the geometric proximity among three samples, thus serving as a supplement to the pairwise similarity matrix.

  • 2)

    The third-order tensor affinity is defined via the Khatri-Rao product, while the Kronecker product is used to define the fourth-order tensor affinity. Therefore, the popular matrix affinity defined by the arithmetical product is surprisingly related to the Khatri-Rao and Kronecker products.

  • 3)

    A uniform tensor clustering method (UTC) is developed. The UTC learns a low-dimensional embedding jointly based on affinities of various orders. The UTC is formulated elegantly and works on affinities of any order.

  • 4)

    Extensive experiments on HDLSS data demonstrate that the UTC achieved superior performance compared with that of baseline methods. In addition, the experiments show that using high-order affinities to characterize the sample spatial distribution can improve clustering performance, especially in cases of small sample sizes.

Refer to caption
Figure 1: Demonstration of the UTC method, which aims to learn a uniform embedding via joint optimization of not only the pairwise matrix affinity but also the triadic and tetradic tensor affinities among more than two samples. The approach is combined with the high-order tensor affinity to provide complementary spatial characterization for which the pairwise affinity fails; UTC learns a low-dimensional embedding and is robust to high-dimensional concentration effects and remains stable.

2 NOTATIONS AND PRELIMINARIES

2.1 Notations

Throughout the paper, lowercase letters represent scalars, while bold lowercase letters represent vectors, such as 𝒗m\bm{v}\in\mathbb{R}^{m}. Bold uppercase letters denote matrices, such as 𝑴m×n\bm{M}\in\mathbb{R}^{m\times n}, while bold calligraphy letters represent tensors, such as, 𝒯m×n×l\mathcal{T}\in\mathbb{R}^{m\times n\times l}. Let 𝒯(:,:,i)\mathcal{T}(:,:,i), 𝒯(:,i,:)\mathcal{T}(:,i,:), 𝒯(i,:,:)\mathcal{T}(i,:,:) represent the ii-th frontal, lateral and horizontal slices of an order-3 tensor 𝒯n1×n2×n3\mathcal{T}\in\mathbb{R}^{n_{1}\times n_{2}\times n_{3}}, respectively. 𝒯(:,:,i)\mathcal{T}(:,:,i) can be abbreviated as 𝒯(i)\mathcal{T}^{(i)}. A tensor can be reformatted into a matrix via series operations, called unfolding. For example, the three-order and four-order tensor unfold operations are as follows:

Definition 2.1.

Unfolding 3-order Tensor: Let 𝒯3m×m×m\mathcal{T}_{3}\in\mathbb{R}^{m\times m\times m} be an order-3 m-dimensional tensor. It can unfold to an m2×mm^{2}\times m matrix 𝐓^3\hat{\bm{T}}_{3} as follows:

𝑻3^=unfold(𝒯3)=[𝒯3(1)𝒯3(2)𝒯3(m)]\displaystyle\hat{\bm{T}_{3}}=\operatorname{unfold}(\mathcal{T}_{3})=\left[\begin{array}[]{c}\mathcal{T}_{3}^{(1)}\\ \mathcal{T}_{3}^{(2)}\\ \vdots\\ \mathcal{T}_{3}^{\left(m\right)}\end{array}\right] (1)
Definition 2.2.

Unfolding 4-order Tensor Let 𝒯4m×m×m×m\mathcal{T}_{4}\in\mathbb{R}^{m\times m\times m\times m} represent a fourth-order m-dimensional tensor. This tensor can unfold to an m2×m2m^{2}\times m^{2} matrix T^4\hat{T}_{4}, with its (r,s)-th entry given by:

𝑻^4rs=𝒯4ijkl\displaystyle\hat{\bm{T}}_{4_{rs}}=\mathcal{T}_{4_{ijkl}} (2)

with r=m(j1)+i,s=m(l1)+k,i,j,k,l[1,m]r=m(j-1)+i,s=m(l-1)+k,i,j,k,l\in[1,m].

Three matrix multiplications are considered, namely, Hadamard product, Kronecker product and Khatri-Rao product [27].

Definition 2.3.

Hadamard Product The Hadamard product of two matrices 𝐀m×n\bm{A}\in\mathbb{R}^{m\times n} and 𝐁m×n\bm{B}\in\mathbb{R}^{m\times n} is defined as:

𝑨𝑩=[a11b11a1nb1nam1bm1amnbmn]m×n\displaystyle\bm{A}\odot\bm{B}=\left[\begin{matrix}a_{11}b_{11}&\cdots&a_{1n}b_{1n}\\ \vdots&\ddots&\vdots\\ a_{m1}b_{m1}&\cdots&a_{mn}b_{mn}\\ \end{matrix}\right]\in\mathbb{R}^{m\times n} (3)
Definition 2.4.

Kronecker Product The Kronecker product of two matrices 𝐀m1×n1\bm{A}\in\mathbb{R}^{m_{1}\times n_{1}} and 𝐁m2×n2\bm{B}\in\mathbb{R}^{m_{2}\times n_{2}} is defined as:

𝑨𝑩=[a11𝑩a1n𝑩am1𝑩amn𝑩]m1m2×n1n2\displaystyle\bm{A}\otimes\bm{B}=\left[\begin{matrix}a_{11}\bm{B}&\cdots&a_{1n}\bm{B}\\ \vdots&\ddots&\vdots\\ a_{m1}\bm{B}&\cdots&a_{mn}\bm{B}\\ \end{matrix}\right]\in\mathbb{R}^{m_{1}m_{2}\times n_{1}n_{2}} (4)
Definition 2.5.

Khatri-rao Product The Khatri-Rao product of two matrices 𝐀m1×n\bm{A}\in\mathbb{R}^{m_{1}\times n} and 𝐁m2×n\bm{B}\in\mathbb{R}^{m_{2}\times n} is defined as the matrix:

𝑨𝑩=[a:1b:1a:nb:n]m1m2×n\displaystyle\bm{A}*\bm{B}=\left[\begin{matrix}a_{:1}\otimes b_{:1}&\cdots&a_{:n}\otimes b_{:n}\end{matrix}\right]\in\mathbb{R}^{m_{1}m_{2}\times n} (5)

A popular product between a tensor and a matrix is called a kk-mode product.

Definition 2.6.

k-mode Product The kk-mode product between an order-mm tensor 𝓣n1×n2××nm\bm{\mathcal{T}}\in\mathbb{R}^{n_{1}\times n_{2}\times\dots\times n_{m}} and a matrix Vp×nkV\in\mathbb{R}^{p\times n_{k}}, denoted by 𝓣kUn1××nk1×p×nk+1××nm\bm{\mathcal{T}}\otimes_{k}U\in\mathbb{R}^{n_{1}\times\dots\times n_{k-1}\times p\times n_{k+1}\times\dots\times n_{m}}, with

(𝓣k𝑽)i1ik1jik+1im=ik=1nk𝓣i1ik1ikik+1im𝑽jik\displaystyle(\bm{\mathcal{T}}\otimes_{k}\bm{V})_{i_{1}\dots i_{k-1}ji_{k+1}\dots i_{m}}=\sum^{n_{k}}_{i_{k}=1}\bm{\mathcal{T}}_{i_{1}\dots i_{k-1}i_{k}i_{k+1}\dots i_{m}}\bm{V}_{ji_{k}} (6)

For convenience, we summarize the frequently used notations and definitions in Table I.

TABLE I: Summary of notations
𝒗m\bm{v}\in\mathbb{R}^{m} A vector
𝑴m×n\bm{M}\in\mathbb{R}^{m\times n} A matrix
𝒯n1×n2××nk\mathcal{T}\in\mathbb{R}^{n_{1}\times n_{2}\times\dots\times n_{k}} An order-k tensor
𝑻^k\hat{\bm{T}}_{k} The unfolding order-k tensor
Tr()Tr(\cdot) The trace operation
\odot The Hadamard product
* The Khatri-Rao product
\otimes The Kronecker product
k\otimes_{k} The k-mode product

2.2 Introduction of Spectral Clustering with Pairwise Similarity

Spectral clustering [3] starts by computing a sample-wise affinity matrix 𝑺2m×m\bm{S}_{2}\in\mathbb{R}^{m\times m}. Every entry 𝑺2(i,j)\bm{S}_{2}(i,j) measures the proximity between two samples xix_{i} and xjx_{j}. Then, a normalized affinity matrix 𝑳^2\hat{\bm{L}}_{2} is estimated with 𝑳^2=𝑫212𝑺2𝑫212\hat{\bm{L}}_{2}=\bm{D}_{2}^{-\frac{1}{2}}\bm{S}_{2}\bm{D}_{2}^{-\frac{1}{2}}, where 𝑫2\bm{D}_{2} denotes the degree matrix of 𝑺2\bm{S}_{2} with 𝑫2ii=i𝑺2ij\bm{D}_{2_{ii}}=\sum_{i}\bm{S}_{2{ij}}. The Laplacian matrix 𝑳\bm{L} can be generated through 𝑳=𝑰𝑳^2\bm{L}=\bm{I}-\hat{\bm{L}}_{2}. Spectral graph clustering seeks a low-dimensional embedding of samples by maximizing the cross-entropy while preserving the local proximity of paired samples:

min𝑽tr(𝑽T𝑳𝑽) Subject to: 𝑽T𝑽=𝑰\displaystyle\begin{aligned} &\min_{\bm{V}}\,\,tr(\bm{V}^{T}\bm{L}\bm{V})\\ \text{ Subject to:~{}}\quad&\bm{V}^{T}\bm{V}=\bm{I}\\ \end{aligned} (7)

It is a standard eigenvalue problem and thus can be efficiently solved by calculating the eigenvectors 𝑽\bm{V} of the Laplacian matrix 𝑳\bm{L}, using methods such as singular value decomposition. Typically, the normalized cut of the graph is minimized in this model, but as stated in [28], this model can also be expressed as a maximization problem with the normalized pairwise affinity 𝑳^2\hat{\bm{L}}_{2}. By using the tensor kk-mode product, the spectral clustering in Eq. (7) can be rewritten as:

max𝑽tr(𝑽T𝑳^2𝑽)\displaystyle\max_{\bm{V}}\,\,tr(\bm{V}^{T}\hat{\bm{L}}_{2}\bm{V}) (8)
=\displaystyle= max𝑽j=1k𝑳^22𝒗jT1𝒗jT\displaystyle\max_{\bm{V}}\,\,\sum_{j=1}^{k}\hat{\bm{L}}_{2}\otimes_{2}\bm{v}_{j}^{T}\otimes_{1}\bm{v}_{j}^{T}
Subject to: 𝑽T𝑽=𝑰\displaystyle\text{ Subject to:~{}}\bm{V}^{T}\bm{V}=\bm{I}

where 𝒗j\bm{v}_{j} is the jthj^{th} column of the partition matrix 𝑽\bm{V}. This equation establishes the relationship between the membership 𝒗j\bm{v}_{j} and the pairwise affinity. Then, a clustering task is performed on the obtained embedding.

2.3 Introduction of IPS2 Clustering with Pair-to-Pair Similarity

The pairwise affinity is easily broken by noise contamination or concentration effects in HDLSS data. To address this issue, our recent work IPS2 [26] attempted to use high-order affinity among more than two samples. IPS2 used the fourth-order tensor affinity to measure the proximity of two sample pairs. The decomposable fourth-order tensor affinity can be defined as the product of two pairwise similarities:

𝒯4ijkl=𝑺2ik𝑺2jl,i,j,k,lm.\displaystyle\mathcal{T}_{4_{ijkl}}=\bm{S}_{2_{ik}}\bm{S}_{2_{jl}},\quad i,j,k,l\in m. (9)

Let 𝑻^4\hat{\bm{T}}_{4} denote the matrix unfolded from a fourth-order decomposable tensor affinity 𝒯4\mathcal{T}_{4}; it has been shown in  [26] that

𝑻^4=𝑺2𝑺2\displaystyle\hat{\bm{T}}_{4}=\bm{S}_{2}\otimes\bm{S}_{2} (10)

This result establishes a direct connection between a fourth-order decomposable tensor affinity and the usual second-order matrix affinity 𝑺2\bm{S}_{2}. Now, consider a normalized affinity matrix 𝑳^4\hat{\bm{L}}_{4}, computed by 𝑳^4=𝑫^412𝑻^4𝑫^412\hat{\bm{L}}_{4}=\hat{\bm{D}}_{4}^{-\frac{1}{2}}\hat{\bm{T}}_{4}\hat{\bm{D}}_{4}^{-\frac{1}{2}}. Here, the diagonal matrix 𝑫^4\hat{\bm{D}}_{4} is the degree matrix of 𝑻^4\hat{\bm{T}}_{4} with the diagonal elements being 𝑫^4ii=j𝑻^4ij\hat{\bm{D}}_{4_{ii}}=\sum_{j}\hat{\bm{T}}_{4_{ij}}. It has been shown that the fourth-order and second-order normalized affinity matrix 𝑳^4,𝑳^2\hat{\bm{L}}_{4},\hat{\bm{L}}_{2} shares a similar relation as in Eq. (10):

Theorem 2.1.

The decomposable 4-order tensor [26] Let 𝐋^2\hat{\bm{L}}_{2} and 4\mathcal{L}_{4} denote the normalized similarity matrix and fourth-order normalized affinity tensor with the unfolding form 𝐋^4\hat{\bm{L}}_{4}, respectively. Then, we have the following equality:

𝑳^4=𝑳^2𝑳^2\displaystyle\hat{\bm{L}}_{4}=\hat{\bm{L}}_{2}\otimes\hat{\bm{L}}_{2} (11)

Moreover, the eigenvectors 𝐯\bm{v} and 𝐯^\hat{\bm{v}} for the matrix 𝐋^2\hat{\bm{L}}_{2} and unfolded matrix 𝐋^4\hat{\bm{L}}_{4} satisfy:

𝒗^=𝒗𝒗\displaystyle\hat{\bm{v}}=\bm{v}\otimes\bm{v} (12)

One can reshape the eigenvector 𝒗^\hat{\bm{v}} into a matrix 𝑽m×m\bm{V}\in\mathbb{R}^{m\times m}, and the matrix 𝑽\bm{V} is an eigenmatrix of the fourth-order tensor affinity 𝑻^4\hat{\bm{T}}_{4} in Eq. (10). The established equivalence between the decomposable tensor affinity 𝒯4\mathcal{T}_{4} and any similarity matrix 𝑺\bm{S} opens a new path to understand high-order affinity. However, the fourth-order decomposable affinity inherits the drawback of the low-order similarity [29] in that it is vulnerable to noise corruption and concentration effects when applied to high-dimensional samples. Alternatively, one can use undecomposable tensor affinity to provide complementary sample affinities that the pairwise matrix affinity could not provide. The IPS2 uses a Fisher-ratio-like tensor affinity:

𝒯ijkl=exp(σdij+dkldik+djl+ε)\mathcal{T}_{ijkl}=exp(-\sigma\frac{d_{ij}+d_{kl}}{d_{ik}+d_{jl}+\varepsilon}) (13)

for i,j,k,lni,j,k,l\in n, where dijd_{ij} denotes the distance between samples xix_{i} and xjx_{j} and the parameter σ\sigma is an scaling constant.

One can calculate the normalized tensor affinity 4\mathcal{L}_{4} from the fourth-order tensor affinity 𝓣4\mathcal{\bm{T}}_{4}. Then, we can learn a low-dimensional embedding 𝑽\bm{V} from the fourth-order tensor affinity by solving the following maximization problem:

max𝑽\displaystyle\max_{\bm{V}} j=1k44𝒗j3𝒗j2𝒗j1𝒗j\displaystyle\sum_{j=1}^{k}\mathcal{L}_{4}\otimes_{4}\bm{v}_{j}\otimes_{3}\bm{v}_{j}\otimes_{2}\bm{v}_{j}\otimes_{1}\bm{v}_{j} (14)
=tr((𝑽𝑽)T𝑳^4(𝑽𝑽))\displaystyle=tr((\bm{V}*\bm{V})^{T}\hat{\bm{L}}_{4}(\bm{V}*\bm{V}))
s.t.𝑽T𝑽=𝑰\displaystyle s.t.\bm{V}^{T}\bm{V}=\bm{I}

3 Method

The performance of clustering based on pairwise affinity suffers in high-dimensional and small-sample-size datasets. Incorporating the spatial distribution among multiple samples instead of using only pairwise affinity improves the clustering performance. IPS2 [26] built a bridge between the fourth-order affinity and low-dimensional embedding. It has been shown that high-order information can effectively complete the description of spatial data structure from low-order information. However, it is restricted to using even-order affinity, such as the fourth-order affinity, and does not apply to odd-order affinity. Moreover, it fails to fuse affinities from different orders within a uniform formulation, resulting in cumbersome utility for empirical applications. To address these two issues, we did the following work: (1) we introduced triadic and tetradic tensor affinity for three and four samples; (2) we propose a uniform framework to learn a low-dimensional embedding by fusing affinity of various orders. An illustrative overview of UTC is shown in Fig. 1.

3.1 Characterizing Sample’s Affinity via Third-Order Tensor Affinity

3.1.1 Decomposable Third-Order Tensor Affinity

Popular clustering methods are based on the pairwise similarity of two samples. We are interested in the tensor affinity among three samples, denoted by a third-order tensor 𝒯3=[𝒯ijk]m×m×m\mathcal{T}_{3}=[\mathcal{T}_{ijk}]\in\mathbb{R}^{m\times m\times m}. A natural way to define such affinity is to use the composite affinity based on paired affinities, such that each entry 𝒯ijk\mathcal{T}_{ijk} is given by:

𝒯3ijk=𝑺2ij𝑺2kj\displaystyle\mathcal{T}_{3_{ijk}}=\bm{S}_{2_{ij}}\bm{S}_{2_{kj}} (15)

Under this definition, one can generalize the pairwise affinity 𝑺2\bm{S}_{2} to a third-order tensor affinity 𝒯3\mathcal{T}_{3}. Specifically, the results satisfy the following properties.

Theorem 3.1.

Given a decomposable third-order tensor affinity 𝒯3\mathcal{T}_{3} defined in Eq. (15), 𝐓^3=𝐒2𝐒2\hat{\bm{T}}_{3}=\bm{S}_{2}*\bm{S}_{2}, where 𝐓^3\hat{\bm{T}}_{3} is the matrix unfolded by the tensor 𝒯3\mathcal{T}_{3}.

Proof.

By the definition in Eq.( 1), the unfolded matrix 𝑻3^n2×n\hat{\bm{T}_{3}}\in\mathbb{R}^{n^{2}\times n} can be written:

𝑻^3\displaystyle\bm{\hat{T}}_{3} =[𝒯3111𝒯31n1𝒯3n1n𝒯3nnn]\displaystyle=\left[\begin{matrix}\mathcal{T}_{3_{111}}&\dots&\mathcal{T}_{3_{1n1}}\\ \vdots&\ddots&\vdots\\ \mathcal{T}_{3_{n1n}}&\dots&\mathcal{T}_{3_{nnn}}\\ \end{matrix}\right] (16)
=[𝑺211𝑺211𝑺21n𝑺21n𝑺211𝑺2n1𝑺21n𝑺2nn𝑺2n1𝑺211𝑺2nn𝑺21n𝑺2n1𝑺2n1𝑺2nn𝑺2nn]\displaystyle=\left[\begin{matrix}\bm{S}_{2_{11}}\bm{S}_{2_{11}}&\dots&\bm{S}_{2_{1n}}\bm{S}_{2_{1n}}\\ \vdots&\ddots&\vdots\\ \bm{S}_{2_{11}}\bm{S}_{2_{n1}}&\dots&\bm{S}_{2_{1n}}\bm{S}_{2_{nn}}\\ \vdots&\ddots&\vdots\\ \bm{S}_{2_{n1}}\bm{S}_{2_{11}}&\dots&\bm{S}_{2_{nn}}\bm{S}_{2_{1n}}\\ \vdots&\ddots&\vdots\\ \bm{S}_{2_{n1}}\bm{S}_{2_{n1}}&\dots&\bm{S}_{2_{nn}}\bm{S}_{2_{nn}}\\ \end{matrix}\right]
=[𝑺211𝑺2:1𝑺21n𝑺2:n𝑺2n1𝑺2:1𝑺2nn𝑺2:n]\displaystyle=\left[\begin{matrix}\bm{S}_{2_{11}}\bm{S}_{2_{:1}}&\dots&\bm{S}_{2_{1n}}\bm{S}_{2_{:n}}\\ \vdots&\ddots&\vdots\\ \bm{S}_{2_{n1}}\bm{S}_{2_{:1}}&\dots&\bm{S}_{2_{nn}}\bm{S}_{2_{:n}}\\ \end{matrix}\right]
=[𝑺2:1𝑺2:1𝑺2:n𝑺2:n]\displaystyle=\left[\begin{matrix}\bm{S}_{2_{:1}}\otimes\bm{S}_{2_{:1}}&\dots&\bm{S}_{2_{:n}}\otimes\bm{S}_{2_{:n}}\\ \end{matrix}\right]
=𝑺2𝑺2\displaystyle=\bm{S}_{2}*\bm{S}_{2}

Surprisingly, this theorem shows that a decomposable third-order tensor affinity can be obtained via the Khatri-Rao product of two affinity matrices. We now proceed to define a normalized affinity tensor 3\mathcal{L}_{3}. Let a matrix 𝑳^3\hat{\bm{L}}_{3} be calculated by 𝑳^3=𝑫^3112𝑻^3𝑫^3212\hat{\bm{L}}_{3}=\hat{\bm{D}}_{3_{1}}^{-\frac{1}{2}}\hat{\bm{T}}_{3}\hat{\bm{D}}_{3_{2}}^{-\frac{1}{2}}. Here, the two diagonal matrices 𝑫^31,𝑫^32\hat{\bm{D}}_{3_{1}},\hat{\bm{D}}_{3_{2}} are given by 𝑫^31=i𝑻^3:ii𝑻^3:i=𝑫2𝑫2\hat{\bm{D}}_{3_{1}}=\sqrt{\sum_{i}{\hat{\bm{T}}_{3_{:i}}}\sum_{i}{\hat{\bm{T}}_{3_{:i}}}}=\bm{D}_{2}\otimes\bm{D}_{2} and 𝑫^32=i𝑻^3:i=𝑫2𝑫2\hat{\bm{D}}_{3_{2}}=\sum_{i}{\hat{\bm{T}}_{3_{:i}}}=\bm{D}_{2}\odot\bm{D}_{2}. The normalized third-order affinity tensor 3\mathcal{L}_{3} is obtained by folding the matrix 𝑳^3\hat{\bm{L}}_{3}.

This defined affinity tensor shares a similar property to that of the fourth-order affinity tensor.

Theorem 3.2.

Let a matrix 𝐒2\bm{S}_{2} be a pairwise affinity matrix with 𝐋^2\hat{\bm{L}}_{2} being its normalized affinity matrix. The decomposable third-order tensor affinity 𝒯3\mathcal{T}_{3} and its normalized tensor 3\mathcal{L}_{3} are obtained as above. We have

𝑳^3=𝑳^2𝑳^2\displaystyle\hat{\bm{L}}_{3}=\hat{\bm{L}}_{2}*\hat{\bm{L}}_{2} (17)

where 𝐋^3\hat{\bm{L}}_{3} is the unfolded matrix from the tensor 3\mathcal{L}_{3}.

Proof.

By the definitions of the normalized affinity matrix 𝑳^2\hat{\bm{L}}_{2} and 𝑳^3\hat{\bm{L}}_{3}, we have:

𝑳^3=\displaystyle\hat{\bm{L}}_{3}= 𝑫^3112𝑻3^𝑫^3212\displaystyle\hat{\bm{D}}_{3_{1}}^{-\frac{1}{2}}\hat{\bm{T}_{3}}\hat{\bm{D}}_{3_{2}}^{-\frac{1}{2}} (18)
=\displaystyle= (𝑫2𝑫2)12(𝑺2𝑺2)(𝑫2𝑫2)12\displaystyle(\bm{D}_{2}\otimes\bm{D}_{2})^{-\frac{1}{2}}(\bm{S}_{2}*\bm{S}_{2})(\bm{D}_{2}\odot\bm{D}_{2})^{-\frac{1}{2}}
=\displaystyle= (𝑫212𝑺2𝑫212)(𝑫212𝑺2𝑫212)\displaystyle(\bm{D}_{2}^{-\frac{1}{2}}\bm{S}_{2}\bm{D}_{2}^{-\frac{1}{2}})*(\bm{D}_{2}^{-\frac{1}{2}}\bm{S}_{2}\bm{D}_{2}^{-\frac{1}{2}})
=\displaystyle= 𝑳^2𝑳^2\displaystyle\hat{\bm{L}}_{2}*\hat{\bm{L}}_{2}

Theo. 3.1 and Theo. 3.2 build a connection between third-order affinity 𝒯3\mathcal{T}_{3} and pairwise similarity 𝑺2\bm{S}_{2} via the Khatri-Rao product. We construct a synthetic dataset consisting of eighty samples, evenly drawn from two different clusters. Then, we visualize the pairwise affinity 𝑺2\bm{S}_{2} and the decomposable third-order and fourth-order tensor affinity in Fig. 2. One can observe that they preserve the sample proximity satisfactorily. They can be well segmented into four distinct blocks, and within each sub-block, the sample proximity is small.

Refer to caption
Refer to caption
Refer to caption
Figure 2: Demonstration of (a) pairwise, (b) third-order, and (c) fourth-order affinity on a synthetic dataset drawn from a normal distribution. There are obvious block structures in the heatmap on different orders of unfolded affinity, indicating that the decomposable similarities of different orders are consistent in terms of performance.

3.1.2 Indecomposable Third-Order Tensor Affinity

After constructing the connection of decomposable affinity with pairwise similarity, we analyze the local spatial relations to obtain supplement information for pairwise relations. As  [30] stated, when the order is two, the spectral methods can also be seen as an eigenvector 𝒗\bm{v} containing the important underlying manifold information of the data. We generate this model to high-order affinity. Different third-order information can be extracted by examining the three samples from different views. As an example, the affinity between three samples is defined as follows: for any three samples xix_{i}, xjx_{j}, and xkx_{k}, we can always use one sample xjx_{j} as an anchor point for the other samples xix_{i} and xkx_{k} and consider anchor point xjx_{j}’s perspective to measure the relationship between the two other points. If xix_{i} and xjx_{j} are sufficiently similar, they are similar no matter the location of the anchor point. But for outliers, the similarity to other data should be small under most anchors’ perspectives. Therefore, the undecomposable affinity 𝒯3m2×m\mathcal{T}_{3}\in\mathbb{R}^{m^{2}\times m} among three points is defined by:

𝒯3ijk=<(xixj),(xkxj)>dijdjk\mathcal{T}_{3_{ijk}}=\frac{<(x_{i}-x_{j}),(x_{k}-x_{j})>}{d_{ij}d_{jk}} (19)

for i,j,kmi,j,k\in m, where dijd_{ij} denotes the distance between samples xix_{i} and xjx_{j}. The entry 𝒯3ijk\mathcal{T}_{3_{ijk}} denotes the similarity between xix_{i} and xkx_{k} under xjx_{j}’s view. If the distance between xix_{i} and xkx_{k} is small, wherever the location of xjx_{j} is, it should be a small value. Then, we can unfold 𝒯3\mathcal{T}_{3} into a matrix 𝑻3^m2×m\hat{\bm{T}_{3}}\in\mathbb{R}^{m^{2}\times m} along the mode-3 direction. Similar to the decomposable tensor affinity, we can infer the three-order normalized affinity tensor 3\mathcal{L}_{3}, which when unfolded along the mode-3 direction satisfies 𝑳^3=𝑫^3112𝑻^3𝑫^3212\hat{\bm{L}}_{3}=\hat{\bm{D}}_{3_{1}}^{-\frac{1}{2}}\hat{\bm{T}}_{3}\hat{\bm{D}}_{3_{2}}^{-\frac{1}{2}}, where 𝑫^3112=i𝑻^3::ii𝑻^3::i=𝑫2𝑫2\hat{\bm{D}}_{3_{1}}^{-\frac{1}{2}}=\sqrt{\sum_{i}{\hat{\bm{T}}_{3_{::i}}}\sum_{i}{\hat{\bm{T}}_{3_{::i}}}}=\bm{D}_{2}\otimes\bm{D}_{2} and 𝑫^32=i𝑻^3::i=𝑫2𝑫2\hat{\bm{D}}_{3_{2}}=\sum_{i}{\hat{\bm{T}}_{3_{::i}}}=\bm{D}_{2}*\bm{D}_{2}. Similar to the form of the pairwise and tetradic relations, we seek a low-dimensional embedding 𝒗\bm{v} from the third-order tensor affinity by maximizing the cross-entropy:

max𝑽\displaystyle\max_{\bm{V}} j=1k33𝒗j2𝒗j1𝒗j\displaystyle\sum_{j=1}^{k}\mathcal{L}_{3}\otimes_{3}\bm{v}_{j}\otimes_{2}\bm{v}_{j}\otimes_{1}\bm{v}_{j} (20)
=max𝑽\displaystyle=\max_{\bm{V}}{} tr((𝑽𝑽)T𝑳^3𝑽)\displaystyle tr((\bm{V}*\bm{V})^{T}\hat{\bm{L}}_{3}\bm{V})
s.t.𝑽T𝑽=𝑰\displaystyle s.t.\bm{V}^{T}\bm{V}=\bm{I}

3.2 Uniform Tensor Clustering

We now formulate a uniform framework to learn a low-dimensional embedding from affinities of various orders. Suppose we have nn samples 𝑿=[𝒙1,𝒙2,,𝒙m]n×m\bm{X}=[\bm{x}_{1},\bm{x}_{2},...,\bm{x}_{m}]\in\mathbb{R}^{n\times m}. The aim of clustering is to assign the samples into kk disjoint clusters, C1,C2,,CkC_{1},C_{2},\dots,C_{k}. We introduce an entropy, called total normalized similarity, to quantify the sample assignment:

Definition 3.1.

The total N-similarity of a cluster: Let 𝐂𝐗\bm{C}\in\bm{X} be a group of samples and 𝒮n1×n2×nk\mathcal{S}\in\mathbb{R}^{n_{1}\times n_{2}\dots\times n_{k}} be the order-kk similarity tensor. The total normalized similarity of 𝐂\bm{C} is defined as

Sim(𝑪)=xi1,xi2,,xik𝑪xi1,xi2,,xik\displaystyle Sim(\bm{C})=\sum_{x_{i_{1}},x_{i_{2}},\dots,x_{i_{k}}\in\bm{C}}\mathcal{L}_{x_{i_{1}},x_{i_{2}},\dots,x_{i_{k}}} (21)

where \mathcal{L} is the normalized affinity tensor calculated by 𝒮\mathcal{S}. In addition, for a partition 𝐂1,𝐂2,,𝐂k\bm{C}_{1},\bm{C}_{2},\dots,\bm{C}_{k} of 𝐗\bm{X}, we define the normalized associativity of the clustering as:

NAssoc(𝑪1,𝑪2,,𝑪k)=i=1kSim(𝑪i)|𝑪j|m,\displaystyle N-Assoc(\bm{C}_{1},\bm{C}_{2},\dots,\bm{C}_{k})=\sum^{k}_{i=1}\frac{Sim(\bm{C}_{i})}{|\bm{C}_{j}|^{m}}, (22)

where |𝐂j||\bm{C}_{j}| denotes the sample number of cluster 𝐂j\bm{C}_{j}.

Let an indicator matrix 𝑯=[𝒉1,𝒉2,,𝒉k]n×k\bm{H}=[\bm{h}_{1},\bm{h}_{2},\dots,\bm{h}_{k}]\in\mathbb{R}^{n\times k} be the sample assignment such that 𝑯ij=|𝑪j|1\bm{\bm{H}}_{ij}=|\bm{C}_{j}|^{-1} if xi𝑪jx_{i}\in\bm{C}_{j} and zero otherwise.

One can maximize the normalized associativity in Eq. (22) to seek an optimal sample assignment 𝑪i,,𝑪k{\bm{C}_{i},\dots,\bm{C}_{k}}. Via algebraic transformation, this problem is equivalent to solving:

max𝑪1,,𝑪kj=1k(m𝒉jm1𝒉j1𝒉j)\displaystyle\max_{\bm{C}_{1},\cdots,\bm{C}_{k}}\sum_{j=1}^{k}(\mathcal{L}\otimes_{m}\bm{h}_{j}\otimes_{m-1}\bm{h}_{j}\dots\otimes_{1}\bm{h}_{j}) (23)

where 𝒉j\bm{h}_{j} denotes the jthj^{th} column of 𝑯\bm{H}.

One can verify that solving the maximum normalized associativity is NP-hard. Alternatively, one can relax the binary assignment matrix to the orthonormal matrix 𝑽n×k:𝑽T𝑽=𝑰\bm{V}\in\mathbb{R}^{n\times k}:\bm{V}^{T}\bm{V}=\bm{I} to lower the expectation of binary assignment. The simplified problem then becomes solving

max𝑽T𝑽=𝑰j=1k1𝒗j2𝒗jm𝒗j\displaystyle\max_{\bm{V}^{T}\bm{V}=\bm{I}}\sum_{j=1}^{k}\mathcal{L}\otimes_{1}\bm{v}_{j}\otimes_{2}\bm{v}_{j}\dots\otimes_{m}\bm{v}_{j} (24)

where 𝒗j\bm{v}_{j} represents the jthj^{th} column of 𝑽\bm{V}.

We now fuse the aforementioned undecomposable third-order and fourth-order tensor affinities with the pairwise affinity matrix to seek a uniform low-dimensional embedding. The high-order tensor affinity can be used to effectively address the challenge of the vulnerability to high-dimensional datasets. The proposed model, named uniform tensor clustering (UTC), effectively fuses the similarity measurements among samples to yield a uniform low-dimensional embedding to approximate the partition matrix. It achieves the goal by maximizing the cluster’s associativity of various orders. The proposed model is formulated as:

min𝑽\displaystyle\min_{\bm{V}} j=1kL^22𝒗j1𝒗j33𝒗j2𝒗j1𝒗j\displaystyle\sum_{j=1}^{k}-{\hat{L}}_{2}\otimes_{2}\bm{v}_{j}\otimes_{1}\bm{v}_{j}-\mathcal{L}_{3}\otimes_{3}\bm{v}_{j}\otimes_{2}\bm{v}_{j}\otimes_{1}\bm{v}_{j} (25)
44𝒗j3𝒗j2𝒗j1𝒗j\displaystyle-\mathcal{L}_{4}\otimes_{4}\bm{v}_{j}\otimes_{3}\bm{v}_{j}\otimes_{2}\bm{v}_{j}\otimes_{1}\bm{v}_{j}
s.t𝑽T𝑽=𝑰\displaystyle s.t\,\,\bm{V}^{T}\bm{V}=\bm{I}

where 𝒗j\bm{v}_{j} is the jthj^{th} column of the matrix 𝑽\bm{V}. This unified model aims to fuse various affinity matrices to yield a uniform low-dimensional embedding that is robust to noise contamination and high-dimensional concentration effects. The proposed model enhances the clustering performance by exploiting spatial information from multiple (more than two) samples. Upon solving this problem, one obtains the embedding 𝑽\bm{V}. The clustering task is accomplished by fusing low- to high-order information and using K-means clustering based on the embedding matrix 𝑽\bm{V} to obtain the final group. Therefore, once we have the indecomposable tensor affinity 𝒮\mathcal{S} from different orders, computing the fused low-dimensional embedding yields the different orders’ spatial information to enhance the clustering performance.

3.3 Numerical Scheme to Solve UTC

The optimization problem in Eq. (25) is convex, and the objective is differentiable. However, the gradient of the objective function involves solving a polynomial function of order three. It has no explicit solution and thus is computationally intractable. To circumvent this issue, we introduce a slack variable 𝒗2\bm{v}_{2} to approximate the term 𝒗𝒗\bm{v}\otimes\bm{v}, thereby avoiding having to solve a high-order polynomial function. The model can be rewritten as

min𝑽1,𝑽2\displaystyle\min_{\bm{V}_{1},\bm{V}_{2}} tr(𝑽1T𝑳^2𝑽1+𝑽2T𝑳^3𝑽1+𝑽2T𝑳^4𝑽2)\displaystyle-tr(\bm{V}_{1}^{T}\hat{\bm{L}}_{2}\bm{V}_{1}+\bm{V}_{2}^{T}\hat{\bm{L}}_{3}\bm{V}_{1}+\bm{V}_{2}^{T}\hat{\bm{L}}_{4}\bm{V}_{2}) (26)
s.t.\displaystyle s.t. 𝑽1T𝑽1=𝑰\displaystyle\bm{V}_{1}^{T}\bm{V}_{1}=\bm{I}
𝑽1𝑽1=𝑽2\displaystyle\bm{V}_{1}*\bm{V}_{1}=\bm{V}_{2}

This problem can be solved by alternating direction minimization. Its augmented Lagrangian formulation is as follows:

min𝑽1,𝑽2\displaystyle\min_{\bm{V}_{1},\bm{V}_{2}} tr(𝑽1T𝑳^2𝑽1)tr(𝑽2T𝑳^3𝑽1)tr(𝑽2T𝑳^4𝑽2)\displaystyle-tr(\bm{V}_{1}^{T}\hat{\bm{L}}_{2}\bm{V}_{1})-tr(\bm{V}_{2}^{T}\hat{\bm{L}}_{3}\bm{V}_{1})-tr(\bm{V}_{2}^{T}\hat{\bm{L}}_{4}\bm{V}_{2}) (27)
+\displaystyle+ <𝒀1,𝑽1𝑽1𝑽2>+<𝒀2,𝑽1T𝑽1𝑰>\displaystyle<\bm{Y}_{1},\bm{V}_{1}*\bm{V}_{1}-\bm{V}_{2}>+<\bm{Y}_{2},\bm{V}_{1}^{T}\bm{V}_{1}-\bm{I}>
+\displaystyle+ μ2[𝑽1𝑽1𝑽2F2+𝑽1T𝑽1𝑰F2]\displaystyle\frac{\mu}{2}[\|\bm{V}_{1}*\bm{V}_{1}-\bm{V}_{2}\|_{F}^{2}+\|\bm{V}_{1}^{T}\bm{V}_{1}-\bm{I}\|_{F}^{2}]

where 𝒀1\bm{Y}_{1} and 𝒀2\bm{Y}_{2} are Lagrange multipliers. The constant μ>0\mu>0 is a penalty parameter. We then decompose the problem into two subproblems with respect to the variables 𝑽1\bm{V}_{1} and 𝑽2\bm{V}_{2} alternatively. Each variable is solved while fixing other variables. The process is updated iteratively until convergence.

Step 1): Solving the subproblem with respect to the variable V1\bm{V}_{1}

By fixing the variable 𝑽2\bm{V}_{2}, the problem can be simplified as:

min𝑽1\displaystyle\min_{\bm{V}_{1}} tr(𝑽1T𝑳𝑽1)tr(𝑽2T𝑳^3𝑽1)\displaystyle-tr(\bm{V}_{1}^{T}\bm{L}\bm{V}_{1})-tr(\bm{V}_{2}^{T}\hat{\bm{L}}_{3}\bm{V}_{1}) (28)
+\displaystyle+ <𝒀1,𝑽1𝑽1𝑽2>+<𝒀2,𝑽T𝑽𝑰>\displaystyle<\bm{Y}_{1},\bm{V}_{1}*\bm{V}_{1}-\bm{V}_{2}>+<\bm{Y}_{2},\bm{V}^{T}\bm{V}-\bm{I}>
+\displaystyle+ μ2[𝑽1𝑽1𝑽2F2+(𝑽T𝑽𝑰)]\displaystyle\frac{\mu}{2}[\|\bm{V}_{1}*\bm{V}_{1}-\bm{V}_{2}\|_{F}^{2}+(\bm{V}^{T}\bm{V}-\bm{I})]

The gradient of the quadratic function is:

L𝑽1=\displaystyle\frac{\partial L}{\partial\bm{V}_{1}}= 2𝑳𝑽1𝑳^3T𝑽2+μj=1k(𝑽1:j𝑰\displaystyle-2\bm{L}\bm{V}_{1}-\hat{\bm{L}}_{3}^{T}\bm{V}_{2}+\mu\sum_{j=1}^{k}(\bm{V}_{1_{:j}}\otimes\bm{I} (29)
+𝑰𝑽1:j)T(𝑽1:j𝑽1:j𝑽2:j+𝒀1:j/μ)\displaystyle+\bm{I}\otimes\bm{V}_{1_{:j}})^{T}(\bm{V}_{1_{:j}}*\bm{V}_{1_{:j}}-\bm{V}_{2_{:j}}+\bm{Y}_{1_{:j}}/\mu)
+2μ𝑽1(𝑽1T𝑽1𝑰+𝒀2/μ)\displaystyle+2\mu\bm{V}_{1}(\bm{V}_{1}^{T}\bm{V}_{1}-\bm{I}+\bm{Y}_{2}/\mu)

Thus, the problem in Eq. (28) can be solved via gradient descent or by setting the gradient function Eq. (29) to be zero to obtain the solution directly.

Step 2): Solving the subproblem with respect to the variable V2\bm{V}_{2}

By fixing the variable 𝑽1\bm{V}_{1}, the augmented Lagrange function can be simplified as:

min𝑽2\displaystyle\min_{\bm{V}_{2}} tr(𝑽2T𝑳^3𝑽1)tr(𝑽2T𝑳^4𝑽2)\displaystyle tr(-\bm{V}_{2}^{T}\hat{\bm{L}}_{3}\bm{V}_{1})-tr(\bm{V}_{2}^{T}\hat{\bm{L}}_{4}\bm{V}_{2}) (30)
+𝒀1(𝑽1𝑽1𝑽2)+μ2[𝑽1𝑽1𝑽2F2\displaystyle+\bm{Y}_{1}(\bm{V}_{1}*\bm{V}_{1}-\bm{V}_{2})+\frac{\mu}{2}[\|\bm{V}_{1}*\bm{V}_{1}-\bm{V}_{2}\|_{F}^{2}

The gradient of the objective function is:

L𝑽2=𝑳^3𝑽12𝑳^4𝑽2+μ(𝑽2𝑽1𝑽1𝒀1μ)\displaystyle\frac{\partial L}{\partial\bm{V}_{2}}=-\hat{\bm{L}}_{3}\bm{V}_{1}-2\hat{\bm{L}}_{4}\bm{V}_{2}+\mu(\bm{V}_{2}-\bm{V}_{1}*\bm{V}_{1}-\frac{\bm{Y}_{1}}{\mu}) (31)

By setting the gradient to zero, one can obtain the implicit solution as:

𝑽2=(μ𝑰2𝑳^4)1(μ𝑽1𝑽1+𝑳^3𝑽1+𝒀)\displaystyle\bm{V}_{2}^{*}=(\mu\bm{I}-2\hat{\bm{L}}_{4})^{-1}(\mu\bm{V}_{1}*\bm{V}_{1}+\hat{\bm{L}}_{3}\bm{V}_{1}+\bm{Y}) (32)

Step 3): Updating the multipliers Y1\bm{Y}_{1} and Y2\bm{Y}_{2}:

𝒀1(k+1)=𝒀1(k)+μ(𝑽1(k)𝑽1(k)𝑽2(k))\displaystyle\bm{Y}_{1}^{(k+1)}=\bm{Y}_{1}^{(k)}+\mu(\bm{V}_{1}^{(k)}*\bm{V}_{1}^{(k)}-\bm{V}_{2}^{(k)}) (33)
𝒀2k+1=𝒀2(k)+μ(𝑽1(k)T𝑽(k)𝑰)\displaystyle\bm{Y}_{2}^{k+1}=\bm{Y}_{2}^{(k)}+\mu({\bm{V}_{1}^{(k)}}^{T}\bm{V}^{(k)}-\bm{I}) (34)

The three steps are iteratively updated until convergence or until a stopping criterion is met: max(𝒗1k+1𝒗1k,𝒗2k+1𝒗2k,𝒗1k+1𝒗1k+1𝒗2k+1)<ϵmax(\|\bm{v}_{1}^{k+1}-\bm{v}_{1}^{k}\|_{\infty},\|\bm{v}_{2}^{k+1}-\bm{v}_{2}^{k}\|_{\infty},\|\bm{v}_{1}^{k+1}\otimes\bm{v}_{1}^{k+1}-\bm{v}_{2}^{k+1}\|_{\infty})<\epsilon. Since the main problem is convex, the numerical scheme is guaranteed to be convergent. The alternate strategy to solve the problem is summarized in Algorithm 1.

Algorithm 1 High-Order Affinity Clustering

Input: Data 𝒳md\mathcal{X}\in\mathbb{R}^{m*d}, Cluster number cc
Output: The fused low-dimensional embedding 𝑽\bm{V} and the cluster results 𝒀pred\bm{Y}_{pred}

1:Compute the pairwise affinity 𝑳^2\hat{\bm{L}}_{2}. Set 𝑽10=𝑽20=𝒀10=𝒀20=𝟎\bm{V}_{1}^{0}=\bm{V}_{2}^{0}=\bm{Y}_{1}^{0}=\bm{Y}_{2}^{0}=\bm{0}, μ0=103\mu^{0}=10^{-3}, μmax=102\mu^{max}=10^{2}, ϵ=102\epsilon=10^{-2}, ε=103\varepsilon=10^{-3}, ρ=1.1\rho=1.1, α=103\alpha=10^{-3}, k=t=0k=t=0, and σ=106\sigma=10^{-6}.
2:Construct triadic affinity tensor 3\mathcal{L}_{3} by Eq. (19).
3:Construct tetradic affinity tensor 4\mathcal{L}_{4} by Eq. (13).
4:Unfold order-3 tensor 3\mathcal{L}_{3} to matrix 𝑳3^\hat{\bm{L}_{3}} by Eq. (1).
5:Unfold order-4 tensor 4\mathcal{L}_{4} to matrix 𝑳4^\hat{\bm{L}_{4}} by Eq. (2).
6:while Not Converged do
7:     𝑽1tempt=𝑽1k\bm{V}_{1_{temp}}^{t}=\bm{V}_{1}^{k}.
8:     while Not Converged do
9:         Fix 𝑽2k\bm{V}_{2}^{k}, computed L𝑽1tempt\frac{\partial L}{\partial\bm{V}_{1_{temp}}^{t}} by Eq. (29).
10:         Update 𝑽1tempt+1\bm{V}_{1_{temp}}^{t+1} by 𝑽1tempt+1=𝑽1tempt+αL𝑽1tempt\bm{V}_{1_{temp}}^{t+1}=\bm{V}_{1_{temp}}^{t}+\alpha\frac{\partial L}{\partial\bm{V}_{1_{temp}}^{t}}.
11:         Check the convergence conditions:    𝑽1tempt+1𝑽1temptϵ\|\bm{V}_{1_{temp}}^{t+1}-\bm{V}_{1_{temp}}^{t}\|_{\infty}\leq\epsilon .
12:         t=t+1t=t+1.      
13:     𝑽1k+1=𝑽1tempt\bm{V}_{1}^{k+1}=\bm{V}_{1_{temp}}^{t}
14:     Fix 𝑽1k+1\bm{V}_{1}^{k+1}, update 𝑽2k+1\bm{V}_{2}^{k+1} by Eq. (32).
15:     Update 𝒀1k+1\bm{Y}_{1}^{k+1} by Eq. (33).
16:     Update 𝒀2k+1\bm{Y}_{2}^{k+1} by Eq. (34).
17:     μk+1=min(ρμk,μmax)\mu^{k+1}=min(\rho\mu^{k},\mu^{max})
18:     Check the convergence conditions: 𝑽1k+1𝑽1kε\|\bm{V}_{1}^{k+1}-\bm{V}_{1}^{k}\|_{\infty}\leq\varepsilon, 𝑽2k+1𝑽2kε\|\bm{V}_{2}^{k+1}-\bm{V}_{2}^{k}\|_{\infty}\leq\varepsilon 𝒀1k+1𝒀1kε\|\bm{Y}_{1}^{k+1}-\bm{Y}_{1}^{k}\|_{\infty}\leq\varepsilon,𝒀2k+1𝒀2kε\|\bm{Y}_{2}^{k+1}-\bm{Y}_{2}^{k}\|_{\infty}\leq\varepsilon
19:     k=k+1k=k+1.
20:     t=0t=0
21:return 𝑽1k\bm{V}_{1}^{k}
22:Perform spectral clustering on 𝑽𝑽T\bm{V}\bm{V}^{T} to obtain 𝒀pred\bm{Y}_{pred}

4 EXPERIMENTS

In this section, we illustrate the superiority of the proposed UTC method on both synthetic and real datasets. We first construct an example (Syndata-1) to demonstrate the scenario in which clustering through the standard pairwise affinity fails. In comparison, the triadic affinity can be employed to adequately characterize the spatial distribution of multiple samples; thus, the proposed UTC can perfectly cluster the sample using the triadic affinity. We then construct a second synthetic dataset (Syndata-2) with the HDLSS property to demonstrate the scenario in which clustering through standard pairwise affinity is unsatisfactory when the feature dimension increases. In comparison, the proposed UTC can perform clustering tasks stably by integrating affinities of different orders. Finally, we conduct experiments on six real datasets to verify the effectiveness of the proposed UTC by comparison with five popular clustering methods.

4.1 Experimental Settings

We employ five popular clustering methods to benchmark the proposed UTC. A brief introduction to these methods is given below:

  • Spectral clustering (SC) [3]: The classic spectral clustering gives a baseline on behalf of pairwise similarity.

  • Clustering with Adaptive Neighbors (CAN)  [17]: Dynamically learning the affinity matrix by assigning the adaptive neighbors for each data point based on local distance. The final similarity matrix’s connected components are made precisely equal to the cluster number by imposing a new rank constraint.

  • Clique averaging+ncut (CAVE)  [21]: The hypergraph, on behalf of the high-order affinity, is approximated using clique averaging, and the resulting graph is partitioned using the normalized cuts algorithm.

  • Pair-to-pair clustering (PPC) [26]: Tetradic undecomposable affinities are used to obtain the low-dimension embedding as the final similarity matrix of the spectral method.

  • Integrating tensor similarity and pairwise similarity (IPS2) [26]: Applying the spectral clustering method on the fused similarity that combines the pair-to-pair affinities and pairwise similarity.

  • Uniform Tensor Clustering (UTC): The proposed method.

In terms of the type of the affinity, the five methods can be divided into three categories, i.e., pairwise affinity, high-order affinity, and fused-order affinity.

The performance of each method is evaluated with respect to five popular metrics: accuracy (ACC), adjusted Rand index (ARI), F-score, normalized mutual information (NMI), and purity (PURITY). The larger the value is, the better the performance is.

Refer to caption
Refer to caption
Refer to caption
Figure 3: Experiment on Syndata-1. (a) Syndata-1 is constructed to consist of two subgroups, which are perfectly distributed along two cross-lines, highlighted by \cdot and *; (b) The assignments obtained using pairwise affinity mistakenly mix the two groups, highlighted in orange and blue. Most of the samples are labeled incorrectly; (c) The assignments obtained by UTC using triadic affinity perfectly segment the samples into the correct subgroups, highlighted in orange and blue.

4.2 Experiment on Synthdata-1 to Validate the Discrimination Potential of UTC via Triadic Affinity

Syndata-1 consists of 40 samples from two different subgroups. Each group is perfectly distributed along a straight line, while the two groups are distributed perpendicularly to each other. We use circles and crosses to denote the samples from the two subgroups, and their distribution is shown in Fig.  3(a).

We applied the spectral clustering, which works on pairwise affinity, on Syndata-1. Since the samples from the two subgroups are perpendicular, the resulting affinity matrix is non-differentiable. Therefore, the SC fails to yield the correct assignment. It mistakenly classifies the samples into two parts, with most in the upper and the rest in the bottom. In comparison, UTC can identify the sample assignments perfectly with 100%100\% accuracy. The results indicate that the proposed undecomposable high-order affinity can explore complementary information to improve the clustering performance.

Refer to caption
Figure 4: Experiment to demonstrate the stability of UTC when dealing with high-dimensional feature corruption. The feature dimension is artificially increased from 10 to 10000. The accuracy of most of the comparative methods gradually decreased from 100% to around 35%. In comparison, the performance of UTC was stable; the accuracy was maintained at approximately 100% as the dimension increased.
TABLE II: Clustering results on Syndata-2 when the dimensionality is 10000
Datasets Methods ACC ARI F-SCORE NMI PURITY
Syndata2 SC 0.55 0.4427 0.6633 0.6544 0.68
CAN 0.6449 0.2306 0.5342 0.3323 0.6449
CAVE 0.3551 -0.013 0.3734 0.0018 0.3551
PPS 0.5636 0.3166 0.4472 0.4453 0.5636
IPS2 0.7218 0.4847 0.5848 0.5862 0.7818
UTC(Fusing pairwise and triadic affinity) 0.86 0.6719 0.7865 0.759 0.86
UTC(Fusing pairwise and tetradic affinity) 0.77 0.5204 0.6799 0.5776 0.77
UTC 𝟏\bm{1} 𝟏\bm{1} 𝟏\bm{1} 𝟏\bm{1} 𝟏\bm{1}
Refer to caption
Figure 5: Visualization results on Syndata2 with a feature dimensionality of 10000. We apply t-SNE on the embedding features derived from UTC and the corresponding affinity matrix in three ways. Clearly, most of the samples are mixed together when applying t-SNE to the raw data(a), and the corresponding affinity matrix has no clear-cut structure (b). The samples in group A are well-segmented, while the other groups remain undistinguished when fusing the triadic affinity with pairwise affinity (c). The upper left of the corresponding affinity matrix has blocky structures (d). Although group C can be identified accurately by combining fourth-order association with pairwise similarity, the remaining two types of the Gaussian mixture distribution are merged (e),(f). By combining the third-order and fourth-order affinities with pairwise similarity, each group can be identified accurately (h), and three distinct blocks are presented on the diagonal of the corresponding affinity matrix (g).

4.3 Experiment on Synthdata-2 to Validate the Stability of UTC over Dimension Expanding via High-order Affinities

Syndata-2 consists of three subgroups, each derived from independent and identically distributed normal distributions with an equal standard deviation of 0.5 and mean of 2. The data from different groups are orthogonal, and each group contains 30-40 samples.

The pairwise affinity tends to be ambiguous with expanding feature dimensionality, limiting the clustering performance. We applied the proposed UTC on Syndata-2 to validate its stability in handling high-dimensional data by comparison with popular methods, including SC, CAN and CAVE, PPS, and IPS2. The results are shown in Fig. 4. When the feature dimension varies from 10 to 10000, UTC can maintain robustness to perform accurate clustering. Furthermore, the baseline methods gradually lose the ability to identify the data structure.

Moreover, when the number of data dimensions reaches 10000, Tab. II shows that our proposed UTC methods, including fusing pairwise with triadic affinity, tetradic affinity, and both, substantially outperform all other state of the art methods. The ACC clustering results of fusing triadic and fusing tetradic are higher than those of the second-highest baseline method by 14% and 28%, demonstrating the effectiveness of fusing high-order affinity with pairwise relation under HDLSS data. Additionally, the ACC achieves 100% when utilizing both higher-order affinities, proving that fusing different-order affinities is more helpful in completely explaining the data structure than is merging only a single higher-order affinity.

We also employ the t-SNE to visualize the differences in the embedded features after different methods. Most of the samples mix together when visualizing their embedding by t-SNE on the raw data, as shown in Fig. 5(a). On the other hand, the pairwise affinity, shown in Fig. 5(b), has no clear-cut structure, except the diagonal elements having large values. Therefore, it is challenging to discriminate the three subgroups when using the pairwise affinity matrix directly. Now, we fuse the three-order tensor affinity with pairwise affinity and apply UTC to obtain a local embedding. t-SNE is applied to the resulting embedding to visualize the sample assignment distribution. The visualized result is shown in Fig. 5(c). The samples in one subgroup (Subgroup A) are clearly separated from the others, while the remaining two subgroups (Subgroups B and C) are not well-distinguished. The upper left of the affinity matrix shows an obvious blocky structure, but the remaining part lacks a clear structure, as shown in Fig. 5(d), which confirms the results obtained by t-SNE. Alternatively, if we fuse the fourth-order tensor affinity with the pairwise matrix affinity and apply UTC to obtain another local embedding, the visualization using t-SNE is shown in Fig. 5(e). The samples in Subgroup C clearly are separated from the other two subgroups, while the other two subgroups are merged. The affinity heatmap also validates the observation that Subgroup A in the lower-right corner possesses distinct value compared to the others, as shown in Fig. 5(f). Finally, we fuse the pairwise affinity with the third-order and fourth-order. UTC is employed to learn a uniform embedding that is then visualized via t-SNE, as shown in Fig. 5(g). The three subgroups are perfectly separated in this case. The diagonal elements of the corresponding affinity matrix also show three obvious block-like structures and an affinity difference between the three subgroups. Therefore, an accurate sample assignment can easily be obtained from the embedding by the proposed UTC. In summary, the above experimental results validate that the high-order tensor affinities make up for the insufficiency of the pairwise affinity matrix when corrupted by high-dimensional features.

TABLE III: Statistics on six tested real-world datasets.
Dataset Type Samples Features Clusters
Yale Facial Image 55 4096 5
Coil20 Object Image 100 16384 5
Lymphoma Bioinformatics 62 4702 3
DriveFace Facial Image 78 307200 3
Lung Bioinformatics 100 12600 3
GLI-85 Bioinformatics 85 22283 2

4.4 Experiments on Real Datasets

We then validated the power of UTC by applying it to six public benchmark datasets. The six datasets are chosen to be representative of various sources, including facial image, object image, and bioinformatics data. The statistics for the six datasets are summarized in Tab. III. Additionally, a detailed description is provided.

TABLE IV: Clustering performance on six real datasets
Datasets Methods ACC ARI F-SCORE NMI PURITY
YALE SC 0.4727 0.2528 0.423 0.4097 0.4909
CAN 0.5091 0.3316 0.4839 0.5046 0.5273
CAVE 0.5818 0.2871 0.4337 0.4307 0.6182
PPS 0.5636 0.3166 0.4472 0.4453 0.5636
IPS2 0.7218 0.4847 0.5848 0.5862 0.7818
UTC 0.8\bm{0.8} 0.5019\bm{0.5019} 0.6036\bm{0.6036} 0.67\bm{0.67} 0.8\bm{0.8}
Coil SC 0.83 0.9068 0.9248 0.9352 0.96
CAN 0.86 0.7660 0.7807 0.8886 0.81
CAVE 0.84 0.6852 0.7462 0.7575 0.84
PPS 0.93 0.8383 0.8694 0.869 0.93
IPS2 0.97\bm{0.97} 0.9280\bm{0.9280} 0.9419\bm{0.9419} 0.9460\bm{0.9460} 0.97\bm{0.97}
UTC 0.97\bm{0.97} 0.9280\bm{0.9280} 0.9419\bm{0.9419} 0.9460\bm{0.9460} 0.97\bm{0.97}
DriveFace SC 0.641 0.3304 0.5921 0.4505 0.641
CAN 0.7436 0.3671 0.5941 0.4486 0.7436
CAVE 0.8974 0.7274 0.8182 0.7449 0.8974
PPS 0.7436 0.4655 0.6433 0.5144 0.718
IPS2 0.7436 0.4711 0.6549 0.5731 0.7436
UTC 0.9615\bm{0.9615} 0.8869\bm{0.8869} 0.9237\bm{0.9237} 0.8675\bm{0.8675} 0.9615\bm{0.9615}
Lymphoma SC 0.8065 0.5059 0.7167 0.6634 0.8548
CAN 0.9839 0.9471 0.9733 0.9255 0.9839
CAVE 0.8065 0.5060 0.7167 0.6634 0.8548
PPS 0.9194 0.7570 0.8696 0.7823 0.9194
IPS2 0.9839 0.9471 0.9733 0.9255 0.9839
UTC 𝟏\bm{1} 𝟏\bm{1} 𝟏\bm{1} 𝟏\bm{1} 𝟏\bm{1}
Lung SC 0.8 0.5855 0.675 0.6848 0.81
CAN 0.77 0.5973 0.6971 0.6995 0.85
CAVE 0.8 0.5855 0.675 0.6848 0.81
PPS 0.73 0.541 0.6472 0.6311 0.8
IPS2 0.82 0.541 0.7427 0.7289\bm{0.7289} 0.85
UTC 0.85\bm{0.85} 0.6941\bm{0.6941} 0.7634\bm{0.7634} 0.7219 0.85\bm{0.85}
GLI-85 SC 0.6471 0.0731 0.5721 0.1385 0.6941
CAN 0.6588 -0.0353 0.697 0.042 0.6941
CAVE 0.7412 0.1225 0.7376 0.171 0.7412
PPS 0.5294 -0.0447 0.5777 0.1103 0.6941
IPS2 0.7294 0.2012 0.6267 0.2754 0.7294
UTC 0.8118\bm{0.8118} 0.3772\bm{0.3772} 0.7177\bm{0.7177} 0.2798\bm{0.2798} 0.8118\bm{0.8118}
  • Yale dataset has 165 grayscale images covering 15 different individuals. Each individual has 11 different facial and configuration images, such as with glasses, without glasses, center-light, left-light, happy, sad, and so on. We extracted 4096-dimensional raw pixel values of every image in 5 classes for our experiment.

  • Coil20 dataset contains 1440 images of 20 categories of objects. Each category has 72 images from different views. We sample 20 samples from 5 classes, and all images have a size of 128×128128\times 128.

  • Lymphoma dataset, one of the most common subtypes of non-Hodgkin’s lymphoma has two molecularly different forms of diffuse large B-cell lymphoma (DLBCL), which have gene expression patterns that indicate different stages of B cell differentiation. The dataset contains a total of 62 samples, 4702 based on expression fragments.

  • DriveFace dataset contains images sequences of subjects while driving in real scenarios. It is composed of 606 samples of 64006400 features each, acquired over different days from 3 drivers with various facial features, such as glasses and beards. For each type, we pick 26 samples for the experiment.

  • Lung dataset contains a total of 203 samples that can be divided into 5 classes, with 149, 21, 20, 6, and 17 samples, respectively. Each sample has 12,600 genes. We selected 40 samples of the first category and all samples of the remaining four categories for experiments.

  • GLI-85 dataset consists of the transcriptional data of 85 diffuse infiltrating gliomas from 74 patients. Those gliomas can be divided into two kinds of tumor subclasses. Each instance has 22283 features.

Refer to caption
Figure 6: Visualization of the latent representations with t-SNE and corresponding affinity heatmaps of the (a) Yale; (b) Coil; and (c) DriveFace datasets. The first and second columns are the visualizations of the embedding by t-SNE on the raw data and the samples’ affinity heatmap. The third and the fourth columns are the visualization results of the obtained uniform embedding by UTC and the samples’ affinity heatmap, respectively.
Refer to caption
Figure 7: Visualization of the latent representations with t-SNE and corresponding affinity heatmaps of the (a) Lung; (b) Lymphoma, and (c) GLI-85 datasets. The first and second columns are the visualization by t-SNE of the raw data and the samples’ affinity heatmap. The third and fourth columns are the visualization results of the obtained uniform embedding by UTC and the samples’ affinity heatmap, respectively.

We applied the proposed UTC as well as five comparative methods on the six datasets, and the results are summarized in Tab. IV. The bold value represents the best result. The clustering method that uses higher-order affinity uniformly outperforms the standard method using pairwise affinity. Therefore, the higher-order affinity constructed from the relationships of multiple samples can better describe the spatial structure of the data, which is advantageous when dealing with HDLSS data. Furthermore, our proposed UTC method significantly outperforms the competitive methods with respect to all five evaluation metrics for all types of data, including facial image, object image, and bioinformatics data.

On the Yale dataset, the ACC clustering result of our method is over 33% higher than that of SC and over 8% higher than the second-highest IPS2. On the GLI-85 dataset, our results are more than 16% and 6% higher than those of the baseline and second-highest IPS2 methods. There are also 1%, 7%, 10%, 2%, and 3% improvements compared with the second-best performance on the Coil, DriveFace, Colon, Lymphoma, and Lung datasets, respectively. IPS2 and UTC achieve the top two results on almost all datasets with respect to nearly all the metrics.

We also visualized the sample spatial distribution after t-SNE[31] and the heatmap of the affinity matrix on the Yale, Coil, and DriveFace datasets in Fig. 6 and the Lymphoma, Lung, and GLI-85 datasets in Fig. 7, respectively. The spatial distributions after t-SNE on the raw data and the resulting embedding are shown on the left side of the two columns, while the heatmaps of the raw data and the affinity matrix fusing with various orders are on the other side. In most figures, the low-dimensional embedding after UTC displays better separation and less overlap, thus making it easy for clustering. In comparison, the heatmap of similarity produced by SC and UTC further demonstrates the effectiveness of our method. Deep yellow represents a large proximity value. The affinity heatmap on the raw samples possesses highly blurred boundaries, and there is no apparent block structure. However, the affinity matrix calculated from the low-dimensional embedding after UTC has clear-cut boundaries. The large value in the matrix generated by UTC is concentrated on the diagonal blocks, implying that the fused affinity can more accurately reflect the real relationship among the samples, thus yielding superior clustering performance in Tab. IV. Taking the Yale dataset as an example, the ACC of the clustering results obtained by UTC is 33% higher than that of SC. The visualization results confirm our method’s superiority intuitively, as shown in Fig. 6(a). Most of the samples are mixed together in the visualization based on pairwise affinity, resulting in an inability to distinguish between different subgroups accurately, as shown on the far left of Fig. 6(a). The heatmap of pairwise affinity confirms that only one cluster is significantly different from the other samples. Correspondingly, in the t-SNE visualization of the resulting embedding, most samples are clustered with their same cluster’s partners. Five obvious blocks are displayed on the diagonal in the heatmap. The right side of Fig. 6(a) confirms that UTC can better reflect the real structure of the data compared with single-order methods, which is critical for clustering.

In summary, the reason for the superior performance of UTC is that it utilizes the complementarity of high- and low-order affinities to comprehensively capture the spatial structure of data, thereby leading to a decisive effect on the clustering performance.

4.5 Computational Complexity Analysis

The UTC consists of two major computational components: constructing the high-order tensor affinity and solving the minimization problem in Eq. (27). In the first component, we calculate both the triadic and tetradic affinities by means of their kk-nearest neighbors (KNN). Given mm samples, one can use the KNN strategy to reduce the time complexity from 𝒪(m4)\mathcal{O}(m^{4}) to 𝒪(mk4)\mathcal{O}(mk^{4}). Thus, the two affinity tensors are sparse, with the number of nonzero elements being ss. For the latter, we solve th model by means of an iterative strategy. The subproblem of 𝑽1\bm{V}_{1} based on the gradient descent strategy has a computational complexity of 𝒪(ncsm2)\mathcal{O}(ncsm^{2}), where the cc denotes the number of clusters and the maximum iteration number is nn. Solving the other sub-problem 𝑽2\bm{V}_{2} takes 𝒪(csm2)\mathcal{O}(csm^{2}) using the Germs algorithm. Thus, each iteration has a total computational cost of 𝒪(ncsm2)+𝒪(csm2)=𝒪((n+1)ncsm2)\mathcal{O}(ncsm^{2})+\mathcal{O}(csm^{2})=\mathcal{O}((n+1)ncsm^{2}). In total, the complexity to learn the uniform embedding in Algorithm 1 is 𝒪(mk4+K((n+1)ncsm2))\mathcal{O}(mk^{4}+K((n+1)ncsm^{2})), with KK being the number of iterations.

5 Conclusion

The clustering of small-size samples with large dimensions is a bottleneck problem. We propose a unified learning model to effectively fuse multiple samples’ affinities of different orders to obtain the sample’s uniform low-dimensional embedding. Tensor-vector products uniformly formulate the sample affinities of various orders, which opens a new door for studying multiple samples. Our method involves three matrix products: arithmetical, Khatri-Rao and Kronecker product. They are jointly optimized to yield a uniform low-dimensional embedding of the samples. Experiments on synthetic data demonstrate the power of fusing different-order affinities and experiments on several real datasets with small sample size yet large feature dimensionality show the effectiveness and superiority of the method over other popular approaches.

References

  • [1] T. Hofmann and J. M. Buhmann, “Pairwise data clustering by deterministic annealing,” Ieee transactions on pattern analysis and machine intelligence, vol. 19, no. 1, pp. 1–14, 1997.
  • [2] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu, “An efficient k-means clustering algorithm: Analysis and implementation,” IEEE transactions on pattern analysis and machine intelligence, vol. 24, no. 7, pp. 881–892, 2002.
  • [3] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Transactions on pattern analysis and machine intelligence, vol. 22, no. 8, pp. 888–905, 2000.
  • [4] A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering: Analysis and an algorithm,” in Advances in neural information processing systems, 2002, pp. 849–856.
  • [5] H. Li, S. Tian, Y. Li, Q. Fang, R. Tan, Y. Pan, C. Huang, Y. Xu, and X. Gao, “Modern deep learning in bioinformatics,” Journal of molecular cell biology, vol. 12, no. 11, pp. 823–827, 2020.
  • [6] T. Tian, J. Wan, Q. Song, and Z. Wei, “Clustering single-cell rna-seq data with a model-based deep learning approach,” Nature Machine Intelligence, vol. 1, no. 4, pp. 191–198, 2019.
  • [7] X. Peng, J. Feng, J. T. Zhou, Y. Lei, and S. Yan, “Deep subspace clustering,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–13, 2020.
  • [8] Y. Lu, Y. Cheung, and Y. Y. Tang, “Self-adaptive multiprototype-based competitive learning approach: A k-means-type algorithm for imbalanced data clustering,” IEEE Transactions on Cybernetics, pp. 1–15, 2019.
  • [9] W. Yang, C. Hui, D. Sun, X. Sun, and Q. Liao, “Clustering through probability distribution analysis along eigenpaths,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2018.
  • [10] Y. Chen, L. Zhou, S. Pei, Z. Yu, Y. Chen, X. Liu, J. Du, and N. Xiong, “Knn-block dbscan: Fast clustering for large-scale data,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2019.
  • [11] X. Xu, J. Li, M. Zhou, J. Xu, and J. Cao, “Accelerated two-stage particle swarm optimization for clustering not-well-separated data,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 50, no. 11, pp. 4212–4223, 2018.
  • [12] C. C. Aggarwal, A. Hinneburg, and D. A. Keim, “On the surprising behavior of distance metrics in high dimensional space,” in International conference on database theory.   Springer, 2001, pp. 420–434.
  • [13] D. François, V. Wertz, and M. Verleysen, “The concentration of fractional distances,” IEEE Transactions on Knowledge and Data Engineering, vol. 19, no. 7, pp. 873–886, 2007.
  • [14] S. Sarkar and A. K. Ghosh, “On perfect clustering of high dimension, low sample size data,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
  • [15] P. Borysov, J. Hannig, and J. S. Marron, “Asymptotics of hierarchical clustering for growing dimension,” Journal of Multivariate Analysis, vol. 124, pp. 465–479, 2014.
  • [16] P. Hall, J. S. Marron, and A. Neeman, “Geometric representation of high dimension, low sample size data,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 67, no. 3, pp. 427–444, 2005.
  • [17] F. Nie, X. Wang, and H. Huang, “Clustering and projected clustering with adaptive neighbors,” in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 2014, pp. 977–986.
  • [18] H. Wang, G. Xiao, Y. Yan, and D. Suter, “Searching for representative modes on hypergraphs for robust geometric model fitting,” IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 3, pp. 697–711, 2018.
  • [19] H. C. Nguyen and H. Mamitsuka, “Learning on hypergraphs with sparsity,” IEEE transactions on pattern analysis and machine intelligence, 2020.
  • [20] D. Zhou, J. Huang, and B. Schölkopf, “Learning with hypergraphs: Clustering, classification, and embedding,” Advances in neural information processing systems, vol. 19, pp. 1601–1608, 2006.
  • [21] S. Agarwal, J. Lim, L. Zelnik-Manor, P. Perona, D. Kriegman, and S. Belongie, “Beyond pairwise clustering,” in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 2.   IEEE, 2005, pp. 838–845.
  • [22] S. R. Bulo and M. Pelillo, “A game-theoretic approach to hypergraph clustering.” in NIPS, vol. 22.   Citeseer, 2009, pp. 1571–1579.
  • [23] Z. Zhang, H. Lin, Y. Gao, and K. BNRist, “Dynamic hypergraph structure learning.” in IJCAI, 2018, pp. 3162–3169.
  • [24] G. Li, L. Qi, and G. Yu, “The z-eigenvalues of a symmetric tensor and its application to spectral hypergraph theory,” Numerical Linear Algebra with Applications, vol. 20, no. 6, pp. 1001–1029, 2013.
  • [25] D. Ghoshdastidar and A. Dukkipati, “Spectral clustering using multilinear svd: Analysis, approximations and applications,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 29, no. 1, 2015.
  • [26] H. Peng, Y. Hu, J. Chen, W. Haiyan, Y. Li, and H. Cai, “Integrating tensor similarity to enhance clustering performance,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
  • [27] S. Rabanser, O. Shchur, and S. Günnemann, “Introduction to tensor decompositions and their applications in machine learning,” arXiv preprint arXiv:1711.10781, 2017.
  • [28] U. Von Luxburg, “A tutorial on spectral clustering,” Statistics and computing, vol. 17, no. 4, pp. 395–416, 2007.
  • [29] S. Sarkar and A. K. Ghosh, “On perfect clustering of high dimension, low sample size data,” IEEE transactions on pattern analysis and machine intelligence, vol. 42, no. 9, pp. 2257–2272, 2019.
  • [30] A. Kumar, P. Rai, and H. Daume, “Co-regularized multi-view spectral clustering,” Advances in neural information processing systems, vol. 24, pp. 1413–1421, 2011.
  • [31] L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal of machine learning research, vol. 9, no. 11, 2008.